These patches are related to two similar patch sets from Ard and Steve:
- https://lkml.kernel.org/r/[email protected]
- https://lkml.kernel.org/r/[email protected]
The code is also heavily inspired by the jump label code, as some of the
concepts are very similar.
There are three separate implementations, depending on what the arch
supports:
1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires
objtool and a small amount of arch code
2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires
a small amount of arch code
3) If no arch support, fall back to regular function pointers
TODO:
- I'm not sure about the objtool approach. Objtool is (currently)
x86-64 only, which means we have to use the "unoptimized" version
everywhere else. I may experiment with a GCC plugin instead.
- Does this feature have much value without retpolines? If not, should
we make it depend on retpolines somehow?
- Find some actual users of the interfaces (tracepoints? crypto?)
Details (cribbed from comments in include/linux/static_call.h):
------------------------------------------------------------------------
Static calls use code patching to hard-code function pointers into
direct branch instructions. They give the flexibility of function
pointers, but with improved performance. This is especially important
for cases where retpolines would otherwise be used, as retpolines can
significantly impact performance.
API overview:
DECLARE_STATIC_CALL(key, func);
DEFINE_STATIC_CALL(key, func);
static_call(key, args...);
static_call_update(key, func);
Usage example:
# Start with the following functions (with identical prototypes):
int func_a(int arg1, int arg2);
int func_b(int arg1, int arg2);
# Define a 'my_key' reference, associated with func_a() by default
DEFINE_STATIC_CALL(my_key, func_a);
# Call func_a()
static_call(my_key, arg1, arg2);
# Update 'my_key' to point to func_b()
static_call_update(my_key, func_b);
# Call func_b()
static_call(my_key, arg1, arg2);
Implementation details:
There are three different implementations:
1) Optimized static calls (patched call sites)
This requires objtool, which detects all the static_call() sites and
annotates them in the '.static_call_sites' section. By default, the call
sites will call into a temporary per-key trampoline which has an indirect
branch to the current destination function associated with the key.
During system boot (or module init), all call sites are patched to call
their destination functions directly. Updates to a key will patch all
call sites associated with that key.
2) Unoptimized static calls (patched trampolines)
Each static_call() site calls into a permanent trampoline associated with
the key. The trampoline has a direct branch to the default function.
Updates to a key will modify the direct branch in the key's trampoline.
3) Generic implementation
This is the default implementation if the architecture hasn't implemented
CONFIG_HAVE_STATIC_CALL_[UN]OPTIMIZED. In this case, a basic
function pointer is used.
Josh Poimboeuf (3):
static_call: Add static call infrastructure
x86/static_call: Add x86 unoptimized static call implementation
x86/static_call: Add optimized static call implementation for 64-bit
arch/Kconfig | 6 +
arch/x86/Kconfig | 4 +-
arch/x86/include/asm/static_call.h | 42 +++
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/static_call.c | 84 +++++
include/asm-generic/vmlinux.lds.h | 11 +
include/linux/module.h | 10 +
include/linux/static_call.h | 186 +++++++++++
include/linux/static_call_types.h | 19 ++
kernel/Makefile | 1 +
kernel/module.c | 5 +
kernel/static_call.c | 297 ++++++++++++++++++
tools/objtool/Makefile | 3 +-
tools/objtool/check.c | 126 +++++++-
tools/objtool/check.h | 2 +
tools/objtool/elf.h | 1 +
.../objtool/include/linux/static_call_types.h | 19 ++
tools/objtool/sync-check.sh | 1 +
18 files changed, 815 insertions(+), 3 deletions(-)
create mode 100644 arch/x86/include/asm/static_call.h
create mode 100644 arch/x86/kernel/static_call.c
create mode 100644 include/linux/static_call.h
create mode 100644 include/linux/static_call_types.h
create mode 100644 kernel/static_call.c
create mode 100644 tools/objtool/include/linux/static_call_types.h
--
2.17.2
Add the optimized static call implementation for x86-64. For each key,
a temporary trampoline is created, named __static_call_tramp_<key>. The
trampoline has an indirect jump to the destination function.
Objtool uses the trampoline naming convention to detect all the call
sites. It then annotates those call sites in the .static_call_sites
section.
During boot (and module init), the call sites are patched to call
directly into the destination function. The temporary trampoline is
then no longer used.
Signed-off-by: Josh Poimboeuf <[email protected]>
---
arch/x86/Kconfig | 5 +-
arch/x86/include/asm/static_call.h | 24 ++++
arch/x86/kernel/static_call.c | 35 ++++-
tools/objtool/Makefile | 3 +-
tools/objtool/check.c | 126 +++++++++++++++++-
tools/objtool/check.h | 2 +
tools/objtool/elf.h | 1 +
.../objtool/include/linux/static_call_types.h | 19 +++
tools/objtool/sync-check.sh | 1 +
9 files changed, 211 insertions(+), 5 deletions(-)
create mode 100644 tools/objtool/include/linux/static_call_types.h
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9a83c3edd839..1d9b0e06b3b0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -189,7 +189,8 @@ config X86
select HAVE_FUNCTION_ARG_ACCESS_API
select HAVE_STACKPROTECTOR if CC_HAS_SANE_STACKPROTECTOR
select HAVE_STACK_VALIDATION if X86_64
- select HAVE_STATIC_CALL_UNOPTIMIZED
+ select HAVE_STATIC_CALL_OPTIMIZED if HAVE_STACK_VALIDATION
+ select HAVE_STATIC_CALL_UNOPTIMIZED if !HAVE_STACK_VALIDATION
select HAVE_RSEQ
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_UNSTABLE_SCHED_CLOCK
@@ -203,6 +204,7 @@ config X86
select RTC_MC146818_LIB
select SPARSE_IRQ
select SRCU
+ select STACK_VALIDATION if HAVE_STACK_VALIDATION && (HAVE_STATIC_CALL_OPTIMIZED || RETPOLINE)
select SYSCTL_EXCEPTION_TRACE
select THREAD_INFO_IN_TASK
select USER_STACKTRACE_SUPPORT
@@ -438,7 +440,6 @@ config GOLDFISH
config RETPOLINE
bool "Avoid speculative indirect branches in kernel"
default y
- select STACK_VALIDATION if HAVE_STACK_VALIDATION
help
Compile kernel with the retpoline compiler options to guard against
kernel-to-user data leaks by avoiding speculative indirect
diff --git a/arch/x86/include/asm/static_call.h b/arch/x86/include/asm/static_call.h
index de6b032cf809..d3b40c34a00f 100644
--- a/arch/x86/include/asm/static_call.h
+++ b/arch/x86/include/asm/static_call.h
@@ -2,6 +2,27 @@
#ifndef _ASM_STATIC_CALL_H
#define _ASM_STATIC_CALL_H
+#ifdef CONFIG_HAVE_STATIC_CALL_OPTIMIZED
+/*
+ * This is a temporary trampoline which is only used during boot (before the
+ * call sites have been patched). It uses the current value of the key->func
+ * pointer to do an indirect jump to the function.
+ *
+ * The name of this function has a magical aspect. Objtool uses it to find
+ * static call sites so that it can create the .static_call_sites section.
+ */
+#define ARCH_STATIC_CALL_TEMPORARY_TRAMP(key) \
+ asm(".pushsection .text, \"ax\" \n" \
+ ".align 4 \n" \
+ ".globl " STATIC_CALL_TRAMP_STR(key) " \n" \
+ ".type " STATIC_CALL_TRAMP_STR(key) ", @function \n" \
+ STATIC_CALL_TRAMP_STR(key) ": \n" \
+ ANNOTATE_RETPOLINE_SAFE " \n" \
+ "jmpq *" __stringify(key) "(%rip) \n" \
+ ".popsection \n")
+
+#else /* !CONFIG_HAVE_STATIC_CALL_OPTIMIZED */
+
/*
* This is a permanent trampoline which is the destination for all static calls
* for the given key. The direct jump gets patched by static_call_update().
@@ -15,4 +36,7 @@
"jmp " #func " \n" \
".popsection \n")
+
+#endif /* !CONFIG_HAVE_STATIC_CALL_OPTIMIZED */
+
#endif /* _ASM_STATIC_CALL_H */
diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c
index 47ddc655ccda..f3176a7ea6d9 100644
--- a/arch/x86/kernel/static_call.c
+++ b/arch/x86/kernel/static_call.c
@@ -10,6 +10,22 @@
void static_call_bp_handler(void);
void *bp_handler_dest;
+#ifdef CONFIG_HAVE_STATIC_CALL_OPTIMIZED
+
+void *bp_handler_continue;
+
+asm(".pushsection .text, \"ax\" \n"
+ ".globl static_call_bp_handler \n"
+ ".type static_call_bp_handler, @function \n"
+ "static_call_bp_handler: \n"
+ "ANNOTATE_RETPOLINE_SAFE \n"
+ "call *bp_handler_dest \n"
+ "ANNOTATE_RETPOLINE_SAFE \n"
+ "jmp *bp_handler_continue \n"
+ ".popsection \n");
+
+#else /* !CONFIG_HAVE_STATIC_CALL_OPTIMIZED */
+
asm(".pushsection .text, \"ax\" \n"
".globl static_call_bp_handler \n"
".type static_call_bp_handler, @function \n"
@@ -18,6 +34,8 @@ asm(".pushsection .text, \"ax\" \n"
"jmp *bp_handler_dest \n"
".popsection \n");
+#endif /* !CONFIG_HAVE_STATIC_CALL_OPTIMIZED */
+
void arch_static_call_transform(unsigned long insn, void *dest)
{
s32 dest_relative;
@@ -38,8 +56,11 @@ void arch_static_call_transform(unsigned long insn, void *dest)
opcodes[0] = insn_opcode;
memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1);
- /* Set up the variable for the breakpoint handler: */
+ /* Set up the variables for the breakpoint handler: */
bp_handler_dest = dest;
+#ifdef CONFIG_HAVE_STATIC_CALL_OPTIMIZED
+ bp_handler_continue = (void *)(insn + CALL_INSN_SIZE);
+#endif
/* Patch the call site: */
text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE,
@@ -49,3 +70,15 @@ void arch_static_call_transform(unsigned long insn, void *dest)
mutex_unlock(&text_mutex);
}
EXPORT_SYMBOL_GPL(arch_static_call_transform);
+
+#ifdef CONFIG_HAVE_STATIC_CALL_OPTIMIZED
+void arch_static_call_poison_tramp(unsigned long insn)
+{
+ unsigned long tramp = insn + CALL_INSN_SIZE + *(s32 *)(insn + 1);
+ unsigned short opcode = INSN_UD2;
+
+ mutex_lock(&text_mutex);
+ text_poke((void *)tramp, &opcode, 2);
+ mutex_unlock(&text_mutex);
+}
+#endif
diff --git a/tools/objtool/Makefile b/tools/objtool/Makefile
index c9d038f91af6..fb1afa34f10d 100644
--- a/tools/objtool/Makefile
+++ b/tools/objtool/Makefile
@@ -29,7 +29,8 @@ all: $(OBJTOOL)
INCLUDES := -I$(srctree)/tools/include \
-I$(srctree)/tools/arch/$(HOSTARCH)/include/uapi \
- -I$(srctree)/tools/objtool/arch/$(ARCH)/include
+ -I$(srctree)/tools/objtool/arch/$(ARCH)/include \
+ -I$(srctree)/tools/objtool/include
WARNINGS := $(EXTRA_WARNINGS) -Wno-switch-default -Wno-switch-enum -Wno-packed
CFLAGS += -Werror $(WARNINGS) $(KBUILD_HOSTCFLAGS) -g $(INCLUDES)
LDFLAGS += -lelf $(LIBSUBCMD) $(KBUILD_HOSTLDFLAGS)
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 0414a0d52262..ea1ff9ea2d78 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -27,6 +27,7 @@
#include <linux/hashtable.h>
#include <linux/kernel.h>
+#include <linux/static_call_types.h>
struct alternative {
struct list_head list;
@@ -165,6 +166,7 @@ static int __dead_end_function(struct objtool_file *file, struct symbol *func,
"fortify_panic",
"usercopy_abort",
"machine_real_restart",
+ "rewind_stack_do_exit",
};
if (func->bind == STB_WEAK)
@@ -525,6 +527,10 @@ static int add_jump_destinations(struct objtool_file *file)
} else {
/* sibling call */
insn->jump_dest = 0;
+ if (rela->sym->static_call_tramp) {
+ list_add_tail(&insn->static_call_node,
+ &file->static_call_list);
+ }
continue;
}
@@ -1202,6 +1208,24 @@ static int read_retpoline_hints(struct objtool_file *file)
return 0;
}
+static int read_static_call_tramps(struct objtool_file *file)
+{
+ struct section *sec;
+ struct symbol *func;
+
+ for_each_sec(file, sec) {
+ list_for_each_entry(func, &sec->symbol_list, list) {
+ if (func->bind == STB_GLOBAL &&
+ !strncmp(func->name, STATIC_CALL_TRAMP_PREFIX_STR,
+ strlen(STATIC_CALL_TRAMP_PREFIX_STR)))
+ func->static_call_tramp = true;
+ }
+
+ }
+
+ return 0;
+}
+
static void mark_rodata(struct objtool_file *file)
{
struct section *sec;
@@ -1267,6 +1291,10 @@ static int decode_sections(struct objtool_file *file)
if (ret)
return ret;
+ ret = read_static_call_tramps(file);
+ if (ret)
+ return ret;
+
return 0;
}
@@ -1920,6 +1948,11 @@ static int validate_branch(struct objtool_file *file, struct instruction *first,
if (is_fentry_call(insn))
break;
+ if (insn->call_dest->static_call_tramp) {
+ list_add_tail(&insn->static_call_node,
+ &file->static_call_list);
+ }
+
ret = dead_end_function(file, insn->call_dest);
if (ret == 1)
return 0;
@@ -2167,6 +2200,89 @@ static int validate_reachable_instructions(struct objtool_file *file)
return 0;
}
+static int create_static_call_sections(struct objtool_file *file)
+{
+ struct section *sec, *rela_sec;
+ struct rela *rela;
+ struct static_call_site *site;
+ struct instruction *insn;
+ char *key_name;
+ struct symbol *key_sym;
+ int idx;
+
+ sec = find_section_by_name(file->elf, ".static_call_sites");
+ if (sec) {
+ WARN("file already has .static_call_sites section, skipping");
+ return 0;
+ }
+
+ if (list_empty(&file->static_call_list))
+ return 0;
+
+ idx = 0;
+ list_for_each_entry(insn, &file->static_call_list, static_call_node)
+ idx++;
+
+ sec = elf_create_section(file->elf, ".static_call_sites",
+ sizeof(struct static_call_site), idx);
+ if (!sec)
+ return -1;
+
+ rela_sec = elf_create_rela_section(file->elf, sec);
+ if (!rela_sec)
+ return -1;
+
+ idx = 0;
+ list_for_each_entry(insn, &file->static_call_list, static_call_node) {
+
+ site = (struct static_call_site *)sec->data->d_buf + idx;
+ memset(site, 0, sizeof(struct static_call_site));
+
+ /* populate rela for 'addr' */
+ rela = malloc(sizeof(*rela));
+ if (!rela) {
+ perror("malloc");
+ return -1;
+ }
+ memset(rela, 0, sizeof(*rela));
+ rela->sym = insn->sec->sym;
+ rela->addend = insn->offset;
+ rela->type = R_X86_64_PC32;
+ rela->offset = idx * sizeof(struct static_call_site);
+ list_add_tail(&rela->list, &rela_sec->rela_list);
+ hash_add(rela_sec->rela_hash, &rela->hash, rela->offset);
+
+ /* find key symbol */
+ key_name = insn->call_dest->name + strlen(STATIC_CALL_TRAMP_PREFIX_STR);
+ key_sym = find_symbol_by_name(file->elf, key_name);
+ if (!key_sym) {
+ WARN("can't find static call key symbol: %s", key_name);
+ return -1;
+ }
+
+ /* populate rela for 'key' */
+ rela = malloc(sizeof(*rela));
+ if (!rela) {
+ perror("malloc");
+ return -1;
+ }
+ memset(rela, 0, sizeof(*rela));
+ rela->sym = key_sym;
+ rela->addend = 0;
+ rela->type = R_X86_64_PC32;
+ rela->offset = idx * sizeof(struct static_call_site) + 4;
+ list_add_tail(&rela->list, &rela_sec->rela_list);
+ hash_add(rela_sec->rela_hash, &rela->hash, rela->offset);
+
+ idx++;
+ }
+
+ if (elf_rebuild_rela_section(rela_sec))
+ return -1;
+
+ return 0;
+}
+
static void cleanup(struct objtool_file *file)
{
struct instruction *insn, *tmpinsn;
@@ -2191,12 +2307,13 @@ int check(const char *_objname, bool orc)
objname = _objname;
- file.elf = elf_open(objname, orc ? O_RDWR : O_RDONLY);
+ file.elf = elf_open(objname, O_RDWR);
if (!file.elf)
return 1;
INIT_LIST_HEAD(&file.insn_list);
hash_init(file.insn_hash);
+ INIT_LIST_HEAD(&file.static_call_list);
file.whitelist = find_section_by_name(file.elf, ".discard.func_stack_frame_non_standard");
file.c_file = find_section_by_name(file.elf, ".comment");
file.ignore_unreachables = no_unreachable;
@@ -2236,6 +2353,11 @@ int check(const char *_objname, bool orc)
warnings += ret;
}
+ ret = create_static_call_sections(&file);
+ if (ret < 0)
+ goto out;
+ warnings += ret;
+
if (orc) {
ret = create_orc(&file);
if (ret < 0)
@@ -2244,7 +2366,9 @@ int check(const char *_objname, bool orc)
ret = create_orc_sections(&file);
if (ret < 0)
goto out;
+ }
+ if (orc || !list_empty(&file.static_call_list)) {
ret = elf_write(file.elf);
if (ret < 0)
goto out;
diff --git a/tools/objtool/check.h b/tools/objtool/check.h
index e6e8a655b556..56b8b7fb1bd1 100644
--- a/tools/objtool/check.h
+++ b/tools/objtool/check.h
@@ -39,6 +39,7 @@ struct insn_state {
struct instruction {
struct list_head list;
struct hlist_node hash;
+ struct list_head static_call_node;
struct section *sec;
unsigned long offset;
unsigned int len;
@@ -60,6 +61,7 @@ struct objtool_file {
struct elf *elf;
struct list_head insn_list;
DECLARE_HASHTABLE(insn_hash, 16);
+ struct list_head static_call_list;
struct section *whitelist;
bool ignore_unreachables, c_file, hints, rodata;
};
diff --git a/tools/objtool/elf.h b/tools/objtool/elf.h
index bc97ed86b9cd..3cf44d7cc3ac 100644
--- a/tools/objtool/elf.h
+++ b/tools/objtool/elf.h
@@ -62,6 +62,7 @@ struct symbol {
unsigned long offset;
unsigned int len;
struct symbol *pfunc, *cfunc;
+ bool static_call_tramp;
};
struct rela {
diff --git a/tools/objtool/include/linux/static_call_types.h b/tools/objtool/include/linux/static_call_types.h
new file mode 100644
index 000000000000..7dd4b3d7ec2b
--- /dev/null
+++ b/tools/objtool/include/linux/static_call_types.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _STATIC_CALL_TYPES_H
+#define _STATIC_CALL_TYPES_H
+
+#include <linux/stringify.h>
+
+#define STATIC_CALL_TRAMP_PREFIX ____static_call_tramp_
+#define STATIC_CALL_TRAMP_PREFIX_STR __stringify(STATIC_CALL_TRAMP_PREFIX)
+
+#define STATIC_CALL_TRAMP(key) STATIC_CALL_TRAMP_PREFIX##key
+#define STATIC_CALL_TRAMP_STR(key) __stringify(STATIC_CALL_TRAMP(key))
+
+/* The static call site table is created by objtool. */
+struct static_call_site {
+ s32 addr;
+ s32 key;
+};
+
+#endif /* _STATIC_CALL_TYPES_H */
diff --git a/tools/objtool/sync-check.sh b/tools/objtool/sync-check.sh
index 1470e74e9d66..e1a204bf3556 100755
--- a/tools/objtool/sync-check.sh
+++ b/tools/objtool/sync-check.sh
@@ -10,6 +10,7 @@ arch/x86/include/asm/insn.h
arch/x86/include/asm/inat.h
arch/x86/include/asm/inat_types.h
arch/x86/include/asm/orc_types.h
+include/linux/static_call_types.h
'
check()
--
2.17.2
Add the x86 unoptimized static call implementation. For each key, it
creates a permanent trampoline which is the destination for all static
calls for the given key. The trampoline has a direct jump which gets
patched by static_call_update() when the destination changes.
Signed-off-by: Josh Poimboeuf <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/static_call.h | 18 +++++++++++
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/static_call.c | 51 ++++++++++++++++++++++++++++++
4 files changed, 71 insertions(+)
create mode 100644 arch/x86/include/asm/static_call.h
create mode 100644 arch/x86/kernel/static_call.c
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b5286ad2a982..9a83c3edd839 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -189,6 +189,7 @@ config X86
select HAVE_FUNCTION_ARG_ACCESS_API
select HAVE_STACKPROTECTOR if CC_HAS_SANE_STACKPROTECTOR
select HAVE_STACK_VALIDATION if X86_64
+ select HAVE_STATIC_CALL_UNOPTIMIZED
select HAVE_RSEQ
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_UNSTABLE_SCHED_CLOCK
diff --git a/arch/x86/include/asm/static_call.h b/arch/x86/include/asm/static_call.h
new file mode 100644
index 000000000000..de6b032cf809
--- /dev/null
+++ b/arch/x86/include/asm/static_call.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_STATIC_CALL_H
+#define _ASM_STATIC_CALL_H
+
+/*
+ * This is a permanent trampoline which is the destination for all static calls
+ * for the given key. The direct jump gets patched by static_call_update().
+ */
+#define ARCH_STATIC_CALL_TRAMP(key, func) \
+ asm(".pushsection .text, \"ax\" \n" \
+ ".align 4 \n" \
+ ".globl " STATIC_CALL_TRAMP_STR(key) " \n" \
+ ".type " STATIC_CALL_TRAMP_STR(key) ", @function \n" \
+ STATIC_CALL_TRAMP_STR(key) ": \n" \
+ "jmp " #func " \n" \
+ ".popsection \n")
+
+#endif /* _ASM_STATIC_CALL_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 8824d01c0c35..82acc8a28429 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -62,6 +62,7 @@ obj-y += tsc.o tsc_msr.o io_delay.o rtc.o
obj-y += pci-iommu_table.o
obj-y += resource.o
obj-y += irqflags.o
+obj-y += static_call.o
obj-y += process.o
obj-y += fpu/
diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c
new file mode 100644
index 000000000000..47ddc655ccda
--- /dev/null
+++ b/arch/x86/kernel/static_call.c
@@ -0,0 +1,51 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/static_call.h>
+#include <linux/memory.h>
+#include <linux/bug.h>
+#include <asm/text-patching.h>
+#include <asm/nospec-branch.h>
+
+#define CALL_INSN_SIZE 5
+
+void static_call_bp_handler(void);
+void *bp_handler_dest;
+
+asm(".pushsection .text, \"ax\" \n"
+ ".globl static_call_bp_handler \n"
+ ".type static_call_bp_handler, @function \n"
+ "static_call_bp_handler: \n"
+ "ANNOTATE_RETPOLINE_SAFE \n"
+ "jmp *bp_handler_dest \n"
+ ".popsection \n");
+
+void arch_static_call_transform(unsigned long insn, void *dest)
+{
+ s32 dest_relative;
+ unsigned char insn_opcode;
+ unsigned char opcodes[CALL_INSN_SIZE];
+
+ mutex_lock(&text_mutex);
+
+ insn_opcode = *(unsigned char *)insn;
+ if (insn_opcode != 0xe8 && insn_opcode != 0xe9) {
+ WARN_ONCE(1, "unexpected static call insn opcode 0x%x at %pS",
+ insn_opcode, (void *)insn);
+ goto done;
+ }
+
+ dest_relative = (long)(dest) - (long)(insn + CALL_INSN_SIZE);
+
+ opcodes[0] = insn_opcode;
+ memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1);
+
+ /* Set up the variable for the breakpoint handler: */
+ bp_handler_dest = dest;
+
+ /* Patch the call site: */
+ text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE,
+ static_call_bp_handler);
+
+done:
+ mutex_unlock(&text_mutex);
+}
+EXPORT_SYMBOL_GPL(arch_static_call_transform);
--
2.17.2
Add a static call infrastructure. Static calls use code patching to
hard-code function pointers into direct branch instructions. They give
the flexibility of function pointers, but with improved performance.
This is especially important for cases where retpolines would otherwise
be used, as retpolines can significantly impact performance.
This code is heavily inspired by the jump label code (aka "static
jumps"), as some of the concepts are very similar.
There are three implementations, depending on arch support:
1) optimized: patched call sites (CONFIG_HAVE_STATIC_CALL_OPTIMIZED)
2) unoptimized: patched trampolines (CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED)
3) basic function pointers
For more details, see the comments in include/linux/static_call.h.
Signed-off-by: Josh Poimboeuf <[email protected]>
---
arch/Kconfig | 6 +
include/asm-generic/vmlinux.lds.h | 11 ++
include/linux/module.h | 10 +
include/linux/static_call.h | 186 +++++++++++++++++++
include/linux/static_call_types.h | 19 ++
kernel/Makefile | 1 +
kernel/module.c | 5 +
kernel/static_call.c | 297 ++++++++++++++++++++++++++++++
8 files changed, 535 insertions(+)
create mode 100644 include/linux/static_call.h
create mode 100644 include/linux/static_call_types.h
create mode 100644 kernel/static_call.c
diff --git a/arch/Kconfig b/arch/Kconfig
index e1e540ffa979..9657ad5a80d1 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -879,6 +879,12 @@ config HAVE_ARCH_PREL32_RELOCATIONS
architectures, and don't require runtime relocation on relocatable
kernels.
+config HAVE_STATIC_CALL_OPTIMIZED
+ bool
+
+config HAVE_STATIC_CALL_UNOPTIMIZED
+ bool
+
source "kernel/gcov/Kconfig"
source "scripts/gcc-plugins/Kconfig"
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 3d7a6a9c2370..45d8ced28bd0 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -320,6 +320,7 @@
__start_ro_after_init = .; \
*(.data..ro_after_init) \
JUMP_TABLE_DATA \
+ STATIC_CALL_SITES \
__end_ro_after_init = .;
#endif
@@ -725,6 +726,16 @@
#define BUG_TABLE
#endif
+#ifdef CONFIG_HAVE_STATIC_CALL_OPTIMIZED
+#define STATIC_CALL_SITES \
+ . = ALIGN(8); \
+ __start_static_call_sites = .; \
+ KEEP(*(.static_call_sites)) \
+ __stop_static_call_sites = .;
+#else
+#define STATIC_CALL_SITES
+#endif
+
#ifdef CONFIG_UNWINDER_ORC
#define ORC_UNWIND_TABLE \
. = ALIGN(4); \
diff --git a/include/linux/module.h b/include/linux/module.h
index fce6b4335e36..91baf181247a 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -21,6 +21,7 @@
#include <linux/rbtree_latch.h>
#include <linux/error-injection.h>
#include <linux/tracepoint-defs.h>
+#include <linux/static_call_types.h>
#include <linux/percpu.h>
#include <asm/module.h>
@@ -450,6 +451,10 @@ struct module {
unsigned int num_ftrace_callsites;
unsigned long *ftrace_callsites;
#endif
+#ifdef CONFIG_HAVE_STATIC_CALL_OPTIMIZED
+ int num_static_call_sites;
+ struct static_call_site *static_call_sites;
+#endif
#ifdef CONFIG_LIVEPATCH
bool klp; /* Is this a livepatch module? */
@@ -682,6 +687,11 @@ static inline bool is_module_text_address(unsigned long addr)
return false;
}
+static inline bool within_module_init(unsigned long addr, const struct module *mod)
+{
+ return false;
+}
+
/* Get/put a kernel symbol (calls should be symmetric) */
#define symbol_get(x) ({ extern typeof(x) x __attribute__((weak)); &(x); })
#define symbol_put(x) do { } while (0)
diff --git a/include/linux/static_call.h b/include/linux/static_call.h
new file mode 100644
index 000000000000..b22987057cf1
--- /dev/null
+++ b/include/linux/static_call.h
@@ -0,0 +1,186 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_STATIC_CALL_H
+#define _LINUX_STATIC_CALL_H
+
+/*
+ * Static call support
+ *
+ * Static calls use code patching to hard-code function pointers into direct
+ * branch instructions. They give the flexibility of function pointers, but
+ * with improved performance. This is especially important for cases where
+ * retpolines would otherwise be used, as retpolines can significantly impact
+ * performance.
+ *
+ *
+ * API overview:
+ *
+ * DECLARE_STATIC_CALL(key, func);
+ * DEFINE_STATIC_CALL(key, func);
+ * static_call(key, args...);
+ * static_call_update(key, func);
+ *
+ *
+ * Usage example:
+ *
+ * # Start with the following functions (with identical prototypes):
+ * int func_a(int arg1, int arg2);
+ * int func_b(int arg1, int arg2);
+ *
+ * # Define a 'my_key' reference, associated with func_a() by default
+ * DEFINE_STATIC_CALL(my_key, func_a);
+ *
+ * # Call func_a()
+ * static_call(my_key, arg1, arg2);
+ *
+ * # Update 'my_key' to point to func_b()
+ * static_call_update(my_key, func_b);
+ *
+ * # Call func_b()
+ * static_call(my_key, arg1, arg2);
+ *
+ *
+ * Implementation details:
+ *
+ * There are three different implementations:
+ *
+ * 1) Optimized static calls (patched call sites)
+ *
+ * This requires objtool, which detects all the static_call() sites and
+ * annotates them in the '.static_call_sites' section. By default, the call
+ * sites will call into a temporary per-key trampoline which has an indirect
+ * branch to the current destination function associated with the key.
+ * During system boot (or module init), all call sites are patched to call
+ * their destination functions directly. Updates to a key will patch all
+ * call sites associated with that key.
+ *
+ * 2) Unoptimized static calls (patched trampolines)
+ *
+ * Each static_call() site calls into a permanent trampoline associated with
+ * the key. The trampoline has a direct branch to the default function.
+ * Updates to a key will modify the direct branch in the key's trampoline.
+ *
+ * 3) Generic implementation
+ *
+ * This is the default implementation if the architecture hasn't implemented
+ * CONFIG_HAVE_STATIC_CALL_[UN]OPTIMIZED. In this case, a basic
+ * function pointer is used.
+ */
+
+#include <linux/types.h>
+#include <linux/cpu.h>
+#include <linux/static_call_types.h>
+
+#if defined(CONFIG_HAVE_STATIC_CALL_OPTIMIZED) || \
+ defined(CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED)
+
+#include <asm/static_call.h>
+
+extern void arch_static_call_transform(unsigned long insn, void *dest);
+
+#endif /* CONFIG_HAVE_STATIC_CALL_[UN]OPTIMIZED */
+
+#ifdef CONFIG_HAVE_STATIC_CALL_OPTIMIZED
+/* Optimized implementation */
+
+struct static_call_mod {
+ struct list_head list;
+ struct module *mod; /* NULL means vmlinux */
+ struct static_call_site *sites;
+};
+
+struct static_call_key {
+ /*
+ * This field should always be first because the trampolines expect it.
+ * This points to the current call destination.
+ */
+ void *func;
+
+ /*
+ * List of modules (including vmlinux) and the call sites which need to
+ * be patched for this key.
+ */
+ struct list_head site_mods;
+};
+
+extern void __static_call_update(struct static_call_key *key, void *func);
+extern void arch_static_call_poison_tramp(unsigned long insn);
+
+#define DECLARE_STATIC_CALL(key, func) \
+ extern struct static_call_key key; \
+ extern typeof(func) STATIC_CALL_TRAMP(key); \
+ /* Preserve the ELF symbol so objtool can access it: */ \
+ __ADDRESSABLE(key)
+
+#define DEFINE_STATIC_CALL(key, _func) \
+ DECLARE_STATIC_CALL(key, _func); \
+ struct static_call_key key = { \
+ .func = _func, \
+ .site_mods = LIST_HEAD_INIT(key.site_mods), \
+ }; \
+ ARCH_STATIC_CALL_TEMPORARY_TRAMP(key)
+
+#define static_call(key, args...) STATIC_CALL_TRAMP(key)(args)
+
+#define static_call_update(key, func) \
+({ \
+ BUILD_BUG_ON(!__same_type(typeof(func), typeof(STATIC_CALL_TRAMP(key)))); \
+ __static_call_update(&key, func); \
+})
+
+#define EXPORT_STATIC_CALL(key) \
+ EXPORT_SYMBOL(key); \
+ EXPORT_SYMBOL(STATIC_CALL_TRAMP(key))
+
+#define EXPORT_STATIC_CALL_GPL(key) \
+ EXPORT_SYMBOL_GPL(key); \
+ EXPORT_SYMBOL_GPL(STATIC_CALL_TRAMP(key))
+
+
+#elif defined(CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED)
+/* Unoptimized implementation */
+
+#define DECLARE_STATIC_CALL(key, func) \
+ extern typeof(func) STATIC_CALL_TRAMP(key)
+
+#define DEFINE_STATIC_CALL(key, func) \
+ DECLARE_STATIC_CALL(key, func); \
+ ARCH_STATIC_CALL_TRAMP(key, func)
+
+#define static_call(key, args...) STATIC_CALL_TRAMP(key)(args)
+
+#define static_call_update(key, func) \
+({ \
+ BUILD_BUG_ON(!__same_type(func, STATIC_CALL_TRAMP(key))); \
+ cpus_read_lock(); \
+ arch_static_call_transform((unsigned long)STATIC_CALL_TRAMP(key),\
+ func); \
+ cpus_read_unlock(); \
+})
+
+#define EXPORT_STATIC_CALL(key) \
+ EXPORT_SYMBOL(STATIC_CALL_TRAMP(key))
+
+#define EXPORT_STATIC_CALL_GPL(key) \
+ EXPORT_SYMBOL_GPL(STATIC_CALL_TRAMP(key))
+
+
+#else /* Generic implementation */
+
+#define DECLARE_STATIC_CALL(key, func) \
+ extern typeof(func) *key
+
+#define DEFINE_STATIC_CALL(key, func) \
+ typeof(func) *key = func
+
+#define static_call(key, args...) \
+ key(args)
+
+#define static_call_update(key, func) \
+ WRITE_ONCE(key, func)
+
+#define EXPORT_STATIC_CALL(key) EXPORT_SYMBOL(key)
+#define EXPORT_STATIC_CALL_GPL(key) EXPORT_SYMBOL_GPL(key)
+
+#endif /* CONFIG_HAVE_STATIC_CALL_OPTIMIZED */
+
+#endif /* _LINUX_STATIC_CALL_H */
diff --git a/include/linux/static_call_types.h b/include/linux/static_call_types.h
new file mode 100644
index 000000000000..7dd4b3d7ec2b
--- /dev/null
+++ b/include/linux/static_call_types.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _STATIC_CALL_TYPES_H
+#define _STATIC_CALL_TYPES_H
+
+#include <linux/stringify.h>
+
+#define STATIC_CALL_TRAMP_PREFIX ____static_call_tramp_
+#define STATIC_CALL_TRAMP_PREFIX_STR __stringify(STATIC_CALL_TRAMP_PREFIX)
+
+#define STATIC_CALL_TRAMP(key) STATIC_CALL_TRAMP_PREFIX##key
+#define STATIC_CALL_TRAMP_STR(key) __stringify(STATIC_CALL_TRAMP(key))
+
+/* The static call site table is created by objtool. */
+struct static_call_site {
+ s32 addr;
+ s32 key;
+};
+
+#endif /* _STATIC_CALL_TYPES_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 7343b3a9bff0..888bb476ddaf 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -103,6 +103,7 @@ obj-$(CONFIG_TRACEPOINTS) += trace/
obj-$(CONFIG_IRQ_WORK) += irq_work.o
obj-$(CONFIG_CPU_PM) += cpu_pm.o
obj-$(CONFIG_BPF) += bpf/
+obj-$(CONFIG_HAVE_STATIC_CALL_OPTIMIZED) += static_call.o
obj-$(CONFIG_PERF_EVENTS) += events/
diff --git a/kernel/module.c b/kernel/module.c
index 49a405891587..11124d991fcd 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3121,6 +3121,11 @@ static int find_module_sections(struct module *mod, struct load_info *info)
mod->ei_funcs = section_objs(info, "_error_injection_whitelist",
sizeof(*mod->ei_funcs),
&mod->num_ei_funcs);
+#endif
+#ifdef CONFIG_HAVE_STATIC_CALL_OPTIMIZED
+ mod->static_call_sites = section_objs(info, ".static_call_sites",
+ sizeof(*mod->static_call_sites),
+ &mod->num_static_call_sites);
#endif
mod->extable = section_objs(info, "__ex_table",
sizeof(*mod->extable), &mod->num_exentries);
diff --git a/kernel/static_call.c b/kernel/static_call.c
new file mode 100644
index 000000000000..599ebc6fc4f1
--- /dev/null
+++ b/kernel/static_call.c
@@ -0,0 +1,297 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/init.h>
+#include <linux/static_call.h>
+#include <linux/bug.h>
+#include <linux/smp.h>
+#include <linux/sort.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/cpu.h>
+#include <linux/processor.h>
+#include <asm/sections.h>
+
+extern struct static_call_site __start_static_call_sites[],
+ __stop_static_call_sites[];
+
+static bool static_call_initialized;
+
+/* mutex to protect key modules/sites */
+static DEFINE_MUTEX(static_call_mutex);
+
+static void static_call_lock(void)
+{
+ mutex_lock(&static_call_mutex);
+}
+
+static void static_call_unlock(void)
+{
+ mutex_unlock(&static_call_mutex);
+}
+
+static inline unsigned long static_call_addr(struct static_call_site *site)
+{
+ return (long)site->addr + (long)&site->addr;
+}
+
+static inline struct static_call_key *static_call_key(const struct static_call_site *site)
+{
+ return (struct static_call_key *)((long)site->key + (long)&site->key);
+}
+
+static int static_call_site_cmp(const void *_a, const void *_b)
+{
+ const struct static_call_site *a = _a;
+ const struct static_call_site *b = _b;
+ const struct static_call_key *key_a = static_call_key(a);
+ const struct static_call_key *key_b = static_call_key(b);
+
+ if (key_a < key_b)
+ return -1;
+
+ if (key_a > key_b)
+ return 1;
+
+ return 0;
+}
+
+static void static_call_site_swap(void *_a, void *_b, int size)
+{
+ long delta = (unsigned long)_a - (unsigned long)_b;
+ struct static_call_site *a = _a;
+ struct static_call_site *b = _b;
+ struct static_call_site tmp = *a;
+
+ a->addr = b->addr - delta;
+ a->key = b->key - delta;
+
+ b->addr = tmp.addr + delta;
+ b->key = tmp.key + delta;
+}
+
+static inline void static_call_sort_entries(struct static_call_site *start,
+ struct static_call_site *stop)
+{
+ sort(start, stop - start, sizeof(struct static_call_site),
+ static_call_site_cmp, static_call_site_swap);
+}
+
+void __static_call_update(struct static_call_key *key, void *func)
+{
+ struct static_call_mod *mod;
+ struct static_call_site *site, *stop;
+
+ cpus_read_lock();
+ static_call_lock();
+
+ if (key->func == func)
+ goto done;
+
+ key->func = func;
+
+ /*
+ * If called before init, leave the call sites unpatched for now.
+ * In the meantime they'll continue to call the temporary trampoline.
+ */
+ if (!static_call_initialized)
+ goto done;
+
+ list_for_each_entry(mod, &key->site_mods, list) {
+ if (!mod->sites) {
+ /*
+ * This can happen if the static call key is defined in
+ * a module which doesn't use it.
+ */
+ continue;
+ }
+
+ stop = __stop_static_call_sites;
+
+#ifdef CONFIG_MODULES
+ if (mod->mod) {
+ stop = mod->mod->static_call_sites +
+ mod->mod->num_static_call_sites;
+ }
+#endif
+
+ for (site = mod->sites;
+ site < stop && static_call_key(site) == key; site++) {
+ unsigned long addr = static_call_addr(site);
+
+ if (!mod->mod && init_section_contains((void *)addr, 1))
+ continue;
+ if (mod->mod && within_module_init(addr, mod->mod))
+ continue;
+
+ arch_static_call_transform(addr, func);
+ }
+ }
+
+done:
+ static_call_unlock();
+ cpus_read_unlock();
+}
+EXPORT_SYMBOL_GPL(__static_call_update);
+
+#ifdef CONFIG_MODULES
+
+static int static_call_add_module(struct module *mod)
+{
+ struct static_call_site *start = mod->static_call_sites;
+ struct static_call_site *stop = mod->static_call_sites +
+ mod->num_static_call_sites;
+ struct static_call_site *site;
+ struct static_call_key *key, *prev_key = NULL;
+ struct static_call_mod *static_call_mod;
+
+ if (start == stop)
+ return 0;
+
+ module_disable_ro(mod);
+ static_call_sort_entries(start, stop);
+ module_enable_ro(mod, false);
+
+ for (site = start; site < stop; site++) {
+ unsigned long addr = static_call_addr(site);
+
+ if (within_module_init(addr, mod))
+ continue;
+
+ key = static_call_key(site);
+ if (key != prev_key) {
+ prev_key = key;
+
+ static_call_mod = kzalloc(sizeof(*static_call_mod), GFP_KERNEL);
+ if (!static_call_mod)
+ return -ENOMEM;
+
+ static_call_mod->mod = mod;
+ static_call_mod->sites = site;
+ list_add_tail(&static_call_mod->list, &key->site_mods);
+
+ if (is_module_address((unsigned long)key)) {
+ /*
+ * The trampoline should no longer be used.
+ * Poison it it with a BUG() to catch any stray
+ * callers.
+ */
+ arch_static_call_poison_tramp(addr);
+ }
+ }
+
+ arch_static_call_transform(addr, key->func);
+ }
+
+ return 0;
+}
+
+static void static_call_del_module(struct module *mod)
+{
+ struct static_call_site *start = mod->static_call_sites;
+ struct static_call_site *stop = mod->static_call_sites +
+ mod->num_static_call_sites;
+ struct static_call_site *site;
+ struct static_call_key *key, *prev_key = NULL;
+ struct static_call_mod *static_call_mod;
+
+ for (site = start; site < stop; site++) {
+ key = static_call_key(site);
+ if (key == prev_key)
+ continue;
+ prev_key = key;
+
+ list_for_each_entry(static_call_mod, &key->site_mods, list) {
+ if (static_call_mod->mod == mod) {
+ list_del(&static_call_mod->list);
+ kfree(static_call_mod);
+ break;
+ }
+ }
+ }
+}
+
+static int static_call_module_notify(struct notifier_block *nb,
+ unsigned long val, void *data)
+{
+ struct module *mod = data;
+ int ret = 0;
+
+ cpus_read_lock();
+ static_call_lock();
+
+ switch (val) {
+ case MODULE_STATE_COMING:
+ ret = static_call_add_module(mod);
+ if (ret) {
+ WARN(1, "Failed to allocate memory for static calls");
+ static_call_del_module(mod);
+ }
+ break;
+ case MODULE_STATE_GOING:
+ static_call_del_module(mod);
+ break;
+ }
+
+ static_call_unlock();
+ cpus_read_unlock();
+
+ return notifier_from_errno(ret);
+}
+
+static struct notifier_block static_call_module_nb = {
+ .notifier_call = static_call_module_notify,
+};
+
+#endif /* CONFIG_MODULES */
+
+static void __init static_call_init(void)
+{
+ struct static_call_site *start = __start_static_call_sites;
+ struct static_call_site *stop = __stop_static_call_sites;
+ struct static_call_site *site;
+
+ if (start == stop) {
+ pr_warn("WARNING: empty static call table\n");
+ return;
+ }
+
+ cpus_read_lock();
+ static_call_lock();
+
+ static_call_sort_entries(start, stop);
+
+ for (site = start; site < stop; site++) {
+ struct static_call_key *key = static_call_key(site);
+ unsigned long addr = static_call_addr(site);
+
+ if (list_empty(&key->site_mods)) {
+ struct static_call_mod *mod;
+
+ mod = kzalloc(sizeof(*mod), GFP_KERNEL);
+ if (!mod) {
+ WARN(1, "Failed to allocate memory for static calls");
+ return;
+ }
+
+ mod->sites = site;
+ list_add_tail(&mod->list, &key->site_mods);
+
+ /*
+ * The trampoline should no longer be used. Poison it
+ * it with a BUG() to catch any stray callers.
+ */
+ arch_static_call_poison_tramp(addr);
+ }
+
+ arch_static_call_transform(addr, key->func);
+ }
+
+ static_call_initialized = true;
+
+ static_call_unlock();
+ cpus_read_unlock();
+
+#ifdef CONFIG_MODULES
+ register_module_notifier(&static_call_module_nb);
+#endif
+}
+early_initcall(static_call_init);
--
2.17.2
On Thu, Nov 08, 2018 at 03:15:50PM -0600, Josh Poimboeuf wrote:
> - Does this feature have much value without retpolines? If not, should
> we make it depend on retpolines somehow?
I forgot Andy mentioned that we might be able to use this to clean up
paravirt patching, in which case it would have a lot of value,
retpolines or not...
--
Josh
* Josh Poimboeuf <[email protected]> wrote:
> These patches are related to two similar patch sets from Ard and Steve:
>
> - https://lkml.kernel.org/r/[email protected]
> - https://lkml.kernel.org/r/[email protected]
>
> The code is also heavily inspired by the jump label code, as some of the
> concepts are very similar.
>
> There are three separate implementations, depending on what the arch
> supports:
>
> 1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires
> objtool and a small amount of arch code
>
> 2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires
> a small amount of arch code
>
> 3) If no arch support, fall back to regular function pointers
>
>
> TODO:
>
> - I'm not sure about the objtool approach. Objtool is (currently)
> x86-64 only, which means we have to use the "unoptimized" version
> everywhere else. I may experiment with a GCC plugin instead.
I'd prefer the objtool approach. It's a pretty reliable first-principles
approach while GCC plugin would have to be replicated for Clang and any
other compilers, etc.
> - Does this feature have much value without retpolines? If not, should
> we make it depend on retpolines somehow?
Paravirt patching, as you mention in your later reply?
> - Find some actual users of the interfaces (tracepoints? crypto?)
I'd be very happy with a demonstrated paravirt optimization already -
i.e. seeing the before/after effect on the vmlinux with an x86 distro
config.
All major Linux distributions enable CONFIG_PARAVIRT=y and
CONFIG_PARAVIRT_XXL=y on x86 at the moment, so optimizing it away as much
as possible in the 99.999% cases where it's not used is a primary
concern.
All other usecases are bonus, but it would certainly be interesting to
investigate the impact of using these APIs for tracing: that too is a
feature enabled everywhere but utilized only by a small fraction of Linux
users - so literally every single cycle or instruction saved or hot-path
shortened is a major win.
Thanks,
Ingo
* Ingo Molnar <[email protected]> wrote:
> > - Does this feature have much value without retpolines? If not, should
> > we make it depend on retpolines somehow?
>
> Paravirt patching, as you mention in your later reply?
BTW., to look for candidates of this API, I'd suggest looking at the
function call frequency of my (almost-)distro kernel vmlinux:
$ objdump -d vmlinux | grep -w callq | cut -f3- | sort | uniq -c | sort -n | tail -100
which gives:
502 callq ffffffff8157d050 <nla_put>
522 callq ffffffff81aaf420 <down_write>
536 callq ffffffff81547e60 <_copy_to_user>
615 callq ffffffff81a97700 <snprintf>
624 callq *0xffffffff82648428
624 callq ffffffff810cc810 <__might_sleep>
625 callq ffffffff81a93b90 <strcmp>
649 callq ffffffff81547dd0 <_copy_from_user>
651 callq ffffffff811ba930 <trace_seq_printf>
654 callq ffffffff8170b6f0 <_dev_warn>
691 callq ffffffff81a93790 <strlen>
693 callq ffffffff81a88dc0 <cpumask_next>
709 callq *0xffffffff82648438
723 callq ffffffff811bdbd0 <trace_hardirqs_on>
735 callq ffffffff810feac0 <up_write>
750 callq ffffffff8163e9f0 <acpi_ut_status_exit>
768 callq *0xffffffff82648430
814 callq ffffffff81ab2710 <_raw_spin_lock_irq>
841 callq ffffffff81a9e680 <__memcpy>
863 callq ffffffff812ae3d0 <__kmalloc>
899 callq ffffffff8126ac80 <__might_fault>
912 callq ffffffff81ab2970 <_raw_spin_unlock_irq>
939 callq ffffffff81aaaf10 <_cond_resched>
966 callq ffffffff811bda00 <trace_hardirqs_off>
1069 callq ffffffff81126f50 <rcu_read_lock_sched_held>
1078 callq ffffffff81097760 <__warn_printk>
1081 callq ffffffff8157b140 <__dynamic_dev_dbg>
1351 callq ffffffff8170b630 <_dev_err>
1365 callq ffffffff811050c0 <lock_is_held_type>
1373 callq ffffffff81a977f0 <sprintf>
1390 callq ffffffff8157b090 <__dynamic_pr_debug>
1453 callq ffffffff8155c650 <__list_add_valid>
1501 callq ffffffff812ad6f0 <kmem_cache_alloc_trace>
1509 callq ffffffff8155c6c0 <__list_del_entry_valid>
1513 callq ffffffff81310ce0 <seq_printf>
1571 callq ffffffff81ab2780 <_raw_spin_lock_irqsave>
1624 callq ffffffff81ab29b0 <_raw_spin_unlock_irqrestore>
1661 callq ffffffff81126fd0 <rcu_read_lock_held>
1986 callq ffffffff81104940 <lock_acquire>
2050 callq ffffffff811c5110 <trace_define_field>
2133 callq ffffffff81102c70 <lock_release>
2507 callq ffffffff81ab2560 <_raw_spin_lock>
2676 callq ffffffff81aadc40 <mutex_lock_nested>
3056 callq ffffffff81ab2900 <_raw_spin_unlock>
3294 callq ffffffff81aac610 <mutex_unlock>
3628 callq ffffffff81129100 <rcu_is_watching>
4462 callq ffffffff812ac2c0 <kfree>
6454 callq ffffffff8111a51e <printk>
6676 callq ffffffff81101420 <lockdep_rcu_suspicious>
7328 callq ffffffff81e014b0 <__x86_indirect_thunk_rax>
7598 callq ffffffff81126f30 <debug_lockdep_rcu_enabled>
9065 callq ffffffff810979f0 <__stack_chk_fail>
The most prominent callers which are already function call pointers today
are:
$ objdump -d vmlinux | grep -w callq | grep \* | cut -f3- | sort | uniq -c | sort -n | tail -10
109 callq *0xffffffff82648530
134 callq *0xffffffff82648568
154 callq *0xffffffff826483d0
260 callq *0xffffffff826483d8
297 callq *0xffffffff826483e0
345 callq *0xffffffff82648440
345 callq *0xffffffff82648558
624 callq *0xffffffff82648428
709 callq *0xffffffff82648438
768 callq *0xffffffff82648430
That's all pv_ops->*() method calls:
ffffffff82648300 D pv_ops
ffffffff826485d0 D pv_info
Optimizing those thousands of function pointer calls would already be a
nice improvement.
But retpolines:
7328 callq ffffffff81e014b0 <__x86_indirect_thunk_rax>
ffffffff81e014b0 <__x86_indirect_thunk_rax>:
ffffffff81e014b0: ff e0 jmpq *%rax
... are even more prominent, and turned on in every distro as well,
obviously.
Thanks,
Ingo
Hi Josh,
Thanks a lot for looking into this.
On 8 November 2018 at 22:15, Josh Poimboeuf <[email protected]> wrote:
> Add a static call infrastructure. Static calls use code patching to
> hard-code function pointers into direct branch instructions. They give
> the flexibility of function pointers, but with improved performance.
> This is especially important for cases where retpolines would otherwise
> be used, as retpolines can significantly impact performance.
>
> This code is heavily inspired by the jump label code (aka "static
> jumps"), as some of the concepts are very similar.
>
> There are three implementations, depending on arch support:
>
> 1) optimized: patched call sites (CONFIG_HAVE_STATIC_CALL_OPTIMIZED)
> 2) unoptimized: patched trampolines (CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED)
Could we move to an idiom like inline/out of line? The unoptimized
variant may be perfectly adequate for many arches, and 'unoptimized'
sounds like there is something wrong with it.
> 3) basic function pointers
>
> For more details, see the comments in include/linux/static_call.h.
>
> Signed-off-by: Josh Poimboeuf <[email protected]>
> ---
> arch/Kconfig | 6 +
> include/asm-generic/vmlinux.lds.h | 11 ++
> include/linux/module.h | 10 +
> include/linux/static_call.h | 186 +++++++++++++++++++
> include/linux/static_call_types.h | 19 ++
> kernel/Makefile | 1 +
> kernel/module.c | 5 +
> kernel/static_call.c | 297 ++++++++++++++++++++++++++++++
> 8 files changed, 535 insertions(+)
> create mode 100644 include/linux/static_call.h
> create mode 100644 include/linux/static_call_types.h
> create mode 100644 kernel/static_call.c
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index e1e540ffa979..9657ad5a80d1 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -879,6 +879,12 @@ config HAVE_ARCH_PREL32_RELOCATIONS
> architectures, and don't require runtime relocation on relocatable
> kernels.
>
> +config HAVE_STATIC_CALL_OPTIMIZED
> + bool
> +
> +config HAVE_STATIC_CALL_UNOPTIMIZED
> + bool
> +
> source "kernel/gcov/Kconfig"
>
> source "scripts/gcc-plugins/Kconfig"
> diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
> index 3d7a6a9c2370..45d8ced28bd0 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -320,6 +320,7 @@
> __start_ro_after_init = .; \
> *(.data..ro_after_init) \
> JUMP_TABLE_DATA \
> + STATIC_CALL_SITES \
> __end_ro_after_init = .;
> #endif
>
> @@ -725,6 +726,16 @@
> #define BUG_TABLE
> #endif
>
> +#ifdef CONFIG_HAVE_STATIC_CALL_OPTIMIZED
> +#define STATIC_CALL_SITES \
> + . = ALIGN(8); \
> + __start_static_call_sites = .; \
> + KEEP(*(.static_call_sites)) \
> + __stop_static_call_sites = .;
> +#else
> +#define STATIC_CALL_SITES
> +#endif
> +
> #ifdef CONFIG_UNWINDER_ORC
> #define ORC_UNWIND_TABLE \
> . = ALIGN(4); \
> diff --git a/include/linux/module.h b/include/linux/module.h
> index fce6b4335e36..91baf181247a 100644
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -21,6 +21,7 @@
> #include <linux/rbtree_latch.h>
> #include <linux/error-injection.h>
> #include <linux/tracepoint-defs.h>
> +#include <linux/static_call_types.h>
>
> #include <linux/percpu.h>
> #include <asm/module.h>
> @@ -450,6 +451,10 @@ struct module {
> unsigned int num_ftrace_callsites;
> unsigned long *ftrace_callsites;
> #endif
> +#ifdef CONFIG_HAVE_STATIC_CALL_OPTIMIZED
> + int num_static_call_sites;
> + struct static_call_site *static_call_sites;
> +#endif
>
> #ifdef CONFIG_LIVEPATCH
> bool klp; /* Is this a livepatch module? */
> @@ -682,6 +687,11 @@ static inline bool is_module_text_address(unsigned long addr)
> return false;
> }
>
> +static inline bool within_module_init(unsigned long addr, const struct module *mod)
> +{
> + return false;
> +}
> +
> /* Get/put a kernel symbol (calls should be symmetric) */
> #define symbol_get(x) ({ extern typeof(x) x __attribute__((weak)); &(x); })
> #define symbol_put(x) do { } while (0)
> diff --git a/include/linux/static_call.h b/include/linux/static_call.h
> new file mode 100644
> index 000000000000..b22987057cf1
> --- /dev/null
> +++ b/include/linux/static_call.h
> @@ -0,0 +1,186 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_STATIC_CALL_H
> +#define _LINUX_STATIC_CALL_H
> +
> +/*
> + * Static call support
> + *
> + * Static calls use code patching to hard-code function pointers into direct
> + * branch instructions. They give the flexibility of function pointers, but
> + * with improved performance. This is especially important for cases where
> + * retpolines would otherwise be used, as retpolines can significantly impact
> + * performance.
> + *
> + *
> + * API overview:
> + *
> + * DECLARE_STATIC_CALL(key, func);
> + * DEFINE_STATIC_CALL(key, func);
> + * static_call(key, args...);
> + * static_call_update(key, func);
> + *
> + *
> + * Usage example:
> + *
> + * # Start with the following functions (with identical prototypes):
> + * int func_a(int arg1, int arg2);
> + * int func_b(int arg1, int arg2);
> + *
> + * # Define a 'my_key' reference, associated with func_a() by default
> + * DEFINE_STATIC_CALL(my_key, func_a);
> + *
> + * # Call func_a()
> + * static_call(my_key, arg1, arg2);
> + *
> + * # Update 'my_key' to point to func_b()
> + * static_call_update(my_key, func_b);
> + *
> + * # Call func_b()
> + * static_call(my_key, arg1, arg2);
> + *
Any way we can revert to the default implementation? That would be
useful for, e.g., unloading modules that provide an alternative
implementation.
> + *
> + * Implementation details:
> + *
> + * There are three different implementations:
> + *
> + * 1) Optimized static calls (patched call sites)
> + *
> + * This requires objtool, which detects all the static_call() sites and
> + * annotates them in the '.static_call_sites' section. By default, the call
> + * sites will call into a temporary per-key trampoline which has an indirect
> + * branch to the current destination function associated with the key.
> + * During system boot (or module init), all call sites are patched to call
> + * their destination functions directly. Updates to a key will patch all
> + * call sites associated with that key.
> + *
> + * 2) Unoptimized static calls (patched trampolines)
> + *
> + * Each static_call() site calls into a permanent trampoline associated with
> + * the key. The trampoline has a direct branch to the default function.
> + * Updates to a key will modify the direct branch in the key's trampoline.
> + *
> + * 3) Generic implementation
> + *
> + * This is the default implementation if the architecture hasn't implemented
> + * CONFIG_HAVE_STATIC_CALL_[UN]OPTIMIZED. In this case, a basic
> + * function pointer is used.
> + */
> +
> +#include <linux/types.h>
> +#include <linux/cpu.h>
> +#include <linux/static_call_types.h>
> +
> +#if defined(CONFIG_HAVE_STATIC_CALL_OPTIMIZED) || \
> + defined(CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED)
> +
> +#include <asm/static_call.h>
> +
> +extern void arch_static_call_transform(unsigned long insn, void *dest);
> +
> +#endif /* CONFIG_HAVE_STATIC_CALL_[UN]OPTIMIZED */
> +
> +#ifdef CONFIG_HAVE_STATIC_CALL_OPTIMIZED
> +/* Optimized implementation */
> +
> +struct static_call_mod {
> + struct list_head list;
> + struct module *mod; /* NULL means vmlinux */
> + struct static_call_site *sites;
> +};
> +
> +struct static_call_key {
> + /*
> + * This field should always be first because the trampolines expect it.
> + * This points to the current call destination.
> + */
> + void *func;
> +
> + /*
> + * List of modules (including vmlinux) and the call sites which need to
> + * be patched for this key.
> + */
> + struct list_head site_mods;
> +};
> +
> +extern void __static_call_update(struct static_call_key *key, void *func);
> +extern void arch_static_call_poison_tramp(unsigned long insn);
> +
> +#define DECLARE_STATIC_CALL(key, func) \
> + extern struct static_call_key key; \
> + extern typeof(func) STATIC_CALL_TRAMP(key); \
> + /* Preserve the ELF symbol so objtool can access it: */ \
> + __ADDRESSABLE(key)
> +
> +#define DEFINE_STATIC_CALL(key, _func) \
> + DECLARE_STATIC_CALL(key, _func); \
> + struct static_call_key key = { \
> + .func = _func, \
> + .site_mods = LIST_HEAD_INIT(key.site_mods), \
> + }; \
> + ARCH_STATIC_CALL_TEMPORARY_TRAMP(key)
> +
> +#define static_call(key, args...) STATIC_CALL_TRAMP(key)(args)
> +
> +#define static_call_update(key, func) \
> +({ \
> + BUILD_BUG_ON(!__same_type(typeof(func), typeof(STATIC_CALL_TRAMP(key)))); \
> + __static_call_update(&key, func); \
> +})
> +
> +#define EXPORT_STATIC_CALL(key) \
> + EXPORT_SYMBOL(key); \
> + EXPORT_SYMBOL(STATIC_CALL_TRAMP(key))
> +
> +#define EXPORT_STATIC_CALL_GPL(key) \
> + EXPORT_SYMBOL_GPL(key); \
> + EXPORT_SYMBOL_GPL(STATIC_CALL_TRAMP(key))
> +
> +
> +#elif defined(CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED)
> +/* Unoptimized implementation */
> +
> +#define DECLARE_STATIC_CALL(key, func) \
> + extern typeof(func) STATIC_CALL_TRAMP(key)
> +
> +#define DEFINE_STATIC_CALL(key, func) \
> + DECLARE_STATIC_CALL(key, func); \
> + ARCH_STATIC_CALL_TRAMP(key, func)
> +
> +#define static_call(key, args...) STATIC_CALL_TRAMP(key)(args)
> +
> +#define static_call_update(key, func) \
> +({ \
> + BUILD_BUG_ON(!__same_type(func, STATIC_CALL_TRAMP(key))); \
> + cpus_read_lock(); \
> + arch_static_call_transform((unsigned long)STATIC_CALL_TRAMP(key),\
> + func); \
> + cpus_read_unlock(); \
> +})
> +
> +#define EXPORT_STATIC_CALL(key) \
> + EXPORT_SYMBOL(STATIC_CALL_TRAMP(key))
> +
> +#define EXPORT_STATIC_CALL_GPL(key) \
> + EXPORT_SYMBOL_GPL(STATIC_CALL_TRAMP(key))
> +
> +
> +#else /* Generic implementation */
> +
> +#define DECLARE_STATIC_CALL(key, func) \
> + extern typeof(func) *key
> +
> +#define DEFINE_STATIC_CALL(key, func) \
> + typeof(func) *key = func
> +
> +#define static_call(key, args...) \
> + key(args)
> +
> +#define static_call_update(key, func) \
> + WRITE_ONCE(key, func)
> +
> +#define EXPORT_STATIC_CALL(key) EXPORT_SYMBOL(key)
> +#define EXPORT_STATIC_CALL_GPL(key) EXPORT_SYMBOL_GPL(key)
> +
> +#endif /* CONFIG_HAVE_STATIC_CALL_OPTIMIZED */
> +
> +#endif /* _LINUX_STATIC_CALL_H */
> diff --git a/include/linux/static_call_types.h b/include/linux/static_call_types.h
> new file mode 100644
> index 000000000000..7dd4b3d7ec2b
> --- /dev/null
> +++ b/include/linux/static_call_types.h
> @@ -0,0 +1,19 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _STATIC_CALL_TYPES_H
> +#define _STATIC_CALL_TYPES_H
> +
> +#include <linux/stringify.h>
> +
> +#define STATIC_CALL_TRAMP_PREFIX ____static_call_tramp_
> +#define STATIC_CALL_TRAMP_PREFIX_STR __stringify(STATIC_CALL_TRAMP_PREFIX)
> +
> +#define STATIC_CALL_TRAMP(key) STATIC_CALL_TRAMP_PREFIX##key
> +#define STATIC_CALL_TRAMP_STR(key) __stringify(STATIC_CALL_TRAMP(key))
> +
I needed to apply
diff --git a/include/linux/static_call_types.h
b/include/linux/static_call_types.h
index 7dd4b3d7ec2b..6859b208de6e 100644
--- a/include/linux/static_call_types.h
+++ b/include/linux/static_call_types.h
@@ -7,7 +7,7 @@
#define STATIC_CALL_TRAMP_PREFIX ____static_call_tramp_
#define STATIC_CALL_TRAMP_PREFIX_STR __stringify(STATIC_CALL_TRAMP_PREFIX)
-#define STATIC_CALL_TRAMP(key) STATIC_CALL_TRAMP_PREFIX##key
+#define STATIC_CALL_TRAMP(key) __PASTE(STATIC_CALL_TRAMP_PREFIX, key)
#define STATIC_CALL_TRAMP_STR(key) __stringify(STATIC_CALL_TRAMP(key))
/* The static call site table is created by objtool. */
or I end up with
0000000000000000 <STATIC_CALL_TRAMP_PREFIXfoo>:
0: 14000000 b a8 <foo_a>
4: d503201f nop
> +/* The static call site table is created by objtool. */
> +struct static_call_site {
> + s32 addr;
> + s32 key;
> +};
> +
> +#endif /* _STATIC_CALL_TYPES_H */
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 7343b3a9bff0..888bb476ddaf 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -103,6 +103,7 @@ obj-$(CONFIG_TRACEPOINTS) += trace/
> obj-$(CONFIG_IRQ_WORK) += irq_work.o
> obj-$(CONFIG_CPU_PM) += cpu_pm.o
> obj-$(CONFIG_BPF) += bpf/
> +obj-$(CONFIG_HAVE_STATIC_CALL_OPTIMIZED) += static_call.o
>
> obj-$(CONFIG_PERF_EVENTS) += events/
>
> diff --git a/kernel/module.c b/kernel/module.c
> index 49a405891587..11124d991fcd 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -3121,6 +3121,11 @@ static int find_module_sections(struct module *mod, struct load_info *info)
> mod->ei_funcs = section_objs(info, "_error_injection_whitelist",
> sizeof(*mod->ei_funcs),
> &mod->num_ei_funcs);
> +#endif
> +#ifdef CONFIG_HAVE_STATIC_CALL_OPTIMIZED
> + mod->static_call_sites = section_objs(info, ".static_call_sites",
> + sizeof(*mod->static_call_sites),
> + &mod->num_static_call_sites);
> #endif
> mod->extable = section_objs(info, "__ex_table",
> sizeof(*mod->extable), &mod->num_exentries);
> diff --git a/kernel/static_call.c b/kernel/static_call.c
> new file mode 100644
> index 000000000000..599ebc6fc4f1
> --- /dev/null
> +++ b/kernel/static_call.c
> @@ -0,0 +1,297 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#include <linux/init.h>
> +#include <linux/static_call.h>
> +#include <linux/bug.h>
> +#include <linux/smp.h>
> +#include <linux/sort.h>
> +#include <linux/slab.h>
> +#include <linux/module.h>
> +#include <linux/cpu.h>
> +#include <linux/processor.h>
> +#include <asm/sections.h>
> +
> +extern struct static_call_site __start_static_call_sites[],
> + __stop_static_call_sites[];
> +
> +static bool static_call_initialized;
> +
> +/* mutex to protect key modules/sites */
> +static DEFINE_MUTEX(static_call_mutex);
> +
> +static void static_call_lock(void)
> +{
> + mutex_lock(&static_call_mutex);
> +}
> +
> +static void static_call_unlock(void)
> +{
> + mutex_unlock(&static_call_mutex);
> +}
> +
> +static inline unsigned long static_call_addr(struct static_call_site *site)
> +{
> + return (long)site->addr + (long)&site->addr;
> +}
> +
> +static inline struct static_call_key *static_call_key(const struct static_call_site *site)
> +{
> + return (struct static_call_key *)((long)site->key + (long)&site->key);
> +}
> +
> +static int static_call_site_cmp(const void *_a, const void *_b)
> +{
> + const struct static_call_site *a = _a;
> + const struct static_call_site *b = _b;
> + const struct static_call_key *key_a = static_call_key(a);
> + const struct static_call_key *key_b = static_call_key(b);
> +
> + if (key_a < key_b)
> + return -1;
> +
> + if (key_a > key_b)
> + return 1;
> +
> + return 0;
> +}
> +
> +static void static_call_site_swap(void *_a, void *_b, int size)
> +{
> + long delta = (unsigned long)_a - (unsigned long)_b;
> + struct static_call_site *a = _a;
> + struct static_call_site *b = _b;
> + struct static_call_site tmp = *a;
> +
> + a->addr = b->addr - delta;
> + a->key = b->key - delta;
> +
> + b->addr = tmp.addr + delta;
> + b->key = tmp.key + delta;
> +}
> +
> +static inline void static_call_sort_entries(struct static_call_site *start,
> + struct static_call_site *stop)
> +{
> + sort(start, stop - start, sizeof(struct static_call_site),
> + static_call_site_cmp, static_call_site_swap);
> +}
> +
> +void __static_call_update(struct static_call_key *key, void *func)
> +{
> + struct static_call_mod *mod;
> + struct static_call_site *site, *stop;
> +
> + cpus_read_lock();
> + static_call_lock();
> +
> + if (key->func == func)
> + goto done;
> +
> + key->func = func;
> +
> + /*
> + * If called before init, leave the call sites unpatched for now.
> + * In the meantime they'll continue to call the temporary trampoline.
> + */
> + if (!static_call_initialized)
> + goto done;
> +
> + list_for_each_entry(mod, &key->site_mods, list) {
> + if (!mod->sites) {
> + /*
> + * This can happen if the static call key is defined in
> + * a module which doesn't use it.
> + */
> + continue;
> + }
> +
> + stop = __stop_static_call_sites;
> +
> +#ifdef CONFIG_MODULES
> + if (mod->mod) {
> + stop = mod->mod->static_call_sites +
> + mod->mod->num_static_call_sites;
> + }
> +#endif
> +
> + for (site = mod->sites;
> + site < stop && static_call_key(site) == key; site++) {
> + unsigned long addr = static_call_addr(site);
> +
> + if (!mod->mod && init_section_contains((void *)addr, 1))
> + continue;
> + if (mod->mod && within_module_init(addr, mod->mod))
> + continue;
> +
> + arch_static_call_transform(addr, func);
> + }
> + }
> +
> +done:
> + static_call_unlock();
> + cpus_read_unlock();
> +}
> +EXPORT_SYMBOL_GPL(__static_call_update);
> +
> +#ifdef CONFIG_MODULES
> +
> +static int static_call_add_module(struct module *mod)
> +{
> + struct static_call_site *start = mod->static_call_sites;
> + struct static_call_site *stop = mod->static_call_sites +
> + mod->num_static_call_sites;
> + struct static_call_site *site;
> + struct static_call_key *key, *prev_key = NULL;
> + struct static_call_mod *static_call_mod;
> +
> + if (start == stop)
> + return 0;
> +
> + module_disable_ro(mod);
> + static_call_sort_entries(start, stop);
> + module_enable_ro(mod, false);
> +
> + for (site = start; site < stop; site++) {
> + unsigned long addr = static_call_addr(site);
> +
> + if (within_module_init(addr, mod))
> + continue;
> +
> + key = static_call_key(site);
> + if (key != prev_key) {
> + prev_key = key;
> +
> + static_call_mod = kzalloc(sizeof(*static_call_mod), GFP_KERNEL);
> + if (!static_call_mod)
> + return -ENOMEM;
> +
> + static_call_mod->mod = mod;
> + static_call_mod->sites = site;
> + list_add_tail(&static_call_mod->list, &key->site_mods);
> +
> + if (is_module_address((unsigned long)key)) {
> + /*
> + * The trampoline should no longer be used.
> + * Poison it it with a BUG() to catch any stray
> + * callers.
> + */
> + arch_static_call_poison_tramp(addr);
> + }
> + }
> +
> + arch_static_call_transform(addr, key->func);
> + }
> +
> + return 0;
> +}
> +
> +static void static_call_del_module(struct module *mod)
> +{
> + struct static_call_site *start = mod->static_call_sites;
> + struct static_call_site *stop = mod->static_call_sites +
> + mod->num_static_call_sites;
> + struct static_call_site *site;
> + struct static_call_key *key, *prev_key = NULL;
> + struct static_call_mod *static_call_mod;
> +
> + for (site = start; site < stop; site++) {
> + key = static_call_key(site);
> + if (key == prev_key)
> + continue;
> + prev_key = key;
> +
> + list_for_each_entry(static_call_mod, &key->site_mods, list) {
> + if (static_call_mod->mod == mod) {
> + list_del(&static_call_mod->list);
> + kfree(static_call_mod);
> + break;
> + }
> + }
> + }
> +}
> +
> +static int static_call_module_notify(struct notifier_block *nb,
> + unsigned long val, void *data)
> +{
> + struct module *mod = data;
> + int ret = 0;
> +
> + cpus_read_lock();
> + static_call_lock();
> +
> + switch (val) {
> + case MODULE_STATE_COMING:
> + ret = static_call_add_module(mod);
> + if (ret) {
> + WARN(1, "Failed to allocate memory for static calls");
> + static_call_del_module(mod);
> + }
> + break;
> + case MODULE_STATE_GOING:
> + static_call_del_module(mod);
> + break;
> + }
> +
> + static_call_unlock();
> + cpus_read_unlock();
> +
> + return notifier_from_errno(ret);
> +}
> +
> +static struct notifier_block static_call_module_nb = {
> + .notifier_call = static_call_module_notify,
> +};
> +
> +#endif /* CONFIG_MODULES */
> +
> +static void __init static_call_init(void)
> +{
> + struct static_call_site *start = __start_static_call_sites;
> + struct static_call_site *stop = __stop_static_call_sites;
> + struct static_call_site *site;
> +
> + if (start == stop) {
> + pr_warn("WARNING: empty static call table\n");
> + return;
> + }
> +
> + cpus_read_lock();
> + static_call_lock();
> +
> + static_call_sort_entries(start, stop);
> +
> + for (site = start; site < stop; site++) {
> + struct static_call_key *key = static_call_key(site);
> + unsigned long addr = static_call_addr(site);
> +
> + if (list_empty(&key->site_mods)) {
> + struct static_call_mod *mod;
> +
> + mod = kzalloc(sizeof(*mod), GFP_KERNEL);
> + if (!mod) {
> + WARN(1, "Failed to allocate memory for static calls");
> + return;
> + }
> +
> + mod->sites = site;
> + list_add_tail(&mod->list, &key->site_mods);
> +
> + /*
> + * The trampoline should no longer be used. Poison it
> + * it with a BUG() to catch any stray callers.
> + */
> + arch_static_call_poison_tramp(addr);
> + }
> +
> + arch_static_call_transform(addr, key->func);
> + }
> +
> + static_call_initialized = true;
> +
> + static_call_unlock();
> + cpus_read_unlock();
> +
> +#ifdef CONFIG_MODULES
> + register_module_notifier(&static_call_module_nb);
> +#endif
> +}
> +early_initcall(static_call_init);
> --
> 2.17.2
>
On 8 November 2018 at 22:15, Josh Poimboeuf <[email protected]> wrote:
> Add a static call infrastructure. Static calls use code patching to
> hard-code function pointers into direct branch instructions. They give
> the flexibility of function pointers, but with improved performance.
> This is especially important for cases where retpolines would otherwise
> be used, as retpolines can significantly impact performance.
>
> This code is heavily inspired by the jump label code (aka "static
> jumps"), as some of the concepts are very similar.
>
> There are three implementations, depending on arch support:
>
> 1) optimized: patched call sites (CONFIG_HAVE_STATIC_CALL_OPTIMIZED)
> 2) unoptimized: patched trampolines (CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED)
> 3) basic function pointers
>
> For more details, see the comments in include/linux/static_call.h.
>
> Signed-off-by: Josh Poimboeuf <[email protected]>
> ---
> arch/Kconfig | 6 +
> include/asm-generic/vmlinux.lds.h | 11 ++
> include/linux/module.h | 10 +
> include/linux/static_call.h | 186 +++++++++++++++++++
> include/linux/static_call_types.h | 19 ++
> kernel/Makefile | 1 +
> kernel/module.c | 5 +
> kernel/static_call.c | 297 ++++++++++++++++++++++++++++++
> 8 files changed, 535 insertions(+)
> create mode 100644 include/linux/static_call.h
> create mode 100644 include/linux/static_call_types.h
> create mode 100644 kernel/static_call.c
>
> diff --git a/kernel/static_call.c b/kernel/static_call.c
> new file mode 100644
> index 000000000000..599ebc6fc4f1
> --- /dev/null
> +++ b/kernel/static_call.c
...
> +static void __init static_call_init(void)
> +{
> + struct static_call_site *start = __start_static_call_sites;
> + struct static_call_site *stop = __stop_static_call_sites;
> + struct static_call_site *site;
> +
> + if (start == stop) {
> + pr_warn("WARNING: empty static call table\n");
> + return;
> + }
> +
> + cpus_read_lock();
> + static_call_lock();
> +
> + static_call_sort_entries(start, stop);
> +
> + for (site = start; site < stop; site++) {
> + struct static_call_key *key = static_call_key(site);
> + unsigned long addr = static_call_addr(site);
> +
> + if (list_empty(&key->site_mods)) {
> + struct static_call_mod *mod;
> +
> + mod = kzalloc(sizeof(*mod), GFP_KERNEL);
> + if (!mod) {
> + WARN(1, "Failed to allocate memory for static calls");
> + return;
> + }
> +
> + mod->sites = site;
> + list_add_tail(&mod->list, &key->site_mods);
> +
> + /*
> + * The trampoline should no longer be used. Poison it
> + * it with a BUG() to catch any stray callers.
> + */
> + arch_static_call_poison_tramp(addr);
This patches the wrong thing: the trampoline is at key->func not addr.
However, patching it here means we poison it before all users are
patched. I added this on top
diff --git a/kernel/static_call.c b/kernel/static_call.c
index 599ebc6fc4f1..d9562329bec6 100644
--- a/kernel/static_call.c
+++ b/kernel/static_call.c
@@ -248,6 +248,7 @@ static void __init static_call_init(void)
struct static_call_site *start = __start_static_call_sites;
struct static_call_site *stop = __stop_static_call_sites;
struct static_call_site *site;
+ struct static_call_key *prev_key = NULL;
if (start == stop) {
pr_warn("WARNING: empty static call table\n");
@@ -279,7 +280,9 @@ static void __init static_call_init(void)
* The trampoline should no longer be used. Poison it
* it with a BUG() to catch any stray callers.
*/
- arch_static_call_poison_tramp(addr);
+ if (prev_key)
+
arch_static_call_poison_tramp((unsigned long)prev_key->func);
+ prev_key = key;
}
arch_static_call_transform(addr, key->func);
> + }
> +
> + arch_static_call_transform(addr, key->func);
> + }
> +
> + static_call_initialized = true;
> +
> + static_call_unlock();
> + cpus_read_unlock();
> +
> +#ifdef CONFIG_MODULES
> + register_module_notifier(&static_call_module_nb);
> +#endif
> +}
> +early_initcall(static_call_init);
> --
> 2.17.2
>
On 9 November 2018 at 08:28, Ingo Molnar <[email protected]> wrote:
>
> * Josh Poimboeuf <[email protected]> wrote:
>
>> These patches are related to two similar patch sets from Ard and Steve:
>>
>> - https://lkml.kernel.org/r/[email protected]
>> - https://lkml.kernel.org/r/[email protected]
>>
>> The code is also heavily inspired by the jump label code, as some of the
>> concepts are very similar.
>>
>> There are three separate implementations, depending on what the arch
>> supports:
>>
>> 1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires
>> objtool and a small amount of arch code
>>
>> 2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires
>> a small amount of arch code
>>
>> 3) If no arch support, fall back to regular function pointers
>>
>>
>> TODO:
>>
>> - I'm not sure about the objtool approach. Objtool is (currently)
>> x86-64 only, which means we have to use the "unoptimized" version
>> everywhere else. I may experiment with a GCC plugin instead.
>
> I'd prefer the objtool approach. It's a pretty reliable first-principles
> approach while GCC plugin would have to be replicated for Clang and any
> other compilers, etc.
>
I implemented the GCC plugin approach here for arm64
https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=static-calls
That implements both the unoptimized and the optimized versions.
I do take your point about GCC and other compilers, but on arm64 we
don't have a lot of choice.
As far as I can tell, the GCC plugin is generic (i.e., it does not
rely on any ARM specific passes, but obviously, this requires a *lot*
of testing and validation to be taken seriously.
>> - Does this feature have much value without retpolines? If not, should
>> we make it depend on retpolines somehow?
>
> Paravirt patching, as you mention in your later reply?
>
>> - Find some actual users of the interfaces (tracepoints? crypto?)
>
> I'd be very happy with a demonstrated paravirt optimization already -
> i.e. seeing the before/after effect on the vmlinux with an x86 distro
> config.
>
> All major Linux distributions enable CONFIG_PARAVIRT=y and
> CONFIG_PARAVIRT_XXL=y on x86 at the moment, so optimizing it away as much
> as possible in the 99.999% cases where it's not used is a primary
> concern.
>
> All other usecases are bonus, but it would certainly be interesting to
> investigate the impact of using these APIs for tracing: that too is a
> feature enabled everywhere but utilized only by a small fraction of Linux
> users - so literally every single cycle or instruction saved or hot-path
> shortened is a major win.
>
> Thanks,
>
> Ingo
On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote:
> > - I'm not sure about the objtool approach. Objtool is (currently)
> > x86-64 only, which means we have to use the "unoptimized" version
> > everywhere else. I may experiment with a GCC plugin instead.
>
> I'd prefer the objtool approach. It's a pretty reliable first-principles
> approach while GCC plugin would have to be replicated for Clang and any
> other compilers, etc.
The benefit of a plugin is that we'd only need two of them: GCC and
Clang. And presumably, they'd share a lot of code.
The prospect of porting objtool to all architectures is going to be much
more of a daunting task (though we are at least already considering it
for some arches).
> > - Does this feature have much value without retpolines? If not, should
> > we make it depend on retpolines somehow?
>
> Paravirt patching, as you mention in your later reply?
>
> > - Find some actual users of the interfaces (tracepoints? crypto?)
>
> I'd be very happy with a demonstrated paravirt optimization already -
> i.e. seeing the before/after effect on the vmlinux with an x86 distro
> config.
>
> All major Linux distributions enable CONFIG_PARAVIRT=y and
> CONFIG_PARAVIRT_XXL=y on x86 at the moment, so optimizing it away as much
> as possible in the 99.999% cases where it's not used is a primary
> concern.
For paravirt, I was thinking of it as more of a cleanup than an
optimization. The paravirt patching code already replaces indirect
branches with direct ones -- see paravirt_patch_default().
Though it *would* reduce the instruction footprint a bit, as the 7-byte
indirect calls (later patched to 5-byte direct + 2-byte nop) would
instead be 5-byte direct calls to begin with.
> All other usecases are bonus, but it would certainly be interesting to
> investigate the impact of using these APIs for tracing: that too is a
> feature enabled everywhere but utilized only by a small fraction of Linux
> users - so literally every single cycle or instruction saved or hot-path
> shortened is a major win.
With retpolines, and with tracepoints enabled, it's definitely a major
win. Steve measured an 8.9% general slowdown on hackbench caused by
retpolines.
But with tracepoints disabled, I believe static jumps are used, which
already minimizes the impact on hot paths.
--
Josh
On Fri, Nov 09, 2018 at 10:51:16AM +0100, Ard Biesheuvel wrote:
> Hi Josh,
>
> Thanks a lot for looking into this.
>
> On 8 November 2018 at 22:15, Josh Poimboeuf <[email protected]> wrote:
> > Add a static call infrastructure. Static calls use code patching to
> > hard-code function pointers into direct branch instructions. They give
> > the flexibility of function pointers, but with improved performance.
> > This is especially important for cases where retpolines would otherwise
> > be used, as retpolines can significantly impact performance.
> >
> > This code is heavily inspired by the jump label code (aka "static
> > jumps"), as some of the concepts are very similar.
> >
> > There are three implementations, depending on arch support:
> >
> > 1) optimized: patched call sites (CONFIG_HAVE_STATIC_CALL_OPTIMIZED)
> > 2) unoptimized: patched trampolines (CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED)
>
> Could we move to an idiom like inline/out of line? The unoptimized
> variant may be perfectly adequate for many arches, and 'unoptimized'
> sounds like there is something wrong with it.
Yeah, inline/out-of-line would be better. It's both more descriptive
and less derogatory :-)
> > + * Usage example:
> > + *
> > + * # Start with the following functions (with identical prototypes):
> > + * int func_a(int arg1, int arg2);
> > + * int func_b(int arg1, int arg2);
> > + *
> > + * # Define a 'my_key' reference, associated with func_a() by default
> > + * DEFINE_STATIC_CALL(my_key, func_a);
> > + *
> > + * # Call func_a()
> > + * static_call(my_key, arg1, arg2);
> > + *
> > + * # Update 'my_key' to point to func_b()
> > + * static_call_update(my_key, func_b);
> > + *
> > + * # Call func_b()
> > + * static_call(my_key, arg1, arg2);
> > + *
>
> Any way we can revert to the default implementation? That would be
> useful for, e.g., unloading modules that provide an alternative
> implementation.
I saw your original implementation had a "reset to default", but from
what I can tell, most users wouldn't need that.
Assuming the caller knows what the original dest function was, can't
they just call static_call_update(my_key, orig_func) directly?
> > diff --git a/include/linux/static_call_types.h b/include/linux/static_call_types.h
> > new file mode 100644
> > index 000000000000..7dd4b3d7ec2b
> > --- /dev/null
> > +++ b/include/linux/static_call_types.h
> > @@ -0,0 +1,19 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _STATIC_CALL_TYPES_H
> > +#define _STATIC_CALL_TYPES_H
> > +
> > +#include <linux/stringify.h>
> > +
> > +#define STATIC_CALL_TRAMP_PREFIX ____static_call_tramp_
> > +#define STATIC_CALL_TRAMP_PREFIX_STR __stringify(STATIC_CALL_TRAMP_PREFIX)
> > +
> > +#define STATIC_CALL_TRAMP(key) STATIC_CALL_TRAMP_PREFIX##key
> > +#define STATIC_CALL_TRAMP_STR(key) __stringify(STATIC_CALL_TRAMP(key))
> > +
>
> I needed to apply
>
> diff --git a/include/linux/static_call_types.h
> b/include/linux/static_call_types.h
> index 7dd4b3d7ec2b..6859b208de6e 100644
> --- a/include/linux/static_call_types.h
> +++ b/include/linux/static_call_types.h
> @@ -7,7 +7,7 @@
> #define STATIC_CALL_TRAMP_PREFIX ____static_call_tramp_
> #define STATIC_CALL_TRAMP_PREFIX_STR __stringify(STATIC_CALL_TRAMP_PREFIX)
>
> -#define STATIC_CALL_TRAMP(key) STATIC_CALL_TRAMP_PREFIX##key
> +#define STATIC_CALL_TRAMP(key) __PASTE(STATIC_CALL_TRAMP_PREFIX, key)
> #define STATIC_CALL_TRAMP_STR(key) __stringify(STATIC_CALL_TRAMP(key))
>
> /* The static call site table is created by objtool. */
>
> or I end up with
>
> 0000000000000000 <STATIC_CALL_TRAMP_PREFIXfoo>:
> 0: 14000000 b a8 <foo_a>
> 4: d503201f nop
Ha, oops. That was part of a last minute cleanup to share the macros
with objtool.
--
Josh
On Fri, Nov 09, 2018 at 02:39:17PM +0100, Ard Biesheuvel wrote:
> > + for (site = start; site < stop; site++) {
> > + struct static_call_key *key = static_call_key(site);
> > + unsigned long addr = static_call_addr(site);
> > +
> > + if (list_empty(&key->site_mods)) {
> > + struct static_call_mod *mod;
> > +
> > + mod = kzalloc(sizeof(*mod), GFP_KERNEL);
> > + if (!mod) {
> > + WARN(1, "Failed to allocate memory for static calls");
> > + return;
> > + }
> > +
> > + mod->sites = site;
> > + list_add_tail(&mod->list, &key->site_mods);
> > +
> > + /*
> > + * The trampoline should no longer be used. Poison it
> > + * it with a BUG() to catch any stray callers.
> > + */
> > + arch_static_call_poison_tramp(addr);
>
> This patches the wrong thing: the trampoline is at key->func not addr.
If you look at the x86 implementation, it actually does poison the
trampoline.
The address of the trampoline isn't actually known here. key->func
isn't the trampoline address; it's the destination func address.
So instead I passed the address of the call instruction. The arch code
then reads the instruction to find the callee (the trampoline).
The code is a bit confusing. To make it more obvious, maybe we should
add another arch function to read the call destination. Then this code
can pass that into arch_static_call_poison_tramp().
> However, patching it here means we poison it before all users are
> patched. I added this on top
>
> diff --git a/kernel/static_call.c b/kernel/static_call.c
> index 599ebc6fc4f1..d9562329bec6 100644
> --- a/kernel/static_call.c
> +++ b/kernel/static_call.c
> @@ -248,6 +248,7 @@ static void __init static_call_init(void)
> struct static_call_site *start = __start_static_call_sites;
> struct static_call_site *stop = __stop_static_call_sites;
> struct static_call_site *site;
> + struct static_call_key *prev_key = NULL;
>
> if (start == stop) {
> pr_warn("WARNING: empty static call table\n");
> @@ -279,7 +280,9 @@ static void __init static_call_init(void)
> * The trampoline should no longer be used. Poison it
> * it with a BUG() to catch any stray callers.
> */
> - arch_static_call_poison_tramp(addr);
> + if (prev_key)
> +
> arch_static_call_poison_tramp((unsigned long)prev_key->func);
> + prev_key = key;
> }
>
> arch_static_call_transform(addr, key->func);
While it does indeed poison the trampoline before all users are patched,
I had been thinking that it didn't really matter because this is before
the other CPUs have been booted.
But I believe interrupts are enabled at this point during the boot, so
it would indeed be wise to poison it afterwards, in case an irq handler
makes a static call.
--
Josh
On 9 November 2018 at 16:10, Josh Poimboeuf <[email protected]> wrote:
> On Fri, Nov 09, 2018 at 02:39:17PM +0100, Ard Biesheuvel wrote:
>> > + for (site = start; site < stop; site++) {
>> > + struct static_call_key *key = static_call_key(site);
>> > + unsigned long addr = static_call_addr(site);
>> > +
>> > + if (list_empty(&key->site_mods)) {
>> > + struct static_call_mod *mod;
>> > +
>> > + mod = kzalloc(sizeof(*mod), GFP_KERNEL);
>> > + if (!mod) {
>> > + WARN(1, "Failed to allocate memory for static calls");
>> > + return;
>> > + }
>> > +
>> > + mod->sites = site;
>> > + list_add_tail(&mod->list, &key->site_mods);
>> > +
>> > + /*
>> > + * The trampoline should no longer be used. Poison it
>> > + * it with a BUG() to catch any stray callers.
>> > + */
>> > + arch_static_call_poison_tramp(addr);
>>
>> This patches the wrong thing: the trampoline is at key->func not addr.
>
> If you look at the x86 implementation, it actually does poison the
> trampoline.
>
> The address of the trampoline isn't actually known here. key->func
> isn't the trampoline address; it's the destination func address.
>
> So instead I passed the address of the call instruction. The arch code
> then reads the instruction to find the callee (the trampoline).
>
> The code is a bit confusing. To make it more obvious, maybe we should
> add another arch function to read the call destination. Then this code
> can pass that into arch_static_call_poison_tramp().
>
Ah right, so I am basically missing a dereference in my
arch_static_call_poison_tramp() code if this breaks.
>> However, patching it here means we poison it before all users are
>> patched. I added this on top
>>
>> diff --git a/kernel/static_call.c b/kernel/static_call.c
>> index 599ebc6fc4f1..d9562329bec6 100644
>> --- a/kernel/static_call.c
>> +++ b/kernel/static_call.c
>> @@ -248,6 +248,7 @@ static void __init static_call_init(void)
>> struct static_call_site *start = __start_static_call_sites;
>> struct static_call_site *stop = __stop_static_call_sites;
>> struct static_call_site *site;
>> + struct static_call_key *prev_key = NULL;
>>
>> if (start == stop) {
>> pr_warn("WARNING: empty static call table\n");
>> @@ -279,7 +280,9 @@ static void __init static_call_init(void)
>> * The trampoline should no longer be used. Poison it
>> * it with a BUG() to catch any stray callers.
>> */
>> - arch_static_call_poison_tramp(addr);
>> + if (prev_key)
>> +
>> arch_static_call_poison_tramp((unsigned long)prev_key->func);
>> + prev_key = key;
>> }
>>
>> arch_static_call_transform(addr, key->func);
>
> While it does indeed poison the trampoline before all users are patched,
> I had been thinking that it didn't really matter because this is before
> the other CPUs have been booted.
>
> But I believe interrupts are enabled at this point during the boot, so
> it would indeed be wise to poison it afterwards, in case an irq handler
> makes a static call.
>
And kmalloc(GFP_KERNEL) itself could cascade into lots of other code as well.
On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar <[email protected]> wrote:
>
>
> All other usecases are bonus, but it would certainly be interesting to
> investigate the impact of using these APIs for tracing: that too is a
> feature enabled everywhere but utilized only by a small fraction of Linux
> users - so literally every single cycle or instruction saved or hot-path
> shortened is a major win.
For tracing, we'd want static_call_set_to_nop() or something like that, right?
--Andy
On Fri, Nov 09, 2018 at 02:50:27PM +0100, Ard Biesheuvel wrote:
> On 9 November 2018 at 08:28, Ingo Molnar <[email protected]> wrote:
> >
> > * Josh Poimboeuf <[email protected]> wrote:
> >
> >> These patches are related to two similar patch sets from Ard and Steve:
> >>
> >> - https://lkml.kernel.org/r/[email protected]
> >> - https://lkml.kernel.org/r/[email protected]
> >>
> >> The code is also heavily inspired by the jump label code, as some of the
> >> concepts are very similar.
> >>
> >> There are three separate implementations, depending on what the arch
> >> supports:
> >>
> >> 1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires
> >> objtool and a small amount of arch code
> >>
> >> 2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires
> >> a small amount of arch code
> >>
> >> 3) If no arch support, fall back to regular function pointers
> >>
> >>
> >> TODO:
> >>
> >> - I'm not sure about the objtool approach. Objtool is (currently)
> >> x86-64 only, which means we have to use the "unoptimized" version
> >> everywhere else. I may experiment with a GCC plugin instead.
> >
> > I'd prefer the objtool approach. It's a pretty reliable first-principles
> > approach while GCC plugin would have to be replicated for Clang and any
> > other compilers, etc.
> >
>
> I implemented the GCC plugin approach here for arm64
>
> https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=static-calls
>
> That implements both the unoptimized and the optimized versions.
Nice! That was fast :-)
> I do take your point about GCC and other compilers, but on arm64 we
> don't have a lot of choice.
>
> As far as I can tell, the GCC plugin is generic (i.e., it does not
> rely on any ARM specific passes, but obviously, this requires a *lot*
> of testing and validation to be taken seriously.
Yeah. I haven't had a chance to try your plugin on x86 yet, but in
theory it should be arch-independent.
--
Josh
On Fri, Nov 09, 2018 at 07:16:17AM -0800, Andy Lutomirski wrote:
> On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar <[email protected]> wrote:
> >
> >
> > All other usecases are bonus, but it would certainly be interesting to
> > investigate the impact of using these APIs for tracing: that too is a
> > feature enabled everywhere but utilized only by a small fraction of Linux
> > users - so literally every single cycle or instruction saved or hot-path
> > shortened is a major win.
>
> For tracing, we'd want static_call_set_to_nop() or something like that, right?
Are we talking about tracepoints? Or ftrace?
--
Josh
On Fri, Nov 09, 2018 at 09:21:39AM -0600, Josh Poimboeuf wrote:
> On Fri, Nov 09, 2018 at 07:16:17AM -0800, Andy Lutomirski wrote:
> > On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar <[email protected]> wrote:
> > >
> > >
> > > All other usecases are bonus, but it would certainly be interesting to
> > > investigate the impact of using these APIs for tracing: that too is a
> > > feature enabled everywhere but utilized only by a small fraction of Linux
> > > users - so literally every single cycle or instruction saved or hot-path
> > > shortened is a major win.
> >
> > For tracing, we'd want static_call_set_to_nop() or something like that, right?
>
> Are we talking about tracepoints? Or ftrace?
Since ftrace changes calls to nops, and vice versa, I assume you meant
ftrace. I don't think ftrace is a good candidate for this, as it's
inherently more flexible than this API would reasonably allow.
--
Josh
On 9 November 2018 at 16:14, Ard Biesheuvel <[email protected]> wrote:
> On 9 November 2018 at 16:10, Josh Poimboeuf <[email protected]> wrote:
>> On Fri, Nov 09, 2018 at 02:39:17PM +0100, Ard Biesheuvel wrote:
>>> > + for (site = start; site < stop; site++) {
>>> > + struct static_call_key *key = static_call_key(site);
>>> > + unsigned long addr = static_call_addr(site);
>>> > +
>>> > + if (list_empty(&key->site_mods)) {
>>> > + struct static_call_mod *mod;
>>> > +
>>> > + mod = kzalloc(sizeof(*mod), GFP_KERNEL);
>>> > + if (!mod) {
>>> > + WARN(1, "Failed to allocate memory for static calls");
>>> > + return;
>>> > + }
>>> > +
>>> > + mod->sites = site;
>>> > + list_add_tail(&mod->list, &key->site_mods);
>>> > +
>>> > + /*
>>> > + * The trampoline should no longer be used. Poison it
>>> > + * it with a BUG() to catch any stray callers.
>>> > + */
>>> > + arch_static_call_poison_tramp(addr);
>>>
>>> This patches the wrong thing: the trampoline is at key->func not addr.
>>
>> If you look at the x86 implementation, it actually does poison the
>> trampoline.
>>
>> The address of the trampoline isn't actually known here. key->func
>> isn't the trampoline address; it's the destination func address.
>>
>> So instead I passed the address of the call instruction. The arch code
>> then reads the instruction to find the callee (the trampoline).
>>
>> The code is a bit confusing. To make it more obvious, maybe we should
>> add another arch function to read the call destination. Then this code
>> can pass that into arch_static_call_poison_tramp().
>>
>
> Ah right, so I am basically missing a dereference in my
> arch_static_call_poison_tramp() code if this breaks.
>
Could we call it 'defuse' rather than 'poision'? On arm64, we will
need to keep it around to bounce function calls that are out of range,
and replace it with a PLT sequence.
On Fri, Nov 09, 2018 at 06:25:24PM +0100, Ard Biesheuvel wrote:
> On 9 November 2018 at 16:14, Ard Biesheuvel <[email protected]> wrote:
> > On 9 November 2018 at 16:10, Josh Poimboeuf <[email protected]> wrote:
> >> On Fri, Nov 09, 2018 at 02:39:17PM +0100, Ard Biesheuvel wrote:
> >>> > + for (site = start; site < stop; site++) {
> >>> > + struct static_call_key *key = static_call_key(site);
> >>> > + unsigned long addr = static_call_addr(site);
> >>> > +
> >>> > + if (list_empty(&key->site_mods)) {
> >>> > + struct static_call_mod *mod;
> >>> > +
> >>> > + mod = kzalloc(sizeof(*mod), GFP_KERNEL);
> >>> > + if (!mod) {
> >>> > + WARN(1, "Failed to allocate memory for static calls");
> >>> > + return;
> >>> > + }
> >>> > +
> >>> > + mod->sites = site;
> >>> > + list_add_tail(&mod->list, &key->site_mods);
> >>> > +
> >>> > + /*
> >>> > + * The trampoline should no longer be used. Poison it
> >>> > + * it with a BUG() to catch any stray callers.
> >>> > + */
> >>> > + arch_static_call_poison_tramp(addr);
> >>>
> >>> This patches the wrong thing: the trampoline is at key->func not addr.
> >>
> >> If you look at the x86 implementation, it actually does poison the
> >> trampoline.
> >>
> >> The address of the trampoline isn't actually known here. key->func
> >> isn't the trampoline address; it's the destination func address.
> >>
> >> So instead I passed the address of the call instruction. The arch code
> >> then reads the instruction to find the callee (the trampoline).
> >>
> >> The code is a bit confusing. To make it more obvious, maybe we should
> >> add another arch function to read the call destination. Then this code
> >> can pass that into arch_static_call_poison_tramp().
> >>
> >
> > Ah right, so I am basically missing a dereference in my
> > arch_static_call_poison_tramp() code if this breaks.
> >
>
> Could we call it 'defuse' rather than 'poision'? On arm64, we will
> need to keep it around to bounce function calls that are out of range,
> and replace it with a PLT sequence.
Ok, but doesn't that defeat the purpose of the inline approach?
--
Josh
On Fri, Nov 09, 2018 at 11:31:06AM -0600, Josh Poimboeuf wrote:
> On Fri, Nov 09, 2018 at 06:25:24PM +0100, Ard Biesheuvel wrote:
> > On 9 November 2018 at 16:14, Ard Biesheuvel <[email protected]> wrote:
> > > On 9 November 2018 at 16:10, Josh Poimboeuf <[email protected]> wrote:
> > >> On Fri, Nov 09, 2018 at 02:39:17PM +0100, Ard Biesheuvel wrote:
> > >>> > + for (site = start; site < stop; site++) {
> > >>> > + struct static_call_key *key = static_call_key(site);
> > >>> > + unsigned long addr = static_call_addr(site);
> > >>> > +
> > >>> > + if (list_empty(&key->site_mods)) {
> > >>> > + struct static_call_mod *mod;
> > >>> > +
> > >>> > + mod = kzalloc(sizeof(*mod), GFP_KERNEL);
> > >>> > + if (!mod) {
> > >>> > + WARN(1, "Failed to allocate memory for static calls");
> > >>> > + return;
> > >>> > + }
> > >>> > +
> > >>> > + mod->sites = site;
> > >>> > + list_add_tail(&mod->list, &key->site_mods);
> > >>> > +
> > >>> > + /*
> > >>> > + * The trampoline should no longer be used. Poison it
> > >>> > + * it with a BUG() to catch any stray callers.
> > >>> > + */
> > >>> > + arch_static_call_poison_tramp(addr);
> > >>>
> > >>> This patches the wrong thing: the trampoline is at key->func not addr.
> > >>
> > >> If you look at the x86 implementation, it actually does poison the
> > >> trampoline.
> > >>
> > >> The address of the trampoline isn't actually known here. key->func
> > >> isn't the trampoline address; it's the destination func address.
> > >>
> > >> So instead I passed the address of the call instruction. The arch code
> > >> then reads the instruction to find the callee (the trampoline).
> > >>
> > >> The code is a bit confusing. To make it more obvious, maybe we should
> > >> add another arch function to read the call destination. Then this code
> > >> can pass that into arch_static_call_poison_tramp().
> > >>
> > >
> > > Ah right, so I am basically missing a dereference in my
> > > arch_static_call_poison_tramp() code if this breaks.
> > >
> >
> > Could we call it 'defuse' rather than 'poision'? On arm64, we will
> > need to keep it around to bounce function calls that are out of range,
> > and replace it with a PLT sequence.
>
> Ok, but doesn't that defeat the purpose of the inline approach?
Or are you only going to use the trampoline for out-of-range calls,
otherwise just do direct calls?
--
Josh
On 9 November 2018 at 18:31, Josh Poimboeuf <[email protected]> wrote:
> On Fri, Nov 09, 2018 at 06:25:24PM +0100, Ard Biesheuvel wrote:
>> On 9 November 2018 at 16:14, Ard Biesheuvel <[email protected]> wrote:
>> > On 9 November 2018 at 16:10, Josh Poimboeuf <[email protected]> wrote:
>> >> On Fri, Nov 09, 2018 at 02:39:17PM +0100, Ard Biesheuvel wrote:
>> >>> > + for (site = start; site < stop; site++) {
>> >>> > + struct static_call_key *key = static_call_key(site);
>> >>> > + unsigned long addr = static_call_addr(site);
>> >>> > +
>> >>> > + if (list_empty(&key->site_mods)) {
>> >>> > + struct static_call_mod *mod;
>> >>> > +
>> >>> > + mod = kzalloc(sizeof(*mod), GFP_KERNEL);
>> >>> > + if (!mod) {
>> >>> > + WARN(1, "Failed to allocate memory for static calls");
>> >>> > + return;
>> >>> > + }
>> >>> > +
>> >>> > + mod->sites = site;
>> >>> > + list_add_tail(&mod->list, &key->site_mods);
>> >>> > +
>> >>> > + /*
>> >>> > + * The trampoline should no longer be used. Poison it
>> >>> > + * it with a BUG() to catch any stray callers.
>> >>> > + */
>> >>> > + arch_static_call_poison_tramp(addr);
>> >>>
>> >>> This patches the wrong thing: the trampoline is at key->func not addr.
>> >>
>> >> If you look at the x86 implementation, it actually does poison the
>> >> trampoline.
>> >>
>> >> The address of the trampoline isn't actually known here. key->func
>> >> isn't the trampoline address; it's the destination func address.
>> >>
>> >> So instead I passed the address of the call instruction. The arch code
>> >> then reads the instruction to find the callee (the trampoline).
>> >>
>> >> The code is a bit confusing. To make it more obvious, maybe we should
>> >> add another arch function to read the call destination. Then this code
>> >> can pass that into arch_static_call_poison_tramp().
>> >>
>> >
>> > Ah right, so I am basically missing a dereference in my
>> > arch_static_call_poison_tramp() code if this breaks.
>> >
>>
>> Could we call it 'defuse' rather than 'poision'? On arm64, we will
>> need to keep it around to bounce function calls that are out of range,
>> and replace it with a PLT sequence.
>
> Ok, but doesn't that defeat the purpose of the inline approach?
>
It does. But this only occurs when a module is loaded far away, and
this will only happen if you have 2 GB range KASLR enabled, or your
128 MB module region gets exhausted for some reason, so the majority
of calls should use a single relative branch.
On Fri, Nov 09, 2018 at 06:33:03PM +0100, Ard Biesheuvel wrote:
> On 9 November 2018 at 18:31, Josh Poimboeuf <[email protected]> wrote:
> > On Fri, Nov 09, 2018 at 06:25:24PM +0100, Ard Biesheuvel wrote:
> >> On 9 November 2018 at 16:14, Ard Biesheuvel <[email protected]> wrote:
> >> > On 9 November 2018 at 16:10, Josh Poimboeuf <[email protected]> wrote:
> >> >> On Fri, Nov 09, 2018 at 02:39:17PM +0100, Ard Biesheuvel wrote:
> >> >>> > + for (site = start; site < stop; site++) {
> >> >>> > + struct static_call_key *key = static_call_key(site);
> >> >>> > + unsigned long addr = static_call_addr(site);
> >> >>> > +
> >> >>> > + if (list_empty(&key->site_mods)) {
> >> >>> > + struct static_call_mod *mod;
> >> >>> > +
> >> >>> > + mod = kzalloc(sizeof(*mod), GFP_KERNEL);
> >> >>> > + if (!mod) {
> >> >>> > + WARN(1, "Failed to allocate memory for static calls");
> >> >>> > + return;
> >> >>> > + }
> >> >>> > +
> >> >>> > + mod->sites = site;
> >> >>> > + list_add_tail(&mod->list, &key->site_mods);
> >> >>> > +
> >> >>> > + /*
> >> >>> > + * The trampoline should no longer be used. Poison it
> >> >>> > + * it with a BUG() to catch any stray callers.
> >> >>> > + */
> >> >>> > + arch_static_call_poison_tramp(addr);
> >> >>>
> >> >>> This patches the wrong thing: the trampoline is at key->func not addr.
> >> >>
> >> >> If you look at the x86 implementation, it actually does poison the
> >> >> trampoline.
> >> >>
> >> >> The address of the trampoline isn't actually known here. key->func
> >> >> isn't the trampoline address; it's the destination func address.
> >> >>
> >> >> So instead I passed the address of the call instruction. The arch code
> >> >> then reads the instruction to find the callee (the trampoline).
> >> >>
> >> >> The code is a bit confusing. To make it more obvious, maybe we should
> >> >> add another arch function to read the call destination. Then this code
> >> >> can pass that into arch_static_call_poison_tramp().
> >> >>
> >> >
> >> > Ah right, so I am basically missing a dereference in my
> >> > arch_static_call_poison_tramp() code if this breaks.
> >> >
> >>
> >> Could we call it 'defuse' rather than 'poision'? On arm64, we will
> >> need to keep it around to bounce function calls that are out of range,
> >> and replace it with a PLT sequence.
> >
> > Ok, but doesn't that defeat the purpose of the inline approach?
> >
>
> It does. But this only occurs when a module is loaded far away, and
> this will only happen if you have 2 GB range KASLR enabled, or your
> 128 MB module region gets exhausted for some reason, so the majority
> of calls should use a single relative branch.
Makes sense. Do you also account for the possibility that the original
call emitted by GCC was far away and thus used the PLT?
--
Josh
On 9 November 2018 at 18:46, Josh Poimboeuf <[email protected]> wrote:
> On Fri, Nov 09, 2018 at 06:33:03PM +0100, Ard Biesheuvel wrote:
>> On 9 November 2018 at 18:31, Josh Poimboeuf <[email protected]> wrote:
>> > On Fri, Nov 09, 2018 at 06:25:24PM +0100, Ard Biesheuvel wrote:
>> >> On 9 November 2018 at 16:14, Ard Biesheuvel <[email protected]> wrote:
>> >> > On 9 November 2018 at 16:10, Josh Poimboeuf <[email protected]> wrote:
>> >> >> On Fri, Nov 09, 2018 at 02:39:17PM +0100, Ard Biesheuvel wrote:
>> >> >>> > + for (site = start; site < stop; site++) {
>> >> >>> > + struct static_call_key *key = static_call_key(site);
>> >> >>> > + unsigned long addr = static_call_addr(site);
>> >> >>> > +
>> >> >>> > + if (list_empty(&key->site_mods)) {
>> >> >>> > + struct static_call_mod *mod;
>> >> >>> > +
>> >> >>> > + mod = kzalloc(sizeof(*mod), GFP_KERNEL);
>> >> >>> > + if (!mod) {
>> >> >>> > + WARN(1, "Failed to allocate memory for static calls");
>> >> >>> > + return;
>> >> >>> > + }
>> >> >>> > +
>> >> >>> > + mod->sites = site;
>> >> >>> > + list_add_tail(&mod->list, &key->site_mods);
>> >> >>> > +
>> >> >>> > + /*
>> >> >>> > + * The trampoline should no longer be used. Poison it
>> >> >>> > + * it with a BUG() to catch any stray callers.
>> >> >>> > + */
>> >> >>> > + arch_static_call_poison_tramp(addr);
>> >> >>>
>> >> >>> This patches the wrong thing: the trampoline is at key->func not addr.
>> >> >>
>> >> >> If you look at the x86 implementation, it actually does poison the
>> >> >> trampoline.
>> >> >>
>> >> >> The address of the trampoline isn't actually known here. key->func
>> >> >> isn't the trampoline address; it's the destination func address.
>> >> >>
>> >> >> So instead I passed the address of the call instruction. The arch code
>> >> >> then reads the instruction to find the callee (the trampoline).
>> >> >>
>> >> >> The code is a bit confusing. To make it more obvious, maybe we should
>> >> >> add another arch function to read the call destination. Then this code
>> >> >> can pass that into arch_static_call_poison_tramp().
>> >> >>
>> >> >
>> >> > Ah right, so I am basically missing a dereference in my
>> >> > arch_static_call_poison_tramp() code if this breaks.
>> >> >
>> >>
>> >> Could we call it 'defuse' rather than 'poision'? On arm64, we will
>> >> need to keep it around to bounce function calls that are out of range,
>> >> and replace it with a PLT sequence.
>> >
>> > Ok, but doesn't that defeat the purpose of the inline approach?
>> >
>>
>> It does. But this only occurs when a module is loaded far away, and
>> this will only happen if you have 2 GB range KASLR enabled, or your
>> 128 MB module region gets exhausted for some reason, so the majority
>> of calls should use a single relative branch.
>
> Makes sense. Do you also account for the possibility that the original
> call emitted by GCC was far away and thus used the PLT?
>
That doesn't really happen. vmlinux is never over 128 MB, and the
modules are only partially linked, so the linker never gets to the
stage where it needs to emit veneers.
It's a bit fiddly since inline and out-of-line both use
arch_static_call_transform(), but what I need to do is basically:
- for out-of-line, the trampoline needs to be patched into a
movn/movk/movk/br sequence if the target is too far
- for inline, the trampoline itself needs to be patched from
adrp/ldr/br (which does a load and a indirect branch) to
movn/movk/movk/br (which uses immediates), and the call sites need to
be patched into calls to the veneer if the target is out of range.
So arch_static_call_transform() needs to know where the trampoline is
for this use case, so could we perhaps add a 'void *orig' in the key
struct that keeps track of the original value of 'addr' ?
On 9 November 2018 at 18:52, Ard Biesheuvel <[email protected]> wrote:
> On 9 November 2018 at 18:46, Josh Poimboeuf <[email protected]> wrote:
>> On Fri, Nov 09, 2018 at 06:33:03PM +0100, Ard Biesheuvel wrote:
>>> On 9 November 2018 at 18:31, Josh Poimboeuf <[email protected]> wrote:
>>> > On Fri, Nov 09, 2018 at 06:25:24PM +0100, Ard Biesheuvel wrote:
>>> >> On 9 November 2018 at 16:14, Ard Biesheuvel <[email protected]> wrote:
>>> >> > On 9 November 2018 at 16:10, Josh Poimboeuf <[email protected]> wrote:
>>> >> >> On Fri, Nov 09, 2018 at 02:39:17PM +0100, Ard Biesheuvel wrote:
>>> >> >>> > + for (site = start; site < stop; site++) {
>>> >> >>> > + struct static_call_key *key = static_call_key(site);
>>> >> >>> > + unsigned long addr = static_call_addr(site);
>>> >> >>> > +
>>> >> >>> > + if (list_empty(&key->site_mods)) {
>>> >> >>> > + struct static_call_mod *mod;
>>> >> >>> > +
>>> >> >>> > + mod = kzalloc(sizeof(*mod), GFP_KERNEL);
>>> >> >>> > + if (!mod) {
>>> >> >>> > + WARN(1, "Failed to allocate memory for static calls");
>>> >> >>> > + return;
>>> >> >>> > + }
>>> >> >>> > +
>>> >> >>> > + mod->sites = site;
>>> >> >>> > + list_add_tail(&mod->list, &key->site_mods);
>>> >> >>> > +
>>> >> >>> > + /*
>>> >> >>> > + * The trampoline should no longer be used. Poison it
>>> >> >>> > + * it with a BUG() to catch any stray callers.
>>> >> >>> > + */
>>> >> >>> > + arch_static_call_poison_tramp(addr);
>>> >> >>>
>>> >> >>> This patches the wrong thing: the trampoline is at key->func not addr.
>>> >> >>
>>> >> >> If you look at the x86 implementation, it actually does poison the
>>> >> >> trampoline.
>>> >> >>
>>> >> >> The address of the trampoline isn't actually known here. key->func
>>> >> >> isn't the trampoline address; it's the destination func address.
>>> >> >>
>>> >> >> So instead I passed the address of the call instruction. The arch code
>>> >> >> then reads the instruction to find the callee (the trampoline).
>>> >> >>
>>> >> >> The code is a bit confusing. To make it more obvious, maybe we should
>>> >> >> add another arch function to read the call destination. Then this code
>>> >> >> can pass that into arch_static_call_poison_tramp().
>>> >> >>
>>> >> >
>>> >> > Ah right, so I am basically missing a dereference in my
>>> >> > arch_static_call_poison_tramp() code if this breaks.
>>> >> >
>>> >>
>>> >> Could we call it 'defuse' rather than 'poision'? On arm64, we will
>>> >> need to keep it around to bounce function calls that are out of range,
>>> >> and replace it with a PLT sequence.
>>> >
>>> > Ok, but doesn't that defeat the purpose of the inline approach?
>>> >
>>>
>>> It does. But this only occurs when a module is loaded far away, and
>>> this will only happen if you have 2 GB range KASLR enabled, or your
>>> 128 MB module region gets exhausted for some reason, so the majority
>>> of calls should use a single relative branch.
>>
>> Makes sense. Do you also account for the possibility that the original
>> call emitted by GCC was far away and thus used the PLT?
>>
>
> That doesn't really happen. vmlinux is never over 128 MB, and the
> modules are only partially linked, so the linker never gets to the
> stage where it needs to emit veneers.
>
> It's a bit fiddly since inline and out-of-line both use
> arch_static_call_transform(), but what I need to do is basically:
>
> - for out-of-line, the trampoline needs to be patched into a
> movn/movk/movk/br sequence if the target is too far
> - for inline, the trampoline itself needs to be patched from
> adrp/ldr/br (which does a load and a indirect branch) to
> movn/movk/movk/br (which uses immediates), and the call sites need to
> be patched into calls to the veneer if the target is out of range.
>
> So arch_static_call_transform() needs to know where the trampoline is
> for this use case, so could we perhaps add a 'void *orig' in the key
> struct that keeps track of the original value of 'addr' ?
... and pass it to arch_static_call_transform
On Thu, 8 Nov 2018 15:15:51 -0600
Josh Poimboeuf <[email protected]> wrote:
> Add a static call infrastructure. Static calls use code patching to
> hard-code function pointers into direct branch instructions. They give
> the flexibility of function pointers, but with improved performance.
> This is especially important for cases where retpolines would otherwise
> be used, as retpolines can significantly impact performance.
>
> This code is heavily inspired by the jump label code (aka "static
> jumps"), as some of the concepts are very similar.
>
> There are three implementations, depending on arch support:
>
> 1) optimized: patched call sites (CONFIG_HAVE_STATIC_CALL_OPTIMIZED)
Thanks for doing this.
> 2) unoptimized: patched trampolines (CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED)
#2 looks like my approach.
> 3) basic function pointers
>
> For more details, see the comments in include/linux/static_call.h.
>
> Signed-off-by: Josh Poimboeuf <[email protected]>
> ---
> arch/Kconfig | 6 +
> include/asm-generic/vmlinux.lds.h | 11 ++
> include/linux/module.h | 10 +
> include/linux/static_call.h | 186 +++++++++++++++++++
> include/linux/static_call_types.h | 19 ++
> kernel/Makefile | 1 +
> kernel/module.c | 5 +
> kernel/static_call.c | 297 ++++++++++++++++++++++++++++++
> 8 files changed, 535 insertions(+)
> create mode 100644 include/linux/static_call.h
> create mode 100644 include/linux/static_call_types.h
> create mode 100644 kernel/static_call.c
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index e1e540ffa979..9657ad5a80d1 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -879,6 +879,12 @@ config HAVE_ARCH_PREL32_RELOCATIONS
> architectures, and don't require runtime relocation on relocatable
> kernels.
>
> +config HAVE_STATIC_CALL_OPTIMIZED
> + bool
> +
> +config HAVE_STATIC_CALL_UNOPTIMIZED
> + bool
> +
Let's also have
config HAVE_STATIC_CALL
def_bool Y
depends on HAVE_STATIC_CALL_OPTIMIZED || HAVE_STATIC_CALL_UNOPTIMIZED
> source "kernel/gcov/Kconfig"
>
> source "scripts/gcc-plugins/Kconfig"
> diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
> index 3d7a6a9c2370..45d8ced28bd0 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -320,6 +320,7 @@
> __start_ro_after_init = .; \
> *(.data..ro_after_init) \
> JUMP_TABLE_DATA \
> + STATIC_CALL_SITES \
> __end_ro_after_init = .;
> #endif
>
> @@ -725,6 +726,16 @@
> #define BUG_TABLE
> #endif
>
> +#ifdef CONFIG_HAVE_STATIC_CALL_OPTIMIZED
> +#define STATIC_CALL_SITES \
> + . = ALIGN(8); \
> + __start_static_call_sites = .; \
> + KEEP(*(.static_call_sites)) \
> + __stop_static_call_sites = .;
> +#else
> +#define STATIC_CALL_SITES
> +#endif
> +
> #ifdef CONFIG_UNWINDER_ORC
> #define ORC_UNWIND_TABLE \
> . = ALIGN(4); \
> diff --git a/include/linux/module.h b/include/linux/module.h
> index fce6b4335e36..91baf181247a 100644
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -21,6 +21,7 @@
> #include <linux/rbtree_latch.h>
> #include <linux/error-injection.h>
> #include <linux/tracepoint-defs.h>
> +#include <linux/static_call_types.h>
>
> #include <linux/percpu.h>
> #include <asm/module.h>
> @@ -450,6 +451,10 @@ struct module {
> unsigned int num_ftrace_callsites;
> unsigned long *ftrace_callsites;
> #endif
> +#ifdef CONFIG_HAVE_STATIC_CALL_OPTIMIZED
> + int num_static_call_sites;
> + struct static_call_site *static_call_sites;
> +#endif
>
> #ifdef CONFIG_LIVEPATCH
> bool klp; /* Is this a livepatch module? */
> @@ -682,6 +687,11 @@ static inline bool is_module_text_address(unsigned long addr)
> return false;
> }
>
> +static inline bool within_module_init(unsigned long addr, const struct module *mod)
> +{
> + return false;
> +}
> +
> /* Get/put a kernel symbol (calls should be symmetric) */
> #define symbol_get(x) ({ extern typeof(x) x __attribute__((weak)); &(x); })
> #define symbol_put(x) do { } while (0)
> diff --git a/include/linux/static_call.h b/include/linux/static_call.h
> new file mode 100644
> index 000000000000..b22987057cf1
> --- /dev/null
> +++ b/include/linux/static_call.h
> @@ -0,0 +1,186 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_STATIC_CALL_H
> +#define _LINUX_STATIC_CALL_H
> +
> +/*
> + * Static call support
> + *
> + * Static calls use code patching to hard-code function pointers into direct
> + * branch instructions. They give the flexibility of function pointers, but
> + * with improved performance. This is especially important for cases where
> + * retpolines would otherwise be used, as retpolines can significantly impact
> + * performance.
> + *
> + *
> + * API overview:
> + *
> + * DECLARE_STATIC_CALL(key, func);
> + * DEFINE_STATIC_CALL(key, func);
> + * static_call(key, args...);
> + * static_call_update(key, func);
> + *
> + *
> + * Usage example:
> + *
> + * # Start with the following functions (with identical prototypes):
> + * int func_a(int arg1, int arg2);
> + * int func_b(int arg1, int arg2);
> + *
> + * # Define a 'my_key' reference, associated with func_a() by default
> + * DEFINE_STATIC_CALL(my_key, func_a);
> + *
> + * # Call func_a()
> + * static_call(my_key, arg1, arg2);
> + *
> + * # Update 'my_key' to point to func_b()
> + * static_call_update(my_key, func_b);
> + *
> + * # Call func_b()
> + * static_call(my_key, arg1, arg2);
> + *
> + *
> + * Implementation details:
> + *
> + * There are three different implementations:
> + *
> + * 1) Optimized static calls (patched call sites)
> + *
> + * This requires objtool, which detects all the static_call() sites and
> + * annotates them in the '.static_call_sites' section. By default, the call
> + * sites will call into a temporary per-key trampoline which has an indirect
> + * branch to the current destination function associated with the key.
> + * During system boot (or module init), all call sites are patched to call
> + * their destination functions directly. Updates to a key will patch all
> + * call sites associated with that key.
> + *
> + * 2) Unoptimized static calls (patched trampolines)
> + *
> + * Each static_call() site calls into a permanent trampoline associated with
> + * the key. The trampoline has a direct branch to the default function.
> + * Updates to a key will modify the direct branch in the key's trampoline.
> + *
> + * 3) Generic implementation
> + *
> + * This is the default implementation if the architecture hasn't implemented
> + * CONFIG_HAVE_STATIC_CALL_[UN]OPTIMIZED. In this case, a basic
> + * function pointer is used.
> + */
> +
> +#include <linux/types.h>
> +#include <linux/cpu.h>
> +#include <linux/static_call_types.h>
> +
> +#if defined(CONFIG_HAVE_STATIC_CALL_OPTIMIZED) || \
> + defined(CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED)
Now we can have:
#ifdef CONFIG_HAVE_STATIC_CALL
> +
> +#include <asm/static_call.h>
> +
> +extern void arch_static_call_transform(unsigned long insn, void *dest);
> +
> +#endif /* CONFIG_HAVE_STATIC_CALL_[UN]OPTIMIZED */
> +
> +#ifdef CONFIG_HAVE_STATIC_CALL_OPTIMIZED
> +/* Optimized implementation */
> +
> +struct static_call_mod {
> + struct list_head list;
> + struct module *mod; /* NULL means vmlinux */
> + struct static_call_site *sites;
> +};
> +
> +struct static_call_key {
> + /*
> + * This field should always be first because the trampolines expect it.
> + * This points to the current call destination.
> + */
> + void *func;
> +
> + /*
> + * List of modules (including vmlinux) and the call sites which need to
> + * be patched for this key.
> + */
> + struct list_head site_mods;
> +};
> +
> +extern void __static_call_update(struct static_call_key *key, void *func);
> +extern void arch_static_call_poison_tramp(unsigned long insn);
> +
> +#define DECLARE_STATIC_CALL(key, func) \
> + extern struct static_call_key key; \
> + extern typeof(func) STATIC_CALL_TRAMP(key); \
> + /* Preserve the ELF symbol so objtool can access it: */ \
> + __ADDRESSABLE(key)
Does the __ADDRESSABLE(key) need to be in the DECLARE part?
If so, there needs to be more explanation than just the comment above
it.
> +
> +#define DEFINE_STATIC_CALL(key, _func) \
> + DECLARE_STATIC_CALL(key, _func); \
> + struct static_call_key key = { \
> + .func = _func, \
> + .site_mods = LIST_HEAD_INIT(key.site_mods), \
> + }; \
> + ARCH_STATIC_CALL_TEMPORARY_TRAMP(key)
> +
> +#define static_call(key, args...) STATIC_CALL_TRAMP(key)(args)
> +
> +#define static_call_update(key, func) \
> +({ \
> + BUILD_BUG_ON(!__same_type(typeof(func), typeof(STATIC_CALL_TRAMP(key)))); \
> + __static_call_update(&key, func); \
> +})
> +
> +#define EXPORT_STATIC_CALL(key) \
> + EXPORT_SYMBOL(key); \
> + EXPORT_SYMBOL(STATIC_CALL_TRAMP(key))
> +
> +#define EXPORT_STATIC_CALL_GPL(key) \
> + EXPORT_SYMBOL_GPL(key); \
> + EXPORT_SYMBOL_GPL(STATIC_CALL_TRAMP(key))
> +
> +
> +#elif defined(CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED)
> +/* Unoptimized implementation */
> +
> +#define DECLARE_STATIC_CALL(key, func) \
> + extern typeof(func) STATIC_CALL_TRAMP(key)
> +
> +#define DEFINE_STATIC_CALL(key, func) \
> + DECLARE_STATIC_CALL(key, func); \
> + ARCH_STATIC_CALL_TRAMP(key, func)
> +
> +#define static_call(key, args...) STATIC_CALL_TRAMP(key)(args)
> +
> +#define static_call_update(key, func) \
> +({ \
> + BUILD_BUG_ON(!__same_type(func, STATIC_CALL_TRAMP(key))); \
> + cpus_read_lock(); \
> + arch_static_call_transform((unsigned long)STATIC_CALL_TRAMP(key),\
> + func); \
> + cpus_read_unlock(); \
> +})
> +
> +#define EXPORT_STATIC_CALL(key) \
> + EXPORT_SYMBOL(STATIC_CALL_TRAMP(key))
> +
> +#define EXPORT_STATIC_CALL_GPL(key) \
> + EXPORT_SYMBOL_GPL(STATIC_CALL_TRAMP(key))
> +
> +
> +#else /* Generic implementation */
> +
> +#define DECLARE_STATIC_CALL(key, func) \
> + extern typeof(func) *key
> +
> +#define DEFINE_STATIC_CALL(key, func) \
> + typeof(func) *key = func
> +
> +#define static_call(key, args...) \
> + key(args)
> +
> +#define static_call_update(key, func) \
> + WRITE_ONCE(key, func)
> +
> +#define EXPORT_STATIC_CALL(key) EXPORT_SYMBOL(key)
> +#define EXPORT_STATIC_CALL_GPL(key) EXPORT_SYMBOL_GPL(key)
> +
> +#endif /* CONFIG_HAVE_STATIC_CALL_OPTIMIZED */
> +
> +#endif /* _LINUX_STATIC_CALL_H */
> diff --git a/include/linux/static_call_types.h b/include/linux/static_call_types.h
> new file mode 100644
> index 000000000000..7dd4b3d7ec2b
> --- /dev/null
> +++ b/include/linux/static_call_types.h
> @@ -0,0 +1,19 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _STATIC_CALL_TYPES_H
> +#define _STATIC_CALL_TYPES_H
> +
> +#include <linux/stringify.h>
> +
> +#define STATIC_CALL_TRAMP_PREFIX ____static_call_tramp_
> +#define STATIC_CALL_TRAMP_PREFIX_STR __stringify(STATIC_CALL_TRAMP_PREFIX)
> +
> +#define STATIC_CALL_TRAMP(key) STATIC_CALL_TRAMP_PREFIX##key
> +#define STATIC_CALL_TRAMP_STR(key) __stringify(STATIC_CALL_TRAMP(key))
> +
> +/* The static call site table is created by objtool. */
> +struct static_call_site {
> + s32 addr;
> + s32 key;
> +};
> +
> +#endif /* _STATIC_CALL_TYPES_H */
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 7343b3a9bff0..888bb476ddaf 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -103,6 +103,7 @@ obj-$(CONFIG_TRACEPOINTS) += trace/
> obj-$(CONFIG_IRQ_WORK) += irq_work.o
> obj-$(CONFIG_CPU_PM) += cpu_pm.o
> obj-$(CONFIG_BPF) += bpf/
> +obj-$(CONFIG_HAVE_STATIC_CALL_OPTIMIZED) += static_call.o
>
> obj-$(CONFIG_PERF_EVENTS) += events/
>
> diff --git a/kernel/module.c b/kernel/module.c
> index 49a405891587..11124d991fcd 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -3121,6 +3121,11 @@ static int find_module_sections(struct module *mod, struct load_info *info)
> mod->ei_funcs = section_objs(info, "_error_injection_whitelist",
> sizeof(*mod->ei_funcs),
> &mod->num_ei_funcs);
> +#endif
> +#ifdef CONFIG_HAVE_STATIC_CALL_OPTIMIZED
> + mod->static_call_sites = section_objs(info, ".static_call_sites",
> + sizeof(*mod->static_call_sites),
> + &mod->num_static_call_sites);
> #endif
> mod->extable = section_objs(info, "__ex_table",
> sizeof(*mod->extable), &mod->num_exentries);
> diff --git a/kernel/static_call.c b/kernel/static_call.c
> new file mode 100644
> index 000000000000..599ebc6fc4f1
> --- /dev/null
> +++ b/kernel/static_call.c
> @@ -0,0 +1,297 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#include <linux/init.h>
> +#include <linux/static_call.h>
> +#include <linux/bug.h>
> +#include <linux/smp.h>
> +#include <linux/sort.h>
> +#include <linux/slab.h>
> +#include <linux/module.h>
> +#include <linux/cpu.h>
> +#include <linux/processor.h>
> +#include <asm/sections.h>
> +
> +extern struct static_call_site __start_static_call_sites[],
> + __stop_static_call_sites[];
> +
> +static bool static_call_initialized;
> +
> +/* mutex to protect key modules/sites */
> +static DEFINE_MUTEX(static_call_mutex);
> +
> +static void static_call_lock(void)
> +{
> + mutex_lock(&static_call_mutex);
> +}
> +
> +static void static_call_unlock(void)
> +{
> + mutex_unlock(&static_call_mutex);
> +}
> +
> +static inline unsigned long static_call_addr(struct static_call_site *site)
> +{
> + return (long)site->addr + (long)&site->addr;
> +}
> +
> +static inline struct static_call_key *static_call_key(const struct static_call_site *site)
> +{
> + return (struct static_call_key *)((long)site->key + (long)&site->key);
> +}
> +
> +static int static_call_site_cmp(const void *_a, const void *_b)
> +{
> + const struct static_call_site *a = _a;
> + const struct static_call_site *b = _b;
> + const struct static_call_key *key_a = static_call_key(a);
> + const struct static_call_key *key_b = static_call_key(b);
> +
> + if (key_a < key_b)
> + return -1;
> +
> + if (key_a > key_b)
> + return 1;
> +
> + return 0;
> +}
> +
> +static void static_call_site_swap(void *_a, void *_b, int size)
> +{
> + long delta = (unsigned long)_a - (unsigned long)_b;
> + struct static_call_site *a = _a;
> + struct static_call_site *b = _b;
> + struct static_call_site tmp = *a;
> +
> + a->addr = b->addr - delta;
> + a->key = b->key - delta;
> +
> + b->addr = tmp.addr + delta;
> + b->key = tmp.key + delta;
> +}
> +
> +static inline void static_call_sort_entries(struct static_call_site *start,
> + struct static_call_site *stop)
> +{
> + sort(start, stop - start, sizeof(struct static_call_site),
> + static_call_site_cmp, static_call_site_swap);
> +}
> +
> +void __static_call_update(struct static_call_key *key, void *func)
> +{
> + struct static_call_mod *mod;
> + struct static_call_site *site, *stop;
> +
> + cpus_read_lock();
> + static_call_lock();
> +
> + if (key->func == func)
> + goto done;
> +
> + key->func = func;
> +
> + /*
> + * If called before init, leave the call sites unpatched for now.
> + * In the meantime they'll continue to call the temporary trampoline.
> + */
> + if (!static_call_initialized)
> + goto done;
> +
> + list_for_each_entry(mod, &key->site_mods, list) {
Since I'm expecting a lot of sites, I'm wondering if we should just do
this as an array, like I do with the ftrace call sites.
But this can be an enhancement for later. Let's focus on getting this
working first.
> + if (!mod->sites) {
> + /*
> + * This can happen if the static call key is defined in
> + * a module which doesn't use it.
> + */
> + continue;
> + }
> +
> + stop = __stop_static_call_sites;
> +
> +#ifdef CONFIG_MODULES
> + if (mod->mod) {
BTW, "mod" is an unfortunate name.
> + stop = mod->mod->static_call_sites +
> + mod->mod->num_static_call_sites;
> + }
> +#endif
> +
> + for (site = mod->sites;
> + site < stop && static_call_key(site) == key; site++) {
> + unsigned long addr = static_call_addr(site);
> +
> + if (!mod->mod && init_section_contains((void *)addr, 1))
> + continue;
> + if (mod->mod && within_module_init(addr, mod->mod))
> + continue;
> +
So what's the reason for skipping init calls?
-- Steve
> + arch_static_call_transform(addr, func);
> + }
> + }
> +
> +done:
> + static_call_unlock();
> + cpus_read_unlock();
> +}
> +EXPORT_SYMBOL_GPL(__static_call_update);
> +
> +#ifdef CONFIG_MODULES
> +
> +static int static_call_add_module(struct module *mod)
> +{
> + struct static_call_site *start = mod->static_call_sites;
> + struct static_call_site *stop = mod->static_call_sites +
> + mod->num_static_call_sites;
> + struct static_call_site *site;
> + struct static_call_key *key, *prev_key = NULL;
> + struct static_call_mod *static_call_mod;
> +
> + if (start == stop)
> + return 0;
> +
> + module_disable_ro(mod);
> + static_call_sort_entries(start, stop);
> + module_enable_ro(mod, false);
> +
> + for (site = start; site < stop; site++) {
> + unsigned long addr = static_call_addr(site);
> +
> + if (within_module_init(addr, mod))
> + continue;
> +
> + key = static_call_key(site);
> + if (key != prev_key) {
> + prev_key = key;
> +
> + static_call_mod = kzalloc(sizeof(*static_call_mod), GFP_KERNEL);
> + if (!static_call_mod)
> + return -ENOMEM;
> +
> + static_call_mod->mod = mod;
> + static_call_mod->sites = site;
> + list_add_tail(&static_call_mod->list, &key->site_mods);
> +
> + if (is_module_address((unsigned long)key)) {
> + /*
> + * The trampoline should no longer be used.
> + * Poison it it with a BUG() to catch any stray
> + * callers.
> + */
> + arch_static_call_poison_tramp(addr);
> + }
> + }
> +
> + arch_static_call_transform(addr, key->func);
> + }
> +
> + return 0;
> +}
> +
> +static void static_call_del_module(struct module *mod)
> +{
> + struct static_call_site *start = mod->static_call_sites;
> + struct static_call_site *stop = mod->static_call_sites +
> + mod->num_static_call_sites;
> + struct static_call_site *site;
> + struct static_call_key *key, *prev_key = NULL;
> + struct static_call_mod *static_call_mod;
> +
> + for (site = start; site < stop; site++) {
> + key = static_call_key(site);
> + if (key == prev_key)
> + continue;
> + prev_key = key;
> +
> + list_for_each_entry(static_call_mod, &key->site_mods, list) {
> + if (static_call_mod->mod == mod) {
> + list_del(&static_call_mod->list);
> + kfree(static_call_mod);
> + break;
> + }
> + }
> + }
> +}
> +
> +static int static_call_module_notify(struct notifier_block *nb,
> + unsigned long val, void *data)
> +{
> + struct module *mod = data;
> + int ret = 0;
> +
> + cpus_read_lock();
> + static_call_lock();
> +
> + switch (val) {
> + case MODULE_STATE_COMING:
> + ret = static_call_add_module(mod);
> + if (ret) {
> + WARN(1, "Failed to allocate memory for static calls");
> + static_call_del_module(mod);
> + }
> + break;
> + case MODULE_STATE_GOING:
> + static_call_del_module(mod);
> + break;
> + }
> +
> + static_call_unlock();
> + cpus_read_unlock();
> +
> + return notifier_from_errno(ret);
> +}
> +
> +static struct notifier_block static_call_module_nb = {
> + .notifier_call = static_call_module_notify,
> +};
> +
> +#endif /* CONFIG_MODULES */
> +
> +static void __init static_call_init(void)
> +{
> + struct static_call_site *start = __start_static_call_sites;
> + struct static_call_site *stop = __stop_static_call_sites;
> + struct static_call_site *site;
> +
> + if (start == stop) {
> + pr_warn("WARNING: empty static call table\n");
> + return;
> + }
> +
> + cpus_read_lock();
> + static_call_lock();
> +
> + static_call_sort_entries(start, stop);
> +
> + for (site = start; site < stop; site++) {
> + struct static_call_key *key = static_call_key(site);
> + unsigned long addr = static_call_addr(site);
> +
> + if (list_empty(&key->site_mods)) {
> + struct static_call_mod *mod;
> +
> + mod = kzalloc(sizeof(*mod), GFP_KERNEL);
> + if (!mod) {
> + WARN(1, "Failed to allocate memory for static calls");
> + return;
> + }
> +
> + mod->sites = site;
> + list_add_tail(&mod->list, &key->site_mods);
> +
> + /*
> + * The trampoline should no longer be used. Poison it
> + * it with a BUG() to catch any stray callers.
> + */
> + arch_static_call_poison_tramp(addr);
> + }
> +
> + arch_static_call_transform(addr, key->func);
> + }
> +
> + static_call_initialized = true;
> +
> + static_call_unlock();
> + cpus_read_unlock();
> +
> +#ifdef CONFIG_MODULES
> + register_module_notifier(&static_call_module_nb);
> +#endif
> +}
> +early_initcall(static_call_init);
On Fri, 9 Nov 2018 10:41:37 -0600
Josh Poimboeuf <[email protected]> wrote:
> On Fri, Nov 09, 2018 at 09:21:39AM -0600, Josh Poimboeuf wrote:
> > On Fri, Nov 09, 2018 at 07:16:17AM -0800, Andy Lutomirski wrote:
> > > On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar <[email protected]> wrote:
> > > >
> > > >
> > > > All other usecases are bonus, but it would certainly be interesting to
> > > > investigate the impact of using these APIs for tracing: that too is a
> > > > feature enabled everywhere but utilized only by a small fraction of Linux
> > > > users - so literally every single cycle or instruction saved or hot-path
> > > > shortened is a major win.
> > >
> > > For tracing, we'd want static_call_set_to_nop() or something like that, right?
> >
> > Are we talking about tracepoints? Or ftrace?
>
> Since ftrace changes calls to nops, and vice versa, I assume you meant
> ftrace. I don't think ftrace is a good candidate for this, as it's
> inherently more flexible than this API would reasonably allow.
>
Not sure what Andy was talking about, but I'm currently implementing
tracepoints to use this, as tracepoints use indirect calls, and are a
prime candidate for static calls, as I showed in my original RFC of
this feature.
-- Steve
On Fri, Nov 09, 2018 at 06:53:21PM +0100, Ard Biesheuvel wrote:
> > It's a bit fiddly since inline and out-of-line both use
> > arch_static_call_transform(), but what I need to do is basically:
> >
> > - for out-of-line, the trampoline needs to be patched into a
> > movn/movk/movk/br sequence if the target is too far
> > - for inline, the trampoline itself needs to be patched from
> > adrp/ldr/br (which does a load and a indirect branch) to
> > movn/movk/movk/br (which uses immediates), and the call sites need to
> > be patched into calls to the veneer if the target is out of range.
> >
> > So arch_static_call_transform() needs to know where the trampoline is
> > for this use case, so could we perhaps add a 'void *orig' in the key
> > struct that keeps track of the original value of 'addr' ?
>
> ... and pass it to arch_static_call_transform
Ok. So in both cases, you need both the call site address and the tramp
address, right? We could add a 'tramp' pointer to the key struct -- I
assume that's what you mean?
And also change
void arch_static_call_transform(unsigned long insn, void *dest)
to
void arch_static_call_transform(void *call_site, void *tramp, void *dest)
--
Josh
> On Nov 9, 2018, at 10:42 AM, Steven Rostedt <[email protected]> wrote:
>
> On Fri, 9 Nov 2018 10:41:37 -0600
> Josh Poimboeuf <[email protected]> wrote:
>
>>> On Fri, Nov 09, 2018 at 09:21:39AM -0600, Josh Poimboeuf wrote:
>>>> On Fri, Nov 09, 2018 at 07:16:17AM -0800, Andy Lutomirski wrote:
>>>>> On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar <[email protected]> wrote:
>>>>>
>>>>>
>>>>> All other usecases are bonus, but it would certainly be interesting to
>>>>> investigate the impact of using these APIs for tracing: that too is a
>>>>> feature enabled everywhere but utilized only by a small fraction of Linux
>>>>> users - so literally every single cycle or instruction saved or hot-path
>>>>> shortened is a major win.
>>>>
>>>> For tracing, we'd want static_call_set_to_nop() or something like that, right?
>>>
>>> Are we talking about tracepoints? Or ftrace?
>>
>> Since ftrace changes calls to nops, and vice versa, I assume you meant
>> ftrace. I don't think ftrace is a good candidate for this, as it's
>> inherently more flexible than this API would reasonably allow.
>>
>
> Not sure what Andy was talking about, but I'm currently implementing
> tracepoints to use this, as tracepoints use indirect calls, and are a
> prime candidate for static calls, as I showed in my original RFC of
> this feature.
>
>
Indeed.
Although I had assumed that tracepoints already had appropriate jump label magic.
On 9 November 2018 at 20:03, Josh Poimboeuf <[email protected]> wrote:
> On Fri, Nov 09, 2018 at 06:53:21PM +0100, Ard Biesheuvel wrote:
>> > It's a bit fiddly since inline and out-of-line both use
>> > arch_static_call_transform(), but what I need to do is basically:
>> >
>> > - for out-of-line, the trampoline needs to be patched into a
>> > movn/movk/movk/br sequence if the target is too far
>> > - for inline, the trampoline itself needs to be patched from
>> > adrp/ldr/br (which does a load and a indirect branch) to
>> > movn/movk/movk/br (which uses immediates), and the call sites need to
>> > be patched into calls to the veneer if the target is out of range.
>> >
>> > So arch_static_call_transform() needs to know where the trampoline is
>> > for this use case, so could we perhaps add a 'void *orig' in the key
>> > struct that keeps track of the original value of 'addr' ?
>>
>> ... and pass it to arch_static_call_transform
>
> Ok. So in both cases, you need both the call site address and the tramp
> address, right? We could add a 'tramp' pointer to the key struct -- I
> assume that's what you mean?
>
Yes.
> And also change
>
> void arch_static_call_transform(unsigned long insn, void *dest)
>
> to
>
> void arch_static_call_transform(void *call_site, void *tramp, void *dest)
>
Indeed.
On Fri, Nov 09, 2018 at 01:33:37PM -0500, Steven Rostedt wrote:
> > +config HAVE_STATIC_CALL_OPTIMIZED
> > + bool
> > +
> > +config HAVE_STATIC_CALL_UNOPTIMIZED
> > + bool
> > +
>
> Let's also have
>
> config HAVE_STATIC_CALL
> def_bool Y
> depends on HAVE_STATIC_CALL_OPTIMIZED || HAVE_STATIC_CALL_UNOPTIMIZED
>
> > +#if defined(CONFIG_HAVE_STATIC_CALL_OPTIMIZED) || \
> > + defined(CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED)
>
> Now we can have:
>
> #ifdef CONFIG_HAVE_STATIC_CALL
Yeah, that's better.
> > +#define DECLARE_STATIC_CALL(key, func) \
> > + extern struct static_call_key key; \
> > + extern typeof(func) STATIC_CALL_TRAMP(key); \
> > + /* Preserve the ELF symbol so objtool can access it: */ \
> > + __ADDRESSABLE(key)
>
> Does the __ADDRESSABLE(key) need to be in the DECLARE part?
> If so, there needs to be more explanation than just the comment above
> it.
For each call site, objtool creates a struct in .static_call_sites:
struct static_call_site {
s32 addr;
s32 key;
};
In order to do that, it needs to create a relocation which references
the key symbol. If the key is defined in another .o file, then the
current .o will not have an ELF symbol associated with the key. The
__ADDRESSABLE(key) thing tells GCC to leave the key symbol in the .o
file, even though it's not referenced anywhere. That makes objtool's
job easier, so it doesn't have to edit the symbol table.
I could add a comment saying as much, though it's hard to explain it in
fewer words than I just did :-)
> > + /*
> > + * If called before init, leave the call sites unpatched for now.
> > + * In the meantime they'll continue to call the temporary trampoline.
> > + */
> > + if (!static_call_initialized)
> > + goto done;
> > +
> > + list_for_each_entry(mod, &key->site_mods, list) {
>
> Since I'm expecting a lot of sites, I'm wondering if we should just do
> this as an array, like I do with the ftrace call sites.
>
> But this can be an enhancement for later. Let's focus on getting this
> working first.
But it's not a static list. It can grow/shrink as modules are
loaded/unload.
>
> > + if (!mod->sites) {
> > + /*
> > + * This can happen if the static call key is defined in
> > + * a module which doesn't use it.
> > + */
> > + continue;
> > + }
> > +
> > + stop = __stop_static_call_sites;
> > +
> > +#ifdef CONFIG_MODULES
> > + if (mod->mod) {
>
> BTW, "mod" is an unfortunate name.
True. I'll change it to 'site_mod'.
>
> > + stop = mod->mod->static_call_sites +
> > + mod->mod->num_static_call_sites;
> > + }
> > +#endif
> > +
> > + for (site = mod->sites;
> > + site < stop && static_call_key(site) == key; site++) {
> > + unsigned long addr = static_call_addr(site);
> > +
> > + if (!mod->mod && init_section_contains((void *)addr, 1))
> > + continue;
> > + if (mod->mod && within_module_init(addr, mod->mod))
> > + continue;
> > +
>
>
> So what's the reason for skipping init calls?
This is the runtime changing code (static_call_update). Presumably the
init sections no longer exist and we shouldn't write to any (former)
call sites there.
That's probably a dangerous assumption though... If
static_call_update() were called early, some init code might not get
patched and then call into the wrong function.
I'm thinking we should just disallow static call sites in init sections.
I can't think of a good reason why they would be needed in init code.
We can WARN when detecting them during boot / module init.
--
Josh
On Fri, 9 Nov 2018 11:05:51 -0800
Andy Lutomirski <[email protected]> wrote:
> > Not sure what Andy was talking about, but I'm currently implementing
> > tracepoints to use this, as tracepoints use indirect calls, and are a
> > prime candidate for static calls, as I showed in my original RFC of
> > this feature.
> >
> >
>
> Indeed.
>
> Although I had assumed that tracepoints already had appropriate jump label magic.
It does. But that's not the problem I was trying to solve. It's that
tracing took a 8% noise dive with retpolines when enabled (hackbench
slowed down by 8% with all the trace events enabled compared to all
trace events enabled without retpoline). That is, normal users (those
not tracinng) are not affected by trace events slowing down by
retpoline. Those that care about performance when they are tracing, are
affected by retpoline, quite drastically.
I'm doing another test run and measurements, to see how the unoptimized
trampolines help, followed by the trampoline case.
-- Steve
On Fri, Nov 09, 2018 at 02:37:03PM -0500, Steven Rostedt wrote:
> On Fri, 9 Nov 2018 11:05:51 -0800
> Andy Lutomirski <[email protected]> wrote:
>
> > > Not sure what Andy was talking about, but I'm currently implementing
> > > tracepoints to use this, as tracepoints use indirect calls, and are a
> > > prime candidate for static calls, as I showed in my original RFC of
> > > this feature.
> > >
> > >
> >
> > Indeed.
> >
> > Although I had assumed that tracepoints already had appropriate jump label magic.
>
> It does. But that's not the problem I was trying to solve. It's that
> tracing took a 8% noise dive with retpolines when enabled (hackbench
> slowed down by 8% with all the trace events enabled compared to all
> trace events enabled without retpoline). That is, normal users (those
> not tracinng) are not affected by trace events slowing down by
> retpoline. Those that care about performance when they are tracing, are
> affected by retpoline, quite drastically.
>
> I'm doing another test run and measurements, to see how the unoptimized
> trampolines help, followed by the trampoline case.
Are you sure you're using unoptimized? Optimized is the default on
x86-64 (with my third patch).
--
Josh
On Fri, 9 Nov 2018 13:35:05 -0600
Josh Poimboeuf <[email protected]> wrote:
> > > +#define DECLARE_STATIC_CALL(key, func) \
> > > + extern struct static_call_key key; \
> > > + extern typeof(func) STATIC_CALL_TRAMP(key); \
> > > + /* Preserve the ELF symbol so objtool can access it: */ \
> > > + __ADDRESSABLE(key)
> >
> > Does the __ADDRESSABLE(key) need to be in the DECLARE part?
> > If so, there needs to be more explanation than just the comment above
> > it.
>
> For each call site, objtool creates a struct in .static_call_sites:
>
> struct static_call_site {
> s32 addr;
> s32 key;
> };
>
> In order to do that, it needs to create a relocation which references
> the key symbol. If the key is defined in another .o file, then the
> current .o will not have an ELF symbol associated with the key. The
> __ADDRESSABLE(key) thing tells GCC to leave the key symbol in the .o
> file, even though it's not referenced anywhere. That makes objtool's
> job easier, so it doesn't have to edit the symbol table.
>
> I could add a comment saying as much, though it's hard to explain it in
> fewer words than I just did :-)
Does this have to do with adding the references by relative address?
In record_mcount, I just picked an existing symbol and referenced that..
But perhaps this is a cleaner way.
Adding a more in depth comment wont hurt.
>
> > > + /*
> > > + * If called before init, leave the call sites unpatched for now.
> > > + * In the meantime they'll continue to call the temporary trampoline.
> > > + */
> > > + if (!static_call_initialized)
> > > + goto done;
> > > +
> > > + list_for_each_entry(mod, &key->site_mods, list) {
> >
> > Since I'm expecting a lot of sites, I'm wondering if we should just do
> > this as an array, like I do with the ftrace call sites.
> >
> > But this can be an enhancement for later. Let's focus on getting this
> > working first.
>
> But it's not a static list. It can grow/shrink as modules are
> loaded/unload.
Neither is ftrace :-) What I did was make one array for the core kernel
code, and an array for each module, and link list those (single link,
although double link may not be hard either). That will save a lot of
memory than having each instance have a link pointer, as it only grows
or shrinks in chunks.
>
> >
> > > + if (!mod->sites) {
> > > + /*
> > > + * This can happen if the static call key is defined in
> > > + * a module which doesn't use it.
> > > + */
> > > + continue;
> > > + }
> > > +
> > > + stop = __stop_static_call_sites;
> > > +
> > > +#ifdef CONFIG_MODULES
> > > + if (mod->mod) {
> >
> > BTW, "mod" is an unfortunate name.
>
> True. I'll change it to 'site_mod'.
>
> >
> > > + stop = mod->mod->static_call_sites +
> > > + mod->mod->num_static_call_sites;
> > > + }
> > > +#endif
> > > +
> > > + for (site = mod->sites;
> > > + site < stop && static_call_key(site) == key; site++) {
> > > + unsigned long addr = static_call_addr(site);
> > > +
> > > + if (!mod->mod && init_section_contains((void *)addr, 1))
> > > + continue;
> > > + if (mod->mod && within_module_init(addr, mod->mod))
> > > + continue;
> > > +
> >
> >
> > So what's the reason for skipping init calls?
>
> This is the runtime changing code (static_call_update). Presumably the
> init sections no longer exist and we shouldn't write to any (former)
> call sites there.
>
> That's probably a dangerous assumption though... If
> static_call_update() were called early, some init code might not get
> patched and then call into the wrong function.
>
> I'm thinking we should just disallow static call sites in init sections.
> I can't think of a good reason why they would be needed in init code.
> We can WARN when detecting them during boot / module init.
>
What I would do is to allow init (like ftrace now does). I have
ftrace_free_init_mem() that removes all the mcount references for init
calls from its list. You could add a static_call_free_init() to
kernel_init() in init/main.c too.
-- Steve
On Fri, 9 Nov 2018 13:44:09 -0600
Josh Poimboeuf <[email protected]> wrote:
> On Fri, Nov 09, 2018 at 02:37:03PM -0500, Steven Rostedt wrote:
> > On Fri, 9 Nov 2018 11:05:51 -0800
> > Andy Lutomirski <[email protected]> wrote:
> >
> > > > Not sure what Andy was talking about, but I'm currently implementing
> > > > tracepoints to use this, as tracepoints use indirect calls, and are a
> > > > prime candidate for static calls, as I showed in my original RFC of
> > > > this feature.
> > > >
> > > >
> > >
> > > Indeed.
> > >
> > > Although I had assumed that tracepoints already had appropriate jump label magic.
> >
> > It does. But that's not the problem I was trying to solve. It's that
> > tracing took a 8% noise dive with retpolines when enabled (hackbench
> > slowed down by 8% with all the trace events enabled compared to all
> > trace events enabled without retpoline). That is, normal users (those
> > not tracinng) are not affected by trace events slowing down by
> > retpoline. Those that care about performance when they are tracing, are
> > affected by retpoline, quite drastically.
> >
> > I'm doing another test run and measurements, to see how the unoptimized
> > trampolines help, followed by the trampoline case.
>
> Are you sure you're using unoptimized? Optimized is the default on
> x86-64 (with my third patch).
>
Yes, because I haven't applied that third patch yet ;-)
Then I'll apply it and see how much that improves things.
-- Steve
On Fri, Nov 09, 2018 at 02:57:46PM -0500, Steven Rostedt wrote:
> On Fri, 9 Nov 2018 13:35:05 -0600
> Josh Poimboeuf <[email protected]> wrote:
>
>
> > > > +#define DECLARE_STATIC_CALL(key, func) \
> > > > + extern struct static_call_key key; \
> > > > + extern typeof(func) STATIC_CALL_TRAMP(key); \
> > > > + /* Preserve the ELF symbol so objtool can access it: */ \
> > > > + __ADDRESSABLE(key)
> > >
> > > Does the __ADDRESSABLE(key) need to be in the DECLARE part?
> > > If so, there needs to be more explanation than just the comment above
> > > it.
> >
> > For each call site, objtool creates a struct in .static_call_sites:
> >
> > struct static_call_site {
> > s32 addr;
> > s32 key;
> > };
> >
> > In order to do that, it needs to create a relocation which references
> > the key symbol. If the key is defined in another .o file, then the
> > current .o will not have an ELF symbol associated with the key. The
> > __ADDRESSABLE(key) thing tells GCC to leave the key symbol in the .o
> > file, even though it's not referenced anywhere. That makes objtool's
> > job easier, so it doesn't have to edit the symbol table.
> >
> > I could add a comment saying as much, though it's hard to explain it in
> > fewer words than I just did :-)
>
> Does this have to do with adding the references by relative address?
>
> In record_mcount, I just picked an existing symbol and referenced that..
> But perhaps this is a cleaner way.
I think recordmcount is different. It creates references (in
__mcount_loc) to functions which are already in the object file, so they
already have symbols associated with them.
But in this case, when objtool is creating references, the symbol it
needs to reference is outside the .o file, so there's no symbol to
associate it with.
> Adding a more in depth comment wont hurt.
>
> >
> > > > + /*
> > > > + * If called before init, leave the call sites unpatched for now.
> > > > + * In the meantime they'll continue to call the temporary trampoline.
> > > > + */
> > > > + if (!static_call_initialized)
> > > > + goto done;
> > > > +
> > > > + list_for_each_entry(mod, &key->site_mods, list) {
> > >
> > > Since I'm expecting a lot of sites, I'm wondering if we should just do
> > > this as an array, like I do with the ftrace call sites.
> > >
> > > But this can be an enhancement for later. Let's focus on getting this
> > > working first.
> >
> > But it's not a static list. It can grow/shrink as modules are
> > loaded/unload.
>
> Neither is ftrace :-) What I did was make one array for the core kernel
> code, and an array for each module, and link list those (single link,
> although double link may not be hard either). That will save a lot of
> memory than having each instance have a link pointer, as it only grows
> or shrinks in chunks.
That sounds exactly like what I did :-)
The site_mods list is a list of static_call_mods. Each static_call_mod
has a pointer to an array (which is that module/vmlinux's
.static_call_site section).
> > > So what's the reason for skipping init calls?
> >
> > This is the runtime changing code (static_call_update). Presumably the
> > init sections no longer exist and we shouldn't write to any (former)
> > call sites there.
> >
> > That's probably a dangerous assumption though... If
> > static_call_update() were called early, some init code might not get
> > patched and then call into the wrong function.
> >
> > I'm thinking we should just disallow static call sites in init sections.
> > I can't think of a good reason why they would be needed in init code.
> > We can WARN when detecting them during boot / module init.
> >
>
> What I would do is to allow init (like ftrace now does). I have
> ftrace_free_init_mem() that removes all the mcount references for init
> calls from its list. You could add a static_call_free_init() to
> kernel_init() in init/main.c too.
That makes sense for ftrace, but I don't see much point in allowing it
for static calls. Maybe we could just add support for it later if it
turns out to be useful.
--
Josh
On Fri, Nov 09, 2018 at 02:59:18PM -0500, Steven Rostedt wrote:
> On Fri, 9 Nov 2018 13:44:09 -0600
> Josh Poimboeuf <[email protected]> wrote:
>
> > On Fri, Nov 09, 2018 at 02:37:03PM -0500, Steven Rostedt wrote:
> > > On Fri, 9 Nov 2018 11:05:51 -0800
> > > Andy Lutomirski <[email protected]> wrote:
> > >
> > > > > Not sure what Andy was talking about, but I'm currently implementing
> > > > > tracepoints to use this, as tracepoints use indirect calls, and are a
> > > > > prime candidate for static calls, as I showed in my original RFC of
> > > > > this feature.
> > > > >
> > > > >
> > > >
> > > > Indeed.
> > > >
> > > > Although I had assumed that tracepoints already had appropriate jump label magic.
> > >
> > > It does. But that's not the problem I was trying to solve. It's that
> > > tracing took a 8% noise dive with retpolines when enabled (hackbench
> > > slowed down by 8% with all the trace events enabled compared to all
> > > trace events enabled without retpoline). That is, normal users (those
> > > not tracinng) are not affected by trace events slowing down by
> > > retpoline. Those that care about performance when they are tracing, are
> > > affected by retpoline, quite drastically.
> > >
> > > I'm doing another test run and measurements, to see how the unoptimized
> > > trampolines help, followed by the trampoline case.
> >
> > Are you sure you're using unoptimized? Optimized is the default on
> > x86-64 (with my third patch).
> >
>
> Yes, because I haven't applied that third patch yet ;-)
>
> Then I'll apply it and see how much that improves things.
Ah, good. That will be interesting to see the difference between
optimized/unoptimized.
--
Josh
On 09/11/2018 16.16, Andy Lutomirski wrote:
> On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar <[email protected]> wrote:
>>
>>
>> All other usecases are bonus, but it would certainly be interesting to
>> investigate the impact of using these APIs for tracing: that too is a
>> feature enabled everywhere but utilized only by a small fraction of Linux
>> users - so literally every single cycle or instruction saved or hot-path
>> shortened is a major win.
>
> For tracing, we'd want static_call_set_to_nop() or something like that, right?
>
Hm. IIUC, when gcc sees static_call(key)(...), it has to generate code
to put the right values in %rdi, %rsi etc.. Even if the function is void
(*)(void), gcc would still need to shuffle things around (either spill
and reload, or move %rdi to some callee saved register). So if the
static_call is noop'ed out most of the time, that seems like a net loss?
With an unlikely static_key, gcc can do all the parameter setup and
reloading in an out-of-line chunk of code.
static calls seems like a quite useful concept, but only/mostly if
_some_ function needs to be called at that spot.
Aside: there should be some compile-time check that
static_call_set_to_nop can only be used if the return type is void.
Rasmus
On Fri, 9 Nov 2018 14:34:59 -0600
Josh Poimboeuf <[email protected]> wrote:
I'm slowly massaging this to work with tracepoints.
But I hit a snag on this patch.
> On Fri, Nov 09, 2018 at 02:57:46PM -0500, Steven Rostedt wrote:
> > On Fri, 9 Nov 2018 13:35:05 -0600
> > Josh Poimboeuf <[email protected]> wrote:
> >
> >
> > > > > +#define DECLARE_STATIC_CALL(key, func) \
> > > > > + extern struct static_call_key key; \
> > > > > + extern typeof(func) STATIC_CALL_TRAMP(key); \
> > > > > + /* Preserve the ELF symbol so objtool can access it: */ \
> > > > > + __ADDRESSABLE(key)
> > > >
> > > > Does the __ADDRESSABLE(key) need to be in the DECLARE part?
> > > > If so, there needs to be more explanation than just the comment above
> > > > it.
> > >
> > > For each call site, objtool creates a struct in .static_call_sites:
> > >
> > > struct static_call_site {
> > > s32 addr;
> > > s32 key;
> > > };
> > >
> > > In order to do that, it needs to create a relocation which references
> > > the key symbol. If the key is defined in another .o file, then the
> > > current .o will not have an ELF symbol associated with the key. The
> > > __ADDRESSABLE(key) thing tells GCC to leave the key symbol in the .o
> > > file, even though it's not referenced anywhere. That makes objtool's
> > > job easier, so it doesn't have to edit the symbol table.
> > >
> > > I could add a comment saying as much, though it's hard to explain it in
> > > fewer words than I just did :-)
> >
> > Does this have to do with adding the references by relative address?
> >
> > In record_mcount, I just picked an existing symbol and referenced that..
> > But perhaps this is a cleaner way.
>
> I think recordmcount is different. It creates references (in
> __mcount_loc) to functions which are already in the object file, so they
> already have symbols associated with them.
>
> But in this case, when objtool is creating references, the symbol it
> needs to reference is outside the .o file, so there's no symbol to
> associate it with.
>
The __ADDRESSABLE() appears to fail if you have a header with a
DECLARE_STATIC_CALL() that is included where the DEFINE_STATIC_CALL()
is, because I'm getting this:
In file included from <command-line>:
/work/git/linux-trace.git/include/linux/compiler.h:285:11: error: redefinition of ‘__addressable___tp_func_sys_enter40’
__PASTE(__addressable_##sym, __LINE__) = (void *)&sym;
^~~~~~~~~~~~~~
/work/git/linux-trace.git/include/linux/compiler_types.h:53:23: note: in definition of macro ‘___PASTE’
#define ___PASTE(a,b) a##b
^
/work/git/linux-trace.git/include/linux/compiler.h:285:3: note: in expansion of macro ‘__PASTE’
__PASTE(__addressable_##sym, __LINE__) = (void *)&sym;
^~~~~~~
/work/git/linux-trace.git/include/linux/static_call.h:112:2: note: in expansion of macro ‘__ADDRESSABLE’
__ADDRESSABLE(key)
^~~~~~~~~~~~~
/work/git/linux-trace.git/include/linux/static_call.h:115:2: note: in expansion of macro ‘DECLARE_STATIC_CALL’
DECLARE_STATIC_CALL(key, _func); \
^~~~~~~~~~~~~~~~~~~
/work/git/linux-trace.git/include/linux/tracepoint.h:310:2: note: in expansion of macro ‘DEFINE_STATIC_CALL’
DEFINE_STATIC_CALL(__tp_func_##name, __tracepoint_iter_##name);
^~~~~~~~~~~~~~~~~~
/work/git/linux-trace.git/include/trace/define_trace.h:42:2: note: in expansion of macro ‘DEFINE_TRACE_FN’
DEFINE_TRACE_FN(name, reg, unreg, PARAMS(proto), PARAMS(args))
^~~~~~~~~~~~~~~
/work/git/linux-trace.git/include/trace/events/syscalls.h:18:1: note: in expansion of macro ‘TRACE_EVENT_FN’
TRACE_EVENT_FN(sys_enter,
^~~~~~~~~~~~~~
/work/git/linux-trace.git/include/linux/compiler.h:285:11: note: previous definition of ‘__addressable___tp_func_sys_enter40’ was here
__PASTE(__addressable_##sym, __LINE__) = (void *)&sym;
^~~~~~~~~~~~~~
/work/git/linux-trace.git/include/linux/compiler_types.h:53:23: note: in definition of macro ‘___PASTE’
#define ___PASTE(a,b) a##b
^
/work/git/linux-trace.git/include/linux/compiler.h:285:3: note: in expansion of macro ‘__PASTE’
__PASTE(__addressable_##sym, __LINE__) = (void *)&sym;
^~~~~~~
/work/git/linux-trace.git/include/linux/static_call.h:112:2: note: in expansion of macro ‘__ADDRESSABLE’
__ADDRESSABLE(key)
^~~~~~~~~~~~~
/work/git/linux-trace.git/include/linux/tracepoint.h:234:2: note: in expansion of macro ‘DECLARE_STATIC_CALL’
DECLARE_STATIC_CALL(__tp_func_##name, __tracepoint_iter_##name); \
^~~~~~~~~~~~~~~~~~~
/work/git/linux-trace.git/include/linux/tracepoint.h:421:2: note: in expansion of macro ‘__DECLARE_TRACE’
__DECLARE_TRACE(name, PARAMS(proto), PARAMS(args), \
^~~~~~~~~~~~~~~
/work/git/linux-trace.git/include/linux/tracepoint.h:560:2: note: in expansion of macro ‘DECLARE_TRACE’
DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
^~~~~~~~~~~~~
/work/git/linux-trace.git/include/trace/events/syscalls.h:18:1: note: in expansion of macro ‘TRACE_EVENT_FN’
TRACE_EVENT_FN(sys_enter,
The complaint is on:
DEFINE_STATIC_CALL(__tp_func_##name, __tracepoint_iter_##name);
And the previous definition is on:
DECLARE_STATIC_CALL(__tp_func_##name, __tracepoint_iter_##name); \
-- Steve
On 9 November 2018 at 21:34, Josh Poimboeuf <[email protected]> wrote:
> On Fri, Nov 09, 2018 at 02:57:46PM -0500, Steven Rostedt wrote:
>> On Fri, 9 Nov 2018 13:35:05 -0600
>> Josh Poimboeuf <[email protected]> wrote:
>>
..
>> > > So what's the reason for skipping init calls?
>> >
>> > This is the runtime changing code (static_call_update). Presumably the
>> > init sections no longer exist and we shouldn't write to any (former)
>> > call sites there.
>> >
>> > That's probably a dangerous assumption though... If
>> > static_call_update() were called early, some init code might not get
>> > patched and then call into the wrong function.
>> >
>> > I'm thinking we should just disallow static call sites in init sections.
>> > I can't think of a good reason why they would be needed in init code.
>> > We can WARN when detecting them during boot / module init.
>> >
>>
>> What I would do is to allow init (like ftrace now does). I have
>> ftrace_free_init_mem() that removes all the mcount references for init
>> calls from its list. You could add a static_call_free_init() to
>> kernel_init() in init/main.c too.
>
> That makes sense for ftrace, but I don't see much point in allowing it
> for static calls. Maybe we could just add support for it later if it
> turns out to be useful.
>
I don't see how you can prevent that. Some arch may use a static call
in its version of some library code which could be used anywhere,
including in .init code.
On 10 November 2018 at 06:10, Steven Rostedt <[email protected]> wrote:
> On Fri, 9 Nov 2018 14:34:59 -0600
> Josh Poimboeuf <[email protected]> wrote:
>
> I'm slowly massaging this to work with tracepoints.
>
> But I hit a snag on this patch.
>
>> On Fri, Nov 09, 2018 at 02:57:46PM -0500, Steven Rostedt wrote:
>> > On Fri, 9 Nov 2018 13:35:05 -0600
>> > Josh Poimboeuf <[email protected]> wrote:
>> >
>> >
>> > > > > +#define DECLARE_STATIC_CALL(key, func) \
>> > > > > + extern struct static_call_key key; \
>> > > > > + extern typeof(func) STATIC_CALL_TRAMP(key); \
>> > > > > + /* Preserve the ELF symbol so objtool can access it: */ \
>> > > > > + __ADDRESSABLE(key)
>> > > >
>> > > > Does the __ADDRESSABLE(key) need to be in the DECLARE part?
>> > > > If so, there needs to be more explanation than just the comment above
>> > > > it.
>> > >
>> > > For each call site, objtool creates a struct in .static_call_sites:
>> > >
>> > > struct static_call_site {
>> > > s32 addr;
>> > > s32 key;
>> > > };
>> > >
>> > > In order to do that, it needs to create a relocation which references
>> > > the key symbol. If the key is defined in another .o file, then the
>> > > current .o will not have an ELF symbol associated with the key. The
>> > > __ADDRESSABLE(key) thing tells GCC to leave the key symbol in the .o
>> > > file, even though it's not referenced anywhere. That makes objtool's
>> > > job easier, so it doesn't have to edit the symbol table.
>> > >
>> > > I could add a comment saying as much, though it's hard to explain it in
>> > > fewer words than I just did :-)
>> >
>> > Does this have to do with adding the references by relative address?
>> >
>> > In record_mcount, I just picked an existing symbol and referenced that..
>> > But perhaps this is a cleaner way.
>>
>> I think recordmcount is different. It creates references (in
>> __mcount_loc) to functions which are already in the object file, so they
>> already have symbols associated with them.
>>
>> But in this case, when objtool is creating references, the symbol it
>> needs to reference is outside the .o file, so there's no symbol to
>> associate it with.
>>
>
> The __ADDRESSABLE() appears to fail if you have a header with a
> DECLARE_STATIC_CALL() that is included where the DEFINE_STATIC_CALL()
> is, because I'm getting this:
>
> In file included from <command-line>:
> /work/git/linux-trace.git/include/linux/compiler.h:285:11: error: redefinition of ‘__addressable___tp_func_sys_enter40’
> __PASTE(__addressable_##sym, __LINE__) = (void *)&sym;
> ^~~~~~~~~~~~~~
> /work/git/linux-trace.git/include/linux/compiler_types.h:53:23: note: in definition of macro ‘___PASTE’
> #define ___PASTE(a,b) a##b
> ^
> /work/git/linux-trace.git/include/linux/compiler.h:285:3: note: in expansion of macro ‘__PASTE’
> __PASTE(__addressable_##sym, __LINE__) = (void *)&sym;
> ^~~~~~~
> /work/git/linux-trace.git/include/linux/static_call.h:112:2: note: in expansion of macro ‘__ADDRESSABLE’
> __ADDRESSABLE(key)
> ^~~~~~~~~~~~~
> /work/git/linux-trace.git/include/linux/static_call.h:115:2: note: in expansion of macro ‘DECLARE_STATIC_CALL’
> DECLARE_STATIC_CALL(key, _func); \
> ^~~~~~~~~~~~~~~~~~~
> /work/git/linux-trace.git/include/linux/tracepoint.h:310:2: note: in expansion of macro ‘DEFINE_STATIC_CALL’
> DEFINE_STATIC_CALL(__tp_func_##name, __tracepoint_iter_##name);
> ^~~~~~~~~~~~~~~~~~
> /work/git/linux-trace.git/include/trace/define_trace.h:42:2: note: in expansion of macro ‘DEFINE_TRACE_FN’
> DEFINE_TRACE_FN(name, reg, unreg, PARAMS(proto), PARAMS(args))
> ^~~~~~~~~~~~~~~
> /work/git/linux-trace.git/include/trace/events/syscalls.h:18:1: note: in expansion of macro ‘TRACE_EVENT_FN’
> TRACE_EVENT_FN(sys_enter,
> ^~~~~~~~~~~~~~
> /work/git/linux-trace.git/include/linux/compiler.h:285:11: note: previous definition of ‘__addressable___tp_func_sys_enter40’ was here
> __PASTE(__addressable_##sym, __LINE__) = (void *)&sym;
> ^~~~~~~~~~~~~~
> /work/git/linux-trace.git/include/linux/compiler_types.h:53:23: note: in definition of macro ‘___PASTE’
> #define ___PASTE(a,b) a##b
> ^
> /work/git/linux-trace.git/include/linux/compiler.h:285:3: note: in expansion of macro ‘__PASTE’
> __PASTE(__addressable_##sym, __LINE__) = (void *)&sym;
> ^~~~~~~
> /work/git/linux-trace.git/include/linux/static_call.h:112:2: note: in expansion of macro ‘__ADDRESSABLE’
> __ADDRESSABLE(key)
> ^~~~~~~~~~~~~
> /work/git/linux-trace.git/include/linux/tracepoint.h:234:2: note: in expansion of macro ‘DECLARE_STATIC_CALL’
> DECLARE_STATIC_CALL(__tp_func_##name, __tracepoint_iter_##name); \
> ^~~~~~~~~~~~~~~~~~~
> /work/git/linux-trace.git/include/linux/tracepoint.h:421:2: note: in expansion of macro ‘__DECLARE_TRACE’
> __DECLARE_TRACE(name, PARAMS(proto), PARAMS(args), \
> ^~~~~~~~~~~~~~~
> /work/git/linux-trace.git/include/linux/tracepoint.h:560:2: note: in expansion of macro ‘DECLARE_TRACE’
> DECLARE_TRACE(name, PARAMS(proto), PARAMS(args))
> ^~~~~~~~~~~~~
> /work/git/linux-trace.git/include/trace/events/syscalls.h:18:1: note: in expansion of macro ‘TRACE_EVENT_FN’
> TRACE_EVENT_FN(sys_enter,
>
> The complaint is on:
>
> DEFINE_STATIC_CALL(__tp_func_##name, __tracepoint_iter_##name);
>
> And the previous definition is on:
>
> DECLARE_STATIC_CALL(__tp_func_##name, __tracepoint_iter_##name); \
>
Does the DECLARE really need the __ADDRESSABLE? Its purpose is to
ensure that symbols with static linkage are not optimized away, but
since the reference is from a header file, the symbol should have
external linkage anyway.
On Sat, 10 Nov 2018 12:58:08 +0100
Ard Biesheuvel <[email protected]> wrote:
> > The complaint is on:
> >
> > DEFINE_STATIC_CALL(__tp_func_##name, __tracepoint_iter_##name);
> >
> > And the previous definition is on:
> >
> > DECLARE_STATIC_CALL(__tp_func_##name, __tracepoint_iter_##name); \
> >
>
> Does the DECLARE really need the __ADDRESSABLE? Its purpose is to
> ensure that symbols with static linkage are not optimized away, but
> since the reference is from a header file, the symbol should have
> external linkage anyway.
I applied the following change and it compiled fine:
diff --git a/include/linux/static_call.h b/include/linux/static_call.h
index 90b580b95303..5f8a0f0e77be 100644
--- a/include/linux/static_call.h
+++ b/include/linux/static_call.h
@@ -108,8 +108,6 @@ extern void arch_static_call_poison_tramp(unsigned long insn);
#define DECLARE_STATIC_CALL(key, func) \
extern struct static_call_key key; \
extern typeof(func) STATIC_CALL_TRAMP(key); \
- /* Preserve the ELF symbol so objtool can access it: */ \
- __ADDRESSABLE(key)
#define DEFINE_STATIC_CALL(key, _func) \
DECLARE_STATIC_CALL(key, _func); \
@@ -117,7 +115,9 @@ extern void arch_static_call_poison_tramp(unsigned long insn);
.func = _func, \
.site_mods = LIST_HEAD_INIT(key.site_mods), \
}; \
- ARCH_STATIC_CALL_TEMPORARY_TRAMP(key)
+ ARCH_STATIC_CALL_TEMPORARY_TRAMP(key); \
+ /* Preserve the ELF symbol so objtool can access it: */ \
+ __ADDRESSABLE(key)
#define static_call(key, args...) STATIC_CALL_TRAMP(key)(args)
Thanks!
-- Steve
On Fri, 9 Nov 2018 11:05:51 -0800
Andy Lutomirski <[email protected]> wrote:
>
>
> > On Nov 9, 2018, at 10:42 AM, Steven Rostedt <[email protected]> wrote:
> >
> > On Fri, 9 Nov 2018 10:41:37 -0600
> > Josh Poimboeuf <[email protected]> wrote:
> >
> >>> On Fri, Nov 09, 2018 at 09:21:39AM -0600, Josh Poimboeuf wrote:
> >>>> On Fri, Nov 09, 2018 at 07:16:17AM -0800, Andy Lutomirski wrote:
> >>>>> On Thu, Nov 8, 2018 at 11:28 PM Ingo Molnar <[email protected]> wrote:
> >>>>>
> >>>>>
> >>>>> All other usecases are bonus, but it would certainly be interesting to
> >>>>> investigate the impact of using these APIs for tracing: that too is a
> >>>>> feature enabled everywhere but utilized only by a small fraction of Linux
> >>>>> users - so literally every single cycle or instruction saved or hot-path
> >>>>> shortened is a major win.
> >>>>
> >>>> For tracing, we'd want static_call_set_to_nop() or something like that, right?
> >>>
> >>> Are we talking about tracepoints? Or ftrace?
> >>
> >> Since ftrace changes calls to nops, and vice versa, I assume you meant
> >> ftrace. I don't think ftrace is a good candidate for this, as it's
> >> inherently more flexible than this API would reasonably allow.
> >>
> >
> > Not sure what Andy was talking about, but I'm currently implementing
> > tracepoints to use this, as tracepoints use indirect calls, and are a
> > prime candidate for static calls, as I showed in my original RFC of
> > this feature.
> >
> >
>
> Indeed.
>
> Although I had assumed that tracepoints already had appropriate jump label magic.
As far as I know, the jump label magic is for reducing the overhead when the
tracepoint is OFF (because it can skip function parameter preparation), and this
static call will be good when the tracepoint is ON (enabled) because of this can
avoid retpoline performance degradation.
Thank you,
--
Masami Hiramatsu <[email protected]>
On Fri, Nov 09, 2018 at 02:50:27PM +0100, Ard Biesheuvel wrote:
> On 9 November 2018 at 08:28, Ingo Molnar <[email protected]> wrote:
> >> - I'm not sure about the objtool approach. Objtool is (currently)
> >> x86-64 only, which means we have to use the "unoptimized" version
> >> everywhere else. I may experiment with a GCC plugin instead.
> >
> > I'd prefer the objtool approach. It's a pretty reliable first-principles
> > approach while GCC plugin would have to be replicated for Clang and any
> > other compilers, etc.
> >
>
> I implemented the GCC plugin approach here for arm64
I'm confused; I though we only needed objtool for variable instruction
length architectures, because we can't reliably decode our instruction
stream. Otherwise we can fairly trivially use the DWARF relocation data,
no?
On 11 November 2018 at 00:20, Peter Zijlstra <[email protected]> wrote:
> On Fri, Nov 09, 2018 at 02:50:27PM +0100, Ard Biesheuvel wrote:
>> On 9 November 2018 at 08:28, Ingo Molnar <[email protected]> wrote:
>> >> - I'm not sure about the objtool approach. Objtool is (currently)
>> >> x86-64 only, which means we have to use the "unoptimized" version
>> >> everywhere else. I may experiment with a GCC plugin instead.
>> >
>> > I'd prefer the objtool approach. It's a pretty reliable first-principles
>> > approach while GCC plugin would have to be replicated for Clang and any
>> > other compilers, etc.
>> >
>>
>> I implemented the GCC plugin approach here for arm64
>
> I'm confused; I though we only needed objtool for variable instruction
> length architectures, because we can't reliably decode our instruction
> stream. Otherwise we can fairly trivially use the DWARF relocation data,
> no?
How would that work? We could build vmlinux with --emit-relocs, filter
out the static jump/call relocations and resolve the symbol names to
filter the ones associated with calls to trampolines. But then, we
have to build the static_call_sites section and reinject it back into
the image in some way, which is essentially objtool, no?
On Sun, Nov 11, 2018 at 02:42:55PM +0100, Ard Biesheuvel wrote:
> On 11 November 2018 at 00:20, Peter Zijlstra <[email protected]> wrote:
> > On Fri, Nov 09, 2018 at 02:50:27PM +0100, Ard Biesheuvel wrote:
> >> On 9 November 2018 at 08:28, Ingo Molnar <[email protected]> wrote:
> >> >> - I'm not sure about the objtool approach. Objtool is (currently)
> >> >> x86-64 only, which means we have to use the "unoptimized" version
> >> >> everywhere else. I may experiment with a GCC plugin instead.
> >> >
> >> > I'd prefer the objtool approach. It's a pretty reliable first-principles
> >> > approach while GCC plugin would have to be replicated for Clang and any
> >> > other compilers, etc.
> >> >
> >>
> >> I implemented the GCC plugin approach here for arm64
> >
> > I'm confused; I though we only needed objtool for variable instruction
> > length architectures, because we can't reliably decode our instruction
> > stream. Otherwise we can fairly trivially use the DWARF relocation data,
> > no?
>
> How would that work? We could build vmlinux with --emit-relocs, filter
> out the static jump/call relocations and resolve the symbol names to
> filter the ones associated with calls to trampolines. But then, we
> have to build the static_call_sites section and reinject it back into
> the image in some way, which is essentially objtool, no?
It's a _much_ simpler tool than objtool, but yes, we need a tool that
reads the relocation stuff and (re)injects it in a new section -- we
don't need it on a vmlinux level, it can be done per TU.
Anyway, a GCC plugin (I still have to have a peek at your thing) sounds
like it should work just fine too.
On Sat, Nov 10, 2018 at 08:09:17AM -0500, Steven Rostedt wrote:
> On Sat, 10 Nov 2018 12:58:08 +0100
> Ard Biesheuvel <[email protected]> wrote:
>
>
> > > The complaint is on:
> > >
> > > DEFINE_STATIC_CALL(__tp_func_##name, __tracepoint_iter_##name);
> > >
> > > And the previous definition is on:
> > >
> > > DECLARE_STATIC_CALL(__tp_func_##name, __tracepoint_iter_##name); \
> > >
> >
> > Does the DECLARE really need the __ADDRESSABLE? Its purpose is to
> > ensure that symbols with static linkage are not optimized away, but
> > since the reference is from a header file, the symbol should have
> > external linkage anyway.
Yes, DECLARE needs the __ADDRESSABLE. In the case where DECLARE
is used, but DEFINE is not, GCC strips the symbol.
> I applied the following change and it compiled fine:
>
> diff --git a/include/linux/static_call.h b/include/linux/static_call.h
> index 90b580b95303..5f8a0f0e77be 100644
> --- a/include/linux/static_call.h
> +++ b/include/linux/static_call.h
> @@ -108,8 +108,6 @@ extern void arch_static_call_poison_tramp(unsigned long insn);
> #define DECLARE_STATIC_CALL(key, func) \
> extern struct static_call_key key; \
> extern typeof(func) STATIC_CALL_TRAMP(key); \
> - /* Preserve the ELF symbol so objtool can access it: */ \
> - __ADDRESSABLE(key)
>
> #define DEFINE_STATIC_CALL(key, _func) \
> DECLARE_STATIC_CALL(key, _func); \
> @@ -117,7 +115,9 @@ extern void arch_static_call_poison_tramp(unsigned long insn);
> .func = _func, \
> .site_mods = LIST_HEAD_INIT(key.site_mods), \
> }; \
> - ARCH_STATIC_CALL_TEMPORARY_TRAMP(key)
> + ARCH_STATIC_CALL_TEMPORARY_TRAMP(key); \
> + /* Preserve the ELF symbol so objtool can access it: */ \
> + __ADDRESSABLE(key)
>
> #define static_call(key, args...) STATIC_CALL_TRAMP(key)(args)
Adding __ADDRESSABLE to the DEFINE macro doesn't do any good. By
definition, the key is defined in the .o file, so the symbol already
exists.
The issue you're seeing is really an issue with the __ADDRESSABLE macro
not creating unique symbol names. This should fix it:
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 06396c1cf127..4bb73fd918b5 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -282,7 +282,7 @@ unsigned long read_word_at_a_time(const void *addr)
*/
#define __ADDRESSABLE(sym) \
static void * __section(".discard.addressable") __used \
- __PASTE(__addressable_##sym, __LINE__) = (void *)&sym;
+ __UNIQUE_ID(__addressable_##sym) = (void *)&sym;
/**
* offset_to_ptr - convert a relative memory offset to an absolute pointer
On 12 November 2018 at 04:07, Josh Poimboeuf <[email protected]> wrote:
> On Sat, Nov 10, 2018 at 08:09:17AM -0500, Steven Rostedt wrote:
>> On Sat, 10 Nov 2018 12:58:08 +0100
>> Ard Biesheuvel <[email protected]> wrote:
>>
>>
>> > > The complaint is on:
>> > >
>> > > DEFINE_STATIC_CALL(__tp_func_##name, __tracepoint_iter_##name);
>> > >
>> > > And the previous definition is on:
>> > >
>> > > DECLARE_STATIC_CALL(__tp_func_##name, __tracepoint_iter_##name); \
>> > >
>> >
>> > Does the DECLARE really need the __ADDRESSABLE? Its purpose is to
>> > ensure that symbols with static linkage are not optimized away, but
>> > since the reference is from a header file, the symbol should have
>> > external linkage anyway.
>
> Yes, DECLARE needs the __ADDRESSABLE. In the case where DECLARE
> is used, but DEFINE is not, GCC strips the symbol.
>
I assume DECLARE() is intended for use in header files, and DEFINE()
for source files, no?
Doesn't that mean that whatever symbol __ADDRESSABLE() refers to
should have external linkage, in which case it it addressable anyway?
Or are we talking about some LTO / --gc-sections use case here?
>> I applied the following change and it compiled fine:
>>
>> diff --git a/include/linux/static_call.h b/include/linux/static_call.h
>> index 90b580b95303..5f8a0f0e77be 100644
>> --- a/include/linux/static_call.h
>> +++ b/include/linux/static_call.h
>> @@ -108,8 +108,6 @@ extern void arch_static_call_poison_tramp(unsigned long insn);
>> #define DECLARE_STATIC_CALL(key, func) \
>> extern struct static_call_key key; \
>> extern typeof(func) STATIC_CALL_TRAMP(key); \
>> - /* Preserve the ELF symbol so objtool can access it: */ \
>> - __ADDRESSABLE(key)
>>
>> #define DEFINE_STATIC_CALL(key, _func) \
>> DECLARE_STATIC_CALL(key, _func); \
>> @@ -117,7 +115,9 @@ extern void arch_static_call_poison_tramp(unsigned long insn);
>> .func = _func, \
>> .site_mods = LIST_HEAD_INIT(key.site_mods), \
>> }; \
>> - ARCH_STATIC_CALL_TEMPORARY_TRAMP(key)
>> + ARCH_STATIC_CALL_TEMPORARY_TRAMP(key); \
>> + /* Preserve the ELF symbol so objtool can access it: */ \
>> + __ADDRESSABLE(key)
>>
>> #define static_call(key, args...) STATIC_CALL_TRAMP(key)(args)
>
> Adding __ADDRESSABLE to the DEFINE macro doesn't do any good. By
> definition, the key is defined in the .o file, so the symbol already
> exists.
>
> The issue you're seeing is really an issue with the __ADDRESSABLE macro
> not creating unique symbol names. This should fix it:
>
> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index 06396c1cf127..4bb73fd918b5 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -282,7 +282,7 @@ unsigned long read_word_at_a_time(const void *addr)
> */
> #define __ADDRESSABLE(sym) \
> static void * __section(".discard.addressable") __used \
> - __PASTE(__addressable_##sym, __LINE__) = (void *)&sym;
> + __UNIQUE_ID(__addressable_##sym) = (void *)&sym;
>
> /**
> * offset_to_ptr - convert a relative memory offset to an absolute pointer
Not sure if it matters, but we'll end up with a lot more stuff in
.discard.addressable this way.
On Mon, Nov 12, 2018 at 05:39:38AM +0100, Ard Biesheuvel wrote:
> On 12 November 2018 at 04:07, Josh Poimboeuf <[email protected]> wrote:
> > On Sat, Nov 10, 2018 at 08:09:17AM -0500, Steven Rostedt wrote:
> >> On Sat, 10 Nov 2018 12:58:08 +0100
> >> Ard Biesheuvel <[email protected]> wrote:
> >>
> >>
> >> > > The complaint is on:
> >> > >
> >> > > DEFINE_STATIC_CALL(__tp_func_##name, __tracepoint_iter_##name);
> >> > >
> >> > > And the previous definition is on:
> >> > >
> >> > > DECLARE_STATIC_CALL(__tp_func_##name, __tracepoint_iter_##name); \
> >> > >
> >> >
> >> > Does the DECLARE really need the __ADDRESSABLE? Its purpose is to
> >> > ensure that symbols with static linkage are not optimized away, but
> >> > since the reference is from a header file, the symbol should have
> >> > external linkage anyway.
> >
> > Yes, DECLARE needs the __ADDRESSABLE. In the case where DECLARE
> > is used, but DEFINE is not, GCC strips the symbol.
> >
>
> I assume DECLARE() is intended for use in header files, and DEFINE()
> for source files, no?
Right.
> Doesn't that mean that whatever symbol __ADDRESSABLE() refers to
> should have external linkage, in which case it it addressable anyway?
> Or are we talking about some LTO / --gc-sections use case here?
If the key is declared, but not used, GCC doesn't put the key's ELF
symbol in the binary's symbol table. That makes objtool's life harder,
because if the file has a call site, then objtool has to add the key
symbol to the symbol table, so that it can create a relocation (in the
call site table) which references the symbol.
--
Josh
On 12 November 2018 at 05:56, Josh Poimboeuf <[email protected]> wrote:
> On Mon, Nov 12, 2018 at 05:39:38AM +0100, Ard Biesheuvel wrote:
>> On 12 November 2018 at 04:07, Josh Poimboeuf <[email protected]> wrote:
>> > On Sat, Nov 10, 2018 at 08:09:17AM -0500, Steven Rostedt wrote:
>> >> On Sat, 10 Nov 2018 12:58:08 +0100
>> >> Ard Biesheuvel <[email protected]> wrote:
>> >>
>> >>
>> >> > > The complaint is on:
>> >> > >
>> >> > > DEFINE_STATIC_CALL(__tp_func_##name, __tracepoint_iter_##name);
>> >> > >
>> >> > > And the previous definition is on:
>> >> > >
>> >> > > DECLARE_STATIC_CALL(__tp_func_##name, __tracepoint_iter_##name); \
>> >> > >
>> >> >
>> >> > Does the DECLARE really need the __ADDRESSABLE? Its purpose is to
>> >> > ensure that symbols with static linkage are not optimized away, but
>> >> > since the reference is from a header file, the symbol should have
>> >> > external linkage anyway.
>> >
>> > Yes, DECLARE needs the __ADDRESSABLE. In the case where DECLARE
>> > is used, but DEFINE is not, GCC strips the symbol.
>> >
>>
>> I assume DECLARE() is intended for use in header files, and DEFINE()
>> for source files, no?
>
> Right.
>
>> Doesn't that mean that whatever symbol __ADDRESSABLE() refers to
>> should have external linkage, in which case it it addressable anyway?
>> Or are we talking about some LTO / --gc-sections use case here?
>
> If the key is declared, but not used, GCC doesn't put the key's ELF
> symbol in the binary's symbol table. That makes objtool's life harder,
> because if the file has a call site, then objtool has to add the key
> symbol to the symbol table, so that it can create a relocation (in the
> call site table) which references the symbol.
>
Ah, right. So this is specific to objtool rather than a generic issue.
Should we perhaps wrap __ADDRESSABLE() into something else that
resolves to a NOP unless on X86, and give it a clear name?
* Josh Poimboeuf <[email protected]> wrote:
> On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote:
> > > - I'm not sure about the objtool approach. Objtool is (currently)
> > > x86-64 only, which means we have to use the "unoptimized" version
> > > everywhere else. I may experiment with a GCC plugin instead.
> >
> > I'd prefer the objtool approach. It's a pretty reliable first-principles
> > approach while GCC plugin would have to be replicated for Clang and any
> > other compilers, etc.
>
> The benefit of a plugin is that we'd only need two of them: GCC and
> Clang. And presumably, they'd share a lot of code.
>
> The prospect of porting objtool to all architectures is going to be much
> more of a daunting task (though we are at least already considering it
> for some arches).
Which architectures would benefit from ORC support the most?
I really think that hard reliance on GCC plugins is foolish - but maybe
Clang's plugin infrastructure is a guarantee that it remains a sane and
usable interface.
> > I'd be very happy with a demonstrated paravirt optimization already -
> > i.e. seeing the before/after effect on the vmlinux with an x86 distro
> > config.
> >
> > All major Linux distributions enable CONFIG_PARAVIRT=y and
> > CONFIG_PARAVIRT_XXL=y on x86 at the moment, so optimizing it away as much
> > as possible in the 99.999% cases where it's not used is a primary
> > concern.
>
> For paravirt, I was thinking of it as more of a cleanup than an
> optimization. The paravirt patching code already replaces indirect
> branches with direct ones -- see paravirt_patch_default().
>
> Though it *would* reduce the instruction footprint a bit, as the 7-byte
> indirect calls (later patched to 5-byte direct + 2-byte nop) would
> instead be 5-byte direct calls to begin with.
Yes.
> > All other usecases are bonus, but it would certainly be interesting to
> > investigate the impact of using these APIs for tracing: that too is a
> > feature enabled everywhere but utilized only by a small fraction of Linux
> > users - so literally every single cycle or instruction saved or hot-path
> > shortened is a major win.
>
> With retpolines, and with tracepoints enabled, it's definitely a major
> win. Steve measured an 8.9% general slowdown on hackbench caused by
> retpolines.
How much of that slowdown is reversed?
> But with tracepoints disabled, I believe static jumps are used, which
> already minimizes the impact on hot paths.
Yeah.
Thanks,
Ing
On Mon, Nov 12, 2018 at 06:02:41AM +0100, Ingo Molnar wrote:
>
> * Josh Poimboeuf <[email protected]> wrote:
>
> > On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote:
> > > > - I'm not sure about the objtool approach. Objtool is (currently)
> > > > x86-64 only, which means we have to use the "unoptimized" version
> > > > everywhere else. I may experiment with a GCC plugin instead.
> > >
> > > I'd prefer the objtool approach. It's a pretty reliable first-principles
> > > approach while GCC plugin would have to be replicated for Clang and any
> > > other compilers, etc.
> >
> > The benefit of a plugin is that we'd only need two of them: GCC and
> > Clang. And presumably, they'd share a lot of code.
> >
> > The prospect of porting objtool to all architectures is going to be much
> > more of a daunting task (though we are at least already considering it
> > for some arches).
>
> Which architectures would benefit from ORC support the most?
According to my (limited and potentially flawed) knowledge, I think
arm64 would benefit the most performance-wise, whereas powerpc and s390
gains would be quite a bit less.
We may have to port objtool to arm64 anyway, for live patching. But
that will be a lot more work than it took for Ard to write a GCC plugin.
> I really think that hard reliance on GCC plugins is foolish
Funny, I feel the same way about hard reliance on objtool :-)
> - but maybe Clang's plugin infrastructure is a guarantee that it
> remains a sane and usable interface.
Hopefully so. If it breaks, we could always write another tool, as the
work is straightforward. Or we could make it an objtool subcommand
which works on all arches.
> > > All other usecases are bonus, but it would certainly be interesting to
> > > investigate the impact of using these APIs for tracing: that too is a
> > > feature enabled everywhere but utilized only by a small fraction of Linux
> > > users - so literally every single cycle or instruction saved or hot-path
> > > shortened is a major win.
> >
> > With retpolines, and with tracepoints enabled, it's definitely a major
> > win. Steve measured an 8.9% general slowdown on hackbench caused by
> > retpolines.
>
> How much of that slowdown is reversed?
In theory, it should reverse all of the slowdown, and actually may even
speed it up a little. Steve is working on measuring that now.
--
Josh
On Sun, Nov 11, 2018 at 9:02 PM Ingo Molnar <[email protected]> wrote:
>
>
> * Josh Poimboeuf <[email protected]> wrote:
>
> > On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote:
> > > > - I'm not sure about the objtool approach. Objtool is (currently)
> > > > x86-64 only, which means we have to use the "unoptimized" version
> > > > everywhere else. I may experiment with a GCC plugin instead.
> > >
> > > I'd prefer the objtool approach. It's a pretty reliable first-principles
> > > approach while GCC plugin would have to be replicated for Clang and any
> > > other compilers, etc.
> >
> > The benefit of a plugin is that we'd only need two of them: GCC and
> > Clang. And presumably, they'd share a lot of code.
> >
> > The prospect of porting objtool to all architectures is going to be much
> > more of a daunting task (though we are at least already considering it
> > for some arches).
>
> Which architectures would benefit from ORC support the most?
>
> I really think that hard reliance on GCC plugins is foolish - but maybe
> Clang's plugin infrastructure is a guarantee that it remains a sane and
> usable interface.
>
> > > I'd be very happy with a demonstrated paravirt optimization already -
> > > i.e. seeing the before/after effect on the vmlinux with an x86 distro
> > > config.
> > >
> > > All major Linux distributions enable CONFIG_PARAVIRT=y and
> > > CONFIG_PARAVIRT_XXL=y on x86 at the moment, so optimizing it away as much
> > > as possible in the 99.999% cases where it's not used is a primary
> > > concern.
> >
> > For paravirt, I was thinking of it as more of a cleanup than an
> > optimization. The paravirt patching code already replaces indirect
> > branches with direct ones -- see paravirt_patch_default().
> >
> > Though it *would* reduce the instruction footprint a bit, as the 7-byte
> > indirect calls (later patched to 5-byte direct + 2-byte nop) would
> > instead be 5-byte direct calls to begin with.
>
> Yes.
It would be a huge cleanup IMO -- the existing PVOP call stuff is
really quite ugly IMO. Also, the existing stuff tries to emulate the
semantics of passing parameters of unknown types using asm
constraints, and I just don't believe that GCC does what we want it to
do. In general, passing the *value* of a pointer to asm doesn't seem
to convince gcc that the pointed-to value is used by the asm, and this
makes me nervous. See commit 715bd9d12f84d8f5cc8ad21d888f9bc304a8eb0b
as an example of this. In a similar vein, the existing PVOP calls
have a "memory" clobber, and that's not free.
On Mon, 12 Nov 2018 at 06:31, Josh Poimboeuf <[email protected]> wrote:
>
> On Mon, Nov 12, 2018 at 06:02:41AM +0100, Ingo Molnar wrote:
> >
> > * Josh Poimboeuf <[email protected]> wrote:
> >
> > > On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote:
> > > > > - I'm not sure about the objtool approach. Objtool is (currently)
> > > > > x86-64 only, which means we have to use the "unoptimized" version
> > > > > everywhere else. I may experiment with a GCC plugin instead.
> > > >
> > > > I'd prefer the objtool approach. It's a pretty reliable first-principles
> > > > approach while GCC plugin would have to be replicated for Clang and any
> > > > other compilers, etc.
> > >
> > > The benefit of a plugin is that we'd only need two of them: GCC and
> > > Clang. And presumably, they'd share a lot of code.
> > >
Having looked into this, I don't think they will share any code at
all, to be honest. Perhaps some macros and string templates, that's
all.
> > > The prospect of porting objtool to all architectures is going to be much
> > > more of a daunting task (though we are at least already considering it
> > > for some arches).
> >
> > Which architectures would benefit from ORC support the most?
>
> According to my (limited and potentially flawed) knowledge, I think
> arm64 would benefit the most performance-wise, whereas powerpc and s390
> gains would be quite a bit less.
>
What would arm64 gain from ORC and/or objtool?
> We may have to port objtool to arm64 anyway, for live patching.
Is this about the reliable stack traces, i.e., the ability to detect
non-leaf functions that don't create stack frames? I think we should
be able to manage this without objtool on arm64 tbh.
> But
> that will be a lot more work than it took for Ard to write a GCC plugin.
>
> > I really think that hard reliance on GCC plugins is foolish
>
> Funny, I feel the same way about hard reliance on objtool :-)
>
I tend to agree here. I think objtool is a necessary evil (as are
compiler plugins, for that matter) which I hope does not spread to
other architectures.
But the main difference is that the GCC plugin is only ~50 lines (for
this particular use case, and minus another 50 lines of boilerplate),
whereas objtool (AIUI) duplicates lots and lots of functionality of
the compiler, assembler and/or linker, to mangle relocations, create
new sections etc etc. Porting this to other architectures is going to
be a major maintenance effort, especially when I think of, e.g.,
32-bit ARM with its Thumb2 quirks and other idiosyncrasies that are
currently hidden in the toolchain. Other architectures should be first
class citizens if objtool gains support for them, which means that the
x86 people that own it currently are on the hook for testing their
changes against architectures they are not familiar with.
This obviously applies equally to compiler plugins, but those have a
lot more focus.
> > - but maybe Clang's plugin infrastructure is a guarantee that it
> > remains a sane and usable interface.
>
> Hopefully so. If it breaks, we could always write another tool, as the
> work is straightforward. Or we could make it an objtool subcommand
> which works on all arches.
>
> > > > All other usecases are bonus, but it would certainly be interesting to
> > > > investigate the impact of using these APIs for tracing: that too is a
> > > > feature enabled everywhere but utilized only by a small fraction of Linux
> > > > users - so literally every single cycle or instruction saved or hot-path
> > > > shortened is a major win.
> > >
> > > With retpolines, and with tracepoints enabled, it's definitely a major
> > > win. Steve measured an 8.9% general slowdown on hackbench caused by
> > > retpolines.
> >
> > How much of that slowdown is reversed?
>
> In theory, it should reverse all of the slowdown, and actually may even
> speed it up a little. Steve is working on measuring that now.
>
> --
> Josh
On Sun, 11 Nov 2018 23:30:55 -0600
Josh Poimboeuf <[email protected]> wrote:
> > How much of that slowdown is reversed?
>
> In theory, it should reverse all of the slowdown, and actually may even
> speed it up a little. Steve is working on measuring that now.
When I'm able to get it to work! Hopefully that last patch snippet you
posted will help. If not, I'm assuming you'll be in Vancouver this
week, and we could sit down and work it out.
That said, I don't expect a 100% improvement. Because the retpoline
causes slow down in other areas than just tracing, which is not being
fixed by this. I'm expecting a substantial improvement (which I see
good improvement with the unoptimized static calls), and hoping for
much more with the optimized one (when I get it working). But not 100%,
as stated above.
-- Steve
On Mon, Nov 12, 2018 at 10:39:52AM +0100, Ard Biesheuvel wrote:
> On Mon, 12 Nov 2018 at 06:31, Josh Poimboeuf <[email protected]> wrote:
> >
> > On Mon, Nov 12, 2018 at 06:02:41AM +0100, Ingo Molnar wrote:
> > >
> > > * Josh Poimboeuf <[email protected]> wrote:
> > >
> > > > On Fri, Nov 09, 2018 at 08:28:11AM +0100, Ingo Molnar wrote:
> > > > > > - I'm not sure about the objtool approach. Objtool is (currently)
> > > > > > x86-64 only, which means we have to use the "unoptimized" version
> > > > > > everywhere else. I may experiment with a GCC plugin instead.
> > > > >
> > > > > I'd prefer the objtool approach. It's a pretty reliable first-principles
> > > > > approach while GCC plugin would have to be replicated for Clang and any
> > > > > other compilers, etc.
> > > >
> > > > The benefit of a plugin is that we'd only need two of them: GCC and
> > > > Clang. And presumably, they'd share a lot of code.
> > > >
>
> Having looked into this, I don't think they will share any code at
> all, to be honest. Perhaps some macros and string templates, that's
> all.
Oh well. That should still be easier to maintain than objtool across
all arches at this point.
> > > > The prospect of porting objtool to all architectures is going to be much
> > > > more of a daunting task (though we are at least already considering it
> > > > for some arches).
> > >
> > > Which architectures would benefit from ORC support the most?
> >
> > According to my (limited and potentially flawed) knowledge, I think
> > arm64 would benefit the most performance-wise, whereas powerpc and s390
> > gains would be quite a bit less.
> >
>
> What would arm64 gain from ORC and/or objtool?
Other than live patching, the biggest benefit would be an
across-the-board performance improvement from disabling frame pointers.
It would be interesting to see some arm64 performance numbers there, for
a kernel compiled with -fomit-frame-pointer.
For more details (and benefits of) ORC see
Documentation/x86/orc-unwinder.txt.
Objtool has also come in handy for other cases, like ensuring retpolines
are used everywhere.
Over time, I would like to move some objtool functionality to compiler
plugins, such that it would be easier to port it to other arches.
> > We may have to port objtool to arm64 anyway, for live patching.
>
> Is this about the reliable stack traces, i.e., the ability to detect
> non-leaf functions that don't create stack frames? I think we should
> be able to manage this without objtool on arm64 tbh.
Hm? How else would you ensure all functions honor CONFIG_FRAME_POINTER,
and continue to do so indefinitely?
> > But
> > that will be a lot more work than it took for Ard to write a GCC plugin.
> >
> > > I really think that hard reliance on GCC plugins is foolish
> >
> > Funny, I feel the same way about hard reliance on objtool :-)
> >
>
> I tend to agree here. I think objtool is a necessary evil (as are
> compiler plugins, for that matter) which I hope does not spread to
> other architectures.
I agree that it's a necessary evil, but it may be necessary on arm64 for
live patching.
> But the main difference is that the GCC plugin is only ~50 lines (for
> this particular use case, and minus another 50 lines of boilerplate),
> whereas objtool (AIUI) duplicates lots and lots of functionality of
> the compiler, assembler and/or linker, to mangle relocations, create
> new sections etc etc. Porting this to other architectures is going to
> be a major maintenance effort, especially when I think of, e.g.,
> 32-bit ARM with its Thumb2 quirks and other idiosyncrasies that are
> currently hidden in the toolchain. Other architectures should be first
> class citizens if objtool gains support for them, which means that the
> x86 people that own it currently are on the hook for testing their
> changes against architectures they are not familiar with.
Sounds like we could use you as a co-maintainer then :-)
BTW, AFAIK, there are no plans to support live patching for 32-bit ARM.
--
Josh
On Mon, Nov 12, 2018 at 12:03:34PM -0500, Steven Rostedt wrote:
> On Sun, 11 Nov 2018 23:30:55 -0600
> Josh Poimboeuf <[email protected]> wrote:
>
> > > How much of that slowdown is reversed?
> >
> > In theory, it should reverse all of the slowdown, and actually may even
> > speed it up a little. Steve is working on measuring that now.
>
> When I'm able to get it to work! Hopefully that last patch snippet you
> posted will help. If not, I'm assuming you'll be in Vancouver this
> week, and we could sit down and work it out.
Sure, I'm already in Vancouver. Just grab me if you see me, or ping me
on IRC/email. Or feel free to send me your patches if it's still giving
you trouble.
> That said, I don't expect a 100% improvement. Because the retpoline
> causes slow down in other areas than just tracing, which is not being
> fixed by this. I'm expecting a substantial improvement (which I see
> good improvement with the unoptimized static calls), and hoping for
> much more with the optimized one (when I get it working). But not 100%,
> as stated above.
Ah, ok. Makes sense.
--
Josh