2018-11-26 13:56:33

by Josh Poimboeuf

[permalink] [raw]
Subject: [PATCH v2 0/4] Static calls

v2:
- fix STATIC_CALL_TRAMP() macro by using __PASTE() [Ard]
- rename optimized/unoptimized -> inline/out-of-line [Ard]
- tweak arch interfaces for PLT and add key->tramp field [Ard]
- rename 'poison' to 'defuse' and do it after all sites have been patched [Ard]
- fix .init handling [Ard, Steven]
- add CONFIG_HAVE_STATIC_CALL [Steven]
- make interfaces more consistent across configs to allow tracepoints to
use them [Steven]
- move __ADDRESSABLE() to static_call() macro [Steven]
- prevent 2-byte jumps [Steven]
- add offset to asm-offsets.c instead of hard coding key->func offset
- add kernel_text_address() sanity check
- make __ADDRESSABLE() symbols truly unique

TODO:
- port Ard's arm64 patches to the new arch interfaces
- tracepoint performance testing

--------------------

These patches are related to two similar patch sets from Ard and Steve:

- https://lkml.kernel.org/r/[email protected]
- https://lkml.kernel.org/r/[email protected]

The code is also heavily inspired by the jump label code, as some of the
concepts are very similar.

There are three separate implementations, depending on what the arch
supports:

1) CONFIG_HAVE_STATIC_CALL_INLINE: patched call sites - requires
objtool and a small amount of arch code

2) CONFIG_HAVE_STATIC_CALL_OUTLINE: patched trampolines - requires
a small amount of arch code

3) If no arch support, fall back to regular function pointers


Josh Poimboeuf (4):
compiler.h: Make __ADDRESSABLE() symbol truly unique
static_call: Add static call infrastructure
x86/static_call: Add out-of-line static call implementation
x86/static_call: Add inline static call implementation for x86-64

arch/Kconfig | 10 +
arch/x86/Kconfig | 4 +-
arch/x86/include/asm/static_call.h | 52 +++
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/asm-offsets.c | 6 +
arch/x86/kernel/static_call.c | 78 ++++
include/asm-generic/vmlinux.lds.h | 11 +
include/linux/compiler.h | 2 +-
include/linux/module.h | 10 +
include/linux/static_call.h | 202 ++++++++++
include/linux/static_call_types.h | 19 +
kernel/Makefile | 1 +
kernel/module.c | 5 +
kernel/static_call.c | 350 ++++++++++++++++++
tools/objtool/Makefile | 3 +-
tools/objtool/check.c | 126 ++++++-
tools/objtool/check.h | 2 +
tools/objtool/elf.h | 1 +
.../objtool/include/linux/static_call_types.h | 19 +
tools/objtool/sync-check.sh | 1 +
20 files changed, 899 insertions(+), 4 deletions(-)
create mode 100644 arch/x86/include/asm/static_call.h
create mode 100644 arch/x86/kernel/static_call.c
create mode 100644 include/linux/static_call.h
create mode 100644 include/linux/static_call_types.h
create mode 100644 kernel/static_call.c
create mode 100644 tools/objtool/include/linux/static_call_types.h

--
2.17.2



2018-11-26 13:56:33

by Josh Poimboeuf

[permalink] [raw]
Subject: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

Add the inline static call implementation for x86-64. For each key, a
temporary trampoline is created, named __static_call_tramp_<key>. The
trampoline has an indirect jump to the destination function.

Objtool uses the trampoline naming convention to detect all the call
sites. It then annotates those call sites in the .static_call_sites
section.

During boot (and module init), the call sites are patched to call
directly into the destination function. The temporary trampoline is
then no longer used.

Signed-off-by: Josh Poimboeuf <[email protected]>
---
arch/x86/Kconfig | 5 +-
arch/x86/include/asm/static_call.h | 28 +++-
arch/x86/kernel/asm-offsets.c | 6 +
arch/x86/kernel/static_call.c | 30 ++++-
include/linux/static_call.h | 2 +-
tools/objtool/Makefile | 3 +-
tools/objtool/check.c | 126 +++++++++++++++++-
tools/objtool/check.h | 2 +
tools/objtool/elf.h | 1 +
.../objtool/include/linux/static_call_types.h | 19 +++
tools/objtool/sync-check.sh | 1 +
11 files changed, 213 insertions(+), 10 deletions(-)
create mode 100644 tools/objtool/include/linux/static_call_types.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a2a10e0ce248..e099ea87ea70 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -189,7 +189,8 @@ config X86
select HAVE_FUNCTION_ARG_ACCESS_API
select HAVE_STACKPROTECTOR if CC_HAS_SANE_STACKPROTECTOR
select HAVE_STACK_VALIDATION if X86_64
- select HAVE_STATIC_CALL_OUTLINE
+ select HAVE_STATIC_CALL_INLINE if HAVE_STACK_VALIDATION
+ select HAVE_STATIC_CALL_OUTLINE if !HAVE_STACK_VALIDATION
select HAVE_RSEQ
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_UNSTABLE_SCHED_CLOCK
@@ -203,6 +204,7 @@ config X86
select RTC_MC146818_LIB
select SPARSE_IRQ
select SRCU
+ select STACK_VALIDATION if HAVE_STACK_VALIDATION && (HAVE_STATIC_CALL_INLINE || RETPOLINE)
select SYSCTL_EXCEPTION_TRACE
select THREAD_INFO_IN_TASK
select USER_STACKTRACE_SUPPORT
@@ -438,7 +440,6 @@ config GOLDFISH
config RETPOLINE
bool "Avoid speculative indirect branches in kernel"
default y
- select STACK_VALIDATION if HAVE_STACK_VALIDATION
help
Compile kernel with the retpoline compiler options to guard against
kernel-to-user data leaks by avoiding speculative indirect
diff --git a/arch/x86/include/asm/static_call.h b/arch/x86/include/asm/static_call.h
index 6e9ad5969ec2..27bd7da16150 100644
--- a/arch/x86/include/asm/static_call.h
+++ b/arch/x86/include/asm/static_call.h
@@ -2,6 +2,20 @@
#ifndef _ASM_STATIC_CALL_H
#define _ASM_STATIC_CALL_H

+#include <asm/asm-offsets.h>
+
+#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
+
+/*
+ * This trampoline is only used during boot / module init, so it's safe to use
+ * the indirect branch without a retpoline.
+ */
+#define __ARCH_STATIC_CALL_TRAMP_JMP(key, func) \
+ ANNOTATE_RETPOLINE_SAFE \
+ "jmpq *" __stringify(key) "+" __stringify(SC_KEY_func) "(%rip) \n"
+
+#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */
+
/*
* Manually construct a 5-byte direct JMP to prevent the assembler from
* optimizing it into a 2-byte JMP.
@@ -12,9 +26,19 @@
".long " #func " - " __ARCH_STATIC_CALL_JMP_LABEL(key) "\n" \
__ARCH_STATIC_CALL_JMP_LABEL(key) ":"

+#endif /* !CONFIG_HAVE_STATIC_CALL_INLINE */
+
/*
- * This is a permanent trampoline which does a direct jump to the function.
- * The direct jump get patched by static_call_update().
+ * For CONFIG_HAVE_STATIC_CALL_INLINE, this is a temporary trampoline which
+ * uses the current value of the key->func pointer to do an indirect jump to
+ * the function. This trampoline is only used during boot, before the call
+ * sites get patched by static_call_update(). The name of this trampoline has
+ * a magical aspect: objtool uses it to find static call sites so it can create
+ * the .static_call_sites section.
+ *
+ * For CONFIG_HAVE_STATIC_CALL_OUTLINE, this is a permanent trampoline which
+ * does a direct jump to the function. The direct jump gets patched by
+ * static_call_update().
*/
#define ARCH_DEFINE_STATIC_CALL_TRAMP(key, func) \
asm(".pushsection .text, \"ax\" \n" \
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 72adf6c335dc..da8fd220e4f2 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -12,6 +12,7 @@
#include <linux/hardirq.h>
#include <linux/suspend.h>
#include <linux/kbuild.h>
+#include <linux/static_call.h>
#include <asm/processor.h>
#include <asm/thread_info.h>
#include <asm/sigframe.h>
@@ -104,4 +105,9 @@ void common(void) {
OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
OFFSET(TSS_sp2, tss_struct, x86_tss.sp2);
+
+#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
+ BLANK();
+ OFFSET(SC_KEY_func, static_call_key, func);
+#endif
}
diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c
index 8026d176f25c..d3869295b88d 100644
--- a/arch/x86/kernel/static_call.c
+++ b/arch/x86/kernel/static_call.c
@@ -9,13 +9,21 @@

void static_call_bp_handler(void);
void *bp_handler_dest;
+void *bp_handler_continue;

asm(".pushsection .text, \"ax\" \n"
".globl static_call_bp_handler \n"
".type static_call_bp_handler, @function \n"
"static_call_bp_handler: \n"
- "ANNOTATE_RETPOLINE_SAFE \n"
+#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
+ ANNOTATE_RETPOLINE_SAFE
+ "call *bp_handler_dest \n"
+ ANNOTATE_RETPOLINE_SAFE
+ "jmp *bp_handler_continue \n"
+#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */
+ ANNOTATE_RETPOLINE_SAFE
"jmp *bp_handler_dest \n"
+#endif
".popsection \n");

void arch_static_call_transform(void *site, void *tramp, void *func)
@@ -25,7 +33,10 @@ void arch_static_call_transform(void *site, void *tramp, void *func)
unsigned char insn_opcode;
unsigned char opcodes[CALL_INSN_SIZE];

- insn = (unsigned long)tramp;
+ if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
+ insn = (unsigned long)site;
+ else
+ insn = (unsigned long)tramp;

mutex_lock(&text_mutex);

@@ -41,8 +52,10 @@ void arch_static_call_transform(void *site, void *tramp, void *func)
opcodes[0] = insn_opcode;
memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1);

- /* Set up the variable for the breakpoint handler: */
+ /* Set up the variables for the breakpoint handler: */
bp_handler_dest = func;
+ if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
+ bp_handler_continue = (void *)(insn + CALL_INSN_SIZE);

/* Patch the call site: */
text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE,
@@ -52,3 +65,14 @@ void arch_static_call_transform(void *site, void *tramp, void *func)
mutex_unlock(&text_mutex);
}
EXPORT_SYMBOL_GPL(arch_static_call_transform);
+
+#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
+void arch_static_call_defuse_tramp(void *site, void *tramp)
+{
+ unsigned short opcode = INSN_UD2;
+
+ mutex_lock(&text_mutex);
+ text_poke((void *)tramp, &opcode, 2);
+ mutex_unlock(&text_mutex);
+}
+#endif
diff --git a/include/linux/static_call.h b/include/linux/static_call.h
index 651f4d784377..6daff586c97d 100644
--- a/include/linux/static_call.h
+++ b/include/linux/static_call.h
@@ -70,7 +70,7 @@
#include <linux/cpu.h>
#include <linux/static_call_types.h>

-#ifdef CONFIG_HAVE_STATIC_CALL
+#if defined(CONFIG_HAVE_STATIC_CALL) && !defined(COMPILE_OFFSETS)
#include <asm/static_call.h>
extern void arch_static_call_transform(void *site, void *tramp, void *func);
#endif
diff --git a/tools/objtool/Makefile b/tools/objtool/Makefile
index c9d038f91af6..fb1afa34f10d 100644
--- a/tools/objtool/Makefile
+++ b/tools/objtool/Makefile
@@ -29,7 +29,8 @@ all: $(OBJTOOL)

INCLUDES := -I$(srctree)/tools/include \
-I$(srctree)/tools/arch/$(HOSTARCH)/include/uapi \
- -I$(srctree)/tools/objtool/arch/$(ARCH)/include
+ -I$(srctree)/tools/objtool/arch/$(ARCH)/include \
+ -I$(srctree)/tools/objtool/include
WARNINGS := $(EXTRA_WARNINGS) -Wno-switch-default -Wno-switch-enum -Wno-packed
CFLAGS += -Werror $(WARNINGS) $(KBUILD_HOSTCFLAGS) -g $(INCLUDES)
LDFLAGS += -lelf $(LIBSUBCMD) $(KBUILD_HOSTLDFLAGS)
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 0414a0d52262..ea1ff9ea2d78 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -27,6 +27,7 @@

#include <linux/hashtable.h>
#include <linux/kernel.h>
+#include <linux/static_call_types.h>

struct alternative {
struct list_head list;
@@ -165,6 +166,7 @@ static int __dead_end_function(struct objtool_file *file, struct symbol *func,
"fortify_panic",
"usercopy_abort",
"machine_real_restart",
+ "rewind_stack_do_exit",
};

if (func->bind == STB_WEAK)
@@ -525,6 +527,10 @@ static int add_jump_destinations(struct objtool_file *file)
} else {
/* sibling call */
insn->jump_dest = 0;
+ if (rela->sym->static_call_tramp) {
+ list_add_tail(&insn->static_call_node,
+ &file->static_call_list);
+ }
continue;
}

@@ -1202,6 +1208,24 @@ static int read_retpoline_hints(struct objtool_file *file)
return 0;
}

+static int read_static_call_tramps(struct objtool_file *file)
+{
+ struct section *sec;
+ struct symbol *func;
+
+ for_each_sec(file, sec) {
+ list_for_each_entry(func, &sec->symbol_list, list) {
+ if (func->bind == STB_GLOBAL &&
+ !strncmp(func->name, STATIC_CALL_TRAMP_PREFIX_STR,
+ strlen(STATIC_CALL_TRAMP_PREFIX_STR)))
+ func->static_call_tramp = true;
+ }
+
+ }
+
+ return 0;
+}
+
static void mark_rodata(struct objtool_file *file)
{
struct section *sec;
@@ -1267,6 +1291,10 @@ static int decode_sections(struct objtool_file *file)
if (ret)
return ret;

+ ret = read_static_call_tramps(file);
+ if (ret)
+ return ret;
+
return 0;
}

@@ -1920,6 +1948,11 @@ static int validate_branch(struct objtool_file *file, struct instruction *first,
if (is_fentry_call(insn))
break;

+ if (insn->call_dest->static_call_tramp) {
+ list_add_tail(&insn->static_call_node,
+ &file->static_call_list);
+ }
+
ret = dead_end_function(file, insn->call_dest);
if (ret == 1)
return 0;
@@ -2167,6 +2200,89 @@ static int validate_reachable_instructions(struct objtool_file *file)
return 0;
}

+static int create_static_call_sections(struct objtool_file *file)
+{
+ struct section *sec, *rela_sec;
+ struct rela *rela;
+ struct static_call_site *site;
+ struct instruction *insn;
+ char *key_name;
+ struct symbol *key_sym;
+ int idx;
+
+ sec = find_section_by_name(file->elf, ".static_call_sites");
+ if (sec) {
+ WARN("file already has .static_call_sites section, skipping");
+ return 0;
+ }
+
+ if (list_empty(&file->static_call_list))
+ return 0;
+
+ idx = 0;
+ list_for_each_entry(insn, &file->static_call_list, static_call_node)
+ idx++;
+
+ sec = elf_create_section(file->elf, ".static_call_sites",
+ sizeof(struct static_call_site), idx);
+ if (!sec)
+ return -1;
+
+ rela_sec = elf_create_rela_section(file->elf, sec);
+ if (!rela_sec)
+ return -1;
+
+ idx = 0;
+ list_for_each_entry(insn, &file->static_call_list, static_call_node) {
+
+ site = (struct static_call_site *)sec->data->d_buf + idx;
+ memset(site, 0, sizeof(struct static_call_site));
+
+ /* populate rela for 'addr' */
+ rela = malloc(sizeof(*rela));
+ if (!rela) {
+ perror("malloc");
+ return -1;
+ }
+ memset(rela, 0, sizeof(*rela));
+ rela->sym = insn->sec->sym;
+ rela->addend = insn->offset;
+ rela->type = R_X86_64_PC32;
+ rela->offset = idx * sizeof(struct static_call_site);
+ list_add_tail(&rela->list, &rela_sec->rela_list);
+ hash_add(rela_sec->rela_hash, &rela->hash, rela->offset);
+
+ /* find key symbol */
+ key_name = insn->call_dest->name + strlen(STATIC_CALL_TRAMP_PREFIX_STR);
+ key_sym = find_symbol_by_name(file->elf, key_name);
+ if (!key_sym) {
+ WARN("can't find static call key symbol: %s", key_name);
+ return -1;
+ }
+
+ /* populate rela for 'key' */
+ rela = malloc(sizeof(*rela));
+ if (!rela) {
+ perror("malloc");
+ return -1;
+ }
+ memset(rela, 0, sizeof(*rela));
+ rela->sym = key_sym;
+ rela->addend = 0;
+ rela->type = R_X86_64_PC32;
+ rela->offset = idx * sizeof(struct static_call_site) + 4;
+ list_add_tail(&rela->list, &rela_sec->rela_list);
+ hash_add(rela_sec->rela_hash, &rela->hash, rela->offset);
+
+ idx++;
+ }
+
+ if (elf_rebuild_rela_section(rela_sec))
+ return -1;
+
+ return 0;
+}
+
static void cleanup(struct objtool_file *file)
{
struct instruction *insn, *tmpinsn;
@@ -2191,12 +2307,13 @@ int check(const char *_objname, bool orc)

objname = _objname;

- file.elf = elf_open(objname, orc ? O_RDWR : O_RDONLY);
+ file.elf = elf_open(objname, O_RDWR);
if (!file.elf)
return 1;

INIT_LIST_HEAD(&file.insn_list);
hash_init(file.insn_hash);
+ INIT_LIST_HEAD(&file.static_call_list);
file.whitelist = find_section_by_name(file.elf, ".discard.func_stack_frame_non_standard");
file.c_file = find_section_by_name(file.elf, ".comment");
file.ignore_unreachables = no_unreachable;
@@ -2236,6 +2353,11 @@ int check(const char *_objname, bool orc)
warnings += ret;
}

+ ret = create_static_call_sections(&file);
+ if (ret < 0)
+ goto out;
+ warnings += ret;
+
if (orc) {
ret = create_orc(&file);
if (ret < 0)
@@ -2244,7 +2366,9 @@ int check(const char *_objname, bool orc)
ret = create_orc_sections(&file);
if (ret < 0)
goto out;
+ }

+ if (orc || !list_empty(&file.static_call_list)) {
ret = elf_write(file.elf);
if (ret < 0)
goto out;
diff --git a/tools/objtool/check.h b/tools/objtool/check.h
index e6e8a655b556..56b8b7fb1bd1 100644
--- a/tools/objtool/check.h
+++ b/tools/objtool/check.h
@@ -39,6 +39,7 @@ struct insn_state {
struct instruction {
struct list_head list;
struct hlist_node hash;
+ struct list_head static_call_node;
struct section *sec;
unsigned long offset;
unsigned int len;
@@ -60,6 +61,7 @@ struct objtool_file {
struct elf *elf;
struct list_head insn_list;
DECLARE_HASHTABLE(insn_hash, 16);
+ struct list_head static_call_list;
struct section *whitelist;
bool ignore_unreachables, c_file, hints, rodata;
};
diff --git a/tools/objtool/elf.h b/tools/objtool/elf.h
index bc97ed86b9cd..3cf44d7cc3ac 100644
--- a/tools/objtool/elf.h
+++ b/tools/objtool/elf.h
@@ -62,6 +62,7 @@ struct symbol {
unsigned long offset;
unsigned int len;
struct symbol *pfunc, *cfunc;
+ bool static_call_tramp;
};

struct rela {
diff --git a/tools/objtool/include/linux/static_call_types.h b/tools/objtool/include/linux/static_call_types.h
new file mode 100644
index 000000000000..6859b208de6e
--- /dev/null
+++ b/tools/objtool/include/linux/static_call_types.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _STATIC_CALL_TYPES_H
+#define _STATIC_CALL_TYPES_H
+
+#include <linux/stringify.h>
+
+#define STATIC_CALL_TRAMP_PREFIX ____static_call_tramp_
+#define STATIC_CALL_TRAMP_PREFIX_STR __stringify(STATIC_CALL_TRAMP_PREFIX)
+
+#define STATIC_CALL_TRAMP(key) __PASTE(STATIC_CALL_TRAMP_PREFIX, key)
+#define STATIC_CALL_TRAMP_STR(key) __stringify(STATIC_CALL_TRAMP(key))
+
+/* The static call site table is created by objtool. */
+struct static_call_site {
+ s32 addr;
+ s32 key;
+};
+
+#endif /* _STATIC_CALL_TYPES_H */
diff --git a/tools/objtool/sync-check.sh b/tools/objtool/sync-check.sh
index 1470e74e9d66..e1a204bf3556 100755
--- a/tools/objtool/sync-check.sh
+++ b/tools/objtool/sync-check.sh
@@ -10,6 +10,7 @@ arch/x86/include/asm/insn.h
arch/x86/include/asm/inat.h
arch/x86/include/asm/inat_types.h
arch/x86/include/asm/orc_types.h
+include/linux/static_call_types.h
'

check()
--
2.17.2


2018-11-26 13:56:38

by Josh Poimboeuf

[permalink] [raw]
Subject: [PATCH v2 1/4] compiler.h: Make __ADDRESSABLE() symbol truly unique

The __ADDRESSABLE() macro uses the __LINE__ macro to create a temporary
symbol which has a unique name. However, if the macro is used multiple
times from within another macro, the line number will always be the
same, resulting in duplicate symbols.

Make the temporary symbols truly unique by using __UNIQUE_ID instead of
__LINE__.

Signed-off-by: Josh Poimboeuf <[email protected]>
---
include/linux/compiler.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 06396c1cf127..4bb73fd918b5 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -282,7 +282,7 @@ unsigned long read_word_at_a_time(const void *addr)
*/
#define __ADDRESSABLE(sym) \
static void * __section(".discard.addressable") __used \
- __PASTE(__addressable_##sym, __LINE__) = (void *)&sym;
+ __UNIQUE_ID(__addressable_##sym) = (void *)&sym;

/**
* offset_to_ptr - convert a relative memory offset to an absolute pointer
--
2.17.2


2018-11-26 13:56:48

by Josh Poimboeuf

[permalink] [raw]
Subject: [PATCH v2 3/4] x86/static_call: Add out-of-line static call implementation

Add the x86 out-of-line static call implementation. For each key, a
permanent trampoline is created which is the destination for all static
calls for the given key. The trampoline has a direct jump which gets
patched by static_call_update() when the destination function changes.

Signed-off-by: Josh Poimboeuf <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/static_call.h | 28 ++++++++++++++++
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/static_call.c | 54 ++++++++++++++++++++++++++++++
include/linux/static_call.h | 2 +-
5 files changed, 85 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/include/asm/static_call.h
create mode 100644 arch/x86/kernel/static_call.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b5286ad2a982..a2a10e0ce248 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -189,6 +189,7 @@ config X86
select HAVE_FUNCTION_ARG_ACCESS_API
select HAVE_STACKPROTECTOR if CC_HAS_SANE_STACKPROTECTOR
select HAVE_STACK_VALIDATION if X86_64
+ select HAVE_STATIC_CALL_OUTLINE
select HAVE_RSEQ
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_UNSTABLE_SCHED_CLOCK
diff --git a/arch/x86/include/asm/static_call.h b/arch/x86/include/asm/static_call.h
new file mode 100644
index 000000000000..6e9ad5969ec2
--- /dev/null
+++ b/arch/x86/include/asm/static_call.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_STATIC_CALL_H
+#define _ASM_STATIC_CALL_H
+
+/*
+ * Manually construct a 5-byte direct JMP to prevent the assembler from
+ * optimizing it into a 2-byte JMP.
+ */
+#define __ARCH_STATIC_CALL_JMP_LABEL(key) ".L" __stringify(key ## _after_jmp)
+#define __ARCH_STATIC_CALL_TRAMP_JMP(key, func) \
+ ".byte 0xe9 \n" \
+ ".long " #func " - " __ARCH_STATIC_CALL_JMP_LABEL(key) "\n" \
+ __ARCH_STATIC_CALL_JMP_LABEL(key) ":"
+
+/*
+ * This is a permanent trampoline which does a direct jump to the function.
+ * The direct jump get patched by static_call_update().
+ */
+#define ARCH_DEFINE_STATIC_CALL_TRAMP(key, func) \
+ asm(".pushsection .text, \"ax\" \n" \
+ ".align 4 \n" \
+ ".globl " STATIC_CALL_TRAMP_STR(key) " \n" \
+ ".type " STATIC_CALL_TRAMP_STR(key) ", @function \n" \
+ STATIC_CALL_TRAMP_STR(key) ": \n" \
+ __ARCH_STATIC_CALL_TRAMP_JMP(key, func) " \n" \
+ ".popsection \n")
+
+#endif /* _ASM_STATIC_CALL_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 8824d01c0c35..82acc8a28429 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -62,6 +62,7 @@ obj-y += tsc.o tsc_msr.o io_delay.o rtc.o
obj-y += pci-iommu_table.o
obj-y += resource.o
obj-y += irqflags.o
+obj-y += static_call.o

obj-y += process.o
obj-y += fpu/
diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c
new file mode 100644
index 000000000000..8026d176f25c
--- /dev/null
+++ b/arch/x86/kernel/static_call.c
@@ -0,0 +1,54 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/static_call.h>
+#include <linux/memory.h>
+#include <linux/bug.h>
+#include <asm/text-patching.h>
+#include <asm/nospec-branch.h>
+
+#define CALL_INSN_SIZE 5
+
+void static_call_bp_handler(void);
+void *bp_handler_dest;
+
+asm(".pushsection .text, \"ax\" \n"
+ ".globl static_call_bp_handler \n"
+ ".type static_call_bp_handler, @function \n"
+ "static_call_bp_handler: \n"
+ "ANNOTATE_RETPOLINE_SAFE \n"
+ "jmp *bp_handler_dest \n"
+ ".popsection \n");
+
+void arch_static_call_transform(void *site, void *tramp, void *func)
+{
+ s32 dest_relative;
+ unsigned long insn;
+ unsigned char insn_opcode;
+ unsigned char opcodes[CALL_INSN_SIZE];
+
+ insn = (unsigned long)tramp;
+
+ mutex_lock(&text_mutex);
+
+ insn_opcode = *(unsigned char *)insn;
+ if (insn_opcode != 0xe8 && insn_opcode != 0xe9) {
+ WARN_ONCE(1, "unexpected static call insn opcode 0x%x at %pS",
+ insn_opcode, (void *)insn);
+ goto done;
+ }
+
+ dest_relative = (long)(func) - (long)(insn + CALL_INSN_SIZE);
+
+ opcodes[0] = insn_opcode;
+ memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1);
+
+ /* Set up the variable for the breakpoint handler: */
+ bp_handler_dest = func;
+
+ /* Patch the call site: */
+ text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE,
+ static_call_bp_handler);
+
+done:
+ mutex_unlock(&text_mutex);
+}
+EXPORT_SYMBOL_GPL(arch_static_call_transform);
diff --git a/include/linux/static_call.h b/include/linux/static_call.h
index c8d0da1ef6b2..651f4d784377 100644
--- a/include/linux/static_call.h
+++ b/include/linux/static_call.h
@@ -149,7 +149,7 @@ struct static_call_key {
.func = _func, \
.tramp = STATIC_CALL_TRAMP(key), \
}; \
- ARCH_DEFINE_STATIC_CALL_TRAMP(key, func)
+ ARCH_DEFINE_STATIC_CALL_TRAMP(key, _func)

#define static_call(key, args...) STATIC_CALL_TRAMP(key)(args)

--
2.17.2


2018-11-26 13:56:57

by Josh Poimboeuf

[permalink] [raw]
Subject: [PATCH v2 2/4] static_call: Add static call infrastructure

Add a static call infrastructure. Static calls use code patching to
hard-code function pointers into direct branch instructions. They give
the flexibility of function pointers, but with improved performance.
This is especially important for cases where retpolines would otherwise
be used, as retpolines can significantly impact performance.

The concept and code are an extension of previous work done by Ard
Biesheuvel and Steven Rostedt:

https://lkml.kernel.org/r/[email protected]
https://lkml.kernel.org/r/[email protected]

This code is also heavily inspired by the jump label code (aka "static
jumps"), as some of the concepts are very similar.

There are three implementations, depending on arch support:

1) inline: patched call sites (CONFIG_HAVE_STATIC_CALL_INLINE)
2) out-of-line: patched trampolines (CONFIG_HAVE_STATIC_CALL_OUTLINE)
3) basic function pointers

For more details, see the comments in include/linux/static_call.h.

Signed-off-by: Josh Poimboeuf <[email protected]>
---
arch/Kconfig | 10 +
include/asm-generic/vmlinux.lds.h | 11 +
include/linux/module.h | 10 +
include/linux/static_call.h | 202 +++++++++++++++++
include/linux/static_call_types.h | 19 ++
kernel/Makefile | 1 +
kernel/module.c | 5 +
kernel/static_call.c | 350 ++++++++++++++++++++++++++++++
8 files changed, 608 insertions(+)
create mode 100644 include/linux/static_call.h
create mode 100644 include/linux/static_call_types.h
create mode 100644 kernel/static_call.c

diff --git a/arch/Kconfig b/arch/Kconfig
index e1e540ffa979..4474f2958e03 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -879,6 +879,16 @@ config HAVE_ARCH_PREL32_RELOCATIONS
architectures, and don't require runtime relocation on relocatable
kernels.

+config HAVE_STATIC_CALL_INLINE
+ bool
+
+config HAVE_STATIC_CALL_OUTLINE
+ bool
+
+config HAVE_STATIC_CALL
+ def_bool y
+ depends on HAVE_STATIC_CALL_INLINE || HAVE_STATIC_CALL_OUTLINE
+
source "kernel/gcov/Kconfig"

source "scripts/gcc-plugins/Kconfig"
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 3d7a6a9c2370..f2729831c8b8 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -320,6 +320,7 @@
__start_ro_after_init = .; \
*(.data..ro_after_init) \
JUMP_TABLE_DATA \
+ STATIC_CALL_SITES \
__end_ro_after_init = .;
#endif

@@ -725,6 +726,16 @@
#define BUG_TABLE
#endif

+#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
+#define STATIC_CALL_SITES \
+ . = ALIGN(8); \
+ __start_static_call_sites = .; \
+ KEEP(*(.static_call_sites)) \
+ __stop_static_call_sites = .;
+#else
+#define STATIC_CALL_SITES
+#endif
+
#ifdef CONFIG_UNWINDER_ORC
#define ORC_UNWIND_TABLE \
. = ALIGN(4); \
diff --git a/include/linux/module.h b/include/linux/module.h
index fce6b4335e36..d7c575759931 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -21,6 +21,7 @@
#include <linux/rbtree_latch.h>
#include <linux/error-injection.h>
#include <linux/tracepoint-defs.h>
+#include <linux/static_call_types.h>

#include <linux/percpu.h>
#include <asm/module.h>
@@ -450,6 +451,10 @@ struct module {
unsigned int num_ftrace_callsites;
unsigned long *ftrace_callsites;
#endif
+#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
+ int num_static_call_sites;
+ struct static_call_site *static_call_sites;
+#endif

#ifdef CONFIG_LIVEPATCH
bool klp; /* Is this a livepatch module? */
@@ -682,6 +687,11 @@ static inline bool is_module_text_address(unsigned long addr)
return false;
}

+static inline bool within_module_init(unsigned long addr, const struct module *mod)
+{
+ return false;
+}
+
/* Get/put a kernel symbol (calls should be symmetric) */
#define symbol_get(x) ({ extern typeof(x) x __attribute__((weak)); &(x); })
#define symbol_put(x) do { } while (0)
diff --git a/include/linux/static_call.h b/include/linux/static_call.h
new file mode 100644
index 000000000000..c8d0da1ef6b2
--- /dev/null
+++ b/include/linux/static_call.h
@@ -0,0 +1,202 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_STATIC_CALL_H
+#define _LINUX_STATIC_CALL_H
+
+/*
+ * Static call support
+ *
+ * Static calls use code patching to hard-code function pointers into direct
+ * branch instructions. They give the flexibility of function pointers, but
+ * with improved performance. This is especially important for cases where
+ * retpolines would otherwise be used, as retpolines can significantly impact
+ * performance.
+ *
+ *
+ * API overview:
+ *
+ * DECLARE_STATIC_CALL(key, func);
+ * DEFINE_STATIC_CALL(key, func);
+ * static_call(key, args...);
+ * static_call_update(key, func);
+ *
+ *
+ * Usage example:
+ *
+ * # Start with the following functions (with identical prototypes):
+ * int func_a(int arg1, int arg2);
+ * int func_b(int arg1, int arg2);
+ *
+ * # Define a 'my_key' reference, associated with func_a() by default
+ * DEFINE_STATIC_CALL(my_key, func_a);
+ *
+ * # Call func_a()
+ * static_call(my_key, arg1, arg2);
+ *
+ * # Update 'my_key' to point to func_b()
+ * static_call_update(my_key, func_b);
+ *
+ * # Call func_b()
+ * static_call(my_key, arg1, arg2);
+ *
+ *
+ * Implementation details:
+ *
+ * There are three different implementations:
+ *
+ * 1) Inline static calls (patched call sites)
+ *
+ * This requires objtool, which detects all the static_call() sites and
+ * annotates them in the '.static_call_sites' section. By default, the call
+ * sites will call into a temporary per-key trampoline which has an indirect
+ * branch to the current destination function associated with the key.
+ * During system boot (or module init), all call sites are patched to call
+ * their destination functions directly. Updates to a key will patch all
+ * call sites associated with that key.
+ *
+ * 2) Out-of-line static calls (patched trampolines)
+ *
+ * Each static_call() site calls into a permanent trampoline associated with
+ * the key. The trampoline has a direct branch to the default function.
+ * Updates to a key will modify the direct branch in the key's trampoline.
+ *
+ * 3) Generic implementation
+ *
+ * This is the default implementation if the architecture hasn't implemented
+ * static calls (either inline or out-of-line). In this case, a basic
+ * function pointer is used.
+ */
+
+#include <linux/types.h>
+#include <linux/cpu.h>
+#include <linux/static_call_types.h>
+
+#ifdef CONFIG_HAVE_STATIC_CALL
+#include <asm/static_call.h>
+extern void arch_static_call_transform(void *site, void *tramp, void *func);
+#endif
+
+
+#define DECLARE_STATIC_CALL(key, func) \
+ extern struct static_call_key key; \
+ extern typeof(func) STATIC_CALL_TRAMP(key)
+
+
+#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
+
+struct static_call_key {
+ void *func, *tramp;
+ /*
+ * List of modules (including vmlinux) and their call sites associated
+ * with this key.
+ */
+ struct list_head site_mods;
+};
+
+struct static_call_mod {
+ struct list_head list;
+ struct module *mod; /* for vmlinux, mod == NULL */
+ struct static_call_site *sites;
+};
+
+extern void arch_static_call_defuse_tramp(void *site, void *tramp);
+extern void __static_call_update(struct static_call_key *key, void *func);
+extern int static_call_mod_init(struct module *mod);
+
+#define DEFINE_STATIC_CALL(key, _func) \
+ DECLARE_STATIC_CALL(key, _func); \
+ struct static_call_key key = { \
+ .func = _func, \
+ .tramp = STATIC_CALL_TRAMP(key), \
+ .site_mods = LIST_HEAD_INIT(key.site_mods), \
+ }; \
+ ARCH_DEFINE_STATIC_CALL_TRAMP(key, _func)
+
+/*
+ * __ADDRESSABLE() is used to ensure the key symbol doesn't get stripped from
+ * the symbol table so objtool can reference it when it generates the
+ * static_call_site structs.
+ */
+#define static_call(key, args...) \
+({ \
+ __ADDRESSABLE(key); \
+ STATIC_CALL_TRAMP(key)(args); \
+})
+
+#define static_call_update(key, func) \
+({ \
+ BUILD_BUG_ON(!__same_type(func, STATIC_CALL_TRAMP(key))); \
+ __static_call_update(&key, func); \
+})
+
+#define EXPORT_STATIC_CALL(key) \
+ EXPORT_SYMBOL(key); \
+ EXPORT_SYMBOL(STATIC_CALL_TRAMP(key))
+
+#define EXPORT_STATIC_CALL_GPL(key) \
+ EXPORT_SYMBOL_GPL(key); \
+ EXPORT_SYMBOL_GPL(STATIC_CALL_TRAMP(key))
+
+
+#elif defined(CONFIG_HAVE_STATIC_CALL_OUTLINE)
+
+struct static_call_key {
+ void *func, *tramp;
+};
+
+#define DEFINE_STATIC_CALL(key, _func) \
+ DECLARE_STATIC_CALL(key, _func); \
+ struct static_call_key key = { \
+ .func = _func, \
+ .tramp = STATIC_CALL_TRAMP(key), \
+ }; \
+ ARCH_DEFINE_STATIC_CALL_TRAMP(key, func)
+
+#define static_call(key, args...) STATIC_CALL_TRAMP(key)(args)
+
+#define __static_call_update(key, func) \
+({ \
+ cpus_read_lock(); \
+ arch_static_call_transform(NULL, key->tramp, func); \
+ cpus_read_unlock(); \
+})
+
+#define static_call_update(key, func) \
+({ \
+ BUILD_BUG_ON(!__same_type(func, STATIC_CALL_TRAMP(key))); \
+})
+
+#define EXPORT_STATIC_CALL(key) \
+ EXPORT_SYMBOL(STATIC_CALL_TRAMP(key))
+
+#define EXPORT_STATIC_CALL_GPL(key) \
+ EXPORT_SYMBOL_GPL(STATIC_CALL_TRAMP(key))
+
+
+#else /* Generic implementation */
+
+struct static_call_key {
+ void *func;
+};
+
+#define DEFINE_STATIC_CALL(key, _func) \
+ DECLARE_STATIC_CALL(key, _func); \
+ struct static_call_key key = { \
+ .func = _func, \
+ }
+
+#define static_call(key, args...) \
+ ((typeof(STATIC_CALL_TRAMP(key))*)(key.func))(args)
+
+#define __static_call_update(key, _func) \
+ WRITE_ONCE(key->func, _func)
+
+#define static_call_update(key, func) \
+ BUILD_BUG_ON(!__same_type(_func, STATIC_CALL_TRAMP(key))); \
+ __static_call_update(key, func)
+
+#define EXPORT_STATIC_CALL(key) EXPORT_SYMBOL(key)
+#define EXPORT_STATIC_CALL_GPL(key) EXPORT_SYMBOL_GPL(key)
+
+#endif /* CONFIG_HAVE_STATIC_CALL_INLINE */
+
+#endif /* _LINUX_STATIC_CALL_H */
diff --git a/include/linux/static_call_types.h b/include/linux/static_call_types.h
new file mode 100644
index 000000000000..6859b208de6e
--- /dev/null
+++ b/include/linux/static_call_types.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _STATIC_CALL_TYPES_H
+#define _STATIC_CALL_TYPES_H
+
+#include <linux/stringify.h>
+
+#define STATIC_CALL_TRAMP_PREFIX ____static_call_tramp_
+#define STATIC_CALL_TRAMP_PREFIX_STR __stringify(STATIC_CALL_TRAMP_PREFIX)
+
+#define STATIC_CALL_TRAMP(key) __PASTE(STATIC_CALL_TRAMP_PREFIX, key)
+#define STATIC_CALL_TRAMP_STR(key) __stringify(STATIC_CALL_TRAMP(key))
+
+/* The static call site table is created by objtool. */
+struct static_call_site {
+ s32 addr;
+ s32 key;
+};
+
+#endif /* _STATIC_CALL_TYPES_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 7343b3a9bff0..88bc7fa14eb8 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -103,6 +103,7 @@ obj-$(CONFIG_TRACEPOINTS) += trace/
obj-$(CONFIG_IRQ_WORK) += irq_work.o
obj-$(CONFIG_CPU_PM) += cpu_pm.o
obj-$(CONFIG_BPF) += bpf/
+obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o

obj-$(CONFIG_PERF_EVENTS) += events/

diff --git a/kernel/module.c b/kernel/module.c
index 49a405891587..ecad0ee4ffb5 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3121,6 +3121,11 @@ static int find_module_sections(struct module *mod, struct load_info *info)
mod->ei_funcs = section_objs(info, "_error_injection_whitelist",
sizeof(*mod->ei_funcs),
&mod->num_ei_funcs);
+#endif
+#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
+ mod->static_call_sites = section_objs(info, ".static_call_sites",
+ sizeof(*mod->static_call_sites),
+ &mod->num_static_call_sites);
#endif
mod->extable = section_objs(info, "__ex_table",
sizeof(*mod->extable), &mod->num_exentries);
diff --git a/kernel/static_call.c b/kernel/static_call.c
new file mode 100644
index 000000000000..88996ebe96e2
--- /dev/null
+++ b/kernel/static_call.c
@@ -0,0 +1,350 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/init.h>
+#include <linux/static_call.h>
+#include <linux/bug.h>
+#include <linux/smp.h>
+#include <linux/sort.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/cpu.h>
+#include <linux/processor.h>
+#include <asm/sections.h>
+
+extern struct static_call_site __start_static_call_sites[],
+ __stop_static_call_sites[];
+
+static bool static_call_initialized;
+
+#define STATIC_CALL_INIT 1UL
+
+/* mutex to protect key modules/sites */
+static DEFINE_MUTEX(static_call_mutex);
+
+static void static_call_lock(void)
+{
+ mutex_lock(&static_call_mutex);
+}
+
+static void static_call_unlock(void)
+{
+ mutex_unlock(&static_call_mutex);
+}
+
+static inline void *static_call_addr(struct static_call_site *site)
+{
+ return (void *)((long)site->addr + (long)&site->addr);
+}
+
+
+static inline struct static_call_key *static_call_key(const struct static_call_site *site)
+{
+ return (struct static_call_key *)
+ (((long)site->key + (long)&site->key) & ~STATIC_CALL_INIT);
+}
+
+/* These assume the key is word-aligned. */
+static inline bool static_call_is_init(struct static_call_site *site)
+{
+ return ((long)site->key + (long)&site->key) & STATIC_CALL_INIT;
+}
+
+static inline void static_call_set_init(struct static_call_site *site)
+{
+ site->key = ((long)static_call_key(site) | STATIC_CALL_INIT) -
+ (long)&site->key;
+}
+
+static int static_call_site_cmp(const void *_a, const void *_b)
+{
+ const struct static_call_site *a = _a;
+ const struct static_call_site *b = _b;
+ const struct static_call_key *key_a = static_call_key(a);
+ const struct static_call_key *key_b = static_call_key(b);
+
+ if (key_a < key_b)
+ return -1;
+
+ if (key_a > key_b)
+ return 1;
+
+ return 0;
+}
+
+static void static_call_site_swap(void *_a, void *_b, int size)
+{
+ long delta = (unsigned long)_a - (unsigned long)_b;
+ struct static_call_site *a = _a;
+ struct static_call_site *b = _b;
+ struct static_call_site tmp = *a;
+
+ a->addr = b->addr - delta;
+ a->key = b->key - delta;
+
+ b->addr = tmp.addr + delta;
+ b->key = tmp.key + delta;
+}
+
+static inline void static_call_sort_entries(struct static_call_site *start,
+ struct static_call_site *stop)
+{
+ sort(start, stop - start, sizeof(struct static_call_site),
+ static_call_site_cmp, static_call_site_swap);
+}
+
+void __static_call_update(struct static_call_key *key, void *func)
+{
+ struct static_call_mod *site_mod;
+ struct static_call_site *site, *stop;
+
+ cpus_read_lock();
+ static_call_lock();
+
+ if (key->func == func)
+ goto done;
+
+ key->func = func;
+
+ /*
+ * If called before init, leave the call sites unpatched for now.
+ * In the meantime they'll continue to call the temporary trampoline.
+ */
+ if (!static_call_initialized)
+ goto done;
+
+ list_for_each_entry(site_mod, &key->site_mods, list) {
+ if (!site_mod->sites) {
+ /*
+ * This can happen if the static call key is defined in
+ * a module which doesn't use it.
+ */
+ continue;
+ }
+
+ stop = __stop_static_call_sites;
+
+#ifdef CONFIG_MODULES
+ if (site_mod->mod) {
+ stop = site_mod->mod->static_call_sites +
+ site_mod->mod->num_static_call_sites;
+ }
+#endif
+
+ for (site = site_mod->sites;
+ site < stop && static_call_key(site) == key; site++) {
+ void *site_addr = static_call_addr(site);
+ struct module *mod = site_mod->mod;
+
+ if (static_call_is_init(site)) {
+ /*
+ * Don't write to call sites which were in
+ * initmem and have since been freed.
+ */
+ if (!mod && system_state >= SYSTEM_RUNNING)
+ continue;
+ if (mod && (mod->state == MODULE_STATE_LIVE ||
+ mod->state == MODULE_STATE_GOING))
+ continue;
+ }
+
+ if (!kernel_text_address((unsigned long)site_addr)) {
+ WARN_ONCE(1, "can't patch static call site at %pS",
+ site_addr);
+ continue;
+ }
+
+ arch_static_call_transform(site_addr, key->tramp, func);
+ }
+ }
+
+done:
+ static_call_unlock();
+ cpus_read_unlock();
+}
+EXPORT_SYMBOL_GPL(__static_call_update);
+
+/*
+ * On arches without PLTs, the trampolines will no longer be used and can be
+ * poisoned.
+ *
+ * Other arches may continue to reuse the trampolines in cases where the
+ * destination function is too far away from the call site.
+ */
+static void static_call_defuse_tramps(struct static_call_site *start,
+ struct static_call_site *stop)
+{
+ struct static_call_site *site;
+ struct static_call_key *key;
+ struct static_call_key *prev_key = NULL;
+
+ for (site = start; site < stop; site++) {
+ key = static_call_key(site);
+
+ if (key != prev_key) {
+ prev_key = key;
+ arch_static_call_defuse_tramp(static_call_addr(site),
+ key->tramp);
+ }
+ }
+}
+
+#ifdef CONFIG_MODULES
+
+static int static_call_add_module(struct module *mod)
+{
+ struct static_call_site *start = mod->static_call_sites;
+ struct static_call_site *stop = mod->static_call_sites +
+ mod->num_static_call_sites;
+ struct static_call_site *site;
+ struct static_call_key *key, *prev_key = NULL;
+ struct static_call_mod *site_mod;
+
+ if (start == stop)
+ return 0;
+
+ static_call_sort_entries(start, stop);
+
+ for (site = start; site < stop; site++) {
+ void *site_addr = static_call_addr(site);
+
+ if (within_module_init((unsigned long)site_addr, mod))
+ static_call_set_init(site);
+
+ key = static_call_key(site);
+ if (key != prev_key) {
+ prev_key = key;
+
+ site_mod = kzalloc(sizeof(*site_mod), GFP_KERNEL);
+ if (!site_mod)
+ return -ENOMEM;
+
+ site_mod->mod = mod;
+ site_mod->sites = site;
+ list_add_tail(&site_mod->list, &key->site_mods);
+ }
+
+ arch_static_call_transform(site_addr, key->tramp, key->func);
+ }
+
+ /*
+ * If a tramp is used across modules, it may be defused more than once.
+ * This should be idempotent.
+ */
+ static_call_defuse_tramps(start, stop);
+
+ return 0;
+}
+
+static void static_call_del_module(struct module *mod)
+{
+ struct static_call_site *start = mod->static_call_sites;
+ struct static_call_site *stop = mod->static_call_sites +
+ mod->num_static_call_sites;
+ struct static_call_site *site;
+ struct static_call_key *key, *prev_key = NULL;
+ struct static_call_mod *site_mod;
+
+ for (site = start; site < stop; site++) {
+ key = static_call_key(site);
+ if (key == prev_key)
+ continue;
+ prev_key = key;
+
+ list_for_each_entry(site_mod, &key->site_mods, list) {
+ if (site_mod->mod == mod) {
+ list_del(&site_mod->list);
+ kfree(site_mod);
+ break;
+ }
+ }
+ }
+}
+
+static int static_call_module_notify(struct notifier_block *nb,
+ unsigned long val, void *data)
+{
+ struct module *mod = data;
+ int ret = 0;
+
+ cpus_read_lock();
+ static_call_lock();
+
+ switch (val) {
+ case MODULE_STATE_COMING:
+ module_disable_ro(mod);
+ ret = static_call_add_module(mod);
+ module_enable_ro(mod, false);
+ if (ret) {
+ WARN(1, "Failed to allocate memory for static calls");
+ static_call_del_module(mod);
+ }
+ break;
+ case MODULE_STATE_GOING:
+ static_call_del_module(mod);
+ break;
+ }
+
+ static_call_unlock();
+ cpus_read_unlock();
+
+ return notifier_from_errno(ret);
+}
+
+static struct notifier_block static_call_module_nb = {
+ .notifier_call = static_call_module_notify,
+};
+
+#endif /* CONFIG_MODULES */
+
+static void __init static_call_init(void)
+{
+ struct static_call_site *start = __start_static_call_sites;
+ struct static_call_site *stop = __stop_static_call_sites;
+ struct static_call_site *site;
+
+ if (start == stop) {
+ pr_warn("WARNING: empty static call table\n");
+ return;
+ }
+
+ cpus_read_lock();
+ static_call_lock();
+
+ static_call_sort_entries(start, stop);
+
+ for (site = start; site < stop; site++) {
+ struct static_call_key *key = static_call_key(site);
+ void *site_addr = static_call_addr(site);
+
+ if (init_section_contains(site_addr, 1))
+ static_call_set_init(site);
+
+ if (list_empty(&key->site_mods)) {
+ struct static_call_mod *site_mod;
+
+ site_mod = kzalloc(sizeof(*site_mod), GFP_KERNEL);
+ if (!site_mod) {
+ WARN(1, "Failed to allocate memory for static calls");
+ goto done;
+ }
+
+ site_mod->sites = site;
+ list_add_tail(&site_mod->list, &key->site_mods);
+ }
+
+ arch_static_call_transform(site_addr, key->tramp, key->func);
+ }
+
+ static_call_defuse_tramps(start, stop);
+
+ static_call_initialized = true;
+
+done:
+ static_call_unlock();
+ cpus_read_unlock();
+
+#ifdef CONFIG_MODULES
+ if (static_call_initialized)
+ register_module_notifier(&static_call_module_nb);
+#endif
+}
+early_initcall(static_call_init);
--
2.17.2


2018-11-26 14:02:50

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Static calls

On Mon, Nov 26, 2018 at 07:54:56AM -0600, Josh Poimboeuf wrote:
> v2:
> - fix STATIC_CALL_TRAMP() macro by using __PASTE() [Ard]
> - rename optimized/unoptimized -> inline/out-of-line [Ard]
> - tweak arch interfaces for PLT and add key->tramp field [Ard]
> - rename 'poison' to 'defuse' and do it after all sites have been patched [Ard]
> - fix .init handling [Ard, Steven]
> - add CONFIG_HAVE_STATIC_CALL [Steven]
> - make interfaces more consistent across configs to allow tracepoints to
> use them [Steven]
> - move __ADDRESSABLE() to static_call() macro [Steven]
> - prevent 2-byte jumps [Steven]
> - add offset to asm-offsets.c instead of hard coding key->func offset
> - add kernel_text_address() sanity check
> - make __ADDRESSABLE() symbols truly unique
>
> TODO:
> - port Ard's arm64 patches to the new arch interfaces
> - tracepoint performance testing

Below is the patch Steve gave me for converting tracepoints to use
static calls.

Steve, if you want me to do the performance testing, send me the test
details and I can give it a try this week.



diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h
index 49ba9cde7e4b..ae16672bea61 100644
--- a/include/linux/tracepoint-defs.h
+++ b/include/linux/tracepoint-defs.h
@@ -11,6 +11,8 @@
#include <linux/atomic.h>
#include <linux/static_key.h>

+struct static_call_key;
+
struct trace_print_flags {
unsigned long mask;
const char *name;
@@ -30,6 +32,8 @@ struct tracepoint_func {
struct tracepoint {
const char *name; /* Tracepoint name */
struct static_key key;
+ struct static_call_key *static_call_key;
+ void *iterator;
int (*regfunc)(void);
void (*unregfunc)(void);
struct tracepoint_func __rcu *funcs;
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 538ba1a58f5b..bddaf6043027 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -21,6 +21,7 @@
#include <linux/cpumask.h>
#include <linux/rcupdate.h>
#include <linux/tracepoint-defs.h>
+#include <linux/static_call.h>

struct module;
struct tracepoint;
@@ -94,7 +95,9 @@ extern int syscall_regfunc(void);
extern void syscall_unregfunc(void);
#endif /* CONFIG_HAVE_SYSCALL_TRACEPOINTS */

+#ifndef PARAMS
#define PARAMS(args...) args
+#endif

#define TRACE_DEFINE_ENUM(x)
#define TRACE_DEFINE_SIZEOF(x)
@@ -161,12 +164,11 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
* as "(void *, void)". The DECLARE_TRACE_NOARGS() will pass in just
* "void *data", where as the DECLARE_TRACE() will pass in "void *data, proto".
*/
-#define __DO_TRACE(tp, proto, args, cond, rcuidle) \
+#define __DO_TRACE(name, proto, args, cond, rcuidle) \
do { \
struct tracepoint_func *it_func_ptr; \
- void *it_func; \
- void *__data; \
int __maybe_unused idx = 0; \
+ void *__data; \
\
if (!(cond)) \
return; \
@@ -186,14 +188,11 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
rcu_irq_enter_irqson(); \
} \
\
- it_func_ptr = rcu_dereference_raw((tp)->funcs); \
- \
+ it_func_ptr = \
+ rcu_dereference_raw((&__tracepoint_##name)->funcs); \
if (it_func_ptr) { \
- do { \
- it_func = (it_func_ptr)->func; \
- __data = (it_func_ptr)->data; \
- ((void(*)(proto))(it_func))(args); \
- } while ((++it_func_ptr)->func); \
+ __data = (it_func_ptr)->data; \
+ static_call(__tp_func_##name, args); \
} \
\
if (rcuidle) { \
@@ -209,7 +208,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
static inline void trace_##name##_rcuidle(proto) \
{ \
if (static_key_false(&__tracepoint_##name.key)) \
- __DO_TRACE(&__tracepoint_##name, \
+ __DO_TRACE(name, \
TP_PROTO(data_proto), \
TP_ARGS(data_args), \
TP_CONDITION(cond), 1); \
@@ -231,11 +230,13 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
* poking RCU a bit.
*/
#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args) \
+ extern int __tracepoint_iter_##name(data_proto); \
+ DECLARE_STATIC_CALL(__tp_func_##name, __tracepoint_iter_##name); \
extern struct tracepoint __tracepoint_##name; \
static inline void trace_##name(proto) \
{ \
if (static_key_false(&__tracepoint_##name.key)) \
- __DO_TRACE(&__tracepoint_##name, \
+ __DO_TRACE(name, \
TP_PROTO(data_proto), \
TP_ARGS(data_args), \
TP_CONDITION(cond), 0); \
@@ -281,21 +282,43 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
* structures, so we create an array of pointers that will be used for iteration
* on the tracepoints.
*/
-#define DEFINE_TRACE_FN(name, reg, unreg) \
- static const char __tpstrtab_##name[] \
- __attribute__((section("__tracepoints_strings"))) = #name; \
- struct tracepoint __tracepoint_##name \
- __attribute__((section("__tracepoints"), used)) = \
- { __tpstrtab_##name, STATIC_KEY_INIT_FALSE, reg, unreg, NULL };\
- __TRACEPOINT_ENTRY(name);
+#define DEFINE_TRACE_FN(name, reg, unreg, proto, args) \
+ static const char __tpstrtab_##name[] \
+ __attribute__((section("__tracepoints_strings"))) = #name; \
+ extern struct static_call_key __tp_func_##name; \
+ int __tracepoint_iter_##name(void *__data, proto); \
+ struct tracepoint __tracepoint_##name \
+ __attribute__((section("__tracepoints"), used)) = \
+ { __tpstrtab_##name, STATIC_KEY_INIT_FALSE, \
+ &__tp_func_##name, __tracepoint_iter_##name, \
+ reg, unreg, NULL }; \
+ __TRACEPOINT_ENTRY(name); \
+ int __tracepoint_iter_##name(void *__data, proto) \
+ { \
+ struct tracepoint_func *it_func_ptr; \
+ void *it_func; \
+ \
+ it_func_ptr = \
+ rcu_dereference_raw((&__tracepoint_##name)->funcs); \
+ do { \
+ it_func = (it_func_ptr)->func; \
+ __data = (it_func_ptr)->data; \
+ ((void(*)(void *, proto))(it_func))(__data, args); \
+ } while ((++it_func_ptr)->func); \
+ return 0; \
+ } \
+ DEFINE_STATIC_CALL(__tp_func_##name, __tracepoint_iter_##name);

-#define DEFINE_TRACE(name) \
- DEFINE_TRACE_FN(name, NULL, NULL);
+#define DEFINE_TRACE(name, proto, args) \
+ DEFINE_TRACE_FN(name, NULL, NULL, PARAMS(proto), PARAMS(args));

#define EXPORT_TRACEPOINT_SYMBOL_GPL(name) \
- EXPORT_SYMBOL_GPL(__tracepoint_##name)
+ EXPORT_SYMBOL_GPL(__tracepoint_##name); \
+ EXPORT_STATIC_CALL_GPL(__tp_func_##name)
#define EXPORT_TRACEPOINT_SYMBOL(name) \
- EXPORT_SYMBOL(__tracepoint_##name)
+ EXPORT_SYMBOL(__tracepoint_##name); \
+ EXPORT_STATIC_CALL(__tp_func_##name)
+

#else /* !TRACEPOINTS_ENABLED */
#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args) \
@@ -324,8 +347,8 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
return false; \
}

-#define DEFINE_TRACE_FN(name, reg, unreg)
-#define DEFINE_TRACE(name)
+#define DEFINE_TRACE_FN(name, reg, unreg, proto, args)
+#define DEFINE_TRACE(name, proto, args)
#define EXPORT_TRACEPOINT_SYMBOL_GPL(name)
#define EXPORT_TRACEPOINT_SYMBOL(name)

diff --git a/include/trace/define_trace.h b/include/trace/define_trace.h
index cb30c5532144..c19aea44efb2 100644
--- a/include/trace/define_trace.h
+++ b/include/trace/define_trace.h
@@ -25,7 +25,7 @@

#undef TRACE_EVENT
#define TRACE_EVENT(name, proto, args, tstruct, assign, print) \
- DEFINE_TRACE(name)
+ DEFINE_TRACE(name, PARAMS(proto), PARAMS(args))

#undef TRACE_EVENT_CONDITION
#define TRACE_EVENT_CONDITION(name, proto, args, cond, tstruct, assign, print) \
@@ -39,24 +39,24 @@
#undef TRACE_EVENT_FN
#define TRACE_EVENT_FN(name, proto, args, tstruct, \
assign, print, reg, unreg) \
- DEFINE_TRACE_FN(name, reg, unreg)
+ DEFINE_TRACE_FN(name, reg, unreg, PARAMS(proto), PARAMS(args))

#undef TRACE_EVENT_FN_COND
#define TRACE_EVENT_FN_COND(name, proto, args, cond, tstruct, \
assign, print, reg, unreg) \
- DEFINE_TRACE_FN(name, reg, unreg)
+ DEFINE_TRACE_FN(name, reg, unreg, PARAMS(proto), PARAMS(args))

#undef DEFINE_EVENT
#define DEFINE_EVENT(template, name, proto, args) \
- DEFINE_TRACE(name)
+ DEFINE_TRACE(name, PARAMS(proto), PARAMS(args))

#undef DEFINE_EVENT_FN
#define DEFINE_EVENT_FN(template, name, proto, args, reg, unreg) \
- DEFINE_TRACE_FN(name, reg, unreg)
+ DEFINE_TRACE_FN(name, reg, unreg, PARAMS(proto), PARAMS(args))

#undef DEFINE_EVENT_PRINT
#define DEFINE_EVENT_PRINT(template, name, proto, args, print) \
- DEFINE_TRACE(name)
+ DEFINE_TRACE(name, PARAMS(proto), PARAMS(args))

#undef DEFINE_EVENT_CONDITION
#define DEFINE_EVENT_CONDITION(template, name, proto, args, cond) \
@@ -64,7 +64,7 @@

#undef DECLARE_TRACE
#define DECLARE_TRACE(name, proto, args) \
- DEFINE_TRACE(name)
+ DEFINE_TRACE(name, PARAMS(proto), PARAMS(args))

#undef TRACE_INCLUDE
#undef __TRACE_INCLUDE
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index a3be42304485..55ccf794f4d3 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -140,7 +140,7 @@ static void debug_print_probes(struct tracepoint_func *funcs)

static struct tracepoint_func *
func_add(struct tracepoint_func **funcs, struct tracepoint_func *tp_func,
- int prio)
+ int prio, int *tot_probes)
{
struct tracepoint_func *old, *new;
int nr_probes = 0;
@@ -183,11 +183,12 @@ func_add(struct tracepoint_func **funcs, struct tracepoint_func *tp_func,
new[nr_probes + 1].func = NULL;
*funcs = new;
debug_print_probes(*funcs);
+ *tot_probes = nr_probes + 1;
return old;
}

static void *func_remove(struct tracepoint_func **funcs,
- struct tracepoint_func *tp_func)
+ struct tracepoint_func *tp_func, int *left)
{
int nr_probes = 0, nr_del = 0, i;
struct tracepoint_func *old, *new;
@@ -241,6 +242,7 @@ static int tracepoint_add_func(struct tracepoint *tp,
struct tracepoint_func *func, int prio)
{
struct tracepoint_func *old, *tp_funcs;
+ int probes = 0;
int ret;

if (tp->regfunc && !static_key_enabled(&tp->key)) {
@@ -251,7 +253,7 @@ static int tracepoint_add_func(struct tracepoint *tp,

tp_funcs = rcu_dereference_protected(tp->funcs,
lockdep_is_held(&tracepoints_mutex));
- old = func_add(&tp_funcs, func, prio);
+ old = func_add(&tp_funcs, func, prio, &probes);
if (IS_ERR(old)) {
WARN_ON_ONCE(PTR_ERR(old) != -ENOMEM);
return PTR_ERR(old);
@@ -266,6 +268,12 @@ static int tracepoint_add_func(struct tracepoint *tp,
rcu_assign_pointer(tp->funcs, tp_funcs);
if (!static_key_enabled(&tp->key))
static_key_slow_inc(&tp->key);
+
+ if (probes == 1)
+ __static_call_update(tp->static_call_key, tp_funcs->func);
+ else
+ __static_call_update(tp->static_call_key, tp->iterator);
+
release_probes(old);
return 0;
}
@@ -280,10 +288,11 @@ static int tracepoint_remove_func(struct tracepoint *tp,
struct tracepoint_func *func)
{
struct tracepoint_func *old, *tp_funcs;
+ int probes_left = 0;

tp_funcs = rcu_dereference_protected(tp->funcs,
lockdep_is_held(&tracepoints_mutex));
- old = func_remove(&tp_funcs, func);
+ old = func_remove(&tp_funcs, func, &probes_left);
if (IS_ERR(old)) {
WARN_ON_ONCE(PTR_ERR(old) != -ENOMEM);
return PTR_ERR(old);
@@ -297,6 +306,12 @@ static int tracepoint_remove_func(struct tracepoint *tp,
if (static_key_enabled(&tp->key))
static_key_slow_dec(&tp->key);
}
+
+ if (probes_left == 1)
+ __static_call_update(tp->static_call_key, tp_funcs->func);
+ else
+ __static_call_update(tp->static_call_key, tp->iterator);
+
rcu_assign_pointer(tp->funcs, tp_funcs);
release_probes(old);
return 0;

2018-11-26 15:49:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 3/4] x86/static_call: Add out-of-line static call implementation

On Mon, Nov 26, 2018 at 07:54:59AM -0600, Josh Poimboeuf wrote:

> +void static_call_bp_handler(void);
> +void *bp_handler_dest;
> +
> +asm(".pushsection .text, \"ax\" \n"
> + ".globl static_call_bp_handler \n"
> + ".type static_call_bp_handler, @function \n"
> + "static_call_bp_handler: \n"
> + "ANNOTATE_RETPOLINE_SAFE \n"
> + "jmp *bp_handler_dest \n"
> + ".popsection \n");
> +
> +void arch_static_call_transform(void *site, void *tramp, void *func)
> +{
> + s32 dest_relative;
> + unsigned long insn;
> + unsigned char insn_opcode;
> + unsigned char opcodes[CALL_INSN_SIZE];
> +
> + insn = (unsigned long)tramp;
> +
> + mutex_lock(&text_mutex);
> +
> + insn_opcode = *(unsigned char *)insn;
> + if (insn_opcode != 0xe8 && insn_opcode != 0xe9) {
> + WARN_ONCE(1, "unexpected static call insn opcode 0x%x at %pS",
> + insn_opcode, (void *)insn);
> + goto done;
> + }
> +
> + dest_relative = (long)(func) - (long)(insn + CALL_INSN_SIZE);
> +
> + opcodes[0] = insn_opcode;
> + memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1);
> +
> + /* Set up the variable for the breakpoint handler: */
> + bp_handler_dest = func;
> +
> + /* Patch the call site: */
> + text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE,
> + static_call_bp_handler);

I'm confused by the whole static_call_bp_handler thing; why not jump
straight to @func ?

Also, what guarantees this other thread will have gotten from
static_call_bp_handler and executed the actual indirect JMP instruction
by the time we re-write @bp_handler_dest again?

> +done:
> + mutex_unlock(&text_mutex);
> +}

2018-11-26 16:11:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote:
> diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c
> index 8026d176f25c..d3869295b88d 100644
> --- a/arch/x86/kernel/static_call.c
> +++ b/arch/x86/kernel/static_call.c
> @@ -9,13 +9,21 @@
>
> void static_call_bp_handler(void);
> void *bp_handler_dest;
> +void *bp_handler_continue;
>
> asm(".pushsection .text, \"ax\" \n"
> ".globl static_call_bp_handler \n"
> ".type static_call_bp_handler, @function \n"
> "static_call_bp_handler: \n"
> - "ANNOTATE_RETPOLINE_SAFE \n"
> +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> + ANNOTATE_RETPOLINE_SAFE
> + "call *bp_handler_dest \n"
> + ANNOTATE_RETPOLINE_SAFE
> + "jmp *bp_handler_continue \n"
> +#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */
> + ANNOTATE_RETPOLINE_SAFE
> "jmp *bp_handler_dest \n"
> +#endif
> ".popsection \n");
>
> void arch_static_call_transform(void *site, void *tramp, void *func)
> @@ -25,7 +33,10 @@ void arch_static_call_transform(void *site, void *tramp, void *func)
> unsigned char insn_opcode;
> unsigned char opcodes[CALL_INSN_SIZE];
>
> - insn = (unsigned long)tramp;
> + if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
> + insn = (unsigned long)site;
> + else
> + insn = (unsigned long)tramp;
>
> mutex_lock(&text_mutex);
>
> @@ -41,8 +52,10 @@ void arch_static_call_transform(void *site, void *tramp, void *func)
> opcodes[0] = insn_opcode;
> memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1);
>
> - /* Set up the variable for the breakpoint handler: */
> + /* Set up the variables for the breakpoint handler: */
> bp_handler_dest = func;
> + if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
> + bp_handler_continue = (void *)(insn + CALL_INSN_SIZE);
>
> /* Patch the call site: */
> text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE,

OK, so this is where that static_call_bp_handler comes from; you need
that CALL to frob the stack.

But I still think it is broken; consider:

CPU0 CPU1

bp_handler = ponies;

text_poke_bp(, &static_call_bp_handler)
text_poke(&int3);
on_each_cpu(sync)
<IPI>
...
</IPI>

text_poke(/* all but first bytes */)
on_each_cpu(sync)
<IPI>
...
</IPI>

<int3>
pt_regs->ip = &static_call_bp_handler
</int3>

// VCPU takes a nap...
text_poke(/* first byte */)
on_each_cpu(sync)
<IPI>
...
</IPI>

// VCPU sleeps more
bp_handler = unicorn;

CALL unicorn

*whoops*

Now, granted, that is all rather 'unlikely', but that never stopped
Murphy.

2018-11-26 16:17:21

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Mon, 26 Nov 2018 at 17:08, Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote:
> > +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> > +void arch_static_call_defuse_tramp(void *site, void *tramp)
> > +{
> > + unsigned short opcode = INSN_UD2;
> > +
> > + mutex_lock(&text_mutex);
> > + text_poke((void *)tramp, &opcode, 2);
> > + mutex_unlock(&text_mutex);
> > +}
> > +#endif
>
> I would rather think that makes the trampoline _more_ dangerous, rather
> than less so.
>
> My dictionary sayeth:
>
> defuse: verb
>
> - remove the fuse from (an explosive device) in order to prevent it
> from exploding.
>
> - make (a situation) less tense or dangerous
>
> patching in an UD2 seems to do the exact opposite.

That is my fault.

The original name was 'poison' iirc, but on arm64, we need to retain
the trampoline for cases where the direct branch is out of range, and
so poisoning is semantically inaccurate.

But since you opened your dictionary anyway, any better suggestions? :-)

2018-11-26 16:19:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote:
> +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> +void arch_static_call_defuse_tramp(void *site, void *tramp)
> +{
> + unsigned short opcode = INSN_UD2;
> +
> + mutex_lock(&text_mutex);
> + text_poke((void *)tramp, &opcode, 2);
> + mutex_unlock(&text_mutex);
> +}
> +#endif

I would rather think that makes the trampoline _more_ dangerous, rather
than less so.

My dictionary sayeth:

defuse: verb

- remove the fuse from (an explosive device) in order to prevent it
from exploding.

- make (a situation) less tense or dangerous

patching in an UD2 seems to do the exact opposite.

2018-11-26 16:22:51

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 3/4] x86/static_call: Add out-of-line static call implementation

On Mon, 26 Nov 2018 16:43:56 +0100
Peter Zijlstra <[email protected]> wrote:


> > + /* Patch the call site: */
> > + text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE,
> > + static_call_bp_handler);
>
> I'm confused by the whole static_call_bp_handler thing; why not jump
> straight to @func ?

Interesting, that might work. I can give it a try.

-- Steve

2018-11-26 16:38:53

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Mon, Nov 26, 2018 at 8:11 AM Ard Biesheuvel
<[email protected]> wrote:
>
> On Mon, 26 Nov 2018 at 17:08, Peter Zijlstra <[email protected]> wrote:
> >
> > On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote:
> > > +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> > > +void arch_static_call_defuse_tramp(void *site, void *tramp)
> > > +{
> > > + unsigned short opcode = INSN_UD2;
> > > +
> > > + mutex_lock(&text_mutex);
> > > + text_poke((void *)tramp, &opcode, 2);
> > > + mutex_unlock(&text_mutex);
> > > +}
> > > +#endif
> >
> > I would rather think that makes the trampoline _more_ dangerous, rather
> > than less so.
> >
> > My dictionary sayeth:
> >
> > defuse: verb
> >
> > - remove the fuse from (an explosive device) in order to prevent it
> > from exploding.
> >
> > - make (a situation) less tense or dangerous
> >
> > patching in an UD2 seems to do the exact opposite.
>
> That is my fault.
>
> The original name was 'poison' iirc, but on arm64, we need to retain
> the trampoline for cases where the direct branch is out of range, and
> so poisoning is semantically inaccurate.
>
> But since you opened your dictionary anyway, any better suggestions? :-)

Release? Finish?

2018-11-26 16:41:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Mon, Nov 26, 2018 at 05:11:05PM +0100, Ard Biesheuvel wrote:
> On Mon, 26 Nov 2018 at 17:08, Peter Zijlstra <[email protected]> wrote:
> >
> > On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote:
> > > +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> > > +void arch_static_call_defuse_tramp(void *site, void *tramp)
> > > +{
> > > + unsigned short opcode = INSN_UD2;
> > > +
> > > + mutex_lock(&text_mutex);
> > > + text_poke((void *)tramp, &opcode, 2);
> > > + mutex_unlock(&text_mutex);
> > > +}
> > > +#endif
> >
> > I would rather think that makes the trampoline _more_ dangerous, rather
> > than less so.
> >
> > My dictionary sayeth:
> >
> > defuse: verb
> >
> > - remove the fuse from (an explosive device) in order to prevent it
> > from exploding.
> >
> > - make (a situation) less tense or dangerous
> >
> > patching in an UD2 seems to do the exact opposite.
>
> That is my fault.
>
> The original name was 'poison' iirc, but on arm64, we need to retain
> the trampoline for cases where the direct branch is out of range, and
> so poisoning is semantically inaccurate.
>
> But since you opened your dictionary anyway, any better suggestions? :-)

I was leaning towards: "prime", but I'm not entirely sure that works
with your case.

2018-11-26 16:47:23

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Mon, Nov 26, 2018 at 05:39:23PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 26, 2018 at 05:11:05PM +0100, Ard Biesheuvel wrote:
> > On Mon, 26 Nov 2018 at 17:08, Peter Zijlstra <[email protected]> wrote:
> > >
> > > On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote:
> > > > +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> > > > +void arch_static_call_defuse_tramp(void *site, void *tramp)
> > > > +{
> > > > + unsigned short opcode = INSN_UD2;
> > > > +
> > > > + mutex_lock(&text_mutex);
> > > > + text_poke((void *)tramp, &opcode, 2);
> > > > + mutex_unlock(&text_mutex);
> > > > +}
> > > > +#endif
> > >
> > > I would rather think that makes the trampoline _more_ dangerous, rather
> > > than less so.
> > >
> > > My dictionary sayeth:
> > >
> > > defuse: verb
> > >
> > > - remove the fuse from (an explosive device) in order to prevent it
> > > from exploding.
> > >
> > > - make (a situation) less tense or dangerous
> > >
> > > patching in an UD2 seems to do the exact opposite.
> >
> > That is my fault.
> >
> > The original name was 'poison' iirc, but on arm64, we need to retain
> > the trampoline for cases where the direct branch is out of range, and
> > so poisoning is semantically inaccurate.
> >
> > But since you opened your dictionary anyway, any better suggestions? :-)
>
> I was leaning towards: "prime", but I'm not entirely sure that works
> with your case.

Maybe we should just go back to "poison", along with a comment that it
will not necessarily be poisoned for all arches. I think "poison" at
least describes the intent, if not always the implementation.

--
Josh

2018-11-26 17:15:03

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Mon, Nov 26, 2018 at 05:02:17PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote:
> > diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c
> > index 8026d176f25c..d3869295b88d 100644
> > --- a/arch/x86/kernel/static_call.c
> > +++ b/arch/x86/kernel/static_call.c
> > @@ -9,13 +9,21 @@
> >
> > void static_call_bp_handler(void);
> > void *bp_handler_dest;
> > +void *bp_handler_continue;
> >
> > asm(".pushsection .text, \"ax\" \n"
> > ".globl static_call_bp_handler \n"
> > ".type static_call_bp_handler, @function \n"
> > "static_call_bp_handler: \n"
> > - "ANNOTATE_RETPOLINE_SAFE \n"
> > +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> > + ANNOTATE_RETPOLINE_SAFE
> > + "call *bp_handler_dest \n"
> > + ANNOTATE_RETPOLINE_SAFE
> > + "jmp *bp_handler_continue \n"
> > +#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */
> > + ANNOTATE_RETPOLINE_SAFE
> > "jmp *bp_handler_dest \n"
> > +#endif
> > ".popsection \n");
> >
> > void arch_static_call_transform(void *site, void *tramp, void *func)
> > @@ -25,7 +33,10 @@ void arch_static_call_transform(void *site, void *tramp, void *func)
> > unsigned char insn_opcode;
> > unsigned char opcodes[CALL_INSN_SIZE];
> >
> > - insn = (unsigned long)tramp;
> > + if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
> > + insn = (unsigned long)site;
> > + else
> > + insn = (unsigned long)tramp;
> >
> > mutex_lock(&text_mutex);
> >
> > @@ -41,8 +52,10 @@ void arch_static_call_transform(void *site, void *tramp, void *func)
> > opcodes[0] = insn_opcode;
> > memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1);
> >
> > - /* Set up the variable for the breakpoint handler: */
> > + /* Set up the variables for the breakpoint handler: */
> > bp_handler_dest = func;
> > + if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
> > + bp_handler_continue = (void *)(insn + CALL_INSN_SIZE);
> >
> > /* Patch the call site: */
> > text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE,
>
> OK, so this is where that static_call_bp_handler comes from; you need
> that CALL to frob the stack.
>
> But I still think it is broken; consider:
>
> CPU0 CPU1
>
> bp_handler = ponies;
>
> text_poke_bp(, &static_call_bp_handler)
> text_poke(&int3);
> on_each_cpu(sync)
> <IPI>
> ...
> </IPI>
>
> text_poke(/* all but first bytes */)
> on_each_cpu(sync)
> <IPI>
> ...
> </IPI>
>
> <int3>
> pt_regs->ip = &static_call_bp_handler
> </int3>
>
> // VCPU takes a nap...
> text_poke(/* first byte */)
> on_each_cpu(sync)
> <IPI>
> ...
> </IPI>
>
> // VCPU sleeps more
> bp_handler = unicorn;
>
> CALL unicorn
>
> *whoops*
>
> Now, granted, that is all rather 'unlikely', but that never stopped
> Murphy.

Good find, thanks Peter.

As we discussed on IRC, we'll need to fix this from within the int3
exception handler by faking the call: putting a fake return address on
the stack (pointing to right after the call) and setting regs->ip to the
called function.

And for the out-of-line case we can just jump straight to the function,
so the function itself will be the text_poke_bp() "handler".

So the static_call_bp_handler() trampoline will go away.

--
Josh

2018-11-26 18:06:49

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Mon, Nov 26, 2018 at 11:10:36AM -0600, Josh Poimboeuf wrote:
> On Mon, Nov 26, 2018 at 05:02:17PM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote:
> > > diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c
> > > index 8026d176f25c..d3869295b88d 100644
> > > --- a/arch/x86/kernel/static_call.c
> > > +++ b/arch/x86/kernel/static_call.c
> > > @@ -9,13 +9,21 @@
> > >
> > > void static_call_bp_handler(void);
> > > void *bp_handler_dest;
> > > +void *bp_handler_continue;
> > >
> > > asm(".pushsection .text, \"ax\" \n"
> > > ".globl static_call_bp_handler \n"
> > > ".type static_call_bp_handler, @function \n"
> > > "static_call_bp_handler: \n"
> > > - "ANNOTATE_RETPOLINE_SAFE \n"
> > > +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> > > + ANNOTATE_RETPOLINE_SAFE
> > > + "call *bp_handler_dest \n"
> > > + ANNOTATE_RETPOLINE_SAFE
> > > + "jmp *bp_handler_continue \n"
> > > +#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */
> > > + ANNOTATE_RETPOLINE_SAFE
> > > "jmp *bp_handler_dest \n"
> > > +#endif
> > > ".popsection \n");
> > >
> > > void arch_static_call_transform(void *site, void *tramp, void *func)
> > > @@ -25,7 +33,10 @@ void arch_static_call_transform(void *site, void *tramp, void *func)
> > > unsigned char insn_opcode;
> > > unsigned char opcodes[CALL_INSN_SIZE];
> > >
> > > - insn = (unsigned long)tramp;
> > > + if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
> > > + insn = (unsigned long)site;
> > > + else
> > > + insn = (unsigned long)tramp;
> > >
> > > mutex_lock(&text_mutex);
> > >
> > > @@ -41,8 +52,10 @@ void arch_static_call_transform(void *site, void *tramp, void *func)
> > > opcodes[0] = insn_opcode;
> > > memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1);
> > >
> > > - /* Set up the variable for the breakpoint handler: */
> > > + /* Set up the variables for the breakpoint handler: */
> > > bp_handler_dest = func;
> > > + if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
> > > + bp_handler_continue = (void *)(insn + CALL_INSN_SIZE);
> > >
> > > /* Patch the call site: */
> > > text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE,
> >
> > OK, so this is where that static_call_bp_handler comes from; you need
> > that CALL to frob the stack.
> >
> > But I still think it is broken; consider:
> >
> > CPU0 CPU1
> >
> > bp_handler = ponies;
> >
> > text_poke_bp(, &static_call_bp_handler)
> > text_poke(&int3);
> > on_each_cpu(sync)
> > <IPI>
> > ...
> > </IPI>
> >
> > text_poke(/* all but first bytes */)
> > on_each_cpu(sync)
> > <IPI>
> > ...
> > </IPI>
> >
> > <int3>
> > pt_regs->ip = &static_call_bp_handler
> > </int3>
> >
> > // VCPU takes a nap...
> > text_poke(/* first byte */)
> > on_each_cpu(sync)
> > <IPI>
> > ...
> > </IPI>
> >
> > // VCPU sleeps more
> > bp_handler = unicorn;
> >
> > CALL unicorn
> >
> > *whoops*
> >
> > Now, granted, that is all rather 'unlikely', but that never stopped
> > Murphy.
>
> Good find, thanks Peter.
>
> As we discussed on IRC, we'll need to fix this from within the int3
> exception handler by faking the call: putting a fake return address on
> the stack (pointing to right after the call) and setting regs->ip to the
> called function.
>
> And for the out-of-line case we can just jump straight to the function,
> so the function itself will be the text_poke_bp() "handler".
>
> So the static_call_bp_handler() trampoline will go away.

Peter suggested updating the text_poke_bp() interface to add a handler
which is called from int3 context. This seems to work.

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index e85ff65c43c3..7fcaa37c1876 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -20,6 +20,8 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,

extern void *text_poke_early(void *addr, const void *opcode, size_t len);

+typedef void (*bp_handler_t)(struct pt_regs *regs);
+
/*
* Clear and restore the kernel write-protection flag on the local CPU.
* Allows the kernel to edit read-only pages.
@@ -36,7 +38,8 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
*/
extern void *text_poke(void *addr, const void *opcode, size_t len);
extern int poke_int3_handler(struct pt_regs *regs);
-extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
+extern void *text_poke_bp(void *addr, const void *opcode, size_t len,
+ bp_handler_t handler, void *resume);
extern int after_bootmem;

#endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index ebeac487a20c..b6fb645488be 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -738,7 +738,8 @@ static void do_sync_core(void *info)
}

static bool bp_patching_in_progress;
-static void *bp_int3_handler, *bp_int3_addr;
+static void *bp_int3_resume, *bp_int3_addr;
+static bp_handler_t bp_int3_handler;

int poke_int3_handler(struct pt_regs *regs)
{
@@ -746,11 +747,11 @@ int poke_int3_handler(struct pt_regs *regs)
* Having observed our INT3 instruction, we now must observe
* bp_patching_in_progress.
*
- * in_progress = TRUE INT3
- * WMB RMB
- * write INT3 if (in_progress)
+ * in_progress = TRUE INT3
+ * WMB RMB
+ * write INT3 if (in_progress)
*
- * Idem for bp_int3_handler.
+ * Idem for bp_int3_resume.
*/
smp_rmb();

@@ -760,8 +761,10 @@ int poke_int3_handler(struct pt_regs *regs)
if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
return 0;

- /* set up the specified breakpoint handler */
- regs->ip = (unsigned long) bp_int3_handler;
+ if (bp_int3_handler)
+ bp_int3_handler(regs);
+
+ regs->ip = (unsigned long)bp_int3_resume;

return 1;

@@ -772,7 +775,8 @@ int poke_int3_handler(struct pt_regs *regs)
* @addr: address to patch
* @opcode: opcode of new instruction
* @len: length to copy
- * @handler: address to jump to when the temporary breakpoint is hit
+ * @handler: handler to call from int3 context (optional)
+ * @resume: address to jump to when returning from int3 context
*
* Modify multi-byte instruction by using int3 breakpoint on SMP.
* We completely avoid stop_machine() here, and achieve the
@@ -787,11 +791,13 @@ int poke_int3_handler(struct pt_regs *regs)
* replacing opcode
* - sync cores
*/
-void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
+void *text_poke_bp(void *addr, const void *opcode, size_t len,
+ bp_handler_t handler, void *resume)
{
unsigned char int3 = 0xcc;

bp_int3_handler = handler;
+ bp_int3_resume = resume;
bp_int3_addr = (u8 *)addr + sizeof(int3);
bp_patching_in_progress = true;

diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index aac0c1f7e354..1a54c5c6d9f3 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -90,7 +90,7 @@ static void __ref __jump_label_transform(struct jump_entry *entry,
return;
}

- text_poke_bp((void *)jump_entry_code(entry), code, JUMP_LABEL_NOP_SIZE,
+ text_poke_bp((void *)jump_entry_code(entry), code, JUMP_LABEL_NOP_SIZE, NULL,
(void *)jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
}

diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c
index 40b16b270656..5787f48be243 100644
--- a/arch/x86/kernel/kprobes/opt.c
+++ b/arch/x86/kernel/kprobes/opt.c
@@ -446,7 +446,7 @@ void arch_optimize_kprobes(struct list_head *oplist)
insn_buf[0] = RELATIVEJUMP_OPCODE;
*(s32 *)(&insn_buf[1]) = rel;

- text_poke_bp(op->kp.addr, insn_buf, RELATIVEJUMP_SIZE,
+ text_poke_bp(op->kp.addr, insn_buf, RELATIVEJUMP_SIZE, NULL,
op->optinsn.insn);

list_del_init(&op->list);
@@ -461,7 +461,7 @@ void arch_unoptimize_kprobe(struct optimized_kprobe *op)
/* Set int3 to first byte for kprobes */
insn_buf[0] = BREAKPOINT_INSTRUCTION;
memcpy(insn_buf + 1, op->optinsn.copied_insn, RELATIVE_ADDR_SIZE);
- text_poke_bp(op->kp.addr, insn_buf, RELATIVEJUMP_SIZE,
+ text_poke_bp(op->kp.addr, insn_buf, RELATIVEJUMP_SIZE, NULL,
op->optinsn.insn);
}

diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c
index d3869295b88d..8fd6c8556750 100644
--- a/arch/x86/kernel/static_call.c
+++ b/arch/x86/kernel/static_call.c
@@ -7,24 +7,19 @@

#define CALL_INSN_SIZE 5

-void static_call_bp_handler(void);
-void *bp_handler_dest;
-void *bp_handler_continue;
+unsigned long bp_handler_call_return_addr;

-asm(".pushsection .text, \"ax\" \n"
- ".globl static_call_bp_handler \n"
- ".type static_call_bp_handler, @function \n"
- "static_call_bp_handler: \n"
+static void static_call_bp_handler(struct pt_regs *regs)
+{
#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
- ANNOTATE_RETPOLINE_SAFE
- "call *bp_handler_dest \n"
- ANNOTATE_RETPOLINE_SAFE
- "jmp *bp_handler_continue \n"
-#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */
- ANNOTATE_RETPOLINE_SAFE
- "jmp *bp_handler_dest \n"
+ /*
+ * Push the return address on the stack so the "called" function will
+ * return to immediately after the call site.
+ */
+ regs->sp -= sizeof(long);
+ *(unsigned long *)regs->sp = bp_handler_call_return_addr;
#endif
- ".popsection \n");
+}

void arch_static_call_transform(void *site, void *tramp, void *func)
{
@@ -52,14 +47,12 @@ void arch_static_call_transform(void *site, void *tramp, void *func)
opcodes[0] = insn_opcode;
memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1);

- /* Set up the variables for the breakpoint handler: */
- bp_handler_dest = func;
if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
- bp_handler_continue = (void *)(insn + CALL_INSN_SIZE);
+ bp_handler_call_return_addr = insn + CALL_INSN_SIZE;

/* Patch the call site: */
text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE,
- static_call_bp_handler);
+ static_call_bp_handler, func);

done:
mutex_unlock(&text_mutex);

2018-11-26 18:32:10

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Mon, Nov 26, 2018 at 9:10 AM Josh Poimboeuf <[email protected]> wrote:
>
> On Mon, Nov 26, 2018 at 05:02:17PM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote:
> > > diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c
> > > index 8026d176f25c..d3869295b88d 100644
> > > --- a/arch/x86/kernel/static_call.c
> > > +++ b/arch/x86/kernel/static_call.c
> > > @@ -9,13 +9,21 @@
> > >
> > > void static_call_bp_handler(void);
> > > void *bp_handler_dest;
> > > +void *bp_handler_continue;
> > >
> > > asm(".pushsection .text, \"ax\" \n"
> > > ".globl static_call_bp_handler \n"
> > > ".type static_call_bp_handler, @function \n"
> > > "static_call_bp_handler: \n"
> > > - "ANNOTATE_RETPOLINE_SAFE \n"
> > > +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> > > + ANNOTATE_RETPOLINE_SAFE
> > > + "call *bp_handler_dest \n"
> > > + ANNOTATE_RETPOLINE_SAFE
> > > + "jmp *bp_handler_continue \n"
> > > +#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */
> > > + ANNOTATE_RETPOLINE_SAFE
> > > "jmp *bp_handler_dest \n"
> > > +#endif
> > > ".popsection \n");
> > >
> > > void arch_static_call_transform(void *site, void *tramp, void *func)
> > > @@ -25,7 +33,10 @@ void arch_static_call_transform(void *site, void *tramp, void *func)
> > > unsigned char insn_opcode;
> > > unsigned char opcodes[CALL_INSN_SIZE];
> > >
> > > - insn = (unsigned long)tramp;
> > > + if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
> > > + insn = (unsigned long)site;
> > > + else
> > > + insn = (unsigned long)tramp;
> > >
> > > mutex_lock(&text_mutex);
> > >
> > > @@ -41,8 +52,10 @@ void arch_static_call_transform(void *site, void *tramp, void *func)
> > > opcodes[0] = insn_opcode;
> > > memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1);
> > >
> > > - /* Set up the variable for the breakpoint handler: */
> > > + /* Set up the variables for the breakpoint handler: */
> > > bp_handler_dest = func;
> > > + if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
> > > + bp_handler_continue = (void *)(insn + CALL_INSN_SIZE);
> > >
> > > /* Patch the call site: */
> > > text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE,
> >
> > OK, so this is where that static_call_bp_handler comes from; you need
> > that CALL to frob the stack.
> >
> > But I still think it is broken; consider:
> >
> > CPU0 CPU1
> >
> > bp_handler = ponies;
> >
> > text_poke_bp(, &static_call_bp_handler)
> > text_poke(&int3);
> > on_each_cpu(sync)
> > <IPI>
> > ...
> > </IPI>
> >
> > text_poke(/* all but first bytes */)
> > on_each_cpu(sync)
> > <IPI>
> > ...
> > </IPI>
> >
> > <int3>
> > pt_regs->ip = &static_call_bp_handler
> > </int3>
> >
> > // VCPU takes a nap...
> > text_poke(/* first byte */)
> > on_each_cpu(sync)
> > <IPI>
> > ...
> > </IPI>
> >
> > // VCPU sleeps more
> > bp_handler = unicorn;
> >
> > CALL unicorn
> >
> > *whoops*
> >
> > Now, granted, that is all rather 'unlikely', but that never stopped
> > Murphy.
>
> Good find, thanks Peter.
>
> As we discussed on IRC, we'll need to fix this from within the int3
> exception handler by faking the call: putting a fake return address on
> the stack (pointing to right after the call) and setting regs->ip to the
> called function.

Can you add a comment that it will need updating when kernel CET is added?

2018-11-26 20:03:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Mon, Nov 26, 2018 at 11:56:24AM -0600, Josh Poimboeuf wrote:
> Peter suggested updating the text_poke_bp() interface to add a handler
> which is called from int3 context. This seems to work.

> @@ -760,8 +761,10 @@ int poke_int3_handler(struct pt_regs *regs)
> if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
> return 0;
>
> - /* set up the specified breakpoint handler */
> - regs->ip = (unsigned long) bp_int3_handler;
> + if (bp_int3_handler)
> + bp_int3_handler(regs);
> +
> + regs->ip = (unsigned long)bp_int3_resume;
>
> return 1;
>

Peter also suggested you write that like:

if (bp_int3_handler)
bp_int3_handler(regs, resume);
else
regs->ip = resume;

That allows 'abusing' @resume as 'data' pointer for @handler. Which
allows for more complicated handlers.

2018-11-26 20:09:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Mon, Nov 26, 2018 at 11:56:24AM -0600, Josh Poimboeuf wrote:
> diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c
> index d3869295b88d..8fd6c8556750 100644
> --- a/arch/x86/kernel/static_call.c
> +++ b/arch/x86/kernel/static_call.c
> @@ -7,24 +7,19 @@
>
> #define CALL_INSN_SIZE 5
>
> +unsigned long bp_handler_call_return_addr;
>
> +static void static_call_bp_handler(struct pt_regs *regs)
> +{
> #ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> + /*
> + * Push the return address on the stack so the "called" function will
> + * return to immediately after the call site.
> + */
> + regs->sp -= sizeof(long);
> + *(unsigned long *)regs->sp = bp_handler_call_return_addr;
> #endif
> +}
>
> void arch_static_call_transform(void *site, void *tramp, void *func)
> {
> @@ -52,14 +47,12 @@ void arch_static_call_transform(void *site, void *tramp, void *func)
> opcodes[0] = insn_opcode;
> memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1);
>
> if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
> + bp_handler_call_return_addr = insn + CALL_INSN_SIZE;
>
> /* Patch the call site: */
> text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE,
> - static_call_bp_handler);
> + static_call_bp_handler, func);
>
> done:
> mutex_unlock(&text_mutex);


like maybe something along the lines of:

struct sc_data {
unsigned long ret;
unsigned long ip;
};

void sc_handler(struct pt_regs *regs, void *data)
{
struct sc_data *scd = data;

regs->sp -= sizeof(long);
*(unsigned long *)regs->sp = scd->ret;
regs->ip = scd->ip;
}

arch_static_call_transform()
{
...

scd = (struct sc_data){
.ret = insn + CALL_INSN_SIZE,
.ip = (unsigned long)func,
};

text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE,
sc_handler, (void *)&scd);

...
}

2018-11-26 20:16:07

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Mon, Nov 26, 2018 at 10:28:08AM -0800, Andy Lutomirski wrote:
> On Mon, Nov 26, 2018 at 9:10 AM Josh Poimboeuf <[email protected]> wrote:
> >
> > On Mon, Nov 26, 2018 at 05:02:17PM +0100, Peter Zijlstra wrote:
> > > On Mon, Nov 26, 2018 at 07:55:00AM -0600, Josh Poimboeuf wrote:
> > > > diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c
> > > > index 8026d176f25c..d3869295b88d 100644
> > > > --- a/arch/x86/kernel/static_call.c
> > > > +++ b/arch/x86/kernel/static_call.c
> > > > @@ -9,13 +9,21 @@
> > > >
> > > > void static_call_bp_handler(void);
> > > > void *bp_handler_dest;
> > > > +void *bp_handler_continue;
> > > >
> > > > asm(".pushsection .text, \"ax\" \n"
> > > > ".globl static_call_bp_handler \n"
> > > > ".type static_call_bp_handler, @function \n"
> > > > "static_call_bp_handler: \n"
> > > > - "ANNOTATE_RETPOLINE_SAFE \n"
> > > > +#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> > > > + ANNOTATE_RETPOLINE_SAFE
> > > > + "call *bp_handler_dest \n"
> > > > + ANNOTATE_RETPOLINE_SAFE
> > > > + "jmp *bp_handler_continue \n"
> > > > +#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */
> > > > + ANNOTATE_RETPOLINE_SAFE
> > > > "jmp *bp_handler_dest \n"
> > > > +#endif
> > > > ".popsection \n");
> > > >
> > > > void arch_static_call_transform(void *site, void *tramp, void *func)
> > > > @@ -25,7 +33,10 @@ void arch_static_call_transform(void *site, void *tramp, void *func)
> > > > unsigned char insn_opcode;
> > > > unsigned char opcodes[CALL_INSN_SIZE];
> > > >
> > > > - insn = (unsigned long)tramp;
> > > > + if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
> > > > + insn = (unsigned long)site;
> > > > + else
> > > > + insn = (unsigned long)tramp;
> > > >
> > > > mutex_lock(&text_mutex);
> > > >
> > > > @@ -41,8 +52,10 @@ void arch_static_call_transform(void *site, void *tramp, void *func)
> > > > opcodes[0] = insn_opcode;
> > > > memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1);
> > > >
> > > > - /* Set up the variable for the breakpoint handler: */
> > > > + /* Set up the variables for the breakpoint handler: */
> > > > bp_handler_dest = func;
> > > > + if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
> > > > + bp_handler_continue = (void *)(insn + CALL_INSN_SIZE);
> > > >
> > > > /* Patch the call site: */
> > > > text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE,
> > >
> > > OK, so this is where that static_call_bp_handler comes from; you need
> > > that CALL to frob the stack.
> > >
> > > But I still think it is broken; consider:
> > >
> > > CPU0 CPU1
> > >
> > > bp_handler = ponies;
> > >
> > > text_poke_bp(, &static_call_bp_handler)
> > > text_poke(&int3);
> > > on_each_cpu(sync)
> > > <IPI>
> > > ...
> > > </IPI>
> > >
> > > text_poke(/* all but first bytes */)
> > > on_each_cpu(sync)
> > > <IPI>
> > > ...
> > > </IPI>
> > >
> > > <int3>
> > > pt_regs->ip = &static_call_bp_handler
> > > </int3>
> > >
> > > // VCPU takes a nap...
> > > text_poke(/* first byte */)
> > > on_each_cpu(sync)
> > > <IPI>
> > > ...
> > > </IPI>
> > >
> > > // VCPU sleeps more
> > > bp_handler = unicorn;
> > >
> > > CALL unicorn
> > >
> > > *whoops*
> > >
> > > Now, granted, that is all rather 'unlikely', but that never stopped
> > > Murphy.
> >
> > Good find, thanks Peter.
> >
> > As we discussed on IRC, we'll need to fix this from within the int3
> > exception handler by faking the call: putting a fake return address on
> > the stack (pointing to right after the call) and setting regs->ip to the
> > called function.
>
> Can you add a comment that it will need updating when kernel CET is added?

Will do, though I get the feeling there's a lot of other (existing) code
that will also need to change for kernel CET.

--
Josh

2018-11-26 20:55:39

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Static calls


Here's the test with the attached config (A fedora distro with
localmodconfig run against it), with also two patches to implement
tracepoints with static calls. The first makes it where a tracepoint
will call a function pointer to a single callback if there's only one
callback, or an "iterator" which iterates a list of callbacks (when
there are more than one callback associated to a tracepoint).

It adds printks() to where it enables and disables the tracepoints so
expect to see a lot of output when you enable the tracepoints. This is
to verify that it's assigning the right code.

Here's what I did.

1) I first took the config and turned off CONFIG_RETPOLINE and built
v4.20-rc4 with that. I ran this to see what the affect was without
retpolines. I booted that kernel and did the following (which is also
what I did for every kernel):

# trace-cmd start -e all

To get the same affect you could also do:

# echo 1 > /sys/kernel/debug/tracing/events/enable

# perf stat -r 10 /work/c/hackbench 50

The output was this:

No RETPOLINES:

# perf stat -r 10 /work/c/hackbench 50
Time: 1.351
Time: 1.414
Time: 1.319
Time: 1.277
Time: 1.280
Time: 1.305
Time: 1.294
Time: 1.342
Time: 1.319
Time: 1.288

Performance counter stats for '/work/c/hackbench 50' (10 runs):

10,727.44 msec task-clock # 7.397 CPUs utilized ( +- 0.95% )
126,300 context-switches # 11774.138 M/sec ( +- 13.80% )
14,309 cpu-migrations # 1333.973 M/sec ( +- 8.73% )
44,073 page-faults # 4108.652 M/sec ( +- 0.68% )
39,484,799,554 cycles # 3680914.295 GHz ( +- 0.95% )
28,470,896,143 stalled-cycles-frontend # 72.11% frontend cycles idle ( +- 0.95% )
26,521,427,813 instructions # 0.67 insn per cycle
# 1.07 stalled cycles per insn ( +- 0.85% )
4,931,066,096 branches # 459691625.400 M/sec ( +- 0.87% )
19,063,801 branch-misses # 0.39% of all branches ( +- 2.05% )

1.4503 +- 0.0148 seconds time elapsed ( +- 1.02% )

Then I enabled CONFIG_RETPOLINES, built boot and ran it again:

baseline RETPOLINES:

# perf stat -r 10 /work/c/hackbench 50
Time: 1.313
Time: 1.386
Time: 1.335
Time: 1.363
Time: 1.357
Time: 1.369
Time: 1.363
Time: 1.489
Time: 1.357
Time: 1.422

Performance counter stats for '/work/c/hackbench 50' (10 runs):

11,162.24 msec task-clock # 7.383 CPUs utilized ( +- 1.11% )
112,882 context-switches # 10113.153 M/sec ( +- 15.86% )
14,255 cpu-migrations # 1277.103 M/sec ( +- 7.78% )
43,067 page-faults # 3858.393 M/sec ( +- 1.04% )
41,076,270,559 cycles # 3680042.874 GHz ( +- 1.12% )
29,669,137,584 stalled-cycles-frontend # 72.23% frontend cycles idle ( +- 1.21% )
26,647,656,812 instructions # 0.65 insn per cycle
# 1.11 stalled cycles per insn ( +- 0.81% )
5,069,504,923 branches # 454179389.091 M/sec ( +- 0.83% )
99,135,413 branch-misses # 1.96% of all branches ( +- 0.87% )

1.5120 +- 0.0133 seconds time elapsed ( +- 0.88% )


Then I applied the first tracepoint patch to make the change to call
directly (and be able to use static calls later). And tested that.

Added direct calls for trace_events:

# perf stat -r 10 /work/c/hackbench 50
Time: 1.448
Time: 1.386
Time: 1.404
Time: 1.386
Time: 1.344
Time: 1.397
Time: 1.378
Time: 1.351
Time: 1.369
Time: 1.385

Performance counter stats for '/work/c/hackbench 50' (10 runs):

11,249.28 msec task-clock # 7.382 CPUs utilized ( +- 0.64% )
112,058 context-switches # 9961.721 M/sec ( +- 11.15% )
15,535 cpu-migrations # 1381.033 M/sec ( +- 10.34% )
43,673 page-faults # 3882.433 M/sec ( +- 1.14% )
41,407,431,000 cycles # 3681020.455 GHz ( +- 0.63% )
29,842,394,154 stalled-cycles-frontend # 72.07% frontend cycles idle ( +- 0.63% )
26,669,867,181 instructions # 0.64 insn per cycle
# 1.12 stalled cycles per insn ( +- 0.58% )
5,085,122,641 branches # 452055102.392 M/sec ( +- 0.60% )
108,935,006 branch-misses # 2.14% of all branches ( +- 0.57% )

1.5239 +- 0.0139 seconds time elapsed ( +- 0.91% )


Then I added patch 1 and 2, and applied the second attached patch and
ran that:

With static calls:

# perf stat -r 10 /work/c/hackbench 50
Time: 1.407
Time: 1.424
Time: 1.352
Time: 1.355
Time: 1.361
Time: 1.416
Time: 1.453
Time: 1.353
Time: 1.341
Time: 1.439

Performance counter stats for '/work/c/hackbench 50' (10 runs):

11,293.08 msec task-clock # 7.390 CPUs utilized ( +- 0.93% )
125,343 context-switches # 11099.462 M/sec ( +- 11.84% )
15,587 cpu-migrations # 1380.272 M/sec ( +- 8.21% )
43,871 page-faults # 3884.890 M/sec ( +- 1.06% )
41,567,508,330 cycles # 3680918.499 GHz ( +- 0.94% )
29,851,271,023 stalled-cycles-frontend # 71.81% frontend cycles idle ( +- 0.99% )
26,878,085,513 instructions # 0.65 insn per cycle
# 1.11 stalled cycles per insn ( +- 0.72% )
5,125,816,911 branches # 453905346.879 M/sec ( +- 0.74% )
107,643,635 branch-misses # 2.10% of all branches ( +- 0.71% )

1.5282 +- 0.0135 seconds time elapsed ( +- 0.88% )

Then I applied patch 3 and tested that:

With static call trampolines:

# perf stat -r 10 /work/c/hackbench 50
Time: 1.350
Time: 1.333
Time: 1.369
Time: 1.361
Time: 1.375
Time: 1.352
Time: 1.316
Time: 1.336
Time: 1.339
Time: 1.371

Performance counter stats for '/work/c/hackbench 50' (10 runs):

10,964.38 msec task-clock # 7.392 CPUs utilized ( +- 0.41% )
75,986 context-switches # 6930.527 M/sec ( +- 9.23% )
12,464 cpu-migrations # 1136.858 M/sec ( +- 7.93% )
44,476 page-faults # 4056.558 M/sec ( +- 1.12% )
40,354,963,428 cycles # 3680712.468 GHz ( +- 0.42% )
29,057,240,222 stalled-cycles-frontend # 72.00% frontend cycles idle ( +- 0.46% )
26,171,883,339 instructions # 0.65 insn per cycle
# 1.11 stalled cycles per insn ( +- 0.32% )
4,978,193,830 branches # 454053195.523 M/sec ( +- 0.33% )
83,625,127 branch-misses # 1.68% of all branches ( +- 0.33% )

1.48328 +- 0.00515 seconds time elapsed ( +- 0.35% )

And finally I added patch 4 and tested that:

Full static calls:

# perf stat -r 10 /work/c/hackbench 50
Time: 1.302
Time: 1.323
Time: 1.356
Time: 1.325
Time: 1.372
Time: 1.373
Time: 1.319
Time: 1.313
Time: 1.362
Time: 1.322

Performance counter stats for '/work/c/hackbench 50' (10 runs):

10,865.10 msec task-clock # 7.373 CPUs utilized ( +- 0.62% )
88,718 context-switches # 8165.823 M/sec ( +- 10.11% )
13,463 cpu-migrations # 1239.125 M/sec ( +- 8.42% )
44,574 page-faults # 4102.673 M/sec ( +- 0.60% )
39,991,476,585 cycles # 3680897.280 GHz ( +- 0.63% )
28,713,229,777 stalled-cycles-frontend # 71.80% frontend cycles idle ( +- 0.68% )
26,289,703,633 instructions # 0.66 insn per cycle
# 1.09 stalled cycles per insn ( +- 0.44% )
4,983,099,105 branches # 458654631.123 M/sec ( +- 0.45% )
83,719,799 branch-misses # 1.68% of all branches ( +- 0.44% )

1.47364 +- 0.00706 seconds time elapsed ( +- 0.48% )


In summary, we had this:

No RETPOLINES:
1.4503 +- 0.0148 seconds time elapsed ( +- 1.02% )

baseline RETPOLINES:
1.5120 +- 0.0133 seconds time elapsed ( +- 0.88% )

Added direct calls for trace_events:
1.5239 +- 0.0139 seconds time elapsed ( +- 0.91% )

With static calls:
1.5282 +- 0.0135 seconds time elapsed ( +- 0.88% )

With static call trampolines:
1.48328 +- 0.00515 seconds time elapsed ( +- 0.35% )

Full static calls:
1.47364 +- 0.00706 seconds time elapsed ( +- 0.48% )


Adding Retpolines caused a 1.5120 / 1.4503 = 1.0425 ( 4.25% ) slowdown

Trampolines made it into 1.48328 / 1.4503 = 1.0227 ( 2.27% ) slowdown

With full static calls 1.47364 / 1.4503 = 1.0160 ( 1.6% ) slowdown

Going from 4.25 to 1.6 isn't bad, and I think this is very much worth
the effort. I did not expect it to go to 0% as there's a lot of other
places that retpolines cause issues, but this shows that it does help
the tracing code.

I originally did the tests with the development config, which has a
bunch of debugging options enabled (hackbench usually takes over 9
seconds, not the 1.5 that was done here), and the slowdown was closer
to 9% with retpolines. If people want me to do this with that, or I can
send them the config. Or better yet, the code is here, just use your
own configs.

-- Steve


Attachments:
(No filename) (10.61 kB)
config-distro (131.70 kB)
0001-tracepoints-Add-a-direct-call-or-an-iterator.patch (11.02 kB)
0002-tracepoints-Implement-it-with-dynamic-functions.patch (5.39 kB)
Download all attachments

2018-11-26 21:28:54

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Mon, Nov 26, 2018 at 09:08:01PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 26, 2018 at 11:56:24AM -0600, Josh Poimboeuf wrote:
> > diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c
> > index d3869295b88d..8fd6c8556750 100644
> > --- a/arch/x86/kernel/static_call.c
> > +++ b/arch/x86/kernel/static_call.c
> > @@ -7,24 +7,19 @@
> >
> > #define CALL_INSN_SIZE 5
> >
> > +unsigned long bp_handler_call_return_addr;
> >
> > +static void static_call_bp_handler(struct pt_regs *regs)
> > +{
> > #ifdef CONFIG_HAVE_STATIC_CALL_INLINE
> > + /*
> > + * Push the return address on the stack so the "called" function will
> > + * return to immediately after the call site.
> > + */
> > + regs->sp -= sizeof(long);
> > + *(unsigned long *)regs->sp = bp_handler_call_return_addr;
> > #endif
> > +}
> >
> > void arch_static_call_transform(void *site, void *tramp, void *func)
> > {
> > @@ -52,14 +47,12 @@ void arch_static_call_transform(void *site, void *tramp, void *func)
> > opcodes[0] = insn_opcode;
> > memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1);
> >
> > if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
> > + bp_handler_call_return_addr = insn + CALL_INSN_SIZE;
> >
> > /* Patch the call site: */
> > text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE,
> > - static_call_bp_handler);
> > + static_call_bp_handler, func);
> >
> > done:
> > mutex_unlock(&text_mutex);
>
>
> like maybe something along the lines of:
>
> struct sc_data {
> unsigned long ret;
> unsigned long ip;
> };
>
> void sc_handler(struct pt_regs *regs, void *data)
> {
> struct sc_data *scd = data;
>
> regs->sp -= sizeof(long);
> *(unsigned long *)regs->sp = scd->ret;
> regs->ip = scd->ip;
> }
>
> arch_static_call_transform()
> {
> ...
>
> scd = (struct sc_data){
> .ret = insn + CALL_INSN_SIZE,
> .ip = (unsigned long)func,
> };
>
> text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE,
> sc_handler, (void *)&scd);
>
> ...
> }

Yeah, that's probably better. I assume you also mean that we would have
all text_poke_bp() users create a handler callback? That way the
interface is clear and consistent for everybody. Like:

diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index e85ff65c43c3..04d6cf838fb7 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -20,6 +20,8 @@ static inline void apply_paravirt(struct paravirt_patch_site *start,

extern void *text_poke_early(void *addr, const void *opcode, size_t len);

+typedef void (*bp_handler_t)(struct pt_regs *regs, void *data);
+
/*
* Clear and restore the kernel write-protection flag on the local CPU.
* Allows the kernel to edit read-only pages.
@@ -36,7 +38,8 @@ extern void *text_poke_early(void *addr, const void *opcode, size_t len);
*/
extern void *text_poke(void *addr, const void *opcode, size_t len);
extern int poke_int3_handler(struct pt_regs *regs);
-extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
+extern void *text_poke_bp(void *addr, const void *opcode, size_t len,
+ bp_handler_t handler, void *data);
extern int after_bootmem;

#endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index ebeac487a20c..547af714bd60 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -738,7 +738,8 @@ static void do_sync_core(void *info)
}

static bool bp_patching_in_progress;
-static void *bp_int3_handler, *bp_int3_addr;
+static void *bp_int3_data, *bp_int3_addr;
+static bp_handler_t bp_int3_handler;

int poke_int3_handler(struct pt_regs *regs)
{
@@ -746,11 +747,11 @@ int poke_int3_handler(struct pt_regs *regs)
* Having observed our INT3 instruction, we now must observe
* bp_patching_in_progress.
*
- * in_progress = TRUE INT3
- * WMB RMB
- * write INT3 if (in_progress)
+ * in_progress = TRUE INT3
+ * WMB RMB
+ * write INT3 if (in_progress)
*
- * Idem for bp_int3_handler.
+ * Idem for bp_int3_data.
*/
smp_rmb();

@@ -760,8 +761,7 @@ int poke_int3_handler(struct pt_regs *regs)
if (user_mode(regs) || regs->ip != (unsigned long)bp_int3_addr)
return 0;

- /* set up the specified breakpoint handler */
- regs->ip = (unsigned long) bp_int3_handler;
+ bp_int3_handler(regs, bp_int3_data);

return 1;

@@ -772,7 +772,8 @@ int poke_int3_handler(struct pt_regs *regs)
* @addr: address to patch
* @opcode: opcode of new instruction
* @len: length to copy
- * @handler: address to jump to when the temporary breakpoint is hit
+ * @handler: handler to call from int3 context
+ * @data: opaque data passed to handler
*
* Modify multi-byte instruction by using int3 breakpoint on SMP.
* We completely avoid stop_machine() here, and achieve the
@@ -787,11 +788,13 @@ int poke_int3_handler(struct pt_regs *regs)
* replacing opcode
* - sync cores
*/
-void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
+void *text_poke_bp(void *addr, const void *opcode, size_t len,
+ bp_handler_t handler, void *data)
{
unsigned char int3 = 0xcc;

bp_int3_handler = handler;
+ bp_int3_data = data;
bp_int3_addr = (u8 *)addr + sizeof(int3);
bp_patching_in_progress = true;

diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index aac0c1f7e354..d4b0abe4912d 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -37,6 +37,11 @@ static void bug_at(unsigned char *ip, int line)
BUG();
}

+static inline void jump_label_bp_handler(struct pt_regs *regs, void *data)
+{
+ regs->ip += JUMP_LABEL_NOP_SIZE - 1;
+}
+
static void __ref __jump_label_transform(struct jump_entry *entry,
enum jump_label_type type,
void *(*poker)(void *, const void *, size_t),
@@ -91,7 +96,7 @@ static void __ref __jump_label_transform(struct jump_entry *entry,
}

text_poke_bp((void *)jump_entry_code(entry), code, JUMP_LABEL_NOP_SIZE,
- (void *)jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
+ jump_label_bp_handler, NULL);
}

void arch_jump_label_transform(struct jump_entry *entry,
diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c
index 40b16b270656..b2dffdd6068d 100644
--- a/arch/x86/kernel/kprobes/opt.c
+++ b/arch/x86/kernel/kprobes/opt.c
@@ -424,6 +424,11 @@ int arch_prepare_optimized_kprobe(struct optimized_kprobe *op,
goto out;
}

+static void kprobes_poke_bp_handler(struct pt_regs *regs, void *data)
+{
+ regs->ip = data;
+}
+
/*
* Replace breakpoints (int3) with relative jumps.
* Caller must call with locking kprobe_mutex and text_mutex.
@@ -447,7 +452,7 @@ void arch_optimize_kprobes(struct list_head *oplist)
*(s32 *)(&insn_buf[1]) = rel;

text_poke_bp(op->kp.addr, insn_buf, RELATIVEJUMP_SIZE,
- op->optinsn.insn);
+ kprobes_poke_bp_handler, op->optinsn.insn);

list_del_init(&op->list);
}
@@ -462,7 +467,7 @@ void arch_unoptimize_kprobe(struct optimized_kprobe *op)
insn_buf[0] = BREAKPOINT_INSTRUCTION;
memcpy(insn_buf + 1, op->optinsn.copied_insn, RELATIVE_ADDR_SIZE);
text_poke_bp(op->kp.addr, insn_buf, RELATIVEJUMP_SIZE,
- op->optinsn.insn);
+ kprobes_poke_bp_handler, op->optinsn.insn);
}

/*
diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c
index d3869295b88d..e05ebc6d4db5 100644
--- a/arch/x86/kernel/static_call.c
+++ b/arch/x86/kernel/static_call.c
@@ -7,24 +7,30 @@

#define CALL_INSN_SIZE 5

-void static_call_bp_handler(void);
-void *bp_handler_dest;
-void *bp_handler_continue;
-
-asm(".pushsection .text, \"ax\" \n"
- ".globl static_call_bp_handler \n"
- ".type static_call_bp_handler, @function \n"
- "static_call_bp_handler: \n"
-#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
- ANNOTATE_RETPOLINE_SAFE
- "call *bp_handler_dest \n"
- ANNOTATE_RETPOLINE_SAFE
- "jmp *bp_handler_continue \n"
-#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */
- ANNOTATE_RETPOLINE_SAFE
- "jmp *bp_handler_dest \n"
-#endif
- ".popsection \n");
+struct static_call_bp_data {
+ unsigned long func, ret;
+};
+
+static void static_call_bp_handler(struct pt_regs *regs, void *_data)
+{
+ struct static_call_bp_data *data = _data;
+
+ /*
+ * For inline static calls, push the return address on the stack so the
+ * "called" function will return to the location immediately after the
+ * call site.
+ *
+ * NOTE: This code will need to be revisited when kernel CET gets
+ * implemented.
+ */
+ if (data->ret) {
+ regs->sp -= sizeof(long);
+ *(unsigned long *)regs->sp = data->ret;
+ }
+
+ /* The exception handler will 'return' to the destination function. */
+ regs->ip = data->func;
+}

void arch_static_call_transform(void *site, void *tramp, void *func)
{
@@ -32,11 +38,17 @@ void arch_static_call_transform(void *site, void *tramp, void *func)
unsigned long insn;
unsigned char insn_opcode;
unsigned char opcodes[CALL_INSN_SIZE];
+ struct static_call_bp_data handler_data;
+
+ handler_data.func = (unsigned long)func;

- if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
+ if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE)) {
insn = (unsigned long)site;
- else
+ handler_data.ret = insn + CALL_INSN_SIZE;
+ } else {
insn = (unsigned long)tramp;
+ handler_data.ret = 0;
+ }

mutex_lock(&text_mutex);

@@ -52,14 +64,9 @@ void arch_static_call_transform(void *site, void *tramp, void *func)
opcodes[0] = insn_opcode;
memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1);

- /* Set up the variables for the breakpoint handler: */
- bp_handler_dest = func;
- if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
- bp_handler_continue = (void *)(insn + CALL_INSN_SIZE);
-
/* Patch the call site: */
text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE,
- static_call_bp_handler);
+ static_call_bp_handler, &handler_data);

done:
mutex_unlock(&text_mutex);

2018-11-26 22:25:18

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Static calls

On Mon, Nov 26, 2018 at 03:54:05PM -0500, Steven Rostedt wrote:
> In summary, we had this:
>
> No RETPOLINES:
> 1.4503 +- 0.0148 seconds time elapsed ( +- 1.02% )
>
> baseline RETPOLINES:
> 1.5120 +- 0.0133 seconds time elapsed ( +- 0.88% )
>
> Added direct calls for trace_events:
> 1.5239 +- 0.0139 seconds time elapsed ( +- 0.91% )
>
> With static calls:
> 1.5282 +- 0.0135 seconds time elapsed ( +- 0.88% )
>
> With static call trampolines:
> 1.48328 +- 0.00515 seconds time elapsed ( +- 0.35% )
>
> Full static calls:
> 1.47364 +- 0.00706 seconds time elapsed ( +- 0.48% )
>
>
> Adding Retpolines caused a 1.5120 / 1.4503 = 1.0425 ( 4.25% ) slowdown
>
> Trampolines made it into 1.48328 / 1.4503 = 1.0227 ( 2.27% ) slowdown
>
> With full static calls 1.47364 / 1.4503 = 1.0160 ( 1.6% ) slowdown
>
> Going from 4.25 to 1.6 isn't bad, and I think this is very much worth
> the effort. I did not expect it to go to 0% as there's a lot of other
> places that retpolines cause issues, but this shows that it does help
> the tracing code.
>
> I originally did the tests with the development config, which has a
> bunch of debugging options enabled (hackbench usually takes over 9
> seconds, not the 1.5 that was done here), and the slowdown was closer
> to 9% with retpolines. If people want me to do this with that, or I can
> send them the config. Or better yet, the code is here, just use your
> own configs.

Thanks a lot for running these. This looks like a nice speedup. Also a
nice reduction in the standard deviation.

Should I add your tracepoint patch to the next version of my patches?

--
Josh

2018-11-26 22:54:17

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Static calls

On Mon, 26 Nov 2018 16:24:20 -0600
Josh Poimboeuf <[email protected]> wrote:

> Should I add your tracepoint patch to the next version of my patches?
>

No, not yet. Especially since I haven't totally vetted them.

When yours are ready, I'll post an RFC, and then we can add them in. I
would want an Acked-by from Mathieu Desnoyers too.

-- Steve


2018-11-27 08:46:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Mon, Nov 26, 2018 at 03:26:28PM -0600, Josh Poimboeuf wrote:

> Yeah, that's probably better. I assume you also mean that we would have
> all text_poke_bp() users create a handler callback? That way the
> interface is clear and consistent for everybody. Like:

Can do, it does indeed make the interface less like a hack. It is not
like there are too many users.

> diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
> index aac0c1f7e354..d4b0abe4912d 100644
> --- a/arch/x86/kernel/jump_label.c
> +++ b/arch/x86/kernel/jump_label.c
> @@ -37,6 +37,11 @@ static void bug_at(unsigned char *ip, int line)
> BUG();
> }
>
> +static inline void jump_label_bp_handler(struct pt_regs *regs, void *data)
> +{
> + regs->ip += JUMP_LABEL_NOP_SIZE - 1;
> +}
> +
> static void __ref __jump_label_transform(struct jump_entry *entry,
> enum jump_label_type type,
> void *(*poker)(void *, const void *, size_t),
> @@ -91,7 +96,7 @@ static void __ref __jump_label_transform(struct jump_entry *entry,
> }
>
> text_poke_bp((void *)jump_entry_code(entry), code, JUMP_LABEL_NOP_SIZE,
> - (void *)jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
> + jump_label_bp_handler, NULL);
> }
>
> void arch_jump_label_transform(struct jump_entry *entry,

Per that example..

> diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c
> index d3869295b88d..e05ebc6d4db5 100644
> --- a/arch/x86/kernel/static_call.c
> +++ b/arch/x86/kernel/static_call.c
> @@ -7,24 +7,30 @@
>
> #define CALL_INSN_SIZE 5
>
> +struct static_call_bp_data {
> + unsigned long func, ret;
> +};
> +
> +static void static_call_bp_handler(struct pt_regs *regs, void *_data)
> +{
> + struct static_call_bp_data *data = _data;
> +
> + /*
> + * For inline static calls, push the return address on the stack so the
> + * "called" function will return to the location immediately after the
> + * call site.
> + *
> + * NOTE: This code will need to be revisited when kernel CET gets
> + * implemented.
> + */
> + if (data->ret) {
> + regs->sp -= sizeof(long);
> + *(unsigned long *)regs->sp = data->ret;
> + }
> +
> + /* The exception handler will 'return' to the destination function. */
> + regs->ip = data->func;
> +}

Now; if I'm not mistaken, the below @site is in fact @regs->ip - 1, no?

We already patched site with INT3, which is what we just trapped on. So
we could in fact write something like:

static void static_call_bp_handler(struct pt_regs *regs, void *data)
{
struct static_call_bp_data *scd = data;

switch (data->type) {
case CALL_INSN: /* emulate CALL instruction */
regs->sp -= sizeof(unsigned long);
*(unsigned long *)regs->sp = regs->ip + CALL_INSN_SIZE - 1;
regs->ip = data->func;
break;

case JMP_INSN: /* emulate JMP instruction */
regs->ip = data->func;
break;
}
}

> void arch_static_call_transform(void *site, void *tramp, void *func)
> {
> @@ -32,11 +38,17 @@ void arch_static_call_transform(void *site, void *tramp, void *func)
> unsigned long insn;
> unsigned char insn_opcode;
> unsigned char opcodes[CALL_INSN_SIZE];
> + struct static_call_bp_data handler_data;
> +
> + handler_data.func = (unsigned long)func;
>
> - if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE))
> + if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE)) {
> insn = (unsigned long)site;
> + handler_data.ret = insn + CALL_INSN_SIZE;
> + } else {
> insn = (unsigned long)tramp;
> + handler_data.ret = 0;
> + }

handler_data = (struct static_call_bp_data){
.type = IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE) ? CALL_INSN : JMP_INSN,
.func = func,
};

> mutex_lock(&text_mutex);
>
> @@ -52,14 +64,9 @@ void arch_static_call_transform(void *site, void *tramp, void *func)
> opcodes[0] = insn_opcode;
> memcpy(&opcodes[1], &dest_relative, CALL_INSN_SIZE - 1);
>
> /* Patch the call site: */
> text_poke_bp((void *)insn, opcodes, CALL_INSN_SIZE,
> + static_call_bp_handler, &handler_data);
>
> done:
> mutex_unlock(&text_mutex);

2018-11-27 08:47:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Mon, Nov 26, 2018 at 02:14:49PM -0600, Josh Poimboeuf wrote:
> On Mon, Nov 26, 2018 at 10:28:08AM -0800, Andy Lutomirski wrote:

> > Can you add a comment that it will need updating when kernel CET is added?
>
> Will do, though I get the feeling there's a lot of other (existing) code
> that will also need to change for kernel CET.

Yeah, function graph tracer and kretprobes at the very least. But I
suspect there's a few surprises to be hand once they try kernel CET.


2018-11-27 11:16:36

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] compiler.h: Make __ADDRESSABLE() symbol truly unique

On Mon, 26 Nov 2018 at 14:55, Josh Poimboeuf <[email protected]> wrote:
>
> The __ADDRESSABLE() macro uses the __LINE__ macro to create a temporary
> symbol which has a unique name. However, if the macro is used multiple
> times from within another macro, the line number will always be the
> same, resulting in duplicate symbols.
>
> Make the temporary symbols truly unique by using __UNIQUE_ID instead of
> __LINE__.
>
> Signed-off-by: Josh Poimboeuf <[email protected]>

Acked-by: Ard Biesheuvel <[email protected]>

> ---
> include/linux/compiler.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index 06396c1cf127..4bb73fd918b5 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -282,7 +282,7 @@ unsigned long read_word_at_a_time(const void *addr)
> */
> #define __ADDRESSABLE(sym) \
> static void * __section(".discard.addressable") __used \
> - __PASTE(__addressable_##sym, __LINE__) = (void *)&sym;
> + __UNIQUE_ID(__addressable_##sym) = (void *)&sym;
>
> /**
> * offset_to_ptr - convert a relative memory offset to an absolute pointer
> --
> 2.17.2
>

2018-11-27 11:24:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Tue, Nov 27, 2018 at 09:43:30AM +0100, Peter Zijlstra wrote:
> Now; if I'm not mistaken, the below @site is in fact @regs->ip - 1, no?
>
> We already patched site with INT3, which is what we just trapped on. So
> we could in fact write something like:
>
> static void static_call_bp_handler(struct pt_regs *regs, void *data)
> {
> struct static_call_bp_data *scd = data;
>
> switch (data->type) {
> case CALL_INSN: /* emulate CALL instruction */
> regs->sp -= sizeof(unsigned long);
> *(unsigned long *)regs->sp = regs->ip + CALL_INSN_SIZE - 1;
> regs->ip = data->func;
> break;
>
> case JMP_INSN: /* emulate JMP instruction */
> regs->ip = data->func;
> break;
> }
> }

> handler_data = (struct static_call_bp_data){
> .type = IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE) ? CALL_INSN : JMP_INSN,
> .func = func,
> };

Heck; check this:

static void static_call_bp_handler(struct pt_regs *regs, void *data)
{
#ifdef CONFIG_HAVE_STATIC_CALL_INLINE

/* emulate CALL instruction */
regs->sp -= sizeof(unsigned long);
*(unsigned long *)regs->sp = regs->ip + CALL_INSN_SIZE - 1;
regs->ip = data;

#else /* !CONFIG_HAVE_STATIC_CALL_INLINE */

/* emulate JMP instruction */
regs->ip = data;

#endif
}



2018-11-29 06:08:44

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

> On Nov 27, 2018, at 12:43 AM, Peter Zijlstra <[email protected]> wrote:
>
>> On Mon, Nov 26, 2018 at 03:26:28PM -0600, Josh Poimboeuf wrote:
>>
>> Yeah, that's probably better. I assume you also mean that we would have
>> all text_poke_bp() users create a handler callback? That way the
>> interface is clear and consistent for everybody. Like:
>
> Can do, it does indeed make the interface less like a hack. It is not
> like there are too many users.
>
>> diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
>> index aac0c1f7e354..d4b0abe4912d 100644
>> --- a/arch/x86/kernel/jump_label.c
>> +++ b/arch/x86/kernel/jump_label.c
>> @@ -37,6 +37,11 @@ static void bug_at(unsigned char *ip, int line)
>> BUG();
>> }
>>
>> +static inline void jump_label_bp_handler(struct pt_regs *regs, void *data)
>> +{
>> + regs->ip += JUMP_LABEL_NOP_SIZE - 1;
>> +}
>> +
>> static void __ref __jump_label_transform(struct jump_entry *entry,
>> enum jump_label_type type,
>> void *(*poker)(void *, const void *, size_t),
>> @@ -91,7 +96,7 @@ static void __ref __jump_label_transform(struct jump_entry *entry,
>> }
>>
>> text_poke_bp((void *)jump_entry_code(entry), code, JUMP_LABEL_NOP_SIZE,
>> - (void *)jump_entry_code(entry) + JUMP_LABEL_NOP_SIZE);
>> + jump_label_bp_handler, NULL);
>> }
>>
>> void arch_jump_label_transform(struct jump_entry *entry,
>
> Per that example..
>
>> diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c
>> index d3869295b88d..e05ebc6d4db5 100644
>> --- a/arch/x86/kernel/static_call.c
>> +++ b/arch/x86/kernel/static_call.c
>> @@ -7,24 +7,30 @@
>>
>> #define CALL_INSN_SIZE 5
>>
>> +struct static_call_bp_data {
>> + unsigned long func, ret;
>> +};
>> +
>> +static void static_call_bp_handler(struct pt_regs *regs, void *_data)
>> +{
>> + struct static_call_bp_data *data = _data;
>> +
>> + /*
>> + * For inline static calls, push the return address on the stack so the
>> + * "called" function will return to the location immediately after the
>> + * call site.
>> + *
>> + * NOTE: This code will need to be revisited when kernel CET gets
>> + * implemented.
>> + */
>> + if (data->ret) {
>> + regs->sp -= sizeof(long);
>> + *(unsigned long *)regs->sp = data->ret;
>> + }

You can’t do this. Depending on the alignment of the old RSP, which
is not guaranteed, this overwrites regs->cs. IRET goes boom.

Maybe it could be fixed by pointing regs->ip at a real trampoline?

This code is subtle and executed rarely, which is a bag combination.
It would be great if we had a test case.

I think it would be great if the implementation could be, literally:

regs->ip -= 1;
return;

IOW, just retry and wait until we get the new patched instruction.
The problem is that, if we're in a context where IRQs are off, then
we're preventing on_each_cpu() from completing and, even if we somehow
just let the code know that we already serialized ourselves, we're
still potentially holding a spinlock that another CPU is waiting for
with IRQs off. Ugh. Anyone have a clever idea to fix that?

2018-11-29 09:43:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Wed, Nov 28, 2018 at 10:05:54PM -0800, Andy Lutomirski wrote:

> >> +static void static_call_bp_handler(struct pt_regs *regs, void *_data)
> >> +{
> >> + struct static_call_bp_data *data = _data;
> >> +
> >> + /*
> >> + * For inline static calls, push the return address on the stack so the
> >> + * "called" function will return to the location immediately after the
> >> + * call site.
> >> + *
> >> + * NOTE: This code will need to be revisited when kernel CET gets
> >> + * implemented.
> >> + */
> >> + if (data->ret) {
> >> + regs->sp -= sizeof(long);
> >> + *(unsigned long *)regs->sp = data->ret;
> >> + }
>
> You can’t do this. Depending on the alignment of the old RSP, which
> is not guaranteed, this overwrites regs->cs. IRET goes boom.

I don't get it; can you spell that out?

The way I understand it is that we're at a location where a "E8 - Near
CALL" instruction should be, and thus RSP should be the regular kernel
stack, and the above simply does "PUSH ret", which is what that CALL
would've done too.



2018-11-29 13:13:26

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 10:42:10AM +0100, Peter Zijlstra wrote:
> On Wed, Nov 28, 2018 at 10:05:54PM -0800, Andy Lutomirski wrote:
>
> > >> +static void static_call_bp_handler(struct pt_regs *regs, void *_data)
> > >> +{
> > >> + struct static_call_bp_data *data = _data;
> > >> +
> > >> + /*
> > >> + * For inline static calls, push the return address on the stack so the
> > >> + * "called" function will return to the location immediately after the
> > >> + * call site.
> > >> + *
> > >> + * NOTE: This code will need to be revisited when kernel CET gets
> > >> + * implemented.
> > >> + */
> > >> + if (data->ret) {
> > >> + regs->sp -= sizeof(long);
> > >> + *(unsigned long *)regs->sp = data->ret;
> > >> + }
> >
> > You can’t do this. Depending on the alignment of the old RSP, which
> > is not guaranteed, this overwrites regs->cs. IRET goes boom.
>
> I don't get it; can you spell that out?

I don't quite follow that either. Maybe Andy is referring to x86-32,
for which regs->sp isn't actually saved: see kernel_stack_pointer().

This code is 64-bit only so that's not a concern.

> The way I understand it is that we're at a location where a "E8 - Near
> CALL" instruction should be, and thus RSP should be the regular kernel
> stack, and the above simply does "PUSH ret", which is what that CALL
> would've done too.

Right.

--
Josh

2018-11-29 13:43:24

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64



> On Nov 29, 2018, at 1:42 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Wed, Nov 28, 2018 at 10:05:54PM -0800, Andy Lutomirski wrote:
>
>>>> +static void static_call_bp_handler(struct pt_regs *regs, void *_data)
>>>> +{
>>>> + struct static_call_bp_data *data = _data;
>>>> +
>>>> + /*
>>>> + * For inline static calls, push the return address on the stack so the
>>>> + * "called" function will return to the location immediately after the
>>>> + * call site.
>>>> + *
>>>> + * NOTE: This code will need to be revisited when kernel CET gets
>>>> + * implemented.
>>>> + */
>>>> + if (data->ret) {
>>>> + regs->sp -= sizeof(long);
>>>> + *(unsigned long *)regs->sp = data->ret;
>>>> + }
>>
>> You can’t do this. Depending on the alignment of the old RSP, which
>> is not guaranteed, this overwrites regs->cs. IRET goes boom.
>
> I don't get it; can you spell that out?
>
> The way I understand it is that we're at a location where a "E8 - Near
> CALL" instruction should be, and thus RSP should be the regular kernel
> stack, and the above simply does "PUSH ret", which is what that CALL
> would've done too.
>

int3 isn’t IST anymore, so the int3 instruction conditionally subtracts 8 from RSP and then pushes SS, etc. So my email was obviously wrong wrt “cs”, but you’re still potentially overwriting the int3 IRET frame.

2018-11-29 14:41:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 05:37:39AM -0800, Andy Lutomirski wrote:
>
>
> > On Nov 29, 2018, at 1:42 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > On Wed, Nov 28, 2018 at 10:05:54PM -0800, Andy Lutomirski wrote:
> >
> >>>> +static void static_call_bp_handler(struct pt_regs *regs, void *_data)
> >>>> +{
> >>>> + struct static_call_bp_data *data = _data;
> >>>> +
> >>>> + /*
> >>>> + * For inline static calls, push the return address on the stack so the
> >>>> + * "called" function will return to the location immediately after the
> >>>> + * call site.
> >>>> + *
> >>>> + * NOTE: This code will need to be revisited when kernel CET gets
> >>>> + * implemented.
> >>>> + */
> >>>> + if (data->ret) {
> >>>> + regs->sp -= sizeof(long);
> >>>> + *(unsigned long *)regs->sp = data->ret;
> >>>> + }
> >>
> >> You can’t do this. Depending on the alignment of the old RSP, which
> >> is not guaranteed, this overwrites regs->cs. IRET goes boom.
> >
> > I don't get it; can you spell that out?
> >
> > The way I understand it is that we're at a location where a "E8 - Near
> > CALL" instruction should be, and thus RSP should be the regular kernel
> > stack, and the above simply does "PUSH ret", which is what that CALL
> > would've done too.
> >
>
> int3 isn’t IST anymore, so the int3 instruction conditionally
> subtracts 8 from RSP and then pushes SS, etc. So my email was
> obviously wrong wrt “cs”, but you’re still potentially overwriting the
> int3 IRET frame.

ARGH!..

can't we 'fix' that again? The alternative is moving that IRET-frame and
fixing everything up, which is going to be fragile, ugly and such
things more.

Commit d8ba61ba58c8 ("x86/entry/64: Don't use IST entry for #BP stack")
doesn't list any strong reasons for why it should NOT be an IST.



2018-11-29 14:44:01

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, 29 Nov 2018, Peter Zijlstra wrote:

> > int3 isn’t IST anymore, so the int3 instruction conditionally
> > subtracts 8 from RSP and then pushes SS, etc. So my email was
> > obviously wrong wrt “cs”, but you’re still potentially overwriting the
> > int3 IRET frame.
>
> ARGH!..
>
> can't we 'fix' that again? The alternative is moving that IRET-frame and
> fixing everything up, which is going to be fragile, ugly and such
> things more.
>
> Commit d8ba61ba58c8 ("x86/entry/64: Don't use IST entry for #BP stack")
> doesn't list any strong reasons for why it should NOT be an IST.

It's CVE-2018-8897.

--
Jiri Kosina
SUSE Labs


2018-11-29 16:36:47

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 03:38:53PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 29, 2018 at 05:37:39AM -0800, Andy Lutomirski wrote:
> >
> >
> > > On Nov 29, 2018, at 1:42 AM, Peter Zijlstra <[email protected]> wrote:
> > >
> > > On Wed, Nov 28, 2018 at 10:05:54PM -0800, Andy Lutomirski wrote:
> > >
> > >>>> +static void static_call_bp_handler(struct pt_regs *regs, void *_data)
> > >>>> +{
> > >>>> + struct static_call_bp_data *data = _data;
> > >>>> +
> > >>>> + /*
> > >>>> + * For inline static calls, push the return address on the stack so the
> > >>>> + * "called" function will return to the location immediately after the
> > >>>> + * call site.
> > >>>> + *
> > >>>> + * NOTE: This code will need to be revisited when kernel CET gets
> > >>>> + * implemented.
> > >>>> + */
> > >>>> + if (data->ret) {
> > >>>> + regs->sp -= sizeof(long);
> > >>>> + *(unsigned long *)regs->sp = data->ret;
> > >>>> + }
> > >>
> > >> You can’t do this. Depending on the alignment of the old RSP, which
> > >> is not guaranteed, this overwrites regs->cs. IRET goes boom.
> > >
> > > I don't get it; can you spell that out?
> > >
> > > The way I understand it is that we're at a location where a "E8 - Near
> > > CALL" instruction should be, and thus RSP should be the regular kernel
> > > stack, and the above simply does "PUSH ret", which is what that CALL
> > > would've done too.
> > >
> >
> > int3 isn’t IST anymore, so the int3 instruction conditionally
> > subtracts 8 from RSP and then pushes SS, etc. So my email was
> > obviously wrong wrt “cs”, but you’re still potentially overwriting the
> > int3 IRET frame.
>
> ARGH!..
>
> can't we 'fix' that again? The alternative is moving that IRET-frame and
> fixing everything up, which is going to be fragile, ugly and such
> things more.
>
> Commit d8ba61ba58c8 ("x86/entry/64: Don't use IST entry for #BP stack")
> doesn't list any strong reasons for why it should NOT be an IST.

This seems to work...

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index ce25d84023c0..184523447d35 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -876,7 +876,7 @@ apicinterrupt IRQ_WORK_VECTOR irq_work_interrupt smp_irq_work_interrupt
* @paranoid == 2 is special: the stub will never switch stacks. This is for
* #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS.
*/
-.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
+.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 create_gap=0
ENTRY(\sym)
UNWIND_HINT_IRET_REGS offset=\has_error_code*8

@@ -891,6 +891,12 @@ ENTRY(\sym)
pushq $-1 /* ORIG_RAX: no syscall to restart */
.endif

+ .if \create_gap == 1
+ .rept 6
+ pushq 5*8(%rsp)
+ .endr
+ .endif
+
.if \paranoid == 1
testb $3, CS-ORIG_RAX(%rsp) /* If coming from userspace, switch stacks */
jnz .Lfrom_usermode_switch_stack_\@
@@ -1126,7 +1132,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \
#endif /* CONFIG_HYPERV */

idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
-idtentry int3 do_int3 has_error_code=0
+idtentry int3 do_int3 has_error_code=0 create_gap=1
idtentry stack_segment do_stack_segment has_error_code=1

#ifdef CONFIG_XEN_PV

2018-11-29 16:52:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 10:33:42AM -0600, Josh Poimboeuf wrote:
> > can't we 'fix' that again? The alternative is moving that IRET-frame and
> > fixing everything up, which is going to be fragile, ugly and such
> > things more.

> This seems to work...

That's almost too easy... nice!

> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index ce25d84023c0..184523447d35 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -876,7 +876,7 @@ apicinterrupt IRQ_WORK_VECTOR irq_work_interrupt smp_irq_work_interrupt
> * @paranoid == 2 is special: the stub will never switch stacks. This is for
> * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS.
> */
> -.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
> +.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 create_gap=0
> ENTRY(\sym)
> UNWIND_HINT_IRET_REGS offset=\has_error_code*8
>
> @@ -891,6 +891,12 @@ ENTRY(\sym)
> pushq $-1 /* ORIG_RAX: no syscall to restart */
> .endif
>
> + .if \create_gap == 1
> + .rept 6
> + pushq 5*8(%rsp)
> + .endr
> + .endif
> +
> .if \paranoid == 1
> testb $3, CS-ORIG_RAX(%rsp) /* If coming from userspace, switch stacks */
> jnz .Lfrom_usermode_switch_stack_\@
> @@ -1126,7 +1132,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \
> #endif /* CONFIG_HYPERV */
>
> idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
> -idtentry int3 do_int3 has_error_code=0
> +idtentry int3 do_int3 has_error_code=0 create_gap=1
> idtentry stack_segment do_stack_segment has_error_code=1
>
> #ifdef CONFIG_XEN_PV

2018-11-29 16:52:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 8:33 AM Josh Poimboeuf <[email protected]> wrote:
>
> This seems to work...
>
> + .if \create_gap == 1
> + .rept 6
> + pushq 5*8(%rsp)
> + .endr
> + .endif
> +
> -idtentry int3 do_int3 has_error_code=0
> +idtentry int3 do_int3 has_error_code=0 create_gap=1

Ugh. Doesn't this entirely screw up the stack layout, which then
screws up task_pt_regs(), which then breaks ptrace and friends?

... and you'd only notice it for users that use int3 in user space,
which now writes random locations on the kernel stack, which is then a
huge honking security hole.

It's possible that I'm confused, but let's not play random games with
the stack like this. The entry code is sacred, in scary ways.

So no. Do *not* try to change %rsp on the stack in the bp handler.
Instead, I'd suggest:

- just restart the instruction (with the suggested "ptregs->rip --")

- to avoid any "oh, we're not making progress" issues, just fix the
instruction yourself to be the right call, by looking it up in the
"what needs to be fixed" tables.

No?

Linus

2018-11-29 17:11:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 09:02:23AM -0800, Andy Lutomirski wrote:
> > On Nov 29, 2018, at 8:50 AM, Linus Torvalds <[email protected]> wrote:

> > So no. Do *not* try to change %rsp on the stack in the bp handler.
> > Instead, I'd suggest:
> >
> > - just restart the instruction (with the suggested "ptregs->rip --")
> >
> > - to avoid any "oh, we're not making progress" issues, just fix the
> > instruction yourself to be the right call, by looking it up in the
> > "what needs to be fixed" tables.
> >
> > No?

> Or do you think we can avoid the IPI while the int3 is there?

I'm thinking Linus is suggesting the #BP handler does the text write too
(as a competing store) and then sync_core() and restarts.

But I think that is broken, because then there is no telling what the
other CPUs will observe.

2018-11-29 17:14:43

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 08:59:31AM -0800, Andy Lutomirski wrote:
>
>
> > On Nov 29, 2018, at 8:49 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > On Thu, Nov 29, 2018 at 10:33:42AM -0600, Josh Poimboeuf wrote:
> >>> can't we 'fix' that again? The alternative is moving that IRET-frame and
> >>> fixing everything up, which is going to be fragile, ugly and such
> >>> things more.
> >
> >> This seems to work...
> >
> > That's almost too easy... nice!
>
> It is indeed too easy: you’re putting pt_regs in the wrong place for
> int3 from user mode, which is probably a root hole if you arrange for
> a ptraced process to do int3 and try to write to whatever register
> aliases CS.
>
> If you make it conditional on CPL, do it for 32-bit as well, add
> comments convince yourself that there isn’t a better solution

I could do that - but why subject 32-bit to it? I was going to make it
conditional on CONFIG_HAVE_STATIC_CALL_INLINE which is 64-bit only.

> (like pointing IP at a stub that retpolines to the target by reading
> the function pointer, a la the unoptimizable version), then okay, I
> guess, with only a small amount of grumbling.

I tried that in v2, but Peter pointed out it's racy:

https://lkml.kernel.org/r/[email protected]

--
Josh

2018-11-29 17:15:08

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, 29 Nov 2018 09:02:23 -0800
Andy Lutomirski <[email protected]> wrote:


> > Instead, I'd suggest:
> >
> > - just restart the instruction (with the suggested "ptregs->rip --")
> >
> > - to avoid any "oh, we're not making progress" issues, just fix the
> > instruction yourself to be the right call, by looking it up in the
> > "what needs to be fixed" tables.
> >
> > No?
>
> I thought that too. I think it deadlocks. CPU A does
> text_poke_bp(). CPU B is waiting for a spinlock with IRQs off. CPU
> C holds the spinlock and hits the int3. The int3 never goes away
> because CPU A is waiting for CPU B to handle the sync_core IPI.

I agree that this can happen.

>
> Or do you think we can avoid the IPI while the int3 is there?

No, we really do need to sync after we change the second part of the
command with the int3 on it. Unless there's another way to guarantee
that the full instruction gets seen when we replace the int3 with the
finished command.

To refresh everyone's memory for why we have an IPI (as IPIs have an
implicit memory barrier for the CPU).

We start with:

e8 01 02 03 04

and we want to convert it to: e8 ab cd ef 01

And let's say the instruction crosses a cache line that breaks it into
e8 01 and 02 03 04.

We add the breakpoint:

cc 01 02 03 04

We do a sync (so now everyone should see the break point), because we
don't want to update the second part and another CPU happens to update
the second part of the cache, and might see:

e8 01 cd ef 01

Which would not be good.

And we need another sync after we change the code so all CPUs see

cc ab cd ef 01

Because when we remove the break point, we don't want other CPUs to see

e8 ab 02 03 04

Which would also be bad.

-- Steve

2018-11-29 17:18:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 08:59:31AM -0800, Andy Lutomirski wrote:

> If you make it conditional on CPL, do it for 32-bit as well, add
> comments,

> and convince yourself that there isn’t a better solution
> (like pointing IP at a stub that retpolines to the target by reading
> the function pointer, a la the unoptimizable version), then okay, I
> guess, with only a small amount of grumbling.

Right; so we _could_ grow the trampoline with a retpoline indirect call
and ret. It just makes the trampoline a whole lot bigger, but it could
work.

2018-11-29 17:21:48

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, 29 Nov 2018 18:15:39 +0100
Peter Zijlstra <[email protected]> wrote:

> On Thu, Nov 29, 2018 at 08:59:31AM -0800, Andy Lutomirski wrote:
>
> > If you make it conditional on CPL, do it for 32-bit as well, add
> > comments,
>
> > and convince yourself that there isn’t a better solution
> > (like pointing IP at a stub that retpolines to the target by reading
> > the function pointer, a la the unoptimizable version), then okay, I
> > guess, with only a small amount of grumbling.
>
> Right; so we _could_ grow the trampoline with a retpoline indirect call
> and ret. It just makes the trampoline a whole lot bigger, but it could
> work.

Can't we make use of the callee clobbered registers? I mean, we know
that call is being made when the int3 is triggered. Then we can save
the return address in one register, and the jump location in another,
and then just call a trampoline that does:

r8 = return address
r9 = function to call

push r8
jmp *r9

Then have the regs->ip point to that trampoline.

-- Steve

2018-11-29 17:32:40

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64



> On Nov 29, 2018, at 9:07 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Nov 29, 2018 at 09:02:23AM -0800, Andy Lutomirski wrote:
>>> On Nov 29, 2018, at 8:50 AM, Linus Torvalds <[email protected]> wrote:
>
>>> So no. Do *not* try to change %rsp on the stack in the bp handler.
>>> Instead, I'd suggest:
>>>
>>> - just restart the instruction (with the suggested "ptregs->rip --")
>>>
>>> - to avoid any "oh, we're not making progress" issues, just fix the
>>> instruction yourself to be the right call, by looking it up in the
>>> "what needs to be fixed" tables.
>>>
>>> No?
>
>> Or do you think we can avoid the IPI while the int3 is there?
>
> I'm thinking Linus is suggesting the #BP handler does the text write too
> (as a competing store) and then sync_core() and restarts.
>
> But I think that is broken, because then there is no telling what the
> other CPUs will observe.

Does anyone know what the actual hardware semantics are? The SDM is not particularly informative unless I looked at the wrong section.

2018-11-29 17:36:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 9:13 AM Steven Rostedt <[email protected]> wrote:
>
> No, we really do need to sync after we change the second part of the
> command with the int3 on it. Unless there's another way to guarantee
> that the full instruction gets seen when we replace the int3 with the
> finished command.

Making sure the call instruction is aligned with the I$ fetch boundary
should do that.

It's not in the SDM, but neither was our current behavior - we
were/are just relying on "it will work".

Linus

2018-11-29 17:37:11

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64



> On Nov 29, 2018, at 9:29 AM, Linus Torvalds <[email protected]> wrote:
>
> On Thu, Nov 29, 2018 at 9:02 AM Andy Lutomirski <[email protected]> wrote:
>>>
>>> - just restart the instruction (with the suggested "ptregs->rip --")
>>>
>>> - to avoid any "oh, we're not making progress" issues, just fix the
>>> instruction yourself to be the right call, by looking it up in the
>>> "what needs to be fixed" tables.
>>
>> I thought that too. I think it deadlocks. CPU A does text_poke_bp(). CPU B is waiting for a spinlock with IRQs off. CPU C holds the spinlock and hits the int3. The int3 never goes away because CPU A is waiting for CPU B to handle the sync_core IPI.
>>
>> Or do you think we can avoid the IPI while the int3 is there?
>
> I'm handwaving and thinking that CPU C that hits the int3 can just fix
> up the instruction directly in its own caches, and return.
>
> Yes, it does what he "text_poke" *will* do (so now the instruction
> gets rewritten _twice_), but who cares? It's idempotent.
>
>

But it’s out of order. I’m not concerned about the final IPI — I’m concerned about the IPI after the int3 write and before the int3 is removed again. If one CPU replaces 0xcc with 0xe8, another CPU could observe that before the last couple bytes of the call target are written and observed by all CPUs.

2018-11-29 17:46:49

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, 29 Nov 2018 09:35:11 -0800
Linus Torvalds <[email protected]> wrote:

> On Thu, Nov 29, 2018 at 9:13 AM Steven Rostedt <[email protected]> wrote:
> >
> > No, we really do need to sync after we change the second part of the
> > command with the int3 on it. Unless there's another way to guarantee
> > that the full instruction gets seen when we replace the int3 with the
> > finished command.
>
> Making sure the call instruction is aligned with the I$ fetch boundary
> should do that.
>
> It's not in the SDM, but neither was our current behavior - we
> were/are just relying on "it will work".
>

Well, the current method (as Jiri mentioned) did get the OK from at
least Intel (and that was with a lot of arm twisting to do so).

-- Steve

2018-11-29 17:51:46

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, 29 Nov 2018 09:41:33 -0800
Andy Lutomirski <[email protected]> wrote:

> > On Nov 29, 2018, at 9:21 AM, Steven Rostedt <[email protected]> wrote:
> >
> > On Thu, 29 Nov 2018 12:20:00 -0500
> > Steven Rostedt <[email protected]> wrote:
> >
> >
> >> r8 = return address
> >> r9 = function to call
> >>
> >
> > Bad example, r8 and r9 are args, but r10 and r11 are available.
> >
> > -- Steve
> >
> >> push r8
> >> jmp *r9
> >>
> >> Then have the regs->ip point to that trampoline.
>
> Cute. That’ll need ORC annotations and some kind of retpoline to replace the indirect jump, though.
>

Do we really need to worry about retpoline here?

I'm not fully up on all the current vulnerabilities, but can this
really be taken advantage of when it only happens in the transition of
changing a static call with the small chance of one of those calls
triggering the break point?

If someone can take advantage of that, I almost think they deserve
cracking my box ;-)

-- Steve

2018-11-29 17:53:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 9:44 AM Steven Rostedt <[email protected]> wrote:
>
> Well, the current method (as Jiri mentioned) did get the OK from at
> least Intel (and that was with a lot of arm twisting to do so).

Guys, when the comparison is to:

- create a huge honking security hole by screwing up the stack frame

or

- corrupt random registers because we "know" they aren't in use

then it really sounds pretty safe to just say "ok, just make it
aligned and update the instruction with an atomic cmpxchg or
something".

Of course, another option is to just say "we don't do the inline case,
then", and only ever do a call to a stub that does a "jmp"
instruction.

Problem solved, at the cost of some I$. Emulating a "jmp" is trivial,
in ways emulating a "call" is not.

Linus

2018-11-29 17:55:11

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64


> On Nov 29, 2018, at 9:45 AM, Josh Poimboeuf <[email protected]> wrote:
>
>> On Thu, Nov 29, 2018 at 09:41:33AM -0800, Andy Lutomirski wrote:
>>
>>> On Nov 29, 2018, at 9:21 AM, Steven Rostedt <[email protected]> wrote:
>>>
>>> On Thu, 29 Nov 2018 12:20:00 -0500
>>> Steven Rostedt <[email protected]> wrote:
>>>
>>>
>>>> r8 = return address
>>>> r9 = function to call
>>>>
>>>
>>> Bad example, r8 and r9 are args, but r10 and r11 are available.
>>>
>>> -- Steve
>>>
>>>> push r8
>>>> jmp *r9
>>>>
>>>> Then have the regs->ip point to that trampoline.
>>
>> Cute. That’ll need ORC annotations and some kind of retpoline to replace the indirect jump, though.
>
> I'm going with this idea, but the BP is so rare that I really don't see
> why a retpoline would be needed.
>

Without the retpoline in place, you are vulnerable to security researchers causing you a personal denial of service by finding a way to cause the BP to get hit, mistraining the branch predictor, and writing a paper about it :)

2018-11-29 17:55:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 9:50 AM Linus Torvalds
<[email protected]> wrote:
>
> - corrupt random registers because we "know" they aren't in use

Just to clarify: I think that's a completely unacceptable model.

We already have lots of special calling conventions, including ones
that do not have any call-clobbered registers at all, because we have
special magic calls in inline asm.

Some of those might be prime material for doing static calls (ie PV-op
stuff, where the native model does *not* change any registers).

So no. Don't do ugly hacks like that.

Linus

2018-11-29 18:00:38

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, 29 Nov 2018 09:50:28 -0800
Linus Torvalds <[email protected]> wrote:

> On Thu, Nov 29, 2018 at 9:44 AM Steven Rostedt <[email protected]> wrote:
> >
> > Well, the current method (as Jiri mentioned) did get the OK from at
> > least Intel (and that was with a lot of arm twisting to do so).
>
> Guys, when the comparison is to:
>
> - create a huge honking security hole by screwing up the stack frame
>
> or
>
> - corrupt random registers because we "know" they aren't in use
>
> then it really sounds pretty safe to just say "ok, just make it
> aligned and update the instruction with an atomic cmpxchg or
> something".

Do you realize that the cmpxchg used by the first attempts of the
dynamic modification of code by ftrace was the source of the e1000e
NVRAM corruption bug.

It's because it happened to do it to IO write only memory, and a
cmpxchg will *always* write, even if it didn't match. It will just
write out what it read.

In the case of the e1000e bug, it read 0xffffffff and that's what it
wrote back out.

So no, I don't think that's a better solution.

-- Steve


>
> Of course, another option is to just say "we don't do the inline case,
> then", and only ever do a call to a stub that does a "jmp"
> instruction.
>
> Problem solved, at the cost of some I$. Emulating a "jmp" is trivial,
> in ways emulating a "call" is not.
>
> Linus


2018-11-29 18:36:38

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, 29 Nov 2018 08:50:16 -0800
Linus Torvalds <[email protected]> wrote:

> Instead, I'd suggest:
>
> - just restart the instruction (with the suggested "ptregs->rip --")
>
> - to avoid any "oh, we're not making progress" issues, just fix the
> instruction yourself to be the right call, by looking it up in the
> "what needs to be fixed" tables.

So basically this will cause the code to go into a spin while we are
doing the update, right?

-- Steve

2018-11-29 18:36:54

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64



> On Nov 29, 2018, at 8:49 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Nov 29, 2018 at 10:33:42AM -0600, Josh Poimboeuf wrote:
>>> can't we 'fix' that again? The alternative is moving that IRET-frame and
>>> fixing everything up, which is going to be fragile, ugly and such
>>> things more.
>
>> This seems to work...
>
> That's almost too easy... nice!

It is indeed too easy: you’re putting pt_regs in the wrong place for int3 from user mode, which is probably a root hole if you arrange for a ptraced process to do int3 and try to write to whatever register aliases CS.

If you make it conditional on CPL, do it for 32-bit as well, add comments, and convince yourself that there isn’t a better solution (like pointing IP at a stub that retpolines to the target by reading the function pointer, a la the unoptimizable version), then okay, I guess, with only a small amount of grumbling.

>
>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> index ce25d84023c0..184523447d35 100644
>> --- a/arch/x86/entry/entry_64.S
>> +++ b/arch/x86/entry/entry_64.S
>> @@ -876,7 +876,7 @@ apicinterrupt IRQ_WORK_VECTOR irq_work_interrupt smp_irq_work_interrupt
>> * @paranoid == 2 is special: the stub will never switch stacks. This is for
>> * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS.
>> */
>> -.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
>> +.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 create_gap=0
>> ENTRY(\sym)
>> UNWIND_HINT_IRET_REGS offset=\has_error_code*8
>>
>> @@ -891,6 +891,12 @@ ENTRY(\sym)
>> pushq $-1 /* ORIG_RAX: no syscall to restart */
>> .endif
>>
>> + .if \create_gap == 1
>> + .rept 6
>> + pushq 5*8(%rsp)
>> + .endr
>> + .endif
>> +
>> .if \paranoid == 1
>> testb $3, CS-ORIG_RAX(%rsp) /* If coming from userspace, switch stacks */
>> jnz .Lfrom_usermode_switch_stack_\@
>> @@ -1126,7 +1132,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \
>> #endif /* CONFIG_HYPERV */
>>
>> idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
>> -idtentry int3 do_int3 has_error_code=0
>> +idtentry int3 do_int3 has_error_code=0 create_gap=1
>> idtentry stack_segment do_stack_segment has_error_code=1
>>
>> #ifdef CONFIG_XEN_PV

2018-11-29 18:38:00

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64



> On Nov 29, 2018, at 8:50 AM, Linus Torvalds <[email protected]> wrote:
>
>> On Thu, Nov 29, 2018 at 8:33 AM Josh Poimboeuf <[email protected]> wrote:
>>
>> This seems to work...
>>
>> + .if \create_gap == 1
>> + .rept 6
>> + pushq 5*8(%rsp)
>> + .endr
>> + .endif
>> +
>> -idtentry int3 do_int3 has_error_code=0
>> +idtentry int3 do_int3 has_error_code=0 create_gap=1
>
> Ugh. Doesn't this entirely screw up the stack layout, which then
> screws up task_pt_regs(), which then breaks ptrace and friends?
>
> ... and you'd only notice it for users that use int3 in user space,
> which now writes random locations on the kernel stack, which is then a
> huge honking security hole.
>
> It's possible that I'm confused, but let's not play random games with
> the stack like this. The entry code is sacred, in scary ways.
>
> So no. Do *not* try to change %rsp on the stack in the bp handler.
> Instead, I'd suggest:
>
> - just restart the instruction (with the suggested "ptregs->rip --")
>
> - to avoid any "oh, we're not making progress" issues, just fix the
> instruction yourself to be the right call, by looking it up in the
> "what needs to be fixed" tables.
>
> No?

I thought that too. I think it deadlocks. CPU A does text_poke_bp(). CPU B is waiting for a spinlock with IRQs off. CPU C holds the spinlock and hits the int3. The int3 never goes away because CPU A is waiting for CPU B to handle the sync_core IPI.

Or do you think we can avoid the IPI while the int3 is there?

2018-11-29 18:39:50

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, 29 Nov 2018 12:20:00 -0500
Steven Rostedt <[email protected]> wrote:


> r8 = return address
> r9 = function to call
>

Bad example, r8 and r9 are args, but r10 and r11 are available.

-- Steve

> push r8
> jmp *r9
>
> Then have the regs->ip point to that trampoline.
>
> -- Steve


2018-11-29 18:40:32

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64


> On Nov 29, 2018, at 9:21 AM, Steven Rostedt <[email protected]> wrote:
>
> On Thu, 29 Nov 2018 12:20:00 -0500
> Steven Rostedt <[email protected]> wrote:
>
>
>> r8 = return address
>> r9 = function to call
>>
>
> Bad example, r8 and r9 are args, but r10 and r11 are available.
>
> -- Steve
>
>> push r8
>> jmp *r9
>>
>> Then have the regs->ip point to that trampoline.

Cute. That’ll need ORC annotations and some kind of retpoline to replace the indirect jump, though.

>>
>> -- Steve
>

2018-11-29 18:41:45

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 09:41:33AM -0800, Andy Lutomirski wrote:
>
> > On Nov 29, 2018, at 9:21 AM, Steven Rostedt <[email protected]> wrote:
> >
> > On Thu, 29 Nov 2018 12:20:00 -0500
> > Steven Rostedt <[email protected]> wrote:
> >
> >
> >> r8 = return address
> >> r9 = function to call
> >>
> >
> > Bad example, r8 and r9 are args, but r10 and r11 are available.
> >
> > -- Steve
> >
> >> push r8
> >> jmp *r9
> >>
> >> Then have the regs->ip point to that trampoline.
>
> Cute. That’ll need ORC annotations and some kind of retpoline to replace the indirect jump, though.

I'm going with this idea, but the BP is so rare that I really don't see
why a retpoline would be needed.

--
Josh

2018-11-29 18:41:55

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64


> On Nov 29, 2018, at 9:50 AM, Linus Torvalds <[email protected]> wrote:
>
>> On Thu, Nov 29, 2018 at 9:44 AM Steven Rostedt <[email protected]> wrote:
>>
>> Well, the current method (as Jiri mentioned) did get the OK from at
>> least Intel (and that was with a lot of arm twisting to do so).
>
> Guys, when the comparison is to:
>
> - create a huge honking security hole by screwing up the stack frame
>
> or
>
> - corrupt random registers because we "know" they aren't in use

For C calls, we do indeed know that. But I guess there could be asm calls.

>
> then it really sounds pretty safe to just say "ok, just make it
> aligned and update the instruction with an atomic cmpxchg or
> something".

And how do we do that? With a gcc plugin and some asm magic?

>
> Of course, another option is to just say "we don't do the inline case,
> then", and only ever do a call to a stub that does a "jmp"
> instruction.

That’s not a terrible idea.

>
> Problem solved, at the cost of some I$. Emulating a "jmp" is trivial,
> in ways emulating a "call" is not.
>
>



2018-11-29 18:44:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 9:59 AM Steven Rostedt <[email protected]> wrote:
>
> Do you realize that the cmpxchg used by the first attempts of the
> dynamic modification of code by ftrace was the source of the e1000e
> NVRAM corruption bug.

If you have a static call in IO memory, you have bigger problems than that.

What's your point?

Again - I will point out that the things you guys have tried to come
up with have been *WORSE*. Much worse.

Linus

2018-11-29 18:47:48

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 06:15:39PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 29, 2018 at 08:59:31AM -0800, Andy Lutomirski wrote:
>
> > If you make it conditional on CPL, do it for 32-bit as well, add
> > comments,
>
> > and convince yourself that there isn’t a better solution
> > (like pointing IP at a stub that retpolines to the target by reading
> > the function pointer, a la the unoptimizable version), then okay, I
> > guess, with only a small amount of grumbling.
>
> Right; so we _could_ grow the trampoline with a retpoline indirect call
> and ret. It just makes the trampoline a whole lot bigger, but it could
> work.

I'm trying to envision how this would work. How would the function (or
stub) know how to return back to the call site?

--
Josh

2018-11-29 18:49:54

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, 29 Nov 2018 10:23:44 -0800
Linus Torvalds <[email protected]> wrote:

> On Thu, Nov 29, 2018 at 9:59 AM Steven Rostedt <[email protected]> wrote:
> >
> > Do you realize that the cmpxchg used by the first attempts of the
> > dynamic modification of code by ftrace was the source of the e1000e
> > NVRAM corruption bug.
>
> If you have a static call in IO memory, you have bigger problems than that.
>
> What's your point?

Just that cmpxchg on dynamic modified code brings back bad memories ;-)

>
> Again - I will point out that the things you guys have tried to come
> up with have been *WORSE*. Much worse.

Note, we do have a bit of control at what is getting called. The patch
set requires that the callers are wrapped in macros. We should not
allow just any random callers (like from asm).

This isn't about modifying any function call. This is for a specific
subset, that we can impose rules on.

-- Steve


2018-11-29 18:50:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 10:00 AM Andy Lutomirski <[email protected]> wrote:
> > then it really sounds pretty safe to just say "ok, just make it
> > aligned and update the instruction with an atomic cmpxchg or
> > something".
>
> And how do we do that? With a gcc plugin and some asm magic?

Asm magic.

You already have to mark the call sites with

static_call(fn, arg1, arg2, ...);

and while it right now just magically depends on gcc outputting the
right code to call the trampoline. But it could do it as a jmp
instruction (tail-call), and maybe that works right, maybe it doesn't.
And maybe some gcc switch makes it output it as a indirect call due to
instrumentation or something. Doing it with asm magic would, I feel,
be safer anyway, so that we'd know *exactly* how that call gets done.

For example, if gcc does it as a jmp due to a tail-call, the
compiler/linker could in theory turn the jump into a short jump if it
sees that the trampoline is close enough. Does that happen? Probably
not. But I don't see why it *couldn't* happen in the current patch
series. The trampoline is just a regular function, even if it has been
defined by global asm.

Putting the trampoline in a different code section could fix things
like that (maybe there was a patch that did that and I missed it?) but
I do think that doing the call with an asm would *also* fix it.

But the "just always use a trampoline" is certainly the simpler model.

Linus

2018-11-29 18:56:50

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, 29 Nov 2018 10:00:48 -0800
Andy Lutomirski <[email protected]> wrote:

> >
> > Of course, another option is to just say "we don't do the inline case,
> > then", and only ever do a call to a stub that does a "jmp"
> > instruction.
>
> That’s not a terrible idea.

It was the implementation of my first proof of concept that kicked off
this entire idea, where others (Peter and Josh) thought it was better
to modify the calls themselves. It does improve things.

Just a reminder of the benchmarks of enabling all tracepoints (which
use indirect jumps) and running hackbench:

No RETPOLINES:
1.4503 +- 0.0148 seconds time elapsed ( +- 1.02% )

baseline RETPOLINES:
1.5120 +- 0.0133 seconds time elapsed ( +- 0.88% )

Added direct calls for trace_events:
1.5239 +- 0.0139 seconds time elapsed ( +- 0.91% )

With static calls:
1.5282 +- 0.0135 seconds time elapsed ( +- 0.88% )

With static call trampolines:
1.48328 +- 0.00515 seconds time elapsed ( +- 0.35% )

Full static calls:
1.47364 +- 0.00706 seconds time elapsed ( +- 0.48% )


Adding Retpolines caused a 1.5120 / 1.4503 = 1.0425 ( 4.25% ) slowdown

Trampolines made it into 1.48328 / 1.4503 = 1.0227 ( 2.27% ) slowdown

The above is the stub with the jmp case.

With full static calls 1.47364 / 1.4503 = 1.0160 ( 1.6% ) slowdown

Modifying the calls themselves does have an improvement (and this is
much greater of an improvement when I had debugging enabled).

Perhaps it's not worth the effort, but again, we do have control of
what uses this. It's not a total free-for-all.

Full results here:

http://lkml.kernel.org/r/[email protected]

Although since lore.kernel.org seems to be having issues:

https://marc.info/?l=linux-kernel&m=154326714710686


-- Steve

2018-11-29 19:01:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 10:47 AM Steven Rostedt <[email protected]> wrote:
>
> Note, we do have a bit of control at what is getting called. The patch
> set requires that the callers are wrapped in macros. We should not
> allow just any random callers (like from asm).

Actually, I'd argue that asm is often more controlled than C code.

Right now you can do odd things if you really want to, and have the
compiler generate indirect calls to those wrapper functions.

For example, I can easily imagine a pre-retpoline compiler turning

if (cond)
fn1(a,b)
else
fn2(a,b);

into a function pointer conditional

(cond ? fn1 : fn2)(a,b);

and honestly, the way "static_call()" works now, can you guarantee
that the call-site doesn't end up doing that, and calling the
trampoline function for two different static calls from one indirect
call?

See what I'm talking about? Saying "callers are wrapped in macros"
doesn't actually protect you from the compiler doing things like that.

In contrast, if the call was wrapped in an inline asm, we'd *know* the
compiler couldn't turn a "call wrapper(%rip)" into anything else.

Linus

2018-11-29 19:12:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 9:02 AM Andy Lutomirski <[email protected]> wrote:
> >
> > - just restart the instruction (with the suggested "ptregs->rip --")
> >
> > - to avoid any "oh, we're not making progress" issues, just fix the
> > instruction yourself to be the right call, by looking it up in the
> > "what needs to be fixed" tables.
>
> I thought that too. I think it deadlocks. CPU A does text_poke_bp(). CPU B is waiting for a spinlock with IRQs off. CPU C holds the spinlock and hits the int3. The int3 never goes away because CPU A is waiting for CPU B to handle the sync_core IPI.
>
> Or do you think we can avoid the IPI while the int3 is there?

I'm handwaving and thinking that CPU C that hits the int3 can just fix
up the instruction directly in its own caches, and return.

Yes, it does what he "text_poke" *will* do (so now the instruction
gets rewritten _twice_), but who cares? It's idempotent.

And no, I don't have code, just "maybe some handwaving like this"

Linus

2018-11-29 19:13:53

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, 29 Nov 2018, Andy Lutomirski wrote:

> Does anyone know what the actual hardware semantics are? The SDM is not
> particularly informative unless I looked at the wrong section.

I don't think SDM answers all the questions there, unfortunately.

I vaguely remember that back then when I was preparing the original
text_poke_bp() implementation, hpa had to provide some answers directly
from inner depths of Intel ... see fd4363fff3 ("x86: Introduce int3
(breakpoint)-based instruction patching") for reference.

--
Jiri Kosina
SUSE Labs


2018-11-29 19:14:53

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, 29 Nov 2018 11:08:26 -0800
Linus Torvalds <[email protected]> wrote:

> On Thu, Nov 29, 2018 at 10:58 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > In contrast, if the call was wrapped in an inline asm, we'd *know* the
> > compiler couldn't turn a "call wrapper(%rip)" into anything else.
>
> Actually, I think I have a better model - if the caller is done with inline asm.
>
> What you can do then is basically add a single-byte prefix to the
> "call" instruction that does nothing (say, cs override), and then
> replace *that* with a 'int3' instruction.
>
> Boom. Done.
>
> Now, the "int3" handler can just update the instruction in-place, but
> leave the "int3" in place, and then return to the next instruction
> byte (which is just the normal branch instruction without the prefix
> byte).
>
> The cross-CPU case continues to work, because the 'int3' remains in
> place until after the IPI.
>
> But that would require that we'd mark those call instruction with
>

In my original proof of concept, I tried to to implement the callers
with asm, but then the way to handle parameters became a nightmare.

The goal of this (for me) was to replace the tracepoint indirect calls
with static calls, and tracepoints can have any number of parameters to
pass. I ended up needing the compiler to help me with the passing of
parameters.

-- Steve

2018-11-29 19:15:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 11:08 AM Linus Torvalds
<[email protected]> wrote:
>
> What you can do then is basically add a single-byte prefix to the
> "call" instruction that does nothing (say, cs override), and then
> replace *that* with a 'int3' instruction.

Hmm. the segment prefixes are documented as being "reserved" for
branch instructions. I *think* that means just conditional branches
(Intel at one point used the prefixes for static prediction
information), not "call", but who knows..

It might be better to use an empty REX prefix on x86-64 or something like that.

Linus

2018-11-29 19:15:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 10:58 AM Linus Torvalds
<[email protected]> wrote:
>
> In contrast, if the call was wrapped in an inline asm, we'd *know* the
> compiler couldn't turn a "call wrapper(%rip)" into anything else.

Actually, I think I have a better model - if the caller is done with inline asm.

What you can do then is basically add a single-byte prefix to the
"call" instruction that does nothing (say, cs override), and then
replace *that* with a 'int3' instruction.

Boom. Done.

Now, the "int3" handler can just update the instruction in-place, but
leave the "int3" in place, and then return to the next instruction
byte (which is just the normal branch instruction without the prefix
byte).

The cross-CPU case continues to work, because the 'int3' remains in
place until after the IPI.

But that would require that we'd mark those call instruction with

Linus

2018-11-29 19:17:56

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, 29 Nov 2018 10:58:40 -0800
Linus Torvalds <[email protected]> wrote:

> On Thu, Nov 29, 2018 at 10:47 AM Steven Rostedt <[email protected]> wrote:
> >
> > Note, we do have a bit of control at what is getting called. The patch
> > set requires that the callers are wrapped in macros. We should not
> > allow just any random callers (like from asm).
>
> Actually, I'd argue that asm is often more controlled than C code.
>
> Right now you can do odd things if you really want to, and have the
> compiler generate indirect calls to those wrapper functions.
>
> For example, I can easily imagine a pre-retpoline compiler turning
>
> if (cond)
> fn1(a,b)
> else
> fn2(a,b);
>
> into a function pointer conditional
>
> (cond ? fn1 : fn2)(a,b);

If we are worried about such a construct, wouldn't a compiler barrier
before and after the static_call solve that?

barrier();
static_call(func...);
barrier();

It should also stop tail calls too.

>
> and honestly, the way "static_call()" works now, can you guarantee
> that the call-site doesn't end up doing that, and calling the
> trampoline function for two different static calls from one indirect
> call?
>
> See what I'm talking about? Saying "callers are wrapped in macros"
> doesn't actually protect you from the compiler doing things like that.
>
> In contrast, if the call was wrapped in an inline asm, we'd *know* the
> compiler couldn't turn a "call wrapper(%rip)" into anything else.

But then we need to implement all numbers of parameters.

-- Steve

2018-11-29 19:23:10

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 02:16:48PM -0500, Steven Rostedt wrote:
> > and honestly, the way "static_call()" works now, can you guarantee
> > that the call-site doesn't end up doing that, and calling the
> > trampoline function for two different static calls from one indirect
> > call?
> >
> > See what I'm talking about? Saying "callers are wrapped in macros"
> > doesn't actually protect you from the compiler doing things like that.
> >
> > In contrast, if the call was wrapped in an inline asm, we'd *know* the
> > compiler couldn't turn a "call wrapper(%rip)" into anything else.
>
> But then we need to implement all numbers of parameters.

I actually have an old unfinished patch which (ab)used C macros to
detect the number of parameters and then setup the asm constraints
accordingly. At the time, the goal was to optimize the BUG code.

I had wanted to avoid this kind of approach for static calls, because
"ugh", but now it's starting to look much more appealing.

Behold:

diff --git a/arch/x86/include/asm/bug.h b/arch/x86/include/asm/bug.h
index aa6b2023d8f8..d63e9240da77 100644
--- a/arch/x86/include/asm/bug.h
+++ b/arch/x86/include/asm/bug.h
@@ -32,10 +32,59 @@

#ifdef CONFIG_DEBUG_BUGVERBOSE

-#define _BUG_FLAGS(ins, flags) \
+#define __BUG_ARGS_0(ins, ...) \
+({\
+ asm volatile("1:\t" ins "\n"); \
+})
+#define __BUG_ARGS_1(ins, ...) \
+({\
+ asm volatile("1:\t" ins "\n" \
+ : : "D" (ARG1(__VA_ARGS__))); \
+})
+#define __BUG_ARGS_2(ins, ...) \
+({\
+ asm volatile("1:\t" ins "\n" \
+ : : "D" (ARG1(__VA_ARGS__)), \
+ "S" (ARG2(__VA_ARGS__))); \
+})
+#define __BUG_ARGS_3(ins, ...) \
+({\
+ asm volatile("1:\t" ins "\n" \
+ : : "D" (ARG1(__VA_ARGS__)), \
+ "S" (ARG2(__VA_ARGS__)), \
+ "d" (ARG3(__VA_ARGS__))); \
+})
+#define __BUG_ARGS_4(ins, ...) \
+({\
+ asm volatile("1:\t" ins "\n" \
+ : : "D" (ARG1(__VA_ARGS__)), \
+ "S" (ARG2(__VA_ARGS__)), \
+ "d" (ARG3(__VA_ARGS__)), \
+ "c" (ARG4(__VA_ARGS__))); \
+})
+#define __BUG_ARGS_5(ins, ...) \
+({\
+ register u64 __r8 asm("r8") = (u64)ARG5(__VA_ARGS__); \
+ asm volatile("1:\t" ins "\n" \
+ : : "D" (ARG1(__VA_ARGS__)), \
+ "S" (ARG2(__VA_ARGS__)), \
+ "d" (ARG3(__VA_ARGS__)), \
+ "c" (ARG4(__VA_ARGS__)), \
+ "r" (__r8)); \
+})
+#define __BUG_ARGS_6 foo
+#define __BUG_ARGS_7 foo
+#define __BUG_ARGS_8 foo
+#define __BUG_ARGS_9 foo
+
+#define __BUG_ARGS(ins, num, ...) __BUG_ARGS_ ## num(ins, __VA_ARGS__)
+
+#define _BUG_ARGS(ins, num, ...) __BUG_ARGS(ins, num, __VA_ARGS__)
+
+#define _BUG_FLAGS(ins, flags, ...) \
do { \
- asm volatile("1:\t" ins "\n" \
- ".pushsection __bug_table,\"aw\"\n" \
+ _BUG_ARGS(ins, NUM_ARGS(__VA_ARGS__), __VA_ARGS__); \
+ asm volatile(".pushsection __bug_table,\"aw\"\n" \
"2:\t" __BUG_REL(1b) "\t# bug_entry::bug_addr\n" \
"\t" __BUG_REL(%c0) "\t# bug_entry::file\n" \
"\t.word %c1" "\t# bug_entry::line\n" \
@@ -76,7 +125,7 @@ do { \
unreachable(); \
} while (0)

-#define __WARN_FLAGS(flags) _BUG_FLAGS(ASM_UD0, BUGFLAG_WARNING|(flags))
+#define __WARN_FLAGS(flags, ...) _BUG_FLAGS(ASM_UD0, BUGFLAG_WARNING|(flags), __VA_ARGS__)

#include <asm-generic/bug.h>

diff --git a/include/asm-generic/bug.h b/include/asm-generic/bug.h
index 70c7732c9594..0cb16e912c02 100644
--- a/include/asm-generic/bug.h
+++ b/include/asm-generic/bug.h
@@ -58,8 +58,8 @@ struct bug_entry {
#endif

#ifdef __WARN_FLAGS
-#define __WARN_TAINT(taint) __WARN_FLAGS(BUGFLAG_TAINT(taint))
-#define __WARN_ONCE_TAINT(taint) __WARN_FLAGS(BUGFLAG_ONCE|BUGFLAG_TAINT(taint))
+#define __WARN_TAINT(taint, args...) __WARN_FLAGS(BUGFLAG_TAINT(taint), args)
+#define __WARN_ONCE_TAINT(taint, args...) __WARN_FLAGS(BUGFLAG_ONCE|BUGFLAG_TAINT(taint), args)

#define WARN_ON_ONCE(condition) ({ \
int __ret_warn_on = !!(condition); \
@@ -84,11 +84,12 @@ void warn_slowpath_fmt_taint(const char *file, const int line, unsigned taint,
extern void warn_slowpath_null(const char *file, const int line);
#ifdef __WARN_TAINT
#define __WARN() __WARN_TAINT(TAINT_WARN)
+#define __WARN_printf(args...) __WARN_TAINT(TAINT_WARN, args)
#else
#define __WARN() warn_slowpath_null(__FILE__, __LINE__)
+#define __WARN_printf(arg...) warn_slowpath_fmt(__FILE__, __LINE__, arg)
#endif

-#define __WARN_printf(arg...) warn_slowpath_fmt(__FILE__, __LINE__, arg)
#define __WARN_printf_taint(taint, arg...) \
warn_slowpath_fmt_taint(__FILE__, __LINE__, taint, arg)
/* used internally by panic.c */
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 2d2721756abf..e641552e17cf 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -192,6 +192,14 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
# define unreachable() do { } while (1)
#endif

+#define __NUM_ARGS(_0, _1, _2, _3, _4, _5, _6, _7, _8, _9, _10, N, ...) N
+#define NUM_ARGS(...) __NUM_ARGS(0, ## __VA_ARGS__, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0)
+#define ARG1(_1, ...) _1
+#define ARG2(_1, _2, ...) _2
+#define ARG3(_1, _2, _3, ...) _3
+#define ARG4(_1, _2, _3, _4, ...) _4
+#define ARG5(_1, _2, _3, _4, _5, ...) _5
+
/*
* KENTRY - kernel entry point
* This can be used to annotate symbols (functions or data) that are used

--
Josh

2018-11-29 19:27:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 11:16 AM Steven Rostedt <[email protected]> wrote:
>
> But then we need to implement all numbers of parameters.

Oh, I agree, it's nasty.

But it's actually a nastiness that we've solved before. In particular,
with the system call mappings, which have pretty much the exact same
issue of "map unknown number of arguments to registers".

Yes, it's different - there you map the unknown number of arguments to
a structure access instead. And yes, the macros are unbelievably ugly.
See

arch/x86/include/asm/syscall_wrapper.h

and the __MAP() macro from

include/linux/syscalls.h

so it's not pretty. But it would solve all the problems.

Linus

2018-11-29 19:28:33

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 11:08 AM Linus Torvalds
<[email protected]> wrote:
>
> On Thu, Nov 29, 2018 at 10:58 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > In contrast, if the call was wrapped in an inline asm, we'd *know* the
> > compiler couldn't turn a "call wrapper(%rip)" into anything else.
>
> Actually, I think I have a better model - if the caller is done with inline asm.
>
> What you can do then is basically add a single-byte prefix to the
> "call" instruction that does nothing (say, cs override), and then
> replace *that* with a 'int3' instruction.
>
> Boom. Done.
>
> Now, the "int3" handler can just update the instruction in-place, but
> leave the "int3" in place, and then return to the next instruction
> byte (which is just the normal branch instruction without the prefix
> byte).
>
> The cross-CPU case continues to work, because the 'int3' remains in
> place until after the IPI.

Hmm, cute. But then the calls are in inline asm, which results in
giant turds like we have for the pvop vcalls. And, if they start
being used more generally, we potentially have ABI issues where the
calling convention isn't quite what the asm expects, and we explode.

I propose a different solution:

As in this patch set, we have a direct and an indirect version. The
indirect version remains exactly the same as in this patch set. The
direct version just only does the patching when all seems well: the
call instruction needs to be 0xe8, and we only do it when the thing
doesn't cross a cache line. Does that work? In the rare case where
the compiler generates something other than 0xe8 or crosses a cache
line, then the thing just remains as a call to the out of line jmp
trampoline. Does that seem reasonable? It's a very minor change to
the patch set.

Alternatively, we could actually emulate call instructions like this:

void __noreturn jump_to_kernel_pt_regs(struct pt_regs *regs, ...)
{
struct pt_regs ptregs_copy = *regs;
barrier();
*(unsigned long *)(regs->sp - 8) = whatever; /* may clobber old
regs, but so what? */
asm volatile ("jmp return_to_alternate_ptregs");
}

where return_to_alternate_ptregs points rsp to the ptregs and goes
through the normal return path. It's ugly, but we could have a test
case for it, and it should work fine.

2018-11-29 19:30:08

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, 29 Nov 2018 13:22:11 -0600
Josh Poimboeuf <[email protected]> wrote:

> On Thu, Nov 29, 2018 at 02:16:48PM -0500, Steven Rostedt wrote:
> > > and honestly, the way "static_call()" works now, can you guarantee
> > > that the call-site doesn't end up doing that, and calling the
> > > trampoline function for two different static calls from one indirect
> > > call?
> > >
> > > See what I'm talking about? Saying "callers are wrapped in macros"
> > > doesn't actually protect you from the compiler doing things like that.
> > >
> > > In contrast, if the call was wrapped in an inline asm, we'd *know* the
> > > compiler couldn't turn a "call wrapper(%rip)" into anything else.
> >
> > But then we need to implement all numbers of parameters.
>
> I actually have an old unfinished patch which (ab)used C macros to
> detect the number of parameters and then setup the asm constraints
> accordingly. At the time, the goal was to optimize the BUG code.
>
> I had wanted to avoid this kind of approach for static calls, because
> "ugh", but now it's starting to look much more appealing.
>
> Behold:
>
> diff --git a/arch/x86/include/asm/bug.h b/arch/x86/include/asm/bug.h
> index aa6b2023d8f8..d63e9240da77 100644
> --- a/arch/x86/include/asm/bug.h
> +++ b/arch/x86/include/asm/bug.h
> @@ -32,10 +32,59 @@
>
> #ifdef CONFIG_DEBUG_BUGVERBOSE
>
> -#define _BUG_FLAGS(ins, flags) \
> +#define __BUG_ARGS_0(ins, ...) \
> +({\
> + asm volatile("1:\t" ins "\n"); \
> +})
> +#define __BUG_ARGS_1(ins, ...) \
> +({\
> + asm volatile("1:\t" ins "\n" \
> + : : "D" (ARG1(__VA_ARGS__))); \
> +})
> +#define __BUG_ARGS_2(ins, ...) \
> +({\
> + asm volatile("1:\t" ins "\n" \
> + : : "D" (ARG1(__VA_ARGS__)), \
> + "S" (ARG2(__VA_ARGS__))); \
> +})
> +#define __BUG_ARGS_3(ins, ...) \
> +({\
> + asm volatile("1:\t" ins "\n" \
> + : : "D" (ARG1(__VA_ARGS__)), \
> + "S" (ARG2(__VA_ARGS__)), \
> + "d" (ARG3(__VA_ARGS__))); \
> +})
> +#define __BUG_ARGS_4(ins, ...) \
> +({\
> + asm volatile("1:\t" ins "\n" \
> + : : "D" (ARG1(__VA_ARGS__)), \
> + "S" (ARG2(__VA_ARGS__)), \
> + "d" (ARG3(__VA_ARGS__)), \
> + "c" (ARG4(__VA_ARGS__))); \
> +})
> +#define __BUG_ARGS_5(ins, ...) \
> +({\
> + register u64 __r8 asm("r8") = (u64)ARG5(__VA_ARGS__); \
> + asm volatile("1:\t" ins "\n" \
> + : : "D" (ARG1(__VA_ARGS__)), \
> + "S" (ARG2(__VA_ARGS__)), \
> + "d" (ARG3(__VA_ARGS__)), \
> + "c" (ARG4(__VA_ARGS__)), \
> + "r" (__r8)); \
> +})
> +#define __BUG_ARGS_6 foo
> +#define __BUG_ARGS_7 foo
> +#define __BUG_ARGS_8 foo
> +#define __BUG_ARGS_9 foo
> +


There exist tracepoints with 13 arguments.

-- Steve

2018-11-29 19:31:14

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 11:25 AM Linus Torvalds
<[email protected]> wrote:
>
> On Thu, Nov 29, 2018 at 11:16 AM Steven Rostedt <[email protected]> wrote:
> >
> > But then we need to implement all numbers of parameters.
>
> Oh, I agree, it's nasty.
>
> But it's actually a nastiness that we've solved before. In particular,
> with the system call mappings, which have pretty much the exact same
> issue of "map unknown number of arguments to registers".
>
> Yes, it's different - there you map the unknown number of arguments to
> a structure access instead. And yes, the macros are unbelievably ugly.
> See
>
> arch/x86/include/asm/syscall_wrapper.h
>
> and the __MAP() macro from
>
> include/linux/syscalls.h
>
> so it's not pretty. But it would solve all the problems.
>

Until someone does:

struct foo foo;
static_call(thingy, foo);

For syscalls, we know better than to do that. For static calls, I'm
less confident.

2018-11-29 19:32:57

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, 29 Nov 2018 11:24:43 -0800
Linus Torvalds <[email protected]> wrote:

> On Thu, Nov 29, 2018 at 11:16 AM Steven Rostedt <[email protected]> wrote:
> >
> > But then we need to implement all numbers of parameters.
>
> Oh, I agree, it's nasty.
>
> But it's actually a nastiness that we've solved before. In particular,
> with the system call mappings, which have pretty much the exact same
> issue of "map unknown number of arguments to registers".
>
> Yes, it's different - there you map the unknown number of arguments to
> a structure access instead. And yes, the macros are unbelievably ugly.
> See
>
> arch/x86/include/asm/syscall_wrapper.h

Those are not doing inline assembly.

>
> and the __MAP() macro from
>
> include/linux/syscalls.h
>
> so it's not pretty. But it would solve all the problems.
>

Again, not inline assembly, and those only handle up to 6 parameters.

My POC started down this route, until I notice that there's tracepoints
that have 13 parameters! And I need to handle all tracepoints.

Yes, we can argue that we need to change those (if that doesn't break
the API of something using it).

-- Steve


2018-11-29 20:13:55

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 10:58:40AM -0800, Linus Torvalds wrote:
> On Thu, Nov 29, 2018 at 10:47 AM Steven Rostedt <[email protected]> wrote:
> >
> > Note, we do have a bit of control at what is getting called. The patch
> > set requires that the callers are wrapped in macros. We should not
> > allow just any random callers (like from asm).
>
> Actually, I'd argue that asm is often more controlled than C code.
>
> Right now you can do odd things if you really want to, and have the
> compiler generate indirect calls to those wrapper functions.
>
> For example, I can easily imagine a pre-retpoline compiler turning
>
> if (cond)
> fn1(a,b)
> else
> fn2(a,b);
>
> into a function pointer conditional
>
> (cond ? fn1 : fn2)(a,b);
>
> and honestly, the way "static_call()" works now, can you guarantee
> that the call-site doesn't end up doing that, and calling the
> trampoline function for two different static calls from one indirect
> call?
>
> See what I'm talking about? Saying "callers are wrapped in macros"
> doesn't actually protect you from the compiler doing things like that.
>
> In contrast, if the call was wrapped in an inline asm, we'd *know* the
> compiler couldn't turn a "call wrapper(%rip)" into anything else.

I think objtool could warn about many such issues, including function
pointer references to trampolines and short tail call jumps.

--
Josh

2018-11-29 20:26:29

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 11:27:00AM -0800, Andy Lutomirski wrote:
> On Thu, Nov 29, 2018 at 11:08 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > On Thu, Nov 29, 2018 at 10:58 AM Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > In contrast, if the call was wrapped in an inline asm, we'd *know* the
> > > compiler couldn't turn a "call wrapper(%rip)" into anything else.
> >
> > Actually, I think I have a better model - if the caller is done with inline asm.
> >
> > What you can do then is basically add a single-byte prefix to the
> > "call" instruction that does nothing (say, cs override), and then
> > replace *that* with a 'int3' instruction.
> >
> > Boom. Done.
> >
> > Now, the "int3" handler can just update the instruction in-place, but
> > leave the "int3" in place, and then return to the next instruction
> > byte (which is just the normal branch instruction without the prefix
> > byte).
> >
> > The cross-CPU case continues to work, because the 'int3' remains in
> > place until after the IPI.
>
> Hmm, cute. But then the calls are in inline asm, which results in
> giant turds like we have for the pvop vcalls. And, if they start
> being used more generally, we potentially have ABI issues where the
> calling convention isn't quite what the asm expects, and we explode.
>
> I propose a different solution:
>
> As in this patch set, we have a direct and an indirect version. The
> indirect version remains exactly the same as in this patch set. The
> direct version just only does the patching when all seems well: the
> call instruction needs to be 0xe8, and we only do it when the thing
> doesn't cross a cache line. Does that work? In the rare case where
> the compiler generates something other than 0xe8 or crosses a cache
> line, then the thing just remains as a call to the out of line jmp
> trampoline. Does that seem reasonable? It's a very minor change to
> the patch set.

Maybe that would be ok. If my math is right, we would use the
out-of-line version almost 5% of the time due to cache misalignment of
the address.

> Alternatively, we could actually emulate call instructions like this:
>
> void __noreturn jump_to_kernel_pt_regs(struct pt_regs *regs, ...)
> {
> struct pt_regs ptregs_copy = *regs;
> barrier();
> *(unsigned long *)(regs->sp - 8) = whatever; /* may clobber old
> regs, but so what? */
> asm volatile ("jmp return_to_alternate_ptregs");
> }
>
> where return_to_alternate_ptregs points rsp to the ptregs and goes
> through the normal return path. It's ugly, but we could have a test
> case for it, and it should work fine.

Is that really any better than my patch to create a gap in the stack
(modified for kernel space #BP only)?

--
Josh

2018-11-29 22:04:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 11:10:50AM -0600, Josh Poimboeuf wrote:
> On Thu, Nov 29, 2018 at 08:59:31AM -0800, Andy Lutomirski wrote:

> > (like pointing IP at a stub that retpolines to the target by reading
> > the function pointer, a la the unoptimizable version), then okay, I
> > guess, with only a small amount of grumbling.
>
> I tried that in v2, but Peter pointed out it's racy:
>
> https://lkml.kernel.org/r/[email protected]

Ah, but that is because it is a global shared trampoline.

Each static_call has it's own trampoline; which currently reads
something like:

RETPOLINE_SAFE
JMP *key

which you then 'defuse' by writing an UD2 on. _However_, if you write
that trampoline like:

1: RETPOLINE_SAFE
JMP *key
2: CALL_NOSPEC *key
RET

and have the text_poke_bp() handler jump to 2 (a location you'll never
reach when you enter at 1), it will in fact work I think. The trampoline
is never modified and not shared between different static_call's.

2018-11-29 22:16:04

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 11:01:48PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 29, 2018 at 11:10:50AM -0600, Josh Poimboeuf wrote:
> > On Thu, Nov 29, 2018 at 08:59:31AM -0800, Andy Lutomirski wrote:
>
> > > (like pointing IP at a stub that retpolines to the target by reading
> > > the function pointer, a la the unoptimizable version), then okay, I
> > > guess, with only a small amount of grumbling.
> >
> > I tried that in v2, but Peter pointed out it's racy:
> >
> > https://lkml.kernel.org/r/[email protected]
>
> Ah, but that is because it is a global shared trampoline.
>
> Each static_call has it's own trampoline; which currently reads
> something like:
>
> RETPOLINE_SAFE
> JMP *key
>
> which you then 'defuse' by writing an UD2 on. _However_, if you write
> that trampoline like:
>
> 1: RETPOLINE_SAFE
> JMP *key
> 2: CALL_NOSPEC *key
> RET
>
> and have the text_poke_bp() handler jump to 2 (a location you'll never
> reach when you enter at 1), it will in fact work I think. The trampoline
> is never modified and not shared between different static_call's.

But after returning from the function to the trampoline, how does it
return from the trampoline to the call site? At that point there is no
return address on the stack.

--
Josh

2018-11-29 22:19:46

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 02:24:52PM -0600, Josh Poimboeuf wrote:
> On Thu, Nov 29, 2018 at 11:27:00AM -0800, Andy Lutomirski wrote:
> > On Thu, Nov 29, 2018 at 11:08 AM Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > On Thu, Nov 29, 2018 at 10:58 AM Linus Torvalds
> > > <[email protected]> wrote:
> > > >
> > > > In contrast, if the call was wrapped in an inline asm, we'd *know* the
> > > > compiler couldn't turn a "call wrapper(%rip)" into anything else.
> > >
> > > Actually, I think I have a better model - if the caller is done with inline asm.
> > >
> > > What you can do then is basically add a single-byte prefix to the
> > > "call" instruction that does nothing (say, cs override), and then
> > > replace *that* with a 'int3' instruction.
> > >
> > > Boom. Done.
> > >
> > > Now, the "int3" handler can just update the instruction in-place, but
> > > leave the "int3" in place, and then return to the next instruction
> > > byte (which is just the normal branch instruction without the prefix
> > > byte).
> > >
> > > The cross-CPU case continues to work, because the 'int3' remains in
> > > place until after the IPI.
> >
> > Hmm, cute. But then the calls are in inline asm, which results in
> > giant turds like we have for the pvop vcalls. And, if they start
> > being used more generally, we potentially have ABI issues where the
> > calling convention isn't quite what the asm expects, and we explode.
> >
> > I propose a different solution:
> >
> > As in this patch set, we have a direct and an indirect version. The
> > indirect version remains exactly the same as in this patch set. The
> > direct version just only does the patching when all seems well: the
> > call instruction needs to be 0xe8, and we only do it when the thing
> > doesn't cross a cache line. Does that work? In the rare case where
> > the compiler generates something other than 0xe8 or crosses a cache
> > line, then the thing just remains as a call to the out of line jmp
> > trampoline. Does that seem reasonable? It's a very minor change to
> > the patch set.
>
> Maybe that would be ok. If my math is right, we would use the
> out-of-line version almost 5% of the time due to cache misalignment of
> the address.

BTW, this means that if any of a trampoline's callers crosses cache
boundaries then we won't be able to poison the trampoline. Which is
kind of sad.

--
Josh

2018-11-29 22:24:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 04:14:46PM -0600, Josh Poimboeuf wrote:
> On Thu, Nov 29, 2018 at 11:01:48PM +0100, Peter Zijlstra wrote:
> > On Thu, Nov 29, 2018 at 11:10:50AM -0600, Josh Poimboeuf wrote:
> > > On Thu, Nov 29, 2018 at 08:59:31AM -0800, Andy Lutomirski wrote:
> >
> > > > (like pointing IP at a stub that retpolines to the target by reading
> > > > the function pointer, a la the unoptimizable version), then okay, I
> > > > guess, with only a small amount of grumbling.
> > >
> > > I tried that in v2, but Peter pointed out it's racy:
> > >
> > > https://lkml.kernel.org/r/[email protected]
> >
> > Ah, but that is because it is a global shared trampoline.
> >
> > Each static_call has it's own trampoline; which currently reads
> > something like:
> >
> > RETPOLINE_SAFE
> > JMP *key
> >
> > which you then 'defuse' by writing an UD2 on. _However_, if you write
> > that trampoline like:
> >
> > 1: RETPOLINE_SAFE
> > JMP *key
> > 2: CALL_NOSPEC *key
> > RET
> >
> > and have the text_poke_bp() handler jump to 2 (a location you'll never
> > reach when you enter at 1), it will in fact work I think. The trampoline
> > is never modified and not shared between different static_call's.
>
> But after returning from the function to the trampoline, how does it
> return from the trampoline to the call site? At that point there is no
> return address on the stack.

Oh, right, so that RET don't work. ARGH. Time to go sleep I suppose.

2018-11-29 22:26:47

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 2:22 PM Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Nov 29, 2018 at 04:14:46PM -0600, Josh Poimboeuf wrote:
> > On Thu, Nov 29, 2018 at 11:01:48PM +0100, Peter Zijlstra wrote:
> > > On Thu, Nov 29, 2018 at 11:10:50AM -0600, Josh Poimboeuf wrote:
> > > > On Thu, Nov 29, 2018 at 08:59:31AM -0800, Andy Lutomirski wrote:
> > >
> > > > > (like pointing IP at a stub that retpolines to the target by reading
> > > > > the function pointer, a la the unoptimizable version), then okay, I
> > > > > guess, with only a small amount of grumbling.
> > > >
> > > > I tried that in v2, but Peter pointed out it's racy:
> > > >
> > > > https://lkml.kernel.org/r/[email protected]
> > >
> > > Ah, but that is because it is a global shared trampoline.
> > >
> > > Each static_call has it's own trampoline; which currently reads
> > > something like:
> > >
> > > RETPOLINE_SAFE
> > > JMP *key
> > >
> > > which you then 'defuse' by writing an UD2 on. _However_, if you write
> > > that trampoline like:
> > >
> > > 1: RETPOLINE_SAFE
> > > JMP *key
> > > 2: CALL_NOSPEC *key
> > > RET
> > >
> > > and have the text_poke_bp() handler jump to 2 (a location you'll never
> > > reach when you enter at 1), it will in fact work I think. The trampoline
> > > is never modified and not shared between different static_call's.
> >
> > But after returning from the function to the trampoline, how does it
> > return from the trampoline to the call site? At that point there is no
> > return address on the stack.
>
> Oh, right, so that RET don't work. ARGH. Time to go sleep I suppose.

I assume I'm missing something, but can't it just be JMP_NOSPEC *key?
The code would call the trampoline just like any other function and,
if the alignment is bad, we can skip patching it. And, if we want the
performance back, maybe some day we can find a clean way to patch
those misaligned callers, too.

2018-11-29 22:31:14

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 02:25:33PM -0800, Andy Lutomirski wrote:
> On Thu, Nov 29, 2018 at 2:22 PM Peter Zijlstra <[email protected]> wrote:
> >
> > On Thu, Nov 29, 2018 at 04:14:46PM -0600, Josh Poimboeuf wrote:
> > > On Thu, Nov 29, 2018 at 11:01:48PM +0100, Peter Zijlstra wrote:
> > > > On Thu, Nov 29, 2018 at 11:10:50AM -0600, Josh Poimboeuf wrote:
> > > > > On Thu, Nov 29, 2018 at 08:59:31AM -0800, Andy Lutomirski wrote:
> > > >
> > > > > > (like pointing IP at a stub that retpolines to the target by reading
> > > > > > the function pointer, a la the unoptimizable version), then okay, I
> > > > > > guess, with only a small amount of grumbling.
> > > > >
> > > > > I tried that in v2, but Peter pointed out it's racy:
> > > > >
> > > > > https://lkml.kernel.org/r/[email protected]
> > > >
> > > > Ah, but that is because it is a global shared trampoline.
> > > >
> > > > Each static_call has it's own trampoline; which currently reads
> > > > something like:
> > > >
> > > > RETPOLINE_SAFE
> > > > JMP *key
> > > >
> > > > which you then 'defuse' by writing an UD2 on. _However_, if you write
> > > > that trampoline like:
> > > >
> > > > 1: RETPOLINE_SAFE
> > > > JMP *key
> > > > 2: CALL_NOSPEC *key
> > > > RET
> > > >
> > > > and have the text_poke_bp() handler jump to 2 (a location you'll never
> > > > reach when you enter at 1), it will in fact work I think. The trampoline
> > > > is never modified and not shared between different static_call's.
> > >
> > > But after returning from the function to the trampoline, how does it
> > > return from the trampoline to the call site? At that point there is no
> > > return address on the stack.
> >
> > Oh, right, so that RET don't work. ARGH. Time to go sleep I suppose.
>
> I assume I'm missing something, but can't it just be JMP_NOSPEC *key?
> The code would call the trampoline just like any other function and,
> if the alignment is bad, we can skip patching it. And, if we want the
> performance back, maybe some day we can find a clean way to patch
> those misaligned callers, too.

Yeah, this is currently the leading contender, though I believe it will
use a direct jump like the current out-of-line implementation.

--
Josh

2018-11-29 23:07:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 12:25 PM Josh Poimboeuf <[email protected]> wrote:
>
> On Thu, Nov 29, 2018 at 11:27:00AM -0800, Andy Lutomirski wrote:
> >
> > I propose a different solution:
> >
> > As in this patch set, we have a direct and an indirect version. The
> > indirect version remains exactly the same as in this patch set. The
> > direct version just only does the patching when all seems well: the
> > call instruction needs to be 0xe8, and we only do it when the thing
> > doesn't cross a cache line. Does that work? In the rare case where
> > the compiler generates something other than 0xe8 or crosses a cache
> > line, then the thing just remains as a call to the out of line jmp
> > trampoline. Does that seem reasonable? It's a very minor change to
> > the patch set.
>
> Maybe that would be ok. If my math is right, we would use the
> out-of-line version almost 5% of the time due to cache misalignment of
> the address.

Note that I don't think cache-line alignment is necessarily sufficient.

The I$ fetch from the cacheline can happen in smaller chunks, because
the bus between the I$ and the instruction decode isn't a full
cacheline (well, it is _now_ in modern big cores, but it hasn't always
been).

So even if the cacheline is updated atomically, I could imagine seeing
a partial fetch from the I$ (old values) and then a second partial
fetch (new values).

It would be interesting to know what the exact fetch rules are.

Linus

2018-11-30 16:30:02

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 03:04:20PM -0800, Linus Torvalds wrote:
> On Thu, Nov 29, 2018 at 12:25 PM Josh Poimboeuf <[email protected]> wrote:
> >
> > On Thu, Nov 29, 2018 at 11:27:00AM -0800, Andy Lutomirski wrote:
> > >
> > > I propose a different solution:
> > >
> > > As in this patch set, we have a direct and an indirect version. The
> > > indirect version remains exactly the same as in this patch set. The
> > > direct version just only does the patching when all seems well: the
> > > call instruction needs to be 0xe8, and we only do it when the thing
> > > doesn't cross a cache line. Does that work? In the rare case where
> > > the compiler generates something other than 0xe8 or crosses a cache
> > > line, then the thing just remains as a call to the out of line jmp
> > > trampoline. Does that seem reasonable? It's a very minor change to
> > > the patch set.
> >
> > Maybe that would be ok. If my math is right, we would use the
> > out-of-line version almost 5% of the time due to cache misalignment of
> > the address.
>
> Note that I don't think cache-line alignment is necessarily sufficient.
>
> The I$ fetch from the cacheline can happen in smaller chunks, because
> the bus between the I$ and the instruction decode isn't a full
> cacheline (well, it is _now_ in modern big cores, but it hasn't always
> been).
>
> So even if the cacheline is updated atomically, I could imagine seeing
> a partial fetch from the I$ (old values) and then a second partial
> fetch (new values).
>
> It would be interesting to know what the exact fetch rules are.

I've been doing some cross-modifying code experiments on Nehalem, with
one CPU writing call destinations while the other CPUs are executing
them. Reliably, one of the readers goes off into the weeds within a few
seconds.

The writing was done with just text_poke(), no #BP.

I wasn't able to figure out the pattern in the addresses of the
corrupted call sites. It wasn't cache line.

That was on Nehalem. Skylake didn't crash at all.

--
Josh

2018-11-30 16:43:37

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 12:24 PM Josh Poimboeuf <[email protected]> wrote:
>
> > Alternatively, we could actually emulate call instructions like this:
> >
> > void __noreturn jump_to_kernel_pt_regs(struct pt_regs *regs, ...)
> > {
> > struct pt_regs ptregs_copy = *regs;
> > barrier();
> > *(unsigned long *)(regs->sp - 8) = whatever; /* may clobber old
> > regs, but so what? */
> > asm volatile ("jmp return_to_alternate_ptregs");
> > }
> >
> > where return_to_alternate_ptregs points rsp to the ptregs and goes
> > through the normal return path. It's ugly, but we could have a test
> > case for it, and it should work fine.
>
> Is that really any better than my patch to create a gap in the stack
> (modified for kernel space #BP only)?
>

I tend to prefer a nice local hack like mine over a hack that further
complicates the entry in general. This is not to say I'm thrilled by
my idea either.

2018-11-30 18:41:28

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Fri, Nov 30, 2018 at 08:42:26AM -0800, Andy Lutomirski wrote:
> On Thu, Nov 29, 2018 at 12:24 PM Josh Poimboeuf <[email protected]> wrote:
> >
> > > Alternatively, we could actually emulate call instructions like this:
> > >
> > > void __noreturn jump_to_kernel_pt_regs(struct pt_regs *regs, ...)
> > > {
> > > struct pt_regs ptregs_copy = *regs;
> > > barrier();
> > > *(unsigned long *)(regs->sp - 8) = whatever; /* may clobber old
> > > regs, but so what? */
> > > asm volatile ("jmp return_to_alternate_ptregs");
> > > }
> > >
> > > where return_to_alternate_ptregs points rsp to the ptregs and goes
> > > through the normal return path. It's ugly, but we could have a test
> > > case for it, and it should work fine.
> >
> > Is that really any better than my patch to create a gap in the stack
> > (modified for kernel space #BP only)?
> >
>
> I tend to prefer a nice local hack like mine over a hack that further
> complicates the entry in general. This is not to say I'm thrilled by
> my idea either.

They're both mucking with the location of the pt_regs. The above code
just takes that fact and hides it in the corner and hopes that there are
no bugs lurking there.

Even with the CPL check, the "gap" code is simple and self-contained
(see below). The kernel pt_regs can already be anywhere on the stack so
there should be no harm in moving them.

AFAICT, all the other proposed options seem to have major issues.

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index ce25d84023c0..f487f7daed6c 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -876,7 +876,7 @@ apicinterrupt IRQ_WORK_VECTOR irq_work_interrupt smp_irq_work_interrupt
* @paranoid == 2 is special: the stub will never switch stacks. This is for
* #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS.
*/
-.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
+.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 create_gap=1
ENTRY(\sym)
UNWIND_HINT_IRET_REGS offset=\has_error_code*8

@@ -896,6 +896,18 @@ ENTRY(\sym)
jnz .Lfrom_usermode_switch_stack_\@
.endif

+#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
+ .if \create_gap == 1
+ testb $3, CS-ORIG_RAX(%rsp)
+ jnz .Lfrom_usermode_no_gap_\@
+ .rept 6
+ pushq 5*8(%rsp)
+ .endr
+ UNWIND_HINT_IRET_REGS offset=8
+.Lfrom_usermode_no_gap_\@:
+ .endif
+#endif
+
.if \paranoid
call paranoid_entry
.else
@@ -1126,7 +1138,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \
#endif /* CONFIG_HYPERV */

idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
-idtentry int3 do_int3 has_error_code=0
+idtentry int3 do_int3 has_error_code=0 create_gap=1
idtentry stack_segment do_stack_segment has_error_code=1

#ifdef CONFIG_XEN_PV

2018-11-30 19:47:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Fri, Nov 30, 2018 at 10:39 AM Josh Poimboeuf <[email protected]> wrote:
>
> AFAICT, all the other proposed options seem to have major issues.

I still absolutely detest this patch, and in fact it got worse from
the test of the config variable.

Honestly, the entry code being legible and simple is more important
than the extra cycle from branching to a trampoline for static calls.

Just don't do the inline case if it causes this much confusion.

Linus

2018-11-30 20:19:48

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Fri, Nov 30, 2018 at 11:51 AM Linus Torvalds
<[email protected]> wrote:
>
> On Fri, Nov 30, 2018 at 10:39 AM Josh Poimboeuf <[email protected]> wrote:
> >
> > AFAICT, all the other proposed options seem to have major issues.
>
> I still absolutely detest this patch, and in fact it got worse from
> the test of the config variable.
>
> Honestly, the entry code being legible and simple is more important
> than the extra cycle from branching to a trampoline for static calls.
>
> Just don't do the inline case if it causes this much confusion.

With my entry maintainer hat on, I don't mind it so much, although the
implementation needs some work. The #ifdef should just go away, and
there should be another sanity check in the sanity check section.

Or we could replace that IPI with x86's bona fide serialize-all-cpus
primitive and then we can just retry instead of emulating. It's a
piece of cake -- we just trigger an SMI :) /me runs away.

--Andy

2018-11-30 20:31:15

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Fri, 30 Nov 2018 12:18:33 -0800
Andy Lutomirski <[email protected]> wrote:

> Or we could replace that IPI with x86's bona fide serialize-all-cpus
> primitive and then we can just retry instead of emulating. It's a
> piece of cake -- we just trigger an SMI :) /me runs away.

I must have fallen on my head one too many times, because I really like
the idea of synchronizing all the CPUs with an SMI! (If that's even
possible). The IPI's that are sent are only to force smp_mb() on all
CPUs. Which should be something an SMI could do.

/me runs after Andy

-- Steve

2018-11-30 21:02:07

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Fri, Nov 30, 2018 at 12:28 PM Steven Rostedt <[email protected]> wrote:
>
> On Fri, 30 Nov 2018 12:18:33 -0800
> Andy Lutomirski <[email protected]> wrote:
>
> > Or we could replace that IPI with x86's bona fide serialize-all-cpus
> > primitive and then we can just retry instead of emulating. It's a
> > piece of cake -- we just trigger an SMI :) /me runs away.
>
> I must have fallen on my head one too many times, because I really like
> the idea of synchronizing all the CPUs with an SMI! (If that's even
> possible). The IPI's that are sent are only to force smp_mb() on all
> CPUs. Which should be something an SMI could do.
>
> /me runs after Andy

According to the SDM, you can program the APIC ICR to request an SMI.
It's not remotely clear to me what will happen if we do this. For all
I know, the SMI handler will explode and the computer will catch fire.
PeterZ?

2018-11-30 21:03:51

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Fri, 30 Nov 2018 12:59:36 -0800
Andy Lutomirski <[email protected]> wrote:

> For all I know, the SMI handler will explode and the computer will catch fire.

That sounds like an AWESOME feature!!!

-- Steve


2018-11-30 21:11:40

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Fri, Nov 30, 2018 at 12:18:33PM -0800, Andy Lutomirski wrote:
> On Fri, Nov 30, 2018 at 11:51 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > On Fri, Nov 30, 2018 at 10:39 AM Josh Poimboeuf <[email protected]> wrote:
> > >
> > > AFAICT, all the other proposed options seem to have major issues.
> >
> > I still absolutely detest this patch, and in fact it got worse from
> > the test of the config variable.
> >
> > Honestly, the entry code being legible and simple is more important
> > than the extra cycle from branching to a trampoline for static calls.
> >
> > Just don't do the inline case if it causes this much confusion.

I *really* don't want to have to drop the inline feature. The speedup
is measurable and not insignificant. And out-of-line would be a
regression if we ported paravirt to use static calls.

> With my entry maintainer hat on, I don't mind it so much, although the
> implementation needs some work. The #ifdef should just go away, and
> there should be another sanity check in the sanity check section.

Your suggested changes sound good to me. I'll be gone next week, so
here's hoping you'll have this all figured out when I get back!

--
Josh

2018-11-30 21:14:46

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Fri, 30 Nov 2018, Andy Lutomirski wrote:

> According to the SDM, you can program the APIC ICR to request an SMI.
> It's not remotely clear to me what will happen if we do this.

I think one of the known reliable ways to trigger SMI is to write 0x0 to
the SMI command I/O port (0xb2).

> For all I know, the SMI handler will explode and the computer will catch
> fire.

Ha, therefore noone can't claim any more that SMIs are always harmful :)

--
Jiri Kosina
SUSE Labs


2018-11-30 22:17:41

by Rasmus Villemoes

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On 29/11/2018 20.22, Josh Poimboeuf wrote:
> On Thu, Nov 29, 2018 at 02:16:48PM -0500, Steven Rostedt wrote:
>>> and honestly, the way "static_call()" works now, can you guarantee
>>> that the call-site doesn't end up doing that, and calling the
>>> trampoline function for two different static calls from one indirect
>>> call?
>>>
>>> See what I'm talking about? Saying "callers are wrapped in macros"
>>> doesn't actually protect you from the compiler doing things like that.
>>>
>>> In contrast, if the call was wrapped in an inline asm, we'd *know* the
>>> compiler couldn't turn a "call wrapper(%rip)" into anything else.
>>
>> But then we need to implement all numbers of parameters.
>
> I actually have an old unfinished patch which (ab)used C macros to
> detect the number of parameters and then setup the asm constraints
> accordingly. At the time, the goal was to optimize the BUG code.
>
> I had wanted to avoid this kind of approach for static calls, because
> "ugh", but now it's starting to look much more appealing.
>
> Behold:
>
> diff --git a/arch/x86/include/asm/bug.h b/arch/x86/include/asm/bug.h
> index aa6b2023d8f8..d63e9240da77 100644
> --- a/arch/x86/include/asm/bug.h
> +++ b/arch/x86/include/asm/bug.h
> @@ -32,10 +32,59 @@
>
> #ifdef CONFIG_DEBUG_BUGVERBOSE
>
> -#define _BUG_FLAGS(ins, flags) \
> +#define __BUG_ARGS_0(ins, ...) \
> +({\
> + asm volatile("1:\t" ins "\n"); \
> +})
> +#define __BUG_ARGS_1(ins, ...) \
> +({\
> + asm volatile("1:\t" ins "\n" \
> + : : "D" (ARG1(__VA_ARGS__))); \
> +})
> +#define __BUG_ARGS_2(ins, ...) \
> +({\
> + asm volatile("1:\t" ins "\n" \
> + : : "D" (ARG1(__VA_ARGS__)), \
> + "S" (ARG2(__VA_ARGS__))); \
> +})
> +#define __BUG_ARGS_3(ins, ...) \
> +({\
> + asm volatile("1:\t" ins "\n" \
> + : : "D" (ARG1(__VA_ARGS__)), \
> + "S" (ARG2(__VA_ARGS__)), \
> + "d" (ARG3(__VA_ARGS__))); \
> +})

wouldn't you need to tie all these to (unused) outputs as well as adding
the remaining caller-saved registers to the clobber list? Maybe not for
the WARN machinery(?), but at least for stuff that should look like a
normal call to gcc? Then there's %rax which is either a clobber or an
output, and if there's not to be a separate static_call_void(), one
would need to do some __builtin_choose_expr(__same_type(void, f(...)), ...).

Rasmus

2018-11-30 22:26:54

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Fri, Nov 30, 2018 at 11:16:34PM +0100, Rasmus Villemoes wrote:
> On 29/11/2018 20.22, Josh Poimboeuf wrote:
> > On Thu, Nov 29, 2018 at 02:16:48PM -0500, Steven Rostedt wrote:
> >>> and honestly, the way "static_call()" works now, can you guarantee
> >>> that the call-site doesn't end up doing that, and calling the
> >>> trampoline function for two different static calls from one indirect
> >>> call?
> >>>
> >>> See what I'm talking about? Saying "callers are wrapped in macros"
> >>> doesn't actually protect you from the compiler doing things like that.
> >>>
> >>> In contrast, if the call was wrapped in an inline asm, we'd *know* the
> >>> compiler couldn't turn a "call wrapper(%rip)" into anything else.
> >>
> >> But then we need to implement all numbers of parameters.
> >
> > I actually have an old unfinished patch which (ab)used C macros to
> > detect the number of parameters and then setup the asm constraints
> > accordingly. At the time, the goal was to optimize the BUG code.
> >
> > I had wanted to avoid this kind of approach for static calls, because
> > "ugh", but now it's starting to look much more appealing.
> >
> > Behold:
> >
> > diff --git a/arch/x86/include/asm/bug.h b/arch/x86/include/asm/bug.h
> > index aa6b2023d8f8..d63e9240da77 100644
> > --- a/arch/x86/include/asm/bug.h
> > +++ b/arch/x86/include/asm/bug.h
> > @@ -32,10 +32,59 @@
> >
> > #ifdef CONFIG_DEBUG_BUGVERBOSE
> >
> > -#define _BUG_FLAGS(ins, flags) \
> > +#define __BUG_ARGS_0(ins, ...) \
> > +({\
> > + asm volatile("1:\t" ins "\n"); \
> > +})
> > +#define __BUG_ARGS_1(ins, ...) \
> > +({\
> > + asm volatile("1:\t" ins "\n" \
> > + : : "D" (ARG1(__VA_ARGS__))); \
> > +})
> > +#define __BUG_ARGS_2(ins, ...) \
> > +({\
> > + asm volatile("1:\t" ins "\n" \
> > + : : "D" (ARG1(__VA_ARGS__)), \
> > + "S" (ARG2(__VA_ARGS__))); \
> > +})
> > +#define __BUG_ARGS_3(ins, ...) \
> > +({\
> > + asm volatile("1:\t" ins "\n" \
> > + : : "D" (ARG1(__VA_ARGS__)), \
> > + "S" (ARG2(__VA_ARGS__)), \
> > + "d" (ARG3(__VA_ARGS__))); \
> > +})
>
> wouldn't you need to tie all these to (unused) outputs as well as adding
> the remaining caller-saved registers to the clobber list? Maybe not for
> the WARN machinery(?), but at least for stuff that should look like a
> normal call to gcc? Then there's %rax which is either a clobber or an
> output, and if there's not to be a separate static_call_void(), one
> would need to do some __builtin_choose_expr(__same_type(void, f(...)), ...).

Yes, this is a crappy unfinished patch. It should be ignored, and
perhaps even mercilessly mocked :-)

paravirt_types.h already does something similar today, and it's at least
more correct than this.

What I was trying to show was that you can use macros to count
arguments, like this:

_BUG_ARGS(ins, NUM_ARGS(__VA_ARGS__), __VA_ARGS__);

which can make a macro look and act like a function call. Though as
Steven pointed out, the concept falls apart after 6 arguments.

--
Josh

2018-12-04 23:10:49

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Static calls


Where did this end up BTW?

I know that there's controversy about the
CONFIG_HAVE_STATIC_CALL_OPTIMIZED option, but I don't think the
CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED version was controversial. From the
v1 patch 0 description:

There are three separate implementations, depending on what the arch
supports:

1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires
objtool and a small amount of arch code

2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires
a small amount of arch code

3) If no arch support, fall back to regular function pointers

My benchmarks showed the best improvements with the
STATIC_CALL_OPTIMIZED, but it still showed improvement with the
UNOPTIMIZED version as well. Can we at least apply 2 and 3 from the
above (which happen to be the first part of the patch set. 1 comes in
at the end).

I would also just call it CONFIG_STATIC_CALL. If we every agree on the
optimized version, then we can call it CONFIG_STATIC_CALL_OPTIMIZED.
Have an option called UNOPTIMIZED just seems wrong.

-- Steve



On Mon, 26 Nov 2018 07:54:56 -0600
Josh Poimboeuf <[email protected]> wrote:

> v2:
> - fix STATIC_CALL_TRAMP() macro by using __PASTE() [Ard]
> - rename optimized/unoptimized -> inline/out-of-line [Ard]
> - tweak arch interfaces for PLT and add key->tramp field [Ard]
> - rename 'poison' to 'defuse' and do it after all sites have been patched [Ard]
> - fix .init handling [Ard, Steven]
> - add CONFIG_HAVE_STATIC_CALL [Steven]
> - make interfaces more consistent across configs to allow tracepoints to
> use them [Steven]
> - move __ADDRESSABLE() to static_call() macro [Steven]
> - prevent 2-byte jumps [Steven]
> - add offset to asm-offsets.c instead of hard coding key->func offset
> - add kernel_text_address() sanity check
> - make __ADDRESSABLE() symbols truly unique
>
> TODO:
> - port Ard's arm64 patches to the new arch interfaces
> - tracepoint performance testing
>
> --------------------
>
> These patches are related to two similar patch sets from Ard and Steve:
>
> - https://lkml.kernel.org/r/[email protected]
> - https://lkml.kernel.org/r/[email protected]
>
> The code is also heavily inspired by the jump label code, as some of the
> concepts are very similar.
>
> There are three separate implementations, depending on what the arch
> supports:
>
> 1) CONFIG_HAVE_STATIC_CALL_INLINE: patched call sites - requires
> objtool and a small amount of arch code
>
> 2) CONFIG_HAVE_STATIC_CALL_OUTLINE: patched trampolines - requires
> a small amount of arch code
>
> 3) If no arch support, fall back to regular function pointers
>
>
> Josh Poimboeuf (4):
> compiler.h: Make __ADDRESSABLE() symbol truly unique
> static_call: Add static call infrastructure
> x86/static_call: Add out-of-line static call implementation
> x86/static_call: Add inline static call implementation for x86-64
>
> arch/Kconfig | 10 +
> arch/x86/Kconfig | 4 +-
> arch/x86/include/asm/static_call.h | 52 +++
> arch/x86/kernel/Makefile | 1 +
> arch/x86/kernel/asm-offsets.c | 6 +
> arch/x86/kernel/static_call.c | 78 ++++
> include/asm-generic/vmlinux.lds.h | 11 +
> include/linux/compiler.h | 2 +-
> include/linux/module.h | 10 +
> include/linux/static_call.h | 202 ++++++++++
> include/linux/static_call_types.h | 19 +
> kernel/Makefile | 1 +
> kernel/module.c | 5 +
> kernel/static_call.c | 350 ++++++++++++++++++
> tools/objtool/Makefile | 3 +-
> tools/objtool/check.c | 126 ++++++-
> tools/objtool/check.h | 2 +
> tools/objtool/elf.h | 1 +
> .../objtool/include/linux/static_call_types.h | 19 +
> tools/objtool/sync-check.sh | 1 +
> 20 files changed, 899 insertions(+), 4 deletions(-)
> create mode 100644 arch/x86/include/asm/static_call.h
> create mode 100644 arch/x86/kernel/static_call.c
> create mode 100644 include/linux/static_call.h
> create mode 100644 include/linux/static_call_types.h
> create mode 100644 kernel/static_call.c
> create mode 100644 tools/objtool/include/linux/static_call_types.h
>


2018-12-04 23:43:42

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Static calls



> On Dec 4, 2018, at 3:08 PM, Steven Rostedt <[email protected]> wrote:
>
>
> Where did this end up BTW?
>
> I know that there's controversy about the
> CONFIG_HAVE_STATIC_CALL_OPTIMIZED option, but I don't think the
> CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED version was controversial. From the
> v1 patch 0 description:
>
> There are three separate implementations, depending on what the arch
> supports:
>
> 1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires
> objtool and a small amount of arch code
>
> 2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires
> a small amount of arch code
>
> 3) If no arch support, fall back to regular function pointers
>
> My benchmarks showed the best improvements with the
> STATIC_CALL_OPTIMIZED, but it still showed improvement with the
> UNOPTIMIZED version as well. Can we at least apply 2 and 3 from the
> above (which happen to be the first part of the patch set. 1 comes in
> at the end).

Sounds good to me.

>
> I would also just call it CONFIG_STATIC_CALL. If we every agree on the
> optimized version, then we can call it CONFIG_STATIC_CALL_OPTIMIZED.
> Have an option called UNOPTIMIZED just seems wrong.

My objection to all the bike shed colors so far is that we *always* have static_call() — it’s just not always static.

Anyway, I have a new objection to Josh’s create_gap proposal: what on Earth will kernel CET do to it? Maybe my longjmp-like hack is actually better.



2018-12-05 15:07:43

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Static calls

On Tue, Dec 04, 2018 at 03:41:01PM -0800, Andy Lutomirski wrote:
>
>
> > On Dec 4, 2018, at 3:08 PM, Steven Rostedt <[email protected]> wrote:
> >
> >
> > Where did this end up BTW?
> >
> > I know that there's controversy about the
> > CONFIG_HAVE_STATIC_CALL_OPTIMIZED option, but I don't think the
> > CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED version was controversial. From the
> > v1 patch 0 description:
> >
> > There are three separate implementations, depending on what the arch
> > supports:
> >
> > 1) CONFIG_HAVE_STATIC_CALL_OPTIMIZED: patched call sites - requires
> > objtool and a small amount of arch code
> >
> > 2) CONFIG_HAVE_STATIC_CALL_UNOPTIMIZED: patched trampolines - requires
> > a small amount of arch code
> >
> > 3) If no arch support, fall back to regular function pointers
> >
> > My benchmarks showed the best improvements with the
> > STATIC_CALL_OPTIMIZED, but it still showed improvement with the
> > UNOPTIMIZED version as well. Can we at least apply 2 and 3 from the
> > above (which happen to be the first part of the patch set. 1 comes in
> > at the end).
>
> Sounds good to me.
>
> >
> > I would also just call it CONFIG_STATIC_CALL. If we every agree on the
> > optimized version, then we can call it CONFIG_STATIC_CALL_OPTIMIZED.
> > Have an option called UNOPTIMIZED just seems wrong.

(Poking my head up for a bit, soon to disappear again until next week)

Ard had already objected to "unoptimized", which was why for v2 I
renamed them to CONFIG_STATIC_CALL_OUTLINE and CONFIG_STATIC_CALL_INLINE.

I could rename it to CONFIG_STATIC_CALL and CONFIG_STATIC_CALL_INLINE if
you prefer. I don't have much of an opinion either way.

I'll post a v3 next week or so, with the controversial bits more fully
separated from the non-controversial bits. So at least the out-of-line
implementation can get merged.

> My objection to all the bike shed colors so far is that we *always*
> have static_call() — it’s just not always static.

Hm? Do you mean you don't like that we have a generic function pointer
implementation? or what?

> Anyway, I have a new objection to Josh’s create_gap proposal: what on
> Earth will kernel CET do to it? Maybe my longjmp-like hack is
> actually better.

Does CET even care about iret? I assumed it didn't. If it does, your
proposal would have the same problem, no?

--
Josh

2018-12-05 23:37:27

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Static calls

>> On Dec 5, 2018, at 7:04 AM, Josh Poimboeuf <[email protected]> wrote:
>
>
>> Anyway, I have a new objection to Josh’s create_gap proposal: what on
>> Earth will kernel CET do to it? Maybe my longjmp-like hack is
>> actually better.
>
> Does CET even care about iret? I assumed it didn't. If it does, your
> proposal would have the same problem, no?

I think it doesn’t, but it doesn’t really matter. The shadow stack looks like:

retaddr of function being poked
call do_int3 + 5

And, to emulate a call, you need to stick a new frame right in the
middle. At least with a longjmp-like approach, you can clobber the
“call do_int3 + 5” part and then INCSSP on the way out. To be fair, I
think this also sucks.

PeterZ, can we abuse NMI to make this problem go away? I don't
suppose that we have some rule that NMI handlers never wait for other
CPUs to finish doing anything?

2018-12-07 16:08:48

by Edward Cree

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Static calls

Sorry if this has been pointed out before (it's a very long thread), but
in the out-of-line implementation, it appears that static_call_update()
never alters key->func. Am I right in thinking that this should be
fixed by adding 'WRITE_ONCE(key->func, func);' just after the call to
arch_static_call_transform() on line 159 of include/linux/static_call.h?

Some background (why does key->func matter for the
CONFIG_HAVE_STATIC_CALL_OUTLINE case?): I am experimenting with
combining these static calls with the 'indirect call wrappers' notion
that Paolo Abeni has been working on [1], using runtime instrumentation
to determine a list of potential callees. (This allows us to cope with
cases where the callees are in modules, or where different workloads may
use different sets of callees for a given call site, neither of which is
handled by Paolo's approach).
The core of my design looks something like:

static int dynamic_call_xyz(int (*func)(some_args), some_args)
{
if (func == dynamic_call_xyz_1.func)
return static_call(dynamic_call_xyz_1, some_args);
if (func == dynamic_call_xyz_2.func)
return static_call(dynamic_call_xyz_2, some_args);
return (*func)(some_args);
}

albeit with a bunch of extra (and currently rather ugly) stuff to collect
the statistics needed to decide what to put in the static call keys, and
mechanisms (RCU in my current case) to ensure that the static call isn't
changed between checking its .func and actually calling it.

-Ed

PS: not on list, please keep me in CC.

[1] https://lwn.net/Articles/773985/

2018-12-07 16:51:26

by Edward Cree

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Static calls

On 07/12/18 16:06, Edward Cree wrote:
> Sorry if this has been pointed out before (it's a very long thread), but
> in the out-of-line implementation, it appears that static_call_update()
> never alters key->func. Am I right in thinking that this should be
> fixed by adding 'WRITE_ONCE(key->func, func);' just after the call to
> arch_static_call_transform() on line 159 of include/linux/static_call.h?
On further examination, it's worse than that.

Why does the CONFIG_HAVE_STATIC_CALL_OUTLINE static_call_update() not
 call __static_call_update()?  It contains nothing but a BUILD_BUG_ON,
 which isn't likely to update anything.

-Ed

2018-12-11 00:47:51

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Static calls

Hi!

> These patches are related to two similar patch sets from Ard and Steve:
>
> - https://lkml.kernel.org/r/[email protected]
> - https://lkml.kernel.org/r/[email protected]
>
> The code is also heavily inspired by the jump label code, as some of the
> concepts are very similar.
>
> There are three separate implementations, depending on what the arch
> supports:
>
> 1) CONFIG_HAVE_STATIC_CALL_INLINE: patched call sites - requires
> objtool and a small amount of arch code
>
> 2) CONFIG_HAVE_STATIC_CALL_OUTLINE: patched trampolines - requires
> a small amount of arch code
>
> 3) If no arch support, fall back to regular function pointers

Well, it would be nice to mention what these patches do :-).

I guess they are expected to make things slightly faster? If so it
would be nice to mention benchmarks...

(There are even statistics later in the series, but I guess at least
short explanation should be in cover letter).

Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (1.17 kB)
signature.asc (188.00 B)
Digital signature
Download all attachments

2018-12-11 00:47:54

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu 2018-11-29 11:11:50, Linus Torvalds wrote:
> On Thu, Nov 29, 2018 at 11:08 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > What you can do then is basically add a single-byte prefix to the
> > "call" instruction that does nothing (say, cs override), and then
> > replace *that* with a 'int3' instruction.
>
> Hmm. the segment prefixes are documented as being "reserved" for
> branch instructions. I *think* that means just conditional branches
> (Intel at one point used the prefixes for static prediction
> information), not "call", but who knows..
>
> It might be better to use an empty REX prefix on x86-64 or something like that.

It might be easiest to use plain old NOP, no? :-).
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (885.00 B)
signature.asc (188.00 B)
Digital signature
Download all attachments

2018-12-11 06:46:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Mon, Dec 10, 2018 at 3:58 PM Pavel Machek <[email protected]> wrote:
>
> On Thu 2018-11-29 11:11:50, Linus Torvalds wrote:
> >
> > It might be better to use an empty REX prefix on x86-64 or something like that.
>
> It might be easiest to use plain old NOP, no? :-).

No. The whole point would be that the instruction rewriting is atomic wrt fetch.

If it's a "nop" + "second instruction", and the "nop" is overwritten
by "int3", then the second instruction could still be executed after
the "int3" has been written (because the other CPU just finished the
"nop".

So an empty rex prefix is very different from a one-byte nop, exactly
because it's executed atomically with the instruction itself.

Linus

2018-12-11 09:42:51

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

From: Josh Poimboeuf
> Sent: 30 November 2018 16:27
>
> On Thu, Nov 29, 2018 at 03:04:20PM -0800, Linus Torvalds wrote:
> > On Thu, Nov 29, 2018 at 12:25 PM Josh Poimboeuf <[email protected]> wrote:
...
> > > Maybe that would be ok. If my math is right, we would use the
> > > out-of-line version almost 5% of the time due to cache misalignment of
> > > the address.
> >
> > Note that I don't think cache-line alignment is necessarily sufficient.
> >
> > The I$ fetch from the cacheline can happen in smaller chunks, because
> > the bus between the I$ and the instruction decode isn't a full
> > cacheline (well, it is _now_ in modern big cores, but it hasn't always
> > been).
> >
> > So even if the cacheline is updated atomically, I could imagine seeing
> > a partial fetch from the I$ (old values) and then a second partial
> > fetch (new values).
> >
> > It would be interesting to know what the exact fetch rules are.
>
> I've been doing some cross-modifying code experiments on Nehalem, with
> one CPU writing call destinations while the other CPUs are executing
> them. Reliably, one of the readers goes off into the weeds within a few
> seconds.
>
> The writing was done with just text_poke(), no #BP.
>
> I wasn't able to figure out the pattern in the addresses of the
> corrupted call sites. It wasn't cache line.
>
> That was on Nehalem. Skylake didn't crash at all.

Interesting thought?

If it is possible to add a prefix that can be overwritten by an int3
is it also possible to add something that the assembler will use
to align the instruction so that a write to the 4 byte offset
will be atomic?

I'd guess that avoiding 8 byte granularity would be sufficient.
So you'd need a 1, 2 or 3 byte nop depending on the actual
alignment - although a 3 byte one would always do.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2018-12-11 17:25:04

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Tue, Dec 11, 2018 at 09:41:37AM +0000, David Laight wrote:
> From: Josh Poimboeuf
> > Sent: 30 November 2018 16:27
> >
> > On Thu, Nov 29, 2018 at 03:04:20PM -0800, Linus Torvalds wrote:
> > > On Thu, Nov 29, 2018 at 12:25 PM Josh Poimboeuf <[email protected]> wrote:
> ...
> > > > Maybe that would be ok. If my math is right, we would use the
> > > > out-of-line version almost 5% of the time due to cache misalignment of
> > > > the address.
> > >
> > > Note that I don't think cache-line alignment is necessarily sufficient.
> > >
> > > The I$ fetch from the cacheline can happen in smaller chunks, because
> > > the bus between the I$ and the instruction decode isn't a full
> > > cacheline (well, it is _now_ in modern big cores, but it hasn't always
> > > been).
> > >
> > > So even if the cacheline is updated atomically, I could imagine seeing
> > > a partial fetch from the I$ (old values) and then a second partial
> > > fetch (new values).
> > >
> > > It would be interesting to know what the exact fetch rules are.
> >
> > I've been doing some cross-modifying code experiments on Nehalem, with
> > one CPU writing call destinations while the other CPUs are executing
> > them. Reliably, one of the readers goes off into the weeds within a few
> > seconds.
> >
> > The writing was done with just text_poke(), no #BP.
> >
> > I wasn't able to figure out the pattern in the addresses of the
> > corrupted call sites. It wasn't cache line.
> >
> > That was on Nehalem. Skylake didn't crash at all.
>
> Interesting thought?
>
> If it is possible to add a prefix that can be overwritten by an int3
> is it also possible to add something that the assembler will use
> to align the instruction so that a write to the 4 byte offset
> will be atomic?
>
> I'd guess that avoiding 8 byte granularity would be sufficient.
> So you'd need a 1, 2 or 3 byte nop depending on the actual
> alignment - although a 3 byte one would always do.

The problem is that the call is done in C code, and we don't have a
feasible way to use inline asm to call functions with more than five
arguments.

BTW, my original experiments (mentioned above) were a bit... flawed. I
used text_poke(), which does memcpy(), which writes one byte at a time.
No wonder it wasn't atomic.

I'll need to do some more experiments.

--
Josh

2018-12-11 18:07:23

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Static calls

On Fri, Dec 07, 2018 at 04:06:32PM +0000, Edward Cree wrote:
> Sorry if this has been pointed out before (it's a very long thread), but
> in the out-of-line implementation, it appears that static_call_update()
> never alters key->func. Am I right in thinking that this should be
> fixed by adding 'WRITE_ONCE(key->func, func);' just after the call to
> arch_static_call_transform() on line 159 of include/linux/static_call.h?

Yes, you're right about both bugs in the out-of-line case: key->func
needs to be written, and __static_call_update() needs to be called by
static_call_update. I was so focused on getting the inline case working
that I overlooked those.

> Some background (why does key->func matter for the
> CONFIG_HAVE_STATIC_CALL_OUTLINE case?): I am experimenting with
> combining these static calls with the 'indirect call wrappers' notion
> that Paolo Abeni has been working on [1], using runtime instrumentation
> to determine a list of potential callees. (This allows us to cope with
> cases where the callees are in modules, or where different workloads may
> use different sets of callees for a given call site, neither of which is
> handled by Paolo's approach).
> The core of my design looks something like:
>
> static int dynamic_call_xyz(int (*func)(some_args), some_args)
> {
> if (func == dynamic_call_xyz_1.func)
> return static_call(dynamic_call_xyz_1, some_args);
> if (func == dynamic_call_xyz_2.func)
> return static_call(dynamic_call_xyz_2, some_args);
> return (*func)(some_args);
> }
>
> albeit with a bunch of extra (and currently rather ugly) stuff to collect
> the statistics needed to decide what to put in the static call keys, and
> mechanisms (RCU in my current case) to ensure that the static call isn't
> changed between checking its .func and actually calling it.
>
> -Ed
>
> PS: not on list, please keep me in CC.
>
> [1] https://lwn.net/Articles/773985/

Thanks, this sounds very interesting. Adding Nadav to CC, as he has
been looking at a different approach to solving the same problem:

https://lkml.kernel.org/r/[email protected]

--
Josh

2018-12-12 06:01:40

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Static calls

> On Dec 11, 2018, at 10:05 AM, Josh Poimboeuf <[email protected]> wrote:
>
> On Fri, Dec 07, 2018 at 04:06:32PM +0000, Edward Cree wrote:
>> Sorry if this has been pointed out before (it's a very long thread), but
>> in the out-of-line implementation, it appears that static_call_update()
>> never alters key->func. Am I right in thinking that this should be
>> fixed by adding 'WRITE_ONCE(key->func, func);' just after the call to
>> arch_static_call_transform() on line 159 of include/linux/static_call.h?
>
> Yes, you're right about both bugs in the out-of-line case: key->func
> needs to be written, and __static_call_update() needs to be called by
> static_call_update. I was so focused on getting the inline case working
> that I overlooked those.
>
>> Some background (why does key->func matter for the
>> CONFIG_HAVE_STATIC_CALL_OUTLINE case?): I am experimenting with
>> combining these static calls with the 'indirect call wrappers' notion
>> that Paolo Abeni has been working on [1], using runtime instrumentation
>> to determine a list of potential callees. (This allows us to cope with
>> cases where the callees are in modules, or where different workloads may
>> use different sets of callees for a given call site, neither of which is
>> handled by Paolo's approach).
>> The core of my design looks something like:
>>
>> static int dynamic_call_xyz(int (*func)(some_args), some_args)
>> {
>> if (func == dynamic_call_xyz_1.func)
>> return static_call(dynamic_call_xyz_1, some_args);
>> if (func == dynamic_call_xyz_2.func)
>> return static_call(dynamic_call_xyz_2, some_args);
>> return (*func)(some_args);
>> }
>>
>> albeit with a bunch of extra (and currently rather ugly) stuff to collect
>> the statistics needed to decide what to put in the static call keys, and
>> mechanisms (RCU in my current case) to ensure that the static call isn't
>> changed between checking its .func and actually calling it.
>>
>> -Ed
>>
>> PS: not on list, please keep me in CC.
>>
>> [1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flwn.net%2FArticles%2F773985%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7C147894d6c56d4ce6c1fc08d65f933adf%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636801483286688490&amp;sdata=QOZcfWXjvPqoR0oujtf1QTQLenv%2BiEu6jUA5fiav6Mo%3D&amp;reserved=0
>
> Thanks, this sounds very interesting. Adding Nadav to CC, as he has
> been looking at a different approach to solving the same problem:
>
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20181018005420.82993-1-namit%40vmware.com&amp;data=02%7C01%7Cnamit%40vmware.com%7C147894d6c56d4ce6c1fc08d65f933adf%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636801483286688490&amp;sdata=bCI2N2xKUVyyyPAnWDH3mC3DsSk%2Bzy5nmI0DV%2F%2FaYYw%3D&amp;reserved=0

Thanks for cc’ing me. (I didn’t know about the other patch-sets.)

Allow me to share my experience, because I was studying this issue for some
time and have implemented more than I shared, since the code need more
cleanup. Some of the proposed approaches are things we either considered or
actually implemented (and later dropped). Eventually, our design was guided
by performance profiling and a grain of “what’s academic” consideration.

I think that eventually you would want to go with one central mechanism for
the various situations:

1. Registered targets (static) and automatically learnt targets (dynamic).
Registration does not work in some cases (e.g., file-system function
pointers, there are too many of those). On the other hand, if you know your
target it is obviously simpler/better.

2. With/without retpoline fallback. We’ve have always had the retpoline as
fallback, but if you use a registration mechanism, it’s not necessary.

3. Single and multiple targets. For multiple targets we decided to use
outline block in order not to inflate the code for no reason. There were
over 10000 indirect branches in our kernel build, but in our workloads only
~500 were actually run.

If you got with the approach that Edward mentioned, you may want to
associate each “function" with identifier (think about file_operations
having an additional field that holds a unique ID, or using the struct
address). This would allow you to use a “binary search” to find the right
target, which would be slightly more efficient. We actually used a
binary-search for a different reason - learning the most frequent syscalls
per process and calling them in this manner (we actually had an indirection
table to choose the right one).

3. Call-chains which are mostly fixed (next). You want to unroll those.

4. Per-process calls. The case that bothered us the most is seccomp. On our
setup, systemd installed 17(!) BPF seccomp programs on Redis server, causing
every syscall to go through 17 indirect branches to invoke them. But
similarly you have mmu-notifiers and others. We used a per-process
trampoline page for this matter.

Now there is of course the question of whether to go through automatic
inference of the indirect call sites (using asm macros/GCC plugin) or
manually marking them (using C macros). Initially we used C macros, which we
created using semi-automatically generated Coccinelle scripts. As I
remembered how I was crucified in the past over “uglification” of the code,
I thought a transparent modification of the code would be better, so we went
with asm macros for our prototype.

Finally, I should mention that the impact of most of these mechanisms should
not be significant (or even positive) if retpolines were not used. Having
said that, the automatically-learnt indirect branch promotion (with a single
target) showed up to roughly 2% performance improvement.

Please let me know how you want to proceed. I didn’t know about your
patch-set, but I think that having two similar (yet different) separate
mechanisms is not great. If you want, I’ll finish addressing the issues
you’ve raised and send another RFC.

Regards,
Nadav

2018-12-12 17:12:18

by Edward Cree

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Static calls

On 12/12/18 05:59, Nadav Amit wrote:
> Thanks for cc’ing me. (I didn’t know about the other patch-sets.)
Well in my case, that's because I haven't posted any yet.  (Will follow up
 shortly with what I currently have, though it's not pretty.)

Looking at your patches, it seems you've got a much more developed learning
 mechanism.  Mine on the other hand is brutally simple but runs continuously
 (i.e. after we patch we immediately enter the next 'relearning' phase);
 since it never does anything but prod a handful of percpu variables, this
 shouldn't be too costly.

Also, you've got the macrology for making all indirect calls use this,
 whereas at present I just have an open-coded instance on a single call site
 (I went with deliver_skb in the networking stack).

So I think where we probably want to go from here is:
 1) get Josh's static_calls in.  AIUI Linus seems to prefer the out-of-line
    approach; I'd say ditch the inline version (at least for now).
 2) build a relpolines patch series that uses
   i) static_calls for the text-patching part
  ii) as much of Nadav's macrology as is applicable
 iii) either my or Nadav's learning mechanism; we can experiment with both,
      bikeshed it incessantly etc.

Seem reasonable?

-Ed

2018-12-12 17:50:22

by Edward Cree

[permalink] [raw]
Subject: [RFC/WIP PATCH 0/2] dynamic calls

A fix to the static_calls series (on which this series depends), and a really
hacky proof-of-concept of runtime-patched branch trees of static_calls to
avoid indirect calls / retpolines in the hot-path. Rather than any generally
applicable machinery, the patch just open-codes it for one call site (the
pt_prev->func() call in deliver_skb and __netif_receive_skb_one_core()); it
should however be possible to make a macro that takes a 'name' parameter and
expands to the whole thing. Also the _update() function could be shared and
get something useful from its work_struct, rather than needing a separate
copy of the function for every indirect call site.

Performance testing so far has been somewhat inconclusive; I applied this on
net-next, hacked up my Kconfig to use out-of-line static calls on x86-64, and
ran some 1-byte UDP stream tests with the DUT receiving.
On a single stream test, I saw packet rate go up by 7%, from 470Kpps to
504Kpps, with a considerable reduction in variance; however, CPU usage
increased by a larger factor: (packet rate / RX cpu) is a much lower-variance
measurement and went down by 13%. This however may be because it often got
into a state where, while patching the calls (and thus sending all callers
down the slow path) we continue to gather stats and see enough calls to
trigger another update; as there's no code to detect and skip an update that
doesn't change anything, we get into a tight loop of redoing updates. I am
working on this & plan to change it to not collect any stats while an update
is actually in progress.
On a 4-stream test, the variance I saw was too high to draw any conclusions;
the packet rate went down about 2½% but this was not statistically
significant (and the fastest run I saw was with dynamic calls present).

Edward Cree (2):
static_call: fix out-of-line static call implementation
net: core: rather hacky PoC implementation of dynamic calls

include/linux/static_call.h | 6 +-
net/core/dev.c | 222 +++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 221 insertions(+), 7 deletions(-)


2018-12-12 17:51:34

by Edward Cree

[permalink] [raw]
Subject: [RFC PATCH 1/2] static_call: fix out-of-line static call implementation

Actually call __static_call_update() from static_call_update(), and fix the
former so it can actually compile. Also make it update key.func.

Signed-off-by: Edward Cree <[email protected]>
---
include/linux/static_call.h | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/static_call.h b/include/linux/static_call.h
index 6daff586c97d..38d6c1e4c85d 100644
--- a/include/linux/static_call.h
+++ b/include/linux/static_call.h
@@ -153,16 +153,18 @@ struct static_call_key {

#define static_call(key, args...) STATIC_CALL_TRAMP(key)(args)

-#define __static_call_update(key, func) \
+#define __static_call_update(key, _func) \
({ \
cpus_read_lock(); \
- arch_static_call_transform(NULL, key->tramp, func); \
+ arch_static_call_transform(NULL, key.tramp, _func); \
+ WRITE_ONCE(key.func, _func); \
cpus_read_unlock(); \
})

#define static_call_update(key, func) \
({ \
BUILD_BUG_ON(!__same_type(func, STATIC_CALL_TRAMP(key))); \
+ __static_call_update(key, func); \
})

#define EXPORT_STATIC_CALL(key) \


2018-12-12 17:55:07

by Edward Cree

[permalink] [raw]
Subject: [RFC PATCH 2/2] net: core: rather hacky PoC implementation of dynamic calls

Uses runtime instrumentation of callees from an indirect call site
(deliver_skb, and also __netif_receive_skb_one_core()) to populate an
indirect-call-wrapper branch tree. Essentially we're doing indirect
branch prediction in software because the hardware can't be trusted to
get it right; this is sad.

It's also full of printk()s right now to display what it's doing for
debugging purposes; obviously those wouldn't be quite the same in a
finished version.

Signed-off-by: Edward Cree <[email protected]>
---
net/core/dev.c | 222 +++++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 217 insertions(+), 5 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 04a6b7100aac..f69c110c34e3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -145,6 +145,7 @@
#include <linux/sctp.h>
#include <net/udp_tunnel.h>
#include <linux/net_namespace.h>
+#include <linux/static_call.h>

#include "net-sysfs.h"

@@ -1935,14 +1936,223 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
}
EXPORT_SYMBOL_GPL(dev_forward_skb);

-static inline int deliver_skb(struct sk_buff *skb,
- struct packet_type *pt_prev,
- struct net_device *orig_dev)
+static void deliver_skb_update(struct work_struct *unused);
+
+static DECLARE_WORK(deliver_skb_update_work, deliver_skb_update);
+
+typedef int (*deliver_skb_func)(struct sk_buff *, struct net_device *, struct packet_type *, struct net_device *);
+
+struct deliver_skb_candidate {
+ deliver_skb_func func;
+ unsigned long hit_count;
+};
+
+static DEFINE_PER_CPU(struct deliver_skb_candidate[4], deliver_skb_candidates);
+
+static DEFINE_PER_CPU(unsigned long, deliver_skb_miss_count);
+
+/* Used to route around the dynamic version when we're changing it, as well as
+ * as a fallback if none of our static calls match.
+ */
+static int do_deliver_skb(struct sk_buff *skb,
+ struct packet_type *pt_prev,
+ struct net_device *orig_dev)
+{
+ struct deliver_skb_candidate *cands = *this_cpu_ptr(&deliver_skb_candidates);
+ deliver_skb_func func = pt_prev->func;
+ unsigned long total_count;
+ int i;
+
+ for (i = 0; i < 4; i++)
+ if (func == cands[i].func) {
+ cands[i].hit_count++;
+ break;
+ }
+ if (i == 4) /* no match */
+ for (i = 0; i < 4; i++)
+ if (!cands[i].func) {
+ cands[i].func = func;
+ cands[i].hit_count = 1;
+ break;
+ }
+ if (i == 4) /* no space */
+ (*this_cpu_ptr(&deliver_skb_miss_count))++;
+
+ total_count = *this_cpu_ptr(&deliver_skb_miss_count);
+ for (i = 0; i < 4; i++)
+ total_count += cands[i].hit_count;
+ if (total_count > 1000) /* Arbitrary threshold */
+ schedule_work(&deliver_skb_update_work);
+ return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+}
+
+DEFINE_STATIC_CALL(dispatch_deliver_skb, do_deliver_skb);
+
+static int dummy_deliver_skb(struct sk_buff *skb, struct net_device *dev,
+ struct packet_type *pt_prev,
+ struct net_device *orig_dev)
+{
+ WARN_ON_ONCE(1); /* shouldn't ever actually get here */
+ return do_deliver_skb(skb, pt_prev, orig_dev);
+}
+
+DEFINE_STATIC_CALL(dynamic_deliver_skb_1, dummy_deliver_skb);
+DEFINE_STATIC_CALL(dynamic_deliver_skb_2, dummy_deliver_skb);
+
+static DEFINE_PER_CPU(unsigned long, dds1_hit_count);
+static DEFINE_PER_CPU(unsigned long, dds2_hit_count);
+
+static int dynamic_deliver_skb(struct sk_buff *skb,
+ struct packet_type *pt_prev,
+ struct net_device *orig_dev)
+{
+ deliver_skb_func func = pt_prev->func;
+
+ if (func == dynamic_deliver_skb_1.func) {
+ (*this_cpu_ptr(&dds1_hit_count))++;
+ return static_call(dynamic_deliver_skb_1, skb, skb->dev,
+ pt_prev, orig_dev);
+ }
+ if (func == dynamic_deliver_skb_2.func) {
+ (*this_cpu_ptr(&dds2_hit_count))++;
+ return static_call(dynamic_deliver_skb_2, skb, skb->dev,
+ pt_prev, orig_dev);
+ }
+ return do_deliver_skb(skb, pt_prev, orig_dev);
+}
+
+DEFINE_MUTEX(deliver_skb_update_lock);
+
+static void deliver_skb_add_cand(struct deliver_skb_candidate *top,
+ size_t ncands,
+ struct deliver_skb_candidate next)
+{
+ struct deliver_skb_candidate old;
+ int i;
+
+ for (i = 0; i < ncands; i++) {
+ if (next.hit_count > top[i].hit_count) {
+ /* Swap next with top[i], so that the old top[i] can
+ * shunt along all lower scores
+ */
+ old = top[i];
+ top[i] = next;
+ next = old;
+ }
+ }
+}
+
+static void deliver_skb_count_hits(struct deliver_skb_candidate *top,
+ size_t ncands, struct static_call_key *key,
+ unsigned long __percpu *hit_count)
+{
+ struct deliver_skb_candidate next;
+ int cpu;
+
+ next.func = key->func;
+ next.hit_count = 0;
+ for_each_online_cpu(cpu) {
+ next.hit_count += *per_cpu_ptr(hit_count, cpu);
+ *per_cpu_ptr(hit_count, cpu) = 0;
+ }
+
+ printk(KERN_ERR "hit_count for old %pf: %lu\n", next.func,
+ next.hit_count);
+
+ deliver_skb_add_cand(top, ncands, next);
+}
+
+static void deliver_skb_update(struct work_struct *unused)
+{
+ struct deliver_skb_candidate top[4], next, *cands, *cands2;
+ int cpu, i, cpu2, j;
+
+ memset(top, 0, sizeof(top));
+
+ printk(KERN_ERR "deliver_skb_update called\n");
+ mutex_lock(&deliver_skb_update_lock);
+ printk(KERN_ERR "deliver_skb_update_lock acquired\n");
+ /* We don't stop the other CPUs adding to their counts while this is
+ * going on; but it doesn't really matter because this is a heuristic
+ * anyway so we don't care about perfect accuracy.
+ */
+ /* First count up the hits on the existing static branches */
+ deliver_skb_count_hits(top, ARRAY_SIZE(top), &dynamic_deliver_skb_1,
+ &dds1_hit_count);
+ deliver_skb_count_hits(top, ARRAY_SIZE(top), &dynamic_deliver_skb_2,
+ &dds2_hit_count);
+ /* Next count up the callees seen in the fallback path */
+ for_each_online_cpu(cpu) {
+ cands = *per_cpu_ptr(&deliver_skb_candidates, cpu);
+ printk(KERN_ERR "miss_count for %d: %lu\n", cpu,
+ *per_cpu_ptr(&deliver_skb_miss_count, cpu));
+ for (i = 0; i < 4; i++) {
+ next = cands[i];
+ if (next.func == NULL)
+ continue;
+ next.hit_count = 0;
+ for_each_online_cpu(cpu2) {
+ cands2 = *per_cpu_ptr(&deliver_skb_candidates,
+ cpu2);
+ for (j = 0; j < 4; j++) {
+ if (cands2[j].func == next.func) {
+ next.hit_count += cands2[j].hit_count;
+ cands2[j].hit_count = 0;
+ cands2[j].func = NULL;
+ break;
+ }
+ }
+ }
+ printk(KERN_ERR "candidate %d/%d: %pf %lu\n", cpu, i,
+ next.func, next.hit_count);
+ deliver_skb_add_cand(top, ARRAY_SIZE(top), next);
+ }
+ }
+ /* Record our results (for debugging) */
+ for (i = 0; i < ARRAY_SIZE(top); i++) {
+ if (i < 2) /* 2 == number of static calls in the branch tree */
+ printk(KERN_ERR "selected [%d] %pf, score %lu\n", i,
+ top[i].func, top[i].hit_count);
+ else
+ printk(KERN_ERR "runnerup [%d] %pf, score %lu\n", i,
+ top[i].func, top[i].hit_count);
+ }
+ /* It's possible that we could have picked up multiple pushes of the
+ * workitem, so someone already collected most of the count. In that
+ * case, don't make a decision based on only a small number of calls.
+ */
+ if (top[0].hit_count > 250) {
+ /* Divert callers away from the fast path */
+ static_call_update(dispatch_deliver_skb, do_deliver_skb);
+ printk(KERN_ERR "patched dds to %pf\n", dispatch_deliver_skb.func);
+ /* Wait for existing fast path callers to finish */
+ synchronize_rcu();
+ /* Patch the chosen callees into the fast path */
+ static_call_update(dynamic_deliver_skb_1, *top[0].func);
+ printk(KERN_ERR "patched dds1 to %pf\n", dynamic_deliver_skb_1.func);
+ static_call_update(dynamic_deliver_skb_2, *top[1].func);
+ printk(KERN_ERR "patched dds2 to %pf\n", dynamic_deliver_skb_2.func);
+ /* Ensure the new fast path is seen before we direct anyone
+ * into it. This probably isn't necessary (the binary-patching
+ * framework probably takes care of it) but let's be paranoid.
+ */
+ wmb();
+ /* Switch callers back onto the fast path */
+ static_call_update(dispatch_deliver_skb, dynamic_deliver_skb);
+ printk(KERN_ERR "patched dds to %pf\n", dispatch_deliver_skb.func);
+ }
+ mutex_unlock(&deliver_skb_update_lock);
+ printk(KERN_ERR "deliver_skb_update finished\n");
+}
+
+static noinline int deliver_skb(struct sk_buff *skb,
+ struct packet_type *pt_prev,
+ struct net_device *orig_dev)
{
if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC)))
return -ENOMEM;
refcount_inc(&skb->users);
- return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+ return static_call(dispatch_deliver_skb, skb, pt_prev, orig_dev);
}

static inline void deliver_ptype_list_skb(struct sk_buff *skb,
@@ -4951,7 +5161,9 @@ static int __netif_receive_skb_one_core(struct sk_buff *skb, bool pfmemalloc)

ret = __netif_receive_skb_core(skb, pfmemalloc, &pt_prev);
if (pt_prev)
- ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+ /* ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev); */
+ /* but (hopefully) faster */
+ ret = static_call(dispatch_deliver_skb, skb, pt_prev, orig_dev);
return ret;
}


2018-12-12 18:15:20

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Static calls

> On Dec 12, 2018, at 9:11 AM, Edward Cree <[email protected]> wrote:
>
> On 12/12/18 05:59, Nadav Amit wrote:
>> Thanks for cc’ing me. (I didn’t know about the other patch-sets.)
> Well in my case, that's because I haven't posted any yet. (Will follow up
> shortly with what I currently have, though it's not pretty.)
>
> Looking at your patches, it seems you've got a much more developed learning
> mechanism. Mine on the other hand is brutally simple but runs continuously
> (i.e. after we patch we immediately enter the next 'relearning' phase);
> since it never does anything but prod a handful of percpu variables, this
> shouldn't be too costly.
>
> Also, you've got the macrology for making all indirect calls use this,
> whereas at present I just have an open-coded instance on a single call site
> (I went with deliver_skb in the networking stack).
>
> So I think where we probably want to go from here is:
> 1) get Josh's static_calls in. AIUI Linus seems to prefer the out-of-line
> approach; I'd say ditch the inline version (at least for now).
> 2) build a relpolines patch series that uses
> i) static_calls for the text-patching part
> ii) as much of Nadav's macrology as is applicable
> iii) either my or Nadav's learning mechanism; we can experiment with both,
> bikeshed it incessantly etc.
>
> Seem reasonable?

Mostly yes. I have a few reservations (and let’s call them optpolines from
now on, since Josh disliked the previous name).

First, I still have to address the issues that Josh raised before, and try
to use gcc plugin instead of (most) of the macros. Specifically, I need to
bring back (from my PoC code) the part that sets multiple targets.

Second, (2i) is not very intuitive for me. Using the out-of-line static
calls seems to me as less performant than the inline (potentially, I didn’t
check).

Anyhow, the use of out-of-line static calls seems to me as
counter-intuitive. I think (didn’t measure) that it may add more overhead
than it saves due to the additional call, ret, and so on - at least if
retpolines are not used. For multiple targets it may be useful in saving
some memory if the outline block is dynamically allocated (as I did in my
yet unpublished code). But that’s not how it’s done in Josh’s code.

If we talk about inline implementation there is a different problem that
prevents me of using Josh’s static-calls as-is. I tried to avoid reading to
compared target from memory and therefore used an immediate. This should
prevent data cache misses and even when the data is available is faster by
one cycle. But it requires the patching of both the “cmp %target-reg, imm”
and “call rel-target” to be patched “atomically”. So the static-calls
mechanism wouldn’t be sufficient.

Based on Josh’s previous feedback, I thought of improving the learning using
some hysteresis. Anyhow, note that there are quite a few cases in which you
wouldn’t want optpolines. The question is whether in general it would be an
opt-in or opt-out mechanism.

Let me know what you think.

BTW: When it comes to deliver_skb, you have packet_type as an identifier.
You can use it directly or through an indirection table to figure the
target. Here’s a chunk of assembly magic that I used in a similar case:

.macro _call_table val:req bit:req max:req val1:req bit1:req
call_table_\val\()_\bit\():
test $(1 << \bit), %al
.if \val1 + (1 << \bit1) >= \max
jnz syscall_relpoline_\val1
jmp syscall_relpoline_\val
.else
jnz call_table_\val1\()_\bit1

# fall-through to no carry, val unchange, going to next bit
call_table \val,\bit1,\max
call_table \val1,\bit1,\max
.endif
.endm

.macro call_table val:req bit:req max:req
.altmacro
_call_table \val,\bit,\max,%(\val + (1 << \bit)),%(\bit + 1)
.noaltmacro
.endm

ENTRY(direct_syscall)
mov %esi, %eax
call_table val=0 bit=0 max=16
ENDPROC(direct_syscall)

2018-12-12 18:32:21

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

On Thu, Nov 29, 2018 at 03:04:20PM -0800, Linus Torvalds wrote:
> On Thu, Nov 29, 2018 at 12:25 PM Josh Poimboeuf <[email protected]> wrote:
> >
> > On Thu, Nov 29, 2018 at 11:27:00AM -0800, Andy Lutomirski wrote:
> > >
> > > I propose a different solution:
> > >
> > > As in this patch set, we have a direct and an indirect version. The
> > > indirect version remains exactly the same as in this patch set. The
> > > direct version just only does the patching when all seems well: the
> > > call instruction needs to be 0xe8, and we only do it when the thing
> > > doesn't cross a cache line. Does that work? In the rare case where
> > > the compiler generates something other than 0xe8 or crosses a cache
> > > line, then the thing just remains as a call to the out of line jmp
> > > trampoline. Does that seem reasonable? It's a very minor change to
> > > the patch set.
> >
> > Maybe that would be ok. If my math is right, we would use the
> > out-of-line version almost 5% of the time due to cache misalignment of
> > the address.
>
> Note that I don't think cache-line alignment is necessarily sufficient.
>
> The I$ fetch from the cacheline can happen in smaller chunks, because
> the bus between the I$ and the instruction decode isn't a full
> cacheline (well, it is _now_ in modern big cores, but it hasn't always
> been).
>
> So even if the cacheline is updated atomically, I could imagine seeing
> a partial fetch from the I$ (old values) and then a second partial
> fetch (new values).
>
> It would be interesting to know what the exact fetch rules are.

So I fixed my test case to do 32-bit writes, and now the results are
making a lot more sense. Now I only get crashes when writing across
cache lines. So maybe we should just go with Andy's suggestion above.

It would be great if some CPU people could confirm that it's safe (for
x86-64 only), since it's not in the SDM. Who can help answer that?

--
Josh

2018-12-12 18:35:22

by Edward Cree

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Static calls

On 12/12/18 18:14, Nadav Amit wrote:
> Second, (2i) is not very intuitive for me. Using the out-of-line static
> calls seems to me as less performant than the inline (potentially, I didn’t
> check).
>
> Anyhow, the use of out-of-line static calls seems to me as
> counter-intuitive. I think (didn’t measure) that it may add more overhead
> than it saves due to the additional call, ret, and so on
AIUI the outline version uses a tail-call (i.e. jmpq *target) rather than an
 additional call and ret.  So I wouldn't expect it to be too expensive.
More to the point, it seems like it's easier to get right than the inline
 version, and if we get the inline version working later we can introduce it
 without any API change, much as Josh's existing patches have both versions
 behind a Kconfig switch.

> I tried to avoid reading to
> compared target from memory and therefore used an immediate. This should
> prevent data cache misses and even when the data is available is faster by
> one cycle. But it requires the patching of both the “cmp %target-reg, imm”
> and “call rel-target” to be patched “atomically”. So the static-calls
> mechanism wouldn’t be sufficient.
The approach I took to deal with that (since though I'm doing a read from
 memory, it's key->func in .data rather than the jmp immediate in .text) was
 to have another static_call (though a plain static_key could also be used)
 to 'skip' the fast-path while it's actually being patched.  Then, since all
 my callers were under the rcu_read_lock, I just needed to synchronize_rcu()
 after switching off the fast-path to make sure no threads were still in it.
I'm not sure how that would be generalised to all cases, though; we don't
 want to force every indirect call to take the rcu_read_lock as that means
 no callee can ever synchronize_rcu().  I guess we could have our own
 separate RCU read lock just for indirect call patching?  (What does kgraft
 do?)

> Based on Josh’s previous feedback, I thought of improving the learning using
> some hysteresis. Anyhow, note that there are quite a few cases in which you
> wouldn’t want optpolines. The question is whether in general it would be an
> opt-in or opt-out mechanism.
I was working on the assumption that it would be opt-in, wrapping a macro
 around indirect calls that are known to have a fairly small number of hot
 targets.  There are plenty of indirect calls in the kernel that are only
 called once in a blue moon, e.g. in control-plane operations like ethtool;
 we don't really need to bulk up .text with trampolines for all of them.

-Ed

2018-12-12 21:17:56

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Static calls

> On Dec 12, 2018, at 10:33 AM, Edward Cree <[email protected]> wrote:
>
> On 12/12/18 18:14, Nadav Amit wrote:
>> Second, (2i) is not very intuitive for me. Using the out-of-line static
>> calls seems to me as less performant than the inline (potentially, I didn’t
>> check).
>>
>> Anyhow, the use of out-of-line static calls seems to me as
>> counter-intuitive. I think (didn’t measure) that it may add more overhead
>> than it saves due to the additional call, ret, and so on
> AIUI the outline version uses a tail-call (i.e. jmpq *target) rather than an
> additional call and ret. So I wouldn't expect it to be too expensive.
> More to the point, it seems like it's easier to get right than the inline
> version, and if we get the inline version working later we can introduce it
> without any API change, much as Josh's existing patches have both versions
> behind a Kconfig switch.

I see. For my outlined blocks I used the opposite approach - a call followed
by jmp (instead of jmp followed by call that Josh did). Does the stack look
correct when you first do the jmp? It seems to me that you will not see the
calling function on the stack in this case. Can it have an effect on
live-patching, debugging?

>> I tried to avoid reading to
>> compared target from memory and therefore used an immediate. This should
>> prevent data cache misses and even when the data is available is faster by
>> one cycle. But it requires the patching of both the “cmp %target-reg, imm”
>> and “call rel-target” to be patched “atomically”. So the static-calls
>> mechanism wouldn’t be sufficient.
> The approach I took to deal with that (since though I'm doing a read from
> memory, it's key->func in .data rather than the jmp immediate in .text) was
> to have another static_call (though a plain static_key could also be used)
> to 'skip' the fast-path while it's actually being patched. Then, since all
> my callers were under the rcu_read_lock, I just needed to synchronize_rcu()
> after switching off the fast-path to make sure no threads were still in it.
> I'm not sure how that would be generalised to all cases, though; we don't
> want to force every indirect call to take the rcu_read_lock as that means
> no callee can ever synchronize_rcu(). I guess we could have our own
> separate RCU read lock just for indirect call patching? (What does kgraft
> do?)

I used a similar approach to a certain extent. (I’m going to describe the
implementation following the discussion with Andy Lutomirski): We use a
“restartable section” that if we need to be preempted in this block of code,
we restart the entire section. Then, we use synchronize_rcu() like you do
after patching.

>> Based on Josh’s previous feedback, I thought of improving the learning using
>> some hysteresis. Anyhow, note that there are quite a few cases in which you
>> wouldn’t want optpolines. The question is whether in general it would be an
>> opt-in or opt-out mechanism.
> I was working on the assumption that it would be opt-in, wrapping a macro
> around indirect calls that are known to have a fairly small number of hot
> targets. There are plenty of indirect calls in the kernel that are only
> called once in a blue moon, e.g. in control-plane operations like ethtool;
> we don't really need to bulk up .text with trampolines for all of them.

On the other hand, I’m not sure the static_call interface is so intuitive.
And extending it into “dynamic_call” might be even worse. As I initially
used an opt-in approach, I can tell you that it was very exhausting.

2018-12-12 21:38:40

by Edward Cree

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Static calls

On 12/12/18 21:15, Nadav Amit wrote:
>> On Dec 12, 2018, at 10:33 AM, Edward Cree <[email protected]> wrote:
>>
>> AIUI the outline version uses a tail-call (i.e. jmpq *target) rather than an
>> additional call and ret. So I wouldn't expect it to be too expensive.
>> More to the point, it seems like it's easier to get right than the inline
>> version, and if we get the inline version working later we can introduce it
>> without any API change, much as Josh's existing patches have both versions
>> behind a Kconfig switch.
> I see. For my outlined blocks I used the opposite approach - a call followed
> by jmp
That's what Josh did too.  I.e. caller calls the trampoline, which jmps to the
 callee; later it rets, taking it back to the caller.  Perhaps I wasn't clear.
The point is that there's still only one call and one ret.

>> I was working on the assumption that it would be opt-in, wrapping a macro
>> around indirect calls that are known to have a fairly small number of hot
>> targets. There are plenty of indirect calls in the kernel that are only
>> called once in a blue moon, e.g. in control-plane operations like ethtool;
>> we don't really need to bulk up .text with trampolines for all of them.
> On the other hand, I’m not sure the static_call interface is so intuitive.
> And extending it into “dynamic_call” might be even worse. As I initially
> used an opt-in approach, I can tell you that it was very exhausting.
Well, if it's done with a gcc plugin after all, then it wouldn't be too hard
 to make it opt-out.
One advantage of the explicit opt-in dynamic_call, though, which can be seen
 in my patch is that multiple call sites can share the same learning-state,
 if they're expected to call the same set of functions.  An opt-out approach
 would automatically give each indirect call statement its own individual BTB.
Either way, I think the question is orthogonal to what the trampolines
 themselves look like (and even to the inline vs outline question).

-Ed

2018-12-12 21:47:48

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Static calls

> On Dec 12, 2018, at 1:36 PM, Edward Cree <[email protected]> wrote:
>
> On 12/12/18 21:15, Nadav Amit wrote:
>>> On Dec 12, 2018, at 10:33 AM, Edward Cree <[email protected]> wrote:
>>>
>>> AIUI the outline version uses a tail-call (i.e. jmpq *target) rather than an
>>> additional call and ret. So I wouldn't expect it to be too expensive.
>>> More to the point, it seems like it's easier to get right than the inline
>>> version, and if we get the inline version working later we can introduce it
>>> without any API change, much as Josh's existing patches have both versions
>>> behind a Kconfig switch.
>> I see. For my outlined blocks I used the opposite approach - a call followed
>> by jmp
> That's what Josh did too. I.e. caller calls the trampoline, which jmps to the
> callee; later it rets, taking it back to the caller. Perhaps I wasn't clear.
> The point is that there's still only one call and one ret.

Sorry for the misunderstanding.

>
>>> I was working on the assumption that it would be opt-in, wrapping a macro
>>> around indirect calls that are known to have a fairly small number of hot
>>> targets. There are plenty of indirect calls in the kernel that are only
>>> called once in a blue moon, e.g. in control-plane operations like ethtool;
>>> we don't really need to bulk up .text with trampolines for all of them.
>> On the other hand, I’m not sure the static_call interface is so intuitive.
>> And extending it into “dynamic_call” might be even worse. As I initially
>> used an opt-in approach, I can tell you that it was very exhausting.
> Well, if it's done with a gcc plugin after all, then it wouldn't be too hard
> to make it opt-out.
> One advantage of the explicit opt-in dynamic_call, though, which can be seen
> in my patch is that multiple call sites can share the same learning-state,
> if they're expected to call the same set of functions. An opt-out approach
> would automatically give each indirect call statement its own individual BTB.
> Either way, I think the question is orthogonal to what the trampolines
> themselves look like (and even to the inline vs outline question).

Not entirely. If the mechanism is opt-out and outlined, and especially if it
also supports multiple targets, you may not want to allocate all the memory
for them during build-time, and instead use module memory to allocate them
dynamically (that’s what we did).