2018-10-18 00:57:49

by Nadav Amit

[permalink] [raw]
Subject: [RFC PATCH 0/5] x86: dynamic indirect call promotion

This RFC introduces indirect call promotion in runtime, which for the
matter of simplification (and branding) will be called here "relpolines"
(relative call + trampoline). Relpolines are mainly intended as a way
of reducing retpoline overheads due to Spectre v2.

Unlike indirect call promotion through profile guided optimization, the
proposed approach does not require a profiling stage, works well with
modules whose address is unknown and can adapt to changing workloads.

The main idea is simple: for every indirect call, we inject a piece of
code with fast- and slow-path calls. The fast path is used if the target
matches the expected (hot) target. The slow-path uses a retpoline.
During training, the slow-path is set to call a function that saves the
call source and target in a hash-table and keep count for call
frequency. The most common target is then patched into the hot path.

The patching is done on-the-fly by patching the conditional branch
(opcode and offset) that is used to compare the target to the hot
target. This allows to direct all cores to the fast-path, while patching
the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
patch a single byte when the code might be executed by any core. (2)
When patching more than one byte, ensure that all cores do not run the
to-be-patched-code by preventing this code from being preempted, and
using synchronize_sched() after patching the branch that jumps over this
code.

Changing all the indirect calls to use relpolines is done using assembly
macro magic. There are alternative solutions, but this one is
relatively simple and transparent. There is also logic to retrain the
software predictor, but the policy it uses may need to be refined.

Eventually the results are not bad (2 VCPU VM, throughput reported):

base relpoline
---- ---------
nginx 22898 25178 (+10%)
redis-ycsb 24523 25486 (+4%)
dbench 2144 2103 (+2%)

When retpolines are disabled, and if retraining is off, performance
benefits are up to 2% (nginx), but are much less impressive.

There are several open issues: retraining should be done when modules
are removed; CPU hotplug is not supported, x86-32 is probably broken and
the Makefile does not rebuild when the relpoline code is changed. Having
said that, I am worried that some of the approaches I took would
challenge the new code-of-conduct, so I though of getting some feedback
before putting more effort into it.

Nadav Amit (5):
x86: introduce preemption disable prefix
x86: patch indirect branch promotion
x86: interface for accessing indirect branch locations
x86: learning and patching indirect branch targets
x86: relpoline: disabling interface

arch/x86/entry/entry_64.S | 10 +
arch/x86/include/asm/nospec-branch.h | 158 +++++
arch/x86/include/asm/sections.h | 2 +
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/asm-offsets.c | 6 +
arch/x86/kernel/macros.S | 1 +
arch/x86/kernel/nospec-branch.c | 899 +++++++++++++++++++++++++++
arch/x86/kernel/vmlinux.lds.S | 7 +
arch/x86/lib/retpoline.S | 75 +++
include/linux/module.h | 5 +
kernel/module.c | 8 +
kernel/seccomp.c | 2 +
12 files changed, 1174 insertions(+)
create mode 100644 arch/x86/kernel/nospec-branch.c

--
2.17.1



2018-10-18 00:56:58

by Nadav Amit

[permalink] [raw]
Subject: [RFC PATCH 5/5] x86: relpoline: disabling interface

In certain cases it is beneficial not to use indirect branch promotion.
One such case is seccomp, which may hold multiple filters and different
ones for different processes. The interface indicates to the macro not
to add a relpoline to the the indirect branch.

Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/include/asm/nospec-branch.h | 25 +++++++++++++++++++++++++
kernel/seccomp.c | 2 ++
2 files changed, 27 insertions(+)

diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index 360caad7a890..8b10e8165069 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -246,7 +246,21 @@
.endr
.endm

+.L_DISABLE_INDIRECT_BRANCH_OPT = 0
+
+.macro disable_indirect_branch_opt
+_DISABLE_INDIRECT_BRANCH_OPT = 1
+.endm
+
+.macro enable_indirect_branch_opt
+_DISABLE_INDIRECT_BRANCH_OPT = 0
+.endm
+
.macro call v:vararg
+.ifc _DISABLE_INDIRECT_BRANCH_OPT, "1"
+ # The pseudo-prefix is just to avoid expanding the macro
+ {disp8} call \v
+.else
retpoline = 0
.irp reg_it,ARCH_REG_NAMES
.ifc "\v", "__x86_indirect_thunk_\reg_it"
@@ -257,6 +271,7 @@
.if retpoline == 0
{disp8} call \v
.endif
+.endif
.endm

#else /* __ASSEMBLY__ */
@@ -409,6 +424,16 @@ struct relpoline_entry {
extern const void *indirect_thunks[16];
extern const void *save_relpoline_funcs[16];

+static inline void enable_relpolines(void)
+{
+ asm volatile("enable_indirect_branch_opt");
+}
+
+static inline void disable_relpolines(void)
+{
+ asm volatile("disable_indirect_branch_opt");
+}
+
/* The Intel SPEC CTRL MSR base value cache */
extern u64 x86_spec_ctrl_base;

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index fd023ac24e10..c3fbeddfa8fa 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -207,6 +207,7 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
* All filters in the list are evaluated and the lowest BPF return
* value always takes priority (ignoring the DATA).
*/
+ disable_relpolines();
for (; f; f = f->prev) {
u32 cur_ret = BPF_PROG_RUN(f->prog, sd);

@@ -215,6 +216,7 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
*match = f;
}
}
+ enable_relpolines();
return ret;
}
#endif /* CONFIG_SECCOMP_FILTER */
--
2.17.1


2018-10-18 00:57:08

by Nadav Amit

[permalink] [raw]
Subject: [RFC PATCH 4/5] x86: learning and patching indirect branch targets

During runtime, we collect the targets of indirect branch targets and
patch them in. Patching is done asynchronously, by modifying each of the
relpoline code-paths separately while diverting code execution to the
other path during patching. Preemption is disabled while the code runs,
and we wait for preemption to occur on each core to ensure no core is
executing the patched code.

To make use of relpolines, a worker goes over the experienced indirect
calls targets and sorts them according to frequency. The target that
was encountered most times is patched in.

Periodically, the indirect branches are set back into learning mode to
see whether the targets have changed. The current policy might be too
aggressive.

Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/include/asm/nospec-branch.h | 3 +
arch/x86/kernel/nospec-branch.c | 894 +++++++++++++++++++++++++++
2 files changed, 897 insertions(+)

diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index a4496c141143..360caad7a890 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -406,6 +406,9 @@ struct relpoline_entry {
u8 reg;
} __packed;

+extern const void *indirect_thunks[16];
+extern const void *save_relpoline_funcs[16];
+
/* The Intel SPEC CTRL MSR base value cache */
extern u64 x86_spec_ctrl_base;

diff --git a/arch/x86/kernel/nospec-branch.c b/arch/x86/kernel/nospec-branch.c
index b3027761442b..2a2cfc2db9d8 100644
--- a/arch/x86/kernel/nospec-branch.c
+++ b/arch/x86/kernel/nospec-branch.c
@@ -1,5 +1,899 @@
#include <linux/percpu.h>
+#include <linux/cpumask.h>
+#include <linux/sort.h>
+#include <linux/workqueue.h>
+#include <linux/mutex.h>
+#include <linux/memory.h>
+#include <linux/cpu.h>
+#include <linux/module.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/cpumask.h>
+#include <linux/mm.h>
+#include <linux/debugfs.h>
+#include <linux/jump_label.h>
+#include <linux/rhashtable.h>
#include <asm/nospec-branch.h>
+#include <asm/text-patching.h>
+#include <asm/asm-offsets.h>
+#include <asm/sections.h>
+#include <asm/mmu_context.h>
+
+#define REX_B (0x41)
+#define CMP_REG_IMM32_OPCODE (0x81)
+#define JNZ_REL8_OPCODE (0x75)
+#define JMP_REL8_OPCODE (0xeb)
+#define CALL_REL32_OPCODE (0xe8)
+#define CALL_IND_INS "\xff\xd0"
+
+#ifdef CONFIG_PREEMPT
+#define NO_PREEMPT_PREFIX u8 no_preempt_prefix
+#else
+#define NO_PREEMPT_PREFIX
+#endif
+
+#define RP_READJUST_SECONDS (1)
+#define RP_REENABLE_IN_EPOCH (4)
+#define RP_TARGETS (1)
+#define RP_MAX_DECISIONS (64)
+#define RP_SAMPLE_MSECS (60000)
+
+enum code_state {
+ RELPOLINE_SLOWPATH,
+ RELPOLINE_FASTPATH,
+ RELPOLINE_COND
+};
+
+/*
+ * Reflects the structure of the assembly code. We exclude the compare
+ * opcode which depends on the register.
+ */
+struct relpoline_code {
+ u32 cmp_imm;
+ struct {
+ NO_PREEMPT_PREFIX;
+ u8 opcode;
+ int8_t rel;
+ } __packed jnz;
+ struct {
+ NO_PREEMPT_PREFIX;
+ u8 opcode;
+ int32_t rel;
+ } __packed call;
+ struct {
+ u8 opcode;
+ u8 rel;
+ } __packed jmp_done;
+ struct {
+ NO_PREEMPT_PREFIX;
+ u8 opcode;
+ int32_t rel;
+ } __packed fallback;
+} __packed;
+
+/*
+ * Information for relpolines as it dynamically changes during execution.
+ */
+struct relpoline {
+ struct relpoline_code *code;
+ struct list_head node;
+ struct rhash_head rhash;
+ /*
+ * The following information can be obtained by disassembly, but saving
+ * it here does not consume too much memory and makes the code simpler.
+ */
+ u8 reg : 4;
+ u8 state : 4; /* enum relpoline_state */
+ u8 n_dsts : 7;
+ u8 overflow : 1;
+};
+
+struct relpoline_list {
+ unsigned int num;
+ struct list_head list;
+};
+
+static struct kmem_cache *relpoline_info_cache;
+
+static const struct rhashtable_params relpoline_rht_params = {
+ .automatic_shrinking = true,
+ .key_len = sizeof(uintptr_t),
+ .key_offset = offsetof(struct relpoline, code),
+ .head_offset = offsetof(struct relpoline, rhash),
+};
+
+struct relpoline_transition {
+ u32 dsts[RP_TARGETS];
+ struct relpoline *rp;
+ u8 n_dsts: 7;
+ u8 overflow: 1;
+ u8 prev_n_dsts : 7;
+ u8 reset : 1;
+};

DEFINE_PER_CPU_ALIGNED(struct relpoline_sample[RELPOLINE_SAMPLES_NUM],
relpoline_samples);
+
+enum relpoline_state {
+ RP_STATE_LEARNING,
+ RP_STATE_STABLE,
+ RP_STATE_UNSTABLE,
+ RP_STATE_LAST = RP_STATE_UNSTABLE
+};
+
+#define RP_STATE_NUM (RP_STATE_LAST + 1)
+
+static struct relpoline_list relpoline_list[RP_STATE_NUM];
+
+static struct rhashtable relpolines_rhead;
+
+/*
+ * List of relpolines that are not learning. We do not want misbehaving
+ * relpolines to wreck havoc and prevent us from figuring out the target of
+ * relpolines that have a stable target.
+ */
+static struct mutex relpoline_mutex;
+static struct relpoline_sample *relpoline_samples_copy;
+static struct relpoline_transition cp_changes[RP_MAX_DECISIONS];
+
+struct relpolines_stat {
+ u32 n_dsts[RP_TARGETS];
+ u32 n_overflow;
+};
+
+#ifdef CONFIG_DEBUG_FS
+ /* debugfs file exporting statistics */
+struct dentry *relpoline_dbg_entry;
+#endif
+
+static struct relpolines_stat *relpolines_stat;
+static ktime_t relpolines_sample_time;
+static unsigned int relpolines_resampling_count;
+
+static inline void *kernel_ptr(u32 low_addr)
+{
+ return (void *)(low_addr | 0xffffffff00000000ul);
+}
+
+static inline
+int32_t inline_fastpath_rel(struct relpoline_code *code,
+ const void *dst)
+{
+ return (unsigned long)dst - (unsigned long)code -
+ offsetofend(struct relpoline_code, call);
+}
+
+static inline
+void set_inline_fastpath_rel(struct relpoline_code *code, const void *dst)
+{
+ int32_t rel = inline_fastpath_rel(code, dst);
+
+ text_poke(&code->call.rel, &rel, sizeof(rel));
+}
+
+static inline
+int32_t slowpath_rel(const struct relpoline_code *code, const void *dst)
+{
+ return (unsigned long)dst - (unsigned long)code -
+ offsetofend(struct relpoline_code, fallback);
+}
+
+static void set_slowpath_rel(struct relpoline_code *code, const void *dst)
+{
+ int32_t rel = slowpath_rel(code, dst);
+ u8 opcode = CALL_REL32_OPCODE;
+
+ /*
+ * The opcode will be already ok in most configurations, but not if we
+ * change an indirect branch to direct one, which would happen in
+ * systems which do not use retpolines.
+ */
+ if (code->fallback.opcode != opcode)
+ text_poke(&code->fallback.opcode, &opcode, sizeof(opcode));
+ text_poke(&code->fallback.rel, &rel, sizeof(rel));
+}
+
+static void direct_relpoline(struct relpoline_code *code, enum code_state state)
+{
+ u8 opcode;
+ char rel;
+
+ /*
+ * We set the jump targets based on the previous instruction, since we
+ * have the no-preempt-prefix field which is not always there.
+ */
+ switch (state) {
+ case RELPOLINE_SLOWPATH:
+ opcode = JMP_REL8_OPCODE;
+ rel = offsetof(struct relpoline_code, fallback);
+ break;
+ case RELPOLINE_FASTPATH:
+ opcode = JMP_REL8_OPCODE;
+ rel = offsetof(struct relpoline_code, call);
+ break;
+ case RELPOLINE_COND:
+ opcode = JNZ_REL8_OPCODE;
+ rel = offsetof(struct relpoline_code, fallback);
+ }
+
+ rel -= offsetofend(struct relpoline_code, jnz);
+
+ /*
+ * Our atomicity requirements mean that we can only change one byte and
+ * another one must stay the same.
+ */
+ BUG_ON(opcode != code->jnz.opcode && rel != code->jnz.rel);
+
+ if (code->jnz.opcode != opcode)
+ text_poke(&code->jnz.opcode, &opcode, 1);
+ if (code->jnz.rel != rel)
+ text_poke(&code->jnz.rel, &rel, 1);
+}
+
+static void update_relpoline_info(struct relpoline_transition *transition)
+{
+ struct relpoline *rp = transition->rp;
+
+ /* Update the statistics */
+ if (rp->overflow)
+ relpolines_stat->n_overflow--;
+ else if (rp->n_dsts != 0)
+ relpolines_stat->n_dsts[rp->n_dsts-1]--;
+
+ if (transition->overflow)
+ relpolines_stat->n_overflow++;
+ else if (transition->n_dsts != 0)
+ relpolines_stat->n_dsts[transition->n_dsts-1]++;
+
+
+ /* Store the state */
+ rp->n_dsts = transition->n_dsts;
+ rp->overflow = transition->overflow;
+}
+
+static void make_inline_relpoline(struct relpoline_code *code, u32 dst)
+{
+ /*
+ * Update the compared target; we only compare the low 32-bits,
+ * since the modules and the kernel text always have the same
+ * upper 32-bits.
+ */
+ text_poke(&code->cmp_imm, &dst, sizeof(code->cmp_imm));
+
+ set_inline_fastpath_rel(code, kernel_ptr(dst));
+}
+
+static void clear_learned_relpolines(void)
+{
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ memset(&per_cpu(relpoline_samples, cpu)[0], 0,
+ RELPOLINE_SAMPLES_NUM * sizeof(struct relpoline_sample));
+ }
+}
+
+static void change_relpoline_state(struct relpoline *rp,
+ enum relpoline_state rp_state, bool new_rp)
+{
+ if (!new_rp) {
+ relpoline_list[rp->state].num--;
+ list_del(&rp->node);
+ }
+
+ relpoline_list[rp_state].num++;
+ list_add(&rp->node, &relpoline_list[rp_state].list);
+ rp->state = rp_state;
+}
+
+static int add_relpolines(const struct relpoline_entry *entries,
+ unsigned int n_entries)
+{
+ struct relpoline *rp;
+ int i, r = 0;
+
+ for (i = 0; i < n_entries; i++) {
+ if (init_kernel_text((uintptr_t)entries[i].rip))
+ continue;
+
+ rp = kmem_cache_alloc(relpoline_info_cache,
+ GFP_KERNEL|__GFP_ZERO);
+ if (!rp) {
+ r = -ENOMEM;
+ break;
+ }
+
+ rp->code = entries[i].rip;
+ rp->reg = entries[i].reg;
+
+ r = rhashtable_insert_fast(&relpolines_rhead,
+ &rp->rhash,
+ relpoline_rht_params);
+ if (r < 0) {
+ kmem_cache_free(relpoline_info_cache,
+ rp);
+ break;
+ }
+
+ change_relpoline_state(rp, RP_STATE_LEARNING, true);
+ }
+ if (r < 0)
+ WARN_ONCE(1, "Error loading relpolines\n");
+
+ return r;
+}
+
+static void remove_relpolines(const struct relpoline_entry *entries,
+ unsigned int n_entries)
+{
+ struct relpoline *rp;
+ unsigned int i;
+
+ for (i = 0; i < n_entries; i++) {
+ rp = rhashtable_lookup_fast(&relpolines_rhead, &entries[i].rip,
+ relpoline_rht_params);
+ if (!rp)
+ continue;
+
+ list_del(&rp->node);
+ rhashtable_remove_fast(&relpolines_rhead, &rp->rhash,
+ relpoline_rht_params);
+
+ /* TODO: Relearn everything! */
+ kmem_cache_free(relpoline_info_cache, rp);
+ }
+}
+
+static int
+relpoline_module_notify(struct notifier_block *self, unsigned long val,
+ void *data)
+{
+ struct module *mod = data;
+
+ mutex_lock(&relpoline_mutex);
+
+ switch (val) {
+ case MODULE_STATE_COMING:
+ add_relpolines(mod->relpolines, mod->num_relpolines);
+ break;
+ case MODULE_STATE_GOING:
+ remove_relpolines(mod->relpolines, mod->num_relpolines);
+ }
+
+ mutex_unlock(&relpoline_mutex);
+
+ return 0;
+}
+
+static struct notifier_block relpoline_module_nb = {
+ .notifier_call = relpoline_module_notify,
+ .priority = 1,
+};
+
+static int relpoline_sample_src_cmp_func(const void *l, const void *r)
+{
+ const struct relpoline_sample *s1 = l;
+ const struct relpoline_sample *s2 = r;
+
+ if (s1->src != s2->src)
+ return s1->src - s2->src;
+
+ return s1->dst - s2->dst;
+}
+
+static int relpoline_sample_cnt_cmp_func(const void *l, const void *r)
+{
+ const struct relpoline_sample *s1 = l;
+ const struct relpoline_sample *s2 = r;
+
+ return s2->cnt - s1->cnt;
+}
+
+static int copy_relpolines(void)
+{
+ struct relpoline_sample *p_copy = relpoline_samples_copy;
+ int cpu, i, n_entries;
+
+ for_each_online_cpu(cpu) {
+ struct relpoline_sample *orig;
+
+ orig = per_cpu(relpoline_samples, cpu);
+
+ for (i = 0; i < RELPOLINE_SAMPLES_NUM; i++, orig++) {
+ p_copy->src = READ_ONCE(orig->src);
+
+ /* Do some sanity checks while we are at it */
+ if (p_copy->src == 0)
+ continue;
+
+ if (init_kernel_text((uintptr_t)kernel_ptr(p_copy->src)))
+ continue;
+
+ /* Do the adjusting now to simplify later work */
+ p_copy->src -= offsetofend(struct relpoline_code,
+ fallback);
+
+ /*
+ * Ignore the risk of getting wrong data. We can live
+ * with it (with some performance impact)
+ */
+ p_copy->dst = READ_ONCE(orig->dst);
+ p_copy->cnt = READ_ONCE(orig->cnt);
+ p_copy++;
+ }
+ }
+
+ n_entries = p_copy - relpoline_samples_copy;
+
+ /* Sort by src */
+ sort(relpoline_samples_copy, n_entries, sizeof(*relpoline_samples_copy),
+ relpoline_sample_src_cmp_func, NULL);
+
+ return n_entries;
+}
+
+/*
+ * Goes over the sources starting from src_idx. Returns whether a new decision
+ * that is different than the current behavior has been done.
+ */
+static void relpoline_one_decision(struct relpoline_sample *samples,
+ unsigned int n_entries,
+ struct relpoline_transition *transition)
+{
+ struct relpoline *rp = transition->rp;
+ int i, j, k, n_new = 0, n_dsts = 0;
+
+ if (!transition->reset && rp->n_dsts != 0) {
+ transition->dsts[0] = rp->code->cmp_imm;
+ n_dsts++;
+ }
+
+ /*
+ * Merge destinations with the same source and sum their samples.
+ */
+ for (i = 1, j = 0; i < n_entries; i++) {
+ if (samples[j].dst != samples[i].dst) {
+ /* New target */
+ samples[++j] = samples[i];
+ continue;
+ }
+ /* Known target, add samples */
+ samples[j].cnt += samples[i].cnt;
+ }
+ n_new = j + 1;
+
+ /*
+ * Remove entries that are already set. It is not likely to happen, but
+ * this should not be too expensive. Do not use sort, since it uses
+ * indirect branches, and likely to be slower than silly test.
+ */
+ for (i = 0, j = 0; i < n_new; i++) {
+ for (k = 0; k < n_dsts; k++) {
+ if (samples[i].dst == transition->dsts[k])
+ break;
+ }
+
+ if (k == n_dsts) {
+ /* New entry */
+ samples[j++] = samples[i];
+ }
+ }
+
+ /* Change the new number based on the duplicates we detected */
+ n_new = j;
+
+ sort(samples, n_new, sizeof(*samples), relpoline_sample_cnt_cmp_func,
+ NULL);
+
+ /* Add the new ones */
+ for (i = 0; i < n_new && n_dsts < ARRAY_SIZE(transition->dsts);
+ i++, n_dsts++)
+ transition->dsts[n_dsts] = samples[i].dst;
+
+ if (n_dsts + n_new > 1) {
+ transition->n_dsts = 1;
+ transition->overflow = true;
+ }
+ transition->n_dsts = n_dsts;
+}
+
+static void init_relpoline_transition(struct relpoline_transition *rp_trans,
+ struct relpoline *rp,
+ bool reset)
+{
+ rp_trans->rp = rp;
+ rp_trans->n_dsts = 0;
+ rp_trans->overflow = 0;
+ rp_trans->prev_n_dsts = rp->n_dsts;
+ rp_trans->reset = reset;
+}
+
+/*
+ * Returns the number of decisions.
+ */
+static int make_relpoline_decisions(struct relpoline_transition *transition)
+{
+ unsigned int end, start, n_decisions = 0;
+ int n_copied;
+
+ /*
+ * First we copy the relpolines for a certain hash index to prevent it
+ * from messing up with our data. While we can cope with races that
+ * modify the destination, we need the source rip to be consistent.
+ */
+ n_copied = copy_relpolines();
+
+ for (start = 0, n_decisions = 0;
+ start < n_copied && n_decisions < RP_MAX_DECISIONS; start = end) {
+ struct relpoline_code *code;
+ struct relpoline *rp;
+
+ code = kernel_ptr(relpoline_samples_copy[start].src);
+ rp = rhashtable_lookup_fast(&relpolines_rhead, &code,
+ relpoline_rht_params);
+
+ /* Races might cause the source to be wrong. Live with it */
+ if (!rp) {
+ end = start + 1;
+ continue;
+ }
+
+ /* Find all the relevant entries */
+ for (end = start + 1; end < n_copied; end++) {
+ if (relpoline_samples_copy[start].src !=
+ relpoline_samples_copy[end].src)
+ break;
+ }
+
+ init_relpoline_transition(&transition[n_decisions], rp, false);
+
+ relpoline_one_decision(&relpoline_samples_copy[start],
+ end - start, &transition[n_decisions]);
+
+ n_decisions++;
+ }
+ return n_decisions;
+}
+
+static void change_retpoline_to_indirect(struct relpoline_code *code, u8 reg)
+{
+ u8 copy_len, offset, nop_len, call_len = 2;
+ u8 patched[sizeof(code->fallback)];
+
+ copy_len = offsetofend(typeof(*code), fallback) -
+ offsetof(typeof(*code), fallback.opcode);
+
+ /* Save a byte for no-preempt prefix */
+ if (IS_ENABLED(CONFIG_PREEMPT))
+ call_len++;
+
+ /* Save a byte for call */
+ if (reg >= ARCH_R8)
+ call_len++;
+
+ /* We leave the no-preempt prefix unmodified, so we ignore it */
+ nop_len = copy_len - call_len;
+ memcpy(patched, ideal_nops[nop_len], nop_len);
+
+ offset = nop_len;
+ if (IS_ENABLED(CONFIG_PREEMPT))
+ patched[offset++] = PREEMPT_DISABLE_PREFIX;
+ if (reg >= ARCH_R8)
+ patched[offset++] = REX_B;
+
+ memcpy(&patched[offset], CALL_IND_INS, sizeof(CALL_IND_INS) - 1);
+ patched[copy_len - 1] |= reg & 7;
+
+ text_poke(&code->fallback.opcode, patched, copy_len);
+}
+
+static void update_slowpath(struct relpoline_transition *transition)
+{
+ struct relpoline *rp = transition->rp;
+ struct relpoline_code *code = rp->code;
+ u8 reg = rp->reg;
+
+ if (transition->overflow) {
+ if (static_cpu_has(X86_FEATURE_RETPOLINE))
+ set_slowpath_rel(code, indirect_thunks[reg]);
+ else
+ change_retpoline_to_indirect(code, reg);
+ } else
+ set_slowpath_rel(code, save_relpoline_funcs[reg]);
+}
+
+static void update_relpolines(struct relpoline_transition *transitions, int n)
+{
+ struct relpoline_transition *start, *cur, *end;
+
+ mutex_lock(&text_mutex);
+ start = transitions;
+ end = transitions + n;
+
+ for (cur = start; cur != end; cur++) {
+ update_relpoline_info(cur);
+
+ direct_relpoline(cur->rp->code, RELPOLINE_SLOWPATH);
+ }
+
+ /*
+ * Ensure all cores no longer run the disabled relpolines. Since
+ * preemption is disabled between the relpoline compare and call, this
+ * would mean they are all safe.
+ */
+ synchronize_sched();
+
+ for (cur = start; cur != end; cur++) {
+ /*
+ * During the transition the fast-path behaves as a fallback.
+ */
+ set_inline_fastpath_rel(cur->rp->code,
+ indirect_thunks[cur->rp->reg]);
+
+ /*
+ * Move everything to the fast-path.
+ */
+ direct_relpoline(cur->rp->code, RELPOLINE_FASTPATH);
+ }
+
+ synchronize_sched();
+
+ /* Now we can quietly update the slow-path */
+ for (cur = start; cur != end; cur++) {
+ update_slowpath(cur);
+ direct_relpoline(cur->rp->code, RELPOLINE_SLOWPATH);
+ }
+
+ synchronize_sched();
+
+ /* Update the compared value and the call targets */
+ for (cur = start; cur != end; cur++) {
+ struct relpoline *rp = cur->rp;
+ enum relpoline_state state;
+
+ /*
+ * If there are destinations, enable; otherwise, keep disabled.
+ */
+ if (cur->n_dsts == 0) {
+ state = RP_STATE_LEARNING;
+ } else {
+ make_inline_relpoline(rp->code, cur->dsts[0]);
+ direct_relpoline(rp->code, RELPOLINE_COND);
+
+ if (cur->overflow || cur->n_dsts > 1)
+ state = RP_STATE_UNSTABLE;
+ else
+ state = RP_STATE_STABLE;
+ }
+
+ change_relpoline_state(rp, state, false);
+ }
+ mutex_unlock(&text_mutex);
+}
+
+static void relearn_relpolines(unsigned int n)
+{
+ struct list_head *list = &relpoline_list[RP_STATE_UNSTABLE].list;
+ struct relpoline *rp, *tmp;
+ unsigned int done = 0;
+
+ while (!list_empty(list) && done < n) {
+ unsigned int i = 0;
+
+ list_for_each_entry_safe(rp, tmp, list, node) {
+ init_relpoline_transition(&cp_changes[i], rp, true);
+
+ i++;
+ done++;
+
+ if (i >= RP_MAX_DECISIONS || done >= n)
+ break;
+ }
+ update_relpolines(cp_changes, i);
+ }
+}
+
+static void relpoline_work(struct work_struct *work);
+
+static DECLARE_DELAYED_WORK(c_work, relpoline_work);
+
+static void relpolines_autolearn(void)
+{
+ if (relpolines_resampling_count != 0) {
+ unsigned int resampled;
+
+ resampled = min_t(unsigned int, relpolines_resampling_count,
+ RP_REENABLE_IN_EPOCH);
+
+ relearn_relpolines(resampled);
+
+ relpolines_resampling_count -= resampled;
+
+ /*
+ * We defer starting the next start-time from the end of the
+ * current one.
+ */
+ relpolines_sample_time = ktime_get();
+ return;
+ }
+
+ if (ktime_ms_delta(ktime_get(), relpolines_sample_time) <
+ RP_SAMPLE_MSECS)
+ return;
+
+ /* Start another training period */
+ relpolines_resampling_count = relpoline_list[RP_STATE_UNSTABLE].num;
+}
+
+
+static void relpoline_work(struct work_struct *work)
+{
+ int n;
+
+ mutex_lock(&relpoline_mutex);
+ cpus_read_lock();
+
+ n = make_relpoline_decisions(cp_changes);
+
+ if (n != 0) {
+ update_relpolines(cp_changes, n);
+ clear_learned_relpolines();
+ } else
+ relpolines_autolearn();
+
+ cpus_read_unlock();
+ mutex_unlock(&relpoline_mutex);
+
+ schedule_delayed_work(&c_work, HZ * RP_READJUST_SECONDS);
+}
+
+static int relpoline_debug_show(struct seq_file *f, void *offset)
+{
+ int i;
+
+ seq_puts(f, "Destinations distribution\n");
+ for (i = 0; i < RP_TARGETS; i++)
+ seq_printf(f, "%d, %u\n", i+1, relpolines_stat->n_dsts[i]);
+ seq_printf(f, "%d, %u\n", i+1, relpolines_stat->n_overflow);
+ return 0;
+}
+
+static int relpoline_debug_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, relpoline_debug_show, inode->i_private);
+}
+
+static ssize_t relpoline_debug_write(struct file *file, const char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ u8 kbuf[40] = {0};
+ size_t len;
+
+ len = min(count, sizeof(kbuf) - 1);
+
+ if (len == 0)
+ return -EINVAL;
+
+ if (copy_from_user(kbuf, ubuf, len))
+ return -EFAULT;
+
+ kbuf[len] = '\0';
+ if (kbuf[len - 1] == '\n')
+ kbuf[len - 1] = '\0';
+
+ if (strcmp(kbuf, "relearn") == 0) {
+ mutex_lock(&relpoline_mutex);
+ relearn_relpolines(UINT_MAX);
+ mutex_unlock(&relpoline_mutex);
+ } else
+ return -EINVAL;
+
+ return count;
+}
+
+static const struct file_operations relpoline_debug_fops = {
+ .owner = THIS_MODULE,
+ .open = relpoline_debug_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+ .write = relpoline_debug_write,
+};
+
+static int __init create_relpoline_debugfs(void)
+{
+ int r;
+
+ relpoline_dbg_entry = debugfs_create_file("relpolines", 0444, NULL,
+ NULL, &relpoline_debug_fops);
+ if (IS_ERR(relpoline_dbg_entry)) {
+ r = PTR_ERR(relpoline_dbg_entry);
+ pr_err("failed to create debugfs entry, error: %d", r);
+ return r;
+ }
+
+ relpolines_stat = kzalloc(sizeof(*(relpolines_stat)), GFP_KERNEL);
+
+ return 0;
+}
+
+static void release_relpoline_debugfs(void)
+{
+ kfree(relpolines_stat);
+ relpolines_stat = NULL;
+}
+
+static int __init relpoline_init(void)
+{
+ bool relpolines_rhead_initialized = false;
+ int i, r;
+
+ if (IS_ENABLED(CONFIG_DEBUG_FS)) {
+ r = create_relpoline_debugfs();
+ if (r)
+ goto error;
+ }
+
+ mutex_init(&relpoline_mutex);
+
+ r = rhashtable_init(&relpolines_rhead, &relpoline_rht_params);
+ if (r)
+ goto error;
+
+ relpolines_rhead_initialized = true;
+
+ relpoline_info_cache = kmem_cache_create("relpoline_cache",
+ sizeof(struct relpoline), 0, 0, NULL);
+ if (!relpoline_info_cache)
+ goto error;
+
+ relpolines_sample_time = ktime_get();
+
+ /*
+ * TODO: this array needs to be reallocated if CPUs are hotplugged
+ */
+ relpoline_samples_copy = kmalloc_array(num_online_cpus() * RELPOLINE_SAMPLES_NUM,
+ sizeof(*relpoline_samples_copy),
+ GFP_KERNEL);
+
+ if (relpoline_samples_copy == NULL) {
+ r = -ENOMEM;
+ WARN(1, "error allocating relpoline memory");
+ goto error;
+ }
+
+ r = register_module_notifier(&relpoline_module_nb);
+ if (r) {
+ WARN(1, "error initializing relpolines");
+ goto error;
+ }
+
+ for (i = 0; i < RP_STATE_NUM; i++) {
+ INIT_LIST_HEAD(&relpoline_list[i].list);
+ relpoline_list[i].num = 0;
+ }
+
+ /*
+ * Ignoring errors here, only part of the relpolines would be enabled.
+ */
+ add_relpolines(__relpolines, __relpolines_end - __relpolines);
+
+ schedule_delayed_work(&c_work, HZ * RP_READJUST_SECONDS * 10);
+
+ return 0;
+error:
+ kfree(relpoline_samples_copy);
+ relpoline_samples_copy = NULL;
+ unregister_module_notifier(&relpoline_module_nb);
+
+ kmem_cache_destroy(relpoline_info_cache);
+ relpoline_info_cache = NULL;
+
+ if (relpolines_rhead_initialized)
+ rhashtable_destroy(&relpolines_rhead);
+ relpolines_rhead_initialized = false;
+
+ release_relpoline_debugfs();
+ return r;
+}
+late_initcall(relpoline_init);
--
2.17.1


2018-10-18 00:57:09

by Nadav Amit

[permalink] [raw]
Subject: [RFC PATCH 3/5] x86: interface for accessing indirect branch locations

Adding a C interface to access the locations of indirect branches. To be
used for dynamic patching.

Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/include/asm/nospec-branch.h | 1 -
arch/x86/include/asm/sections.h | 2 ++
include/linux/module.h | 5 +++++
kernel/module.c | 8 ++++++++
4 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index bd2d3a41e88c..a4496c141143 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -406,7 +406,6 @@ struct relpoline_entry {
u8 reg;
} __packed;

-
/* The Intel SPEC CTRL MSR base value cache */
extern u64 x86_spec_ctrl_base;

diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
index 8ea1cfdbeabc..e8dd9c40ecd5 100644
--- a/arch/x86/include/asm/sections.h
+++ b/arch/x86/include/asm/sections.h
@@ -4,6 +4,7 @@

#include <asm-generic/sections.h>
#include <asm/extable.h>
+#include <asm/nospec-branch.h>

extern char __brk_base[], __brk_limit[];
extern struct exception_table_entry __stop___ex_table[];
@@ -11,6 +12,7 @@ extern char __end_rodata_aligned[];

#if defined(CONFIG_X86_64)
extern char __end_rodata_hpage_align[];
+extern const struct relpoline_entry __relpolines[], __relpolines_end[];
#endif

#endif /* _ASM_X86_SECTIONS_H */
diff --git a/include/linux/module.h b/include/linux/module.h
index f807f15bebbe..29ff390f8339 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -39,6 +39,7 @@ struct modversion_info {

struct module;
struct exception_table_entry;
+struct relpoline_entry;

struct module_kobject {
struct kobject kobj;
@@ -387,6 +388,10 @@ struct module {
unsigned int num_exentries;
struct exception_table_entry *extable;

+ /* Cachepolines */
+ unsigned int num_relpolines;
+ struct relpoline_entry *relpolines;
+
/* Startup function. */
int (*init)(void);

diff --git a/kernel/module.c b/kernel/module.c
index 49a405891587..e34fc28875bd 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -64,6 +64,7 @@
#include <linux/bsearch.h>
#include <linux/dynamic_debug.h>
#include <linux/audit.h>
+#include <asm/nospec-branch.h>
#include <uapi/linux/module.h>
#include "module-internal.h"

@@ -256,6 +257,7 @@ static void mod_update_bounds(struct module *mod)
#ifdef CONFIG_KGDB_KDB
struct list_head *kdb_modules = &modules; /* kdb needs the list of modules */
#endif /* CONFIG_KGDB_KDB */
+struct list_head *cachepoline_modules = &modules;

static void module_assert_mutex(void)
{
@@ -3125,6 +3127,12 @@ static int find_module_sections(struct module *mod, struct load_info *info)
mod->extable = section_objs(info, "__ex_table",
sizeof(*mod->extable), &mod->num_exentries);

+#ifdef CONFIG_X86_64
+ mod->relpolines = section_objs(info, "__relpolines",
+ sizeof(*mod->relpolines),
+ &mod->num_relpolines);
+#endif
+
if (section_addr(info, "__obsparm"))
pr_warn("%s: Ignoring obsolete parameters\n", mod->name);

--
2.17.1


2018-10-18 00:57:17

by Nadav Amit

[permalink] [raw]
Subject: [RFC PATCH 2/5] x86: patch indirect branch promotion

To perform indirect branch promotion, we need to find all the locations.
Retpolines make it relatively easy to find these branches, by looking at
the assembly and finding calls to the indirect thunks.

An assembly macro named CALL is used to catch all assembly calls, find
these the use indirect thunks and patch them to hold the code that is
needed for indirect branch promotion.

The build-system is slightly broken with this patch, as changes to
nospec-branch.h should trigger a full kernel rebuild, which currently
it does not.

Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/include/asm/nospec-branch.h | 119 +++++++++++++++++++++++++++
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/asm-offsets.c | 6 ++
arch/x86/kernel/macros.S | 1 +
arch/x86/kernel/nospec-branch.c | 5 ++
arch/x86/kernel/vmlinux.lds.S | 7 ++
arch/x86/lib/retpoline.S | 75 +++++++++++++++++
7 files changed, 214 insertions(+)
create mode 100644 arch/x86/kernel/nospec-branch.c

diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index 0267611eb247..bd2d3a41e88c 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -7,6 +7,27 @@
#include <asm/alternative-asm.h>
#include <asm/cpufeatures.h>
#include <asm/msr-index.h>
+#include <asm/percpu.h>
+
+/*
+ * Defining registers with the architectural order
+ */
+#define ARCH_RAX 0
+#define ARCH_RCX 1
+#define ARCH_RDX 2
+#define ARCH_RBX 3
+#define ARCH_RSP 4
+#define ARCH_RBP 5
+#define ARCH_RSI 6
+#define ARCH_RDI 7
+#define ARCH_R8 8
+#define ARCH_R9 9
+#define ARCH_R10 10
+#define ARCH_R11 11
+#define ARCH_R12 12
+#define ARCH_R13 13
+#define ARCH_R14 14
+#define ARCH_R15 15

/*
* Fill the CPU return stack buffer.
@@ -28,6 +49,9 @@
#define RSB_CLEAR_LOOPS 32 /* To forcibly overwrite all entries */
#define RSB_FILL_LOOPS 16 /* To avoid underflow */

+#define RELPOLINE_SAMPLES_NUM (1 << 8)
+#define RELPOLINE_SAMPLES_MASK (RELPOLINE_SAMPLES_NUM - 1)
+
/*
* Google experimented with loop-unrolling and this turned out to be
* the optimal version — two calls, each with their own speculation
@@ -160,6 +184,81 @@
#endif
.endm

+/*
+ * This macro performs the actual relpoline work. The machine-code is hand
+ * coded to avoid assembler optimizations. This code is heavily patched in
+ * runtime to make it do what it should.
+ */
+.macro relpoline_call reg:req
+ # cmp instruction
+ get_reg_num reg=\reg
+.if reg_num == ARCH_RAX
+ .byte 0x48
+ .byte 0x3d
+.else
+.if reg_num >= ARCH_R8
+ .byte 0x49
+.else
+ .byte 0x48
+.endif
+ .byte 0x81
+ .byte 0xf8 | (reg_num & 7) # modrm
+.endif
+1:
+ .long 0
+
+ .section .relpolines,"a"
+ _ASM_PTR 1b
+ .byte reg_num
+ .previous
+
+ # cachepoling-using code
+
+ # jnz 4f, patched to jmp while the target is changed
+ preempt_disable_prefix
+ .byte 0x75, 4f - 2f
+2:
+ # call retpoline
+ preempt_disable_prefix
+ .byte 0xe8
+ .long __x86_indirect_thunk_\reg - 3f
+3:
+ # jmp 5f
+ .byte 0xeb, 5f - 4f
+4:
+ # retpoline space
+ ANNOTATE_NOSPEC_ALTERNATIVE
+ preempt_disable_prefix
+ .byte 0xe8
+ .long save_relpoline_\reg - 5f
+5:
+.endm
+
+#define ARCH_REG_NAMES rax,rcx,rdx,rbx,rsp,rbp,rsi,rdi,r8,r9,r10,r11,r12,r13,r14,r15
+
+.macro get_reg_num reg:req
+ i = 0
+.irp reg_it,ARCH_REG_NAMES
+ .ifc "\reg", "\reg_it"
+ reg_num=i
+ .endif
+ i = i+1
+.endr
+.endm
+
+.macro call v:vararg
+ retpoline = 0
+.irp reg_it,ARCH_REG_NAMES
+.ifc "\v", "__x86_indirect_thunk_\reg_it"
+ relpoline_call reg=\reg_it
+ retpoline = 1
+.endif
+.endr
+.if retpoline == 0
+ {disp8} call \v
+.endif
+.endm
+
#else /* __ASSEMBLY__ */

#define ANNOTATE_NOSPEC_ALTERNATIVE \
@@ -288,6 +387,26 @@ static inline void indirect_branch_prediction_barrier(void)
alternative_msr_write(MSR_IA32_PRED_CMD, val, X86_FEATURE_USE_IBPB);
}

+/* Data structure that is used during the learning stage */
+struct relpoline_sample {
+ u32 src;
+ u32 dst;
+ u32 cnt;
+ u32 padding;
+} __packed;
+
+DECLARE_PER_CPU_ALIGNED(struct relpoline_sample[RELPOLINE_SAMPLES_NUM],
+ relpoline_samples);
+
+/*
+ * Information for relpolines as it is saved in the source.
+ */
+struct relpoline_entry {
+ void *rip;
+ u8 reg;
+} __packed;
+
+
/* The Intel SPEC CTRL MSR base value cache */
extern u64 x86_spec_ctrl_base;

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 8824d01c0c35..8a50d304093a 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -138,6 +138,7 @@ obj-$(CONFIG_X86_INTEL_UMIP) += umip.o
obj-$(CONFIG_UNWINDER_ORC) += unwind_orc.o
obj-$(CONFIG_UNWINDER_FRAME_POINTER) += unwind_frame.o
obj-$(CONFIG_UNWINDER_GUESS) += unwind_guess.o
+obj-$(CONFIG_RETPOLINE) += nospec-branch.o

###
# 64 bit specific files
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 72adf6c335dc..2db2628c79cd 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -18,6 +18,7 @@
#include <asm/bootparam.h>
#include <asm/suspend.h>
#include <asm/tlbflush.h>
+#include <asm/nospec-branch.h>

#ifdef CONFIG_XEN
#include <xen/interface/xen.h>
@@ -104,4 +105,9 @@ void common(void) {
OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
OFFSET(TSS_sp2, tss_struct, x86_tss.sp2);
+
+ /* Relpolines */
+ OFFSET(RELPOLINE_SAMPLE_src, relpoline_sample, src);
+ OFFSET(RELPOLINE_SAMPLE_dst, relpoline_sample, dst);
+ OFFSET(RELPOLINE_SAMPLE_cnt, relpoline_sample, cnt);
}
diff --git a/arch/x86/kernel/macros.S b/arch/x86/kernel/macros.S
index 161c95059044..3d79f3d62d20 100644
--- a/arch/x86/kernel/macros.S
+++ b/arch/x86/kernel/macros.S
@@ -14,3 +14,4 @@
#include <asm/asm.h>
#include <asm/cpufeature.h>
#include <asm/jump_label.h>
+#include <asm/nospec-branch.h>
diff --git a/arch/x86/kernel/nospec-branch.c b/arch/x86/kernel/nospec-branch.c
new file mode 100644
index 000000000000..b3027761442b
--- /dev/null
+++ b/arch/x86/kernel/nospec-branch.c
@@ -0,0 +1,5 @@
+#include <linux/percpu.h>
+#include <asm/nospec-branch.h>
+
+DEFINE_PER_CPU_ALIGNED(struct relpoline_sample[RELPOLINE_SAMPLES_NUM],
+ relpoline_samples);
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 0d618ee634ac..c62735d06d58 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -355,6 +355,13 @@ SECTIONS
.data_nosave : AT(ADDR(.data_nosave) - LOAD_OFFSET) {
NOSAVE_DATA
}
+
+ . = ALIGN(8);
+ .relpolines : AT(ADDR(.relpolines) - LOAD_OFFSET) {
+ __relpolines = .;
+ *(.relpolines)
+ __relpolines_end = .;
+ }
#endif

/* BSS */
diff --git a/arch/x86/lib/retpoline.S b/arch/x86/lib/retpoline.S
index c909961e678a..f30521c180db 100644
--- a/arch/x86/lib/retpoline.S
+++ b/arch/x86/lib/retpoline.S
@@ -7,6 +7,8 @@
#include <asm/alternative-asm.h>
#include <asm/export.h>
#include <asm/nospec-branch.h>
+#include <asm/asm-offsets.h>
+#include <asm/frame.h>

.macro THUNK reg
.section .text.__x86.indirect_thunk
@@ -45,4 +47,77 @@ GENERATE_THUNK(r12)
GENERATE_THUNK(r13)
GENERATE_THUNK(r14)
GENERATE_THUNK(r15)
+
+.macro save_relpoline reg:req
+ENTRY(save_relpoline_\reg\())
+ pushq %rdi
+ pushq %rsi
+ pushq %rcx
+
+ /* First load the destination, for the case rsi is the destination */
+.if "\reg" != "rdi"
+ mov %\reg, %rdi
+.endif
+ mov 24(%rsp), %rsi
+
+ /* Compute the xor as an index in the table */
+ mov %rsi, %rcx
+ xor %rdi, %rcx
+ and $RELPOLINE_SAMPLES_MASK, %ecx
+
+ /* Entry size is 16-bit */
+ shl $4, %ecx
+
+ movl %esi, PER_CPU_VAR(relpoline_samples + RELPOLINE_SAMPLE_src)(%ecx)
+ movl %edi, PER_CPU_VAR(relpoline_samples + RELPOLINE_SAMPLE_dst)(%ecx)
+ incl PER_CPU_VAR(relpoline_samples + RELPOLINE_SAMPLE_cnt)(%ecx)
+
+#ifdef CACHEPOLINE_DEBUG
+ incl PER_CPU_VAR(relpoline_misses)
+#endif
+ popq %rcx
+ popq %rsi
+ popq %rdi
+ ANNOTATE_NOSPEC_ALTERNATIVE
+ ALTERNATIVE __stringify(ANNOTATE_RETPOLINE_SAFE; jmp *%\reg\()),\
+ "jmp __x86_indirect_thunk_\reg", \
+ X86_FEATURE_RETPOLINE
+
+ENDPROC(save_relpoline_\reg\())
+_ASM_NOKPROBE(save_relpoline_\reg\())
+EXPORT_SYMBOL(save_relpoline_\reg\())
+.endm
+
+.irp reg,ARCH_REG_NAMES
+.if \reg != "rsp"
+save_relpoline reg=\reg
+.endif
+.endr
+
+/*
+ * List of indirect thunks
+ */
+.pushsection .rodata
+.global indirect_thunks
+indirect_thunks:
+.irp reg,ARCH_REG_NAMES
+.if \reg != "rsp"
+.quad __x86_indirect_thunk_\reg
+.else
+.quad 0
+.endif
+.endr
+
+.global save_relpoline_funcs
+save_relpoline_funcs:
+.irp reg,ARCH_REG_NAMES
+.if \reg != "rsp"
+.quad save_relpoline_\reg
+.else
+.quad 0
+.endif
+.endr
+.popsection
+
+
#endif
--
2.17.1


2018-10-18 00:58:13

by Nadav Amit

[permalink] [raw]
Subject: [RFC PATCH 1/5] x86: introduce preemption disable prefix

It is sometimes beneficial to prevent preemption for very few
instructions, or prevent preemption for some instructions that precede
a branch (this latter case will be introduced in the next patches).

To provide such functionality on x86-64, we use an empty REX-prefix
(opcode 0x40) as an indication that preemption is disabled for the
following instruction.

It is expected that this opcode is not in common use.

Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/entry/entry_64.S | 10 ++++++++++
arch/x86/include/asm/nospec-branch.h | 12 ++++++++++++
2 files changed, 22 insertions(+)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index cb8a5893fd33..31d59aad496e 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -643,6 +643,16 @@ retint_kernel:
jnc 1f
0: cmpl $0, PER_CPU_VAR(__preempt_count)
jnz 1f
+
+ /*
+ * Allow to use hint to prevent preemption on a certain instruction.
+ * Consider an instruction with the first byte having REX prefix
+ * without any bits set as an indication for preemption disabled.
+ */
+ movq RIP(%rsp), %rax
+ cmpb $PREEMPT_DISABLE_PREFIX, (%rax)
+ jz 1f
+
call preempt_schedule_irq
jmp 0b
1:
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index 80dc14422495..0267611eb247 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -52,6 +52,12 @@
jnz 771b; \
add $(BITS_PER_LONG/8) * nr, sp;

+/*
+ * An empty REX-prefix is an indication that preemption should not take place on
+ * this instruction.
+ */
+#define PREEMPT_DISABLE_PREFIX (0x40)
+
#ifdef __ASSEMBLY__

/*
@@ -148,6 +154,12 @@
#endif
.endm

+.macro preempt_disable_prefix
+#ifdef CONFIG_PREEMPT
+ .byte PREEMPT_DISABLE_PREFIX
+#endif
+.endm
+
#else /* __ASSEMBLY__ */

#define ANNOTATE_NOSPEC_ALTERNATIVE \
--
2.17.1


2018-10-18 01:23:50

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix


> On Oct 17, 2018, at 5:54 PM, Nadav Amit <[email protected]> wrote:
>
> It is sometimes beneficial to prevent preemption for very few
> instructions, or prevent preemption for some instructions that precede
> a branch (this latter case will be introduced in the next patches).
>
> To provide such functionality on x86-64, we use an empty REX-prefix
> (opcode 0x40) as an indication that preemption is disabled for the
> following instruction.

Nifty!

That being said, I think you have a few bugs. First, you can’t just ignore a rescheduling interrupt, as you introduce unbounded latency when this happens — you’re effectively emulating preempt_enable_no_resched(), which is not a drop-in replacement for preempt_enable(). To fix this, you may need to jump to a slow-path trampoline that calls schedule() at the end or consider rewinding one instruction instead. Or use TF, which is only a little bit terrifying...

You also aren’t accounting for the case where you get an exception that is, in turn, preempted.



>
> It is expected that this opcode is not in common use.
>
> Signed-off-by: Nadav Amit <[email protected]>
> ---
> arch/x86/entry/entry_64.S | 10 ++++++++++
> arch/x86/include/asm/nospec-branch.h | 12 ++++++++++++
> 2 files changed, 22 insertions(+)
>
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index cb8a5893fd33..31d59aad496e 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -643,6 +643,16 @@ retint_kernel:
> jnc 1f
> 0: cmpl $0, PER_CPU_VAR(__preempt_count)
> jnz 1f
> +
> + /*
> + * Allow to use hint to prevent preemption on a certain instruction.
> + * Consider an instruction with the first byte having REX prefix
> + * without any bits set as an indication for preemption disabled.
> + */
> + movq RIP(%rsp), %rax
> + cmpb $PREEMPT_DISABLE_PREFIX, (%rax)
> + jz 1f
> +
> call preempt_schedule_irq
> jmp 0b
> 1:
> diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
> index 80dc14422495..0267611eb247 100644
> --- a/arch/x86/include/asm/nospec-branch.h
> +++ b/arch/x86/include/asm/nospec-branch.h
> @@ -52,6 +52,12 @@
> jnz 771b; \
> add $(BITS_PER_LONG/8) * nr, sp;
>
> +/*
> + * An empty REX-prefix is an indication that preemption should not take place on
> + * this instruction.
> + */
> +#define PREEMPT_DISABLE_PREFIX (0x40)
> +
> #ifdef __ASSEMBLY__
>
> /*
> @@ -148,6 +154,12 @@
> #endif
> .endm
>
> +.macro preempt_disable_prefix
> +#ifdef CONFIG_PREEMPT
> + .byte PREEMPT_DISABLE_PREFIX
> +#endif
> +.endm
> +
> #else /* __ASSEMBLY__ */
>
> #define ANNOTATE_NOSPEC_ALTERNATIVE \
> --
> 2.17.1
>

2018-10-18 03:15:00

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

at 6:22 PM, Andy Lutomirski <[email protected]> wrote:

>
>> On Oct 17, 2018, at 5:54 PM, Nadav Amit <[email protected]> wrote:
>>
>> It is sometimes beneficial to prevent preemption for very few
>> instructions, or prevent preemption for some instructions that precede
>> a branch (this latter case will be introduced in the next patches).
>>
>> To provide such functionality on x86-64, we use an empty REX-prefix
>> (opcode 0x40) as an indication that preemption is disabled for the
>> following instruction.
>
> Nifty!
>
> That being said, I think you have a few bugs. First, you can’t just ignore
> a rescheduling interrupt, as you introduce unbounded latency when this
> happens — you’re effectively emulating preempt_enable_no_resched(), which
> is not a drop-in replacement for preempt_enable(). To fix this, you may
> need to jump to a slow-path trampoline that calls schedule() at the end or
> consider rewinding one instruction instead. Or use TF, which is only a
> little bit terrifying…

Yes, I didn’t pay enough attention here. For my use-case, I think that the
easiest solution would be to make synchronize_sched() ignore preemptions
that happen while the prefix is detected. It would slightly change the
meaning of the prefix.

> You also aren’t accounting for the case where you get an exception that
> is, in turn, preempted.

Hmm.. Can you give me an example for such an exception in my use-case? I
cannot think of an exception that might be preempted (assuming #BP, #MC
cannot be preempted).

I agree that for super-general case this might be inappropriate.

2018-10-18 03:28:01

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

at 8:11 PM, Nadav Amit <[email protected]> wrote:

> at 6:22 PM, Andy Lutomirski <[email protected]> wrote:
>
>>> On Oct 17, 2018, at 5:54 PM, Nadav Amit <[email protected]> wrote:
>>>
>>> It is sometimes beneficial to prevent preemption for very few
>>> instructions, or prevent preemption for some instructions that precede
>>> a branch (this latter case will be introduced in the next patches).
>>>
>>> To provide such functionality on x86-64, we use an empty REX-prefix
>>> (opcode 0x40) as an indication that preemption is disabled for the
>>> following instruction.
>>
>> Nifty!
>>
>> That being said, I think you have a few bugs. First, you can’t just ignore
>> a rescheduling interrupt, as you introduce unbounded latency when this
>> happens — you’re effectively emulating preempt_enable_no_resched(), which
>> is not a drop-in replacement for preempt_enable(). To fix this, you may
>> need to jump to a slow-path trampoline that calls schedule() at the end or
>> consider rewinding one instruction instead. Or use TF, which is only a
>> little bit terrifying…
>
> Yes, I didn’t pay enough attention here. For my use-case, I think that the
> easiest solution would be to make synchronize_sched() ignore preemptions
> that happen while the prefix is detected. It would slightly change the
> meaning of the prefix.

Ignore this nonsense that I wrote. I’ll try to come up with a decent
solution.

>> You also aren’t accounting for the case where you get an exception that
>> is, in turn, preempted.
>
> Hmm.. Can you give me an example for such an exception in my use-case? I
> cannot think of an exception that might be preempted (assuming #BP, #MC
> cannot be preempted).
>
> I agree that for super-general case this might be inappropriate.


2018-10-18 03:52:31

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit <[email protected]> wrote:
>
> at 6:22 PM, Andy Lutomirski <[email protected]> wrote:
>
> >
> >> On Oct 17, 2018, at 5:54 PM, Nadav Amit <[email protected]> wrote:
> >>
> >> It is sometimes beneficial to prevent preemption for very few
> >> instructions, or prevent preemption for some instructions that precede
> >> a branch (this latter case will be introduced in the next patches).
> >>
> >> To provide such functionality on x86-64, we use an empty REX-prefix
> >> (opcode 0x40) as an indication that preemption is disabled for the
> >> following instruction.
> >
> > Nifty!
> >
> > That being said, I think you have a few bugs. First, you can’t just ignore
> > a rescheduling interrupt, as you introduce unbounded latency when this
> > happens — you’re effectively emulating preempt_enable_no_resched(), which
> > is not a drop-in replacement for preempt_enable(). To fix this, you may
> > need to jump to a slow-path trampoline that calls schedule() at the end or
> > consider rewinding one instruction instead. Or use TF, which is only a
> > little bit terrifying…
>
> Yes, I didn’t pay enough attention here. For my use-case, I think that the
> easiest solution would be to make synchronize_sched() ignore preemptions
> that happen while the prefix is detected. It would slightly change the
> meaning of the prefix.
>
> > You also aren’t accounting for the case where you get an exception that
> > is, in turn, preempted.
>
> Hmm.. Can you give me an example for such an exception in my use-case? I
> cannot think of an exception that might be preempted (assuming #BP, #MC
> cannot be preempted).
>

Look for cond_local_irq_enable().

--Andy

2018-10-18 07:55:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

On Wed, Oct 17, 2018 at 06:22:48PM -0700, Andy Lutomirski wrote:
>
> > On Oct 17, 2018, at 5:54 PM, Nadav Amit <[email protected]> wrote:
> >
> > It is sometimes beneficial to prevent preemption for very few
> > instructions, or prevent preemption for some instructions that precede
> > a branch (this latter case will be introduced in the next patches).
> >
> > To provide such functionality on x86-64, we use an empty REX-prefix
> > (opcode 0x40) as an indication that preemption is disabled for the
> > following instruction.
>
> Nifty!
>
> That being said, I think you have a few bugs.

> First, you can’t just ignore a rescheduling interrupt, as you
> introduce unbounded latency when this happens — you’re effectively
> emulating preempt_enable_no_resched(), which is not a drop-in
> replacement for preempt_enable().

> To fix this, you may need to jump to a slow-path trampoline that calls
> schedule() at the end or consider rewinding one instruction instead.
> Or use TF, which is only a little bit terrifying...

At which point we're very close to in-kernel rseq.

2018-10-18 16:48:43

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

at 8:51 PM, Andy Lutomirski <[email protected]> wrote:

> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit <[email protected]> wrote:
>> at 6:22 PM, Andy Lutomirski <[email protected]> wrote:
>>
>>>> On Oct 17, 2018, at 5:54 PM, Nadav Amit <[email protected]> wrote:
>>>>
>>>> It is sometimes beneficial to prevent preemption for very few
>>>> instructions, or prevent preemption for some instructions that precede
>>>> a branch (this latter case will be introduced in the next patches).
>>>>
>>>> To provide such functionality on x86-64, we use an empty REX-prefix
>>>> (opcode 0x40) as an indication that preemption is disabled for the
>>>> following instruction.
>>>
>>> Nifty!
>>>
>>> That being said, I think you have a few bugs. First, you can’t just ignore
>>> a rescheduling interrupt, as you introduce unbounded latency when this
>>> happens — you’re effectively emulating preempt_enable_no_resched(), which
>>> is not a drop-in replacement for preempt_enable(). To fix this, you may
>>> need to jump to a slow-path trampoline that calls schedule() at the end or
>>> consider rewinding one instruction instead. Or use TF, which is only a
>>> little bit terrifying…
>>
>> Yes, I didn’t pay enough attention here. For my use-case, I think that the
>> easiest solution would be to make synchronize_sched() ignore preemptions
>> that happen while the prefix is detected. It would slightly change the
>> meaning of the prefix.

So thinking about it further, rewinding the instruction seems the easiest
and most robust solution. I’ll do it.

>>> You also aren’t accounting for the case where you get an exception that
>>> is, in turn, preempted.
>>
>> Hmm.. Can you give me an example for such an exception in my use-case? I
>> cannot think of an exception that might be preempted (assuming #BP, #MC
>> cannot be preempted).
>
> Look for cond_local_irq_enable().

I looked at it. Yet, I still don’t see how exceptions might happen in my
use-case, but having said that - this can be fixed too.

To be frank, I paid relatively little attention to this subject. Any
feedback about the other parts and especially on the high-level approach? Is
modifying the retpolines in the proposed manner (assembly macros)
acceptable?

Thanks,
Nadav

2018-10-18 17:01:55

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix



> On Oct 18, 2018, at 9:47 AM, Nadav Amit <[email protected]> wrote:
>
> at 8:51 PM, Andy Lutomirski <[email protected]> wrote:
>
>>> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit <[email protected]> wrote:
>>> at 6:22 PM, Andy Lutomirski <[email protected]> wrote:
>>>
>>>>> On Oct 17, 2018, at 5:54 PM, Nadav Amit <[email protected]> wrote:
>>>>>
>>>>> It is sometimes beneficial to prevent preemption for very few
>>>>> instructions, or prevent preemption for some instructions that precede
>>>>> a branch (this latter case will be introduced in the next patches).
>>>>>
>>>>> To provide such functionality on x86-64, we use an empty REX-prefix
>>>>> (opcode 0x40) as an indication that preemption is disabled for the
>>>>> following instruction.
>>>>
>>>> Nifty!
>>>>
>>>> That being said, I think you have a few bugs. First, you can’t just ignore
>>>> a rescheduling interrupt, as you introduce unbounded latency when this
>>>> happens — you’re effectively emulating preempt_enable_no_resched(), which
>>>> is not a drop-in replacement for preempt_enable(). To fix this, you may
>>>> need to jump to a slow-path trampoline that calls schedule() at the end or
>>>> consider rewinding one instruction instead. Or use TF, which is only a
>>>> little bit terrifying…
>>>
>>> Yes, I didn’t pay enough attention here. For my use-case, I think that the
>>> easiest solution would be to make synchronize_sched() ignore preemptions
>>> that happen while the prefix is detected. It would slightly change the
>>> meaning of the prefix.
>
> So thinking about it further, rewinding the instruction seems the easiest
> and most robust solution. I’ll do it.
>
>>>> You also aren’t accounting for the case where you get an exception that
>>>> is, in turn, preempted.
>>>
>>> Hmm.. Can you give me an example for such an exception in my use-case? I
>>> cannot think of an exception that might be preempted (assuming #BP, #MC
>>> cannot be preempted).
>>
>> Look for cond_local_irq_enable().
>
> I looked at it. Yet, I still don’t see how exceptions might happen in my
> use-case, but having said that - this can be fixed too.

I’m not totally certain there’s a case that matters. But it’s worth checking

>
> To be frank, I paid relatively little attention to this subject. Any
> feedback about the other parts and especially on the high-level approach? Is
> modifying the retpolines in the proposed manner (assembly macros)
> acceptable?
>

It’s certainly a neat idea, and it could be a real speedup.

> Thanks,
> Nadav

2018-10-18 17:26:21

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

at 10:00 AM, Andy Lutomirski <[email protected]> wrote:

>
>
>> On Oct 18, 2018, at 9:47 AM, Nadav Amit <[email protected]> wrote:
>>
>> at 8:51 PM, Andy Lutomirski <[email protected]> wrote:
>>
>>>> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit <[email protected]> wrote:
>>>> at 6:22 PM, Andy Lutomirski <[email protected]> wrote:
>>>>
>>>>>> On Oct 17, 2018, at 5:54 PM, Nadav Amit <[email protected]> wrote:
>>>>>>
>>>>>> It is sometimes beneficial to prevent preemption for very few
>>>>>> instructions, or prevent preemption for some instructions that precede
>>>>>> a branch (this latter case will be introduced in the next patches).
>>>>>>
>>>>>> To provide such functionality on x86-64, we use an empty REX-prefix
>>>>>> (opcode 0x40) as an indication that preemption is disabled for the
>>>>>> following instruction.
>>>>>
>>>>> Nifty!
>>>>>
>>>>> That being said, I think you have a few bugs. First, you can’t just ignore
>>>>> a rescheduling interrupt, as you introduce unbounded latency when this
>>>>> happens — you’re effectively emulating preempt_enable_no_resched(), which
>>>>> is not a drop-in replacement for preempt_enable(). To fix this, you may
>>>>> need to jump to a slow-path trampoline that calls schedule() at the end or
>>>>> consider rewinding one instruction instead. Or use TF, which is only a
>>>>> little bit terrifying…
>>>>
>>>> Yes, I didn’t pay enough attention here. For my use-case, I think that the
>>>> easiest solution would be to make synchronize_sched() ignore preemptions
>>>> that happen while the prefix is detected. It would slightly change the
>>>> meaning of the prefix.
>>
>> So thinking about it further, rewinding the instruction seems the easiest
>> and most robust solution. I’ll do it.
>>
>>>>> You also aren’t accounting for the case where you get an exception that
>>>>> is, in turn, preempted.
>>>>
>>>> Hmm.. Can you give me an example for such an exception in my use-case? I
>>>> cannot think of an exception that might be preempted (assuming #BP, #MC
>>>> cannot be preempted).
>>>
>>> Look for cond_local_irq_enable().
>>
>> I looked at it. Yet, I still don’t see how exceptions might happen in my
>> use-case, but having said that - this can be fixed too.
>
> I’m not totally certain there’s a case that matters. But it’s worth checking
>
>> To be frank, I paid relatively little attention to this subject. Any
>> feedback about the other parts and especially on the high-level approach? Is
>> modifying the retpolines in the proposed manner (assembly macros)
>> acceptable?
>
> It’s certainly a neat idea, and it could be a real speedup.

Great. So I’ll try to shape things up, and I still wait for other comments
(from others).

I’ll just mention two more patches I need to cleanup (I know I still owe you some
work, so obviously it will be done later):

1. Seccomp trampolines. On my Ubuntu, when I run Redis, systemd installs 17
BPF filters on the Redis server process that are invoked on each
system-call. Invoking each one requires an indirect branch. The patch keeps
a per-process kernel code-page that holds trampolines for these functions.

2. Binary-search for system-calls. Use the per-process kernel code-page also
to hold multiple trampolines for the 16 common system calls of a certain
process. The patch uses an indirection table and a binary-search to find the
proper trampoline.

Thanks again,
Nadav

2018-10-18 17:43:12

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

On Thu, Oct 18, 2018 at 10:25 AM Nadav Amit <[email protected]> wrote:
>
> at 10:00 AM, Andy Lutomirski <[email protected]> wrote:
>
> >
> >
> >> On Oct 18, 2018, at 9:47 AM, Nadav Amit <[email protected]> wrote:
> >>
> >> at 8:51 PM, Andy Lutomirski <[email protected]> wrote:
> >>
> >>>> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit <[email protected]> wrote:
> >>>> at 6:22 PM, Andy Lutomirski <[email protected]> wrote:
> >>>>
> >>>>>> On Oct 17, 2018, at 5:54 PM, Nadav Amit <[email protected]> wrote:
> >>>>>>
> >>>>>> It is sometimes beneficial to prevent preemption for very few
> >>>>>> instructions, or prevent preemption for some instructions that precede
> >>>>>> a branch (this latter case will be introduced in the next patches).
> >>>>>>
> >>>>>> To provide such functionality on x86-64, we use an empty REX-prefix
> >>>>>> (opcode 0x40) as an indication that preemption is disabled for the
> >>>>>> following instruction.
> >>>>>
> >>>>> Nifty!
> >>>>>
> >>>>> That being said, I think you have a few bugs. First, you can’t just ignore
> >>>>> a rescheduling interrupt, as you introduce unbounded latency when this
> >>>>> happens — you’re effectively emulating preempt_enable_no_resched(), which
> >>>>> is not a drop-in replacement for preempt_enable(). To fix this, you may
> >>>>> need to jump to a slow-path trampoline that calls schedule() at the end or
> >>>>> consider rewinding one instruction instead. Or use TF, which is only a
> >>>>> little bit terrifying…
> >>>>
> >>>> Yes, I didn’t pay enough attention here. For my use-case, I think that the
> >>>> easiest solution would be to make synchronize_sched() ignore preemptions
> >>>> that happen while the prefix is detected. It would slightly change the
> >>>> meaning of the prefix.
> >>
> >> So thinking about it further, rewinding the instruction seems the easiest
> >> and most robust solution. I’ll do it.
> >>
> >>>>> You also aren’t accounting for the case where you get an exception that
> >>>>> is, in turn, preempted.
> >>>>
> >>>> Hmm.. Can you give me an example for such an exception in my use-case? I
> >>>> cannot think of an exception that might be preempted (assuming #BP, #MC
> >>>> cannot be preempted).
> >>>
> >>> Look for cond_local_irq_enable().
> >>
> >> I looked at it. Yet, I still don’t see how exceptions might happen in my
> >> use-case, but having said that - this can be fixed too.
> >
> > I’m not totally certain there’s a case that matters. But it’s worth checking
> >
> >> To be frank, I paid relatively little attention to this subject. Any
> >> feedback about the other parts and especially on the high-level approach? Is
> >> modifying the retpolines in the proposed manner (assembly macros)
> >> acceptable?
> >
> > It’s certainly a neat idea, and it could be a real speedup.
>
> Great. So I’ll try to shape things up, and I still wait for other comments
> (from others).
>
> I’ll just mention two more patches I need to cleanup (I know I still owe you some
> work, so obviously it will be done later):
>
> 1. Seccomp trampolines. On my Ubuntu, when I run Redis, systemd installs 17
> BPF filters on the Redis server process that are invoked on each
> system-call. Invoking each one requires an indirect branch. The patch keeps
> a per-process kernel code-page that holds trampolines for these functions.

I wonder how many levels of branches are needed before the branches
involved exceed the retpoline cost.

>
> 2. Binary-search for system-calls. Use the per-process kernel code-page also
> to hold multiple trampolines for the 16 common system calls of a certain
> process. The patch uses an indirection table and a binary-search to find the
> proper trampoline.

Same comment applies here.

>
> Thanks again,
> Nadav



--
Andy Lutomirski
AMA Capital Management, LLC

2018-10-18 17:44:12

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

at 10:29 AM, Andy Lutomirski <[email protected]> wrote:

> On Thu, Oct 18, 2018 at 10:25 AM Nadav Amit <[email protected]> wrote:
>> at 10:00 AM, Andy Lutomirski <[email protected]> wrote:
>>
>>>> On Oct 18, 2018, at 9:47 AM, Nadav Amit <[email protected]> wrote:
>>>>
>>>> at 8:51 PM, Andy Lutomirski <[email protected]> wrote:
>>>>
>>>>>> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit <[email protected]> wrote:
>>>>>> at 6:22 PM, Andy Lutomirski <[email protected]> wrote:
>>>>>>
>>>>>>>> On Oct 17, 2018, at 5:54 PM, Nadav Amit <[email protected]> wrote:
>>>>>>>>
>>>>>>>> It is sometimes beneficial to prevent preemption for very few
>>>>>>>> instructions, or prevent preemption for some instructions that precede
>>>>>>>> a branch (this latter case will be introduced in the next patches).
>>>>>>>>
>>>>>>>> To provide such functionality on x86-64, we use an empty REX-prefix
>>>>>>>> (opcode 0x40) as an indication that preemption is disabled for the
>>>>>>>> following instruction.
>>>>>>>
>>>>>>> Nifty!
>>>>>>>
>>>>>>> That being said, I think you have a few bugs. First, you can’t just ignore
>>>>>>> a rescheduling interrupt, as you introduce unbounded latency when this
>>>>>>> happens — you’re effectively emulating preempt_enable_no_resched(), which
>>>>>>> is not a drop-in replacement for preempt_enable(). To fix this, you may
>>>>>>> need to jump to a slow-path trampoline that calls schedule() at the end or
>>>>>>> consider rewinding one instruction instead. Or use TF, which is only a
>>>>>>> little bit terrifying…
>>>>>>
>>>>>> Yes, I didn’t pay enough attention here. For my use-case, I think that the
>>>>>> easiest solution would be to make synchronize_sched() ignore preemptions
>>>>>> that happen while the prefix is detected. It would slightly change the
>>>>>> meaning of the prefix.
>>>>
>>>> So thinking about it further, rewinding the instruction seems the easiest
>>>> and most robust solution. I’ll do it.
>>>>
>>>>>>> You also aren’t accounting for the case where you get an exception that
>>>>>>> is, in turn, preempted.
>>>>>>
>>>>>> Hmm.. Can you give me an example for such an exception in my use-case? I
>>>>>> cannot think of an exception that might be preempted (assuming #BP, #MC
>>>>>> cannot be preempted).
>>>>>
>>>>> Look for cond_local_irq_enable().
>>>>
>>>> I looked at it. Yet, I still don’t see how exceptions might happen in my
>>>> use-case, but having said that - this can be fixed too.
>>>
>>> I’m not totally certain there’s a case that matters. But it’s worth checking
>>>
>>>> To be frank, I paid relatively little attention to this subject. Any
>>>> feedback about the other parts and especially on the high-level approach? Is
>>>> modifying the retpolines in the proposed manner (assembly macros)
>>>> acceptable?
>>>
>>> It’s certainly a neat idea, and it could be a real speedup.
>>
>> Great. So I’ll try to shape things up, and I still wait for other comments
>> (from others).
>>
>> I’ll just mention two more patches I need to cleanup (I know I still owe you some
>> work, so obviously it will be done later):
>>
>> 1. Seccomp trampolines. On my Ubuntu, when I run Redis, systemd installs 17
>> BPF filters on the Redis server process that are invoked on each
>> system-call. Invoking each one requires an indirect branch. The patch keeps
>> a per-process kernel code-page that holds trampolines for these functions.
>
> I wonder how many levels of branches are needed before the branches
> involved exceed the retpoline cost.

In this case there is no hierarchy, but a list of trampolines that are
called one after the other, as the seccomp filter order is predefined. It
does not work if different threads of the same process have different
filters.

>> 2. Binary-search for system-calls. Use the per-process kernel code-page also
>> to hold multiple trampolines for the 16 common system calls of a certain
>> process. The patch uses an indirection table and a binary-search to find the
>> proper trampoline.
>
> Same comment applies here.

Branch misprediction wastes ~7 cycles and a retpoline takes at least 30. So
assuming the branch predictor is not completely stupid 3-4 levels should not
be too much.

2018-10-18 18:15:26

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

at 12:54 AM, Peter Zijlstra <[email protected]> wrote:

> On Wed, Oct 17, 2018 at 06:22:48PM -0700, Andy Lutomirski wrote:
>>> On Oct 17, 2018, at 5:54 PM, Nadav Amit <[email protected]> wrote:
>>>
>>> It is sometimes beneficial to prevent preemption for very few
>>> instructions, or prevent preemption for some instructions that precede
>>> a branch (this latter case will be introduced in the next patches).
>>>
>>> To provide such functionality on x86-64, we use an empty REX-prefix
>>> (opcode 0x40) as an indication that preemption is disabled for the
>>> following instruction.
>>
>> Nifty!
>>
>> That being said, I think you have a few bugs.
>
>> First, you can’t just ignore a rescheduling interrupt, as you
>> introduce unbounded latency when this happens — you’re effectively
>> emulating preempt_enable_no_resched(), which is not a drop-in
>> replacement for preempt_enable().
>
>> To fix this, you may need to jump to a slow-path trampoline that calls
>> schedule() at the end or consider rewinding one instruction instead.
>> Or use TF, which is only a little bit terrifying...
>
> At which point we're very close to in-kernel rseq.

Interesting. I didn’t know about this feature. I’ll see if I can draw some
ideas from there.

2018-10-19 01:09:07

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

at 10:00 AM, Andy Lutomirski <[email protected]> wrote:

>
>
>> On Oct 18, 2018, at 9:47 AM, Nadav Amit <[email protected]> wrote:
>>
>> at 8:51 PM, Andy Lutomirski <[email protected]> wrote:
>>
>>>> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit <[email protected]> wrote:
>>>> at 6:22 PM, Andy Lutomirski <[email protected]> wrote:
>>>>
>>>>>> On Oct 17, 2018, at 5:54 PM, Nadav Amit <[email protected]> wrote:
>>>>>>
>>>>>> It is sometimes beneficial to prevent preemption for very few
>>>>>> instructions, or prevent preemption for some instructions that precede
>>>>>> a branch (this latter case will be introduced in the next patches).
>>>>>>
>>>>>> To provide such functionality on x86-64, we use an empty REX-prefix
>>>>>> (opcode 0x40) as an indication that preemption is disabled for the
>>>>>> following instruction.
>>>>>
>>>>> Nifty!
>>>>>
>>>>> That being said, I think you have a few bugs. First, you can’t just ignore
>>>>> a rescheduling interrupt, as you introduce unbounded latency when this
>>>>> happens — you’re effectively emulating preempt_enable_no_resched(), which
>>>>> is not a drop-in replacement for preempt_enable(). To fix this, you may
>>>>> need to jump to a slow-path trampoline that calls schedule() at the end or
>>>>> consider rewinding one instruction instead. Or use TF, which is only a
>>>>> little bit terrifying…
>>>>
>>>> Yes, I didn’t pay enough attention here. For my use-case, I think that the
>>>> easiest solution would be to make synchronize_sched() ignore preemptions
>>>> that happen while the prefix is detected. It would slightly change the
>>>> meaning of the prefix.
>>
>> So thinking about it further, rewinding the instruction seems the easiest
>> and most robust solution. I’ll do it.
>>
>>>>> You also aren’t accounting for the case where you get an exception that
>>>>> is, in turn, preempted.
>>>>
>>>> Hmm.. Can you give me an example for such an exception in my use-case? I
>>>> cannot think of an exception that might be preempted (assuming #BP, #MC
>>>> cannot be preempted).
>>>
>>> Look for cond_local_irq_enable().
>>
>> I looked at it. Yet, I still don’t see how exceptions might happen in my
>> use-case, but having said that - this can be fixed too.
>
> I’m not totally certain there’s a case that matters. But it’s worth checking

I am still checking. But, I wanted to ask you whether the existing code is
correct, since it seems to me that others do the same mistake I did, unless
I don’t understand the code.

Consider for example do_int3(), and see my inlined comments:

dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
{
...
ist_enter(regs); // => preempt_disable()
cond_local_irq_enable(regs); // => assume it enables IRQs

...
// resched irq can be delivered here. It will not caused rescheduling
// since preemption is disabled

cond_local_irq_disable(regs); // => assume it disables IRQs
ist_exit(regs); // => preempt_enable_no_resched()
}

At this point resched will not happen for unbounded length of time (unless
there is another point when exiting the trap handler that checks if
preemption should take place).

Another example is __BPF_PROG_RUN_ARRAY(), which also uses
preempt_enable_no_resched().

Am I missing something?

Thanks,
Nadav

2018-10-19 04:30:33

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

> On Oct 18, 2018, at 6:08 PM, Nadav Amit <[email protected]> wrote:
>
> at 10:00 AM, Andy Lutomirski <[email protected]> wrote:
>
>>
>>
>>> On Oct 18, 2018, at 9:47 AM, Nadav Amit <[email protected]> wrote:
>>>
>>> at 8:51 PM, Andy Lutomirski <[email protected]> wrote:
>>>
>>>>> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit <[email protected]> wrote:
>>>>> at 6:22 PM, Andy Lutomirski <[email protected]> wrote:
>>>>>
>>>>>>> On Oct 17, 2018, at 5:54 PM, Nadav Amit <[email protected]> wrote:
>>>>>>>
>>>>>>> It is sometimes beneficial to prevent preemption for very few
>>>>>>> instructions, or prevent preemption for some instructions that precede
>>>>>>> a branch (this latter case will be introduced in the next patches).
>>>>>>>
>>>>>>> To provide such functionality on x86-64, we use an empty REX-prefix
>>>>>>> (opcode 0x40) as an indication that preemption is disabled for the
>>>>>>> following instruction.
>>>>>>
>>>>>> Nifty!
>>>>>>
>>>>>> That being said, I think you have a few bugs. First, you can’t just ignore
>>>>>> a rescheduling interrupt, as you introduce unbounded latency when this
>>>>>> happens — you’re effectively emulating preempt_enable_no_resched(), which
>>>>>> is not a drop-in replacement for preempt_enable(). To fix this, you may
>>>>>> need to jump to a slow-path trampoline that calls schedule() at the end or
>>>>>> consider rewinding one instruction instead. Or use TF, which is only a
>>>>>> little bit terrifying…
>>>>>
>>>>> Yes, I didn’t pay enough attention here. For my use-case, I think that the
>>>>> easiest solution would be to make synchronize_sched() ignore preemptions
>>>>> that happen while the prefix is detected. It would slightly change the
>>>>> meaning of the prefix.
>>>
>>> So thinking about it further, rewinding the instruction seems the easiest
>>> and most robust solution. I’ll do it.
>>>
>>>>>> You also aren’t accounting for the case where you get an exception that
>>>>>> is, in turn, preempted.
>>>>>
>>>>> Hmm.. Can you give me an example for such an exception in my use-case? I
>>>>> cannot think of an exception that might be preempted (assuming #BP, #MC
>>>>> cannot be preempted).
>>>>
>>>> Look for cond_local_irq_enable().
>>>
>>> I looked at it. Yet, I still don’t see how exceptions might happen in my
>>> use-case, but having said that - this can be fixed too.
>>
>> I’m not totally certain there’s a case that matters. But it’s worth checking
>
> I am still checking. But, I wanted to ask you whether the existing code is
> correct, since it seems to me that others do the same mistake I did, unless
> I don’t understand the code.
>
> Consider for example do_int3(), and see my inlined comments:
>
> dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
> {
> ...
> ist_enter(regs); // => preempt_disable()
> cond_local_irq_enable(regs); // => assume it enables IRQs
>
> ...
> // resched irq can be delivered here. It will not caused rescheduling
> // since preemption is disabled
>
> cond_local_irq_disable(regs); // => assume it disables IRQs
> ist_exit(regs); // => preempt_enable_no_resched()
> }
>
> At this point resched will not happen for unbounded length of time (unless
> there is another point when exiting the trap handler that checks if
> preemption should take place).

I think it's only a bug in the cases where someone uses extable to fix
up an int3 (which would be nuts) or that we oops. But I should still
fix it. In the normal case where int3 was in user code, we'll miss
the reschedule in do_trap(), but we'll reschedule in
prepare_exit_to_usermode() -> exit_to_usermode_loop().

>
> Another example is __BPF_PROG_RUN_ARRAY(), which also uses
> preempt_enable_no_resched().

Alexei, I think this code is just wrong. Do you know why it uses
preempt_enable_no_resched()?

Oleg, the code in kernel/signal.c:

preempt_disable();
read_unlock(&tasklist_lock);
preempt_enable_no_resched();
freezable_schedule();

looks bogus. I don't get what it's trying to achieve with
preempt_disable(), and I also don't see why no_resched does anything.
Sure, it prevents a reschedule triggered during read_unlock() from
causing a reschedule, but it doesn't prevent an interrupt immediately
after the preempt_enable_no_resched() call from scheduling.

--Andy

>
> Am I missing something?
>
> Thanks,
> Nadav

2018-10-19 04:48:39

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

at 9:29 PM, Andy Lutomirski <[email protected]> wrote:

>> On Oct 18, 2018, at 6:08 PM, Nadav Amit <[email protected]> wrote:
>>
>> at 10:00 AM, Andy Lutomirski <[email protected]> wrote:
>>
>>>> On Oct 18, 2018, at 9:47 AM, Nadav Amit <[email protected]> wrote:
>>>>
>>>> at 8:51 PM, Andy Lutomirski <[email protected]> wrote:
>>>>
>>>>>> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit <[email protected]> wrote:
>>>>>> at 6:22 PM, Andy Lutomirski <[email protected]> wrote:
>>>>>>
>>>>>>>> On Oct 17, 2018, at 5:54 PM, Nadav Amit <[email protected]> wrote:
>>>>>>>>
>>>>>>>> It is sometimes beneficial to prevent preemption for very few
>>>>>>>> instructions, or prevent preemption for some instructions that precede
>>>>>>>> a branch (this latter case will be introduced in the next patches).
>>>>>>>>
>>>>>>>> To provide such functionality on x86-64, we use an empty REX-prefix
>>>>>>>> (opcode 0x40) as an indication that preemption is disabled for the
>>>>>>>> following instruction.
>>>>>>>
>>>>>>> Nifty!
>>>>>>>
>>>>>>> That being said, I think you have a few bugs. First, you can’t just ignore
>>>>>>> a rescheduling interrupt, as you introduce unbounded latency when this
>>>>>>> happens — you’re effectively emulating preempt_enable_no_resched(), which
>>>>>>> is not a drop-in replacement for preempt_enable(). To fix this, you may
>>>>>>> need to jump to a slow-path trampoline that calls schedule() at the end or
>>>>>>> consider rewinding one instruction instead. Or use TF, which is only a
>>>>>>> little bit terrifying…
>>>>>>
>>>>>> Yes, I didn’t pay enough attention here. For my use-case, I think that the
>>>>>> easiest solution would be to make synchronize_sched() ignore preemptions
>>>>>> that happen while the prefix is detected. It would slightly change the
>>>>>> meaning of the prefix.
>>>>
>>>> So thinking about it further, rewinding the instruction seems the easiest
>>>> and most robust solution. I’ll do it.
>>>>
>>>>>>> You also aren’t accounting for the case where you get an exception that
>>>>>>> is, in turn, preempted.
>>>>>>
>>>>>> Hmm.. Can you give me an example for such an exception in my use-case? I
>>>>>> cannot think of an exception that might be preempted (assuming #BP, #MC
>>>>>> cannot be preempted).
>>>>>
>>>>> Look for cond_local_irq_enable().
>>>>
>>>> I looked at it. Yet, I still don’t see how exceptions might happen in my
>>>> use-case, but having said that - this can be fixed too.
>>>
>>> I’m not totally certain there’s a case that matters. But it’s worth checking
>>
>> I am still checking. But, I wanted to ask you whether the existing code is
>> correct, since it seems to me that others do the same mistake I did, unless
>> I don’t understand the code.
>>
>> Consider for example do_int3(), and see my inlined comments:
>>
>> dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
>> {
>> ...
>> ist_enter(regs); // => preempt_disable()
>> cond_local_irq_enable(regs); // => assume it enables IRQs
>>
>> ...
>> // resched irq can be delivered here. It will not caused rescheduling
>> // since preemption is disabled
>>
>> cond_local_irq_disable(regs); // => assume it disables IRQs
>> ist_exit(regs); // => preempt_enable_no_resched()
>> }
>>
>> At this point resched will not happen for unbounded length of time (unless
>> there is another point when exiting the trap handler that checks if
>> preemption should take place).
>
> I think it's only a bug in the cases where someone uses extable to fix
> up an int3 (which would be nuts) or that we oops. But I should still
> fix it. In the normal case where int3 was in user code, we'll miss
> the reschedule in do_trap(), but we'll reschedule in
> prepare_exit_to_usermode() -> exit_to_usermode_loop().

Thanks for your quick response, and sorry for bothering instead of dealing
with it. Note that do_debug() does something similar to do_int3().

And then there is optimized_callback() that also uses
preempt_enable_no_resched(). I think the original use was correct, but then
a19b2e3d7839 ("kprobes/x86: Remove IRQ disabling from ftrace-based/optimized
kprobes”) removed the IRQ disabling, while leaving
preempt_enable_no_resched() . No?

2018-10-19 05:02:03

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

>
> >
> > Another example is __BPF_PROG_RUN_ARRAY(), which also uses
> > preempt_enable_no_resched().
>
> Alexei, I think this code is just wrong.

why 'just wrong' ?

> Do you know why it uses
> preempt_enable_no_resched()?

dont recall precisely.
we could be preemptable at the point where macro is called.
I think the goal of no_resched was to avoid adding scheduling points
where they didn't exist before just because a prog ran for few nsec.
May be Daniel or Roman remember.


2018-10-19 08:22:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

On Thu, Oct 18, 2018 at 09:29:39PM -0700, Andy Lutomirski wrote:

> > Another example is __BPF_PROG_RUN_ARRAY(), which also uses
> > preempt_enable_no_resched().
>
> Alexei, I think this code is just wrong. Do you know why it uses
> preempt_enable_no_resched()?

Yes, that's a straight up bug.

It looks like I need to go fix up abuse again :/

> Oleg, the code in kernel/signal.c:
>
> preempt_disable();
> read_unlock(&tasklist_lock);
> preempt_enable_no_resched();
> freezable_schedule();
>

The purpose here is to avoid back-to-back schedule() calls, and this
pattern is one of the few correct uses of preempt_enable_no_resched().

Suppose we got a preemption while holding the read_lock(), then when
we'd do read_unlock(), we'd drop preempt_count to 0 and reschedule, then
when we get back we instantly call into schedule _again_.

What this code does, is it increments preempt_count such that
read_unlock() doesn't hit 0 and doesn't call schedule, then we lower it
to 0 without a call to schedule() and then call schedule() explicitly.


2018-10-19 08:24:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

On Thu, Oct 18, 2018 at 10:00:53PM -0700, Alexei Starovoitov wrote:
> >
> > >
> > > Another example is __BPF_PROG_RUN_ARRAY(), which also uses
> > > preempt_enable_no_resched().
> >
> > Alexei, I think this code is just wrong.
>
> why 'just wrong' ?

Because you lost a preemption point, this is a no-no.

>
> > Do you know why it uses
> > preempt_enable_no_resched()?
>
> dont recall precisely.
> we could be preemptable at the point where macro is called.
> I think the goal of no_resched was to avoid adding scheduling points
> where they didn't exist before just because a prog ran for few nsec.
> May be Daniel or Roman remember.

No, you did the exact opposite, where there previously was a preemption,
you just ate it. The band saw didn't get stopped in time, you loose your
hand etc..

2018-10-19 08:34:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

On Fri, Oct 19, 2018 at 01:08:23AM +0000, Nadav Amit wrote:
> Consider for example do_int3(), and see my inlined comments:
>
> dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
> {
> ...
> ist_enter(regs); // => preempt_disable()
> cond_local_irq_enable(regs); // => assume it enables IRQs
>
> ...
> // resched irq can be delivered here. It will not caused rescheduling
> // since preemption is disabled
>
> cond_local_irq_disable(regs); // => assume it disables IRQs
> ist_exit(regs); // => preempt_enable_no_resched()
> }
>
> At this point resched will not happen for unbounded length of time (unless
> there is another point when exiting the trap handler that checks if
> preemption should take place).
>
> Another example is __BPF_PROG_RUN_ARRAY(), which also uses
> preempt_enable_no_resched().
>
> Am I missing something?

Would not the interrupt return then check for TIF_NEED_RESCHED and call
schedule() ?

I think (and this certainly wants a comment) is that the ist_exit()
thing hard relies on the interrupt-return path doing the reschedule.

2018-10-19 10:39:38

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

On 10/18, Andy Lutomirski wrote:
>
> Oleg, the code in kernel/signal.c:
>
> preempt_disable();
> read_unlock(&tasklist_lock);
> preempt_enable_no_resched();
> freezable_schedule();
>
> looks bogus. I don't get what it's trying to achieve with
> preempt_disable(), and I also don't see why no_resched does anything.
> Sure, it prevents a reschedule triggered during read_unlock() from
> causing a reschedule,

Yes. Lets suppose we remove preempt_disable/enable.

Debugger was already woken up, if it runs on the same CPU quite possibly
it will preemt the tracee. After that debugger will spin in wait_task_inactive(),
until it is in turn preempted or calls schedule_timeout(1), so that the tracee
(current) can finally call __schedule(preempt = F) and call deactivate_task() to
become inactive.

> but it doesn't prevent an interrupt immediately
> after the preempt_enable_no_resched() call from scheduling.

Yes, but this is less likely.

Oleg.


2018-10-19 14:32:37

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix



> On Oct 19, 2018, at 1:33 AM, Peter Zijlstra <[email protected]> wrote:
>
>> On Fri, Oct 19, 2018 at 01:08:23AM +0000, Nadav Amit wrote:
>> Consider for example do_int3(), and see my inlined comments:
>>
>> dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
>> {
>> ...
>> ist_enter(regs); // => preempt_disable()
>> cond_local_irq_enable(regs); // => assume it enables IRQs
>>
>> ...
>> // resched irq can be delivered here. It will not caused rescheduling
>> // since preemption is disabled
>>
>> cond_local_irq_disable(regs); // => assume it disables IRQs
>> ist_exit(regs); // => preempt_enable_no_resched()
>> }
>>
>> At this point resched will not happen for unbounded length of time (unless
>> there is another point when exiting the trap handler that checks if
>> preemption should take place).
>>
>> Another example is __BPF_PROG_RUN_ARRAY(), which also uses
>> preempt_enable_no_resched().
>>
>> Am I missing something?
>
> Would not the interrupt return then check for TIF_NEED_RESCHED and call
> schedule() ?

The paranoid exit path doesn’t check TIF_NEED_RESCHED because it’s fundamentally atomic — it’s running on a percpu stack and it can’t schedule. In theory we could do some evil stack switching, but we don’t.

How does NMI handle this? If an NMI that hit interruptible kernel code overflows a perf counter, how does the wake up work?

(do_int3() is special because it’s not actually IST. But it can hit in odd places due to kprobes, and I’m nervous about recursing incorrectly into RCU and context tracking code if we were to use exception_enter().)

>
> I think (and this certainly wants a comment) is that the ist_exit()
> thing hard relies on the interrupt-return path doing the reschedule.

2018-10-19 14:48:10

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

On Fri, Oct 19, 2018 at 1:22 AM Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Oct 18, 2018 at 10:00:53PM -0700, Alexei Starovoitov wrote:
> > >
> > > >
> > > > Another example is __BPF_PROG_RUN_ARRAY(), which also uses
> > > > preempt_enable_no_resched().
> > >
> > > Alexei, I think this code is just wrong.
> >
> > why 'just wrong' ?
>
> Because you lost a preemption point, this is a no-no.
>
> >
> > > Do you know why it uses
> > > preempt_enable_no_resched()?
> >
> > dont recall precisely.
> > we could be preemptable at the point where macro is called.
> > I think the goal of no_resched was to avoid adding scheduling points
> > where they didn't exist before just because a prog ran for few nsec.
> > May be Daniel or Roman remember.
>
> No, you did the exact opposite, where there previously was a preemption,
> you just ate it. The band saw didn't get stopped in time, you loose your
> hand etc..

Let me do few experiments then.
We will fix it up.

2018-10-20 01:23:05

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

On Fri, 19 Oct 2018 04:44:33 +0000
Nadav Amit <[email protected]> wrote:

> at 9:29 PM, Andy Lutomirski <[email protected]> wrote:
>
> >> On Oct 18, 2018, at 6:08 PM, Nadav Amit <[email protected]> wrote:
> >>
> >> at 10:00 AM, Andy Lutomirski <[email protected]> wrote:
> >>
> >>>> On Oct 18, 2018, at 9:47 AM, Nadav Amit <[email protected]> wrote:
> >>>>
> >>>> at 8:51 PM, Andy Lutomirski <[email protected]> wrote:
> >>>>
> >>>>>> On Wed, Oct 17, 2018 at 8:12 PM Nadav Amit <[email protected]> wrote:
> >>>>>> at 6:22 PM, Andy Lutomirski <[email protected]> wrote:
> >>>>>>
> >>>>>>>> On Oct 17, 2018, at 5:54 PM, Nadav Amit <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>> It is sometimes beneficial to prevent preemption for very few
> >>>>>>>> instructions, or prevent preemption for some instructions that precede
> >>>>>>>> a branch (this latter case will be introduced in the next patches).
> >>>>>>>>
> >>>>>>>> To provide such functionality on x86-64, we use an empty REX-prefix
> >>>>>>>> (opcode 0x40) as an indication that preemption is disabled for the
> >>>>>>>> following instruction.
> >>>>>>>
> >>>>>>> Nifty!
> >>>>>>>
> >>>>>>> That being said, I think you have a few bugs. First, you can$B!G(Bt just ignore
> >>>>>>> a rescheduling interrupt, as you introduce unbounded latency when this
> >>>>>>> happens $B!=(B you$B!G(Bre effectively emulating preempt_enable_no_resched(), which
> >>>>>>> is not a drop-in replacement for preempt_enable(). To fix this, you may
> >>>>>>> need to jump to a slow-path trampoline that calls schedule() at the end or
> >>>>>>> consider rewinding one instruction instead. Or use TF, which is only a
> >>>>>>> little bit terrifying$B!D(B
> >>>>>>
> >>>>>> Yes, I didn$B!G(Bt pay enough attention here. For my use-case, I think that the
> >>>>>> easiest solution would be to make synchronize_sched() ignore preemptions
> >>>>>> that happen while the prefix is detected. It would slightly change the
> >>>>>> meaning of the prefix.
> >>>>
> >>>> So thinking about it further, rewinding the instruction seems the easiest
> >>>> and most robust solution. I$B!G(Bll do it.
> >>>>
> >>>>>>> You also aren$B!G(Bt accounting for the case where you get an exception that
> >>>>>>> is, in turn, preempted.
> >>>>>>
> >>>>>> Hmm.. Can you give me an example for such an exception in my use-case? I
> >>>>>> cannot think of an exception that might be preempted (assuming #BP, #MC
> >>>>>> cannot be preempted).
> >>>>>
> >>>>> Look for cond_local_irq_enable().
> >>>>
> >>>> I looked at it. Yet, I still don$B!G(Bt see how exceptions might happen in my
> >>>> use-case, but having said that - this can be fixed too.
> >>>
> >>> I$B!G(Bm not totally certain there$B!G(Bs a case that matters. But it$B!G(Bs worth checking
> >>
> >> I am still checking. But, I wanted to ask you whether the existing code is
> >> correct, since it seems to me that others do the same mistake I did, unless
> >> I don$B!G(Bt understand the code.
> >>
> >> Consider for example do_int3(), and see my inlined comments:
> >>
> >> dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
> >> {
> >> ...
> >> ist_enter(regs); // => preempt_disable()
> >> cond_local_irq_enable(regs); // => assume it enables IRQs
> >>
> >> ...
> >> // resched irq can be delivered here. It will not caused rescheduling
> >> // since preemption is disabled
> >>
> >> cond_local_irq_disable(regs); // => assume it disables IRQs
> >> ist_exit(regs); // => preempt_enable_no_resched()
> >> }
> >>
> >> At this point resched will not happen for unbounded length of time (unless
> >> there is another point when exiting the trap handler that checks if
> >> preemption should take place).
> >
> > I think it's only a bug in the cases where someone uses extable to fix
> > up an int3 (which would be nuts) or that we oops. But I should still
> > fix it. In the normal case where int3 was in user code, we'll miss
> > the reschedule in do_trap(), but we'll reschedule in
> > prepare_exit_to_usermode() -> exit_to_usermode_loop().
>
> Thanks for your quick response, and sorry for bothering instead of dealing
> with it. Note that do_debug() does something similar to do_int3().
>
> And then there is optimized_callback() that also uses
> preempt_enable_no_resched(). I think the original use was correct, but then
> a19b2e3d7839 ("kprobes/x86: Remove IRQ disabling from ftrace-based/optimized
> kprobes$B!I(B) removed the IRQ disabling, while leaving
> preempt_enable_no_resched() . No?

Ah, good catch!
Indeed, we don't need to stick on no_resched anymore.

Thanks!


--
Masami Hiramatsu <[email protected]>

2018-10-23 18:36:39

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

On 10/17/18 5:54 PM, Nadav Amit wrote:
> base relpoline
> ---- ---------
> nginx 22898 25178 (+10%)
> redis-ycsb 24523 25486 (+4%)
> dbench 2144 2103 (+2%)

Just out of curiosity, which indirect branches are the culprits here for
causing the slowdowns?

2018-10-23 20:33:48

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

at 11:36 AM, Dave Hansen <[email protected]> wrote:

> On 10/17/18 5:54 PM, Nadav Amit wrote:
>> base relpoline
>> ---- ---------
>> nginx 22898 25178 (+10%)
>> redis-ycsb 24523 25486 (+4%)
>> dbench 2144 2103 (+2%)
>
> Just out of curiosity, which indirect branches are the culprits here for
> causing the slowdowns?

So I didn’t try to measure exactly which one. There are roughly 500 that
actually “run” in my tests. Initially, I took the silly approach of trying
to patch the C source-code using semi automatically-generated Coccinelle
scripts, so I can tell you it is not just few branches but many. The
network stack is full of function pointers (e.g., tcp_congestion_ops,
tcp_sock_af_ops, dst_ops). The file-system also uses many function pointers
(file_operations specifically). Compound-pages have d’tor and so on.

If you want, you can rebuild the kernel without retpolines and run

perf record -e br_inst_exec.taken_indirect_near_call:k (your workload)

For some reason I didn’t manage to use PEBS (:ppp) from either the guest or
the host, so my results are a bit skewed (i.e., the sampled location is
usually after the call was taken). Running dbench in the VM gives me the
following “hot-spots”:

# Samples: 304 of event 'br_inst_exec.taken_indirect_near_call'
# Event count (approx.): 60800912
#
# Overhead Command Shared Object Symbol
# ........ ....... ....................... .............................................
#
5.26% :197970 [guest.kernel.kallsyms] [g] __fget_light
4.28% :197969 [guest.kernel.kallsyms] [g] __fget_light
3.95% :197969 [guest.kernel.kallsyms] [g] dcache_readdir
3.29% :197970 [guest.kernel.kallsyms] [g] next_positive.isra.14
2.96% :197970 [guest.kernel.kallsyms] [g] __do_sys_kill
2.30% :197970 [guest.kernel.kallsyms] [g] apparmor_file_open
1.97% :197969 [guest.kernel.kallsyms] [g] __do_sys_kill
1.97% :197969 [guest.kernel.kallsyms] [g] next_positive.isra.14
1.97% :197970 [guest.kernel.kallsyms] [g] _raw_spin_lock
1.64% :197969 [guest.kernel.kallsyms] [g] __alloc_file
1.64% :197969 [guest.kernel.kallsyms] [g] common_file_perm
1.64% :197969 [guest.kernel.kallsyms] [g] filldir
1.64% :197970 [guest.kernel.kallsyms] [g] do_dentry_open
1.64% :197970 [guest.kernel.kallsyms] [g] kmem_cache_free
1.32% :197969 [guest.kernel.kallsyms] [g] __raw_callee_save___pv_queued_spin_unlock
1.32% :197969 [guest.kernel.kallsyms] [g] __slab_free

Regards,
Nadav

2018-10-23 20:38:11

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

On 10/23/18 1:32 PM, Nadav Amit wrote:
>> On 10/17/18 5:54 PM, Nadav Amit wrote:
>>> base relpoline
>>> ---- ---------
>>> nginx 22898 25178 (+10%)
>>> redis-ycsb 24523 25486 (+4%)
>>> dbench 2144 2103 (+2%)
>> Just out of curiosity, which indirect branches are the culprits here for
>> causing the slowdowns?
> So I didn’t try to measure exactly which one. There are roughly 500 that
> actually “run” in my tests.

OK, cool, that's pretty much all I wanted to know, just that there
aren't 3 of them or something for which we need all this infrastructure.

2018-11-28 16:12:08

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
> This RFC introduces indirect call promotion in runtime, which for the
> matter of simplification (and branding) will be called here "relpolines"
> (relative call + trampoline). Relpolines are mainly intended as a way
> of reducing retpoline overheads due to Spectre v2.
>
> Unlike indirect call promotion through profile guided optimization, the
> proposed approach does not require a profiling stage, works well with
> modules whose address is unknown and can adapt to changing workloads.
>
> The main idea is simple: for every indirect call, we inject a piece of
> code with fast- and slow-path calls. The fast path is used if the target
> matches the expected (hot) target. The slow-path uses a retpoline.
> During training, the slow-path is set to call a function that saves the
> call source and target in a hash-table and keep count for call
> frequency. The most common target is then patched into the hot path.
>
> The patching is done on-the-fly by patching the conditional branch
> (opcode and offset) that is used to compare the target to the hot
> target. This allows to direct all cores to the fast-path, while patching
> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
> patch a single byte when the code might be executed by any core. (2)
> When patching more than one byte, ensure that all cores do not run the
> to-be-patched-code by preventing this code from being preempted, and
> using synchronize_sched() after patching the branch that jumps over this
> code.
>
> Changing all the indirect calls to use relpolines is done using assembly
> macro magic. There are alternative solutions, but this one is
> relatively simple and transparent. There is also logic to retrain the
> software predictor, but the policy it uses may need to be refined.
>
> Eventually the results are not bad (2 VCPU VM, throughput reported):
>
> base relpoline
> ---- ---------
> nginx 22898 25178 (+10%)
> redis-ycsb 24523 25486 (+4%)
> dbench 2144 2103 (+2%)
>
> When retpolines are disabled, and if retraining is off, performance
> benefits are up to 2% (nginx), but are much less impressive.

Hi Nadav,

Peter pointed me to these patches during a discussion about retpoline
profiling. Personally, I think this is brilliant. This could help
networking and filesystem intensive workloads a lot.

Some high-level comments:

- "Relpoline" looks confusingly a lot like "retpoline". How about
"optpoline"? To avoid confusing myself I will hereafter refer to it
as such :-)

- Instead of patching one byte at a time, is there a reason why
text_poke_bp() can't be used? That would greatly simplify the
patching process, as everything could be patched in a single step.

- In many cases, a single direct call may not be sufficient, as there
could be for example multiple tasks using different network protocols
which need different callbacks for the same call site.

- I'm not sure about the periodic retraining logic, it seems a bit
nondeterministic and bursty.

So I'd propose the following changes:

- In the optpoline, reserve space for multiple (5 or so) comparisons and
direct calls. Maybe the number of reserved cmp/jne/call slots can be
tweaked by the caller somehow. Or maybe it could grow as needed.
Starting out, they would just be NOPs.

- Instead of the temporary learning mode, add permanent tracking to
detect a direct call "miss" -- i.e., when none of the existing direct
calls are applicable and the retpoline will be used.

- In the case of a miss (or N misses), it could trigger a direct call
patching operation to be run later (workqueue or syscall exit). If
all the direct call slots are full, it could patch the least recently
modified one. If this causes thrashing (>x changes over y time), it
could increase the number of direct call slots using a trampoline.
Even if there were several slots, CPU branch prediction would
presumably help make it much faster than a basic retpoline.

Thoughts?

--
Josh

2018-11-28 19:36:00

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

> On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf <[email protected]> wrote:
>
> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
>> This RFC introduces indirect call promotion in runtime, which for the
>> matter of simplification (and branding) will be called here "relpolines"
>> (relative call + trampoline). Relpolines are mainly intended as a way
>> of reducing retpoline overheads due to Spectre v2.
>>
>> Unlike indirect call promotion through profile guided optimization, the
>> proposed approach does not require a profiling stage, works well with
>> modules whose address is unknown and can adapt to changing workloads.
>>
>> The main idea is simple: for every indirect call, we inject a piece of
>> code with fast- and slow-path calls. The fast path is used if the target
>> matches the expected (hot) target. The slow-path uses a retpoline.
>> During training, the slow-path is set to call a function that saves the
>> call source and target in a hash-table and keep count for call
>> frequency. The most common target is then patched into the hot path.
>>
>> The patching is done on-the-fly by patching the conditional branch
>> (opcode and offset) that is used to compare the target to the hot
>> target. This allows to direct all cores to the fast-path, while patching
>> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
>> patch a single byte when the code might be executed by any core. (2)
>> When patching more than one byte, ensure that all cores do not run the
>> to-be-patched-code by preventing this code from being preempted, and
>> using synchronize_sched() after patching the branch that jumps over this
>> code.
>>
>> Changing all the indirect calls to use relpolines is done using assembly
>> macro magic. There are alternative solutions, but this one is
>> relatively simple and transparent. There is also logic to retrain the
>> software predictor, but the policy it uses may need to be refined.
>>
>> Eventually the results are not bad (2 VCPU VM, throughput reported):
>>
>> base relpoline
>> ---- ---------
>> nginx 22898 25178 (+10%)
>> redis-ycsb 24523 25486 (+4%)
>> dbench 2144 2103 (+2%)
>>
>> When retpolines are disabled, and if retraining is off, performance
>> benefits are up to 2% (nginx), but are much less impressive.
>
> Hi Nadav,
>
> Peter pointed me to these patches during a discussion about retpoline
> profiling. Personally, I think this is brilliant. This could help
> networking and filesystem intensive workloads a lot.

Thanks! I was a bit held-back by the relatively limited number of responses.
I finished another version two weeks ago, and every day I think: "should it
be RFCv2 or v1”, ending up not sending it…

There is one issue that I realized while working on the new version: I’m not
sure it is well-defined what an outline retpoline is allowed to do. The
indirect branch promotion code can change rflags, which might cause
correction issues. In practice, using gcc, it is not a problem.

> Some high-level comments:
>
> - "Relpoline" looks confusingly a lot like "retpoline". How about
> "optpoline"? To avoid confusing myself I will hereafter refer to it
> as such :-)

Sure. For the academic paper we submitted, we used a much worse name that my
colleague came up with. I’m ok with anything other than that name (not
mentioned to prevent double-blinding violations). I’ll go with your name.

> - Instead of patching one byte at a time, is there a reason why
> text_poke_bp() can't be used? That would greatly simplify the
> patching process, as everything could be patched in a single step.

I thought of it and maybe it is somehow possible, but there are several
problems, for which I didn’t find a simple solution:

1. An indirect branch inside the BP handler might be the one we patch

2. An indirect branch inside an interrupt or NMI handler might be the
one we patch

3. Overall, we need to patch more than a single instruction, and
text_poke_bp() is not suitable

> - In many cases, a single direct call may not be sufficient, as there
> could be for example multiple tasks using different network protocols
> which need different callbacks for the same call site.

We want to know during compilation how many targets to use. It is not
super-simple to support multiple inlined targets, but it is feasible if you
are willing to annotate when multiple targets are needed. We have a version
which uses outlined indirect branch promotion when there are multiple
targets, but it’s not ready for prime time, and the code-cache misses can
induce some overheads.

> - I'm not sure about the periodic retraining logic, it seems a bit
> nondeterministic and bursty.

I agree. It can be limited to cases in which modules are loaded/removed,
or when the user explicitly asks for it to take place.

>
> So I'd propose the following changes:
>
> - In the optpoline, reserve space for multiple (5 or so) comparisons and
> direct calls. Maybe the number of reserved cmp/jne/call slots can be
> tweaked by the caller somehow. Or maybe it could grow as needed.
> Starting out, they would just be NOPs.
>
> - Instead of the temporary learning mode, add permanent tracking to
> detect a direct call "miss" -- i.e., when none of the existing direct
> calls are applicable and the retpoline will be used.
>
> - In the case of a miss (or N misses), it could trigger a direct call
> patching operation to be run later (workqueue or syscall exit). If
> all the direct call slots are full, it could patch the least recently
> modified one. If this causes thrashing (>x changes over y time), it
> could increase the number of direct call slots using a trampoline.
> Even if there were several slots, CPU branch prediction would
> presumably help make it much faster than a basic retpoline.
>
> Thoughts?

I’m ok with these changes in general, although having multiple inline
targets is not super-simple. However, there are a few issues:

- There is potentially a negative impact due to code-size increase which
I was worried about.

- I see no reason not to use all the available slots immediately when
we encounter a miss.

- The order of the branches might be “wrong” (unoptimized) if we do not
do any relearning.

- The main question is what to do if we run out of slots and still get
(many?) misses. I presume the right thing is to disable the optpoline
and jump over it to the retpoline.

Thanks again for the feedback, and please let me know what you think about
my concerns.

2018-11-29 00:42:04

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

On Wed, Nov 28, 2018 at 07:34:52PM +0000, Nadav Amit wrote:
> > On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf <[email protected]> wrote:
> >
> > On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
> >> This RFC introduces indirect call promotion in runtime, which for the
> >> matter of simplification (and branding) will be called here "relpolines"
> >> (relative call + trampoline). Relpolines are mainly intended as a way
> >> of reducing retpoline overheads due to Spectre v2.
> >>
> >> Unlike indirect call promotion through profile guided optimization, the
> >> proposed approach does not require a profiling stage, works well with
> >> modules whose address is unknown and can adapt to changing workloads.
> >>
> >> The main idea is simple: for every indirect call, we inject a piece of
> >> code with fast- and slow-path calls. The fast path is used if the target
> >> matches the expected (hot) target. The slow-path uses a retpoline.
> >> During training, the slow-path is set to call a function that saves the
> >> call source and target in a hash-table and keep count for call
> >> frequency. The most common target is then patched into the hot path.
> >>
> >> The patching is done on-the-fly by patching the conditional branch
> >> (opcode and offset) that is used to compare the target to the hot
> >> target. This allows to direct all cores to the fast-path, while patching
> >> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
> >> patch a single byte when the code might be executed by any core. (2)
> >> When patching more than one byte, ensure that all cores do not run the
> >> to-be-patched-code by preventing this code from being preempted, and
> >> using synchronize_sched() after patching the branch that jumps over this
> >> code.
> >>
> >> Changing all the indirect calls to use relpolines is done using assembly
> >> macro magic. There are alternative solutions, but this one is
> >> relatively simple and transparent. There is also logic to retrain the
> >> software predictor, but the policy it uses may need to be refined.
> >>
> >> Eventually the results are not bad (2 VCPU VM, throughput reported):
> >>
> >> base relpoline
> >> ---- ---------
> >> nginx 22898 25178 (+10%)
> >> redis-ycsb 24523 25486 (+4%)
> >> dbench 2144 2103 (+2%)
> >>
> >> When retpolines are disabled, and if retraining is off, performance
> >> benefits are up to 2% (nginx), but are much less impressive.
> >
> > Hi Nadav,
> >
> > Peter pointed me to these patches during a discussion about retpoline
> > profiling. Personally, I think this is brilliant. This could help
> > networking and filesystem intensive workloads a lot.
>
> Thanks! I was a bit held-back by the relatively limited number of responses.

It is a rather, erm, ambitious idea, maybe they were speechless :-)

> I finished another version two weeks ago, and every day I think: "should it
> be RFCv2 or v1”, ending up not sending it…
>
> There is one issue that I realized while working on the new version: I’m not
> sure it is well-defined what an outline retpoline is allowed to do. The
> indirect branch promotion code can change rflags, which might cause
> correction issues. In practice, using gcc, it is not a problem.

Callees can clobber flags, so it seems fine to me.

> > Some high-level comments:
> >
> > - "Relpoline" looks confusingly a lot like "retpoline". How about
> > "optpoline"? To avoid confusing myself I will hereafter refer to it
> > as such :-)
>
> Sure. For the academic paper we submitted, we used a much worse name that my
> colleague came up with. I’m ok with anything other than that name (not
> mentioned to prevent double-blinding violations). I’ll go with your name.
>
> > - Instead of patching one byte at a time, is there a reason why
> > text_poke_bp() can't be used? That would greatly simplify the
> > patching process, as everything could be patched in a single step.
>
> I thought of it and maybe it is somehow possible, but there are several
> problems, for which I didn’t find a simple solution:
>
> 1. An indirect branch inside the BP handler might be the one we patch

I _think_ nested INT3s should be doable, because they don't use IST.
Maybe Andy can clarify.

To avoid recursion we'd have to make sure not to use any function
pointers in do_int3() before or during the call to poke_int3_handler().

Incidentally I'm working on a patch which adds an indirect call to
poke_int3_handler(). We'd have to disable optolines for that code.

> 2. An indirect branch inside an interrupt or NMI handler might be the
> one we patch

But INT3s just use the existing stack, and NMIs support nesting, so I'm
thinking that should also be doable. Andy?

> 3. Overall, we need to patch more than a single instruction, and
> text_poke_bp() is not suitable

Hm, I suppose that's true.

> > - In many cases, a single direct call may not be sufficient, as there
> > could be for example multiple tasks using different network protocols
> > which need different callbacks for the same call site.
>
> We want to know during compilation how many targets to use. It is not
> super-simple to support multiple inlined targets, but it is feasible if you
> are willing to annotate when multiple targets are needed.

Why would an annotation be needed? Is there a reason why the 'call'
macro wouldn't work?

I hate to say it, but another option would be a compiler plugin.

> We have a version which uses outlined indirect branch promotion when
> there are multiple targets, but it’s not ready for prime time, and the
> code-cache misses can induce some overheads.

It would be interesting to see some measurements comparing inline and
out-of-line. If there were multiple direct call slots then out-of-line
could have advantages in some cases since it reduces the original
function's i-cache footprint. But maybe it wouldn't be worth it.

> > - I'm not sure about the periodic retraining logic, it seems a bit
> > nondeterministic and bursty.
>
> I agree. It can be limited to cases in which modules are loaded/removed,
> or when the user explicitly asks for it to take place.
>
> >
> > So I'd propose the following changes:
> >
> > - In the optpoline, reserve space for multiple (5 or so) comparisons and
> > direct calls. Maybe the number of reserved cmp/jne/call slots can be
> > tweaked by the caller somehow. Or maybe it could grow as needed.
> > Starting out, they would just be NOPs.
> >
> > - Instead of the temporary learning mode, add permanent tracking to
> > detect a direct call "miss" -- i.e., when none of the existing direct
> > calls are applicable and the retpoline will be used.
> >
> > - In the case of a miss (or N misses), it could trigger a direct call
> > patching operation to be run later (workqueue or syscall exit). If
> > all the direct call slots are full, it could patch the least recently
> > modified one. If this causes thrashing (>x changes over y time), it
> > could increase the number of direct call slots using a trampoline.
> > Even if there were several slots, CPU branch prediction would
> > presumably help make it much faster than a basic retpoline.
> >
> > Thoughts?
>
> I’m ok with these changes in general, although having multiple inline
> targets is not super-simple. However, there are a few issues:
>
> - There is potentially a negative impact due to code-size increase which
> I was worried about.

True. Though out-of-line could help with that (but it would also have
downsides of course).

> - I see no reason not to use all the available slots immediately when
> we encounter a miss.

Agreed.

> - The order of the branches might be “wrong” (unoptimized) if we do not
> do any relearning.

True, though branch prediction would mitigate that to a certain degree.
Also if the number of slots is reasonably small (which it will probably
need to be anyway) then it might be good enough.

> - The main question is what to do if we run out of slots and still get
> (many?) misses. I presume the right thing is to disable the optpoline
> and jump over it to the retpoline.

It may need some experimentation. You could just try a simple
round-robin patching scheme. Possibly only writing after X number of
misses. And if that doesn't help, just disable it. Such a simple
approach might work well enough in practice.

--
Josh

2018-11-29 01:41:20

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf <[email protected]> wrote:
>
> On Wed, Nov 28, 2018 at 07:34:52PM +0000, Nadav Amit wrote:
> > > On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf <[email protected]> wrote:
> > >
> > > On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
> > >> This RFC introduces indirect call promotion in runtime, which for the
> > >> matter of simplification (and branding) will be called here "relpolines"
> > >> (relative call + trampoline). Relpolines are mainly intended as a way
> > >> of reducing retpoline overheads due to Spectre v2.
> > >>
> > >> Unlike indirect call promotion through profile guided optimization, the
> > >> proposed approach does not require a profiling stage, works well with
> > >> modules whose address is unknown and can adapt to changing workloads.
> > >>
> > >> The main idea is simple: for every indirect call, we inject a piece of
> > >> code with fast- and slow-path calls. The fast path is used if the target
> > >> matches the expected (hot) target. The slow-path uses a retpoline.
> > >> During training, the slow-path is set to call a function that saves the
> > >> call source and target in a hash-table and keep count for call
> > >> frequency. The most common target is then patched into the hot path.
> > >>
> > >> The patching is done on-the-fly by patching the conditional branch
> > >> (opcode and offset) that is used to compare the target to the hot
> > >> target. This allows to direct all cores to the fast-path, while patching
> > >> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
> > >> patch a single byte when the code might be executed by any core. (2)
> > >> When patching more than one byte, ensure that all cores do not run the
> > >> to-be-patched-code by preventing this code from being preempted, and
> > >> using synchronize_sched() after patching the branch that jumps over this
> > >> code.
> > >>
> > >> Changing all the indirect calls to use relpolines is done using assembly
> > >> macro magic. There are alternative solutions, but this one is
> > >> relatively simple and transparent. There is also logic to retrain the
> > >> software predictor, but the policy it uses may need to be refined.
> > >>
> > >> Eventually the results are not bad (2 VCPU VM, throughput reported):
> > >>
> > >> base relpoline
> > >> ---- ---------
> > >> nginx 22898 25178 (+10%)
> > >> redis-ycsb 24523 25486 (+4%)
> > >> dbench 2144 2103 (+2%)
> > >>
> > >> When retpolines are disabled, and if retraining is off, performance
> > >> benefits are up to 2% (nginx), but are much less impressive.
> > >
> > > Hi Nadav,
> > >
> > > Peter pointed me to these patches during a discussion about retpoline
> > > profiling. Personally, I think this is brilliant. This could help
> > > networking and filesystem intensive workloads a lot.
> >
> > Thanks! I was a bit held-back by the relatively limited number of responses.
>
> It is a rather, erm, ambitious idea, maybe they were speechless :-)
>
> > I finished another version two weeks ago, and every day I think: "should it
> > be RFCv2 or v1”, ending up not sending it…
> >
> > There is one issue that I realized while working on the new version: I’m not
> > sure it is well-defined what an outline retpoline is allowed to do. The
> > indirect branch promotion code can change rflags, which might cause
> > correction issues. In practice, using gcc, it is not a problem.
>
> Callees can clobber flags, so it seems fine to me.

Just to check I understand your approach right: you made a macro
called "call", and you're therefore causing all instances of "call" to
become magic? This is... terrifying. It's even plausibly worse than
"#define if" :) The scariest bit is that it will impact inline asm as
well. Maybe a gcc plugin would be less alarming?

> >
> > 1. An indirect branch inside the BP handler might be the one we patch
>
> I _think_ nested INT3s should be doable, because they don't use IST.
> Maybe Andy can clarify.

int3 should survive recursion these days. Although I admit I'm
currently wondering what happens if one thread puts a kprobe on an
address that another thread tries to text_poke.

Also, this relpoline magic is likely to start patching text at runtime
on a semi-regular basis. This type of patching is *slow*. Is it a
problem?

> > 2. An indirect branch inside an interrupt or NMI handler might be the
> > one we patch
>
> But INT3s just use the existing stack, and NMIs support nesting, so I'm
> thinking that should also be doable. Andy?
>

In principle, as long as the code isn't NOKPROBE_SYMBOL-ified, we
should be fine, right? I'd be a little nervous if we get an int3 in
the C code that handles the early part of an NMI from user mode. It's
*probably* okay, but one of the alarming issues is that the int3
return path will implicitly unmask NMI, which isn't fantastic. Maybe
we finally need to dust off my old "return using RET" code to get rid
of that problem.

2018-11-29 02:21:37

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski <[email protected]> wrote:
>
> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf <[email protected]> wrote:
>> On Wed, Nov 28, 2018 at 07:34:52PM +0000, Nadav Amit wrote:
>>>> On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf <[email protected]> wrote:
>>>>
>>>> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
>>>>> This RFC introduces indirect call promotion in runtime, which for the
>>>>> matter of simplification (and branding) will be called here "relpolines"
>>>>> (relative call + trampoline). Relpolines are mainly intended as a way
>>>>> of reducing retpoline overheads due to Spectre v2.
>>>>>
>>>>> Unlike indirect call promotion through profile guided optimization, the
>>>>> proposed approach does not require a profiling stage, works well with
>>>>> modules whose address is unknown and can adapt to changing workloads.
>>>>>
>>>>> The main idea is simple: for every indirect call, we inject a piece of
>>>>> code with fast- and slow-path calls. The fast path is used if the target
>>>>> matches the expected (hot) target. The slow-path uses a retpoline.
>>>>> During training, the slow-path is set to call a function that saves the
>>>>> call source and target in a hash-table and keep count for call
>>>>> frequency. The most common target is then patched into the hot path.
>>>>>
>>>>> The patching is done on-the-fly by patching the conditional branch
>>>>> (opcode and offset) that is used to compare the target to the hot
>>>>> target. This allows to direct all cores to the fast-path, while patching
>>>>> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
>>>>> patch a single byte when the code might be executed by any core. (2)
>>>>> When patching more than one byte, ensure that all cores do not run the
>>>>> to-be-patched-code by preventing this code from being preempted, and
>>>>> using synchronize_sched() after patching the branch that jumps over this
>>>>> code.
>>>>>
>>>>> Changing all the indirect calls to use relpolines is done using assembly
>>>>> macro magic. There are alternative solutions, but this one is
>>>>> relatively simple and transparent. There is also logic to retrain the
>>>>> software predictor, but the policy it uses may need to be refined.
>>>>>
>>>>> Eventually the results are not bad (2 VCPU VM, throughput reported):
>>>>>
>>>>> base relpoline
>>>>> ---- ---------
>>>>> nginx 22898 25178 (+10%)
>>>>> redis-ycsb 24523 25486 (+4%)
>>>>> dbench 2144 2103 (+2%)
>>>>>
>>>>> When retpolines are disabled, and if retraining is off, performance
>>>>> benefits are up to 2% (nginx), but are much less impressive.
>>>>
>>>> Hi Nadav,
>>>>
>>>> Peter pointed me to these patches during a discussion about retpoline
>>>> profiling. Personally, I think this is brilliant. This could help
>>>> networking and filesystem intensive workloads a lot.
>>>
>>> Thanks! I was a bit held-back by the relatively limited number of responses.
>>
>> It is a rather, erm, ambitious idea, maybe they were speechless :-)
>>
>>> I finished another version two weeks ago, and every day I think: "should it
>>> be RFCv2 or v1”, ending up not sending it…
>>>
>>> There is one issue that I realized while working on the new version: I’m not
>>> sure it is well-defined what an outline retpoline is allowed to do. The
>>> indirect branch promotion code can change rflags, which might cause
>>> correction issues. In practice, using gcc, it is not a problem.
>>
>> Callees can clobber flags, so it seems fine to me.
>
> Just to check I understand your approach right: you made a macro
> called "call", and you're therefore causing all instances of "call" to
> become magic? This is... terrifying. It's even plausibly worse than
> "#define if" :) The scariest bit is that it will impact inline asm as
> well. Maybe a gcc plugin would be less alarming?

It is likely to look less alarming. When I looked at the inline retpoline
implementation of gcc, it didn’t look much better than what I did - it
basically just emits assembly instructions.

Anyhow, I look (again) into using gcc-plugins.

>>> 1. An indirect branch inside the BP handler might be the one we patch
>>
>> I _think_ nested INT3s should be doable, because they don't use IST.
>> Maybe Andy can clarify.
>
> int3 should survive recursion these days. Although I admit I'm
> currently wondering what happens if one thread puts a kprobe on an
> address that another thread tries to text_poke.

The issue I regarded is having an indirect call *inside* the the handler.
For example, you try to patch the call to bp_int3_handler and then get an
int3. They can be annotated to prevent them from being patched. Then again,
I need to see how gcc plugins can get these annotations.

>
> Also, this relpoline magic is likely to start patching text at runtime
> on a semi-regular basis. This type of patching is *slow*. Is it a
> problem?

It didn’t appear so. Although there are >10000 indirect branches in the
kernel, you don’t patch too many of them even you are doing relearning.

>
>>> 2. An indirect branch inside an interrupt or NMI handler might be the
>>> one we patch
>>
>> But INT3s just use the existing stack, and NMIs support nesting, so I'm
>> thinking that should also be doable. Andy?
>
> In principle, as long as the code isn't NOKPROBE_SYMBOL-ified, we
> should be fine, right? I'd be a little nervous if we get an int3 in
> the C code that handles the early part of an NMI from user mode. It's
> *probably* okay, but one of the alarming issues is that the int3
> return path will implicitly unmask NMI, which isn't fantastic. Maybe
> we finally need to dust off my old "return using RET" code to get rid
> of that problem.

So it may be possible. It would require having a new text_poke_bp() variant
for multiple instructions. text_poke_bp() might be slower though.


2018-11-29 03:25:11

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion


On Nov 28, 2018, at 6:06 PM, Nadav Amit <[email protected]> wrote:

>> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski <[email protected]> wrote:
>>
>>> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf <[email protected]> wrote:
>>> On Wed, Nov 28, 2018 at 07:34:52PM +0000, Nadav Amit wrote:
>>>>> On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf <[email protected]> wrote:
>>>>>
>>>>>> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
>>>>>> This RFC introduces indirect call promotion in runtime, which for the
>>>>>> matter of simplification (and branding) will be called here "relpolines"
>>>>>> (relative call + trampoline). Relpolines are mainly intended as a way
>>>>>> of reducing retpoline overheads due to Spectre v2.
>>>>>>
>>>>>> Unlike indirect call promotion through profile guided optimization, the
>>>>>> proposed approach does not require a profiling stage, works well with
>>>>>> modules whose address is unknown and can adapt to changing workloads.
>>>>>>
>>>>>> The main idea is simple: for every indirect call, we inject a piece of
>>>>>> code with fast- and slow-path calls. The fast path is used if the target
>>>>>> matches the expected (hot) target. The slow-path uses a retpoline.
>>>>>> During training, the slow-path is set to call a function that saves the
>>>>>> call source and target in a hash-table and keep count for call
>>>>>> frequency. The most common target is then patched into the hot path.
>>>>>>
>>>>>> The patching is done on-the-fly by patching the conditional branch
>>>>>> (opcode and offset) that is used to compare the target to the hot
>>>>>> target. This allows to direct all cores to the fast-path, while patching
>>>>>> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
>>>>>> patch a single byte when the code might be executed by any core. (2)
>>>>>> When patching more than one byte, ensure that all cores do not run the
>>>>>> to-be-patched-code by preventing this code from being preempted, and
>>>>>> using synchronize_sched() after patching the branch that jumps over this
>>>>>> code.
>>>>>>
>>>>>> Changing all the indirect calls to use relpolines is done using assembly
>>>>>> macro magic. There are alternative solutions, but this one is
>>>>>> relatively simple and transparent. There is also logic to retrain the
>>>>>> software predictor, but the policy it uses may need to be refined.
>>>>>>
>>>>>> Eventually the results are not bad (2 VCPU VM, throughput reported):
>>>>>>
>>>>>> base relpoline
>>>>>> ---- ---------
>>>>>> nginx 22898 25178 (+10%)
>>>>>> redis-ycsb 24523 25486 (+4%)
>>>>>> dbench 2144 2103 (+2%)
>>>>>>
>>>>>> When retpolines are disabled, and if retraining is off, performance
>>>>>> benefits are up to 2% (nginx), but are much less impressive.
>>>>>
>>>>> Hi Nadav,
>>>>>
>>>>> Peter pointed me to these patches during a discussion about retpoline
>>>>> profiling. Personally, I think this is brilliant. This could help
>>>>> networking and filesystem intensive workloads a lot.
>>>>
>>>> Thanks! I was a bit held-back by the relatively limited number of responses.
>>>
>>> It is a rather, erm, ambitious idea, maybe they were speechless :-)
>>>
>>>> I finished another version two weeks ago, and every day I think: "should it
>>>> be RFCv2 or v1”, ending up not sending it…
>>>>
>>>> There is one issue that I realized while working on the new version: I’m not
>>>> sure it is well-defined what an outline retpoline is allowed to do. The
>>>> indirect branch promotion code can change rflags, which might cause
>>>> correction issues. In practice, using gcc, it is not a problem.
>>>
>>> Callees can clobber flags, so it seems fine to me.
>>
>> Just to check I understand your approach right: you made a macro
>> called "call", and you're therefore causing all instances of "call" to
>> become magic? This is... terrifying. It's even plausibly worse than
>> "#define if" :) The scariest bit is that it will impact inline asm as
>> well. Maybe a gcc plugin would be less alarming?
>
> It is likely to look less alarming. When I looked at the inline retpoline
> implementation of gcc, it didn’t look much better than what I did - it
> basically just emits assembly instructions.

To be clear, that wasn’t a NAK. It was merely a “this is alarming.”

Hey Josh - you could potentially do the same hack to generate the static call tables. Take that, objtool.

>
> Anyhow, I look (again) into using gcc-plugins.
>
>>>> 1. An indirect branch inside the BP handler might be the one we patch
>>>
>>> I _think_ nested INT3s should be doable, because they don't use IST.
>>> Maybe Andy can clarify.
>>
>> int3 should survive recursion these days. Although I admit I'm
>> currently wondering what happens if one thread puts a kprobe on an
>> address that another thread tries to text_poke.
>
> The issue I regarded is having an indirect call *inside* the the handler.
> For example, you try to patch the call to bp_int3_handler and then get an
> int3. They can be annotated to prevent them from being patched. Then again,
> I need to see how gcc plugins can get these annotations.

We could move the relevant code to a separate object file that disables the whole mess.

>
>>
>> Also, this relpoline magic is likely to start patching text at runtime
>> on a semi-regular basis. This type of patching is *slow*. Is it a
>> problem?
>
> It didn’t appear so. Although there are >10000 indirect branches in the
> kernel, you don’t patch too many of them even you are doing relearning.
>
>>
>>>> 2. An indirect branch inside an interrupt or NMI handler might be the
>>>> one we patch
>>>
>>> But INT3s just use the existing stack, and NMIs support nesting, so I'm
>>> thinking that should also be doable. Andy?
>>
>> In principle, as long as the code isn't NOKPROBE_SYMBOL-ified, we
>> should be fine, right? I'd be a little nervous if we get an int3 in
>> the C code that handles the early part of an NMI from user mode. It's
>> *probably* okay, but one of the alarming issues is that the int3
>> return path will implicitly unmask NMI, which isn't fantastic. Maybe
>> we finally need to dust off my old "return using RET" code to get rid
>> of that problem.
>
> So it may be possible. It would require having a new text_poke_bp() variant
> for multiple instructions. text_poke_bp() might be slower though.
>
>

Can you outline how the patching works at all? You’re getting rid of preempt disabling, right? What’s the actual sequence and how does it work?

2018-11-29 04:37:33

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

On Wed, Nov 28, 2018 at 07:24:08PM -0800, Andy Lutomirski wrote:
> To be clear, that wasn’t a NAK. It was merely a “this is alarming.”
>
> Hey Josh - you could potentially do the same hack to generate the
> static call tables. Take that, objtool.

Ha, after witnessing Nadav's glorious hack, I considered that. But I
didn't see a way to pull it off, because asm macro conditionals don't
seem to have a way to test for a regex (or at least a named prefix) for
the call target.

I'd need a way to detect "call __static_call_tramp_<whatever>".

--
Josh

2018-11-29 06:09:49

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

On Wed, Nov 28, 2018 at 7:24 PM Andy Lutomirski <[email protected]> wrote:
>
>
> On Nov 28, 2018, at 6:06 PM, Nadav Amit <[email protected]> wrote:
>
> >> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski <[email protected]> wrote:
> >>
> >>> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf <[email protected]> wrote:
> >>> On Wed, Nov 28, 2018 at 07:34:52PM +0000, Nadav Amit wrote:
> >>>>> On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf <[email protected]> wrote:
> >>>>>
> >>>>>> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
> >>>>>> This RFC introduces indirect call promotion in runtime, which for the
> >>>>>> matter of simplification (and branding) will be called here "relpolines"
> >>>>>> (relative call + trampoline). Relpolines are mainly intended as a way
> >>>>>> of reducing retpoline overheads due to Spectre v2.
> >>>>>>
> >>>>>> Unlike indirect call promotion through profile guided optimization, the
> >>>>>> proposed approach does not require a profiling stage, works well with
> >>>>>> modules whose address is unknown and can adapt to changing workloads.
> >>>>>>
> >>>>>> The main idea is simple: for every indirect call, we inject a piece of
> >>>>>> code with fast- and slow-path calls. The fast path is used if the target
> >>>>>> matches the expected (hot) target. The slow-path uses a retpoline.
> >>>>>> During training, the slow-path is set to call a function that saves the
> >>>>>> call source and target in a hash-table and keep count for call
> >>>>>> frequency. The most common target is then patched into the hot path.
> >>>>>>
> >>>>>> The patching is done on-the-fly by patching the conditional branch
> >>>>>> (opcode and offset) that is used to compare the target to the hot
> >>>>>> target. This allows to direct all cores to the fast-path, while patching
> >>>>>> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
> >>>>>> patch a single byte when the code might be executed by any core. (2)
> >>>>>> When patching more than one byte, ensure that all cores do not run the
> >>>>>> to-be-patched-code by preventing this code from being preempted, and
> >>>>>> using synchronize_sched() after patching the branch that jumps over this
> >>>>>> code.
> >>>>>>
> >>>>>> Changing all the indirect calls to use relpolines is done using assembly
> >>>>>> macro magic. There are alternative solutions, but this one is
> >>>>>> relatively simple and transparent. There is also logic to retrain the
> >>>>>> software predictor, but the policy it uses may need to be refined.
> >>>>>>
> >>>>>> Eventually the results are not bad (2 VCPU VM, throughput reported):
> >>>>>>
> >>>>>> base relpoline
> >>>>>> ---- ---------
> >>>>>> nginx 22898 25178 (+10%)
> >>>>>> redis-ycsb 24523 25486 (+4%)
> >>>>>> dbench 2144 2103 (+2%)
> >>>>>>
> >>>>>> When retpolines are disabled, and if retraining is off, performance
> >>>>>> benefits are up to 2% (nginx), but are much less impressive.
> >>>>>
> >>>>> Hi Nadav,
> >>>>>
> >>>>> Peter pointed me to these patches during a discussion about retpoline
> >>>>> profiling. Personally, I think this is brilliant. This could help
> >>>>> networking and filesystem intensive workloads a lot.
> >>>>
> >>>> Thanks! I was a bit held-back by the relatively limited number of responses.
> >>>
> >>> It is a rather, erm, ambitious idea, maybe they were speechless :-)
> >>>
> >>>> I finished another version two weeks ago, and every day I think: "should it
> >>>> be RFCv2 or v1”, ending up not sending it…
> >>>>
> >>>> There is one issue that I realized while working on the new version: I’m not
> >>>> sure it is well-defined what an outline retpoline is allowed to do. The
> >>>> indirect branch promotion code can change rflags, which might cause
> >>>> correction issues. In practice, using gcc, it is not a problem.
> >>>
> >>> Callees can clobber flags, so it seems fine to me.
> >>
> >> Just to check I understand your approach right: you made a macro
> >> called "call", and you're therefore causing all instances of "call" to
> >> become magic? This is... terrifying. It's even plausibly worse than
> >> "#define if" :) The scariest bit is that it will impact inline asm as
> >> well. Maybe a gcc plugin would be less alarming?
> >
> > It is likely to look less alarming. When I looked at the inline retpoline
> > implementation of gcc, it didn’t look much better than what I did - it
> > basically just emits assembly instructions.
>
> To be clear, that wasn’t a NAK. It was merely a “this is alarming.”

Although... how do you avoid matching on things that really don't want
this treatment? paravirt ops come to mind.

2018-11-29 09:47:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 1/5] x86: introduce preemption disable prefix

On Fri, Oct 19, 2018 at 07:29:45AM -0700, Andy Lutomirski wrote:
> > On Oct 19, 2018, at 1:33 AM, Peter Zijlstra <[email protected]> wrote:
> >
> >> On Fri, Oct 19, 2018 at 01:08:23AM +0000, Nadav Amit wrote:
> >> Consider for example do_int3(), and see my inlined comments:
> >>
> >> dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
> >> {
> >> ...
> >> ist_enter(regs); // => preempt_disable()
> >> cond_local_irq_enable(regs); // => assume it enables IRQs
> >>
> >> ...
> >> // resched irq can be delivered here. It will not caused rescheduling
> >> // since preemption is disabled
> >>
> >> cond_local_irq_disable(regs); // => assume it disables IRQs
> >> ist_exit(regs); // => preempt_enable_no_resched()
> >> }
> >>
> >> At this point resched will not happen for unbounded length of time (unless
> >> there is another point when exiting the trap handler that checks if
> >> preemption should take place).
> >>
> >> Another example is __BPF_PROG_RUN_ARRAY(), which also uses
> >> preempt_enable_no_resched().
> >>
> >> Am I missing something?
> >
> > Would not the interrupt return then check for TIF_NEED_RESCHED and call
> > schedule() ?
>
> The paranoid exit path doesn’t check TIF_NEED_RESCHED because it’s
> fundamentally atomic — it’s running on a percpu stack and it can’t
> schedule. In theory we could do some evil stack switching, but we
> don’t.
>
> How does NMI handle this? If an NMI that hit interruptible kernel
> code overflows a perf counter, how does the wake up work?

NMIs should never set NEED_RESCHED. What the perf does it self-IPI
(irq_work) and do the wakeup from there.

2018-11-29 15:22:06

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

On Wed, Nov 28, 2018 at 10:06:52PM -0800, Andy Lutomirski wrote:
> On Wed, Nov 28, 2018 at 7:24 PM Andy Lutomirski <[email protected]> wrote:
> >
> >
> > On Nov 28, 2018, at 6:06 PM, Nadav Amit <[email protected]> wrote:
> >
> > >> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski <[email protected]> wrote:
> > >>
> > >>> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf <[email protected]> wrote:
> > >>> On Wed, Nov 28, 2018 at 07:34:52PM +0000, Nadav Amit wrote:
> > >>>>> On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf <[email protected]> wrote:
> > >>>>>
> > >>>>>> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
> > >>>>>> This RFC introduces indirect call promotion in runtime, which for the
> > >>>>>> matter of simplification (and branding) will be called here "relpolines"
> > >>>>>> (relative call + trampoline). Relpolines are mainly intended as a way
> > >>>>>> of reducing retpoline overheads due to Spectre v2.
> > >>>>>>
> > >>>>>> Unlike indirect call promotion through profile guided optimization, the
> > >>>>>> proposed approach does not require a profiling stage, works well with
> > >>>>>> modules whose address is unknown and can adapt to changing workloads.
> > >>>>>>
> > >>>>>> The main idea is simple: for every indirect call, we inject a piece of
> > >>>>>> code with fast- and slow-path calls. The fast path is used if the target
> > >>>>>> matches the expected (hot) target. The slow-path uses a retpoline.
> > >>>>>> During training, the slow-path is set to call a function that saves the
> > >>>>>> call source and target in a hash-table and keep count for call
> > >>>>>> frequency. The most common target is then patched into the hot path.
> > >>>>>>
> > >>>>>> The patching is done on-the-fly by patching the conditional branch
> > >>>>>> (opcode and offset) that is used to compare the target to the hot
> > >>>>>> target. This allows to direct all cores to the fast-path, while patching
> > >>>>>> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
> > >>>>>> patch a single byte when the code might be executed by any core. (2)
> > >>>>>> When patching more than one byte, ensure that all cores do not run the
> > >>>>>> to-be-patched-code by preventing this code from being preempted, and
> > >>>>>> using synchronize_sched() after patching the branch that jumps over this
> > >>>>>> code.
> > >>>>>>
> > >>>>>> Changing all the indirect calls to use relpolines is done using assembly
> > >>>>>> macro magic. There are alternative solutions, but this one is
> > >>>>>> relatively simple and transparent. There is also logic to retrain the
> > >>>>>> software predictor, but the policy it uses may need to be refined.
> > >>>>>>
> > >>>>>> Eventually the results are not bad (2 VCPU VM, throughput reported):
> > >>>>>>
> > >>>>>> base relpoline
> > >>>>>> ---- ---------
> > >>>>>> nginx 22898 25178 (+10%)
> > >>>>>> redis-ycsb 24523 25486 (+4%)
> > >>>>>> dbench 2144 2103 (+2%)
> > >>>>>>
> > >>>>>> When retpolines are disabled, and if retraining is off, performance
> > >>>>>> benefits are up to 2% (nginx), but are much less impressive.
> > >>>>>
> > >>>>> Hi Nadav,
> > >>>>>
> > >>>>> Peter pointed me to these patches during a discussion about retpoline
> > >>>>> profiling. Personally, I think this is brilliant. This could help
> > >>>>> networking and filesystem intensive workloads a lot.
> > >>>>
> > >>>> Thanks! I was a bit held-back by the relatively limited number of responses.
> > >>>
> > >>> It is a rather, erm, ambitious idea, maybe they were speechless :-)
> > >>>
> > >>>> I finished another version two weeks ago, and every day I think: "should it
> > >>>> be RFCv2 or v1”, ending up not sending it…
> > >>>>
> > >>>> There is one issue that I realized while working on the new version: I’m not
> > >>>> sure it is well-defined what an outline retpoline is allowed to do. The
> > >>>> indirect branch promotion code can change rflags, which might cause
> > >>>> correction issues. In practice, using gcc, it is not a problem.
> > >>>
> > >>> Callees can clobber flags, so it seems fine to me.
> > >>
> > >> Just to check I understand your approach right: you made a macro
> > >> called "call", and you're therefore causing all instances of "call" to
> > >> become magic? This is... terrifying. It's even plausibly worse than
> > >> "#define if" :) The scariest bit is that it will impact inline asm as
> > >> well. Maybe a gcc plugin would be less alarming?
> > >
> > > It is likely to look less alarming. When I looked at the inline retpoline
> > > implementation of gcc, it didn’t look much better than what I did - it
> > > basically just emits assembly instructions.
> >
> > To be clear, that wasn’t a NAK. It was merely a “this is alarming.”
>
> Although... how do you avoid matching on things that really don't want
> this treatment? paravirt ops come to mind.

Paravirt ops don't use retpolines because they're patched into direct
calls during boot. So Nadav's patches won't touch them.

--
Josh

2018-12-01 06:53:44

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

> On Nov 29, 2018, at 7:19 AM, Josh Poimboeuf <[email protected]> wrote:
>
> On Wed, Nov 28, 2018 at 10:06:52PM -0800, Andy Lutomirski wrote:
>> On Wed, Nov 28, 2018 at 7:24 PM Andy Lutomirski <[email protected]> wrote:
>>> On Nov 28, 2018, at 6:06 PM, Nadav Amit <[email protected]> wrote:
>>>
>>>>> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski <[email protected]> wrote:
>>>>>
>>>>>> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf <[email protected]> wrote:
>>>>>> On Wed, Nov 28, 2018 at 07:34:52PM +0000, Nadav Amit wrote:
>>>>>>>> On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
>>>>>>>>> This RFC introduces indirect call promotion in runtime, which for the
>>>>>>>>> matter of simplification (and branding) will be called here "relpolines"
>>>>>>>>> (relative call + trampoline). Relpolines are mainly intended as a way
>>>>>>>>> of reducing retpoline overheads due to Spectre v2.
>>>>>>>>>
>>>>>>>>> Unlike indirect call promotion through profile guided optimization, the
>>>>>>>>> proposed approach does not require a profiling stage, works well with
>>>>>>>>> modules whose address is unknown and can adapt to changing workloads.
>>>>>>>>>
>>>>>>>>> The main idea is simple: for every indirect call, we inject a piece of
>>>>>>>>> code with fast- and slow-path calls. The fast path is used if the target
>>>>>>>>> matches the expected (hot) target. The slow-path uses a retpoline.
>>>>>>>>> During training, the slow-path is set to call a function that saves the
>>>>>>>>> call source and target in a hash-table and keep count for call
>>>>>>>>> frequency. The most common target is then patched into the hot path.
>>>>>>>>>
>>>>>>>>> The patching is done on-the-fly by patching the conditional branch
>>>>>>>>> (opcode and offset) that is used to compare the target to the hot
>>>>>>>>> target. This allows to direct all cores to the fast-path, while patching
>>>>>>>>> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
>>>>>>>>> patch a single byte when the code might be executed by any core. (2)
>>>>>>>>> When patching more than one byte, ensure that all cores do not run the
>>>>>>>>> to-be-patched-code by preventing this code from being preempted, and
>>>>>>>>> using synchronize_sched() after patching the branch that jumps over this
>>>>>>>>> code.
>>>>>>>>>
>>>>>>>>> Changing all the indirect calls to use relpolines is done using assembly
>>>>>>>>> macro magic. There are alternative solutions, but this one is
>>>>>>>>> relatively simple and transparent. There is also logic to retrain the
>>>>>>>>> software predictor, but the policy it uses may need to be refined.
>>>>>>>>>
>>>>>>>>> Eventually the results are not bad (2 VCPU VM, throughput reported):
>>>>>>>>>
>>>>>>>>> base relpoline
>>>>>>>>> ---- ---------
>>>>>>>>> nginx 22898 25178 (+10%)
>>>>>>>>> redis-ycsb 24523 25486 (+4%)
>>>>>>>>> dbench 2144 2103 (+2%)
>>>>>>>>>
>>>>>>>>> When retpolines are disabled, and if retraining is off, performance
>>>>>>>>> benefits are up to 2% (nginx), but are much less impressive.
>>>>>>>>
>>>>>>>> Hi Nadav,
>>>>>>>>
>>>>>>>> Peter pointed me to these patches during a discussion about retpoline
>>>>>>>> profiling. Personally, I think this is brilliant. This could help
>>>>>>>> networking and filesystem intensive workloads a lot.
>>>>>>>
>>>>>>> Thanks! I was a bit held-back by the relatively limited number of responses.
>>>>>>
>>>>>> It is a rather, erm, ambitious idea, maybe they were speechless :-)
>>>>>>
>>>>>>> I finished another version two weeks ago, and every day I think: "should it
>>>>>>> be RFCv2 or v1”, ending up not sending it…
>>>>>>>
>>>>>>> There is one issue that I realized while working on the new version: I’m not
>>>>>>> sure it is well-defined what an outline retpoline is allowed to do. The
>>>>>>> indirect branch promotion code can change rflags, which might cause
>>>>>>> correction issues. In practice, using gcc, it is not a problem.
>>>>>>
>>>>>> Callees can clobber flags, so it seems fine to me.
>>>>>
>>>>> Just to check I understand your approach right: you made a macro
>>>>> called "call", and you're therefore causing all instances of "call" to
>>>>> become magic? This is... terrifying. It's even plausibly worse than
>>>>> "#define if" :) The scariest bit is that it will impact inline asm as
>>>>> well. Maybe a gcc plugin would be less alarming?
>>>>
>>>> It is likely to look less alarming. When I looked at the inline retpoline
>>>> implementation of gcc, it didn’t look much better than what I did - it
>>>> basically just emits assembly instructions.
>>>
>>> To be clear, that wasn’t a NAK. It was merely a “this is alarming.”
>>
>> Although... how do you avoid matching on things that really don't want
>> this treatment? paravirt ops come to mind.
>
> Paravirt ops don't use retpolines because they're patched into direct
> calls during boot. So Nadav's patches won't touch them.

Actually, the way it’s handled is slightly more complicated - yes, the CALL
macro should not be applied, as Josh said, but the question is how it is
achieved.

The basic idea is that the CALL macro should only be applied to C
source-files and not to assembly files and for macros.s, which holds the PV
call macros. I will recheck it is done this way.

Regards,
Nadav

2018-12-01 14:26:32

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

On Sat, Dec 01, 2018 at 06:52:45AM +0000, Nadav Amit wrote:
> > On Nov 29, 2018, at 7:19 AM, Josh Poimboeuf <[email protected]> wrote:
> >
> > On Wed, Nov 28, 2018 at 10:06:52PM -0800, Andy Lutomirski wrote:
> >> On Wed, Nov 28, 2018 at 7:24 PM Andy Lutomirski <[email protected]> wrote:
> >>> On Nov 28, 2018, at 6:06 PM, Nadav Amit <[email protected]> wrote:
> >>>
> >>>>> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski <[email protected]> wrote:
> >>>>>
> >>>>>> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf <[email protected]> wrote:
> >>>>>> On Wed, Nov 28, 2018 at 07:34:52PM +0000, Nadav Amit wrote:
> >>>>>>>> On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>>> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
> >>>>>>>>> This RFC introduces indirect call promotion in runtime, which for the
> >>>>>>>>> matter of simplification (and branding) will be called here "relpolines"
> >>>>>>>>> (relative call + trampoline). Relpolines are mainly intended as a way
> >>>>>>>>> of reducing retpoline overheads due to Spectre v2.
> >>>>>>>>>
> >>>>>>>>> Unlike indirect call promotion through profile guided optimization, the
> >>>>>>>>> proposed approach does not require a profiling stage, works well with
> >>>>>>>>> modules whose address is unknown and can adapt to changing workloads.
> >>>>>>>>>
> >>>>>>>>> The main idea is simple: for every indirect call, we inject a piece of
> >>>>>>>>> code with fast- and slow-path calls. The fast path is used if the target
> >>>>>>>>> matches the expected (hot) target. The slow-path uses a retpoline.
> >>>>>>>>> During training, the slow-path is set to call a function that saves the
> >>>>>>>>> call source and target in a hash-table and keep count for call
> >>>>>>>>> frequency. The most common target is then patched into the hot path.
> >>>>>>>>>
> >>>>>>>>> The patching is done on-the-fly by patching the conditional branch
> >>>>>>>>> (opcode and offset) that is used to compare the target to the hot
> >>>>>>>>> target. This allows to direct all cores to the fast-path, while patching
> >>>>>>>>> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
> >>>>>>>>> patch a single byte when the code might be executed by any core. (2)
> >>>>>>>>> When patching more than one byte, ensure that all cores do not run the
> >>>>>>>>> to-be-patched-code by preventing this code from being preempted, and
> >>>>>>>>> using synchronize_sched() after patching the branch that jumps over this
> >>>>>>>>> code.
> >>>>>>>>>
> >>>>>>>>> Changing all the indirect calls to use relpolines is done using assembly
> >>>>>>>>> macro magic. There are alternative solutions, but this one is
> >>>>>>>>> relatively simple and transparent. There is also logic to retrain the
> >>>>>>>>> software predictor, but the policy it uses may need to be refined.
> >>>>>>>>>
> >>>>>>>>> Eventually the results are not bad (2 VCPU VM, throughput reported):
> >>>>>>>>>
> >>>>>>>>> base relpoline
> >>>>>>>>> ---- ---------
> >>>>>>>>> nginx 22898 25178 (+10%)
> >>>>>>>>> redis-ycsb 24523 25486 (+4%)
> >>>>>>>>> dbench 2144 2103 (+2%)
> >>>>>>>>>
> >>>>>>>>> When retpolines are disabled, and if retraining is off, performance
> >>>>>>>>> benefits are up to 2% (nginx), but are much less impressive.
> >>>>>>>>
> >>>>>>>> Hi Nadav,
> >>>>>>>>
> >>>>>>>> Peter pointed me to these patches during a discussion about retpoline
> >>>>>>>> profiling. Personally, I think this is brilliant. This could help
> >>>>>>>> networking and filesystem intensive workloads a lot.
> >>>>>>>
> >>>>>>> Thanks! I was a bit held-back by the relatively limited number of responses.
> >>>>>>
> >>>>>> It is a rather, erm, ambitious idea, maybe they were speechless :-)
> >>>>>>
> >>>>>>> I finished another version two weeks ago, and every day I think: "should it
> >>>>>>> be RFCv2 or v1”, ending up not sending it…
> >>>>>>>
> >>>>>>> There is one issue that I realized while working on the new version: I’m not
> >>>>>>> sure it is well-defined what an outline retpoline is allowed to do. The
> >>>>>>> indirect branch promotion code can change rflags, which might cause
> >>>>>>> correction issues. In practice, using gcc, it is not a problem.
> >>>>>>
> >>>>>> Callees can clobber flags, so it seems fine to me.
> >>>>>
> >>>>> Just to check I understand your approach right: you made a macro
> >>>>> called "call", and you're therefore causing all instances of "call" to
> >>>>> become magic? This is... terrifying. It's even plausibly worse than
> >>>>> "#define if" :) The scariest bit is that it will impact inline asm as
> >>>>> well. Maybe a gcc plugin would be less alarming?
> >>>>
> >>>> It is likely to look less alarming. When I looked at the inline retpoline
> >>>> implementation of gcc, it didn’t look much better than what I did - it
> >>>> basically just emits assembly instructions.
> >>>
> >>> To be clear, that wasn’t a NAK. It was merely a “this is alarming.”
> >>
> >> Although... how do you avoid matching on things that really don't want
> >> this treatment? paravirt ops come to mind.
> >
> > Paravirt ops don't use retpolines because they're patched into direct
> > calls during boot. So Nadav's patches won't touch them.
>
> Actually, the way it’s handled is slightly more complicated - yes, the CALL
> macro should not be applied, as Josh said, but the question is how it is
> achieved.
>
> The basic idea is that the CALL macro should only be applied to C
> source-files and not to assembly files and for macros.s, which holds the PV
> call macros. I will recheck it is done this way.

Even if the CALL macro were applied, it would get ignored by your code
because the PARAVIRT_CALL macro doesn't use retpolines. So it would get
skipped by this check:

.ifc "\v", "__x86_indirect_thunk_\reg_it"
relpoline_call reg=\reg_it
retpoline = 1
.endif

--
Josh