Changes v1 => v2:
1. Fix build errors for different config. (kernel test robot)
Kernel Live Patch (livepatch, or klp) and bpf trampoline are important
features for modern systems. This set allows the two to work on the same
kernel function as the same time.
live patch uses ftrace with IPMODIFY, while bpf trampoline use direct
ftrace. Existing policy does not allow the two to attach to the same kernel
function. This is changed by fine tuning ftrace IPMODIFY policy, and allows
one non-DIRECT IPMODIFY ftrace_ops and one non-IPMODIFY DIRECT ftrace_ops
on the same kernel function at the same time. Please see 3/5 for more
details on this.
Note that, one of the constraint here is to let bpf trampoline use direct
call when it is not working on the same function as live patch. This is
achieved by allowing ftrace code to ask bpf trampoline to make changes.
Jiri Olsa (1):
bpf, x64: Allow to use caller address from stack
Song Liu (4):
ftrace: allow customized flags for ftrace_direct_multi ftrace_ops
ftrace: add modify_ftrace_direct_multi_nolock
ftrace: introduce FTRACE_OPS_FL_SHARE_IPMODIFY
bpf: trampoline: support FTRACE_OPS_FL_SHARE_IPMODIFY
arch/x86/net/bpf_jit_comp.c | 13 +-
include/linux/bpf.h | 8 ++
include/linux/ftrace.h | 79 +++++++++++
kernel/bpf/trampoline.c | 109 +++++++++++++--
kernel/trace/ftrace.c | 269 +++++++++++++++++++++++++++++++-----
5 files changed, 424 insertions(+), 54 deletions(-)
--
2.30.2
live patch and BPF trampoline (kfunc/kretfunc in bpftrace) are important
features for modern systems. Currently, it is not possible to use live
patch and BPF trampoline on the same kernel function at the same time.
This is because of the resitriction that only one ftrace_ops with flag
FTRACE_OPS_FL_IPMODIFY on the same kernel function.
BPF trampoline uses direct ftrace_ops, which assumes IPMODIFY. However,
not all direct ftrace_ops would overwrite the actual function. This means
it is possible to have a non-IPMODIFY direct ftrace_ops to share the same
kernel function with an IPMODIFY ftrace_ops.
Introduce FTRACE_OPS_FL_SHARE_IPMODIFY, which allows the direct ftrace_ops
to share with IPMODIFY ftrace_ops. With FTRACE_OPS_FL_SHARE_IPMODIFY flag
set, the direct ftrace_ops would call the target function picked by the
IPMODIFY ftrace_ops.
Comment "IPMODIFY, DIRECT, and SHARE_IPMODIFY" in include/linux/ftrace.h
contains more information about how SHARE_IPMODIFY interacts with IPMODIFY
and DIRECT flags.
Signed-off-by: Song Liu <[email protected]>
---
include/linux/ftrace.h | 74 +++++++++++++++++
kernel/trace/ftrace.c | 179 ++++++++++++++++++++++++++++++++++++++---
2 files changed, 242 insertions(+), 11 deletions(-)
diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 9023bf69f675..bfacf608de9c 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -98,6 +98,18 @@ static inline int ftrace_mod_get_kallsym(unsigned int symnum, unsigned long *val
}
#endif
+/*
+ * FTRACE_OPS_CMD_* commands allow the ftrace core logic to request changes
+ * to a ftrace_ops.
+ *
+ * ENABLE_SHARE_IPMODIFY - enable FTRACE_OPS_FL_SHARE_IPMODIFY.
+ * DISABLE_SHARE_IPMODIFY - disable FTRACE_OPS_FL_SHARE_IPMODIFY.
+ */
+enum ftrace_ops_cmd {
+ FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY,
+ FTRACE_OPS_CMD_DISABLE_SHARE_IPMODIFY,
+};
+
#ifdef CONFIG_FUNCTION_TRACER
extern int ftrace_enabled;
@@ -189,6 +201,9 @@ ftrace_func_t ftrace_ops_get_func(struct ftrace_ops *ops);
* ftrace_enabled.
* DIRECT - Used by the direct ftrace_ops helper for direct functions
* (internal ftrace only, should not be used by others)
+ * SHARE_IPMODIFY - For direct ftrace_ops only. Set when the direct function
+ * is ready to share same kernel function with IPMODIFY function
+ * (live patch, etc.).
*/
enum {
FTRACE_OPS_FL_ENABLED = BIT(0),
@@ -209,8 +224,66 @@ enum {
FTRACE_OPS_FL_TRACE_ARRAY = BIT(15),
FTRACE_OPS_FL_PERMANENT = BIT(16),
FTRACE_OPS_FL_DIRECT = BIT(17),
+ FTRACE_OPS_FL_SHARE_IPMODIFY = BIT(18),
};
+/*
+ * IPMODIFY, DIRECT, and SHARE_IPMODIFY.
+ *
+ * ftrace provides IPMODIFY flag for users to replace existing kernel
+ * function with a different version. This is achieved by setting regs->ip.
+ * The top user of IPMODIFY is live patch.
+ *
+ * DIRECT allows user to load custom trampoline on top of ftrace. DIRECT
+ * ftrace does not overwrite regs->ip. Instead, the custom trampoline is
+ * saved separately (for example, orig_ax on x86). The top user of DIRECT
+ * is bpf trampoline.
+ *
+ * It is not super rare to have both live patch and bpf trampoline on the
+ * same kernel function. Therefore, it is necessary to allow the two work
+ * with each other. Given that IPMODIFY and DIRECT target addressese are
+ * saved separately, this is feasible, but we need to be careful.
+ *
+ * The policy between IPMODIFY and DIRECT is:
+ *
+ * 1. Each kernel function can only have one IPMODIFY ftrace_ops;
+ * 2. Each kernel function can only have one DIRECT ftrace_ops;
+ * 3. DIRECT ftrace_ops may have IPMODIFY or not;
+ * 4. Each kernel function may have one non-DIRECT IPMODIFY ftrace_ops,
+ * and one non-IPMODIFY DIRECT ftrace_ops at the same time. This
+ * requires support from the DIRECT ftrace_ops. Specifically, the
+ * DIRECT trampoline should call the kernel function at regs->ip.
+ * If the DIRECT ftrace_ops supports sharing a function with ftrace_ops
+ * with IPMODIFY, it should set flag SHARE_IPMODIFY.
+ *
+ * Some DIRECT ftrace_ops has an option to enable SHARE_IPMODIFY or not.
+ * Usually, the non-SHARE_IPMODIFY option gives better performance. To take
+ * advantage of this performance benefit, is necessary to only enable
+ * SHARE_IPMODIFY only when it is on the same function as an IPMODIFY
+ * ftrace_ops. There are two cases to consider:
+ *
+ * 1. IPMODIFY ftrace_ops is registered first. When the (non-IPMODIFY, and
+ * non-SHARE_IPMODIFY) DIRECT ftrace_ops is registered later,
+ * register_ftrace_direct_multi() returns -EAGAIN. If the user of
+ * the DIRECT ftrace_ops can support SHARE_IPMODIFY, it should enable
+ * SHARE_IPMODIFY and retry.
+ * 2. (non-IPMODIFY, and non-SHARE_IPMODIFY) DIRECT ftrace_ops is
+ * registered first. When the IPMODIFY ftrace_ops is registered later,
+ * it is necessary to ask the direct ftrace_ops to enable
+ * SHARE_IPMODIFY support. This is achieved via ftrace_ops->ops_func
+ * cmd=FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY. For more details on this
+ * condition, check out prepare_direct_functions_for_ipmodify().
+ */
+
+/*
+ * For most ftrace_ops_cmd,
+ * Returns:
+ * 0 - Success.
+ * -EBUSY - The operation cannot process
+ * -EAGAIN - The operation cannot process tempoorarily.
+ */
+typedef int (*ftrace_ops_func_t)(struct ftrace_ops *op, enum ftrace_ops_cmd cmd);
+
#ifdef CONFIG_DYNAMIC_FTRACE
/* The hash used to know what functions callbacks trace */
struct ftrace_ops_hash {
@@ -253,6 +326,7 @@ struct ftrace_ops {
unsigned long trampoline;
unsigned long trampoline_size;
struct list_head list;
+ ftrace_ops_func_t ops_func;
#endif
};
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 6a419f6bbbf0..868bbc753803 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -1865,7 +1865,8 @@ static void ftrace_hash_rec_enable_modify(struct ftrace_ops *ops,
/*
* Try to update IPMODIFY flag on each ftrace_rec. Return 0 if it is OK
* or no-needed to update, -EBUSY if it detects a conflict of the flag
- * on a ftrace_rec, and -EINVAL if the new_hash tries to trace all recs.
+ * on a ftrace_rec, -EINVAL if the new_hash tries to trace all recs, and
+ * -EAGAIN if the ftrace_ops need to enable SHARE_IPMODIFY.
* Note that old_hash and new_hash has below meanings
* - If the hash is NULL, it hits all recs (if IPMODIFY is set, this is rejected)
* - If the hash is EMPTY_HASH, it hits nothing
@@ -1875,6 +1876,7 @@ static int __ftrace_hash_update_ipmodify(struct ftrace_ops *ops,
struct ftrace_hash *old_hash,
struct ftrace_hash *new_hash)
{
+ bool is_ipmodify, is_direct, share_ipmodify;
struct ftrace_page *pg;
struct dyn_ftrace *rec, *end = NULL;
int in_old, in_new;
@@ -1883,7 +1885,24 @@ static int __ftrace_hash_update_ipmodify(struct ftrace_ops *ops,
if (!(ops->flags & FTRACE_OPS_FL_ENABLED))
return 0;
- if (!(ops->flags & FTRACE_OPS_FL_IPMODIFY))
+ /*
+ * The following are all the valid combinations of is_ipmodify,
+ * is_direct, and share_ipmodify
+ *
+ * is_ipmodify is_direct share_ipmodify
+ * #1 0 0 0
+ * #2 1 0 0
+ * #3 1 1 0
+ * #4 0 1 0
+ * #5 0 1 1
+ */
+
+
+ is_ipmodify = ops->flags & FTRACE_OPS_FL_IPMODIFY;
+ is_direct = ops->flags & FTRACE_OPS_FL_DIRECT;
+
+ /* either ipmodify nor direct, skip */
+ if (!is_ipmodify && !is_direct) /* combinations #1 */
return 0;
/*
@@ -1893,6 +1912,30 @@ static int __ftrace_hash_update_ipmodify(struct ftrace_ops *ops,
if (!new_hash || !old_hash)
return -EINVAL;
+ share_ipmodify = ops->flags & FTRACE_OPS_FL_SHARE_IPMODIFY;
+
+ /*
+ * This ops itself doesn't do ip_modify and it can share a fentry
+ * with other ops with ipmodify, nothing to do.
+ */
+ if (!is_ipmodify && share_ipmodify) /* combinations #5 */
+ return 0;
+
+ /*
+ * Only three combinations of is_ipmodify, is_direct, and
+ * share_ipmodify for the logic below:
+ * #2 live patch
+ * #3 direct with ipmodify
+ * #4 direct without ipmodify
+ *
+ * is_ipmodify is_direct share_ipmodify
+ * #2 1 0 0
+ * #3 1 1 0
+ * #4 0 1 0
+ *
+ * Only update/rollback rec->flags for is_ipmodify == 1 (#2 and #3)
+ */
+
/* Update rec->flags */
do_for_each_ftrace_rec(pg, rec) {
@@ -1906,12 +1949,18 @@ static int __ftrace_hash_update_ipmodify(struct ftrace_ops *ops,
continue;
if (in_new) {
- /* New entries must ensure no others are using it */
- if (rec->flags & FTRACE_FL_IPMODIFY)
- goto rollback;
- rec->flags |= FTRACE_FL_IPMODIFY;
- } else /* Removed entry */
+ if (rec->flags & FTRACE_FL_IPMODIFY) {
+ /* cannot have two ipmodify on same rec */
+ if (is_ipmodify) /* combination #2 and #3 */
+ goto rollback;
+ /* let user enable share_ipmodify and retry */
+ return -EAGAIN; /* combination #4 */
+ } else if (is_ipmodify) {
+ rec->flags |= FTRACE_FL_IPMODIFY;
+ }
+ } else if (is_ipmodify) {/* Removed entry */
rec->flags &= ~FTRACE_FL_IPMODIFY;
+ }
} while_for_each_ftrace_rec();
return 0;
@@ -3115,14 +3164,14 @@ static inline int ops_traces_mod(struct ftrace_ops *ops)
}
/*
- * Check if the current ops references the record.
+ * Check if the current ops references the given ip.
*
* If the ops traces all functions, then it was already accounted for.
* If the ops does not trace the current record function, skip it.
* If the ops ignores the function via notrace filter, skip it.
*/
static inline bool
-ops_references_rec(struct ftrace_ops *ops, struct dyn_ftrace *rec)
+ops_references_ip(struct ftrace_ops *ops, unsigned long ip)
{
/* If ops isn't enabled, ignore it */
if (!(ops->flags & FTRACE_OPS_FL_ENABLED))
@@ -3134,16 +3183,29 @@ ops_references_rec(struct ftrace_ops *ops, struct dyn_ftrace *rec)
/* The function must be in the filter */
if (!ftrace_hash_empty(ops->func_hash->filter_hash) &&
- !__ftrace_lookup_ip(ops->func_hash->filter_hash, rec->ip))
+ !__ftrace_lookup_ip(ops->func_hash->filter_hash, ip))
return false;
/* If in notrace hash, we ignore it too */
- if (ftrace_lookup_ip(ops->func_hash->notrace_hash, rec->ip))
+ if (ftrace_lookup_ip(ops->func_hash->notrace_hash, ip))
return false;
return true;
}
+/*
+ * Check if the current ops references the record.
+ *
+ * If the ops traces all functions, then it was already accounted for.
+ * If the ops does not trace the current record function, skip it.
+ * If the ops ignores the function via notrace filter, skip it.
+ */
+static inline bool
+ops_references_rec(struct ftrace_ops *ops, struct dyn_ftrace *rec)
+{
+ return ops_references_ip(ops, rec->ip);
+}
+
static int ftrace_update_code(struct module *mod, struct ftrace_page *new_pgs)
{
bool init_nop = ftrace_need_init_nop();
@@ -5519,6 +5581,14 @@ int register_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr)
if (ops->flags & FTRACE_OPS_FL_ENABLED)
return -EINVAL;
+ /*
+ * if the ops does ipmodify, it cannot share the same fentry with
+ * other functions with ipmodify.
+ */
+ if ((ops->flags & FTRACE_OPS_FL_IPMODIFY) &&
+ (ops->flags & FTRACE_OPS_FL_SHARE_IPMODIFY))
+ return -EINVAL;
+
hash = ops->func_hash->filter_hash;
if (ftrace_hash_empty(hash))
return -EINVAL;
@@ -7901,6 +7971,83 @@ int ftrace_is_dead(void)
return ftrace_disabled;
}
+/*
+ * When registering ftrace_ops with IPMODIFY (not direct), it is necessary
+ * to make sure it doesn't conflict with any direct ftrace_ops. If there is
+ * existing direct ftrace_ops on a kernel function being patched, call
+ * FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY on it to enable sharing.
+ *
+ * @ops: ftrace_ops being registered.
+ *
+ * Returns:
+ * 0 - @ops does have IPMODIFY or @ops itself is DIRECT, no change
+ * needed;
+ * 1 - @ops has IPMODIFY, hold direct_mutex;
+ * -EBUSY - currently registered DIRECT ftrace_ops does not support
+ * SHARE_IPMODIFY, we need to abort the register.
+ * -EAGAIN - cannot make changes to currently registered DIRECT
+ * ftrace_ops at the moment, but we can retry later. This
+ * is needed to avoid potential deadlocks.
+ */
+static int prepare_direct_functions_for_ipmodify(struct ftrace_ops *ops)
+ __acquires(&direct_mutex)
+{
+ struct ftrace_func_entry *entry;
+ struct ftrace_hash *hash;
+ struct ftrace_ops *op;
+ int size, i, ret;
+
+ if (!(ops->flags & FTRACE_OPS_FL_IPMODIFY) ||
+ (ops->flags & FTRACE_OPS_FL_DIRECT))
+ return 0;
+
+ mutex_lock(&direct_mutex);
+
+ hash = ops->func_hash->filter_hash;
+ size = 1 << hash->size_bits;
+ for (i = 0; i < size; i++) {
+ hlist_for_each_entry(entry, &hash->buckets[i], hlist) {
+ unsigned long ip = entry->ip;
+ bool found_op = false;
+
+ mutex_lock(&ftrace_lock);
+ do_for_each_ftrace_op(op, ftrace_ops_list) {
+ if (!(op->flags & FTRACE_OPS_FL_DIRECT))
+ continue;
+ if (op->flags & FTRACE_OPS_FL_SHARE_IPMODIFY)
+ break;
+ if (ops_references_ip(op, ip)) {
+ found_op = true;
+ break;
+ }
+ } while_for_each_ftrace_op(op);
+ mutex_unlock(&ftrace_lock);
+
+ if (found_op) {
+ if (!op->ops_func) {
+ ret = -EBUSY;
+ goto err_out;
+ }
+ ret = op->ops_func(op, FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY);
+ if (ret)
+ goto err_out;
+ }
+ }
+ }
+
+ /*
+ * Didn't find any overlap with any direct function, or the direct
+ * function can share with ipmodify. Hold direct_mutex to make sure
+ * this doesn't change until we are done.
+ */
+ return 1;
+
+err_out:
+ mutex_unlock(&direct_mutex);
+ return ret;
+
+}
+
/**
* register_ftrace_function - register a function for profiling
* @ops: ops structure that holds the function for profiling.
@@ -7913,17 +8060,27 @@ int ftrace_is_dead(void)
* recursive loop.
*/
int register_ftrace_function(struct ftrace_ops *ops)
+ __releases(&direct_mutex)
{
+ bool direct_mutex_locked;
int ret;
ftrace_ops_init(ops);
+ ret = prepare_direct_functions_for_ipmodify(ops);
+ if (ret < 0)
+ return ret;
+
+ direct_mutex_locked = ret == 1;
+
mutex_lock(&ftrace_lock);
ret = ftrace_startup(ops, 0);
mutex_unlock(&ftrace_lock);
+ if (direct_mutex_locked)
+ mutex_unlock(&direct_mutex);
return ret;
}
EXPORT_SYMBOL_GPL(register_ftrace_function);
--
2.30.2
This enables users of ftrace_direct_multi to specify the flags based on
the actual use case. For example, some users may not set flag IPMODIFY.
Signed-off-by: Song Liu <[email protected]>
---
kernel/trace/ftrace.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 2fcd17857ff6..afe782ae28d3 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -5456,8 +5456,7 @@ int modify_ftrace_direct(unsigned long ip,
}
EXPORT_SYMBOL_GPL(modify_ftrace_direct);
-#define MULTI_FLAGS (FTRACE_OPS_FL_IPMODIFY | FTRACE_OPS_FL_DIRECT | \
- FTRACE_OPS_FL_SAVE_REGS)
+#define MULTI_FLAGS (FTRACE_OPS_FL_DIRECT | FTRACE_OPS_FL_SAVE_REGS)
static int check_direct_multi(struct ftrace_ops *ops)
{
@@ -5547,7 +5546,7 @@ int register_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr)
}
ops->func = call_direct_funcs;
- ops->flags = MULTI_FLAGS;
+ ops->flags |= MULTI_FLAGS;
ops->trampoline = FTRACE_REGS_ADDR;
err = register_ftrace_function(ops);
--
2.30.2
From: Jiri Olsa <[email protected]>
Currently we call the original function by using the absolute address
given at the JIT generation. That's not usable when having trampoline
attached to multiple functions, or the target address changes dynamically
(in case of live patch). In such cases we need to take the return address
from the stack.
Adding support to retrieve the original function address from the stack
by adding new BPF_TRAMP_F_ORIG_STACK flag for arch_prepare_bpf_trampoline
function.
Basically we take the return address of the 'fentry' call:
function + 0: call fentry # stores 'function + 5' address on stack
function + 5: ...
The 'function + 5' address will be used as the address for the
original function to call.
Signed-off-by: Jiri Olsa <[email protected]>
Signed-off-by: Song Liu <[email protected]>
---
arch/x86/net/bpf_jit_comp.c | 13 +++++++++----
include/linux/bpf.h | 5 +++++
2 files changed, 14 insertions(+), 4 deletions(-)
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index f298b18a9a3d..c835a9f18fd8 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -2130,10 +2130,15 @@ int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *i
if (flags & BPF_TRAMP_F_CALL_ORIG) {
restore_regs(m, &prog, nr_args, regs_off);
- /* call original function */
- if (emit_call(&prog, orig_call, prog)) {
- ret = -EINVAL;
- goto cleanup;
+ if (flags & BPF_TRAMP_F_ORIG_STACK) {
+ emit_ldx(&prog, BPF_DW, BPF_REG_0, BPF_REG_FP, 8);
+ EMIT2(0xff, 0xd0); /* call *rax */
+ } else {
+ /* call original function */
+ if (emit_call(&prog, orig_call, prog)) {
+ ret = -EINVAL;
+ goto cleanup;
+ }
}
/* remember return value in a stack for bpf prog to access */
emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -8);
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 8e6092d0ea95..a6e06f384e81 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -733,6 +733,11 @@ struct btf_func_model {
/* Return the return value of fentry prog. Only used by bpf_struct_ops. */
#define BPF_TRAMP_F_RET_FENTRY_RET BIT(4)
+/* Get original function from stack instead of from provided direct address.
+ * Makes sense for fexit programs only.
+ */
+#define BPF_TRAMP_F_ORIG_STACK BIT(5)
+
/* Each call __bpf_prog_enter + call bpf_func + call __bpf_prog_exit is ~50
* bytes on x86.
*/
--
2.30.2
This allows bpf trampoline to trace kernel functions with live patch.
Also, move bpf trampoline to *_ftrace_direct_multi APIs, which allows
setting different flags of ftrace_ops.
Signed-off-by: Song Liu <[email protected]>
---
include/linux/bpf.h | 3 ++
kernel/bpf/trampoline.c | 109 +++++++++++++++++++++++++++++++++++-----
2 files changed, 99 insertions(+), 13 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index a6e06f384e81..20a8ed600ca6 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -44,6 +44,7 @@ struct kobject;
struct mem_cgroup;
struct module;
struct bpf_func_state;
+struct ftrace_ops;
extern struct idr btf_idr;
extern spinlock_t btf_idr_lock;
@@ -816,6 +817,7 @@ struct bpf_tramp_image {
struct bpf_trampoline {
/* hlist for trampoline_table */
struct hlist_node hlist;
+ struct ftrace_ops *fops;
/* serializes access to fields of this trampoline */
struct mutex mutex;
refcount_t refcnt;
@@ -838,6 +840,7 @@ struct bpf_trampoline {
struct bpf_tramp_image *cur_image;
u64 selector;
struct module *mod;
+ bool indirect_call;
};
struct bpf_attach_target_info {
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index 93c7675f0c9e..447c788c5520 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -27,6 +27,44 @@ static struct hlist_head trampoline_table[TRAMPOLINE_TABLE_SIZE];
/* serializes access to trampoline_table */
static DEFINE_MUTEX(trampoline_mutex);
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
+static int bpf_trampoline_update(struct bpf_trampoline *tr, bool lock_direct_mutex);
+
+static int bpf_tramp_ftrace_ops_func(struct ftrace_ops *ops, enum ftrace_ops_cmd cmd)
+{
+ struct bpf_trampoline *tr = ops->private;
+ int ret;
+
+ /*
+ * The normal locking order is
+ * tr->mutex => direct_mutex (ftrace.c) => ftrace_lock (ftrace.c)
+ *
+ * This is called from prepare_direct_functions_for_ipmodify, with
+ * direct_mutex locked. Use mutex_trylock() to avoid dead lock.
+ * Also, bpf_trampoline_update here should not lock direct_mutex.
+ */
+ if (!mutex_trylock(&tr->mutex))
+ return -EAGAIN;
+
+ switch (cmd) {
+ case FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY:
+ tr->indirect_call = true;
+ ret = bpf_trampoline_update(tr, false /* lock_direct_mutex */);
+ break;
+ case FTRACE_OPS_CMD_DISABLE_SHARE_IPMODIFY:
+ tr->indirect_call = false;
+ tr->fops->flags &= ~FTRACE_OPS_FL_SHARE_IPMODIFY;
+ ret = bpf_trampoline_update(tr, false /* lock_direct_mutex */);
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ };
+ mutex_unlock(&tr->mutex);
+ return ret;
+}
+#endif
+
bool bpf_prog_has_trampoline(const struct bpf_prog *prog)
{
enum bpf_attach_type eatype = prog->expected_attach_type;
@@ -87,7 +125,16 @@ static struct bpf_trampoline *bpf_trampoline_lookup(u64 key)
tr = kzalloc(sizeof(*tr), GFP_KERNEL);
if (!tr)
goto out;
-
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
+ tr->fops = kzalloc(sizeof(struct ftrace_ops), GFP_KERNEL);
+ if (!tr->fops) {
+ kfree(tr);
+ tr = NULL;
+ goto out;
+ }
+ tr->fops->private = tr;
+ tr->fops->ops_func = bpf_tramp_ftrace_ops_func;
+#endif
tr->key = key;
INIT_HLIST_NODE(&tr->hlist);
hlist_add_head(&tr->hlist, head);
@@ -126,7 +173,7 @@ static int unregister_fentry(struct bpf_trampoline *tr, void *old_addr)
int ret;
if (tr->func.ftrace_managed)
- ret = unregister_ftrace_direct((long)ip, (long)old_addr);
+ ret = unregister_ftrace_direct_multi(tr->fops, (long)old_addr);
else
ret = bpf_arch_text_poke(ip, BPF_MOD_CALL, old_addr, NULL);
@@ -135,15 +182,20 @@ static int unregister_fentry(struct bpf_trampoline *tr, void *old_addr)
return ret;
}
-static int modify_fentry(struct bpf_trampoline *tr, void *old_addr, void *new_addr)
+static int modify_fentry(struct bpf_trampoline *tr, void *old_addr, void *new_addr,
+ bool lock_direct_mutex)
{
void *ip = tr->func.addr;
int ret;
- if (tr->func.ftrace_managed)
- ret = modify_ftrace_direct((long)ip, (long)old_addr, (long)new_addr);
- else
+ if (tr->func.ftrace_managed) {
+ if (lock_direct_mutex)
+ ret = modify_ftrace_direct_multi(tr->fops, (long)new_addr);
+ else
+ ret = modify_ftrace_direct_multi_nolock(tr->fops, (long)new_addr);
+ } else {
ret = bpf_arch_text_poke(ip, BPF_MOD_CALL, old_addr, new_addr);
+ }
return ret;
}
@@ -161,10 +213,15 @@ static int register_fentry(struct bpf_trampoline *tr, void *new_addr)
if (bpf_trampoline_module_get(tr))
return -ENOENT;
- if (tr->func.ftrace_managed)
- ret = register_ftrace_direct((long)ip, (long)new_addr);
- else
+ if (tr->func.ftrace_managed) {
+ ftrace_set_filter_ip(tr->fops, (unsigned long)ip, 0, 0);
+ ret = register_ftrace_direct_multi(tr->fops, (long)new_addr);
+ if (ret)
+ ftrace_set_filter_ip(tr->fops, (unsigned long)ip, 1, 0);
+
+ } else {
ret = bpf_arch_text_poke(ip, BPF_MOD_CALL, NULL, new_addr);
+ }
if (ret)
bpf_trampoline_module_put(tr);
@@ -330,7 +387,7 @@ static struct bpf_tramp_image *bpf_tramp_image_alloc(u64 key, u32 idx)
return ERR_PTR(err);
}
-static int bpf_trampoline_update(struct bpf_trampoline *tr)
+static int bpf_trampoline_update(struct bpf_trampoline *tr, bool lock_direct_mutex)
{
struct bpf_tramp_image *im;
struct bpf_tramp_links *tlinks;
@@ -363,20 +420,45 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr)
if (ip_arg)
flags |= BPF_TRAMP_F_IP_ARG;
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
+again:
+ if (tr->indirect_call)
+ flags |= BPF_TRAMP_F_ORIG_STACK;
+#endif
+
err = arch_prepare_bpf_trampoline(im, im->image, im->image + PAGE_SIZE,
&tr->func.model, flags, tlinks,
tr->func.addr);
if (err < 0)
goto out;
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
+ if (tr->indirect_call)
+ tr->fops->flags |= FTRACE_OPS_FL_SHARE_IPMODIFY;
+#endif
+
WARN_ON(tr->cur_image && tr->selector == 0);
WARN_ON(!tr->cur_image && tr->selector);
if (tr->cur_image)
/* progs already running at this address */
- err = modify_fentry(tr, tr->cur_image->image, im->image);
+ err = modify_fentry(tr, tr->cur_image->image, im->image, lock_direct_mutex);
else
/* first time registering */
err = register_fentry(tr, im->image);
+
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
+ if (err == -EAGAIN) {
+ if (WARN_ON_ONCE(tr->indirect_call))
+ goto out;
+ /* should only retry on the first register */
+ if (WARN_ON_ONCE(tr->cur_image))
+ goto out;
+ tr->indirect_call = true;
+ tr->fops->func = NULL;
+ tr->fops->trampoline = 0;
+ goto again;
+ }
+#endif
if (err)
goto out;
if (tr->cur_image)
@@ -460,7 +542,7 @@ int bpf_trampoline_link_prog(struct bpf_tramp_link *link, struct bpf_trampoline
hlist_add_head(&link->tramp_hlist, &tr->progs_hlist[kind]);
tr->progs_cnt[kind]++;
- err = bpf_trampoline_update(tr);
+ err = bpf_trampoline_update(tr, true /* lock_direct_mutex */);
if (err) {
hlist_del_init(&link->tramp_hlist);
tr->progs_cnt[kind]--;
@@ -487,7 +569,7 @@ int bpf_trampoline_unlink_prog(struct bpf_tramp_link *link, struct bpf_trampolin
}
hlist_del_init(&link->tramp_hlist);
tr->progs_cnt[kind]--;
- err = bpf_trampoline_update(tr);
+ err = bpf_trampoline_update(tr, true /* lock_direct_mutex */);
out:
mutex_unlock(&tr->mutex);
return err;
@@ -535,6 +617,7 @@ void bpf_trampoline_put(struct bpf_trampoline *tr)
* multiple rcu callbacks.
*/
hlist_del(&tr->hlist);
+ kfree(tr->fops);
kfree(tr);
out:
mutex_unlock(&trampoline_mutex);
--
2.30.2
On Thu, Jun 02, 2022 at 12:37:04PM -0700, Song Liu wrote:
> live patch and BPF trampoline (kfunc/kretfunc in bpftrace) are important
> features for modern systems. Currently, it is not possible to use live
> patch and BPF trampoline on the same kernel function at the same time.
> This is because of the resitriction that only one ftrace_ops with flag
> FTRACE_OPS_FL_IPMODIFY on the same kernel function.
is it hard to make live patch test? would be great to have
selftest for this, or at least sample module that does that,
there are already sample modules for direct interface
>
> BPF trampoline uses direct ftrace_ops, which assumes IPMODIFY. However,
> not all direct ftrace_ops would overwrite the actual function. This means
> it is possible to have a non-IPMODIFY direct ftrace_ops to share the same
> kernel function with an IPMODIFY ftrace_ops.
>
> Introduce FTRACE_OPS_FL_SHARE_IPMODIFY, which allows the direct ftrace_ops
> to share with IPMODIFY ftrace_ops. With FTRACE_OPS_FL_SHARE_IPMODIFY flag
> set, the direct ftrace_ops would call the target function picked by the
> IPMODIFY ftrace_ops.
>
> Comment "IPMODIFY, DIRECT, and SHARE_IPMODIFY" in include/linux/ftrace.h
> contains more information about how SHARE_IPMODIFY interacts with IPMODIFY
> and DIRECT flags.
>
> Signed-off-by: Song Liu <[email protected]>
> ---
> include/linux/ftrace.h | 74 +++++++++++++++++
> kernel/trace/ftrace.c | 179 ++++++++++++++++++++++++++++++++++++++---
> 2 files changed, 242 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> index 9023bf69f675..bfacf608de9c 100644
> --- a/include/linux/ftrace.h
> +++ b/include/linux/ftrace.h
> @@ -98,6 +98,18 @@ static inline int ftrace_mod_get_kallsym(unsigned int symnum, unsigned long *val
> }
> #endif
>
> +/*
> + * FTRACE_OPS_CMD_* commands allow the ftrace core logic to request changes
> + * to a ftrace_ops.
> + *
> + * ENABLE_SHARE_IPMODIFY - enable FTRACE_OPS_FL_SHARE_IPMODIFY.
> + * DISABLE_SHARE_IPMODIFY - disable FTRACE_OPS_FL_SHARE_IPMODIFY.
> + */
> +enum ftrace_ops_cmd {
> + FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY,
> + FTRACE_OPS_CMD_DISABLE_SHARE_IPMODIFY,
> +};
> +
> #ifdef CONFIG_FUNCTION_TRACER
>
> extern int ftrace_enabled;
> @@ -189,6 +201,9 @@ ftrace_func_t ftrace_ops_get_func(struct ftrace_ops *ops);
> * ftrace_enabled.
> * DIRECT - Used by the direct ftrace_ops helper for direct functions
> * (internal ftrace only, should not be used by others)
> + * SHARE_IPMODIFY - For direct ftrace_ops only. Set when the direct function
> + * is ready to share same kernel function with IPMODIFY function
> + * (live patch, etc.).
> */
> enum {
> FTRACE_OPS_FL_ENABLED = BIT(0),
> @@ -209,8 +224,66 @@ enum {
> FTRACE_OPS_FL_TRACE_ARRAY = BIT(15),
> FTRACE_OPS_FL_PERMANENT = BIT(16),
> FTRACE_OPS_FL_DIRECT = BIT(17),
> + FTRACE_OPS_FL_SHARE_IPMODIFY = BIT(18),
> };
>
> +/*
> + * IPMODIFY, DIRECT, and SHARE_IPMODIFY.
> + *
> + * ftrace provides IPMODIFY flag for users to replace existing kernel
> + * function with a different version. This is achieved by setting regs->ip.
> + * The top user of IPMODIFY is live patch.
> + *
> + * DIRECT allows user to load custom trampoline on top of ftrace. DIRECT
> + * ftrace does not overwrite regs->ip. Instead, the custom trampoline is
> + * saved separately (for example, orig_ax on x86). The top user of DIRECT
> + * is bpf trampoline.
> + *
> + * It is not super rare to have both live patch and bpf trampoline on the
> + * same kernel function. Therefore, it is necessary to allow the two work
> + * with each other. Given that IPMODIFY and DIRECT target addressese are
> + * saved separately, this is feasible, but we need to be careful.
> + *
> + * The policy between IPMODIFY and DIRECT is:
> + *
> + * 1. Each kernel function can only have one IPMODIFY ftrace_ops;
> + * 2. Each kernel function can only have one DIRECT ftrace_ops;
> + * 3. DIRECT ftrace_ops may have IPMODIFY or not;
> + * 4. Each kernel function may have one non-DIRECT IPMODIFY ftrace_ops,
> + * and one non-IPMODIFY DIRECT ftrace_ops at the same time. This
> + * requires support from the DIRECT ftrace_ops. Specifically, the
> + * DIRECT trampoline should call the kernel function at regs->ip.
> + * If the DIRECT ftrace_ops supports sharing a function with ftrace_ops
> + * with IPMODIFY, it should set flag SHARE_IPMODIFY.
> + *
> + * Some DIRECT ftrace_ops has an option to enable SHARE_IPMODIFY or not.
> + * Usually, the non-SHARE_IPMODIFY option gives better performance. To take
> + * advantage of this performance benefit, is necessary to only enable
> + * SHARE_IPMODIFY only when it is on the same function as an IPMODIFY
> + * ftrace_ops. There are two cases to consider:
> + *
> + * 1. IPMODIFY ftrace_ops is registered first. When the (non-IPMODIFY, and
> + * non-SHARE_IPMODIFY) DIRECT ftrace_ops is registered later,
> + * register_ftrace_direct_multi() returns -EAGAIN. If the user of
> + * the DIRECT ftrace_ops can support SHARE_IPMODIFY, it should enable
> + * SHARE_IPMODIFY and retry.
> + * 2. (non-IPMODIFY, and non-SHARE_IPMODIFY) DIRECT ftrace_ops is
> + * registered first. When the IPMODIFY ftrace_ops is registered later,
> + * it is necessary to ask the direct ftrace_ops to enable
> + * SHARE_IPMODIFY support. This is achieved via ftrace_ops->ops_func
> + * cmd=FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY. For more details on this
> + * condition, check out prepare_direct_functions_for_ipmodify().
> + */
> +
> +/*
> + * For most ftrace_ops_cmd,
> + * Returns:
> + * 0 - Success.
> + * -EBUSY - The operation cannot process
> + * -EAGAIN - The operation cannot process tempoorarily.
> + */
> +typedef int (*ftrace_ops_func_t)(struct ftrace_ops *op, enum ftrace_ops_cmd cmd);
> +
> #ifdef CONFIG_DYNAMIC_FTRACE
> /* The hash used to know what functions callbacks trace */
> struct ftrace_ops_hash {
> @@ -253,6 +326,7 @@ struct ftrace_ops {
> unsigned long trampoline;
> unsigned long trampoline_size;
> struct list_head list;
> + ftrace_ops_func_t ops_func;
> #endif
> };
>
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 6a419f6bbbf0..868bbc753803 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -1865,7 +1865,8 @@ static void ftrace_hash_rec_enable_modify(struct ftrace_ops *ops,
> /*
> * Try to update IPMODIFY flag on each ftrace_rec. Return 0 if it is OK
> * or no-needed to update, -EBUSY if it detects a conflict of the flag
> - * on a ftrace_rec, and -EINVAL if the new_hash tries to trace all recs.
> + * on a ftrace_rec, -EINVAL if the new_hash tries to trace all recs, and
> + * -EAGAIN if the ftrace_ops need to enable SHARE_IPMODIFY.
> * Note that old_hash and new_hash has below meanings
> * - If the hash is NULL, it hits all recs (if IPMODIFY is set, this is rejected)
> * - If the hash is EMPTY_HASH, it hits nothing
> @@ -1875,6 +1876,7 @@ static int __ftrace_hash_update_ipmodify(struct ftrace_ops *ops,
> struct ftrace_hash *old_hash,
> struct ftrace_hash *new_hash)
> {
> + bool is_ipmodify, is_direct, share_ipmodify;
> struct ftrace_page *pg;
> struct dyn_ftrace *rec, *end = NULL;
> int in_old, in_new;
> @@ -1883,7 +1885,24 @@ static int __ftrace_hash_update_ipmodify(struct ftrace_ops *ops,
> if (!(ops->flags & FTRACE_OPS_FL_ENABLED))
> return 0;
>
> - if (!(ops->flags & FTRACE_OPS_FL_IPMODIFY))
> + /*
> + * The following are all the valid combinations of is_ipmodify,
> + * is_direct, and share_ipmodify
> + *
> + * is_ipmodify is_direct share_ipmodify
> + * #1 0 0 0
> + * #2 1 0 0
> + * #3 1 1 0
> + * #4 0 1 0
> + * #5 0 1 1
> + */
> +
> +
> + is_ipmodify = ops->flags & FTRACE_OPS_FL_IPMODIFY;
> + is_direct = ops->flags & FTRACE_OPS_FL_DIRECT;
> +
> + /* either ipmodify nor direct, skip */
> + if (!is_ipmodify && !is_direct) /* combinations #1 */
> return 0;
>
> /*
> @@ -1893,6 +1912,30 @@ static int __ftrace_hash_update_ipmodify(struct ftrace_ops *ops,
> if (!new_hash || !old_hash)
> return -EINVAL;
>
> + share_ipmodify = ops->flags & FTRACE_OPS_FL_SHARE_IPMODIFY;
> +
> + /*
> + * This ops itself doesn't do ip_modify and it can share a fentry
> + * with other ops with ipmodify, nothing to do.
> + */
> + if (!is_ipmodify && share_ipmodify) /* combinations #5 */
> + return 0;
> +
> + /*
> + * Only three combinations of is_ipmodify, is_direct, and
> + * share_ipmodify for the logic below:
> + * #2 live patch
> + * #3 direct with ipmodify
> + * #4 direct without ipmodify
> + *
> + * is_ipmodify is_direct share_ipmodify
> + * #2 1 0 0
> + * #3 1 1 0
> + * #4 0 1 0
> + *
> + * Only update/rollback rec->flags for is_ipmodify == 1 (#2 and #3)
> + */
> +
> /* Update rec->flags */
> do_for_each_ftrace_rec(pg, rec) {
>
> @@ -1906,12 +1949,18 @@ static int __ftrace_hash_update_ipmodify(struct ftrace_ops *ops,
> continue;
>
> if (in_new) {
> - /* New entries must ensure no others are using it */
> - if (rec->flags & FTRACE_FL_IPMODIFY)
> - goto rollback;
> - rec->flags |= FTRACE_FL_IPMODIFY;
> - } else /* Removed entry */
> + if (rec->flags & FTRACE_FL_IPMODIFY) {
> + /* cannot have two ipmodify on same rec */
> + if (is_ipmodify) /* combination #2 and #3 */
> + goto rollback;
> + /* let user enable share_ipmodify and retry */
> + return -EAGAIN; /* combination #4 */
> + } else if (is_ipmodify) {
> + rec->flags |= FTRACE_FL_IPMODIFY;
> + }
> + } else if (is_ipmodify) {/* Removed entry */
> rec->flags &= ~FTRACE_FL_IPMODIFY;
> + }
> } while_for_each_ftrace_rec();
>
> return 0;
> @@ -3115,14 +3164,14 @@ static inline int ops_traces_mod(struct ftrace_ops *ops)
> }
>
> /*
> - * Check if the current ops references the record.
> + * Check if the current ops references the given ip.
> *
> * If the ops traces all functions, then it was already accounted for.
> * If the ops does not trace the current record function, skip it.
> * If the ops ignores the function via notrace filter, skip it.
> */
> static inline bool
> -ops_references_rec(struct ftrace_ops *ops, struct dyn_ftrace *rec)
> +ops_references_ip(struct ftrace_ops *ops, unsigned long ip)
> {
> /* If ops isn't enabled, ignore it */
> if (!(ops->flags & FTRACE_OPS_FL_ENABLED))
> @@ -3134,16 +3183,29 @@ ops_references_rec(struct ftrace_ops *ops, struct dyn_ftrace *rec)
>
> /* The function must be in the filter */
> if (!ftrace_hash_empty(ops->func_hash->filter_hash) &&
> - !__ftrace_lookup_ip(ops->func_hash->filter_hash, rec->ip))
> + !__ftrace_lookup_ip(ops->func_hash->filter_hash, ip))
> return false;
>
> /* If in notrace hash, we ignore it too */
> - if (ftrace_lookup_ip(ops->func_hash->notrace_hash, rec->ip))
> + if (ftrace_lookup_ip(ops->func_hash->notrace_hash, ip))
> return false;
>
> return true;
> }
>
> +/*
> + * Check if the current ops references the record.
> + *
> + * If the ops traces all functions, then it was already accounted for.
> + * If the ops does not trace the current record function, skip it.
> + * If the ops ignores the function via notrace filter, skip it.
> + */
> +static inline bool
> +ops_references_rec(struct ftrace_ops *ops, struct dyn_ftrace *rec)
> +{
> + return ops_references_ip(ops, rec->ip);
> +}
> +
> static int ftrace_update_code(struct module *mod, struct ftrace_page *new_pgs)
> {
> bool init_nop = ftrace_need_init_nop();
> @@ -5519,6 +5581,14 @@ int register_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr)
> if (ops->flags & FTRACE_OPS_FL_ENABLED)
> return -EINVAL;
>
> + /*
> + * if the ops does ipmodify, it cannot share the same fentry with
> + * other functions with ipmodify.
> + */
> + if ((ops->flags & FTRACE_OPS_FL_IPMODIFY) &&
> + (ops->flags & FTRACE_OPS_FL_SHARE_IPMODIFY))
> + return -EINVAL;
> +
> hash = ops->func_hash->filter_hash;
> if (ftrace_hash_empty(hash))
> return -EINVAL;
> @@ -7901,6 +7971,83 @@ int ftrace_is_dead(void)
> return ftrace_disabled;
> }
>
> +/*
> + * When registering ftrace_ops with IPMODIFY (not direct), it is necessary
> + * to make sure it doesn't conflict with any direct ftrace_ops. If there is
> + * existing direct ftrace_ops on a kernel function being patched, call
> + * FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY on it to enable sharing.
> + *
> + * @ops: ftrace_ops being registered.
> + *
> + * Returns:
> + * 0 - @ops does have IPMODIFY or @ops itself is DIRECT, no change
> + * needed;
> + * 1 - @ops has IPMODIFY, hold direct_mutex;
> + * -EBUSY - currently registered DIRECT ftrace_ops does not support
> + * SHARE_IPMODIFY, we need to abort the register.
> + * -EAGAIN - cannot make changes to currently registered DIRECT
> + * ftrace_ops at the moment, but we can retry later. This
> + * is needed to avoid potential deadlocks.
> + */
> +static int prepare_direct_functions_for_ipmodify(struct ftrace_ops *ops)
> + __acquires(&direct_mutex)
> +{
> + struct ftrace_func_entry *entry;
> + struct ftrace_hash *hash;
> + struct ftrace_ops *op;
> + int size, i, ret;
> +
> + if (!(ops->flags & FTRACE_OPS_FL_IPMODIFY) ||
> + (ops->flags & FTRACE_OPS_FL_DIRECT))
> + return 0;
> +
> + mutex_lock(&direct_mutex);
> +
> + hash = ops->func_hash->filter_hash;
> + size = 1 << hash->size_bits;
> + for (i = 0; i < size; i++) {
> + hlist_for_each_entry(entry, &hash->buckets[i], hlist) {
> + unsigned long ip = entry->ip;
> + bool found_op = false;
> +
> + mutex_lock(&ftrace_lock);
> + do_for_each_ftrace_op(op, ftrace_ops_list) {
would it be better to iterate direct_functions hash instead?
all the registered direct functions should be there
hm maybe you would not have the 'op' then..
> + if (!(op->flags & FTRACE_OPS_FL_DIRECT))
> + continue;
> + if (op->flags & FTRACE_OPS_FL_SHARE_IPMODIFY)
> + break;
> + if (ops_references_ip(op, ip)) {
> + found_op = true;
> + break;
> + }
> + } while_for_each_ftrace_op(op);
> + mutex_unlock(&ftrace_lock);
so the 'op' can't go away because it's direct and we hold direct_mutex
even though we unlocked ftrace_lock, right?
> +
> + if (found_op) {
> + if (!op->ops_func) {
> + ret = -EBUSY;
> + goto err_out;
> + }
> + ret = op->ops_func(op, FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY);
I did not find call with FTRACE_OPS_CMD_DISABLE_SHARE_IPMODIFY flag
jirka
> + if (ret)
> + goto err_out;
> + }
> + }
> + }
> +
> + /*
> + * Didn't find any overlap with any direct function, or the direct
> + * function can share with ipmodify. Hold direct_mutex to make sure
> + * this doesn't change until we are done.
> + */
> + return 1;
> +
> +err_out:
> + mutex_unlock(&direct_mutex);
> + return ret;
> +
> +}
> +
> /**
> * register_ftrace_function - register a function for profiling
> * @ops: ops structure that holds the function for profiling.
> @@ -7913,17 +8060,27 @@ int ftrace_is_dead(void)
> * recursive loop.
> */
> int register_ftrace_function(struct ftrace_ops *ops)
> + __releases(&direct_mutex)
> {
> + bool direct_mutex_locked;
> int ret;
>
> ftrace_ops_init(ops);
>
> + ret = prepare_direct_functions_for_ipmodify(ops);
> + if (ret < 0)
> + return ret;
> +
> + direct_mutex_locked = ret == 1;
> +
> mutex_lock(&ftrace_lock);
>
> ret = ftrace_startup(ops, 0);
>
> mutex_unlock(&ftrace_lock);
>
> + if (direct_mutex_locked)
> + mutex_unlock(&direct_mutex);
> return ret;
> }
> EXPORT_SYMBOL_GPL(register_ftrace_function);
> --
> 2.30.2
>
> On Jun 6, 2022, at 1:20 AM, Jiri Olsa <[email protected]> wrote:
>
> On Thu, Jun 02, 2022 at 12:37:04PM -0700, Song Liu wrote:
>> live patch and BPF trampoline (kfunc/kretfunc in bpftrace) are important
>> features for modern systems. Currently, it is not possible to use live
>> patch and BPF trampoline on the same kernel function at the same time.
>> This is because of the resitriction that only one ftrace_ops with flag
>> FTRACE_OPS_FL_IPMODIFY on the same kernel function.
>
> is it hard to make live patch test? would be great to have
> selftest for this, or at least sample module that does that,
> there are already sample modules for direct interface
It is possible, but a little tricky. I can add some when selftests or
samples in later version.
>
>>
>> BPF trampoline uses direct ftrace_ops, which assumes IPMODIFY. However,
>> not all direct ftrace_ops would overwrite the actual function. This means
>> it is possible to have a non-IPMODIFY direct ftrace_ops to share the same
>> kernel function with an IPMODIFY ftrace_ops.
>>
>> Introduce FTRACE_OPS_FL_SHARE_IPMODIFY, which allows the direct ftrace_ops
>> to share with IPMODIFY ftrace_ops. With FTRACE_OPS_FL_SHARE_IPMODIFY flag
>> set, the direct ftrace_ops would call the target function picked by the
>> IPMODIFY ftrace_ops.
>>
>> Comment "IPMODIFY, DIRECT, and SHARE_IPMODIFY" in include/linux/ftrace.h
>> contains more information about how SHARE_IPMODIFY interacts with IPMODIFY
>> and DIRECT flags.
>>
>> Signed-off-by: Song Liu <[email protected]>
>>
[...]
>> +static int prepare_direct_functions_for_ipmodify(struct ftrace_ops *ops)
>> + __acquires(&direct_mutex)
>> +{
>> + struct ftrace_func_entry *entry;
>> + struct ftrace_hash *hash;
>> + struct ftrace_ops *op;
>> + int size, i, ret;
>> +
>> + if (!(ops->flags & FTRACE_OPS_FL_IPMODIFY) ||
>> + (ops->flags & FTRACE_OPS_FL_DIRECT))
>> + return 0;
>> +
>> + mutex_lock(&direct_mutex);
>> +
>> + hash = ops->func_hash->filter_hash;
>> + size = 1 << hash->size_bits;
>> + for (i = 0; i < size; i++) {
>> + hlist_for_each_entry(entry, &hash->buckets[i], hlist) {
>> + unsigned long ip = entry->ip;
>> + bool found_op = false;
>> +
>> + mutex_lock(&ftrace_lock);
>> + do_for_each_ftrace_op(op, ftrace_ops_list) {
>
> would it be better to iterate direct_functions hash instead?
> all the registered direct functions should be there
>
> hm maybe you would not have the 'op' then..
Yeah, we need ftrace_ops here.
>
>> + if (!(op->flags & FTRACE_OPS_FL_DIRECT))
>> + continue;
>> + if (op->flags & FTRACE_OPS_FL_SHARE_IPMODIFY)
>> + break;
>> + if (ops_references_ip(op, ip)) {
>> + found_op = true;
>> + break;
>> + }
>> + } while_for_each_ftrace_op(op);
>> + mutex_unlock(&ftrace_lock);
>
> so the 'op' can't go away because it's direct and we hold direct_mutex
> even though we unlocked ftrace_lock, right?
Yep, we need to hold direct_mutex here.
>
>> +
>> + if (found_op) {
>> + if (!op->ops_func) {
>> + ret = -EBUSY;
>> + goto err_out;
>> + }
>> + ret = op->ops_func(op, FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY);
>
> I did not find call with FTRACE_OPS_CMD_DISABLE_SHARE_IPMODIFY flag
We don't have it yet, and I think we probably don't really need it.
AFAICT, unloading live patch is not a common operation. So not
recovering the performance of !SHARE_IPMODIFY should be acceptable
in those cases. That said, I can add that path if we think it is
important.
Thanks,
Song
[...]
Hi Steven,
> On Jun 2, 2022, at 12:37 PM, Song Liu <[email protected]> wrote:
>
> Changes v1 => v2:
> 1. Fix build errors for different config. (kernel test robot)
>
> Kernel Live Patch (livepatch, or klp) and bpf trampoline are important
> features for modern systems. This set allows the two to work on the same
> kernel function as the same time.
>
> live patch uses ftrace with IPMODIFY, while bpf trampoline use direct
> ftrace. Existing policy does not allow the two to attach to the same kernel
> function. This is changed by fine tuning ftrace IPMODIFY policy, and allows
> one non-DIRECT IPMODIFY ftrace_ops and one non-IPMODIFY DIRECT ftrace_ops
> on the same kernel function at the same time. Please see 3/5 for more
> details on this.
>
> Note that, one of the constraint here is to let bpf trampoline use direct
> call when it is not working on the same function as live patch. This is
> achieved by allowing ftrace code to ask bpf trampoline to make changes.
Could you please share your comments on this set?
Thanks!
Song
>
> Jiri Olsa (1):
> bpf, x64: Allow to use caller address from stack
>
> Song Liu (4):
> ftrace: allow customized flags for ftrace_direct_multi ftrace_ops
> ftrace: add modify_ftrace_direct_multi_nolock
> ftrace: introduce FTRACE_OPS_FL_SHARE_IPMODIFY
> bpf: trampoline: support FTRACE_OPS_FL_SHARE_IPMODIFY
>
> arch/x86/net/bpf_jit_comp.c | 13 +-
> include/linux/bpf.h | 8 ++
> include/linux/ftrace.h | 79 +++++++++++
> kernel/bpf/trampoline.c | 109 +++++++++++++--
> kernel/trace/ftrace.c | 269 +++++++++++++++++++++++++++++++-----
> 5 files changed, 424 insertions(+), 54 deletions(-)
>
> --
> 2.30.2
On Thu, 2 Jun 2022 12:37:06 -0700
Song Liu <[email protected]> wrote:
> --- a/kernel/bpf/trampoline.c
> +++ b/kernel/bpf/trampoline.c
> @@ -27,6 +27,44 @@ static struct hlist_head trampoline_table[TRAMPOLINE_TABLE_SIZE];
> /* serializes access to trampoline_table */
> static DEFINE_MUTEX(trampoline_mutex);
>
> +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
> +static int bpf_trampoline_update(struct bpf_trampoline *tr, bool lock_direct_mutex);
> +
> +static int bpf_tramp_ftrace_ops_func(struct ftrace_ops *ops, enum ftrace_ops_cmd cmd)
> +{
> + struct bpf_trampoline *tr = ops->private;
> + int ret;
> +
> + /*
> + * The normal locking order is
> + * tr->mutex => direct_mutex (ftrace.c) => ftrace_lock (ftrace.c)
> + *
> + * This is called from prepare_direct_functions_for_ipmodify, with
> + * direct_mutex locked. Use mutex_trylock() to avoid dead lock.
> + * Also, bpf_trampoline_update here should not lock direct_mutex.
> + */
> + if (!mutex_trylock(&tr->mutex))
Can you comment here that returning -EAGAIN will not cause this to repeat.
That it will change things where the next try will not return -EGAIN?
> + return -EAGAIN;
> +
> + switch (cmd) {
> + case FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY:
> + tr->indirect_call = true;
> + ret = bpf_trampoline_update(tr, false /* lock_direct_mutex */);
> + break;
> + case FTRACE_OPS_CMD_DISABLE_SHARE_IPMODIFY:
> + tr->indirect_call = false;
> + tr->fops->flags &= ~FTRACE_OPS_FL_SHARE_IPMODIFY;
> + ret = bpf_trampoline_update(tr, false /* lock_direct_mutex */);
> + break;
> + default:
> + ret = -EINVAL;
> + break;
> + };
> + mutex_unlock(&tr->mutex);
> + return ret;
> +}
> +#endif
> +
>
> @@ -330,7 +387,7 @@ static struct bpf_tramp_image *bpf_tramp_image_alloc(u64 key, u32 idx)
> return ERR_PTR(err);
> }
>
> -static int bpf_trampoline_update(struct bpf_trampoline *tr)
> +static int bpf_trampoline_update(struct bpf_trampoline *tr, bool lock_direct_mutex)
> {
> struct bpf_tramp_image *im;
> struct bpf_tramp_links *tlinks;
> @@ -363,20 +420,45 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr)
> if (ip_arg)
> flags |= BPF_TRAMP_F_IP_ARG;
>
> +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
> +again:
> + if (tr->indirect_call)
> + flags |= BPF_TRAMP_F_ORIG_STACK;
> +#endif
> +
> err = arch_prepare_bpf_trampoline(im, im->image, im->image + PAGE_SIZE,
> &tr->func.model, flags, tlinks,
> tr->func.addr);
> if (err < 0)
> goto out;
>
> +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
> + if (tr->indirect_call)
> + tr->fops->flags |= FTRACE_OPS_FL_SHARE_IPMODIFY;
> +#endif
> +
> WARN_ON(tr->cur_image && tr->selector == 0);
> WARN_ON(!tr->cur_image && tr->selector);
> if (tr->cur_image)
> /* progs already running at this address */
> - err = modify_fentry(tr, tr->cur_image->image, im->image);
> + err = modify_fentry(tr, tr->cur_image->image, im->image, lock_direct_mutex);
> else
> /* first time registering */
> err = register_fentry(tr, im->image);
> +
> +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
> + if (err == -EAGAIN) {
> + if (WARN_ON_ONCE(tr->indirect_call))
> + goto out;
> + /* should only retry on the first register */
> + if (WARN_ON_ONCE(tr->cur_image))
> + goto out;
> + tr->indirect_call = true;
> + tr->fops->func = NULL;
> + tr->fops->trampoline = 0;
> + goto again;
I'm assuming that the above will prevent a return of -EAGAIN again. As if
it can, then this could turn into a dead lock.
Can you please comment that?
Thanks,
-- Steve
> + }
> +#endif
> if (err)
> goto out;
> if (tr->cur_image)
> @@ -460,7 +542,7 @@ int bpf_trampoline_link_prog(struct bpf_tramp_link *link, struct bpf_trampoline
>
> hlist_add_head(&link->tramp_hlist, &tr->progs_hlist[kind]);
> tr->progs_cnt[kind]++;
> - err = bpf_trampoline_update(tr);
> + err = bpf_trampoline_update(tr, true /* lock_direct_mutex */);
> if (err) {
> hlist_del_init(&link->tramp_hlist);
> tr->progs_cnt[kind]--;
> @@ -487,7 +569,7 @@ int bpf_trampoline_unlink_prog(struct bpf_tramp_link *link, struct bpf_trampolin
> }
> hlist_del_init(&link->tramp_hlist);
> tr->progs_cnt[kind]--;
> - err = bpf_trampoline_update(tr);
> + err = bpf_trampoline_update(tr, true /* lock_direct_mutex */);
> out:
> mutex_unlock(&tr->mutex);
> return err;
> @@ -535,6 +617,7 @@ void bpf_trampoline_put(struct bpf_trampoline *tr)
> * multiple rcu callbacks.
> */
> hlist_del(&tr->hlist);
> + kfree(tr->fops);
> kfree(tr);
> out:
> mutex_unlock(&trampoline_mutex);
On Wed, 6 Jul 2022 21:37:52 +0000
Song Liu <[email protected]> wrote:
> > Can you comment here that returning -EAGAIN will not cause this to repeat.
> > That it will change things where the next try will not return -EGAIN?
>
> Hmm.. this is not the guarantee here. This conflict is a real race condition
> that an IPMODIFY function (i.e. livepatch) is being registered at the same time
> when something else, for example bpftrace, is updating the BPF trampoline.
>
> This EAGAIN will propagate to the user of the IPMODIFY function (i.e. livepatch),
> and we need to retry there. In the case of livepatch, the retry is initiated
> from user space.
We need to be careful here then. If there's a userspace application that
runs at real-time and does a:
do {
errno = 0;
regsiter_bpf();
} while (errno != -EAGAIN);
it could in theory preempt the owner of the lock and never make any
progress.
-- Steve
> On Jul 6, 2022, at 12:38 PM, Steven Rostedt <[email protected]> wrote:
>
> On Thu, 2 Jun 2022 12:37:06 -0700
> Song Liu <[email protected]> wrote:
>
>
>> --- a/kernel/bpf/trampoline.c
>> +++ b/kernel/bpf/trampoline.c
>> @@ -27,6 +27,44 @@ static struct hlist_head trampoline_table[TRAMPOLINE_TABLE_SIZE];
>> /* serializes access to trampoline_table */
>> static DEFINE_MUTEX(trampoline_mutex);
>>
>> +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
>> +static int bpf_trampoline_update(struct bpf_trampoline *tr, bool lock_direct_mutex);
>> +
>> +static int bpf_tramp_ftrace_ops_func(struct ftrace_ops *ops, enum ftrace_ops_cmd cmd)
>> +{
>> + struct bpf_trampoline *tr = ops->private;
>> + int ret;
>> +
>> + /*
>> + * The normal locking order is
>> + * tr->mutex => direct_mutex (ftrace.c) => ftrace_lock (ftrace.c)
>> + *
>> + * This is called from prepare_direct_functions_for_ipmodify, with
>> + * direct_mutex locked. Use mutex_trylock() to avoid dead lock.
>> + * Also, bpf_trampoline_update here should not lock direct_mutex.
>> + */
>> + if (!mutex_trylock(&tr->mutex))
>
> Can you comment here that returning -EAGAIN will not cause this to repeat.
> That it will change things where the next try will not return -EGAIN?
Hmm.. this is not the guarantee here. This conflict is a real race condition
that an IPMODIFY function (i.e. livepatch) is being registered at the same time
when something else, for example bpftrace, is updating the BPF trampoline.
This EAGAIN will propagate to the user of the IPMODIFY function (i.e. livepatch),
and we need to retry there. In the case of livepatch, the retry is initiated
from user space.
>
>> + return -EAGAIN;
>> +
>> + switch (cmd) {
>> + case FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY:
>> + tr->indirect_call = true;
>> + ret = bpf_trampoline_update(tr, false /* lock_direct_mutex */);
>> + break;
>> + case FTRACE_OPS_CMD_DISABLE_SHARE_IPMODIFY:
>> + tr->indirect_call = false;
>> + tr->fops->flags &= ~FTRACE_OPS_FL_SHARE_IPMODIFY;
>> + ret = bpf_trampoline_update(tr, false /* lock_direct_mutex */);
>> + break;
>> + default:
>> + ret = -EINVAL;
>> + break;
>> + };
>> + mutex_unlock(&tr->mutex);
>> + return ret;
>> +}
>> +#endif
>> +
>>
>
>
>> @@ -330,7 +387,7 @@ static struct bpf_tramp_image *bpf_tramp_image_alloc(u64 key, u32 idx)
>> return ERR_PTR(err);
>> }
>>
>> -static int bpf_trampoline_update(struct bpf_trampoline *tr)
>> +static int bpf_trampoline_update(struct bpf_trampoline *tr, bool lock_direct_mutex)
>> {
>> struct bpf_tramp_image *im;
>> struct bpf_tramp_links *tlinks;
>> @@ -363,20 +420,45 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr)
>> if (ip_arg)
>> flags |= BPF_TRAMP_F_IP_ARG;
>>
>> +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
>> +again:
>> + if (tr->indirect_call)
>> + flags |= BPF_TRAMP_F_ORIG_STACK;
>> +#endif
>> +
>> err = arch_prepare_bpf_trampoline(im, im->image, im->image + PAGE_SIZE,
>> &tr->func.model, flags, tlinks,
>> tr->func.addr);
>> if (err < 0)
>> goto out;
>>
>> +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
>> + if (tr->indirect_call)
>> + tr->fops->flags |= FTRACE_OPS_FL_SHARE_IPMODIFY;
>> +#endif
>> +
>> WARN_ON(tr->cur_image && tr->selector == 0);
>> WARN_ON(!tr->cur_image && tr->selector);
>> if (tr->cur_image)
>> /* progs already running at this address */
>> - err = modify_fentry(tr, tr->cur_image->image, im->image);
>> + err = modify_fentry(tr, tr->cur_image->image, im->image, lock_direct_mutex);
>> else
>> /* first time registering */
>> err = register_fentry(tr, im->image);
>> +
>> +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
>> + if (err == -EAGAIN) {
>> + if (WARN_ON_ONCE(tr->indirect_call))
>> + goto out;
>> + /* should only retry on the first register */
>> + if (WARN_ON_ONCE(tr->cur_image))
>> + goto out;
>> + tr->indirect_call = true;
>> + tr->fops->func = NULL;
>> + tr->fops->trampoline = 0;
>> + goto again;
>
> I'm assuming that the above will prevent a return of -EAGAIN again. As if
> it can, then this could turn into a dead lock.
>
> Can you please comment that?
This is a different case. This EAGAIN happens when the live patch came first,
and we register bpf trampoline later. By enabling tr->indirect_call, we should
not get EAGAIN from register_fentry.
I will add more comments for both cases.
Thanks,
Song
> On Jul 6, 2022, at 2:40 PM, Steven Rostedt <[email protected]> wrote:
>
> On Wed, 6 Jul 2022 21:37:52 +0000
> Song Liu <[email protected]> wrote:
>
>>> Can you comment here that returning -EAGAIN will not cause this to repeat.
>>> That it will change things where the next try will not return -EGAIN?
>>
>> Hmm.. this is not the guarantee here. This conflict is a real race condition
>> that an IPMODIFY function (i.e. livepatch) is being registered at the same time
>> when something else, for example bpftrace, is updating the BPF trampoline.
>>
>> This EAGAIN will propagate to the user of the IPMODIFY function (i.e. livepatch),
>> and we need to retry there. In the case of livepatch, the retry is initiated
>> from user space.
>
> We need to be careful here then. If there's a userspace application that
> runs at real-time and does a:
>
> do {
> errno = 0;
> regsiter_bpf();
> } while (errno != -EAGAIN);
>
> it could in theory preempt the owner of the lock and never make any
> progress.
We can probably workaround this with some trick on tr->indirect_call. However,
I don't think this is a real concern from livepatch side. We have seen many
other issues that cause live patch to fail and requires retry. This race
condition in theory shouldn't cause real world issues.
Thanks,
Song
> On Jul 6, 2022, at 2:40 PM, Steven Rostedt <[email protected]> wrote:
>
> On Wed, 6 Jul 2022 21:37:52 +0000
> Song Liu <[email protected]> wrote:
>
>>> Can you comment here that returning -EAGAIN will not cause this to repeat.
>>> That it will change things where the next try will not return -EGAIN?
>>
>> Hmm.. this is not the guarantee here. This conflict is a real race condition
>> that an IPMODIFY function (i.e. livepatch) is being registered at the same time
>> when something else, for example bpftrace, is updating the BPF trampoline.
>>
>> This EAGAIN will propagate to the user of the IPMODIFY function (i.e. livepatch),
>> and we need to retry there. In the case of livepatch, the retry is initiated
>> from user space.
>
> We need to be careful here then. If there's a userspace application that
> runs at real-time and does a:
>
> do {
> errno = 0;
> regsiter_bpf();
> } while (errno != -EAGAIN);
Actually, do you mean:
do {
errno = 0;
regsiter_bpf();
} while (errno == -EAGAIN);
(== -EAGAIN) here?
In this specific race condition, register_bpf() will succeed, as it already
got tr->mutex. But the IPMODIFY (livepatch) side will fail and retry.
Since both livepatch and bpf trampoline changes are rare operations, I think
the chance of the race condition is low enough.
Does this make sense?
Thanks,
Song
On Wed, 6 Jul 2022 22:15:47 +0000
Song Liu <[email protected]> wrote:
> > On Jul 6, 2022, at 2:40 PM, Steven Rostedt <[email protected]> wrote:
> >
> > On Wed, 6 Jul 2022 21:37:52 +0000
> > Song Liu <[email protected]> wrote:
> >
> >>> Can you comment here that returning -EAGAIN will not cause this to repeat.
> >>> That it will change things where the next try will not return -EGAIN?
> >>
> >> Hmm.. this is not the guarantee here. This conflict is a real race condition
> >> that an IPMODIFY function (i.e. livepatch) is being registered at the same time
> >> when something else, for example bpftrace, is updating the BPF trampoline.
> >>
> >> This EAGAIN will propagate to the user of the IPMODIFY function (i.e. livepatch),
> >> and we need to retry there. In the case of livepatch, the retry is initiated
> >> from user space.
> >
> > We need to be careful here then. If there's a userspace application that
> > runs at real-time and does a:
> >
> > do {
> > errno = 0;
> > regsiter_bpf();
> > } while (errno != -EAGAIN);
>
> Actually, do you mean:
>
> do {
> errno = 0;
> regsiter_bpf();
> } while (errno == -EAGAIN);
>
> (== -EAGAIN) here?
Yeah, of course.
>
> In this specific race condition, register_bpf() will succeed, as it already
> got tr->mutex. But the IPMODIFY (livepatch) side will fail and retry.
What else takes the tr->mutex ?
If it preempts anything else taking that mutex, when this runs, then it
needs to be careful.
You said this can happen when the live patch came first. This isn't racing
against live patch, it's racing against anything that takes the tr->mutex
and then adds a bpf trampoline to a location that has a live patch.
>
> Since both livepatch and bpf trampoline changes are rare operations, I think
> the chance of the race condition is low enough.
>
> Does this make sense?
>
It's low, and if it is also a privileged operation then there's less to be
concern about. As if it is not, then we could have a way to deadlock the
system. I'm more concerned that this will lead to a CVE than it just
happening randomly. In other words, it only takes something that can run at
a real-time priority to connect to a live patch location, and something
that runs at a low priority to take a tr->mutex. If an attacker has both,
then it can pin both to a CPU and then cause the deadlock to the system.
One hack to fix this is to add a msleep(1) in the failed case of the
trylock. This will at least give the owner of the lock a millisecond to
release it. This was what the RT patch use to do with spin_trylock() that
was converted to a mutex (and we worked hard to remove all of them).
-- Steve
> On Jul 6, 2022, at 3:29 PM, Steven Rostedt <[email protected]> wrote:
>
> On Wed, 6 Jul 2022 22:15:47 +0000
> Song Liu <[email protected]> wrote:
>
>>> On Jul 6, 2022, at 2:40 PM, Steven Rostedt <[email protected]> wrote:
>>>
>>> On Wed, 6 Jul 2022 21:37:52 +0000
>>> Song Liu <[email protected]> wrote:
>>>
>>>>> Can you comment here that returning -EAGAIN will not cause this to repeat.
>>>>> That it will change things where the next try will not return -EGAIN?
>>>>
>>>> Hmm.. this is not the guarantee here. This conflict is a real race condition
>>>> that an IPMODIFY function (i.e. livepatch) is being registered at the same time
>>>> when something else, for example bpftrace, is updating the BPF trampoline.
>>>>
>>>> This EAGAIN will propagate to the user of the IPMODIFY function (i.e. livepatch),
>>>> and we need to retry there. In the case of livepatch, the retry is initiated
>>>> from user space.
>>>
>>> We need to be careful here then. If there's a userspace application that
>>> runs at real-time and does a:
>>>
>>> do {
>>> errno = 0;
>>> regsiter_bpf();
>>> } while (errno != -EAGAIN);
>>
>> Actually, do you mean:
>>
>> do {
>> errno = 0;
>> regsiter_bpf();
>> } while (errno == -EAGAIN);
>>
>> (== -EAGAIN) here?
>
> Yeah, of course.
>
>>
>> In this specific race condition, register_bpf() will succeed, as it already
>> got tr->mutex. But the IPMODIFY (livepatch) side will fail and retry.
>
> What else takes the tr->mutex ?
tr->mutex is the local mutex for a single BPF trampoline, we only need to take
it when we make changes to the trampoline (add/remove fentry/fexit programs).
>
> If it preempts anything else taking that mutex, when this runs, then it
> needs to be careful.
>
> You said this can happen when the live patch came first. This isn't racing
> against live patch, it's racing against anything that takes the tr->mutex
> and then adds a bpf trampoline to a location that has a live patch.
There are a few scenarios here:
1. Live patch is already applied, then a BPF trampoline is being registered
to the same function. In bpf_trampoline_update(), register_fentry returns
-EAGAIN, and this will be resolved.
2. BPF trampoline is already registered, then a live patch is being applied
to the same function. The live patch code need to ask the bpf trampoline to
prepare the trampoline before live patch. This is done by calling
bpf_tramp_ftrace_ops_func.
2.1 If nothing else is modifying the trampoline at the same time,
bpf_tramp_ftrace_ops_func will succeed.
2.2 In rare cases, if something else is holding tr->mutex to make changes to
the trampoline (add/remove fentry functions, etc.), mutex_trylock in
bpf_tramp_ftrace_ops_func will fail, and live patch will fail. However, the
change to BPF trampoline will still succeed. It is common for live patch to
retry, so we just need to try live patch again when no one is making changes
to the BPF trampoline in parallel.
>
>>
>> Since both livepatch and bpf trampoline changes are rare operations, I think
>> the chance of the race condition is low enough.
>>
>> Does this make sense?
>>
>
> It's low, and if it is also a privileged operation then there's less to be
> concern about.
Both live patch and BPF trampoline are privileged operations.
> As if it is not, then we could have a way to deadlock the
> system. I'm more concerned that this will lead to a CVE than it just
> happening randomly. In other words, it only takes something that can run at
> a real-time priority to connect to a live patch location, and something
> that runs at a low priority to take a tr->mutex. If an attacker has both,
> then it can pin both to a CPU and then cause the deadlock to the system.
>
> One hack to fix this is to add a msleep(1) in the failed case of the
> trylock. This will at least give the owner of the lock a millisecond to
> release it. This was what the RT patch use to do with spin_trylock() that
> was converted to a mutex (and we worked hard to remove all of them).
The fix is really simple. But I still think we don't need it. We only hit
the trylock case for something with IPMODIFY. The non-privileged user
should not be able to do that, right?
Thanks,
Song
On Thu, 7 Jul 2022 00:19:07 +0000
Song Liu <[email protected]> wrote:
> >> In this specific race condition, register_bpf() will succeed, as it already
> >> got tr->mutex. But the IPMODIFY (livepatch) side will fail and retry.
> >
> > What else takes the tr->mutex ?
>
> tr->mutex is the local mutex for a single BPF trampoline, we only need to take
> it when we make changes to the trampoline (add/remove fentry/fexit programs).
>
> >
> > If it preempts anything else taking that mutex, when this runs, then it
> > needs to be careful.
> >
> > You said this can happen when the live patch came first. This isn't racing
> > against live patch, it's racing against anything that takes the tr->mutex
> > and then adds a bpf trampoline to a location that has a live patch.
>
> There are a few scenarios here:
> 1. Live patch is already applied, then a BPF trampoline is being registered
> to the same function. In bpf_trampoline_update(), register_fentry returns
> -EAGAIN, and this will be resolved.
Where will it be resolved?
>
> 2. BPF trampoline is already registered, then a live patch is being applied
> to the same function. The live patch code need to ask the bpf trampoline to
> prepare the trampoline before live patch. This is done by calling
> bpf_tramp_ftrace_ops_func.
>
> 2.1 If nothing else is modifying the trampoline at the same time,
> bpf_tramp_ftrace_ops_func will succeed.
>
> 2.2 In rare cases, if something else is holding tr->mutex to make changes to
> the trampoline (add/remove fentry functions, etc.), mutex_trylock in
> bpf_tramp_ftrace_ops_func will fail, and live patch will fail. However, the
> change to BPF trampoline will still succeed. It is common for live patch to
> retry, so we just need to try live patch again when no one is making changes
> to the BPF trampoline in parallel.
If the live patch is going to try again, and the task doing the live
patch is SCHED_FIFO, and the task holding the tr->mutex is SCHED_OTHER
(or just a lower priority), then there is a chance that the live patch
task preempted the tr->mutex owner, and let's say the tr->mutex owner
is pinned to the CPU (by the user or whatever), then because the live
patch task is in a loop trying to take that mutex, it will never let
the owner continue.
Yes, this is a real scenario with trylock on mutexes. We hit it all the
time in RT.
>
> >
> >>
> >> Since both livepatch and bpf trampoline changes are rare operations, I think
> >> the chance of the race condition is low enough.
A low race condition in a world that does this a billion times a day,
ends up being not so rare.
I like to say, "I live in a world where the unlikely is very much likely!"
> >>
> >> Does this make sense?
> >>
> >
> > It's low, and if it is also a privileged operation then there's less to be
> > concern about.
>
> Both live patch and BPF trampoline are privileged operations.
This makes the issue less of an issue, but if there's an application
that does this with setuid or something, there's a chance that it can
be used by an attacker as well.
>
> > As if it is not, then we could have a way to deadlock the
> > system. I'm more concerned that this will lead to a CVE than it just
> > happening randomly. In other words, it only takes something that can run at
> > a real-time priority to connect to a live patch location, and something
> > that runs at a low priority to take a tr->mutex. If an attacker has both,
> > then it can pin both to a CPU and then cause the deadlock to the system.
> >
> > One hack to fix this is to add a msleep(1) in the failed case of the
> > trylock. This will at least give the owner of the lock a millisecond to
> > release it. This was what the RT patch use to do with spin_trylock() that
> > was converted to a mutex (and we worked hard to remove all of them).
>
> The fix is really simple. But I still think we don't need it. We only hit
> the trylock case for something with IPMODIFY. The non-privileged user
> should not be able to do that, right?
For now, perhaps. But what useful applications are there going to be in
the future that performs these privileged operations on behalf of a
non-privileged user?
In other words, if we can fix it now, we should, and avoid debugging
this issue 5 years from now where it may take months to figure out.
-- Steve
> On Jul 6, 2022, at 6:18 PM, Steven Rostedt <[email protected]> wrote:
>
> On Thu, 7 Jul 2022 00:19:07 +0000
> Song Liu <[email protected]> wrote:
>
>>>> In this specific race condition, register_bpf() will succeed, as it already
>>>> got tr->mutex. But the IPMODIFY (livepatch) side will fail and retry.
>>>
>>> What else takes the tr->mutex ?
>>
>> tr->mutex is the local mutex for a single BPF trampoline, we only need to take
>> it when we make changes to the trampoline (add/remove fentry/fexit programs).
>>
>>>
>>> If it preempts anything else taking that mutex, when this runs, then it
>>> needs to be careful.
>>>
>>> You said this can happen when the live patch came first. This isn't racing
>>> against live patch, it's racing against anything that takes the tr->mutex
>>> and then adds a bpf trampoline to a location that has a live patch.
>>
>> There are a few scenarios here:
>> 1. Live patch is already applied, then a BPF trampoline is being registered
>> to the same function. In bpf_trampoline_update(), register_fentry returns
>> -EAGAIN, and this will be resolved.
>
> Where will it be resolved?
bpf_trampoline_update() will set tr->indirect_call and goto again. Then the
trampoline is re-prepared to be able to share with the IPMODIFY functions
and register_fentry will succeed.
>
>>
>> 2. BPF trampoline is already registered, then a live patch is being applied
>> to the same function. The live patch code need to ask the bpf trampoline to
>> prepare the trampoline before live patch. This is done by calling
>> bpf_tramp_ftrace_ops_func.
>>
>> 2.1 If nothing else is modifying the trampoline at the same time,
>> bpf_tramp_ftrace_ops_func will succeed.
>>
>> 2.2 In rare cases, if something else is holding tr->mutex to make changes to
>> the trampoline (add/remove fentry functions, etc.), mutex_trylock in
>> bpf_tramp_ftrace_ops_func will fail, and live patch will fail. However, the
>> change to BPF trampoline will still succeed. It is common for live patch to
>> retry, so we just need to try live patch again when no one is making changes
>> to the BPF trampoline in parallel.
>
> If the live patch is going to try again, and the task doing the live
> patch is SCHED_FIFO, and the task holding the tr->mutex is SCHED_OTHER
> (or just a lower priority), then there is a chance that the live patch
> task preempted the tr->mutex owner, and let's say the tr->mutex owner
> is pinned to the CPU (by the user or whatever), then because the live
> patch task is in a loop trying to take that mutex, it will never let
> the owner continue.
Yeah, I got this scenario. I just don't think we should run live patch
with high priority. Well, maybe we shouldn't make such assumptions.
>
> Yes, this is a real scenario with trylock on mutexes. We hit it all the
> time in RT.
>
>>
>>>
>>>>
>>>> Since both livepatch and bpf trampoline changes are rare operations, I think
>>>> the chance of the race condition is low enough.
>
>
> A low race condition in a world that does this a billion times a day,
> ends up being not so rare.
>
> I like to say, "I live in a world where the unlikely is very much likely!"
>
>
>>>>
>>>> Does this make sense?
>>>>
>>>
>>> It's low, and if it is also a privileged operation then there's less to be
>>> concern about.
>>
>> Both live patch and BPF trampoline are privileged operations.
>
> This makes the issue less of an issue, but if there's an application
> that does this with setuid or something, there's a chance that it can
> be used by an attacker as well.
>
>>
>>> As if it is not, then we could have a way to deadlock the
>>> system. I'm more concerned that this will lead to a CVE than it just
>>> happening randomly. In other words, it only takes something that can run at
>>> a real-time priority to connect to a live patch location, and something
>>> that runs at a low priority to take a tr->mutex. If an attacker has both,
>>> then it can pin both to a CPU and then cause the deadlock to the system.
>>>
>>> One hack to fix this is to add a msleep(1) in the failed case of the
>>> trylock. This will at least give the owner of the lock a millisecond to
>>> release it. This was what the RT patch use to do with spin_trylock() that
>>> was converted to a mutex (and we worked hard to remove all of them).
>>
>> The fix is really simple. But I still think we don't need it. We only hit
>> the trylock case for something with IPMODIFY. The non-privileged user
>> should not be able to do that, right?
>
> For now, perhaps. But what useful applications are there going to be in
> the future that performs these privileged operations on behalf of a
> non-privileged user?
>
> In other words, if we can fix it now, we should, and avoid debugging
> this issue 5 years from now where it may take months to figure out.
Fair enough.. I guess we will just add the msleep(1) in the -EAGAIN
path. If this sounds good, I will send v3 with this change and more
comments.
Thanks,
Song
I just realized that none of the live kernel patching folks are Cc'd on
this thread. I think they will care much more about this than I do.
-- Steve
On Thu, 2 Jun 2022 12:37:01 -0700
Song Liu <[email protected]> wrote:
> Changes v1 => v2:
> 1. Fix build errors for different config. (kernel test robot)
>
> Kernel Live Patch (livepatch, or klp) and bpf trampoline are important
> features for modern systems. This set allows the two to work on the same
> kernel function as the same time.
>
> live patch uses ftrace with IPMODIFY, while bpf trampoline use direct
> ftrace. Existing policy does not allow the two to attach to the same kernel
> function. This is changed by fine tuning ftrace IPMODIFY policy, and allows
> one non-DIRECT IPMODIFY ftrace_ops and one non-IPMODIFY DIRECT ftrace_ops
> on the same kernel function at the same time. Please see 3/5 for more
> details on this.
>
> Note that, one of the constraint here is to let bpf trampoline use direct
> call when it is not working on the same function as live patch. This is
> achieved by allowing ftrace code to ask bpf trampoline to make changes.
>
> Jiri Olsa (1):
> bpf, x64: Allow to use caller address from stack
>
> Song Liu (4):
> ftrace: allow customized flags for ftrace_direct_multi ftrace_ops
> ftrace: add modify_ftrace_direct_multi_nolock
> ftrace: introduce FTRACE_OPS_FL_SHARE_IPMODIFY
> bpf: trampoline: support FTRACE_OPS_FL_SHARE_IPMODIFY
>
> arch/x86/net/bpf_jit_comp.c | 13 +-
> include/linux/bpf.h | 8 ++
> include/linux/ftrace.h | 79 +++++++++++
> kernel/bpf/trampoline.c | 109 +++++++++++++--
> kernel/trace/ftrace.c | 269 +++++++++++++++++++++++++++++++-----
> 5 files changed, 424 insertions(+), 54 deletions(-)
>
> --
> 2.30.2
> On Jul 11, 2022, at 4:55 PM, Steven Rostedt <[email protected]> wrote:
>
> I just realized that none of the live kernel patching folks are Cc'd on
> this thread. I think they will care much more about this than I do.
vger.kernel.org often drops my email when the CC list is too long. So I
try to keep the list short. In this case, since we are not changing live
patch code, and there isn't any negative impact for live patch side, I
didn't CC live patch folks.
I will at least CC live-patching@ in the next version.
Thanks,
Song
PS: I am the live patch guy at Meta. :)
>
> -- Steve
>
>
> On Thu, 2 Jun 2022 12:37:01 -0700
> Song Liu <[email protected]> wrote:
>
>> Changes v1 => v2:
>> 1. Fix build errors for different config. (kernel test robot)
>>
>> Kernel Live Patch (livepatch, or klp) and bpf trampoline are important
>> features for modern systems. This set allows the two to work on the same
>> kernel function as the same time.
>>
>> live patch uses ftrace with IPMODIFY, while bpf trampoline use direct
>> ftrace. Existing policy does not allow the two to attach to the same kernel
>> function. This is changed by fine tuning ftrace IPMODIFY policy, and allows
>> one non-DIRECT IPMODIFY ftrace_ops and one non-IPMODIFY DIRECT ftrace_ops
>> on the same kernel function at the same time. Please see 3/5 for more
>> details on this.
>>
>> Note that, one of the constraint here is to let bpf trampoline use direct
>> call when it is not working on the same function as live patch. This is
>> achieved by allowing ftrace code to ask bpf trampoline to make changes.
>>
>> Jiri Olsa (1):
>> bpf, x64: Allow to use caller address from stack
>>
>> Song Liu (4):
>> ftrace: allow customized flags for ftrace_direct_multi ftrace_ops
>> ftrace: add modify_ftrace_direct_multi_nolock
>> ftrace: introduce FTRACE_OPS_FL_SHARE_IPMODIFY
>> bpf: trampoline: support FTRACE_OPS_FL_SHARE_IPMODIFY
>>
>> arch/x86/net/bpf_jit_comp.c | 13 +-
>> include/linux/bpf.h | 8 ++
>> include/linux/ftrace.h | 79 +++++++++++
>> kernel/bpf/trampoline.c | 109 +++++++++++++--
>> kernel/trace/ftrace.c | 269 +++++++++++++++++++++++++++++++-----
>> 5 files changed, 424 insertions(+), 54 deletions(-)
>>
>> --
>> 2.30.2
>
On Tue, 12 Jul 2022 05:15:26 +0000
Song Liu <[email protected]> wrote:
> > On Jul 11, 2022, at 4:55 PM, Steven Rostedt <[email protected]> wrote:
> >
> > I just realized that none of the live kernel patching folks are Cc'd on
> > this thread. I think they will care much more about this than I do.
>
> vger.kernel.org often drops my email when the CC list is too long. So I
Oh, they fixed that. I've had over 20 Cc's and it still works. ;-)
> try to keep the list short. In this case, since we are not changing live
> patch code, and there isn't any negative impact for live patch side, I
> didn't CC live patch folks.
It affects them indirectly, and they should be aware of what is happening
underneath.
>
> I will at least CC live-patching@ in the next version.
Thanks.
-- Steve
On Thu, 2 Jun 2022 12:37:02 -0700
Song Liu <[email protected]> wrote:
> This enables users of ftrace_direct_multi to specify the flags based on
> the actual use case. For example, some users may not set flag IPMODIFY.
If we apply this patch without any of the others, then we are relying on
the caller to get it right?
That is, can we register a direct function with this function and pick a
function with IPMODIFY already attached?
-- Steve
>
> Signed-off-by: Song Liu <[email protected]>
> ---
> kernel/trace/ftrace.c | 5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 2fcd17857ff6..afe782ae28d3 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -5456,8 +5456,7 @@ int modify_ftrace_direct(unsigned long ip,
> }
> EXPORT_SYMBOL_GPL(modify_ftrace_direct);
>
> -#define MULTI_FLAGS (FTRACE_OPS_FL_IPMODIFY | FTRACE_OPS_FL_DIRECT | \
> - FTRACE_OPS_FL_SAVE_REGS)
> +#define MULTI_FLAGS (FTRACE_OPS_FL_DIRECT | FTRACE_OPS_FL_SAVE_REGS)
>
> static int check_direct_multi(struct ftrace_ops *ops)
> {
> @@ -5547,7 +5546,7 @@ int register_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr)
> }
>
> ops->func = call_direct_funcs;
> - ops->flags = MULTI_FLAGS;
> + ops->flags |= MULTI_FLAGS;
> ops->trampoline = FTRACE_REGS_ADDR;
>
> err = register_ftrace_function(ops);
> On Jul 13, 2022, at 4:18 PM, Steven Rostedt <[email protected]> wrote:
>
> On Thu, 2 Jun 2022 12:37:02 -0700
> Song Liu <[email protected]> wrote:
>
>> This enables users of ftrace_direct_multi to specify the flags based on
>> the actual use case. For example, some users may not set flag IPMODIFY.
>
> If we apply this patch without any of the others, then we are relying on
> the caller to get it right?
>
> That is, can we register a direct function with this function and pick a
> function with IPMODIFY already attached?
Yes, if the direct function follows regs->ip, it works.
For example, BPF trampoline with only fentry calls will just work with only this patch.
Thanks,
Song
>
> -- Steve
>
>
>>
>> Signed-off-by: Song Liu <[email protected]>
>> ---
>> kernel/trace/ftrace.c | 5 ++---
>> 1 file changed, 2 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
>> index 2fcd17857ff6..afe782ae28d3 100644
>> --- a/kernel/trace/ftrace.c
>> +++ b/kernel/trace/ftrace.c
>> @@ -5456,8 +5456,7 @@ int modify_ftrace_direct(unsigned long ip,
>> }
>> EXPORT_SYMBOL_GPL(modify_ftrace_direct);
>>
>> -#define MULTI_FLAGS (FTRACE_OPS_FL_IPMODIFY | FTRACE_OPS_FL_DIRECT | \
>> - FTRACE_OPS_FL_SAVE_REGS)
>> +#define MULTI_FLAGS (FTRACE_OPS_FL_DIRECT | FTRACE_OPS_FL_SAVE_REGS)
>>
>> static int check_direct_multi(struct ftrace_ops *ops)
>> {
>> @@ -5547,7 +5546,7 @@ int register_ftrace_direct_multi(struct ftrace_ops *ops, unsigned long addr)
>> }
>>
>> ops->func = call_direct_funcs;
>> - ops->flags = MULTI_FLAGS;
>> + ops->flags |= MULTI_FLAGS;
>> ops->trampoline = FTRACE_REGS_ADDR;
>>
>> err = register_ftrace_function(ops);
>
On Thu, 14 Jul 2022 00:11:53 +0000
Song Liu <[email protected]> wrote:
> > That is, can we register a direct function with this function and pick a
> > function with IPMODIFY already attached?
>
> Yes, if the direct function follows regs->ip, it works.
>
> For example, BPF trampoline with only fentry calls will just work with only this patch.
I replied with my thoughts on this to patch 3, but I disagree with this.
ftrace has no idea if the direct trampoline modifies the IP or not.
ftrace must assume that it does, and the management should be done in
the infrastructure.
As I replied to patch 3, here's my thoughts.
DIRECT is treated as though it changes the IP. If you register it to a
function that has an IPMODIFY already set to it, it will call the
ops->ops_func() asking if the ops can use SHARED_IPMODIFY (which means
a DIRECT can be shared with IPMODIFY). If it can, then it returns true,
and the SHARED_IPMODIFY is set *by ftrace*. The user of the ftrace APIs
should not have to manage this. It should be managed by the ftrace
infrastructure.
Make sense?
-- Steve
On Thu, 2 Jun 2022 12:37:04 -0700
Song Liu <[email protected]> wrote:
> live patch and BPF trampoline (kfunc/kretfunc in bpftrace) are important
> features for modern systems. Currently, it is not possible to use live
> patch and BPF trampoline on the same kernel function at the same time.
> This is because of the resitriction that only one ftrace_ops with flag
> FTRACE_OPS_FL_IPMODIFY on the same kernel function.
>
> BPF trampoline uses direct ftrace_ops, which assumes IPMODIFY. However,
> not all direct ftrace_ops would overwrite the actual function. This means
> it is possible to have a non-IPMODIFY direct ftrace_ops to share the same
> kernel function with an IPMODIFY ftrace_ops.
>
> Introduce FTRACE_OPS_FL_SHARE_IPMODIFY, which allows the direct ftrace_ops
> to share with IPMODIFY ftrace_ops. With FTRACE_OPS_FL_SHARE_IPMODIFY flag
> set, the direct ftrace_ops would call the target function picked by the
> IPMODIFY ftrace_ops.
>
> Comment "IPMODIFY, DIRECT, and SHARE_IPMODIFY" in include/linux/ftrace.h
> contains more information about how SHARE_IPMODIFY interacts with IPMODIFY
> and DIRECT flags.
>
> Signed-off-by: Song Liu <[email protected]>
> ---
> include/linux/ftrace.h | 74 +++++++++++++++++
> kernel/trace/ftrace.c | 179 ++++++++++++++++++++++++++++++++++++++---
> 2 files changed, 242 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> index 9023bf69f675..bfacf608de9c 100644
> --- a/include/linux/ftrace.h
> +++ b/include/linux/ftrace.h
> @@ -98,6 +98,18 @@ static inline int ftrace_mod_get_kallsym(unsigned int symnum, unsigned long *val
> }
> #endif
>
> +/*
> + * FTRACE_OPS_CMD_* commands allow the ftrace core logic to request changes
> + * to a ftrace_ops.
> + *
> + * ENABLE_SHARE_IPMODIFY - enable FTRACE_OPS_FL_SHARE_IPMODIFY.
> + * DISABLE_SHARE_IPMODIFY - disable FTRACE_OPS_FL_SHARE_IPMODIFY.
The above comment is basically:
/* Set x to 1 */
x = 1;
Probably something like this:
* FTRACE_OPS_CMD_* commands allow the ftrace core logic to request
changes
* to a ftrace_ops. Note, the requests may fail.
*
* ENABLE_SHARE_IPMODIFY - Request setting the ftrace ops
* SHARE_IPMODIFY flag.
* DISABLE_SHARE_IPMODIFY - Request disabling the ftrace ops
* SHARE_IPMODIFY flag.
> + */
> +enum ftrace_ops_cmd {
> + FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY,
> + FTRACE_OPS_CMD_DISABLE_SHARE_IPMODIFY,
> +};
> +
> #ifdef CONFIG_FUNCTION_TRACER
>
> extern int ftrace_enabled;
> @@ -189,6 +201,9 @@ ftrace_func_t ftrace_ops_get_func(struct ftrace_ops *ops);
> * ftrace_enabled.
> * DIRECT - Used by the direct ftrace_ops helper for direct functions
> * (internal ftrace only, should not be used by others)
> + * SHARE_IPMODIFY - For direct ftrace_ops only. Set when the direct function
> + * is ready to share same kernel function with IPMODIFY function
> + * (live patch, etc.).
> */
> enum {
> FTRACE_OPS_FL_ENABLED = BIT(0),
> @@ -209,8 +224,66 @@ enum {
> FTRACE_OPS_FL_TRACE_ARRAY = BIT(15),
> FTRACE_OPS_FL_PERMANENT = BIT(16),
> FTRACE_OPS_FL_DIRECT = BIT(17),
> + FTRACE_OPS_FL_SHARE_IPMODIFY = BIT(18),
> };
>
> +/*
> + * IPMODIFY, DIRECT, and SHARE_IPMODIFY.
> + *
> + * ftrace provides IPMODIFY flag for users to replace existing kernel
> + * function with a different version. This is achieved by setting regs->ip.
> + * The top user of IPMODIFY is live patch.
> + *
> + * DIRECT allows user to load custom trampoline on top of ftrace. DIRECT
> + * ftrace does not overwrite regs->ip. Instead, the custom trampoline is
No need to state if DIRECT modifies regs->ip or not. ftrace must assume
that it does (more below).
> + * saved separately (for example, orig_ax on x86). The top user of DIRECT
> + * is bpf trampoline.
> + *
> + * It is not super rare to have both live patch and bpf trampoline on the
> + * same kernel function. Therefore, it is necessary to allow the two work
"the two to work"
> + * with each other. Given that IPMODIFY and DIRECT target addressese are
"addresses"
> + * saved separately, this is feasible, but we need to be careful.
> + *
> + * The policy between IPMODIFY and DIRECT is:
> + *
> + * 1. Each kernel function can only have one IPMODIFY ftrace_ops;
> + * 2. Each kernel function can only have one DIRECT ftrace_ops;
> + * 3. DIRECT ftrace_ops may have IPMODIFY or not;
I was thinking about this more, and I think by default we should
consider all DIRECT ftrace_ops as the same as IPMODIFY. So perhaps the
first patch is to just remove the IPMODIFY from direct (as you did) but
then make all checks for multiple IPMODIFY also check DIRECT as well.
That is because there's no way that ftrace can verify that a direct
trampoline modifies the IP or not. Thus, it must assume that all do.
> + * 4. Each kernel function may have one non-DIRECT IPMODIFY ftrace_ops,
> + * and one non-IPMODIFY DIRECT ftrace_ops at the same time. This
> + * requires support from the DIRECT ftrace_ops. Specifically, the
> + * DIRECT trampoline should call the kernel function at regs->ip.
> + * If the DIRECT ftrace_ops supports sharing a function with ftrace_ops
> + * with IPMODIFY, it should set flag SHARE_IPMODIFY.
> + *
> + * Some DIRECT ftrace_ops has an option to enable SHARE_IPMODIFY or not.
> + * Usually, the non-SHARE_IPMODIFY option gives better performance. To take
> + * advantage of this performance benefit, is necessary to only enable
The performance part of this comment should not be in ftrace. It's an
implementation detail of the direct trampoline and may not even be
accurate with other implementations.
> + * SHARE_IPMODIFY only when it is on the same function as an IPMODIFY
> + * ftrace_ops. There are two cases to consider:
> + *
> + * 1. IPMODIFY ftrace_ops is registered first. When the (non-IPMODIFY, and
> + * non-SHARE_IPMODIFY) DIRECT ftrace_ops is registered later,
> + * register_ftrace_direct_multi() returns -EAGAIN. If the user of
> + * the DIRECT ftrace_ops can support SHARE_IPMODIFY, it should enable
> + * SHARE_IPMODIFY and retry.
If this ftrace_ops being registered can support SHARE_IPMODIFY, then it
should have the ops_func defined, in which case, why not have it just
call that instead of having to return -EAGAIN?
> + * 2. (non-IPMODIFY, and non-SHARE_IPMODIFY) DIRECT ftrace_ops is
> + * registered first. When the IPMODIFY ftrace_ops is registered later,
> + * it is necessary to ask the direct ftrace_ops to enable
> + * SHARE_IPMODIFY support. This is achieved via ftrace_ops->ops_func
> + * cmd=FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY. For more details on this
> + * condition, check out prepare_direct_functions_for_ipmodify().
> + */
> +
> +/*
> + * For most ftrace_ops_cmd,
> + * Returns:
> + * 0 - Success.
> + * -EBUSY - The operation cannot process
> + * -EAGAIN - The operation cannot process tempoorarily.
> + */
> +typedef int (*ftrace_ops_func_t)(struct ftrace_ops *op, enum ftrace_ops_cmd cmd);
> +
> #ifdef CONFIG_DYNAMIC_FTRACE
> /* The hash used to know what functions callbacks trace */
> struct ftrace_ops_hash {
> @@ -253,6 +326,7 @@ struct ftrace_ops {
> unsigned long trampoline;
> unsigned long trampoline_size;
> struct list_head list;
> + ftrace_ops_func_t ops_func;
> #endif
> };
>
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 6a419f6bbbf0..868bbc753803 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -1865,7 +1865,8 @@ static void ftrace_hash_rec_enable_modify(struct ftrace_ops *ops,
> /*
> * Try to update IPMODIFY flag on each ftrace_rec. Return 0 if it is OK
> * or no-needed to update, -EBUSY if it detects a conflict of the flag
> - * on a ftrace_rec, and -EINVAL if the new_hash tries to trace all recs.
> + * on a ftrace_rec, -EINVAL if the new_hash tries to trace all recs, and
> + * -EAGAIN if the ftrace_ops need to enable SHARE_IPMODIFY.
It should just call the ftrace_ops() with the command to set it. If you
want, we could add another CMD enum that can be passed for this case.
> * Note that old_hash and new_hash has below meanings
> * - If the hash is NULL, it hits all recs (if IPMODIFY is set, this is rejected)
> * - If the hash is EMPTY_HASH, it hits nothing
> @@ -1875,6 +1876,7 @@ static int __ftrace_hash_update_ipmodify(struct ftrace_ops *ops,
> struct ftrace_hash *old_hash,
> struct ftrace_hash *new_hash)
> {
> + bool is_ipmodify, is_direct, share_ipmodify;
> struct ftrace_page *pg;
> struct dyn_ftrace *rec, *end = NULL;
> int in_old, in_new;
> @@ -1883,7 +1885,24 @@ static int __ftrace_hash_update_ipmodify(struct ftrace_ops *ops,
> if (!(ops->flags & FTRACE_OPS_FL_ENABLED))
> return 0;
>
> - if (!(ops->flags & FTRACE_OPS_FL_IPMODIFY))
> + /*
> + * The following are all the valid combinations of is_ipmodify,
> + * is_direct, and share_ipmodify
> + *
> + * is_ipmodify is_direct share_ipmodify
> + * #1 0 0 0
> + * #2 1 0 0
> + * #3 1 1 0
I still think that DIRECT should automatically be considered IPMODIFY
(at least in the view of ftrace, whether or not the direct function
modifies the IP).
> + * #4 0 1 0
> + * #5 0 1 1
> + */
> +
> +
> + is_ipmodify = ops->flags & FTRACE_OPS_FL_IPMODIFY;
> + is_direct = ops->flags & FTRACE_OPS_FL_DIRECT;
> +
> + /* either ipmodify nor direct, skip */
> + if (!is_ipmodify && !is_direct) /* combinations #1 */
> return 0;
>
> /*
> @@ -1893,6 +1912,30 @@ static int __ftrace_hash_update_ipmodify(struct ftrace_ops *ops,
> if (!new_hash || !old_hash)
> return -EINVAL;
>
> + share_ipmodify = ops->flags & FTRACE_OPS_FL_SHARE_IPMODIFY;
> +
> + /*
> + * This ops itself doesn't do ip_modify and it can share a fentry
> + * with other ops with ipmodify, nothing to do.
> + */
> + if (!is_ipmodify && share_ipmodify) /* combinations #5 */
> + return 0;
> +
Really, if connecting to a function that already has IPMODIFY, then the
ops_func() needs to be called, and if the ops supports SHARED_IPMODIFY
then it should get set and then continue.
Make sense?
-- Steve
> + /*
> + * Only three combinations of is_ipmodify, is_direct, and
> + * share_ipmodify for the logic below:
> + * #2 live patch
> + * #3 direct with ipmodify
> + * #4 direct without ipmodify
> + *
> + * is_ipmodify is_direct share_ipmodify
> + * #2 1 0 0
> + * #3 1 1 0
> + * #4 0 1 0
> + *
> + * Only update/rollback rec->flags for is_ipmodify == 1 (#2 and #3)
> + */
> +
> /* Update rec->flags */
> do_for_each_ftrace_rec(pg, rec) {
>
> @@ -1906,12 +1949,18 @@ static int __ftrace_hash_update_ipmodify(struct ftrace_ops *ops,
> continue;
>
> if (in_new) {
> - /* New entries must ensure no others are using it */
> - if (rec->flags & FTRACE_FL_IPMODIFY)
> - goto rollback;
> - rec->flags |= FTRACE_FL_IPMODIFY;
> - } else /* Removed entry */
> + if (rec->flags & FTRACE_FL_IPMODIFY) {
> + /* cannot have two ipmodify on same rec */
> + if (is_ipmodify) /* combination #2 and #3 */
> + goto rollback;
> + /* let user enable share_ipmodify and retry */
> + return -EAGAIN; /* combination #4 */
> + } else if (is_ipmodify) {
> + rec->flags |= FTRACE_FL_IPMODIFY;
> + }
> + } else if (is_ipmodify) {/* Removed entry */
> rec->flags &= ~FTRACE_FL_IPMODIFY;
> + }
> } while_for_each_ftrace_rec();
>
> return 0;
> On Jul 13, 2022, at 5:38 PM, Steven Rostedt <[email protected]> wrote:
>
> On Thu, 14 Jul 2022 00:11:53 +0000
> Song Liu <[email protected]> wrote:
>
>>> That is, can we register a direct function with this function and pick a
>>> function with IPMODIFY already attached?
>>
>> Yes, if the direct function follows regs->ip, it works.
>>
>> For example, BPF trampoline with only fentry calls will just work with only this patch.
>
> I replied with my thoughts on this to patch 3, but I disagree with this.
>
> ftrace has no idea if the direct trampoline modifies the IP or not.
> ftrace must assume that it does, and the management should be done in
> the infrastructure.
>
> As I replied to patch 3, here's my thoughts.
>
> DIRECT is treated as though it changes the IP. If you register it to a
> function that has an IPMODIFY already set to it, it will call the
> ops->ops_func() asking if the ops can use SHARED_IPMODIFY (which means
> a DIRECT can be shared with IPMODIFY). If it can, then it returns true,
> and the SHARED_IPMODIFY is set *by ftrace*. The user of the ftrace APIs
> should not have to manage this. It should be managed by the ftrace
> infrastructure.
Hmm... I don't think this gonna work.
First, two IPMODIFY ftrace ops cannot work together on the same kernel
function. So there won't be a ops with both IPMODIFY and SHARE_IPMODIFY.
non-direct ops without IPMODIFY can already share with IPMODIFY ops.
Only direct ops need SHARE_IPMODIFY flag, and it means "I can share with
another ops with IPMODIFY". So there will be different flavors of
direct ops:
1. w/ IPMODIFY, w/o SHARE_IPMODIFY;
2. w/o IPMODIFY, w/o SHARE_IPMODIFY;
3. w/o IPMODIFY, w/ SHARE_IPMODIFY.
#1 can never work on the same function with another IPMODIFY ops, and
we don't plan to make it work. #2 does not work with another IPMODIFY
ops. And #3 works with another IPMODIFY ops.
The owner of the direct trampoline uses these flags to tell ftrace
infrastructure the property of this trampoline.
BPF trampolines with only fentry calls are #3 direct ops. BPF
trampolines with fexit or fmod_ret calls will be #2 trampoline by
default, but it is also possible to generate #3 trampoline for it.
BPF side will try to register #2 trampoline, If ftrace detects another
IPMODIFY ops on the same function, it will let BPF trampoline know
with -EAGAIN from register_ftrace_direct_multi(). Then, BPF side will
regenerate a #3 trampoline and register it again.
I know this somehow changes the policy with direct ops, but it is the
only way this can work, AFAICT.
Does this make sense? Did I miss something?
Thanks,
Song
On Thu, 14 Jul 2022 01:42:59 +0000
Song Liu <[email protected]> wrote:
> > As I replied to patch 3, here's my thoughts.
> >
> > DIRECT is treated as though it changes the IP. If you register it to a
> > function that has an IPMODIFY already set to it, it will call the
> > ops->ops_func() asking if the ops can use SHARED_IPMODIFY (which means
> > a DIRECT can be shared with IPMODIFY). If it can, then it returns true,
> > and the SHARED_IPMODIFY is set *by ftrace*. The user of the ftrace APIs
> > should not have to manage this. It should be managed by the ftrace
> > infrastructure.
>
> Hmm... I don't think this gonna work.
>
> First, two IPMODIFY ftrace ops cannot work together on the same kernel
> function. So there won't be a ops with both IPMODIFY and SHARE_IPMODIFY.
I'm not saying that.
I'm saying that ftrace does not have any clue (nor cares) about what a
DIRECT ops does. It might modify the IP or it might not. It doesn't know.
But ftrace has control over the callbacks it does control.
A DIRECT ops knows if it can work with another ops that has IPMODIFY set.
If the DIRECT ops does not work with IPMODIFY (perhaps it wants to modify
the IP), then it will tell ftrace that it can't and ftrace will not allow
it.
That is, ftrace doesn't care if the DIRECT ops modifies the IP or not. It
can only ask (through the ops->ops_func()) if the direct trampoline can
honor the IP that is modified. If it can, then it reports back that it can,
and then ftrace will set that ops to SHARED_MODIFY, and the direct function
can switch the ops->func() to one that uses SHARED_MODIFY.
>
> non-direct ops without IPMODIFY can already share with IPMODIFY ops.
It can? ftrace sets IPMODIFY for all DIRECT callers to prevent that. Except
for this patch that removes that restriction (which I believe is broken).
> Only direct ops need SHARE_IPMODIFY flag, and it means "I can share with
> another ops with IPMODIFY". So there will be different flavors of
> direct ops:
I agree that only DIRECT ops can have SHARED_IPMODIFY set. That's what I'm
saying. But I'm saying it gets set by ftrace.
>
> 1. w/ IPMODIFY, w/o SHARE_IPMODIFY;
> 2. w/o IPMODIFY, w/o SHARE_IPMODIFY;
> 3. w/o IPMODIFY, w/ SHARE_IPMODIFY.
>
> #1 can never work on the same function with another IPMODIFY ops, and
> we don't plan to make it work. #2 does not work with another IPMODIFY
> ops. And #3 works with another IPMODIFY ops.
Lets look at this differently. What I'm proposing is that registering a
direct ops does not need to tell ftrace if it modifies the IP or not.
That's because ftrace doesn't care. Once ftrace calls the direct trampoline
it loses all control. With the ftrace ops callbacks, it is the one
responsible for setting up the modified IP. That's not the case with the
direct trampolines.
I'm saying that ftrace doesn't care what the DIRECT ops does. But it will
not, by default, allow an IPMODIFY to happen when a DIRECT ops is on the
same function, or vice versa.
What I'm suggesting is when a IPMODIFY tries to attach to a function that
also has a DIRECT ops, or a DIRECT tries to attach to a function that
already has an IPMODIFY ops on it, that ftrace calls the direct
ops->ops_func() asking if it is safe to use with an IPMODIFY function.
If the direct ops modifies the IP itself, it will return a "no", and ftrace
will reject the attachment. If the direct ops can, it returns a "yes" and
then ftrace will set the SHARED_IPMODIFY flag to that ops and continue.
The fact that the ops->ops_func was called will let the caller (bpf) know
that it needs to use the stack to return to the function, or to call it if
it is also tracing the return.
If the IPMODIFY ops is removed, then ftrace will call the ops->ops_func()
telling it it no longer has the IPMODIFY set, and will clear the
SHARED_IPMODIFY flag, and then the owner of the direct ops can go back to
doing whatever it did before (the calling the function directly, or
whatever).
>
> The owner of the direct trampoline uses these flags to tell ftrace
> infrastructure the property of this trampoline.
I disagree. The owner shouldn't have to care about the flags. Let ftrace
handle it. This will make things so much more simple for both BPF and
ftrace.
>
> BPF trampolines with only fentry calls are #3 direct ops. BPF
> trampolines with fexit or fmod_ret calls will be #2 trampoline by
> default, but it is also possible to generate #3 trampoline for it.
And ftrace doesn't care about this. But bpf needs to care if the IP is
being modified or not. Which can be done by the ops->ops_func() that you
added.
>
> BPF side will try to register #2 trampoline, If ftrace detects another
> IPMODIFY ops on the same function, it will let BPF trampoline know
> with -EAGAIN from register_ftrace_direct_multi(). Then, BPF side will
> regenerate a #3 trampoline and register it again.
This is too complex. You are missing the simple way.
>
> I know this somehow changes the policy with direct ops, but it is the
> only way this can work, AFAICT.
I disagree. There's a much better way that this can work.
>
> Does this make sense? Did I miss something?
Let me start from the beginning.
1. Live kernel patching patches function foo.
2. lkp has an ops->flags | IPMODIFY set when it registers.
3. bpf registers a direct trampoline to function foo.
4. bpf has an ops->flags | DIRECT set when it registers
5. ftrace sees that the function already has an ops->flag = IPMODIFY on it,
so it calls bpf ops->ops_func(SHARE_WITH_IPMODIFY)
6. bpf can and does the following
a. if it's the simple #1 trampoline (just traces the start of a function)
it doesn't need to do anything special returns "yes".
b. if it's the #2 trampoline, it will change the trampoline to use the
stack to find what to call, and returns "yes".
7. ftrace gets "yes" and sets the *ipmodify* ops with SHARED_IPMODIFY
(there's a reason for setting this flag for the ipmodify ops and not the
direct ops).
8. LKP is removed from the function foo.
9. ftrace sees the lkp IPMODIFY ops has SHARED_IPMODIFY on it, and knows
that there's a direct call here too. It removes the IPMODIFY ops, and
then calls the direct ops->ops_func(STOP_SHARE_WITH_IPMODIFY) to let the
direct code know that it is no longer sharing with an IPMODIFY such that
it can change to call the function directly and not use the stack.
See how simple this is? ftrace doesn't have to care if the direct caller
changes the IP or not. It just wants to know if it can be shared with an
IPMODIFY ops. And BPF doesn't have to care about extra flags used to manage
the ftrace infrastructure.
Does this make sense now?
-- Steve
> On Jul 13, 2022, at 7:55 PM, Steven Rostedt <[email protected]> wrote:
>
> On Thu, 14 Jul 2022 01:42:59 +0000
> Song Liu <[email protected]> wrote:
>
>>> As I replied to patch 3, here's my thoughts.
>>>
>>> DIRECT is treated as though it changes the IP. If you register it to a
>>> function that has an IPMODIFY already set to it, it will call the
>>> ops->ops_func() asking if the ops can use SHARED_IPMODIFY (which means
>>> a DIRECT can be shared with IPMODIFY). If it can, then it returns true,
>>> and the SHARED_IPMODIFY is set *by ftrace*. The user of the ftrace APIs
>>> should not have to manage this. It should be managed by the ftrace
>>> infrastructure.
>>
>> Hmm... I don't think this gonna work.
>>
>> First, two IPMODIFY ftrace ops cannot work together on the same kernel
>> function. So there won't be a ops with both IPMODIFY and SHARE_IPMODIFY.
>
> I'm not saying that.
>
> I'm saying that ftrace does not have any clue (nor cares) about what a
> DIRECT ops does. It might modify the IP or it might not. It doesn't know.
>
> But ftrace has control over the callbacks it does control.
>
> A DIRECT ops knows if it can work with another ops that has IPMODIFY set.
> If the DIRECT ops does not work with IPMODIFY (perhaps it wants to modify
> the IP), then it will tell ftrace that it can't and ftrace will not allow
> it.
>
> That is, ftrace doesn't care if the DIRECT ops modifies the IP or not. It
> can only ask (through the ops->ops_func()) if the direct trampoline can
> honor the IP that is modified. If it can, then it reports back that it can,
> and then ftrace will set that ops to SHARED_MODIFY, and the direct function
> can switch the ops->func() to one that uses SHARED_MODIFY.
>
>>
>> non-direct ops without IPMODIFY can already share with IPMODIFY ops.
>
> It can? ftrace sets IPMODIFY for all DIRECT callers to prevent that. Except
> for this patch that removes that restriction (which I believe is broken).
I mean "non-direct" ftrace ops, not direct ftrace ops.
>
>> Only direct ops need SHARE_IPMODIFY flag, and it means "I can share with
>> another ops with IPMODIFY". So there will be different flavors of
>> direct ops:
>
> I agree that only DIRECT ops can have SHARED_IPMODIFY set. That's what I'm
> saying. But I'm saying it gets set by ftrace.
>
>>
>> 1. w/ IPMODIFY, w/o SHARE_IPMODIFY;
>> 2. w/o IPMODIFY, w/o SHARE_IPMODIFY;
>> 3. w/o IPMODIFY, w/ SHARE_IPMODIFY.
>>
>> #1 can never work on the same function with another IPMODIFY ops, and
>> we don't plan to make it work. #2 does not work with another IPMODIFY
>> ops. And #3 works with another IPMODIFY ops.
>
> Lets look at this differently. What I'm proposing is that registering a
> direct ops does not need to tell ftrace if it modifies the IP or not.
> That's because ftrace doesn't care. Once ftrace calls the direct trampoline
> it loses all control. With the ftrace ops callbacks, it is the one
> responsible for setting up the modified IP. That's not the case with the
> direct trampolines.
>
> I'm saying that ftrace doesn't care what the DIRECT ops does. But it will
> not, by default, allow an IPMODIFY to happen when a DIRECT ops is on the
> same function, or vice versa.
>
> What I'm suggesting is when a IPMODIFY tries to attach to a function that
> also has a DIRECT ops, or a DIRECT tries to attach to a function that
> already has an IPMODIFY ops on it, that ftrace calls the direct
> ops->ops_func() asking if it is safe to use with an IPMODIFY function.
>
> If the direct ops modifies the IP itself, it will return a "no", and ftrace
> will reject the attachment. If the direct ops can, it returns a "yes" and
> then ftrace will set the SHARED_IPMODIFY flag to that ops and continue.
>
> The fact that the ops->ops_func was called will let the caller (bpf) know
> that it needs to use the stack to return to the function, or to call it if
> it is also tracing the return.
>
> If the IPMODIFY ops is removed, then ftrace will call the ops->ops_func()
> telling it it no longer has the IPMODIFY set, and will clear the
> SHARED_IPMODIFY flag, and then the owner of the direct ops can go back to
> doing whatever it did before (the calling the function directly, or
> whatever).
>
>>
>> The owner of the direct trampoline uses these flags to tell ftrace
>> infrastructure the property of this trampoline.
>
> I disagree. The owner shouldn't have to care about the flags. Let ftrace
> handle it. This will make things so much more simple for both BPF and
> ftrace.
>
>>
>> BPF trampolines with only fentry calls are #3 direct ops. BPF
>> trampolines with fexit or fmod_ret calls will be #2 trampoline by
>> default, but it is also possible to generate #3 trampoline for it.
>
> And ftrace doesn't care about this. But bpf needs to care if the IP is
> being modified or not. Which can be done by the ops->ops_func() that you
> added.
>
>>
>> BPF side will try to register #2 trampoline, If ftrace detects another
>> IPMODIFY ops on the same function, it will let BPF trampoline know
>> with -EAGAIN from register_ftrace_direct_multi(). Then, BPF side will
>> regenerate a #3 trampoline and register it again.
>
> This is too complex. You are missing the simple way.
>
>>
>> I know this somehow changes the policy with direct ops, but it is the
>> only way this can work, AFAICT.
>
> I disagree. There's a much better way that this can work.
>
>>
>> Does this make sense? Did I miss something?
>
>
> Let me start from the beginning.
I got your point now. We replace the flag on direct trampoline with a
callback check. So yes, this works.
>
> 1. Live kernel patching patches function foo.
>
> 2. lkp has an ops->flags | IPMODIFY set when it registers.
>
> 3. bpf registers a direct trampoline to function foo.
>
> 4. bpf has an ops->flags | DIRECT set when it registers
>
> 5. ftrace sees that the function already has an ops->flag = IPMODIFY on it,
> so it calls bpf ops->ops_func(SHARE_WITH_IPMODIFY)
>
> 6. bpf can and does the following
>
> a. if it's the simple #1 trampoline (just traces the start of a function)
> it doesn't need to do anything special returns "yes".
>
> b. if it's the #2 trampoline, it will change the trampoline to use the
> stack to find what to call, and returns "yes".
>
> 7. ftrace gets "yes" and sets the *ipmodify* ops with SHARED_IPMODIFY
> (there's a reason for setting this flag for the ipmodify ops and not the
> direct ops).
>
>
> 8. LKP is removed from the function foo.
>
> 9. ftrace sees the lkp IPMODIFY ops has SHARED_IPMODIFY on it, and knows
> that there's a direct call here too. It removes the IPMODIFY ops, and
> then calls the direct ops->ops_func(STOP_SHARE_WITH_IPMODIFY) to let the
> direct code know that it is no longer sharing with an IPMODIFY such that
> it can change to call the function directly and not use the stack.
I wonder whether we still need this flag. Alternatively, we can always
find direct calls on the function and calls ops_func(STOP_SHARE_WITH_IPMODIFY).
What do you think about this?
Thanks,
Song
On Thu, 14 Jul 2022 04:37:43 +0000
Song Liu <[email protected]> wrote:
> >
> >>
> >> non-direct ops without IPMODIFY can already share with IPMODIFY ops.
> >
> > It can? ftrace sets IPMODIFY for all DIRECT callers to prevent that. Except
> > for this patch that removes that restriction (which I believe is broken).
>
> I mean "non-direct" ftrace ops, not direct ftrace ops.
Ah, sorry misunderstood that.
> > Let me start from the beginning.
>
> I got your point now. We replace the flag on direct trampoline with a
> callback check. So yes, this works.
I'm glad we are on the same page :-)
> > 9. ftrace sees the lkp IPMODIFY ops has SHARED_IPMODIFY on it, and knows
> > that there's a direct call here too. It removes the IPMODIFY ops, and
> > then calls the direct ops->ops_func(STOP_SHARE_WITH_IPMODIFY) to let the
> > direct code know that it is no longer sharing with an IPMODIFY such that
> > it can change to call the function directly and not use the stack.
>
> I wonder whether we still need this flag. Alternatively, we can always
> find direct calls on the function and calls ops_func(STOP_SHARE_WITH_IPMODIFY).
Actually we don't need the new flag and we don't need to always search. When
a direct is attached to the function then the rec->flags will have
FTRACE_FL_DIRECT attached to it.
Then if an IPMODIFY is being removed and the rec->flags has
FTRACE_FL_DIRECT set, then we know to search the ops for the one that has a
DIRECT flag attached and we can call the ops_func() on that one.
We should also add a FTRACE_WARN_ON() if a direct is not found but the flag
was set.
>
> What do you think about this?
>
I think this works.
Also, on the patch that implements this in the next version, please add to
the change log:
Link: https://lore.kernel.org/all/[email protected]/
so that we have a link to this discussion.
Thanks,
-- Steve
> On Jul 13, 2022, at 5:33 PM, Steven Rostedt <[email protected]> wrote:
>
> On Thu, 2 Jun 2022 12:37:04 -0700
> Song Liu <[email protected]> wrote:
>
>> live patch and BPF trampoline (kfunc/kretfunc in bpftrace) are important
>> features for modern systems. Currently, it is not possible to use live
>> patch and BPF trampoline on the same kernel function at the same time.
>> This is because of the resitriction that only one ftrace_ops with flag
>> FTRACE_OPS_FL_IPMODIFY on the same kernel function.
>>
>> BPF trampoline uses direct ftrace_ops, which assumes IPMODIFY. However,
>> not all direct ftrace_ops would overwrite the actual function. This means
>> it is possible to have a non-IPMODIFY direct ftrace_ops to share the same
>> kernel function with an IPMODIFY ftrace_ops.
>>
>> Introduce FTRACE_OPS_FL_SHARE_IPMODIFY, which allows the direct ftrace_ops
>> to share with IPMODIFY ftrace_ops. With FTRACE_OPS_FL_SHARE_IPMODIFY flag
>> set, the direct ftrace_ops would call the target function picked by the
>> IPMODIFY ftrace_ops.
>>
>> Comment "IPMODIFY, DIRECT, and SHARE_IPMODIFY" in include/linux/ftrace.h
>> contains more information about how SHARE_IPMODIFY interacts with IPMODIFY
>> and DIRECT flags.
>>
>> Signed-off-by: Song Liu <[email protected]>
>> ---
>> include/linux/ftrace.h | 74 +++++++++++++++++
>> kernel/trace/ftrace.c | 179 ++++++++++++++++++++++++++++++++++++++---
>> 2 files changed, 242 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
>> index 9023bf69f675..bfacf608de9c 100644
>> --- a/include/linux/ftrace.h
>> +++ b/include/linux/ftrace.h
>> @@ -98,6 +98,18 @@ static inline int ftrace_mod_get_kallsym(unsigned int symnum, unsigned long *val
>> }
>> #endif
>>
>> +/*
>> + * FTRACE_OPS_CMD_* commands allow the ftrace core logic to request changes
>> + * to a ftrace_ops.
>> + *
>> + * ENABLE_SHARE_IPMODIFY - enable FTRACE_OPS_FL_SHARE_IPMODIFY.
>> + * DISABLE_SHARE_IPMODIFY - disable FTRACE_OPS_FL_SHARE_IPMODIFY.
>
> The above comment is basically:
>
> /* Set x to 1 */
> x = 1;
>
> Probably something like this:
>
> * FTRACE_OPS_CMD_* commands allow the ftrace core logic to request
> changes
> * to a ftrace_ops. Note, the requests may fail.
> *
> * ENABLE_SHARE_IPMODIFY - Request setting the ftrace ops
> * SHARE_IPMODIFY flag.
> * DISABLE_SHARE_IPMODIFY - Request disabling the ftrace ops
> * SHARE_IPMODIFY flag.
>
>
>> + */
>> +enum ftrace_ops_cmd {
>> + FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY,
>> + FTRACE_OPS_CMD_DISABLE_SHARE_IPMODIFY,
>> +};
>> +
>> #ifdef CONFIG_FUNCTION_TRACER
>>
>> extern int ftrace_enabled;
>> @@ -189,6 +201,9 @@ ftrace_func_t ftrace_ops_get_func(struct ftrace_ops *ops);
>> * ftrace_enabled.
>> * DIRECT - Used by the direct ftrace_ops helper for direct functions
>> * (internal ftrace only, should not be used by others)
>> + * SHARE_IPMODIFY - For direct ftrace_ops only. Set when the direct function
>> + * is ready to share same kernel function with IPMODIFY function
>> + * (live patch, etc.).
>> */
>> enum {
>> FTRACE_OPS_FL_ENABLED = BIT(0),
>> @@ -209,8 +224,66 @@ enum {
>> FTRACE_OPS_FL_TRACE_ARRAY = BIT(15),
>> FTRACE_OPS_FL_PERMANENT = BIT(16),
>> FTRACE_OPS_FL_DIRECT = BIT(17),
>> + FTRACE_OPS_FL_SHARE_IPMODIFY = BIT(18),
>> };
>>
>> +/*
>> + * IPMODIFY, DIRECT, and SHARE_IPMODIFY.
>> + *
>> + * ftrace provides IPMODIFY flag for users to replace existing kernel
>> + * function with a different version. This is achieved by setting regs->ip.
>> + * The top user of IPMODIFY is live patch.
>> + *
>> + * DIRECT allows user to load custom trampoline on top of ftrace. DIRECT
>> + * ftrace does not overwrite regs->ip. Instead, the custom trampoline is
>
> No need to state if DIRECT modifies regs->ip or not. ftrace must assume
> that it does (more below).
>
>> + * saved separately (for example, orig_ax on x86). The top user of DIRECT
>> + * is bpf trampoline.
>> + *
>> + * It is not super rare to have both live patch and bpf trampoline on the
>> + * same kernel function. Therefore, it is necessary to allow the two work
>
> "the two to work"
>
>> + * with each other. Given that IPMODIFY and DIRECT target addressese are
>
> "addresses"
>
>> + * saved separately, this is feasible, but we need to be careful.
>> + *
>> + * The policy between IPMODIFY and DIRECT is:
>> + *
>> + * 1. Each kernel function can only have one IPMODIFY ftrace_ops;
>> + * 2. Each kernel function can only have one DIRECT ftrace_ops;
>> + * 3. DIRECT ftrace_ops may have IPMODIFY or not;
>
> I was thinking about this more, and I think by default we should
> consider all DIRECT ftrace_ops as the same as IPMODIFY. So perhaps the
> first patch is to just remove the IPMODIFY from direct (as you did) but
> then make all checks for multiple IPMODIFY also check DIRECT as well.
>
> That is because there's no way that ftrace can verify that a direct
> trampoline modifies the IP or not. Thus, it must assume that all do.
>
>> + * 4. Each kernel function may have one non-DIRECT IPMODIFY ftrace_ops,
>> + * and one non-IPMODIFY DIRECT ftrace_ops at the same time. This
>> + * requires support from the DIRECT ftrace_ops. Specifically, the
>> + * DIRECT trampoline should call the kernel function at regs->ip.
>> + * If the DIRECT ftrace_ops supports sharing a function with ftrace_ops
>> + * with IPMODIFY, it should set flag SHARE_IPMODIFY.
>> + *
>> + * Some DIRECT ftrace_ops has an option to enable SHARE_IPMODIFY or not.
>> + * Usually, the non-SHARE_IPMODIFY option gives better performance. To take
>> + * advantage of this performance benefit, is necessary to only enable
>
> The performance part of this comment should not be in ftrace. It's an
> implementation detail of the direct trampoline and may not even be
> accurate with other implementations.
>
>> + * SHARE_IPMODIFY only when it is on the same function as an IPMODIFY
>> + * ftrace_ops. There are two cases to consider:
>> + *
>> + * 1. IPMODIFY ftrace_ops is registered first. When the (non-IPMODIFY, and
>> + * non-SHARE_IPMODIFY) DIRECT ftrace_ops is registered later,
>> + * register_ftrace_direct_multi() returns -EAGAIN. If the user of
>> + * the DIRECT ftrace_ops can support SHARE_IPMODIFY, it should enable
>> + * SHARE_IPMODIFY and retry.
>
> If this ftrace_ops being registered can support SHARE_IPMODIFY, then it
> should have the ops_func defined, in which case, why not have it just
> call that instead of having to return -EAGAIN?
>
>
>> + * 2. (non-IPMODIFY, and non-SHARE_IPMODIFY) DIRECT ftrace_ops is
>> + * registered first. When the IPMODIFY ftrace_ops is registered later,
>> + * it is necessary to ask the direct ftrace_ops to enable
>> + * SHARE_IPMODIFY support. This is achieved via ftrace_ops->ops_func
>> + * cmd=FTRACE_OPS_CMD_ENABLE_SHARE_IPMODIFY. For more details on this
>> + * condition, check out prepare_direct_functions_for_ipmodify().
>> + */
>> +
>> +/*
>> + * For most ftrace_ops_cmd,
>> + * Returns:
>> + * 0 - Success.
>> + * -EBUSY - The operation cannot process
>> + * -EAGAIN - The operation cannot process tempoorarily.
>> + */
>> +typedef int (*ftrace_ops_func_t)(struct ftrace_ops *op, enum ftrace_ops_cmd cmd);
>> +
>> #ifdef CONFIG_DYNAMIC_FTRACE
>> /* The hash used to know what functions callbacks trace */
>> struct ftrace_ops_hash {
>> @@ -253,6 +326,7 @@ struct ftrace_ops {
>> unsigned long trampoline;
>> unsigned long trampoline_size;
>> struct list_head list;
>> + ftrace_ops_func_t ops_func;
>> #endif
>> };
>>
>> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
>> index 6a419f6bbbf0..868bbc753803 100644
>> --- a/kernel/trace/ftrace.c
>> +++ b/kernel/trace/ftrace.c
>> @@ -1865,7 +1865,8 @@ static void ftrace_hash_rec_enable_modify(struct ftrace_ops *ops,
>> /*
>> * Try to update IPMODIFY flag on each ftrace_rec. Return 0 if it is OK
>> * or no-needed to update, -EBUSY if it detects a conflict of the flag
>> - * on a ftrace_rec, and -EINVAL if the new_hash tries to trace all recs.
>> + * on a ftrace_rec, -EINVAL if the new_hash tries to trace all recs, and
>> + * -EAGAIN if the ftrace_ops need to enable SHARE_IPMODIFY.
>
> It should just call the ftrace_ops() with the command to set it. If you
> want, we could add another CMD enum that can be passed for this case.
>
>> * Note that old_hash and new_hash has below meanings
>> * - If the hash is NULL, it hits all recs (if IPMODIFY is set, this is rejected)
>> * - If the hash is EMPTY_HASH, it hits nothing
>> @@ -1875,6 +1876,7 @@ static int __ftrace_hash_update_ipmodify(struct ftrace_ops *ops,
>> struct ftrace_hash *old_hash,
>> struct ftrace_hash *new_hash)
>> {
>> + bool is_ipmodify, is_direct, share_ipmodify;
>> struct ftrace_page *pg;
>> struct dyn_ftrace *rec, *end = NULL;
>> int in_old, in_new;
>> @@ -1883,7 +1885,24 @@ static int __ftrace_hash_update_ipmodify(struct ftrace_ops *ops,
>> if (!(ops->flags & FTRACE_OPS_FL_ENABLED))
>> return 0;
>>
>> - if (!(ops->flags & FTRACE_OPS_FL_IPMODIFY))
>> + /*
>> + * The following are all the valid combinations of is_ipmodify,
>> + * is_direct, and share_ipmodify
>> + *
>> + * is_ipmodify is_direct share_ipmodify
>> + * #1 0 0 0
>> + * #2 1 0 0
>> + * #3 1 1 0
>
> I still think that DIRECT should automatically be considered IPMODIFY
> (at least in the view of ftrace, whether or not the direct function
> modifies the IP).
>
>> + * #4 0 1 0
>> + * #5 0 1 1
>> + */
>> +
>> +
>> + is_ipmodify = ops->flags & FTRACE_OPS_FL_IPMODIFY;
>> + is_direct = ops->flags & FTRACE_OPS_FL_DIRECT;
>> +
>> + /* either ipmodify nor direct, skip */
>> + if (!is_ipmodify && !is_direct) /* combinations #1 */
>> return 0;
>>
>> /*
>> @@ -1893,6 +1912,30 @@ static int __ftrace_hash_update_ipmodify(struct ftrace_ops *ops,
>> if (!new_hash || !old_hash)
>> return -EINVAL;
>>
>> + share_ipmodify = ops->flags & FTRACE_OPS_FL_SHARE_IPMODIFY;
>> +
>> + /*
>> + * This ops itself doesn't do ip_modify and it can share a fentry
>> + * with other ops with ipmodify, nothing to do.
>> + */
>> + if (!is_ipmodify && share_ipmodify) /* combinations #5 */
>> + return 0;
>> +
>
> Really, if connecting to a function that already has IPMODIFY, then the
> ops_func() needs to be called, and if the ops supports SHARED_IPMODIFY
> then it should get set and then continue.
>
> Make sense?
>
> -- Steve
>
I think there is one more problem here. If we force all direct trampoline
set IPMODIFY, and remove the SHARE_IPMODIFY flag. It may cause confusion
and/or extra work here (__ftrace_hash_update_ipmodify).
Say __ftrace_hash_update_ipmodify() tries to attach an ops with IPMODIFY,
and found the rec already has IPMODIFY. At this point, we have to iterate
all ftrace ops (do_for_each_ftrace_op) to confirm whether the IPMODIFY is
from
1) a direct ops that can share IPMODIFY, or
2) a direct ops that cannot share IPMODIFY, or
3) another non-direct ops with IPMODIFY.
For the 1), this attach works, while for 2) and 3), the attach doesn't work.
OTOH, with current version (v2), we trust the direct ops to set or clear
IPMODIFY. rec with IPMODIFY makes it clear that it cannot share with another
ops with IPMODIFY. Then we don't have to iterate over all ftrace ops here.
Does this make sense? Did I miss some better solutions?
Thanks,
Song
>> + /*
>> + * Only three combinations of is_ipmodify, is_direct, and
>> + * share_ipmodify for the logic below:
>> + * #2 live patch
>> + * #3 direct with ipmodify
>> + * #4 direct without ipmodify
>> + *
>> + * is_ipmodify is_direct share_ipmodify
>> + * #2 1 0 0
>> + * #3 1 1 0
>> + * #4 0 1 0
>> + *
>> + * Only update/rollback rec->flags for is_ipmodify == 1 (#2 and #3)
>> + */
>> +
>> /* Update rec->flags */
>> do_for_each_ftrace_rec(pg, rec) {
>>
>> @@ -1906,12 +1949,18 @@ static int __ftrace_hash_update_ipmodify(struct ftrace_ops *ops,
>> continue;
>>
>> if (in_new) {
>> - /* New entries must ensure no others are using it */
>> - if (rec->flags & FTRACE_FL_IPMODIFY)
>> - goto rollback;
>> - rec->flags |= FTRACE_FL_IPMODIFY;
>> - } else /* Removed entry */
>> + if (rec->flags & FTRACE_FL_IPMODIFY) {
>> + /* cannot have two ipmodify on same rec */
>> + if (is_ipmodify) /* combination #2 and #3 */
>> + goto rollback;
>> + /* let user enable share_ipmodify and retry */
>> + return -EAGAIN; /* combination #4 */
>> + } else if (is_ipmodify) {
>> + rec->flags |= FTRACE_FL_IPMODIFY;
>> + }
>> + } else if (is_ipmodify) {/* Removed entry */
>> rec->flags &= ~FTRACE_FL_IPMODIFY;
>> + }
>> } while_for_each_ftrace_rec();
>>
>> return 0;
On Fri, 15 Jul 2022 00:13:51 +0000
Song Liu <[email protected]> wrote:
> I think there is one more problem here. If we force all direct trampoline
> set IPMODIFY, and remove the SHARE_IPMODIFY flag. It may cause confusion
> and/or extra work here (__ftrace_hash_update_ipmodify).
I'm saying that the caller (BPF) does not need to set IPMODIFY. Just
change the ftrace.c to treat IPMODIFY the same as direct. In fact, the
two should be mutually exclusive (from a ftrace point of view). That
is, if DIRECT is set, then IPMODIFY must not be.
Again, ftrace will take care of the accounting of if a rec has both
IPMODIFY and DIRECT on it.
>
> Say __ftrace_hash_update_ipmodify() tries to attach an ops with IPMODIFY,
> and found the rec already has IPMODIFY. At this point, we have to iterate
> all ftrace ops (do_for_each_ftrace_op) to confirm whether the IPMODIFY is
> from
What I'm suggesting is that a DIRECT ops will never set IPMODIFY. Like
I mentioned before, ftrace doesn't care if the direct trampoline
modifies IP or not. All ftrace will do is ask the owner of the direct
ops if it is safe to have an IP modify attached to it or not. And at
the same time will tell the direct ops owner that it is adding an
IPMODIFY ops such that it can take the appropriate actions.
In other words, IPMODIFY on the rec means that it can not attach
anything else with an IPMODIFY on it.
The direct ops should only set the DIRECT flag. If an IPMODIFY ops is
being added, if it already has an IPMODIFY then it will fail right there.
If direct is set, then a loop for the direct ops will be done and the
callback is made. If the direct ops is OK with the IPMODIFY then it
will adjust itself to work with the IPMODIFY and the IPMODIFY will
continue to be added (and then set the rec IPMODIFY flag).
>
> 1) a direct ops that can share IPMODIFY, or
> 2) a direct ops that cannot share IPMODIFY, or
> 3) another non-direct ops with IPMODIFY.
>
> For the 1), this attach works, while for 2) and 3), the attach doesn't work.
>
> OTOH, with current version (v2), we trust the direct ops to set or clear
> IPMODIFY. rec with IPMODIFY makes it clear that it cannot share with another
> ops with IPMODIFY. Then we don't have to iterate over all ftrace ops here.
The only time an iterate should be done is if rec->flags is DIRECT and !IPMODIFY.
>
> Does this make sense? Did I miss some better solutions?
Did I fill in the holes? ;-)
-- Steve
> On Jul 14, 2022, at 5:48 PM, Steven Rostedt <[email protected]> wrote:
>
> On Fri, 15 Jul 2022 00:13:51 +0000
> Song Liu <[email protected]> wrote:
>
>> I think there is one more problem here. If we force all direct trampoline
>> set IPMODIFY, and remove the SHARE_IPMODIFY flag. It may cause confusion
>> and/or extra work here (__ftrace_hash_update_ipmodify).
>
> I'm saying that the caller (BPF) does not need to set IPMODIFY. Just
> change the ftrace.c to treat IPMODIFY the same as direct. In fact, the
> two should be mutually exclusive (from a ftrace point of view). That
> is, if DIRECT is set, then IPMODIFY must not be.
>
> Again, ftrace will take care of the accounting of if a rec has both
> IPMODIFY and DIRECT on it.
>
>>
>> Say __ftrace_hash_update_ipmodify() tries to attach an ops with IPMODIFY,
>> and found the rec already has IPMODIFY. At this point, we have to iterate
>> all ftrace ops (do_for_each_ftrace_op) to confirm whether the IPMODIFY is
>> from
>
> What I'm suggesting is that a DIRECT ops will never set IPMODIFY.
Aha, this the point I misunderstood. I thought DIRECT ops would always
set IPMODIFY (as it does now).
> Like
> I mentioned before, ftrace doesn't care if the direct trampoline
> modifies IP or not. All ftrace will do is ask the owner of the direct
> ops if it is safe to have an IP modify attached to it or not. And at
> the same time will tell the direct ops owner that it is adding an
> IPMODIFY ops such that it can take the appropriate actions.
>
> In other words, IPMODIFY on the rec means that it can not attach
> anything else with an IPMODIFY on it.
>
> The direct ops should only set the DIRECT flag. If an IPMODIFY ops is
> being added, if it already has an IPMODIFY then it will fail right there.
>
> If direct is set, then a loop for the direct ops will be done and the
> callback is made. If the direct ops is OK with the IPMODIFY then it
> will adjust itself to work with the IPMODIFY and the IPMODIFY will
> continue to be added (and then set the rec IPMODIFY flag).
>
>>
>> 1) a direct ops that can share IPMODIFY, or
>> 2) a direct ops that cannot share IPMODIFY, or
>> 3) another non-direct ops with IPMODIFY.
>>
>> For the 1), this attach works, while for 2) and 3), the attach doesn't work.
>>
>> OTOH, with current version (v2), we trust the direct ops to set or clear
>> IPMODIFY. rec with IPMODIFY makes it clear that it cannot share with another
>> ops with IPMODIFY. Then we don't have to iterate over all ftrace ops here.
>
> The only time an iterate should be done is if rec->flags is DIRECT and !IPMODIFY.
Yeah, this makes sense. And this shouldn't be too expensive.
>
>>
>> Does this make sense? Did I miss some better solutions?
>
> Did I fill in the holes? ;-)
You sure did. :)
Thanks,
Song
On Fri, 15 Jul 2022 02:04:33 +0000
Song Liu <[email protected]> wrote:
> > What I'm suggesting is that a DIRECT ops will never set IPMODIFY.
>
> Aha, this the point I misunderstood. I thought DIRECT ops would always
> set IPMODIFY (as it does now).
My fault. I was probably not being clear when I was suggesting that
DIRECT should *act* like an IPMODIFY, but never explicitly stated that
it should not set the IPMODIFY flag.
The only reason it does today was to make it easy to act like an
IPMODIFY (because it set the flag). But I'm now suggesting to get rid
of that and just make DIRECT act like an IPMDOFIY as there can only be
one of them on a function, but now we have some cases where DIRECT can
work with IPMODIFY via the callbacks.
-- Steve
> On Jul 14, 2022, at 7:46 PM, Steven Rostedt <[email protected]> wrote:
>
> On Fri, 15 Jul 2022 02:04:33 +0000
> Song Liu <[email protected]> wrote:
>
>>> What I'm suggesting is that a DIRECT ops will never set IPMODIFY.
>>
>> Aha, this the point I misunderstood. I thought DIRECT ops would always
>> set IPMODIFY (as it does now).
>
> My fault. I was probably not being clear when I was suggesting that
> DIRECT should *act* like an IPMODIFY, but never explicitly stated that
> it should not set the IPMODIFY flag.
>
> The only reason it does today was to make it easy to act like an
> IPMODIFY (because it set the flag). But I'm now suggesting to get rid
> of that and just make DIRECT act like an IPMDOFIY as there can only be
> one of them on a function, but now we have some cases where DIRECT can
> work with IPMODIFY via the callbacks.
Thanks for the clarification. I think we are finally on the same page on
this. :)
Song
Hi Steven,
> On Jul 14, 2022, at 7:50 PM, Song Liu <[email protected]> wrote:
>
>
>
>> On Jul 14, 2022, at 7:46 PM, Steven Rostedt <[email protected]> wrote:
>>
>> On Fri, 15 Jul 2022 02:04:33 +0000
>> Song Liu <[email protected]> wrote:
>>
>>>> What I'm suggesting is that a DIRECT ops will never set IPMODIFY.
>>>
>>> Aha, this the point I misunderstood. I thought DIRECT ops would always
>>> set IPMODIFY (as it does now).
>>
>> My fault. I was probably not being clear when I was suggesting that
>> DIRECT should *act* like an IPMODIFY, but never explicitly stated that
>> it should not set the IPMODIFY flag.
>>
>> The only reason it does today was to make it easy to act like an
>> IPMODIFY (because it set the flag). But I'm now suggesting to get rid
>> of that and just make DIRECT act like an IPMDOFIY as there can only be
>> one of them on a function, but now we have some cases where DIRECT can
>> work with IPMODIFY via the callbacks.
>
> Thanks for the clarification. I think we are finally on the same page on
> this. :)
A quick update and ask for feedback/clarification.
Based on my understanding, you recommended calling ops_func() from
__ftrace_hash_update_ipmodify() and in ops_func() the direct trampoline
may make changes to the trampoline. Did I get this right?
I am going towards this direction, but hit some issue. Specifically, in
__ftrace_hash_update_ipmodify(), ftrace_lock is already locked, so the
direct trampoline cannot easily make changes with
modify_ftrace_direct_multi(), which locks both direct_mutex and
ftrace_mutex.
One solution would be have no-lock version of all the functions called
by modify_ftrace_direct_multi(), but that's a lot of functions and the
code will be pretty ugly. The alternative would be the logic in v2:
__ftrace_hash_update_ipmodify() returns -EAGAIN, and we make changes to
the direct trampoline in other places:
1) if DIRECT ops attached first, the trampoline is updated in
prepare_direct_functions_for_ipmodify(), see 3/5 of v2;
2) if IPMODIFY ops attached first, the trampoline is updated in
bpf_trampoline_update(), see "goto again" path in 5/5 of v2.
Overall, I think this way is still cleaner. What do you think about this?
Thanks,
Song
On Fri, 15 Jul 2022 17:42:55 +0000
Song Liu <[email protected]> wrote:
> A quick update and ask for feedback/clarification.
>
> Based on my understanding, you recommended calling ops_func() from
> __ftrace_hash_update_ipmodify() and in ops_func() the direct trampoline
> may make changes to the trampoline. Did I get this right?
>
>
> I am going towards this direction, but hit some issue. Specifically, in
> __ftrace_hash_update_ipmodify(), ftrace_lock is already locked, so the
> direct trampoline cannot easily make changes with
> modify_ftrace_direct_multi(), which locks both direct_mutex and
> ftrace_mutex.
>
> One solution would be have no-lock version of all the functions called
> by modify_ftrace_direct_multi(), but that's a lot of functions and the
> code will be pretty ugly. The alternative would be the logic in v2:
> __ftrace_hash_update_ipmodify() returns -EAGAIN, and we make changes to
> the direct trampoline in other places:
>
> 1) if DIRECT ops attached first, the trampoline is updated in
> prepare_direct_functions_for_ipmodify(), see 3/5 of v2;
>
> 2) if IPMODIFY ops attached first, the trampoline is updated in
> bpf_trampoline_update(), see "goto again" path in 5/5 of v2.
>
> Overall, I think this way is still cleaner. What do you think about this?
What about if we release the lock when doing the callback?
Then we just need to make sure things are the same after reacquiring the
lock, and if they are different, we release the lock again and do the
callback with the new update. Wash, rinse, repeat, until the state is the
same before and after the callback with locks acquired?
This is a common way to handle callbacks that need to do something that
takes the lock held before doing a callback.
The reason I say this, is because the more we can keep the accounting
inside of ftrace the better.
Wouldn't this need to be done anyway if BPF was first and live kernel
patching needed the update? An -EAGAIN would not suffice.
-- Steve
> On Jul 15, 2022, at 12:12 PM, Steven Rostedt <[email protected]> wrote:
>
> On Fri, 15 Jul 2022 17:42:55 +0000
> Song Liu <[email protected]> wrote:
>
>
>> A quick update and ask for feedback/clarification.
>>
>> Based on my understanding, you recommended calling ops_func() from
>> __ftrace_hash_update_ipmodify() and in ops_func() the direct trampoline
>> may make changes to the trampoline. Did I get this right?
>>
>>
>> I am going towards this direction, but hit some issue. Specifically, in
>> __ftrace_hash_update_ipmodify(), ftrace_lock is already locked, so the
>> direct trampoline cannot easily make changes with
>> modify_ftrace_direct_multi(), which locks both direct_mutex and
>> ftrace_mutex.
>>
>> One solution would be have no-lock version of all the functions called
>> by modify_ftrace_direct_multi(), but that's a lot of functions and the
>> code will be pretty ugly. The alternative would be the logic in v2:
>> __ftrace_hash_update_ipmodify() returns -EAGAIN, and we make changes to
>> the direct trampoline in other places:
>>
>> 1) if DIRECT ops attached first, the trampoline is updated in
>> prepare_direct_functions_for_ipmodify(), see 3/5 of v2;
>>
>> 2) if IPMODIFY ops attached first, the trampoline is updated in
>> bpf_trampoline_update(), see "goto again" path in 5/5 of v2.
>>
>> Overall, I think this way is still cleaner. What do you think about this?
>
> What about if we release the lock when doing the callback?
We can probably unlock ftrace_lock here. But we may break locking order
with direct mutex (see below).
>
> Then we just need to make sure things are the same after reacquiring the
> lock, and if they are different, we release the lock again and do the
> callback with the new update. Wash, rinse, repeat, until the state is the
> same before and after the callback with locks acquired?
Personally, I would like to avoid wash-rinse-repeat here.
>
> This is a common way to handle callbacks that need to do something that
> takes the lock held before doing a callback.
>
> The reason I say this, is because the more we can keep the accounting
> inside of ftrace the better.
>
> Wouldn't this need to be done anyway if BPF was first and live kernel
> patching needed the update? An -EAGAIN would not suffice.
prepare_direct_functions_for_ipmodify handles BPF-first-livepatch-later
case. The benefit of prepare_direct_functions_for_ipmodify() is that it
holds direct_mutex before ftrace_lock, and keeps holding it if necessary.
This is enough to make sure we don't need the wash-rinse-repeat.
OTOH, if we wait until __ftrace_hash_update_ipmodify(), we already hold
ftrace_lock, but not direct_mutex. To make changes to bpf trampoline, we
have to unlock ftrace_lock and lock direct_mutex to avoid deadlock.
However, this means we will need the wash-rinse-repeat.
For livepatch-first-BPF-later case, we can probably handle this in
__ftrace_hash_update_ipmodify(), since we hold both direct_mutex and
ftrace_lock. We can unlock ftrace_lock and update the BPF trampoline.
It is safe against changes to direct ops, because we are still holding
direct_mutex. But, is this safe against another IPMODIFY ops? I am not
sure yet... Also, this is pretty weird because, we are updating a
direct trampoline before we finish registering it for the first time.
IOW, we are calling modify_ftrace_direct_multi_nolock for the same
trampoline before register_ftrace_direct_multi() returns.
The approach in v2 propagates the -EAGAIN to BPF side, so these are two
independent calls of register_ftrace_direct_multi(). This does require
some protocol between ftrace core and its user, but I still think this
is a cleaner approach.
Does this make sense?
Thanks,
Song
On Fri, 15 Jul 2022 19:49:00 +0000
Song Liu <[email protected]> wrote:
> >
> > What about if we release the lock when doing the callback?
>
> We can probably unlock ftrace_lock here. But we may break locking order
> with direct mutex (see below).
You're talking about the multi registering case, right?
>
> >
> > Then we just need to make sure things are the same after reacquiring the
> > lock, and if they are different, we release the lock again and do the
> > callback with the new update. Wash, rinse, repeat, until the state is the
> > same before and after the callback with locks acquired?
>
> Personally, I would like to avoid wash-rinse-repeat here.
But it's common to do. Keeps your hair cleaner that way ;-)
>
> >
> > This is a common way to handle callbacks that need to do something that
> > takes the lock held before doing a callback.
> >
> > The reason I say this, is because the more we can keep the accounting
> > inside of ftrace the better.
> >
> > Wouldn't this need to be done anyway if BPF was first and live kernel
> > patching needed the update? An -EAGAIN would not suffice.
>
> prepare_direct_functions_for_ipmodify handles BPF-first-livepatch-later
> case. The benefit of prepare_direct_functions_for_ipmodify() is that it
> holds direct_mutex before ftrace_lock, and keeps holding it if necessary.
> This is enough to make sure we don't need the wash-rinse-repeat.
>
> OTOH, if we wait until __ftrace_hash_update_ipmodify(), we already hold
> ftrace_lock, but not direct_mutex. To make changes to bpf trampoline, we
> have to unlock ftrace_lock and lock direct_mutex to avoid deadlock.
> However, this means we will need the wash-rinse-repeat.
>
>
> For livepatch-first-BPF-later case, we can probably handle this in
> __ftrace_hash_update_ipmodify(), since we hold both direct_mutex and
> ftrace_lock. We can unlock ftrace_lock and update the BPF trampoline.
> It is safe against changes to direct ops, because we are still holding
> direct_mutex. But, is this safe against another IPMODIFY ops? I am not
> sure yet... Also, this is pretty weird because, we are updating a
> direct trampoline before we finish registering it for the first time.
> IOW, we are calling modify_ftrace_direct_multi_nolock for the same
> trampoline before register_ftrace_direct_multi() returns.
>
> The approach in v2 propagates the -EAGAIN to BPF side, so these are two
> independent calls of register_ftrace_direct_multi(). This does require
> some protocol between ftrace core and its user, but I still think this
> is a cleaner approach.
The issue I have with this approach is it couples BPF and ftrace a bit too
much.
But there is a way with my approach you can still do your approach. That
is, have ops_func() return zero if everything is fine, and otherwise returns
a negative value. Then have the register function fail and return whatever
value that gets returned by the ops_func()
Then have the bpf ops_func() check (does this direct caller handle
IPMODIFY? if yes, return 0, else return -EAGAIN). Then the registering of
ftrace fails with your -EAGAIN, and then you can change the direct
trampoline to handle IPMODIFY and try again. This time when ops_func() is
called, it sees that the direct trampoline can handle the IPMODIFY and
returns 0.
Basically, it's a way to still implement my suggestion, but let BPF decide
to use -EAGAIN to try again. And then BPF and ftrace don't need to have
these special flags to change the behavior of each other.
-- Steve
> On Jul 15, 2022, at 12:59 PM, Steven Rostedt <[email protected]> wrote:
>
> On Fri, 15 Jul 2022 19:49:00 +0000
> Song Liu <[email protected]> wrote:
>
>>>
>>> What about if we release the lock when doing the callback?
>>
>> We can probably unlock ftrace_lock here. But we may break locking order
>> with direct mutex (see below).
>
> You're talking about the multi registering case, right?
We are using the *_ftrace_direct_multi() API here, to be able to specify
ops_func. The direct single API just uses the shared direct_ops.
>
>>
>>>
>>> Then we just need to make sure things are the same after reacquiring the
>>> lock, and if they are different, we release the lock again and do the
>>> callback with the new update. Wash, rinse, repeat, until the state is the
>>> same before and after the callback with locks acquired?
>>
>> Personally, I would like to avoid wash-rinse-repeat here.
>
> But it's common to do. Keeps your hair cleaner that way ;-)
>
>>
>>>
>>> This is a common way to handle callbacks that need to do something that
>>> takes the lock held before doing a callback.
>>>
>>> The reason I say this, is because the more we can keep the accounting
>>> inside of ftrace the better.
>>>
>>> Wouldn't this need to be done anyway if BPF was first and live kernel
>>> patching needed the update? An -EAGAIN would not suffice.
>>
>> prepare_direct_functions_for_ipmodify handles BPF-first-livepatch-later
>> case. The benefit of prepare_direct_functions_for_ipmodify() is that it
>> holds direct_mutex before ftrace_lock, and keeps holding it if necessary.
>> This is enough to make sure we don't need the wash-rinse-repeat.
>>
>> OTOH, if we wait until __ftrace_hash_update_ipmodify(), we already hold
>> ftrace_lock, but not direct_mutex. To make changes to bpf trampoline, we
>> have to unlock ftrace_lock and lock direct_mutex to avoid deadlock.
>> However, this means we will need the wash-rinse-repeat.
What do you think about the prepare_direct_functions_for_ipmodify()
approach? If this is not ideal, maybe we can simplify it so that it only
holds direct_mutex (when necessary). The benefit is that we are sure
direct_mutex is already held in __ftrace_hash_update_ipmodify(). However,
I think it is not safe to unlock ftrace_lock in __ftrace_hash_update_ipmodify().
We can get parallel do_for_each_ftrace_rec(), which is dangerous, no?
>>
>>
>> For livepatch-first-BPF-later case, we can probably handle this in
>> __ftrace_hash_update_ipmodify(), since we hold both direct_mutex and
>> ftrace_lock. We can unlock ftrace_lock and update the BPF trampoline.
>> It is safe against changes to direct ops, because we are still holding
>> direct_mutex. But, is this safe against another IPMODIFY ops? I am not
>> sure yet... Also, this is pretty weird because, we are updating a
>> direct trampoline before we finish registering it for the first time.
>> IOW, we are calling modify_ftrace_direct_multi_nolock for the same
>> trampoline before register_ftrace_direct_multi() returns.
>>
>> The approach in v2 propagates the -EAGAIN to BPF side, so these are two
>> independent calls of register_ftrace_direct_multi(). This does require
>> some protocol between ftrace core and its user, but I still think this
>> is a cleaner approach.
>
> The issue I have with this approach is it couples BPF and ftrace a bit too
> much.
>
> But there is a way with my approach you can still do your approach. That
> is, have ops_func() return zero if everything is fine, and otherwise returns
> a negative value. Then have the register function fail and return whatever
> value that gets returned by the ops_func()
>
> Then have the bpf ops_func() check (does this direct caller handle
> IPMODIFY? if yes, return 0, else return -EAGAIN). Then the registering of
> ftrace fails with your -EAGAIN, and then you can change the direct
> trampoline to handle IPMODIFY and try again. This time when ops_func() is
> called, it sees that the direct trampoline can handle the IPMODIFY and
> returns 0.
>
> Basically, it's a way to still implement my suggestion, but let BPF decide
> to use -EAGAIN to try again. And then BPF and ftrace don't need to have
> these special flags to change the behavior of each other.
I like this one. So there is no protocol about the return value here.
Thanks,
Song
On Fri, 15 Jul 2022 20:21:49 +0000
Song Liu <[email protected]> wrote:
> >>> Wouldn't this need to be done anyway if BPF was first and live kernel
> >>> patching needed the update? An -EAGAIN would not suffice.
> >>
> >> prepare_direct_functions_for_ipmodify handles BPF-first-livepatch-later
> >> case. The benefit of prepare_direct_functions_for_ipmodify() is that it
> >> holds direct_mutex before ftrace_lock, and keeps holding it if necessary.
> >> This is enough to make sure we don't need the wash-rinse-repeat.
> >>
> >> OTOH, if we wait until __ftrace_hash_update_ipmodify(), we already hold
> >> ftrace_lock, but not direct_mutex. To make changes to bpf trampoline, we
> >> have to unlock ftrace_lock and lock direct_mutex to avoid deadlock.
> >> However, this means we will need the wash-rinse-repeat.
>
> What do you think about the prepare_direct_functions_for_ipmodify()
> approach? If this is not ideal, maybe we can simplify it so that it only
> holds direct_mutex (when necessary). The benefit is that we are sure
> direct_mutex is already held in __ftrace_hash_update_ipmodify(). However,
> I think it is not safe to unlock ftrace_lock in __ftrace_hash_update_ipmodify().
> We can get parallel do_for_each_ftrace_rec(), which is dangerous, no?
I'm fine with it. But one nit on the logic:
> int register_ftrace_function(struct ftrace_ops *ops)
> + __releases(&direct_mutex)
> {
> + bool direct_mutex_locked;
> int ret;
>
> ftrace_ops_init(ops);
>
> + ret = prepare_direct_functions_for_ipmodify(ops);
> + if (ret < 0)
> + return ret;
> +
> + direct_mutex_locked = ret == 1;
> +
Please make the above:
if (ret < 0)
return ret;
else if (ret == 1)
direct_mutex_locked = true;
It's much easier to read that way.
-- Steve
> mutex_lock(&ftrace_lock);
>
> ret = ftrace_startup(ops, 0);
>
> mutex_unlock(&ftrace_lock);
>
> + if (direct_mutex_locked)
> + mutex_unlock(&direct_mutex);
> return ret;
> }
> EXPORT_SYMBOL_GPL(register_ftrace_function);
> --
> On Jul 15, 2022, at 2:29 PM, Steven Rostedt <[email protected]> wrote:
>
> On Fri, 15 Jul 2022 20:21:49 +0000
> Song Liu <[email protected]> wrote:
>
>>>>> Wouldn't this need to be done anyway if BPF was first and live kernel
>>>>> patching needed the update? An -EAGAIN would not suffice.
>>>>
>>>> prepare_direct_functions_for_ipmodify handles BPF-first-livepatch-later
>>>> case. The benefit of prepare_direct_functions_for_ipmodify() is that it
>>>> holds direct_mutex before ftrace_lock, and keeps holding it if necessary.
>>>> This is enough to make sure we don't need the wash-rinse-repeat.
>>>>
>>>> OTOH, if we wait until __ftrace_hash_update_ipmodify(), we already hold
>>>> ftrace_lock, but not direct_mutex. To make changes to bpf trampoline, we
>>>> have to unlock ftrace_lock and lock direct_mutex to avoid deadlock.
>>>> However, this means we will need the wash-rinse-repeat.
>>
>> What do you think about the prepare_direct_functions_for_ipmodify()
>> approach? If this is not ideal, maybe we can simplify it so that it only
>> holds direct_mutex (when necessary). The benefit is that we are sure
>> direct_mutex is already held in __ftrace_hash_update_ipmodify(). However,
>> I think it is not safe to unlock ftrace_lock in __ftrace_hash_update_ipmodify().
>> We can get parallel do_for_each_ftrace_rec(), which is dangerous, no?
>
> I'm fine with it. But one nit on the logic:
>
>> int register_ftrace_function(struct ftrace_ops *ops)
>> + __releases(&direct_mutex)
>> {
>> + bool direct_mutex_locked;
>> int ret;
>>
>> ftrace_ops_init(ops);
>>
>> + ret = prepare_direct_functions_for_ipmodify(ops);
>> + if (ret < 0)
>> + return ret;
>> +
>> + direct_mutex_locked = ret == 1;
>> +
>
> Please make the above:
>
> if (ret < 0)
> return ret;
> else if (ret == 1)
> direct_mutex_locked = true;
>
> It's much easier to read that way.
Thanks for the clarification!
Song
On Fri, 15 Jul 2022 21:48:21 +0000
Song Liu <[email protected]> wrote:
> >> int register_ftrace_function(struct ftrace_ops *ops)
> >> + __releases(&direct_mutex)
> >> {
> >> + bool direct_mutex_locked;
You'll need:
bool direct_mutex_locked = false;
obviously ;-)
-- Steve
> >> int ret;
> >>
> >> ftrace_ops_init(ops);
> >>
> >> + ret = prepare_direct_functions_for_ipmodify(ops);
> >> + if (ret < 0)
> >> + return ret;
> >> +
> >> + direct_mutex_locked = ret == 1;
> >> +
> >
> > Please make the above:
> >
> > if (ret < 0)
> > return ret;
> > else if (ret == 1)
> > direct_mutex_locked = true;
> >
> > It's much easier to read that way.
>
> Thanks for the clarification!