This implements non-inline static calls for arm64. This is rather
straight-forward, as we don't rely on any tooling to look for static
call sites etc. The only minor complication is Clang CFI, which is
already in mainline for arm64, and requires a little tweak to ensure
that we don't end up patching the CFI jump table instead of the static
call trampoline itself.
Changes since v5:
- drop the patch that works around issues with references to symbols
with static linkage from asm blocks; this is specific to Clang+ThinLTO
in versions before 13, so we can just decide not to support that config.
- add a patch to use non-function type symbols for the trampolines, to
ensure that taking the address gives us the trampoline itself rather
than the address of a CFI jump table entry that branches to it.
Changes since v4:
- add preparatory patch to address generic CFI/LTO issues with static
calls
- add comment to patch #2 describing the trampoline layout
- add handling of Clang CFI jump table entries
- add PeterZ's ack to patch #2
Cc: Mark Rutland <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: James Morse <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Sami Tolvanen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Steven Rostedt <[email protected]>
Ard Biesheuvel (2):
static_call: use non-function types to refer to the trampolines
arm64: implement support for static call trampolines
arch/arm64/Kconfig | 2 +
arch/arm64/include/asm/static_call.h | 40 ++++++++++
arch/arm64/kernel/patching.c | 77 +++++++++++++++++++-
arch/arm64/kernel/vmlinux.lds.S | 1 +
include/linux/static_call.h | 4 +-
include/linux/static_call_types.h | 11 ++-
6 files changed, 127 insertions(+), 8 deletions(-)
create mode 100644 arch/arm64/include/asm/static_call.h
--
2.30.2
In order to prevent CFI enabled code from grabbing a jump table entry
that jumps to the trampoline, rather than the trampoline itself, use an
incomplete non-function type for the trampoline, and cast it to the
right type only when invoking it.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
include/linux/static_call.h | 4 ++--
include/linux/static_call_types.h | 11 ++++++++---
2 files changed, 10 insertions(+), 5 deletions(-)
diff --git a/include/linux/static_call.h b/include/linux/static_call.h
index 3e56a9751c06..616607393273 100644
--- a/include/linux/static_call.h
+++ b/include/linux/static_call.h
@@ -151,7 +151,7 @@ extern void arch_static_call_transform(void *site, void *tramp, void *func, bool
#define static_call_update(name, func) \
({ \
- typeof(&STATIC_CALL_TRAMP(name)) __F = (func); \
+ typeof(&STATIC_CALL_TYPE(name)) __F = (func); \
__static_call_update(&STATIC_CALL_KEY(name), \
STATIC_CALL_TRAMP_ADDR(name), __F); \
})
@@ -306,7 +306,7 @@ static inline void __static_call_nop(void) { }
void *func = READ_ONCE(STATIC_CALL_KEY(name).func); \
if (!func) \
func = &__static_call_nop; \
- (typeof(STATIC_CALL_TRAMP(name))*)func; \
+ (typeof(&STATIC_CALL_TYPE(name)))func; \
})
#define static_call_cond(name) (void)__static_call_cond(name)
diff --git a/include/linux/static_call_types.h b/include/linux/static_call_types.h
index 5a00b8b2cf9f..5e658ef537e4 100644
--- a/include/linux/static_call_types.h
+++ b/include/linux/static_call_types.h
@@ -18,6 +18,9 @@
#define STATIC_CALL_TRAMP(name) __PASTE(STATIC_CALL_TRAMP_PREFIX, name)
#define STATIC_CALL_TRAMP_STR(name) __stringify(STATIC_CALL_TRAMP(name))
+#define STATIC_CALL_TYPE_PREFIX __SCtype__
+#define STATIC_CALL_TYPE(name) __PASTE(STATIC_CALL_TYPE_PREFIX, name)
+
/*
* Flags in the low bits of static_call_site::key.
*/
@@ -36,11 +39,13 @@ struct static_call_site {
#define DECLARE_STATIC_CALL(name, func) \
extern struct static_call_key STATIC_CALL_KEY(name); \
- extern typeof(func) STATIC_CALL_TRAMP(name);
+ extern struct static_call_tramp STATIC_CALL_TRAMP(name); \
+ extern typeof(func) STATIC_CALL_TYPE(name)
#ifdef CONFIG_HAVE_STATIC_CALL
-#define __raw_static_call(name) (&STATIC_CALL_TRAMP(name))
+#define __raw_static_call(name) \
+ ((typeof(&STATIC_CALL_TYPE(name)))&STATIC_CALL_TRAMP(name))
#ifdef CONFIG_HAVE_STATIC_CALL_INLINE
@@ -96,7 +101,7 @@ struct static_call_key {
};
#define static_call(name) \
- ((typeof(STATIC_CALL_TRAMP(name))*)(STATIC_CALL_KEY(name).func))
+ ((typeof(&STATIC_CALL_TYPE(name)))(STATIC_CALL_KEY(name).func))
#endif /* CONFIG_HAVE_STATIC_CALL */
--
2.30.2
Implement arm64 support for the 'unoptimized' static call variety, which
routes all calls through a single trampoline that is patched to perform a
tail call to the selected function.
It is expected that the direct branch instruction will be able to cover
the common case. However, given that static call targets may be located
in modules loaded out of direct branching range, we need a fallback path
that loads the address into R16 and uses a branch-to-register (BR)
instruction to perform an indirect call.
Unlike on x86, there is no pressing need on arm64 to avoid indirect
calls at all cost, but hiding it from the compiler as is done here does
have some benefits:
- the literal is located in .text, which gives us the same robustness
advantage that code patching does;
- no performance hit on CFI enabled Clang builds that decorate compiler
emitted indirect calls with branch target validity checks.
Acked-by: Peter Zijlstra <[email protected]>
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/Kconfig | 2 +
arch/arm64/include/asm/static_call.h | 40 ++++++++++
arch/arm64/kernel/patching.c | 77 +++++++++++++++++++-
arch/arm64/kernel/vmlinux.lds.S | 1 +
4 files changed, 117 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 176d6fddc4f2..ccc33b85769c 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -193,6 +193,8 @@ config ARM64
select HAVE_PERF_USER_STACK_DUMP
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_POSIX_CPU_TIMERS_TASK_WORK
+ # https://github.com/ClangBuiltLinux/linux/issues/1354
+ select HAVE_STATIC_CALL if !LTO_CLANG_THIN || CLANG_VERSION >= 130000
select HAVE_FUNCTION_ARG_ACCESS_API
select HAVE_FUTEX_CMPXCHG if FUTEX
select MMU_GATHER_RCU_TABLE_FREE
diff --git a/arch/arm64/include/asm/static_call.h b/arch/arm64/include/asm/static_call.h
new file mode 100644
index 000000000000..6ee918991510
--- /dev/null
+++ b/arch/arm64/include/asm/static_call.h
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_STATIC_CALL_H
+#define _ASM_STATIC_CALL_H
+
+/*
+ * The sequence below is laid out in a way that guarantees that the literal and
+ * the instruction are always covered by the same cacheline, and can be updated
+ * using a single store-pair instruction (provided that we rewrite the BTI C
+ * instruction as well). This means the literal and the instruction are always
+ * in sync when observed via the D-side.
+ *
+ * However, this does not guarantee that the I-side will catch up immediately
+ * as well: until the I-cache maintenance completes, CPUs may branch to the old
+ * target, or execute a stale NOP or RET. We deal with this by writing the
+ * literal unconditionally, even if it is 0x0 or the branch is in range. That
+ * way, a stale NOP will fall through and call the new target via an indirect
+ * call. Stale RETs or Bs will be taken as before, and branch to the old
+ * target.
+ */
+#define __ARCH_DEFINE_STATIC_CALL_TRAMP(name, insn) \
+ asm(" .pushsection .static_call.text, \"ax\" \n" \
+ " .align 4 \n" \
+ " .globl " STATIC_CALL_TRAMP_STR(name) " \n" \
+ "0: .quad 0x0 \n" \
+ STATIC_CALL_TRAMP_STR(name) ": \n" \
+ " hint 34 /* BTI C */ \n" \
+ insn " \n" \
+ " ldr x16, 0b \n" \
+ " cbz x16, 1f \n" \
+ " br x16 \n" \
+ "1: ret \n" \
+ " .popsection \n")
+
+#define ARCH_DEFINE_STATIC_CALL_TRAMP(name, func) \
+ __ARCH_DEFINE_STATIC_CALL_TRAMP(name, "b " #func)
+
+#define ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name) \
+ __ARCH_DEFINE_STATIC_CALL_TRAMP(name, "ret")
+
+#endif /* _ASM_STATIC_CALL_H */
diff --git a/arch/arm64/kernel/patching.c b/arch/arm64/kernel/patching.c
index 771f543464e0..a265a87d4d9e 100644
--- a/arch/arm64/kernel/patching.c
+++ b/arch/arm64/kernel/patching.c
@@ -3,6 +3,7 @@
#include <linux/mm.h>
#include <linux/smp.h>
#include <linux/spinlock.h>
+#include <linux/static_call.h>
#include <linux/stop_machine.h>
#include <linux/uaccess.h>
@@ -66,7 +67,7 @@ int __kprobes aarch64_insn_read(void *addr, u32 *insnp)
return ret;
}
-static int __kprobes __aarch64_insn_write(void *addr, __le32 insn)
+static int __kprobes __aarch64_insn_write(void *addr, void *insn, int size)
{
void *waddr = addr;
unsigned long flags = 0;
@@ -75,7 +76,7 @@ static int __kprobes __aarch64_insn_write(void *addr, __le32 insn)
raw_spin_lock_irqsave(&patch_lock, flags);
waddr = patch_map(addr, FIX_TEXT_POKE0);
- ret = copy_to_kernel_nofault(waddr, &insn, AARCH64_INSN_SIZE);
+ ret = copy_to_kernel_nofault(waddr, insn, size);
patch_unmap(FIX_TEXT_POKE0);
raw_spin_unlock_irqrestore(&patch_lock, flags);
@@ -85,7 +86,77 @@ static int __kprobes __aarch64_insn_write(void *addr, __le32 insn)
int __kprobes aarch64_insn_write(void *addr, u32 insn)
{
- return __aarch64_insn_write(addr, cpu_to_le32(insn));
+ __le32 i = cpu_to_le32(insn);
+
+ return __aarch64_insn_write(addr, &i, AARCH64_INSN_SIZE);
+}
+
+static void *strip_cfi_jt(void *addr)
+{
+ if (IS_ENABLED(CONFIG_CFI_CLANG)) {
+ void *p = addr;
+ u32 insn;
+
+ /*
+ * Taking the address of a function produces the address of the
+ * jump table entry when Clang CFI is enabled. Such entries are
+ * ordinary jump instructions, preceded by a BTI C instruction
+ * if BTI is enabled for the kernel.
+ */
+ if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
+ p += 4;
+
+ insn = le32_to_cpup(p);
+ if (aarch64_insn_is_b(insn))
+ return p + aarch64_get_branch_offset(insn);
+
+ WARN_ON(1);
+ }
+ return addr;
+}
+
+void arch_static_call_transform(void *site, void *tramp, void *func, bool tail)
+{
+ /*
+ * -0x8 <literal>
+ * 0x0 bti c <--- trampoline entry point
+ * 0x4 <branch or nop>
+ * 0x8 ldr x16, <literal>
+ * 0xc cbz x16, 20
+ * 0x10 br x16
+ * 0x14 ret
+ */
+ struct {
+ u64 literal;
+ __le32 insn[2];
+ } insns;
+ u32 insn;
+ int ret;
+
+ insn = aarch64_insn_gen_hint(AARCH64_INSN_HINT_BTIC);
+ insns.literal = (u64)func;
+ insns.insn[0] = cpu_to_le32(insn);
+
+ if (!func) {
+ insn = aarch64_insn_gen_branch_reg(AARCH64_INSN_REG_LR,
+ AARCH64_INSN_BRANCH_RETURN);
+ } else {
+ insn = aarch64_insn_gen_branch_imm((u64)tramp + 4,
+ (u64)strip_cfi_jt(func),
+ AARCH64_INSN_BRANCH_NOLINK);
+
+ /*
+ * Use a NOP if the branch target is out of range, and rely on
+ * the indirect call instead.
+ */
+ if (insn == AARCH64_BREAK_FAULT)
+ insn = aarch64_insn_gen_hint(AARCH64_INSN_HINT_NOP);
+ }
+ insns.insn[1] = cpu_to_le32(insn);
+
+ ret = __aarch64_insn_write(tramp - 8, &insns, sizeof(insns));
+ if (!WARN_ON(ret))
+ caches_clean_inval_pou((u64)tramp - 8, sizeof(insns));
}
int __kprobes aarch64_insn_patch_text_nosync(void *addr, u32 insn)
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index 50bab186c49b..e16860a14eaf 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -173,6 +173,7 @@ SECTIONS
HIBERNATE_TEXT
KEXEC_TEXT
TRAMP_TEXT
+ STATIC_CALL_TEXT
*(.gnu.warning)
. = ALIGN(16);
*(.got) /* Global offset table */
--
2.30.2
On Mon, 8 Nov 2021 at 11:23, Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Nov 05, 2021 at 03:59:17PM +0100, Ard Biesheuvel wrote:
> > diff --git a/arch/arm64/include/asm/static_call.h b/arch/arm64/include/asm/static_call.h
> > new file mode 100644
> > index 000000000000..6ee918991510
> > --- /dev/null
> > +++ b/arch/arm64/include/asm/static_call.h
> > @@ -0,0 +1,40 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _ASM_STATIC_CALL_H
> > +#define _ASM_STATIC_CALL_H
> > +
> > +/*
> > + * The sequence below is laid out in a way that guarantees that the literal and
> > + * the instruction are always covered by the same cacheline, and can be updated
> > + * using a single store-pair instruction (provided that we rewrite the BTI C
> > + * instruction as well). This means the literal and the instruction are always
> > + * in sync when observed via the D-side.
> > + *
> > + * However, this does not guarantee that the I-side will catch up immediately
> > + * as well: until the I-cache maintenance completes, CPUs may branch to the old
> > + * target, or execute a stale NOP or RET. We deal with this by writing the
> > + * literal unconditionally, even if it is 0x0 or the branch is in range. That
> > + * way, a stale NOP will fall through and call the new target via an indirect
> > + * call. Stale RETs or Bs will be taken as before, and branch to the old
> > + * target.
> > + */
>
> Thanks for the comment!
>
>
> > diff --git a/arch/arm64/kernel/patching.c b/arch/arm64/kernel/patching.c
> > index 771f543464e0..a265a87d4d9e 100644
> > --- a/arch/arm64/kernel/patching.c
> > +++ b/arch/arm64/kernel/patching.c
>
>
> > +static void *strip_cfi_jt(void *addr)
> > +{
> > + if (IS_ENABLED(CONFIG_CFI_CLANG)) {
> > + void *p = addr;
> > + u32 insn;
> > +
> > + /*
> > + * Taking the address of a function produces the address of the
> > + * jump table entry when Clang CFI is enabled. Such entries are
> > + * ordinary jump instructions, preceded by a BTI C instruction
> > + * if BTI is enabled for the kernel.
> > + */
> > + if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
> > + p += 4;
>
> Perhaps:
> if (aarch64_insn_is_bti(le32_to_cpup(p)))
That instruction does not exist yet, and it begs the question which
type of BTI instruction we want to detect.
> p += 4;
>
> Perhapser still, add:
> else
> WARN_ON(IS_ENABLED(CONFIG_ARM64_BTI_KERNEL));
>
There's already a WARN() below that will trigger and return the
original address if the entry did not have the expected layout, which
means a direct branch at offset 0x0 or 0x4 depending on whether BTI is
on.
So I could add a WARN() here as well, but I'd prefer to keep the one
at the bottom, which makes the one here slightly redundant.
> > +
> > + insn = le32_to_cpup(p);
> > + if (aarch64_insn_is_b(insn))
> > + return p + aarch64_get_branch_offset(insn);
> > +
> > + WARN_ON(1);
> > + }
> > + return addr;
> > +}
>
> Also, can this please have a comment decrying the lack of built-in for
> this?
Sure.
On Fri, Nov 05, 2021 at 03:59:16PM +0100, Ard Biesheuvel wrote:
> In order to prevent CFI enabled code from grabbing a jump table entry
> that jumps to the trampoline, rather than the trampoline itself, use an
> incomplete non-function type for the trampoline, and cast it to the
> right type only when invoking it.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
Very grudingly:
Acked-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> include/linux/static_call.h | 4 ++--
> include/linux/static_call_types.h | 11 ++++++++---
> 2 files changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/static_call.h b/include/linux/static_call.h
> index 3e56a9751c06..616607393273 100644
> --- a/include/linux/static_call.h
> +++ b/include/linux/static_call.h
> @@ -151,7 +151,7 @@ extern void arch_static_call_transform(void *site, void *tramp, void *func, bool
>
> #define static_call_update(name, func) \
> ({ \
> - typeof(&STATIC_CALL_TRAMP(name)) __F = (func); \
> + typeof(&STATIC_CALL_TYPE(name)) __F = (func); \
> __static_call_update(&STATIC_CALL_KEY(name), \
> STATIC_CALL_TRAMP_ADDR(name), __F); \
> })
> @@ -306,7 +306,7 @@ static inline void __static_call_nop(void) { }
> void *func = READ_ONCE(STATIC_CALL_KEY(name).func); \
> if (!func) \
> func = &__static_call_nop; \
> - (typeof(STATIC_CALL_TRAMP(name))*)func; \
> + (typeof(&STATIC_CALL_TYPE(name)))func; \
> })
>
> #define static_call_cond(name) (void)__static_call_cond(name)
> diff --git a/include/linux/static_call_types.h b/include/linux/static_call_types.h
> index 5a00b8b2cf9f..5e658ef537e4 100644
> --- a/include/linux/static_call_types.h
> +++ b/include/linux/static_call_types.h
> @@ -18,6 +18,9 @@
> #define STATIC_CALL_TRAMP(name) __PASTE(STATIC_CALL_TRAMP_PREFIX, name)
> #define STATIC_CALL_TRAMP_STR(name) __stringify(STATIC_CALL_TRAMP(name))
>
> +#define STATIC_CALL_TYPE_PREFIX __SCtype__
> +#define STATIC_CALL_TYPE(name) __PASTE(STATIC_CALL_TYPE_PREFIX, name)
> +
> /*
> * Flags in the low bits of static_call_site::key.
> */
> @@ -36,11 +39,13 @@ struct static_call_site {
>
> #define DECLARE_STATIC_CALL(name, func) \
> extern struct static_call_key STATIC_CALL_KEY(name); \
> - extern typeof(func) STATIC_CALL_TRAMP(name);
> + extern struct static_call_tramp STATIC_CALL_TRAMP(name); \
> + extern typeof(func) STATIC_CALL_TYPE(name)
>
> #ifdef CONFIG_HAVE_STATIC_CALL
>
> -#define __raw_static_call(name) (&STATIC_CALL_TRAMP(name))
> +#define __raw_static_call(name) \
> + ((typeof(&STATIC_CALL_TYPE(name)))&STATIC_CALL_TRAMP(name))
>
> #ifdef CONFIG_HAVE_STATIC_CALL_INLINE
>
> @@ -96,7 +101,7 @@ struct static_call_key {
> };
>
> #define static_call(name) \
> - ((typeof(STATIC_CALL_TRAMP(name))*)(STATIC_CALL_KEY(name).func))
> + ((typeof(&STATIC_CALL_TYPE(name)))(STATIC_CALL_KEY(name).func))
>
> #endif /* CONFIG_HAVE_STATIC_CALL */
>
> --
> 2.30.2
>
On Fri, Nov 05, 2021 at 03:59:17PM +0100, Ard Biesheuvel wrote:
> diff --git a/arch/arm64/include/asm/static_call.h b/arch/arm64/include/asm/static_call.h
> new file mode 100644
> index 000000000000..6ee918991510
> --- /dev/null
> +++ b/arch/arm64/include/asm/static_call.h
> @@ -0,0 +1,40 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_STATIC_CALL_H
> +#define _ASM_STATIC_CALL_H
> +
> +/*
> + * The sequence below is laid out in a way that guarantees that the literal and
> + * the instruction are always covered by the same cacheline, and can be updated
> + * using a single store-pair instruction (provided that we rewrite the BTI C
> + * instruction as well). This means the literal and the instruction are always
> + * in sync when observed via the D-side.
> + *
> + * However, this does not guarantee that the I-side will catch up immediately
> + * as well: until the I-cache maintenance completes, CPUs may branch to the old
> + * target, or execute a stale NOP or RET. We deal with this by writing the
> + * literal unconditionally, even if it is 0x0 or the branch is in range. That
> + * way, a stale NOP will fall through and call the new target via an indirect
> + * call. Stale RETs or Bs will be taken as before, and branch to the old
> + * target.
> + */
Thanks for the comment!
> diff --git a/arch/arm64/kernel/patching.c b/arch/arm64/kernel/patching.c
> index 771f543464e0..a265a87d4d9e 100644
> --- a/arch/arm64/kernel/patching.c
> +++ b/arch/arm64/kernel/patching.c
> +static void *strip_cfi_jt(void *addr)
> +{
> + if (IS_ENABLED(CONFIG_CFI_CLANG)) {
> + void *p = addr;
> + u32 insn;
> +
> + /*
> + * Taking the address of a function produces the address of the
> + * jump table entry when Clang CFI is enabled. Such entries are
> + * ordinary jump instructions, preceded by a BTI C instruction
> + * if BTI is enabled for the kernel.
> + */
> + if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
> + p += 4;
Perhaps:
if (aarch64_insn_is_bti(le32_to_cpup(p)))
p += 4;
Perhapser still, add:
else
WARN_ON(IS_ENABLED(CONFIG_ARM64_BTI_KERNEL));
> +
> + insn = le32_to_cpup(p);
> + if (aarch64_insn_is_b(insn))
> + return p + aarch64_get_branch_offset(insn);
> +
> + WARN_ON(1);
> + }
> + return addr;
> +}
Also, can this please have a comment decrying the lack of built-in for
this?
On Mon, Nov 08, 2021 at 12:29:04PM +0100, Ard Biesheuvel wrote:
> On Mon, 8 Nov 2021 at 11:23, Peter Zijlstra <[email protected]> wrote:
> > > +static void *strip_cfi_jt(void *addr)
> > > +{
> > > + if (IS_ENABLED(CONFIG_CFI_CLANG)) {
> > > + void *p = addr;
> > > + u32 insn;
> > > +
> > > + /*
> > > + * Taking the address of a function produces the address of the
> > > + * jump table entry when Clang CFI is enabled. Such entries are
> > > + * ordinary jump instructions, preceded by a BTI C instruction
> > > + * if BTI is enabled for the kernel.
> > > + */
> > > + if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
> > > + p += 4;
> >
> > Perhaps:
> > if (aarch64_insn_is_bti(le32_to_cpup(p)))
>
> That instruction does not exist yet, and it begs the question which
> type of BTI instruction we want to detect.
Yeah, I actually checked, but I figured the intent was clear enough. I
figured all of them?
> > p += 4;
> >
> > Perhapser still, add:
> > else
> > WARN_ON(IS_ENABLED(CONFIG_ARM64_BTI_KERNEL));
> >
>
> There's already a WARN() below that will trigger and return the
> original address if the entry did not have the expected layout, which
> means a direct branch at offset 0x0 or 0x4 depending on whether BTI is
> on.
>
> So I could add a WARN() here as well, but I'd prefer to keep the one
> at the bottom, which makes the one here slightly redundant.
Sure, that works. The slightly more paranoid me would tell you that the
code as is might match something you didn't want it to.
Eg. without the extra WARN, you could accidentally match a B instruction
without BTI on a BTI kernel build. Or your initial version could even
match:
RET;
B ponies;
on a BTI kernel.
My point being that since we're not exactly sure what a future compiler
will generate for us here, we'd best be maximally paranoid about what
we're willing to accept.
> > > +
> > > + insn = le32_to_cpup(p);
> > > + if (aarch64_insn_is_b(insn))
> > > + return p + aarch64_get_branch_offset(insn);
> > > +
> > > + WARN_ON(1);
> > > + }
> > > + return addr;
> > > +}
> >
> > Also, can this please have a comment decrying the lack of built-in for
> > this?
>
> Sure.
Which ties in with that. Once it's a built-in, we can be sure the
compiler knows what it needs to do to undo it's own magic.
On Tue, 9 Nov 2021 at 18:55, Mark Rutland <[email protected]> wrote:
>
> Hi Ard,
>
> On Fri, Nov 05, 2021 at 03:59:17PM +0100, Ard Biesheuvel wrote:
> > +static void *strip_cfi_jt(void *addr)
> > +{
> > + if (IS_ENABLED(CONFIG_CFI_CLANG)) {
> > + void *p = addr;
> > + u32 insn;
> > +
> > + /*
> > + * Taking the address of a function produces the address of the
> > + * jump table entry when Clang CFI is enabled. Such entries are
> > + * ordinary jump instructions, preceded by a BTI C instruction
> > + * if BTI is enabled for the kernel.
> > + */
> > + if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
> > + p += 4;
> > +
> > + insn = le32_to_cpup(p);
> > + if (aarch64_insn_is_b(insn))
> > + return p + aarch64_get_branch_offset(insn);
> > +
> > + WARN_ON(1);
> > + }
> > + return addr;
> > +}
>
> I'm somewhat uncomfortable with this, because it seems like the compiler could
> easily violate our expectations in future, and then we're in for a massive
> headache. I assume clang doesn't provide any guarnatee as to the format of the
> jump table entries (and e.g. I can see scope for branch padding breaking this).
>
> In trying to sidestep that I ended up with:
>
> https://lore.kernel.org/linux-arm-kernel/[email protected]/
>
> ... which I think is a good option for PREEMPT_DYNAMIC, but I don't know if
> there were other places where we believe static calls would be critical for
> performance rather than a nice-to-have, and whether we truly need static calls
> on arm64. My mind is leaning towards "avoid if reasonable" at the moment (or at
> least make that mutually exclusive with CFI so we can avoid that specific fun).
>
> I see you had at least one other user in:
>
> https://lore.kernel.org/r/[email protected]
>
> ... what were your thoughts on the criticality of that?
>
That particular use case does not rely on static calls being fast at
all, so there it doesn't really matter which variety we implement. The
reason I sent it out today is because it gives some test coverage for
static calls used in a way that the API as designed should support,
but which turned out to be slightly broken in practice.
> FWIW other than the above this looks good to me. My major concern here is
> fragility/maintenance, and secondly whether we're gaining much in practice. So
> if you think we really need this, I'm not going to stand in the way.
>
Android relies heavily on tracepoints for vendor hooks, and given the
performance impact of CFI on indirect calls, there has been interest
in enabling static calls to replace them.
Quentin, anything to add here?
Hi Ard,
On Fri, Nov 05, 2021 at 03:59:17PM +0100, Ard Biesheuvel wrote:
> +static void *strip_cfi_jt(void *addr)
> +{
> + if (IS_ENABLED(CONFIG_CFI_CLANG)) {
> + void *p = addr;
> + u32 insn;
> +
> + /*
> + * Taking the address of a function produces the address of the
> + * jump table entry when Clang CFI is enabled. Such entries are
> + * ordinary jump instructions, preceded by a BTI C instruction
> + * if BTI is enabled for the kernel.
> + */
> + if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
> + p += 4;
> +
> + insn = le32_to_cpup(p);
> + if (aarch64_insn_is_b(insn))
> + return p + aarch64_get_branch_offset(insn);
> +
> + WARN_ON(1);
> + }
> + return addr;
> +}
I'm somewhat uncomfortable with this, because it seems like the compiler could
easily violate our expectations in future, and then we're in for a massive
headache. I assume clang doesn't provide any guarnatee as to the format of the
jump table entries (and e.g. I can see scope for branch padding breaking this).
In trying to sidestep that I ended up with:
https://lore.kernel.org/linux-arm-kernel/[email protected]/
... which I think is a good option for PREEMPT_DYNAMIC, but I don't know if
there were other places where we believe static calls would be critical for
performance rather than a nice-to-have, and whether we truly need static calls
on arm64. My mind is leaning towards "avoid if reasonable" at the moment (or at
least make that mutually exclusive with CFI so we can avoid that specific fun).
I see you had at least one other user in:
https://lore.kernel.org/r/[email protected]
... what were your thoughts on the criticality of that?
FWIW other than the above this looks good to me. My major concern here is
fragility/maintenance, and secondly whether we're gaining much in practice. So
if you think we really need this, I'm not going to stand in the way.
Thanks
Mark.
> +void arch_static_call_transform(void *site, void *tramp, void *func, bool tail)
> +{
> + /*
> + * -0x8 <literal>
> + * 0x0 bti c <--- trampoline entry point
> + * 0x4 <branch or nop>
> + * 0x8 ldr x16, <literal>
> + * 0xc cbz x16, 20
> + * 0x10 br x16
> + * 0x14 ret
> + */
> + struct {
> + u64 literal;
> + __le32 insn[2];
> + } insns;
> + u32 insn;
> + int ret;
> +
> + insn = aarch64_insn_gen_hint(AARCH64_INSN_HINT_BTIC);
> + insns.literal = (u64)func;
> + insns.insn[0] = cpu_to_le32(insn);
> +
> + if (!func) {
> + insn = aarch64_insn_gen_branch_reg(AARCH64_INSN_REG_LR,
> + AARCH64_INSN_BRANCH_RETURN);
> + } else {
> + insn = aarch64_insn_gen_branch_imm((u64)tramp + 4,
> + (u64)strip_cfi_jt(func),
> + AARCH64_INSN_BRANCH_NOLINK);
> +
> + /*
> + * Use a NOP if the branch target is out of range, and rely on
> + * the indirect call instead.
> + */
> + if (insn == AARCH64_BREAK_FAULT)
> + insn = aarch64_insn_gen_hint(AARCH64_INSN_HINT_NOP);
> + }
> + insns.insn[1] = cpu_to_le32(insn);
> +
> + ret = __aarch64_insn_write(tramp - 8, &insns, sizeof(insns));
> + if (!WARN_ON(ret))
> + caches_clean_inval_pou((u64)tramp - 8, sizeof(insns));
> }
>
> int __kprobes aarch64_insn_patch_text_nosync(void *addr, u32 insn)
> diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
> index 50bab186c49b..e16860a14eaf 100644
> --- a/arch/arm64/kernel/vmlinux.lds.S
> +++ b/arch/arm64/kernel/vmlinux.lds.S
> @@ -173,6 +173,7 @@ SECTIONS
> HIBERNATE_TEXT
> KEXEC_TEXT
> TRAMP_TEXT
> + STATIC_CALL_TEXT
> *(.gnu.warning)
> . = ALIGN(16);
> *(.got) /* Global offset table */
> --
> 2.30.2
>
On Tuesday 09 Nov 2021 at 19:09:21 (+0100), Ard Biesheuvel wrote:
> On Tue, 9 Nov 2021 at 18:55, Mark Rutland <[email protected]> wrote:
> >
> > Hi Ard,
> >
> > On Fri, Nov 05, 2021 at 03:59:17PM +0100, Ard Biesheuvel wrote:
> > > +static void *strip_cfi_jt(void *addr)
> > > +{
> > > + if (IS_ENABLED(CONFIG_CFI_CLANG)) {
> > > + void *p = addr;
> > > + u32 insn;
> > > +
> > > + /*
> > > + * Taking the address of a function produces the address of the
> > > + * jump table entry when Clang CFI is enabled. Such entries are
> > > + * ordinary jump instructions, preceded by a BTI C instruction
> > > + * if BTI is enabled for the kernel.
> > > + */
> > > + if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
> > > + p += 4;
> > > +
> > > + insn = le32_to_cpup(p);
> > > + if (aarch64_insn_is_b(insn))
> > > + return p + aarch64_get_branch_offset(insn);
> > > +
> > > + WARN_ON(1);
> > > + }
> > > + return addr;
> > > +}
> >
> > I'm somewhat uncomfortable with this, because it seems like the compiler could
> > easily violate our expectations in future, and then we're in for a massive
> > headache. I assume clang doesn't provide any guarnatee as to the format of the
> > jump table entries (and e.g. I can see scope for branch padding breaking this).
> >
> > In trying to sidestep that I ended up with:
> >
> > https://lore.kernel.org/linux-arm-kernel/[email protected]/
> >
> > ... which I think is a good option for PREEMPT_DYNAMIC, but I don't know if
> > there were other places where we believe static calls would be critical for
> > performance rather than a nice-to-have, and whether we truly need static calls
> > on arm64. My mind is leaning towards "avoid if reasonable" at the moment (or at
> > least make that mutually exclusive with CFI so we can avoid that specific fun).
> >
> > I see you had at least one other user in:
> >
> > https://lore.kernel.org/r/[email protected]
> >
> > ... what were your thoughts on the criticality of that?
> >
>
> That particular use case does not rely on static calls being fast at
> all, so there it doesn't really matter which variety we implement. The
> reason I sent it out today is because it gives some test coverage for
> static calls used in a way that the API as designed should support,
> but which turned out to be slightly broken in practice.
>
> > FWIW other than the above this looks good to me. My major concern here is
> > fragility/maintenance, and secondly whether we're gaining much in practice. So
> > if you think we really need this, I'm not going to stand in the way.
> >
>
> Android relies heavily on tracepoints for vendor hooks, and given the
> performance impact of CFI on indirect calls, there has been interest
> in enabling static calls to replace them.
>
> Quentin, anything to add here?
Yes, Android should definitely benefit from static calls.
Modules attaching to tracepoints cause a measurable overhead w/ CFI as
the jump target is a bit harder to verify if it is not in-kernel. But
sadly that's a common pattern for GKI. The current 'workaround' in
Android has been to just plain disable CFI around all tracepoints in the
kernel, which is a bit sad from a security PoV. But there was really no
other option at the time, and we needed the performance back. Static
calls would be a far superior solution as they would avoid much of the
CFI overhead, and are not vulnerable in the CFI sense (that is, the
branch target can't be easily overridden with a random OOB write from a
dodgy driver). So yes, we'd really like to have those please :)
Thanks,
Quentin
Hi,
On Tue, Nov 09, 2021 at 07:02:21PM +0000, Quentin Perret wrote:
> On Tuesday 09 Nov 2021 at 19:09:21 (+0100), Ard Biesheuvel wrote:
> > Android relies heavily on tracepoints for vendor hooks, and given the
> > performance impact of CFI on indirect calls, there has been interest
> > in enabling static calls to replace them.
Hhmm.... what exactly is a "vendor hook" in this context, and what is it doing
with a tracepoint? From an upstream perspective that sounds somewhat fishy
usage.
> > Quentin, anything to add here?
>
> Yes, Android should definitely benefit from static calls.
>
> Modules attaching to tracepoints cause a measurable overhead w/ CFI as
> the jump target is a bit harder to verify if it is not in-kernel.
Where does that additional overhead come from when the target is not in-kernel?
I hope that I am wrong in understanding that __cfi_slowpath_diag() means we're
always doing an out-of-line check when calling into a module?
If that were the case, that would seem to be a much more general problem with
the current clang CFI scheme, and my fear here is that we're adding fragility
and complexity in specific plces to work around general problems with the CFI
scheme.
Thanks,
Mark.
> But sadly that's a common pattern for GKI. The current 'workaround' in
> Android has been to just plain disable CFI around all tracepoints in the
> kernel, which is a bit sad from a security PoV. But there was really no other
> option at the time, and we needed the performance back. Static calls would be
> a far superior solution as they would avoid much of the CFI overhead, and are
> not vulnerable in the CFI sense (that is, the branch target can't be easily
> overridden with a random OOB write from a dodgy driver). So yes, we'd really
> like to have those please :)
>
> Thanks,
> Quentin
Hi Mark,
On Wednesday 10 Nov 2021 at 11:09:40 (+0000), Mark Rutland wrote:
> Hi,
>
> On Tue, Nov 09, 2021 at 07:02:21PM +0000, Quentin Perret wrote:
> > On Tuesday 09 Nov 2021 at 19:09:21 (+0100), Ard Biesheuvel wrote:
> > > Android relies heavily on tracepoints for vendor hooks, and given the
> > > performance impact of CFI on indirect calls, there has been interest
> > > in enabling static calls to replace them.
>
> Hhmm.... what exactly is a "vendor hook" in this context, and what is it doing
> with a tracepoint? From an upstream perspective that sounds somewhat fishy
> usage.
Right, 'vendor hooks' are an ugly Android-specific hack that I hope we
will be able to get rid off overtime. And I don't think upstream should
care about any of this TBH. But it's not the only use-case in Android
for having modules attached to tracepoints, and the other is a bit more
relevant to upstream. So I'd day it makes sense to have that discussion
here.
Specifically, we've got a bunch of 'empty' tracepoints *upstream* in e.g.
the scheduler that don't have any trace events associated with them (see
the exported TPs at the top of kernel/sched/core.c for instance).
They're exported with no in-kernel user on purpose. The only reason they
exist is to allow people to attach modules to them, and do whatever they
need from there (collect stats, write to the trace buffer, ...). That
way the kernel doesn't commit to any userspace ABI, and the maintenance
burden falls on whoever maintains the module instead.
But nowadays virtually every vendor/OEM in the Android world attaches to
those TPs, in production, to gather stats and whatnot. And given that
some of them are hooked in scheduler hot paths, we'd really like those
to be low overhead. I wouldn't be surprised if other distros get the
same issues at some point FWIW -- they all collect SCHED_DEBUG stats and
such.
>
> > > Quentin, anything to add here?
> >
> > Yes, Android should definitely benefit from static calls.
> >
> > Modules attaching to tracepoints cause a measurable overhead w/ CFI as
> > the jump target is a bit harder to verify if it is not in-kernel.
>
> Where does that additional overhead come from when the target is not in-kernel?
>
> I hope that I am wrong in understanding that __cfi_slowpath_diag() means we're
> always doing an out-of-line check when calling into a module?
Nope, I think you're right.
> If that were the case, that would seem to be a much more general problem with
> the current clang CFI scheme, and my fear here is that we're adding fragility
> and complexity in specific plces to work around general problems with the CFI
> scheme.
Right, no objection from me if we want to optimize the CFI slowpath
instead if we can find a way to do that.
A few thoughts:
- attaching and detaching to TPs is a very infrequent operation, so
having to do CFI checks (however cheap) before every call is a bit
sad as the target doesn't change;
- so far the CFI overhead has been visible in practice mainly for
tracepoints and not really anywhere else. The cost of a
kernel-to-module indirect calls for e.g. driver operations seems to
often (though not always) be somewhat small compared to the work
done by the driver itself. And I think that module-to-kernel calls
should be mostly unaffected as we will either resolve them with
PC-relative instructions if within range, or via the module PLT
which doesn't include CFI checks IIRC.
Thanks,
Quentin