2021-09-21 03:46:37

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 0/4] arm64: Support dynamic preemption

Traditionally the preemption flavour was defined on Kconfig then fixed
in stone. Now with CONFIG_PREEMPT_DYNAMIC the users can overwrite that
on boot with the "preempt=" boot option (and also through debugfs but
that's a secret).

Linux distros can be particularly fond of this because it allows them
to rely on a single kernel image for all preemption flavours.

x86 was the only supported architecture so far but interests are
broader.

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
preempt/arm

HEAD: 351eaa68b5304b8b0e7c6e7b4470dd917475e65e

Thanks,
Frederic
---

Frederic Weisbecker (3):
sched/preempt: Prepare for supporting !CONFIG_GENERIC_ENTRY dynamic preemption
arm64: Implement IRQ exit preemption static call for dynamic preemption
arm64: Implement HAVE_PREEMPT_DYNAMIC

Ard Biesheuvel (1):
arm64: implement support for static call trampolines


arch/Kconfig | 1 -
arch/arm64/Kconfig | 2 ++
arch/arm64/include/asm/insn.h | 2 ++
arch/arm64/include/asm/preempt.h | 23 ++++++++++++++++++++++-
arch/arm64/include/asm/static_call.h | 28 ++++++++++++++++++++++++++++
arch/arm64/kernel/Makefile | 4 ++--
arch/arm64/kernel/entry-common.c | 15 ++++++++++++---
arch/arm64/kernel/patching.c | 14 +++++++++++---
arch/arm64/kernel/vmlinux.lds.S | 1 +
include/linux/entry-common.h | 3 ++-
kernel/sched/core.c | 6 ++++--
11 files changed, 86 insertions(+), 13 deletions(-)


2021-09-21 03:48:25

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 4/4] arm64: Implement HAVE_PREEMPT_DYNAMIC

Provide the static calls for the common preemption points and report
arm64 ability to support dynamic preemption.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: James Morse <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
---
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/preempt.h | 20 +++++++++++++++++---
2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 5b51b359ccda..e28bcca8954c 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -191,6 +191,7 @@ config ARM64
select HAVE_PERF_EVENTS
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
+ select HAVE_PREEMPT_DYNAMIC
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_STATIC_CALL
select HAVE_FUNCTION_ARG_ACCESS_API
diff --git a/arch/arm64/include/asm/preempt.h b/arch/arm64/include/asm/preempt.h
index 4fbbe644532f..69d1cc491d3b 100644
--- a/arch/arm64/include/asm/preempt.h
+++ b/arch/arm64/include/asm/preempt.h
@@ -82,15 +82,29 @@ static inline bool should_resched(int preempt_offset)

#ifdef CONFIG_PREEMPTION
void preempt_schedule(void);
-#define __preempt_schedule() preempt_schedule()
void preempt_schedule_notrace(void);
-#define __preempt_schedule_notrace() preempt_schedule_notrace()
-#endif /* CONFIG_PREEMPTION */

#ifdef CONFIG_PREEMPT_DYNAMIC
+
+#define __preempt_schedule_func preempt_schedule
+DECLARE_STATIC_CALL(preempt_schedule, __preempt_schedule_func);
+#define __preempt_schedule() static_call(preempt_schedule)()
+
+#define __preempt_schedule_notrace_func preempt_schedule_notrace
+DECLARE_STATIC_CALL(preempt_schedule_notrace, __preempt_schedule_notrace_func);
+#define __preempt_schedule_notrace() static_call(preempt_schedule_notrace)()
+
void arm64_preempt_schedule_irq(void);
#define __irqentry_exit_cond_resched_func arm64_preempt_schedule_irq
DECLARE_STATIC_CALL(irqentry_exit_cond_resched, __irqentry_exit_cond_resched_func);
+
+#else /* !CONFIG_PREEMPT_DYNAMIC */
+
+#define __preempt_schedule() preempt_schedule()
+#define __preempt_schedule_notrace() preempt_schedule_notrace()
+
#endif /* CONFIG_PREEMPT_DYNAMIC */

+#endif /* CONFIG_PREEMPTION */
+
#endif /* __ASM_PREEMPT_H */
--
2.25.1

2021-09-21 03:48:44

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 2/4] arm64: implement support for static call trampolines

From: Ard Biesheuvel <[email protected]>

[fweisbec: rebased against 5.15-rc2. There has been quite some changes
on arm64 since then, especially with insn/patching, so some naming may
not be relevant anymore]

Implement arm64 support for the 'unoptimized' static call variety, which
routes all calls through a single trampoline that is patched to perform a
tail call to the selected function.

Since static call targets may be located in modules loaded out of direct
branching range, we need to use a ADRP/ADD pair to load the branch target
into R16 and use a branch-to-register (BR) instruction to perform an
indirect call. Unlike on x86, there is no pressing need on arm64 to avoid
indirect calls at all cost, but hiding it from the compiler as is done
here does have some benefits:
- the literal is located in .rodata, which gives us the same robustness
advantage that code patching does;
- no performance hit on CFI enabled Clang builds that decorate compiler
emitted indirect calls with branch target validity checks.

Signed-off-by: Ard Biesheuvel <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: James Morse <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/insn.h | 2 ++
arch/arm64/include/asm/static_call.h | 28 ++++++++++++++++++++++++++++
arch/arm64/kernel/Makefile | 4 ++--
arch/arm64/kernel/patching.c | 14 +++++++++++---
arch/arm64/kernel/vmlinux.lds.S | 1 +
6 files changed, 45 insertions(+), 5 deletions(-)
create mode 100644 arch/arm64/include/asm/static_call.h

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 5c7ae4c3954b..5b51b359ccda 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -192,6 +192,7 @@ config ARM64
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_REGS_AND_STACK_ACCESS_API
+ select HAVE_STATIC_CALL
select HAVE_FUNCTION_ARG_ACCESS_API
select HAVE_FUTEX_CMPXCHG if FUTEX
select MMU_GATHER_RCU_TABLE_FREE
diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
index 6b776c8667b2..681c08b170df 100644
--- a/arch/arm64/include/asm/insn.h
+++ b/arch/arm64/include/asm/insn.h
@@ -547,6 +547,8 @@ u32 aarch64_set_branch_offset(u32 insn, s32 offset);
s32 aarch64_insn_adrp_get_offset(u32 insn);
u32 aarch64_insn_adrp_set_offset(u32 insn, s32 offset);

+int aarch64_literal_write(void *addr, u64 literal);
+
bool aarch32_insn_is_wide(u32 insn);

#define A32_RN_OFFSET 16
diff --git a/arch/arm64/include/asm/static_call.h b/arch/arm64/include/asm/static_call.h
new file mode 100644
index 000000000000..665ec2a7cdb2
--- /dev/null
+++ b/arch/arm64/include/asm/static_call.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_STATIC_CALL_H
+#define _ASM_STATIC_CALL_H
+
+#define __ARCH_DEFINE_STATIC_CALL_TRAMP(name, target) \
+ asm(" .pushsection .static_call.text, \"ax\" \n" \
+ " .align 3 \n" \
+ " .globl " STATIC_CALL_TRAMP_STR(name) " \n" \
+ STATIC_CALL_TRAMP_STR(name) ": \n" \
+ " hint 34 /* BTI C */ \n" \
+ " adrp x16, 1f \n" \
+ " ldr x16, [x16, :lo12:1f] \n" \
+ " cbz x16, 0f \n" \
+ " br x16 \n" \
+ "0: ret \n" \
+ " .popsection \n" \
+ " .pushsection .rodata, \"a\" \n" \
+ " .align 3 \n" \
+ "1: .quad " target " \n" \
+ " .popsection \n")
+
+#define ARCH_DEFINE_STATIC_CALL_TRAMP(name, func) \
+ __ARCH_DEFINE_STATIC_CALL_TRAMP(name, #func)
+
+#define ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name) \
+ __ARCH_DEFINE_STATIC_CALL_TRAMP(name, "0x0")
+
+#endif /* _ASM_STATIC_CALL_H */
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index 3f1490bfb938..83f03fc1e402 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -28,8 +28,8 @@ obj-y := debug-monitors.o entry.o irq.o fpsimd.o \
return_address.o cpuinfo.o cpu_errata.o \
cpufeature.o alternative.o cacheinfo.o \
smp.o smp_spin_table.o topology.o smccc-call.o \
- syscall.o proton-pack.o idreg-override.o idle.o \
- patching.o
+ syscall.o proton-pack.o static_call.o \
+ idreg-override.o idle.o patching.o

targets += efi-entry.o

diff --git a/arch/arm64/kernel/patching.c b/arch/arm64/kernel/patching.c
index 771f543464e0..841c0499eca5 100644
--- a/arch/arm64/kernel/patching.c
+++ b/arch/arm64/kernel/patching.c
@@ -66,7 +66,7 @@ int __kprobes aarch64_insn_read(void *addr, u32 *insnp)
return ret;
}

-static int __kprobes __aarch64_insn_write(void *addr, __le32 insn)
+static int __kprobes __aarch64_insn_write(void *addr, void *insn, int size)
{
void *waddr = addr;
unsigned long flags = 0;
@@ -75,7 +75,7 @@ static int __kprobes __aarch64_insn_write(void *addr, __le32 insn)
raw_spin_lock_irqsave(&patch_lock, flags);
waddr = patch_map(addr, FIX_TEXT_POKE0);

- ret = copy_to_kernel_nofault(waddr, &insn, AARCH64_INSN_SIZE);
+ ret = copy_to_kernel_nofault(waddr, insn, size);

patch_unmap(FIX_TEXT_POKE0);
raw_spin_unlock_irqrestore(&patch_lock, flags);
@@ -85,7 +85,15 @@ static int __kprobes __aarch64_insn_write(void *addr, __le32 insn)

int __kprobes aarch64_insn_write(void *addr, u32 insn)
{
- return __aarch64_insn_write(addr, cpu_to_le32(insn));
+ __le32 i = cpu_to_le32(insn);
+
+ return __aarch64_insn_write(addr, &i, AARCH64_INSN_SIZE);
+}
+
+int aarch64_literal_write(void *addr, u64 literal)
+{
+ BUG_ON(!IS_ALIGNED((u64)addr, sizeof(u64)));
+ return __aarch64_insn_write(addr, &literal, sizeof(u64));
}

int __kprobes aarch64_insn_patch_text_nosync(void *addr, u32 insn)
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index f6b1a88245db..ceb35c35192c 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -161,6 +161,7 @@ SECTIONS
IDMAP_TEXT
HIBERNATE_TEXT
TRAMP_TEXT
+ STATIC_CALL_TEXT
*(.fixup)
*(.gnu.warning)
. = ALIGN(16);
--
2.25.1

2021-09-21 07:14:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/4] arm64: implement support for static call trampolines

On Tue, Sep 21, 2021 at 01:32:35AM +0200, Frederic Weisbecker wrote:

> +#define __ARCH_DEFINE_STATIC_CALL_TRAMP(name, target) \
> + asm(" .pushsection .static_call.text, \"ax\" \n" \
> + " .align 3 \n" \
> + " .globl " STATIC_CALL_TRAMP_STR(name) " \n" \
> + STATIC_CALL_TRAMP_STR(name) ": \n" \
> + " hint 34 /* BTI C */ \n" \
> + " adrp x16, 1f \n" \
> + " ldr x16, [x16, :lo12:1f] \n" \
> + " cbz x16, 0f \n" \
> + " br x16 \n" \
> + "0: ret \n" \
> + " .popsection \n" \
> + " .pushsection .rodata, \"a\" \n" \
> + " .align 3 \n" \
> + "1: .quad " target " \n" \
> + " .popsection \n")

So I like what Christophe did for PPC32:

https://lkml.kernel.org/r/6ec2a7865ed6a5ec54ab46d026785bafe1d837ea.1630484892.git.christophe.leroy@csgroup.eu

Where he starts with an unconditional jmp and uses that IFF the offset
fits and only does the data load when it doesn't. Ard, woulnd't that
also make sense on ARM64? I'm thinking most in-kernel function pointers
would actually fit, it's just the module muck that gets to have too
large pointers, no?

> +#define ARCH_DEFINE_STATIC_CALL_TRAMP(name, func) \
> + __ARCH_DEFINE_STATIC_CALL_TRAMP(name, #func)
> +
> +#define ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name) \
> + __ARCH_DEFINE_STATIC_CALL_TRAMP(name, "0x0")

2021-09-21 14:46:22

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH 2/4] arm64: implement support for static call trampolines

On Tue, 21 Sept 2021 at 09:10, Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Sep 21, 2021 at 01:32:35AM +0200, Frederic Weisbecker wrote:
>
> > +#define __ARCH_DEFINE_STATIC_CALL_TRAMP(name, target) \
> > + asm(" .pushsection .static_call.text, \"ax\" \n" \
> > + " .align 3 \n" \
> > + " .globl " STATIC_CALL_TRAMP_STR(name) " \n" \
> > + STATIC_CALL_TRAMP_STR(name) ": \n" \
> > + " hint 34 /* BTI C */ \n" \
> > + " adrp x16, 1f \n" \
> > + " ldr x16, [x16, :lo12:1f] \n" \
> > + " cbz x16, 0f \n" \
> > + " br x16 \n" \
> > + "0: ret \n" \
> > + " .popsection \n" \
> > + " .pushsection .rodata, \"a\" \n" \
> > + " .align 3 \n" \
> > + "1: .quad " target " \n" \
> > + " .popsection \n")
>
> So I like what Christophe did for PPC32:
>
> https://lkml.kernel.org/r/6ec2a7865ed6a5ec54ab46d026785bafe1d837ea.1630484892.git.christophe.leroy@csgroup.eu
>
> Where he starts with an unconditional jmp and uses that IFF the offset
> fits and only does the data load when it doesn't. Ard, woulnd't that
> also make sense on ARM64? I'm thinking most in-kernel function pointers
> would actually fit, it's just the module muck that gets to have too
> large pointers, no?
>

Yeah, I'd have to page that back in. But it seems like the following

bti c
<branch>
adrp x16, <literal>
ldr x16, [x16, ...]
br x16

with <branch> either set to 'b target' for the near targets, 'ret' for
the NULL target, and 'nop' for the far targets should work, and the
architecture permits patching branches into NOPs and vice versa
without special synchronization. But I must be missing something here,
or why did we have that long discussion before?

2021-09-21 15:09:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/4] arm64: implement support for static call trampolines

On Tue, Sep 21, 2021 at 04:44:56PM +0200, Ard Biesheuvel wrote:
> On Tue, 21 Sept 2021 at 09:10, Peter Zijlstra <[email protected]> wrote:
> >
> > On Tue, Sep 21, 2021 at 01:32:35AM +0200, Frederic Weisbecker wrote:
> >
> > > +#define __ARCH_DEFINE_STATIC_CALL_TRAMP(name, target) \
> > > + asm(" .pushsection .static_call.text, \"ax\" \n" \
> > > + " .align 3 \n" \
> > > + " .globl " STATIC_CALL_TRAMP_STR(name) " \n" \
> > > + STATIC_CALL_TRAMP_STR(name) ": \n" \
> > > + " hint 34 /* BTI C */ \n" \
> > > + " adrp x16, 1f \n" \
> > > + " ldr x16, [x16, :lo12:1f] \n" \
> > > + " cbz x16, 0f \n" \
> > > + " br x16 \n" \
> > > + "0: ret \n" \
> > > + " .popsection \n" \
> > > + " .pushsection .rodata, \"a\" \n" \
> > > + " .align 3 \n" \
> > > + "1: .quad " target " \n" \
> > > + " .popsection \n")
> >
> > So I like what Christophe did for PPC32:
> >
> > https://lkml.kernel.org/r/6ec2a7865ed6a5ec54ab46d026785bafe1d837ea.1630484892.git.christophe.leroy@csgroup.eu
> >
> > Where he starts with an unconditional jmp and uses that IFF the offset
> > fits and only does the data load when it doesn't. Ard, woulnd't that
> > also make sense on ARM64? I'm thinking most in-kernel function pointers
> > would actually fit, it's just the module muck that gets to have too
> > large pointers, no?
> >
>
> Yeah, I'd have to page that back in. But it seems like the following
>
> bti c
> <branch>
> adrp x16, <literal>
> ldr x16, [x16, ...]
> br x16
>
> with <branch> either set to 'b target' for the near targets, 'ret' for
> the NULL target, and 'nop' for the far targets should work, and the
> architecture permits patching branches into NOPs and vice versa
> without special synchronization. But I must be missing something here,
> or why did we have that long discussion before?

So the fundamental contraint is that we can only modify a single
instruction at the time and need to consider concurrent execution.

I think the first round of discussions was around getting the normal arm
pattern of constructing a long pointer 'working'. My initial suggestion
was to have 2 slots for that, then you came up with this data load
thing.

2021-09-21 15:35:23

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 2/4] arm64: implement support for static call trampolines

On Tue, Sep 21, 2021 at 04:44:56PM +0200, Ard Biesheuvel wrote:
> On Tue, 21 Sept 2021 at 09:10, Peter Zijlstra <[email protected]> wrote:
> >
> > On Tue, Sep 21, 2021 at 01:32:35AM +0200, Frederic Weisbecker wrote:
> >
> > > +#define __ARCH_DEFINE_STATIC_CALL_TRAMP(name, target) \
> > > + asm(" .pushsection .static_call.text, \"ax\" \n" \
> > > + " .align 3 \n" \
> > > + " .globl " STATIC_CALL_TRAMP_STR(name) " \n" \
> > > + STATIC_CALL_TRAMP_STR(name) ": \n" \
> > > + " hint 34 /* BTI C */ \n" \
> > > + " adrp x16, 1f \n" \
> > > + " ldr x16, [x16, :lo12:1f] \n" \
> > > + " cbz x16, 0f \n" \
> > > + " br x16 \n" \
> > > + "0: ret \n" \
> > > + " .popsection \n" \
> > > + " .pushsection .rodata, \"a\" \n" \
> > > + " .align 3 \n" \
> > > + "1: .quad " target " \n" \
> > > + " .popsection \n")
> >
> > So I like what Christophe did for PPC32:
> >
> > https://lkml.kernel.org/r/6ec2a7865ed6a5ec54ab46d026785bafe1d837ea.1630484892.git.christophe.leroy@csgroup.eu
> >
> > Where he starts with an unconditional jmp and uses that IFF the offset
> > fits and only does the data load when it doesn't. Ard, woulnd't that
> > also make sense on ARM64? I'm thinking most in-kernel function pointers
> > would actually fit, it's just the module muck that gets to have too
> > large pointers, no?
> >
>
> Yeah, I'd have to page that back in. But it seems like the following
>
> bti c
> <branch>
> adrp x16, <literal>
> ldr x16, [x16, ...]
> br x16
>
> with <branch> either set to 'b target' for the near targets, 'ret' for
> the NULL target, and 'nop' for the far targets should work, and the
> architecture permits patching branches into NOPs and vice versa
> without special synchronization.

I think so, yes. We can do sligntly better with an inline literal pool
and a PC-relative LDR to fold the ADRP+LDR, e.g.

.align 3
tramp:
BTI C
{B <func> | RET | NOP}
LDR X16, 1f
BR X16
1: .quad <literal>

Since that's in the .text, it's RO for regular accesses anyway.

> But I must be missing something here, or why did we have that long
> discussion before?

I think the long discussion was because v2 had some more complex options
(mostly due to trying to use ADRP+ADD) and atomicity/preemption issues
meant we could only transition between some of those one-way, and it was
subtle/complex:

https://lore.kernel.org/linux-arm-kernel/[email protected]/

For v3, that was all gone, but we didn't have a user.

Since the common case *should* be handled by {B <func> | RET | NOP }, I
reckon it's fine to have just that and the literal pool fallback (which
I'll definitely need for the sorts of kernel I run when fuzzing, where
the kernel Image itself can be 100s of MiBs).

Thanks,
Mark.

2021-09-21 15:56:29

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH 2/4] arm64: implement support for static call trampolines

On Tue, 21 Sept 2021 at 17:33, Mark Rutland <[email protected]> wrote:
>
> On Tue, Sep 21, 2021 at 04:44:56PM +0200, Ard Biesheuvel wrote:
> > On Tue, 21 Sept 2021 at 09:10, Peter Zijlstra <[email protected]> wrote:
...
> > >
> > > So I like what Christophe did for PPC32:
> > >
> > > https://lkml.kernel.org/r/6ec2a7865ed6a5ec54ab46d026785bafe1d837ea.1630484892.git.christophe.leroy@csgroup.eu
> > >
> > > Where he starts with an unconditional jmp and uses that IFF the offset
> > > fits and only does the data load when it doesn't. Ard, woulnd't that
> > > also make sense on ARM64? I'm thinking most in-kernel function pointers
> > > would actually fit, it's just the module muck that gets to have too
> > > large pointers, no?
> > >
> >
> > Yeah, I'd have to page that back in. But it seems like the following
> >
> > bti c
> > <branch>
> > adrp x16, <literal>
> > ldr x16, [x16, ...]
> > br x16
> >
> > with <branch> either set to 'b target' for the near targets, 'ret' for
> > the NULL target, and 'nop' for the far targets should work, and the
> > architecture permits patching branches into NOPs and vice versa
> > without special synchronization.
>
> I think so, yes. We can do sligntly better with an inline literal pool
> and a PC-relative LDR to fold the ADRP+LDR, e.g.
>
> .align 3
> tramp:
> BTI C
> {B <func> | RET | NOP}
> LDR X16, 1f
> BR X16
> 1: .quad <literal>
>
> Since that's in the .text, it's RO for regular accesses anyway.
>

I tried to keep the literal in .rodata to avoid inadvertent gadgets
and/or anticipate exec-only mappings of .text, but that may be a bit
overzealous.

> > But I must be missing something here, or why did we have that long
> > discussion before?
>
> I think the long discussion was because v2 had some more complex options
> (mostly due to trying to use ADRP+ADD) and atomicity/preemption issues
> meant we could only transition between some of those one-way, and it was
> subtle/complex:
>
> https://lore.kernel.org/linux-arm-kernel/[email protected]/
>

Ah yes, I was trying to use ADRP/ADD to avoid the load, and this is
what created all the complexity.

> For v3, that was all gone, but we didn't have a user.
>
> Since the common case *should* be handled by {B <func> | RET | NOP }, I
> reckon it's fine to have just that and the literal pool fallback (which
> I'll definitely need for the sorts of kernel I run when fuzzing, where
> the kernel Image itself can be 100s of MiBs).

Ack. So I'll respin this along these lines. Do we care deeply about
the branch and the literal being transiently out of sync?

2021-09-21 16:13:41

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH 2/4] arm64: implement support for static call trampolines

On Tue, 21 Sept 2021 at 01:32, Frederic Weisbecker <[email protected]> wrote:
>
> From: Ard Biesheuvel <[email protected]>
>
> [fweisbec: rebased against 5.15-rc2. There has been quite some changes
> on arm64 since then, especially with insn/patching, so some naming may
> not be relevant anymore]
>

This patch does not include the static_call.c file references to which
are being added below.


> Implement arm64 support for the 'unoptimized' static call variety, which
> routes all calls through a single trampoline that is patched to perform a
> tail call to the selected function.
>
> Since static call targets may be located in modules loaded out of direct
> branching range, we need to use a ADRP/ADD pair to load the branch target
> into R16 and use a branch-to-register (BR) instruction to perform an
> indirect call. Unlike on x86, there is no pressing need on arm64 to avoid
> indirect calls at all cost, but hiding it from the compiler as is done
> here does have some benefits:
> - the literal is located in .rodata, which gives us the same robustness
> advantage that code patching does;
> - no performance hit on CFI enabled Clang builds that decorate compiler
> emitted indirect calls with branch target validity checks.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> Cc: Mark Rutland <[email protected]>
> Cc: Quentin Perret <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Catalin Marinas <[email protected]>
> Cc: James Morse <[email protected]>
> Cc: Will Deacon <[email protected]>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> ---
> arch/arm64/Kconfig | 1 +
> arch/arm64/include/asm/insn.h | 2 ++
> arch/arm64/include/asm/static_call.h | 28 ++++++++++++++++++++++++++++
> arch/arm64/kernel/Makefile | 4 ++--
> arch/arm64/kernel/patching.c | 14 +++++++++++---
> arch/arm64/kernel/vmlinux.lds.S | 1 +
> 6 files changed, 45 insertions(+), 5 deletions(-)
> create mode 100644 arch/arm64/include/asm/static_call.h
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 5c7ae4c3954b..5b51b359ccda 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -192,6 +192,7 @@ config ARM64
> select HAVE_PERF_REGS
> select HAVE_PERF_USER_STACK_DUMP
> select HAVE_REGS_AND_STACK_ACCESS_API
> + select HAVE_STATIC_CALL
> select HAVE_FUNCTION_ARG_ACCESS_API
> select HAVE_FUTEX_CMPXCHG if FUTEX
> select MMU_GATHER_RCU_TABLE_FREE
> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
> index 6b776c8667b2..681c08b170df 100644
> --- a/arch/arm64/include/asm/insn.h
> +++ b/arch/arm64/include/asm/insn.h
> @@ -547,6 +547,8 @@ u32 aarch64_set_branch_offset(u32 insn, s32 offset);
> s32 aarch64_insn_adrp_get_offset(u32 insn);
> u32 aarch64_insn_adrp_set_offset(u32 insn, s32 offset);
>
> +int aarch64_literal_write(void *addr, u64 literal);
> +
> bool aarch32_insn_is_wide(u32 insn);
>
> #define A32_RN_OFFSET 16
> diff --git a/arch/arm64/include/asm/static_call.h b/arch/arm64/include/asm/static_call.h
> new file mode 100644
> index 000000000000..665ec2a7cdb2
> --- /dev/null
> +++ b/arch/arm64/include/asm/static_call.h
> @@ -0,0 +1,28 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_STATIC_CALL_H
> +#define _ASM_STATIC_CALL_H
> +
> +#define __ARCH_DEFINE_STATIC_CALL_TRAMP(name, target) \
> + asm(" .pushsection .static_call.text, \"ax\" \n" \
> + " .align 3 \n" \
> + " .globl " STATIC_CALL_TRAMP_STR(name) " \n" \
> + STATIC_CALL_TRAMP_STR(name) ": \n" \
> + " hint 34 /* BTI C */ \n" \
> + " adrp x16, 1f \n" \
> + " ldr x16, [x16, :lo12:1f] \n" \
> + " cbz x16, 0f \n" \
> + " br x16 \n" \
> + "0: ret \n" \
> + " .popsection \n" \
> + " .pushsection .rodata, \"a\" \n" \
> + " .align 3 \n" \
> + "1: .quad " target " \n" \
> + " .popsection \n")
> +
> +#define ARCH_DEFINE_STATIC_CALL_TRAMP(name, func) \
> + __ARCH_DEFINE_STATIC_CALL_TRAMP(name, #func)
> +
> +#define ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name) \
> + __ARCH_DEFINE_STATIC_CALL_TRAMP(name, "0x0")
> +
> +#endif /* _ASM_STATIC_CALL_H */
> diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
> index 3f1490bfb938..83f03fc1e402 100644
> --- a/arch/arm64/kernel/Makefile
> +++ b/arch/arm64/kernel/Makefile
> @@ -28,8 +28,8 @@ obj-y := debug-monitors.o entry.o irq.o fpsimd.o \
> return_address.o cpuinfo.o cpu_errata.o \
> cpufeature.o alternative.o cacheinfo.o \
> smp.o smp_spin_table.o topology.o smccc-call.o \
> - syscall.o proton-pack.o idreg-override.o idle.o \
> - patching.o
> + syscall.o proton-pack.o static_call.o \
> + idreg-override.o idle.o patching.o
>
> targets += efi-entry.o
>
> diff --git a/arch/arm64/kernel/patching.c b/arch/arm64/kernel/patching.c
> index 771f543464e0..841c0499eca5 100644
> --- a/arch/arm64/kernel/patching.c
> +++ b/arch/arm64/kernel/patching.c
> @@ -66,7 +66,7 @@ int __kprobes aarch64_insn_read(void *addr, u32 *insnp)
> return ret;
> }
>
> -static int __kprobes __aarch64_insn_write(void *addr, __le32 insn)
> +static int __kprobes __aarch64_insn_write(void *addr, void *insn, int size)
> {
> void *waddr = addr;
> unsigned long flags = 0;
> @@ -75,7 +75,7 @@ static int __kprobes __aarch64_insn_write(void *addr, __le32 insn)
> raw_spin_lock_irqsave(&patch_lock, flags);
> waddr = patch_map(addr, FIX_TEXT_POKE0);
>
> - ret = copy_to_kernel_nofault(waddr, &insn, AARCH64_INSN_SIZE);
> + ret = copy_to_kernel_nofault(waddr, insn, size);
>
> patch_unmap(FIX_TEXT_POKE0);
> raw_spin_unlock_irqrestore(&patch_lock, flags);
> @@ -85,7 +85,15 @@ static int __kprobes __aarch64_insn_write(void *addr, __le32 insn)
>
> int __kprobes aarch64_insn_write(void *addr, u32 insn)
> {
> - return __aarch64_insn_write(addr, cpu_to_le32(insn));
> + __le32 i = cpu_to_le32(insn);
> +
> + return __aarch64_insn_write(addr, &i, AARCH64_INSN_SIZE);
> +}
> +
> +int aarch64_literal_write(void *addr, u64 literal)
> +{
> + BUG_ON(!IS_ALIGNED((u64)addr, sizeof(u64)));
> + return __aarch64_insn_write(addr, &literal, sizeof(u64));
> }
>
> int __kprobes aarch64_insn_patch_text_nosync(void *addr, u32 insn)
> diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
> index f6b1a88245db..ceb35c35192c 100644
> --- a/arch/arm64/kernel/vmlinux.lds.S
> +++ b/arch/arm64/kernel/vmlinux.lds.S
> @@ -161,6 +161,7 @@ SECTIONS
> IDMAP_TEXT
> HIBERNATE_TEXT
> TRAMP_TEXT
> + STATIC_CALL_TEXT
> *(.fixup)
> *(.gnu.warning)
> . = ALIGN(16);
> --
> 2.25.1
>

2021-09-21 16:30:31

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 2/4] arm64: implement support for static call trampolines

On Tue, Sep 21, 2021 at 05:55:11PM +0200, Ard Biesheuvel wrote:
> On Tue, 21 Sept 2021 at 17:33, Mark Rutland <[email protected]> wrote:
> >
> > On Tue, Sep 21, 2021 at 04:44:56PM +0200, Ard Biesheuvel wrote:
> > > On Tue, 21 Sept 2021 at 09:10, Peter Zijlstra <[email protected]> wrote:
> ...
> > > >
> > > > So I like what Christophe did for PPC32:
> > > >
> > > > https://lkml.kernel.org/r/6ec2a7865ed6a5ec54ab46d026785bafe1d837ea.1630484892.git.christophe.leroy@csgroup.eu
> > > >
> > > > Where he starts with an unconditional jmp and uses that IFF the offset
> > > > fits and only does the data load when it doesn't. Ard, woulnd't that
> > > > also make sense on ARM64? I'm thinking most in-kernel function pointers
> > > > would actually fit, it's just the module muck that gets to have too
> > > > large pointers, no?
> > > >
> > >
> > > Yeah, I'd have to page that back in. But it seems like the following
> > >
> > > bti c
> > > <branch>
> > > adrp x16, <literal>
> > > ldr x16, [x16, ...]
> > > br x16
> > >
> > > with <branch> either set to 'b target' for the near targets, 'ret' for
> > > the NULL target, and 'nop' for the far targets should work, and the
> > > architecture permits patching branches into NOPs and vice versa
> > > without special synchronization.
> >
> > I think so, yes. We can do sligntly better with an inline literal pool
> > and a PC-relative LDR to fold the ADRP+LDR, e.g.
> >
> > .align 3
> > tramp:
> > BTI C
> > {B <func> | RET | NOP}
> > LDR X16, 1f
> > BR X16
> > 1: .quad <literal>
> >
> > Since that's in the .text, it's RO for regular accesses anyway.
> >
>
> I tried to keep the literal in .rodata to avoid inadvertent gadgets
> and/or anticipate exec-only mappings of .text, but that may be a bit
> overzealous.

I think that in practice the risk of gadgetisation is minimal, and
having it inline means we only need to record a single address per
trampoline, so there's less risk that we get the patching wrong.

> > > But I must be missing something here, or why did we have that long
> > > discussion before?
> >
> > I think the long discussion was because v2 had some more complex options
> > (mostly due to trying to use ADRP+ADD) and atomicity/preemption issues
> > meant we could only transition between some of those one-way, and it was
> > subtle/complex:
> >
> > https://lore.kernel.org/linux-arm-kernel/[email protected]/
> >
>
> Ah yes, I was trying to use ADRP/ADD to avoid the load, and this is
> what created all the complexity.
>
> > For v3, that was all gone, but we didn't have a user.
> >
> > Since the common case *should* be handled by {B <func> | RET | NOP }, I
> > reckon it's fine to have just that and the literal pool fallback (which
> > I'll definitely need for the sorts of kernel I run when fuzzing, where
> > the kernel Image itself can be 100s of MiBs).
>
> Ack. So I'll respin this along these lines.

Sounds good!

> Do we care deeply about the branch and the literal being transiently
> out of sync?

I don't think we care about the tranisent window, since even if we just
patched a branch, a thread could be preempted immediately after the
branch and sit around blocked for a while. So it's always necessary to
either handle such threads taking stale branches, or to flip the branch
such that this doesn't matter (e.g. done once at boot time).

That said, I'd suggest that we always patch the literal, then patch the
{B| RET | NOP}, so that outside of patch times those are consistent with
one another and we can't accidentally get into a state were we use a
stale/bogus target after multiple patches. We can align the trampoline
such that we know it falls within a single page, so that we only need to
map/unmap it once (and the cost of the extra STR will be far smaller
than the map/unmap anyhow).

Thanks,
Mark.

2021-09-25 17:47:48

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 2/4] arm64: implement support for static call trampolines

From: Mark Rutland
> Sent: 21 September 2021 17:28
>
> On Tue, Sep 21, 2021 at 05:55:11PM +0200, Ard Biesheuvel wrote:
> > On Tue, 21 Sept 2021 at 17:33, Mark Rutland <[email protected]> wrote:
> > >
> > > On Tue, Sep 21, 2021 at 04:44:56PM +0200, Ard Biesheuvel wrote:
> > > > On Tue, 21 Sept 2021 at 09:10, Peter Zijlstra <[email protected]> wrote:
> > ...
...
> > >
> > > I think so, yes. We can do sligntly better with an inline literal pool
> > > and a PC-relative LDR to fold the ADRP+LDR, e.g.
> > >
> > > .align 3
> > > tramp:
> > > BTI C
> > > {B <func> | RET | NOP}
> > > LDR X16, 1f
> > > BR X16
> > > 1: .quad <literal>
> > >
> > > Since that's in the .text, it's RO for regular accesses anyway.
> > >
> >
> > I tried to keep the literal in .rodata to avoid inadvertent gadgets
> > and/or anticipate exec-only mappings of .text, but that may be a bit
> > overzealous.
>
> I think that in practice the risk of gadgetisation is minimal, and
> having it inline means we only need to record a single address per
> trampoline, so there's less risk that we get the patching wrong.

But doesn't that mean that it is almost certainly a data cache miss?
You really want an instruction that reads the constant from the I-cache.
Or at least be able to 'bunch together' the constants so they
stand a chance of sharing a D-cache line.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2021-09-27 09:03:06

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 2/4] arm64: implement support for static call trampolines

On Sat, Sep 25, 2021 at 05:46:23PM +0000, David Laight wrote:
> From: Mark Rutland
> > Sent: 21 September 2021 17:28
> >
> > On Tue, Sep 21, 2021 at 05:55:11PM +0200, Ard Biesheuvel wrote:
> > > On Tue, 21 Sept 2021 at 17:33, Mark Rutland <[email protected]> wrote:
> > > >
> > > > On Tue, Sep 21, 2021 at 04:44:56PM +0200, Ard Biesheuvel wrote:
> > > > > On Tue, 21 Sept 2021 at 09:10, Peter Zijlstra <[email protected]> wrote:
> > > ...
> ...
> > > >
> > > > I think so, yes. We can do sligntly better with an inline literal pool
> > > > and a PC-relative LDR to fold the ADRP+LDR, e.g.
> > > >
> > > > .align 3
> > > > tramp:
> > > > BTI C
> > > > {B <func> | RET | NOP}
> > > > LDR X16, 1f
> > > > BR X16
> > > > 1: .quad <literal>
> > > >
> > > > Since that's in the .text, it's RO for regular accesses anyway.
> > > >
> > >
> > > I tried to keep the literal in .rodata to avoid inadvertent gadgets
> > > and/or anticipate exec-only mappings of .text, but that may be a bit
> > > overzealous.
> >
> > I think that in practice the risk of gadgetisation is minimal, and
> > having it inline means we only need to record a single address per
> > trampoline, so there's less risk that we get the patching wrong.
>
> But doesn't that mean that it is almost certainly a data cache miss?
> You really want an instruction that reads the constant from the I-cache.
> Or at least be able to 'bunch together' the constants so they
> stand a chance of sharing a D-cache line.

The idea is that in the common case we don't even use the literal, and
the `B <func>` goes to the target.

The literal is there as a fallback for when the target is a sufficiently
long distance away (more than +/-128MiB from the `BR X16`). By default
we try to keep modules within 128MiB of the kernel image, and this
should only happen in uncommon configs (e.g. my debug kernel configs
when the kernel can be 100s of MiBs).

With that in mind, I'd strongly prefer to optimize for simplicity rather
than making the uncommon case faster.

Thanks,
Mark.