LinuxLists.cc - [PATCH 0/5] x86/bugs: more BHI fixes

2024-04-19 21:10:10

by Josh Poimboeuf

[permalink] [raw]

Subject: [PATCH 0/5] x86/bugs: more BHI fixes

Patch 1 is another iteration of reducing the scope of syscall hardening
in order to improve performance on some CPUs. The feature bit has a new
name, and the commit log and comments are much improved.

The rest of the patches are new:

- Patch 2 fixes the default mitigations for !x86 (reimplementation of
Sean's fix).

- Patch 3 fixes some objtool warnings found by Paul.

- Patch 4 is a documentation cleanup and prep for patch 5.

- Patch 5 adds a requested spectre_bhi=vmexit option.

Josh Poimboeuf (5):
x86/bugs: Only harden syscalls when needed
cpu/speculation: Fix CPU mitigation defaults for !x86
x86/syscall: Mark exit[_group] syscall handlers __noreturn
x86/bugs: Remove duplicate Spectre cmdline option descriptions
x86/bugs: Add 'spectre_bhi=vmexit' cmdline option

Documentation/admin-guide/hw-vuln/spectre.rst | 84 ++-----------------
.../admin-guide/kernel-parameters.txt | 12 ++-
arch/Kconfig | 10 +++
arch/x86/Kconfig | 15 +---
arch/x86/entry/common.c | 15 +++-
arch/x86/entry/syscall_32.c | 11 +--
arch/x86/entry/syscall_64.c | 10 +--
arch/x86/entry/syscall_x32.c | 11 ++-
arch/x86/entry/syscalls/syscall_64.tbl | 6 +-
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/syscall.h | 8 +-
arch/x86/kernel/cpu/bugs.c | 51 +++++++++--
arch/x86/um/sys_call_table_32.c | 1 +
arch/x86/um/sys_call_table_64.c | 1 +
kernel/cpu.c | 4 +-
scripts/syscalltbl.sh | 6 +-
tools/objtool/noreturns.h | 4 +
17 files changed, 126 insertions(+), 124 deletions(-)

--
2.44.0

2024-04-19 21:10:21

by Josh Poimboeuf

[permalink] [raw]

Subject: [PATCH v4 1/5] x86/bugs: Only harden syscalls when needed

Syscall hardening (converting the syscall indirect branch to a series of
direct branches) has shown some performance regressions:

- Red Hat internal testing showed up to 12% slowdowns in database
benchmark testing on Sapphire Rapids when the DB was stressed with 80+
users to cause contention.

- The kernel test robot's will-it-scale benchmarks showed significant
regressions on Skylake with IBRS:
https://lkml.kernel.org/lkml/[email protected]

To fix those slowdowns, only use the syscall direct branches when
indirect branches are considered to be "not OK": meaning Spectre v2+BHI
isn't mitigated by HW and the user hasn't disabled mitigations.

Fixes: 1e3ad78334a6 ("x86/syscall: Don't force use of indirect calls for system calls")
Reviewed-by: Pawan Gupta <[email protected]>
Acked-by: Borislav Petkov (AMD) <[email protected]>
Signed-off-by: Josh Poimboeuf <[email protected]>
---
arch/x86/entry/common.c | 15 ++++++++++---
arch/x86/entry/syscall_32.c | 11 +---------
arch/x86/entry/syscall_64.c | 6 -----
arch/x86/entry/syscall_x32.c | 7 +++++-
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/syscall.h | 8 ++++++-
arch/x86/kernel/cpu/bugs.c | 35 +++++++++++++++++++++++++++++-
7 files changed, 61 insertions(+), 22 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 51cc9c7cb9bd..db1ef98da3a4 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -49,7 +49,10 @@ static __always_inline bool do_syscall_x64(struct pt_regs *regs, int nr)

if (likely(unr < NR_syscalls)) {
unr = array_index_nospec(unr, NR_syscalls);
- regs->ax = x64_sys_call(regs, unr);
+ if (likely(cpu_feature_enabled(X86_FEATURE_INDIRECT_BRANCH_OK)))
+ regs->ax = sys_call_table[unr](regs);
+ else
+ regs->ax = x64_sys_call(regs, unr);
return true;
}
return false;
@@ -66,7 +69,10 @@ static __always_inline bool do_syscall_x32(struct pt_regs *regs, int nr)

if (IS_ENABLED(CONFIG_X86_X32_ABI) && likely(xnr < X32_NR_syscalls)) {
xnr = array_index_nospec(xnr, X32_NR_syscalls);
- regs->ax = x32_sys_call(regs, xnr);
+ if (likely(cpu_feature_enabled(X86_FEATURE_INDIRECT_BRANCH_OK)))
+ regs->ax = x32_sys_call_table[xnr](regs);
+ else
+ regs->ax = x32_sys_call(regs, xnr);
return true;
}
return false;
@@ -162,7 +168,10 @@ static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs, int nr)

if (likely(unr < IA32_NR_syscalls)) {
unr = array_index_nospec(unr, IA32_NR_syscalls);
- regs->ax = ia32_sys_call(regs, unr);
+ if (likely(cpu_feature_enabled(X86_FEATURE_INDIRECT_BRANCH_OK)))
+ regs->ax = ia32_sys_call_table[unr](regs);
+ else
+ regs->ax = ia32_sys_call(regs, unr);
} else if (nr != -1) {
regs->ax = __ia32_sys_ni_syscall(regs);
}
diff --git a/arch/x86/entry/syscall_32.c b/arch/x86/entry/syscall_32.c
index c2235bae17ef..aab31760b4e3 100644
--- a/arch/x86/entry/syscall_32.c
+++ b/arch/x86/entry/syscall_32.c
@@ -14,25 +14,16 @@
#endif

#define __SYSCALL(nr, sym) extern long __ia32_##sym(const struct pt_regs *);
-
#include <asm/syscalls_32.h>
#undef __SYSCALL

-/*
- * The sys_call_table[] is no longer used for system calls, but
- * kernel/trace/trace_syscalls.c still wants to know the system
- * call address.
- */
-#ifdef CONFIG_X86_32
#define __SYSCALL(nr, sym) __ia32_##sym,
-const sys_call_ptr_t sys_call_table[] = {
+const sys_call_ptr_t ia32_sys_call_table[] = {
#include <asm/syscalls_32.h>
};
#undef __SYSCALL
-#endif

#define __SYSCALL(nr, sym) case nr: return __ia32_##sym(regs);
-
long ia32_sys_call(const struct pt_regs *regs, unsigned int nr)
{
switch (nr) {
diff --git a/arch/x86/entry/syscall_64.c b/arch/x86/entry/syscall_64.c
index 33b3f09e6f15..96ea1f8a1d3f 100644
--- a/arch/x86/entry/syscall_64.c
+++ b/arch/x86/entry/syscall_64.c
@@ -11,11 +11,6 @@
#include <asm/syscalls_64.h>
#undef __SYSCALL

-/*
- * The sys_call_table[] is no longer used for system calls, but
- * kernel/trace/trace_syscalls.c still wants to know the system
- * call address.
- */
#define __SYSCALL(nr, sym) __x64_##sym,
const sys_call_ptr_t sys_call_table[] = {
#include <asm/syscalls_64.h>
@@ -23,7 +18,6 @@ const sys_call_ptr_t sys_call_table[] = {
#undef __SYSCALL

#define __SYSCALL(nr, sym) case nr: return __x64_##sym(regs);
-
long x64_sys_call(const struct pt_regs *regs, unsigned int nr)
{
switch (nr) {
diff --git a/arch/x86/entry/syscall_x32.c b/arch/x86/entry/syscall_x32.c
index 03de4a932131..5aef4230faca 100644
--- a/arch/x86/entry/syscall_x32.c
+++ b/arch/x86/entry/syscall_x32.c
@@ -11,8 +11,13 @@
#include <asm/syscalls_x32.h>
#undef __SYSCALL

-#define __SYSCALL(nr, sym) case nr: return __x64_##sym(regs);
+#define __SYSCALL(nr, sym) __x64_##sym,
+const sys_call_ptr_t x32_sys_call_table[] = {
+#include <asm/syscalls_x32.h>
+};
+#undef __SYSCALL

+#define __SYSCALL(nr, sym) case nr: return __x64_##sym(regs);
long x32_sys_call(const struct pt_regs *regs, unsigned int nr)
{
switch (nr) {
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 3c7434329661..d64b0a5291f1 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -470,6 +470,7 @@
#define X86_FEATURE_BHI_CTRL (21*32+ 2) /* "" BHI_DIS_S HW control available */
#define X86_FEATURE_CLEAR_BHB_HW (21*32+ 3) /* "" BHI_DIS_S HW control enabled */
#define X86_FEATURE_CLEAR_BHB_LOOP_ON_VMEXIT (21*32+ 4) /* "" Clear branch history at vmexit using SW loop */
+#define X86_FEATURE_INDIRECT_BRANCH_OK (21*32+ 5) /* "" It's OK to use indirect branches */

/*
* BUG word(s)
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index 2fc7bc3863ff..dfb59521244c 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -16,14 +16,20 @@
#include <asm/thread_info.h> /* for TS_COMPAT */
#include <asm/unistd.h>

-/* This is used purely for kernel/trace/trace_syscalls.c */
typedef long (*sys_call_ptr_t)(const struct pt_regs *);
extern const sys_call_ptr_t sys_call_table[];

+#if defined(CONFIG_X86_32)
+#define ia32_sys_call_table sys_call_table
+#else
/*
* These may not exist, but still put the prototypes in so we
* can use IS_ENABLED().
*/
+extern const sys_call_ptr_t ia32_sys_call_table[];
+extern const sys_call_ptr_t x32_sys_call_table[];
+#endif
+
extern long ia32_sys_call(const struct pt_regs *, unsigned int nr);
extern long x32_sys_call(const struct pt_regs *, unsigned int nr);
extern long x64_sys_call(const struct pt_regs *, unsigned int nr);
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index ab18185894df..5fca46c78daf 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1665,6 +1665,12 @@ static void __init bhi_select_mitigation(void)
if (!IS_ENABLED(CONFIG_X86_64))
return;

+ /*
+ * There's no HW mitigation in place. Mark indirect branches as
+ * "not OK".
+ */
+ setup_clear_cpu_cap(X86_FEATURE_INDIRECT_BRANCH_OK);
+
/* Mitigate KVM by default */
setup_force_cpu_cap(X86_FEATURE_CLEAR_BHB_LOOP_ON_VMEXIT);
pr_info("Spectre BHI mitigation: SW BHB clearing on vm exit\n");
@@ -1679,6 +1685,28 @@ static void __init spectre_v2_select_mitigation(void)
enum spectre_v2_mitigation_cmd cmd = spectre_v2_parse_cmdline();
enum spectre_v2_mitigation mode = SPECTRE_V2_NONE;

+ /*
+ * X86_FEATURE_INDIRECT_BRANCH_OK indicates that indirect calls are
+ * "OK" to use due to (at least) one of the following being true:
+ *
+ * - the CPU isn't vulnerable to Spectre v2, BHI, etc;
+ *
+ * - a HW mitigation is in place (e.g., IBRS, eIBRS+BHI_DIS_S); or
+ *
+ * - the user disabled mitigations.
+ *
+ * Clearing the bit enables certain indirect branch "easy targets" [*]
+ * to be converted to a series of direct branches.
+ *
+ * Assume innocence until proven guilty: set it now and clear it later
+ * if/when needed.
+ *
+ * [*] The closer the indirect branch is to kernel entry, and the more
+ * user-controlled registers there are, the easier target it may be
+ * for future Spectre v2 variants.
+ */
+ setup_force_cpu_cap(X86_FEATURE_INDIRECT_BRANCH_OK);
+
/*
* If the CPU is not affected and the command line mode is NONE or AUTO
* then nothing to do.
@@ -1765,11 +1793,16 @@ static void __init spectre_v2_select_mitigation(void)
break;

case SPECTRE_V2_LFENCE:
+ setup_clear_cpu_cap(X86_FEATURE_INDIRECT_BRANCH_OK);
+ fallthrough;
case SPECTRE_V2_EIBRS_LFENCE:
setup_force_cpu_cap(X86_FEATURE_RETPOLINE_LFENCE);
- fallthrough;
+ setup_force_cpu_cap(X86_FEATURE_RETPOLINE);
+ break;

case SPECTRE_V2_RETPOLINE:
+ setup_clear_cpu_cap(X86_FEATURE_INDIRECT_BRANCH_OK);
+ fallthrough;
case SPECTRE_V2_EIBRS_RETPOLINE:
setup_force_cpu_cap(X86_FEATURE_RETPOLINE);
break;
--
2.44.0

2024-04-19 21:10:24

by Josh Poimboeuf

[permalink] [raw]

Subject: [PATCH v4 2/5] cpu/speculation: Fix CPU mitigation defaults for !x86

CPU speculative execution mitigations were inadvertently disabled on
non-x86 arches by the following commit:

f337a6a21e2f ("x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n")

Fix it by replacing CONFIG_SPECULATION_MITIGATIONS with a new generic
CONFIG_CPU_MITIGATIONS option and moving the x86-specific mitigations to
a separate menu which depends on CONFIG_CPU_MITIGATIONS.

Fixes: f337a6a21e2f ("x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n")
Reported-by: Stephen Rothwell <[email protected]>
Reported-by: Michael Ellerman <[email protected]>
Reported-by: Geert Uytterhoeven <[email protected]>
Closes: https://lkml.kernel.org/r/20240413115324.53303a68%40canb.auug.org.au
Signed-off-by: Josh Poimboeuf <[email protected]>
---
arch/Kconfig | 10 ++++++++++
arch/x86/Kconfig | 15 +++------------
kernel/cpu.c | 4 ++--
3 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 9f066785bb71..5c96849eb957 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -11,6 +11,16 @@ source "arch/$(SRCARCH)/Kconfig"

menu "General architecture-dependent options"

+config CPU_MITIGATIONS
+ bool "Mitigations for CPU speculative execution vulnerabilities"
+ default y
+ help
+ Say Y here to enable mitigations for CPU speculative execution
+ vulnerabilities.
+
+ If you say N, all mitigations will be disabled. You really
+ should know what you are doing to say so.
+
config ARCH_HAS_SUBPAGE_FAULTS
bool
help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4474bf32d0a4..85a4d57bce1e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2488,17 +2488,8 @@ config PREFIX_SYMBOLS
def_bool y
depends on CALL_PADDING && !CFI_CLANG

-menuconfig SPECULATION_MITIGATIONS
- bool "Mitigations for speculative execution vulnerabilities"
- default y
- help
- Say Y here to enable options which enable mitigations for
- speculative execution hardware vulnerabilities.
-
- If you say N, all mitigations will be disabled. You really
- should know what you are doing to say so.
-
-if SPECULATION_MITIGATIONS
+menu "CPU speculative execution mitigation defaults"
+ depends on CPU_MITIGATIONS

config MITIGATION_PAGE_TABLE_ISOLATION
bool "Remove the kernel mapping in user mode"
@@ -2643,7 +2634,7 @@ config MITIGATION_SPECTRE_BHI
indirect branches.
See <file:Documentation/admin-guide/hw-vuln/spectre.rst>

-endif
+endmenu

config ARCH_HAS_ADD_PAGES
def_bool y
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 07ad53b7f119..bb0ff275fb46 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -3207,8 +3207,8 @@ enum cpu_mitigations {
};

static enum cpu_mitigations cpu_mitigations __ro_after_init =
- IS_ENABLED(CONFIG_SPECULATION_MITIGATIONS) ? CPU_MITIGATIONS_AUTO :
- CPU_MITIGATIONS_OFF;
+ IS_ENABLED(CONFIG_CPU_MITIGATIONS) ? CPU_MITIGATIONS_AUTO :
+ CPU_MITIGATIONS_OFF;

static int __init mitigations_parse_cmdline(char *arg)
{
--
2.44.0

2024-04-19 21:10:36

by Josh Poimboeuf

[permalink] [raw]

Subject: [PATCH v4 3/5] x86/syscall: Mark exit[_group] syscall handlers __noreturn

The direct-call syscall dispatch functions don't know that the exit()
and exit_group() syscall handlers don't return. As a result the call
sites aren't optimized accordingly.

Fix that by marking those exit syscall declarations as __noreturn.

Fixes the following warnings:

vmlinux.o: warning: objtool: x64_sys_call+0x2804: __x64_sys_exit() is missing a __noreturn annotation
vmlinux.o: warning: objtool: ia32_sys_call+0x29b6: __ia32_sys_exit_group() is missing a __noreturn annotation

Fixes: 7390db8aea0d ("x86/bhi: Add support for clearing branch history at syscall entry")
Reported-by: "Paul E. McKenney" <[email protected]>
Closes: https://lkml.kernel.org/lkml/6dba9b32-db2c-4e6d-9500-7a08852f17a3@paulmck-laptop
Signed-off-by: Josh Poimboeuf <[email protected]>
---
arch/x86/entry/syscall_64.c | 4 ++++
arch/x86/entry/syscall_x32.c | 4 ++++
arch/x86/entry/syscalls/syscall_64.tbl | 6 +++---
arch/x86/um/sys_call_table_32.c | 1 +
arch/x86/um/sys_call_table_64.c | 1 +
scripts/syscalltbl.sh | 6 ++++--
tools/objtool/noreturns.h | 4 ++++
7 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/syscall_64.c b/arch/x86/entry/syscall_64.c
index 96ea1f8a1d3f..ff36a993a07e 100644
--- a/arch/x86/entry/syscall_64.c
+++ b/arch/x86/entry/syscall_64.c
@@ -8,9 +8,13 @@
#include <asm/syscall.h>

#define __SYSCALL(nr, sym) extern long __x64_##sym(const struct pt_regs *);
+#define __SYSCALL_NORETURN(nr, sym) extern long __noreturn __x64_##sym(const struct pt_regs *);
#include <asm/syscalls_64.h>
#undef __SYSCALL

+#undef __SYSCALL_NORETURN
+#define __SYSCALL_NORETURN __SYSCALL
+
#define __SYSCALL(nr, sym) __x64_##sym,
const sys_call_ptr_t sys_call_table[] = {
#include <asm/syscalls_64.h>
diff --git a/arch/x86/entry/syscall_x32.c b/arch/x86/entry/syscall_x32.c
index 5aef4230faca..4221ecce6e68 100644
--- a/arch/x86/entry/syscall_x32.c
+++ b/arch/x86/entry/syscall_x32.c
@@ -8,9 +8,13 @@
#include <asm/syscall.h>

#define __SYSCALL(nr, sym) extern long __x64_##sym(const struct pt_regs *);
+#define __SYSCALL_NORETURN(nr, sym) extern long __noreturn __x64_##sym(const struct pt_regs *);
#include <asm/syscalls_x32.h>
#undef __SYSCALL

+#undef __SYSCALL_NORETURN
+#define __SYSCALL_NORETURN __SYSCALL
+
#define __SYSCALL(nr, sym) __x64_##sym,
const sys_call_ptr_t x32_sys_call_table[] = {
#include <asm/syscalls_x32.h>
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 7e8d46f4147f..6695105d21b5 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -2,7 +2,7 @@
# 64-bit system call numbers and entry vectors
#
# The format is:
-# <number> <abi> <name> <entry point>
+# <number> <abi> <name> <entry point> [0 noreturn]
#
# The __x64_sys_*() stubs are created on-the-fly for sys_*() system calls
#
@@ -68,7 +68,7 @@
57 common fork sys_fork
58 common vfork sys_vfork
59 64 execve sys_execve
-60 common exit sys_exit
+60 common exit sys_exit 0 noreturn
61 common wait4 sys_wait4
62 common kill sys_kill
63 common uname sys_newuname
@@ -239,7 +239,7 @@
228 common clock_gettime sys_clock_gettime
229 common clock_getres sys_clock_getres
230 common clock_nanosleep sys_clock_nanosleep
-231 common exit_group sys_exit_group
+231 common exit_group sys_exit_group 0 noreturn
232 common epoll_wait sys_epoll_wait
233 common epoll_ctl sys_epoll_ctl
234 common tgkill sys_tgkill
diff --git a/arch/x86/um/sys_call_table_32.c b/arch/x86/um/sys_call_table_32.c
index 89df5d89d664..c7d4bf955d2b 100644
--- a/arch/x86/um/sys_call_table_32.c
+++ b/arch/x86/um/sys_call_table_32.c
@@ -24,6 +24,7 @@
#define __SYSCALL_WITH_COMPAT(nr, native, compat) __SYSCALL(nr, native)

#define __SYSCALL(nr, sym) extern asmlinkage long sym(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long);
+#define __SYSCALL_NORETURN __SYSCALL
#include <asm/syscalls_32.h>

#undef __SYSCALL
diff --git a/arch/x86/um/sys_call_table_64.c b/arch/x86/um/sys_call_table_64.c
index b0b4cfd2308c..4760c40ae5cd 100644
--- a/arch/x86/um/sys_call_table_64.c
+++ b/arch/x86/um/sys_call_table_64.c
@@ -19,6 +19,7 @@
#define sys_ioperm sys_ni_syscall

#define __SYSCALL(nr, sym) extern asmlinkage long sym(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long);
+#define __SYSCALL_NORETURN __SYSCALL
#include <asm/syscalls_64.h>

#undef __SYSCALL
diff --git a/scripts/syscalltbl.sh b/scripts/syscalltbl.sh
index 6abe143889ef..16487d47e06a 100755
--- a/scripts/syscalltbl.sh
+++ b/scripts/syscalltbl.sh
@@ -54,7 +54,7 @@ nxt=0

grep -E "^[0-9]+[[:space:]]+$abis" "$infile" | {

- while read nr abi name native compat ; do
+ while read nr abi name native compat noreturn; do

if [ $nxt -gt $nr ]; then
echo "error: $infile: syscall table is not sorted or duplicates the same syscall number" >&2
@@ -66,7 +66,9 @@ grep -E "^[0-9]+[[:space:]]+$abis" "$infile" | {
nxt=$((nxt + 1))
done

- if [ -n "$compat" ]; then
+ if [ -n "$noreturn" ]; then
+ echo "__SYSCALL_NORETURN($nr, $native)"
+ elif [ -n "$compat" ]; then
echo "__SYSCALL_WITH_COMPAT($nr, $native, $compat)"
elif [ -n "$native" ]; then
echo "__SYSCALL($nr, $native)"
diff --git a/tools/objtool/noreturns.h b/tools/objtool/noreturns.h
index 7ebf29c91184..1e8141ef1b15 100644
--- a/tools/objtool/noreturns.h
+++ b/tools/objtool/noreturns.h
@@ -7,12 +7,16 @@
* Yes, this is unfortunate. A better solution is in the works.
*/
NORETURN(__fortify_panic)
+NORETURN(__ia32_sys_exit)
+NORETURN(__ia32_sys_exit_group)
NORETURN(__kunit_abort)
NORETURN(__module_put_and_kthread_exit)
NORETURN(__reiserfs_panic)
NORETURN(__stack_chk_fail)
NORETURN(__tdx_hypercall_failed)
NORETURN(__ubsan_handle_builtin_unreachable)
+NORETURN(__x64_sys_exit)
+NORETURN(__x64_sys_exit_group)
NORETURN(arch_cpu_idle_dead)
NORETURN(bch2_trans_in_restart_error)
NORETURN(bch2_trans_restart_error)
--
2.44.0

2024-04-19 21:10:41

by Josh Poimboeuf

[permalink] [raw]

Subject: [PATCH v4 4/5] x86/bugs: Remove duplicate Spectre cmdline option descriptions

Duplicating the documentation of all the Spectre kernel cmdline options
in two separate places is unwieldy and error-prone. Instead just add a
reference to kernel-parameters.txt from spectre.rst.

Signed-off-by: Josh Poimboeuf <[email protected]>
---
Documentation/admin-guide/hw-vuln/spectre.rst | 84 ++-----------------
1 file changed, 9 insertions(+), 75 deletions(-)

diff --git a/Documentation/admin-guide/hw-vuln/spectre.rst b/Documentation/admin-guide/hw-vuln/spectre.rst
index 25a04cda4c2c..f9797ab6b38f 100644
--- a/Documentation/admin-guide/hw-vuln/spectre.rst
+++ b/Documentation/admin-guide/hw-vuln/spectre.rst
@@ -592,85 +592,19 @@ Spectre variant 2
Mitigation control on the kernel command line
---------------------------------------------

-Spectre variant 2 mitigation can be disabled or force enabled at the
-kernel command line.
+In general the kernel selects reasonable default mitigations for the
+current CPU.
+
+Spectre default mitigations can be disabled or changed at the kernel
+command line with the following options:

nospectre_v1
-
- [X86,PPC] Disable mitigations for Spectre Variant 1
- (bounds check bypass). With this option data leaks are
- possible in the system.
-
nospectre_v2
+ spectre_v2={option}
+ spectre_v2_user={option}
+ spectre_bhi={option}

- [X86] Disable all mitigations for the Spectre variant 2
- (indirect branch prediction) vulnerability. System may
- allow data leaks with this option, which is equivalent
- to spectre_v2=off.
-
-
- spectre_v2=
-
- [X86] Control mitigation of Spectre variant 2
- (indirect branch speculation) vulnerability.
- The default operation protects the kernel from
- user space attacks.
-
- on
- unconditionally enable, implies
- spectre_v2_user=on
- off
- unconditionally disable, implies
- spectre_v2_user=off
- auto
- kernel detects whether your CPU model is
- vulnerable
-
- Selecting 'on' will, and 'auto' may, choose a
- mitigation method at run time according to the
- CPU, the available microcode, the setting of the
- CONFIG_MITIGATION_RETPOLINE configuration option,
- and the compiler with which the kernel was built.
-
- Selecting 'on' will also enable the mitigation
- against user space to user space task attacks.
-
- Selecting 'off' will disable both the kernel and
- the user space protections.
-
- Specific mitigations can also be selected manually:
-
- retpoline auto pick between generic,lfence
- retpoline,generic Retpolines
- retpoline,lfence LFENCE; indirect branch
- retpoline,amd alias for retpoline,lfence
- eibrs Enhanced/Auto IBRS
- eibrs,retpoline Enhanced/Auto IBRS + Retpolines
- eibrs,lfence Enhanced/Auto IBRS + LFENCE
- ibrs use IBRS to protect kernel
-
- Not specifying this option is equivalent to
- spectre_v2=auto.
-
- In general the kernel by default selects
- reasonable mitigations for the current CPU. To
- disable Spectre variant 2 mitigations, boot with
- spectre_v2=off. Spectre variant 1 mitigations
- cannot be disabled.
-
- spectre_bhi=
-
- [X86] Control mitigation of Branch History Injection
- (BHI) vulnerability. This setting affects the deployment
- of the HW BHI control and the SW BHB clearing sequence.
-
- on
- (default) Enable the HW or SW mitigation as
- needed.
- off
- Disable the mitigation.
-
-For spectre_v2_user see Documentation/admin-guide/kernel-parameters.txt
+For more details on the available options, refer to Documentation/admin-guide/kernel-parameters.txt

Mitigation selection guide
--------------------------
--
2.44.0

2024-04-19 21:10:52

by Josh Poimboeuf

[permalink] [raw]

Subject: [PATCH v4 5/5] x86/bugs: Add 'spectre_bhi=vmexit' cmdline option

In cloud environments it can be useful to *only* enable the vmexit
mitigation and leave syscalls vulnerable. Add that as an option.

This is similar to the old spectre_v2=auto option which was removed with
the following commit:

36d4fe147c87 ("x86/bugs: Remove CONFIG_BHI_MITIGATION_AUTO and spectre_bhi=auto")

with the main difference being that this has a more descriptive name and
is disabled by default.

Requested-by: Maksim Davydov <[email protected]>
Signed-off-by: Josh Poimboeuf <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 12 +++++++++---
arch/x86/kernel/cpu/bugs.c | 16 +++++++++++-----
2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 902ecd92a29f..83c4889b88d2 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6069,9 +6069,15 @@
deployment of the HW BHI control and the SW BHB
clearing sequence.

- on - (default) Enable the HW or SW mitigation
- as needed.
- off - Disable the mitigation.
+ on - (default) Enable the HW or SW mitigation as
+ needed. This protects the kernel from
+ both syscalls and VMs.
+ vmexit - On systems which don't have the HW mitigation
+ available, enable the SW mitigation on vmexit
+ ONLY. On such systems, the host kernel is
+ protected from VM-originated BHI attacks, but
+ may still be vulnerable to syscall attacks.
+ off - Disable the mitigation.

spectre_v2= [X86,EARLY] Control mitigation of Spectre variant 2
(indirect branch speculation) vulnerability.
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 5fca46c78daf..575a4bb5a78d 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1625,6 +1625,7 @@ static bool __init spec_ctrl_bhi_dis(void)
enum bhi_mitigations {
BHI_MITIGATION_OFF,
BHI_MITIGATION_ON,
+ BHI_MITIGATION_VMEXIT_ONLY,
};

static enum bhi_mitigations bhi_mitigation __ro_after_init =
@@ -1639,6 +1640,8 @@ static int __init spectre_bhi_parse_cmdline(char *str)
bhi_mitigation = BHI_MITIGATION_OFF;
else if (!strcmp(str, "on"))
bhi_mitigation = BHI_MITIGATION_ON;
+ else if (!strcmp(str, "vmexit"))
+ bhi_mitigation = BHI_MITIGATION_VMEXIT_ONLY;
else
pr_err("Ignoring unknown spectre_bhi option (%s)", str);

@@ -1659,6 +1662,7 @@ static void __init bhi_select_mitigation(void)
return;
}

+ /* Mitigate in hardware if supported */
if (spec_ctrl_bhi_dis())
return;

@@ -1671,13 +1675,15 @@ static void __init bhi_select_mitigation(void)
*/
setup_clear_cpu_cap(X86_FEATURE_INDIRECT_BRANCH_OK);

- /* Mitigate KVM by default */
- setup_force_cpu_cap(X86_FEATURE_CLEAR_BHB_LOOP_ON_VMEXIT);
- pr_info("Spectre BHI mitigation: SW BHB clearing on vm exit\n");
+ if (bhi_mitigation == BHI_MITIGATION_VMEXIT_ONLY) {
+ pr_info("Spectre BHI mitigation: SW BHB clearing on vm exit only\n");
+ setup_force_cpu_cap(X86_FEATURE_CLEAR_BHB_LOOP_ON_VMEXIT);
+ return;
+ }

- /* Mitigate syscalls when the mitigation is forced =on */
+ pr_info("Spectre BHI mitigation: SW BHB clearing on syscall and vm exit\n");
setup_force_cpu_cap(X86_FEATURE_CLEAR_BHB_LOOP);
- pr_info("Spectre BHI mitigation: SW BHB clearing on syscall\n");
+ setup_force_cpu_cap(X86_FEATURE_CLEAR_BHB_LOOP_ON_VMEXIT);
}

static void __init spectre_v2_select_mitigation(void)
--
2.44.0

2024-04-19 21:46:44

by Josh Poimboeuf

[permalink] [raw]

Subject: Re: [PATCH v4 5/5] x86/bugs: Add 'spectre_bhi=vmexit' cmdline option

On Fri, Apr 19, 2024 at 02:09:51PM -0700, Josh Poimboeuf wrote:
> In cloud environments it can be useful to *only* enable the vmexit
> mitigation and leave syscalls vulnerable. Add that as an option.
>
> This is similar to the old spectre_v2=auto option which was removed with

As Daniel pointed out this should be /v2/bhi/

--
Josh

2024-04-20 00:09:43

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH v4 2/5] cpu/speculation: Fix CPU mitigation defaults for !x86

On Fri, Apr 19, 2024, Josh Poimboeuf wrote:
> CPU speculative execution mitigations were inadvertently disabled on
> non-x86 arches by the following commit:
>
> f337a6a21e2f ("x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n")
>
> Fix it by replacing CONFIG_SPECULATION_MITIGATIONS with a new generic
> CONFIG_CPU_MITIGATIONS option and moving the x86-specific mitigations to
> a separate menu which depends on CONFIG_CPU_MITIGATIONS.

Ah drat, I didn't check my mailbox until after Cc'ing Linus my own version[*].

I don't have a strong preference between the two, though I do think it's worth
nothing that this will (obvioulsy) allow disabling mitigations at compile time
on all architectures, which may or may not be desirable.

[*] https://lore.kernel.org/all/[email protected]

> Fixes: f337a6a21e2f ("x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n")
> Reported-by: Stephen Rothwell <[email protected]>
> Reported-by: Michael Ellerman <[email protected]>
> Reported-by: Geert Uytterhoeven <[email protected]>
> Closes: https://lkml.kernel.org/r/20240413115324.53303a68%40canb.auug.org.au
> Signed-off-by: Josh Poimboeuf <[email protected]>
> ---
> arch/Kconfig | 10 ++++++++++
> arch/x86/Kconfig | 15 +++------------
> kernel/cpu.c | 4 ++--
> 3 files changed, 15 insertions(+), 14 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 9f066785bb71..5c96849eb957 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -11,6 +11,16 @@ source "arch/$(SRCARCH)/Kconfig"
>
> menu "General architecture-dependent options"
>
> +config CPU_MITIGATIONS
> + bool "Mitigations for CPU speculative execution vulnerabilities"
> + default y
> + help
> + Say Y here to enable mitigations for CPU speculative execution
> + vulnerabilities.
> +
> + If you say N, all mitigations will be disabled. You really
> + should know what you are doing to say so.

2024-04-20 13:59:09

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH v4 3/5] x86/syscall: Mark exit[_group] syscall handlers __noreturn

On Fri, Apr 19, 2024 at 02:09:49PM -0700, Josh Poimboeuf wrote:
> The direct-call syscall dispatch functions don't know that the exit()
> and exit_group() syscall handlers don't return. As a result the call
> sites aren't optimized accordingly.
>
> Fix that by marking those exit syscall declarations as __noreturn.
>
> Fixes the following warnings:
>
> vmlinux.o: warning: objtool: x64_sys_call+0x2804: __x64_sys_exit() is missing a __noreturn annotation
> vmlinux.o: warning: objtool: ia32_sys_call+0x29b6: __ia32_sys_exit_group() is missing a __noreturn annotation
>
> Fixes: 7390db8aea0d ("x86/bhi: Add support for clearing branch history at syscall entry")
> Reported-by: "Paul E. McKenney" <[email protected]>
> Closes: https://lkml.kernel.org/lkml/6dba9b32-db2c-4e6d-9500-7a08852f17a3@paulmck-laptop
> Signed-off-by: Josh Poimboeuf <[email protected]>

Looks good, but it does not apply on top of current -next and I don't
trust myself to hand-apply it (something about having just got off of
a flight across the big pond).

Could you please let me know what else do I need to pull in to be able
to cleanly apply this one?

Thanx, Paul

> ---
> arch/x86/entry/syscall_64.c | 4 ++++
> arch/x86/entry/syscall_x32.c | 4 ++++
> arch/x86/entry/syscalls/syscall_64.tbl | 6 +++---
> arch/x86/um/sys_call_table_32.c | 1 +
> arch/x86/um/sys_call_table_64.c | 1 +
> scripts/syscalltbl.sh | 6 ++++--
> tools/objtool/noreturns.h | 4 ++++
> 7 files changed, 21 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/entry/syscall_64.c b/arch/x86/entry/syscall_64.c
> index 96ea1f8a1d3f..ff36a993a07e 100644
> --- a/arch/x86/entry/syscall_64.c
> +++ b/arch/x86/entry/syscall_64.c
> @@ -8,9 +8,13 @@
> #include <asm/syscall.h>
>
> #define __SYSCALL(nr, sym) extern long __x64_##sym(const struct pt_regs *);
> +#define __SYSCALL_NORETURN(nr, sym) extern long __noreturn __x64_##sym(const struct pt_regs *);
> #include <asm/syscalls_64.h>
> #undef __SYSCALL
>
> +#undef __SYSCALL_NORETURN
> +#define __SYSCALL_NORETURN __SYSCALL
> +
> #define __SYSCALL(nr, sym) __x64_##sym,
> const sys_call_ptr_t sys_call_table[] = {
> #include <asm/syscalls_64.h>
> diff --git a/arch/x86/entry/syscall_x32.c b/arch/x86/entry/syscall_x32.c
> index 5aef4230faca..4221ecce6e68 100644
> --- a/arch/x86/entry/syscall_x32.c
> +++ b/arch/x86/entry/syscall_x32.c
> @@ -8,9 +8,13 @@
> #include <asm/syscall.h>
>
> #define __SYSCALL(nr, sym) extern long __x64_##sym(const struct pt_regs *);
> +#define __SYSCALL_NORETURN(nr, sym) extern long __noreturn __x64_##sym(const struct pt_regs *);
> #include <asm/syscalls_x32.h>
> #undef __SYSCALL
>
> +#undef __SYSCALL_NORETURN
> +#define __SYSCALL_NORETURN __SYSCALL
> +
> #define __SYSCALL(nr, sym) __x64_##sym,
> const sys_call_ptr_t x32_sys_call_table[] = {
> #include <asm/syscalls_x32.h>
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 7e8d46f4147f..6695105d21b5 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -2,7 +2,7 @@
> # 64-bit system call numbers and entry vectors
> #
> # The format is:
> -# <number> <abi> <name> <entry point>
> +# <number> <abi> <name> <entry point> [0 noreturn]
> #
> # The __x64_sys_*() stubs are created on-the-fly for sys_*() system calls
> #
> @@ -68,7 +68,7 @@
> 57 common fork sys_fork
> 58 common vfork sys_vfork
> 59 64 execve sys_execve
> -60 common exit sys_exit
> +60 common exit sys_exit 0 noreturn
> 61 common wait4 sys_wait4
> 62 common kill sys_kill
> 63 common uname sys_newuname
> @@ -239,7 +239,7 @@
> 228 common clock_gettime sys_clock_gettime
> 229 common clock_getres sys_clock_getres
> 230 common clock_nanosleep sys_clock_nanosleep
> -231 common exit_group sys_exit_group
> +231 common exit_group sys_exit_group 0 noreturn
> 232 common epoll_wait sys_epoll_wait
> 233 common epoll_ctl sys_epoll_ctl
> 234 common tgkill sys_tgkill
> diff --git a/arch/x86/um/sys_call_table_32.c b/arch/x86/um/sys_call_table_32.c
> index 89df5d89d664..c7d4bf955d2b 100644
> --- a/arch/x86/um/sys_call_table_32.c
> +++ b/arch/x86/um/sys_call_table_32.c
> @@ -24,6 +24,7 @@
> #define __SYSCALL_WITH_COMPAT(nr, native, compat) __SYSCALL(nr, native)
>
> #define __SYSCALL(nr, sym) extern asmlinkage long sym(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long);
> +#define __SYSCALL_NORETURN __SYSCALL
> #include <asm/syscalls_32.h>
>
> #undef __SYSCALL
> diff --git a/arch/x86/um/sys_call_table_64.c b/arch/x86/um/sys_call_table_64.c
> index b0b4cfd2308c..4760c40ae5cd 100644
> --- a/arch/x86/um/sys_call_table_64.c
> +++ b/arch/x86/um/sys_call_table_64.c
> @@ -19,6 +19,7 @@
> #define sys_ioperm sys_ni_syscall
>
> #define __SYSCALL(nr, sym) extern asmlinkage long sym(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long);
> +#define __SYSCALL_NORETURN __SYSCALL
> #include <asm/syscalls_64.h>
>
> #undef __SYSCALL
> diff --git a/scripts/syscalltbl.sh b/scripts/syscalltbl.sh
> index 6abe143889ef..16487d47e06a 100755
> --- a/scripts/syscalltbl.sh
> +++ b/scripts/syscalltbl.sh
> @@ -54,7 +54,7 @@ nxt=0
>
> grep -E "^[0-9]+[[:space:]]+$abis" "$infile" | {
>
> - while read nr abi name native compat ; do
> + while read nr abi name native compat noreturn; do
>
> if [ $nxt -gt $nr ]; then
> echo "error: $infile: syscall table is not sorted or duplicates the same syscall number" >&2
> @@ -66,7 +66,9 @@ grep -E "^[0-9]+[[:space:]]+$abis" "$infile" | {
> nxt=$((nxt + 1))
> done
>
> - if [ -n "$compat" ]; then
> + if [ -n "$noreturn" ]; then
> + echo "__SYSCALL_NORETURN($nr, $native)"
> + elif [ -n "$compat" ]; then
> echo "__SYSCALL_WITH_COMPAT($nr, $native, $compat)"
> elif [ -n "$native" ]; then
> echo "__SYSCALL($nr, $native)"
> diff --git a/tools/objtool/noreturns.h b/tools/objtool/noreturns.h
> index 7ebf29c91184..1e8141ef1b15 100644
> --- a/tools/objtool/noreturns.h
> +++ b/tools/objtool/noreturns.h
> @@ -7,12 +7,16 @@
> * Yes, this is unfortunate. A better solution is in the works.
> */
> NORETURN(__fortify_panic)
> +NORETURN(__ia32_sys_exit)
> +NORETURN(__ia32_sys_exit_group)
> NORETURN(__kunit_abort)
> NORETURN(__module_put_and_kthread_exit)
> NORETURN(__reiserfs_panic)
> NORETURN(__stack_chk_fail)
> NORETURN(__tdx_hypercall_failed)
> NORETURN(__ubsan_handle_builtin_unreachable)
> +NORETURN(__x64_sys_exit)
> +NORETURN(__x64_sys_exit_group)
> NORETURN(arch_cpu_idle_dead)
> NORETURN(bch2_trans_in_restart_error)
> NORETURN(bch2_trans_restart_error)
> --
> 2.44.0
>

2024-04-21 05:25:59

by Josh Poimboeuf

[permalink] [raw]

Subject: Re: [PATCH v4 3/5] x86/syscall: Mark exit[_group] syscall handlers __noreturn

On Sat, Apr 20, 2024 at 06:58:58AM -0700, Paul E. McKenney wrote:
> On Fri, Apr 19, 2024 at 02:09:49PM -0700, Josh Poimboeuf wrote:
> > The direct-call syscall dispatch functions don't know that the exit()
> > and exit_group() syscall handlers don't return. As a result the call
> > sites aren't optimized accordingly.
> >
> > Fix that by marking those exit syscall declarations as __noreturn.
> >
> > Fixes the following warnings:
> >
> > vmlinux.o: warning: objtool: x64_sys_call+0x2804: __x64_sys_exit() is missing a __noreturn annotation
> > vmlinux.o: warning: objtool: ia32_sys_call+0x29b6: __ia32_sys_exit_group() is missing a __noreturn annotation
> >
> > Fixes: 7390db8aea0d ("x86/bhi: Add support for clearing branch history at syscall entry")
> > Reported-by: "Paul E. McKenney" <[email protected]>
> > Closes: https://lkml.kernel.org/lkml/6dba9b32-db2c-4e6d-9500-7a08852f17a3@paulmck-laptop
> > Signed-off-by: Josh Poimboeuf <[email protected]>
>
> Looks good, but it does not apply on top of current -next and I don't
> trust myself to hand-apply it (something about having just got off of
> a flight across the big pond).
>
> Could you please let me know what else do I need to pull in to be able
> to cleanly apply this one?

This patch has a dependency on an earlier patch in the set:

https://lkml.kernel.org/lkml/982d05a2f669140f26500bee643011896d661094.1713559768.git.jpoimboe@kernel.org

Though I think it's not a hard dependency and I could reverse the order
of the patches if needed.

--
Josh

2024-04-21 20:41:04

by Paul McKenney

[permalink] [raw]

Subject: Re: [PATCH v4 3/5] x86/syscall: Mark exit[_group] syscall handlers __noreturn

They apply fine as is, so I have started tests with that pair of patches.

Thanx, Paul

On Sat, Apr 20, 2024 at 10:25 PM Josh Poimboeuf <[email protected]> wrote:
>
> On Sat, Apr 20, 2024 at 06:58:58AM -0700, Paul E. McKenney wrote:
> > On Fri, Apr 19, 2024 at 02:09:49PM -0700, Josh Poimboeuf wrote:
> > > The direct-call syscall dispatch functions don't know that the exit()
> > > and exit_group() syscall handlers don't return. As a result the call
> > > sites aren't optimized accordingly.
> > >
> > > Fix that by marking those exit syscall declarations as __noreturn.
> > >
> > > Fixes the following warnings:
> > >
> > > vmlinux.o: warning: objtool: x64_sys_call+0x2804: __x64_sys_exit() is missing a __noreturn annotation
> > > vmlinux.o: warning: objtool: ia32_sys_call+0x29b6: __ia32_sys_exit_group() is missing a __noreturn annotation
> > >
> > > Fixes: 7390db8aea0d ("x86/bhi: Add support for clearing branch history at syscall entry")
> > > Reported-by: "Paul E. McKenney" <[email protected]>
> > > Closes: https://lkml.kernel.org/lkml/6dba9b32-db2c-4e6d-9500-7a08852f17a3@paulmck-laptop
> > > Signed-off-by: Josh Poimboeuf <[email protected]>
> >
> > Looks good, but it does not apply on top of current -next and I don't
> > trust myself to hand-apply it (something about having just got off of
> > a flight across the big pond).
> >
> > Could you please let me know what else do I need to pull in to be able
> > to cleanly apply this one?
>
> This patch has a dependency on an earlier patch in the set:
>
> https://lkml.kernel.org/lkml/982d05a2f669140f26500bee643011896d661094.1713559768.git.jpoimboe@kernel.org
>
> Though I think it's not a hard dependency and I could reverse the order
> of the patches if needed.
>
> --
> Josh

2024-04-21 21:47:45

by Paul McKenney

[permalink] [raw]

Subject: Re: [PATCH v4 3/5] x86/syscall: Mark exit[_group] syscall handlers __noreturn

And this definitely helped, thank you!

However, this one still remains:

vmlinux.o: warning: objtool: ia32_sys_call+0x29b6:
__ia32_sys_exit_group() is missing a __noreturn annotation

Please see below for my diffs against next-20240419, in case I messed
something up.

I attached a copy as well, given that I am away from mutt, hence using
gmail directly.

Thanx, Paul

-----------------------------------------------------

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 6de50b80702e6..9810ba2857a5c 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -49,7 +49,10 @@ static __always_inline bool do_syscall_x64(struct
pt_regs *regs, int nr)

if (likely(unr < NR_syscalls)) {
unr = array_index_nospec(unr, NR_syscalls);
- regs->ax = x64_sys_call(regs, unr);
+ if (likely(cpu_feature_enabled(X86_FEATURE_INDIRECT_BRANCH_OK)))
+ regs->ax = sys_call_table[unr](regs);
+ else
+ regs->ax = x64_sys_call(regs, unr);
return true;
}
return false;
@@ -66,7 +69,10 @@ static __always_inline bool do_syscall_x32(struct
pt_regs *regs, int nr)

if (IS_ENABLED(CONFIG_X86_X32_ABI) && likely(xnr < X32_NR_syscalls)) {
xnr = array_index_nospec(xnr, X32_NR_syscalls);
- regs->ax = x32_sys_call(regs, xnr);
+ if (likely(cpu_feature_enabled(X86_FEATURE_INDIRECT_BRANCH_OK)))
+ regs->ax = x32_sys_call_table[xnr](regs);
+ else
+ regs->ax = x32_sys_call(regs, xnr);
return true;
}
return false;
@@ -162,7 +168,10 @@ static __always_inline void
do_syscall_32_irqs_on(struct pt_regs *regs, int nr)

if (likely(unr < IA32_NR_syscalls)) {
unr = array_index_nospec(unr, IA32_NR_syscalls);
- regs->ax = ia32_sys_call(regs, unr);
+ if (likely(cpu_feature_enabled(X86_FEATURE_INDIRECT_BRANCH_OK)))
+ regs->ax = ia32_sys_call_table[unr](regs);
+ else
+ regs->ax = ia32_sys_call(regs, unr);
} else if (nr != -1) {
regs->ax = __ia32_sys_ni_syscall(regs);
}
diff --git a/arch/x86/entry/syscall_32.c b/arch/x86/entry/syscall_32.c
index c2235bae17ef6..aab31760b4e3e 100644
--- a/arch/x86/entry/syscall_32.c
+++ b/arch/x86/entry/syscall_32.c
@@ -14,25 +14,16 @@
#endif

#define __SYSCALL(nr, sym) extern long __ia32_##sym(const struct pt_regs *);
-
#include <asm/syscalls_32.h>
#undef __SYSCALL

-/*
- * The sys_call_table[] is no longer used for system calls, but
- * kernel/trace/trace_syscalls.c still wants to know the system
- * call address.
- */
-#ifdef CONFIG_X86_32
#define __SYSCALL(nr, sym) __ia32_##sym,
-const sys_call_ptr_t sys_call_table[] = {
+const sys_call_ptr_t ia32_sys_call_table[] = {
#include <asm/syscalls_32.h>
};
#undef __SYSCALL
-#endif

#define __SYSCALL(nr, sym) case nr: return __ia32_##sym(regs);
-
long ia32_sys_call(const struct pt_regs *regs, unsigned int nr)
{
switch (nr) {
diff --git a/arch/x86/entry/syscall_64.c b/arch/x86/entry/syscall_64.c
index 33b3f09e6f151..ff36a993a07e0 100644
--- a/arch/x86/entry/syscall_64.c
+++ b/arch/x86/entry/syscall_64.c
@@ -8,14 +8,13 @@
#include <asm/syscall.h>

#define __SYSCALL(nr, sym) extern long __x64_##sym(const struct pt_regs *);
+#define __SYSCALL_NORETURN(nr, sym) extern long __noreturn
__x64_##sym(const struct pt_regs *);
#include <asm/syscalls_64.h>
#undef __SYSCALL

-/*
- * The sys_call_table[] is no longer used for system calls, but
- * kernel/trace/trace_syscalls.c still wants to know the system
- * call address.
- */
+#undef __SYSCALL_NORETURN
+#define __SYSCALL_NORETURN __SYSCALL
+
#define __SYSCALL(nr, sym) __x64_##sym,
const sys_call_ptr_t sys_call_table[] = {
#include <asm/syscalls_64.h>
@@ -23,7 +22,6 @@ const sys_call_ptr_t sys_call_table[] = {
#undef __SYSCALL

#define __SYSCALL(nr, sym) case nr: return __x64_##sym(regs);
-
long x64_sys_call(const struct pt_regs *regs, unsigned int nr)
{
switch (nr) {
diff --git a/arch/x86/entry/syscall_x32.c b/arch/x86/entry/syscall_x32.c
index 03de4a9321318..4221ecce6e689 100644
--- a/arch/x86/entry/syscall_x32.c
+++ b/arch/x86/entry/syscall_x32.c
@@ -8,11 +8,20 @@
#include <asm/syscall.h>

#define __SYSCALL(nr, sym) extern long __x64_##sym(const struct pt_regs *);
+#define __SYSCALL_NORETURN(nr, sym) extern long __noreturn
__x64_##sym(const struct pt_regs *);
#include <asm/syscalls_x32.h>
#undef __SYSCALL

-#define __SYSCALL(nr, sym) case nr: return __x64_##sym(regs);
+#undef __SYSCALL_NORETURN
+#define __SYSCALL_NORETURN __SYSCALL
+
+#define __SYSCALL(nr, sym) __x64_##sym,
+const sys_call_ptr_t x32_sys_call_table[] = {
+#include <asm/syscalls_x32.h>
+};
+#undef __SYSCALL

+#define __SYSCALL(nr, sym) case nr: return __x64_##sym(regs);
long x32_sys_call(const struct pt_regs *regs, unsigned int nr)
{
switch (nr) {
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl
b/arch/x86/entry/syscalls/syscall_64.tbl
index a396f6e6ab5bf..7ec68d94eb593 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -2,7 +2,7 @@
# 64-bit system call numbers and entry vectors
#
# The format is:
-# <number> <abi> <name> <entry point>
+# <number> <abi> <name> <entry point> [0 noreturn]
#
# The __x64_sys_*() stubs are created on-the-fly for sys_*() system calls
#
@@ -68,7 +68,7 @@
57 common fork sys_fork
58 common vfork sys_vfork
59 64 execve sys_execve
-60 common exit sys_exit
+60 common exit sys_exit 0 noreturn
61 common wait4 sys_wait4
62 common kill sys_kill
63 common uname sys_newuname
@@ -239,7 +239,7 @@
228 common clock_gettime sys_clock_gettime
229 common clock_getres sys_clock_getres
230 common clock_nanosleep sys_clock_nanosleep
-231 common exit_group sys_exit_group
+231 common exit_group sys_exit_group 0 noreturn
232 common epoll_wait sys_epoll_wait
233 common epoll_ctl sys_epoll_ctl
234 common tgkill sys_tgkill
diff --git a/arch/x86/include/asm/cpufeatures.h
b/arch/x86/include/asm/cpufeatures.h
index 3c7434329661c..d64b0a5291f10 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -470,6 +470,7 @@
#define X86_FEATURE_BHI_CTRL (21*32+ 2) /* "" BHI_DIS_S HW control available */
#define X86_FEATURE_CLEAR_BHB_HW (21*32+ 3) /* "" BHI_DIS_S HW
control enabled */
#define X86_FEATURE_CLEAR_BHB_LOOP_ON_VMEXIT (21*32+ 4) /* "" Clear
branch history at vmexit using SW loop */
+#define X86_FEATURE_INDIRECT_BRANCH_OK (21*32+ 5) /* "" It's OK to
use indirect branches */

/*
* BUG word(s)
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index 2fc7bc3863ff6..dfb59521244c2 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -16,14 +16,20 @@
#include <asm/thread_info.h> /* for TS_COMPAT */
#include <asm/unistd.h>

-/* This is used purely for kernel/trace/trace_syscalls.c */
typedef long (*sys_call_ptr_t)(const struct pt_regs *);
extern const sys_call_ptr_t sys_call_table[];

+#if defined(CONFIG_X86_32)
+#define ia32_sys_call_table sys_call_table
+#else
/*
* These may not exist, but still put the prototypes in so we
* can use IS_ENABLED().
*/
+extern const sys_call_ptr_t ia32_sys_call_table[];
+extern const sys_call_ptr_t x32_sys_call_table[];
+#endif
+
extern long ia32_sys_call(const struct pt_regs *, unsigned int nr);
extern long x32_sys_call(const struct pt_regs *, unsigned int nr);
extern long x64_sys_call(const struct pt_regs *, unsigned int nr);
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index ab18185894dfd..5fca46c78daf2 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1665,6 +1665,12 @@ static void __init bhi_select_mitigation(void)
if (!IS_ENABLED(CONFIG_X86_64))
return;

+ /*
+ * There's no HW mitigation in place. Mark indirect branches as
+ * "not OK".
+ */
+ setup_clear_cpu_cap(X86_FEATURE_INDIRECT_BRANCH_OK);
+
/* Mitigate KVM by default */
setup_force_cpu_cap(X86_FEATURE_CLEAR_BHB_LOOP_ON_VMEXIT);
pr_info("Spectre BHI mitigation: SW BHB clearing on vm exit\n");
@@ -1679,6 +1685,28 @@ static void __init spectre_v2_select_mitigation(void)
enum spectre_v2_mitigation_cmd cmd = spectre_v2_parse_cmdline();
enum spectre_v2_mitigation mode = SPECTRE_V2_NONE;

+ /*
+ * X86_FEATURE_INDIRECT_BRANCH_OK indicates that indirect calls are
+ * "OK" to use due to (at least) one of the following being true:
+ *
+ * - the CPU isn't vulnerable to Spectre v2, BHI, etc;
+ *
+ * - a HW mitigation is in place (e.g., IBRS, eIBRS+BHI_DIS_S); or
+ *
+ * - the user disabled mitigations.
+ *
+ * Clearing the bit enables certain indirect branch "easy targets" [*]
+ * to be converted to a series of direct branches.
+ *
+ * Assume innocence until proven guilty: set it now and clear it later
+ * if/when needed.
+ *
+ * [*] The closer the indirect branch is to kernel entry, and the more
+ * user-controlled registers there are, the easier target it may be
+ * for future Spectre v2 variants.
+ */
+ setup_force_cpu_cap(X86_FEATURE_INDIRECT_BRANCH_OK);
+
/*
* If the CPU is not affected and the command line mode is NONE or AUTO
* then nothing to do.
@@ -1765,11 +1793,16 @@ static void __init spectre_v2_select_mitigation(void)
break;

case SPECTRE_V2_LFENCE:
+ setup_clear_cpu_cap(X86_FEATURE_INDIRECT_BRANCH_OK);
+ fallthrough;
case SPECTRE_V2_EIBRS_LFENCE:
setup_force_cpu_cap(X86_FEATURE_RETPOLINE_LFENCE);
- fallthrough;
+ setup_force_cpu_cap(X86_FEATURE_RETPOLINE);
+ break;

case SPECTRE_V2_RETPOLINE:
+ setup_clear_cpu_cap(X86_FEATURE_INDIRECT_BRANCH_OK);
+ fallthrough;
case SPECTRE_V2_EIBRS_RETPOLINE:
setup_force_cpu_cap(X86_FEATURE_RETPOLINE);
break;
diff --git a/arch/x86/um/sys_call_table_32.c b/arch/x86/um/sys_call_table_32.c
index 89df5d89d6640..c7d4bf955d2ba 100644
--- a/arch/x86/um/sys_call_table_32.c
+++ b/arch/x86/um/sys_call_table_32.c
@@ -24,6 +24,7 @@
#define __SYSCALL_WITH_COMPAT(nr, native, compat) __SYSCALL(nr, native)

#define __SYSCALL(nr, sym) extern asmlinkage long sym(unsigned long,
unsigned long, unsigned long, unsigned long, unsigned long, unsigned
long);
+#define __SYSCALL_NORETURN __SYSCALL
#include <asm/syscalls_32.h>

#undef __SYSCALL
diff --git a/arch/x86/um/sys_call_table_64.c b/arch/x86/um/sys_call_table_64.c
index b0b4cfd2308c8..4760c40ae5cd0 100644
--- a/arch/x86/um/sys_call_table_64.c
+++ b/arch/x86/um/sys_call_table_64.c
@@ -19,6 +19,7 @@
#define sys_ioperm sys_ni_syscall

#define __SYSCALL(nr, sym) extern asmlinkage long sym(unsigned long,
unsigned long, unsigned long, unsigned long, unsigned long, unsigned
long);
+#define __SYSCALL_NORETURN __SYSCALL
#include <asm/syscalls_64.h>

#undef __SYSCALL
diff --git a/scripts/syscalltbl.sh b/scripts/syscalltbl.sh
index 6abe143889ef6..16487d47e06a3 100755
--- a/scripts/syscalltbl.sh
+++ b/scripts/syscalltbl.sh
@@ -54,7 +54,7 @@ nxt=0

grep -E "^[0-9]+[[:space:]]+$abis" "$infile" | {

- while read nr abi name native compat ; do
+ while read nr abi name native compat noreturn; do

if [ $nxt -gt $nr ]; then
echo "error: $infile: syscall table is not sorted or duplicates the
same syscall number" >&2
@@ -66,7 +66,9 @@ grep -E "^[0-9]+[[:space:]]+$abis" "$infile" | {
nxt=$((nxt + 1))
done

- if [ -n "$compat" ]; then
+ if [ -n "$noreturn" ]; then
+ echo "__SYSCALL_NORETURN($nr, $native)"
+ elif [ -n "$compat" ]; then
echo "__SYSCALL_WITH_COMPAT($nr, $native, $compat)"
elif [ -n "$native" ]; then
echo "__SYSCALL($nr, $native)"
diff --git a/tools/objtool/noreturns.h b/tools/objtool/noreturns.h
index 7ebf29c911849..1e8141ef1b15d 100644
--- a/tools/objtool/noreturns.h
+++ b/tools/objtool/noreturns.h
@@ -7,12 +7,16 @@
* Yes, this is unfortunate. A better solution is in the works.
*/
NORETURN(__fortify_panic)
+NORETURN(__ia32_sys_exit)
+NORETURN(__ia32_sys_exit_group)
NORETURN(__kunit_abort)
NORETURN(__module_put_and_kthread_exit)
NORETURN(__reiserfs_panic)
NORETURN(__stack_chk_fail)
NORETURN(__tdx_hypercall_failed)
NORETURN(__ubsan_handle_builtin_unreachable)
+NORETURN(__x64_sys_exit)
+NORETURN(__x64_sys_exit_group)
NORETURN(arch_cpu_idle_dead)
NORETURN(bch2_trans_in_restart_error)
NORETURN(bch2_trans_restart_error)

On Sun, Apr 21, 2024 at 1:40 PM Paul McKenney <[email protected]> wrote:
>
> They apply fine as is, so I have started tests with that pair of patches.
>
> Thanx, Paul
>
> On Sat, Apr 20, 2024 at 10:25 PM Josh Poimboeuf <[email protected]> wrote:
> >
> > On Sat, Apr 20, 2024 at 06:58:58AM -0700, Paul E. McKenney wrote:
> > > On Fri, Apr 19, 2024 at 02:09:49PM -0700, Josh Poimboeuf wrote:
> > > > The direct-call syscall dispatch functions don't know that the exit()
> > > > and exit_group() syscall handlers don't return. As a result the call
> > > > sites aren't optimized accordingly.
> > > >
> > > > Fix that by marking those exit syscall declarations as __noreturn.
> > > >
> > > > Fixes the following warnings:
> > > >
> > > > vmlinux.o: warning: objtool: x64_sys_call+0x2804: __x64_sys_exit() is missing a __noreturn annotation
> > > > vmlinux.o: warning: objtool: ia32_sys_call+0x29b6: __ia32_sys_exit_group() is missing a __noreturn annotation
> > > >
> > > > Fixes: 7390db8aea0d ("x86/bhi: Add support for clearing branch history at syscall entry")
> > > > Reported-by: "Paul E. McKenney" <[email protected]>
> > > > Closes: https://lkml.kernel.org/lkml/6dba9b32-db2c-4e6d-9500-7a08852f17a3@paulmck-laptop
> > > > Signed-off-by: Josh Poimboeuf <[email protected]>
> > >
> > > Looks good, but it does not apply on top of current -next and I don't
> > > trust myself to hand-apply it (something about having just got off of
> > > a flight across the big pond).
> > >
> > > Could you please let me know what else do I need to pull in to be able
> > > to cleanly apply this one?
> >
> > This patch has a dependency on an earlier patch in the set:
> >
> > https://lkml.kernel.org/lkml/982d05a2f669140f26500bee643011896d6610941713559768.git.jpoimboe@kernel.org
> >
> > Though I think it's not a hard dependency and I could reverse the order
> > of the patches if needed.
> >
> > --
> > Josh

Attachments:

diffs (11.45 kB)

2024-04-22 08:19:24

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH v4 1/5] x86/bugs: Only harden syscalls when needed

On Fri, Apr 19, 2024 at 02:09:47PM -0700, Josh Poimboeuf wrote:
> Syscall hardening (converting the syscall indirect branch to a series of
> direct branches) has shown some performance regressions:
>
> - Red Hat internal testing showed up to 12% slowdowns in database
> benchmark testing on Sapphire Rapids when the DB was stressed with 80+
> users to cause contention.
>
> - The kernel test robot's will-it-scale benchmarks showed significant
> regressions on Skylake with IBRS:
> https://lkml.kernel.org/lkml/[email protected]

To clarify, we reported a +1.4% improvement (not regression) of
will-it-scale futex4 benchmark on Skylake. Meanwhile we did observe some
regressions by running other benchmarks on Ice Lake, such as:

stress-ng.null.ops_per_sec -4.0% regression on Intel Xeon Gold 6346 (Ice Lake)
unixbench.fsbuffer.throughput -1.4% regression on Intel Xeon Gold 6346 (Ice Lake)

>
> To fix those slowdowns, only use the syscall direct branches when
> indirect branches are considered to be "not OK": meaning Spectre v2+BHI
> isn't mitigated by HW and the user hasn't disabled mitigations.

2024-04-23 14:10:57

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH v4 2/5] cpu/speculation: Fix CPU mitigation defaults for !x86

On Fri, Apr 19, 2024, Sean Christopherson wrote:
> On Fri, Apr 19, 2024, Josh Poimboeuf wrote:
> > CPU speculative execution mitigations were inadvertently disabled on
> > non-x86 arches by the following commit:
> >
> > f337a6a21e2f ("x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n")
> >
> > Fix it by replacing CONFIG_SPECULATION_MITIGATIONS with a new generic
> > CONFIG_CPU_MITIGATIONS option and moving the x86-specific mitigations to
> > a separate menu which depends on CONFIG_CPU_MITIGATIONS.
>
> Ah drat, I didn't check my mailbox until after Cc'ing Linus my own version[*].
>
> I don't have a strong preference between the two, though I do think it's worth
> nothing that this will (obvioulsy) allow disabling mitigations at compile time
> on all architectures, which may or may not be desirable.
>
> [*] https://lore.kernel.org/all/[email protected]

Josh, when you get a chance, can you weigh in on my menu-preserving approach?

I want to get this resolved asap so that we're not scrambing on Friday again :-)

2024-04-24 05:36:04

by Josh Poimboeuf

[permalink] [raw]

Subject: Re: [PATCH v4 2/5] cpu/speculation: Fix CPU mitigation defaults for !x86

On Tue, Apr 23, 2024 at 07:10:23AM -0700, Sean Christopherson wrote:
> On Fri, Apr 19, 2024, Sean Christopherson wrote:
> > On Fri, Apr 19, 2024, Josh Poimboeuf wrote:
> > > CPU speculative execution mitigations were inadvertently disabled on
> > > non-x86 arches by the following commit:
> > >
> > > f337a6a21e2f ("x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n")
> > >
> > > Fix it by replacing CONFIG_SPECULATION_MITIGATIONS with a new generic
> > > CONFIG_CPU_MITIGATIONS option and moving the x86-specific mitigations to
> > > a separate menu which depends on CONFIG_CPU_MITIGATIONS.
> >
> > Ah drat, I didn't check my mailbox until after Cc'ing Linus my own version[*].
> >
> > I don't have a strong preference between the two, though I do think it's worth
> > nothing that this will (obvioulsy) allow disabling mitigations at compile time
> > on all architectures, which may or may not be desirable.
> >
> > [*] https://lore.kernel.org/all/[email protected]
>
> Josh, when you get a chance, can you weigh in on my menu-preserving approach?
>
> I want to get this resolved asap so that we're not scrambing on Friday again :-)

Yeah, yours looks good. Lemme go ack.

--
Josh

2024-05-02 23:48:37

by Paul McKenney

[permalink] [raw]

Subject: Re: [PATCH v4 3/5] x86/syscall: Mark exit[_group] syscall handlers __noreturn

On Sun, Apr 21, 2024 at 2:47 PM Paul McKenney <[email protected]> wrote:
>
> And this definitely helped, thank you!
>
> However, this one still remains:
>
> vmlinux.o: warning: objtool: ia32_sys_call+0x29b6:
> __ia32_sys_exit_group() is missing a __noreturn annotation

And looking at the patched code, this function looks to me to be
correctly marked.

No idea... :-/

Thanx, Paul

> Please see below for my diffs against next-20240419, in case I messed
> something up.
>
> I attached a copy as well, given that I am away from mutt, hence using
> gmail directly.
>
> Thanx, Paul
>
> -----------------------------------------------------
>
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 6de50b80702e6..9810ba2857a5c 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -49,7 +49,10 @@ static __always_inline bool do_syscall_x64(struct
> pt_regs *regs, int nr)
>
> if (likely(unr < NR_syscalls)) {
> unr = array_index_nospec(unr, NR_syscalls);
> - regs->ax = x64_sys_call(regs, unr);
> + if (likely(cpu_feature_enabled(X86_FEATURE_INDIRECT_BRANCH_OK)))
> + regs->ax = sys_call_table[unr](regs);
> + else
> + regs->ax = x64_sys_call(regs, unr);
> return true;
> }
> return false;
> @@ -66,7 +69,10 @@ static __always_inline bool do_syscall_x32(struct
> pt_regs *regs, int nr)
>
> if (IS_ENABLED(CONFIG_X86_X32_ABI) && likely(xnr < X32_NR_syscalls)) {
> xnr = array_index_nospec(xnr, X32_NR_syscalls);
> - regs->ax = x32_sys_call(regs, xnr);
> + if (likely(cpu_feature_enabled(X86_FEATURE_INDIRECT_BRANCH_OK)))
> + regs->ax = x32_sys_call_table[xnr](regs);
> + else
> + regs->ax = x32_sys_call(regs, xnr);
> return true;
> }
> return false;
> @@ -162,7 +168,10 @@ static __always_inline void
> do_syscall_32_irqs_on(struct pt_regs *regs, int nr)
>
> if (likely(unr < IA32_NR_syscalls)) {
> unr = array_index_nospec(unr, IA32_NR_syscalls);
> - regs->ax = ia32_sys_call(regs, unr);
> + if (likely(cpu_feature_enabled(X86_FEATURE_INDIRECT_BRANCH_OK)))
> + regs->ax = ia32_sys_call_table[unr](regs);
> + else
> + regs->ax = ia32_sys_call(regs, unr);
> } else if (nr != -1) {
> regs->ax = __ia32_sys_ni_syscall(regs);
> }
> diff --git a/arch/x86/entry/syscall_32.c b/arch/x86/entry/syscall_32.c
> index c2235bae17ef6..aab31760b4e3e 100644
> --- a/arch/x86/entry/syscall_32.c
> +++ b/arch/x86/entry/syscall_32.c
> @@ -14,25 +14,16 @@
> #endif
>
> #define __SYSCALL(nr, sym) extern long __ia32_##sym(const struct pt_regs *);
> -
> #include <asm/syscalls_32.h>
> #undef __SYSCALL
>
> -/*
> - * The sys_call_table[] is no longer used for system calls, but
> - * kernel/trace/trace_syscalls.c still wants to know the system
> - * call address.
> - */
> -#ifdef CONFIG_X86_32
> #define __SYSCALL(nr, sym) __ia32_##sym,
> -const sys_call_ptr_t sys_call_table[] = {
> +const sys_call_ptr_t ia32_sys_call_table[] = {
> #include <asm/syscalls_32.h>
> };
> #undef __SYSCALL
> -#endif
>
> #define __SYSCALL(nr, sym) case nr: return __ia32_##sym(regs);
> -
> long ia32_sys_call(const struct pt_regs *regs, unsigned int nr)
> {
> switch (nr) {
> diff --git a/arch/x86/entry/syscall_64.c b/arch/x86/entry/syscall_64.c
> index 33b3f09e6f151..ff36a993a07e0 100644
> --- a/arch/x86/entry/syscall_64.c
> +++ b/arch/x86/entry/syscall_64.c
> @@ -8,14 +8,13 @@
> #include <asm/syscall.h>
>
> #define __SYSCALL(nr, sym) extern long __x64_##sym(const struct pt_regs *);
> +#define __SYSCALL_NORETURN(nr, sym) extern long __noreturn
> __x64_##sym(const struct pt_regs *);
> #include <asm/syscalls_64.h>
> #undef __SYSCALL
>
> -/*
> - * The sys_call_table[] is no longer used for system calls, but
> - * kernel/trace/trace_syscalls.c still wants to know the system
> - * call address.
> - */
> +#undef __SYSCALL_NORETURN
> +#define __SYSCALL_NORETURN __SYSCALL
> +
> #define __SYSCALL(nr, sym) __x64_##sym,
> const sys_call_ptr_t sys_call_table[] = {
> #include <asm/syscalls_64.h>
> @@ -23,7 +22,6 @@ const sys_call_ptr_t sys_call_table[] = {
> #undef __SYSCALL
>
> #define __SYSCALL(nr, sym) case nr: return __x64_##sym(regs);
> -
> long x64_sys_call(const struct pt_regs *regs, unsigned int nr)
> {
> switch (nr) {
> diff --git a/arch/x86/entry/syscall_x32.c b/arch/x86/entry/syscall_x32.c
> index 03de4a9321318..4221ecce6e689 100644
> --- a/arch/x86/entry/syscall_x32.c
> +++ b/arch/x86/entry/syscall_x32.c
> @@ -8,11 +8,20 @@
> #include <asm/syscall.h>
>
> #define __SYSCALL(nr, sym) extern long __x64_##sym(const struct pt_regs *);
> +#define __SYSCALL_NORETURN(nr, sym) extern long __noreturn
> __x64_##sym(const struct pt_regs *);
> #include <asm/syscalls_x32.h>
> #undef __SYSCALL
>
> -#define __SYSCALL(nr, sym) case nr: return __x64_##sym(regs);
> +#undef __SYSCALL_NORETURN
> +#define __SYSCALL_NORETURN __SYSCALL
> +
> +#define __SYSCALL(nr, sym) __x64_##sym,
> +const sys_call_ptr_t x32_sys_call_table[] = {
> +#include <asm/syscalls_x32.h>
> +};
> +#undef __SYSCALL
>
> +#define __SYSCALL(nr, sym) case nr: return __x64_##sym(regs);
> long x32_sys_call(const struct pt_regs *regs, unsigned int nr)
> {
> switch (nr) {
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl
> b/arch/x86/entry/syscalls/syscall_64.tbl
> index a396f6e6ab5bf..7ec68d94eb593 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -2,7 +2,7 @@
> # 64-bit system call numbers and entry vectors
> #
> # The format is:
> -# <number> <abi> <name> <entry point>
> +# <number> <abi> <name> <entry point> [0 noreturn]
> #
> # The __x64_sys_*() stubs are created on-the-fly for sys_*() system calls
> #
> @@ -68,7 +68,7 @@
> 57 common fork sys_fork
> 58 common vfork sys_vfork
> 59 64 execve sys_execve
> -60 common exit sys_exit
> +60 common exit sys_exit 0 noreturn
> 61 common wait4 sys_wait4
> 62 common kill sys_kill
> 63 common uname sys_newuname
> @@ -239,7 +239,7 @@
> 228 common clock_gettime sys_clock_gettime
> 229 common clock_getres sys_clock_getres
> 230 common clock_nanosleep sys_clock_nanosleep
> -231 common exit_group sys_exit_group
> +231 common exit_group sys_exit_group 0 noreturn
> 232 common epoll_wait sys_epoll_wait
> 233 common epoll_ctl sys_epoll_ctl
> 234 common tgkill sys_tgkill
> diff --git a/arch/x86/include/asm/cpufeatures.h
> b/arch/x86/include/asm/cpufeatures.h
> index 3c7434329661c..d64b0a5291f10 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -470,6 +470,7 @@
> #define X86_FEATURE_BHI_CTRL (21*32+ 2) /* "" BHI_DIS_S HW control available */
> #define X86_FEATURE_CLEAR_BHB_HW (21*32+ 3) /* "" BHI_DIS_S HW
> control enabled */
> #define X86_FEATURE_CLEAR_BHB_LOOP_ON_VMEXIT (21*32+ 4) /* "" Clear
> branch history at vmexit using SW loop */
> +#define X86_FEATURE_INDIRECT_BRANCH_OK (21*32+ 5) /* "" It's OK to
> use indirect branches */
>
> /*
> * BUG word(s)
> diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
> index 2fc7bc3863ff6..dfb59521244c2 100644
> --- a/arch/x86/include/asm/syscall.h
> +++ b/arch/x86/include/asm/syscall.h
> @@ -16,14 +16,20 @@
> #include <asm/thread_info.h> /* for TS_COMPAT */
> #include <asm/unistd.h>
>
> -/* This is used purely for kernel/trace/trace_syscalls.c */
> typedef long (*sys_call_ptr_t)(const struct pt_regs *);
> extern const sys_call_ptr_t sys_call_table[];
>
> +#if defined(CONFIG_X86_32)
> +#define ia32_sys_call_table sys_call_table
> +#else
> /*
> * These may not exist, but still put the prototypes in so we
> * can use IS_ENABLED().
> */
> +extern const sys_call_ptr_t ia32_sys_call_table[];
> +extern const sys_call_ptr_t x32_sys_call_table[];
> +#endif
> +
> extern long ia32_sys_call(const struct pt_regs *, unsigned int nr);
> extern long x32_sys_call(const struct pt_regs *, unsigned int nr);
> extern long x64_sys_call(const struct pt_regs *, unsigned int nr);
> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
> index ab18185894dfd..5fca46c78daf2 100644
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -1665,6 +1665,12 @@ static void __init bhi_select_mitigation(void)
> if (!IS_ENABLED(CONFIG_X86_64))
> return;
>
> + /*
> + * There's no HW mitigation in place. Mark indirect branches as
> + * "not OK".
> + */
> + setup_clear_cpu_cap(X86_FEATURE_INDIRECT_BRANCH_OK);
> +
> /* Mitigate KVM by default */
> setup_force_cpu_cap(X86_FEATURE_CLEAR_BHB_LOOP_ON_VMEXIT);
> pr_info("Spectre BHI mitigation: SW BHB clearing on vm exit\n");
> @@ -1679,6 +1685,28 @@ static void __init spectre_v2_select_mitigation(void)
> enum spectre_v2_mitigation_cmd cmd = spectre_v2_parse_cmdline();
> enum spectre_v2_mitigation mode = SPECTRE_V2_NONE;
>
> + /*
> + * X86_FEATURE_INDIRECT_BRANCH_OK indicates that indirect calls are
> + * "OK" to use due to (at least) one of the following being true:
> + *
> + * - the CPU isn't vulnerable to Spectre v2, BHI, etc;
> + *
> + * - a HW mitigation is in place (e.g., IBRS, eIBRS+BHI_DIS_S); or
> + *
> + * - the user disabled mitigations.
> + *
> + * Clearing the bit enables certain indirect branch "easy targets" [*]
> + * to be converted to a series of direct branches.
> + *
> + * Assume innocence until proven guilty: set it now and clear it later
> + * if/when needed.
> + *
> + * [*] The closer the indirect branch is to kernel entry, and the more
> + * user-controlled registers there are, the easier target it may be
> + * for future Spectre v2 variants.
> + */
> + setup_force_cpu_cap(X86_FEATURE_INDIRECT_BRANCH_OK);
> +
> /*
> * If the CPU is not affected and the command line mode is NONE or AUTO
> * then nothing to do.
> @@ -1765,11 +1793,16 @@ static void __init spectre_v2_select_mitigation(void)
> break;
>
> case SPECTRE_V2_LFENCE:
> + setup_clear_cpu_cap(X86_FEATURE_INDIRECT_BRANCH_OK);
> + fallthrough;
> case SPECTRE_V2_EIBRS_LFENCE:
> setup_force_cpu_cap(X86_FEATURE_RETPOLINE_LFENCE);
> - fallthrough;
> + setup_force_cpu_cap(X86_FEATURE_RETPOLINE);
> + break;
>
> case SPECTRE_V2_RETPOLINE:
> + setup_clear_cpu_cap(X86_FEATURE_INDIRECT_BRANCH_OK);
> + fallthrough;
> case SPECTRE_V2_EIBRS_RETPOLINE:
> setup_force_cpu_cap(X86_FEATURE_RETPOLINE);
> break;
> diff --git a/arch/x86/um/sys_call_table_32.c b/arch/x86/um/sys_call_table_32.c
> index 89df5d89d6640..c7d4bf955d2ba 100644
> --- a/arch/x86/um/sys_call_table_32.c
> +++ b/arch/x86/um/sys_call_table_32.c
> @@ -24,6 +24,7 @@
> #define __SYSCALL_WITH_COMPAT(nr, native, compat) __SYSCALL(nr, native)
>
> #define __SYSCALL(nr, sym) extern asmlinkage long sym(unsigned long,
> unsigned long, unsigned long, unsigned long, unsigned long, unsigned
> long);
> +#define __SYSCALL_NORETURN __SYSCALL
> #include <asm/syscalls_32.h>
>
> #undef __SYSCALL
> diff --git a/arch/x86/um/sys_call_table_64.c b/arch/x86/um/sys_call_table_64.c
> index b0b4cfd2308c8..4760c40ae5cd0 100644
> --- a/arch/x86/um/sys_call_table_64.c
> +++ b/arch/x86/um/sys_call_table_64.c
> @@ -19,6 +19,7 @@
> #define sys_ioperm sys_ni_syscall
>
> #define __SYSCALL(nr, sym) extern asmlinkage long sym(unsigned long,
> unsigned long, unsigned long, unsigned long, unsigned long, unsigned
> long);
> +#define __SYSCALL_NORETURN __SYSCALL
> #include <asm/syscalls_64.h>
>
> #undef __SYSCALL
> diff --git a/scripts/syscalltbl.sh b/scripts/syscalltbl.sh
> index 6abe143889ef6..16487d47e06a3 100755
> --- a/scripts/syscalltbl.sh
> +++ b/scripts/syscalltbl.sh
> @@ -54,7 +54,7 @@ nxt=0
>
> grep -E "^[0-9]+[[:space:]]+$abis" "$infile" | {
>
> - while read nr abi name native compat ; do
> + while read nr abi name native compat noreturn; do
>
> if [ $nxt -gt $nr ]; then
> echo "error: $infile: syscall table is not sorted or duplicates the
> same syscall number" >&2
> @@ -66,7 +66,9 @@ grep -E "^[0-9]+[[:space:]]+$abis" "$infile" | {
> nxt=$((nxt + 1))
> done
>
> - if [ -n "$compat" ]; then
> + if [ -n "$noreturn" ]; then
> + echo "__SYSCALL_NORETURN($nr, $native)"
> + elif [ -n "$compat" ]; then
> echo "__SYSCALL_WITH_COMPAT($nr, $native, $compat)"
> elif [ -n "$native" ]; then
> echo "__SYSCALL($nr, $native)"
> diff --git a/tools/objtool/noreturns.h b/tools/objtool/noreturns.h
> index 7ebf29c911849..1e8141ef1b15d 100644
> --- a/tools/objtool/noreturns.h
> +++ b/tools/objtool/noreturns.h
> @@ -7,12 +7,16 @@
> * Yes, this is unfortunate. A better solution is in the works.
> */
> NORETURN(__fortify_panic)
> +NORETURN(__ia32_sys_exit)
> +NORETURN(__ia32_sys_exit_group)
> NORETURN(__kunit_abort)
> NORETURN(__module_put_and_kthread_exit)
> NORETURN(__reiserfs_panic)
> NORETURN(__stack_chk_fail)
> NORETURN(__tdx_hypercall_failed)
> NORETURN(__ubsan_handle_builtin_unreachable)
> +NORETURN(__x64_sys_exit)
> +NORETURN(__x64_sys_exit_group)
> NORETURN(arch_cpu_idle_dead)
> NORETURN(bch2_trans_in_restart_error)
> NORETURN(bch2_trans_restart_error)
>
> On Sun, Apr 21, 2024 at 1:40 PM Paul McKenney <[email protected]> wrote:
> >
> > They apply fine as is, so I have started tests with that pair of patches.
> >
> > Thanx, Paul
> >
> > On Sat, Apr 20, 2024 at 10:25 PM Josh Poimboeuf <[email protected]> wrote:
> > >
> > > On Sat, Apr 20, 2024 at 06:58:58AM -0700, Paul E. McKenney wrote:
> > > > On Fri, Apr 19, 2024 at 02:09:49PM -0700, Josh Poimboeuf wrote:
> > > > > The direct-call syscall dispatch functions don't know that the exit()
> > > > > and exit_group() syscall handlers don't return. As a result the call
> > > > > sites aren't optimized accordingly.
> > > > >
> > > > > Fix that by marking those exit syscall declarations as __noreturn.
> > > > >
> > > > > Fixes the following warnings:
> > > > >
> > > > > vmlinux.o: warning: objtool: x64_sys_call+0x2804: __x64_sys_exit() is missing a __noreturn annotation
> > > > > vmlinux.o: warning: objtool: ia32_sys_call+0x29b6: __ia32_sys_exit_group() is missing a __noreturn annotation
> > > > >
> > > > > Fixes: 7390db8aea0d ("x86/bhi: Add support for clearing branch history at syscall entry")
> > > > > Reported-by: "Paul E. McKenney" <[email protected]>
> > > > > Closes: https://lkml.kernel.org/lkml/6dba9b32-db2c-4e6d-9500-7a08852f17a3@paulmck-laptop
> > > > > Signed-off-by: Josh Poimboeuf <[email protected]>
> > > >
> > > > Looks good, but it does not apply on top of current -next and I don't
> > > > trust myself to hand-apply it (something about having just got off of
> > > > a flight across the big pond).
> > > >
> > > > Could you please let me know what else do I need to pull in to be able
> > > > to cleanly apply this one?
> > >
> > > This patch has a dependency on an earlier patch in the set:
> > >
> > > https://lkml.kernel.org/lkml/982d05a2f669140f26500bee643011896d661094.1713559768.git.jpoimboe@kernel.org
> > >
> > > Though I think it's not a hard dependency and I could reverse the order
> > > of the patches if needed.
> > >
> > > --
> > > Josh

2024-05-03 15:39:17

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH v4 3/5] x86/syscall: Mark exit[_group] syscall handlers __noreturn

On Thu, May 02, 2024 at 04:48:13PM -0700, Paul McKenney wrote:
> On Sun, Apr 21, 2024 at 2:47 PM Paul McKenney <[email protected]> wrote:
> >
> > And this definitely helped, thank you!
> >
> > However, this one still remains:
> >
> > vmlinux.o: warning: objtool: ia32_sys_call+0x29b6:
> > __ia32_sys_exit_group() is missing a __noreturn annotation
>
> And looking at the patched code, this function looks to me to be
> correctly marked.
>
> No idea... :-/

But thank you for getting rid of the other warning:

Tested-by: Paul E. McKenney <[email protected]>

> > Please see below for my diffs against next-20240419, in case I messed
> > something up.
> >
> > I attached a copy as well, given that I am away from mutt, hence using
> > gmail directly.
> >
> > Thanx, Paul
> >
> > -----------------------------------------------------
> >
> > diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> > index 6de50b80702e6..9810ba2857a5c 100644
> > --- a/arch/x86/entry/common.c
> > +++ b/arch/x86/entry/common.c
> > @@ -49,7 +49,10 @@ static __always_inline bool do_syscall_x64(struct
> > pt_regs *regs, int nr)
> >
> > if (likely(unr < NR_syscalls)) {
> > unr = array_index_nospec(unr, NR_syscalls);
> > - regs->ax = x64_sys_call(regs, unr);
> > + if (likely(cpu_feature_enabled(X86_FEATURE_INDIRECT_BRANCH_OK)))
> > + regs->ax = sys_call_table[unr](regs);
> > + else
> > + regs->ax = x64_sys_call(regs, unr);
> > return true;
> > }
> > return false;
> > @@ -66,7 +69,10 @@ static __always_inline bool do_syscall_x32(struct
> > pt_regs *regs, int nr)
> >
> > if (IS_ENABLED(CONFIG_X86_X32_ABI) && likely(xnr < X32_NR_syscalls)) {
> > xnr = array_index_nospec(xnr, X32_NR_syscalls);
> > - regs->ax = x32_sys_call(regs, xnr);
> > + if (likely(cpu_feature_enabled(X86_FEATURE_INDIRECT_BRANCH_OK)))
> > + regs->ax = x32_sys_call_table[xnr](regs);
> > + else
> > + regs->ax = x32_sys_call(regs, xnr);
> > return true;
> > }
> > return false;
> > @@ -162,7 +168,10 @@ static __always_inline void
> > do_syscall_32_irqs_on(struct pt_regs *regs, int nr)
> >
> > if (likely(unr < IA32_NR_syscalls)) {
> > unr = array_index_nospec(unr, IA32_NR_syscalls);
> > - regs->ax = ia32_sys_call(regs, unr);
> > + if (likely(cpu_feature_enabled(X86_FEATURE_INDIRECT_BRANCH_OK)))
> > + regs->ax = ia32_sys_call_table[unr](regs);
> > + else
> > + regs->ax = ia32_sys_call(regs, unr);
> > } else if (nr != -1) {
> > regs->ax = __ia32_sys_ni_syscall(regs);
> > }
> > diff --git a/arch/x86/entry/syscall_32.c b/arch/x86/entry/syscall_32.c
> > index c2235bae17ef6..aab31760b4e3e 100644
> > --- a/arch/x86/entry/syscall_32.c
> > +++ b/arch/x86/entry/syscall_32.c
> > @@ -14,25 +14,16 @@
> > #endif
> >
> > #define __SYSCALL(nr, sym) extern long __ia32_##sym(const struct pt_regs *);
> > -
> > #include <asm/syscalls_32.h>
> > #undef __SYSCALL
> >
> > -/*
> > - * The sys_call_table[] is no longer used for system calls, but
> > - * kernel/trace/trace_syscalls.c still wants to know the system
> > - * call address.
> > - */
> > -#ifdef CONFIG_X86_32
> > #define __SYSCALL(nr, sym) __ia32_##sym,
> > -const sys_call_ptr_t sys_call_table[] = {
> > +const sys_call_ptr_t ia32_sys_call_table[] = {
> > #include <asm/syscalls_32.h>
> > };
> > #undef __SYSCALL
> > -#endif
> >
> > #define __SYSCALL(nr, sym) case nr: return __ia32_##sym(regs);
> > -
> > long ia32_sys_call(const struct pt_regs *regs, unsigned int nr)
> > {
> > switch (nr) {
> > diff --git a/arch/x86/entry/syscall_64.c b/arch/x86/entry/syscall_64.c
> > index 33b3f09e6f151..ff36a993a07e0 100644
> > --- a/arch/x86/entry/syscall_64.c
> > +++ b/arch/x86/entry/syscall_64.c
> > @@ -8,14 +8,13 @@
> > #include <asm/syscall.h>
> >
> > #define __SYSCALL(nr, sym) extern long __x64_##sym(const struct pt_regs *);
> > +#define __SYSCALL_NORETURN(nr, sym) extern long __noreturn
> > __x64_##sym(const struct pt_regs *);
> > #include <asm/syscalls_64.h>
> > #undef __SYSCALL
> >
> > -/*
> > - * The sys_call_table[] is no longer used for system calls, but
> > - * kernel/trace/trace_syscalls.c still wants to know the system
> > - * call address.
> > - */
> > +#undef __SYSCALL_NORETURN
> > +#define __SYSCALL_NORETURN __SYSCALL
> > +
> > #define __SYSCALL(nr, sym) __x64_##sym,
> > const sys_call_ptr_t sys_call_table[] = {
> > #include <asm/syscalls_64.h>
> > @@ -23,7 +22,6 @@ const sys_call_ptr_t sys_call_table[] = {
> > #undef __SYSCALL
> >
> > #define __SYSCALL(nr, sym) case nr: return __x64_##sym(regs);
> > -
> > long x64_sys_call(const struct pt_regs *regs, unsigned int nr)
> > {
> > switch (nr) {
> > diff --git a/arch/x86/entry/syscall_x32.c b/arch/x86/entry/syscall_x32.c
> > index 03de4a9321318..4221ecce6e689 100644
> > --- a/arch/x86/entry/syscall_x32.c
> > +++ b/arch/x86/entry/syscall_x32.c
> > @@ -8,11 +8,20 @@
> > #include <asm/syscall.h>
> >
> > #define __SYSCALL(nr, sym) extern long __x64_##sym(const struct pt_regs *);
> > +#define __SYSCALL_NORETURN(nr, sym) extern long __noreturn
> > __x64_##sym(const struct pt_regs *);
> > #include <asm/syscalls_x32.h>
> > #undef __SYSCALL
> >
> > -#define __SYSCALL(nr, sym) case nr: return __x64_##sym(regs);
> > +#undef __SYSCALL_NORETURN
> > +#define __SYSCALL_NORETURN __SYSCALL
> > +
> > +#define __SYSCALL(nr, sym) __x64_##sym,
> > +const sys_call_ptr_t x32_sys_call_table[] = {
> > +#include <asm/syscalls_x32.h>
> > +};
> > +#undef __SYSCALL
> >
> > +#define __SYSCALL(nr, sym) case nr: return __x64_##sym(regs);
> > long x32_sys_call(const struct pt_regs *regs, unsigned int nr)
> > {
> > switch (nr) {
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl
> > b/arch/x86/entry/syscalls/syscall_64.tbl
> > index a396f6e6ab5bf..7ec68d94eb593 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -2,7 +2,7 @@
> > # 64-bit system call numbers and entry vectors
> > #
> > # The format is:
> > -# <number> <abi> <name> <entry point>
> > +# <number> <abi> <name> <entry point> [0 noreturn]
> > #
> > # The __x64_sys_*() stubs are created on-the-fly for sys_*() system calls
> > #
> > @@ -68,7 +68,7 @@
> > 57 common fork sys_fork
> > 58 common vfork sys_vfork
> > 59 64 execve sys_execve
> > -60 common exit sys_exit
> > +60 common exit sys_exit 0 noreturn
> > 61 common wait4 sys_wait4
> > 62 common kill sys_kill
> > 63 common uname sys_newuname
> > @@ -239,7 +239,7 @@
> > 228 common clock_gettime sys_clock_gettime
> > 229 common clock_getres sys_clock_getres
> > 230 common clock_nanosleep sys_clock_nanosleep
> > -231 common exit_group sys_exit_group
> > +231 common exit_group sys_exit_group 0 noreturn
> > 232 common epoll_wait sys_epoll_wait
> > 233 common epoll_ctl sys_epoll_ctl
> > 234 common tgkill sys_tgkill
> > diff --git a/arch/x86/include/asm/cpufeatures.h
> > b/arch/x86/include/asm/cpufeatures.h
> > index 3c7434329661c..d64b0a5291f10 100644
> > --- a/arch/x86/include/asm/cpufeatures.h
> > +++ b/arch/x86/include/asm/cpufeatures.h
> > @@ -470,6 +470,7 @@
> > #define X86_FEATURE_BHI_CTRL (21*32+ 2) /* "" BHI_DIS_S HW control available */
> > #define X86_FEATURE_CLEAR_BHB_HW (21*32+ 3) /* "" BHI_DIS_S HW
> > control enabled */
> > #define X86_FEATURE_CLEAR_BHB_LOOP_ON_VMEXIT (21*32+ 4) /* "" Clear
> > branch history at vmexit using SW loop */
> > +#define X86_FEATURE_INDIRECT_BRANCH_OK (21*32+ 5) /* "" It's OK to
> > use indirect branches */
> >
> > /*
> > * BUG word(s)
> > diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
> > index 2fc7bc3863ff6..dfb59521244c2 100644
> > --- a/arch/x86/include/asm/syscall.h
> > +++ b/arch/x86/include/asm/syscall.h
> > @@ -16,14 +16,20 @@
> > #include <asm/thread_info.h> /* for TS_COMPAT */
> > #include <asm/unistd.h>
> >
> > -/* This is used purely for kernel/trace/trace_syscalls.c */
> > typedef long (*sys_call_ptr_t)(const struct pt_regs *);
> > extern const sys_call_ptr_t sys_call_table[];
> >
> > +#if defined(CONFIG_X86_32)
> > +#define ia32_sys_call_table sys_call_table
> > +#else
> > /*
> > * These may not exist, but still put the prototypes in so we
> > * can use IS_ENABLED().
> > */
> > +extern const sys_call_ptr_t ia32_sys_call_table[];
> > +extern const sys_call_ptr_t x32_sys_call_table[];
> > +#endif
> > +
> > extern long ia32_sys_call(const struct pt_regs *, unsigned int nr);
> > extern long x32_sys_call(const struct pt_regs *, unsigned int nr);
> > extern long x64_sys_call(const struct pt_regs *, unsigned int nr);
> > diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
> > index ab18185894dfd..5fca46c78daf2 100644
> > --- a/arch/x86/kernel/cpu/bugs.c
> > +++ b/arch/x86/kernel/cpu/bugs.c
> > @@ -1665,6 +1665,12 @@ static void __init bhi_select_mitigation(void)
> > if (!IS_ENABLED(CONFIG_X86_64))
> > return;
> >
> > + /*
> > + * There's no HW mitigation in place. Mark indirect branches as
> > + * "not OK".
> > + */
> > + setup_clear_cpu_cap(X86_FEATURE_INDIRECT_BRANCH_OK);
> > +
> > /* Mitigate KVM by default */
> > setup_force_cpu_cap(X86_FEATURE_CLEAR_BHB_LOOP_ON_VMEXIT);
> > pr_info("Spectre BHI mitigation: SW BHB clearing on vm exit\n");
> > @@ -1679,6 +1685,28 @@ static void __init spectre_v2_select_mitigation(void)
> > enum spectre_v2_mitigation_cmd cmd = spectre_v2_parse_cmdline();
> > enum spectre_v2_mitigation mode = SPECTRE_V2_NONE;
> >
> > + /*
> > + * X86_FEATURE_INDIRECT_BRANCH_OK indicates that indirect calls are
> > + * "OK" to use due to (at least) one of the following being true:
> > + *
> > + * - the CPU isn't vulnerable to Spectre v2, BHI, etc;
> > + *
> > + * - a HW mitigation is in place (e.g., IBRS, eIBRS+BHI_DIS_S); or
> > + *
> > + * - the user disabled mitigations.
> > + *
> > + * Clearing the bit enables certain indirect branch "easy targets" [*]
> > + * to be converted to a series of direct branches.
> > + *
> > + * Assume innocence until proven guilty: set it now and clear it later
> > + * if/when needed.
> > + *
> > + * [*] The closer the indirect branch is to kernel entry, and the more
> > + * user-controlled registers there are, the easier target it may be
> > + * for future Spectre v2 variants.
> > + */
> > + setup_force_cpu_cap(X86_FEATURE_INDIRECT_BRANCH_OK);
> > +
> > /*
> > * If the CPU is not affected and the command line mode is NONE or AUTO
> > * then nothing to do.
> > @@ -1765,11 +1793,16 @@ static void __init spectre_v2_select_mitigation(void)
> > break;
> >
> > case SPECTRE_V2_LFENCE:
> > + setup_clear_cpu_cap(X86_FEATURE_INDIRECT_BRANCH_OK);
> > + fallthrough;
> > case SPECTRE_V2_EIBRS_LFENCE:
> > setup_force_cpu_cap(X86_FEATURE_RETPOLINE_LFENCE);
> > - fallthrough;
> > + setup_force_cpu_cap(X86_FEATURE_RETPOLINE);
> > + break;
> >
> > case SPECTRE_V2_RETPOLINE:
> > + setup_clear_cpu_cap(X86_FEATURE_INDIRECT_BRANCH_OK);
> > + fallthrough;
> > case SPECTRE_V2_EIBRS_RETPOLINE:
> > setup_force_cpu_cap(X86_FEATURE_RETPOLINE);
> > break;
> > diff --git a/arch/x86/um/sys_call_table_32.c b/arch/x86/um/sys_call_table_32.c
> > index 89df5d89d6640..c7d4bf955d2ba 100644
> > --- a/arch/x86/um/sys_call_table_32.c
> > +++ b/arch/x86/um/sys_call_table_32.c
> > @@ -24,6 +24,7 @@
> > #define __SYSCALL_WITH_COMPAT(nr, native, compat) __SYSCALL(nr, native)
> >
> > #define __SYSCALL(nr, sym) extern asmlinkage long sym(unsigned long,
> > unsigned long, unsigned long, unsigned long, unsigned long, unsigned
> > long);
> > +#define __SYSCALL_NORETURN __SYSCALL
> > #include <asm/syscalls_32.h>
> >
> > #undef __SYSCALL
> > diff --git a/arch/x86/um/sys_call_table_64.c b/arch/x86/um/sys_call_table_64.c
> > index b0b4cfd2308c8..4760c40ae5cd0 100644
> > --- a/arch/x86/um/sys_call_table_64.c
> > +++ b/arch/x86/um/sys_call_table_64.c
> > @@ -19,6 +19,7 @@
> > #define sys_ioperm sys_ni_syscall
> >
> > #define __SYSCALL(nr, sym) extern asmlinkage long sym(unsigned long,
> > unsigned long, unsigned long, unsigned long, unsigned long, unsigned
> > long);
> > +#define __SYSCALL_NORETURN __SYSCALL
> > #include <asm/syscalls_64.h>
> >
> > #undef __SYSCALL
> > diff --git a/scripts/syscalltbl.sh b/scripts/syscalltbl.sh
> > index 6abe143889ef6..16487d47e06a3 100755
> > --- a/scripts/syscalltbl.sh
> > +++ b/scripts/syscalltbl.sh
> > @@ -54,7 +54,7 @@ nxt=0
> >
> > grep -E "^[0-9]+[[:space:]]+$abis" "$infile" | {
> >
> > - while read nr abi name native compat ; do
> > + while read nr abi name native compat noreturn; do
> >
> > if [ $nxt -gt $nr ]; then
> > echo "error: $infile: syscall table is not sorted or duplicates the
> > same syscall number" >&2
> > @@ -66,7 +66,9 @@ grep -E "^[0-9]+[[:space:]]+$abis" "$infile" | {
> > nxt=$((nxt + 1))
> > done
> >
> > - if [ -n "$compat" ]; then
> > + if [ -n "$noreturn" ]; then
> > + echo "__SYSCALL_NORETURN($nr, $native)"
> > + elif [ -n "$compat" ]; then
> > echo "__SYSCALL_WITH_COMPAT($nr, $native, $compat)"
> > elif [ -n "$native" ]; then
> > echo "__SYSCALL($nr, $native)"
> > diff --git a/tools/objtool/noreturns.h b/tools/objtool/noreturns.h
> > index 7ebf29c911849..1e8141ef1b15d 100644
> > --- a/tools/objtool/noreturns.h
> > +++ b/tools/objtool/noreturns.h
> > @@ -7,12 +7,16 @@
> > * Yes, this is unfortunate. A better solution is in the works.
> > */
> > NORETURN(__fortify_panic)
> > +NORETURN(__ia32_sys_exit)
> > +NORETURN(__ia32_sys_exit_group)
> > NORETURN(__kunit_abort)
> > NORETURN(__module_put_and_kthread_exit)
> > NORETURN(__reiserfs_panic)
> > NORETURN(__stack_chk_fail)
> > NORETURN(__tdx_hypercall_failed)
> > NORETURN(__ubsan_handle_builtin_unreachable)
> > +NORETURN(__x64_sys_exit)
> > +NORETURN(__x64_sys_exit_group)
> > NORETURN(arch_cpu_idle_dead)
> > NORETURN(bch2_trans_in_restart_error)
> > NORETURN(bch2_trans_restart_error)
> >
> > On Sun, Apr 21, 2024 at 1:40 PM Paul McKenney <[email protected]> wrote:
> > >
> > > They apply fine as is, so I have started tests with that pair of patches.
> > >
> > > Thanx, Paul
> > >
> > > On Sat, Apr 20, 2024 at 10:25 PM Josh Poimboeuf <[email protected]> wrote:
> > > >
> > > > On Sat, Apr 20, 2024 at 06:58:58AM -0700, Paul E. McKenney wrote:
> > > > > On Fri, Apr 19, 2024 at 02:09:49PM -0700, Josh Poimboeuf wrote:
> > > > > > The direct-call syscall dispatch functions don't know that the exit()
> > > > > > and exit_group() syscall handlers don't return. As a result the call
> > > > > > sites aren't optimized accordingly.
> > > > > >
> > > > > > Fix that by marking those exit syscall declarations as __noreturn.
> > > > > >
> > > > > > Fixes the following warnings:
> > > > > >
> > > > > > vmlinux.o: warning: objtool: x64_sys_call+0x2804: __x64_sys_exit() is missing a __noreturn annotation
> > > > > > vmlinux.o: warning: objtool: ia32_sys_call+0x29b6: __ia32_sys_exit_group() is missing a __noreturn annotation
> > > > > >
> > > > > > Fixes: 7390db8aea0d ("x86/bhi: Add support for clearing branch history at syscall entry")
> > > > > > Reported-by: "Paul E. McKenney" <[email protected]>
> > > > > > Closes: https://lkml.kernel.org/lkml/6dba9b32-db2c-4e6d-9500-7a08852f17a3@paulmck-laptop
> > > > > > Signed-off-by: Josh Poimboeuf <[email protected]>
> > > > >
> > > > > Looks good, but it does not apply on top of current -next and I don't
> > > > > trust myself to hand-apply it (something about having just got off of
> > > > > a flight across the big pond).
> > > > >
> > > > > Could you please let me know what else do I need to pull in to be able
> > > > > to cleanly apply this one?
> > > >
> > > > This patch has a dependency on an earlier patch in the set:
> > > >
> > > > https://lkml.kernel.org/lkml/982d05a2f669140f26500bee643011896d661094.1713559768.git.jpoimboe@kernel.org
> > > >
> > > > Though I think it's not a hard dependency and I could reverse the order
> > > > of the patches if needed.
> > > >
> > > > --
> > > > Josh

2024-05-03 19:57:07

by Josh Poimboeuf

[permalink] [raw]

Subject: Re: [PATCH v4 3/5] x86/syscall: Mark exit[_group] syscall handlers __noreturn

On Thu, May 02, 2024 at 04:48:13PM -0700, Paul McKenney wrote:
> On Sun, Apr 21, 2024 at 2:47 PM Paul McKenney <[email protected]> wrote:
> >
> > And this definitely helped, thank you!
> >
> > However, this one still remains:
> >
> > vmlinux.o: warning: objtool: ia32_sys_call+0x29b6:
> > __ia32_sys_exit_group() is missing a __noreturn annotation
>
> And looking at the patched code, this function looks to me to be
> correctly marked.
>
> No idea... :-/

Ah, I think I missed fixing syscall_32.tbl. Lemme see how to do that...

--
Josh

2024-05-03 20:44:27

by Josh Poimboeuf

[permalink] [raw]

Subject: Re: [PATCH v4 3/5] x86/syscall: Mark exit[_group] syscall handlers __noreturn

On Fri, May 03, 2024 at 12:57:00PM -0700, Josh Poimboeuf wrote:
> On Thu, May 02, 2024 at 04:48:13PM -0700, Paul McKenney wrote:
> > On Sun, Apr 21, 2024 at 2:47 PM Paul McKenney <[email protected]> wrote:
> > >
> > > And this definitely helped, thank you!
> > >
> > > However, this one still remains:
> > >
> > > vmlinux.o: warning: objtool: ia32_sys_call+0x29b6:
> > > __ia32_sys_exit_group() is missing a __noreturn annotation
> >
> > And looking at the patched code, this function looks to me to be
> > correctly marked.
> >
> > No idea... :-/
>
> Ah, I think I missed fixing syscall_32.tbl. Lemme see how to do that...

Can you try adding this on top?

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 5f8591ce7f25..f30b608d14dc 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -12,7 +12,7 @@
# The abi is always "i386" for this file.
#
0 i386 restart_syscall sys_restart_syscall
-1 i386 exit sys_exit
+1 i386 exit sys_exit 0 noreturn
2 i386 fork sys_fork
3 i386 read sys_read
4 i386 write sys_write
@@ -263,7 +263,7 @@
249 i386 io_cancel sys_io_cancel
250 i386 fadvise64 sys_ia32_fadvise64
# 251 is available for reuse (was briefly sys_set_zone_reclaim)
-252 i386 exit_group sys_exit_group
+252 i386 exit_group sys_exit_group 0 noreturn
253 i386 lookup_dcookie
254 i386 epoll_create sys_epoll_create
255 i386 epoll_ctl sys_epoll_ctl

2024-05-03 23:33:07

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH v4 3/5] x86/syscall: Mark exit[_group] syscall handlers __noreturn

On Fri, May 03, 2024 at 01:44:17PM -0700, Josh Poimboeuf wrote:
> On Fri, May 03, 2024 at 12:57:00PM -0700, Josh Poimboeuf wrote:
> > On Thu, May 02, 2024 at 04:48:13PM -0700, Paul McKenney wrote:
> > > On Sun, Apr 21, 2024 at 2:47 PM Paul McKenney <[email protected]> wrote:
> > > >
> > > > And this definitely helped, thank you!
> > > >
> > > > However, this one still remains:
> > > >
> > > > vmlinux.o: warning: objtool: ia32_sys_call+0x29b6:
> > > > __ia32_sys_exit_group() is missing a __noreturn annotation
> > >
> > > And looking at the patched code, this function looks to me to be
> > > correctly marked.
> > >
> > > No idea... :-/
> >
> > Ah, I think I missed fixing syscall_32.tbl. Lemme see how to do that...
>
> Can you try adding this on top?
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 5f8591ce7f25..f30b608d14dc 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -12,7 +12,7 @@
> # The abi is always "i386" for this file.
> #
> 0 i386 restart_syscall sys_restart_syscall
> -1 i386 exit sys_exit
> +1 i386 exit sys_exit 0 noreturn
> 2 i386 fork sys_fork
> 3 i386 read sys_read
> 4 i386 write sys_write
> @@ -263,7 +263,7 @@
> 249 i386 io_cancel sys_io_cancel
> 250 i386 fadvise64 sys_ia32_fadvise64
> # 251 is available for reuse (was briefly sys_set_zone_reclaim)
> -252 i386 exit_group sys_exit_group
> +252 i386 exit_group sys_exit_group 0 noreturn
> 253 i386 lookup_dcookie
> 254 i386 epoll_create sys_epoll_create
> 255 i386 epoll_ctl sys_epoll_ctl

Thank you!

But for non-KCSAN builds, I get the following diagnostics:

In file included from arch/x86/entry/syscall_32.c:17:
/arch/x86/include/generated/asm/syscalls_32.h:2:20: error: expected declaration specifiers or ‘...’ before numeric constant
2 | __SYSCALL_NORETURN(1, sys_exit)

For KCSAN builds, I instead get the following diagnostics:

In file included from arch/x86/entry/syscall_32.c:17:
/arch/x86/include/generated/asm/syscalls_32.h:2:20: error: expected parameter declarator
__SYSCALL_NORETURN(1, sys_exit)

Does arch/x86/entry/syscall_32.c need the following additional patch?

A quick smoke test passes, but perhaps I am just getting lucky...

Thanx, Paul

------------------------------------------------------------------------

diff --git a/arch/x86/entry/syscall_32.c b/arch/x86/entry/syscall_32.c
index aab31760b4e3e..d9ae910ea6f33 100644
--- a/arch/x86/entry/syscall_32.c
+++ b/arch/x86/entry/syscall_32.c
@@ -14,9 +14,13 @@
#endif

#define __SYSCALL(nr, sym) extern long __ia32_##sym(const struct pt_regs *);
+#define __SYSCALL_NORETURN(nr, sym) extern long __noreturn __ia32_##sym(const struct pt_regs *);
#include <asm/syscalls_32.h>
#undef __SYSCALL

+#undef __SYSCALL_NORETURN
+#define __SYSCALL_NORETURN __SYSCALL
+
#define __SYSCALL(nr, sym) __ia32_##sym,
const sys_call_ptr_t ia32_sys_call_table[] = {
#include <asm/syscalls_32.h>

2024-05-03 23:48:46

by Josh Poimboeuf

[permalink] [raw]

Subject: Re: [PATCH v4 3/5] x86/syscall: Mark exit[_group] syscall handlers __noreturn

On Fri, May 03, 2024 at 04:33:00PM -0700, Paul E. McKenney wrote:
> Does arch/x86/entry/syscall_32.c need the following additional patch?
>
> A quick smoke test passes, but perhaps I am just getting lucky...
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> diff --git a/arch/x86/entry/syscall_32.c b/arch/x86/entry/syscall_32.c
> index aab31760b4e3e..d9ae910ea6f33 100644
> --- a/arch/x86/entry/syscall_32.c
> +++ b/arch/x86/entry/syscall_32.c
> @@ -14,9 +14,13 @@
> #endif
>
> #define __SYSCALL(nr, sym) extern long __ia32_##sym(const struct pt_regs *);
> +#define __SYSCALL_NORETURN(nr, sym) extern long __noreturn __ia32_##sym(const struct pt_regs *);
> #include <asm/syscalls_32.h>
> #undef __SYSCALL
>
> +#undef __SYSCALL_NORETURN
> +#define __SYSCALL_NORETURN __SYSCALL
> +
> #define __SYSCALL(nr, sym) __ia32_##sym,
> const sys_call_ptr_t ia32_sys_call_table[] = {
> #include <asm/syscalls_32.h>

Ah, yeah, that looks right.

--
Josh

2024-05-04 16:48:35

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH v4 3/5] x86/syscall: Mark exit[_group] syscall handlers __noreturn

On Fri, May 03, 2024 at 04:48:34PM -0700, Josh Poimboeuf wrote:
> On Fri, May 03, 2024 at 04:33:00PM -0700, Paul E. McKenney wrote:
> > Does arch/x86/entry/syscall_32.c need the following additional patch?
> >
> > A quick smoke test passes, but perhaps I am just getting lucky...
> >
> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > diff --git a/arch/x86/entry/syscall_32.c b/arch/x86/entry/syscall_32.c
> > index aab31760b4e3e..d9ae910ea6f33 100644
> > --- a/arch/x86/entry/syscall_32.c
> > +++ b/arch/x86/entry/syscall_32.c
> > @@ -14,9 +14,13 @@
> > #endif
> >
> > #define __SYSCALL(nr, sym) extern long __ia32_##sym(const struct pt_regs *);
> > +#define __SYSCALL_NORETURN(nr, sym) extern long __noreturn __ia32_##sym(const struct pt_regs *);
> > #include <asm/syscalls_32.h>
> > #undef __SYSCALL
> >
> > +#undef __SYSCALL_NORETURN
> > +#define __SYSCALL_NORETURN __SYSCALL
> > +
> > #define __SYSCALL(nr, sym) __ia32_##sym,
> > const sys_call_ptr_t ia32_sys_call_table[] = {
> > #include <asm/syscalls_32.h>
>
> Ah, yeah, that looks right.

And the overnight testing went well, so...

For your three patches and this one:

Tested-by: Paul E. McKenney <[email protected]>

2024-05-07 05:18:22

by Josh Poimboeuf

[permalink] [raw]

Subject: Re: [PATCH v4 1/5] x86/bugs: Only harden syscalls when needed

On Mon, Apr 22, 2024 at 04:09:33PM +0800, Yujie Liu wrote:
> On Fri, Apr 19, 2024 at 02:09:47PM -0700, Josh Poimboeuf wrote:
> > Syscall hardening (converting the syscall indirect branch to a series of
> > direct branches) has shown some performance regressions:
> >
> > - Red Hat internal testing showed up to 12% slowdowns in database
> > benchmark testing on Sapphire Rapids when the DB was stressed with 80+
> > users to cause contention.
> >
> > - The kernel test robot's will-it-scale benchmarks showed significant
> > regressions on Skylake with IBRS:
> > https://lkml.kernel.org/lkml/[email protected]
>
> To clarify, we reported a +1.4% improvement (not regression) of
> will-it-scale futex4 benchmark on Skylake. Meanwhile we did observe some
> regressions by running other benchmarks on Ice Lake, such as:
>
> stress-ng.null.ops_per_sec -4.0% regression on Intel Xeon Gold 6346 (Ice Lake)
> unixbench.fsbuffer.throughput -1.4% regression on Intel Xeon Gold 6346 (Ice Lake)

Thanks for clarifying that. I'm not sure what I was looking at.

I also saw your email where Ice Lake showed a ~10% regression for
1e3ad78334a6. Unfortunately my patch wouldn't help with that, as it's
designed to help with older systems (e.g., Skylake) and newer (e.g.,
Sapphire Rapids) but not Ice/Cascade Lake.

Whether 1e3ad78334a6 helps or hurts seems very workload-dependent.

It would be especially interesting to see if my patch helps on the newer
systems which have the HW mitigation: Raptor Lake, Meteor Lake, Sapphire
Rapids, Emerald Rapids.

For now, maybe I'll just table this patch until we have more data.

--
Josh

2024-05-20 05:21:49

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH v4 1/5] x86/bugs: Only harden syscalls when needed

On Mon, May 06, 2024 at 10:17:41PM -0700, Josh Poimboeuf wrote:
> On Mon, Apr 22, 2024 at 04:09:33PM +0800, Yujie Liu wrote:
> > On Fri, Apr 19, 2024 at 02:09:47PM -0700, Josh Poimboeuf wrote:
> > > Syscall hardening (converting the syscall indirect branch to a series of
> > > direct branches) has shown some performance regressions:
> > >
> > > - Red Hat internal testing showed up to 12% slowdowns in database
> > > benchmark testing on Sapphire Rapids when the DB was stressed with 80+
> > > users to cause contention.
> > >
> > > - The kernel test robot's will-it-scale benchmarks showed significant
> > > regressions on Skylake with IBRS:
> > > https://lkml.kernel.org/lkml/[email protected]
> >
> > To clarify, we reported a +1.4% improvement (not regression) of
> > will-it-scale futex4 benchmark on Skylake. Meanwhile we did observe some
> > regressions by running other benchmarks on Ice Lake, such as:
> >
> > stress-ng.null.ops_per_sec -4.0% regression on Intel Xeon Gold 6346 (Ice Lake)
> > unixbench.fsbuffer.throughput -1.4% regression on Intel Xeon Gold 6346 (Ice Lake)
>
> Thanks for clarifying that. I'm not sure what I was looking at.
>
> I also saw your email where Ice Lake showed a ~10% regression for
> 1e3ad78334a6. Unfortunately my patch wouldn't help with that, as it's
> designed to help with older systems (e.g., Skylake) and newer (e.g.,
> Sapphire Rapids) but not Ice/Cascade Lake.
>
> Whether 1e3ad78334a6 helps or hurts seems very workload-dependent.
>
> It would be especially interesting to see if my patch helps on the newer
> systems which have the HW mitigation: Raptor Lake, Meteor Lake, Sapphire
> Rapids, Emerald Rapids.
>
> For now, maybe I'll just table this patch until we have more data.

FYI, we tested this patch on a Sapphire Rapids machine in our test
infrastructure. There are both improvements and regressions across
different benchmarks, which does show that the impact may be very
workload-dependent.

Intel Xeon Platinum 8480CTDX (Sapphire Rapids) with 512G Memory:

+-----------------------+----------+---------------+------------------+
| | mainline | + this patch | (percent change) |
+-----------------------+----------+---------------+------------------+
| will-it-scale.futex4 | 5015892 | 5118744 | +2.05% |
+-----------------------+----------+---------------+------------------+
| will-it-scale.pread1 | 2720498 | 2721206 | +0.03% |
+-----------------------+----------+---------------+------------------+
| will-it-scale.pwrite1 | 2143302 | 2164511 | +0.99% |
+-----------------------+----------+---------------+------------------+
| will-it-scale.poll1 | 4046943 | 4095046 | +1.19% |
+-----------------------+----------+---------------+------------------+
| stress-ng.null | 19400 | 18689 | -3.66% |
+-----------------------+----------+---------------+------------------+
| unixbench.fsbuffer | 369494 | 364181 | -1.44% |
+-----------------------+----------+---------------+------------------+

Thanks,
Yujie