From: Ira Weiny <[email protected]>
Rik reported that the knowledge of which CPU's are seeing faults can help in
determining which CPUs are failing in a large data center.[1]
Storing the CPU at exception entry time allows this print to report the CPU
which actually took the exception. This still may not be the CPU which is
failing but it should be closer.
Dave and Boris recognized that the auxiliary pt_regs work I did for the PKS
series could help to store this value and avoid passing the CPU throughout the
fault handler call stack.
I'm posting this RFC for a few reasons.
1) I've left in arch_restore_aux_pt_regs(). This is called on exception exit
and is not needed for this use case but I believe it is better to leave it
for symmetry within the generic entry code. This also means that patch
1/5 could be dropped completely.
2) I want to see if 0day has any issues with the Kconfig option changes I made
which may creep in from a 32bit build.
3) The final patch could be squashed with Rik's but it seemed better to leave
them split for authorship clarity.
Compile tested only.
[1] https://lore.kernel.org/all/[email protected]/
Ira Weiny (4):
entry: Pass pt_regs to irqentry_exit_cond_resched()
entry: Add calls for save/restore auxiliary pt_regs
x86/entry: Add auxiliary pt_regs space
x86/entry: Store CPU info on exception entry
Rik van Riel (1):
x86,mm: print likely CPU at segfault time
arch/arm64/include/asm/preempt.h | 2 +-
arch/arm64/kernel/entry-common.c | 4 +--
arch/x86/Kconfig | 4 +++
arch/x86/entry/calling.h | 19 ++++++++++++++
arch/x86/entry/common.c | 2 +-
arch/x86/entry/entry_64.S | 22 ++++++++++++++++
arch/x86/entry/entry_64_compat.S | 6 +++++
arch/x86/include/asm/entry-common.h | 12 +++++++++
arch/x86/include/asm/ptrace.h | 19 ++++++++++++++
arch/x86/kernel/asm-offsets_64.c | 15 +++++++++++
arch/x86/kernel/head_64.S | 6 +++++
arch/x86/mm/fault.c | 10 ++++++++
include/linux/entry-common.h | 25 +++++++++++++-----
kernel/entry/common.c | 29 ++++++++++++++++-----
kernel/sched/core.c | 40 ++++++++++++++---------------
15 files changed, 178 insertions(+), 37 deletions(-)
base-commit: b2a88c212e652e94f1e4b635910972ac57ba4e97
--
2.35.3
From: Ira Weiny <[email protected]>
Rik van Riel reports that knowledge of where a fault hits is valuable in
detecting CPU failures in large data centers.[0] Having auxiliary
pt_regs space is a useful place to store the CPU and avoids passing
additional data through the exception call stacks.
Two possible places for preserving this state were originally
considered, irqentry_state_t or pt_regs.[1] pt_regs was much more
complicated and was potentially fraught with unintended consequences.[2]
However, Andy Lutomirski came up with a way to hide additional values on
the stack which could be accessed as "extended_pt_regs".[3] This method
allows any function with current access to pt_regs to obtain access to
the extra information without expanding the use of irqentry_state_t and
leaving pt_regs intact for compatibility with outside tools like BPF.
Prepare the assembly code to add a hidden auxiliary pt_regs space. To
simplify, the assembly code only adds space on the stack as defined by
the C code which needs it. The use of this space is left to the C code
which is required to select ARCH_HAS_PTREGS_AUXILIARY to enable this
support.
Each nested exception gets another copy of this auxiliary space allowing
for any number of levels of exception handling.
Initially the space is left empty and results in no code changes because
ARCH_HAS_PTREGS_AUXILIARY is not set. Subsequent patches adding data to
pt_regs_auxiliary must set ARCH_HAS_PTREGS_AUXILIARY or a build failure
will occur. The use of ARCH_HAS_PTREGS_AUXILIARY also avoids the
introduction of 2 instructions (addq/subq) on every entry call when the
extra space is not needed.
32bit is specifically excluded.
Peter, Thomas, Andy, Dave, and Dan all suggested parts of the patch or
aided in the development of the patch..
[0] https://lore.kernel.org/all/[email protected]/
[1] https://lore.kernel.org/lkml/CALCETrVe1i5JdyzD_BcctxQJn+ZE3T38EFPgjxN1F577M36g+w@mail.gmail.com/
[2] https://lore.kernel.org/lkml/[email protected]/#t
[3] https://lore.kernel.org/lkml/CALCETrUHwZPic89oExMMe-WyDY8-O3W68NcZvse3=PGW+iW5=w@mail.gmail.com/
Cc: Rik van Riel <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Borislav Petkov <[email protected]>
Suggested-by: Dave Hansen <[email protected]>
Suggested-by: Dan Williams <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Suggested-by: Andy Lutomirski <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>
---
Forward port from PKS series
https://lore.kernel.org/lkml/[email protected]/
---
arch/x86/Kconfig | 4 ++++
arch/x86/entry/calling.h | 19 +++++++++++++++++++
arch/x86/entry/entry_64.S | 22 ++++++++++++++++++++++
arch/x86/entry/entry_64_compat.S | 6 ++++++
arch/x86/include/asm/ptrace.h | 18 ++++++++++++++++++
arch/x86/kernel/asm-offsets_64.c | 15 +++++++++++++++
arch/x86/kernel/head_64.S | 6 ++++++
7 files changed, 90 insertions(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fb5900e2c29a..b35f6a472e09 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1874,6 +1874,10 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
If unsure, say y.
+config ARCH_HAS_PTREGS_AUXILIARY
+ depends on X86_64
+ bool
+
choice
prompt "TSX enable mode"
depends on CPU_SUP_INTEL
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index f6907627172b..b7515f8b0092 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -65,6 +65,25 @@ For 32-bit we have the following conventions - kernel is built with
* for assembly code:
*/
+#ifdef CONFIG_ARCH_HAS_PTREGS_AUXILIARY
+
+.macro PUSH_PTREGS_AUXILIARY
+ /* add space for pt_regs_auxiliary */
+ subq $PTREGS_AUX_SIZE, %rsp
+.endm
+
+.macro POP_PTREGS_AUXILIARY
+ /* remove space for pt_regs_auxiliary */
+ addq $PTREGS_AUX_SIZE, %rsp
+.endm
+
+#else
+
+#define PUSH_PTREGS_AUXILIARY
+#define POP_PTREGS_AUXILIARY
+
+#endif
+
.macro PUSH_REGS rdx=%rdx rcx=%rcx rax=%rax save_ret=0
.if \save_ret
pushq %rsi /* pt_regs->si */
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9953d966d124..4f9f7f5cb563 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -362,7 +362,9 @@ SYM_CODE_END(xen_error_entry)
movq $-1, ORIG_RAX(%rsp) /* no syscall to restart */
.endif
+ PUSH_PTREGS_AUXILIARY
call \cfunc
+ POP_PTREGS_AUXILIARY
/* For some configurations \cfunc ends up being a noreturn. */
REACHABLE
@@ -472,7 +474,9 @@ SYM_CODE_START(\asmsym)
movq %rsp, %rdi /* pt_regs pointer */
+ PUSH_PTREGS_AUXILIARY
call \cfunc
+ POP_PTREGS_AUXILIARY
jmp paranoid_exit
@@ -535,7 +539,9 @@ SYM_CODE_START(\asmsym)
* stack.
*/
movq %rsp, %rdi /* pt_regs pointer */
+ PUSH_PTREGS_AUXILIARY
call vc_switch_off_ist
+ POP_PTREGS_AUXILIARY
movq %rax, %rsp /* Switch to new stack */
ENCODE_FRAME_POINTER
@@ -547,7 +553,9 @@ SYM_CODE_START(\asmsym)
movq %rsp, %rdi /* pt_regs pointer */
+ PUSH_PTREGS_AUXILIARY
call kernel_\cfunc
+ POP_PTREGS_AUXILIARY
/*
* No need to switch back to the IST stack. The current stack is either
@@ -584,7 +592,9 @@ SYM_CODE_START(\asmsym)
movq %rsp, %rdi /* pt_regs pointer into first argument */
movq ORIG_RAX(%rsp), %rsi /* get error code into 2nd argument*/
movq $-1, ORIG_RAX(%rsp) /* no syscall to restart */
+ PUSH_PTREGS_AUXILIARY
call \cfunc
+ POP_PTREGS_AUXILIARY
/* For some configurations \cfunc ends up being a noreturn. */
REACHABLE
@@ -838,7 +848,9 @@ SYM_CODE_START_LOCAL(exc_xen_hypervisor_callback)
movq %rdi, %rsp /* we don't return, adjust the stack frame */
UNWIND_HINT_REGS
+ PUSH_PTREGS_AUXILIARY
call xen_pv_evtchn_do_upcall
+ POP_PTREGS_AUXILIARY
jmp error_return
SYM_CODE_END(exc_xen_hypervisor_callback)
@@ -1062,7 +1074,9 @@ SYM_CODE_START_LOCAL(error_entry)
.Lerror_entry_from_usermode_after_swapgs:
/* Put us onto the real thread stack. */
+ PUSH_PTREGS_AUXILIARY
call sync_regs
+ POP_PTREGS_AUXILIARY
RET
/*
@@ -1119,7 +1133,9 @@ SYM_CODE_START_LOCAL(error_entry)
* as if we faulted immediately after IRET.
*/
leaq 8(%rsp), %rdi /* arg0 = pt_regs pointer */
+ PUSH_PTREGS_AUXILIARY
call fixup_bad_iret
+ POP_PTREGS_AUXILIARY
mov %rax, %rdi
jmp .Lerror_entry_from_usermode_after_swapgs
SYM_CODE_END(error_entry)
@@ -1229,7 +1245,9 @@ SYM_CODE_START(asm_exc_nmi)
movq %rsp, %rdi
movq $-1, %rsi
+ PUSH_PTREGS_AUXILIARY
call exc_nmi
+ POP_PTREGS_AUXILIARY
/*
* Return back to user mode. We must *not* do the normal exit
@@ -1265,6 +1283,8 @@ SYM_CODE_START(asm_exc_nmi)
* +---------------------------------------------------------+
* | pt_regs |
* +---------------------------------------------------------+
+ * | (Optionally) pt_regs_extended |
+ * +---------------------------------------------------------+
*
* The "original" frame is used by hardware. Before re-enabling
* NMIs, we need to be done with it, and we need to leave enough
@@ -1443,7 +1463,9 @@ end_repeat_nmi:
movq %rsp, %rdi
movq $-1, %rsi
+ PUSH_PTREGS_AUXILIARY
call exc_nmi
+ POP_PTREGS_AUXILIARY
/* Always restore stashed SPEC_CTRL value (see paranoid_entry) */
IBRS_EXIT save_reg=%r15
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 682338e7e2a3..7f1e670f7b06 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -117,7 +117,9 @@ SYM_INNER_LABEL(entry_SYSENTER_compat_after_hwframe, SYM_L_GLOBAL)
.Lsysenter_flags_fixed:
movq %rsp, %rdi
+ PUSH_PTREGS_AUXILIARY
call do_SYSENTER_32
+ POP_PTREGS_AUXILIARY
/* XEN PV guests always use IRET path */
ALTERNATIVE "testl %eax, %eax; jz swapgs_restore_regs_and_return_to_usermode", \
"jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
@@ -212,7 +214,9 @@ SYM_INNER_LABEL(entry_SYSCALL_compat_after_hwframe, SYM_L_GLOBAL)
UNTRAIN_RET
movq %rsp, %rdi
+ PUSH_PTREGS_AUXILIARY
call do_fast_syscall_32
+ POP_PTREGS_AUXILIARY
/* XEN PV guests always use IRET path */
ALTERNATIVE "testl %eax, %eax; jz swapgs_restore_regs_and_return_to_usermode", \
"jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
@@ -351,6 +355,8 @@ SYM_CODE_START(entry_INT80_compat)
UNTRAIN_RET
movq %rsp, %rdi
+ PUSH_PTREGS_AUXILIARY
call do_int80_syscall_32
+ POP_PTREGS_AUXILIARY
jmp swapgs_restore_regs_and_return_to_usermode
SYM_CODE_END(entry_INT80_compat)
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index f4db78b09c8f..5a9c85893459 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -2,6 +2,7 @@
#ifndef _ASM_X86_PTRACE_H
#define _ASM_X86_PTRACE_H
+#include <linux/container_of.h>
#include <asm/segment.h>
#include <asm/page_types.h>
#include <uapi/asm/ptrace.h>
@@ -91,6 +92,23 @@ struct pt_regs {
/* top of stack page */
};
+/*
+ * NOTE: Features which add data to pt_regs_auxiliary must select
+ * ARCH_HAS_PTREGS_AUXILIARY. Failure to do so will result in a build failure.
+ */
+struct pt_regs_auxiliary {
+};
+
+struct pt_regs_extended {
+ struct pt_regs_auxiliary aux;
+ struct pt_regs pt_regs __aligned(8);
+};
+
+static inline struct pt_regs_extended *to_extended_pt_regs(struct pt_regs *regs)
+{
+ return container_of(regs, struct pt_regs_extended, pt_regs);
+}
+
#endif /* !__i386__ */
#ifdef CONFIG_PARAVIRT
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index 9b698215d261..413fe632445b 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -4,6 +4,7 @@
#endif
#include <asm/ia32.h>
+#include <asm/ptrace.h>
#if defined(CONFIG_KVM_GUEST)
#include <asm/kvm_para.h>
@@ -60,5 +61,19 @@ int main(void)
DEFINE(stack_canary_offset, offsetof(struct fixed_percpu_data, stack_canary));
BLANK();
#endif
+
+#ifdef CONFIG_ARCH_HAS_PTREGS_AUXILIARY
+ /* Size of Auxiliary pt_regs data */
+ DEFINE(PTREGS_AUX_SIZE, sizeof(struct pt_regs_extended) -
+ sizeof(struct pt_regs));
+#else
+ /*
+ * Adding data to struct pt_regs_auxiliary requires setting
+ * ARCH_HAS_PTREGS_AUXILIARY
+ */
+ BUILD_BUG_ON((sizeof(struct pt_regs_extended) -
+ sizeof(struct pt_regs)) != 0);
+#endif
+
return 0;
}
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index d860d437631b..3a41273acb1c 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -398,8 +398,10 @@ SYM_CODE_START_NOALIGN(vc_boot_ghcb)
movq %rsp, %rdi
movq ORIG_RAX(%rsp), %rsi
movq initial_vc_handler(%rip), %rax
+ PUSH_PTREGS_AUXILIARY
ANNOTATE_RETPOLINE_SAFE
call *%rax
+ POP_PTREGS_AUXILIARY
/* Unwind pt_regs */
POP_REGS
@@ -479,7 +481,9 @@ SYM_CODE_START_LOCAL(early_idt_handler_common)
UNWIND_HINT_REGS
movq %rsp,%rdi /* RDI = pt_regs; RSI is already trapnr */
+ PUSH_PTREGS_AUXILIARY
call do_early_exception
+ POP_PTREGS_AUXILIARY
decl early_recursion_flag(%rip)
jmp restore_regs_and_return_to_kernel
@@ -508,7 +512,9 @@ SYM_CODE_START_NOALIGN(vc_no_ghcb)
/* Call C handler */
movq %rsp, %rdi
movq ORIG_RAX(%rsp), %rsi
+ PUSH_PTREGS_AUXILIARY
call do_vc_no_ghcb
+ POP_PTREGS_AUXILIARY
/* Unwind pt_regs */
POP_REGS
--
2.35.3
From: Rik van Riel <[email protected]>
In a large enough fleet of computers, it is common to have a few bad CPUs.
Those can often be identified by seeing that some commonly run kernel code,
which runs fine everywhere else, keeps crashing on the same CPU core on one
particular bad system.
However, the failure modes in CPUs that have gone bad over the years are
often oddly specific, and the only bad behavior seen might be segfaults
in programs like bash, python, or various system daemons that run fine
everywhere else.
Add a printk() to show_signal_msg() to print the CPU, core, and socket
at segfault time. This is not perfect, since the task might get rescheduled
on another CPU between when the fault hit, and when the message is printed,
but in practice this has been good enough to help us identify several bad
CPU cores.
segfault[1349]: segfault at 0 ip 000000000040113a sp 00007ffc6d32e360 error 4 in segfault[401000+1000] on CPU 0 (core 0, socket 0)
This printk can be controlled through /proc/sys/debug/exception-trace
Signed-off-by: Rik van Riel <[email protected]>
CC: Dave Jones <[email protected]>
---
arch/x86/mm/fault.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 971977c438fc..82cf23975aa1 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -769,6 +769,8 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
unsigned long address, struct task_struct *tsk)
{
const char *loglvl = task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG;
+ /* This is a racy snapshot, but it's better than nothing. */
+ int cpu = raw_smp_processor_id();
if (!unhandled_signal(tsk, SIGSEGV))
return;
@@ -782,6 +784,14 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
print_vma_addr(KERN_CONT " in ", regs->ip);
+ /*
+ * Dump the likely CPU where the fatal segfault happened.
+ * This can help identify faulty hardware.
+ */
+ printk(KERN_CONT " on CPU %d (core %d, socket %d)", cpu,
+ topology_core_id(cpu), topology_physical_package_id(cpu));
+
+
printk(KERN_CONT "\n");
show_opcodes(regs, loglvl);
--
2.35.3
From: Ira Weiny <[email protected]>
Auxiliary pt_regs space needs to be manipulated by the generic
entry/exit code.
Ideally irqentry_exit() would take care of handling any auxiliary
pt_regs on exit. Unfortunately, irqentry_exit() is not the only exit
from exception path. The call to irqentry_exit_cond_resched() from
xen_pv_evtchn_do_upcall() bypasses irqentry_exit().
Make irqentry_exit_cond_resched() symmetrical with irqentry_enter() by
passing pt_regs to it. This makes irqentry_exit_cond_resched() capable
of handling auxiliary pt_regs in future patches.
Cc: Rik van Riel <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Borislav Petkov <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>
---
Forward ported from PKS series:
https://lore.kernel.org/lkml/[email protected]/
---
arch/arm64/include/asm/preempt.h | 2 +-
arch/arm64/kernel/entry-common.c | 4 ++--
arch/x86/entry/common.c | 2 +-
include/linux/entry-common.h | 17 ++++++++------
kernel/entry/common.c | 13 +++++++----
kernel/sched/core.c | 40 ++++++++++++++++----------------
6 files changed, 43 insertions(+), 35 deletions(-)
diff --git a/arch/arm64/include/asm/preempt.h b/arch/arm64/include/asm/preempt.h
index 0159b625cc7f..bd185a214096 100644
--- a/arch/arm64/include/asm/preempt.h
+++ b/arch/arm64/include/asm/preempt.h
@@ -87,7 +87,7 @@ void preempt_schedule_notrace(void);
#ifdef CONFIG_PREEMPT_DYNAMIC
-DECLARE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched);
+DECLARE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched_internal);
void dynamic_preempt_schedule(void);
#define __preempt_schedule() dynamic_preempt_schedule()
void dynamic_preempt_schedule_notrace(void);
diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
index c75ca36b4a49..a1cc8795b729 100644
--- a/arch/arm64/kernel/entry-common.c
+++ b/arch/arm64/kernel/entry-common.c
@@ -224,9 +224,9 @@ static void noinstr arm64_exit_el1_dbg(struct pt_regs *regs)
}
#ifdef CONFIG_PREEMPT_DYNAMIC
-DEFINE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched);
+DEFINE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched_internal);
#define need_irq_preemption() \
- (static_branch_unlikely(&sk_dynamic_irqentry_exit_cond_resched))
+ (static_branch_unlikely(&sk_dynamic_irqentry_exit_cond_resched_internal))
#else
#define need_irq_preemption() (IS_ENABLED(CONFIG_PREEMPTION))
#endif
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 6c2826417b33..f1ba770d035d 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -309,7 +309,7 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
inhcall = get_and_clear_inhcall();
if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
- irqentry_exit_cond_resched();
+ irqentry_exit_cond_resched(regs);
instrumentation_end();
restore_inhcall(inhcall);
} else {
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 84a466b176cf..976cce7cf803 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -412,23 +412,26 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
/**
* irqentry_exit_cond_resched - Conditionally reschedule on return from interrupt
+ * @regs: Pointer to pt_regs of interrupted context
*
* Conditional reschedule with additional sanity checks.
*/
+void irqentry_exit_cond_resched(struct pt_regs *regs);
+
void raw_irqentry_exit_cond_resched(void);
#ifdef CONFIG_PREEMPT_DYNAMIC
#if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
-#define irqentry_exit_cond_resched_dynamic_enabled raw_irqentry_exit_cond_resched
-#define irqentry_exit_cond_resched_dynamic_disabled NULL
-DECLARE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
-#define irqentry_exit_cond_resched() static_call(irqentry_exit_cond_resched)()
+#define irqentry_exit_cond_resched_internal_dynamic_enabled raw_irqentry_exit_cond_resched
+#define irqentry_exit_cond_resched_internal_dynamic_disabled NULL
+DECLARE_STATIC_CALL(irqentry_exit_cond_resched_internal, raw_irqentry_exit_cond_resched);
+#define irqentry_exit_cond_resched_internal() static_call(irqentry_exit_cond_resched_internal)()
#elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
-DECLARE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched);
+DECLARE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched_internal);
void dynamic_irqentry_exit_cond_resched(void);
-#define irqentry_exit_cond_resched() dynamic_irqentry_exit_cond_resched()
+#define irqentry_exit_cond_resched_internal() dynamic_irqentry_exit_cond_resched()
#endif
#else /* CONFIG_PREEMPT_DYNAMIC */
-#define irqentry_exit_cond_resched() raw_irqentry_exit_cond_resched()
+#define irqentry_exit_cond_resched_internal() raw_irqentry_exit_cond_resched()
#endif /* CONFIG_PREEMPT_DYNAMIC */
/**
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 063068a9ea9b..8c0f334c4b75 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -387,18 +387,23 @@ void raw_irqentry_exit_cond_resched(void)
}
#ifdef CONFIG_PREEMPT_DYNAMIC
#if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
-DEFINE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
+DEFINE_STATIC_CALL(irqentry_exit_cond_resched_internal, raw_irqentry_exit_cond_resched);
#elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
-DEFINE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched);
+DEFINE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched_internal);
void dynamic_irqentry_exit_cond_resched(void)
{
- if (!static_branch_unlikely(&sk_dynamic_irqentry_exit_cond_resched))
+ if (!static_branch_unlikely(&sk_dynamic_irqentry_exit_cond_resched_internal))
return;
raw_irqentry_exit_cond_resched();
}
#endif
#endif
+void irqentry_exit_cond_resched(struct pt_regs *regs)
+{
+ irqentry_exit_cond_resched_internal();
+}
+
noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
{
lockdep_assert_irqs_disabled();
@@ -425,7 +430,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
instrumentation_begin();
if (IS_ENABLED(CONFIG_PREEMPTION))
- irqentry_exit_cond_resched();
+ irqentry_exit_cond_resched_internal();
/* Covers both tracing and lockdep */
trace_hardirqs_on();
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 189999007f32..38dd74ba1c86 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8419,29 +8419,29 @@ EXPORT_SYMBOL(__cond_resched_rwlock_write);
* SC:might_resched
* SC:preempt_schedule
* SC:preempt_schedule_notrace
- * SC:irqentry_exit_cond_resched
+ * SC:irqentry_exit_cond_resched_internal
*
*
* NONE:
- * cond_resched <- __cond_resched
- * might_resched <- RET0
- * preempt_schedule <- NOP
- * preempt_schedule_notrace <- NOP
- * irqentry_exit_cond_resched <- NOP
+ * cond_resched <- __cond_resched
+ * might_resched <- RET0
+ * preempt_schedule <- NOP
+ * preempt_schedule_notrace <- NOP
+ * irqentry_exit_cond_resched_internal <- NOP
*
* VOLUNTARY:
- * cond_resched <- __cond_resched
- * might_resched <- __cond_resched
- * preempt_schedule <- NOP
- * preempt_schedule_notrace <- NOP
- * irqentry_exit_cond_resched <- NOP
+ * cond_resched <- __cond_resched
+ * might_resched <- __cond_resched
+ * preempt_schedule <- NOP
+ * preempt_schedule_notrace <- NOP
+ * irqentry_exit_cond_resched_internal <- NOP
*
* FULL:
- * cond_resched <- RET0
- * might_resched <- RET0
- * preempt_schedule <- preempt_schedule
- * preempt_schedule_notrace <- preempt_schedule_notrace
- * irqentry_exit_cond_resched <- irqentry_exit_cond_resched
+ * cond_resched <- RET0
+ * might_resched <- RET0
+ * preempt_schedule <- preempt_schedule
+ * preempt_schedule_notrace <- preempt_schedule_notrace
+ * irqentry_exit_cond_resched_internal <- irqentry_exit_cond_resched_internal
*/
enum {
@@ -8487,7 +8487,7 @@ void sched_dynamic_update(int mode)
preempt_dynamic_enable(might_resched);
preempt_dynamic_enable(preempt_schedule);
preempt_dynamic_enable(preempt_schedule_notrace);
- preempt_dynamic_enable(irqentry_exit_cond_resched);
+ preempt_dynamic_enable(irqentry_exit_cond_resched_internal);
switch (mode) {
case preempt_dynamic_none:
@@ -8495,7 +8495,7 @@ void sched_dynamic_update(int mode)
preempt_dynamic_disable(might_resched);
preempt_dynamic_disable(preempt_schedule);
preempt_dynamic_disable(preempt_schedule_notrace);
- preempt_dynamic_disable(irqentry_exit_cond_resched);
+ preempt_dynamic_disable(irqentry_exit_cond_resched_internal);
pr_info("Dynamic Preempt: none\n");
break;
@@ -8504,7 +8504,7 @@ void sched_dynamic_update(int mode)
preempt_dynamic_enable(might_resched);
preempt_dynamic_disable(preempt_schedule);
preempt_dynamic_disable(preempt_schedule_notrace);
- preempt_dynamic_disable(irqentry_exit_cond_resched);
+ preempt_dynamic_disable(irqentry_exit_cond_resched_internal);
pr_info("Dynamic Preempt: voluntary\n");
break;
@@ -8513,7 +8513,7 @@ void sched_dynamic_update(int mode)
preempt_dynamic_disable(might_resched);
preempt_dynamic_enable(preempt_schedule);
preempt_dynamic_enable(preempt_schedule_notrace);
- preempt_dynamic_enable(irqentry_exit_cond_resched);
+ preempt_dynamic_enable(irqentry_exit_cond_resched_internal);
pr_info("Dynamic Preempt: full\n");
break;
}
--
2.35.3
From: Ira Weiny <[email protected]>
The CPU information of an exception is useful in determining where bad
CPUs are in a large data center.
Define arch_{save|restore}_auxiliary_pt_regs() and set
ARCH_HAS_PTREGS_AUXILIARY default to yes.
Store the CPU on exception entry and use it later.
Cc: Rik van Riel <[email protected]>
Suggested-by: Borislav Petkov <[email protected]>
Suggested-by: Dave Hansen <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>
---
arch/x86/Kconfig | 2 +-
arch/x86/include/asm/entry-common.h | 12 ++++++++++++
arch/x86/include/asm/ptrace.h | 1 +
arch/x86/mm/fault.c | 4 ++--
4 files changed, 16 insertions(+), 3 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b35f6a472e09..707650a6ecb2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1876,7 +1876,7 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
config ARCH_HAS_PTREGS_AUXILIARY
depends on X86_64
- bool
+ def_bool y
choice
prompt "TSX enable mode"
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index 674ed46d3ced..eb145106929a 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -95,4 +95,16 @@ static __always_inline void arch_exit_to_user_mode(void)
}
#define arch_exit_to_user_mode arch_exit_to_user_mode
+#ifdef CONFIG_ARCH_HAS_PTREGS_AUXILIARY
+
+static inline void arch_save_aux_pt_regs(struct pt_regs *regs)
+{
+ struct pt_regs_auxiliary *aux_pt_regs = &to_extended_pt_regs(regs)->aux;
+
+ aux_pt_regs->cpu = raw_smp_processor_id();
+}
+#define arch_save_aux_pt_regs arch_save_aux_pt_regs
+
+#endif
+
#endif
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 5a9c85893459..b403b469996f 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -97,6 +97,7 @@ struct pt_regs {
* ARCH_HAS_PTREGS_AUXILIARY. Failure to do so will result in a build failure.
*/
struct pt_regs_auxiliary {
+ int cpu;
};
struct pt_regs_extended {
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 82cf23975aa1..5df99fe49494 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -768,9 +768,9 @@ static inline void
show_signal_msg(struct pt_regs *regs, unsigned long error_code,
unsigned long address, struct task_struct *tsk)
{
+ struct pt_regs_auxiliary *aux_pt_regs = &to_extended_pt_regs(regs)->aux;
const char *loglvl = task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG;
- /* This is a racy snapshot, but it's better than nothing. */
- int cpu = raw_smp_processor_id();
+ int cpu = aux_pt_regs->cpu;
if (!unhandled_signal(tsk, SIGSEGV))
return;
--
2.35.3
From: Ira Weiny <[email protected]>
Some architectures have auxiliary pt_regs space available to store
information on the stack during exceptions. This information is easier
to obtain and store within C code rather than in arch specific assembly.
Define empty calls to architecture specific save and restore auxiliary
pt_regs functions. Call these functions on generic entry/exit.
NOTE: Due to the split nature of the Xen exit code
irqentry_exit_cond_resched() requires an unbalanced call to
arch_restore_aux_pt_regs().
Cc: Rik van Riel <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Borislav Petkov <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>
---
Forward ported from PKS series
https://lore.kernel.org/lkml/[email protected]/
---
include/linux/entry-common.h | 8 ++++++++
kernel/entry/common.c | 16 ++++++++++++++--
2 files changed, 22 insertions(+), 2 deletions(-)
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 976cce7cf803..1c09ba64ad28 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -79,6 +79,14 @@ static __always_inline void arch_enter_from_user_mode(struct pt_regs *regs);
static __always_inline void arch_enter_from_user_mode(struct pt_regs *regs) {}
#endif
+#ifndef arch_save_aux_pt_regs
+static inline void arch_save_aux_pt_regs(struct pt_regs *regs) { }
+#endif
+
+#ifndef arch_restore_aux_pt_regs
+static inline void arch_restore_aux_pt_regs(struct pt_regs *regs) { }
+#endif
+
/**
* enter_from_user_mode - Establish state when coming from user mode
*
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 8c0f334c4b75..a70a0f314aee 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -317,7 +317,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
if (user_mode(regs)) {
irqentry_enter_from_user_mode(regs);
- return ret;
+ goto aux_save;
}
/*
@@ -356,7 +356,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
instrumentation_end();
ret.exit_rcu = true;
- return ret;
+ goto aux_save;
}
/*
@@ -371,6 +371,11 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
trace_hardirqs_off_finish();
instrumentation_end();
+aux_save:
+ instrumentation_begin();
+ arch_save_aux_pt_regs(regs);
+ instrumentation_end();
+
return ret;
}
@@ -401,6 +406,7 @@ void dynamic_irqentry_exit_cond_resched(void)
void irqentry_exit_cond_resched(struct pt_regs *regs)
{
+ arch_restore_aux_pt_regs(regs);
irqentry_exit_cond_resched_internal();
}
@@ -408,6 +414,10 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
{
lockdep_assert_irqs_disabled();
+ instrumentation_begin();
+ arch_restore_aux_pt_regs(regs);
+ instrumentation_end();
+
/* Check whether this returns to user mode */
if (user_mode(regs)) {
irqentry_exit_to_user_mode(regs);
@@ -459,6 +469,7 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
instrumentation_begin();
trace_hardirqs_off_finish();
ftrace_nmi_enter();
+ arch_save_aux_pt_regs(regs);
instrumentation_end();
return irq_state;
@@ -467,6 +478,7 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
{
instrumentation_begin();
+ arch_restore_aux_pt_regs(regs);
ftrace_nmi_exit();
if (irq_state.lockdep) {
trace_hardirqs_on_prepare();
--
2.35.3
On Fri, 2022-08-05 at 10:30 -0700, [email protected] wrote:
> From: Ira Weiny <[email protected]>
>
> Auxiliary pt_regs space needs to be manipulated by the generic
> entry/exit code.
>
> Ideally irqentry_exit() would take care of handling any auxiliary
> pt_regs on exit. Unfortunately, irqentry_exit() is not the only exit
> from exception path. The call to irqentry_exit_cond_resched() from
> xen_pv_evtchn_do_upcall() bypasses irqentry_exit().
>
> Make irqentry_exit_cond_resched() symmetrical with irqentry_enter()
> by
> passing pt_regs to it. This makes irqentry_exit_cond_resched()
> capable
> of handling auxiliary pt_regs in future patches.
>
> Cc: Rik van Riel <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
Acked-by: Rik van Riel <[email protected]>
--
All Rights Reversed.
On Fri, 2022-08-05 at 10:30 -0700, [email protected] wrote:
> From: Ira Weiny <[email protected]>
>
> Some architectures have auxiliary pt_regs space available to store
> information on the stack during exceptions. This information is
> easier
> to obtain and store within C code rather than in arch specific
> assembly.
>
> Define empty calls to architecture specific save and restore
> auxiliary
> pt_regs functions. Call these functions on generic entry/exit.
>
> NOTE: Due to the split nature of the Xen exit code
> irqentry_exit_cond_resched() requires an unbalanced call to
> arch_restore_aux_pt_regs().
>
> Cc: Rik van Riel <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
Acked-by: Rik van Riel <[email protected]>
--
All Rights Reversed.
On Fri, 2022-08-05 at 10:30 -0700, [email protected] wrote:
>
> Prepare the assembly code to add a hidden auxiliary pt_regs space.
> To
> simplify, the assembly code only adds space on the stack as defined
> by
> the C code which needs it. The use of this space is left to the C
> code
> which is required to select ARCH_HAS_PTREGS_AUXILIARY to enable this
> support.
>
Acked-by: Rik van Riel <[email protected]>
--
All Rights Reversed.
On Fri, 2022-08-05 at 10:30 -0700, [email protected] wrote:
> From: Ira Weiny <[email protected]>
>
> The CPU information of an exception is useful in determining where
> bad
> CPUs are in a large data center.
>
> Define arch_{save|restore}_auxiliary_pt_regs() and set
> ARCH_HAS_PTREGS_AUXILIARY default to yes.
>
> Store the CPU on exception entry and use it later.
>
> Cc: Rik van Riel <[email protected]>
> Suggested-by: Borislav Petkov <[email protected]>
> Suggested-by: Dave Hansen <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
Acked-by: Rik van Riel <[email protected]>
--
All Rights Reversed.
On 8/5/22 10:30, [email protected] wrote:
> +static inline void arch_save_aux_pt_regs(struct pt_regs *regs)
> +{
> + struct pt_regs_auxiliary *aux_pt_regs = &to_extended_pt_regs(regs)->aux;
> +
> + aux_pt_regs->cpu = raw_smp_processor_id();
> +}
This is in a fast path that all interrupt and exception entry uses. So,
I was curious what the overhead is.
Code generation in irqentry_enter() gets a _bit_ more complicated
because arch_save_aux_pt_regs() has to be done on the way out of the
function and the compiler can't (for instance) do a
mov $0x1,%eax
ret
to return. But, the gist of the change is still only two instructions
that read a pretty hot, read-only per-cpu cacheline:
mov %gs:0x7e21fa4a(%rip),%eax # 15a38 <cpu_number>
mov %eax,-0x8(%rbx)
That doesn't seem too bad.
* Dave Hansen <[email protected]> wrote:
> On 8/5/22 10:30, [email protected] wrote:
> > +static inline void arch_save_aux_pt_regs(struct pt_regs *regs)
> > +{
> > + struct pt_regs_auxiliary *aux_pt_regs = &to_extended_pt_regs(regs)->aux;
> > +
> > + aux_pt_regs->cpu = raw_smp_processor_id();
> > +}
>
> This is in a fast path that all interrupt and exception entry uses. So,
> I was curious what the overhead is.
>
> Code generation in irqentry_enter() gets a _bit_ more complicated
> because arch_save_aux_pt_regs() has to be done on the way out of the
> function and the compiler can't (for instance) do a
>
> mov $0x1,%eax
> ret
>
> to return. But, the gist of the change is still only two instructions
> that read a pretty hot, read-only per-cpu cacheline:
>
> mov %gs:0x7e21fa4a(%rip),%eax # 15a38 <cpu_number>
> mov %eax,-0x8(%rbx)
>
> That doesn't seem too bad.
It's still 2 instructions more than what we had before, while the
fault-time CPU number is only needed infrequently AFAICS.
Thanks,
Ingo
On Sat, Aug 06, 2022 at 11:01:06AM +0200, Ingo Molnar wrote:
> It's still 2 instructions more than what we had before, while the
> fault-time CPU number is only needed infrequently AFAICS.
With the amount of logical cores ever increasing and how CPU packages
(nodes, L3 sharing, you name it) get more and more complex topology,
I'd say the 2 insns to show the CPU number in every exception is a good
thing to do.
Arguably, we probably should've even done it already...
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
* Borislav Petkov <[email protected]> wrote:
> On Sat, Aug 06, 2022 at 11:01:06AM +0200, Ingo Molnar wrote:
> > It's still 2 instructions more than what we had before, while the
> > fault-time CPU number is only needed infrequently AFAICS.
>
> With the amount of logical cores ever increasing and how CPU packages
> (nodes, L3 sharing, you name it) get more and more complex topology,
> I'd say the 2 insns to show the CPU number in every exception is a good
> thing to do.
We can show it - I'm arguing against extracting it too early, which costs
us 2 instructions in the exception fast path - while in 99.999999999% of
the cases we don't use that field at all ...
> Arguably, we probably should've even done it already...
Yeah, so I'm not against Rik's patch that prints the CPU number - that's
indeed useful and I'd like to see it merged.
I'm arguing against extracting the CPU so early as to impact the exception
fast path.
Thanks,
Ingo
On Sun, Aug 07, 2022 at 12:02:41PM +0200, Ingo Molnar wrote:
> * Borislav Petkov <[email protected]> wrote:
> > With the amount of logical cores ever increasing and how CPU packages
> > (nodes, L3 sharing, you name it) get more and more complex topology,
> > I'd say the 2 insns to show the CPU number in every exception is a good
> > thing to do.
>
> We can show it - I'm arguing against extracting it too early, which costs
Not early - more correct. We can say which CPU executed the exception
handler *exactly*. Not which CPU executed the exception handler *maybe*.
> us 2 instructions in the exception fast path
2 insns? They don't matter at all. FWIW, they'll pull in the per-CPU
cacheline earlier which should be a net win later, for code which does
smp_processor_id().
> - while in 99.999999999% of the cases we don't use that field at all ...
See my text above about the ever-increasing complexity of CPU topology.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Sun, Aug 07, 2022 at 12:35:03PM +0200, Borislav Petkov wrote:
> On Sun, Aug 07, 2022 at 12:02:41PM +0200, Ingo Molnar wrote:
> > * Borislav Petkov <[email protected]> wrote:
> > > With the amount of logical cores ever increasing and how CPU packages
> > > (nodes, L3 sharing, you name it) get more and more complex topology,
> > > I'd say the 2 insns to show the CPU number in every exception is a good
> > > thing to do.
> >
> > We can show it - I'm arguing against extracting it too early, which costs
>
> Not early - more correct. We can say which CPU executed the exception
> handler *exactly*. Not which CPU executed the exception handler *maybe*.
>
> > us 2 instructions in the exception fast path
>
> 2 insns? They don't matter at all. FWIW, they'll pull in the per-CPU
> cacheline earlier which should be a net win later, for code which does
> smp_processor_id().
I agree with Boris; however I feel that I have to mention that in patch 3/5 you
also have 1 instruction on each of entry and exit to push the extra stack
space. So all told it would cost 4 instructions.
Again, I don't believe this is too much overhead but I don't want people to say
it was not discussed.
Ira
>
> > - while in 99.999999999% of the cases we don't use that field at all ...
>
> See my text above about the ever-increasing complexity of CPU topology.
>
> Thx.
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
On Fri, Aug 05, 2022 at 10:30:05AM -0700, [email protected] wrote:
> From: Ira Weiny <[email protected]>
>
> Auxiliary pt_regs space needs to be manipulated by the generic
> entry/exit code.
>
> Ideally irqentry_exit() would take care of handling any auxiliary
> pt_regs on exit. Unfortunately, irqentry_exit() is not the only exit
> from exception path. The call to irqentry_exit_cond_resched() from
> xen_pv_evtchn_do_upcall() bypasses irqentry_exit().
>
> Make irqentry_exit_cond_resched() symmetrical with irqentry_enter() by
> passing pt_regs to it. This makes irqentry_exit_cond_resched() capable
> of handling auxiliary pt_regs in future patches.
>
> Cc: Rik van Riel <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Forward ported from PKS series:
> https://lore.kernel.org/lkml/[email protected]/
> ---
> arch/arm64/include/asm/preempt.h | 2 +-
> arch/arm64/kernel/entry-common.c | 4 ++--
> arch/x86/entry/common.c | 2 +-
> include/linux/entry-common.h | 17 ++++++++------
> kernel/entry/common.c | 13 +++++++----
> kernel/sched/core.c | 40 ++++++++++++++++----------------
> 6 files changed, 43 insertions(+), 35 deletions(-)
Why all this churn?
Why can't you add a parameter to irqentry_exit():
noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state, bool cond_resched);
and then have all callers except xen_pv_evtchn_do_upcall() pass in false
and this way have all exit paths end up in irqentry_exit()?
And, ofc, move the true case which is the body of
raw_irqentry_exit_cond_resched() to irqentry_exit() and then get rid of
former.
xen_pv_evtchn_do_upcall() will, ofc, do:
if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
irqentry_exit(regs, state, true);
instrumentation_end();
restore_inhcall(inhcall);
} else {
instrumentation_end();
irqentry_exit(regs, state, false);
Hmmm?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
* Ira Weiny <[email protected]> wrote:
> On Sun, Aug 07, 2022 at 12:35:03PM +0200, Borislav Petkov wrote:
> > On Sun, Aug 07, 2022 at 12:02:41PM +0200, Ingo Molnar wrote:
> > > * Borislav Petkov <[email protected]> wrote:
> > > > With the amount of logical cores ever increasing and how CPU packages
> > > > (nodes, L3 sharing, you name it) get more and more complex topology,
> > > > I'd say the 2 insns to show the CPU number in every exception is a good
> > > > thing to do.
> > >
> > > We can show it - I'm arguing against extracting it too early, which costs
> >
> > Not early - more correct. We can say which CPU executed the exception
> > handler *exactly*. Not which CPU executed the exception handler *maybe*.
> >
> > > us 2 instructions in the exception fast path
> >
> > 2 insns? They don't matter at all. FWIW, they'll pull in the per-CPU
> > cacheline earlier which should be a net win later, for code which does
> > smp_processor_id().
I'd like to hear what Andy Lutomirski thinks about the notion that
"2 instructions don't matter at all" ...
Especially since it's now 4 instructions:
> I agree with Boris; however I feel that I have to mention that in patch
> 3/5 you also have 1 instruction on each of entry and exit to push the
> extra stack space. So all told it would cost 4 instructions.
... 4 instructions in the exception path is a non-trivial impact.
> Again, I don't believe this is too much overhead but I don't want people
> to say it was not discussed.
Is it necessary to do this, what are the alternatives, can this overhead be
avoided?
Thanks,
Ingo
On Mon, Aug 08, 2022 at 01:03:24PM +0200, Ingo Molnar wrote:
> I'd like to hear what Andy Lutomirski thinks about the notion that
> "2 instructions don't matter at all" ...
>
> Especially since it's now 4 instructions:
He wasn't opposed to it when we talked on IRC last week.
> ... 4 instructions in the exception path is a non-trivial impact.
How do I measure this "impact"?
Hell, we recently added retbleed - and IBRS especially on Intel - on
the entry path which is whopping 30% perf impact in some cases. And
now we're arguing about a handful of insns. I'm sceptical they'll be
anything else but "in-the-noise" in any sensible workload.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 8/8/22 04:03, Ingo Molnar wrote:
>> Again, I don't believe this is too much overhead but I don't want people
>> to say it was not discussed.
> Is it necessary to do this, what are the alternatives, can this overhead be
> avoided?
One thing Andy mentioned is that we _could_ get it down to two instructions:
rdgsbase $reg
push $reg
This could be hidden in:
PUSH_PTREGS_AUXILIARY
where, today, it would only net add a single instruction. But, if we
ever add more stuff to PUSH_PTREGS_AUXILIARY, it would move back to
needing two instructions since we'd need both the:
subq $PTREGS_AUX_SIZE, %rsp
and something to write gsbase to the stack.
That doesn't get us the smp_processor_id() directly, but we can derive
it later on from the gsbase value.
The downside is that we're doing it in assembly. We'd also have
something additional which is a bit uglier and that reads memory on
!X86_FEATURE_FSGSBASE, probably:
movq PER_CPU_VAR(cpu_number), %reg
push %reg
Which would require some different code to decode what was there:
int read_exception_cpu_number(ext_pt_regs *e)
{
if (cpu_feature_enabled(X86_FEATURE_FSGSBASE))
return gsbase_to_cpu_number(e->ext_cpu_nr);
else
return e->ext_cpu_nr;
}
I'm thinking that the whole racy smp_processor_id() thing wasn't so bad
in the first place.
On Mon, 2022-08-08 at 09:16 -0700, Dave Hansen wrote:
> On 8/8/22 04:03, Ingo Molnar wrote:
> > > Again, I don't believe this is too much overhead but I don't want
> > > people
> > > to say it was not discussed.
> > Is it necessary to do this, what are the alternatives, can this
> > overhead be
> > avoided?
>
> I'm thinking that the whole racy smp_processor_id() thing wasn't so
> bad
> in the first place.
>
FWIW, just grabbing the CPU number in show_signal_msg()
appears to be good enough for our use.
It will typically show >90% of the errors happening on the
CPU core that went bad, which is more than enough to diagnose
that a server has a hardware issue and should probably have
the CPU repaired.
--
All Rights Reversed.
On Mon, Aug 08, 2022 at 10:34:19AM -0700, Ira Weiny wrote:
> I thought about that but generally have been steered away from using bool
> arguments like this.
The reason being?
> Looking this over I think this may be a reasonable use of a flag like
> that.
>
> I'm not sure if this series is going to land given the discussion of
> performance.
Well, the only way to talk performance is to have something to measure
it with. But let me finish going through the rest of your set first.
> Would this clean up be welcome on its own? I'm willing to respin
> this patch and submit if so.
Depends on who steered you away from the bool thing and why?
:-)
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Mon, Aug 08, 2022 at 10:43:35AM -0700, Dave Hansen wrote:
> Might have been me. Function calls that look like this:
>
> foo(&ptr, false, true, false, true, 1, 0);
>
> are incomprehensible. A true/false is effectively a magic number here
> and you have to go looking at the code implementing 'foo()' or at least
> the declaration hoping that the variable names help (if the declaration
> has variable names).
Yap, agreed.
It would start getting on my nerves after the second bool. :)
> I think I've encouraged Ira to do something like this instead:
>
> enum foo_mode {
> MODE_BAR,
> MODE_BAZ
> }
>
> where the call ends up looking like:
>
> foo(&ptr, MODE_BAR);
>
> which is much more self-documenting.
Yap, that's much better.
I suggested the bool thing in thinking that this would be the only
exception to the usage, i.e., a one-off thing. I probably should talk
to Jürgen whether we even need this one-off thing and maybe solve it
differently.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 8/8/22 10:38, Borislav Petkov wrote:
> On Mon, Aug 08, 2022 at 10:34:19AM -0700, Ira Weiny wrote:
>> I thought about that but generally have been steered away from using bool
>> arguments like this.
> The reason being?
Might have been me. Function calls that look like this:
foo(&ptr, false, true, false, true, 1, 0);
are incomprehensible. A true/false is effectively a magic number here
and you have to go looking at the code implementing 'foo()' or at least
the declaration hoping that the variable names help (if the declaration
has variable names).
I think I've encouraged Ira to do something like this instead:
enum foo_mode {
MODE_BAR,
MODE_BAZ
}
where the call ends up looking like:
foo(&ptr, MODE_BAR);
which is much more self-documenting.
On Mon, Aug 08, 2022 at 12:38:08PM +0200, Borislav Petkov wrote:
> On Fri, Aug 05, 2022 at 10:30:05AM -0700, [email protected] wrote:
> > From: Ira Weiny <[email protected]>
> >
> > Auxiliary pt_regs space needs to be manipulated by the generic
> > entry/exit code.
> >
> > Ideally irqentry_exit() would take care of handling any auxiliary
> > pt_regs on exit. Unfortunately, irqentry_exit() is not the only exit
> > from exception path. The call to irqentry_exit_cond_resched() from
> > xen_pv_evtchn_do_upcall() bypasses irqentry_exit().
> >
> > Make irqentry_exit_cond_resched() symmetrical with irqentry_enter() by
> > passing pt_regs to it. This makes irqentry_exit_cond_resched() capable
> > of handling auxiliary pt_regs in future patches.
> >
> > Cc: Rik van Riel <[email protected]>
> > Cc: Dave Hansen <[email protected]>
> > Cc: Borislav Petkov <[email protected]>
> > Signed-off-by: Ira Weiny <[email protected]>
> >
> > ---
> > Forward ported from PKS series:
> > https://lore.kernel.org/lkml/[email protected]/
> > ---
> > arch/arm64/include/asm/preempt.h | 2 +-
> > arch/arm64/kernel/entry-common.c | 4 ++--
> > arch/x86/entry/common.c | 2 +-
> > include/linux/entry-common.h | 17 ++++++++------
> > kernel/entry/common.c | 13 +++++++----
> > kernel/sched/core.c | 40 ++++++++++++++++----------------
> > 6 files changed, 43 insertions(+), 35 deletions(-)
>
> Why all this churn?
>
> Why can't you add a parameter to irqentry_exit():
>
> noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state, bool cond_resched);
I thought about that but generally have been steered away from using bool
arguments like this.
>
> and then have all callers except xen_pv_evtchn_do_upcall() pass in false
> and this way have all exit paths end up in irqentry_exit()?
>
> And, ofc, move the true case which is the body of
> raw_irqentry_exit_cond_resched() to irqentry_exit() and then get rid of
> former.
Looking this over I think this may be a reasonable use of a flag like that.
I'm not sure if this series is going to land given the discussion of
performance. Would this clean up be welcome on its own? I'm willing to respin
this patch and submit if so.
Ira
>
> xen_pv_evtchn_do_upcall() will, ofc, do:
>
> if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
> irqentry_exit(regs, state, true);
> instrumentation_end();
> restore_inhcall(inhcall);
> } else {
> instrumentation_end();
> irqentry_exit(regs, state, false);
>
> Hmmm?
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
On Fri, Aug 05, 2022 at 10:30:06AM -0700, [email protected] wrote:
> From: Ira Weiny <[email protected]>
>
> Some architectures have auxiliary pt_regs space available to store
> information on the stack during exceptions. This information is easier
> to obtain and store within C code rather than in arch specific assembly.
There are others?
Because I would've done this whole thing in arch/x86/ only...
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index 8c0f334c4b75..a70a0f314aee 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -317,7 +317,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
>
> if (user_mode(regs)) {
> irqentry_enter_from_user_mode(regs);
> - return ret;
> + goto aux_save;
Why do you have to goto and do the instrumentation sandwitch around it
at the goto label?
Why not simply do
if (user_mode(regs)) {
irqentry_enter_from_user_mode(regs);
arch_save_aux_pt_regs(regs);
return ret;
and so on?
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Tue, Aug 09, 2022 at 11:38:03AM -0700, Ira Weiny wrote:
> Thomas did a lot of work to make the entry code generic. So I was
> keeping that work consistent. This also helps to ensure I did not miss
> any places.
How about you worry about the other arches when you actually cross that
bridge?
> I don't believe this is correct because instrumentation is not enabled
> here.
Why do you have to run
arch_save_aux_pt_regs()
with instrumentation enabled?
Patch 5 does
+ struct pt_regs_auxiliary *aux_pt_regs = &to_extended_pt_regs(regs)->aux;
+
+ aux_pt_regs->cpu = raw_smp_processor_id();
only?
Why does that need to run with instrumentation enabled?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Tue, Aug 09, 2022 at 02:05:15PM +0200, Borislav Petkov wrote:
> On Fri, Aug 05, 2022 at 10:30:06AM -0700, [email protected] wrote:
> > From: Ira Weiny <[email protected]>
> >
> > Some architectures have auxiliary pt_regs space available to store
> > information on the stack during exceptions. This information is easier
> > to obtain and store within C code rather than in arch specific assembly.
>
> There are others?
Other archs? not now.
>
> Because I would've done this whole thing in arch/x86/ only...
Thomas did a lot of work to make the entry code generic. So I was keeping that
work consistent. This also helps to ensure I did not miss any places.
>
> > diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> > index 8c0f334c4b75..a70a0f314aee 100644
> > --- a/kernel/entry/common.c
> > +++ b/kernel/entry/common.c
> > @@ -317,7 +317,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
> >
> > if (user_mode(regs)) {
> > irqentry_enter_from_user_mode(regs);
> > - return ret;
> > + goto aux_save;
>
> Why do you have to goto and do the instrumentation sandwitch around it
> at the goto label?
>
> Why not simply do
>
> if (user_mode(regs)) {
> irqentry_enter_from_user_mode(regs);
> arch_save_aux_pt_regs(regs);
> return ret;
I don't believe this is correct because instrumentation is not enabled here.
See below.[1]
There are 3 exit paths from irqentry_enter(). All must call
arch_save_aux_pt_regs(). I felt the maintenance of the code would be easier if
I did not scatter those calls through the function and simply exited 1
place.[1]
FWICT calling instrumentation_{begin,end}() is a noop in production code. So
there is no real cost to calling instrumentation_begin() -> end -> begin ->
end. And the goto seemed low enough overhead. Given the current discussion I
admit I may have made the wrong choice. But I think I would add some comments
to the below to help future developers.
Ira
[1]
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -317,6 +317,9 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
if (user_mode(regs)) {
irqentry_enter_from_user_mode(regs);
+ instrumentation_begin();
+ arch_save_aux_pt_regs(regs);
+ instrumentation_end();
return ret;
}
@@ -353,6 +356,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
ct_irq_enter();
instrumentation_begin();
trace_hardirqs_off_finish();
+ arch_save_aux_pt_regs(regs);
instrumentation_end();
ret.exit_rcu = true;
@@ -369,6 +373,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
instrumentation_begin();
rcu_irq_enter_check_tick();
trace_hardirqs_off_finish();
+ arch_save_aux_pt_regs(regs);
instrumentation_end();
return ret;
On Mon, Aug 08 2022 at 14:01, Borislav Petkov wrote:
> On Mon, Aug 08, 2022 at 01:03:24PM +0200, Ingo Molnar wrote:
>> I'd like to hear what Andy Lutomirski thinks about the notion that
>> "2 instructions don't matter at all" ...
>>
>> Especially since it's now 4 instructions:
>
> He wasn't opposed to it when we talked on IRC last week.
>
>> ... 4 instructions in the exception path is a non-trivial impact.
>
> How do I measure this "impact"?
>
> Hell, we recently added retbleed - and IBRS especially on Intel - on
> the entry path which is whopping 30% perf impact in some cases. And
> now we're arguing about a handful of insns. I'm sceptical they'll be
> anything else but "in-the-noise" in any sensible workload.
I'm not worried about the 4 instructions per se, but storing the CPU
number on every exception and interrupt entry just to use it in exactly
one place, i.e. the user mode #PF handler, does not make any sense at
all.
Get the CPU number before enabling interrupts and hand it in to the
bad area variants.
I know that the aux reg code is required for other things, but just for
this it's complete overkill.
Thanks,
tglx
On Tue, Aug 09 2022 at 20:49, Borislav Petkov wrote:
> On Tue, Aug 09, 2022 at 11:38:03AM -0700, Ira Weiny wrote:
>> Thomas did a lot of work to make the entry code generic. So I was
>> keeping that work consistent. This also helps to ensure I did not miss
>> any places.
>
> How about you worry about the other arches when you actually cross that
> bridge?
Ira is right. If we want it for everything, then the generic code is the
right place.
>> I don't believe this is correct because instrumentation is not enabled
>> here.
>
> Why do you have to run
>
> arch_save_aux_pt_regs()
>
> with instrumentation enabled?
>
> Patch 5 does
>
> + struct pt_regs_auxiliary *aux_pt_regs = &to_extended_pt_regs(regs)->aux;
> +
> + aux_pt_regs->cpu = raw_smp_processor_id();
>
> only?
>
> Why does that need to run with instrumentation enabled?
There is no reason and actually _if_ we go there then this _has_ to
happen _before_ instrumentation comes into play as instrumentation might
want to have access to extended pt_regs. Not necessarily the cpu number,
but the other things which need to go there for a real reason.
Thanks,
tglx
On Mon, Aug 8, 2022, at 9:16 AM, Dave Hansen wrote:
> On 8/8/22 04:03, Ingo Molnar wrote:
>>> Again, I don't believe this is too much overhead but I don't want people
>>> to say it was not discussed.
>> Is it necessary to do this, what are the alternatives, can this overhead be
>> avoided?
>
> One thing Andy mentioned is that we _could_ get it down to two instructions:
>
> rdgsbase $reg
> push $reg
>
> This could be hidden in:
>
> PUSH_PTREGS_AUXILIARY
>
> where, today, it would only net add a single instruction. But, if we
> ever add more stuff to PUSH_PTREGS_AUXILIARY, it would move back to
> needing two instructions since we'd need both the:
>
> subq $PTREGS_AUX_SIZE, %rsp
>
> and something to write gsbase to the stack.
>
> That doesn't get us the smp_processor_id() directly, but we can derive
> it later on from the gsbase value.
>
> The downside is that we're doing it in assembly. We'd also have
> something additional which is a bit uglier and that reads memory on
> !X86_FEATURE_FSGSBASE, probably:
>
> movq PER_CPU_VAR(cpu_number), %reg
> push %reg
Nah, I believe the same value that RDGSBASE reads is already in percpu memory as 'per_cpu_offset', so the alternative can just read that and the code that uses it doesn’t need to care about the alternative.
>
> Which would require some different code to decode what was there:
>
> int read_exception_cpu_number(ext_pt_regs *e)
> {
> if (cpu_feature_enabled(X86_FEATURE_FSGSBASE))
> return gsbase_to_cpu_number(e->ext_cpu_nr);
> else
> return e->ext_cpu_nr;
> }
>
> I'm thinking that the whole racy smp_processor_id() thing wasn't so bad
> in the first place.
On Tue, Aug 09, 2022 at 11:14:19PM +0200, Thomas Gleixner wrote:
> Ira is right. If we want it for everything, then the generic code is the
> right place.
But what is "everything"? Currently, and AFAIU, it is for the PKS use
case on x86 only.
I'm not saying it should not be done this way eventually - all I'm
saying is we should not design "preemptively" before it is really needed
for other arches.
Unless putting it in generic code makes it all simpler and cleaner to
do, that is.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Tue, Aug 09, 2022 at 08:49:47PM +0200, Borislav Petkov wrote:
> On Tue, Aug 09, 2022 at 11:38:03AM -0700, Ira Weiny wrote:
> > Thomas did a lot of work to make the entry code generic. So I was
> > keeping that work consistent. This also helps to ensure I did not miss
> > any places.
>
> How about you worry about the other arches when you actually cross that
> bridge?
I guess I miss understood the reasoning behind Thomas' work. No one mentioned
trying to isolate this to x86 during the PKS review.
>
> > I don't believe this is correct because instrumentation is not enabled
> > here.
>
> Why do you have to run
>
> arch_save_aux_pt_regs()
>
> with instrumentation enabled?
Sorry, that was carried over from the PKS series where I did need to call it.
I pulled this series over quickly to show basically what needed to be done.
But I did not review it for this detail.
>
> Patch 5 does
>
> + struct pt_regs_auxiliary *aux_pt_regs = &to_extended_pt_regs(regs)->aux;
> +
> + aux_pt_regs->cpu = raw_smp_processor_id();
>
> only?
>
> Why does that need to run with instrumentation enabled?
This does not.
Am I wrong that instrumentation begin/end are 0 overhead in production builds?
Ira
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
On Tue, Aug 09, 2022 at 11:38:48PM +0200, Borislav Petkov wrote:
> On Tue, Aug 09, 2022 at 11:14:19PM +0200, Thomas Gleixner wrote:
> > Ira is right. If we want it for everything, then the generic code is the
> > right place.
>
> But what is "everything"? Currently, and AFAIU, it is for the PKS use
> case on x86 only.
>
> I'm not saying it should not be done this way eventually - all I'm
> saying is we should not design "preemptively" before it is really needed
> for other arches.
>
> Unless putting it in generic code makes it all simpler and cleaner to
> do, that is.
For the cpu use case we could limit the number of call sites. However, for PKS
the patch would have required changing x86 code in approximately 9 places for
the enter code.
$ git grep 'irqentry_enter(regs)' arch/x86 | wc -l
9
How about we drop patch 1 (I'll rework it to be less churn and submit it for
clean up separately because it will no longer be needed). Keep patch 3 as is.
Then combine 2 and 5 as below. The saving of the CPU can be lifted later if
needed.
Ira
commit 4c1d646888dd7471ae71a24109d587901a00f87d
Author: Ira Weiny <[email protected]>
Date: Mon Jan 10 15:06:07 2022 -0800
x86/entry: Store CPU info on exception entry
x86 has auxiliary pt_regs space available to store information on the
stack during exceptions. This information is easier to obtain and store
within C code.
The CPU information of a page fault is useful in determining where bad
CPUs are in a large data center.
Define aux_pt_regs_save_cpu() and set ARCH_HAS_PTREGS_AUXILIARY default
to yes.
Store the CPU on page fault entry and use it later.
Cc: Rik van Riel <[email protected]>
Suggested-by: Borislav Petkov <[email protected]>
Suggested-by: Dave Hansen <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>
---
Changes from RFC:
New patch combining 2 and 5 from original series and modified.
Boris/Thomas - eliminate generic calls to save the cpu and call
only from exc_page_fault
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b35f6a472e09..707650a6ecb2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1876,7 +1876,7 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
config ARCH_HAS_PTREGS_AUXILIARY
depends on X86_64
- bool
+ def_bool y
choice
prompt "TSX enable mode"
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 5a9c85893459..b403b469996f 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -97,6 +97,7 @@ struct pt_regs {
* ARCH_HAS_PTREGS_AUXILIARY. Failure to do so will result in a build failure.
*/
struct pt_regs_auxiliary {
+ int cpu;
};
struct pt_regs_extended {
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 82cf23975aa1..b9b8344b69ad 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -768,9 +768,9 @@ static inline void
show_signal_msg(struct pt_regs *regs, unsigned long error_code,
unsigned long address, struct task_struct *tsk)
{
+ struct pt_regs_auxiliary *aux_pt_regs = &to_extended_pt_regs(regs)->aux;
const char *loglvl = task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG;
- /* This is a racy snapshot, but it's better than nothing. */
- int cpu = raw_smp_processor_id();
+ int cpu = aux_pt_regs->cpu;
if (!unhandled_signal(tsk, SIGSEGV))
return;
@@ -1503,6 +1503,13 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code,
}
}
+static void aux_pt_regs_save_cpu(struct pt_regs *regs)
+{
+ struct pt_regs_auxiliary *aux_pt_regs = &to_extended_pt_regs(regs)->aux;
+
+ aux_pt_regs->cpu = raw_smp_processor_id();
+}
+
DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
{
unsigned long address = read_cr2();
@@ -1546,6 +1553,7 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
*/
state = irqentry_enter(regs);
+ aux_pt_regs_save_cpu(regs);
instrumentation_begin();
handle_page_fault(regs, error_code, address);
instrumentation_end();
On Mon, Aug 08 2022 at 12:38, Borislav Petkov wrote:
> On Fri, Aug 05, 2022 at 10:30:05AM -0700, [email protected] wrote:
>> ---
>> arch/arm64/include/asm/preempt.h | 2 +-
>> arch/arm64/kernel/entry-common.c | 4 ++--
>> arch/x86/entry/common.c | 2 +-
>> include/linux/entry-common.h | 17 ++++++++------
>> kernel/entry/common.c | 13 +++++++----
>> kernel/sched/core.c | 40 ++++++++++++++++----------------
>> 6 files changed, 43 insertions(+), 35 deletions(-)
>
> Why all this churn?
>
> Why can't you add a parameter to irqentry_exit():
>
> noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state, bool cond_resched);
>
> and then have all callers except xen_pv_evtchn_do_upcall() pass in false
> and this way have all exit paths end up in irqentry_exit()?
>
> And, ofc, move the true case which is the body of
> raw_irqentry_exit_cond_resched() to irqentry_exit() and then get rid of
> former.
>
> xen_pv_evtchn_do_upcall() will, ofc, do:
>
> if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
> irqentry_exit(regs, state, true);
> instrumentation_end();
> restore_inhcall(inhcall);
> } else {
> instrumentation_end();
> irqentry_exit(regs, state, false);
>
How is that less churn? Care to do:
git grep 'irqentry_exit(' arch/
The real question is:
Why is it required that irqentry_exit_cond_resched() handles any of
the auxiliary pt regs space?
That's completely unanswered by the changelog and absolutely irrelevant
for the problem at hand, i.e. storing the CPU number on irq/exception
entry.
So why is this patch in this series at all?
But even for future purposes it is more than questionable. Why?
Contrary to the claim of the changelog xen_pv_evtchn_do_upcall() is not
really a bypass of irqentry_exit(). The point is:
The hypercall is issued by the kernel from privcmd_ioctl_hypercall()
which does:
xen_preemptible_hcall_begin();
hypercall();
xen_preemptible_hcall_end();
So any upcall from the hypervisor to the guest will semantically hit
regular interrupt enabled guest kernel space which means that if that
code would call irqentry_exit() then it would run through the kernel
exit code path:
} else if (!regs_irqs_disabled(regs)) {
instrumentation_begin();
if (IS_ENABLED(CONFIG_PREEMPTION))
irqentry_exit_cond_resched();
/* Covers both tracing and lockdep */
trace_hardirqs_on();
instrumentation_end();
} ....
Which would fail to invoke irqentry_exit_cond_resched() if
CONFIG_PREEMPTION=n. But the whole point of this exercise is to allow
preemption from the upcall even when the kernel has CONFIG_PREEMPTION=n.
Staring at this XENPV code there are two issues:
1) That code lacks a trace_hardirqs_on() after the call to
irqentry_exit_cond_resched(). My bad.
2) CONFIG_PREEMPT_DYNAMIC broke this mechanism.
If the static call/key is disabled then the call becomes a NOP.
Frederic?
Both clearly demonstrate how well tested this XEN_PV muck is which means
we should just delete it.
Anyway. This wants the fix below.
If there is still a need to do anything about this XEN cond_resched()
muck for the PREEMPTION=n case for future auxregs usage then this can be
simply hidden in this new XEN helper and does not need any change to the
rest of the code.
I doubt that this is required, but if so then there needs to be a very
concise explanation and not just uncomprehensible hand waving word
salad.
Thanks,
tglx
---
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -283,9 +283,18 @@ static __always_inline void restore_inhc
{
__this_cpu_write(xen_in_preemptible_hcall, inhcall);
}
+
+static __always_inline void xenpv_irqentry_exit_cond_resched(void)
+{
+ instrumentation_begin();
+ raw_irqentry_exit_cond_resched();
+ trace_hardirqs_on();
+ instrumentation_end();
+}
#else
static __always_inline bool get_and_clear_inhcall(void) { return false; }
static __always_inline void restore_inhcall(bool inhcall) { }
+static __always_inline void xenpv_irqentry_exit_cond_resched(void) { }
#endif
static void __xen_pv_evtchn_do_upcall(struct pt_regs *regs)
@@ -306,11 +315,11 @@ static void __xen_pv_evtchn_do_upcall(st
instrumentation_begin();
run_sysvec_on_irqstack_cond(__xen_pv_evtchn_do_upcall, regs);
+ instrumentation_end();
inhcall = get_and_clear_inhcall();
if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
- irqentry_exit_cond_resched();
- instrumentation_end();
+ xenpv_irqentry_exit_cond_resched();
restore_inhcall(inhcall);
} else {
instrumentation_end();
On Tue, Aug 09 2022 at 14:49, Ira Weiny wrote:
> On Tue, Aug 09, 2022 at 08:49:47PM +0200, Borislav Petkov wrote:
>> Why does that need to run with instrumentation enabled?
>
> This does not.
>
> Am I wrong that instrumentation begin/end are 0 overhead in production builds?
instrumentation_begin/end() are annotations and never emit executable
code. Neither in production nor in debug builds. Why keep you harping on
this question? Is it so hard to figure out what it does?
But that's not the question at all. The question is where auxregs need
to be set up.
Once the entry code leaves non-instrumentable code it's obviously
desired to have consistent state for instrumentation.
This obviously applies also to the auxreg space unless there is a
compelling reason why this is not required.
Thanks,
tglx
On Wed, Aug 10 2022 at 01:18, Thomas Gleixner wrote:
> Both clearly demonstrate how well tested this XEN_PV muck is which means
> we should just delete it.
>
> Anyway. This wants the fix below.
which is not sufficient. If CONFIG_PREEMPT_DYNAMIC is set then
CONFIG_PREEMPTION is set too, which makes the preemtible hypercall magic
a complete NOP. So if the dynamic call is disabled, then such a several
seconds (sic!) hypercall cannot be preempted at all.
Something like the below then. Oh well...
Thanks,
tglx
---
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -253,7 +253,8 @@ SYSCALL_DEFINE0(ni_syscall)
}
#ifdef CONFIG_XEN_PV
-#ifndef CONFIG_PREEMPTION
+
+#if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)
/*
* Some hypercalls issued by the toolstack can take many 10s of
* seconds. Allow tasks running hypercalls via the privcmd driver to
@@ -283,9 +284,18 @@ static __always_inline void restore_inhc
{
__this_cpu_write(xen_in_preemptible_hcall, inhcall);
}
+
+static __always_inline void xenpv_irqentry_exit_cond_resched(void)
+{
+ instrumentation_begin();
+ raw_irqentry_exit_cond_resched();
+ trace_hardirqs_on();
+ instrumentation_end();
+}
#else
static __always_inline bool get_and_clear_inhcall(void) { return false; }
static __always_inline void restore_inhcall(bool inhcall) { }
+static __always_inline void xenpv_irqentry_exit_cond_resched(void) { }
#endif
static void __xen_pv_evtchn_do_upcall(struct pt_regs *regs)
@@ -306,11 +316,11 @@ static void __xen_pv_evtchn_do_upcall(st
instrumentation_begin();
run_sysvec_on_irqstack_cond(__xen_pv_evtchn_do_upcall, regs);
+ instrumentation_end();
inhcall = get_and_clear_inhcall();
if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
- irqentry_exit_cond_resched();
- instrumentation_end();
+ xenpv_irqentry_exit_cond_resched();
restore_inhcall(inhcall);
} else {
instrumentation_end();
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -416,6 +416,7 @@ irqentry_state_t noinstr irqentry_enter(
* Conditional reschedule with additional sanity checks.
*/
void raw_irqentry_exit_cond_resched(void);
+
#ifdef CONFIG_PREEMPT_DYNAMIC
#if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
#define irqentry_exit_cond_resched_dynamic_enabled raw_irqentry_exit_cond_resched
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -194,7 +194,7 @@ bool xen_running_on_version_or_later(uns
void xen_efi_runtime_setup(void);
-#if defined(CONFIG_XEN_PV) && !defined(CONFIG_PREEMPTION)
+#if defined(CONFIG_XEN_PV) && (!defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC))
DECLARE_PER_CPU(bool, xen_in_preemptible_hcall);
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -385,6 +385,7 @@ void raw_irqentry_exit_cond_resched(void
preempt_schedule_irq();
}
}
+
#ifdef CONFIG_PREEMPT_DYNAMIC
#if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
DEFINE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);