2023-04-10 08:41:36

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 00/33] x86: enable FRED for x86-64

This patch set enables FRED for x86-64.

The Intel flexible return and event delivery (FRED) architecture defines simple
new transitions that change privilege level (ring transitions). The FRED
architecture was designed with the following goals:
1) Improve overall performance and response time by replacing event delivery
through the interrupt descriptor table (IDT event delivery) and event return by
the IRET instruction with lower latency transitions.
2) Improve software robustness by ensuring that event delivery establishes the
full supervisor context and that event return establishes the full user context.

The new transitions defined by the FRED architecture are FRED event delivery and,
for returning from events, two FRED return instructions. FRED event delivery can
effect a transition from ring 3 to ring 0, but it is used also to deliver events
incident to ring 0. One FRED instruction (ERETU) effects a return from ring 0 to
ring 3, while the other (ERETS) returns while remaining in ring 0.

Search for the latest FRED spec in most search engines with this search pattern:

site:intel.com FRED (flexible return and event delivery) specification

As of now there is no publicly avaiable CPU supporting FRED, thus the Intel
SimicsĀ® Simulator is used as software development and testing vehicles. And
it can be downloaded from:
https://www.intel.com/content/www/us/en/developer/articles/tool/simics-simulator.html

To enable FRED, the Simics package 8112 QSP-CPU needs to be installed with CPU
model configured as:
$cpu_comp_class = "x86-experimental-fred"

Longer term, we should refactor common code shared by FRED and IDT into common
shared files, and contain IDT code using a new config CONFIG_X86_IDT.

Changes since v7:
* Always call external_interrupt() for VMX IRQ handling on x86_64, thus avoid
re-entering the noinstr code.
* Create a FRED stack frame when FRED is compiled-in but not enabled, which
uses some extra stack space but simplifies the code.
* Add a log message when FRED is enabled.

Changes since v6:
* Add a comment to explain why it is safe to write to a previous FRED stack
frame. (Lai Jiangshan).
* Export fred_entrypoint_kernel(), required when kvm-intel built as a module.
* Reserve a REDZONE for CALL emulation and Align RSP to a 64-byte boundary
before pushing a new FRED stack frame.
* Replace pt_regs csx flags prefix FRED_CSL_ with FRED_CSX_.

Changes since v5:
* Initialize system_interrupt_handlers with dispatch_table_spurious_interrupt()
instead of NULL to get rid of a branch (Peter Zijlstra).
* Disallow #DB inside #MCE for robustness sake (Peter Zijlstra).
* Add a comment for FRED stack level settings (Lai Jiangshan).
* Move the NMI bit from an invalid stack frame, which caused ERETU to fault,
to the fault handler's stack frame, thus to unblock NMI ASAP if NMI is blocked
(Lai Jiangshan).
* Refactor VMX_DO_EVENT_IRQOFF to handle IRQ/NMI in IRQ/NMI induced VM exits
when FRED is enabled (Sean Christopherson).

Changes since v4:
* Do NOT use the term "injection", which in the KVM context means to
reinject an event into the guest (Sean Christopherson).
* Add the explanation of why to execute "int $2" to invoke the NMI handler
in NMI caused VM exits (Sean Christopherson).
* Use cs/ss instead of csx/ssx when initializing the pt_regs structure
for calling external_interrupt(), otherwise it breaks i386 build.

Changes since v3:
* Call external_interrupt() to handle IRQ in IRQ caused VM exits.
* Execute "int $2" to handle NMI in NMI caused VM exits.
* Rename csl/ssl of the pt_regs structure to csx/ssx (x for extended)
(Andrew Cooper).

Changes since v2:
* Improve comments for changes in arch/x86/include/asm/idtentry.h.

Changes since v1:
* call irqentry_nmi_{enter,exit}() in both IDT and FRED debug fault kernel
handler (Peter Zijlstra).
* Initialize a FRED exception handler to fred_bad_event() instead of NULL
if no FRED handler defined for an exception vector (Peter Zijlstra).
* Push calling irqentry_{enter,exit}() and instrumentation_{begin,end}()
down into individual FRED exception handlers, instead of in the dispatch
framework (Peter Zijlstra).


H. Peter Anvin (Intel) (24):
x86/traps: let common_interrupt() handle IRQ_MOVE_CLEANUP_VECTOR
x86/fred: make unions for the cs and ss fields in struct pt_regs
x86/traps: add a system interrupt table for system interrupt dispatch
x86/traps: add external_interrupt() to dispatch external interrupts
x86/cpufeature: add the cpu feature bit for FRED
x86/opcode: add ERETU, ERETS instructions to x86-opcode-map
x86/objtool: teach objtool about ERETU and ERETS
x86/cpu: add X86_CR4_FRED macro
x86/fred: add Kconfig option for FRED (CONFIG_X86_FRED)
x86/fred: if CONFIG_X86_FRED is disabled, disable FRED support
x86/cpu: add MSR numbers for FRED configuration
x86/fred: header file with FRED definitions
x86/fred: reserve space for the FRED stack frame
x86/fred: add a page fault entry stub for FRED
x86/fred: add a debug fault entry stub for FRED
x86/fred: add a NMI entry stub for FRED
x86/fred: FRED entry/exit and dispatch code
x86/fred: FRED initialization code
x86/fred: update MSR_IA32_FRED_RSP0 during task switch
x86/fred: let ret_from_fork() jmp to fred_exit_user when FRED is
enabled
x86/fred: disallow the swapgs instruction when FRED is enabled
x86/fred: no ESPFIX needed when FRED is enabled
x86/fred: allow single-step trap and NMI when starting a new thread
x86/fred: allow FRED systems to use interrupt vectors 0x10-0x1f

Xin Li (9):
x86/traps: add install_system_interrupt_handler()
x86/fred: header file for event types
x86/fred: add a machine check entry stub for FRED
x86/fred: fixup fault on ERETU by jumping to fred_entrypoint_user
x86/ia32: do not modify the DPL bits for a null selector
x86/fred: allow dynamic stack frame size
x86/fred: BUG() when ERETU with %rsp not equal to that when the ring 3
event was just delivered
x86/fred: disable FRED by default in its early stage
KVM: x86/vmx: refactor VMX_DO_EVENT_IRQOFF to generate FRED stack
frames

.../admin-guide/kernel-parameters.txt | 4 +
arch/x86/Kconfig | 9 +
arch/x86/entry/Makefile | 5 +-
arch/x86/entry/entry_32.S | 2 +-
arch/x86/entry/entry_64.S | 5 +
arch/x86/entry/entry_64_fred.S | 62 +++++
arch/x86/entry/entry_fred.c | 222 ++++++++++++++++++
arch/x86/entry/vsyscall/vsyscall_64.c | 2 +-
arch/x86/include/asm/asm-prototypes.h | 1 +
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/entry-common.h | 3 +
arch/x86/include/asm/event-type.h | 17 ++
arch/x86/include/asm/extable_fixup_types.h | 4 +-
arch/x86/include/asm/fred.h | 150 ++++++++++++
arch/x86/include/asm/idtentry.h | 76 +++++-
arch/x86/include/asm/irq.h | 5 +
arch/x86/include/asm/irq_vectors.h | 15 +-
arch/x86/include/asm/msr-index.h | 13 +-
arch/x86/include/asm/processor.h | 12 +-
arch/x86/include/asm/ptrace.h | 34 ++-
arch/x86/include/asm/switch_to.h | 10 +-
arch/x86/include/asm/thread_info.h | 35 +--
arch/x86/include/asm/traps.h | 13 +
arch/x86/include/asm/vmx.h | 17 +-
arch/x86/include/uapi/asm/processor-flags.h | 2 +
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/apic/apic.c | 11 +-
arch/x86/kernel/apic/vector.c | 8 +-
arch/x86/kernel/cpu/acrn.c | 7 +-
arch/x86/kernel/cpu/common.c | 88 ++++---
arch/x86/kernel/cpu/mce/core.c | 15 ++
arch/x86/kernel/cpu/mshyperv.c | 22 +-
arch/x86/kernel/espfix_64.c | 8 +
arch/x86/kernel/fred.c | 81 +++++++
arch/x86/kernel/head_32.S | 3 +-
arch/x86/kernel/idt.c | 6 +-
arch/x86/kernel/irq.c | 6 +-
arch/x86/kernel/irqinit.c | 7 +-
arch/x86/kernel/kvm.c | 4 +-
arch/x86/kernel/nmi.c | 19 ++
arch/x86/kernel/process.c | 5 +
arch/x86/kernel/process_64.c | 21 +-
arch/x86/kernel/signal_32.c | 21 +-
arch/x86/kernel/traps.c | 175 ++++++++++++--
arch/x86/kvm/vmx/vmenter.S | 78 +++++-
arch/x86/kvm/vmx/vmx.c | 12 +-
arch/x86/lib/x86-opcode-map.txt | 2 +-
arch/x86/mm/extable.c | 84 +++++++
arch/x86/mm/fault.c | 20 +-
drivers/xen/events/events_base.c | 5 +-
kernel/fork.c | 6 +
tools/arch/x86/include/asm/cpufeatures.h | 1 +
.../arch/x86/include/asm/disabled-features.h | 8 +-
tools/arch/x86/include/asm/msr-index.h | 13 +-
tools/arch/x86/lib/x86-opcode-map.txt | 2 +-
tools/objtool/arch/x86/decode.c | 19 +-
57 files changed, 1306 insertions(+), 179 deletions(-)
create mode 100644 arch/x86/entry/entry_64_fred.S
create mode 100644 arch/x86/entry/entry_fred.c
create mode 100644 arch/x86/include/asm/event-type.h
create mode 100644 arch/x86/include/asm/fred.h
create mode 100644 arch/x86/kernel/fred.c

--
2.34.1


2023-04-10 08:41:38

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 01/33] x86/traps: let common_interrupt() handle IRQ_MOVE_CLEANUP_VECTOR

From: "H. Peter Anvin (Intel)" <[email protected]>

IRQ_MOVE_CLEANUP_VECTOR is the only one of the system IRQ vectors that
is *below* FIRST_SYSTEM_VECTOR. It is a slow path, so just push it
into common_interrupt() just before the spurious interrupt handling.

Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---
arch/x86/kernel/irq.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 766ffe3ba313..7e125fff45ab 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -248,6 +248,10 @@ DEFINE_IDTENTRY_IRQ(common_interrupt)
desc = __this_cpu_read(vector_irq[vector]);
if (likely(!IS_ERR_OR_NULL(desc))) {
handle_irq(desc, regs);
+#ifdef CONFIG_SMP
+ } else if (vector == IRQ_MOVE_CLEANUP_VECTOR) {
+ sysvec_irq_move_cleanup(regs);
+#endif
} else {
ack_APIC_irq();

--
2.34.1

2023-04-10 08:41:43

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 02/33] x86/fred: make unions for the cs and ss fields in struct pt_regs

From: "H. Peter Anvin (Intel)" <[email protected]>

Make the cs and ss fields in struct pt_regs unions between the actual
selector and the unsigned long stack slot. FRED uses this space to
store additional flags.

The printk changes are simply due to the cs and ss fields changed to
unsigned short from unsigned long.

Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---

Changes since v3:
* Rename csl/ssl of the pt_regs structure to csx/ssx (x for extended)
(Andrew Cooper).
---
arch/x86/entry/vsyscall/vsyscall_64.c | 2 +-
arch/x86/include/asm/ptrace.h | 34 ++++++++++++++++++++++++---
arch/x86/kernel/process_64.c | 2 +-
3 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
index d234ca797e4a..2429ad0df068 100644
--- a/arch/x86/entry/vsyscall/vsyscall_64.c
+++ b/arch/x86/entry/vsyscall/vsyscall_64.c
@@ -76,7 +76,7 @@ static void warn_bad_vsyscall(const char *level, struct pt_regs *regs,
if (!show_unhandled_signals)
return;

- printk_ratelimited("%s%s[%d] %s ip:%lx cs:%lx sp:%lx ax:%lx si:%lx di:%lx\n",
+ printk_ratelimited("%s%s[%d] %s ip:%lx cs:%x sp:%lx ax:%lx si:%lx di:%lx\n",
level, current->comm, task_pid_nr(current),
message, regs->ip, regs->cs,
regs->sp, regs->ax, regs->si, regs->di);
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index f4db78b09c8f..2abb23e6c1e2 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -82,12 +82,40 @@ struct pt_regs {
* On hw interrupt, it's IRQ number:
*/
unsigned long orig_ax;
-/* Return frame for iretq */
+/* Return frame for iretq/eretu/erets */
unsigned long ip;
- unsigned long cs;
+ union {
+ unsigned long csx; /* cs extended: CS + any fields above it */
+ struct __attribute__((__packed__)) {
+ unsigned short cs; /* CS selector proper */
+ unsigned int current_stack_level: 2;
+ unsigned int __csx_resv1 : 6;
+ unsigned int interrupt_shadowed : 1;
+ unsigned int software_initiated : 1;
+ unsigned int __csx_resv2 : 2;
+ unsigned int nmi : 1;
+ unsigned int __csx_resv3 : 3;
+ unsigned int __csx_resv4 : 32;
+ };
+ };
unsigned long flags;
unsigned long sp;
- unsigned long ss;
+ union {
+ unsigned long ssx; /* ss extended: SS + any fields above it */
+ struct __attribute__((__packed__)) {
+ unsigned short ss; /* SS selector proper */
+ unsigned int __ssx_resv1 : 16;
+ unsigned int vector : 8;
+ unsigned int __ssx_resv2 : 8;
+ unsigned int type : 4;
+ unsigned int __ssx_resv3 : 4;
+ unsigned int enclv : 1;
+ unsigned int long_mode : 1;
+ unsigned int nested : 1;
+ unsigned int __ssx_resv4 : 1;
+ unsigned int instr_len : 4;
+ };
+ };
/* top of stack page */
};

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index bb65a68b4b49..a1aa74864c8b 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -116,7 +116,7 @@ void __show_regs(struct pt_regs *regs, enum show_regs_mode mode,

printk("%sFS: %016lx(%04x) GS:%016lx(%04x) knlGS:%016lx\n",
log_lvl, fs, fsindex, gs, gsindex, shadowgs);
- printk("%sCS: %04lx DS: %04x ES: %04x CR0: %016lx\n",
+ printk("%sCS: %04x DS: %04x ES: %04x CR0: %016lx\n",
log_lvl, regs->cs, ds, es, cr0);
printk("%sCR2: %016lx CR3: %016lx CR4: %016lx\n",
log_lvl, cr2, cr3, cr4);
--
2.34.1

2023-04-10 08:41:47

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 05/33] x86/traps: add external_interrupt() to dispatch external interrupts

From: "H. Peter Anvin (Intel)" <[email protected]>

Add external_interrupt() to dispatch external interrupts to their handlers.

If an external interrupt is a system interrupt, dipatch it through
system_interrupt_handlers table, otherwise to dispatch_common_interrupt().

Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Co-developed-by: Xin Li <[email protected]>
Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---

Changes since v5:
* Initialize system_interrupt_handlers with dispatch_table_spurious_interrupt()
instead of NULL to get rid of a branch (Peter Zijlstra).
---
arch/x86/kernel/traps.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 12072e2af4a6..f86cd233b00b 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -1511,6 +1511,32 @@ static system_interrupt_handler system_interrupt_handlers[NR_SYSTEM_VECTORS] = {

#undef SYSV

+/*
+ * External interrupt dispatch function.
+ *
+ * Until/unless dispatch_common_interrupt() can be taught to deal with the
+ * special system vectors, split the dispatch.
+ *
+ * Note: dispatch_common_interrupt() already deals with IRQ_MOVE_CLEANUP_VECTOR.
+ */
+int external_interrupt(struct pt_regs *regs)
+{
+ unsigned int vector = regs->vector;
+ unsigned int sysvec = vector - FIRST_SYSTEM_VECTOR;
+
+ if (vector < FIRST_EXTERNAL_VECTOR) {
+ pr_err("invalid external interrupt vector %d\n", vector);
+ return -EINVAL;
+ }
+
+ if (sysvec < NR_SYSTEM_VECTORS)
+ system_interrupt_handlers[sysvec](regs);
+ else
+ dispatch_common_interrupt(regs, vector);
+
+ return 0;
+}
+
#endif /* CONFIG_X86_64 */

void __init install_system_interrupt_handler(unsigned int n, const void *asm_addr, const void *addr)
--
2.34.1

2023-04-10 08:41:59

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 06/33] x86/cpufeature: add the cpu feature bit for FRED

From: "H. Peter Anvin (Intel)" <[email protected]>

Add the CPU feature bit for FRED (Flexible Return and Event Delivery).

The Intel flexible return and event delivery (FRED) architecture defines simple
new transitions that change privilege level (ring transitions). The FRED
architecture was designed with the following goals:
1) Improve overall performance and response time by replacing event delivery
through the interrupt descriptor table (IDT event delivery) and event return by
the IRET instruction with lower latency transitions.
2) Improve software robustness by ensuring that event delivery establishes the
full supervisor context and that event return establishes the full user context.

The new transitions defined by the FRED architecture are FRED event delivery and,
for returning from events, two FRED return instructions. FRED event delivery can
effect a transition from ring 3 to ring 0, but it is used also to deliver events
incident to ring 0. One FRED instruction (ERETU) effects a return from ring 0 to
ring 3, while the other (ERETS) returns while remaining in ring 0.

Search for the latest FRED spec in most search engines with this search pattern:

site:intel.com FRED (flexible return and event delivery) specification

Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---
arch/x86/include/asm/cpufeatures.h | 1 +
tools/arch/x86/include/asm/cpufeatures.h | 1 +
2 files changed, 2 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 73c9672c123b..1fa444478d33 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -318,6 +318,7 @@
#define X86_FEATURE_FZRM (12*32+10) /* "" Fast zero-length REP MOVSB */
#define X86_FEATURE_FSRS (12*32+11) /* "" Fast short REP STOSB */
#define X86_FEATURE_FSRC (12*32+12) /* "" Fast short REP {CMPSB,SCASB} */
+#define X86_FEATURE_FRED (12*32+17) /* Flexible Return and Event Delivery */
#define X86_FEATURE_LKGS (12*32+18) /* "" Load "kernel" (userspace) GS */
#define X86_FEATURE_AMX_FP16 (12*32+21) /* "" AMX fp16 Support */
#define X86_FEATURE_AVX_IFMA (12*32+23) /* "" Support for VPMADD52[H,L]UQ */
diff --git a/tools/arch/x86/include/asm/cpufeatures.h b/tools/arch/x86/include/asm/cpufeatures.h
index b89005819cd5..e9064f4a011a 100644
--- a/tools/arch/x86/include/asm/cpufeatures.h
+++ b/tools/arch/x86/include/asm/cpufeatures.h
@@ -312,6 +312,7 @@
#define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI instructions */
#define X86_FEATURE_AVX512_BF16 (12*32+ 5) /* AVX512 BFLOAT16 instructions */
#define X86_FEATURE_CMPCCXADD (12*32+ 7) /* "" CMPccXADD instructions */
+#define X86_FEATURE_FRED (12*32+17) /* Flexible Return and Event Delivery */
#define X86_FEATURE_LKGS (12*32+18) /* "" Load "kernel" (userspace) GS */
#define X86_FEATURE_AMX_FP16 (12*32+21) /* "" AMX fp16 Support */
#define X86_FEATURE_AVX_IFMA (12*32+23) /* "" Support for VPMADD52[H,L]UQ */
--
2.34.1

2023-04-10 08:42:04

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 08/33] x86/objtool: teach objtool about ERETU and ERETS

From: "H. Peter Anvin (Intel)" <[email protected]>

Update the objtool decoder to know about the ERETU and ERETS
instructions (type INSN_CONTEXT_SWITCH.)

Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---
tools/objtool/arch/x86/decode.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/tools/objtool/arch/x86/decode.c b/tools/objtool/arch/x86/decode.c
index 9ef024fd648c..8e9c802f78ec 100644
--- a/tools/objtool/arch/x86/decode.c
+++ b/tools/objtool/arch/x86/decode.c
@@ -509,11 +509,20 @@ int arch_decode_instruction(struct objtool_file *file, const struct section *sec

if (op2 == 0x01) {

- if (modrm == 0xca)
- insn->type = INSN_CLAC;
- else if (modrm == 0xcb)
- insn->type = INSN_STAC;
-
+ switch (insn_last_prefix_id(&ins)) {
+ case INAT_PFX_REPE:
+ case INAT_PFX_REPNE:
+ if (modrm == 0xca)
+ /* eretu/erets */
+ insn->type = INSN_CONTEXT_SWITCH;
+ break;
+ default:
+ if (modrm == 0xca)
+ insn->type = INSN_CLAC;
+ else if (modrm == 0xcb)
+ insn->type = INSN_STAC;
+ break;
+ }
} else if (op2 >= 0x80 && op2 <= 0x8f) {

insn->type = INSN_JUMP_CONDITIONAL;
--
2.34.1

2023-04-10 08:42:07

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 04/33] x86/traps: add install_system_interrupt_handler()

Some kernel components install system interrupt handlers into the IDT,
and we need to do the same for system_interrupt_handlers. A new function
install_system_interrupt_handler() is added to install a system interrupt
handler into both the IDT and system_interrupt_handlers.

Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---
arch/x86/include/asm/traps.h | 2 ++
arch/x86/kernel/cpu/acrn.c | 7 +++++--
arch/x86/kernel/cpu/mshyperv.c | 22 ++++++++++++++--------
arch/x86/kernel/kvm.c | 4 +++-
arch/x86/kernel/traps.c | 10 ++++++++++
drivers/xen/events/events_base.c | 5 ++++-
6 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 28c8ba5fd81c..46f5e4e2a346 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -41,6 +41,8 @@ void math_emulate(struct math_emu_info *);

bool fault_in_kernel_space(unsigned long address);

+void install_system_interrupt_handler(unsigned int n, const void *asm_addr, const void *addr);
+
#ifdef CONFIG_VMAP_STACK
void __noreturn handle_stack_overflow(struct pt_regs *regs,
unsigned long fault_address,
diff --git a/arch/x86/kernel/cpu/acrn.c b/arch/x86/kernel/cpu/acrn.c
index 485441b7f030..9351bf183a9e 100644
--- a/arch/x86/kernel/cpu/acrn.c
+++ b/arch/x86/kernel/cpu/acrn.c
@@ -18,6 +18,7 @@
#include <asm/hypervisor.h>
#include <asm/idtentry.h>
#include <asm/irq_regs.h>
+#include <asm/traps.h>

static u32 __init acrn_detect(void)
{
@@ -26,8 +27,10 @@ static u32 __init acrn_detect(void)

static void __init acrn_init_platform(void)
{
- /* Setup the IDT for ACRN hypervisor callback */
- alloc_intr_gate(HYPERVISOR_CALLBACK_VECTOR, asm_sysvec_acrn_hv_callback);
+ /* Install system interrupt handler for ACRN hypervisor callback */
+ install_system_interrupt_handler(HYPERVISOR_CALLBACK_VECTOR,
+ asm_sysvec_acrn_hv_callback,
+ sysvec_acrn_hv_callback);

x86_platform.calibrate_tsc = acrn_get_tsc_khz;
x86_platform.calibrate_cpu = acrn_get_tsc_khz;
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index f1197366a97d..c102c9192cbb 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -29,6 +29,7 @@
#include <asm/i8259.h>
#include <asm/apic.h>
#include <asm/timer.h>
+#include <asm/traps.h>
#include <asm/reboot.h>
#include <asm/nmi.h>
#include <clocksource/hyperv_timer.h>
@@ -491,19 +492,24 @@ static void __init ms_hyperv_init_platform(void)
*/
x86_platform.apic_post_init = hyperv_init;
hyperv_setup_mmu_ops();
- /* Setup the IDT for hypervisor callback */
- alloc_intr_gate(HYPERVISOR_CALLBACK_VECTOR, asm_sysvec_hyperv_callback);

- /* Setup the IDT for reenlightenment notifications */
+ /* Install system interrupt handler for hypervisor callback */
+ install_system_interrupt_handler(HYPERVISOR_CALLBACK_VECTOR,
+ asm_sysvec_hyperv_callback,
+ sysvec_hyperv_callback);
+
+ /* Install system interrupt handler for reenlightenment notifications */
if (ms_hyperv.features & HV_ACCESS_REENLIGHTENMENT) {
- alloc_intr_gate(HYPERV_REENLIGHTENMENT_VECTOR,
- asm_sysvec_hyperv_reenlightenment);
+ install_system_interrupt_handler(HYPERV_REENLIGHTENMENT_VECTOR,
+ asm_sysvec_hyperv_reenlightenment,
+ sysvec_hyperv_reenlightenment);
}

- /* Setup the IDT for stimer0 */
+ /* Install system interrupt handler for stimer0 */
if (ms_hyperv.misc_features & HV_STIMER_DIRECT_MODE_AVAILABLE) {
- alloc_intr_gate(HYPERV_STIMER0_VECTOR,
- asm_sysvec_hyperv_stimer0);
+ install_system_interrupt_handler(HYPERV_STIMER0_VECTOR,
+ asm_sysvec_hyperv_stimer0,
+ sysvec_hyperv_stimer0);
}

# ifdef CONFIG_SMP
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 1cceac5984da..5c684df6de7a 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -829,7 +829,9 @@ static void __init kvm_guest_init(void)

if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF_INT) && kvmapf) {
static_branch_enable(&kvm_async_pf_enabled);
- alloc_intr_gate(HYPERVISOR_CALLBACK_VECTOR, asm_sysvec_kvm_asyncpf_interrupt);
+ install_system_interrupt_handler(HYPERVISOR_CALLBACK_VECTOR,
+ asm_sysvec_kvm_asyncpf_interrupt,
+ sysvec_kvm_asyncpf_interrupt);
}

#ifdef CONFIG_SMP
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 2cbe7e7e8b96..12072e2af4a6 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -1513,6 +1513,16 @@ static system_interrupt_handler system_interrupt_handlers[NR_SYSTEM_VECTORS] = {

#endif /* CONFIG_X86_64 */

+void __init install_system_interrupt_handler(unsigned int n, const void *asm_addr, const void *addr)
+{
+ BUG_ON(n < FIRST_SYSTEM_VECTOR);
+
+#ifdef CONFIG_X86_64
+ system_interrupt_handlers[n - FIRST_SYSTEM_VECTOR] = (system_interrupt_handler)addr;
+#endif
+ alloc_intr_gate(n, asm_addr);
+}
+
void __init trap_init(void)
{
/* Init cpu_entry_area before IST entries are set up */
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index c7715f8bd452..cf1a5ca3bf62 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -45,6 +45,7 @@
#include <asm/irq.h>
#include <asm/io_apic.h>
#include <asm/i8259.h>
+#include <asm/traps.h>
#include <asm/xen/cpuid.h>
#include <asm/xen/pci.h>
#endif
@@ -2249,7 +2250,9 @@ static __init void xen_alloc_callback_vector(void)
return;

pr_info("Xen HVM callback vector for event delivery is enabled\n");
- alloc_intr_gate(HYPERVISOR_CALLBACK_VECTOR, asm_sysvec_xen_hvm_callback);
+ install_system_interrupt_handler(HYPERVISOR_CALLBACK_VECTOR,
+ asm_sysvec_xen_hvm_callback,
+ sysvec_xen_hvm_callback);
}
#else
void xen_setup_callback_vector(void) {}
--
2.34.1

2023-04-10 08:42:09

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 10/33] x86/fred: add Kconfig option for FRED (CONFIG_X86_FRED)

From: "H. Peter Anvin (Intel)" <[email protected]>

Add the configuration option CONFIG_X86_FRED to enable FRED.

Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---
arch/x86/Kconfig | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a825bf031f49..da62178bb246 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -500,6 +500,15 @@ config X86_CPU_RESCTRL

Say N if unsure.

+config X86_FRED
+ bool "Flexible Return and Event Delivery"
+ depends on X86_64
+ help
+ When enabled, try to use Flexible Return and Event Delivery
+ instead of the legacy SYSCALL/SYSENTER/IDT architecture for
+ ring transitions and exception/interrupt handling if the
+ system supports.
+
if X86_32
config X86_BIGSMP
bool "Support for big SMP systems with more than 8 CPUs"
--
2.34.1

2023-04-10 08:42:11

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 07/33] x86/opcode: add ERETU, ERETS instructions to x86-opcode-map

From: "H. Peter Anvin (Intel)" <[email protected]>

Add the instruction opcodes used by FRED: ERETU, ERETS.

Opcode number is per public FRED draft spec v3.0.

Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---
arch/x86/lib/x86-opcode-map.txt | 2 +-
tools/arch/x86/lib/x86-opcode-map.txt | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/lib/x86-opcode-map.txt b/arch/x86/lib/x86-opcode-map.txt
index 5168ee0360b2..7a269e269dc0 100644
--- a/arch/x86/lib/x86-opcode-map.txt
+++ b/arch/x86/lib/x86-opcode-map.txt
@@ -1052,7 +1052,7 @@ EndTable

GrpTable: Grp7
0: SGDT Ms | VMCALL (001),(11B) | VMLAUNCH (010),(11B) | VMRESUME (011),(11B) | VMXOFF (100),(11B) | PCONFIG (101),(11B) | ENCLV (000),(11B)
-1: SIDT Ms | MONITOR (000),(11B) | MWAIT (001),(11B) | CLAC (010),(11B) | STAC (011),(11B) | ENCLS (111),(11B)
+1: SIDT Ms | MONITOR (000),(11B) | MWAIT (001),(11B) | CLAC (010),(11B) | STAC (011),(11B) | ENCLS (111),(11B) | ERETU (F3),(010),(11B) | ERETS (F2),(010),(11B)
2: LGDT Ms | XGETBV (000),(11B) | XSETBV (001),(11B) | VMFUNC (100),(11B) | XEND (101)(11B) | XTEST (110)(11B) | ENCLU (111),(11B)
3: LIDT Ms
4: SMSW Mw/Rv
diff --git a/tools/arch/x86/lib/x86-opcode-map.txt b/tools/arch/x86/lib/x86-opcode-map.txt
index 5168ee0360b2..7a269e269dc0 100644
--- a/tools/arch/x86/lib/x86-opcode-map.txt
+++ b/tools/arch/x86/lib/x86-opcode-map.txt
@@ -1052,7 +1052,7 @@ EndTable

GrpTable: Grp7
0: SGDT Ms | VMCALL (001),(11B) | VMLAUNCH (010),(11B) | VMRESUME (011),(11B) | VMXOFF (100),(11B) | PCONFIG (101),(11B) | ENCLV (000),(11B)
-1: SIDT Ms | MONITOR (000),(11B) | MWAIT (001),(11B) | CLAC (010),(11B) | STAC (011),(11B) | ENCLS (111),(11B)
+1: SIDT Ms | MONITOR (000),(11B) | MWAIT (001),(11B) | CLAC (010),(11B) | STAC (011),(11B) | ENCLS (111),(11B) | ERETU (F3),(010),(11B) | ERETS (F2),(010),(11B)
2: LGDT Ms | XGETBV (000),(11B) | XSETBV (001),(11B) | VMFUNC (100),(11B) | XEND (101)(11B) | XTEST (110)(11B) | ENCLU (111),(11B)
3: LIDT Ms
4: SMSW Mw/Rv
--
2.34.1

2023-04-10 08:42:19

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 09/33] x86/cpu: add X86_CR4_FRED macro

From: "H. Peter Anvin (Intel)" <[email protected]>

Add X86_CR4_FRED macro for the FRED bit in %cr4. This bit should be a
pinned bit, not to be changed after initialization.

Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---
arch/x86/include/uapi/asm/processor-flags.h | 2 ++
arch/x86/kernel/cpu/common.c | 11 ++++++++---
2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index c47cc7f2feeb..a90933f1ff41 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -132,6 +132,8 @@
#define X86_CR4_PKE _BITUL(X86_CR4_PKE_BIT)
#define X86_CR4_CET_BIT 23 /* enable Control-flow Enforcement Technology */
#define X86_CR4_CET _BITUL(X86_CR4_CET_BIT)
+#define X86_CR4_FRED_BIT 32 /* enable FRED kernel entry */
+#define X86_CR4_FRED _BITULL(X86_CR4_FRED_BIT)

/*
* x86-64 Task Priority Register, CR8
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 8cd4126d8253..e8cf6f4cfb52 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -412,10 +412,15 @@ static __always_inline void setup_umip(struct cpuinfo_x86 *c)
cr4_clear_bits(X86_CR4_UMIP);
}

-/* These bits should not change their value after CPU init is finished. */
+/*
+ * These bits should not change their value after CPU init is finished.
+ * The explicit cast to unsigned long suppresses a warning on i386 for
+ * x86-64 only feature bits >= 32.
+ */
static const unsigned long cr4_pinned_mask =
- X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP |
- X86_CR4_FSGSBASE | X86_CR4_CET;
+ (unsigned long)
+ (X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP |
+ X86_CR4_FSGSBASE | X86_CR4_CET | X86_CR4_FRED);
static DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
static unsigned long cr4_pinned_bits __ro_after_init;

--
2.34.1

2023-04-10 08:42:20

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 12/33] x86/cpu: add MSR numbers for FRED configuration

From: "H. Peter Anvin (Intel)" <[email protected]>

Add MSR numbers for the FRED configuration registers.

Originally-by: Megha Dey <[email protected]>
Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---
arch/x86/include/asm/msr-index.h | 13 ++++++++++++-
tools/arch/x86/include/asm/msr-index.h | 13 ++++++++++++-
2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index ad35355ee43e..87db728f8bbc 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -36,8 +36,19 @@
#define EFER_FFXSR (1<<_EFER_FFXSR)
#define EFER_AUTOIBRS (1<<_EFER_AUTOIBRS)

-/* Intel MSRs. Some also available on other CPUs */
+/* FRED MSRs */
+#define MSR_IA32_FRED_RSP0 0x1cc /* Level 0 stack pointer */
+#define MSR_IA32_FRED_RSP1 0x1cd /* Level 1 stack pointer */
+#define MSR_IA32_FRED_RSP2 0x1ce /* Level 2 stack pointer */
+#define MSR_IA32_FRED_RSP3 0x1cf /* Level 3 stack pointer */
+#define MSR_IA32_FRED_STKLVLS 0x1d0 /* Exception stack levels */
+#define MSR_IA32_FRED_SSP0 MSR_IA32_PL0_SSP /* Level 0 shadow stack pointer */
+#define MSR_IA32_FRED_SSP1 0x1d1 /* Level 1 shadow stack pointer */
+#define MSR_IA32_FRED_SSP2 0x1d2 /* Level 2 shadow stack pointer */
+#define MSR_IA32_FRED_SSP3 0x1d3 /* Level 3 shadow stack pointer */
+#define MSR_IA32_FRED_CONFIG 0x1d4 /* Entrypoint and interrupt stack level */

+/* Intel MSRs. Some also available on other CPUs */
#define MSR_TEST_CTRL 0x00000033
#define MSR_TEST_CTRL_SPLIT_LOCK_DETECT_BIT 29
#define MSR_TEST_CTRL_SPLIT_LOCK_DETECT BIT(MSR_TEST_CTRL_SPLIT_LOCK_DETECT_BIT)
diff --git a/tools/arch/x86/include/asm/msr-index.h b/tools/arch/x86/include/asm/msr-index.h
index ad35355ee43e..87db728f8bbc 100644
--- a/tools/arch/x86/include/asm/msr-index.h
+++ b/tools/arch/x86/include/asm/msr-index.h
@@ -36,8 +36,19 @@
#define EFER_FFXSR (1<<_EFER_FFXSR)
#define EFER_AUTOIBRS (1<<_EFER_AUTOIBRS)

-/* Intel MSRs. Some also available on other CPUs */
+/* FRED MSRs */
+#define MSR_IA32_FRED_RSP0 0x1cc /* Level 0 stack pointer */
+#define MSR_IA32_FRED_RSP1 0x1cd /* Level 1 stack pointer */
+#define MSR_IA32_FRED_RSP2 0x1ce /* Level 2 stack pointer */
+#define MSR_IA32_FRED_RSP3 0x1cf /* Level 3 stack pointer */
+#define MSR_IA32_FRED_STKLVLS 0x1d0 /* Exception stack levels */
+#define MSR_IA32_FRED_SSP0 MSR_IA32_PL0_SSP /* Level 0 shadow stack pointer */
+#define MSR_IA32_FRED_SSP1 0x1d1 /* Level 1 shadow stack pointer */
+#define MSR_IA32_FRED_SSP2 0x1d2 /* Level 2 shadow stack pointer */
+#define MSR_IA32_FRED_SSP3 0x1d3 /* Level 3 shadow stack pointer */
+#define MSR_IA32_FRED_CONFIG 0x1d4 /* Entrypoint and interrupt stack level */

+/* Intel MSRs. Some also available on other CPUs */
#define MSR_TEST_CTRL 0x00000033
#define MSR_TEST_CTRL_SPLIT_LOCK_DETECT_BIT 29
#define MSR_TEST_CTRL_SPLIT_LOCK_DETECT BIT(MSR_TEST_CTRL_SPLIT_LOCK_DETECT_BIT)
--
2.34.1

2023-04-10 08:42:24

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 11/33] x86/fred: if CONFIG_X86_FRED is disabled, disable FRED support

From: "H. Peter Anvin (Intel)" <[email protected]>

Add CONFIG_X86_FRED to <asm/disabled-features.h> to make
cpu_feature_enabled() work correctly with FRED.

Originally-by: Megha Dey <[email protected]>
Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---
arch/x86/include/asm/disabled-features.h | 8 +++++++-
tools/arch/x86/include/asm/disabled-features.h | 8 +++++++-
2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 5dfa4fb76f4b..56838de9cb23 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -99,6 +99,12 @@
# define DISABLE_TDX_GUEST (1 << (X86_FEATURE_TDX_GUEST & 31))
#endif

+#ifdef CONFIG_X86_FRED
+# define DISABLE_FRED 0
+#else
+# define DISABLE_FRED (1 << (X86_FEATURE_FRED & 31))
+#endif
+
/*
* Make sure to add features to the correct mask
*/
@@ -115,7 +121,7 @@
#define DISABLED_MASK10 0
#define DISABLED_MASK11 (DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \
DISABLE_CALL_DEPTH_TRACKING)
-#define DISABLED_MASK12 0
+#define DISABLED_MASK12 (DISABLE_FRED)
#define DISABLED_MASK13 0
#define DISABLED_MASK14 0
#define DISABLED_MASK15 0
diff --git a/tools/arch/x86/include/asm/disabled-features.h b/tools/arch/x86/include/asm/disabled-features.h
index 5dfa4fb76f4b..56838de9cb23 100644
--- a/tools/arch/x86/include/asm/disabled-features.h
+++ b/tools/arch/x86/include/asm/disabled-features.h
@@ -99,6 +99,12 @@
# define DISABLE_TDX_GUEST (1 << (X86_FEATURE_TDX_GUEST & 31))
#endif

+#ifdef CONFIG_X86_FRED
+# define DISABLE_FRED 0
+#else
+# define DISABLE_FRED (1 << (X86_FEATURE_FRED & 31))
+#endif
+
/*
* Make sure to add features to the correct mask
*/
@@ -115,7 +121,7 @@
#define DISABLED_MASK10 0
#define DISABLED_MASK11 (DISABLE_RETPOLINE|DISABLE_RETHUNK|DISABLE_UNRET| \
DISABLE_CALL_DEPTH_TRACKING)
-#define DISABLED_MASK12 0
+#define DISABLED_MASK12 (DISABLE_FRED)
#define DISABLED_MASK13 0
#define DISABLED_MASK14 0
#define DISABLED_MASK15 0
--
2.34.1

2023-04-10 08:42:27

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 03/33] x86/traps: add a system interrupt table for system interrupt dispatch

From: "H. Peter Anvin (Intel)" <[email protected]>

On x86, external interrupts are divided into the following two groups
1) system interrupts
2) external device interrupts

External device interrupts are all routed to common_interrupt(), which
dispatches external device interrupts through a per-CPU external interrupt
dispatch table vector_irq.

For system interrupts, add a system interrupt handler table for dispatching
a system interrupt to its corresponding handler directly. Thus a software
based dispatch function will be:

void external_interrupt(struct pt_regs *regs)
{
u8 vector = regs->vector;
if (is_system_interrupt(vector))
system_interrupt_handlers[vector_to_sysvec(vector)](regs);
else /* external device interrupt */
common_interrupt(regs);
}

Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Co-developed-by: Xin Li <[email protected]>
Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---
arch/x86/include/asm/idtentry.h | 64 +++++++++++++++++++++++++++------
arch/x86/include/asm/traps.h | 7 ++++
arch/x86/kernel/traps.c | 62 ++++++++++++++++++++++++++++++++
3 files changed, 122 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index b241af4ce9b4..2876ddae02bc 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -167,17 +167,22 @@ __visible noinstr void func(struct pt_regs *regs, unsigned long error_code)

/**
* DECLARE_IDTENTRY_IRQ - Declare functions for device interrupt IDT entry
- * points (common/spurious)
+ * points (common/spurious) and their corresponding
+ * software based dispatch handlers in the non-noinstr
+ * text section
* @vector: Vector number (ignored for C)
* @func: Function name of the entry point
*
- * Maps to DECLARE_IDTENTRY_ERRORCODE()
+ * Maps to DECLARE_IDTENTRY_ERRORCODE(), plus a dispatch function prototype
*/
#define DECLARE_IDTENTRY_IRQ(vector, func) \
- DECLARE_IDTENTRY_ERRORCODE(vector, func)
+ DECLARE_IDTENTRY_ERRORCODE(vector, func); \
+ void dispatch_##func(struct pt_regs *regs, unsigned long error_code)

/**
* DEFINE_IDTENTRY_IRQ - Emit code for device interrupt IDT entry points
+ * and their corresponding software based dispatch
+ * handlers in the non-noinstr text section
* @func: Function name of the entry point
*
* The vector number is pushed by the low level entry stub and handed
@@ -187,6 +192,9 @@ __visible noinstr void func(struct pt_regs *regs, unsigned long error_code)
* irq_enter/exit_rcu() are invoked before the function body and the
* KVM L1D flush request is set. Stack switching to the interrupt stack
* has to be done in the function body if necessary.
+ *
+ * dispatch_func() is a software based dispatch handler in the non-noinstr
+ * text section.
*/
#define DEFINE_IDTENTRY_IRQ(func) \
static void __##func(struct pt_regs *regs, u32 vector); \
@@ -204,31 +212,48 @@ __visible noinstr void func(struct pt_regs *regs, \
irqentry_exit(regs, state); \
} \
\
+void dispatch_##func(struct pt_regs *regs, unsigned long error_code) \
+{ \
+ u32 vector = (u32)(u8)error_code; \
+ \
+ kvm_set_cpu_l1tf_flush_l1d(); \
+ run_irq_on_irqstack_cond(__##func, regs, vector); \
+} \
+ \
static noinline void __##func(struct pt_regs *regs, u32 vector)

/**
* DECLARE_IDTENTRY_SYSVEC - Declare functions for system vector entry points
+ * and their corresponding software based dispatch
+ * handlers in the non-noinstr text section
* @vector: Vector number (ignored for C)
* @func: Function name of the entry point
*
- * Declares three functions:
+ * Declares four functions:
* - The ASM entry point: asm_##func
* - The XEN PV trap entry point: xen_##func (maybe unused)
* - The C handler called from the ASM entry point
+ * - The C handler used in the system interrupt handler table
*
- * Maps to DECLARE_IDTENTRY().
+ * Maps to DECLARE_IDTENTRY(), plus a dispatch table function prototype
*/
#define DECLARE_IDTENTRY_SYSVEC(vector, func) \
- DECLARE_IDTENTRY(vector, func)
+ DECLARE_IDTENTRY(vector, func); \
+ void dispatch_table_##func(struct pt_regs *regs)

/**
* DEFINE_IDTENTRY_SYSVEC - Emit code for system vector IDT entry points
+ * and their corresponding software based dispatch
+ * handlers in the non-noinstr text section
* @func: Function name of the entry point
*
* irqentry_enter/exit() and irq_enter/exit_rcu() are invoked before the
* function body. KVM L1D flush request is set.
*
- * Runs the function on the interrupt stack if the entry hit kernel mode
+ * Runs the function on the interrupt stack if the entry hit kernel mode.
+ *
+ * dispatch_table_func() is used in the system interrupt handler table for
+ * system interrupts dispatching.
*/
#define DEFINE_IDTENTRY_SYSVEC(func) \
static void __##func(struct pt_regs *regs); \
@@ -244,11 +269,19 @@ __visible noinstr void func(struct pt_regs *regs) \
irqentry_exit(regs, state); \
} \
\
+void dispatch_table_##func(struct pt_regs *regs) \
+{ \
+ kvm_set_cpu_l1tf_flush_l1d(); \
+ run_sysvec_on_irqstack_cond(__##func, regs); \
+} \
+ \
static noinline void __##func(struct pt_regs *regs)

/**
* DEFINE_IDTENTRY_SYSVEC_SIMPLE - Emit code for simple system vector IDT
- * entry points
+ * entry points and their corresponding
+ * software based dispatch handlers in
+ * the non-noinstr text section
* @func: Function name of the entry point
*
* Runs the function on the interrupted stack. No switch to IRQ stack and
@@ -256,6 +289,9 @@ static noinline void __##func(struct pt_regs *regs)
*
* Only use for 'empty' vectors like reschedule IPI and KVM posted
* interrupt vectors.
+ *
+ * dispatch_table_func() is used in the system interrupt handler table for
+ * system interrupts dispatching.
*/
#define DEFINE_IDTENTRY_SYSVEC_SIMPLE(func) \
static __always_inline void __##func(struct pt_regs *regs); \
@@ -273,6 +309,14 @@ __visible noinstr void func(struct pt_regs *regs) \
irqentry_exit(regs, state); \
} \
\
+void dispatch_table_##func(struct pt_regs *regs) \
+{ \
+ __irq_enter_raw(); \
+ kvm_set_cpu_l1tf_flush_l1d(); \
+ __##func (regs); \
+ __irq_exit_raw(); \
+} \
+ \
static __always_inline void __##func(struct pt_regs *regs)

/**
@@ -634,9 +678,7 @@ DECLARE_IDTENTRY(X86_TRAP_VE, exc_virtualization_exception);

/* Device interrupts common/spurious */
DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER, common_interrupt);
-#ifdef CONFIG_X86_LOCAL_APIC
DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER, spurious_interrupt);
-#endif

/* System vector entry points */
#ifdef CONFIG_X86_LOCAL_APIC
@@ -647,7 +689,7 @@ DECLARE_IDTENTRY_SYSVEC(X86_PLATFORM_IPI_VECTOR, sysvec_x86_platform_ipi);
#endif

#ifdef CONFIG_SMP
-DECLARE_IDTENTRY(RESCHEDULE_VECTOR, sysvec_reschedule_ipi);
+DECLARE_IDTENTRY_SYSVEC(RESCHEDULE_VECTOR, sysvec_reschedule_ipi);
DECLARE_IDTENTRY_SYSVEC(IRQ_MOVE_CLEANUP_VECTOR, sysvec_irq_move_cleanup);
DECLARE_IDTENTRY_SYSVEC(REBOOT_VECTOR, sysvec_reboot);
DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_SINGLE_VECTOR, sysvec_call_function_single);
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 47ecfff2c83d..28c8ba5fd81c 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -47,4 +47,11 @@ void __noreturn handle_stack_overflow(struct pt_regs *regs,
struct stack_info *info);
#endif

+/*
+ * How system interrupt handlers are called.
+ */
+#define DECLARE_SYSTEM_INTERRUPT_HANDLER(f) \
+ void f (struct pt_regs *regs)
+typedef DECLARE_SYSTEM_INTERRUPT_HANDLER((*system_interrupt_handler));
+
#endif /* _ASM_X86_TRAPS_H */
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index d317dc3d06a3..2cbe7e7e8b96 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -1451,6 +1451,68 @@ DEFINE_IDTENTRY_SW(iret_error)
}
#endif

+#ifdef CONFIG_X86_64
+
+#ifndef CONFIG_X86_LOCAL_APIC
+/*
+ * Used when local APIC is not configured to build into the kernel, but
+ * dispatch_table_spurious_interrupt() needs dispatch_spurious_interrupt().
+ */
+DEFINE_IDTENTRY_IRQ(spurious_interrupt)
+{
+ pr_info("Spurious interrupt (vector 0x%x) on CPU#%d, should never happen.\n",
+ vector, smp_processor_id());
+}
+#endif
+
+static void dispatch_table_spurious_interrupt(struct pt_regs *regs)
+{
+ dispatch_spurious_interrupt(regs, regs->vector);
+}
+
+#define SYSV(x,y) [(x) - FIRST_SYSTEM_VECTOR] = y
+
+static system_interrupt_handler system_interrupt_handlers[NR_SYSTEM_VECTORS] = {
+ [0 ... NR_SYSTEM_VECTORS-1] = dispatch_table_spurious_interrupt,
+#ifdef CONFIG_SMP
+ SYSV(RESCHEDULE_VECTOR, dispatch_table_sysvec_reschedule_ipi),
+ SYSV(CALL_FUNCTION_VECTOR, dispatch_table_sysvec_call_function),
+ SYSV(CALL_FUNCTION_SINGLE_VECTOR, dispatch_table_sysvec_call_function_single),
+ SYSV(REBOOT_VECTOR, dispatch_table_sysvec_reboot),
+#endif
+
+#ifdef CONFIG_X86_THERMAL_VECTOR
+ SYSV(THERMAL_APIC_VECTOR, dispatch_table_sysvec_thermal),
+#endif
+
+#ifdef CONFIG_X86_MCE_THRESHOLD
+ SYSV(THRESHOLD_APIC_VECTOR, dispatch_table_sysvec_threshold),
+#endif
+
+#ifdef CONFIG_X86_MCE_AMD
+ SYSV(DEFERRED_ERROR_VECTOR, dispatch_table_sysvec_deferred_error),
+#endif
+
+#ifdef CONFIG_X86_LOCAL_APIC
+ SYSV(LOCAL_TIMER_VECTOR, dispatch_table_sysvec_apic_timer_interrupt),
+ SYSV(X86_PLATFORM_IPI_VECTOR, dispatch_table_sysvec_x86_platform_ipi),
+# ifdef CONFIG_HAVE_KVM
+ SYSV(POSTED_INTR_VECTOR, dispatch_table_sysvec_kvm_posted_intr_ipi),
+ SYSV(POSTED_INTR_WAKEUP_VECTOR, dispatch_table_sysvec_kvm_posted_intr_wakeup_ipi),
+ SYSV(POSTED_INTR_NESTED_VECTOR, dispatch_table_sysvec_kvm_posted_intr_nested_ipi),
+# endif
+# ifdef CONFIG_IRQ_WORK
+ SYSV(IRQ_WORK_VECTOR, dispatch_table_sysvec_irq_work),
+# endif
+ SYSV(SPURIOUS_APIC_VECTOR, dispatch_table_sysvec_spurious_apic_interrupt),
+ SYSV(ERROR_APIC_VECTOR, dispatch_table_sysvec_error_interrupt),
+#endif
+};
+
+#undef SYSV
+
+#endif /* CONFIG_X86_64 */
+
void __init trap_init(void)
{
/* Init cpu_entry_area before IST entries are set up */
--
2.34.1

2023-04-10 08:42:39

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 15/33] x86/fred: reserve space for the FRED stack frame

From: "H. Peter Anvin (Intel)" <[email protected]>

When using FRED, reserve space at the top of the stack frame, just
like i386 does. A future version of FRED might have dynamic frame
sizes, though, in which case it might be necessary to make
TOP_OF_KERNEL_STACK_PADDING a variable instead of a constant.

Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---
arch/x86/include/asm/thread_info.h | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index f1cccba52eb9..998483078d5f 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -31,7 +31,9 @@
* In vm86 mode, the hardware frame is much longer still, so add 16
* bytes to make room for the real-mode segments.
*
- * x86_64 has a fixed-length stack frame.
+ * x86-64 has a fixed-length stack frame, but it depends on whether
+ * or not FRED is enabled. Future versions of FRED might make this
+ * dynamic, but for now it is always 2 words longer.
*/
#ifdef CONFIG_X86_32
# ifdef CONFIG_VM86
@@ -39,8 +41,12 @@
# else
# define TOP_OF_KERNEL_STACK_PADDING 8
# endif
-#else
-# define TOP_OF_KERNEL_STACK_PADDING 0
+#else /* x86-64 */
+# ifdef CONFIG_X86_FRED
+# define TOP_OF_KERNEL_STACK_PADDING (2*8)
+# else
+# define TOP_OF_KERNEL_STACK_PADDING 0
+# endif
#endif

/*
--
2.34.1

2023-04-10 08:42:46

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 13/33] x86/fred: header file for event types

FRED inherits the Intel VT-x enhancement of classified events with
a two-level event dispatch logic. The first-level dispatch is on
the event type, not the event vector as used in the IDT architecture.
This also means that vectors in different event types are orthogonal,
e.g., vectors 0x10-0x1f become available as hardware interrupts.

Add a header file for event types, and also use it in <asm/vmx.h>.

Suggested-by: H. Peter Anvin (Intel) <[email protected]>
Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---
arch/x86/include/asm/event-type.h | 17 +++++++++++++++++
arch/x86/include/asm/vmx.h | 17 +++++++++--------
2 files changed, 26 insertions(+), 8 deletions(-)
create mode 100644 arch/x86/include/asm/event-type.h

diff --git a/arch/x86/include/asm/event-type.h b/arch/x86/include/asm/event-type.h
new file mode 100644
index 000000000000..fedaa0e492c5
--- /dev/null
+++ b/arch/x86/include/asm/event-type.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_EVENT_TYPE_H
+#define _ASM_X86_EVENT_TYPE_H
+
+/*
+ * Event type codes: these are the same that are used by VTx.
+ */
+#define EVENT_TYPE_HWINT 0 /* Maskable external interrupt */
+#define EVENT_TYPE_RESERVED 1
+#define EVENT_TYPE_NMI 2 /* Non-maskable interrupt */
+#define EVENT_TYPE_HWFAULT 3 /* Hardware exceptions (e.g., page fault) */
+#define EVENT_TYPE_SWINT 4 /* Software interrupt (INT n) */
+#define EVENT_TYPE_PRIVSW 5 /* INT1 (ICEBP) */
+#define EVENT_TYPE_SWFAULT 6 /* Software exception (INT3 or INTO) */
+#define EVENT_TYPE_OTHER 7 /* FRED: SYSCALL/SYSENTER */
+
+#endif
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 498dc600bd5c..8d9b8b0d8e56 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -15,6 +15,7 @@
#include <linux/bitops.h>
#include <linux/types.h>
#include <uapi/asm/vmx.h>
+#include <asm/event-type.h>
#include <asm/vmxfeatures.h>

#define VMCS_CONTROL_BIT(x) BIT(VMX_FEATURE_##x & 0x1f)
@@ -372,14 +373,14 @@ enum vmcs_field {
#define VECTORING_INFO_DELIVER_CODE_MASK INTR_INFO_DELIVER_CODE_MASK
#define VECTORING_INFO_VALID_MASK INTR_INFO_VALID_MASK

-#define INTR_TYPE_EXT_INTR (0 << 8) /* external interrupt */
-#define INTR_TYPE_RESERVED (1 << 8) /* reserved */
-#define INTR_TYPE_NMI_INTR (2 << 8) /* NMI */
-#define INTR_TYPE_HARD_EXCEPTION (3 << 8) /* processor exception */
-#define INTR_TYPE_SOFT_INTR (4 << 8) /* software interrupt */
-#define INTR_TYPE_PRIV_SW_EXCEPTION (5 << 8) /* ICE breakpoint - undocumented */
-#define INTR_TYPE_SOFT_EXCEPTION (6 << 8) /* software exception */
-#define INTR_TYPE_OTHER_EVENT (7 << 8) /* other event */
+#define INTR_TYPE_EXT_INTR (EVENT_TYPE_HWINT << 8) /* external interrupt */
+#define INTR_TYPE_RESERVED (EVENT_TYPE_RESERVED << 8) /* reserved */
+#define INTR_TYPE_NMI_INTR (EVENT_TYPE_NMI << 8) /* NMI */
+#define INTR_TYPE_HARD_EXCEPTION (EVENT_TYPE_HWFAULT << 8) /* processor exception */
+#define INTR_TYPE_SOFT_INTR (EVENT_TYPE_SWINT << 8) /* software interrupt */
+#define INTR_TYPE_PRIV_SW_EXCEPTION (EVENT_TYPE_PRIVSW << 8) /* ICE breakpoint - undocumented */
+#define INTR_TYPE_SOFT_EXCEPTION (EVENT_TYPE_SWFAULT << 8) /* software exception */
+#define INTR_TYPE_OTHER_EVENT (EVENT_TYPE_OTHER << 8) /* other event */

/* GUEST_INTERRUPTIBILITY_INFO flags. */
#define GUEST_INTR_STATE_STI 0x00000001
--
2.34.1

2023-04-10 08:42:46

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 14/33] x86/fred: header file with FRED definitions

From: "H. Peter Anvin (Intel)" <[email protected]>

Add a header file for FRED prototypes and definitions.

Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---

Changes since v6:
* Replace pt_regs csx flags prefix FRED_CSL_ with FRED_CSX_.
---
arch/x86/include/asm/fred.h | 106 ++++++++++++++++++++++++++++++++++++
1 file changed, 106 insertions(+)
create mode 100644 arch/x86/include/asm/fred.h

diff --git a/arch/x86/include/asm/fred.h b/arch/x86/include/asm/fred.h
new file mode 100644
index 000000000000..b59397411ab9
--- /dev/null
+++ b/arch/x86/include/asm/fred.h
@@ -0,0 +1,106 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * arch/x86/include/asm/fred.h
+ *
+ * Macros for Flexible Return and Event Delivery (FRED)
+ */
+
+#ifndef ASM_X86_FRED_H
+#define ASM_X86_FRED_H
+
+#ifdef CONFIG_X86_FRED
+
+#include <linux/const.h>
+#include <asm/asm.h>
+
+/*
+ * FRED return instructions
+ *
+ * Replace with "ERETS"/"ERETU" once binutils support FRED return instructions.
+ * The binutils version supporting FRED instructions is still TBD, and will
+ * update once we have it.
+ */
+#define ERETS _ASM_BYTES(0xf2,0x0f,0x01,0xca)
+#define ERETU _ASM_BYTES(0xf3,0x0f,0x01,0xca)
+
+/*
+ * RSP is aligned to a 64-byte boundary before used to push a new stack frame
+ */
+#define FRED_STACK_FRAME_RSP_MASK _AT(unsigned long, (~0x3f))
+
+/*
+ * Event stack level macro for the FRED_STKLVLS MSR.
+ * Usage example: FRED_STKLVL(X86_TRAP_DF, 3)
+ * Multiple values can be ORd together.
+ */
+#define FRED_STKLVL(v,l) (_AT(unsigned long, l) << (2*(v)))
+
+/* FRED_CONFIG MSR */
+#define FRED_CONFIG_CSL_MASK 0x3
+#define FRED_CONFIG_REDZONE_AMOUNT 1 /* measured in 64-byte cache lines */
+#define FRED_CONFIG_REDZONE (_AT(unsigned long, FRED_CONFIG_REDZONE_AMOUNT) << 6)
+#define FRED_CONFIG_INT_STKLVL(l) (_AT(unsigned long, l) << 9)
+#define FRED_CONFIG_ENTRYPOINT(p) _AT(unsigned long, (p))
+
+/* FRED event type and vector bit width and counts */
+#define FRED_EVENT_TYPE_BITS 3 /* only 3 bits used in FRED 3.0 */
+#define FRED_EVENT_TYPE_COUNT _BITUL(FRED_EVENT_TYPE_BITS)
+#define FRED_EVENT_VECTOR_BITS 8
+#define FRED_EVENT_VECTOR_COUNT _BITUL(FRED_EVENT_VECTOR_BITS)
+
+/* FRED EVENT_TYPE_OTHER vector numbers */
+#define FRED_SYSCALL 1
+#define FRED_SYSENTER 2
+
+/* Flags above the CS selector (regs->csx) */
+#define FRED_CSX_ENABLE_NMI _BITUL(28)
+#define FRED_CSX_ALLOW_SINGLE_STEP _BITUL(25)
+#define FRED_CSX_INTERRUPT_SHADOW _BITUL(24)
+
+#ifndef __ASSEMBLY__
+
+#include <linux/kernel.h>
+#include <asm/ptrace.h>
+
+/* FRED stack frame information */
+struct fred_info {
+ unsigned long edata; /* Event data: CR2, DR6, ... */
+ unsigned long resv;
+};
+
+/* Full format of the FRED stack frame */
+struct fred_frame {
+ struct pt_regs regs;
+ struct fred_info info;
+};
+
+/* Getting the FRED frame information from a pt_regs pointer */
+static __always_inline struct fred_info *fred_info(struct pt_regs *regs)
+{
+ return &container_of(regs, struct fred_frame, regs)->info;
+}
+
+static __always_inline unsigned long fred_event_data(struct pt_regs *regs)
+{
+ return fred_info(regs)->edata;
+}
+
+/*
+ * How FRED event handlers are called.
+ *
+ * FRED event delivery establishes the full supervisor context
+ * by pushing everything related to the event being delivered
+ * to the FRED stack frame, e.g., the faulting linear address
+ * of a #PF is pushed as event data of the FRED #PF stack frame.
+ * Thus a struct pt_regs has everything needed and it's the only
+ * input parameter required for a FRED event handler.
+ */
+#define DECLARE_FRED_HANDLER(f) void f (struct pt_regs *regs)
+#define DEFINE_FRED_HANDLER(f) noinstr DECLARE_FRED_HANDLER(f)
+typedef DECLARE_FRED_HANDLER((*fred_handler));
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* CONFIG_X86_FRED */
+
+#endif /* ASM_X86_FRED_H */
--
2.34.1

2023-04-10 08:42:46

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 16/33] x86/fred: add a page fault entry stub for FRED

From: "H. Peter Anvin (Intel)" <[email protected]>

Add a page fault entry stub for FRED.

On a FRED system, the faulting address (CR2) is passed on the stack,
to avoid the problem of transient state. Thus we get the page fault
address from the stack instead of CR2.

Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---
arch/x86/include/asm/fred.h | 2 ++
arch/x86/mm/fault.c | 20 ++++++++++++++++++--
2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/fred.h b/arch/x86/include/asm/fred.h
index b59397411ab9..4ff05d350066 100644
--- a/arch/x86/include/asm/fred.h
+++ b/arch/x86/include/asm/fred.h
@@ -99,6 +99,8 @@ static __always_inline unsigned long fred_event_data(struct pt_regs *regs)
#define DEFINE_FRED_HANDLER(f) noinstr DECLARE_FRED_HANDLER(f)
typedef DECLARE_FRED_HANDLER((*fred_handler));

+DECLARE_FRED_HANDLER(fred_exc_page_fault);
+
#endif /* __ASSEMBLY__ */

#endif /* CONFIG_X86_FRED */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index a498ae1fbe66..0f946121de14 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -33,6 +33,7 @@
#include <asm/kvm_para.h> /* kvm_handle_async_pf */
#include <asm/vdso.h> /* fixup_vdso_exception() */
#include <asm/irq_stack.h>
+#include <asm/fred.h> /* fred_event_data() */

#define CREATE_TRACE_POINTS
#include <asm/trace/exceptions.h>
@@ -1507,9 +1508,10 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code,
}
}

-DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
+static __always_inline void page_fault_common(struct pt_regs *regs,
+ unsigned int error_code,
+ unsigned long address)
{
- unsigned long address = read_cr2();
irqentry_state_t state;

prefetchw(&current->mm->mmap_lock);
@@ -1556,3 +1558,17 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)

irqentry_exit(regs, state);
}
+
+DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
+{
+ page_fault_common(regs, error_code, read_cr2());
+}
+
+#ifdef CONFIG_X86_FRED
+
+DEFINE_FRED_HANDLER(fred_exc_page_fault)
+{
+ page_fault_common(regs, regs->orig_ax, fred_event_data(regs));
+}
+
+#endif /* CONFIG_X86_FRED */
--
2.34.1

2023-04-10 08:42:48

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 18/33] x86/fred: add a NMI entry stub for FRED

From: "H. Peter Anvin (Intel)" <[email protected]>

On a FRED system, NMIs nest both with themselves and faults, transient
information is saved into the stack frame, and NMI unblocking only
happens when the stack frame indicates that so should happen.

Thus, the NMI entry stub for FRED is really quite small...

Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---
arch/x86/include/asm/fred.h | 1 +
arch/x86/kernel/nmi.c | 19 +++++++++++++++++++
2 files changed, 20 insertions(+)

diff --git a/arch/x86/include/asm/fred.h b/arch/x86/include/asm/fred.h
index f670430aaa54..9ce2a6439091 100644
--- a/arch/x86/include/asm/fred.h
+++ b/arch/x86/include/asm/fred.h
@@ -99,6 +99,7 @@ static __always_inline unsigned long fred_event_data(struct pt_regs *regs)
#define DEFINE_FRED_HANDLER(f) noinstr DECLARE_FRED_HANDLER(f)
typedef DECLARE_FRED_HANDLER((*fred_handler));

+DECLARE_FRED_HANDLER(fred_exc_nmi);
DECLARE_FRED_HANDLER(fred_exc_debug);
DECLARE_FRED_HANDLER(fred_exc_page_fault);

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 776f4b1e395b..e7b2abe42583 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -34,6 +34,7 @@
#include <asm/cache.h>
#include <asm/nospec-branch.h>
#include <asm/sev.h>
+#include <asm/fred.h>

#define CREATE_TRACE_POINTS
#include <trace/events/nmi.h>
@@ -643,6 +644,24 @@ void nmi_backtrace_stall_check(const struct cpumask *btp)

#endif

+#ifdef CONFIG_X86_FRED
+DEFINE_FRED_HANDLER(fred_exc_nmi)
+{
+ /*
+ * With FRED, CR2 and DR6 are pushed atomically on faults,
+ * so we don't have to worry about saving and restoring them.
+ * Breakpoint faults nest, so assume it is OK to leave DR7
+ * enabled.
+ */
+ irqentry_state_t irq_state = irqentry_nmi_enter(regs);
+
+ inc_irq_stat(__nmi_count);
+ default_do_nmi(regs);
+
+ irqentry_nmi_exit(regs, irq_state);
+}
+#endif
+
void stop_nmi(void)
{
ignore_nmis++;
--
2.34.1

2023-04-10 08:42:50

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 17/33] x86/fred: add a debug fault entry stub for FRED

From: "H. Peter Anvin (Intel)" <[email protected]>

Add a debug fault entry stub for FRED.

On a FRED system, the debug trap status information (DR6) is passed
on the stack, to avoid the problem of transient state. Furthermore,
FRED transitions avoid a lot of ugly corner cases the handling of which
can, and should be, skipped.

The FRED debug trap status information saved on the stack differs from DR6
in both stickiness and polarity; it is exactly what debug_read_clear_dr6()
returns, and exc_debug_user()/exc_debug_kernel() expect.

Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---

Changes since v1:
* call irqentry_nmi_{enter,exit}() in both IDT and FRED debug fault kernel
handler (Peter Zijlstra).
---
arch/x86/include/asm/fred.h | 1 +
arch/x86/kernel/traps.c | 56 +++++++++++++++++++++++++++----------
2 files changed, 42 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/fred.h b/arch/x86/include/asm/fred.h
index 4ff05d350066..f670430aaa54 100644
--- a/arch/x86/include/asm/fred.h
+++ b/arch/x86/include/asm/fred.h
@@ -99,6 +99,7 @@ static __always_inline unsigned long fred_event_data(struct pt_regs *regs)
#define DEFINE_FRED_HANDLER(f) noinstr DECLARE_FRED_HANDLER(f)
typedef DECLARE_FRED_HANDLER((*fred_handler));

+DECLARE_FRED_HANDLER(fred_exc_debug);
DECLARE_FRED_HANDLER(fred_exc_page_fault);

#endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index f86cd233b00b..549f7f962f8f 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -47,6 +47,7 @@
#include <asm/debugreg.h>
#include <asm/realmode.h>
#include <asm/text-patching.h>
+#include <asm/fred.h>
#include <asm/ftrace.h>
#include <asm/traps.h>
#include <asm/desc.h>
@@ -1020,21 +1021,9 @@ static bool notify_debug(struct pt_regs *regs, unsigned long *dr6)
return false;
}

-static __always_inline void exc_debug_kernel(struct pt_regs *regs,
- unsigned long dr6)
+static __always_inline void debug_kernel_common(struct pt_regs *regs,
+ unsigned long dr6)
{
- /*
- * Disable breakpoints during exception handling; recursive exceptions
- * are exceedingly 'fun'.
- *
- * Since this function is NOKPROBE, and that also applies to
- * HW_BREAKPOINT_X, we can't hit a breakpoint before this (XXX except a
- * HW_BREAKPOINT_W on our stack)
- *
- * Entry text is excluded for HW_BP_X and cpu_entry_area, which
- * includes the entry stack is excluded for everything.
- */
- unsigned long dr7 = local_db_save();
irqentry_state_t irq_state = irqentry_nmi_enter(regs);
instrumentation_begin();

@@ -1062,7 +1051,8 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs,
* Catch SYSENTER with TF set and clear DR_STEP. If this hit a
* watchpoint at the same time then that will still be handled.
*/
- if ((dr6 & DR_STEP) && is_sysenter_singlestep(regs))
+ if (!cpu_feature_enabled(X86_FEATURE_FRED) &&
+ (dr6 & DR_STEP) && is_sysenter_singlestep(regs))
dr6 &= ~DR_STEP;

/*
@@ -1090,7 +1080,25 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs,
out:
instrumentation_end();
irqentry_nmi_exit(regs, irq_state);
+}

+static __always_inline void exc_debug_kernel(struct pt_regs *regs,
+ unsigned long dr6)
+{
+ /*
+ * Disable breakpoints during exception handling; recursive exceptions
+ * are exceedingly 'fun'.
+ *
+ * Since this function is NOKPROBE, and that also applies to
+ * HW_BREAKPOINT_X, we can't hit a breakpoint before this (XXX except a
+ * HW_BREAKPOINT_W on our stack)
+ *
+ * Entry text is excluded for HW_BP_X and cpu_entry_area, which
+ * includes the entry stack is excluded for everything.
+ */
+ unsigned long dr7 = local_db_save();
+
+ debug_kernel_common(regs, dr6);
local_db_restore(dr7);
}

@@ -1179,6 +1187,24 @@ DEFINE_IDTENTRY_DEBUG_USER(exc_debug)
{
exc_debug_user(regs, debug_read_clear_dr6());
}
+
+# ifdef CONFIG_X86_FRED
+DEFINE_FRED_HANDLER(fred_exc_debug)
+{
+ /*
+ * The FRED debug information saved onto stack differs from
+ * DR6 in both stickiness and polarity; it is exactly what
+ * debug_read_clear_dr6() returns.
+ */
+ unsigned long dr6 = fred_event_data(regs);
+
+ if (user_mode(regs))
+ exc_debug_user(regs, dr6);
+ else
+ debug_kernel_common(regs, dr6);
+}
+# endif /* CONFIG_X86_FRED */
+
#else
/* 32 bit does not have separate entry points. */
DEFINE_IDTENTRY_RAW(exc_debug)
--
2.34.1

2023-04-10 08:42:55

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 19/33] x86/fred: add a machine check entry stub for FRED

Add a machine check entry stub for FRED.

Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---

Changes since v5:
* Disallow #DB inside #MCE for robustness sake (Peter Zijlstra).
---
arch/x86/include/asm/fred.h | 1 +
arch/x86/kernel/cpu/mce/core.c | 15 +++++++++++++++
2 files changed, 16 insertions(+)

diff --git a/arch/x86/include/asm/fred.h b/arch/x86/include/asm/fred.h
index 9ce2a6439091..61048aa4e01d 100644
--- a/arch/x86/include/asm/fred.h
+++ b/arch/x86/include/asm/fred.h
@@ -102,6 +102,7 @@ typedef DECLARE_FRED_HANDLER((*fred_handler));
DECLARE_FRED_HANDLER(fred_exc_nmi);
DECLARE_FRED_HANDLER(fred_exc_debug);
DECLARE_FRED_HANDLER(fred_exc_page_fault);
+DECLARE_FRED_HANDLER(fred_exc_machine_check);

#endif /* __ASSEMBLY__ */

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 2eec60f50057..859331a6a7ad 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -52,6 +52,7 @@
#include <asm/mce.h>
#include <asm/msr.h>
#include <asm/reboot.h>
+#include <asm/fred.h>

#include "internal.h"

@@ -2111,6 +2112,20 @@ DEFINE_IDTENTRY_MCE_USER(exc_machine_check)
exc_machine_check_user(regs);
local_db_restore(dr7);
}
+
+#ifdef CONFIG_X86_FRED
+DEFINE_FRED_HANDLER(fred_exc_machine_check)
+{
+ unsigned long dr7;
+
+ dr7 = local_db_save();
+ if (user_mode(regs))
+ exc_machine_check_user(regs);
+ else
+ exc_machine_check_kernel(regs);
+ local_db_restore(dr7);
+}
+#endif
#else
/* 32bit unified entry point */
DEFINE_IDTENTRY_RAW(exc_machine_check)
--
2.34.1

2023-04-10 08:43:04

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 20/33] x86/fred: FRED entry/exit and dispatch code

From: "H. Peter Anvin (Intel)" <[email protected]>

The code to actually handle kernel and event entry/exit using
FRED. It is split up into two files thus:

- entry_64_fred.S contains the actual entrypoints and exit code, and
saves and restores registers.
- entry_fred.c contains the two-level event dispatch code for FRED.
The first-level dispatch is on the event type, and the second-level
is on the event vector.

Originally-by: Megha Dey <[email protected]>
Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Co-developed-by: Xin Li <[email protected]>
Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---

Changes since v1:
* Initialize a FRED exception handler to fred_bad_event() instead of NULL
if no FRED handler defined for an exception vector (Peter Zijlstra).
* Push calling irqentry_{enter,exit}() and instrumentation_{begin,end}()
down into individual FRED exception handlers, instead of in the dispatch
framework (Peter Zijlstra).
---
arch/x86/entry/Makefile | 5 +-
arch/x86/entry/entry_64_fred.S | 57 +++++++++
arch/x86/entry/entry_fred.c | 220 ++++++++++++++++++++++++++++++++
arch/x86/include/asm/idtentry.h | 8 ++
4 files changed, 289 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/entry/entry_64_fred.S
create mode 100644 arch/x86/entry/entry_fred.c

diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
index ca2fe186994b..c93e7f5c2a06 100644
--- a/arch/x86/entry/Makefile
+++ b/arch/x86/entry/Makefile
@@ -18,6 +18,9 @@ obj-y += vdso/
obj-y += vsyscall/

obj-$(CONFIG_PREEMPTION) += thunk_$(BITS).o
+CFLAGS_entry_fred.o += -fno-stack-protector
+CFLAGS_REMOVE_entry_fred.o += -pg $(CC_FLAGS_FTRACE)
+obj-$(CONFIG_X86_FRED) += entry_64_fred.o entry_fred.o
+
obj-$(CONFIG_IA32_EMULATION) += entry_64_compat.o syscall_32.o
obj-$(CONFIG_X86_X32_ABI) += syscall_x32.o
-
diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
new file mode 100644
index 000000000000..d975cacd060f
--- /dev/null
+++ b/arch/x86/entry/entry_64_fred.S
@@ -0,0 +1,57 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * arch/x86/entry/entry_64_fred.S
+ *
+ * The actual FRED entry points.
+ */
+#include <linux/linkage.h>
+#include <asm/errno.h>
+#include <asm/asm-offsets.h>
+#include <asm/fred.h>
+
+#include "calling.h"
+
+ .code64
+ .section ".noinstr.text", "ax"
+
+.macro FRED_ENTER
+ UNWIND_HINT_EMPTY
+ PUSH_AND_CLEAR_REGS
+ movq %rsp, %rdi /* %rdi -> pt_regs */
+.endm
+
+.macro FRED_EXIT
+ UNWIND_HINT_REGS
+ POP_REGS
+ addq $8,%rsp /* Drop error code */
+.endm
+
+/*
+ * The new RIP value that FRED event delivery establishes is
+ * IA32_FRED_CONFIG & ~FFFH for events that occur in ring 3.
+ * Thus the FRED ring 3 entry point must be 4K page aligned.
+ */
+ .align 4096
+
+SYM_CODE_START_NOALIGN(fred_entrypoint_user)
+ FRED_ENTER
+ call fred_entry_from_user
+SYM_INNER_LABEL(fred_exit_user, SYM_L_GLOBAL)
+ FRED_EXIT
+ ERETU
+SYM_CODE_END(fred_entrypoint_user)
+
+.fill fred_entrypoint_kernel - ., 1, 0xcc
+
+/*
+ * The new RIP value that FRED event delivery establishes is
+ * (IA32_FRED_CONFIG & ~FFFH) + 256 for events that occur in
+ * ring 0, i.e., fred_entrypoint_user + 256.
+ */
+ .org fred_entrypoint_user+256
+SYM_CODE_START_NOALIGN(fred_entrypoint_kernel)
+ FRED_ENTER
+ call fred_entry_from_kernel
+ FRED_EXIT
+ ERETS
+SYM_CODE_END(fred_entrypoint_kernel)
diff --git a/arch/x86/entry/entry_fred.c b/arch/x86/entry/entry_fred.c
new file mode 100644
index 000000000000..0bff1db913b8
--- /dev/null
+++ b/arch/x86/entry/entry_fred.c
@@ -0,0 +1,220 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * arch/x86/entry/entry_fred.c
+ *
+ * This contains the dispatch functions called from the entry point
+ * assembly.
+ */
+
+#include <linux/kernel.h>
+#include <linux/kdebug.h> /* oops_begin/end, ... */
+#include <linux/nospec.h>
+#include <asm/event-type.h>
+#include <asm/fred.h>
+#include <asm/idtentry.h>
+#include <asm/syscall.h>
+#include <asm/trapnr.h>
+#include <asm/traps.h>
+#include <asm/kdebug.h>
+
+/*
+ * Badness...
+ */
+static DEFINE_FRED_HANDLER(fred_bad_event)
+{
+ irqentry_state_t irq_state = irqentry_nmi_enter(regs);
+
+ instrumentation_begin();
+
+ /* Panic on events from a high stack level */
+ if (regs->current_stack_level > 0) {
+ pr_emerg("PANIC: invalid or fatal FRED event; event type %u "
+ "vector %u error 0x%lx aux 0x%lx at %04x:%016lx\n",
+ regs->type, regs->vector, regs->orig_ax,
+ fred_event_data(regs), regs->cs, regs->ip);
+ die("invalid or fatal FRED event", regs, regs->orig_ax);
+ panic("invalid or fatal FRED event");
+ } else {
+ unsigned long flags = oops_begin();
+ int sig = SIGKILL;
+
+ pr_alert("BUG: invalid or fatal FRED event; event type %u "
+ "vector %u error 0x%lx aux 0x%lx at %04x:%016lx\n",
+ regs->type, regs->vector, regs->orig_ax,
+ fred_event_data(regs), regs->cs, regs->ip);
+
+ if (__die("Invalid or fatal FRED event", regs, regs->orig_ax))
+ sig = 0;
+
+ oops_end(flags, regs, sig);
+ }
+
+ instrumentation_end();
+ irqentry_nmi_exit(regs, irq_state);
+}
+
+noinstr void fred_exc_double_fault(struct pt_regs *regs)
+{
+ exc_double_fault(regs, regs->orig_ax);
+}
+
+/*
+ * Exception entry
+ */
+static DEFINE_FRED_HANDLER(fred_exception)
+{
+ /*
+ * Exceptions that cannot happen on FRED h/w are set to fred_bad_event().
+ */
+ static const fred_handler exception_handlers[NUM_EXCEPTION_VECTORS] = {
+ [X86_TRAP_DE] = exc_divide_error,
+ [X86_TRAP_DB] = fred_exc_debug,
+ [X86_TRAP_NMI] = fred_bad_event, /* A separate event type, not handled here */
+ [X86_TRAP_BP] = exc_int3,
+ [X86_TRAP_OF] = exc_overflow,
+ [X86_TRAP_BR] = exc_bounds,
+ [X86_TRAP_UD] = exc_invalid_op,
+ [X86_TRAP_NM] = exc_device_not_available,
+ [X86_TRAP_DF] = fred_exc_double_fault,
+ [X86_TRAP_OLD_MF] = fred_bad_event, /* 387 only! */
+ [X86_TRAP_TS] = fred_exc_invalid_tss,
+ [X86_TRAP_NP] = fred_exc_segment_not_present,
+ [X86_TRAP_SS] = fred_exc_stack_segment,
+ [X86_TRAP_GP] = fred_exc_general_protection,
+ [X86_TRAP_PF] = fred_exc_page_fault,
+ [X86_TRAP_SPURIOUS] = fred_bad_event, /* Interrupts are their own event type */
+ [X86_TRAP_MF] = exc_coprocessor_error,
+ [X86_TRAP_AC] = fred_exc_alignment_check,
+ [X86_TRAP_MC] = fred_exc_machine_check,
+ [X86_TRAP_XF] = exc_simd_coprocessor_error,
+ [X86_TRAP_VE...NUM_EXCEPTION_VECTORS-1] = fred_bad_event
+ };
+ u8 vector = array_index_nospec((u8)regs->vector, NUM_EXCEPTION_VECTORS);
+
+ exception_handlers[vector](regs);
+}
+
+static __always_inline void fred_emulate_trap(struct pt_regs *regs)
+{
+ regs->type = EVENT_TYPE_SWFAULT;
+ regs->orig_ax = 0;
+ fred_exception(regs);
+}
+
+static __always_inline void fred_emulate_fault(struct pt_regs *regs)
+{
+ regs->ip -= regs->instr_len;
+ fred_emulate_trap(regs);
+}
+
+/*
+ * Emulate SYSENTER if applicable. This is not the preferred system
+ * call in 32-bit mode under FRED, rather int $0x80 is preferred and
+ * exported in the vdso. SYSCALL proper has a hard-coded early out in
+ * fred_entry_from_user().
+ */
+static DEFINE_FRED_HANDLER(fred_syscall_slow)
+{
+ if (IS_ENABLED(CONFIG_IA32_EMULATION) &&
+ likely(regs->vector == FRED_SYSENTER)) {
+ /* Convert frame to a syscall frame */
+ regs->orig_ax = regs->ax;
+ regs->ax = -ENOSYS;
+ do_fast_syscall_32(regs);
+ } else {
+ regs->vector = X86_TRAP_UD;
+ fred_emulate_fault(regs);
+ }
+}
+
+/*
+ * Some software exceptions can also be triggered as int instructions,
+ * for historical reasons. Implement those here. The performance-critical
+ * int $0x80 (32-bit system call) has a hard-coded early out.
+ */
+static DEFINE_FRED_HANDLER(fred_sw_interrupt_user)
+{
+ if (IS_ENABLED(CONFIG_IA32_EMULATION) &&
+ likely(regs->vector == IA32_SYSCALL_VECTOR)) {
+ /* Convert frame to a syscall frame */
+ regs->orig_ax = regs->ax;
+ regs->ax = -ENOSYS;
+ return do_int80_syscall_32(regs);
+ }
+
+ switch (regs->vector) {
+ case X86_TRAP_BP:
+ case X86_TRAP_OF:
+ fred_emulate_trap(regs);
+ break;
+ default:
+ regs->vector = X86_TRAP_GP;
+ fred_emulate_fault(regs);
+ break;
+ }
+}
+
+static DEFINE_FRED_HANDLER(fred_hw_interrupt)
+{
+ irqentry_state_t state = irqentry_enter(regs);
+
+ instrumentation_begin();
+ external_interrupt(regs);
+ instrumentation_end();
+ irqentry_exit(regs, state);
+}
+
+__visible noinstr void fred_entry_from_user(struct pt_regs *regs)
+{
+ static const fred_handler user_handlers[FRED_EVENT_TYPE_COUNT] =
+ {
+ [EVENT_TYPE_HWINT] = fred_hw_interrupt,
+ [EVENT_TYPE_RESERVED] = fred_bad_event,
+ [EVENT_TYPE_NMI] = fred_exc_nmi,
+ [EVENT_TYPE_SWINT] = fred_sw_interrupt_user,
+ [EVENT_TYPE_HWFAULT] = fred_exception,
+ [EVENT_TYPE_SWFAULT] = fred_exception,
+ [EVENT_TYPE_PRIVSW] = fred_exception,
+ [EVENT_TYPE_OTHER] = fred_syscall_slow
+ };
+
+ /*
+ * FRED employs a two-level event dispatch mechanism, with
+ * the first-level on the type of an event and the second-level
+ * on its vector. Thus a dispatch typically induces 2 calls.
+ * We optimize it by using early outs for the most frequent
+ * events, and syscalls are the first. We may also need early
+ * outs for page faults.
+ */
+ if (likely(regs->type == EVENT_TYPE_OTHER &&
+ regs->vector == FRED_SYSCALL)) {
+ /* Convert frame to a syscall frame */
+ regs->orig_ax = regs->ax;
+ regs->ax = -ENOSYS;
+ do_syscall_64(regs, regs->orig_ax);
+ } else {
+ /* Not a system call */
+ u8 type = array_index_nospec((u8)regs->type, FRED_EVENT_TYPE_COUNT);
+
+ user_handlers[type](regs);
+ }
+}
+
+__visible noinstr void fred_entry_from_kernel(struct pt_regs *regs)
+{
+ static const fred_handler kernel_handlers[FRED_EVENT_TYPE_COUNT] =
+ {
+ [EVENT_TYPE_HWINT] = fred_hw_interrupt,
+ [EVENT_TYPE_RESERVED] = fred_bad_event,
+ [EVENT_TYPE_NMI] = fred_exc_nmi,
+ [EVENT_TYPE_SWINT] = fred_bad_event,
+ [EVENT_TYPE_HWFAULT] = fred_exception,
+ [EVENT_TYPE_SWFAULT] = fred_exception,
+ [EVENT_TYPE_PRIVSW] = fred_exception,
+ [EVENT_TYPE_OTHER] = fred_bad_event
+ };
+ u8 type = array_index_nospec((u8)regs->type, FRED_EVENT_TYPE_COUNT);
+
+ /* The pt_regs frame on entry here is an exception frame */
+ kernel_handlers[type](regs);
+}
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 2876ddae02bc..bd43866f9c3e 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -82,6 +82,7 @@ static __always_inline void __##func(struct pt_regs *regs)
#define DECLARE_IDTENTRY_ERRORCODE(vector, func) \
asmlinkage void asm_##func(void); \
asmlinkage void xen_asm_##func(void); \
+ __visible void fred_##func(struct pt_regs *regs); \
__visible void func(struct pt_regs *regs, unsigned long error_code)

/**
@@ -106,6 +107,11 @@ __visible noinstr void func(struct pt_regs *regs, \
irqentry_exit(regs, state); \
} \
\
+__visible noinstr void fred_##func(struct pt_regs *regs) \
+{ \
+ func (regs, regs->orig_ax); \
+} \
+ \
static __always_inline void __##func(struct pt_regs *regs, \
unsigned long error_code)

@@ -622,6 +628,8 @@ DECLARE_IDTENTRY_RAW(X86_TRAP_MC, exc_machine_check);
#ifdef CONFIG_XEN_PV
DECLARE_IDTENTRY_RAW(X86_TRAP_MC, xenpv_exc_machine_check);
#endif
+#else
+#define fred_exc_machine_check fred_bad_event
#endif

/* NMI */
--
2.34.1

2023-04-10 08:43:06

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 22/33] x86/fred: update MSR_IA32_FRED_RSP0 during task switch

From: "H. Peter Anvin (Intel)" <[email protected]>

MSR_IA32_FRED_RSP0 is used during ring 3 event delivery, and needs to
be updated to point to the top of next task stack during task switch.

Update MSR_IA32_FRED_RSP0 with WRMSR instruction for now, and will use
WRMSRNS/WRMSRLIST for performance once it gets upstreamed.

Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---
arch/x86/include/asm/switch_to.h | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 5c91305d09d2..00fd85abc1d2 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -68,9 +68,16 @@ static inline void update_task_stack(struct task_struct *task)
#ifdef CONFIG_X86_32
this_cpu_write(cpu_tss_rw.x86_tss.sp1, task->thread.sp0);
#else
- /* Xen PV enters the kernel on the thread stack. */
- if (cpu_feature_enabled(X86_FEATURE_XENPV))
+ if (cpu_feature_enabled(X86_FEATURE_FRED)) {
+ /*
+ * Will use WRMSRNS/WRMSRLIST for performance once it's upstreamed.
+ */
+ wrmsrl(MSR_IA32_FRED_RSP0,
+ task_top_of_stack(task) + TOP_OF_KERNEL_STACK_PADDING);
+ } else if (cpu_feature_enabled(X86_FEATURE_XENPV)) {
+ /* Xen PV enters the kernel on the thread stack. */
load_sp0(task_top_of_stack(task));
+ }
#endif
}

--
2.34.1

2023-04-10 08:43:16

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 21/33] x86/fred: FRED initialization code

From: "H. Peter Anvin (Intel)" <[email protected]>

The code to initialize FRED when it's available and _not_ disabled.

cpu_init_fred_exceptions() is the core function to initialize FRED,
which
1. Sets up FRED entrypoints for events happening in ring 0 and 3.
2. Sets up a default stack for event handling.
3. Sets up dedicated event stacks for DB/NMI/MC/DF, equivalent to
the IDT IST stacks.
4. Forces 32-bit system calls to use "int $0x80" only.
5. Enables FRED and invalidtes IDT.

When the FRED is used, cpu_init_exception_handling() initializes FRED
through calling cpu_init_fred_exceptions(), otherwise it sets up TSS
IST and loads IDT.

As FRED uses the ring 3 FRED entrypoint for SYSCALL and SYSENTER,
it skips setting up SYSCALL/SYSENTER related MSRs, e.g., MSR_LSTAR.

Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Co-developed-by: Xin Li <[email protected]>
Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---

Changes since v5:
* Add a comment for FRED stack level settings (Lai Jiangshan).
* Define #DB/NMI/#MC/#DF stack levels using macros.
---
arch/x86/include/asm/fred.h | 27 +++++++++++++
arch/x86/include/asm/traps.h | 2 +
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/cpu/common.c | 74 +++++++++++++++++++++-------------
arch/x86/kernel/fred.c | 78 ++++++++++++++++++++++++++++++++++++
arch/x86/kernel/irqinit.c | 7 +++-
arch/x86/kernel/traps.c | 16 +++++++-
7 files changed, 175 insertions(+), 30 deletions(-)
create mode 100644 arch/x86/kernel/fred.c

diff --git a/arch/x86/include/asm/fred.h b/arch/x86/include/asm/fred.h
index 61048aa4e01d..c5fbc4f18059 100644
--- a/arch/x86/include/asm/fred.h
+++ b/arch/x86/include/asm/fred.h
@@ -57,6 +57,19 @@
#define FRED_CSX_ALLOW_SINGLE_STEP _BITUL(25)
#define FRED_CSX_INTERRUPT_SHADOW _BITUL(24)

+/* #DB in the kernel would imply the use of a kernel debugger. */
+#define FRED_DB_STACK_LEVEL 1
+#define FRED_NMI_STACK_LEVEL 2
+#define FRED_MC_STACK_LEVEL 2
+/*
+ * #DF is the highest level because a #DF means "something went wrong
+ * *while delivering an exception*." The number of cases for which that
+ * can happen with FRED is drastically reduced and basically amounts to
+ * "the stack you pointed me to is broken." Thus, always change stacks
+ * on #DF, which means it should be at the highest level.
+ */
+#define FRED_DF_STACK_LEVEL 3
+
#ifndef __ASSEMBLY__

#include <linux/kernel.h>
@@ -104,8 +117,22 @@ DECLARE_FRED_HANDLER(fred_exc_debug);
DECLARE_FRED_HANDLER(fred_exc_page_fault);
DECLARE_FRED_HANDLER(fred_exc_machine_check);

+/*
+ * The actual assembly entry and exit points
+ */
+extern __visible void fred_entrypoint_user(void);
+
+/*
+ * Initialization
+ */
+void cpu_init_fred_exceptions(void);
+void fred_setup_apic(void);
+
#endif /* __ASSEMBLY__ */

+#else
+#define cpu_init_fred_exceptions() BUG()
+#define fred_setup_apic() BUG()
#endif /* CONFIG_X86_FRED */

#endif /* ASM_X86_FRED_H */
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 46f5e4e2a346..612b3d6fec53 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -56,4 +56,6 @@ void __noreturn handle_stack_overflow(struct pt_regs *regs,
void f (struct pt_regs *regs)
typedef DECLARE_SYSTEM_INTERRUPT_HANDLER((*system_interrupt_handler));

+system_interrupt_handler get_system_interrupt_handler(unsigned int i);
+
#endif /* _ASM_X86_TRAPS_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index dd61752f4c96..08d9c0a0bfbe 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -47,6 +47,7 @@ obj-y += platform-quirks.o
obj-y += process_$(BITS).o signal.o signal_$(BITS).o
obj-y += traps.o idt.o irq.o irq_$(BITS).o dumpstack_$(BITS).o
obj-y += time.o ioport.o dumpstack.o nmi.o
+obj-$(CONFIG_X86_FRED) += fred.o
obj-$(CONFIG_MODIFY_LDT_SYSCALL) += ldt.o
obj-y += setup.o x86_init.o i8259.o irqinit.o
obj-$(CONFIG_JUMP_LABEL) += jump_label.o
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index e8cf6f4cfb52..eea41cb8722e 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -58,6 +58,7 @@
#include <asm/microcode_intel.h>
#include <asm/intel-family.h>
#include <asm/cpu_device_id.h>
+#include <asm/fred.h>
#include <asm/uv/uv.h>
#include <asm/sigframe.h>
#include <asm/traps.h>
@@ -2054,28 +2055,6 @@ static void wrmsrl_cstar(unsigned long val)
/* May not be marked __init: used by software suspend */
void syscall_init(void)
{
- wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
- wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
-
-#ifdef CONFIG_IA32_EMULATION
- wrmsrl_cstar((unsigned long)entry_SYSCALL_compat);
- /*
- * This only works on Intel CPUs.
- * On AMD CPUs these MSRs are 32-bit, CPU truncates MSR_IA32_SYSENTER_EIP.
- * This does not cause SYSENTER to jump to the wrong location, because
- * AMD doesn't allow SYSENTER in long mode (either 32- or 64-bit).
- */
- wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
- wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
- (unsigned long)(cpu_entry_stack(smp_processor_id()) + 1));
- wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
-#else
- wrmsrl_cstar((unsigned long)ignore_sysret);
- wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG);
- wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
- wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
-#endif
-
/*
* Flags to clear on syscall; clear as much as possible
* to minimize user space-kernel interference.
@@ -2086,6 +2065,41 @@ void syscall_init(void)
X86_EFLAGS_IF|X86_EFLAGS_DF|X86_EFLAGS_OF|
X86_EFLAGS_IOPL|X86_EFLAGS_NT|X86_EFLAGS_RF|
X86_EFLAGS_AC|X86_EFLAGS_ID);
+
+ /*
+ * The default user and kernel segments
+ */
+ wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
+
+ if (cpu_feature_enabled(X86_FEATURE_FRED)) {
+ /* Both sysexit and sysret cause #UD when FRED is enabled */
+ wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG);
+ wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
+ wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
+ } else {
+ wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
+
+#ifdef CONFIG_IA32_EMULATION
+ wrmsrl_cstar((unsigned long)entry_SYSCALL_compat);
+ /*
+ * This only works on Intel CPUs.
+ * On AMD CPUs these MSRs are 32-bit, CPU truncates
+ * MSR_IA32_SYSENTER_EIP.
+ * This does not cause SYSENTER to jump to the wrong
+ * location, because AMD doesn't allow SYSENTER in
+ * long mode (either 32- or 64-bit).
+ */
+ wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
+ wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
+ (unsigned long)(cpu_entry_stack(smp_processor_id()) + 1));
+ wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
+#else
+ wrmsrl_cstar((unsigned long)ignore_sysret);
+ wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG);
+ wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
+ wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
+#endif
+ }
}

#else /* CONFIG_X86_64 */
@@ -2218,18 +2232,24 @@ void cpu_init_exception_handling(void)
/* paranoid_entry() gets the CPU number from the GDT */
setup_getcpu(cpu);

- /* IST vectors need TSS to be set up. */
- tss_setup_ist(tss);
+ /* Set up the TSS */
tss_setup_io_bitmap(tss);
set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
-
load_TR_desc();

/* GHCB needs to be setup to handle #VC. */
setup_ghcb();

- /* Finally load the IDT */
- load_current_idt();
+ if (cpu_feature_enabled(X86_FEATURE_FRED)) {
+ /* Set up FRED exception handling */
+ cpu_init_fred_exceptions();
+ } else {
+ /* IST vectors need TSS to be set up. */
+ tss_setup_ist(tss);
+
+ /* Finally load the IDT */
+ load_current_idt();
+ }
}

/*
diff --git a/arch/x86/kernel/fred.c b/arch/x86/kernel/fred.c
new file mode 100644
index 000000000000..907abfd69f3f
--- /dev/null
+++ b/arch/x86/kernel/fred.c
@@ -0,0 +1,78 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/kernel.h>
+#include <asm/desc.h>
+#include <asm/fred.h>
+#include <asm/tlbflush.h> /* For cr4_set_bits() */
+#include <asm/traps.h>
+
+/*
+ * Initialize FRED on this CPU. This cannot be __init as it is called
+ * during CPU hotplug.
+ */
+void cpu_init_fred_exceptions(void)
+{
+ wrmsrl(MSR_IA32_FRED_CONFIG,
+ FRED_CONFIG_REDZONE | /* Reserve for CALL emulation */
+ FRED_CONFIG_INT_STKLVL(0) |
+ FRED_CONFIG_ENTRYPOINT(fred_entrypoint_user));
+
+ /*
+ * The purpose of separate stacks for NMI, #DB and #MC *in the kernel*
+ * (remember that user space faults are always taken on stack level 0)
+ * is to avoid overflowing the kernel stack.
+ */
+ wrmsrl(MSR_IA32_FRED_STKLVLS,
+ FRED_STKLVL(X86_TRAP_DB, FRED_DB_STACK_LEVEL) |
+ FRED_STKLVL(X86_TRAP_NMI, FRED_NMI_STACK_LEVEL) |
+ FRED_STKLVL(X86_TRAP_MC, FRED_MC_STACK_LEVEL) |
+ FRED_STKLVL(X86_TRAP_DF, FRED_DF_STACK_LEVEL));
+
+ /* The FRED equivalents to IST stacks... */
+ wrmsrl(MSR_IA32_FRED_RSP1, __this_cpu_ist_top_va(DB));
+ wrmsrl(MSR_IA32_FRED_RSP2, __this_cpu_ist_top_va(NMI));
+ wrmsrl(MSR_IA32_FRED_RSP3, __this_cpu_ist_top_va(DF));
+
+ /* Not used with FRED */
+ wrmsrl(MSR_LSTAR, 0ULL);
+ wrmsrl(MSR_CSTAR, 0ULL);
+ wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG);
+ wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
+ wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
+
+ /* Enable FRED */
+ cr4_set_bits(X86_CR4_FRED);
+ idt_invalidate(); /* Any further IDT use is a bug */
+
+ /* Use int $0x80 for 32-bit system calls in FRED mode */
+ setup_clear_cpu_cap(X86_FEATURE_SYSENTER32);
+ setup_clear_cpu_cap(X86_FEATURE_SYSCALL32);
+}
+
+/*
+ * Initialize system vectors from a FRED perspective, so
+ * lapic_assign_system_vectors() can do its job.
+ */
+void __init fred_setup_apic(void)
+{
+ int i;
+
+ for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++)
+ set_bit(i, system_vectors);
+
+ /*
+ * Don't set the non assigned system vectors in the
+ * system_vectors bitmap. Otherwise they show up in
+ * /proc/interrupts.
+ */
+#ifdef CONFIG_SMP
+ set_bit(IRQ_MOVE_CLEANUP_VECTOR, system_vectors);
+#endif
+
+ for (i = 0; i < NR_SYSTEM_VECTORS; i++) {
+ if (get_system_interrupt_handler(i) != NULL) {
+ set_bit(i + FIRST_SYSTEM_VECTOR, system_vectors);
+ }
+ }
+
+ /* The rest are fair game... */
+}
diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c
index c683666876f1..2a510f72dd11 100644
--- a/arch/x86/kernel/irqinit.c
+++ b/arch/x86/kernel/irqinit.c
@@ -28,6 +28,7 @@
#include <asm/setup.h>
#include <asm/i8259.h>
#include <asm/traps.h>
+#include <asm/fred.h>
#include <asm/prom.h>

/*
@@ -96,7 +97,11 @@ void __init native_init_IRQ(void)
/* Execute any quirks before the call gates are initialised: */
x86_init.irqs.pre_vector_init();

- idt_setup_apic_and_irq_gates();
+ if (cpu_feature_enabled(X86_FEATURE_FRED))
+ fred_setup_apic();
+ else
+ idt_setup_apic_and_irq_gates();
+
lapic_assign_system_vectors();

if (!acpi_ioapic && !of_ioapic && nr_legacy_irqs()) {
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 549f7f962f8f..ecfaf4d647bb 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -1537,6 +1537,14 @@ static system_interrupt_handler system_interrupt_handlers[NR_SYSTEM_VECTORS] = {

#undef SYSV

+system_interrupt_handler get_system_interrupt_handler(unsigned int i)
+{
+ if (i >= NR_SYSTEM_VECTORS)
+ return NULL;
+
+ return system_interrupt_handlers[i];
+}
+
/*
* External interrupt dispatch function.
*
@@ -1572,7 +1580,8 @@ void __init install_system_interrupt_handler(unsigned int n, const void *asm_add
#ifdef CONFIG_X86_64
system_interrupt_handlers[n - FIRST_SYSTEM_VECTOR] = (system_interrupt_handler)addr;
#endif
- alloc_intr_gate(n, asm_addr);
+ if (!cpu_feature_enabled(X86_FEATURE_FRED))
+ alloc_intr_gate(n, asm_addr);
}

void __init trap_init(void)
@@ -1585,7 +1594,10 @@ void __init trap_init(void)

/* Initialize TSS before setting up traps so ISTs work */
cpu_init_exception_handling();
+
/* Setup traps as cpu_init() might #GP */
- idt_setup_traps();
+ if (!cpu_feature_enabled(X86_FEATURE_FRED))
+ idt_setup_traps();
+
cpu_init();
}
--
2.34.1

2023-04-10 08:43:17

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 23/33] x86/fred: let ret_from_fork() jmp to fred_exit_user when FRED is enabled

From: "H. Peter Anvin (Intel)" <[email protected]>

Let ret_from_fork() jmp to fred_exit_user when FRED is enabled,
otherwise the existing IDT code is chosen.

Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---
arch/x86/entry/entry_64.S | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index eccc3431e515..5b595a9b2ffb 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -299,7 +299,12 @@ SYM_CODE_START_NOALIGN(ret_from_fork)
UNWIND_HINT_REGS
movq %rsp, %rdi
call syscall_exit_to_user_mode /* returns with IRQs disabled */
+#ifdef CONFIG_X86_FRED
+ ALTERNATIVE "jmp swapgs_restore_regs_and_return_to_usermode", \
+ "jmp fred_exit_user", X86_FEATURE_FRED
+#else
jmp swapgs_restore_regs_and_return_to_usermode
+#endif

1:
/* kernel thread */
--
2.34.1

2023-04-10 08:43:20

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 26/33] x86/fred: allow single-step trap and NMI when starting a new thread

From: "H. Peter Anvin (Intel)" <[email protected]>

Allow single-step trap and NMI when starting a new thread, thus once
the new thread returns to ring3, single-step trap and NMI are both
enabled immediately.

High-order 48 bits above the lowest 16 bit CS are discarded by the
legacy IRET instruction, thus can be set unconditionally, even when
FRED is not enabled.

Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---
arch/x86/include/asm/fred.h | 11 +++++++++++
arch/x86/kernel/process_64.c | 13 +++++++------
2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/fred.h b/arch/x86/include/asm/fred.h
index c5fbc4f18059..f7caf3b2f3f7 100644
--- a/arch/x86/include/asm/fred.h
+++ b/arch/x86/include/asm/fred.h
@@ -70,6 +70,14 @@
*/
#define FRED_DF_STACK_LEVEL 3

+/*
+ * High-order 48 bits above the lowest 16 bit CS are discarded by the
+ * legacy IRET instruction, thus can be set unconditionally, even when
+ * FRED is not enabled.
+ */
+#define CSX_PROCESS_START \
+ (FRED_CSX_ENABLE_NMI | FRED_CSX_ALLOW_SINGLE_STEP)
+
#ifndef __ASSEMBLY__

#include <linux/kernel.h>
@@ -133,6 +141,9 @@ void fred_setup_apic(void);
#else
#define cpu_init_fred_exceptions() BUG()
#define fred_setup_apic() BUG()
+
+#define CSX_PROCESS_START 0
+
#endif /* CONFIG_X86_FRED */

#endif /* ASM_X86_FRED_H */
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 2bea86073646..c732d9dbff3a 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -55,6 +55,7 @@
#include <asm/resctrl.h>
#include <asm/unistd.h>
#include <asm/fsgsbase.h>
+#include <asm/fred.h>
#ifdef CONFIG_IA32_EMULATION
/* Not included via unistd.h */
#include <asm/unistd_32_ia32.h>
@@ -506,7 +507,7 @@ void x86_gsbase_write_task(struct task_struct *task, unsigned long gsbase)
static void
start_thread_common(struct pt_regs *regs, unsigned long new_ip,
unsigned long new_sp,
- unsigned int _cs, unsigned int _ss, unsigned int _ds)
+ u16 _cs, u16 _ss, u16 _ds)
{
WARN_ON_ONCE(regs != current_pt_regs());

@@ -521,11 +522,11 @@ start_thread_common(struct pt_regs *regs, unsigned long new_ip,
loadsegment(ds, _ds);
load_gs_index(0);

- regs->ip = new_ip;
- regs->sp = new_sp;
- regs->cs = _cs;
- regs->ss = _ss;
- regs->flags = X86_EFLAGS_IF;
+ regs->ip = new_ip;
+ regs->sp = new_sp;
+ regs->csx = _cs | CSX_PROCESS_START;
+ regs->ssx = _ss;
+ regs->flags = X86_EFLAGS_IF | X86_EFLAGS_FIXED;
}

void
--
2.34.1

2023-04-10 08:43:20

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 24/33] x86/fred: disallow the swapgs instruction when FRED is enabled

From: "H. Peter Anvin (Intel)" <[email protected]>

The FRED architecture establishes the full supervisor/user through:
1) FRED event delivery swaps the value of the GS base address and
that of the IA32_KERNEL_GS_BASE MSR.
2) ERETU swaps the value of the GS base address and that of the
IA32_KERNEL_GS_BASE MSR.
Thus, the swapgs instruction is disallowed when FRED is enabled,
otherwise it causes #UD.

Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---
arch/x86/kernel/process_64.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index a1aa74864c8b..2bea86073646 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -165,7 +165,8 @@ static noinstr unsigned long __rdgsbase_inactive(void)

lockdep_assert_irqs_disabled();

- if (!cpu_feature_enabled(X86_FEATURE_XENPV)) {
+ if (!cpu_feature_enabled(X86_FEATURE_FRED) &&
+ !cpu_feature_enabled(X86_FEATURE_XENPV)) {
native_swapgs();
gsbase = rdgsbase();
native_swapgs();
@@ -190,7 +191,8 @@ static noinstr void __wrgsbase_inactive(unsigned long gsbase)
{
lockdep_assert_irqs_disabled();

- if (!cpu_feature_enabled(X86_FEATURE_XENPV)) {
+ if (!cpu_feature_enabled(X86_FEATURE_FRED) &&
+ !cpu_feature_enabled(X86_FEATURE_XENPV)) {
native_swapgs();
wrgsbase(gsbase);
native_swapgs();
--
2.34.1

2023-04-10 08:43:50

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 29/33] x86/fred: allow FRED systems to use interrupt vectors 0x10-0x1f

From: "H. Peter Anvin (Intel)" <[email protected]>

FRED inherits the Intel VT-x enhancement of classified events with
a two-level event dispatch logic. The first-level dispatch is on
the event type, and the second-level is on the event vector. This
also means that vectors in different event types are orthogonal,
thus, vectors 0x10-0x1f become available as hardware interrupts.

Enable interrupt vectors 0x10-0x1f on FRED systems (interrupt 0x80 is
already enabled.) Most of these changes are about removing the
assumption that the lowest-priority vector is hard-wired to 0x20.

Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---
arch/x86/include/asm/idtentry.h | 4 ++--
arch/x86/include/asm/irq.h | 5 +++++
arch/x86/include/asm/irq_vectors.h | 15 +++++++++++----
arch/x86/kernel/apic/apic.c | 11 ++++++++---
arch/x86/kernel/apic/vector.c | 8 +++++++-
arch/x86/kernel/fred.c | 4 ++--
arch/x86/kernel/idt.c | 6 +++---
arch/x86/kernel/irq.c | 2 +-
arch/x86/kernel/traps.c | 2 ++
9 files changed, 41 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index bd43866f9c3e..57c891148b59 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -546,8 +546,8 @@ __visible noinstr void func(struct pt_regs *regs, \
*/
.align IDT_ALIGN
SYM_CODE_START(irq_entries_start)
- vector=FIRST_EXTERNAL_VECTOR
- .rept NR_EXTERNAL_VECTORS
+ vector=FIRST_EXTERNAL_VECTOR_IDT
+ .rept FIRST_SYSTEM_VECTOR - FIRST_EXTERNAL_VECTOR_IDT
UNWIND_HINT_IRET_REGS
0 :
ENDBR
diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
index 768aa234cbb4..e4be6f8409ad 100644
--- a/arch/x86/include/asm/irq.h
+++ b/arch/x86/include/asm/irq.h
@@ -11,6 +11,11 @@
#include <asm/apicdef.h>
#include <asm/irq_vectors.h>

+/*
+ * The first available IRQ vector
+ */
+extern unsigned int __ro_after_init first_external_vector;
+
/*
* The irq entry code is in the noinstr section and the start/end of
* __irqentry_text is emitted via labels. Make the build fail if
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 43dcb9284208..cb3670a7c18f 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -31,15 +31,23 @@

/*
* IDT vectors usable for external interrupt sources start at 0x20.
- * (0x80 is the syscall vector, 0x30-0x3f are for ISA)
+ * (0x80 is the syscall vector, 0x30-0x3f are for ISA).
+ *
+ * With FRED we can also use 0x10-0x1f even though those overlap
+ * exception vectors as FRED distinguishes exceptions and interrupts.
+ * Therefore, FIRST_EXTERNAL_VECTOR is no longer a constant.
*/
-#define FIRST_EXTERNAL_VECTOR 0x20
+#define FIRST_EXTERNAL_VECTOR_IDT 0x20
+#define FIRST_EXTERNAL_VECTOR_FRED 0x10
+#define FIRST_EXTERNAL_VECTOR first_external_vector

/*
* Reserve the lowest usable vector (and hence lowest priority) 0x20 for
* triggering cleanup after irq migration. 0x21-0x2f will still be used
* for device interrupts.
*/
+#define IRQ_MOVE_CLEANUP_VECTOR_IDT FIRST_EXTERNAL_VECTOR_IDT
+#define IRQ_MOVE_CLEANUP_VECTOR_FRED FIRST_EXTERNAL_VECTOR_FRED
#define IRQ_MOVE_CLEANUP_VECTOR FIRST_EXTERNAL_VECTOR

#define IA32_SYSCALL_VECTOR 0x80
@@ -48,7 +56,7 @@
* Vectors 0x30-0x3f are used for ISA interrupts.
* round up to the next 16-vector boundary
*/
-#define ISA_IRQ_VECTOR(irq) (((FIRST_EXTERNAL_VECTOR + 16) & ~15) + irq)
+#define ISA_IRQ_VECTOR(irq) (((FIRST_EXTERNAL_VECTOR_IDT + 16) & ~15) + irq)

/*
* Special IRQ vectors used by the SMP architecture, 0xf0-0xff
@@ -114,7 +122,6 @@
#define FIRST_SYSTEM_VECTOR NR_VECTORS
#endif

-#define NR_EXTERNAL_VECTORS (FIRST_SYSTEM_VECTOR - FIRST_EXTERNAL_VECTOR)
#define NR_SYSTEM_VECTORS (NR_VECTORS - FIRST_SYSTEM_VECTOR)

/*
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 20d9a604da7c..eef67f64aa81 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1621,12 +1621,17 @@ static void setup_local_APIC(void)
/*
* Set Task Priority to 'accept all except vectors 0-31'. An APIC
* vector in the 16-31 range could be delivered if TPR == 0, but we
- * would think it's an exception and terrible things will happen. We
- * never change this later on.
+ * would think it's an exception and terrible things will happen,
+ * unless we are using FRED in which case interrupts and
+ * exceptions are distinguished by type code.
+ *
+ * We never change this later on.
*/
+ BUG_ON(!first_external_vector);
+
value = apic_read(APIC_TASKPRI);
value &= ~APIC_TPRI_MASK;
- value |= 0x10;
+ value |= (first_external_vector - 0x10) & APIC_TPRI_MASK;
apic_write(APIC_TASKPRI, value);

/* Clear eventually stale ISR/IRR bits */
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index c1efebd27e6c..f4325445fd78 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -46,6 +46,7 @@ static struct irq_matrix *vector_matrix;
#ifdef CONFIG_SMP
static DEFINE_PER_CPU(struct hlist_head, cleanup_list);
#endif
+unsigned int first_external_vector = FIRST_EXTERNAL_VECTOR_IDT;

void lock_vector_lock(void)
{
@@ -796,7 +797,12 @@ int __init arch_early_irq_init(void)
* Allocate the vector matrix allocator data structure and limit the
* search area.
*/
- vector_matrix = irq_alloc_matrix(NR_VECTORS, FIRST_EXTERNAL_VECTOR,
+ if (cpu_feature_enabled(X86_FEATURE_FRED))
+ first_external_vector = FIRST_EXTERNAL_VECTOR_FRED;
+ else
+ first_external_vector = FIRST_EXTERNAL_VECTOR_IDT;
+
+ vector_matrix = irq_alloc_matrix(NR_VECTORS, first_external_vector,
FIRST_SYSTEM_VECTOR);
BUG_ON(!vector_matrix);

diff --git a/arch/x86/kernel/fred.c b/arch/x86/kernel/fred.c
index 907abfd69f3f..8d6dda8c8e71 100644
--- a/arch/x86/kernel/fred.c
+++ b/arch/x86/kernel/fred.c
@@ -56,7 +56,7 @@ void __init fred_setup_apic(void)
{
int i;

- for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++)
+ for (i = 0; i < FIRST_EXTERNAL_VECTOR_FRED; i++)
set_bit(i, system_vectors);

/*
@@ -65,7 +65,7 @@ void __init fred_setup_apic(void)
* /proc/interrupts.
*/
#ifdef CONFIG_SMP
- set_bit(IRQ_MOVE_CLEANUP_VECTOR, system_vectors);
+ set_bit(IRQ_MOVE_CLEANUP_VECTOR_FRED, system_vectors);
#endif

for (i = 0; i < NR_SYSTEM_VECTORS; i++) {
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index a58c6bc1cd68..d3fd86f85de9 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -131,7 +131,7 @@ static const __initconst struct idt_data apic_idts[] = {
INTG(RESCHEDULE_VECTOR, asm_sysvec_reschedule_ipi),
INTG(CALL_FUNCTION_VECTOR, asm_sysvec_call_function),
INTG(CALL_FUNCTION_SINGLE_VECTOR, asm_sysvec_call_function_single),
- INTG(IRQ_MOVE_CLEANUP_VECTOR, asm_sysvec_irq_move_cleanup),
+ INTG(IRQ_MOVE_CLEANUP_VECTOR_IDT, asm_sysvec_irq_move_cleanup),
INTG(REBOOT_VECTOR, asm_sysvec_reboot),
#endif

@@ -274,13 +274,13 @@ static void __init idt_map_in_cea(void)
*/
void __init idt_setup_apic_and_irq_gates(void)
{
- int i = FIRST_EXTERNAL_VECTOR;
+ int i = FIRST_EXTERNAL_VECTOR_IDT;
void *entry;

idt_setup_from_table(idt_table, apic_idts, ARRAY_SIZE(apic_idts), true);

for_each_clear_bit_from(i, system_vectors, FIRST_SYSTEM_VECTOR) {
- entry = irq_entries_start + IDT_ALIGN * (i - FIRST_EXTERNAL_VECTOR);
+ entry = irq_entries_start + IDT_ALIGN * (i - FIRST_EXTERNAL_VECTOR_IDT);
set_intr_gate(i, entry);
}

diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 7e125fff45ab..b7511e02959c 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -359,7 +359,7 @@ void fixup_irqs(void)
* vector_lock because the cpu is already marked !online, so
* nothing else will touch it.
*/
- for (vector = FIRST_EXTERNAL_VECTOR; vector < NR_VECTORS; vector++) {
+ for (vector = first_external_vector; vector < NR_VECTORS; vector++) {
if (IS_ERR_OR_NULL(__this_cpu_read(vector_irq[vector])))
continue;

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index ecfaf4d647bb..73471053ed02 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -1489,6 +1489,8 @@ DEFINE_IDTENTRY_IRQ(spurious_interrupt)
pr_info("Spurious interrupt (vector 0x%x) on CPU#%d, should never happen.\n",
vector, smp_processor_id());
}
+
+unsigned int first_external_vector = FIRST_EXTERNAL_VECTOR_IDT;
#endif

static void dispatch_table_spurious_interrupt(struct pt_regs *regs)
--
2.34.1

2023-04-10 08:44:00

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 25/33] x86/fred: no ESPFIX needed when FRED is enabled

From: "H. Peter Anvin (Intel)" <[email protected]>

Because FRED always restores the full value of %rsp, ESPFIX is
no longer needed when it's enabled.

Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---
arch/x86/kernel/espfix_64.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 16f9814c9be0..48d133a54f45 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -106,6 +106,10 @@ void __init init_espfix_bsp(void)
pgd_t *pgd;
p4d_t *p4d;

+ /* FRED systems don't need ESPFIX */
+ if (cpu_feature_enabled(X86_FEATURE_FRED))
+ return;
+
/* Install the espfix pud into the kernel page directory */
pgd = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];
p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
@@ -129,6 +133,10 @@ void init_espfix_ap(int cpu)
void *stack_page;
pteval_t ptemask;

+ /* FRED systems don't need ESPFIX */
+ if (cpu_feature_enabled(X86_FEATURE_FRED))
+ return;
+
/* We only have to do this once... */
if (likely(per_cpu(espfix_stack, cpu)))
return; /* Already initialized */
--
2.34.1

2023-04-10 08:44:05

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 30/33] x86/fred: allow dynamic stack frame size

A FRED stack frame could contain different amount of information for
different event types, or perhaps even for different instances of the
same event type. Thus we need to eliminate the need of any advance
information of the stack frame size to allow dynamic stack frame size.

Implement it through:
1) add a new field user_pt_regs to thread_info, and initialize it
with a pointer to a virtual pt_regs structure at the top of a
thread stack.
2) save a pointer to the user-space pt_regs structure created by
fred_entrypoint_user() to user_pt_regs in fred_entry_from_user().
3) initialize the init_thread_info's user_pt_regs with a pointer to
a virtual pt_regs structure at the top of init stack.

This approach also works for IDT, thus we unify the code.

Suggested-by: H. Peter Anvin (Intel) <[email protected]>
Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---
arch/x86/entry/entry_32.S | 2 +-
arch/x86/entry/entry_fred.c | 2 ++
arch/x86/include/asm/entry-common.h | 3 +++
arch/x86/include/asm/processor.h | 12 +++------
arch/x86/include/asm/switch_to.h | 3 +--
arch/x86/include/asm/thread_info.h | 41 ++++-------------------------
arch/x86/kernel/head_32.S | 3 +--
arch/x86/kernel/process.c | 5 ++++
kernel/fork.c | 6 +++++
9 files changed, 27 insertions(+), 50 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 91397f58ac30..5adc4cf33d92 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1244,7 +1244,7 @@ SYM_CODE_START(rewind_stack_and_make_dead)
xorl %ebp, %ebp

movl PER_CPU_VAR(pcpu_hot + X86_top_of_stack), %esi
- leal -TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%esi), %esp
+ leal -PTREGS_SIZE(%esi), %esp

call make_task_dead
1: jmp 1b
diff --git a/arch/x86/entry/entry_fred.c b/arch/x86/entry/entry_fred.c
index 0bff1db913b8..18852d9d83fb 100644
--- a/arch/x86/entry/entry_fred.c
+++ b/arch/x86/entry/entry_fred.c
@@ -178,6 +178,8 @@ __visible noinstr void fred_entry_from_user(struct pt_regs *regs)
[EVENT_TYPE_OTHER] = fred_syscall_slow
};

+ current->thread_info.user_pt_regs = regs;
+
/*
* FRED employs a two-level event dispatch mechanism, with
* the first-level on the type of an event and the second-level
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index 117903881fe4..5b7d0f47f188 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -12,6 +12,9 @@
/* Check that the stack and regs on entry from user mode are sane. */
static __always_inline void arch_enter_from_user_mode(struct pt_regs *regs)
{
+ if (!cpu_feature_enabled(X86_FEATURE_FRED))
+ current->thread_info.user_pt_regs = regs;
+
if (IS_ENABLED(CONFIG_DEBUG_ENTRY)) {
/*
* Make sure that the entry code gave us a sensible EFLAGS
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 8d73004e4cac..4a50d2a2c14b 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -626,17 +626,11 @@ static inline void spin_lock_prefetch(const void *x)
prefetchw(x);
}

-#define TOP_OF_INIT_STACK ((unsigned long)&init_stack + sizeof(init_stack) - \
- TOP_OF_KERNEL_STACK_PADDING)
+#define TOP_OF_INIT_STACK ((unsigned long)&init_stack + sizeof(init_stack))

-#define task_top_of_stack(task) ((unsigned long)(task_pt_regs(task) + 1))
+#define task_top_of_stack(task) ((unsigned long)task_stack_page(task) + THREAD_SIZE)

-#define task_pt_regs(task) \
-({ \
- unsigned long __ptr = (unsigned long)task_stack_page(task); \
- __ptr += THREAD_SIZE - TOP_OF_KERNEL_STACK_PADDING; \
- ((struct pt_regs *)__ptr) - 1; \
-})
+#define task_pt_regs(task) ((task)->thread_info.user_pt_regs)

#ifdef CONFIG_X86_32
#define INIT_THREAD { \
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 00fd85abc1d2..0a31da150808 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -72,8 +72,7 @@ static inline void update_task_stack(struct task_struct *task)
/*
* Will use WRMSRNS/WRMSRLIST for performance once it's upstreamed.
*/
- wrmsrl(MSR_IA32_FRED_RSP0,
- task_top_of_stack(task) + TOP_OF_KERNEL_STACK_PADDING);
+ wrmsrl(MSR_IA32_FRED_RSP0, task_top_of_stack(task));
} else if (cpu_feature_enabled(X86_FEATURE_XENPV)) {
/* Xen PV enters the kernel on the thread stack. */
load_sp0(task_top_of_stack(task));
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 998483078d5f..ced0a01e0a3e 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -13,42 +13,6 @@
#include <asm/percpu.h>
#include <asm/types.h>

-/*
- * TOP_OF_KERNEL_STACK_PADDING is a number of unused bytes that we
- * reserve at the top of the kernel stack. We do it because of a nasty
- * 32-bit corner case. On x86_32, the hardware stack frame is
- * variable-length. Except for vm86 mode, struct pt_regs assumes a
- * maximum-length frame. If we enter from CPL 0, the top 8 bytes of
- * pt_regs don't actually exist. Ordinarily this doesn't matter, but it
- * does in at least one case:
- *
- * If we take an NMI early enough in SYSENTER, then we can end up with
- * pt_regs that extends above sp0. On the way out, in the espfix code,
- * we can read the saved SS value, but that value will be above sp0.
- * Without this offset, that can result in a page fault. (We are
- * careful that, in this case, the value we read doesn't matter.)
- *
- * In vm86 mode, the hardware frame is much longer still, so add 16
- * bytes to make room for the real-mode segments.
- *
- * x86-64 has a fixed-length stack frame, but it depends on whether
- * or not FRED is enabled. Future versions of FRED might make this
- * dynamic, but for now it is always 2 words longer.
- */
-#ifdef CONFIG_X86_32
-# ifdef CONFIG_VM86
-# define TOP_OF_KERNEL_STACK_PADDING 16
-# else
-# define TOP_OF_KERNEL_STACK_PADDING 8
-# endif
-#else /* x86-64 */
-# ifdef CONFIG_X86_FRED
-# define TOP_OF_KERNEL_STACK_PADDING (2*8)
-# else
-# define TOP_OF_KERNEL_STACK_PADDING 0
-# endif
-#endif
-
/*
* low level task data that entry.S needs immediate access to
* - this struct should fit entirely inside of one cache line
@@ -56,6 +20,7 @@
*/
#ifndef __ASSEMBLY__
struct task_struct;
+struct pt_regs;
#include <asm/cpufeature.h>
#include <linux/atomic.h>

@@ -66,11 +31,14 @@ struct thread_info {
#ifdef CONFIG_SMP
u32 cpu; /* current CPU */
#endif
+ struct pt_regs *user_pt_regs;
};

+#define INIT_TASK_PT_REGS ((struct pt_regs *)TOP_OF_INIT_STACK - 1)
#define INIT_THREAD_INFO(tsk) \
{ \
.flags = 0, \
+ .user_pt_regs = INIT_TASK_PT_REGS, \
}

#else /* !__ASSEMBLY__ */
@@ -240,6 +208,7 @@ static inline int arch_within_stack_frames(const void * const stack,

extern void arch_task_cache_init(void);
extern int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src);
+extern void arch_init_user_pt_regs(struct task_struct *tsk);
extern void arch_release_task_struct(struct task_struct *tsk);
extern void arch_setup_new_exec(void);
#define arch_setup_new_exec arch_setup_new_exec
diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index 67c8ed99144b..0201ddcd7576 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -517,8 +517,7 @@ SYM_DATA_END(initial_page_table)
* reliably detect the end of the stack.
*/
SYM_DATA(initial_stack,
- .long init_thread_union + THREAD_SIZE -
- SIZEOF_PTREGS - TOP_OF_KERNEL_STACK_PADDING)
+ .long init_thread_union + THREAD_SIZE - SIZEOF_PTREGS)

__INITRODATA
int_msg:
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index b650cde3f64d..e1c6350290ae 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -98,6 +98,11 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
return 0;
}

+void arch_init_user_pt_regs(struct task_struct *tsk)
+{
+ tsk->thread_info.user_pt_regs = (struct pt_regs *)task_top_of_stack(tsk)- 1;
+}
+
#ifdef CONFIG_X86_64
void arch_release_task_struct(struct task_struct *tsk)
{
diff --git a/kernel/fork.c b/kernel/fork.c
index 0c92f224c68c..8976d6b540c4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -956,6 +956,10 @@ int __weak arch_dup_task_struct(struct task_struct *dst,
return 0;
}

+void __weak arch_init_user_pt_regs(struct task_struct *tsk)
+{
+}
+
void set_task_stack_end_magic(struct task_struct *tsk)
{
unsigned long *stackend;
@@ -983,6 +987,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
if (err)
goto free_tsk;

+ arch_init_user_pt_regs(tsk);
+
#ifdef CONFIG_THREAD_INFO_IN_TASK
refcount_set(&tsk->stack_refcount, 1);
#endif
--
2.34.1

2023-04-10 08:44:05

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 27/33] x86/fred: fixup fault on ERETU by jumping to fred_entrypoint_user

If the stack frame contains an invalid user context (e.g. due to invalid SS,
a non-canonical RIP, etc.) the ERETU instruction will trap (#SS or #GP).

From a Linux point of view, this really should be considered a user space
failure, so use the standard fault fixup mechanism to intercept the fault,
fix up the exception frame, and redirect execution to fred_entrypoint_user.
The end result is that it appears just as if the hardware had taken the
exception immediately after completing the transition to user space.

Suggested-by: H. Peter Anvin (Intel) <[email protected]>
Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---

Changes since v6:
* Add a comment to explain why it is safe to write to the previous FRED stack
frame. (Lai Jiangshan).

Changes since v5:
* Move the NMI bit from an invalid stack frame, which caused ERETU to fault,
to the fault handler's stack frame, thus to unblock NMI ASAP if NMI is blocked
(Lai Jiangshan).
---
arch/x86/entry/entry_64_fred.S | 8 ++-
arch/x86/include/asm/extable_fixup_types.h | 4 +-
arch/x86/mm/extable.c | 76 ++++++++++++++++++++++
3 files changed, 85 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
index d975cacd060f..efe2bcd11273 100644
--- a/arch/x86/entry/entry_64_fred.S
+++ b/arch/x86/entry/entry_64_fred.S
@@ -5,8 +5,10 @@
* The actual FRED entry points.
*/
#include <linux/linkage.h>
-#include <asm/errno.h>
+#include <asm/asm.h>
#include <asm/asm-offsets.h>
+#include <asm/errno.h>
+#include <asm/export.h>
#include <asm/fred.h>

#include "calling.h"
@@ -38,7 +40,9 @@ SYM_CODE_START_NOALIGN(fred_entrypoint_user)
call fred_entry_from_user
SYM_INNER_LABEL(fred_exit_user, SYM_L_GLOBAL)
FRED_EXIT
- ERETU
+1: ERETU
+
+ _ASM_EXTABLE_TYPE(1b, fred_entrypoint_user, EX_TYPE_ERETU)
SYM_CODE_END(fred_entrypoint_user)

.fill fred_entrypoint_kernel - ., 1, 0xcc
diff --git a/arch/x86/include/asm/extable_fixup_types.h b/arch/x86/include/asm/extable_fixup_types.h
index 991e31cfde94..1585c798a02f 100644
--- a/arch/x86/include/asm/extable_fixup_types.h
+++ b/arch/x86/include/asm/extable_fixup_types.h
@@ -64,6 +64,8 @@
#define EX_TYPE_UCOPY_LEN4 (EX_TYPE_UCOPY_LEN | EX_DATA_IMM(4))
#define EX_TYPE_UCOPY_LEN8 (EX_TYPE_UCOPY_LEN | EX_DATA_IMM(8))

-#define EX_TYPE_ZEROPAD 20 /* longword load with zeropad on fault */
+#define EX_TYPE_ZEROPAD 20 /* longword load with zeropad on fault */
+
+#define EX_TYPE_ERETU 21

#endif
diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c
index 60814e110a54..9d82193adf3c 100644
--- a/arch/x86/mm/extable.c
+++ b/arch/x86/mm/extable.c
@@ -6,6 +6,7 @@
#include <xen/xen.h>

#include <asm/fpu/api.h>
+#include <asm/fred.h>
#include <asm/sev.h>
#include <asm/traps.h>
#include <asm/kdebug.h>
@@ -195,6 +196,77 @@ static bool ex_handler_ucopy_len(const struct exception_table_entry *fixup,
return ex_handler_uaccess(fixup, regs, trapnr);
}

+#ifdef CONFIG_X86_FRED
+static bool ex_handler_eretu(const struct exception_table_entry *fixup,
+ struct pt_regs *regs, unsigned long error_code)
+{
+ struct pt_regs *uregs = (struct pt_regs *)(regs->sp - offsetof(struct pt_regs, ip));
+ unsigned short ss = uregs->ss;
+ unsigned short cs = uregs->cs;
+
+ /*
+ * Move the NMI bit from the invalid stack frame, which caused ERETU
+ * to fault, to the fault handler's stack frame, thus to unblock NMI
+ * with the fault handler's ERETS instruction ASAP if NMI is blocked.
+ */
+ regs->nmi = uregs->nmi;
+
+ fred_info(uregs)->edata = fred_event_data(regs);
+ uregs->ssx = regs->ssx;
+ uregs->ss = ss;
+ uregs->csx = regs->csx;
+ uregs->nmi = 0; /* The NMI bit was moved away above */
+ uregs->current_stack_level = 0;
+ uregs->cs = cs;
+
+ /*
+ * Copy error code to uregs and adjust stack pointer accordingly.
+ *
+ * The RSP used by FRED to push a stack frame is not the value in %rsp,
+ * it is calculated from %rsp with the following 2 steps:
+ * 1) RSP = %rsp - (IA32_FRED_CONFIG & 0x1c0) // Reserve N*64 bytes
+ * 2) RSP = RSP & ~0x3f // Align to a 64-byte cache line
+ * when the event delivery doesn't trigger a stack level change.
+ *
+ * Here is an example with N*64 (N=1) bytes reserved:
+ *
+ * 64-byte cache line ==> ______________
+ * |___Reserved___|
+ * |__Event_data__|
+ * |_____SS_______|
+ * |_____RSP______|
+ * |_____FLAGS____|
+ * |_____CS_______|
+ * |_____IP_______| <== ERETU stack frame
+ * 64-byte cache line ==> |__Error_code__|
+ * |______________|
+ * |______________|
+ * |______________|
+ * |______________|
+ * |______________|
+ * |______________|
+ * |______________| <== RSP after step 1)
+ * 64-byte cache line ==> |______________| <== RSP after step 2)
+ * |___Reserved___|
+ * |__Event_data__|
+ * |_____SS_______|
+ * |_____RSP______|
+ * |_____FLAGS____|
+ * |_____CS_______|
+ * |_____IP_______| <== ERETS stack frame
+ * 64-byte cache line ==> |__Error_code__|
+ *
+ * Thus a new FRED stack frame will always be pushed below a previous
+ * FRED stack frame ((N*64) bytes may be reserved between), and it is
+ * safe to write to a previous FRED stack frame as they never overlap.
+ */
+ uregs->orig_ax = error_code;
+ regs->sp -= 8;
+
+ return ex_handler_default(fixup, regs);
+}
+#endif
+
int ex_get_fixup_type(unsigned long ip)
{
const struct exception_table_entry *e = search_exception_tables(ip);
@@ -272,6 +344,10 @@ int fixup_exception(struct pt_regs *regs, int trapnr, unsigned long error_code,
return ex_handler_ucopy_len(e, regs, trapnr, reg, imm);
case EX_TYPE_ZEROPAD:
return ex_handler_zeropad(e, regs, fault_addr);
+#ifdef CONFIG_X86_FRED
+ case EX_TYPE_ERETU:
+ return ex_handler_eretu(e, regs, error_code);
+#endif
}
BUG();
}
--
2.34.1

2023-04-10 08:44:10

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 28/33] x86/ia32: do not modify the DPL bits for a null selector

When a null selector is to be loaded into a segment register,
reload_segments() sets its DPL bits to 3. Later when the IRET
instruction loads it, it zeros the segment register. The two
operations offset each other to actually effect a nop.

Unlike IRET, ERETU does not make any of DS, ES, FS, or GS null
if it is found to have DPL < 3. It is expected that a FRED-enabled
operating system will return to ring 3 (in compatibility mode)
only when those segments all have DPL = 3.

Thus when FRED is enabled, we end up with having 3 in a segment
register even when it is initially set to 0.

Fix it by not modifying the DPL bits for a null selector.

Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---
arch/x86/kernel/signal_32.c | 21 +++++++++++++--------
1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/signal_32.c b/arch/x86/kernel/signal_32.c
index 9027fc088f97..7796cf84fca2 100644
--- a/arch/x86/kernel/signal_32.c
+++ b/arch/x86/kernel/signal_32.c
@@ -36,22 +36,27 @@
#ifdef CONFIG_IA32_EMULATION
#include <asm/ia32_unistd.h>

+static inline u16 usrseg(u16 sel)
+{
+ return sel <= 3 ? sel : sel | 3;
+}
+
static inline void reload_segments(struct sigcontext_32 *sc)
{
unsigned int cur;

savesegment(gs, cur);
- if ((sc->gs | 0x03) != cur)
- load_gs_index(sc->gs | 0x03);
+ if (usrseg(sc->gs) != cur)
+ load_gs_index(usrseg(sc->gs));
savesegment(fs, cur);
- if ((sc->fs | 0x03) != cur)
- loadsegment(fs, sc->fs | 0x03);
+ if (usrseg(sc->fs) != cur)
+ loadsegment(fs, usrseg(sc->fs));
savesegment(ds, cur);
- if ((sc->ds | 0x03) != cur)
- loadsegment(ds, sc->ds | 0x03);
+ if (usrseg(sc->ds) != cur)
+ loadsegment(ds, usrseg(sc->ds));
savesegment(es, cur);
- if ((sc->es | 0x03) != cur)
- loadsegment(es, sc->es | 0x03);
+ if (usrseg(sc->es) != cur)
+ loadsegment(es, usrseg(sc->es));
}

#define sigset32_t compat_sigset_t
--
2.34.1

2023-04-10 08:44:28

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 32/33] x86/fred: disable FRED by default in its early stage

Disable FRED by default in its early stage.

To enable FRED, a new kernel command line option "fred" needs to be added.

Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---

Changes since v7:
* Add a log message when FRED is enabled.
---
Documentation/admin-guide/kernel-parameters.txt | 4 ++++
arch/x86/kernel/cpu/common.c | 3 +++
arch/x86/kernel/fred.c | 3 +++
3 files changed, 10 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 6221a1d057dd..c55ea60e1a0c 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1498,6 +1498,10 @@
Warning: use of this parameter will taint the kernel
and may cause unknown problems.

+ fred
+ Forcefully enable flexible return and event delivery,
+ which is otherwise disabled by default.
+
ftrace=[tracer]
[FTRACE] will set and start the specified tracer
as early as possible in order to facilitate early
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index eea41cb8722e..4db5e619fc97 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1467,6 +1467,9 @@ static void __init cpu_parse_early_param(void)
char *argptr = arg, *opt;
int arglen, taint = 0;

+ if (!cmdline_find_option_bool(boot_command_line, "fred"))
+ setup_clear_cpu_cap(X86_FEATURE_FRED);
+
#ifdef CONFIG_X86_32
if (cmdline_find_option_bool(boot_command_line, "no387"))
#ifdef CONFIG_MATH_EMULATION
diff --git a/arch/x86/kernel/fred.c b/arch/x86/kernel/fred.c
index 8d6dda8c8e71..f4d48f4f06b4 100644
--- a/arch/x86/kernel/fred.c
+++ b/arch/x86/kernel/fred.c
@@ -11,6 +11,9 @@
*/
void cpu_init_fred_exceptions(void)
{
+ /* When FRED is enabled by default, this log message may not needed */
+ pr_info("Initialize FRED on CPU%d\n", smp_processor_id());
+
wrmsrl(MSR_IA32_FRED_CONFIG,
FRED_CONFIG_REDZONE | /* Reserve for CALL emulation */
FRED_CONFIG_INT_STKLVL(0) |
--
2.34.1

2023-04-10 08:44:39

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 31/33] x86/fred: BUG() when ERETU with %rsp not equal to that when the ring 3 event was just delivered

A FRED stack frame generated by a ring 3 event should never be messed up, and
the first thing we must make sure is that at the time an ERETU instruction is
executed, %rsp must have the same address as that when the user level event
was just delivered.

However we don't want to bother the normal code path of ERETU because it's on
the hotest code path, a good choice is to do this check when ERETU faults.

Suggested-by: H. Peter Anvin (Intel) <[email protected]>
Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---
arch/x86/mm/extable.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c
index 9d82193adf3c..be297d4b137b 100644
--- a/arch/x86/mm/extable.c
+++ b/arch/x86/mm/extable.c
@@ -204,6 +204,14 @@ static bool ex_handler_eretu(const struct exception_table_entry *fixup,
unsigned short ss = uregs->ss;
unsigned short cs = uregs->cs;

+ /*
+ * A FRED stack frame generated by a ring 3 event should never be
+ * messed up, and the first thing we must make sure is that at the
+ * time an ERETU instruction is executed, %rsp must have the same
+ * address as that when the user level event was just delivered.
+ */
+ BUG_ON(uregs != current->thread_info.user_pt_regs);
+
/*
* Move the NMI bit from the invalid stack frame, which caused ERETU
* to fault, to the fault handler's stack frame, thus to unblock NMI
--
2.34.1

2023-04-10 08:45:16

by Li, Xin3

[permalink] [raw]
Subject: [PATCH v8 33/33] KVM: x86/vmx: refactor VMX_DO_EVENT_IRQOFF to generate FRED stack frames

Comparing to an IDT stack frame, a FRED stack frame has extra 16 bytes of
information pushed at the regular stack top and 8 bytes of error code _always_
pushed at the regular stack bottom, VMX_DO_EVENT_IRQOFF can be refactored
to generate FRED stack frames with event type and vector properly set. Thus,
IRQ/NMI can be handled with the existing approach when FRED is enabled.

As a FRED stack frame always contains an error code pushed by hardware, call
a trampoline code first to have the return instruction address pushed on
the regular stack. Then the trampoline code pushes an error code (0 for
both IRQ and NMI) and jumps to fred_entrypoint_kernel() for NMI handling or
calls external_interrupt() for IRQ handling.

The trampoline code for IRQ handling pushes general purpose registers to
form a pt_regs structure and then use it to call external_interrupt(). As a
result, IRQ handling no longer re-enter the noinstr code.

Export fred_entrypoint_kernel() and external_interrupt() for above changes.

Tested-by: Shan Kang <[email protected]>
Signed-off-by: Xin Li <[email protected]>
---

Changes since v7:
* Always call external_interrupt() for IRQ handling on x86_64, thus avoid
re-entering the noinstr code.
* Create a FRED stack frame when FRED is compiled-in but not enabled, which
uses some extra stack space but simplifies the code.

Changes since v6:
* Export fred_entrypoint_kernel(), required when kvm-intel built as a module.
* Reserve a REDZONE for CALL emulation and align RSP to a 64-byte boundary
before pushing a new FRED stack frame.
---
arch/x86/entry/entry_64_fred.S | 1 +
arch/x86/include/asm/asm-prototypes.h | 1 +
arch/x86/include/asm/fred.h | 1 +
arch/x86/include/asm/traps.h | 2 +
arch/x86/kernel/traps.c | 5 ++
arch/x86/kvm/vmx/vmenter.S | 78 +++++++++++++++++++++++++--
arch/x86/kvm/vmx/vmx.c | 12 +++--
7 files changed, 93 insertions(+), 7 deletions(-)

diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
index efe2bcd11273..de74ab97ff00 100644
--- a/arch/x86/entry/entry_64_fred.S
+++ b/arch/x86/entry/entry_64_fred.S
@@ -59,3 +59,4 @@ SYM_CODE_START_NOALIGN(fred_entrypoint_kernel)
FRED_EXIT
ERETS
SYM_CODE_END(fred_entrypoint_kernel)
+EXPORT_SYMBOL(fred_entrypoint_kernel)
diff --git a/arch/x86/include/asm/asm-prototypes.h b/arch/x86/include/asm/asm-prototypes.h
index b1a98fa38828..076bf8dee702 100644
--- a/arch/x86/include/asm/asm-prototypes.h
+++ b/arch/x86/include/asm/asm-prototypes.h
@@ -12,6 +12,7 @@
#include <asm/special_insns.h>
#include <asm/preempt.h>
#include <asm/asm.h>
+#include <asm/fred.h>
#include <asm/gsseg.h>

#ifndef CONFIG_X86_CMPXCHG64
diff --git a/arch/x86/include/asm/fred.h b/arch/x86/include/asm/fred.h
index f7caf3b2f3f7..d00b9cab6aa6 100644
--- a/arch/x86/include/asm/fred.h
+++ b/arch/x86/include/asm/fred.h
@@ -129,6 +129,7 @@ DECLARE_FRED_HANDLER(fred_exc_machine_check);
* The actual assembly entry and exit points
*/
extern __visible void fred_entrypoint_user(void);
+extern __visible void fred_entrypoint_kernel(void);

/*
* Initialization
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 612b3d6fec53..017b95624325 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -58,4 +58,6 @@ typedef DECLARE_SYSTEM_INTERRUPT_HANDLER((*system_interrupt_handler));

system_interrupt_handler get_system_interrupt_handler(unsigned int i);

+int external_interrupt(struct pt_regs *regs);
+
#endif /* _ASM_X86_TRAPS_H */
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 73471053ed02..0f1fcd53cb52 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -1573,6 +1573,11 @@ int external_interrupt(struct pt_regs *regs)
return 0;
}

+#if IS_ENABLED(CONFIG_KVM_INTEL)
+/* For KVM VMX to handle IRQs in IRQ induced VM exits. */
+EXPORT_SYMBOL_GPL(external_interrupt);
+#endif
+
#endif /* CONFIG_X86_64 */

void __init install_system_interrupt_handler(unsigned int n, const void *asm_addr, const void *addr)
diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index 631fd7da2bc3..f2e1f8e61be9 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -2,12 +2,14 @@
#include <linux/linkage.h>
#include <asm/asm.h>
#include <asm/bitsperlong.h>
+#include <asm/fred.h>
#include <asm/kvm_vcpu_regs.h>
#include <asm/nospec-branch.h>
#include <asm/percpu.h>
#include <asm/segment.h>
#include "kvm-asm-offsets.h"
#include "run_flags.h"
+#include "../../entry/calling.h"

#define WORD_SIZE (BITS_PER_LONG / 8)

@@ -31,7 +33,7 @@
#define VCPU_R15 __VCPU_REGS_R15 * WORD_SIZE
#endif

-.macro VMX_DO_EVENT_IRQOFF call_insn call_target
+.macro VMX_DO_EVENT_IRQOFF call_insn call_target fred=0 nmi=0
/*
* Unconditionally create a stack frame, getting the correct RSP on the
* stack (for x86-64) would take two instructions anyways, and RBP can
@@ -41,16 +43,55 @@
mov %_ASM_SP, %_ASM_BP

#ifdef CONFIG_X86_64
+#ifdef CONFIG_X86_FRED
+ /*
+ * It's not necessary to change current stack level for handling IRQ/NMI
+ * because the state of the kernel stack is well defined in this place
+ * in the code, and it is known not to be deep in a bunch of nested I/O
+ * layer handlers that eat up the stack.
+ *
+ * Before starting to push a FRED stack frame, FRED reserves a redzone
+ * (for CALL emulation) and aligns RSP to a 64-byte boundary.
+ */
+ sub $(FRED_CONFIG_REDZONE_AMOUNT << 6), %rsp
+ and $FRED_STACK_FRAME_RSP_MASK, %rsp
+
+ /*
+ * A FRED stack frame has extra 16 bytes of information pushed at the
+ * regular stack top comparing to an IDT stack frame.
+ */
+ push $0 /* Reserved by FRED, must be 0 */
+ push $0 /* FRED event data, 0 for NMI and external interrupts */
+#else
/*
* Align RSP to a 16-byte boundary (to emulate CPU behavior) before
* creating the synthetic interrupt stack frame for the IRQ/NMI.
*/
and $-16, %rsp
- push $__KERNEL_DS
+#endif
+
+ .if \fred
+ .if \nmi
+ mov $(2 << 32 | 2 << 48), %rax /* NMI event type and vector */
+ .else
+ mov %rdi, %rax
+ shl $32, %rax /* External interrupt vector */
+ .endif
+ add $__KERNEL_DS, %rax
+ bts $57, %rax /* Set 64-bit mode */
+ .else
+ mov $__KERNEL_DS, %rax
+ .endif
+ push %rax
+
push %rbp
#endif
pushf
- push $__KERNEL_CS
+ mov $__KERNEL_CS, %_ASM_AX
+ .if \fred && \nmi
+ bts $28, %_ASM_AX /* Set the NMI bit */
+ .endif
+ push %_ASM_AX
\call_insn \call_target

/*
@@ -299,8 +340,19 @@ SYM_INNER_LABEL(vmx_vmexit, SYM_L_GLOBAL)

SYM_FUNC_END(__vmx_vcpu_run)

+SYM_CODE_START(vmx_do_nmi_trampoline)
+#ifdef CONFIG_X86_FRED
+ ALTERNATIVE "jmp .Lno_errorcode_push", "", X86_FEATURE_FRED
+ push $0 /* FRED error code, 0 for NMI */
+ jmp fred_entrypoint_kernel
+#endif
+
+.Lno_errorcode_push:
+ jmp asm_exc_nmi_kvm_vmx
+SYM_CODE_END(vmx_do_nmi_trampoline)
+
SYM_FUNC_START(vmx_do_nmi_irqoff)
- VMX_DO_EVENT_IRQOFF call asm_exc_nmi_kvm_vmx
+ VMX_DO_EVENT_IRQOFF call vmx_do_nmi_trampoline fred=1 nmi=1
SYM_FUNC_END(vmx_do_nmi_irqoff)


@@ -357,6 +409,24 @@ SYM_FUNC_START(vmread_error_trampoline)
SYM_FUNC_END(vmread_error_trampoline)
#endif

+#ifdef CONFIG_X86_64
+SYM_CODE_START(vmx_do_interrupt_trampoline)
+ push $0 /* FRED error code, 0 for NMI and external interrupts */
+ PUSH_REGS
+
+ movq %rsp, %rdi /* %rdi -> pt_regs */
+ call external_interrupt
+
+ POP_REGS
+ addq $8,%rsp /* Drop FRED error code */
+ RET
+SYM_CODE_END(vmx_do_interrupt_trampoline)
+#endif
+
SYM_FUNC_START(vmx_do_interrupt_irqoff)
+#ifdef CONFIG_X86_64
+ VMX_DO_EVENT_IRQOFF call vmx_do_interrupt_trampoline fred=1
+#else
VMX_DO_EVENT_IRQOFF CALL_NOSPEC _ASM_ARG1
+#endif
SYM_FUNC_END(vmx_do_interrupt_irqoff)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d2d6e1b6c788..d85bcfd191b7 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6874,7 +6874,7 @@ static void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
memset(vmx->pi_desc.pir, 0, sizeof(vmx->pi_desc.pir));
}

-void vmx_do_interrupt_irqoff(unsigned long entry);
+void vmx_do_interrupt_irqoff(unsigned long entry_or_vector);
void vmx_do_nmi_irqoff(void);

static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
@@ -6916,14 +6916,20 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
{
u32 intr_info = vmx_get_intr_info(vcpu);
unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
- gate_desc *desc = (gate_desc *)host_idt_base + vector;
+ unsigned long entry_or_vector;
+
+#ifdef CONFIG_X86_64
+ entry_or_vector = vector;
+#else
+ entry_or_vector = gate_offset((gate_desc *)host_idt_base + vector);
+#endif

if (KVM_BUG(!is_external_intr(intr_info), vcpu->kvm,
"unexpected VM-Exit interrupt info: 0x%x", intr_info))
return;

kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
- vmx_do_interrupt_irqoff(gate_offset(desc));
+ vmx_do_interrupt_irqoff(entry_or_vector);
kvm_after_interrupt(vcpu);

vcpu->arch.at_instruction_boundary = true;
--
2.34.1

2023-04-10 18:40:54

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v8 00/33] x86: enable FRED for x86-64

On 4/10/23 01:14, Xin Li wrote:
> This patch set enables FRED for x86-64.

I'm worried we're just in a patchbomb-once-a-week mode with FRED at this
point. There wasn't a single comment on v7 so, of course, here's v8 a
week later.

FRED is a rare CPU feature because it's universally wanted. Us software
folks have been begging for it for a looooooooong time. Is there anyone
out there that has any doubts that the kernel will support (this) FRED
eventually?

The code also looks pretty reasonable.

I do think it's missing some Documentation, and the cover letter is bit
sparse. It would be nice to see some high-level information about
things like, for instance, why there needs to be FRED refactoring for
NMI/#PF/#DB/#MC specifically, but not other exceptions.

There also aren't any new selftests. I faintly recall some tweak to the
selftests recently that was FRED-oriented, but I'd still expect all the
selftests that poke at the entry code to be perturbed by this a bit.

Basically, if anyone else has been procrastinating on reviewing this
set, now is probably the time to dig in. (I'll include myself in that
category, btw)

2023-04-10 19:01:28

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v8 00/33] x86: enable FRED for x86-64

Oh, and one other nit. We do have a specific maintainer for the entry code:

> X86 ENTRY CODE
> M: Andy Lutomirski <[email protected]>
> L: [email protected]
> S: Maintained
> T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/asm
> F: arch/x86/entry/

Please make sure to cc Andy on FRED going forward.

This is probably also a good cue to go and make sure you didn't miss any
other folks that need to see this series.

2023-04-10 19:16:27

by Li, Xin3

[permalink] [raw]
Subject: RE: [PATCH v8 00/33] x86: enable FRED for x86-64

> I do think it's missing some Documentation, and the cover letter is bit sparse. It
> would be nice to see some high-level information about things like, for instance,
> why there needs to be FRED refactoring for NMI/#PF/#DB/#MC specifically, but
> not other exceptions.

We do have some comments in the commit messages or around the code changes.

However a high level document in the Documentation/x86/ directory probably
works better, I can do that.

>
> There also aren't any new selftests. I faintly recall some tweak to the selftests
> recently that was FRED-oriented, but I'd still expect all the selftests that poke at
> the entry code to be perturbed by this a bit.

Because FRED code majorly replaces the IDT entry/dispatch/return code, and
makes few changes to the event handlers, our focus was more of to check if
all the event handlers are properly called and returned, which is very well
covered by the existing IDT selftests.

One area we need to add selftests to is the FRED event dispatch framework, to
make sure we cover all the possible dispatch paths.

> Basically, if anyone else has been procrastinating on reviewing this set, now is
> probably the time to dig in. (I'll include myself in that category, btw)

I really appreciate it!

Thanks!
Xin

2023-04-10 19:30:38

by Li, Xin3

[permalink] [raw]
Subject: RE: [PATCH v8 00/33] x86: enable FRED for x86-64

> Oh, and one other nit. We do have a specific maintainer for the entry code:
>
> > X86 ENTRY CODE
> > M: Andy Lutomirski <[email protected]>
> > L: [email protected]
> > S: Maintained
> > T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/asm
> > F: arch/x86/entry/
>
> Please make sure to cc Andy on FRED going forward.
>
> This is probably also a good cue to go and make sure you didn't miss any other
> folks that need to see this series.

My bad, and surely will add Andy.

2023-04-10 19:34:23

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v8 00/33] x86: enable FRED for x86-64

On Mon, Apr 10, 2023 at 07:14:21PM +0000, Li, Xin3 wrote:
> > Basically, if anyone else has been procrastinating on reviewing this set, now is
> > probably the time to dig in. (I'll include myself in that category, btw)
>
> I really appreciate it!

That doesn't mean that you should patch-bomb people once a week even
without any review happening.

Is FRED in any hardware incarnation to rush it?

If no, be patient, address the review comments once you have them and do
not spam once a week just because. As Dave said, this is wanted by all
and it will get reviewed eventually. But it is not something that needs
to go in now so you don't have to create unnecessary pressure.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-04-10 19:41:10

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v8 00/33] x86: enable FRED for x86-64

On 4/10/23 12:32, Borislav Petkov wrote:
> On Mon, Apr 10, 2023 at 07:14:21PM +0000, Li, Xin3 wrote:
>>> Basically, if anyone else has been procrastinating on reviewing this set, now is
>>> probably the time to dig in. (I'll include myself in that category, btw)
>> I really appreciate it!
> That doesn't mean that you should patch-bomb people once a week even
> without any review happening.
>
> Is FRED in any hardware incarnation to rush it?

Which reminds me... It is always appreciated when hardware vendors can
put a stake in the ground about where and how a feature will show up.
It's great that you said that it is SIMICS-only for now. That's a start.

But, if you could, for instance say things like (and I'm totally pulling
these out of my rear end):

FRED will be available only on server CPUs
or
FRED will be available across all Intel CPUs
or
FRED will start showing up in CPUs that are released
in roughly 2033

it would be nice to know that.

Here's what I said for protection keys, for example:

> https://lore.kernel.org/linux-mm/[email protected]/


2023-04-10 20:54:07

by Li, Xin3

[permalink] [raw]
Subject: RE: [PATCH v8 00/33] x86: enable FRED for x86-64

> > Is FRED in any hardware incarnation to rush it?
>
> Which reminds me... It is always appreciated when hardware vendors can put a
> stake in the ground about where and how a feature will show up.
> It's great that you said that it is SIMICS-only for now. That's a start.
>
> But, if you could, for instance say things like (and I'm totally pulling these out of
> my rear end):
>
> FRED will be available only on server CPUs or
> FRED will be available across all Intel CPUs or
> FRED will start showing up in CPUs that are released
> in roughly 2033
>
> it would be nice to know that.
>
> Here's what I said for protection keys, for example:
>
> > https://lore.kernel.org/linux-mm/[email protected]/

Good point, I will find out what I could provide in LKML and update later.

Thanks!
Xin

2023-04-10 22:01:50

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v8 33/33] KVM: x86/vmx: refactor VMX_DO_EVENT_IRQOFF to generate FRED stack frames

"KVM: VMX:" for the shortlog please.

On Mon, Apr 10, 2023, Xin Li wrote:
> -.macro VMX_DO_EVENT_IRQOFF call_insn call_target
> +.macro VMX_DO_EVENT_IRQOFF call_insn call_target fred=0 nmi=0
> /*
> * Unconditionally create a stack frame, getting the correct RSP on the
> * stack (for x86-64) would take two instructions anyways, and RBP can
> @@ -41,16 +43,55 @@
> mov %_ASM_SP, %_ASM_BP
>
> #ifdef CONFIG_X86_64
> +#ifdef CONFIG_X86_FRED
> + /*
> + * It's not necessary to change current stack level for handling IRQ/NMI
> + * because the state of the kernel stack is well defined in this place
> + * in the code, and it is known not to be deep in a bunch of nested I/O
> + * layer handlers that eat up the stack.
> + *
> + * Before starting to push a FRED stack frame, FRED reserves a redzone
> + * (for CALL emulation) and aligns RSP to a 64-byte boundary.
> + */
> + sub $(FRED_CONFIG_REDZONE_AMOUNT << 6), %rsp
> + and $FRED_STACK_FRAME_RSP_MASK, %rsp
> +
> + /*
> + * A FRED stack frame has extra 16 bytes of information pushed at the
> + * regular stack top comparing to an IDT stack frame.
> + */
> + push $0 /* Reserved by FRED, must be 0 */
> + push $0 /* FRED event data, 0 for NMI and external interrupts */
> +#else
> /*
> * Align RSP to a 16-byte boundary (to emulate CPU behavior) before
> * creating the synthetic interrupt stack frame for the IRQ/NMI.
> */
> and $-16, %rsp
> - push $__KERNEL_DS
> +#endif
> +
> + .if \fred
> + .if \nmi
> + mov $(2 << 32 | 2 << 48), %rax /* NMI event type and vector */
> + .else
> + mov %rdi, %rax
> + shl $32, %rax /* External interrupt vector */
> + .endif
> + add $__KERNEL_DS, %rax
> + bts $57, %rax /* Set 64-bit mode */
> + .else
> + mov $__KERNEL_DS, %rax
> + .endif
> + push %rax

This is painfully difficult to read, and the trampolines only add to that pain.
Using macros instead of magic numbers would alleviate a small amount of pain, but
but the #ifdefs and .if \fred/\nmi are the real culprits.

> static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
> @@ -6916,14 +6916,20 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
> {
> u32 intr_info = vmx_get_intr_info(vcpu);
> unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
> - gate_desc *desc = (gate_desc *)host_idt_base + vector;
> + unsigned long entry_or_vector;
> +
> +#ifdef CONFIG_X86_64
> + entry_or_vector = vector;
> +#else
> + entry_or_vector = gate_offset((gate_desc *)host_idt_base + vector);
> +#endif

And then this is equally gross. Rather than funnel FRED+legacy into a single
function only to split them back out, just route FRED into its own asm subroutine.
The common bits are basically the creation/destruction of the stack frame and the
CALL itself, i.e. the truly interesting bits are what's different.

Pretty much all of the #ifdeffery goes away, the helpers just need #ifdefs to
play nice with CONFIG_X86_FRED=n. E.g. something like the below as a starting
point (it most definitely doesn't compile, and most definitely isn't 100% correct).

---
arch/x86/kvm/vmx/vmenter.S | 72 ++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/vmx.c | 19 ++++++++--
2 files changed, 88 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index 631fd7da2bc3..a6929c78e038 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -2,12 +2,14 @@
#include <linux/linkage.h>
#include <asm/asm.h>
#include <asm/bitsperlong.h>
+#include <asm/fred.h>
#include <asm/kvm_vcpu_regs.h>
#include <asm/nospec-branch.h>
#include <asm/percpu.h>
#include <asm/segment.h>
#include "kvm-asm-offsets.h"
#include "run_flags.h"
+#include "../../entry/calling.h"

#define WORD_SIZE (BITS_PER_LONG / 8)

@@ -31,6 +33,62 @@
#define VCPU_R15 __VCPU_REGS_R15 * WORD_SIZE
#endif

+#ifdef CONFIG_X86_FRED
+.macro VMX_DO_FRED_EVENT_IRQOFF call_target cs_val
+ /*
+ * Unconditionally create a stack frame, getting the correct RSP on the
+ * stack (for x86-64) would take two instructions anyways, and RBP can
+ * be used to restore RSP to make objtool happy (see below).
+ */
+ push %_ASM_BP
+ mov %_ASM_SP, %_ASM_BP
+
+ /*
+ * Don't check the FRED stack level, the call stack leading to this
+ * helper is effectively constant and shallow (relatively speaking).
+ *
+ * Emulate the FRED-defined redzone and stack alignment (128 bytes and
+ * 64 bytes respectively).
+ */
+ sub $(FRED_CONFIG_REDZONE_AMOUNT << 6), %rsp
+ and $FRED_STACK_FRAME_RSP_MASK, %rsp
+
+ /*
+ * A FRED stack frame has extra 16 bytes of information pushed at the
+ * regular stack top compared to an IDT stack frame.
+ */
+ push $0 /* Reserved by FRED, must be 0 */
+ push $0 /* FRED event data, 0 for NMI and external interrupts */
+ shl $32, %rax
+ orq $__KERNEL_DS | $FRED_64_BIT_MODE, %ax
+ push %rax /* Vector (from the "caller") and DS */
+
+ push %rbp
+ pushf
+ push \cs_val
+
+ push $0 /* FRED error code, 0 for NMI and external interrupts */
+ PUSH_REGS
+
+ /* Load @pt_regs */
+ movq %rsp, %_ASM_ARG1
+
+ call \call_target
+
+ POP_REGS
+
+ /*
+ * "Restore" RSP from RBP, even though IRET has already unwound RSP to
+ * the correct value. objtool doesn't know the callee will IRET and,
+ * without the explicit restore, thinks the stack is getting walloped.
+ * Using an unwind hint is problematic due to x86-64's dynamic alignment.
+ */
+ mov %_ASM_BP, %_ASM_SP
+ pop %_ASM_BP
+ RET
+.endm
+#endif
+
.macro VMX_DO_EVENT_IRQOFF call_insn call_target
/*
* Unconditionally create a stack frame, getting the correct RSP on the
@@ -299,6 +357,14 @@ SYM_INNER_LABEL(vmx_vmexit, SYM_L_GLOBAL)

SYM_FUNC_END(__vmx_vcpu_run)

+#ifdef CONFIG_X86_FRED
+SYM_FUNC_START(vmx_do_fred_nmi_irqoff)
+ push $FRED_NMI_ERROR_CODE
+ mov $NMI_VECTOR | $FRED_NMI_SOMETHING, %eax
+ VMX_DO_FRED_EVENT_IRQOFF call fred_entrypoint_kernel $FRED_NMI_CS_VAL
+SYM_FUNC_END(vmx_do_nmi_irqoff)
+#endif
+
SYM_FUNC_START(vmx_do_nmi_irqoff)
VMX_DO_EVENT_IRQOFF call asm_exc_nmi_kvm_vmx
SYM_FUNC_END(vmx_do_nmi_irqoff)
@@ -357,6 +423,12 @@ SYM_FUNC_START(vmread_error_trampoline)
SYM_FUNC_END(vmread_error_trampoline)
#endif

+#ifdef CONFIG_X86_FRED
+SYM_FUNC_START(vmx_do_fred_interrupt_irqoff)
+ mov %_ASM_ARG1, %rax
+ VMX_DO_FRED_EVENT_IRQOFF call external_interrupt
+#endif
+
SYM_FUNC_START(vmx_do_interrupt_irqoff)
VMX_DO_EVENT_IRQOFF CALL_NOSPEC _ASM_ARG1
SYM_FUNC_END(vmx_do_interrupt_irqoff)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 11080a649f60..42f50b0cc125 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6891,6 +6891,14 @@ static void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
memset(vmx->pi_desc.pir, 0, sizeof(vmx->pi_desc.pir));
}

+#ifdef CONFIG_X86_FRED
+void vmx_do_fred_interrupt_irqoff(unsigned int vector);
+void vmx_do_fred_nmi_irqoff(unsigned int vector);
+#else
+#define vmx_do_fred_interrupt_irqoff(x) BUG();
+#define vmx_do_fred_nmi_irqoff(x) BUG();
+#endif
+
void vmx_do_interrupt_irqoff(unsigned long entry);
void vmx_do_nmi_irqoff(void);

@@ -6933,14 +6941,16 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
{
u32 intr_info = vmx_get_intr_info(vcpu);
unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
- gate_desc *desc = (gate_desc *)host_idt_base + vector;

if (KVM_BUG(!is_external_intr(intr_info), vcpu->kvm,
"unexpected VM-Exit interrupt info: 0x%x", intr_info))
return;

kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
- vmx_do_interrupt_irqoff(gate_offset(desc));
+ if (cpu_feature_enabled(X86_FEATURE_FRED))
+ vmx_do_fred_interrupt_irqoff(vector);
+ else
+ vmx_do_interrupt_irqoff(gate_offset((gate_desc *)host_idt_base + vector));
kvm_after_interrupt(vcpu);

vcpu->arch.at_instruction_boundary = true;
@@ -7226,7 +7236,10 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
if ((u16)vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI &&
is_nmi(vmx_get_intr_info(vcpu))) {
kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
- vmx_do_nmi_irqoff();
+ if (cpu_feature_enabled(X86_FEATURE_FRED))
+ vmx_do_fred_nmi_irqoff();
+ else
+ vmx_do_nmi_irqoff();
kvm_after_interrupt(vcpu);
}


base-commit: 33d1a64081c98e390e064db18738428d6fb96f95
--

2023-04-11 04:18:52

by Li, Xin3

[permalink] [raw]
Subject: RE: [PATCH v8 00/33] x86: enable FRED for x86-64

> Is FRED in any hardware incarnation to rush it?

Current plan is to have FRED HW available in Intel developer Cloud in 1H'2024.

As the cover letter mentioned, for now, it's available through Intel Simics
https://www.intel.com/content/www/us/en/developer/articles/tool/simics-simulator.html
, and a developer has to install it on a dev machine.

FRED will also be available publicly through a Simics cloud service in Q2'23,
in which a Simics instance is created for development/QA use.

FRED is a baseline feature, and there are some HW features on top of it. Some of the
new features might be launched with the first FRED HW.

In addition, the KVM FRED patch set is on top of this patch set.

> If no, be patient, address the review comments once you have them and do not
> spam once a week just because. As Dave said, this is wanted by all and it will get
> reviewed eventually. But it is not something that needs to go in now so you don't
> have to create unnecessary pressure.

It wasn't my intension to patch-bomb the list. But I will pay attention
regarding this concern.

Thanks!
Xin

2023-04-11 05:17:22

by Li, Xin3

[permalink] [raw]
Subject: RE: [PATCH v8 33/33] KVM: x86/vmx: refactor VMX_DO_EVENT_IRQOFF to generate FRED stack frames

>
> kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
> - vmx_do_interrupt_irqoff(gate_offset(desc));
> + if (cpu_feature_enabled(X86_FEATURE_FRED))
> + vmx_do_fred_interrupt_irqoff(vector);
> + else
> + vmx_do_interrupt_irqoff(gate_offset((gate_desc *)host_idt_base


external_interrupt() is always available on x86_64, even when CONFIG_X86_FRED
is not defined. I prefer to always call external_interrupt() on x86_64 for IRQ
handling, which avoids re-entering noinstr code. how do you think? Too
aggressive?

Thanks!
Xin


> +
> +vector));
> kvm_after_interrupt(vcpu);

2023-04-11 18:42:28

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v8 33/33] KVM: x86/vmx: refactor VMX_DO_EVENT_IRQOFF to generate FRED stack frames

On Tue, Apr 11, 2023, Xin3 Li wrote:
> >
> > kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
> > - vmx_do_interrupt_irqoff(gate_offset(desc));
> > + if (cpu_feature_enabled(X86_FEATURE_FRED))
> > + vmx_do_fred_interrupt_irqoff(vector);
> > + else
> > + vmx_do_interrupt_irqoff(gate_offset((gate_desc *)host_idt_base
>
>
> external_interrupt() is always available on x86_64, even when CONFIG_X86_FRED
> is not defined. I prefer to always call external_interrupt() on x86_64 for IRQ
> handling, which avoids re-entering noinstr code. how do you think? Too
> aggressive?

I think it's completely orthogonal to FRED enabling. If you or anyone else wants
to convert the non-FRED handling to external_interrupt(), then do so after FRED
lands, or at the very least in a separate patch after enabling FRED in KVM.

2023-04-11 22:58:52

by Li, Xin3

[permalink] [raw]
Subject: RE: [PATCH v8 33/33] KVM: x86/vmx: refactor VMX_DO_EVENT_IRQOFF to generate FRED stack frames


> > external_interrupt() is always available on x86_64, even when CONFIG_X86_FRED
> > is not defined. I prefer to always call external_interrupt() on x86_64 for IRQ
> > handling, which avoids re-entering noinstr code. how do you think? Too
> > aggressive?
>
> I think it's completely orthogonal to FRED enabling. If you or anyone else wants
> to convert the non-FRED handling to external_interrupt(), then do so after FRED
> lands, or at the very least in a separate patch after enabling FRED in KVM.

That sounds a reasonable plan.

2023-04-12 18:34:05

by Li, Xin3

[permalink] [raw]
Subject: RE: [PATCH v8 33/33] KVM: x86/vmx: refactor VMX_DO_EVENT_IRQOFF to generate FRED stack frames


> And then this is equally gross. Rather than funnel FRED+legacy into a single
> function only to split them back out, just route FRED into its own asm subroutine.
> The common bits are basically the creation/destruction of the stack frame and
> the CALL itself, i.e. the truly interesting bits are what's different.

I try to catch up with you but am still confused.

Because a FRED stack frame always contains an error code pushed after RIP,
the FRED entry code doesn't push any error code.

Thus I introduced a trampoline code, which is called to have the return
instruction address pushed first. Then the trampoline code pushes an error
code (0 for both IRQ and NMI) and jumps to fred_entrypoint_kernel() for NMI
handling or calls external_interrupt() for IRQ handling.

The return RIP is used to return from fred_entrypoint_kernel(), but not
external_interrupt().


> Pretty much all of the #ifdeffery goes away, the helpers just need #ifdefs to play
> nice with CONFIG_X86_FRED=n. E.g. something like the below as a starting point
> (it most definitely doesn't compile, and most definitely isn't 100% correct).
>
> ---
> arch/x86/kvm/vmx/vmenter.S | 72
> ++++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/vmx/vmx.c | 19 ++++++++--
> 2 files changed, 88 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S index
> 631fd7da2bc3..a6929c78e038 100644
> --- a/arch/x86/kvm/vmx/vmenter.S
> +++ b/arch/x86/kvm/vmx/vmenter.S
> @@ -2,12 +2,14 @@
> #include <linux/linkage.h>
> #include <asm/asm.h>
> #include <asm/bitsperlong.h>
> +#include <asm/fred.h>
> #include <asm/kvm_vcpu_regs.h>
> #include <asm/nospec-branch.h>
> #include <asm/percpu.h>
> #include <asm/segment.h>
> #include "kvm-asm-offsets.h"
> #include "run_flags.h"
> +#include "../../entry/calling.h"
>
> #define WORD_SIZE (BITS_PER_LONG / 8)
>
> @@ -31,6 +33,62 @@
> #define VCPU_R15 __VCPU_REGS_R15 * WORD_SIZE
> #endif
>
> +#ifdef CONFIG_X86_FRED
> +.macro VMX_DO_FRED_EVENT_IRQOFF call_target cs_val
> + /*
> + * Unconditionally create a stack frame, getting the correct RSP on the
> + * stack (for x86-64) would take two instructions anyways, and RBP can
> + * be used to restore RSP to make objtool happy (see below).
> + */
> + push %_ASM_BP
> + mov %_ASM_SP, %_ASM_BP
> +
> + /*
> + * Don't check the FRED stack level, the call stack leading to this
> + * helper is effectively constant and shallow (relatively speaking).
> + *
> + * Emulate the FRED-defined redzone and stack alignment (128 bytes and
> + * 64 bytes respectively).
> + */
> + sub $(FRED_CONFIG_REDZONE_AMOUNT << 6), %rsp
> + and $FRED_STACK_FRAME_RSP_MASK, %rsp
> +
> + /*
> + * A FRED stack frame has extra 16 bytes of information pushed at the
> + * regular stack top compared to an IDT stack frame.
> + */
> + push $0 /* Reserved by FRED, must be 0 */
> + push $0 /* FRED event data, 0 for NMI and external interrupts */
> + shl $32, %rax
> + orq $__KERNEL_DS | $FRED_64_BIT_MODE, %ax
> + push %rax /* Vector (from the "caller") and DS */
> +
> + push %rbp
> + pushf
> + push \cs_val

We need to push the RIP of the next instruction here. Or are you suggesting
we don't need to care about it because it may not be used to return from the
callee?

As mentioned above, the return RIP is used when returning from NMI handling.

Or I totally missed a key idea to build a FRED stack frame?

Thanks!
Xin

> + push $0 /* FRED error code, 0 for NMI and external interrupts */
> + PUSH_REGS
> +
> + /* Load @pt_regs */
> + movq %rsp, %_ASM_ARG1
> +
> + call \call_target
> +
> + POP_REGS
> +
> + /*
> + * "Restore" RSP from RBP, even though IRET has already unwound RSP
> to
> + * the correct value. objtool doesn't know the callee will IRET and,
> + * without the explicit restore, thinks the stack is getting walloped.
> + * Using an unwind hint is problematic due to x86-64's dynamic
> alignment.
> + */
> + mov %_ASM_BP, %_ASM_SP
> + pop %_ASM_BP
> + RET
> +.endm
> +#endif
> +

2023-04-12 19:46:41

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v8 33/33] KVM: x86/vmx: refactor VMX_DO_EVENT_IRQOFF to generate FRED stack frames

On Wed, Apr 12, 2023, Xin3 Li wrote:
>
> > And then this is equally gross. Rather than funnel FRED+legacy into a single
> > function only to split them back out, just route FRED into its own asm subroutine.
> > The common bits are basically the creation/destruction of the stack frame and
> > the CALL itself, i.e. the truly interesting bits are what's different.
>
> I try to catch up with you but am still confused.
>
> Because a FRED stack frame always contains an error code pushed after RIP,
> the FRED entry code doesn't push any error code.
>
> Thus I introduced a trampoline code, which is called to have the return
> instruction address pushed first. Then the trampoline code pushes an error
> code (0 for both IRQ and NMI) and jumps to fred_entrypoint_kernel() for NMI
> handling or calls external_interrupt() for IRQ handling.
>
> The return RIP is used to return from fred_entrypoint_kernel(), but not
> external_interrupt().

...

> > + /*
> > + * A FRED stack frame has extra 16 bytes of information pushed at the
> > + * regular stack top compared to an IDT stack frame.
> > + */
> > + push $0 /* Reserved by FRED, must be 0 */
> > + push $0 /* FRED event data, 0 for NMI and external interrupts */
> > + shl $32, %rax
> > + orq $__KERNEL_DS | $FRED_64_BIT_MODE, %ax
> > + push %rax /* Vector (from the "caller") and DS */
> > +
> > + push %rbp
> > + pushf
> > + push \cs_val
>
> We need to push the RIP of the next instruction here. Or are you suggesting
> we don't need to care about it because it may not be used to return from the
> callee?

...

> > + push $0 /* FRED error code, 0 for NMI and external interrupts */
> > + PUSH_REGS
> > +
> > + /* Load @pt_regs */
> > + movq %rsp, %_ASM_ARG1
> > +
> > + call \call_target

The CALL here would push RIP, I missed/forgot the detail that the error code needs
to be pushed _after_ RIP, not before.

Unless CET complains, there's no need for a trampoline, just LEA+PUSH the return
RIP, PUSH the error code, and JMP to the handler. IMO, that isn't any weirder than
a trampoline, and it's a bit more obviously weird, e.g. the LEA+PUSH can have a nice
big comment.

2023-05-07 12:07:58

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v8 01/33] x86/traps: let common_interrupt() handle IRQ_MOVE_CLEANUP_VECTOR

On Mon, Apr 10, 2023 at 01:14:06AM -0700, Xin Li wrote:
> From: "H. Peter Anvin (Intel)" <[email protected]>
>
> IRQ_MOVE_CLEANUP_VECTOR is the only one of the system IRQ vectors that
> is *below* FIRST_SYSTEM_VECTOR. It is a slow path, so just push it
> into common_interrupt() just before the spurious interrupt handling.

I'm missing here the "why":

I can go forward into the set and see that you're splitting the handling
based on vector types and there's event classification and the lowest
prio vector is not going to be hardwired to 0x20, yadda yadda...

but some of that should be in the text here so that it is clear why it
is being done.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-06-03 10:20:37

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v8 02/33] x86/fred: make unions for the cs and ss fields in struct pt_regs

On Mon, Apr 10, 2023 at 01:14:07AM -0700, Xin Li wrote:
> diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
> index f4db78b09c8f..2abb23e6c1e2 100644
> --- a/arch/x86/include/asm/ptrace.h
> +++ b/arch/x86/include/asm/ptrace.h
> @@ -82,12 +82,40 @@ struct pt_regs {
> * On hw interrupt, it's IRQ number:
> */
> unsigned long orig_ax;
> -/* Return frame for iretq */
> +/* Return frame for iretq/eretu/erets */
> unsigned long ip;
> - unsigned long cs;
> + union {
> + unsigned long csx; /* cs extended: CS + any fields above it */
> + struct __attribute__((__packed__)) {

diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 2abb23e6c1e2..d850d3072263 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -86,7 +86,7 @@ struct pt_regs {
unsigned long ip;
union {
unsigned long csx; /* cs extended: CS + any fields above it */
- struct __attribute__((__packed__)) {
+ struct {
unsigned short cs; /* CS selector proper */
unsigned int current_stack_level: 2;
unsigned int __csx_resv1 : 6;
@@ -96,13 +96,13 @@ struct pt_regs {
unsigned int nmi : 1;
unsigned int __csx_resv3 : 3;
unsigned int __csx_resv4 : 32;
- };
+ } __packed;
};
unsigned long flags;
unsigned long sp;
union {
unsigned long ssx; /* ss extended: SS + any fields above it */
- struct __attribute__((__packed__)) {
+ struct {
unsigned short ss; /* SS selector proper */
unsigned int __ssx_resv1 : 16;
unsigned int vector : 8;
@@ -114,7 +114,7 @@ struct pt_regs {
unsigned int nested : 1;
unsigned int __ssx_resv4 : 1;
unsigned int instr_len : 4;
- };
+ } __packed;
};
/* top of stack page */
};

> + unsigned short cs; /* CS selector proper */

Also, please put all those comments above the members, like the rest of the
file does.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-06-03 19:54:21

by Li, Xin3

[permalink] [raw]
Subject: RE: [PATCH v8 01/33] x86/traps: let common_interrupt() handle IRQ_MOVE_CLEANUP_VECTOR

> > IRQ_MOVE_CLEANUP_VECTOR is the only one of the system IRQ vectors that
> > is *below* FIRST_SYSTEM_VECTOR. It is a slow path, so just push it
> > into common_interrupt() just before the spurious interrupt handling.
>
> I'm missing here the "why":
>
> I can go forward into the set and see that you're splitting the handling based on
> vector types and there's event classification and the lowest prio vector is not
> going to be hardwired to 0x20, yadda yadda...
>
> but some of that should be in the text here so that it is clear why it is being done.

Per Dave's ask, I am adding details about the benefits that FRED introduces,
and then why we make these changes, which should make it clearer.

Thanks!
Xin

2023-06-03 21:01:42

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v8 01/33] x86/traps: let common_interrupt() handle IRQ_MOVE_CLEANUP_VECTOR

On Mon, Apr 10 2023 at 01:14, Xin Li wrote:
> IRQ_MOVE_CLEANUP_VECTOR is the only one of the system IRQ vectors that
> is *below* FIRST_SYSTEM_VECTOR. It is a slow path, so just push it
> into common_interrupt() just before the spurious interrupt handling.

This is a complete NOOP on not FRED enabled systems as the IDT entry is
still separate. So this change makes no sense outside of the FRED
universe. Can we pretty please make this consistent?

Aside of that the change comes with zero justification. I can see why
this is done, i.e. to spare range checking in the FRED exception entry
code, but that brings up an interesting question:

IRQ_MOVE_CLEANUP_VECTOR is at vector 0x20 on purpose. 0x20 is the lowest
priority vector so that the following (mostly theoretical) situation
gets resolved:

sysvec_irq_move_cleanup()
if (is_pending_in_apic_IRR(vector_to_clean_up))
apic->send_IPI_self(IRQ_MOVE_CLEANUP_VECTOR);

I.e. when for whatever reasons the to be cleaned up vector is still
pending in the local APIC IRR the function retriggers
IRQ_MOVE_CLEANUP_VECTOR and returns. As the pending to be cleaned up
vector has higher priority it gets handled _before_ the cleanup
vector. Otherwise this ends up in a live lock.

So the question is whether FRED is changing that priority scheme or not.

> @@ -248,6 +248,10 @@ DEFINE_IDTENTRY_IRQ(common_interrupt)
> desc = __this_cpu_read(vector_irq[vector]);
> if (likely(!IS_ERR_OR_NULL(desc))) {
> handle_irq(desc, regs);
> +#ifdef CONFIG_SMP
> + } else if (vector == IRQ_MOVE_CLEANUP_VECTOR) {
> + sysvec_irq_move_cleanup(regs);

This nests IDTENTRY:

common_interrupt()
irqentry_enter();
kvm_set_cpu_l1tf_flush_l1d();
run_irq_on_irqstack_cond(__common_interrupt, ....)
__common_interrupt()
sysvec_irq_move_cleanup()
irqentry_enter(); <- FAIL
kvm_set_cpu_l1tf_flush_l1d(); <- WHY again?
run_sysvec_on_irqstack_cond(__sysvec_irq_move_cleanup);
__sysvec_irq_move_cleanup();
irqentry_exit();

It does not matter whether the cleanup vector is a slow path or
not. Regular interrupts are not nesting, period. Exceptions nest, but
IRQ_MOVE_CLEANUP_VECTOR is not an exception and we don't make an
exception for it.

Stop this mindless hackery and clean it up so it is correct for all
variants.

Thanks,

tglx

2023-06-05 08:45:49

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v8 03/33] x86/traps: add a system interrupt table for system interrupt dispatch

On Mon, Apr 10 2023 at 01:14, Xin Li wrote:
>
> /**
> * DECLARE_IDTENTRY_IRQ - Declare functions for device interrupt IDT entry
> - * points (common/spurious)
> + * points (common/spurious) and their corresponding
> + * software based dispatch handlers in the non-noinstr
> + * text section
> * @vector: Vector number (ignored for C)
> * @func: Function name of the entry point
> *
> - * Maps to DECLARE_IDTENTRY_ERRORCODE()
> + * Maps to DECLARE_IDTENTRY_ERRORCODE(), plus a dispatch function prototype
> */
> #define DECLARE_IDTENTRY_IRQ(vector, func) \
> - DECLARE_IDTENTRY_ERRORCODE(vector, func)
> + DECLARE_IDTENTRY_ERRORCODE(vector, func); \
> + void dispatch_##func(struct pt_regs *regs, unsigned long
> error_code)

Can these IDTENTRY changes please be separate from the actual table
implementation?

>
> +#ifdef CONFIG_X86_64
> +
> +#ifndef CONFIG_X86_LOCAL_APIC

Seriously? You _cannot_ disable local APIC on x8664 builds.

> +/*
> + * Used when local APIC is not configured to build into the kernel, but
> + * dispatch_table_spurious_interrupt() needs
> dispatch_spurious_interrupt().

What? If you there is something which is not used in a certain
configuration then just exclude it via #ifdef in the table or provide a

#define foo NULL

if you want to spare the #ifdeffery.

> + */
> +DEFINE_IDTENTRY_IRQ(spurious_interrupt)
> +{
> + pr_info("Spurious interrupt (vector 0x%x) on CPU#%d, should never happen.\n",
> + vector, smp_processor_id());
> +}

But mindlessly copying code which is even never compiled is a pretty
pointless exercise.

> +#endif
> +
> +static void dispatch_table_spurious_interrupt(struct pt_regs *regs)
> +{
> + dispatch_spurious_interrupt(regs, regs->vector);
> +}
> +
> +#define SYSV(x,y) [(x) - FIRST_SYSTEM_VECTOR] = y
> +
> +static system_interrupt_handler system_interrupt_handlers[NR_SYSTEM_VECTORS] = {
> + [0 ... NR_SYSTEM_VECTORS-1] = dispatch_table_spurious_interrupt,
> +#ifdef CONFIG_SMP
> + SYSV(RESCHEDULE_VECTOR, dispatch_table_sysvec_reschedule_ipi),
> + SYSV(CALL_FUNCTION_VECTOR, dispatch_table_sysvec_call_function),
> + SYSV(CALL_FUNCTION_SINGLE_VECTOR, dispatch_table_sysvec_call_function_single),
> + SYSV(REBOOT_VECTOR, dispatch_table_sysvec_reboot),
> +#endif
> +
> +#ifdef CONFIG_X86_THERMAL_VECTOR
> + SYSV(THERMAL_APIC_VECTOR, dispatch_table_sysvec_thermal),
> +#endif
> +
> +#ifdef CONFIG_X86_MCE_THRESHOLD
> + SYSV(THRESHOLD_APIC_VECTOR, dispatch_table_sysvec_threshold),
> +#endif
> +
> +#ifdef CONFIG_X86_MCE_AMD
> + SYSV(DEFERRED_ERROR_VECTOR, dispatch_table_sysvec_deferred_error),
> +#endif
> +
> +#ifdef CONFIG_X86_LOCAL_APIC
> + SYSV(LOCAL_TIMER_VECTOR, dispatch_table_sysvec_apic_timer_interrupt),
> + SYSV(X86_PLATFORM_IPI_VECTOR, dispatch_table_sysvec_x86_platform_ipi),
> +# ifdef CONFIG_HAVE_KVM
> + SYSV(POSTED_INTR_VECTOR, dispatch_table_sysvec_kvm_posted_intr_ipi),
> + SYSV(POSTED_INTR_WAKEUP_VECTOR, dispatch_table_sysvec_kvm_posted_intr_wakeup_ipi),
> + SYSV(POSTED_INTR_NESTED_VECTOR, dispatch_table_sysvec_kvm_posted_intr_nested_ipi),
> +# endif
> +# ifdef CONFIG_IRQ_WORK
> + SYSV(IRQ_WORK_VECTOR, dispatch_table_sysvec_irq_work),
> +# endif
> + SYSV(SPURIOUS_APIC_VECTOR, dispatch_table_sysvec_spurious_apic_interrupt),

This is clearly in the #ifdef CONFIG_X86_LOCAL_APIC, so what is the
above hackery useful for?

> + SYSV(ERROR_APIC_VECTOR, dispatch_table_sysvec_error_interrupt),
> +#endif
> +};

Thanks,

tglx

2023-06-05 08:48:34

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v8 03/33] x86/traps: add a system interrupt table for system interrupt dispatch

On Mon, Apr 10 2023 at 01:14, Xin Li wrote:
> #ifdef CONFIG_SMP
> -DECLARE_IDTENTRY(RESCHEDULE_VECTOR, sysvec_reschedule_ipi);
> +DECLARE_IDTENTRY_SYSVEC(RESCHEDULE_VECTOR, sysvec_reschedule_ipi);

Please do not hide unrelated semantical changes in a big pile of
supposed to be mechanical changes. Split it out and provide a proper
explanation why this is correct and required.

> +/*
> + * How system interrupt handlers are called.
> + */
> +#define DECLARE_SYSTEM_INTERRUPT_HANDLER(f) \
> + void f (struct pt_regs *regs)
> +typedef DECLARE_SYSTEM_INTERRUPT_HANDLER((*system_interrupt_handler));

How is this related to the other changes and why is it required. Please
make this reviewable.

Thanks,

tglx

2023-06-05 08:55:52

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v8 03/33] x86/traps: add a system interrupt table for system interrupt dispatch

On Mon, Apr 10 2023 at 01:14, Xin Li wrote:
> -#ifdef CONFIG_X86_LOCAL_APIC
> DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER, spurious_interrupt);
> -#endif

Breaks 32bit.

2023-06-05 09:06:25

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v8 04/33] x86/traps: add install_system_interrupt_handler()

On Mon, Apr 10 2023 at 01:14, Xin Li wrote:
> Some kernel components install system interrupt handlers into the IDT,
> and we need to do the same for system_interrupt_handlers.

We need to the same? This sentence does not make any sense at all.

https://www.kernel.org/doc/html/latest/process/maintainer-tip.html

> +void install_system_interrupt_handler(unsigned int n, const void *asm_addr, const void *addr);

Why is this void *?

>
> +void __init install_system_interrupt_handler(unsigned int n, const void *asm_addr, const void *addr)
> +{
> + BUG_ON(n < FIRST_SYSTEM_VECTOR);
> +
> +#ifdef CONFIG_X86_64
> + system_interrupt_handlers[n - FIRST_SYSTEM_VECTOR] = (system_interrupt_handler)addr;

Just that you can add a silly typecast here, right?

Oh well.

> - alloc_intr_gate(HYPERVISOR_CALLBACK_VECTOR, asm_sysvec_xen_hvm_callback);
> + install_system_interrupt_handler(HYPERVISOR_CALLBACK_VECTOR,
> + asm_sysvec_xen_hvm_callback,
> + sysvec_xen_hvm_callback);

Can we please make this less convoluted?

#ifdef CONFIG_X86_64
static inline void sysvec_setup_fred(unsigned int vector, void (*func)(struct pt_regs*))
{
...
}
#else
static inline void sysvec_setup_fred(unsigned int vector, void (*func)(struct pt_regs*))
{
}
#endif

#define sysvec_install(vector, func) { \
sysvec_setup_fred(vector, func); \
alloc_intr_gate(vector, asm_##func); \
}

- alloc_intr_gate(HYPERVISOR_CALLBACK_VECTOR, asm_sysvec_xen_hvm_callback);
+ sysvec_install(HYPERVISOR_CALLBACK_VECTOR, sysvec_xen_hvm_callback);

No?

Thanks,

tglx

2023-06-05 12:00:26

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v8 05/33] x86/traps: add external_interrupt() to dispatch external interrupts

On Mon, Apr 10 2023 at 01:14, Xin Li wrote:
> Add external_interrupt() to dispatch external interrupts to their handlers.
>
> If an external interrupt is a system interrupt, dipatch it through
> system_interrupt_handlers table, otherwise to
> dispatch_common_interrupt().

This naming convention sucks. external interrupts which can be system
interrupts. Come on.

> +/*
> + * External interrupt dispatch function.
> + *
> + * Until/unless dispatch_common_interrupt() can be taught to deal with the
> + * special system vectors, split the dispatch.

More gibberish. It's not useful to add your sloppy personal notes which
are not understandable for anyone else. That comment might eventually
make some sense right in the code where the condition is.

> + * Note: dispatch_common_interrupt() already deals with IRQ_MOVE_CLEANUP_VECTOR.
> + */
> +int external_interrupt(struct pt_regs *regs)
> +{
> + unsigned int vector = regs->vector;
> + unsigned int sysvec = vector - FIRST_SYSTEM_VECTOR;
> +
> + if (vector < FIRST_EXTERNAL_VECTOR) {

unlikely(...)

> + pr_err("invalid external interrupt vector %d\n", vector);
> + return -EINVAL;
> + }
> +
> + if (sysvec < NR_SYSTEM_VECTORS)
> + system_interrupt_handlers[sysvec](regs);
> + else
> + dispatch_common_interrupt(regs, vector);

How is this supposed to work once the vector space gets extended in a
later version of FRED?

Can we please think about this _now_ and not rewrite all of this two
years down the road?

Even if that's not fully specified yet, here is the obvious question:

What are we going to do with the system vectors. Are they going to
stay just in the middle of the expanded vector space?

That would be completely non-sensical as we'd end up with yet another
segmentation of the vector space.

So the obvious solution is to segment the vector space in the following
way:

0 - 31 Exceptions/traps - Cannot be moved
32 IRQ_MOVE_CLEANUP_VECTOR
33 - X System vectors including APIC_SPURIOUS
X+1 - MAX External interrupts

This spares the IRQ_MOVE_CLEANUP_VECTOR hackery. It requires to move the
ISA vectors, but that's not rocket science.

That makes the external interrupt vector space trivially expandable, no?

Coming back to that comment:

> + * Until/unless dispatch_common_interrupt() can be taught to deal with the
> + * special system vectors, split the dispatch.

That's simply wishful thinking. Because all what this would achieve is
moving the condition and table lookup into dispatch_common_interrupt().

What's the win aside of convoluted code?

There are only two options to deal with that:

1) Have the condition in external_interrupt()

if (unlikely(vector < LEGACY_MAX_EXCEPTION_VECTORS))
yell_and_bail();

if (vector < FIRST_DEVICE_VECTOR)
sysvec_handler[vector](regs)
else
fred_common_interrupt(regs, vector);

2) Make the sysvec_handler[] fred wrapper functions take the vector
argument and allocate the sysvec_handler[] array dynamically
sized based on the maximum number of supported vectors in a
particular [FRED]APIC implementation.

Not sure whether that's worth it as FRED allows up to 64k vectors
if I understand the reserved space layout in the spec correctly
and that would cost 512k memory just for the table.

Why has all of the above not been thought through and addressed _before_
posting this pile?

<Kernel development educational template #11>

1) Implementing a Prove of Concept for early validation is perfectly
fine.

2) Trying to sell that PoC upstream by polishing it a bit is doomed.

3) The proper tools for upstream development are brain and taste, not
duct tape and stapler.

</Kernel development educational template #11>

Thanks,

tglx

2023-06-05 12:04:39

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v8 09/33] x86/cpu: add X86_CR4_FRED macro

On Mon, Apr 10 2023 at 01:14, Xin Li wrote:

> From: "H. Peter Anvin (Intel)" <[email protected]>
>
> Add X86_CR4_FRED macro for the FRED bit in %cr4. This bit should be a

s/should/must/ no?

> +/*
> + * These bits should not change their value after CPU init is finished.
> + * The explicit cast to unsigned long suppresses a warning on i386 for
> + * x86-64 only feature bits >= 32.
> + */
> static const unsigned long cr4_pinned_mask =
> - X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP |
> - X86_CR4_FSGSBASE | X86_CR4_CET;
> + (unsigned long)

That type cast is required because:

+#define X86_CR4_FRED _BITULL(X86_CR4_FRED_BIT)

Bah. Fred is 64 bit only. So why defining this as 1ULL << 32
unconditionally and stripping the bit off on 32bit via the type cast?

#ifdef CONFIG_X86_64
#define X86_CR4_FRED _BITUL(X86_CR4_FRED_BIT)
#else
#define X86_CR4_FRED 0
#endif

would be too obvious, right?

> + (X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP |
> + X86_CR4_FSGSBASE | X86_CR4_CET | X86_CR4_FRED);

Thanks,

tglx

2023-06-05 12:30:35

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v8 02/33] x86/fred: make unions for the cs and ss fields in struct pt_regs

On Mon, Apr 10 2023 at 01:14, Xin Li wrote:
> + union {
> + unsigned long csx; /* cs extended: CS + any fields above it */
> + struct __attribute__((__packed__)) {
> + unsigned short cs; /* CS selector proper */
> + unsigned int current_stack_level: 2;
> + unsigned int __csx_resv1 : 6;
> + unsigned int interrupt_shadowed : 1;
> + unsigned int software_initiated : 1;
> + unsigned int __csx_resv2 : 2;
> + unsigned int nmi : 1;
> + unsigned int __csx_resv3 : 3;
> + unsigned int __csx_resv4 : 32;
> + };
> + };
> unsigned long flags;
> unsigned long sp;
> - unsigned long ss;
> + union {
> + unsigned long ssx; /* ss extended: SS + any fields above it */
> + struct __attribute__((__packed__)) {
> + unsigned short ss; /* SS selector proper */
> + unsigned int __ssx_resv1 : 16;
> + unsigned int vector : 8;
> + unsigned int __ssx_resv2 : 8;
> + unsigned int type : 4;
> + unsigned int __ssx_resv3 : 4;
> + unsigned int enclv : 1;
> + unsigned int long_mode : 1;
> + unsigned int nested : 1;
> + unsigned int __ssx_resv4 : 1;
> + unsigned int instr_len : 4;
> + };
> + };

This does not match section

5.2.1 Saving Information on the Regular Stack?

of version 4 and later of the specification.

Thanks,

tglx

2023-06-05 12:30:43

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v8 21/33] x86/fred: FRED initialization code

On Mon, Apr 10 2023 at 01:14, Xin Li wrote:
> From: "H. Peter Anvin (Intel)" <[email protected]>
>
> The code to initialize FRED when it's available and _not_ disabled.
>
> cpu_init_fred_exceptions() is the core function to initialize FRED,
> which
> 1. Sets up FRED entrypoints for events happening in ring 0 and 3.
> 2. Sets up a default stack for event handling.
> 3. Sets up dedicated event stacks for DB/NMI/MC/DF, equivalent to
> the IDT IST stacks.
> 4. Forces 32-bit system calls to use "int $0x80" only.
> 5. Enables FRED and invalidtes IDT.
>
> When the FRED is used, cpu_init_exception_handling() initializes FRED
> through calling cpu_init_fred_exceptions(), otherwise it sets up TSS
> IST and loads IDT.
>
> As FRED uses the ring 3 FRED entrypoint for SYSCALL and SYSENTER,
> it skips setting up SYSCALL/SYSENTER related MSRs, e.g., MSR_LSTAR.

So how is this supposed to work? FRED is enabled in Kconfig, the feature
is detected and FRED is initialized _before_ the rest of the required
changes is in place.

Documentation/process/* is not just there because people have nothing
better to do than writing pointless documents.

Thanks,

tglx



2023-06-05 13:35:11

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v8 20/33] x86/fred: FRED entry/exit and dispatch code

On Mon, Apr 10 2023 at 01:14, Xin Li wrote:
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * arch/x86/entry/entry_fred.c

Please do not add these completely pointless file names. They are
useless _and_ never get updated when a file is moved.

> + *
> + * This contains the dispatch functions called from the entry point
> + * assembly.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/kdebug.h> /* oops_begin/end, ... */

Please remove this useless tail comment. We really do not have to list
which particular things are pulled in from which header file.

> +#include <linux/nospec.h>

New line between linux and asm includes please.

> +#include <asm/event-type.h>
> +#include <asm/fred.h>
> +#include <asm/idtentry.h>
> +#include <asm/syscall.h>
> +#include <asm/trapnr.h>
> +#include <asm/traps.h>
> +#include <asm/kdebug.h>
> +
> +/*
> + * Badness...

Really useful comment. Not.

> +
> +noinstr void fred_exc_double_fault(struct pt_regs *regs)

Has to be global because the only user is the table below, right?

> +{
> + exc_double_fault(regs, regs->orig_ax);
> +}

Also why is this here and not next to the double fault implementation?

> +/*
> + * Exception entry
> + */
> +static DEFINE_FRED_HANDLER(fred_exception)

Lacks noinstr as most of the functions here.

> +{
> + /*
> + * Exceptions that cannot happen on FRED h/w are set to fred_bad_event().
> + */
> + static const fred_handler exception_handlers[NUM_EXCEPTION_VECTORS] = {
> + [X86_TRAP_DE] = exc_divide_error,
> + [X86_TRAP_DB] = fred_exc_debug,
> + [X86_TRAP_NMI] = fred_bad_event, /* A separate event type, not handled here */

Please make this tabular aligned and get rid of these horrible tail
comments.

> + [X86_TRAP_BP] = exc_int3,
> + [X86_TRAP_OF] = exc_overflow,
> + [X86_TRAP_BR] = exc_bounds,
> + [X86_TRAP_UD] = exc_invalid_op,
> + [X86_TRAP_NM] = exc_device_not_available,
> + [X86_TRAP_DF] = fred_exc_double_fault,
> + [X86_TRAP_OLD_MF] = fred_bad_event, /* 387 only! */
> + [X86_TRAP_TS] = fred_exc_invalid_tss,
> + [X86_TRAP_NP] = fred_exc_segment_not_present,
> + [X86_TRAP_SS] = fred_exc_stack_segment,
> + [X86_TRAP_GP] = fred_exc_general_protection,
> + [X86_TRAP_PF] = fred_exc_page_fault,
> + [X86_TRAP_SPURIOUS] = fred_bad_event, /* Interrupts are their own event type */
> + [X86_TRAP_MF] = exc_coprocessor_error,
> + [X86_TRAP_AC] = fred_exc_alignment_check,
> + [X86_TRAP_MC] = fred_exc_machine_check,
> + [X86_TRAP_XF] = exc_simd_coprocessor_error,

> + [X86_TRAP_VE...NUM_EXCEPTION_VECTORS-1] = fred_bad_event

Can we please have something which makes it entirely clear that anything
from #VE on are exceptions which are installed during boot?

> + };
> + u8 vector = array_index_nospec((u8)regs->vector, NUM_EXCEPTION_VECTORS);

This only "works" when NUM_EXCEPTION_VECTORS is power of two. Also what
catches an out of bounds vector? I.e. vector 0x20 will end up as vector
0x0 due to array_index_nospec(). I know, FRED hardware is going to be
perfect...

> + exception_handlers[vector](regs);
> +}
> +
> +static __always_inline void fred_emulate_trap(struct pt_regs *regs)
> +{
> + regs->type = EVENT_TYPE_SWFAULT;

This type information is used where?

> + regs->orig_ax = 0;
> + fred_exception(regs);
> +}
> +
> +static __always_inline void fred_emulate_fault(struct pt_regs *regs)
> +{
> + regs->ip -= regs->instr_len;
> + fred_emulate_trap(regs);
> +}
> +
> +/*
> + * Emulate SYSENTER if applicable. This is not the preferred system
> + * call in 32-bit mode under FRED, rather int $0x80 is preferred and
> + * exported in the vdso. SYSCALL proper has a hard-coded early out in
> + * fred_entry_from_user().

So we have it nicely distributed all over the code....

> + */
> +static DEFINE_FRED_HANDLER(fred_syscall_slow)
> +{
> + if (IS_ENABLED(CONFIG_IA32_EMULATION) &&
> + likely(regs->vector == FRED_SYSENTER)) {
> + /* Convert frame to a syscall frame */
> + regs->orig_ax = regs->ax;
> + regs->ax = -ENOSYS;
> + do_fast_syscall_32(regs);
> + } else {
> + regs->vector = X86_TRAP_UD;
> + fred_emulate_fault(regs);
> + }
> +}
> +
> +/*
> + * Some software exceptions can also be triggered as int instructions,
> + * for historical reasons. Implement those here. The performance-critical
> + * int $0x80 (32-bit system call) has a hard-coded early out.

This comment starts to annoy me. Can you put comments next to the code
where they actually make sense?

> + */
> +static DEFINE_FRED_HANDLER(fred_sw_interrupt_user)
> +{

i.e.

/*
* In compat mode INT $0x80 (32bit system call) is
* performance-critical. Handle it first.
*/

> + if (IS_ENABLED(CONFIG_IA32_EMULATION) &&
> + likely(regs->vector == IA32_SYSCALL_VECTOR)) {
> + /* Convert frame to a syscall frame */
> + regs->orig_ax = regs->ax;
> + regs->ax = -ENOSYS;
> + return do_int80_syscall_32(regs);

> + }
> +
> + switch (regs->vector) {
> + case X86_TRAP_BP:
> + case X86_TRAP_OF:
> + fred_emulate_trap(regs);
> + break;
> + default:
> + regs->vector = X86_TRAP_GP;
> + fred_emulate_fault(regs);
> + break;
> + }
> +}
> +
> +static DEFINE_FRED_HANDLER(fred_hw_interrupt)
> +{
> + irqentry_state_t state = irqentry_enter(regs);
> +
> + instrumentation_begin();
> + external_interrupt(regs);
> + instrumentation_end();
> + irqentry_exit(regs, state);
> +}
> +
> +__visible noinstr void fred_entry_from_user(struct pt_regs *regs)
> +{
> + static const fred_handler user_handlers[FRED_EVENT_TYPE_COUNT] =
> + {
> + [EVENT_TYPE_HWINT] = fred_hw_interrupt,
> + [EVENT_TYPE_RESERVED] = fred_bad_event,
> + [EVENT_TYPE_NMI] = fred_exc_nmi,
> + [EVENT_TYPE_SWINT] = fred_sw_interrupt_user,
> + [EVENT_TYPE_HWFAULT] = fred_exception,
> + [EVENT_TYPE_SWFAULT] = fred_exception,
> + [EVENT_TYPE_PRIVSW] = fred_exception,
> + [EVENT_TYPE_OTHER] = fred_syscall_slow
> + };
> +
> + /*
> + * FRED employs a two-level event dispatch mechanism, with
> + * the first-level on the type of an event and the second-level
> + * on its vector. Thus a dispatch typically induces 2 calls.
> + * We optimize it by using early outs for the most frequent
> + * events, and syscalls are the first. We may also need early
> + * outs for page faults.

I'm not really convinced that adding more special cases and conditionals
is a win. This should be a true two-level dispatch first for _all_ event
types and then in a separate step optimizations with proper performance
numbers and justifications. Premature optimization is the enemy of
correctness. Don't do it.

> + */
> + if (likely(regs->type == EVENT_TYPE_OTHER &&
> + regs->vector == FRED_SYSCALL)) {
> + /* Convert frame to a syscall frame */
> + regs->orig_ax = regs->ax;
> + regs->ax = -ENOSYS;
> + do_syscall_64(regs, regs->orig_ax);
> + } else {
> + /* Not a system call */
> + u8 type = array_index_nospec((u8)regs->type, FRED_EVENT_TYPE_COUNT);

What's the u8 buying here and in all the other places? This has the same
table issue as all other table handling in this file.

> + user_handlers[type](regs);
> + }
> +}

> diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
> index 2876ddae02bc..bd43866f9c3e 100644
> --- a/arch/x86/include/asm/idtentry.h
> +++ b/arch/x86/include/asm/idtentry.h
> @@ -82,6 +82,7 @@ static __always_inline void __##func(struct pt_regs *regs)
> #define DECLARE_IDTENTRY_ERRORCODE(vector, func) \
> asmlinkage void asm_##func(void); \
> asmlinkage void xen_asm_##func(void); \
> + __visible void fred_##func(struct pt_regs *regs); \

Wants to be a separate change.

> __visible void func(struct pt_regs *regs, unsigned long error_code)
>
> /**
> @@ -106,6 +107,11 @@ __visible noinstr void func(struct pt_regs *regs, \
> irqentry_exit(regs, state); \
> } \
> \
> +__visible noinstr void fred_##func(struct pt_regs *regs) \
> +{ \
> + func (regs, regs->orig_ax); \

func() ....

Thanks,

tglx

2023-06-05 14:04:25

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v8 24/33] x86/fred: disallow the swapgs instruction when FRED is enabled

On Mon, Apr 10 2023 at 01:14, Xin Li wrote:
> From: "H. Peter Anvin (Intel)" <[email protected]>
>
> The FRED architecture establishes the full supervisor/user through:
> 1) FRED event delivery swaps the value of the GS base address and
> that of the IA32_KERNEL_GS_BASE MSR.
> 2) ERETU swaps the value of the GS base address and that of the
> IA32_KERNEL_GS_BASE MSR.
> Thus, the swapgs instruction is disallowed when FRED is enabled,
> otherwise it causes #UD.

Which does not explain why writing directly to the IA32_KERNEL_GS_BASE
MSR is doing the right thing.

Thanks,

tglx

2023-06-05 14:15:58

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v8 26/33] x86/fred: allow single-step trap and NMI when starting a new thread

On Mon, Apr 10 2023 at 01:14, Xin Li wrote:
> Allow single-step trap and NMI when starting a new thread, thus once
> the new thread returns to ring3, single-step trap and NMI are both
> enabled immediately.
>
> High-order 48 bits above the lowest 16 bit CS are discarded by the
> legacy IRET instruction, thus can be set unconditionally, even when
> FRED is not enabled.

I assume this has been validated to be true on _all_ CPU incarnations of
_all_ x86 vendors.

If so, then please document it. If not, then go back to the drawing
board.

Thanks,

tglx


2023-06-05 14:18:41

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v8 29/33] x86/fred: allow FRED systems to use interrupt vectors 0x10-0x1f

On Mon, Apr 10 2023 at 01:14, Xin Li wrote:
> From: "H. Peter Anvin (Intel)" <[email protected]>
>
> FRED inherits the Intel VT-x enhancement of classified events with
> a two-level event dispatch logic. The first-level dispatch is on
> the event type, and the second-level is on the event vector. This
> also means that vectors in different event types are orthogonal,
> thus, vectors 0x10-0x1f become available as hardware interrupts.
>
> Enable interrupt vectors 0x10-0x1f on FRED systems (interrupt 0x80 is
> already enabled.) Most of these changes are about removing the
> assumption that the lowest-priority vector is hard-wired to 0x20.

I'm not really interested in this again premature optimization.

Can we please clarify how the final result of FRED vector layout will
look like?

I rather give up on reclaiming these 16 vectors than making _all_ system
vectors dynamically assignable to avoid an extra partitioning of the
vector space.

Thanks,

tglx

2023-06-05 14:20:39

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v8 30/33] x86/fred: allow dynamic stack frame size

On Mon, Apr 10 2023 at 01:14, Xin Li wrote:
> A FRED stack frame could contain different amount of information for
> different event types, or perhaps even for different instances of the
> same event type. Thus we need to eliminate the need of any advance
> information of the stack frame size to allow dynamic stack frame size.
>
> Implement it through:
> 1) add a new field user_pt_regs to thread_info, and initialize it
> with a pointer to a virtual pt_regs structure at the top of a
> thread stack.
> 2) save a pointer to the user-space pt_regs structure created by
> fred_entrypoint_user() to user_pt_regs in fred_entry_from_user().
> 3) initialize the init_thread_info's user_pt_regs with a pointer to
> a virtual pt_regs structure at the top of init stack.
>
> This approach also works for IDT, thus we unify the code.

And thereby remove the useful comment and replace it with an
undocumented macro mess.

I'm simply refusing to review this. It's not my job to understand this
undocumented hackery.

Thanks,

tglx

2023-06-05 14:23:27

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v8 21/33] x86/fred: FRED initialization code

On Mon, Apr 10 2023 at 01:14, Xin Li wrote:
>
> +/*
> + * The actual assembly entry and exit points
> + */
> +extern __visible void fred_entrypoint_user(void);

Why is this defined in this patch and not at the point where the
function is introduced?

> +/*
> + * Initialization
> + */
> +void cpu_init_fred_exceptions(void);
> +void fred_setup_apic(void);
> +
> #endif /* __ASSEMBLY__ */
>
> +#else
> +#define cpu_init_fred_exceptions() BUG()
> +#define fred_setup_apic() BUG()

static inline stubs please.

> @@ -2054,28 +2055,6 @@ static void wrmsrl_cstar(unsigned long val)
> /* May not be marked __init: used by software suspend */
> void syscall_init(void)
> {
> - wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
> - wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
> -
> -#ifdef CONFIG_IA32_EMULATION
> - wrmsrl_cstar((unsigned long)entry_SYSCALL_compat);
> - /*
> - * This only works on Intel CPUs.
> - * On AMD CPUs these MSRs are 32-bit, CPU truncates MSR_IA32_SYSENTER_EIP.
> - * This does not cause SYSENTER to jump to the wrong location, because
> - * AMD doesn't allow SYSENTER in long mode (either 32- or 64-bit).
> - */
> - wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
> - wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
> - (unsigned long)(cpu_entry_stack(smp_processor_id()) + 1));
> - wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
> -#else
> - wrmsrl_cstar((unsigned long)ignore_sysret);
> - wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG);
> - wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
> - wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
> -#endif
> -
> /*
> * Flags to clear on syscall; clear as much as possible
> * to minimize user space-kernel interference.
> @@ -2086,6 +2065,41 @@ void syscall_init(void)
> X86_EFLAGS_IF|X86_EFLAGS_DF|X86_EFLAGS_OF|
> X86_EFLAGS_IOPL|X86_EFLAGS_NT|X86_EFLAGS_RF|
> X86_EFLAGS_AC|X86_EFLAGS_ID);
> +
> + /*
> + * The default user and kernel segments
> + */
> + wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
> +
> + if (cpu_feature_enabled(X86_FEATURE_FRED)) {
> + /* Both sysexit and sysret cause #UD when FRED is enabled */
> + wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG);
> + wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
> + wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
> + } else {
> + wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
> +
> +#ifdef CONFIG_IA32_EMULATION
> + wrmsrl_cstar((unsigned long)entry_SYSCALL_compat);
> + /*
> + * This only works on Intel CPUs.
> + * On AMD CPUs these MSRs are 32-bit, CPU truncates
> + * MSR_IA32_SYSENTER_EIP.
> + * This does not cause SYSENTER to jump to the wrong
> + * location, because AMD doesn't allow SYSENTER in
> + * long mode (either 32- or 64-bit).
> + */
> + wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
> + wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
> + (unsigned long)(cpu_entry_stack(smp_processor_id()) + 1));
> + wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
> +#else
> + wrmsrl_cstar((unsigned long)ignore_sysret);
> + wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG);
> + wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
> + wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
> +#endif
> + }
> }

Sigh. Can you please split this into

static void idt_syscall_init(void)
{
All the existing gunk
}

void syscall_init(void)
{
/* The default user and kernel segments */
wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);

idt_syscall_init();
}

in a first step and then in the next patch add the FRED muck?

> +/*
> + * Initialize FRED on this CPU. This cannot be __init as it is called
> + * during CPU hotplug.

Really no need to repeat this comment vs. __init all over the place.

> + */
> +void cpu_init_fred_exceptions(void)
> +{
> + wrmsrl(MSR_IA32_FRED_CONFIG,
> + FRED_CONFIG_REDZONE | /* Reserve for CALL emulation */

Please don't use tail comments. Nowhere.

> + FRED_CONFIG_INT_STKLVL(0) |
> + FRED_CONFIG_ENTRYPOINT(fred_entrypoint_user));
> +
> +/*
> + * Initialize system vectors from a FRED perspective, so
> + * lapic_assign_system_vectors() can do its job.
> + */
> +void __init fred_setup_apic(void)
> +{
> + int i;
> +
> + for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++)
> + set_bit(i, system_vectors);

> + /*
> + * Don't set the non assigned system vectors in the
> + * system_vectors bitmap. Otherwise they show up in
> + * /proc/interrupts.
> + */
> +#ifdef CONFIG_SMP
> + set_bit(IRQ_MOVE_CLEANUP_VECTOR, system_vectors);
> +#endif
> +
> + for (i = 0; i < NR_SYSTEM_VECTORS; i++) {
> + if (get_system_interrupt_handler(i) != NULL) {

This _cannot be NULL. The system vector table must be fully populated
with either the real handler or the spurious handler. Otherwise you need
a NULL pointer check in the dispatch path.

> + set_bit(i + FIRST_SYSTEM_VECTOR, system_vectors);
> + }
> + }


> +
> + /* The rest are fair game... */

Can you please refrain from adding useless comments. Commenting the
obvious is a distraction and not helpful in any way. Comment the things
which are not obvious in the first place.

> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -1537,6 +1537,14 @@ static system_interrupt_handler system_interrupt_handlers[NR_SYSTEM_VECTORS] = {
>
> #undef SYSV
>
> +system_interrupt_handler get_system_interrupt_handler(unsigned int i)
> +{
> + if (i >= NR_SYSTEM_VECTORS)
> + return NULL;

Seriously?

> + return system_interrupt_handlers[i];

Get rid of this completely confusing and useless function and look the
table up at the only call site. I'm all for defensive programming, but
this is hideous.

Thanks,

tglx

2023-06-05 14:24:00

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v8 26/33] x86/fred: allow single-step trap and NMI when starting a new thread

On Mon, Jun 05 2023 at 15:50, Thomas Gleixner wrote:

> On Mon, Apr 10 2023 at 01:14, Xin Li wrote:
>> Allow single-step trap and NMI when starting a new thread, thus once
>> the new thread returns to ring3, single-step trap and NMI are both
>> enabled immediately.
>>
>> High-order 48 bits above the lowest 16 bit CS are discarded by the
>> legacy IRET instruction, thus can be set unconditionally, even when
>> FRED is not enabled.
>
> I assume this has been validated to be true on _all_ CPU incarnations of
> _all_ x86 vendors.

It's also ensured that VMMs do not get confused by this, right?

> If so, then please document it. If not, then go back to the drawing
> board.
>
> Thanks,
>
> tglx

2023-06-05 14:48:58

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v8 31/33] x86/fred: BUG() when ERETU with %rsp not equal to that when the ring 3 event was just delivered

On Mon, Apr 10 2023 at 01:14, Xin Li wrote:
> A FRED stack frame generated by a ring 3 event should never be messed up, and
> the first thing we must make sure is that at the time an ERETU instruction is
> executed, %rsp must have the same address as that when the user level event
> was just delivered.
>
> However we don't want to bother the normal code path of ERETU because it's on
> the hotest code path, a good choice is to do this check when ERETU
> faults.

Which might be not catching bugs where the wrong frame makes ERETU not
fault.

We have CONFIG_DEBUG_ENTRY for catching this at the proper place.

Thanks,

tglx

2023-06-05 17:22:42

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v8 29/33] x86/fred: allow FRED systems to use interrupt vectors 0x10-0x1f



On 6/5/23 07:06, Thomas Gleixner wrote:
> On Mon, Apr 10 2023 at 01:14, Xin Li wrote:
>> From: "H. Peter Anvin (Intel)" <[email protected]>
>>
>> FRED inherits the Intel VT-x enhancement of classified events with
>> a two-level event dispatch logic. The first-level dispatch is on
>> the event type, and the second-level is on the event vector. This
>> also means that vectors in different event types are orthogonal,
>> thus, vectors 0x10-0x1f become available as hardware interrupts.
>>
>> Enable interrupt vectors 0x10-0x1f on FRED systems (interrupt 0x80 is
>> already enabled.) Most of these changes are about removing the
>> assumption that the lowest-priority vector is hard-wired to 0x20.
>
> I'm not really interested in this again premature optimization.
>
> Can we please clarify how the final result of FRED vector layout will
> look like?
>
> I rather give up on reclaiming these 16 vectors than making _all_ system
> vectors dynamically assignable to avoid an extra partitioning of the
> vector space.
>

So this patch was meant to go *after* the FRED patch set proper.

This patch does not change any of the system vectors except, by
necessity, IRQ_MOVE_CLEANUP_VECTOR, nor does it change the way they are
assigned to fixed vectors today.

It does make the lowest *non*-system vector a variable, but the system
vectors are at the end.

This is something that should be commented: the setting of bits in
system_vector is really misleading: those aren't "system vectors", they
are the vectors reserved for hardware use. On IDT, this means vectors
0-31 because they are used for exceptions, but on FRED interrupts and
exceptions are separate; however vectors 0-15 still need to be reserved
because the APIC doesn't support them.

We *could* change that, but that is completely independent of this.

-hpa

2023-06-05 17:22:58

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v8 09/33] x86/cpu: add X86_CR4_FRED macro

On 6/5/23 05:01, Thomas Gleixner wrote:
> On Mon, Apr 10 2023 at 01:14, Xin Li wrote:
>
>> From: "H. Peter Anvin (Intel)" <[email protected]>
>>
>> Add X86_CR4_FRED macro for the FRED bit in %cr4. This bit should be a
>
> s/should/must/ no?
>

Technically no bit "must" be set in the fixed bit variable, but it would
obviously be insane not to. But it makes it a "should", both in
dictionary and RFC 2119 definitions.

Incidentally, I strongly advice everyone to use the RFC 2119 definitions
of technical requirement terms when possible.

-hpa

2023-06-05 17:23:57

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v8 31/33] x86/fred: BUG() when ERETU with %rsp not equal to that when the ring 3 event was just delivered



On 6/5/23 07:15, Thomas Gleixner wrote:
> On Mon, Apr 10 2023 at 01:14, Xin Li wrote:
>> A FRED stack frame generated by a ring 3 event should never be messed up, and
>> the first thing we must make sure is that at the time an ERETU instruction is
>> executed, %rsp must have the same address as that when the user level event
>> was just delivered.
>>
>> However we don't want to bother the normal code path of ERETU because it's on
>> the hotest code path, a good choice is to do this check when ERETU
>> faults.
>
> Which might be not catching bugs where the wrong frame makes ERETU not
> fault.
>
> We have CONFIG_DEBUG_ENTRY for catching this at the proper place.
>

This is true, but this BUG() is a cheap test on a slow path, and thus
can be included in production code.

-hpa


2023-06-05 17:25:32

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v8 01/33] x86/traps: let common_interrupt() handle IRQ_MOVE_CLEANUP_VECTOR

On 6/5/23 10:07, Thomas Gleixner wrote:
> On Sat, Jun 03 2023 at 22:51, Thomas Gleixner wrote:
>> On Mon, Apr 10 2023 at 01:14, Xin Li wrote:
>>> IRQ_MOVE_CLEANUP_VECTOR is the only one of the system IRQ vectors that
>>> is *below* FIRST_SYSTEM_VECTOR. It is a slow path, so just push it
>>> into common_interrupt() just before the spurious interrupt handling.
>>
>> This is a complete NOOP on not FRED enabled systems as the IDT entry is
>> still separate. So this change makes no sense outside of the FRED
>> universe. Can we pretty please make this consistent?
>
> The right thing to make this consistent is to get rid of this vector
> completely.
>
> There is zero reason for this to be an IPI. This can be delegated to a
> worker or whatever delayed mechanism.
>

As we discussed offline, I agree that this is a better solution (and
should be a separate changeset before the FRED one.)

-hpa

2023-06-05 17:31:23

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v8 01/33] x86/traps: let common_interrupt() handle IRQ_MOVE_CLEANUP_VECTOR

On Sat, Jun 03 2023 at 22:51, Thomas Gleixner wrote:
> On Mon, Apr 10 2023 at 01:14, Xin Li wrote:
>> IRQ_MOVE_CLEANUP_VECTOR is the only one of the system IRQ vectors that
>> is *below* FIRST_SYSTEM_VECTOR. It is a slow path, so just push it
>> into common_interrupt() just before the spurious interrupt handling.
>
> This is a complete NOOP on not FRED enabled systems as the IDT entry is
> still separate. So this change makes no sense outside of the FRED
> universe. Can we pretty please make this consistent?

The right thing to make this consistent is to get rid of this vector
completely.

There is zero reason for this to be an IPI. This can be delegated to a
worker or whatever delayed mechanism.

Thanks,

tglx

2023-06-05 17:31:57

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v8 31/33] x86/fred: BUG() when ERETU with %rsp not equal to that when the ring 3 event was just delivered

On Mon, Jun 05 2023 at 09:42, H. Peter Anvin wrote:
> On 6/5/23 07:15, Thomas Gleixner wrote:
>> On Mon, Apr 10 2023 at 01:14, Xin Li wrote:
>>> A FRED stack frame generated by a ring 3 event should never be messed up, and
>>> the first thing we must make sure is that at the time an ERETU instruction is
>>> executed, %rsp must have the same address as that when the user level event
>>> was just delivered.
>>>
>>> However we don't want to bother the normal code path of ERETU because it's on
>>> the hotest code path, a good choice is to do this check when ERETU
>>> faults.
>>
>> Which might be not catching bugs where the wrong frame makes ERETU not
>> fault.
>>
>> We have CONFIG_DEBUG_ENTRY for catching this at the proper place.
>>
>
> This is true, but this BUG() is a cheap test on a slow path, and thus
> can be included in production code.

No objection.

2023-06-05 17:41:47

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v8 02/33] x86/fred: make unions for the cs and ss fields in struct pt_regs

>
> This does not match section
>
> 5.2.1 Saving Information on the Regular Stack?
>
> of version 4 and later of the specification.
>

Yes, this is correct. This patchset was posted in April, and once we got
some very late requests for changes I asked Xin to hold off any further
iterations until the spec had restabilized.

-hpa

2023-06-05 17:43:34

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v8 00/33] x86: enable FRED for x86-64

So I feel obliged to throw in some defending of Xin here, mostly because
in some of these cases I'm the perkeleen vittupƤƤ[1] and not Xin.

The other thing is that this patchset is from April and predates the
recent FRED changes (which were unanticipated at the time); I explicitly
asked Xin to hold off making updates until the spec had restabilized.

-hpa

[1] https://lkml.org/lkml/2013/7/13/132

2023-06-05 17:47:00

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v8 09/33] x86/cpu: add X86_CR4_FRED macro

On 6/5/23 05:01, Thomas Gleixner wrote:
> On Mon, Apr 10 2023 at 01:14, Xin Li wrote:
>
>> From: "H. Peter Anvin (Intel)" <[email protected]>
>>
>> Add X86_CR4_FRED macro for the FRED bit in %cr4. This bit should be a
>
> s/should/must/ no?
>
>> +/*
>> + * These bits should not change their value after CPU init is finished.
>> + * The explicit cast to unsigned long suppresses a warning on i386 for
>> + * x86-64 only feature bits >= 32.
>> + */
>> static const unsigned long cr4_pinned_mask =
>> - X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP |
>> - X86_CR4_FSGSBASE | X86_CR4_CET;
>> + (unsigned long)
>
> That type cast is required because:
>
> +#define X86_CR4_FRED _BITULL(X86_CR4_FRED_BIT)
>
> Bah. Fred is 64 bit only. So why defining this as 1ULL << 32
> unconditionally and stripping the bit off on 32bit via the type cast?
>
> #ifdef CONFIG_X86_64
> #define X86_CR4_FRED _BITUL(X86_CR4_FRED_BIT)
> #else
> #define X86_CR4_FRED 0
> #endif
>
> would be too obvious, right?
>

It also adds an #ifdef mess to avoid a simple typecast. Is that the
right tradeoff?

I'm not saying it is or it isn't, it is an open question.

-hpa



2023-06-05 17:53:36

by Thomas Gleixner

[permalink] [raw]
Subject: RE: [PATCH v8 00/33] x86: enable FRED for x86-64

On Mon, Apr 10 2023 at 19:16, Li, Xin3 wrote:
>> Oh, and one other nit. We do have a specific maintainer for the entry code:
>>
>> > X86 ENTRY CODE
>> > M: Andy Lutomirski <[email protected]>
>> > L: [email protected]
>> > S: Maintained
>> > T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/asm
>> > F: arch/x86/entry/
>>
>> Please make sure to cc Andy on FRED going forward.
>>
>> This is probably also a good cue to go and make sure you didn't miss any other
>> folks that need to see this series.
>
> My bad, and surely will add Andy.

He's on CC indirectly via the [email protected] mail exploder.

2023-06-05 18:00:28

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v8 00/33] x86: enable FRED for x86-64

On Mon, Jun 05 2023 at 10:22, H. Peter Anvin wrote:
> So I feel obliged to throw in some defending of Xin here, mostly because
> in some of these cases I'm the perkeleen vittupƤƤ[1] and not Xin.

That's clear from the authorship and the style of the patches and not
all was criticism was directed at Xin obviously. He's just the messenger
in that case.

Thanks,

tglx

2023-06-05 18:01:02

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v8 02/33] x86/fred: make unions for the cs and ss fields in struct pt_regs

On Mon, Jun 05 2023 at 10:12, H. Peter Anvin wrote:
>>
>> This does not match section
>>
>> 5.2.1 Saving Information on the Regular Stack?
>>
>> of version 4 and later of the specification.
>>
>
> Yes, this is correct. This patchset was posted in April, and once we got
> some very late requests for changes I asked Xin to hold off any further
> iterations until the spec had restabilized.

I assumed that, but I stumbled over it when trying to review :)

2023-06-05 18:02:57

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v8 05/33] x86/traps: add external_interrupt() to dispatch external interrupts

On Mon, Jun 05 2023 at 13:56, Thomas Gleixner wrote:
> On Mon, Apr 10 2023 at 01:14, Xin Li wrote:
> How is this supposed to work once the vector space gets extended in a
> later version of FRED?
>
> Can we please think about this _now_ and not rewrite all of this two
> years down the road?
>
> Even if that's not fully specified yet, here is the obvious question:
>
> What are we going to do with the system vectors. Are they going to
> stay just in the middle of the expanded vector space?
>
> That would be completely non-sensical as we'd end up with yet another
> segmentation of the vector space.
>
> So the obvious solution is to segment the vector space in the following
> way:
>
> 0 - 31 Exceptions/traps - Cannot be moved
> 32 IRQ_MOVE_CLEANUP_VECTOR
> 33 - X System vectors including APIC_SPURIOUS
> X+1 - MAX External interrupts
>
> This spares the IRQ_MOVE_CLEANUP_VECTOR hackery. It requires to move the
> ISA vectors, but that's not rocket science.

Which we just discussed completely away. :)

> That makes the external interrupt vector space trivially expandable, no?

So there is a theoretical problem here that device interrupts could
starve system vectors due to the priority scheme. Needs some thought.

That whole APIC priority muck is pretty useless as long as CR8 writes
are slower than sti/cli. When I tested that last (Broadwell) they were
significantly slower.

Also it's unclear how that expansion vector space is handled
vs. priorities.

Ideally event delivery would be FIFO because that's the only guarantee
for preventing starvation without having to configure priorities (which
is mostly a wrong guess anyway).

Thanks,

tglx

2023-06-06 06:18:26

by Li, Xin3

[permalink] [raw]
Subject: RE: [PATCH v8 04/33] x86/traps: add install_system_interrupt_handler()

> #ifdef CONFIG_X86_64
> static inline void sysvec_setup_fred(unsigned int vector, void (*func)(struct
> pt_regs*)) {
> ...
> }
> #else
> static inline void sysvec_setup_fred(unsigned int vector, void (*func)(struct
> pt_regs*)) { } #endif
>
> #define sysvec_install(vector, func) { \
> sysvec_setup_fred(vector, func); \
> alloc_intr_gate(vector, asm_##func); \
> }
>
> - alloc_intr_gate(HYPERVISOR_CALLBACK_VECTOR,
> asm_sysvec_xen_hvm_callback);
> + sysvec_install(HYPERVISOR_CALLBACK_VECTOR,
> sysvec_xen_hvm_callback);

This is a better way, and I will do so in the next iteration.

Thanks!
Xin

2023-06-06 06:31:21

by Li, Xin3

[permalink] [raw]
Subject: RE: [PATCH v8 30/33] x86/fred: allow dynamic stack frame size

> > A FRED stack frame could contain different amount of information for
> > different event types, or perhaps even for different instances of the
> > same event type. Thus we need to eliminate the need of any advance
> > information of the stack frame size to allow dynamic stack frame size.
> >
> > Implement it through:
> > 1) add a new field user_pt_regs to thread_info, and initialize it
> > with a pointer to a virtual pt_regs structure at the top of a
> > thread stack.
> > 2) save a pointer to the user-space pt_regs structure created by
> > fred_entrypoint_user() to user_pt_regs in fred_entry_from_user().
> > 3) initialize the init_thread_info's user_pt_regs with a pointer to
> > a virtual pt_regs structure at the top of init stack.
> >
> > This approach also works for IDT, thus we unify the code.
>
> And thereby remove the useful comment and replace it with an undocumented
> macro mess.
>
> I'm simply refusing to review this. It's not my job to understand this
> undocumented hackery.
>

I believe it's a nice idea to allow dynamic stack frame size, at least for
FRED. It's totally my bad that I didn't make it meet the minimum standards,
I will rewrite the commit message and add better comments.

After a second thought, I probably should only apply the change to FRED for
2 reasons, the change seems problematic with ESPFIX (which FRED doesn't need),
and such corner cases are hard to test (self-tests needed?).

Thanks!
Xin


2023-06-06 08:57:07

by Li, Xin3

[permalink] [raw]
Subject: RE: [PATCH v8 03/33] x86/traps: add a system interrupt table for system interrupt dispatch

> > +#ifdef CONFIG_X86_64
> > +
> > +#ifndef CONFIG_X86_LOCAL_APIC
>
> Seriously? You _cannot_ disable local APIC on x8664 builds.

I didn't see this is explicit from Kconfig, which caused the mess...

Thanks!
Xin

2023-06-06 14:00:32

by Thomas Gleixner

[permalink] [raw]
Subject: RE: [PATCH v8 30/33] x86/fred: allow dynamic stack frame size

On Tue, Jun 06 2023 at 06:18, Xin3 Li wrote:
>> > A FRED stack frame could contain different amount of information for
>> > This approach also works for IDT, thus we unify the code.
>>
>> And thereby remove the useful comment and replace it with an undocumented
>> macro mess.
>>
>> I'm simply refusing to review this. It's not my job to understand this
>> undocumented hackery.
>>
>
> I believe it's a nice idea to allow dynamic stack frame size, at least for
> FRED.

Believe belongs in the realm of religion. What we need here are proper
facts, explanations and justifications. Nice ideas are not helpful when
they are not having a value.

> It's totally my bad that I didn't make it meet the minimum standards,
> I will rewrite the commit message and add better comments.
>
> After a second thought, I probably should only apply the change to FRED for
> 2 reasons, the change seems problematic with ESPFIX (which FRED
> doesn't need),

Indeed. Making this FRED only is going to need even more justification.

> and such corner cases are hard to test (self-tests needed?)

There is a test. It's not that hard to find:

# git grep -li ESPFIX tools/testing/selftests/
tools/testing/selftests/x86/sigreturn.c

Thanks,

tglx

2023-06-06 20:15:09

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v8 01/33] x86/traps: let common_interrupt() handle IRQ_MOVE_CLEANUP_VECTOR

On Mon, Jun 05 2023 at 10:09, H. Peter Anvin wrote:
> On 6/5/23 10:07, Thomas Gleixner wrote:
>> There is zero reason for this to be an IPI. This can be delegated to a
>> worker or whatever delayed mechanism.
>>
> As we discussed offline, I agree that this is a better solution (and
> should be a separate changeset before the FRED one.)

The untested below should do the trick. Wants to be split in several
patches, but you get the idea.

Thanks,

tglx
---
Subject: x86/vector: Get rid of IRQ_MOVE_CLEANUP_VECTOR
From: Thomas Gleixner <[email protected]>

No point to waste a vector for cleaning up the leftovers of a moved
interrupt. Aside of that this must be the lowest priority of all vectors
which makes FRED systems utilizing vectors 0x10-0x1f more complicated
than necessary.

Schedule a timer instead.

Not-Yet-Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/x86/include/asm/hw_irq.h | 4 -
arch/x86/include/asm/idtentry.h | 1
arch/x86/include/asm/irq_vectors.h | 7 ---
arch/x86/kernel/apic/vector.c | 83 ++++++++++++++++++++++++++----------
arch/x86/kernel/idt.c | 1
arch/x86/platform/uv/uv_irq.c | 2
drivers/iommu/amd/iommu.c | 2
drivers/iommu/hyperv-iommu.c | 4 -
drivers/iommu/intel/irq_remapping.c | 2
9 files changed, 68 insertions(+), 38 deletions(-)

--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -97,10 +97,10 @@ extern struct irq_cfg *irqd_cfg(struct i
extern void lock_vector_lock(void);
extern void unlock_vector_lock(void);
#ifdef CONFIG_SMP
-extern void send_cleanup_vector(struct irq_cfg *);
+extern void vector_schedule_cleanup(struct irq_cfg *);
extern void irq_complete_move(struct irq_cfg *cfg);
#else
-static inline void send_cleanup_vector(struct irq_cfg *c) { }
+static inline void vector_schedule_cleanup(struct irq_cfg *c) { }
static inline void irq_complete_move(struct irq_cfg *c) { }
#endif

--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -648,7 +648,6 @@ DECLARE_IDTENTRY_SYSVEC(X86_PLATFORM_IPI

#ifdef CONFIG_SMP
DECLARE_IDTENTRY(RESCHEDULE_VECTOR, sysvec_reschedule_ipi);
-DECLARE_IDTENTRY_SYSVEC(IRQ_MOVE_CLEANUP_VECTOR, sysvec_irq_move_cleanup);
DECLARE_IDTENTRY_SYSVEC(REBOOT_VECTOR, sysvec_reboot);
DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_SINGLE_VECTOR, sysvec_call_function_single);
DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_VECTOR, sysvec_call_function);
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -35,13 +35,6 @@
*/
#define FIRST_EXTERNAL_VECTOR 0x20

-/*
- * Reserve the lowest usable vector (and hence lowest priority) 0x20 for
- * triggering cleanup after irq migration. 0x21-0x2f will still be used
- * for device interrupts.
- */
-#define IRQ_MOVE_CLEANUP_VECTOR FIRST_EXTERNAL_VECTOR
-
#define IA32_SYSCALL_VECTOR 0x80

/*
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -44,7 +44,18 @@ static cpumask_var_t vector_searchmask;
static struct irq_chip lapic_controller;
static struct irq_matrix *vector_matrix;
#ifdef CONFIG_SMP
-static DEFINE_PER_CPU(struct hlist_head, cleanup_list);
+
+static void vector_cleanup_callback(struct timer_list *tmr);
+
+struct vector_cleanup {
+ struct hlist_head head;
+ struct timer_list timer;
+};
+
+static DEFINE_PER_CPU(struct vector_cleanup, vector_cleanup) = {
+ .head = HLIST_HEAD_INIT,
+ .timer = __TIMER_INITIALIZER(vector_cleanup_callback, TIMER_PINNED),
+};
#endif

void lock_vector_lock(void)
@@ -843,8 +854,12 @@ void lapic_online(void)

void lapic_offline(void)
{
+ struct vector_cleanup *cl = this_cpu_ptr(&vector_cleanup);
+
lock_vector_lock();
irq_matrix_offline(vector_matrix);
+ WARN_ON_ONCE(try_to_del_timer_sync(&cl->timer) < 0);
+ WARN_ON_ONCE(!hlist_empty(&cl->head));
unlock_vector_lock();
}

@@ -934,62 +949,86 @@ static void free_moved_vector(struct api
apicd->move_in_progress = 0;
}

-DEFINE_IDTENTRY_SYSVEC(sysvec_irq_move_cleanup)
+static void vector_cleanup_callback(struct timer_list *tmr)
{
- struct hlist_head *clhead = this_cpu_ptr(&cleanup_list);
+ struct vector_cleanup *cl = container_of(tmr, typeof(*cl), timer);
struct apic_chip_data *apicd;
struct hlist_node *tmp;
+ bool rearm = false;

- ack_APIC_irq();
/* Prevent vectors vanishing under us */
- raw_spin_lock(&vector_lock);
+ raw_spin_lock_irq(&vector_lock);

- hlist_for_each_entry_safe(apicd, tmp, clhead, clist) {
+ hlist_for_each_entry_safe(apicd, tmp, &cl->head, clist) {
unsigned int irr, vector = apicd->prev_vector;

/*
* Paranoia: Check if the vector that needs to be cleaned
- * up is registered at the APICs IRR. If so, then this is
- * not the best time to clean it up. Clean it up in the
- * next attempt by sending another IRQ_MOVE_CLEANUP_VECTOR
- * to this CPU. IRQ_MOVE_CLEANUP_VECTOR is the lowest
- * priority external vector, so on return from this
- * interrupt the device interrupt will happen first.
+ * up is registered at the APICs IRR. That's clearly a
+ * hardware issue if the vector arrived on the old target
+ * _after_ interrupts were disabled above. Keep @apicd
+ * on the list and schedule the timer again to give the CPU
+ * a chance to handle the pending interrupt.
*/
irr = apic_read(APIC_IRR + (vector / 32 * 0x10));
if (irr & (1U << (vector % 32))) {
- apic->send_IPI_self(IRQ_MOVE_CLEANUP_VECTOR);
+ pr_warn_once("Moved interrupt pending in old target APIC %u\n", apicd->irq);
+ rearm = true;
continue;
}
free_moved_vector(apicd);
}

- raw_spin_unlock(&vector_lock);
+ /*
+ * Must happen under vector_lock to make the timer_pending() check
+ * in __vector_schedule_cleanup() race free against the rearm here.
+ */
+ if (rearm)
+ mod_timer(tmr, jiffies + 1);
+
+ raw_spin_unlock_irq(&vector_lock);
}

-static void __send_cleanup_vector(struct apic_chip_data *apicd)
+static void __vector_schedule_cleanup(struct apic_chip_data *apicd)
{
- unsigned int cpu;
+ unsigned int cpu = apicd->prev_cpu;

raw_spin_lock(&vector_lock);
apicd->move_in_progress = 0;
- cpu = apicd->prev_cpu;
if (cpu_online(cpu)) {
- hlist_add_head(&apicd->clist, per_cpu_ptr(&cleanup_list, cpu));
- apic->send_IPI(cpu, IRQ_MOVE_CLEANUP_VECTOR);
+ struct vector_cleanup *cl = per_cpu_ptr(&vector_cleanup, cpu);
+
+ /*
+ * The lockless timer_pending() check is safe here. If it
+ * returns true, then the callback will observe this new
+ * apic data in the hlist as everything is serialized by
+ * vector lock.
+ *
+ * If it returns false then the timer is either not armed
+ * or the other CPU executes the callback, which again
+ * would be blocked on vector lock. Rearming it in the
+ * latter case makes it fire for nothing.
+ *
+ * This is also safe against the callback rearming the timer
+ * because that's serialized via vector lock too.
+ */
+ if (!timer_pending(&cl->timer)) {
+ cl->timer.expires = jiffies + 1;
+ add_timer_on(&cl->timer, cpu);
+ }
} else {
apicd->prev_vector = 0;
}
raw_spin_unlock(&vector_lock);
}

-void send_cleanup_vector(struct irq_cfg *cfg)
+void vector_schedule_cleanup(struct irq_cfg *cfg)
{
struct apic_chip_data *apicd;

apicd = container_of(cfg, struct apic_chip_data, hw_irq_cfg);
if (apicd->move_in_progress)
- __send_cleanup_vector(apicd);
+ __vector_schedule_cleanup(apicd);
}

void irq_complete_move(struct irq_cfg *cfg)
@@ -1007,7 +1046,7 @@ void irq_complete_move(struct irq_cfg *c
* on the same CPU.
*/
if (apicd->cpu == smp_processor_id())
- __send_cleanup_vector(apicd);
+ __vector_schedule_cleanup(apicd);
}

/*
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -131,7 +131,6 @@ static const __initconst struct idt_data
INTG(RESCHEDULE_VECTOR, asm_sysvec_reschedule_ipi),
INTG(CALL_FUNCTION_VECTOR, asm_sysvec_call_function),
INTG(CALL_FUNCTION_SINGLE_VECTOR, asm_sysvec_call_function_single),
- INTG(IRQ_MOVE_CLEANUP_VECTOR, asm_sysvec_irq_move_cleanup),
INTG(REBOOT_VECTOR, asm_sysvec_reboot),
#endif

--- a/arch/x86/platform/uv/uv_irq.c
+++ b/arch/x86/platform/uv/uv_irq.c
@@ -58,7 +58,7 @@ uv_set_irq_affinity(struct irq_data *dat
ret = parent->chip->irq_set_affinity(parent, mask, force);
if (ret >= 0) {
uv_program_mmr(cfg, data->chip_data);
- send_cleanup_vector(cfg);
+ vector_schedule_cleanup(cfg);
}

return ret;
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3639,7 +3639,7 @@ static int amd_ir_set_affinity(struct ir
* at the new destination. So, time to cleanup the previous
* vector allocation.
*/
- send_cleanup_vector(cfg);
+ vector_schedule_cleanup(cfg);

return IRQ_SET_MASK_OK_DONE;
}
--- a/drivers/iommu/hyperv-iommu.c
+++ b/drivers/iommu/hyperv-iommu.c
@@ -51,7 +51,7 @@ static int hyperv_ir_set_affinity(struct
if (ret < 0 || ret == IRQ_SET_MASK_OK_DONE)
return ret;

- send_cleanup_vector(cfg);
+ vector_schedule_cleanup(cfg);

return 0;
}
@@ -257,7 +257,7 @@ static int hyperv_root_ir_set_affinity(s
if (ret < 0 || ret == IRQ_SET_MASK_OK_DONE)
return ret;

- send_cleanup_vector(cfg);
+ vector_schedule_cleanup(cfg);

return 0;
}
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1180,7 +1180,7 @@ intel_ir_set_affinity(struct irq_data *d
* at the new destination. So, time to cleanup the previous
* vector allocation.
*/
- send_cleanup_vector(cfg);
+ vector_schedule_cleanup(cfg);

return IRQ_SET_MASK_OK_DONE;
}

2023-06-06 23:22:26

by Li, Xin3

[permalink] [raw]
Subject: RE: [PATCH v8 01/33] x86/traps: let common_interrupt() handle IRQ_MOVE_CLEANUP_VECTOR

> The untested below should do the trick. Wants to be split in several patches, but
> you get the idea.

I will continue the work from what you posted. Thanks a lot!

Xin




> ---
> Subject: x86/vector: Get rid of IRQ_MOVE_CLEANUP_VECTOR
> From: Thomas Gleixner <[email protected]>
>
> No point to waste a vector for cleaning up the leftovers of a moved interrupt.
> Aside of that this must be the lowest priority of all vectors which makes FRED
> systems utilizing vectors 0x10-0x1f more complicated than necessary.
>
> Schedule a timer instead.
>
> Not-Yet-Signed-off-by: Thomas Gleixner <[email protected]>
> ---
> arch/x86/include/asm/hw_irq.h | 4 -
> arch/x86/include/asm/idtentry.h | 1
> arch/x86/include/asm/irq_vectors.h | 7 ---
> arch/x86/kernel/apic/vector.c | 83 ++++++++++++++++++++++++++----------
> arch/x86/kernel/idt.c | 1
> arch/x86/platform/uv/uv_irq.c | 2
> drivers/iommu/amd/iommu.c | 2
> drivers/iommu/hyperv-iommu.c | 4 -
> drivers/iommu/intel/irq_remapping.c | 2
> 9 files changed, 68 insertions(+), 38 deletions(-)
>
> --- a/arch/x86/include/asm/hw_irq.h
> +++ b/arch/x86/include/asm/hw_irq.h
> @@ -97,10 +97,10 @@ extern struct irq_cfg *irqd_cfg(struct i extern void
> lock_vector_lock(void); extern void unlock_vector_lock(void); #ifdef
> CONFIG_SMP -extern void send_cleanup_vector(struct irq_cfg *);
> +extern void vector_schedule_cleanup(struct irq_cfg *);
> extern void irq_complete_move(struct irq_cfg *cfg); #else -static inline void
> send_cleanup_vector(struct irq_cfg *c) { }
> +static inline void vector_schedule_cleanup(struct irq_cfg *c) { }
> static inline void irq_complete_move(struct irq_cfg *c) { } #endif
>
> --- a/arch/x86/include/asm/idtentry.h
> +++ b/arch/x86/include/asm/idtentry.h
> @@ -648,7 +648,6 @@ DECLARE_IDTENTRY_SYSVEC(X86_PLATFORM_IPI
>
> #ifdef CONFIG_SMP
> DECLARE_IDTENTRY(RESCHEDULE_VECTOR,
> sysvec_reschedule_ipi);
> -DECLARE_IDTENTRY_SYSVEC(IRQ_MOVE_CLEANUP_VECTOR,
> sysvec_irq_move_cleanup);
> DECLARE_IDTENTRY_SYSVEC(REBOOT_VECTOR,
> sysvec_reboot);
> DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_SINGLE_VECTOR,
> sysvec_call_function_single);
> DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_VECTOR,
> sysvec_call_function);
> --- a/arch/x86/include/asm/irq_vectors.h
> +++ b/arch/x86/include/asm/irq_vectors.h
> @@ -35,13 +35,6 @@
> */
> #define FIRST_EXTERNAL_VECTOR 0x20
>
> -/*
> - * Reserve the lowest usable vector (and hence lowest priority) 0x20 for
> - * triggering cleanup after irq migration. 0x21-0x2f will still be used
> - * for device interrupts.
> - */
> -#define IRQ_MOVE_CLEANUP_VECTOR FIRST_EXTERNAL_VECTOR
> -
> #define IA32_SYSCALL_VECTOR 0x80
>
> /*
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -44,7 +44,18 @@ static cpumask_var_t vector_searchmask; static struct
> irq_chip lapic_controller; static struct irq_matrix *vector_matrix; #ifdef
> CONFIG_SMP -static DEFINE_PER_CPU(struct hlist_head, cleanup_list);
> +
> +static void vector_cleanup_callback(struct timer_list *tmr);
> +
> +struct vector_cleanup {
> + struct hlist_head head;
> + struct timer_list timer;
> +};
> +
> +static DEFINE_PER_CPU(struct vector_cleanup, vector_cleanup) = {
> + .head = HLIST_HEAD_INIT,
> + .timer = __TIMER_INITIALIZER(vector_cleanup_callback,
> TIMER_PINNED),
> +};
> #endif
>
> void lock_vector_lock(void)
> @@ -843,8 +854,12 @@ void lapic_online(void)
>
> void lapic_offline(void)
> {
> + struct vector_cleanup *cl = this_cpu_ptr(&vector_cleanup);
> +
> lock_vector_lock();
> irq_matrix_offline(vector_matrix);
> + WARN_ON_ONCE(try_to_del_timer_sync(&cl->timer) < 0);
> + WARN_ON_ONCE(!hlist_empty(&cl->head));
> unlock_vector_lock();
> }
>
> @@ -934,62 +949,86 @@ static void free_moved_vector(struct api
> apicd->move_in_progress = 0;
> }
>
> -DEFINE_IDTENTRY_SYSVEC(sysvec_irq_move_cleanup)
> +static void vector_cleanup_callback(struct timer_list *tmr)
> {
> - struct hlist_head *clhead = this_cpu_ptr(&cleanup_list);
> + struct vector_cleanup *cl = container_of(tmr, typeof(*cl), timer);
> struct apic_chip_data *apicd;
> struct hlist_node *tmp;
> + bool rearm = false;
>
> - ack_APIC_irq();
> /* Prevent vectors vanishing under us */
> - raw_spin_lock(&vector_lock);
> + raw_spin_lock_irq(&vector_lock);
>
> - hlist_for_each_entry_safe(apicd, tmp, clhead, clist) {
> + hlist_for_each_entry_safe(apicd, tmp, &cl->head, clist) {
> unsigned int irr, vector = apicd->prev_vector;
>
> /*
> * Paranoia: Check if the vector that needs to be cleaned
> - * up is registered at the APICs IRR. If so, then this is
> - * not the best time to clean it up. Clean it up in the
> - * next attempt by sending another
> IRQ_MOVE_CLEANUP_VECTOR
> - * to this CPU. IRQ_MOVE_CLEANUP_VECTOR is the lowest
> - * priority external vector, so on return from this
> - * interrupt the device interrupt will happen first.
> + * up is registered at the APICs IRR. That's clearly a
> + * hardware issue if the vector arrived on the old target
> + * _after_ interrupts were disabled above. Keep @apicd
> + * on the list and schedule the timer again to give the CPU
> + * a chance to handle the pending interrupt.
> */
> irr = apic_read(APIC_IRR + (vector / 32 * 0x10));
> if (irr & (1U << (vector % 32))) {
> - apic->send_IPI_self(IRQ_MOVE_CLEANUP_VECTOR);
> + pr_warn_once("Moved interrupt pending in old target
> APIC %u\n", apicd->irq);
> + rearm = true;
> continue;
> }
> free_moved_vector(apicd);
> }
>
> - raw_spin_unlock(&vector_lock);
> + /*
> + * Must happen under vector_lock to make the timer_pending() check
> + * in __vector_schedule_cleanup() race free against the rearm here.
> + */
> + if (rearm)
> + mod_timer(tmr, jiffies + 1);
> +
> + raw_spin_unlock_irq(&vector_lock);
> }
>
> -static void __send_cleanup_vector(struct apic_chip_data *apicd)
> +static void __vector_schedule_cleanup(struct apic_chip_data *apicd)
> {
> - unsigned int cpu;
> + unsigned int cpu = apicd->prev_cpu;
>
> raw_spin_lock(&vector_lock);
> apicd->move_in_progress = 0;
> - cpu = apicd->prev_cpu;
> if (cpu_online(cpu)) {
> - hlist_add_head(&apicd->clist, per_cpu_ptr(&cleanup_list, cpu));
> - apic->send_IPI(cpu, IRQ_MOVE_CLEANUP_VECTOR);
> + struct vector_cleanup *cl = per_cpu_ptr(&vector_cleanup, cpu);
> +
> + /*
> + * The lockless timer_pending() check is safe here. If it
> + * returns true, then the callback will observe this new
> + * apic data in the hlist as everything is serialized by
> + * vector lock.
> + *
> + * If it returns false then the timer is either not armed
> + * or the other CPU executes the callback, which again
> + * would be blocked on vector lock. Rearming it in the
> + * latter case makes it fire for nothing.
> + *
> + * This is also safe against the callback rearming the timer
> + * because that's serialized via vector lock too.
> + */
> + if (!timer_pending(&cl->timer)) {
> + cl->timer.expires = jiffies + 1;
> + add_timer_on(&cl->timer, cpu);
> + }
> } else {
> apicd->prev_vector = 0;
> }
> raw_spin_unlock(&vector_lock);
> }
>
> -void send_cleanup_vector(struct irq_cfg *cfg)
> +void vector_schedule_cleanup(struct irq_cfg *cfg)
> {
> struct apic_chip_data *apicd;
>
> apicd = container_of(cfg, struct apic_chip_data, hw_irq_cfg);
> if (apicd->move_in_progress)
> - __send_cleanup_vector(apicd);
> + __vector_schedule_cleanup(apicd);
> }
>
> void irq_complete_move(struct irq_cfg *cfg) @@ -1007,7 +1046,7 @@ void
> irq_complete_move(struct irq_cfg *c
> * on the same CPU.
> */
> if (apicd->cpu == smp_processor_id())
> - __send_cleanup_vector(apicd);
> + __vector_schedule_cleanup(apicd);
> }
>
> /*
> --- a/arch/x86/kernel/idt.c
> +++ b/arch/x86/kernel/idt.c
> @@ -131,7 +131,6 @@ static const __initconst struct idt_data
> INTG(RESCHEDULE_VECTOR,
> asm_sysvec_reschedule_ipi),
> INTG(CALL_FUNCTION_VECTOR,
> asm_sysvec_call_function),
> INTG(CALL_FUNCTION_SINGLE_VECTOR,
> asm_sysvec_call_function_single),
> - INTG(IRQ_MOVE_CLEANUP_VECTOR,
> asm_sysvec_irq_move_cleanup),
> INTG(REBOOT_VECTOR, asm_sysvec_reboot),
> #endif
>
> --- a/arch/x86/platform/uv/uv_irq.c
> +++ b/arch/x86/platform/uv/uv_irq.c
> @@ -58,7 +58,7 @@ uv_set_irq_affinity(struct irq_data *dat
> ret = parent->chip->irq_set_affinity(parent, mask, force);
> if (ret >= 0) {
> uv_program_mmr(cfg, data->chip_data);
> - send_cleanup_vector(cfg);
> + vector_schedule_cleanup(cfg);
> }
>
> return ret;
> --- a/drivers/iommu/amd/iommu.c
> +++ b/drivers/iommu/amd/iommu.c
> @@ -3639,7 +3639,7 @@ static int amd_ir_set_affinity(struct ir
> * at the new destination. So, time to cleanup the previous
> * vector allocation.
> */
> - send_cleanup_vector(cfg);
> + vector_schedule_cleanup(cfg);
>
> return IRQ_SET_MASK_OK_DONE;
> }
> --- a/drivers/iommu/hyperv-iommu.c
> +++ b/drivers/iommu/hyperv-iommu.c
> @@ -51,7 +51,7 @@ static int hyperv_ir_set_affinity(struct
> if (ret < 0 || ret == IRQ_SET_MASK_OK_DONE)
> return ret;
>
> - send_cleanup_vector(cfg);
> + vector_schedule_cleanup(cfg);
>
> return 0;
> }
> @@ -257,7 +257,7 @@ static int hyperv_root_ir_set_affinity(s
> if (ret < 0 || ret == IRQ_SET_MASK_OK_DONE)
> return ret;
>
> - send_cleanup_vector(cfg);
> + vector_schedule_cleanup(cfg);
>
> return 0;
> }
> --- a/drivers/iommu/intel/irq_remapping.c
> +++ b/drivers/iommu/intel/irq_remapping.c
> @@ -1180,7 +1180,7 @@ intel_ir_set_affinity(struct irq_data *d
> * at the new destination. So, time to cleanup the previous
> * vector allocation.
> */
> - send_cleanup_vector(cfg);
> + vector_schedule_cleanup(cfg);
>
> return IRQ_SET_MASK_OK_DONE;
> }

2023-06-07 00:03:35

by H. Peter Anvin

[permalink] [raw]
Subject: RE: [PATCH v8 30/33] x86/fred: allow dynamic stack frame size

On June 6, 2023 6:27:25 AM PDT, Thomas Gleixner <[email protected]> wrote:
>On Tue, Jun 06 2023 at 06:18, Xin3 Li wrote:
>>> > A FRED stack frame could contain different amount of information for
>>> > This approach also works for IDT, thus we unify the code.
>>>
>>> And thereby remove the useful comment and replace it with an undocumented
>>> macro mess.
>>>
>>> I'm simply refusing to review this. It's not my job to understand this
>>> undocumented hackery.
>>>
>>
>> I believe it's a nice idea to allow dynamic stack frame size, at least for
>> FRED.
>
>Believe belongs in the realm of religion. What we need here are proper
>facts, explanations and justifications. Nice ideas are not helpful when
>they are not having a value.
>
>> It's totally my bad that I didn't make it meet the minimum standards,
>> I will rewrite the commit message and add better comments.
>>
>> After a second thought, I probably should only apply the change to FRED for
>> 2 reasons, the change seems problematic with ESPFIX (which FRED
>> doesn't need),
>
>Indeed. Making this FRED only is going to need even more justification.
>
>> and such corner cases are hard to test (self-tests needed?)
>
>There is a test. It's not that hard to find:
>
># git grep -li ESPFIX tools/testing/selftests/
>tools/testing/selftests/x86/sigreturn.c
>
>Thanks,
>
> tglx

For what it is worth, I am working on a FRED forward compatibly document at the moment.

2023-06-19 08:09:49

by Li, Xin3

[permalink] [raw]
Subject: RE: [PATCH v8 01/33] x86/traps: let common_interrupt() handle IRQ_MOVE_CLEANUP_VECTOR

> The untested below should do the trick. Wants to be split in several
> patches, but you get the idea.

Hi Thomas,

How do you want to split the patch?

To me it's better to keep the changes in one patch, thus the differences
are more obvious.

We need a second patch to do vector cleanup in lapic_offline() in case the
vector cleanup timer has not expired. But I don't see a good way to split
your draft patch (I only have a minor fix on it).

Thanks!
Xin

>
> Thanks,
>
> tglx
> ---
> Subject: x86/vector: Get rid of IRQ_MOVE_CLEANUP_VECTOR
> From: Thomas Gleixner <[email protected]>
>
> No point to waste a vector for cleaning up the leftovers of a moved
> interrupt. Aside of that this must be the lowest priority of all vectors
> which makes FRED systems utilizing vectors 0x10-0x1f more complicated
> than necessary.
>
> Schedule a timer instead.
>
> Not-Yet-Signed-off-by: Thomas Gleixner <[email protected]>
> ---
> arch/x86/include/asm/hw_irq.h | 4 -
> arch/x86/include/asm/idtentry.h | 1
> arch/x86/include/asm/irq_vectors.h | 7 ---
> arch/x86/kernel/apic/vector.c | 83 ++++++++++++++++++++++++++----------
> arch/x86/kernel/idt.c | 1
> arch/x86/platform/uv/uv_irq.c | 2
> drivers/iommu/amd/iommu.c | 2
> drivers/iommu/hyperv-iommu.c | 4 -
> drivers/iommu/intel/irq_remapping.c | 2
> 9 files changed, 68 insertions(+), 38 deletions(-)
>
> --- a/arch/x86/include/asm/hw_irq.h
> +++ b/arch/x86/include/asm/hw_irq.h
> @@ -97,10 +97,10 @@ extern struct irq_cfg *irqd_cfg(struct i
> extern void lock_vector_lock(void);
> extern void unlock_vector_lock(void);
> #ifdef CONFIG_SMP
> -extern void send_cleanup_vector(struct irq_cfg *);
> +extern void vector_schedule_cleanup(struct irq_cfg *);
> extern void irq_complete_move(struct irq_cfg *cfg);
> #else
> -static inline void send_cleanup_vector(struct irq_cfg *c) { }
> +static inline void vector_schedule_cleanup(struct irq_cfg *c) { }
> static inline void irq_complete_move(struct irq_cfg *c) { }
> #endif
>
> --- a/arch/x86/include/asm/idtentry.h
> +++ b/arch/x86/include/asm/idtentry.h
> @@ -648,7 +648,6 @@ DECLARE_IDTENTRY_SYSVEC(X86_PLATFORM_IPI
>
> #ifdef CONFIG_SMP
> DECLARE_IDTENTRY(RESCHEDULE_VECTOR,
> sysvec_reschedule_ipi);
> -DECLARE_IDTENTRY_SYSVEC(IRQ_MOVE_CLEANUP_VECTOR,
> sysvec_irq_move_cleanup);
> DECLARE_IDTENTRY_SYSVEC(REBOOT_VECTOR, sysvec_reboot);
> DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_SINGLE_VECTOR,
> sysvec_call_function_single);
> DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_VECTOR,
> sysvec_call_function);
> --- a/arch/x86/include/asm/irq_vectors.h
> +++ b/arch/x86/include/asm/irq_vectors.h
> @@ -35,13 +35,6 @@
> */
> #define FIRST_EXTERNAL_VECTOR 0x20
>
> -/*
> - * Reserve the lowest usable vector (and hence lowest priority) 0x20 for
> - * triggering cleanup after irq migration. 0x21-0x2f will still be used
> - * for device interrupts.
> - */
> -#define IRQ_MOVE_CLEANUP_VECTOR FIRST_EXTERNAL_VECTOR
> -
> #define IA32_SYSCALL_VECTOR 0x80
>
> /*
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -44,7 +44,18 @@ static cpumask_var_t vector_searchmask;
> static struct irq_chip lapic_controller;
> static struct irq_matrix *vector_matrix;
> #ifdef CONFIG_SMP
> -static DEFINE_PER_CPU(struct hlist_head, cleanup_list);
> +
> +static void vector_cleanup_callback(struct timer_list *tmr);
> +
> +struct vector_cleanup {
> + struct hlist_head head;
> + struct timer_list timer;
> +};
> +
> +static DEFINE_PER_CPU(struct vector_cleanup, vector_cleanup) = {
> + .head = HLIST_HEAD_INIT,
> + .timer = __TIMER_INITIALIZER(vector_cleanup_callback, TIMER_PINNED),
> +};
> #endif
>
> void lock_vector_lock(void)
> @@ -843,8 +854,12 @@ void lapic_online(void)
>
> void lapic_offline(void)
> {
> + struct vector_cleanup *cl = this_cpu_ptr(&vector_cleanup);
> +
> lock_vector_lock();
> irq_matrix_offline(vector_matrix);
> + WARN_ON_ONCE(try_to_del_timer_sync(&cl->timer) < 0);
> + WARN_ON_ONCE(!hlist_empty(&cl->head));
> unlock_vector_lock();
> }
>
> @@ -934,62 +949,86 @@ static void free_moved_vector(struct api
> apicd->move_in_progress = 0;
> }
>
> -DEFINE_IDTENTRY_SYSVEC(sysvec_irq_move_cleanup)
> +static void vector_cleanup_callback(struct timer_list *tmr)
> {
> - struct hlist_head *clhead = this_cpu_ptr(&cleanup_list);
> + struct vector_cleanup *cl = container_of(tmr, typeof(*cl), timer);
> struct apic_chip_data *apicd;
> struct hlist_node *tmp;
> + bool rearm = false;
>
> - ack_APIC_irq();
> /* Prevent vectors vanishing under us */
> - raw_spin_lock(&vector_lock);
> + raw_spin_lock_irq(&vector_lock);
>
> - hlist_for_each_entry_safe(apicd, tmp, clhead, clist) {
> + hlist_for_each_entry_safe(apicd, tmp, &cl->head, clist) {
> unsigned int irr, vector = apicd->prev_vector;
>
> /*
> * Paranoia: Check if the vector that needs to be cleaned
> - * up is registered at the APICs IRR. If so, then this is
> - * not the best time to clean it up. Clean it up in the
> - * next attempt by sending another IRQ_MOVE_CLEANUP_VECTOR
> - * to this CPU. IRQ_MOVE_CLEANUP_VECTOR is the lowest
> - * priority external vector, so on return from this
> - * interrupt the device interrupt will happen first.
> + * up is registered at the APICs IRR. That's clearly a
> + * hardware issue if the vector arrived on the old target
> + * _after_ interrupts were disabled above. Keep @apicd
> + * on the list and schedule the timer again to give the CPU
> + * a chance to handle the pending interrupt.
> */
> irr = apic_read(APIC_IRR + (vector / 32 * 0x10));
> if (irr & (1U << (vector % 32))) {
> - apic->send_IPI_self(IRQ_MOVE_CLEANUP_VECTOR);
> + pr_warn_once("Moved interrupt pending in old target APIC
> %u\n", apicd->irq);
> + rearm = true;
> continue;
> }
> free_moved_vector(apicd);
> }
>
> - raw_spin_unlock(&vector_lock);
> + /*
> + * Must happen under vector_lock to make the timer_pending() check
> + * in __vector_schedule_cleanup() race free against the rearm here.
> + */
> + if (rearm)
> + mod_timer(tmr, jiffies + 1);
> +
> + raw_spin_unlock_irq(&vector_lock);
> }
>
> -static void __send_cleanup_vector(struct apic_chip_data *apicd)
> +static void __vector_schedule_cleanup(struct apic_chip_data *apicd)
> {
> - unsigned int cpu;
> + unsigned int cpu = apicd->prev_cpu;
>
> raw_spin_lock(&vector_lock);
> apicd->move_in_progress = 0;
> - cpu = apicd->prev_cpu;
> if (cpu_online(cpu)) {
> - hlist_add_head(&apicd->clist, per_cpu_ptr(&cleanup_list, cpu));
> - apic->send_IPI(cpu, IRQ_MOVE_CLEANUP_VECTOR);
> + struct vector_cleanup *cl = per_cpu_ptr(&vector_cleanup, cpu);
> +
> + /*
> + * The lockless timer_pending() check is safe here. If it
> + * returns true, then the callback will observe this new
> + * apic data in the hlist as everything is serialized by
> + * vector lock.
> + *
> + * If it returns false then the timer is either not armed
> + * or the other CPU executes the callback, which again
> + * would be blocked on vector lock. Rearming it in the
> + * latter case makes it fire for nothing.
> + *
> + * This is also safe against the callback rearming the timer
> + * because that's serialized via vector lock too.
> + */
> + if (!timer_pending(&cl->timer)) {
> + cl->timer.expires = jiffies + 1;
> + add_timer_on(&cl->timer, cpu);
> + }
> } else {
> apicd->prev_vector = 0;
> }
> raw_spin_unlock(&vector_lock);
> }
>
> -void send_cleanup_vector(struct irq_cfg *cfg)
> +void vector_schedule_cleanup(struct irq_cfg *cfg)
> {
> struct apic_chip_data *apicd;
>
> apicd = container_of(cfg, struct apic_chip_data, hw_irq_cfg);
> if (apicd->move_in_progress)
> - __send_cleanup_vector(apicd);
> + __vector_schedule_cleanup(apicd);
> }
>
> void irq_complete_move(struct irq_cfg *cfg)
> @@ -1007,7 +1046,7 @@ void irq_complete_move(struct irq_cfg *c
> * on the same CPU.
> */
> if (apicd->cpu == smp_processor_id())
> - __send_cleanup_vector(apicd);
> + __vector_schedule_cleanup(apicd);
> }
>
> /*
> --- a/arch/x86/kernel/idt.c
> +++ b/arch/x86/kernel/idt.c
> @@ -131,7 +131,6 @@ static const __initconst struct idt_data
> INTG(RESCHEDULE_VECTOR,
> asm_sysvec_reschedule_ipi),
> INTG(CALL_FUNCTION_VECTOR, asm_sysvec_call_function),
> INTG(CALL_FUNCTION_SINGLE_VECTOR, asm_sysvec_call_function_single),
> - INTG(IRQ_MOVE_CLEANUP_VECTOR,
> asm_sysvec_irq_move_cleanup),
> INTG(REBOOT_VECTOR, asm_sysvec_reboot),
> #endif
>
> --- a/arch/x86/platform/uv/uv_irq.c
> +++ b/arch/x86/platform/uv/uv_irq.c
> @@ -58,7 +58,7 @@ uv_set_irq_affinity(struct irq_data *dat
> ret = parent->chip->irq_set_affinity(parent, mask, force);
> if (ret >= 0) {
> uv_program_mmr(cfg, data->chip_data);
> - send_cleanup_vector(cfg);
> + vector_schedule_cleanup(cfg);
> }
>
> return ret;
> --- a/drivers/iommu/amd/iommu.c
> +++ b/drivers/iommu/amd/iommu.c
> @@ -3639,7 +3639,7 @@ static int amd_ir_set_affinity(struct ir
> * at the new destination. So, time to cleanup the previous
> * vector allocation.
> */
> - send_cleanup_vector(cfg);
> + vector_schedule_cleanup(cfg);
>
> return IRQ_SET_MASK_OK_DONE;
> }
> --- a/drivers/iommu/hyperv-iommu.c
> +++ b/drivers/iommu/hyperv-iommu.c
> @@ -51,7 +51,7 @@ static int hyperv_ir_set_affinity(struct
> if (ret < 0 || ret == IRQ_SET_MASK_OK_DONE)
> return ret;
>
> - send_cleanup_vector(cfg);
> + vector_schedule_cleanup(cfg);
>
> return 0;
> }
> @@ -257,7 +257,7 @@ static int hyperv_root_ir_set_affinity(s
> if (ret < 0 || ret == IRQ_SET_MASK_OK_DONE)
> return ret;
>
> - send_cleanup_vector(cfg);
> + vector_schedule_cleanup(cfg);
>
> return 0;
> }
> --- a/drivers/iommu/intel/irq_remapping.c
> +++ b/drivers/iommu/intel/irq_remapping.c
> @@ -1180,7 +1180,7 @@ intel_ir_set_affinity(struct irq_data *d
> * at the new destination. So, time to cleanup the previous
> * vector allocation.
> */
> - send_cleanup_vector(cfg);
> + vector_schedule_cleanup(cfg);
>
> return IRQ_SET_MASK_OK_DONE;
> }

2023-06-19 14:39:01

by Thomas Gleixner

[permalink] [raw]
Subject: RE: [PATCH v8 01/33] x86/traps: let common_interrupt() handle IRQ_MOVE_CLEANUP_VECTOR

On Mon, Jun 19 2023 at 08:00, Li, Xin3 wrote:
> To me it's better to keep the changes in one patch, thus the differences
> are more obvious.

The rename to vector_schedule_cleanup() can be obviously done first.

> We need a second patch to do vector cleanup in lapic_offline() in case the
> vector cleanup timer has not expired.

Right. I was lazy and just put a WARN_ON() there under the assumption
that you will figure it out.

But a second patch?

We don't switch things over into a broken state first and then fix it up
afterwards.

Thanks,

tglx


2023-06-19 18:55:57

by Li, Xin3

[permalink] [raw]
Subject: RE: [PATCH v8 01/33] x86/traps: let common_interrupt() handle IRQ_MOVE_CLEANUP_VECTOR

> > To me it's better to keep the changes in one patch, thus the
> > differences are more obvious.
>
> The rename to vector_schedule_cleanup() can be obviously done first.

Okay, it's a bit wired to me to rename before any actual code logic change.

>
> > We need a second patch to do vector cleanup in lapic_offline() in case
> > the vector cleanup timer has not expired.
>
> Right. I was lazy and just put a WARN_ON() there under the assumption that you
> will figure it out.

I see that, as your changes to lapic_offline() are completely new.

> But a second patch?
>
> We don't switch things over into a broken state first and then fix it up afterwards.

Make sense!

2023-06-19 19:53:35

by Li, Xin3

[permalink] [raw]
Subject: RE: [PATCH v8 05/33] x86/traps: add external_interrupt() to dispatch external interrupts

> > Add external_interrupt() to dispatch external interrupts to their handlers.
> >
> > If an external interrupt is a system interrupt, dipatch it through
> > system_interrupt_handlers table, otherwise to
> > dispatch_common_interrupt().
>
> This naming convention sucks. external interrupts which can be system
> interrupts. Come on.

This name dispatch_common_interrupt() comes from arch/x86/kernel/irq.c:

/*
* common_interrupt() handles all normal device IRQ's (the special SMP
* cross-CPU interrupts have their own entry points).
*/
DEFINE_IDTENTRY_IRQ(common_interrupt)

Should we rename it to device_intertupt()?

Thanks!
Xin

2023-06-19 20:06:43

by H. Peter Anvin

[permalink] [raw]
Subject: RE: [PATCH v8 01/33] x86/traps: let common_interrupt() handle IRQ_MOVE_CLEANUP_VECTOR

On June 19, 2023 11:47:08 AM PDT, "Li, Xin3" <[email protected]> wrote:
>> > To me it's better to keep the changes in one patch, thus the
>> > differences are more obvious.
>>
>> The rename to vector_schedule_cleanup() can be obviously done first.
>
>Okay, it's a bit wired to me to rename before any actual code logic change.
>

Weird or not, that's the established practice.

However, if you think about it, it makes sense: that way your code logic patch doesn't contain a bunch of names which will almost immediately be outdated. That is *really* confusing when you are going back through the git history, for example.

>>
>> > We need a second patch to do vector cleanup in lapic_offline() in case
>> > the vector cleanup timer has not expired.
>>
>> Right. I was lazy and just put a WARN_ON() there under the assumption that you
>> will figure it out.
>
>I see that, as your changes to lapic_offline() are completely new.
>
>> But a second patch?
>>
>> We don't switch things over into a broken state first and then fix it up afterwards.
>
>Make sense!
>


2023-06-19 21:26:07

by Thomas Gleixner

[permalink] [raw]
Subject: RE: [PATCH v8 05/33] x86/traps: add external_interrupt() to dispatch external interrupts

On Mon, Jun 19 2023 at 19:16, Li, Xin3 wrote:
>> > Add external_interrupt() to dispatch external interrupts to their handlers.
>> >
>> > If an external interrupt is a system interrupt, dipatch it through
>> > system_interrupt_handlers table, otherwise to
>> > dispatch_common_interrupt().
>>
>> This naming convention sucks. external interrupts which can be system
>> interrupts. Come on.
>
> This name dispatch_common_interrupt() comes from arch/x86/kernel/irq.c:

That's not the point. Your changelog says:

If an external interrupt is a system interrupt...

It's either an external interrupt which goes through common_interrupt()
or it is a system interrupt which goes through it's very own handler,
no?

Thanks,

tglx






2023-06-20 00:16:49

by Li, Xin3

[permalink] [raw]
Subject: RE: [PATCH v8 01/33] x86/traps: let common_interrupt() handle IRQ_MOVE_CLEANUP_VECTOR

> >Okay, it's a bit wired to me to rename before any actual code logic change.
> >
>
> Weird or not, that's the established practice.
>
> However, if you think about it, it makes sense: that way your code logic patch
> doesn't contain a bunch of names which will almost immediately be outdated. That
> is *really* confusing when you are going back through the git history, for example.

Thanks for the clarification!
Xin

2023-06-20 00:40:17

by Li, Xin3

[permalink] [raw]
Subject: RE: [PATCH v8 05/33] x86/traps: add external_interrupt() to dispatch external interrupts

> That's not the point. Your changelog says:
>
> If an external interrupt is a system interrupt...
>
> It's either an external interrupt which goes through common_interrupt()
> or it is a system interrupt which goes through it's very own handler,
> no?

Ah, then it looks more of a problem in the way how I described it.

What I wanted to describe is the dispatch logic _inside_ the new function
external_interrupt(), what about:

external_interrupt() dispatches all external interrupts: it checks if an
external interrupt is a system interrupt, if yes it dipatches it through
the system_interrupt_handlers table, otherwise to dispatch_common_interrupt().

Thanks!
Xin