2023-05-08 07:11:36

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v2 00/20] efi/x86: Avoid bare metal decompressor during EFI boot

This is v2 of [2], which updates the x86 boot path to avoid the legacy
decompressor when booting via the EFI stub.

TL;DR the bare metal decompressor inherits the loader's 1:1 mapping of
DRAM when entering in 64-bit mode, and assumes that all of it is mapped
read/write/execute, which will no longer be the case on hardware built
to comply with recently tightened logo requirements (*).

Changes since v1:
- streamline existing 4/5 level switching code and call it directly from
the EFI stub - this is covered by the first 9 patches, which can be
applied in isolation, if desired;
- deal with SEV/SNP init explicitly;
- clear BSS when booting via the 'handover protocol'
- switch to kernel CS before calling SEV init code in kernel proper.

Conceptually, this should not conflict with the memory acceptance work
which is being done by Kirill [3], but some lexical conlicts are not
unlikely. After applying this series, the allocations done by the EFI
stub for the trampoline and the decompressed kernel will use the EFI
page allocation APIs, and will therefore not need explicit acceptance.

---- v1 cover letter follows ----

This series is conceptually a combination of Evgeny's series [0] and
mine [1], both of which attempt to make the early decompressor code more
amenable to executing in the EFI environment with stricter handling of
memory permissions.

My series [1] implemented zboot for x86, by getting rid of the entire
x86 decompressor, and replacing it with existing EFI code that does the
same but in a generic way. The downside of this is that only EFI boot is
supported, making it unviable for distros, which need to support BIOS
boot and hybrid EFI boot modes that omit the EFI stub.

Evgeny's series [0] adapted the entire decompressor code flow to allow
it to execute in the EFI context as well as the bare metal context, and
this involves changes to the 1:1 mapping code and the page fault
handlers etc, none of which are really needed when doing EFI boot in the
first place.

So this series attempts to occupy the middle ground here: it makes
minimal changes to the existing decompressor so some of it can be called
from the EFI stub. Then, it reimplements the EFI boot flow to decompress
the kernel and boot it directly, without relying on the trampoline
allocation code, page table code or page fault handling code. This
allows us to get rid of quite a bit of unsavory EFI stub code, and
replace it with two clear invocations of the EFI firmware APIs to clear
NX restrictions from allocations that have been populated with
executable code.

The only code that is being reused is the decompression library itself,
along with the minimal ELF parsing that is required to copy the ELF
segments in place, and the relocation processing that fixes up absolute
symbol references to refer to the correct virtual addresses.

Note that some of Evgeny's changes to clean up the PE/COFF header
generation will still be needed, but I've omitted those here for
brevity.

(*) IMHO the following developments are likely to occur:
- the Windows boot chain (along with 3rd party drivers) is cleaned up so
that it never relies on memory being writable and executable at the
same time when running under the EFI boot services;
- the EFI reference implementation gets updated to map all memory NX by
default, and to require read-only permissions for executable mappings;
- BIOS vendors incorporate these changes into their codebases, and
deploy it more widely than just the 'secure' SKUs;
- OEMs only care about the Windows sticker, so they only boot test
Windows, which works fine in this more restricted context;
- Linux boot no longer works reliably on new hardware built for Windows
unless we clean up our boot chain as well.

Cc: Evgeniy Baskov <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Alexey Khoroshilov <[email protected]>
Cc: Peter Jones <[email protected]>
Cc: Gerd Hoffmann <[email protected]>
Cc: Dave Young <[email protected]>
Cc: Mario Limonciello <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Linus Torvalds <[email protected]>

[0] https://lore.kernel.org/all/[email protected]/
[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/
[3] https://lore.kernel.org/all/[email protected]/

Ard Biesheuvel (20):
x86: decompressor: Use proper sequence to take the address of the GOT
x86: decompressor: Store boot_params pointer in callee save register
x86: decompressor: Call trampoline as a normal function
x86: decompressor: Use standard calling convention for trampoline
x86: decompressor: Avoid the need for a stack in the 32-bit trampoline
x86: decompressor: Call trampoline directly from C code
x86: decompressor: Only call the trampoline when changing paging
levels
x86: decompressor: Merge trampoline cleanup with switching code
x86: efistub: Perform 4/5 level paging switch from the stub
x86: efistub: Prefer EFI memory attributes protocol over DXE services
decompress: Use 8 byte alignment
x86: decompressor: Move global symbol references to C code
x86: decompressor: Factor out kernel decompression and relocation
x86: head_64: Store boot_params pointer in callee-preserved register
x86: head_64: Switch to kernel CS before enabling memory encryption
efi: libstub: Add limit argument to efi_random_alloc()
x86: efistub: Check SEV/SNP support while running in the firmware
x86: efistub: Avoid legacy decompressor when doing EFI boot
x86: efistub: Clear BSS in EFI handover protocol entrypoint
x86: decompressor: Avoid magic offsets for EFI handover entrypoint

arch/x86/boot/compressed/Makefile | 5 +
arch/x86/boot/compressed/efi_mixed.S | 58 +---
arch/x86/boot/compressed/head_32.S | 34 +-
arch/x86/boot/compressed/head_64.S | 196 +++--------
arch/x86/boot/compressed/misc.c | 44 ++-
arch/x86/boot/compressed/pgtable.h | 6 +-
arch/x86/boot/compressed/pgtable_64.c | 72 ++--
arch/x86/boot/compressed/sev.c | 12 +-
arch/x86/include/asm/boot.h | 8 +
arch/x86/include/asm/efi.h | 7 +-
arch/x86/include/asm/sev.h | 4 +
arch/x86/kernel/head_64.S | 33 +-
drivers/firmware/efi/libstub/arm64-stub.c | 2 +-
drivers/firmware/efi/libstub/efi-stub-helper.c | 4 +
drivers/firmware/efi/libstub/efistub.h | 2 +-
drivers/firmware/efi/libstub/randomalloc.c | 10 +-
drivers/firmware/efi/libstub/x86-stub.c | 353 +++++++++++++-------
drivers/firmware/efi/libstub/zboot.c | 2 +-
include/linux/decompress/mm.h | 2 +-
19 files changed, 400 insertions(+), 454 deletions(-)

--
2.39.2


2023-05-08 07:11:58

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v2 12/20] x86: decompressor: Move global symbol references to C code

We no longer have to be cautious when referring to global variables in
the position independent decompressor code, now that we use PIE codegen
and assert in the linker script that no GOT entries exist (which would
require adjustment for the actual runtime load address of the
decompressor binary).

This means we can simply refer to global variables directly, instead of
passing them into C routines from asm code. Let's do so for the code
that will be called directly from the EFI stub after a subsequent patch,
and avoid the need to pass these quantities directly.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/boot/compressed/head_32.S | 8 --------
arch/x86/boot/compressed/head_64.S | 8 +-------
arch/x86/boot/compressed/misc.c | 16 +++++++++-------
3 files changed, 10 insertions(+), 22 deletions(-)

diff --git a/arch/x86/boot/compressed/head_32.S b/arch/x86/boot/compressed/head_32.S
index 53cbee1e2a93efce..8ee8749007595fcc 100644
--- a/arch/x86/boot/compressed/head_32.S
+++ b/arch/x86/boot/compressed/head_32.S
@@ -179,13 +179,7 @@ SYM_FUNC_START_LOCAL_NOALIGN(.Lrelocated)
*/
/* push arguments for extract_kernel: */

- pushl output_len@GOTOFF(%ebx) /* decompressed length, end of relocs */
pushl %ebp /* output address */
- pushl input_len@GOTOFF(%ebx) /* input_len */
- leal input_data@GOTOFF(%ebx), %eax
- pushl %eax /* input_data */
- leal boot_heap@GOTOFF(%ebx), %eax
- pushl %eax /* heap area */
pushl %esi /* real mode pointer */
call extract_kernel /* returns kernel entry point in %eax */
addl $24, %esp
@@ -213,8 +207,6 @@ SYM_DATA_END_LABEL(gdt, SYM_L_LOCAL, gdt_end)
*/
.bss
.balign 4
-boot_heap:
- .fill BOOT_HEAP_SIZE, 1, 0
boot_stack:
.fill BOOT_STACK_SIZE, 1, 0
boot_stack_end:
diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 21af9cfd0dd0afb7..bcf678aed81468e3 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -519,11 +519,7 @@ SYM_FUNC_START_LOCAL_NOALIGN(.Lrelocated)
* Do the extraction, and jump to the new kernel..
*/
movq %r15, %rdi /* pass struct boot_params pointer */
- leaq boot_heap(%rip), %rsi /* malloc area for uncompression */
- leaq input_data(%rip), %rdx /* input_data */
- movl input_len(%rip), %ecx /* input_len */
- movq %rbp, %r8 /* output target address */
- movl output_len(%rip), %r9d /* decompressed length, end of relocs */
+ movq %rbp, %rsi /* output target address */
call extract_kernel /* returns kernel entry point in %rax */

/*
@@ -651,8 +647,6 @@ SYM_DATA_END_LABEL(boot_idt, SYM_L_GLOBAL, boot_idt_end)
*/
.bss
.balign 4
-SYM_DATA_LOCAL(boot_heap, .fill BOOT_HEAP_SIZE, 1, 0)
-
SYM_DATA_START_LOCAL(boot_stack)
.fill BOOT_STACK_SIZE, 1, 0
.balign 16
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 014ff222bf4b35c2..1cd40cb9fb4e5027 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -330,6 +330,11 @@ static size_t parse_elf(void *output)
return ehdr.e_entry - LOAD_PHYSICAL_ADDR;
}

+static u8 boot_heap[BOOT_HEAP_SIZE] __aligned(4);
+
+extern unsigned char input_data[];
+extern unsigned int input_len, output_len;
+
/*
* The compressed kernel image (ZO), has been moved so that its position
* is against the end of the buffer used to hold the uncompressed kernel
@@ -347,14 +352,11 @@ static size_t parse_elf(void *output)
* |-------uncompressed kernel image---------|
*
*/
-asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
- unsigned char *input_data,
- unsigned long input_len,
- unsigned char *output,
- unsigned long output_len)
+asmlinkage __visible void *extract_kernel(void *rmode, unsigned char *output)
{
const unsigned long kernel_total_size = VO__end - VO__text;
unsigned long virt_addr = LOAD_PHYSICAL_ADDR;
+ memptr heap = (memptr)boot_heap;
unsigned long needed_size;
size_t entry_offset;

@@ -412,7 +414,7 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
* entries. This ensures the full mapped area is usable RAM
* and doesn't include any reserved areas.
*/
- needed_size = max(output_len, kernel_total_size);
+ needed_size = max((unsigned long)output_len, kernel_total_size);
#ifdef CONFIG_X86_64
needed_size = ALIGN(needed_size, MIN_KERNEL_ALIGN);
#endif
@@ -443,7 +445,7 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
#ifdef CONFIG_X86_64
if (heap > 0x3fffffffffffUL)
error("Destination address too large");
- if (virt_addr + max(output_len, kernel_total_size) > KERNEL_IMAGE_SIZE)
+ if (virt_addr + needed_size > KERNEL_IMAGE_SIZE)
error("Destination virtual address is beyond the kernel mapping area");
#else
if (heap > ((-__PAGE_OFFSET-(128<<20)-1) & 0x7fffffff))
--
2.39.2

2023-05-08 07:12:39

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v2 17/20] x86: efistub: Check SEV/SNP support while running in the firmware

The decompressor executes in an environment with little or no access to
a console, and without any ability to return an error back to the caller
(the bootloader). So the only recourse we have when the SEV/SNP context
is not quite what the kernel expects is to terminate the guest entirely.

This is a bit harsh, and also unnecessary when booting via the EFI stub,
given that it provides all the support that SEV guests need to probe the
underlying platform.

So let's do the SEV initialization and SNP feature check before calling
ExitBootServices(), and simply return with an error if the SNP feature
mask is not as expected.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/boot/compressed/sev.c | 12 ++++++++----
arch/x86/include/asm/sev.h | 4 ++++
drivers/firmware/efi/libstub/x86-stub.c | 17 +++++++++++++++++
3 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 014b89c890887b9a..19c40873fdd209b5 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -315,20 +315,24 @@ static void enforce_vmpl0(void)
*/
#define SNP_FEATURES_PRESENT (0)

+u64 snp_get_unsupported_features(void)
+{
+ if (!(sev_status & MSR_AMD64_SEV_SNP_ENABLED))
+ return 0;
+ return sev_status & SNP_FEATURES_IMPL_REQ & ~SNP_FEATURES_PRESENT;
+}
+
void snp_check_features(void)
{
u64 unsupported;

- if (!(sev_status & MSR_AMD64_SEV_SNP_ENABLED))
- return;
-
/*
* Terminate the boot if hypervisor has enabled any feature lacking
* guest side implementation. Pass on the unsupported features mask through
* EXIT_INFO_2 of the GHCB protocol so that those features can be reported
* as part of the guest boot failure.
*/
- unsupported = sev_status & SNP_FEATURES_IMPL_REQ & ~SNP_FEATURES_PRESENT;
+ unsupported = snp_get_unsupported_features();
if (unsupported) {
if (ghcb_version < 2 || (!boot_ghcb && !early_setup_ghcb()))
sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SNP_UNSUPPORTED);
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 13dc2a9d23c1eb25..bf27b91644d0da5a 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -157,6 +157,7 @@ static __always_inline void sev_es_nmi_complete(void)
__sev_es_nmi_complete();
}
extern int __init sev_es_efi_map_ghcbs(pgd_t *pgd);
+extern void sev_enable(struct boot_params *bp);

static inline int rmpadjust(unsigned long vaddr, bool rmp_psize, unsigned long attrs)
{
@@ -202,12 +203,14 @@ void snp_set_wakeup_secondary_cpu(void);
bool snp_init(struct boot_params *bp);
void __init __noreturn snp_abort(void);
int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, struct snp_guest_request_ioctl *rio);
+u64 snp_get_unsupported_features(void);
#else
static inline void sev_es_ist_enter(struct pt_regs *regs) { }
static inline void sev_es_ist_exit(void) { }
static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) { return 0; }
static inline void sev_es_nmi_complete(void) { }
static inline int sev_es_efi_map_ghcbs(pgd_t *pgd) { return 0; }
+static inline void sev_enable(struct boot_params *bp) { }
static inline int pvalidate(unsigned long vaddr, bool rmp_psize, bool validate) { return 0; }
static inline int rmpadjust(unsigned long vaddr, bool rmp_psize, unsigned long attrs) { return 0; }
static inline void setup_ghcb(void) { }
@@ -225,6 +228,7 @@ static inline int snp_issue_guest_request(u64 exit_code, struct snp_req_data *in
{
return -ENOTTY;
}
+static inline u64 snp_get_unsupported_features(void) { return 0; }
#endif

#endif
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index ce8434fce0c37982..33d11ba78f1d8c4f 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -15,6 +15,7 @@
#include <asm/setup.h>
#include <asm/desc.h>
#include <asm/boot.h>
+#include <asm/sev.h>

#include "efistub.h"

@@ -714,6 +715,22 @@ static efi_status_t exit_boot_func(struct efi_boot_memmap *map,
&p->efi->efi_memmap, &p->efi->efi_memmap_hi);
p->efi->efi_memmap_size = map->map_size;

+ /*
+ * Call the SEV init code while still running with the firmware's
+ * GDT/IDT, so #VC exceptions will be handled by EFI.
+ */
+ if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
+ u64 unsupported;
+
+ sev_enable(p->boot_params);
+ unsupported = snp_get_unsupported_features();
+ if (unsupported) {
+ efi_err("Unsupported SEV-SNP features detected: 0x%llx\n",
+ unsupported);
+ return EFI_UNSUPPORTED;
+ }
+ }
+
return EFI_SUCCESS;
}

--
2.39.2

2023-05-08 07:13:01

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v2 15/20] x86: head_64: Switch to kernel CS before enabling memory encryption

The SME initialization triggers #VC exceptions due to the use of CPUID
instructions, and returning from an exception restores the code segment
that was active when the exception was taken.

This means we should ensure that we switch the code segment to one that
is described in the GDT we just loaded before running the SME init code.

Reported-by: Tom Lendacky <[email protected]>
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/kernel/head_64.S | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 95b12fdae10e1dc9..a128ac62956ff7c4 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -76,6 +76,15 @@ SYM_CODE_START_NOALIGN(startup_64)

call startup_64_setup_env

+ /* Now switch to __KERNEL_CS so IRET works reliably */
+ pushq $__KERNEL_CS
+ leaq .Lon_kernel_cs(%rip), %rax
+ pushq %rax
+ lretq
+
+.Lon_kernel_cs:
+ UNWIND_HINT_END_OF_STACK
+
#ifdef CONFIG_AMD_MEM_ENCRYPT
/*
* Activate SEV/SME memory encryption if supported/enabled. This needs to
@@ -87,15 +96,6 @@ SYM_CODE_START_NOALIGN(startup_64)
call sme_enable
#endif

- /* Now switch to __KERNEL_CS so IRET works reliably */
- pushq $__KERNEL_CS
- leaq .Lon_kernel_cs(%rip), %rax
- pushq %rax
- lretq
-
-.Lon_kernel_cs:
- UNWIND_HINT_END_OF_STACK
-
/* Sanitize CPU configuration */
call verify_cpu

--
2.39.2

2023-05-08 07:13:01

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v2 09/20] x86: efistub: Perform 4/5 level paging switch from the stub

In preparation for updating the EFI stub boot flow to avoid the bare
metal decompressor code altogether, implement the support code for
switching between 4 and 5 levels of paging before jumping to the kernel
proper.

This reuses the newly refactored trampoline that the bare metal
decompressor uses, but relies on EFI APIs to allocate 32-bit addressable
memory and remap it with the appropriate permissions. Given that the
bare metal decompressor will no longer call into the trampoline if the
number of paging levels is already set correctly, we no longer need to
remove NX restrictions from the memory range where this trampoline may
end up.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
drivers/firmware/efi/libstub/efi-stub-helper.c | 4 +
drivers/firmware/efi/libstub/x86-stub.c | 119 ++++++++++++++++----
2 files changed, 102 insertions(+), 21 deletions(-)

diff --git a/drivers/firmware/efi/libstub/efi-stub-helper.c b/drivers/firmware/efi/libstub/efi-stub-helper.c
index 1e0203d74691ffcc..fc5f3b4c45e91401 100644
--- a/drivers/firmware/efi/libstub/efi-stub-helper.c
+++ b/drivers/firmware/efi/libstub/efi-stub-helper.c
@@ -16,6 +16,8 @@

#include "efistub.h"

+extern bool efi_no5lvl;
+
bool efi_nochunk;
bool efi_nokaslr = !IS_ENABLED(CONFIG_RANDOMIZE_BASE);
bool efi_novamap;
@@ -73,6 +75,8 @@ efi_status_t efi_parse_options(char const *cmdline)
efi_loglevel = CONSOLE_LOGLEVEL_QUIET;
} else if (!strcmp(param, "noinitrd")) {
efi_noinitrd = true;
+ } else if (IS_ENABLED(CONFIG_X86_64) && !strcmp(param, "no5lvl")) {
+ efi_no5lvl = true;
} else if (!strcmp(param, "efi") && val) {
efi_nochunk = parse_option_str(val, "nochunk");
efi_novamap |= parse_option_str(val, "novamap");
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index a0bfd31358ba97b1..fb83a72ad905ad6e 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -267,32 +267,11 @@ adjust_memory_range_protection(unsigned long start, unsigned long size)
}
}

-/*
- * Trampoline takes 2 pages and can be loaded in first megabyte of memory
- * with its end placed between 128k and 640k where BIOS might start.
- * (see arch/x86/boot/compressed/pgtable_64.c)
- *
- * We cannot find exact trampoline placement since memory map
- * can be modified by UEFI, and it can alter the computed address.
- */
-
-#define TRAMPOLINE_PLACEMENT_BASE ((128 - 8)*1024)
-#define TRAMPOLINE_PLACEMENT_SIZE (640*1024 - (128 - 8)*1024)
-
void startup_32(struct boot_params *boot_params);

static void
setup_memory_protection(unsigned long image_base, unsigned long image_size)
{
- /*
- * Allow execution of possible trampoline used
- * for switching between 4- and 5-level page tables
- * and relocated kernel image.
- */
-
- adjust_memory_range_protection(TRAMPOLINE_PLACEMENT_BASE,
- TRAMPOLINE_PLACEMENT_SIZE);
-
#ifdef CONFIG_64BIT
if (image_base != (unsigned long)startup_32)
adjust_memory_range_protection(image_base, image_size);
@@ -760,6 +739,96 @@ static efi_status_t exit_boot(struct boot_params *boot_params, void *handle)
return EFI_SUCCESS;
}

+bool efi_no5lvl;
+
+static void (*la57_toggle)(void *trampoline, bool enable_5lvl);
+
+extern void trampoline_32bit_src(void *, bool);
+extern const u16 trampoline_ljmp_imm_offset;
+
+/*
+ * Enabling (or disabling) 5 level paging is tricky, because it can only be
+ * done from 32-bit mode with paging disabled. This means not only that the
+ * code itself must be running from 32-bit addressable physical memory, but
+ * also that the root page table must be 32-bit addressable, as we cannot
+ * program a 64-bit value into CR3 when running in 32-bit mode.
+ */
+static efi_status_t efi_setup_5level_paging(void)
+{
+ u8 tmpl_size = (u8 *)&trampoline_ljmp_imm_offset - (u8 *)&trampoline_32bit_src;
+ efi_status_t status;
+ u8 *la57_code;
+
+ if (!efi_is_64bit())
+ return EFI_SUCCESS;
+
+ /* check for 5 level paging support */
+ if (native_cpuid_eax(0) < 7 ||
+ !(native_cpuid_ecx(7) & (1 << (X86_FEATURE_LA57 & 31))))
+ return EFI_SUCCESS;
+
+ /* allocate some 32-bit addressable memory for code and a page table */
+ status = efi_allocate_pages(2 * PAGE_SIZE, (unsigned long *)&la57_code,
+ U32_MAX);
+ if (status != EFI_SUCCESS)
+ return status;
+
+ la57_toggle = memcpy(la57_code, trampoline_32bit_src, tmpl_size);
+ memset(la57_code + tmpl_size, 0x90, PAGE_SIZE - tmpl_size);
+
+ /*
+ * To avoid having to allocate a 32-bit addressable stack, we use a
+ * ljmp to switch back to long mode. However, this takes an absolute
+ * address, so we have to poke it in at runtime.
+ */
+ *(u32 *)&la57_code[trampoline_ljmp_imm_offset] += (unsigned long)la57_code;
+
+ adjust_memory_range_protection((unsigned long)la57_toggle, PAGE_SIZE);
+
+ return EFI_SUCCESS;
+}
+
+static void efi_5level_switch(void)
+{
+#ifdef CONFIG_X86_64
+ static const struct desc_struct gdt[] = {
+ [GDT_ENTRY_KERNEL32_CS] = GDT_ENTRY_INIT(0xc09b, 0, 0xfffff),
+ [GDT_ENTRY_KERNEL_CS] = GDT_ENTRY_INIT(0xa09b, 0, 0xfffff),
+ };
+
+ bool want_la57 = IS_ENABLED(CONFIG_X86_5LEVEL) && !efi_no5lvl;
+ bool have_la57 = native_read_cr4() & X86_CR4_LA57;
+ bool need_toggle = want_la57 ^ have_la57;
+ u64 *pgt = (void *)la57_toggle + PAGE_SIZE;
+ u64 *cr3 = (u64 *)__native_read_cr3();
+ u64 *new_cr3;
+
+ if (!la57_toggle || !need_toggle)
+ return;
+
+ if (!have_la57) {
+ /*
+ * We are going to enable 5 level paging, so we need to
+ * allocate a root level page from the 32-bit addressable
+ * physical region, and plug the existing hierarchy into it.
+ */
+ new_cr3 = memset(pgt, 0, PAGE_SIZE);
+ new_cr3[0] = (u64)cr3 | _PAGE_TABLE_NOENC;
+ } else {
+ // take the new root table pointer from the current entry #0
+ new_cr3 = (u64 *)(cr3[0] & PAGE_MASK);
+
+ // copy the new root level table if it is not 32-bit addressable
+ if ((u64)new_cr3 > U32_MAX)
+ new_cr3 = memcpy(pgt, new_cr3, PAGE_SIZE);
+ }
+
+ native_load_gdt(&(struct desc_ptr){ sizeof(gdt) - 1, (u64)gdt });
+
+ la57_toggle(new_cr3, want_la57);
+#endif
+}
+
/*
* On success, we return the address of startup_32, which has potentially been
* relocated by efi_relocate_kernel.
@@ -787,6 +856,12 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,
efi_dxe_table = NULL;
}

+ status = efi_setup_5level_paging();
+ if (status != EFI_SUCCESS) {
+ efi_err("efi_setup_5level_paging() failed!\n");
+ goto fail;
+ }
+
/*
* If the kernel isn't already loaded at a suitable address,
* relocate it.
@@ -905,6 +980,8 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,
goto fail;
}

+ efi_5level_switch();
+
return bzimage_addr;
fail:
efi_err("efi_main() failed!\n");
--
2.39.2

2023-05-08 07:13:21

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v2 16/20] efi: libstub: Add limit argument to efi_random_alloc()

x86 will need to limit the kernel memory allocation to the lowest 512
MiB of memory, to match the behavior of the existing bare metal KASLR
physical randomization logic. So in preparation for that, add a limit
parameter to efi_random_alloc() and wire it up.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
drivers/firmware/efi/libstub/arm64-stub.c | 2 +-
drivers/firmware/efi/libstub/efistub.h | 2 +-
drivers/firmware/efi/libstub/randomalloc.c | 10 ++++++----
drivers/firmware/efi/libstub/zboot.c | 2 +-
4 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/firmware/efi/libstub/arm64-stub.c b/drivers/firmware/efi/libstub/arm64-stub.c
index 770b8ecb73984c61..8c40fc89f5f99209 100644
--- a/drivers/firmware/efi/libstub/arm64-stub.c
+++ b/drivers/firmware/efi/libstub/arm64-stub.c
@@ -106,7 +106,7 @@ efi_status_t handle_kernel_image(unsigned long *image_addr,
*/
status = efi_random_alloc(*reserve_size, min_kimg_align,
reserve_addr, phys_seed,
- EFI_LOADER_CODE);
+ EFI_LOADER_CODE, EFI_ALLOC_LIMIT);
if (status != EFI_SUCCESS)
efi_warn("efi_random_alloc() failed: 0x%lx\n", status);
} else {
diff --git a/drivers/firmware/efi/libstub/efistub.h b/drivers/firmware/efi/libstub/efistub.h
index 67d5a20802e0b7c6..03e3cec87ffbe2d1 100644
--- a/drivers/firmware/efi/libstub/efistub.h
+++ b/drivers/firmware/efi/libstub/efistub.h
@@ -955,7 +955,7 @@ efi_status_t efi_get_random_bytes(unsigned long size, u8 *out);

efi_status_t efi_random_alloc(unsigned long size, unsigned long align,
unsigned long *addr, unsigned long random_seed,
- int memory_type);
+ int memory_type, unsigned long alloc_limit);

efi_status_t efi_random_get_seed(void);

diff --git a/drivers/firmware/efi/libstub/randomalloc.c b/drivers/firmware/efi/libstub/randomalloc.c
index 32c7a54923b4c127..674a064b8f7adc68 100644
--- a/drivers/firmware/efi/libstub/randomalloc.c
+++ b/drivers/firmware/efi/libstub/randomalloc.c
@@ -16,7 +16,8 @@
*/
static unsigned long get_entry_num_slots(efi_memory_desc_t *md,
unsigned long size,
- unsigned long align_shift)
+ unsigned long align_shift,
+ u64 alloc_limit)
{
unsigned long align = 1UL << align_shift;
u64 first_slot, last_slot, region_end;
@@ -29,7 +30,7 @@ static unsigned long get_entry_num_slots(efi_memory_desc_t *md,
return 0;

region_end = min(md->phys_addr + md->num_pages * EFI_PAGE_SIZE - 1,
- (u64)EFI_ALLOC_LIMIT);
+ alloc_limit);
if (region_end < size)
return 0;

@@ -54,7 +55,8 @@ efi_status_t efi_random_alloc(unsigned long size,
unsigned long align,
unsigned long *addr,
unsigned long random_seed,
- int memory_type)
+ int memory_type,
+ unsigned long alloc_limit)
{
unsigned long total_slots = 0, target_slot;
unsigned long total_mirrored_slots = 0;
@@ -76,7 +78,7 @@ efi_status_t efi_random_alloc(unsigned long size,
efi_memory_desc_t *md = (void *)map->map + map_offset;
unsigned long slots;

- slots = get_entry_num_slots(md, size, ilog2(align));
+ slots = get_entry_num_slots(md, size, ilog2(align), alloc_limit);
MD_NUM_SLOTS(md) = slots;
total_slots += slots;
if (md->attribute & EFI_MEMORY_MORE_RELIABLE)
diff --git a/drivers/firmware/efi/libstub/zboot.c b/drivers/firmware/efi/libstub/zboot.c
index e5d7fa1f1d8fd160..bdb17eac0cb401be 100644
--- a/drivers/firmware/efi/libstub/zboot.c
+++ b/drivers/firmware/efi/libstub/zboot.c
@@ -119,7 +119,7 @@ efi_zboot_entry(efi_handle_t handle, efi_system_table_t *systab)
}

status = efi_random_alloc(alloc_size, min_kimg_align, &image_base,
- seed, EFI_LOADER_CODE);
+ seed, EFI_LOADER_CODE, EFI_ALLOC_LIMIT);
if (status != EFI_SUCCESS) {
efi_err("Failed to allocate memory\n");
goto free_cmdline;
--
2.39.2

2023-05-08 07:13:34

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v2 04/20] x86: decompressor: Use standard calling convention for trampoline

Update the trampoline code so its arguments are passed via RDI and RSI,
which matches the ordinary SysV calling convention for x86_64. This will
allow us to call this code directly from C.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/boot/compressed/head_64.S | 30 +++++++++-----------
arch/x86/boot/compressed/pgtable.h | 2 +-
2 files changed, 14 insertions(+), 18 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 81b53b576cdd05ae..b1f8a867777120bb 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -454,9 +454,9 @@ SYM_CODE_START(startup_64)
movq %r15, %rdi /* pass struct boot_params pointer */
call paging_prepare

- /* Save the trampoline address in RCX */
- movq %rax, %rcx
-
+ /* Pass the trampoline address and boolean flag as args #1 and #2 */
+ movq %rax, %rdi
+ movq %rdx, %rsi
leaq TRAMPOLINE_32BIT_CODE_OFFSET(%rax), %rax
call *%rax

@@ -560,11 +560,11 @@ SYM_FUNC_END(.Lrelocated)
/*
* This is the 32-bit trampoline that will be copied over to low memory.
*
- * ECX contains the base address of the trampoline memory.
- * Non zero RDX means trampoline needs to enable 5-level paging.
+ * EDI contains the base address of the trampoline memory.
+ * Non-zero ESI means trampoline needs to enable 5-level paging.
*/
SYM_CODE_START(trampoline_32bit_src)
- popq %rdi
+ popq %r8
/* Switch to compatibility mode (CS.L = 0 CS.D = 1) via far return */
pushq $__KERNEL32_CS
leaq 0f(%rip), %rax
@@ -578,7 +578,7 @@ SYM_CODE_START(trampoline_32bit_src)
movl %eax, %ss

/* Set up new stack */
- leal TRAMPOLINE_32BIT_STACK_END(%ecx), %esp
+ leal TRAMPOLINE_32BIT_STACK_END(%edi), %esp

/* Disable paging */
movl %cr0, %eax
@@ -586,7 +586,7 @@ SYM_CODE_START(trampoline_32bit_src)
movl %eax, %cr0

/* Check what paging mode we want to be in after the trampoline */
- testl %edx, %edx
+ testl %esi, %esi
jz 1f

/* We want 5-level paging: don't touch CR3 if it already points to 5-level page tables */
@@ -601,21 +601,17 @@ SYM_CODE_START(trampoline_32bit_src)
jz 3f
2:
/* Point CR3 to the trampoline's new top level page table */
- leal TRAMPOLINE_32BIT_PGTABLE_OFFSET(%ecx), %eax
+ leal TRAMPOLINE_32BIT_PGTABLE_OFFSET(%edi), %eax
movl %eax, %cr3
3:
/* Set EFER.LME=1 as a precaution in case hypervsior pulls the rug */
- pushl %ecx
- pushl %edx
movl $MSR_EFER, %ecx
rdmsr
btsl $_EFER_LME, %eax
/* Avoid writing EFER if no change was made (for TDX guest) */
jc 1f
wrmsr
-1: popl %edx
- popl %ecx
-
+1:
#ifdef CONFIG_X86_MCE
/*
* Preserve CR4.MCE if the kernel will enable #MC support.
@@ -632,14 +628,14 @@ SYM_CODE_START(trampoline_32bit_src)

/* Enable PAE and LA57 (if required) paging modes */
orl $X86_CR4_PAE, %eax
- testl %edx, %edx
+ testl %esi, %esi
jz 1f
orl $X86_CR4_LA57, %eax
1:
movl %eax, %cr4

/* Calculate address of paging_enabled() once we are executing in the trampoline */
- leal .Lpaging_enabled - trampoline_32bit_src + TRAMPOLINE_32BIT_CODE_OFFSET(%ecx), %eax
+ leal .Lpaging_enabled - trampoline_32bit_src + TRAMPOLINE_32BIT_CODE_OFFSET(%edi), %eax

/* Prepare the stack for far return to Long Mode */
pushl $__KERNEL_CS
@@ -656,7 +652,7 @@ SYM_CODE_END(trampoline_32bit_src)
.code64
SYM_FUNC_START_LOCAL_NOALIGN(.Lpaging_enabled)
/* Return from the trampoline */
- jmp *%rdi
+ jmp *%r8
SYM_FUNC_END(.Lpaging_enabled)

/*
diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
index 91dbb99203fbce2d..4e8cef135226bcbb 100644
--- a/arch/x86/boot/compressed/pgtable.h
+++ b/arch/x86/boot/compressed/pgtable.h
@@ -14,7 +14,7 @@

extern unsigned long *trampoline_32bit;

-extern void trampoline_32bit_src(void *return_ptr);
+extern void trampoline_32bit_src(void *trampoline, bool enable_5lvl);

#endif /* __ASSEMBLER__ */
#endif /* BOOT_COMPRESSED_PAGETABLE_H */
--
2.39.2

2023-05-08 07:14:00

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v2 07/20] x86: decompressor: Only call the trampoline when changing paging levels

Since we know the current and desired number of paging levels when
preparing the trampoline, let's not call the trampoline at all when we
know that calling it is not going to result in a change to the number of
paging levels. Given that we are already running in long mode, the PAE
and LA57 settings are necessarily consistent with the currently active
page tables - the only difference is that we will always preserve
CR4.MCE in this case, but this will be cleared by the real kernel
startup code if CONFIG_X86_MCE is not enabled.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/boot/compressed/head_64.S | 21 +-------------------
arch/x86/boot/compressed/pgtable_64.c | 18 +++++++----------
2 files changed, 8 insertions(+), 31 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 94b614ecb7c2fd55..ccdfe7e55c36a40f 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -398,10 +398,6 @@ SYM_CODE_START(startup_64)
* For the trampoline, we need the top page table to reside in lower
* memory as we don't have a way to load 64-bit values into CR3 in
* 32-bit mode.
- *
- * We go though the trampoline even if we don't have to: if we're
- * already in a desired paging mode. This way the trampoline code gets
- * tested on every boot.
*/

/* Make sure we have GDT with 32-bit code segment */
@@ -563,25 +559,10 @@ SYM_CODE_START(trampoline_32bit_src)
btrl $X86_CR0_PG_BIT, %eax
movl %eax, %cr0

- /* Check what paging mode we want to be in after the trampoline */
- testl %esi, %esi
- jz 1f
-
- /* We want 5-level paging: don't touch CR3 if it already points to 5-level page tables */
- movl %cr4, %eax
- testl $X86_CR4_LA57, %eax
- jnz 3f
- jmp 2f
-1:
- /* We want 4-level paging: don't touch CR3 if it already points to 4-level page tables */
- movl %cr4, %eax
- testl $X86_CR4_LA57, %eax
- jz 3f
-2:
/* Point CR3 to the trampoline's new top level page table */
leal TRAMPOLINE_32BIT_PGTABLE_OFFSET(%edi), %eax
movl %eax, %cr3
-3:
+
/* Set EFER.LME=1 as a precaution in case hypervsior pulls the rug */
movl $MSR_EFER, %ecx
rdmsr
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index b62b6819dcdd01be..b92cf1d6e156d5f6 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -128,6 +128,13 @@ asmlinkage void set_paging_levels(void *rmode)
l5_required = true;
}

+ /*
+ * We are not going to use the trampoline if we
+ * are already in the desired paging mode.
+ */
+ if (l5_required == !!(native_read_cr4() & X86_CR4_LA57))
+ return;
+
trampoline_32bit = (unsigned long *)find_trampoline_placement();

/* Preserve trampoline memory */
@@ -155,18 +162,8 @@ asmlinkage void set_paging_levels(void *rmode)
*
* The new page table will be used by trampoline code for switching
* from 4- to 5-level paging or vice versa.
- *
- * If switching is not required, the page table is unused: trampoline
- * code wouldn't touch CR3.
*/

- /*
- * We are not going to use the page table in trampoline memory if we
- * are already in the desired paging mode.
- */
- if (l5_required == !!(native_read_cr4() & X86_CR4_LA57))
- goto out;
-
if (l5_required) {
/*
* For 4- to 5-level paging transition, set up current CR3 as
@@ -189,7 +186,6 @@ asmlinkage void set_paging_levels(void *rmode)
(void *)src, PAGE_SIZE);
}

-out:
toggle_la57(trampoline_32bit, l5_required);
}

--
2.39.2

2023-05-08 07:14:32

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v2 14/20] x86: head_64: Store boot_params pointer in callee-preserved register

Instead of pushing/popping %RSI to/from the stack every time we call a
function from startup_64(), just store it in a callee preserved register
until we're ready to pass it on.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/kernel/head_64.S | 15 +++------------
1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index a5df3e994f04f10f..95b12fdae10e1dc9 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -60,6 +60,7 @@ SYM_CODE_START_NOALIGN(startup_64)
* compiled to run at we first fixup the physical addresses in our page
* tables and then reload them.
*/
+ mov %rsi, %r15 /* Preserve boot_params pointer */

/* Set up the stack for verify_cpu() */
leaq (__end_init_task - PTREGS_SIZE)(%rip), %rsp
@@ -73,9 +74,7 @@ SYM_CODE_START_NOALIGN(startup_64)
shrq $32, %rdx
wrmsr

- pushq %rsi
call startup_64_setup_env
- popq %rsi

#ifdef CONFIG_AMD_MEM_ENCRYPT
/*
@@ -84,10 +83,8 @@ SYM_CODE_START_NOALIGN(startup_64)
* which needs to be done before any CPUID instructions are executed in
* subsequent code.
*/
- movq %rsi, %rdi
- pushq %rsi
+ movq %r15, %rdi
call sme_enable
- popq %rsi
#endif

/* Now switch to __KERNEL_CS so IRET works reliably */
@@ -109,9 +106,7 @@ SYM_CODE_START_NOALIGN(startup_64)
* programmed into CR3.
*/
leaq _text(%rip), %rdi
- pushq %rsi
call __startup_64
- popq %rsi

/* Form the CR3 value being sure to include the CR3 modifier */
addq $(early_top_pgt - __START_KERNEL_map), %rax
@@ -200,10 +195,8 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
* %rsi carries pointer to realmode data and is callee-clobbered. Save
* and restore it.
*/
- pushq %rsi
movq %rax, %rdi
call sev_verify_cbit
- popq %rsi

/*
* Switch to new page-table
@@ -294,9 +287,7 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
wrmsr

/* Setup and Load IDT */
- pushq %rsi
call early_setup_idt
- popq %rsi

/* Check if nx is implemented */
movl $0x80000001, %eax
@@ -334,7 +325,7 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)

/* rsi is pointer to real mode structure with interesting info.
pass it to C */
- movq %rsi, %rdi
+ movq %r15, %rdi

.Ljump_to_C_code:
/*
--
2.39.2

2023-05-08 07:14:32

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v2 03/20] x86: decompressor: Call trampoline as a normal function

Move the long return to switch to 32-bit mode into the trampoline code
so we can call it as an ordinary function. This will allow us to call it
directly from C code in a subsequent patch.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/boot/compressed/head_64.S | 25 +++++++++-----------
arch/x86/boot/compressed/pgtable.h | 2 +-
2 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 59340e533dff8369..81b53b576cdd05ae 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -457,18 +457,9 @@ SYM_CODE_START(startup_64)
/* Save the trampoline address in RCX */
movq %rax, %rcx

- /*
- * Load the address of trampoline_return() into RDI.
- * It will be used by the trampoline to return to the main code.
- */
- leaq trampoline_return(%rip), %rdi
-
- /* Switch to compatibility mode (CS.L = 0 CS.D = 1) via far return */
- pushq $__KERNEL32_CS
leaq TRAMPOLINE_32BIT_CODE_OFFSET(%rax), %rax
- pushq %rax
- lretq
-trampoline_return:
+ call *%rax
+
/* Restore the stack, the 32-bit trampoline uses its own stack */
leaq rva(boot_stack_end)(%rbx), %rsp

@@ -566,16 +557,22 @@ SYM_FUNC_START_LOCAL_NOALIGN(.Lrelocated)
jmp *%rax
SYM_FUNC_END(.Lrelocated)

- .code32
/*
* This is the 32-bit trampoline that will be copied over to low memory.
*
- * RDI contains the return address (might be above 4G).
* ECX contains the base address of the trampoline memory.
* Non zero RDX means trampoline needs to enable 5-level paging.
*/
SYM_CODE_START(trampoline_32bit_src)
- /* Set up data and stack segments */
+ popq %rdi
+ /* Switch to compatibility mode (CS.L = 0 CS.D = 1) via far return */
+ pushq $__KERNEL32_CS
+ leaq 0f(%rip), %rax
+ pushq %rax
+ lretq
+
+ .code32
+0: /* Set up data and stack segments */
movl $__KERNEL_DS, %eax
movl %eax, %ds
movl %eax, %ss
diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
index cc9b2529a08634b4..91dbb99203fbce2d 100644
--- a/arch/x86/boot/compressed/pgtable.h
+++ b/arch/x86/boot/compressed/pgtable.h
@@ -6,7 +6,7 @@
#define TRAMPOLINE_32BIT_PGTABLE_OFFSET 0

#define TRAMPOLINE_32BIT_CODE_OFFSET PAGE_SIZE
-#define TRAMPOLINE_32BIT_CODE_SIZE 0x80
+#define TRAMPOLINE_32BIT_CODE_SIZE 0xA0

#define TRAMPOLINE_32BIT_STACK_END TRAMPOLINE_32BIT_SIZE

--
2.39.2

2023-05-08 07:15:44

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v2 20/20] x86: decompressor: Avoid magic offsets for EFI handover entrypoint

The special EFI handover protocol entrypoint offset wrt to the
startup_XX address is described in struct boot_params as
handover_offset, so that the special Linux/x86 aware EFI loader can find
it there.

When mixed mode is enabled, this single field has to describe this
offset for both the 32-bit and 64-bit entrypoints, so their respective
relative offsets have to be identical.

Currently, we use hard-coded fixed offsets to ensure this, but the only
requirement is that the entrypoints are 0x200 bytes apart, and this only
matters when EFI mixed mode is configured to begin with.

So just set the required offset directly. This could potentially result
in a build error if the 32-bit startup code is much smaller than the
64-bit code but this is currently far from the case, and easily fixed
when that situation does arise.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/boot/compressed/head_64.S | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index b7599cbbd2ea1136..72780644a2272af8 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -282,7 +282,6 @@ SYM_FUNC_START(startup_32)
SYM_FUNC_END(startup_32)

#if IS_ENABLED(CONFIG_EFI_MIXED) && IS_ENABLED(CONFIG_EFI_HANDOVER_PROTOCOL)
- .org 0x190
SYM_FUNC_START(efi32_stub_entry)
add $0x4, %esp /* Discard return address */
popl %ecx
@@ -455,7 +454,9 @@ SYM_CODE_START(startup_64)
SYM_CODE_END(startup_64)

#ifdef CONFIG_EFI_HANDOVER_PROTOCOL
- .org 0x390
+#ifdef CONFIG_EFI_MIXED
+ .org efi32_stub_entry + 0x200
+#endif
SYM_FUNC_START(efi64_stub_entry)
and $~0xf, %rsp /* realign the stack */
call efi_handover_entry
--
2.39.2

2023-05-08 07:16:05

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v2 08/20] x86: decompressor: Merge trampoline cleanup with switching code

Now that the trampoline setup code and the actual invocation of it are
all done from the C routine, we can merge the trampoline cleanup into
that as well, instead of returning to asm just to call another C
function.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/boot/compressed/head_64.S | 13 +++------
arch/x86/boot/compressed/pgtable_64.c | 28 ++++++++------------
2 files changed, 15 insertions(+), 26 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index ccdfe7e55c36a40f..21af9cfd0dd0afb7 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -440,19 +440,14 @@ SYM_CODE_START(startup_64)
* set_paging_levels() updates the number of paging levels using a
* trampoline in 32-bit addressable memory if the current number does
* not match the desired number.
+ *
+ * RSI is the relocated address of the page table to use instead of
+ * page table in trampoline memory (if required).
*/
movq %r15, %rdi /* pass struct boot_params pointer */
+ leaq rva(top_pgtable)(%rbx), %rsi
call set_paging_levels

- /*
- * cleanup_trampoline() would restore trampoline memory.
- *
- * RDI is address of the page table to use instead of page table
- * in trampoline memory (if required).
- */
- leaq rva(top_pgtable)(%rbx), %rdi
- call cleanup_trampoline
-
/* Zero EFLAGS */
pushq $0
popfq
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index b92cf1d6e156d5f6..eeddad8c8335655e 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -101,9 +101,10 @@ static unsigned long find_trampoline_placement(void)
return bios_start - TRAMPOLINE_32BIT_SIZE;
}

-asmlinkage void set_paging_levels(void *rmode)
+asmlinkage void set_paging_levels(void *rmode, void *pgtable)
{
void (*toggle_la57)(void *trampoline, bool enable_5lvl);
+ void *trampoline_pgtable;
bool l5_required = false;

/* Initialize boot_params. Required for cmdline_find_option_bool(). */
@@ -133,7 +134,7 @@ asmlinkage void set_paging_levels(void *rmode)
* are already in the desired paging mode.
*/
if (l5_required == !!(native_read_cr4() & X86_CR4_LA57))
- return;
+ goto out;

trampoline_32bit = (unsigned long *)find_trampoline_placement();

@@ -163,6 +164,8 @@ asmlinkage void set_paging_levels(void *rmode)
* The new page table will be used by trampoline code for switching
* from 4- to 5-level paging or vice versa.
*/
+ trampoline_pgtable = trampoline_32bit +
+ TRAMPOLINE_32BIT_PGTABLE_OFFSET / sizeof(unsigned long);

if (l5_required) {
/*
@@ -182,31 +185,21 @@ asmlinkage void set_paging_levels(void *rmode)
* may be above 4G.
*/
src = *(unsigned long *)__native_read_cr3() & PAGE_MASK;
- memcpy(trampoline_32bit + TRAMPOLINE_32BIT_PGTABLE_OFFSET / sizeof(unsigned long),
- (void *)src, PAGE_SIZE);
+ memcpy(trampoline_pgtable, (void *)src, PAGE_SIZE);
}

toggle_la57(trampoline_32bit, l5_required);
-}
-
-void cleanup_trampoline(void *pgtable)
-{
- void *trampoline_pgtable;
-
- trampoline_pgtable = trampoline_32bit + TRAMPOLINE_32BIT_PGTABLE_OFFSET / sizeof(unsigned long);

/*
- * Move the top level page table out of trampoline memory,
- * if it's there.
+ * Move the top level page table out of trampoline memory.
*/
- if ((void *)__native_read_cr3() == trampoline_pgtable) {
- memcpy(pgtable, trampoline_pgtable, PAGE_SIZE);
- native_write_cr3((unsigned long)pgtable);
- }
+ memcpy(pgtable, trampoline_pgtable, PAGE_SIZE);
+ native_write_cr3((unsigned long)pgtable);

/* Restore trampoline memory */
memcpy(trampoline_32bit, trampoline_save, TRAMPOLINE_32BIT_SIZE);

+out:
/* Initialize variables for 5-level paging */
#ifdef CONFIG_X86_5LEVEL
if (__read_cr4() & X86_CR4_LA57) {
@@ -215,4 +208,5 @@ void cleanup_trampoline(void *pgtable)
ptrs_per_p4d = 512;
}
#endif
+ return;
}
--
2.39.2

2023-05-08 07:17:44

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v2 05/20] x86: decompressor: Avoid the need for a stack in the 32-bit trampoline

The 32-bit trampoline no longer uses the stack for anything except
performing a long return back to long mode. Currently, this stack is
allocated in the same page that carries the trampoline code, which means
this page must be mapped writable and executable, and the stack is
therefore executable as well.

So let's do a long jump instead: that way, we can pre-calculate the
return address and poke it into the code before we call it. In a later
patch, we will take advantage of this by removing writable permissions
(and adding executable ones) explicitly when booting via the EFI stub.

Not playing with the stack pointer also makes it more straight-forward
to call the trampoline code as an ordinary 64-bit function from C code.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/boot/compressed/head_64.S | 34 ++++----------------
arch/x86/boot/compressed/pgtable.h | 6 ++--
arch/x86/boot/compressed/pgtable_64.c | 12 ++++++-
3 files changed, 21 insertions(+), 31 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index b1f8a867777120bb..3b5fc851737ffc39 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -460,9 +460,6 @@ SYM_CODE_START(startup_64)
leaq TRAMPOLINE_32BIT_CODE_OFFSET(%rax), %rax
call *%rax

- /* Restore the stack, the 32-bit trampoline uses its own stack */
- leaq rva(boot_stack_end)(%rbx), %rsp
-
/*
* cleanup_trampoline() would restore trampoline memory.
*
@@ -563,24 +560,17 @@ SYM_FUNC_END(.Lrelocated)
* EDI contains the base address of the trampoline memory.
* Non-zero ESI means trampoline needs to enable 5-level paging.
*/
+ .section ".rodata", "a", @progbits
SYM_CODE_START(trampoline_32bit_src)
- popq %r8
/* Switch to compatibility mode (CS.L = 0 CS.D = 1) via far return */
pushq $__KERNEL32_CS
leaq 0f(%rip), %rax
pushq %rax
lretq
+.Lret: retq

.code32
-0: /* Set up data and stack segments */
- movl $__KERNEL_DS, %eax
- movl %eax, %ds
- movl %eax, %ss
-
- /* Set up new stack */
- leal TRAMPOLINE_32BIT_STACK_END(%edi), %esp
-
- /* Disable paging */
+0: /* Disable paging */
movl %cr0, %eax
btrl $X86_CR0_PG_BIT, %eax
movl %eax, %cr0
@@ -634,26 +624,16 @@ SYM_CODE_START(trampoline_32bit_src)
1:
movl %eax, %cr4

- /* Calculate address of paging_enabled() once we are executing in the trampoline */
- leal .Lpaging_enabled - trampoline_32bit_src + TRAMPOLINE_32BIT_CODE_OFFSET(%edi), %eax
-
- /* Prepare the stack for far return to Long Mode */
- pushl $__KERNEL_CS
- pushl %eax
-
/* Enable paging again. */
movl %cr0, %eax
btsl $X86_CR0_PG_BIT, %eax
movl %eax, %cr0

- lret
+.Ljmp: ljmpl $__KERNEL_CS, $(.Lret - trampoline_32bit_src)
SYM_CODE_END(trampoline_32bit_src)

- .code64
-SYM_FUNC_START_LOCAL_NOALIGN(.Lpaging_enabled)
- /* Return from the trampoline */
- jmp *%r8
-SYM_FUNC_END(.Lpaging_enabled)
+/* keep this right after trampoline_32bit_src() so we can infer its size */
+SYM_DATA(trampoline_ljmp_imm_offset, .word .Ljmp + 1 - trampoline_32bit_src)

/*
* The trampoline code has a size limit.
@@ -662,7 +642,7 @@ SYM_FUNC_END(.Lpaging_enabled)
*/
.org trampoline_32bit_src + TRAMPOLINE_32BIT_CODE_SIZE

- .code32
+ .text
SYM_FUNC_START_LOCAL_NOALIGN(.Lno_longmode)
/* This isn't an x86-64 CPU, so hang intentionally, we cannot continue */
1:
diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
index 4e8cef135226bcbb..131488f50af55d0a 100644
--- a/arch/x86/boot/compressed/pgtable.h
+++ b/arch/x86/boot/compressed/pgtable.h
@@ -6,9 +6,7 @@
#define TRAMPOLINE_32BIT_PGTABLE_OFFSET 0

#define TRAMPOLINE_32BIT_CODE_OFFSET PAGE_SIZE
-#define TRAMPOLINE_32BIT_CODE_SIZE 0xA0
-
-#define TRAMPOLINE_32BIT_STACK_END TRAMPOLINE_32BIT_SIZE
+#define TRAMPOLINE_32BIT_CODE_SIZE 0x80

#ifndef __ASSEMBLER__

@@ -16,5 +14,7 @@ extern unsigned long *trampoline_32bit;

extern void trampoline_32bit_src(void *trampoline, bool enable_5lvl);

+extern const u16 trampoline_ljmp_imm_offset;
+
#endif /* __ASSEMBLER__ */
#endif /* BOOT_COMPRESSED_PAGETABLE_H */
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index 2ac12ff4111bf8c0..09fc18180929fab3 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -109,6 +109,7 @@ static unsigned long find_trampoline_placement(void)
struct paging_config paging_prepare(void *rmode)
{
struct paging_config paging_config = {};
+ void *tramp_code;

/* Initialize boot_params. Required for cmdline_find_option_bool(). */
boot_params = rmode;
@@ -143,9 +144,18 @@ struct paging_config paging_prepare(void *rmode)
memset(trampoline_32bit, 0, TRAMPOLINE_32BIT_SIZE);

/* Copy trampoline code in place */
- memcpy(trampoline_32bit + TRAMPOLINE_32BIT_CODE_OFFSET / sizeof(unsigned long),
+ tramp_code = memcpy(trampoline_32bit +
+ TRAMPOLINE_32BIT_CODE_OFFSET / sizeof(unsigned long),
&trampoline_32bit_src, TRAMPOLINE_32BIT_CODE_SIZE);

+ /*
+ * Avoid the need for a stack in the 32-bit trampoline code, by using
+ * LJMP rather than LRET to return back to long mode. LJMP takes an
+ * immediate absolute address, so we have to adjust that based on the
+ * placement of the trampoline.
+ */
+ *(u32 *)(tramp_code + trampoline_ljmp_imm_offset) += (unsigned long)tramp_code;
+
/*
* The code below prepares page table in trampoline memory.
*
--
2.39.2

2023-05-08 07:37:50

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v2 19/20] x86: efistub: Clear BSS in EFI handover protocol entrypoint

The so-called EFI handover protocol is value-add from the distros that
permits a loader to simply copy a PE kernel image into memory and call
an alternative entrypoint that is described by an embedded boot_params
structure.

Most implementations of this protocol do not bother to check the PE
header for minimum alignment, section placement, etc, and therefore also
don't clear the image's BSS, or even allocate enough memory for it.

Allocating more memory on the fly is rather difficult, but let's at
least clear the BSS explicitly when entering in this manner, so that the
decompressor's pseudo-malloc() does not get confused by global variables
that were not zero-initialized correctly.

Note that we shouldn't clear BSS before calling efi_main() if we are not
booting via the handover protocol, but this entrypoint is no longer
shared with the pure EFI entrypoint so we can ignore that case here.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/boot/compressed/head_32.S | 6 -----
arch/x86/boot/compressed/head_64.S | 2 +-
drivers/firmware/efi/libstub/x86-stub.c | 24 +++++++++++++++++---
3 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/arch/x86/boot/compressed/head_32.S b/arch/x86/boot/compressed/head_32.S
index 3f9b80726070a8e7..cd9587fcd5084f22 100644
--- a/arch/x86/boot/compressed/head_32.S
+++ b/arch/x86/boot/compressed/head_32.S
@@ -137,12 +137,6 @@ SYM_FUNC_START(startup_32)
jmp *%eax
SYM_FUNC_END(startup_32)

-#ifdef CONFIG_EFI_HANDOVER_PROTOCOL
-SYM_FUNC_START(efi32_stub_entry)
- jmp efi_main
-SYM_FUNC_END(efi32_stub_entry)
-#endif
-
.text
SYM_FUNC_START_LOCAL_NOALIGN(.Lrelocated)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 320e2825ff0b32da..b7599cbbd2ea1136 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -458,7 +458,7 @@ SYM_CODE_END(startup_64)
.org 0x390
SYM_FUNC_START(efi64_stub_entry)
and $~0xf, %rsp /* realign the stack */
- call efi_main
+ call efi_handover_entry
SYM_FUNC_END(efi64_stub_entry)
#endif

diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index 59076e16c1ac11ee..0528db3e36cf636b 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -891,9 +891,9 @@ static void __noreturn enter_kernel(unsigned long kernel_addr,
* On success, we jump to the relocated kernel directly and never return.
* On failure, we exit to the firmware via efi_exit instead of returning.
*/
-asmlinkage void __noreturn efi_main(efi_handle_t handle,
- efi_system_table_t *sys_table_arg,
- struct boot_params *boot_params)
+static void __noreturn efi_main(efi_handle_t handle,
+ efi_system_table_t *sys_table_arg,
+ struct boot_params *boot_params)
{
efi_guid_t guid = EFI_MEMORY_ATTRIBUTE_PROTOCOL_GUID;
struct setup_header *hdr = &boot_params->hdr;
@@ -1002,3 +1002,21 @@ asmlinkage void __noreturn efi_main(efi_handle_t handle,

efi_exit(handle, status);
}
+
+#ifdef CONFIG_EFI_HANDOVER_PROTOCOL
+void efi_handover_entry(efi_handle_t handle, efi_system_table_t *sys_table_arg,
+ struct boot_params *boot_params)
+{
+ extern char _bss[], _ebss[];
+
+ /* Ensure that BSS is zeroed when booting via the handover protocol */
+ memset(_bss, 0, _ebss - _bss);
+ efi_main(handle, sys_table_arg, boot_params);
+}
+
+#ifdef CONFIG_X86_32
+extern __alias(efi_handover_entry)
+void efi32_stub_entry(efi_handle_t handle, efi_system_table_t *sys_table_arg,
+ struct boot_params *boot_params);
+#endif
+#endif
--
2.39.2

2023-05-15 14:07:47

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v2 04/20] x86: decompressor: Use standard calling convention for trampoline

On Mon, May 08, 2023 at 09:03:14AM +0200, Ard Biesheuvel wrote:
> Update the trampoline code so its arguments are passed via RDI and RSI,
> which matches the ordinary SysV calling convention for x86_64. This will
> allow us to call this code directly from C.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>

Acked-by: Kirill A. Shutemov <[email protected]>

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-05-15 14:09:49

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v2 03/20] x86: decompressor: Call trampoline as a normal function

On Mon, May 08, 2023 at 09:03:13AM +0200, Ard Biesheuvel wrote:
> Move the long return to switch to 32-bit mode into the trampoline code
> so we can call it as an ordinary function. This will allow us to call it
> directly from C code in a subsequent patch.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>

Acked-by: Kirill A. Shutemov <[email protected]>

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-05-15 14:15:52

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v2 05/20] x86: decompressor: Avoid the need for a stack in the 32-bit trampoline

On Mon, May 08, 2023 at 09:03:15AM +0200, Ard Biesheuvel wrote:
> The 32-bit trampoline no longer uses the stack for anything except
> performing a long return back to long mode. Currently, this stack is
> allocated in the same page that carries the trampoline code, which means
> this page must be mapped writable and executable, and the stack is
> therefore executable as well.
>
> So let's do a long jump instead: that way, we can pre-calculate the
> return address and poke it into the code before we call it. In a later
> patch, we will take advantage of this by removing writable permissions
> (and adding executable ones) explicitly when booting via the EFI stub.
>
> Not playing with the stack pointer also makes it more straight-forward
> to call the trampoline code as an ordinary 64-bit function from C code.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>

Acked-by: Kirill A. Shutemov <[email protected]>

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-05-15 14:21:21

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v2 09/20] x86: efistub: Perform 4/5 level paging switch from the stub

On Mon, May 08, 2023 at 09:03:19AM +0200, Ard Biesheuvel wrote:
> In preparation for updating the EFI stub boot flow to avoid the bare
> metal decompressor code altogether, implement the support code for
> switching between 4 and 5 levels of paging before jumping to the kernel
> proper.
>
> This reuses the newly refactored trampoline that the bare metal
> decompressor uses, but relies on EFI APIs to allocate 32-bit addressable
> memory and remap it with the appropriate permissions. Given that the
> bare metal decompressor will no longer call into the trampoline if the
> number of paging levels is already set correctly, we no longer need to
> remove NX restrictions from the memory range where this trampoline may
> end up.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
> drivers/firmware/efi/libstub/efi-stub-helper.c | 4 +
> drivers/firmware/efi/libstub/x86-stub.c | 119 ++++++++++++++++----
> 2 files changed, 102 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/firmware/efi/libstub/efi-stub-helper.c b/drivers/firmware/efi/libstub/efi-stub-helper.c
> index 1e0203d74691ffcc..fc5f3b4c45e91401 100644
> --- a/drivers/firmware/efi/libstub/efi-stub-helper.c
> +++ b/drivers/firmware/efi/libstub/efi-stub-helper.c
> @@ -16,6 +16,8 @@
>
> #include "efistub.h"
>
> +extern bool efi_no5lvl;
> +
> bool efi_nochunk;
> bool efi_nokaslr = !IS_ENABLED(CONFIG_RANDOMIZE_BASE);
> bool efi_novamap;
> @@ -73,6 +75,8 @@ efi_status_t efi_parse_options(char const *cmdline)
> efi_loglevel = CONSOLE_LOGLEVEL_QUIET;
> } else if (!strcmp(param, "noinitrd")) {
> efi_noinitrd = true;
> + } else if (IS_ENABLED(CONFIG_X86_64) && !strcmp(param, "no5lvl")) {
> + efi_no5lvl = true;
> } else if (!strcmp(param, "efi") && val) {
> efi_nochunk = parse_option_str(val, "nochunk");
> efi_novamap |= parse_option_str(val, "novamap");
> diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
> index a0bfd31358ba97b1..fb83a72ad905ad6e 100644
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -267,32 +267,11 @@ adjust_memory_range_protection(unsigned long start, unsigned long size)
> }
> }
>
> -/*
> - * Trampoline takes 2 pages and can be loaded in first megabyte of memory
> - * with its end placed between 128k and 640k where BIOS might start.
> - * (see arch/x86/boot/compressed/pgtable_64.c)
> - *
> - * We cannot find exact trampoline placement since memory map
> - * can be modified by UEFI, and it can alter the computed address.
> - */
> -
> -#define TRAMPOLINE_PLACEMENT_BASE ((128 - 8)*1024)
> -#define TRAMPOLINE_PLACEMENT_SIZE (640*1024 - (128 - 8)*1024)
> -
> void startup_32(struct boot_params *boot_params);
>
> static void
> setup_memory_protection(unsigned long image_base, unsigned long image_size)
> {
> - /*
> - * Allow execution of possible trampoline used
> - * for switching between 4- and 5-level page tables
> - * and relocated kernel image.
> - */
> -
> - adjust_memory_range_protection(TRAMPOLINE_PLACEMENT_BASE,
> - TRAMPOLINE_PLACEMENT_SIZE);
> -
> #ifdef CONFIG_64BIT
> if (image_base != (unsigned long)startup_32)
> adjust_memory_range_protection(image_base, image_size);
> @@ -760,6 +739,96 @@ static efi_status_t exit_boot(struct boot_params *boot_params, void *handle)
> return EFI_SUCCESS;
> }
>
> +bool efi_no5lvl;
> +
> +static void (*la57_toggle)(void *trampoline, bool enable_5lvl);
> +
> +extern void trampoline_32bit_src(void *, bool);
> +extern const u16 trampoline_ljmp_imm_offset;
> +
> +/*
> + * Enabling (or disabling) 5 level paging is tricky, because it can only be
> + * done from 32-bit mode with paging disabled. This means not only that the
> + * code itself must be running from 32-bit addressable physical memory, but
> + * also that the root page table must be 32-bit addressable, as we cannot
> + * program a 64-bit value into CR3 when running in 32-bit mode.
> + */
> +static efi_status_t efi_setup_5level_paging(void)
> +{
> + u8 tmpl_size = (u8 *)&trampoline_ljmp_imm_offset - (u8 *)&trampoline_32bit_src;
> + efi_status_t status;
> + u8 *la57_code;
> +
> + if (!efi_is_64bit())
> + return EFI_SUCCESS;
> +
> + /* check for 5 level paging support */
> + if (native_cpuid_eax(0) < 7 ||
> + !(native_cpuid_ecx(7) & (1 << (X86_FEATURE_LA57 & 31))))
> + return EFI_SUCCESS;
> +
> + /* allocate some 32-bit addressable memory for code and a page table */
> + status = efi_allocate_pages(2 * PAGE_SIZE, (unsigned long *)&la57_code,
> + U32_MAX);
> + if (status != EFI_SUCCESS)
> + return status;
> +
> + la57_toggle = memcpy(la57_code, trampoline_32bit_src, tmpl_size);
> + memset(la57_code + tmpl_size, 0x90, PAGE_SIZE - tmpl_size);
> +
> + /*
> + * To avoid having to allocate a 32-bit addressable stack, we use a
> + * ljmp to switch back to long mode. However, this takes an absolute
> + * address, so we have to poke it in at runtime.
> + */
> + *(u32 *)&la57_code[trampoline_ljmp_imm_offset] += (unsigned long)la57_code;
> +
> + adjust_memory_range_protection((unsigned long)la57_toggle, PAGE_SIZE);
> +
> + return EFI_SUCCESS;
> +}
> +
> +static void efi_5level_switch(void)
> +{
> +#ifdef CONFIG_X86_64
> + static const struct desc_struct gdt[] = {
> + [GDT_ENTRY_KERNEL32_CS] = GDT_ENTRY_INIT(0xc09b, 0, 0xfffff),
> + [GDT_ENTRY_KERNEL_CS] = GDT_ENTRY_INIT(0xa09b, 0, 0xfffff),
> + };
> +
> + bool want_la57 = IS_ENABLED(CONFIG_X86_5LEVEL) && !efi_no5lvl;
> + bool have_la57 = native_read_cr4() & X86_CR4_LA57;
> + bool need_toggle = want_la57 ^ have_la57;
> + u64 *pgt = (void *)la57_toggle + PAGE_SIZE;
> + u64 *cr3 = (u64 *)__native_read_cr3();
> + u64 *new_cr3;
> +
> + if (!la57_toggle || !need_toggle)
> + return;
> +
> + if (!have_la57) {
> + /*
> + * We are going to enable 5 level paging, so we need to
> + * allocate a root level page from the 32-bit addressable
> + * physical region, and plug the existing hierarchy into it.
> + */
> + new_cr3 = memset(pgt, 0, PAGE_SIZE);
> + new_cr3[0] = (u64)cr3 | _PAGE_TABLE_NOENC;
> + } else {
> + // take the new root table pointer from the current entry #0
> + new_cr3 = (u64 *)(cr3[0] & PAGE_MASK);
> +
> + // copy the new root level table if it is not 32-bit addressable
> + if ((u64)new_cr3 > U32_MAX)
> + new_cr3 = memcpy(pgt, new_cr3, PAGE_SIZE);
> + }
> +
> + native_load_gdt(&(struct desc_ptr){ sizeof(gdt) - 1, (u64)gdt });
> +
> + la57_toggle(new_cr3, want_la57);
> +#endif
> +}
> +

Nit: I would prefer to have the #ifdef around whole function with dummy
function for !CONFIG_X86_64 case, if IS_ENABLED(CONFIG_X86_64) is not an
option.

Otherwise:

Acked-by: Kirill A. Shutemov <[email protected]>
--
Kiryl Shutsemau / Kirill A. Shutemov

2023-05-15 14:32:27

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v2 07/20] x86: decompressor: Only call the trampoline when changing paging levels

On Mon, May 08, 2023 at 09:03:17AM +0200, Ard Biesheuvel wrote:
> Since we know the current and desired number of paging levels when
> preparing the trampoline, let's not call the trampoline at all when we
> know that calling it is not going to result in a change to the number of
> paging levels. Given that we are already running in long mode, the PAE
> and LA57 settings are necessarily consistent with the currently active
> page tables - the only difference is that we will always preserve
> CR4.MCE in this case, but this will be cleared by the real kernel
> startup code if CONFIG_X86_MCE is not enabled.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>

Acked-by: Kirill A. Shutemov <[email protected]>

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-05-16 18:07:16

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v2 09/20] x86: efistub: Perform 4/5 level paging switch from the stub

On Mon, 15 May 2023 at 16:14, Kirill A. Shutemov <[email protected]> wrote:
>
> On Mon, May 08, 2023 at 09:03:19AM +0200, Ard Biesheuvel wrote:
> > In preparation for updating the EFI stub boot flow to avoid the bare
> > metal decompressor code altogether, implement the support code for
> > switching between 4 and 5 levels of paging before jumping to the kernel
> > proper.
> >
> > This reuses the newly refactored trampoline that the bare metal
> > decompressor uses, but relies on EFI APIs to allocate 32-bit addressable
> > memory and remap it with the appropriate permissions. Given that the
> > bare metal decompressor will no longer call into the trampoline if the
> > number of paging levels is already set correctly, we no longer need to
> > remove NX restrictions from the memory range where this trampoline may
> > end up.
> >
> > Signed-off-by: Ard Biesheuvel <[email protected]>
> > ---
> > drivers/firmware/efi/libstub/efi-stub-helper.c | 4 +
> > drivers/firmware/efi/libstub/x86-stub.c | 119 ++++++++++++++++----
> > 2 files changed, 102 insertions(+), 21 deletions(-)
> >
> > diff --git a/drivers/firmware/efi/libstub/efi-stub-helper.c b/drivers/firmware/efi/libstub/efi-stub-helper.c
> > index 1e0203d74691ffcc..fc5f3b4c45e91401 100644
> > --- a/drivers/firmware/efi/libstub/efi-stub-helper.c
> > +++ b/drivers/firmware/efi/libstub/efi-stub-helper.c
> > @@ -16,6 +16,8 @@
> >
> > #include "efistub.h"
> >
> > +extern bool efi_no5lvl;
> > +
> > bool efi_nochunk;
> > bool efi_nokaslr = !IS_ENABLED(CONFIG_RANDOMIZE_BASE);
> > bool efi_novamap;
> > @@ -73,6 +75,8 @@ efi_status_t efi_parse_options(char const *cmdline)
> > efi_loglevel = CONSOLE_LOGLEVEL_QUIET;
> > } else if (!strcmp(param, "noinitrd")) {
> > efi_noinitrd = true;
> > + } else if (IS_ENABLED(CONFIG_X86_64) && !strcmp(param, "no5lvl")) {
> > + efi_no5lvl = true;
> > } else if (!strcmp(param, "efi") && val) {
> > efi_nochunk = parse_option_str(val, "nochunk");
> > efi_novamap |= parse_option_str(val, "novamap");
> > diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
> > index a0bfd31358ba97b1..fb83a72ad905ad6e 100644
> > --- a/drivers/firmware/efi/libstub/x86-stub.c
> > +++ b/drivers/firmware/efi/libstub/x86-stub.c
> > @@ -267,32 +267,11 @@ adjust_memory_range_protection(unsigned long start, unsigned long size)
> > }
> > }
> >
> > -/*
> > - * Trampoline takes 2 pages and can be loaded in first megabyte of memory
> > - * with its end placed between 128k and 640k where BIOS might start.
> > - * (see arch/x86/boot/compressed/pgtable_64.c)
> > - *
> > - * We cannot find exact trampoline placement since memory map
> > - * can be modified by UEFI, and it can alter the computed address.
> > - */
> > -
> > -#define TRAMPOLINE_PLACEMENT_BASE ((128 - 8)*1024)
> > -#define TRAMPOLINE_PLACEMENT_SIZE (640*1024 - (128 - 8)*1024)
> > -
> > void startup_32(struct boot_params *boot_params);
> >
> > static void
> > setup_memory_protection(unsigned long image_base, unsigned long image_size)
> > {
> > - /*
> > - * Allow execution of possible trampoline used
> > - * for switching between 4- and 5-level page tables
> > - * and relocated kernel image.
> > - */
> > -
> > - adjust_memory_range_protection(TRAMPOLINE_PLACEMENT_BASE,
> > - TRAMPOLINE_PLACEMENT_SIZE);
> > -
> > #ifdef CONFIG_64BIT
> > if (image_base != (unsigned long)startup_32)
> > adjust_memory_range_protection(image_base, image_size);
> > @@ -760,6 +739,96 @@ static efi_status_t exit_boot(struct boot_params *boot_params, void *handle)
> > return EFI_SUCCESS;
> > }
> >
> > +bool efi_no5lvl;
> > +
> > +static void (*la57_toggle)(void *trampoline, bool enable_5lvl);
> > +
> > +extern void trampoline_32bit_src(void *, bool);
> > +extern const u16 trampoline_ljmp_imm_offset;
> > +
> > +/*
> > + * Enabling (or disabling) 5 level paging is tricky, because it can only be
> > + * done from 32-bit mode with paging disabled. This means not only that the
> > + * code itself must be running from 32-bit addressable physical memory, but
> > + * also that the root page table must be 32-bit addressable, as we cannot
> > + * program a 64-bit value into CR3 when running in 32-bit mode.
> > + */
> > +static efi_status_t efi_setup_5level_paging(void)
> > +{
> > + u8 tmpl_size = (u8 *)&trampoline_ljmp_imm_offset - (u8 *)&trampoline_32bit_src;
> > + efi_status_t status;
> > + u8 *la57_code;
> > +
> > + if (!efi_is_64bit())
> > + return EFI_SUCCESS;
> > +
> > + /* check for 5 level paging support */
> > + if (native_cpuid_eax(0) < 7 ||
> > + !(native_cpuid_ecx(7) & (1 << (X86_FEATURE_LA57 & 31))))
> > + return EFI_SUCCESS;
> > +
> > + /* allocate some 32-bit addressable memory for code and a page table */
> > + status = efi_allocate_pages(2 * PAGE_SIZE, (unsigned long *)&la57_code,
> > + U32_MAX);
> > + if (status != EFI_SUCCESS)
> > + return status;
> > +
> > + la57_toggle = memcpy(la57_code, trampoline_32bit_src, tmpl_size);
> > + memset(la57_code + tmpl_size, 0x90, PAGE_SIZE - tmpl_size);
> > +
> > + /*
> > + * To avoid having to allocate a 32-bit addressable stack, we use a
> > + * ljmp to switch back to long mode. However, this takes an absolute
> > + * address, so we have to poke it in at runtime.
> > + */
> > + *(u32 *)&la57_code[trampoline_ljmp_imm_offset] += (unsigned long)la57_code;
> > +
> > + adjust_memory_range_protection((unsigned long)la57_toggle, PAGE_SIZE);
> > +
> > + return EFI_SUCCESS;
> > +}
> > +
> > +static void efi_5level_switch(void)
> > +{
> > +#ifdef CONFIG_X86_64
> > + static const struct desc_struct gdt[] = {
> > + [GDT_ENTRY_KERNEL32_CS] = GDT_ENTRY_INIT(0xc09b, 0, 0xfffff),
> > + [GDT_ENTRY_KERNEL_CS] = GDT_ENTRY_INIT(0xa09b, 0, 0xfffff),
> > + };
> > +
> > + bool want_la57 = IS_ENABLED(CONFIG_X86_5LEVEL) && !efi_no5lvl;
> > + bool have_la57 = native_read_cr4() & X86_CR4_LA57;
> > + bool need_toggle = want_la57 ^ have_la57;
> > + u64 *pgt = (void *)la57_toggle + PAGE_SIZE;
> > + u64 *cr3 = (u64 *)__native_read_cr3();
> > + u64 *new_cr3;
> > +
> > + if (!la57_toggle || !need_toggle)
> > + return;
> > +
> > + if (!have_la57) {
> > + /*
> > + * We are going to enable 5 level paging, so we need to
> > + * allocate a root level page from the 32-bit addressable
> > + * physical region, and plug the existing hierarchy into it.
> > + */
> > + new_cr3 = memset(pgt, 0, PAGE_SIZE);
> > + new_cr3[0] = (u64)cr3 | _PAGE_TABLE_NOENC;
> > + } else {
> > + // take the new root table pointer from the current entry #0
> > + new_cr3 = (u64 *)(cr3[0] & PAGE_MASK);
> > +
> > + // copy the new root level table if it is not 32-bit addressable
> > + if ((u64)new_cr3 > U32_MAX)
> > + new_cr3 = memcpy(pgt, new_cr3, PAGE_SIZE);
> > + }
> > +
> > + native_load_gdt(&(struct desc_ptr){ sizeof(gdt) - 1, (u64)gdt });
> > +
> > + la57_toggle(new_cr3, want_la57);
> > +#endif
> > +}
> > +
>
> Nit: I would prefer to have the #ifdef around whole function with dummy
> function for !CONFIG_X86_64 case, if IS_ENABLED(CONFIG_X86_64) is not an
> option.
>

The latter is not an option because 32-bit has no definition for
GDT_ENTRY_KERNEL32_CS.

Personally, I prefer having only a single function declaration.
Alternatively, we could move this code to a separate file and only
include it in the build on 64-bit. That would allow the use of
IS_ENABLED() at the call site, as the calls would just disappear from
the object file on 32-bit builds.

> Otherwise:
>
> Acked-by: Kirill A. Shutemov <[email protected]>

Thanks.

2023-05-16 18:10:08

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v2 08/20] x86: decompressor: Merge trampoline cleanup with switching code

On Mon, 15 May 2023 at 16:10, Kirill A. Shutemov <[email protected]> wrote:
>
> On Mon, May 08, 2023 at 09:03:18AM +0200, Ard Biesheuvel wrote:
> > @@ -215,4 +208,5 @@ void cleanup_trampoline(void *pgtable)
> > ptrs_per_p4d = 512;
> > }
> > #endif
> > + return;
> > }
>
> Return is redundant here.

The return (or at least just a semicolon) is needed here because of
the goto label right before the #ifdef block, as otherwise, the
function would end with a label and the compiler does not like that.

> Otherwise:
>
> Acked-by: Kirill A. Shutemov <[email protected]>
>

Thanks.

2023-05-17 19:11:38

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH v2 15/20] x86: head_64: Switch to kernel CS before enabling memory encryption

On 5/8/23 02:03, Ard Biesheuvel wrote:
> The SME initialization triggers #VC exceptions due to the use of CPUID
> instructions, and returning from an exception restores the code segment
> that was active when the exception was taken.
>
> This means we should ensure that we switch the code segment to one that
> is described in the GDT we just loaded before running the SME init code.
>
> Reported-by: Tom Lendacky <[email protected]>
> Signed-off-by: Ard Biesheuvel <[email protected]>

Ah, just saw this as I was going through my email backlog... I submitted
a separate patch just a little earlier today for this issue. I guess we'll
let the maintainers decide how they want to handle it.

Thanks,
Tom

> ---
> arch/x86/kernel/head_64.S | 18 +++++++++---------
> 1 file changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index 95b12fdae10e1dc9..a128ac62956ff7c4 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -76,6 +76,15 @@ SYM_CODE_START_NOALIGN(startup_64)
>
> call startup_64_setup_env
>
> + /* Now switch to __KERNEL_CS so IRET works reliably */
> + pushq $__KERNEL_CS
> + leaq .Lon_kernel_cs(%rip), %rax
> + pushq %rax
> + lretq
> +
> +.Lon_kernel_cs:
> + UNWIND_HINT_END_OF_STACK
> +
> #ifdef CONFIG_AMD_MEM_ENCRYPT
> /*
> * Activate SEV/SME memory encryption if supported/enabled. This needs to
> @@ -87,15 +96,6 @@ SYM_CODE_START_NOALIGN(startup_64)
> call sme_enable
> #endif
>
> - /* Now switch to __KERNEL_CS so IRET works reliably */
> - pushq $__KERNEL_CS
> - leaq .Lon_kernel_cs(%rip), %rax
> - pushq %rax
> - lretq
> -
> -.Lon_kernel_cs:
> - UNWIND_HINT_END_OF_STACK
> -
> /* Sanitize CPU configuration */
> call verify_cpu
>

2023-05-17 22:57:43

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH v2 05/20] x86: decompressor: Avoid the need for a stack in the 32-bit trampoline

On 5/8/23 02:03, Ard Biesheuvel wrote:
> The 32-bit trampoline no longer uses the stack for anything except
> performing a long return back to long mode. Currently, this stack is
> allocated in the same page that carries the trampoline code, which means
> this page must be mapped writable and executable, and the stack is
> therefore executable as well.
>
> So let's do a long jump instead: that way, we can pre-calculate the
> return address and poke it into the code before we call it. In a later
> patch, we will take advantage of this by removing writable permissions
> (and adding executable ones) explicitly when booting via the EFI stub.
>
> Not playing with the stack pointer also makes it more straight-forward
> to call the trampoline code as an ordinary 64-bit function from C code.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
> arch/x86/boot/compressed/head_64.S | 34 ++++----------------
> arch/x86/boot/compressed/pgtable.h | 6 ++--
> arch/x86/boot/compressed/pgtable_64.c | 12 ++++++-
> 3 files changed, 21 insertions(+), 31 deletions(-)
>
> diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
> index b1f8a867777120bb..3b5fc851737ffc39 100644
> --- a/arch/x86/boot/compressed/head_64.S
> +++ b/arch/x86/boot/compressed/head_64.S
> @@ -460,9 +460,6 @@ SYM_CODE_START(startup_64)
> leaq TRAMPOLINE_32BIT_CODE_OFFSET(%rax), %rax
> call *%rax
>
> - /* Restore the stack, the 32-bit trampoline uses its own stack */
> - leaq rva(boot_stack_end)(%rbx), %rsp
> -
> /*
> * cleanup_trampoline() would restore trampoline memory.
> *
> @@ -563,24 +560,17 @@ SYM_FUNC_END(.Lrelocated)
> * EDI contains the base address of the trampoline memory.
> * Non-zero ESI means trampoline needs to enable 5-level paging.
> */
> + .section ".rodata", "a", @progbits
> SYM_CODE_START(trampoline_32bit_src)
> - popq %r8
> /* Switch to compatibility mode (CS.L = 0 CS.D = 1) via far return */
> pushq $__KERNEL32_CS
> leaq 0f(%rip), %rax
> pushq %rax
> lretq
> +.Lret: retq

Maybe just add a comment above this to explain that this is a target of
the long jump below to get back into long mode and be able to return
without setting up a new stack for the 32-bit code.

And then a corresponding comment on the long jump itself. I think it would
make it easier to understand what is going on in this part of the code.

Thanks,
Tom

>
> .code32
> -0: /* Set up data and stack segments */
> - movl $__KERNEL_DS, %eax
> - movl %eax, %ds
> - movl %eax, %ss
> -
> - /* Set up new stack */
> - leal TRAMPOLINE_32BIT_STACK_END(%edi), %esp
> -
> - /* Disable paging */
> +0: /* Disable paging */
> movl %cr0, %eax
> btrl $X86_CR0_PG_BIT, %eax
> movl %eax, %cr0
> @@ -634,26 +624,16 @@ SYM_CODE_START(trampoline_32bit_src)
> 1:
> movl %eax, %cr4
>
> - /* Calculate address of paging_enabled() once we are executing in the trampoline */
> - leal .Lpaging_enabled - trampoline_32bit_src + TRAMPOLINE_32BIT_CODE_OFFSET(%edi), %eax
> -
> - /* Prepare the stack for far return to Long Mode */
> - pushl $__KERNEL_CS
> - pushl %eax
> -
> /* Enable paging again. */
> movl %cr0, %eax
> btsl $X86_CR0_PG_BIT, %eax
> movl %eax, %cr0
>
> - lret
> +.Ljmp: ljmpl $__KERNEL_CS, $(.Lret - trampoline_32bit_src)
> SYM_CODE_END(trampoline_32bit_src)
>
> - .code64
> -SYM_FUNC_START_LOCAL_NOALIGN(.Lpaging_enabled)
> - /* Return from the trampoline */
> - jmp *%r8
> -SYM_FUNC_END(.Lpaging_enabled)
> +/* keep this right after trampoline_32bit_src() so we can infer its size */
> +SYM_DATA(trampoline_ljmp_imm_offset, .word .Ljmp + 1 - trampoline_32bit_src)
>
> /*
> * The trampoline code has a size limit.
> @@ -662,7 +642,7 @@ SYM_FUNC_END(.Lpaging_enabled)
> */
> .org trampoline_32bit_src + TRAMPOLINE_32BIT_CODE_SIZE
>
> - .code32
> + .text
> SYM_FUNC_START_LOCAL_NOALIGN(.Lno_longmode)
> /* This isn't an x86-64 CPU, so hang intentionally, we cannot continue */
> 1:
> diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
> index 4e8cef135226bcbb..131488f50af55d0a 100644
> --- a/arch/x86/boot/compressed/pgtable.h
> +++ b/arch/x86/boot/compressed/pgtable.h
> @@ -6,9 +6,7 @@
> #define TRAMPOLINE_32BIT_PGTABLE_OFFSET 0
>
> #define TRAMPOLINE_32BIT_CODE_OFFSET PAGE_SIZE
> -#define TRAMPOLINE_32BIT_CODE_SIZE 0xA0
> -
> -#define TRAMPOLINE_32BIT_STACK_END TRAMPOLINE_32BIT_SIZE
> +#define TRAMPOLINE_32BIT_CODE_SIZE 0x80
>
> #ifndef __ASSEMBLER__
>
> @@ -16,5 +14,7 @@ extern unsigned long *trampoline_32bit;
>
> extern void trampoline_32bit_src(void *trampoline, bool enable_5lvl);
>
> +extern const u16 trampoline_ljmp_imm_offset;
> +
> #endif /* __ASSEMBLER__ */
> #endif /* BOOT_COMPRESSED_PAGETABLE_H */
> diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
> index 2ac12ff4111bf8c0..09fc18180929fab3 100644
> --- a/arch/x86/boot/compressed/pgtable_64.c
> +++ b/arch/x86/boot/compressed/pgtable_64.c
> @@ -109,6 +109,7 @@ static unsigned long find_trampoline_placement(void)
> struct paging_config paging_prepare(void *rmode)
> {
> struct paging_config paging_config = {};
> + void *tramp_code;
>
> /* Initialize boot_params. Required for cmdline_find_option_bool(). */
> boot_params = rmode;
> @@ -143,9 +144,18 @@ struct paging_config paging_prepare(void *rmode)
> memset(trampoline_32bit, 0, TRAMPOLINE_32BIT_SIZE);
>
> /* Copy trampoline code in place */
> - memcpy(trampoline_32bit + TRAMPOLINE_32BIT_CODE_OFFSET / sizeof(unsigned long),
> + tramp_code = memcpy(trampoline_32bit +
> + TRAMPOLINE_32BIT_CODE_OFFSET / sizeof(unsigned long),
> &trampoline_32bit_src, TRAMPOLINE_32BIT_CODE_SIZE);
>
> + /*
> + * Avoid the need for a stack in the 32-bit trampoline code, by using
> + * LJMP rather than LRET to return back to long mode. LJMP takes an
> + * immediate absolute address, so we have to adjust that based on the
> + * placement of the trampoline.
> + */
> + *(u32 *)(tramp_code + trampoline_ljmp_imm_offset) += (unsigned long)tramp_code;
> +
> /*
> * The code below prepares page table in trampoline memory.
> *

2023-05-18 15:06:23

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v2 05/20] x86: decompressor: Avoid the need for a stack in the 32-bit trampoline

On Thu, 18 May 2023 at 00:40, Tom Lendacky <[email protected]> wrote:
>
> On 5/8/23 02:03, Ard Biesheuvel wrote:
> > The 32-bit trampoline no longer uses the stack for anything except
> > performing a long return back to long mode. Currently, this stack is
> > allocated in the same page that carries the trampoline code, which means
> > this page must be mapped writable and executable, and the stack is
> > therefore executable as well.
> >
> > So let's do a long jump instead: that way, we can pre-calculate the
> > return address and poke it into the code before we call it. In a later
> > patch, we will take advantage of this by removing writable permissions
> > (and adding executable ones) explicitly when booting via the EFI stub.
> >
> > Not playing with the stack pointer also makes it more straight-forward
> > to call the trampoline code as an ordinary 64-bit function from C code.
> >
> > Signed-off-by: Ard Biesheuvel <[email protected]>
> > ---
> > arch/x86/boot/compressed/head_64.S | 34 ++++----------------
> > arch/x86/boot/compressed/pgtable.h | 6 ++--
> > arch/x86/boot/compressed/pgtable_64.c | 12 ++++++-
> > 3 files changed, 21 insertions(+), 31 deletions(-)
> >
> > diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
> > index b1f8a867777120bb..3b5fc851737ffc39 100644
> > --- a/arch/x86/boot/compressed/head_64.S
> > +++ b/arch/x86/boot/compressed/head_64.S
> > @@ -460,9 +460,6 @@ SYM_CODE_START(startup_64)
> > leaq TRAMPOLINE_32BIT_CODE_OFFSET(%rax), %rax
> > call *%rax
> >
> > - /* Restore the stack, the 32-bit trampoline uses its own stack */
> > - leaq rva(boot_stack_end)(%rbx), %rsp
> > -
> > /*
> > * cleanup_trampoline() would restore trampoline memory.
> > *
> > @@ -563,24 +560,17 @@ SYM_FUNC_END(.Lrelocated)
> > * EDI contains the base address of the trampoline memory.
> > * Non-zero ESI means trampoline needs to enable 5-level paging.
> > */
> > + .section ".rodata", "a", @progbits
> > SYM_CODE_START(trampoline_32bit_src)
> > - popq %r8
> > /* Switch to compatibility mode (CS.L = 0 CS.D = 1) via far return */
> > pushq $__KERNEL32_CS
> > leaq 0f(%rip), %rax
> > pushq %rax
> > lretq
> > +.Lret: retq
>
> Maybe just add a comment above this to explain that this is a target of
> the long jump below to get back into long mode and be able to return
> without setting up a new stack for the 32-bit code.
>
> And then a corresponding comment on the long jump itself. I think it would
> make it easier to understand what is going on in this part of the code.
>

Fair point. I'll add that in the next version.

2023-05-18 20:36:59

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH v2 17/20] x86: efistub: Check SEV/SNP support while running in the firmware

On 5/8/23 02:03, Ard Biesheuvel wrote:
> The decompressor executes in an environment with little or no access to
> a console, and without any ability to return an error back to the caller
> (the bootloader). So the only recourse we have when the SEV/SNP context
> is not quite what the kernel expects is to terminate the guest entirely.
>
> This is a bit harsh, and also unnecessary when booting via the EFI stub,
> given that it provides all the support that SEV guests need to probe the
> underlying platform.
>
> So let's do the SEV initialization and SNP feature check before calling
> ExitBootServices(), and simply return with an error if the SNP feature
> mask is not as expected.

My SEV-ES / SEV-SNP guests started crashing when I applied this patch.
Turns out that sev_es_negotiate_protocol() used to be called when no #VC
exceptions were being generated before a valid GHCB was setup. Because
of that the current GHCB MSR value was not saved/restored. But now,
sev_es_negotiate_protocol() is called earlier in the boot process and
there are still messages being issued by UEFI, e.g.:

SetUefiImageMemoryAttributes - 0x000000007F6D7000 - 0x0000000000006000 (0x0000000000000008)

Similarly, get_hv_features() didn't worry about saving the GHCB MSR value
and an SNP guest crashed after fixing sev_es_negotiate_protocol().

The following changes got me past everything:

diff --git a/arch/x86/kernel/sev-shared.c b/arch/x86/kernel/sev-shared.c
index 3a5b0c9c4fcc..23450628d41c 100644
--- a/arch/x86/kernel/sev-shared.c
+++ b/arch/x86/kernel/sev-shared.c
@@ -106,15 +106,19 @@ static void __noreturn sev_es_terminate(unsigned int set, unsigned int reason)
*/
static u64 get_hv_features(void)
{
- u64 val;
+ u64 val, save;

if (ghcb_version < 2)
return 0;

+ save = sev_es_rd_ghcb_msr();
+
sev_es_wr_ghcb_msr(GHCB_MSR_HV_FT_REQ);
VMGEXIT();
-
val = sev_es_rd_ghcb_msr();
+
+ sev_es_wr_ghcb_msr(save);
+
if (GHCB_RESP_CODE(val) != GHCB_MSR_HV_FT_RESP)
return 0;

@@ -139,13 +143,17 @@ static void snp_register_ghcb_early(unsigned long paddr)

static bool sev_es_negotiate_protocol(void)
{
- u64 val;
+ u64 val, save;
+
+ save = sev_es_rd_ghcb_msr();

/* Do the GHCB protocol version negotiation */
sev_es_wr_ghcb_msr(GHCB_MSR_SEV_INFO_REQ);
VMGEXIT();
val = sev_es_rd_ghcb_msr();

+ sev_es_wr_ghcb_msr(save);
+
if (GHCB_MSR_INFO(val) != GHCB_MSR_SEV_INFO_RESP)
return false;


Thanks,
Tom

>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
> arch/x86/boot/compressed/sev.c | 12 ++++++++----
> arch/x86/include/asm/sev.h | 4 ++++
> drivers/firmware/efi/libstub/x86-stub.c | 17 +++++++++++++++++
> 3 files changed, 29 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
> index 014b89c890887b9a..19c40873fdd209b5 100644
> --- a/arch/x86/boot/compressed/sev.c
> +++ b/arch/x86/boot/compressed/sev.c
> @@ -315,20 +315,24 @@ static void enforce_vmpl0(void)
> */
> #define SNP_FEATURES_PRESENT (0)
>
> +u64 snp_get_unsupported_features(void)
> +{
> + if (!(sev_status & MSR_AMD64_SEV_SNP_ENABLED))
> + return 0;
> + return sev_status & SNP_FEATURES_IMPL_REQ & ~SNP_FEATURES_PRESENT;
> +}
> +
> void snp_check_features(void)
> {
> u64 unsupported;
>
> - if (!(sev_status & MSR_AMD64_SEV_SNP_ENABLED))
> - return;
> -
> /*
> * Terminate the boot if hypervisor has enabled any feature lacking
> * guest side implementation. Pass on the unsupported features mask through
> * EXIT_INFO_2 of the GHCB protocol so that those features can be reported
> * as part of the guest boot failure.
> */
> - unsupported = sev_status & SNP_FEATURES_IMPL_REQ & ~SNP_FEATURES_PRESENT;
> + unsupported = snp_get_unsupported_features();
> if (unsupported) {
> if (ghcb_version < 2 || (!boot_ghcb && !early_setup_ghcb()))
> sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SNP_UNSUPPORTED);
> diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
> index 13dc2a9d23c1eb25..bf27b91644d0da5a 100644
> --- a/arch/x86/include/asm/sev.h
> +++ b/arch/x86/include/asm/sev.h
> @@ -157,6 +157,7 @@ static __always_inline void sev_es_nmi_complete(void)
> __sev_es_nmi_complete();
> }
> extern int __init sev_es_efi_map_ghcbs(pgd_t *pgd);
> +extern void sev_enable(struct boot_params *bp);
>
> static inline int rmpadjust(unsigned long vaddr, bool rmp_psize, unsigned long attrs)
> {
> @@ -202,12 +203,14 @@ void snp_set_wakeup_secondary_cpu(void);
> bool snp_init(struct boot_params *bp);
> void __init __noreturn snp_abort(void);
> int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, struct snp_guest_request_ioctl *rio);
> +u64 snp_get_unsupported_features(void);
> #else
> static inline void sev_es_ist_enter(struct pt_regs *regs) { }
> static inline void sev_es_ist_exit(void) { }
> static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) { return 0; }
> static inline void sev_es_nmi_complete(void) { }
> static inline int sev_es_efi_map_ghcbs(pgd_t *pgd) { return 0; }
> +static inline void sev_enable(struct boot_params *bp) { }
> static inline int pvalidate(unsigned long vaddr, bool rmp_psize, bool validate) { return 0; }
> static inline int rmpadjust(unsigned long vaddr, bool rmp_psize, unsigned long attrs) { return 0; }
> static inline void setup_ghcb(void) { }
> @@ -225,6 +228,7 @@ static inline int snp_issue_guest_request(u64 exit_code, struct snp_req_data *in
> {
> return -ENOTTY;
> }
> +static inline u64 snp_get_unsupported_features(void) { return 0; }
> #endif
>
> #endif
> diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
> index ce8434fce0c37982..33d11ba78f1d8c4f 100644
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -15,6 +15,7 @@
> #include <asm/setup.h>
> #include <asm/desc.h>
> #include <asm/boot.h>
> +#include <asm/sev.h>
>
> #include "efistub.h"
>
> @@ -714,6 +715,22 @@ static efi_status_t exit_boot_func(struct efi_boot_memmap *map,
> &p->efi->efi_memmap, &p->efi->efi_memmap_hi);
> p->efi->efi_memmap_size = map->map_size;
>
> + /*
> + * Call the SEV init code while still running with the firmware's
> + * GDT/IDT, so #VC exceptions will be handled by EFI.
> + */
> + if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
> + u64 unsupported;
> +
> + sev_enable(p->boot_params);
> + unsupported = snp_get_unsupported_features();
> + if (unsupported) {
> + efi_err("Unsupported SEV-SNP features detected: 0x%llx\n",
> + unsupported);
> + return EFI_UNSUPPORTED;
> + }
> + }
> +
> return EFI_SUCCESS;
> }
>

2023-05-18 22:31:25

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v2 17/20] x86: efistub: Check SEV/SNP support while running in the firmware

On Thu, 18 May 2023 at 22:16, Tom Lendacky <[email protected]> wrote:
>
> On 5/8/23 02:03, Ard Biesheuvel wrote:
> > The decompressor executes in an environment with little or no access to
> > a console, and without any ability to return an error back to the caller
> > (the bootloader). So the only recourse we have when the SEV/SNP context
> > is not quite what the kernel expects is to terminate the guest entirely.
> >
> > This is a bit harsh, and also unnecessary when booting via the EFI stub,
> > given that it provides all the support that SEV guests need to probe the
> > underlying platform.
> >
> > So let's do the SEV initialization and SNP feature check before calling
> > ExitBootServices(), and simply return with an error if the SNP feature
> > mask is not as expected.
>
> My SEV-ES / SEV-SNP guests started crashing when I applied this patch.
> Turns out that sev_es_negotiate_protocol() used to be called when no #VC
> exceptions were being generated before a valid GHCB was setup. Because
> of that the current GHCB MSR value was not saved/restored. But now,
> sev_es_negotiate_protocol() is called earlier in the boot process and
> there are still messages being issued by UEFI, e.g.:
>
> SetUefiImageMemoryAttributes - 0x000000007F6D7000 - 0x0000000000006000 (0x0000000000000008)
>
> Similarly, get_hv_features() didn't worry about saving the GHCB MSR value
> and an SNP guest crashed after fixing sev_es_negotiate_protocol().
>

Thanks for the clarification

So the underlying assumption here is that performing these checks
before ExitBootServices() is better because we can still return to the
bootloader, which -like GRUB does- could simply attempt booting the
next kernel in the list.

I was obviously unaware of the complication you are hitting here. So I
wonder what your take is on this: should we defer this check until
after ExitBootServices(), and crash and burn like before if the test
fails? Or is the change below reasonable in your opinion, and we can
incorporate it? Or is there a third option, i.e., is there a SEV
specific EFI protocol that the stub should be using to establish
whether the underlying platform meets the kernel's expectations?


> The following changes got me past everything:
>
> diff --git a/arch/x86/kernel/sev-shared.c b/arch/x86/kernel/sev-shared.c
> index 3a5b0c9c4fcc..23450628d41c 100644
> --- a/arch/x86/kernel/sev-shared.c
> +++ b/arch/x86/kernel/sev-shared.c
> @@ -106,15 +106,19 @@ static void __noreturn sev_es_terminate(unsigned int set, unsigned int reason)
> */
> static u64 get_hv_features(void)
> {
> - u64 val;
> + u64 val, save;
>
> if (ghcb_version < 2)
> return 0;
>
> + save = sev_es_rd_ghcb_msr();
> +
> sev_es_wr_ghcb_msr(GHCB_MSR_HV_FT_REQ);
> VMGEXIT();
> -
> val = sev_es_rd_ghcb_msr();
> +
> + sev_es_wr_ghcb_msr(save);
> +
> if (GHCB_RESP_CODE(val) != GHCB_MSR_HV_FT_RESP)
> return 0;
>
> @@ -139,13 +143,17 @@ static void snp_register_ghcb_early(unsigned long paddr)
>
> static bool sev_es_negotiate_protocol(void)
> {
> - u64 val;
> + u64 val, save;
> +
> + save = sev_es_rd_ghcb_msr();
>
> /* Do the GHCB protocol version negotiation */
> sev_es_wr_ghcb_msr(GHCB_MSR_SEV_INFO_REQ);
> VMGEXIT();
> val = sev_es_rd_ghcb_msr();
>
> + sev_es_wr_ghcb_msr(save);
> +
> if (GHCB_MSR_INFO(val) != GHCB_MSR_SEV_INFO_RESP)
> return false;
>
>
> Thanks,
> Tom
>
> >
> > Signed-off-by: Ard Biesheuvel <[email protected]>
> > ---
> > arch/x86/boot/compressed/sev.c | 12 ++++++++----
> > arch/x86/include/asm/sev.h | 4 ++++
> > drivers/firmware/efi/libstub/x86-stub.c | 17 +++++++++++++++++
> > 3 files changed, 29 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
> > index 014b89c890887b9a..19c40873fdd209b5 100644
> > --- a/arch/x86/boot/compressed/sev.c
> > +++ b/arch/x86/boot/compressed/sev.c
> > @@ -315,20 +315,24 @@ static void enforce_vmpl0(void)
> > */
> > #define SNP_FEATURES_PRESENT (0)
> >
> > +u64 snp_get_unsupported_features(void)
> > +{
> > + if (!(sev_status & MSR_AMD64_SEV_SNP_ENABLED))
> > + return 0;
> > + return sev_status & SNP_FEATURES_IMPL_REQ & ~SNP_FEATURES_PRESENT;
> > +}
> > +
> > void snp_check_features(void)
> > {
> > u64 unsupported;
> >
> > - if (!(sev_status & MSR_AMD64_SEV_SNP_ENABLED))
> > - return;
> > -
> > /*
> > * Terminate the boot if hypervisor has enabled any feature lacking
> > * guest side implementation. Pass on the unsupported features mask through
> > * EXIT_INFO_2 of the GHCB protocol so that those features can be reported
> > * as part of the guest boot failure.
> > */
> > - unsupported = sev_status & SNP_FEATURES_IMPL_REQ & ~SNP_FEATURES_PRESENT;
> > + unsupported = snp_get_unsupported_features();
> > if (unsupported) {
> > if (ghcb_version < 2 || (!boot_ghcb && !early_setup_ghcb()))
> > sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SNP_UNSUPPORTED);
> > diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
> > index 13dc2a9d23c1eb25..bf27b91644d0da5a 100644
> > --- a/arch/x86/include/asm/sev.h
> > +++ b/arch/x86/include/asm/sev.h
> > @@ -157,6 +157,7 @@ static __always_inline void sev_es_nmi_complete(void)
> > __sev_es_nmi_complete();
> > }
> > extern int __init sev_es_efi_map_ghcbs(pgd_t *pgd);
> > +extern void sev_enable(struct boot_params *bp);
> >
> > static inline int rmpadjust(unsigned long vaddr, bool rmp_psize, unsigned long attrs)
> > {
> > @@ -202,12 +203,14 @@ void snp_set_wakeup_secondary_cpu(void);
> > bool snp_init(struct boot_params *bp);
> > void __init __noreturn snp_abort(void);
> > int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, struct snp_guest_request_ioctl *rio);
> > +u64 snp_get_unsupported_features(void);
> > #else
> > static inline void sev_es_ist_enter(struct pt_regs *regs) { }
> > static inline void sev_es_ist_exit(void) { }
> > static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) { return 0; }
> > static inline void sev_es_nmi_complete(void) { }
> > static inline int sev_es_efi_map_ghcbs(pgd_t *pgd) { return 0; }
> > +static inline void sev_enable(struct boot_params *bp) { }
> > static inline int pvalidate(unsigned long vaddr, bool rmp_psize, bool validate) { return 0; }
> > static inline int rmpadjust(unsigned long vaddr, bool rmp_psize, unsigned long attrs) { return 0; }
> > static inline void setup_ghcb(void) { }
> > @@ -225,6 +228,7 @@ static inline int snp_issue_guest_request(u64 exit_code, struct snp_req_data *in
> > {
> > return -ENOTTY;
> > }
> > +static inline u64 snp_get_unsupported_features(void) { return 0; }
> > #endif
> >
> > #endif
> > diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
> > index ce8434fce0c37982..33d11ba78f1d8c4f 100644
> > --- a/drivers/firmware/efi/libstub/x86-stub.c
> > +++ b/drivers/firmware/efi/libstub/x86-stub.c
> > @@ -15,6 +15,7 @@
> > #include <asm/setup.h>
> > #include <asm/desc.h>
> > #include <asm/boot.h>
> > +#include <asm/sev.h>
> >
> > #include "efistub.h"
> >
> > @@ -714,6 +715,22 @@ static efi_status_t exit_boot_func(struct efi_boot_memmap *map,
> > &p->efi->efi_memmap, &p->efi->efi_memmap_hi);
> > p->efi->efi_memmap_size = map->map_size;
> >
> > + /*
> > + * Call the SEV init code while still running with the firmware's
> > + * GDT/IDT, so #VC exceptions will be handled by EFI.
> > + */
> > + if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
> > + u64 unsupported;
> > +
> > + sev_enable(p->boot_params);
> > + unsupported = snp_get_unsupported_features();
> > + if (unsupported) {
> > + efi_err("Unsupported SEV-SNP features detected: 0x%llx\n",
> > + unsupported);
> > + return EFI_UNSUPPORTED;
> > + }
> > + }
> > +
> > return EFI_SUCCESS;
> > }
> >

2023-05-19 14:25:38

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH v2 17/20] x86: efistub: Check SEV/SNP support while running in the firmware

On 5/18/23 17:27, Ard Biesheuvel wrote:
> On Thu, 18 May 2023 at 22:16, Tom Lendacky <[email protected]> wrote:
>>
>> On 5/8/23 02:03, Ard Biesheuvel wrote:
>>> The decompressor executes in an environment with little or no access to
>>> a console, and without any ability to return an error back to the caller
>>> (the bootloader). So the only recourse we have when the SEV/SNP context
>>> is not quite what the kernel expects is to terminate the guest entirely.
>>>
>>> This is a bit harsh, and also unnecessary when booting via the EFI stub,
>>> given that it provides all the support that SEV guests need to probe the
>>> underlying platform.
>>>
>>> So let's do the SEV initialization and SNP feature check before calling
>>> ExitBootServices(), and simply return with an error if the SNP feature
>>> mask is not as expected.
>>
>> My SEV-ES / SEV-SNP guests started crashing when I applied this patch.
>> Turns out that sev_es_negotiate_protocol() used to be called when no #VC
>> exceptions were being generated before a valid GHCB was setup. Because
>> of that the current GHCB MSR value was not saved/restored. But now,
>> sev_es_negotiate_protocol() is called earlier in the boot process and
>> there are still messages being issued by UEFI, e.g.:
>>
>> SetUefiImageMemoryAttributes - 0x000000007F6D7000 - 0x0000000000006000 (0x0000000000000008)
>>
>> Similarly, get_hv_features() didn't worry about saving the GHCB MSR value
>> and an SNP guest crashed after fixing sev_es_negotiate_protocol().
>>
>
> Thanks for the clarification
>
> So the underlying assumption here is that performing these checks
> before ExitBootServices() is better because we can still return to the
> bootloader, which -like GRUB does- could simply attempt booting the
> next kernel in the list.
>
> I was obviously unaware of the complication you are hitting here. So I
> wonder what your take is on this: should we defer this check until
> after ExitBootServices(), and crash and burn like before if the test
> fails? Or is the change below reasonable in your opinion, and we can

Deferring the checks is probably the safest thing to do, since that would
match the way things are done today and known to work. I'm not sure what
other things might pop up if we stay with this approach, for example, page
state change calls using the GHCB MSR protocol that also don't
save/restore the MSR value.

It is possible to audit these areas and stay with this approach, but I'm
wondering if that wouldn't be better done as a separate patch series.

Adding @Joerg for any additional thoughts he might have around this area, too.

> incorporate it? Or is there a third option, i.e., is there a SEV
> specific EFI protocol that the stub should be using to establish
> whether the underlying platform meets the kernel's expectations?

There isn't currently an EFI protocol do that.

Thanks,
Tom

>
>
>> The following changes got me past everything:
>>
>> diff --git a/arch/x86/kernel/sev-shared.c b/arch/x86/kernel/sev-shared.c
>> index 3a5b0c9c4fcc..23450628d41c 100644
>> --- a/arch/x86/kernel/sev-shared.c
>> +++ b/arch/x86/kernel/sev-shared.c
>> @@ -106,15 +106,19 @@ static void __noreturn sev_es_terminate(unsigned int set, unsigned int reason)
>> */
>> static u64 get_hv_features(void)
>> {
>> - u64 val;
>> + u64 val, save;
>>
>> if (ghcb_version < 2)
>> return 0;
>>
>> + save = sev_es_rd_ghcb_msr();
>> +
>> sev_es_wr_ghcb_msr(GHCB_MSR_HV_FT_REQ);
>> VMGEXIT();
>> -
>> val = sev_es_rd_ghcb_msr();
>> +
>> + sev_es_wr_ghcb_msr(save);
>> +
>> if (GHCB_RESP_CODE(val) != GHCB_MSR_HV_FT_RESP)
>> return 0;
>>
>> @@ -139,13 +143,17 @@ static void snp_register_ghcb_early(unsigned long paddr)
>>
>> static bool sev_es_negotiate_protocol(void)
>> {
>> - u64 val;
>> + u64 val, save;
>> +
>> + save = sev_es_rd_ghcb_msr();
>>
>> /* Do the GHCB protocol version negotiation */
>> sev_es_wr_ghcb_msr(GHCB_MSR_SEV_INFO_REQ);
>> VMGEXIT();
>> val = sev_es_rd_ghcb_msr();
>>
>> + sev_es_wr_ghcb_msr(save);
>> +
>> if (GHCB_MSR_INFO(val) != GHCB_MSR_SEV_INFO_RESP)
>> return false;
>>
>>
>> Thanks,
>> Tom
>>
>>>
>>> Signed-off-by: Ard Biesheuvel <[email protected]>
>>> ---
>>> arch/x86/boot/compressed/sev.c | 12 ++++++++----
>>> arch/x86/include/asm/sev.h | 4 ++++
>>> drivers/firmware/efi/libstub/x86-stub.c | 17 +++++++++++++++++
>>> 3 files changed, 29 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
>>> index 014b89c890887b9a..19c40873fdd209b5 100644
>>> --- a/arch/x86/boot/compressed/sev.c
>>> +++ b/arch/x86/boot/compressed/sev.c
>>> @@ -315,20 +315,24 @@ static void enforce_vmpl0(void)
>>> */
>>> #define SNP_FEATURES_PRESENT (0)
>>>
>>> +u64 snp_get_unsupported_features(void)
>>> +{
>>> + if (!(sev_status & MSR_AMD64_SEV_SNP_ENABLED))
>>> + return 0;
>>> + return sev_status & SNP_FEATURES_IMPL_REQ & ~SNP_FEATURES_PRESENT;
>>> +}
>>> +
>>> void snp_check_features(void)
>>> {
>>> u64 unsupported;
>>>
>>> - if (!(sev_status & MSR_AMD64_SEV_SNP_ENABLED))
>>> - return;
>>> -
>>> /*
>>> * Terminate the boot if hypervisor has enabled any feature lacking
>>> * guest side implementation. Pass on the unsupported features mask through
>>> * EXIT_INFO_2 of the GHCB protocol so that those features can be reported
>>> * as part of the guest boot failure.
>>> */
>>> - unsupported = sev_status & SNP_FEATURES_IMPL_REQ & ~SNP_FEATURES_PRESENT;
>>> + unsupported = snp_get_unsupported_features();
>>> if (unsupported) {
>>> if (ghcb_version < 2 || (!boot_ghcb && !early_setup_ghcb()))
>>> sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SNP_UNSUPPORTED);
>>> diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
>>> index 13dc2a9d23c1eb25..bf27b91644d0da5a 100644
>>> --- a/arch/x86/include/asm/sev.h
>>> +++ b/arch/x86/include/asm/sev.h
>>> @@ -157,6 +157,7 @@ static __always_inline void sev_es_nmi_complete(void)
>>> __sev_es_nmi_complete();
>>> }
>>> extern int __init sev_es_efi_map_ghcbs(pgd_t *pgd);
>>> +extern void sev_enable(struct boot_params *bp);
>>>
>>> static inline int rmpadjust(unsigned long vaddr, bool rmp_psize, unsigned long attrs)
>>> {
>>> @@ -202,12 +203,14 @@ void snp_set_wakeup_secondary_cpu(void);
>>> bool snp_init(struct boot_params *bp);
>>> void __init __noreturn snp_abort(void);
>>> int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, struct snp_guest_request_ioctl *rio);
>>> +u64 snp_get_unsupported_features(void);
>>> #else
>>> static inline void sev_es_ist_enter(struct pt_regs *regs) { }
>>> static inline void sev_es_ist_exit(void) { }
>>> static inline int sev_es_setup_ap_jump_table(struct real_mode_header *rmh) { return 0; }
>>> static inline void sev_es_nmi_complete(void) { }
>>> static inline int sev_es_efi_map_ghcbs(pgd_t *pgd) { return 0; }
>>> +static inline void sev_enable(struct boot_params *bp) { }
>>> static inline int pvalidate(unsigned long vaddr, bool rmp_psize, bool validate) { return 0; }
>>> static inline int rmpadjust(unsigned long vaddr, bool rmp_psize, unsigned long attrs) { return 0; }
>>> static inline void setup_ghcb(void) { }
>>> @@ -225,6 +228,7 @@ static inline int snp_issue_guest_request(u64 exit_code, struct snp_req_data *in
>>> {
>>> return -ENOTTY;
>>> }
>>> +static inline u64 snp_get_unsupported_features(void) { return 0; }
>>> #endif
>>>
>>> #endif
>>> diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
>>> index ce8434fce0c37982..33d11ba78f1d8c4f 100644
>>> --- a/drivers/firmware/efi/libstub/x86-stub.c
>>> +++ b/drivers/firmware/efi/libstub/x86-stub.c
>>> @@ -15,6 +15,7 @@
>>> #include <asm/setup.h>
>>> #include <asm/desc.h>
>>> #include <asm/boot.h>
>>> +#include <asm/sev.h>
>>>
>>> #include "efistub.h"
>>>
>>> @@ -714,6 +715,22 @@ static efi_status_t exit_boot_func(struct efi_boot_memmap *map,
>>> &p->efi->efi_memmap, &p->efi->efi_memmap_hi);
>>> p->efi->efi_memmap_size = map->map_size;
>>>
>>> + /*
>>> + * Call the SEV init code while still running with the firmware's
>>> + * GDT/IDT, so #VC exceptions will be handled by EFI.
>>> + */
>>> + if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
>>> + u64 unsupported;
>>> +
>>> + sev_enable(p->boot_params);
>>> + unsupported = snp_get_unsupported_features();
>>> + if (unsupported) {
>>> + efi_err("Unsupported SEV-SNP features detected: 0x%llx\n",
>>> + unsupported);
>>> + return EFI_UNSUPPORTED;
>>> + }
>>> + }
>>> +
>>> return EFI_SUCCESS;
>>> }
>>>

2023-05-22 13:25:19

by Jörg Rödel

[permalink] [raw]
Subject: Re: [PATCH v2 17/20] x86: efistub: Check SEV/SNP support while running in the firmware

On Fri, May 19, 2023 at 09:04:46AM -0500, Tom Lendacky wrote:
> Deferring the checks is probably the safest thing to do, since that would
> match the way things are done today and known to work. I'm not sure what
> other things might pop up if we stay with this approach, for example, page
> state change calls using the GHCB MSR protocol that also don't save/restore
> the MSR value.
>
> It is possible to audit these areas and stay with this approach, but I'm
> wondering if that wouldn't be better done as a separate patch series.
>
> Adding @Joerg for any additional thoughts he might have around this area, too.

If I got it correctly the patch actually moves two things before
ExitBootServices:

1) SEV features check

2) SEV initialization

I think it makes a lot of sense to have 1) before ExitBootServices. It
allows to soft-fail in case the kernel does not support all required
SEV-SNP features and move on to a kernel which does. This check also only
needs the SEV_STATUS MSR and not any GHCB calls.

The problem is the GHCB protocol negotiation with the HV, but the GHCB
protocol is downward-compatible, so an older kernel can work with a
newer HV.

But 2) needs to stay after ExitBootServices, as it needs resources owned
by UEFI, e.g. the GHCB MSR and potentially the configured GHCB itself.
Fiddling around with the GHCB MSR while it is still owned by UEFI will
bite us in one or the other way (e.g. UEFI, before ExitBootServices, is
free to take IRQs with handlers that rely on the GHCB MSR content).

Regards,

Joerg


2023-05-22 13:28:03

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v2 17/20] x86: efistub: Check SEV/SNP support while running in the firmware

On Mon, 22 May 2023 at 14:48, Joerg Roedel <[email protected]> wrote:
>
> On Fri, May 19, 2023 at 09:04:46AM -0500, Tom Lendacky wrote:
> > Deferring the checks is probably the safest thing to do, since that would
> > match the way things are done today and known to work. I'm not sure what
> > other things might pop up if we stay with this approach, for example, page
> > state change calls using the GHCB MSR protocol that also don't save/restore
> > the MSR value.
> >
> > It is possible to audit these areas and stay with this approach, but I'm
> > wondering if that wouldn't be better done as a separate patch series.
> >
> > Adding @Joerg for any additional thoughts he might have around this area, too.
>
> If I got it correctly the patch actually moves two things before
> ExitBootServices:
>
> 1) SEV features check
>
> 2) SEV initialization
>
> I think it makes a lot of sense to have 1) before ExitBootServices. It
> allows to soft-fail in case the kernel does not support all required
> SEV-SNP features and move on to a kernel which does. This check also only
> needs the SEV_STATUS MSR and not any GHCB calls.
>
> The problem is the GHCB protocol negotiation with the HV, but the GHCB
> protocol is downward-compatible, so an older kernel can work with a
> newer HV.
>
> But 2) needs to stay after ExitBootServices, as it needs resources owned
> by UEFI, e.g. the GHCB MSR and potentially the configured GHCB itself.
> Fiddling around with the GHCB MSR while it is still owned by UEFI will
> bite us in one or the other way (e.g. UEFI, before ExitBootServices, is
> free to take IRQs with handlers that rely on the GHCB MSR content).
>

Thanks for the insight. Note that I have sent a v3 in the mean time
that moves all of this *after* ExitBootServices() [0], but I failed to
cc you - apologies.

So IIUC, we could just read sev_status much earlier just to perform
the SNP feature check, and fail the boot gracefully on a mismatch. And
the sev_enable() call needs to move after ExitBootServices(), right?

That would result in only very minor duplication, afaict. I'll have a
stab at implementing this for v4.

Thanks,




[0] https://lore.kernel.org/all/[email protected]/

2023-05-22 13:53:25

by Jörg Rödel

[permalink] [raw]
Subject: Re: [PATCH v2 17/20] x86: efistub: Check SEV/SNP support while running in the firmware

On Mon, May 22, 2023 at 03:07:12PM +0200, Ard Biesheuvel wrote:
> So IIUC, we could just read sev_status much earlier just to perform
> the SNP feature check, and fail the boot gracefully on a mismatch. And
> the sev_enable() call needs to move after ExitBootServices(), right?

Right, sev_enable() negotiates the GHCB protocol version, which needs
the GHCB MSR, so that has to stay after ExitBootServices(). The
SEV feature check on the other side only needs to read the sev-status
MSR, which is no problem before ExitBootServices() (as long as it is
only read on SEV platforms).

> That would result in only very minor duplication, afaict. I'll have a
> stab at implementing this for v4.

Thanks,

--
J?rg R?del
[email protected]

SUSE Software Solutions Germany GmbH
Frankenstra?e 146
90461 N?rnberg
Germany

(HRB 36809, AG N?rnberg)
Gesch?ftsf?hrer: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman