This series is conceptually a combination of Evgeny's series [0] and
mine [1], both of which attempt to make the early decompressor code more
amenable to executing in the EFI environment with stricter handling of
memory permissions.
My series [1] implemented zboot for x86, by getting rid of the entire
x86 decompressor, and replacing it with existing EFI code that does the
same but in a generic way. The downside of this is that only EFI boot is
supported, making it unviable for distros, which need to support BIOS
boot and hybrid EFI boot modes that omit the EFI stub.
Evgeny's series [0] adapted the entire decompressor code flow to allow
it to execute in the EFI context as well as the bare metal context, and
this involves changes to the 1:1 mapping code and the page fault
handlers etc, none of which are really needed when doing EFI boot in the
first place.
So this series attempts to occupy the middle ground here: it makes
minimal changes to the existing decompressor so some of it can be called
from the EFI stub. Then, it reimplements the EFI boot flow to decompress
the kernel and boot it directly, without relying on the trampoline code,
page table code or page fault handling code. This allows us to get rid
of quite a bit of unsavory EFI stub code, and replace it with two clear
invocations of the EFI firmware APIs to clear NX restrictions from
allocations that have been populated with executable code.
The only code that is being reused is the decompression library itself,
along with the minimal ELF parsing that is required to copy the ELF
segments in place, and the relocation processing that fixes up absolute
symbol references to refer to the correct virtual addresses.
Note that some of Evgeny's changes to clean up the PE/COFF header
generation will still be needed, but I've omitted those here for
brevity.
Cc: Evgeniy Baskov <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Alexey Khoroshilov <[email protected]>
Cc: Peter Jones <[email protected]>
Cc: Gerd Hoffmann <[email protected]>
Cc: Dave Young <[email protected]>
Cc: Mario Limonciello <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Linus Torvalds <[email protected]>
[0] https://lore.kernel.org/all/[email protected]/
[1] https://lore.kernel.org/all/[email protected]/
Ard Biesheuvel (6):
x86: decompressor: Move global symbol references to C code
x86: decompressor: Factor out kernel decompression and relocation
x86: efistub: Obtain ACPI RSDP address while running in the stub
x86: efistub: Perform 4/5 level paging switch from the stub
x86: efistub: Prefer EFI memory attributes protocol over DXE services
x86: efistub: Avoid legacy decompressor when doing EFI boot
arch/x86/boot/compressed/efi_mixed.S | 55 ---
arch/x86/boot/compressed/head_32.S | 24 --
arch/x86/boot/compressed/head_64.S | 39 +--
arch/x86/boot/compressed/misc.c | 44 ++-
arch/x86/include/asm/efi.h | 2 +
drivers/firmware/efi/libstub/efi-stub-helper.c | 4 +
drivers/firmware/efi/libstub/x86-stub.c | 360 +++++++++++++-------
7 files changed, 279 insertions(+), 249 deletions(-)
--
2.39.2
One of the actions performed by the decompressor is populating the RSDP
address field in the boot_params struct, and when doing EFI boot, EFI
configuration tables are the preferred source for this information.
In preparation for removing the decompressor code from the EFI stub boot
path, set this field from the EFI stub code.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
drivers/firmware/efi/libstub/x86-stub.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index a0bfd31358ba97b1..e136c94037dda8d3 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -787,6 +787,11 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,
efi_dxe_table = NULL;
}
+ if (!boot_params->acpi_rsdp_addr)
+ boot_params->acpi_rsdp_addr = (unsigned long)
+ (get_efi_config_table(ACPI_20_TABLE_GUID) ?:
+ get_efi_config_table(ACPI_TABLE_GUID));
+
/*
* If the kernel isn't already loaded at a suitable address,
* relocate it.
--
2.39.2
In preparation for updating the EFI stub boot flow to avoid the bare
metal decompressor code altogether, implement the support code for
switching between 4 and 5 levels of paging before jumping to the kernel
proper.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
drivers/firmware/efi/libstub/efi-stub-helper.c | 4 +
drivers/firmware/efi/libstub/x86-stub.c | 145 ++++++++++++++++++++
2 files changed, 149 insertions(+)
diff --git a/drivers/firmware/efi/libstub/efi-stub-helper.c b/drivers/firmware/efi/libstub/efi-stub-helper.c
index 1e0203d74691ffcc..fc5f3b4c45e91401 100644
--- a/drivers/firmware/efi/libstub/efi-stub-helper.c
+++ b/drivers/firmware/efi/libstub/efi-stub-helper.c
@@ -16,6 +16,8 @@
#include "efistub.h"
+extern bool efi_no5lvl;
+
bool efi_nochunk;
bool efi_nokaslr = !IS_ENABLED(CONFIG_RANDOMIZE_BASE);
bool efi_novamap;
@@ -73,6 +75,8 @@ efi_status_t efi_parse_options(char const *cmdline)
efi_loglevel = CONSOLE_LOGLEVEL_QUIET;
} else if (!strcmp(param, "noinitrd")) {
efi_noinitrd = true;
+ } else if (IS_ENABLED(CONFIG_X86_64) && !strcmp(param, "no5lvl")) {
+ efi_no5lvl = true;
} else if (!strcmp(param, "efi") && val) {
efi_nochunk = parse_option_str(val, "nochunk");
efi_novamap |= parse_option_str(val, "novamap");
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index e136c94037dda8d3..7b8717cbb96a1246 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -760,6 +760,139 @@ static efi_status_t exit_boot(struct boot_params *boot_params, void *handle)
return EFI_SUCCESS;
}
+#ifdef CONFIG_X86_64
+bool efi_no5lvl;
+
+static const struct desc_struct gdt[] = {
+ [GDT_ENTRY_KERNEL32_CS] = GDT_ENTRY_INIT(0xc09b, 0, 0xfffff),
+ [GDT_ENTRY_KERNEL_CS] = GDT_ENTRY_INIT(0xa09b, 0, 0xfffff),
+ [GDT_ENTRY_KERNEL_DS] = GDT_ENTRY_INIT(0xc093, 0, 0xfffff),
+};
+
+static void (*la57_toggle)(void *cr3, void *gdt);
+
+static void __naked tmpl_toggle(void *cr3, void *gdt)
+{
+ /*
+ * This is template code that will be copied into a 32-bit addressable
+ * buffer, allowing us to drop to 32-bit mode with paging disabled,
+ * which is required to be able to toggle the CR4.LA57 bit.
+ *
+ * The first MOVB instruction is only there to capture the size of the
+ * sequence, and implicitly, the offset to the LJMP's immediate, which
+ * will be populated with the correct absolute address after copying.
+ */
+ asm("0: movb $(4f - .), %%al \n\t"
+ " lgdt (%%rsi) \n\t"
+ " movw %[ds], %%ax \n\t"
+ " movw %%ax, %%ds \n\t"
+ " movw %%ax, %%ss \n\t"
+ " leaq 2f(%%rip), %%rax \n\t"
+ " pushq %[cs32] \n\t"
+ " pushq %%rax \n\t"
+ " lretq \n\t"
+ "1: retq \n\t"
+ " .code32 \n\t"
+ "2: movl %%cr0, %%eax \n\t"
+ " btrl %[pg], %%eax \n\t"
+ " movl %%eax, %%cr0 \n\t"
+ " jmp 3f \n\t"
+ "3: movl %%cr4, %%ecx \n\t"
+ " btcl %[la57], %%ecx \n\t"
+ " movl %%ecx, %%cr4 \n\t"
+ " movl %%edi, %%cr3 \n\t"
+ " btsl %[pg], %%eax \n\t"
+ " movl %%eax, %%cr0 \n\t"
+ " ljmpl %[cs], $(1b - 0b) \n\t"
+ "4: .code64"
+ :
+ : [cs32] "i"(__KERNEL32_CS),
+ [cs] "i"(__KERNEL_CS),
+ [ds] "i"(__KERNEL_DS),
+ [pg] "i"(X86_CR0_PG_BIT),
+ [la57] "i"(X86_CR4_LA57_BIT));
+}
+
+/*
+ * Enabling (or disabling) 5 level paging is tricky, because it can only be
+ * done from 32-bit mode with paging disabled. This means not only that the
+ * code itself must be running from 32-bit addressable physical memory, but
+ * also that the root page table must be 32-bit addressable, as we cannot
+ * program a 64-bit value into CR3 when running in 32-bit mode.
+ */
+static efi_status_t efi_setup_5level_paging(void)
+{
+ const u8 tmpl_size = ((u8 *)tmpl_toggle)[1];
+ efi_status_t status;
+ u8 *la57_code;
+
+ if (!efi_is_64bit())
+ return EFI_SUCCESS;
+
+ /* check for 5 level paging support */
+ if (native_cpuid_eax(0) < 7 ||
+ !(native_cpuid_ecx(7) & (1 << (X86_FEATURE_LA57 & 31))))
+ return EFI_SUCCESS;
+
+ /* allocate some 32-bit addressable memory for code and a page table */
+ status = efi_allocate_pages(2 * PAGE_SIZE, (unsigned long *)&la57_code,
+ U32_MAX);
+ if (status != EFI_SUCCESS)
+ return status;
+
+ la57_toggle = memcpy(la57_code, tmpl_toggle, tmpl_size);
+ memset(la57_code + tmpl_size, 0x90, PAGE_SIZE - tmpl_size);
+
+ /*
+ * To avoid having to allocate a 32-bit addressable stack, we use a
+ * ljmp to switch back to long mode. However, this takes an absolute
+ * address, so we have to poke it in at runtime. The dummy MOVB
+ * instruction at the beginning can be used to locate the immediate.
+ */
+ *(u32 *)&la57_code[tmpl_size - 6] += (unsigned long)la57_code;
+
+ adjust_memory_range_protection((unsigned long)la57_code, PAGE_SIZE);
+
+ return EFI_SUCCESS;
+}
+
+static void efi_5level_switch(void)
+{
+ bool want_la57 = IS_ENABLED(CONFIG_X86_5LEVEL) && !efi_no5lvl;
+ bool have_la57 = native_read_cr4() & X86_CR4_LA57;
+ bool need_toggle = want_la57 ^ have_la57;
+ u64 *pgt = (void *)la57_toggle + PAGE_SIZE;
+ u64 *cr3 = (u64 *)__native_read_cr3();
+ struct desc_ptr desc;
+ u64 *new_cr3;
+
+ if (!la57_toggle || !need_toggle)
+ return;
+
+ if (!have_la57) {
+ /*
+ * We are going to enable 5 level paging, so we need to
+ * allocate a root level page from the 32-bit addressable
+ * physical region, and plug the existing hierarchy into it.
+ */
+ new_cr3 = memset(pgt, 0, PAGE_SIZE);
+ new_cr3[0] = (u64)cr3 | _PAGE_TABLE_NOENC;
+ } else {
+ // take the new root table pointer from the current entry #0
+ new_cr3 = (u64 *)(cr3[0] & PAGE_MASK);
+
+ // copy the new root level table if it is not 32-bit addressable
+ if ((u64)new_cr3 > U32_MAX)
+ new_cr3 = memcpy(pgt, new_cr3, PAGE_SIZE);
+ }
+
+ desc.size = sizeof(gdt) - 1;
+ desc.address = (u64)gdt;
+
+ la57_toggle(new_cr3, &desc);
+}
+#endif
+
/*
* On success, we return the address of startup_32, which has potentially been
* relocated by efi_relocate_kernel.
@@ -792,6 +925,14 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,
(get_efi_config_table(ACPI_20_TABLE_GUID) ?:
get_efi_config_table(ACPI_TABLE_GUID));
+#ifdef CONFIG_X86_64
+ status = efi_setup_5level_paging();
+ if (status != EFI_SUCCESS) {
+ efi_err("efi_setup_5level_paging() failed!\n");
+ goto fail;
+ }
+#endif
+
/*
* If the kernel isn't already loaded at a suitable address,
* relocate it.
@@ -910,6 +1051,10 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,
goto fail;
}
+#ifdef CONFIG_X86_64
+ efi_5level_switch();
+#endif
+
return bzimage_addr;
fail:
efi_err("efi_main() failed!\n");
--
2.39.2
Currently, we rely on DXE services in some cases to clear non-execute
restrictions from page allocations that need to be executable. This is
dodgy, because DXE services are not specified by UEFI but by PI, and
they are not intended for consumption by OS loaders. However, no
alternative existed at the time.
Now, there is a new UEFI protocol that should be used instead, so if it
exists, prefer it over the DXE services calls.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
drivers/firmware/efi/libstub/x86-stub.c | 28 ++++++++++++++------
1 file changed, 20 insertions(+), 8 deletions(-)
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index 7b8717cbb96a1246..ea4024a6a04e507f 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -25,6 +25,7 @@ const efi_system_table_t *efi_system_table;
const efi_dxe_services_table_t *efi_dxe_table;
u32 image_offset __section(".data");
static efi_loaded_image_t *image = NULL;
+static efi_memory_attribute_protocol_t *memattr;
static efi_status_t
preserve_pci_rom_image(efi_pci_io_protocol_t *pci, struct pci_setup_rom **__rom)
@@ -221,12 +222,18 @@ adjust_memory_range_protection(unsigned long start, unsigned long size)
unsigned long rounded_start, rounded_end;
unsigned long unprotect_start, unprotect_size;
- if (efi_dxe_table == NULL)
- return;
-
rounded_start = rounddown(start, EFI_PAGE_SIZE);
rounded_end = roundup(start + size, EFI_PAGE_SIZE);
+ if (memattr != NULL) {
+ efi_call_proto(memattr, clear_memory_attributes, rounded_start,
+ rounded_end - rounded_start, EFI_MEMORY_XP);
+ return;
+ }
+
+ if (efi_dxe_table == NULL)
+ return;
+
/*
* Don't modify memory region attributes, they are
* already suitable, to lower the possibility to
@@ -913,13 +920,18 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,
if (efi_system_table->hdr.signature != EFI_SYSTEM_TABLE_SIGNATURE)
efi_exit(handle, EFI_INVALID_PARAMETER);
- efi_dxe_table = get_efi_config_table(EFI_DXE_SERVICES_TABLE_GUID);
- if (efi_dxe_table &&
- efi_dxe_table->hdr.signature != EFI_DXE_SERVICES_TABLE_SIGNATURE) {
- efi_warn("Ignoring DXE services table: invalid signature\n");
- efi_dxe_table = NULL;
+ if (IS_ENABLED(CONFIG_EFI_DXE_MEM_ATTRIBUTES)) {
+ efi_dxe_table = get_efi_config_table(EFI_DXE_SERVICES_TABLE_GUID);
+ if (efi_dxe_table &&
+ efi_dxe_table->hdr.signature != EFI_DXE_SERVICES_TABLE_SIGNATURE) {
+ efi_warn("Ignoring DXE services table: invalid signature\n");
+ efi_dxe_table = NULL;
+ }
}
+ /* grab the memory attributes protocol if it exists */
+ efi_bs_call(locate_protocol, &guid, NULL, (void **)&memattr);
+
if (!boot_params->acpi_rsdp_addr)
boot_params->acpi_rsdp_addr = (unsigned long)
(get_efi_config_table(ACPI_20_TABLE_GUID) ?:
--
2.39.2
We no longer have to be cautious when referring to global variables in
the position independent decompressor code, now that we use PIE codegen
and assert in the linker script that no GOT entries exist (which would
require adjustment for the actual runtime load address of the
decompressor binary).
This means we can simply refer to global variables directly, instead of
passing them into C routines from asm code. Let's do so for the code
that will be called directly from the EFI stub after a subsequent patch,
and avoid the need to pass these quantities directly.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/boot/compressed/head_32.S | 8 --------
arch/x86/boot/compressed/head_64.S | 8 +-------
arch/x86/boot/compressed/misc.c | 17 ++++++++++-------
3 files changed, 11 insertions(+), 22 deletions(-)
diff --git a/arch/x86/boot/compressed/head_32.S b/arch/x86/boot/compressed/head_32.S
index 987ae727cf9f0d04..3ecc1bbe971e1d43 100644
--- a/arch/x86/boot/compressed/head_32.S
+++ b/arch/x86/boot/compressed/head_32.S
@@ -179,13 +179,7 @@ SYM_FUNC_START_LOCAL_NOALIGN(.Lrelocated)
*/
/* push arguments for extract_kernel: */
- pushl output_len@GOTOFF(%ebx) /* decompressed length, end of relocs */
pushl %ebp /* output address */
- pushl input_len@GOTOFF(%ebx) /* input_len */
- leal input_data@GOTOFF(%ebx), %eax
- pushl %eax /* input_data */
- leal boot_heap@GOTOFF(%ebx), %eax
- pushl %eax /* heap area */
pushl %esi /* real mode pointer */
call extract_kernel /* returns kernel entry point in %eax */
addl $24, %esp
@@ -213,8 +207,6 @@ SYM_DATA_END_LABEL(gdt, SYM_L_LOCAL, gdt_end)
*/
.bss
.balign 4
-boot_heap:
- .fill BOOT_HEAP_SIZE, 1, 0
boot_stack:
.fill BOOT_STACK_SIZE, 1, 0
boot_stack_end:
diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 03c4328a88cbd5d0..5ab014be179c1103 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -564,11 +564,7 @@ SYM_FUNC_START_LOCAL_NOALIGN(.Lrelocated)
*/
pushq %rsi /* Save the real mode argument */
movq %rsi, %rdi /* real mode address */
- leaq boot_heap(%rip), %rsi /* malloc area for uncompression */
- leaq input_data(%rip), %rdx /* input_data */
- movl input_len(%rip), %ecx /* input_len */
- movq %rbp, %r8 /* output target address */
- movl output_len(%rip), %r9d /* decompressed length, end of relocs */
+ movq %rbp, %rsi /* output target address */
call extract_kernel /* returns kernel entry point in %rax */
popq %rsi
@@ -726,8 +722,6 @@ SYM_DATA_END_LABEL(boot_idt, SYM_L_GLOBAL, boot_idt_end)
*/
.bss
.balign 4
-SYM_DATA_LOCAL(boot_heap, .fill BOOT_HEAP_SIZE, 1, 0)
-
SYM_DATA_START_LOCAL(boot_stack)
.fill BOOT_STACK_SIZE, 1, 0
.balign 16
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 014ff222bf4b35c2..3b79667412ceb388 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -330,6 +330,11 @@ static size_t parse_elf(void *output)
return ehdr.e_entry - LOAD_PHYSICAL_ADDR;
}
+static u8 boot_heap[BOOT_HEAP_SIZE] __aligned(4);
+
+extern unsigned char input_data[];
+extern unsigned int input_len, output_len;
+
/*
* The compressed kernel image (ZO), has been moved so that its position
* is against the end of the buffer used to hold the uncompressed kernel
@@ -347,14 +352,12 @@ static size_t parse_elf(void *output)
* |-------uncompressed kernel image---------|
*
*/
-asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
- unsigned char *input_data,
- unsigned long input_len,
- unsigned char *output,
- unsigned long output_len)
+asmlinkage __visible void *extract_kernel(void *rmode,
+ unsigned char *output)
{
const unsigned long kernel_total_size = VO__end - VO__text;
unsigned long virt_addr = LOAD_PHYSICAL_ADDR;
+ memptr heap = (memptr)boot_heap;
unsigned long needed_size;
size_t entry_offset;
@@ -412,7 +415,7 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
* entries. This ensures the full mapped area is usable RAM
* and doesn't include any reserved areas.
*/
- needed_size = max(output_len, kernel_total_size);
+ needed_size = max((unsigned long)output_len, kernel_total_size);
#ifdef CONFIG_X86_64
needed_size = ALIGN(needed_size, MIN_KERNEL_ALIGN);
#endif
@@ -443,7 +446,7 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
#ifdef CONFIG_X86_64
if (heap > 0x3fffffffffffUL)
error("Destination address too large");
- if (virt_addr + max(output_len, kernel_total_size) > KERNEL_IMAGE_SIZE)
+ if (virt_addr + needed_size > KERNEL_IMAGE_SIZE)
error("Destination virtual address is beyond the kernel mapping area");
#else
if (heap > ((-__PAGE_OFFSET-(128<<20)-1) & 0x7fffffff))
--
2.39.2
The bare metal decompressor code was never really intended to run in a
hosted environment such as the EFI boot services, and does a few things
that are problematic in the context of EFI boot now that the logo
requirements are getting tighter.
In particular, the decompressor moves itself around in memory, and
assumes that arbitrary chunks of low memory can be temporarily emptied,
populated with trampoline code and restored again afterwards, and this
is difficult to support in a context where memory is not permitted to be
mapped writable and executable at the same time or, at the very least,
is mapped non-executable by default, and needs special treatment for
this restriction to be lifted.
Since EFI already maps all of memory 1:1, we don't need to create new
page tables or handle page faults when decompressing the kernel. That
means there is also no need to set up special exception handlers for
SEV. Generally, there is little need to do anything that the
decompressor does beyond
- perform the 4/5 level paging switch, if needed,
- decompress the kernel
- relocate the kernel
So let's do all of this from the EFI stub code, and avoid the bare metal
decompressor altogether.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/boot/compressed/efi_mixed.S | 55 ------
arch/x86/boot/compressed/head_32.S | 16 --
arch/x86/boot/compressed/head_64.S | 31 ----
arch/x86/include/asm/efi.h | 2 +
drivers/firmware/efi/libstub/x86-stub.c | 188 ++++++++------------
5 files changed, 75 insertions(+), 217 deletions(-)
diff --git a/arch/x86/boot/compressed/efi_mixed.S b/arch/x86/boot/compressed/efi_mixed.S
index 4ca70bf93dc0bdcd..075dd6410f9a4d60 100644
--- a/arch/x86/boot/compressed/efi_mixed.S
+++ b/arch/x86/boot/compressed/efi_mixed.S
@@ -245,10 +245,6 @@ SYM_FUNC_START(efi32_entry)
jmp startup_32
SYM_FUNC_END(efi32_entry)
-#define ST32_boottime 60 // offsetof(efi_system_table_32_t, boottime)
-#define BS32_handle_protocol 88 // offsetof(efi_boot_services_32_t, handle_protocol)
-#define LI32_image_base 32 // offsetof(efi_loaded_image_32_t, image_base)
-
/*
* efi_status_t efi32_pe_entry(efi_handle_t image_handle,
* efi_system_table_32_t *sys_table)
@@ -256,8 +252,6 @@ SYM_FUNC_END(efi32_entry)
SYM_FUNC_START(efi32_pe_entry)
pushl %ebp
movl %esp, %ebp
- pushl %eax // dummy push to allocate loaded_image
-
pushl %ebx // save callee-save registers
pushl %edi
@@ -266,48 +260,8 @@ SYM_FUNC_START(efi32_pe_entry)
movl $0x80000003, %eax // EFI_UNSUPPORTED
jnz 2f
- call 1f
-1: pop %ebx
-
- /* Get the loaded image protocol pointer from the image handle */
- leal -4(%ebp), %eax
- pushl %eax // &loaded_image
- leal (loaded_image_proto - 1b)(%ebx), %eax
- pushl %eax // pass the GUID address
- pushl 8(%ebp) // pass the image handle
-
- /*
- * Note the alignment of the stack frame.
- * sys_table
- * handle <-- 16-byte aligned on entry by ABI
- * return address
- * frame pointer
- * loaded_image <-- local variable
- * saved %ebx <-- 16-byte aligned here
- * saved %edi
- * &loaded_image
- * &loaded_image_proto
- * handle <-- 16-byte aligned for call to handle_protocol
- */
-
- movl 12(%ebp), %eax // sys_table
- movl ST32_boottime(%eax), %eax // sys_table->boottime
- call *BS32_handle_protocol(%eax) // sys_table->boottime->handle_protocol
- addl $12, %esp // restore argument space
- testl %eax, %eax
- jnz 2f
-
movl 8(%ebp), %ecx // image_handle
movl 12(%ebp), %edx // sys_table
- movl -4(%ebp), %esi // loaded_image
- movl LI32_image_base(%esi), %esi // loaded_image->image_base
- leal (startup_32 - 1b)(%ebx), %ebp // runtime address of startup_32
- /*
- * We need to set the image_offset variable here since startup_32() will
- * use it before we get to the 64-bit efi_pe_entry() in C code.
- */
- subl %esi, %ebp // calculate image_offset
- movl %ebp, (image_offset - 1b)(%ebx) // save image_offset
xorl %esi, %esi
jmp efi32_entry // pass %ecx, %edx, %esi
// no other registers remain live
@@ -318,15 +272,6 @@ SYM_FUNC_START(efi32_pe_entry)
RET
SYM_FUNC_END(efi32_pe_entry)
- .section ".rodata"
- /* EFI loaded image protocol GUID */
- .balign 4
-SYM_DATA_START_LOCAL(loaded_image_proto)
- .long 0x5b1b31a1
- .word 0x9562, 0x11d2
- .byte 0x8e, 0x3f, 0x00, 0xa0, 0xc9, 0x69, 0x72, 0x3b
-SYM_DATA_END(loaded_image_proto)
-
.data
.balign 8
SYM_DATA_START_LOCAL(efi32_boot_gdt)
diff --git a/arch/x86/boot/compressed/head_32.S b/arch/x86/boot/compressed/head_32.S
index 3ecc1bbe971e1d43..a578fe178f1cfee7 100644
--- a/arch/x86/boot/compressed/head_32.S
+++ b/arch/x86/boot/compressed/head_32.S
@@ -84,19 +84,6 @@ SYM_FUNC_START(startup_32)
#ifdef CONFIG_RELOCATABLE
leal startup_32@GOTOFF(%edx), %ebx
-
-#ifdef CONFIG_EFI_STUB
-/*
- * If we were loaded via the EFI LoadImage service, startup_32() will be at an
- * offset to the start of the space allocated for the image. efi_pe_entry() will
- * set up image_offset to tell us where the image actually starts, so that we
- * can use the full available buffer.
- * image_offset = startup_32 - image_base
- * Otherwise image_offset will be zero and has no effect on the calculations.
- */
- subl image_offset@GOTOFF(%edx), %ebx
-#endif
-
movl BP_kernel_alignment(%esi), %eax
decl %eax
addl %eax, %ebx
@@ -153,10 +140,7 @@ SYM_FUNC_END(startup_32)
#ifdef CONFIG_EFI_STUB
SYM_FUNC_START(efi32_stub_entry)
add $0x4, %esp
- movl 8(%esp), %esi /* save boot_params pointer */
call efi_main
- /* efi_main returns the possibly relocated address of startup_32 */
- jmp *%eax
SYM_FUNC_END(efi32_stub_entry)
SYM_FUNC_ALIAS(efi_stub_entry, efi32_stub_entry)
#endif
diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 5ab014be179c1103..eb8ead7bce9fc577 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -146,19 +146,6 @@ SYM_FUNC_START(startup_32)
#ifdef CONFIG_RELOCATABLE
movl %ebp, %ebx
-
-#ifdef CONFIG_EFI_STUB
-/*
- * If we were loaded via the EFI LoadImage service, startup_32 will be at an
- * offset to the start of the space allocated for the image. efi_pe_entry will
- * set up image_offset to tell us where the image actually starts, so that we
- * can use the full available buffer.
- * image_offset = startup_32 - image_base
- * Otherwise image_offset will be zero and has no effect on the calculations.
- */
- subl rva(image_offset)(%ebp), %ebx
-#endif
-
movl BP_kernel_alignment(%esi), %eax
decl %eax
addl %eax, %ebx
@@ -346,20 +333,6 @@ SYM_CODE_START(startup_64)
/* Start with the delta to where the kernel will run at. */
#ifdef CONFIG_RELOCATABLE
leaq startup_32(%rip) /* - $startup_32 */, %rbp
-
-#ifdef CONFIG_EFI_STUB
-/*
- * If we were loaded via the EFI LoadImage service, startup_32 will be at an
- * offset to the start of the space allocated for the image. efi_pe_entry will
- * set up image_offset to tell us where the image actually starts, so that we
- * can use the full available buffer.
- * image_offset = startup_32 - image_base
- * Otherwise image_offset will be zero and has no effect on the calculations.
- */
- movl image_offset(%rip), %eax
- subq %rax, %rbp
-#endif
-
movl BP_kernel_alignment(%rsi), %eax
decl %eax
addq %rax, %rbp
@@ -529,11 +502,7 @@ SYM_CODE_END(startup_64)
#endif
SYM_FUNC_START(efi64_stub_entry)
and $~0xf, %rsp /* realign the stack */
- movq %rdx, %rbx /* save boot_params pointer */
call efi_main
- movq %rbx,%rsi
- leaq rva(startup_64)(%rax), %rax
- jmp *%rax
SYM_FUNC_END(efi64_stub_entry)
SYM_FUNC_ALIAS(efi_stub_entry, efi64_stub_entry)
#endif
diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index 419280d263d2e3f2..36b848e7d088a59b 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -216,6 +216,8 @@ efi_status_t efi_set_virtual_address_map(unsigned long memory_map_size,
#ifdef CONFIG_EFI_MIXED
+#define EFI_ALLOC_LIMIT (efi_is_64bit() ? ULONG_MAX : U32_MAX)
+
#define ARCH_HAS_EFISTUB_WRAPPERS
static inline bool efi_is_64bit(void)
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index ea4024a6a04e507f..dc45bcb74d1552c2 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -18,12 +18,8 @@
#include "efistub.h"
-/* Maximum physical address for 64-bit kernel with 4-level paging */
-#define MAXMEM_X86_64_4LEVEL (1ull << 46)
-
const efi_system_table_t *efi_system_table;
const efi_dxe_services_table_t *efi_dxe_table;
-u32 image_offset __section(".data");
static efi_loaded_image_t *image = NULL;
static efi_memory_attribute_protocol_t *memattr;
@@ -274,54 +270,9 @@ adjust_memory_range_protection(unsigned long start, unsigned long size)
}
}
-/*
- * Trampoline takes 2 pages and can be loaded in first megabyte of memory
- * with its end placed between 128k and 640k where BIOS might start.
- * (see arch/x86/boot/compressed/pgtable_64.c)
- *
- * We cannot find exact trampoline placement since memory map
- * can be modified by UEFI, and it can alter the computed address.
- */
-
-#define TRAMPOLINE_PLACEMENT_BASE ((128 - 8)*1024)
-#define TRAMPOLINE_PLACEMENT_SIZE (640*1024 - (128 - 8)*1024)
-
-void startup_32(struct boot_params *boot_params);
-
-static void
-setup_memory_protection(unsigned long image_base, unsigned long image_size)
-{
- /*
- * Allow execution of possible trampoline used
- * for switching between 4- and 5-level page tables
- * and relocated kernel image.
- */
-
- adjust_memory_range_protection(TRAMPOLINE_PLACEMENT_BASE,
- TRAMPOLINE_PLACEMENT_SIZE);
-
-#ifdef CONFIG_64BIT
- if (image_base != (unsigned long)startup_32)
- adjust_memory_range_protection(image_base, image_size);
-#else
- /*
- * Clear protection flags on a whole range of possible
- * addresses used for KASLR. We don't need to do that
- * on x86_64, since KASLR/extraction is performed after
- * dedicated identity page tables are built and we only
- * need to remove possible protection on relocated image
- * itself disregarding further relocations.
- */
- adjust_memory_range_protection(LOAD_PHYSICAL_ADDR,
- KERNEL_IMAGE_SIZE - LOAD_PHYSICAL_ADDR);
-#endif
-}
-
static const efi_char16_t apple[] = L"Apple";
-static void setup_quirks(struct boot_params *boot_params,
- unsigned long image_base,
- unsigned long image_size)
+static void setup_quirks(struct boot_params *boot_params)
{
efi_char16_t *fw_vendor = (efi_char16_t *)(unsigned long)
efi_table_attr(efi_system_table, fw_vendor);
@@ -330,9 +281,6 @@ static void setup_quirks(struct boot_params *boot_params,
if (IS_ENABLED(CONFIG_APPLE_PROPERTIES))
retrieve_apple_device_properties(boot_params);
}
-
- if (IS_ENABLED(CONFIG_EFI_DXE_MEM_ATTRIBUTES))
- setup_memory_protection(image_base, image_size);
}
/*
@@ -485,7 +433,6 @@ efi_status_t __efiapi efi_pe_entry(efi_handle_t handle,
}
image_base = efi_table_attr(image, image_base);
- image_offset = (void *)startup_32 - image_base;
status = efi_allocate_pages(sizeof(struct boot_params),
(unsigned long *)&boot_params, ULONG_MAX);
@@ -900,19 +847,78 @@ static void efi_5level_switch(void)
}
#endif
+static void efi_get_seed(void *seed, int size)
+{
+ efi_get_random_bytes(size, seed);
+
+ // TODO mix in RDRAND/TSC
+}
+
+static void error(char *str)
+{
+ efi_warn("Decompression failed: %s\n", str);
+}
+
+int decompress_kernel(unsigned char *outbuf, unsigned long virt_addr,
+ void (*error)(char *x));
+
+static efi_status_t efi_decompress_kernel(unsigned long *kernel_entry)
+{
+ unsigned long virt_addr = LOAD_PHYSICAL_ADDR;
+ extern const unsigned long kernel_total_size;
+ extern unsigned int input_len, output_len;
+ extern void *input_data;
+ unsigned long addr, alloc_size, entry;
+ unsigned long seed[2] = {};
+ efi_status_t status;
+
+ /* determine the required size of the allocation */
+ alloc_size = ALIGN(max((unsigned long)output_len, kernel_total_size),
+ MIN_KERNEL_ALIGN);
+
+ if (IS_ENABLED(CONFIG_RANDOMIZE_BASE) && !efi_nokaslr) {
+ u64 range = KERNEL_IMAGE_SIZE - LOAD_PHYSICAL_ADDR - kernel_total_size;
+
+ efi_get_seed(seed, sizeof(seed));
+
+ virt_addr += (range * (u32)seed[1]) >> 32;
+ virt_addr &= ~(CONFIG_PHYSICAL_ALIGN - 1);
+ }
+
+ status = efi_random_alloc(alloc_size, CONFIG_PHYSICAL_ALIGN, &addr,
+ seed[0], EFI_LOADER_CODE);
+ if (status != EFI_SUCCESS)
+ return status;
+
+ entry = decompress_kernel((void *)addr, virt_addr, error);
+ *kernel_entry = addr + entry;
+
+ adjust_memory_range_protection(addr, kernel_total_size);
+
+ return EFI_SUCCESS;
+}
+
+static void __noreturn enter_kernel(unsigned long kernel_addr,
+ struct boot_params *boot_params)
+{
+ /* enter decompressed kernel with boot_params pointer in RSI/ESI */
+ asm("jmp *%0"::"r"(kernel_addr), "S"(boot_params));
+
+ unreachable();
+}
+
/*
- * On success, we return the address of startup_32, which has potentially been
- * relocated by efi_relocate_kernel.
- * On failure, we exit to the firmware via efi_exit instead of returning.
+ * On success, we jump to the relocated kernel directly and never return;
+ * on failure, we exit to the firmware via efi_exit instead of returning.
*/
asmlinkage unsigned long efi_main(efi_handle_t handle,
efi_system_table_t *sys_table_arg,
struct boot_params *boot_params)
{
- unsigned long bzimage_addr = (unsigned long)startup_32;
- unsigned long buffer_start, buffer_end;
+ efi_guid_t guid = EFI_MEMORY_ATTRIBUTE_PROTOCOL_GUID;
struct setup_header *hdr = &boot_params->hdr;
const struct linux_efi_initrd *initrd = NULL;
+ unsigned long kernel_entry;
efi_status_t status;
efi_system_table = sys_table_arg;
@@ -945,60 +951,6 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,
}
#endif
- /*
- * If the kernel isn't already loaded at a suitable address,
- * relocate it.
- *
- * It must be loaded above LOAD_PHYSICAL_ADDR.
- *
- * The maximum address for 64-bit is 1 << 46 for 4-level paging. This
- * is defined as the macro MAXMEM, but unfortunately that is not a
- * compile-time constant if 5-level paging is configured, so we instead
- * define our own macro for use here.
- *
- * For 32-bit, the maximum address is complicated to figure out, for
- * now use KERNEL_IMAGE_SIZE, which will be 512MiB, the same as what
- * KASLR uses.
- *
- * Also relocate it if image_offset is zero, i.e. the kernel wasn't
- * loaded by LoadImage, but rather by a bootloader that called the
- * handover entry. The reason we must always relocate in this case is
- * to handle the case of systemd-boot booting a unified kernel image,
- * which is a PE executable that contains the bzImage and an initrd as
- * COFF sections. The initrd section is placed after the bzImage
- * without ensuring that there are at least init_size bytes available
- * for the bzImage, and thus the compressed kernel's startup code may
- * overwrite the initrd unless it is moved out of the way.
- */
-
- buffer_start = ALIGN(bzimage_addr - image_offset,
- hdr->kernel_alignment);
- buffer_end = buffer_start + hdr->init_size;
-
- if ((buffer_start < LOAD_PHYSICAL_ADDR) ||
- (IS_ENABLED(CONFIG_X86_32) && buffer_end > KERNEL_IMAGE_SIZE) ||
- (IS_ENABLED(CONFIG_X86_64) && buffer_end > MAXMEM_X86_64_4LEVEL) ||
- (image_offset == 0)) {
- extern char _bss[];
-
- status = efi_relocate_kernel(&bzimage_addr,
- (unsigned long)_bss - bzimage_addr,
- hdr->init_size,
- hdr->pref_address,
- hdr->kernel_alignment,
- LOAD_PHYSICAL_ADDR);
- if (status != EFI_SUCCESS) {
- efi_err("efi_relocate_kernel() failed!\n");
- goto fail;
- }
- /*
- * Now that we've copied the kernel elsewhere, we no longer
- * have a set up block before startup_32(), so reset image_offset
- * to zero in case it was set earlier.
- */
- image_offset = 0;
- }
-
#ifdef CONFIG_CMDLINE_BOOL
status = efi_parse_options(CONFIG_CMDLINE);
if (status != EFI_SUCCESS) {
@@ -1016,6 +968,12 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,
}
}
+ status = efi_decompress_kernel(&kernel_entry);
+ if (status != EFI_SUCCESS) {
+ efi_err("Failed to decompress kernel\n");
+ goto fail;
+ }
+
/*
* At this point, an initrd may already have been loaded by the
* bootloader and passed via bootparams. We permit an initrd loaded
@@ -1055,7 +1013,7 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,
setup_efi_pci(boot_params);
- setup_quirks(boot_params, bzimage_addr, buffer_end - buffer_start);
+ setup_quirks(boot_params);
status = exit_boot(boot_params, handle);
if (status != EFI_SUCCESS) {
@@ -1067,7 +1025,7 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,
efi_5level_switch();
#endif
- return bzimage_addr;
+ enter_kernel(kernel_entry, boot_params);
fail:
efi_err("efi_main() failed!\n");
--
2.39.2
Factor out the decompressor sequence that invokes the decompressor,
parses the ELF and applies the relocations so that we will be able to
call it directly from the EFI stub.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/boot/compressed/misc.c | 27 ++++++++++++++++----
1 file changed, 22 insertions(+), 5 deletions(-)
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 3b79667412ceb388..0fa5df49ce14f196 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -330,11 +330,32 @@ static size_t parse_elf(void *output)
return ehdr.e_entry - LOAD_PHYSICAL_ADDR;
}
+const unsigned long kernel_total_size = VO__end - VO__text;
+
static u8 boot_heap[BOOT_HEAP_SIZE] __aligned(4);
extern unsigned char input_data[];
extern unsigned int input_len, output_len;
+unsigned long decompress_kernel(unsigned char *outbuf, unsigned long virt_addr,
+ void (*error)(char *x))
+{
+ unsigned long entry;
+
+ if (!free_mem_ptr) {
+ free_mem_ptr = (unsigned long)boot_heap;
+ free_mem_end_ptr = (unsigned long)boot_heap + sizeof(boot_heap);
+ }
+
+ __decompress(input_data, input_len, NULL, NULL, outbuf, output_len,
+ NULL, error);
+
+ entry = parse_elf(outbuf);
+ handle_relocations(outbuf, output_len, virt_addr);
+
+ return entry;
+}
+
/*
* The compressed kernel image (ZO), has been moved so that its position
* is against the end of the buffer used to hold the uncompressed kernel
@@ -355,7 +376,6 @@ extern unsigned int input_len, output_len;
asmlinkage __visible void *extract_kernel(void *rmode,
unsigned char *output)
{
- const unsigned long kernel_total_size = VO__end - VO__text;
unsigned long virt_addr = LOAD_PHYSICAL_ADDR;
memptr heap = (memptr)boot_heap;
unsigned long needed_size;
@@ -458,10 +478,7 @@ asmlinkage __visible void *extract_kernel(void *rmode,
#endif
debug_putstr("\nDecompressing Linux... ");
- __decompress(input_data, input_len, NULL, NULL, output, output_len,
- NULL, error);
- entry_offset = parse_elf(output);
- handle_relocations(output, output_len, virt_addr);
+ entry_offset = decompress_kernel(output, virt_addr, error);
debug_putstr("done.\nBooting the kernel (entry_offset: 0x");
debug_puthex(entry_offset);
--
2.39.2
On Mon, Apr 24, 2023 at 06:57:20PM +0200, Ard Biesheuvel wrote:
> arch/x86/boot/compressed/efi_mixed.S | 55 ---
> arch/x86/boot/compressed/head_32.S | 24 --
> arch/x86/boot/compressed/head_64.S | 39 +--
> arch/x86/boot/compressed/misc.c | 44 ++-
> arch/x86/include/asm/efi.h | 2 +
> drivers/firmware/efi/libstub/efi-stub-helper.c | 4 +
> drivers/firmware/efi/libstub/x86-stub.c | 360 +++++++++++++-------
> 7 files changed, 279 insertions(+), 249 deletions(-)
Upon a quick scan, I can't argue with that diffstat and would prefer
a lot more if we did this instead of Evgeny's pile which touches a lot
of nasty and hard to debug code which gets executed on *everything*.
So if people agree with that approach, I'd gladly give it a more
detailed look.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Mon, Apr 24, 2023 at 06:57:24PM +0200, Ard Biesheuvel wrote:
> In preparation for updating the EFI stub boot flow to avoid the bare
> metal decompressor code altogether, implement the support code for
> switching between 4 and 5 levels of paging before jumping to the kernel
> proper.
I must admit it is neat. I like it a lot.
Any chance we can share the code with the traditional decompressor?
There's not much that EFI specific here. It should be possible to isolate
it from the rest, no?
> @@ -792,6 +925,14 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,
> (get_efi_config_table(ACPI_20_TABLE_GUID) ?:
> get_efi_config_table(ACPI_TABLE_GUID));
>
> +#ifdef CONFIG_X86_64
> + status = efi_setup_5level_paging();
> + if (status != EFI_SUCCESS) {
> + efi_err("efi_setup_5level_paging() failed!\n");
> + goto fail;
> + }
> +#endif
> +
> /*
> * If the kernel isn't already loaded at a suitable address,
> * relocate it.
> @@ -910,6 +1051,10 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,
> goto fail;
> }
>
> +#ifdef CONFIG_X86_64
> + efi_5level_switch();
> +#endif
> +
> return bzimage_addr;
> fail:
> efi_err("efi_main() failed!\n");
Maybe use IS_ENABLED() + dummy efi_setup_5level_paging()/efi_5level_switch()
instead of #ifdefs?
--
Kiryl Shutsemau / Kirill A. Shutemov
On Wed, 26 Apr 2023 at 11:18, Borislav Petkov <[email protected]> wrote:
>
> On Mon, Apr 24, 2023 at 06:57:20PM +0200, Ard Biesheuvel wrote:
> > arch/x86/boot/compressed/efi_mixed.S | 55 ---
> > arch/x86/boot/compressed/head_32.S | 24 --
> > arch/x86/boot/compressed/head_64.S | 39 +--
> > arch/x86/boot/compressed/misc.c | 44 ++-
> > arch/x86/include/asm/efi.h | 2 +
> > drivers/firmware/efi/libstub/efi-stub-helper.c | 4 +
> > drivers/firmware/efi/libstub/x86-stub.c | 360 +++++++++++++-------
> > 7 files changed, 279 insertions(+), 249 deletions(-)
>
> Upon a quick scan, I can't argue with that diffstat and would prefer
> a lot more if we did this instead of Evgeny's pile which touches a lot
> of nasty and hard to debug code which gets executed on *everything*.
>
> So if people agree with that approach, I'd gladly give it a more
> detailed look.
>
I think the general approach is better, but to be honest, I may have
missed a thing or two, so it would be good if people more familiar
with the history could chime in.
On Wed, 26 Apr 2023 at 11:42, Kirill A . Shutemov
<[email protected]> wrote:
>
> On Mon, Apr 24, 2023 at 06:57:24PM +0200, Ard Biesheuvel wrote:
> > In preparation for updating the EFI stub boot flow to avoid the bare
> > metal decompressor code altogether, implement the support code for
> > switching between 4 and 5 levels of paging before jumping to the kernel
> > proper.
>
> I must admit it is neat. I like it a lot.
>
Thanks!
> Any chance we can share the code with the traditional decompressor?
> There's not much that EFI specific here. It should be possible to isolate
> it from the rest, no?
>
I agree. The EFI boot code should still avoid the bare metal
trampoline allocation/deallocation, but the actual payload could be
the same - it's just an indirect call with the GDT and page table
pointers as arguments.
>
> > @@ -792,6 +925,14 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,
> > (get_efi_config_table(ACPI_20_TABLE_GUID) ?:
> > get_efi_config_table(ACPI_TABLE_GUID));
> >
> > +#ifdef CONFIG_X86_64
> > + status = efi_setup_5level_paging();
> > + if (status != EFI_SUCCESS) {
> > + efi_err("efi_setup_5level_paging() failed!\n");
> > + goto fail;
> > + }
> > +#endif
> > +
> > /*
> > * If the kernel isn't already loaded at a suitable address,
> > * relocate it.
> > @@ -910,6 +1051,10 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,
> > goto fail;
> > }
> >
> > +#ifdef CONFIG_X86_64
> > + efi_5level_switch();
> > +#endif
> > +
> > return bzimage_addr;
> > fail:
> > efi_err("efi_main() failed!\n");
>
> Maybe use IS_ENABLED() + dummy efi_setup_5level_paging()/efi_5level_switch()
> instead of #ifdefs?
>
These are functions returning void so I can just move the #ifdef into
the function implementation. Wo do need #ifdefs at some level, as i386
does not provide a definition for __KERNEL32_CS
On 2023-04-24 19:57, Ard Biesheuvel wrote:
> This series is conceptually a combination of Evgeny's series [0] and
> mine [1], both of which attempt to make the early decompressor code
> more
> amenable to executing in the EFI environment with stricter handling of
> memory permissions.
>
> My series [1] implemented zboot for x86, by getting rid of the entire
> x86 decompressor, and replacing it with existing EFI code that does the
> same but in a generic way. The downside of this is that only EFI boot
> is
> supported, making it unviable for distros, which need to support BIOS
> boot and hybrid EFI boot modes that omit the EFI stub.
>
> Evgeny's series [0] adapted the entire decompressor code flow to allow
> it to execute in the EFI context as well as the bare metal context, and
> this involves changes to the 1:1 mapping code and the page fault
> handlers etc, none of which are really needed when doing EFI boot in
> the
> first place.
>
> So this series attempts to occupy the middle ground here: it makes
> minimal changes to the existing decompressor so some of it can be
> called
> from the EFI stub. Then, it reimplements the EFI boot flow to
> decompress
> the kernel and boot it directly, without relying on the trampoline
> code,
> page table code or page fault handling code. This allows us to get rid
> of quite a bit of unsavory EFI stub code, and replace it with two clear
> invocations of the EFI firmware APIs to clear NX restrictions from
> allocations that have been populated with executable code.
>
> The only code that is being reused is the decompression library itself,
> along with the minimal ELF parsing that is required to copy the ELF
> segments in place, and the relocation processing that fixes up absolute
> symbol references to refer to the correct virtual addresses.
>
> Note that some of Evgeny's changes to clean up the PE/COFF header
> generation will still be needed, but I've omitted those here for
> brevity.
My series also implements W^X for both UEFI and non-UEFI boot paths, but
I
agree that we can just consider non-UEFI code legacy and it would be
better
to avoid touching it and encourage everyone to use UEFI code path on x86
instead. If PE format will also get fixed with either my patches or some
others, I do like your approach more than mine, as it removes a lot of
old
cruft but does not break things (as far as I see). Seems like a perfect
compromise between [1] and my approach.
I've briefly tested the patches and looked through them and they look
good
to me. Two things I've noticed:
* there's one TSC-related TODO;
* probably we want to clear .bss in efi32_stub_entry and
efi64_stub_entry
for UEFI handover protocol, since it's unfortunately still present
and
.bss will contain garbage.
I'll probably do some more testing on the weekend and let you know if I
find something.
Please tell me if/when you are going to merge these or similar, and I
will
clean up and rebase PE-related patches on top of these.
I'd also like to send W^X patches for EFISTUB (omitting the non-UEFI
boot
path) as a follow up after the PE file header will get fixed. They will
be
considerably smaller with this approach and will not touch legacy code.
Thanks,
Evgeniy Baskov
>
> Cc: Evgeniy Baskov <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Alexey Khoroshilov <[email protected]>
> Cc: Peter Jones <[email protected]>
> Cc: Gerd Hoffmann <[email protected]>
> Cc: Dave Young <[email protected]>
> Cc: Mario Limonciello <[email protected]>
> Cc: Kees Cook <[email protected]>
> Cc: Tom Lendacky <[email protected]>
> Cc: Kirill A. Shutemov <[email protected]>
> Cc: Linus Torvalds <[email protected]>
>
> [0] https://lore.kernel.org/all/[email protected]/
> [1]
> https://lore.kernel.org/all/[email protected]/
>
> Ard Biesheuvel (6):
> x86: decompressor: Move global symbol references to C code
> x86: decompressor: Factor out kernel decompression and relocation
> x86: efistub: Obtain ACPI RSDP address while running in the stub
> x86: efistub: Perform 4/5 level paging switch from the stub
> x86: efistub: Prefer EFI memory attributes protocol over DXE services
> x86: efistub: Avoid legacy decompressor when doing EFI boot
>
> arch/x86/boot/compressed/efi_mixed.S | 55 ---
> arch/x86/boot/compressed/head_32.S | 24 --
> arch/x86/boot/compressed/head_64.S | 39 +--
> arch/x86/boot/compressed/misc.c | 44 ++-
> arch/x86/include/asm/efi.h | 2 +
> drivers/firmware/efi/libstub/efi-stub-helper.c | 4 +
> drivers/firmware/efi/libstub/x86-stub.c | 360
> +++++++++++++-------
> 7 files changed, 279 insertions(+), 249 deletions(-)
On Fri, 28 Apr 2023 at 14:23, Evgeniy Baskov <[email protected]> wrote:
>
> On 2023-04-24 19:57, Ard Biesheuvel wrote:
> > This series is conceptually a combination of Evgeny's series [0] and
> > mine [1], both of which attempt to make the early decompressor code
> > more
> > amenable to executing in the EFI environment with stricter handling of
> > memory permissions.
> >
> > My series [1] implemented zboot for x86, by getting rid of the entire
> > x86 decompressor, and replacing it with existing EFI code that does the
> > same but in a generic way. The downside of this is that only EFI boot
> > is
> > supported, making it unviable for distros, which need to support BIOS
> > boot and hybrid EFI boot modes that omit the EFI stub.
> >
> > Evgeny's series [0] adapted the entire decompressor code flow to allow
> > it to execute in the EFI context as well as the bare metal context, and
> > this involves changes to the 1:1 mapping code and the page fault
> > handlers etc, none of which are really needed when doing EFI boot in
> > the
> > first place.
> >
> > So this series attempts to occupy the middle ground here: it makes
> > minimal changes to the existing decompressor so some of it can be
> > called
> > from the EFI stub. Then, it reimplements the EFI boot flow to
> > decompress
> > the kernel and boot it directly, without relying on the trampoline
> > code,
> > page table code or page fault handling code. This allows us to get rid
> > of quite a bit of unsavory EFI stub code, and replace it with two clear
> > invocations of the EFI firmware APIs to clear NX restrictions from
> > allocations that have been populated with executable code.
> >
> > The only code that is being reused is the decompression library itself,
> > along with the minimal ELF parsing that is required to copy the ELF
> > segments in place, and the relocation processing that fixes up absolute
> > symbol references to refer to the correct virtual addresses.
> >
> > Note that some of Evgeny's changes to clean up the PE/COFF header
> > generation will still be needed, but I've omitted those here for
> > brevity.
>
> My series also implements W^X for both UEFI and non-UEFI boot paths, but
> I
> agree that we can just consider non-UEFI code legacy and it would be
> better
> to avoid touching it and encourage everyone to use UEFI code path on x86
> instead. If PE format will also get fixed with either my patches or some
> others, I do like your approach more than mine, as it removes a lot of
> old
> cruft but does not break things (as far as I see). Seems like a perfect
> compromise between [1] and my approach.
>
Thanks, I'm glad we agree.
> I've briefly tested the patches and looked through them and they look
> good
> to me. Two things I've noticed:
> * there's one TSC-related TODO;
> * probably we want to clear .bss in efi32_stub_entry and
> efi64_stub_entry
> for UEFI handover protocol, since it's unfortunately still present
> and
> .bss will contain garbage.
I'll look into that.
> I'll probably do some more testing on the weekend and let you know if I
> find something.
>
Yes, please.
> Please tell me if/when you are going to merge these or similar, and I
> will
> clean up and rebase PE-related patches on top of these.
>
> I'd also like to send W^X patches for EFISTUB (omitting the non-UEFI
> boot
> path) as a follow up after the PE file header will get fixed. They will
> be
> considerably smaller with this approach and will not touch legacy code.
>
On 4/24/23 11:57, Ard Biesheuvel wrote:
> This series is conceptually a combination of Evgeny's series [0] and
> mine [1], both of which attempt to make the early decompressor code more
> amenable to executing in the EFI environment with stricter handling of
> memory permissions.
>
> My series [1] implemented zboot for x86, by getting rid of the entire
> x86 decompressor, and replacing it with existing EFI code that does the
> same but in a generic way. The downside of this is that only EFI boot is
> supported, making it unviable for distros, which need to support BIOS
> boot and hybrid EFI boot modes that omit the EFI stub.
>
> Evgeny's series [0] adapted the entire decompressor code flow to allow
> it to execute in the EFI context as well as the bare metal context, and
> this involves changes to the 1:1 mapping code and the page fault
> handlers etc, none of which are really needed when doing EFI boot in the
> first place.
>
> So this series attempts to occupy the middle ground here: it makes
> minimal changes to the existing decompressor so some of it can be called
> from the EFI stub. Then, it reimplements the EFI boot flow to decompress
> the kernel and boot it directly, without relying on the trampoline code,
> page table code or page fault handling code. This allows us to get rid
> of quite a bit of unsavory EFI stub code, and replace it with two clear
> invocations of the EFI firmware APIs to clear NX restrictions from
> allocations that have been populated with executable code.
>
> The only code that is being reused is the decompression library itself,
> along with the minimal ELF parsing that is required to copy the ELF
> segments in place, and the relocation processing that fixes up absolute
> symbol references to refer to the correct virtual addresses.
>
> Note that some of Evgeny's changes to clean up the PE/COFF header
> generation will still be needed, but I've omitted those here for
> brevity.
I tried booting an SEV and an SEV-ES guest using this and both failed to boot:
EFI stub: WARNING: Decompression failed: Out of memory while allocating
z_stream
I'll have to take a closer look as to why, but it might be a couple of
days before I can get to it.
Thanks,
Tom
>
> Cc: Evgeniy Baskov <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Alexey Khoroshilov <[email protected]>
> Cc: Peter Jones <[email protected]>
> Cc: Gerd Hoffmann <[email protected]>
> Cc: Dave Young <[email protected]>
> Cc: Mario Limonciello <[email protected]>
> Cc: Kees Cook <[email protected]>
> Cc: Tom Lendacky <[email protected]>
> Cc: Kirill A. Shutemov <[email protected]>
> Cc: Linus Torvalds <[email protected]>
>
> [0] https://lore.kernel.org/all/[email protected]/
> [1] https://lore.kernel.org/all/[email protected]/
>
> Ard Biesheuvel (6):
> x86: decompressor: Move global symbol references to C code
> x86: decompressor: Factor out kernel decompression and relocation
> x86: efistub: Obtain ACPI RSDP address while running in the stub
> x86: efistub: Perform 4/5 level paging switch from the stub
> x86: efistub: Prefer EFI memory attributes protocol over DXE services
> x86: efistub: Avoid legacy decompressor when doing EFI boot
>
> arch/x86/boot/compressed/efi_mixed.S | 55 ---
> arch/x86/boot/compressed/head_32.S | 24 --
> arch/x86/boot/compressed/head_64.S | 39 +--
> arch/x86/boot/compressed/misc.c | 44 ++-
> arch/x86/include/asm/efi.h | 2 +
> drivers/firmware/efi/libstub/efi-stub-helper.c | 4 +
> drivers/firmware/efi/libstub/x86-stub.c | 360 +++++++++++++-------
> 7 files changed, 279 insertions(+), 249 deletions(-)
>
On Tue, 2 May 2023 at 15:37, Tom Lendacky <[email protected]> wrote:
>
> On 4/24/23 11:57, Ard Biesheuvel wrote:
> > This series is conceptually a combination of Evgeny's series [0] and
> > mine [1], both of which attempt to make the early decompressor code more
> > amenable to executing in the EFI environment with stricter handling of
> > memory permissions.
> >
> > My series [1] implemented zboot for x86, by getting rid of the entire
> > x86 decompressor, and replacing it with existing EFI code that does the
> > same but in a generic way. The downside of this is that only EFI boot is
> > supported, making it unviable for distros, which need to support BIOS
> > boot and hybrid EFI boot modes that omit the EFI stub.
> >
> > Evgeny's series [0] adapted the entire decompressor code flow to allow
> > it to execute in the EFI context as well as the bare metal context, and
> > this involves changes to the 1:1 mapping code and the page fault
> > handlers etc, none of which are really needed when doing EFI boot in the
> > first place.
> >
> > So this series attempts to occupy the middle ground here: it makes
> > minimal changes to the existing decompressor so some of it can be called
> > from the EFI stub. Then, it reimplements the EFI boot flow to decompress
> > the kernel and boot it directly, without relying on the trampoline code,
> > page table code or page fault handling code. This allows us to get rid
> > of quite a bit of unsavory EFI stub code, and replace it with two clear
> > invocations of the EFI firmware APIs to clear NX restrictions from
> > allocations that have been populated with executable code.
> >
> > The only code that is being reused is the decompression library itself,
> > along with the minimal ELF parsing that is required to copy the ELF
> > segments in place, and the relocation processing that fixes up absolute
> > symbol references to refer to the correct virtual addresses.
> >
> > Note that some of Evgeny's changes to clean up the PE/COFF header
> > generation will still be needed, but I've omitted those here for
> > brevity.
>
> I tried booting an SEV and an SEV-ES guest using this and both failed to boot:
>
> EFI stub: WARNING: Decompression failed: Out of memory while allocating
> z_stream
>
> I'll have to take a closer look as to why, but it might be a couple of
> days before I can get to it.
>
Thanks Tom.
The internal malloc() seems to be failing, which is often caused by
BSS clearing problems. Could you elaborate a little bit on the boot
environment you are using here?
On 5/2/23 08:39, Ard Biesheuvel wrote:
> On Tue, 2 May 2023 at 15:37, Tom Lendacky <[email protected]> wrote:
>>
>> On 4/24/23 11:57, Ard Biesheuvel wrote:
>>> This series is conceptually a combination of Evgeny's series [0] and
>>> mine [1], both of which attempt to make the early decompressor code more
>>> amenable to executing in the EFI environment with stricter handling of
>>> memory permissions.
>>>
>>> My series [1] implemented zboot for x86, by getting rid of the entire
>>> x86 decompressor, and replacing it with existing EFI code that does the
>>> same but in a generic way. The downside of this is that only EFI boot is
>>> supported, making it unviable for distros, which need to support BIOS
>>> boot and hybrid EFI boot modes that omit the EFI stub.
>>>
>>> Evgeny's series [0] adapted the entire decompressor code flow to allow
>>> it to execute in the EFI context as well as the bare metal context, and
>>> this involves changes to the 1:1 mapping code and the page fault
>>> handlers etc, none of which are really needed when doing EFI boot in the
>>> first place.
>>>
>>> So this series attempts to occupy the middle ground here: it makes
>>> minimal changes to the existing decompressor so some of it can be called
>>> from the EFI stub. Then, it reimplements the EFI boot flow to decompress
>>> the kernel and boot it directly, without relying on the trampoline code,
>>> page table code or page fault handling code. This allows us to get rid
>>> of quite a bit of unsavory EFI stub code, and replace it with two clear
>>> invocations of the EFI firmware APIs to clear NX restrictions from
>>> allocations that have been populated with executable code.
>>>
>>> The only code that is being reused is the decompression library itself,
>>> along with the minimal ELF parsing that is required to copy the ELF
>>> segments in place, and the relocation processing that fixes up absolute
>>> symbol references to refer to the correct virtual addresses.
>>>
>>> Note that some of Evgeny's changes to clean up the PE/COFF header
>>> generation will still be needed, but I've omitted those here for
>>> brevity.
>>
>> I tried booting an SEV and an SEV-ES guest using this and both failed to boot:
>>
>> EFI stub: WARNING: Decompression failed: Out of memory while allocating
>> z_stream
>>
>> I'll have to take a closer look as to why, but it might be a couple of
>> days before I can get to it.
>>
>
> Thanks Tom.
>
> The internal malloc() seems to be failing, which is often caused by
> BSS clearing problems. Could you elaborate a little bit on the boot
> environment you are using here?
I'm using Qemu v7.2.1 as my VMM, Linux 6.3 with your series applied for my
host/hypervisor and guest kernel and the current OVMF tree built using
OvmfPkgX64.dsc.
I was originally using the current merge window Linux, but moved to the
release version just to . With the release version SEV and SEV-ES still fail to
boot, but SEV actually #GPs now. And some of the register contents look
like encrypted data:
ConvertPages: range 1000000 - 4FA1FFF covers multiple entries
!!!! X64 Exception Type - 0D(#GP - General Protection) CPU Apic ID - 00000000 !!!!
ExceptionData - 0000000000000000
RIP - 00000000597E71C1, CS - 0000000000000038, RFLAGS - 0000000000210206
RAX - 1FBA02A45943B920, RCX - 0000000000AF7009, RDX - A9DAE761B64A1F1B
RBX - 1FBA02A45943B8C0, RSP - 000000007FD97320, RBP - 0000000002000000
RSI - 0000000001000000, RDI - 1FBA02A45943DE68
R8 - 0000000003EF3C94, R9 - 0000000000000000, R10 - 000000007D7C6018
R11 - 0000000000000000, R12 - 0000000001000000, R13 - 00000000597EDD98
R14 - 0000000001000000, R15 - 000000007E0A5198
DS - 0000000000000030, ES - 0000000000000030, FS - 0000000000000030
GS - 0000000000000030, SS - 0000000000000030
CR0 - 0000000080010033, CR2 - 0000000000000000, CR3 - 000000007FA01000
CR4 - 0000000000000668, CR8 - 0000000000000000
DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000
DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400
GDTR - 000000007F7DC000 0000000000000047, LDTR - 0000000000000000
IDTR - 000000007F34C018 0000000000000FFF, TR - 0000000000000000
FXSAVE_STATE - 000000007FD96F80
!!!! Find image based on IP(0x597E71C1) /root/kernels/ovmf-build-X64/Build/OvmfX64/DEBUG_GCC5/X64/MdeModulePkg/Universal/Variable/RuntimeDxe/VariableRuntimeDxe/DEBUG/Variable
RuntimeDxe.dll (ImageBase=0000000000D4792C, EntryPoint=0000000000D50CC3) !!!!
So, yes, probably an area of memory that was zeroes when mapped
unencrypted, but wasn't cleared after changing the mapping to
encrypted.
Thanks,
Tom
On Tue, 2 May 2023 at 18:08, Tom Lendacky <[email protected]> wrote:
>
> On 5/2/23 08:39, Ard Biesheuvel wrote:
> > On Tue, 2 May 2023 at 15:37, Tom Lendacky <[email protected]> wrote:
> >>
> >> On 4/24/23 11:57, Ard Biesheuvel wrote:
> >>> This series is conceptually a combination of Evgeny's series [0] and
> >>> mine [1], both of which attempt to make the early decompressor code more
> >>> amenable to executing in the EFI environment with stricter handling of
> >>> memory permissions.
> >>>
> >>> My series [1] implemented zboot for x86, by getting rid of the entire
> >>> x86 decompressor, and replacing it with existing EFI code that does the
> >>> same but in a generic way. The downside of this is that only EFI boot is
> >>> supported, making it unviable for distros, which need to support BIOS
> >>> boot and hybrid EFI boot modes that omit the EFI stub.
> >>>
> >>> Evgeny's series [0] adapted the entire decompressor code flow to allow
> >>> it to execute in the EFI context as well as the bare metal context, and
> >>> this involves changes to the 1:1 mapping code and the page fault
> >>> handlers etc, none of which are really needed when doing EFI boot in the
> >>> first place.
> >>>
> >>> So this series attempts to occupy the middle ground here: it makes
> >>> minimal changes to the existing decompressor so some of it can be called
> >>> from the EFI stub. Then, it reimplements the EFI boot flow to decompress
> >>> the kernel and boot it directly, without relying on the trampoline code,
> >>> page table code or page fault handling code. This allows us to get rid
> >>> of quite a bit of unsavory EFI stub code, and replace it with two clear
> >>> invocations of the EFI firmware APIs to clear NX restrictions from
> >>> allocations that have been populated with executable code.
> >>>
> >>> The only code that is being reused is the decompression library itself,
> >>> along with the minimal ELF parsing that is required to copy the ELF
> >>> segments in place, and the relocation processing that fixes up absolute
> >>> symbol references to refer to the correct virtual addresses.
> >>>
> >>> Note that some of Evgeny's changes to clean up the PE/COFF header
> >>> generation will still be needed, but I've omitted those here for
> >>> brevity.
> >>
> >> I tried booting an SEV and an SEV-ES guest using this and both failed to boot:
> >>
> >> EFI stub: WARNING: Decompression failed: Out of memory while allocating
> >> z_stream
> >>
> >> I'll have to take a closer look as to why, but it might be a couple of
> >> days before I can get to it.
> >>
> >
> > Thanks Tom.
> >
> > The internal malloc() seems to be failing, which is often caused by
> > BSS clearing problems. Could you elaborate a little bit on the boot
> > environment you are using here?
>
> I'm using Qemu v7.2.1 as my VMM, Linux 6.3 with your series applied for my
> host/hypervisor and guest kernel and the current OVMF tree built using
> OvmfPkgX64.dsc.
>
> I was originally using the current merge window Linux, but moved to the
> release version just to . With the release version SEV and SEV-ES still fail to
> boot, but SEV actually #GPs now. And some of the register contents look
> like encrypted data:
>
> ConvertPages: range 1000000 - 4FA1FFF covers multiple entries
> !!!! X64 Exception Type - 0D(#GP - General Protection) CPU Apic ID - 00000000 !!!!
> ExceptionData - 0000000000000000
> RIP - 00000000597E71C1, CS - 0000000000000038, RFLAGS - 0000000000210206
> RAX - 1FBA02A45943B920, RCX - 0000000000AF7009, RDX - A9DAE761B64A1F1B
> RBX - 1FBA02A45943B8C0, RSP - 000000007FD97320, RBP - 0000000002000000
> RSI - 0000000001000000, RDI - 1FBA02A45943DE68
> R8 - 0000000003EF3C94, R9 - 0000000000000000, R10 - 000000007D7C6018
> R11 - 0000000000000000, R12 - 0000000001000000, R13 - 00000000597EDD98
> R14 - 0000000001000000, R15 - 000000007E0A5198
> DS - 0000000000000030, ES - 0000000000000030, FS - 0000000000000030
> GS - 0000000000000030, SS - 0000000000000030
> CR0 - 0000000080010033, CR2 - 0000000000000000, CR3 - 000000007FA01000
> CR4 - 0000000000000668, CR8 - 0000000000000000
> DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000
> DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400
> GDTR - 000000007F7DC000 0000000000000047, LDTR - 0000000000000000
> IDTR - 000000007F34C018 0000000000000FFF, TR - 0000000000000000
> FXSAVE_STATE - 000000007FD96F80
> !!!! Find image based on IP(0x597E71C1) /root/kernels/ovmf-build-X64/Build/OvmfX64/DEBUG_GCC5/X64/MdeModulePkg/Universal/Variable/RuntimeDxe/VariableRuntimeDxe/DEBUG/Variable
> RuntimeDxe.dll (ImageBase=0000000000D4792C, EntryPoint=0000000000D50CC3) !!!!
>
> So, yes, probably an area of memory that was zeroes when mapped
> unencrypted, but wasn't cleared after changing the mapping to
> encrypted.
>
Thanks.
It seems I was a bit naive and underestimated the amount of SEV
related processing that goes on in the decompressor after the EFI stub
has handed over. I will have to take some time and go through this,
and decide whether there is a way we can share this code with the EFI
stub without introducing yet another permutation that requires testing
and maintenance.
Any suggestions on how to test this stuff is appreciated - does QEMU
emulate any of this? Does consumer-level AMD hardware implement the
pieces I'd need to run a SEV host with SNP support etc?
On 5/2/23 11:08, Tom Lendacky wrote:
> On 5/2/23 08:39, Ard Biesheuvel wrote:
>> On Tue, 2 May 2023 at 15:37, Tom Lendacky <[email protected]> wrote:
>>>
>>> On 4/24/23 11:57, Ard Biesheuvel wrote:
>>>> This series is conceptually a combination of Evgeny's series [0] and
>>>> mine [1], both of which attempt to make the early decompressor code more
>>>> amenable to executing in the EFI environment with stricter handling of
>>>> memory permissions.
>>>>
>>>> My series [1] implemented zboot for x86, by getting rid of the entire
>>>> x86 decompressor, and replacing it with existing EFI code that does the
>>>> same but in a generic way. The downside of this is that only EFI boot is
>>>> supported, making it unviable for distros, which need to support BIOS
>>>> boot and hybrid EFI boot modes that omit the EFI stub.
>>>>
>>>> Evgeny's series [0] adapted the entire decompressor code flow to allow
>>>> it to execute in the EFI context as well as the bare metal context, and
>>>> this involves changes to the 1:1 mapping code and the page fault
>>>> handlers etc, none of which are really needed when doing EFI boot in the
>>>> first place.
>>>>
>>>> So this series attempts to occupy the middle ground here: it makes
>>>> minimal changes to the existing decompressor so some of it can be called
>>>> from the EFI stub. Then, it reimplements the EFI boot flow to decompress
>>>> the kernel and boot it directly, without relying on the trampoline code,
>>>> page table code or page fault handling code. This allows us to get rid
>>>> of quite a bit of unsavory EFI stub code, and replace it with two clear
>>>> invocations of the EFI firmware APIs to clear NX restrictions from
>>>> allocations that have been populated with executable code.
>>>>
>>>> The only code that is being reused is the decompression library itself,
>>>> along with the minimal ELF parsing that is required to copy the ELF
>>>> segments in place, and the relocation processing that fixes up absolute
>>>> symbol references to refer to the correct virtual addresses.
>>>>
>>>> Note that some of Evgeny's changes to clean up the PE/COFF header
>>>> generation will still be needed, but I've omitted those here for
>>>> brevity.
>>>
>>> I tried booting an SEV and an SEV-ES guest using this and both failed
>>> to boot:
>>>
>>> EFI stub: WARNING: Decompression failed: Out of memory while allocating
>>> z_stream
>>>
>>> I'll have to take a closer look as to why, but it might be a couple of
>>> days before I can get to it.
>>>
>>
>> Thanks Tom.
>>
>> The internal malloc() seems to be failing, which is often caused by
>> BSS clearing problems. Could you elaborate a little bit on the boot
>> environment you are using here?
>
> I'm using Qemu v7.2.1 as my VMM, Linux 6.3 with your series applied for my
> host/hypervisor and guest kernel and the current OVMF tree built using
> OvmfPkgX64.dsc.
>
> I was originally using the current merge window Linux, but moved to the
> release version just to . With the release version SEV and SEV-ES still
> fail to
> boot, but SEV actually #GPs now. And some of the register contents look
> like encrypted data:
>
> ConvertPages: range 1000000 - 4FA1FFF covers multiple entries
> !!!! X64 Exception Type - 0D(#GP - General Protection)Â CPU Apic ID -
> 00000000 !!!!
> ExceptionData - 0000000000000000
> RIPÂ - 00000000597E71C1, CSÂ - 0000000000000038, RFLAGS - 0000000000210206
> RAXÂ - 1FBA02A45943B920, RCX - 0000000000AF7009, RDX - A9DAE761B64A1F1B
> RBXÂ - 1FBA02A45943B8C0, RSP - 000000007FD97320, RBP - 0000000002000000
> RSIÂ - 0000000001000000, RDI - 1FBA02A45943DE68
> R8Â Â - 0000000003EF3C94, R9Â - 0000000000000000, R10 - 000000007D7C6018
> R11Â - 0000000000000000, R12 - 0000000001000000, R13 - 00000000597EDD98
> R14Â - 0000000001000000, R15 - 000000007E0A5198
> DSÂ Â - 0000000000000030, ESÂ - 0000000000000030, FSÂ - 0000000000000030
> GSÂ Â - 0000000000000030, SSÂ - 0000000000000030
> CR0Â - 0000000080010033, CR2 - 0000000000000000, CR3 - 000000007FA01000
> CR4Â - 0000000000000668, CR8 - 0000000000000000
> DR0Â - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000
> DR3Â - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400
> GDTR - 000000007F7DC000 0000000000000047, LDTR - 0000000000000000
> IDTR - 000000007F34C018 0000000000000FFF,  TR - 0000000000000000
> FXSAVE_STATE - 000000007FD96F80
> !!!! Find image based on IP(0x597E71C1)
> /root/kernels/ovmf-build-X64/Build/OvmfX64/DEBUG_GCC5/X64/MdeModulePkg/Universal/Variable/RuntimeDxe/VariableRuntimeDxe/DEBUG/Variable
> RuntimeDxe.dll (ImageBase=0000000000D4792C, EntryPoint=0000000000D50CC3) !!!!
>
> So, yes, probably an area of memory that was zeroes when mapped
> unencrypted, but wasn't cleared after changing the mapping to
> encrypted.
Yes, looks like a bss clearing issue. If I add __section(".data") to
free_mem_ptr and free_mem_end_ptr in arch/x86/boot/compressed/misc.c and
to malloc_ptr and malloc_cnt in include/linux/decompress/mm.h, then I can
boot an SEV guest.
However, an SEV-ES guest is triple faulting. This looks to be because
we're still on the EFI CS of 0x38 after switching GDTs in
arch/x86/kernel/head_64.S by calling startup_64_setup_env(). Before
switching to the kernel CS, we take a #VC (from CPUID calls in sme_enable)
and things blow up on the iretq. Moving the block headed by the comment
"Now switch to __KERNEL_CS so IRET works reliably" to just after calling
startup_64_setup_env() fixes it and an SEV-ES guest can boot.
This worked before because I believe we switched off the EFI CS as part of
the kernel decompressor support and so this bug wasn't exposed. But this
needs to be fixed regardless of this series.
Thanks,
Tom
>
> Thanks,
> Tom
>
>
On Wed, 3 May 2023 at 19:58, Tom Lendacky <[email protected]> wrote:
>
> On 5/2/23 11:08, Tom Lendacky wrote:
> > On 5/2/23 08:39, Ard Biesheuvel wrote:
> >> On Tue, 2 May 2023 at 15:37, Tom Lendacky <[email protected]> wrote:
> >>>
> >>> On 4/24/23 11:57, Ard Biesheuvel wrote:
> >>>> This series is conceptually a combination of Evgeny's series [0] and
> >>>> mine [1], both of which attempt to make the early decompressor code more
> >>>> amenable to executing in the EFI environment with stricter handling of
> >>>> memory permissions.
> >>>>
> >>>> My series [1] implemented zboot for x86, by getting rid of the entire
> >>>> x86 decompressor, and replacing it with existing EFI code that does the
> >>>> same but in a generic way. The downside of this is that only EFI boot is
> >>>> supported, making it unviable for distros, which need to support BIOS
> >>>> boot and hybrid EFI boot modes that omit the EFI stub.
> >>>>
> >>>> Evgeny's series [0] adapted the entire decompressor code flow to allow
> >>>> it to execute in the EFI context as well as the bare metal context, and
> >>>> this involves changes to the 1:1 mapping code and the page fault
> >>>> handlers etc, none of which are really needed when doing EFI boot in the
> >>>> first place.
> >>>>
> >>>> So this series attempts to occupy the middle ground here: it makes
> >>>> minimal changes to the existing decompressor so some of it can be called
> >>>> from the EFI stub. Then, it reimplements the EFI boot flow to decompress
> >>>> the kernel and boot it directly, without relying on the trampoline code,
> >>>> page table code or page fault handling code. This allows us to get rid
> >>>> of quite a bit of unsavory EFI stub code, and replace it with two clear
> >>>> invocations of the EFI firmware APIs to clear NX restrictions from
> >>>> allocations that have been populated with executable code.
> >>>>
> >>>> The only code that is being reused is the decompression library itself,
> >>>> along with the minimal ELF parsing that is required to copy the ELF
> >>>> segments in place, and the relocation processing that fixes up absolute
> >>>> symbol references to refer to the correct virtual addresses.
> >>>>
> >>>> Note that some of Evgeny's changes to clean up the PE/COFF header
> >>>> generation will still be needed, but I've omitted those here for
> >>>> brevity.
> >>>
> >>> I tried booting an SEV and an SEV-ES guest using this and both failed
> >>> to boot:
> >>>
> >>> EFI stub: WARNING: Decompression failed: Out of memory while allocating
> >>> z_stream
> >>>
> >>> I'll have to take a closer look as to why, but it might be a couple of
> >>> days before I can get to it.
> >>>
> >>
> >> Thanks Tom.
> >>
> >> The internal malloc() seems to be failing, which is often caused by
> >> BSS clearing problems. Could you elaborate a little bit on the boot
> >> environment you are using here?
> >
> > I'm using Qemu v7.2.1 as my VMM, Linux 6.3 with your series applied for my
> > host/hypervisor and guest kernel and the current OVMF tree built using
> > OvmfPkgX64.dsc.
> >
> > I was originally using the current merge window Linux, but moved to the
> > release version just to . With the release version SEV and SEV-ES still
> > fail to
> > boot, but SEV actually #GPs now. And some of the register contents look
> > like encrypted data:
> >
> > ConvertPages: range 1000000 - 4FA1FFF covers multiple entries
> > !!!! X64 Exception Type - 0D(#GP - General Protection) CPU Apic ID -
> > 00000000 !!!!
> > ExceptionData - 0000000000000000
> > RIP - 00000000597E71C1, CS - 0000000000000038, RFLAGS - 0000000000210206
> > RAX - 1FBA02A45943B920, RCX - 0000000000AF7009, RDX - A9DAE761B64A1F1B
> > RBX - 1FBA02A45943B8C0, RSP - 000000007FD97320, RBP - 0000000002000000
> > RSI - 0000000001000000, RDI - 1FBA02A45943DE68
> > R8 - 0000000003EF3C94, R9 - 0000000000000000, R10 - 000000007D7C6018
> > R11 - 0000000000000000, R12 - 0000000001000000, R13 - 00000000597EDD98
> > R14 - 0000000001000000, R15 - 000000007E0A5198
> > DS - 0000000000000030, ES - 0000000000000030, FS - 0000000000000030
> > GS - 0000000000000030, SS - 0000000000000030
> > CR0 - 0000000080010033, CR2 - 0000000000000000, CR3 - 000000007FA01000
> > CR4 - 0000000000000668, CR8 - 0000000000000000
> > DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000
> > DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400
> > GDTR - 000000007F7DC000 0000000000000047, LDTR - 0000000000000000
> > IDTR - 000000007F34C018 0000000000000FFF, TR - 0000000000000000
> > FXSAVE_STATE - 000000007FD96F80
> > !!!! Find image based on IP(0x597E71C1)
> > /root/kernels/ovmf-build-X64/Build/OvmfX64/DEBUG_GCC5/X64/MdeModulePkg/Universal/Variable/RuntimeDxe/VariableRuntimeDxe/DEBUG/Variable
> > RuntimeDxe.dll (ImageBase=0000000000D4792C, EntryPoint=0000000000D50CC3) !!!!
> >
> > So, yes, probably an area of memory that was zeroes when mapped
> > unencrypted, but wasn't cleared after changing the mapping to
> > encrypted.
>
> Yes, looks like a bss clearing issue. If I add __section(".data") to
> free_mem_ptr and free_mem_end_ptr in arch/x86/boot/compressed/misc.c and
> to malloc_ptr and malloc_cnt in include/linux/decompress/mm.h, then I can
> boot an SEV guest.
>
> However, an SEV-ES guest is triple faulting. This looks to be because
> we're still on the EFI CS of 0x38 after switching GDTs in
> arch/x86/kernel/head_64.S by calling startup_64_setup_env(). Before
> switching to the kernel CS, we take a #VC (from CPUID calls in sme_enable)
> and things blow up on the iretq. Moving the block headed by the comment
> "Now switch to __KERNEL_CS so IRET works reliably" to just after calling
> startup_64_setup_env() fixes it and an SEV-ES guest can boot.
>
> This worked before because I believe we switched off the EFI CS as part of
> the kernel decompressor support and so this bug wasn't exposed. But this
> needs to be fixed regardless of this series.
>
Very interesting. I was under the assumption that everything that goes
on in sev_enable() in the decompressor would be rather indispensable
to boot in SEV mode (which I only spotted today) so I am quite
surprised that things just appear to work. (There is some 32-bit SEV
code in the decompressor as well that obviously never gets called when
booting in long mode, but I hadn't quite grasped how much other SEV
code there actually is)
On Wed, May 03, 2023 at 08:17:28PM +0200, Ard Biesheuvel wrote:
> (There is some 32-bit SEV code in the decompressor as well that
> obviously never gets called when booting in long mode, but I hadn't
> quite grasped how much other SEV code there actually is)
If you mean the thing in startup_32 which does get_sev_encryption_bit()
etc, that is going away.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Wed, 3 May 2023 at 20:24, Borislav Petkov <[email protected]> wrote:
>
> On Wed, May 03, 2023 at 08:17:28PM +0200, Ard Biesheuvel wrote:
> > (There is some 32-bit SEV code in the decompressor as well that
> > obviously never gets called when booting in long mode, but I hadn't
> > quite grasped how much other SEV code there actually is)
>
> If you mean the thing in startup_32 which does get_sev_encryption_bit()
> etc, that is going away.
>
OK, good to know.
On 5/3/23 13:17, Ard Biesheuvel wrote:
> On Wed, 3 May 2023 at 19:58, Tom Lendacky <[email protected]> wrote:
>>
>> On 5/2/23 11:08, Tom Lendacky wrote:
>>> On 5/2/23 08:39, Ard Biesheuvel wrote:
>>>> On Tue, 2 May 2023 at 15:37, Tom Lendacky <[email protected]> wrote:
>>>>>
>>>>> On 4/24/23 11:57, Ard Biesheuvel wrote:
>>>>>> This series is conceptually a combination of Evgeny's series [0] and
>>>>>> mine [1], both of which attempt to make the early decompressor code more
>>>>>> amenable to executing in the EFI environment with stricter handling of
>>>>>> memory permissions.
>>>>>>
>>>>>> My series [1] implemented zboot for x86, by getting rid of the entire
>>>>>> x86 decompressor, and replacing it with existing EFI code that does the
>>>>>> same but in a generic way. The downside of this is that only EFI boot is
>>>>>> supported, making it unviable for distros, which need to support BIOS
>>>>>> boot and hybrid EFI boot modes that omit the EFI stub.
>>>>>>
>>>>>> Evgeny's series [0] adapted the entire decompressor code flow to allow
>>>>>> it to execute in the EFI context as well as the bare metal context, and
>>>>>> this involves changes to the 1:1 mapping code and the page fault
>>>>>> handlers etc, none of which are really needed when doing EFI boot in the
>>>>>> first place.
>>>>>>
>>>>>> So this series attempts to occupy the middle ground here: it makes
>>>>>> minimal changes to the existing decompressor so some of it can be called
>>>>>> from the EFI stub. Then, it reimplements the EFI boot flow to decompress
>>>>>> the kernel and boot it directly, without relying on the trampoline code,
>>>>>> page table code or page fault handling code. This allows us to get rid
>>>>>> of quite a bit of unsavory EFI stub code, and replace it with two clear
>>>>>> invocations of the EFI firmware APIs to clear NX restrictions from
>>>>>> allocations that have been populated with executable code.
>>>>>>
>>>>>> The only code that is being reused is the decompression library itself,
>>>>>> along with the minimal ELF parsing that is required to copy the ELF
>>>>>> segments in place, and the relocation processing that fixes up absolute
>>>>>> symbol references to refer to the correct virtual addresses.
>>>>>>
>>>>>> Note that some of Evgeny's changes to clean up the PE/COFF header
>>>>>> generation will still be needed, but I've omitted those here for
>>>>>> brevity.
>>>>>
>>>>> I tried booting an SEV and an SEV-ES guest using this and both failed
>>>>> to boot:
>>>>>
>>>>> EFI stub: WARNING: Decompression failed: Out of memory while allocating
>>>>> z_stream
>>>>>
>>>>> I'll have to take a closer look as to why, but it might be a couple of
>>>>> days before I can get to it.
>>>>>
>>>>
>>>> Thanks Tom.
>>>>
>>>> The internal malloc() seems to be failing, which is often caused by
>>>> BSS clearing problems. Could you elaborate a little bit on the boot
>>>> environment you are using here?
>>>
>>> I'm using Qemu v7.2.1 as my VMM, Linux 6.3 with your series applied for my
>>> host/hypervisor and guest kernel and the current OVMF tree built using
>>> OvmfPkgX64.dsc.
>>>
>>> I was originally using the current merge window Linux, but moved to the
>>> release version just to . With the release version SEV and SEV-ES still
>>> fail to
>>> boot, but SEV actually #GPs now. And some of the register contents look
>>> like encrypted data:
>>>
>>> ConvertPages: range 1000000 - 4FA1FFF covers multiple entries
>>> !!!! X64 Exception Type - 0D(#GP - General Protection) CPU Apic ID -
>>> 00000000 !!!!
>>> ExceptionData - 0000000000000000
>>> RIP - 00000000597E71C1, CS - 0000000000000038, RFLAGS - 0000000000210206
>>> RAX - 1FBA02A45943B920, RCX - 0000000000AF7009, RDX - A9DAE761B64A1F1B
>>> RBX - 1FBA02A45943B8C0, RSP - 000000007FD97320, RBP - 0000000002000000
>>> RSI - 0000000001000000, RDI - 1FBA02A45943DE68
>>> R8 - 0000000003EF3C94, R9 - 0000000000000000, R10 - 000000007D7C6018
>>> R11 - 0000000000000000, R12 - 0000000001000000, R13 - 00000000597EDD98
>>> R14 - 0000000001000000, R15 - 000000007E0A5198
>>> DS - 0000000000000030, ES - 0000000000000030, FS - 0000000000000030
>>> GS - 0000000000000030, SS - 0000000000000030
>>> CR0 - 0000000080010033, CR2 - 0000000000000000, CR3 - 000000007FA01000
>>> CR4 - 0000000000000668, CR8 - 0000000000000000
>>> DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000
>>> DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400
>>> GDTR - 000000007F7DC000 0000000000000047, LDTR - 0000000000000000
>>> IDTR - 000000007F34C018 0000000000000FFF, TR - 0000000000000000
>>> FXSAVE_STATE - 000000007FD96F80
>>> !!!! Find image based on IP(0x597E71C1)
>>> /root/kernels/ovmf-build-X64/Build/OvmfX64/DEBUG_GCC5/X64/MdeModulePkg/Universal/Variable/RuntimeDxe/VariableRuntimeDxe/DEBUG/Variable
>>> RuntimeDxe.dll (ImageBase=0000000000D4792C, EntryPoint=0000000000D50CC3) !!!!
>>>
>>> So, yes, probably an area of memory that was zeroes when mapped
>>> unencrypted, but wasn't cleared after changing the mapping to
>>> encrypted.
>>
>> Yes, looks like a bss clearing issue. If I add __section(".data") to
>> free_mem_ptr and free_mem_end_ptr in arch/x86/boot/compressed/misc.c and
>> to malloc_ptr and malloc_cnt in include/linux/decompress/mm.h, then I can
>> boot an SEV guest.
>>
>> However, an SEV-ES guest is triple faulting. This looks to be because
>> we're still on the EFI CS of 0x38 after switching GDTs in
>> arch/x86/kernel/head_64.S by calling startup_64_setup_env(). Before
>> switching to the kernel CS, we take a #VC (from CPUID calls in sme_enable)
>> and things blow up on the iretq. Moving the block headed by the comment
>> "Now switch to __KERNEL_CS so IRET works reliably" to just after calling
>> startup_64_setup_env() fixes it and an SEV-ES guest can boot.
>>
>> This worked before because I believe we switched off the EFI CS as part of
>> the kernel decompressor support and so this bug wasn't exposed. But this
>> needs to be fixed regardless of this series.
>>
>
> Very interesting. I was under the assumption that everything that goes
> on in sev_enable() in the decompressor would be rather indispensable
> to boot in SEV mode (which I only spotted today) so I am quite
> surprised that things just appear to work. (There is some 32-bit SEV
> code in the decompressor as well that obviously never gets called when
> booting in long mode, but I hadn't quite grasped how much other SEV
> code there actually is)
The sev_enable() function for SEV and SEV-ES is more for ensuring a proper
environment (ensuring the proper CPUID bits are present for the guest,
checking the GHCB protocol level, properly preparing pagetables, etc.).
Since we're still in EFI and using the EFI #VC handler and pagetables, we
don't require some of that stuff, but some of the it would need to be
performed at some point (I wasn't trying to be a malicious hypervisor).
I haven't tested SEV-SNP yet because the hypervisor support is not
upstream, yet, and I haven't applied your series to an SNP hypervisor
tree, yet. There is a lot of support in sev_enable() for SNP that may
likely cause a problem compared to SEV and SEV-ES.
SEV-SNP has a lot more checking and validation going on and it ensures it
gets the confidential computing blob from EFI into the boot parameters.
I'm not sure when I'll be able to test SNP, since I'm off next week and
trying to wrap up a bunch of stuff before I leave.
Thanks,
Tom
On 5/3/23 12:44, Ard Biesheuvel wrote:
> On Tue, 2 May 2023 at 18:08, Tom Lendacky <[email protected]> wrote:
>>
>> On 5/2/23 08:39, Ard Biesheuvel wrote:
>>> On Tue, 2 May 2023 at 15:37, Tom Lendacky <[email protected]> wrote:
>>>>
>>>> On 4/24/23 11:57, Ard Biesheuvel wrote:
>>>>> This series is conceptually a combination of Evgeny's series [0] and
>>>>> mine [1], both of which attempt to make the early decompressor code more
>>>>> amenable to executing in the EFI environment with stricter handling of
>>>>> memory permissions.
>>>>>
>>>>> My series [1] implemented zboot for x86, by getting rid of the entire
>>>>> x86 decompressor, and replacing it with existing EFI code that does the
>>>>> same but in a generic way. The downside of this is that only EFI boot is
>>>>> supported, making it unviable for distros, which need to support BIOS
>>>>> boot and hybrid EFI boot modes that omit the EFI stub.
>>>>>
>>>>> Evgeny's series [0] adapted the entire decompressor code flow to allow
>>>>> it to execute in the EFI context as well as the bare metal context, and
>>>>> this involves changes to the 1:1 mapping code and the page fault
>>>>> handlers etc, none of which are really needed when doing EFI boot in the
>>>>> first place.
>>>>>
>>>>> So this series attempts to occupy the middle ground here: it makes
>>>>> minimal changes to the existing decompressor so some of it can be called
>>>>> from the EFI stub. Then, it reimplements the EFI boot flow to decompress
>>>>> the kernel and boot it directly, without relying on the trampoline code,
>>>>> page table code or page fault handling code. This allows us to get rid
>>>>> of quite a bit of unsavory EFI stub code, and replace it with two clear
>>>>> invocations of the EFI firmware APIs to clear NX restrictions from
>>>>> allocations that have been populated with executable code.
>>>>>
>>>>> The only code that is being reused is the decompression library itself,
>>>>> along with the minimal ELF parsing that is required to copy the ELF
>>>>> segments in place, and the relocation processing that fixes up absolute
>>>>> symbol references to refer to the correct virtual addresses.
>>>>>
>>>>> Note that some of Evgeny's changes to clean up the PE/COFF header
>>>>> generation will still be needed, but I've omitted those here for
>>>>> brevity.
>>>>
>>>> I tried booting an SEV and an SEV-ES guest using this and both failed to boot:
>>>>
>>>> EFI stub: WARNING: Decompression failed: Out of memory while allocating
>>>> z_stream
>>>>
>>>> I'll have to take a closer look as to why, but it might be a couple of
>>>> days before I can get to it.
>>>>
>>>
>>> Thanks Tom.
>>>
>>> The internal malloc() seems to be failing, which is often caused by
>>> BSS clearing problems. Could you elaborate a little bit on the boot
>>> environment you are using here?
>>
>> I'm using Qemu v7.2.1 as my VMM, Linux 6.3 with your series applied for my
>> host/hypervisor and guest kernel and the current OVMF tree built using
>> OvmfPkgX64.dsc.
>>
>> I was originally using the current merge window Linux, but moved to the
>> release version just to . With the release version SEV and SEV-ES still fail to
>> boot, but SEV actually #GPs now. And some of the register contents look
>> like encrypted data:
>>
>> ConvertPages: range 1000000 - 4FA1FFF covers multiple entries
>> !!!! X64 Exception Type - 0D(#GP - General Protection) CPU Apic ID - 00000000 !!!!
>> ExceptionData - 0000000000000000
>> RIP - 00000000597E71C1, CS - 0000000000000038, RFLAGS - 0000000000210206
>> RAX - 1FBA02A45943B920, RCX - 0000000000AF7009, RDX - A9DAE761B64A1F1B
>> RBX - 1FBA02A45943B8C0, RSP - 000000007FD97320, RBP - 0000000002000000
>> RSI - 0000000001000000, RDI - 1FBA02A45943DE68
>> R8 - 0000000003EF3C94, R9 - 0000000000000000, R10 - 000000007D7C6018
>> R11 - 0000000000000000, R12 - 0000000001000000, R13 - 00000000597EDD98
>> R14 - 0000000001000000, R15 - 000000007E0A5198
>> DS - 0000000000000030, ES - 0000000000000030, FS - 0000000000000030
>> GS - 0000000000000030, SS - 0000000000000030
>> CR0 - 0000000080010033, CR2 - 0000000000000000, CR3 - 000000007FA01000
>> CR4 - 0000000000000668, CR8 - 0000000000000000
>> DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000
>> DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400
>> GDTR - 000000007F7DC000 0000000000000047, LDTR - 0000000000000000
>> IDTR - 000000007F34C018 0000000000000FFF, TR - 0000000000000000
>> FXSAVE_STATE - 000000007FD96F80
>> !!!! Find image based on IP(0x597E71C1) /root/kernels/ovmf-build-X64/Build/OvmfX64/DEBUG_GCC5/X64/MdeModulePkg/Universal/Variable/RuntimeDxe/VariableRuntimeDxe/DEBUG/Variable
>> RuntimeDxe.dll (ImageBase=0000000000D4792C, EntryPoint=0000000000D50CC3) !!!!
>>
>> So, yes, probably an area of memory that was zeroes when mapped
>> unencrypted, but wasn't cleared after changing the mapping to
>> encrypted.
>>
>
> Thanks.
>
> It seems I was a bit naive and underestimated the amount of SEV
> related processing that goes on in the decompressor after the EFI stub
> has handed over. I will have to take some time and go through this,
> and decide whether there is a way we can share this code with the EFI
> stub without introducing yet another permutation that requires testing
> and maintenance.
>
> Any suggestions on how to test this stuff is appreciated - does QEMU
> emulate any of this? Does consumer-level AMD hardware implement the
> pieces I'd need to run a SEV host with SNP support etc?
Unfortunately Qemu doesn't emulate any of this and SEV support,
specifically the SEV firmware, isn't available on consumer-level parts, so
you would need an EPYC part to do SEV.
Thanks,
Tom
On Wed, 3 May 2023 at 20:49, Tom Lendacky <[email protected]> wrote:
>
> On 5/3/23 13:17, Ard Biesheuvel wrote:
> > On Wed, 3 May 2023 at 19:58, Tom Lendacky <[email protected]> wrote:
> >>
> >> On 5/2/23 11:08, Tom Lendacky wrote:
> >>> On 5/2/23 08:39, Ard Biesheuvel wrote:
> >>>> On Tue, 2 May 2023 at 15:37, Tom Lendacky <[email protected]> wrote:
> >>>>>
> >>>>> On 4/24/23 11:57, Ard Biesheuvel wrote:
> >>>>>> This series is conceptually a combination of Evgeny's series [0] and
> >>>>>> mine [1], both of which attempt to make the early decompressor code more
> >>>>>> amenable to executing in the EFI environment with stricter handling of
> >>>>>> memory permissions.
> >>>>>>
> >>>>>> My series [1] implemented zboot for x86, by getting rid of the entire
> >>>>>> x86 decompressor, and replacing it with existing EFI code that does the
> >>>>>> same but in a generic way. The downside of this is that only EFI boot is
> >>>>>> supported, making it unviable for distros, which need to support BIOS
> >>>>>> boot and hybrid EFI boot modes that omit the EFI stub.
> >>>>>>
> >>>>>> Evgeny's series [0] adapted the entire decompressor code flow to allow
> >>>>>> it to execute in the EFI context as well as the bare metal context, and
> >>>>>> this involves changes to the 1:1 mapping code and the page fault
> >>>>>> handlers etc, none of which are really needed when doing EFI boot in the
> >>>>>> first place.
> >>>>>>
> >>>>>> So this series attempts to occupy the middle ground here: it makes
> >>>>>> minimal changes to the existing decompressor so some of it can be called
> >>>>>> from the EFI stub. Then, it reimplements the EFI boot flow to decompress
> >>>>>> the kernel and boot it directly, without relying on the trampoline code,
> >>>>>> page table code or page fault handling code. This allows us to get rid
> >>>>>> of quite a bit of unsavory EFI stub code, and replace it with two clear
> >>>>>> invocations of the EFI firmware APIs to clear NX restrictions from
> >>>>>> allocations that have been populated with executable code.
> >>>>>>
> >>>>>> The only code that is being reused is the decompression library itself,
> >>>>>> along with the minimal ELF parsing that is required to copy the ELF
> >>>>>> segments in place, and the relocation processing that fixes up absolute
> >>>>>> symbol references to refer to the correct virtual addresses.
> >>>>>>
> >>>>>> Note that some of Evgeny's changes to clean up the PE/COFF header
> >>>>>> generation will still be needed, but I've omitted those here for
> >>>>>> brevity.
> >>>>>
> >>>>> I tried booting an SEV and an SEV-ES guest using this and both failed
> >>>>> to boot:
> >>>>>
> >>>>> EFI stub: WARNING: Decompression failed: Out of memory while allocating
> >>>>> z_stream
> >>>>>
> >>>>> I'll have to take a closer look as to why, but it might be a couple of
> >>>>> days before I can get to it.
> >>>>>
> >>>>
> >>>> Thanks Tom.
> >>>>
> >>>> The internal malloc() seems to be failing, which is often caused by
> >>>> BSS clearing problems. Could you elaborate a little bit on the boot
> >>>> environment you are using here?
> >>>
> >>> I'm using Qemu v7.2.1 as my VMM, Linux 6.3 with your series applied for my
> >>> host/hypervisor and guest kernel and the current OVMF tree built using
> >>> OvmfPkgX64.dsc.
> >>>
> >>> I was originally using the current merge window Linux, but moved to the
> >>> release version just to . With the release version SEV and SEV-ES still
> >>> fail to
> >>> boot, but SEV actually #GPs now. And some of the register contents look
> >>> like encrypted data:
> >>>
> >>> ConvertPages: range 1000000 - 4FA1FFF covers multiple entries
> >>> !!!! X64 Exception Type - 0D(#GP - General Protection) CPU Apic ID -
> >>> 00000000 !!!!
> >>> ExceptionData - 0000000000000000
> >>> RIP - 00000000597E71C1, CS - 0000000000000038, RFLAGS - 0000000000210206
> >>> RAX - 1FBA02A45943B920, RCX - 0000000000AF7009, RDX - A9DAE761B64A1F1B
> >>> RBX - 1FBA02A45943B8C0, RSP - 000000007FD97320, RBP - 0000000002000000
> >>> RSI - 0000000001000000, RDI - 1FBA02A45943DE68
> >>> R8 - 0000000003EF3C94, R9 - 0000000000000000, R10 - 000000007D7C6018
> >>> R11 - 0000000000000000, R12 - 0000000001000000, R13 - 00000000597EDD98
> >>> R14 - 0000000001000000, R15 - 000000007E0A5198
> >>> DS - 0000000000000030, ES - 0000000000000030, FS - 0000000000000030
> >>> GS - 0000000000000030, SS - 0000000000000030
> >>> CR0 - 0000000080010033, CR2 - 0000000000000000, CR3 - 000000007FA01000
> >>> CR4 - 0000000000000668, CR8 - 0000000000000000
> >>> DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000
> >>> DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400
> >>> GDTR - 000000007F7DC000 0000000000000047, LDTR - 0000000000000000
> >>> IDTR - 000000007F34C018 0000000000000FFF, TR - 0000000000000000
> >>> FXSAVE_STATE - 000000007FD96F80
> >>> !!!! Find image based on IP(0x597E71C1)
> >>> /root/kernels/ovmf-build-X64/Build/OvmfX64/DEBUG_GCC5/X64/MdeModulePkg/Universal/Variable/RuntimeDxe/VariableRuntimeDxe/DEBUG/Variable
> >>> RuntimeDxe.dll (ImageBase=0000000000D4792C, EntryPoint=0000000000D50CC3) !!!!
> >>>
> >>> So, yes, probably an area of memory that was zeroes when mapped
> >>> unencrypted, but wasn't cleared after changing the mapping to
> >>> encrypted.
> >>
> >> Yes, looks like a bss clearing issue. If I add __section(".data") to
> >> free_mem_ptr and free_mem_end_ptr in arch/x86/boot/compressed/misc.c and
> >> to malloc_ptr and malloc_cnt in include/linux/decompress/mm.h, then I can
> >> boot an SEV guest.
> >>
> >> However, an SEV-ES guest is triple faulting. This looks to be because
> >> we're still on the EFI CS of 0x38 after switching GDTs in
> >> arch/x86/kernel/head_64.S by calling startup_64_setup_env(). Before
> >> switching to the kernel CS, we take a #VC (from CPUID calls in sme_enable)
> >> and things blow up on the iretq. Moving the block headed by the comment
> >> "Now switch to __KERNEL_CS so IRET works reliably" to just after calling
> >> startup_64_setup_env() fixes it and an SEV-ES guest can boot.
> >>
> >> This worked before because I believe we switched off the EFI CS as part of
> >> the kernel decompressor support and so this bug wasn't exposed. But this
> >> needs to be fixed regardless of this series.
> >>
> >
> > Very interesting. I was under the assumption that everything that goes
> > on in sev_enable() in the decompressor would be rather indispensable
> > to boot in SEV mode (which I only spotted today) so I am quite
> > surprised that things just appear to work. (There is some 32-bit SEV
> > code in the decompressor as well that obviously never gets called when
> > booting in long mode, but I hadn't quite grasped how much other SEV
> > code there actually is)
>
> The sev_enable() function for SEV and SEV-ES is more for ensuring a proper
> environment (ensuring the proper CPUID bits are present for the guest,
> checking the GHCB protocol level, properly preparing pagetables, etc.).
> Since we're still in EFI and using the EFI #VC handler and pagetables, we
> don't require some of that stuff, but some of the it would need to be
> performed at some point (I wasn't trying to be a malicious hypervisor).
>
OK, good to know.
> I haven't tested SEV-SNP yet because the hypervisor support is not
> upstream, yet, and I haven't applied your series to an SNP hypervisor
> tree, yet. There is a lot of support in sev_enable() for SNP that may
> likely cause a problem compared to SEV and SEV-ES.
>
> SEV-SNP has a lot more checking and validation going on and it ensures it
> gets the confidential computing blob from EFI into the boot parameters.
> I'm not sure when I'll be able to test SNP, since I'm off next week and
> trying to wrap up a bunch of stuff before I leave.
>
Would it make sense, given the apparent direct dependency on EFI, to
move that stuff into the EFI stub instead of doing it from the
decompressor?
Thanks for giving this a spin - I can wait for more testing until you
get back, and hopefully, we'll have converged a bit more by that time
and there will be a new revision of the series available for testing.
BTW any clue whether your boot path is relying on the EFI handover
protocol? I.e., QemuStartLegacyImage() in OVMF? Evgeny mentioned that
BSS needs clearing in that case, and I was wondering if that is the
case you are hitting.
(If you have a DEBUG OVMF boot log you could share, I can go through it myself)
On 5/3/23 13:59, Ard Biesheuvel wrote:
> On Wed, 3 May 2023 at 20:49, Tom Lendacky <[email protected]> wrote:
>>
>> On 5/3/23 13:17, Ard Biesheuvel wrote:
>>> On Wed, 3 May 2023 at 19:58, Tom Lendacky <[email protected]> wrote:
>>>>
>>>> On 5/2/23 11:08, Tom Lendacky wrote:
>>>>> On 5/2/23 08:39, Ard Biesheuvel wrote:
>>>>>> On Tue, 2 May 2023 at 15:37, Tom Lendacky <[email protected]> wrote:
>>>>>>>
>>>>>>> On 4/24/23 11:57, Ard Biesheuvel wrote:
>>>>>>>> This series is conceptually a combination of Evgeny's series [0] and
>>>>>>>> mine [1], both of which attempt to make the early decompressor code more
>>>>>>>> amenable to executing in the EFI environment with stricter handling of
>>>>>>>> memory permissions.
>>>>>>>>
>>>>>>>> My series [1] implemented zboot for x86, by getting rid of the entire
>>>>>>>> x86 decompressor, and replacing it with existing EFI code that does the
>>>>>>>> same but in a generic way. The downside of this is that only EFI boot is
>>>>>>>> supported, making it unviable for distros, which need to support BIOS
>>>>>>>> boot and hybrid EFI boot modes that omit the EFI stub.
>>>>>>>>
>>>>>>>> Evgeny's series [0] adapted the entire decompressor code flow to allow
>>>>>>>> it to execute in the EFI context as well as the bare metal context, and
>>>>>>>> this involves changes to the 1:1 mapping code and the page fault
>>>>>>>> handlers etc, none of which are really needed when doing EFI boot in the
>>>>>>>> first place.
>>>>>>>>
>>>>>>>> So this series attempts to occupy the middle ground here: it makes
>>>>>>>> minimal changes to the existing decompressor so some of it can be called
>>>>>>>> from the EFI stub. Then, it reimplements the EFI boot flow to decompress
>>>>>>>> the kernel and boot it directly, without relying on the trampoline code,
>>>>>>>> page table code or page fault handling code. This allows us to get rid
>>>>>>>> of quite a bit of unsavory EFI stub code, and replace it with two clear
>>>>>>>> invocations of the EFI firmware APIs to clear NX restrictions from
>>>>>>>> allocations that have been populated with executable code.
>>>>>>>>
>>>>>>>> The only code that is being reused is the decompression library itself,
>>>>>>>> along with the minimal ELF parsing that is required to copy the ELF
>>>>>>>> segments in place, and the relocation processing that fixes up absolute
>>>>>>>> symbol references to refer to the correct virtual addresses.
>>>>>>>>
>>>>>>>> Note that some of Evgeny's changes to clean up the PE/COFF header
>>>>>>>> generation will still be needed, but I've omitted those here for
>>>>>>>> brevity.
>>>>>>>
>>>>>>> I tried booting an SEV and an SEV-ES guest using this and both failed
>>>>>>> to boot:
>>>>>>>
>>>>>>> EFI stub: WARNING: Decompression failed: Out of memory while allocating
>>>>>>> z_stream
>>>>>>>
>>>>>>> I'll have to take a closer look as to why, but it might be a couple of
>>>>>>> days before I can get to it.
>>>>>>>
>>>>>>
>>>>>> Thanks Tom.
>>>>>>
>>>>>> The internal malloc() seems to be failing, which is often caused by
>>>>>> BSS clearing problems. Could you elaborate a little bit on the boot
>>>>>> environment you are using here?
>>>>>
>>>>> I'm using Qemu v7.2.1 as my VMM, Linux 6.3 with your series applied for my
>>>>> host/hypervisor and guest kernel and the current OVMF tree built using
>>>>> OvmfPkgX64.dsc.
>>>>>
>>>>> I was originally using the current merge window Linux, but moved to the
>>>>> release version just to . With the release version SEV and SEV-ES still
>>>>> fail to
>>>>> boot, but SEV actually #GPs now. And some of the register contents look
>>>>> like encrypted data:
>>>>>
>>>>> ConvertPages: range 1000000 - 4FA1FFF covers multiple entries
>>>>> !!!! X64 Exception Type - 0D(#GP - General Protection) CPU Apic ID -
>>>>> 00000000 !!!!
>>>>> ExceptionData - 0000000000000000
>>>>> RIP - 00000000597E71C1, CS - 0000000000000038, RFLAGS - 0000000000210206
>>>>> RAX - 1FBA02A45943B920, RCX - 0000000000AF7009, RDX - A9DAE761B64A1F1B
>>>>> RBX - 1FBA02A45943B8C0, RSP - 000000007FD97320, RBP - 0000000002000000
>>>>> RSI - 0000000001000000, RDI - 1FBA02A45943DE68
>>>>> R8 - 0000000003EF3C94, R9 - 0000000000000000, R10 - 000000007D7C6018
>>>>> R11 - 0000000000000000, R12 - 0000000001000000, R13 - 00000000597EDD98
>>>>> R14 - 0000000001000000, R15 - 000000007E0A5198
>>>>> DS - 0000000000000030, ES - 0000000000000030, FS - 0000000000000030
>>>>> GS - 0000000000000030, SS - 0000000000000030
>>>>> CR0 - 0000000080010033, CR2 - 0000000000000000, CR3 - 000000007FA01000
>>>>> CR4 - 0000000000000668, CR8 - 0000000000000000
>>>>> DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000
>>>>> DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400
>>>>> GDTR - 000000007F7DC000 0000000000000047, LDTR - 0000000000000000
>>>>> IDTR - 000000007F34C018 0000000000000FFF, TR - 0000000000000000
>>>>> FXSAVE_STATE - 000000007FD96F80
>>>>> !!!! Find image based on IP(0x597E71C1)
>>>>> /root/kernels/ovmf-build-X64/Build/OvmfX64/DEBUG_GCC5/X64/MdeModulePkg/Universal/Variable/RuntimeDxe/VariableRuntimeDxe/DEBUG/Variable
>>>>> RuntimeDxe.dll (ImageBase=0000000000D4792C, EntryPoint=0000000000D50CC3) !!!!
>>>>>
>>>>> So, yes, probably an area of memory that was zeroes when mapped
>>>>> unencrypted, but wasn't cleared after changing the mapping to
>>>>> encrypted.
>>>>
>>>> Yes, looks like a bss clearing issue. If I add __section(".data") to
>>>> free_mem_ptr and free_mem_end_ptr in arch/x86/boot/compressed/misc.c and
>>>> to malloc_ptr and malloc_cnt in include/linux/decompress/mm.h, then I can
>>>> boot an SEV guest.
>>>>
>>>> However, an SEV-ES guest is triple faulting. This looks to be because
>>>> we're still on the EFI CS of 0x38 after switching GDTs in
>>>> arch/x86/kernel/head_64.S by calling startup_64_setup_env(). Before
>>>> switching to the kernel CS, we take a #VC (from CPUID calls in sme_enable)
>>>> and things blow up on the iretq. Moving the block headed by the comment
>>>> "Now switch to __KERNEL_CS so IRET works reliably" to just after calling
>>>> startup_64_setup_env() fixes it and an SEV-ES guest can boot.
>>>>
>>>> This worked before because I believe we switched off the EFI CS as part of
>>>> the kernel decompressor support and so this bug wasn't exposed. But this
>>>> needs to be fixed regardless of this series.
>>>>
>>>
>>> Very interesting. I was under the assumption that everything that goes
>>> on in sev_enable() in the decompressor would be rather indispensable
>>> to boot in SEV mode (which I only spotted today) so I am quite
>>> surprised that things just appear to work. (There is some 32-bit SEV
>>> code in the decompressor as well that obviously never gets called when
>>> booting in long mode, but I hadn't quite grasped how much other SEV
>>> code there actually is)
>>
>> The sev_enable() function for SEV and SEV-ES is more for ensuring a proper
>> environment (ensuring the proper CPUID bits are present for the guest,
>> checking the GHCB protocol level, properly preparing pagetables, etc.).
>> Since we're still in EFI and using the EFI #VC handler and pagetables, we
>> don't require some of that stuff, but some of the it would need to be
>> performed at some point (I wasn't trying to be a malicious hypervisor).
>>
>
> OK, good to know.
>
>> I haven't tested SEV-SNP yet because the hypervisor support is not
>> upstream, yet, and I haven't applied your series to an SNP hypervisor
>> tree, yet. There is a lot of support in sev_enable() for SNP that may
>> likely cause a problem compared to SEV and SEV-ES.
>>
>> SEV-SNP has a lot more checking and validation going on and it ensures it
>> gets the confidential computing blob from EFI into the boot parameters.
>> I'm not sure when I'll be able to test SNP, since I'm off next week and
>> trying to wrap up a bunch of stuff before I leave.
>>
>
> Would it make sense, given the apparent direct dependency on EFI, to
> move that stuff into the EFI stub instead of doing it from the
> decompressor?
We tried to make the support for SNP not be reliant on EFI. If EFI is
present, then get the confidential computing blob from it. Otherwise, the
cc blob can be supplied as part of the setup data in the Linux boot
protocol (see find_cc_block() in arch/x86/boot/compressed/sev.c).
So I guess some of it could be moved, but I'm not sure how much.
>
> Thanks for giving this a spin - I can wait for more testing until you
> get back, and hopefully, we'll have converged a bit more by that time
> and there will be a new revision of the series available for testing.
>
> BTW any clue whether your boot path is relying on the EFI handover
> protocol? I.e., QemuStartLegacyImage() in OVMF? Evgeny mentioned that
I don't believe it is. It's using the Grub EFI bootloader to load and boot
Linux.
> BSS needs clearing in that case, and I was wondering if that is the
> case you are hitting.
> (If you have a DEBUG OVMF boot log you could share, I can go through it myself)
Sure, I'll send you a copy in a separate email.
Thanks,
Tom
On Wed, 3 May 2023 at 23:23, Tom Lendacky <[email protected]> wrote:
>
> On 5/3/23 13:59, Ard Biesheuvel wrote:
> > On Wed, 3 May 2023 at 20:49, Tom Lendacky <[email protected]> wrote:
> >>
> >> On 5/3/23 13:17, Ard Biesheuvel wrote:
> >>> On Wed, 3 May 2023 at 19:58, Tom Lendacky <[email protected]> wrote:
> >>>>
> >>>> On 5/2/23 11:08, Tom Lendacky wrote:
> >>>>> On 5/2/23 08:39, Ard Biesheuvel wrote:
> >>>>>> On Tue, 2 May 2023 at 15:37, Tom Lendacky <[email protected]> wrote:
> >>>>>>>
> >>>>>>> On 4/24/23 11:57, Ard Biesheuvel wrote:
> >>>>>>>> This series is conceptually a combination of Evgeny's series [0] and
> >>>>>>>> mine [1], both of which attempt to make the early decompressor code more
> >>>>>>>> amenable to executing in the EFI environment with stricter handling of
> >>>>>>>> memory permissions.
> >>>>>>>>
> >>>>>>>> My series [1] implemented zboot for x86, by getting rid of the entire
> >>>>>>>> x86 decompressor, and replacing it with existing EFI code that does the
> >>>>>>>> same but in a generic way. The downside of this is that only EFI boot is
> >>>>>>>> supported, making it unviable for distros, which need to support BIOS
> >>>>>>>> boot and hybrid EFI boot modes that omit the EFI stub.
> >>>>>>>>
> >>>>>>>> Evgeny's series [0] adapted the entire decompressor code flow to allow
> >>>>>>>> it to execute in the EFI context as well as the bare metal context, and
> >>>>>>>> this involves changes to the 1:1 mapping code and the page fault
> >>>>>>>> handlers etc, none of which are really needed when doing EFI boot in the
> >>>>>>>> first place.
> >>>>>>>>
> >>>>>>>> So this series attempts to occupy the middle ground here: it makes
> >>>>>>>> minimal changes to the existing decompressor so some of it can be called
> >>>>>>>> from the EFI stub. Then, it reimplements the EFI boot flow to decompress
> >>>>>>>> the kernel and boot it directly, without relying on the trampoline code,
> >>>>>>>> page table code or page fault handling code. This allows us to get rid
> >>>>>>>> of quite a bit of unsavory EFI stub code, and replace it with two clear
> >>>>>>>> invocations of the EFI firmware APIs to clear NX restrictions from
> >>>>>>>> allocations that have been populated with executable code.
> >>>>>>>>
> >>>>>>>> The only code that is being reused is the decompression library itself,
> >>>>>>>> along with the minimal ELF parsing that is required to copy the ELF
> >>>>>>>> segments in place, and the relocation processing that fixes up absolute
> >>>>>>>> symbol references to refer to the correct virtual addresses.
> >>>>>>>>
> >>>>>>>> Note that some of Evgeny's changes to clean up the PE/COFF header
> >>>>>>>> generation will still be needed, but I've omitted those here for
> >>>>>>>> brevity.
> >>>>>>>
> >>>>>>> I tried booting an SEV and an SEV-ES guest using this and both failed
> >>>>>>> to boot:
> >>>>>>>
> >>>>>>> EFI stub: WARNING: Decompression failed: Out of memory while allocating
> >>>>>>> z_stream
> >>>>>>>
> >>>>>>> I'll have to take a closer look as to why, but it might be a couple of
> >>>>>>> days before I can get to it.
> >>>>>>>
> >>>>>>
> >>>>>> Thanks Tom.
> >>>>>>
> >>>>>> The internal malloc() seems to be failing, which is often caused by
> >>>>>> BSS clearing problems. Could you elaborate a little bit on the boot
> >>>>>> environment you are using here?
> >>>>>
> >>>>> I'm using Qemu v7.2.1 as my VMM, Linux 6.3 with your series applied for my
> >>>>> host/hypervisor and guest kernel and the current OVMF tree built using
> >>>>> OvmfPkgX64.dsc.
> >>>>>
> >>>>> I was originally using the current merge window Linux, but moved to the
> >>>>> release version just to . With the release version SEV and SEV-ES still
> >>>>> fail to
> >>>>> boot, but SEV actually #GPs now. And some of the register contents look
> >>>>> like encrypted data:
> >>>>>
> >>>>> ConvertPages: range 1000000 - 4FA1FFF covers multiple entries
> >>>>> !!!! X64 Exception Type - 0D(#GP - General Protection) CPU Apic ID -
> >>>>> 00000000 !!!!
> >>>>> ExceptionData - 0000000000000000
> >>>>> RIP - 00000000597E71C1, CS - 0000000000000038, RFLAGS - 0000000000210206
> >>>>> RAX - 1FBA02A45943B920, RCX - 0000000000AF7009, RDX - A9DAE761B64A1F1B
> >>>>> RBX - 1FBA02A45943B8C0, RSP - 000000007FD97320, RBP - 0000000002000000
> >>>>> RSI - 0000000001000000, RDI - 1FBA02A45943DE68
> >>>>> R8 - 0000000003EF3C94, R9 - 0000000000000000, R10 - 000000007D7C6018
> >>>>> R11 - 0000000000000000, R12 - 0000000001000000, R13 - 00000000597EDD98
> >>>>> R14 - 0000000001000000, R15 - 000000007E0A5198
> >>>>> DS - 0000000000000030, ES - 0000000000000030, FS - 0000000000000030
> >>>>> GS - 0000000000000030, SS - 0000000000000030
> >>>>> CR0 - 0000000080010033, CR2 - 0000000000000000, CR3 - 000000007FA01000
> >>>>> CR4 - 0000000000000668, CR8 - 0000000000000000
> >>>>> DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000
> >>>>> DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400
> >>>>> GDTR - 000000007F7DC000 0000000000000047, LDTR - 0000000000000000
> >>>>> IDTR - 000000007F34C018 0000000000000FFF, TR - 0000000000000000
> >>>>> FXSAVE_STATE - 000000007FD96F80
> >>>>> !!!! Find image based on IP(0x597E71C1)
> >>>>> /root/kernels/ovmf-build-X64/Build/OvmfX64/DEBUG_GCC5/X64/MdeModulePkg/Universal/Variable/RuntimeDxe/VariableRuntimeDxe/DEBUG/Variable
> >>>>> RuntimeDxe.dll (ImageBase=0000000000D4792C, EntryPoint=0000000000D50CC3) !!!!
> >>>>>
> >>>>> So, yes, probably an area of memory that was zeroes when mapped
> >>>>> unencrypted, but wasn't cleared after changing the mapping to
> >>>>> encrypted.
> >>>>
> >>>> Yes, looks like a bss clearing issue. If I add __section(".data") to
> >>>> free_mem_ptr and free_mem_end_ptr in arch/x86/boot/compressed/misc.c and
> >>>> to malloc_ptr and malloc_cnt in include/linux/decompress/mm.h, then I can
> >>>> boot an SEV guest.
> >>>>
> >>>> However, an SEV-ES guest is triple faulting. This looks to be because
> >>>> we're still on the EFI CS of 0x38 after switching GDTs in
> >>>> arch/x86/kernel/head_64.S by calling startup_64_setup_env(). Before
> >>>> switching to the kernel CS, we take a #VC (from CPUID calls in sme_enable)
> >>>> and things blow up on the iretq. Moving the block headed by the comment
> >>>> "Now switch to __KERNEL_CS so IRET works reliably" to just after calling
> >>>> startup_64_setup_env() fixes it and an SEV-ES guest can boot.
> >>>>
> >>>> This worked before because I believe we switched off the EFI CS as part of
> >>>> the kernel decompressor support and so this bug wasn't exposed. But this
> >>>> needs to be fixed regardless of this series.
> >>>>
> >>>
> >>> Very interesting. I was under the assumption that everything that goes
> >>> on in sev_enable() in the decompressor would be rather indispensable
> >>> to boot in SEV mode (which I only spotted today) so I am quite
> >>> surprised that things just appear to work. (There is some 32-bit SEV
> >>> code in the decompressor as well that obviously never gets called when
> >>> booting in long mode, but I hadn't quite grasped how much other SEV
> >>> code there actually is)
> >>
> >> The sev_enable() function for SEV and SEV-ES is more for ensuring a proper
> >> environment (ensuring the proper CPUID bits are present for the guest,
> >> checking the GHCB protocol level, properly preparing pagetables, etc.).
> >> Since we're still in EFI and using the EFI #VC handler and pagetables, we
> >> don't require some of that stuff, but some of the it would need to be
> >> performed at some point (I wasn't trying to be a malicious hypervisor).
> >>
> >
> > OK, good to know.
> >
> >> I haven't tested SEV-SNP yet because the hypervisor support is not
> >> upstream, yet, and I haven't applied your series to an SNP hypervisor
> >> tree, yet. There is a lot of support in sev_enable() for SNP that may
> >> likely cause a problem compared to SEV and SEV-ES.
> >>
> >> SEV-SNP has a lot more checking and validation going on and it ensures it
> >> gets the confidential computing blob from EFI into the boot parameters.
> >> I'm not sure when I'll be able to test SNP, since I'm off next week and
> >> trying to wrap up a bunch of stuff before I leave.
> >>
> >
> > Would it make sense, given the apparent direct dependency on EFI, to
> > move that stuff into the EFI stub instead of doing it from the
> > decompressor?
>
> We tried to make the support for SNP not be reliant on EFI. If EFI is
> present, then get the confidential computing blob from it. Otherwise, the
> cc blob can be supplied as part of the setup data in the Linux boot
> protocol (see find_cc_block() in arch/x86/boot/compressed/sev.c).
>
> So I guess some of it could be moved, but I'm not sure how much.
>
Fair enough.
> >
> > Thanks for giving this a spin - I can wait for more testing until you
> > get back, and hopefully, we'll have converged a bit more by that time
> > and there will be a new revision of the series available for testing.
> >
> > BTW any clue whether your boot path is relying on the EFI handover
> > protocol? I.e., QemuStartLegacyImage() in OVMF? Evgeny mentioned that
>
> I don't believe it is. It's using the Grub EFI bootloader to load and boot
> Linux.
>
Looking at the log you just sent (thanks), it appears you are using
distro shim+grub, which implement their own rough approximation of PE
file loading, and typically use the EFI handover protocol. This is
good news, because we already know how broken shim and GRUB are, and I
was concerned that there might be another issue to track down here.
I'll implement Evgeny's suggestion to clear BSS explicitly in the EFI
handover protocol entrypoints for the next revision of the series.
Thanks,