2024-02-27 15:19:32

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v7 0/9] x86: Confine early 1:1 mapped startup code

From: Ard Biesheuvel <[email protected]>

This is resend #2 of v5 [0] with some touchups applied.

Changes since v6:
- Drop flawed patch to move some SME/SEV related calls out of the early
boot path to avoid the potential need for backporting patches #6/#7
to kernels where SEV support may not be crucial. This problem will be
dealt with if/when it arises while doing those backports.

Changes since v5:
- drop patches that have been merged
- rebase onto latest tip/x86/boot
- fix comment regarding CR4.PGE wrt flushing of global TLB entries
- avoid adding startup code to .noinstr.text as it triggers objtool
warnings

[0] https://lore.kernel.org/all/[email protected]/

Cc: Kevin Loughlin <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Dionna Glaze <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Brian Gerst <[email protected]>

Ard Biesheuvel (9):
x86/startup_64: Simplify CR4 handling in startup code
x86/startup_64: Defer assignment of 5-level paging global variables
x86/startup_64: Simplify calculation of initial page table address
x86/startup_64: Simplify virtual switch on primary boot
efi/libstub: Add generic support for parsing mem_encrypt=
x86/boot: Move mem_encrypt= parsing to the decompressor
x86/sme: Move early SME kernel encryption handling into .head.text
x86/sev: Move early startup code into .head.text section
x86/startup_64: Drop global variables keeping track of LA57 state

arch/x86/boot/compressed/misc.c | 15 ++++
arch/x86/boot/compressed/misc.h | 4 -
arch/x86/boot/compressed/pgtable_64.c | 12 ---
arch/x86/boot/compressed/sev.c | 3 +
arch/x86/boot/compressed/vmlinux.lds.S | 1 +
arch/x86/include/asm/mem_encrypt.h | 8 +-
arch/x86/include/asm/pgtable_64_types.h | 43 ++++-----
arch/x86/include/asm/sev.h | 10 +--
arch/x86/include/uapi/asm/bootparam.h | 1 +
arch/x86/kernel/cpu/common.c | 2 -
arch/x86/kernel/head64.c | 61 ++-----------
arch/x86/kernel/head_64.S | 93 ++++++++------------
arch/x86/kernel/sev-shared.c | 23 +++--
arch/x86/kernel/sev.c | 14 +--
arch/x86/lib/Makefile | 13 ---
arch/x86/mm/kasan_init_64.c | 3 -
arch/x86/mm/mem_encrypt_identity.c | 83 +++++------------
drivers/firmware/efi/libstub/efi-stub-helper.c | 8 ++
drivers/firmware/efi/libstub/efistub.h | 2 +-
drivers/firmware/efi/libstub/x86-stub.c | 3 +
20 files changed, 147 insertions(+), 255 deletions(-)

--
2.44.0.rc1.240.g4c46232300-goog



2024-02-27 15:19:39

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v7 1/9] x86/startup_64: Simplify CR4 handling in startup code

From: Ard Biesheuvel <[email protected]>

When paging is enabled, the CR4.PAE and CR4.LA57 control bits cannot be
changed, and so they can simply be preserved rather than reason about
whether or not they need to be set. CR4.MCE should be preserved unless
the kernel was built without CONFIG_X86_MCE, in which case it must be
cleared.

CR4.PSE should be set explicitly, regardless of whether or not it was
set before.

CR4.PGE is set explicitly, and then cleared and set again after
programming CR3 in order to flush TLB entries based on global
translations. This makes the first assignment redundant, and can
therefore be omitted. So clear PGE by omitting it from the preserve
mask, and set it again explicitly after switching to the new page
tables.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/kernel/head_64.S | 30 ++++++++------------
1 file changed, 12 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index d295bf68bf94..1b054585bfd1 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -185,6 +185,11 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
addq $(init_top_pgt - __START_KERNEL_map), %rax
1:

+ /*
+ * Create a mask of CR4 bits to preserve. Omit PGE in order to clean
+ * global 1:1 translations from the TLBs.
+ */
+ movl $(X86_CR4_PAE | X86_CR4_LA57), %edx
#ifdef CONFIG_X86_MCE
/*
* Preserve CR4.MCE if the kernel will enable #MC support.
@@ -193,20 +198,13 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
* configured will crash the system regardless of the CR4.MCE value set
* here.
*/
- movq %cr4, %rcx
- andl $X86_CR4_MCE, %ecx
-#else
- movl $0, %ecx
+ orl $X86_CR4_MCE, %edx
#endif
+ movq %cr4, %rcx
+ andl %edx, %ecx

- /* Enable PAE mode, PSE, PGE and LA57 */
- orl $(X86_CR4_PAE | X86_CR4_PSE | X86_CR4_PGE), %ecx
-#ifdef CONFIG_X86_5LEVEL
- testb $1, __pgtable_l5_enabled(%rip)
- jz 1f
- orl $X86_CR4_LA57, %ecx
-1:
-#endif
+ /* Even if ignored in long mode, set PSE uniformly on all logical CPUs. */
+ btsl $X86_CR4_PSE_BIT, %ecx
movq %rcx, %cr4

/* Setup early boot stage 4-/5-level pagetables. */
@@ -223,14 +221,10 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
movq %rax, %cr3

/*
- * Do a global TLB flush after the CR3 switch to make sure the TLB
- * entries from the identity mapping are flushed.
+ * Set CR4.PGE to re-enable global translations.
*/
- movq %cr4, %rcx
- movq %rcx, %rax
- xorq $X86_CR4_PGE, %rcx
+ btsl $X86_CR4_PGE_BIT, %ecx
movq %rcx, %cr4
- movq %rax, %cr4

/* Ensure I am executing from virtual addresses */
movq $1f, %rax
--
2.44.0.rc1.240.g4c46232300-goog


2024-02-27 15:20:09

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v7 3/9] x86/startup_64: Simplify calculation of initial page table address

From: Ard Biesheuvel <[email protected]>

Determining the address of the initial page table to program into CR3
involves:
- taking the physical address
- adding the SME encryption mask

On the primary entry path, the code is mapped using a 1:1 virtual to
physical translation, so the physical address can be taken directly
using a RIP-relative LEA instruction.

On the secondary entry path, the address can be obtained by taking the
offset from the virtual kernel base (__START_kernel_map) and adding the
physical kernel base.

This is implemented in a slightly confusing way, so clean this up.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/kernel/head_64.S | 25 ++++++--------------
1 file changed, 7 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 1b054585bfd1..c451a72bc92b 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -111,13 +111,11 @@ SYM_CODE_START_NOALIGN(startup_64)
call __startup_64

/* Form the CR3 value being sure to include the CR3 modifier */
- addq $(early_top_pgt - __START_KERNEL_map), %rax
+ leaq early_top_pgt(%rip), %rcx
+ addq %rcx, %rax

#ifdef CONFIG_AMD_MEM_ENCRYPT
mov %rax, %rdi
- mov %rax, %r14
-
- addq phys_base(%rip), %rdi

/*
* For SEV guests: Verify that the C-bit is correct. A malicious
@@ -126,12 +124,6 @@ SYM_CODE_START_NOALIGN(startup_64)
* the next RET instruction.
*/
call sev_verify_cbit
-
- /*
- * Restore CR3 value without the phys_base which will be added
- * below, before writing %cr3.
- */
- mov %r14, %rax
#endif

jmp 1f
@@ -171,18 +163,18 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
/* Clear %R15 which holds the boot_params pointer on the boot CPU */
xorq %r15, %r15

+ /* Derive the runtime physical address of init_top_pgt[] */
+ movq phys_base(%rip), %rax
+ addq $(init_top_pgt - __START_KERNEL_map), %rax
+
/*
* Retrieve the modifier (SME encryption mask if SME is active) to be
* added to the initial pgdir entry that will be programmed into CR3.
*/
#ifdef CONFIG_AMD_MEM_ENCRYPT
- movq sme_me_mask, %rax
-#else
- xorq %rax, %rax
+ addq sme_me_mask(%rip), %rax
#endif

- /* Form the CR3 value being sure to include the CR3 modifier */
- addq $(init_top_pgt - __START_KERNEL_map), %rax
1:

/*
@@ -207,9 +199,6 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
btsl $X86_CR4_PSE_BIT, %ecx
movq %rcx, %cr4

- /* Setup early boot stage 4-/5-level pagetables. */
- addq phys_base(%rip), %rax
-
/*
* Switch to new page-table
*
--
2.44.0.rc1.240.g4c46232300-goog


2024-02-27 15:20:31

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v7 2/9] x86/startup_64: Defer assignment of 5-level paging global variables

From: Ard Biesheuvel <[email protected]>

Assigning the 5-level paging related global variables from the earliest
C code using explicit references that use the 1:1 translation of memory
is unnecessary, as the startup code itself does not rely on them to
create the initial page tables, and this is all it should be doing. So
defer these assignments to the primary C entry code that executes via
the ordinary kernel virtual mapping.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/include/asm/pgtable_64_types.h | 2 +-
arch/x86/kernel/head64.c | 44 +++++++-------------
2 files changed, 15 insertions(+), 31 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 38b54b992f32..9053dfe9fa03 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -21,9 +21,9 @@ typedef unsigned long pgprotval_t;
typedef struct { pteval_t pte; } pte_t;
typedef struct { pmdval_t pmd; } pmd_t;

-#ifdef CONFIG_X86_5LEVEL
extern unsigned int __pgtable_l5_enabled;

+#ifdef CONFIG_X86_5LEVEL
#ifdef USE_EARLY_PGTABLE_L5
/*
* cpu_feature_enabled() is not available in early boot code.
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 72351c3121a6..deaaea3280d9 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -23,6 +23,7 @@
#include <linux/pgtable.h>

#include <asm/asm.h>
+#include <asm/page_64.h>
#include <asm/processor.h>
#include <asm/proto.h>
#include <asm/smp.h>
@@ -77,24 +78,11 @@ static struct desc_struct startup_gdt[GDT_ENTRIES] __initdata = {
[GDT_ENTRY_KERNEL_DS] = GDT_ENTRY_INIT(DESC_DATA64, 0, 0xfffff),
};

-#ifdef CONFIG_X86_5LEVEL
-static void __head *fixup_pointer(void *ptr, unsigned long physaddr)
-{
- return ptr - (void *)_text + (void *)physaddr;
-}
-
-static unsigned long __head *fixup_long(void *ptr, unsigned long physaddr)
+static inline bool check_la57_support(void)
{
- return fixup_pointer(ptr, physaddr);
-}
-
-static unsigned int __head *fixup_int(void *ptr, unsigned long physaddr)
-{
- return fixup_pointer(ptr, physaddr);
-}
+ if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+ return false;

-static bool __head check_la57_support(unsigned long physaddr)
-{
/*
* 5-level paging is detected and enabled at kernel decompression
* stage. Only check if it has been enabled there.
@@ -102,21 +90,8 @@ static bool __head check_la57_support(unsigned long physaddr)
if (!(native_read_cr4() & X86_CR4_LA57))
return false;

- *fixup_int(&__pgtable_l5_enabled, physaddr) = 1;
- *fixup_int(&pgdir_shift, physaddr) = 48;
- *fixup_int(&ptrs_per_p4d, physaddr) = 512;
- *fixup_long(&page_offset_base, physaddr) = __PAGE_OFFSET_BASE_L5;
- *fixup_long(&vmalloc_base, physaddr) = __VMALLOC_BASE_L5;
- *fixup_long(&vmemmap_base, physaddr) = __VMEMMAP_BASE_L5;
-
return true;
}
-#else
-static bool __head check_la57_support(unsigned long physaddr)
-{
- return false;
-}
-#endif

static unsigned long __head sme_postprocess_startup(struct boot_params *bp, pmdval_t *pmd)
{
@@ -180,7 +155,7 @@ unsigned long __head __startup_64(unsigned long physaddr,
bool la57;
int i;

- la57 = check_la57_support(physaddr);
+ la57 = check_la57_support();

/* Is the address too large? */
if (physaddr >> MAX_PHYSMEM_BITS)
@@ -465,6 +440,15 @@ asmlinkage __visible void __init __noreturn x86_64_start_kernel(char * real_mode
(__START_KERNEL & PGDIR_MASK)));
BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);

+ if (check_la57_support()) {
+ __pgtable_l5_enabled = 1;
+ pgdir_shift = 48;
+ ptrs_per_p4d = 512;
+ page_offset_base = __PAGE_OFFSET_BASE_L5;
+ vmalloc_base = __VMALLOC_BASE_L5;
+ vmemmap_base = __VMEMMAP_BASE_L5;
+ }
+
cr4_init_shadow();

/* Kill off the identity-map trampoline */
--
2.44.0.rc1.240.g4c46232300-goog


2024-02-27 15:20:33

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v7 4/9] x86/startup_64: Simplify virtual switch on primary boot

From: Ard Biesheuvel <[email protected]>

The secondary startup code is used on the primary boot path as well, but
in this case, the initial part runs from a 1:1 mapping, until an
explicit cross-jump is made to the kernel virtual mapping of the same
code.

On the secondary boot path, this jump is pointless as the code already
executes from the mapping targeted by the jump. So combine this
cross-jump with the jump from startup_64() into the common boot path.
This simplifies the execution flow, and clearly separates code that runs
from a 1:1 mapping from code that runs from the kernel virtual mapping.

Note that this requires a page table switch, so hoist the CR3 assignment
into startup_64() as well. And since absolute symbol references will no
longer be permitted in .head.text once we enable the associated build
time checks, a RIP-relative memory operand is used in the JMP
instruction, referring to an absolute constant in the .init.rodata
section.

Given that the secondary startup code does not require a special
placement inside the executable, move it to the .text section.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/kernel/head_64.S | 42 ++++++++++----------
1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index c451a72bc92b..87929f615048 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -39,7 +39,6 @@ L4_START_KERNEL = l4_index(__START_KERNEL_map)

L3_START_KERNEL = pud_index(__START_KERNEL_map)

- .text
__HEAD
.code64
SYM_CODE_START_NOALIGN(startup_64)
@@ -126,9 +125,21 @@ SYM_CODE_START_NOALIGN(startup_64)
call sev_verify_cbit
#endif

- jmp 1f
+ /*
+ * Switch to early_top_pgt which still has the identity mappings
+ * present.
+ */
+ movq %rax, %cr3
+
+ /* Branch to the common startup code at its kernel virtual address */
+ ANNOTATE_RETPOLINE_SAFE
+ jmp *0f(%rip)
SYM_CODE_END(startup_64)

+ __INITRODATA
+0: .quad common_startup_64
+
+ .text
SYM_CODE_START(secondary_startup_64)
UNWIND_HINT_END_OF_STACK
ANNOTATE_NOENDBR
@@ -174,8 +185,15 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
#ifdef CONFIG_AMD_MEM_ENCRYPT
addq sme_me_mask(%rip), %rax
#endif
+ /*
+ * Switch to the init_top_pgt here, away from the trampoline_pgd and
+ * unmap the identity mapped ranges.
+ */
+ movq %rax, %cr3

-1:
+SYM_INNER_LABEL(common_startup_64, SYM_L_LOCAL)
+ UNWIND_HINT_END_OF_STACK
+ ANNOTATE_NOENDBR

/*
* Create a mask of CR4 bits to preserve. Omit PGE in order to clean
@@ -199,30 +217,12 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
btsl $X86_CR4_PSE_BIT, %ecx
movq %rcx, %cr4

- /*
- * Switch to new page-table
- *
- * For the boot CPU this switches to early_top_pgt which still has the
- * identity mappings present. The secondary CPUs will switch to the
- * init_top_pgt here, away from the trampoline_pgd and unmap the
- * identity mapped ranges.
- */
- movq %rax, %cr3
-
/*
* Set CR4.PGE to re-enable global translations.
*/
btsl $X86_CR4_PGE_BIT, %ecx
movq %rcx, %cr4

- /* Ensure I am executing from virtual addresses */
- movq $1f, %rax
- ANNOTATE_RETPOLINE_SAFE
- jmp *%rax
-1:
- UNWIND_HINT_END_OF_STACK
- ANNOTATE_NOENDBR // above
-
#ifdef CONFIG_SMP
/*
* For parallel boot, the APIC ID is read from the APIC, and then
--
2.44.0.rc1.240.g4c46232300-goog


2024-02-27 15:20:47

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v7 6/9] x86/boot: Move mem_encrypt= parsing to the decompressor

From: Ard Biesheuvel <[email protected]>

The early SME/SEV code parses the command line very early, in order to
decide whether or not memory encryption should be enabled, which needs
to occur even before the initial page tables are created.

This is problematic for a number of reasons:
- this early code runs from the 1:1 mapping provided by the decompressor
or firmware, which uses a different translation than the one assumed by
the linker, and so the code needs to be built in a special way;
- parsing external input while the entire kernel image is still mapped
writable is a bad idea in general, and really does not belong in
security minded code;
- the current code ignores the built-in command line entirely (although
this appears to be the case for the entire decompressor)

Given that the decompressor/EFI stub is an intrinsic part of the x86
bootable kernel image, move the command line parsing there and out of
the core kernel. This removes the need to build lib/cmdline.o in a
special way, or to use RIP-relative LEA instructions in inline asm
blocks.

This involves a new xloadflag in the setup header to indicate
that mem_encrypt=on appeared on the kernel command line.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/boot/compressed/misc.c | 15 +++++++++
arch/x86/include/uapi/asm/bootparam.h | 1 +
arch/x86/lib/Makefile | 13 --------
arch/x86/mm/mem_encrypt_identity.c | 32 ++------------------
drivers/firmware/efi/libstub/x86-stub.c | 3 ++
5 files changed, 22 insertions(+), 42 deletions(-)

diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index bd6857a9f15a..408507e305be 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -371,6 +371,19 @@ unsigned long decompress_kernel(unsigned char *outbuf, unsigned long virt_addr,
return entry;
}

+/*
+ * Set the memory encryption xloadflag based on the mem_encrypt= command line
+ * parameter, if provided.
+ */
+static void parse_mem_encrypt(struct setup_header *hdr)
+{
+ int on = cmdline_find_option_bool("mem_encrypt=on");
+ int off = cmdline_find_option_bool("mem_encrypt=off");
+
+ if (on > off)
+ hdr->xloadflags |= XLF_MEM_ENCRYPTION;
+}
+
/*
* The compressed kernel image (ZO), has been moved so that its position
* is against the end of the buffer used to hold the uncompressed kernel
@@ -401,6 +414,8 @@ asmlinkage __visible void *extract_kernel(void *rmode, unsigned char *output)
/* Clear flags intended for solely in-kernel use. */
boot_params_ptr->hdr.loadflags &= ~KASLR_FLAG;

+ parse_mem_encrypt(&boot_params_ptr->hdr);
+
sanitize_boot_params(boot_params_ptr);

if (boot_params_ptr->screen_info.orig_video_mode == 7) {
diff --git a/arch/x86/include/uapi/asm/bootparam.h b/arch/x86/include/uapi/asm/bootparam.h
index 01d19fc22346..eeea058cf602 100644
--- a/arch/x86/include/uapi/asm/bootparam.h
+++ b/arch/x86/include/uapi/asm/bootparam.h
@@ -38,6 +38,7 @@
#define XLF_EFI_KEXEC (1<<4)
#define XLF_5LEVEL (1<<5)
#define XLF_5LEVEL_ENABLED (1<<6)
+#define XLF_MEM_ENCRYPTION (1<<7)

#ifndef __ASSEMBLY__

diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index ea3a28e7b613..f0dae4fb6d07 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -14,19 +14,6 @@ ifdef CONFIG_KCSAN
CFLAGS_REMOVE_delay.o = $(CC_FLAGS_FTRACE)
endif

-# Early boot use of cmdline; don't instrument it
-ifdef CONFIG_AMD_MEM_ENCRYPT
-KCOV_INSTRUMENT_cmdline.o := n
-KASAN_SANITIZE_cmdline.o := n
-KCSAN_SANITIZE_cmdline.o := n
-
-ifdef CONFIG_FUNCTION_TRACER
-CFLAGS_REMOVE_cmdline.o = -pg
-endif
-
-CFLAGS_cmdline.o := -fno-stack-protector -fno-jump-tables
-endif
-
inat_tables_script = $(srctree)/arch/x86/tools/gen-insn-attr-x86.awk
inat_tables_maps = $(srctree)/arch/x86/lib/x86-opcode-map.txt
quiet_cmd_inat_tables = GEN $@
diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index 0166ab1780cc..d210c7fc8fa2 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -43,7 +43,6 @@

#include <asm/setup.h>
#include <asm/sections.h>
-#include <asm/cmdline.h>
#include <asm/coco.h>
#include <asm/sev.h>

@@ -95,9 +94,6 @@ struct sme_populate_pgd_data {
*/
static char sme_workarea[2 * PMD_SIZE] __section(".init.scratch");

-static char sme_cmdline_arg[] __initdata = "mem_encrypt";
-static char sme_cmdline_on[] __initdata = "on";
-
static void __init sme_clear_pgd(struct sme_populate_pgd_data *ppd)
{
unsigned long pgd_start, pgd_end, pgd_size;
@@ -504,11 +500,9 @@ void __init sme_encrypt_kernel(struct boot_params *bp)

void __init sme_enable(struct boot_params *bp)
{
- const char *cmdline_ptr, *cmdline_arg, *cmdline_on;
unsigned int eax, ebx, ecx, edx;
unsigned long feature_mask;
unsigned long me_mask;
- char buffer[16];
bool snp;
u64 msr;

@@ -551,6 +545,9 @@ void __init sme_enable(struct boot_params *bp)

/* Check if memory encryption is enabled */
if (feature_mask == AMD_SME_BIT) {
+ if (!(bp->hdr.xloadflags & XLF_MEM_ENCRYPTION))
+ return;
+
/*
* No SME if Hypervisor bit is set. This check is here to
* prevent a guest from trying to enable SME. For running as a
@@ -570,31 +567,8 @@ void __init sme_enable(struct boot_params *bp)
msr = __rdmsr(MSR_AMD64_SYSCFG);
if (!(msr & MSR_AMD64_SYSCFG_MEM_ENCRYPT))
return;
- } else {
- /* SEV state cannot be controlled by a command line option */
- goto out;
}

- /*
- * Fixups have not been applied to phys_base yet and we're running
- * identity mapped, so we must obtain the address to the SME command
- * line argument data using rip-relative addressing.
- */
- asm ("lea sme_cmdline_arg(%%rip), %0"
- : "=r" (cmdline_arg)
- : "p" (sme_cmdline_arg));
- asm ("lea sme_cmdline_on(%%rip), %0"
- : "=r" (cmdline_on)
- : "p" (sme_cmdline_on));
-
- cmdline_ptr = (const char *)((u64)bp->hdr.cmd_line_ptr |
- ((u64)bp->ext_cmd_line_ptr << 32));
-
- if (cmdline_find_option(cmdline_ptr, cmdline_arg, buffer, sizeof(buffer)) < 0 ||
- strncmp(buffer, cmdline_on, sizeof(buffer)))
- return;
-
-out:
RIP_REL_REF(sme_me_mask) = me_mask;
physical_mask &= ~me_mask;
cc_vendor = CC_VENDOR_AMD;
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index 99429bc4b0c7..0336ed175e67 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -884,6 +884,9 @@ void __noreturn efi_stub_entry(efi_handle_t handle,
}
}

+ if (efi_mem_encrypt > 0)
+ hdr->xloadflags |= XLF_MEM_ENCRYPTION;
+
status = efi_decompress_kernel(&kernel_entry);
if (status != EFI_SUCCESS) {
efi_err("Failed to decompress kernel\n");
--
2.44.0.rc1.240.g4c46232300-goog


2024-02-27 15:23:39

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v7 5/9] efi/libstub: Add generic support for parsing mem_encrypt=

From: Ard Biesheuvel <[email protected]>

Parse the mem_encrypt= command line parameter from the EFI stub if
CONFIG_ARCH_HAS_MEM_ENCRYPT=y, so that it can be passed to the early
boot code by the arch code in the stub.

This avoids the need for the core kernel to do any string parsing very
early in the boot.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
drivers/firmware/efi/libstub/efi-stub-helper.c | 8 ++++++++
drivers/firmware/efi/libstub/efistub.h | 2 +-
2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/efi/libstub/efi-stub-helper.c b/drivers/firmware/efi/libstub/efi-stub-helper.c
index bfa30625f5d0..3dc2f9aaf08d 100644
--- a/drivers/firmware/efi/libstub/efi-stub-helper.c
+++ b/drivers/firmware/efi/libstub/efi-stub-helper.c
@@ -24,6 +24,8 @@ static bool efi_noinitrd;
static bool efi_nosoftreserve;
static bool efi_disable_pci_dma = IS_ENABLED(CONFIG_EFI_DISABLE_PCI_DMA);

+int efi_mem_encrypt;
+
bool __pure __efi_soft_reserve_enabled(void)
{
return !efi_nosoftreserve;
@@ -75,6 +77,12 @@ efi_status_t efi_parse_options(char const *cmdline)
efi_noinitrd = true;
} else if (IS_ENABLED(CONFIG_X86_64) && !strcmp(param, "no5lvl")) {
efi_no5lvl = true;
+ } else if (IS_ENABLED(CONFIG_ARCH_HAS_MEM_ENCRYPT) &&
+ !strcmp(param, "mem_encrypt") && val) {
+ if (parse_option_str(val, "on"))
+ efi_mem_encrypt = 1;
+ else if (parse_option_str(val, "off"))
+ efi_mem_encrypt = -1;
} else if (!strcmp(param, "efi") && val) {
efi_nochunk = parse_option_str(val, "nochunk");
efi_novamap |= parse_option_str(val, "novamap");
diff --git a/drivers/firmware/efi/libstub/efistub.h b/drivers/firmware/efi/libstub/efistub.h
index c04b82ea40f2..fc18fd649ed7 100644
--- a/drivers/firmware/efi/libstub/efistub.h
+++ b/drivers/firmware/efi/libstub/efistub.h
@@ -37,8 +37,8 @@ extern bool efi_no5lvl;
extern bool efi_nochunk;
extern bool efi_nokaslr;
extern int efi_loglevel;
+extern int efi_mem_encrypt;
extern bool efi_novamap;
-
extern const efi_system_table_t *efi_system_table;

typedef union efi_dxe_services_table efi_dxe_services_table_t;
--
2.44.0.rc1.240.g4c46232300-goog


2024-02-27 15:24:03

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v7 7/9] x86/sme: Move early SME kernel encryption handling into .head.text

From: Ard Biesheuvel <[email protected]>

The .head.text section is the initial primary entrypoint of the core
kernel, and is entered with the CPU executing from a 1:1 mapping of
memory. Such code must never access global variables using absolute
references, as these are based on the kernel virtual mapping which is
not active yet at this point.

Given that the SME startup code is also called from this early execution
context, move it into .head.text as well. This will allow more thorough
build time checks in the future to ensure that early startup code only
uses RIP-relative references to global variables.

Also replace some occurrences of __pa_symbol() [which relies on the
compiler generating an absolute reference, which is not guaranteed] and
an open coded RIP-relative access with RIP_REL_REF().

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/include/asm/mem_encrypt.h | 8 ++--
arch/x86/mm/mem_encrypt_identity.c | 42 ++++++++------------
2 files changed, 21 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h
index b31eb9fd5954..f922b682b9b4 100644
--- a/arch/x86/include/asm/mem_encrypt.h
+++ b/arch/x86/include/asm/mem_encrypt.h
@@ -47,8 +47,8 @@ void __init sme_unmap_bootdata(char *real_mode_data);

void __init sme_early_init(void);

-void __init sme_encrypt_kernel(struct boot_params *bp);
-void __init sme_enable(struct boot_params *bp);
+void sme_encrypt_kernel(struct boot_params *bp);
+void sme_enable(struct boot_params *bp);

int __init early_set_memory_decrypted(unsigned long vaddr, unsigned long size);
int __init early_set_memory_encrypted(unsigned long vaddr, unsigned long size);
@@ -81,8 +81,8 @@ static inline void __init sme_unmap_bootdata(char *real_mode_data) { }

static inline void __init sme_early_init(void) { }

-static inline void __init sme_encrypt_kernel(struct boot_params *bp) { }
-static inline void __init sme_enable(struct boot_params *bp) { }
+static inline void sme_encrypt_kernel(struct boot_params *bp) { }
+static inline void sme_enable(struct boot_params *bp) { }

static inline void sev_es_init_vc_handling(void) { }

diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index d210c7fc8fa2..64b5005d49e5 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -41,6 +41,7 @@
#include <linux/mem_encrypt.h>
#include <linux/cc_platform.h>

+#include <asm/init.h>
#include <asm/setup.h>
#include <asm/sections.h>
#include <asm/coco.h>
@@ -94,7 +95,7 @@ struct sme_populate_pgd_data {
*/
static char sme_workarea[2 * PMD_SIZE] __section(".init.scratch");

-static void __init sme_clear_pgd(struct sme_populate_pgd_data *ppd)
+static void __head sme_clear_pgd(struct sme_populate_pgd_data *ppd)
{
unsigned long pgd_start, pgd_end, pgd_size;
pgd_t *pgd_p;
@@ -109,7 +110,7 @@ static void __init sme_clear_pgd(struct sme_populate_pgd_data *ppd)
memset(pgd_p, 0, pgd_size);
}

-static pud_t __init *sme_prepare_pgd(struct sme_populate_pgd_data *ppd)
+static pud_t __head *sme_prepare_pgd(struct sme_populate_pgd_data *ppd)
{
pgd_t *pgd;
p4d_t *p4d;
@@ -146,7 +147,7 @@ static pud_t __init *sme_prepare_pgd(struct sme_populate_pgd_data *ppd)
return pud;
}

-static void __init sme_populate_pgd_large(struct sme_populate_pgd_data *ppd)
+static void __head sme_populate_pgd_large(struct sme_populate_pgd_data *ppd)
{
pud_t *pud;
pmd_t *pmd;
@@ -162,7 +163,7 @@ static void __init sme_populate_pgd_large(struct sme_populate_pgd_data *ppd)
set_pmd(pmd, __pmd(ppd->paddr | ppd->pmd_flags));
}

-static void __init sme_populate_pgd(struct sme_populate_pgd_data *ppd)
+static void __head sme_populate_pgd(struct sme_populate_pgd_data *ppd)
{
pud_t *pud;
pmd_t *pmd;
@@ -188,7 +189,7 @@ static void __init sme_populate_pgd(struct sme_populate_pgd_data *ppd)
set_pte(pte, __pte(ppd->paddr | ppd->pte_flags));
}

-static void __init __sme_map_range_pmd(struct sme_populate_pgd_data *ppd)
+static void __head __sme_map_range_pmd(struct sme_populate_pgd_data *ppd)
{
while (ppd->vaddr < ppd->vaddr_end) {
sme_populate_pgd_large(ppd);
@@ -198,7 +199,7 @@ static void __init __sme_map_range_pmd(struct sme_populate_pgd_data *ppd)
}
}

-static void __init __sme_map_range_pte(struct sme_populate_pgd_data *ppd)
+static void __head __sme_map_range_pte(struct sme_populate_pgd_data *ppd)
{
while (ppd->vaddr < ppd->vaddr_end) {
sme_populate_pgd(ppd);
@@ -208,7 +209,7 @@ static void __init __sme_map_range_pte(struct sme_populate_pgd_data *ppd)
}
}

-static void __init __sme_map_range(struct sme_populate_pgd_data *ppd,
+static void __head __sme_map_range(struct sme_populate_pgd_data *ppd,
pmdval_t pmd_flags, pteval_t pte_flags)
{
unsigned long vaddr_end;
@@ -232,22 +233,22 @@ static void __init __sme_map_range(struct sme_populate_pgd_data *ppd,
__sme_map_range_pte(ppd);
}

-static void __init sme_map_range_encrypted(struct sme_populate_pgd_data *ppd)
+static void __head sme_map_range_encrypted(struct sme_populate_pgd_data *ppd)
{
__sme_map_range(ppd, PMD_FLAGS_ENC, PTE_FLAGS_ENC);
}

-static void __init sme_map_range_decrypted(struct sme_populate_pgd_data *ppd)
+static void __head sme_map_range_decrypted(struct sme_populate_pgd_data *ppd)
{
__sme_map_range(ppd, PMD_FLAGS_DEC, PTE_FLAGS_DEC);
}

-static void __init sme_map_range_decrypted_wp(struct sme_populate_pgd_data *ppd)
+static void __head sme_map_range_decrypted_wp(struct sme_populate_pgd_data *ppd)
{
__sme_map_range(ppd, PMD_FLAGS_DEC_WP, PTE_FLAGS_DEC_WP);
}

-static unsigned long __init sme_pgtable_calc(unsigned long len)
+static unsigned long __head sme_pgtable_calc(unsigned long len)
{
unsigned long entries = 0, tables = 0;

@@ -284,7 +285,7 @@ static unsigned long __init sme_pgtable_calc(unsigned long len)
return entries + tables;
}

-void __init sme_encrypt_kernel(struct boot_params *bp)
+void __head sme_encrypt_kernel(struct boot_params *bp)
{
unsigned long workarea_start, workarea_end, workarea_len;
unsigned long execute_start, execute_end, execute_len;
@@ -319,9 +320,8 @@ void __init sme_encrypt_kernel(struct boot_params *bp)
* memory from being cached.
*/

- /* Physical addresses gives us the identity mapped virtual addresses */
- kernel_start = __pa_symbol(_text);
- kernel_end = ALIGN(__pa_symbol(_end), PMD_SIZE);
+ kernel_start = (unsigned long)RIP_REL_REF(_text);
+ kernel_end = ALIGN((unsigned long)RIP_REL_REF(_end), PMD_SIZE);
kernel_len = kernel_end - kernel_start;

initrd_start = 0;
@@ -338,14 +338,6 @@ void __init sme_encrypt_kernel(struct boot_params *bp)
}
#endif

- /*
- * We're running identity mapped, so we must obtain the address to the
- * SME encryption workarea using rip-relative addressing.
- */
- asm ("lea sme_workarea(%%rip), %0"
- : "=r" (workarea_start)
- : "p" (sme_workarea));
-
/*
* Calculate required number of workarea bytes needed:
* executable encryption area size:
@@ -355,7 +347,7 @@ void __init sme_encrypt_kernel(struct boot_params *bp)
* pagetable structures for the encryption of the kernel
* pagetable structures for workarea (in case not currently mapped)
*/
- execute_start = workarea_start;
+ execute_start = workarea_start = (unsigned long)RIP_REL_REF(sme_workarea);
execute_end = execute_start + (PAGE_SIZE * 2) + PMD_SIZE;
execute_len = execute_end - execute_start;

@@ -498,7 +490,7 @@ void __init sme_encrypt_kernel(struct boot_params *bp)
native_write_cr3(__native_read_cr3());
}

-void __init sme_enable(struct boot_params *bp)
+void __head sme_enable(struct boot_params *bp)
{
unsigned int eax, ebx, ecx, edx;
unsigned long feature_mask;
--
2.44.0.rc1.240.g4c46232300-goog


2024-02-27 15:24:09

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v7 8/9] x86/sev: Move early startup code into .head.text section

From: Ard Biesheuvel <[email protected]>

In preparation for implementing rigorous build time checks to enforce
that only code that can support it will be called from the early 1:1
mapping of memory, move SEV init code that is called in this manner to
the .head.text section.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/boot/compressed/sev.c | 3 +++
arch/x86/include/asm/sev.h | 10 ++++-----
arch/x86/kernel/sev-shared.c | 23 +++++++++-----------
arch/x86/kernel/sev.c | 14 +++++++-----
4 files changed, 26 insertions(+), 24 deletions(-)

diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 073291832f44..bea0719d70f2 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -116,6 +116,9 @@ static bool fault_in_kernel_space(unsigned long address)
#undef __init
#define __init

+#undef __head
+#define __head
+
#define __BOOT_COMPRESSED

/* Basic instruction decoding support needed */
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index bed95e1f4d52..cf671138feef 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -213,16 +213,16 @@ static inline int pvalidate(unsigned long vaddr, bool rmp_psize, bool validate)
struct snp_guest_request_ioctl;

void setup_ghcb(void);
-void __init early_snp_set_memory_private(unsigned long vaddr, unsigned long paddr,
- unsigned long npages);
-void __init early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr,
- unsigned long npages);
+void early_snp_set_memory_private(unsigned long vaddr, unsigned long paddr,
+ unsigned long npages);
+void early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr,
+ unsigned long npages);
void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op);
void snp_set_memory_shared(unsigned long vaddr, unsigned long npages);
void snp_set_memory_private(unsigned long vaddr, unsigned long npages);
void snp_set_wakeup_secondary_cpu(void);
bool snp_init(struct boot_params *bp);
-void __init __noreturn snp_abort(void);
+void __noreturn snp_abort(void);
int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, struct snp_guest_request_ioctl *rio);
void snp_accept_memory(phys_addr_t start, phys_addr_t end);
u64 snp_get_unsupported_features(u64 status);
diff --git a/arch/x86/kernel/sev-shared.c b/arch/x86/kernel/sev-shared.c
index ae79f9505298..0bd7ccbe8732 100644
--- a/arch/x86/kernel/sev-shared.c
+++ b/arch/x86/kernel/sev-shared.c
@@ -93,7 +93,8 @@ static bool __init sev_es_check_cpu_features(void)
return true;
}

-static void __noreturn sev_es_terminate(unsigned int set, unsigned int reason)
+static void __head __noreturn
+sev_es_terminate(unsigned int set, unsigned int reason)
{
u64 val = GHCB_MSR_TERM_REQ;

@@ -330,13 +331,7 @@ static int sev_cpuid_hv(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid
*/
static const struct snp_cpuid_table *snp_cpuid_get_table(void)
{
- void *ptr;
-
- asm ("lea cpuid_table_copy(%%rip), %0"
- : "=r" (ptr)
- : "p" (&cpuid_table_copy));
-
- return ptr;
+ return &RIP_REL_REF(cpuid_table_copy);
}

/*
@@ -395,7 +390,7 @@ static u32 snp_cpuid_calc_xsave_size(u64 xfeatures_en, bool compacted)
return xsave_size;
}

-static bool
+static bool __head
snp_cpuid_get_validated_func(struct cpuid_leaf *leaf)
{
const struct snp_cpuid_table *cpuid_table = snp_cpuid_get_table();
@@ -532,7 +527,8 @@ static int snp_cpuid_postprocess(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
* Returns -EOPNOTSUPP if feature not enabled. Any other non-zero return value
* should be treated as fatal by caller.
*/
-static int snp_cpuid(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid_leaf *leaf)
+static int __head
+snp_cpuid(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid_leaf *leaf)
{
const struct snp_cpuid_table *cpuid_table = snp_cpuid_get_table();

@@ -574,7 +570,7 @@ static int snp_cpuid(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid_le
* page yet, so it only supports the MSR based communication with the
* hypervisor and only the CPUID exit-code.
*/
-void __init do_vc_no_ghcb(struct pt_regs *regs, unsigned long exit_code)
+void __head do_vc_no_ghcb(struct pt_regs *regs, unsigned long exit_code)
{
unsigned int subfn = lower_bits(regs->cx, 32);
unsigned int fn = lower_bits(regs->ax, 32);
@@ -1025,7 +1021,8 @@ struct cc_setup_data {
* Search for a Confidential Computing blob passed in as a setup_data entry
* via the Linux Boot Protocol.
*/
-static struct cc_blob_sev_info *find_cc_blob_setup_data(struct boot_params *bp)
+static __head
+struct cc_blob_sev_info *find_cc_blob_setup_data(struct boot_params *bp)
{
struct cc_setup_data *sd = NULL;
struct setup_data *hdr;
@@ -1052,7 +1049,7 @@ static struct cc_blob_sev_info *find_cc_blob_setup_data(struct boot_params *bp)
* mapping needs to be updated in sync with all the changes to virtual memory
* layout and related mapping facilities throughout the boot process.
*/
-static void __init setup_cpuid_table(const struct cc_blob_sev_info *cc_info)
+static void __head setup_cpuid_table(const struct cc_blob_sev_info *cc_info)
{
const struct snp_cpuid_table *cpuid_table_fw, *cpuid_table;
int i;
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 1ef7ae806a01..33c14aa1f06c 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -25,6 +25,7 @@
#include <linux/psp-sev.h>
#include <uapi/linux/sev-guest.h>

+#include <asm/init.h>
#include <asm/cpu_entry_area.h>
#include <asm/stacktrace.h>
#include <asm/sev.h>
@@ -682,8 +683,9 @@ static u64 __init get_jump_table_addr(void)
return ret;
}

-static void early_set_pages_state(unsigned long vaddr, unsigned long paddr,
- unsigned long npages, enum psc_op op)
+static void __head
+early_set_pages_state(unsigned long vaddr, unsigned long paddr,
+ unsigned long npages, enum psc_op op)
{
unsigned long paddr_end;
u64 val;
@@ -739,7 +741,7 @@ static void early_set_pages_state(unsigned long vaddr, unsigned long paddr,
sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
}

-void __init early_snp_set_memory_private(unsigned long vaddr, unsigned long paddr,
+void __head early_snp_set_memory_private(unsigned long vaddr, unsigned long paddr,
unsigned long npages)
{
/*
@@ -2062,7 +2064,7 @@ bool __init handle_vc_boot_ghcb(struct pt_regs *regs)
*
* Scan for the blob in that order.
*/
-static __init struct cc_blob_sev_info *find_cc_blob(struct boot_params *bp)
+static __head struct cc_blob_sev_info *find_cc_blob(struct boot_params *bp)
{
struct cc_blob_sev_info *cc_info;

@@ -2088,7 +2090,7 @@ static __init struct cc_blob_sev_info *find_cc_blob(struct boot_params *bp)
return cc_info;
}

-bool __init snp_init(struct boot_params *bp)
+bool __head snp_init(struct boot_params *bp)
{
struct cc_blob_sev_info *cc_info;

@@ -2110,7 +2112,7 @@ bool __init snp_init(struct boot_params *bp)
return true;
}

-void __init __noreturn snp_abort(void)
+void __head __noreturn snp_abort(void)
{
sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SNP_UNSUPPORTED);
}
--
2.44.0.rc1.240.g4c46232300-goog


2024-02-27 15:24:24

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v7 9/9] x86/startup_64: Drop global variables keeping track of LA57 state

From: Ard Biesheuvel <[email protected]>

On x86_64, the core kernel is entered in long mode, which implies that
paging is enabled. This means that the CR4.LA57 control bit is
guaranteed to be in sync with the number of paging levels used by the
kernel, and there is no need to store this in a variable.

There is also no need to use variables for storing the calculations of
pgdir_shift and ptrs_per_p4d, as they are easily determined on the fly.

This removes the need for two different sources of truth for determining
whether 5-level paging is in use: CR4.LA57 always reflects the actual
state, and never changes from the point of view of the 64-bit core
kernel. The only potential concern is the cost of CR4 accesses, which
can be mitigated using alternatives patching based on feature detection.

Note that even the decompressor does not manipulate any page tables
before updating CR4.LA57, so it can also avoid the associated global
variables entirely. However, as it does not implement alternatives
patching, the associated ELF sections need to be discarded.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/boot/compressed/misc.h | 4 --
arch/x86/boot/compressed/pgtable_64.c | 12 ------
arch/x86/boot/compressed/vmlinux.lds.S | 1 +
arch/x86/include/asm/pgtable_64_types.h | 43 ++++++++++----------
arch/x86/kernel/cpu/common.c | 2 -
arch/x86/kernel/head64.c | 33 +--------------
arch/x86/mm/kasan_init_64.c | 3 --
arch/x86/mm/mem_encrypt_identity.c | 9 ----
8 files changed, 25 insertions(+), 82 deletions(-)

diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index b353a7be380c..e4ab7b4d8698 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -16,9 +16,6 @@

#define __NO_FORTIFY

-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5
-
/*
* Boot stub deals with identity mappings, physical and virtual addresses are
* the same, so override these defines.
@@ -181,7 +178,6 @@ static inline int count_immovable_mem_regions(void) { return 0; }
#endif

/* ident_map_64.c */
-extern unsigned int __pgtable_l5_enabled, pgdir_shift, ptrs_per_p4d;
extern void kernel_add_identity_map(unsigned long start, unsigned long end);

/* Used by PAGE_KERN* macros: */
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index 51f957b24ba7..ae72f53f5e77 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -9,13 +9,6 @@
#define BIOS_START_MIN 0x20000U /* 128K, less than this is insane */
#define BIOS_START_MAX 0x9f000U /* 640K, absolute maximum */

-#ifdef CONFIG_X86_5LEVEL
-/* __pgtable_l5_enabled needs to be in .data to avoid being cleared along with .bss */
-unsigned int __section(".data") __pgtable_l5_enabled;
-unsigned int __section(".data") pgdir_shift = 39;
-unsigned int __section(".data") ptrs_per_p4d = 1;
-#endif
-
/* Buffer to preserve trampoline memory */
static char trampoline_save[TRAMPOLINE_32BIT_SIZE];

@@ -125,11 +118,6 @@ asmlinkage void configure_5level_paging(struct boot_params *bp, void *pgtable)
native_cpuid_eax(0) >= 7 &&
(native_cpuid_ecx(7) & (1 << (X86_FEATURE_LA57 & 31)))) {
l5_required = true;
-
- /* Initialize variables for 5-level paging */
- __pgtable_l5_enabled = 1;
- pgdir_shift = 48;
- ptrs_per_p4d = 512;
}

/*
diff --git a/arch/x86/boot/compressed/vmlinux.lds.S b/arch/x86/boot/compressed/vmlinux.lds.S
index 083ec6d7722a..06358bb067fe 100644
--- a/arch/x86/boot/compressed/vmlinux.lds.S
+++ b/arch/x86/boot/compressed/vmlinux.lds.S
@@ -81,6 +81,7 @@ SECTIONS
*(.dynamic) *(.dynsym) *(.dynstr) *(.dynbss)
*(.hash) *(.gnu.hash)
*(.note.*)
+ *(.altinstructions .altinstr_replacement)
}

.got.plt (INFO) : {
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 9053dfe9fa03..2fac8ba9564a 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -6,7 +6,10 @@

#ifndef __ASSEMBLY__
#include <linux/types.h>
+#include <asm/alternative.h>
+#include <asm/cpufeatures.h>
#include <asm/kaslr.h>
+#include <asm/processor-flags.h>

/*
* These are used to make use of C type-checking..
@@ -21,28 +24,24 @@ typedef unsigned long pgprotval_t;
typedef struct { pteval_t pte; } pte_t;
typedef struct { pmdval_t pmd; } pmd_t;

-extern unsigned int __pgtable_l5_enabled;
-
-#ifdef CONFIG_X86_5LEVEL
-#ifdef USE_EARLY_PGTABLE_L5
-/*
- * cpu_feature_enabled() is not available in early boot code.
- * Use variable instead.
- */
-static inline bool pgtable_l5_enabled(void)
+static __always_inline __pure bool pgtable_l5_enabled(void)
{
- return __pgtable_l5_enabled;
-}
-#else
-#define pgtable_l5_enabled() cpu_feature_enabled(X86_FEATURE_LA57)
-#endif /* USE_EARLY_PGTABLE_L5 */
+ unsigned long r;
+ bool ret;

-#else
-#define pgtable_l5_enabled() 0
-#endif /* CONFIG_X86_5LEVEL */
+ if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+ return false;

-extern unsigned int pgdir_shift;
-extern unsigned int ptrs_per_p4d;
+ asm(ALTERNATIVE_TERNARY(
+ "movq %%cr4, %[reg] \n\t btl %[la57], %k[reg]" CC_SET(c),
+ %P[feat], "stc", "clc")
+ : [reg] "=&r" (r), CC_OUT(c) (ret)
+ : [feat] "i" (X86_FEATURE_LA57),
+ [la57] "i" (X86_CR4_LA57_BIT)
+ : "cc");
+
+ return ret;
+}

#endif /* !__ASSEMBLY__ */

@@ -53,7 +52,7 @@ extern unsigned int ptrs_per_p4d;
/*
* PGDIR_SHIFT determines what a top-level page table entry can map
*/
-#define PGDIR_SHIFT pgdir_shift
+#define PGDIR_SHIFT (pgtable_l5_enabled() ? 48 : 39)
#define PTRS_PER_PGD 512

/*
@@ -61,7 +60,7 @@ extern unsigned int ptrs_per_p4d;
*/
#define P4D_SHIFT 39
#define MAX_PTRS_PER_P4D 512
-#define PTRS_PER_P4D ptrs_per_p4d
+#define PTRS_PER_P4D (pgtable_l5_enabled() ? 512 : 1)
#define P4D_SIZE (_AC(1, UL) << P4D_SHIFT)
#define P4D_MASK (~(P4D_SIZE - 1))

@@ -76,6 +75,8 @@ extern unsigned int ptrs_per_p4d;
#define PTRS_PER_PGD 512
#define MAX_PTRS_PER_P4D 1

+#define MAX_POSSIBLE_PHYSMEM_BITS 46
+
#endif /* CONFIG_X86_5LEVEL */

/*
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 9e35e276c55a..d88e4be88868 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1,6 +1,4 @@
// SPDX-License-Identifier: GPL-2.0-only
-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5

#include <linux/memblock.h>
#include <linux/linkage.h>
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index deaaea3280d9..789ed2c53527 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -7,9 +7,6 @@

#define DISABLE_BRANCH_PROFILING

-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5
-
#include <linux/init.h>
#include <linux/linkage.h>
#include <linux/types.h>
@@ -52,14 +49,6 @@ extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
static unsigned int __initdata next_early_pgt;
pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);

-#ifdef CONFIG_X86_5LEVEL
-unsigned int __pgtable_l5_enabled __ro_after_init;
-unsigned int pgdir_shift __ro_after_init = 39;
-EXPORT_SYMBOL(pgdir_shift);
-unsigned int ptrs_per_p4d __ro_after_init = 1;
-EXPORT_SYMBOL(ptrs_per_p4d);
-#endif
-
#ifdef CONFIG_DYNAMIC_MEMORY_LAYOUT
unsigned long page_offset_base __ro_after_init = __PAGE_OFFSET_BASE_L4;
EXPORT_SYMBOL(page_offset_base);
@@ -78,21 +67,6 @@ static struct desc_struct startup_gdt[GDT_ENTRIES] __initdata = {
[GDT_ENTRY_KERNEL_DS] = GDT_ENTRY_INIT(DESC_DATA64, 0, 0xfffff),
};

-static inline bool check_la57_support(void)
-{
- if (!IS_ENABLED(CONFIG_X86_5LEVEL))
- return false;
-
- /*
- * 5-level paging is detected and enabled at kernel decompression
- * stage. Only check if it has been enabled there.
- */
- if (!(native_read_cr4() & X86_CR4_LA57))
- return false;
-
- return true;
-}
-
static unsigned long __head sme_postprocess_startup(struct boot_params *bp, pmdval_t *pmd)
{
unsigned long vaddr, vaddr_end;
@@ -155,7 +129,7 @@ unsigned long __head __startup_64(unsigned long physaddr,
bool la57;
int i;

- la57 = check_la57_support();
+ la57 = pgtable_l5_enabled();

/* Is the address too large? */
if (physaddr >> MAX_PHYSMEM_BITS)
@@ -440,10 +414,7 @@ asmlinkage __visible void __init __noreturn x86_64_start_kernel(char * real_mode
(__START_KERNEL & PGDIR_MASK)));
BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);

- if (check_la57_support()) {
- __pgtable_l5_enabled = 1;
- pgdir_shift = 48;
- ptrs_per_p4d = 512;
+ if (pgtable_l5_enabled()) {
page_offset_base = __PAGE_OFFSET_BASE_L5;
vmalloc_base = __VMALLOC_BASE_L5;
vmemmap_base = __VMEMMAP_BASE_L5;
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 0302491d799d..85ae1ef840cc 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -2,9 +2,6 @@
#define DISABLE_BRANCH_PROFILING
#define pr_fmt(fmt) "kasan: " fmt

-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5
-
#include <linux/memblock.h>
#include <linux/kasan.h>
#include <linux/kdebug.h>
diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index 64b5005d49e5..a857945af177 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -27,15 +27,6 @@
#undef CONFIG_PARAVIRT_XXL
#undef CONFIG_PARAVIRT_SPINLOCKS

-/*
- * This code runs before CPU feature bits are set. By default, the
- * pgtable_l5_enabled() function uses bit X86_FEATURE_LA57 to determine if
- * 5-level paging is active, so that won't work here. USE_EARLY_PGTABLE_L5
- * is provided to handle this situation and, instead, use a variable that
- * has been set by the early boot code.
- */
-#define USE_EARLY_PGTABLE_L5
-
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/mem_encrypt.h>
--
2.44.0.rc1.240.g4c46232300-goog


2024-02-28 14:10:10

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 1/9] x86/startup_64: Simplify CR4 handling in startup code

On Tue, Feb 27, 2024 at 04:19:09PM +0100, Ard Biesheuvel wrote:
> + /*
> + * Create a mask of CR4 bits to preserve. Omit PGE in order to clean
> + * global 1:1 translations from the TLBs.

Brian raised this question when exactly global entries get flushed and
I was looking for the exact definition in the SDM, here's what I'll do
ontop:

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 2d8762887c6a..24df91535062 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -186,8 +186,13 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
1:

/*
- * Create a mask of CR4 bits to preserve. Omit PGE in order to clean
+ * Create a mask of CR4 bits to preserve. Omit PGE in order to flush
* global 1:1 translations from the TLBs.
+ *
+ * From the SDM:
+ * "If CR4.PGE is changing from 0 to 1, there were no global TLB
+ * entries before the execution; if CR4.PGE is changing from 1 to 0,
+ * there will be no global TLB entries after the execution."
*/
movl $(X86_CR4_PAE | X86_CR4_LA57), %edx
#ifdef CONFIG_X86_MCE
---

And how it is perfectly clear.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-28 21:13:20

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 2/9] x86/startup_64: Defer assignment of 5-level paging global variables

On Tue, Feb 27, 2024 at 04:19:10PM +0100, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <[email protected]>
>
> Assigning the 5-level paging related global variables from the earliest
> C code using explicit references that use the 1:1 translation of memory
> is unnecessary, as the startup code itself does not rely on them to
> create the initial page tables, and this is all it should be doing. So
> defer these assignments to the primary C entry code that executes via
> the ordinary kernel virtual mapping.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
> arch/x86/include/asm/pgtable_64_types.h | 2 +-
> arch/x86/kernel/head64.c | 44 +++++++-------------
> 2 files changed, 15 insertions(+), 31 deletions(-)

Reviewed-by: Borislav Petkov (AMD) <[email protected]>

Those should probably be tested on a 5level machine, just in case.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-28 21:28:45

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 3/9] x86/startup_64: Simplify calculation of initial page table address

On Tue, Feb 27, 2024 at 04:19:11PM +0100, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <[email protected]>
>
> Determining the address of the initial page table to program into CR3
> involves:
> - taking the physical address
> - adding the SME encryption mask
>
> On the primary entry path, the code is mapped using a 1:1 virtual to
> physical translation, so the physical address can be taken directly
> using a RIP-relative LEA instruction.
>
> On the secondary entry path, the address can be obtained by taking the
> offset from the virtual kernel base (__START_kernel_map) and adding the
> physical kernel base.
>
> This is implemented in a slightly confusing way, so clean this up.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
> arch/x86/kernel/head_64.S | 25 ++++++--------------
> 1 file changed, 7 insertions(+), 18 deletions(-)

Reviewed-by: Borislav Petkov (AMD) <[email protected]>

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-29 10:38:12

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 4/9] x86/startup_64: Simplify virtual switch on primary boot

First

Reviewed-by: Borislav Petkov (AMD) <[email protected]>

for the patch.

On Tue, Feb 27, 2024 at 04:19:12PM +0100, Ard Biesheuvel wrote:
> + /*
> + * Switch to early_top_pgt which still has the identity mappings
> + * present.

I was wondering why we've had this "discrepancy" forever - the boot CPU
would have early_top_pgt *with* the ident mappings while the APs would do
init_top_pgt.

But we end up loading init_top_pgt on the BSP too in init_mem_mapping()
so there's a short time during boot where we have this difference.
I haven't found a reason to have it yet except "why bother"...

And now some details just for future reference:

On the BSP:

=> 0x10000a0: mov %rax,%cr3

cr3 0x9922000
111850: ffffffff89922000 8192 OBJECT GLOBAL DEFAULT 22 early_top_pgt

(gdb) p/x early_top_pgt
$3 = {{pgd = 0x9924063}, {pgd = 0x9924063}, {pgd = 0x0} <repeats 509 times>, {pgd = 0x2418067}}

first two PGDs and the last one are populated.

On the AP:

cr3 0x2416000
104747: ffffffff82416000 8192 OBJECT GLOBAL DEFAULT 12 init_top_pgt

(gdb) p/x (long[512])*0xffffffff82416000
$8 = {0x0 <repeats 273 times>, 0xbe01067, 0x0 <repeats 128 times>, 0xc000067, 0xc001067, 0xc002067, 0xc003067, 0xc004067, 0xc005067,
0xc006067, 0xc007067, 0xc008067, 0xc009067, 0xc00a067, 0xc00b067, 0xc00c067, 0xc00d067, 0xc00e067, 0xc00f067, 0xc010067, 0xc011067,
0xc012067, 0xc013067, 0xc014067, 0xc015067, 0xc016067, 0xc017067, 0xc018067, 0xc019067, 0xc01a067, 0xc01b067, 0xc01c067, 0xc01d067,
0xc01e067, 0xc01f067, 0xc020067, 0xc021067, 0xc022067, 0xc023067, 0xc024067, 0xc025067, 0xc026067, 0xc027067, 0xc028067, 0xc029067,
0xc02a067, 0xc02b067, 0xc02c067, 0xc02d067, 0xc02e067, 0xc02f067, 0xc030067, 0xc031067, 0xc032067, 0xc033067, 0xc034067, 0xc035067,
0xc036067, 0xc037067, 0xc038067, 0xc039067, 0xc03a067, 0xc03b067, 0xc03c067, 0xc03d067, 0xc03e067, 0xc03f067, 0x0, 0x0, 0x7ffd3067,
0x0 <repeats 39 times>, 0x7ffd1067, 0x0, 0x9b11067, 0x2418067}

and that one becomes the swapper_pg_dir which is the kernel pagetable we
use.

PTI then does two separate ones, which is a whole different topic.

:-)

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-29 22:36:56

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v7 1/9] x86/startup_64: Simplify CR4 handling in startup code

On Wed, 28 Feb 2024 at 14:45, Borislav Petkov <[email protected]> wrote:
>
> On Tue, Feb 27, 2024 at 04:19:09PM +0100, Ard Biesheuvel wrote:
> > + /*
> > + * Create a mask of CR4 bits to preserve. Omit PGE in order to clean
> > + * global 1:1 translations from the TLBs.
>
> Brian raised this question when exactly global entries get flushed and
> I was looking for the exact definition in the SDM, here's what I'll do
> ontop:
>
> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index 2d8762887c6a..24df91535062 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -186,8 +186,13 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
> 1:
>
> /*
> - * Create a mask of CR4 bits to preserve. Omit PGE in order to clean
> + * Create a mask of CR4 bits to preserve. Omit PGE in order to flush
> * global 1:1 translations from the TLBs.
> + *
> + * From the SDM:
> + * "If CR4.PGE is changing from 0 to 1, there were no global TLB
> + * entries before the execution; if CR4.PGE is changing from 1 to 0,
> + * there will be no global TLB entries after the execution."
> */
> movl $(X86_CR4_PAE | X86_CR4_LA57), %edx
> #ifdef CONFIG_X86_MCE
> ---
>
> And how it is perfectly clear.
>

Looks good to me - thanks.

2024-02-29 22:38:21

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v7 4/9] x86/startup_64: Simplify virtual switch on primary boot

On Thu, 29 Feb 2024 at 11:38, Borislav Petkov <[email protected]> wrote:
>
> First
>
> Reviewed-by: Borislav Petkov (AMD) <[email protected]>
>
> for the patch.
>

Thanks.

> On Tue, Feb 27, 2024 at 04:19:12PM +0100, Ard Biesheuvel wrote:
> > + /*
> > + * Switch to early_top_pgt which still has the identity mappings
> > + * present.
>
> I was wondering why we've had this "discrepancy" forever - the boot CPU
> would have early_top_pgt *with* the ident mappings while the APs would do
> init_top_pgt.
>
> But we end up loading init_top_pgt on the BSP too in init_mem_mapping()
> so there's a short time during boot where we have this difference.
> I haven't found a reason to have it yet except "why bother"...
>

Because we enter with a 1:1 mapping, and so we can only switch to
another set of page tables that also includes this 1:1 mapping. Once
we are running from the kernel mapping, we can drop the 1:1 mapping
but we still need it.

What we could do for robustness is reduce this 1:1 mapping to text +
rodata, and make it read-only, but I'm not sure it's worth the churn.

> And now some details just for future reference:
>
> On the BSP:
>
> => 0x10000a0: mov %rax,%cr3
>
> cr3 0x9922000
> 111850: ffffffff89922000 8192 OBJECT GLOBAL DEFAULT 22 early_top_pgt
>
> (gdb) p/x early_top_pgt
> $3 = {{pgd = 0x9924063}, {pgd = 0x9924063}, {pgd = 0x0} <repeats 509 times>, {pgd = 0x2418067}}
>
> first two PGDs and the last one are populated.
>
> On the AP:
>
> cr3 0x2416000
> 104747: ffffffff82416000 8192 OBJECT GLOBAL DEFAULT 12 init_top_pgt
>
> (gdb) p/x (long[512])*0xffffffff82416000
> $8 = {0x0 <repeats 273 times>, 0xbe01067, 0x0 <repeats 128 times>, 0xc000067, 0xc001067, 0xc002067, 0xc003067, 0xc004067, 0xc005067,
> 0xc006067, 0xc007067, 0xc008067, 0xc009067, 0xc00a067, 0xc00b067, 0xc00c067, 0xc00d067, 0xc00e067, 0xc00f067, 0xc010067, 0xc011067,
> 0xc012067, 0xc013067, 0xc014067, 0xc015067, 0xc016067, 0xc017067, 0xc018067, 0xc019067, 0xc01a067, 0xc01b067, 0xc01c067, 0xc01d067,
> 0xc01e067, 0xc01f067, 0xc020067, 0xc021067, 0xc022067, 0xc023067, 0xc024067, 0xc025067, 0xc026067, 0xc027067, 0xc028067, 0xc029067,
> 0xc02a067, 0xc02b067, 0xc02c067, 0xc02d067, 0xc02e067, 0xc02f067, 0xc030067, 0xc031067, 0xc032067, 0xc033067, 0xc034067, 0xc035067,
> 0xc036067, 0xc037067, 0xc038067, 0xc039067, 0xc03a067, 0xc03b067, 0xc03c067, 0xc03d067, 0xc03e067, 0xc03f067, 0x0, 0x0, 0x7ffd3067,
> 0x0 <repeats 39 times>, 0x7ffd1067, 0x0, 0x9b11067, 0x2418067}
>
> and that one becomes the swapper_pg_dir which is the kernel pagetable we
> use.
>
> PTI then does two separate ones, which is a whole different topic.
>
> :-)
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette

2024-03-01 10:03:15

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v7 2/9] x86/startup_64: Defer assignment of 5-level paging global variables

On Wed, 28 Feb 2024 at 21:56, Borislav Petkov <[email protected]> wrote:
>
> On Tue, Feb 27, 2024 at 04:19:10PM +0100, Ard Biesheuvel wrote:
> > From: Ard Biesheuvel <[email protected]>
> >
> > Assigning the 5-level paging related global variables from the earliest
> > C code using explicit references that use the 1:1 translation of memory
> > is unnecessary, as the startup code itself does not rely on them to
> > create the initial page tables, and this is all it should be doing. So
> > defer these assignments to the primary C entry code that executes via
> > the ordinary kernel virtual mapping.
> >
> > Signed-off-by: Ard Biesheuvel <[email protected]>
> > ---
> > arch/x86/include/asm/pgtable_64_types.h | 2 +-
> > arch/x86/kernel/head64.c | 44 +++++++-------------
> > 2 files changed, 15 insertions(+), 31 deletions(-)
>
> Reviewed-by: Borislav Petkov (AMD) <[email protected]>
>
> Those should probably be tested on a 5level machine, just in case.
>

I have tested this myself on QEMU with -cpu qemu64,+la57 and -cpu host+kvm using
- EFI boot (OVMF)
- legacy BIOS boot (SeaBIOS)
- with and without no5lvl on the command line
- with and without CONFIG_X86_5LEVEL

The scenario that I have not managed to test is entering from EFI with
5 levels of paging enabled, and switching back to 4 levels (which
should work regardless of CONFIG_X86_5LEVEL). However, no firmware in
existence actually supports that today, and I am pretty sure that this
code has never been tested under those conditions to begin with. (OVMF
patches are under review atm to allow 5-level paging to be enabled in
the firmware)

I currently don't have access to real hardware with LA57 support so
any additional coverage there is highly appreciated (same for the last
patch in this series)

2024-03-01 16:11:09

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 2/9] x86/startup_64: Defer assignment of 5-level paging global variables

On Fri, Mar 01, 2024 at 11:01:33AM +0100, Ard Biesheuvel wrote:
> The scenario that I have not managed to test is entering from EFI with
> 5 levels of paging enabled, and switching back to 4 levels (which
> should work regardless of CONFIG_X86_5LEVEL). However, no firmware in
> existence actually supports that today, and I am pretty sure that this
> code has never been tested under those conditions to begin with. (OVMF
> patches are under review atm to allow 5-level paging to be enabled in
> the firmware)

Aha.

> I currently don't have access to real hardware with LA57 support so
> any additional coverage there is highly appreciated (same for the last
> patch in this series)

Right, I'm sure dhansen could dig up such a machine. We'll ask him
nicely to test when the set is ready.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-03-01 16:13:33

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 4/9] x86/startup_64: Simplify virtual switch on primary boot

On Thu, Feb 29, 2024 at 11:36:01PM +0100, Ard Biesheuvel wrote:
> Because we enter with a 1:1 mapping, and so we can only switch to
> another set of page tables that also includes this 1:1 mapping. Once
> we are running from the kernel mapping, we can drop the 1:1 mapping
> but we still need it.
>
> What we could do for robustness is reduce this 1:1 mapping to text +
> rodata, and make it read-only, but I'm not sure it's worth the churn.

Yeah, I was experimenting a bit with some shenanigans with those two
pagetables yesterday and arrived to a similar conclusion - there's no
point in trying to unify them.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-03-01 16:18:24

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 5/9] efi/libstub: Add generic support for parsing mem_encrypt=

On Tue, Feb 27, 2024 at 04:19:13PM +0100, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <[email protected]>
>
> Parse the mem_encrypt= command line parameter from the EFI stub if
> CONFIG_ARCH_HAS_MEM_ENCRYPT=y, so that it can be passed to the early
> boot code by the arch code in the stub.
>
> This avoids the need for the core kernel to do any string parsing very
> early in the boot.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
> drivers/firmware/efi/libstub/efi-stub-helper.c | 8 ++++++++
> drivers/firmware/efi/libstub/efistub.h | 2 +-
> 2 files changed, 9 insertions(+), 1 deletion(-)

Reviewed-by: Borislav Petkov (AMD) <[email protected]>

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-03-01 17:10:59

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v7 2/9] x86/startup_64: Defer assignment of 5-level paging global variables

On Fri, 1 Mar 2024 at 17:09, Borislav Petkov <[email protected]> wrote:
>
> On Fri, Mar 01, 2024 at 11:01:33AM +0100, Ard Biesheuvel wrote:
> > The scenario that I have not managed to test is entering from EFI with
> > 5 levels of paging enabled, and switching back to 4 levels (which
> > should work regardless of CONFIG_X86_5LEVEL). However, no firmware in
> > existence actually supports that today, and I am pretty sure that this
> > code has never been tested under those conditions to begin with. (OVMF
> > patches are under review atm to allow 5-level paging to be enabled in
> > the firmware)
>
> Aha.
>

I've built a debug OVMF image using the latest version of the series,
and put it at [0]

Run like this

qemu-system-x86_64 -M q35 \
-cpu qemu64,+la57 -smp 4 \
-bios OVMF-5level.fd \
-kernel arch/x86/boot/bzImage \
-append console=ttyS0\ earlyprintk=ttyS0 \
-vga none -nographic -m 1g \
-initrd <initrd.img>

and you will get loads of DEBUG output from the firmware first, and
then boot into Linux. (initrd can be omitted)

Right before entering, it will print

CpuDxe: 5-Level Paging = 1

which confirms that the firmware is running with 5 levels of paging.

I've confirmed that this boots happily with this series applied,
including when using 'no5lvl' on the command line, or when disabling
CONFIG_X86_5LEVEL [confirmed by inspecting
/sys/kernel/debug/page_tables/kernel].


[0] http://files.workofard.com/OVMF-5level.fd.gz

2024-03-01 17:43:36

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 2/9] x86/startup_64: Defer assignment of 5-level paging global variables

On Fri, Mar 01, 2024 at 06:09:53PM +0100, Ard Biesheuvel wrote:
> On Fri, 1 Mar 2024 at 17:09, Borislav Petkov <[email protected]> wrote:
> >
> > On Fri, Mar 01, 2024 at 11:01:33AM +0100, Ard Biesheuvel wrote:
> > > The scenario that I have not managed to test is entering from EFI with
> > > 5 levels of paging enabled, and switching back to 4 levels (which
> > > should work regardless of CONFIG_X86_5LEVEL). However, no firmware in
> > > existence actually supports that today, and I am pretty sure that this
> > > code has never been tested under those conditions to begin with. (OVMF
> > > patches are under review atm to allow 5-level paging to be enabled in
> > > the firmware)
> >
> > Aha.
> >
>
> I've built a debug OVMF image using the latest version of the series,
> and put it at [0]
>
> Run like this
>
> qemu-system-x86_64 -M q35 \
> -cpu qemu64,+la57 -smp 4 \
> -bios OVMF-5level.fd \
> -kernel arch/x86/boot/bzImage \
> -append console=ttyS0\ earlyprintk=ttyS0 \
> -vga none -nographic -m 1g \
> -initrd <initrd.img>
>
> and you will get loads of DEBUG output from the firmware first, and
> then boot into Linux. (initrd can be omitted)
>
> Right before entering, it will print
>
> CpuDxe: 5-Level Paging = 1
>
> which confirms that the firmware is running with 5 levels of paging.
>
> I've confirmed that this boots happily with this series applied,
> including when using 'no5lvl' on the command line, or when disabling
> CONFIG_X86_5LEVEL [confirmed by inspecting
> /sys/kernel/debug/page_tables/kernel].
>
>
> [0] http://files.workofard.com/OVMF-5level.fd.gz

Nice, that might come in handy for other testing too.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-03-01 19:14:03

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH v7 2/9] x86/startup_64: Defer assignment of 5-level paging global variables

On 3/1/24 11:33, Borislav Petkov wrote:
> On Fri, Mar 01, 2024 at 06:09:53PM +0100, Ard Biesheuvel wrote:
>> On Fri, 1 Mar 2024 at 17:09, Borislav Petkov <[email protected]> wrote:
>>>
>>> On Fri, Mar 01, 2024 at 11:01:33AM +0100, Ard Biesheuvel wrote:
>>>> The scenario that I have not managed to test is entering from EFI with
>>>> 5 levels of paging enabled, and switching back to 4 levels (which
>>>> should work regardless of CONFIG_X86_5LEVEL). However, no firmware in
>>>> existence actually supports that today, and I am pretty sure that this
>>>> code has never been tested under those conditions to begin with. (OVMF
>>>> patches are under review atm to allow 5-level paging to be enabled in
>>>> the firmware)
>>>
>>> Aha.
>>>
>>
>> I've built a debug OVMF image using the latest version of the series,
>> and put it at [0]
>>
>> Run like this
>>
>> qemu-system-x86_64 -M q35 \
>> -cpu qemu64,+la57 -smp 4 \
>> -bios OVMF-5level.fd \
>> -kernel arch/x86/boot/bzImage \
>> -append console=ttyS0\ earlyprintk=ttyS0 \
>> -vga none -nographic -m 1g \
>> -initrd <initrd.img>
>>
>> and you will get loads of DEBUG output from the firmware first, and
>> then boot into Linux. (initrd can be omitted)
>>
>> Right before entering, it will print
>>
>> CpuDxe: 5-Level Paging = 1
>>
>> which confirms that the firmware is running with 5 levels of paging.
>>
>> I've confirmed that this boots happily with this series applied,
>> including when using 'no5lvl' on the command line, or when disabling
>> CONFIG_X86_5LEVEL [confirmed by inspecting
>> /sys/kernel/debug/page_tables/kernel].
>>
>>
>> [0] http://files.workofard.com/OVMF-5level.fd.gz
>
> Nice, that might come in handy for other testing too.

Be aware that additional work will need to be done in OVMF to support
5-level paging for SEV VMs.

Initial SEV implementation happened when there wasn't a page table library
and so SEV support had to roll it's own page table modifications. A page
table library has since been created and 5-level support was added, but
the SEV code hasn't been converted over to use the new library, yet.

Thanks,
Tom

>
> Thx.
>

2024-03-01 19:17:39

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 6/9] x86/boot: Move mem_encrypt= parsing to the decompressor

On Tue, Feb 27, 2024 at 04:19:14PM +0100, Ard Biesheuvel wrote:
> diff --git a/arch/x86/include/uapi/asm/bootparam.h b/arch/x86/include/uapi/asm/bootparam.h
> index 01d19fc22346..eeea058cf602 100644
> --- a/arch/x86/include/uapi/asm/bootparam.h
> +++ b/arch/x86/include/uapi/asm/bootparam.h
> @@ -38,6 +38,7 @@
> #define XLF_EFI_KEXEC (1<<4)
> #define XLF_5LEVEL (1<<5)
> #define XLF_5LEVEL_ENABLED (1<<6)
> +#define XLF_MEM_ENCRYPTION (1<<7)

Needs documenting in Documentation/arch/x86/boot.rst.

And yes, those 5LEVEL things are not documented either but I'm even
questioning the justification for their existence. We'll see...

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-03-01 19:20:42

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 9/9] x86/startup_64: Drop global variables keeping track of LA57 state

On Tue, Feb 27, 2024 at 04:19:17PM +0100, Ard Biesheuvel wrote:
> + asm(ALTERNATIVE_TERNARY(
> + "movq %%cr4, %[reg] \n\t btl %[la57], %k[reg]" CC_SET(c),
> + %P[feat], "stc", "clc")
> + : [reg] "=&r" (r), CC_OUT(c) (ret)
> + : [feat] "i" (X86_FEATURE_LA57),
> + [la57] "i" (X86_CR4_LA57_BIT)
> + : "cc");
> +
> + return ret;

Yeah, I said this is creative but an alternative here looks like an
overkill.

Can we use a RIP_REL_REF(global var) instead pls?

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-03-01 23:47:46

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v7 6/9] x86/boot: Move mem_encrypt= parsing to the decompressor

On Fri, 1 Mar 2024 at 20:17, Borislav Petkov <[email protected]> wrote:
>
> On Tue, Feb 27, 2024 at 04:19:14PM +0100, Ard Biesheuvel wrote:
> > diff --git a/arch/x86/include/uapi/asm/bootparam.h b/arch/x86/include/uapi/asm/bootparam.h
> > index 01d19fc22346..eeea058cf602 100644
> > --- a/arch/x86/include/uapi/asm/bootparam.h
> > +++ b/arch/x86/include/uapi/asm/bootparam.h
> > @@ -38,6 +38,7 @@
> > #define XLF_EFI_KEXEC (1<<4)
> > #define XLF_5LEVEL (1<<5)
> > #define XLF_5LEVEL_ENABLED (1<<6)
> > +#define XLF_MEM_ENCRYPTION (1<<7)
>
> Needs documenting in Documentation/arch/x86/boot.rst.
>

Ack.

> And yes, those 5LEVEL things are not documented either but I'm even
> questioning the justification for their existence. We'll see...
>

Yeah that seems unnecessary to me. They are only used by kexec, and
only for cases where you want to kexec a [much] older kernel that
cannot deal with 5-level paging at all. AFAICT 5-level support was
added in v4.13.

So I think we might be able to drop these entirely, no?

2024-03-01 23:56:15

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v7 9/9] x86/startup_64: Drop global variables keeping track of LA57 state

On Fri, 1 Mar 2024 at 20:20, Borislav Petkov <[email protected]> wrote:
>
> On Tue, Feb 27, 2024 at 04:19:17PM +0100, Ard Biesheuvel wrote:
> > + asm(ALTERNATIVE_TERNARY(
> > + "movq %%cr4, %[reg] \n\t btl %[la57], %k[reg]" CC_SET(c),
> > + %P[feat], "stc", "clc")
> > + : [reg] "=&r" (r), CC_OUT(c) (ret)
> > + : [feat] "i" (X86_FEATURE_LA57),
> > + [la57] "i" (X86_CR4_LA57_BIT)
> > + : "cc");
> > +
> > + return ret;
>
> Yeah, I said this is creative but an alternative here looks like an
> overkill.
>
> Can we use a RIP_REL_REF(global var) instead pls?
>

I don't see the point of that, tbh. Patch #2 already ripped out all
the fixup_pointer() occurrences. This patch gets rid of the need to
#define USE_EARLY_PGTABLE_L5 in each translation unit that contains
code that might execute before alternatives patching has occurred.

Today, pgtable_l5_enabled() is used in many places, most of which
resolve to cpu_feature_enabled(), and I don't think you are suggesting
to replace all of those with a variable load, right? So that means
we'll have to stick with early and late variants of
pgtable_l5_enabled() like we have today, and we should just drop this
patch instead - I put it at the end of the series for a reason.

2024-03-02 14:51:41

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 6/9] x86/boot: Move mem_encrypt= parsing to the decompressor

On Sat, Mar 02, 2024 at 12:46:06AM +0100, Ard Biesheuvel wrote:
> Yeah that seems unnecessary to me. They are only used by kexec, and
> only for cases where you want to kexec a [much] older kernel that
> cannot deal with 5-level paging at all. AFAICT 5-level support was
> added in v4.13.

Yeah, I can't imagine a use case where you'd do two different kernels
- one for normal boot and one for kexec and they'd differ in the .config
and one would support 5level pagetables and the other wouldn't...

> So I think we might be able to drop these entirely, no?

Yeah:

https://lore.kernel.org/r/[email protected]


--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-03-02 15:17:50

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 9/9] x86/startup_64: Drop global variables keeping track of LA57 state

On Sat, Mar 02, 2024 at 12:55:14AM +0100, Ard Biesheuvel wrote:
> Today, pgtable_l5_enabled() is used in many places, most of which
> resolve to cpu_feature_enabled(), and I don't think you are suggesting
> to replace all of those with a variable load, right?

pgtable_l5_enabled() is the odd one out, special thing which should
have been cpu_feature_enabled() as latter is the default interface we
use to query what features the CPU supports. But 5level got done long
ago and we hadn't decided upon cpu_feature_enabled() back then.

So should we replace it with it?

Yap, eventually.

> So that means we'll have to stick with early and late variants of
> pgtable_l5_enabled() like we have today,

I don't mind at all if we had a

early_pgtable_l5_enabled()

which does the RIP_REL_REF() thing and it should be used only in 1:1
mapping code and the late stuff should start to get replaced with
cpu_feature_enabled() until pgtable_l5_enabled() is completely gone.

And then we even see whether we can opencode the handful places.

> and we should just drop this patch instead - I put it at the end of
> the series for a reason.

Yeah, we can leave that one aside for now but use it to start a cleanup
series.

If you're bored and feel like doing it, be my guest but for the next
cycle. Or I'll put it on my evergrowing TODO and get to it eventually.

For now, lemme test your set without this one and see whether we can
queue them even now.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-03-02 15:32:42

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v7 9/9] x86/startup_64: Drop global variables keeping track of LA57 state

On Sat, 2 Mar 2024 at 16:17, Borislav Petkov <[email protected]> wrote:
>
> On Sat, Mar 02, 2024 at 12:55:14AM +0100, Ard Biesheuvel wrote:
> > Today, pgtable_l5_enabled() is used in many places, most of which
> > resolve to cpu_feature_enabled(), and I don't think you are suggesting
> > to replace all of those with a variable load, right?
>
> pgtable_l5_enabled() is the odd one out, special thing which should
> have been cpu_feature_enabled() as latter is the default interface we
> use to query what features the CPU supports. But 5level got done long
> ago and we hadn't decided upon cpu_feature_enabled() back then.
>
> So should we replace it with it?
>
> Yap, eventually.
>

That is not the issue here. The issue is that cpu_feature_enabled()
will produce the wrong value if you call it too early.

The 'late' version of pgtable_l5_enabled() already uses
cpu_feature_enabled(), and I don't think that needs to change. Or are
you saying that pgtable_l5_enabled() should not exist at all, and all
ordinary users should use cpu_feature_enabled() directly? I suspect
that would cause problems where pgtable_l5_enabled() is used in static
inlines in header files, and it is left up to the .c file to set the
#define to convert all #include'd occurrences of pgtable_l5_enabled()
into the 'early' variant.

> > So that means we'll have to stick with early and late variants of
> > pgtable_l5_enabled() like we have today,
>
> I don't mind at all if we had a
>
> early_pgtable_l5_enabled()
>

That wouldn't work with the static inline users of pgtable_l5_enabled().

> which does the RIP_REL_REF() thing and it should be used only in 1:1
> mapping code and the late stuff should start to get replaced with
> cpu_feature_enabled() until pgtable_l5_enabled() is completely gone.
>

All the references to pgtable_l5_enabled() are already gone from the
code that executes from a 1:1 mapping. That is why this patch is
optional: it is just cleanup that goes on top of the earlier patch
that gets rid of potentially problematic uses of fixup_pointer().

The issue being solved here is that we duplicate the value of CR4.LA57
into a global variable, in a way that requires us to reason about
whether routines in a certain compilation unit might be called before
cpu_feature_enabled() may be used. LA57 is an example of a feature
where we cannot just assume that it is missing until the point where
we figure out whether it is there or not, like in most other cases
with CPU features that are truly optional.

But I am not aware of any issues around this, although the early
accessors are probably used more widely than necessary at this point.

So an alternative approach might be to not use cpu_feature_enabled()
at all, and always rely on the global variables. But that might tickle
a hot path in the wrong way and cause a noticeable slowdown on some
benchmark.

> And then we even see whether we can opencode the handful places.
>
> > and we should just drop this patch instead - I put it at the end of
> > the series for a reason.
>
> Yeah, we can leave that one aside for now but use it to start a cleanup
> series.
>
> If you're bored and feel like doing it, be my guest but for the next
> cycle. Or I'll put it on my evergrowing TODO and get to it eventually.
>

I don't mind revisiting this next cycle.

> For now, lemme test your set without this one and see whether we can
> queue them even now.
>

Cheers.

2024-03-02 18:23:01

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 9/9] x86/startup_64: Drop global variables keeping track of LA57 state

On Sat, Mar 02, 2024 at 04:32:19PM +0100, Ard Biesheuvel wrote:
> That is not the issue here. The issue is that cpu_feature_enabled()
> will produce the wrong value if you call it too early.

Looks like I didn't express myself too clearly: the early version of
early_pgtable_l5_enabled() should use a simple variable like now - not
cpu_feature_enabled().

And regardless, cpu_feature_enabled() should work even before
alternatives have run because we do dynamic testing in that case... and
so on and so on.

BUT(!), let's put a pin in this and let me first have an indepth look
after the merge window so that I can poke at the details and produce
a concrete diff which we can talk about.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-03-03 19:27:40

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 2/9] x86/startup_64: Defer assignment of 5-level paging global variables

On Fri, Mar 01, 2024 at 06:33:23PM +0100, Borislav Petkov wrote:
> > I've built a debug OVMF image using the latest version of the series,
> > and put it at [0]
> >
> > Run like this
> >
> > qemu-system-x86_64 -M q35 \
> > -cpu qemu64,+la57 -smp 4 \
> > -bios OVMF-5level.fd \
> > -kernel arch/x86/boot/bzImage \
> > -append console=ttyS0\ earlyprintk=ttyS0 \
> > -vga none -nographic -m 1g \
> > -initrd <initrd.img>
> >
> > and you will get loads of DEBUG output from the firmware first, and
> > then boot into Linux. (initrd can be omitted)
> >
> > Right before entering, it will print
> >
> > CpuDxe: 5-Level Paging = 1
> >
> > which confirms that the firmware is running with 5 levels of paging.
> >
> > I've confirmed that this boots happily with this series applied,
> > including when using 'no5lvl' on the command line, or when disabling
> > CONFIG_X86_5LEVEL [confirmed by inspecting
> > /sys/kernel/debug/page_tables/kernel].
> >
> >
> > [0] http://files.workofard.com/OVMF-5level.fd.gz
>
> Nice, that might come in handy for other testing too.

Btw, on a semi-related note, do you have an idea whether a normal guest
kernel using OVMF istead of seabios would be even able to boot a kernel
supplied with -kernel like above but without an -initrd?

I have everything builtin and the same kernel boots fine in a guest with
a
[ 0.000000] SMBIOS 3.0.0 present.
[ 0.000000] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014

but if I try to boot the respective guest installed with the OVMF BIOS
from the debian package:

[ 0.000000] efi: EFI v2.7 by Debian distribution of EDK II
[ 0.000000] efi: SMBIOS=0x7f788000 SMBIOS 3.0=0x7f786000 ACPI=0x7f97e000 ACPI 2.0=0x7f97e014 MEMATTR=0x7ddfe018

it fails looking up the /dev/root device major/minor deep in the bowels
of the vfs:

[ 2.565651] do_new_mount:
[ 2.566380] vfs_get_tree: fc->root: 0000000000000000
[ 2.567298] kern_path: filename: ffff88800d666000 of name: /dev/root
[ 2.568418] kern_path: ret: 0
[ 2.569009] lookup_bdev: kern_path(/dev/root, , path: ffff88800e537380), error: 0
[ 2.571645] lookup_bdev: inode->i_rdev: 0x0
[ 2.572417] get_tree_bdev: lookup_bdev(/dev/root, dev: 0x0), error: 0
^^^^^^^^^

That dev_t should be 0x800002 - the major and minor of /dev/sda2 but it
looks like something else is missing in this case...

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-03-03 21:57:10

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v7 2/9] x86/startup_64: Defer assignment of 5-level paging global variables

On Sun, 3 Mar 2024 at 20:27, Borislav Petkov <[email protected]> wrote:
>
..
>
> Btw, on a semi-related note, do you have an idea whether a normal guest
> kernel using OVMF istead of seabios would be even able to boot a kernel
> supplied with -kernel like above but without an -initrd?
>

How are you passing the root device to the kernel? Via root= on the
command line?

> I have everything builtin and the same kernel boots fine in a guest with
> a
> [ 0.000000] SMBIOS 3.0.0 present.
> [ 0.000000] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
>

OK, so this is SeaBIOS

> but if I try to boot the respective guest installed with the OVMF BIOS
> from the debian package:
>
> [ 0.000000] efi: EFI v2.7 by Debian distribution of EDK II
> [ 0.000000] efi: SMBIOS=0x7f788000 SMBIOS 3.0=0x7f786000 ACPI=0x7f97e000 ACPI 2.0=0x7f97e014 MEMATTR=0x7ddfe018
>

and this is OVMF.

I have tried both of these, with i440fx as well as q35, and they all
work happily with my Debian guest image passed via -hda to QEMU, and
with root=/dev/sda2 on the kernel command line.


> it fails looking up the /dev/root device major/minor deep in the bowels
> of the vfs:
>
> [ 2.565651] do_new_mount:
> [ 2.566380] vfs_get_tree: fc->root: 0000000000000000
> [ 2.567298] kern_path: filename: ffff88800d666000 of name: /dev/root
> [ 2.568418] kern_path: ret: 0
> [ 2.569009] lookup_bdev: kern_path(/dev/root, , path: ffff88800e537380), error: 0
> [ 2.571645] lookup_bdev: inode->i_rdev: 0x0
> [ 2.572417] get_tree_bdev: lookup_bdev(/dev/root, dev: 0x0), error: 0
> ^^^^^^^^^
>
> That dev_t should be 0x800002 - the major and minor of /dev/sda2 but it
> looks like something else is missing in this case...
>

How did you get this output? Are these debug printk()s you added yourself?

2024-03-03 22:10:50

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 2/9] x86/startup_64: Defer assignment of 5-level paging global variables

On Sun, Mar 03, 2024 at 10:56:49PM +0100, Ard Biesheuvel wrote:
> How are you passing the root device to the kernel? Via root= on the
> command line?

Yeah:

qemu
..
-kernel arch/x86/boot/bzImage
-append "root=/dev/sda2 resume=/dev/sda3 ...

> and this is OVMF.

Yap.

> I have tried both of these, with i440fx as well as q35, and they all
> work happily with my Debian guest image passed via -hda to QEMU, and
> with root=/dev/sda2 on the kernel command line.

Interesting. I'm not passing any machine type. Maybe I should even
thought I've never done it before.

/me goes and tries machine type.

Well, I'll be damned!

-machine type=pc-i440fx-2.8 - no workie BUT

-machine type=pc-q35-2.8

booted.

Now on to figure out what's different with q35 and why it is magical and
it finds the root device just fine:

[ 2.732908] mount_root_generic: i: 2, fs_name: ext4
[ 2.734275] do_mount_root: name: /dev/root
[ 2.735093] kern_path: filename: ffff88800d4de000 of name: /root
[ 2.736954] kern_path: ret: 0
[ 2.737727] init_mount: kern_path(/root), ret: 0
[ 2.738964] path_mount: will do_new_mount
[ 2.739784] do_new_mount: 1, fc source: (null)
[ 2.740961] do_new_mount: 2, err: 0
[ 2.741722] do_new_mount: 3, err: 0
[ 2.742448] do_new_mount: 4, err: 0
[ 2.743164] vfs_get_tree: fc->root: 0000000000000000
[ 2.744095] kern_path: filename: ffff88800d4de000 of name: /dev/root
[ 2.745352] kern_path: ret: 0
[ 2.745994] lookup_bdev: kern_path(/dev/root, , path: ffff88800cf163c0), error: 0
[ 2.747288] lookup_bdev: inode->i_rdev: 0x800002
[ 2.748163] get_tree_bdev: lookup_bdev(/dev/root, dev: 0x800002), error: 0
^^^^^^^^^

> How did you get this output? Are these debug printk()s you added yourself?

Yeah, the good old "sprinkle printks" debugging method. Figured I should
look at the VFS code out of interest. :-)

Thanks a lot for the suggestions, especially about q35!

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-03-04 17:09:01

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH v7 0/9] x86: Confine early 1:1 mapped startup code

On 2/27/24 09:19, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <[email protected]>
>
> This is resend #2 of v5 [0] with some touchups applied.
>
> Changes since v6:
> - Drop flawed patch to move some SME/SEV related calls out of the early
> boot path to avoid the potential need for backporting patches #6/#7
> to kernels where SEV support may not be crucial. This problem will be
> dealt with if/when it arises while doing those backports.
>
> Changes since v5:
> - drop patches that have been merged
> - rebase onto latest tip/x86/boot
> - fix comment regarding CR4.PGE wrt flushing of global TLB entries
> - avoid adding startup code to .noinstr.text as it triggers objtool
> warnings
>
> [0] https://lore.kernel.org/all/[email protected]/
>
> Cc: Kevin Loughlin <[email protected]>
> Cc: Tom Lendacky <[email protected]>
> Cc: Dionna Glaze <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Brian Gerst <[email protected]>

For the series, tested bare metal boots with mem_encrypt=on /
mem_encrypt=off and boots of SVM, SEV, SEV-ES and SEV-SNP guests.

Tested-by: Tom Lendacky <[email protected]>

>
> Ard Biesheuvel (9):
> x86/startup_64: Simplify CR4 handling in startup code
> x86/startup_64: Defer assignment of 5-level paging global variables
> x86/startup_64: Simplify calculation of initial page table address
> x86/startup_64: Simplify virtual switch on primary boot
> efi/libstub: Add generic support for parsing mem_encrypt=
> x86/boot: Move mem_encrypt= parsing to the decompressor
> x86/sme: Move early SME kernel encryption handling into .head.text
> x86/sev: Move early startup code into .head.text section
> x86/startup_64: Drop global variables keeping track of LA57 state
>
> arch/x86/boot/compressed/misc.c | 15 ++++
> arch/x86/boot/compressed/misc.h | 4 -
> arch/x86/boot/compressed/pgtable_64.c | 12 ---
> arch/x86/boot/compressed/sev.c | 3 +
> arch/x86/boot/compressed/vmlinux.lds.S | 1 +
> arch/x86/include/asm/mem_encrypt.h | 8 +-
> arch/x86/include/asm/pgtable_64_types.h | 43 ++++-----
> arch/x86/include/asm/sev.h | 10 +--
> arch/x86/include/uapi/asm/bootparam.h | 1 +
> arch/x86/kernel/cpu/common.c | 2 -
> arch/x86/kernel/head64.c | 61 ++-----------
> arch/x86/kernel/head_64.S | 93 ++++++++------------
> arch/x86/kernel/sev-shared.c | 23 +++--
> arch/x86/kernel/sev.c | 14 +--
> arch/x86/lib/Makefile | 13 ---
> arch/x86/mm/kasan_init_64.c | 3 -
> arch/x86/mm/mem_encrypt_identity.c | 83 +++++------------
> drivers/firmware/efi/libstub/efi-stub-helper.c | 8 ++
> drivers/firmware/efi/libstub/efistub.h | 2 +-
> drivers/firmware/efi/libstub/x86-stub.c | 3 +
> 20 files changed, 147 insertions(+), 255 deletions(-)
>

2024-03-04 19:15:17

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v7 0/9] x86: Confine early 1:1 mapped startup code

On Mon, 4 Mar 2024 at 18:07, Tom Lendacky <[email protected]> wrote:
>
> On 2/27/24 09:19, Ard Biesheuvel wrote:
> > From: Ard Biesheuvel <[email protected]>
> >
> > This is resend #2 of v5 [0] with some touchups applied.
> >
> > Changes since v6:
> > - Drop flawed patch to move some SME/SEV related calls out of the early
> > boot path to avoid the potential need for backporting patches #6/#7
> > to kernels where SEV support may not be crucial. This problem will be
> > dealt with if/when it arises while doing those backports.
> >
> > Changes since v5:
> > - drop patches that have been merged
> > - rebase onto latest tip/x86/boot
> > - fix comment regarding CR4.PGE wrt flushing of global TLB entries
> > - avoid adding startup code to .noinstr.text as it triggers objtool
> > warnings
> >
> > [0] https://lore.kernel.org/all/[email protected]/
> >
> > Cc: Kevin Loughlin <[email protected]>
> > Cc: Tom Lendacky <[email protected]>
> > Cc: Dionna Glaze <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Borislav Petkov <[email protected]>
> > Cc: Dave Hansen <[email protected]>
> > Cc: Andy Lutomirski <[email protected]>
> > Cc: Brian Gerst <[email protected]>
>
> For the series, tested bare metal boots with mem_encrypt=on /
> mem_encrypt=off and boots of SVM, SEV, SEV-ES and SEV-SNP guests.
>
> Tested-by: Tom Lendacky <[email protected]>
>

Thanks a lot! I take it this was a kernel built with GCC?

2024-03-04 22:12:31

by tip-bot2 for Tony Luck

[permalink] [raw]
Subject: [tip: x86/boot] x86/sev: Move early startup code into .head.text section

The following commit has been merged into the x86/boot branch of tip:

Commit-ID: 428080c9b19bfda37c478cd626dbd3851db1aff9
Gitweb: https://git.kernel.org/tip/428080c9b19bfda37c478cd626dbd3851db1aff9
Author: Ard Biesheuvel <[email protected]>
AuthorDate: Tue, 27 Feb 2024 16:19:16 +01:00
Committer: Borislav Petkov (AMD) <[email protected]>
CommitterDate: Mon, 04 Mar 2024 18:12:37 +01:00

x86/sev: Move early startup code into .head.text section

In preparation for implementing rigorous build time checks to enforce
that only code that can support it will be called from the early 1:1
mapping of memory, move SEV init code that is called in this manner to
the .head.text section.

Signed-off-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Tested-by: Tom Lendacky <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/boot/compressed/sev.c | 3 +++
arch/x86/include/asm/sev.h | 10 +++++-----
arch/x86/kernel/sev-shared.c | 23 ++++++++++-------------
arch/x86/kernel/sev.c | 14 ++++++++------
4 files changed, 26 insertions(+), 24 deletions(-)

diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 0732918..bea0719 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -116,6 +116,9 @@ static bool fault_in_kernel_space(unsigned long address)
#undef __init
#define __init

+#undef __head
+#define __head
+
#define __BOOT_COMPRESSED

/* Basic instruction decoding support needed */
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index bed95e1..cf67113 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -213,16 +213,16 @@ static inline int pvalidate(unsigned long vaddr, bool rmp_psize, bool validate)
struct snp_guest_request_ioctl;

void setup_ghcb(void);
-void __init early_snp_set_memory_private(unsigned long vaddr, unsigned long paddr,
- unsigned long npages);
-void __init early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr,
- unsigned long npages);
+void early_snp_set_memory_private(unsigned long vaddr, unsigned long paddr,
+ unsigned long npages);
+void early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr,
+ unsigned long npages);
void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op);
void snp_set_memory_shared(unsigned long vaddr, unsigned long npages);
void snp_set_memory_private(unsigned long vaddr, unsigned long npages);
void snp_set_wakeup_secondary_cpu(void);
bool snp_init(struct boot_params *bp);
-void __init __noreturn snp_abort(void);
+void __noreturn snp_abort(void);
int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, struct snp_guest_request_ioctl *rio);
void snp_accept_memory(phys_addr_t start, phys_addr_t end);
u64 snp_get_unsupported_features(u64 status);
diff --git a/arch/x86/kernel/sev-shared.c b/arch/x86/kernel/sev-shared.c
index ae79f95..0bd7ccb 100644
--- a/arch/x86/kernel/sev-shared.c
+++ b/arch/x86/kernel/sev-shared.c
@@ -93,7 +93,8 @@ static bool __init sev_es_check_cpu_features(void)
return true;
}

-static void __noreturn sev_es_terminate(unsigned int set, unsigned int reason)
+static void __head __noreturn
+sev_es_terminate(unsigned int set, unsigned int reason)
{
u64 val = GHCB_MSR_TERM_REQ;

@@ -330,13 +331,7 @@ static int sev_cpuid_hv(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid
*/
static const struct snp_cpuid_table *snp_cpuid_get_table(void)
{
- void *ptr;
-
- asm ("lea cpuid_table_copy(%%rip), %0"
- : "=r" (ptr)
- : "p" (&cpuid_table_copy));
-
- return ptr;
+ return &RIP_REL_REF(cpuid_table_copy);
}

/*
@@ -395,7 +390,7 @@ static u32 snp_cpuid_calc_xsave_size(u64 xfeatures_en, bool compacted)
return xsave_size;
}

-static bool
+static bool __head
snp_cpuid_get_validated_func(struct cpuid_leaf *leaf)
{
const struct snp_cpuid_table *cpuid_table = snp_cpuid_get_table();
@@ -532,7 +527,8 @@ static int snp_cpuid_postprocess(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
* Returns -EOPNOTSUPP if feature not enabled. Any other non-zero return value
* should be treated as fatal by caller.
*/
-static int snp_cpuid(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid_leaf *leaf)
+static int __head
+snp_cpuid(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid_leaf *leaf)
{
const struct snp_cpuid_table *cpuid_table = snp_cpuid_get_table();

@@ -574,7 +570,7 @@ static int snp_cpuid(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid_le
* page yet, so it only supports the MSR based communication with the
* hypervisor and only the CPUID exit-code.
*/
-void __init do_vc_no_ghcb(struct pt_regs *regs, unsigned long exit_code)
+void __head do_vc_no_ghcb(struct pt_regs *regs, unsigned long exit_code)
{
unsigned int subfn = lower_bits(regs->cx, 32);
unsigned int fn = lower_bits(regs->ax, 32);
@@ -1025,7 +1021,8 @@ struct cc_setup_data {
* Search for a Confidential Computing blob passed in as a setup_data entry
* via the Linux Boot Protocol.
*/
-static struct cc_blob_sev_info *find_cc_blob_setup_data(struct boot_params *bp)
+static __head
+struct cc_blob_sev_info *find_cc_blob_setup_data(struct boot_params *bp)
{
struct cc_setup_data *sd = NULL;
struct setup_data *hdr;
@@ -1052,7 +1049,7 @@ static struct cc_blob_sev_info *find_cc_blob_setup_data(struct boot_params *bp)
* mapping needs to be updated in sync with all the changes to virtual memory
* layout and related mapping facilities throughout the boot process.
*/
-static void __init setup_cpuid_table(const struct cc_blob_sev_info *cc_info)
+static void __head setup_cpuid_table(const struct cc_blob_sev_info *cc_info)
{
const struct snp_cpuid_table *cpuid_table_fw, *cpuid_table;
int i;
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 1ef7ae8..33c14aa 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -25,6 +25,7 @@
#include <linux/psp-sev.h>
#include <uapi/linux/sev-guest.h>

+#include <asm/init.h>
#include <asm/cpu_entry_area.h>
#include <asm/stacktrace.h>
#include <asm/sev.h>
@@ -682,8 +683,9 @@ static u64 __init get_jump_table_addr(void)
return ret;
}

-static void early_set_pages_state(unsigned long vaddr, unsigned long paddr,
- unsigned long npages, enum psc_op op)
+static void __head
+early_set_pages_state(unsigned long vaddr, unsigned long paddr,
+ unsigned long npages, enum psc_op op)
{
unsigned long paddr_end;
u64 val;
@@ -739,7 +741,7 @@ e_term:
sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
}

-void __init early_snp_set_memory_private(unsigned long vaddr, unsigned long paddr,
+void __head early_snp_set_memory_private(unsigned long vaddr, unsigned long paddr,
unsigned long npages)
{
/*
@@ -2062,7 +2064,7 @@ fail:
*
* Scan for the blob in that order.
*/
-static __init struct cc_blob_sev_info *find_cc_blob(struct boot_params *bp)
+static __head struct cc_blob_sev_info *find_cc_blob(struct boot_params *bp)
{
struct cc_blob_sev_info *cc_info;

@@ -2088,7 +2090,7 @@ found_cc_info:
return cc_info;
}

-bool __init snp_init(struct boot_params *bp)
+bool __head snp_init(struct boot_params *bp)
{
struct cc_blob_sev_info *cc_info;

@@ -2110,7 +2112,7 @@ bool __init snp_init(struct boot_params *bp)
return true;
}

-void __init __noreturn snp_abort(void)
+void __head __noreturn snp_abort(void)
{
sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SNP_UNSUPPORTED);
}

2024-03-04 22:12:48

by tip-bot2 for Tony Luck

[permalink] [raw]
Subject: [tip: x86/boot] x86/boot: Move mem_encrypt= parsing to the decompressor

The following commit has been merged into the x86/boot branch of tip:

Commit-ID: cd0d9d92c8bb46e77de62efd7df13069ddd61e7d
Gitweb: https://git.kernel.org/tip/cd0d9d92c8bb46e77de62efd7df13069ddd61e7d
Author: Ard Biesheuvel <[email protected]>
AuthorDate: Tue, 27 Feb 2024 16:19:14 +01:00
Committer: Borislav Petkov (AMD) <[email protected]>
CommitterDate: Mon, 04 Mar 2024 18:12:28 +01:00

x86/boot: Move mem_encrypt= parsing to the decompressor

The early SME/SEV code parses the command line very early, in order to
decide whether or not memory encryption should be enabled, which needs
to occur even before the initial page tables are created.

This is problematic for a number of reasons:
- this early code runs from the 1:1 mapping provided by the decompressor
or firmware, which uses a different translation than the one assumed by
the linker, and so the code needs to be built in a special way;
- parsing external input while the entire kernel image is still mapped
writable is a bad idea in general, and really does not belong in
security minded code;
- the current code ignores the built-in command line entirely (although
this appears to be the case for the entire decompressor)

Given that the decompressor/EFI stub is an intrinsic part of the x86
bootable kernel image, move the command line parsing there and out of
the core kernel. This removes the need to build lib/cmdline.o in a
special way, or to use RIP-relative LEA instructions in inline asm
blocks.

This involves a new xloadflag in the setup header to indicate
that mem_encrypt=on appeared on the kernel command line.

Signed-off-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Tested-by: Tom Lendacky <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/boot/compressed/misc.c | 15 +++++++++++-
arch/x86/include/uapi/asm/bootparam.h | 1 +-
arch/x86/lib/Makefile | 13 +----------
arch/x86/mm/mem_encrypt_identity.c | 32 ++----------------------
drivers/firmware/efi/libstub/x86-stub.c | 3 ++-
5 files changed, 22 insertions(+), 42 deletions(-)

diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index bd6857a..408507e 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -372,6 +372,19 @@ unsigned long decompress_kernel(unsigned char *outbuf, unsigned long virt_addr,
}

/*
+ * Set the memory encryption xloadflag based on the mem_encrypt= command line
+ * parameter, if provided.
+ */
+static void parse_mem_encrypt(struct setup_header *hdr)
+{
+ int on = cmdline_find_option_bool("mem_encrypt=on");
+ int off = cmdline_find_option_bool("mem_encrypt=off");
+
+ if (on > off)
+ hdr->xloadflags |= XLF_MEM_ENCRYPTION;
+}
+
+/*
* The compressed kernel image (ZO), has been moved so that its position
* is against the end of the buffer used to hold the uncompressed kernel
* image (VO) and the execution environment (.bss, .brk), which makes sure
@@ -401,6 +414,8 @@ asmlinkage __visible void *extract_kernel(void *rmode, unsigned char *output)
/* Clear flags intended for solely in-kernel use. */
boot_params_ptr->hdr.loadflags &= ~KASLR_FLAG;

+ parse_mem_encrypt(&boot_params_ptr->hdr);
+
sanitize_boot_params(boot_params_ptr);

if (boot_params_ptr->screen_info.orig_video_mode == 7) {
diff --git a/arch/x86/include/uapi/asm/bootparam.h b/arch/x86/include/uapi/asm/bootparam.h
index 01d19fc..eeea058 100644
--- a/arch/x86/include/uapi/asm/bootparam.h
+++ b/arch/x86/include/uapi/asm/bootparam.h
@@ -38,6 +38,7 @@
#define XLF_EFI_KEXEC (1<<4)
#define XLF_5LEVEL (1<<5)
#define XLF_5LEVEL_ENABLED (1<<6)
+#define XLF_MEM_ENCRYPTION (1<<7)

#ifndef __ASSEMBLY__

diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index ea3a28e..f0dae4f 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -14,19 +14,6 @@ ifdef CONFIG_KCSAN
CFLAGS_REMOVE_delay.o = $(CC_FLAGS_FTRACE)
endif

-# Early boot use of cmdline; don't instrument it
-ifdef CONFIG_AMD_MEM_ENCRYPT
-KCOV_INSTRUMENT_cmdline.o := n
-KASAN_SANITIZE_cmdline.o := n
-KCSAN_SANITIZE_cmdline.o := n
-
-ifdef CONFIG_FUNCTION_TRACER
-CFLAGS_REMOVE_cmdline.o = -pg
-endif
-
-CFLAGS_cmdline.o := -fno-stack-protector -fno-jump-tables
-endif
-
inat_tables_script = $(srctree)/arch/x86/tools/gen-insn-attr-x86.awk
inat_tables_maps = $(srctree)/arch/x86/lib/x86-opcode-map.txt
quiet_cmd_inat_tables = GEN $@
diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index 0166ab1..d210c7f 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -43,7 +43,6 @@

#include <asm/setup.h>
#include <asm/sections.h>
-#include <asm/cmdline.h>
#include <asm/coco.h>
#include <asm/sev.h>

@@ -95,9 +94,6 @@ struct sme_populate_pgd_data {
*/
static char sme_workarea[2 * PMD_SIZE] __section(".init.scratch");

-static char sme_cmdline_arg[] __initdata = "mem_encrypt";
-static char sme_cmdline_on[] __initdata = "on";
-
static void __init sme_clear_pgd(struct sme_populate_pgd_data *ppd)
{
unsigned long pgd_start, pgd_end, pgd_size;
@@ -504,11 +500,9 @@ void __init sme_encrypt_kernel(struct boot_params *bp)

void __init sme_enable(struct boot_params *bp)
{
- const char *cmdline_ptr, *cmdline_arg, *cmdline_on;
unsigned int eax, ebx, ecx, edx;
unsigned long feature_mask;
unsigned long me_mask;
- char buffer[16];
bool snp;
u64 msr;

@@ -551,6 +545,9 @@ void __init sme_enable(struct boot_params *bp)

/* Check if memory encryption is enabled */
if (feature_mask == AMD_SME_BIT) {
+ if (!(bp->hdr.xloadflags & XLF_MEM_ENCRYPTION))
+ return;
+
/*
* No SME if Hypervisor bit is set. This check is here to
* prevent a guest from trying to enable SME. For running as a
@@ -570,31 +567,8 @@ void __init sme_enable(struct boot_params *bp)
msr = __rdmsr(MSR_AMD64_SYSCFG);
if (!(msr & MSR_AMD64_SYSCFG_MEM_ENCRYPT))
return;
- } else {
- /* SEV state cannot be controlled by a command line option */
- goto out;
}

- /*
- * Fixups have not been applied to phys_base yet and we're running
- * identity mapped, so we must obtain the address to the SME command
- * line argument data using rip-relative addressing.
- */
- asm ("lea sme_cmdline_arg(%%rip), %0"
- : "=r" (cmdline_arg)
- : "p" (sme_cmdline_arg));
- asm ("lea sme_cmdline_on(%%rip), %0"
- : "=r" (cmdline_on)
- : "p" (sme_cmdline_on));
-
- cmdline_ptr = (const char *)((u64)bp->hdr.cmd_line_ptr |
- ((u64)bp->ext_cmd_line_ptr << 32));
-
- if (cmdline_find_option(cmdline_ptr, cmdline_arg, buffer, sizeof(buffer)) < 0 ||
- strncmp(buffer, cmdline_on, sizeof(buffer)))
- return;
-
-out:
RIP_REL_REF(sme_me_mask) = me_mask;
physical_mask &= ~me_mask;
cc_vendor = CC_VENDOR_AMD;
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index 99429bc..0336ed1 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -884,6 +884,9 @@ void __noreturn efi_stub_entry(efi_handle_t handle,
}
}

+ if (efi_mem_encrypt > 0)
+ hdr->xloadflags |= XLF_MEM_ENCRYPTION;
+
status = efi_decompress_kernel(&kernel_entry);
if (status != EFI_SUCCESS) {
efi_err("Failed to decompress kernel\n");

2024-03-04 22:13:08

by tip-bot2 for Tony Luck

[permalink] [raw]
Subject: [tip: x86/boot] efi/libstub: Add generic support for parsing mem_encrypt=

The following commit has been merged into the x86/boot branch of tip:

Commit-ID: 7205f06e847422b66c1506eee01b9998ffc75d76
Gitweb: https://git.kernel.org/tip/7205f06e847422b66c1506eee01b9998ffc75d76
Author: Ard Biesheuvel <[email protected]>
AuthorDate: Tue, 27 Feb 2024 16:19:13 +01:00
Committer: Borislav Petkov (AMD) <[email protected]>
CommitterDate: Mon, 04 Mar 2024 18:12:24 +01:00

efi/libstub: Add generic support for parsing mem_encrypt=

Parse the mem_encrypt= command line parameter from the EFI stub if
CONFIG_ARCH_HAS_MEM_ENCRYPT=y, so that it can be passed to the early
boot code by the arch code in the stub.

This avoids the need for the core kernel to do any string parsing very
early in the boot.

Signed-off-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Tested-by: Tom Lendacky <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
drivers/firmware/efi/libstub/efi-stub-helper.c | 8 ++++++++
drivers/firmware/efi/libstub/efistub.h | 2 +-
2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/efi/libstub/efi-stub-helper.c b/drivers/firmware/efi/libstub/efi-stub-helper.c
index bfa3062..3dc2f9a 100644
--- a/drivers/firmware/efi/libstub/efi-stub-helper.c
+++ b/drivers/firmware/efi/libstub/efi-stub-helper.c
@@ -24,6 +24,8 @@ static bool efi_noinitrd;
static bool efi_nosoftreserve;
static bool efi_disable_pci_dma = IS_ENABLED(CONFIG_EFI_DISABLE_PCI_DMA);

+int efi_mem_encrypt;
+
bool __pure __efi_soft_reserve_enabled(void)
{
return !efi_nosoftreserve;
@@ -75,6 +77,12 @@ efi_status_t efi_parse_options(char const *cmdline)
efi_noinitrd = true;
} else if (IS_ENABLED(CONFIG_X86_64) && !strcmp(param, "no5lvl")) {
efi_no5lvl = true;
+ } else if (IS_ENABLED(CONFIG_ARCH_HAS_MEM_ENCRYPT) &&
+ !strcmp(param, "mem_encrypt") && val) {
+ if (parse_option_str(val, "on"))
+ efi_mem_encrypt = 1;
+ else if (parse_option_str(val, "off"))
+ efi_mem_encrypt = -1;
} else if (!strcmp(param, "efi") && val) {
efi_nochunk = parse_option_str(val, "nochunk");
efi_novamap |= parse_option_str(val, "novamap");
diff --git a/drivers/firmware/efi/libstub/efistub.h b/drivers/firmware/efi/libstub/efistub.h
index c04b82e..fc18fd6 100644
--- a/drivers/firmware/efi/libstub/efistub.h
+++ b/drivers/firmware/efi/libstub/efistub.h
@@ -37,8 +37,8 @@ extern bool efi_no5lvl;
extern bool efi_nochunk;
extern bool efi_nokaslr;
extern int efi_loglevel;
+extern int efi_mem_encrypt;
extern bool efi_novamap;
-
extern const efi_system_table_t *efi_system_table;

typedef union efi_dxe_services_table efi_dxe_services_table_t;

2024-03-04 22:13:49

by tip-bot2 for Tony Luck

[permalink] [raw]
Subject: [tip: x86/boot] x86/startup_64: Simplify calculation of initial page table address

The following commit has been merged into the x86/boot branch of tip:

Commit-ID: d6a41f184dcea0814260af2780e147022c11dca8
Gitweb: https://git.kernel.org/tip/d6a41f184dcea0814260af2780e147022c11dca8
Author: Ard Biesheuvel <[email protected]>
AuthorDate: Tue, 27 Feb 2024 16:19:11 +01:00
Committer: Borislav Petkov (AMD) <[email protected]>
CommitterDate: Mon, 04 Mar 2024 18:12:16 +01:00

x86/startup_64: Simplify calculation of initial page table address

Determining the address of the initial page table to program into CR3
involves:
- taking the physical address
- adding the SME encryption mask

On the primary entry path, the code is mapped using a 1:1 virtual to
physical translation, so the physical address can be taken directly
using a RIP-relative LEA instruction.

On the secondary entry path, the address can be obtained by taking the
offset from the virtual kernel base (__START_kernel_map) and adding the
physical kernel base.

This is implemented in a slightly confusing way, so clean this up.

Signed-off-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Tested-by: Tom Lendacky <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/kernel/head_64.S | 25 +++++++------------------
1 file changed, 7 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 54207e7..b8b7118 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -111,13 +111,11 @@ SYM_CODE_START_NOALIGN(startup_64)
call __startup_64

/* Form the CR3 value being sure to include the CR3 modifier */
- addq $(early_top_pgt - __START_KERNEL_map), %rax
+ leaq early_top_pgt(%rip), %rcx
+ addq %rcx, %rax

#ifdef CONFIG_AMD_MEM_ENCRYPT
mov %rax, %rdi
- mov %rax, %r14
-
- addq phys_base(%rip), %rdi

/*
* For SEV guests: Verify that the C-bit is correct. A malicious
@@ -126,12 +124,6 @@ SYM_CODE_START_NOALIGN(startup_64)
* the next RET instruction.
*/
call sev_verify_cbit
-
- /*
- * Restore CR3 value without the phys_base which will be added
- * below, before writing %cr3.
- */
- mov %r14, %rax
#endif

jmp 1f
@@ -171,18 +163,18 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
/* Clear %R15 which holds the boot_params pointer on the boot CPU */
xorl %r15d, %r15d

+ /* Derive the runtime physical address of init_top_pgt[] */
+ movq phys_base(%rip), %rax
+ addq $(init_top_pgt - __START_KERNEL_map), %rax
+
/*
* Retrieve the modifier (SME encryption mask if SME is active) to be
* added to the initial pgdir entry that will be programmed into CR3.
*/
#ifdef CONFIG_AMD_MEM_ENCRYPT
- movq sme_me_mask, %rax
-#else
- xorl %eax, %eax
+ addq sme_me_mask(%rip), %rax
#endif

- /* Form the CR3 value being sure to include the CR3 modifier */
- addq $(init_top_pgt - __START_KERNEL_map), %rax
1:

/*
@@ -212,9 +204,6 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
btsl $X86_CR4_PSE_BIT, %ecx
movq %rcx, %cr4

- /* Setup early boot stage 4-/5-level pagetables. */
- addq phys_base(%rip), %rax
-
/*
* Switch to new page-table
*

2024-03-04 22:13:52

by tip-bot2 for Tony Luck

[permalink] [raw]
Subject: [tip: x86/boot] x86/startup_64: Defer assignment of 5-level paging global variables

The following commit has been merged into the x86/boot branch of tip:

Commit-ID: 63bed96604205fa0b23c91d268df5f1f1b26faf6
Gitweb: https://git.kernel.org/tip/63bed96604205fa0b23c91d268df5f1f1b26faf6
Author: Ard Biesheuvel <[email protected]>
AuthorDate: Tue, 27 Feb 2024 16:19:10 +01:00
Committer: Borislav Petkov (AMD) <[email protected]>
CommitterDate: Mon, 04 Mar 2024 18:12:06 +01:00

x86/startup_64: Defer assignment of 5-level paging global variables

Assigning the 5-level paging related global variables from the earliest
C code using explicit references that use the 1:1 translation of memory
is unnecessary, as the startup code itself does not rely on them to
create the initial page tables, and this is all it should be doing. So
defer these assignments to the primary C entry code that executes via
the ordinary kernel virtual mapping.

Signed-off-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Tested-by: Tom Lendacky <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/include/asm/pgtable_64_types.h | 2 +-
arch/x86/kernel/head64.c | 44 +++++++-----------------
2 files changed, 15 insertions(+), 31 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 38b54b9..9053dfe 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -21,9 +21,9 @@ typedef unsigned long pgprotval_t;
typedef struct { pteval_t pte; } pte_t;
typedef struct { pmdval_t pmd; } pmd_t;

-#ifdef CONFIG_X86_5LEVEL
extern unsigned int __pgtable_l5_enabled;

+#ifdef CONFIG_X86_5LEVEL
#ifdef USE_EARLY_PGTABLE_L5
/*
* cpu_feature_enabled() is not available in early boot code.
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index fd77a26..212e8e0 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -23,6 +23,7 @@
#include <linux/pgtable.h>

#include <asm/asm.h>
+#include <asm/page_64.h>
#include <asm/processor.h>
#include <asm/proto.h>
#include <asm/smp.h>
@@ -68,24 +69,11 @@ unsigned long vmemmap_base __ro_after_init = __VMEMMAP_BASE_L4;
EXPORT_SYMBOL(vmemmap_base);
#endif

-#ifdef CONFIG_X86_5LEVEL
-static void __head *fixup_pointer(void *ptr, unsigned long physaddr)
-{
- return ptr - (void *)_text + (void *)physaddr;
-}
-
-static unsigned long __head *fixup_long(void *ptr, unsigned long physaddr)
+static inline bool check_la57_support(void)
{
- return fixup_pointer(ptr, physaddr);
-}
-
-static unsigned int __head *fixup_int(void *ptr, unsigned long physaddr)
-{
- return fixup_pointer(ptr, physaddr);
-}
+ if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+ return false;

-static bool __head check_la57_support(unsigned long physaddr)
-{
/*
* 5-level paging is detected and enabled at kernel decompression
* stage. Only check if it has been enabled there.
@@ -93,21 +81,8 @@ static bool __head check_la57_support(unsigned long physaddr)
if (!(native_read_cr4() & X86_CR4_LA57))
return false;

- *fixup_int(&__pgtable_l5_enabled, physaddr) = 1;
- *fixup_int(&pgdir_shift, physaddr) = 48;
- *fixup_int(&ptrs_per_p4d, physaddr) = 512;
- *fixup_long(&page_offset_base, physaddr) = __PAGE_OFFSET_BASE_L5;
- *fixup_long(&vmalloc_base, physaddr) = __VMALLOC_BASE_L5;
- *fixup_long(&vmemmap_base, physaddr) = __VMEMMAP_BASE_L5;
-
return true;
}
-#else
-static bool __head check_la57_support(unsigned long physaddr)
-{
- return false;
-}
-#endif

static unsigned long __head sme_postprocess_startup(struct boot_params *bp, pmdval_t *pmd)
{
@@ -171,7 +146,7 @@ unsigned long __head __startup_64(unsigned long physaddr,
bool la57;
int i;

- la57 = check_la57_support(physaddr);
+ la57 = check_la57_support();

/* Is the address too large? */
if (physaddr >> MAX_PHYSMEM_BITS)
@@ -456,6 +431,15 @@ asmlinkage __visible void __init __noreturn x86_64_start_kernel(char * real_mode
(__START_KERNEL & PGDIR_MASK)));
BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);

+ if (check_la57_support()) {
+ __pgtable_l5_enabled = 1;
+ pgdir_shift = 48;
+ ptrs_per_p4d = 512;
+ page_offset_base = __PAGE_OFFSET_BASE_L5;
+ vmalloc_base = __VMALLOC_BASE_L5;
+ vmemmap_base = __VMEMMAP_BASE_L5;
+ }
+
cr4_init_shadow();

/* Kill off the identity-map trampoline */

2024-03-04 22:14:11

by tip-bot2 for Tony Luck

[permalink] [raw]
Subject: [tip: x86/boot] x86/startup_64: Simplify CR4 handling in startup code

The following commit has been merged into the x86/boot branch of tip:

Commit-ID: dada8587068c820ba5e5d09b9c32d8bc28c4dbe6
Gitweb: https://git.kernel.org/tip/dada8587068c820ba5e5d09b9c32d8bc28c4dbe6
Author: Ard Biesheuvel <[email protected]>
AuthorDate: Tue, 27 Feb 2024 16:19:09 +01:00
Committer: Borislav Petkov (AMD) <[email protected]>
CommitterDate: Mon, 04 Mar 2024 18:11:34 +01:00

x86/startup_64: Simplify CR4 handling in startup code

When paging is enabled, the CR4.PAE and CR4.LA57 control bits cannot be
changed, and so they can simply be preserved rather than reason about
whether or not they need to be set. CR4.MCE should be preserved unless
the kernel was built without CONFIG_X86_MCE, in which case it must be
cleared.

CR4.PSE should be set explicitly, regardless of whether or not it was
set before.

CR4.PGE is set explicitly, and then cleared and set again after
programming CR3 in order to flush TLB entries based on global
translations. This makes the first assignment redundant, and can
therefore be omitted. So clear PGE by omitting it from the preserve
mask, and set it again explicitly after switching to the new page
tables.

[ bp: Document the exact operation of CR4.PGE ]

Signed-off-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Tested-by: Tom Lendacky <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/kernel/head_64.S | 35 +++++++++++++++++------------------
1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 86136a7..54207e7 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -185,6 +185,16 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
addq $(init_top_pgt - __START_KERNEL_map), %rax
1:

+ /*
+ * Create a mask of CR4 bits to preserve. Omit PGE in order to flush
+ * global 1:1 translations from the TLBs.
+ *
+ * From the SDM:
+ * "If CR4.PGE is changing from 0 to 1, there were no global TLB
+ * entries before the execution; if CR4.PGE is changing from 1 to 0,
+ * there will be no global TLB entries after the execution."
+ */
+ movl $(X86_CR4_PAE | X86_CR4_LA57), %edx
#ifdef CONFIG_X86_MCE
/*
* Preserve CR4.MCE if the kernel will enable #MC support.
@@ -193,20 +203,13 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
* configured will crash the system regardless of the CR4.MCE value set
* here.
*/
- movq %cr4, %rcx
- andl $X86_CR4_MCE, %ecx
-#else
- movl $0, %ecx
+ orl $X86_CR4_MCE, %edx
#endif
+ movq %cr4, %rcx
+ andl %edx, %ecx

- /* Enable PAE mode, PSE, PGE and LA57 */
- orl $(X86_CR4_PAE | X86_CR4_PSE | X86_CR4_PGE), %ecx
-#ifdef CONFIG_X86_5LEVEL
- testb $1, __pgtable_l5_enabled(%rip)
- jz 1f
- orl $X86_CR4_LA57, %ecx
-1:
-#endif
+ /* Even if ignored in long mode, set PSE uniformly on all logical CPUs. */
+ btsl $X86_CR4_PSE_BIT, %ecx
movq %rcx, %cr4

/* Setup early boot stage 4-/5-level pagetables. */
@@ -223,14 +226,10 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
movq %rax, %cr3

/*
- * Do a global TLB flush after the CR3 switch to make sure the TLB
- * entries from the identity mapping are flushed.
+ * Set CR4.PGE to re-enable global translations.
*/
- movq %cr4, %rcx
- movq %rcx, %rax
- xorq $X86_CR4_PGE, %rcx
+ btsl $X86_CR4_PGE_BIT, %ecx
movq %rcx, %cr4
- movq %rax, %cr4

/* Ensure I am executing from virtual addresses */
movq $1f, %rax

2024-03-04 22:31:39

by tip-bot2 for Tony Luck

[permalink] [raw]
Subject: [tip: x86/boot] x86/sme: Move early SME kernel encryption handling into .head.text

The following commit has been merged into the x86/boot branch of tip:

Commit-ID: 48204aba801f1b512b3abed10b8e1a63e03f3dd1
Gitweb: https://git.kernel.org/tip/48204aba801f1b512b3abed10b8e1a63e03f3dd1
Author: Ard Biesheuvel <[email protected]>
AuthorDate: Tue, 27 Feb 2024 16:19:15 +01:00
Committer: Borislav Petkov (AMD) <[email protected]>
CommitterDate: Mon, 04 Mar 2024 18:12:33 +01:00

x86/sme: Move early SME kernel encryption handling into .head.text

The .head.text section is the initial primary entrypoint of the core
kernel, and is entered with the CPU executing from a 1:1 mapping of
memory. Such code must never access global variables using absolute
references, as these are based on the kernel virtual mapping which is
not active yet at this point.

Given that the SME startup code is also called from this early execution
context, move it into .head.text as well. This will allow more thorough
build time checks in the future to ensure that early startup code only
uses RIP-relative references to global variables.

Also replace some occurrences of __pa_symbol() [which relies on the
compiler generating an absolute reference, which is not guaranteed] and
an open coded RIP-relative access with RIP_REL_REF().

Signed-off-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Tested-by: Tom Lendacky <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/include/asm/mem_encrypt.h | 8 +++---
arch/x86/mm/mem_encrypt_identity.c | 42 +++++++++++------------------
2 files changed, 21 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h
index b31eb9f..f922b68 100644
--- a/arch/x86/include/asm/mem_encrypt.h
+++ b/arch/x86/include/asm/mem_encrypt.h
@@ -47,8 +47,8 @@ void __init sme_unmap_bootdata(char *real_mode_data);

void __init sme_early_init(void);

-void __init sme_encrypt_kernel(struct boot_params *bp);
-void __init sme_enable(struct boot_params *bp);
+void sme_encrypt_kernel(struct boot_params *bp);
+void sme_enable(struct boot_params *bp);

int __init early_set_memory_decrypted(unsigned long vaddr, unsigned long size);
int __init early_set_memory_encrypted(unsigned long vaddr, unsigned long size);
@@ -81,8 +81,8 @@ static inline void __init sme_unmap_bootdata(char *real_mode_data) { }

static inline void __init sme_early_init(void) { }

-static inline void __init sme_encrypt_kernel(struct boot_params *bp) { }
-static inline void __init sme_enable(struct boot_params *bp) { }
+static inline void sme_encrypt_kernel(struct boot_params *bp) { }
+static inline void sme_enable(struct boot_params *bp) { }

static inline void sev_es_init_vc_handling(void) { }

diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index d210c7f..64b5005 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -41,6 +41,7 @@
#include <linux/mem_encrypt.h>
#include <linux/cc_platform.h>

+#include <asm/init.h>
#include <asm/setup.h>
#include <asm/sections.h>
#include <asm/coco.h>
@@ -94,7 +95,7 @@ struct sme_populate_pgd_data {
*/
static char sme_workarea[2 * PMD_SIZE] __section(".init.scratch");

-static void __init sme_clear_pgd(struct sme_populate_pgd_data *ppd)
+static void __head sme_clear_pgd(struct sme_populate_pgd_data *ppd)
{
unsigned long pgd_start, pgd_end, pgd_size;
pgd_t *pgd_p;
@@ -109,7 +110,7 @@ static void __init sme_clear_pgd(struct sme_populate_pgd_data *ppd)
memset(pgd_p, 0, pgd_size);
}

-static pud_t __init *sme_prepare_pgd(struct sme_populate_pgd_data *ppd)
+static pud_t __head *sme_prepare_pgd(struct sme_populate_pgd_data *ppd)
{
pgd_t *pgd;
p4d_t *p4d;
@@ -146,7 +147,7 @@ static pud_t __init *sme_prepare_pgd(struct sme_populate_pgd_data *ppd)
return pud;
}

-static void __init sme_populate_pgd_large(struct sme_populate_pgd_data *ppd)
+static void __head sme_populate_pgd_large(struct sme_populate_pgd_data *ppd)
{
pud_t *pud;
pmd_t *pmd;
@@ -162,7 +163,7 @@ static void __init sme_populate_pgd_large(struct sme_populate_pgd_data *ppd)
set_pmd(pmd, __pmd(ppd->paddr | ppd->pmd_flags));
}

-static void __init sme_populate_pgd(struct sme_populate_pgd_data *ppd)
+static void __head sme_populate_pgd(struct sme_populate_pgd_data *ppd)
{
pud_t *pud;
pmd_t *pmd;
@@ -188,7 +189,7 @@ static void __init sme_populate_pgd(struct sme_populate_pgd_data *ppd)
set_pte(pte, __pte(ppd->paddr | ppd->pte_flags));
}

-static void __init __sme_map_range_pmd(struct sme_populate_pgd_data *ppd)
+static void __head __sme_map_range_pmd(struct sme_populate_pgd_data *ppd)
{
while (ppd->vaddr < ppd->vaddr_end) {
sme_populate_pgd_large(ppd);
@@ -198,7 +199,7 @@ static void __init __sme_map_range_pmd(struct sme_populate_pgd_data *ppd)
}
}

-static void __init __sme_map_range_pte(struct sme_populate_pgd_data *ppd)
+static void __head __sme_map_range_pte(struct sme_populate_pgd_data *ppd)
{
while (ppd->vaddr < ppd->vaddr_end) {
sme_populate_pgd(ppd);
@@ -208,7 +209,7 @@ static void __init __sme_map_range_pte(struct sme_populate_pgd_data *ppd)
}
}

-static void __init __sme_map_range(struct sme_populate_pgd_data *ppd,
+static void __head __sme_map_range(struct sme_populate_pgd_data *ppd,
pmdval_t pmd_flags, pteval_t pte_flags)
{
unsigned long vaddr_end;
@@ -232,22 +233,22 @@ static void __init __sme_map_range(struct sme_populate_pgd_data *ppd,
__sme_map_range_pte(ppd);
}

-static void __init sme_map_range_encrypted(struct sme_populate_pgd_data *ppd)
+static void __head sme_map_range_encrypted(struct sme_populate_pgd_data *ppd)
{
__sme_map_range(ppd, PMD_FLAGS_ENC, PTE_FLAGS_ENC);
}

-static void __init sme_map_range_decrypted(struct sme_populate_pgd_data *ppd)
+static void __head sme_map_range_decrypted(struct sme_populate_pgd_data *ppd)
{
__sme_map_range(ppd, PMD_FLAGS_DEC, PTE_FLAGS_DEC);
}

-static void __init sme_map_range_decrypted_wp(struct sme_populate_pgd_data *ppd)
+static void __head sme_map_range_decrypted_wp(struct sme_populate_pgd_data *ppd)
{
__sme_map_range(ppd, PMD_FLAGS_DEC_WP, PTE_FLAGS_DEC_WP);
}

-static unsigned long __init sme_pgtable_calc(unsigned long len)
+static unsigned long __head sme_pgtable_calc(unsigned long len)
{
unsigned long entries = 0, tables = 0;

@@ -284,7 +285,7 @@ static unsigned long __init sme_pgtable_calc(unsigned long len)
return entries + tables;
}

-void __init sme_encrypt_kernel(struct boot_params *bp)
+void __head sme_encrypt_kernel(struct boot_params *bp)
{
unsigned long workarea_start, workarea_end, workarea_len;
unsigned long execute_start, execute_end, execute_len;
@@ -319,9 +320,8 @@ void __init sme_encrypt_kernel(struct boot_params *bp)
* memory from being cached.
*/

- /* Physical addresses gives us the identity mapped virtual addresses */
- kernel_start = __pa_symbol(_text);
- kernel_end = ALIGN(__pa_symbol(_end), PMD_SIZE);
+ kernel_start = (unsigned long)RIP_REL_REF(_text);
+ kernel_end = ALIGN((unsigned long)RIP_REL_REF(_end), PMD_SIZE);
kernel_len = kernel_end - kernel_start;

initrd_start = 0;
@@ -339,14 +339,6 @@ void __init sme_encrypt_kernel(struct boot_params *bp)
#endif

/*
- * We're running identity mapped, so we must obtain the address to the
- * SME encryption workarea using rip-relative addressing.
- */
- asm ("lea sme_workarea(%%rip), %0"
- : "=r" (workarea_start)
- : "p" (sme_workarea));
-
- /*
* Calculate required number of workarea bytes needed:
* executable encryption area size:
* stack page (PAGE_SIZE)
@@ -355,7 +347,7 @@ void __init sme_encrypt_kernel(struct boot_params *bp)
* pagetable structures for the encryption of the kernel
* pagetable structures for workarea (in case not currently mapped)
*/
- execute_start = workarea_start;
+ execute_start = workarea_start = (unsigned long)RIP_REL_REF(sme_workarea);
execute_end = execute_start + (PAGE_SIZE * 2) + PMD_SIZE;
execute_len = execute_end - execute_start;

@@ -498,7 +490,7 @@ void __init sme_encrypt_kernel(struct boot_params *bp)
native_write_cr3(__native_read_cr3());
}

-void __init sme_enable(struct boot_params *bp)
+void __head sme_enable(struct boot_params *bp)
{
unsigned int eax, ebx, ecx, edx;
unsigned long feature_mask;

2024-03-04 22:32:39

by tip-bot2 for Tony Luck

[permalink] [raw]
Subject: [tip: x86/boot] x86/startup_64: Simplify virtual switch on primary boot

The following commit has been merged into the x86/boot branch of tip:

Commit-ID: 828263957611c210da00c1820db73fac217135b6
Gitweb: https://git.kernel.org/tip/828263957611c210da00c1820db73fac217135b6
Author: Ard Biesheuvel <[email protected]>
AuthorDate: Tue, 27 Feb 2024 16:19:12 +01:00
Committer: Borislav Petkov (AMD) <[email protected]>
CommitterDate: Mon, 04 Mar 2024 18:12:20 +01:00

x86/startup_64: Simplify virtual switch on primary boot

The secondary startup code is used on the primary boot path as well, but
in this case, the initial part runs from a 1:1 mapping, until an
explicit cross-jump is made to the kernel virtual mapping of the same
code.

On the secondary boot path, this jump is pointless as the code already
executes from the mapping targeted by the jump. So combine this
cross-jump with the jump from startup_64() into the common boot path.
This simplifies the execution flow, and clearly separates code that runs
from a 1:1 mapping from code that runs from the kernel virtual mapping.

Note that this requires a page table switch, so hoist the CR3 assignment
into startup_64() as well. And since absolute symbol references will no
longer be permitted in .head.text once we enable the associated build
time checks, a RIP-relative memory operand is used in the JMP
instruction, referring to an absolute constant in the .init.rodata
section.

Given that the secondary startup code does not require a special
placement inside the executable, move it to the .text section.

Signed-off-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Tested-by: Tom Lendacky <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/kernel/head_64.S | 42 +++++++++++++++++++-------------------
1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index b8b7118..79f7c34 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -39,7 +39,6 @@ L4_START_KERNEL = l4_index(__START_KERNEL_map)

L3_START_KERNEL = pud_index(__START_KERNEL_map)

- .text
__HEAD
.code64
SYM_CODE_START_NOALIGN(startup_64)
@@ -126,9 +125,21 @@ SYM_CODE_START_NOALIGN(startup_64)
call sev_verify_cbit
#endif

- jmp 1f
+ /*
+ * Switch to early_top_pgt which still has the identity mappings
+ * present.
+ */
+ movq %rax, %cr3
+
+ /* Branch to the common startup code at its kernel virtual address */
+ ANNOTATE_RETPOLINE_SAFE
+ jmp *0f(%rip)
SYM_CODE_END(startup_64)

+ __INITRODATA
+0: .quad common_startup_64
+
+ .text
SYM_CODE_START(secondary_startup_64)
UNWIND_HINT_END_OF_STACK
ANNOTATE_NOENDBR
@@ -174,8 +185,15 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
#ifdef CONFIG_AMD_MEM_ENCRYPT
addq sme_me_mask(%rip), %rax
#endif
+ /*
+ * Switch to the init_top_pgt here, away from the trampoline_pgd and
+ * unmap the identity mapped ranges.
+ */
+ movq %rax, %cr3

-1:
+SYM_INNER_LABEL(common_startup_64, SYM_L_LOCAL)
+ UNWIND_HINT_END_OF_STACK
+ ANNOTATE_NOENDBR

/*
* Create a mask of CR4 bits to preserve. Omit PGE in order to flush
@@ -205,29 +223,11 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
movq %rcx, %cr4

/*
- * Switch to new page-table
- *
- * For the boot CPU this switches to early_top_pgt which still has the
- * identity mappings present. The secondary CPUs will switch to the
- * init_top_pgt here, away from the trampoline_pgd and unmap the
- * identity mapped ranges.
- */
- movq %rax, %cr3
-
- /*
* Set CR4.PGE to re-enable global translations.
*/
btsl $X86_CR4_PGE_BIT, %ecx
movq %rcx, %cr4

- /* Ensure I am executing from virtual addresses */
- movq $1f, %rax
- ANNOTATE_RETPOLINE_SAFE
- jmp *%rax
-1:
- UNWIND_HINT_END_OF_STACK
- ANNOTATE_NOENDBR // above
-
#ifdef CONFIG_SMP
/*
* For parallel boot, the APIC ID is read from the APIC, and then

2024-03-04 22:41:37

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH v7 0/9] x86: Confine early 1:1 mapped startup code

On 3/4/24 13:13, Ard Biesheuvel wrote:
> On Mon, 4 Mar 2024 at 18:07, Tom Lendacky <[email protected]> wrote:
>>
>> On 2/27/24 09:19, Ard Biesheuvel wrote:
>>> From: Ard Biesheuvel <[email protected]>
>>>
>>> This is resend #2 of v5 [0] with some touchups applied.
>>>
>>> Changes since v6:
>>> - Drop flawed patch to move some SME/SEV related calls out of the early
>>> boot path to avoid the potential need for backporting patches #6/#7
>>> to kernels where SEV support may not be crucial. This problem will be
>>> dealt with if/when it arises while doing those backports.
>>>
>>> Changes since v5:
>>> - drop patches that have been merged
>>> - rebase onto latest tip/x86/boot
>>> - fix comment regarding CR4.PGE wrt flushing of global TLB entries
>>> - avoid adding startup code to .noinstr.text as it triggers objtool
>>> warnings
>>>
>>> [0] https://lore.kernel.org/all/[email protected]/
>>>
>>> Cc: Kevin Loughlin <[email protected]>
>>> Cc: Tom Lendacky <[email protected]>
>>> Cc: Dionna Glaze <[email protected]>
>>> Cc: Thomas Gleixner <[email protected]>
>>> Cc: Ingo Molnar <[email protected]>
>>> Cc: Borislav Petkov <[email protected]>
>>> Cc: Dave Hansen <[email protected]>
>>> Cc: Andy Lutomirski <[email protected]>
>>> Cc: Brian Gerst <[email protected]>
>>
>> For the series, tested bare metal boots with mem_encrypt=on /
>> mem_encrypt=off and boots of SVM, SEV, SEV-ES and SEV-SNP guests.
>>
>> Tested-by: Tom Lendacky <[email protected]>
>>
>
> Thanks a lot! I take it this was a kernel built with GCC?

Yes, it was a GCC build, so I went back and rebuilt with Clang-14 and
everything was fine there, too.

Thanks,
Tom

2024-03-05 08:47:02

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v7 0/9] x86: Confine early 1:1 mapped startup code

On Mon, 4 Mar 2024 at 23:41, Tom Lendacky <[email protected]> wrote:
>
> On 3/4/24 13:13, Ard Biesheuvel wrote:
> > On Mon, 4 Mar 2024 at 18:07, Tom Lendacky <[email protected]> wrote:
> >>
> >> On 2/27/24 09:19, Ard Biesheuvel wrote:
> >>> From: Ard Biesheuvel <[email protected]>
> >>>
> >>> This is resend #2 of v5 [0] with some touchups applied.
> >>>
> >>> Changes since v6:
> >>> - Drop flawed patch to move some SME/SEV related calls out of the early
> >>> boot path to avoid the potential need for backporting patches #6/#7
> >>> to kernels where SEV support may not be crucial. This problem will be
> >>> dealt with if/when it arises while doing those backports.
> >>>
> >>> Changes since v5:
> >>> - drop patches that have been merged
> >>> - rebase onto latest tip/x86/boot
> >>> - fix comment regarding CR4.PGE wrt flushing of global TLB entries
> >>> - avoid adding startup code to .noinstr.text as it triggers objtool
> >>> warnings
> >>>
> >>> [0] https://lore.kernel.org/all/[email protected]/
> >>>
> >>> Cc: Kevin Loughlin <[email protected]>
> >>> Cc: Tom Lendacky <[email protected]>
> >>> Cc: Dionna Glaze <[email protected]>
> >>> Cc: Thomas Gleixner <[email protected]>
> >>> Cc: Ingo Molnar <[email protected]>
> >>> Cc: Borislav Petkov <[email protected]>
> >>> Cc: Dave Hansen <[email protected]>
> >>> Cc: Andy Lutomirski <[email protected]>
> >>> Cc: Brian Gerst <[email protected]>
> >>
> >> For the series, tested bare metal boots with mem_encrypt=on /
> >> mem_encrypt=off and boots of SVM, SEV, SEV-ES and SEV-SNP guests.
> >>
> >> Tested-by: Tom Lendacky <[email protected]>
> >>
> >
> > Thanks a lot! I take it this was a kernel built with GCC?
>
> Yes, it was a GCC build, so I went back and rebuilt with Clang-14 and
> everything was fine there, too.
>

OK, thanks again.