2024-02-21 11:36:38

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v5 00/16] x86: Confine early 1:1 mapped startup code

From: Ard Biesheuvel <[email protected]>

This is a follow-up to [0] which implemented rigorous build time checks
to ensure that any code that is executed during early startup supports
running from the initial 1:1 mapping of memory, which is how the kernel
is entered from the decompressor or the EFI firmware.

Using PIC codegen and introducing new magic sections into generic code
would create a maintenance burden, and more experimentation is needed
there. One issue with PIC codegen is that it still permits the compiler
to make assumptions about the runtime address of global objects (modulo
runtime relocation), which is incompatible with how the kernel is
entered, i.e., running a fully linked and relocated executable from the
wrong runtime address.

The RIP_REL_REF() macro that was introduced recently [1] is actually
more appropriate for this use case, as it hides the access from the
compiler entirely, and so the compiler can never predict its result.

To make incremental progress on this, this v5 drops the special
instrumentation for .pi.text and PIC codegen, but retains all the
cleanup work on the startup code to make it more maintainable and more
obviously correct.

In particular, this involves:
- getting rid of early accesses to global objects, either by moving them
to the stack, deferring the access until later, or dropping the
globals entirely;
- moving all code that runs early via the 1:1 mapping into .head.text,
and moving code that does not out of it, so that build time checks can
be added later to ensure that no inadvertent absolute references were
emitted into code that does not tolerate them;
- removing fixup_pointer() and occurrences of __pa_symbol(), which rely
on the compiler emitting absolute references, and this is not
guaranteed. (Without -fpic, the compiler might still use RIP-relative
references in some cases)

Changes since v4 [2]:
- incorporate Boris's tweaked version of patch #1
- split __startup64() changes into multiple patches, and align more
closely with the original logic
- fix build for CONFIG_X86_5LEVEL=n
- add comment to clarify that CR4.PSE is always set deliberately
- add separate SME startup change to remove SME/SVE related calls from
the non-SME/SVE boot path (this can be backported more easily further
back than to where we need the changes for SVE guest boot)

Changes since v3:
- dropped half of the patches and added a couple of new ones
- applied feedback from Boris to patches that were retained, mostly
related to some minor oversights on my part, and to some style issues

[0] https://lkml.kernel.org/r/20240129180502.4069817-21-ardb%2Bgit%40google.com
[1] https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/sev&id=1c811d403afd73f0
[2] https://lkml.kernel.org/r/20240213124143.1484862-13-ardb%2Bgit%40google.com

Cc: Kevin Loughlin <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Dionna Glaze <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Cc: Justin Stitt <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

Ard Biesheuvel (16):
x86/startup_64: Simplify global variable accesses in GDT/IDT
programming
x86/startup_64: Use RIP_REL_REF() to assign phys_base
x86/startup_64: Use RIP_REL_REF() to access early_dynamic_pgts[]
x86/startup_64: Use RIP_REL_REF() to access __supported_pte_mask
x86/startup_64: Use RIP_REL_REF() to access early page tables
x86/startup_64: Use RIP_REL_REF() to access early_top_pgt[]
x86/startup_64: Simplify CR4 handling in startup code
x86/startup_64: Defer assignment of 5-level paging global variables
x86/startup_64: Simplify calculation of initial page table address
x86/startup_64: Simplify virtual switch on primary boot
x86/sme: Avoid SME/SVE related checks on non-SME/SVE platforms
efi/libstub: Add generic support for parsing mem_encrypt=
x86/boot: Move mem_encrypt= parsing to the decompressor
x86/sme: Move early SME kernel encryption handling into .head.text
x86/sev: Move early startup code into .head.text section
x86/startup_64: Drop global variables keeping track of LA57 state

arch/x86/boot/compressed/misc.c | 15 ++
arch/x86/boot/compressed/misc.h | 4 -
arch/x86/boot/compressed/pgtable_64.c | 12 --
arch/x86/boot/compressed/sev.c | 3 +
arch/x86/boot/compressed/vmlinux.lds.S | 1 +
arch/x86/include/asm/mem_encrypt.h | 8 +-
arch/x86/include/asm/pgtable_64_types.h | 43 ++---
arch/x86/include/asm/setup.h | 2 +-
arch/x86/include/asm/sev.h | 10 +-
arch/x86/include/uapi/asm/bootparam.h | 1 +
arch/x86/kernel/cpu/common.c | 2 -
arch/x86/kernel/head64.c | 195 ++++++--------------
arch/x86/kernel/head_64.S | 95 ++++------
arch/x86/kernel/sev-shared.c | 23 +--
arch/x86/kernel/sev.c | 14 +-
arch/x86/lib/Makefile | 13 --
arch/x86/mm/kasan_init_64.c | 3 -
arch/x86/mm/mem_encrypt_identity.c | 89 +++------
drivers/firmware/efi/libstub/efi-stub-helper.c | 8 +
drivers/firmware/efi/libstub/efistub.h | 2 +-
drivers/firmware/efi/libstub/x86-stub.c | 3 +
21 files changed, 203 insertions(+), 343 deletions(-)


base-commit: ee8ff8768735edc3e013837c4416f819543ddc17
--
2.44.0.rc0.258.g7320e95886-goog



2024-02-21 11:37:03

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v5 02/16] x86/startup_64: Use RIP_REL_REF() to assign phys_base

From: Ard Biesheuvel <[email protected]>

phys_base is assigned from code that executes from a 1:1 mapping so it
cannot use a plain access from C. Replace the use of fixup_pointer()
with RIP_REL_REF(), which is better and simpler.

While at it, move the assignment to before the addition of the SME mask
so there is no need to subtract it again, and drop the unnecessary
addition (phys_base is statically initialized to 0x0)

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/kernel/head64.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 1d6865eafe6a..f98f5b6a06b5 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -192,6 +192,7 @@ unsigned long __head __startup_64(unsigned long physaddr,
* and the address I am actually running at.
*/
load_delta = physaddr - (unsigned long)(_text - __START_KERNEL_map);
+ RIP_REL_REF(phys_base) = load_delta;

/* Is the address not 2M aligned? */
if (load_delta & ~PMD_MASK)
@@ -301,12 +302,6 @@ unsigned long __head __startup_64(unsigned long physaddr,
for (; i < PTRS_PER_PMD; i++)
pmd[i] &= ~_PAGE_PRESENT;

- /*
- * Fixup phys_base - remove the memory encryption mask to obtain
- * the true physical address.
- */
- *fixup_long(&phys_base, physaddr) += load_delta - sme_get_me_mask();
-
return sme_postprocess_startup(bp, pmd);
}

--
2.44.0.rc0.258.g7320e95886-goog


2024-02-21 11:37:17

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v5 01/16] x86/startup_64: Simplify global variable accesses in GDT/IDT programming

From: Ard Biesheuvel <[email protected]>

There are two code paths in the startup code to program an IDT: one that
runs from the 1:1 mapping and one that runs from the virtual kernel
mapping. Currently, these are strictly separate because fixup_pointer()
is used on the 1:1 path, which will produce the wrong value when used
while executing from the virtual kernel mapping.

Switch to RIP_REL_REF() so that the two code paths can be merged. Also,
move the GDT and IDT descriptors to the stack so that they can be
referenced directly, rather than via RIP_REL_REF().

Rename startup_64_setup_env() to startup_64_setup_gdt_idt() while at it,
to make the call from assembler self-documenting.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/include/asm/setup.h | 2 +-
arch/x86/kernel/head64.c | 75 ++++++++------------
arch/x86/kernel/head_64.S | 4 +-
3 files changed, 32 insertions(+), 49 deletions(-)

diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index 5c83729c8e71..e61e68d71cba 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -48,7 +48,7 @@ extern unsigned long saved_video_mode;
extern void reserve_standard_io_resources(void);
extern void i386_reserve_resources(void);
extern unsigned long __startup_64(unsigned long physaddr, struct boot_params *bp);
-extern void startup_64_setup_env(unsigned long physbase);
+extern void startup_64_setup_gdt_idt(void);
extern void early_setup_idt(void);
extern void __init do_early_exception(struct pt_regs *regs, int trapnr);

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index dc0956067944..1d6865eafe6a 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -22,6 +22,7 @@
#include <linux/cc_platform.h>
#include <linux/pgtable.h>

+#include <asm/asm.h>
#include <asm/processor.h>
#include <asm/proto.h>
#include <asm/smp.h>
@@ -76,15 +77,6 @@ static struct desc_struct startup_gdt[GDT_ENTRIES] __initdata = {
[GDT_ENTRY_KERNEL_DS] = GDT_ENTRY_INIT(DESC_DATA64, 0, 0xfffff),
};

-/*
- * Address needs to be set at runtime because it references the startup_gdt
- * while the kernel still uses a direct mapping.
- */
-static struct desc_ptr startup_gdt_descr __initdata = {
- .size = sizeof(startup_gdt)-1,
- .address = 0,
-};
-
static void __head *fixup_pointer(void *ptr, unsigned long physaddr)
{
return ptr - (void *)_text + (void *)physaddr;
@@ -569,62 +561,52 @@ void __init __noreturn x86_64_start_reservations(char *real_mode_data)
*/
static gate_desc bringup_idt_table[NUM_EXCEPTION_VECTORS] __page_aligned_data;

-static struct desc_ptr bringup_idt_descr = {
- .size = (NUM_EXCEPTION_VECTORS * sizeof(gate_desc)) - 1,
- .address = 0, /* Set at runtime */
-};
-
-static void set_bringup_idt_handler(gate_desc *idt, int n, void *handler)
+/* This may run while still in the direct mapping */
+static void __head startup_64_load_idt(void *vc_handler)
{
-#ifdef CONFIG_AMD_MEM_ENCRYPT
+ struct desc_ptr desc = {
+ .address = (unsigned long)&RIP_REL_REF(bringup_idt_table),
+ .size = sizeof(bringup_idt_table) - 1,
+ };
struct idt_data data;
- gate_desc desc;
-
- init_idt_data(&data, n, handler);
- idt_init_desc(&desc, &data);
- native_write_idt_entry(idt, n, &desc);
-#endif
-}
+ gate_desc idt_desc;

-/* This runs while still in the direct mapping */
-static void __head startup_64_load_idt(unsigned long physbase)
-{
- struct desc_ptr *desc = fixup_pointer(&bringup_idt_descr, physbase);
- gate_desc *idt = fixup_pointer(bringup_idt_table, physbase);
-
-
- if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
- void *handler;
-
- /* VMM Communication Exception */
- handler = fixup_pointer(vc_no_ghcb, physbase);
- set_bringup_idt_handler(idt, X86_TRAP_VC, handler);
+ /* @vc_handler is set only for a VMM Communication Exception */
+ if (vc_handler) {
+ init_idt_data(&data, X86_TRAP_VC, vc_handler);
+ idt_init_desc(&idt_desc, &data);
+ native_write_idt_entry((gate_desc *)desc.address, X86_TRAP_VC, &idt_desc);
}

- desc->address = (unsigned long)idt;
- native_load_idt(desc);
+ native_load_idt(&desc);
}

/* This is used when running on kernel addresses */
void early_setup_idt(void)
{
- /* VMM Communication Exception */
+ void *handler = NULL;
+
if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
setup_ghcb();
- set_bringup_idt_handler(bringup_idt_table, X86_TRAP_VC, vc_boot_ghcb);
+ handler = vc_boot_ghcb;
}

- bringup_idt_descr.address = (unsigned long)bringup_idt_table;
- native_load_idt(&bringup_idt_descr);
+ startup_64_load_idt(handler);
}

/*
* Setup boot CPU state needed before kernel switches to virtual addresses.
*/
-void __head startup_64_setup_env(unsigned long physbase)
+void __head startup_64_setup_gdt_idt(void)
{
+ void *handler = NULL;
+
+ struct desc_ptr startup_gdt_descr = {
+ .address = (unsigned long)&RIP_REL_REF(startup_gdt),
+ .size = sizeof(startup_gdt) - 1,
+ };
+
/* Load GDT */
- startup_gdt_descr.address = (unsigned long)fixup_pointer(startup_gdt, physbase);
native_load_gdt(&startup_gdt_descr);

/* New GDT is live - reload data segment registers */
@@ -632,5 +614,8 @@ void __head startup_64_setup_env(unsigned long physbase)
"movl %%eax, %%ss\n"
"movl %%eax, %%es\n" : : "a"(__KERNEL_DS) : "memory");

- startup_64_load_idt(physbase);
+ if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))
+ handler = &RIP_REL_REF(vc_no_ghcb);
+
+ startup_64_load_idt(handler);
}
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index d4918d03efb4..3cac98c61066 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -68,8 +68,6 @@ SYM_CODE_START_NOALIGN(startup_64)
/* Set up the stack for verify_cpu() */
leaq (__end_init_task - PTREGS_SIZE)(%rip), %rsp

- leaq _text(%rip), %rdi
-
/* Setup GSBASE to allow stack canary access for C code */
movl $MSR_GS_BASE, %ecx
leaq INIT_PER_CPU_VAR(fixed_percpu_data)(%rip), %rdx
@@ -77,7 +75,7 @@ SYM_CODE_START_NOALIGN(startup_64)
shrq $32, %rdx
wrmsr

- call startup_64_setup_env
+ call startup_64_setup_gdt_idt

/* Now switch to __KERNEL_CS so IRET works reliably */
pushq $__KERNEL_CS
--
2.44.0.rc0.258.g7320e95886-goog


2024-02-21 11:37:18

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v5 03/16] x86/startup_64: Use RIP_REL_REF() to access early_dynamic_pgts[]

From: Ard Biesheuvel <[email protected]>

early_dynamic_pgts[] and next_early_pgt are accessed from code that
executes from a 1:1 mapping so it cannot use a plain access from C.
Replace the use of fixup_pointer() with RIP_REL_REF(), which is better
and simpler.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/kernel/head64.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index f98f5b6a06b5..2ac904110f6a 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -170,6 +170,7 @@ static unsigned long __head sme_postprocess_startup(struct boot_params *bp, pmdv
unsigned long __head __startup_64(unsigned long physaddr,
struct boot_params *bp)
{
+ pmd_t (*early_pgts)[PTRS_PER_PMD] = RIP_REL_REF(early_dynamic_pgts);
unsigned long load_delta, *p;
unsigned long pgtable_flags;
pgdval_t *pgd;
@@ -179,7 +180,6 @@ unsigned long __head __startup_64(unsigned long physaddr,
pteval_t *mask_ptr;
bool la57;
int i;
- unsigned int *next_pgt_ptr;

la57 = check_la57_support(physaddr);

@@ -231,15 +231,14 @@ unsigned long __head __startup_64(unsigned long physaddr,
* it avoids problems around wraparound.
*/

- next_pgt_ptr = fixup_pointer(&next_early_pgt, physaddr);
- pud = fixup_pointer(early_dynamic_pgts[(*next_pgt_ptr)++], physaddr);
- pmd = fixup_pointer(early_dynamic_pgts[(*next_pgt_ptr)++], physaddr);
+ pud = &early_pgts[0]->pmd;
+ pmd = &early_pgts[1]->pmd;
+ RIP_REL_REF(next_early_pgt) = 2;

pgtable_flags = _KERNPG_TABLE_NOENC + sme_get_me_mask();

if (la57) {
- p4d = fixup_pointer(early_dynamic_pgts[(*next_pgt_ptr)++],
- physaddr);
+ p4d = &early_pgts[RIP_REL_REF(next_early_pgt)++]->pmd;

i = (physaddr >> PGDIR_SHIFT) % PTRS_PER_PGD;
pgd[i + 0] = (pgdval_t)p4d + pgtable_flags;
--
2.44.0.rc0.258.g7320e95886-goog


2024-02-21 11:37:30

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v5 04/16] x86/startup_64: Use RIP_REL_REF() to access __supported_pte_mask

From: Ard Biesheuvel <[email protected]>

__supported_pte_mask is accessed from code that executes from a 1:1
mapping so it cannot use a plain access from C. Replace the use of
fixup_pointer() with RIP_REL_REF(), which is better and simpler.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/kernel/head64.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 2ac904110f6a..e2573ddae32f 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -177,7 +177,6 @@ unsigned long __head __startup_64(unsigned long physaddr,
p4dval_t *p4d;
pudval_t *pud;
pmdval_t *pmd, pmd_entry;
- pteval_t *mask_ptr;
bool la57;
int i;

@@ -259,8 +258,7 @@ unsigned long __head __startup_64(unsigned long physaddr,

pmd_entry = __PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL;
/* Filter out unsupported __PAGE_KERNEL_* bits: */
- mask_ptr = fixup_pointer(&__supported_pte_mask, physaddr);
- pmd_entry &= *mask_ptr;
+ pmd_entry &= RIP_REL_REF(__supported_pte_mask);
pmd_entry += sme_get_me_mask();
pmd_entry += physaddr;

--
2.44.0.rc0.258.g7320e95886-goog


2024-02-21 11:37:45

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v5 05/16] x86/startup_64: Use RIP_REL_REF() to access early page tables

From: Ard Biesheuvel <[email protected]>

The early statically allocated page tables are populated from code that
executes from a 1:1 mapping so it cannot use plain accesses from C.
Replace the use of fixup_pointer() with RIP_REL_REF(), which is better
and simpler.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/kernel/head64.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index e2573ddae32f..7e2c9b581d58 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -215,13 +215,11 @@ unsigned long __head __startup_64(unsigned long physaddr,
p4d[511] += load_delta;
}

- pud = fixup_pointer(level3_kernel_pgt, physaddr);
- pud[510] += load_delta;
- pud[511] += load_delta;
+ RIP_REL_REF(level3_kernel_pgt)[PTRS_PER_PUD - 2].pud += load_delta;
+ RIP_REL_REF(level3_kernel_pgt)[PTRS_PER_PUD - 1].pud += load_delta;

- pmd = fixup_pointer(level2_fixmap_pgt, physaddr);
for (i = FIXMAP_PMD_TOP; i > FIXMAP_PMD_TOP - FIXMAP_PMD_NUM; i--)
- pmd[i] += load_delta;
+ RIP_REL_REF(level2_fixmap_pgt)[i].pmd += load_delta;

/*
* Set up the identity mapping for the switchover. These
@@ -284,7 +282,7 @@ unsigned long __head __startup_64(unsigned long physaddr,
* error, causing the BIOS to halt the system.
*/

- pmd = fixup_pointer(level2_kernel_pgt, physaddr);
+ pmd = &RIP_REL_REF(level2_kernel_pgt)->pmd;

/* invalidate pages before the kernel image */
for (i = 0; i < pmd_index((unsigned long)_text); i++)
--
2.44.0.rc0.258.g7320e95886-goog


2024-02-21 11:37:59

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v5 06/16] x86/startup_64: Use RIP_REL_REF() to access early_top_pgt[]

From: Ard Biesheuvel <[email protected]>

early_top_pgt[] is assigned from code that executes from a 1:1 mapping
so it cannot use a plain access from C. Replace the use of
fixup_pointer() with RIP_REL_REF(), which is better and simpler.

For legibility and to align with the code that populates the lower page
table levels, statically initialize the root level page table with an
entry pointing to level3_kernel_pgt[], and overwrite it when needed to
enable 5-level paging.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/kernel/head64.c | 21 +++++++++-----------
arch/x86/kernel/head_64.S | 3 ++-
2 files changed, 11 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 7e2c9b581d58..72351c3121a6 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -77,6 +77,7 @@ static struct desc_struct startup_gdt[GDT_ENTRIES] __initdata = {
[GDT_ENTRY_KERNEL_DS] = GDT_ENTRY_INIT(DESC_DATA64, 0, 0xfffff),
};

+#ifdef CONFIG_X86_5LEVEL
static void __head *fixup_pointer(void *ptr, unsigned long physaddr)
{
return ptr - (void *)_text + (void *)physaddr;
@@ -87,7 +88,6 @@ static unsigned long __head *fixup_long(void *ptr, unsigned long physaddr)
return fixup_pointer(ptr, physaddr);
}

-#ifdef CONFIG_X86_5LEVEL
static unsigned int __head *fixup_int(void *ptr, unsigned long physaddr)
{
return fixup_pointer(ptr, physaddr);
@@ -165,14 +165,14 @@ static unsigned long __head sme_postprocess_startup(struct boot_params *bp, pmdv
* doesn't have to generate PC-relative relocations when accessing globals from
* that function. Clang actually does not generate them, which leads to
* boot-time crashes. To work around this problem, every global pointer must
- * be adjusted using fixup_pointer().
+ * be accessed using RIP_REL_REF().
*/
unsigned long __head __startup_64(unsigned long physaddr,
struct boot_params *bp)
{
pmd_t (*early_pgts)[PTRS_PER_PMD] = RIP_REL_REF(early_dynamic_pgts);
- unsigned long load_delta, *p;
unsigned long pgtable_flags;
+ unsigned long load_delta;
pgdval_t *pgd;
p4dval_t *p4d;
pudval_t *pud;
@@ -202,17 +202,14 @@ unsigned long __head __startup_64(unsigned long physaddr,

/* Fixup the physical addresses in the page table */

- pgd = fixup_pointer(early_top_pgt, physaddr);
- p = pgd + pgd_index(__START_KERNEL_map);
- if (la57)
- *p = (unsigned long)level4_kernel_pgt;
- else
- *p = (unsigned long)level3_kernel_pgt;
- *p += _PAGE_TABLE_NOENC - __START_KERNEL_map + load_delta;
+ pgd = &RIP_REL_REF(early_top_pgt)->pgd;
+ pgd[pgd_index(__START_KERNEL_map)] += load_delta;

if (la57) {
- p4d = fixup_pointer(level4_kernel_pgt, physaddr);
- p4d[511] += load_delta;
+ p4d = (p4dval_t *)&RIP_REL_REF(level4_kernel_pgt);
+ p4d[MAX_PTRS_PER_P4D - 1] += load_delta;
+
+ pgd[pgd_index(__START_KERNEL_map)] = (pgdval_t)p4d | _PAGE_TABLE_NOENC;
}

RIP_REL_REF(level3_kernel_pgt)[PTRS_PER_PUD - 2].pud += load_delta;
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 3cac98c61066..fb2a98c29094 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -653,7 +653,8 @@ SYM_CODE_END(vc_no_ghcb)
.balign 4

SYM_DATA_START_PTI_ALIGNED(early_top_pgt)
- .fill 512,8,0
+ .fill 511,8,0
+ .quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
.fill PTI_USER_PGD_FILL,8,0
SYM_DATA_END(early_top_pgt)

--
2.44.0.rc0.258.g7320e95886-goog


2024-02-21 11:38:08

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v5 07/16] x86/startup_64: Simplify CR4 handling in startup code

From: Ard Biesheuvel <[email protected]>

When paging is enabled, the CR4.PAE and CR4.LA57 control bits cannot be
changed, and so they can simply be preserved rather than reason about
whether or not they need to be set. CR4.MCE should be preserved unless
the kernel was built without CONFIG_X86_MCE, in which case it must be
cleared.

CR4.PSE should be set explicitly, regardless of whether or not it was
set before.

CR4.PGE is set explicitly, and then cleared and set again after
programming CR3 in order to flush TLB entries based on global
translations. This makes the first assignment redundant, and can
therefore be omitted.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/kernel/head_64.S | 24 +++++++-------------
1 file changed, 8 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index fb2a98c29094..426f6fdc0075 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -185,6 +185,8 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
addq $(init_top_pgt - __START_KERNEL_map), %rax
1:

+ /* Create a mask of CR4 bits to preserve */
+ movl $(X86_CR4_PAE | X86_CR4_LA57), %edx
#ifdef CONFIG_X86_MCE
/*
* Preserve CR4.MCE if the kernel will enable #MC support.
@@ -193,20 +195,13 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
* configured will crash the system regardless of the CR4.MCE value set
* here.
*/
- movq %cr4, %rcx
- andl $X86_CR4_MCE, %ecx
-#else
- movl $0, %ecx
+ orl $X86_CR4_MCE, %edx
#endif
+ movq %cr4, %rcx
+ andl %edx, %ecx

- /* Enable PAE mode, PSE, PGE and LA57 */
- orl $(X86_CR4_PAE | X86_CR4_PSE | X86_CR4_PGE), %ecx
-#ifdef CONFIG_X86_5LEVEL
- testb $1, __pgtable_l5_enabled(%rip)
- jz 1f
- orl $X86_CR4_LA57, %ecx
-1:
-#endif
+ /* Even if ignored in long mode, set PSE uniformly on all logical CPUs. */
+ btsl $X86_CR4_PSE_BIT, %ecx
movq %rcx, %cr4

/* Setup early boot stage 4-/5-level pagetables. */
@@ -226,11 +221,8 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
* Do a global TLB flush after the CR3 switch to make sure the TLB
* entries from the identity mapping are flushed.
*/
- movq %cr4, %rcx
- movq %rcx, %rax
- xorq $X86_CR4_PGE, %rcx
+ btsl $X86_CR4_PGE_BIT, %ecx
movq %rcx, %cr4
- movq %rax, %cr4

/* Ensure I am executing from virtual addresses */
movq $1f, %rax
--
2.44.0.rc0.258.g7320e95886-goog


2024-02-21 11:38:27

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v5 08/16] x86/startup_64: Defer assignment of 5-level paging global variables

From: Ard Biesheuvel <[email protected]>

Assigning the 5-level paging related global variables from the earliest
C code using explicit references that use the 1:1 translation of memory
is unnecessary, as the startup code itself does not rely on them to
create the initial page tables, and this is all it should be doing. So
defer these assignments to the primary C entry code that executes via
the ordinary kernel virtual mapping.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/include/asm/pgtable_64_types.h | 2 +-
arch/x86/kernel/head64.c | 44 +++++++-------------
2 files changed, 15 insertions(+), 31 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 38b54b992f32..9053dfe9fa03 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -21,9 +21,9 @@ typedef unsigned long pgprotval_t;
typedef struct { pteval_t pte; } pte_t;
typedef struct { pmdval_t pmd; } pmd_t;

-#ifdef CONFIG_X86_5LEVEL
extern unsigned int __pgtable_l5_enabled;

+#ifdef CONFIG_X86_5LEVEL
#ifdef USE_EARLY_PGTABLE_L5
/*
* cpu_feature_enabled() is not available in early boot code.
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 72351c3121a6..deaaea3280d9 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -23,6 +23,7 @@
#include <linux/pgtable.h>

#include <asm/asm.h>
+#include <asm/page_64.h>
#include <asm/processor.h>
#include <asm/proto.h>
#include <asm/smp.h>
@@ -77,24 +78,11 @@ static struct desc_struct startup_gdt[GDT_ENTRIES] __initdata = {
[GDT_ENTRY_KERNEL_DS] = GDT_ENTRY_INIT(DESC_DATA64, 0, 0xfffff),
};

-#ifdef CONFIG_X86_5LEVEL
-static void __head *fixup_pointer(void *ptr, unsigned long physaddr)
-{
- return ptr - (void *)_text + (void *)physaddr;
-}
-
-static unsigned long __head *fixup_long(void *ptr, unsigned long physaddr)
+static inline bool check_la57_support(void)
{
- return fixup_pointer(ptr, physaddr);
-}
-
-static unsigned int __head *fixup_int(void *ptr, unsigned long physaddr)
-{
- return fixup_pointer(ptr, physaddr);
-}
+ if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+ return false;

-static bool __head check_la57_support(unsigned long physaddr)
-{
/*
* 5-level paging is detected and enabled at kernel decompression
* stage. Only check if it has been enabled there.
@@ -102,21 +90,8 @@ static bool __head check_la57_support(unsigned long physaddr)
if (!(native_read_cr4() & X86_CR4_LA57))
return false;

- *fixup_int(&__pgtable_l5_enabled, physaddr) = 1;
- *fixup_int(&pgdir_shift, physaddr) = 48;
- *fixup_int(&ptrs_per_p4d, physaddr) = 512;
- *fixup_long(&page_offset_base, physaddr) = __PAGE_OFFSET_BASE_L5;
- *fixup_long(&vmalloc_base, physaddr) = __VMALLOC_BASE_L5;
- *fixup_long(&vmemmap_base, physaddr) = __VMEMMAP_BASE_L5;
-
return true;
}
-#else
-static bool __head check_la57_support(unsigned long physaddr)
-{
- return false;
-}
-#endif

static unsigned long __head sme_postprocess_startup(struct boot_params *bp, pmdval_t *pmd)
{
@@ -180,7 +155,7 @@ unsigned long __head __startup_64(unsigned long physaddr,
bool la57;
int i;

- la57 = check_la57_support(physaddr);
+ la57 = check_la57_support();

/* Is the address too large? */
if (physaddr >> MAX_PHYSMEM_BITS)
@@ -465,6 +440,15 @@ asmlinkage __visible void __init __noreturn x86_64_start_kernel(char * real_mode
(__START_KERNEL & PGDIR_MASK)));
BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);

+ if (check_la57_support()) {
+ __pgtable_l5_enabled = 1;
+ pgdir_shift = 48;
+ ptrs_per_p4d = 512;
+ page_offset_base = __PAGE_OFFSET_BASE_L5;
+ vmalloc_base = __VMALLOC_BASE_L5;
+ vmemmap_base = __VMEMMAP_BASE_L5;
+ }
+
cr4_init_shadow();

/* Kill off the identity-map trampoline */
--
2.44.0.rc0.258.g7320e95886-goog


2024-02-21 11:38:35

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v5 09/16] x86/startup_64: Simplify calculation of initial page table address

From: Ard Biesheuvel <[email protected]>

Determining the address of the initial page table to program into CR3
involves:
- taking the physical address
- adding the SME encryption mask

On the primary entry path, the code is mapped using a 1:1 virtual to
physical translation, so the physical address can be taken directly
using a RIP-relative LEA instruction.

On the secondary entry path, the address can be obtained by taking the
offset from the virtual kernel base (__START_kernel_map) and adding the
physical kernel base.

This is implemented in a slightly confusing way, so clean this up.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/kernel/head_64.S | 25 ++++++--------------
1 file changed, 7 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 426f6fdc0075..b92031d7e006 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -111,13 +111,11 @@ SYM_CODE_START_NOALIGN(startup_64)
call __startup_64

/* Form the CR3 value being sure to include the CR3 modifier */
- addq $(early_top_pgt - __START_KERNEL_map), %rax
+ leaq early_top_pgt(%rip), %rcx
+ addq %rcx, %rax

#ifdef CONFIG_AMD_MEM_ENCRYPT
mov %rax, %rdi
- mov %rax, %r14
-
- addq phys_base(%rip), %rdi

/*
* For SEV guests: Verify that the C-bit is correct. A malicious
@@ -126,12 +124,6 @@ SYM_CODE_START_NOALIGN(startup_64)
* the next RET instruction.
*/
call sev_verify_cbit
-
- /*
- * Restore CR3 value without the phys_base which will be added
- * below, before writing %cr3.
- */
- mov %r14, %rax
#endif

jmp 1f
@@ -171,18 +163,18 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
/* Clear %R15 which holds the boot_params pointer on the boot CPU */
xorq %r15, %r15

+ /* Derive the runtime physical address of init_top_pgt[] */
+ movq phys_base(%rip), %rax
+ addq $(init_top_pgt - __START_KERNEL_map), %rax
+
/*
* Retrieve the modifier (SME encryption mask if SME is active) to be
* added to the initial pgdir entry that will be programmed into CR3.
*/
#ifdef CONFIG_AMD_MEM_ENCRYPT
- movq sme_me_mask, %rax
-#else
- xorq %rax, %rax
+ addq sme_me_mask(%rip), %rax
#endif

- /* Form the CR3 value being sure to include the CR3 modifier */
- addq $(init_top_pgt - __START_KERNEL_map), %rax
1:

/* Create a mask of CR4 bits to preserve */
@@ -204,9 +196,6 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
btsl $X86_CR4_PSE_BIT, %ecx
movq %rcx, %cr4

- /* Setup early boot stage 4-/5-level pagetables. */
- addq phys_base(%rip), %rax
-
/*
* Switch to new page-table
*
--
2.44.0.rc0.258.g7320e95886-goog


2024-02-21 11:39:06

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v5 11/16] x86/sme: Avoid SME/SVE related checks on non-SME/SVE platforms

From: Ard Biesheuvel <[email protected]>

Reorganize the early SME/SVE init code so that SME/SVE related calls are
deferred until it has been determined that the platform actually
supports this, and so those calls could actually make sense.

This removes logic from the early boot path that executes from the 1:1
mapping when booting a CONFIG_AMD_MEM_ENCRYPT=y kernel on a system that
does not implement that (i.e., 99% of distro kernels)

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/include/asm/mem_encrypt.h | 4 ++--
arch/x86/kernel/head64.c | 6 +++---
arch/x86/mm/mem_encrypt_identity.c | 8 +++-----
3 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h
index b31eb9fd5954..b1437ba0b3b8 100644
--- a/arch/x86/include/asm/mem_encrypt.h
+++ b/arch/x86/include/asm/mem_encrypt.h
@@ -48,7 +48,7 @@ void __init sme_unmap_bootdata(char *real_mode_data);
void __init sme_early_init(void);

void __init sme_encrypt_kernel(struct boot_params *bp);
-void __init sme_enable(struct boot_params *bp);
+void sme_enable(struct boot_params *bp);

int __init early_set_memory_decrypted(unsigned long vaddr, unsigned long size);
int __init early_set_memory_encrypted(unsigned long vaddr, unsigned long size);
@@ -82,7 +82,7 @@ static inline void __init sme_unmap_bootdata(char *real_mode_data) { }
static inline void __init sme_early_init(void) { }

static inline void __init sme_encrypt_kernel(struct boot_params *bp) { }
-static inline void __init sme_enable(struct boot_params *bp) { }
+static inline void sme_enable(struct boot_params *bp) { }

static inline void sev_es_init_vc_handling(void) { }

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 0b827cbf6ee4..b33f47489505 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -98,9 +98,6 @@ static unsigned long __head sme_postprocess_startup(struct boot_params *bp, pmdv
unsigned long vaddr, vaddr_end;
int i;

- /* Encrypt the kernel and related (if SME is active) */
- sme_encrypt_kernel(bp);
-
/*
* Clear the memory encryption mask from the .bss..decrypted section.
* The bss section will be memset to zero later in the initialization so
@@ -108,6 +105,9 @@ static unsigned long __head sme_postprocess_startup(struct boot_params *bp, pmdv
* attribute.
*/
if (sme_get_me_mask()) {
+ /* Encrypt the kernel and related */
+ sme_encrypt_kernel(bp);
+
vaddr = (unsigned long)__start_bss_decrypted;
vaddr_end = (unsigned long)__end_bss_decrypted;

diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index 0166ab1780cc..7ddcf960e92a 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -45,6 +45,7 @@
#include <asm/sections.h>
#include <asm/cmdline.h>
#include <asm/coco.h>
+#include <asm/init.h>
#include <asm/sev.h>

#include "mm_internal.h"
@@ -502,18 +503,15 @@ void __init sme_encrypt_kernel(struct boot_params *bp)
native_write_cr3(__native_read_cr3());
}

-void __init sme_enable(struct boot_params *bp)
+void __head sme_enable(struct boot_params *bp)
{
const char *cmdline_ptr, *cmdline_arg, *cmdline_on;
unsigned int eax, ebx, ecx, edx;
unsigned long feature_mask;
unsigned long me_mask;
char buffer[16];
- bool snp;
u64 msr;

- snp = snp_init(bp);
-
/* Check for the SME/SEV support leaf */
eax = 0x80000000;
ecx = 0;
@@ -546,7 +544,7 @@ void __init sme_enable(struct boot_params *bp)
feature_mask = (msr & MSR_AMD64_SEV_ENABLED) ? AMD_SEV_BIT : AMD_SME_BIT;

/* The SEV-SNP CC blob should never be present unless SEV-SNP is enabled. */
- if (snp && !(msr & MSR_AMD64_SEV_SNP_ENABLED))
+ if (snp_init(bp) && !(msr & MSR_AMD64_SEV_SNP_ENABLED))
snp_abort();

/* Check if memory encryption is enabled */
--
2.44.0.rc0.258.g7320e95886-goog


2024-02-21 11:39:39

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v5 13/16] x86/boot: Move mem_encrypt= parsing to the decompressor

From: Ard Biesheuvel <[email protected]>

The early SME/SEV code parses the command line very early, in order to
decide whether or not memory encryption should be enabled, which needs
to occur even before the initial page tables are created.

This is problematic for a number of reasons:
- this early code runs from the 1:1 mapping provided by the decompressor
or firmware, which uses a different translation than the one assumed by
the linker, and so the code needs to be built in a special way;
- parsing external input while the entire kernel image is still mapped
writable is a bad idea in general, and really does not belong in
security minded code;
- the current code ignores the built-in command line entirely (although
this appears to be the case for the entire decompressor)

Given that the decompressor/EFI stub is an intrinsic part of the x86
bootable kernel image, move the command line parsing there and out of
the core kernel. This removes the need to build lib/cmdline.o in a
special way, or to use RIP-relative LEA instructions in inline asm
blocks.

This involves a new xloadflag in the setup header to indicate
that mem_encrypt=on appeared on the kernel command line.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/boot/compressed/misc.c | 15 +++++++++
arch/x86/include/uapi/asm/bootparam.h | 1 +
arch/x86/lib/Makefile | 13 --------
arch/x86/mm/mem_encrypt_identity.c | 32 ++------------------
drivers/firmware/efi/libstub/x86-stub.c | 3 ++
5 files changed, 22 insertions(+), 42 deletions(-)

diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index b99e08e6815b..6c5c190a4d86 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -357,6 +357,19 @@ unsigned long decompress_kernel(unsigned char *outbuf, unsigned long virt_addr,
return entry;
}

+/*
+ * Set the memory encryption xloadflag based on the mem_encrypt= command line
+ * parameter, if provided.
+ */
+static void parse_mem_encrypt(struct setup_header *hdr)
+{
+ int on = cmdline_find_option_bool("mem_encrypt=on");
+ int off = cmdline_find_option_bool("mem_encrypt=off");
+
+ if (on > off)
+ hdr->xloadflags |= XLF_MEM_ENCRYPTION;
+}
+
/*
* The compressed kernel image (ZO), has been moved so that its position
* is against the end of the buffer used to hold the uncompressed kernel
@@ -387,6 +400,8 @@ asmlinkage __visible void *extract_kernel(void *rmode, unsigned char *output)
/* Clear flags intended for solely in-kernel use. */
boot_params_ptr->hdr.loadflags &= ~KASLR_FLAG;

+ parse_mem_encrypt(&boot_params_ptr->hdr);
+
sanitize_boot_params(boot_params_ptr);

if (boot_params_ptr->screen_info.orig_video_mode == 7) {
diff --git a/arch/x86/include/uapi/asm/bootparam.h b/arch/x86/include/uapi/asm/bootparam.h
index 01d19fc22346..eeea058cf602 100644
--- a/arch/x86/include/uapi/asm/bootparam.h
+++ b/arch/x86/include/uapi/asm/bootparam.h
@@ -38,6 +38,7 @@
#define XLF_EFI_KEXEC (1<<4)
#define XLF_5LEVEL (1<<5)
#define XLF_5LEVEL_ENABLED (1<<6)
+#define XLF_MEM_ENCRYPTION (1<<7)

#ifndef __ASSEMBLY__

diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index ea3a28e7b613..f0dae4fb6d07 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -14,19 +14,6 @@ ifdef CONFIG_KCSAN
CFLAGS_REMOVE_delay.o = $(CC_FLAGS_FTRACE)
endif

-# Early boot use of cmdline; don't instrument it
-ifdef CONFIG_AMD_MEM_ENCRYPT
-KCOV_INSTRUMENT_cmdline.o := n
-KASAN_SANITIZE_cmdline.o := n
-KCSAN_SANITIZE_cmdline.o := n
-
-ifdef CONFIG_FUNCTION_TRACER
-CFLAGS_REMOVE_cmdline.o = -pg
-endif
-
-CFLAGS_cmdline.o := -fno-stack-protector -fno-jump-tables
-endif
-
inat_tables_script = $(srctree)/arch/x86/tools/gen-insn-attr-x86.awk
inat_tables_maps = $(srctree)/arch/x86/lib/x86-opcode-map.txt
quiet_cmd_inat_tables = GEN $@
diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index 7ddcf960e92a..0180fbbcc940 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -43,7 +43,6 @@

#include <asm/setup.h>
#include <asm/sections.h>
-#include <asm/cmdline.h>
#include <asm/coco.h>
#include <asm/init.h>
#include <asm/sev.h>
@@ -96,9 +95,6 @@ struct sme_populate_pgd_data {
*/
static char sme_workarea[2 * PMD_SIZE] __section(".init.scratch");

-static char sme_cmdline_arg[] __initdata = "mem_encrypt";
-static char sme_cmdline_on[] __initdata = "on";
-
static void __init sme_clear_pgd(struct sme_populate_pgd_data *ppd)
{
unsigned long pgd_start, pgd_end, pgd_size;
@@ -505,11 +501,9 @@ void __init sme_encrypt_kernel(struct boot_params *bp)

void __head sme_enable(struct boot_params *bp)
{
- const char *cmdline_ptr, *cmdline_arg, *cmdline_on;
unsigned int eax, ebx, ecx, edx;
unsigned long feature_mask;
unsigned long me_mask;
- char buffer[16];
u64 msr;

/* Check for the SME/SEV support leaf */
@@ -549,6 +543,9 @@ void __head sme_enable(struct boot_params *bp)

/* Check if memory encryption is enabled */
if (feature_mask == AMD_SME_BIT) {
+ if (!(bp->hdr.xloadflags & XLF_MEM_ENCRYPTION))
+ return;
+
/*
* No SME if Hypervisor bit is set. This check is here to
* prevent a guest from trying to enable SME. For running as a
@@ -568,31 +565,8 @@ void __head sme_enable(struct boot_params *bp)
msr = __rdmsr(MSR_AMD64_SYSCFG);
if (!(msr & MSR_AMD64_SYSCFG_MEM_ENCRYPT))
return;
- } else {
- /* SEV state cannot be controlled by a command line option */
- goto out;
}

- /*
- * Fixups have not been applied to phys_base yet and we're running
- * identity mapped, so we must obtain the address to the SME command
- * line argument data using rip-relative addressing.
- */
- asm ("lea sme_cmdline_arg(%%rip), %0"
- : "=r" (cmdline_arg)
- : "p" (sme_cmdline_arg));
- asm ("lea sme_cmdline_on(%%rip), %0"
- : "=r" (cmdline_on)
- : "p" (sme_cmdline_on));
-
- cmdline_ptr = (const char *)((u64)bp->hdr.cmd_line_ptr |
- ((u64)bp->ext_cmd_line_ptr << 32));
-
- if (cmdline_find_option(cmdline_ptr, cmdline_arg, buffer, sizeof(buffer)) < 0 ||
- strncmp(buffer, cmdline_on, sizeof(buffer)))
- return;
-
-out:
RIP_REL_REF(sme_me_mask) = me_mask;
physical_mask &= ~me_mask;
cc_vendor = CC_VENDOR_AMD;
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index 0d510c9a06a4..9a25ec16b344 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -879,6 +879,9 @@ void __noreturn efi_stub_entry(efi_handle_t handle,
}
}

+ if (efi_mem_encrypt > 0)
+ hdr->xloadflags |= XLF_MEM_ENCRYPTION;
+
status = efi_decompress_kernel(&kernel_entry);
if (status != EFI_SUCCESS) {
efi_err("Failed to decompress kernel\n");
--
2.44.0.rc0.258.g7320e95886-goog


2024-02-21 11:39:53

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v5 14/16] x86/sme: Move early SME kernel encryption handling into .head.text

From: Ard Biesheuvel <[email protected]>

The .head.text section is the initial primary entrypoint of the core
kernel, and is entered with the CPU executing from a 1:1 mapping of
memory. Such code must never access global variables using absolute
references, as these are based on the kernel virtual mapping which is
not active yet at this point.

Given that the SME startup code is also called from this early execution
context, move it into .head.text as well. This will allow more thorough
build time checks in the future to ensure that early startup code only
uses RIP-relative references to global variables.

Also replace some occurrences of __pa_symbol() [which relies on the
compiler generating an absolute reference, which is not guaranteed] and
an open coded RIP-relative access with RIP_REL_REF().

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/include/asm/mem_encrypt.h | 4 +-
arch/x86/mm/mem_encrypt_identity.c | 40 ++++++++------------
2 files changed, 18 insertions(+), 26 deletions(-)

diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h
index b1437ba0b3b8..f922b682b9b4 100644
--- a/arch/x86/include/asm/mem_encrypt.h
+++ b/arch/x86/include/asm/mem_encrypt.h
@@ -47,7 +47,7 @@ void __init sme_unmap_bootdata(char *real_mode_data);

void __init sme_early_init(void);

-void __init sme_encrypt_kernel(struct boot_params *bp);
+void sme_encrypt_kernel(struct boot_params *bp);
void sme_enable(struct boot_params *bp);

int __init early_set_memory_decrypted(unsigned long vaddr, unsigned long size);
@@ -81,7 +81,7 @@ static inline void __init sme_unmap_bootdata(char *real_mode_data) { }

static inline void __init sme_early_init(void) { }

-static inline void __init sme_encrypt_kernel(struct boot_params *bp) { }
+static inline void sme_encrypt_kernel(struct boot_params *bp) { }
static inline void sme_enable(struct boot_params *bp) { }

static inline void sev_es_init_vc_handling(void) { }
diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index 0180fbbcc940..174a7192c9cb 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -41,6 +41,7 @@
#include <linux/mem_encrypt.h>
#include <linux/cc_platform.h>

+#include <asm/init.h>
#include <asm/setup.h>
#include <asm/sections.h>
#include <asm/coco.h>
@@ -95,7 +96,7 @@ struct sme_populate_pgd_data {
*/
static char sme_workarea[2 * PMD_SIZE] __section(".init.scratch");

-static void __init sme_clear_pgd(struct sme_populate_pgd_data *ppd)
+static void __head sme_clear_pgd(struct sme_populate_pgd_data *ppd)
{
unsigned long pgd_start, pgd_end, pgd_size;
pgd_t *pgd_p;
@@ -110,7 +111,7 @@ static void __init sme_clear_pgd(struct sme_populate_pgd_data *ppd)
memset(pgd_p, 0, pgd_size);
}

-static pud_t __init *sme_prepare_pgd(struct sme_populate_pgd_data *ppd)
+static pud_t __head *sme_prepare_pgd(struct sme_populate_pgd_data *ppd)
{
pgd_t *pgd;
p4d_t *p4d;
@@ -147,7 +148,7 @@ static pud_t __init *sme_prepare_pgd(struct sme_populate_pgd_data *ppd)
return pud;
}

-static void __init sme_populate_pgd_large(struct sme_populate_pgd_data *ppd)
+static void __head sme_populate_pgd_large(struct sme_populate_pgd_data *ppd)
{
pud_t *pud;
pmd_t *pmd;
@@ -163,7 +164,7 @@ static void __init sme_populate_pgd_large(struct sme_populate_pgd_data *ppd)
set_pmd(pmd, __pmd(ppd->paddr | ppd->pmd_flags));
}

-static void __init sme_populate_pgd(struct sme_populate_pgd_data *ppd)
+static void __head sme_populate_pgd(struct sme_populate_pgd_data *ppd)
{
pud_t *pud;
pmd_t *pmd;
@@ -189,7 +190,7 @@ static void __init sme_populate_pgd(struct sme_populate_pgd_data *ppd)
set_pte(pte, __pte(ppd->paddr | ppd->pte_flags));
}

-static void __init __sme_map_range_pmd(struct sme_populate_pgd_data *ppd)
+static void __head __sme_map_range_pmd(struct sme_populate_pgd_data *ppd)
{
while (ppd->vaddr < ppd->vaddr_end) {
sme_populate_pgd_large(ppd);
@@ -199,7 +200,7 @@ static void __init __sme_map_range_pmd(struct sme_populate_pgd_data *ppd)
}
}

-static void __init __sme_map_range_pte(struct sme_populate_pgd_data *ppd)
+static void __head __sme_map_range_pte(struct sme_populate_pgd_data *ppd)
{
while (ppd->vaddr < ppd->vaddr_end) {
sme_populate_pgd(ppd);
@@ -209,7 +210,7 @@ static void __init __sme_map_range_pte(struct sme_populate_pgd_data *ppd)
}
}

-static void __init __sme_map_range(struct sme_populate_pgd_data *ppd,
+static void __head __sme_map_range(struct sme_populate_pgd_data *ppd,
pmdval_t pmd_flags, pteval_t pte_flags)
{
unsigned long vaddr_end;
@@ -233,22 +234,22 @@ static void __init __sme_map_range(struct sme_populate_pgd_data *ppd,
__sme_map_range_pte(ppd);
}

-static void __init sme_map_range_encrypted(struct sme_populate_pgd_data *ppd)
+static void __head sme_map_range_encrypted(struct sme_populate_pgd_data *ppd)
{
__sme_map_range(ppd, PMD_FLAGS_ENC, PTE_FLAGS_ENC);
}

-static void __init sme_map_range_decrypted(struct sme_populate_pgd_data *ppd)
+static void __head sme_map_range_decrypted(struct sme_populate_pgd_data *ppd)
{
__sme_map_range(ppd, PMD_FLAGS_DEC, PTE_FLAGS_DEC);
}

-static void __init sme_map_range_decrypted_wp(struct sme_populate_pgd_data *ppd)
+static void __head sme_map_range_decrypted_wp(struct sme_populate_pgd_data *ppd)
{
__sme_map_range(ppd, PMD_FLAGS_DEC_WP, PTE_FLAGS_DEC_WP);
}

-static unsigned long __init sme_pgtable_calc(unsigned long len)
+static unsigned long __head sme_pgtable_calc(unsigned long len)
{
unsigned long entries = 0, tables = 0;

@@ -285,7 +286,7 @@ static unsigned long __init sme_pgtable_calc(unsigned long len)
return entries + tables;
}

-void __init sme_encrypt_kernel(struct boot_params *bp)
+void __head sme_encrypt_kernel(struct boot_params *bp)
{
unsigned long workarea_start, workarea_end, workarea_len;
unsigned long execute_start, execute_end, execute_len;
@@ -320,9 +321,8 @@ void __init sme_encrypt_kernel(struct boot_params *bp)
* memory from being cached.
*/

- /* Physical addresses gives us the identity mapped virtual addresses */
- kernel_start = __pa_symbol(_text);
- kernel_end = ALIGN(__pa_symbol(_end), PMD_SIZE);
+ kernel_start = (unsigned long)RIP_REL_REF(_text);
+ kernel_end = ALIGN((unsigned long)RIP_REL_REF(_end), PMD_SIZE);
kernel_len = kernel_end - kernel_start;

initrd_start = 0;
@@ -339,14 +339,6 @@ void __init sme_encrypt_kernel(struct boot_params *bp)
}
#endif

- /*
- * We're running identity mapped, so we must obtain the address to the
- * SME encryption workarea using rip-relative addressing.
- */
- asm ("lea sme_workarea(%%rip), %0"
- : "=r" (workarea_start)
- : "p" (sme_workarea));
-
/*
* Calculate required number of workarea bytes needed:
* executable encryption area size:
@@ -356,7 +348,7 @@ void __init sme_encrypt_kernel(struct boot_params *bp)
* pagetable structures for the encryption of the kernel
* pagetable structures for workarea (in case not currently mapped)
*/
- execute_start = workarea_start;
+ execute_start = workarea_start = (unsigned long)RIP_REL_REF(sme_workarea);
execute_end = execute_start + (PAGE_SIZE * 2) + PMD_SIZE;
execute_len = execute_end - execute_start;

--
2.44.0.rc0.258.g7320e95886-goog


2024-02-21 11:39:55

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v5 10/16] x86/startup_64: Simplify virtual switch on primary boot

From: Ard Biesheuvel <[email protected]>

The secondary startup code is used on the primary boot path as well, but
in this case, the initial part runs from a 1:1 mapping, until an
explicit cross-jump is made to the kernel virtual mapping of the same
code.

On the secondary boot path, this jump is pointless as the code already
executes from the mapping targeted by the jump. So combine this
cross-jump with the jump from startup_64() into the common boot path.
This simplifies the execution flow, and clearly separates code that runs
from a 1:1 mapping from code that runs from the kernel virtual mapping.

Note that this requires a page table switch, so hoist the CR3 assignment
into startup_64() as well. And since absolute symbol references will no
longer be permitted in .head.text once we enable the associated build
time checks, a RIP-relative memory operand is used in the JMP
instruction, referring to an absolute constant in the .init.rodata
section.

Given that the secondary startup code does not require a special
placement inside the executable, move it to the .noinstr.text section.
This requires the use of a subsection so that the payload is placed
after the page aligned Xen hypercall page, as otherwise, objtool will
complain about the resulting JMP instruction emitted by the assembler
being unreachable.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/kernel/head64.c | 2 +-
arch/x86/kernel/head_64.S | 43 ++++++++++----------
2 files changed, 23 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index deaaea3280d9..0b827cbf6ee4 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -553,7 +553,7 @@ static void __head startup_64_load_idt(void *vc_handler)
}

/* This is used when running on kernel addresses */
-void early_setup_idt(void)
+void noinstr early_setup_idt(void)
{
void *handler = NULL;

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index b92031d7e006..03268bf0214a 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -39,7 +39,6 @@ L4_START_KERNEL = l4_index(__START_KERNEL_map)

L3_START_KERNEL = pud_index(__START_KERNEL_map)

- .text
__HEAD
.code64
SYM_CODE_START_NOALIGN(startup_64)
@@ -126,9 +125,22 @@ SYM_CODE_START_NOALIGN(startup_64)
call sev_verify_cbit
#endif

- jmp 1f
+ /*
+ * Switch to early_top_pgt which still has the identity mappings
+ * present.
+ */
+ movq %rax, %cr3
+
+ /* Branch to the common startup code at its kernel virtual address */
+ ANNOTATE_RETPOLINE_SAFE
+ jmp *0f(%rip)
SYM_CODE_END(startup_64)

+ __INITRODATA
+0: .quad common_startup_64
+
+ .section .noinstr.text, "ax"
+ .subsection 1
SYM_CODE_START(secondary_startup_64)
UNWIND_HINT_END_OF_STACK
ANNOTATE_NOENDBR
@@ -174,8 +186,15 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
#ifdef CONFIG_AMD_MEM_ENCRYPT
addq sme_me_mask(%rip), %rax
#endif
+ /*
+ * Switch to the init_top_pgt here, away from the trampoline_pgd and
+ * unmap the identity mapped ranges.
+ */
+ movq %rax, %cr3

-1:
+SYM_INNER_LABEL(common_startup_64, SYM_L_LOCAL)
+ UNWIND_HINT_END_OF_STACK
+ ANNOTATE_NOENDBR

/* Create a mask of CR4 bits to preserve */
movl $(X86_CR4_PAE | X86_CR4_LA57), %edx
@@ -196,16 +215,6 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
btsl $X86_CR4_PSE_BIT, %ecx
movq %rcx, %cr4

- /*
- * Switch to new page-table
- *
- * For the boot CPU this switches to early_top_pgt which still has the
- * identity mappings present. The secondary CPUs will switch to the
- * init_top_pgt here, away from the trampoline_pgd and unmap the
- * identity mapped ranges.
- */
- movq %rax, %cr3
-
/*
* Do a global TLB flush after the CR3 switch to make sure the TLB
* entries from the identity mapping are flushed.
@@ -213,14 +222,6 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
btsl $X86_CR4_PGE_BIT, %ecx
movq %rcx, %cr4

- /* Ensure I am executing from virtual addresses */
- movq $1f, %rax
- ANNOTATE_RETPOLINE_SAFE
- jmp *%rax
-1:
- UNWIND_HINT_END_OF_STACK
- ANNOTATE_NOENDBR // above
-
#ifdef CONFIG_SMP
/*
* For parallel boot, the APIC ID is read from the APIC, and then
--
2.44.0.rc0.258.g7320e95886-goog


2024-02-21 11:40:11

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v5 15/16] x86/sev: Move early startup code into .head.text section

From: Ard Biesheuvel <[email protected]>

In preparation for implementing rigorous build time checks to enforce
that only code that can support it will be called from the early 1:1
mapping of memory, move SEV init code that is called in this manner to
the .head.text section.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/boot/compressed/sev.c | 3 +++
arch/x86/include/asm/sev.h | 10 ++++-----
arch/x86/kernel/sev-shared.c | 23 +++++++++-----------
arch/x86/kernel/sev.c | 14 +++++++-----
4 files changed, 26 insertions(+), 24 deletions(-)

diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 073291832f44..bea0719d70f2 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -116,6 +116,9 @@ static bool fault_in_kernel_space(unsigned long address)
#undef __init
#define __init

+#undef __head
+#define __head
+
#define __BOOT_COMPRESSED

/* Basic instruction decoding support needed */
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index bed95e1f4d52..cf671138feef 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -213,16 +213,16 @@ static inline int pvalidate(unsigned long vaddr, bool rmp_psize, bool validate)
struct snp_guest_request_ioctl;

void setup_ghcb(void);
-void __init early_snp_set_memory_private(unsigned long vaddr, unsigned long paddr,
- unsigned long npages);
-void __init early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr,
- unsigned long npages);
+void early_snp_set_memory_private(unsigned long vaddr, unsigned long paddr,
+ unsigned long npages);
+void early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr,
+ unsigned long npages);
void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op);
void snp_set_memory_shared(unsigned long vaddr, unsigned long npages);
void snp_set_memory_private(unsigned long vaddr, unsigned long npages);
void snp_set_wakeup_secondary_cpu(void);
bool snp_init(struct boot_params *bp);
-void __init __noreturn snp_abort(void);
+void __noreturn snp_abort(void);
int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, struct snp_guest_request_ioctl *rio);
void snp_accept_memory(phys_addr_t start, phys_addr_t end);
u64 snp_get_unsupported_features(u64 status);
diff --git a/arch/x86/kernel/sev-shared.c b/arch/x86/kernel/sev-shared.c
index ae79f9505298..0bd7ccbe8732 100644
--- a/arch/x86/kernel/sev-shared.c
+++ b/arch/x86/kernel/sev-shared.c
@@ -93,7 +93,8 @@ static bool __init sev_es_check_cpu_features(void)
return true;
}

-static void __noreturn sev_es_terminate(unsigned int set, unsigned int reason)
+static void __head __noreturn
+sev_es_terminate(unsigned int set, unsigned int reason)
{
u64 val = GHCB_MSR_TERM_REQ;

@@ -330,13 +331,7 @@ static int sev_cpuid_hv(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid
*/
static const struct snp_cpuid_table *snp_cpuid_get_table(void)
{
- void *ptr;
-
- asm ("lea cpuid_table_copy(%%rip), %0"
- : "=r" (ptr)
- : "p" (&cpuid_table_copy));
-
- return ptr;
+ return &RIP_REL_REF(cpuid_table_copy);
}

/*
@@ -395,7 +390,7 @@ static u32 snp_cpuid_calc_xsave_size(u64 xfeatures_en, bool compacted)
return xsave_size;
}

-static bool
+static bool __head
snp_cpuid_get_validated_func(struct cpuid_leaf *leaf)
{
const struct snp_cpuid_table *cpuid_table = snp_cpuid_get_table();
@@ -532,7 +527,8 @@ static int snp_cpuid_postprocess(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
* Returns -EOPNOTSUPP if feature not enabled. Any other non-zero return value
* should be treated as fatal by caller.
*/
-static int snp_cpuid(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid_leaf *leaf)
+static int __head
+snp_cpuid(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid_leaf *leaf)
{
const struct snp_cpuid_table *cpuid_table = snp_cpuid_get_table();

@@ -574,7 +570,7 @@ static int snp_cpuid(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid_le
* page yet, so it only supports the MSR based communication with the
* hypervisor and only the CPUID exit-code.
*/
-void __init do_vc_no_ghcb(struct pt_regs *regs, unsigned long exit_code)
+void __head do_vc_no_ghcb(struct pt_regs *regs, unsigned long exit_code)
{
unsigned int subfn = lower_bits(regs->cx, 32);
unsigned int fn = lower_bits(regs->ax, 32);
@@ -1025,7 +1021,8 @@ struct cc_setup_data {
* Search for a Confidential Computing blob passed in as a setup_data entry
* via the Linux Boot Protocol.
*/
-static struct cc_blob_sev_info *find_cc_blob_setup_data(struct boot_params *bp)
+static __head
+struct cc_blob_sev_info *find_cc_blob_setup_data(struct boot_params *bp)
{
struct cc_setup_data *sd = NULL;
struct setup_data *hdr;
@@ -1052,7 +1049,7 @@ static struct cc_blob_sev_info *find_cc_blob_setup_data(struct boot_params *bp)
* mapping needs to be updated in sync with all the changes to virtual memory
* layout and related mapping facilities throughout the boot process.
*/
-static void __init setup_cpuid_table(const struct cc_blob_sev_info *cc_info)
+static void __head setup_cpuid_table(const struct cc_blob_sev_info *cc_info)
{
const struct snp_cpuid_table *cpuid_table_fw, *cpuid_table;
int i;
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 1ef7ae806a01..33c14aa1f06c 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -25,6 +25,7 @@
#include <linux/psp-sev.h>
#include <uapi/linux/sev-guest.h>

+#include <asm/init.h>
#include <asm/cpu_entry_area.h>
#include <asm/stacktrace.h>
#include <asm/sev.h>
@@ -682,8 +683,9 @@ static u64 __init get_jump_table_addr(void)
return ret;
}

-static void early_set_pages_state(unsigned long vaddr, unsigned long paddr,
- unsigned long npages, enum psc_op op)
+static void __head
+early_set_pages_state(unsigned long vaddr, unsigned long paddr,
+ unsigned long npages, enum psc_op op)
{
unsigned long paddr_end;
u64 val;
@@ -739,7 +741,7 @@ static void early_set_pages_state(unsigned long vaddr, unsigned long paddr,
sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
}

-void __init early_snp_set_memory_private(unsigned long vaddr, unsigned long paddr,
+void __head early_snp_set_memory_private(unsigned long vaddr, unsigned long paddr,
unsigned long npages)
{
/*
@@ -2062,7 +2064,7 @@ bool __init handle_vc_boot_ghcb(struct pt_regs *regs)
*
* Scan for the blob in that order.
*/
-static __init struct cc_blob_sev_info *find_cc_blob(struct boot_params *bp)
+static __head struct cc_blob_sev_info *find_cc_blob(struct boot_params *bp)
{
struct cc_blob_sev_info *cc_info;

@@ -2088,7 +2090,7 @@ static __init struct cc_blob_sev_info *find_cc_blob(struct boot_params *bp)
return cc_info;
}

-bool __init snp_init(struct boot_params *bp)
+bool __head snp_init(struct boot_params *bp)
{
struct cc_blob_sev_info *cc_info;

@@ -2110,7 +2112,7 @@ bool __init snp_init(struct boot_params *bp)
return true;
}

-void __init __noreturn snp_abort(void)
+void __head __noreturn snp_abort(void)
{
sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SNP_UNSUPPORTED);
}
--
2.44.0.rc0.258.g7320e95886-goog


2024-02-21 11:40:20

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v5 16/16] x86/startup_64: Drop global variables keeping track of LA57 state

From: Ard Biesheuvel <[email protected]>

On x86_64, the core kernel is entered in long mode, which implies that
paging is enabled. This means that the CR4.LA57 control bit is
guaranteed to be in sync with the number of paging levels used by the
kernel, and there is no need to store this in a variable.

There is also no need to use variables for storing the calculations of
pgdir_shift and ptrs_per_p4d, as they are easily determined on the fly.

This removes the need for two different sources of truth for determining
whether 5-level paging is in use: CR4.LA57 always reflects the actual
state, and never changes from the point of view of the 64-bit core
kernel. The only potential concern is the cost of CR4 accesses, which
can be mitigated using alternatives patching based on feature detection.

Note that even the decompressor does not manipulate any page tables
before updating CR4.LA57, so it can also avoid the associated global
variables entirely. However, as it does not implement alternatives
patching, the associated ELF sections need to be discarded.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/boot/compressed/misc.h | 4 --
arch/x86/boot/compressed/pgtable_64.c | 12 ------
arch/x86/boot/compressed/vmlinux.lds.S | 1 +
arch/x86/include/asm/pgtable_64_types.h | 43 ++++++++++----------
arch/x86/kernel/cpu/common.c | 2 -
arch/x86/kernel/head64.c | 33 +--------------
arch/x86/mm/kasan_init_64.c | 3 --
arch/x86/mm/mem_encrypt_identity.c | 9 ----
8 files changed, 25 insertions(+), 82 deletions(-)

diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index bc2f0f17fb90..2b15ddd0e177 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -16,9 +16,6 @@

#define __NO_FORTIFY

-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5
-
/*
* Boot stub deals with identity mappings, physical and virtual addresses are
* the same, so override these defines.
@@ -178,7 +175,6 @@ static inline int count_immovable_mem_regions(void) { return 0; }
#endif

/* ident_map_64.c */
-extern unsigned int __pgtable_l5_enabled, pgdir_shift, ptrs_per_p4d;
extern void kernel_add_identity_map(unsigned long start, unsigned long end);

/* Used by PAGE_KERN* macros: */
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index 51f957b24ba7..ae72f53f5e77 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -9,13 +9,6 @@
#define BIOS_START_MIN 0x20000U /* 128K, less than this is insane */
#define BIOS_START_MAX 0x9f000U /* 640K, absolute maximum */

-#ifdef CONFIG_X86_5LEVEL
-/* __pgtable_l5_enabled needs to be in .data to avoid being cleared along with .bss */
-unsigned int __section(".data") __pgtable_l5_enabled;
-unsigned int __section(".data") pgdir_shift = 39;
-unsigned int __section(".data") ptrs_per_p4d = 1;
-#endif
-
/* Buffer to preserve trampoline memory */
static char trampoline_save[TRAMPOLINE_32BIT_SIZE];

@@ -125,11 +118,6 @@ asmlinkage void configure_5level_paging(struct boot_params *bp, void *pgtable)
native_cpuid_eax(0) >= 7 &&
(native_cpuid_ecx(7) & (1 << (X86_FEATURE_LA57 & 31)))) {
l5_required = true;
-
- /* Initialize variables for 5-level paging */
- __pgtable_l5_enabled = 1;
- pgdir_shift = 48;
- ptrs_per_p4d = 512;
}

/*
diff --git a/arch/x86/boot/compressed/vmlinux.lds.S b/arch/x86/boot/compressed/vmlinux.lds.S
index 083ec6d7722a..06358bb067fe 100644
--- a/arch/x86/boot/compressed/vmlinux.lds.S
+++ b/arch/x86/boot/compressed/vmlinux.lds.S
@@ -81,6 +81,7 @@ SECTIONS
*(.dynamic) *(.dynsym) *(.dynstr) *(.dynbss)
*(.hash) *(.gnu.hash)
*(.note.*)
+ *(.altinstructions .altinstr_replacement)
}

.got.plt (INFO) : {
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 9053dfe9fa03..2fac8ba9564a 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -6,7 +6,10 @@

#ifndef __ASSEMBLY__
#include <linux/types.h>
+#include <asm/alternative.h>
+#include <asm/cpufeatures.h>
#include <asm/kaslr.h>
+#include <asm/processor-flags.h>

/*
* These are used to make use of C type-checking..
@@ -21,28 +24,24 @@ typedef unsigned long pgprotval_t;
typedef struct { pteval_t pte; } pte_t;
typedef struct { pmdval_t pmd; } pmd_t;

-extern unsigned int __pgtable_l5_enabled;
-
-#ifdef CONFIG_X86_5LEVEL
-#ifdef USE_EARLY_PGTABLE_L5
-/*
- * cpu_feature_enabled() is not available in early boot code.
- * Use variable instead.
- */
-static inline bool pgtable_l5_enabled(void)
+static __always_inline __pure bool pgtable_l5_enabled(void)
{
- return __pgtable_l5_enabled;
-}
-#else
-#define pgtable_l5_enabled() cpu_feature_enabled(X86_FEATURE_LA57)
-#endif /* USE_EARLY_PGTABLE_L5 */
+ unsigned long r;
+ bool ret;

-#else
-#define pgtable_l5_enabled() 0
-#endif /* CONFIG_X86_5LEVEL */
+ if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+ return false;

-extern unsigned int pgdir_shift;
-extern unsigned int ptrs_per_p4d;
+ asm(ALTERNATIVE_TERNARY(
+ "movq %%cr4, %[reg] \n\t btl %[la57], %k[reg]" CC_SET(c),
+ %P[feat], "stc", "clc")
+ : [reg] "=&r" (r), CC_OUT(c) (ret)
+ : [feat] "i" (X86_FEATURE_LA57),
+ [la57] "i" (X86_CR4_LA57_BIT)
+ : "cc");
+
+ return ret;
+}

#endif /* !__ASSEMBLY__ */

@@ -53,7 +52,7 @@ extern unsigned int ptrs_per_p4d;
/*
* PGDIR_SHIFT determines what a top-level page table entry can map
*/
-#define PGDIR_SHIFT pgdir_shift
+#define PGDIR_SHIFT (pgtable_l5_enabled() ? 48 : 39)
#define PTRS_PER_PGD 512

/*
@@ -61,7 +60,7 @@ extern unsigned int ptrs_per_p4d;
*/
#define P4D_SHIFT 39
#define MAX_PTRS_PER_P4D 512
-#define PTRS_PER_P4D ptrs_per_p4d
+#define PTRS_PER_P4D (pgtable_l5_enabled() ? 512 : 1)
#define P4D_SIZE (_AC(1, UL) << P4D_SHIFT)
#define P4D_MASK (~(P4D_SIZE - 1))

@@ -76,6 +75,8 @@ extern unsigned int ptrs_per_p4d;
#define PTRS_PER_PGD 512
#define MAX_PTRS_PER_P4D 1

+#define MAX_POSSIBLE_PHYSMEM_BITS 46
+
#endif /* CONFIG_X86_5LEVEL */

/*
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 9e35e276c55a..d88e4be88868 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1,6 +1,4 @@
// SPDX-License-Identifier: GPL-2.0-only
-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5

#include <linux/memblock.h>
#include <linux/linkage.h>
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index b33f47489505..348fd69252ac 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -7,9 +7,6 @@

#define DISABLE_BRANCH_PROFILING

-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5
-
#include <linux/init.h>
#include <linux/linkage.h>
#include <linux/types.h>
@@ -52,14 +49,6 @@ extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
static unsigned int __initdata next_early_pgt;
pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);

-#ifdef CONFIG_X86_5LEVEL
-unsigned int __pgtable_l5_enabled __ro_after_init;
-unsigned int pgdir_shift __ro_after_init = 39;
-EXPORT_SYMBOL(pgdir_shift);
-unsigned int ptrs_per_p4d __ro_after_init = 1;
-EXPORT_SYMBOL(ptrs_per_p4d);
-#endif
-
#ifdef CONFIG_DYNAMIC_MEMORY_LAYOUT
unsigned long page_offset_base __ro_after_init = __PAGE_OFFSET_BASE_L4;
EXPORT_SYMBOL(page_offset_base);
@@ -78,21 +67,6 @@ static struct desc_struct startup_gdt[GDT_ENTRIES] __initdata = {
[GDT_ENTRY_KERNEL_DS] = GDT_ENTRY_INIT(DESC_DATA64, 0, 0xfffff),
};

-static inline bool check_la57_support(void)
-{
- if (!IS_ENABLED(CONFIG_X86_5LEVEL))
- return false;
-
- /*
- * 5-level paging is detected and enabled at kernel decompression
- * stage. Only check if it has been enabled there.
- */
- if (!(native_read_cr4() & X86_CR4_LA57))
- return false;
-
- return true;
-}
-
static unsigned long __head sme_postprocess_startup(struct boot_params *bp, pmdval_t *pmd)
{
unsigned long vaddr, vaddr_end;
@@ -155,7 +129,7 @@ unsigned long __head __startup_64(unsigned long physaddr,
bool la57;
int i;

- la57 = check_la57_support();
+ la57 = pgtable_l5_enabled();

/* Is the address too large? */
if (physaddr >> MAX_PHYSMEM_BITS)
@@ -440,10 +414,7 @@ asmlinkage __visible void __init __noreturn x86_64_start_kernel(char * real_mode
(__START_KERNEL & PGDIR_MASK)));
BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);

- if (check_la57_support()) {
- __pgtable_l5_enabled = 1;
- pgdir_shift = 48;
- ptrs_per_p4d = 512;
+ if (pgtable_l5_enabled()) {
page_offset_base = __PAGE_OFFSET_BASE_L5;
vmalloc_base = __VMALLOC_BASE_L5;
vmemmap_base = __VMEMMAP_BASE_L5;
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 0302491d799d..85ae1ef840cc 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -2,9 +2,6 @@
#define DISABLE_BRANCH_PROFILING
#define pr_fmt(fmt) "kasan: " fmt

-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5
-
#include <linux/memblock.h>
#include <linux/kasan.h>
#include <linux/kdebug.h>
diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index 174a7192c9cb..3da15e9a8c7d 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -27,15 +27,6 @@
#undef CONFIG_PARAVIRT_XXL
#undef CONFIG_PARAVIRT_SPINLOCKS

-/*
- * This code runs before CPU feature bits are set. By default, the
- * pgtable_l5_enabled() function uses bit X86_FEATURE_LA57 to determine if
- * 5-level paging is active, so that won't work here. USE_EARLY_PGTABLE_L5
- * is provided to handle this situation and, instead, use a variable that
- * has been set by the early boot code.
- */
-#define USE_EARLY_PGTABLE_L5
-
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/mem_encrypt.h>
--
2.44.0.rc0.258.g7320e95886-goog


2024-02-21 11:40:23

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v5 12/16] efi/libstub: Add generic support for parsing mem_encrypt=

From: Ard Biesheuvel <[email protected]>

Parse the mem_encrypt= command line parameter from the EFI stub if
CONFIG_ARCH_HAS_MEM_ENCRYPT=y, so that it can be passed to the early
boot code by the arch code in the stub.

This avoids the need for the core kernel to do any string parsing very
early in the boot.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
drivers/firmware/efi/libstub/efi-stub-helper.c | 8 ++++++++
drivers/firmware/efi/libstub/efistub.h | 2 +-
2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/efi/libstub/efi-stub-helper.c b/drivers/firmware/efi/libstub/efi-stub-helper.c
index bfa30625f5d0..3dc2f9aaf08d 100644
--- a/drivers/firmware/efi/libstub/efi-stub-helper.c
+++ b/drivers/firmware/efi/libstub/efi-stub-helper.c
@@ -24,6 +24,8 @@ static bool efi_noinitrd;
static bool efi_nosoftreserve;
static bool efi_disable_pci_dma = IS_ENABLED(CONFIG_EFI_DISABLE_PCI_DMA);

+int efi_mem_encrypt;
+
bool __pure __efi_soft_reserve_enabled(void)
{
return !efi_nosoftreserve;
@@ -75,6 +77,12 @@ efi_status_t efi_parse_options(char const *cmdline)
efi_noinitrd = true;
} else if (IS_ENABLED(CONFIG_X86_64) && !strcmp(param, "no5lvl")) {
efi_no5lvl = true;
+ } else if (IS_ENABLED(CONFIG_ARCH_HAS_MEM_ENCRYPT) &&
+ !strcmp(param, "mem_encrypt") && val) {
+ if (parse_option_str(val, "on"))
+ efi_mem_encrypt = 1;
+ else if (parse_option_str(val, "off"))
+ efi_mem_encrypt = -1;
} else if (!strcmp(param, "efi") && val) {
efi_nochunk = parse_option_str(val, "nochunk");
efi_novamap |= parse_option_str(val, "novamap");
diff --git a/drivers/firmware/efi/libstub/efistub.h b/drivers/firmware/efi/libstub/efistub.h
index 212687c30d79..a1c6ab24cd99 100644
--- a/drivers/firmware/efi/libstub/efistub.h
+++ b/drivers/firmware/efi/libstub/efistub.h
@@ -37,8 +37,8 @@ extern bool efi_no5lvl;
extern bool efi_nochunk;
extern bool efi_nokaslr;
extern int efi_loglevel;
+extern int efi_mem_encrypt;
extern bool efi_novamap;
-
extern const efi_system_table_t *efi_system_table;

typedef union efi_dxe_services_table efi_dxe_services_table_t;
--
2.44.0.rc0.258.g7320e95886-goog


2024-02-21 15:22:15

by Brian Gerst

[permalink] [raw]
Subject: Re: [PATCH v5 07/16] x86/startup_64: Simplify CR4 handling in startup code

On Wed, Feb 21, 2024 at 6:35 AM Ard Biesheuvel <[email protected]> wrote:
>
> From: Ard Biesheuvel <[email protected]>
>
> When paging is enabled, the CR4.PAE and CR4.LA57 control bits cannot be
> changed, and so they can simply be preserved rather than reason about
> whether or not they need to be set. CR4.MCE should be preserved unless
> the kernel was built without CONFIG_X86_MCE, in which case it must be
> cleared.
>
> CR4.PSE should be set explicitly, regardless of whether or not it was
> set before.
>
> CR4.PGE is set explicitly, and then cleared and set again after
> programming CR3 in order to flush TLB entries based on global
> translations. This makes the first assignment redundant, and can
> therefore be omitted.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
> arch/x86/kernel/head_64.S | 24 +++++++-------------
> 1 file changed, 8 insertions(+), 16 deletions(-)
>
> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index fb2a98c29094..426f6fdc0075 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -185,6 +185,8 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
> addq $(init_top_pgt - __START_KERNEL_map), %rax
> 1:
>
> + /* Create a mask of CR4 bits to preserve */
> + movl $(X86_CR4_PAE | X86_CR4_LA57), %edx
> #ifdef CONFIG_X86_MCE
> /*
> * Preserve CR4.MCE if the kernel will enable #MC support.
> @@ -193,20 +195,13 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
> * configured will crash the system regardless of the CR4.MCE value set
> * here.
> */
> - movq %cr4, %rcx
> - andl $X86_CR4_MCE, %ecx
> -#else
> - movl $0, %ecx
> + orl $X86_CR4_MCE, %edx
> #endif
> + movq %cr4, %rcx
> + andl %edx, %ecx
>
> - /* Enable PAE mode, PSE, PGE and LA57 */
> - orl $(X86_CR4_PAE | X86_CR4_PSE | X86_CR4_PGE), %ecx
> -#ifdef CONFIG_X86_5LEVEL
> - testb $1, __pgtable_l5_enabled(%rip)
> - jz 1f
> - orl $X86_CR4_LA57, %ecx
> -1:
> -#endif
> + /* Even if ignored in long mode, set PSE uniformly on all logical CPUs. */
> + btsl $X86_CR4_PSE_BIT, %ecx
> movq %rcx, %cr4

This CR4 write now does the global flush - see below.

>
> /* Setup early boot stage 4-/5-level pagetables. */
> @@ -226,11 +221,8 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
> * Do a global TLB flush after the CR3 switch to make sure the TLB
> * entries from the identity mapping are flushed.
> */

This comment is misleading now since it's the first CR4 write above
(with PGE clear) that actually does the global flush.

> - movq %cr4, %rcx
> - movq %rcx, %rax
> - xorq $X86_CR4_PGE, %rcx
> + btsl $X86_CR4_PGE_BIT, %ecx
> movq %rcx, %cr4
> - movq %rax, %cr4
>
> /* Ensure I am executing from virtual addresses */
> movq $1f, %rax
> --
> 2.44.0.rc0.258.g7320e95886-goog
>

Brian Gerst

2024-02-23 13:11:33

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v5 10/16] x86/startup_64: Simplify virtual switch on primary boot

On Wed, 21 Feb 2024 at 12:36, Ard Biesheuvel <[email protected]> wrote:
>
> From: Ard Biesheuvel <[email protected]>
>
> The secondary startup code is used on the primary boot path as well, but
> in this case, the initial part runs from a 1:1 mapping, until an
> explicit cross-jump is made to the kernel virtual mapping of the same
> code.
>
> On the secondary boot path, this jump is pointless as the code already
> executes from the mapping targeted by the jump. So combine this
> cross-jump with the jump from startup_64() into the common boot path.
> This simplifies the execution flow, and clearly separates code that runs
> from a 1:1 mapping from code that runs from the kernel virtual mapping.
>
> Note that this requires a page table switch, so hoist the CR3 assignment
> into startup_64() as well. And since absolute symbol references will no
> longer be permitted in .head.text once we enable the associated build
> time checks, a RIP-relative memory operand is used in the JMP
> instruction, referring to an absolute constant in the .init.rodata
> section.
>
> Given that the secondary startup code does not require a special
> placement inside the executable, move it to the .noinstr.text section.

This should be the .text section, or we get

vmlinux.o: warning: objtool: early_setup_idt+0x4: call to
startup_64_load_idt() leaves .noinstr.text section

It would be better to have the secondary startup code in
noinstr.text, but it is only a very minor improvement. I'll be
looking into teaching objtool to be strict about .head.text code and
reject references that use absolute addressing, so I might be able to
tweak it to permit this use case as well but at this point we should
probably just move it into ordinary .text


> This requires the use of a subsection so that the payload is placed
> after the page aligned Xen hypercall page, as otherwise, objtool will
> complain about the resulting JMP instruction emitted by the assembler
> being unreachable.
>

^^^ this bit can be dropped, and the following hunk applied (apologies
if my gmail mangles this - i can resend the patch or the series)


diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 4bee33d8e1dc..e16df01791be 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -515,7 +515,7 @@
}

/* This is used when running on kernel addresses */
-void noinstr early_setup_idt(void)
+void early_setup_idt(void)
{
void *handler = NULL;

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 32fefb23f4df..c9ee92550508 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -139,8 +139,7 @@
__INITRODATA
0: .quad common_startup_64

- .section .noinstr.text, "ax"
- .subsection 1
+ .text
SYM_CODE_START(secondary_startup_64)
UNWIND_HINT_END_OF_STACK
ANNOTATE_NOENDBR

Subject: [tip: x86/boot] x86/boot/64: Use RIP_REL_REF() to access early page tables

The following commit has been merged into the x86/boot branch of tip:

Commit-ID: eb54c2ae4a4825c42a6a2b4022926bda7448f735
Gitweb: https://git.kernel.org/tip/eb54c2ae4a4825c42a6a2b4022926bda7448f735
Author: Ard Biesheuvel <[email protected]>
AuthorDate: Wed, 21 Feb 2024 12:35:12 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Mon, 26 Feb 2024 12:58:35 +01:00

x86/boot/64: Use RIP_REL_REF() to access early page tables

The early statically allocated page tables are populated from code that
executes from a 1:1 mapping so it cannot use plain accesses from C.
Replace the use of fixup_pointer() with RIP_REL_REF(), which is better
and simpler.

Signed-off-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/kernel/head64.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index e2573dd..7e2c9b5 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -215,13 +215,11 @@ unsigned long __head __startup_64(unsigned long physaddr,
p4d[511] += load_delta;
}

- pud = fixup_pointer(level3_kernel_pgt, physaddr);
- pud[510] += load_delta;
- pud[511] += load_delta;
+ RIP_REL_REF(level3_kernel_pgt)[PTRS_PER_PUD - 2].pud += load_delta;
+ RIP_REL_REF(level3_kernel_pgt)[PTRS_PER_PUD - 1].pud += load_delta;

- pmd = fixup_pointer(level2_fixmap_pgt, physaddr);
for (i = FIXMAP_PMD_TOP; i > FIXMAP_PMD_TOP - FIXMAP_PMD_NUM; i--)
- pmd[i] += load_delta;
+ RIP_REL_REF(level2_fixmap_pgt)[i].pmd += load_delta;

/*
* Set up the identity mapping for the switchover. These
@@ -284,7 +282,7 @@ unsigned long __head __startup_64(unsigned long physaddr,
* error, causing the BIOS to halt the system.
*/

- pmd = fixup_pointer(level2_kernel_pgt, physaddr);
+ pmd = &RIP_REL_REF(level2_kernel_pgt)->pmd;

/* invalidate pages before the kernel image */
for (i = 0; i < pmd_index((unsigned long)_text); i++)

Subject: [tip: x86/boot] x86/boot/64: Use RIP_REL_REF() to access early_top_pgt[]

The following commit has been merged into the x86/boot branch of tip:

Commit-ID: 533568e06b157b175912a960efe5ebce8710b4f9
Gitweb: https://git.kernel.org/tip/533568e06b157b175912a960efe5ebce8710b4f9
Author: Ard Biesheuvel <[email protected]>
AuthorDate: Wed, 21 Feb 2024 12:35:13 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Mon, 26 Feb 2024 12:58:35 +01:00

x86/boot/64: Use RIP_REL_REF() to access early_top_pgt[]

early_top_pgt[] is assigned from code that executes from a 1:1 mapping
so it cannot use a plain access from C. Replace the use of
fixup_pointer() with RIP_REL_REF(), which is better and simpler.

For legibility and to align with the code that populates the lower page
table levels, statically initialize the root level page table with an
entry pointing to level3_kernel_pgt[], and overwrite it when needed to
enable 5-level paging.

Signed-off-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/kernel/head64.c | 21 +++++++++------------
arch/x86/kernel/head_64.S | 3 ++-
2 files changed, 11 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 7e2c9b5..72351c3 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -77,6 +77,7 @@ static struct desc_struct startup_gdt[GDT_ENTRIES] __initdata = {
[GDT_ENTRY_KERNEL_DS] = GDT_ENTRY_INIT(DESC_DATA64, 0, 0xfffff),
};

+#ifdef CONFIG_X86_5LEVEL
static void __head *fixup_pointer(void *ptr, unsigned long physaddr)
{
return ptr - (void *)_text + (void *)physaddr;
@@ -87,7 +88,6 @@ static unsigned long __head *fixup_long(void *ptr, unsigned long physaddr)
return fixup_pointer(ptr, physaddr);
}

-#ifdef CONFIG_X86_5LEVEL
static unsigned int __head *fixup_int(void *ptr, unsigned long physaddr)
{
return fixup_pointer(ptr, physaddr);
@@ -165,14 +165,14 @@ static unsigned long __head sme_postprocess_startup(struct boot_params *bp, pmdv
* doesn't have to generate PC-relative relocations when accessing globals from
* that function. Clang actually does not generate them, which leads to
* boot-time crashes. To work around this problem, every global pointer must
- * be adjusted using fixup_pointer().
+ * be accessed using RIP_REL_REF().
*/
unsigned long __head __startup_64(unsigned long physaddr,
struct boot_params *bp)
{
pmd_t (*early_pgts)[PTRS_PER_PMD] = RIP_REL_REF(early_dynamic_pgts);
- unsigned long load_delta, *p;
unsigned long pgtable_flags;
+ unsigned long load_delta;
pgdval_t *pgd;
p4dval_t *p4d;
pudval_t *pud;
@@ -202,17 +202,14 @@ unsigned long __head __startup_64(unsigned long physaddr,

/* Fixup the physical addresses in the page table */

- pgd = fixup_pointer(early_top_pgt, physaddr);
- p = pgd + pgd_index(__START_KERNEL_map);
- if (la57)
- *p = (unsigned long)level4_kernel_pgt;
- else
- *p = (unsigned long)level3_kernel_pgt;
- *p += _PAGE_TABLE_NOENC - __START_KERNEL_map + load_delta;
+ pgd = &RIP_REL_REF(early_top_pgt)->pgd;
+ pgd[pgd_index(__START_KERNEL_map)] += load_delta;

if (la57) {
- p4d = fixup_pointer(level4_kernel_pgt, physaddr);
- p4d[511] += load_delta;
+ p4d = (p4dval_t *)&RIP_REL_REF(level4_kernel_pgt);
+ p4d[MAX_PTRS_PER_P4D - 1] += load_delta;
+
+ pgd[pgd_index(__START_KERNEL_map)] = (pgdval_t)p4d | _PAGE_TABLE_NOENC;
}

RIP_REL_REF(level3_kernel_pgt)[PTRS_PER_PUD - 2].pud += load_delta;
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index e09cf0b..d295bf6 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -624,7 +624,8 @@ SYM_CODE_END(vc_no_ghcb)
.balign 4

SYM_DATA_START_PTI_ALIGNED(early_top_pgt)
- .fill 512,8,0
+ .fill 511,8,0
+ .quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
.fill PTI_USER_PGD_FILL,8,0
SYM_DATA_END(early_top_pgt)


Subject: [tip: x86/boot] x86/boot/64: Use RIP_REL_REF() to access '__supported_pte_mask'

The following commit has been merged into the x86/boot branch of tip:

Commit-ID: 4f8b6cf25f5c119b117cb2a4bacb604a6cd00ff1
Gitweb: https://git.kernel.org/tip/4f8b6cf25f5c119b117cb2a4bacb604a6cd00ff1
Author: Ard Biesheuvel <[email protected]>
AuthorDate: Wed, 21 Feb 2024 12:35:11 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Mon, 26 Feb 2024 12:58:35 +01:00

x86/boot/64: Use RIP_REL_REF() to access '__supported_pte_mask'

'__supported_pte_mask' is accessed from code that executes from a 1:1
mapping so it cannot use a plain access from C. Replace the use of
fixup_pointer() with RIP_REL_REF(), which is better and simpler.

Signed-off-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/kernel/head64.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 2ac9041..e2573dd 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -177,7 +177,6 @@ unsigned long __head __startup_64(unsigned long physaddr,
p4dval_t *p4d;
pudval_t *pud;
pmdval_t *pmd, pmd_entry;
- pteval_t *mask_ptr;
bool la57;
int i;

@@ -259,8 +258,7 @@ unsigned long __head __startup_64(unsigned long physaddr,

pmd_entry = __PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL;
/* Filter out unsupported __PAGE_KERNEL_* bits: */
- mask_ptr = fixup_pointer(&__supported_pte_mask, physaddr);
- pmd_entry &= *mask_ptr;
+ pmd_entry &= RIP_REL_REF(__supported_pte_mask);
pmd_entry += sme_get_me_mask();
pmd_entry += physaddr;


Subject: [tip: x86/boot] x86/boot/64: Use RIP_REL_REF() to access early_dynamic_pgts[]

The following commit has been merged into the x86/boot branch of tip:

Commit-ID: b0fe5fb6095be0f68b570c4cf4cd86b7e70c2e38
Gitweb: https://git.kernel.org/tip/b0fe5fb6095be0f68b570c4cf4cd86b7e70c2e38
Author: Ard Biesheuvel <[email protected]>
AuthorDate: Wed, 21 Feb 2024 12:35:10 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Mon, 26 Feb 2024 12:58:35 +01:00

x86/boot/64: Use RIP_REL_REF() to access early_dynamic_pgts[]

early_dynamic_pgts[] and next_early_pgt are accessed from code that
executes from a 1:1 mapping so it cannot use a plain access from C.
Replace the use of fixup_pointer() with RIP_REL_REF(), which is better
and simpler.

Signed-off-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/kernel/head64.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index f98f5b6..2ac9041 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -170,6 +170,7 @@ static unsigned long __head sme_postprocess_startup(struct boot_params *bp, pmdv
unsigned long __head __startup_64(unsigned long physaddr,
struct boot_params *bp)
{
+ pmd_t (*early_pgts)[PTRS_PER_PMD] = RIP_REL_REF(early_dynamic_pgts);
unsigned long load_delta, *p;
unsigned long pgtable_flags;
pgdval_t *pgd;
@@ -179,7 +180,6 @@ unsigned long __head __startup_64(unsigned long physaddr,
pteval_t *mask_ptr;
bool la57;
int i;
- unsigned int *next_pgt_ptr;

la57 = check_la57_support(physaddr);

@@ -231,15 +231,14 @@ unsigned long __head __startup_64(unsigned long physaddr,
* it avoids problems around wraparound.
*/

- next_pgt_ptr = fixup_pointer(&next_early_pgt, physaddr);
- pud = fixup_pointer(early_dynamic_pgts[(*next_pgt_ptr)++], physaddr);
- pmd = fixup_pointer(early_dynamic_pgts[(*next_pgt_ptr)++], physaddr);
+ pud = &early_pgts[0]->pmd;
+ pmd = &early_pgts[1]->pmd;
+ RIP_REL_REF(next_early_pgt) = 2;

pgtable_flags = _KERNPG_TABLE_NOENC + sme_get_me_mask();

if (la57) {
- p4d = fixup_pointer(early_dynamic_pgts[(*next_pgt_ptr)++],
- physaddr);
+ p4d = &early_pgts[RIP_REL_REF(next_early_pgt)++]->pmd;

i = (physaddr >> PGDIR_SHIFT) % PTRS_PER_PGD;
pgd[i + 0] = (pgdval_t)p4d + pgtable_flags;

Subject: [tip: x86/boot] x86/boot/64: Use RIP_REL_REF() to assign 'phys_base'

The following commit has been merged into the x86/boot branch of tip:

Commit-ID: d9ec1158056bedb6da8f4e02de30d300914b31f8
Gitweb: https://git.kernel.org/tip/d9ec1158056bedb6da8f4e02de30d300914b31f8
Author: Ard Biesheuvel <[email protected]>
AuthorDate: Wed, 21 Feb 2024 12:35:09 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Mon, 26 Feb 2024 12:58:35 +01:00

x86/boot/64: Use RIP_REL_REF() to assign 'phys_base'

'phys_base' is assigned from code that executes from a 1:1 mapping so it
cannot use a plain access from C. Replace the use of fixup_pointer()
with RIP_REL_REF(), which is better and simpler.

While at it, move the assignment to before the addition of the SME mask
so there is no need to subtract it again, and drop the unnecessary
addition ('phys_base' is statically initialized to 0x0)

Signed-off-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/kernel/head64.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 1d6865e..f98f5b6 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -192,6 +192,7 @@ unsigned long __head __startup_64(unsigned long physaddr,
* and the address I am actually running at.
*/
load_delta = physaddr - (unsigned long)(_text - __START_KERNEL_map);
+ RIP_REL_REF(phys_base) = load_delta;

/* Is the address not 2M aligned? */
if (load_delta & ~PMD_MASK)
@@ -301,12 +302,6 @@ unsigned long __head __startup_64(unsigned long physaddr,
for (; i < PTRS_PER_PMD; i++)
pmd[i] &= ~_PAGE_PRESENT;

- /*
- * Fixup phys_base - remove the memory encryption mask to obtain
- * the true physical address.
- */
- *fixup_long(&phys_base, physaddr) += load_delta - sme_get_me_mask();
-
return sme_postprocess_startup(bp, pmd);
}


Subject: [tip: x86/boot] x86/boot/64: Simplify global variable accesses in GDT/IDT programming

The following commit has been merged into the x86/boot branch of tip:

Commit-ID: 5da793671957e8e99fa74423fab2737bf8c772a8
Gitweb: https://git.kernel.org/tip/5da793671957e8e99fa74423fab2737bf8c772a8
Author: Ard Biesheuvel <[email protected]>
AuthorDate: Wed, 21 Feb 2024 12:35:08 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Mon, 26 Feb 2024 12:58:11 +01:00

x86/boot/64: Simplify global variable accesses in GDT/IDT programming

There are two code paths in the startup code to program an IDT: one that
runs from the 1:1 mapping and one that runs from the virtual kernel
mapping. Currently, these are strictly separate because fixup_pointer()
is used on the 1:1 path, which will produce the wrong value when used
while executing from the virtual kernel mapping.

Switch to RIP_REL_REF() so that the two code paths can be merged. Also,
move the GDT and IDT descriptors to the stack so that they can be
referenced directly, rather than via RIP_REL_REF().

Rename startup_64_setup_env() to startup_64_setup_gdt_idt() while at it,
to make the call from assembler self-documenting.

Signed-off-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Signed-off-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/include/asm/setup.h | 2 +-
arch/x86/kernel/head64.c | 75 ++++++++++++++---------------------
arch/x86/kernel/head_64.S | 4 +--
3 files changed, 32 insertions(+), 49 deletions(-)

diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index 5c83729..e61e68d 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -48,7 +48,7 @@ extern unsigned long saved_video_mode;
extern void reserve_standard_io_resources(void);
extern void i386_reserve_resources(void);
extern unsigned long __startup_64(unsigned long physaddr, struct boot_params *bp);
-extern void startup_64_setup_env(unsigned long physbase);
+extern void startup_64_setup_gdt_idt(void);
extern void early_setup_idt(void);
extern void __init do_early_exception(struct pt_regs *regs, int trapnr);

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index dc09560..1d6865e 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -22,6 +22,7 @@
#include <linux/cc_platform.h>
#include <linux/pgtable.h>

+#include <asm/asm.h>
#include <asm/processor.h>
#include <asm/proto.h>
#include <asm/smp.h>
@@ -76,15 +77,6 @@ static struct desc_struct startup_gdt[GDT_ENTRIES] __initdata = {
[GDT_ENTRY_KERNEL_DS] = GDT_ENTRY_INIT(DESC_DATA64, 0, 0xfffff),
};

-/*
- * Address needs to be set at runtime because it references the startup_gdt
- * while the kernel still uses a direct mapping.
- */
-static struct desc_ptr startup_gdt_descr __initdata = {
- .size = sizeof(startup_gdt)-1,
- .address = 0,
-};
-
static void __head *fixup_pointer(void *ptr, unsigned long physaddr)
{
return ptr - (void *)_text + (void *)physaddr;
@@ -569,62 +561,52 @@ void __init __noreturn x86_64_start_reservations(char *real_mode_data)
*/
static gate_desc bringup_idt_table[NUM_EXCEPTION_VECTORS] __page_aligned_data;

-static struct desc_ptr bringup_idt_descr = {
- .size = (NUM_EXCEPTION_VECTORS * sizeof(gate_desc)) - 1,
- .address = 0, /* Set at runtime */
-};
-
-static void set_bringup_idt_handler(gate_desc *idt, int n, void *handler)
+/* This may run while still in the direct mapping */
+static void __head startup_64_load_idt(void *vc_handler)
{
-#ifdef CONFIG_AMD_MEM_ENCRYPT
+ struct desc_ptr desc = {
+ .address = (unsigned long)&RIP_REL_REF(bringup_idt_table),
+ .size = sizeof(bringup_idt_table) - 1,
+ };
struct idt_data data;
- gate_desc desc;
-
- init_idt_data(&data, n, handler);
- idt_init_desc(&desc, &data);
- native_write_idt_entry(idt, n, &desc);
-#endif
-}
+ gate_desc idt_desc;

-/* This runs while still in the direct mapping */
-static void __head startup_64_load_idt(unsigned long physbase)
-{
- struct desc_ptr *desc = fixup_pointer(&bringup_idt_descr, physbase);
- gate_desc *idt = fixup_pointer(bringup_idt_table, physbase);
-
-
- if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
- void *handler;
-
- /* VMM Communication Exception */
- handler = fixup_pointer(vc_no_ghcb, physbase);
- set_bringup_idt_handler(idt, X86_TRAP_VC, handler);
+ /* @vc_handler is set only for a VMM Communication Exception */
+ if (vc_handler) {
+ init_idt_data(&data, X86_TRAP_VC, vc_handler);
+ idt_init_desc(&idt_desc, &data);
+ native_write_idt_entry((gate_desc *)desc.address, X86_TRAP_VC, &idt_desc);
}

- desc->address = (unsigned long)idt;
- native_load_idt(desc);
+ native_load_idt(&desc);
}

/* This is used when running on kernel addresses */
void early_setup_idt(void)
{
- /* VMM Communication Exception */
+ void *handler = NULL;
+
if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
setup_ghcb();
- set_bringup_idt_handler(bringup_idt_table, X86_TRAP_VC, vc_boot_ghcb);
+ handler = vc_boot_ghcb;
}

- bringup_idt_descr.address = (unsigned long)bringup_idt_table;
- native_load_idt(&bringup_idt_descr);
+ startup_64_load_idt(handler);
}

/*
* Setup boot CPU state needed before kernel switches to virtual addresses.
*/
-void __head startup_64_setup_env(unsigned long physbase)
+void __head startup_64_setup_gdt_idt(void)
{
+ void *handler = NULL;
+
+ struct desc_ptr startup_gdt_descr = {
+ .address = (unsigned long)&RIP_REL_REF(startup_gdt),
+ .size = sizeof(startup_gdt) - 1,
+ };
+
/* Load GDT */
- startup_gdt_descr.address = (unsigned long)fixup_pointer(startup_gdt, physbase);
native_load_gdt(&startup_gdt_descr);

/* New GDT is live - reload data segment registers */
@@ -632,5 +614,8 @@ void __head startup_64_setup_env(unsigned long physbase)
"movl %%eax, %%ss\n"
"movl %%eax, %%es\n" : : "a"(__KERNEL_DS) : "memory");

- startup_64_load_idt(physbase);
+ if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))
+ handler = &RIP_REL_REF(vc_no_ghcb);
+
+ startup_64_load_idt(handler);
}
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index bfbac50..e09cf0b 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -68,8 +68,6 @@ SYM_CODE_START_NOALIGN(startup_64)
/* Set up the stack for verify_cpu() */
leaq (__end_init_task - PTREGS_SIZE)(%rip), %rsp

- leaq _text(%rip), %rdi
-
/* Setup GSBASE to allow stack canary access for C code */
movl $MSR_GS_BASE, %ecx
leaq INIT_PER_CPU_VAR(fixed_percpu_data)(%rip), %rdx
@@ -77,7 +75,7 @@ SYM_CODE_START_NOALIGN(startup_64)
shrq $32, %rdx
wrmsr

- call startup_64_setup_env
+ call startup_64_setup_gdt_idt

/* Now switch to __KERNEL_CS so IRET works reliably */
pushq $__KERNEL_CS