2024-02-13 12:42:30

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v4 00/11] x86: Confine early 1:1 mapped startup code

From: Ard Biesheuvel <[email protected]>

This is a follow-up to [0] which implemented rigorous build time checks
to ensure that any code that is executed during early startup supports
running from the initial 1:1 mapping of memory, which is how the kernel
is entered from the decompressor or the EFI firmware.

Using PIC codegen and introducing new magic sections into generic code
would create a maintenance burden, and more experimentation is needed
there. One issue with PIC codegen is that it still permits the compiler
to make assumptions about the runtime address of global objects (modulo
runtime relocation), which is incompatible with how the kernel is
entered, i.e., running a fully linked and relocated executable from the
wrong runtime address.

The RIP_REL_REF() macro that was introduced recently [1] is actually
more appropriate for this use case, as it hides the access from the
compiler entirely, and so the compiler can never predict its result.

To make incremental progress on this, this v4 drops the special
instrumentation for .pi.text and PIC codegen, but retains all the
cleanup work on the startup code to make it more maintainable and more
obviously correct.

In particular, this involves:
- getting rid of early accesses to global objects, either by moving them
to the stack, deferring the access until later, or dropping the
globals entirely;
- moving all code that runs early via the 1:1 mapping into .head.text,
and moving code that does not out of it, so that build time checks can
be added later to ensure that no inadvertent absolute references were
emitted into code that does not tolerate them;
- removing fixup_pointer() and occurrences of __pa_symbol(), which rely
on the compiler emitting absolute references, and this is not
guaranteed. (Without -fpic, the compiler might still use RIP-relative
references in some cases)

Changes since v3:
- dropped half of the patches and added a couple of new ones
- applied feedback from Boris to patches that were retained, mostly
related to some minor oversights on my part, and to some style issues

[0] https://lkml.kernel.org/r/20240129180502.4069817-21-ardb%2Bgit%40google.com
[1] https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/sev&id=1c811d403afd73f0

Cc: Kevin Loughlin <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Dionna Glaze <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Cc: Nick Desaulniers <[email protected]>
Cc: Justin Stitt <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

Ard Biesheuvel (11):
x86/startup_64: Simplify global variable accesses in GDT/IDT
programming
x86/startup_64: Replace pointer fixups with RIP-relative references
x86/startup_64: Simplify CR4 handling in startup code
x86/startup_64: Defer assignment of 5-level paging global variables
x86/startup_64: Simplify calculation of initial page table address
x86/startup_64: Simplify virtual switch on primary boot
efi/libstub: Add generic support for parsing mem_encrypt=
x86/boot: Move mem_encrypt= parsing to the decompressor
x86/sme: Move early SME kernel encryption handling into .head.text
x86/sev: Move early startup code into .head.text section
x86/startup_64: Drop global variables keeping track of LA57 state

arch/x86/boot/compressed/misc.c | 15 ++
arch/x86/boot/compressed/misc.h | 4 -
arch/x86/boot/compressed/pgtable_64.c | 12 --
arch/x86/boot/compressed/sev.c | 3 +
arch/x86/boot/compressed/vmlinux.lds.S | 1 +
arch/x86/include/asm/mem_encrypt.h | 8 +-
arch/x86/include/asm/pgtable_64_types.h | 58 +++----
arch/x86/include/asm/setup.h | 2 +-
arch/x86/include/asm/sev.h | 10 +-
arch/x86/include/uapi/asm/bootparam.h | 1 +
arch/x86/kernel/cpu/common.c | 2 -
arch/x86/kernel/head64.c | 172 ++++++--------------
arch/x86/kernel/head_64.S | 91 ++++-------
arch/x86/kernel/sev-shared.c | 23 ++-
arch/x86/kernel/sev.c | 14 +-
arch/x86/lib/Makefile | 13 --
arch/x86/mm/kasan_init_64.c | 3 -
arch/x86/mm/mem_encrypt_identity.c | 83 +++-------
drivers/firmware/efi/libstub/efi-stub-helper.c | 8 +
drivers/firmware/efi/libstub/efistub.h | 2 +-
drivers/firmware/efi/libstub/x86-stub.c | 3 +
21 files changed, 186 insertions(+), 342 deletions(-)


base-commit: 1c811d403afd73f04bde82b83b24c754011bd0e8
--
2.43.0.687.g38aa6559b0-goog



2024-02-13 12:42:45

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v4 01/11] x86/startup_64: Simplify global variable accesses in GDT/IDT programming

From: Ard Biesheuvel <[email protected]>

There are two code paths in the startup code to program an IDT: one that
runs from the 1:1 mapping and one that runs from the virtual kernel
mapping. Currently, these are strictly separate because fixup_pointer()
is used on the 1:1 path, which will produce the wrong value when used
while executing from the virtual kernel mapping.

Switch to RIP_REL_REF() so that the two code paths can be merged. Also,
move the GDT and IDT descriptors to the stack so that they can be
referenced directly, rather than via RIP_REL_REF().

Rename startup_64_setup_env() to startup_64_setup_gdt_idt() while at it,
to make the call from assembler self-documenting.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/include/asm/setup.h | 2 +-
arch/x86/kernel/head64.c | 56 +++++++-------------
arch/x86/kernel/head_64.S | 4 +-
3 files changed, 22 insertions(+), 40 deletions(-)

diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index 5c83729c8e71..e61e68d71cba 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -48,7 +48,7 @@ extern unsigned long saved_video_mode;
extern void reserve_standard_io_resources(void);
extern void i386_reserve_resources(void);
extern unsigned long __startup_64(unsigned long physaddr, struct boot_params *bp);
-extern void startup_64_setup_env(unsigned long physbase);
+extern void startup_64_setup_gdt_idt(void);
extern void early_setup_idt(void);
extern void __init do_early_exception(struct pt_regs *regs, int trapnr);

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index dc0956067944..9d7f12829f2d 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -22,6 +22,7 @@
#include <linux/cc_platform.h>
#include <linux/pgtable.h>

+#include <asm/asm.h>
#include <asm/processor.h>
#include <asm/proto.h>
#include <asm/smp.h>
@@ -76,15 +77,6 @@ static struct desc_struct startup_gdt[GDT_ENTRIES] __initdata = {
[GDT_ENTRY_KERNEL_DS] = GDT_ENTRY_INIT(DESC_DATA64, 0, 0xfffff),
};

-/*
- * Address needs to be set at runtime because it references the startup_gdt
- * while the kernel still uses a direct mapping.
- */
-static struct desc_ptr startup_gdt_descr __initdata = {
- .size = sizeof(startup_gdt)-1,
- .address = 0,
-};
-
static void __head *fixup_pointer(void *ptr, unsigned long physaddr)
{
return ptr - (void *)_text + (void *)physaddr;
@@ -569,12 +561,7 @@ void __init __noreturn x86_64_start_reservations(char *real_mode_data)
*/
static gate_desc bringup_idt_table[NUM_EXCEPTION_VECTORS] __page_aligned_data;

-static struct desc_ptr bringup_idt_descr = {
- .size = (NUM_EXCEPTION_VECTORS * sizeof(gate_desc)) - 1,
- .address = 0, /* Set at runtime */
-};
-
-static void set_bringup_idt_handler(gate_desc *idt, int n, void *handler)
+static void __head set_bringup_idt_handler(gate_desc *idt, int n, void *handler)
{
#ifdef CONFIG_AMD_MEM_ENCRYPT
struct idt_data data;
@@ -586,45 +573,42 @@ static void set_bringup_idt_handler(gate_desc *idt, int n, void *handler)
#endif
}

-/* This runs while still in the direct mapping */
-static void __head startup_64_load_idt(unsigned long physbase)
+/* This may run while still in the direct mapping */
+static void __head startup_64_load_idt(void *handler)
{
- struct desc_ptr *desc = fixup_pointer(&bringup_idt_descr, physbase);
- gate_desc *idt = fixup_pointer(bringup_idt_table, physbase);
-
-
- if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
- void *handler;
+ struct desc_ptr desc = {
+ .address = (unsigned long)&RIP_REL_REF(bringup_idt_table),
+ .size = sizeof(bringup_idt_table) - 1,
+ };
+ gate_desc *idt = (gate_desc *)desc.address;

+ if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))
/* VMM Communication Exception */
- handler = fixup_pointer(vc_no_ghcb, physbase);
set_bringup_idt_handler(idt, X86_TRAP_VC, handler);
- }

- desc->address = (unsigned long)idt;
- native_load_idt(desc);
+ native_load_idt(&desc);
}

/* This is used when running on kernel addresses */
void early_setup_idt(void)
{
- /* VMM Communication Exception */
- if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
+ if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))
setup_ghcb();
- set_bringup_idt_handler(bringup_idt_table, X86_TRAP_VC, vc_boot_ghcb);
- }

- bringup_idt_descr.address = (unsigned long)bringup_idt_table;
- native_load_idt(&bringup_idt_descr);
+ startup_64_load_idt(vc_boot_ghcb);
}

/*
* Setup boot CPU state needed before kernel switches to virtual addresses.
*/
-void __head startup_64_setup_env(unsigned long physbase)
+void __head startup_64_setup_gdt_idt(void)
{
+ struct desc_ptr startup_gdt_descr = {
+ .address = (unsigned long)&RIP_REL_REF(startup_gdt),
+ .size = sizeof(startup_gdt) - 1,
+ };
+
/* Load GDT */
- startup_gdt_descr.address = (unsigned long)fixup_pointer(startup_gdt, physbase);
native_load_gdt(&startup_gdt_descr);

/* New GDT is live - reload data segment registers */
@@ -632,5 +616,5 @@ void __head startup_64_setup_env(unsigned long physbase)
"movl %%eax, %%ss\n"
"movl %%eax, %%es\n" : : "a"(__KERNEL_DS) : "memory");

- startup_64_load_idt(physbase);
+ startup_64_load_idt(&RIP_REL_REF(vc_no_ghcb));
}
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index d4918d03efb4..3cac98c61066 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -68,8 +68,6 @@ SYM_CODE_START_NOALIGN(startup_64)
/* Set up the stack for verify_cpu() */
leaq (__end_init_task - PTREGS_SIZE)(%rip), %rsp

- leaq _text(%rip), %rdi
-
/* Setup GSBASE to allow stack canary access for C code */
movl $MSR_GS_BASE, %ecx
leaq INIT_PER_CPU_VAR(fixed_percpu_data)(%rip), %rdx
@@ -77,7 +75,7 @@ SYM_CODE_START_NOALIGN(startup_64)
shrq $32, %rdx
wrmsr

- call startup_64_setup_env
+ call startup_64_setup_gdt_idt

/* Now switch to __KERNEL_CS so IRET works reliably */
pushq $__KERNEL_CS
--
2.43.0.687.g38aa6559b0-goog


2024-02-13 12:42:58

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v4 02/11] x86/startup_64: Replace pointer fixups with RIP-relative references

From: Ard Biesheuvel <[email protected]>

The code in __startup64() runs from a 1:1 mapping of memory, and uses
explicit pointer arithmetic to convert symbol references generated by
the compiler into references that work correctly via this 1:1 mapping.

This relies on the compiler generating absolute symbol references, which
will be resolved by the linker using the kernel virtual mapping.
However, the compiler may just as well emit RIP-relative references,
even when not operating in PIC mode, and so this explicit pointer
arithmetic is fragile and should be avoided. The fixup routines also
take a 'physical base' argument which needs to be passed around as
well.

Convert these pointer fixups to RIP-relative references, which are
guaranteed to produce the correct values without any explicit
arithmetic, removing the need to pass around the physical load address.
It also makes the code substantially easier to understand.

Replace bare 510/511 constants with the appropriate symbolic constants
while at it. Note that pgd_index(__START_KERNEL_map) always produces
the value 511, regardless of the number of paging levels used, so a
symbolic constant is used here as well.

The remaining fixup_int()/fixup_long() calls related to 5-level paging
will be removed in a subsequent patch.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/kernel/head64.c | 57 ++++++++------------
1 file changed, 21 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 9d7f12829f2d..4b08e321d168 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -77,6 +77,7 @@ static struct desc_struct startup_gdt[GDT_ENTRIES] __initdata = {
[GDT_ENTRY_KERNEL_DS] = GDT_ENTRY_INIT(DESC_DATA64, 0, 0xfffff),
};

+#ifdef CONFIG_X86_5LEVEL
static void __head *fixup_pointer(void *ptr, unsigned long physaddr)
{
return ptr - (void *)_text + (void *)physaddr;
@@ -87,7 +88,6 @@ static unsigned long __head *fixup_long(void *ptr, unsigned long physaddr)
return fixup_pointer(ptr, physaddr);
}

-#ifdef CONFIG_X86_5LEVEL
static unsigned int __head *fixup_int(void *ptr, unsigned long physaddr)
{
return fixup_pointer(ptr, physaddr);
@@ -164,22 +164,21 @@ static unsigned long __head sme_postprocess_startup(struct boot_params *bp, pmdv
/* Code in __startup_64() can be relocated during execution, but the compiler
* doesn't have to generate PC-relative relocations when accessing globals from
* that function. Clang actually does not generate them, which leads to
- * boot-time crashes. To work around this problem, every global pointer must
- * be adjusted using fixup_pointer().
+ * boot-time crashes. To work around this problem, every global variable must
+ * be accessed using RIP_REL_REF().
*/
unsigned long __head __startup_64(unsigned long physaddr,
struct boot_params *bp)
{
- unsigned long load_delta, *p;
+ pmd_t (*early_pgts)[PTRS_PER_PMD] = RIP_REL_REF(early_dynamic_pgts);
unsigned long pgtable_flags;
+ unsigned long load_delta;
pgdval_t *pgd;
p4dval_t *p4d;
pudval_t *pud;
pmdval_t *pmd, pmd_entry;
- pteval_t *mask_ptr;
bool la57;
int i;
- unsigned int *next_pgt_ptr;

la57 = check_la57_support(physaddr);

@@ -192,6 +191,7 @@ unsigned long __head __startup_64(unsigned long physaddr,
* and the address I am actually running at.
*/
load_delta = physaddr - (unsigned long)(_text - __START_KERNEL_map);
+ RIP_REL_REF(phys_base) = load_delta;

/* Is the address not 2M aligned? */
if (load_delta & ~PMD_MASK)
@@ -201,25 +201,19 @@ unsigned long __head __startup_64(unsigned long physaddr,
load_delta += sme_get_me_mask();

/* Fixup the physical addresses in the page table */
-
- pgd = fixup_pointer(early_top_pgt, physaddr);
- p = pgd + pgd_index(__START_KERNEL_map);
- if (la57)
- *p = (unsigned long)level4_kernel_pgt;
- else
- *p = (unsigned long)level3_kernel_pgt;
- *p += _PAGE_TABLE_NOENC - __START_KERNEL_map + load_delta;
-
if (la57) {
- p4d = fixup_pointer(level4_kernel_pgt, physaddr);
- p4d[511] += load_delta;
+ p4d = (p4dval_t *)&RIP_REL_REF(level4_kernel_pgt);
+ p4d[MAX_PTRS_PER_P4D - 1] += load_delta;
}

- pud = fixup_pointer(level3_kernel_pgt, physaddr);
- pud[510] += load_delta;
- pud[511] += load_delta;
+ pud = &RIP_REL_REF(level3_kernel_pgt)->pud;
+ pud[PTRS_PER_PUD - 2] += load_delta;
+ pud[PTRS_PER_PUD - 1] += load_delta;
+
+ pgd = &RIP_REL_REF(early_top_pgt)->pgd;
+ pgd[PTRS_PER_PGD - 1] = (pgdval_t)(la57 ? p4d : pud) | _PAGE_TABLE_NOENC;

- pmd = fixup_pointer(level2_fixmap_pgt, physaddr);
+ pmd = &RIP_REL_REF(level2_fixmap_pgt)->pmd;
for (i = FIXMAP_PMD_TOP; i > FIXMAP_PMD_TOP - FIXMAP_PMD_NUM; i--)
pmd[i] += load_delta;

@@ -230,16 +224,14 @@ unsigned long __head __startup_64(unsigned long physaddr,
* it avoids problems around wraparound.
*/

- next_pgt_ptr = fixup_pointer(&next_early_pgt, physaddr);
- pud = fixup_pointer(early_dynamic_pgts[(*next_pgt_ptr)++], physaddr);
- pmd = fixup_pointer(early_dynamic_pgts[(*next_pgt_ptr)++], physaddr);
+ pud = &early_pgts[0]->pmd;
+ pmd = &early_pgts[1]->pmd;
+ p4d = &early_pgts[2]->pmd;
+ RIP_REL_REF(next_early_pgt) = 3;

pgtable_flags = _KERNPG_TABLE_NOENC + sme_get_me_mask();

if (la57) {
- p4d = fixup_pointer(early_dynamic_pgts[(*next_pgt_ptr)++],
- physaddr);
-
i = (physaddr >> PGDIR_SHIFT) % PTRS_PER_PGD;
pgd[i + 0] = (pgdval_t)p4d + pgtable_flags;
pgd[i + 1] = (pgdval_t)p4d + pgtable_flags;
@@ -259,8 +251,7 @@ unsigned long __head __startup_64(unsigned long physaddr,

pmd_entry = __PAGE_KERNEL_LARGE_EXEC & ~_PAGE_GLOBAL;
/* Filter out unsupported __PAGE_KERNEL_* bits: */
- mask_ptr = fixup_pointer(&__supported_pte_mask, physaddr);
- pmd_entry &= *mask_ptr;
+ pmd_entry &= RIP_REL_REF(__supported_pte_mask);
pmd_entry += sme_get_me_mask();
pmd_entry += physaddr;

@@ -286,7 +277,7 @@ unsigned long __head __startup_64(unsigned long physaddr,
* error, causing the BIOS to halt the system.
*/

- pmd = fixup_pointer(level2_kernel_pgt, physaddr);
+ pmd = &RIP_REL_REF(level2_kernel_pgt)->pmd;

/* invalidate pages before the kernel image */
for (i = 0; i < pmd_index((unsigned long)_text); i++)
@@ -301,12 +292,6 @@ unsigned long __head __startup_64(unsigned long physaddr,
for (; i < PTRS_PER_PMD; i++)
pmd[i] &= ~_PAGE_PRESENT;

- /*
- * Fixup phys_base - remove the memory encryption mask to obtain
- * the true physical address.
- */
- *fixup_long(&phys_base, physaddr) += load_delta - sme_get_me_mask();
-
return sme_postprocess_startup(bp, pmd);
}

--
2.43.0.687.g38aa6559b0-goog


2024-02-13 12:43:08

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v4 03/11] x86/startup_64: Simplify CR4 handling in startup code

From: Ard Biesheuvel <[email protected]>

When paging is enabled, the CR4.PAE and CR4.LA57 control bits cannot be
changed, and so they can simply be preserved rather than reason about
whether or not they need to be set. CR4.MCE should be preserved unless
the kernel was built without CONFIG_X86_MCE, in which case it must be
cleared.

CR4.PSE should be set explicitly, regardless of whether or not it was
set before.

CR4.PGE is set explicitly, and then cleared and set again after
programming CR3 in order to flush TLB entries based on global
translations. This makes the first assignment redundant, and can
therefore be omitted.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/kernel/head_64.S | 24 ++++++--------------
1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 3cac98c61066..7e76cc0b442a 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -185,6 +185,8 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
addq $(init_top_pgt - __START_KERNEL_map), %rax
1:

+ /* Create a mask of CR4 bits to preserve */
+ movl $(X86_CR4_PAE | X86_CR4_LA57), %edx
#ifdef CONFIG_X86_MCE
/*
* Preserve CR4.MCE if the kernel will enable #MC support.
@@ -193,20 +195,11 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
* configured will crash the system regardless of the CR4.MCE value set
* here.
*/
- movq %cr4, %rcx
- andl $X86_CR4_MCE, %ecx
-#else
- movl $0, %ecx
-#endif
-
- /* Enable PAE mode, PSE, PGE and LA57 */
- orl $(X86_CR4_PAE | X86_CR4_PSE | X86_CR4_PGE), %ecx
-#ifdef CONFIG_X86_5LEVEL
- testb $1, __pgtable_l5_enabled(%rip)
- jz 1f
- orl $X86_CR4_LA57, %ecx
-1:
+ orl $X86_CR4_MCE, %edx
#endif
+ movq %cr4, %rcx
+ andl %edx, %ecx
+ btsl $X86_CR4_PSE_BIT, %ecx
movq %rcx, %cr4

/* Setup early boot stage 4-/5-level pagetables. */
@@ -226,11 +219,8 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
* Do a global TLB flush after the CR3 switch to make sure the TLB
* entries from the identity mapping are flushed.
*/
- movq %cr4, %rcx
- movq %rcx, %rax
- xorq $X86_CR4_PGE, %rcx
+ btsl $X86_CR4_PGE_BIT, %ecx
movq %rcx, %cr4
- movq %rax, %cr4

/* Ensure I am executing from virtual addresses */
movq $1f, %rax
--
2.43.0.687.g38aa6559b0-goog


2024-02-13 12:43:30

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v4 04/11] x86/startup_64: Defer assignment of 5-level paging global variables

From: Ard Biesheuvel <[email protected]>

Assigning the 5-level paging related global variables from the earliest
C code using explicit references that use the 1:1 translation of memory
is unnecessary, as the startup code itself does not rely on them to
create the initial page tables, and this is all it should be doing. So
defer these assignments to the primary C entry code that executes via
the ordinary kernel virtual mapping.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/kernel/head64.c | 44 +++++++-------------
1 file changed, 14 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 4b08e321d168..4bcbd4ae2dc6 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -23,6 +23,7 @@
#include <linux/pgtable.h>

#include <asm/asm.h>
+#include <asm/page_64.h>
#include <asm/processor.h>
#include <asm/proto.h>
#include <asm/smp.h>
@@ -77,24 +78,11 @@ static struct desc_struct startup_gdt[GDT_ENTRIES] __initdata = {
[GDT_ENTRY_KERNEL_DS] = GDT_ENTRY_INIT(DESC_DATA64, 0, 0xfffff),
};

-#ifdef CONFIG_X86_5LEVEL
-static void __head *fixup_pointer(void *ptr, unsigned long physaddr)
-{
- return ptr - (void *)_text + (void *)physaddr;
-}
-
-static unsigned long __head *fixup_long(void *ptr, unsigned long physaddr)
+static inline bool check_la57_support(void)
{
- return fixup_pointer(ptr, physaddr);
-}
-
-static unsigned int __head *fixup_int(void *ptr, unsigned long physaddr)
-{
- return fixup_pointer(ptr, physaddr);
-}
+ if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+ return false;

-static bool __head check_la57_support(unsigned long physaddr)
-{
/*
* 5-level paging is detected and enabled at kernel decompression
* stage. Only check if it has been enabled there.
@@ -102,21 +90,8 @@ static bool __head check_la57_support(unsigned long physaddr)
if (!(native_read_cr4() & X86_CR4_LA57))
return false;

- *fixup_int(&__pgtable_l5_enabled, physaddr) = 1;
- *fixup_int(&pgdir_shift, physaddr) = 48;
- *fixup_int(&ptrs_per_p4d, physaddr) = 512;
- *fixup_long(&page_offset_base, physaddr) = __PAGE_OFFSET_BASE_L5;
- *fixup_long(&vmalloc_base, physaddr) = __VMALLOC_BASE_L5;
- *fixup_long(&vmemmap_base, physaddr) = __VMEMMAP_BASE_L5;
-
return true;
}
-#else
-static bool __head check_la57_support(unsigned long physaddr)
-{
- return false;
-}
-#endif

static unsigned long __head sme_postprocess_startup(struct boot_params *bp, pmdval_t *pmd)
{
@@ -180,7 +155,7 @@ unsigned long __head __startup_64(unsigned long physaddr,
bool la57;
int i;

- la57 = check_la57_support(physaddr);
+ la57 = check_la57_support();

/* Is the address too large? */
if (physaddr >> MAX_PHYSMEM_BITS)
@@ -463,6 +438,15 @@ asmlinkage __visible void __init __noreturn x86_64_start_kernel(char * real_mode
(__START_KERNEL & PGDIR_MASK)));
BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);

+ if (check_la57_support()) {
+ __pgtable_l5_enabled = 1;
+ pgdir_shift = 48;
+ ptrs_per_p4d = 512;
+ page_offset_base = __PAGE_OFFSET_BASE_L5;
+ vmalloc_base = __VMALLOC_BASE_L5;
+ vmemmap_base = __VMEMMAP_BASE_L5;
+ }
+
cr4_init_shadow();

/* Kill off the identity-map trampoline */
--
2.43.0.687.g38aa6559b0-goog


2024-02-13 12:43:43

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v4 05/11] x86/startup_64: Simplify calculation of initial page table address

From: Ard Biesheuvel <[email protected]>

Determining the address of the initial page table to program into CR3
involves:
- taking the physical address
- adding the SME encryption mask

On the primary entry path, the code is mapped using a 1:1 virtual to
physical translation, so the physical address can be taken directly
using a RIP-relative LEA instruction.

On the secondary entry path, the address can be obtained by taking the
offset from the virtual kernel base (__START_kernel_map) and adding the
physical kernel base.

This is implemented in a slightly confusing way, so clean this up.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/kernel/head_64.S | 25 ++++++--------------
1 file changed, 7 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 7e76cc0b442a..6dcc2f7f4108 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -111,13 +111,11 @@ SYM_CODE_START_NOALIGN(startup_64)
call __startup_64

/* Form the CR3 value being sure to include the CR3 modifier */
- addq $(early_top_pgt - __START_KERNEL_map), %rax
+ leaq early_top_pgt(%rip), %rcx
+ addq %rcx, %rax

#ifdef CONFIG_AMD_MEM_ENCRYPT
mov %rax, %rdi
- mov %rax, %r14
-
- addq phys_base(%rip), %rdi

/*
* For SEV guests: Verify that the C-bit is correct. A malicious
@@ -126,12 +124,6 @@ SYM_CODE_START_NOALIGN(startup_64)
* the next RET instruction.
*/
call sev_verify_cbit
-
- /*
- * Restore CR3 value without the phys_base which will be added
- * below, before writing %cr3.
- */
- mov %r14, %rax
#endif

jmp 1f
@@ -171,18 +163,18 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
/* Clear %R15 which holds the boot_params pointer on the boot CPU */
xorq %r15, %r15

+ /* Derive the runtime physical address of init_top_pgt[] */
+ movq phys_base(%rip), %rax
+ addq $(init_top_pgt - __START_KERNEL_map), %rax
+
/*
* Retrieve the modifier (SME encryption mask if SME is active) to be
* added to the initial pgdir entry that will be programmed into CR3.
*/
#ifdef CONFIG_AMD_MEM_ENCRYPT
- movq sme_me_mask, %rax
-#else
- xorq %rax, %rax
+ addq sme_me_mask(%rip), %rax
#endif

- /* Form the CR3 value being sure to include the CR3 modifier */
- addq $(init_top_pgt - __START_KERNEL_map), %rax
1:

/* Create a mask of CR4 bits to preserve */
@@ -202,9 +194,6 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
btsl $X86_CR4_PSE_BIT, %ecx
movq %rcx, %cr4

- /* Setup early boot stage 4-/5-level pagetables. */
- addq phys_base(%rip), %rax
-
/*
* Switch to new page-table
*
--
2.43.0.687.g38aa6559b0-goog


2024-02-13 12:43:53

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v4 06/11] x86/startup_64: Simplify virtual switch on primary boot

From: Ard Biesheuvel <[email protected]>

The secondary startup code is used on the primary boot path as well, but
in this case, the initial part runs from a 1:1 mapping, until an
explicit cross-jump is made to the kernel virtual mapping of the same
code.

On the secondary boot path, this jump is pointless as the code already
executes from the mapping targeted by the jump. So combine this
cross-jump with the jump from startup_64() into the common boot path.
This simplifies the execution flow, and clearly separates code that runs
from a 1:1 mapping from code that runs from the kernel virtual mapping.

Note that this requires a page table switch, so hoist the CR3 assignment
into startup_64() as well. And since absolute symbol references will no
longer be permitted in .head.text once we enable the associated build
time checks, a RIP-relative memory operand is used in the JMP
instruction, referring to an absolute constant in the .init.rodata
section.

Given that the secondary startup code does not require a special
placement inside the executable, move it to the .noinstr.text section.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/kernel/head_64.S | 42 ++++++++++----------
1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 6dcc2f7f4108..3fed0aafcb41 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -39,7 +39,6 @@ L4_START_KERNEL = l4_index(__START_KERNEL_map)

L3_START_KERNEL = pud_index(__START_KERNEL_map)

- .text
__HEAD
.code64
SYM_CODE_START_NOALIGN(startup_64)
@@ -126,9 +125,21 @@ SYM_CODE_START_NOALIGN(startup_64)
call sev_verify_cbit
#endif

- jmp 1f
+ /*
+ * Switch to early_top_pgt which still has the identity mappings
+ * present.
+ */
+ movq %rax, %cr3
+
+ /* Branch to the common startup code at its kernel virtual address */
+ ANNOTATE_RETPOLINE_SAFE
+ jmp *0f(%rip)
SYM_CODE_END(startup_64)

+ __INITRODATA
+0: .quad common_startup_64
+
+ .section .noinstr.text, "ax"
SYM_CODE_START(secondary_startup_64)
UNWIND_HINT_END_OF_STACK
ANNOTATE_NOENDBR
@@ -174,8 +185,15 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
#ifdef CONFIG_AMD_MEM_ENCRYPT
addq sme_me_mask(%rip), %rax
#endif
+ /*
+ * Switch to the init_top_pgt here, away from the trampoline_pgd and
+ * unmap the identity mapped ranges.
+ */
+ movq %rax, %cr3

-1:
+SYM_INNER_LABEL(common_startup_64, SYM_L_LOCAL)
+ UNWIND_HINT_END_OF_STACK
+ ANNOTATE_NOENDBR

/* Create a mask of CR4 bits to preserve */
movl $(X86_CR4_PAE | X86_CR4_LA57), %edx
@@ -194,16 +212,6 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
btsl $X86_CR4_PSE_BIT, %ecx
movq %rcx, %cr4

- /*
- * Switch to new page-table
- *
- * For the boot CPU this switches to early_top_pgt which still has the
- * identity mappings present. The secondary CPUs will switch to the
- * init_top_pgt here, away from the trampoline_pgd and unmap the
- * identity mapped ranges.
- */
- movq %rax, %cr3
-
/*
* Do a global TLB flush after the CR3 switch to make sure the TLB
* entries from the identity mapping are flushed.
@@ -211,14 +219,6 @@ SYM_INNER_LABEL(secondary_startup_64_no_verify, SYM_L_GLOBAL)
btsl $X86_CR4_PGE_BIT, %ecx
movq %rcx, %cr4

- /* Ensure I am executing from virtual addresses */
- movq $1f, %rax
- ANNOTATE_RETPOLINE_SAFE
- jmp *%rax
-1:
- UNWIND_HINT_END_OF_STACK
- ANNOTATE_NOENDBR // above
-
#ifdef CONFIG_SMP
/*
* For parallel boot, the APIC ID is read from the APIC, and then
--
2.43.0.687.g38aa6559b0-goog


2024-02-13 12:44:23

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v4 07/11] efi/libstub: Add generic support for parsing mem_encrypt=

From: Ard Biesheuvel <[email protected]>

Parse the mem_encrypt= command line parameter from the EFI stub if
CONFIG_ARCH_HAS_MEM_ENCRYPT=y, so that it can be passed to the early
boot code by the arch code in the stub.

This avoids the need for the core kernel to do any string parsing very
early in the boot.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
drivers/firmware/efi/libstub/efi-stub-helper.c | 8 ++++++++
drivers/firmware/efi/libstub/efistub.h | 2 +-
2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/efi/libstub/efi-stub-helper.c b/drivers/firmware/efi/libstub/efi-stub-helper.c
index bfa30625f5d0..3dc2f9aaf08d 100644
--- a/drivers/firmware/efi/libstub/efi-stub-helper.c
+++ b/drivers/firmware/efi/libstub/efi-stub-helper.c
@@ -24,6 +24,8 @@ static bool efi_noinitrd;
static bool efi_nosoftreserve;
static bool efi_disable_pci_dma = IS_ENABLED(CONFIG_EFI_DISABLE_PCI_DMA);

+int efi_mem_encrypt;
+
bool __pure __efi_soft_reserve_enabled(void)
{
return !efi_nosoftreserve;
@@ -75,6 +77,12 @@ efi_status_t efi_parse_options(char const *cmdline)
efi_noinitrd = true;
} else if (IS_ENABLED(CONFIG_X86_64) && !strcmp(param, "no5lvl")) {
efi_no5lvl = true;
+ } else if (IS_ENABLED(CONFIG_ARCH_HAS_MEM_ENCRYPT) &&
+ !strcmp(param, "mem_encrypt") && val) {
+ if (parse_option_str(val, "on"))
+ efi_mem_encrypt = 1;
+ else if (parse_option_str(val, "off"))
+ efi_mem_encrypt = -1;
} else if (!strcmp(param, "efi") && val) {
efi_nochunk = parse_option_str(val, "nochunk");
efi_novamap |= parse_option_str(val, "novamap");
diff --git a/drivers/firmware/efi/libstub/efistub.h b/drivers/firmware/efi/libstub/efistub.h
index 212687c30d79..a1c6ab24cd99 100644
--- a/drivers/firmware/efi/libstub/efistub.h
+++ b/drivers/firmware/efi/libstub/efistub.h
@@ -37,8 +37,8 @@ extern bool efi_no5lvl;
extern bool efi_nochunk;
extern bool efi_nokaslr;
extern int efi_loglevel;
+extern int efi_mem_encrypt;
extern bool efi_novamap;
-
extern const efi_system_table_t *efi_system_table;

typedef union efi_dxe_services_table efi_dxe_services_table_t;
--
2.43.0.687.g38aa6559b0-goog


2024-02-13 12:44:46

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v4 08/11] x86/boot: Move mem_encrypt= parsing to the decompressor

From: Ard Biesheuvel <[email protected]>

The early SME/SEV code parses the command line very early, in order to
decide whether or not memory encryption should be enabled, which needs
to occur even before the initial page tables are created.

This is problematic for a number of reasons:
- this early code runs from the 1:1 mapping provided by the decompressor
or firmware, which uses a different translation than the one assumed by
the linker, and so the code needs to be built in a special way;
- parsing external input while the entire kernel image is still mapped
writable is a bad idea in general, and really does not belong in
security minded code;
- the current code ignores the built-in command line entirely (although
this appears to be the case for the entire decompressor)

Given that the decompressor/EFI stub is an intrinsic part of the x86
bootable kernel image, move the command line parsing there and out of
the core kernel. This removes the need to build lib/cmdline.o in a
special way, or to use RIP-relative LEA instructions in inline asm
blocks.

This involves a new xloadflag in the setup header to indicate
that mem_encrypt=on appeared on the kernel command line.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/boot/compressed/misc.c | 15 +++++++++
arch/x86/include/uapi/asm/bootparam.h | 1 +
arch/x86/lib/Makefile | 13 --------
arch/x86/mm/mem_encrypt_identity.c | 32 ++------------------
drivers/firmware/efi/libstub/x86-stub.c | 3 ++
5 files changed, 22 insertions(+), 42 deletions(-)

diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index b99e08e6815b..6c5c190a4d86 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -357,6 +357,19 @@ unsigned long decompress_kernel(unsigned char *outbuf, unsigned long virt_addr,
return entry;
}

+/*
+ * Set the memory encryption xloadflag based on the mem_encrypt= command line
+ * parameter, if provided.
+ */
+static void parse_mem_encrypt(struct setup_header *hdr)
+{
+ int on = cmdline_find_option_bool("mem_encrypt=on");
+ int off = cmdline_find_option_bool("mem_encrypt=off");
+
+ if (on > off)
+ hdr->xloadflags |= XLF_MEM_ENCRYPTION;
+}
+
/*
* The compressed kernel image (ZO), has been moved so that its position
* is against the end of the buffer used to hold the uncompressed kernel
@@ -387,6 +400,8 @@ asmlinkage __visible void *extract_kernel(void *rmode, unsigned char *output)
/* Clear flags intended for solely in-kernel use. */
boot_params_ptr->hdr.loadflags &= ~KASLR_FLAG;

+ parse_mem_encrypt(&boot_params_ptr->hdr);
+
sanitize_boot_params(boot_params_ptr);

if (boot_params_ptr->screen_info.orig_video_mode == 7) {
diff --git a/arch/x86/include/uapi/asm/bootparam.h b/arch/x86/include/uapi/asm/bootparam.h
index 01d19fc22346..eeea058cf602 100644
--- a/arch/x86/include/uapi/asm/bootparam.h
+++ b/arch/x86/include/uapi/asm/bootparam.h
@@ -38,6 +38,7 @@
#define XLF_EFI_KEXEC (1<<4)
#define XLF_5LEVEL (1<<5)
#define XLF_5LEVEL_ENABLED (1<<6)
+#define XLF_MEM_ENCRYPTION (1<<7)

#ifndef __ASSEMBLY__

diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index ea3a28e7b613..f0dae4fb6d07 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -14,19 +14,6 @@ ifdef CONFIG_KCSAN
CFLAGS_REMOVE_delay.o = $(CC_FLAGS_FTRACE)
endif

-# Early boot use of cmdline; don't instrument it
-ifdef CONFIG_AMD_MEM_ENCRYPT
-KCOV_INSTRUMENT_cmdline.o := n
-KASAN_SANITIZE_cmdline.o := n
-KCSAN_SANITIZE_cmdline.o := n
-
-ifdef CONFIG_FUNCTION_TRACER
-CFLAGS_REMOVE_cmdline.o = -pg
-endif
-
-CFLAGS_cmdline.o := -fno-stack-protector -fno-jump-tables
-endif
-
inat_tables_script = $(srctree)/arch/x86/tools/gen-insn-attr-x86.awk
inat_tables_maps = $(srctree)/arch/x86/lib/x86-opcode-map.txt
quiet_cmd_inat_tables = GEN $@
diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index 0166ab1780cc..d210c7fc8fa2 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -43,7 +43,6 @@

#include <asm/setup.h>
#include <asm/sections.h>
-#include <asm/cmdline.h>
#include <asm/coco.h>
#include <asm/sev.h>

@@ -95,9 +94,6 @@ struct sme_populate_pgd_data {
*/
static char sme_workarea[2 * PMD_SIZE] __section(".init.scratch");

-static char sme_cmdline_arg[] __initdata = "mem_encrypt";
-static char sme_cmdline_on[] __initdata = "on";
-
static void __init sme_clear_pgd(struct sme_populate_pgd_data *ppd)
{
unsigned long pgd_start, pgd_end, pgd_size;
@@ -504,11 +500,9 @@ void __init sme_encrypt_kernel(struct boot_params *bp)

void __init sme_enable(struct boot_params *bp)
{
- const char *cmdline_ptr, *cmdline_arg, *cmdline_on;
unsigned int eax, ebx, ecx, edx;
unsigned long feature_mask;
unsigned long me_mask;
- char buffer[16];
bool snp;
u64 msr;

@@ -551,6 +545,9 @@ void __init sme_enable(struct boot_params *bp)

/* Check if memory encryption is enabled */
if (feature_mask == AMD_SME_BIT) {
+ if (!(bp->hdr.xloadflags & XLF_MEM_ENCRYPTION))
+ return;
+
/*
* No SME if Hypervisor bit is set. This check is here to
* prevent a guest from trying to enable SME. For running as a
@@ -570,31 +567,8 @@ void __init sme_enable(struct boot_params *bp)
msr = __rdmsr(MSR_AMD64_SYSCFG);
if (!(msr & MSR_AMD64_SYSCFG_MEM_ENCRYPT))
return;
- } else {
- /* SEV state cannot be controlled by a command line option */
- goto out;
}

- /*
- * Fixups have not been applied to phys_base yet and we're running
- * identity mapped, so we must obtain the address to the SME command
- * line argument data using rip-relative addressing.
- */
- asm ("lea sme_cmdline_arg(%%rip), %0"
- : "=r" (cmdline_arg)
- : "p" (sme_cmdline_arg));
- asm ("lea sme_cmdline_on(%%rip), %0"
- : "=r" (cmdline_on)
- : "p" (sme_cmdline_on));
-
- cmdline_ptr = (const char *)((u64)bp->hdr.cmd_line_ptr |
- ((u64)bp->ext_cmd_line_ptr << 32));
-
- if (cmdline_find_option(cmdline_ptr, cmdline_arg, buffer, sizeof(buffer)) < 0 ||
- strncmp(buffer, cmdline_on, sizeof(buffer)))
- return;
-
-out:
RIP_REL_REF(sme_me_mask) = me_mask;
physical_mask &= ~me_mask;
cc_vendor = CC_VENDOR_AMD;
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index 0d510c9a06a4..9a25ec16b344 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -879,6 +879,9 @@ void __noreturn efi_stub_entry(efi_handle_t handle,
}
}

+ if (efi_mem_encrypt > 0)
+ hdr->xloadflags |= XLF_MEM_ENCRYPTION;
+
status = efi_decompress_kernel(&kernel_entry);
if (status != EFI_SUCCESS) {
efi_err("Failed to decompress kernel\n");
--
2.43.0.687.g38aa6559b0-goog


2024-02-13 12:46:00

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v4 11/11] x86/startup_64: Drop global variables keeping track of LA57 state

From: Ard Biesheuvel <[email protected]>

On x86_64, the core kernel is entered in long mode, which implies that
paging is enabled. This means that the CR4.LA57 control bit is
guaranteed to be in sync with the number of paging levels used by the
kernel, and there is no need to store this in a variable.

There is also no need to use variables for storing the calculations of
pgdir_shift and ptrs_per_p4d, as they are easily determined on the fly.

This removes the need for two different sources of truth for determining
whether 5-level paging is in use: CR4.LA57 always reflects the actual
state, and never changes from the point of view of the 64-bit core
kernel. The only potential concern is the cost of CR4 accesses, which
can be mitigated using alternatives patching based on feature detection.

Note that even the decompressor does not manipulate any page tables
before updating CR4.LA57, so it can also avoid the associated global
variables entirely. However, as it does not implement alternatives
patching, the associated ELF sections need to be discarded.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/boot/compressed/misc.h | 4 --
arch/x86/boot/compressed/pgtable_64.c | 12 ----
arch/x86/boot/compressed/vmlinux.lds.S | 1 +
arch/x86/include/asm/pgtable_64_types.h | 58 ++++++++------------
arch/x86/kernel/cpu/common.c | 2 -
arch/x86/kernel/head64.c | 33 +----------
arch/x86/mm/kasan_init_64.c | 3 -
arch/x86/mm/mem_encrypt_identity.c | 9 ---
8 files changed, 27 insertions(+), 95 deletions(-)

diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index bc2f0f17fb90..2b15ddd0e177 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -16,9 +16,6 @@

#define __NO_FORTIFY

-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5
-
/*
* Boot stub deals with identity mappings, physical and virtual addresses are
* the same, so override these defines.
@@ -178,7 +175,6 @@ static inline int count_immovable_mem_regions(void) { return 0; }
#endif

/* ident_map_64.c */
-extern unsigned int __pgtable_l5_enabled, pgdir_shift, ptrs_per_p4d;
extern void kernel_add_identity_map(unsigned long start, unsigned long end);

/* Used by PAGE_KERN* macros: */
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index 51f957b24ba7..ae72f53f5e77 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -9,13 +9,6 @@
#define BIOS_START_MIN 0x20000U /* 128K, less than this is insane */
#define BIOS_START_MAX 0x9f000U /* 640K, absolute maximum */

-#ifdef CONFIG_X86_5LEVEL
-/* __pgtable_l5_enabled needs to be in .data to avoid being cleared along with .bss */
-unsigned int __section(".data") __pgtable_l5_enabled;
-unsigned int __section(".data") pgdir_shift = 39;
-unsigned int __section(".data") ptrs_per_p4d = 1;
-#endif
-
/* Buffer to preserve trampoline memory */
static char trampoline_save[TRAMPOLINE_32BIT_SIZE];

@@ -125,11 +118,6 @@ asmlinkage void configure_5level_paging(struct boot_params *bp, void *pgtable)
native_cpuid_eax(0) >= 7 &&
(native_cpuid_ecx(7) & (1 << (X86_FEATURE_LA57 & 31)))) {
l5_required = true;
-
- /* Initialize variables for 5-level paging */
- __pgtable_l5_enabled = 1;
- pgdir_shift = 48;
- ptrs_per_p4d = 512;
}

/*
diff --git a/arch/x86/boot/compressed/vmlinux.lds.S b/arch/x86/boot/compressed/vmlinux.lds.S
index 083ec6d7722a..06358bb067fe 100644
--- a/arch/x86/boot/compressed/vmlinux.lds.S
+++ b/arch/x86/boot/compressed/vmlinux.lds.S
@@ -81,6 +81,7 @@ SECTIONS
*(.dynamic) *(.dynsym) *(.dynstr) *(.dynbss)
*(.hash) *(.gnu.hash)
*(.note.*)
+ *(.altinstructions .altinstr_replacement)
}

.got.plt (INFO) : {
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 38b54b992f32..6a57bfdff52b 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -6,7 +6,10 @@

#ifndef __ASSEMBLY__
#include <linux/types.h>
+#include <asm/alternative.h>
+#include <asm/cpufeatures.h>
#include <asm/kaslr.h>
+#include <asm/processor-flags.h>

/*
* These are used to make use of C type-checking..
@@ -21,63 +24,50 @@ typedef unsigned long pgprotval_t;
typedef struct { pteval_t pte; } pte_t;
typedef struct { pmdval_t pmd; } pmd_t;

-#ifdef CONFIG_X86_5LEVEL
-extern unsigned int __pgtable_l5_enabled;
-
-#ifdef USE_EARLY_PGTABLE_L5
-/*
- * cpu_feature_enabled() is not available in early boot code.
- * Use variable instead.
- */
-static inline bool pgtable_l5_enabled(void)
+static __always_inline __pure bool pgtable_l5_enabled(void)
{
- return __pgtable_l5_enabled;
-}
-#else
-#define pgtable_l5_enabled() cpu_feature_enabled(X86_FEATURE_LA57)
-#endif /* USE_EARLY_PGTABLE_L5 */
+ unsigned long r;
+ bool ret;

-#else
-#define pgtable_l5_enabled() 0
-#endif /* CONFIG_X86_5LEVEL */
+ if (!IS_ENABLED(CONFIG_X86_5LEVEL))
+ return false;
+
+ asm(ALTERNATIVE_TERNARY(
+ "movq %%cr4, %[reg] \n\t btl %[la57], %k[reg]" CC_SET(c),
+ %P[feat], "stc", "clc")
+ : [reg] "=&r" (r), CC_OUT(c) (ret)
+ : [feat] "i" (X86_FEATURE_LA57),
+ [la57] "i" (X86_CR4_LA57_BIT)
+ : "cc");

-extern unsigned int pgdir_shift;
-extern unsigned int ptrs_per_p4d;
+ return ret;
+}

#endif /* !__ASSEMBLY__ */

#define SHARED_KERNEL_PMD 0

-#ifdef CONFIG_X86_5LEVEL
-
/*
* PGDIR_SHIFT determines what a top-level page table entry can map
*/
-#define PGDIR_SHIFT pgdir_shift
+#define PGDIR_SHIFT (pgtable_l5_enabled() ? 48 : 39)
#define PTRS_PER_PGD 512

/*
* 4th level page in 5-level paging case
*/
#define P4D_SHIFT 39
+#ifdef CONFIG_X86_5LEVEL
#define MAX_PTRS_PER_P4D 512
-#define PTRS_PER_P4D ptrs_per_p4d
+#else
+#define MAX_PTRS_PER_P4D 1
+#endif
+#define PTRS_PER_P4D (pgtable_l5_enabled() ? 512 : 1)
#define P4D_SIZE (_AC(1, UL) << P4D_SHIFT)
#define P4D_MASK (~(P4D_SIZE - 1))

#define MAX_POSSIBLE_PHYSMEM_BITS 52

-#else /* CONFIG_X86_5LEVEL */
-
-/*
- * PGDIR_SHIFT determines what a top-level page table entry can map
- */
-#define PGDIR_SHIFT 39
-#define PTRS_PER_PGD 512
-#define MAX_PTRS_PER_P4D 1
-
-#endif /* CONFIG_X86_5LEVEL */
-
/*
* 3rd level page
*/
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 9e35e276c55a..d88e4be88868 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1,6 +1,4 @@
// SPDX-License-Identifier: GPL-2.0-only
-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5

#include <linux/memblock.h>
#include <linux/linkage.h>
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 4bcbd4ae2dc6..aee99cfda4eb 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -7,9 +7,6 @@

#define DISABLE_BRANCH_PROFILING

-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5
-
#include <linux/init.h>
#include <linux/linkage.h>
#include <linux/types.h>
@@ -52,14 +49,6 @@ extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
static unsigned int __initdata next_early_pgt;
pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);

-#ifdef CONFIG_X86_5LEVEL
-unsigned int __pgtable_l5_enabled __ro_after_init;
-unsigned int pgdir_shift __ro_after_init = 39;
-EXPORT_SYMBOL(pgdir_shift);
-unsigned int ptrs_per_p4d __ro_after_init = 1;
-EXPORT_SYMBOL(ptrs_per_p4d);
-#endif
-
#ifdef CONFIG_DYNAMIC_MEMORY_LAYOUT
unsigned long page_offset_base __ro_after_init = __PAGE_OFFSET_BASE_L4;
EXPORT_SYMBOL(page_offset_base);
@@ -78,21 +67,6 @@ static struct desc_struct startup_gdt[GDT_ENTRIES] __initdata = {
[GDT_ENTRY_KERNEL_DS] = GDT_ENTRY_INIT(DESC_DATA64, 0, 0xfffff),
};

-static inline bool check_la57_support(void)
-{
- if (!IS_ENABLED(CONFIG_X86_5LEVEL))
- return false;
-
- /*
- * 5-level paging is detected and enabled at kernel decompression
- * stage. Only check if it has been enabled there.
- */
- if (!(native_read_cr4() & X86_CR4_LA57))
- return false;
-
- return true;
-}
-
static unsigned long __head sme_postprocess_startup(struct boot_params *bp, pmdval_t *pmd)
{
unsigned long vaddr, vaddr_end;
@@ -155,7 +129,7 @@ unsigned long __head __startup_64(unsigned long physaddr,
bool la57;
int i;

- la57 = check_la57_support();
+ la57 = pgtable_l5_enabled();

/* Is the address too large? */
if (physaddr >> MAX_PHYSMEM_BITS)
@@ -438,10 +412,7 @@ asmlinkage __visible void __init __noreturn x86_64_start_kernel(char * real_mode
(__START_KERNEL & PGDIR_MASK)));
BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);

- if (check_la57_support()) {
- __pgtable_l5_enabled = 1;
- pgdir_shift = 48;
- ptrs_per_p4d = 512;
+ if (pgtable_l5_enabled()) {
page_offset_base = __PAGE_OFFSET_BASE_L5;
vmalloc_base = __VMALLOC_BASE_L5;
vmemmap_base = __VMEMMAP_BASE_L5;
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 0302491d799d..85ae1ef840cc 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -2,9 +2,6 @@
#define DISABLE_BRANCH_PROFILING
#define pr_fmt(fmt) "kasan: " fmt

-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5
-
#include <linux/memblock.h>
#include <linux/kasan.h>
#include <linux/kdebug.h>
diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index 64b5005d49e5..a857945af177 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -27,15 +27,6 @@
#undef CONFIG_PARAVIRT_XXL
#undef CONFIG_PARAVIRT_SPINLOCKS

-/*
- * This code runs before CPU feature bits are set. By default, the
- * pgtable_l5_enabled() function uses bit X86_FEATURE_LA57 to determine if
- * 5-level paging is active, so that won't work here. USE_EARLY_PGTABLE_L5
- * is provided to handle this situation and, instead, use a variable that
- * has been set by the early boot code.
- */
-#define USE_EARLY_PGTABLE_L5
-
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/mem_encrypt.h>
--
2.43.0.687.g38aa6559b0-goog


2024-02-13 12:46:39

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v4 10/11] x86/sev: Move early startup code into .head.text section

From: Ard Biesheuvel <[email protected]>

In preparation for implementing rigorous build time checks to enforce
that only code that can support it will be called from the early 1:1
mapping of memory, move SEV init code that is called in this manner to
the .head.text section.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/boot/compressed/sev.c | 3 +++
arch/x86/include/asm/sev.h | 10 ++++-----
arch/x86/kernel/sev-shared.c | 23 +++++++++-----------
arch/x86/kernel/sev.c | 14 +++++++-----
4 files changed, 26 insertions(+), 24 deletions(-)

diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 073291832f44..bea0719d70f2 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -116,6 +116,9 @@ static bool fault_in_kernel_space(unsigned long address)
#undef __init
#define __init

+#undef __head
+#define __head
+
#define __BOOT_COMPRESSED

/* Basic instruction decoding support needed */
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index bed95e1f4d52..cf671138feef 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -213,16 +213,16 @@ static inline int pvalidate(unsigned long vaddr, bool rmp_psize, bool validate)
struct snp_guest_request_ioctl;

void setup_ghcb(void);
-void __init early_snp_set_memory_private(unsigned long vaddr, unsigned long paddr,
- unsigned long npages);
-void __init early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr,
- unsigned long npages);
+void early_snp_set_memory_private(unsigned long vaddr, unsigned long paddr,
+ unsigned long npages);
+void early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr,
+ unsigned long npages);
void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op);
void snp_set_memory_shared(unsigned long vaddr, unsigned long npages);
void snp_set_memory_private(unsigned long vaddr, unsigned long npages);
void snp_set_wakeup_secondary_cpu(void);
bool snp_init(struct boot_params *bp);
-void __init __noreturn snp_abort(void);
+void __noreturn snp_abort(void);
int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, struct snp_guest_request_ioctl *rio);
void snp_accept_memory(phys_addr_t start, phys_addr_t end);
u64 snp_get_unsupported_features(u64 status);
diff --git a/arch/x86/kernel/sev-shared.c b/arch/x86/kernel/sev-shared.c
index ae79f9505298..0bd7ccbe8732 100644
--- a/arch/x86/kernel/sev-shared.c
+++ b/arch/x86/kernel/sev-shared.c
@@ -93,7 +93,8 @@ static bool __init sev_es_check_cpu_features(void)
return true;
}

-static void __noreturn sev_es_terminate(unsigned int set, unsigned int reason)
+static void __head __noreturn
+sev_es_terminate(unsigned int set, unsigned int reason)
{
u64 val = GHCB_MSR_TERM_REQ;

@@ -330,13 +331,7 @@ static int sev_cpuid_hv(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid
*/
static const struct snp_cpuid_table *snp_cpuid_get_table(void)
{
- void *ptr;
-
- asm ("lea cpuid_table_copy(%%rip), %0"
- : "=r" (ptr)
- : "p" (&cpuid_table_copy));
-
- return ptr;
+ return &RIP_REL_REF(cpuid_table_copy);
}

/*
@@ -395,7 +390,7 @@ static u32 snp_cpuid_calc_xsave_size(u64 xfeatures_en, bool compacted)
return xsave_size;
}

-static bool
+static bool __head
snp_cpuid_get_validated_func(struct cpuid_leaf *leaf)
{
const struct snp_cpuid_table *cpuid_table = snp_cpuid_get_table();
@@ -532,7 +527,8 @@ static int snp_cpuid_postprocess(struct ghcb *ghcb, struct es_em_ctxt *ctxt,
* Returns -EOPNOTSUPP if feature not enabled. Any other non-zero return value
* should be treated as fatal by caller.
*/
-static int snp_cpuid(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid_leaf *leaf)
+static int __head
+snp_cpuid(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid_leaf *leaf)
{
const struct snp_cpuid_table *cpuid_table = snp_cpuid_get_table();

@@ -574,7 +570,7 @@ static int snp_cpuid(struct ghcb *ghcb, struct es_em_ctxt *ctxt, struct cpuid_le
* page yet, so it only supports the MSR based communication with the
* hypervisor and only the CPUID exit-code.
*/
-void __init do_vc_no_ghcb(struct pt_regs *regs, unsigned long exit_code)
+void __head do_vc_no_ghcb(struct pt_regs *regs, unsigned long exit_code)
{
unsigned int subfn = lower_bits(regs->cx, 32);
unsigned int fn = lower_bits(regs->ax, 32);
@@ -1025,7 +1021,8 @@ struct cc_setup_data {
* Search for a Confidential Computing blob passed in as a setup_data entry
* via the Linux Boot Protocol.
*/
-static struct cc_blob_sev_info *find_cc_blob_setup_data(struct boot_params *bp)
+static __head
+struct cc_blob_sev_info *find_cc_blob_setup_data(struct boot_params *bp)
{
struct cc_setup_data *sd = NULL;
struct setup_data *hdr;
@@ -1052,7 +1049,7 @@ static struct cc_blob_sev_info *find_cc_blob_setup_data(struct boot_params *bp)
* mapping needs to be updated in sync with all the changes to virtual memory
* layout and related mapping facilities throughout the boot process.
*/
-static void __init setup_cpuid_table(const struct cc_blob_sev_info *cc_info)
+static void __head setup_cpuid_table(const struct cc_blob_sev_info *cc_info)
{
const struct snp_cpuid_table *cpuid_table_fw, *cpuid_table;
int i;
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 1ef7ae806a01..33c14aa1f06c 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -25,6 +25,7 @@
#include <linux/psp-sev.h>
#include <uapi/linux/sev-guest.h>

+#include <asm/init.h>
#include <asm/cpu_entry_area.h>
#include <asm/stacktrace.h>
#include <asm/sev.h>
@@ -682,8 +683,9 @@ static u64 __init get_jump_table_addr(void)
return ret;
}

-static void early_set_pages_state(unsigned long vaddr, unsigned long paddr,
- unsigned long npages, enum psc_op op)
+static void __head
+early_set_pages_state(unsigned long vaddr, unsigned long paddr,
+ unsigned long npages, enum psc_op op)
{
unsigned long paddr_end;
u64 val;
@@ -739,7 +741,7 @@ static void early_set_pages_state(unsigned long vaddr, unsigned long paddr,
sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
}

-void __init early_snp_set_memory_private(unsigned long vaddr, unsigned long paddr,
+void __head early_snp_set_memory_private(unsigned long vaddr, unsigned long paddr,
unsigned long npages)
{
/*
@@ -2062,7 +2064,7 @@ bool __init handle_vc_boot_ghcb(struct pt_regs *regs)
*
* Scan for the blob in that order.
*/
-static __init struct cc_blob_sev_info *find_cc_blob(struct boot_params *bp)
+static __head struct cc_blob_sev_info *find_cc_blob(struct boot_params *bp)
{
struct cc_blob_sev_info *cc_info;

@@ -2088,7 +2090,7 @@ static __init struct cc_blob_sev_info *find_cc_blob(struct boot_params *bp)
return cc_info;
}

-bool __init snp_init(struct boot_params *bp)
+bool __head snp_init(struct boot_params *bp)
{
struct cc_blob_sev_info *cc_info;

@@ -2110,7 +2112,7 @@ bool __init snp_init(struct boot_params *bp)
return true;
}

-void __init __noreturn snp_abort(void)
+void __head __noreturn snp_abort(void)
{
sev_es_terminate(SEV_TERM_SET_GEN, GHCB_SNP_UNSUPPORTED);
}
--
2.43.0.687.g38aa6559b0-goog


2024-02-13 12:47:21

by Ard Biesheuvel

[permalink] [raw]
Subject: [PATCH v4 09/11] x86/sme: Move early SME kernel encryption handling into .head.text

From: Ard Biesheuvel <[email protected]>

The .head.text section is the initial primary entrypoint of the core
kernel, and is entered with the CPU executing from a 1:1 mapping of
memory. Such code must never access global variables using absolute
references, as these are based on the kernel virtual mapping which is
not active yet at this point.

Given that the SME startup code is also called from this early execution
context, move it into .head.text as well. This will allow more thorough
build time checks in the future to ensure that early startup code only
uses RIP-relative references to global variables.

Also replace some occurrences of __pa_symbol() [which relies on the
compiler generating an absolute reference, which is not guaranteed] and
an open coded RIP-relative access with RIP_REL_REF().

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/include/asm/mem_encrypt.h | 8 ++--
arch/x86/mm/mem_encrypt_identity.c | 42 ++++++++------------
2 files changed, 21 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h
index b31eb9fd5954..f922b682b9b4 100644
--- a/arch/x86/include/asm/mem_encrypt.h
+++ b/arch/x86/include/asm/mem_encrypt.h
@@ -47,8 +47,8 @@ void __init sme_unmap_bootdata(char *real_mode_data);

void __init sme_early_init(void);

-void __init sme_encrypt_kernel(struct boot_params *bp);
-void __init sme_enable(struct boot_params *bp);
+void sme_encrypt_kernel(struct boot_params *bp);
+void sme_enable(struct boot_params *bp);

int __init early_set_memory_decrypted(unsigned long vaddr, unsigned long size);
int __init early_set_memory_encrypted(unsigned long vaddr, unsigned long size);
@@ -81,8 +81,8 @@ static inline void __init sme_unmap_bootdata(char *real_mode_data) { }

static inline void __init sme_early_init(void) { }

-static inline void __init sme_encrypt_kernel(struct boot_params *bp) { }
-static inline void __init sme_enable(struct boot_params *bp) { }
+static inline void sme_encrypt_kernel(struct boot_params *bp) { }
+static inline void sme_enable(struct boot_params *bp) { }

static inline void sev_es_init_vc_handling(void) { }

diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index d210c7fc8fa2..64b5005d49e5 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -41,6 +41,7 @@
#include <linux/mem_encrypt.h>
#include <linux/cc_platform.h>

+#include <asm/init.h>
#include <asm/setup.h>
#include <asm/sections.h>
#include <asm/coco.h>
@@ -94,7 +95,7 @@ struct sme_populate_pgd_data {
*/
static char sme_workarea[2 * PMD_SIZE] __section(".init.scratch");

-static void __init sme_clear_pgd(struct sme_populate_pgd_data *ppd)
+static void __head sme_clear_pgd(struct sme_populate_pgd_data *ppd)
{
unsigned long pgd_start, pgd_end, pgd_size;
pgd_t *pgd_p;
@@ -109,7 +110,7 @@ static void __init sme_clear_pgd(struct sme_populate_pgd_data *ppd)
memset(pgd_p, 0, pgd_size);
}

-static pud_t __init *sme_prepare_pgd(struct sme_populate_pgd_data *ppd)
+static pud_t __head *sme_prepare_pgd(struct sme_populate_pgd_data *ppd)
{
pgd_t *pgd;
p4d_t *p4d;
@@ -146,7 +147,7 @@ static pud_t __init *sme_prepare_pgd(struct sme_populate_pgd_data *ppd)
return pud;
}

-static void __init sme_populate_pgd_large(struct sme_populate_pgd_data *ppd)
+static void __head sme_populate_pgd_large(struct sme_populate_pgd_data *ppd)
{
pud_t *pud;
pmd_t *pmd;
@@ -162,7 +163,7 @@ static void __init sme_populate_pgd_large(struct sme_populate_pgd_data *ppd)
set_pmd(pmd, __pmd(ppd->paddr | ppd->pmd_flags));
}

-static void __init sme_populate_pgd(struct sme_populate_pgd_data *ppd)
+static void __head sme_populate_pgd(struct sme_populate_pgd_data *ppd)
{
pud_t *pud;
pmd_t *pmd;
@@ -188,7 +189,7 @@ static void __init sme_populate_pgd(struct sme_populate_pgd_data *ppd)
set_pte(pte, __pte(ppd->paddr | ppd->pte_flags));
}

-static void __init __sme_map_range_pmd(struct sme_populate_pgd_data *ppd)
+static void __head __sme_map_range_pmd(struct sme_populate_pgd_data *ppd)
{
while (ppd->vaddr < ppd->vaddr_end) {
sme_populate_pgd_large(ppd);
@@ -198,7 +199,7 @@ static void __init __sme_map_range_pmd(struct sme_populate_pgd_data *ppd)
}
}

-static void __init __sme_map_range_pte(struct sme_populate_pgd_data *ppd)
+static void __head __sme_map_range_pte(struct sme_populate_pgd_data *ppd)
{
while (ppd->vaddr < ppd->vaddr_end) {
sme_populate_pgd(ppd);
@@ -208,7 +209,7 @@ static void __init __sme_map_range_pte(struct sme_populate_pgd_data *ppd)
}
}

-static void __init __sme_map_range(struct sme_populate_pgd_data *ppd,
+static void __head __sme_map_range(struct sme_populate_pgd_data *ppd,
pmdval_t pmd_flags, pteval_t pte_flags)
{
unsigned long vaddr_end;
@@ -232,22 +233,22 @@ static void __init __sme_map_range(struct sme_populate_pgd_data *ppd,
__sme_map_range_pte(ppd);
}

-static void __init sme_map_range_encrypted(struct sme_populate_pgd_data *ppd)
+static void __head sme_map_range_encrypted(struct sme_populate_pgd_data *ppd)
{
__sme_map_range(ppd, PMD_FLAGS_ENC, PTE_FLAGS_ENC);
}

-static void __init sme_map_range_decrypted(struct sme_populate_pgd_data *ppd)
+static void __head sme_map_range_decrypted(struct sme_populate_pgd_data *ppd)
{
__sme_map_range(ppd, PMD_FLAGS_DEC, PTE_FLAGS_DEC);
}

-static void __init sme_map_range_decrypted_wp(struct sme_populate_pgd_data *ppd)
+static void __head sme_map_range_decrypted_wp(struct sme_populate_pgd_data *ppd)
{
__sme_map_range(ppd, PMD_FLAGS_DEC_WP, PTE_FLAGS_DEC_WP);
}

-static unsigned long __init sme_pgtable_calc(unsigned long len)
+static unsigned long __head sme_pgtable_calc(unsigned long len)
{
unsigned long entries = 0, tables = 0;

@@ -284,7 +285,7 @@ static unsigned long __init sme_pgtable_calc(unsigned long len)
return entries + tables;
}

-void __init sme_encrypt_kernel(struct boot_params *bp)
+void __head sme_encrypt_kernel(struct boot_params *bp)
{
unsigned long workarea_start, workarea_end, workarea_len;
unsigned long execute_start, execute_end, execute_len;
@@ -319,9 +320,8 @@ void __init sme_encrypt_kernel(struct boot_params *bp)
* memory from being cached.
*/

- /* Physical addresses gives us the identity mapped virtual addresses */
- kernel_start = __pa_symbol(_text);
- kernel_end = ALIGN(__pa_symbol(_end), PMD_SIZE);
+ kernel_start = (unsigned long)RIP_REL_REF(_text);
+ kernel_end = ALIGN((unsigned long)RIP_REL_REF(_end), PMD_SIZE);
kernel_len = kernel_end - kernel_start;

initrd_start = 0;
@@ -338,14 +338,6 @@ void __init sme_encrypt_kernel(struct boot_params *bp)
}
#endif

- /*
- * We're running identity mapped, so we must obtain the address to the
- * SME encryption workarea using rip-relative addressing.
- */
- asm ("lea sme_workarea(%%rip), %0"
- : "=r" (workarea_start)
- : "p" (sme_workarea));
-
/*
* Calculate required number of workarea bytes needed:
* executable encryption area size:
@@ -355,7 +347,7 @@ void __init sme_encrypt_kernel(struct boot_params *bp)
* pagetable structures for the encryption of the kernel
* pagetable structures for workarea (in case not currently mapped)
*/
- execute_start = workarea_start;
+ execute_start = workarea_start = (unsigned long)RIP_REL_REF(sme_workarea);
execute_end = execute_start + (PAGE_SIZE * 2) + PMD_SIZE;
execute_len = execute_end - execute_start;

@@ -498,7 +490,7 @@ void __init sme_encrypt_kernel(struct boot_params *bp)
native_write_cr3(__native_read_cr3());
}

-void __init sme_enable(struct boot_params *bp)
+void __head sme_enable(struct boot_params *bp)
{
unsigned int eax, ebx, ecx, edx;
unsigned long feature_mask;
--
2.43.0.687.g38aa6559b0-goog


2024-02-13 20:06:34

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 01/11] x86/startup_64: Simplify global variable accesses in GDT/IDT programming

On Tue, Feb 13, 2024 at 01:41:45PM +0100, Ard Biesheuvel wrote:
> @@ -632,5 +616,5 @@ void __head startup_64_setup_env(unsigned long physbase)
> "movl %%eax, %%ss\n"
> "movl %%eax, %%es\n" : : "a"(__KERNEL_DS) : "memory");
>
> - startup_64_load_idt(physbase);
> + startup_64_load_idt(&RIP_REL_REF(vc_no_ghcb));

It took me a while to figure out that even if we pass in one of the two
GHCB handler pointers, we only set it if CONFIG_AMD_MEM_ENCRYPT.

I think this ontop of yours is a bit more readable as it makes it
perfectly clear *when* the pointer is valid.

Yeah, if handler is set, we set it for the X86_TRAP_VC vector
unconditionally but that can be changed later, if really needed.

But this way it is clear by having the callers select the pointer. I.e.,
it is a more common coding pattern this way, I'd say.

Thx.

---

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 9d7f12829f2d..e39114c348e2 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -582,8 +582,8 @@ static void __head startup_64_load_idt(void *handler)
};
gate_desc *idt = (gate_desc *)desc.address;

- if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))
- /* VMM Communication Exception */
+ /* @handler is set only for a VMM Communication Exception */
+ if (handler)
set_bringup_idt_handler(idt, X86_TRAP_VC, handler);

native_load_idt(&desc);
@@ -592,10 +592,14 @@ static void __head startup_64_load_idt(void *handler)
/* This is used when running on kernel addresses */
void early_setup_idt(void)
{
- if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))
+ void *handler = NULL;
+
+ if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
setup_ghcb();
+ handler = vc_boot_ghcb;
+ }

- startup_64_load_idt(vc_boot_ghcb);
+ startup_64_load_idt(handler);
}

/*
@@ -603,6 +607,8 @@ void early_setup_idt(void)
*/
void __head startup_64_setup_gdt_idt(void)
{
+ void *handler = NULL;
+
struct desc_ptr startup_gdt_descr = {
.address = (unsigned long)&RIP_REL_REF(startup_gdt),
.size = sizeof(startup_gdt) - 1,
@@ -616,5 +622,8 @@ void __head startup_64_setup_gdt_idt(void)
"movl %%eax, %%ss\n"
"movl %%eax, %%es\n" : : "a"(__KERNEL_DS) : "memory");

- startup_64_load_idt(&RIP_REL_REF(vc_no_ghcb));
+ if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))
+ handler = &RIP_REL_REF(vc_no_ghcb);
+
+ startup_64_load_idt(handler);
}

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-13 22:02:10

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v4 01/11] x86/startup_64: Simplify global variable accesses in GDT/IDT programming

On Tue, 13 Feb 2024 at 21:06, Borislav Petkov <[email protected]> wrote:
>
> On Tue, Feb 13, 2024 at 01:41:45PM +0100, Ard Biesheuvel wrote:
> > @@ -632,5 +616,5 @@ void __head startup_64_setup_env(unsigned long physbase)
> > "movl %%eax, %%ss\n"
> > "movl %%eax, %%es\n" : : "a"(__KERNEL_DS) : "memory");
> >
> > - startup_64_load_idt(physbase);
> > + startup_64_load_idt(&RIP_REL_REF(vc_no_ghcb));
>
> It took me a while to figure out that even if we pass in one of the two
> GHCB handler pointers, we only set it if CONFIG_AMD_MEM_ENCRYPT.
>
> I think this ontop of yours is a bit more readable as it makes it
> perfectly clear *when* the pointer is valid.
>

Looks fine to me.

> Yeah, if handler is set, we set it for the X86_TRAP_VC vector
> unconditionally but that can be changed later, if really needed.
>

We might call the parameter 'vc_handler' to make this clearer.

2024-02-14 07:29:16

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v4 01/11] x86/startup_64: Simplify global variable accesses in GDT/IDT programming

On Tue, 13 Feb 2024 at 22:53, Ard Biesheuvel <[email protected]> wrote:
>
> On Tue, 13 Feb 2024 at 21:06, Borislav Petkov <[email protected]> wrote:
> >
> > On Tue, Feb 13, 2024 at 01:41:45PM +0100, Ard Biesheuvel wrote:
> > > @@ -632,5 +616,5 @@ void __head startup_64_setup_env(unsigned long physbase)
> > > "movl %%eax, %%ss\n"
> > > "movl %%eax, %%es\n" : : "a"(__KERNEL_DS) : "memory");
> > >
> > > - startup_64_load_idt(physbase);
> > > + startup_64_load_idt(&RIP_REL_REF(vc_no_ghcb));
> >
> > It took me a while to figure out that even if we pass in one of the two
> > GHCB handler pointers, we only set it if CONFIG_AMD_MEM_ENCRYPT.
> >
> > I think this ontop of yours is a bit more readable as it makes it
> > perfectly clear *when* the pointer is valid.
> >
>
> Looks fine to me.
>
> > Yeah, if handler is set, we set it for the X86_TRAP_VC vector
> > unconditionally but that can be changed later, if really needed.
> >
>
> We might call the parameter 'vc_handler' to make this clearer.

Actually, we can merge set_bringup_idt_handler() into its caller as well:

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index aee99cfda4eb..804ba9a2214f 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -501,30 +501,22 @@ void __init __noreturn
x86_64_start_reservations(char *real_mode_data)
*/
static gate_desc bringup_idt_table[NUM_EXCEPTION_VECTORS] __page_aligned_data;

-static void __head set_bringup_idt_handler(gate_desc *idt, int n,
void *handler)
-{
-#ifdef CONFIG_AMD_MEM_ENCRYPT
- struct idt_data data;
- gate_desc desc;
-
- init_idt_data(&data, n, handler);
- idt_init_desc(&desc, &data);
- native_write_idt_entry(idt, n, &desc);
-#endif
-}
-
/* This may run while still in the direct mapping */
-static void __head startup_64_load_idt(void *handler)
+static void __head startup_64_load_idt(void *vc_handler)
{
struct desc_ptr desc = {
.address = (unsigned
long)&RIP_REL_REF(bringup_idt_table),
.size = sizeof(bringup_idt_table) - 1,
};
- gate_desc *idt = (gate_desc *)desc.address;
+ struct idt_data data;
+ gate_desc idt_desc;

- if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))
- /* VMM Communication Exception */
- set_bringup_idt_handler(idt, X86_TRAP_VC, handler);
+ if (vc_handler) {
+ init_idt_data(&data, X86_TRAP_VC, vc_handler);
+ idt_init_desc(&idt_desc, &data);
+ native_write_idt_entry((gate_desc *)desc.address,
+ X86_TRAP_VC, &idt_desc);
+ }

native_load_idt(&desc);
}


(^^^ plus your changes boot tested on SEV-SNP)

2024-02-14 15:44:35

by Brian Gerst

[permalink] [raw]
Subject: Re: [PATCH v4 11/11] x86/startup_64: Drop global variables keeping track of LA57 state

On Tue, Feb 13, 2024 at 7:42 AM Ard Biesheuvel <[email protected]> wrote:
>
> From: Ard Biesheuvel <[email protected]>
>
> On x86_64, the core kernel is entered in long mode, which implies that
> paging is enabled. This means that the CR4.LA57 control bit is
> guaranteed to be in sync with the number of paging levels used by the
> kernel, and there is no need to store this in a variable.
>
> There is also no need to use variables for storing the calculations of
> pgdir_shift and ptrs_per_p4d, as they are easily determined on the fly.
>
> This removes the need for two different sources of truth for determining
> whether 5-level paging is in use: CR4.LA57 always reflects the actual
> state, and never changes from the point of view of the 64-bit core
> kernel. The only potential concern is the cost of CR4 accesses, which
> can be mitigated using alternatives patching based on feature detection.
>
> Note that even the decompressor does not manipulate any page tables
> before updating CR4.LA57, so it can also avoid the associated global
> variables entirely. However, as it does not implement alternatives
> patching, the associated ELF sections need to be discarded.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
> arch/x86/boot/compressed/misc.h | 4 --
> arch/x86/boot/compressed/pgtable_64.c | 12 ----
> arch/x86/boot/compressed/vmlinux.lds.S | 1 +
> arch/x86/include/asm/pgtable_64_types.h | 58 ++++++++------------
> arch/x86/kernel/cpu/common.c | 2 -
> arch/x86/kernel/head64.c | 33 +----------
> arch/x86/mm/kasan_init_64.c | 3 -
> arch/x86/mm/mem_encrypt_identity.c | 9 ---
> 8 files changed, 27 insertions(+), 95 deletions(-)
>
> diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
> index bc2f0f17fb90..2b15ddd0e177 100644
> --- a/arch/x86/boot/compressed/misc.h
> +++ b/arch/x86/boot/compressed/misc.h
> @@ -16,9 +16,6 @@
>
> #define __NO_FORTIFY
>
> -/* cpu_feature_enabled() cannot be used this early */
> -#define USE_EARLY_PGTABLE_L5
> -
> /*
> * Boot stub deals with identity mappings, physical and virtual addresses are
> * the same, so override these defines.
> @@ -178,7 +175,6 @@ static inline int count_immovable_mem_regions(void) { return 0; }
> #endif
>
> /* ident_map_64.c */
> -extern unsigned int __pgtable_l5_enabled, pgdir_shift, ptrs_per_p4d;
> extern void kernel_add_identity_map(unsigned long start, unsigned long end);
>
> /* Used by PAGE_KERN* macros: */
> diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
> index 51f957b24ba7..ae72f53f5e77 100644
> --- a/arch/x86/boot/compressed/pgtable_64.c
> +++ b/arch/x86/boot/compressed/pgtable_64.c
> @@ -9,13 +9,6 @@
> #define BIOS_START_MIN 0x20000U /* 128K, less than this is insane */
> #define BIOS_START_MAX 0x9f000U /* 640K, absolute maximum */
>
> -#ifdef CONFIG_X86_5LEVEL
> -/* __pgtable_l5_enabled needs to be in .data to avoid being cleared along with .bss */
> -unsigned int __section(".data") __pgtable_l5_enabled;
> -unsigned int __section(".data") pgdir_shift = 39;
> -unsigned int __section(".data") ptrs_per_p4d = 1;
> -#endif
> -
> /* Buffer to preserve trampoline memory */
> static char trampoline_save[TRAMPOLINE_32BIT_SIZE];
>
> @@ -125,11 +118,6 @@ asmlinkage void configure_5level_paging(struct boot_params *bp, void *pgtable)
> native_cpuid_eax(0) >= 7 &&
> (native_cpuid_ecx(7) & (1 << (X86_FEATURE_LA57 & 31)))) {
> l5_required = true;
> -
> - /* Initialize variables for 5-level paging */
> - __pgtable_l5_enabled = 1;
> - pgdir_shift = 48;
> - ptrs_per_p4d = 512;
> }
>
> /*
> diff --git a/arch/x86/boot/compressed/vmlinux.lds.S b/arch/x86/boot/compressed/vmlinux.lds.S
> index 083ec6d7722a..06358bb067fe 100644
> --- a/arch/x86/boot/compressed/vmlinux.lds.S
> +++ b/arch/x86/boot/compressed/vmlinux.lds.S
> @@ -81,6 +81,7 @@ SECTIONS
> *(.dynamic) *(.dynsym) *(.dynstr) *(.dynbss)
> *(.hash) *(.gnu.hash)
> *(.note.*)
> + *(.altinstructions .altinstr_replacement)
> }
>
> .got.plt (INFO) : {
> diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
> index 38b54b992f32..6a57bfdff52b 100644
> --- a/arch/x86/include/asm/pgtable_64_types.h
> +++ b/arch/x86/include/asm/pgtable_64_types.h
> @@ -6,7 +6,10 @@
>
> #ifndef __ASSEMBLY__
> #include <linux/types.h>
> +#include <asm/alternative.h>
> +#include <asm/cpufeatures.h>
> #include <asm/kaslr.h>
> +#include <asm/processor-flags.h>
>
> /*
> * These are used to make use of C type-checking..
> @@ -21,63 +24,50 @@ typedef unsigned long pgprotval_t;
> typedef struct { pteval_t pte; } pte_t;
> typedef struct { pmdval_t pmd; } pmd_t;
>
> -#ifdef CONFIG_X86_5LEVEL
> -extern unsigned int __pgtable_l5_enabled;
> -
> -#ifdef USE_EARLY_PGTABLE_L5
> -/*
> - * cpu_feature_enabled() is not available in early boot code.
> - * Use variable instead.
> - */
> -static inline bool pgtable_l5_enabled(void)
> +static __always_inline __pure bool pgtable_l5_enabled(void)
> {
> - return __pgtable_l5_enabled;
> -}
> -#else
> -#define pgtable_l5_enabled() cpu_feature_enabled(X86_FEATURE_LA57)
> -#endif /* USE_EARLY_PGTABLE_L5 */
> + unsigned long r;
> + bool ret;
>
> -#else
> -#define pgtable_l5_enabled() 0
> -#endif /* CONFIG_X86_5LEVEL */
> + if (!IS_ENABLED(CONFIG_X86_5LEVEL))
> + return false;
> +
> + asm(ALTERNATIVE_TERNARY(
> + "movq %%cr4, %[reg] \n\t btl %[la57], %k[reg]" CC_SET(c),
> + %P[feat], "stc", "clc")
> + : [reg] "=&r" (r), CC_OUT(c) (ret)
> + : [feat] "i" (X86_FEATURE_LA57),
> + [la57] "i" (X86_CR4_LA57_BIT)
> + : "cc");

This should be more like _static_cpu_has(), where the runtime test is
out of line in a discardable section, and the inline part is just a
JMP or NOP.

Brian Gerst

2024-02-14 16:10:04

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v4 11/11] x86/startup_64: Drop global variables keeping track of LA57 state

On Wed, 14 Feb 2024 at 16:24, Brian Gerst <[email protected]> wrote:
>
> On Tue, Feb 13, 2024 at 7:42 AM Ard Biesheuvel <[email protected]> wrote:
> >
> > From: Ard Biesheuvel <[email protected]>
> >
> > On x86_64, the core kernel is entered in long mode, which implies that
> > paging is enabled. This means that the CR4.LA57 control bit is
> > guaranteed to be in sync with the number of paging levels used by the
> > kernel, and there is no need to store this in a variable.
> >
> > There is also no need to use variables for storing the calculations of
> > pgdir_shift and ptrs_per_p4d, as they are easily determined on the fly.
> >
> > This removes the need for two different sources of truth for determining
> > whether 5-level paging is in use: CR4.LA57 always reflects the actual
> > state, and never changes from the point of view of the 64-bit core
> > kernel. The only potential concern is the cost of CR4 accesses, which
> > can be mitigated using alternatives patching based on feature detection.
> >
> > Note that even the decompressor does not manipulate any page tables
> > before updating CR4.LA57, so it can also avoid the associated global
> > variables entirely. However, as it does not implement alternatives
> > patching, the associated ELF sections need to be discarded.
> >
> > Signed-off-by: Ard Biesheuvel <[email protected]>
> > ---
> > arch/x86/boot/compressed/misc.h | 4 --
> > arch/x86/boot/compressed/pgtable_64.c | 12 ----
> > arch/x86/boot/compressed/vmlinux.lds.S | 1 +
> > arch/x86/include/asm/pgtable_64_types.h | 58 ++++++++------------
> > arch/x86/kernel/cpu/common.c | 2 -
> > arch/x86/kernel/head64.c | 33 +----------
> > arch/x86/mm/kasan_init_64.c | 3 -
> > arch/x86/mm/mem_encrypt_identity.c | 9 ---
> > 8 files changed, 27 insertions(+), 95 deletions(-)
> >
> > diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
> > index bc2f0f17fb90..2b15ddd0e177 100644
> > --- a/arch/x86/boot/compressed/misc.h
> > +++ b/arch/x86/boot/compressed/misc.h
> > @@ -16,9 +16,6 @@
> >
> > #define __NO_FORTIFY
> >
> > -/* cpu_feature_enabled() cannot be used this early */
> > -#define USE_EARLY_PGTABLE_L5
> > -
> > /*
> > * Boot stub deals with identity mappings, physical and virtual addresses are
> > * the same, so override these defines.
> > @@ -178,7 +175,6 @@ static inline int count_immovable_mem_regions(void) { return 0; }
> > #endif
> >
> > /* ident_map_64.c */
> > -extern unsigned int __pgtable_l5_enabled, pgdir_shift, ptrs_per_p4d;
> > extern void kernel_add_identity_map(unsigned long start, unsigned long end);
> >
> > /* Used by PAGE_KERN* macros: */
> > diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
> > index 51f957b24ba7..ae72f53f5e77 100644
> > --- a/arch/x86/boot/compressed/pgtable_64.c
> > +++ b/arch/x86/boot/compressed/pgtable_64.c
> > @@ -9,13 +9,6 @@
> > #define BIOS_START_MIN 0x20000U /* 128K, less than this is insane */
> > #define BIOS_START_MAX 0x9f000U /* 640K, absolute maximum */
> >
> > -#ifdef CONFIG_X86_5LEVEL
> > -/* __pgtable_l5_enabled needs to be in .data to avoid being cleared along with .bss */
> > -unsigned int __section(".data") __pgtable_l5_enabled;
> > -unsigned int __section(".data") pgdir_shift = 39;
> > -unsigned int __section(".data") ptrs_per_p4d = 1;
> > -#endif
> > -
> > /* Buffer to preserve trampoline memory */
> > static char trampoline_save[TRAMPOLINE_32BIT_SIZE];
> >
> > @@ -125,11 +118,6 @@ asmlinkage void configure_5level_paging(struct boot_params *bp, void *pgtable)
> > native_cpuid_eax(0) >= 7 &&
> > (native_cpuid_ecx(7) & (1 << (X86_FEATURE_LA57 & 31)))) {
> > l5_required = true;
> > -
> > - /* Initialize variables for 5-level paging */
> > - __pgtable_l5_enabled = 1;
> > - pgdir_shift = 48;
> > - ptrs_per_p4d = 512;
> > }
> >
> > /*
> > diff --git a/arch/x86/boot/compressed/vmlinux.lds.S b/arch/x86/boot/compressed/vmlinux.lds.S
> > index 083ec6d7722a..06358bb067fe 100644
> > --- a/arch/x86/boot/compressed/vmlinux.lds.S
> > +++ b/arch/x86/boot/compressed/vmlinux.lds.S
> > @@ -81,6 +81,7 @@ SECTIONS
> > *(.dynamic) *(.dynsym) *(.dynstr) *(.dynbss)
> > *(.hash) *(.gnu.hash)
> > *(.note.*)
> > + *(.altinstructions .altinstr_replacement)
> > }
> >
> > .got.plt (INFO) : {
> > diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
> > index 38b54b992f32..6a57bfdff52b 100644
> > --- a/arch/x86/include/asm/pgtable_64_types.h
> > +++ b/arch/x86/include/asm/pgtable_64_types.h
> > @@ -6,7 +6,10 @@
> >
> > #ifndef __ASSEMBLY__
> > #include <linux/types.h>
> > +#include <asm/alternative.h>
> > +#include <asm/cpufeatures.h>
> > #include <asm/kaslr.h>
> > +#include <asm/processor-flags.h>
> >
> > /*
> > * These are used to make use of C type-checking..
> > @@ -21,63 +24,50 @@ typedef unsigned long pgprotval_t;
> > typedef struct { pteval_t pte; } pte_t;
> > typedef struct { pmdval_t pmd; } pmd_t;
> >
> > -#ifdef CONFIG_X86_5LEVEL
> > -extern unsigned int __pgtable_l5_enabled;
> > -
> > -#ifdef USE_EARLY_PGTABLE_L5
> > -/*
> > - * cpu_feature_enabled() is not available in early boot code.
> > - * Use variable instead.
> > - */
> > -static inline bool pgtable_l5_enabled(void)
> > +static __always_inline __pure bool pgtable_l5_enabled(void)
> > {
> > - return __pgtable_l5_enabled;
> > -}
> > -#else
> > -#define pgtable_l5_enabled() cpu_feature_enabled(X86_FEATURE_LA57)
> > -#endif /* USE_EARLY_PGTABLE_L5 */
> > + unsigned long r;
> > + bool ret;
> >
> > -#else
> > -#define pgtable_l5_enabled() 0
> > -#endif /* CONFIG_X86_5LEVEL */
> > + if (!IS_ENABLED(CONFIG_X86_5LEVEL))
> > + return false;
> > +
> > + asm(ALTERNATIVE_TERNARY(
> > + "movq %%cr4, %[reg] \n\t btl %[la57], %k[reg]" CC_SET(c),
> > + %P[feat], "stc", "clc")
> > + : [reg] "=&r" (r), CC_OUT(c) (ret)
> > + : [feat] "i" (X86_FEATURE_LA57),
> > + [la57] "i" (X86_CR4_LA57_BIT)
> > + : "cc");
>
> This should be more like _static_cpu_has(), where the runtime test is
> out of line in a discardable section, and the inline part is just a
> JMP or NOP.
>

Why exactly? It matters very little in terms of space, a cross-section
jump is 5 bytes, and movq+btl is 7 bytes.

If you are referring to the use of the C flag: this way, it is left up
to the compiler to decide whether a branch or a conditional move is
more suitable, rather than forcing the use of a branch in one of the
two cases.

2024-02-14 20:26:50

by Brian Gerst

[permalink] [raw]
Subject: Re: [PATCH v4 11/11] x86/startup_64: Drop global variables keeping track of LA57 state

On Wed, Feb 14, 2024 at 10:45 AM Ard Biesheuvel <[email protected]> wrote:
>
> On Wed, 14 Feb 2024 at 16:24, Brian Gerst <[email protected]> wrote:
> >
> > On Tue, Feb 13, 2024 at 7:42 AM Ard Biesheuvel <ardb+git@googlecom> wrote:
> > >
> > > From: Ard Biesheuvel <[email protected]>
> > >
> > > On x86_64, the core kernel is entered in long mode, which implies that
> > > paging is enabled. This means that the CR4.LA57 control bit is
> > > guaranteed to be in sync with the number of paging levels used by the
> > > kernel, and there is no need to store this in a variable.
> > >
> > > There is also no need to use variables for storing the calculations of
> > > pgdir_shift and ptrs_per_p4d, as they are easily determined on the fly.
> > >
> > > This removes the need for two different sources of truth for determining
> > > whether 5-level paging is in use: CR4.LA57 always reflects the actual
> > > state, and never changes from the point of view of the 64-bit core
> > > kernel. The only potential concern is the cost of CR4 accesses, which
> > > can be mitigated using alternatives patching based on feature detection.
> > >
> > > Note that even the decompressor does not manipulate any page tables
> > > before updating CR4.LA57, so it can also avoid the associated global
> > > variables entirely. However, as it does not implement alternatives
> > > patching, the associated ELF sections need to be discarded.
> > >
> > > Signed-off-by: Ard Biesheuvel <[email protected]>
> > > ---
> > > arch/x86/boot/compressed/misc.h | 4 --
> > > arch/x86/boot/compressed/pgtable_64.c | 12 ----
> > > arch/x86/boot/compressed/vmlinux.lds.S | 1 +
> > > arch/x86/include/asm/pgtable_64_types.h | 58 ++++++++------------
> > > arch/x86/kernel/cpu/common.c | 2 -
> > > arch/x86/kernel/head64.c | 33 +----------
> > > arch/x86/mm/kasan_init_64.c | 3 -
> > > arch/x86/mm/mem_encrypt_identity.c | 9 ---
> > > 8 files changed, 27 insertions(+), 95 deletions(-)
> > >
> > > diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
> > > index bc2f0f17fb90..2b15ddd0e177 100644
> > > --- a/arch/x86/boot/compressed/misc.h
> > > +++ b/arch/x86/boot/compressed/misc.h
> > > @@ -16,9 +16,6 @@
> > >
> > > #define __NO_FORTIFY
> > >
> > > -/* cpu_feature_enabled() cannot be used this early */
> > > -#define USE_EARLY_PGTABLE_L5
> > > -
> > > /*
> > > * Boot stub deals with identity mappings, physical and virtual addresses are
> > > * the same, so override these defines.
> > > @@ -178,7 +175,6 @@ static inline int count_immovable_mem_regions(void) { return 0; }
> > > #endif
> > >
> > > /* ident_map_64.c */
> > > -extern unsigned int __pgtable_l5_enabled, pgdir_shift, ptrs_per_p4d;
> > > extern void kernel_add_identity_map(unsigned long start, unsigned long end);
> > >
> > > /* Used by PAGE_KERN* macros: */
> > > diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
> > > index 51f957b24ba7..ae72f53f5e77 100644
> > > --- a/arch/x86/boot/compressed/pgtable_64.c
> > > +++ b/arch/x86/boot/compressed/pgtable_64.c
> > > @@ -9,13 +9,6 @@
> > > #define BIOS_START_MIN 0x20000U /* 128K, less than this is insane */
> > > #define BIOS_START_MAX 0x9f000U /* 640K, absolute maximum */
> > >
> > > -#ifdef CONFIG_X86_5LEVEL
> > > -/* __pgtable_l5_enabled needs to be in .data to avoid being cleared along with .bss */
> > > -unsigned int __section(".data") __pgtable_l5_enabled;
> > > -unsigned int __section(".data") pgdir_shift = 39;
> > > -unsigned int __section(".data") ptrs_per_p4d = 1;
> > > -#endif
> > > -
> > > /* Buffer to preserve trampoline memory */
> > > static char trampoline_save[TRAMPOLINE_32BIT_SIZE];
> > >
> > > @@ -125,11 +118,6 @@ asmlinkage void configure_5level_paging(struct boot_params *bp, void *pgtable)
> > > native_cpuid_eax(0) >= 7 &&
> > > (native_cpuid_ecx(7) & (1 << (X86_FEATURE_LA57 & 31)))) {
> > > l5_required = true;
> > > -
> > > - /* Initialize variables for 5-level paging */
> > > - __pgtable_l5_enabled = 1;
> > > - pgdir_shift = 48;
> > > - ptrs_per_p4d = 512;
> > > }
> > >
> > > /*
> > > diff --git a/arch/x86/boot/compressed/vmlinux.lds.S b/arch/x86/boot/compressed/vmlinux.lds.S
> > > index 083ec6d7722a..06358bb067fe 100644
> > > --- a/arch/x86/boot/compressed/vmlinux.lds.S
> > > +++ b/arch/x86/boot/compressed/vmlinux.lds.S
> > > @@ -81,6 +81,7 @@ SECTIONS
> > > *(.dynamic) *(.dynsym) *(.dynstr) *(.dynbss)
> > > *(.hash) *(.gnu.hash)
> > > *(.note.*)
> > > + *(.altinstructions .altinstr_replacement)
> > > }
> > >
> > > .got.plt (INFO) : {
> > > diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
> > > index 38b54b992f32..6a57bfdff52b 100644
> > > --- a/arch/x86/include/asm/pgtable_64_types.h
> > > +++ b/arch/x86/include/asm/pgtable_64_types.h
> > > @@ -6,7 +6,10 @@
> > >
> > > #ifndef __ASSEMBLY__
> > > #include <linux/types.h>
> > > +#include <asm/alternative.h>
> > > +#include <asm/cpufeatures.h>
> > > #include <asm/kaslr.h>
> > > +#include <asm/processor-flags.h>
> > >
> > > /*
> > > * These are used to make use of C type-checking..
> > > @@ -21,63 +24,50 @@ typedef unsigned long pgprotval_t;
> > > typedef struct { pteval_t pte; } pte_t;
> > > typedef struct { pmdval_t pmd; } pmd_t;
> > >
> > > -#ifdef CONFIG_X86_5LEVEL
> > > -extern unsigned int __pgtable_l5_enabled;
> > > -
> > > -#ifdef USE_EARLY_PGTABLE_L5
> > > -/*
> > > - * cpu_feature_enabled() is not available in early boot code.
> > > - * Use variable instead.
> > > - */
> > > -static inline bool pgtable_l5_enabled(void)
> > > +static __always_inline __pure bool pgtable_l5_enabled(void)
> > > {
> > > - return __pgtable_l5_enabled;
> > > -}
> > > -#else
> > > -#define pgtable_l5_enabled() cpu_feature_enabled(X86_FEATURE_LA57)
> > > -#endif /* USE_EARLY_PGTABLE_L5 */
> > > + unsigned long r;
> > > + bool ret;
> > >
> > > -#else
> > > -#define pgtable_l5_enabled() 0
> > > -#endif /* CONFIG_X86_5LEVEL */
> > > + if (!IS_ENABLED(CONFIG_X86_5LEVEL))
> > > + return false;
> > > +
> > > + asm(ALTERNATIVE_TERNARY(
> > > + "movq %%cr4, %[reg] \n\t btl %[la57], %k[reg]" CC_SET(c),
> > > + %P[feat], "stc", "clc")
> > > + : [reg] "=&r" (r), CC_OUT(c) (ret)
> > > + : [feat] "i" (X86_FEATURE_LA57),
> > > + [la57] "i" (X86_CR4_LA57_BIT)
> > > + : "cc");
> >
> > This should be more like _static_cpu_has(), where the runtime test is
> > out of line in a discardable section, and the inline part is just a
> > JMP or NOP.
> >
>
> Why exactly? It matters very little in terms of space, a cross-section
> jump is 5 bytes, and movq+btl is 7 bytes.
>
> If you are referring to the use of the C flag: this way, it is left up
> to the compiler to decide whether a branch or a conditional move is
> more suitable, rather than forcing the use of a branch in one of the
> two cases.

You're probably right in that many uses of pgtable_l5_enabled() are
choosing between two constants, and static_cpu_has() does not handle
that case very efficiently. Something like static_cpu_choose(feature,
yes_val, no_val) would be a possible idea to explore.

Brian Gerst

2024-02-15 14:20:42

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 01/11] x86/startup_64: Simplify global variable accesses in GDT/IDT programming

On Wed, Feb 14, 2024 at 08:28:41AM +0100, Ard Biesheuvel wrote:
> Actually, we can merge set_bringup_idt_handler() into its caller as well:

Yap, here's the final version I have here and yes, it boots fine as
a SNP guest:

From: Ard Biesheuvel <[email protected]>
Date: Tue, 13 Feb 2024 13:41:45 +0100
Subject: [PATCH] x86/startup_64: Simplify global variable accesses in GDT/IDT
programming

There are two code paths in the startup code to program an IDT: one that
runs from the 1:1 mapping and one that runs from the virtual kernel
mapping. Currently, these are strictly separate because fixup_pointer()
is used on the 1:1 path, which will produce the wrong value when used
while executing from the virtual kernel mapping.

Switch to RIP_REL_REF() so that the two code paths can be merged. Also,
move the GDT and IDT descriptors to the stack so that they can be
referenced directly, rather than via RIP_REL_REF().

Rename startup_64_setup_env() to startup_64_setup_gdt_idt() while at it,
to make the call from assembler self-documenting.

Signed-off-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/include/asm/setup.h | 2 +-
arch/x86/kernel/head64.c | 75 +++++++++++++++---------------------
arch/x86/kernel/head_64.S | 4 +-
3 files changed, 32 insertions(+), 49 deletions(-)

diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index 5c83729c8e71..e61e68d71cba 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -48,7 +48,7 @@ extern unsigned long saved_video_mode;
extern void reserve_standard_io_resources(void);
extern void i386_reserve_resources(void);
extern unsigned long __startup_64(unsigned long physaddr, struct boot_params *bp);
-extern void startup_64_setup_env(unsigned long physbase);
+extern void startup_64_setup_gdt_idt(void);
extern void early_setup_idt(void);
extern void __init do_early_exception(struct pt_regs *regs, int trapnr);

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index dc0956067944..cdff748bf5cb 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -22,6 +22,7 @@
#include <linux/cc_platform.h>
#include <linux/pgtable.h>

+#include <asm/asm.h>
#include <asm/processor.h>
#include <asm/proto.h>
#include <asm/smp.h>
@@ -76,15 +77,6 @@ static struct desc_struct startup_gdt[GDT_ENTRIES] __initdata = {
[GDT_ENTRY_KERNEL_DS] = GDT_ENTRY_INIT(DESC_DATA64, 0, 0xfffff),
};

-/*
- * Address needs to be set at runtime because it references the startup_gdt
- * while the kernel still uses a direct mapping.
- */
-static struct desc_ptr startup_gdt_descr __initdata = {
- .size = sizeof(startup_gdt)-1,
- .address = 0,
-};
-
static void __head *fixup_pointer(void *ptr, unsigned long physaddr)
{
return ptr - (void *)_text + (void *)physaddr;
@@ -569,62 +561,52 @@ void __init __noreturn x86_64_start_reservations(char *real_mode_data)
*/
static gate_desc bringup_idt_table[NUM_EXCEPTION_VECTORS] __page_aligned_data;

-static struct desc_ptr bringup_idt_descr = {
- .size = (NUM_EXCEPTION_VECTORS * sizeof(gate_desc)) - 1,
- .address = 0, /* Set at runtime */
-};
-
-static void set_bringup_idt_handler(gate_desc *idt, int n, void *handler)
+/* This may run while still in the direct mapping */
+static void __head startup_64_load_idt(void *vc_handler)
{
-#ifdef CONFIG_AMD_MEM_ENCRYPT
+ struct desc_ptr desc = {
+ .address = (unsigned long)&RIP_REL_REF(bringup_idt_table),
+ .size = sizeof(bringup_idt_table) - 1,
+ };
struct idt_data data;
- gate_desc desc;
-
- init_idt_data(&data, n, handler);
- idt_init_desc(&desc, &data);
- native_write_idt_entry(idt, n, &desc);
-#endif
-}
+ gate_desc idt_desc;

-/* This runs while still in the direct mapping */
-static void __head startup_64_load_idt(unsigned long physbase)
-{
- struct desc_ptr *desc = fixup_pointer(&bringup_idt_descr, physbase);
- gate_desc *idt = fixup_pointer(bringup_idt_table, physbase);
-
-
- if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
- void *handler;
-
- /* VMM Communication Exception */
- handler = fixup_pointer(vc_no_ghcb, physbase);
- set_bringup_idt_handler(idt, X86_TRAP_VC, handler);
+ /* @vc_handler is set only for a VMM Communication Exception */
+ if (vc_handler) {
+ init_idt_data(&data, X86_TRAP_VC, vc_handler);
+ idt_init_desc(&idt_desc, &data);
+ native_write_idt_entry((gate_desc *)desc.address, X86_TRAP_VC, &idt_desc);
}

- desc->address = (unsigned long)idt;
- native_load_idt(desc);
+ native_load_idt(&desc);
}

/* This is used when running on kernel addresses */
void early_setup_idt(void)
{
- /* VMM Communication Exception */
+ void *handler = NULL;
+
if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
setup_ghcb();
- set_bringup_idt_handler(bringup_idt_table, X86_TRAP_VC, vc_boot_ghcb);
+ handler = vc_boot_ghcb;
}

- bringup_idt_descr.address = (unsigned long)bringup_idt_table;
- native_load_idt(&bringup_idt_descr);
+ startup_64_load_idt(handler);
}

/*
* Setup boot CPU state needed before kernel switches to virtual addresses.
*/
-void __head startup_64_setup_env(unsigned long physbase)
+void __head startup_64_setup_gdt_idt(void)
{
+ void *handler = NULL;
+
+ struct desc_ptr startup_gdt_descr = {
+ .address = (unsigned long)&RIP_REL_REF(startup_gdt),
+ .size = sizeof(startup_gdt) - 1,
+ };
+
/* Load GDT */
- startup_gdt_descr.address = (unsigned long)fixup_pointer(startup_gdt, physbase);
native_load_gdt(&startup_gdt_descr);

/* New GDT is live - reload data segment registers */
@@ -632,5 +614,8 @@ void __head startup_64_setup_env(unsigned long physbase)
"movl %%eax, %%ss\n"
"movl %%eax, %%es\n" : : "a"(__KERNEL_DS) : "memory");

- startup_64_load_idt(physbase);
+ if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))
+ handler = &RIP_REL_REF(vc_no_ghcb);
+
+ startup_64_load_idt(handler);
}
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index a8eaecbd5c81..bcbebab2cc03 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -68,8 +68,6 @@ SYM_CODE_START_NOALIGN(startup_64)
/* Set up the stack for verify_cpu() */
leaq (__end_init_task - PTREGS_SIZE)(%rip), %rsp

- leaq _text(%rip), %rdi
-
/* Setup GSBASE to allow stack canary access for C code */
movl $MSR_GS_BASE, %ecx
leaq INIT_PER_CPU_VAR(fixed_percpu_data)(%rip), %rdx
@@ -77,7 +75,7 @@ SYM_CODE_START_NOALIGN(startup_64)
shrq $32, %rdx
wrmsr

- call startup_64_setup_env
+ call startup_64_setup_gdt_idt

/* Now switch to __KERNEL_CS so IRET works reliably */
pushq $__KERNEL_CS
--
2.43.0

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-17 12:51:49

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 02/11] x86/startup_64: Replace pointer fixups with RIP-relative references

On Tue, Feb 13, 2024 at 01:41:46PM +0100, Ard Biesheuvel wrote:
> @@ -201,25 +201,19 @@ unsigned long __head __startup_64(unsigned long physaddr,
> load_delta += sme_get_me_mask();
>
> /* Fixup the physical addresses in the page table */
> -
> - pgd = fixup_pointer(early_top_pgt, physaddr);
> - p = pgd + pgd_index(__START_KERNEL_map);
> - if (la57)
> - *p = (unsigned long)level4_kernel_pgt;
> - else
> - *p = (unsigned long)level3_kernel_pgt;
> - *p += _PAGE_TABLE_NOENC - __START_KERNEL_map + load_delta;
> -
> if (la57) {
> - p4d = fixup_pointer(level4_kernel_pgt, physaddr);
> - p4d[511] += load_delta;
> + p4d = (p4dval_t *)&RIP_REL_REF(level4_kernel_pgt);
> + p4d[MAX_PTRS_PER_P4D - 1] += load_delta;
> }
>
> - pud = fixup_pointer(level3_kernel_pgt, physaddr);
> - pud[510] += load_delta;
> - pud[511] += load_delta;
> + pud = &RIP_REL_REF(level3_kernel_pgt)->pud;
> + pud[PTRS_PER_PUD - 2] += load_delta;
> + pud[PTRS_PER_PUD - 1] += load_delta;
> +
> + pgd = &RIP_REL_REF(early_top_pgt)->pgd;

Let's do the pgd assignment above, where it was so that we have that
natural order of p4d -> pgd -> pud ->pmd etc manipulations.

> + pgd[PTRS_PER_PGD - 1] = (pgdval_t)(la57 ? p4d : pud) | _PAGE_TABLE_NOENC;

I see what you mean with pgd_index(__START_KERNEL_map) always being 511
but this:

pgd[pgd_index(__START_KERNEL_map)] = (pgdval_t)(la57 ? p4d : pud) | _PAGE_TABLE_NOENC;

says exactly what gets mapped there in the pagetable while

PTRS_PER_PGD - 1

makes me wonder what's that last pud supposed to map.

Other than that, my gut feeling right now is, this would need extensive
testing so that we make sure there's no fallout from it.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-17 13:58:52

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v4 02/11] x86/startup_64: Replace pointer fixups with RIP-relative references

On Sat, 17 Feb 2024 at 13:51, Borislav Petkov <[email protected]> wrote:
>
> On Tue, Feb 13, 2024 at 01:41:46PM +0100, Ard Biesheuvel wrote:
> > @@ -201,25 +201,19 @@ unsigned long __head __startup_64(unsigned long physaddr,
> > load_delta += sme_get_me_mask();
> >
> > /* Fixup the physical addresses in the page table */
> > -
> > - pgd = fixup_pointer(early_top_pgt, physaddr);
> > - p = pgd + pgd_index(__START_KERNEL_map);
> > - if (la57)
> > - *p = (unsigned long)level4_kernel_pgt;
> > - else
> > - *p = (unsigned long)level3_kernel_pgt;
> > - *p += _PAGE_TABLE_NOENC - __START_KERNEL_map + load_delta;
> > -
> > if (la57) {
> > - p4d = fixup_pointer(level4_kernel_pgt, physaddr);
> > - p4d[511] += load_delta;
> > + p4d = (p4dval_t *)&RIP_REL_REF(level4_kernel_pgt);
> > + p4d[MAX_PTRS_PER_P4D - 1] += load_delta;
> > }
> >
> > - pud = fixup_pointer(level3_kernel_pgt, physaddr);
> > - pud[510] += load_delta;
> > - pud[511] += load_delta;
> > + pud = &RIP_REL_REF(level3_kernel_pgt)->pud;
> > + pud[PTRS_PER_PUD - 2] += load_delta;
> > + pud[PTRS_PER_PUD - 1] += load_delta;
> > +
> > + pgd = &RIP_REL_REF(early_top_pgt)->pgd;
>
> Let's do the pgd assignment above, where it was so that we have that
> natural order of p4d -> pgd -> pud ->pmd etc manipulations.
>

pud and p4d need to be assigned first, unless we want to keep taking
the addresses of level4_kernel_pgt and level3_kernel_pgt twice as
before.

> > + pgd[PTRS_PER_PGD - 1] = (pgdval_t)(la57 ? p4d : pud) | _PAGE_TABLE_NOENC;
>
> I see what you mean with pgd_index(__START_KERNEL_map) always being 511
> but this:
>
> pgd[pgd_index(__START_KERNEL_map)] = (pgdval_t)(la57 ? p4d : pud) | _PAGE_TABLE_NOENC;
>
> says exactly what gets mapped there in the pagetable while
>
> PTRS_PER_PGD - 1
>
> makes me wonder what's that last pud supposed to map.
>

Fair enough. But the same applies to p4d[] and pud[].

> Other than that, my gut feeling right now is, this would need extensive
> testing so that we make sure there's no fallout from it.
>

More testing is always good, but I am not particularly nervous about
these changes.

I could split this up into 3+ patches so we could bisect any resulting
issues more effectively.

2024-02-17 16:10:48

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v4 02/11] x86/startup_64: Replace pointer fixups with RIP-relative references

On Sat, 17 Feb 2024 at 14:58, Ard Biesheuvel <[email protected]> wrote:
>
> On Sat, 17 Feb 2024 at 13:51, Borislav Petkov <[email protected]> wrote:
> >
> > On Tue, Feb 13, 2024 at 01:41:46PM +0100, Ard Biesheuvel wrote:
> > > @@ -201,25 +201,19 @@ unsigned long __head __startup_64(unsigned long physaddr,
> > > load_delta += sme_get_me_mask();
> > >
> > > /* Fixup the physical addresses in the page table */
> > > -
> > > - pgd = fixup_pointer(early_top_pgt, physaddr);
> > > - p = pgd + pgd_index(__START_KERNEL_map);
> > > - if (la57)
> > > - *p = (unsigned long)level4_kernel_pgt;
> > > - else
> > > - *p = (unsigned long)level3_kernel_pgt;
> > > - *p += _PAGE_TABLE_NOENC - __START_KERNEL_map + load_delta;
> > > -
> > > if (la57) {
> > > - p4d = fixup_pointer(level4_kernel_pgt, physaddr);
> > > - p4d[511] += load_delta;
> > > + p4d = (p4dval_t *)&RIP_REL_REF(level4_kernel_pgt);
> > > + p4d[MAX_PTRS_PER_P4D - 1] += load_delta;
> > > }
> > >
> > > - pud = fixup_pointer(level3_kernel_pgt, physaddr);
> > > - pud[510] += load_delta;
> > > - pud[511] += load_delta;
> > > + pud = &RIP_REL_REF(level3_kernel_pgt)->pud;
> > > + pud[PTRS_PER_PUD - 2] += load_delta;
> > > + pud[PTRS_PER_PUD - 1] += load_delta;
> > > +
> > > + pgd = &RIP_REL_REF(early_top_pgt)->pgd;
> >
> > Let's do the pgd assignment above, where it was so that we have that
> > natural order of p4d -> pgd -> pud ->pmd etc manipulations.
> >
>
> pud and p4d need to be assigned first, unless we want to keep taking
> the addresses of level4_kernel_pgt and level3_kernel_pgt twice as
> before.
>
> > > + pgd[PTRS_PER_PGD - 1] = (pgdval_t)(la57 ? p4d : pud) | _PAGE_TABLE_NOENC;
> >
> > I see what you mean with pgd_index(__START_KERNEL_map) always being 511
> > but this:
> >
> > pgd[pgd_index(__START_KERNEL_map)] = (pgdval_t)(la57 ? p4d : pud) | _PAGE_TABLE_NOENC;
> >
> > says exactly what gets mapped there in the pagetable while
> >
> > PTRS_PER_PGD - 1
> >
> > makes me wonder what's that last pud supposed to map.
> >
>
> Fair enough. But the same applies to p4d[] and pud[].
>
> > Other than that, my gut feeling right now is, this would need extensive
> > testing so that we make sure there's no fallout from it.
> >
>
> More testing is always good, but I am not particularly nervous about
> these changes.
>
> I could split this up into 3+ patches so we could bisect any resulting
> issues more effectively.

Maybe this is better?

--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -165,14 +165,14 @@
* doesn't have to generate PC-relative relocations when accessing globals from
* that function. Clang actually does not generate them, which leads to
* boot-time crashes. To work around this problem, every global pointer must
- * be adjusted using fixup_pointer().
+ * be accessed using RIP_REL_REF().
*/
unsigned long __head __startup_64(unsigned long physaddr,
struct boot_params *bp)
{
pmd_t (*early_pgts)[PTRS_PER_PMD] = RIP_REL_REF(early_dynamic_pgts);
- unsigned long load_delta, *p;
unsigned long pgtable_flags;
+ unsigned long load_delta;
pgdval_t *pgd;
p4dval_t *p4d;
pudval_t *pud;
@@ -202,17 +202,14 @@ unsigned long __head __startup_64(unsigned long physaddr,

/* Fixup the physical addresses in the page table */

- pgd = fixup_pointer(early_top_pgt, physaddr);
- p = pgd + pgd_index(__START_KERNEL_map);
- if (la57)
- *p = (unsigned long)level4_kernel_pgt;
- else
- *p = (unsigned long)level3_kernel_pgt;
- *p += _PAGE_TABLE_NOENC - __START_KERNEL_map + load_delta;
+ pgd = &RIP_REL_REF(early_top_pgt)->pgd;
+ pgd[pgd_index(__START_KERNEL_map)] += load_delta;

if (la57) {
- p4d = fixup_pointer(level4_kernel_pgt, physaddr);
- p4d[511] += load_delta;
+ p4d = (p4dval_t *)&RIP_REL_REF(level4_kernel_pgt);
+ p4d[MAX_PTRS_PER_P4D - 1] += load_delta;
+
+ pgd[pgd_index(__START_KERNEL_map)] = (pgdval_t)p4d |
_PAGE_TABLE_NOENC;
}

RIP_REL_REF(level3_kernel_pgt)[PTRS_PER_PUD - 2].pud += load_delta;
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 3cac98c61066..fb2a98c29094 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -653,7 +653,8 @@ SYM_CODE_END(vc_no_ghcb)
.balign 4

SYM_DATA_START_PTI_ALIGNED(early_top_pgt)
- .fill 512,8,0
+ .fill 511,8,0
+ .quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
.fill PTI_USER_PGD_FILL,8,0
SYM_DATA_END(early_top_pgt)

2024-02-19 09:56:08

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 02/11] x86/startup_64: Replace pointer fixups with RIP-relative references

On Sat, Feb 17, 2024 at 05:10:27PM +0100, Ard Biesheuvel wrote:
> Maybe this is better?

Yap, looks better.

Btw, when you paste those diffs ontop, can pls make sure you paste them
in applicable form so that I can apply them and look at them in detail?

gmail mangles them:

> +
> + pgd[pgd_index(__START_KERNEL_map)] = (pgdval_t)p4d |
> _PAGE_TABLE_NOENC;

For example, linebreak here.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-19 10:01:56

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 02/11] x86/startup_64: Replace pointer fixups with RIP-relative references

On Sat, Feb 17, 2024 at 02:58:29PM +0100, Ard Biesheuvel wrote:
> More testing is always good, but I am not particularly nervous about
> these changes.

Perhaps but there's a big difference between testing everything as much
as one can and *then* queueing it - vs testing a bit, not being really
nervous about the changes and then someone reporting a snafu when the
patches are already in Linus' tree.

Means dropping everything and getting on that. And then imagine a couple
more breakages happening in parallel and needing urgent attention.

Not something you wanna deal with. Speaking from my experience, at
least.

> I could split this up into 3+ patches so we could bisect any resulting
> issues more effectively.

Yeah, splitting changes into separate bits - ala, one logical change per
patch - is always a good idea.

In this particular case, I don't mind splitting them even more so that
it is perfectly clear what happens and looking at those changes doesn't
make people have to go look at the source to figure out what the change
actually looks like applied, in order to fully grok it.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-19 10:41:35

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 03/11] x86/startup_64: Simplify CR4 handling in startup code

On Tue, Feb 13, 2024 at 01:41:47PM +0100, Ard Biesheuvel wrote:
> -#ifdef CONFIG_X86_5LEVEL
> - testb $1, __pgtable_l5_enabled(%rip)
> - jz 1f
> - orl $X86_CR4_LA57, %ecx
> -1:
> + orl $X86_CR4_MCE, %edx
> #endif
> + movq %cr4, %rcx
> + andl %edx, %ecx

Let's just state it explicitly:

/* Even if ignored in long mode, set PSE uniformly on all logical CPUs. */

> + btsl $X86_CR4_PSE_BIT, %ecx
> movq %rcx, %cr4

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-19 10:45:42

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v4 02/11] x86/startup_64: Replace pointer fixups with RIP-relative references

On Mon, 19 Feb 2024 at 10:56, Borislav Petkov <[email protected]> wrote:
>
> On Sat, Feb 17, 2024 at 05:10:27PM +0100, Ard Biesheuvel wrote:
> > Maybe this is better?
>
> Yap, looks better.
>
> Btw, when you paste those diffs ontop, can pls make sure you paste them
> in applicable form so that I can apply them and look at them in detail?
>
> gmail mangles them:
>
> > +
> > + pgd[pgd_index(__START_KERNEL_map)] = (pgdval_t)p4d |
> > _PAGE_TABLE_NOENC;
>
> For example, linebreak here.
>

Yeah, sorry about that.

2024-02-19 10:48:22

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v4 02/11] x86/startup_64: Replace pointer fixups with RIP-relative references

On Mon, 19 Feb 2024 at 11:01, Borislav Petkov <[email protected]> wrote:
>
> On Sat, Feb 17, 2024 at 02:58:29PM +0100, Ard Biesheuvel wrote:
> > More testing is always good, but I am not particularly nervous about
> > these changes.
>
> Perhaps but there's a big difference between testing everything as much
> as one can and *then* queueing it - vs testing a bit, not being really
> nervous about the changes and then someone reporting a snafu when the
> patches are already in Linus' tree.
>
> Means dropping everything and getting on that. And then imagine a couple
> more breakages happening in parallel and needing urgent attention.
>
> Not something you wanna deal with. Speaking from my experience, at
> least.
>

Not disagreeing with that.

> > I could split this up into 3+ patches so we could bisect any resulting
> > issues more effectively.
>
> Yeah, splitting changes into separate bits - ala, one logical change per
> patch - is always a good idea.
>
> In this particular case, I don't mind splitting them even more so that
> it is perfectly clear what happens and looking at those changes doesn't
> make people have to go look at the source to figure out what the change
> actually looks like applied, in order to fully grok it.
>

I split this into 5 patches for v5. The final patch in this v4 is
broken for CONFIG_X86_5LEVEL=n so I was going to have to respin
anyway. (I'll pick up the latest version of patch #1 you pasted)

2024-02-19 17:00:41

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH v4 07/11] efi/libstub: Add generic support for parsing mem_encrypt=

On 2/13/24 06:41, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <[email protected]>
>
> Parse the mem_encrypt= command line parameter from the EFI stub if
> CONFIG_ARCH_HAS_MEM_ENCRYPT=y, so that it can be passed to the early
> boot code by the arch code in the stub.
>
> This avoids the need for the core kernel to do any string parsing very
> early in the boot.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
> drivers/firmware/efi/libstub/efi-stub-helper.c | 8 ++++++++
> drivers/firmware/efi/libstub/efistub.h | 2 +-
> 2 files changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/firmware/efi/libstub/efi-stub-helper.c b/drivers/firmware/efi/libstub/efi-stub-helper.c
> index bfa30625f5d0..3dc2f9aaf08d 100644
> --- a/drivers/firmware/efi/libstub/efi-stub-helper.c
> +++ b/drivers/firmware/efi/libstub/efi-stub-helper.c
> @@ -24,6 +24,8 @@ static bool efi_noinitrd;
> static bool efi_nosoftreserve;
> static bool efi_disable_pci_dma = IS_ENABLED(CONFIG_EFI_DISABLE_PCI_DMA);
>
> +int efi_mem_encrypt;
> +
> bool __pure __efi_soft_reserve_enabled(void)
> {
> return !efi_nosoftreserve;
> @@ -75,6 +77,12 @@ efi_status_t efi_parse_options(char const *cmdline)
> efi_noinitrd = true;
> } else if (IS_ENABLED(CONFIG_X86_64) && !strcmp(param, "no5lvl")) {
> efi_no5lvl = true;
> + } else if (IS_ENABLED(CONFIG_ARCH_HAS_MEM_ENCRYPT) &&
> + !strcmp(param, "mem_encrypt") && val) {
> + if (parse_option_str(val, "on"))
> + efi_mem_encrypt = 1;
> + else if (parse_option_str(val, "off"))
> + efi_mem_encrypt = -1;

With CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT having recently been
removed, I'm not sure what parsing for mem_encrypt=off does.

(Same thing in the next patch.)

Thanks,
Tom

> } else if (!strcmp(param, "efi") && val) {
> efi_nochunk = parse_option_str(val, "nochunk");
> efi_novamap |= parse_option_str(val, "novamap");
> diff --git a/drivers/firmware/efi/libstub/efistub.h b/drivers/firmware/efi/libstub/efistub.h
> index 212687c30d79..a1c6ab24cd99 100644
> --- a/drivers/firmware/efi/libstub/efistub.h
> +++ b/drivers/firmware/efi/libstub/efistub.h
> @@ -37,8 +37,8 @@ extern bool efi_no5lvl;
> extern bool efi_nochunk;
> extern bool efi_nokaslr;
> extern int efi_loglevel;
> +extern int efi_mem_encrypt;
> extern bool efi_novamap;
> -
> extern const efi_system_table_t *efi_system_table;
>
> typedef union efi_dxe_services_table efi_dxe_services_table_t;

2024-02-19 17:28:04

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v4 07/11] efi/libstub: Add generic support for parsing mem_encrypt=

On Mon, 19 Feb 2024 at 18:00, Tom Lendacky <[email protected]> wrote:
>
> On 2/13/24 06:41, Ard Biesheuvel wrote:
> > From: Ard Biesheuvel <[email protected]>
> >
> > Parse the mem_encrypt= command line parameter from the EFI stub if
> > CONFIG_ARCH_HAS_MEM_ENCRYPT=y, so that it can be passed to the early
> > boot code by the arch code in the stub.
> >
> > This avoids the need for the core kernel to do any string parsing very
> > early in the boot.
> >
> > Signed-off-by: Ard Biesheuvel <[email protected]>
> > ---
> > drivers/firmware/efi/libstub/efi-stub-helper.c | 8 ++++++++
> > drivers/firmware/efi/libstub/efistub.h | 2 +-
> > 2 files changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/firmware/efi/libstub/efi-stub-helper.c b/drivers/firmware/efi/libstub/efi-stub-helper.c
> > index bfa30625f5d0..3dc2f9aaf08d 100644
> > --- a/drivers/firmware/efi/libstub/efi-stub-helper.c
> > +++ b/drivers/firmware/efi/libstub/efi-stub-helper.c
> > @@ -24,6 +24,8 @@ static bool efi_noinitrd;
> > static bool efi_nosoftreserve;
> > static bool efi_disable_pci_dma = IS_ENABLED(CONFIG_EFI_DISABLE_PCI_DMA);
> >
> > +int efi_mem_encrypt;
> > +
> > bool __pure __efi_soft_reserve_enabled(void)
> > {
> > return !efi_nosoftreserve;
> > @@ -75,6 +77,12 @@ efi_status_t efi_parse_options(char const *cmdline)
> > efi_noinitrd = true;
> > } else if (IS_ENABLED(CONFIG_X86_64) && !strcmp(param, "no5lvl")) {
> > efi_no5lvl = true;
> > + } else if (IS_ENABLED(CONFIG_ARCH_HAS_MEM_ENCRYPT) &&
> > + !strcmp(param, "mem_encrypt") && val) {
> > + if (parse_option_str(val, "on"))
> > + efi_mem_encrypt = 1;
> > + else if (parse_option_str(val, "off"))
> > + efi_mem_encrypt = -1;
>
> With CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT having recently been
> removed, I'm not sure what parsing for mem_encrypt=off does.
>
> (Same thing in the next patch.)
>

We have to deal with both mem_encrypt=on and mem_encrypt=off occurring
on the command line. efi_parse_options() may be called more than once,
i.e., when there is a default command line baked into the image that
can be overridden at runtime. So if the baked in one has
mem_encrypt=on, mem_encrypt=off appearing later should counter that.

The same applies to the next patch, although the decompressor appears
to ignore the built-in command line entirely (I made a note of that in
the commit log)

2024-02-20 18:50:05

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 04/11] x86/startup_64: Defer assignment of 5-level paging global variables

On Tue, Feb 13, 2024 at 01:41:48PM +0100, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <[email protected]>
>
> Assigning the 5-level paging related global variables from the earliest
> C code using explicit references that use the 1:1 translation of memory
> is unnecessary, as the startup code itself does not rely on them to
> create the initial page tables, and this is all it should be doing. So
> defer these assignments to the primary C entry code that executes via
> the ordinary kernel virtual mapping.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
> arch/x86/kernel/head64.c | 44 +++++++-------------
> 1 file changed, 14 insertions(+), 30 deletions(-)

Whoops:

arch/x86/kernel/head64.c: In function ‘x86_64_start_kernel’:
arch/x86/kernel/head64.c:442:17: error: ‘__pgtable_l5_enabled’ undeclared (first use in this function); did you mean ‘pgtable_l5_enabled’?
442 | __pgtable_l5_enabled = 1;
| ^~~~~~~~~~~~~~~~~~~~
| pgtable_l5_enabled
arch/x86/kernel/head64.c:442:17: note: each undeclared identifier is reported only once for each function it appears in
make[4]: *** [scripts/Makefile.build:243: arch/x86/kernel/head64.o] Error 1
make[3]: *** [scripts/Makefile.build:481: arch/x86/kernel] Error 2
make[2]: *** [scripts/Makefile.build:481: arch/x86] Error 2
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [/mnt/kernel/kernel/2nd/linux/Makefile:1921: .] Error 2
make: *** [Makefile:240: __sub-make] Error 2


--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-20 19:28:19

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH v4 07/11] efi/libstub: Add generic support for parsing mem_encrypt=

On 2/19/24 11:06, Ard Biesheuvel wrote:
> On Mon, 19 Feb 2024 at 18:00, Tom Lendacky <[email protected]> wrote:
>>
>> On 2/13/24 06:41, Ard Biesheuvel wrote:
>>> From: Ard Biesheuvel <[email protected]>
>>>
>>> Parse the mem_encrypt= command line parameter from the EFI stub if
>>> CONFIG_ARCH_HAS_MEM_ENCRYPT=y, so that it can be passed to the early
>>> boot code by the arch code in the stub.
>>>
>>> This avoids the need for the core kernel to do any string parsing very
>>> early in the boot.
>>>
>>> Signed-off-by: Ard Biesheuvel <[email protected]>
>>> ---
>>> drivers/firmware/efi/libstub/efi-stub-helper.c | 8 ++++++++
>>> drivers/firmware/efi/libstub/efistub.h | 2 +-
>>> 2 files changed, 9 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/firmware/efi/libstub/efi-stub-helper.c b/drivers/firmware/efi/libstub/efi-stub-helper.c
>>> index bfa30625f5d0..3dc2f9aaf08d 100644
>>> --- a/drivers/firmware/efi/libstub/efi-stub-helper.c
>>> +++ b/drivers/firmware/efi/libstub/efi-stub-helper.c
>>> @@ -24,6 +24,8 @@ static bool efi_noinitrd;
>>> static bool efi_nosoftreserve;
>>> static bool efi_disable_pci_dma = IS_ENABLED(CONFIG_EFI_DISABLE_PCI_DMA);
>>>
>>> +int efi_mem_encrypt;
>>> +
>>> bool __pure __efi_soft_reserve_enabled(void)
>>> {
>>> return !efi_nosoftreserve;
>>> @@ -75,6 +77,12 @@ efi_status_t efi_parse_options(char const *cmdline)
>>> efi_noinitrd = true;
>>> } else if (IS_ENABLED(CONFIG_X86_64) && !strcmp(param, "no5lvl")) {
>>> efi_no5lvl = true;
>>> + } else if (IS_ENABLED(CONFIG_ARCH_HAS_MEM_ENCRYPT) &&
>>> + !strcmp(param, "mem_encrypt") && val) {
>>> + if (parse_option_str(val, "on"))
>>> + efi_mem_encrypt = 1;
>>> + else if (parse_option_str(val, "off"))
>>> + efi_mem_encrypt = -1;
>>
>> With CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT having recently been
>> removed, I'm not sure what parsing for mem_encrypt=off does.
>>
>> (Same thing in the next patch.)
>>
>
> We have to deal with both mem_encrypt=on and mem_encrypt=off occurring
> on the command line. efi_parse_options() may be called more than once,
> i.e., when there is a default command line baked into the image that
> can be overridden at runtime. So if the baked in one has
> mem_encrypt=on, mem_encrypt=off appearing later should counter that.
>
> The same applies to the next patch, although the decompressor appears
> to ignore the built-in command line entirely (I made a note of that in
> the commit log)

Ah, makes sense.

Thanks,
Tom


2024-02-20 23:33:31

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v4 04/11] x86/startup_64: Defer assignment of 5-level paging global variables

On Tue, 20 Feb 2024 at 19:45, Borislav Petkov <[email protected]> wrote:
>
> On Tue, Feb 13, 2024 at 01:41:48PM +0100, Ard Biesheuvel wrote:
> > From: Ard Biesheuvel <[email protected]>
> >
> > Assigning the 5-level paging related global variables from the earliest
> > C code using explicit references that use the 1:1 translation of memory
> > is unnecessary, as the startup code itself does not rely on them to
> > create the initial page tables, and this is all it should be doing. So
> > defer these assignments to the primary C entry code that executes via
> > the ordinary kernel virtual mapping.
> >
> > Signed-off-by: Ard Biesheuvel <[email protected]>
> > ---
> > arch/x86/kernel/head64.c | 44 +++++++-------------
> > 1 file changed, 14 insertions(+), 30 deletions(-)
>
> Whoops:
>

Right, this is the same issue as in #11 - in both cases, the extern
declaration of __pgtable_l5_enabled needs to be visible regardless of
CONFIG_X86_5LEVEL.

I'll fix both cases for v5.

> arch/x86/kernel/head64.c: In function ‘x86_64_start_kernel’:
> arch/x86/kernel/head64.c:442:17: error: ‘__pgtable_l5_enabled’ undeclared (first use in this function); did you mean ‘pgtable_l5_enabled’?
> 442 | __pgtable_l5_enabled = 1;
> | ^~~~~~~~~~~~~~~~~~~~
> | pgtable_l5_enabled
> arch/x86/kernel/head64.c:442:17: note: each undeclared identifier is reported only once for each function it appears in
> make[4]: *** [scripts/Makefile.build:243: arch/x86/kernel/head64.o] Error 1
> make[3]: *** [scripts/Makefile.build:481: arch/x86/kernel] Error 2
> make[2]: *** [scripts/Makefile.build:481: arch/x86] Error 2
> make[2]: *** Waiting for unfinished jobs....
> make[1]: *** [/mnt/kernel/kernel/2nd/linux/Makefile:1921: .] Error 2
> make: *** [Makefile:240: __sub-make] Error 2
>
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette

2024-02-21 10:11:47

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 04/11] x86/startup_64: Defer assignment of 5-level paging global variables

On Wed, Feb 21, 2024 at 12:33:08AM +0100, Ard Biesheuvel wrote:
> Right, this is the same issue as in #11 - in both cases, the extern
> declaration of __pgtable_l5_enabled needs to be visible regardless of
> CONFIG_X86_5LEVEL.

Yap, I don't mind something like below.

5LEVEL will be practically enabled everywhere.

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 158da0fd01d2..eeb1744215f2 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -52,13 +52,11 @@ extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
static unsigned int __initdata next_early_pgt;
pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);

-#ifdef CONFIG_X86_5LEVEL
unsigned int __pgtable_l5_enabled __ro_after_init;
unsigned int pgdir_shift __ro_after_init = 39;
EXPORT_SYMBOL(pgdir_shift);
unsigned int ptrs_per_p4d __ro_after_init = 1;
EXPORT_SYMBOL(ptrs_per_p4d);
-#endif

#ifdef CONFIG_DYNAMIC_MEMORY_LAYOUT
unsigned long page_offset_base __ro_after_init = __PAGE_OFFSET_BASE_L4;

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-21 10:20:33

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v4 04/11] x86/startup_64: Defer assignment of 5-level paging global variables

On Wed, 21 Feb 2024 at 11:09, Borislav Petkov <[email protected]> wrote:
>
> On Wed, Feb 21, 2024 at 12:33:08AM +0100, Ard Biesheuvel wrote:
> > Right, this is the same issue as in #11 - in both cases, the extern
> > declaration of __pgtable_l5_enabled needs to be visible regardless of
> > CONFIG_X86_5LEVEL.
>
> Yap, I don't mind something like below.
>
> 5LEVEL will be practically enabled everywhere.
>
> diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
> index 158da0fd01d2..eeb1744215f2 100644
> --- a/arch/x86/kernel/head64.c
> +++ b/arch/x86/kernel/head64.c
> @@ -52,13 +52,11 @@ extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
> static unsigned int __initdata next_early_pgt;
> pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
>
> -#ifdef CONFIG_X86_5LEVEL
> unsigned int __pgtable_l5_enabled __ro_after_init;
> unsigned int pgdir_shift __ro_after_init = 39;
> EXPORT_SYMBOL(pgdir_shift);
> unsigned int ptrs_per_p4d __ro_after_init = 1;
> EXPORT_SYMBOL(ptrs_per_p4d);
> -#endif
>

Just the below should be sufficient

--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -22,7 +22,7 @@ typedef struct { pteval_t pte; } pte_t;
typedef struct { pmdval_t pmd; } pmd_t;

-#ifdef CONFIG_X86_5LEVEL
extern unsigned int __pgtable_l5_enabled;

+#ifdef CONFIG_X86_5LEVEL
#ifdef USE_EARLY_PGTABLE_L5
/*

2024-02-21 11:13:44

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 04/11] x86/startup_64: Defer assignment of 5-level paging global variables

On Wed, Feb 21, 2024 at 11:20:13AM +0100, Ard Biesheuvel wrote:
> Just the below should be sufficient
>
> --- a/arch/x86/include/asm/pgtable_64_types.h
> +++ b/arch/x86/include/asm/pgtable_64_types.h
> @@ -22,7 +22,7 @@ typedef struct { pteval_t pte; } pte_t;
> typedef struct { pmdval_t pmd; } pmd_t;
>
> -#ifdef CONFIG_X86_5LEVEL
> extern unsigned int __pgtable_l5_enabled;
>
> +#ifdef CONFIG_X86_5LEVEL
> #ifdef USE_EARLY_PGTABLE_L5

Perhaps but the CONFIG_X86_5LEVEL ifdeffery is just ugly and getting
unnecessary.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-02-21 11:22:28

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v4 04/11] x86/startup_64: Defer assignment of 5-level paging global variables

On Wed, 21 Feb 2024 at 12:13, Borislav Petkov <[email protected]> wrote:
>
> On Wed, Feb 21, 2024 at 11:20:13AM +0100, Ard Biesheuvel wrote:
> > Just the below should be sufficient
> >
> > --- a/arch/x86/include/asm/pgtable_64_types.h
> > +++ b/arch/x86/include/asm/pgtable_64_types.h
> > @@ -22,7 +22,7 @@ typedef struct { pteval_t pte; } pte_t;
> > typedef struct { pmdval_t pmd; } pmd_t;
> >
> > -#ifdef CONFIG_X86_5LEVEL
> > extern unsigned int __pgtable_l5_enabled;
> >
> > +#ifdef CONFIG_X86_5LEVEL
> > #ifdef USE_EARLY_PGTABLE_L5
>
> Perhaps but the CONFIG_X86_5LEVEL ifdeffery is just ugly and getting
> unnecessary.
>

That all gets ripped out in the last patch.


Btw v5 is good to go, in case you're ok switching to that:

https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=x86-pie-for-sev-v5

2024-02-21 11:24:52

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v4 04/11] x86/startup_64: Defer assignment of 5-level paging global variables

On Wed, Feb 21, 2024 at 12:21:20PM +0100, Ard Biesheuvel wrote:
> Btw v5 is good to go, in case you're ok switching to that:

Sure, send it on.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette