2024-05-28 11:03:47

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv11 00/19] x86/tdx: Add kexec support

The patchset adds bits and pieces to get kexec (and crashkernel) work on
TDX guest.

The last patch implements CPU offlining according to the approved ACPI
spec change poposal[1]. It unlocks kexec with all CPUs visible in the target
kernel. It requires BIOS-side enabling. If it missing we fallback to booting
2nd kernel with single CPU.

Please review. I would be glad for any feedback.

[1] https://lore.kernel.org/all/13356251.uLZWGnKmhe@kreacher

v11:
- Rebased onto current tip/master;
- Rename CONFIG_X86_ACPI_MADT_WAKEUP to CONFIG_ACPI_MADT_WAKEUP;
- Drop CC_ATTR_GUEST_MEM_ENCRYPT checks around x86_platform.guest.enc_kexec_*
callbacks;
- Rename x86_platform.guest.enc_kexec_* callbacks;
- Report error code in case of vmm call fail in __set_memory_enc_pgtable();
- Update commit messages and comments;
- Add Reviewed-bys;
v10:
- Rebased to current tip/master;
- Preserve CR4.MCE instead of setting it unconditionally;
- Fix build error in Hyper-V code after rebase;
- Include Ashish's patch for real;
v9:
- Rebased;
- Keep page tables that maps E820_TYPE_ACPI (Ashish);
- Ack/Reviewed/Tested-bys from Sathya, Kai, Tao;
- Minor printk() message adjustments;
v8:
- Rework serialization of around conversion memory back to private;
- Print ACPI_MADT_TYPE_MULTIPROC_WAKEUP in acpi_table_print_madt_entry();
- Drop debugfs interface to dump info on shared memory;
- Adjust comments and commit messages;
- Reviewed-bys by Baoquan, Dave and Thomas;
v7:
- Call enc_kexec_stop_conversion() and enc_kexec_unshare_mem() after shutting
down IO-APIC, lapic and hpet. It meets AMD requirements.
- Minor style changes;
- Add Acked/Reviewed-bys;
v6:
- Rebased to v6.8-rc1;
- Provide default noop callbacks from .enc_kexec_stop_conversion and
.enc_kexec_unshare_mem;
- Split off patch that introduces .enc_kexec_* callbacks;
- asm_acpi_mp_play_dead(): program CR3 directly from RSI, no MOV to RAX
required;
- Restructure how smp_ops.stop_this_cpu() hooked up in crash_nmi_callback();
- kvmclock patch got merged via KVM tree;
v5:
- Rename smp_ops.crash_play_dead to smp_ops.stop_this_cpu and use it in
stop_this_cpu();
- Split off enc_kexec_stop_conversion() from enc_kexec_unshare_mem();
- Introduce kernel_ident_mapping_free();
- Add explicit include for alternatives and stringify.
- Add barrier() after setting conversion_allowed to false;
- Mark cpu_hotplug_offline_disabled __ro_after_init;
- Print error if failed to hand over CPU to BIOS;
- Update comments and commit messages;
v4:
- Fix build for !KEXEC_CORE;
- Cleaner ATLERNATIVE use;
- Update commit messages and comments;
- Add Reviewed-bys;
v3:
- Rework acpi_mp_crash_stop_other_cpus() to avoid invoking hotplug state
machine;
- Free page tables if reset vector setup failed;
- Change asm_acpi_mp_play_dead() to pass reset vector and PGD as arguments;
- Mark acpi_mp_* variables as static and __ro_after_init;
- Use u32 for apicid;
- Disable CPU offlining if reset vector setup failed;
- Rename madt.S -> madt_playdead.S;
- Mark tdx_kexec_unshare_mem() as static;
- Rebase onto up-to-date tip/master;
- Whitespace fixes;
- Reorder patches;
- Add Reviewed-bys;
- Update comments and commit messages;
v2:
- Rework how unsharing hook ups into kexec codepath;
- Rework kvmclock_disable() fix based on Sean's;
- s/cpu_hotplug_not_supported()/cpu_hotplug_disable_offlining()/;
- use play_dead_common() to implement acpi_mp_play_dead();
- cond_resched() in tdx_shared_memory_show();
- s/target kernel/second kernel/;
- Update commit messages and comments;

Ashish Kalra (1):
x86/mm: Do not zap page table entries mapping unaccepted memory table
during kdump.

Borislav Petkov (1):
x86/relocate_kernel: Use named labels for less confusion

Kirill A. Shutemov (17):
x86/acpi: Extract ACPI MADT wakeup code into a separate file
x86/apic: Mark acpi_mp_wake_* variables as __ro_after_init
cpu/hotplug: Add support for declaring CPU offlining not supported
cpu/hotplug, x86/acpi: Disable CPU offlining for ACPI MADT wakeup
x86/kexec: Keep CR4.MCE set during kexec for TDX guest
x86/mm: Make x86_platform.guest.enc_status_change_*() return errno
x86/mm: Return correct level from lookup_address() if pte is none
x86/tdx: Account shared memory
x86/mm: Add callbacks to prepare encrypted memory for kexec
x86/tdx: Convert shared memory back to private on kexec
x86/mm: Make e820__end_ram_pfn() cover E820_TYPE_ACPI ranges
x86/acpi: Rename fields in acpi_madt_multiproc_wakeup structure
x86/acpi: Do not attempt to bring up secondary CPUs in kexec case
x86/smp: Add smp_ops.stop_this_cpu() callback
x86/mm: Introduce kernel_ident_mapping_free()
x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
ACPI: tables: Print MULTIPROC_WAKEUP when MADT is parsed

arch/x86/Kconfig | 7 +
arch/x86/coco/core.c | 1 -
arch/x86/coco/tdx/tdx.c | 96 ++++++++-
arch/x86/hyperv/ivm.c | 22 +-
arch/x86/include/asm/acpi.h | 7 +
arch/x86/include/asm/init.h | 3 +
arch/x86/include/asm/pgtable.h | 5 +
arch/x86/include/asm/pgtable_types.h | 1 +
arch/x86/include/asm/set_memory.h | 3 +
arch/x86/include/asm/smp.h | 1 +
arch/x86/include/asm/x86_init.h | 13 +-
arch/x86/kernel/acpi/Makefile | 1 +
arch/x86/kernel/acpi/boot.c | 86 +-------
arch/x86/kernel/acpi/madt_playdead.S | 28 +++
arch/x86/kernel/acpi/madt_wakeup.c | 292 +++++++++++++++++++++++++++
arch/x86/kernel/crash.c | 12 ++
arch/x86/kernel/e820.c | 9 +-
arch/x86/kernel/process.c | 7 +
arch/x86/kernel/reboot.c | 18 ++
arch/x86/kernel/relocate_kernel_64.S | 25 ++-
arch/x86/kernel/x86_init.c | 8 +-
arch/x86/mm/ident_map.c | 73 +++++++
arch/x86/mm/init_64.c | 16 +-
arch/x86/mm/mem_encrypt_amd.c | 8 +-
arch/x86/mm/pat/set_memory.c | 74 +++++--
drivers/acpi/tables.c | 14 ++
include/acpi/actbl2.h | 19 +-
include/linux/cc_platform.h | 10 -
include/linux/cpu.h | 2 +
kernel/cpu.c | 12 +-
30 files changed, 707 insertions(+), 166 deletions(-)
create mode 100644 arch/x86/kernel/acpi/madt_playdead.S
create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c

--
2.43.0



2024-05-28 11:06:51

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv11 19/19] ACPI: tables: Print MULTIPROC_WAKEUP when MADT is parsed

When MADT is parsed, print MULTIPROC_WAKEUP information:

ACPI: MP Wakeup (version[1], mailbox[0x7fffd000], reset[0x7fffe068])

This debug information will be very helpful during bring up.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Baoquan He <[email protected]>
Reviewed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Acked-by: Kai Huang <[email protected]>
Tested-by: Tao Liu <[email protected]>
---
drivers/acpi/tables.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
index b976e5fc3fbc..9e1b01c35070 100644
--- a/drivers/acpi/tables.c
+++ b/drivers/acpi/tables.c
@@ -198,6 +198,20 @@ void acpi_table_print_madt_entry(struct acpi_subtable_header *header)
}
break;

+ case ACPI_MADT_TYPE_MULTIPROC_WAKEUP:
+ {
+ struct acpi_madt_multiproc_wakeup *p =
+ (struct acpi_madt_multiproc_wakeup *)header;
+ u64 reset_vector = 0;
+
+ if (p->version >= ACPI_MADT_MP_WAKEUP_VERSION_V1)
+ reset_vector = p->reset_vector;
+
+ pr_debug("MP Wakeup (version[%d], mailbox[%#llx], reset[%#llx])\n",
+ p->version, p->mailbox_address, reset_vector);
+ }
+ break;
+
case ACPI_MADT_TYPE_CORE_PIC:
{
struct acpi_madt_core_pic *p = (struct acpi_madt_core_pic *)header;
--
2.43.0


2024-05-28 11:09:32

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv11 01/19] x86/acpi: Extract ACPI MADT wakeup code into a separate file

In order to prepare for the expansion of support for the ACPI MADT
wakeup method, move the relevant code into a separate file.

Introduce a new configuration option to clearly indicate dependencies
without the use of ifdefs.

There have been no functional changes.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Acked-by: Kai Huang <[email protected]>
Reviewed-by: Baoquan He <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Tested-by: Tao Liu <[email protected]>
---
arch/x86/Kconfig | 7 +++
arch/x86/include/asm/acpi.h | 5 ++
arch/x86/kernel/acpi/Makefile | 1 +
arch/x86/kernel/acpi/boot.c | 86 +-----------------------------
arch/x86/kernel/acpi/madt_wakeup.c | 82 ++++++++++++++++++++++++++++
5 files changed, 96 insertions(+), 85 deletions(-)
create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e8837116704c..e30ea4129d2c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1118,6 +1118,13 @@ config X86_LOCAL_APIC
depends on X86_64 || SMP || X86_32_NON_STANDARD || X86_UP_APIC || PCI_MSI
select IRQ_DOMAIN_HIERARCHY

+config ACPI_MADT_WAKEUP
+ def_bool y
+ depends on X86_64
+ depends on ACPI
+ depends on SMP
+ depends on X86_LOCAL_APIC
+
config X86_IO_APIC
def_bool y
depends on X86_LOCAL_APIC || X86_UP_IOAPIC
diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
index 5af926c050f0..ceacac2b335d 100644
--- a/arch/x86/include/asm/acpi.h
+++ b/arch/x86/include/asm/acpi.h
@@ -78,6 +78,11 @@ static inline bool acpi_skip_set_wakeup_address(void)

#define acpi_skip_set_wakeup_address acpi_skip_set_wakeup_address

+union acpi_subtable_headers;
+
+int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
+ const unsigned long end);
+
/*
* Check if the CPU can handle C2 and deeper
*/
diff --git a/arch/x86/kernel/acpi/Makefile b/arch/x86/kernel/acpi/Makefile
index fc17b3f136fe..2feba7257665 100644
--- a/arch/x86/kernel/acpi/Makefile
+++ b/arch/x86/kernel/acpi/Makefile
@@ -4,6 +4,7 @@ obj-$(CONFIG_ACPI) += boot.o
obj-$(CONFIG_ACPI_SLEEP) += sleep.o wakeup_$(BITS).o
obj-$(CONFIG_ACPI_APEI) += apei.o
obj-$(CONFIG_ACPI_CPPC_LIB) += cppc.o
+obj-$(CONFIG_ACPI_MADT_WAKEUP) += madt_wakeup.o

ifneq ($(CONFIG_ACPI_PROCESSOR),)
obj-y += cstate.o
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 4bf82dbd2a6b..9f4618dcd704 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -67,13 +67,6 @@ static bool has_lapic_cpus __initdata;
static bool acpi_support_online_capable;
#endif

-#ifdef CONFIG_X86_64
-/* Physical address of the Multiprocessor Wakeup Structure mailbox */
-static u64 acpi_mp_wake_mailbox_paddr;
-/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
-static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
-#endif
-
#ifdef CONFIG_X86_IO_APIC
/*
* Locks related to IOAPIC hotplug
@@ -341,60 +334,6 @@ acpi_parse_lapic_nmi(union acpi_subtable_headers * header, const unsigned long e

return 0;
}
-
-#ifdef CONFIG_X86_64
-static int acpi_wakeup_cpu(u32 apicid, unsigned long start_ip)
-{
- /*
- * Remap mailbox memory only for the first call to acpi_wakeup_cpu().
- *
- * Wakeup of secondary CPUs is fully serialized in the core code.
- * No need to protect acpi_mp_wake_mailbox from concurrent accesses.
- */
- if (!acpi_mp_wake_mailbox) {
- acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
- sizeof(*acpi_mp_wake_mailbox),
- MEMREMAP_WB);
- }
-
- /*
- * Mailbox memory is shared between the firmware and OS. Firmware will
- * listen on mailbox command address, and once it receives the wakeup
- * command, the CPU associated with the given apicid will be booted.
- *
- * The value of 'apic_id' and 'wakeup_vector' must be visible to the
- * firmware before the wakeup command is visible. smp_store_release()
- * ensures ordering and visibility.
- */
- acpi_mp_wake_mailbox->apic_id = apicid;
- acpi_mp_wake_mailbox->wakeup_vector = start_ip;
- smp_store_release(&acpi_mp_wake_mailbox->command,
- ACPI_MP_WAKE_COMMAND_WAKEUP);
-
- /*
- * Wait for the CPU to wake up.
- *
- * The CPU being woken up is essentially in a spin loop waiting to be
- * woken up. It should not take long for it wake up and acknowledge by
- * zeroing out ->command.
- *
- * ACPI specification doesn't provide any guidance on how long kernel
- * has to wait for a wake up acknowledgement. It also doesn't provide
- * a way to cancel a wake up request if it takes too long.
- *
- * In TDX environment, the VMM has control over how long it takes to
- * wake up secondary. It can postpone scheduling secondary vCPU
- * indefinitely. Giving up on wake up request and reporting error opens
- * possible attack vector for VMM: it can wake up a secondary CPU when
- * kernel doesn't expect it. Wait until positive result of the wake up
- * request.
- */
- while (READ_ONCE(acpi_mp_wake_mailbox->command))
- cpu_relax();
-
- return 0;
-}
-#endif /* CONFIG_X86_64 */
#endif /* CONFIG_X86_LOCAL_APIC */

#ifdef CONFIG_X86_IO_APIC
@@ -1124,29 +1063,6 @@ static int __init acpi_parse_madt_lapic_entries(void)
}
return 0;
}
-
-#ifdef CONFIG_X86_64
-static int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
- const unsigned long end)
-{
- struct acpi_madt_multiproc_wakeup *mp_wake;
-
- if (!IS_ENABLED(CONFIG_SMP))
- return -ENODEV;
-
- mp_wake = (struct acpi_madt_multiproc_wakeup *)header;
- if (BAD_MADT_ENTRY(mp_wake, end))
- return -EINVAL;
-
- acpi_table_print_madt_entry(&header->common);
-
- acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
-
- apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
-
- return 0;
-}
-#endif /* CONFIG_X86_64 */
#endif /* CONFIG_X86_LOCAL_APIC */

#ifdef CONFIG_X86_IO_APIC
@@ -1343,7 +1259,7 @@ static void __init acpi_process_madt(void)
smp_found_config = 1;
}

-#ifdef CONFIG_X86_64
+#ifdef CONFIG_ACPI_MADT_WAKEUP
/*
* Parse MADT MP Wake entry.
*/
diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
new file mode 100644
index 000000000000..7f164d38bd0b
--- /dev/null
+++ b/arch/x86/kernel/acpi/madt_wakeup.c
@@ -0,0 +1,82 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+#include <linux/acpi.h>
+#include <linux/io.h>
+#include <asm/apic.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
+
+/* Physical address of the Multiprocessor Wakeup Structure mailbox */
+static u64 acpi_mp_wake_mailbox_paddr;
+
+/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
+static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
+
+static int acpi_wakeup_cpu(u32 apicid, unsigned long start_ip)
+{
+ /*
+ * Remap mailbox memory only for the first call to acpi_wakeup_cpu().
+ *
+ * Wakeup of secondary CPUs is fully serialized in the core code.
+ * No need to protect acpi_mp_wake_mailbox from concurrent accesses.
+ */
+ if (!acpi_mp_wake_mailbox) {
+ acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
+ sizeof(*acpi_mp_wake_mailbox),
+ MEMREMAP_WB);
+ }
+
+ /*
+ * Mailbox memory is shared between the firmware and OS. Firmware will
+ * listen on mailbox command address, and once it receives the wakeup
+ * command, the CPU associated with the given apicid will be booted.
+ *
+ * The value of 'apic_id' and 'wakeup_vector' must be visible to the
+ * firmware before the wakeup command is visible. smp_store_release()
+ * ensures ordering and visibility.
+ */
+ acpi_mp_wake_mailbox->apic_id = apicid;
+ acpi_mp_wake_mailbox->wakeup_vector = start_ip;
+ smp_store_release(&acpi_mp_wake_mailbox->command,
+ ACPI_MP_WAKE_COMMAND_WAKEUP);
+
+ /*
+ * Wait for the CPU to wake up.
+ *
+ * The CPU being woken up is essentially in a spin loop waiting to be
+ * woken up. It should not take long for it wake up and acknowledge by
+ * zeroing out ->command.
+ *
+ * ACPI specification doesn't provide any guidance on how long kernel
+ * has to wait for a wake up acknowledgment. It also doesn't provide
+ * a way to cancel a wake up request if it takes too long.
+ *
+ * In TDX environment, the VMM has control over how long it takes to
+ * wake up secondary. It can postpone scheduling secondary vCPU
+ * indefinitely. Giving up on wake up request and reporting error opens
+ * possible attack vector for VMM: it can wake up a secondary CPU when
+ * kernel doesn't expect it. Wait until positive result of the wake up
+ * request.
+ */
+ while (READ_ONCE(acpi_mp_wake_mailbox->command))
+ cpu_relax();
+
+ return 0;
+}
+
+int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
+ const unsigned long end)
+{
+ struct acpi_madt_multiproc_wakeup *mp_wake;
+
+ mp_wake = (struct acpi_madt_multiproc_wakeup *)header;
+ if (BAD_MADT_ENTRY(mp_wake, end))
+ return -EINVAL;
+
+ acpi_table_print_madt_entry(&header->common);
+
+ acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
+
+ apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
+
+ return 0;
+}
--
2.43.0


2024-05-28 11:55:55

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv11 06/19] x86/kexec: Keep CR4.MCE set during kexec for TDX guest

TDX guests run with MCA enabled (CR4.MCE=1b) from the very start. If
that bit is cleared during CR4 register reprogramming during boot or
kexec flows, a #VE exception will be raised which the guest kernel
cannot handle it.

Therefore, make sure the CR4.MCE setting is preserved over kexec too and
avoid raising any #VEs.

The change doesn't affect non-TDX-guest environments.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/relocate_kernel_64.S | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 085eef5c3904..b668a6be4f6f 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -5,6 +5,8 @@
*/

#include <linux/linkage.h>
+#include <linux/stringify.h>
+#include <asm/alternative.h>
#include <asm/page_types.h>
#include <asm/kexec.h>
#include <asm/processor-flags.h>
@@ -143,15 +145,17 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)

/*
* Set cr4 to a known state:
- * - physical address extension enabled
* - 5-level paging, if it was enabled before
+ * - Machine check exception on TDX guest, if it was enabled before.
+ * Clearing MCE might not be allowed in TDX guests, depending on setup.
+ * - physical address extension enabled
*/
- movl $X86_CR4_PAE, %eax
- testq $X86_CR4_LA57, %r13
- jz .Lno_la57
- orl $X86_CR4_LA57, %eax
-.Lno_la57:
+ movl $X86_CR4_LA57, %eax
+ ALTERNATIVE "", __stringify(orl $X86_CR4_MCE, %eax), X86_FEATURE_TDX_GUEST

+ /* R13 contains the original CR4 value, read in relocate_kernel() */
+ andl %r13d, %eax
+ orl $X86_CR4_PAE, %eax
movq %rax, %cr4

jmp 1f
--
2.43.0


2024-05-28 12:45:53

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv11 04/19] cpu/hotplug, x86/acpi: Disable CPU offlining for ACPI MADT wakeup

ACPI MADT doesn't allow to offline CPU after it got woke up.

Currently CPU hotplug is prevented based on the confidential computing
attribute which is set for Intel TDX. But TDX is not the only possible
user of the wake up method. Any platform that uses ACPI MADT wakeup
method cannot offline CPU.

Disable CPU offlining on ACPI MADT wakeup enumeration.

The change has no visible effects for users: currently, TDX guest is the
only platform that uses the ACPI MADT wakeup method.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Tested-by: Tao Liu <[email protected]>
---
arch/x86/coco/core.c | 1 -
arch/x86/kernel/acpi/madt_wakeup.c | 3 +++
include/linux/cc_platform.h | 10 ----------
kernel/cpu.c | 3 +--
4 files changed, 4 insertions(+), 13 deletions(-)

diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index b31ef2424d19..0f81f70aca82 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -29,7 +29,6 @@ static bool noinstr intel_cc_platform_has(enum cc_attr attr)
{
switch (attr) {
case CC_ATTR_GUEST_UNROLL_STRING_IO:
- case CC_ATTR_HOTPLUG_DISABLED:
case CC_ATTR_GUEST_MEM_ENCRYPT:
case CC_ATTR_MEM_ENCRYPT:
return true;
diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
index cf79ea6f3007..d222be8d7a07 100644
--- a/arch/x86/kernel/acpi/madt_wakeup.c
+++ b/arch/x86/kernel/acpi/madt_wakeup.c
@@ -1,5 +1,6 @@
// SPDX-License-Identifier: GPL-2.0-or-later
#include <linux/acpi.h>
+#include <linux/cpu.h>
#include <linux/io.h>
#include <asm/apic.h>
#include <asm/barrier.h>
@@ -76,6 +77,8 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,

acpi_mp_wake_mailbox_paddr = mp_wake->base_address;

+ cpu_hotplug_disable_offlining();
+
apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);

return 0;
diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
index 60693a145894..caa4b4430634 100644
--- a/include/linux/cc_platform.h
+++ b/include/linux/cc_platform.h
@@ -81,16 +81,6 @@ enum cc_attr {
*/
CC_ATTR_GUEST_SEV_SNP,

- /**
- * @CC_ATTR_HOTPLUG_DISABLED: Hotplug is not supported or disabled.
- *
- * The platform/OS is running as a guest/virtual machine does not
- * support CPU hotplug feature.
- *
- * Examples include TDX Guest.
- */
- CC_ATTR_HOTPLUG_DISABLED,
-
/**
* @CC_ATTR_HOST_SEV_SNP: AMD SNP enabled on the host.
*
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 4c15b478e2bc..a609385c7f99 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1481,8 +1481,7 @@ static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
* If the platform does not support hotplug, report it explicitly to
* differentiate it from a transient offlining failure.
*/
- if (cc_platform_has(CC_ATTR_HOTPLUG_DISABLED) ||
- cpu_hotplug_offline_disabled)
+ if (cpu_hotplug_offline_disabled)
return -EOPNOTSUPP;
if (cpu_hotplug_disabled)
return -EBUSY;
--
2.43.0


2024-05-28 12:56:14

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv11 02/19] x86/apic: Mark acpi_mp_wake_* variables as __ro_after_init

acpi_mp_wake_mailbox_paddr and acpi_mp_wake_mailbox initialized once
during ACPI MADT init and never changed.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Kai Huang <[email protected]>
Reviewed-by: Baoquan He <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Tested-by: Tao Liu <[email protected]>
---
arch/x86/kernel/acpi/madt_wakeup.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
index 7f164d38bd0b..cf79ea6f3007 100644
--- a/arch/x86/kernel/acpi/madt_wakeup.c
+++ b/arch/x86/kernel/acpi/madt_wakeup.c
@@ -6,10 +6,10 @@
#include <asm/processor.h>

/* Physical address of the Multiprocessor Wakeup Structure mailbox */
-static u64 acpi_mp_wake_mailbox_paddr;
+static u64 acpi_mp_wake_mailbox_paddr __ro_after_init;

/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
-static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
+static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox __ro_after_init;

static int acpi_wakeup_cpu(u32 apicid, unsigned long start_ip)
{
--
2.43.0


2024-05-28 12:56:17

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv11 16/19] x86/smp: Add smp_ops.stop_this_cpu() callback

If the helper is defined, it is called instead of halt() to stop the CPU
at the end of stop_this_cpu() and on crash CPU shutdown.

ACPI MADT will use it to hand over the CPU to BIOS in order to be able
to wake it up again after kexec.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Kai Huang <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Tested-by: Tao Liu <[email protected]>
---
arch/x86/include/asm/smp.h | 1 +
arch/x86/kernel/process.c | 7 +++++++
arch/x86/kernel/reboot.c | 6 ++++++
3 files changed, 14 insertions(+)

diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index a35936b512fe..ca073f40698f 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -35,6 +35,7 @@ struct smp_ops {
int (*cpu_disable)(void);
void (*cpu_die)(unsigned int cpu);
void (*play_dead)(void);
+ void (*stop_this_cpu)(void);

void (*send_call_func_ipi)(const struct cpumask *mask);
void (*send_call_func_single_ipi)(int cpu);
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index b8441147eb5e..f63f8fd00a91 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -835,6 +835,13 @@ void __noreturn stop_this_cpu(void *dummy)
*/
cpumask_clear_cpu(cpu, &cpus_stop_mask);

+#ifdef CONFIG_SMP
+ if (smp_ops.stop_this_cpu) {
+ smp_ops.stop_this_cpu();
+ unreachable();
+ }
+#endif
+
for (;;) {
/*
* Use native_halt() so that memory contents don't change
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 097313147ad3..513809b5b27c 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -880,6 +880,12 @@ static int crash_nmi_callback(unsigned int val, struct pt_regs *regs)
cpu_emergency_disable_virtualization();

atomic_dec(&waiting_for_crash_ipi);
+
+ if (smp_ops.stop_this_cpu) {
+ smp_ops.stop_this_cpu();
+ unreachable();
+ }
+
/* Assume hlt works */
halt();
for (;;)
--
2.43.0


2024-05-28 12:59:15

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv11 13/19] x86/mm: Do not zap page table entries mapping unaccepted memory table during kdump.

From: Ashish Kalra <[email protected]>

During crashkernel boot only pre-allocated crash memory is presented as
E820_TYPE_RAM. This can cause page table entries mapping unaccepted memory
table to be zapped during phys_pte_init(), phys_pmd_init(), phys_pud_init()
and phys_p4d_init() as SNP/TDX guest use E820_TYPE_ACPI to store the
unaccepted memory table and pass it between the kernels on
kexec/kdump.

E820_TYPE_ACPI covers not only ACPI data, but also EFI tables and might
be required by kernel to function properly.

The problem was discovered during debugging kdump for SNP guest. The
unaccepted memory table stored with E820_TYPE_ACPI and passed between
the kernels on kdump was getting zapped as the PMD entry mapping this
is above the E820_TYPE_RAM range for the reserved crashkernel memory.

Signed-off-by: Ashish Kalra <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/mm/init_64.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 7e177856ee4f..28002cc7a37d 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -469,7 +469,9 @@ phys_pte_init(pte_t *pte_page, unsigned long paddr, unsigned long paddr_end,
!e820__mapped_any(paddr & PAGE_MASK, paddr_next,
E820_TYPE_RAM) &&
!e820__mapped_any(paddr & PAGE_MASK, paddr_next,
- E820_TYPE_RESERVED_KERN))
+ E820_TYPE_RESERVED_KERN) &&
+ !e820__mapped_any(paddr & PAGE_MASK, paddr_next,
+ E820_TYPE_ACPI))
set_pte_init(pte, __pte(0), init);
continue;
}
@@ -524,7 +526,9 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long paddr, unsigned long paddr_end,
!e820__mapped_any(paddr & PMD_MASK, paddr_next,
E820_TYPE_RAM) &&
!e820__mapped_any(paddr & PMD_MASK, paddr_next,
- E820_TYPE_RESERVED_KERN))
+ E820_TYPE_RESERVED_KERN) &&
+ !e820__mapped_any(paddr & PMD_MASK, paddr_next,
+ E820_TYPE_ACPI))
set_pmd_init(pmd, __pmd(0), init);
continue;
}
@@ -611,7 +615,9 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end,
!e820__mapped_any(paddr & PUD_MASK, paddr_next,
E820_TYPE_RAM) &&
!e820__mapped_any(paddr & PUD_MASK, paddr_next,
- E820_TYPE_RESERVED_KERN))
+ E820_TYPE_RESERVED_KERN) &&
+ !e820__mapped_any(paddr & PUD_MASK, paddr_next,
+ E820_TYPE_ACPI))
set_pud_init(pud, __pud(0), init);
continue;
}
@@ -698,7 +704,9 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end,
!e820__mapped_any(paddr & P4D_MASK, paddr_next,
E820_TYPE_RAM) &&
!e820__mapped_any(paddr & P4D_MASK, paddr_next,
- E820_TYPE_RESERVED_KERN))
+ E820_TYPE_RESERVED_KERN) &&
+ !e820__mapped_any(paddr & P4D_MASK, paddr_next,
+ E820_TYPE_ACPI))
set_p4d_init(p4d, __p4d(0), init);
continue;
}
--
2.43.0


2024-05-28 13:48:27

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCHv11 01/19] x86/acpi: Extract ACPI MADT wakeup code into a separate file

On Tue, May 28, 2024 at 12:55:04PM +0300, Kirill A. Shutemov wrote:
> In order to prepare for the expansion of support for the ACPI MADT
> wakeup method, move the relevant code into a separate file.
>
> Introduce a new configuration option to clearly indicate dependencies
> without the use of ifdefs.
>
> There have been no functional changes.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Reviewed-by: Kuppuswamy Sathyanarayanan <[email protected]>
> Acked-by: Kai Huang <[email protected]>
> Reviewed-by: Baoquan He <[email protected]>
> Reviewed-by: Thomas Gleixner <[email protected]>
> Tested-by: Tao Liu <[email protected]>
> ---
> arch/x86/Kconfig | 7 +++
> arch/x86/include/asm/acpi.h | 5 ++
> arch/x86/kernel/acpi/Makefile | 1 +
> arch/x86/kernel/acpi/boot.c | 86 +-----------------------------
> arch/x86/kernel/acpi/madt_wakeup.c | 82 ++++++++++++++++++++++++++++
> 5 files changed, 96 insertions(+), 85 deletions(-)
> create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c

Acked-by: Borislav Petkov (AMD) <[email protected]>

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-05-28 14:01:46

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv11 08/19] x86/mm: Return correct level from lookup_address() if pte is none

Currently, lookup_address() returns two things:
1. A "pte_t" (which might be a p[g4um]d_t)
2. The 'level' of the page tables where the "pte_t" was found
(returned via a pointer)

If no pte_t is found, 'level' is essentially garbage.

Always fill out the level. For NULL "pte_t"s, fill in the level where
the p*d_none() entry was found mirroring the "found" behavior.

Always filling out the level allows using lookup_address() to precisely
skip over holes when walking kernel page tables.

Add one more entry into enum pg_level to indicate the size of the VA
covered by one PGD entry in 5-level paging mode.

Update comments for lookup_address() and lookup_address_in_pgd() to
reflect changes in the interface.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Rick Edgecombe <[email protected]>
Reviewed-by: Baoquan He <[email protected]>
Reviewed-by: Dave Hansen <[email protected]>
Tested-by: Tao Liu <[email protected]>
---
arch/x86/include/asm/pgtable_types.h | 1 +
arch/x86/mm/pat/set_memory.c | 21 ++++++++++-----------
2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index b78644962626..2f321137736c 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -549,6 +549,7 @@ enum pg_level {
PG_LEVEL_2M,
PG_LEVEL_1G,
PG_LEVEL_512G,
+ PG_LEVEL_256T,
PG_LEVEL_NUM
};

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 498812f067cd..a7a7a6c6a3fb 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -662,8 +662,9 @@ static inline pgprot_t verify_rwx(pgprot_t old, pgprot_t new, unsigned long star

/*
* Lookup the page table entry for a virtual address in a specific pgd.
- * Return a pointer to the entry, the level of the mapping, and the effective
- * NX and RW bits of all page table levels.
+ * Return a pointer to the entry (or NULL if the entry does not exist),
+ * the level of the entry, and the effective NX and RW bits of all
+ * page table levels.
*/
pte_t *lookup_address_in_pgd_attr(pgd_t *pgd, unsigned long address,
unsigned int *level, bool *nx, bool *rw)
@@ -672,13 +673,14 @@ pte_t *lookup_address_in_pgd_attr(pgd_t *pgd, unsigned long address,
pud_t *pud;
pmd_t *pmd;

- *level = PG_LEVEL_NONE;
+ *level = PG_LEVEL_256T;
*nx = false;
*rw = true;

if (pgd_none(*pgd))
return NULL;

+ *level = PG_LEVEL_512G;
*nx |= pgd_flags(*pgd) & _PAGE_NX;
*rw &= pgd_flags(*pgd) & _PAGE_RW;

@@ -686,10 +688,10 @@ pte_t *lookup_address_in_pgd_attr(pgd_t *pgd, unsigned long address,
if (p4d_none(*p4d))
return NULL;

- *level = PG_LEVEL_512G;
if (p4d_leaf(*p4d) || !p4d_present(*p4d))
return (pte_t *)p4d;

+ *level = PG_LEVEL_1G;
*nx |= p4d_flags(*p4d) & _PAGE_NX;
*rw &= p4d_flags(*p4d) & _PAGE_RW;

@@ -697,10 +699,10 @@ pte_t *lookup_address_in_pgd_attr(pgd_t *pgd, unsigned long address,
if (pud_none(*pud))
return NULL;

- *level = PG_LEVEL_1G;
if (pud_leaf(*pud) || !pud_present(*pud))
return (pte_t *)pud;

+ *level = PG_LEVEL_2M;
*nx |= pud_flags(*pud) & _PAGE_NX;
*rw &= pud_flags(*pud) & _PAGE_RW;

@@ -708,15 +710,13 @@ pte_t *lookup_address_in_pgd_attr(pgd_t *pgd, unsigned long address,
if (pmd_none(*pmd))
return NULL;

- *level = PG_LEVEL_2M;
if (pmd_leaf(*pmd) || !pmd_present(*pmd))
return (pte_t *)pmd;

+ *level = PG_LEVEL_4K;
*nx |= pmd_flags(*pmd) & _PAGE_NX;
*rw &= pmd_flags(*pmd) & _PAGE_RW;

- *level = PG_LEVEL_4K;
-
return pte_offset_kernel(pmd, address);
}

@@ -736,9 +736,8 @@ pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
* Lookup the page table entry for a virtual address. Return a pointer
* to the entry and the level of the mapping.
*
- * Note: We return pud and pmd either when the entry is marked large
- * or when the present bit is not set. Otherwise we would return a
- * pointer to a nonexisting mapping.
+ * Note: the function returns p4d, pud or pmd either when the entry is marked
+ * large or when the present bit is not set. Otherwise it returns NULL.
*/
pte_t *lookup_address(unsigned long address, unsigned int *level)
{
--
2.43.0


2024-05-28 15:01:00

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv11 05/19] x86/relocate_kernel: Use named labels for less confusion

From: Borislav Petkov <[email protected]>

That identity_mapped() functions was loving that "1" label to the point
of completely confusing its readers.

Use named labels in each place for clarity.

No functional changes.

Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/relocate_kernel_64.S | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 56cab1bb25f5..085eef5c3904 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -148,9 +148,10 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
*/
movl $X86_CR4_PAE, %eax
testq $X86_CR4_LA57, %r13
- jz 1f
+ jz .Lno_la57
orl $X86_CR4_LA57, %eax
-1:
+.Lno_la57:
+
movq %rax, %cr4

jmp 1f
@@ -165,9 +166,9 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
* used by kexec. Flush the caches before copying the kernel.
*/
testq %r12, %r12
- jz 1f
+ jz .Lsme_off
wbinvd
-1:
+.Lsme_off:

movq %rcx, %r11
call swap_pages
@@ -187,7 +188,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
*/

testq %r11, %r11
- jnz 1f
+ jnz .Lrelocate
xorl %eax, %eax
xorl %ebx, %ebx
xorl %ecx, %ecx
@@ -208,7 +209,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
ret
int3

-1:
+.Lrelocate:
popq %rdx
leaq PAGE_SIZE(%r10), %rsp
ANNOTATE_RETPOLINE_SAFE
--
2.43.0


2024-05-29 10:48:32

by Nikolay Borisov

[permalink] [raw]
Subject: Re: [PATCHv11 05/19] x86/relocate_kernel: Use named labels for less confusion



On 28.05.24 г. 12:55 ч., Kirill A. Shutemov wrote:
> From: Borislav Petkov <[email protected]>
>
> That identity_mapped() functions was loving that "1" label to the point
> of completely confusing its readers.
>
> Use named labels in each place for clarity.
>
> No functional changes.
>
> Signed-off-by: Borislav Petkov (AMD) <[email protected]>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> arch/x86/kernel/relocate_kernel_64.S | 13 +++++++------
> 1 file changed, 7 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> index 56cab1bb25f5..085eef5c3904 100644
> --- a/arch/x86/kernel/relocate_kernel_64.S
> +++ b/arch/x86/kernel/relocate_kernel_64.S
> @@ -148,9 +148,10 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
> */
> movl $X86_CR4_PAE, %eax
> testq $X86_CR4_LA57, %r13
> - jz 1f
> + jz .Lno_la57
> orl $X86_CR4_LA57, %eax
> -1:
> +.Lno_la57:
> +
> movq %rax, %cr4
>
> jmp 1f

That jmp 1f becomes redundant now as it simply jumps 1 line below.

> @@ -165,9 +166,9 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
> * used by kexec. Flush the caches before copying the kernel.
> */
> testq %r12, %r12
> - jz 1f
> + jz .Lsme_off
> wbinvd
> -1:
> +.Lsme_off:
>
> movq %rcx, %r11
> call swap_pages
> @@ -187,7 +188,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
> */
>
> testq %r11, %r11
> - jnz 1f
> + jnz .Lrelocate
> xorl %eax, %eax
> xorl %ebx, %ebx
> xorl %ecx, %ecx
> @@ -208,7 +209,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
> ret
> int3
>
> -1:
> +.Lrelocate:
> popq %rdx
> leaq PAGE_SIZE(%r10), %rsp
> ANNOTATE_RETPOLINE_SAFE

2024-05-29 11:18:22

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv11 05/19] x86/relocate_kernel: Use named labels for less confusion

On Wed, May 29, 2024 at 01:47:50PM +0300, Nikolay Borisov wrote:
>
>
> On 28.05.24 г. 12:55 ч., Kirill A. Shutemov wrote:
> > From: Borislav Petkov <[email protected]>
> >
> > That identity_mapped() functions was loving that "1" label to the point
> > of completely confusing its readers.
> >
> > Use named labels in each place for clarity.
> >
> > No functional changes.
> >
> > Signed-off-by: Borislav Petkov (AMD) <[email protected]>
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > ---
> > arch/x86/kernel/relocate_kernel_64.S | 13 +++++++------
> > 1 file changed, 7 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> > index 56cab1bb25f5..085eef5c3904 100644
> > --- a/arch/x86/kernel/relocate_kernel_64.S
> > +++ b/arch/x86/kernel/relocate_kernel_64.S
> > @@ -148,9 +148,10 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
> > */
> > movl $X86_CR4_PAE, %eax
> > testq $X86_CR4_LA57, %r13
> > - jz 1f
> > + jz .Lno_la57
> > orl $X86_CR4_LA57, %eax
> > -1:
> > +.Lno_la57:
> > +
> > movq %rax, %cr4
> > jmp 1f
>
> That jmp 1f becomes redundant now as it simply jumps 1 line below.
>

Nothing changed wrt this jump. It dates back to initial kexec
implementation.

See 5234f5eb04ab ("[PATCH] kexec: x86_64 kexec implementation").

But I don't see functional need in it.

Anyway, it is outside of the scope of the patch.

--
Kiryl Shutsemau / Kirill A. Shutemov

2024-05-29 11:29:42

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCHv11 05/19] x86/relocate_kernel: Use named labels for less confusion

On Wed, May 29, 2024 at 02:17:29PM +0300, Kirill A. Shutemov wrote:
> > That jmp 1f becomes redundant now as it simply jumps 1 line below.
> >
>
> Nothing changed wrt this jump. It dates back to initial kexec
> implementation.
>
> See 5234f5eb04ab ("[PATCH] kexec: x86_64 kexec implementation").
>
> But I don't see functional need in it.
>
> Anyway, it is outside of the scope of the patch.

Yap, Kirill did what Nikolay should've done - git archeology. Please
don't forget to do that next time.

And back in the day they didn't comment non-obvious things because
commenting is for losers. :-\

So that unconditional forward jump either flushes branch prediction on
some old uarch or something else weird, uarch-special.

I doubt we can remove it just like that.

Lemme add Andy - he should know.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-05-29 11:39:51

by Nikolay Borisov

[permalink] [raw]
Subject: Re: [PATCHv11 06/19] x86/kexec: Keep CR4.MCE set during kexec for TDX guest



On 28.05.24 г. 12:55 ч., Kirill A. Shutemov wrote:
> TDX guests run with MCA enabled (CR4.MCE=1b) from the very start. If
> that bit is cleared during CR4 register reprogramming during boot or
> kexec flows, a #VE exception will be raised which the guest kernel
> cannot handle it.
>
> Therefore, make sure the CR4.MCE setting is preserved over kexec too and
> avoid raising any #VEs.
>
> The change doesn't affect non-TDX-guest environments.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>

Reviewed-by: Nikolay Borisov <[email protected]>

2024-05-29 12:33:47

by Andrew Cooper

[permalink] [raw]
Subject: Re: [PATCHv11 05/19] x86/relocate_kernel: Use named labels for less confusion

On 29/05/2024 12:28 pm, Borislav Petkov wrote:
> On Wed, May 29, 2024 at 02:17:29PM +0300, Kirill A. Shutemov wrote:
>>> That jmp 1f becomes redundant now as it simply jumps 1 line below.
>>>
>> Nothing changed wrt this jump. It dates back to initial kexec
>> implementation.
>>
>> See 5234f5eb04ab ("[PATCH] kexec: x86_64 kexec implementation").
>>
>> But I don't see functional need in it.
>>
>> Anyway, it is outside of the scope of the patch.
> Yap, Kirill did what Nikolay should've done - git archeology. Please
> don't forget to do that next time.
>
> And back in the day they didn't comment non-obvious things because
> commenting is for losers. :-\
>
> So that unconditional forward jump either flushes branch prediction on
> some old uarch or something else weird, uarch-special.
>
> I doubt we can remove it just like that.
>
> Lemme add Andy - he should know.

Seems I've gained a reputation...

jmp 1f dates back to ye olde 8086, which started the whole trend of the
instruction pointer just being a figment of the ISA's imagination[1].

Hardware maintains the pointer to the next byte to fetch (the prefetch
queue was up to 6 bytes), and there was a micro-op to subtract the
current length of the prefetch queue from the accumulator.

In those days, the prefetch queue was not coherent with main memory, and
jumps (being a discontinuity in the instruction stream) simply flushed
the prefetch queue.

This was necessary after modifying executable code, because otherwise
you could end up executing stale bytes from the prefetch queue and then
non-stale bytes thereafter.  (Otherwise known as the way to distinguish
the 8086 from the 8088 because the latter only had a 4 byte prefetch queue.)

Anyway.  It's how you used to spell "serialising operation" before that
term ever entered the architecture.  Linux still supports CPUs prior to
the Pentium, so still needs to care about prefetch queues in the 486.

However, this example appears to be in 64bit code and following a write
to CR4 which will be fully serialising, so it's probably copy&paste from
32bit code where it would be necessary in principle.

~Andrew

[1]
https://www.righto.com/2023/01/inside-8086-processors-instruction.html#fn:pc

In fact, anyone who hasn't should read the entire series on the 8086,
https://www.righto.com/p/index.html

2024-05-29 15:35:01

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCHv11 05/19] x86/relocate_kernel: Use named labels for less confusion

On Wed, May 29, 2024 at 01:33:35PM +0100, Andrew Cooper wrote:
> Seems I've gained a reputation...

Yes you have. You have this weird interest in very deep uarch details
that I can't share. Not at that detail. :-P

> jmp 1f dates back to ye olde 8086, which started the whole trend of the
> instruction pointer just being a figment of the ISA's imagination[1].
>
> Hardware maintains the pointer to the next byte to fetch (the prefetch
> queue was up to 6 bytes), and there was a micro-op to subtract the
> current length of the prefetch queue from the accumulator.
>
> In those days, the prefetch queue was not coherent with main memory, and
> jumps (being a discontinuity in the instruction stream) simply flushed
> the prefetch queue.
>
> This was necessary after modifying executable code, because otherwise
> you could end up executing stale bytes from the prefetch queue and then
> non-stale bytes thereafter.  (Otherwise known as the way to distinguish
> the 8086 from the 8088 because the latter only had a 4 byte prefetch queue.)

Thanks - that certainly wakes up a long-asleep neuron in the back of my
mind...

> Anyway.  It's how you used to spell "serialising operation" before that
> term ever entered the architecture.  Linux still supports CPUs prior to
> the Pentium, so still needs to care about prefetch queues in the 486.
>
> However, this example appears to be in 64bit code and following a write
> to CR4 which will be fully serialising, so it's probably copy&paste from
> 32bit code where it would be necessary in principle.

Yap, fully agreed. We could try to remove it and see what complains.

Nikolay, wanna do a patch which properly explains the situation?

> https://www.righto.com/2023/01/inside-8086-processors-instruction.html#fn:pc
>
> In fact, anyone who hasn't should read the entire series on the 8086,
> https://www.righto.com/p/index.html

Oh yeah, already bookmarked.

Thanks Andy!

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-05-30 23:37:03

by Kalra, Ashish

[permalink] [raw]
Subject: [PATCH v7 0/3] x86/snp: Add kexec support

From: Ashish Kalra <[email protected]>

The patchset adds bits and pieces to get kexec (and crashkernel) work on
SNP guest.

The series is based off of and tested against Kirill Shutemov's tree:
https://github.com/intel/tdx.git guest-kexec

----

v7:
- Rebased onto current tip/master;
- Moved back to checking the md attribute instead of checking the
efi_setup for detecting if running under kexec kernel as
suggested in upstream review feedback.

v6:
- Updated and restructured the commit message for patch 1/3 to
explain the issue in detail.
- Updated inline comments in patch 1/3 to explain the issue in
detail.
- Moved back to checking efi_setup for detecting if running
under kexec kernel.

v5:
- Removed sev_es_enabled() function and using sev_status directly to
check for SEV-ES/SEV-SNP guest.
- used --base option to generate patches to specify Kirill's TDX guest
kexec patches as prerequisite patches to fix kernel test robot
build errors.

v4:
- Rebased to current tip/master.
- Reviewed-bys from Sathya.
- Remove snp_kexec_unprep_rom_memory() as it is not needed any more as
SEV-SNP code is not validating the ROM range in probe_roms() anymore.
- Fix kernel test robot build error/warnings.

v3:
- Rebased;
- moved Keep page tables that maps E820_TYPE_ACPI patch to Kirill's tdx
guest kexec patch series.
- checking the md attribute instead of checking the efi_setup for
detecting if running under kexec kernel.
- added new sev_es_enabled() function.
- skip video memory access in decompressor for SEV-ES/SNP systems to
prevent guest termination as boot stage2 #VC handler does not handle
MMIO.

v2:
- address zeroing of unaccepted memory table mappings at all page table levels
adding phys_pte_init(), phys_pud_init() and phys_p4d_init().
- include skip efi_arch_mem_reserve() in case of kexec as part of this
patch set.
- rename last_address_shd_kexec to a more appropriate
kexec_last_address_to_make_private.
- remove duplicate code shared with TDX and use common interfaces
defined for SNP and TDX for kexec/kdump.
- remove set_pte_enc() dependency on pg_level_to_pfn() and make the
function simpler.
- rename unshare_pte() to make_pte_private().
- clarify and make the comment for using kexec_last_address_to_make_private
more understandable.
- general cleanup.

Ashish Kalra (3):
efi/x86: Fix EFI memory map corruption with kexec
x86/boot/compressed: Skip Video Memory access in Decompressor for
SEV-ES/SNP.
x86/snp: Convert shared memory back to private on kexec

arch/x86/boot/compressed/misc.c | 6 +-
arch/x86/include/asm/sev.h | 4 +
arch/x86/kernel/sev.c | 162 ++++++++++++++++++++++++++++++++
arch/x86/mm/mem_encrypt_amd.c | 3 +
arch/x86/platform/efi/quirks.c | 30 +++++-
5 files changed, 200 insertions(+), 5 deletions(-)


base-commit: f8441cd55885e43eb0d4e8eedc6c5ab15d2dabf1
prerequisite-patch-id: a911f230c2524bd791c47f62f17f0a93cbf726b6
prerequisite-patch-id: bfe2fa046349978ac1825275eb205acecfbc22f3
prerequisite-patch-id: 5e60d292457c7cd98fd3e45c23127e9463b56a69
prerequisite-patch-id: 1f97d0a2edb7509dd58276f628d1a4bda62c154c
prerequisite-patch-id: 6e07f4d4ac95ad1d2c7750ebd3e87483fb9fd48f
prerequisite-patch-id: 24ec385d6a89cf2c8553c6d29515cc513643a68a
prerequisite-patch-id: 6a8bda2b3cf9bfab8177acdcfc8dd0408ed129fa
prerequisite-patch-id: 99382c42348b9a076ba930eca0dfc9d000ec951d
prerequisite-patch-id: 469a0a3c78b0eca82527cd85e2205fb8fb89d645
prerequisite-patch-id: 2be870cdf58bdc6a10ca3c18bf874e5c6cfb7e42
prerequisite-patch-id: 7fc62697fb6bdade0bab66ba2b45a19759008f9e
prerequisite-patch-id: 95356474298029468750a9c1bc2224fb09a86eed
prerequisite-patch-id: d4966ae63e86d24b0bf578da4dae871cd9002b12
prerequisite-patch-id: fccde6f1fa385b5af0195f81fcb95acd71822428
prerequisite-patch-id: 16048ee15e392b0b9217b8923939b0059311abd2
prerequisite-patch-id: 5c9ae9aa294f72f63ae2c3551507dfbd92525803
prerequisite-patch-id: 758bdb686290c018cbd5b7d005354019f9d15248
prerequisite-patch-id: c85fd0bb6d183a40da73720eaa607481b1d51daf
prerequisite-patch-id: 60760e0c98ab7ccd2ca22ae3e9f20ff5a94c6e91
--
2.34.1


2024-05-30 23:37:20

by Kalra, Ashish

[permalink] [raw]
Subject: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

From: Ashish Kalra <[email protected]>

With SNP guest kexec observe the following efi memmap corruption :

[ 0.000000] efi: EFI v2.7 by EDK II
[ 0.000000] efi: SMBIOS=0x7e33f000 SMBIOS 3.0=0x7e33d000 ACPI=0x7e57e000 ACPI 2.0=0x7e57e014 MEMATTR=0x7cc3c018 Unaccepted=0x7c09e018
[ 0.000000] efi: [Firmware Bug]: Invalid EFI memory map entries:
[ 0.000000] efi: mem03: [type=269370880|attr=0x0e42100e42180e41] range=[0x0486200e41038c18-0x200e898a0eee713ac17] (invalid)
[ 0.000000] efi: mem04: [type=12336|attr=0x0e410686300e4105] range=[0x100e420000000176-0x8c290f26248d200e175] (invalid)
[ 0.000000] efi: mem06: [type=1124304408|attr=0x000030b400000028] range=[0x0e51300e45280e77-0xb44ed2142f460c1e76] (invalid)
[ 0.000000] efi: mem08: [type=68|attr=0x300e540583280e41] range=[0x0000011affff3cd8-0x486200e54b38c0bcd7] (invalid)
[ 0.000000] efi: mem10: [type=1107529240|attr=0x0e42280e41300e41] range=[0x300e41058c280e42-0x38010ae54c5c328ee41] (invalid)
[ 0.000000] efi: mem11: [type=189335566|attr=0x048d200e42038e18] range=[0x0000318c00000048-0xe42029228ce4200047] (invalid)
[ 0.000000] efi: mem12: [type=239142534|attr=0x0000002400000b4b] range=[0x0e41380e0a7d700e-0x80f26238f22bfe500d] (invalid)
[ 0.000000] efi: mem14: [type=239207055|attr=0x0e41300e43380e0a] range=[0x8c280e42048d200e-0xc70b028f2f27cc0a00d] (invalid)
[ 0.000000] efi: mem15: [type=239210510|attr=0x00080e660b47080e] range=[0x0000324c0000001c-0xa78028634ce490001b] (invalid)
[ 0.000000] efi: mem16: [type=4294848528|attr=0x0000329400000014] range=[0x0e410286100e4100-0x80f252036a218f20ff] (invalid)
[ 0.000000] efi: mem19: [type=2250772033|attr=0x42180e42200e4328] range=[0x41280e0ab9020683-0xe0e538c28b39e62682] (invalid)
[ 0.000000] efi: mem20: [type=16| | | | | | | | | | |WB| |WC| ] range=[0x00000008ffff4438-0xffff44340090333c437] (invalid)
[ 0.000000] efi: mem22: [Reserved |attr=0x000000c1ffff4420] range=[0xffff442400003398-0x1033a04240003f397] (invalid)
[ 0.000000] efi: mem23: [type=1141080856|attr=0x080e41100e43180e] range=[0x280e66300e4b280e-0x440dc5ee7141f4c080d] (invalid)
[ 0.000000] efi: mem25: [Reserved |attr=0x0000000affff44a0] range=[0xffff44a400003428-0x1034304a400013427] (invalid)
[ 0.000000] efi: mem28: [type=16| | | | | | | | | | |WB| |WC| ] range=[0x0000000affff4488-0xffff448400b034bc487] (invalid)
[ 0.000000] efi: mem30: [Reserved |attr=0x0000000affff4470] range=[0xffff447400003518-0x10352047400013517] (invalid)
[ 0.000000] efi: mem33: [type=16| | | | | | | | | | |WB| |WC| ] range=[0x0000000affff4458-0xffff445400b035ac457] (invalid)
[ 0.000000] efi: mem35: [type=269372416|attr=0x0e42100e42180e41] range=[0x0486200e44038c18-0x200e8b8a0eee823ac17] (invalid)
[ 0.000000] efi: mem37: [type=2351435330|attr=0x0e42100e42180e42] range=[0x470783380e410686-0x2002b2a041c2141e685] (invalid)
[ 0.000000] efi: mem38: [type=1093668417|attr=0x100e420000000270] range=[0x42100e42180e4220-0xfff366a4e421b78c21f] (invalid)
[ 0.000000] efi: mem39: [type=76357646|attr=0x180e42200e42280e] range=[0x0e410686300e4105-0x4130f251a0710ae5104] (invalid)
[ 0.000000] efi: mem40: [type=940444268|attr=0x0e42200e42280e41] range=[0x180e42200e42280e-0x300fc71c300b4f2480d] (invalid)
[ 0.000000] efi: mem41: [MMIO |attr=0x8c280e42048d200e] range=[0xffff479400003728-0x42138e0c87820292727] (invalid)
[ 0.000000] efi: mem42: [type=1191674680|attr=0x0000004c0000000b] range=[0x300e41380e0a0246-0x470b0f26238f22b8245] (invalid)
[ 0.000000] efi: mem43: [type=2010|attr=0x0301f00e4d078338] range=[0x45038e180e42028f-0xe4556bf118f282528e] (invalid)
[ 0.000000] efi: mem44: [type=1109921345|attr=0x300e44000000006c] range=[0x44080e42100e4218-0xfff39254e42138ac217] (invalid)
..

This EFI memap corruption is happening with efi_arch_mem_reserve() invocation in case of kexec boot.

( efi_arch_mem_reserve() is invoked with the following call-stack: )

[ 0.310010] efi_arch_mem_reserve+0xb1/0x220
[ 0.311382] efi_mem_reserve+0x36/0x60
[ 0.311973] efi_bgrt_init+0x17d/0x1a0
[ 0.313265] acpi_parse_bgrt+0x12/0x20
[ 0.313858] acpi_table_parse+0x77/0xd0
[ 0.314463] acpi_boot_init+0x362/0x630
[ 0.315069] setup_arch+0xa88/0xf80
[ 0.315629] start_kernel+0x68/0xa90
[ 0.316194] x86_64_start_reservations+0x1c/0x30
[ 0.316921] x86_64_start_kernel+0xbf/0x110
[ 0.317582] common_startup_64+0x13e/0x141

efi_arch_mem_reserve() calls efi_memmap_alloc() to allocate memory for
EFI memory map and due to early allocation it uses memblock allocation.

Later during boot, efi_enter_virtual_mode() calls kexec_enter_virtual_mode()
in case of a kexec-ed kernel boot.

This function kexec_enter_virtual_mode() installs the new EFI memory map by
calling efi_memmap_init_late() which remaps the efi_memmap physically allocated
in efi_arch_mem_reserve(), but this remapping is still using memblock allocation.

Subsequently, when memblock is freed later in boot flow, this remapped
efi_memmap will have random corruption (similar to a use-after-free scenario).

The corrupted EFI memory map is then passed to the next kexec-ed kernel
which causes a panic when trying to use the corrupted EFI memory map.

Fix this EFI memory map corruption by skipping efi_arch_mem_reserve() for kexec.

Additionally, efi_mem_reserve() is used to reserve boot service memory
eg. bgrt, but it is not necessary for kexec boot, as there are no
boot services in kexec reboot at all after the first kernel ExitBootServices().

The UEFI memmap passed to kexec kernel includes not only the runtime
service memory map but also the boot service memory ranges which were
reserved by the first kernel with efi_mem_reserve, and those boot service
memory ranges have already been marked "EFI_MEMORY_RUNTIME" attribute.

This is the additional reason why efi_mem_reserve can be skipped
for kexec booting and by checking the set EFI_MEMORY_RUNTIME attribute.

Suggested-by: Dave Young <[email protected]>
[Dave Young: checking the md attribute instead of checking the efi_setup]
Signed-off-by: Ashish Kalra <[email protected]>
---
arch/x86/platform/efi/quirks.c | 30 +++++++++++++++++++++++++++---
1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
index f0cc00032751..6f398c59278a 100644
--- a/arch/x86/platform/efi/quirks.c
+++ b/arch/x86/platform/efi/quirks.c
@@ -255,15 +255,39 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
struct efi_memory_map_data data = { 0 };
struct efi_mem_range mr;
efi_memory_desc_t md;
- int num_entries;
+ int num_entries, ret;
void *new;

- if (efi_mem_desc_lookup(addr, &md) ||
- md.type != EFI_BOOT_SERVICES_DATA) {
+ /*
+ * efi_mem_reserve() is used to reserve boot service memory, eg. bgrt,
+ * but it is not neccasery for kexec, as there are no boot services in
+ * kexec reboot at all after the first kernel's ExitBootServices().
+ *
+ * Additionally kexec_enter_virtual_mode() during late init will remap
+ * the efi_memmap physical pages allocated here via memblock & then
+ * subsequently cause random EFI memmap corruption once memblock is freed.
+ *
+ * Therefore, skip efi_mem_reserve for kexec booting by checking the
+ * EFI_MEMORY_RUNTIME attribute which indicates boot service memory
+ * ranges reserved by the first kernel using efi_mem_reserve and marked
+ * with EFI_MEMORY_RUNTIME attribute.
+ */
+
+ ret = efi_mem_desc_lookup(addr, &md);
+ if (ret) {
pr_err("Failed to lookup EFI memory descriptor for %pa\n", &addr);
return;
}

+ if (md.type != EFI_BOOT_SERVICES_DATA) {
+ pr_err("Skip reserving non EFI Boot Service Data memory for %pa\n", &addr);
+ return;
+ }
+
+ /* Kexec copied the efi memmap from the first kernel, thus skip the case */
+ if (md.attribute & EFI_MEMORY_RUNTIME)
+ return;
+
if (addr + size > md.phys_addr + (md.num_pages << EFI_PAGE_SHIFT)) {
pr_err("Region spans EFI memory descriptors, %pa\n", &addr);
return;
--
2.34.1


2024-05-30 23:38:37

by Kalra, Ashish

[permalink] [raw]
Subject: [PATCH v7 2/3] x86/boot/compressed: Skip Video Memory access in Decompressor for SEV-ES/SNP.

From: Ashish Kalra <[email protected]>

Accessing guest video memory/RAM during kernel decompressor
causes guest termination as boot stage2 #VC handler for
SEV-ES/SNP systems does not support MMIO handling.

This issue is observed with SEV-ES/SNP guest kexec as
kexec -c adds screen_info to the boot parameters
passed to the kexec kernel, which causes console output to
be dumped to both video and serial.

As the decompressor output gets cleared really fast, it is
preferable to get the console output only on serial, hence,
skip accessing video RAM during decompressor stage to
prevent guest termination.

Serial console output during decompressor stage works as
boot stage2 #VC handler already supports handling port I/O.

Suggested-by: Thomas Lendacy <[email protected]>
Signed-off-by: Ashish Kalra <[email protected]>
Reviewed-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/boot/compressed/misc.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index b70e4a21c15f..3b9f96b3dbcc 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -427,8 +427,10 @@ asmlinkage __visible void *extract_kernel(void *rmode, unsigned char *output)
vidport = 0x3d4;
}

- lines = boot_params_ptr->screen_info.orig_video_lines;
- cols = boot_params_ptr->screen_info.orig_video_cols;
+ if (!(sev_status & MSR_AMD64_SEV_ES_ENABLED)) {
+ lines = boot_params_ptr->screen_info.orig_video_lines;
+ cols = boot_params_ptr->screen_info.orig_video_cols;
+ }

init_default_io_ops();

--
2.34.1


2024-05-30 23:39:03

by Kalra, Ashish

[permalink] [raw]
Subject: [PATCH v7 3/3] x86/snp: Convert shared memory back to private on kexec

From: Ashish Kalra <[email protected]>

SNP guests allocate shared buffers to perform I/O. It is done by
allocating pages normally from the buddy allocator and converting them
to shared with set_memory_decrypted().

The second kernel has no idea what memory is converted this way. It only
sees E820_TYPE_RAM.

Accessing shared memory via private mapping will cause unrecoverable RMP
page-faults.

On kexec walk direct mapping and convert all shared memory back to
private. It makes all RAM private again and second kernel may use it
normally. Additionally for SNP guests convert all bss decrypted section
pages back to private.

The conversion occurs in two steps: stopping new conversions and
unsharing all memory. In the case of normal kexec, the stopping of
conversions takes place while scheduling is still functioning. This
allows for waiting until any ongoing conversions are finished. The
second step is carried out when all CPUs except one are inactive and
interrupts are disabled. This prevents any conflicts with code that may
access shared memory.

Signed-off-by: Ashish Kalra <[email protected]>
---
arch/x86/include/asm/sev.h | 4 +
arch/x86/kernel/sev.c | 162 ++++++++++++++++++++++++++++++++++
arch/x86/mm/mem_encrypt_amd.c | 3 +
3 files changed, 169 insertions(+)

diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index ca20cc4e5826..f9b0a4eb1980 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -229,6 +229,8 @@ void snp_accept_memory(phys_addr_t start, phys_addr_t end);
u64 snp_get_unsupported_features(u64 status);
u64 sev_get_status(void);
void sev_show_status(void);
+void snp_kexec_finish(void);
+void snp_kexec_begin(bool crash);
#else
static inline void sev_es_ist_enter(struct pt_regs *regs) { }
static inline void sev_es_ist_exit(void) { }
@@ -258,6 +260,8 @@ static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
static inline u64 snp_get_unsupported_features(u64 status) { return 0; }
static inline u64 sev_get_status(void) { return 0; }
static inline void sev_show_status(void) { }
+static inline void snp_kexec_finish(void) { }
+static inline void snp_kexec_begin(bool crash) { }
#endif

#ifdef CONFIG_KVM_AMD_SEV
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 3342ed58e168..941f3996a9b6 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -42,6 +42,8 @@
#include <asm/apic.h>
#include <asm/cpuid.h>
#include <asm/cmdline.h>
+#include <asm/pgtable.h>
+#include <asm/set_memory.h>

#define DR7_RESET_VALUE 0x400

@@ -92,6 +94,9 @@ static struct ghcb *boot_ghcb __section(".data");
/* Bitmap of SEV features supported by the hypervisor */
static u64 sev_hv_features __ro_after_init;

+/* Last address to be switched to private during kexec */
+static unsigned long kexec_last_addr_to_make_private;
+
/* #VC handler runtime per-CPU data */
struct sev_es_runtime_data {
struct ghcb ghcb_page;
@@ -913,6 +918,163 @@ void snp_accept_memory(phys_addr_t start, phys_addr_t end)
set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
}

+static bool set_pte_enc(pte_t *kpte, int level, void *va)
+{
+ pte_t new_pte;
+
+ if (pte_none(*kpte))
+ return false;
+
+ /*
+ * Change the physical page attribute from C=0 to C=1. Flush the
+ * caches to ensure that data gets accessed with the correct C-bit.
+ */
+ if (pte_present(*kpte))
+ clflush_cache_range(va, page_level_size(level));
+
+ new_pte = __pte(cc_mkenc(pte_val(*kpte)));
+ set_pte_atomic(kpte, new_pte);
+
+ return true;
+}
+
+static bool make_pte_private(pte_t *pte, unsigned long addr, int pages, int level)
+{
+ struct sev_es_runtime_data *data;
+ struct ghcb *ghcb;
+
+ data = this_cpu_read(runtime_data);
+ ghcb = &data->ghcb_page;
+
+ /* Check for GHCB for being part of a PMD range. */
+ if ((unsigned long)ghcb >= addr &&
+ (unsigned long)ghcb <= (addr + (pages * PAGE_SIZE))) {
+ /*
+ * Ensure that the current cpu's GHCB is made private
+ * at the end of unshared loop so that we continue to use the
+ * optimized GHCB protocol and not force the switch to
+ * MSR protocol till the very end.
+ */
+ pr_debug("setting boot_ghcb to NULL for this cpu ghcb\n");
+ kexec_last_addr_to_make_private = addr;
+ return true;
+ }
+
+ if (!set_pte_enc(pte, level, (void *)addr))
+ return false;
+
+ snp_set_memory_private(addr, pages);
+
+ return true;
+}
+
+static void unshare_all_memory(void)
+{
+ unsigned long addr, end;
+
+ /*
+ * Walk direct mapping and convert all shared memory back to private,
+ */
+
+ addr = PAGE_OFFSET;
+ end = PAGE_OFFSET + get_max_mapped();
+
+ while (addr < end) {
+ unsigned long size;
+ unsigned int level;
+ pte_t *pte;
+
+ pte = lookup_address(addr, &level);
+ size = page_level_size(level);
+
+ /*
+ * pte_none() check is required to skip physical memory holes in direct mapped.
+ */
+ if (pte && pte_decrypted(*pte) && !pte_none(*pte)) {
+ int pages = size / PAGE_SIZE;
+
+ if (!make_pte_private(pte, addr, pages, level)) {
+ pr_err("Failed to unshare range %#lx-%#lx\n",
+ addr, addr + size);
+ }
+
+ }
+
+ addr += size;
+ }
+ __flush_tlb_all();
+
+}
+
+static void unshare_all_bss_decrypted_memory(void)
+{
+ unsigned long vaddr, vaddr_end;
+ unsigned int level;
+ unsigned int npages;
+ pte_t *pte;
+
+ vaddr = (unsigned long)__start_bss_decrypted;
+ vaddr_end = (unsigned long)__start_bss_decrypted_unused;
+ npages = (vaddr_end - vaddr) >> PAGE_SHIFT;
+ for (; vaddr < vaddr_end; vaddr += PAGE_SIZE) {
+ pte = lookup_address(vaddr, &level);
+ if (!pte || !pte_decrypted(*pte) || pte_none(*pte))
+ continue;
+
+ set_pte_enc(pte, level, (void *)vaddr);
+ }
+ vaddr = (unsigned long)__start_bss_decrypted;
+ snp_set_memory_private(vaddr, npages);
+}
+
+/* Stop new private<->shared conversions */
+void snp_kexec_begin(bool crash)
+{
+ /*
+ * Crash kernel reaches here with interrupts disabled: can't wait for
+ * conversions to finish.
+ *
+ * If race happened, just report and proceed.
+ */
+ bool wait_for_lock = !crash;
+
+ if (!set_memory_enc_stop_conversion(wait_for_lock))
+ pr_warn("Failed to stop shared<->private conversions\n");
+}
+
+/* Walk direct mapping and convert all shared memory back to private */
+void snp_kexec_finish(void)
+{
+ if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
+ return;
+
+ unshare_all_memory();
+
+ unshare_all_bss_decrypted_memory();
+
+ if (kexec_last_addr_to_make_private) {
+ unsigned long size;
+ unsigned int level;
+ pte_t *pte;
+
+ /*
+ * Switch to using the MSR protocol to change this cpu's
+ * GHCB to private.
+ * All the per-cpu GHCBs have been switched back to private,
+ * so can't do any more GHCB calls to the hypervisor beyond
+ * this point till the kexec kernel starts running.
+ */
+ boot_ghcb = NULL;
+ sev_cfg.ghcbs_initialized = false;
+
+ pr_debug("boot ghcb 0x%lx\n", kexec_last_addr_to_make_private);
+ pte = lookup_address(kexec_last_addr_to_make_private, &level);
+ size = page_level_size(level);
+ set_pte_enc(pte, level, (void *)kexec_last_addr_to_make_private);
+ snp_set_memory_private(kexec_last_addr_to_make_private, (size / PAGE_SIZE));
+ }
+}
+
static int snp_set_vmsa(void *va, bool vmsa)
{
u64 attrs;
diff --git a/arch/x86/mm/mem_encrypt_amd.c b/arch/x86/mm/mem_encrypt_amd.c
index e7b67519ddb5..3ba792cd28ef 100644
--- a/arch/x86/mm/mem_encrypt_amd.c
+++ b/arch/x86/mm/mem_encrypt_amd.c
@@ -468,6 +468,9 @@ void __init sme_early_init(void)
x86_platform.guest.enc_tlb_flush_required = amd_enc_tlb_flush_required;
x86_platform.guest.enc_cache_flush_required = amd_enc_cache_flush_required;

+ x86_platform.guest.enc_kexec_begin = snp_kexec_begin;
+ x86_platform.guest.enc_kexec_finish = snp_kexec_finish;
+
/*
* AMD-SEV-ES intercepts the RDMSR to read the X2APIC ID in the
* parallel bringup low level code. That raises #VC which cannot be
--
2.34.1


2024-05-31 09:32:06

by Alexander Kuleshov

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

On 30.05.2024 23:36, Ashish Kalra wrote:
>From: Ashish Kalra <[email protected]>
>+ * but it is not neccasery for kexec, as there are no boot services in

A typo in necessary

2024-06-03 08:57:44

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

On Thu, May 30, 2024 at 11:36:55PM +0000, Ashish Kalra wrote:
> From: Ashish Kalra <[email protected]>
>
> With SNP guest kexec observe the following efi memmap corruption :
>
> [ 0.000000] efi: EFI v2.7 by EDK II
> [ 0.000000] efi: SMBIOS=0x7e33f000 SMBIOS 3.0=0x7e33d000 ACPI=0x7e57e000 ACPI 2.0=0x7e57e014 MEMATTR=0x7cc3c018 Unaccepted=0x7c09e018
> [ 0.000000] efi: [Firmware Bug]: Invalid EFI memory map entries:
> [ 0.000000] efi: mem03: [type=269370880|attr=0x0e42100e42180e41] range=[0x0486200e41038c18-0x200e898a0eee713ac17] (invalid)
> [ 0.000000] efi: mem04: [type=12336|attr=0x0e410686300e4105] range=[0x100e420000000176-0x8c290f26248d200e175] (invalid)
> [ 0.000000] efi: mem06: [type=1124304408|attr=0x000030b400000028] range=[0x0e51300e45280e77-0xb44ed2142f460c1e76] (invalid)
> [ 0.000000] efi: mem08: [type=68|attr=0x300e540583280e41] range=[0x0000011affff3cd8-0x486200e54b38c0bcd7] (invalid)
> [ 0.000000] efi: mem10: [type=1107529240|attr=0x0e42280e41300e41] range=[0x300e41058c280e42-0x38010ae54c5c328ee41] (invalid)
> [ 0.000000] efi: mem11: [type=189335566|attr=0x048d200e42038e18] range=[0x0000318c00000048-0xe42029228ce4200047] (invalid)
> [ 0.000000] efi: mem12: [type=239142534|attr=0x0000002400000b4b] range=[0x0e41380e0a7d700e-0x80f26238f22bfe500d] (invalid)
> [ 0.000000] efi: mem14: [type=239207055|attr=0x0e41300e43380e0a] range=[0x8c280e42048d200e-0xc70b028f2f27cc0a00d] (invalid)
> [ 0.000000] efi: mem15: [type=239210510|attr=0x00080e660b47080e] range=[0x0000324c0000001c-0xa78028634ce490001b] (invalid)
> [ 0.000000] efi: mem16: [type=4294848528|attr=0x0000329400000014] range=[0x0e410286100e4100-0x80f252036a218f20ff] (invalid)
> [ 0.000000] efi: mem19: [type=2250772033|attr=0x42180e42200e4328] range=[0x41280e0ab9020683-0xe0e538c28b39e62682] (invalid)
> [ 0.000000] efi: mem20: [type=16| | | | | | | | | | |WB| |WC| ] range=[0x00000008ffff4438-0xffff44340090333c437] (invalid)
> [ 0.000000] efi: mem22: [Reserved |attr=0x000000c1ffff4420] range=[0xffff442400003398-0x1033a04240003f397] (invalid)
> [ 0.000000] efi: mem23: [type=1141080856|attr=0x080e41100e43180e] range=[0x280e66300e4b280e-0x440dc5ee7141f4c080d] (invalid)
> [ 0.000000] efi: mem25: [Reserved |attr=0x0000000affff44a0] range=[0xffff44a400003428-0x1034304a400013427] (invalid)
> [ 0.000000] efi: mem28: [type=16| | | | | | | | | | |WB| |WC| ] range=[0x0000000affff4488-0xffff448400b034bc487] (invalid)
> [ 0.000000] efi: mem30: [Reserved |attr=0x0000000affff4470] range=[0xffff447400003518-0x10352047400013517] (invalid)
> [ 0.000000] efi: mem33: [type=16| | | | | | | | | | |WB| |WC| ] range=[0x0000000affff4458-0xffff445400b035ac457] (invalid)
> [ 0.000000] efi: mem35: [type=269372416|attr=0x0e42100e42180e41] range=[0x0486200e44038c18-0x200e8b8a0eee823ac17] (invalid)
> [ 0.000000] efi: mem37: [type=2351435330|attr=0x0e42100e42180e42] range=[0x470783380e410686-0x2002b2a041c2141e685] (invalid)
> [ 0.000000] efi: mem38: [type=1093668417|attr=0x100e420000000270] range=[0x42100e42180e4220-0xfff366a4e421b78c21f] (invalid)
> [ 0.000000] efi: mem39: [type=76357646|attr=0x180e42200e42280e] range=[0x0e410686300e4105-0x4130f251a0710ae5104] (invalid)
> [ 0.000000] efi: mem40: [type=940444268|attr=0x0e42200e42280e41] range=[0x180e42200e42280e-0x300fc71c300b4f2480d] (invalid)
> [ 0.000000] efi: mem41: [MMIO |attr=0x8c280e42048d200e] range=[0xffff479400003728-0x42138e0c87820292727] (invalid)
> [ 0.000000] efi: mem42: [type=1191674680|attr=0x0000004c0000000b] range=[0x300e41380e0a0246-0x470b0f26238f22b8245] (invalid)
> [ 0.000000] efi: mem43: [type=2010|attr=0x0301f00e4d078338] range=[0x45038e180e42028f-0xe4556bf118f282528e] (invalid)
> [ 0.000000] efi: mem44: [type=1109921345|attr=0x300e44000000006c] range=[0x44080e42100e4218-0xfff39254e42138ac217] (invalid)
> ...
>
> This EFI memap corruption is happening with efi_arch_mem_reserve() invocation in case of kexec boot.
>
> ( efi_arch_mem_reserve() is invoked with the following call-stack: )
>
> [ 0.310010] efi_arch_mem_reserve+0xb1/0x220
> [ 0.311382] efi_mem_reserve+0x36/0x60
> [ 0.311973] efi_bgrt_init+0x17d/0x1a0
> [ 0.313265] acpi_parse_bgrt+0x12/0x20
> [ 0.313858] acpi_table_parse+0x77/0xd0
> [ 0.314463] acpi_boot_init+0x362/0x630
> [ 0.315069] setup_arch+0xa88/0xf80
> [ 0.315629] start_kernel+0x68/0xa90
> [ 0.316194] x86_64_start_reservations+0x1c/0x30
> [ 0.316921] x86_64_start_kernel+0xbf/0x110
> [ 0.317582] common_startup_64+0x13e/0x141
>
> efi_arch_mem_reserve() calls efi_memmap_alloc() to allocate memory for
> EFI memory map and due to early allocation it uses memblock allocation.
>
> Later during boot, efi_enter_virtual_mode() calls kexec_enter_virtual_mode()
> in case of a kexec-ed kernel boot.
>
> This function kexec_enter_virtual_mode() installs the new EFI memory map by
> calling efi_memmap_init_late() which remaps the efi_memmap physically allocated
> in efi_arch_mem_reserve(), but this remapping is still using memblock allocation.
>
> Subsequently, when memblock is freed later in boot flow, this remapped
> efi_memmap will have random corruption (similar to a use-after-free scenario).
>
> The corrupted EFI memory map is then passed to the next kexec-ed kernel
> which causes a panic when trying to use the corrupted EFI memory map.

This sounds fishy: memblock allocated memory is not freed later in the
boot - it remains reserved. Only free memory is freed from memblock to
the buddy allocator.

Or is the problem that memblock-allocated memory cannot be memremapped
because *raisins*?

Mike?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-06-03 13:08:13

by Kalra, Ashish

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

On 6/3/2024 3:56 AM, Borislav Petkov wrote

>> EFI memory map and due to early allocation it uses memblock allocation.
>>
>> Later during boot, efi_enter_virtual_mode() calls kexec_enter_virtual_mode()
>> in case of a kexec-ed kernel boot.
>>
>> This function kexec_enter_virtual_mode() installs the new EFI memory map by
>> calling efi_memmap_init_late() which remaps the efi_memmap physically allocated
>> in efi_arch_mem_reserve(), but this remapping is still using memblock allocation.
>>
>> Subsequently, when memblock is freed later in boot flow, this remapped
>> efi_memmap will have random corruption (similar to a use-after-free scenario).
>>
>> The corrupted EFI memory map is then passed to the next kexec-ed kernel
>> which causes a panic when trying to use the corrupted EFI memory map.
> This sounds fishy: memblock allocated memory is not freed later in the
> boot - it remains reserved. Only free memory is freed from memblock to
> the buddy allocator.
>
> Or is the problem that memblock-allocated memory cannot be memremapped
> because *raisins*?

This is what seems to be happening:

efi_arch_mem_reserve() calls efi_memmap_alloc() to allocate memory for
EFI memory map and due to early allocation it uses memblock allocation.

And later efi_enter_virtual_mode() calls kexec_enter_virtual_mode()
in case of a kexec-ed kernel boot.

This function kexec_enter_virtual_mode() installs the new EFI memory map by
calling efi_memmap_init_late() which does memremap() on memblock-allocated memory.

Thanks, Ashish

>
> Mike?
>

2024-06-03 13:41:54

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

On Mon, Jun 03, 2024 at 08:06:56AM -0500, Kalra, Ashish wrote:
> On 6/3/2024 3:56 AM, Borislav Petkov wrote
>
> > > EFI memory map and due to early allocation it uses memblock allocation.
> > >
> > > Later during boot, efi_enter_virtual_mode() calls kexec_enter_virtual_mode()
> > > in case of a kexec-ed kernel boot.
> > >
> > > This function kexec_enter_virtual_mode() installs the new EFI memory map by
> > > calling efi_memmap_init_late() which remaps the efi_memmap physically allocated
> > > in efi_arch_mem_reserve(), but this remapping is still using memblock allocation.
> > >
> > > Subsequently, when memblock is freed later in boot flow, this remapped
> > > efi_memmap will have random corruption (similar to a use-after-free scenario).
> > >
> > > The corrupted EFI memory map is then passed to the next kexec-ed kernel
> > > which causes a panic when trying to use the corrupted EFI memory map.
> > This sounds fishy: memblock allocated memory is not freed later in the
> > boot - it remains reserved. Only free memory is freed from memblock to
> > the buddy allocator.
> >
> > Or is the problem that memblock-allocated memory cannot be memremapped
> > because *raisins*?
>
> This is what seems to be happening:
>
> efi_arch_mem_reserve() calls efi_memmap_alloc() to allocate memory for
> EFI memory map and due to early allocation it uses memblock allocation.
>
> And later efi_enter_virtual_mode() calls kexec_enter_virtual_mode()
> in case of a kexec-ed kernel boot.
>
> This function kexec_enter_virtual_mode() installs the new EFI memory map by
> calling efi_memmap_init_late() which does memremap() on memblock-allocated memory.

Does the issue happen only with SNP?

I didn't really dig, but my theory would be that it has something to do
with arch_memremap_can_ram_remap() in arch/x86/mm/ioremap.c

> Thanks, Ashish

--
Sincerely yours,
Mike.

2024-06-03 14:02:12

by Kalra, Ashish

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

On 6/3/2024 8:39 AM, Mike Rapoport wrote:

> On Mon, Jun 03, 2024 at 08:06:56AM -0500, Kalra, Ashish wrote:
>> On 6/3/2024 3:56 AM, Borislav Petkov wrote
>>
>>>> EFI memory map and due to early allocation it uses memblock allocation.
>>>>
>>>> Later during boot, efi_enter_virtual_mode() calls kexec_enter_virtual_mode()
>>>> in case of a kexec-ed kernel boot.
>>>>
>>>> This function kexec_enter_virtual_mode() installs the new EFI memory map by
>>>> calling efi_memmap_init_late() which remaps the efi_memmap physically allocated
>>>> in efi_arch_mem_reserve(), but this remapping is still using memblock allocation.
>>>>
>>>> Subsequently, when memblock is freed later in boot flow, this remapped
>>>> efi_memmap will have random corruption (similar to a use-after-free scenario).
>>>>
>>>> The corrupted EFI memory map is then passed to the next kexec-ed kernel
>>>> which causes a panic when trying to use the corrupted EFI memory map.
>>> This sounds fishy: memblock allocated memory is not freed later in the
>>> boot - it remains reserved. Only free memory is freed from memblock to
>>> the buddy allocator.
>>>
>>> Or is the problem that memblock-allocated memory cannot be memremapped
>>> because *raisins*?
>> This is what seems to be happening:
>>
>> efi_arch_mem_reserve() calls efi_memmap_alloc() to allocate memory for
>> EFI memory map and due to early allocation it uses memblock allocation.
>>
>> And later efi_enter_virtual_mode() calls kexec_enter_virtual_mode()
>> in case of a kexec-ed kernel boot.
>>
>> This function kexec_enter_virtual_mode() installs the new EFI memory map by
>> calling efi_memmap_init_late() which does memremap() on memblock-allocated memory.
> Does the issue happen only with SNP?

This is observed under SNP as efi_arch_mem_reserve() is only being
called with SNP enabled and then efi_arch_mem_reserve() allocates EFI
memory map using memblock.

If we skip efi_arch_mem_reserve() (which should probably be anyway
skipped for kexec case), then for kexec boot, EFI memmap is memremapped
in the same virtual address as the first kernel and not the allocated
memblock address.

Thanks, Ashish

>
> I didn't really dig, but my theory would be that it has something to do
> with arch_memremap_can_ram_remap() in arch/x86/mm/ioremap.c
>
>> Thanks, Ashish

2024-06-03 14:47:37

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

On Mon, Jun 03, 2024 at 09:01:49AM -0500, Kalra, Ashish wrote:
> If we skip efi_arch_mem_reserve() (which should probably be anyway skipped
> for kexec case), then for kexec boot, EFI memmap is memremapped in the same
> virtual address as the first kernel and not the allocated memblock address.

Are you saying that we should simply do

diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index fdf07dd6f459..410cb0743289 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -577,6 +577,9 @@ void __init efi_mem_reserve(phys_addr_t addr, u64 size)
if (WARN_ON_ONCE(efi_enabled(EFI_PARAVIRT)))
return;

+ if (kexec_in_progress)
+ return;
+
if (!memblock_is_region_reserved(addr, size))
memblock_reserve(addr, size);

and skip that whole call?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-06-03 15:31:10

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

On Mon, Jun 03, 2024 at 09:01:49AM -0500, Kalra, Ashish wrote:
> On 6/3/2024 8:39 AM, Mike Rapoport wrote:
>
> > On Mon, Jun 03, 2024 at 08:06:56AM -0500, Kalra, Ashish wrote:
> > > On 6/3/2024 3:56 AM, Borislav Petkov wrote
> > >
> > > > > EFI memory map and due to early allocation it uses memblock allocation.
> > > > >
> > > > > Later during boot, efi_enter_virtual_mode() calls kexec_enter_virtual_mode()
> > > > > in case of a kexec-ed kernel boot.
> > > > >
> > > > > This function kexec_enter_virtual_mode() installs the new EFI memory map by
> > > > > calling efi_memmap_init_late() which remaps the efi_memmap physically allocated
> > > > > in efi_arch_mem_reserve(), but this remapping is still using memblock allocation.
> > > > >
> > > > > Subsequently, when memblock is freed later in boot flow, this remapped
> > > > > efi_memmap will have random corruption (similar to a use-after-free scenario).
> > > > >
> > > > > The corrupted EFI memory map is then passed to the next kexec-ed kernel
> > > > > which causes a panic when trying to use the corrupted EFI memory map.
> > > > This sounds fishy: memblock allocated memory is not freed later in the
> > > > boot - it remains reserved. Only free memory is freed from memblock to
> > > > the buddy allocator.
> > > >
> > > > Or is the problem that memblock-allocated memory cannot be memremapped
> > > > because *raisins*?
> > > This is what seems to be happening:
> > >
> > > efi_arch_mem_reserve() calls efi_memmap_alloc() to allocate memory for
> > > EFI memory map and due to early allocation it uses memblock allocation.
> > >
> > > And later efi_enter_virtual_mode() calls kexec_enter_virtual_mode()
> > > in case of a kexec-ed kernel boot.
> > >
> > > This function kexec_enter_virtual_mode() installs the new EFI memory map by
> > > calling efi_memmap_init_late() which does memremap() on memblock-allocated memory.
> > Does the issue happen only with SNP?
>
> This is observed under SNP as efi_arch_mem_reserve() is only being called
> with SNP enabled and then efi_arch_mem_reserve() allocates EFI memory map
> using memblock.

I don't see how efi_arch_mem_reserve() is only called with SNP. What did I
miss?

> If we skip efi_arch_mem_reserve() (which should probably be anyway skipped
> for kexec case), then for kexec boot, EFI memmap is memremapped in the same
> virtual address as the first kernel and not the allocated memblock address.

Maybe we should skip efi_arch_mem_reserve() for kexec case, but I think we
still need to understand what's causing memory corruption.

> Thanks, Ashish
>
> >
> > I didn't really dig, but my theory would be that it has something to do
> > with arch_memremap_can_ram_remap() in arch/x86/mm/ioremap.c
> > > Thanks, Ashish

--
Sincerely yours,
Mike.

2024-06-03 15:33:43

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

On Mon, Jun 03, 2024 at 04:46:39PM +0200, Borislav Petkov wrote:
> On Mon, Jun 03, 2024 at 09:01:49AM -0500, Kalra, Ashish wrote:
> > If we skip efi_arch_mem_reserve() (which should probably be anyway skipped
> > for kexec case), then for kexec boot, EFI memmap is memremapped in the same
> > virtual address as the first kernel and not the allocated memblock address.
>
> Are you saying that we should simply do
>
> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> index fdf07dd6f459..410cb0743289 100644
> --- a/drivers/firmware/efi/efi.c
> +++ b/drivers/firmware/efi/efi.c
> @@ -577,6 +577,9 @@ void __init efi_mem_reserve(phys_addr_t addr, u64 size)
> if (WARN_ON_ONCE(efi_enabled(EFI_PARAVIRT)))
> return;
>
> + if (kexec_in_progress)
> + return;
> +
> if (!memblock_is_region_reserved(addr, size))
> memblock_reserve(addr, size);
>
> and skip that whole call?

I think Ashish suggested rather

diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index fdf07dd6f459..eccc10ab15a4 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -580,6 +580,9 @@ void __init efi_mem_reserve(phys_addr_t addr, u64 size)
if (!memblock_is_region_reserved(addr, size))
memblock_reserve(addr, size);

+ if (kexec_in_progress)
+ return;
+
/*
* Some architectures (x86) reserve all boot services ranges
* until efi_free_boot_services() because of buggy firmware

> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette

--
Sincerely yours,
Mike.

2024-06-03 16:48:18

by Kalra, Ashish

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

On 6/3/2024 10:31 AM, Mike Rapoport wrote:

> On Mon, Jun 03, 2024 at 04:46:39PM +0200, Borislav Petkov wrote:
>> On Mon, Jun 03, 2024 at 09:01:49AM -0500, Kalra, Ashish wrote:
>>> If we skip efi_arch_mem_reserve() (which should probably be anyway skipped
>>> for kexec case), then for kexec boot, EFI memmap is memremapped in the same
>>> virtual address as the first kernel and not the allocated memblock address.
>> Are you saying that we should simply do
>>
>> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
>> index fdf07dd6f459..410cb0743289 100644
>> --- a/drivers/firmware/efi/efi.c
>> +++ b/drivers/firmware/efi/efi.c
>> @@ -577,6 +577,9 @@ void __init efi_mem_reserve(phys_addr_t addr, u64 size)
>> if (WARN_ON_ONCE(efi_enabled(EFI_PARAVIRT)))
>> return;
>>
>> + if (kexec_in_progress)
>> + return;
>> +
>> if (!memblock_is_region_reserved(addr, size))
>> memblock_reserve(addr, size);
>>
>> and skip that whole call?
> I think Ashish suggested rather
>
> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> index fdf07dd6f459..eccc10ab15a4 100644
> --- a/drivers/firmware/efi/efi.c
> +++ b/drivers/firmware/efi/efi.c
> @@ -580,6 +580,9 @@ void __init efi_mem_reserve(phys_addr_t addr, u64 size)
> if (!memblock_is_region_reserved(addr, size))
> memblock_reserve(addr, size);
>
> + if (kexec_in_progress)
> + return;
> +
> /*
> * Some architectures (x86) reserve all boot services ranges
> * until efi_free_boot_services() because of buggy firmware
>
Yes, something similar as above, as efi_mem_reserve() is used to reserve
boot service memory and is not necessary for kexec boot.

So, Dave Young ([email protected]) had suggested that we skip
efi_arch_mem_reserve() for kexec by checking the set EFI_MEMORY_RUNTIME
attribute as below:

diff
<https://lore.kernel.org/lkml/[email protected]/T/#iZ2e.:..:f4be03b8488665f56a1e5c6e6459f447352dfcf5.1717111180.git.ashish.kalra::40amd.com:1arch:x86:platform:efi:quirks.c>
--git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
index f0cc00032751..6f398c59278a 100644 ---
a/arch/x86/platform/efi/quirks.c +++ b/arch/x86/platform/efi/quirks.c @@
-255,15 +255,39 @@ void __init efi_arch_mem_reserve(phys_addr_t addr,
u64 size) struct efi_memory_map_data data = { 0 };
struct efi_mem_range mr;
efi_memory_desc_t md;
- int num_entries; + int num_entries, ret; void *new;

- if (efi_mem_desc_lookup(addr, &md) || - md.type !=
EFI_BOOT_SERVICES_DATA) { + /* + * efi_mem_reserve() is used to reserve
boot service memory, eg. bgrt, + * but it is not neccasery for kexec, as
there are no boot services in + * kexec reboot at all after the first
kernel's ExitBootServices(). + * + * Therefore, skip efi_mem_reserve for
kexec booting by checking the + * EFI_MEMORY_RUNTIME attribute which
indicates boot service memory + * ranges reserved by the first kernel
using efi_mem_reserve and marked + * with EFI_MEMORY_RUNTIME attribute.
+ */ + + ret = efi_mem_desc_lookup(addr, &md); + if (ret) { pr_err("Failed to lookup EFI memory descriptor for %pa\n", &addr);
return;
}

+ if (md.type != EFI_BOOT_SERVICES_DATA) { + pr_err("Skip reserving non
EFI Boot Service Data memory for %pa\n", &addr); + return; + } + + /*
Kexec copied the efi memmap from the first kernel, thus skip the case */
+ if (md.attribute & EFI_MEMORY_RUNTIME) + return; + if (addr + size > md.phys_addr + (md.num_pages << EFI_PAGE_SHIFT)) {
pr_err("Region spans EFI memory descriptors, %pa\n", &addr);
return;
--
2.34.1

>> --
>> Regards/Gruss,
>> Boris.
>>
>> https://people.kernel.org/tglx/notes-about-netiquette

2024-06-03 16:56:23

by Kalra, Ashish

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

On 6/3/2024 10:29 AM, Mike Rapoport wrote:

> On Mon, Jun 03, 2024 at 09:01:49AM -0500, Kalra, Ashish wrote:
>> On 6/3/2024 8:39 AM, Mike Rapoport wrote:
>>
>>> On Mon, Jun 03, 2024 at 08:06:56AM -0500, Kalra, Ashish wrote:
>>>> On 6/3/2024 3:56 AM, Borislav Petkov wrote
>>>>
>>>>>> EFI memory map and due to early allocation it uses memblock allocation.
>>>>>>
>>>>>> Later during boot, efi_enter_virtual_mode() calls kexec_enter_virtual_mode()
>>>>>> in case of a kexec-ed kernel boot.
>>>>>>
>>>>>> This function kexec_enter_virtual_mode() installs the new EFI memory map by
>>>>>> calling efi_memmap_init_late() which remaps the efi_memmap physically allocated
>>>>>> in efi_arch_mem_reserve(), but this remapping is still using memblock allocation.
>>>>>>
>>>>>> Subsequently, when memblock is freed later in boot flow, this remapped
>>>>>> efi_memmap will have random corruption (similar to a use-after-free scenario).
>>>>>>
>>>>>> The corrupted EFI memory map is then passed to the next kexec-ed kernel
>>>>>> which causes a panic when trying to use the corrupted EFI memory map.
>>>>> This sounds fishy: memblock allocated memory is not freed later in the
>>>>> boot - it remains reserved. Only free memory is freed from memblock to
>>>>> the buddy allocator.
>>>>>
>>>>> Or is the problem that memblock-allocated memory cannot be memremapped
>>>>> because *raisins*?
>>>> This is what seems to be happening:
>>>>
>>>> efi_arch_mem_reserve() calls efi_memmap_alloc() to allocate memory for
>>>> EFI memory map and due to early allocation it uses memblock allocation.
>>>>
>>>> And later efi_enter_virtual_mode() calls kexec_enter_virtual_mode()
>>>> in case of a kexec-ed kernel boot.
>>>>
>>>> This function kexec_enter_virtual_mode() installs the new EFI memory map by
>>>> calling efi_memmap_init_late() which does memremap() on memblock-allocated memory.
>>> Does the issue happen only with SNP?
>> This is observed under SNP as efi_arch_mem_reserve() is only being called
>> with SNP enabled and then efi_arch_mem_reserve() allocates EFI memory map
>> using memblock.
> I don't see how efi_arch_mem_reserve() is only called with SNP. What did I
> miss?
>

This is the call stack for efi_arch_mem_reserve():

[ 0.310010] efi_arch_mem_reserve+0xb1/0x220
[ 0.311382] efi_mem_reserve+0x36/0x60
[ 0.311973] efi_bgrt_init+0x17d/0x1a0
[ 0.313265] acpi_parse_bgrt+0x12/0x20
[ 0.313858] acpi_table_parse+0x77/0xd0
[ 0.314463] acpi_boot_init+0x362/0x630
[ 0.315069] setup_arch+0xa88/0xf80
[ 0.315629] start_kernel+0x68/0xa90
[ 0.316194] x86_64_start_reservations+0x1c/0x30
[ 0.316921] x86_64_start_kernel+0xbf/0x110
[ 0.317582] common_startup_64+0x13e/0x141

So, probably it is being invoked specifically for AMD platform ?

>> If we skip efi_arch_mem_reserve() (which should probably be anyway skipped
>> for kexec case), then for kexec boot, EFI memmap is memremapped in the same
>> virtual address as the first kernel and not the allocated memblock address.
> Maybe we should skip efi_arch_mem_reserve() for kexec case, but I think we
> still need to understand what's causing memory corruption.

When, efi_arch_mem_reserve() allocates memory for EFI memory map using
memblock and then later in boot, kexec_enter_virtual_mode() does
memremap on this memblock allocated memory, subsequently after this i
see EFI memory map corruption, so are there are any issues doing
memremap on memblock-allocated memory ?

Thanks, Ashish

>>> I didn't really dig, but my theory would be that it has something to do
>>> with arch_memremap_can_ram_remap() in arch/x86/mm/ioremap.c
>>>> Thanks, Ashish

2024-06-03 17:06:11

by Kalra, Ashish

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

Re-sending this, last response got garbled.

On 6/3/2024 11:48 AM, Kalra, Ashish wrote:
> On 6/3/2024 10:31 AM, Mike Rapoport wrote:
>
>> On Mon, Jun 03, 2024 at 04:46:39PM +0200, Borislav Petkov wrote:
>>> On Mon, Jun 03, 2024 at 09:01:49AM -0500, Kalra, Ashish wrote:
>>>> If we skip efi_arch_mem_reserve() (which should probably be anyway
>>>> skipped
>>>> for kexec case), then for kexec boot, EFI memmap is memremapped in
>>>> the same
>>>> virtual address as the first kernel and not the allocated memblock
>>>> address.
>>> Are you saying that we should simply do
>>>
>>> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
>>> index fdf07dd6f459..410cb0743289 100644
>>> --- a/drivers/firmware/efi/efi.c
>>> +++ b/drivers/firmware/efi/efi.c
>>> @@ -577,6 +577,9 @@ void __init efi_mem_reserve(phys_addr_t addr,
>>> u64 size)
>>>       if (WARN_ON_ONCE(efi_enabled(EFI_PARAVIRT)))
>>>           return;
>>>   +    if (kexec_in_progress)
>>> +        return;
>>> +
>>>       if (!memblock_is_region_reserved(addr, size))
>>>           memblock_reserve(addr, size);
>>>   and skip that whole call?
>> I think Ashish suggested rather
>>
>> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
>> index fdf07dd6f459..eccc10ab15a4 100644
>> --- a/drivers/firmware/efi/efi.c
>> +++ b/drivers/firmware/efi/efi.c
>> @@ -580,6 +580,9 @@ void __init efi_mem_reserve(phys_addr_t addr, u64
>> size)
>>       if (!memblock_is_region_reserved(addr, size))
>>           memblock_reserve(addr, size);
>>   +    if (kexec_in_progress)
>> +        return;
>> +
>>       /*
>>        * Some architectures (x86) reserve all boot services ranges
>>        * until efi_free_boot_services() because of buggy firmware
> Yes, something similar as above, as efi_mem_reserve() is used to
> reserve boot service memory and is not necessary for kexec boot.
>
> So, Dave Young ([email protected]) had suggested that we skip
> efi_arch_mem_reserve() for kexec by checking the set
> EFI_MEMORY_RUNTIME attribute as below:
>
diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
index f0cc00032751..6f398c59278a 100644
--- a/arch/x86/platform/efi/quirks.c
+++ b/arch/x86/platform/efi/quirks.c
@@ -255,15 +255,39 @@ void __init efi_arch_mem_reserve(phys_addr_t addr,
u64 size)
        struct efi_memory_map_data data = { 0 };
        struct efi_mem_range mr;
        efi_memory_desc_t md;
-       int num_entries;
+       int num_entries, ret;
        void *new;

-       if (efi_mem_desc_lookup(addr, &md) ||
-           md.type != EFI_BOOT_SERVICES_DATA) {
+       /*
+        * efi_mem_reserve() is used to reserve boot service memory, eg.
bgrt,
+        * but it is not neccasery for kexec, as there are no boot
services in
+        * kexec reboot at all after the first kernel's ExitBootServices().
+        *
+        * Therefore, skip efi_mem_reserve for kexec booting by checking the
+        * EFI_MEMORY_RUNTIME attribute which indicates boot service memory
+        * ranges reserved by the first kernel using efi_mem_reserve and
marked
+        * with EFI_MEMORY_RUNTIME attribute.
+        */
+
+       ret = efi_mem_desc_lookup(addr, &md);

+       if (ret) {

                pr_err("Failed to lookup EFI memory descriptor for
%pa\n", &addr);
                return;
        }

+       if (md.type != EFI_BOOT_SERVICES_DATA) {
+               pr_err("Skip reserving non EFI Boot Service Data memory
for %pa\n", &addr);
+               return;
+       }
+
+       /* Kexec copied the efi memmap from the first kernel, thus skip
the case */
+       if (md.attribute & EFI_MEMORY_RUNTIME)
+               return;
+
        if (addr + size > md.phys_addr + (md.num_pages <<
EFI_PAGE_SHIFT)) {
                pr_err("Region spans EFI memory descriptors, %pa\n",
&addr);
                return;

Thanks, Ashish

2024-06-03 17:07:06

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

On Mon, Jun 03, 2024 at 11:48:03AM -0500, Kalra, Ashish wrote:
> Yes, something similar as above, as efi_mem_reserve() is used to reserve
> boot service memory and is not necessary for kexec boot.
>
> So, Dave Young ([email protected]) had suggested that we skip
> efi_arch_mem_reserve() for kexec by checking the set EFI_MEMORY_RUNTIME
> attribute as below:a

efi_arch_mem_reserve() or efi_mem_reserve() altogether?

Btw, that below got really gibberished by your mail client. Snipped.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-06-03 17:09:38

by Kalra, Ashish

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

On 6/3/2024 11:57 AM, Borislav Petkov wrote:

> On Mon, Jun 03, 2024 at 11:48:03AM -0500, Kalra, Ashish wrote:
>> Yes, something similar as above, as efi_mem_reserve() is used to reserve
>> boot service memory and is not necessary for kexec boot.
>>
>> So, Dave Young ([email protected]) had suggested that we skip
>> efi_arch_mem_reserve() for kexec by checking the set EFI_MEMORY_RUNTIME
>> attribute as below:a
> efi_arch_mem_reserve() or efi_mem_reserve() altogether?

efi_arch_mem_reserve().

Thanks, Ashish

>
> Btw, that below got really gibberished by your mail client. Snipped.
>

2024-06-03 17:10:55

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

On Mon, Jun 03, 2024 at 12:05:45PM -0500, Kalra, Ashish wrote:
> Re-sending this, last response got garbled.

And this got linewrapped.

Thunderbird section in Documentation/process/email-clients.rst.

> index f0cc00032751..6f398c59278a 100644
> --- a/arch/x86/platform/efi/quirks.c
> +++ b/arch/x86/platform/efi/quirks.c
> @@ -255,15 +255,39 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64
> size)

^^^

>         struct efi_memory_map_data data = { 0 };
>         struct efi_mem_range mr;
>         efi_memory_desc_t md;
> -       int num_entries;
> +       int num_entries, ret;
>         void *new;
>
> -       if (efi_mem_desc_lookup(addr, &md) ||
> -           md.type != EFI_BOOT_SERVICES_DATA) {
> +       /*
> +        * efi_mem_reserve() is used to reserve boot service memory, eg.
> bgrt,

^^^

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-06-03 17:13:16

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

On Mon, Jun 03, 2024 at 12:08:48PM -0500, Kalra, Ashish wrote:
> efi_arch_mem_reserve().

Now it only remains for you to explain why...

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-06-03 17:43:40

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

On Mon, Jun 03, 2024 at 11:56:01AM -0500, Kalra, Ashish wrote:
> On 6/3/2024 10:29 AM, Mike Rapoport wrote:
>
> > On Mon, Jun 03, 2024 at 09:01:49AM -0500, Kalra, Ashish wrote:
> > > On 6/3/2024 8:39 AM, Mike Rapoport wrote:
> > >
> > > > On Mon, Jun 03, 2024 at 08:06:56AM -0500, Kalra, Ashish wrote:
> > > > > On 6/3/2024 3:56 AM, Borislav Petkov wrote
> > > > >
> > > > > > > EFI memory map and due to early allocation it uses memblock allocation.
> > > > > > >
> > > > > > > Later during boot, efi_enter_virtual_mode() calls kexec_enter_virtual_mode()
> > > > > > > in case of a kexec-ed kernel boot.
> > > > > > >
> > > > > > > This function kexec_enter_virtual_mode() installs the new EFI memory map by
> > > > > > > calling efi_memmap_init_late() which remaps the efi_memmap physically allocated
> > > > > > > in efi_arch_mem_reserve(), but this remapping is still using memblock allocation.
> > > > > > >
> > > > > > > Subsequently, when memblock is freed later in boot flow, this remapped
> > > > > > > efi_memmap will have random corruption (similar to a use-after-free scenario).
> > > > > > >
> > > > > > > The corrupted EFI memory map is then passed to the next kexec-ed kernel
> > > > > > > which causes a panic when trying to use the corrupted EFI memory map.
> > > > > > This sounds fishy: memblock allocated memory is not freed later in the
> > > > > > boot - it remains reserved. Only free memory is freed from memblock to
> > > > > > the buddy allocator.
> > > > > >
> > > > > > Or is the problem that memblock-allocated memory cannot be memremapped
> > > > > > because *raisins*?
> > > > > This is what seems to be happening:
> > > > >
> > > > > efi_arch_mem_reserve() calls efi_memmap_alloc() to allocate memory for
> > > > > EFI memory map and due to early allocation it uses memblock allocation.
> > > > >
> > > > > And later efi_enter_virtual_mode() calls kexec_enter_virtual_mode()
> > > > > in case of a kexec-ed kernel boot.
> > > > >
> > > > > This function kexec_enter_virtual_mode() installs the new EFI memory map by
> > > > > calling efi_memmap_init_late() which does memremap() on memblock-allocated memory.
> > > > Does the issue happen only with SNP?
> > > This is observed under SNP as efi_arch_mem_reserve() is only being called
> > > with SNP enabled and then efi_arch_mem_reserve() allocates EFI memory map
> > > using memblock.
> > I don't see how efi_arch_mem_reserve() is only called with SNP. What did I
> > miss?
>
> This is the call stack for efi_arch_mem_reserve():
>
> [ 0.310010] efi_arch_mem_reserve+0xb1/0x220
> [ 0.311382] efi_mem_reserve+0x36/0x60
> [ 0.311973] efi_bgrt_init+0x17d/0x1a0
> [ 0.313265] acpi_parse_bgrt+0x12/0x20
> [ 0.313858] acpi_table_parse+0x77/0xd0
> [ 0.314463] acpi_boot_init+0x362/0x630
> [ 0.315069] setup_arch+0xa88/0xf80
> [ 0.315629] start_kernel+0x68/0xa90
> [ 0.316194] x86_64_start_reservations+0x1c/0x30
> [ 0.316921] x86_64_start_kernel+0xbf/0x110
> [ 0.317582] common_startup_64+0x13e/0x141
>
> So, probably it is being invoked specifically for AMD platform ?

AFAIU, efi_bgrt_init() can be called for any x86 platform, with or without
encryption.
So if my understating is correct, efi_arch_mem_reserve() will be called with SNP
disabled as well. And if kexec works ok without SNP but fails with SNP this
may give as a clue to the root cause of the failure.

> > > If we skip efi_arch_mem_reserve() (which should probably be anyway skipped
> > > for kexec case), then for kexec boot, EFI memmap is memremapped in the same
> > > virtual address as the first kernel and not the allocated memblock address.
> > Maybe we should skip efi_arch_mem_reserve() for kexec case, but I think we
> > still need to understand what's causing memory corruption.
>
> When, efi_arch_mem_reserve() allocates memory for EFI memory map using
> memblock and then later in boot, kexec_enter_virtual_mode() does memremap on
> this memblock allocated memory, subsequently after this i see EFI memory map
> corruption, so are there are any issues doing memremap on memblock-allocated
> memory ?

memblock-allocated memory is just RAM, so my take is that memremap() cannot
figure out the encryption bits properly.

You can check if there are issues with memrmapp()ing memblock-allocated
memory by sticking memblock_phys_alloc() somewhere, filling that memory with a
pattern and then calling memremap(addr, size, MEMREMAP_WB) and checking if
the pattern is still there.

> Thanks, Ashish
>
> > > > I didn't really dig, but my theory would be that it has something to do
> > > > with arch_memremap_can_ram_remap() in arch/x86/mm/ioremap.c
> > > > > Thanks, Ashish

--
Sincerely yours,
Mike.

2024-06-03 23:47:56

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCHv11 05/19] x86/relocate_kernel: Use named labels for less confusion

On 5/29/24 03:47, Nikolay Borisov wrote:
>>
>> diff --git a/arch/x86/kernel/relocate_kernel_64.S
>> b/arch/x86/kernel/relocate_kernel_64.S
>> index 56cab1bb25f5..085eef5c3904 100644
>> --- a/arch/x86/kernel/relocate_kernel_64.S
>> +++ b/arch/x86/kernel/relocate_kernel_64.S
>> @@ -148,9 +148,10 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>>        */
>>       movl    $X86_CR4_PAE, %eax
>>       testq    $X86_CR4_LA57, %r13
>> -    jz    1f
>> +    jz    .Lno_la57
>>       orl    $X86_CR4_LA57, %eax
>> -1:
>> +.Lno_la57:
>> +
>>       movq    %rax, %cr4
>>       jmp 1f
>

Sorry if this is a duplicate; something strange happened with my email.

If you are cleaning up this code anyway...

this whole piece of code can be simplified to:

and $(X86_CR4_PAE | X86_CR4_LA57), %r13d
mov %r13, %cr4

The PAE bit in %r13 is guaranteed to be set, and %r13 is dead after this.

-hpa

2024-06-03 23:49:39

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCHv11 05/19] x86/relocate_kernel: Use named labels for less confusion

On 5/29/24 03:47, Nikolay Borisov wrote:
>>
>> diff --git a/arch/x86/kernel/relocate_kernel_64.S
>> b/arch/x86/kernel/relocate_kernel_64.S
>> index 56cab1bb25f5..085eef5c3904 100644
>> --- a/arch/x86/kernel/relocate_kernel_64.S
>> +++ b/arch/x86/kernel/relocate_kernel_64.S
>> @@ -148,9 +148,10 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>>        */
>>       movl    $X86_CR4_PAE, %eax
>>       testq    $X86_CR4_LA57, %r13
>> -    jz    1f
>> +    jz    .Lno_la57
>>       orl    $X86_CR4_LA57, %eax
>> -1:
>> +.Lno_la57:
>> +
>>       movq    %rax, %cr4
>>       jmp 1f
>
> That jmp 1f becomes redundant now as it simply jumps 1 line below.
>

Uh... am I the only person to notice that ALL that is needed here is:

andl $(X86_CR4_PAE|X86_CR4_LA57), %r13d
movq %r13, %rax

... since %r13 is dead afterwards, and PAE *will* have been set in %r13
already?

I don't believe that this specific jmp is actually needed -- there are
several more synchronizing jumps later -- but it doesn't hurt.

However, if the effort is for improving the readability, it might be
worthwhile to encapsulate the "jmp 1f; 1:" as a macro, e.g. "SYNC_CODE".

-hpa

2024-06-04 00:45:20

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCHv11 05/19] x86/relocate_kernel: Use named labels for less confusion

Trying one more time; sorry (again) if someone receives this in duplicate.

>>>
>>> diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
>>> index 56cab1bb25f5..085eef5c3904 100644
>>> --- a/arch/x86/kernel/relocate_kernel_64.S
>>> +++ b/arch/x86/kernel/relocate_kernel_64.S
>>> @@ -148,9 +148,10 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>>> */
>>> movl $X86_CR4_PAE, %eax
>>> testq $X86_CR4_LA57, %r13
>>> - jz 1f
>>> + jz .Lno_la57
>>> orl $X86_CR4_LA57, %eax
>>> -1:
>>> +.Lno_la57:
>>> +
>>> movq %rax, %cr4

If we are cleaning up this code... the above can simply be:

andl $(X86_CR4_PAE | X86_CR4_LA54), %r13
movq %r13, %cr4

%r13 is dead afterwards, and the PAE bit *will* be set in %r13 anyway.

-hpa


2024-06-04 01:24:02

by Dave Young

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

On Mon, 3 Jun 2024 at 23:33, Mike Rapoport <[email protected]> wrote:
>
> On Mon, Jun 03, 2024 at 04:46:39PM +0200, Borislav Petkov wrote:
> > On Mon, Jun 03, 2024 at 09:01:49AM -0500, Kalra, Ashish wrote:
> > > If we skip efi_arch_mem_reserve() (which should probably be anyway skipped
> > > for kexec case), then for kexec boot, EFI memmap is memremapped in the same
> > > virtual address as the first kernel and not the allocated memblock address.
> >
> > Are you saying that we should simply do
> >
> > diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> > index fdf07dd6f459..410cb0743289 100644
> > --- a/drivers/firmware/efi/efi.c
> > +++ b/drivers/firmware/efi/efi.c
> > @@ -577,6 +577,9 @@ void __init efi_mem_reserve(phys_addr_t addr, u64 size)
> > if (WARN_ON_ONCE(efi_enabled(EFI_PARAVIRT)))
> > return;
> >
> > + if (kexec_in_progress)
> > + return;
> > +

kexec_in_progress is only for checking if this is in a reboot (kexec) code path.
But eif_mem_reserve is only called during the boot time so checking
kexec_in_progress is meaningless here.
current_kernel_is_booted_via_kexec != is_rebooting_with_kexec

The code change below in the patch looks good to me, but I'm not sure
what caused the memory corruption, it indeed worth some more digging,
maybe SEV/SNP related.
+ if (md.attribute & EFI_MEMORY_RUNTIME)
+ return;

Thanks
Dave


2024-06-04 09:16:05

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCHv11 05/19] x86/relocate_kernel: Use named labels for less confusion

On Mon, Jun 03, 2024 at 05:24:00PM -0700, H. Peter Anvin wrote:
> Trying one more time; sorry (again) if someone receives this in duplicate.
>
> > > >
> > > > diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> > > > index 56cab1bb25f5..085eef5c3904 100644
> > > > --- a/arch/x86/kernel/relocate_kernel_64.S
> > > > +++ b/arch/x86/kernel/relocate_kernel_64.S
> > > > @@ -148,9 +148,10 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
> > > > */
> > > > movl $X86_CR4_PAE, %eax
> > > > testq $X86_CR4_LA57, %r13
> > > > - jz 1f
> > > > + jz .Lno_la57
> > > > orl $X86_CR4_LA57, %eax
> > > > -1:
> > > > +.Lno_la57:
> > > > +
> > > > movq %rax, %cr4
>
> If we are cleaning up this code... the above can simply be:
>
> andl $(X86_CR4_PAE | X86_CR4_LA54), %r13
> movq %r13, %cr4
>
> %r13 is dead afterwards, and the PAE bit *will* be set in %r13 anyway.

Yeah, with a proper comment. The testing of bits is not really needed.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-06-04 09:45:45

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

On Tue, Jun 04, 2024 at 09:23:58AM +0800, Dave Young wrote:
> kexec_in_progress is only for checking if this is in a reboot (kexec) code path.
> But eif_mem_reserve is only called during the boot time so checking
> kexec_in_progress is meaningless here.
> current_kernel_is_booted_via_kexec != is_rebooting_with_kexec

That's exactly what I wanna check: whether this is a kexec-ed kernel. Or
is there a better helper for that?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-06-04 11:09:52

by Dave Young

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

On Tue, 4 Jun 2024 at 17:44, Borislav Petkov <[email protected]> wrote:
>
> On Tue, Jun 04, 2024 at 09:23:58AM +0800, Dave Young wrote:
> > kexec_in_progress is only for checking if this is in a reboot (kexec) code path.
> > But eif_mem_reserve is only called during the boot time so checking
> > kexec_in_progress is meaningless here.
> > current_kernel_is_booted_via_kexec != is_rebooting_with_kexec
>
> That's exactly what I wanna check: whether this is a kexec-ed kernel. Or
> is there a better helper for that?

No general way to check if it is a kexec-ed kernel or not, for x86
one can check the efi_setup as Ashish's original patch did, as the
kexec booted kernel (efi boot) will have efi setup_data passed in.

Otherwise there is a type_of_loader field for x86 boot protocol,
kexec-tools is 0x0D, the kexec_file_load also uses this. But adding
the type_of_loader was only added in kexec-tools code when Yinghai
worked on the kexec-tools bzImage64 load, so older kexec-tools will
not set this field. Anyway the in-kernel kexec_file_load code for x86
added 0x0D as loader type from the beginning.

Anyway there is not such a helper for all cases.

>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
>


2024-06-04 15:22:01

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv11 05/19] x86/relocate_kernel: Use named labels for less confusion

On Tue, Jun 04, 2024 at 11:15:03AM +0200, Borislav Petkov wrote:
> On Mon, Jun 03, 2024 at 05:24:00PM -0700, H. Peter Anvin wrote:
> > Trying one more time; sorry (again) if someone receives this in duplicate.
> >
> > > > >
> > > > > diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> > > > > index 56cab1bb25f5..085eef5c3904 100644
> > > > > --- a/arch/x86/kernel/relocate_kernel_64.S
> > > > > +++ b/arch/x86/kernel/relocate_kernel_64.S
> > > > > @@ -148,9 +148,10 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
> > > > > */
> > > > > movl $X86_CR4_PAE, %eax
> > > > > testq $X86_CR4_LA57, %r13
> > > > > - jz 1f
> > > > > + jz .Lno_la57
> > > > > orl $X86_CR4_LA57, %eax
> > > > > -1:
> > > > > +.Lno_la57:
> > > > > +
> > > > > movq %rax, %cr4
> >
> > If we are cleaning up this code... the above can simply be:
> >
> > andl $(X86_CR4_PAE | X86_CR4_LA54), %r13
> > movq %r13, %cr4
> >
> > %r13 is dead afterwards, and the PAE bit *will* be set in %r13 anyway.
>
> Yeah, with a proper comment. The testing of bits is not really needed.

I think it is better fit the next patch.

What about this?

From b45fe48092abad2612c2bafbb199e4de80c99545 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <[email protected]>
Date: Fri, 10 Feb 2023 12:53:11 +0300
Subject: [PATCHv11.1 06/19] x86/kexec: Keep CR4.MCE set during kexec for TDX guest

TDX guests run with MCA enabled (CR4.MCE=1b) from the very start. If
that bit is cleared during CR4 register reprogramming during boot or
kexec flows, a #VE exception will be raised which the guest kernel
cannot handle it.

Therefore, make sure the CR4.MCE setting is preserved over kexec too and
avoid raising any #VEs.

The change doesn't affect non-TDX-guest environments.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/relocate_kernel_64.S | 17 ++++++++++-------
1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 085eef5c3904..9c2cf70c5f54 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -5,6 +5,8 @@
*/

#include <linux/linkage.h>
+#include <linux/stringify.h>
+#include <asm/alternative.h>
#include <asm/page_types.h>
#include <asm/kexec.h>
#include <asm/processor-flags.h>
@@ -145,14 +147,15 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
* Set cr4 to a known state:
* - physical address extension enabled
* - 5-level paging, if it was enabled before
+ * - Machine check exception on TDX guest, if it was enabled before.
+ * Clearing MCE might not be allowed in TDX guests, depending on setup.
+ *
+ * Use R13 that contains the original CR4 value, read in relocate_kernel().
+ * PAE is always set in the original CR4.
*/
- movl $X86_CR4_PAE, %eax
- testq $X86_CR4_LA57, %r13
- jz .Lno_la57
- orl $X86_CR4_LA57, %eax
-.Lno_la57:
-
- movq %rax, %cr4
+ andl $(X86_CR4_PAE | X86_CR4_LA57), %r13d
+ ALTERNATIVE "", __stringify(orl $X86_CR4_MCE, %r13d), X86_FEATURE_TDX_GUEST
+ movq %r13, %cr4

jmp 1f
1:
--
Kiryl Shutsemau / Kirill A. Shutemov

2024-06-04 18:04:36

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

On Tue, Jun 04, 2024 at 07:09:56PM +0800, Dave Young wrote:
> Anyway there is not such a helper for all cases.

But maybe there should be...

This is not the first case where the need arises to be able to say:

if (am I a kexeced kernel)

in code.

Perhaps we should have a global var kexeced or so which gets incremented
on each kexec-ed kernel, somewhere in very early boot of the kexec-ed
kernel we do

kexeced++;

and then other code can query it and know whether this is a kexec-ed
kernel and how many times it got kexec-ed...

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-06-04 18:07:19

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCHv11 05/19] x86/relocate_kernel: Use named labels for less confusion

On Tue, Jun 04, 2024 at 06:21:27PM +0300, Kirill A. Shutemov wrote:
> What about this?

Yeah, LGTM.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-06-04 22:13:40

by Kalra, Ashish

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

On 6/3/2024 12:12 PM, Borislav Petkov wrote:

> On Mon, Jun 03, 2024 at 12:08:48PM -0500, Kalra, Ashish wrote:
>> efi_arch_mem_reserve().
> Now it only remains for you to explain why...

Here is a detailed explanation of what is causing the EFI memory map corruption, with added debug logs and memblock debugging enabled:

Initially at boot, efi_memblock_x86_reserve_range() does early_memremap() of the EFI memory map passed as part of setup_data, as the following logs show:

...

[ 0.000000] efi: in efi_memblock_x86_reserve_range, phys map 0x27fff9110 [ 0.000000] memblock_reserve: [0x000000027fff9110-0x000000027fffa12f] efi_memblock_x86_reserve_range+0x168/0x2a0

...

Later, efi_arch_mem_reserve() is invoked, which calls efi_memmap_alloc() which does memblock_phys_alloc() to insert new EFI memory descriptor into efi.memap:

...

[ 0.733263] memblock_reserve: [0x000000027ffcaf80-0x000000027ffcbfff] memblock_alloc_range_nid+0xf1/0x1b0 [ 0.734787] efi: efi_arch_mem_reserve, efi phys map 0x27ffcaf80

...

Finally, at the end of boot, kexec_enter_virtual_mode() is called.

It does mapping of efi regions which were passed via setup_data.

So it unregisters the early mem-remapped EFI memmap and installs the new EFI memory map as below:

( Because of efi_arch_mem_reserve() getting invoked, the new EFI memmap phys base being remapped now is the memblock allocation done in efi_arch_mem_reserve()).

[ 4.042160] efi: efi memmap phys map 0x27ffcaf80

So, kexec_enter_virtual_mode() does the following :

if (efi_memmap_init_late(efi.memmap.phys_map, <---- refers to the new EFI memmap phys base allocated via memblock in efi_arch_mem_reserve(). efi.memmap.desc_size * efi.memmap.nr_map)) { ...

This late init, does a memremap() on this memblock-allocated memory, but then immediately frees it :

drivers/firmware/efi/memmap.c:

*/ int __init __efi_memmap_init(struct efi_memory_map_data *data) {

..

phys_map = data->phys_map; <----------------------- refers to the new EFI memmap phys base allocated via memblock in efi_arch_mem_reserve().

if (data->flags & EFI_MEMMAP_LATE) map.map = memremap(phys_map, data->size, MEMREMAP_WB); ... ... if (efi.memmap.flags & (EFI_MEMMAP_MEMBLOCK | EFI_MEMMAP_SLAB)) { __efi_memmap_free(efi.memmap.phys_map, efi.memmap.desc_size * efi.memmap.nr_map, efi.memmap.flags); }

map.phys_map = data->phys_map;

...

efi.memmap = map;

...

This happens as kexec_enter_virtual_mode() can only handle the early mapped EFI memmap and not the one which is memblock allocated by efi_arch_mem_reserve(). As seen above this memblock allocated (EFI_MEMMAP_MEMBLOCK tagged) memory gets freed.

This is confirmed by memblock debugging:

[ 4.044057] memblock_free_late: [0x000000027ffcaf80-0x000000027ffcbfff] __efi_memmap_free+0x66/0x80

So while this memory is memremapped, it has also been freed and then it gets into a use-after-free condition and subsequently gets corrupted.

This corruption is seen just before kexec-ing into the new kernel:

...

[ 11.045522] PEFILE: Unsigned PE binary^M [ 11.060801] kexec-bzImage64: efi memmap phys map 0x27ffcaf80 ... [ 11.061220] kexec-bzImage64: mmap entry, type = 11, va = 0xfffffffeffc00000, pa = 0xffc00000, np = 0x400, attr = 0x8000000000000001^M [ 11.061225] kexec-bzImage64: mmap entry, type = 6, va = 0xfffffffeffb04000, pa = 0x7f704000, np = 0x84, attr = 0x800000000000000f^M [ 11.061228] kexec-bzImage64: mmap entry, type = 4, va = 0xfffffffeff700000, pa = 0x7f100000, np = 0x300, attr = 0x0^M [ 11.061231] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M <---------------- CORRUPTED!!! [ 11.061234] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061236] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061239] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061241] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0,
attr = 0x0^M [ 11.061243] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061245] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061248] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061250] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061252] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061255] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061257] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061259] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061262] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061264] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061266]
kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061268] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061271] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061273] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061275] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061278] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061280] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061282] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061284] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061287] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061289] kexec-bzImage64: mmap entry, type = 0,
va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061291] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061294] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061296] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061298] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061301] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061303] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061305] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061307] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061310] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M [ 11.061312] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr =
0x0^M [ 11.061314] kexec-bzImage64: mmap entry, type = 14080, va = 0x14f29, pa = 0x36c0, np = 0x0, attr = 0x0^M [ 11.061317] kexec-bzImage64: mmap entry, type = 85808, va = 0x0, pa = 0x0, np = 0x72, attr = 0x14f40

...

This EFI memmapphys map address 0x27ffcaf80 being mem-remapped and also getting freed and then in use after free condition (while setting up the EFI memory map for the next kernel with kexec -s) in the above logs confirm the use-after-free case.

Looking at the above code flow, it makes sense to skip efi_arch_mem_reserve() to fix this issue, as it anyway needs to be skipped for kexec case.

Thanks, Ashish


2024-06-04 22:36:16

by Kalra, Ashish

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

Re-sending as the earlier response got line-wrapped.

On 6/3/2024 12:12 PM, Borislav Petkov wrote:
> On Mon, Jun 03, 2024 at 12:08:48PM -0500, Kalra, Ashish wrote:
>> efi_arch_mem_reserve().
> Now it only remains for you to explain why...

Here is a detailed explanation of what is causing the EFI memory map corruption, with added debug logs and memblock debugging enabled:

Initially at boot, efi_memblock_x86_reserve_range() does early_memremap() of the EFI memory map passed as part of setup_data, as the following logs show:

...

[ 0.000000] efi: in efi_memblock_x86_reserve_range, phys map 0x27fff9110
[ 0.000000] memblock_reserve: [0x000000027fff9110-0x000000027fffa12f] efi_memblock_x86_reserve_range+0x168/0x2a0

...

Later, efi_arch_mem_reserve() is invoked, which calls efi_memmap_alloc() which does memblock_phys_alloc() to insert new EFI memory descriptor into efi.memap:

...

[ 0.733263] memblock_reserve: [0x000000027ffcaf80-0x000000027ffcbfff] memblock_alloc_range_nid+0xf1/0x1b0
[ 0.734787] efi: efi_arch_mem_reserve, efi phys map 0x27ffcaf80

...

Finally, at the end of boot, kexec_enter_virtual_mode() is called.

It does mapping of efi regions which were passed via setup_data.

So it unregisters the early mem-remapped EFI memmap and installs the new EFI memory map as below:

( Because of efi_arch_mem_reserve() getting invoked, the new EFI memmap phys base being remapped now is the memblock allocation done in efi_arch_mem_reserve()).

[ 4.042160] efi: efi memmap phys map 0x27ffcaf80

So, kexec_enter_virtual_mode() does the following :

if (efi_memmap_init_late(efi.memmap.phys_map, <- refers to the new EFI memmap phys base allocated via memblock in efi_arch_mem_reserve().
efi.memmap.desc_size * efi.memmap.nr_map)) { ...

This late init, does a memremap() on this memblock-allocated memory, but then immediately frees it :

drivers/firmware/efi/memmap.c:

int __init __efi_memmap_init(struct efi_memory_map_data *data)
{

..

phys_map = data->phys_map; <- refers to the new EFI memmap phys base allocated via memblock in efi_arch_mem_reserve().

if (data->flags & EFI_MEMMAP_LATE)
map.map = memremap(phys_map, data->size, MEMREMAP_WB);
...
...
if (efi.memmap.flags & (EFI_MEMMAP_MEMBLOCK | EFI_MEMMAP_SLAB)) {
__efi_memmap_free(efi.memmap.phys_map,
efi.memmap.desc_size * efi.memmap.nr_map, efi.memmap.flags);
}

...
map.phys_map = data->phys_map;

...

efi.memmap = map;

...

This happens as kexec_enter_virtual_mode() can only handle the early mapped EFI memmap and not the one which is memblock allocated by efi_arch_mem_reserve(). As seen above this memblock allocated (EFI_MEMMAP_MEMBLOCK tagged) memory gets freed.

This is confirmed by memblock debugging:

[ 4.044057] memblock_free_late: [0x000000027ffcaf80-0x000000027ffcbfff] __efi_memmap_free+0x66/0x80

So while this memory is memremapped, it has also been freed and then it gets into a use-after-free condition and subsequently gets corrupted.

This corruption is seen just before kexec-ing into the new kernel:

...
[ 11.045522] PEFILE: Unsigned PE binary^M
[ 11.060801] kexec-bzImage64: efi memmap phys map 0x27ffcaf80^M
...
[ 11.061220] kexec-bzImage64: mmap entry, type = 11, va = 0xfffffffeffc00000, pa = 0xffc00000, np = 0x400, attr = 0x8000000000000001^M
[ 11.061225] kexec-bzImage64: mmap entry, type = 6, va = 0xfffffffeffb04000, pa = 0x7f704000, np = 0x84, attr = 0x800000000000000f^M
[ 11.061228] kexec-bzImage64: mmap entry, type = 4, va = 0xfffffffeff700000, pa = 0x7f100000, np = 0x300, attr = 0x0^M
[ 11.061231] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M <- CORRUPTION!!!
[ 11.061234] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061236] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061239] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061241] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061243] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061245] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061248] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061250] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061252] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061255] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061257] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061259] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061262] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061264] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061266] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061268] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061271] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061273] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061275] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061278] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061280] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061282] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061284] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061287] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061289] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061291] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061294] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061296] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061298] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061301] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061303] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061305] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061307] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061310] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061312] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
[ 11.061314] kexec-bzImage64: mmap entry, type = 14080, va = 0x14f29, pa = 0x36c0, np = 0x0, attr = 0x0^M
[ 11.061317] kexec-bzImage64: mmap entry, type = 85808, va = 0x0, pa = 0x0, np = 0x72, attr = 0x14f40^M
[ 11.061320] kexec-bzImage64: mmap entry, type = 0, va = 0x14f4b, pa = 0x65, np = 0x1, attr = 0x0^M
[ 11.061323] kexec-bzImage64: mmap entry, type = 85840, va = 0x0, pa = 0x2, np = 0x69, attr = 0x14f59^M
[ 11.061325] kexec-bzImage64: mmap entry, type = 0, va = 0x14f65, pa = 0x6c, np = 0x0, attr = 0x0^M
[ 11.061328] kexec-bzImage64: mmap entry, type = 85871, va = 0x0, pa = 0x0, np = 0x7a, attr = 0x14f7f^M


...

This EFI phys map address 0x27ffcaf80 is being mem-remapped and also getting freed and then in use after free condition (while setting up the EFI memory map for the next kernel with kexec -s) in the above logs confirm the use-after-free case.

Looking at the above code flow, it makes sense to skip efi_arch_mem_reserve() to fix this issue, as it anyway needs to be skipped for kexec case.

Thanks, Ashish


2024-06-05 01:48:32

by Dave Young

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

On Wed, 5 Jun 2024 at 06:36, Kalra, Ashish <[email protected]> wrote:
>
> Re-sending as the earlier response got line-wrapped.
>
> On 6/3/2024 12:12 PM, Borislav Petkov wrote:
> > On Mon, Jun 03, 2024 at 12:08:48PM -0500, Kalra, Ashish wrote:
> >> efi_arch_mem_reserve().
> > Now it only remains for you to explain why...
>
> Here is a detailed explanation of what is causing the EFI memory map corruption, with added debug logs and memblock debugging enabled:
>
> Initially at boot, efi_memblock_x86_reserve_range() does early_memremap() of the EFI memory map passed as part of setup_data, as the following logs show:
>
> ...
>
> [ 0.000000] efi: in efi_memblock_x86_reserve_range, phys map 0x27fff9110
> [ 0.000000] memblock_reserve: [0x000000027fff9110-0x000000027fffa12f] efi_memblock_x86_reserve_range+0x168/0x2a0
>
> ...
>
> Later, efi_arch_mem_reserve() is invoked, which calls efi_memmap_alloc() which does memblock_phys_alloc() to insert new EFI memory descriptor into efi.memap:
>
> ...
>
> [ 0.733263] memblock_reserve: [0x000000027ffcaf80-0x000000027ffcbfff] memblock_alloc_range_nid+0xf1/0x1b0
> [ 0.734787] efi: efi_arch_mem_reserve, efi phys map 0x27ffcaf80
>
> ...
>
> Finally, at the end of boot, kexec_enter_virtual_mode() is called.
>
> It does mapping of efi regions which were passed via setup_data.
>
> So it unregisters the early mem-remapped EFI memmap and installs the new EFI memory map as below:
>
> ( Because of efi_arch_mem_reserve() getting invoked, the new EFI memmap phys base being remapped now is the memblock allocation done in efi_arch_mem_reserve()).
>
> [ 4.042160] efi: efi memmap phys map 0x27ffcaf80
>
> So, kexec_enter_virtual_mode() does the following :
>
> if (efi_memmap_init_late(efi.memmap.phys_map, <- refers to the new EFI memmap phys base allocated via memblock in efi_arch_mem_reserve().
> efi.memmap.desc_size * efi.memmap.nr_map)) { ...
>
> This late init, does a memremap() on this memblock-allocated memory, but then immediately frees it :
>
> drivers/firmware/efi/memmap.c:
>
> int __init __efi_memmap_init(struct efi_memory_map_data *data)
> {
>
> ..
>
> phys_map = data->phys_map; <- refers to the new EFI memmap phys base allocated via memblock in efi_arch_mem_reserve().
>
> if (data->flags & EFI_MEMMAP_LATE)
> map.map = memremap(phys_map, data->size, MEMREMAP_WB);
> ...
> ...
> if (efi.memmap.flags & (EFI_MEMMAP_MEMBLOCK | EFI_MEMMAP_SLAB)) {
> __efi_memmap_free(efi.memmap.phys_map,
> efi.memmap.desc_size * efi.memmap.nr_map, efi.memmap.flags);
> }

From your debugging the memmap should not be freed. This piece of
code was added in below commit, added Dan Williams in cc list:
commit f0ef6523475f18ccd213e22ee593dfd131a2c5ea
Author: Dan Williams <[email protected]>
Date: Mon Jan 13 18:22:44 2020 +0100

efi: Fix efi_memmap_alloc() leaks

With efi_fake_memmap() and efi_arch_mem_reserve() the efi table may be
updated and replaced multiple times. When that happens a previous
dynamically allocated efi memory map can be garbage collected. Use the
new EFI_MEMMAP_{SLAB,MEMBLOCK} flags to detect when a dynamically
allocated memory map is being replaced.


>
> ...
> map.phys_map = data->phys_map;
>
> ...
>
> efi.memmap = map;
>
> ...
>
> This happens as kexec_enter_virtual_mode() can only handle the early mapped EFI memmap and not the one which is memblock allocated by efi_arch_mem_reserve(). As seen above this memblock allocated (EFI_MEMMAP_MEMBLOCK tagged) memory gets freed.
>
> This is confirmed by memblock debugging:
>
> [ 4.044057] memblock_free_late: [0x000000027ffcaf80-0x000000027ffcbfff] __efi_memmap_free+0x66/0x80
>
> So while this memory is memremapped, it has also been freed and then it gets into a use-after-free condition and subsequently gets corrupted.
>
> This corruption is seen just before kexec-ing into the new kernel:
>
> ...
> [ 11.045522] PEFILE: Unsigned PE binary^M
> [ 11.060801] kexec-bzImage64: efi memmap phys map 0x27ffcaf80^M
> ...
> [ 11.061220] kexec-bzImage64: mmap entry, type = 11, va = 0xfffffffeffc00000, pa = 0xffc00000, np = 0x400, attr = 0x8000000000000001^M
> [ 11.061225] kexec-bzImage64: mmap entry, type = 6, va = 0xfffffffeffb04000, pa = 0x7f704000, np = 0x84, attr = 0x800000000000000f^M
> [ 11.061228] kexec-bzImage64: mmap entry, type = 4, va = 0xfffffffeff700000, pa = 0x7f100000, np = 0x300, attr = 0x0^M
> [ 11.061231] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M <- CORRUPTION!!!
> [ 11.061234] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061236] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061239] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061241] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061243] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061245] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061248] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061250] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061252] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061255] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061257] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061259] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061262] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061264] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061266] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061268] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061271] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061273] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061275] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061278] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061280] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061282] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061284] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061287] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061289] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061291] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061294] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061296] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061298] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061301] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061303] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061305] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061307] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061310] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061312] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
> [ 11.061314] kexec-bzImage64: mmap entry, type = 14080, va = 0x14f29, pa = 0x36c0, np = 0x0, attr = 0x0^M
> [ 11.061317] kexec-bzImage64: mmap entry, type = 85808, va = 0x0, pa = 0x0, np = 0x72, attr = 0x14f40^M
> [ 11.061320] kexec-bzImage64: mmap entry, type = 0, va = 0x14f4b, pa = 0x65, np = 0x1, attr = 0x0^M
> [ 11.061323] kexec-bzImage64: mmap entry, type = 85840, va = 0x0, pa = 0x2, np = 0x69, attr = 0x14f59^M
> [ 11.061325] kexec-bzImage64: mmap entry, type = 0, va = 0x14f65, pa = 0x6c, np = 0x0, attr = 0x0^M
> [ 11.061328] kexec-bzImage64: mmap entry, type = 85871, va = 0x0, pa = 0x0, np = 0x7a, attr = 0x14f7f^M
>
>
> ...
>
> This EFI phys map address 0x27ffcaf80 is being mem-remapped and also getting freed and then in use after free condition (while setting up the EFI memory map for the next kernel with kexec -s) in the above logs confirm the use-after-free case.
>
> Looking at the above code flow, it makes sense to skip efi_arch_mem_reserve() to fix this issue, as it anyway needs to be skipped for kexec case.
>
> Thanks, Ashish
>


2024-06-05 01:52:03

by Dave Young

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

> > ...
> > if (efi.memmap.flags & (EFI_MEMMAP_MEMBLOCK | EFI_MEMMAP_SLAB)) {
> > __efi_memmap_free(efi.memmap.phys_map,
> > efi.memmap.desc_size * efi.memmap.nr_map, efi.memmap.flags);
> > }
>
> From your debugging the memmap should not be freed. This piece of
> code was added in below commit, added Dan Williams in cc list:
> commit f0ef6523475f18ccd213e22ee593dfd131a2c5ea
> Author: Dan Williams <[email protected]>
> Date: Mon Jan 13 18:22:44 2020 +0100
>
> efi: Fix efi_memmap_alloc() leaks
>
> With efi_fake_memmap() and efi_arch_mem_reserve() the efi table may be
> updated and replaced multiple times. When that happens a previous
> dynamically allocated efi memory map can be garbage collected. Use the
> new EFI_MEMMAP_{SLAB,MEMBLOCK} flags to detect when a dynamically
> allocated memory map is being replaced.
>

Dan, probably those regions should be freed only for "fake" memmap?


2024-06-05 01:58:52

by Dave Young

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

On Wed, 5 Jun 2024 at 09:52, Dave Young <[email protected]> wrote:
>
> > > ...
> > > if (efi.memmap.flags & (EFI_MEMMAP_MEMBLOCK | EFI_MEMMAP_SLAB)) {
> > > __efi_memmap_free(efi.memmap.phys_map,
> > > efi.memmap.desc_size * efi.memmap.nr_map, efi.memmap.flags);
> > > }
> >
> > From your debugging the memmap should not be freed. This piece of
> > code was added in below commit, added Dan Williams in cc list:
> > commit f0ef6523475f18ccd213e22ee593dfd131a2c5ea
> > Author: Dan Williams <[email protected]>
> > Date: Mon Jan 13 18:22:44 2020 +0100
> >
> > efi: Fix efi_memmap_alloc() leaks
> >
> > With efi_fake_memmap() and efi_arch_mem_reserve() the efi table may be
> > updated and replaced multiple times. When that happens a previous
> > dynamically allocated efi memory map can be garbage collected. Use the
> > new EFI_MEMMAP_{SLAB,MEMBLOCK} flags to detect when a dynamically
> > allocated memory map is being replaced.
> >
>
> Dan, probably those regions should be freed only for "fake" memmap?

Ashish, can you comment out the __efi_memmap_free see if it works for
you just confirm about the behavior.


2024-06-05 02:10:17

by Kalra, Ashish

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

Hello Dave,

On 6/4/2024 8:58 PM, Dave Young wrote:
> On Wed, 5 Jun 2024 at 09:52, Dave Young <[email protected]> wrote:
>>>> ...
>>>> if (efi.memmap.flags & (EFI_MEMMAP_MEMBLOCK | EFI_MEMMAP_SLAB)) {
>>>> __efi_memmap_free(efi.memmap.phys_map,
>>>> efi.memmap.desc_size * efi.memmap.nr_map, efi.memmap.flags);
>>>> }
>>> From your debugging the memmap should not be freed. This piece of
>>> code was added in below commit, added Dan Williams in cc list:
>>> commit f0ef6523475f18ccd213e22ee593dfd131a2c5ea
>>> Author: Dan Williams <[email protected]>
>>> Date: Mon Jan 13 18:22:44 2020 +0100
>>>
>>> efi: Fix efi_memmap_alloc() leaks
>>>
>>> With efi_fake_memmap() and efi_arch_mem_reserve() the efi table may be
>>> updated and replaced multiple times. When that happens a previous
>>> dynamically allocated efi memory map can be garbage collected. Use the
>>> new EFI_MEMMAP_{SLAB,MEMBLOCK} flags to detect when a dynamically
>>> allocated memory map is being replaced.
>>>
>> Dan, probably those regions should be freed only for "fake" memmap?
> Ashish, can you comment out the __efi_memmap_free see if it works for
> you just confirm about the behavior.

Yes, i have already tried and tested that, if i avoid __efi_memmap_free(), then i don't see this memory map corruption.

Thanks, Ashish


2024-06-05 02:15:13

by Kalra, Ashish

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

On 6/4/2024 8:48 PM, Dave Young wrote:

> On Wed, 5 Jun 2024 at 06:36, Kalra, Ashish <[email protected]> wrote:
>> Re-sending as the earlier response got line-wrapped.
>>
>> On 6/3/2024 12:12 PM, Borislav Petkov wrote:
>>> On Mon, Jun 03, 2024 at 12:08:48PM -0500, Kalra, Ashish wrote:
>>>> efi_arch_mem_reserve().
>>> Now it only remains for you to explain why...
>> Here is a detailed explanation of what is causing the EFI memory map corruption, with added debug logs and memblock debugging enabled:
>>
>> Initially at boot, efi_memblock_x86_reserve_range() does early_memremap() of the EFI memory map passed as part of setup_data, as the following logs show:
>>
>> ...
>>
>> [ 0.000000] efi: in efi_memblock_x86_reserve_range, phys map 0x27fff9110
>> [ 0.000000] memblock_reserve: [0x000000027fff9110-0x000000027fffa12f] efi_memblock_x86_reserve_range+0x168/0x2a0
>>
>> ...
>>
>> Later, efi_arch_mem_reserve() is invoked, which calls efi_memmap_alloc() which does memblock_phys_alloc() to insert new EFI memory descriptor into efi.memap:
>>
>> ...
>>
>> [ 0.733263] memblock_reserve: [0x000000027ffcaf80-0x000000027ffcbfff] memblock_alloc_range_nid+0xf1/0x1b0
>> [ 0.734787] efi: efi_arch_mem_reserve, efi phys map 0x27ffcaf80
>>
>> ...
>>
>> Finally, at the end of boot, kexec_enter_virtual_mode() is called.
>>
>> It does mapping of efi regions which were passed via setup_data.
>>
>> So it unregisters the early mem-remapped EFI memmap and installs the new EFI memory map as below:
>>
>> ( Because of efi_arch_mem_reserve() getting invoked, the new EFI memmap phys base being remapped now is the memblock allocation done in efi_arch_mem_reserve()).
>>
>> [ 4.042160] efi: efi memmap phys map 0x27ffcaf80
>>
>> So, kexec_enter_virtual_mode() does the following :
>>
>> if (efi_memmap_init_late(efi.memmap.phys_map, <- refers to the new EFI memmap phys base allocated via memblock in efi_arch_mem_reserve().
>> efi.memmap.desc_size * efi.memmap.nr_map)) { ...
>>
>> This late init, does a memremap() on this memblock-allocated memory, but then immediately frees it :
>>
>> drivers/firmware/efi/memmap.c:
>>
>> int __init __efi_memmap_init(struct efi_memory_map_data *data)
>> {
>>
>> ..
>>
>> phys_map = data->phys_map; <- refers to the new EFI memmap phys base allocated via memblock in efi_arch_mem_reserve().
>>
>> if (data->flags & EFI_MEMMAP_LATE)
>> map.map = memremap(phys_map, data->size, MEMREMAP_WB);
>> ...
>> ...
>> if (efi.memmap.flags & (EFI_MEMMAP_MEMBLOCK | EFI_MEMMAP_SLAB)) {
>> __efi_memmap_free(efi.memmap.phys_map,
>> efi.memmap.desc_size * efi.memmap.nr_map, efi.memmap.flags);
>> }
> From your debugging the memmap should not be freed.

Yes, it looks like that it should not be freed, as the new and previous efi memory map can be same.

Thanks, Ashish

> This piece of
> code was added in below commit, added Dan Williams in cc list:
> commit f0ef6523475f18ccd213e22ee593dfd131a2c5ea
> Author: Dan Williams <[email protected]>
> Date: Mon Jan 13 18:22:44 2020 +0100
>
> efi: Fix efi_memmap_alloc() leaks
>
> With efi_fake_memmap() and efi_arch_mem_reserve() the efi table may be
> updated and replaced multiple times. When that happens a previous
> dynamically allocated efi memory map can be garbage collected. Use the
> new EFI_MEMMAP_{SLAB,MEMBLOCK} flags to detect when a dynamically
> allocated memory map is being replaced.
>
>
>> ...
>> map.phys_map = data->phys_map;
>>
>> ...
>>
>> efi.memmap = map;
>>
>> ...
>>
>> This happens as kexec_enter_virtual_mode() can only handle the early mapped EFI memmap and not the one which is memblock allocated by efi_arch_mem_reserve(). As seen above this memblock allocated (EFI_MEMMAP_MEMBLOCK tagged) memory gets freed.
>>
>> This is confirmed by memblock debugging:
>>
>> [ 4.044057] memblock_free_late: [0x000000027ffcaf80-0x000000027ffcbfff] __efi_memmap_free+0x66/0x80
>>
>> So while this memory is memremapped, it has also been freed and then it gets into a use-after-free condition and subsequently gets corrupted.
>>
>> This corruption is seen just before kexec-ing into the new kernel:
>>
>> ...
>> [ 11.045522] PEFILE: Unsigned PE binary^M
>> [ 11.060801] kexec-bzImage64: efi memmap phys map 0x27ffcaf80^M
>> ...
>> [ 11.061220] kexec-bzImage64: mmap entry, type = 11, va = 0xfffffffeffc00000, pa = 0xffc00000, np = 0x400, attr = 0x8000000000000001^M
>> [ 11.061225] kexec-bzImage64: mmap entry, type = 6, va = 0xfffffffeffb04000, pa = 0x7f704000, np = 0x84, attr = 0x800000000000000f^M
>> [ 11.061228] kexec-bzImage64: mmap entry, type = 4, va = 0xfffffffeff700000, pa = 0x7f100000, np = 0x300, attr = 0x0^M
>> [ 11.061231] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M <- CORRUPTION!!!
>> [ 11.061234] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061236] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061239] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061241] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061243] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061245] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061248] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061250] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061252] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061255] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061257] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061259] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061262] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061264] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061266] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061268] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061271] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061273] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061275] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061278] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061280] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061282] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061284] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061287] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061289] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061291] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061294] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061296] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061298] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061301] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061303] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061305] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061307] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061310] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061312] kexec-bzImage64: mmap entry, type = 0, va = 0x0, pa = 0x0, np = 0x0, attr = 0x0^M
>> [ 11.061314] kexec-bzImage64: mmap entry, type = 14080, va = 0x14f29, pa = 0x36c0, np = 0x0, attr = 0x0^M
>> [ 11.061317] kexec-bzImage64: mmap entry, type = 85808, va = 0x0, pa = 0x0, np = 0x72, attr = 0x14f40^M
>> [ 11.061320] kexec-bzImage64: mmap entry, type = 0, va = 0x14f4b, pa = 0x65, np = 0x1, attr = 0x0^M
>> [ 11.061323] kexec-bzImage64: mmap entry, type = 85840, va = 0x0, pa = 0x2, np = 0x69, attr = 0x14f59^M
>> [ 11.061325] kexec-bzImage64: mmap entry, type = 0, va = 0x14f65, pa = 0x6c, np = 0x0, attr = 0x0^M
>> [ 11.061328] kexec-bzImage64: mmap entry, type = 85871, va = 0x0, pa = 0x0, np = 0x7a, attr = 0x14f7f^M
>>
>>
>> ...
>>
>> This EFI phys map address 0x27ffcaf80 is being mem-remapped and also getting freed and then in use after free condition (while setting up the EFI memory map for the next kernel with kexec -s) in the above logs confirm the use-after-free case.
>>
>> Looking at the above code flow, it makes sense to skip efi_arch_mem_reserve() to fix this issue, as it anyway needs to be skipped for kexec case.
>>
>> Thanks, Ashish
>>

2024-06-05 02:28:13

by Dave Young

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

On Wed, 5 Jun 2024 at 10:09, Kalra, Ashish <[email protected]> wrote:
>
> Hello Dave,
>
> On 6/4/2024 8:58 PM, Dave Young wrote:
> > On Wed, 5 Jun 2024 at 09:52, Dave Young <[email protected]> wrote:
> >>>> ...
> >>>> if (efi.memmap.flags & (EFI_MEMMAP_MEMBLOCK | EFI_MEMMAP_SLAB)) {
> >>>> __efi_memmap_free(efi.memmap.phys_map,
> >>>> efi.memmap.desc_size * efi.memmap.nr_map, efi.memmap.flags);
> >>>> }
> >>> From your debugging the memmap should not be freed. This piece of
> >>> code was added in below commit, added Dan Williams in cc list:
> >>> commit f0ef6523475f18ccd213e22ee593dfd131a2c5ea
> >>> Author: Dan Williams <[email protected]>
> >>> Date: Mon Jan 13 18:22:44 2020 +0100
> >>>
> >>> efi: Fix efi_memmap_alloc() leaks
> >>>
> >>> With efi_fake_memmap() and efi_arch_mem_reserve() the efi table may be
> >>> updated and replaced multiple times. When that happens a previous
> >>> dynamically allocated efi memory map can be garbage collected. Use the
> >>> new EFI_MEMMAP_{SLAB,MEMBLOCK} flags to detect when a dynamically
> >>> allocated memory map is being replaced.
> >>>
> >> Dan, probably those regions should be freed only for "fake" memmap?
> > Ashish, can you comment out the __efi_memmap_free see if it works for
> > you just confirm about the behavior.
>
> Yes, i have already tried and tested that, if i avoid __efi_memmap_free(), then i don't see this memory map corruption.

Ok, thanks! I think the right way is creating two patches, one to
remove the __efi_memmap_free, another is skip efi_arch_mem_reserve
when the EFI_MEMORY_RUNTIME bit was set already. But the first one
should be the fix for the root cause.

efi fake mem is only for debugging purposes, the "memleak" mentioned
in commit 0f96a99dab36 should be solved in another way if needed (are
they really leaked? or just not useful anymore)

Anyway this is my opinion, please wait for x86 and efi reviewer's inputs.

>
> Thanks, Ashish
>


2024-06-05 02:53:43

by Dave Young

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

On Wed, 5 Jun 2024 at 02:03, Borislav Petkov <[email protected]> wrote:
>
> On Tue, Jun 04, 2024 at 07:09:56PM +0800, Dave Young wrote:
> > Anyway there is not such a helper for all cases.
>
> But maybe there should be...
>
> This is not the first case where the need arises to be able to say:
>
> if (am I a kexeced kernel)
>
> in code.
>
> Perhaps we should have a global var kexeced or so which gets incremented
> on each kexec-ed kernel, somewhere in very early boot of the kexec-ed
> kernel we do
>
> kexeced++;
>
> and then other code can query it and know whether this is a kexec-ed
> kernel and how many times it got kexec-ed...

It's something good to have but not must for the time being, also no
idea how to save the status across boot, for EFI boot case probably a
EFI var can be used, but how can it be cleared in case of physical
boot. Otherwise probably injecting some kernel parameters, anyway
this needs more thinking.

>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
>


2024-06-05 07:44:37

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

On Wed, Jun 05, 2024 at 10:53:44AM +0800, Dave Young wrote:
> It's something good to have but not must for the time being, also no
> idea how to save the status across boot, for EFI boot case probably a
> EFI var can be used;

Yes.

> but how can it be cleared in case of physical boot. Otherwise
> probably injecting some kernel parameters, anyway this needs more
> thinking.

Yeah, this'll need proper analysis whether we can even do that reliably.

We need to increment it only on the kexec reboot paths and clear it on
the normal reboot paths.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-06-05 08:22:35

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

On Wed, 5 Jun 2024 at 09:43, Borislav Petkov <[email protected]> wrote:
>
> On Wed, Jun 05, 2024 at 10:53:44AM +0800, Dave Young wrote:
> > It's something good to have but not must for the time being, also no
> > idea how to save the status across boot, for EFI boot case probably a
> > EFI var can be used;
>
> Yes.
>
> > but how can it be cleared in case of physical boot. Otherwise
> > probably injecting some kernel parameters, anyway this needs more
> > thinking.
>
> Yeah, this'll need proper analysis whether we can even do that reliably.
>
> We need to increment it only on the kexec reboot paths and clear it on
> the normal reboot paths.
>

I'd argue for the opposite: ideally, the difference between the first
boot and not-the-first-boot should be abstracted away by the
'bootloader' side of kexec as much as possible, so that the tricky
early startup code doesn't have to be riddled with different code
paths depending on !kexec vs kexec.

TDX is a good case in point here: rather than add more conditionals,
I'd urge to remove them so the TDX startup code doesn't have to care
about the difference at all. If there is anything special that needs
to be done, it belongs in the kexec implementation of the previous
kernel.

2024-06-05 11:11:05

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

Moving Ard and Dan to To:

On Wed, Jun 05, 2024 at 10:28:18AM +0800, Dave Young wrote:
> Ok, thanks! I think the right way is creating two patches, one to
> remove the __efi_memmap_free,

Yap, that

f0ef6523475f ("efi: Fix efi_memmap_alloc() leaks")

needs revisiting.

So AFAIU, the flow is this:

In a kexec-ed kernel:

1. efi_arch_mem_reserve() gets called by bgrt, erst, mokvar... whatever
to hold on to boot services regions for longer otherwise EFI
"implementations" explode.

2. On same kexec-ed kernel, we call into kexec_enter_virtual_mode()
because it needs to get the runtime services regions from the first
kernel

3. As part of that call, it'll do
efi_memmap_init_late->__efi_memmap_init():

if (efi.memmap.flags & (EFI_MEMMAP_MEMBLOCK | EFI_MEMMAP_SLAB))
__efi_memmap_free(efi.memmap.phys_map,

and the memory which got allocated in step 1 is gone, thus reverting
what efi_arch_mem_reserve() is trying to fix.

IOW, we need a

EFI_MEMMAP_DO_NOT_TOUCH_MY_MEMORY

flag which'll stop this from happening. But I'd prefer it if Ard decides
what the right thing to do here is.

> another is skip efi_arch_mem_reserve when the EFI_MEMORY_RUNTIME bit
> was set already.

Can that even happen?

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-06-05 11:16:26

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

On Wed, Jun 05, 2024 at 10:17:22AM +0200, Ard Biesheuvel wrote:
> I'd argue for the opposite: ideally, the difference between the first
> boot and not-the-first-boot should be abstracted away by the
> 'bootloader' side of kexec as much as possible, so that the tricky
> early startup code doesn't have to be riddled with different code
> paths depending on !kexec vs kexec.

Well, off and on we end up needing to be able to ask whether the current
kernel is kexec-ed. So you need to be able to access that aspect in
kernel code - not in the bootloader. Perhaps read it from the
bootloader, sure.

But see my other mail from just now - it might end up not needing it
after all and I'd prefer if we never ever have to ask that question but
just from staring at EFI code it reminded me that we do need to ask that
question already:

if (efi_setup)
kexec_enter_virtual_mode();
else
__efi_enter_virtual_mode();

*exactly* because of EFI and that virtual_map call nonsense of allowing
it only once.

And we check efi_setup here because that works. But you can't use that
globally. And so on...

> TDX is a good case in point here: rather than add more conditionals,
> I'd urge to remove them so the TDX startup code doesn't have to care
> about the difference at all. If there is anything special that needs
> to be done, it belongs in the kexec implementation of the previous
> kernel.

Sure, but reality is not as easy sometimes.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-06-06 01:52:35

by Dave Young

[permalink] [raw]
Subject: Re: [PATCH v7 1/3] efi/x86: Fix EFI memory map corruption with kexec

On Wed, 5 Jun 2024 at 19:09, Borislav Petkov <[email protected]> wrote:
>
> Moving Ard and Dan to To:
>
> On Wed, Jun 05, 2024 at 10:28:18AM +0800, Dave Young wrote:
> > Ok, thanks! I think the right way is creating two patches, one to
> > remove the __efi_memmap_free,
>
> Yap, that
>
> f0ef6523475f ("efi: Fix efi_memmap_alloc() leaks")
>
> needs revisiting.
>
> So AFAIU, the flow is this:
>
> In a kexec-ed kernel:
>
> 1. efi_arch_mem_reserve() gets called by bgrt, erst, mokvar... whatever
> to hold on to boot services regions for longer otherwise EFI
> "implementations" explode.
>
> 2. On same kexec-ed kernel, we call into kexec_enter_virtual_mode()
> because it needs to get the runtime services regions from the first
> kernel
>
> 3. As part of that call, it'll do
> efi_memmap_init_late->__efi_memmap_init():
>
> if (efi.memmap.flags & (EFI_MEMMAP_MEMBLOCK | EFI_MEMMAP_SLAB))
> __efi_memmap_free(efi.memmap.phys_map,
>
> and the memory which got allocated in step 1 is gone, thus reverting
> what efi_arch_mem_reserve() is trying to fix.
>
> IOW, we need a
>
> EFI_MEMMAP_DO_NOT_TOUCH_MY_MEMORY
>
> flag which'll stop this from happening. But I'd prefer it if Ard decides
> what the right thing to do here is.
>
> > another is skip efi_arch_mem_reserve when the EFI_MEMORY_RUNTIME bit
> > was set already.
>
> Can that even happen?

Yes, let's say we have two different cases both go through
drivers/firmware/efi/efi-bgrt.c -> efi_mem_reserve ->
efi_arch_mem_reserve
1. normal boot (non kexec-ed)
The bgrt region is reserved and mark as EFI_MEMORY_RUNTIME with a
new efi mem range which is inserted in the memmap, later kexec will
carry over to 2nd kernel (drop those boot service areas without
EFI_MEMORY_RUNTIME)
2. kexec-ed boot
In the same call path, the previous kernel saved bgrt region has
already set EFI_MEMORY_RUNTIME, but it is re-reserved with a new mem
entry in memmap, this is not necessary and duplicate. I did not
check the efi boot code if it will de-duplicate the memmap later, but
anyway this is useless and it should be skipped.

Thanks
Dave


2024-06-06 09:21:41

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v7 2/3] x86/boot/compressed: Skip Video Memory access in Decompressor for SEV-ES/SNP.

On Thu, May 30, 2024 at 11:37:14PM +0000, Ashish Kalra wrote:
> - lines = boot_params_ptr->screen_info.orig_video_lines;
> - cols = boot_params_ptr->screen_info.orig_video_cols;
> + if (!(sev_status & MSR_AMD64_SEV_ES_ENABLED)) {
> + lines = boot_params_ptr->screen_info.orig_video_lines;
> + cols = boot_params_ptr->screen_info.orig_video_cols;
> + }

By now I get an allergic reaction from this sprinkling of "if sev..."
everywhere in the code.

> init_default_io_ops();

<--- right here there's a call to

early_tdx_detect();

You can add a early_sev_detect() counterpart here and clear lines and
cols in it along with an explanation why it is being done.

This is at least a bit cleaner than this.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-06-11 18:45:04

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCHv11 05/19] x86/relocate_kernel: Use named labels for less confusion

On 6/4/24 08:21, Kirill A. Shutemov wrote:
>
> From b45fe48092abad2612c2bafbb199e4de80c99545 Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <[email protected]>
> Date: Fri, 10 Feb 2023 12:53:11 +0300
> Subject: [PATCHv11.1 06/19] x86/kexec: Keep CR4.MCE set during kexec for TDX guest
>
> TDX guests run with MCA enabled (CR4.MCE=1b) from the very start. If
> that bit is cleared during CR4 register reprogramming during boot or
> kexec flows, a #VE exception will be raised which the guest kernel
> cannot handle it.
>
> Therefore, make sure the CR4.MCE setting is preserved over kexec too and
> avoid raising any #VEs.
>
> The change doesn't affect non-TDX-guest environments.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> arch/x86/kernel/relocate_kernel_64.S | 17 ++++++++++-------
> 1 file changed, 10 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> index 085eef5c3904..9c2cf70c5f54 100644
> --- a/arch/x86/kernel/relocate_kernel_64.S
> +++ b/arch/x86/kernel/relocate_kernel_64.S
> @@ -5,6 +5,8 @@
> */
>
> #include <linux/linkage.h>
> +#include <linux/stringify.h>
> +#include <asm/alternative.h>
> #include <asm/page_types.h>
> #include <asm/kexec.h>
> #include <asm/processor-flags.h>
> @@ -145,14 +147,15 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
> * Set cr4 to a known state:
> * - physical address extension enabled
> * - 5-level paging, if it was enabled before
> + * - Machine check exception on TDX guest, if it was enabled before.
> + * Clearing MCE might not be allowed in TDX guests, depending on setup.
> + *
> + * Use R13 that contains the original CR4 value, read in relocate_kernel().
> + * PAE is always set in the original CR4.
> */
> - movl $X86_CR4_PAE, %eax
> - testq $X86_CR4_LA57, %r13
> - jz .Lno_la57
> - orl $X86_CR4_LA57, %eax
> -.Lno_la57:
> -
> - movq %rax, %cr4
> + andl $(X86_CR4_PAE | X86_CR4_LA57), %r13d
> + ALTERNATIVE "", __stringify(orl $X86_CR4_MCE, %r13d), X86_FEATURE_TDX_GUEST
> + movq %r13, %cr4
>

If this is the case, I don't really see a reason to clear MCE per se as
I'm guessing a machine check here will be fatal anyway? It just changes
the method of death.

Also, is there a reason to save %cr4, run code, and *then* clear the
relevant bits? Wouldn't it be better to sanitize %cr4 as soon as possible?

-hpa

2024-06-12 12:10:33

by Nikolay Borisov

[permalink] [raw]
Subject: Re: [PATCHv11 05/19] x86/relocate_kernel: Use named labels for less confusion



On 3.06.24 г. 17:43 ч., H. Peter Anvin wrote:
> On 5/29/24 03:47, Nikolay Borisov wrote:
>>>
>>> diff --git a/arch/x86/kernel/relocate_kernel_64.S
>>> b/arch/x86/kernel/relocate_kernel_64.S
>>> index 56cab1bb25f5..085eef5c3904 100644
>>> --- a/arch/x86/kernel/relocate_kernel_64.S
>>> +++ b/arch/x86/kernel/relocate_kernel_64.S
>>> @@ -148,9 +148,10 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>>>        */
>>>       movl    $X86_CR4_PAE, %eax
>>>       testq    $X86_CR4_LA57, %r13
>>> -    jz    1f
>>> +    jz    .Lno_la57
>>>       orl    $X86_CR4_LA57, %eax
>>> -1:
>>> +.Lno_la57:
>>> +
>>>       movq    %rax, %cr4
>>>       jmp 1f
>>
>> That jmp 1f becomes redundant now as it simply jumps 1 line below.
>>
>
> Uh... am I the only person to notice that ALL that is needed here is:
>
>     andl $(X86_CR4_PAE|X86_CR4_LA57), %r13d
>     movq %r13, %rax
>
> ... since %r13 is dead afterwards, and PAE *will* have been set in %r13
> already?
>
> I don't believe that this specific jmp is actually needed -- there are
> several more synchronizing jumps later -- but it doesn't hurt.
>
> However, if the effort is for improving the readability, it might be
> worthwhile to encapsulate the "jmp 1f; 1:" as a macro, e.g. "SYNC_CODE".


The preceding move to CR4 is itself a serializing instruction, no?

>
>     -hpa

2024-06-12 13:37:04

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv11 05/19] x86/relocate_kernel: Use named labels for less confusion

On Tue, Jun 11, 2024 at 11:26:17AM -0700, H. Peter Anvin wrote:
> On 6/4/24 08:21, Kirill A. Shutemov wrote:
> >
> > From b45fe48092abad2612c2bafbb199e4de80c99545 Mon Sep 17 00:00:00 2001
> > From: "Kirill A. Shutemov" <[email protected]>
> > Date: Fri, 10 Feb 2023 12:53:11 +0300
> > Subject: [PATCHv11.1 06/19] x86/kexec: Keep CR4.MCE set during kexec for TDX guest
> >
> > TDX guests run with MCA enabled (CR4.MCE=1b) from the very start. If
> > that bit is cleared during CR4 register reprogramming during boot or
> > kexec flows, a #VE exception will be raised which the guest kernel
> > cannot handle it.
> >
> > Therefore, make sure the CR4.MCE setting is preserved over kexec too and
> > avoid raising any #VEs.
> >
> > The change doesn't affect non-TDX-guest environments.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > ---
> > arch/x86/kernel/relocate_kernel_64.S | 17 ++++++++++-------
> > 1 file changed, 10 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> > index 085eef5c3904..9c2cf70c5f54 100644
> > --- a/arch/x86/kernel/relocate_kernel_64.S
> > +++ b/arch/x86/kernel/relocate_kernel_64.S
> > @@ -5,6 +5,8 @@
> > */
> > #include <linux/linkage.h>
> > +#include <linux/stringify.h>
> > +#include <asm/alternative.h>
> > #include <asm/page_types.h>
> > #include <asm/kexec.h>
> > #include <asm/processor-flags.h>
> > @@ -145,14 +147,15 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
> > * Set cr4 to a known state:
> > * - physical address extension enabled
> > * - 5-level paging, if it was enabled before
> > + * - Machine check exception on TDX guest, if it was enabled before.
> > + * Clearing MCE might not be allowed in TDX guests, depending on setup.
> > + *
> > + * Use R13 that contains the original CR4 value, read in relocate_kernel().
> > + * PAE is always set in the original CR4.
> > */
> > - movl $X86_CR4_PAE, %eax
> > - testq $X86_CR4_LA57, %r13
> > - jz .Lno_la57
> > - orl $X86_CR4_LA57, %eax
> > -.Lno_la57:
> > -
> > - movq %rax, %cr4
> > + andl $(X86_CR4_PAE | X86_CR4_LA57), %r13d
> > + ALTERNATIVE "", __stringify(orl $X86_CR4_MCE, %r13d), X86_FEATURE_TDX_GUEST
> > + movq %r13, %cr4
>
> If this is the case, I don't really see a reason to clear MCE per se as I'm
> guessing a machine check here will be fatal anyway? It just changes the
> method of death.

Andrew had a strong opinion on method of death here.

https://lore.kernel.org/all/[email protected]

> Also, is there a reason to save %cr4, run code, and *then* clear the
> relevant bits? Wouldn't it be better to sanitize %cr4 as soon as possible?

You mean set new CR4 directly in relocate_kernel() before switching CR3?
I guess it is possible.

But I can say I see huge benefit of changing it. Such change would have
own risks.

--
Kiryl Shutsemau / Kirill A. Shutemov

2024-06-12 23:06:21

by Andrew Cooper

[permalink] [raw]
Subject: Re: [PATCHv11 05/19] x86/relocate_kernel: Use named labels for less confusion

On 12/06/2024 10:22 am, Kirill A. Shutemov wrote:
> On Tue, Jun 11, 2024 at 11:26:17AM -0700, H. Peter Anvin wrote:
>> On 6/4/24 08:21, Kirill A. Shutemov wrote:
>>> From b45fe48092abad2612c2bafbb199e4de80c99545 Mon Sep 17 00:00:00 2001
>>> From: "Kirill A. Shutemov" <[email protected]>
>>> Date: Fri, 10 Feb 2023 12:53:11 +0300
>>> Subject: [PATCHv11.1 06/19] x86/kexec: Keep CR4.MCE set during kexec for TDX guest
>>>
>>> TDX guests run with MCA enabled (CR4.MCE=1b) from the very start. If
>>> that bit is cleared during CR4 register reprogramming during boot or
>>> kexec flows, a #VE exception will be raised which the guest kernel
>>> cannot handle it.
>>>
>>> Therefore, make sure the CR4.MCE setting is preserved over kexec too and
>>> avoid raising any #VEs.
>>>
>>> The change doesn't affect non-TDX-guest environments.
>>>
>>> Signed-off-by: Kirill A. Shutemov <[email protected]>
>>> ---
>>> arch/x86/kernel/relocate_kernel_64.S | 17 ++++++++++-------
>>> 1 file changed, 10 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
>>> index 085eef5c3904..9c2cf70c5f54 100644
>>> --- a/arch/x86/kernel/relocate_kernel_64.S
>>> +++ b/arch/x86/kernel/relocate_kernel_64.S
>>> @@ -5,6 +5,8 @@
>>> */
>>> #include <linux/linkage.h>
>>> +#include <linux/stringify.h>
>>> +#include <asm/alternative.h>
>>> #include <asm/page_types.h>
>>> #include <asm/kexec.h>
>>> #include <asm/processor-flags.h>
>>> @@ -145,14 +147,15 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>>> * Set cr4 to a known state:
>>> * - physical address extension enabled
>>> * - 5-level paging, if it was enabled before
>>> + * - Machine check exception on TDX guest, if it was enabled before.
>>> + * Clearing MCE might not be allowed in TDX guests, depending on setup.
>>> + *
>>> + * Use R13 that contains the original CR4 value, read in relocate_kernel().
>>> + * PAE is always set in the original CR4.
>>> */
>>> - movl $X86_CR4_PAE, %eax
>>> - testq $X86_CR4_LA57, %r13
>>> - jz .Lno_la57
>>> - orl $X86_CR4_LA57, %eax
>>> -.Lno_la57:
>>> -
>>> - movq %rax, %cr4
>>> + andl $(X86_CR4_PAE | X86_CR4_LA57), %r13d
>>> + ALTERNATIVE "", __stringify(orl $X86_CR4_MCE, %r13d), X86_FEATURE_TDX_GUEST
>>> + movq %r13, %cr4
>> If this is the case, I don't really see a reason to clear MCE per se as I'm
>> guessing a machine check here will be fatal anyway? It just changes the
>> method of death.
> Andrew had a strong opinion on method of death here.
>
> https://lore.kernel.org/all/[email protected]

Not sure if I intended it to come across that strongly, but given a
choice, the !CR4.MCE death is cleaner because at least you're not
interpreting garbage and trying to use it as a valid IDT.

~Andrew

2024-06-12 23:26:42

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCHv11 05/19] x86/relocate_kernel: Use named labels for less confusion

On June 12, 2024 4:06:07 PM PDT, Andrew Cooper <[email protected]> wrote:
>On 12/06/2024 10:22 am, Kirill A. Shutemov wrote:
>> On Tue, Jun 11, 2024 at 11:26:17AM -0700, H. Peter Anvin wrote:
>>> On 6/4/24 08:21, Kirill A. Shutemov wrote:
>>>> From b45fe48092abad2612c2bafbb199e4de80c99545 Mon Sep 17 00:00:00 2001
>>>> From: "Kirill A. Shutemov" <[email protected]>
>>>> Date: Fri, 10 Feb 2023 12:53:11 +0300
>>>> Subject: [PATCHv11.1 06/19] x86/kexec: Keep CR4.MCE set during kexec for TDX guest
>>>>
>>>> TDX guests run with MCA enabled (CR4.MCE=1b) from the very start. If
>>>> that bit is cleared during CR4 register reprogramming during boot or
>>>> kexec flows, a #VE exception will be raised which the guest kernel
>>>> cannot handle it.
>>>>
>>>> Therefore, make sure the CR4.MCE setting is preserved over kexec too and
>>>> avoid raising any #VEs.
>>>>
>>>> The change doesn't affect non-TDX-guest environments.
>>>>
>>>> Signed-off-by: Kirill A. Shutemov <[email protected]>
>>>> ---
>>>> arch/x86/kernel/relocate_kernel_64.S | 17 ++++++++++-------
>>>> 1 file changed, 10 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
>>>> index 085eef5c3904..9c2cf70c5f54 100644
>>>> --- a/arch/x86/kernel/relocate_kernel_64.S
>>>> +++ b/arch/x86/kernel/relocate_kernel_64.S
>>>> @@ -5,6 +5,8 @@
>>>> */
>>>> #include <linux/linkage.h>
>>>> +#include <linux/stringify.h>
>>>> +#include <asm/alternative.h>
>>>> #include <asm/page_types.h>
>>>> #include <asm/kexec.h>
>>>> #include <asm/processor-flags.h>
>>>> @@ -145,14 +147,15 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>>>> * Set cr4 to a known state:
>>>> * - physical address extension enabled
>>>> * - 5-level paging, if it was enabled before
>>>> + * - Machine check exception on TDX guest, if it was enabled before.
>>>> + * Clearing MCE might not be allowed in TDX guests, depending on setup.
>>>> + *
>>>> + * Use R13 that contains the original CR4 value, read in relocate_kernel().
>>>> + * PAE is always set in the original CR4.
>>>> */
>>>> - movl $X86_CR4_PAE, %eax
>>>> - testq $X86_CR4_LA57, %r13
>>>> - jz .Lno_la57
>>>> - orl $X86_CR4_LA57, %eax
>>>> -.Lno_la57:
>>>> -
>>>> - movq %rax, %cr4
>>>> + andl $(X86_CR4_PAE | X86_CR4_LA57), %r13d
>>>> + ALTERNATIVE "", __stringify(orl $X86_CR4_MCE, %r13d), X86_FEATURE_TDX_GUEST
>>>> + movq %r13, %cr4
>>> If this is the case, I don't really see a reason to clear MCE per se as I'm
>>> guessing a machine check here will be fatal anyway? It just changes the
>>> method of death.
>> Andrew had a strong opinion on method of death here.
>>
>> https://lore.kernel.org/all/[email protected]
>
>Not sure if I intended it to come across that strongly, but given a
>choice, the !CR4.MCE death is cleaner because at least you're not
>interpreting garbage and trying to use it as a valid IDT.
>
>~Andrew

Zorch the IDT if it isn't valid?