2023-10-05 14:57:51

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 00/13] x86/tdx: Add kexec support

The patchset adds bits and pieces to get kexec (and crashkernel) work on
TDX guest.

They bring kexec support to the point when we can start the new kernel,
but it will only be able to use single CPU. It should be enough to cover
the most common case: crashkernel.

The last patch implements CPU offlining according to the approved ACPI
spec change poposal[1]. It unlocks kexec with all CPUs visible in the target
kernel.

Please review. I would be glad for any feedback.

[1] https://lore.kernel.org/all/13356251.uLZWGnKmhe@kreacher

Kirill A. Shutemov (13):
x86/acpi: Extract ACPI MADT wakeup code into a separate file
kernel/cpu: Add support for declaring CPU hotplug not supported
cpu/hotplug, x86/acpi: Disable CPU hotplug for ACPI MADT wakeup
x86/kvm: Do not try to disable kvmclock if it was not enabled
x86/kexec: Keep CR4.MCE set during kexec for TDX guest
x86/mm: Make x86_platform.guest.enc_status_change_*() return errno
x86/mm: Return correct level from lookup_address() if pte is none
KVM: x86: Add config option to gate emergency virt callback support
x86/tdx: Account shared memory
x86/tdx: Convert shared memory back to private on kexec
x86/mm: Make e820_end_ram_pfn() cover E820_TYPE_ACPI ranges
x86/acpi: Do not attempt to bring up secondary CPUs in kexec case
x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method

arch/x86/Kconfig | 8 +
arch/x86/coco/core.c | 1 -
arch/x86/coco/tdx/kexec.c | 0
arch/x86/coco/tdx/tdx.c | 220 +++++++++++++++++++++-
arch/x86/hyperv/ivm.c | 9 +-
arch/x86/include/asm/acpi.h | 5 +
arch/x86/include/asm/pgtable_types.h | 1 +
arch/x86/include/asm/reboot.h | 4 +-
arch/x86/include/asm/x86_init.h | 4 +-
arch/x86/kernel/acpi/Makefile | 11 +-
arch/x86/kernel/acpi/boot.c | 88 +--------
arch/x86/kernel/acpi/madt.S | 28 +++
arch/x86/kernel/acpi/madt_wakeup.c | 262 +++++++++++++++++++++++++++
arch/x86/kernel/e820.c | 9 +-
arch/x86/kernel/kvmclock.c | 9 +-
arch/x86/kernel/reboot.c | 4 +-
arch/x86/kernel/relocate_kernel_64.S | 5 +
arch/x86/kernel/x86_init.c | 4 +-
arch/x86/kvm/Kconfig | 5 +
arch/x86/mm/mem_encrypt_amd.c | 8 +-
arch/x86/mm/pat/set_memory.c | 17 +-
include/acpi/actbl2.h | 19 +-
include/linux/cc_platform.h | 10 -
include/linux/cpu.h | 2 +
kernel/cpu.c | 17 +-
25 files changed, 604 insertions(+), 146 deletions(-)
create mode 100644 arch/x86/coco/tdx/kexec.c
create mode 100644 arch/x86/kernel/acpi/madt.S
create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c

--
2.41.0


2023-10-05 15:58:33

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 05/13] x86/kexec: Keep CR4.MCE set during kexec for TDX guest

TDX guests are not allowed to clear CR4.MCE. Attempt to clear it leads
to #VE.

Use alternatives to keep the flag during kexec for TDX guests.

The change doesn't affect non-TDX environments.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/relocate_kernel_64.S | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 56cab1bb25f5..bea89814b48e 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -145,11 +145,16 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
* Set cr4 to a known state:
* - physical address extension enabled
* - 5-level paging, if it was enabled before
+ * - Machine check exception on TDX guest. Clearing MCE is not allowed
+ * in TDX guests.
*/
movl $X86_CR4_PAE, %eax
testq $X86_CR4_LA57, %r13
jz 1f
orl $X86_CR4_LA57, %eax
+1:
+ ALTERNATIVE "jmp 1f", "", X86_FEATURE_TDX_GUEST
+ orl $X86_CR4_MCE, %eax
1:
movq %rax, %cr4

--
2.41.0

2023-10-05 16:00:11

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 13/13] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method

MADT mailbox version 1 brings support of CPU offlining: BIOS provides
a reset vector where the CPU has to jump to offline itself. The new TEST
mailbox command can be used to test the CPU offlined successfully and
BIOS has control over it.

Add CPU offling support for ACPI MADT wakeup method by implementing
custom cpu_die, play_dead and stop_other_cpus SMP operations.

CPU offlining makes possible to hand over secondary CPUs over kexec, not
limiting the target kernel with single CPU.

The change conforms to the approved ACPI spec change proposal. See the
Link.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Link: https://lore.kernel.org/all/13356251.uLZWGnKmhe@kreacher
---
arch/x86/kernel/acpi/Makefile | 2 +-
arch/x86/kernel/acpi/boot.c | 2 +
arch/x86/kernel/acpi/madt.S | 28 +++++
arch/x86/kernel/acpi/madt_wakeup.c | 191 ++++++++++++++++++++++++++---
include/acpi/actbl2.h | 19 ++-
5 files changed, 223 insertions(+), 19 deletions(-)
create mode 100644 arch/x86/kernel/acpi/madt.S

diff --git a/arch/x86/kernel/acpi/Makefile b/arch/x86/kernel/acpi/Makefile
index 8c7329c88a75..ccb8198dd8d1 100644
--- a/arch/x86/kernel/acpi/Makefile
+++ b/arch/x86/kernel/acpi/Makefile
@@ -4,7 +4,7 @@ obj-$(CONFIG_ACPI) += boot.o
obj-$(CONFIG_ACPI_SLEEP) += sleep.o wakeup_$(BITS).o
obj-$(CONFIG_ACPI_APEI) += apei.o
obj-$(CONFIG_ACPI_CPPC_LIB) += cppc.o
-obj-$(CONFIG_X86_ACPI_MADT_WAKEUP) += madt_wakeup.o
+obj-$(CONFIG_X86_ACPI_MADT_WAKEUP) += madt_wakeup.o madt.o

ifneq ($(CONFIG_ACPI_PROCESSOR),)
obj-y += cstate.o
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 111bd226ad99..d537dbffa697 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -22,6 +22,7 @@
#include <linux/efi-bgrt.h>
#include <linux/serial_core.h>
#include <linux/pgtable.h>
+#include <linux/sched/hotplug.h>

#include <asm/e820/api.h>
#include <asm/irqdomain.h>
@@ -33,6 +34,7 @@
#include <asm/smp.h>
#include <asm/i8259.h>
#include <asm/setup.h>
+#include <asm/init.h>

#include "sleep.h" /* To include x86_acpi_suspend_lowlevel */
static int __initdata acpi_force = 0;
diff --git a/arch/x86/kernel/acpi/madt.S b/arch/x86/kernel/acpi/madt.S
new file mode 100644
index 000000000000..5d00d315e44e
--- /dev/null
+++ b/arch/x86/kernel/acpi/madt.S
@@ -0,0 +1,28 @@
+#include <linux/linkage.h>
+#include <asm/nospec-branch.h>
+#include <asm/page_types.h>
+#include <asm/processor-flags.h>
+
+ .text
+ .align PAGE_SIZE
+SYM_FUNC_START(asm_acpi_mp_play_dead)
+ /* Load address of reset vector into RCX to jump when kernel is ready */
+ movq acpi_mp_reset_vector_paddr(%rip), %rcx
+
+ /* zero out flags, and disable interrupts */
+ pushq $0
+ popfq
+
+ /* Turn off global entries. Following CR3 write will flush them. */
+ movq %cr4, %rdx
+ andq $~(X86_CR4_PGE), %rdx
+ movq %rdx, %cr4
+
+ /* Switch to identity mapping */
+ movq acpi_mp_pgd(%rip), %rax
+ movq %rax, %cr3
+
+ /* Jump to reset vector */
+ ANNOTATE_RETPOLINE_SAFE
+ jmp *%rcx
+SYM_FUNC_END(asm_acpi_mp_play_dead)
diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
index 4e92d1d4a5fa..2cc8590ec7a5 100644
--- a/arch/x86/kernel/acpi/madt_wakeup.c
+++ b/arch/x86/kernel/acpi/madt_wakeup.c
@@ -1,12 +1,162 @@
#include <linux/acpi.h>
#include <linux/cpu.h>
+#include <linux/delay.h>
+#include <linux/memblock.h>
+#include <linux/pgtable.h>
+#include <linux/sched/hotplug.h>
#include <asm/apic.h>
+#include <asm/init.h>

/* Physical address of the Multiprocessor Wakeup Structure mailbox */
static u64 acpi_mp_wake_mailbox_paddr;
/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;

+unsigned long acpi_mp_pgd;
+u64 acpi_mp_reset_vector_paddr;
+
+void asm_acpi_mp_play_dead(void);
+
+static void __init *alloc_pgt_page(void *context)
+{
+ return memblock_alloc(PAGE_SIZE, PAGE_SIZE);
+}
+
+/*
+ * Make sure asm_acpi_mp_play_dead() is present in the identity mapping at
+ * the same place as in the kernel page tables. The function switches to
+ * the identity mapping and has be present at the same spot in before and
+ * after transition.
+ */
+static int __init init_transition_pgtable(pgd_t *pgd)
+{
+ pgprot_t prot = PAGE_KERNEL_EXEC_NOENC;
+ unsigned long vaddr, paddr;
+ int result = -ENOMEM;
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+
+ vaddr = (unsigned long)asm_acpi_mp_play_dead;
+ pgd += pgd_index(vaddr);
+ if (!pgd_present(*pgd)) {
+ p4d = (p4d_t *)alloc_pgt_page(NULL);
+ if (!p4d)
+ goto err;
+ set_pgd(pgd, __pgd(__pa(p4d) | _KERNPG_TABLE));
+ }
+ p4d = p4d_offset(pgd, vaddr);
+ if (!p4d_present(*p4d)) {
+ pud = (pud_t *)alloc_pgt_page(NULL);
+ if (!pud)
+ goto err;
+ set_p4d(p4d, __p4d(__pa(pud) | _KERNPG_TABLE));
+ }
+ pud = pud_offset(p4d, vaddr);
+ if (!pud_present(*pud)) {
+ pmd = (pmd_t *)alloc_pgt_page(NULL);
+ if (!pmd)
+ goto err;
+ set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE));
+ }
+ pmd = pmd_offset(pud, vaddr);
+ if (!pmd_present(*pmd)) {
+ pte = (pte_t *)alloc_pgt_page(NULL);
+ if (!pte)
+ goto err;
+ set_pmd(pmd, __pmd(__pa(pte) | _KERNPG_TABLE));
+ }
+ pte = pte_offset_kernel(pmd, vaddr);
+
+ paddr = __pa(vaddr);
+ set_pte(pte, pfn_pte(paddr >> PAGE_SHIFT, prot));
+
+ return 0;
+err:
+ return result;
+}
+
+static void acpi_mp_play_dead(void)
+{
+ idle_task_exit();
+ cpuhp_ap_report_dead();
+ asm_acpi_mp_play_dead();
+}
+
+static void acpi_mp_cpu_die(unsigned int cpu)
+{
+ int apicid = per_cpu(x86_cpu_to_apicid, cpu);
+ unsigned long timeout;
+
+ /*
+ * Use TEST mailbox command to prove that BIOS got control over
+ * the CPU before declaring it dead.
+ *
+ * BIOS has to clear 'command' field of the mailbox.
+ */
+ acpi_mp_wake_mailbox->apic_id = apicid;
+ smp_store_release(&acpi_mp_wake_mailbox->command,
+ ACPI_MP_WAKE_COMMAND_TEST);
+
+ /* Don't wait longer than a second. */
+ timeout = USEC_PER_SEC;
+ while (READ_ONCE(acpi_mp_wake_mailbox->command) && timeout--)
+ udelay(1);
+}
+
+static void acpi_mp_stop_other_cpus(int wait)
+{
+ smp_shutdown_nonboot_cpus(smp_processor_id());
+}
+
+static void acpi_mp_crash_stop_other_cpus(void)
+{
+ smp_shutdown_nonboot_cpus(smp_processor_id());
+
+ /* The kernel is broken so disable interrupts */
+ local_irq_disable();
+}
+
+static int __init acpi_mp_setup_reset(u64 reset_vector)
+{
+ pgd_t *pgd;
+ struct x86_mapping_info info = {
+ .alloc_pgt_page = alloc_pgt_page,
+ .page_flag = __PAGE_KERNEL_LARGE_EXEC,
+ .kernpg_flag = _KERNPG_TABLE_NOENC,
+ };
+
+ pgd = alloc_pgt_page(NULL);
+
+ for (int i = 0; i < nr_pfn_mapped; i++) {
+ unsigned long mstart, mend;
+ mstart = pfn_mapped[i].start << PAGE_SHIFT;
+ mend = pfn_mapped[i].end << PAGE_SHIFT;
+ if (kernel_ident_mapping_init(&info, pgd, mstart, mend))
+ return -ENOMEM;
+ }
+
+ if (kernel_ident_mapping_init(&info, pgd,
+ PAGE_ALIGN_DOWN(reset_vector),
+ PAGE_ALIGN(reset_vector + 1))) {
+ return -ENOMEM;
+ }
+
+ if (init_transition_pgtable(pgd))
+ return -ENOMEM;
+
+ smp_ops.play_dead = acpi_mp_play_dead;
+ smp_ops.cpu_die = acpi_mp_cpu_die;
+ smp_ops.stop_other_cpus = acpi_mp_stop_other_cpus;
+ smp_ops.crash_stop_other_cpus = acpi_mp_crash_stop_other_cpus;
+
+ acpi_mp_reset_vector_paddr = reset_vector;
+ acpi_mp_pgd = __pa(pgd);
+
+ return 0;
+}
+
static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
{
if (!acpi_mp_wake_mailbox_paddr) {
@@ -73,27 +223,38 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
return -ENODEV;

mp_wake = (struct acpi_madt_multiproc_wakeup *)header;
- if (BAD_MADT_ENTRY(mp_wake, end))
+ if (!mp_wake)
+ return -EINVAL;
+
+ if (end - (unsigned long)mp_wake < ACPI_MADT_MP_WAKEUP_SIZE_V0)
+ return -EINVAL;
+ if (mp_wake->header.length < ACPI_MADT_MP_WAKEUP_SIZE_V0)
return -EINVAL;

acpi_table_print_madt_entry(&header->common);

- acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
+ acpi_mp_wake_mailbox_paddr = mp_wake->mailbox_address;

- /* Disable CPU onlining/offlining */
- cpu_hotplug_not_supported();
+ if (mp_wake->version >= ACPI_MADT_MP_WAKEUP_VERSION_V1 &&
+ mp_wake->header.length >= ACPI_MADT_MP_WAKEUP_SIZE_V1) {
+ acpi_mp_setup_reset(mp_wake->reset_vector);
+ } else {
+ /* Disable CPU onlining/offlining */
+ cpu_hotplug_not_supported();

- /*
- * ACPI MADT doesn't allow to offline CPU after it got woke up.
- * It limits kexec: target kernel won't be able to use more than
- * one CPU.
- *
- * Zero out mailbox address in the ACPI MADT wakeup structure to
- * indicate that the mailbox is not usable.
- *
- * This is Linux-specific protocol and not reflected in ACPI spec.
- */
- mp_wake->base_address = 0;
+ /*
+ * Without reset vector support, ACPI MADT doesn't allow to
+ * offline CPU after it got woke up. It limits kexec: target
+ * kernel won't be able to use more than one CPU.
+ *
+ * Zero out mailbox address in the ACPI MADT wakeup structure
+ * to indicate that the mailbox is not usable.
+ *
+ * This is Linux-specific protocol and not reflected in ACPI
+ * spec.
+ */
+ mp_wake->mailbox_address = 0;
+ }

apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);

diff --git a/include/acpi/actbl2.h b/include/acpi/actbl2.h
index 3751ae69432f..8348bf46a648 100644
--- a/include/acpi/actbl2.h
+++ b/include/acpi/actbl2.h
@@ -1109,11 +1109,23 @@ struct acpi_madt_generic_translator {

struct acpi_madt_multiproc_wakeup {
struct acpi_subtable_header header;
- u16 mailbox_version;
+ u16 version;
u32 reserved; /* reserved - must be zero */
- u64 base_address;
+ u64 mailbox_address;
+ u64 reset_vector;
};

+/* Values for Version field above */
+
+enum acpi_madt_multiproc_wakeup_version {
+ ACPI_MADT_MP_WAKEUP_VERSION_NONE = 0,
+ ACPI_MADT_MP_WAKEUP_VERSION_V1 = 1,
+ ACPI_MADT_MP_WAKEUP_VERSION_RESERVED = 2, /* 2 and greater are reserved */
+};
+
+#define ACPI_MADT_MP_WAKEUP_SIZE_V0 16
+#define ACPI_MADT_MP_WAKEUP_SIZE_V1 24
+
#define ACPI_MULTIPROC_WAKEUP_MB_OS_SIZE 2032
#define ACPI_MULTIPROC_WAKEUP_MB_FIRMWARE_SIZE 2048

@@ -1126,7 +1138,8 @@ struct acpi_madt_multiproc_wakeup_mailbox {
u8 reserved_firmware[ACPI_MULTIPROC_WAKEUP_MB_FIRMWARE_SIZE]; /* reserved for firmware use */
};

-#define ACPI_MP_WAKE_COMMAND_WAKEUP 1
+#define ACPI_MP_WAKE_COMMAND_WAKEUP 1
+#define ACPI_MP_WAKE_COMMAND_TEST 2

/* 17: CPU Core Interrupt Controller (ACPI 6.5) */

--
2.41.0

2023-10-05 16:02:01

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec

TDX guests allocate shared buffers to perform I/O. It is done by
allocating pages normally from the buddy allocator and converting them
to shared with set_memory_decrypted().

The target kernel has no idea what memory is converted this way. It only
sees E820_TYPE_RAM.

Accessing shared memory via private mapping is fatal. It leads to
unrecoverable TD exit.

On TD shutdown (also covers kexec), walk direct mapping and convert all
shared memory back to private. It makes all RAM private again and target
kernel may use it normally.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/coco/tdx/kexec.c | 0
arch/x86/coco/tdx/tdx.c | 137 +++++++++++++++++++++++++++++++++++++-
3 files changed, 136 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/coco/tdx/kexec.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7368d254d01f..b5acf9fb4c70 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -884,6 +884,7 @@ config INTEL_TDX_GUEST
select X86_MEM_ENCRYPT
select X86_MCE
select UNACCEPTED_MEMORY
+ select EMERGENCY_VIRT_CALLBACK
help
Support running as a guest under Intel TDX. Without this support,
the guest kernel can not boot or run under TDX.
diff --git a/arch/x86/coco/tdx/kexec.c b/arch/x86/coco/tdx/kexec.c
new file mode 100644
index 000000000000..e69de29bb2d1
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 56e152126f20..ac0745303983 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -6,6 +6,7 @@

#include <linux/cpufeature.h>
#include <linux/debugfs.h>
+#include <linux/delay.h>
#include <linux/export.h>
#include <linux/io.h>
#include <asm/coco.h>
@@ -14,6 +15,8 @@
#include <asm/insn.h>
#include <asm/insn-eval.h>
#include <asm/pgtable.h>
+#include <asm/reboot.h>
+#include <asm/set_memory.h>

/* MMIO direction */
#define EPT_READ 0
@@ -40,6 +43,9 @@

static atomic_long_t nr_shared;

+static atomic_t conversions_in_progress;
+static bool conversion_allowed = true;
+
static inline bool pte_decrypted(pte_t pte)
{
return cc_mkdec(pte_val(pte)) == pte_val(pte);
@@ -704,6 +710,14 @@ static bool tdx_tlb_flush_required(bool private)

static bool tdx_cache_flush_required(void)
{
+ /*
+ * Avoid issuing CLFLUSH on set_memory_decrypted() if conversions
+ * stopped. Otherwise it can race with unshare_all_memory() and trigger
+ * implicit conversion to shared.
+ */
+ if (!conversion_allowed)
+ return false;
+
/*
* AMD SME/SEV can avoid cache flushing if HW enforces cache coherence.
* TDX doesn't have such capability.
@@ -787,12 +801,25 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
static int tdx_enc_status_change_prepare(unsigned long vaddr, int numpages,
bool enc)
{
+ atomic_inc(&conversions_in_progress);
+
+ /*
+ * Check after bumping conversions_in_progress to serialize
+ * against tdx_shutdown().
+ */
+ if (!conversion_allowed) {
+ atomic_dec(&conversions_in_progress);
+ return -EBUSY;
+ }
+
/*
* Only handle shared->private conversion here.
* See the comment in tdx_early_init().
*/
- if (enc && !tdx_enc_status_changed(vaddr, numpages, enc))
+ if (enc && !tdx_enc_status_changed(vaddr, numpages, enc)) {
+ atomic_dec(&conversions_in_progress);
return -EIO;
+ }

return 0;
}
@@ -804,17 +831,115 @@ static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
* Only handle private->shared conversion here.
* See the comment in tdx_early_init().
*/
- if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc))
+ if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc)) {
+ atomic_dec(&conversions_in_progress);
return -EIO;
+ }

if (enc)
atomic_long_sub(numpages, &nr_shared);
else
atomic_long_add(numpages, &nr_shared);

+ atomic_dec(&conversions_in_progress);
+
return 0;
}

+static void unshare_all_memory(bool unmap)
+{
+ unsigned long addr, end;
+ long found = 0, shared;
+
+ /*
+ * Walk direct mapping and convert all shared memory back to private,
+ */
+
+ addr = PAGE_OFFSET;
+ end = PAGE_OFFSET + get_max_mapped();
+
+ while (addr < end) {
+ unsigned long size;
+ unsigned int level;
+ pte_t *pte;
+
+ pte = lookup_address(addr, &level);
+ size = page_level_size(level);
+
+ if (pte && pte_decrypted(*pte)) {
+ int pages = size / PAGE_SIZE;
+
+ /*
+ * Touching memory with shared bit set triggers implicit
+ * conversion to shared.
+ *
+ * Make sure nobody touches the shared range from
+ * now on.
+ *
+ * Bypass unmapping for crash scenario. Unmapping
+ * requires sleepable context, but in crash case kernel
+ * hits the code path with interrupts disabled.
+ * It shouldn't be a problem as all secondary CPUs are
+ * down and kernel runs with interrupts disabled, so
+ * there is no room for race.
+ */
+ if (unmap)
+ set_memory_np(addr, pages);
+
+ if (!tdx_enc_status_changed(addr, pages, true)) {
+ pr_err("Failed to unshare range %#lx-%#lx\n",
+ addr, addr + size);
+ }
+
+ found += pages;
+ }
+
+ addr += size;
+ }
+
+ shared = atomic_long_read(&nr_shared);
+ if (shared != found) {
+ pr_err("shared page accounting is off\n");
+ pr_err("nr_shared = %ld, nr_found = %ld\n", shared, found);
+ }
+}
+
+static void tdx_shutdown(void)
+{
+ unsigned long timeout;
+
+ /*
+ * Stop new private<->shared conversions and wait for in-flight
+ * conversions to complete.
+ *
+ * Do not wait more than 30 seconds.
+ */
+ timeout = 30 * USEC_PER_SEC;
+ conversion_allowed = false;
+ while (atomic_read(&conversions_in_progress) && timeout--)
+ udelay(1);
+
+ if (!timeout)
+ pr_warn("Failed to finish shared<->private conversions\n");
+
+ unshare_all_memory(true);
+
+ native_machine_shutdown();
+}
+
+static void tdx_crash_shutdown(void)
+{
+ /*
+ * Crash can race with private<->shared conversion.
+ *
+ * There's no clean way out: report and proceed.
+ */
+ if (atomic_read(&conversions_in_progress))
+ pr_warn("Failed to finish shared<->private conversions\n");
+
+ unshare_all_memory(false);
+}
+
void __init tdx_early_init(void)
{
struct tdx_module_args args = {
@@ -882,6 +1007,14 @@ void __init tdx_early_init(void)
*/
x86_cpuinit.parallel_bringup = false;

+ machine_ops.shutdown = tdx_shutdown;
+
+ /*
+ * KVM overrides machine_ops.crash_shutdown, use emergency
+ * virt callback instead.
+ */
+ cpu_emergency_register_virt_callback(tdx_crash_shutdown);
+
pr_info("Guest detected\n");
}

--
2.41.0

2023-10-05 16:02:04

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 01/13] x86/acpi: Extract ACPI MADT wakeup code into a separate file

In order to prepare for the expansion of support for the ACPI MADT
wakeup method, the relevant code has been moved into a separate file.
A new configuration option has been introduced to clearly indicate
dependencies without the use of ifdefs.

There have been no functional changes.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/Kconfig | 7 +++
arch/x86/include/asm/acpi.h | 5 ++
arch/x86/kernel/acpi/Makefile | 11 ++--
arch/x86/kernel/acpi/boot.c | 86 +-----------------------------
arch/x86/kernel/acpi/madt_wakeup.c | 80 +++++++++++++++++++++++++++
5 files changed, 99 insertions(+), 90 deletions(-)
create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3154dbc49cf5..7368d254d01f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1108,6 +1108,13 @@ config X86_LOCAL_APIC
depends on X86_64 || SMP || X86_32_NON_STANDARD || X86_UP_APIC || PCI_MSI
select IRQ_DOMAIN_HIERARCHY

+config X86_ACPI_MADT_WAKEUP
+ def_bool y
+ depends on X86_64
+ depends on ACPI
+ depends on SMP
+ depends on X86_LOCAL_APIC
+
config X86_IO_APIC
def_bool y
depends on X86_LOCAL_APIC || X86_UP_IOAPIC
diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
index c8a7fc23f63c..b536b5a6a57b 100644
--- a/arch/x86/include/asm/acpi.h
+++ b/arch/x86/include/asm/acpi.h
@@ -73,6 +73,11 @@ static inline bool acpi_skip_set_wakeup_address(void)

#define acpi_skip_set_wakeup_address acpi_skip_set_wakeup_address

+union acpi_subtable_headers;
+
+int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
+ const unsigned long end);
+
/*
* Check if the CPU can handle C2 and deeper
*/
diff --git a/arch/x86/kernel/acpi/Makefile b/arch/x86/kernel/acpi/Makefile
index fc17b3f136fe..8c7329c88a75 100644
--- a/arch/x86/kernel/acpi/Makefile
+++ b/arch/x86/kernel/acpi/Makefile
@@ -1,11 +1,12 @@
# SPDX-License-Identifier: GPL-2.0

-obj-$(CONFIG_ACPI) += boot.o
-obj-$(CONFIG_ACPI_SLEEP) += sleep.o wakeup_$(BITS).o
-obj-$(CONFIG_ACPI_APEI) += apei.o
-obj-$(CONFIG_ACPI_CPPC_LIB) += cppc.o
+obj-$(CONFIG_ACPI) += boot.o
+obj-$(CONFIG_ACPI_SLEEP) += sleep.o wakeup_$(BITS).o
+obj-$(CONFIG_ACPI_APEI) += apei.o
+obj-$(CONFIG_ACPI_CPPC_LIB) += cppc.o
+obj-$(CONFIG_X86_ACPI_MADT_WAKEUP) += madt_wakeup.o

ifneq ($(CONFIG_ACPI_PROCESSOR),)
-obj-y += cstate.o
+obj-y += cstate.o
endif

diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 2a0ea38955df..111bd226ad99 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -66,13 +66,6 @@ static u64 acpi_lapic_addr __initdata = APIC_DEFAULT_PHYS_BASE;
static bool acpi_support_online_capable;
#endif

-#ifdef CONFIG_X86_64
-/* Physical address of the Multiprocessor Wakeup Structure mailbox */
-static u64 acpi_mp_wake_mailbox_paddr;
-/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
-static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
-#endif
-
#ifdef CONFIG_X86_IO_APIC
/*
* Locks related to IOAPIC hotplug
@@ -357,60 +350,6 @@ acpi_parse_lapic_nmi(union acpi_subtable_headers * header, const unsigned long e

return 0;
}
-
-#ifdef CONFIG_X86_64
-static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
-{
- /*
- * Remap mailbox memory only for the first call to acpi_wakeup_cpu().
- *
- * Wakeup of secondary CPUs is fully serialized in the core code.
- * No need to protect acpi_mp_wake_mailbox from concurrent accesses.
- */
- if (!acpi_mp_wake_mailbox) {
- acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
- sizeof(*acpi_mp_wake_mailbox),
- MEMREMAP_WB);
- }
-
- /*
- * Mailbox memory is shared between the firmware and OS. Firmware will
- * listen on mailbox command address, and once it receives the wakeup
- * command, the CPU associated with the given apicid will be booted.
- *
- * The value of 'apic_id' and 'wakeup_vector' must be visible to the
- * firmware before the wakeup command is visible. smp_store_release()
- * ensures ordering and visibility.
- */
- acpi_mp_wake_mailbox->apic_id = apicid;
- acpi_mp_wake_mailbox->wakeup_vector = start_ip;
- smp_store_release(&acpi_mp_wake_mailbox->command,
- ACPI_MP_WAKE_COMMAND_WAKEUP);
-
- /*
- * Wait for the CPU to wake up.
- *
- * The CPU being woken up is essentially in a spin loop waiting to be
- * woken up. It should not take long for it wake up and acknowledge by
- * zeroing out ->command.
- *
- * ACPI specification doesn't provide any guidance on how long kernel
- * has to wait for a wake up acknowledgement. It also doesn't provide
- * a way to cancel a wake up request if it takes too long.
- *
- * In TDX environment, the VMM has control over how long it takes to
- * wake up secondary. It can postpone scheduling secondary vCPU
- * indefinitely. Giving up on wake up request and reporting error opens
- * possible attack vector for VMM: it can wake up a secondary CPU when
- * kernel doesn't expect it. Wait until positive result of the wake up
- * request.
- */
- while (READ_ONCE(acpi_mp_wake_mailbox->command))
- cpu_relax();
-
- return 0;
-}
-#endif /* CONFIG_X86_64 */
#endif /* CONFIG_X86_LOCAL_APIC */

#ifdef CONFIG_X86_IO_APIC
@@ -1160,29 +1099,6 @@ static int __init acpi_parse_madt_lapic_entries(void)
}
return 0;
}
-
-#ifdef CONFIG_X86_64
-static int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
- const unsigned long end)
-{
- struct acpi_madt_multiproc_wakeup *mp_wake;
-
- if (!IS_ENABLED(CONFIG_SMP))
- return -ENODEV;
-
- mp_wake = (struct acpi_madt_multiproc_wakeup *)header;
- if (BAD_MADT_ENTRY(mp_wake, end))
- return -EINVAL;
-
- acpi_table_print_madt_entry(&header->common);
-
- acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
-
- apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
-
- return 0;
-}
-#endif /* CONFIG_X86_64 */
#endif /* CONFIG_X86_LOCAL_APIC */

#ifdef CONFIG_X86_IO_APIC
@@ -1379,7 +1295,7 @@ static void __init acpi_process_madt(void)
smp_found_config = 1;
}

-#ifdef CONFIG_X86_64
+#ifdef CONFIG_X86_ACPI_MADT_WAKEUP
/*
* Parse MADT MP Wake entry.
*/
diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
new file mode 100644
index 000000000000..1b9747bfd5b9
--- /dev/null
+++ b/arch/x86/kernel/acpi/madt_wakeup.c
@@ -0,0 +1,80 @@
+#include <linux/acpi.h>
+#include <asm/apic.h>
+
+/* Physical address of the Multiprocessor Wakeup Structure mailbox */
+static u64 acpi_mp_wake_mailbox_paddr;
+/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
+static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
+
+static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
+{
+ /*
+ * Remap mailbox memory only for the first call to acpi_wakeup_cpu().
+ *
+ * Wakeup of secondary CPUs is fully serialized in the core code.
+ * No need to protect acpi_mp_wake_mailbox from concurrent accesses.
+ */
+ if (!acpi_mp_wake_mailbox) {
+ acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
+ sizeof(*acpi_mp_wake_mailbox),
+ MEMREMAP_WB);
+ }
+
+ /*
+ * Mailbox memory is shared between the firmware and OS. Firmware will
+ * listen on mailbox command address, and once it receives the wakeup
+ * command, the CPU associated with the given apicid will be booted.
+ *
+ * The value of 'apic_id' and 'wakeup_vector' must be visible to the
+ * firmware before the wakeup command is visible. smp_store_release()
+ * ensures ordering and visibility.
+ */
+ acpi_mp_wake_mailbox->apic_id = apicid;
+ acpi_mp_wake_mailbox->wakeup_vector = start_ip;
+ smp_store_release(&acpi_mp_wake_mailbox->command,
+ ACPI_MP_WAKE_COMMAND_WAKEUP);
+
+ /*
+ * Wait for the CPU to wake up.
+ *
+ * The CPU being woken up is essentially in a spin loop waiting to be
+ * woken up. It should not take long for it wake up and acknowledge by
+ * zeroing out ->command.
+ *
+ * ACPI specification doesn't provide any guidance on how long kernel
+ * has to wait for a wake up acknowledgement. It also doesn't provide
+ * a way to cancel a wake up request if it takes too long.
+ *
+ * In TDX environment, the VMM has control over how long it takes to
+ * wake up secondary. It can postpone scheduling secondary vCPU
+ * indefinitely. Giving up on wake up request and reporting error opens
+ * possible attack vector for VMM: it can wake up a secondary CPU when
+ * kernel doesn't expect it. Wait until positive result of the wake up
+ * request.
+ */
+ while (READ_ONCE(acpi_mp_wake_mailbox->command))
+ cpu_relax();
+
+ return 0;
+}
+
+int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
+ const unsigned long end)
+{
+ struct acpi_madt_multiproc_wakeup *mp_wake;
+
+ if (!IS_ENABLED(CONFIG_SMP))
+ return -ENODEV;
+
+ mp_wake = (struct acpi_madt_multiproc_wakeup *)header;
+ if (BAD_MADT_ENTRY(mp_wake, end))
+ return -EINVAL;
+
+ acpi_table_print_madt_entry(&header->common);
+
+ acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
+
+ apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
+
+ return 0;
+}
--
2.41.0

2023-10-05 16:04:25

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 06/13] x86/mm: Make x86_platform.guest.enc_status_change_*() return errno

TDX is going to have more than one reason to fail
enc_status_change_prepare().

Change the callback to return errno instead of assuming -EIO;
enc_status_change_finish() changed too to keep the interface symmetric.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/coco/tdx/tdx.c | 20 +++++++++++---------
arch/x86/hyperv/ivm.c | 9 +++------
arch/x86/include/asm/x86_init.h | 4 ++--
arch/x86/kernel/x86_init.c | 4 ++--
arch/x86/mm/mem_encrypt_amd.c | 8 ++++----
arch/x86/mm/pat/set_memory.c | 9 +++++----
6 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 3e6dbd2199cf..46022283d955 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -776,28 +776,30 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
return true;
}

-static bool tdx_enc_status_change_prepare(unsigned long vaddr, int numpages,
- bool enc)
+static int tdx_enc_status_change_prepare(unsigned long vaddr, int numpages,
+ bool enc)
{
/*
* Only handle shared->private conversion here.
* See the comment in tdx_early_init().
*/
- if (enc)
- return tdx_enc_status_changed(vaddr, numpages, enc);
- return true;
+ if (enc && !tdx_enc_status_changed(vaddr, numpages, enc))
+ return -EIO;
+
+ return 0;
}

-static bool tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
+static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
bool enc)
{
/*
* Only handle private->shared conversion here.
* See the comment in tdx_early_init().
*/
- if (!enc)
- return tdx_enc_status_changed(vaddr, numpages, enc);
- return true;
+ if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc))
+ return -EIO;
+
+ return 0;
}

void __init tdx_early_init(void)
diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index c1088d3661d5..7d2241059d49 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -510,13 +510,12 @@ static int hv_mark_gpa_visibility(u16 count, const u64 pfn[],
* with host. This function works as wrap of hv_mark_gpa_visibility()
* with memory base and size.
*/
-static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bool enc)
+static int hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bool enc)
{
enum hv_mem_host_visibility visibility = enc ?
VMBUS_PAGE_NOT_VISIBLE : VMBUS_PAGE_VISIBLE_READ_WRITE;
u64 *pfn_array;
int ret = 0;
- bool result = true;
int i, pfn;

pfn_array = kmalloc(HV_HYP_PAGE_SIZE, GFP_KERNEL);
@@ -530,17 +529,15 @@ static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bo
if (pfn == HV_MAX_MODIFY_GPA_REP_COUNT || i == pagecount - 1) {
ret = hv_mark_gpa_visibility(pfn, pfn_array,
visibility);
- if (ret) {
- result = false;
+ if (ret)
goto err_free_pfn_array;
- }
pfn = 0;
}
}

err_free_pfn_array:
kfree(pfn_array);
- return result;
+ return ret;
}

static bool hv_vtom_tlb_flush_required(bool private)
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 5240d88db52a..5031cbc6e211 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -150,8 +150,8 @@ struct x86_init_acpi {
* @enc_cache_flush_required Returns true if a cache flush is needed before changing page encryption status
*/
struct x86_guest {
- bool (*enc_status_change_prepare)(unsigned long vaddr, int npages, bool enc);
- bool (*enc_status_change_finish)(unsigned long vaddr, int npages, bool enc);
+ int (*enc_status_change_prepare)(unsigned long vaddr, int npages, bool enc);
+ int (*enc_status_change_finish)(unsigned long vaddr, int npages, bool enc);
bool (*enc_tlb_flush_required)(bool enc);
bool (*enc_cache_flush_required)(void);
};
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index a37ebd3b4773..f0f54e109eb9 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -131,8 +131,8 @@ struct x86_cpuinit_ops x86_cpuinit = {

static void default_nmi_init(void) { };

-static bool enc_status_change_prepare_noop(unsigned long vaddr, int npages, bool enc) { return true; }
-static bool enc_status_change_finish_noop(unsigned long vaddr, int npages, bool enc) { return true; }
+static int enc_status_change_prepare_noop(unsigned long vaddr, int npages, bool enc) { return 0; }
+static int enc_status_change_finish_noop(unsigned long vaddr, int npages, bool enc) { return 0; }
static bool enc_tlb_flush_required_noop(bool enc) { return false; }
static bool enc_cache_flush_required_noop(void) { return false; }
static bool is_private_mmio_noop(u64 addr) {return false; }
diff --git a/arch/x86/mm/mem_encrypt_amd.c b/arch/x86/mm/mem_encrypt_amd.c
index 6faea41e99b6..9cbdfbf8cd45 100644
--- a/arch/x86/mm/mem_encrypt_amd.c
+++ b/arch/x86/mm/mem_encrypt_amd.c
@@ -318,7 +318,7 @@ static void enc_dec_hypercall(unsigned long vaddr, unsigned long size, bool enc)
#endif
}

-static bool amd_enc_status_change_prepare(unsigned long vaddr, int npages, bool enc)
+static int amd_enc_status_change_prepare(unsigned long vaddr, int npages, bool enc)
{
/*
* To maintain the security guarantees of SEV-SNP guests, make sure
@@ -327,11 +327,11 @@ static bool amd_enc_status_change_prepare(unsigned long vaddr, int npages, bool
if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP) && !enc)
snp_set_memory_shared(vaddr, npages);

- return true;
+ return 0;
}

/* Return true unconditionally: return value doesn't matter for the SEV side */
-static bool amd_enc_status_change_finish(unsigned long vaddr, int npages, bool enc)
+static int amd_enc_status_change_finish(unsigned long vaddr, int npages, bool enc)
{
/*
* After memory is mapped encrypted in the page table, validate it
@@ -343,7 +343,7 @@ static bool amd_enc_status_change_finish(unsigned long vaddr, int npages, bool e
if (!cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
enc_dec_hypercall(vaddr, npages << PAGE_SHIFT, enc);

- return true;
+ return 0;
}

static void __init __set_clr_pte_enc(pte_t *kpte, int level, bool enc)
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index bda9f129835e..6fbf22d5fa56 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2152,8 +2152,9 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
cpa_flush(&cpa, x86_platform.guest.enc_cache_flush_required());

/* Notify hypervisor that we are about to set/clr encryption attribute. */
- if (!x86_platform.guest.enc_status_change_prepare(addr, numpages, enc))
- return -EIO;
+ ret = x86_platform.guest.enc_status_change_prepare(addr, numpages, enc);
+ if (ret)
+ return ret;

ret = __change_page_attr_set_clr(&cpa, 1);

@@ -2168,8 +2169,8 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)

/* Notify hypervisor that we have successfully set/clr encryption attribute. */
if (!ret) {
- if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
- ret = -EIO;
+ ret = x86_platform.guest.enc_status_change_finish(addr,
+ numpages, enc);
}

return ret;
--
2.41.0

2023-10-05 18:41:54

by Kalra, Ashish

[permalink] [raw]
Subject: Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec

Hello Kirill,

On 10/5/2023 8:13 AM, Kirill A. Shutemov wrote:
> TDX guests allocate shared buffers to perform I/O. It is done by
> allocating pages normally from the buddy allocator and converting them
> to shared with set_memory_decrypted().
>
> The target kernel has no idea what memory is converted this way. It only
> sees E820_TYPE_RAM.
>
> Accessing shared memory via private mapping is fatal. It leads to
> unrecoverable TD exit.
>
> On TD shutdown (also covers kexec), walk direct mapping and convert all
> shared memory back to private. It makes all RAM private again and target
> kernel may use it normally.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> arch/x86/Kconfig | 1 +
> arch/x86/coco/tdx/kexec.c | 0
> arch/x86/coco/tdx/tdx.c | 137 +++++++++++++++++++++++++++++++++++++-
> 3 files changed, 136 insertions(+), 2 deletions(-)
> create mode 100644 arch/x86/coco/tdx/kexec.c
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 7368d254d01f..b5acf9fb4c70 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -884,6 +884,7 @@ config INTEL_TDX_GUEST
> select X86_MEM_ENCRYPT
> select X86_MCE
> select UNACCEPTED_MEMORY
> + select EMERGENCY_VIRT_CALLBACK
> help
> Support running as a guest under Intel TDX. Without this support,
> the guest kernel can not boot or run under TDX.
> diff --git a/arch/x86/coco/tdx/kexec.c b/arch/x86/coco/tdx/kexec.c
> new file mode 100644
> index 000000000000..e69de29bb2d1
> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 56e152126f20..ac0745303983 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -6,6 +6,7 @@
>
> #include <linux/cpufeature.h>
> #include <linux/debugfs.h>
> +#include <linux/delay.h>
> #include <linux/export.h>
> #include <linux/io.h>
> #include <asm/coco.h>
> @@ -14,6 +15,8 @@
> #include <asm/insn.h>
> #include <asm/insn-eval.h>
> #include <asm/pgtable.h>
> +#include <asm/reboot.h>
> +#include <asm/set_memory.h>
>
> /* MMIO direction */
> #define EPT_READ 0
> @@ -40,6 +43,9 @@
>
> static atomic_long_t nr_shared;
>
> +static atomic_t conversions_in_progress;
> +static bool conversion_allowed = true;
> +
> static inline bool pte_decrypted(pte_t pte)
> {
> return cc_mkdec(pte_val(pte)) == pte_val(pte);
> @@ -704,6 +710,14 @@ static bool tdx_tlb_flush_required(bool private)
>
> static bool tdx_cache_flush_required(void)
> {
> + /*
> + * Avoid issuing CLFLUSH on set_memory_decrypted() if conversions
> + * stopped. Otherwise it can race with unshare_all_memory() and trigger
> + * implicit conversion to shared.
> + */
> + if (!conversion_allowed)
> + return false;
> +
> /*
> * AMD SME/SEV can avoid cache flushing if HW enforces cache coherence.
> * TDX doesn't have such capability.
> @@ -787,12 +801,25 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
> static int tdx_enc_status_change_prepare(unsigned long vaddr, int numpages,
> bool enc)
> {
> + atomic_inc(&conversions_in_progress);
> +
> + /*
> + * Check after bumping conversions_in_progress to serialize
> + * against tdx_shutdown().
> + */
> + if (!conversion_allowed) {
> + atomic_dec(&conversions_in_progress);
> + return -EBUSY;
> + }
> +
> /*
> * Only handle shared->private conversion here.
> * See the comment in tdx_early_init().
> */
> - if (enc && !tdx_enc_status_changed(vaddr, numpages, enc))
> + if (enc && !tdx_enc_status_changed(vaddr, numpages, enc)) {
> + atomic_dec(&conversions_in_progress);
> return -EIO;
> + }
>
> return 0;
> }
> @@ -804,17 +831,115 @@ static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
> * Only handle private->shared conversion here.
> * See the comment in tdx_early_init().
> */
> - if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc))
> + if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc)) {
> + atomic_dec(&conversions_in_progress);
> return -EIO;
> + }
>
> if (enc)
> atomic_long_sub(numpages, &nr_shared);
> else
> atomic_long_add(numpages, &nr_shared);
>
> + atomic_dec(&conversions_in_progress);
> +
> return 0;
> }
>
> +static void unshare_all_memory(bool unmap)
> +{
> + unsigned long addr, end;
> + long found = 0, shared;
> +
> + /*
> + * Walk direct mapping and convert all shared memory back to private,
> + */
> +
> + addr = PAGE_OFFSET;
> + end = PAGE_OFFSET + get_max_mapped();
> +
> + while (addr < end) {
> + unsigned long size;
> + unsigned int level;
> + pte_t *pte;
> +
> + pte = lookup_address(addr, &level);

IIRC, you were earlier walking the direct mapping using
walk_page_range_novma(), any particular reason to use lookup_address()
instead ?

> + size = page_level_size(level);
> +
> + if (pte && pte_decrypted(*pte)) {

Additionally need to add check for pte_none() here to handle physical
memory holes in direct mapping.

> + int pages = size / PAGE_SIZE;
> +
> + /*
> + * Touching memory with shared bit set triggers implicit
> + * conversion to shared.
> + *
> + * Make sure nobody touches the shared range from
> + * now on.
> + *
> + * Bypass unmapping for crash scenario. Unmapping
> + * requires sleepable context, but in crash case kernel
> + * hits the code path with interrupts disabled.

In case of SNP we will need to temporarily enable interrupts during this
unsharing as we invoke set_memory_encrypted() which then hits a BUG_ON()
in cpa_flush() if interrupts are disabled.

Thanks,
Ashish

> + * It shouldn't be a problem as all secondary CPUs are
> + * down and kernel runs with interrupts disabled, so
> + * there is no room for race.
> + */
> + if (unmap)
> + set_memory_np(addr, pages);
> +
> + if (!tdx_enc_status_changed(addr, pages, true)) {
> + pr_err("Failed to unshare range %#lx-%#lx\n",
> + addr, addr + size);
> + }
> +
> + found += pages;
> + }
> +
> + addr += size;
> + }
> +
> + shared = atomic_long_read(&nr_shared);
> + if (shared != found) {
> + pr_err("shared page accounting is off\n");
> + pr_err("nr_shared = %ld, nr_found = %ld\n", shared, found);
> + }
> +}
> +
> +static void tdx_shutdown(void)
> +{
> + unsigned long timeout;
> +
> + /*
> + * Stop new private<->shared conversions and wait for in-flight
> + * conversions to complete.
> + *
> + * Do not wait more than 30 seconds.
> + */
> + timeout = 30 * USEC_PER_SEC;
> + conversion_allowed = false;
> + while (atomic_read(&conversions_in_progress) && timeout--)
> + udelay(1);
> +
> + if (!timeout)
> + pr_warn("Failed to finish shared<->private conversions\n");
> +
> + unshare_all_memory(true);
> +
> + native_machine_shutdown();
> +}
> +
> +static void tdx_crash_shutdown(void)
> +{
> + /*
> + * Crash can race with private<->shared conversion.
> + *
> + * There's no clean way out: report and proceed.
> + */
> + if (atomic_read(&conversions_in_progress))
> + pr_warn("Failed to finish shared<->private conversions\n");
> +
> + unshare_all_memory(false);
> +}
> +
> void __init tdx_early_init(void)
> {
> struct tdx_module_args args = {
> @@ -882,6 +1007,14 @@ void __init tdx_early_init(void)
> */
> x86_cpuinit.parallel_bringup = false;
>
> + machine_ops.shutdown = tdx_shutdown;
> +
> + /*
> + * KVM overrides machine_ops.crash_shutdown, use emergency
> + * virt callback instead.
> + */
> + cpu_emergency_register_virt_callback(tdx_crash_shutdown);
> +
> pr_info("Guest detected\n");
> }
>
>

2023-10-05 21:28:50

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec

On Thu, Oct 05, 2023 at 01:41:38PM -0500, Kalra, Ashish wrote:
> > +static void unshare_all_memory(bool unmap)
> > +{
> > + unsigned long addr, end;
> > + long found = 0, shared;
> > +
> > + /*
> > + * Walk direct mapping and convert all shared memory back to private,
> > + */
> > +
> > + addr = PAGE_OFFSET;
> > + end = PAGE_OFFSET + get_max_mapped();
> > +
> > + while (addr < end) {
> > + unsigned long size;
> > + unsigned int level;
> > + pte_t *pte;
> > +
> > + pte = lookup_address(addr, &level);
>
> IIRC, you were earlier walking the direct mapping using
> walk_page_range_novma(), any particular reason to use lookup_address()
> instead ?

walk_page_range_novma() wants mmap lock to be taken, but it is tricky as
we run here from atomic context in case of crash.

I considered using trylock to bypass the limitation, but it is a hack.

>
> > + size = page_level_size(level);
> > +
> > + if (pte && pte_decrypted(*pte)) {
>
> Additionally need to add check for pte_none() here to handle physical memory
> holes in direct mapping.

lookup_address() returns NULL for none entries.

> > + int pages = size / PAGE_SIZE;
> > +
> > + /*
> > + * Touching memory with shared bit set triggers implicit
> > + * conversion to shared.
> > + *
> > + * Make sure nobody touches the shared range from
> > + * now on.
> > + *
> > + * Bypass unmapping for crash scenario. Unmapping
> > + * requires sleepable context, but in crash case kernel
> > + * hits the code path with interrupts disabled.
>
> In case of SNP we will need to temporarily enable interrupts during this
> unsharing as we invoke set_memory_encrypted() which then hits a BUG_ON() in
> cpa_flush() if interrupts are disabled.

Do you really need full set_memory_encrypted()? Can't you do something
ligher?


--
Kiryl Shutsemau / Kirill A. Shutemov

2023-10-05 22:01:50

by Kalra, Ashish

[permalink] [raw]
Subject: Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec

On 10/5/2023 4:28 PM, Kirill A. Shutemov wrote:
> On Thu, Oct 05, 2023 at 01:41:38PM -0500, Kalra, Ashish wrote:
>>> +static void unshare_all_memory(bool unmap)
>>> +{
>>> + unsigned long addr, end;
>>> + long found = 0, shared;
>>> +
>>> + /*
>>> + * Walk direct mapping and convert all shared memory back to private,
>>> + */
>>> +
>>> + addr = PAGE_OFFSET;
>>> + end = PAGE_OFFSET + get_max_mapped();
>>> +
>>> + while (addr < end) {
>>> + unsigned long size;
>>> + unsigned int level;
>>> + pte_t *pte;
>>> +
>>> + pte = lookup_address(addr, &level);
>>
>> IIRC, you were earlier walking the direct mapping using
>> walk_page_range_novma(), any particular reason to use lookup_address()
>> instead ?
>
> walk_page_range_novma() wants mmap lock to be taken, but it is tricky as
> we run here from atomic context in case of crash.
>
> I considered using trylock to bypass the limitation, but it is a hack.
>
>>
>>> + size = page_level_size(level);
>>> +
>>> + if (pte && pte_decrypted(*pte)) {
>>
>> Additionally need to add check for pte_none() here to handle physical memory
>> holes in direct mapping.
>
> lookup_address() returns NULL for none entries.
>

Looking at lookup_address_in_pgd(), at pte level it is simply returning
pte_offset_kernel() and there does not seem to be a check for returning
NULL if pte_none() ?

>>> + int pages = size / PAGE_SIZE;
>>> +
>>> + /*
>>> + * Touching memory with shared bit set triggers implicit
>>> + * conversion to shared.
>>> + *
>>> + * Make sure nobody touches the shared range from
>>> + * now on.
>>> + *
>>> + * Bypass unmapping for crash scenario. Unmapping
>>> + * requires sleepable context, but in crash case kernel
>>> + * hits the code path with interrupts disabled.
>>
>> In case of SNP we will need to temporarily enable interrupts during this
>> unsharing as we invoke set_memory_encrypted() which then hits a BUG_ON() in
>> cpa_flush() if interrupts are disabled.
>
> Do you really need full set_memory_encrypted()? Can't you do something
> ligher?
>
We need to modify the PTE for setting c-bit to 1 so that will require
cpa_flush(), though probably can add something lighter to do
clflush_cache_range() directly ?

Thanks,
Ashish

2023-10-05 22:29:07

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec

On Thu, Oct 05, 2023 at 05:01:23PM -0500, Kalra, Ashish wrote:
> On 10/5/2023 4:28 PM, Kirill A. Shutemov wrote:
> > On Thu, Oct 05, 2023 at 01:41:38PM -0500, Kalra, Ashish wrote:
> > > > +static void unshare_all_memory(bool unmap)
> > > > +{
> > > > + unsigned long addr, end;
> > > > + long found = 0, shared;
> > > > +
> > > > + /*
> > > > + * Walk direct mapping and convert all shared memory back to private,
> > > > + */
> > > > +
> > > > + addr = PAGE_OFFSET;
> > > > + end = PAGE_OFFSET + get_max_mapped();
> > > > +
> > > > + while (addr < end) {
> > > > + unsigned long size;
> > > > + unsigned int level;
> > > > + pte_t *pte;
> > > > +
> > > > + pte = lookup_address(addr, &level);
> > >
> > > IIRC, you were earlier walking the direct mapping using
> > > walk_page_range_novma(), any particular reason to use lookup_address()
> > > instead ?
> >
> > walk_page_range_novma() wants mmap lock to be taken, but it is tricky as
> > we run here from atomic context in case of crash.
> >
> > I considered using trylock to bypass the limitation, but it is a hack.
> >
> > >
> > > > + size = page_level_size(level);
> > > > +
> > > > + if (pte && pte_decrypted(*pte)) {
> > >
> > > Additionally need to add check for pte_none() here to handle physical memory
> > > holes in direct mapping.
> >
> > lookup_address() returns NULL for none entries.
> >
>
> Looking at lookup_address_in_pgd(), at pte level it is simply returning
> pte_offset_kernel() and there does not seem to be a check for returning NULL
> if pte_none() ?

Hm. You are right.

I think it yet another quirk in how lookup_address() implemented. We need
to make it straight too.

There's two options: either make lookup_address() return pointer for entry
even if it is NULL, or add check for pte_none() after pte_offset_kernel()
and return NULL if it is true.

I like the first option more as it allows caller to populate the entry if
it wants.

> > > > + int pages = size / PAGE_SIZE;
> > > > +
> > > > + /*
> > > > + * Touching memory with shared bit set triggers implicit
> > > > + * conversion to shared.
> > > > + *
> > > > + * Make sure nobody touches the shared range from
> > > > + * now on.
> > > > + *
> > > > + * Bypass unmapping for crash scenario. Unmapping
> > > > + * requires sleepable context, but in crash case kernel
> > > > + * hits the code path with interrupts disabled.
> > >
> > > In case of SNP we will need to temporarily enable interrupts during this
> > > unsharing as we invoke set_memory_encrypted() which then hits a BUG_ON() in
> > > cpa_flush() if interrupts are disabled.
> >
> > Do you really need full set_memory_encrypted()? Can't you do something
> > ligher?
> >
> We need to modify the PTE for setting c-bit to 1 so that will require
> cpa_flush(), though probably can add something lighter to do
> clflush_cache_range() directly ?

For TDX, I don't touch shared bit as nobody suppose to touch the memory
after the point (ans set_memory_np() enforces it for !crash case).

Can't SNP do the same?

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-10-06 10:23:16

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH 01/13] x86/acpi: Extract ACPI MADT wakeup code into a separate file

On Thu, 2023-10-05 at 16:13 +0300, Kirill A. Shutemov wrote:
> In order to prepare for the expansion of support for the ACPI MADT
> wakeup method, the relevant code has been moved into a separate file.
> A new configuration option has been introduced to clearly indicate
> dependencies without the use of ifdefs.

Use imperative mood?  

- Move the relevant code into ...
- Introduce a new configuration option to ...

>
> There have been no functional changes.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> arch/x86/Kconfig | 7 +++
> arch/x86/include/asm/acpi.h | 5 ++
> arch/x86/kernel/acpi/Makefile | 11 ++--
> arch/x86/kernel/acpi/boot.c | 86 +-----------------------------
> arch/x86/kernel/acpi/madt_wakeup.c | 80 +++++++++++++++++++++++++++
> 5 files changed, 99 insertions(+), 90 deletions(-)
> create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 3154dbc49cf5..7368d254d01f 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1108,6 +1108,13 @@ config X86_LOCAL_APIC
> depends on X86_64 || SMP || X86_32_NON_STANDARD || X86_UP_APIC || PCI_MSI
> select IRQ_DOMAIN_HIERARCHY
>
> +config X86_ACPI_MADT_WAKEUP
> + def_bool y
> + depends on X86_64
> + depends on ACPI
> + depends on SMP
> + depends on X86_LOCAL_APIC
> +
> config X86_IO_APIC
> def_bool y
> depends on X86_LOCAL_APIC || X86_UP_IOAPIC
> diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
> index c8a7fc23f63c..b536b5a6a57b 100644
> --- a/arch/x86/include/asm/acpi.h
> +++ b/arch/x86/include/asm/acpi.h
> @@ -73,6 +73,11 @@ static inline bool acpi_skip_set_wakeup_address(void)
>
> #define acpi_skip_set_wakeup_address acpi_skip_set_wakeup_address
>
> +union acpi_subtable_headers;
> +
> +int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
> + const unsigned long end);
> +
> /*
> * Check if the CPU can handle C2 and deeper
> */
> diff --git a/arch/x86/kernel/acpi/Makefile b/arch/x86/kernel/acpi/Makefile
> index fc17b3f136fe..8c7329c88a75 100644
> --- a/arch/x86/kernel/acpi/Makefile
> +++ b/arch/x86/kernel/acpi/Makefile
> @@ -1,11 +1,12 @@
> # SPDX-License-Identifier: GPL-2.0
>
> -obj-$(CONFIG_ACPI) += boot.o
> -obj-$(CONFIG_ACPI_SLEEP) += sleep.o wakeup_$(BITS).o
> -obj-$(CONFIG_ACPI_APEI) += apei.o
> -obj-$(CONFIG_ACPI_CPPC_LIB) += cppc.o
> +obj-$(CONFIG_ACPI) += boot.o
> +obj-$(CONFIG_ACPI_SLEEP) += sleep.o wakeup_$(BITS).o
> +obj-$(CONFIG_ACPI_APEI) += apei.o
> +obj-$(CONFIG_ACPI_CPPC_LIB) += cppc.o
> +obj-$(CONFIG_X86_ACPI_MADT_WAKEUP) += madt_wakeup.o
>
> ifneq ($(CONFIG_ACPI_PROCESSOR),)
> -obj-y += cstate.o
> +obj-y += cstate.o
> endif

unintended code change?

[...]

>
> diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
> new file mode 100644
> index 000000000000..1b9747bfd5b9
> --- /dev/null
> +++ b/arch/x86/kernel/acpi/madt_wakeup.c
> @@ -0,0 +1,80 @@
> +#include <linux/acpi.h>
> +#include <asm/apic.h>

Functions like memremap(), smp_store_release() and cpu_relax() are used in this
file. Should we explicitly include the relevant headers?

> +
> +/* Physical address of the Multiprocessor Wakeup Structure mailbox */
> +static u64 acpi_mp_wake_mailbox_paddr;
> +/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
> +static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
> +
> +static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
> +{
> + /*
> + * Remap mailbox memory only for the first call to acpi_wakeup_cpu().
> + *
> + * Wakeup of secondary CPUs is fully serialized in the core code.
> + * No need to protect acpi_mp_wake_mailbox from concurrent accesses.
> + */
> + if (!acpi_mp_wake_mailbox) {
> + acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
> + sizeof(*acpi_mp_wake_mailbox),
> + MEMREMAP_WB);
> + }
> +
> + /*
> + * Mailbox memory is shared between the firmware and OS. Firmware will
> + * listen on mailbox command address, and once it receives the wakeup
> + * command, the CPU associated with the given apicid will be booted.
> + *
> + * The value of 'apic_id' and 'wakeup_vector' must be visible to the
> + * firmware before the wakeup command is visible. smp_store_release()
> + * ensures ordering and visibility.
> + */
> + acpi_mp_wake_mailbox->apic_id = apicid;
> + acpi_mp_wake_mailbox->wakeup_vector = start_ip;
> + smp_store_release(&acpi_mp_wake_mailbox->command,
> + ACPI_MP_WAKE_COMMAND_WAKEUP);
> +
> + /*
> + * Wait for the CPU to wake up.
> + *
> + * The CPU being woken up is essentially in a spin loop waiting to be
> + * woken up. It should not take long for it wake up and acknowledge by
> + * zeroing out ->command.
> + *
> + * ACPI specification doesn't provide any guidance on how long kernel
> + * has to wait for a wake up acknowledgement. It also doesn't provide
> + * a way to cancel a wake up request if it takes too long.
> + *
> + * In TDX environment, the VMM has control over how long it takes to
> + * wake up secondary. It can postpone scheduling secondary vCPU
> + * indefinitely. Giving up on wake up request and reporting error opens
> + * possible attack vector for VMM: it can wake up a secondary CPU when
> + * kernel doesn't expect it. Wait until positive result of the wake up
> + * request.
> + */
> + while (READ_ONCE(acpi_mp_wake_mailbox->command))
> + cpu_relax();
> +
> + return 0;
> +}
> +
> +int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
> + const unsigned long end)
> +{
> + struct acpi_madt_multiproc_wakeup *mp_wake;
> +
> + if (!IS_ENABLED(CONFIG_SMP))
> + return -ENODEV;

Now you have made X86_ACPI_MADT_WAKEUP depend on SMP, and this file will only be
compiled when X86_ACPI_MADT_WAKEUP is turned on. IIUC this essentially means
IS_ENABLED(CONFIG_SMP) will always report true?

> +
> + mp_wake = (struct acpi_madt_multiproc_wakeup *)header;
> + if (BAD_MADT_ENTRY(mp_wake, end))
> + return -EINVAL;
> +
> + acpi_table_print_madt_entry(&header->common);
> +
> + acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
> +
> + apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
> +
> + return 0;
> +}

2023-10-06 12:01:27

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 01/13] x86/acpi: Extract ACPI MADT wakeup code into a separate file

On Fri, Oct 06, 2023 at 10:22:54AM +0000, Huang, Kai wrote:
> On Thu, 2023-10-05 at 16:13 +0300, Kirill A. Shutemov wrote:
> > In order to prepare for the expansion of support for the ACPI MADT
> > wakeup method, the relevant code has been moved into a separate file.
> > A new configuration option has been introduced to clearly indicate
> > dependencies without the use of ifdefs.
>
> Use imperative mood? ?
>
> - Move the relevant code into ...
> - Introduce a new configuration option to ...

Okay.

> > There have been no functional changes.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > ---
> > arch/x86/Kconfig | 7 +++
> > arch/x86/include/asm/acpi.h | 5 ++
> > arch/x86/kernel/acpi/Makefile | 11 ++--
> > arch/x86/kernel/acpi/boot.c | 86 +-----------------------------
> > arch/x86/kernel/acpi/madt_wakeup.c | 80 +++++++++++++++++++++++++++
> > 5 files changed, 99 insertions(+), 90 deletions(-)
> > create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c
> >
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 3154dbc49cf5..7368d254d01f 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1108,6 +1108,13 @@ config X86_LOCAL_APIC
> > depends on X86_64 || SMP || X86_32_NON_STANDARD || X86_UP_APIC || PCI_MSI
> > select IRQ_DOMAIN_HIERARCHY
> >
> > +config X86_ACPI_MADT_WAKEUP
> > + def_bool y
> > + depends on X86_64
> > + depends on ACPI
> > + depends on SMP
> > + depends on X86_LOCAL_APIC
> > +
> > config X86_IO_APIC
> > def_bool y
> > depends on X86_LOCAL_APIC || X86_UP_IOAPIC
> > diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
> > index c8a7fc23f63c..b536b5a6a57b 100644
> > --- a/arch/x86/include/asm/acpi.h
> > +++ b/arch/x86/include/asm/acpi.h
> > @@ -73,6 +73,11 @@ static inline bool acpi_skip_set_wakeup_address(void)
> >
> > #define acpi_skip_set_wakeup_address acpi_skip_set_wakeup_address
> >
> > +union acpi_subtable_headers;
> > +
> > +int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
> > + const unsigned long end);
> > +
> > /*
> > * Check if the CPU can handle C2 and deeper
> > */
> > diff --git a/arch/x86/kernel/acpi/Makefile b/arch/x86/kernel/acpi/Makefile
> > index fc17b3f136fe..8c7329c88a75 100644
> > --- a/arch/x86/kernel/acpi/Makefile
> > +++ b/arch/x86/kernel/acpi/Makefile
> > @@ -1,11 +1,12 @@
> > # SPDX-License-Identifier: GPL-2.0
> >
> > -obj-$(CONFIG_ACPI) += boot.o
> > -obj-$(CONFIG_ACPI_SLEEP) += sleep.o wakeup_$(BITS).o
> > -obj-$(CONFIG_ACPI_APEI) += apei.o
> > -obj-$(CONFIG_ACPI_CPPC_LIB) += cppc.o
> > +obj-$(CONFIG_ACPI) += boot.o
> > +obj-$(CONFIG_ACPI_SLEEP) += sleep.o wakeup_$(BITS).o
> > +obj-$(CONFIG_ACPI_APEI) += apei.o
> > +obj-$(CONFIG_ACPI_CPPC_LIB) += cppc.o
> > +obj-$(CONFIG_X86_ACPI_MADT_WAKEUP) += madt_wakeup.o
> >
> > ifneq ($(CONFIG_ACPI_PROCESSOR),)
> > -obj-y += cstate.o
> > +obj-y += cstate.o
> > endif
>
> unintended code change?

No. It keeps += aligned as they are before the patch.
>
> [...]
>
> >
> > diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
> > new file mode 100644
> > index 000000000000..1b9747bfd5b9
> > --- /dev/null
> > +++ b/arch/x86/kernel/acpi/madt_wakeup.c
> > @@ -0,0 +1,80 @@
> > +#include <linux/acpi.h>
> > +#include <asm/apic.h>
>
> Functions like memremap(), smp_store_release() and cpu_relax() are used in this
> file. Should we explicitly include the relevant headers?

Okay, will do.

> > +
> > +/* Physical address of the Multiprocessor Wakeup Structure mailbox */
> > +static u64 acpi_mp_wake_mailbox_paddr;
> > +/* Virtual address of the Multiprocessor Wakeup Structure mailbox */
> > +static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
> > +
> > +static int acpi_wakeup_cpu(int apicid, unsigned long start_ip)
> > +{
> > + /*
> > + * Remap mailbox memory only for the first call to acpi_wakeup_cpu().
> > + *
> > + * Wakeup of secondary CPUs is fully serialized in the core code.
> > + * No need to protect acpi_mp_wake_mailbox from concurrent accesses.
> > + */
> > + if (!acpi_mp_wake_mailbox) {
> > + acpi_mp_wake_mailbox = memremap(acpi_mp_wake_mailbox_paddr,
> > + sizeof(*acpi_mp_wake_mailbox),
> > + MEMREMAP_WB);
> > + }
> > +
> > + /*
> > + * Mailbox memory is shared between the firmware and OS. Firmware will
> > + * listen on mailbox command address, and once it receives the wakeup
> > + * command, the CPU associated with the given apicid will be booted.
> > + *
> > + * The value of 'apic_id' and 'wakeup_vector' must be visible to the
> > + * firmware before the wakeup command is visible. smp_store_release()
> > + * ensures ordering and visibility.
> > + */
> > + acpi_mp_wake_mailbox->apic_id = apicid;
> > + acpi_mp_wake_mailbox->wakeup_vector = start_ip;
> > + smp_store_release(&acpi_mp_wake_mailbox->command,
> > + ACPI_MP_WAKE_COMMAND_WAKEUP);
> > +
> > + /*
> > + * Wait for the CPU to wake up.
> > + *
> > + * The CPU being woken up is essentially in a spin loop waiting to be
> > + * woken up. It should not take long for it wake up and acknowledge by
> > + * zeroing out ->command.
> > + *
> > + * ACPI specification doesn't provide any guidance on how long kernel
> > + * has to wait for a wake up acknowledgement. It also doesn't provide
> > + * a way to cancel a wake up request if it takes too long.
> > + *
> > + * In TDX environment, the VMM has control over how long it takes to
> > + * wake up secondary. It can postpone scheduling secondary vCPU
> > + * indefinitely. Giving up on wake up request and reporting error opens
> > + * possible attack vector for VMM: it can wake up a secondary CPU when
> > + * kernel doesn't expect it. Wait until positive result of the wake up
> > + * request.
> > + */
> > + while (READ_ONCE(acpi_mp_wake_mailbox->command))
> > + cpu_relax();
> > +
> > + return 0;
> > +}
> > +
> > +int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
> > + const unsigned long end)
> > +{
> > + struct acpi_madt_multiproc_wakeup *mp_wake;
> > +
> > + if (!IS_ENABLED(CONFIG_SMP))
> > + return -ENODEV;
>
> Now you have made X86_ACPI_MADT_WAKEUP depend on SMP, and this file will only be
> compiled when X86_ACPI_MADT_WAKEUP is turned on. IIUC this essentially means
> IS_ENABLED(CONFIG_SMP) will always report true?

Good catch. I didn't have 'depends SMP' initially, but it triggered build
issue, so I added it, but forgot to drop IS_ENABLED().

> > +
> > + mp_wake = (struct acpi_madt_multiproc_wakeup *)header;
> > + if (BAD_MADT_ENTRY(mp_wake, end))
> > + return -EINVAL;
> > +
> > + acpi_table_print_madt_entry(&header->common);
> > +
> > + acpi_mp_wake_mailbox_paddr = mp_wake->base_address;
> > +
> > + apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu);
> > +
> > + return 0;
> > +}
>

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-10-06 14:58:17

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec

On Thu, Oct 05, 2023, Kirill A. Shutemov wrote:
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 7368d254d01f..b5acf9fb4c70 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -884,6 +884,7 @@ config INTEL_TDX_GUEST
> select X86_MEM_ENCRYPT
> select X86_MCE
> select UNACCEPTED_MEMORY
> + select EMERGENCY_VIRT_CALLBACK
> help
> Support running as a guest under Intel TDX. Without this support,
> the guest kernel can not boot or run under TDX.

...

> void __init tdx_early_init(void)
> {
> struct tdx_module_args args = {
> @@ -882,6 +1007,14 @@ void __init tdx_early_init(void)
> */
> x86_cpuinit.parallel_bringup = false;
>
> + machine_ops.shutdown = tdx_shutdown;
> +
> + /*
> + * KVM overrides machine_ops.crash_shutdown, use emergency

This is going to be super confusing. KVM utilizes the emergency virt callback.
The KVM paravirt guest code uses .crash_shutdown(). People that are passingly
familiar with virt and know what KVM is, but don't already know the difference
between the two are going to be all kinds of confused.

I also feel like you're playing with fire, e.g. what's to stop the hypervisor
specific paravirt guest support from using .shutdown() in the future?

And the callback is invoked for far more than just kexec(). I don't see how the
host can emulate a reboot without destroying and rebuilding the VM, e.g. it can't
stuff register state to emulate INIT or RESET. Unless I'm missing something,
converting shared memory back to private for a shutdown or reboot is undesirable
as adds one more thing that can go wrong and prevent the system from cleanly
shutting down ASAP (for some definitions of "cleanly").

Lastly, doesn't SEV need similar behavior? This seems like core functionality
for any guest with cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT). Why not make the
"unshare on kexec" code common and gate it with CC_ATTR_GUEST_MEM_ENCRYPT?

2023-10-06 15:12:15

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec

On Fri, Oct 06, 2023 at 07:58:03AM -0700, Sean Christopherson wrote:
> On Thu, Oct 05, 2023, Kirill A. Shutemov wrote:
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 7368d254d01f..b5acf9fb4c70 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -884,6 +884,7 @@ config INTEL_TDX_GUEST
> > select X86_MEM_ENCRYPT
> > select X86_MCE
> > select UNACCEPTED_MEMORY
> > + select EMERGENCY_VIRT_CALLBACK
> > help
> > Support running as a guest under Intel TDX. Without this support,
> > the guest kernel can not boot or run under TDX.
>
> ...
>
> > void __init tdx_early_init(void)
> > {
> > struct tdx_module_args args = {
> > @@ -882,6 +1007,14 @@ void __init tdx_early_init(void)
> > */
> > x86_cpuinit.parallel_bringup = false;
> >
> > + machine_ops.shutdown = tdx_shutdown;
> > +
> > + /*
> > + * KVM overrides machine_ops.crash_shutdown, use emergency
>
> This is going to be super confusing. KVM utilizes the emergency virt callback.
> The KVM paravirt guest code uses .crash_shutdown(). People that are passingly
> familiar with virt and know what KVM is, but don't already know the difference
> between the two are going to be all kinds of confused.
>
> I also feel like you're playing with fire, e.g. what's to stop the hypervisor
> specific paravirt guest support from using .shutdown() in the future?
>
> And the callback is invoked for far more than just kexec(). I don't see how the
> host can emulate a reboot without destroying and rebuilding the VM, e.g. it can't
> stuff register state to emulate INIT or RESET. Unless I'm missing something,
> converting shared memory back to private for a shutdown or reboot is undesirable
> as adds one more thing that can go wrong and prevent the system from cleanly
> shutting down ASAP (for some definitions of "cleanly").

Okay, fair enough. I will look for better way to hookup into kexec
process. That was the best fit I found so far, but yes it is not ideal.

> Lastly, doesn't SEV need similar behavior? This seems like core functionality
> for any guest with cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT). Why not make the
> "unshare on kexec" code common and gate it with CC_ATTR_GUEST_MEM_ENCRYPT?

I don't know SEV specifics. I am open to collaboration on this.

Tom, Ashish, let me know if you need this in generic code. I can arrange
that.

--
Kiryl Shutsemau / Kirill A. Shutemov

Subject: Re: [PATCH 01/13] x86/acpi: Extract ACPI MADT wakeup code into a separate file

Hi Kirill,

On 10/5/2023 6:13 AM, Kirill A. Shutemov wrote:
> In order to prepare for the expansion of support for the ACPI MADT
> wakeup method, the relevant code has been moved into a separate file.
> A new configuration option has been introduced to clearly indicate
> dependencies without the use of ifdefs.
>
> There have been no functional changes.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> arch/x86/Kconfig | 7 +++
> arch/x86/include/asm/acpi.h | 5 ++
> arch/x86/kernel/acpi/Makefile | 11 ++--
> arch/x86/kernel/acpi/boot.c | 86 +-----------------------------
> arch/x86/kernel/acpi/madt_wakeup.c | 80 +++++++++++++++++++++++++++
> 5 files changed, 99 insertions(+), 90 deletions(-)
> create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 3154dbc49cf5..7368d254d01f 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1108,6 +1108,13 @@ config X86_LOCAL_APIC
> depends on X86_64 || SMP || X86_32_NON_STANDARD || X86_UP_APIC || PCI_MSI
> select IRQ_DOMAIN_HIERARCHY
>
> +config X86_ACPI_MADT_WAKEUP
> + def_bool y
> + depends on X86_64
> + depends on ACPI
> + depends on SMP
> + depends on X86_LOCAL_APIC
> +
> config X86_IO_APIC
> def_bool y
> depends on X86_LOCAL_APIC || X86_UP_IOAPIC
> diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
> index c8a7fc23f63c..b536b5a6a57b 100644
> --- a/arch/x86/include/asm/acpi.h
> +++ b/arch/x86/include/asm/acpi.h
> @@ -73,6 +73,11 @@ static inline bool acpi_skip_set_wakeup_address(void)
>
> #define acpi_skip_set_wakeup_address acpi_skip_set_wakeup_address
>
> +union acpi_subtable_headers;
> +
> +int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
> + const unsigned long end);
> +

IMO, you don't need to declare acpi_parse_mp_wake() in asm/acpi.h. Since the
only user of this function is in arch/x86/kernel/acpi, you can either create
a header file there or re-use sleep.h.

If you want to leave it here, do you want to protect it with
CONFIG_X86_ACPI_MADT_WAKEUP?


--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2023-10-06 19:24:52

by Kalra, Ashish

[permalink] [raw]
Subject: Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec


On 10/5/2023 5:28 PM, Kirill A. Shutemov wrote:
> On Thu, Oct 05, 2023 at 05:01:23PM -0500, Kalra, Ashish wrote:
>> On 10/5/2023 4:28 PM, Kirill A. Shutemov wrote:
>>> On Thu, Oct 05, 2023 at 01:41:38PM -0500, Kalra, Ashish wrote:
>>>>> +static void unshare_all_memory(bool unmap)
>>>>> +{
>>>>> + unsigned long addr, end;
>>>>> + long found = 0, shared;
>>>>> +
>>>>> + /*
>>>>> + * Walk direct mapping and convert all shared memory back to private,
>>>>> + */
>>>>> +
>>>>> + addr = PAGE_OFFSET;
>>>>> + end = PAGE_OFFSET + get_max_mapped();
>>>>> +
>>>>> + while (addr < end) {
>>>>> + unsigned long size;
>>>>> + unsigned int level;
>>>>> + pte_t *pte;
>>>>> +
>>>>> + pte = lookup_address(addr, &level);
>>>>
>>>> IIRC, you were earlier walking the direct mapping using
>>>> walk_page_range_novma(), any particular reason to use lookup_address()
>>>> instead ?
>>>
>>> walk_page_range_novma() wants mmap lock to be taken, but it is tricky as
>>> we run here from atomic context in case of crash.
>>>
>>> I considered using trylock to bypass the limitation, but it is a hack.
>>>
>>>>
>>>>> + size = page_level_size(level);
>>>>> +
>>>>> + if (pte && pte_decrypted(*pte)) {
>>>>
>>>> Additionally need to add check for pte_none() here to handle physical memory
>>>> holes in direct mapping.
>>>
>>> lookup_address() returns NULL for none entries.
>>>
>>
>> Looking at lookup_address_in_pgd(), at pte level it is simply returning
>> pte_offset_kernel() and there does not seem to be a check for returning NULL
>> if pte_none() ?
>
> Hm. You are right.
>
> I think it yet another quirk in how lookup_address() implemented. We need
> to make it straight too.
>
> There's two options: either make lookup_address() return pointer for entry
> even if it is NULL, or add check for pte_none() after pte_offset_kernel()
> and return NULL if it is true.
>
> I like the first option more as it allows caller to populate the entry if
> it wants.

Yes, i like the first option.

>
>>>>> + int pages = size / PAGE_SIZE;
>>>>> +
>>>>> + /*
>>>>> + * Touching memory with shared bit set triggers implicit
>>>>> + * conversion to shared.
>>>>> + *
>>>>> + * Make sure nobody touches the shared range from
>>>>> + * now on.
>>>>> + *
>>>>> + * Bypass unmapping for crash scenario. Unmapping
>>>>> + * requires sleepable context, but in crash case kernel
>>>>> + * hits the code path with interrupts disabled.
>>>>
>>>> In case of SNP we will need to temporarily enable interrupts during this
>>>> unsharing as we invoke set_memory_encrypted() which then hits a BUG_ON() in
>>>> cpa_flush() if interrupts are disabled.
>>>
>>> Do you really need full set_memory_encrypted()? Can't you do something
>>> ligher?
>>>
>> We need to modify the PTE for setting c-bit to 1 so that will require
>> cpa_flush(), though probably can add something lighter to do
>> clflush_cache_range() directly ?
>
> For TDX, I don't touch shared bit as nobody suppose to touch the memory
> after the point (ans set_memory_np() enforces it for !crash case).
>
> Can't SNP do the same?
>

No, we need to make the PSC call for HV to update the RMP, then set
C-bit=1 in the PTE and then do a PVALIDATE to switch the page back to
private, so it needs something like a full set_memory_encrypted().

Thanks,
Ashish

2023-10-06 22:16:04

by Kalra, Ashish

[permalink] [raw]
Subject: Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec


On 10/6/2023 10:11 AM, Kirill A. Shutemov wrote:
> On Fri, Oct 06, 2023 at 07:58:03AM -0700, Sean Christopherson wrote:
>> On Thu, Oct 05, 2023, Kirill A. Shutemov wrote:
>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>> index 7368d254d01f..b5acf9fb4c70 100644
>>> --- a/arch/x86/Kconfig
>>> +++ b/arch/x86/Kconfig
>>> @@ -884,6 +884,7 @@ config INTEL_TDX_GUEST
>>> select X86_MEM_ENCRYPT
>>> select X86_MCE
>>> select UNACCEPTED_MEMORY
>>> + select EMERGENCY_VIRT_CALLBACK
>>> help
>>> Support running as a guest under Intel TDX. Without this support,
>>> the guest kernel can not boot or run under TDX.
>>
>> ...
>>
>>> void __init tdx_early_init(void)
>>> {
>>> struct tdx_module_args args = {
>>> @@ -882,6 +1007,14 @@ void __init tdx_early_init(void)
>>> */
>>> x86_cpuinit.parallel_bringup = false;
>>>
>>> + machine_ops.shutdown = tdx_shutdown;
>>> +
>>> + /*
>>> + * KVM overrides machine_ops.crash_shutdown, use emergency
>>
>> This is going to be super confusing. KVM utilizes the emergency virt callback.
>> The KVM paravirt guest code uses .crash_shutdown(). People that are passingly
>> familiar with virt and know what KVM is, but don't already know the difference
>> between the two are going to be all kinds of confused.
>>
>> I also feel like you're playing with fire, e.g. what's to stop the hypervisor
>> specific paravirt guest support from using .shutdown() in the future?
>>
>> And the callback is invoked for far more than just kexec(). I don't see how the
>> host can emulate a reboot without destroying and rebuilding the VM, e.g. it can't
>> stuff register state to emulate INIT or RESET. Unless I'm missing something,
>> converting shared memory back to private for a shutdown or reboot is undesirable
>> as adds one more thing that can go wrong and prevent the system from cleanly
>> shutting down ASAP (for some definitions of "cleanly").
>
> Okay, fair enough. I will look for better way to hookup into kexec
> process. That was the best fit I found so far, but yes it is not ideal.
>
>> Lastly, doesn't SEV need similar behavior? This seems like core functionality
>> for any guest with cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT). Why not make the
>> "unshare on kexec" code common and gate it with CC_ATTR_GUEST_MEM_ENCRYPT?
>
> I don't know SEV specifics. I am open to collaboration on this.
>
> Tom, Ashish, let me know if you need this in generic code. I can arrange
> that.
>

Yes, some kind of a generic interface like unshare_on_kexec() gated with
CC_ATTR_GUEST_MEM_ENCRYPT is needed, we can then add SNP specific kexec
functionality as part of this.

Thanks,
Ashish

2023-10-08 08:36:43

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec

On 10/05/23 at 04:13pm, Kirill A. Shutemov wrote:
> TDX guests allocate shared buffers to perform I/O. It is done by
> allocating pages normally from the buddy allocator and converting them
> to shared with set_memory_decrypted().
>
> The target kernel has no idea what memory is converted this way. It only
~~~~~~~~~~~~~
> sees E820_TYPE_RAM.

I finally realized it means the 2nd kernel of kexec rebooting. Maybe we
can call it 2nd kernel always, it works for both kexec and kdump
jumping.

>
> Accessing shared memory via private mapping is fatal. It leads to
> unrecoverable TD exit.
>
> On TD shutdown (also covers kexec), walk direct mapping and convert all
> shared memory back to private. It makes all RAM private again and target
> kernel may use it normally.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> arch/x86/Kconfig | 1 +
> arch/x86/coco/tdx/kexec.c | 0
> arch/x86/coco/tdx/tdx.c | 137 +++++++++++++++++++++++++++++++++++++-
> 3 files changed, 136 insertions(+), 2 deletions(-)
> create mode 100644 arch/x86/coco/tdx/kexec.c
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 7368d254d01f..b5acf9fb4c70 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -884,6 +884,7 @@ config INTEL_TDX_GUEST
> select X86_MEM_ENCRYPT
> select X86_MCE
> select UNACCEPTED_MEMORY
> + select EMERGENCY_VIRT_CALLBACK
> help
> Support running as a guest under Intel TDX. Without this support,
> the guest kernel can not boot or run under TDX.
> diff --git a/arch/x86/coco/tdx/kexec.c b/arch/x86/coco/tdx/kexec.c
> new file mode 100644
> index 000000000000..e69de29bb2d1
> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 56e152126f20..ac0745303983 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -6,6 +6,7 @@
>
> #include <linux/cpufeature.h>
> #include <linux/debugfs.h>
> +#include <linux/delay.h>
> #include <linux/export.h>
> #include <linux/io.h>
> #include <asm/coco.h>
> @@ -14,6 +15,8 @@
> #include <asm/insn.h>
> #include <asm/insn-eval.h>
> #include <asm/pgtable.h>
> +#include <asm/reboot.h>
> +#include <asm/set_memory.h>
>
> /* MMIO direction */
> #define EPT_READ 0
> @@ -40,6 +43,9 @@
>
> static atomic_long_t nr_shared;
>
> +static atomic_t conversions_in_progress;
> +static bool conversion_allowed = true;
> +
> static inline bool pte_decrypted(pte_t pte)
> {
> return cc_mkdec(pte_val(pte)) == pte_val(pte);
> @@ -704,6 +710,14 @@ static bool tdx_tlb_flush_required(bool private)
>
> static bool tdx_cache_flush_required(void)
> {
> + /*
> + * Avoid issuing CLFLUSH on set_memory_decrypted() if conversions
> + * stopped. Otherwise it can race with unshare_all_memory() and trigger
> + * implicit conversion to shared.
> + */
> + if (!conversion_allowed)
> + return false;
> +
> /*
> * AMD SME/SEV can avoid cache flushing if HW enforces cache coherence.
> * TDX doesn't have such capability.
> @@ -787,12 +801,25 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
> static int tdx_enc_status_change_prepare(unsigned long vaddr, int numpages,
> bool enc)
> {
> + atomic_inc(&conversions_in_progress);
> +
> + /*
> + * Check after bumping conversions_in_progress to serialize
> + * against tdx_shutdown().
> + */
> + if (!conversion_allowed) {
> + atomic_dec(&conversions_in_progress);
> + return -EBUSY;
> + }
> +
> /*
> * Only handle shared->private conversion here.
> * See the comment in tdx_early_init().
> */
> - if (enc && !tdx_enc_status_changed(vaddr, numpages, enc))
> + if (enc && !tdx_enc_status_changed(vaddr, numpages, enc)) {
> + atomic_dec(&conversions_in_progress);
> return -EIO;
> + }
>
> return 0;
> }
> @@ -804,17 +831,115 @@ static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
> * Only handle private->shared conversion here.
> * See the comment in tdx_early_init().
> */
> - if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc))
> + if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc)) {
> + atomic_dec(&conversions_in_progress);
> return -EIO;
> + }
>
> if (enc)
> atomic_long_sub(numpages, &nr_shared);
> else
> atomic_long_add(numpages, &nr_shared);
>
> + atomic_dec(&conversions_in_progress);
> +
> return 0;
> }
>
> +static void unshare_all_memory(bool unmap)
> +{
> + unsigned long addr, end;
> + long found = 0, shared;
> +
> + /*
> + * Walk direct mapping and convert all shared memory back to private,
> + */
> +
> + addr = PAGE_OFFSET;
> + end = PAGE_OFFSET + get_max_mapped();
> +
> + while (addr < end) {
> + unsigned long size;
> + unsigned int level;
> + pte_t *pte;
> +
> + pte = lookup_address(addr, &level);
> + size = page_level_size(level);
> +
> + if (pte && pte_decrypted(*pte)) {
> + int pages = size / PAGE_SIZE;
> +
> + /*
> + * Touching memory with shared bit set triggers implicit
> + * conversion to shared.
> + *
> + * Make sure nobody touches the shared range from
> + * now on.
> + *
> + * Bypass unmapping for crash scenario. Unmapping
> + * requires sleepable context, but in crash case kernel
> + * hits the code path with interrupts disabled.
> + * It shouldn't be a problem as all secondary CPUs are
> + * down and kernel runs with interrupts disabled, so
> + * there is no room for race.
> + */
> + if (unmap)
> + set_memory_np(addr, pages);
> +
> + if (!tdx_enc_status_changed(addr, pages, true)) {
> + pr_err("Failed to unshare range %#lx-%#lx\n",
> + addr, addr + size);
> + }
> +
> + found += pages;
> + }
> +
> + addr += size;
> + }
> +
> + shared = atomic_long_read(&nr_shared);
> + if (shared != found) {
> + pr_err("shared page accounting is off\n");
> + pr_err("nr_shared = %ld, nr_found = %ld\n", shared, found);
> + }
> +}
> +
> +static void tdx_shutdown(void)
> +{
> + unsigned long timeout;
> +
> + /*
> + * Stop new private<->shared conversions and wait for in-flight
> + * conversions to complete.
> + *
> + * Do not wait more than 30 seconds.
> + */
> + timeout = 30 * USEC_PER_SEC;
> + conversion_allowed = false;
> + while (atomic_read(&conversions_in_progress) && timeout--)
> + udelay(1);
> +
> + if (!timeout)
> + pr_warn("Failed to finish shared<->private conversions\n");
> +
> + unshare_all_memory(true);
> +
> + native_machine_shutdown();
> +}
> +
> +static void tdx_crash_shutdown(void)
> +{
> + /*
> + * Crash can race with private<->shared conversion.
> + *
> + * There's no clean way out: report and proceed.
> + */
> + if (atomic_read(&conversions_in_progress))
> + pr_warn("Failed to finish shared<->private conversions\n");
> +
> + unshare_all_memory(false);
> +}
> +
> void __init tdx_early_init(void)
> {
> struct tdx_module_args args = {
> @@ -882,6 +1007,14 @@ void __init tdx_early_init(void)
> */
> x86_cpuinit.parallel_bringup = false;
>
> + machine_ops.shutdown = tdx_shutdown;
> +
> + /*
> + * KVM overrides machine_ops.crash_shutdown, use emergency
> + * virt callback instead.
> + */
> + cpu_emergency_register_virt_callback(tdx_crash_shutdown);
> +
> pr_info("Guest detected\n");
> }
>
> --
> 2.41.0
>
>
> _______________________________________________
> kexec mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/kexec
>

2023-10-08 23:51:00

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH 00/13] x86/tdx: Add kexec support

On 10/05/23 at 04:13pm, Kirill A. Shutemov wrote:
> The patchset adds bits and pieces to get kexec (and crashkernel) work on
> TDX guest.
>
> They bring kexec support to the point when we can start the new kernel,
> but it will only be able to use single CPU. It should be enough to cover
> the most common case: crashkernel.

Not sure if this question has been raised and answered in the past. Please
forgive my bad memory if it has. The one cpu is fine to kdump kernel most
of time, while we enable all CPUs by default when kexec rebooting. And kdump
kernel with multiple cpu is allowed too. Wondering if there's plan to
support the multiple cpu on TDX in the future.

Thanks
Baoquan

>
> The last patch implements CPU offlining according to the approved ACPI
> spec change poposal[1]. It unlocks kexec with all CPUs visible in the target
> kernel.
>
> Please review. I would be glad for any feedback.
>
> [1] https://lore.kernel.org/all/13356251.uLZWGnKmhe@kreacher
>
> Kirill A. Shutemov (13):
> x86/acpi: Extract ACPI MADT wakeup code into a separate file
> kernel/cpu: Add support for declaring CPU hotplug not supported
> cpu/hotplug, x86/acpi: Disable CPU hotplug for ACPI MADT wakeup
> x86/kvm: Do not try to disable kvmclock if it was not enabled
> x86/kexec: Keep CR4.MCE set during kexec for TDX guest
> x86/mm: Make x86_platform.guest.enc_status_change_*() return errno
> x86/mm: Return correct level from lookup_address() if pte is none
> KVM: x86: Add config option to gate emergency virt callback support
> x86/tdx: Account shared memory
> x86/tdx: Convert shared memory back to private on kexec
> x86/mm: Make e820_end_ram_pfn() cover E820_TYPE_ACPI ranges
> x86/acpi: Do not attempt to bring up secondary CPUs in kexec case
> x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
>
> arch/x86/Kconfig | 8 +
> arch/x86/coco/core.c | 1 -
> arch/x86/coco/tdx/kexec.c | 0
> arch/x86/coco/tdx/tdx.c | 220 +++++++++++++++++++++-
> arch/x86/hyperv/ivm.c | 9 +-
> arch/x86/include/asm/acpi.h | 5 +
> arch/x86/include/asm/pgtable_types.h | 1 +
> arch/x86/include/asm/reboot.h | 4 +-
> arch/x86/include/asm/x86_init.h | 4 +-
> arch/x86/kernel/acpi/Makefile | 11 +-
> arch/x86/kernel/acpi/boot.c | 88 +--------
> arch/x86/kernel/acpi/madt.S | 28 +++
> arch/x86/kernel/acpi/madt_wakeup.c | 262 +++++++++++++++++++++++++++
> arch/x86/kernel/e820.c | 9 +-
> arch/x86/kernel/kvmclock.c | 9 +-
> arch/x86/kernel/reboot.c | 4 +-
> arch/x86/kernel/relocate_kernel_64.S | 5 +
> arch/x86/kernel/x86_init.c | 4 +-
> arch/x86/kvm/Kconfig | 5 +
> arch/x86/mm/mem_encrypt_amd.c | 8 +-
> arch/x86/mm/pat/set_memory.c | 17 +-
> include/acpi/actbl2.h | 19 +-
> include/linux/cc_platform.h | 10 -
> include/linux/cpu.h | 2 +
> kernel/cpu.c | 17 +-
> 25 files changed, 604 insertions(+), 146 deletions(-)
> create mode 100644 arch/x86/coco/tdx/kexec.c
> create mode 100644 arch/x86/kernel/acpi/madt.S
> create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c
>
> --
> 2.41.0
>
>
> _______________________________________________
> kexec mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/kexec
>

2023-10-09 12:33:15

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH 05/13] x86/kexec: Keep CR4.MCE set during kexec for TDX guest

On Thu, 2023-10-05 at 16:13 +0300, Kirill A. Shutemov wrote:
> TDX guests are not allowed to clear CR4.MCE. Attempt to clear it leads
> to #VE.
>
> Use alternatives to keep the flag during kexec for TDX guests.
>
> The change doesn't affect non-TDX environments.

Nit: non-TDX-guest environments. ?

>
> Signed-off-by: Kirill A. Shutemov <[email protected]>

Reviewed-by: Kai Huang <[email protected]>

> ---
> arch/x86/kernel/relocate_kernel_64.S | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> index 56cab1bb25f5..bea89814b48e 100644
> --- a/arch/x86/kernel/relocate_kernel_64.S
> +++ b/arch/x86/kernel/relocate_kernel_64.S
> @@ -145,11 +145,16 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
> * Set cr4 to a known state:
> * - physical address extension enabled
> * - 5-level paging, if it was enabled before
> + * - Machine check exception on TDX guest. Clearing MCE is not allowed
> + * in TDX guests.
> */
> movl $X86_CR4_PAE, %eax
> testq $X86_CR4_LA57, %r13
> jz 1f
> orl $X86_CR4_LA57, %eax
> +1:
> + ALTERNATIVE "jmp 1f", "", X86_FEATURE_TDX_GUEST
> + orl $X86_CR4_MCE, %eax
> 1:
> movq %rax, %cr4
>

2023-10-09 13:32:29

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 01/13] x86/acpi: Extract ACPI MADT wakeup code into a separate file

On Fri, Oct 06, 2023 at 11:33:47AM -0700, Kuppuswamy Sathyanarayanan wrote:
> Hi Kirill,
>
> On 10/5/2023 6:13 AM, Kirill A. Shutemov wrote:
> > In order to prepare for the expansion of support for the ACPI MADT
> > wakeup method, the relevant code has been moved into a separate file.
> > A new configuration option has been introduced to clearly indicate
> > dependencies without the use of ifdefs.
> >
> > There have been no functional changes.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > ---
> > arch/x86/Kconfig | 7 +++
> > arch/x86/include/asm/acpi.h | 5 ++
> > arch/x86/kernel/acpi/Makefile | 11 ++--
> > arch/x86/kernel/acpi/boot.c | 86 +-----------------------------
> > arch/x86/kernel/acpi/madt_wakeup.c | 80 +++++++++++++++++++++++++++
> > 5 files changed, 99 insertions(+), 90 deletions(-)
> > create mode 100644 arch/x86/kernel/acpi/madt_wakeup.c
> >
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 3154dbc49cf5..7368d254d01f 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1108,6 +1108,13 @@ config X86_LOCAL_APIC
> > depends on X86_64 || SMP || X86_32_NON_STANDARD || X86_UP_APIC || PCI_MSI
> > select IRQ_DOMAIN_HIERARCHY
> >
> > +config X86_ACPI_MADT_WAKEUP
> > + def_bool y
> > + depends on X86_64
> > + depends on ACPI
> > + depends on SMP
> > + depends on X86_LOCAL_APIC
> > +
> > config X86_IO_APIC
> > def_bool y
> > depends on X86_LOCAL_APIC || X86_UP_IOAPIC
> > diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
> > index c8a7fc23f63c..b536b5a6a57b 100644
> > --- a/arch/x86/include/asm/acpi.h
> > +++ b/arch/x86/include/asm/acpi.h
> > @@ -73,6 +73,11 @@ static inline bool acpi_skip_set_wakeup_address(void)
> >
> > #define acpi_skip_set_wakeup_address acpi_skip_set_wakeup_address
> >
> > +union acpi_subtable_headers;
> > +
> > +int __init acpi_parse_mp_wake(union acpi_subtable_headers *header,
> > + const unsigned long end);
> > +
>
> IMO, you don't need to declare acpi_parse_mp_wake() in asm/acpi.h. Since the
> only user of this function is in arch/x86/kernel/acpi, you can either create
> a header file there or re-use sleep.h.

Is it a really a bid deal? I don't see how it fits into sleep.h and
introducing one more header file for one declaration seems excessive.

> If you want to leave it here, do you want to protect it with
> CONFIG_X86_ACPI_MADT_WAKEUP?

Declarations are harmless if nobody uses them. Needless ifdeffery hurts
eyes :P

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-10-09 13:33:17

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 05/13] x86/kexec: Keep CR4.MCE set during kexec for TDX guest

On Mon, Oct 09, 2023 at 12:30:55PM +0000, Huang, Kai wrote:
> On Thu, 2023-10-05 at 16:13 +0300, Kirill A. Shutemov wrote:
> > TDX guests are not allowed to clear CR4.MCE. Attempt to clear it leads
> > to #VE.
> >
> > Use alternatives to keep the flag during kexec for TDX guests.
> >
> > The change doesn't affect non-TDX environments.
>
> Nit: non-TDX-guest environments. ?

Okay, will fix.

> > Signed-off-by: Kirill A. Shutemov <[email protected]>
>
> Reviewed-by: Kai Huang <[email protected]>

Thanks.

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-10-09 13:35:32

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec

On Sun, Oct 08, 2023 at 04:35:27PM +0800, Baoquan He wrote:
> On 10/05/23 at 04:13pm, Kirill A. Shutemov wrote:
> > TDX guests allocate shared buffers to perform I/O. It is done by
> > allocating pages normally from the buddy allocator and converting them
> > to shared with set_memory_decrypted().
> >
> > The target kernel has no idea what memory is converted this way. It only
> ~~~~~~~~~~~~~
> > sees E820_TYPE_RAM.
>
> I finally realized it means the 2nd kernel of kexec rebooting. Maybe we
> can call it 2nd kernel always, it works for both kexec and kdump
> jumping.

Okay. Will fix. I am new to kexec and I don't know proper terminology :)

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-10-09 13:37:25

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 00/13] x86/tdx: Add kexec support

On Mon, Oct 09, 2023 at 07:49:35AM +0800, Baoquan He wrote:
> On 10/05/23 at 04:13pm, Kirill A. Shutemov wrote:
> > The patchset adds bits and pieces to get kexec (and crashkernel) work on
> > TDX guest.
> >
> > They bring kexec support to the point when we can start the new kernel,
> > but it will only be able to use single CPU. It should be enough to cover
> > the most common case: crashkernel.
>
> Not sure if this question has been raised and answered in the past. Please
> forgive my bad memory if it has. The one cpu is fine to kdump kernel most
> of time, while we enable all CPUs by default when kexec rebooting. And kdump
> kernel with multiple cpu is allowed too. Wondering if there's plan to
> support the multiple cpu on TDX in the future.

Sorry, I didn't update this part of cover letter properly, but the last
patch of the patchset makes possible to kexec with multiple CPUs and the
2nd kernel will see them all. It requires support on BIOS side, otherwise
we fallback to single CPU kexec.

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-10-09 14:14:45

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH 00/13] x86/tdx: Add kexec support

On 10/09/23 at 04:36pm, Kirill A. Shutemov wrote:
> On Mon, Oct 09, 2023 at 07:49:35AM +0800, Baoquan He wrote:
> > On 10/05/23 at 04:13pm, Kirill A. Shutemov wrote:
> > > The patchset adds bits and pieces to get kexec (and crashkernel) work on
> > > TDX guest.
> > >
> > > They bring kexec support to the point when we can start the new kernel,
> > > but it will only be able to use single CPU. It should be enough to cover
> > > the most common case: crashkernel.
> >
> > Not sure if this question has been raised and answered in the past. Please
> > forgive my bad memory if it has. The one cpu is fine to kdump kernel most
> > of time, while we enable all CPUs by default when kexec rebooting. And kdump
> > kernel with multiple cpu is allowed too. Wondering if there's plan to
> > support the multiple cpu on TDX in the future.
>
> Sorry, I didn't update this part of cover letter properly, but the last
> patch of the patchset makes possible to kexec with multiple CPUs and the
> 2nd kernel will see them all. It requires support on BIOS side, otherwise
> we fallback to single CPU kexec.

Oops, I didn't read them carefully. You have mentioned that in the last
paragraph of cover letter. That's a great news, thanks.

2023-10-20 09:22:29

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec

On Fri, Oct 06, 2023 at 02:24:11PM -0500, Kalra, Ashish wrote:
>
> On 10/5/2023 5:28 PM, Kirill A. Shutemov wrote:
> > On Thu, Oct 05, 2023 at 05:01:23PM -0500, Kalra, Ashish wrote:
> > > On 10/5/2023 4:28 PM, Kirill A. Shutemov wrote:
> > > > On Thu, Oct 05, 2023 at 01:41:38PM -0500, Kalra, Ashish wrote:
> > > > > > +static void unshare_all_memory(bool unmap)
> > > > > > +{
> > > > > > + unsigned long addr, end;
> > > > > > + long found = 0, shared;
> > > > > > +
> > > > > > + /*
> > > > > > + * Walk direct mapping and convert all shared memory back to private,
> > > > > > + */
> > > > > > +
> > > > > > + addr = PAGE_OFFSET;
> > > > > > + end = PAGE_OFFSET + get_max_mapped();
> > > > > > +
> > > > > > + while (addr < end) {
> > > > > > + unsigned long size;
> > > > > > + unsigned int level;
> > > > > > + pte_t *pte;
> > > > > > +
> > > > > > + pte = lookup_address(addr, &level);
> > > > >
> > > > > IIRC, you were earlier walking the direct mapping using
> > > > > walk_page_range_novma(), any particular reason to use lookup_address()
> > > > > instead ?
> > > >
> > > > walk_page_range_novma() wants mmap lock to be taken, but it is tricky as
> > > > we run here from atomic context in case of crash.
> > > >
> > > > I considered using trylock to bypass the limitation, but it is a hack.
> > > >
> > > > >
> > > > > > + size = page_level_size(level);
> > > > > > +
> > > > > > + if (pte && pte_decrypted(*pte)) {
> > > > >
> > > > > Additionally need to add check for pte_none() here to handle physical memory
> > > > > holes in direct mapping.
> > > >
> > > > lookup_address() returns NULL for none entries.
> > > >
> > >
> > > Looking at lookup_address_in_pgd(), at pte level it is simply returning
> > > pte_offset_kernel() and there does not seem to be a check for returning NULL
> > > if pte_none() ?
> >
> > Hm. You are right.
> >
> > I think it yet another quirk in how lookup_address() implemented. We need
> > to make it straight too.
> >
> > There's two options: either make lookup_address() return pointer for entry
> > even if it is NULL, or add check for pte_none() after pte_offset_kernel()
> > and return NULL if it is true.
> >
> > I like the first option more as it allows caller to populate the entry if
> > it wants.
>
> Yes, i like the first option.

I tried to this, but lookup_address() has to many callers. It gets beyond
the scope of the patchset. I will add pte_none() check on unshare side for
now.

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-10-20 09:40:12

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 10/13] x86/tdx: Convert shared memory back to private on kexec

On Fri, Oct 20, 2023 at 12:21:11PM +0300, Kirill A. Shutemov wrote:
> On Fri, Oct 06, 2023 at 02:24:11PM -0500, Kalra, Ashish wrote:
> >
> > On 10/5/2023 5:28 PM, Kirill A. Shutemov wrote:
> > > On Thu, Oct 05, 2023 at 05:01:23PM -0500, Kalra, Ashish wrote:
> > > > On 10/5/2023 4:28 PM, Kirill A. Shutemov wrote:
> > > > > On Thu, Oct 05, 2023 at 01:41:38PM -0500, Kalra, Ashish wrote:
> > > > > > > +static void unshare_all_memory(bool unmap)
> > > > > > > +{
> > > > > > > + unsigned long addr, end;
> > > > > > > + long found = 0, shared;
> > > > > > > +
> > > > > > > + /*
> > > > > > > + * Walk direct mapping and convert all shared memory back to private,
> > > > > > > + */
> > > > > > > +
> > > > > > > + addr = PAGE_OFFSET;
> > > > > > > + end = PAGE_OFFSET + get_max_mapped();
> > > > > > > +
> > > > > > > + while (addr < end) {
> > > > > > > + unsigned long size;
> > > > > > > + unsigned int level;
> > > > > > > + pte_t *pte;
> > > > > > > +
> > > > > > > + pte = lookup_address(addr, &level);
> > > > > >
> > > > > > IIRC, you were earlier walking the direct mapping using
> > > > > > walk_page_range_novma(), any particular reason to use lookup_address()
> > > > > > instead ?
> > > > >
> > > > > walk_page_range_novma() wants mmap lock to be taken, but it is tricky as
> > > > > we run here from atomic context in case of crash.
> > > > >
> > > > > I considered using trylock to bypass the limitation, but it is a hack.
> > > > >
> > > > > >
> > > > > > > + size = page_level_size(level);
> > > > > > > +
> > > > > > > + if (pte && pte_decrypted(*pte)) {
> > > > > >
> > > > > > Additionally need to add check for pte_none() here to handle physical memory
> > > > > > holes in direct mapping.
> > > > >
> > > > > lookup_address() returns NULL for none entries.
> > > > >
> > > >
> > > > Looking at lookup_address_in_pgd(), at pte level it is simply returning
> > > > pte_offset_kernel() and there does not seem to be a check for returning NULL
> > > > if pte_none() ?
> > >
> > > Hm. You are right.
> > >
> > > I think it yet another quirk in how lookup_address() implemented. We need
> > > to make it straight too.
> > >
> > > There's two options: either make lookup_address() return pointer for entry
> > > even if it is NULL, or add check for pte_none() after pte_offset_kernel()
> > > and return NULL if it is true.
> > >
> > > I like the first option more as it allows caller to populate the entry if
> > > it wants.
> >
> > Yes, i like the first option.
>
> I tried to this, but lookup_address() has to many callers. It gets beyond
> the scope of the patchset. I will add pte_none() check on unshare side for
> now.

Ah. pte_none() is not need for TDX implementation, as pte_decrypted()
check will fail for it. SEV implementation would need an additional check.

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-10-20 09:50:12

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH 13/13] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method

On Thu, 2023-10-05 at 16:14 +0300, Kirill A. Shutemov wrote:
>  struct acpi_madt_multiproc_wakeup {
>   struct acpi_subtable_header header;
> - u16 mailbox_version;
> + u16 version;
>   u32 reserved; /* reserved - must be zero */
> - u64 base_address;
> + u64 mailbox_address;
> + u64 reset_vector;
>  };

I don't quite understand the connection between the renaming and what this patch
wants to achieve? What's the reason to rename? If needed, perhaps put into a
separate patch with proper justification?

2023-10-20 10:42:51

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 13/13] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method

On Fri, Oct 20, 2023 at 09:49:59AM +0000, Huang, Kai wrote:
> On Thu, 2023-10-05 at 16:14 +0300, Kirill A. Shutemov wrote:
> > ?struct acpi_madt_multiproc_wakeup {
> > ? struct acpi_subtable_header header;
> > - u16 mailbox_version;
> > + u16 version;
> > ? u32 reserved; /* reserved - must be zero */
> > - u64 base_address;
> > + u64 mailbox_address;
> > + u64 reset_vector;
> > ?};
>
> I don't quite understand the connection between the renaming and what this patch
> wants to achieve? What's the reason to rename?

Names are bad: the version field guides version of the structure, not
version of the mailbox. And it is not clear what base base_address
specifies.

> If needed, perhaps put into a separate patch with proper justification?

Hm. Okay...

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-10-20 11:21:57

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH 13/13] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method


> --- /dev/null
> +++ b/arch/x86/kernel/acpi/madt.S
> @@ -0,0 +1,28 @@
> +#include <linux/linkage.h>
> +#include <asm/nospec-branch.h>
> +#include <asm/page_types.h>
> +#include <asm/processor-flags.h>
> +
> + .text
> + .align PAGE_SIZE
> +SYM_FUNC_START(asm_acpi_mp_play_dead)
> + /* Load address of reset vector into RCX to jump when kernel is ready */
> + movq acpi_mp_reset_vector_paddr(%rip), %rcx
> +
> + /* zero out flags, and disable interrupts */
> + pushq $0
> + popfq
> +
> + /* Turn off global entries. Following CR3 write will flush them. */
> + movq %cr4, %rdx
> + andq $~(X86_CR4_PGE), %rdx
> + movq %rdx, %cr4
> +
> + /* Switch to identity mapping */
> + movq acpi_mp_pgd(%rip), %rax
> + movq %rax, %cr3
> +
> + /* Jump to reset vector */
> + ANNOTATE_RETPOLINE_SAFE
> + jmp *%rcx
> +SYM_FUNC_END(asm_acpi_mp_play_dead)
> diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
> index 4e92d1d4a5fa..2cc8590ec7a5 100644
> --- a/arch/x86/kernel/acpi/madt_wakeup.c
> +++ b/arch/x86/kernel/acpi/madt_wakeup.c
> @@ -1,12 +1,162 @@
> #include <linux/acpi.h>
> #include <linux/cpu.h>
> +#include <linux/delay.h>
> +#include <linux/memblock.h>
> +#include <linux/pgtable.h>
> +#include <linux/sched/hotplug.h>
> #include <asm/apic.h>
> +#include <asm/init.h>
>
> /* Physical address of the Multiprocessor Wakeup Structure mailbox */
> static u64 acpi_mp_wake_mailbox_paddr;
> /* Virtual address of the Multiprocessor Wakeup Structure mailbox */
> static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
>
> +unsigned long acpi_mp_pgd;
> +u64 acpi_mp_reset_vector_paddr;

Nit: 

They are both physical address. It's a little bit odd for one to use 'unsigned
long' and the other to use 'u64'.

> +
> +void asm_acpi_mp_play_dead(void);
> +
> +static void __init *alloc_pgt_page(void *context)
> +{
> + return memblock_alloc(PAGE_SIZE, PAGE_SIZE);
> +}

If I am reading correclty, @context is never used. It's not used inside this
function, and all the callers call this with @context = NULL.


[...]

> +
> +static void acpi_mp_play_dead(void)
> +{
> + idle_task_exit();
> + cpuhp_ap_report_dead();
> + asm_acpi_mp_play_dead();
> +}
> +

Looks you can use play_dead_common() here, if you take IRQ disable part out from
the assembly, because play_dead_common() does:

void play_dead_common(void)
{
idle_task_exit();

cpuhp_ap_report_dead();

local_irq_disable();
}

2023-10-20 12:34:46

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 13/13] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method

On Fri, Oct 20, 2023 at 11:21:34AM +0000, Huang, Kai wrote:
>
> > --- /dev/null
> > +++ b/arch/x86/kernel/acpi/madt.S
> > @@ -0,0 +1,28 @@
> > +#include <linux/linkage.h>
> > +#include <asm/nospec-branch.h>
> > +#include <asm/page_types.h>
> > +#include <asm/processor-flags.h>
> > +
> > + .text
> > + .align PAGE_SIZE
> > +SYM_FUNC_START(asm_acpi_mp_play_dead)
> > + /* Load address of reset vector into RCX to jump when kernel is ready */
> > + movq acpi_mp_reset_vector_paddr(%rip), %rcx
> > +
> > + /* zero out flags, and disable interrupts */
> > + pushq $0
> > + popfq
> > +
> > + /* Turn off global entries. Following CR3 write will flush them. */
> > + movq %cr4, %rdx
> > + andq $~(X86_CR4_PGE), %rdx
> > + movq %rdx, %cr4
> > +
> > + /* Switch to identity mapping */
> > + movq acpi_mp_pgd(%rip), %rax
> > + movq %rax, %cr3
> > +
> > + /* Jump to reset vector */
> > + ANNOTATE_RETPOLINE_SAFE
> > + jmp *%rcx
> > +SYM_FUNC_END(asm_acpi_mp_play_dead)
> > diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c
> > index 4e92d1d4a5fa..2cc8590ec7a5 100644
> > --- a/arch/x86/kernel/acpi/madt_wakeup.c
> > +++ b/arch/x86/kernel/acpi/madt_wakeup.c
> > @@ -1,12 +1,162 @@
> > #include <linux/acpi.h>
> > #include <linux/cpu.h>
> > +#include <linux/delay.h>
> > +#include <linux/memblock.h>
> > +#include <linux/pgtable.h>
> > +#include <linux/sched/hotplug.h>
> > #include <asm/apic.h>
> > +#include <asm/init.h>
> >
> > /* Physical address of the Multiprocessor Wakeup Structure mailbox */
> > static u64 acpi_mp_wake_mailbox_paddr;
> > /* Virtual address of the Multiprocessor Wakeup Structure mailbox */
> > static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox;
> >
> > +unsigned long acpi_mp_pgd;
> > +u64 acpi_mp_reset_vector_paddr;
>
> Nit:?
>
> They are both physical address. It's a little bit odd for one to use 'unsigned
> long' and the other to use 'u64'.

Okay, I will make them both u64.

> > +
> > +void asm_acpi_mp_play_dead(void);
> > +
> > +static void __init *alloc_pgt_page(void *context)
> > +{
> > + return memblock_alloc(PAGE_SIZE, PAGE_SIZE);
> > +}
>
> If I am reading correclty, @context is never used. It's not used inside this
> function, and all the callers call this with @context = NULL.

Yes. We need the argument to conform to x86_mapping_info::alloc_pgt_page
interface.

> > +
> > +static void acpi_mp_play_dead(void)
> > +{
> > + idle_task_exit();
> > + cpuhp_ap_report_dead();
> > + asm_acpi_mp_play_dead();
> > +}
> > +
>
> Looks you can use play_dead_common() here, if you take IRQ disable part out from
> the assembly, because play_dead_common() does:
>
> void play_dead_common(void)
> {
> idle_task_exit();
>
> cpuhp_ap_report_dead();
>
> local_irq_disable();
> }

Okay, fair enough.

--
Kiryl Shutsemau / Kirill A. Shutemov