2023-03-30 11:53:22

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 00/14] mm, x86/cc: Implement support for unaccepted memory

UEFI Specification version 2.9 introduces the concept of memory
acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, requiring memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific for the Virtual
Machine platform.

Accepting memory is costly and it makes VMM allocate memory for the
accepted guest physical address range. It's better to postpone memory
acceptance until memory is needed. It lowers boot time and reduces
memory overhead.

The kernel needs to know what memory has been accepted. Firmware
communicates this information via memory map: a new memory type --
EFI_UNACCEPTED_MEMORY -- indicates such memory.

Range-based tracking works fine for firmware, but it gets bulky for
the kernel: e820 has to be modified on every page acceptance. It leads
to table fragmentation, but there's a limited number of entries in the
e820 table

Another option is to mark such memory as usable in e820 and track if the
range has been accepted in a bitmap. One bit in the bitmap represents
2MiB in the address space: one 4k page is enough to track 64GiB or
physical address space.

In the worst-case scenario -- a huge hole in the middle of the
address space -- It needs 256MiB to handle 4PiB of the address
space.

Any unaccepted memory that is not aligned to 2M gets accepted upfront.

The approach lowers boot time substantially. Boot to shell is ~2.5x
faster for 4G TDX VM and ~4x faster for 64G.

TDX-specific code isolated from the core of unaccepted memory support. It
supposed to help to plug-in different implementation of unaccepted memory
such as SEV-SNP.

-- Fragmentation study --

Vlastimil and Mel were concern about effect of unaccepted memory on
fragmentation prevention measures in page allocator. I tried to evaluate
it, but it is tricky. As suggested I tried to run multiple parallel kernel
builds and follow how often kmem:mm_page_alloc_extfrag gets hit.

I don't like the results. Not because unaccepted memory is slow, but
because the test is way too noisy to produce sensible information.

So, I run 8 parallel builds of kernel, 16 jobs each in a VM with 16 CPU
and 16G of RAM. I compared unpatched base-line (mm-unstable) with the tree
that has patchset applied. For the later case, all memory above 1G was
considered unaccepted with fake_unaccepted_start=1G. Around 14G of RAM was
accounted as unaccepted after the boot. The test got all memory accepted.

kmem:mm_page_alloc_extfrag Time elapsed
Base line:
Run 1 837 1803.21s
Run 2 3,845 1785.87s
Run 3 1,704 1883.59s

Patched:
Run 1 905 1876.02s
Run 2 845 1758.50s
Run 3 1276 1876.13s

As you can see the numbers are all over the place.

The good news is that unaccepted memory doesn't make picture notably worse.

I am open to suggestions on how to test it better.

Also, feel free to play with it yourself. fake_unaccepted_start= doesn't
require any special setup.

--

The tree can be found here:

https://github.com/intel/tdx.git guest-unaccepted-memory

The patchset depends on MAX_ORDER changes in MM tree.

v9:
- Accept memory up to high watermark when kernel runs out of free memory;
- Treat unaccepted memory as unusable in __zone_watermark_unusable_free();
- Per-zone unaccepted memory accounting;
- All pages on unaccepted list are MAX_ORDER now;
- accept_memory=eager in cmdline to pre-accept memory during the boot;
- Implement fake unaccepted memory;
- Sysfs handle to accept memory manually;
- Drop PageUnaccepted();
- Rename unaccepted_pages static key to zones_with_unaccepted_pages;
v8:
- Rewrite core-mm support for unaccepted memory (patch 02/14);
- s/UnacceptedPages/Unaccepted/ in meminfo;
- Drop arch/x86/boot/compressed/compiler.h;
- Fix build errors;
- Adjust commit messages and comments;
- Reviewed-bys from Dave and Borislav;
- Rebased to tip/master.
v7:
- Rework meminfo counter to use PageUnaccepted() and move to generic code;
- Fix range_contains_unaccepted_memory() on machines without unaccepted memory;
- Add Reviewed-by from David;
v6:
- Fix load_unaligned_zeropad() on machine with unaccepted memory;
- Clear PageUnaccepted() on merged pages, leaving it only on head;
- Clarify error handling in allocate_e820();
- Fix build with CONFIG_UNACCEPTED_MEMORY=y, but without TDX;
- Disable kexec at boottime instead of build conflict;
- Rebased to tip/master;
- Spelling fixes;
- Add Reviewed-by from Mike and David;
v5:
- Updates comments and commit messages;
+ Explain options for unaccepted memory handling;
- Expose amount of unaccepted memory in /proc/meminfo
- Adjust check in page_expected_state();
- Fix error code handling in allocate_e820();
- Centralize __pa()/__va() definitions in the boot stub;
- Avoid includes from the main kernel in the boot stub;
- Use an existing hole in boot_param for unaccepted_memory, instead of adding
to the end of the structure;
- Extract allocate_unaccepted_memory() form allocate_e820();
- Complain if there's unaccepted memory, but kernel does not support it;
- Fix vmstat counter;
- Split up few preparatory patches;
- Random readability adjustments;
v4:
- PageBuddyUnaccepted() -> PageUnaccepted;
- Use separate page_type, not shared with offline;
- Rework interface between core-mm and arch code;
- Adjust commit messages;
- Ack from Mike;

Kirill A. Shutemov (14):
x86/boot: Centralize __pa()/__va() definitions
mm: Add support for unaccepted memory
mm/page_alloc: Fake unaccepted memory
mm/page_alloc: Add sysfs handle to accept accept_memory
efi/x86: Get full memory map in allocate_e820()
x86/boot: Add infrastructure required for unaccepted memory support
efi/x86: Implement support for unaccepted memory
x86/boot/compressed: Handle unaccepted memory
x86/mm: Reserve unaccepted memory bitmap
x86/mm: Provide helpers for unaccepted memory
x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory
x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in
boot stub
x86/tdx: Refactor try_accept_one()
x86/tdx: Add unaccepted memory support

Documentation/x86/zero-page.rst | 1 +
arch/x86/Kconfig | 2 +
arch/x86/boot/bitops.h | 40 ++++
arch/x86/boot/compressed/Makefile | 3 +-
arch/x86/boot/compressed/align.h | 14 ++
arch/x86/boot/compressed/bitmap.c | 43 ++++
arch/x86/boot/compressed/bitmap.h | 49 +++++
arch/x86/boot/compressed/bits.h | 36 ++++
arch/x86/boot/compressed/efi.h | 1 +
arch/x86/boot/compressed/error.c | 19 ++
arch/x86/boot/compressed/error.h | 1 +
arch/x86/boot/compressed/find.c | 54 +++++
arch/x86/boot/compressed/find.h | 79 ++++++++
arch/x86/boot/compressed/ident_map_64.c | 8 -
arch/x86/boot/compressed/kaslr.c | 35 ++--
arch/x86/boot/compressed/math.h | 37 ++++
arch/x86/boot/compressed/mem.c | 122 +++++++++++
arch/x86/boot/compressed/minmax.h | 61 ++++++
arch/x86/boot/compressed/misc.c | 6 +
arch/x86/boot/compressed/misc.h | 15 ++
arch/x86/boot/compressed/pgtable_types.h | 25 +++
arch/x86/boot/compressed/sev.c | 2 -
arch/x86/boot/compressed/tdx-shared.c | 2 +
arch/x86/boot/compressed/tdx.c | 39 ++++
arch/x86/coco/tdx/Makefile | 2 +-
arch/x86/coco/tdx/tdx-shared.c | 95 +++++++++
arch/x86/coco/tdx/tdx.c | 118 +----------
arch/x86/include/asm/page.h | 3 +
arch/x86/include/asm/shared/tdx.h | 53 +++++
arch/x86/include/asm/tdx.h | 21 +-
arch/x86/include/asm/unaccepted_memory.h | 16 ++
arch/x86/include/uapi/asm/bootparam.h | 2 +-
arch/x86/kernel/e820.c | 17 ++
arch/x86/mm/Makefile | 2 +
arch/x86/mm/unaccepted_memory.c | 107 ++++++++++
drivers/base/node.c | 7 +
drivers/firmware/efi/Kconfig | 14 ++
drivers/firmware/efi/efi.c | 1 +
drivers/firmware/efi/libstub/x86-stub.c | 98 +++++++--
fs/proc/meminfo.c | 5 +
include/linux/efi.h | 3 +-
include/linux/mmzone.h | 8 +
mm/internal.h | 13 ++
mm/memblock.c | 9 +
mm/mm_init.c | 7 +
mm/page_alloc.c | 246 +++++++++++++++++++++++
mm/vmstat.c | 3 +
47 files changed, 1368 insertions(+), 176 deletions(-)
create mode 100644 arch/x86/boot/compressed/align.h
create mode 100644 arch/x86/boot/compressed/bitmap.c
create mode 100644 arch/x86/boot/compressed/bitmap.h
create mode 100644 arch/x86/boot/compressed/bits.h
create mode 100644 arch/x86/boot/compressed/find.c
create mode 100644 arch/x86/boot/compressed/find.h
create mode 100644 arch/x86/boot/compressed/math.h
create mode 100644 arch/x86/boot/compressed/mem.c
create mode 100644 arch/x86/boot/compressed/minmax.h
create mode 100644 arch/x86/boot/compressed/pgtable_types.h
create mode 100644 arch/x86/boot/compressed/tdx-shared.c
create mode 100644 arch/x86/coco/tdx/tdx-shared.c
create mode 100644 arch/x86/include/asm/unaccepted_memory.h
create mode 100644 arch/x86/mm/unaccepted_memory.c

--
2.39.2


2023-03-30 11:53:30

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 08/14] x86/boot/compressed: Handle unaccepted memory

The firmware will pre-accept the memory used to run the stub. But, the
stub is responsible for accepting the memory into which it decompresses
the main kernel. Accept memory just before decompression starts.

The stub is also responsible for choosing a physical address in which to
place the decompressed kernel image. The KASLR mechanism will randomize
this physical address. Since the unaccepted memory region is relatively
small, KASLR would be quite ineffective if it only used the pre-accepted
area (EFI_CONVENTIONAL_MEMORY). Ensure that KASLR randomizes among the
entire physical address space by also including EFI_UNACCEPTED_MEMORY.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/boot/compressed/Makefile | 2 +-
arch/x86/boot/compressed/efi.h | 1 +
arch/x86/boot/compressed/kaslr.c | 35 ++++++++++++++++--------
arch/x86/boot/compressed/mem.c | 18 ++++++++++++
arch/x86/boot/compressed/misc.c | 6 ++++
arch/x86/boot/compressed/misc.h | 6 ++++
arch/x86/include/asm/unaccepted_memory.h | 2 ++
7 files changed, 57 insertions(+), 13 deletions(-)

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index f62c02348f9a..74f7adee46ad 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -107,7 +107,7 @@ endif

vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o
-vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/bitmap.o $(obj)/mem.o
+vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/bitmap.o $(obj)/find.o $(obj)/mem.o

vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_mixed.o
diff --git a/arch/x86/boot/compressed/efi.h b/arch/x86/boot/compressed/efi.h
index 7db2f41b54cd..cf475243b6d5 100644
--- a/arch/x86/boot/compressed/efi.h
+++ b/arch/x86/boot/compressed/efi.h
@@ -32,6 +32,7 @@ typedef struct {
} efi_table_hdr_t;

#define EFI_CONVENTIONAL_MEMORY 7
+#define EFI_UNACCEPTED_MEMORY 15

#define EFI_MEMORY_MORE_RELIABLE \
((u64)0x0000000000010000ULL) /* higher reliability */
diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index 454757fbdfe5..749f0fe7e446 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -672,6 +672,28 @@ static bool process_mem_region(struct mem_vector *region,
}

#ifdef CONFIG_EFI
+
+/*
+ * Only EFI_CONVENTIONAL_MEMORY and EFI_UNACCEPTED_MEMORY (if supported) are
+ * guaranteed to be free.
+ *
+ * It is more conservative in picking free memory than the EFI spec allows:
+ *
+ * According to the spec, EFI_BOOT_SERVICES_{CODE|DATA} are also free memory
+ * and thus available to place the kernel image into, but in practice there's
+ * firmware where using that memory leads to crashes.
+ */
+static inline bool memory_type_is_free(efi_memory_desc_t *md)
+{
+ if (md->type == EFI_CONVENTIONAL_MEMORY)
+ return true;
+
+ if (md->type == EFI_UNACCEPTED_MEMORY)
+ return IS_ENABLED(CONFIG_UNACCEPTED_MEMORY);
+
+ return false;
+}
+
/*
* Returns true if we processed the EFI memmap, which we prefer over the E820
* table if it is available.
@@ -716,18 +738,7 @@ process_efi_entries(unsigned long minimum, unsigned long image_size)
for (i = 0; i < nr_desc; i++) {
md = efi_early_memdesc_ptr(pmap, e->efi_memdesc_size, i);

- /*
- * Here we are more conservative in picking free memory than
- * the EFI spec allows:
- *
- * According to the spec, EFI_BOOT_SERVICES_{CODE|DATA} are also
- * free memory and thus available to place the kernel image into,
- * but in practice there's firmware where using that memory leads
- * to crashes.
- *
- * Only EFI_CONVENTIONAL_MEMORY is guaranteed to be free.
- */
- if (md->type != EFI_CONVENTIONAL_MEMORY)
+ if (!memory_type_is_free(md))
continue;

if (efi_soft_reserve_enabled() &&
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index 6b15a0ed8b54..de858a5180b6 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -3,12 +3,15 @@
#include "../cpuflags.h"
#include "bitmap.h"
#include "error.h"
+#include "find.h"
#include "math.h"

#define PMD_SHIFT 21
#define PMD_SIZE (_AC(1, UL) << PMD_SHIFT)
#define PMD_MASK (~(PMD_SIZE - 1))

+extern struct boot_params *boot_params;
+
static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
{
/* Platform-specific memory-acceptance call goes here */
@@ -71,3 +74,18 @@ void process_unaccepted_memory(struct boot_params *params, u64 start, u64 end)
bitmap_set((unsigned long *)params->unaccepted_memory,
start / PMD_SIZE, (end - start) / PMD_SIZE);
}
+
+void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+ unsigned long range_start, range_end;
+ unsigned long *bitmap, bitmap_size;
+
+ bitmap = (unsigned long *)boot_params->unaccepted_memory;
+ range_start = start / PMD_SIZE;
+ bitmap_size = DIV_ROUND_UP(end, PMD_SIZE);
+
+ for_each_set_bitrange_from(range_start, range_end, bitmap, bitmap_size) {
+ __accept_memory(range_start * PMD_SIZE, range_end * PMD_SIZE);
+ bitmap_clear(bitmap, range_start, range_end - range_start);
+ }
+}
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 014ff222bf4b..186bfd53e042 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -455,6 +455,12 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
#endif

debug_putstr("\nDecompressing Linux... ");
+
+ if (boot_params->unaccepted_memory) {
+ debug_putstr("Accepting memory... ");
+ accept_memory(__pa(output), __pa(output) + needed_size);
+ }
+
__decompress(input_data, input_len, NULL, NULL, output, output_len,
NULL, error);
entry_offset = parse_elf(output);
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 2f155a0e3041..9663d1839f54 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -247,4 +247,10 @@ static inline unsigned long efi_find_vendor_table(struct boot_params *bp,
}
#endif /* CONFIG_EFI */

+#ifdef CONFIG_UNACCEPTED_MEMORY
+void accept_memory(phys_addr_t start, phys_addr_t end);
+#else
+static inline void accept_memory(phys_addr_t start, phys_addr_t end) {}
+#endif
+
#endif /* BOOT_COMPRESSED_MISC_H */
diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
index df0736d32858..41fbfc798100 100644
--- a/arch/x86/include/asm/unaccepted_memory.h
+++ b/arch/x86/include/asm/unaccepted_memory.h
@@ -7,4 +7,6 @@ struct boot_params;

void process_unaccepted_memory(struct boot_params *params, u64 start, u64 num);

+void accept_memory(phys_addr_t start, phys_addr_t end);
+
#endif
--
2.39.2

2023-03-30 11:53:36

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 06/14] x86/boot: Add infrastructure required for unaccepted memory support

Pull functionality from the main kernel headers and lib/ that is
required for unaccepted memory support.

This is preparatory patch. The users for the functionality will come in
following patches.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Borislav Petkov (AMD) <[email protected]>
---
arch/x86/boot/bitops.h | 40 ++++++++++++
arch/x86/boot/compressed/align.h | 14 +++++
arch/x86/boot/compressed/bitmap.c | 43 +++++++++++++
arch/x86/boot/compressed/bitmap.h | 49 +++++++++++++++
arch/x86/boot/compressed/bits.h | 36 +++++++++++
arch/x86/boot/compressed/find.c | 54 ++++++++++++++++
arch/x86/boot/compressed/find.h | 79 ++++++++++++++++++++++++
arch/x86/boot/compressed/math.h | 37 +++++++++++
arch/x86/boot/compressed/minmax.h | 61 ++++++++++++++++++
arch/x86/boot/compressed/pgtable_types.h | 25 ++++++++
10 files changed, 438 insertions(+)
create mode 100644 arch/x86/boot/compressed/align.h
create mode 100644 arch/x86/boot/compressed/bitmap.c
create mode 100644 arch/x86/boot/compressed/bitmap.h
create mode 100644 arch/x86/boot/compressed/bits.h
create mode 100644 arch/x86/boot/compressed/find.c
create mode 100644 arch/x86/boot/compressed/find.h
create mode 100644 arch/x86/boot/compressed/math.h
create mode 100644 arch/x86/boot/compressed/minmax.h
create mode 100644 arch/x86/boot/compressed/pgtable_types.h

diff --git a/arch/x86/boot/bitops.h b/arch/x86/boot/bitops.h
index 8518ae214c9b..38badf028543 100644
--- a/arch/x86/boot/bitops.h
+++ b/arch/x86/boot/bitops.h
@@ -41,4 +41,44 @@ static inline void set_bit(int nr, void *addr)
asm("btsl %1,%0" : "+m" (*(u32 *)addr) : "Ir" (nr));
}

+static __always_inline void __set_bit(long nr, volatile unsigned long *addr)
+{
+ asm volatile(__ASM_SIZE(bts) " %1,%0" : : "m" (*(volatile long *) addr),
+ "Ir" (nr) : "memory");
+}
+
+static __always_inline void __clear_bit(long nr, volatile unsigned long *addr)
+{
+ asm volatile(__ASM_SIZE(btr) " %1,%0" : : "m" (*(volatile long *) addr),
+ "Ir" (nr) : "memory");
+}
+
+/**
+ * __ffs - find first set bit in word
+ * @word: The word to search
+ *
+ * Undefined if no bit exists, so code should check against 0 first.
+ */
+static __always_inline unsigned long __ffs(unsigned long word)
+{
+ asm("rep; bsf %1,%0"
+ : "=r" (word)
+ : "rm" (word));
+ return word;
+}
+
+/**
+ * ffz - find first zero bit in word
+ * @word: The word to search
+ *
+ * Undefined if no zero exists, so code should check against ~0UL first.
+ */
+static __always_inline unsigned long ffz(unsigned long word)
+{
+ asm("rep; bsf %1,%0"
+ : "=r" (word)
+ : "r" (~word));
+ return word;
+}
+
#endif /* BOOT_BITOPS_H */
diff --git a/arch/x86/boot/compressed/align.h b/arch/x86/boot/compressed/align.h
new file mode 100644
index 000000000000..7ccabbc5d1b8
--- /dev/null
+++ b/arch/x86/boot/compressed/align.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef BOOT_ALIGN_H
+#define BOOT_ALIGN_H
+#define _LINUX_ALIGN_H /* Inhibit inclusion of <linux/align.h> */
+
+/* @a is a power of 2 value */
+#define ALIGN(x, a) __ALIGN_KERNEL((x), (a))
+#define ALIGN_DOWN(x, a) __ALIGN_KERNEL((x) - ((a) - 1), (a))
+#define __ALIGN_MASK(x, mask) __ALIGN_KERNEL_MASK((x), (mask))
+#define PTR_ALIGN(p, a) ((typeof(p))ALIGN((unsigned long)(p), (a)))
+#define PTR_ALIGN_DOWN(p, a) ((typeof(p))ALIGN_DOWN((unsigned long)(p), (a)))
+#define IS_ALIGNED(x, a) (((x) & ((typeof(x))(a) - 1)) == 0)
+
+#endif
diff --git a/arch/x86/boot/compressed/bitmap.c b/arch/x86/boot/compressed/bitmap.c
new file mode 100644
index 000000000000..789ecadeb521
--- /dev/null
+++ b/arch/x86/boot/compressed/bitmap.c
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "bitmap.h"
+
+void __bitmap_set(unsigned long *map, unsigned int start, int len)
+{
+ unsigned long *p = map + BIT_WORD(start);
+ const unsigned int size = start + len;
+ int bits_to_set = BITS_PER_LONG - (start % BITS_PER_LONG);
+ unsigned long mask_to_set = BITMAP_FIRST_WORD_MASK(start);
+
+ while (len - bits_to_set >= 0) {
+ *p |= mask_to_set;
+ len -= bits_to_set;
+ bits_to_set = BITS_PER_LONG;
+ mask_to_set = ~0UL;
+ p++;
+ }
+ if (len) {
+ mask_to_set &= BITMAP_LAST_WORD_MASK(size);
+ *p |= mask_to_set;
+ }
+}
+
+void __bitmap_clear(unsigned long *map, unsigned int start, int len)
+{
+ unsigned long *p = map + BIT_WORD(start);
+ const unsigned int size = start + len;
+ int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
+ unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
+
+ while (len - bits_to_clear >= 0) {
+ *p &= ~mask_to_clear;
+ len -= bits_to_clear;
+ bits_to_clear = BITS_PER_LONG;
+ mask_to_clear = ~0UL;
+ p++;
+ }
+ if (len) {
+ mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
+ *p &= ~mask_to_clear;
+ }
+}
diff --git a/arch/x86/boot/compressed/bitmap.h b/arch/x86/boot/compressed/bitmap.h
new file mode 100644
index 000000000000..35357f5feda2
--- /dev/null
+++ b/arch/x86/boot/compressed/bitmap.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef BOOT_BITMAP_H
+#define BOOT_BITMAP_H
+#define __LINUX_BITMAP_H /* Inhibit inclusion of <linux/bitmap.h> */
+
+#include "../bitops.h"
+#include "../string.h"
+#include "align.h"
+
+#define BITMAP_MEM_ALIGNMENT 8
+#define BITMAP_MEM_MASK (BITMAP_MEM_ALIGNMENT - 1)
+
+#define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) & (BITS_PER_LONG - 1)))
+#define BITMAP_LAST_WORD_MASK(nbits) (~0UL >> (-(nbits) & (BITS_PER_LONG - 1)))
+
+#define BIT_WORD(nr) ((nr) / BITS_PER_LONG)
+
+void __bitmap_set(unsigned long *map, unsigned int start, int len);
+void __bitmap_clear(unsigned long *map, unsigned int start, int len);
+
+static __always_inline void bitmap_set(unsigned long *map, unsigned int start,
+ unsigned int nbits)
+{
+ if (__builtin_constant_p(nbits) && nbits == 1)
+ __set_bit(start, map);
+ else if (__builtin_constant_p(start & BITMAP_MEM_MASK) &&
+ IS_ALIGNED(start, BITMAP_MEM_ALIGNMENT) &&
+ __builtin_constant_p(nbits & BITMAP_MEM_MASK) &&
+ IS_ALIGNED(nbits, BITMAP_MEM_ALIGNMENT))
+ memset((char *)map + start / 8, 0xff, nbits / 8);
+ else
+ __bitmap_set(map, start, nbits);
+}
+
+static __always_inline void bitmap_clear(unsigned long *map, unsigned int start,
+ unsigned int nbits)
+{
+ if (__builtin_constant_p(nbits) && nbits == 1)
+ __clear_bit(start, map);
+ else if (__builtin_constant_p(start & BITMAP_MEM_MASK) &&
+ IS_ALIGNED(start, BITMAP_MEM_ALIGNMENT) &&
+ __builtin_constant_p(nbits & BITMAP_MEM_MASK) &&
+ IS_ALIGNED(nbits, BITMAP_MEM_ALIGNMENT))
+ memset((char *)map + start / 8, 0, nbits / 8);
+ else
+ __bitmap_clear(map, start, nbits);
+}
+
+#endif
diff --git a/arch/x86/boot/compressed/bits.h b/arch/x86/boot/compressed/bits.h
new file mode 100644
index 000000000000..b0ffa007ee19
--- /dev/null
+++ b/arch/x86/boot/compressed/bits.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef BOOT_BITS_H
+#define BOOT_BITS_H
+#define __LINUX_BITS_H /* Inhibit inclusion of <linux/bits.h> */
+
+#ifdef __ASSEMBLY__
+#define _AC(X,Y) X
+#define _AT(T,X) X
+#else
+#define __AC(X,Y) (X##Y)
+#define _AC(X,Y) __AC(X,Y)
+#define _AT(T,X) ((T)(X))
+#endif
+
+#define _UL(x) (_AC(x, UL))
+#define _ULL(x) (_AC(x, ULL))
+#define UL(x) (_UL(x))
+#define ULL(x) (_ULL(x))
+
+#define BIT(nr) (UL(1) << (nr))
+#define BIT_ULL(nr) (ULL(1) << (nr))
+#define BIT_MASK(nr) (UL(1) << ((nr) % BITS_PER_LONG))
+#define BIT_WORD(nr) ((nr) / BITS_PER_LONG)
+#define BIT_ULL_MASK(nr) (ULL(1) << ((nr) % BITS_PER_LONG_LONG))
+#define BIT_ULL_WORD(nr) ((nr) / BITS_PER_LONG_LONG)
+#define BITS_PER_BYTE 8
+
+#define GENMASK(h, l) \
+ (((~UL(0)) - (UL(1) << (l)) + 1) & \
+ (~UL(0) >> (BITS_PER_LONG - 1 - (h))))
+
+#define GENMASK_ULL(h, l) \
+ (((~ULL(0)) - (ULL(1) << (l)) + 1) & \
+ (~ULL(0) >> (BITS_PER_LONG_LONG - 1 - (h))))
+
+#endif
diff --git a/arch/x86/boot/compressed/find.c b/arch/x86/boot/compressed/find.c
new file mode 100644
index 000000000000..b97a9e7c8085
--- /dev/null
+++ b/arch/x86/boot/compressed/find.c
@@ -0,0 +1,54 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include "bitmap.h"
+#include "find.h"
+#include "math.h"
+#include "minmax.h"
+
+static __always_inline unsigned long swab(const unsigned long y)
+{
+#if __BITS_PER_LONG == 64
+ return __builtin_bswap32(y);
+#else /* __BITS_PER_LONG == 32 */
+ return __builtin_bswap64(y);
+#endif
+}
+
+unsigned long _find_next_bit(const unsigned long *addr1,
+ const unsigned long *addr2, unsigned long nbits,
+ unsigned long start, unsigned long invert, unsigned long le)
+{
+ unsigned long tmp, mask;
+
+ if (start >= nbits)
+ return nbits;
+
+ tmp = addr1[start / BITS_PER_LONG];
+ if (addr2)
+ tmp &= addr2[start / BITS_PER_LONG];
+ tmp ^= invert;
+
+ /* Handle 1st word. */
+ mask = BITMAP_FIRST_WORD_MASK(start);
+ if (le)
+ mask = swab(mask);
+
+ tmp &= mask;
+
+ start = round_down(start, BITS_PER_LONG);
+
+ while (!tmp) {
+ start += BITS_PER_LONG;
+ if (start >= nbits)
+ return nbits;
+
+ tmp = addr1[start / BITS_PER_LONG];
+ if (addr2)
+ tmp &= addr2[start / BITS_PER_LONG];
+ tmp ^= invert;
+ }
+
+ if (le)
+ tmp = swab(tmp);
+
+ return min(start + __ffs(tmp), nbits);
+}
diff --git a/arch/x86/boot/compressed/find.h b/arch/x86/boot/compressed/find.h
new file mode 100644
index 000000000000..903574b9d57a
--- /dev/null
+++ b/arch/x86/boot/compressed/find.h
@@ -0,0 +1,79 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef BOOT_FIND_H
+#define BOOT_FIND_H
+#define __LINUX_FIND_H /* Inhibit inclusion of <linux/find.h> */
+
+#include "../bitops.h"
+#include "align.h"
+#include "bits.h"
+
+unsigned long _find_next_bit(const unsigned long *addr1,
+ const unsigned long *addr2, unsigned long nbits,
+ unsigned long start, unsigned long invert, unsigned long le);
+
+/**
+ * find_next_bit - find the next set bit in a memory region
+ * @addr: The address to base the search on
+ * @offset: The bitnumber to start searching at
+ * @size: The bitmap size in bits
+ *
+ * Returns the bit number for the next set bit
+ * If no bits are set, returns @size.
+ */
+static inline
+unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
+ unsigned long offset)
+{
+ if (small_const_nbits(size)) {
+ unsigned long val;
+
+ if (offset >= size)
+ return size;
+
+ val = *addr & GENMASK(size - 1, offset);
+ return val ? __ffs(val) : size;
+ }
+
+ return _find_next_bit(addr, NULL, size, offset, 0UL, 0);
+}
+
+/**
+ * find_next_zero_bit - find the next cleared bit in a memory region
+ * @addr: The address to base the search on
+ * @offset: The bitnumber to start searching at
+ * @size: The bitmap size in bits
+ *
+ * Returns the bit number of the next zero bit
+ * If no bits are zero, returns @size.
+ */
+static inline
+unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
+ unsigned long offset)
+{
+ if (small_const_nbits(size)) {
+ unsigned long val;
+
+ if (offset >= size)
+ return size;
+
+ val = *addr | ~GENMASK(size - 1, offset);
+ return val == ~0UL ? size : ffz(val);
+ }
+
+ return _find_next_bit(addr, NULL, size, offset, ~0UL, 0);
+}
+
+/**
+ * for_each_set_bitrange_from - iterate over all set bit ranges [b; e)
+ * @b: bit offset of start of current bitrange (first set bit); must be initialized
+ * @e: bit offset of end of current bitrange (first unset bit)
+ * @addr: bitmap address to base the search on
+ * @size: bitmap size in number of bits
+ */
+#define for_each_set_bitrange_from(b, e, addr, size) \
+ for ((b) = find_next_bit((addr), (size), (b)), \
+ (e) = find_next_zero_bit((addr), (size), (b) + 1); \
+ (b) < (size); \
+ (b) = find_next_bit((addr), (size), (e) + 1), \
+ (e) = find_next_zero_bit((addr), (size), (b) + 1))
+#endif
diff --git a/arch/x86/boot/compressed/math.h b/arch/x86/boot/compressed/math.h
new file mode 100644
index 000000000000..f7eede84bbc2
--- /dev/null
+++ b/arch/x86/boot/compressed/math.h
@@ -0,0 +1,37 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef BOOT_MATH_H
+#define BOOT_MATH_H
+#define __LINUX_MATH_H /* Inhibit inclusion of <linux/math.h> */
+
+/*
+ *
+ * This looks more complex than it should be. But we need to
+ * get the type for the ~ right in round_down (it needs to be
+ * as wide as the result!), and we want to evaluate the macro
+ * arguments just once each.
+ */
+#define __round_mask(x, y) ((__typeof__(x))((y)-1))
+
+/**
+ * round_up - round up to next specified power of 2
+ * @x: the value to round
+ * @y: multiple to round up to (must be a power of 2)
+ *
+ * Rounds @x up to next multiple of @y (which must be a power of 2).
+ * To perform arbitrary rounding up, use roundup() below.
+ */
+#define round_up(x, y) ((((x)-1) | __round_mask(x, y))+1)
+
+/**
+ * round_down - round down to next specified power of 2
+ * @x: the value to round
+ * @y: multiple to round down to (must be a power of 2)
+ *
+ * Rounds @x down to next multiple of @y (which must be a power of 2).
+ * To perform arbitrary rounding down, use rounddown() below.
+ */
+#define round_down(x, y) ((x) & ~__round_mask(x, y))
+
+#define DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d))
+
+#endif
diff --git a/arch/x86/boot/compressed/minmax.h b/arch/x86/boot/compressed/minmax.h
new file mode 100644
index 000000000000..4efd05673260
--- /dev/null
+++ b/arch/x86/boot/compressed/minmax.h
@@ -0,0 +1,61 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef BOOT_MINMAX_H
+#define BOOT_MINMAX_H
+#define __LINUX_MINMAX_H /* Inhibit inclusion of <linux/minmax.h> */
+
+/*
+ * This returns a constant expression while determining if an argument is
+ * a constant expression, most importantly without evaluating the argument.
+ * Glory to Martin Uecker <[email protected]>
+ */
+#define __is_constexpr(x) \
+ (sizeof(int) == sizeof(*(8 ? ((void *)((long)(x) * 0l)) : (int *)8)))
+
+/*
+ * min()/max()/clamp() macros must accomplish three things:
+ *
+ * - avoid multiple evaluations of the arguments (so side-effects like
+ * "x++" happen only once) when non-constant.
+ * - perform strict type-checking (to generate warnings instead of
+ * nasty runtime surprises). See the "unnecessary" pointer comparison
+ * in __typecheck().
+ * - retain result as a constant expressions when called with only
+ * constant expressions (to avoid tripping VLA warnings in stack
+ * allocation usage).
+ */
+#define __typecheck(x, y) \
+ (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
+
+#define __no_side_effects(x, y) \
+ (__is_constexpr(x) && __is_constexpr(y))
+
+#define __safe_cmp(x, y) \
+ (__typecheck(x, y) && __no_side_effects(x, y))
+
+#define __cmp(x, y, op) ((x) op (y) ? (x) : (y))
+
+#define __cmp_once(x, y, unique_x, unique_y, op) ({ \
+ typeof(x) unique_x = (x); \
+ typeof(y) unique_y = (y); \
+ __cmp(unique_x, unique_y, op); })
+
+#define __careful_cmp(x, y, op) \
+ __builtin_choose_expr(__safe_cmp(x, y), \
+ __cmp(x, y, op), \
+ __cmp_once(x, y, __UNIQUE_ID(__x), __UNIQUE_ID(__y), op))
+
+/**
+ * min - return minimum of two values of the same or compatible types
+ * @x: first value
+ * @y: second value
+ */
+#define min(x, y) __careful_cmp(x, y, <)
+
+/**
+ * max - return maximum of two values of the same or compatible types
+ * @x: first value
+ * @y: second value
+ */
+#define max(x, y) __careful_cmp(x, y, >)
+
+#endif
diff --git a/arch/x86/boot/compressed/pgtable_types.h b/arch/x86/boot/compressed/pgtable_types.h
new file mode 100644
index 000000000000..8f1d87a69efc
--- /dev/null
+++ b/arch/x86/boot/compressed/pgtable_types.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef BOOT_COMPRESSED_PGTABLE_TYPES_H
+#define BOOT_COMPRESSED_PGTABLE_TYPES_H
+#define _ASM_X86_PGTABLE_DEFS_H /* Inhibit inclusion of <asm/pgtable_types.h> */
+
+#define PAGE_SHIFT 12
+
+#ifdef CONFIG_X86_64
+#define PTE_SHIFT 9
+#elif defined CONFIG_X86_PAE
+#define PTE_SHIFT 9
+#else /* 2-level */
+#define PTE_SHIFT 10
+#endif
+
+enum pg_level {
+ PG_LEVEL_NONE,
+ PG_LEVEL_4K,
+ PG_LEVEL_2M,
+ PG_LEVEL_1G,
+ PG_LEVEL_512G,
+ PG_LEVEL_NUM
+};
+
+#endif
--
2.39.2

2023-03-30 11:53:38

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 11/14] x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory

load_unaligned_zeropad() can lead to unwanted loads across page boundaries.
The unwanted loads are typically harmless. But, they might be made to
totally unrelated or even unmapped memory. load_unaligned_zeropad()
relies on exception fixup (#PF, #GP and now #VE) to recover from these
unwanted loads.

But, this approach does not work for unaccepted memory. For TDX, a load
from unaccepted memory will not lead to a recoverable exception within
the guest. The guest will exit to the VMM where the only recourse is to
terminate the guest.

There are three parts to fix this issue and comprehensively avoid access
to unaccepted memory. Together these ensure that an extra "guard" page
is accepted in addition to the memory that needs to be used.

1. Implicitly extend the range_contains_unaccepted_memory(start, end)
checks up to end+2M if 'end' is aligned on a 2M boundary. It may
require checking 2M chunk beyond end of RAM. The bitmap allocation is
modified to accommodate this.
2. Implicitly extend accept_memory(start, end) to end+2M if 'end' is
aligned on a 2M boundary.
3. Set PageUnaccepted() on both memory that itself needs to be accepted
*and* memory where the next page needs to be accepted. Essentially,
make PageUnaccepted(page) a marker for whether work needs to be done
to make 'page' usable. That work might include accepting pages in
addition to 'page' itself.

Side note: This leads to something strange. Pages which were accepted
at boot, marked by the firmware as accepted and will never
_need_ to be accepted might have PageUnaccepted() set on
them. PageUnaccepted(page) is a cue to ensure that the next
page is accepted before 'page' can be used.

This is an actual, real-world problem which was discovered during TDX
testing.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Dave Hansen <[email protected]>
---
arch/x86/mm/unaccepted_memory.c | 39 +++++++++++++++++++++++++
drivers/firmware/efi/libstub/x86-stub.c | 7 +++++
2 files changed, 46 insertions(+)

diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index 1df918b21469..a0a58486eb74 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -23,6 +23,38 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
bitmap = __va(boot_params.unaccepted_memory);
range_start = start / PMD_SIZE;

+ /*
+ * load_unaligned_zeropad() can lead to unwanted loads across page
+ * boundaries. The unwanted loads are typically harmless. But, they
+ * might be made to totally unrelated or even unmapped memory.
+ * load_unaligned_zeropad() relies on exception fixup (#PF, #GP and now
+ * #VE) to recover from these unwanted loads.
+ *
+ * But, this approach does not work for unaccepted memory. For TDX, a
+ * load from unaccepted memory will not lead to a recoverable exception
+ * within the guest. The guest will exit to the VMM where the only
+ * recourse is to terminate the guest.
+ *
+ * There are three parts to fix this issue and comprehensively avoid
+ * access to unaccepted memory. Together these ensure that an extra
+ * "guard" page is accepted in addition to the memory that needs to be
+ * used:
+ *
+ * 1. Implicitly extend the range_contains_unaccepted_memory(start, end)
+ * checks up to end+2M if 'end' is aligned on a 2M boundary.
+ *
+ * 2. Implicitly extend accept_memory(start, end) to end+2M if 'end' is
+ * aligned on a 2M boundary. (immediately following this comment)
+ *
+ * 3. Set PageUnaccepted() on both memory that itself needs to be
+ * accepted *and* memory where the next page needs to be accepted.
+ * Essentially, make PageUnaccepted(page) a marker for whether work
+ * needs to be done to make 'page' usable. That work might include
+ * accepting pages in addition to 'page' itself.
+ */
+ if (!(end % PMD_SIZE))
+ end += PMD_SIZE;
+
spin_lock_irqsave(&unaccepted_memory_lock, flags);
for_each_set_bitrange_from(range_start, range_end, bitmap,
DIV_ROUND_UP(end, PMD_SIZE)) {
@@ -46,6 +78,13 @@ bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end)

bitmap = __va(boot_params.unaccepted_memory);

+ /*
+ * Also consider the unaccepted state of the *next* page. See fix #1 in
+ * the comment on load_unaligned_zeropad() in accept_memory().
+ */
+ if (!(end % PMD_SIZE))
+ end += PMD_SIZE;
+
spin_lock_irqsave(&unaccepted_memory_lock, flags);
while (start < end) {
if (test_bit(start / PMD_SIZE, bitmap)) {
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index 1643ddbde249..1afe7b5b02e1 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -715,6 +715,13 @@ static efi_status_t allocate_unaccepted_bitmap(struct boot_params *params,
return EFI_SUCCESS;
}

+ /*
+ * range_contains_unaccepted_memory() may need to check one 2M chunk
+ * beyond the end of RAM to deal with load_unaligned_zeropad(). Make
+ * sure that the bitmap is large enough handle it.
+ */
+ max_addr += PMD_SIZE;
+
/*
* If unaccepted memory is present, allocate a bitmap to track what
* memory has to be accepted before access.
--
2.39.2

2023-03-30 11:53:40

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 07/14] efi/x86: Implement support for unaccepted memory

UEFI Specification version 2.9 introduces the concept of memory
acceptance: Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, requiring memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific for the Virtual
Machine platform.

Accepting memory is costly and it makes VMM allocate memory for the
accepted guest physical address range. It's better to postpone memory
acceptance until memory is needed. It lowers boot time and reduces
memory overhead.

The kernel needs to know what memory has been accepted. Firmware
communicates this information via memory map: a new memory type --
EFI_UNACCEPTED_MEMORY -- indicates such memory.

Range-based tracking works fine for firmware, but it gets bulky for
the kernel: e820 has to be modified on every page acceptance. It leads
to table fragmentation, but there's a limited number of entries in the
e820 table

Another option is to mark such memory as usable in e820 and track if the
range has been accepted in a bitmap. One bit in the bitmap represents
2MiB in the address space: one 4k page is enough to track 64GiB or
physical address space.

In the worst-case scenario -- a huge hole in the middle of the
address space -- It needs 256MiB to handle 4PiB of the address
space.

Any unaccepted memory that is not aligned to 2M gets accepted upfront.

The bitmap is allocated and constructed in the EFI stub and passed down
to the kernel via boot_params. allocate_e820() allocates the bitmap if
unaccepted memory is present, according to the maximum address in the
memory map.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
Documentation/x86/zero-page.rst | 1 +
arch/x86/boot/compressed/Makefile | 1 +
arch/x86/boot/compressed/mem.c | 73 ++++++++++++++++++++++++
arch/x86/include/asm/unaccepted_memory.h | 10 ++++
arch/x86/include/uapi/asm/bootparam.h | 2 +-
drivers/firmware/efi/Kconfig | 14 +++++
drivers/firmware/efi/efi.c | 1 +
drivers/firmware/efi/libstub/x86-stub.c | 65 +++++++++++++++++++++
include/linux/efi.h | 3 +-
9 files changed, 168 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/boot/compressed/mem.c
create mode 100644 arch/x86/include/asm/unaccepted_memory.h

diff --git a/Documentation/x86/zero-page.rst b/Documentation/x86/zero-page.rst
index 45aa9cceb4f1..f21905e61ade 100644
--- a/Documentation/x86/zero-page.rst
+++ b/Documentation/x86/zero-page.rst
@@ -20,6 +20,7 @@ Offset/Size Proto Name Meaning
060/010 ALL ist_info Intel SpeedStep (IST) BIOS support information
(struct ist_info)
070/008 ALL acpi_rsdp_addr Physical address of ACPI RSDP table
+078/008 ALL unaccepted_memory Bitmap of unaccepted memory (1bit == 2M)
080/010 ALL hd0_info hd0 disk parameter, OBSOLETE!!
090/010 ALL hd1_info hd1 disk parameter, OBSOLETE!!
0A0/010 ALL sys_desc_table System description table (struct sys_desc_table),
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 6b6cfe607bdb..f62c02348f9a 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -107,6 +107,7 @@ endif

vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o
+vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/bitmap.o $(obj)/mem.o

vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_mixed.o
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
new file mode 100644
index 000000000000..6b15a0ed8b54
--- /dev/null
+++ b/arch/x86/boot/compressed/mem.c
@@ -0,0 +1,73 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "../cpuflags.h"
+#include "bitmap.h"
+#include "error.h"
+#include "math.h"
+
+#define PMD_SHIFT 21
+#define PMD_SIZE (_AC(1, UL) << PMD_SHIFT)
+#define PMD_MASK (~(PMD_SIZE - 1))
+
+static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
+{
+ /* Platform-specific memory-acceptance call goes here */
+ error("Cannot accept memory");
+}
+
+/*
+ * The accepted memory bitmap only works at PMD_SIZE granularity. Take
+ * unaligned start/end addresses and either:
+ * 1. Accepts the memory immediately and in its entirety
+ * 2. Accepts unaligned parts, and marks *some* aligned part unaccepted
+ *
+ * The function will never reach the bitmap_set() with zero bits to set.
+ */
+void process_unaccepted_memory(struct boot_params *params, u64 start, u64 end)
+{
+ /*
+ * Ensure that at least one bit will be set in the bitmap by
+ * immediately accepting all regions under 2*PMD_SIZE. This is
+ * imprecise and may immediately accept some areas that could
+ * have been represented in the bitmap. But, results in simpler
+ * code below
+ *
+ * Consider case like this:
+ *
+ * | 4k | 2044k | 2048k |
+ * ^ 0x0 ^ 2MB ^ 4MB
+ *
+ * Only the first 4k has been accepted. The 0MB->2MB region can not be
+ * represented in the bitmap. The 2MB->4MB region can be represented in
+ * the bitmap. But, the 0MB->4MB region is <2*PMD_SIZE and will be
+ * immediately accepted in its entirety.
+ */
+ if (end - start < 2 * PMD_SIZE) {
+ __accept_memory(start, end);
+ return;
+ }
+
+ /*
+ * No matter how the start and end are aligned, at least one unaccepted
+ * PMD_SIZE area will remain to be marked in the bitmap.
+ */
+
+ /* Immediately accept a <PMD_SIZE piece at the start: */
+ if (start & ~PMD_MASK) {
+ __accept_memory(start, round_up(start, PMD_SIZE));
+ start = round_up(start, PMD_SIZE);
+ }
+
+ /* Immediately accept a <PMD_SIZE piece at the end: */
+ if (end & ~PMD_MASK) {
+ __accept_memory(round_down(end, PMD_SIZE), end);
+ end = round_down(end, PMD_SIZE);
+ }
+
+ /*
+ * 'start' and 'end' are now both PMD-aligned.
+ * Record the range as being unaccepted:
+ */
+ bitmap_set((unsigned long *)params->unaccepted_memory,
+ start / PMD_SIZE, (end - start) / PMD_SIZE);
+}
diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
new file mode 100644
index 000000000000..df0736d32858
--- /dev/null
+++ b/arch/x86/include/asm/unaccepted_memory.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2020 Intel Corporation */
+#ifndef _ASM_X86_UNACCEPTED_MEMORY_H
+#define _ASM_X86_UNACCEPTED_MEMORY_H
+
+struct boot_params;
+
+void process_unaccepted_memory(struct boot_params *params, u64 start, u64 num);
+
+#endif
diff --git a/arch/x86/include/uapi/asm/bootparam.h b/arch/x86/include/uapi/asm/bootparam.h
index 01d19fc22346..630a54046af0 100644
--- a/arch/x86/include/uapi/asm/bootparam.h
+++ b/arch/x86/include/uapi/asm/bootparam.h
@@ -189,7 +189,7 @@ struct boot_params {
__u64 tboot_addr; /* 0x058 */
struct ist_info ist_info; /* 0x060 */
__u64 acpi_rsdp_addr; /* 0x070 */
- __u8 _pad3[8]; /* 0x078 */
+ __u64 unaccepted_memory; /* 0x078 */
__u8 hd0_info[16]; /* obsolete! */ /* 0x080 */
__u8 hd1_info[16]; /* obsolete! */ /* 0x090 */
struct sys_desc_table sys_desc_table; /* obsolete! */ /* 0x0a0 */
diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig
index 043ca31c114e..231f1c70d1db 100644
--- a/drivers/firmware/efi/Kconfig
+++ b/drivers/firmware/efi/Kconfig
@@ -269,6 +269,20 @@ config EFI_COCO_SECRET
virt/coco/efi_secret module to access the secrets, which in turn
allows userspace programs to access the injected secrets.

+config UNACCEPTED_MEMORY
+ bool
+ depends on EFI_STUB
+ help
+ Some Virtual Machine platforms, such as Intel TDX, require
+ some memory to be "accepted" by the guest before it can be used.
+ This mechanism helps prevent malicious hosts from making changes
+ to guest memory.
+
+ UEFI specification v2.9 introduced EFI_UNACCEPTED_MEMORY memory type.
+
+ This option adds support for unaccepted memory and makes such memory
+ usable by the kernel.
+
config EFI_EMBEDDED_FIRMWARE
bool
select CRYPTO_LIB_SHA256
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index abeff7dc0b58..7dce06e419c5 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -843,6 +843,7 @@ static __initdata char memory_type_name[][13] = {
"MMIO Port",
"PAL Code",
"Persistent",
+ "Unaccepted",
};

char * __init efi_md_typeattr_format(char *buf, size_t size,
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index fff81843169c..1643ddbde249 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -15,6 +15,7 @@
#include <asm/setup.h>
#include <asm/desc.h>
#include <asm/boot.h>
+#include <asm/unaccepted_memory.h>

#include "efistub.h"

@@ -613,6 +614,16 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
e820_type = E820_TYPE_PMEM;
break;

+ case EFI_UNACCEPTED_MEMORY:
+ if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY)) {
+ efi_warn_once(
+"The system has unaccepted memory, but kernel does not support it\nConsider enabling CONFIG_UNACCEPTED_MEMORY\n");
+ continue;
+ }
+ e820_type = E820_TYPE_RAM;
+ process_unaccepted_memory(params, d->phys_addr,
+ d->phys_addr + PAGE_SIZE * d->num_pages);
+ break;
default:
continue;
}
@@ -677,6 +688,57 @@ static efi_status_t alloc_e820ext(u32 nr_desc, struct setup_data **e820ext,
return status;
}

+static efi_status_t allocate_unaccepted_bitmap(struct boot_params *params,
+ __u32 nr_desc,
+ struct efi_boot_memmap *map)
+{
+ unsigned long *mem = NULL;
+ u64 size, max_addr = 0;
+ efi_status_t status;
+ bool found = false;
+ int i;
+
+ /* Check if there's any unaccepted memory and find the max address */
+ for (i = 0; i < nr_desc; i++) {
+ efi_memory_desc_t *d;
+ unsigned long m = (unsigned long)map->map;
+
+ d = efi_early_memdesc_ptr(m, map->desc_size, i);
+ if (d->type == EFI_UNACCEPTED_MEMORY)
+ found = true;
+ if (d->phys_addr + d->num_pages * PAGE_SIZE > max_addr)
+ max_addr = d->phys_addr + d->num_pages * PAGE_SIZE;
+ }
+
+ if (!found) {
+ params->unaccepted_memory = 0;
+ return EFI_SUCCESS;
+ }
+
+ /*
+ * If unaccepted memory is present, allocate a bitmap to track what
+ * memory has to be accepted before access.
+ *
+ * One bit in the bitmap represents 2MiB in the address space:
+ * A 4k bitmap can track 64GiB of physical address space.
+ *
+ * In the worst case scenario -- a huge hole in the middle of the
+ * address space -- It needs 256MiB to handle 4PiB of the address
+ * space.
+ *
+ * The bitmap will be populated in setup_e820() according to the memory
+ * map after efi_exit_boot_services().
+ */
+ size = DIV_ROUND_UP(max_addr, PMD_SIZE * BITS_PER_BYTE);
+ status = efi_allocate_pages(size, (unsigned long *)&mem, ULONG_MAX);
+ if (status == EFI_SUCCESS) {
+ memset(mem, 0, size);
+ params->unaccepted_memory = (unsigned long)mem;
+ }
+
+ return status;
+}
+
static efi_status_t allocate_e820(struct boot_params *params,
struct setup_data **e820ext,
u32 *e820ext_size)
@@ -697,6 +759,9 @@ static efi_status_t allocate_e820(struct boot_params *params,
status = alloc_e820ext(nr_e820ext, e820ext, e820ext_size);
}

+ if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY) && status == EFI_SUCCESS)
+ status = allocate_unaccepted_bitmap(params, nr_desc, map);
+
efi_bs_call(free_pool, map);
return status;
}
diff --git a/include/linux/efi.h b/include/linux/efi.h
index 04a733f0ba95..1d4f0343c710 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -108,7 +108,8 @@ typedef struct {
#define EFI_MEMORY_MAPPED_IO_PORT_SPACE 12
#define EFI_PAL_CODE 13
#define EFI_PERSISTENT_MEMORY 14
-#define EFI_MAX_MEMORY_TYPE 15
+#define EFI_UNACCEPTED_MEMORY 15
+#define EFI_MAX_MEMORY_TYPE 16

/* Attribute values: */
#define EFI_MEMORY_UC ((u64)0x0000000000000001ULL) /* uncached */
--
2.39.2

2023-03-30 11:53:42

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 14/14] x86/tdx: Add unaccepted memory support

Hookup TDX-specific code to accept memory.

Accepting the memory is the same process as converting memory from
shared to private: kernel notifies VMM with MAP_GPA hypercall and then
accept pages with ACCEPT_PAGE module call.

The implementation in core kernel uses tdx_enc_status_changed(). It
already used for converting memory to shared and back for I/O
transactions.

Boot stub provides own implementation of tdx_accept_memory(). It is
similar in structure to tdx_enc_status_changed(), but only cares about
converting memory to private.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/Kconfig | 2 +
arch/x86/boot/compressed/Makefile | 2 +-
arch/x86/boot/compressed/error.c | 19 ++++++
arch/x86/boot/compressed/error.h | 1 +
arch/x86/boot/compressed/mem.c | 33 +++++++++-
arch/x86/boot/compressed/tdx-shared.c | 2 +
arch/x86/boot/compressed/tdx.c | 39 +++++++++++
arch/x86/coco/tdx/Makefile | 2 +-
arch/x86/coco/tdx/tdx-shared.c | 95 +++++++++++++++++++++++++++
arch/x86/coco/tdx/tdx.c | 86 +-----------------------
arch/x86/include/asm/shared/tdx.h | 2 +
arch/x86/include/asm/tdx.h | 2 +
arch/x86/mm/unaccepted_memory.c | 9 ++-
13 files changed, 206 insertions(+), 88 deletions(-)
create mode 100644 arch/x86/boot/compressed/tdx-shared.c
create mode 100644 arch/x86/coco/tdx/tdx-shared.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index df21fba77db1..448cd869f0bd 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -884,9 +884,11 @@ config INTEL_TDX_GUEST
bool "Intel TDX (Trust Domain Extensions) - Guest Support"
depends on X86_64 && CPU_SUP_INTEL
depends on X86_X2APIC
+ depends on EFI_STUB
select ARCH_HAS_CC_PLATFORM
select X86_MEM_ENCRYPT
select X86_MCE
+ select UNACCEPTED_MEMORY
help
Support running as a guest under Intel TDX. Without this support,
the guest kernel can not boot or run under TDX.
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 74f7adee46ad..71d9f71c13eb 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -106,7 +106,7 @@ ifdef CONFIG_X86_64
endif

vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
-vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o
+vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdx-shared.o $(obj)/tdcall.o
vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/bitmap.o $(obj)/find.o $(obj)/mem.o

vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
diff --git a/arch/x86/boot/compressed/error.c b/arch/x86/boot/compressed/error.c
index c881878e56d3..5313c5cb2b80 100644
--- a/arch/x86/boot/compressed/error.c
+++ b/arch/x86/boot/compressed/error.c
@@ -22,3 +22,22 @@ void error(char *m)
while (1)
asm("hlt");
}
+
+/* EFI libstub provides vsnprintf() */
+#ifdef CONFIG_EFI_STUB
+void panic(const char *fmt, ...)
+{
+ static char buf[1024];
+ va_list args;
+ int len;
+
+ va_start(args, fmt);
+ len = vsnprintf(buf, sizeof(buf), fmt, args);
+ va_end(args);
+
+ if (len && buf[len - 1] == '\n')
+ buf[len - 1] = '\0';
+
+ error(buf);
+}
+#endif
diff --git a/arch/x86/boot/compressed/error.h b/arch/x86/boot/compressed/error.h
index 1de5821184f1..86fe33b93715 100644
--- a/arch/x86/boot/compressed/error.h
+++ b/arch/x86/boot/compressed/error.h
@@ -6,5 +6,6 @@

void warn(char *m);
void error(char *m) __noreturn;
+void panic(const char *fmt, ...) __noreturn __cold;

#endif /* BOOT_COMPRESSED_ERROR_H */
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index de858a5180b6..e6b92e822ddd 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -5,6 +5,8 @@
#include "error.h"
#include "find.h"
#include "math.h"
+#include "tdx.h"
+#include <asm/shared/tdx.h>

#define PMD_SHIFT 21
#define PMD_SIZE (_AC(1, UL) << PMD_SHIFT)
@@ -12,10 +14,39 @@

extern struct boot_params *boot_params;

+/*
+ * accept_memory() and process_unaccepted_memory() called from EFI stub which
+ * runs before decompresser and its early_tdx_detect().
+ *
+ * Enumerate TDX directly from the early users.
+ */
+static bool early_is_tdx_guest(void)
+{
+ static bool once;
+ static bool is_tdx;
+
+ if (!IS_ENABLED(CONFIG_INTEL_TDX_GUEST))
+ return false;
+
+ if (!once) {
+ u32 eax, sig[3];
+
+ cpuid_count(TDX_CPUID_LEAF_ID, 0, &eax,
+ &sig[0], &sig[2], &sig[1]);
+ is_tdx = !memcmp(TDX_IDENT, sig, sizeof(sig));
+ once = true;
+ }
+
+ return is_tdx;
+}
+
static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
{
/* Platform-specific memory-acceptance call goes here */
- error("Cannot accept memory");
+ if (early_is_tdx_guest())
+ tdx_accept_memory(start, end);
+ else
+ error("Cannot accept memory: unknown platform\n");
}

/*
diff --git a/arch/x86/boot/compressed/tdx-shared.c b/arch/x86/boot/compressed/tdx-shared.c
new file mode 100644
index 000000000000..5ac43762fe13
--- /dev/null
+++ b/arch/x86/boot/compressed/tdx-shared.c
@@ -0,0 +1,2 @@
+#include "error.h"
+#include "../../coco/tdx/tdx-shared.c"
diff --git a/arch/x86/boot/compressed/tdx.c b/arch/x86/boot/compressed/tdx.c
index 918a7606f53c..de1d4a87418d 100644
--- a/arch/x86/boot/compressed/tdx.c
+++ b/arch/x86/boot/compressed/tdx.c
@@ -3,12 +3,17 @@
#include "../cpuflags.h"
#include "../string.h"
#include "../io.h"
+#include "align.h"
#include "error.h"
+#include "pgtable_types.h"

#include <vdso/limits.h>
#include <uapi/asm/vmx.h>

#include <asm/shared/tdx.h>
+#include <asm/page_types.h>
+
+static u64 cc_mask;

/* Called from __tdx_hypercall() for unrecoverable failure */
void __tdx_hypercall_failed(void)
@@ -16,6 +21,38 @@ void __tdx_hypercall_failed(void)
error("TDVMCALL failed. TDX module bug?");
}

+static u64 get_cc_mask(void)
+{
+ struct tdx_module_output out;
+ unsigned int gpa_width;
+
+ /*
+ * TDINFO TDX module call is used to get the TD execution environment
+ * information like GPA width, number of available vcpus, debug mode
+ * information, etc. More details about the ABI can be found in TDX
+ * Guest-Host-Communication Interface (GHCI), section 2.4.2 TDCALL
+ * [TDG.VP.INFO].
+ *
+ * The GPA width that comes out of this call is critical. TDX guests
+ * can not meaningfully run without it.
+ */
+ if (__tdx_module_call(TDX_GET_INFO, 0, 0, 0, 0, &out))
+ error("TDCALL GET_INFO failed (Buggy TDX module!)\n");
+
+ gpa_width = out.rcx & GENMASK(5, 0);
+
+ /*
+ * The highest bit of a guest physical address is the "sharing" bit.
+ * Set it for shared pages and clear it for private pages.
+ */
+ return BIT_ULL(gpa_width - 1);
+}
+
+u64 cc_mkdec(u64 val)
+{
+ return val & ~cc_mask;
+}
+
static inline unsigned int tdx_io_in(int size, u16 port)
{
struct tdx_hypercall_args args = {
@@ -70,6 +107,8 @@ void early_tdx_detect(void)
if (memcmp(TDX_IDENT, sig, sizeof(sig)))
return;

+ cc_mask = get_cc_mask();
+
/* Use hypercalls instead of I/O instructions */
pio_ops.f_inb = tdx_inb;
pio_ops.f_outb = tdx_outb;
diff --git a/arch/x86/coco/tdx/Makefile b/arch/x86/coco/tdx/Makefile
index 46c55998557d..2c7dcbf1458b 100644
--- a/arch/x86/coco/tdx/Makefile
+++ b/arch/x86/coco/tdx/Makefile
@@ -1,3 +1,3 @@
# SPDX-License-Identifier: GPL-2.0

-obj-y += tdx.o tdcall.o
+obj-y += tdx.o tdx-shared.o tdcall.o
diff --git a/arch/x86/coco/tdx/tdx-shared.c b/arch/x86/coco/tdx/tdx-shared.c
new file mode 100644
index 000000000000..ee74f7bbe806
--- /dev/null
+++ b/arch/x86/coco/tdx/tdx-shared.c
@@ -0,0 +1,95 @@
+#include <asm/tdx.h>
+#include <asm/pgtable.h>
+
+static unsigned long try_accept_one(phys_addr_t start, unsigned long len,
+ enum pg_level pg_level)
+{
+ unsigned long accept_size = page_level_size(pg_level);
+ u64 tdcall_rcx;
+ u8 page_size;
+
+ if (!IS_ALIGNED(start, accept_size))
+ return 0;
+
+ if (len < accept_size)
+ return 0;
+
+ /*
+ * Pass the page physical address to the TDX module to accept the
+ * pending, private page.
+ *
+ * Bits 2:0 of RCX encode page size: 0 - 4K, 1 - 2M, 2 - 1G.
+ */
+ switch (pg_level) {
+ case PG_LEVEL_4K:
+ page_size = 0;
+ break;
+ case PG_LEVEL_2M:
+ page_size = 1;
+ break;
+ case PG_LEVEL_1G:
+ page_size = 2;
+ break;
+ default:
+ return 0;
+ }
+
+ tdcall_rcx = start | page_size;
+ if (__tdx_module_call(TDX_ACCEPT_PAGE, tdcall_rcx, 0, 0, 0, NULL))
+ return 0;
+
+ return accept_size;
+}
+
+bool tdx_enc_status_changed_phys(phys_addr_t start, phys_addr_t end, bool enc)
+{
+ if (!enc) {
+ /* Set the shared (decrypted) bits: */
+ start |= cc_mkdec(0);
+ end |= cc_mkdec(0);
+ }
+
+ /*
+ * Notify the VMM about page mapping conversion. More info about ABI
+ * can be found in TDX Guest-Host-Communication Interface (GHCI),
+ * section "TDG.VP.VMCALL<MapGPA>"
+ */
+ if (_tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0))
+ return false;
+
+ /* private->shared conversion requires only MapGPA call */
+ if (!enc)
+ return true;
+
+ /*
+ * For shared->private conversion, accept the page using
+ * TDX_ACCEPT_PAGE TDX module call.
+ */
+ while (start < end) {
+ unsigned long len = end - start;
+ unsigned long accept_size;
+
+ /*
+ * Try larger accepts first. It gives chance to VMM to keep
+ * 1G/2M Secure EPT entries where possible and speeds up
+ * process by cutting number of hypercalls (if successful).
+ */
+
+ accept_size = try_accept_one(start, len, PG_LEVEL_1G);
+ if (!accept_size)
+ accept_size = try_accept_one(start, len, PG_LEVEL_2M);
+ if (!accept_size)
+ accept_size = try_accept_one(start, len, PG_LEVEL_4K);
+ if (!accept_size)
+ return false;
+ start += accept_size;
+ }
+
+ return true;
+}
+
+void tdx_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+ if (!tdx_enc_status_changed_phys(start, end, true))
+ panic("Accepting memory failed: %#llx-%#llx\n", start, end);
+}
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 9e6557d7514c..1392ebc3b406 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -713,46 +713,6 @@ static bool tdx_cache_flush_required(void)
return true;
}

-static unsigned long try_accept_one(phys_addr_t start, unsigned long len,
- enum pg_level pg_level)
-{
- unsigned long accept_size = page_level_size(pg_level);
- u64 tdcall_rcx;
- u8 page_size;
-
- if (!IS_ALIGNED(start, accept_size))
- return 0;
-
- if (len < accept_size)
- return 0;
-
- /*
- * Pass the page physical address to the TDX module to accept the
- * pending, private page.
- *
- * Bits 2:0 of RCX encode page size: 0 - 4K, 1 - 2M, 2 - 1G.
- */
- switch (pg_level) {
- case PG_LEVEL_4K:
- page_size = 0;
- break;
- case PG_LEVEL_2M:
- page_size = 1;
- break;
- case PG_LEVEL_1G:
- page_size = 2;
- break;
- default:
- return 0;
- }
-
- tdcall_rcx = start | page_size;
- if (__tdx_module_call(TDX_ACCEPT_PAGE, tdcall_rcx, 0, 0, 0, NULL))
- return 0;
-
- return accept_size;
-}
-
/*
* Inform the VMM of the guest's intent for this physical page: shared with
* the VMM or private to the guest. The VMM is expected to change its mapping
@@ -761,51 +721,9 @@ static unsigned long try_accept_one(phys_addr_t start, unsigned long len,
static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
{
phys_addr_t start = __pa(vaddr);
- phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE);
-
- if (!enc) {
- /* Set the shared (decrypted) bits: */
- start |= cc_mkdec(0);
- end |= cc_mkdec(0);
- }
-
- /*
- * Notify the VMM about page mapping conversion. More info about ABI
- * can be found in TDX Guest-Host-Communication Interface (GHCI),
- * section "TDG.VP.VMCALL<MapGPA>"
- */
- if (_tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0))
- return false;
-
- /* private->shared conversion requires only MapGPA call */
- if (!enc)
- return true;
+ phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE);

- /*
- * For shared->private conversion, accept the page using
- * TDX_ACCEPT_PAGE TDX module call.
- */
- while (start < end) {
- unsigned long len = end - start;
- unsigned long accept_size;
-
- /*
- * Try larger accepts first. It gives chance to VMM to keep
- * 1G/2M Secure EPT entries where possible and speeds up
- * process by cutting number of hypercalls (if successful).
- */
-
- accept_size = try_accept_one(start, len, PG_LEVEL_1G);
- if (!accept_size)
- accept_size = try_accept_one(start, len, PG_LEVEL_2M);
- if (!accept_size)
- accept_size = try_accept_one(start, len, PG_LEVEL_4K);
- if (!accept_size)
- return false;
- start += accept_size;
- }
-
- return true;
+ return tdx_enc_status_changed_phys(start, end, enc);
}

void __init tdx_early_init(void)
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index 562b3f4cbde8..3afbba545a0d 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -92,5 +92,7 @@ struct tdx_module_output {
u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
struct tdx_module_output *out);

+void tdx_accept_memory(phys_addr_t start, phys_addr_t end);
+
#endif /* !__ASSEMBLY__ */
#endif /* _ASM_X86_SHARED_TDX_H */
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 234197ec17e4..3a7340ad9a4b 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -50,6 +50,8 @@ bool tdx_early_handle_ve(struct pt_regs *regs);

int tdx_mcall_get_report0(u8 *reportdata, u8 *tdreport);

+bool tdx_enc_status_changed_phys(phys_addr_t start, phys_addr_t end, bool enc);
+
#else

static inline void tdx_early_init(void) { };
diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index a0a58486eb74..a521f8c0987d 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -6,6 +6,7 @@

#include <asm/io.h>
#include <asm/setup.h>
+#include <asm/shared/tdx.h>
#include <asm/unaccepted_memory.h>

/* Protects unaccepted memory bitmap */
@@ -61,7 +62,13 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
unsigned long len = range_end - range_start;

/* Platform-specific memory-acceptance call goes here */
- panic("Cannot accept memory: unknown platform\n");
+ if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
+ tdx_accept_memory(range_start * PMD_SIZE,
+ range_end * PMD_SIZE);
+ } else {
+ panic("Cannot accept memory: unknown platform\n");
+ }
+
bitmap_clear(bitmap, range_start, len);
}
spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
--
2.39.2

2023-03-30 11:53:51

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 05/14] efi/x86: Get full memory map in allocate_e820()

Currently allocate_e820() is only interested in the size of map and size
of memory descriptor to determine how many e820 entries the kernel
needs.

UEFI Specification version 2.9 introduces a new memory type --
unaccepted memory. To track unaccepted memory kernel needs to allocate
a bitmap. The size of the bitmap is dependent on the maximum physical
address present in the system. A full memory map is required to find
the maximum address.

Modify allocate_e820() to get a full memory map.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
---
drivers/firmware/efi/libstub/x86-stub.c | 26 +++++++++++--------------
1 file changed, 11 insertions(+), 15 deletions(-)

diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index a0bfd31358ba..fff81843169c 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -681,28 +681,24 @@ static efi_status_t allocate_e820(struct boot_params *params,
struct setup_data **e820ext,
u32 *e820ext_size)
{
- unsigned long map_size, desc_size, map_key;
+ struct efi_boot_memmap *map;
efi_status_t status;
- __u32 nr_desc, desc_version;
+ __u32 nr_desc;

- /* Only need the size of the mem map and size of each mem descriptor */
- map_size = 0;
- status = efi_bs_call(get_memory_map, &map_size, NULL, &map_key,
- &desc_size, &desc_version);
- if (status != EFI_BUFFER_TOO_SMALL)
- return (status != EFI_SUCCESS) ? status : EFI_UNSUPPORTED;
-
- nr_desc = map_size / desc_size + EFI_MMAP_NR_SLACK_SLOTS;
+ status = efi_get_memory_map(&map, false);
+ if (status != EFI_SUCCESS)
+ return status;

- if (nr_desc > ARRAY_SIZE(params->e820_table)) {
- u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table);
+ nr_desc = map->map_size / map->desc_size;
+ if (nr_desc > ARRAY_SIZE(params->e820_table) - EFI_MMAP_NR_SLACK_SLOTS) {
+ u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table) +
+ EFI_MMAP_NR_SLACK_SLOTS;

status = alloc_e820ext(nr_e820ext, e820ext, e820ext_size);
- if (status != EFI_SUCCESS)
- return status;
}

- return EFI_SUCCESS;
+ efi_bs_call(free_pool, map);
+ return status;
}

struct exit_boot_struct {
--
2.39.2

2023-03-30 11:53:52

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 09/14] x86/mm: Reserve unaccepted memory bitmap

A given page of memory can only be accepted once. The kernel has to
accept memory both in the early decompression stage and during normal
runtime.

A bitmap is used to communicate the acceptance state of each page
between the decompression stage and normal runtime.

boot_params is used to communicate location of the bitmap throughout
the boot. The bitmap is allocated and initially populated in EFI stub.
Decompression stage accepts pages required for kernel/initrd and marks
these pages accordingly in the bitmap. The main kernel picks up the
bitmap from the same boot_params and uses it to determine what has to
be accepted on allocation.

In the runtime kernel, reserve the bitmap's memory to ensure nothing
overwrites it.

The size of bitmap is determined with e820__end_of_ram_pfn() which
relies on setup_e820() marking unaccepted memory as E820_TYPE_RAM.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Mike Rapoport <[email protected]>
---
arch/x86/kernel/e820.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index fb8cf953380d..483c36a28d2e 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1316,6 +1316,23 @@ void __init e820__memblock_setup(void)
int i;
u64 end;

+ /*
+ * Mark unaccepted memory bitmap reserved.
+ *
+ * This kind of reservation usually done from early_reserve_memory(),
+ * but early_reserve_memory() called before e820__memory_setup(), so
+ * e820_table is not finalized and e820__end_of_ram_pfn() cannot be
+ * used to get correct RAM size.
+ */
+ if (boot_params.unaccepted_memory) {
+ unsigned long size;
+
+ /* One bit per 2MB */
+ size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE,
+ PMD_SIZE * BITS_PER_BYTE);
+ memblock_reserve(boot_params.unaccepted_memory, size);
+ }
+
/*
* The bootstrap memblock region count maximum is 128 entries
* (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries
--
2.39.2

2023-03-30 11:53:58

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv9 10/14] x86/mm: Provide helpers for unaccepted memory

Core-mm requires few helpers to support unaccepted memory:

- accept_memory() checks the range of addresses against the bitmap and
accept memory if needed.

- range_contains_unaccepted_memory() checks if anything within the
range requires acceptance.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/page.h | 3 ++
arch/x86/include/asm/unaccepted_memory.h | 4 ++
arch/x86/mm/Makefile | 2 +
arch/x86/mm/unaccepted_memory.c | 61 ++++++++++++++++++++++++
4 files changed, 70 insertions(+)
create mode 100644 arch/x86/mm/unaccepted_memory.c

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index d18e5c332cb9..4bab2bb2c9c0 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -19,6 +19,9 @@
struct page;

#include <linux/range.h>
+
+#include <asm/unaccepted_memory.h>
+
extern struct range pfn_mapped[];
extern int nr_pfn_mapped;

diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
index 41fbfc798100..89fc91c61560 100644
--- a/arch/x86/include/asm/unaccepted_memory.h
+++ b/arch/x86/include/asm/unaccepted_memory.h
@@ -7,6 +7,10 @@ struct boot_params;

void process_unaccepted_memory(struct boot_params *params, u64 start, u64 num);

+#ifdef CONFIG_UNACCEPTED_MEMORY
+
void accept_memory(phys_addr_t start, phys_addr_t end);
+bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end);

#endif
+#endif
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index c80febc44cd2..b0ef1755e5c8 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -67,3 +67,5 @@ obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_amd.o

obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_identity.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_boot.o
+
+obj-$(CONFIG_UNACCEPTED_MEMORY) += unaccepted_memory.o
diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
new file mode 100644
index 000000000000..1df918b21469
--- /dev/null
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -0,0 +1,61 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/memblock.h>
+#include <linux/mm.h>
+#include <linux/pfn.h>
+#include <linux/spinlock.h>
+
+#include <asm/io.h>
+#include <asm/setup.h>
+#include <asm/unaccepted_memory.h>
+
+/* Protects unaccepted memory bitmap */
+static DEFINE_SPINLOCK(unaccepted_memory_lock);
+
+void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+ unsigned long range_start, range_end;
+ unsigned long *bitmap;
+ unsigned long flags;
+
+ if (!boot_params.unaccepted_memory)
+ return;
+
+ bitmap = __va(boot_params.unaccepted_memory);
+ range_start = start / PMD_SIZE;
+
+ spin_lock_irqsave(&unaccepted_memory_lock, flags);
+ for_each_set_bitrange_from(range_start, range_end, bitmap,
+ DIV_ROUND_UP(end, PMD_SIZE)) {
+ unsigned long len = range_end - range_start;
+
+ /* Platform-specific memory-acceptance call goes here */
+ panic("Cannot accept memory: unknown platform\n");
+ bitmap_clear(bitmap, range_start, len);
+ }
+ spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+}
+
+bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end)
+{
+ unsigned long *bitmap;
+ unsigned long flags;
+ bool ret = false;
+
+ if (!boot_params.unaccepted_memory)
+ return 0;
+
+ bitmap = __va(boot_params.unaccepted_memory);
+
+ spin_lock_irqsave(&unaccepted_memory_lock, flags);
+ while (start < end) {
+ if (test_bit(start / PMD_SIZE, bitmap)) {
+ ret = true;
+ break;
+ }
+
+ start += PMD_SIZE;
+ }
+ spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+
+ return ret;
+}
--
2.39.2

2023-04-03 13:31:20

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCHv9 11/14] x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory

On 3/30/23 13:49, Kirill A. Shutemov wrote:
> load_unaligned_zeropad() can lead to unwanted loads across page boundaries.
> The unwanted loads are typically harmless. But, they might be made to
> totally unrelated or even unmapped memory. load_unaligned_zeropad()
> relies on exception fixup (#PF, #GP and now #VE) to recover from these
> unwanted loads.
>
> But, this approach does not work for unaccepted memory. For TDX, a load
> from unaccepted memory will not lead to a recoverable exception within
> the guest. The guest will exit to the VMM where the only recourse is to
> terminate the guest.
>
> There are three parts to fix this issue and comprehensively avoid access
> to unaccepted memory. Together these ensure that an extra "guard" page
> is accepted in addition to the memory that needs to be used.
>
> 1. Implicitly extend the range_contains_unaccepted_memory(start, end)
> checks up to end+2M if 'end' is aligned on a 2M boundary. It may
> require checking 2M chunk beyond end of RAM. The bitmap allocation is
> modified to accommodate this.
> 2. Implicitly extend accept_memory(start, end) to end+2M if 'end' is
> aligned on a 2M boundary.
> 3. Set PageUnaccepted() on both memory that itself needs to be accepted
> *and* memory where the next page needs to be accepted. Essentially,
> make PageUnaccepted(page) a marker for whether work needs to be done
> to make 'page' usable. That work might include accepting pages in
> addition to 'page' itself.
>
> Side note: This leads to something strange. Pages which were accepted
> at boot, marked by the firmware as accepted and will never
> _need_ to be accepted might have PageUnaccepted() set on
> them. PageUnaccepted(page) is a cue to ensure that the next
> page is accepted before 'page' can be used.

At least the part about PageUnaccepted() is obsolete in v9, no?

> This is an actual, real-world problem which was discovered during TDX
> testing.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Reviewed-by: Dave Hansen <[email protected]>
> ---
> arch/x86/mm/unaccepted_memory.c | 39 +++++++++++++++++++++++++
> drivers/firmware/efi/libstub/x86-stub.c | 7 +++++
> 2 files changed, 46 insertions(+)
>
> diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
> index 1df918b21469..a0a58486eb74 100644
> --- a/arch/x86/mm/unaccepted_memory.c
> +++ b/arch/x86/mm/unaccepted_memory.c
> @@ -23,6 +23,38 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
> bitmap = __va(boot_params.unaccepted_memory);
> range_start = start / PMD_SIZE;
>
> + /*
> + * load_unaligned_zeropad() can lead to unwanted loads across page
> + * boundaries. The unwanted loads are typically harmless. But, they
> + * might be made to totally unrelated or even unmapped memory.
> + * load_unaligned_zeropad() relies on exception fixup (#PF, #GP and now
> + * #VE) to recover from these unwanted loads.
> + *
> + * But, this approach does not work for unaccepted memory. For TDX, a
> + * load from unaccepted memory will not lead to a recoverable exception
> + * within the guest. The guest will exit to the VMM where the only
> + * recourse is to terminate the guest.
> + *
> + * There are three parts to fix this issue and comprehensively avoid
> + * access to unaccepted memory. Together these ensure that an extra
> + * "guard" page is accepted in addition to the memory that needs to be
> + * used:
> + *
> + * 1. Implicitly extend the range_contains_unaccepted_memory(start, end)
> + * checks up to end+2M if 'end' is aligned on a 2M boundary.
> + *
> + * 2. Implicitly extend accept_memory(start, end) to end+2M if 'end' is
> + * aligned on a 2M boundary. (immediately following this comment)
> + *
> + * 3. Set PageUnaccepted() on both memory that itself needs to be
> + * accepted *and* memory where the next page needs to be accepted.
> + * Essentially, make PageUnaccepted(page) a marker for whether work
> + * needs to be done to make 'page' usable. That work might include
> + * accepting pages in addition to 'page' itself.
> + */

And here.

> + if (!(end % PMD_SIZE))
> + end += PMD_SIZE;
> +
> spin_lock_irqsave(&unaccepted_memory_lock, flags);
> for_each_set_bitrange_from(range_start, range_end, bitmap,
> DIV_ROUND_UP(end, PMD_SIZE)) {
> @@ -46,6 +78,13 @@ bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end)
>
> bitmap = __va(boot_params.unaccepted_memory);
>
> + /*
> + * Also consider the unaccepted state of the *next* page. See fix #1 in
> + * the comment on load_unaligned_zeropad() in accept_memory().
> + */
> + if (!(end % PMD_SIZE))
> + end += PMD_SIZE;
> +
> spin_lock_irqsave(&unaccepted_memory_lock, flags);
> while (start < end) {
> if (test_bit(start / PMD_SIZE, bitmap)) {
> diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
> index 1643ddbde249..1afe7b5b02e1 100644
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -715,6 +715,13 @@ static efi_status_t allocate_unaccepted_bitmap(struct boot_params *params,
> return EFI_SUCCESS;
> }
>
> + /*
> + * range_contains_unaccepted_memory() may need to check one 2M chunk
> + * beyond the end of RAM to deal with load_unaligned_zeropad(). Make
> + * sure that the bitmap is large enough handle it.
> + */
> + max_addr += PMD_SIZE;
> +
> /*
> * If unaccepted memory is present, allocate a bitmap to track what
> * memory has to be accepted before access.

2023-04-03 14:43:48

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv9 11/14] x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory

On Mon, Apr 03, 2023 at 03:28:36PM +0200, Vlastimil Babka wrote:
> On 3/30/23 13:49, Kirill A. Shutemov wrote:
> > load_unaligned_zeropad() can lead to unwanted loads across page boundaries.
> > The unwanted loads are typically harmless. But, they might be made to
> > totally unrelated or even unmapped memory. load_unaligned_zeropad()
> > relies on exception fixup (#PF, #GP and now #VE) to recover from these
> > unwanted loads.
> >
> > But, this approach does not work for unaccepted memory. For TDX, a load
> > from unaccepted memory will not lead to a recoverable exception within
> > the guest. The guest will exit to the VMM where the only recourse is to
> > terminate the guest.
> >
> > There are three parts to fix this issue and comprehensively avoid access
> > to unaccepted memory. Together these ensure that an extra "guard" page
> > is accepted in addition to the memory that needs to be used.
> >
> > 1. Implicitly extend the range_contains_unaccepted_memory(start, end)
> > checks up to end+2M if 'end' is aligned on a 2M boundary. It may
> > require checking 2M chunk beyond end of RAM. The bitmap allocation is
> > modified to accommodate this.
> > 2. Implicitly extend accept_memory(start, end) to end+2M if 'end' is
> > aligned on a 2M boundary.
> > 3. Set PageUnaccepted() on both memory that itself needs to be accepted
> > *and* memory where the next page needs to be accepted. Essentially,
> > make PageUnaccepted(page) a marker for whether work needs to be done
> > to make 'page' usable. That work might include accepting pages in
> > addition to 'page' itself.
> >
> > Side note: This leads to something strange. Pages which were accepted
> > at boot, marked by the firmware as accepted and will never
> > _need_ to be accepted might have PageUnaccepted() set on
> > them. PageUnaccepted(page) is a cue to ensure that the next
> > page is accepted before 'page' can be used.
>
> At least the part about PageUnaccepted() is obsolete in v9, no?

Ah, right. Nice catch.

I removed PageUnaccepted() late in patchset preparation and forgot about
this.

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-04-03 14:44:14

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCHv9 00/14] mm, x86/cc: Implement support for unaccepted memory

On 3/30/23 13:49, Kirill A. Shutemov wrote:
> UEFI Specification version 2.9 introduces the concept of memory
> acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
> SEV-SNP, requiring memory to be accepted before it can be used by the
> guest. Accepting happens via a protocol specific for the Virtual
> Machine platform.
>
> Accepting memory is costly and it makes VMM allocate memory for the
> accepted guest physical address range. It's better to postpone memory
> acceptance until memory is needed. It lowers boot time and reduces
> memory overhead.
>
> The kernel needs to know what memory has been accepted. Firmware
> communicates this information via memory map: a new memory type --
> EFI_UNACCEPTED_MEMORY -- indicates such memory.
>
> Range-based tracking works fine for firmware, but it gets bulky for
> the kernel: e820 has to be modified on every page acceptance. It leads
> to table fragmentation, but there's a limited number of entries in the
> e820 table
>
> Another option is to mark such memory as usable in e820 and track if the
> range has been accepted in a bitmap. One bit in the bitmap represents
> 2MiB in the address space: one 4k page is enough to track 64GiB or
> physical address space.
>
> In the worst-case scenario -- a huge hole in the middle of the
> address space -- It needs 256MiB to handle 4PiB of the address
> space.
>
> Any unaccepted memory that is not aligned to 2M gets accepted upfront.
>
> The approach lowers boot time substantially. Boot to shell is ~2.5x
> faster for 4G TDX VM and ~4x faster for 64G.
>
> TDX-specific code isolated from the core of unaccepted memory support. It
> supposed to help to plug-in different implementation of unaccepted memory
> such as SEV-SNP.
>
> -- Fragmentation study --
>
> Vlastimil and Mel were concern about effect of unaccepted memory on
> fragmentation prevention measures in page allocator. I tried to evaluate
> it, but it is tricky. As suggested I tried to run multiple parallel kernel
> builds and follow how often kmem:mm_page_alloc_extfrag gets hit.
>
> I don't like the results. Not because unaccepted memory is slow, but
> because the test is way too noisy to produce sensible information.

Hmm yeah it can be noisy. Did you try to only count events that have
fragmenting=1 and/or MIGRATE_MOVABLE as fallback_migratetype? As those are
the really bad events.

> So, I run 8 parallel builds of kernel, 16 jobs each in a VM with 16 CPU
> and 16G of RAM. I compared unpatched base-line (mm-unstable) with the tree
> that has patchset applied. For the later case, all memory above 1G was
> considered unaccepted with fake_unaccepted_start=1G. Around 14G of RAM was
> accounted as unaccepted after the boot. The test got all memory accepted.
>
> kmem:mm_page_alloc_extfrag Time elapsed
> Base line:
> Run 1 837 1803.21s
> Run 2 3,845 1785.87s
> Run 3 1,704 1883.59s
>
> Patched:
> Run 1 905 1876.02s
> Run 2 845 1758.50s
> Run 3 1276 1876.13s
>
> As you can see the numbers are all over the place.
>
> The good news is that unaccepted memory doesn't make picture notably worse.

Yeah that at least somehow confirms no big surprises. I wouldn't expect any
with the v9 design of watermarks handling.

> I am open to suggestions on how to test it better.
>
> Also, feel free to play with it yourself. fake_unaccepted_start= doesn't
> require any special setup.
>
> --
>
> The tree can be found here:
>
> https://github.com/intel/tdx.git guest-unaccepted-memory
>
> The patchset depends on MAX_ORDER changes in MM tree.
>
> v9:
> - Accept memory up to high watermark when kernel runs out of free memory;
> - Treat unaccepted memory as unusable in __zone_watermark_unusable_free();
> - Per-zone unaccepted memory accounting;
> - All pages on unaccepted list are MAX_ORDER now;
> - accept_memory=eager in cmdline to pre-accept memory during the boot;
> - Implement fake unaccepted memory;
> - Sysfs handle to accept memory manually;
> - Drop PageUnaccepted();
> - Rename unaccepted_pages static key to zones_with_unaccepted_pages;
> v8:
> - Rewrite core-mm support for unaccepted memory (patch 02/14);
> - s/UnacceptedPages/Unaccepted/ in meminfo;
> - Drop arch/x86/boot/compressed/compiler.h;
> - Fix build errors;
> - Adjust commit messages and comments;
> - Reviewed-bys from Dave and Borislav;
> - Rebased to tip/master.
> v7:
> - Rework meminfo counter to use PageUnaccepted() and move to generic code;
> - Fix range_contains_unaccepted_memory() on machines without unaccepted memory;
> - Add Reviewed-by from David;
> v6:
> - Fix load_unaligned_zeropad() on machine with unaccepted memory;
> - Clear PageUnaccepted() on merged pages, leaving it only on head;
> - Clarify error handling in allocate_e820();
> - Fix build with CONFIG_UNACCEPTED_MEMORY=y, but without TDX;
> - Disable kexec at boottime instead of build conflict;
> - Rebased to tip/master;
> - Spelling fixes;
> - Add Reviewed-by from Mike and David;
> v5:
> - Updates comments and commit messages;
> + Explain options for unaccepted memory handling;
> - Expose amount of unaccepted memory in /proc/meminfo
> - Adjust check in page_expected_state();
> - Fix error code handling in allocate_e820();
> - Centralize __pa()/__va() definitions in the boot stub;
> - Avoid includes from the main kernel in the boot stub;
> - Use an existing hole in boot_param for unaccepted_memory, instead of adding
> to the end of the structure;
> - Extract allocate_unaccepted_memory() form allocate_e820();
> - Complain if there's unaccepted memory, but kernel does not support it;
> - Fix vmstat counter;
> - Split up few preparatory patches;
> - Random readability adjustments;
> v4:
> - PageBuddyUnaccepted() -> PageUnaccepted;
> - Use separate page_type, not shared with offline;
> - Rework interface between core-mm and arch code;
> - Adjust commit messages;
> - Ack from Mike;
>
> Kirill A. Shutemov (14):
> x86/boot: Centralize __pa()/__va() definitions
> mm: Add support for unaccepted memory
> mm/page_alloc: Fake unaccepted memory
> mm/page_alloc: Add sysfs handle to accept accept_memory
> efi/x86: Get full memory map in allocate_e820()
> x86/boot: Add infrastructure required for unaccepted memory support
> efi/x86: Implement support for unaccepted memory
> x86/boot/compressed: Handle unaccepted memory
> x86/mm: Reserve unaccepted memory bitmap
> x86/mm: Provide helpers for unaccepted memory
> x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory
> x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in
> boot stub
> x86/tdx: Refactor try_accept_one()
> x86/tdx: Add unaccepted memory support
>
> Documentation/x86/zero-page.rst | 1 +
> arch/x86/Kconfig | 2 +
> arch/x86/boot/bitops.h | 40 ++++
> arch/x86/boot/compressed/Makefile | 3 +-
> arch/x86/boot/compressed/align.h | 14 ++
> arch/x86/boot/compressed/bitmap.c | 43 ++++
> arch/x86/boot/compressed/bitmap.h | 49 +++++
> arch/x86/boot/compressed/bits.h | 36 ++++
> arch/x86/boot/compressed/efi.h | 1 +
> arch/x86/boot/compressed/error.c | 19 ++
> arch/x86/boot/compressed/error.h | 1 +
> arch/x86/boot/compressed/find.c | 54 +++++
> arch/x86/boot/compressed/find.h | 79 ++++++++
> arch/x86/boot/compressed/ident_map_64.c | 8 -
> arch/x86/boot/compressed/kaslr.c | 35 ++--
> arch/x86/boot/compressed/math.h | 37 ++++
> arch/x86/boot/compressed/mem.c | 122 +++++++++++
> arch/x86/boot/compressed/minmax.h | 61 ++++++
> arch/x86/boot/compressed/misc.c | 6 +
> arch/x86/boot/compressed/misc.h | 15 ++
> arch/x86/boot/compressed/pgtable_types.h | 25 +++
> arch/x86/boot/compressed/sev.c | 2 -
> arch/x86/boot/compressed/tdx-shared.c | 2 +
> arch/x86/boot/compressed/tdx.c | 39 ++++
> arch/x86/coco/tdx/Makefile | 2 +-
> arch/x86/coco/tdx/tdx-shared.c | 95 +++++++++
> arch/x86/coco/tdx/tdx.c | 118 +----------
> arch/x86/include/asm/page.h | 3 +
> arch/x86/include/asm/shared/tdx.h | 53 +++++
> arch/x86/include/asm/tdx.h | 21 +-
> arch/x86/include/asm/unaccepted_memory.h | 16 ++
> arch/x86/include/uapi/asm/bootparam.h | 2 +-
> arch/x86/kernel/e820.c | 17 ++
> arch/x86/mm/Makefile | 2 +
> arch/x86/mm/unaccepted_memory.c | 107 ++++++++++
> drivers/base/node.c | 7 +
> drivers/firmware/efi/Kconfig | 14 ++
> drivers/firmware/efi/efi.c | 1 +
> drivers/firmware/efi/libstub/x86-stub.c | 98 +++++++--
> fs/proc/meminfo.c | 5 +
> include/linux/efi.h | 3 +-
> include/linux/mmzone.h | 8 +
> mm/internal.h | 13 ++
> mm/memblock.c | 9 +
> mm/mm_init.c | 7 +
> mm/page_alloc.c | 246 +++++++++++++++++++++++
> mm/vmstat.c | 3 +
> 47 files changed, 1368 insertions(+), 176 deletions(-)
> create mode 100644 arch/x86/boot/compressed/align.h
> create mode 100644 arch/x86/boot/compressed/bitmap.c
> create mode 100644 arch/x86/boot/compressed/bitmap.h
> create mode 100644 arch/x86/boot/compressed/bits.h
> create mode 100644 arch/x86/boot/compressed/find.c
> create mode 100644 arch/x86/boot/compressed/find.h
> create mode 100644 arch/x86/boot/compressed/math.h
> create mode 100644 arch/x86/boot/compressed/mem.c
> create mode 100644 arch/x86/boot/compressed/minmax.h
> create mode 100644 arch/x86/boot/compressed/pgtable_types.h
> create mode 100644 arch/x86/boot/compressed/tdx-shared.c
> create mode 100644 arch/x86/coco/tdx/tdx-shared.c
> create mode 100644 arch/x86/include/asm/unaccepted_memory.h
> create mode 100644 arch/x86/mm/unaccepted_memory.c
>

2023-04-04 17:41:08

by Tom Lendacky

[permalink] [raw]
Subject: [PATCH v7 0/6] Provide SEV-SNP support for unaccepted memory

This series adds SEV-SNP support for unaccepted memory to the patch series
titled:

[PATCHv9 00/14] mm, x86/cc: Implement support for unaccepted memory

Currently, when changing the state of a page under SNP, the page state
change structure is kmalloc()'d. This leads to hangs during boot when
accepting memory because the allocation can trigger the need to accept
more memory. Additionally, the page state change operations are not
optimized under Linux since it was expected that all memory has been
validated already, resulting in poor performance when adding basic
support for unaccepted memory.

This series consists of six patches:
- One pre-patch fix which can be taken regardless of this series.

- A pre-patch to switch from a kmalloc()'d page state change structure
to a (smaller) stack-based page state change structure.

- A pre-patch to allow the use of the early boot GHCB in the core kernel
path.

- A pre-patch to allow for use of 2M page state change requests and 2M
page validation.

- SNP support for unaccepted memory.

- EFI support to inform EFI/OVMF to not accept all memory on exit boot
services (from Dionna Glaze <[email protected]>)

The series is based off of and tested against Kirill Shutemov's tree:
https://github.com/intel/tdx.git guest-unaccepted-memory

---

Changes since v6:
- Add missing Kconfig dependency on EFI_STUB.
- Include an EFI patch to tell OVMF to not accept all memory on the
exit boot services call (from Dionna Glaze <[email protected]>)

Changes since v5:
- Replace the two fix patches with a single fix patch that uses a new
function signature to address the problem instead of using casting.
- Adjust the #define WARN definition in arch/x86/kernel/sev-shared.c to
allow use from arch/x86/boot/compressed path without moving them outside
of the if statements.

Changes since v4:
- Two fixes for when an unsigned int used as the number of pages to
process, it needs to be converted to an unsigned long before being
used to calculate ending addresses, otherwise a value >= 0x100000
results in adding 0 in the calculation.
- Commit message and comment updates.

Changes since v3:
- Reworks the PSC process to greatly improve performance:
- Optimize the PSC process to use 2M pages when applicable.
- Optimize the page validation process to use 2M pages when applicable.
- Use the early GHCB in both the decompression phase and core kernel
boot phase in order to minimize the use of the MSR protocol. The MSR
protocol only allows for a single 4K page to be updated at a time.
- Move the ghcb_percpu_ready flag into the sev_config structure and
rename it to ghcbs_initialized.

Changes since v2:
- Improve code comments in regards to when to use the per-CPU GHCB vs
the MSR protocol and why a single global value is valid for both
the BSP and APs.
- Add a comment related to the number of PSC entries and how it can
impact the size of the struct and, therefore, stack usage.
- Add a WARN_ON_ONCE() for invoking vmgexit_psc() when per-CPU GHCBs
haven't been created or registered, yet.
- Use the compiler support for clearing the PSC struct instead of
issuing memset().

Changes since v1:
- Change from using a per-CPU PSC structure to a (smaller) stack PSC
structure.


Dionna Glaze (1):
x86/efi: Safely enable unaccepted memory in UEFI

Tom Lendacky (5):
x86/sev: Fix calculation of end address based on number of pages
x86/sev: Put PSC struct on the stack in prep for unaccepted memory
support
x86/sev: Allow for use of the early boot GHCB for PSC requests
x86/sev: Use large PSC requests if applicable
x86/sev: Add SNP-specific unaccepted memory support

arch/x86/Kconfig | 2 +
arch/x86/boot/compressed/mem.c | 3 +
arch/x86/boot/compressed/sev.c | 54 ++++-
arch/x86/boot/compressed/sev.h | 23 +++
arch/x86/include/asm/sev-common.h | 9 +-
arch/x86/include/asm/sev.h | 23 ++-
arch/x86/kernel/sev-shared.c | 103 ++++++++++
arch/x86/kernel/sev.c | 256 ++++++++++--------------
arch/x86/mm/unaccepted_memory.c | 4 +
drivers/firmware/efi/libstub/x86-stub.c | 36 ++++
include/linux/efi.h | 3 +
11 files changed, 357 insertions(+), 159 deletions(-)
create mode 100644 arch/x86/boot/compressed/sev.h

--
2.40.0

2023-04-04 17:47:34

by Tom Lendacky

[permalink] [raw]
Subject: [PATCH v7 4/6] x86/sev: Use large PSC requests if applicable

In advance of providing support for unaccepted memory, request 2M Page
State Change (PSC) requests when the address range allows for it. By using
a 2M page size, more PSC operations can be handled in a single request to
the hypervisor. The hypervisor will determine if it can accommodate the
larger request by checking the mapping in the nested page table. If mapped
as a large page, then the 2M page request can be performed, otherwise the
2M page request will be broken down into 512 4K page requests. This is
still more efficient than having the guest perform multiple PSC requests
in order to process the 512 4K pages.

In conjunction with the 2M PSC requests, attempt to perform the associated
PVALIDATE instruction of the page using the 2M page size. If PVALIDATE
fails with a size mismatch, then fallback to validating 512 4K pages. To
do this, page validation is modified to work with the PSC structure and
not just a virtual address range.

Signed-off-by: Tom Lendacky <[email protected]>
---
arch/x86/include/asm/sev.h | 4 ++
arch/x86/kernel/sev.c | 125 ++++++++++++++++++++++++-------------
2 files changed, 84 insertions(+), 45 deletions(-)

diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index a0a58c4122ec..91b4f712ef18 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -78,11 +78,15 @@ extern void vc_no_ghcb(void);
extern void vc_boot_ghcb(void);
extern bool handle_vc_boot_ghcb(struct pt_regs *regs);

+/* PVALIDATE return codes */
+#define PVALIDATE_FAIL_SIZEMISMATCH 6
+
/* Software defined (when rFlags.CF = 1) */
#define PVALIDATE_FAIL_NOUPDATE 255

/* RMP page size */
#define RMP_PG_SIZE_4K 0
+#define RMP_PG_SIZE_2M 1

#define RMPADJUST_VMSA_PAGE_BIT BIT(16)

diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 1717bc4558f7..93de70340427 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -655,32 +655,58 @@ static u64 __init get_jump_table_addr(void)
return ret;
}

-static void pvalidate_pages(unsigned long vaddr, unsigned long npages, bool validate)
+static void pvalidate_pages(struct snp_psc_desc *desc)
{
- unsigned long vaddr_end;
+ struct psc_entry *e;
+ unsigned long vaddr;
+ unsigned int size;
+ unsigned int i;
+ bool validate;
int rc;

- vaddr = vaddr & PAGE_MASK;
- vaddr_end = vaddr + (npages << PAGE_SHIFT);
+ for (i = 0; i <= desc->hdr.end_entry; i++) {
+ e = &desc->entries[i];
+
+ vaddr = (unsigned long)pfn_to_kaddr(e->gfn);
+ size = e->pagesize ? RMP_PG_SIZE_2M : RMP_PG_SIZE_4K;
+ validate = (e->operation == SNP_PAGE_STATE_PRIVATE) ? true : false;
+
+ rc = pvalidate(vaddr, size, validate);
+ if (rc == PVALIDATE_FAIL_SIZEMISMATCH && size == RMP_PG_SIZE_2M) {
+ unsigned long vaddr_end = vaddr + PMD_SIZE;
+
+ for (; vaddr < vaddr_end; vaddr += PAGE_SIZE) {
+ rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate);
+ if (rc)
+ break;
+ }
+ }

- while (vaddr < vaddr_end) {
- rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate);
if (WARN(rc, "Failed to validate address 0x%lx ret %d", vaddr, rc))
sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PVALIDATE);
-
- vaddr = vaddr + PAGE_SIZE;
}
}

-static void early_set_pages_state(unsigned long paddr, unsigned long npages, enum psc_op op)
+static void early_set_pages_state(unsigned long vaddr, unsigned long paddr,
+ unsigned long npages, enum psc_op op)
{
unsigned long paddr_end;
u64 val;
+ int ret;
+
+ vaddr = vaddr & PAGE_MASK;

paddr = paddr & PAGE_MASK;
paddr_end = paddr + (npages << PAGE_SHIFT);

while (paddr < paddr_end) {
+ if (op == SNP_PAGE_STATE_SHARED) {
+ /* Page validation must be rescinded before changing to shared */
+ ret = pvalidate(vaddr, RMP_PG_SIZE_4K, false);
+ if (WARN(ret, "Failed to validate address 0x%lx ret %d", paddr, ret))
+ goto e_term;
+ }
+
/*
* Use the MSR protocol because this function can be called before
* the GHCB is established.
@@ -701,7 +727,15 @@ static void early_set_pages_state(unsigned long paddr, unsigned long npages, enu
paddr, GHCB_MSR_PSC_RESP_VAL(val)))
goto e_term;

- paddr = paddr + PAGE_SIZE;
+ if (op == SNP_PAGE_STATE_PRIVATE) {
+ /* Page validation must be performed after changing to private */
+ ret = pvalidate(vaddr, RMP_PG_SIZE_4K, true);
+ if (WARN(ret, "Failed to validate address 0x%lx ret %d", paddr, ret))
+ goto e_term;
+ }
+
+ vaddr += PAGE_SIZE;
+ paddr += PAGE_SIZE;
}

return;
@@ -726,10 +760,7 @@ void __init early_snp_set_memory_private(unsigned long vaddr, unsigned long padd
* Ask the hypervisor to mark the memory pages as private in the RMP
* table.
*/
- early_set_pages_state(paddr, npages, SNP_PAGE_STATE_PRIVATE);
-
- /* Validate the memory pages after they've been added in the RMP table. */
- pvalidate_pages(vaddr, npages, true);
+ early_set_pages_state(vaddr, paddr, npages, SNP_PAGE_STATE_PRIVATE);
}

void __init early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr,
@@ -744,11 +775,8 @@ void __init early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr
if (!(sev_status & MSR_AMD64_SEV_SNP_ENABLED))
return;

- /* Invalidate the memory pages before they are marked shared in the RMP table. */
- pvalidate_pages(vaddr, npages, false);
-
/* Ask hypervisor to mark the memory pages shared in the RMP table. */
- early_set_pages_state(paddr, npages, SNP_PAGE_STATE_SHARED);
+ early_set_pages_state(vaddr, paddr, npages, SNP_PAGE_STATE_SHARED);
}

void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op)
@@ -832,10 +860,11 @@ static int vmgexit_psc(struct ghcb *ghcb, struct snp_psc_desc *desc)
return ret;
}

-static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
- unsigned long vaddr_end, int op)
+static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
+ unsigned long vaddr_end, int op)
{
struct ghcb_state state;
+ bool use_large_entry;
struct psc_hdr *hdr;
struct psc_entry *e;
unsigned long flags;
@@ -849,27 +878,37 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
memset(data, 0, sizeof(*data));
i = 0;

- while (vaddr < vaddr_end) {
- if (is_vmalloc_addr((void *)vaddr))
+ while (vaddr < vaddr_end && i < ARRAY_SIZE(data->entries)) {
+ hdr->end_entry = i;
+
+ if (is_vmalloc_addr((void *)vaddr)) {
pfn = vmalloc_to_pfn((void *)vaddr);
- else
+ use_large_entry = false;
+ } else {
pfn = __pa(vaddr) >> PAGE_SHIFT;
+ use_large_entry = true;
+ }

e->gfn = pfn;
e->operation = op;
- hdr->end_entry = i;

- /*
- * Current SNP implementation doesn't keep track of the RMP page
- * size so use 4K for simplicity.
- */
- e->pagesize = RMP_PG_SIZE_4K;
+ if (use_large_entry && IS_ALIGNED(vaddr, PMD_SIZE) &&
+ (vaddr_end - vaddr) >= PMD_SIZE) {
+ e->pagesize = RMP_PG_SIZE_2M;
+ vaddr += PMD_SIZE;
+ } else {
+ e->pagesize = RMP_PG_SIZE_4K;
+ vaddr += PAGE_SIZE;
+ }

- vaddr = vaddr + PAGE_SIZE;
e++;
i++;
}

+ /* Page validation must be rescinded before changing to shared */
+ if (op == SNP_PAGE_STATE_SHARED)
+ pvalidate_pages(data);
+
local_irq_save(flags);

if (sev_cfg.ghcbs_initialized)
@@ -877,6 +916,7 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
else
ghcb = boot_ghcb;

+ /* Invoke the hypervisor to perform the page state changes */
if (!ghcb || vmgexit_psc(ghcb, data))
sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);

@@ -884,29 +924,28 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
__sev_put_ghcb(&state);

local_irq_restore(flags);
+
+ /* Page validation must be performed after changing to private */
+ if (op == SNP_PAGE_STATE_PRIVATE)
+ pvalidate_pages(data);
+
+ return vaddr;
}

static void set_pages_state(unsigned long vaddr, unsigned long npages, int op)
{
- unsigned long vaddr_end, next_vaddr;
struct snp_psc_desc desc;
+ unsigned long vaddr_end;

/* Use the MSR protocol when a GHCB is not available. */
if (!boot_ghcb)
- return early_set_pages_state(__pa(vaddr), npages, op);
+ return early_set_pages_state(vaddr, __pa(vaddr), npages, op);

vaddr = vaddr & PAGE_MASK;
vaddr_end = vaddr + (npages << PAGE_SHIFT);

- while (vaddr < vaddr_end) {
- /* Calculate the last vaddr that fits in one struct snp_psc_desc. */
- next_vaddr = min_t(unsigned long, vaddr_end,
- (VMGEXIT_PSC_MAX_ENTRY * PAGE_SIZE) + vaddr);
-
- __set_pages_state(&desc, vaddr, next_vaddr, op);
-
- vaddr = next_vaddr;
- }
+ while (vaddr < vaddr_end)
+ vaddr = __set_pages_state(&desc, vaddr, vaddr_end, op);
}

void snp_set_memory_shared(unsigned long vaddr, unsigned long npages)
@@ -914,8 +953,6 @@ void snp_set_memory_shared(unsigned long vaddr, unsigned long npages)
if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
return;

- pvalidate_pages(vaddr, npages, false);
-
set_pages_state(vaddr, npages, SNP_PAGE_STATE_SHARED);
}

@@ -925,8 +962,6 @@ void snp_set_memory_private(unsigned long vaddr, unsigned long npages)
return;

set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
-
- pvalidate_pages(vaddr, npages, true);
}

static int snp_set_vmsa(void *va, bool vmsa)
--
2.40.0

2023-04-04 17:47:45

by Tom Lendacky

[permalink] [raw]
Subject: [PATCH v7 6/6] x86/efi: Safely enable unaccepted memory in UEFI

From: Dionna Glaze <[email protected]>

The UEFI v2.9 specification includes a new memory type to be used in
environments where the OS must accept memory that is provided from its
host. Before the introduction of this memory type, all memory was
accepted eagerly in the firmware. In order for the firmware to safely
stop accepting memory on the OS's behalf, the OS must affirmatively
indicate support to the firmware. This is only a problem for AMD
SEV-SNP, since Linux has had support for it since 5.19. The other
technology that can make use of unaccepted memory, Intel TDX, does not
yet have Linux support, so it can strictly require unaccepted memory
support as a dependency of CONFIG_TDX and not require communication with
the firmware.

Enabling unaccepted memory requires calling a 0-argument enablement
protocol before ExitBootServices. This call is only made if the kernel
is compiled with UNACCEPTED_MEMORY=y

This protocol will be removed after the end of life of the first LTS
that includes it, in order to give firmware implementations an
expiration date for it. When the protocol is removed, firmware will
strictly infer that a SEV-SNP VM is running an OS that supports the
unaccepted memory type. At the earliest convenience, when unaccepted
memory support is added to Linux, SEV-SNP may take strict dependence in
it. After the firmware removes support for the protocol, this patch
should be reverted.

[tl: address some checkscript warnings]

Cc: Ard Biescheuvel <[email protected]>
Cc: "Min M. Xu" <[email protected]>
Cc: Gerd Hoffmann <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: Jiewen Yao <[email protected]>
Cc: Erdem Aktas <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Borislav Petkov <[email protected]>
Signed-off-by: Dionna Glaze <[email protected]>
Signed-off-by: Tom Lendacky <[email protected]>
---
drivers/firmware/efi/libstub/x86-stub.c | 36 +++++++++++++++++++++++++
include/linux/efi.h | 3 +++
2 files changed, 39 insertions(+)

diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index 1afe7b5b02e1..119e201cfc68 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -27,6 +27,17 @@ const efi_dxe_services_table_t *efi_dxe_table;
u32 image_offset __section(".data");
static efi_loaded_image_t *image = NULL;

+typedef union sev_memory_acceptance_protocol sev_memory_acceptance_protocol_t;
+union sev_memory_acceptance_protocol {
+ struct {
+ efi_status_t (__efiapi * allow_unaccepted_memory)(
+ sev_memory_acceptance_protocol_t *);
+ };
+ struct {
+ u32 allow_unaccepted_memory;
+ } mixed_mode;
+};
+
static efi_status_t
preserve_pci_rom_image(efi_pci_io_protocol_t *pci, struct pci_setup_rom **__rom)
{
@@ -311,6 +322,29 @@ setup_memory_protection(unsigned long image_base, unsigned long image_size)
#endif
}

+static void setup_unaccepted_memory(void)
+{
+ efi_guid_t mem_acceptance_proto = OVMF_SEV_MEMORY_ACCEPTANCE_PROTOCOL_GUID;
+ sev_memory_acceptance_protocol_t *proto;
+ efi_status_t status;
+
+ if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY))
+ return;
+
+ /*
+ * Enable unaccepted memory before calling exit boot services in order
+ * for the UEFI to not accept all memory on EBS.
+ */
+ status = efi_bs_call(locate_protocol, &mem_acceptance_proto, NULL,
+ (void **)&proto);
+ if (status != EFI_SUCCESS)
+ return;
+
+ status = efi_call_proto(proto, allow_unaccepted_memory);
+ if (status != EFI_SUCCESS)
+ efi_err("Memory acceptance protocol failed\n");
+}
+
static const efi_char16_t apple[] = L"Apple";

static void setup_quirks(struct boot_params *boot_params,
@@ -967,6 +1001,8 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,

setup_quirks(boot_params, bzimage_addr, buffer_end - buffer_start);

+ setup_unaccepted_memory();
+
status = exit_boot(boot_params, handle);
if (status != EFI_SUCCESS) {
efi_err("exit_boot() failed!\n");
diff --git a/include/linux/efi.h b/include/linux/efi.h
index 1d4f0343c710..e728b8cf6b73 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -436,6 +436,9 @@ void efi_native_runtime_setup(void);
#define DELLEMC_EFI_RCI2_TABLE_GUID EFI_GUID(0x2d9f28a2, 0xa886, 0x456a, 0x97, 0xa8, 0xf1, 0x1e, 0xf2, 0x4f, 0xf4, 0x55)
#define AMD_SEV_MEM_ENCRYPT_GUID EFI_GUID(0x0cf29b71, 0x9e51, 0x433a, 0xa3, 0xb7, 0x81, 0xf3, 0xab, 0x16, 0xb8, 0x75)

+/* OVMF protocol GUIDs */
+#define OVMF_SEV_MEMORY_ACCEPTANCE_PROTOCOL_GUID EFI_GUID(0xc5a010fe, 0x38a7, 0x4531, 0x8a, 0x4a, 0x05, 0x00, 0xd2, 0xfd, 0x16, 0x49)
+
typedef struct {
efi_guid_t guid;
u64 table;
--
2.40.0

2023-04-04 17:48:58

by Tom Lendacky

[permalink] [raw]
Subject: [PATCH v7 3/6] x86/sev: Allow for use of the early boot GHCB for PSC requests

Using a GHCB for a page stage change (as opposed to the MSR protocol)
allows for multiple pages to be processed in a single request. In prep
for early PSC requests in support of unaccepted memory, update the
invocation of vmgexit_psc() to be able to use the early boot GHCB and not
just the per-CPU GHCB structure.

In order to use the proper GHCB (early boot vs per-CPU), set a flag that
indicates when the per-CPU GHCBs are available and registered. For APs,
the per-CPU GHCBs are created before they are started and registered upon
startup, so this flag can be used globally for the BSP and APs instead of
creating a per-CPU flag. This will allow for a significant reduction in
the number of MSR protocol page state change requests when accepting
memory.

Signed-off-by: Tom Lendacky <[email protected]>
---
arch/x86/kernel/sev.c | 61 +++++++++++++++++++++++++++----------------
1 file changed, 38 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index a3d76eb344d6..1717bc4558f7 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -117,7 +117,19 @@ static DEFINE_PER_CPU(struct sev_es_save_area *, sev_vmsa);

struct sev_config {
__u64 debug : 1,
- __reserved : 63;
+
+ /*
+ * A flag used by __set_pages_state() that indicates when the
+ * per-CPU GHCB has been created and registered and thus can be
+ * used by the BSP instead of the early boot GHCB.
+ *
+ * For APs, the per-CPU GHCB is created before they are started
+ * and registered upon startup, so this flag can be used globally
+ * for the BSP and APs.
+ */
+ ghcbs_initialized : 1,
+
+ __reserved : 62;
};

static struct sev_config sev_cfg __read_mostly;
@@ -660,7 +672,7 @@ static void pvalidate_pages(unsigned long vaddr, unsigned long npages, bool vali
}
}

-static void __init early_set_pages_state(unsigned long paddr, unsigned long npages, enum psc_op op)
+static void early_set_pages_state(unsigned long paddr, unsigned long npages, enum psc_op op)
{
unsigned long paddr_end;
u64 val;
@@ -754,26 +766,13 @@ void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op
WARN(1, "invalid memory op %d\n", op);
}

-static int vmgexit_psc(struct snp_psc_desc *desc)
+static int vmgexit_psc(struct ghcb *ghcb, struct snp_psc_desc *desc)
{
int cur_entry, end_entry, ret = 0;
struct snp_psc_desc *data;
- struct ghcb_state state;
struct es_em_ctxt ctxt;
- unsigned long flags;
- struct ghcb *ghcb;

- /*
- * __sev_get_ghcb() needs to run with IRQs disabled because it is using
- * a per-CPU GHCB.
- */
- local_irq_save(flags);
-
- ghcb = __sev_get_ghcb(&state);
- if (!ghcb) {
- ret = 1;
- goto out_unlock;
- }
+ vc_ghcb_invalidate(ghcb);

/* Copy the input desc into GHCB shared buffer */
data = (struct snp_psc_desc *)ghcb->shared_buffer;
@@ -830,20 +829,18 @@ static int vmgexit_psc(struct snp_psc_desc *desc)
}

out:
- __sev_put_ghcb(&state);
-
-out_unlock:
- local_irq_restore(flags);
-
return ret;
}

static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
unsigned long vaddr_end, int op)
{
+ struct ghcb_state state;
struct psc_hdr *hdr;
struct psc_entry *e;
+ unsigned long flags;
unsigned long pfn;
+ struct ghcb *ghcb;
int i;

hdr = &data->hdr;
@@ -873,8 +870,20 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
i++;
}

- if (vmgexit_psc(data))
+ local_irq_save(flags);
+
+ if (sev_cfg.ghcbs_initialized)
+ ghcb = __sev_get_ghcb(&state);
+ else
+ ghcb = boot_ghcb;
+
+ if (!ghcb || vmgexit_psc(ghcb, data))
sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
+
+ if (sev_cfg.ghcbs_initialized)
+ __sev_put_ghcb(&state);
+
+ local_irq_restore(flags);
}

static void set_pages_state(unsigned long vaddr, unsigned long npages, int op)
@@ -882,6 +891,10 @@ static void set_pages_state(unsigned long vaddr, unsigned long npages, int op)
unsigned long vaddr_end, next_vaddr;
struct snp_psc_desc desc;

+ /* Use the MSR protocol when a GHCB is not available. */
+ if (!boot_ghcb)
+ return early_set_pages_state(__pa(vaddr), npages, op);
+
vaddr = vaddr & PAGE_MASK;
vaddr_end = vaddr + (npages << PAGE_SHIFT);

@@ -1259,6 +1272,8 @@ void setup_ghcb(void)
if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
snp_register_per_cpu_ghcb();

+ sev_cfg.ghcbs_initialized = true;
+
return;
}

--
2.40.0

2023-04-04 17:55:36

by Tom Lendacky

[permalink] [raw]
Subject: [PATCH v7 5/6] x86/sev: Add SNP-specific unaccepted memory support

Add SNP-specific hooks to the unaccepted memory support in the boot
path (__accept_memory()) and the core kernel (accept_memory()) in order
to support booting SNP guests when unaccepted memory is present. Without
this support, SNP guests will fail to boot and/or panic() when unaccepted
memory is present in the EFI memory map.

The process of accepting memory under SNP involves invoking the hypervisor
to perform a page state change for the page to private memory and then
issuing a PVALIDATE instruction to accept the page.

Since the boot path and the core kernel paths perform similar operations,
move the pvalidate_pages() and vmgexit_psc() functions into sev-shared.c
to avoid code duplication.

Create the new header file arch/x86/boot/compressed/sev.h because adding
the function declaration to any of the existing SEV related header files
pulls in too many other header files, causing the build to fail.

Signed-off-by: Tom Lendacky <[email protected]>
---
arch/x86/Kconfig | 2 +
arch/x86/boot/compressed/mem.c | 3 +
arch/x86/boot/compressed/sev.c | 54 ++++++++++++++-
arch/x86/boot/compressed/sev.h | 23 +++++++
arch/x86/include/asm/sev.h | 3 +
arch/x86/kernel/sev-shared.c | 103 +++++++++++++++++++++++++++++
arch/x86/kernel/sev.c | 112 ++++----------------------------
arch/x86/mm/unaccepted_memory.c | 4 ++
8 files changed, 205 insertions(+), 99 deletions(-)
create mode 100644 arch/x86/boot/compressed/sev.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 448cd869f0bd..9fd69128a7d8 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1543,11 +1543,13 @@ config X86_MEM_ENCRYPT
config AMD_MEM_ENCRYPT
bool "AMD Secure Memory Encryption (SME) support"
depends on X86_64 && CPU_SUP_AMD
+ depends on EFI_STUB
select DMA_COHERENT_POOL
select ARCH_USE_MEMREMAP_PROT
select INSTRUCTION_DECODER
select ARCH_HAS_CC_PLATFORM
select X86_MEM_ENCRYPT
+ select UNACCEPTED_MEMORY
help
Say yes to enable support for the encryption of system memory.
This requires an AMD processor that supports Secure Memory
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index e6b92e822ddd..e6ab88dcfd7e 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -6,6 +6,7 @@
#include "find.h"
#include "math.h"
#include "tdx.h"
+#include "sev.h"
#include <asm/shared/tdx.h>

#define PMD_SHIFT 21
@@ -45,6 +46,8 @@ static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
/* Platform-specific memory-acceptance call goes here */
if (early_is_tdx_guest())
tdx_accept_memory(start, end);
+ else if (sev_snp_enabled())
+ snp_accept_memory(start, end);
else
error("Cannot accept memory: unknown platform\n");
}
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 014b89c89088..09dc8c187b3c 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -115,7 +115,7 @@ static enum es_result vc_read_mem(struct es_em_ctxt *ctxt,
/* Include code for early handlers */
#include "../../kernel/sev-shared.c"

-static inline bool sev_snp_enabled(void)
+bool sev_snp_enabled(void)
{
return sev_status & MSR_AMD64_SEV_SNP_ENABLED;
}
@@ -181,6 +181,58 @@ static bool early_setup_ghcb(void)
return true;
}

+static phys_addr_t __snp_accept_memory(struct snp_psc_desc *desc,
+ phys_addr_t pa, phys_addr_t pa_end)
+{
+ struct psc_hdr *hdr;
+ struct psc_entry *e;
+ unsigned int i;
+
+ hdr = &desc->hdr;
+ memset(hdr, 0, sizeof(*hdr));
+
+ e = desc->entries;
+
+ i = 0;
+ while (pa < pa_end && i < VMGEXIT_PSC_MAX_ENTRY) {
+ hdr->end_entry = i;
+
+ e->gfn = pa >> PAGE_SHIFT;
+ e->operation = SNP_PAGE_STATE_PRIVATE;
+ if (IS_ALIGNED(pa, PMD_SIZE) && (pa_end - pa) >= PMD_SIZE) {
+ e->pagesize = RMP_PG_SIZE_2M;
+ pa += PMD_SIZE;
+ } else {
+ e->pagesize = RMP_PG_SIZE_4K;
+ pa += PAGE_SIZE;
+ }
+
+ e++;
+ i++;
+ }
+
+ if (vmgexit_psc(boot_ghcb, desc))
+ sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
+
+ pvalidate_pages(desc);
+
+ return pa;
+}
+
+void snp_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+ struct snp_psc_desc desc = {};
+ unsigned int i;
+ phys_addr_t pa;
+
+ if (!boot_ghcb && !early_setup_ghcb())
+ sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
+
+ pa = start;
+ while (pa < end)
+ pa = __snp_accept_memory(&desc, pa, end);
+}
+
void sev_es_shutdown_ghcb(void)
{
if (!boot_ghcb)
diff --git a/arch/x86/boot/compressed/sev.h b/arch/x86/boot/compressed/sev.h
new file mode 100644
index 000000000000..fc725a981b09
--- /dev/null
+++ b/arch/x86/boot/compressed/sev.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * AMD SEV header for early boot related functions.
+ *
+ * Author: Tom Lendacky <[email protected]>
+ */
+
+#ifndef BOOT_COMPRESSED_SEV_H
+#define BOOT_COMPRESSED_SEV_H
+
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+
+bool sev_snp_enabled(void);
+void snp_accept_memory(phys_addr_t start, phys_addr_t end);
+
+#else
+
+static inline bool sev_snp_enabled(void) { return false; }
+static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
+
+#endif
+
+#endif
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 91b4f712ef18..67e81141a873 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -201,6 +201,7 @@ void snp_set_wakeup_secondary_cpu(void);
bool snp_init(struct boot_params *bp);
void __init __noreturn snp_abort(void);
int snp_issue_guest_request(u64 exit_code, struct snp_req_data *input, unsigned long *fw_err);
+void snp_accept_memory(phys_addr_t start, phys_addr_t end);
#else
static inline void sev_es_ist_enter(struct pt_regs *regs) { }
static inline void sev_es_ist_exit(void) { }
@@ -225,6 +226,8 @@ static inline int snp_issue_guest_request(u64 exit_code, struct snp_req_data *in
{
return -ENOTTY;
}
+
+static inline void snp_accept_memory(phys_addr_t start, phys_addr_t end) { }
#endif

#endif
diff --git a/arch/x86/kernel/sev-shared.c b/arch/x86/kernel/sev-shared.c
index 3a5b0c9c4fcc..be312db48a49 100644
--- a/arch/x86/kernel/sev-shared.c
+++ b/arch/x86/kernel/sev-shared.c
@@ -12,6 +12,9 @@
#ifndef __BOOT_COMPRESSED
#define error(v) pr_err(v)
#define has_cpuflag(f) boot_cpu_has(f)
+#else
+#undef WARN
+#define WARN(condition, format...) (!!(condition))
#endif

/* I/O parameters for CPUID-related helpers */
@@ -991,3 +994,103 @@ static void __init setup_cpuid_table(const struct cc_blob_sev_info *cc_info)
cpuid_ext_range_max = fn->eax;
}
}
+
+static void pvalidate_pages(struct snp_psc_desc *desc)
+{
+ struct psc_entry *e;
+ unsigned long vaddr;
+ unsigned int size;
+ unsigned int i;
+ bool validate;
+ int rc;
+
+ for (i = 0; i <= desc->hdr.end_entry; i++) {
+ e = &desc->entries[i];
+
+ vaddr = (unsigned long)pfn_to_kaddr(e->gfn);
+ size = e->pagesize ? RMP_PG_SIZE_2M : RMP_PG_SIZE_4K;
+ validate = (e->operation == SNP_PAGE_STATE_PRIVATE) ? true : false;
+
+ rc = pvalidate(vaddr, size, validate);
+ if (rc == PVALIDATE_FAIL_SIZEMISMATCH && size == RMP_PG_SIZE_2M) {
+ unsigned long vaddr_end = vaddr + PMD_SIZE;
+
+ for (; vaddr < vaddr_end; vaddr += PAGE_SIZE) {
+ rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate);
+ if (rc)
+ break;
+ }
+ }
+
+ if (rc) {
+ WARN(1, "Failed to validate address 0x%lx ret %d", vaddr, rc);
+ sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PVALIDATE);
+ }
+ }
+}
+
+static int vmgexit_psc(struct ghcb *ghcb, struct snp_psc_desc *desc)
+{
+ int cur_entry, end_entry, ret = 0;
+ struct snp_psc_desc *data;
+ struct es_em_ctxt ctxt;
+
+ vc_ghcb_invalidate(ghcb);
+
+ /* Copy the input desc into GHCB shared buffer */
+ data = (struct snp_psc_desc *)ghcb->shared_buffer;
+ memcpy(ghcb->shared_buffer, desc, min_t(int, GHCB_SHARED_BUF_SIZE, sizeof(*desc)));
+
+ /*
+ * As per the GHCB specification, the hypervisor can resume the guest
+ * before processing all the entries. Check whether all the entries
+ * are processed. If not, then keep retrying. Note, the hypervisor
+ * will update the data memory directly to indicate the status, so
+ * reference the data->hdr everywhere.
+ *
+ * The strategy here is to wait for the hypervisor to change the page
+ * state in the RMP table before guest accesses the memory pages. If the
+ * page state change was not successful, then later memory access will
+ * result in a crash.
+ */
+ cur_entry = data->hdr.cur_entry;
+ end_entry = data->hdr.end_entry;
+
+ while (data->hdr.cur_entry <= data->hdr.end_entry) {
+ ghcb_set_sw_scratch(ghcb, (u64)__pa(data));
+
+ /* This will advance the shared buffer data points to. */
+ ret = sev_es_ghcb_hv_call(ghcb, &ctxt, SVM_VMGEXIT_PSC, 0, 0);
+
+ /*
+ * Page State Change VMGEXIT can pass error code through
+ * exit_info_2.
+ */
+ if (WARN(ret || ghcb->save.sw_exit_info_2,
+ "SNP: PSC failed ret=%d exit_info_2=%llx\n",
+ ret, ghcb->save.sw_exit_info_2)) {
+ ret = 1;
+ goto out;
+ }
+
+ /* Verify that reserved bit is not set */
+ if (WARN(data->hdr.reserved, "Reserved bit is set in the PSC header\n")) {
+ ret = 1;
+ goto out;
+ }
+
+ /*
+ * Sanity check that entry processing is not going backwards.
+ * This will happen only if hypervisor is tricking us.
+ */
+ if (WARN(data->hdr.end_entry > end_entry || cur_entry > data->hdr.cur_entry,
+"SNP: PSC processing going backward, end_entry %d (got %d) cur_entry %d (got %d)\n",
+ end_entry, data->hdr.end_entry, cur_entry, data->hdr.cur_entry)) {
+ ret = 1;
+ goto out;
+ }
+ }
+
+out:
+ return ret;
+}
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 93de70340427..48cb926e28ae 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -655,38 +655,6 @@ static u64 __init get_jump_table_addr(void)
return ret;
}

-static void pvalidate_pages(struct snp_psc_desc *desc)
-{
- struct psc_entry *e;
- unsigned long vaddr;
- unsigned int size;
- unsigned int i;
- bool validate;
- int rc;
-
- for (i = 0; i <= desc->hdr.end_entry; i++) {
- e = &desc->entries[i];
-
- vaddr = (unsigned long)pfn_to_kaddr(e->gfn);
- size = e->pagesize ? RMP_PG_SIZE_2M : RMP_PG_SIZE_4K;
- validate = (e->operation == SNP_PAGE_STATE_PRIVATE) ? true : false;
-
- rc = pvalidate(vaddr, size, validate);
- if (rc == PVALIDATE_FAIL_SIZEMISMATCH && size == RMP_PG_SIZE_2M) {
- unsigned long vaddr_end = vaddr + PMD_SIZE;
-
- for (; vaddr < vaddr_end; vaddr += PAGE_SIZE) {
- rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate);
- if (rc)
- break;
- }
- }
-
- if (WARN(rc, "Failed to validate address 0x%lx ret %d", vaddr, rc))
- sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PVALIDATE);
- }
-}
-
static void early_set_pages_state(unsigned long vaddr, unsigned long paddr,
unsigned long npages, enum psc_op op)
{
@@ -794,72 +762,6 @@ void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op
WARN(1, "invalid memory op %d\n", op);
}

-static int vmgexit_psc(struct ghcb *ghcb, struct snp_psc_desc *desc)
-{
- int cur_entry, end_entry, ret = 0;
- struct snp_psc_desc *data;
- struct es_em_ctxt ctxt;
-
- vc_ghcb_invalidate(ghcb);
-
- /* Copy the input desc into GHCB shared buffer */
- data = (struct snp_psc_desc *)ghcb->shared_buffer;
- memcpy(ghcb->shared_buffer, desc, min_t(int, GHCB_SHARED_BUF_SIZE, sizeof(*desc)));
-
- /*
- * As per the GHCB specification, the hypervisor can resume the guest
- * before processing all the entries. Check whether all the entries
- * are processed. If not, then keep retrying. Note, the hypervisor
- * will update the data memory directly to indicate the status, so
- * reference the data->hdr everywhere.
- *
- * The strategy here is to wait for the hypervisor to change the page
- * state in the RMP table before guest accesses the memory pages. If the
- * page state change was not successful, then later memory access will
- * result in a crash.
- */
- cur_entry = data->hdr.cur_entry;
- end_entry = data->hdr.end_entry;
-
- while (data->hdr.cur_entry <= data->hdr.end_entry) {
- ghcb_set_sw_scratch(ghcb, (u64)__pa(data));
-
- /* This will advance the shared buffer data points to. */
- ret = sev_es_ghcb_hv_call(ghcb, &ctxt, SVM_VMGEXIT_PSC, 0, 0);
-
- /*
- * Page State Change VMGEXIT can pass error code through
- * exit_info_2.
- */
- if (WARN(ret || ghcb->save.sw_exit_info_2,
- "SNP: PSC failed ret=%d exit_info_2=%llx\n",
- ret, ghcb->save.sw_exit_info_2)) {
- ret = 1;
- goto out;
- }
-
- /* Verify that reserved bit is not set */
- if (WARN(data->hdr.reserved, "Reserved bit is set in the PSC header\n")) {
- ret = 1;
- goto out;
- }
-
- /*
- * Sanity check that entry processing is not going backwards.
- * This will happen only if hypervisor is tricking us.
- */
- if (WARN(data->hdr.end_entry > end_entry || cur_entry > data->hdr.cur_entry,
-"SNP: PSC processing going backward, end_entry %d (got %d) cur_entry %d (got %d)\n",
- end_entry, data->hdr.end_entry, cur_entry, data->hdr.cur_entry)) {
- ret = 1;
- goto out;
- }
- }
-
-out:
- return ret;
-}
-
static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
unsigned long vaddr_end, int op)
{
@@ -964,6 +866,20 @@ void snp_set_memory_private(unsigned long vaddr, unsigned long npages)
set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
}

+void snp_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+ unsigned long vaddr;
+ unsigned int npages;
+
+ if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
+ return;
+
+ vaddr = (unsigned long)__va(start);
+ npages = (end - start) >> PAGE_SHIFT;
+
+ set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
+}
+
static int snp_set_vmsa(void *va, bool vmsa)
{
u64 attrs;
diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index a521f8c0987d..7e08f7a8bd63 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -8,6 +8,7 @@
#include <asm/setup.h>
#include <asm/shared/tdx.h>
#include <asm/unaccepted_memory.h>
+#include <asm/sev.h>

/* Protects unaccepted memory bitmap */
static DEFINE_SPINLOCK(unaccepted_memory_lock);
@@ -65,6 +66,9 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
tdx_accept_memory(range_start * PMD_SIZE,
range_end * PMD_SIZE);
+ } else if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) {
+ snp_accept_memory(range_start * PMD_SIZE,
+ range_end * PMD_SIZE);
} else {
panic("Cannot accept memory: unknown platform\n");
}
--
2.40.0

2023-04-04 17:57:50

by Tom Lendacky

[permalink] [raw]
Subject: [PATCH v7 2/6] x86/sev: Put PSC struct on the stack in prep for unaccepted memory support

In advance of providing support for unaccepted memory, switch from using
kmalloc() for allocating the Page State Change (PSC) structure to using a
local variable that lives on the stack. This is needed to avoid a possible
recursive call into set_pages_state() if the kmalloc() call requires
(more) memory to be accepted, which would result in a hang.

The current size of the PSC struct is 2,032 bytes. To make the struct more
stack friendly, reduce the number of PSC entries from 253 down to 64,
resulting in a size of 520 bytes. This is a nice compromise on struct size
and total PSC requests while still allowing parallel PSC operations across
vCPUs.

If the reduction in PSC entries results in any kind of performance issue
(that is not seen at the moment), use of a larger static PSC struct, with
fallback to the smaller stack version, can be investigated.

For more background info on this decision, see the subthread in the Link:
tag below.

Signed-off-by: Tom Lendacky <[email protected]>
Link: https://lore.kernel.org/lkml/658c455c40e8950cb046dd885dd19dc1c52d060a.1659103274.git.thomas.lendacky@amd.com
---
arch/x86/include/asm/sev-common.h | 9 +++++++--
arch/x86/kernel/sev.c | 10 ++--------
2 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index b8357d6ecd47..8ddfdbe521d4 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -106,8 +106,13 @@ enum psc_op {
#define GHCB_HV_FT_SNP BIT_ULL(0)
#define GHCB_HV_FT_SNP_AP_CREATION BIT_ULL(1)

-/* SNP Page State Change NAE event */
-#define VMGEXIT_PSC_MAX_ENTRY 253
+/*
+ * SNP Page State Change NAE event
+ * The VMGEXIT_PSC_MAX_ENTRY determines the size of the PSC structure, which
+ * is a local stack variable in set_pages_state(). Do not increase this value
+ * without evaluating the impact to stack usage.
+ */
+#define VMGEXIT_PSC_MAX_ENTRY 64

struct psc_hdr {
u16 cur_entry;
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 72ef1e1fef84..a3d76eb344d6 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -880,11 +880,7 @@ static void __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
static void set_pages_state(unsigned long vaddr, unsigned long npages, int op)
{
unsigned long vaddr_end, next_vaddr;
- struct snp_psc_desc *desc;
-
- desc = kmalloc(sizeof(*desc), GFP_KERNEL_ACCOUNT);
- if (!desc)
- panic("SNP: failed to allocate memory for PSC descriptor\n");
+ struct snp_psc_desc desc;

vaddr = vaddr & PAGE_MASK;
vaddr_end = vaddr + (npages << PAGE_SHIFT);
@@ -894,12 +890,10 @@ static void set_pages_state(unsigned long vaddr, unsigned long npages, int op)
next_vaddr = min_t(unsigned long, vaddr_end,
(VMGEXIT_PSC_MAX_ENTRY * PAGE_SIZE) + vaddr);

- __set_pages_state(desc, vaddr, next_vaddr, op);
+ __set_pages_state(&desc, vaddr, next_vaddr, op);

vaddr = next_vaddr;
}
-
- kfree(desc);
}

void snp_set_memory_shared(unsigned long vaddr, unsigned long npages)
--
2.40.0

2023-04-04 17:58:11

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v7 6/6] x86/efi: Safely enable unaccepted memory in UEFI

On Tue, Apr 04, 2023 at 12:23:06PM -0500, Tom Lendacky wrote:
> From: Dionna Glaze <[email protected]>
>
> The UEFI v2.9 specification includes a new memory type to be used in
> environments where the OS must accept memory that is provided from its
> host. Before the introduction of this memory type, all memory was
> accepted eagerly in the firmware. In order for the firmware to safely
> stop accepting memory on the OS's behalf, the OS must affirmatively
> indicate support to the firmware. This is only a problem for AMD
> SEV-SNP, since Linux has had support for it since 5.19. The other
> technology that can make use of unaccepted memory, Intel TDX, does not
> yet have Linux support, so it can strictly require unaccepted memory
> support as a dependency of CONFIG_TDX and not require communication with
> the firmware.
>
> Enabling unaccepted memory requires calling a 0-argument enablement
> protocol before ExitBootServices. This call is only made if the kernel
> is compiled with UNACCEPTED_MEMORY=y
>
> This protocol will be removed after the end of life of the first LTS
> that includes it, in order to give firmware implementations an
> expiration date for it. When the protocol is removed, firmware will
> strictly infer that a SEV-SNP VM is running an OS that supports the
> unaccepted memory type. At the earliest convenience, when unaccepted
> memory support is added to Linux, SEV-SNP may take strict dependence in
> it. After the firmware removes support for the protocol, this patch
> should be reverted.
>
> [tl: address some checkscript warnings]
>
> Cc: Ard Biescheuvel <[email protected]>
> Cc: "Min M. Xu" <[email protected]>
> Cc: Gerd Hoffmann <[email protected]>
> Cc: James Bottomley <[email protected]>
> Cc: Tom Lendacky <[email protected]>
> Cc: Jiewen Yao <[email protected]>
> Cc: Erdem Aktas <[email protected]>
> Cc: "Kirill A. Shutemov" <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Signed-off-by: Dionna Glaze <[email protected]>
> Signed-off-by: Tom Lendacky <[email protected]>

I still think it is a bad idea.

As I asked before, please include my

Nacked-by: Kirill A. Shutemov <[email protected]>

into the patch.

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-04-04 18:03:44

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 6/6] x86/efi: Safely enable unaccepted memory in UEFI

On 4/4/23 10:45, Kirill A. Shutemov wrote:
> I still think it is a bad idea.
>
> As I asked before, please include my
>
> Nacked-by: Kirill A. Shutemov <[email protected]>
>
> into the patch.

I was pretty opposed to this when I first saw it too. But, Tom and
company have worn down my opposition a bit.

The fact is that we have upstream kernels out there with SEV-SNP support
that don't know anything about unaccepted memory. They're either
relegated to using the pre-accepted memory (4GB??) or _some_ entity
needs to accept the memory. That entity obviously can't be the kernel
unless we backport unaccepted memory support.

This both lets the BIOS be the page-accepting entity _and_ allows the
entity to delegate that to the kernel when it needs to.

As much as I want to nak this and pretend that that those existing
kernel's don't exist, my powers of self-delusion do have their limits.

If our AMD friends don't do this, what is their alternative?

2023-04-04 18:21:19

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v7 6/6] x86/efi: Safely enable unaccepted memory in UEFI

On Tue, Apr 04, 2023 at 10:57:52AM -0700, Dave Hansen wrote:
> On 4/4/23 10:45, Kirill A. Shutemov wrote:
> > I still think it is a bad idea.
> >
> > As I asked before, please include my
> >
> > Nacked-by: Kirill A. Shutemov <[email protected]>
> >
> > into the patch.
>
> I was pretty opposed to this when I first saw it too. But, Tom and
> company have worn down my opposition a bit.
>
> The fact is that we have upstream kernels out there with SEV-SNP support
> that don't know anything about unaccepted memory. They're either
> relegated to using the pre-accepted memory (4GB??) or _some_ entity
> needs to accept the memory. That entity obviously can't be the kernel
> unless we backport unaccepted memory support.
>
> This both lets the BIOS be the page-accepting entity _and_ allows the
> entity to delegate that to the kernel when it needs to.
>
> As much as I want to nak this and pretend that that those existing
> kernel's don't exist, my powers of self-delusion do have their limits.
>
> If our AMD friends don't do this, what is their alternative?

The alternative is coordination on the host side: VMM can load a BIOS that
pre-accepts all memory if the kernel is older.

I know that it is not convenient for VMM, but it is technically possible.

Introduce an ABI with an expiration date is much more ugly. And nobody
will care about the expiration date, until you will try to remove it.

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-04-04 19:31:56

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 6/6] x86/efi: Safely enable unaccepted memory in UEFI

On 4/4/23 11:09, Kirill A. Shutemov wrote:
>> If our AMD friends don't do this, what is their alternative?
> The alternative is coordination on the host side: VMM can load a BIOS that
> pre-accepts all memory if the kernel is older.
>
> I know that it is not convenient for VMM, but it is technically possible.

Yeah, either a specific BIOS or a knob to tell the BIOS what it has to
do. But, either way, that requires coordination between the BIOS (or
BIOS configuration) and the specific guest. I can see why that's
unpalatable.

> Introduce an ABI with an expiration date is much more ugly. And nobody
> will care about the expiration date, until you will try to remove it.

Yeah, the only real expiration date for an ABI is "never". I don't
believe for a second that we'll ever be able to remove the interface.

Either way, I'd love to hear more from folks about why a BIOS-side
option (configuration or otherwise) is not a good option. I know we've
discussed this in a few mail threads, but it would be even better to get
it into the cover letter or documentation.

2023-04-04 19:55:38

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v7 6/6] x86/efi: Safely enable unaccepted memory in UEFI

On Tue, 4 Apr 2023 at 20:09, Kirill A. Shutemov <[email protected]> wrote:
>
> On Tue, Apr 04, 2023 at 10:57:52AM -0700, Dave Hansen wrote:
> > On 4/4/23 10:45, Kirill A. Shutemov wrote:
> > > I still think it is a bad idea.
> > >
> > > As I asked before, please include my
> > >
> > > Nacked-by: Kirill A. Shutemov <[email protected]>
> > >
> > > into the patch.
> >
> > I was pretty opposed to this when I first saw it too. But, Tom and
> > company have worn down my opposition a bit.
> >
> > The fact is that we have upstream kernels out there with SEV-SNP support
> > that don't know anything about unaccepted memory. They're either
> > relegated to using the pre-accepted memory (4GB??) or _some_ entity
> > needs to accept the memory. That entity obviously can't be the kernel
> > unless we backport unaccepted memory support.
> >
> > This both lets the BIOS be the page-accepting entity _and_ allows the
> > entity to delegate that to the kernel when it needs to.
> >
> > As much as I want to nak this and pretend that that those existing
> > kernel's don't exist, my powers of self-delusion do have their limits.
> >
> > If our AMD friends don't do this, what is their alternative?
>
> The alternative is coordination on the host side: VMM can load a BIOS that
> pre-accepts all memory if the kernel is older.
>

And how does one identify such a kernel? How does the VMM know which
kernel the guest is going to load after it boots?

> I know that it is not convenient for VMM, but it is technically possible.
>
> Introduce an ABI with an expiration date is much more ugly. And nobody
> will care about the expiration date, until you will try to remove it.
>

None of us are thrilled about this, but the simple reality is that
there are kernels that do not understand unaccepted memory. EFI being
an extensible, generic, protocol based programmatic interface, the
best way of informing the loader that a kernel does understand it is
/not/ by adding some flag to some highly arch and OS specific header,
but to discover a protocol and call it.

We're past arguing that a legitimate need exists for a solution to
this problem. So what solution are you proposing?

2023-04-04 20:29:35

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v7 6/6] x86/efi: Safely enable unaccepted memory in UEFI

On Tue, Apr 04, 2023 at 09:49:52PM +0200, Ard Biesheuvel wrote:
> On Tue, 4 Apr 2023 at 20:09, Kirill A. Shutemov <[email protected]> wrote:
> >
> > On Tue, Apr 04, 2023 at 10:57:52AM -0700, Dave Hansen wrote:
> > > On 4/4/23 10:45, Kirill A. Shutemov wrote:
> > > > I still think it is a bad idea.
> > > >
> > > > As I asked before, please include my
> > > >
> > > > Nacked-by: Kirill A. Shutemov <[email protected]>
> > > >
> > > > into the patch.
> > >
> > > I was pretty opposed to this when I first saw it too. But, Tom and
> > > company have worn down my opposition a bit.
> > >
> > > The fact is that we have upstream kernels out there with SEV-SNP support
> > > that don't know anything about unaccepted memory. They're either
> > > relegated to using the pre-accepted memory (4GB??) or _some_ entity
> > > needs to accept the memory. That entity obviously can't be the kernel
> > > unless we backport unaccepted memory support.
> > >
> > > This both lets the BIOS be the page-accepting entity _and_ allows the
> > > entity to delegate that to the kernel when it needs to.
> > >
> > > As much as I want to nak this and pretend that that those existing
> > > kernel's don't exist, my powers of self-delusion do have their limits.
> > >
> > > If our AMD friends don't do this, what is their alternative?
> >
> > The alternative is coordination on the host side: VMM can load a BIOS that
> > pre-accepts all memory if the kernel is older.
> >
>
> And how does one identify such a kernel? How does the VMM know which
> kernel the guest is going to load after it boots?

VMM has to know what it is running. Yes, it is cumbersome. But enabling
phase for a feature is often rough. It will get smoother overtime.

> > I know that it is not convenient for VMM, but it is technically possible.
> >
> > Introduce an ABI with an expiration date is much more ugly. And nobody
> > will care about the expiration date, until you will try to remove it.
> >
>
> None of us are thrilled about this, but the simple reality is that
> there are kernels that do not understand unaccepted memory.

How is it different from any other feature the kernel is not [yet] aware
of?

Like if we boot a legacy kernel on machine with persistent memory or
memory attached over CLX, it will not see it as conventional memory.

> EFI being
> an extensible, generic, protocol based programmatic interface, the
> best way of informing the loader that a kernel does understand it is
> /not/ by adding some flag to some highly arch and OS specific header,
> but to discover a protocol and call it.
>
> We're past arguing that a legitimate need exists for a solution to
> this problem. So what solution are you proposing?

I described the solution multiple times. You just don't like it.

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-04-04 20:45:58

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v7 6/6] x86/efi: Safely enable unaccepted memory in UEFI

On Tue, 4 Apr 2023 at 22:24, Kirill A. Shutemov <[email protected]> wrote:
>
> On Tue, Apr 04, 2023 at 09:49:52PM +0200, Ard Biesheuvel wrote:
> > On Tue, 4 Apr 2023 at 20:09, Kirill A. Shutemov <[email protected]> wrote:
> > >
> > > On Tue, Apr 04, 2023 at 10:57:52AM -0700, Dave Hansen wrote:
> > > > On 4/4/23 10:45, Kirill A. Shutemov wrote:
> > > > > I still think it is a bad idea.
> > > > >
> > > > > As I asked before, please include my
> > > > >
> > > > > Nacked-by: Kirill A. Shutemov <[email protected]>
> > > > >
> > > > > into the patch.
> > > >
> > > > I was pretty opposed to this when I first saw it too. But, Tom and
> > > > company have worn down my opposition a bit.
> > > >
> > > > The fact is that we have upstream kernels out there with SEV-SNP support
> > > > that don't know anything about unaccepted memory. They're either
> > > > relegated to using the pre-accepted memory (4GB??) or _some_ entity
> > > > needs to accept the memory. That entity obviously can't be the kernel
> > > > unless we backport unaccepted memory support.
> > > >
> > > > This both lets the BIOS be the page-accepting entity _and_ allows the
> > > > entity to delegate that to the kernel when it needs to.
> > > >
> > > > As much as I want to nak this and pretend that that those existing
> > > > kernel's don't exist, my powers of self-delusion do have their limits.
> > > >
> > > > If our AMD friends don't do this, what is their alternative?
> > >
> > > The alternative is coordination on the host side: VMM can load a BIOS that
> > > pre-accepts all memory if the kernel is older.
> > >
> >
> > And how does one identify such a kernel? How does the VMM know which
> > kernel the guest is going to load after it boots?
>
> VMM has to know what it is running. Yes, it is cumbersome. But enabling
> phase for a feature is often rough. It will get smoother overtime.
>

So how does the VMM get informed about what it is running? How does it
distinguish between kernels that support unaccepted memory and ones
that don't? And how does it predict which kernel a guest is going to
load?

If the solution you described many times addresses these questions,
could you please share a link?

2023-04-04 21:05:14

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v7 6/6] x86/efi: Safely enable unaccepted memory in UEFI

On Tue, Apr 04, 2023 at 10:41:02PM +0200, Ard Biesheuvel wrote:
> On Tue, 4 Apr 2023 at 22:24, Kirill A. Shutemov <[email protected]> wrote:
> >
> > On Tue, Apr 04, 2023 at 09:49:52PM +0200, Ard Biesheuvel wrote:
> > > On Tue, 4 Apr 2023 at 20:09, Kirill A. Shutemov <[email protected]> wrote:
> > > >
> > > > On Tue, Apr 04, 2023 at 10:57:52AM -0700, Dave Hansen wrote:
> > > > > On 4/4/23 10:45, Kirill A. Shutemov wrote:
> > > > > > I still think it is a bad idea.
> > > > > >
> > > > > > As I asked before, please include my
> > > > > >
> > > > > > Nacked-by: Kirill A. Shutemov <[email protected]>
> > > > > >
> > > > > > into the patch.
> > > > >
> > > > > I was pretty opposed to this when I first saw it too. But, Tom and
> > > > > company have worn down my opposition a bit.
> > > > >
> > > > > The fact is that we have upstream kernels out there with SEV-SNP support
> > > > > that don't know anything about unaccepted memory. They're either
> > > > > relegated to using the pre-accepted memory (4GB??) or _some_ entity
> > > > > needs to accept the memory. That entity obviously can't be the kernel
> > > > > unless we backport unaccepted memory support.
> > > > >
> > > > > This both lets the BIOS be the page-accepting entity _and_ allows the
> > > > > entity to delegate that to the kernel when it needs to.
> > > > >
> > > > > As much as I want to nak this and pretend that that those existing
> > > > > kernel's don't exist, my powers of self-delusion do have their limits.
> > > > >
> > > > > If our AMD friends don't do this, what is their alternative?
> > > >
> > > > The alternative is coordination on the host side: VMM can load a BIOS that
> > > > pre-accepts all memory if the kernel is older.
> > > >
> > >
> > > And how does one identify such a kernel? How does the VMM know which
> > > kernel the guest is going to load after it boots?
> >
> > VMM has to know what it is running. Yes, it is cumbersome. But enabling
> > phase for a feature is often rough. It will get smoother overtime.
> >
>
> So how does the VMM get informed about what it is running? How does it
> distinguish between kernels that support unaccepted memory and ones
> that don't? And how does it predict which kernel a guest is going to
> load?

User will specify if it wants unaccepted memory or not for the VM. And if
it does it is his responsibility to have kernel that supports it.

And you have not addressed my question:

How is it different from any other feature the kernel is not [yet] aware
of?

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-04-05 07:53:40

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v7 6/6] x86/efi: Safely enable unaccepted memory in UEFI

On Tue, 4 Apr 2023 at 23:02, Kirill A. Shutemov <[email protected]> wrote:
>
> On Tue, Apr 04, 2023 at 10:41:02PM +0200, Ard Biesheuvel wrote:
> > On Tue, 4 Apr 2023 at 22:24, Kirill A. Shutemov <[email protected]> wrote:
> > >
> > > On Tue, Apr 04, 2023 at 09:49:52PM +0200, Ard Biesheuvel wrote:
> > > > On Tue, 4 Apr 2023 at 20:09, Kirill A. Shutemov <[email protected]> wrote:
> > > > >
> > > > > On Tue, Apr 04, 2023 at 10:57:52AM -0700, Dave Hansen wrote:
> > > > > > On 4/4/23 10:45, Kirill A. Shutemov wrote:
> > > > > > > I still think it is a bad idea.
> > > > > > >
> > > > > > > As I asked before, please include my
> > > > > > >
> > > > > > > Nacked-by: Kirill A. Shutemov <[email protected]>
> > > > > > >
> > > > > > > into the patch.
> > > > > >
> > > > > > I was pretty opposed to this when I first saw it too. But, Tom and
> > > > > > company have worn down my opposition a bit.
> > > > > >
> > > > > > The fact is that we have upstream kernels out there with SEV-SNP support
> > > > > > that don't know anything about unaccepted memory. They're either
> > > > > > relegated to using the pre-accepted memory (4GB??) or _some_ entity
> > > > > > needs to accept the memory. That entity obviously can't be the kernel
> > > > > > unless we backport unaccepted memory support.
> > > > > >
> > > > > > This both lets the BIOS be the page-accepting entity _and_ allows the
> > > > > > entity to delegate that to the kernel when it needs to.
> > > > > >
> > > > > > As much as I want to nak this and pretend that that those existing
> > > > > > kernel's don't exist, my powers of self-delusion do have their limits.
> > > > > >
> > > > > > If our AMD friends don't do this, what is their alternative?
> > > > >
> > > > > The alternative is coordination on the host side: VMM can load a BIOS that
> > > > > pre-accepts all memory if the kernel is older.
> > > > >
> > > >
> > > > And how does one identify such a kernel? How does the VMM know which
> > > > kernel the guest is going to load after it boots?
> > >
> > > VMM has to know what it is running. Yes, it is cumbersome. But enabling
> > > phase for a feature is often rough. It will get smoother overtime.
> > >
> >
> > So how does the VMM get informed about what it is running? How does it
> > distinguish between kernels that support unaccepted memory and ones
> > that don't? And how does it predict which kernel a guest is going to
> > load?
>
> User will specify if it wants unaccepted memory or not for the VM. And if
> it does it is his responsibility to have kernel that supports it.
>
> And you have not addressed my question:
>
> How is it different from any other feature the kernel is not [yet] aware
> of?
>

It is the same problem, but this is just a better solution. Having a
BIOS menu option (or similar) to choose between unaccepted memory or
not (or to expose CXL memory via the EFI memory map, which is another
hack I have seen) is just unnecessary complication, if the kernel can
simply inform the loader about what it supports. We do this all the
time with things like OsIndications.

We can phase out the protocol implementation from the firmware once we
no longer need it, at which point the LocateProtocol() call just
becomes a NOP (we do the same thing for UGA support, which has
disappeared a long time ago, but we still look for the protocol in the
EFI stub).

Once the firmware stops exposing this protocol (and ceases to accept
memory on the OS's behalf), we can phase it out from the kernel as
well.

The only other potential solution I see is exposing the unaccepted
memory as coldplugged ACPI memory objects, and implementing the accept
calls via PRM methods. But PRM has had very little test coverage, so
it is anybody's guess whether it works for the stable kernels that we
need to support with this. It would also mean that the new unaccepted
memory logic would need to be updated and cross reference these memory
regions with EFI unaccepted memory regions and avoid claiming them
both.

2023-04-05 10:15:47

by Gerd Hoffmann

[permalink] [raw]
Subject: Re: [PATCH v7 6/6] x86/efi: Safely enable unaccepted memory in UEFI

Hi,

> User will specify if it wants unaccepted memory or not for the VM. And if
> it does it is his responsibility to have kernel that supports it.
>
> And you have not addressed my question:
>
> How is it different from any other feature the kernel is not [yet] aware
> of?

Come on. Automatic feature negotiation is standard procedure in many
places. It's not like we inventing something totally new here.

Just one example: When a virtio device learns a new trick a feature flag
is added for it, and in case both guest and host support it it can be
enabled, otherwise not. There is no need for the user to configure the
virtio device features manually according to the capabilities of the
kernel it is going to boot.

take care,
Gerd

2023-04-05 13:04:03

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 6/6] x86/efi: Safely enable unaccepted memory in UEFI

On 4/5/23 00:46, Ard Biesheuvel wrote:
> Once the firmware stops exposing this protocol (and ceases to accept
> memory on the OS's behalf), we can phase it out from the kernel as
> well.

This is a part of the story that I have doubts about.

How and when do you think this phase-out would happen, realistically?

The firmware will need the unaccepted memory protocol support as long as
there are guests around that need it, right?

People like to keep running old kernels for a _long_ time. Doesn't that
mean _some_ firmware will need to keep doing this dance for a long time?

As long as there is firmware out there in the wild that people want to
run new kernels on, the support needs to stay in mainline. It can't be
dropped.

2023-04-05 13:51:49

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v7 6/6] x86/efi: Safely enable unaccepted memory in UEFI

On Wed, Apr 05, 2023 at 09:46:59AM +0200, Ard Biesheuvel wrote:
> On Tue, 4 Apr 2023 at 23:02, Kirill A. Shutemov <[email protected]> wrote:
> >
> > On Tue, Apr 04, 2023 at 10:41:02PM +0200, Ard Biesheuvel wrote:
> > > On Tue, 4 Apr 2023 at 22:24, Kirill A. Shutemov <[email protected]> wrote:
> > > >
> > > > On Tue, Apr 04, 2023 at 09:49:52PM +0200, Ard Biesheuvel wrote:
> > > > > On Tue, 4 Apr 2023 at 20:09, Kirill A. Shutemov <[email protected]> wrote:
> > > > > >
> > > > > > On Tue, Apr 04, 2023 at 10:57:52AM -0700, Dave Hansen wrote:
> > > > > > > On 4/4/23 10:45, Kirill A. Shutemov wrote:
> > > > > > > > I still think it is a bad idea.
> > > > > > > >
> > > > > > > > As I asked before, please include my
> > > > > > > >
> > > > > > > > Nacked-by: Kirill A. Shutemov <[email protected]>
> > > > > > > >
> > > > > > > > into the patch.
> > > > > > >
> > > > > > > I was pretty opposed to this when I first saw it too. But, Tom and
> > > > > > > company have worn down my opposition a bit.
> > > > > > >
> > > > > > > The fact is that we have upstream kernels out there with SEV-SNP support
> > > > > > > that don't know anything about unaccepted memory. They're either
> > > > > > > relegated to using the pre-accepted memory (4GB??) or _some_ entity
> > > > > > > needs to accept the memory. That entity obviously can't be the kernel
> > > > > > > unless we backport unaccepted memory support.
> > > > > > >
> > > > > > > This both lets the BIOS be the page-accepting entity _and_ allows the
> > > > > > > entity to delegate that to the kernel when it needs to.
> > > > > > >
> > > > > > > As much as I want to nak this and pretend that that those existing
> > > > > > > kernel's don't exist, my powers of self-delusion do have their limits.
> > > > > > >
> > > > > > > If our AMD friends don't do this, what is their alternative?
> > > > > >
> > > > > > The alternative is coordination on the host side: VMM can load a BIOS that
> > > > > > pre-accepts all memory if the kernel is older.
> > > > > >
> > > > >
> > > > > And how does one identify such a kernel? How does the VMM know which
> > > > > kernel the guest is going to load after it boots?
> > > >
> > > > VMM has to know what it is running. Yes, it is cumbersome. But enabling
> > > > phase for a feature is often rough. It will get smoother overtime.
> > > >
> > >
> > > So how does the VMM get informed about what it is running? How does it
> > > distinguish between kernels that support unaccepted memory and ones
> > > that don't? And how does it predict which kernel a guest is going to
> > > load?
> >
> > User will specify if it wants unaccepted memory or not for the VM. And if
> > it does it is his responsibility to have kernel that supports it.
> >
> > And you have not addressed my question:
> >
> > How is it different from any other feature the kernel is not [yet] aware
> > of?
> >
>
> It is the same problem, but this is just a better solution.

Okay, we at least agree that there are more then one solution to the
problem.

> Having a BIOS menu option (or similar) to choose between unaccepted
> memory or not (or to expose CXL memory via the EFI memory map, which is
> another hack I have seen) is just unnecessary complication, if the
> kernel can simply inform the loader about what it supports. We do this
> all the time with things like OsIndications.

It assumes that kernel calls ExitBootServices() which is not always true.
A bootloader in between will make impossible for kernel to use any of
futures exposed this way.

But we talked about this before.

BTW, can we at least acknowledge the limitation in the commit message?

> We can phase out the protocol implementation from the firmware once we
> no longer need it, at which point the LocateProtocol() call just
> becomes a NOP (we do the same thing for UGA support, which has
> disappeared a long time ago, but we still look for the protocol in the
> EFI stub).
>
> Once the firmware stops exposing this protocol (and ceases to accept
> memory on the OS's behalf), we can phase it out from the kernel as
> well.

It is unlikely to ever happen. In few year everybody will forget about
this conversation. Regardless of what is written in commit message.

Everything works, why bother?

> The only other potential solution I see is exposing the unaccepted
> memory as coldplugged ACPI memory objects, and implementing the accept
> calls via PRM methods. But PRM has had very little test coverage, so
> it is anybody's guess whether it works for the stable kernels that we
> need to support with this. It would also mean that the new unaccepted
> memory logic would need to be updated and cross reference these memory
> regions with EFI unaccepted memory regions and avoid claiming them
> both.

Nah. That is a lot of complexity for no particular reason.

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-04-05 13:52:59

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v7 6/6] x86/efi: Safely enable unaccepted memory in UEFI

On Wed, 5 Apr 2023 at 15:00, Dave Hansen <[email protected]> wrote:
>
> On 4/5/23 00:46, Ard Biesheuvel wrote:
> > Once the firmware stops exposing this protocol (and ceases to accept
> > memory on the OS's behalf), we can phase it out from the kernel as
> > well.
>
> This is a part of the story that I have doubts about.
>
> How and when do you think this phase-out would happen, realistically?
>
> The firmware will need the unaccepted memory protocol support as long as
> there are guests around that need it, right?
>

Current firmware will accept all memory on behalf of the OS unless the
OS invokes the protocol to prevent it from doing so.

Future firmware will simply never accept all memory on behalf of the
OS, and not expose the protocol at all.

So the difference of opinion mainly comes down to whether or not the
intermediate, first step is needed or not.

Unenlightened OS kernels will not invoke the protocol, and will
therefore need current firmware in order to see all of their memory.

Enlightened OS kernels will invoke the protocol unless it does not
exist, and so will be able to accept their memory lazily both on
current and future firmware.

We will be able to move to future firmware once we no longer need to
support unenlightened kernels.

> People like to keep running old kernels for a _long_ time. Doesn't that
> mean _some_ firmware will need to keep doing this dance for a long time?
>

Yes.

> As long as there is firmware out there in the wild that people want to
> run new kernels on, the support needs to stay in mainline. It can't be
> dropped.

The penalty for not calling the protocol on firmware that implements
it is a much slower boot, but everything works as it should beyond
that.

Given that the intent here is to retain compatibility with
unenlightened workloads (i.e., which do not upgrade their kernels), I
think it is perfectly reasonable to drop this from mainline at some
point.

2023-04-05 13:55:14

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v7 6/6] x86/efi: Safely enable unaccepted memory in UEFI

On Wed, 5 Apr 2023 at 15:42, Kirill A. Shutemov <[email protected]> wrote:
>
> On Wed, Apr 05, 2023 at 09:46:59AM +0200, Ard Biesheuvel wrote:
> > On Tue, 4 Apr 2023 at 23:02, Kirill A. Shutemov <[email protected]> wrote:
> > >
> > > On Tue, Apr 04, 2023 at 10:41:02PM +0200, Ard Biesheuvel wrote:
> > > > On Tue, 4 Apr 2023 at 22:24, Kirill A. Shutemov <[email protected]> wrote:
> > > > >
> > > > > On Tue, Apr 04, 2023 at 09:49:52PM +0200, Ard Biesheuvel wrote:
> > > > > > On Tue, 4 Apr 2023 at 20:09, Kirill A. Shutemov <[email protected]> wrote:
> > > > > > >
> > > > > > > On Tue, Apr 04, 2023 at 10:57:52AM -0700, Dave Hansen wrote:
> > > > > > > > On 4/4/23 10:45, Kirill A. Shutemov wrote:
> > > > > > > > > I still think it is a bad idea.
> > > > > > > > >
> > > > > > > > > As I asked before, please include my
> > > > > > > > >
> > > > > > > > > Nacked-by: Kirill A. Shutemov <[email protected]>
> > > > > > > > >
> > > > > > > > > into the patch.
> > > > > > > >
> > > > > > > > I was pretty opposed to this when I first saw it too. But, Tom and
> > > > > > > > company have worn down my opposition a bit.
> > > > > > > >
> > > > > > > > The fact is that we have upstream kernels out there with SEV-SNP support
> > > > > > > > that don't know anything about unaccepted memory. They're either
> > > > > > > > relegated to using the pre-accepted memory (4GB??) or _some_ entity
> > > > > > > > needs to accept the memory. That entity obviously can't be the kernel
> > > > > > > > unless we backport unaccepted memory support.
> > > > > > > >
> > > > > > > > This both lets the BIOS be the page-accepting entity _and_ allows the
> > > > > > > > entity to delegate that to the kernel when it needs to.
> > > > > > > >
> > > > > > > > As much as I want to nak this and pretend that that those existing
> > > > > > > > kernel's don't exist, my powers of self-delusion do have their limits.
> > > > > > > >
> > > > > > > > If our AMD friends don't do this, what is their alternative?
> > > > > > >
> > > > > > > The alternative is coordination on the host side: VMM can load a BIOS that
> > > > > > > pre-accepts all memory if the kernel is older.
> > > > > > >
> > > > > >
> > > > > > And how does one identify such a kernel? How does the VMM know which
> > > > > > kernel the guest is going to load after it boots?
> > > > >
> > > > > VMM has to know what it is running. Yes, it is cumbersome. But enabling
> > > > > phase for a feature is often rough. It will get smoother overtime.
> > > > >
> > > >
> > > > So how does the VMM get informed about what it is running? How does it
> > > > distinguish between kernels that support unaccepted memory and ones
> > > > that don't? And how does it predict which kernel a guest is going to
> > > > load?
> > >
> > > User will specify if it wants unaccepted memory or not for the VM. And if
> > > it does it is his responsibility to have kernel that supports it.
> > >
> > > And you have not addressed my question:
> > >
> > > How is it different from any other feature the kernel is not [yet] aware
> > > of?
> > >
> >
> > It is the same problem, but this is just a better solution.
>
> Okay, we at least agree that there are more then one solution to the
> problem.
>
> > Having a BIOS menu option (or similar) to choose between unaccepted
> > memory or not (or to expose CXL memory via the EFI memory map, which is
> > another hack I have seen) is just unnecessary complication, if the
> > kernel can simply inform the loader about what it supports. We do this
> > all the time with things like OsIndications.
>
> It assumes that kernel calls ExitBootServices() which is not always true.
> A bootloader in between will make impossible for kernel to use any of
> futures exposed this way.
>
> But we talked about this before.
>

Yes, we have. But this is a theoretical concern, as nobody who is
deploying this stuff is interested in booting the kernel without the
stub: even the trenchboot folks are bending over backwards to
incorporate execution of the kernel's EFI stub into the D-RTM
bootflow, and all of the confidential compute attestation logic is
based on EFI protocols as well. So using a bootloader that calls
ExitBootServices() and subsequently boots the Linux kernel using the
legacy boot protocol is simply not something anyone is interested in
doing. But don't take my word for it.

> BTW, can we at least acknowledge the limitation in the commit message?
>

Sure.

> > We can phase out the protocol implementation from the firmware once we
> > no longer need it, at which point the LocateProtocol() call just
> > becomes a NOP (we do the same thing for UGA support, which has
> > disappeared a long time ago, but we still look for the protocol in the
> > EFI stub).
> >
> > Once the firmware stops exposing this protocol (and ceases to accept
> > memory on the OS's behalf), we can phase it out from the kernel as
> > well.
>
> It is unlikely to ever happen. In few year everybody will forget about
> this conversation. Regardless of what is written in commit message.
>
> Everything works, why bother?
>

That is a good question. If it doesn't get in the way and does not
prevent us from doing any of the things we want to do, why would we
even care?

But as I argued in my reply to Dave, we can actually drop it from
mainline later if we provide an upgrade path for legacy workloads that
want to upgrade their kernels.

> > The only other potential solution I see is exposing the unaccepted
> > memory as coldplugged ACPI memory objects, and implementing the accept
> > calls via PRM methods. But PRM has had very little test coverage, so
> > it is anybody's guess whether it works for the stable kernels that we
> > need to support with this. It would also mean that the new unaccepted
> > memory logic would need to be updated and cross reference these memory
> > regions with EFI unaccepted memory regions and avoid claiming them
> > both.
>
> Nah. That is a lot of complexity for no particular reason.
>

Good, at least we agree on that :-)

2023-04-05 16:18:40

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 6/6] x86/efi: Safely enable unaccepted memory in UEFI

On 4/5/23 06:44, Ard Biesheuvel wrote:
> Given that the intent here is to retain compatibility with
> unenlightened workloads (i.e., which do not upgrade their kernels), I
> think it is perfectly reasonable to drop this from mainline at some
> point.

OK, so there are three firmware types that matter:

1. Today's SEV-SNP deployed firmware.
2. Near future SEV-SNP firmware that exposes the new ExitBootServices()
protocol that allows guests that speak the protocol to boot faster
by participating in the unaccepted memory dance.
3. Far future firmware that doesn't have the ExitBootServices() protocol

There are also three kernel types:
1. Old kernels with zero unaccepted memory support: no
ExitBootServices() protocol support and no hypercalls to accept pages
2. Kernels that can accept pages and twiddle the ExitBootServices() flag
3. Future kernels that can accept pages, but have had ExitBootServices()
support removed.

That leads to nine possible mix-and-match firmware/kernel combos. I'm
personally assuming that folks are going to *try* to run with all of
these combos and will send us kernel folks bug reports if they see
regressions. Let's just enumerate all of them and their implications
before we go consult our crystal balls about what folks will actually do
in the future.

So, here we go:

| Kernel |
| |
| Unenlightened | Enlightened | Dropped UEFI |
Firmware | ~5.19?? | ~6.4?? | protocol |
|---------------+-------------+--------------|
Deployed | Slow boot | Slow boot | Slow boot |
Near future | Slow boot | Fast boot | Slow boot |
Far future | Crashes?? | Fast Boot | Fast boot |

I hope I got that all right.

The thing that worries me is the "Near future firmware" where someone
runs a ~6.4 kernel and has a fast boot experience. They upgrade to a
newer, "dropped protocol" kernel and their boot gets slower.

I'm also a little fuzzy about what an ancient enlightened kernel would
do on a "far future" firmware that requires unaccepted memory support.
I _think_ those kernels would hit some unaccepted memory, and
#VC/#VE/#whatever and die. Is that right, or is there some fallback there?

2023-04-05 19:16:24

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v7 6/6] x86/efi: Safely enable unaccepted memory in UEFI

On Wed, Apr 05, 2023 at 09:15:15AM -0700, Dave Hansen wrote:
> On 4/5/23 06:44, Ard Biesheuvel wrote:
> > Given that the intent here is to retain compatibility with
> > unenlightened workloads (i.e., which do not upgrade their kernels), I
> > think it is perfectly reasonable to drop this from mainline at some
> > point.
>
> OK, so there are three firmware types that matter:
>
> 1. Today's SEV-SNP deployed firmware.
> 2. Near future SEV-SNP firmware that exposes the new ExitBootServices()
> protocol that allows guests that speak the protocol to boot faster
> by participating in the unaccepted memory dance.
> 3. Far future firmware that doesn't have the ExitBootServices() protocol
>
> There are also three kernel types:
> 1. Old kernels with zero unaccepted memory support: no
> ExitBootServices() protocol support and no hypercalls to accept pages
> 2. Kernels that can accept pages and twiddle the ExitBootServices() flag
> 3. Future kernels that can accept pages, but have had ExitBootServices()
> support removed.
>
> That leads to nine possible mix-and-match firmware/kernel combos. I'm
> personally assuming that folks are going to *try* to run with all of
> these combos and will send us kernel folks bug reports if they see
> regressions. Let's just enumerate all of them and their implications
> before we go consult our crystal balls about what folks will actually do
> in the future.
>
> So, here we go:
>
> | Kernel |
> | |
> | Unenlightened | Enlightened | Dropped UEFI |
> Firmware | ~5.19?? | ~6.4?? | protocol |
> |---------------+-------------+--------------|
> Deployed | Slow boot | Slow boot | Slow boot |
> Near future | Slow boot | Fast boot | Slow boot |
> Far future | Crashes?? | Fast Boot | Fast boot |
>
> I hope I got that all right.
>
> The thing that worries me is the "Near future firmware" where someone
> runs a ~6.4 kernel and has a fast boot experience. They upgrade to a
> newer, "dropped protocol" kernel and their boot gets slower.
>
> I'm also a little fuzzy about what an ancient enlightened kernel would
> do on a "far future" firmware that requires unaccepted memory support.
> I _think_ those kernels would hit some unaccepted memory, and
> #VC/#VE/#whatever and die. Is that right, or is there some fallback there?

The far future firmware in this scheme would expose unaccepted memory in
EFI memory map without need of kernel to declare unaccepted memory
support. The unenlightened kernel in this case will not be able to use the
memory and consider it reserved. Only memory accepted by firmware will be
accessible. Depending on how much memory firmware would pre-accept it can
be OOM, but more likely it will boot fine with the fraction of memory
usable.

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-04-05 20:19:07

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH v7 6/6] x86/efi: Safely enable unaccepted memory in UEFI

On 4/5/23 14:06, Kirill A. Shutemov wrote:
> On Wed, Apr 05, 2023 at 09:15:15AM -0700, Dave Hansen wrote:
>> On 4/5/23 06:44, Ard Biesheuvel wrote:
>>> Given that the intent here is to retain compatibility with
>>> unenlightened workloads (i.e., which do not upgrade their kernels), I
>>> think it is perfectly reasonable to drop this from mainline at some
>>> point.
>>
>> OK, so there are three firmware types that matter:
>>
>> 1. Today's SEV-SNP deployed firmware.

SNP support is originally available as part of the edk2-stable202202 release.

>> 2. Near future SEV-SNP firmware that exposes the new ExitBootServices()
>> protocol that allows guests that speak the protocol to boot faster
>> by participating in the unaccepted memory dance.

This is already out and available as part of the edk2-stable202302 release.

But it did come out after general SNP support, so the near future
terminology works.

>> 3. Far future firmware that doesn't have the ExitBootServices() protocol
>>
>> There are also three kernel types:
>> 1. Old kernels with zero unaccepted memory support: no
>> ExitBootServices() protocol support and no hypercalls to accept pages
>> 2. Kernels that can accept pages and twiddle the ExitBootServices() flag
>> 3. Future kernels that can accept pages, but have had ExitBootServices()
>> support removed.
>>
>> That leads to nine possible mix-and-match firmware/kernel combos. I'm
>> personally assuming that folks are going to *try* to run with all of
>> these combos and will send us kernel folks bug reports if they see
>> regressions. Let's just enumerate all of them and their implications
>> before we go consult our crystal balls about what folks will actually do
>> in the future.
>>
>> So, here we go:
>>
>> | Kernel |
>> | |
>> | Unenlightened | Enlightened | Dropped UEFI |
>> Firmware | ~5.19?? | ~6.4?? | protocol |
>> |---------------+-------------+--------------|
>> Deployed | Slow boot | Slow boot | Slow boot |
>> Near future | Slow boot | Fast boot | Slow boot |
>> Far future | Crashes?? | Fast Boot | Fast boot |
>>
>> I hope I got that all right.

Looks correct to me (with Kirill's description below in place of the
"Crashes??").

>>
>> The thing that worries me is the "Near future firmware" where someone
>> runs a ~6.4 kernel and has a fast boot experience. They upgrade to a
>> newer, "dropped protocol" kernel and their boot gets slower.

Right, so that is what begs the question of when to actually drop the
call. Or does it really need to be dropped? It's a small patch to execute
a boot services call, I guess I don't see the big deal of it being there.
If the firmware still has the protocol, the call is made, if it doesn't,
its not. In the overall support for unaccepted memory, this seems to be a
very minor piece.

>>
>> I'm also a little fuzzy about what an ancient enlightened kernel would
>> do on a "far future" firmware that requires unaccepted memory support.
>> I _think_ those kernels would hit some unaccepted memory, and
>> #VC/#VE/#whatever and die. Is that right, or is there some fallback there?
>
> The far future firmware in this scheme would expose unaccepted memory in
> EFI memory map without need of kernel to declare unaccepted memory
> support. The unenlightened kernel in this case will not be able to use the
> memory and consider it reserved. Only memory accepted by firmware will be
> accessible. Depending on how much memory firmware would pre-accept it can
> be OOM, but more likely it will boot fine with the fraction of memory
> usable.

Right, since a typical Qemu VM has a 2GB hole for PCI/MMIO, the guest is
likely to only see 2GB of memory available to it.

Thanks,
Tom

>

2023-04-05 21:27:12

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v7 6/6] x86/efi: Safely enable unaccepted memory in UEFI

On 4/5/23 13:11, Tom Lendacky wrote:
>>> The thing that worries me is the "Near future firmware" where someone
>>> runs a ~6.4 kernel and has a fast boot experience.  They upgrade to a
>>> newer, "dropped protocol" kernel and their boot gets slower.
>
> Right, so that is what begs the question of when to actually drop the
> call. Or does it really need to be dropped? It's a small patch to
> execute a boot services call, I guess I don't see the big deal of it
> being there.
> If the firmware still has the protocol, the call is made, if it doesn't,
> its not. In the overall support for unaccepted memory, this seems to be
> a very minor piece.

I honestly don't think it's a big deal either, at least on the kernel
side. Maybe it's a bigger deal to the firmware folks on their side.

So, the corrected table looks something like this:

| Kernel |
| |
| Unenlightened | Enlightened | Dropped UEFI |
Firmware | ~5.19?? | ~6.4?? | protocol |
|---------------+-------------+--------------|
Deployed | Slow boot | Slow boot | Slow boot |
Near future | Slow boot | Fast boot | Slow boot |
Far future | 2GB limited | Fast Boot | Fast boot |


But, honestly, I don't see much benefit to the "dropped UEFI protocol".
It adds complexity and will represent a regression either in boot
speeds, or in unenlightened kernels losing RAM when moving to newer
firmware. Neither of those is great.

Looking at this _purely_ from the kernel perspective, I think I'd prefer
this situation:

| Kernel |
| |
| Unenlightened | Enlightened |
Firmware | ~5.19?? | ~6.4?? |
|---------------+-------------+
Deployed | Slow boot | Slow boot |
Future | Slow boot | Fast boot |

and not have future firmware drop support for the handshake protocol.
That way there are no potential regressions.

Is there a compelling reason on the firmware side to drop the
ExitBootServices() protocol that I'm missing?

2023-04-05 21:35:42

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v7 6/6] x86/efi: Safely enable unaccepted memory in UEFI

On Wed, 5 Apr 2023 at 23:23, Dave Hansen <[email protected]> wrote:
>
> On 4/5/23 13:11, Tom Lendacky wrote:
> >>> The thing that worries me is the "Near future firmware" where someone
> >>> runs a ~6.4 kernel and has a fast boot experience. They upgrade to a
> >>> newer, "dropped protocol" kernel and their boot gets slower.
> >
> > Right, so that is what begs the question of when to actually drop the
> > call. Or does it really need to be dropped? It's a small patch to
> > execute a boot services call, I guess I don't see the big deal of it
> > being there.
> > If the firmware still has the protocol, the call is made, if it doesn't,
> > its not. In the overall support for unaccepted memory, this seems to be
> > a very minor piece.
>
> I honestly don't think it's a big deal either, at least on the kernel
> side. Maybe it's a bigger deal to the firmware folks on their side.
>
> So, the corrected table looks something like this:
>
> | Kernel |
> | |
> | Unenlightened | Enlightened | Dropped UEFI |
> Firmware | ~5.19?? | ~6.4?? | protocol |
> |---------------+-------------+--------------|
> Deployed | Slow boot | Slow boot | Slow boot |
> Near future | Slow boot | Fast boot | Slow boot |
> Far future | 2GB limited | Fast Boot | Fast boot |
>

I don't think there is any agreement on the firmware side on what
constitutes are reasonable minimum to accept when lazy accept is in
use, so the 2 GiB is really the upper bound here, and it could
substantially less.

>
> But, honestly, I don't see much benefit to the "dropped UEFI protocol".
> It adds complexity and will represent a regression either in boot
> speeds, or in unenlightened kernels losing RAM when moving to newer
> firmware. Neither of those is great.
>
> Looking at this _purely_ from the kernel perspective, I think I'd prefer
> this situation:
>
> | Kernel |
> | |
> | Unenlightened | Enlightened |
> Firmware | ~5.19?? | ~6.4?? |
> |---------------+-------------+
> Deployed | Slow boot | Slow boot |
> Future | Slow boot | Fast boot |
>
> and not have future firmware drop support for the handshake protocol.
> That way there are no potential regressions.
>
> Is there a compelling reason on the firmware side to drop the
> ExitBootServices() protocol that I'm missing?

The protocol only exists to stop the firmware from eagerly accepting
all memory on behalf of the OS. So from the firmware side, it would be
more about removing that functionality (making the protocol call moot)
rather than removing the protocol itself.

2023-04-16 19:27:38

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv9 00/14] mm, x86/cc: Implement support for unaccepted memory

On Mon, Apr 03, 2023 at 04:42:54PM +0200, Vlastimil Babka wrote:
> Hmm yeah it can be noisy. Did you try to only count events that have
> fragmenting=1 and/or MIGRATE_MOVABLE as fallback_migratetype? As those are
> the really bad events.

I finally got around to retest it.

total fragmenting movable fragmenting&&movable
base-1: 957 583 353 0
base-2: 2715 2343 359 0
base-3: 2033 1669 353 0
patched-1: 1325 929 371 0
patched-2: 2844 2451 371 0
patched-3: 1304 917 361 0

fragmenting=1 is defined as fallback_order<pageblock_order which is most
of them.

Patched kernel showed slightly elevated movable(fallback_migratetype=1)
cases. Is it critical?

There's no allocations that is fragmenting and movable. Hm.

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-04-17 07:56:50

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCHv9 00/14] mm, x86/cc: Implement support for unaccepted memory

On 4/16/23 21:19, Kirill A. Shutemov wrote:
> On Mon, Apr 03, 2023 at 04:42:54PM +0200, Vlastimil Babka wrote:
>> Hmm yeah it can be noisy. Did you try to only count events that have
>> fragmenting=1 and/or MIGRATE_MOVABLE as fallback_migratetype? As those are
>> the really bad events.
>
> I finally got around to retest it.
>
> total fragmenting movable fragmenting&&movable
> base-1: 957 583 353 0
> base-2: 2715 2343 359 0
> base-3: 2033 1669 353 0
> patched-1: 1325 929 371 0
> patched-2: 2844 2451 371 0
> patched-3: 1304 917 361 0
>
> fragmenting=1 is defined as fallback_order<pageblock_order which is most
> of them.
>
> Patched kernel showed slightly elevated movable(fallback_migratetype=1)
> cases. Is it critical?

Maybe it's still not statistically significant anyway, also not as cricical
as fragmenting&movable.

> There's no allocations that is fragmenting and movable. Hm.

It probably means your test wasn't stressfull enough to inflict a mix of
rapid movable an unmovable allocations when memory is nearly full. But at
that point the memory is all accepted, so we don't need such scenario. The
important thing is that this kind of events didn't start happening during
the gradual memory accepting phase.