2023-05-08 00:12:32

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv10 00/11] mm, x86/cc: Implement support for unaccepted memory

UEFI Specification version 2.9 introduces the concept of memory
acceptance: some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, requiring memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific for the Virtual
Machine platform.

Accepting memory is costly and it makes VMM allocate memory for the
accepted guest physical address range. It's better to postpone memory
acceptance until memory is needed. It lowers boot time and reduces
memory overhead.

The kernel needs to know what memory has been accepted. Firmware
communicates this information via memory map: a new memory type --
EFI_UNACCEPTED_MEMORY -- indicates such memory.

Range-based tracking works fine for firmware, but it gets bulky for
the kernel: e820 has to be modified on every page acceptance. It leads
to table fragmentation, but there's a limited number of entries in the
e820 table

Another option is to mark such memory as usable in e820 and track if the
range has been accepted in a bitmap. One bit in the bitmap represents
2MiB in the address space: one 4k page is enough to track 64GiB or
physical address space.

In the worst-case scenario -- a huge hole in the middle of the
address space -- It needs 256MiB to handle 4PiB of the address
space.

Any unaccepted memory that is not aligned to 2M gets accepted upfront.

The approach lowers boot time substantially. Boot to shell is ~2.5x
faster for 4G TDX VM and ~4x faster for 64G.

TDX-specific code isolated from the core of unaccepted memory support. It
supposed to help to plug-in different implementation of unaccepted memory
such as SEV-SNP.

-- Fragmentation study --

Vlastimil and Mel were concern about effect of unaccepted memory on
fragmentation prevention measures in page allocator. I tried to evaluate
it, but it is tricky. As suggested I tried to run multiple parallel kernel
builds and follow how often kmem:mm_page_alloc_extfrag gets hit.

See results in the v9 of the patchset[1][2]

[1] https://lore.kernel.org/all/[email protected]
[2] https://lore.kernel.org/all/[email protected]

--

The tree can be found here:

https://github.com/intel/tdx.git guest-unaccepted-memory

v10:
- Rebased v6.4-rc1;
- Restructure code around zones_with_unaccepted_pages static brach to avoid
unnecessary function calls (Suggested by Vlastimil);
- Drop mentions of PageUnaccepted();
- Drop patches that add fake unaccepted memory support and sysfs handle to
accept memory manually;
- Add Reviewed-by from Vlastimil;
v9:
- Accept memory up to high watermark when kernel runs out of free memory;
- Treat unaccepted memory as unusable in __zone_watermark_unusable_free();
- Per-zone unaccepted memory accounting;
- All pages on unaccepted list are MAX_ORDER now;
- accept_memory=eager in cmdline to pre-accept memory during the boot;
- Implement fake unaccepted memory;
- Sysfs handle to accept memory manually;
- Drop PageUnaccepted();
- Rename unaccepted_pages static key to zones_with_unaccepted_pages;
v8:
- Rewrite core-mm support for unaccepted memory (patch 02/14);
- s/UnacceptedPages/Unaccepted/ in meminfo;
- Drop arch/x86/boot/compressed/compiler.h;
- Fix build errors;
- Adjust commit messages and comments;
- Reviewed-bys from Dave and Borislav;
- Rebased to tip/master.
v7:
- Rework meminfo counter to use PageUnaccepted() and move to generic code;
- Fix range_contains_unaccepted_memory() on machines without unaccepted memory;
- Add Reviewed-by from David;
v6:
- Fix load_unaligned_zeropad() on machine with unaccepted memory;
- Clear PageUnaccepted() on merged pages, leaving it only on head;
- Clarify error handling in allocate_e820();
- Fix build with CONFIG_UNACCEPTED_MEMORY=y, but without TDX;
- Disable kexec at boottime instead of build conflict;
- Rebased to tip/master;
- Spelling fixes;
- Add Reviewed-by from Mike and David;
v5:
- Updates comments and commit messages;
+ Explain options for unaccepted memory handling;
- Expose amount of unaccepted memory in /proc/meminfo
- Adjust check in page_expected_state();
- Fix error code handling in allocate_e820();
- Centralize __pa()/__va() definitions in the boot stub;
- Avoid includes from the main kernel in the boot stub;
- Use an existing hole in boot_param for unaccepted_memory, instead of adding
to the end of the structure;
- Extract allocate_unaccepted_memory() form allocate_e820();
- Complain if there's unaccepted memory, but kernel does not support it;
- Fix vmstat counter;
- Split up few preparatory patches;
- Random readability adjustments;
v4:
- PageBuddyUnaccepted() -> PageUnaccepted;
- Use separate page_type, not shared with offline;
- Rework interface between core-mm and arch code;
- Adjust commit messages;
- Ack from Mike;

Kirill A. Shutemov (11):
mm: Add support for unaccepted memory
efi/x86: Get full memory map in allocate_e820()
x86/boot: Add infrastructure required for unaccepted memory support
efi/x86: Implement support for unaccepted memory
x86/boot/compressed: Handle unaccepted memory
x86/mm: Reserve unaccepted memory bitmap
x86/mm: Provide helpers for unaccepted memory
x86/mm: Avoid load_unaligned_zeropad() stepping into unaccepted memory
x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in
boot stub
x86/tdx: Refactor try_accept_one()
x86/tdx: Add unaccepted memory support

Documentation/arch/x86/zero-page.rst | 1 +
arch/x86/Kconfig | 2 +
arch/x86/boot/bitops.h | 40 ++++++
arch/x86/boot/compressed/Makefile | 3 +-
arch/x86/boot/compressed/align.h | 14 ++
arch/x86/boot/compressed/bitmap.c | 43 ++++++
arch/x86/boot/compressed/bitmap.h | 49 +++++++
arch/x86/boot/compressed/bits.h | 36 +++++
arch/x86/boot/compressed/efi.h | 1 +
arch/x86/boot/compressed/error.c | 19 +++
arch/x86/boot/compressed/error.h | 1 +
arch/x86/boot/compressed/find.c | 54 +++++++
arch/x86/boot/compressed/find.h | 79 +++++++++++
arch/x86/boot/compressed/kaslr.c | 35 +++--
arch/x86/boot/compressed/math.h | 37 +++++
arch/x86/boot/compressed/mem.c | 122 ++++++++++++++++
arch/x86/boot/compressed/minmax.h | 61 ++++++++
arch/x86/boot/compressed/misc.c | 6 +
arch/x86/boot/compressed/misc.h | 6 +
arch/x86/boot/compressed/pgtable_types.h | 25 ++++
arch/x86/boot/compressed/tdx-shared.c | 2 +
arch/x86/boot/compressed/tdx.c | 39 +++++
arch/x86/coco/tdx/Makefile | 2 +-
arch/x86/coco/tdx/tdx-shared.c | 95 +++++++++++++
arch/x86/coco/tdx/tdx.c | 118 +---------------
arch/x86/include/asm/page.h | 3 +
arch/x86/include/asm/shared/tdx.h | 53 +++++++
arch/x86/include/asm/tdx.h | 21 +--
arch/x86/include/asm/unaccepted_memory.h | 16 +++
arch/x86/include/uapi/asm/bootparam.h | 2 +-
arch/x86/kernel/e820.c | 17 +++
arch/x86/mm/Makefile | 2 +
arch/x86/mm/unaccepted_memory.c | 101 +++++++++++++
drivers/base/node.c | 7 +
drivers/firmware/efi/Kconfig | 14 ++
drivers/firmware/efi/efi.c | 1 +
drivers/firmware/efi/libstub/x86-stub.c | 98 +++++++++++--
fs/proc/meminfo.c | 5 +
include/linux/efi.h | 3 +-
include/linux/mmzone.h | 8 ++
mm/internal.h | 13 ++
mm/memblock.c | 9 ++
mm/mm_init.c | 7 +
mm/page_alloc.c | 173 +++++++++++++++++++++++
mm/vmstat.c | 3 +
45 files changed, 1280 insertions(+), 166 deletions(-)
create mode 100644 arch/x86/boot/compressed/align.h
create mode 100644 arch/x86/boot/compressed/bitmap.c
create mode 100644 arch/x86/boot/compressed/bitmap.h
create mode 100644 arch/x86/boot/compressed/bits.h
create mode 100644 arch/x86/boot/compressed/find.c
create mode 100644 arch/x86/boot/compressed/find.h
create mode 100644 arch/x86/boot/compressed/math.h
create mode 100644 arch/x86/boot/compressed/mem.c
create mode 100644 arch/x86/boot/compressed/minmax.h
create mode 100644 arch/x86/boot/compressed/pgtable_types.h
create mode 100644 arch/x86/boot/compressed/tdx-shared.c
create mode 100644 arch/x86/coco/tdx/tdx-shared.c
create mode 100644 arch/x86/include/asm/unaccepted_memory.h
create mode 100644 arch/x86/mm/unaccepted_memory.c

--
2.39.3


2023-05-08 00:20:54

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv10 06/11] x86/mm: Reserve unaccepted memory bitmap

A given page of memory can only be accepted once. The kernel has to
accept memory both in the early decompression stage and during normal
runtime.

A bitmap is used to communicate the acceptance state of each page
between the decompression stage and normal runtime.

boot_params is used to communicate location of the bitmap throughout
the boot. The bitmap is allocated and initially populated in EFI stub.
Decompression stage accepts pages required for kernel/initrd and marks
these pages accordingly in the bitmap. The main kernel picks up the
bitmap from the same boot_params and uses it to determine what has to
be accepted on allocation.

In the runtime kernel, reserve the bitmap's memory to ensure nothing
overwrites it.

The size of bitmap is determined with e820__end_of_ram_pfn() which
relies on setup_e820() marking unaccepted memory as E820_TYPE_RAM.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Mike Rapoport <[email protected]>
---
arch/x86/kernel/e820.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index fb8cf953380d..483c36a28d2e 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1316,6 +1316,23 @@ void __init e820__memblock_setup(void)
int i;
u64 end;

+ /*
+ * Mark unaccepted memory bitmap reserved.
+ *
+ * This kind of reservation usually done from early_reserve_memory(),
+ * but early_reserve_memory() called before e820__memory_setup(), so
+ * e820_table is not finalized and e820__end_of_ram_pfn() cannot be
+ * used to get correct RAM size.
+ */
+ if (boot_params.unaccepted_memory) {
+ unsigned long size;
+
+ /* One bit per 2MB */
+ size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE,
+ PMD_SIZE * BITS_PER_BYTE);
+ memblock_reserve(boot_params.unaccepted_memory, size);
+ }
+
/*
* The bootstrap memblock region count maximum is 128 entries
* (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries
--
2.39.3

2023-05-08 00:24:51

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv10 02/11] efi/x86: Get full memory map in allocate_e820()

Currently allocate_e820() is only interested in the size of map and size
of memory descriptor to determine how many e820 entries the kernel
needs.

UEFI Specification version 2.9 introduces a new memory type --
unaccepted memory. To track unaccepted memory kernel needs to allocate
a bitmap. The size of the bitmap is dependent on the maximum physical
address present in the system. A full memory map is required to find
the maximum address.

Modify allocate_e820() to get a full memory map.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Borislav Petkov <[email protected]>
---
drivers/firmware/efi/libstub/x86-stub.c | 26 +++++++++++--------------
1 file changed, 11 insertions(+), 15 deletions(-)

diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index a0bfd31358ba..fff81843169c 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -681,28 +681,24 @@ static efi_status_t allocate_e820(struct boot_params *params,
struct setup_data **e820ext,
u32 *e820ext_size)
{
- unsigned long map_size, desc_size, map_key;
+ struct efi_boot_memmap *map;
efi_status_t status;
- __u32 nr_desc, desc_version;
+ __u32 nr_desc;

- /* Only need the size of the mem map and size of each mem descriptor */
- map_size = 0;
- status = efi_bs_call(get_memory_map, &map_size, NULL, &map_key,
- &desc_size, &desc_version);
- if (status != EFI_BUFFER_TOO_SMALL)
- return (status != EFI_SUCCESS) ? status : EFI_UNSUPPORTED;
+ status = efi_get_memory_map(&map, false);
+ if (status != EFI_SUCCESS)
+ return status;

- nr_desc = map_size / desc_size + EFI_MMAP_NR_SLACK_SLOTS;
-
- if (nr_desc > ARRAY_SIZE(params->e820_table)) {
- u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table);
+ nr_desc = map->map_size / map->desc_size;
+ if (nr_desc > ARRAY_SIZE(params->e820_table) - EFI_MMAP_NR_SLACK_SLOTS) {
+ u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table) +
+ EFI_MMAP_NR_SLACK_SLOTS;

status = alloc_e820ext(nr_e820ext, e820ext, e820ext_size);
- if (status != EFI_SUCCESS)
- return status;
}

- return EFI_SUCCESS;
+ efi_bs_call(free_pool, map);
+ return status;
}

struct exit_boot_struct {
--
2.39.3

2023-05-08 00:26:40

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv10 03/11] x86/boot: Add infrastructure required for unaccepted memory support

Pull functionality from the main kernel headers and lib/ that is
required for unaccepted memory support.

This is preparatory patch. The users for the functionality will come in
following patches.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Borislav Petkov (AMD) <[email protected]>
---
arch/x86/boot/bitops.h | 40 ++++++++++++
arch/x86/boot/compressed/align.h | 14 +++++
arch/x86/boot/compressed/bitmap.c | 43 +++++++++++++
arch/x86/boot/compressed/bitmap.h | 49 +++++++++++++++
arch/x86/boot/compressed/bits.h | 36 +++++++++++
arch/x86/boot/compressed/find.c | 54 ++++++++++++++++
arch/x86/boot/compressed/find.h | 79 ++++++++++++++++++++++++
arch/x86/boot/compressed/math.h | 37 +++++++++++
arch/x86/boot/compressed/minmax.h | 61 ++++++++++++++++++
arch/x86/boot/compressed/pgtable_types.h | 25 ++++++++
10 files changed, 438 insertions(+)
create mode 100644 arch/x86/boot/compressed/align.h
create mode 100644 arch/x86/boot/compressed/bitmap.c
create mode 100644 arch/x86/boot/compressed/bitmap.h
create mode 100644 arch/x86/boot/compressed/bits.h
create mode 100644 arch/x86/boot/compressed/find.c
create mode 100644 arch/x86/boot/compressed/find.h
create mode 100644 arch/x86/boot/compressed/math.h
create mode 100644 arch/x86/boot/compressed/minmax.h
create mode 100644 arch/x86/boot/compressed/pgtable_types.h

diff --git a/arch/x86/boot/bitops.h b/arch/x86/boot/bitops.h
index 8518ae214c9b..38badf028543 100644
--- a/arch/x86/boot/bitops.h
+++ b/arch/x86/boot/bitops.h
@@ -41,4 +41,44 @@ static inline void set_bit(int nr, void *addr)
asm("btsl %1,%0" : "+m" (*(u32 *)addr) : "Ir" (nr));
}

+static __always_inline void __set_bit(long nr, volatile unsigned long *addr)
+{
+ asm volatile(__ASM_SIZE(bts) " %1,%0" : : "m" (*(volatile long *) addr),
+ "Ir" (nr) : "memory");
+}
+
+static __always_inline void __clear_bit(long nr, volatile unsigned long *addr)
+{
+ asm volatile(__ASM_SIZE(btr) " %1,%0" : : "m" (*(volatile long *) addr),
+ "Ir" (nr) : "memory");
+}
+
+/**
+ * __ffs - find first set bit in word
+ * @word: The word to search
+ *
+ * Undefined if no bit exists, so code should check against 0 first.
+ */
+static __always_inline unsigned long __ffs(unsigned long word)
+{
+ asm("rep; bsf %1,%0"
+ : "=r" (word)
+ : "rm" (word));
+ return word;
+}
+
+/**
+ * ffz - find first zero bit in word
+ * @word: The word to search
+ *
+ * Undefined if no zero exists, so code should check against ~0UL first.
+ */
+static __always_inline unsigned long ffz(unsigned long word)
+{
+ asm("rep; bsf %1,%0"
+ : "=r" (word)
+ : "r" (~word));
+ return word;
+}
+
#endif /* BOOT_BITOPS_H */
diff --git a/arch/x86/boot/compressed/align.h b/arch/x86/boot/compressed/align.h
new file mode 100644
index 000000000000..7ccabbc5d1b8
--- /dev/null
+++ b/arch/x86/boot/compressed/align.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef BOOT_ALIGN_H
+#define BOOT_ALIGN_H
+#define _LINUX_ALIGN_H /* Inhibit inclusion of <linux/align.h> */
+
+/* @a is a power of 2 value */
+#define ALIGN(x, a) __ALIGN_KERNEL((x), (a))
+#define ALIGN_DOWN(x, a) __ALIGN_KERNEL((x) - ((a) - 1), (a))
+#define __ALIGN_MASK(x, mask) __ALIGN_KERNEL_MASK((x), (mask))
+#define PTR_ALIGN(p, a) ((typeof(p))ALIGN((unsigned long)(p), (a)))
+#define PTR_ALIGN_DOWN(p, a) ((typeof(p))ALIGN_DOWN((unsigned long)(p), (a)))
+#define IS_ALIGNED(x, a) (((x) & ((typeof(x))(a) - 1)) == 0)
+
+#endif
diff --git a/arch/x86/boot/compressed/bitmap.c b/arch/x86/boot/compressed/bitmap.c
new file mode 100644
index 000000000000..789ecadeb521
--- /dev/null
+++ b/arch/x86/boot/compressed/bitmap.c
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "bitmap.h"
+
+void __bitmap_set(unsigned long *map, unsigned int start, int len)
+{
+ unsigned long *p = map + BIT_WORD(start);
+ const unsigned int size = start + len;
+ int bits_to_set = BITS_PER_LONG - (start % BITS_PER_LONG);
+ unsigned long mask_to_set = BITMAP_FIRST_WORD_MASK(start);
+
+ while (len - bits_to_set >= 0) {
+ *p |= mask_to_set;
+ len -= bits_to_set;
+ bits_to_set = BITS_PER_LONG;
+ mask_to_set = ~0UL;
+ p++;
+ }
+ if (len) {
+ mask_to_set &= BITMAP_LAST_WORD_MASK(size);
+ *p |= mask_to_set;
+ }
+}
+
+void __bitmap_clear(unsigned long *map, unsigned int start, int len)
+{
+ unsigned long *p = map + BIT_WORD(start);
+ const unsigned int size = start + len;
+ int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
+ unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
+
+ while (len - bits_to_clear >= 0) {
+ *p &= ~mask_to_clear;
+ len -= bits_to_clear;
+ bits_to_clear = BITS_PER_LONG;
+ mask_to_clear = ~0UL;
+ p++;
+ }
+ if (len) {
+ mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
+ *p &= ~mask_to_clear;
+ }
+}
diff --git a/arch/x86/boot/compressed/bitmap.h b/arch/x86/boot/compressed/bitmap.h
new file mode 100644
index 000000000000..35357f5feda2
--- /dev/null
+++ b/arch/x86/boot/compressed/bitmap.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef BOOT_BITMAP_H
+#define BOOT_BITMAP_H
+#define __LINUX_BITMAP_H /* Inhibit inclusion of <linux/bitmap.h> */
+
+#include "../bitops.h"
+#include "../string.h"
+#include "align.h"
+
+#define BITMAP_MEM_ALIGNMENT 8
+#define BITMAP_MEM_MASK (BITMAP_MEM_ALIGNMENT - 1)
+
+#define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) & (BITS_PER_LONG - 1)))
+#define BITMAP_LAST_WORD_MASK(nbits) (~0UL >> (-(nbits) & (BITS_PER_LONG - 1)))
+
+#define BIT_WORD(nr) ((nr) / BITS_PER_LONG)
+
+void __bitmap_set(unsigned long *map, unsigned int start, int len);
+void __bitmap_clear(unsigned long *map, unsigned int start, int len);
+
+static __always_inline void bitmap_set(unsigned long *map, unsigned int start,
+ unsigned int nbits)
+{
+ if (__builtin_constant_p(nbits) && nbits == 1)
+ __set_bit(start, map);
+ else if (__builtin_constant_p(start & BITMAP_MEM_MASK) &&
+ IS_ALIGNED(start, BITMAP_MEM_ALIGNMENT) &&
+ __builtin_constant_p(nbits & BITMAP_MEM_MASK) &&
+ IS_ALIGNED(nbits, BITMAP_MEM_ALIGNMENT))
+ memset((char *)map + start / 8, 0xff, nbits / 8);
+ else
+ __bitmap_set(map, start, nbits);
+}
+
+static __always_inline void bitmap_clear(unsigned long *map, unsigned int start,
+ unsigned int nbits)
+{
+ if (__builtin_constant_p(nbits) && nbits == 1)
+ __clear_bit(start, map);
+ else if (__builtin_constant_p(start & BITMAP_MEM_MASK) &&
+ IS_ALIGNED(start, BITMAP_MEM_ALIGNMENT) &&
+ __builtin_constant_p(nbits & BITMAP_MEM_MASK) &&
+ IS_ALIGNED(nbits, BITMAP_MEM_ALIGNMENT))
+ memset((char *)map + start / 8, 0, nbits / 8);
+ else
+ __bitmap_clear(map, start, nbits);
+}
+
+#endif
diff --git a/arch/x86/boot/compressed/bits.h b/arch/x86/boot/compressed/bits.h
new file mode 100644
index 000000000000..b0ffa007ee19
--- /dev/null
+++ b/arch/x86/boot/compressed/bits.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef BOOT_BITS_H
+#define BOOT_BITS_H
+#define __LINUX_BITS_H /* Inhibit inclusion of <linux/bits.h> */
+
+#ifdef __ASSEMBLY__
+#define _AC(X,Y) X
+#define _AT(T,X) X
+#else
+#define __AC(X,Y) (X##Y)
+#define _AC(X,Y) __AC(X,Y)
+#define _AT(T,X) ((T)(X))
+#endif
+
+#define _UL(x) (_AC(x, UL))
+#define _ULL(x) (_AC(x, ULL))
+#define UL(x) (_UL(x))
+#define ULL(x) (_ULL(x))
+
+#define BIT(nr) (UL(1) << (nr))
+#define BIT_ULL(nr) (ULL(1) << (nr))
+#define BIT_MASK(nr) (UL(1) << ((nr) % BITS_PER_LONG))
+#define BIT_WORD(nr) ((nr) / BITS_PER_LONG)
+#define BIT_ULL_MASK(nr) (ULL(1) << ((nr) % BITS_PER_LONG_LONG))
+#define BIT_ULL_WORD(nr) ((nr) / BITS_PER_LONG_LONG)
+#define BITS_PER_BYTE 8
+
+#define GENMASK(h, l) \
+ (((~UL(0)) - (UL(1) << (l)) + 1) & \
+ (~UL(0) >> (BITS_PER_LONG - 1 - (h))))
+
+#define GENMASK_ULL(h, l) \
+ (((~ULL(0)) - (ULL(1) << (l)) + 1) & \
+ (~ULL(0) >> (BITS_PER_LONG_LONG - 1 - (h))))
+
+#endif
diff --git a/arch/x86/boot/compressed/find.c b/arch/x86/boot/compressed/find.c
new file mode 100644
index 000000000000..b97a9e7c8085
--- /dev/null
+++ b/arch/x86/boot/compressed/find.c
@@ -0,0 +1,54 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include "bitmap.h"
+#include "find.h"
+#include "math.h"
+#include "minmax.h"
+
+static __always_inline unsigned long swab(const unsigned long y)
+{
+#if __BITS_PER_LONG == 64
+ return __builtin_bswap32(y);
+#else /* __BITS_PER_LONG == 32 */
+ return __builtin_bswap64(y);
+#endif
+}
+
+unsigned long _find_next_bit(const unsigned long *addr1,
+ const unsigned long *addr2, unsigned long nbits,
+ unsigned long start, unsigned long invert, unsigned long le)
+{
+ unsigned long tmp, mask;
+
+ if (start >= nbits)
+ return nbits;
+
+ tmp = addr1[start / BITS_PER_LONG];
+ if (addr2)
+ tmp &= addr2[start / BITS_PER_LONG];
+ tmp ^= invert;
+
+ /* Handle 1st word. */
+ mask = BITMAP_FIRST_WORD_MASK(start);
+ if (le)
+ mask = swab(mask);
+
+ tmp &= mask;
+
+ start = round_down(start, BITS_PER_LONG);
+
+ while (!tmp) {
+ start += BITS_PER_LONG;
+ if (start >= nbits)
+ return nbits;
+
+ tmp = addr1[start / BITS_PER_LONG];
+ if (addr2)
+ tmp &= addr2[start / BITS_PER_LONG];
+ tmp ^= invert;
+ }
+
+ if (le)
+ tmp = swab(tmp);
+
+ return min(start + __ffs(tmp), nbits);
+}
diff --git a/arch/x86/boot/compressed/find.h b/arch/x86/boot/compressed/find.h
new file mode 100644
index 000000000000..903574b9d57a
--- /dev/null
+++ b/arch/x86/boot/compressed/find.h
@@ -0,0 +1,79 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef BOOT_FIND_H
+#define BOOT_FIND_H
+#define __LINUX_FIND_H /* Inhibit inclusion of <linux/find.h> */
+
+#include "../bitops.h"
+#include "align.h"
+#include "bits.h"
+
+unsigned long _find_next_bit(const unsigned long *addr1,
+ const unsigned long *addr2, unsigned long nbits,
+ unsigned long start, unsigned long invert, unsigned long le);
+
+/**
+ * find_next_bit - find the next set bit in a memory region
+ * @addr: The address to base the search on
+ * @offset: The bitnumber to start searching at
+ * @size: The bitmap size in bits
+ *
+ * Returns the bit number for the next set bit
+ * If no bits are set, returns @size.
+ */
+static inline
+unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
+ unsigned long offset)
+{
+ if (small_const_nbits(size)) {
+ unsigned long val;
+
+ if (offset >= size)
+ return size;
+
+ val = *addr & GENMASK(size - 1, offset);
+ return val ? __ffs(val) : size;
+ }
+
+ return _find_next_bit(addr, NULL, size, offset, 0UL, 0);
+}
+
+/**
+ * find_next_zero_bit - find the next cleared bit in a memory region
+ * @addr: The address to base the search on
+ * @offset: The bitnumber to start searching at
+ * @size: The bitmap size in bits
+ *
+ * Returns the bit number of the next zero bit
+ * If no bits are zero, returns @size.
+ */
+static inline
+unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
+ unsigned long offset)
+{
+ if (small_const_nbits(size)) {
+ unsigned long val;
+
+ if (offset >= size)
+ return size;
+
+ val = *addr | ~GENMASK(size - 1, offset);
+ return val == ~0UL ? size : ffz(val);
+ }
+
+ return _find_next_bit(addr, NULL, size, offset, ~0UL, 0);
+}
+
+/**
+ * for_each_set_bitrange_from - iterate over all set bit ranges [b; e)
+ * @b: bit offset of start of current bitrange (first set bit); must be initialized
+ * @e: bit offset of end of current bitrange (first unset bit)
+ * @addr: bitmap address to base the search on
+ * @size: bitmap size in number of bits
+ */
+#define for_each_set_bitrange_from(b, e, addr, size) \
+ for ((b) = find_next_bit((addr), (size), (b)), \
+ (e) = find_next_zero_bit((addr), (size), (b) + 1); \
+ (b) < (size); \
+ (b) = find_next_bit((addr), (size), (e) + 1), \
+ (e) = find_next_zero_bit((addr), (size), (b) + 1))
+#endif
diff --git a/arch/x86/boot/compressed/math.h b/arch/x86/boot/compressed/math.h
new file mode 100644
index 000000000000..f7eede84bbc2
--- /dev/null
+++ b/arch/x86/boot/compressed/math.h
@@ -0,0 +1,37 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef BOOT_MATH_H
+#define BOOT_MATH_H
+#define __LINUX_MATH_H /* Inhibit inclusion of <linux/math.h> */
+
+/*
+ *
+ * This looks more complex than it should be. But we need to
+ * get the type for the ~ right in round_down (it needs to be
+ * as wide as the result!), and we want to evaluate the macro
+ * arguments just once each.
+ */
+#define __round_mask(x, y) ((__typeof__(x))((y)-1))
+
+/**
+ * round_up - round up to next specified power of 2
+ * @x: the value to round
+ * @y: multiple to round up to (must be a power of 2)
+ *
+ * Rounds @x up to next multiple of @y (which must be a power of 2).
+ * To perform arbitrary rounding up, use roundup() below.
+ */
+#define round_up(x, y) ((((x)-1) | __round_mask(x, y))+1)
+
+/**
+ * round_down - round down to next specified power of 2
+ * @x: the value to round
+ * @y: multiple to round down to (must be a power of 2)
+ *
+ * Rounds @x down to next multiple of @y (which must be a power of 2).
+ * To perform arbitrary rounding down, use rounddown() below.
+ */
+#define round_down(x, y) ((x) & ~__round_mask(x, y))
+
+#define DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d))
+
+#endif
diff --git a/arch/x86/boot/compressed/minmax.h b/arch/x86/boot/compressed/minmax.h
new file mode 100644
index 000000000000..4efd05673260
--- /dev/null
+++ b/arch/x86/boot/compressed/minmax.h
@@ -0,0 +1,61 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef BOOT_MINMAX_H
+#define BOOT_MINMAX_H
+#define __LINUX_MINMAX_H /* Inhibit inclusion of <linux/minmax.h> */
+
+/*
+ * This returns a constant expression while determining if an argument is
+ * a constant expression, most importantly without evaluating the argument.
+ * Glory to Martin Uecker <[email protected]>
+ */
+#define __is_constexpr(x) \
+ (sizeof(int) == sizeof(*(8 ? ((void *)((long)(x) * 0l)) : (int *)8)))
+
+/*
+ * min()/max()/clamp() macros must accomplish three things:
+ *
+ * - avoid multiple evaluations of the arguments (so side-effects like
+ * "x++" happen only once) when non-constant.
+ * - perform strict type-checking (to generate warnings instead of
+ * nasty runtime surprises). See the "unnecessary" pointer comparison
+ * in __typecheck().
+ * - retain result as a constant expressions when called with only
+ * constant expressions (to avoid tripping VLA warnings in stack
+ * allocation usage).
+ */
+#define __typecheck(x, y) \
+ (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
+
+#define __no_side_effects(x, y) \
+ (__is_constexpr(x) && __is_constexpr(y))
+
+#define __safe_cmp(x, y) \
+ (__typecheck(x, y) && __no_side_effects(x, y))
+
+#define __cmp(x, y, op) ((x) op (y) ? (x) : (y))
+
+#define __cmp_once(x, y, unique_x, unique_y, op) ({ \
+ typeof(x) unique_x = (x); \
+ typeof(y) unique_y = (y); \
+ __cmp(unique_x, unique_y, op); })
+
+#define __careful_cmp(x, y, op) \
+ __builtin_choose_expr(__safe_cmp(x, y), \
+ __cmp(x, y, op), \
+ __cmp_once(x, y, __UNIQUE_ID(__x), __UNIQUE_ID(__y), op))
+
+/**
+ * min - return minimum of two values of the same or compatible types
+ * @x: first value
+ * @y: second value
+ */
+#define min(x, y) __careful_cmp(x, y, <)
+
+/**
+ * max - return maximum of two values of the same or compatible types
+ * @x: first value
+ * @y: second value
+ */
+#define max(x, y) __careful_cmp(x, y, >)
+
+#endif
diff --git a/arch/x86/boot/compressed/pgtable_types.h b/arch/x86/boot/compressed/pgtable_types.h
new file mode 100644
index 000000000000..8f1d87a69efc
--- /dev/null
+++ b/arch/x86/boot/compressed/pgtable_types.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef BOOT_COMPRESSED_PGTABLE_TYPES_H
+#define BOOT_COMPRESSED_PGTABLE_TYPES_H
+#define _ASM_X86_PGTABLE_DEFS_H /* Inhibit inclusion of <asm/pgtable_types.h> */
+
+#define PAGE_SHIFT 12
+
+#ifdef CONFIG_X86_64
+#define PTE_SHIFT 9
+#elif defined CONFIG_X86_PAE
+#define PTE_SHIFT 9
+#else /* 2-level */
+#define PTE_SHIFT 10
+#endif
+
+enum pg_level {
+ PG_LEVEL_NONE,
+ PG_LEVEL_4K,
+ PG_LEVEL_2M,
+ PG_LEVEL_1G,
+ PG_LEVEL_512G,
+ PG_LEVEL_NUM
+};
+
+#endif
--
2.39.3

2023-05-08 00:28:49

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv10 05/11] x86/boot/compressed: Handle unaccepted memory

The firmware will pre-accept the memory used to run the stub. But, the
stub is responsible for accepting the memory into which it decompresses
the main kernel. Accept memory just before decompression starts.

The stub is also responsible for choosing a physical address in which to
place the decompressed kernel image. The KASLR mechanism will randomize
this physical address. Since the unaccepted memory region is relatively
small, KASLR would be quite ineffective if it only used the pre-accepted
area (EFI_CONVENTIONAL_MEMORY). Ensure that KASLR randomizes among the
entire physical address space by also including EFI_UNACCEPTED_MEMORY.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/boot/compressed/Makefile | 2 +-
arch/x86/boot/compressed/efi.h | 1 +
arch/x86/boot/compressed/kaslr.c | 35 ++++++++++++++++--------
arch/x86/boot/compressed/mem.c | 18 ++++++++++++
arch/x86/boot/compressed/misc.c | 6 ++++
arch/x86/boot/compressed/misc.h | 6 ++++
arch/x86/include/asm/unaccepted_memory.h | 2 ++
7 files changed, 57 insertions(+), 13 deletions(-)

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index f62c02348f9a..74f7adee46ad 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -107,7 +107,7 @@ endif

vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o
-vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/bitmap.o $(obj)/mem.o
+vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/bitmap.o $(obj)/find.o $(obj)/mem.o

vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_mixed.o
diff --git a/arch/x86/boot/compressed/efi.h b/arch/x86/boot/compressed/efi.h
index 7db2f41b54cd..cf475243b6d5 100644
--- a/arch/x86/boot/compressed/efi.h
+++ b/arch/x86/boot/compressed/efi.h
@@ -32,6 +32,7 @@ typedef struct {
} efi_table_hdr_t;

#define EFI_CONVENTIONAL_MEMORY 7
+#define EFI_UNACCEPTED_MEMORY 15

#define EFI_MEMORY_MORE_RELIABLE \
((u64)0x0000000000010000ULL) /* higher reliability */
diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index 454757fbdfe5..749f0fe7e446 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -672,6 +672,28 @@ static bool process_mem_region(struct mem_vector *region,
}

#ifdef CONFIG_EFI
+
+/*
+ * Only EFI_CONVENTIONAL_MEMORY and EFI_UNACCEPTED_MEMORY (if supported) are
+ * guaranteed to be free.
+ *
+ * It is more conservative in picking free memory than the EFI spec allows:
+ *
+ * According to the spec, EFI_BOOT_SERVICES_{CODE|DATA} are also free memory
+ * and thus available to place the kernel image into, but in practice there's
+ * firmware where using that memory leads to crashes.
+ */
+static inline bool memory_type_is_free(efi_memory_desc_t *md)
+{
+ if (md->type == EFI_CONVENTIONAL_MEMORY)
+ return true;
+
+ if (md->type == EFI_UNACCEPTED_MEMORY)
+ return IS_ENABLED(CONFIG_UNACCEPTED_MEMORY);
+
+ return false;
+}
+
/*
* Returns true if we processed the EFI memmap, which we prefer over the E820
* table if it is available.
@@ -716,18 +738,7 @@ process_efi_entries(unsigned long minimum, unsigned long image_size)
for (i = 0; i < nr_desc; i++) {
md = efi_early_memdesc_ptr(pmap, e->efi_memdesc_size, i);

- /*
- * Here we are more conservative in picking free memory than
- * the EFI spec allows:
- *
- * According to the spec, EFI_BOOT_SERVICES_{CODE|DATA} are also
- * free memory and thus available to place the kernel image into,
- * but in practice there's firmware where using that memory leads
- * to crashes.
- *
- * Only EFI_CONVENTIONAL_MEMORY is guaranteed to be free.
- */
- if (md->type != EFI_CONVENTIONAL_MEMORY)
+ if (!memory_type_is_free(md))
continue;

if (efi_soft_reserve_enabled() &&
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index 6b15a0ed8b54..de858a5180b6 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -3,12 +3,15 @@
#include "../cpuflags.h"
#include "bitmap.h"
#include "error.h"
+#include "find.h"
#include "math.h"

#define PMD_SHIFT 21
#define PMD_SIZE (_AC(1, UL) << PMD_SHIFT)
#define PMD_MASK (~(PMD_SIZE - 1))

+extern struct boot_params *boot_params;
+
static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
{
/* Platform-specific memory-acceptance call goes here */
@@ -71,3 +74,18 @@ void process_unaccepted_memory(struct boot_params *params, u64 start, u64 end)
bitmap_set((unsigned long *)params->unaccepted_memory,
start / PMD_SIZE, (end - start) / PMD_SIZE);
}
+
+void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+ unsigned long range_start, range_end;
+ unsigned long *bitmap, bitmap_size;
+
+ bitmap = (unsigned long *)boot_params->unaccepted_memory;
+ range_start = start / PMD_SIZE;
+ bitmap_size = DIV_ROUND_UP(end, PMD_SIZE);
+
+ for_each_set_bitrange_from(range_start, range_end, bitmap, bitmap_size) {
+ __accept_memory(range_start * PMD_SIZE, range_end * PMD_SIZE);
+ bitmap_clear(bitmap, range_start, range_end - range_start);
+ }
+}
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 014ff222bf4b..186bfd53e042 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -455,6 +455,12 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
#endif

debug_putstr("\nDecompressing Linux... ");
+
+ if (boot_params->unaccepted_memory) {
+ debug_putstr("Accepting memory... ");
+ accept_memory(__pa(output), __pa(output) + needed_size);
+ }
+
__decompress(input_data, input_len, NULL, NULL, output, output_len,
NULL, error);
entry_offset = parse_elf(output);
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 2f155a0e3041..9663d1839f54 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -247,4 +247,10 @@ static inline unsigned long efi_find_vendor_table(struct boot_params *bp,
}
#endif /* CONFIG_EFI */

+#ifdef CONFIG_UNACCEPTED_MEMORY
+void accept_memory(phys_addr_t start, phys_addr_t end);
+#else
+static inline void accept_memory(phys_addr_t start, phys_addr_t end) {}
+#endif
+
#endif /* BOOT_COMPRESSED_MISC_H */
diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
index df0736d32858..41fbfc798100 100644
--- a/arch/x86/include/asm/unaccepted_memory.h
+++ b/arch/x86/include/asm/unaccepted_memory.h
@@ -7,4 +7,6 @@ struct boot_params;

void process_unaccepted_memory(struct boot_params *params, u64 start, u64 num);

+void accept_memory(phys_addr_t start, phys_addr_t end);
+
#endif
--
2.39.3

2023-05-08 00:34:39

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv10 10/11] x86/tdx: Refactor try_accept_one()

Rework try_accept_one() to return accepted size instead of modifying
'start' inside the helper. It makes 'start' in-only argument and
streamlines code on the caller side.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Suggested-by: Borislav Petkov <[email protected]>
Reviewed-by: Dave Hansen <[email protected]>
---
arch/x86/coco/tdx/tdx.c | 38 +++++++++++++++++++-------------------
1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index e6f4c2758a68..0d5fe6e24e45 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -713,18 +713,18 @@ static bool tdx_cache_flush_required(void)
return true;
}

-static bool try_accept_one(phys_addr_t *start, unsigned long len,
- enum pg_level pg_level)
+static unsigned long try_accept_one(phys_addr_t start, unsigned long len,
+ enum pg_level pg_level)
{
unsigned long accept_size = page_level_size(pg_level);
u64 tdcall_rcx;
u8 page_size;

- if (!IS_ALIGNED(*start, accept_size))
- return false;
+ if (!IS_ALIGNED(start, accept_size))
+ return 0;

if (len < accept_size)
- return false;
+ return 0;

/*
* Pass the page physical address to the TDX module to accept the
@@ -743,15 +743,14 @@ static bool try_accept_one(phys_addr_t *start, unsigned long len,
page_size = 2;
break;
default:
- return false;
+ return 0;
}

- tdcall_rcx = *start | page_size;
+ tdcall_rcx = start | page_size;
if (__tdx_module_call(TDX_ACCEPT_PAGE, tdcall_rcx, 0, 0, 0, NULL))
- return false;
+ return 0;

- *start += accept_size;
- return true;
+ return accept_size;
}

/*
@@ -788,21 +787,22 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
*/
while (start < end) {
unsigned long len = end - start;
+ unsigned long accept_size;

/*
* Try larger accepts first. It gives chance to VMM to keep
- * 1G/2M SEPT entries where possible and speeds up process by
- * cutting number of hypercalls (if successful).
+ * 1G/2M Secure EPT entries where possible and speeds up
+ * process by cutting number of hypercalls (if successful).
*/

- if (try_accept_one(&start, len, PG_LEVEL_1G))
- continue;
-
- if (try_accept_one(&start, len, PG_LEVEL_2M))
- continue;
-
- if (!try_accept_one(&start, len, PG_LEVEL_4K))
+ accept_size = try_accept_one(start, len, PG_LEVEL_1G);
+ if (!accept_size)
+ accept_size = try_accept_one(start, len, PG_LEVEL_2M);
+ if (!accept_size)
+ accept_size = try_accept_one(start, len, PG_LEVEL_4K);
+ if (!accept_size)
return false;
+ start += accept_size;
}

return true;
--
2.39.3

2023-05-08 00:35:02

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv10 01/11] mm: Add support for unaccepted memory

UEFI Specification version 2.9 introduces the concept of memory
acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, require memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific to the Virtual Machine
platform.

There are several ways kernel can deal with unaccepted memory:

1. Accept all the memory during the boot. It is easy to implement and
it doesn't have runtime cost once the system is booted. The downside
is very long boot time.

Accept can be parallelized to multiple CPUs to keep it manageable
(i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
memory bandwidth and does not scale beyond the point.

2. Accept a block of memory on the first use. It requires more
infrastructure and changes in page allocator to make it work, but
it provides good boot time.

On-demand memory accept means latency spikes every time kernel steps
onto a new memory block. The spikes will go away once workload data
set size gets stabilized or all memory gets accepted.

3. Accept all memory in background. Introduce a thread (or multiple)
that gets memory accepted proactively. It will minimize time the
system experience latency spikes on memory allocation while keeping
low boot time.

This approach cannot function on its own. It is an extension of #2:
background memory acceptance requires functional scheduler, but the
page allocator may need to tap into unaccepted memory before that.

The downside of the approach is that these threads also steal CPU
cycles and memory bandwidth from the user's workload and may hurt
user experience.

The patch implements #1 and #2 for now. #2 is the default. Some
workloads may want to use #1 with accept_memory=eager in kernel
command line. #3 can be implemented later based on user's demands.

Support of unaccepted memory requires a few changes in core-mm code:

- memblock has to accept memory on allocation;

- page allocator has to accept memory on the first allocation of the
page;

Memblock change is trivial.

The page allocator is modified to accept pages. New memory gets accepted
before putting pages on free lists. It is done lazily: only accept new
pages when we run out of already accepted memory. The memory gets
accepted until the high watermark is reached.

Architecture has to provide two helpers if it wants to support
unaccepted memory:

- accept_memory() makes a range of physical addresses accepted.

- range_contains_unaccepted_memory() checks anything within the range
of physical addresses requires acceptance.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Mike Rapoport <[email protected]> # memblock
Reviewed-by: Vlastimil Babka <[email protected]>
---
drivers/base/node.c | 7 ++
fs/proc/meminfo.c | 5 ++
include/linux/mmzone.h | 8 ++
mm/internal.h | 13 ++++
mm/memblock.c | 9 +++
mm/mm_init.c | 7 ++
mm/page_alloc.c | 173 +++++++++++++++++++++++++++++++++++++++++
mm/vmstat.c | 3 +
8 files changed, 225 insertions(+)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index b46db17124f3..655975946ef6 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -448,6 +448,9 @@ static ssize_t node_read_meminfo(struct device *dev,
"Node %d ShmemPmdMapped: %8lu kB\n"
"Node %d FileHugePages: %8lu kB\n"
"Node %d FilePmdMapped: %8lu kB\n"
+#endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+ "Node %d Unaccepted: %8lu kB\n"
#endif
,
nid, K(node_page_state(pgdat, NR_FILE_DIRTY)),
@@ -477,6 +480,10 @@ static ssize_t node_read_meminfo(struct device *dev,
nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)),
nid, K(node_page_state(pgdat, NR_FILE_THPS)),
nid, K(node_page_state(pgdat, NR_FILE_PMDMAPPED))
+#endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+ ,
+ nid, K(sum_zone_node_page_state(nid, NR_UNACCEPTED))
#endif
);
len += hugetlb_report_node_meminfo(buf, len, nid);
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index b43d0bd42762..8dca4d6d96c7 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -168,6 +168,11 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
global_zone_page_state(NR_FREE_CMA_PAGES));
#endif

+#ifdef CONFIG_UNACCEPTED_MEMORY
+ show_val_kb(m, "Unaccepted: ",
+ global_zone_page_state(NR_UNACCEPTED));
+#endif
+
hugetlb_report_meminfo(m);

arch_report_meminfo(m);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a4889c9d4055..6c1c2fc13017 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -143,6 +143,9 @@ enum zone_stat_item {
NR_ZSPAGES, /* allocated in zsmalloc */
#endif
NR_FREE_CMA_PAGES,
+#ifdef CONFIG_UNACCEPTED_MEMORY
+ NR_UNACCEPTED,
+#endif
NR_VM_ZONE_STAT_ITEMS };

enum node_stat_item {
@@ -910,6 +913,11 @@ struct zone {
/* free areas of different sizes */
struct free_area free_area[MAX_ORDER + 1];

+#ifdef CONFIG_UNACCEPTED_MEMORY
+ /* Pages to be accepted. All pages on the list are MAX_ORDER */
+ struct list_head unaccepted_pages;
+#endif
+
/* zone flags, see below */
unsigned long flags;

diff --git a/mm/internal.h b/mm/internal.h
index 68410c6d97ac..ed042e366d49 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1099,4 +1099,17 @@ struct vma_prepare {
struct vm_area_struct *remove;
struct vm_area_struct *remove2;
};
+
+#ifndef CONFIG_UNACCEPTED_MEMORY
+static inline bool range_contains_unaccepted_memory(phys_addr_t start,
+ phys_addr_t end)
+{
+ return false;
+}
+
+static inline void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+}
+#endif
+
#endif /* __MM_INTERNAL_H */
diff --git a/mm/memblock.c b/mm/memblock.c
index 3feafea06ab2..50b921119600 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1436,6 +1436,15 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
*/
kmemleak_alloc_phys(found, size, 0);

+ /*
+ * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
+ * require memory to be accepted before it can be used by the
+ * guest.
+ *
+ * Accept the memory of the allocated buffer.
+ */
+ accept_memory(found, found + size);
+
return found;
}

diff --git a/mm/mm_init.c b/mm/mm_init.c
index 7f7f9c677854..1cfc08e25f93 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1375,6 +1375,10 @@ static void __meminit zone_init_free_lists(struct zone *zone)
INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
zone->free_area[order].nr_free = 0;
}
+
+#ifdef CONFIG_UNACCEPTED_MEMORY
+ INIT_LIST_HEAD(&zone->unaccepted_pages);
+#endif
}

void __meminit init_currently_empty_zone(struct zone *zone,
@@ -1960,6 +1964,9 @@ static void __init deferred_free_range(unsigned long pfn,
return;
}

+ /* Accept chunks smaller than MAX_ORDER upfront */
+ accept_memory(PFN_PHYS(pfn), PFN_PHYS(pfn + nr_pages));
+
for (i = 0; i < nr_pages; i++, page++, pfn++) {
if (pageblock_aligned(pfn))
set_pageblock_migratetype(page, MIGRATE_MOVABLE);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 47421bedc12b..d239fba3f31c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -387,6 +387,12 @@ EXPORT_SYMBOL(nr_node_ids);
EXPORT_SYMBOL(nr_online_nodes);
#endif

+static bool page_contains_unaccepted(struct page *page, unsigned int order);
+static void accept_page(struct page *page, unsigned int order);
+static bool try_to_accept_memory(struct zone *zone, unsigned int order);
+static inline bool has_unaccepted_memory(void);
+static bool __free_unaccepted(struct page *page);
+
int page_group_by_mobility_disabled __read_mostly;

#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
@@ -1481,6 +1487,13 @@ void __free_pages_core(struct page *page, unsigned int order)

atomic_long_add(nr_pages, &page_zone(page)->managed_pages);

+ if (page_contains_unaccepted(page, order)) {
+ if (order == MAX_ORDER && __free_unaccepted(page))
+ return;
+
+ accept_page(page, order);
+ }
+
/*
* Bypass PCP and place fresh pages right to the tail, primarily
* relevant for memory onlining.
@@ -3159,6 +3172,9 @@ static inline long __zone_watermark_unusable_free(struct zone *z,
if (!(alloc_flags & ALLOC_CMA))
unusable_free += zone_page_state(z, NR_FREE_CMA_PAGES);
#endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+ unusable_free += zone_page_state(z, NR_UNACCEPTED);
+#endif

return unusable_free;
}
@@ -3458,6 +3474,11 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
gfp_mask)) {
int ret;

+ if (has_unaccepted_memory()) {
+ if (try_to_accept_memory(zone, order))
+ goto try_this_zone;
+ }
+
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
/*
* Watermark failed for this zone, but see if we can
@@ -3510,6 +3531,11 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,

return page;
} else {
+ if (has_unaccepted_memory()) {
+ if (try_to_accept_memory(zone, order))
+ goto try_this_zone;
+ }
+
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
/* Try again if zone has deferred pages */
if (deferred_pages_enabled()) {
@@ -7215,3 +7241,150 @@ bool has_managed_dma(void)
return false;
}
#endif /* CONFIG_ZONE_DMA */
+
+#ifdef CONFIG_UNACCEPTED_MEMORY
+
+/* Counts number of zones with unaccepted pages. */
+static DEFINE_STATIC_KEY_FALSE(zones_with_unaccepted_pages);
+
+static bool lazy_accept = true;
+
+static int __init accept_memory_parse(char *p)
+{
+ if (!strcmp(p, "lazy")) {
+ lazy_accept = true;
+ return 0;
+ } else if (!strcmp(p, "eager")) {
+ lazy_accept = false;
+ return 0;
+ } else {
+ return -EINVAL;
+ }
+}
+early_param("accept_memory", accept_memory_parse);
+
+static bool page_contains_unaccepted(struct page *page, unsigned int order)
+{
+ phys_addr_t start = page_to_phys(page);
+ phys_addr_t end = start + (PAGE_SIZE << order);
+
+ return range_contains_unaccepted_memory(start, end);
+}
+
+static void accept_page(struct page *page, unsigned int order)
+{
+ phys_addr_t start = page_to_phys(page);
+
+ accept_memory(start, start + (PAGE_SIZE << order));
+}
+
+static bool try_to_accept_memory_one(struct zone *zone)
+{
+ unsigned long flags;
+ struct page *page;
+ bool last;
+
+ if (list_empty(&zone->unaccepted_pages))
+ return false;
+
+ spin_lock_irqsave(&zone->lock, flags);
+ page = list_first_entry_or_null(&zone->unaccepted_pages,
+ struct page, lru);
+ if (!page) {
+ spin_unlock_irqrestore(&zone->lock, flags);
+ return false;
+ }
+
+ list_del(&page->lru);
+ last = list_empty(&zone->unaccepted_pages);
+
+ __mod_zone_freepage_state(zone, -MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE);
+ __mod_zone_page_state(zone, NR_UNACCEPTED, -MAX_ORDER_NR_PAGES);
+ spin_unlock_irqrestore(&zone->lock, flags);
+
+ accept_page(page, MAX_ORDER);
+
+ __free_pages_ok(page, MAX_ORDER, FPI_TO_TAIL);
+
+ if (last)
+ static_branch_dec(&zones_with_unaccepted_pages);
+
+ return true;
+}
+
+static bool try_to_accept_memory(struct zone *zone, unsigned int order)
+{
+ long to_accept;
+ int ret = false;
+
+ /* How much to accept to get to high watermark? */
+ to_accept = high_wmark_pages(zone) -
+ (zone_page_state(zone, NR_FREE_PAGES) -
+ __zone_watermark_unusable_free(zone, order, 0));
+
+ /* Accept at least one page */
+ do {
+ if (!try_to_accept_memory_one(zone))
+ break;
+ ret = true;
+ to_accept -= MAX_ORDER_NR_PAGES;
+ } while (to_accept > 0);
+
+ return ret;
+}
+
+static inline bool has_unaccepted_memory(void)
+{
+ return static_branch_unlikely(&zones_with_unaccepted_pages);
+}
+
+static bool __free_unaccepted(struct page *page)
+{
+ struct zone *zone = page_zone(page);
+ unsigned long flags;
+ bool first = false;
+
+ if (!lazy_accept)
+ return false;
+
+ spin_lock_irqsave(&zone->lock, flags);
+ first = list_empty(&zone->unaccepted_pages);
+ list_add_tail(&page->lru, &zone->unaccepted_pages);
+ __mod_zone_freepage_state(zone, MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE);
+ __mod_zone_page_state(zone, NR_UNACCEPTED, MAX_ORDER_NR_PAGES);
+ spin_unlock_irqrestore(&zone->lock, flags);
+
+ if (first)
+ static_branch_inc(&zones_with_unaccepted_pages);
+
+ return true;
+}
+
+#else
+
+static bool page_contains_unaccepted(struct page *page, unsigned int order)
+{
+ return false;
+}
+
+static void accept_page(struct page *page, unsigned int order)
+{
+}
+
+static bool try_to_accept_memory(struct zone *zone, unsigned int order)
+{
+ return false;
+}
+
+static inline bool has_unaccepted_memory(void)
+{
+ return false;
+}
+
+static bool __free_unaccepted(struct page *page)
+{
+ BUILD_BUG();
+ return false;
+}
+
+#endif /* CONFIG_UNACCEPTED_MEMORY */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c28046371b45..282349cabf01 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1180,6 +1180,9 @@ const char * const vmstat_text[] = {
"nr_zspages",
#endif
"nr_free_cma",
+#ifdef CONFIG_UNACCEPTED_MEMORY
+ "nr_unaccepted",
+#endif

/* enum numa_stat_item counters */
#ifdef CONFIG_NUMA
--
2.39.3

2023-05-08 00:35:56

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv10 07/11] x86/mm: Provide helpers for unaccepted memory

Core-mm requires few helpers to support unaccepted memory:

- accept_memory() checks the range of addresses against the bitmap and
accept memory if needed.

- range_contains_unaccepted_memory() checks if anything within the
range requires acceptance.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/page.h | 3 ++
arch/x86/include/asm/unaccepted_memory.h | 4 ++
arch/x86/mm/Makefile | 2 +
arch/x86/mm/unaccepted_memory.c | 61 ++++++++++++++++++++++++
4 files changed, 70 insertions(+)
create mode 100644 arch/x86/mm/unaccepted_memory.c

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index d18e5c332cb9..4bab2bb2c9c0 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -19,6 +19,9 @@
struct page;

#include <linux/range.h>
+
+#include <asm/unaccepted_memory.h>
+
extern struct range pfn_mapped[];
extern int nr_pfn_mapped;

diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
index 41fbfc798100..89fc91c61560 100644
--- a/arch/x86/include/asm/unaccepted_memory.h
+++ b/arch/x86/include/asm/unaccepted_memory.h
@@ -7,6 +7,10 @@ struct boot_params;

void process_unaccepted_memory(struct boot_params *params, u64 start, u64 num);

+#ifdef CONFIG_UNACCEPTED_MEMORY
+
void accept_memory(phys_addr_t start, phys_addr_t end);
+bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end);

#endif
+#endif
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index c80febc44cd2..b0ef1755e5c8 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -67,3 +67,5 @@ obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_amd.o

obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_identity.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_boot.o
+
+obj-$(CONFIG_UNACCEPTED_MEMORY) += unaccepted_memory.o
diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
new file mode 100644
index 000000000000..1df918b21469
--- /dev/null
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -0,0 +1,61 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/memblock.h>
+#include <linux/mm.h>
+#include <linux/pfn.h>
+#include <linux/spinlock.h>
+
+#include <asm/io.h>
+#include <asm/setup.h>
+#include <asm/unaccepted_memory.h>
+
+/* Protects unaccepted memory bitmap */
+static DEFINE_SPINLOCK(unaccepted_memory_lock);
+
+void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+ unsigned long range_start, range_end;
+ unsigned long *bitmap;
+ unsigned long flags;
+
+ if (!boot_params.unaccepted_memory)
+ return;
+
+ bitmap = __va(boot_params.unaccepted_memory);
+ range_start = start / PMD_SIZE;
+
+ spin_lock_irqsave(&unaccepted_memory_lock, flags);
+ for_each_set_bitrange_from(range_start, range_end, bitmap,
+ DIV_ROUND_UP(end, PMD_SIZE)) {
+ unsigned long len = range_end - range_start;
+
+ /* Platform-specific memory-acceptance call goes here */
+ panic("Cannot accept memory: unknown platform\n");
+ bitmap_clear(bitmap, range_start, len);
+ }
+ spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+}
+
+bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end)
+{
+ unsigned long *bitmap;
+ unsigned long flags;
+ bool ret = false;
+
+ if (!boot_params.unaccepted_memory)
+ return 0;
+
+ bitmap = __va(boot_params.unaccepted_memory);
+
+ spin_lock_irqsave(&unaccepted_memory_lock, flags);
+ while (start < end) {
+ if (test_bit(start / PMD_SIZE, bitmap)) {
+ ret = true;
+ break;
+ }
+
+ start += PMD_SIZE;
+ }
+ spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+
+ return ret;
+}
--
2.39.3

2023-05-08 00:36:38

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv10 09/11] x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stub

Memory acceptance requires a hypercall and one or multiple module calls.

Make helpers for the calls available in boot stub. It has to accept
memory where kernel image and initrd are placed.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Dave Hansen <[email protected]>
---
arch/x86/coco/tdx/tdx.c | 32 -------------------
arch/x86/include/asm/shared/tdx.h | 51 +++++++++++++++++++++++++++++++
arch/x86/include/asm/tdx.h | 19 ------------
3 files changed, 51 insertions(+), 51 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index e146b599260f..e6f4c2758a68 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -14,20 +14,6 @@
#include <asm/insn-eval.h>
#include <asm/pgtable.h>

-/* TDX module Call Leaf IDs */
-#define TDX_GET_INFO 1
-#define TDX_GET_VEINFO 3
-#define TDX_GET_REPORT 4
-#define TDX_ACCEPT_PAGE 6
-#define TDX_WR 8
-
-/* TDCS fields. To be used by TDG.VM.WR and TDG.VM.RD module calls */
-#define TDCS_NOTIFY_ENABLES 0x9100000000000010
-
-/* TDX hypercall Leaf IDs */
-#define TDVMCALL_MAP_GPA 0x10001
-#define TDVMCALL_REPORT_FATAL_ERROR 0x10003
-
/* MMIO direction */
#define EPT_READ 0
#define EPT_WRITE 1
@@ -51,24 +37,6 @@

#define TDREPORT_SUBTYPE_0 0

-/*
- * Wrapper for standard use of __tdx_hypercall with no output aside from
- * return code.
- */
-static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
-{
- struct tdx_hypercall_args args = {
- .r10 = TDX_HYPERCALL_STANDARD,
- .r11 = fn,
- .r12 = r12,
- .r13 = r13,
- .r14 = r14,
- .r15 = r15,
- };
-
- return __tdx_hypercall(&args);
-}
-
/* Called from __tdx_hypercall() for unrecoverable failure */
noinstr void __tdx_hypercall_failed(void)
{
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index 2631e01f6e0f..1ff0ee822961 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -10,6 +10,20 @@
#define TDX_CPUID_LEAF_ID 0x21
#define TDX_IDENT "IntelTDX "

+/* TDX module Call Leaf IDs */
+#define TDX_GET_INFO 1
+#define TDX_GET_VEINFO 3
+#define TDX_GET_REPORT 4
+#define TDX_ACCEPT_PAGE 6
+#define TDX_WR 8
+
+/* TDCS fields. To be used by TDG.VM.WR and TDG.VM.RD module calls */
+#define TDCS_NOTIFY_ENABLES 0x9100000000000010
+
+/* TDX hypercall Leaf IDs */
+#define TDVMCALL_MAP_GPA 0x10001
+#define TDVMCALL_REPORT_FATAL_ERROR 0x10003
+
#ifndef __ASSEMBLY__

/*
@@ -37,8 +51,45 @@ struct tdx_hypercall_args {
u64 __tdx_hypercall(struct tdx_hypercall_args *args);
u64 __tdx_hypercall_ret(struct tdx_hypercall_args *args);

+/*
+ * Wrapper for standard use of __tdx_hypercall with no output aside from
+ * return code.
+ */
+static inline u64 _tdx_hypercall(u64 fn, u64 r12, u64 r13, u64 r14, u64 r15)
+{
+ struct tdx_hypercall_args args = {
+ .r10 = TDX_HYPERCALL_STANDARD,
+ .r11 = fn,
+ .r12 = r12,
+ .r13 = r13,
+ .r14 = r14,
+ .r15 = r15,
+ };
+
+ return __tdx_hypercall(&args);
+}
+
+
/* Called from __tdx_hypercall() for unrecoverable failure */
void __tdx_hypercall_failed(void);

+/*
+ * Used in __tdx_module_call() to gather the output registers' values of the
+ * TDCALL instruction when requesting services from the TDX module. This is a
+ * software only structure and not part of the TDX module/VMM ABI
+ */
+struct tdx_module_output {
+ u64 rcx;
+ u64 rdx;
+ u64 r8;
+ u64 r9;
+ u64 r10;
+ u64 r11;
+};
+
+/* Used to communicate with the TDX module */
+u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
+ struct tdx_module_output *out);
+
#endif /* !__ASSEMBLY__ */
#endif /* _ASM_X86_SHARED_TDX_H */
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 28d889c9aa16..234197ec17e4 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -20,21 +20,6 @@

#ifndef __ASSEMBLY__

-/*
- * Used to gather the output registers values of the TDCALL and SEAMCALL
- * instructions when requesting services from the TDX module.
- *
- * This is a software only structure and not part of the TDX module/VMM ABI.
- */
-struct tdx_module_output {
- u64 rcx;
- u64 rdx;
- u64 r8;
- u64 r9;
- u64 r10;
- u64 r11;
-};
-
/*
* Used by the #VE exception handler to gather the #VE exception
* info from the TDX module. This is a software only structure
@@ -55,10 +40,6 @@ struct ve_info {

void __init tdx_early_init(void);

-/* Used to communicate with the TDX module */
-u64 __tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
- struct tdx_module_output *out);
-
void tdx_get_ve_info(struct ve_info *ve);

bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
--
2.39.3

2023-05-08 00:37:39

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv10 04/11] efi/x86: Implement support for unaccepted memory

UEFI Specification version 2.9 introduces the concept of memory
acceptance: Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, requiring memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific for the Virtual
Machine platform.

Accepting memory is costly and it makes VMM allocate memory for the
accepted guest physical address range. It's better to postpone memory
acceptance until memory is needed. It lowers boot time and reduces
memory overhead.

The kernel needs to know what memory has been accepted. Firmware
communicates this information via memory map: a new memory type --
EFI_UNACCEPTED_MEMORY -- indicates such memory.

Range-based tracking works fine for firmware, but it gets bulky for
the kernel: e820 has to be modified on every page acceptance. It leads
to table fragmentation, but there's a limited number of entries in the
e820 table

Another option is to mark such memory as usable in e820 and track if the
range has been accepted in a bitmap. One bit in the bitmap represents
2MiB in the address space: one 4k page is enough to track 64GiB or
physical address space.

In the worst-case scenario -- a huge hole in the middle of the
address space -- It needs 256MiB to handle 4PiB of the address
space.

Any unaccepted memory that is not aligned to 2M gets accepted upfront.

The bitmap is allocated and constructed in the EFI stub and passed down
to the kernel via boot_params. allocate_e820() allocates the bitmap if
unaccepted memory is present, according to the maximum address in the
memory map.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
Documentation/arch/x86/zero-page.rst | 1 +
arch/x86/boot/compressed/Makefile | 1 +
arch/x86/boot/compressed/mem.c | 73 ++++++++++++++++++++++++
arch/x86/include/asm/unaccepted_memory.h | 10 ++++
arch/x86/include/uapi/asm/bootparam.h | 2 +-
drivers/firmware/efi/Kconfig | 14 +++++
drivers/firmware/efi/efi.c | 1 +
drivers/firmware/efi/libstub/x86-stub.c | 65 +++++++++++++++++++++
include/linux/efi.h | 3 +-
9 files changed, 168 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/boot/compressed/mem.c
create mode 100644 arch/x86/include/asm/unaccepted_memory.h

diff --git a/Documentation/arch/x86/zero-page.rst b/Documentation/arch/x86/zero-page.rst
index 45aa9cceb4f1..f21905e61ade 100644
--- a/Documentation/arch/x86/zero-page.rst
+++ b/Documentation/arch/x86/zero-page.rst
@@ -20,6 +20,7 @@ Offset/Size Proto Name Meaning
060/010 ALL ist_info Intel SpeedStep (IST) BIOS support information
(struct ist_info)
070/008 ALL acpi_rsdp_addr Physical address of ACPI RSDP table
+078/008 ALL unaccepted_memory Bitmap of unaccepted memory (1bit == 2M)
080/010 ALL hd0_info hd0 disk parameter, OBSOLETE!!
090/010 ALL hd1_info hd1 disk parameter, OBSOLETE!!
0A0/010 ALL sys_desc_table System description table (struct sys_desc_table),
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 6b6cfe607bdb..f62c02348f9a 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -107,6 +107,7 @@ endif

vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o
+vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/bitmap.o $(obj)/mem.o

vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_mixed.o
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
new file mode 100644
index 000000000000..6b15a0ed8b54
--- /dev/null
+++ b/arch/x86/boot/compressed/mem.c
@@ -0,0 +1,73 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include "../cpuflags.h"
+#include "bitmap.h"
+#include "error.h"
+#include "math.h"
+
+#define PMD_SHIFT 21
+#define PMD_SIZE (_AC(1, UL) << PMD_SHIFT)
+#define PMD_MASK (~(PMD_SIZE - 1))
+
+static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
+{
+ /* Platform-specific memory-acceptance call goes here */
+ error("Cannot accept memory");
+}
+
+/*
+ * The accepted memory bitmap only works at PMD_SIZE granularity. Take
+ * unaligned start/end addresses and either:
+ * 1. Accepts the memory immediately and in its entirety
+ * 2. Accepts unaligned parts, and marks *some* aligned part unaccepted
+ *
+ * The function will never reach the bitmap_set() with zero bits to set.
+ */
+void process_unaccepted_memory(struct boot_params *params, u64 start, u64 end)
+{
+ /*
+ * Ensure that at least one bit will be set in the bitmap by
+ * immediately accepting all regions under 2*PMD_SIZE. This is
+ * imprecise and may immediately accept some areas that could
+ * have been represented in the bitmap. But, results in simpler
+ * code below
+ *
+ * Consider case like this:
+ *
+ * | 4k | 2044k | 2048k |
+ * ^ 0x0 ^ 2MB ^ 4MB
+ *
+ * Only the first 4k has been accepted. The 0MB->2MB region can not be
+ * represented in the bitmap. The 2MB->4MB region can be represented in
+ * the bitmap. But, the 0MB->4MB region is <2*PMD_SIZE and will be
+ * immediately accepted in its entirety.
+ */
+ if (end - start < 2 * PMD_SIZE) {
+ __accept_memory(start, end);
+ return;
+ }
+
+ /*
+ * No matter how the start and end are aligned, at least one unaccepted
+ * PMD_SIZE area will remain to be marked in the bitmap.
+ */
+
+ /* Immediately accept a <PMD_SIZE piece at the start: */
+ if (start & ~PMD_MASK) {
+ __accept_memory(start, round_up(start, PMD_SIZE));
+ start = round_up(start, PMD_SIZE);
+ }
+
+ /* Immediately accept a <PMD_SIZE piece at the end: */
+ if (end & ~PMD_MASK) {
+ __accept_memory(round_down(end, PMD_SIZE), end);
+ end = round_down(end, PMD_SIZE);
+ }
+
+ /*
+ * 'start' and 'end' are now both PMD-aligned.
+ * Record the range as being unaccepted:
+ */
+ bitmap_set((unsigned long *)params->unaccepted_memory,
+ start / PMD_SIZE, (end - start) / PMD_SIZE);
+}
diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
new file mode 100644
index 000000000000..df0736d32858
--- /dev/null
+++ b/arch/x86/include/asm/unaccepted_memory.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2020 Intel Corporation */
+#ifndef _ASM_X86_UNACCEPTED_MEMORY_H
+#define _ASM_X86_UNACCEPTED_MEMORY_H
+
+struct boot_params;
+
+void process_unaccepted_memory(struct boot_params *params, u64 start, u64 num);
+
+#endif
diff --git a/arch/x86/include/uapi/asm/bootparam.h b/arch/x86/include/uapi/asm/bootparam.h
index 01d19fc22346..630a54046af0 100644
--- a/arch/x86/include/uapi/asm/bootparam.h
+++ b/arch/x86/include/uapi/asm/bootparam.h
@@ -189,7 +189,7 @@ struct boot_params {
__u64 tboot_addr; /* 0x058 */
struct ist_info ist_info; /* 0x060 */
__u64 acpi_rsdp_addr; /* 0x070 */
- __u8 _pad3[8]; /* 0x078 */
+ __u64 unaccepted_memory; /* 0x078 */
__u8 hd0_info[16]; /* obsolete! */ /* 0x080 */
__u8 hd1_info[16]; /* obsolete! */ /* 0x090 */
struct sys_desc_table sys_desc_table; /* obsolete! */ /* 0x0a0 */
diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig
index 043ca31c114e..231f1c70d1db 100644
--- a/drivers/firmware/efi/Kconfig
+++ b/drivers/firmware/efi/Kconfig
@@ -269,6 +269,20 @@ config EFI_COCO_SECRET
virt/coco/efi_secret module to access the secrets, which in turn
allows userspace programs to access the injected secrets.

+config UNACCEPTED_MEMORY
+ bool
+ depends on EFI_STUB
+ help
+ Some Virtual Machine platforms, such as Intel TDX, require
+ some memory to be "accepted" by the guest before it can be used.
+ This mechanism helps prevent malicious hosts from making changes
+ to guest memory.
+
+ UEFI specification v2.9 introduced EFI_UNACCEPTED_MEMORY memory type.
+
+ This option adds support for unaccepted memory and makes such memory
+ usable by the kernel.
+
config EFI_EMBEDDED_FIRMWARE
bool
select CRYPTO_LIB_SHA256
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index abeff7dc0b58..7dce06e419c5 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -843,6 +843,7 @@ static __initdata char memory_type_name[][13] = {
"MMIO Port",
"PAL Code",
"Persistent",
+ "Unaccepted",
};

char * __init efi_md_typeattr_format(char *buf, size_t size,
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index fff81843169c..1643ddbde249 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -15,6 +15,7 @@
#include <asm/setup.h>
#include <asm/desc.h>
#include <asm/boot.h>
+#include <asm/unaccepted_memory.h>

#include "efistub.h"

@@ -613,6 +614,16 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
e820_type = E820_TYPE_PMEM;
break;

+ case EFI_UNACCEPTED_MEMORY:
+ if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY)) {
+ efi_warn_once(
+"The system has unaccepted memory, but kernel does not support it\nConsider enabling CONFIG_UNACCEPTED_MEMORY\n");
+ continue;
+ }
+ e820_type = E820_TYPE_RAM;
+ process_unaccepted_memory(params, d->phys_addr,
+ d->phys_addr + PAGE_SIZE * d->num_pages);
+ break;
default:
continue;
}
@@ -677,6 +688,57 @@ static efi_status_t alloc_e820ext(u32 nr_desc, struct setup_data **e820ext,
return status;
}

+static efi_status_t allocate_unaccepted_bitmap(struct boot_params *params,
+ __u32 nr_desc,
+ struct efi_boot_memmap *map)
+{
+ unsigned long *mem = NULL;
+ u64 size, max_addr = 0;
+ efi_status_t status;
+ bool found = false;
+ int i;
+
+ /* Check if there's any unaccepted memory and find the max address */
+ for (i = 0; i < nr_desc; i++) {
+ efi_memory_desc_t *d;
+ unsigned long m = (unsigned long)map->map;
+
+ d = efi_early_memdesc_ptr(m, map->desc_size, i);
+ if (d->type == EFI_UNACCEPTED_MEMORY)
+ found = true;
+ if (d->phys_addr + d->num_pages * PAGE_SIZE > max_addr)
+ max_addr = d->phys_addr + d->num_pages * PAGE_SIZE;
+ }
+
+ if (!found) {
+ params->unaccepted_memory = 0;
+ return EFI_SUCCESS;
+ }
+
+ /*
+ * If unaccepted memory is present, allocate a bitmap to track what
+ * memory has to be accepted before access.
+ *
+ * One bit in the bitmap represents 2MiB in the address space:
+ * A 4k bitmap can track 64GiB of physical address space.
+ *
+ * In the worst case scenario -- a huge hole in the middle of the
+ * address space -- It needs 256MiB to handle 4PiB of the address
+ * space.
+ *
+ * The bitmap will be populated in setup_e820() according to the memory
+ * map after efi_exit_boot_services().
+ */
+ size = DIV_ROUND_UP(max_addr, PMD_SIZE * BITS_PER_BYTE);
+ status = efi_allocate_pages(size, (unsigned long *)&mem, ULONG_MAX);
+ if (status == EFI_SUCCESS) {
+ memset(mem, 0, size);
+ params->unaccepted_memory = (unsigned long)mem;
+ }
+
+ return status;
+}
+
static efi_status_t allocate_e820(struct boot_params *params,
struct setup_data **e820ext,
u32 *e820ext_size)
@@ -697,6 +759,9 @@ static efi_status_t allocate_e820(struct boot_params *params,
status = alloc_e820ext(nr_e820ext, e820ext, e820ext_size);
}

+ if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY) && status == EFI_SUCCESS)
+ status = allocate_unaccepted_bitmap(params, nr_desc, map);
+
efi_bs_call(free_pool, map);
return status;
}
diff --git a/include/linux/efi.h b/include/linux/efi.h
index 7aa62c92185f..efbe14641638 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -108,7 +108,8 @@ typedef struct {
#define EFI_MEMORY_MAPPED_IO_PORT_SPACE 12
#define EFI_PAL_CODE 13
#define EFI_PERSISTENT_MEMORY 14
-#define EFI_MAX_MEMORY_TYPE 15
+#define EFI_UNACCEPTED_MEMORY 15
+#define EFI_MAX_MEMORY_TYPE 16

/* Attribute values: */
#define EFI_MEMORY_UC ((u64)0x0000000000000001ULL) /* uncached */
--
2.39.3

2023-05-08 07:42:09

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCHv10 04/11] efi/x86: Implement support for unaccepted memory

Hello Kirill,

On Mon, 8 May 2023 at 01:46, Kirill A. Shutemov
<[email protected]> wrote:
>
> UEFI Specification version 2.9 introduces the concept of memory
> acceptance: Some Virtual Machine platforms, such as Intel TDX or AMD
> SEV-SNP, requiring memory to be accepted before it can be used by the
> guest. Accepting happens via a protocol specific for the Virtual
> Machine platform.
>
> Accepting memory is costly and it makes VMM allocate memory for the
> accepted guest physical address range. It's better to postpone memory
> acceptance until memory is needed. It lowers boot time and reduces
> memory overhead.
>
> The kernel needs to know what memory has been accepted. Firmware
> communicates this information via memory map: a new memory type --
> EFI_UNACCEPTED_MEMORY -- indicates such memory.
>
> Range-based tracking works fine for firmware, but it gets bulky for
> the kernel: e820 has to be modified on every page acceptance. It leads
> to table fragmentation, but there's a limited number of entries in the
> e820 table
>
> Another option is to mark such memory as usable in e820 and track if the
> range has been accepted in a bitmap. One bit in the bitmap represents
> 2MiB in the address space: one 4k page is enough to track 64GiB or
> physical address space.
>
> In the worst-case scenario -- a huge hole in the middle of the
> address space -- It needs 256MiB to handle 4PiB of the address
> space.
>
> Any unaccepted memory that is not aligned to 2M gets accepted upfront.
>
> The bitmap is allocated and constructed in the EFI stub and passed down
> to the kernel via boot_params. allocate_e820() allocates the bitmap if
> unaccepted memory is present, according to the maximum address in the
> memory map.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> Documentation/arch/x86/zero-page.rst | 1 +
> arch/x86/boot/compressed/Makefile | 1 +
> arch/x86/boot/compressed/mem.c | 73 ++++++++++++++++++++++++
> arch/x86/include/asm/unaccepted_memory.h | 10 ++++
> arch/x86/include/uapi/asm/bootparam.h | 2 +-
> drivers/firmware/efi/Kconfig | 14 +++++
> drivers/firmware/efi/efi.c | 1 +
> drivers/firmware/efi/libstub/x86-stub.c | 65 +++++++++++++++++++++
> include/linux/efi.h | 3 +-
> 9 files changed, 168 insertions(+), 2 deletions(-)
> create mode 100644 arch/x86/boot/compressed/mem.c
> create mode 100644 arch/x86/include/asm/unaccepted_memory.h
>
> diff --git a/Documentation/arch/x86/zero-page.rst b/Documentation/arch/x86/zero-page.rst
> index 45aa9cceb4f1..f21905e61ade 100644
> --- a/Documentation/arch/x86/zero-page.rst
> +++ b/Documentation/arch/x86/zero-page.rst
> @@ -20,6 +20,7 @@ Offset/Size Proto Name Meaning
> 060/010 ALL ist_info Intel SpeedStep (IST) BIOS support information
> (struct ist_info)
> 070/008 ALL acpi_rsdp_addr Physical address of ACPI RSDP table
> +078/008 ALL unaccepted_memory Bitmap of unaccepted memory (1bit == 2M)

Unaccepted memory is a generic EFI feature, and will need to be
supported on other architectures as well.

Could we perhaps use a EFI configuration table to pass the bitmap to
the core kernel, instead of adding more cruft to this archaic header?
That could be implemented in a arch-agnostic manner, even in cases
where the bootloader is the agent that calls ExitBootServices(), as it
would be the loader that allocates and populates the bitmap in that
case.


> 080/010 ALL hd0_info hd0 disk parameter, OBSOLETE!!
> 090/010 ALL hd1_info hd1 disk parameter, OBSOLETE!!
> 0A0/010 ALL sys_desc_table System description table (struct sys_desc_table),
> diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
> index 6b6cfe607bdb..f62c02348f9a 100644
> --- a/arch/x86/boot/compressed/Makefile
> +++ b/arch/x86/boot/compressed/Makefile
> @@ -107,6 +107,7 @@ endif
>
> vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
> vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o
> +vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/bitmap.o $(obj)/mem.o
>
> vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
> vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_mixed.o
> diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
> new file mode 100644
> index 000000000000..6b15a0ed8b54
> --- /dev/null
> +++ b/arch/x86/boot/compressed/mem.c
> @@ -0,0 +1,73 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#include "../cpuflags.h"
> +#include "bitmap.h"
> +#include "error.h"
> +#include "math.h"
> +
> +#define PMD_SHIFT 21
> +#define PMD_SIZE (_AC(1, UL) << PMD_SHIFT)
> +#define PMD_MASK (~(PMD_SIZE - 1))
> +
> +static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> + /* Platform-specific memory-acceptance call goes here */
> + error("Cannot accept memory");
> +}
> +
> +/*
> + * The accepted memory bitmap only works at PMD_SIZE granularity. Take
> + * unaligned start/end addresses and either:
> + * 1. Accepts the memory immediately and in its entirety
> + * 2. Accepts unaligned parts, and marks *some* aligned part unaccepted
> + *
> + * The function will never reach the bitmap_set() with zero bits to set.
> + */
> +void process_unaccepted_memory(struct boot_params *params, u64 start, u64 end)
> +{
> + /*
> + * Ensure that at least one bit will be set in the bitmap by
> + * immediately accepting all regions under 2*PMD_SIZE. This is
> + * imprecise and may immediately accept some areas that could
> + * have been represented in the bitmap. But, results in simpler
> + * code below
> + *
> + * Consider case like this:
> + *
> + * | 4k | 2044k | 2048k |
> + * ^ 0x0 ^ 2MB ^ 4MB
> + *
> + * Only the first 4k has been accepted. The 0MB->2MB region can not be
> + * represented in the bitmap. The 2MB->4MB region can be represented in
> + * the bitmap. But, the 0MB->4MB region is <2*PMD_SIZE and will be
> + * immediately accepted in its entirety.
> + */
> + if (end - start < 2 * PMD_SIZE) {
> + __accept_memory(start, end);
> + return;
> + }
> +
> + /*
> + * No matter how the start and end are aligned, at least one unaccepted
> + * PMD_SIZE area will remain to be marked in the bitmap.
> + */
> +
> + /* Immediately accept a <PMD_SIZE piece at the start: */
> + if (start & ~PMD_MASK) {
> + __accept_memory(start, round_up(start, PMD_SIZE));
> + start = round_up(start, PMD_SIZE);
> + }
> +
> + /* Immediately accept a <PMD_SIZE piece at the end: */
> + if (end & ~PMD_MASK) {
> + __accept_memory(round_down(end, PMD_SIZE), end);
> + end = round_down(end, PMD_SIZE);
> + }
> +
> + /*
> + * 'start' and 'end' are now both PMD-aligned.
> + * Record the range as being unaccepted:
> + */
> + bitmap_set((unsigned long *)params->unaccepted_memory,
> + start / PMD_SIZE, (end - start) / PMD_SIZE);
> +}
> diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
> new file mode 100644
> index 000000000000..df0736d32858
> --- /dev/null
> +++ b/arch/x86/include/asm/unaccepted_memory.h
> @@ -0,0 +1,10 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (C) 2020 Intel Corporation */
> +#ifndef _ASM_X86_UNACCEPTED_MEMORY_H
> +#define _ASM_X86_UNACCEPTED_MEMORY_H
> +
> +struct boot_params;
> +
> +void process_unaccepted_memory(struct boot_params *params, u64 start, u64 num);
> +
> +#endif
> diff --git a/arch/x86/include/uapi/asm/bootparam.h b/arch/x86/include/uapi/asm/bootparam.h
> index 01d19fc22346..630a54046af0 100644
> --- a/arch/x86/include/uapi/asm/bootparam.h
> +++ b/arch/x86/include/uapi/asm/bootparam.h
> @@ -189,7 +189,7 @@ struct boot_params {
> __u64 tboot_addr; /* 0x058 */
> struct ist_info ist_info; /* 0x060 */
> __u64 acpi_rsdp_addr; /* 0x070 */
> - __u8 _pad3[8]; /* 0x078 */
> + __u64 unaccepted_memory; /* 0x078 */
> __u8 hd0_info[16]; /* obsolete! */ /* 0x080 */
> __u8 hd1_info[16]; /* obsolete! */ /* 0x090 */
> struct sys_desc_table sys_desc_table; /* obsolete! */ /* 0x0a0 */
> diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig
> index 043ca31c114e..231f1c70d1db 100644
> --- a/drivers/firmware/efi/Kconfig
> +++ b/drivers/firmware/efi/Kconfig
> @@ -269,6 +269,20 @@ config EFI_COCO_SECRET
> virt/coco/efi_secret module to access the secrets, which in turn
> allows userspace programs to access the injected secrets.
>
> +config UNACCEPTED_MEMORY
> + bool
> + depends on EFI_STUB
> + help
> + Some Virtual Machine platforms, such as Intel TDX, require
> + some memory to be "accepted" by the guest before it can be used.
> + This mechanism helps prevent malicious hosts from making changes
> + to guest memory.
> +
> + UEFI specification v2.9 introduced EFI_UNACCEPTED_MEMORY memory type.
> +
> + This option adds support for unaccepted memory and makes such memory
> + usable by the kernel.
> +
> config EFI_EMBEDDED_FIRMWARE
> bool
> select CRYPTO_LIB_SHA256
> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> index abeff7dc0b58..7dce06e419c5 100644
> --- a/drivers/firmware/efi/efi.c
> +++ b/drivers/firmware/efi/efi.c
> @@ -843,6 +843,7 @@ static __initdata char memory_type_name[][13] = {
> "MMIO Port",
> "PAL Code",
> "Persistent",
> + "Unaccepted",
> };
>
> char * __init efi_md_typeattr_format(char *buf, size_t size,
> diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
> index fff81843169c..1643ddbde249 100644
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -15,6 +15,7 @@
> #include <asm/setup.h>
> #include <asm/desc.h>
> #include <asm/boot.h>
> +#include <asm/unaccepted_memory.h>
>
> #include "efistub.h"
>
> @@ -613,6 +614,16 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
> e820_type = E820_TYPE_PMEM;
> break;
>
> + case EFI_UNACCEPTED_MEMORY:
> + if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY)) {
> + efi_warn_once(
> +"The system has unaccepted memory, but kernel does not support it\nConsider enabling CONFIG_UNACCEPTED_MEMORY\n");
> + continue;
> + }
> + e820_type = E820_TYPE_RAM;
> + process_unaccepted_memory(params, d->phys_addr,
> + d->phys_addr + PAGE_SIZE * d->num_pages);
> + break;
> default:
> continue;
> }
> @@ -677,6 +688,57 @@ static efi_status_t alloc_e820ext(u32 nr_desc, struct setup_data **e820ext,
> return status;
> }
>
> +static efi_status_t allocate_unaccepted_bitmap(struct boot_params *params,
> + __u32 nr_desc,
> + struct efi_boot_memmap *map)
> +{
> + unsigned long *mem = NULL;
> + u64 size, max_addr = 0;
> + efi_status_t status;
> + bool found = false;
> + int i;
> +
> + /* Check if there's any unaccepted memory and find the max address */
> + for (i = 0; i < nr_desc; i++) {
> + efi_memory_desc_t *d;
> + unsigned long m = (unsigned long)map->map;
> +
> + d = efi_early_memdesc_ptr(m, map->desc_size, i);
> + if (d->type == EFI_UNACCEPTED_MEMORY)
> + found = true;
> + if (d->phys_addr + d->num_pages * PAGE_SIZE > max_addr)
> + max_addr = d->phys_addr + d->num_pages * PAGE_SIZE;
> + }
> +
> + if (!found) {
> + params->unaccepted_memory = 0;
> + return EFI_SUCCESS;
> + }
> +
> + /*
> + * If unaccepted memory is present, allocate a bitmap to track what
> + * memory has to be accepted before access.
> + *
> + * One bit in the bitmap represents 2MiB in the address space:
> + * A 4k bitmap can track 64GiB of physical address space.
> + *
> + * In the worst case scenario -- a huge hole in the middle of the
> + * address space -- It needs 256MiB to handle 4PiB of the address
> + * space.
> + *
> + * The bitmap will be populated in setup_e820() according to the memory
> + * map after efi_exit_boot_services().
> + */
> + size = DIV_ROUND_UP(max_addr, PMD_SIZE * BITS_PER_BYTE);
> + status = efi_allocate_pages(size, (unsigned long *)&mem, ULONG_MAX);
> + if (status == EFI_SUCCESS) {
> + memset(mem, 0, size);
> + params->unaccepted_memory = (unsigned long)mem;
> + }
> +
> + return status;
> +}
> +
> static efi_status_t allocate_e820(struct boot_params *params,
> struct setup_data **e820ext,
> u32 *e820ext_size)
> @@ -697,6 +759,9 @@ static efi_status_t allocate_e820(struct boot_params *params,
> status = alloc_e820ext(nr_e820ext, e820ext, e820ext_size);
> }
>
> + if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY) && status == EFI_SUCCESS)
> + status = allocate_unaccepted_bitmap(params, nr_desc, map);
> +
> efi_bs_call(free_pool, map);
> return status;
> }
> diff --git a/include/linux/efi.h b/include/linux/efi.h
> index 7aa62c92185f..efbe14641638 100644
> --- a/include/linux/efi.h
> +++ b/include/linux/efi.h
> @@ -108,7 +108,8 @@ typedef struct {
> #define EFI_MEMORY_MAPPED_IO_PORT_SPACE 12
> #define EFI_PAL_CODE 13
> #define EFI_PERSISTENT_MEMORY 14
> -#define EFI_MAX_MEMORY_TYPE 15
> +#define EFI_UNACCEPTED_MEMORY 15
> +#define EFI_MAX_MEMORY_TYPE 16
>
> /* Attribute values: */
> #define EFI_MEMORY_UC ((u64)0x0000000000000001ULL) /* uncached */
> --
> 2.39.3
>

2023-05-08 08:02:42

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCHv10 05/11] x86/boot/compressed: Handle unaccepted memory

On Mon, 8 May 2023 at 01:46, Kirill A. Shutemov
<[email protected]> wrote:
>
> The firmware will pre-accept the memory used to run the stub. But, the
> stub is responsible for accepting the memory into which it decompresses
> the main kernel. Accept memory just before decompression starts.
>
> The stub is also responsible for choosing a physical address in which to
> place the decompressed kernel image. The KASLR mechanism will randomize
> this physical address. Since the unaccepted memory region is relatively
> small, KASLR would be quite ineffective if it only used the pre-accepted
> area (EFI_CONVENTIONAL_MEMORY). Ensure that KASLR randomizes among the
> entire physical address space by also including EFI_UNACCEPTED_MEMORY.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>

Acked-by: Ard Biesheuvel <[email protected]>

> ---
> arch/x86/boot/compressed/Makefile | 2 +-
> arch/x86/boot/compressed/efi.h | 1 +
> arch/x86/boot/compressed/kaslr.c | 35 ++++++++++++++++--------
> arch/x86/boot/compressed/mem.c | 18 ++++++++++++
> arch/x86/boot/compressed/misc.c | 6 ++++
> arch/x86/boot/compressed/misc.h | 6 ++++
> arch/x86/include/asm/unaccepted_memory.h | 2 ++
> 7 files changed, 57 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
> index f62c02348f9a..74f7adee46ad 100644
> --- a/arch/x86/boot/compressed/Makefile
> +++ b/arch/x86/boot/compressed/Makefile
> @@ -107,7 +107,7 @@ endif
>
> vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
> vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdcall.o
> -vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/bitmap.o $(obj)/mem.o
> +vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/bitmap.o $(obj)/find.o $(obj)/mem.o
>
> vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
> vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_mixed.o
> diff --git a/arch/x86/boot/compressed/efi.h b/arch/x86/boot/compressed/efi.h
> index 7db2f41b54cd..cf475243b6d5 100644
> --- a/arch/x86/boot/compressed/efi.h
> +++ b/arch/x86/boot/compressed/efi.h
> @@ -32,6 +32,7 @@ typedef struct {
> } efi_table_hdr_t;
>
> #define EFI_CONVENTIONAL_MEMORY 7
> +#define EFI_UNACCEPTED_MEMORY 15
>
> #define EFI_MEMORY_MORE_RELIABLE \
> ((u64)0x0000000000010000ULL) /* higher reliability */
> diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
> index 454757fbdfe5..749f0fe7e446 100644
> --- a/arch/x86/boot/compressed/kaslr.c
> +++ b/arch/x86/boot/compressed/kaslr.c
> @@ -672,6 +672,28 @@ static bool process_mem_region(struct mem_vector *region,
> }
>
> #ifdef CONFIG_EFI
> +
> +/*
> + * Only EFI_CONVENTIONAL_MEMORY and EFI_UNACCEPTED_MEMORY (if supported) are
> + * guaranteed to be free.
> + *
> + * It is more conservative in picking free memory than the EFI spec allows:
> + *
> + * According to the spec, EFI_BOOT_SERVICES_{CODE|DATA} are also free memory
> + * and thus available to place the kernel image into, but in practice there's
> + * firmware where using that memory leads to crashes.
> + */
> +static inline bool memory_type_is_free(efi_memory_desc_t *md)
> +{
> + if (md->type == EFI_CONVENTIONAL_MEMORY)
> + return true;
> +
> + if (md->type == EFI_UNACCEPTED_MEMORY)
> + return IS_ENABLED(CONFIG_UNACCEPTED_MEMORY);
> +
> + return false;
> +}
> +
> /*
> * Returns true if we processed the EFI memmap, which we prefer over the E820
> * table if it is available.
> @@ -716,18 +738,7 @@ process_efi_entries(unsigned long minimum, unsigned long image_size)
> for (i = 0; i < nr_desc; i++) {
> md = efi_early_memdesc_ptr(pmap, e->efi_memdesc_size, i);
>
> - /*
> - * Here we are more conservative in picking free memory than
> - * the EFI spec allows:
> - *
> - * According to the spec, EFI_BOOT_SERVICES_{CODE|DATA} are also
> - * free memory and thus available to place the kernel image into,
> - * but in practice there's firmware where using that memory leads
> - * to crashes.
> - *
> - * Only EFI_CONVENTIONAL_MEMORY is guaranteed to be free.
> - */
> - if (md->type != EFI_CONVENTIONAL_MEMORY)
> + if (!memory_type_is_free(md))
> continue;
>
> if (efi_soft_reserve_enabled() &&
> diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
> index 6b15a0ed8b54..de858a5180b6 100644
> --- a/arch/x86/boot/compressed/mem.c
> +++ b/arch/x86/boot/compressed/mem.c
> @@ -3,12 +3,15 @@
> #include "../cpuflags.h"
> #include "bitmap.h"
> #include "error.h"
> +#include "find.h"
> #include "math.h"
>
> #define PMD_SHIFT 21
> #define PMD_SIZE (_AC(1, UL) << PMD_SHIFT)
> #define PMD_MASK (~(PMD_SIZE - 1))
>
> +extern struct boot_params *boot_params;
> +
> static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
> {
> /* Platform-specific memory-acceptance call goes here */
> @@ -71,3 +74,18 @@ void process_unaccepted_memory(struct boot_params *params, u64 start, u64 end)
> bitmap_set((unsigned long *)params->unaccepted_memory,
> start / PMD_SIZE, (end - start) / PMD_SIZE);
> }
> +
> +void accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> + unsigned long range_start, range_end;
> + unsigned long *bitmap, bitmap_size;
> +
> + bitmap = (unsigned long *)boot_params->unaccepted_memory;
> + range_start = start / PMD_SIZE;
> + bitmap_size = DIV_ROUND_UP(end, PMD_SIZE);
> +
> + for_each_set_bitrange_from(range_start, range_end, bitmap, bitmap_size) {
> + __accept_memory(range_start * PMD_SIZE, range_end * PMD_SIZE);
> + bitmap_clear(bitmap, range_start, range_end - range_start);
> + }
> +}
> diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
> index 014ff222bf4b..186bfd53e042 100644
> --- a/arch/x86/boot/compressed/misc.c
> +++ b/arch/x86/boot/compressed/misc.c
> @@ -455,6 +455,12 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
> #endif
>
> debug_putstr("\nDecompressing Linux... ");
> +
> + if (boot_params->unaccepted_memory) {
> + debug_putstr("Accepting memory... ");
> + accept_memory(__pa(output), __pa(output) + needed_size);
> + }
> +
> __decompress(input_data, input_len, NULL, NULL, output, output_len,
> NULL, error);
> entry_offset = parse_elf(output);
> diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
> index 2f155a0e3041..9663d1839f54 100644
> --- a/arch/x86/boot/compressed/misc.h
> +++ b/arch/x86/boot/compressed/misc.h
> @@ -247,4 +247,10 @@ static inline unsigned long efi_find_vendor_table(struct boot_params *bp,
> }
> #endif /* CONFIG_EFI */
>
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +void accept_memory(phys_addr_t start, phys_addr_t end);
> +#else
> +static inline void accept_memory(phys_addr_t start, phys_addr_t end) {}
> +#endif
> +
> #endif /* BOOT_COMPRESSED_MISC_H */
> diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
> index df0736d32858..41fbfc798100 100644
> --- a/arch/x86/include/asm/unaccepted_memory.h
> +++ b/arch/x86/include/asm/unaccepted_memory.h
> @@ -7,4 +7,6 @@ struct boot_params;
>
> void process_unaccepted_memory(struct boot_params *params, u64 start, u64 num);
>
> +void accept_memory(phys_addr_t start, phys_addr_t end);
> +
> #endif
> --
> 2.39.3
>

2023-05-08 08:03:41

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCHv10 02/11] efi/x86: Get full memory map in allocate_e820()

On Mon, 8 May 2023 at 01:46, Kirill A. Shutemov
<[email protected]> wrote:
>
> Currently allocate_e820() is only interested in the size of map and size
> of memory descriptor to determine how many e820 entries the kernel
> needs.
>
> UEFI Specification version 2.9 introduces a new memory type --
> unaccepted memory. To track unaccepted memory kernel needs to allocate
> a bitmap. The size of the bitmap is dependent on the maximum physical
> address present in the system. A full memory map is required to find
> the maximum address.
>
> Modify allocate_e820() to get a full memory map.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Reviewed-by: Borislav Petkov <[email protected]>

Acked-by: Ard Biesheuvel <[email protected]>

> ---
> drivers/firmware/efi/libstub/x86-stub.c | 26 +++++++++++--------------
> 1 file changed, 11 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
> index a0bfd31358ba..fff81843169c 100644
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -681,28 +681,24 @@ static efi_status_t allocate_e820(struct boot_params *params,
> struct setup_data **e820ext,
> u32 *e820ext_size)
> {
> - unsigned long map_size, desc_size, map_key;
> + struct efi_boot_memmap *map;
> efi_status_t status;
> - __u32 nr_desc, desc_version;
> + __u32 nr_desc;
>
> - /* Only need the size of the mem map and size of each mem descriptor */
> - map_size = 0;
> - status = efi_bs_call(get_memory_map, &map_size, NULL, &map_key,
> - &desc_size, &desc_version);
> - if (status != EFI_BUFFER_TOO_SMALL)
> - return (status != EFI_SUCCESS) ? status : EFI_UNSUPPORTED;
> + status = efi_get_memory_map(&map, false);
> + if (status != EFI_SUCCESS)
> + return status;
>
> - nr_desc = map_size / desc_size + EFI_MMAP_NR_SLACK_SLOTS;
> -
> - if (nr_desc > ARRAY_SIZE(params->e820_table)) {
> - u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table);
> + nr_desc = map->map_size / map->desc_size;
> + if (nr_desc > ARRAY_SIZE(params->e820_table) - EFI_MMAP_NR_SLACK_SLOTS) {
> + u32 nr_e820ext = nr_desc - ARRAY_SIZE(params->e820_table) +
> + EFI_MMAP_NR_SLACK_SLOTS;
>
> status = alloc_e820ext(nr_e820ext, e820ext, e820ext_size);
> - if (status != EFI_SUCCESS)
> - return status;
> }
>
> - return EFI_SUCCESS;
> + efi_bs_call(free_pool, map);
> + return status;
> }
>
> struct exit_boot_struct {
> --
> 2.39.3
>

2023-05-08 19:19:12

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv10 04/11] efi/x86: Implement support for unaccepted memory

On Mon, May 08, 2023 at 09:30:22AM +0200, Ard Biesheuvel wrote:
> Hello Kirill,
>
> On Mon, 8 May 2023 at 01:46, Kirill A. Shutemov
> <[email protected]> wrote:
> >
> > UEFI Specification version 2.9 introduces the concept of memory
> > acceptance: Some Virtual Machine platforms, such as Intel TDX or AMD
> > SEV-SNP, requiring memory to be accepted before it can be used by the
> > guest. Accepting happens via a protocol specific for the Virtual
> > Machine platform.
> >
> > Accepting memory is costly and it makes VMM allocate memory for the
> > accepted guest physical address range. It's better to postpone memory
> > acceptance until memory is needed. It lowers boot time and reduces
> > memory overhead.
> >
> > The kernel needs to know what memory has been accepted. Firmware
> > communicates this information via memory map: a new memory type --
> > EFI_UNACCEPTED_MEMORY -- indicates such memory.
> >
> > Range-based tracking works fine for firmware, but it gets bulky for
> > the kernel: e820 has to be modified on every page acceptance. It leads
> > to table fragmentation, but there's a limited number of entries in the
> > e820 table
> >
> > Another option is to mark such memory as usable in e820 and track if the
> > range has been accepted in a bitmap. One bit in the bitmap represents
> > 2MiB in the address space: one 4k page is enough to track 64GiB or
> > physical address space.
> >
> > In the worst-case scenario -- a huge hole in the middle of the
> > address space -- It needs 256MiB to handle 4PiB of the address
> > space.
> >
> > Any unaccepted memory that is not aligned to 2M gets accepted upfront.
> >
> > The bitmap is allocated and constructed in the EFI stub and passed down
> > to the kernel via boot_params. allocate_e820() allocates the bitmap if
> > unaccepted memory is present, according to the maximum address in the
> > memory map.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > ---
> > Documentation/arch/x86/zero-page.rst | 1 +
> > arch/x86/boot/compressed/Makefile | 1 +
> > arch/x86/boot/compressed/mem.c | 73 ++++++++++++++++++++++++
> > arch/x86/include/asm/unaccepted_memory.h | 10 ++++
> > arch/x86/include/uapi/asm/bootparam.h | 2 +-
> > drivers/firmware/efi/Kconfig | 14 +++++
> > drivers/firmware/efi/efi.c | 1 +
> > drivers/firmware/efi/libstub/x86-stub.c | 65 +++++++++++++++++++++
> > include/linux/efi.h | 3 +-
> > 9 files changed, 168 insertions(+), 2 deletions(-)
> > create mode 100644 arch/x86/boot/compressed/mem.c
> > create mode 100644 arch/x86/include/asm/unaccepted_memory.h
> >
> > diff --git a/Documentation/arch/x86/zero-page.rst b/Documentation/arch/x86/zero-page.rst
> > index 45aa9cceb4f1..f21905e61ade 100644
> > --- a/Documentation/arch/x86/zero-page.rst
> > +++ b/Documentation/arch/x86/zero-page.rst
> > @@ -20,6 +20,7 @@ Offset/Size Proto Name Meaning
> > 060/010 ALL ist_info Intel SpeedStep (IST) BIOS support information
> > (struct ist_info)
> > 070/008 ALL acpi_rsdp_addr Physical address of ACPI RSDP table
> > +078/008 ALL unaccepted_memory Bitmap of unaccepted memory (1bit == 2M)
>
> Unaccepted memory is a generic EFI feature, and will need to be
> supported on other architectures as well.
>
> Could we perhaps use a EFI configuration table to pass the bitmap to
> the core kernel, instead of adding more cruft to this archaic header?
> That could be implemented in a arch-agnostic manner, even in cases
> where the bootloader is the agent that calls ExitBootServices(), as it
> would be the loader that allocates and populates the bitmap in that
> case.

Okay, that's a fair point.

Below is my take on this. It is on top of whole patchset. It seems to be
functional, but more testing is required.

While there I also removed hardcoded 1b == 2MB.

My EFI knowledge is rather superficial. I would be glad for feedback.

diff --git a/Documentation/arch/x86/zero-page.rst b/Documentation/arch/x86/zero-page.rst
index f21905e61ade..45aa9cceb4f1 100644
--- a/Documentation/arch/x86/zero-page.rst
+++ b/Documentation/arch/x86/zero-page.rst
@@ -20,7 +20,6 @@ Offset/Size Proto Name Meaning
060/010 ALL ist_info Intel SpeedStep (IST) BIOS support information
(struct ist_info)
070/008 ALL acpi_rsdp_addr Physical address of ACPI RSDP table
-078/008 ALL unaccepted_memory Bitmap of unaccepted memory (1bit == 2M)
080/010 ALL hd0_info hd0 disk parameter, OBSOLETE!!
090/010 ALL hd1_info hd1 disk parameter, OBSOLETE!!
0A0/010 ALL sys_desc_table System description table (struct sys_desc_table),
diff --git a/arch/x86/boot/compressed/efi.h b/arch/x86/boot/compressed/efi.h
index cf475243b6d5..5b63628912da 100644
--- a/arch/x86/boot/compressed/efi.h
+++ b/arch/x86/boot/compressed/efi.h
@@ -105,6 +105,15 @@ struct efi_setup_data {
u64 reserved[8];
};

+struct efi_unaccepted_memory {
+ u32 version;
+ u32 unit_size;
+ u64 size;
+ u64 bitmap[];
+};
+
+extern struct efi_unaccepted_memory *unaccepted_table;
+
static inline int efi_guidcmp (efi_guid_t left, efi_guid_t right)
{
return memcmp(&left, &right, sizeof (efi_guid_t));
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index e6b92e822ddd..7d50aea0d6b1 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -1,7 +1,9 @@
// SPDX-License-Identifier: GPL-2.0-only

+#include <linux/uuid.h>
#include "../cpuflags.h"
#include "bitmap.h"
+#include "efi.h"
#include "error.h"
#include "find.h"
#include "math.h"
@@ -12,8 +14,6 @@
#define PMD_SIZE (_AC(1, UL) << PMD_SHIFT)
#define PMD_MASK (~(PMD_SIZE - 1))

-extern struct boot_params *boot_params;
-
/*
* accept_memory() and process_unaccepted_memory() called from EFI stub which
* runs before decompresser and its early_tdx_detect().
@@ -57,66 +57,77 @@ static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
*
* The function will never reach the bitmap_set() with zero bits to set.
*/
-void process_unaccepted_memory(struct boot_params *params, u64 start, u64 end)
+void process_unaccepted_memory(u64 start, u64 end)
{
+ u64 unit_size = unaccepted_table->unit_size;
+ u64 unit_mask = unaccepted_table->unit_size - 1;
+ u64 bitmap_size = unaccepted_table->size;
+
/*
* Ensure that at least one bit will be set in the bitmap by
- * immediately accepting all regions under 2*PMD_SIZE. This is
+ * immediately accepting all regions under 2*unit_size. This is
* imprecise and may immediately accept some areas that could
* have been represented in the bitmap. But, results in simpler
* code below
*
- * Consider case like this:
+ * Consider case like this (assuming unit_size == 2MB):
*
* | 4k | 2044k | 2048k |
* ^ 0x0 ^ 2MB ^ 4MB
*
* Only the first 4k has been accepted. The 0MB->2MB region can not be
* represented in the bitmap. The 2MB->4MB region can be represented in
- * the bitmap. But, the 0MB->4MB region is <2*PMD_SIZE and will be
+ * the bitmap. But, the 0MB->4MB region is <2*unit_size and will be
* immediately accepted in its entirety.
*/
- if (end - start < 2 * PMD_SIZE) {
+ if (end - start < 2 * unit_size) {
__accept_memory(start, end);
return;
}

/*
* No matter how the start and end are aligned, at least one unaccepted
- * PMD_SIZE area will remain to be marked in the bitmap.
+ * unit_size area will remain to be marked in the bitmap.
*/

- /* Immediately accept a <PMD_SIZE piece at the start: */
- if (start & ~PMD_MASK) {
- __accept_memory(start, round_up(start, PMD_SIZE));
- start = round_up(start, PMD_SIZE);
+ /* Immediately accept a <unit_size piece at the start: */
+ if (start & unit_mask) {
+ __accept_memory(start, round_up(start, unit_size));
+ start = round_up(start, unit_size);
}

- /* Immediately accept a <PMD_SIZE piece at the end: */
- if (end & ~PMD_MASK) {
- __accept_memory(round_down(end, PMD_SIZE), end);
- end = round_down(end, PMD_SIZE);
+ /* Immediately accept a <unit_size piece at the end: */
+ if (end & unit_mask) {
+ __accept_memory(round_down(end, unit_size), end);
+ end = round_down(end, unit_size);
+ }
+
+ /* Accept everything that cannot be recorded into the bitmap */
+ if (end > bitmap_size * unit_size * BITS_PER_BYTE) {
+ __accept_memory(bitmap_size * unit_size * BITS_PER_BYTE, end);
+ end = bitmap_size * unit_size * BITS_PER_BYTE;
}

/*
- * 'start' and 'end' are now both PMD-aligned.
+ * 'start' and 'end' are now both unit_size-aligned.
* Record the range as being unaccepted:
*/
- bitmap_set((unsigned long *)params->unaccepted_memory,
- start / PMD_SIZE, (end - start) / PMD_SIZE);
+ bitmap_set((unsigned long *)unaccepted_table->bitmap,
+ start / unit_size, (end - start) / unit_size);
}

void accept_memory(phys_addr_t start, phys_addr_t end)
{
unsigned long range_start, range_end;
unsigned long *bitmap, bitmap_size;
+ u64 unit_size = unaccepted_table->unit_size;

- bitmap = (unsigned long *)boot_params->unaccepted_memory;
- range_start = start / PMD_SIZE;
- bitmap_size = DIV_ROUND_UP(end, PMD_SIZE);
+ bitmap = (unsigned long *)unaccepted_table->bitmap;
+ range_start = start / unit_size;
+ bitmap_size = DIV_ROUND_UP(end, unit_size);

for_each_set_bitrange_from(range_start, range_end, bitmap, bitmap_size) {
- __accept_memory(range_start * PMD_SIZE, range_end * PMD_SIZE);
+ __accept_memory(range_start * unit_size, range_end * unit_size);
bitmap_clear(bitmap, range_start, range_end - range_start);
}
}
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 186bfd53e042..f481f0b30873 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -456,7 +456,7 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,

debug_putstr("\nDecompressing Linux... ");

- if (boot_params->unaccepted_memory) {
+ if (unaccepted_table) {
debug_putstr("Accepting memory... ");
accept_memory(__pa(output), __pa(output) + needed_size);
}
diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
index 89fc91c61560..9f695bdde01c 100644
--- a/arch/x86/include/asm/unaccepted_memory.h
+++ b/arch/x86/include/asm/unaccepted_memory.h
@@ -3,9 +3,7 @@
#ifndef _ASM_X86_UNACCEPTED_MEMORY_H
#define _ASM_X86_UNACCEPTED_MEMORY_H

-struct boot_params;
-
-void process_unaccepted_memory(struct boot_params *params, u64 start, u64 num);
+void process_unaccepted_memory(u64 start, u64 num);

#ifdef CONFIG_UNACCEPTED_MEMORY

diff --git a/arch/x86/include/uapi/asm/bootparam.h b/arch/x86/include/uapi/asm/bootparam.h
index 630a54046af0..01d19fc22346 100644
--- a/arch/x86/include/uapi/asm/bootparam.h
+++ b/arch/x86/include/uapi/asm/bootparam.h
@@ -189,7 +189,7 @@ struct boot_params {
__u64 tboot_addr; /* 0x058 */
struct ist_info ist_info; /* 0x060 */
__u64 acpi_rsdp_addr; /* 0x070 */
- __u64 unaccepted_memory; /* 0x078 */
+ __u8 _pad3[8]; /* 0x078 */
__u8 hd0_info[16]; /* obsolete! */ /* 0x080 */
__u8 hd1_info[16]; /* obsolete! */ /* 0x090 */
struct sys_desc_table sys_desc_table; /* obsolete! */ /* 0x0a0 */
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 483c36a28d2e..8ee6b756712f 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -16,6 +16,7 @@
#include <linux/firmware-map.h>
#include <linux/sort.h>
#include <linux/memory_hotplug.h>
+#include <linux/efi.h>

#include <asm/e820/api.h>
#include <asm/setup.h>
@@ -1324,13 +1325,15 @@ void __init e820__memblock_setup(void)
* e820_table is not finalized and e820__end_of_ram_pfn() cannot be
* used to get correct RAM size.
*/
- if (boot_params.unaccepted_memory) {
+ if (efi.unaccepted != EFI_INVALID_TABLE_ADDR) {
+ struct efi_unaccepted_memory *unaccepted;
unsigned long size;

- /* One bit per 2MB */
- size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE,
- PMD_SIZE * BITS_PER_BYTE);
- memblock_reserve(boot_params.unaccepted_memory, size);
+ unaccepted = __va(efi.unaccepted);
+
+ size = sizeof(struct efi_unaccepted_memory);
+ size += unaccepted->size;
+ memblock_reserve(efi.unaccepted, size);
}

/*
diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
index f61174d4c3cb..ec2b616ef32e 100644
--- a/arch/x86/mm/unaccepted_memory.c
+++ b/arch/x86/mm/unaccepted_memory.c
@@ -1,4 +1,5 @@
// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/efi.h>
#include <linux/memblock.h>
#include <linux/mm.h>
#include <linux/pfn.h>
@@ -14,15 +15,19 @@ static DEFINE_SPINLOCK(unaccepted_memory_lock);

void accept_memory(phys_addr_t start, phys_addr_t end)
{
+ struct efi_unaccepted_memory *unaccepted;
unsigned long range_start, range_end;
- unsigned long *bitmap;
- unsigned long flags;
+ unsigned long flags, *bitmap;
+ u64 unit_size;

- if (!boot_params.unaccepted_memory)
+ if (efi.unaccepted == EFI_INVALID_TABLE_ADDR)
return;

- bitmap = __va(boot_params.unaccepted_memory);
- range_start = start / PMD_SIZE;
+ unaccepted = __va(efi.unaccepted);
+ unit_size = unaccepted->unit_size;
+ bitmap = (unsigned long *)unaccepted->bitmap;
+
+ range_start = start / unit_size;

/*
* load_unaligned_zeropad() can lead to unwanted loads across page
@@ -42,23 +47,25 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
* used:
*
* 1. Implicitly extend the range_contains_unaccepted_memory(start, end)
- * checks up to end+2M if 'end' is aligned on a 2M boundary.
+ * checks up to end+unit_size if 'end' is aligned on a unit_size
+ * boundary.
*
- * 2. Implicitly extend accept_memory(start, end) to end+2M if 'end' is
- * aligned on a 2M boundary. (immediately following this comment)
+ * 2. Implicitly extend accept_memory(start, end) to end+unit_size if
+ * 'end' is aligned on a unit_size boundary. (immediately following
+ * this comment)
*/
- if (!(end % PMD_SIZE))
- end += PMD_SIZE;
+ if (!(end % unit_size))
+ end += unit_size;

spin_lock_irqsave(&unaccepted_memory_lock, flags);
for_each_set_bitrange_from(range_start, range_end, bitmap,
- DIV_ROUND_UP(end, PMD_SIZE)) {
+ DIV_ROUND_UP(end, unit_size)) {
unsigned long len = range_end - range_start;

/* Platform-specific memory-acceptance call goes here */
if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
- tdx_accept_memory(range_start * PMD_SIZE,
- range_end * PMD_SIZE);
+ tdx_accept_memory(range_start * unit_size,
+ range_end * unit_size);
} else {
panic("Cannot accept memory: unknown platform\n");
}
@@ -70,30 +77,33 @@ void accept_memory(phys_addr_t start, phys_addr_t end)

bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end)
{
- unsigned long *bitmap;
- unsigned long flags;
+ struct efi_unaccepted_memory *unaccepted;
+ unsigned long flags, *bitmap;
bool ret = false;
+ u64 unit_size;

- if (!boot_params.unaccepted_memory)
+ if (efi.unaccepted == EFI_INVALID_TABLE_ADDR)
return 0;

- bitmap = __va(boot_params.unaccepted_memory);
+ unaccepted = __va(efi.unaccepted);
+ unit_size = unaccepted->unit_size;
+ bitmap = (unsigned long *)unaccepted->bitmap;

/*
* Also consider the unaccepted state of the *next* page. See fix #1 in
* the comment on load_unaligned_zeropad() in accept_memory().
*/
- if (!(end % PMD_SIZE))
- end += PMD_SIZE;
+ if (!(end % unit_size))
+ end += unit_size;

spin_lock_irqsave(&unaccepted_memory_lock, flags);
while (start < end) {
- if (test_bit(start / PMD_SIZE, bitmap)) {
+ if (test_bit(start / unit_size, bitmap)) {
ret = true;
break;
}

- start += PMD_SIZE;
+ start += unit_size;
}
spin_unlock_irqrestore(&unaccepted_memory_lock, flags);

diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 7dce06e419c5..e15a2005ed93 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -50,6 +50,9 @@ struct efi __read_mostly efi = {
#ifdef CONFIG_EFI_COCO_SECRET
.coco_secret = EFI_INVALID_TABLE_ADDR,
#endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+ .unaccepted = EFI_INVALID_TABLE_ADDR,
+#endif
};
EXPORT_SYMBOL(efi);

@@ -605,6 +608,9 @@ static const efi_config_table_type_t common_tables[] __initconst = {
#ifdef CONFIG_EFI_COCO_SECRET
{LINUX_EFI_COCO_SECRET_AREA_GUID, &efi.coco_secret, "CocoSecret" },
#endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+ {LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID, &efi.unaccepted, "Unaccepted" },
+#endif
#ifdef CONFIG_EFI_GENERIC_STUB
{LINUX_EFI_SCREEN_INFO_TABLE_GUID, &screen_info_table },
#endif
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index 1afe7b5b02e1..4953b40f30c3 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -26,6 +26,7 @@ const efi_system_table_t *efi_system_table;
const efi_dxe_services_table_t *efi_dxe_table;
u32 image_offset __section(".data");
static efi_loaded_image_t *image = NULL;
+struct efi_unaccepted_memory *unaccepted_table;

static efi_status_t
preserve_pci_rom_image(efi_pci_io_protocol_t *pci, struct pci_setup_rom **__rom)
@@ -621,7 +622,7 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
continue;
}
e820_type = E820_TYPE_RAM;
- process_unaccepted_memory(params, d->phys_addr,
+ process_unaccepted_memory(d->phys_addr,
d->phys_addr + PAGE_SIZE * d->num_pages);
break;
default:
@@ -692,12 +693,22 @@ static efi_status_t allocate_unaccepted_bitmap(struct boot_params *params,
__u32 nr_desc,
struct efi_boot_memmap *map)
{
- unsigned long *mem = NULL;
- u64 size, max_addr = 0;
+ efi_guid_t unaccepted_table_guid = LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID;
+ u64 bitmap_size, max_addr = 0;
efi_status_t status;
bool found = false;
int i;

+ /* Check if the table is already installed */
+ unaccepted_table = get_efi_config_table(unaccepted_table_guid);
+ if (unaccepted_table) {
+ if (unaccepted_table->version != 0) {
+ efi_err("Unknown version of unaccepted memory tatble\n");
+ return EFI_UNSUPPORTED;
+ }
+ return EFI_SUCCESS;
+ }
+
/* Check if there's any unaccepted memory and find the max address */
for (i = 0; i < nr_desc; i++) {
efi_memory_desc_t *d;
@@ -710,11 +721,6 @@ static efi_status_t allocate_unaccepted_bitmap(struct boot_params *params,
max_addr = d->phys_addr + d->num_pages * PAGE_SIZE;
}

- if (!found) {
- params->unaccepted_memory = 0;
- return EFI_SUCCESS;
- }
-
/*
* range_contains_unaccepted_memory() may need to check one 2M chunk
* beyond the end of RAM to deal with load_unaligned_zeropad(). Make
@@ -736,11 +742,26 @@ static efi_status_t allocate_unaccepted_bitmap(struct boot_params *params,
* The bitmap will be populated in setup_e820() according to the memory
* map after efi_exit_boot_services().
*/
- size = DIV_ROUND_UP(max_addr, PMD_SIZE * BITS_PER_BYTE);
- status = efi_allocate_pages(size, (unsigned long *)&mem, ULONG_MAX);
- if (status == EFI_SUCCESS) {
- memset(mem, 0, size);
- params->unaccepted_memory = (unsigned long)mem;
+ bitmap_size = DIV_ROUND_UP(max_addr, PMD_SIZE * BITS_PER_BYTE);
+
+ status = efi_bs_call(allocate_pool, EFI_LOADER_DATA,
+ sizeof(*unaccepted_table) + bitmap_size,
+ (void **)&unaccepted_table);
+ if (status != EFI_SUCCESS) {
+ efi_err("Failed to allocate unaccepted memory config table\n");
+ return status;
+ }
+
+ unaccepted_table->version = 0;
+ unaccepted_table->unit_size = PMD_SIZE;
+ unaccepted_table->size = bitmap_size;
+ memset(unaccepted_table->bitmap, 0, bitmap_size);
+
+ status = efi_bs_call(install_configuration_table,
+ &unaccepted_table_guid, unaccepted_table);
+ if (status != EFI_SUCCESS) {
+ efi_bs_call(free_pool, unaccepted_table);
+ efi_err("Failed to install unaccepted memory config table!\n");
}

return status;
diff --git a/include/linux/efi.h b/include/linux/efi.h
index efbe14641638..f765266a81b3 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -418,6 +418,7 @@ void efi_native_runtime_setup(void);
#define LINUX_EFI_MOK_VARIABLE_TABLE_GUID EFI_GUID(0xc451ed2b, 0x9694, 0x45d3, 0xba, 0xba, 0xed, 0x9f, 0x89, 0x88, 0xa3, 0x89)
#define LINUX_EFI_COCO_SECRET_AREA_GUID EFI_GUID(0xadf956ad, 0xe98c, 0x484c, 0xae, 0x11, 0xb5, 0x1c, 0x7d, 0x33, 0x64, 0x47)
#define LINUX_EFI_BOOT_MEMMAP_GUID EFI_GUID(0x800f683f, 0xd08b, 0x423a, 0xa2, 0x93, 0x96, 0x5c, 0x3c, 0x6f, 0xe2, 0xb4)
+#define LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID EFI_GUID(0xd5d1de3c, 0x105c, 0x44f9, 0x9e, 0xa9, 0xbc, 0xef, 0x98, 0x12, 0x00, 0x31)

#define RISCV_EFI_BOOT_PROTOCOL_GUID EFI_GUID(0xccd15fec, 0x6f73, 0x4eec, 0x83, 0x95, 0x3e, 0x69, 0xe4, 0xb9, 0x40, 0xbf)

@@ -535,6 +536,13 @@ struct efi_boot_memmap {
efi_memory_desc_t map[];
};

+struct efi_unaccepted_memory {
+ u32 version;
+ u32 unit_size;
+ u64 size;
+ u64 bitmap[];
+};
+
/*
* Architecture independent structure for describing a memory map for the
* benefit of efi_memmap_init_early(), and for passing context between
@@ -637,6 +645,7 @@ extern struct efi {
unsigned long tpm_final_log; /* TPM2 Final Events Log table */
unsigned long mokvar_table; /* MOK variable config table */
unsigned long coco_secret; /* Confidential computing secret table */
+ unsigned long unaccepted;

efi_get_time_t *get_time;
efi_set_time_t *set_time;
--
Kiryl Shutsemau / Kirill A. Shutemov

2023-05-08 22:25:36

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCHv10 04/11] efi/x86: Implement support for unaccepted memory

On Mon, 8 May 2023 at 21:00, Kirill A. Shutemov <[email protected]> wrote:
>
> On Mon, May 08, 2023 at 09:30:22AM +0200, Ard Biesheuvel wrote:
> > Hello Kirill,
> >
> > On Mon, 8 May 2023 at 01:46, Kirill A. Shutemov
> > <[email protected]> wrote:
> > >
> > > UEFI Specification version 2.9 introduces the concept of memory
> > > acceptance: Some Virtual Machine platforms, such as Intel TDX or AMD
> > > SEV-SNP, requiring memory to be accepted before it can be used by the
> > > guest. Accepting happens via a protocol specific for the Virtual
> > > Machine platform.
> > >
> > > Accepting memory is costly and it makes VMM allocate memory for the
> > > accepted guest physical address range. It's better to postpone memory
> > > acceptance until memory is needed. It lowers boot time and reduces
> > > memory overhead.
> > >
> > > The kernel needs to know what memory has been accepted. Firmware
> > > communicates this information via memory map: a new memory type --
> > > EFI_UNACCEPTED_MEMORY -- indicates such memory.
> > >
> > > Range-based tracking works fine for firmware, but it gets bulky for
> > > the kernel: e820 has to be modified on every page acceptance. It leads
> > > to table fragmentation, but there's a limited number of entries in the
> > > e820 table
> > >
> > > Another option is to mark such memory as usable in e820 and track if the
> > > range has been accepted in a bitmap. One bit in the bitmap represents
> > > 2MiB in the address space: one 4k page is enough to track 64GiB or
> > > physical address space.
> > >
> > > In the worst-case scenario -- a huge hole in the middle of the
> > > address space -- It needs 256MiB to handle 4PiB of the address
> > > space.
> > >
> > > Any unaccepted memory that is not aligned to 2M gets accepted upfront.
> > >
> > > The bitmap is allocated and constructed in the EFI stub and passed down
> > > to the kernel via boot_params. allocate_e820() allocates the bitmap if
> > > unaccepted memory is present, according to the maximum address in the
> > > memory map.
> > >
> > > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > > ---
> > > Documentation/arch/x86/zero-page.rst | 1 +
> > > arch/x86/boot/compressed/Makefile | 1 +
> > > arch/x86/boot/compressed/mem.c | 73 ++++++++++++++++++++++++
> > > arch/x86/include/asm/unaccepted_memory.h | 10 ++++
> > > arch/x86/include/uapi/asm/bootparam.h | 2 +-
> > > drivers/firmware/efi/Kconfig | 14 +++++
> > > drivers/firmware/efi/efi.c | 1 +
> > > drivers/firmware/efi/libstub/x86-stub.c | 65 +++++++++++++++++++++
> > > include/linux/efi.h | 3 +-
> > > 9 files changed, 168 insertions(+), 2 deletions(-)
> > > create mode 100644 arch/x86/boot/compressed/mem.c
> > > create mode 100644 arch/x86/include/asm/unaccepted_memory.h
> > >
> > > diff --git a/Documentation/arch/x86/zero-page.rst b/Documentation/arch/x86/zero-page.rst
> > > index 45aa9cceb4f1..f21905e61ade 100644
> > > --- a/Documentation/arch/x86/zero-page.rst
> > > +++ b/Documentation/arch/x86/zero-page.rst
> > > @@ -20,6 +20,7 @@ Offset/Size Proto Name Meaning
> > > 060/010 ALL ist_info Intel SpeedStep (IST) BIOS support information
> > > (struct ist_info)
> > > 070/008 ALL acpi_rsdp_addr Physical address of ACPI RSDP table
> > > +078/008 ALL unaccepted_memory Bitmap of unaccepted memory (1bit == 2M)
> >
> > Unaccepted memory is a generic EFI feature, and will need to be
> > supported on other architectures as well.
> >
> > Could we perhaps use a EFI configuration table to pass the bitmap to
> > the core kernel, instead of adding more cruft to this archaic header?
> > That could be implemented in a arch-agnostic manner, even in cases
> > where the bootloader is the agent that calls ExitBootServices(), as it
> > would be the loader that allocates and populates the bitmap in that
> > case.
>
> Okay, that's a fair point.
>
> Below is my take on this. It is on top of whole patchset. It seems to be
> functional, but more testing is required.
>

Thanks a lot for having a stab at this. Some minor nits below, but
this generally looks like the way to do it.


> While there I also removed hardcoded 1b == 2MB.
>
> My EFI knowledge is rather superficial. I would be glad for feedback.
>
...
> diff --git a/arch/x86/boot/compressed/efi.h b/arch/x86/boot/compressed/efi.h
> index cf475243b6d5..5b63628912da 100644
> --- a/arch/x86/boot/compressed/efi.h
> +++ b/arch/x86/boot/compressed/efi.h
> @@ -105,6 +105,15 @@ struct efi_setup_data {
> u64 reserved[8];
> };
>
> +struct efi_unaccepted_memory {
> + u32 version;
> + u32 unit_size;
> + u64 size;

Could we add a base here too? DRAM could be anywhere in the PA space
on some architectures.


> + u64 bitmap[];

Should this be unsigned long[] ?

> +};
> +
> +extern struct efi_unaccepted_memory *unaccepted_table;
> +
> static inline int efi_guidcmp (efi_guid_t left, efi_guid_t right)
> {
> return memcmp(&left, &right, sizeof (efi_guid_t));
> diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
> index e6b92e822ddd..7d50aea0d6b1 100644
> --- a/arch/x86/boot/compressed/mem.c
> +++ b/arch/x86/boot/compressed/mem.c
> @@ -1,7 +1,9 @@
> // SPDX-License-Identifier: GPL-2.0-only
>
> +#include <linux/uuid.h>
> #include "../cpuflags.h"
> #include "bitmap.h"
> +#include "efi.h"
> #include "error.h"
> #include "find.h"
> #include "math.h"
> @@ -12,8 +14,6 @@
> #define PMD_SIZE (_AC(1, UL) << PMD_SHIFT)
> #define PMD_MASK (~(PMD_SIZE - 1))
>
> -extern struct boot_params *boot_params;
> -
> /*
> * accept_memory() and process_unaccepted_memory() called from EFI stub which
> * runs before decompresser and its early_tdx_detect().
> @@ -57,66 +57,77 @@ static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
> *
> * The function will never reach the bitmap_set() with zero bits to set.
> */
> -void process_unaccepted_memory(struct boot_params *params, u64 start, u64 end)
> +void process_unaccepted_memory(u64 start, u64 end)
> {
> + u64 unit_size = unaccepted_table->unit_size;
> + u64 unit_mask = unaccepted_table->unit_size - 1;
> + u64 bitmap_size = unaccepted_table->size;
> +
> /*
> * Ensure that at least one bit will be set in the bitmap by
> - * immediately accepting all regions under 2*PMD_SIZE. This is
> + * immediately accepting all regions under 2*unit_size. This is
> * imprecise and may immediately accept some areas that could
> * have been represented in the bitmap. But, results in simpler
> * code below
> *
> - * Consider case like this:
> + * Consider case like this (assuming unit_size == 2MB):
> *
> * | 4k | 2044k | 2048k |
> * ^ 0x0 ^ 2MB ^ 4MB
> *
> * Only the first 4k has been accepted. The 0MB->2MB region can not be
> * represented in the bitmap. The 2MB->4MB region can be represented in
> - * the bitmap. But, the 0MB->4MB region is <2*PMD_SIZE and will be
> + * the bitmap. But, the 0MB->4MB region is <2*unit_size and will be
> * immediately accepted in its entirety.
> */
> - if (end - start < 2 * PMD_SIZE) {
> + if (end - start < 2 * unit_size) {
> __accept_memory(start, end);
> return;
> }
>
> /*
> * No matter how the start and end are aligned, at least one unaccepted
> - * PMD_SIZE area will remain to be marked in the bitmap.
> + * unit_size area will remain to be marked in the bitmap.
> */
>
> - /* Immediately accept a <PMD_SIZE piece at the start: */
> - if (start & ~PMD_MASK) {
> - __accept_memory(start, round_up(start, PMD_SIZE));
> - start = round_up(start, PMD_SIZE);
> + /* Immediately accept a <unit_size piece at the start: */
> + if (start & unit_mask) {
> + __accept_memory(start, round_up(start, unit_size));
> + start = round_up(start, unit_size);
> }
>
> - /* Immediately accept a <PMD_SIZE piece at the end: */
> - if (end & ~PMD_MASK) {
> - __accept_memory(round_down(end, PMD_SIZE), end);
> - end = round_down(end, PMD_SIZE);
> + /* Immediately accept a <unit_size piece at the end: */
> + if (end & unit_mask) {
> + __accept_memory(round_down(end, unit_size), end);
> + end = round_down(end, unit_size);
> + }
> +
> + /* Accept everything that cannot be recorded into the bitmap */
> + if (end > bitmap_size * unit_size * BITS_PER_BYTE) {
> + __accept_memory(bitmap_size * unit_size * BITS_PER_BYTE, end);
> + end = bitmap_size * unit_size * BITS_PER_BYTE;
> }
>
> /*
> - * 'start' and 'end' are now both PMD-aligned.
> + * 'start' and 'end' are now both unit_size-aligned.
> * Record the range as being unaccepted:
> */
> - bitmap_set((unsigned long *)params->unaccepted_memory,
> - start / PMD_SIZE, (end - start) / PMD_SIZE);
> + bitmap_set((unsigned long *)unaccepted_table->bitmap,
> + start / unit_size, (end - start) / unit_size);
> }
>
> void accept_memory(phys_addr_t start, phys_addr_t end)
> {
> unsigned long range_start, range_end;
> unsigned long *bitmap, bitmap_size;
> + u64 unit_size = unaccepted_table->unit_size;
>
> - bitmap = (unsigned long *)boot_params->unaccepted_memory;
> - range_start = start / PMD_SIZE;
> - bitmap_size = DIV_ROUND_UP(end, PMD_SIZE);
> + bitmap = (unsigned long *)unaccepted_table->bitmap;
> + range_start = start / unit_size;
> + bitmap_size = DIV_ROUND_UP(end, unit_size);
>
> for_each_set_bitrange_from(range_start, range_end, bitmap, bitmap_size) {
> - __accept_memory(range_start * PMD_SIZE, range_end * PMD_SIZE);
> + __accept_memory(range_start * unit_size, range_end * unit_size);
> bitmap_clear(bitmap, range_start, range_end - range_start);
> }
> }
> diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
> index 186bfd53e042..f481f0b30873 100644
> --- a/arch/x86/boot/compressed/misc.c
> +++ b/arch/x86/boot/compressed/misc.c
> @@ -456,7 +456,7 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
>
> debug_putstr("\nDecompressing Linux... ");
>
> - if (boot_params->unaccepted_memory) {
> + if (unaccepted_table) {
> debug_putstr("Accepting memory... ");
> accept_memory(__pa(output), __pa(output) + needed_size);
> }
> diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
> index 89fc91c61560..9f695bdde01c 100644
> --- a/arch/x86/include/asm/unaccepted_memory.h
> +++ b/arch/x86/include/asm/unaccepted_memory.h
> @@ -3,9 +3,7 @@
> #ifndef _ASM_X86_UNACCEPTED_MEMORY_H
> #define _ASM_X86_UNACCEPTED_MEMORY_H
>
> -struct boot_params;
> -
> -void process_unaccepted_memory(struct boot_params *params, u64 start, u64 num);
> +void process_unaccepted_memory(u64 start, u64 num);
>
> #ifdef CONFIG_UNACCEPTED_MEMORY
>
> diff --git a/arch/x86/include/uapi/asm/bootparam.h b/arch/x86/include/uapi/asm/bootparam.h
> index 630a54046af0..01d19fc22346 100644
> --- a/arch/x86/include/uapi/asm/bootparam.h
> +++ b/arch/x86/include/uapi/asm/bootparam.h
> @@ -189,7 +189,7 @@ struct boot_params {
> __u64 tboot_addr; /* 0x058 */
> struct ist_info ist_info; /* 0x060 */
> __u64 acpi_rsdp_addr; /* 0x070 */
> - __u64 unaccepted_memory; /* 0x078 */
> + __u8 _pad3[8]; /* 0x078 */
> __u8 hd0_info[16]; /* obsolete! */ /* 0x080 */
> __u8 hd1_info[16]; /* obsolete! */ /* 0x090 */
> struct sys_desc_table sys_desc_table; /* obsolete! */ /* 0x0a0 */
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index 483c36a28d2e..8ee6b756712f 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -16,6 +16,7 @@
> #include <linux/firmware-map.h>
> #include <linux/sort.h>
> #include <linux/memory_hotplug.h>
> +#include <linux/efi.h>
>
> #include <asm/e820/api.h>
> #include <asm/setup.h>
> @@ -1324,13 +1325,15 @@ void __init e820__memblock_setup(void)
> * e820_table is not finalized and e820__end_of_ram_pfn() cannot be
> * used to get correct RAM size.
> */
> - if (boot_params.unaccepted_memory) {
> + if (efi.unaccepted != EFI_INVALID_TABLE_ADDR) {
> + struct efi_unaccepted_memory *unaccepted;
> unsigned long size;
>
> - /* One bit per 2MB */
> - size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE,
> - PMD_SIZE * BITS_PER_BYTE);
> - memblock_reserve(boot_params.unaccepted_memory, size);
> + unaccepted = __va(efi.unaccepted);
> +
> + size = sizeof(struct efi_unaccepted_memory);
> + size += unaccepted->size;
> + memblock_reserve(efi.unaccepted, size);
> }
>

This could be moved to generic code (but we'll need to use early_memremap())

> /*
> diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
> index f61174d4c3cb..ec2b616ef32e 100644
> --- a/arch/x86/mm/unaccepted_memory.c
> +++ b/arch/x86/mm/unaccepted_memory.c
> @@ -1,4 +1,5 @@
> // SPDX-License-Identifier: GPL-2.0-only
> +#include <linux/efi.h>
> #include <linux/memblock.h>
> #include <linux/mm.h>
> #include <linux/pfn.h>
> @@ -14,15 +15,19 @@ static DEFINE_SPINLOCK(unaccepted_memory_lock);
>
> void accept_memory(phys_addr_t start, phys_addr_t end)
> {
> + struct efi_unaccepted_memory *unaccepted;
> unsigned long range_start, range_end;
> - unsigned long *bitmap;
> - unsigned long flags;
> + unsigned long flags, *bitmap;
> + u64 unit_size;
>
> - if (!boot_params.unaccepted_memory)
> + if (efi.unaccepted == EFI_INVALID_TABLE_ADDR)
> return;
>
> - bitmap = __va(boot_params.unaccepted_memory);
> - range_start = start / PMD_SIZE;
> + unaccepted = __va(efi.unaccepted);
> + unit_size = unaccepted->unit_size;
> + bitmap = (unsigned long *)unaccepted->bitmap;
> +
> + range_start = start / unit_size;
>
> /*
> * load_unaligned_zeropad() can lead to unwanted loads across page
> @@ -42,23 +47,25 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
> * used:
> *
> * 1. Implicitly extend the range_contains_unaccepted_memory(start, end)
> - * checks up to end+2M if 'end' is aligned on a 2M boundary.
> + * checks up to end+unit_size if 'end' is aligned on a unit_size
> + * boundary.
> *
> - * 2. Implicitly extend accept_memory(start, end) to end+2M if 'end' is
> - * aligned on a 2M boundary. (immediately following this comment)
> + * 2. Implicitly extend accept_memory(start, end) to end+unit_size if
> + * 'end' is aligned on a unit_size boundary. (immediately following
> + * this comment)
> */
> - if (!(end % PMD_SIZE))
> - end += PMD_SIZE;
> + if (!(end % unit_size))
> + end += unit_size;
>
> spin_lock_irqsave(&unaccepted_memory_lock, flags);
> for_each_set_bitrange_from(range_start, range_end, bitmap,
> - DIV_ROUND_UP(end, PMD_SIZE)) {
> + DIV_ROUND_UP(end, unit_size)) {
> unsigned long len = range_end - range_start;
>
> /* Platform-specific memory-acceptance call goes here */
> if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
> - tdx_accept_memory(range_start * PMD_SIZE,
> - range_end * PMD_SIZE);
> + tdx_accept_memory(range_start * unit_size,
> + range_end * unit_size);
> } else {
> panic("Cannot accept memory: unknown platform\n");
> }
> @@ -70,30 +77,33 @@ void accept_memory(phys_addr_t start, phys_addr_t end)
>
> bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end)
> {
> - unsigned long *bitmap;
> - unsigned long flags;
> + struct efi_unaccepted_memory *unaccepted;
> + unsigned long flags, *bitmap;
> bool ret = false;
> + u64 unit_size;
>
> - if (!boot_params.unaccepted_memory)
> + if (efi.unaccepted == EFI_INVALID_TABLE_ADDR)
> return 0;
>
> - bitmap = __va(boot_params.unaccepted_memory);
> + unaccepted = __va(efi.unaccepted);
> + unit_size = unaccepted->unit_size;
> + bitmap = (unsigned long *)unaccepted->bitmap;
>
> /*
> * Also consider the unaccepted state of the *next* page. See fix #1 in
> * the comment on load_unaligned_zeropad() in accept_memory().
> */
> - if (!(end % PMD_SIZE))
> - end += PMD_SIZE;
> + if (!(end % unit_size))
> + end += unit_size;
>
> spin_lock_irqsave(&unaccepted_memory_lock, flags);
> while (start < end) {
> - if (test_bit(start / PMD_SIZE, bitmap)) {
> + if (test_bit(start / unit_size, bitmap)) {
> ret = true;
> break;
> }
>
> - start += PMD_SIZE;
> + start += unit_size;
> }
> spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
>
> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> index 7dce06e419c5..e15a2005ed93 100644
> --- a/drivers/firmware/efi/efi.c
> +++ b/drivers/firmware/efi/efi.c
> @@ -50,6 +50,9 @@ struct efi __read_mostly efi = {
> #ifdef CONFIG_EFI_COCO_SECRET
> .coco_secret = EFI_INVALID_TABLE_ADDR,
> #endif
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> + .unaccepted = EFI_INVALID_TABLE_ADDR,
> +#endif
> };
> EXPORT_SYMBOL(efi);
>
> @@ -605,6 +608,9 @@ static const efi_config_table_type_t common_tables[] __initconst = {
> #ifdef CONFIG_EFI_COCO_SECRET
> {LINUX_EFI_COCO_SECRET_AREA_GUID, &efi.coco_secret, "CocoSecret" },
> #endif
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> + {LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID, &efi.unaccepted, "Unaccepted" },
> +#endif
> #ifdef CONFIG_EFI_GENERIC_STUB
> {LINUX_EFI_SCREEN_INFO_TABLE_GUID, &screen_info_table },
> #endif
> diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
> index 1afe7b5b02e1..4953b40f30c3 100644
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -26,6 +26,7 @@ const efi_system_table_t *efi_system_table;
> const efi_dxe_services_table_t *efi_dxe_table;
> u32 image_offset __section(".data");
> static efi_loaded_image_t *image = NULL;
> +struct efi_unaccepted_memory *unaccepted_table;
>
> static efi_status_t
> preserve_pci_rom_image(efi_pci_io_protocol_t *pci, struct pci_setup_rom **__rom)
> @@ -621,7 +622,7 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
> continue;
> }
> e820_type = E820_TYPE_RAM;
> - process_unaccepted_memory(params, d->phys_addr,
> + process_unaccepted_memory(d->phys_addr,
> d->phys_addr + PAGE_SIZE * d->num_pages);
> break;
> default:
> @@ -692,12 +693,22 @@ static efi_status_t allocate_unaccepted_bitmap(struct boot_params *params,
> __u32 nr_desc,
> struct efi_boot_memmap *map)
> {
> - unsigned long *mem = NULL;
> - u64 size, max_addr = 0;
> + efi_guid_t unaccepted_table_guid = LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID;
> + u64 bitmap_size, max_addr = 0;
> efi_status_t status;
> bool found = false;
> int i;
>
> + /* Check if the table is already installed */
> + unaccepted_table = get_efi_config_table(unaccepted_table_guid);
> + if (unaccepted_table) {
> + if (unaccepted_table->version != 0) {
> + efi_err("Unknown version of unaccepted memory tatble\n");
> + return EFI_UNSUPPORTED;
> + }
> + return EFI_SUCCESS;
> + }
> +
> /* Check if there's any unaccepted memory and find the max address */
> for (i = 0; i < nr_desc; i++) {
> efi_memory_desc_t *d;
> @@ -710,11 +721,6 @@ static efi_status_t allocate_unaccepted_bitmap(struct boot_params *params,
> max_addr = d->phys_addr + d->num_pages * PAGE_SIZE;
> }
>
> - if (!found) {
> - params->unaccepted_memory = 0;
> - return EFI_SUCCESS;
> - }
> -
> /*
> * range_contains_unaccepted_memory() may need to check one 2M chunk
> * beyond the end of RAM to deal with load_unaligned_zeropad(). Make
> @@ -736,11 +742,26 @@ static efi_status_t allocate_unaccepted_bitmap(struct boot_params *params,
> * The bitmap will be populated in setup_e820() according to the memory
> * map after efi_exit_boot_services().
> */
> - size = DIV_ROUND_UP(max_addr, PMD_SIZE * BITS_PER_BYTE);
> - status = efi_allocate_pages(size, (unsigned long *)&mem, ULONG_MAX);
> - if (status == EFI_SUCCESS) {
> - memset(mem, 0, size);
> - params->unaccepted_memory = (unsigned long)mem;
> + bitmap_size = DIV_ROUND_UP(max_addr, PMD_SIZE * BITS_PER_BYTE);
> +
> + status = efi_bs_call(allocate_pool, EFI_LOADER_DATA,
> + sizeof(*unaccepted_table) + bitmap_size,
> + (void **)&unaccepted_table);
> + if (status != EFI_SUCCESS) {
> + efi_err("Failed to allocate unaccepted memory config table\n");
> + return status;
> + }
> +
> + unaccepted_table->version = 0;
> + unaccepted_table->unit_size = PMD_SIZE;
> + unaccepted_table->size = bitmap_size;
> + memset(unaccepted_table->bitmap, 0, bitmap_size);
> +
> + status = efi_bs_call(install_configuration_table,
> + &unaccepted_table_guid, unaccepted_table);
> + if (status != EFI_SUCCESS) {
> + efi_bs_call(free_pool, unaccepted_table);
> + efi_err("Failed to install unaccepted memory config table!\n");
> }
>
> return status;
> diff --git a/include/linux/efi.h b/include/linux/efi.h
> index efbe14641638..f765266a81b3 100644
> --- a/include/linux/efi.h
> +++ b/include/linux/efi.h
> @@ -418,6 +418,7 @@ void efi_native_runtime_setup(void);
> #define LINUX_EFI_MOK_VARIABLE_TABLE_GUID EFI_GUID(0xc451ed2b, 0x9694, 0x45d3, 0xba, 0xba, 0xed, 0x9f, 0x89, 0x88, 0xa3, 0x89)
> #define LINUX_EFI_COCO_SECRET_AREA_GUID EFI_GUID(0xadf956ad, 0xe98c, 0x484c, 0xae, 0x11, 0xb5, 0x1c, 0x7d, 0x33, 0x64, 0x47)
> #define LINUX_EFI_BOOT_MEMMAP_GUID EFI_GUID(0x800f683f, 0xd08b, 0x423a, 0xa2, 0x93, 0x96, 0x5c, 0x3c, 0x6f, 0xe2, 0xb4)
> +#define LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID EFI_GUID(0xd5d1de3c, 0x105c, 0x44f9, 0x9e, 0xa9, 0xbc, 0xef, 0x98, 0x12, 0x00, 0x31)
>
> #define RISCV_EFI_BOOT_PROTOCOL_GUID EFI_GUID(0xccd15fec, 0x6f73, 0x4eec, 0x83, 0x95, 0x3e, 0x69, 0xe4, 0xb9, 0x40, 0xbf)
>
> @@ -535,6 +536,13 @@ struct efi_boot_memmap {
> efi_memory_desc_t map[];
> };
>
> +struct efi_unaccepted_memory {
> + u32 version;
> + u32 unit_size;
> + u64 size;
> + u64 bitmap[];
> +};
> +
> /*
> * Architecture independent structure for describing a memory map for the
> * benefit of efi_memmap_init_early(), and for passing context between
> @@ -637,6 +645,7 @@ extern struct efi {
> unsigned long tpm_final_log; /* TPM2 Final Events Log table */
> unsigned long mokvar_table; /* MOK variable config table */
> unsigned long coco_secret; /* Confidential computing secret table */
> + unsigned long unaccepted;
>
> efi_get_time_t *get_time;
> efi_set_time_t *set_time;
> --
> Kiryl Shutsemau / Kirill A. Shutemov

2023-05-09 01:05:05

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv10 04/11] efi/x86: Implement support for unaccepted memory

On Tue, May 09, 2023 at 12:11:41AM +0200, Ard Biesheuvel wrote:
> > @@ -1324,13 +1325,15 @@ void __init e820__memblock_setup(void)
> > * e820_table is not finalized and e820__end_of_ram_pfn() cannot be
> > * used to get correct RAM size.
> > */
> > - if (boot_params.unaccepted_memory) {
> > + if (efi.unaccepted != EFI_INVALID_TABLE_ADDR) {
> > + struct efi_unaccepted_memory *unaccepted;
> > unsigned long size;
> >
> > - /* One bit per 2MB */
> > - size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE,
> > - PMD_SIZE * BITS_PER_BYTE);
> > - memblock_reserve(boot_params.unaccepted_memory, size);
> > + unaccepted = __va(efi.unaccepted);
> > +
> > + size = sizeof(struct efi_unaccepted_memory);
> > + size += unaccepted->size;
> > + memblock_reserve(efi.unaccepted, size);
> > }
> >
>
> This could be moved to generic code (but we'll need to use early_memremap())

I don't understand why early_memremap() is needed. EFI_LOADER_DATA already
mapped into direct mapping. We only need to reserve the memory so it
could not be reallocated for other things. Hm?

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-05-09 07:07:05

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCHv10 04/11] efi/x86: Implement support for unaccepted memory

On Tue, 9 May 2023 at 02:56, Kirill A. Shutemov <[email protected]> wrote:
>
> On Tue, May 09, 2023 at 12:11:41AM +0200, Ard Biesheuvel wrote:
> > > @@ -1324,13 +1325,15 @@ void __init e820__memblock_setup(void)
> > > * e820_table is not finalized and e820__end_of_ram_pfn() cannot be
> > > * used to get correct RAM size.
> > > */
> > > - if (boot_params.unaccepted_memory) {
> > > + if (efi.unaccepted != EFI_INVALID_TABLE_ADDR) {
> > > + struct efi_unaccepted_memory *unaccepted;
> > > unsigned long size;
> > >
> > > - /* One bit per 2MB */
> > > - size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE,
> > > - PMD_SIZE * BITS_PER_BYTE);
> > > - memblock_reserve(boot_params.unaccepted_memory, size);
> > > + unaccepted = __va(efi.unaccepted);
> > > +
> > > + size = sizeof(struct efi_unaccepted_memory);
> > > + size += unaccepted->size;
> > > + memblock_reserve(efi.unaccepted, size);
> > > }
> > >
> >
> > This could be moved to generic code (but we'll need to use early_memremap())
>
> I don't understand why early_memremap() is needed. EFI_LOADER_DATA already
> mapped into direct mapping. We only need to reserve the memory so it
> could not be reallocated for other things. Hm?
>

*If* we move this to generic code, we have to ensure that we don't
rely on x86 specific semantics. When parsing the EFI configuration
tables, other architectures don't have a complete direct map yet, as
they receive the memory description from EFI not from a translated
E820 map.

Note that this is only for getting the size of the reservation. Later
on, when we actually consume the contents of the bitmap, generic or
non-x86 code will need to use the ordinary memremap() API to map this
memory, and this amounts to a __va() call when the memory is already
mapped. But I am not suggesting changing that part for this series.
And even the hunk above can remain as you suggest - we can revisit it
once other architectures gain support for this.

The main thing I would like to avoid at this point in time is to add
new fields to struct bootparams that loaders such as GRUB may start to
populate as well - I don't think there is a very strong case for
pseudo-EFI boot [where GRUB calls ExitBootServices()] on confidential
VMs (as it prevents the EFI stub and the kernel from accessing the
measurement and attestation APIs), but let's not create more struct
bootparams based API if we can avoid it.

2023-05-12 02:13:34

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv10 04/11] efi/x86: Implement support for unaccepted memory

On Tue, May 09, 2023 at 08:47:38AM +0200, Ard Biesheuvel wrote:
> On Tue, 9 May 2023 at 02:56, Kirill A. Shutemov <[email protected]> wrote:
> >
> > On Tue, May 09, 2023 at 12:11:41AM +0200, Ard Biesheuvel wrote:
> > > > @@ -1324,13 +1325,15 @@ void __init e820__memblock_setup(void)
> > > > * e820_table is not finalized and e820__end_of_ram_pfn() cannot be
> > > > * used to get correct RAM size.
> > > > */
> > > > - if (boot_params.unaccepted_memory) {
> > > > + if (efi.unaccepted != EFI_INVALID_TABLE_ADDR) {
> > > > + struct efi_unaccepted_memory *unaccepted;
> > > > unsigned long size;
> > > >
> > > > - /* One bit per 2MB */
> > > > - size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE,
> > > > - PMD_SIZE * BITS_PER_BYTE);
> > > > - memblock_reserve(boot_params.unaccepted_memory, size);
> > > > + unaccepted = __va(efi.unaccepted);
> > > > +
> > > > + size = sizeof(struct efi_unaccepted_memory);
> > > > + size += unaccepted->size;
> > > > + memblock_reserve(efi.unaccepted, size);
> > > > }
> > > >
> > >
> > > This could be moved to generic code (but we'll need to use early_memremap())
> >
> > I don't understand why early_memremap() is needed. EFI_LOADER_DATA already
> > mapped into direct mapping. We only need to reserve the memory so it
> > could not be reallocated for other things. Hm?
> >
>
> *If* we move this to generic code, we have to ensure that we don't
> rely on x86 specific semantics. When parsing the EFI configuration
> tables, other architectures don't have a complete direct map yet, as
> they receive the memory description from EFI not from a translated
> E820 map.
>
> Note that this is only for getting the size of the reservation. Later
> on, when we actually consume the contents of the bitmap, generic or
> non-x86 code will need to use the ordinary memremap() API to map this
> memory, and this amounts to a __va() call when the memory is already
> mapped. But I am not suggesting changing that part for this series.
> And even the hunk above can remain as you suggest - we can revisit it
> once other architectures gain support for this.
>
> The main thing I would like to avoid at this point in time is to add
> new fields to struct bootparams that loaders such as GRUB may start to
> populate as well - I don't think there is a very strong case for
> pseudo-EFI boot [where GRUB calls ExitBootServices()] on confidential
> VMs (as it prevents the EFI stub and the kernel from accessing the
> measurement and attestation APIs), but let's not create more struct
> bootparams based API if we can avoid it.

Below is updated version of the fixup. I believed I addressed all your
feedback.

I moved most of unaccepted memory code into generic EFI and EFI stub. I
hope it looks fine.

early_memremap() for reservation works fine, but when I tried to use
memremap() as you suggested to get the mapping of the table instead of
__va() it failed. I didn't found the root cause. I guess I tried to use
too early for memremap() to be functional. I made arch provide
arch-specific way to get the mapping, which is implemented as __va() on
x86.

While I move code from decompressor to the EFI stub, I removed few headers
as, it *seems*, EFI stub has different policy about re-using headers from
the main kernel image.

Borislav, is it okay with you or EFI stub also has to carry own copy of
the headers?

If everything is fine, I will fold the fixup properly and prepare v11 of
the patchset.

Documentation/arch/x86/zero-page.rst | 1 -
arch/x86/boot/bitops.h | 40 ----
arch/x86/boot/compressed/Makefile | 2 +-
arch/x86/boot/compressed/bitmap.c | 43 -----
arch/x86/boot/compressed/bitmap.h | 49 -----
arch/x86/boot/compressed/bits.h | 36 ----
arch/x86/boot/compressed/find.c | 54 ------
arch/x86/boot/compressed/find.h | 79 --------
arch/x86/boot/compressed/math.h | 37 ----
arch/x86/boot/compressed/mem.c | 81 +--------
arch/x86/boot/compressed/minmax.h | 61 -------
arch/x86/boot/compressed/misc.c | 2 +-
arch/x86/include/asm/page.h | 2 -
arch/x86/include/asm/unaccepted_memory.h | 24 ++-
arch/x86/include/uapi/asm/bootparam.h | 2 +-
arch/x86/kernel/e820.c | 17 --
arch/x86/mm/Makefile | 2 -
arch/x86/mm/unaccepted_memory.c | 101 -----------
drivers/firmware/efi/Makefile | 1 +
drivers/firmware/efi/efi.c | 25 +++
drivers/firmware/efi/libstub/Makefile | 2 +
drivers/firmware/efi/libstub/efistub.h | 6 +
drivers/firmware/efi/libstub/find.c | 43 +++++
drivers/firmware/efi/libstub/unaccepted_memory.c | 221 +++++++++++++++++++++++
drivers/firmware/efi/libstub/x86-stub.c | 62 +------
drivers/firmware/efi/unaccepted_memory.c | 138 ++++++++++++++
include/linux/efi.h | 12 ++
mm/internal.h | 9 +-
28 files changed, 480 insertions(+), 672 deletions(-)

diff --git a/Documentation/arch/x86/zero-page.rst b/Documentation/arch/x86/zero-page.rst
index f21905e61ade..45aa9cceb4f1 100644
--- a/Documentation/arch/x86/zero-page.rst
+++ b/Documentation/arch/x86/zero-page.rst
@@ -20,7 +20,6 @@ Offset/Size Proto Name Meaning
060/010 ALL ist_info Intel SpeedStep (IST) BIOS support information
(struct ist_info)
070/008 ALL acpi_rsdp_addr Physical address of ACPI RSDP table
-078/008 ALL unaccepted_memory Bitmap of unaccepted memory (1bit == 2M)
080/010 ALL hd0_info hd0 disk parameter, OBSOLETE!!
090/010 ALL hd1_info hd1 disk parameter, OBSOLETE!!
0A0/010 ALL sys_desc_table System description table (struct sys_desc_table),
diff --git a/arch/x86/boot/bitops.h b/arch/x86/boot/bitops.h
index 38badf028543..8518ae214c9b 100644
--- a/arch/x86/boot/bitops.h
+++ b/arch/x86/boot/bitops.h
@@ -41,44 +41,4 @@ static inline void set_bit(int nr, void *addr)
asm("btsl %1,%0" : "+m" (*(u32 *)addr) : "Ir" (nr));
}

-static __always_inline void __set_bit(long nr, volatile unsigned long *addr)
-{
- asm volatile(__ASM_SIZE(bts) " %1,%0" : : "m" (*(volatile long *) addr),
- "Ir" (nr) : "memory");
-}
-
-static __always_inline void __clear_bit(long nr, volatile unsigned long *addr)
-{
- asm volatile(__ASM_SIZE(btr) " %1,%0" : : "m" (*(volatile long *) addr),
- "Ir" (nr) : "memory");
-}
-
-/**
- * __ffs - find first set bit in word
- * @word: The word to search
- *
- * Undefined if no bit exists, so code should check against 0 first.
- */
-static __always_inline unsigned long __ffs(unsigned long word)
-{
- asm("rep; bsf %1,%0"
- : "=r" (word)
- : "rm" (word));
- return word;
-}
-
-/**
- * ffz - find first zero bit in word
- * @word: The word to search
- *
- * Undefined if no zero exists, so code should check against ~0UL first.
- */
-static __always_inline unsigned long ffz(unsigned long word)
-{
- asm("rep; bsf %1,%0"
- : "=r" (word)
- : "r" (~word));
- return word;
-}
-
#endif /* BOOT_BITOPS_H */
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 71d9f71c13eb..09d57937640a 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -107,7 +107,7 @@ endif

vmlinux-objs-$(CONFIG_ACPI) += $(obj)/acpi.o
vmlinux-objs-$(CONFIG_INTEL_TDX_GUEST) += $(obj)/tdx.o $(obj)/tdx-shared.o $(obj)/tdcall.o
-vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/bitmap.o $(obj)/find.o $(obj)/mem.o
+vmlinux-objs-$(CONFIG_UNACCEPTED_MEMORY) += $(obj)/bitmap.o $(obj)/mem.o

vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_mixed.o
diff --git a/arch/x86/boot/compressed/bitmap.c b/arch/x86/boot/compressed/bitmap.c
deleted file mode 100644
index 789ecadeb521..000000000000
--- a/arch/x86/boot/compressed/bitmap.c
+++ /dev/null
@@ -1,43 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-
-#include "bitmap.h"
-
-void __bitmap_set(unsigned long *map, unsigned int start, int len)
-{
- unsigned long *p = map + BIT_WORD(start);
- const unsigned int size = start + len;
- int bits_to_set = BITS_PER_LONG - (start % BITS_PER_LONG);
- unsigned long mask_to_set = BITMAP_FIRST_WORD_MASK(start);
-
- while (len - bits_to_set >= 0) {
- *p |= mask_to_set;
- len -= bits_to_set;
- bits_to_set = BITS_PER_LONG;
- mask_to_set = ~0UL;
- p++;
- }
- if (len) {
- mask_to_set &= BITMAP_LAST_WORD_MASK(size);
- *p |= mask_to_set;
- }
-}
-
-void __bitmap_clear(unsigned long *map, unsigned int start, int len)
-{
- unsigned long *p = map + BIT_WORD(start);
- const unsigned int size = start + len;
- int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
- unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
-
- while (len - bits_to_clear >= 0) {
- *p &= ~mask_to_clear;
- len -= bits_to_clear;
- bits_to_clear = BITS_PER_LONG;
- mask_to_clear = ~0UL;
- p++;
- }
- if (len) {
- mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
- *p &= ~mask_to_clear;
- }
-}
diff --git a/arch/x86/boot/compressed/bitmap.h b/arch/x86/boot/compressed/bitmap.h
deleted file mode 100644
index 35357f5feda2..000000000000
--- a/arch/x86/boot/compressed/bitmap.h
+++ /dev/null
@@ -1,49 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-#ifndef BOOT_BITMAP_H
-#define BOOT_BITMAP_H
-#define __LINUX_BITMAP_H /* Inhibit inclusion of <linux/bitmap.h> */
-
-#include "../bitops.h"
-#include "../string.h"
-#include "align.h"
-
-#define BITMAP_MEM_ALIGNMENT 8
-#define BITMAP_MEM_MASK (BITMAP_MEM_ALIGNMENT - 1)
-
-#define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) & (BITS_PER_LONG - 1)))
-#define BITMAP_LAST_WORD_MASK(nbits) (~0UL >> (-(nbits) & (BITS_PER_LONG - 1)))
-
-#define BIT_WORD(nr) ((nr) / BITS_PER_LONG)
-
-void __bitmap_set(unsigned long *map, unsigned int start, int len);
-void __bitmap_clear(unsigned long *map, unsigned int start, int len);
-
-static __always_inline void bitmap_set(unsigned long *map, unsigned int start,
- unsigned int nbits)
-{
- if (__builtin_constant_p(nbits) && nbits == 1)
- __set_bit(start, map);
- else if (__builtin_constant_p(start & BITMAP_MEM_MASK) &&
- IS_ALIGNED(start, BITMAP_MEM_ALIGNMENT) &&
- __builtin_constant_p(nbits & BITMAP_MEM_MASK) &&
- IS_ALIGNED(nbits, BITMAP_MEM_ALIGNMENT))
- memset((char *)map + start / 8, 0xff, nbits / 8);
- else
- __bitmap_set(map, start, nbits);
-}
-
-static __always_inline void bitmap_clear(unsigned long *map, unsigned int start,
- unsigned int nbits)
-{
- if (__builtin_constant_p(nbits) && nbits == 1)
- __clear_bit(start, map);
- else if (__builtin_constant_p(start & BITMAP_MEM_MASK) &&
- IS_ALIGNED(start, BITMAP_MEM_ALIGNMENT) &&
- __builtin_constant_p(nbits & BITMAP_MEM_MASK) &&
- IS_ALIGNED(nbits, BITMAP_MEM_ALIGNMENT))
- memset((char *)map + start / 8, 0, nbits / 8);
- else
- __bitmap_clear(map, start, nbits);
-}
-
-#endif
diff --git a/arch/x86/boot/compressed/bits.h b/arch/x86/boot/compressed/bits.h
deleted file mode 100644
index b0ffa007ee19..000000000000
--- a/arch/x86/boot/compressed/bits.h
+++ /dev/null
@@ -1,36 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-#ifndef BOOT_BITS_H
-#define BOOT_BITS_H
-#define __LINUX_BITS_H /* Inhibit inclusion of <linux/bits.h> */
-
-#ifdef __ASSEMBLY__
-#define _AC(X,Y) X
-#define _AT(T,X) X
-#else
-#define __AC(X,Y) (X##Y)
-#define _AC(X,Y) __AC(X,Y)
-#define _AT(T,X) ((T)(X))
-#endif
-
-#define _UL(x) (_AC(x, UL))
-#define _ULL(x) (_AC(x, ULL))
-#define UL(x) (_UL(x))
-#define ULL(x) (_ULL(x))
-
-#define BIT(nr) (UL(1) << (nr))
-#define BIT_ULL(nr) (ULL(1) << (nr))
-#define BIT_MASK(nr) (UL(1) << ((nr) % BITS_PER_LONG))
-#define BIT_WORD(nr) ((nr) / BITS_PER_LONG)
-#define BIT_ULL_MASK(nr) (ULL(1) << ((nr) % BITS_PER_LONG_LONG))
-#define BIT_ULL_WORD(nr) ((nr) / BITS_PER_LONG_LONG)
-#define BITS_PER_BYTE 8
-
-#define GENMASK(h, l) \
- (((~UL(0)) - (UL(1) << (l)) + 1) & \
- (~UL(0) >> (BITS_PER_LONG - 1 - (h))))
-
-#define GENMASK_ULL(h, l) \
- (((~ULL(0)) - (ULL(1) << (l)) + 1) & \
- (~ULL(0) >> (BITS_PER_LONG_LONG - 1 - (h))))
-
-#endif
diff --git a/arch/x86/boot/compressed/find.c b/arch/x86/boot/compressed/find.c
deleted file mode 100644
index b97a9e7c8085..000000000000
--- a/arch/x86/boot/compressed/find.c
+++ /dev/null
@@ -1,54 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-#include "bitmap.h"
-#include "find.h"
-#include "math.h"
-#include "minmax.h"
-
-static __always_inline unsigned long swab(const unsigned long y)
-{
-#if __BITS_PER_LONG == 64
- return __builtin_bswap32(y);
-#else /* __BITS_PER_LONG == 32 */
- return __builtin_bswap64(y);
-#endif
-}
-
-unsigned long _find_next_bit(const unsigned long *addr1,
- const unsigned long *addr2, unsigned long nbits,
- unsigned long start, unsigned long invert, unsigned long le)
-{
- unsigned long tmp, mask;
-
- if (start >= nbits)
- return nbits;
-
- tmp = addr1[start / BITS_PER_LONG];
- if (addr2)
- tmp &= addr2[start / BITS_PER_LONG];
- tmp ^= invert;
-
- /* Handle 1st word. */
- mask = BITMAP_FIRST_WORD_MASK(start);
- if (le)
- mask = swab(mask);
-
- tmp &= mask;
-
- start = round_down(start, BITS_PER_LONG);
-
- while (!tmp) {
- start += BITS_PER_LONG;
- if (start >= nbits)
- return nbits;
-
- tmp = addr1[start / BITS_PER_LONG];
- if (addr2)
- tmp &= addr2[start / BITS_PER_LONG];
- tmp ^= invert;
- }
-
- if (le)
- tmp = swab(tmp);
-
- return min(start + __ffs(tmp), nbits);
-}
diff --git a/arch/x86/boot/compressed/find.h b/arch/x86/boot/compressed/find.h
deleted file mode 100644
index 903574b9d57a..000000000000
--- a/arch/x86/boot/compressed/find.h
+++ /dev/null
@@ -1,79 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-#ifndef BOOT_FIND_H
-#define BOOT_FIND_H
-#define __LINUX_FIND_H /* Inhibit inclusion of <linux/find.h> */
-
-#include "../bitops.h"
-#include "align.h"
-#include "bits.h"
-
-unsigned long _find_next_bit(const unsigned long *addr1,
- const unsigned long *addr2, unsigned long nbits,
- unsigned long start, unsigned long invert, unsigned long le);
-
-/**
- * find_next_bit - find the next set bit in a memory region
- * @addr: The address to base the search on
- * @offset: The bitnumber to start searching at
- * @size: The bitmap size in bits
- *
- * Returns the bit number for the next set bit
- * If no bits are set, returns @size.
- */
-static inline
-unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
- unsigned long offset)
-{
- if (small_const_nbits(size)) {
- unsigned long val;
-
- if (offset >= size)
- return size;
-
- val = *addr & GENMASK(size - 1, offset);
- return val ? __ffs(val) : size;
- }
-
- return _find_next_bit(addr, NULL, size, offset, 0UL, 0);
-}
-
-/**
- * find_next_zero_bit - find the next cleared bit in a memory region
- * @addr: The address to base the search on
- * @offset: The bitnumber to start searching at
- * @size: The bitmap size in bits
- *
- * Returns the bit number of the next zero bit
- * If no bits are zero, returns @size.
- */
-static inline
-unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
- unsigned long offset)
-{
- if (small_const_nbits(size)) {
- unsigned long val;
-
- if (offset >= size)
- return size;
-
- val = *addr | ~GENMASK(size - 1, offset);
- return val == ~0UL ? size : ffz(val);
- }
-
- return _find_next_bit(addr, NULL, size, offset, ~0UL, 0);
-}
-
-/**
- * for_each_set_bitrange_from - iterate over all set bit ranges [b; e)
- * @b: bit offset of start of current bitrange (first set bit); must be initialized
- * @e: bit offset of end of current bitrange (first unset bit)
- * @addr: bitmap address to base the search on
- * @size: bitmap size in number of bits
- */
-#define for_each_set_bitrange_from(b, e, addr, size) \
- for ((b) = find_next_bit((addr), (size), (b)), \
- (e) = find_next_zero_bit((addr), (size), (b) + 1); \
- (b) < (size); \
- (b) = find_next_bit((addr), (size), (e) + 1), \
- (e) = find_next_zero_bit((addr), (size), (b) + 1))
-#endif
diff --git a/arch/x86/boot/compressed/math.h b/arch/x86/boot/compressed/math.h
deleted file mode 100644
index f7eede84bbc2..000000000000
--- a/arch/x86/boot/compressed/math.h
+++ /dev/null
@@ -1,37 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-#ifndef BOOT_MATH_H
-#define BOOT_MATH_H
-#define __LINUX_MATH_H /* Inhibit inclusion of <linux/math.h> */
-
-/*
- *
- * This looks more complex than it should be. But we need to
- * get the type for the ~ right in round_down (it needs to be
- * as wide as the result!), and we want to evaluate the macro
- * arguments just once each.
- */
-#define __round_mask(x, y) ((__typeof__(x))((y)-1))
-
-/**
- * round_up - round up to next specified power of 2
- * @x: the value to round
- * @y: multiple to round up to (must be a power of 2)
- *
- * Rounds @x up to next multiple of @y (which must be a power of 2).
- * To perform arbitrary rounding up, use roundup() below.
- */
-#define round_up(x, y) ((((x)-1) | __round_mask(x, y))+1)
-
-/**
- * round_down - round down to next specified power of 2
- * @x: the value to round
- * @y: multiple to round down to (must be a power of 2)
- *
- * Rounds @x down to next multiple of @y (which must be a power of 2).
- * To perform arbitrary rounding down, use rounddown() below.
- */
-#define round_down(x, y) ((x) & ~__round_mask(x, y))
-
-#define DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d))
-
-#endif
diff --git a/arch/x86/boot/compressed/mem.c b/arch/x86/boot/compressed/mem.c
index e6b92e822ddd..8138f4bd1959 100644
--- a/arch/x86/boot/compressed/mem.c
+++ b/arch/x86/boot/compressed/mem.c
@@ -1,19 +1,14 @@
// SPDX-License-Identifier: GPL-2.0-only

#include "../cpuflags.h"
-#include "bitmap.h"
+#include "../string.h"
#include "error.h"
-#include "find.h"
-#include "math.h"
-#include "tdx.h"
#include <asm/shared/tdx.h>

#define PMD_SHIFT 21
#define PMD_SIZE (_AC(1, UL) << PMD_SHIFT)
#define PMD_MASK (~(PMD_SIZE - 1))

-extern struct boot_params *boot_params;
-
/*
* accept_memory() and process_unaccepted_memory() called from EFI stub which
* runs before decompresser and its early_tdx_detect().
@@ -40,7 +35,7 @@ static bool early_is_tdx_guest(void)
return is_tdx;
}

-static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
+void arch_accept_memory(phys_addr_t start, phys_addr_t end)
{
/* Platform-specific memory-acceptance call goes here */
if (early_is_tdx_guest())
@@ -48,75 +43,3 @@ static inline void __accept_memory(phys_addr_t start, phys_addr_t end)
else
error("Cannot accept memory: unknown platform\n");
}
-
-/*
- * The accepted memory bitmap only works at PMD_SIZE granularity. Take
- * unaligned start/end addresses and either:
- * 1. Accepts the memory immediately and in its entirety
- * 2. Accepts unaligned parts, and marks *some* aligned part unaccepted
- *
- * The function will never reach the bitmap_set() with zero bits to set.
- */
-void process_unaccepted_memory(struct boot_params *params, u64 start, u64 end)
-{
- /*
- * Ensure that at least one bit will be set in the bitmap by
- * immediately accepting all regions under 2*PMD_SIZE. This is
- * imprecise and may immediately accept some areas that could
- * have been represented in the bitmap. But, results in simpler
- * code below
- *
- * Consider case like this:
- *
- * | 4k | 2044k | 2048k |
- * ^ 0x0 ^ 2MB ^ 4MB
- *
- * Only the first 4k has been accepted. The 0MB->2MB region can not be
- * represented in the bitmap. The 2MB->4MB region can be represented in
- * the bitmap. But, the 0MB->4MB region is <2*PMD_SIZE and will be
- * immediately accepted in its entirety.
- */
- if (end - start < 2 * PMD_SIZE) {
- __accept_memory(start, end);
- return;
- }
-
- /*
- * No matter how the start and end are aligned, at least one unaccepted
- * PMD_SIZE area will remain to be marked in the bitmap.
- */
-
- /* Immediately accept a <PMD_SIZE piece at the start: */
- if (start & ~PMD_MASK) {
- __accept_memory(start, round_up(start, PMD_SIZE));
- start = round_up(start, PMD_SIZE);
- }
-
- /* Immediately accept a <PMD_SIZE piece at the end: */
- if (end & ~PMD_MASK) {
- __accept_memory(round_down(end, PMD_SIZE), end);
- end = round_down(end, PMD_SIZE);
- }
-
- /*
- * 'start' and 'end' are now both PMD-aligned.
- * Record the range as being unaccepted:
- */
- bitmap_set((unsigned long *)params->unaccepted_memory,
- start / PMD_SIZE, (end - start) / PMD_SIZE);
-}
-
-void accept_memory(phys_addr_t start, phys_addr_t end)
-{
- unsigned long range_start, range_end;
- unsigned long *bitmap, bitmap_size;
-
- bitmap = (unsigned long *)boot_params->unaccepted_memory;
- range_start = start / PMD_SIZE;
- bitmap_size = DIV_ROUND_UP(end, PMD_SIZE);
-
- for_each_set_bitrange_from(range_start, range_end, bitmap, bitmap_size) {
- __accept_memory(range_start * PMD_SIZE, range_end * PMD_SIZE);
- bitmap_clear(bitmap, range_start, range_end - range_start);
- }
-}
diff --git a/arch/x86/boot/compressed/minmax.h b/arch/x86/boot/compressed/minmax.h
deleted file mode 100644
index 4efd05673260..000000000000
--- a/arch/x86/boot/compressed/minmax.h
+++ /dev/null
@@ -1,61 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-#ifndef BOOT_MINMAX_H
-#define BOOT_MINMAX_H
-#define __LINUX_MINMAX_H /* Inhibit inclusion of <linux/minmax.h> */
-
-/*
- * This returns a constant expression while determining if an argument is
- * a constant expression, most importantly without evaluating the argument.
- * Glory to Martin Uecker <[email protected]>
- */
-#define __is_constexpr(x) \
- (sizeof(int) == sizeof(*(8 ? ((void *)((long)(x) * 0l)) : (int *)8)))
-
-/*
- * min()/max()/clamp() macros must accomplish three things:
- *
- * - avoid multiple evaluations of the arguments (so side-effects like
- * "x++" happen only once) when non-constant.
- * - perform strict type-checking (to generate warnings instead of
- * nasty runtime surprises). See the "unnecessary" pointer comparison
- * in __typecheck().
- * - retain result as a constant expressions when called with only
- * constant expressions (to avoid tripping VLA warnings in stack
- * allocation usage).
- */
-#define __typecheck(x, y) \
- (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
-
-#define __no_side_effects(x, y) \
- (__is_constexpr(x) && __is_constexpr(y))
-
-#define __safe_cmp(x, y) \
- (__typecheck(x, y) && __no_side_effects(x, y))
-
-#define __cmp(x, y, op) ((x) op (y) ? (x) : (y))
-
-#define __cmp_once(x, y, unique_x, unique_y, op) ({ \
- typeof(x) unique_x = (x); \
- typeof(y) unique_y = (y); \
- __cmp(unique_x, unique_y, op); })
-
-#define __careful_cmp(x, y, op) \
- __builtin_choose_expr(__safe_cmp(x, y), \
- __cmp(x, y, op), \
- __cmp_once(x, y, __UNIQUE_ID(__x), __UNIQUE_ID(__y), op))
-
-/**
- * min - return minimum of two values of the same or compatible types
- * @x: first value
- * @y: second value
- */
-#define min(x, y) __careful_cmp(x, y, <)
-
-/**
- * max - return maximum of two values of the same or compatible types
- * @x: first value
- * @y: second value
- */
-#define max(x, y) __careful_cmp(x, y, >)
-
-#endif
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 186bfd53e042..eb8df0d4ad51 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -456,7 +456,7 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,

debug_putstr("\nDecompressing Linux... ");

- if (boot_params->unaccepted_memory) {
+ if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY)) {
debug_putstr("Accepting memory... ");
accept_memory(__pa(output), __pa(output) + needed_size);
}
diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 4bab2bb2c9c0..92f27d67408f 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -20,8 +20,6 @@ struct page;

#include <linux/range.h>

-#include <asm/unaccepted_memory.h>
-
extern struct range pfn_mapped[];
extern int nr_pfn_mapped;

diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h
index 89fc91c61560..32aff182fd67 100644
--- a/arch/x86/include/asm/unaccepted_memory.h
+++ b/arch/x86/include/asm/unaccepted_memory.h
@@ -3,14 +3,24 @@
#ifndef _ASM_X86_UNACCEPTED_MEMORY_H
#define _ASM_X86_UNACCEPTED_MEMORY_H

-struct boot_params;
+#include <linux/efi.h>
+#include <asm/tdx.h>

-void process_unaccepted_memory(struct boot_params *params, u64 start, u64 num);
+static inline void arch_accept_memory(phys_addr_t start, phys_addr_t end)
+{
+ /* Platform-specific memory-acceptance call goes here */
+ if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
+ tdx_accept_memory(start, end);
+ } else {
+ panic("Cannot accept memory: unknown platform\n");
+ }
+}

-#ifdef CONFIG_UNACCEPTED_MEMORY
-
-void accept_memory(phys_addr_t start, phys_addr_t end);
-bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end);
+static inline struct efi_unaccepted_memory *efi_get_unaccepted_table(void)
+{
+ if (efi.unaccepted == EFI_INVALID_TABLE_ADDR)
+ return NULL;

-#endif
+ return __va(efi.unaccepted);
+}
#endif
diff --git a/arch/x86/include/uapi/asm/bootparam.h b/arch/x86/include/uapi/asm/bootparam.h
index 630a54046af0..01d19fc22346 100644
--- a/arch/x86/include/uapi/asm/bootparam.h
+++ b/arch/x86/include/uapi/asm/bootparam.h
@@ -189,7 +189,7 @@ struct boot_params {
__u64 tboot_addr; /* 0x058 */
struct ist_info ist_info; /* 0x060 */
__u64 acpi_rsdp_addr; /* 0x070 */
- __u64 unaccepted_memory; /* 0x078 */
+ __u8 _pad3[8]; /* 0x078 */
__u8 hd0_info[16]; /* obsolete! */ /* 0x080 */
__u8 hd1_info[16]; /* obsolete! */ /* 0x090 */
struct sys_desc_table sys_desc_table; /* obsolete! */ /* 0x0a0 */
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 483c36a28d2e..fb8cf953380d 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1316,23 +1316,6 @@ void __init e820__memblock_setup(void)
int i;
u64 end;

- /*
- * Mark unaccepted memory bitmap reserved.
- *
- * This kind of reservation usually done from early_reserve_memory(),
- * but early_reserve_memory() called before e820__memory_setup(), so
- * e820_table is not finalized and e820__end_of_ram_pfn() cannot be
- * used to get correct RAM size.
- */
- if (boot_params.unaccepted_memory) {
- unsigned long size;
-
- /* One bit per 2MB */
- size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE,
- PMD_SIZE * BITS_PER_BYTE);
- memblock_reserve(boot_params.unaccepted_memory, size);
- }
-
/*
* The bootstrap memblock region count maximum is 128 entries
* (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index b0ef1755e5c8..c80febc44cd2 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -67,5 +67,3 @@ obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_amd.o

obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_identity.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += mem_encrypt_boot.o
-
-obj-$(CONFIG_UNACCEPTED_MEMORY) += unaccepted_memory.o
diff --git a/arch/x86/mm/unaccepted_memory.c b/arch/x86/mm/unaccepted_memory.c
deleted file mode 100644
index f61174d4c3cb..000000000000
--- a/arch/x86/mm/unaccepted_memory.c
+++ /dev/null
@@ -1,101 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-#include <linux/memblock.h>
-#include <linux/mm.h>
-#include <linux/pfn.h>
-#include <linux/spinlock.h>
-
-#include <asm/io.h>
-#include <asm/setup.h>
-#include <asm/shared/tdx.h>
-#include <asm/unaccepted_memory.h>
-
-/* Protects unaccepted memory bitmap */
-static DEFINE_SPINLOCK(unaccepted_memory_lock);
-
-void accept_memory(phys_addr_t start, phys_addr_t end)
-{
- unsigned long range_start, range_end;
- unsigned long *bitmap;
- unsigned long flags;
-
- if (!boot_params.unaccepted_memory)
- return;
-
- bitmap = __va(boot_params.unaccepted_memory);
- range_start = start / PMD_SIZE;
-
- /*
- * load_unaligned_zeropad() can lead to unwanted loads across page
- * boundaries. The unwanted loads are typically harmless. But, they
- * might be made to totally unrelated or even unmapped memory.
- * load_unaligned_zeropad() relies on exception fixup (#PF, #GP and now
- * #VE) to recover from these unwanted loads.
- *
- * But, this approach does not work for unaccepted memory. For TDX, a
- * load from unaccepted memory will not lead to a recoverable exception
- * within the guest. The guest will exit to the VMM where the only
- * recourse is to terminate the guest.
- *
- * There are two parts to fix this issue and comprehensively avoid
- * access to unaccepted memory. Together these ensure that an extra
- * "guard" page is accepted in addition to the memory that needs to be
- * used:
- *
- * 1. Implicitly extend the range_contains_unaccepted_memory(start, end)
- * checks up to end+2M if 'end' is aligned on a 2M boundary.
- *
- * 2. Implicitly extend accept_memory(start, end) to end+2M if 'end' is
- * aligned on a 2M boundary. (immediately following this comment)
- */
- if (!(end % PMD_SIZE))
- end += PMD_SIZE;
-
- spin_lock_irqsave(&unaccepted_memory_lock, flags);
- for_each_set_bitrange_from(range_start, range_end, bitmap,
- DIV_ROUND_UP(end, PMD_SIZE)) {
- unsigned long len = range_end - range_start;
-
- /* Platform-specific memory-acceptance call goes here */
- if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
- tdx_accept_memory(range_start * PMD_SIZE,
- range_end * PMD_SIZE);
- } else {
- panic("Cannot accept memory: unknown platform\n");
- }
-
- bitmap_clear(bitmap, range_start, len);
- }
- spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
-}
-
-bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end)
-{
- unsigned long *bitmap;
- unsigned long flags;
- bool ret = false;
-
- if (!boot_params.unaccepted_memory)
- return 0;
-
- bitmap = __va(boot_params.unaccepted_memory);
-
- /*
- * Also consider the unaccepted state of the *next* page. See fix #1 in
- * the comment on load_unaligned_zeropad() in accept_memory().
- */
- if (!(end % PMD_SIZE))
- end += PMD_SIZE;
-
- spin_lock_irqsave(&unaccepted_memory_lock, flags);
- while (start < end) {
- if (test_bit(start / PMD_SIZE, bitmap)) {
- ret = true;
- break;
- }
-
- start += PMD_SIZE;
- }
- spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
-
- return ret;
-}
diff --git a/drivers/firmware/efi/Makefile b/drivers/firmware/efi/Makefile
index b51f2a4c821e..e489fefd23da 100644
--- a/drivers/firmware/efi/Makefile
+++ b/drivers/firmware/efi/Makefile
@@ -41,3 +41,4 @@ obj-$(CONFIG_EFI_CAPSULE_LOADER) += capsule-loader.o
obj-$(CONFIG_EFI_EARLYCON) += earlycon.o
obj-$(CONFIG_UEFI_CPER_ARM) += cper-arm.o
obj-$(CONFIG_UEFI_CPER_X86) += cper-x86.o
+obj-$(CONFIG_UNACCEPTED_MEMORY) += unaccepted_memory.o
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 7dce06e419c5..bddb5aeb0d12 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -50,6 +50,9 @@ struct efi __read_mostly efi = {
#ifdef CONFIG_EFI_COCO_SECRET
.coco_secret = EFI_INVALID_TABLE_ADDR,
#endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+ .unaccepted = EFI_INVALID_TABLE_ADDR,
+#endif
};
EXPORT_SYMBOL(efi);

@@ -605,6 +608,9 @@ static const efi_config_table_type_t common_tables[] __initconst = {
#ifdef CONFIG_EFI_COCO_SECRET
{LINUX_EFI_COCO_SECRET_AREA_GUID, &efi.coco_secret, "CocoSecret" },
#endif
+#ifdef CONFIG_UNACCEPTED_MEMORY
+ {LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID, &efi.unaccepted, "Unaccepted" },
+#endif
#ifdef CONFIG_EFI_GENERIC_STUB
{LINUX_EFI_SCREEN_INFO_TABLE_GUID, &screen_info_table },
#endif
@@ -759,6 +765,25 @@ int __init efi_config_parse_tables(const efi_config_table_t *config_tables,
}
}

+ if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY) &&
+ efi.unaccepted != EFI_INVALID_TABLE_ADDR) {
+ struct efi_unaccepted_memory *unaccepted;
+
+ unaccepted = early_memremap(efi.unaccepted, sizeof(*unaccepted));
+ if (unaccepted) {
+ unsigned long size;
+
+ if (unaccepted->version == 0) {
+ size = sizeof(*unaccepted) + unaccepted->size;
+ memblock_reserve(efi.unaccepted, size);
+ } else {
+ efi.unaccepted = EFI_INVALID_TABLE_ADDR;
+ }
+
+ early_memunmap(unaccepted, sizeof(*unaccepted));
+ }
+ }
+
return 0;
}

diff --git a/drivers/firmware/efi/libstub/Makefile b/drivers/firmware/efi/libstub/Makefile
index 3abb2b357482..a09edfbd7cfc 100644
--- a/drivers/firmware/efi/libstub/Makefile
+++ b/drivers/firmware/efi/libstub/Makefile
@@ -96,6 +96,8 @@ CFLAGS_arm32-stub.o := -DTEXT_OFFSET=$(TEXT_OFFSET)
zboot-obj-$(CONFIG_RISCV) := lib-clz_ctz.o lib-ashldi3.o
lib-$(CONFIG_EFI_ZBOOT) += zboot.o $(zboot-obj-y)

+lib-$(CONFIG_UNACCEPTED_MEMORY) += unaccepted_memory.o find.o
+
extra-y := $(lib-y)
lib-y := $(patsubst %.o,%.stub.o,$(lib-y))

diff --git a/drivers/firmware/efi/libstub/efistub.h b/drivers/firmware/efi/libstub/efistub.h
index 67d5a20802e0..8659a01664b8 100644
--- a/drivers/firmware/efi/libstub/efistub.h
+++ b/drivers/firmware/efi/libstub/efistub.h
@@ -1133,4 +1133,10 @@ const u8 *__efi_get_smbios_string(const struct efi_smbios_record *record,
void efi_remap_image(unsigned long image_base, unsigned alloc_size,
unsigned long code_size);

+efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
+ struct efi_boot_memmap *map);
+void process_unaccepted_memory(u64 start, u64 end);
+void accept_memory(phys_addr_t start, phys_addr_t end);
+void arch_accept_memory(phys_addr_t start, phys_addr_t end);
+
#endif
diff --git a/drivers/firmware/efi/libstub/find.c b/drivers/firmware/efi/libstub/find.c
new file mode 100644
index 000000000000..4e7740d28987
--- /dev/null
+++ b/drivers/firmware/efi/libstub/find.c
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/bitmap.h>
+#include <linux/math.h>
+#include <linux/minmax.h>
+
+/*
+ * Common helper for find_next_bit() function family
+ * @FETCH: The expression that fetches and pre-processes each word of bitmap(s)
+ * @MUNGE: The expression that post-processes a word containing found bit (may be empty)
+ * @size: The bitmap size in bits
+ * @start: The bitnumber to start searching at
+ */
+#define FIND_NEXT_BIT(FETCH, MUNGE, size, start) \
+({ \
+ unsigned long mask, idx, tmp, sz = (size), __start = (start); \
+ \
+ if (unlikely(__start >= sz)) \
+ goto out; \
+ \
+ mask = MUNGE(BITMAP_FIRST_WORD_MASK(__start)); \
+ idx = __start / BITS_PER_LONG; \
+ \
+ for (tmp = (FETCH) & mask; !tmp; tmp = (FETCH)) { \
+ if ((idx + 1) * BITS_PER_LONG >= sz) \
+ goto out; \
+ idx++; \
+ } \
+ \
+ sz = min(idx * BITS_PER_LONG + __ffs(MUNGE(tmp)), sz); \
+out: \
+ sz; \
+})
+
+unsigned long _find_next_bit(const unsigned long *addr, unsigned long nbits, unsigned long start)
+{
+ return FIND_NEXT_BIT(addr[idx], /* nop */, nbits, start);
+}
+
+unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits,
+ unsigned long start)
+{
+ return FIND_NEXT_BIT(~addr[idx], /* nop */, nbits, start);
+}
diff --git a/drivers/firmware/efi/libstub/unaccepted_memory.c b/drivers/firmware/efi/libstub/unaccepted_memory.c
new file mode 100644
index 000000000000..6c19d8fa563e
--- /dev/null
+++ b/drivers/firmware/efi/libstub/unaccepted_memory.c
@@ -0,0 +1,221 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/efi.h>
+#include <asm/efi.h>
+#include "efistub.h"
+
+static struct efi_unaccepted_memory *unaccepted_table;
+
+efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
+ struct efi_boot_memmap *map)
+{
+ efi_guid_t unaccepted_table_guid = LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID;
+ u64 unaccepted_start = ULLONG_MAX, unaccepted_end = 0, bitmap_size;
+ efi_status_t status;
+ int i;
+
+ /* Check if the table is already installed */
+ unaccepted_table = get_efi_config_table(unaccepted_table_guid);
+ if (unaccepted_table) {
+ if (unaccepted_table->version != 0) {
+ efi_err("Unknown version of unaccepted memory tatble\n");
+ return EFI_UNSUPPORTED;
+ }
+ return EFI_SUCCESS;
+ }
+
+ /* Check if there's any unaccepted memory and find the max address */
+ for (i = 0; i < nr_desc; i++) {
+ efi_memory_desc_t *d;
+ unsigned long m = (unsigned long)map->map;
+
+ d = efi_early_memdesc_ptr(m, map->desc_size, i);
+ if (d->type != EFI_UNACCEPTED_MEMORY)
+ continue;
+
+ unaccepted_start = min(unaccepted_start, d->phys_addr);
+ unaccepted_end = max(unaccepted_end,
+ d->phys_addr + d->num_pages * PAGE_SIZE);
+ }
+
+ if (unaccepted_start == ULLONG_MAX)
+ return EFI_SUCCESS;
+
+ unaccepted_start = round_down(unaccepted_start, PMD_SIZE);
+ unaccepted_end = round_up(unaccepted_end, PMD_SIZE);
+
+ /*
+ * If unaccepted memory is present, allocate a bitmap to track what
+ * memory has to be accepted before access.
+ *
+ * One bit in the bitmap represents 2MiB in the address space:
+ * A 4k bitmap can track 64GiB of physical address space.
+ *
+ * In the worst case scenario -- a huge hole in the middle of the
+ * address space -- It needs 256MiB to handle 4PiB of the address
+ * space.
+ *
+ * The bitmap will be populated in setup_e820() according to the memory
+ * map after efi_exit_boot_services().
+ */
+ bitmap_size = DIV_ROUND_UP(unaccepted_end - unaccepted_start,
+ PMD_SIZE * BITS_PER_BYTE);
+
+ status = efi_bs_call(allocate_pool, EFI_LOADER_DATA,
+ sizeof(*unaccepted_table) + bitmap_size,
+ (void **)&unaccepted_table);
+ if (status != EFI_SUCCESS) {
+ efi_err("Failed to allocate unaccepted memory config table\n");
+ return status;
+ }
+
+ unaccepted_table->version = 0;
+ unaccepted_table->unit_size = PMD_SIZE;
+ unaccepted_table->phys_base = unaccepted_start;
+ unaccepted_table->size = bitmap_size;
+ memset(unaccepted_table->bitmap, 0, bitmap_size);
+
+ status = efi_bs_call(install_configuration_table,
+ &unaccepted_table_guid, unaccepted_table);
+ if (status != EFI_SUCCESS) {
+ efi_bs_call(free_pool, unaccepted_table);
+ efi_err("Failed to install unaccepted memory config table!\n");
+ }
+
+ return status;
+}
+
+/*
+ * The accepted memory bitmap only works at PMD_SIZE granularity. Take
+ * unaligned start/end addresses and either:
+ * 1. Accepts the memory immediately and in its entirety
+ * 2. Accepts unaligned parts, and marks *some* aligned part unaccepted
+ *
+ * The function will never reach the bitmap_set() with zero bits to set.
+ */
+void process_unaccepted_memory(u64 start, u64 end)
+{
+ u64 unit_size = unaccepted_table->unit_size;
+ u64 unit_mask = unaccepted_table->unit_size - 1;
+ u64 bitmap_size = unaccepted_table->size;
+
+ /*
+ * Ensure that at least one bit will be set in the bitmap by
+ * immediately accepting all regions under 2*unit_size. This is
+ * imprecise and may immediately accept some areas that could
+ * have been represented in the bitmap. But, results in simpler
+ * code below
+ *
+ * Consider case like this (assuming unit_size == 2MB):
+ *
+ * | 4k | 2044k | 2048k |
+ * ^ 0x0 ^ 2MB ^ 4MB
+ *
+ * Only the first 4k has been accepted. The 0MB->2MB region can not be
+ * represented in the bitmap. The 2MB->4MB region can be represented in
+ * the bitmap. But, the 0MB->4MB region is <2*unit_size and will be
+ * immediately accepted in its entirety.
+ */
+ if (end - start < 2 * unit_size) {
+ arch_accept_memory(start, end);
+ return;
+ }
+
+ /*
+ * No matter how the start and end are aligned, at least one unaccepted
+ * unit_size area will remain to be marked in the bitmap.
+ */
+
+ /* Immediately accept a <unit_size piece at the start: */
+ if (start & unit_mask) {
+ arch_accept_memory(start, round_up(start, unit_size));
+ start = round_up(start, unit_size);
+ }
+
+ /* Immediately accept a <unit_size piece at the end: */
+ if (end & unit_mask) {
+ arch_accept_memory(round_down(end, unit_size), end);
+ end = round_down(end, unit_size);
+ }
+
+ /*
+ * Accept part of the range that before phys_base and cannot be recorded
+ * into the bitmap.
+ */
+ if (start < unaccepted_table->phys_base) {
+ arch_accept_memory(start,
+ min(unaccepted_table->phys_base, end));
+ start = unaccepted_table->phys_base;
+ }
+
+ /* Nothing to record */
+ if (end < unaccepted_table->phys_base)
+ return;
+
+ /* Translate to offsets from the beginning of the bitmap */
+ start -= unaccepted_table->phys_base;
+ end -= unaccepted_table->phys_base;
+
+ /* Accept memory that doesn't fit into bitmap */
+ if (end > bitmap_size * unit_size * BITS_PER_BYTE) {
+ unsigned long phys_start, phys_end;
+
+ phys_start = bitmap_size * unit_size * BITS_PER_BYTE +
+ unaccepted_table->phys_base;
+ phys_end = end + unaccepted_table->phys_base;
+
+ arch_accept_memory(phys_start, phys_end);
+ end = bitmap_size * unit_size * BITS_PER_BYTE;
+ }
+
+ /*
+ * 'start' and 'end' are now both unit_size-aligned.
+ * Record the range as being unaccepted:
+ */
+ bitmap_set(unaccepted_table->bitmap,
+ start / unit_size, (end - start) / unit_size);
+}
+
+void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+ unsigned long range_start, range_end;
+ unsigned long bitmap_size;
+ u64 unit_size;
+
+ if (!unaccepted_table)
+ return;
+
+ unit_size = unaccepted_table->unit_size;
+
+ /*
+ * Only care for the part of the range that is represented
+ * in the bitmap.
+ */
+ if (start < unaccepted_table->phys_base)
+ start = unaccepted_table->phys_base;
+ if (end < unaccepted_table->phys_base)
+ return;
+
+ /* Translate to offsets from the beginning of the bitmap */
+ start -= unaccepted_table->phys_base;
+ end -= unaccepted_table->phys_base;
+
+ /* Make sure not to overrun the bitmap */
+ if (end > unaccepted_table->size * unit_size * BITS_PER_BYTE)
+ end = unaccepted_table->size * unit_size * BITS_PER_BYTE;
+
+ range_start = start / unit_size;
+ bitmap_size = DIV_ROUND_UP(end, unit_size);
+
+ for_each_set_bitrange_from(range_start, range_end,
+ unaccepted_table->bitmap, bitmap_size) {
+ unsigned long phys_start, phys_end;
+
+ phys_start = range_start * unit_size + unaccepted_table->phys_base;
+ phys_end = range_end * unit_size + unaccepted_table->phys_base;
+
+ arch_accept_memory(phys_start, phys_end);
+ bitmap_clear(unaccepted_table->bitmap,
+ range_start, range_end - range_start);
+ }
+}
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index 1afe7b5b02e1..16ea5e76907f 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -621,7 +621,7 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
continue;
}
e820_type = E820_TYPE_RAM;
- process_unaccepted_memory(params, d->phys_addr,
+ process_unaccepted_memory(d->phys_addr,
d->phys_addr + PAGE_SIZE * d->num_pages);
break;
default:
@@ -688,64 +688,6 @@ static efi_status_t alloc_e820ext(u32 nr_desc, struct setup_data **e820ext,
return status;
}

-static efi_status_t allocate_unaccepted_bitmap(struct boot_params *params,
- __u32 nr_desc,
- struct efi_boot_memmap *map)
-{
- unsigned long *mem = NULL;
- u64 size, max_addr = 0;
- efi_status_t status;
- bool found = false;
- int i;
-
- /* Check if there's any unaccepted memory and find the max address */
- for (i = 0; i < nr_desc; i++) {
- efi_memory_desc_t *d;
- unsigned long m = (unsigned long)map->map;
-
- d = efi_early_memdesc_ptr(m, map->desc_size, i);
- if (d->type == EFI_UNACCEPTED_MEMORY)
- found = true;
- if (d->phys_addr + d->num_pages * PAGE_SIZE > max_addr)
- max_addr = d->phys_addr + d->num_pages * PAGE_SIZE;
- }
-
- if (!found) {
- params->unaccepted_memory = 0;
- return EFI_SUCCESS;
- }
-
- /*
- * range_contains_unaccepted_memory() may need to check one 2M chunk
- * beyond the end of RAM to deal with load_unaligned_zeropad(). Make
- * sure that the bitmap is large enough handle it.
- */
- max_addr += PMD_SIZE;
-
- /*
- * If unaccepted memory is present, allocate a bitmap to track what
- * memory has to be accepted before access.
- *
- * One bit in the bitmap represents 2MiB in the address space:
- * A 4k bitmap can track 64GiB of physical address space.
- *
- * In the worst case scenario -- a huge hole in the middle of the
- * address space -- It needs 256MiB to handle 4PiB of the address
- * space.
- *
- * The bitmap will be populated in setup_e820() according to the memory
- * map after efi_exit_boot_services().
- */
- size = DIV_ROUND_UP(max_addr, PMD_SIZE * BITS_PER_BYTE);
- status = efi_allocate_pages(size, (unsigned long *)&mem, ULONG_MAX);
- if (status == EFI_SUCCESS) {
- memset(mem, 0, size);
- params->unaccepted_memory = (unsigned long)mem;
- }
-
- return status;
-}
-
static efi_status_t allocate_e820(struct boot_params *params,
struct setup_data **e820ext,
u32 *e820ext_size)
@@ -767,7 +709,7 @@ static efi_status_t allocate_e820(struct boot_params *params,
}

if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY) && status == EFI_SUCCESS)
- status = allocate_unaccepted_bitmap(params, nr_desc, map);
+ status = allocate_unaccepted_bitmap(nr_desc, map);

efi_bs_call(free_pool, map);
return status;
diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c
new file mode 100644
index 000000000000..3d1ca60916dd
--- /dev/null
+++ b/drivers/firmware/efi/unaccepted_memory.c
@@ -0,0 +1,138 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/efi.h>
+#include <linux/memblock.h>
+#include <linux/spinlock.h>
+#include <asm/unaccepted_memory.h>
+
+/* Protects unaccepted memory bitmap */
+static DEFINE_SPINLOCK(unaccepted_memory_lock);
+
+void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+ struct efi_unaccepted_memory *unaccepted;
+ unsigned long range_start, range_end;
+ unsigned long flags;
+ u64 unit_size;
+
+ if (efi.unaccepted == EFI_INVALID_TABLE_ADDR)
+ return;
+
+ unaccepted = efi_get_unaccepted_table();
+ if (!unaccepted)
+ return;
+
+ unit_size = unaccepted->unit_size;
+
+ /*
+ * Only care for the part of the range that is represented
+ * in the bitmap.
+ */
+ if (start < unaccepted->phys_base)
+ start = unaccepted->phys_base;
+ if (end < unaccepted->phys_base)
+ return;
+
+ /* Translate to offsets from the beginning of the bitmap */
+ start -= unaccepted->phys_base;
+ end -= unaccepted->phys_base;
+
+ /*
+ * load_unaligned_zeropad() can lead to unwanted loads across page
+ * boundaries. The unwanted loads are typically harmless. But, they
+ * might be made to totally unrelated or even unmapped memory.
+ * load_unaligned_zeropad() relies on exception fixup (#PF, #GP and now
+ * #VE) to recover from these unwanted loads.
+ *
+ * But, this approach does not work for unaccepted memory. For TDX, a
+ * load from unaccepted memory will not lead to a recoverable exception
+ * within the guest. The guest will exit to the VMM where the only
+ * recourse is to terminate the guest.
+ *
+ * There are two parts to fix this issue and comprehensively avoid
+ * access to unaccepted memory. Together these ensure that an extra
+ * "guard" page is accepted in addition to the memory that needs to be
+ * used:
+ *
+ * 1. Implicitly extend the range_contains_unaccepted_memory(start, end)
+ * checks up to end+unit_size if 'end' is aligned on a unit_size
+ * boundary.
+ *
+ * 2. Implicitly extend accept_memory(start, end) to end+unit_size if
+ * 'end' is aligned on a unit_size boundary. (immediately following
+ * this comment)
+ */
+ if (!(end % unit_size))
+ end += unit_size;
+
+ /* Make sure not to overrun the bitmap */
+ if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
+ end = unaccepted->size * unit_size * BITS_PER_BYTE;
+
+ range_start = start / unit_size;
+
+ spin_lock_irqsave(&unaccepted_memory_lock, flags);
+ for_each_set_bitrange_from(range_start, range_end, unaccepted->bitmap,
+ DIV_ROUND_UP(end, unit_size)) {
+ unsigned long phys_start, phys_end;
+ unsigned long len = range_end - range_start;
+
+ phys_start = range_start * unit_size + unaccepted->phys_base;
+ phys_end = range_end * unit_size + unaccepted->phys_base;
+
+ arch_accept_memory(phys_start, phys_end);
+ bitmap_clear(unaccepted->bitmap, range_start, len);
+ }
+ spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+}
+
+bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end)
+{
+ struct efi_unaccepted_memory *unaccepted;
+ unsigned long flags;
+ bool ret = false;
+ u64 unit_size;
+
+ unaccepted = efi_get_unaccepted_table();
+ if (!unaccepted)
+ return false;
+
+ unit_size = unaccepted->unit_size;
+
+ /*
+ * Only care for the part of the range that is represented
+ * in the bitmap.
+ */
+ if (start < unaccepted->phys_base)
+ start = unaccepted->phys_base;
+ if (end < unaccepted->phys_base)
+ return false;
+
+ /* Translate to offsets from the beginning of the bitmap */
+ start -= unaccepted->phys_base;
+ end -= unaccepted->phys_base;
+
+ /*
+ * Also consider the unaccepted state of the *next* page. See fix #1 in
+ * the comment on load_unaligned_zeropad() in accept_memory().
+ */
+ if (!(end % unit_size))
+ end += unit_size;
+
+ /* Make sure not to overrun the bitmap */
+ if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
+ end = unaccepted->size * unit_size * BITS_PER_BYTE;
+
+ spin_lock_irqsave(&unaccepted_memory_lock, flags);
+ while (start < end) {
+ if (test_bit(start / unit_size, unaccepted->bitmap)) {
+ ret = true;
+ break;
+ }
+
+ start += unit_size;
+ }
+ spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
+
+ return ret;
+}
diff --git a/include/linux/efi.h b/include/linux/efi.h
index efbe14641638..0f4620060ed8 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -418,6 +418,7 @@ void efi_native_runtime_setup(void);
#define LINUX_EFI_MOK_VARIABLE_TABLE_GUID EFI_GUID(0xc451ed2b, 0x9694, 0x45d3, 0xba, 0xba, 0xed, 0x9f, 0x89, 0x88, 0xa3, 0x89)
#define LINUX_EFI_COCO_SECRET_AREA_GUID EFI_GUID(0xadf956ad, 0xe98c, 0x484c, 0xae, 0x11, 0xb5, 0x1c, 0x7d, 0x33, 0x64, 0x47)
#define LINUX_EFI_BOOT_MEMMAP_GUID EFI_GUID(0x800f683f, 0xd08b, 0x423a, 0xa2, 0x93, 0x96, 0x5c, 0x3c, 0x6f, 0xe2, 0xb4)
+#define LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID EFI_GUID(0xd5d1de3c, 0x105c, 0x44f9, 0x9e, 0xa9, 0xbc, 0xef, 0x98, 0x12, 0x00, 0x31)

#define RISCV_EFI_BOOT_PROTOCOL_GUID EFI_GUID(0xccd15fec, 0x6f73, 0x4eec, 0x83, 0x95, 0x3e, 0x69, 0xe4, 0xb9, 0x40, 0xbf)

@@ -535,6 +536,16 @@ struct efi_boot_memmap {
efi_memory_desc_t map[];
};

+struct efi_unaccepted_memory {
+ u32 version;
+ u32 unit_size;
+ u64 phys_base;
+ u64 size;
+ unsigned long bitmap[];
+};
+
+void __init efi_unaccepted_table_init(void);
+
/*
* Architecture independent structure for describing a memory map for the
* benefit of efi_memmap_init_early(), and for passing context between
@@ -637,6 +648,7 @@ extern struct efi {
unsigned long tpm_final_log; /* TPM2 Final Events Log table */
unsigned long mokvar_table; /* MOK variable config table */
unsigned long coco_secret; /* Confidential computing secret table */
+ unsigned long unaccepted; /* Unaccepted memory table */

efi_get_time_t *get_time;
efi_set_time_t *set_time;
diff --git a/mm/internal.h b/mm/internal.h
index ed042e366d49..2e70f22d1b3f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1100,7 +1100,13 @@ struct vma_prepare {
struct vm_area_struct *remove2;
};

-#ifndef CONFIG_UNACCEPTED_MEMORY
+#ifdef CONFIG_UNACCEPTED_MEMORY
+
+bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end);
+void accept_memory(phys_addr_t start, phys_addr_t end);
+
+#else
+
static inline bool range_contains_unaccepted_memory(phys_addr_t start,
phys_addr_t end)
{
@@ -1110,6 +1116,7 @@ static inline bool range_contains_unaccepted_memory(phys_addr_t start,
static inline void accept_memory(phys_addr_t start, phys_addr_t end)
{
}
+
#endif

#endif /* __MM_INTERNAL_H */
--
Kiryl Shutsemau / Kirill A. Shutemov

2023-05-12 07:54:26

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCHv10 04/11] efi/x86: Implement support for unaccepted memory

On Fri, 12 May 2023 at 03:59, Kirill A. Shutemov <[email protected]> wrote:
>
> On Tue, May 09, 2023 at 08:47:38AM +0200, Ard Biesheuvel wrote:
> > On Tue, 9 May 2023 at 02:56, Kirill A. Shutemov <[email protected]> wrote:
> > >
> > > On Tue, May 09, 2023 at 12:11:41AM +0200, Ard Biesheuvel wrote:
> > > > > @@ -1324,13 +1325,15 @@ void __init e820__memblock_setup(void)
> > > > > * e820_table is not finalized and e820__end_of_ram_pfn() cannot be
> > > > > * used to get correct RAM size.
> > > > > */
> > > > > - if (boot_params.unaccepted_memory) {
> > > > > + if (efi.unaccepted != EFI_INVALID_TABLE_ADDR) {
> > > > > + struct efi_unaccepted_memory *unaccepted;
> > > > > unsigned long size;
> > > > >
> > > > > - /* One bit per 2MB */
> > > > > - size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE,
> > > > > - PMD_SIZE * BITS_PER_BYTE);
> > > > > - memblock_reserve(boot_params.unaccepted_memory, size);
> > > > > + unaccepted = __va(efi.unaccepted);
> > > > > +
> > > > > + size = sizeof(struct efi_unaccepted_memory);
> > > > > + size += unaccepted->size;
> > > > > + memblock_reserve(efi.unaccepted, size);
> > > > > }
> > > > >
> > > >
> > > > This could be moved to generic code (but we'll need to use early_memremap())
> > >
> > > I don't understand why early_memremap() is needed. EFI_LOADER_DATA already
> > > mapped into direct mapping. We only need to reserve the memory so it
> > > could not be reallocated for other things. Hm?
> > >
> >
> > *If* we move this to generic code, we have to ensure that we don't
> > rely on x86 specific semantics. When parsing the EFI configuration
> > tables, other architectures don't have a complete direct map yet, as
> > they receive the memory description from EFI not from a translated
> > E820 map.
> >
> > Note that this is only for getting the size of the reservation. Later
> > on, when we actually consume the contents of the bitmap, generic or
> > non-x86 code will need to use the ordinary memremap() API to map this
> > memory, and this amounts to a __va() call when the memory is already
> > mapped. But I am not suggesting changing that part for this series.
> > And even the hunk above can remain as you suggest - we can revisit it
> > once other architectures gain support for this.
> >
> > The main thing I would like to avoid at this point in time is to add
> > new fields to struct bootparams that loaders such as GRUB may start to
> > populate as well - I don't think there is a very strong case for
> > pseudo-EFI boot [where GRUB calls ExitBootServices()] on confidential
> > VMs (as it prevents the EFI stub and the kernel from accessing the
> > measurement and attestation APIs), but let's not create more struct
> > bootparams based API if we can avoid it.
>
> Below is updated version of the fixup. I believed I addressed all your
> feedback.
>
> I moved most of unaccepted memory code into generic EFI and EFI stub. I
> hope it looks fine.
>

Yes this looks excellent. I left some comments below, primarily about
the use of PMD_SIZE and the version field, but other than that, this
looks ready to go.

> early_memremap() for reservation works fine, but when I tried to use
> memremap() as you suggested to get the mapping of the table instead of
> __va() it failed. I didn't found the root cause. I guess I tried to use
> too early for memremap() to be functional. I made arch provide
> arch-specific way to get the mapping, which is implemented as __va() on
> x86.
>

Fair enough - we'll cross that bridge when we have to.

> While I move code from decompressor to the EFI stub, I removed few headers
> as, it *seems*, EFI stub has different policy about re-using headers from
> the main kernel image.
>
> Borislav, is it okay with you or EFI stub also has to carry own copy of
> the headers?
>

I'd prefer to avoid that - I'm not familiar with the motivation behind
this, but I don't remember any issues with the EFI stub that would
justify this.

> If everything is fine, I will fold the fixup properly and prepare v11 of
> the patchset.
>

That works for me. I'll coordinate with Boris on how to merge this.



> Documentation/arch/x86/zero-page.rst | 1 -
> arch/x86/boot/bitops.h | 40 ----
> arch/x86/boot/compressed/Makefile | 2 +-
> arch/x86/boot/compressed/bitmap.c | 43 -----
> arch/x86/boot/compressed/bitmap.h | 49 -----
> arch/x86/boot/compressed/bits.h | 36 ----
> arch/x86/boot/compressed/find.c | 54 ------
> arch/x86/boot/compressed/find.h | 79 --------
> arch/x86/boot/compressed/math.h | 37 ----
> arch/x86/boot/compressed/mem.c | 81 +--------
> arch/x86/boot/compressed/minmax.h | 61 -------
> arch/x86/boot/compressed/misc.c | 2 +-
> arch/x86/include/asm/page.h | 2 -
> arch/x86/include/asm/unaccepted_memory.h | 24 ++-
> arch/x86/include/uapi/asm/bootparam.h | 2 +-
> arch/x86/kernel/e820.c | 17 --
> arch/x86/mm/Makefile | 2 -
> arch/x86/mm/unaccepted_memory.c | 101 -----------
> drivers/firmware/efi/Makefile | 1 +
> drivers/firmware/efi/efi.c | 25 +++
> drivers/firmware/efi/libstub/Makefile | 2 +
> drivers/firmware/efi/libstub/efistub.h | 6 +
> drivers/firmware/efi/libstub/find.c | 43 +++++
> drivers/firmware/efi/libstub/unaccepted_memory.c | 221 +++++++++++++++++++++++
> drivers/firmware/efi/libstub/x86-stub.c | 62 +------
> drivers/firmware/efi/unaccepted_memory.c | 138 ++++++++++++++
> include/linux/efi.h | 12 ++
> mm/internal.h | 9 +-
> 28 files changed, 480 insertions(+), 672 deletions(-)
>
...
> diff --git a/drivers/firmware/efi/Makefile b/drivers/firmware/efi/Makefile
> index b51f2a4c821e..e489fefd23da 100644
> --- a/drivers/firmware/efi/Makefile
> +++ b/drivers/firmware/efi/Makefile
> @@ -41,3 +41,4 @@ obj-$(CONFIG_EFI_CAPSULE_LOADER) += capsule-loader.o
> obj-$(CONFIG_EFI_EARLYCON) += earlycon.o
> obj-$(CONFIG_UEFI_CPER_ARM) += cper-arm.o
> obj-$(CONFIG_UEFI_CPER_X86) += cper-x86.o
> +obj-$(CONFIG_UNACCEPTED_MEMORY) += unaccepted_memory.o
> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> index 7dce06e419c5..bddb5aeb0d12 100644
> --- a/drivers/firmware/efi/efi.c
> +++ b/drivers/firmware/efi/efi.c
> @@ -50,6 +50,9 @@ struct efi __read_mostly efi = {
> #ifdef CONFIG_EFI_COCO_SECRET
> .coco_secret = EFI_INVALID_TABLE_ADDR,
> #endif
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> + .unaccepted = EFI_INVALID_TABLE_ADDR,
> +#endif
> };
> EXPORT_SYMBOL(efi);
>
> @@ -605,6 +608,9 @@ static const efi_config_table_type_t common_tables[] __initconst = {
> #ifdef CONFIG_EFI_COCO_SECRET
> {LINUX_EFI_COCO_SECRET_AREA_GUID, &efi.coco_secret, "CocoSecret" },
> #endif
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> + {LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID, &efi.unaccepted, "Unaccepted" },
> +#endif
> #ifdef CONFIG_EFI_GENERIC_STUB
> {LINUX_EFI_SCREEN_INFO_TABLE_GUID, &screen_info_table },
> #endif
> @@ -759,6 +765,25 @@ int __init efi_config_parse_tables(const efi_config_table_t *config_tables,
> }
> }
>
> + if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY) &&
> + efi.unaccepted != EFI_INVALID_TABLE_ADDR) {
> + struct efi_unaccepted_memory *unaccepted;
> +
> + unaccepted = early_memremap(efi.unaccepted, sizeof(*unaccepted));
> + if (unaccepted) {
> + unsigned long size;
> +
> + if (unaccepted->version == 0) {
> + size = sizeof(*unaccepted) + unaccepted->size;
> + memblock_reserve(efi.unaccepted, size);
> + } else {
> + efi.unaccepted = EFI_INVALID_TABLE_ADDR;
> + }
> +
> + early_memunmap(unaccepted, sizeof(*unaccepted));
> + }
> + }
> +
> return 0;
> }
>
> diff --git a/drivers/firmware/efi/libstub/Makefile b/drivers/firmware/efi/libstub/Makefile
> index 3abb2b357482..a09edfbd7cfc 100644
> --- a/drivers/firmware/efi/libstub/Makefile
> +++ b/drivers/firmware/efi/libstub/Makefile
> @@ -96,6 +96,8 @@ CFLAGS_arm32-stub.o := -DTEXT_OFFSET=$(TEXT_OFFSET)
> zboot-obj-$(CONFIG_RISCV) := lib-clz_ctz.o lib-ashldi3.o
> lib-$(CONFIG_EFI_ZBOOT) += zboot.o $(zboot-obj-y)
>
> +lib-$(CONFIG_UNACCEPTED_MEMORY) += unaccepted_memory.o find.o
> +
> extra-y := $(lib-y)
> lib-y := $(patsubst %.o,%.stub.o,$(lib-y))
>
> diff --git a/drivers/firmware/efi/libstub/efistub.h b/drivers/firmware/efi/libstub/efistub.h
> index 67d5a20802e0..8659a01664b8 100644
> --- a/drivers/firmware/efi/libstub/efistub.h
> +++ b/drivers/firmware/efi/libstub/efistub.h
> @@ -1133,4 +1133,10 @@ const u8 *__efi_get_smbios_string(const struct efi_smbios_record *record,
> void efi_remap_image(unsigned long image_base, unsigned alloc_size,
> unsigned long code_size);
>
> +efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
> + struct efi_boot_memmap *map);
> +void process_unaccepted_memory(u64 start, u64 end);
> +void accept_memory(phys_addr_t start, phys_addr_t end);
> +void arch_accept_memory(phys_addr_t start, phys_addr_t end);
> +
> #endif
> diff --git a/drivers/firmware/efi/libstub/find.c b/drivers/firmware/efi/libstub/find.c
> new file mode 100644
> index 000000000000..4e7740d28987
> --- /dev/null
> +++ b/drivers/firmware/efi/libstub/find.c
> @@ -0,0 +1,43 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#include <linux/bitmap.h>
> +#include <linux/math.h>
> +#include <linux/minmax.h>
> +
> +/*
> + * Common helper for find_next_bit() function family
> + * @FETCH: The expression that fetches and pre-processes each word of bitmap(s)
> + * @MUNGE: The expression that post-processes a word containing found bit (may be empty)
> + * @size: The bitmap size in bits
> + * @start: The bitnumber to start searching at
> + */
> +#define FIND_NEXT_BIT(FETCH, MUNGE, size, start) \
> +({ \
> + unsigned long mask, idx, tmp, sz = (size), __start = (start); \
> + \
> + if (unlikely(__start >= sz)) \
> + goto out; \
> + \
> + mask = MUNGE(BITMAP_FIRST_WORD_MASK(__start)); \
> + idx = __start / BITS_PER_LONG; \
> + \
> + for (tmp = (FETCH) & mask; !tmp; tmp = (FETCH)) { \
> + if ((idx + 1) * BITS_PER_LONG >= sz) \
> + goto out; \
> + idx++; \
> + } \
> + \
> + sz = min(idx * BITS_PER_LONG + __ffs(MUNGE(tmp)), sz); \
> +out: \
> + sz; \
> +})
> +
> +unsigned long _find_next_bit(const unsigned long *addr, unsigned long nbits, unsigned long start)
> +{
> + return FIND_NEXT_BIT(addr[idx], /* nop */, nbits, start);
> +}
> +
> +unsigned long _find_next_zero_bit(const unsigned long *addr, unsigned long nbits,
> + unsigned long start)
> +{
> + return FIND_NEXT_BIT(~addr[idx], /* nop */, nbits, start);
> +}
> diff --git a/drivers/firmware/efi/libstub/unaccepted_memory.c b/drivers/firmware/efi/libstub/unaccepted_memory.c
> new file mode 100644
> index 000000000000..6c19d8fa563e
> --- /dev/null
> +++ b/drivers/firmware/efi/libstub/unaccepted_memory.c
> @@ -0,0 +1,221 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#include <linux/efi.h>
> +#include <asm/efi.h>
> +#include "efistub.h"
> +
> +static struct efi_unaccepted_memory *unaccepted_table;
> +
> +efi_status_t allocate_unaccepted_bitmap(__u32 nr_desc,
> + struct efi_boot_memmap *map)
> +{
> + efi_guid_t unaccepted_table_guid = LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID;
> + u64 unaccepted_start = ULLONG_MAX, unaccepted_end = 0, bitmap_size;
> + efi_status_t status;
> + int i;
> +
> + /* Check if the table is already installed */
> + unaccepted_table = get_efi_config_table(unaccepted_table_guid);
> + if (unaccepted_table) {
> + if (unaccepted_table->version != 0) {
> + efi_err("Unknown version of unaccepted memory tatble\n");

Typo ^^^

Also, can we start with version 1 rather than 0? That way, we can spot
the difference between a valid table and memory that has been cleared
inadvertently.


> + return EFI_UNSUPPORTED;
> + }
> + return EFI_SUCCESS;
> + }
> +
> + /* Check if there's any unaccepted memory and find the max address */
> + for (i = 0; i < nr_desc; i++) {
> + efi_memory_desc_t *d;
> + unsigned long m = (unsigned long)map->map;
> +
> + d = efi_early_memdesc_ptr(m, map->desc_size, i);
> + if (d->type != EFI_UNACCEPTED_MEMORY)
> + continue;
> +
> + unaccepted_start = min(unaccepted_start, d->phys_addr);
> + unaccepted_end = max(unaccepted_end,
> + d->phys_addr + d->num_pages * PAGE_SIZE);
> + }
> +
> + if (unaccepted_start == ULLONG_MAX)
> + return EFI_SUCCESS;
> +
> + unaccepted_start = round_down(unaccepted_start, PMD_SIZE);
> + unaccepted_end = round_up(unaccepted_end, PMD_SIZE);
> +

Please replace PMD_SIZE with something along the lines of
EFI_UNACCEPTED_UNIT_SIZE and #define it to PMD_SIZE in
arch/x86/include/asm/efi.h.

The comment below about the size of the bitmap vs the size of the
address space should probably move there as well.

> + /*
> + * If unaccepted memory is present, allocate a bitmap to track what
> + * memory has to be accepted before access.
> + *
> + * One bit in the bitmap represents 2MiB in the address space:
> + * A 4k bitmap can track 64GiB of physical address space.
> + *
> + * In the worst case scenario -- a huge hole in the middle of the
> + * address space -- It needs 256MiB to handle 4PiB of the address
> + * space.
> + *
> + * The bitmap will be populated in setup_e820() according to the memory
> + * map after efi_exit_boot_services().
> + */
> + bitmap_size = DIV_ROUND_UP(unaccepted_end - unaccepted_start,
> + PMD_SIZE * BITS_PER_BYTE);

PMD_SIZE ^^^

> +
> + status = efi_bs_call(allocate_pool, EFI_LOADER_DATA,
> + sizeof(*unaccepted_table) + bitmap_size,
> + (void **)&unaccepted_table);
> + if (status != EFI_SUCCESS) {
> + efi_err("Failed to allocate unaccepted memory config table\n");
> + return status;
> + }
> +
> + unaccepted_table->version = 0;
> + unaccepted_table->unit_size = PMD_SIZE;

And here

> + unaccepted_table->phys_base = unaccepted_start;
> + unaccepted_table->size = bitmap_size;
> + memset(unaccepted_table->bitmap, 0, bitmap_size);
> +
> + status = efi_bs_call(install_configuration_table,
> + &unaccepted_table_guid, unaccepted_table);
> + if (status != EFI_SUCCESS) {
> + efi_bs_call(free_pool, unaccepted_table);
> + efi_err("Failed to install unaccepted memory config table!\n");
> + }
> +
> + return status;
> +}
> +
> +/*
> + * The accepted memory bitmap only works at PMD_SIZE granularity. Take
> + * unaligned start/end addresses and either:
> + * 1. Accepts the memory immediately and in its entirety
> + * 2. Accepts unaligned parts, and marks *some* aligned part unaccepted
> + *
> + * The function will never reach the bitmap_set() with zero bits to set.
> + */
> +void process_unaccepted_memory(u64 start, u64 end)
> +{
> + u64 unit_size = unaccepted_table->unit_size;
> + u64 unit_mask = unaccepted_table->unit_size - 1;
> + u64 bitmap_size = unaccepted_table->size;
> +
> + /*
> + * Ensure that at least one bit will be set in the bitmap by
> + * immediately accepting all regions under 2*unit_size. This is
> + * imprecise and may immediately accept some areas that could
> + * have been represented in the bitmap. But, results in simpler
> + * code below
> + *
> + * Consider case like this (assuming unit_size == 2MB):
> + *
> + * | 4k | 2044k | 2048k |
> + * ^ 0x0 ^ 2MB ^ 4MB
> + *
> + * Only the first 4k has been accepted. The 0MB->2MB region can not be
> + * represented in the bitmap. The 2MB->4MB region can be represented in
> + * the bitmap. But, the 0MB->4MB region is <2*unit_size and will be
> + * immediately accepted in its entirety.
> + */
> + if (end - start < 2 * unit_size) {
> + arch_accept_memory(start, end);
> + return;
> + }
> +
> + /*
> + * No matter how the start and end are aligned, at least one unaccepted
> + * unit_size area will remain to be marked in the bitmap.
> + */
> +
> + /* Immediately accept a <unit_size piece at the start: */
> + if (start & unit_mask) {
> + arch_accept_memory(start, round_up(start, unit_size));
> + start = round_up(start, unit_size);
> + }
> +
> + /* Immediately accept a <unit_size piece at the end: */
> + if (end & unit_mask) {
> + arch_accept_memory(round_down(end, unit_size), end);
> + end = round_down(end, unit_size);
> + }
> +
> + /*
> + * Accept part of the range that before phys_base and cannot be recorded
> + * into the bitmap.
> + */
> + if (start < unaccepted_table->phys_base) {
> + arch_accept_memory(start,
> + min(unaccepted_table->phys_base, end));
> + start = unaccepted_table->phys_base;
> + }
> +
> + /* Nothing to record */
> + if (end < unaccepted_table->phys_base)
> + return;
> +
> + /* Translate to offsets from the beginning of the bitmap */
> + start -= unaccepted_table->phys_base;
> + end -= unaccepted_table->phys_base;
> +
> + /* Accept memory that doesn't fit into bitmap */
> + if (end > bitmap_size * unit_size * BITS_PER_BYTE) {
> + unsigned long phys_start, phys_end;
> +
> + phys_start = bitmap_size * unit_size * BITS_PER_BYTE +
> + unaccepted_table->phys_base;
> + phys_end = end + unaccepted_table->phys_base;
> +
> + arch_accept_memory(phys_start, phys_end);
> + end = bitmap_size * unit_size * BITS_PER_BYTE;
> + }
> +
> + /*
> + * 'start' and 'end' are now both unit_size-aligned.
> + * Record the range as being unaccepted:
> + */
> + bitmap_set(unaccepted_table->bitmap,
> + start / unit_size, (end - start) / unit_size);
> +}
> +
> +void accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> + unsigned long range_start, range_end;
> + unsigned long bitmap_size;
> + u64 unit_size;
> +
> + if (!unaccepted_table)
> + return;
> +
> + unit_size = unaccepted_table->unit_size;
> +
> + /*
> + * Only care for the part of the range that is represented
> + * in the bitmap.
> + */
> + if (start < unaccepted_table->phys_base)
> + start = unaccepted_table->phys_base;
> + if (end < unaccepted_table->phys_base)
> + return;
> +
> + /* Translate to offsets from the beginning of the bitmap */
> + start -= unaccepted_table->phys_base;
> + end -= unaccepted_table->phys_base;
> +
> + /* Make sure not to overrun the bitmap */
> + if (end > unaccepted_table->size * unit_size * BITS_PER_BYTE)
> + end = unaccepted_table->size * unit_size * BITS_PER_BYTE;
> +

Should we warn here?


> + range_start = start / unit_size;
> + bitmap_size = DIV_ROUND_UP(end, unit_size);
> +
> + for_each_set_bitrange_from(range_start, range_end,
> + unaccepted_table->bitmap, bitmap_size) {
> + unsigned long phys_start, phys_end;
> +
> + phys_start = range_start * unit_size + unaccepted_table->phys_base;
> + phys_end = range_end * unit_size + unaccepted_table->phys_base;
> +
> + arch_accept_memory(phys_start, phys_end);
> + bitmap_clear(unaccepted_table->bitmap,
> + range_start, range_end - range_start);
> + }
> +}
> diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
> index 1afe7b5b02e1..16ea5e76907f 100644
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -621,7 +621,7 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
> continue;
> }
> e820_type = E820_TYPE_RAM;
> - process_unaccepted_memory(params, d->phys_addr,
> + process_unaccepted_memory(d->phys_addr,
> d->phys_addr + PAGE_SIZE * d->num_pages);
> break;
> default:
> @@ -688,64 +688,6 @@ static efi_status_t alloc_e820ext(u32 nr_desc, struct setup_data **e820ext,
> return status;
> }
>
> -static efi_status_t allocate_unaccepted_bitmap(struct boot_params *params,
> - __u32 nr_desc,
> - struct efi_boot_memmap *map)
> -{
> - unsigned long *mem = NULL;
> - u64 size, max_addr = 0;
> - efi_status_t status;
> - bool found = false;
> - int i;
> -
> - /* Check if there's any unaccepted memory and find the max address */
> - for (i = 0; i < nr_desc; i++) {
> - efi_memory_desc_t *d;
> - unsigned long m = (unsigned long)map->map;
> -
> - d = efi_early_memdesc_ptr(m, map->desc_size, i);
> - if (d->type == EFI_UNACCEPTED_MEMORY)
> - found = true;
> - if (d->phys_addr + d->num_pages * PAGE_SIZE > max_addr)
> - max_addr = d->phys_addr + d->num_pages * PAGE_SIZE;
> - }
> -
> - if (!found) {
> - params->unaccepted_memory = 0;
> - return EFI_SUCCESS;
> - }
> -
> - /*
> - * range_contains_unaccepted_memory() may need to check one 2M chunk
> - * beyond the end of RAM to deal with load_unaligned_zeropad(). Make
> - * sure that the bitmap is large enough handle it.
> - */
> - max_addr += PMD_SIZE;
> -
> - /*
> - * If unaccepted memory is present, allocate a bitmap to track what
> - * memory has to be accepted before access.
> - *
> - * One bit in the bitmap represents 2MiB in the address space:
> - * A 4k bitmap can track 64GiB of physical address space.
> - *
> - * In the worst case scenario -- a huge hole in the middle of the
> - * address space -- It needs 256MiB to handle 4PiB of the address
> - * space.
> - *
> - * The bitmap will be populated in setup_e820() according to the memory
> - * map after efi_exit_boot_services().
> - */
> - size = DIV_ROUND_UP(max_addr, PMD_SIZE * BITS_PER_BYTE);
> - status = efi_allocate_pages(size, (unsigned long *)&mem, ULONG_MAX);
> - if (status == EFI_SUCCESS) {
> - memset(mem, 0, size);
> - params->unaccepted_memory = (unsigned long)mem;
> - }
> -
> - return status;
> -}
> -
> static efi_status_t allocate_e820(struct boot_params *params,
> struct setup_data **e820ext,
> u32 *e820ext_size)
> @@ -767,7 +709,7 @@ static efi_status_t allocate_e820(struct boot_params *params,
> }
>
> if (IS_ENABLED(CONFIG_UNACCEPTED_MEMORY) && status == EFI_SUCCESS)
> - status = allocate_unaccepted_bitmap(params, nr_desc, map);
> + status = allocate_unaccepted_bitmap(nr_desc, map);
>
> efi_bs_call(free_pool, map);
> return status;
> diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c
> new file mode 100644
> index 000000000000..3d1ca60916dd
> --- /dev/null
> +++ b/drivers/firmware/efi/unaccepted_memory.c
> @@ -0,0 +1,138 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#include <linux/efi.h>
> +#include <linux/memblock.h>
> +#include <linux/spinlock.h>
> +#include <asm/unaccepted_memory.h>
> +
> +/* Protects unaccepted memory bitmap */
> +static DEFINE_SPINLOCK(unaccepted_memory_lock);
> +
> +void accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> + struct efi_unaccepted_memory *unaccepted;
> + unsigned long range_start, range_end;
> + unsigned long flags;
> + u64 unit_size;
> +
> + if (efi.unaccepted == EFI_INVALID_TABLE_ADDR)
> + return;
> +
> + unaccepted = efi_get_unaccepted_table();
> + if (!unaccepted)
> + return;
> +
> + unit_size = unaccepted->unit_size;
> +
> + /*
> + * Only care for the part of the range that is represented
> + * in the bitmap.
> + */
> + if (start < unaccepted->phys_base)
> + start = unaccepted->phys_base;
> + if (end < unaccepted->phys_base)
> + return;
> +
> + /* Translate to offsets from the beginning of the bitmap */
> + start -= unaccepted->phys_base;
> + end -= unaccepted->phys_base;
> +
> + /*
> + * load_unaligned_zeropad() can lead to unwanted loads across page
> + * boundaries. The unwanted loads are typically harmless. But, they
> + * might be made to totally unrelated or even unmapped memory.
> + * load_unaligned_zeropad() relies on exception fixup (#PF, #GP and now
> + * #VE) to recover from these unwanted loads.
> + *
> + * But, this approach does not work for unaccepted memory. For TDX, a
> + * load from unaccepted memory will not lead to a recoverable exception
> + * within the guest. The guest will exit to the VMM where the only
> + * recourse is to terminate the guest.
> + *
> + * There are two parts to fix this issue and comprehensively avoid
> + * access to unaccepted memory. Together these ensure that an extra
> + * "guard" page is accepted in addition to the memory that needs to be
> + * used:
> + *
> + * 1. Implicitly extend the range_contains_unaccepted_memory(start, end)
> + * checks up to end+unit_size if 'end' is aligned on a unit_size
> + * boundary.
> + *
> + * 2. Implicitly extend accept_memory(start, end) to end+unit_size if
> + * 'end' is aligned on a unit_size boundary. (immediately following
> + * this comment)
> + */
> + if (!(end % unit_size))
> + end += unit_size;
> +
> + /* Make sure not to overrun the bitmap */
> + if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
> + end = unaccepted->size * unit_size * BITS_PER_BYTE;
> +
> + range_start = start / unit_size;
> +
> + spin_lock_irqsave(&unaccepted_memory_lock, flags);
> + for_each_set_bitrange_from(range_start, range_end, unaccepted->bitmap,
> + DIV_ROUND_UP(end, unit_size)) {
> + unsigned long phys_start, phys_end;
> + unsigned long len = range_end - range_start;
> +
> + phys_start = range_start * unit_size + unaccepted->phys_base;
> + phys_end = range_end * unit_size + unaccepted->phys_base;
> +
> + arch_accept_memory(phys_start, phys_end);
> + bitmap_clear(unaccepted->bitmap, range_start, len);
> + }
> + spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
> +}
> +
> +bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end)
> +{
> + struct efi_unaccepted_memory *unaccepted;
> + unsigned long flags;
> + bool ret = false;
> + u64 unit_size;
> +
> + unaccepted = efi_get_unaccepted_table();
> + if (!unaccepted)
> + return false;
> +
> + unit_size = unaccepted->unit_size;
> +
> + /*
> + * Only care for the part of the range that is represented
> + * in the bitmap.
> + */
> + if (start < unaccepted->phys_base)
> + start = unaccepted->phys_base;
> + if (end < unaccepted->phys_base)
> + return false;
> +
> + /* Translate to offsets from the beginning of the bitmap */
> + start -= unaccepted->phys_base;
> + end -= unaccepted->phys_base;
> +
> + /*
> + * Also consider the unaccepted state of the *next* page. See fix #1 in
> + * the comment on load_unaligned_zeropad() in accept_memory().
> + */
> + if (!(end % unit_size))
> + end += unit_size;
> +
> + /* Make sure not to overrun the bitmap */
> + if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
> + end = unaccepted->size * unit_size * BITS_PER_BYTE;
> +
> + spin_lock_irqsave(&unaccepted_memory_lock, flags);
> + while (start < end) {
> + if (test_bit(start / unit_size, unaccepted->bitmap)) {
> + ret = true;
> + break;
> + }
> +
> + start += unit_size;
> + }
> + spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
> +
> + return ret;
> +}
> diff --git a/include/linux/efi.h b/include/linux/efi.h
> index efbe14641638..0f4620060ed8 100644
> --- a/include/linux/efi.h
> +++ b/include/linux/efi.h
> @@ -418,6 +418,7 @@ void efi_native_runtime_setup(void);
> #define LINUX_EFI_MOK_VARIABLE_TABLE_GUID EFI_GUID(0xc451ed2b, 0x9694, 0x45d3, 0xba, 0xba, 0xed, 0x9f, 0x89, 0x88, 0xa3, 0x89)
> #define LINUX_EFI_COCO_SECRET_AREA_GUID EFI_GUID(0xadf956ad, 0xe98c, 0x484c, 0xae, 0x11, 0xb5, 0x1c, 0x7d, 0x33, 0x64, 0x47)
> #define LINUX_EFI_BOOT_MEMMAP_GUID EFI_GUID(0x800f683f, 0xd08b, 0x423a, 0xa2, 0x93, 0x96, 0x5c, 0x3c, 0x6f, 0xe2, 0xb4)
> +#define LINUX_EFI_UNACCEPTED_MEM_TABLE_GUID EFI_GUID(0xd5d1de3c, 0x105c, 0x44f9, 0x9e, 0xa9, 0xbc, 0xef, 0x98, 0x12, 0x00, 0x31)
>
> #define RISCV_EFI_BOOT_PROTOCOL_GUID EFI_GUID(0xccd15fec, 0x6f73, 0x4eec, 0x83, 0x95, 0x3e, 0x69, 0xe4, 0xb9, 0x40, 0xbf)
>
> @@ -535,6 +536,16 @@ struct efi_boot_memmap {
> efi_memory_desc_t map[];
> };
>
> +struct efi_unaccepted_memory {
> + u32 version;
> + u32 unit_size;
> + u64 phys_base;
> + u64 size;
> + unsigned long bitmap[];
> +};
> +
> +void __init efi_unaccepted_table_init(void);
> +
> /*
> * Architecture independent structure for describing a memory map for the
> * benefit of efi_memmap_init_early(), and for passing context between
> @@ -637,6 +648,7 @@ extern struct efi {
> unsigned long tpm_final_log; /* TPM2 Final Events Log table */
> unsigned long mokvar_table; /* MOK variable config table */
> unsigned long coco_secret; /* Confidential computing secret table */
> + unsigned long unaccepted; /* Unaccepted memory table */
>
> efi_get_time_t *get_time;
> efi_set_time_t *set_time;
> diff --git a/mm/internal.h b/mm/internal.h
> index ed042e366d49..2e70f22d1b3f 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1100,7 +1100,13 @@ struct vma_prepare {
> struct vm_area_struct *remove2;
> };
>
> -#ifndef CONFIG_UNACCEPTED_MEMORY
> +#ifdef CONFIG_UNACCEPTED_MEMORY
> +
> +bool range_contains_unaccepted_memory(phys_addr_t start, phys_addr_t end);
> +void accept_memory(phys_addr_t start, phys_addr_t end);
> +
> +#else
> +
> static inline bool range_contains_unaccepted_memory(phys_addr_t start,
> phys_addr_t end)
> {
> @@ -1110,6 +1116,7 @@ static inline bool range_contains_unaccepted_memory(phys_addr_t start,
> static inline void accept_memory(phys_addr_t start, phys_addr_t end)
> {
> }
> +
> #endif
>
> #endif /* __MM_INTERNAL_H */
> --
> Kiryl Shutsemau / Kirill A. Shutemov

2023-05-12 11:17:55

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCHv10 04/11] efi/x86: Implement support for unaccepted memory

On Fri, 12 May 2023 at 12:59, Kirill A. Shutemov
<[email protected]> wrote:
>
> On Fri, May 12, 2023 at 09:39:30AM +0200, Ard Biesheuvel wrote:
> > Please replace PMD_SIZE with something along the lines of
> > EFI_UNACCEPTED_UNIT_SIZE and #define it to PMD_SIZE in
> > arch/x86/include/asm/efi.h.
> >
> > The comment below about the size of the bitmap vs the size of the
> > address space should probably move there as well.
>
> Okay, will do.
>
> > > +void accept_memory(phys_addr_t start, phys_addr_t end)
> > > +{
> > > + unsigned long range_start, range_end;
> > > + unsigned long bitmap_size;
> > > + u64 unit_size;
> > > +
> > > + if (!unaccepted_table)
> > > + return;
> > > +
> > > + unit_size = unaccepted_table->unit_size;
> > > +
> > > + /*
> > > + * Only care for the part of the range that is represented
> > > + * in the bitmap.
> > > + */
> > > + if (start < unaccepted_table->phys_base)
> > > + start = unaccepted_table->phys_base;
> > > + if (end < unaccepted_table->phys_base)
> > > + return;
> > > +
> > > + /* Translate to offsets from the beginning of the bitmap */
> > > + start -= unaccepted_table->phys_base;
> > > + end -= unaccepted_table->phys_base;
> > > +
> > > + /* Make sure not to overrun the bitmap */
> > > + if (end > unaccepted_table->size * unit_size * BITS_PER_BYTE)
> > > + end = unaccepted_table->size * unit_size * BITS_PER_BYTE;
> > > +
> >
> > Should we warn here?
>
> No. accept_memory() is nop for conventional memory (memblock calls it
> unconditionally).
>
> With the fixup, we only allocate bitmap for the range of physical address
> space where we have unaccepted memory. So if there's conventional memory
> after unaccepted, bitmap will not cover it.
>

Fair enough.

2023-05-12 11:26:20

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv10 04/11] efi/x86: Implement support for unaccepted memory

On Fri, May 12, 2023 at 09:39:30AM +0200, Ard Biesheuvel wrote:
> Please replace PMD_SIZE with something along the lines of
> EFI_UNACCEPTED_UNIT_SIZE and #define it to PMD_SIZE in
> arch/x86/include/asm/efi.h.
>
> The comment below about the size of the bitmap vs the size of the
> address space should probably move there as well.

Okay, will do.

> > +void accept_memory(phys_addr_t start, phys_addr_t end)
> > +{
> > + unsigned long range_start, range_end;
> > + unsigned long bitmap_size;
> > + u64 unit_size;
> > +
> > + if (!unaccepted_table)
> > + return;
> > +
> > + unit_size = unaccepted_table->unit_size;
> > +
> > + /*
> > + * Only care for the part of the range that is represented
> > + * in the bitmap.
> > + */
> > + if (start < unaccepted_table->phys_base)
> > + start = unaccepted_table->phys_base;
> > + if (end < unaccepted_table->phys_base)
> > + return;
> > +
> > + /* Translate to offsets from the beginning of the bitmap */
> > + start -= unaccepted_table->phys_base;
> > + end -= unaccepted_table->phys_base;
> > +
> > + /* Make sure not to overrun the bitmap */
> > + if (end > unaccepted_table->size * unit_size * BITS_PER_BYTE)
> > + end = unaccepted_table->size * unit_size * BITS_PER_BYTE;
> > +
>
> Should we warn here?

No. accept_memory() is nop for conventional memory (memblock calls it
unconditionally).

With the fixup, we only allocate bitmap for the range of physical address
space where we have unaccepted memory. So if there's conventional memory
after unaccepted, bitmap will not cover it.

--
Kiryl Shutsemau / Kirill A. Shutemov