2023-03-14 10:16:06

by Evgeniy Baskov

[permalink] [raw]
Subject: [PATCH v5 00/27] x86_64: Improvements at compressed kernel stage

This patchset is aimed
* to improve UEFI compatibility of compressed kernel code for x86_64
* to setup proper memory access attributes for code and rodata sections
* to implement W^X protection policy throughout the whole execution
of compressed kernel for EFISTUB code path.

Kernel is made to be more compatible with PE image specification [3],
allowing it to be successfully loaded by stricter PE loader
implementations like the one from [2]. There is at least one
known implementation that uses that loader in production [4].
There are also ongoing efforts to upstream these changes.

Also the patchset adds EFI_MEMORY_ATTTRIBUTE_PROTOCOL, included into
EFI specification since version 2.10, as a better alternative to
using DXE services for memory protection attributes manipulation,
since it is defined by the UEFI specification itself and not UEFI PI
specification. This protocol is not widely available so the code
using DXE services is kept in place as a fallback in case specific
implementation does not support the new protocol.
One of EFI implementations that already support
EFI_MEMORY_ATTTRIBUTE_PROTOCOL is Microsoft Project Mu [5].

Kernel image generation tool (tools/build.c) is refactored as a part
of changes that makes PE image more compatible.

The patchset implements memory protection for compressed kernel
code while executing both inside EFI boot services and outside of
them. For EFISTUB code path W^X protection policy is maintained
throughout the whole execution of compressed kernel. The latter
is achieved by extracting the kernel directly from EFI environment
and jumping to it's head immediately after exiting EFI boot services.
As a side effect of this change one page table rebuild and a copy of
the kernel image is removed.

Memory protection inside EFI environment is controlled by the
CONFIG_DXE_MEM_ATTRIBUTES option, although with these patches this
option also control the use EFI_MEMORY_ATTTRIBUTE_PROTOCOL and memory
protection attributes of PE sections and not only DXE services as the
name might suggest.

Changes in v2:
* Fix spelling.
* Rebase code to current master.
* Split huge patches into smaller ones.
* Remove unneeded forward declarations.
* Make direct extraction unconditional.
* Also make it work for x86_32.
* Reduce lower limit of KASLR to 64M.
* Make callback interface more logically consistent.
* Actually declare callbacks structure before using it.
* Mention effect on x86_32 in commit message of
"x86/build: Remove RWX sections and align on 4KB".
* Clarify commit message of
"x86/boot: Increase boot page table size".
* Remove "startup32_" prefix on startup32_enable_nx_if_supported.
* Move linker generated sections outside of function scope.
* Drop some unintended changes.
* Drop generating 2 reloc entries.
(as I've misread the documentation and there's no need for this change.)
* Set has_nx from enable_nx_if_supported correctly.
* Move ELF header check to build time.
* Set WP at the same time as PG in trampoline code,
as it is more logically consistent.
* Put x86-specific EFISTUB definitions in x86-stub.h header.
* Catch presence of ELF segments violating W^X during build.
* Move PE definitions from build.c to a new header file.
* Fix generation of PE '.compat' section.

I decided to keep protection of compressed kernel blob and '.rodata'
separate from '.text' for now, since it does not really have a lot
of overhead.

Otherwise, all comments on v1 seems to be addressed.
I have also included Peter's patches [6-7] into the series for simplicity.

Changes in v3:
* Setup IDT before issuing cpuid so that AMD SEV #VC handler is set.
* Replace memcpy with strncpy to prevent out-of-bounds reads in tools/build.c.
* Zero BSS before entering efi_main(), since it can contain garbage
when booting via EFI handover protocol.
* When booting via EFI don't require init_size of RAM, since in-place
unpacking is not used anyway with that interface. This saves ~40M of memory
for debian .config.
* Setup sections memory protection in efi_main() to cover EFI handover protocol,
where EFI sections are likely not properly protected.

Changes in v4:
* Add one missing identity mapping.
* Include following patches improving the use of DXE services:
- efi/x86: don't try to set page attributes on 0-sized regions.
- efi/x86: don't set unsupported memory attributes

Changes in v5:
* Fix some warnings reported by the build bot.
* Clarify comments about initial page table buffer size and
kernel extraction buffer allocation and some commit messages.
* Fix make dependencies tracking for W^X compile time check.
* Remove unneeded explicit identity mapping of struct efi_info in
process_efi_entries() -- it is already mapped at the time of the
function execution.
* Fix 'nokaslr' command line option.
* Remove memory attributes definition patch. It is already upstreamed.
* Reduce diff size for "x86/build: Cleanup tools/build.c"
* Clean up 32-bit EFISTUB assembly entry: use simpler relative offset
expressions and remove unused instruction.
* Don't set alignment flags for PE sections. They are only meant for
object files.
* Rework "x86/build: Make generated PE more spec compliant", so that
it does not generate sections dynamically and is overall simpler.
* Add Ard's patch removing DOS stub. And another one adding the local
copy of boot_params from [8].
* Revert to the old behavior of only relaxing memory attributes
with DXE services. Only when EFI_MEMORY_ATTRIBUTE_PROTOCOL is
available stricter memory attributes are set.

The reworked x86 kernel image build tool patches are based on
the ideas from the Ard's RFC patches at [8] and comments to v4 at [9].

Patch "x86/boot: Support 4KB pages for identity mapping" might need review
from x86/mm team.

Many thanks to Ard Biesheuvel <[email protected]> and
Andrew Cooper <[email protected]> for reviewing the patches, and to
Peter Jones <[email protected]>, Mario Limonciello <[email protected]> and
Joey Lee <[email protected]> for additional testing!

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://github.com/acidanthera/audk/tree/secure_pe
[3] https://download.microsoft.com/download/9/c/5/9c5b2167-8017-4bae-9fde-d599bac8184a/pecoff_v83.docx
[4] https://www.ispras.ru/en/technologies/asperitas/
[5] https://github.com/microsoft/mu_tiano_platforms
[6] https://lore.kernel.org/lkml/[email protected]/
[7] https://lore.kernel.org/lkml/[email protected]/
[8] https://lore.kernel.org/linux-efi/[email protected]/
[9] https://lore.kernel.org/lkml/CAMj1kXGu0uFynyt=MostXo58A4f4Zu6cFFiSShFZChU5LWt1ZQ@mail.gmail.com/

Ard Biesheuvel (2):
x86: decompressor: Remove the 'bugger off' message
efi: x86: Use private copy of struct setup_header

Evgeniy Baskov (23):
x86/boot: Align vmlinuz sections on page size
x86/build: Remove RWX sections and align on 4KB
x86/boot: Set cr0 to known state in trampoline
x86/boot: Increase boot page table size
x86/boot: Support 4KB pages for identity mapping
x86/boot: Setup memory protection for bzImage code
x86/build: Check W^X of vmlinux during build
x86/boot: Map memory explicitly
x86/boot: Remove mapping from page fault handler
efi/libstub: Move helper function to related file
x86/boot: Make console interface more abstract
x86/boot: Make kernel_add_identity_map() a pointer
x86/boot: Split trampoline and pt init code
x86/boot: Add EFI kernel extraction interface
efi/x86: Support extracting kernel from libstub
x86/boot: Reduce lower limit of physical KASLR
tools/include: Add simplified version of pe.h
x86/build: Cleanup tools/build.c
x86/build: Add SETUP_HEADER_OFFSET constant
x86/build: set type_of_loader for EFISTUB
efi/libstub: Don't set ramdisk_image/ramdisk_size
x86/build: Make generated PE more spec compliant
efi/libstub: Use memory attribute protocol

Peter Jones (2):
efi/libstub: make memory protection warnings include newlines.
efi/x86: don't try to set page attributes on 0-sized regions.

arch/x86/boot/Makefile | 2 +-
arch/x86/boot/compressed/Makefile | 9 +-
arch/x86/boot/compressed/acpi.c | 21 +-
arch/x86/boot/compressed/efi.c | 19 +-
arch/x86/boot/compressed/head_32.S | 43 +-
arch/x86/boot/compressed/head_64.S | 89 +++-
arch/x86/boot/compressed/ident_map_64.c | 123 +++--
arch/x86/boot/compressed/kaslr.c | 6 +-
arch/x86/boot/compressed/misc.c | 284 ++++++-----
arch/x86/boot/compressed/misc.h | 23 +-
arch/x86/boot/compressed/pgtable.h | 20 -
arch/x86/boot/compressed/pgtable_64.c | 71 +--
arch/x86/boot/compressed/putstr.c | 130 +++++
arch/x86/boot/compressed/sev.c | 6 +-
arch/x86/boot/compressed/vmlinux.lds.S | 16 +-
arch/x86/boot/header.S | 109 ++--
arch/x86/boot/setup.ld | 7 +-
arch/x86/boot/tools/build.c | 475 +++++++++++-------
arch/x86/include/asm/boot.h | 28 +-
arch/x86/include/asm/init.h | 8 +-
arch/x86/include/asm/shared/extract.h | 26 +
arch/x86/include/asm/shared/pgtable.h | 29 ++
arch/x86/kernel/vmlinux.lds.S | 15 +-
arch/x86/mm/ident_map.c | 185 +++++--
drivers/firmware/efi/Kconfig | 2 +
drivers/firmware/efi/libstub/Makefile | 2 +-
drivers/firmware/efi/libstub/efistub.h | 9 +-
drivers/firmware/efi/libstub/mem.c | 194 +++++++
.../firmware/efi/libstub/x86-extract-direct.c | 217 ++++++++
drivers/firmware/efi/libstub/x86-stub.c | 227 +--------
drivers/firmware/efi/libstub/x86-stub.h | 14 +
tools/include/linux/pe.h | 150 ++++++
32 files changed, 1753 insertions(+), 806 deletions(-)
delete mode 100644 arch/x86/boot/compressed/pgtable.h
create mode 100644 arch/x86/boot/compressed/putstr.c
create mode 100644 arch/x86/include/asm/shared/extract.h
create mode 100644 arch/x86/include/asm/shared/pgtable.h
create mode 100644 drivers/firmware/efi/libstub/x86-extract-direct.c
create mode 100644 drivers/firmware/efi/libstub/x86-stub.h
create mode 100644 tools/include/linux/pe.h

--
2.39.2



2023-03-14 10:16:59

by Evgeniy Baskov

[permalink] [raw]
Subject: [PATCH v5 07/27] x86/build: Check W^X of vmlinux during build

Check if there are simultaneously writable and executable
program segments in vmlinux ELF image and fail build if there are any.

This would prevent accidental introduction of RWX segments.

Tested-by: Mario Limonciello <[email protected]>
Signed-off-by: Evgeniy Baskov <[email protected]>
---
arch/x86/boot/compressed/Makefile | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 6b6cfe607bdb..0c6e25279ec1 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -112,12 +112,17 @@ vmlinux-objs-$(CONFIG_EFI) += $(obj)/efi.o
vmlinux-objs-$(CONFIG_EFI_MIXED) += $(obj)/efi_mixed.o
vmlinux-objs-$(CONFIG_EFI_STUB) += $(objtree)/drivers/firmware/efi/libstub/lib.a

+quiet_cmd_objcopy_and_wx_check = $(quiet_cmd_objcopy)
+cmd_objcopy_and_wx_check = if $(OBJDUMP) -p $< | grep "flags .wx" > /dev/null; then \
+ (echo >&2 "$<: Simultaneously writable and executable sections are prohibited"; \
+ /bin/false); else $(cmd_objcopy); fi
+
$(obj)/vmlinux: $(vmlinux-objs-y) FORCE
$(call if_changed,ld)

OBJCOPYFLAGS_vmlinux.bin := -R .comment -S
$(obj)/vmlinux.bin: vmlinux FORCE
- $(call if_changed,objcopy)
+ $(call if_changed,objcopy_and_wx_check)

targets += $(patsubst $(obj)/%,%,$(vmlinux-objs-y)) vmlinux.bin.all vmlinux.relocs

--
2.39.2


2023-03-14 10:17:02

by Evgeniy Baskov

[permalink] [raw]
Subject: [PATCH v5 09/27] x86/boot: Remove mapping from page fault handler

After every implicit mapping is removed, this code is no longer needed.

Remove memory mapping from page fault handler to ensure that there are
no hidden invalid memory accesses.

Tested-by: Mario Limonciello <[email protected]>
Signed-off-by: Evgeniy Baskov <[email protected]>
---
arch/x86/boot/compressed/ident_map_64.c | 26 ++++++++++---------------
1 file changed, 10 insertions(+), 16 deletions(-)

diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
index eb28ce9812c5..378f99b1d7e8 100644
--- a/arch/x86/boot/compressed/ident_map_64.c
+++ b/arch/x86/boot/compressed/ident_map_64.c
@@ -393,27 +393,21 @@ void do_boot_page_fault(struct pt_regs *regs, unsigned long error_code)
{
unsigned long address = native_read_cr2();
unsigned long end;
- bool ghcb_fault;
+ char *msg;

- ghcb_fault = sev_es_check_ghcb_fault(address);
+ if (sev_es_check_ghcb_fault(address))
+ msg = "Page-fault on GHCB page:";
+ else
+ msg = "Unexpected page-fault:";

address &= PMD_MASK;
end = address + PMD_SIZE;

/*
- * Check for unexpected error codes. Unexpected are:
- * - Faults on present pages
- * - User faults
- * - Reserved bits set
- */
- if (error_code & (X86_PF_PROT | X86_PF_USER | X86_PF_RSVD))
- do_pf_error("Unexpected page-fault:", error_code, address, regs->ip);
- else if (ghcb_fault)
- do_pf_error("Page-fault on GHCB page:", error_code, address, regs->ip);
-
- /*
- * Error code is sane - now identity map the 2M region around
- * the faulting address.
+ * Since all memory allocations are made explicit
+ * now, every page fault at this stage is an
+ * error and the error handler is there only
+ * for debug purposes.
*/
- kernel_add_identity_map(address, end, MAP_WRITE);
+ do_pf_error(msg, error_code, address, regs->ip);
}
--
2.39.2


2023-03-14 10:17:05

by Evgeniy Baskov

[permalink] [raw]
Subject: [PATCH v5 08/27] x86/boot: Map memory explicitly

Implicit mappings hide possible memory errors, e.g. allocations for
ACPI tables were not included in boot page table size.

Replace all implicit mappings from page fault handler with
explicit mappings.

Tested-by: Mario Limonciello <[email protected]>
Signed-off-by: Evgeniy Baskov <[email protected]>
---
arch/x86/boot/compressed/acpi.c | 21 ++++++++++++++++++++-
arch/x86/boot/compressed/efi.c | 19 ++++++++++++++++++-
arch/x86/boot/compressed/kaslr.c | 2 ++
3 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/arch/x86/boot/compressed/acpi.c b/arch/x86/boot/compressed/acpi.c
index 9caf89063e77..633ac56262ee 100644
--- a/arch/x86/boot/compressed/acpi.c
+++ b/arch/x86/boot/compressed/acpi.c
@@ -93,6 +93,8 @@ static u8 *scan_mem_for_rsdp(u8 *start, u32 length)

end = start + length;

+ kernel_add_identity_map((unsigned long)start, (unsigned long)end, 0);
+
/* Search from given start address for the requested length */
for (address = start; address < end; address += ACPI_RSDP_SCAN_STEP) {
/*
@@ -128,6 +130,9 @@ static acpi_physical_address bios_get_rsdp_addr(void)
unsigned long address;
u8 *rsdp;

+ kernel_add_identity_map((unsigned long)ACPI_EBDA_PTR_LOCATION,
+ (unsigned long)ACPI_EBDA_PTR_LOCATION + 2, 0);
+
/* Get the location of the Extended BIOS Data Area (EBDA) */
address = *(u16 *)ACPI_EBDA_PTR_LOCATION;
address <<= 4;
@@ -215,6 +220,9 @@ static unsigned long get_acpi_srat_table(void)
if (!rsdp)
return 0;

+ kernel_add_identity_map((unsigned long)rsdp,
+ (unsigned long)(rsdp + 1), 0);
+
/* Get ACPI root table from RSDP.*/
if (!(cmdline_find_option("acpi", arg, sizeof(arg)) == 4 &&
!strncmp(arg, "rsdt", 4)) &&
@@ -235,6 +243,9 @@ static unsigned long get_acpi_srat_table(void)
if (len < sizeof(struct acpi_table_header) + size)
return 0;

+ kernel_add_identity_map((unsigned long)header,
+ (unsigned long)header + len, 0);
+
num_entries = (len - sizeof(struct acpi_table_header)) / size;
entry = (u8 *)(root_table + sizeof(struct acpi_table_header));

@@ -247,8 +258,16 @@ static unsigned long get_acpi_srat_table(void)
if (acpi_table) {
header = (struct acpi_table_header *)acpi_table;

- if (ACPI_COMPARE_NAMESEG(header->signature, ACPI_SIG_SRAT))
+ kernel_add_identity_map(acpi_table,
+ acpi_table + sizeof(*header),
+ 0);
+
+ if (ACPI_COMPARE_NAMESEG(header->signature, ACPI_SIG_SRAT)) {
+ kernel_add_identity_map(acpi_table,
+ acpi_table + header->length,
+ 0);
return acpi_table;
+ }
}
entry += size;
}
diff --git a/arch/x86/boot/compressed/efi.c b/arch/x86/boot/compressed/efi.c
index 6edd034b0b30..ce70103fbbc0 100644
--- a/arch/x86/boot/compressed/efi.c
+++ b/arch/x86/boot/compressed/efi.c
@@ -57,10 +57,14 @@ enum efi_type efi_get_type(struct boot_params *bp)
*/
unsigned long efi_get_system_table(struct boot_params *bp)
{
- unsigned long sys_tbl_pa;
+ static unsigned long sys_tbl_pa __section(".data");
struct efi_info *ei;
+ unsigned long sys_tbl_size;
enum efi_type et;

+ if (sys_tbl_pa)
+ return sys_tbl_pa;
+
/* Get systab from boot params. */
ei = &bp->efi_info;
#ifdef CONFIG_X86_64
@@ -73,6 +77,13 @@ unsigned long efi_get_system_table(struct boot_params *bp)
return 0;
}

+ if (efi_get_type(bp) == EFI_TYPE_64)
+ sys_tbl_size = sizeof(efi_system_table_64_t);
+ else
+ sys_tbl_size = sizeof(efi_system_table_32_t);
+
+ kernel_add_identity_map(sys_tbl_pa, sys_tbl_pa + sys_tbl_size, 0);
+
return sys_tbl_pa;
}

@@ -92,6 +103,10 @@ static struct efi_setup_data *get_kexec_setup_data(struct boot_params *bp,

pa_data = bp->hdr.setup_data;
while (pa_data) {
+ unsigned long pa_data_end = pa_data + sizeof(struct setup_data)
+ + sizeof(struct efi_setup_data);
+ kernel_add_identity_map(pa_data, pa_data_end, 0);
+
data = (struct setup_data *)pa_data;
if (data->type == SETUP_EFI) {
esd = (struct efi_setup_data *)(pa_data + sizeof(struct setup_data));
@@ -160,6 +175,8 @@ int efi_get_conf_table(struct boot_params *bp, unsigned long *cfg_tbl_pa,
return -EINVAL;
}

+ kernel_add_identity_map(*cfg_tbl_pa, *cfg_tbl_pa + *cfg_tbl_len, 0);
+
return 0;
}

diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index 454757fbdfe5..69966481b82d 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -704,6 +704,8 @@ process_efi_entries(unsigned long minimum, unsigned long image_size)
pmap = (e->efi_memmap | ((__u64)e->efi_memmap_hi << 32));
#endif

+ kernel_add_identity_map(pmap, pmap + e->efi_memmap_size, 0);
+
nr_desc = e->efi_memmap_size / e->efi_memdesc_size;
for (i = 0; i < nr_desc; i++) {
md = efi_early_memdesc_ptr(pmap, e->efi_memdesc_size, i);
--
2.39.2


2023-03-14 10:17:08

by Evgeniy Baskov

[permalink] [raw]
Subject: [PATCH v5 10/27] efi/libstub: Move helper function to related file

efi_adjust_memory_range_protection() can be useful outside x86-stub.c.

Move it to mem.c, where memory related code resides and make it
non-static.

Change its behavior to disallow making memory regions readable and
writable simultaneously.

Tested-by: Mario Limonciello <[email protected]>
Signed-off-by: Evgeniy Baskov <[email protected]>
---
drivers/firmware/efi/libstub/efistub.h | 4 +
drivers/firmware/efi/libstub/mem.c | 99 +++++++++++++++++++++++++
drivers/firmware/efi/libstub/x86-stub.c | 66 ++---------------
3 files changed, 109 insertions(+), 60 deletions(-)

diff --git a/drivers/firmware/efi/libstub/efistub.h b/drivers/firmware/efi/libstub/efistub.h
index 6bd3bb86d967..01ae165731a5 100644
--- a/drivers/firmware/efi/libstub/efistub.h
+++ b/drivers/firmware/efi/libstub/efistub.h
@@ -995,6 +995,10 @@ efi_status_t efi_relocate_kernel(unsigned long *image_addr,
unsigned long alignment,
unsigned long min_addr);

+efi_status_t efi_adjust_memory_range_protection(unsigned long start,
+ unsigned long size,
+ unsigned long attributes);
+
efi_status_t efi_parse_options(char const *cmdline);

void efi_parse_option_graphics(char *option);
diff --git a/drivers/firmware/efi/libstub/mem.c b/drivers/firmware/efi/libstub/mem.c
index 4f1fa302234d..134f17d078d1 100644
--- a/drivers/firmware/efi/libstub/mem.c
+++ b/drivers/firmware/efi/libstub/mem.c
@@ -128,3 +128,102 @@ void efi_free(unsigned long size, unsigned long addr)
nr_pages = round_up(size, EFI_ALLOC_ALIGN) / EFI_PAGE_SIZE;
efi_bs_call(free_pages, addr, nr_pages);
}
+
+/**
+ * efi_adjust_memory_range_protection() - change memory range protection attributes
+ * @start: memory range start address
+ * @size: memory range size
+ *
+ * Actual memory range for which memory attributes are modified is
+ * the smallest ranged with start address and size aligned to EFI_PAGE_SIZE
+ * that includes [start, start + size].
+ *
+ * @return: status code
+ */
+efi_status_t efi_adjust_memory_range_protection(unsigned long start,
+ unsigned long size,
+ unsigned long attributes)
+{
+ efi_status_t status;
+ efi_gcd_memory_space_desc_t desc;
+ efi_physical_addr_t end, next;
+ efi_physical_addr_t rounded_start, rounded_end;
+ efi_physical_addr_t unprotect_start, unprotect_size;
+
+ if (efi_dxe_table == NULL)
+ return EFI_UNSUPPORTED;
+
+ /*
+ * This function should not be used to modify attributes
+ * other than writable/executable.
+ */
+
+ if ((attributes & ~(EFI_MEMORY_RO | EFI_MEMORY_XP)) != 0)
+ return EFI_INVALID_PARAMETER;
+
+ rounded_start = rounddown(start, EFI_PAGE_SIZE);
+ rounded_end = roundup(start + size, EFI_PAGE_SIZE);
+
+ /*
+ * Disallow simultaniously executable and writable memory
+ * to inforce W^X policy if direct extraction code is enabled.
+ */
+
+ if ((attributes & (EFI_MEMORY_RO | EFI_MEMORY_XP)) == 0) {
+ efi_warn("W^X violation at [%08lx,%08lx]\n",
+ (unsigned long)rounded_start,
+ (unsigned long)rounded_end);
+ }
+
+ for (end = start + size; start < end; start = next) {
+
+ status = efi_dxe_call(get_memory_space_descriptor,
+ start, &desc);
+
+ if (status != EFI_SUCCESS) {
+ efi_warn("Unable to get memory descriptor at %lx\n",
+ (unsigned long)start);
+ return status;
+ }
+
+ next = desc.base_address + desc.length;
+
+ /*
+ * Only system memory is suitable for trampoline/kernel image
+ * placement, so only this type of memory needs its attributes
+ * to be modified.
+ */
+
+ if (desc.gcd_memory_type != EfiGcdMemoryTypeSystemMemory) {
+ efi_warn("Attempted to change protection of special memory range\n");
+ return EFI_UNSUPPORTED;
+ }
+
+ /*
+ * Don't modify memory region attributes, if they are already
+ * suitable, to lower the possibility to encounter firmware
+ * bugs. We do not apply XP/RO, only remove them.
+ */
+ if ((desc.attributes & ~attributes) == 0)
+ continue;
+
+ desc.attributes &= attributes | ~(EFI_MEMORY_RO | EFI_MEMORY_XP);
+
+ unprotect_start = max(rounded_start, desc.base_address);
+ unprotect_size = min(rounded_end, next) - unprotect_start;
+
+ status = efi_dxe_call(set_memory_space_attributes,
+ unprotect_start, unprotect_size,
+ desc.attributes);
+
+ if (status != EFI_SUCCESS) {
+ efi_warn("Unable to unprotect memory range [%08lx,%08lx]: %lx\n",
+ (unsigned long)unprotect_start,
+ (unsigned long)(unprotect_start + unprotect_size),
+ status);
+ return status;
+ }
+ }
+
+ return EFI_SUCCESS;
+}
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index a0bfd31358ba..7fb1eff88a18 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -212,61 +212,6 @@ static void retrieve_apple_device_properties(struct boot_params *boot_params)
}
}

-static void
-adjust_memory_range_protection(unsigned long start, unsigned long size)
-{
- efi_status_t status;
- efi_gcd_memory_space_desc_t desc;
- unsigned long end, next;
- unsigned long rounded_start, rounded_end;
- unsigned long unprotect_start, unprotect_size;
-
- if (efi_dxe_table == NULL)
- return;
-
- rounded_start = rounddown(start, EFI_PAGE_SIZE);
- rounded_end = roundup(start + size, EFI_PAGE_SIZE);
-
- /*
- * Don't modify memory region attributes, they are
- * already suitable, to lower the possibility to
- * encounter firmware bugs.
- */
-
- for (end = start + size; start < end; start = next) {
-
- status = efi_dxe_call(get_memory_space_descriptor, start, &desc);
-
- if (status != EFI_SUCCESS)
- return;
-
- next = desc.base_address + desc.length;
-
- /*
- * Only system memory is suitable for trampoline/kernel image placement,
- * so only this type of memory needs its attributes to be modified.
- */
-
- if (desc.gcd_memory_type != EfiGcdMemoryTypeSystemMemory ||
- (desc.attributes & (EFI_MEMORY_RO | EFI_MEMORY_XP)) == 0)
- continue;
-
- unprotect_start = max(rounded_start, (unsigned long)desc.base_address);
- unprotect_size = min(rounded_end, next) - unprotect_start;
-
- status = efi_dxe_call(set_memory_space_attributes,
- unprotect_start, unprotect_size,
- EFI_MEMORY_WB);
-
- if (status != EFI_SUCCESS) {
- efi_warn("Unable to unprotect memory range [%08lx,%08lx]: %lx\n",
- unprotect_start,
- unprotect_start + unprotect_size,
- status);
- }
- }
-}
-
/*
* Trampoline takes 2 pages and can be loaded in first megabyte of memory
* with its end placed between 128k and 640k where BIOS might start.
@@ -290,12 +235,12 @@ setup_memory_protection(unsigned long image_base, unsigned long image_size)
* and relocated kernel image.
*/

- adjust_memory_range_protection(TRAMPOLINE_PLACEMENT_BASE,
- TRAMPOLINE_PLACEMENT_SIZE);
+ efi_adjust_memory_range_protection(TRAMPOLINE_PLACEMENT_BASE,
+ TRAMPOLINE_PLACEMENT_SIZE, 0);

#ifdef CONFIG_64BIT
if (image_base != (unsigned long)startup_32)
- adjust_memory_range_protection(image_base, image_size);
+ efi_adjust_memory_range_protection(image_base, image_size, 0);
#else
/*
* Clear protection flags on a whole range of possible
@@ -305,8 +250,9 @@ setup_memory_protection(unsigned long image_base, unsigned long image_size)
* need to remove possible protection on relocated image
* itself disregarding further relocations.
*/
- adjust_memory_range_protection(LOAD_PHYSICAL_ADDR,
- KERNEL_IMAGE_SIZE - LOAD_PHYSICAL_ADDR);
+ efi_adjust_memory_range_protection(LOAD_PHYSICAL_ADDR,
+ KERNEL_IMAGE_SIZE - LOAD_PHYSICAL_ADDR,
+ 0);
#endif
}

--
2.39.2


2023-03-14 10:17:11

by Evgeniy Baskov

[permalink] [raw]
Subject: [PATCH v5 11/27] x86/boot: Make console interface more abstract

To be able to extract kernel from EFI, console output functions
need to be replaceable by alternative implementations.

Make all of those functions pointers.
Move serial console code to separate file.

Tested-by: Mario Limonciello <[email protected]>
Signed-off-by: Evgeniy Baskov <[email protected]>
---
arch/x86/boot/compressed/Makefile | 2 +-
arch/x86/boot/compressed/misc.c | 109 +------------------------
arch/x86/boot/compressed/misc.h | 9 ++-
arch/x86/boot/compressed/putstr.c | 130 ++++++++++++++++++++++++++++++
4 files changed, 139 insertions(+), 111 deletions(-)
create mode 100644 arch/x86/boot/compressed/putstr.c

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 0c6e25279ec1..97e4a0e8fbba 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -93,7 +93,7 @@ $(obj)/misc.o: $(obj)/../voffset.h

vmlinux-objs-y := $(obj)/vmlinux.lds $(obj)/kernel_info.o $(obj)/head_$(BITS).o \
$(obj)/misc.o $(obj)/string.o $(obj)/cmdline.o $(obj)/error.o \
- $(obj)/piggy.o $(obj)/cpuflags.o
+ $(obj)/piggy.o $(obj)/cpuflags.o $(obj)/putstr.o

vmlinux-objs-$(CONFIG_EARLY_PRINTK) += $(obj)/early_serial_console.o
vmlinux-objs-$(CONFIG_RANDOMIZE_BASE) += $(obj)/kaslr.o
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index efecd8656414..76773a989364 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -53,13 +53,6 @@ struct port_io_ops pio_ops;
memptr free_mem_ptr;
memptr free_mem_end_ptr;

-static char *vidmem;
-static int vidport;
-
-/* These might be accessed before .bss is cleared, so use .data instead. */
-static int lines __section(".data");
-static int cols __section(".data");
-
#ifdef CONFIG_KERNEL_GZIP
#include "../../../../lib/decompress_inflate.c"
#endif
@@ -92,95 +85,6 @@ static int cols __section(".data");
* ../header.S.
*/

-static void scroll(void)
-{
- int i;
-
- memmove(vidmem, vidmem + cols * 2, (lines - 1) * cols * 2);
- for (i = (lines - 1) * cols * 2; i < lines * cols * 2; i += 2)
- vidmem[i] = ' ';
-}
-
-#define XMTRDY 0x20
-
-#define TXR 0 /* Transmit register (WRITE) */
-#define LSR 5 /* Line Status */
-static void serial_putchar(int ch)
-{
- unsigned timeout = 0xffff;
-
- while ((inb(early_serial_base + LSR) & XMTRDY) == 0 && --timeout)
- cpu_relax();
-
- outb(ch, early_serial_base + TXR);
-}
-
-void __putstr(const char *s)
-{
- int x, y, pos;
- char c;
-
- if (early_serial_base) {
- const char *str = s;
- while (*str) {
- if (*str == '\n')
- serial_putchar('\r');
- serial_putchar(*str++);
- }
- }
-
- if (lines == 0 || cols == 0)
- return;
-
- x = boot_params->screen_info.orig_x;
- y = boot_params->screen_info.orig_y;
-
- while ((c = *s++) != '\0') {
- if (c == '\n') {
- x = 0;
- if (++y >= lines) {
- scroll();
- y--;
- }
- } else {
- vidmem[(x + cols * y) * 2] = c;
- if (++x >= cols) {
- x = 0;
- if (++y >= lines) {
- scroll();
- y--;
- }
- }
- }
- }
-
- boot_params->screen_info.orig_x = x;
- boot_params->screen_info.orig_y = y;
-
- pos = (x + cols * y) * 2; /* Update cursor position */
- outb(14, vidport);
- outb(0xff & (pos >> 9), vidport+1);
- outb(15, vidport);
- outb(0xff & (pos >> 1), vidport+1);
-}
-
-void __puthex(unsigned long value)
-{
- char alpha[2] = "0";
- int bits;
-
- for (bits = sizeof(value) * 8 - 4; bits >= 0; bits -= 4) {
- unsigned long digit = (value >> bits) & 0xf;
-
- if (digit < 0xA)
- alpha[0] = '0' + digit;
- else
- alpha[0] = 'a' + (digit - 0xA);
-
- __putstr(alpha);
- }
-}
-
#ifdef CONFIG_X86_NEED_RELOCS
static void handle_relocations(void *output, unsigned long output_len,
unsigned long virt_addr)
@@ -409,17 +313,6 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,

sanitize_boot_params(boot_params);

- if (boot_params->screen_info.orig_video_mode == 7) {
- vidmem = (char *) 0xb0000;
- vidport = 0x3b4;
- } else {
- vidmem = (char *) 0xb8000;
- vidport = 0x3d4;
- }
-
- lines = boot_params->screen_info.orig_video_lines;
- cols = boot_params->screen_info.orig_video_cols;
-
init_default_io_ops();

/*
@@ -430,7 +323,7 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
*/
early_tdx_detect();

- console_init();
+ init_bare_console();

/*
* Save RSDP address for later use. Have this after console_init()
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index aea3621faf29..e20f60bfe91b 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -57,8 +57,8 @@ extern memptr free_mem_end_ptr;
void *malloc(int size);
void free(void *where);
extern struct boot_params *boot_params;
-void __putstr(const char *s);
-void __puthex(unsigned long value);
+extern void (*__putstr)(const char *s);
+extern void (*__puthex)(unsigned long value);
#define error_putstr(__x) __putstr(__x)
#define error_puthex(__x) __puthex(__x)

@@ -128,6 +128,11 @@ static inline void console_init(void)
{ }
#endif

+/* putstr.c */
+void init_bare_console(void);
+void init_console_func(void (*putstr_)(const char *),
+ void (*puthex_)(unsigned long));
+
#ifdef CONFIG_AMD_MEM_ENCRYPT
void sev_enable(struct boot_params *bp);
void snp_check_features(void);
diff --git a/arch/x86/boot/compressed/putstr.c b/arch/x86/boot/compressed/putstr.c
new file mode 100644
index 000000000000..44a4c3dacec5
--- /dev/null
+++ b/arch/x86/boot/compressed/putstr.c
@@ -0,0 +1,130 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "misc.h"
+
+/* These might be accessed before .bss is cleared, so use .data instead. */
+static char *vidmem __section(".data");
+static int vidport __section(".data");
+static int lines __section(".data");
+static int cols __section(".data");
+
+void (*__putstr)(const char *s);
+void (*__puthex)(unsigned long value);
+
+static void putstr(const char *s);
+static void puthex(unsigned long value);
+
+void init_console_func(void (*putstr_)(const char *),
+ void (*puthex_)(unsigned long))
+{
+ __putstr = putstr_;
+ __puthex = puthex_;
+}
+
+void init_bare_console(void)
+{
+ init_console_func(putstr, puthex);
+
+ if (boot_params->screen_info.orig_video_mode == 7) {
+ vidmem = (char *) 0xb0000;
+ vidport = 0x3b4;
+ } else {
+ vidmem = (char *) 0xb8000;
+ vidport = 0x3d4;
+ }
+
+ lines = boot_params->screen_info.orig_video_lines;
+ cols = boot_params->screen_info.orig_video_cols;
+
+ console_init();
+}
+
+static void scroll(void)
+{
+ int i;
+
+ memmove(vidmem, vidmem + cols * 2, (lines - 1) * cols * 2);
+ for (i = (lines - 1) * cols * 2; i < lines * cols * 2; i += 2)
+ vidmem[i] = ' ';
+}
+
+#define XMTRDY 0x20
+
+#define TXR 0 /* Transmit register (WRITE) */
+#define LSR 5 /* Line Status */
+
+static void serial_putchar(int ch)
+{
+ unsigned int timeout = 0xffff;
+
+ while ((inb(early_serial_base + LSR) & XMTRDY) == 0 && --timeout)
+ cpu_relax();
+
+ outb(ch, early_serial_base + TXR);
+}
+
+static void putstr(const char *s)
+{
+ int x, y, pos;
+ char c;
+
+ if (early_serial_base) {
+ const char *str = s;
+
+ while (*str) {
+ if (*str == '\n')
+ serial_putchar('\r');
+ serial_putchar(*str++);
+ }
+ }
+
+ if (lines == 0 || cols == 0)
+ return;
+
+ x = boot_params->screen_info.orig_x;
+ y = boot_params->screen_info.orig_y;
+
+ while ((c = *s++) != '\0') {
+ if (c == '\n') {
+ x = 0;
+ if (++y >= lines) {
+ scroll();
+ y--;
+ }
+ } else {
+ vidmem[(x + cols * y) * 2] = c;
+ if (++x >= cols) {
+ x = 0;
+ if (++y >= lines) {
+ scroll();
+ y--;
+ }
+ }
+ }
+ }
+
+ boot_params->screen_info.orig_x = x;
+ boot_params->screen_info.orig_y = y;
+
+ pos = (x + cols * y) * 2; /* Update cursor position */
+ outb(14, vidport);
+ outb(0xff & (pos >> 9), vidport+1);
+ outb(15, vidport);
+ outb(0xff & (pos >> 1), vidport+1);
+}
+
+static void puthex(unsigned long value)
+{
+ char alpha[2] = "0";
+ int bits;
+
+ for (bits = sizeof(value) * 8 - 4; bits >= 0; bits -= 4) {
+ unsigned long digit = (value >> bits) & 0xf;
+
+ if (digit < 0xA)
+ alpha[0] = '0' + digit;
+ else
+ alpha[0] = 'a' + (digit - 0xA);
+
+ putstr(alpha);
+ }
+}
--
2.39.2


2023-03-14 10:17:53

by Evgeniy Baskov

[permalink] [raw]
Subject: [PATCH v5 13/27] x86/boot: Split trampoline and pt init code

When allocating trampoline from libstub trampoline allocation is
performed separately, so it needs to be skipped.

Split trampoline initialization and allocation code into two
functions to make them invokable separately.

Tested-by: Mario Limonciello <[email protected]>
Signed-off-by: Evgeniy Baskov <[email protected]>
---
arch/x86/boot/compressed/pgtable_64.c | 69 ++++++++++++++++-----------
1 file changed, 42 insertions(+), 27 deletions(-)

diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index c7cf5a1059a8..d71a62b7cf06 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -106,12 +106,8 @@ static unsigned long find_trampoline_placement(void)
return bios_start - TRAMPOLINE_32BIT_SIZE;
}

-struct paging_config paging_prepare(void *rmode)
+bool trampoline_pgtable_init(struct boot_params *boot_params)
{
- struct paging_config paging_config = {};
-
- /* Initialize boot_params. Required for cmdline_find_option_bool(). */
- boot_params = rmode;

/*
* Check if LA57 is desired and supported.
@@ -125,26 +121,10 @@ struct paging_config paging_prepare(void *rmode)
*
* That's substitute for boot_cpu_has() in early boot code.
*/
- if (IS_ENABLED(CONFIG_X86_5LEVEL) &&
- !cmdline_find_option_bool("no5lvl") &&
- native_cpuid_eax(0) >= 7 &&
- (native_cpuid_ecx(7) & (1 << (X86_FEATURE_LA57 & 31)))) {
- paging_config.l5_required = 1;
- }
-
- paging_config.trampoline_start = find_trampoline_placement();
-
- trampoline_32bit = (unsigned long *)paging_config.trampoline_start;
-
- /* Preserve trampoline memory */
- memcpy(trampoline_save, trampoline_32bit, TRAMPOLINE_32BIT_SIZE);
-
- /* Clear trampoline memory first */
- memset(trampoline_32bit, 0, TRAMPOLINE_32BIT_SIZE);
-
- /* Copy trampoline code in place */
- memcpy(trampoline_32bit + TRAMPOLINE_32BIT_CODE_OFFSET / sizeof(unsigned long),
- &trampoline_32bit_src, TRAMPOLINE_32BIT_CODE_SIZE);
+ bool l5_required = IS_ENABLED(CONFIG_X86_5LEVEL) &&
+ !cmdline_find_option_bool("no5lvl") &&
+ native_cpuid_eax(0) >= 7 &&
+ (native_cpuid_ecx(7) & (1 << (X86_FEATURE_LA57 & 31)));

/*
* The code below prepares page table in trampoline memory.
@@ -160,10 +140,10 @@ struct paging_config paging_prepare(void *rmode)
* We are not going to use the page table in trampoline memory if we
* are already in the desired paging mode.
*/
- if (paging_config.l5_required == !!(native_read_cr4() & X86_CR4_LA57))
+ if (l5_required == !!(native_read_cr4() & X86_CR4_LA57))
goto out;

- if (paging_config.l5_required) {
+ if (l5_required) {
/*
* For 4- to 5-level paging transition, set up current CR3 as
* the first and the only entry in a new top-level page table.
@@ -185,6 +165,41 @@ struct paging_config paging_prepare(void *rmode)
(void *)src, PAGE_SIZE);
}

+out:
+ return l5_required;
+}
+
+struct paging_config paging_prepare(void *rmode)
+{
+ struct paging_config paging_config = {};
+
+ /* Initialize boot_params. Required for cmdline_find_option_bool(). */
+ boot_params = rmode;
+
+ /*
+ * We only need to find trampoline placement, if we have
+ * not already done it from libstub.
+ */
+
+ paging_config.trampoline_start = find_trampoline_placement();
+ trampoline_32bit = (unsigned long *)paging_config.trampoline_start;
+
+ /*
+ * Preserve trampoline memory. When trampoline is located in memory
+ * owned by us, i.e. allocated in EFISTUB, we don't care about previous
+ * contents of this memory so copying can also be skipped.
+ */
+ memcpy(trampoline_save, trampoline_32bit, TRAMPOLINE_32BIT_SIZE);
+
+ /* Clear trampoline memory first */
+ memset(trampoline_32bit, 0, TRAMPOLINE_32BIT_SIZE);
+
+ /* Copy trampoline code in place */
+ memcpy(trampoline_32bit + TRAMPOLINE_32BIT_CODE_OFFSET / sizeof(unsigned long),
+ &trampoline_32bit_src, TRAMPOLINE_32BIT_CODE_SIZE);
+
+ paging_config.l5_required = trampoline_pgtable_init(boot_params);
+
out:
return paging_config;
}
--
2.39.2


2023-03-14 10:17:56

by Evgeniy Baskov

[permalink] [raw]
Subject: [PATCH v5 12/27] x86/boot: Make kernel_add_identity_map() a pointer

Convert kernel_add_identity_map() into a function pointer to be able
to provide alternative implementations of this function. Required
to enable calling the code using this function from EFI environment.

Tested-by: Mario Limonciello <[email protected]>
Signed-off-by: Evgeniy Baskov <[email protected]>

Warnings on the previous version were
Reported-by: kernel test robot <[email protected]>
Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/
---
arch/x86/boot/compressed/ident_map_64.c | 7 ++++---
arch/x86/boot/compressed/misc.c | 24 ++++++++++++++++++++++++
arch/x86/boot/compressed/misc.h | 15 +++------------
3 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
index 378f99b1d7e8..1995560d3b43 100644
--- a/arch/x86/boot/compressed/ident_map_64.c
+++ b/arch/x86/boot/compressed/ident_map_64.c
@@ -92,9 +92,9 @@ bool has_nx; /* set in head_64.S */
/*
* Adds the specified range to the identity mappings.
*/
-unsigned long kernel_add_identity_map(unsigned long start,
- unsigned long end,
- unsigned int flags)
+static unsigned long kernel_add_identity_map_impl(unsigned long start,
+ unsigned long end,
+ unsigned int flags)
{
int ret;

@@ -143,6 +143,7 @@ void initialize_identity_maps(void *rmode)
struct setup_data *sd;

boot_params = rmode;
+ kernel_add_identity_map = kernel_add_identity_map_impl;

/* Exclude the encryption mask from __PHYSICAL_MASK */
physical_mask &= ~sme_me_mask;
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 76773a989364..74f028cf2dfd 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -277,6 +277,22 @@ static size_t parse_elf(void *output, unsigned long output_len,
return ehdr.e_entry - LOAD_PHYSICAL_ADDR;
}

+/*
+ * This points to actual implementation of mapping function
+ * for current environment: either EFI API wrapper,
+ * own implementation or dummy implementation below.
+ */
+unsigned long (*kernel_add_identity_map)(unsigned long start,
+ unsigned long end,
+ unsigned int flags);
+
+static unsigned long kernel_add_identity_map_dummy(unsigned long start,
+ unsigned long end,
+ unsigned int flags)
+{
+ return start;
+}
+
/*
* The compressed kernel image (ZO), has been moved so that its position
* is against the end of the buffer used to hold the uncompressed kernel
@@ -315,6 +331,14 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,

init_default_io_ops();

+ /*
+ * On 64-bit this pointer is set during page table uninitialization,
+ * but on 32-bit it remains uninitialized, since paging is disabled.
+ */
+ if (IS_ENABLED(CONFIG_X86_32))
+ kernel_add_identity_map = kernel_add_identity_map_dummy;
+
+
/*
* Detect TDX guest environment.
*
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index e20f60bfe91b..fe201b45b038 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -182,18 +182,9 @@ static inline int count_immovable_mem_regions(void) { return 0; }
#ifdef CONFIG_X86_5LEVEL
extern unsigned int __pgtable_l5_enabled, pgdir_shift, ptrs_per_p4d;
#endif
-#ifdef CONFIG_X86_64
-extern unsigned long kernel_add_identity_map(unsigned long start,
- unsigned long end,
- unsigned int flags);
-#else
-static inline unsigned long kernel_add_identity_map(unsigned long start,
- unsigned long end,
- unsigned int flags)
-{
- return start;
-}
-#endif
+extern unsigned long (*kernel_add_identity_map)(unsigned long start,
+ unsigned long end,
+ unsigned int flags);
/* Used by PAGE_KERN* macros: */
extern pteval_t __default_kernel_pte_mask;

--
2.39.2


2023-03-14 10:20:25

by Evgeniy Baskov

[permalink] [raw]
Subject: [PATCH v5 16/27] x86/boot: Reduce lower limit of physical KASLR

Set lower limit of physical KASLR to 64M.

Previously is was set to 512M when kernel is loaded higher than that.
That prevented physical KASLR from being performed on x86_32, where
upper limit is also set to 512M. The limit is pretty arbitrary, and the
most important is to set it above the ISA hole, i.e. higher than 16M.

It was not that important before, but now kernel is not getting
relocated to the lower address when booting via EFI, exposing the
KASLR failures.

Tested-by: Mario Limonciello <[email protected]>
Signed-off-by: Evgeniy Baskov <[email protected]>
---
arch/x86/boot/compressed/kaslr.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index 69966481b82d..806df3912396 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -850,10 +850,10 @@ void choose_random_location(unsigned long input,

/*
* Low end of the randomization range should be the
- * smaller of 512M or the initial kernel image
+ * smaller of 64M or the initial kernel image
* location:
*/
- min_addr = min(*output, 512UL << 20);
+ min_addr = min(*output, 64UL << 20);
/* Make sure minimum is aligned. */
min_addr = ALIGN(min_addr, CONFIG_PHYSICAL_ALIGN);

--
2.39.2


2023-03-14 10:20:29

by Evgeniy Baskov

[permalink] [raw]
Subject: [PATCH v5 17/27] x86: decompressor: Remove the 'bugger off' message

From: Ard Biesheuvel <[email protected]>

Ancient (pre-2003) x86 kernels could boot from a floppy disk straight from
the BIOS, using a small real mode boot stub at the start of the image
where the BIOS would expect the boot record (or boot block) to appear.

Due to its limitations (kernel size < 1 MiB, no support for IDE, USB or
El Torito floppy emulation), this support was dropped, and a Linux aware
bootloader is now always required to boot the kernel.

To smoothen this transition, the boot stub was not removed entirely, but
replaced with one that just prints an error message telling you to
install a bootloader.

As it is unlikely that anyone doing direct floppy boot with such an
ancient kernel is going to upgrade to v6.4+ and expect that this boot
method still works, printing this message is kind of pointless, and so
we should be able to remove the logic that emits it.

Let's free up this space so we can use it to expand the PE header in a
subsequent patch.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/boot/header.S | 49 ------------------------------------------
arch/x86/boot/setup.ld | 7 +++---
2 files changed, 4 insertions(+), 52 deletions(-)

diff --git a/arch/x86/boot/header.S b/arch/x86/boot/header.S
index 9338c68e7413..d4e16edf1198 100644
--- a/arch/x86/boot/header.S
+++ b/arch/x86/boot/header.S
@@ -38,64 +38,15 @@ SYSSEG = 0x1000 /* historical load address >> 4 */

.code16
.section ".bstext", "ax"
-
- .global bootsect_start
-bootsect_start:
#ifdef CONFIG_EFI_STUB
# "MZ", MS-DOS header
.word MZ_MAGIC
-#endif
-
- # Normalize the start address
- ljmp $BOOTSEG, $start2
-
-start2:
- movw %cs, %ax
- movw %ax, %ds
- movw %ax, %es
- movw %ax, %ss
- xorw %sp, %sp
- sti
- cld
-
- movw $bugger_off_msg, %si
-
-msg_loop:
- lodsb
- andb %al, %al
- jz bs_die
- movb $0xe, %ah
- movw $7, %bx
- int $0x10
- jmp msg_loop
-
-bs_die:
- # Allow the user to press a key, then reboot
- xorw %ax, %ax
- int $0x16
- int $0x19
-
- # int 0x19 should never return. In case it does anyway,
- # invoke the BIOS reset code...
- ljmp $0xf000,$0xfff0
-
-#ifdef CONFIG_EFI_STUB
.org 0x38
#
# Offset to the PE header.
#
.long LINUX_PE_MAGIC
.long pe_header
-#endif /* CONFIG_EFI_STUB */
-
- .section ".bsdata", "a"
-bugger_off_msg:
- .ascii "Use a boot loader.\r\n"
- .ascii "\n"
- .ascii "Remove disk and press any key to reboot...\r\n"
- .byte 0
-
-#ifdef CONFIG_EFI_STUB
pe_header:
.long PE_MAGIC

diff --git a/arch/x86/boot/setup.ld b/arch/x86/boot/setup.ld
index 49546c247ae2..31419b7c9c3f 100644
--- a/arch/x86/boot/setup.ld
+++ b/arch/x86/boot/setup.ld
@@ -10,10 +10,11 @@ ENTRY(_start)
SECTIONS
{
. = 0;
- .bstext : { *(.bstext) }
- .bsdata : { *(.bsdata) }
+ .bstext : {
+ *(.bstext)
+ . = 495;
+ } =0xff

- . = 495;
.header : { *(.header) }
.entrytext : { *(.entrytext) }
.inittext : { *(.inittext) }
--
2.39.2


2023-03-14 10:20:32

by Evgeniy Baskov

[permalink] [raw]
Subject: [PATCH v5 14/27] x86/boot: Add EFI kernel extraction interface

To enable extraction of kernel image from EFI stub code directly
extraction code needs to have separate interface that avoid part
of low level initialization logic, i.e. serial port setup.

Add kernel extraction function callable from libstub as a part
of preparation for extracting the kernel directly from EFI environment.

Tested-by: Mario Limonciello <[email protected]>
Signed-off-by: Evgeniy Baskov <[email protected]>
---
arch/x86/boot/compressed/head_32.S | 3 +-
arch/x86/boot/compressed/head_64.S | 2 +-
arch/x86/boot/compressed/misc.c | 100 +++++++++++++++++---------
arch/x86/boot/compressed/misc.h | 1 +
arch/x86/include/asm/shared/extract.h | 26 +++++++
5 files changed, 96 insertions(+), 36 deletions(-)
create mode 100644 arch/x86/include/asm/shared/extract.h

diff --git a/arch/x86/boot/compressed/head_32.S b/arch/x86/boot/compressed/head_32.S
index 987ae727cf9f..15545781ea48 100644
--- a/arch/x86/boot/compressed/head_32.S
+++ b/arch/x86/boot/compressed/head_32.S
@@ -213,8 +213,7 @@ SYM_DATA_END_LABEL(gdt, SYM_L_LOCAL, gdt_end)
*/
.bss
.balign 4
-boot_heap:
- .fill BOOT_HEAP_SIZE, 1, 0
+SYM_DATA(boot_heap, .fill BOOT_HEAP_SIZE, 1, 0)
boot_stack:
.fill BOOT_STACK_SIZE, 1, 0
boot_stack_end:
diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 7774daf90a19..8af7835de3b1 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -747,7 +747,7 @@ SYM_DATA_END_LABEL(boot_idt, SYM_L_GLOBAL, boot_idt_end)
*/
.bss
.balign 4
-SYM_DATA_LOCAL(boot_heap, .fill BOOT_HEAP_SIZE, 1, 0)
+SYM_DATA(boot_heap, .fill BOOT_HEAP_SIZE, 1, 0)

SYM_DATA_START_LOCAL(boot_stack)
.fill BOOT_STACK_SIZE, 1, 0
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 74f028cf2dfd..925774d0288f 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -310,11 +310,11 @@ static unsigned long kernel_add_identity_map_dummy(unsigned long start,
* |-------uncompressed kernel image---------|
*
*/
-asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
- unsigned char *input_data,
- unsigned long input_len,
- unsigned char *output,
- unsigned long output_len)
+static void *do_extract_kernel(void *rmode,
+ unsigned char *input_data,
+ unsigned long input_len,
+ unsigned char *output,
+ unsigned long output_len)
{
const unsigned long kernel_total_size = VO__end - VO__text;
unsigned long virt_addr = LOAD_PHYSICAL_ADDR;
@@ -329,26 +329,6 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,

sanitize_boot_params(boot_params);

- init_default_io_ops();
-
- /*
- * On 64-bit this pointer is set during page table uninitialization,
- * but on 32-bit it remains uninitialized, since paging is disabled.
- */
- if (IS_ENABLED(CONFIG_X86_32))
- kernel_add_identity_map = kernel_add_identity_map_dummy;
-
-
- /*
- * Detect TDX guest environment.
- *
- * It has to be done before console_init() in order to use
- * paravirtualized port I/O operations if needed.
- */
- early_tdx_detect();
-
- init_bare_console();
-
/*
* Save RSDP address for later use. Have this after console_init()
* so that early debugging output from the RSDP parsing code can be
@@ -356,11 +336,6 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
*/
boot_params->acpi_rsdp_addr = get_rsdp_addr();

- debug_putstr("early console in extract_kernel\n");
-
- free_mem_ptr = heap; /* Heap */
- free_mem_end_ptr = heap + BOOT_HEAP_SIZE;
-
/*
* The memory hole needed for the kernel is the larger of either
* the entire decompressed kernel plus relocation table, or the
@@ -418,12 +393,12 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
if (virt_addr & (MIN_KERNEL_ALIGN - 1))
error("Destination virtual address inappropriately aligned");
#ifdef CONFIG_X86_64
- if (heap > 0x3fffffffffffUL)
+ if (phys_addr > 0x3fffffffffffUL)
error("Destination address too large");
if (virt_addr + max(output_len, kernel_total_size) > KERNEL_IMAGE_SIZE)
error("Destination virtual address is beyond the kernel mapping area");
#else
- if (heap > ((-__PAGE_OFFSET-(128<<20)-1) & 0x7fffffff))
+ if (phys_addr > ((-__PAGE_OFFSET-(128<<20)-1) & 0x7fffffff))
error("Destination address too large");
#endif
#ifndef CONFIG_RELOCATABLE
@@ -440,12 +415,71 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
debug_puthex(entry_offset);
debug_putstr(").\n");

+ return output + entry_offset;
+}
+
+asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
+ unsigned char *input_data,
+ unsigned long input_len,
+ unsigned char *output,
+ unsigned long output_len)
+{
+ void *entry;
+
+ init_default_io_ops();
+
+ /*
+ * On 64-bit this pointer is set during page table uninitialization,
+ * but on 32-bit it remains uninitialized, since paging is disabled.
+ */
+ if (IS_ENABLED(CONFIG_X86_32))
+ kernel_add_identity_map = kernel_add_identity_map_dummy;
+
+ /*
+ * Detect TDX guest environment.
+ *
+ * It has to be done before console_init() in order to use
+ * paravirtualized port I/O operations if needed.
+ */
+ early_tdx_detect();
+
+ init_bare_console();
+
+ debug_putstr("early console in extract_kernel\n");
+
+ free_mem_ptr = heap; /* Heap */
+ free_mem_end_ptr = heap + BOOT_HEAP_SIZE;
+
+ entry = do_extract_kernel(rmode, input_data,
+ input_len, output, output_len);
+
/* Disable exception handling before booting the kernel */
cleanup_exception_handling();

- return output + entry_offset;
+ return entry;
}

+void *efi_extract_kernel(struct boot_params *rmode,
+ struct efi_extract_callbacks *cb,
+ unsigned char *input_data,
+ unsigned long input_len,
+ unsigned long output_len)
+{
+ extern char boot_heap[BOOT_HEAP_SIZE];
+
+ free_mem_ptr = (unsigned long)boot_heap; /* Heap */
+ free_mem_end_ptr = (unsigned long)boot_heap + BOOT_HEAP_SIZE;
+
+ init_console_func(cb->putstr, cb->puthex);
+ kernel_add_identity_map = cb->map_range;
+
+ return do_extract_kernel(rmode, input_data,
+ input_len, (void *)LOAD_PHYSICAL_ADDR, output_len);
+}
+
+
+
+
void fortify_panic(const char *name)
{
error("detected buffer overflow");
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index fe201b45b038..6c67152c862d 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -26,6 +26,7 @@
#include <asm/boot.h>
#include <asm/bootparam.h>
#include <asm/desc_defs.h>
+#include <asm/shared/extract.h>

#include "tdx.h"

diff --git a/arch/x86/include/asm/shared/extract.h b/arch/x86/include/asm/shared/extract.h
new file mode 100644
index 000000000000..46bf56348a86
--- /dev/null
+++ b/arch/x86/include/asm/shared/extract.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef ASM_SHARED_EXTRACT_H
+#define ASM_SHARED_EXTRACT_H
+
+#include <asm/bootparam.h>
+
+#define MAP_WRITE 0x02 /* Writable memory */
+#define MAP_EXEC 0x04 /* Executable memory */
+#define MAP_ALLOC 0x10 /* Range needs to be allocated */
+#define MAP_PROTECT 0x20 /* Set exact memory attributes for memory range */
+
+struct efi_extract_callbacks {
+ void (*putstr)(const char *msg);
+ void (*puthex)(unsigned long x);
+ unsigned long (*map_range)(unsigned long start,
+ unsigned long end,
+ unsigned int flags);
+};
+
+void *efi_extract_kernel(struct boot_params *rmode,
+ struct efi_extract_callbacks *cb,
+ unsigned char *input_data,
+ unsigned long input_len,
+ unsigned long output_len);
+
+#endif /* ASM_SHARED_EXTRACT_H */
--
2.39.2


2023-03-14 10:20:53

by Evgeniy Baskov

[permalink] [raw]
Subject: [PATCH v5 19/27] x86/build: Cleanup tools/build.c

Use newer C standard. Since kernel requires C99 compiler now,
we can make use of the new features to make the core more readable.

Use mmap() for reading files also to make things simpler.

Replace most magic numbers with defines and use PE structure
definitions instead of raw offsets.

Should have no functional changes. This is done in preparation for the
following patches that make generated PE header more spec compliant.

Tested-by: Mario Limonciello <[email protected]>
Signed-off-by: Evgeniy Baskov <[email protected]>
---
arch/x86/boot/tools/build.c | 348 +++++++++++++++++++++++-------------
1 file changed, 219 insertions(+), 129 deletions(-)

diff --git a/arch/x86/boot/tools/build.c b/arch/x86/boot/tools/build.c
index bd247692b701..73e88d30ebce 100644
--- a/arch/x86/boot/tools/build.c
+++ b/arch/x86/boot/tools/build.c
@@ -35,11 +35,14 @@
#include <fcntl.h>
#include <sys/mman.h>
#include <tools/le_byteshift.h>
+#include <linux/pe.h>

typedef unsigned char u8;
typedef unsigned short u16;
typedef unsigned int u32;

+#define round_up(x, n) (((x) + (n) - 1) & ~((n) - 1))
+
#define DEFAULT_MAJOR_ROOT 0
#define DEFAULT_MINOR_ROOT 0
#define DEFAULT_ROOT_DEV (DEFAULT_MAJOR_ROOT << 8 | DEFAULT_MINOR_ROOT)
@@ -48,8 +51,13 @@ typedef unsigned int u32;
#define SETUP_SECT_MIN 5
#define SETUP_SECT_MAX 64

+#define PARAGRAPH_SIZE 16
+#define SECTOR_SIZE 512
+#define FILE_ALIGNMENT 512
+#define SECTION_ALIGNMENT 4096
+
/* This must be large enough to hold the entire setup */
-u8 buf[SETUP_SECT_MAX*512];
+u8 buf[SETUP_SECT_MAX*SECTOR_SIZE];

#define PECOFF_RELOC_RESERVE 0x20

@@ -59,6 +67,40 @@ u8 buf[SETUP_SECT_MAX*512];
#define PECOFF_COMPAT_RESERVE 0x0
#endif

+#define RELOC_SECTION_SIZE 10
+
+/* PE header has different format depending on the architecture */
+#ifdef CONFIG_X86_64
+typedef struct pe32plus_opt_hdr pe_opt_hdr;
+#else
+typedef struct pe32_opt_hdr pe_opt_hdr;
+#endif
+
+static inline struct pe_hdr *get_pe_header(u8 *buf)
+{
+ u32 pe_offset = get_unaligned_le32(buf+MZ_HEADER_PEADDR_OFFSET);
+ return (struct pe_hdr *)(buf + pe_offset);
+}
+
+static inline pe_opt_hdr *get_pe_opt_header(u8 *buf)
+{
+ return (pe_opt_hdr *)(get_pe_header(buf) + 1);
+}
+
+static inline struct section_header *get_sections(u8 *buf)
+{
+ pe_opt_hdr *hdr = get_pe_opt_header(buf);
+ u32 n_data_dirs = get_unaligned_le32(&hdr->data_dirs);
+ u8 *sections = (u8 *)(hdr + 1) + n_data_dirs*sizeof(struct data_dirent);
+ return (struct section_header *)sections;
+}
+
+static inline struct data_directory *get_data_dirs(u8 *buf)
+{
+ pe_opt_hdr *hdr = get_pe_opt_header(buf);
+ return (struct data_directory *)(hdr + 1);
+}
+
static unsigned long efi32_stub_entry;
static unsigned long efi64_stub_entry;
static unsigned long efi_pe_entry;
@@ -152,40 +194,86 @@ static void usage(void)
die("Usage: build setup system zoffset.h image");
}

+static void *map_file(const char *path, size_t *psize)
+{
+ struct stat statbuf;
+ size_t size;
+ void *addr;
+ int fd;
+
+ fd = open(path, O_RDONLY);
+ if (fd < 0)
+ die("Unable to open `%s': %m", path);
+ if (fstat(fd, &statbuf))
+ die("Unable to stat `%s': %m", path);
+
+ size = statbuf.st_size;
+ /*
+ * Map one byte more, to allow adding null-terminator
+ * for text files.
+ */
+ addr = mmap(NULL, size + 1, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
+ if (addr == MAP_FAILED)
+ die("Unable to mmap '%s': %m", path);
+
+ close(fd);
+
+ *psize = size;
+ return addr;
+}
+
+static void unmap_file(void *addr, size_t size)
+{
+ munmap(addr, size + 1);
+}
+
+static void *map_output_file(const char *path, size_t size)
+{
+ void *addr;
+ int fd;
+
+ fd = open(path, O_RDWR | O_CREAT, 0660);
+ if (fd < 0)
+ die("Unable to create `%s': %m", path);
+
+ if (ftruncate(fd, size))
+ die("Unable to resize `%s': %m", path);
+
+ addr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ if (addr == MAP_FAILED)
+ die("Unable to mmap '%s': %m", path);
+
+ return addr;
+}
+
#ifdef CONFIG_EFI_STUB

static void update_pecoff_section_header_fields(char *section_name, u32 vma, u32 size, u32 datasz, u32 offset)
{
- unsigned int pe_header;
unsigned short num_sections;
- u8 *section;
+ struct section_header *section;

- pe_header = get_unaligned_le32(&buf[0x3c]);
- num_sections = get_unaligned_le16(&buf[pe_header + 6]);
-
-#ifdef CONFIG_X86_32
- section = &buf[pe_header + 0xa8];
-#else
- section = &buf[pe_header + 0xb8];
-#endif
+ struct pe_hdr *hdr = get_pe_header(buf);
+ num_sections = get_unaligned_le16(&hdr->sections);
+ section = get_sections(buf);

while (num_sections > 0) {
- if (strncmp((char*)section, section_name, 8) == 0) {
+ if (strncmp(section->name, section_name, 8) == 0) {
/* section header size field */
- put_unaligned_le32(size, section + 0x8);
+ put_unaligned_le32(size, &section->virtual_size);

/* section header vma field */
- put_unaligned_le32(vma, section + 0xc);
+ put_unaligned_le32(vma, &section->virtual_address);

/* section header 'size of initialised data' field */
- put_unaligned_le32(datasz, section + 0x10);
+ put_unaligned_le32(datasz, &section->raw_data_size);

/* section header 'file offset' field */
- put_unaligned_le32(offset, section + 0x14);
+ put_unaligned_le32(offset, &section->data_addr);

break;
}
- section += 0x28;
+ section++;
num_sections--;
}
}
@@ -197,7 +285,7 @@ static void update_pecoff_section_header(char *section_name, u32 offset, u32 siz

static void update_pecoff_setup_and_reloc(unsigned int size)
{
- u32 setup_offset = 0x200;
+ u32 setup_offset = SECTOR_SIZE;
u32 reloc_offset = size - PECOFF_RELOC_RESERVE - PECOFF_COMPAT_RESERVE;
#ifdef CONFIG_EFI_MIXED
u32 compat_offset = reloc_offset + PECOFF_RELOC_RESERVE;
@@ -211,8 +299,8 @@ static void update_pecoff_setup_and_reloc(unsigned int size)
* Modify .reloc section contents with a single entry. The
* relocation is applied to offset 10 of the relocation section.
*/
- put_unaligned_le32(reloc_offset + 10, &buf[reloc_offset]);
- put_unaligned_le32(10, &buf[reloc_offset + 4]);
+ put_unaligned_le32(reloc_offset + RELOC_SECTION_SIZE, &buf[reloc_offset]);
+ put_unaligned_le32(RELOC_SECTION_SIZE, &buf[reloc_offset + 4]);

#ifdef CONFIG_EFI_MIXED
update_pecoff_section_header(".compat", compat_offset, PECOFF_COMPAT_RESERVE);
@@ -224,19 +312,17 @@ static void update_pecoff_setup_and_reloc(unsigned int size)
*/
buf[compat_offset] = 0x1;
buf[compat_offset + 1] = 0x8;
- put_unaligned_le16(0x14c, &buf[compat_offset + 2]);
+ put_unaligned_le16(IMAGE_FILE_MACHINE_I386, &buf[compat_offset + 2]);
put_unaligned_le32(efi32_pe_entry + size, &buf[compat_offset + 4]);
#endif
}

-static void update_pecoff_text(unsigned int text_start, unsigned int file_sz,
+static unsigned int update_pecoff_sections(unsigned int text_start, unsigned int text_sz,
unsigned int init_sz)
{
- unsigned int pe_header;
- unsigned int text_sz = file_sz - text_start;
+ unsigned int file_sz = text_start + text_sz;
unsigned int bss_sz = init_sz - file_sz;
-
- pe_header = get_unaligned_le32(&buf[0x3c]);
+ pe_opt_hdr *hdr = get_pe_opt_header(buf);

/*
* The PE/COFF loader may load the image at an address which is
@@ -254,28 +340,30 @@ static void update_pecoff_text(unsigned int text_start, unsigned int file_sz,
* Size of code: Subtract the size of the first sector (512 bytes)
* which includes the header.
*/
- put_unaligned_le32(file_sz - 512 + bss_sz, &buf[pe_header + 0x1c]);
+ put_unaligned_le32(file_sz - SECTOR_SIZE + bss_sz, &hdr->text_size);

/* Size of image */
- put_unaligned_le32(init_sz, &buf[pe_header + 0x50]);
+ put_unaligned_le32(init_sz, &hdr->image_size);

/*
* Address of entry point for PE/COFF executable
*/
- put_unaligned_le32(text_start + efi_pe_entry, &buf[pe_header + 0x28]);
+ put_unaligned_le32(text_start + efi_pe_entry, &hdr->entry_point);

update_pecoff_section_header_fields(".text", text_start, text_sz + bss_sz,
text_sz, text_start);
+
+ return text_start + file_sz;
}

static int reserve_pecoff_reloc_section(int c)
{
- /* Reserve 0x20 bytes for .reloc section */
+ /* Reserve space for .reloc section */
memset(buf+c, 0, PECOFF_RELOC_RESERVE);
return PECOFF_RELOC_RESERVE;
}

-static void efi_stub_defaults(void)
+static void efi_stub_update_defaults(void)
{
/* Defaults for old kernel */
#ifdef CONFIG_X86_32
@@ -298,7 +386,7 @@ static void efi_stub_entry_update(void)

#ifdef CONFIG_EFI_MIXED
if (efi32_stub_entry != addr)
- die("32-bit and 64-bit EFI entry points do not match\n");
+ die("32-bit and 64-bit EFI entry points do not match");
#endif
#endif
put_unaligned_le32(addr, &buf[0x264]);
@@ -310,7 +398,7 @@ static inline void update_pecoff_setup_and_reloc(unsigned int size) {}
static inline void update_pecoff_text(unsigned int text_start,
unsigned int file_sz,
unsigned int init_sz) {}
-static inline void efi_stub_defaults(void) {}
+static inline void efi_stub_update_defaults(void) {}
static inline void efi_stub_entry_update(void) {}

static inline int reserve_pecoff_reloc_section(int c)
@@ -321,7 +409,7 @@ static inline int reserve_pecoff_reloc_section(int c)

static int reserve_pecoff_compat_section(int c)
{
- /* Reserve 0x20 bytes for .compat section */
+ /* Reserve space for .compat section */
memset(buf+c, 0, PECOFF_COMPAT_RESERVE);
return PECOFF_COMPAT_RESERVE;
}
@@ -338,20 +426,15 @@ static int reserve_pecoff_compat_section(int c)

static void parse_zoffset(char *fname)
{
- FILE *file;
- char *p;
- int c;
+ size_t size;
+ char *data, *p;

- file = fopen(fname, "r");
- if (!file)
- die("Unable to open `%s': %m", fname);
- c = fread(buf, 1, sizeof(buf) - 1, file);
- if (ferror(file))
- die("read-error on `zoffset.h'");
- fclose(file);
- buf[c] = 0;
+ data = map_file(fname, &size);
+
+ /* We can do that, since we mapped one byte more */
+ data[size] = 0;

- p = (char *)buf;
+ p = (char *)data;

while (p && *p) {
PARSE_ZOFS(p, efi32_stub_entry);
@@ -367,82 +450,99 @@ static void parse_zoffset(char *fname)
while (p && (*p == '\r' || *p == '\n'))
p++;
}
+
+ unmap_file(data, size);
}

-int main(int argc, char ** argv)
+static unsigned int read_setup(char *path)
{
- unsigned int i, sz, setup_sectors, init_sz;
- int c;
- u32 sys_size;
- struct stat sb;
- FILE *file, *dest;
- int fd;
- void *kernel;
- u32 crc = 0xffffffffUL;
-
- efi_stub_defaults();
-
- if (argc != 5)
- usage();
- parse_zoffset(argv[3]);
-
- dest = fopen(argv[4], "w");
- if (!dest)
- die("Unable to write `%s': %m", argv[4]);
+ FILE *file;
+ unsigned int setup_size, file_size;

/* Copy the setup code */
- file = fopen(argv[1], "r");
+ file = fopen(path, "r");
if (!file)
- die("Unable to open `%s': %m", argv[1]);
- c = fread(buf, 1, sizeof(buf), file);
+ die("Unable to open `%s': %m", path);
+
+ file_size = fread(buf, 1, sizeof(buf), file);
if (ferror(file))
die("read-error on `setup'");
- if (c < 1024)
+
+ if (file_size < 2 * SECTOR_SIZE)
die("The setup must be at least 1024 bytes");
- if (get_unaligned_le16(&buf[510]) != 0xAA55)
+
+ if (get_unaligned_le16(&buf[SECTOR_SIZE - 2]) != 0xAA55)
die("Boot block hasn't got boot flag (0xAA55)");
+
fclose(file);

- c += reserve_pecoff_compat_section(c);
- c += reserve_pecoff_reloc_section(c);
+ /* Reserve space for PE sections */
+ file_size += reserve_pecoff_compat_section(file_size);
+ file_size += reserve_pecoff_reloc_section(file_size);

/* Pad unused space with zeros */
- setup_sectors = (c + 511) / 512;
- if (setup_sectors < SETUP_SECT_MIN)
- setup_sectors = SETUP_SECT_MIN;
- i = setup_sectors*512;
- memset(buf+c, 0, i-c);

- update_pecoff_setup_and_reloc(i);
+ setup_size = round_up(file_size, SECTOR_SIZE);
+
+ if (setup_size < SETUP_SECT_MIN * SECTOR_SIZE)
+ setup_size = SETUP_SECT_MIN * SECTOR_SIZE;
+
+ /*
+ * Global buffer is already initialised
+ * to 0, but just in case, zero out padding.
+ */
+
+ memset(buf + file_size, 0, setup_size - file_size);
+
+ return setup_size;
+}
+
+int main(int argc, char **argv)
+{
+ size_t kern_file_size;
+ unsigned int setup_size;
+ unsigned int setup_sectors;
+ unsigned int init_size;
+ unsigned int total_size;
+ unsigned int kern_size;
+ void *kernel;
+ u32 crc = 0xffffffffUL;
+ u8 *output;
+
+ if (argc != 5)
+ usage();
+
+ efi_stub_update_defaults();
+ parse_zoffset(argv[3]);
+
+ setup_size = read_setup(argv[1]);
+
+ setup_sectors = setup_size/SECTOR_SIZE;

/* Set the default root device */
put_unaligned_le16(DEFAULT_ROOT_DEV, &buf[508]);

- /* Open and stat the kernel file */
- fd = open(argv[2], O_RDONLY);
- if (fd < 0)
- die("Unable to open `%s': %m", argv[2]);
- if (fstat(fd, &sb))
- die("Unable to stat `%s': %m", argv[2]);
- sz = sb.st_size;
- kernel = mmap(NULL, sz, PROT_READ, MAP_SHARED, fd, 0);
- if (kernel == MAP_FAILED)
- die("Unable to mmap '%s': %m", argv[2]);
- /* Number of 16-byte paragraphs, including space for a 4-byte CRC */
- sys_size = (sz + 15 + 4) / 16;
+ /* Map kernel file to memory */
+ kernel = map_file(argv[2], &kern_file_size);
+
#ifdef CONFIG_EFI_STUB
- /*
- * COFF requires minimum 32-byte alignment of sections, and
- * adding a signature is problematic without that alignment.
- */
- sys_size = (sys_size + 1) & ~1;
+ /* PE specification require 512-byte minimum section file alignment */
+ kern_size = round_up(kern_file_size + 4, FILE_ALIGNMENT);
+ update_pecoff_setup_and_reloc(setup_size);
+#else
+ /* Number of 16-byte paragraphs, including space for a 4-byte CRC */
+ kern_size = round_up(kern_file_size + 4, PARAGRAPH_SIZE);
#endif

/* Patch the setup code with the appropriate size parameters */
- buf[0x1f1] = setup_sectors-1;
- put_unaligned_le32(sys_size, &buf[0x1f4]);
+ buf[0x1f1] = setup_sectors - 1;
+ put_unaligned_le32(kern_size/PARAGRAPH_SIZE, &buf[0x1f4]);
+
+ /* Update kernel_info offset. */
+ put_unaligned_le32(kernel_info, &buf[0x268]);
+
+ init_size = get_unaligned_le32(&buf[0x260]);

- init_sz = get_unaligned_le32(&buf[0x260]);
#ifdef CONFIG_EFI_STUB
/*
* The decompression buffer will start at ImageBase. When relocating
@@ -458,45 +558,35 @@ int main(int argc, char ** argv)
* For future-proofing, increase init_sz if necessary.
*/

- if (init_sz - _end < i + _ehead) {
- init_sz = (i + _ehead + _end + 4095) & ~4095;
- put_unaligned_le32(init_sz, &buf[0x260]);
+ if (init_size - _end < setup_size + _ehead) {
+ init_size = round_up(setup_size + _ehead + _end, SECTION_ALIGNMENT);
+ put_unaligned_le32(init_size, &buf[0x260]);
}
-#endif
- update_pecoff_text(setup_sectors * 512, i + (sys_size * 16), init_sz);

- efi_stub_entry_update();
+ total_size = update_pecoff_sections(setup_size, kern_size, init_size);

- /* Update kernel_info offset. */
- put_unaligned_le32(kernel_info, &buf[0x268]);
-
- crc = partial_crc32(buf, i, crc);
- if (fwrite(buf, 1, i, dest) != i)
- die("Writing setup failed");
+ efi_stub_entry_update();
+#else
+ (void)init_size;
+ total_size = setup_size + kern_size;
+#endif

- /* Copy the kernel code */
- crc = partial_crc32(kernel, sz, crc);
- if (fwrite(kernel, 1, sz, dest) != sz)
- die("Writing kernel failed");
+ output = map_output_file(argv[4], total_size);

- /* Add padding leaving 4 bytes for the checksum */
- while (sz++ < (sys_size*16) - 4) {
- crc = partial_crc32_one('\0', crc);
- if (fwrite("\0", 1, 1, dest) != 1)
- die("Writing padding failed");
- }
+ memcpy(output, buf, setup_size);
+ memcpy(output + setup_size, kernel, kern_file_size);
+ memset(output + setup_size + kern_file_size, 0, kern_size - kern_file_size);

- /* Write the CRC */
- put_unaligned_le32(crc, buf);
- if (fwrite(buf, 1, 4, dest) != 4)
- die("Writing CRC failed");
+ /* Calculate and write kernel checksum. */
+ crc = partial_crc32(output, total_size - 4, crc);
+ put_unaligned_le32(crc, &output[total_size - 4]);

- /* Catch any delayed write failures */
- if (fclose(dest))
- die("Writing image failed");
+ /* Catch any delayed write failures. */
+ if (munmap(output, total_size) < 0)
+ die("Writing kernel failed");

- close(fd);
+ unmap_file(kernel, kern_file_size);

- /* Everything is OK */
+ /* Everything is OK. */
return 0;
}
--
2.39.2


2023-03-14 10:22:40

by Evgeniy Baskov

[permalink] [raw]
Subject: [PATCH v5 20/27] efi: x86: Use private copy of struct setup_header

From: Ard Biesheuvel <ardb () kernel ! org>

The native EFI entrypoint does not take a struct boot_params from the
loader, but instead, it constructs one from scratch, using the setup
header data from the start of the image.

This setup header is placed in a way that permits legacy loaders to
manipulate the contents (i.e., to pass the kernel command line or the
address and size of an initial ramdisk), but EFI boot does not use it in
that way - it only copies the contents that were placed there at build
time, but EFI loaders will not (and should not) manipulate the setup
header to configure the boot. (Commit 63bf28ceb3ebbe76 "efi: x86: Wipe
setup_data on pure EFI boot" deals with some of the fallout of using
setup_data in a way that breaks EFI boot.) So having a pristine, private
copy of the setup header rather than copying the one legacy boot loaders
use would be an advantage for EFI boot.

As it turns out, there is another reason why copying this header is
slightly problematic: the fixed offset of 0x1f1 bytes into the image
makes it difficult to describe all the sections of the image, as we are
running out of space. We could mitigate this by placing some parts of
the PE/COFF header after this setup header (which will happen in a
subsequent patch), but this makes the setup_header fundamentally part of
the PE header as well, which means that it is no longer part of the
'in-memory' representation of the PE image, but only part of the
'on-disk' representation. This means copying the setup header from the
running image may not work, as the PE loader may not have put it in
memory to begin with.

So let's work around this, and simplify things at the same time, by just
copying the setup header into a EFI stub private allocation at build
time, and use that instead of copying it from the loaded image after it
has been started.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/x86/boot/Makefile | 2 +-
arch/x86/boot/tools/build.c | 8 +++++
drivers/firmware/efi/libstub/x86-stub.c | 43 ++++---------------------
3 files changed, 15 insertions(+), 38 deletions(-)

diff --git a/arch/x86/boot/Makefile b/arch/x86/boot/Makefile
index 9e38ffaadb5d..8203f1a23f7a 100644
--- a/arch/x86/boot/Makefile
+++ b/arch/x86/boot/Makefile
@@ -91,7 +91,7 @@ $(obj)/vmlinux.bin: $(obj)/compressed/vmlinux FORCE

SETUP_OBJS = $(addprefix $(obj)/,$(setup-y))

-sed-zoffset := -e 's/^\([0-9a-fA-F]*\) [a-zA-Z] \(startup_32\|startup_64\|efi32_stub_entry\|efi64_stub_entry\|efi_pe_entry\|efi32_pe_entry\|input_data\|kernel_info\|_end\|_ehead\|_text\|z_.*\)$$/\#define ZO_\2 0x\1/p'
+sed-zoffset := -e 's/^\([0-9a-fA-F]*\) [a-zA-Z] \(startup_32\|startup_64\|efi32_stub_entry\|efi64_stub_entry\|efi_pe_entry\|efi32_pe_entry\|efi_boot_params\|input_data\|kernel_info\|_end\|_ehead\|_text\|z_.*\)$$/\#define ZO_\2 0x\1/p'

quiet_cmd_zoffset = ZOFFSET $@
cmd_zoffset = $(NM) $< | sed -n $(sed-zoffset) > $@
diff --git a/arch/x86/boot/tools/build.c b/arch/x86/boot/tools/build.c
index 73e88d30ebce..84d5a5cc7756 100644
--- a/arch/x86/boot/tools/build.c
+++ b/arch/x86/boot/tools/build.c
@@ -105,6 +105,7 @@ static unsigned long efi32_stub_entry;
static unsigned long efi64_stub_entry;
static unsigned long efi_pe_entry;
static unsigned long efi32_pe_entry;
+static unsigned long efi_boot_params;
static unsigned long kernel_info;
static unsigned long startup_64;
static unsigned long _ehead;
@@ -441,6 +442,7 @@ static void parse_zoffset(char *fname)
PARSE_ZOFS(p, efi64_stub_entry);
PARSE_ZOFS(p, efi_pe_entry);
PARSE_ZOFS(p, efi32_pe_entry);
+ PARSE_ZOFS(p, efi_boot_params);
PARSE_ZOFS(p, kernel_info);
PARSE_ZOFS(p, startup_64);
PARSE_ZOFS(p, _ehead);
@@ -577,6 +579,12 @@ int main(int argc, char **argv)
memcpy(output + setup_size, kernel, kern_file_size);
memset(output + setup_size + kern_file_size, 0, kern_size - kern_file_size);

+#ifdef CONFIG_EFI_STUB
+ /* Copy the setup header */
+ memcpy(output + setup_size + efi_boot_params + 0x1f1, &buf[0x1f1],
+ 0x290 - 0x1f1 /* == max possible sizeof(struct setup_header) */);
+#endif
+
/* Calculate and write kernel checksum. */
crc = partial_crc32(output, total_size - 4, crc);
put_unaligned_le32(crc, &output[total_size - 4]);
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index 1d1ab1911fd3..5dbc9c7a4aa3 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -347,6 +347,8 @@ void __noreturn efi_stub_entry(efi_handle_t handle,
efi_system_table_t *sys_table_arg,
struct boot_params *boot_params);

+struct boot_params efi_boot_params __section(".data") __aligned(SZ_4K);
+
/*
* Because the x86 boot code expects to be passed a boot_params we
* need to create one ourselves (usually the bootloader would create
@@ -355,7 +357,6 @@ void __noreturn efi_stub_entry(efi_handle_t handle,
efi_status_t __efiapi efi_pe_entry(efi_handle_t handle,
efi_system_table_t *sys_table_arg)
{
- struct boot_params *boot_params;
struct setup_header *hdr;
void *image_base;
efi_guid_t proto = LOADED_IMAGE_PROTOCOL_GUID;
@@ -378,55 +379,23 @@ efi_status_t __efiapi efi_pe_entry(efi_handle_t handle,
image_base = efi_table_attr(image, image_base);
image_offset = (void *)startup_32 - image_base;

- status = efi_allocate_pages(sizeof(struct boot_params),
- (unsigned long *)&boot_params, ULONG_MAX);
- if (status != EFI_SUCCESS) {
- efi_err("Failed to allocate lowmem for boot params\n");
- efi_exit(handle, status);
- }
-
- memset(boot_params, 0x0, sizeof(struct boot_params));
-
- hdr = &boot_params->hdr;
-
- /* Copy the setup header from the second sector to boot_params */
- memcpy(&hdr->jump, image_base + 512,
- sizeof(struct setup_header) - offsetof(struct setup_header, jump));
-
- /*
- * Fill out some of the header fields ourselves because the
- * EFI firmware loader doesn't load the first sector.
- */
- hdr->root_flags = 1;
- hdr->vid_mode = 0xffff;
- hdr->boot_flag = 0xAA55;
-
- hdr->type_of_loader = 0x21;
+ hdr = &efi_boot_params.hdr;

/* Convert unicode cmdline to ascii */
cmdline_ptr = efi_convert_cmdline(image, &options_size);
if (!cmdline_ptr)
goto fail;

- efi_set_u64_split((unsigned long)cmdline_ptr,
- &hdr->cmd_line_ptr, &boot_params->ext_cmd_line_ptr);
+ efi_set_u64_split((unsigned long)cmdline_ptr, &hdr->cmd_line_ptr,
+ &efi_boot_params.ext_cmd_line_ptr);

hdr->ramdisk_image = 0;
hdr->ramdisk_size = 0;

- /*
- * Disregard any setup data that was provided by the bootloader:
- * setup_data could be pointing anywhere, and we have no way of
- * authenticating or validating the payload.
- */
- hdr->setup_data = 0;
-
- efi_stub_entry(handle, sys_table_arg, boot_params);
+ efi_stub_entry(handle, sys_table_arg, &efi_boot_params);
/* not reached */

fail:
- efi_free(sizeof(struct boot_params), (unsigned long)boot_params);
-
efi_exit(handle, status);
}

--
2.39.2


2023-03-14 10:22:53

by Evgeniy Baskov

[permalink] [raw]
Subject: [PATCH v5 03/27] x86/boot: Set cr0 to known state in trampoline

Ensure WP bit to be set to prevent boot code from writing to
non-writable memory pages.

Tested-by: Mario Limonciello <[email protected]>
Signed-off-by: Evgeniy Baskov <[email protected]>
---
arch/x86/boot/compressed/head_64.S | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 03c4328a88cb..01fa42d31648 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -660,9 +660,8 @@ SYM_CODE_START(trampoline_32bit_src)
pushl $__KERNEL_CS
pushl %eax

- /* Enable paging again. */
- movl %cr0, %eax
- btsl $X86_CR0_PG_BIT, %eax
+ /* Enable paging and set CR0 to known state (this also sets WP flag) */
+ movl $CR0_STATE, %eax
movl %eax, %cr0

lret
--
2.39.2


2023-03-14 10:22:57

by Evgeniy Baskov

[permalink] [raw]
Subject: [PATCH v5 02/27] x86/build: Remove RWX sections and align on 4KB

Avoid creating sections simultaneously writable and readable to prepare
for W^X implementation for the kernel itself (not the decompressor).
Align kernel sections on page size (4KB) to allow protecting them in the
page tables.

Split init code form ".init" segment into separate R_X ".inittext"
segment and make ".init" segment non-executable.

Also add these segments to x86_32 architecture for consistency.
Currently paging is disabled in x86_32 in compressed kernel, so
protection is not applied anyways, but .init code was incorrectly
placed in non-executable ".data" segment. This should not change
anything meaningful in memory layout now, but might be required in case
memory protection will also be implemented in compressed kernel for
x86_32.

Tested-by: Mario Limonciello <[email protected]>
Signed-off-by: Evgeniy Baskov <[email protected]>
---
arch/x86/kernel/vmlinux.lds.S | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 25f155205770..81ea1236d293 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -102,12 +102,11 @@ jiffies = jiffies_64;
PHDRS {
text PT_LOAD FLAGS(5); /* R_E */
data PT_LOAD FLAGS(6); /* RW_ */
-#ifdef CONFIG_X86_64
-#ifdef CONFIG_SMP
+#if defined(CONFIG_X86_64) && defined(CONFIG_SMP)
percpu PT_LOAD FLAGS(6); /* RW_ */
#endif
- init PT_LOAD FLAGS(7); /* RWE */
-#endif
+ inittext PT_LOAD FLAGS(5); /* R_E */
+ init PT_LOAD FLAGS(6); /* RW_ */
note PT_NOTE FLAGS(0); /* ___ */
}

@@ -226,9 +225,10 @@ SECTIONS
#endif

INIT_TEXT_SECTION(PAGE_SIZE)
-#ifdef CONFIG_X86_64
- :init
-#endif
+ :inittext
+
+ . = ALIGN(PAGE_SIZE);
+

/*
* Section for code used exclusively before alternatives are run. All
@@ -240,6 +240,7 @@ SECTIONS
.altinstr_aux : AT(ADDR(.altinstr_aux) - LOAD_OFFSET) {
*(.altinstr_aux)
}
+ :init

INIT_DATA_SECTION(16)

--
2.39.2


2023-03-14 10:23:00

by Evgeniy Baskov

[permalink] [raw]
Subject: [PATCH v5 06/27] x86/boot: Setup memory protection for bzImage code

Use previously added code to use 4KB pages for mapping. Map compressed
and uncompressed kernel with appropriate memory protection attributes.
For compressed kernel set them up manually. For uncompressed kernel
used flags specified in ELF header.

Tested-by: Mario Limonciello <[email protected]>
Signed-off-by: Evgeniy Baskov <[email protected]>

delete mode 100644 arch/x86/boot/compressed/pgtable.h
create mode 100644 arch/x86/include/asm/shared/pgtable.h
---
arch/x86/boot/compressed/head_64.S | 24 +++++-
arch/x86/boot/compressed/ident_map_64.c | 98 ++++++++++++++++---------
arch/x86/boot/compressed/misc.c | 69 +++++++++++++++--
arch/x86/boot/compressed/misc.h | 22 +++++-
arch/x86/boot/compressed/pgtable.h | 20 -----
arch/x86/boot/compressed/pgtable_64.c | 2 +-
arch/x86/boot/compressed/sev.c | 6 +-
arch/x86/include/asm/shared/pgtable.h | 29 ++++++++
8 files changed, 204 insertions(+), 66 deletions(-)
delete mode 100644 arch/x86/boot/compressed/pgtable.h
create mode 100644 arch/x86/include/asm/shared/pgtable.h

diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 01fa42d31648..7774daf90a19 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -29,13 +29,14 @@
#include <linux/linkage.h>
#include <asm/segment.h>
#include <asm/boot.h>
+#include <asm/cpufeatures.h>
#include <asm/msr.h>
#include <asm/processor-flags.h>
#include <asm/asm-offsets.h>
#include <asm/bootparam.h>
#include <asm/desc_defs.h>
#include <asm/trapnr.h>
-#include "pgtable.h"
+#include <asm/shared/pgtable.h>

/*
* Fix alignment at 16 bytes. Following CONFIG_FUNCTION_ALIGNMENT will result
@@ -554,6 +555,7 @@ SYM_FUNC_START_LOCAL_NOALIGN(.Lrelocated)
pushq %rsi
call load_stage2_idt

+ call enable_nx_if_supported
/* Pass boot_params to initialize_identity_maps() */
movq (%rsp), %rdi
call initialize_identity_maps
@@ -578,6 +580,26 @@ SYM_FUNC_START_LOCAL_NOALIGN(.Lrelocated)
jmp *%rax
SYM_FUNC_END(.Lrelocated)

+SYM_FUNC_START_LOCAL_NOALIGN(enable_nx_if_supported)
+ pushq %rbx
+
+ mov $0x80000001, %eax
+ cpuid
+ btl $(X86_FEATURE_NX & 31), %edx
+ jnc .Lnonx
+
+ movl $MSR_EFER, %ecx
+ rdmsr
+ btsl $_EFER_NX, %eax
+ wrmsr
+
+ movb $1, has_nx(%rip)
+
+.Lnonx:
+ popq %rbx
+ RET
+SYM_FUNC_END(enable_nx_if_supported)
+
.code32
/*
* This is the 32-bit trampoline that will be copied over to low memory.
diff --git a/arch/x86/boot/compressed/ident_map_64.c b/arch/x86/boot/compressed/ident_map_64.c
index 321a5011042d..eb28ce9812c5 100644
--- a/arch/x86/boot/compressed/ident_map_64.c
+++ b/arch/x86/boot/compressed/ident_map_64.c
@@ -28,6 +28,7 @@
#include <asm/trap_pf.h>
#include <asm/trapnr.h>
#include <asm/init.h>
+#include <asm/shared/pgtable.h>
/* Use the static base for this part of the boot process */
#undef __PAGE_OFFSET
#define __PAGE_OFFSET __PAGE_OFFSET_BASE
@@ -86,24 +87,53 @@ phys_addr_t physical_mask = (1ULL << __PHYSICAL_MASK_SHIFT) - 1;
* Due to relocation, pointers must be assigned at run time not build time.
*/
static struct x86_mapping_info mapping_info;
+bool has_nx; /* set in head_64.S */

/*
* Adds the specified range to the identity mappings.
*/
-void kernel_add_identity_map(unsigned long start, unsigned long end)
+unsigned long kernel_add_identity_map(unsigned long start,
+ unsigned long end,
+ unsigned int flags)
{
int ret;

/* Align boundary to 2M. */
- start = round_down(start, PMD_SIZE);
- end = round_up(end, PMD_SIZE);
+ start = round_down(start, PAGE_SIZE);
+ end = round_up(end, PAGE_SIZE);
if (start >= end)
- return;
+ return start;
+
+ /*
+ * Warn if W^X is violated.
+ * Only do that if CONFIG_RANDOMIZE_BASE is set, since otherwise we need
+ * to create RWX region in case of overlapping memory regions for
+ * compressed and uncompressed kernel.
+ */
+
+ if ((IS_ENABLED(CONFIG_RANDOMIZE_BASE) &&
+ !cmdline_find_option_bool("nokaslr")) &&
+ (flags & (MAP_EXEC | MAP_WRITE)) == (MAP_EXEC | MAP_WRITE))
+ warn("W^X violation\n");
+
+ bool nx = !(flags & MAP_EXEC) && has_nx;
+ bool ro = !(flags & MAP_WRITE);
+
+ mapping_info.page_flag = sme_me_mask | (nx ?
+ (ro ? __PAGE_KERNEL_RO : __PAGE_KERNEL) :
+ (ro ? __PAGE_KERNEL_ROX : __PAGE_KERNEL_EXEC));

/* Build the mapping. */
- ret = kernel_ident_mapping_init(&mapping_info, (pgd_t *)top_level_pgt, start, end);
+ ret = kernel_ident_mapping_init(&mapping_info,
+ (pgd_t *)top_level_pgt,
+ start, end);
if (ret)
error("Error: kernel_ident_mapping_init() failed\n");
+
+ if (!(flags & MAP_NOFLUSH))
+ write_cr3(top_level_pgt);
+
+ return start;
}

/* Locates and clears a region for a new top level page table. */
@@ -112,14 +142,17 @@ void initialize_identity_maps(void *rmode)
unsigned long cmdline;
struct setup_data *sd;

+ boot_params = rmode;
+
/* Exclude the encryption mask from __PHYSICAL_MASK */
physical_mask &= ~sme_me_mask;

/* Init mapping_info with run-time function/buffer pointers. */
mapping_info.alloc_pgt_page = alloc_pgt_page;
mapping_info.context = &pgt_data;
- mapping_info.page_flag = __PAGE_KERNEL_LARGE_EXEC | sme_me_mask;
+ mapping_info.page_flag = __PAGE_KERNEL_EXEC | sme_me_mask;
mapping_info.kernpg_flag = _KERNPG_TABLE;
+ mapping_info.allow_4kpages = 1;

/*
* It should be impossible for this not to already be true,
@@ -154,15 +187,29 @@ void initialize_identity_maps(void *rmode)
/*
* New page-table is set up - map the kernel image, boot_params and the
* command line. The uncompressed kernel requires boot_params and the
- * command line to be mapped in the identity mapping. Map them
- * explicitly here in case the compressed kernel does not touch them,
- * or does not touch all the pages covering them.
+ * command line to be mapped in the identity mapping.
+ * Every other accessed memory region is mapped later, if required.
*/
- kernel_add_identity_map((unsigned long)_head, (unsigned long)_end);
- boot_params = rmode;
- kernel_add_identity_map((unsigned long)boot_params, (unsigned long)(boot_params + 1));
+ kernel_add_identity_map((unsigned long)_head,
+ (unsigned long)_ehead, MAP_EXEC | MAP_NOFLUSH);
+
+ kernel_add_identity_map((unsigned long)_compressed,
+ (unsigned long)_ecompressed, MAP_WRITE | MAP_NOFLUSH);
+
+ kernel_add_identity_map((unsigned long)_text,
+ (unsigned long)_etext, MAP_EXEC | MAP_NOFLUSH);
+
+ kernel_add_identity_map((unsigned long)_rodata,
+ (unsigned long)_erodata, MAP_NOFLUSH);
+
+ kernel_add_identity_map((unsigned long)_data,
+ (unsigned long)_end, MAP_WRITE | MAP_NOFLUSH);
+
+ kernel_add_identity_map((unsigned long)boot_params,
+ (unsigned long)(boot_params + 1), MAP_WRITE | MAP_NOFLUSH);
+
cmdline = get_cmd_line_ptr();
- kernel_add_identity_map(cmdline, cmdline + COMMAND_LINE_SIZE);
+ kernel_add_identity_map(cmdline, cmdline + COMMAND_LINE_SIZE, MAP_NOFLUSH);

/*
* Also map the setup_data entries passed via boot_params in case they
@@ -172,7 +219,7 @@ void initialize_identity_maps(void *rmode)
while (sd) {
unsigned long sd_addr = (unsigned long)sd;

- kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd) + sd->len);
+ kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd) + sd->len, MAP_NOFLUSH);
sd = (struct setup_data *)sd->next;
}

@@ -191,26 +238,11 @@ void initialize_identity_maps(void *rmode)
static pte_t *split_large_pmd(struct x86_mapping_info *info,
pmd_t *pmdp, unsigned long __address)
{
- unsigned long page_flags;
- unsigned long address;
- pte_t *pte;
- pmd_t pmd;
- int i;
-
- pte = (pte_t *)info->alloc_pgt_page(info->context);
+ unsigned long address = __address & PMD_MASK;
+ pte_t *pte = ident_split_large_pmd(info, pmdp, address);
if (!pte)
return NULL;

- address = __address & PMD_MASK;
- /* No large page - clear PSE flag */
- page_flags = info->page_flag & ~_PAGE_PSE;
-
- /* Populate the PTEs */
- for (i = 0; i < PTRS_PER_PMD; i++) {
- set_pte(&pte[i], __pte(address | page_flags));
- address += PAGE_SIZE;
- }
-
/*
* Ideally we need to clear the large PMD first and do a TLB
* flush before we write the new PMD. But the 2M range of the
@@ -220,7 +252,7 @@ static pte_t *split_large_pmd(struct x86_mapping_info *info,
* also the only user of the page-table, so there is no chance
* of a TLB multihit.
*/
- pmd = __pmd((unsigned long)pte | info->kernpg_flag);
+ pmd_t pmd = __pmd((unsigned long)pte | info->kernpg_flag);
set_pmd(pmdp, pmd);
/* Flush TLB to establish the new PMD */
write_cr3(top_level_pgt);
@@ -383,5 +415,5 @@ void do_boot_page_fault(struct pt_regs *regs, unsigned long error_code)
* Error code is sane - now identity map the 2M region around
* the faulting address.
*/
- kernel_add_identity_map(address, end);
+ kernel_add_identity_map(address, end, MAP_WRITE);
}
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 014ff222bf4b..efecd8656414 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -14,10 +14,10 @@

#include "misc.h"
#include "error.h"
-#include "pgtable.h"
#include "../string.h"
#include "../voffset.h"
#include <asm/bootparam_utils.h>
+#include <asm/shared/pgtable.h>

/*
* WARNING!!
@@ -277,7 +277,8 @@ static inline void handle_relocations(void *output, unsigned long output_len,
{ }
#endif

-static size_t parse_elf(void *output)
+static size_t parse_elf(void *output, unsigned long output_len,
+ unsigned long virt_addr)
{
#ifdef CONFIG_X86_64
Elf64_Ehdr ehdr;
@@ -287,6 +288,7 @@ static size_t parse_elf(void *output)
Elf32_Phdr *phdrs, *phdr;
#endif
void *dest;
+ unsigned long addr;
int i;

memcpy(&ehdr, output, sizeof(ehdr));
@@ -321,10 +323,51 @@ static size_t parse_elf(void *output)
#endif
memmove(dest, output + phdr->p_offset, phdr->p_filesz);
break;
- default: /* Ignore other PT_* */ break;
+ default:
+ /* Ignore other PT_* */
+ break;
}
}

+ handle_relocations(output, output_len, virt_addr);
+
+ if (!IS_ENABLED(CONFIG_RANDOMIZE_BASE) ||
+ cmdline_find_option_bool("nokaslr"))
+ goto skip_protect;
+
+ for (i = 0; i < ehdr.e_phnum; i++) {
+ phdr = &phdrs[i];
+
+ switch (phdr->p_type) {
+ case PT_LOAD:
+#ifdef CONFIG_RELOCATABLE
+ addr = (unsigned long)output;
+ addr += (phdr->p_paddr - LOAD_PHYSICAL_ADDR);
+#else
+ addr = phdr->p_paddr;
+#endif
+ /*
+ * Simultaneously readable and writable segments are
+ * violating W^X, and should not be present in vmlinux image.
+ * The absence of such segments is checked during build.
+ */
+
+ unsigned int flags = MAP_PROTECT;
+
+ if (phdr->p_flags & PF_X)
+ flags |= MAP_EXEC;
+ if (phdr->p_flags & PF_W)
+ flags |= MAP_WRITE;
+
+ kernel_add_identity_map(addr, addr + phdr->p_memsz, flags);
+ break;
+ default:
+ /* Ignore other PT_* */
+ break;
+ }
+ }
+
+skip_protect:
free(phdrs);

return ehdr.e_entry - LOAD_PHYSICAL_ADDR;
@@ -435,6 +478,23 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
needed_size,
&virt_addr);

+ unsigned long phys_addr = (unsigned long)output;
+
+ /*
+ * If KASLR is disabled input and output regions may overlap.
+ * In this case we need to map region excutable as well.
+ */
+ unsigned long map_flags = MAP_ALLOC | MAP_WRITE;
+
+ if (IS_ENABLED(CONFIG_RANDOMIZE_BASE) ||
+ cmdline_find_option_bool("nokaslr"))
+ map_flags |= MAP_WRITE;
+
+ phys_addr = kernel_add_identity_map(phys_addr,
+ phys_addr + needed_size,
+ map_flags);
+ output = (unsigned char *)phys_addr;
+
/* Validate memory location choices. */
if ((unsigned long)output & (MIN_KERNEL_ALIGN - 1))
error("Destination physical address inappropriately aligned");
@@ -457,8 +517,7 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
debug_putstr("\nDecompressing Linux... ");
__decompress(input_data, input_len, NULL, NULL, output, output_len,
NULL, error);
- entry_offset = parse_elf(output);
- handle_relocations(output, output_len, virt_addr);
+ entry_offset = parse_elf(output, output_len, virt_addr);

debug_putstr("done.\nBooting the kernel (entry_offset: 0x");
debug_puthex(entry_offset);
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 20118fb7c53b..aea3621faf29 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -44,8 +44,12 @@
#define memptr unsigned
#endif

-/* boot/compressed/vmlinux start and end markers */
-extern char _head[], _end[];
+/* Compressed kernel section start/end markers. */
+extern char _head[], _ehead[];
+extern char _compressed[], _ecompressed[];
+extern char _text[], _etext[];
+extern char _rodata[], _erodata[];
+extern char _data[], _end[];

/* misc.c */
extern memptr free_mem_ptr;
@@ -173,8 +177,18 @@ static inline int count_immovable_mem_regions(void) { return 0; }
#ifdef CONFIG_X86_5LEVEL
extern unsigned int __pgtable_l5_enabled, pgdir_shift, ptrs_per_p4d;
#endif
-extern void kernel_add_identity_map(unsigned long start, unsigned long end);
-
+#ifdef CONFIG_X86_64
+extern unsigned long kernel_add_identity_map(unsigned long start,
+ unsigned long end,
+ unsigned int flags);
+#else
+static inline unsigned long kernel_add_identity_map(unsigned long start,
+ unsigned long end,
+ unsigned int flags)
+{
+ return start;
+}
+#endif
/* Used by PAGE_KERN* macros: */
extern pteval_t __default_kernel_pte_mask;

diff --git a/arch/x86/boot/compressed/pgtable.h b/arch/x86/boot/compressed/pgtable.h
deleted file mode 100644
index cc9b2529a086..000000000000
--- a/arch/x86/boot/compressed/pgtable.h
+++ /dev/null
@@ -1,20 +0,0 @@
-#ifndef BOOT_COMPRESSED_PAGETABLE_H
-#define BOOT_COMPRESSED_PAGETABLE_H
-
-#define TRAMPOLINE_32BIT_SIZE (2 * PAGE_SIZE)
-
-#define TRAMPOLINE_32BIT_PGTABLE_OFFSET 0
-
-#define TRAMPOLINE_32BIT_CODE_OFFSET PAGE_SIZE
-#define TRAMPOLINE_32BIT_CODE_SIZE 0x80
-
-#define TRAMPOLINE_32BIT_STACK_END TRAMPOLINE_32BIT_SIZE
-
-#ifndef __ASSEMBLER__
-
-extern unsigned long *trampoline_32bit;
-
-extern void trampoline_32bit_src(void *return_ptr);
-
-#endif /* __ASSEMBLER__ */
-#endif /* BOOT_COMPRESSED_PAGETABLE_H */
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index 2ac12ff4111b..c7cf5a1059a8 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -2,7 +2,7 @@
#include "misc.h"
#include <asm/e820/types.h>
#include <asm/processor.h>
-#include "pgtable.h"
+#include <asm/shared/pgtable.h>
#include "../string.h"
#include "efi.h"

diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index d63ad8f99f83..aa3905885e2a 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -13,6 +13,7 @@
#include "misc.h"

#include <asm/pgtable_types.h>
+#include <asm/shared/pgtable.h>
#include <asm/sev.h>
#include <asm/trapnr.h>
#include <asm/trap_pf.h>
@@ -505,10 +506,11 @@ void sev_prep_identity_maps(unsigned long top_level_pgt)
unsigned long cc_info_pa = boot_params->cc_blob_address;
struct cc_blob_sev_info *cc_info;

- kernel_add_identity_map(cc_info_pa, cc_info_pa + sizeof(*cc_info));
+ kernel_add_identity_map(cc_info_pa, cc_info_pa + sizeof(*cc_info), MAP_NOFLUSH);

cc_info = (struct cc_blob_sev_info *)cc_info_pa;
- kernel_add_identity_map(cc_info->cpuid_phys, cc_info->cpuid_phys + cc_info->cpuid_len);
+ kernel_add_identity_map(cc_info->cpuid_phys,
+ cc_info->cpuid_phys + cc_info->cpuid_len, MAP_NOFLUSH);
}

sev_verify_cbit(top_level_pgt);
diff --git a/arch/x86/include/asm/shared/pgtable.h b/arch/x86/include/asm/shared/pgtable.h
new file mode 100644
index 000000000000..6527dadf39d6
--- /dev/null
+++ b/arch/x86/include/asm/shared/pgtable.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef ASM_SHARED_PAGETABLE_H
+#define ASM_SHARED_PAGETABLE_H
+
+#define MAP_WRITE 0x02 /* Writable memory */
+#define MAP_EXEC 0x04 /* Executable memory */
+#define MAP_ALLOC 0x10 /* Range needs to be allocated */
+#define MAP_PROTECT 0x20 /* Set exact memory attributes for memory range */
+#define MAP_NOFLUSH 0x40 /* Avoid flushing TLB */
+
+#define TRAMPOLINE_32BIT_SIZE (3 * PAGE_SIZE)
+
+#define TRAMPOLINE_32BIT_PLACEMENT_MAX (0xA0000)
+
+#define TRAMPOLINE_32BIT_PGTABLE_OFFSET 0
+
+#define TRAMPOLINE_32BIT_CODE_OFFSET PAGE_SIZE
+#define TRAMPOLINE_32BIT_CODE_SIZE 0x80
+
+#define TRAMPOLINE_32BIT_STACK_END TRAMPOLINE_32BIT_SIZE
+
+#ifndef __ASSEMBLER__
+
+extern unsigned long *trampoline_32bit;
+
+extern void trampoline_32bit_src(void *return_ptr);
+
+#endif /* __ASSEMBLER__ */
+#endif /* ASM_SHARED_PAGETABLE_H */
--
2.39.2


2023-03-14 10:23:03

by Evgeniy Baskov

[permalink] [raw]
Subject: [PATCH v5 05/27] x86/boot: Support 4KB pages for identity mapping

Current identity mapping code only supports 2M and 1G pages.
4KB pages are desirable for better memory protection granularity
in compressed kernel code.

Change identity mapping code to support 4KB pages and
memory remapping with different attributes.

Tested-by: Mario Limonciello <[email protected]>
Signed-off-by: Evgeniy Baskov <[email protected]>

Warnings caused by the previous version were
Reported-by: kernel test robot <[email protected]>
Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/
---
arch/x86/include/asm/init.h | 8 +-
arch/x86/mm/ident_map.c | 185 +++++++++++++++++++++++++++++-------
2 files changed, 160 insertions(+), 33 deletions(-)

diff --git a/arch/x86/include/asm/init.h b/arch/x86/include/asm/init.h
index 5f1d3c421f68..d4e790435bac 100644
--- a/arch/x86/include/asm/init.h
+++ b/arch/x86/include/asm/init.h
@@ -8,10 +8,16 @@ struct x86_mapping_info {
unsigned long page_flag; /* page flag for PMD or PUD entry */
unsigned long offset; /* ident mapping offset */
bool direct_gbpages; /* PUD level 1GB page support */
+ bool allow_4kpages; /* Allow more granular mappings with 4K pages */
unsigned long kernpg_flag; /* kernel pagetable flag override */
};

int kernel_ident_mapping_init(struct x86_mapping_info *info, pgd_t *pgd_page,
- unsigned long pstart, unsigned long pend);
+ unsigned long pstart, unsigned long pend);
+pte_t *ident_split_large_pmd(struct x86_mapping_info *info,
+ pmd_t *pmdp, unsigned long page_addr);
+pmd_t *ident_split_large_pud(struct x86_mapping_info *info,
+ pud_t *pudp, unsigned long page_addr);
+

#endif /* _ASM_X86_INIT_H */
diff --git a/arch/x86/mm/ident_map.c b/arch/x86/mm/ident_map.c
index 968d7005f4a7..662e794a325d 100644
--- a/arch/x86/mm/ident_map.c
+++ b/arch/x86/mm/ident_map.c
@@ -4,24 +4,127 @@
* included by both the compressed kernel and the regular kernel.
*/

-static void ident_pmd_init(struct x86_mapping_info *info, pmd_t *pmd_page,
- unsigned long addr, unsigned long end)
+static void ident_pte_init(struct x86_mapping_info *info, pte_t *pte_page,
+ unsigned long addr, unsigned long end,
+ unsigned long flags)
{
- addr &= PMD_MASK;
- for (; addr < end; addr += PMD_SIZE) {
+ addr &= PAGE_MASK;
+ for (; addr < end; addr += PAGE_SIZE) {
+ pte_t *pte = pte_page + pte_index(addr);
+
+ set_pte(pte, __pte((addr - info->offset) | flags));
+ }
+}
+
+pte_t *ident_split_large_pmd(struct x86_mapping_info *info,
+ pmd_t *pmdp, unsigned long page_addr)
+{
+ unsigned long pmd_addr, page_flags;
+ pte_t *pte;
+
+ pte = (pte_t *)info->alloc_pgt_page(info->context);
+ if (!pte)
+ return NULL;
+
+ pmd_addr = page_addr & PMD_MASK;
+
+ /* Not a large page - clear PSE flag */
+ page_flags = pmd_flags(*pmdp) & ~_PSE;
+ ident_pte_init(info, pte, pmd_addr, pmd_addr + PMD_SIZE, page_flags);
+
+ return pte;
+}
+
+static int ident_pmd_init(struct x86_mapping_info *info, pmd_t *pmd_page,
+ unsigned long addr, unsigned long end,
+ unsigned long flags)
+{
+ unsigned long next;
+ bool new_table = 0;
+
+ for (; addr < end; addr = next) {
pmd_t *pmd = pmd_page + pmd_index(addr);
+ pte_t *pte;

- if (pmd_present(*pmd))
+ next = (addr & PMD_MASK) + PMD_SIZE;
+ if (next > end)
+ next = end;
+
+ /*
+ * Use 2M pages if 4k pages are not allowed or
+ * we are not mapping extra, i.e. address and size are aligned.
+ */
+
+ if (!info->allow_4kpages ||
+ (!(addr & ~PMD_MASK) && next == addr + PMD_SIZE)) {
+
+ pmd_t pmdval;
+
+ addr &= PMD_MASK;
+ pmdval = __pmd((addr - info->offset) | flags | _PSE);
+ set_pmd(pmd, pmdval);
continue;
+ }
+
+ /*
+ * If currently mapped page is large, we need to split it.
+ * The case when we don't can remap 2M page to 2M page
+ * with different flags is already covered above.
+ *
+ * If there's nothing mapped to desired address,
+ * we need to allocate new page table.
+ */

- set_pmd(pmd, __pmd((addr - info->offset) | info->page_flag));
+ if (pmd_large(*pmd)) {
+ pte = ident_split_large_pmd(info, pmd, addr);
+ new_table = 1;
+ } else if (!pmd_present(*pmd)) {
+ pte = (pte_t *)info->alloc_pgt_page(info->context);
+ new_table = 1;
+ } else {
+ pte = pte_offset_kernel(pmd, 0);
+ new_table = 0;
+ }
+
+ if (!pte)
+ return -ENOMEM;
+
+ ident_pte_init(info, pte, addr, next, flags);
+
+ if (new_table)
+ set_pmd(pmd, __pmd(__pa(pte) | info->kernpg_flag));
}
+
+ return 0;
}

+
+pmd_t *ident_split_large_pud(struct x86_mapping_info *info,
+ pud_t *pudp, unsigned long page_addr)
+{
+ unsigned long pud_addr, page_flags;
+ pmd_t *pmd;
+
+ pmd = (pmd_t *)info->alloc_pgt_page(info->context);
+ if (!pmd)
+ return NULL;
+
+ pud_addr = page_addr & PUD_MASK;
+
+ /* Not a large page - clear PSE flag */
+ page_flags = pud_flags(*pudp) & ~_PSE;
+ ident_pmd_init(info, pmd, pud_addr, pud_addr + PUD_SIZE, page_flags);
+
+ return pmd;
+}
+
+
static int ident_pud_init(struct x86_mapping_info *info, pud_t *pud_page,
unsigned long addr, unsigned long end)
{
unsigned long next;
+ bool new_table = 0;
+ int result;

for (; addr < end; addr = next) {
pud_t *pud = pud_page + pud_index(addr);
@@ -31,28 +134,39 @@ static int ident_pud_init(struct x86_mapping_info *info, pud_t *pud_page,
if (next > end)
next = end;

+ /* Use 1G pages only if forced, even if they are supported. */
if (info->direct_gbpages) {
pud_t pudval;
-
- if (pud_present(*pud))
- continue;
+ unsigned long flags;

addr &= PUD_MASK;
- pudval = __pud((addr - info->offset) | info->page_flag);
+ flags = info->page_flag | _PSE;
+ pudval = __pud((addr - info->offset) | flags);
+
set_pud(pud, pudval);
continue;
}

- if (pud_present(*pud)) {
+ if (pud_large(*pud)) {
+ pmd = ident_split_large_pud(info, pud, addr);
+ new_table = 1;
+ } else if (!pud_present(*pud)) {
+ pmd = (pmd_t *)info->alloc_pgt_page(info->context);
+ new_table = 1;
+ } else {
pmd = pmd_offset(pud, 0);
- ident_pmd_init(info, pmd, addr, next);
- continue;
+ new_table = 0;
}
- pmd = (pmd_t *)info->alloc_pgt_page(info->context);
+
if (!pmd)
return -ENOMEM;
- ident_pmd_init(info, pmd, addr, next);
- set_pud(pud, __pud(__pa(pmd) | info->kernpg_flag));
+
+ result = ident_pmd_init(info, pmd, addr, next, info->page_flag);
+ if (result)
+ return result;
+
+ if (new_table)
+ set_pud(pud, __pud(__pa(pmd) | info->kernpg_flag));
}

return 0;
@@ -63,6 +177,7 @@ static int ident_p4d_init(struct x86_mapping_info *info, p4d_t *p4d_page,
{
unsigned long next;
int result;
+ bool new_table = 0;

for (; addr < end; addr = next) {
p4d_t *p4d = p4d_page + p4d_index(addr);
@@ -72,15 +187,14 @@ static int ident_p4d_init(struct x86_mapping_info *info, p4d_t *p4d_page,
if (next > end)
next = end;

- if (p4d_present(*p4d)) {
+ if (!p4d_present(*p4d)) {
+ pud = (pud_t *)info->alloc_pgt_page(info->context);
+ new_table = 1;
+ } else {
pud = pud_offset(p4d, 0);
- result = ident_pud_init(info, pud, addr, next);
- if (result)
- return result;
-
- continue;
+ new_table = 0;
}
- pud = (pud_t *)info->alloc_pgt_page(info->context);
+
if (!pud)
return -ENOMEM;

@@ -88,19 +202,22 @@ static int ident_p4d_init(struct x86_mapping_info *info, p4d_t *p4d_page,
if (result)
return result;

- set_p4d(p4d, __p4d(__pa(pud) | info->kernpg_flag));
+ if (new_table)
+ set_p4d(p4d, __p4d(__pa(pud) | info->kernpg_flag));
}

return 0;
}

-int kernel_ident_mapping_init(struct x86_mapping_info *info, pgd_t *pgd_page,
- unsigned long pstart, unsigned long pend)
+int kernel_ident_mapping_init(struct x86_mapping_info *info,
+ pgd_t *pgd_page, unsigned long pstart,
+ unsigned long pend)
{
unsigned long addr = pstart + info->offset;
unsigned long end = pend + info->offset;
unsigned long next;
int result;
+ bool new_table;

/* Set the default pagetable flags if not supplied */
if (!info->kernpg_flag)
@@ -117,20 +234,24 @@ int kernel_ident_mapping_init(struct x86_mapping_info *info, pgd_t *pgd_page,
if (next > end)
next = end;

- if (pgd_present(*pgd)) {
+ if (!pgd_present(*pgd)) {
+ p4d = (p4d_t *)info->alloc_pgt_page(info->context);
+ new_table = 1;
+ } else {
p4d = p4d_offset(pgd, 0);
- result = ident_p4d_init(info, p4d, addr, next);
- if (result)
- return result;
- continue;
+ new_table = 0;
}

- p4d = (p4d_t *)info->alloc_pgt_page(info->context);
if (!p4d)
return -ENOMEM;
+
result = ident_p4d_init(info, p4d, addr, next);
if (result)
return result;
+
+ if (!new_table)
+ continue;
+
if (pgtable_l5_enabled()) {
set_pgd(pgd, __pgd(__pa(p4d) | info->kernpg_flag));
} else {
--
2.39.2


2023-03-14 10:23:20

by Evgeniy Baskov

[permalink] [raw]
Subject: [PATCH v5 04/27] x86/boot: Increase boot page table size

Previous upper limit ignored pages implicitly mapped from #PF handler
by code accessing ACPI tables (boot/compressed/{acpi.c,efi.c}),
so theoretical upper limit is higher than it was set.

Using 4KB pages is desirable for better memory protection granularity.
Approximately twice as much memory is required for those.

Increase initial page table size to 64 4KB page tables.

Tested-by: Mario Limonciello <[email protected]>
Signed-off-by: Evgeniy Baskov <[email protected]>
---
arch/x86/include/asm/boot.h | 28 ++++++++++++++++------------
1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/boot.h b/arch/x86/include/asm/boot.h
index 9191280d9ea3..88836067f88c 100644
--- a/arch/x86/include/asm/boot.h
+++ b/arch/x86/include/asm/boot.h
@@ -41,22 +41,26 @@
# define BOOT_STACK_SIZE 0x4000

# define BOOT_INIT_PGT_SIZE (6*4096)
-# ifdef CONFIG_RANDOMIZE_BASE
/*
* Assuming all cross the 512GB boundary:
* 1 page for level4
- * (2+2)*4 pages for kernel, param, cmd_line, and randomized kernel
- * 2 pages for first 2M (video RAM: CONFIG_X86_VERBOSE_BOOTUP).
- * Total is 19 pages.
+ * (3+3)*2 pages for param and cmd_line
+ * (2+2+S)*2 pages for kernel and randomized kernel, where S is total number
+ * of sections of kernel. Explanation: 2+2 are upper level page tables.
+ * We can have only S unaligned parts of section: 1 at the end of the kernel
+ * and (S-1) at the section borders. The start address of the kernel is
+ * aligned, so an extra page table. There are at most S=6 sections in
+ * vmlinux ELF image.
+ * 3 pages for first 2M (video RAM: CONFIG_X86_VERBOSE_BOOTUP).
+ * Total is 36 pages.
+ *
+ * Some number of page tables is also required for the ACPI and UEFI table
+ * mappings, so we round up 36 to 64. Since ACPI tables are generally getting
+ * allocated in a few contiguous regions of memory, they are very unlikely to be
+ * spread out enough to require more than 28 extra page tables and this
+ * would work fine in all more or less sane environments.
*/
-# ifdef CONFIG_X86_VERBOSE_BOOTUP
-# define BOOT_PGT_SIZE (19*4096)
-# else /* !CONFIG_X86_VERBOSE_BOOTUP */
-# define BOOT_PGT_SIZE (17*4096)
-# endif
-# else /* !CONFIG_RANDOMIZE_BASE */
-# define BOOT_PGT_SIZE BOOT_INIT_PGT_SIZE
-# endif
+# define BOOT_PGT_SIZE (64*4096)

#else /* !CONFIG_X86_64 */
# define BOOT_STACK_SIZE 0x1000
--
2.39.2


2023-03-14 10:25:36

by Evgeniy Baskov

[permalink] [raw]
Subject: [PATCH v5 22/27] x86/build: set type_of_loader for EFISTUB

After switching to the local copy of boot_params, EFISTUB stopped
setting type_of_loader, using the default value of 0. Restore that
behavior by assigning the right value at the build time.

Signed-off-by: Evgeniy Baskov <[email protected]>
---
arch/x86/boot/tools/build.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/x86/boot/tools/build.c b/arch/x86/boot/tools/build.c
index 476ef05f16fb..5ac4f08ed923 100644
--- a/arch/x86/boot/tools/build.c
+++ b/arch/x86/boot/tools/build.c
@@ -588,6 +588,8 @@ int main(int argc, char **argv)
memcpy(output + setup_size + efi_boot_params + SETUP_HEADER_OFFSET,
setup_header, 0x290 - SETUP_HEADER_OFFSET
/* == max possible sizeof(struct setup_header) */);
+ /* Set type_of_loader to the one that EFISTUB uses for the local copy */
+ output[setup_size + efi_boot_params + SETUP_HEADER_OFFSET + 0x1F] = 0x21;
#endif

/* Calculate and write kernel checksum. */
--
2.39.2


2023-03-14 10:53:03

by Evgeniy Baskov

[permalink] [raw]
Subject: [PATCH v5 27/27] efi/x86: don't try to set page attributes on 0-sized regions.

From: Peter Jones <[email protected]>

In "efi/x86: Explicitly set sections memory attributes", the following
region is defined to help compute page permissions:

/* .setup [image_base, _head] */
efi_adjust_memory_range_protection(image_base,
(unsigned long)_head - image_base,
EFI_MEMORY_RO | EFI_MEMORY_XP);

In at least some cases, that will result in a size of 0, which will
produce an error and a message on the console, though no actual failure
will be caused in the boot process.

This patch checks that case in efi_adjust_memory_range_protection() and
returns the error without logging.

Signed-off-by: Peter Jones <[email protected]>
---
drivers/firmware/efi/libstub/mem.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/drivers/firmware/efi/libstub/mem.c b/drivers/firmware/efi/libstub/mem.c
index d58b552739ed..f18c797785ac 100644
--- a/drivers/firmware/efi/libstub/mem.c
+++ b/drivers/firmware/efi/libstub/mem.c
@@ -251,6 +251,9 @@ efi_status_t efi_adjust_memory_range_protection(unsigned long start,
efi_physical_addr_t rounded_start, rounded_end;
unsigned long attr_clear;

+ if (size == 0)
+ return EFI_INVALID_PARAMETER;
+
/*
* This function should not be used to modify attributes
* other than writable/executable.
--
2.39.2


2023-03-14 10:53:08

by Evgeniy Baskov

[permalink] [raw]
Subject: [PATCH v5 24/27] x86/build: Make generated PE more spec compliant

Currently kernel image is not fully compliant PE image, so it may
fail to boot with stricter implementations of UEFI PE loaders.
Align the state of generated PE image with loaders expectations and
the PE documentation [1] referenced by the UEFI specification [2].

Set minimal alignments and sizes according to the spec. Align data
structures to their natural alignments to prevent unaligned data
accesses.

Stop generating '.setup' section, as it is no longer used.
It was needed before to read the boot_params from the kernel image.
EFISTUB was switch to use local copy of the boot_params, to this section
is no longer necessary.

Don't reserve init_size (a buffer size required for the extracted
kernel) of memory for the PE image, since in-place extraction is no
longer used when using UEFI to boot.

Split '.text' into '.data' and '.text' to apply proper memory
protection attributes. Make '.data' non-executable and '.text'
non-writable implement W^X.

Fill SizeOfCode, SizeOfInitializedData, SizeOfUninitializedData,
AddressOfEntryPoint, BaseOfCode and reloc data directory with
appropriate values in tools/build.c.

Remove alignment flags on sections, since they are only allowed
on object files.

This patch and several previous one incorporates some ideas from the
RFC at [3].

[1] https://download.microsoft.com/download/9/c/5/9c5b2167-8017-4bae-9fde-d599bac8184a/pecoff_v83.docx
[2] https://uefi.org/sites/default/files/resources/UEFI_Spec_2_9_2021_03_18.pdf
[3] https://lore.kernel.org/linux-efi/[email protected]/

Signed-off-by: Evgeniy Baskov <[email protected]>
---
arch/x86/boot/Makefile | 2 +-
arch/x86/boot/header.S | 62 +++++++-------
arch/x86/boot/tools/build.c | 156 ++++++++++++++++++------------------
3 files changed, 106 insertions(+), 114 deletions(-)

diff --git a/arch/x86/boot/Makefile b/arch/x86/boot/Makefile
index 8203f1a23f7a..0c61fbb082bb 100644
--- a/arch/x86/boot/Makefile
+++ b/arch/x86/boot/Makefile
@@ -91,7 +91,7 @@ $(obj)/vmlinux.bin: $(obj)/compressed/vmlinux FORCE

SETUP_OBJS = $(addprefix $(obj)/,$(setup-y))

-sed-zoffset := -e 's/^\([0-9a-fA-F]*\) [a-zA-Z] \(startup_32\|startup_64\|efi32_stub_entry\|efi64_stub_entry\|efi_pe_entry\|efi32_pe_entry\|efi_boot_params\|input_data\|kernel_info\|_end\|_ehead\|_text\|z_.*\)$$/\#define ZO_\2 0x\1/p'
+sed-zoffset := -e 's/^\([0-9a-fA-F]*\) [a-zA-Z] \(startup_32\|startup_64\|efi32_stub_entry\|efi64_stub_entry\|efi_pe_entry\|efi32_pe_entry\|efi_boot_params\|input_data\|kernel_info\|_end\|_data\|z_.*\)$$/\#define ZO_\2 0x\1/p'

quiet_cmd_zoffset = ZOFFSET $@
cmd_zoffset = $(NM) $< | sed -n $(sed-zoffset) > $@
diff --git a/arch/x86/boot/header.S b/arch/x86/boot/header.S
index d4e16edf1198..0e96dcac91c0 100644
--- a/arch/x86/boot/header.S
+++ b/arch/x86/boot/header.S
@@ -47,6 +47,8 @@ SYSSEG = 0x1000 /* historical load address >> 4 */
#
.long LINUX_PE_MAGIC
.long pe_header
+
+ .align 8
pe_header:
.long PE_MAGIC

@@ -75,16 +77,13 @@ optional_header:
.byte 0x02 # MajorLinkerVersion
.byte 0x14 # MinorLinkerVersion

- # Filled in by build.c
+ # All of these are filled in by build.c
.long 0 # SizeOfCode
-
.long 0 # SizeOfInitializedData
.long 0 # SizeOfUninitializedData
-
- # Filled in by build.c
.long 0x0000 # AddressOfEntryPoint
+ .long 0x0000 # BaseOfCode

- .long 0x0200 # BaseOfCode
#ifdef CONFIG_X86_32
.long 0 # data
#endif
@@ -97,8 +96,8 @@ extra_header_fields:
#else
.quad image_base # ImageBase
#endif
- .long 0x20 # SectionAlignment
- .long 0x20 # FileAlignment
+ .long 0x1000 # SectionAlignment
+ .long 0x200 # FileAlignment
.word 0 # MajorOperatingSystemVersion
.word 0 # MinorOperatingSystemVersion
.word LINUX_EFISTUB_MAJOR_VERSION # MajorImageVersion
@@ -143,26 +142,6 @@ extra_header_fields:

# Section table
section_table:
- #
- # The offset & size fields are filled in by build.c.
- #
- .ascii ".setup"
- .byte 0
- .byte 0
- .long 0
- .long 0x0 # startup_{32,64}
- .long 0 # Size of initialized data
- # on disk
- .long 0x0 # startup_{32,64}
- .long 0 # PointerToRelocations
- .long 0 # PointerToLineNumbers
- .word 0 # NumberOfRelocations
- .word 0 # NumberOfLineNumbers
- .long IMAGE_SCN_CNT_CODE | \
- IMAGE_SCN_MEM_READ | \
- IMAGE_SCN_MEM_EXECUTE | \
- IMAGE_SCN_ALIGN_16BYTES # Characteristics
-
#
# The EFI application loader requires a relocation section
# because EFI applications must be relocatable. The .reloc
@@ -181,8 +160,7 @@ section_table:
.word 0 # NumberOfLineNumbers
.long IMAGE_SCN_CNT_INITIALIZED_DATA | \
IMAGE_SCN_MEM_READ | \
- IMAGE_SCN_MEM_DISCARDABLE | \
- IMAGE_SCN_ALIGN_1BYTES # Characteristics
+ IMAGE_SCN_MEM_DISCARDABLE # Characteristics

#ifdef CONFIG_EFI_MIXED
#
@@ -200,8 +178,7 @@ section_table:
.word 0 # NumberOfLineNumbers
.long IMAGE_SCN_CNT_INITIALIZED_DATA | \
IMAGE_SCN_MEM_READ | \
- IMAGE_SCN_MEM_DISCARDABLE | \
- IMAGE_SCN_ALIGN_1BYTES # Characteristics
+ IMAGE_SCN_MEM_DISCARDABLE # Characteristics
#endif

#
@@ -222,8 +199,27 @@ section_table:
.word 0 # NumberOfLineNumbers
.long IMAGE_SCN_CNT_CODE | \
IMAGE_SCN_MEM_READ | \
- IMAGE_SCN_MEM_EXECUTE | \
- IMAGE_SCN_ALIGN_16BYTES # Characteristics
+ IMAGE_SCN_MEM_EXECUTE # Characteristics
+
+ #
+ # The offset & size fields are filled in by build.c.
+ #
+ .ascii ".data"
+ .byte 0
+ .byte 0
+ .byte 0
+ .long 0
+ .long 0x0 # startup_{32,64}
+ .long 0 # Size of initialized data
+ # on disk
+ .long 0x0 # startup_{32,64}
+ .long 0 # PointerToRelocations
+ .long 0 # PointerToLineNumbers
+ .word 0 # NumberOfRelocations
+ .word 0 # NumberOfLineNumbers
+ .long IMAGE_SCN_CNT_INITIALIZED_DATA | \
+ IMAGE_SCN_MEM_READ | \
+ IMAGE_SCN_MEM_WRITE # Characteristics

.set section_count, (. - section_table) / 40
#endif /* CONFIG_EFI_STUB */
diff --git a/arch/x86/boot/tools/build.c b/arch/x86/boot/tools/build.c
index 5ac4f08ed923..92a3d4a8ed6b 100644
--- a/arch/x86/boot/tools/build.c
+++ b/arch/x86/boot/tools/build.c
@@ -57,19 +57,20 @@ typedef unsigned int u32;
#define SECTOR_SIZE 512
#define FILE_ALIGNMENT 512
#define SECTION_ALIGNMENT 4096
+#define BASE_RVA 0x1000

/* This must be large enough to hold the entire setup */
u8 buf[SETUP_SECT_MAX*SECTOR_SIZE];

-#define PECOFF_RELOC_RESERVE 0x20
+#define PECOFF_RELOC_RESERVE round_up(0x20, FILE_ALIGNMENT)

#ifdef CONFIG_EFI_MIXED
-#define PECOFF_COMPAT_RESERVE 0x20
+#define PECOFF_COMPAT_RESERVE round_up(0x20, FILE_ALIGNMENT)
#else
#define PECOFF_COMPAT_RESERVE 0x0
#endif

-#define RELOC_SECTION_SIZE 10
+#define RELOC_SECTION_SIZE 12

/* PE header has different format depending on the architecture */
#ifdef CONFIG_X86_64
@@ -110,7 +111,7 @@ static unsigned long efi32_pe_entry;
static unsigned long efi_boot_params;
static unsigned long kernel_info;
static unsigned long startup_64;
-static unsigned long _ehead;
+static unsigned long _data;
static unsigned long _end;

/*----------------------------------------------------------------------*/
@@ -251,7 +252,7 @@ static void *map_output_file(const char *path, size_t size)

#ifdef CONFIG_EFI_STUB

-static void update_pecoff_section_header_fields(char *section_name, u32 vma, u32 size, u32 datasz, u32 offset)
+static void update_pecoff_section_header(char *section_name, u32 vma, u32 size, u32 datasz, u32 offset)
{
unsigned short num_sections;
struct section_header *section;
@@ -281,32 +282,44 @@ static void update_pecoff_section_header_fields(char *section_name, u32 vma, u32
}
}

-static void update_pecoff_section_header(char *section_name, u32 offset, u32 size)
-{
- update_pecoff_section_header_fields(section_name, offset, size, size, offset);
-}

-static void update_pecoff_setup_and_reloc(unsigned int size)
+static unsigned int update_pecoff_reloc_and_compat(unsigned int setup_size)
{
- u32 setup_offset = SECTOR_SIZE;
- u32 reloc_offset = size - PECOFF_RELOC_RESERVE - PECOFF_COMPAT_RESERVE;
+ unsigned int current_rva = BASE_RVA;
+ struct data_directory *dir;
+ u32 reloc_offset = setup_size - PECOFF_RELOC_RESERVE - PECOFF_COMPAT_RESERVE;
+ u32 reloc_memsz = round_up(PECOFF_RELOC_RESERVE, SECTION_ALIGNMENT);
+
#ifdef CONFIG_EFI_MIXED
u32 compat_offset = reloc_offset + PECOFF_RELOC_RESERVE;
+ u32 compat_memsz = round_up(PECOFF_COMPAT_RESERVE, SECTION_ALIGNMENT);
#endif
- u32 setup_size = reloc_offset - setup_offset;

- update_pecoff_section_header(".setup", setup_offset, setup_size);
- update_pecoff_section_header(".reloc", reloc_offset, PECOFF_RELOC_RESERVE);
+ update_pecoff_section_header(".reloc", current_rva, reloc_memsz,
+ PECOFF_RELOC_RESERVE, reloc_offset);
+
+ /* Update PE data directory to point to '.reloc' section */
+ dir = (struct data_directory *)(get_pe_opt_header(buf) + 1);
+ put_unaligned_le32(current_rva, &dir->base_relocations.virtual_address);
+ put_unaligned_le32(RELOC_SECTION_SIZE, &dir->base_relocations.size);

/*
- * Modify .reloc section contents with a single entry. The
- * relocation is applied to offset 10 of the relocation section.
+ * Modify .reloc section contents with two no-op entries. The
+ * relocation is applied to offset 12 of the relocation section.
+ * There are two entries since, according to the PE documentation,
+ * every base relocation block should start on 32-bit boundary.
+ * There is only one block, but some loaders incorrectly check
+ * the size to always be 32-bit aligned even on the last block.
*/
- put_unaligned_le32(reloc_offset + RELOC_SECTION_SIZE, &buf[reloc_offset]);
+ put_unaligned_le32(current_rva + RELOC_SECTION_SIZE, &buf[reloc_offset]);
put_unaligned_le32(RELOC_SECTION_SIZE, &buf[reloc_offset + 4]);

+ current_rva += reloc_memsz;
+
#ifdef CONFIG_EFI_MIXED
- update_pecoff_section_header(".compat", compat_offset, PECOFF_COMPAT_RESERVE);
+ update_pecoff_section_header(".compat", current_rva, compat_memsz,
+ PECOFF_COMPAT_RESERVE, compat_offset);
+ current_rva += compat_memsz;

/*
* Put the IA-32 machine type (0x14c) and the associated entry point
@@ -316,47 +329,56 @@ static void update_pecoff_setup_and_reloc(unsigned int size)
buf[compat_offset] = 0x1;
buf[compat_offset + 1] = 0x8;
put_unaligned_le16(IMAGE_FILE_MACHINE_I386, &buf[compat_offset + 2]);
- put_unaligned_le32(efi32_pe_entry + size, &buf[compat_offset + 4]);
+ put_unaligned_le32(efi32_pe_entry + setup_size, &buf[compat_offset + 4]);
#endif
+ return current_rva;
}

-static unsigned int update_pecoff_sections(unsigned int text_start, unsigned int text_sz,
- unsigned int init_sz)
+static void update_pecoff_sections(unsigned int setup_sz,
+ unsigned int text_sz,
+ unsigned int file_sz,
+ unsigned int image_sz)
{
- unsigned int file_sz = text_start + text_sz;
- unsigned int bss_sz = init_sz - file_sz;
+ unsigned int current_rva;
pe_opt_hdr *hdr = get_pe_opt_header(buf);

- /*
- * The PE/COFF loader may load the image at an address which is
- * misaligned with respect to the kernel_alignment field in the setup
- * header.
- *
- * In order to avoid relocating the kernel to correct the misalignment,
- * add slack to allow the buffer to be aligned within the declared size
- * of the image.
- */
- bss_sz += CONFIG_PHYSICAL_ALIGN;
- init_sz += CONFIG_PHYSICAL_ALIGN;
+ current_rva = update_pecoff_reloc_and_compat(setup_sz);

- /*
- * Size of code: Subtract the size of the first sector (512 bytes)
- * which includes the header.
- */
- put_unaligned_le32(file_sz - SECTOR_SIZE + bss_sz, &hdr->text_size);
+ /* Update sizes inside PE header: */

- /* Size of image */
- put_unaligned_le32(init_sz, &hdr->image_size);
+ /* Text size*/
+ put_unaligned_le32(text_sz, &hdr->text_size);

/*
- * Address of entry point for PE/COFF executable
+ * Initialized data size.
+ * Consider .compat, .reloc and .data sections as data
*/
- put_unaligned_le32(text_start + efi_pe_entry, &hdr->entry_point);
+ put_unaligned_le32(current_rva - BASE_RVA + file_sz - text_sz,
+ &hdr->data_size);
+
+ /* Uninialized data size */
+ put_unaligned_le32(image_sz - file_sz, &hdr->bss_size);
+
+ /* Total image size. Consider all rections and headers. */
+ put_unaligned_le32(current_rva + image_sz, &hdr->image_size);
+
+ /* Address of entry point for PE/COFF executable */
+ put_unaligned_le32(current_rva + efi_pe_entry, &hdr->entry_point);
+
+ /* Base of the text section */
+ put_unaligned_le32(current_rva, &hdr->code_base);

- update_pecoff_section_header_fields(".text", text_start, text_sz + bss_sz,
- text_sz, text_start);
+ /* Update PE sections offsets: */
+
+ /* Text section */
+ update_pecoff_section_header(".text", current_rva, text_sz,
+ text_sz, setup_sz);
+ current_rva += text_sz;
+
+ /* Text section */
+ update_pecoff_section_header(".data", current_rva, image_sz - text_sz,
+ file_sz - text_sz, setup_sz + text_sz);

- return text_start + file_sz;
}

static int reserve_pecoff_reloc_section(int c)
@@ -397,10 +419,10 @@ static void efi_stub_entry_update(void)

#else

-static inline void update_pecoff_setup_and_reloc(unsigned int size) {}
-static inline void update_pecoff_text(unsigned int text_start,
- unsigned int file_sz,
- unsigned int init_sz) {}
+static unsigned int update_pecoff_sections(unsigned int setup_sz,
+ unsigned int text_sz,
+ unsigned int file_sz,
+ unsigned int image_sz) {}
static inline void efi_stub_update_defaults(void) {}
static inline void efi_stub_entry_update(void) {}

@@ -447,7 +469,7 @@ static void parse_zoffset(char *fname)
PARSE_ZOFS(p, efi_boot_params);
PARSE_ZOFS(p, kernel_info);
PARSE_ZOFS(p, startup_64);
- PARSE_ZOFS(p, _ehead);
+ PARSE_ZOFS(p, _data);
PARSE_ZOFS(p, _end);

p = strchr(p, '\n');
@@ -506,7 +528,6 @@ int main(int argc, char **argv)
size_t kern_file_size;
unsigned int setup_size;
unsigned int setup_sectors;
- unsigned int init_size;
unsigned int total_size;
unsigned int kern_size;
void *kernel;
@@ -534,7 +555,6 @@ int main(int argc, char **argv)
#ifdef CONFIG_EFI_STUB
/* PE specification require 512-byte minimum section file alignment */
kern_size = round_up(kern_file_size + 4, FILE_ALIGNMENT);
- update_pecoff_setup_and_reloc(setup_size);
#else
/* Number of 16-byte paragraphs, including space for a 4-byte CRC */
kern_size = round_up(kern_file_size + 4, PARAGRAPH_SIZE);
@@ -547,36 +567,12 @@ int main(int argc, char **argv)
/* Update kernel_info_offset. */
put_unaligned_le32(kernel_info, &setup_header[0x77]);

- init_size = get_unaligned_le32(&setup_header[0x6F]);
-
#ifdef CONFIG_EFI_STUB
- /*
- * The decompression buffer will start at ImageBase. When relocating
- * the compressed kernel to its end, we must ensure that the head
- * section does not get overwritten. The head section occupies
- * [i, i + _ehead), and the destination is [init_sz - _end, init_sz).
- *
- * At present these should never overlap, because 'i' is at most 32k
- * because of SETUP_SECT_MAX, '_ehead' is less than 1k, and the
- * calculation of INIT_SIZE in boot/header.S ensures that
- * 'init_sz - _end' is at least 64k.
- *
- * For future-proofing, increase init_sz if necessary.
- */
-
- if (init_size - _end < setup_size + _ehead) {
- init_size = round_up(setup_size + _ehead + _end, SECTION_ALIGNMENT);
- put_unaligned_le32(init_size, &setup_header[0x6F]);
- }
-
- total_size = update_pecoff_sections(setup_size, kern_size, init_size);
-
+ update_pecoff_sections(setup_size, _data, kern_size, _end);
efi_stub_entry_update();
-#else
- (void)init_size;
- total_size = setup_size + kern_size;
#endif

+ total_size = setup_size + kern_size;
output = map_output_file(argv[4], total_size);

memcpy(output, buf, setup_size);
--
2.39.2


2023-03-14 11:02:26

by Evgeniy Baskov

[permalink] [raw]
Subject: [PATCH v5 26/27] efi/libstub: make memory protection warnings include newlines.

From: Peter Jones <[email protected]>

efi_warn() doesn't put newlines on messages, and that makes reading
warnings without newlines hard to do.

Signed-off-by: Peter Jones <[email protected]>
---
drivers/firmware/efi/libstub/mem.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/firmware/efi/libstub/mem.c b/drivers/firmware/efi/libstub/mem.c
index 77cf745ade0d..d58b552739ed 100644
--- a/drivers/firmware/efi/libstub/mem.c
+++ b/drivers/firmware/efi/libstub/mem.c
@@ -298,7 +298,7 @@ efi_status_t efi_adjust_memory_range_protection(unsigned long start,
rounded_end - rounded_start,
attr_clear);
if (status != EFI_SUCCESS) {
- efi_warn("Failed to clear memory attributes at [%08lx,%08lx]: %lx",
+ efi_warn("Failed to clear memory attributes at [%08lx,%08lx]: %lx\n",
(unsigned long)rounded_start,
(unsigned long)rounded_end,
status);
@@ -311,7 +311,7 @@ efi_status_t efi_adjust_memory_range_protection(unsigned long start,
rounded_end - rounded_start,
attributes);
if (status != EFI_SUCCESS) {
- efi_warn("Failed to set memory attributes at [%08lx,%08lx]: %lx",
+ efi_warn("Failed to set memory attributes at [%08lx,%08lx]: %lx\n",
(unsigned long)rounded_start,
(unsigned long)rounded_end,
status);
--
2.39.2


2023-03-14 11:02:28

by Evgeniy Baskov

[permalink] [raw]
Subject: [PATCH v5 23/27] efi/libstub: Don't set ramdisk_image/ramdisk_size

The local copy of the boot_params made during build time is used now,
so setting ramdisk_image/ramdisk_size fields is no longer needed,
since they are already set to 0.

Remove no longer required assignments.

Signed-off-by: Evgeniy Baskov <[email protected]>
---
drivers/firmware/efi/libstub/x86-stub.c | 3 ---
1 file changed, 3 deletions(-)

diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index 5dbc9c7a4aa3..7c5561aaba71 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -389,9 +389,6 @@ efi_status_t __efiapi efi_pe_entry(efi_handle_t handle,
efi_set_u64_split((unsigned long)cmdline_ptr, &hdr->cmd_line_ptr,
&efi_boot_params.ext_cmd_line_ptr);

- hdr->ramdisk_image = 0;
- hdr->ramdisk_size = 0;
-
efi_stub_entry(handle, sys_table_arg, &efi_boot_params);
/* not reached */

--
2.39.2


2023-03-14 11:02:32

by Evgeniy Baskov

[permalink] [raw]
Subject: [PATCH v5 21/27] x86/build: Add SETUP_HEADER_OFFSET constant

Add and use SETUP_HEADER_OFFSET constant in tools/build.c for
readability purposes. It equals to the struct boot_params offset in
kernel image.

Signed-off-by: Evgeniy Baskov <[email protected]>
---
arch/x86/boot/tools/build.c | 25 +++++++++++++++----------
1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/x86/boot/tools/build.c b/arch/x86/boot/tools/build.c
index 84d5a5cc7756..476ef05f16fb 100644
--- a/arch/x86/boot/tools/build.c
+++ b/arch/x86/boot/tools/build.c
@@ -51,6 +51,8 @@ typedef unsigned int u32;
#define SETUP_SECT_MIN 5
#define SETUP_SECT_MAX 64

+#define SETUP_HEADER_OFFSET 0x1f1
+
#define PARAGRAPH_SIZE 16
#define SECTOR_SIZE 512
#define FILE_ALIGNMENT 512
@@ -473,7 +475,7 @@ static unsigned int read_setup(char *path)
if (file_size < 2 * SECTOR_SIZE)
die("The setup must be at least 1024 bytes");

- if (get_unaligned_le16(&buf[SECTOR_SIZE - 2]) != 0xAA55)
+ if (get_unaligned_le16(&buf[SETUP_HEADER_OFFSET + 0xD]) != 0xAA55)
die("Boot block hasn't got boot flag (0xAA55)");

fclose(file);
@@ -509,6 +511,7 @@ int main(int argc, char **argv)
unsigned int kern_size;
void *kernel;
u32 crc = 0xffffffffUL;
+ u8 *setup_header;
u8 *output;

if (argc != 5)
@@ -520,9 +523,10 @@ int main(int argc, char **argv)
setup_size = read_setup(argv[1]);

setup_sectors = setup_size/SECTOR_SIZE;
+ setup_header = buf + SETUP_HEADER_OFFSET;

/* Set the default root device */
- put_unaligned_le16(DEFAULT_ROOT_DEV, &buf[508]);
+ put_unaligned_le16(DEFAULT_ROOT_DEV, &setup_header[0xB]);

/* Map kernel file to memory */
kernel = map_file(argv[2], &kern_file_size);
@@ -537,13 +541,13 @@ int main(int argc, char **argv)
#endif

/* Patch the setup code with the appropriate size parameters */
- buf[0x1f1] = setup_sectors - 1;
- put_unaligned_le32(kern_size/PARAGRAPH_SIZE, &buf[0x1f4]);
+ setup_header[0] = setup_sectors - 1;
+ put_unaligned_le32(kern_size/PARAGRAPH_SIZE, &setup_header[3]);

- /* Update kernel_info offset. */
- put_unaligned_le32(kernel_info, &buf[0x268]);
+ /* Update kernel_info_offset. */
+ put_unaligned_le32(kernel_info, &setup_header[0x77]);

- init_size = get_unaligned_le32(&buf[0x260]);
+ init_size = get_unaligned_le32(&setup_header[0x6F]);

#ifdef CONFIG_EFI_STUB
/*
@@ -562,7 +566,7 @@ int main(int argc, char **argv)

if (init_size - _end < setup_size + _ehead) {
init_size = round_up(setup_size + _ehead + _end, SECTION_ALIGNMENT);
- put_unaligned_le32(init_size, &buf[0x260]);
+ put_unaligned_le32(init_size, &setup_header[0x6F]);
}

total_size = update_pecoff_sections(setup_size, kern_size, init_size);
@@ -581,8 +585,9 @@ int main(int argc, char **argv)

#ifdef CONFIG_EFI_STUB
/* Copy the setup header */
- memcpy(output + setup_size + efi_boot_params + 0x1f1, &buf[0x1f1],
- 0x290 - 0x1f1 /* == max possible sizeof(struct setup_header) */);
+ memcpy(output + setup_size + efi_boot_params + SETUP_HEADER_OFFSET,
+ setup_header, 0x290 - SETUP_HEADER_OFFSET
+ /* == max possible sizeof(struct setup_header) */);
#endif

/* Calculate and write kernel checksum. */
--
2.39.2


2023-03-14 11:02:35

by Evgeniy Baskov

[permalink] [raw]
Subject: [PATCH v5 25/27] efi/libstub: Use memory attribute protocol

Add EFI_MEMORY_ATTRIBUTE_PROTOCOL as preferred alternative to DXE
services for changing memory attributes in the EFISTUB.

Use DXE services only as a fallback in case aforementioned protocol
is not supported by UEFI implementation. DXE services are still not
used for applying stricter attributes, only more relaxed ones, unlike
the EFI_MEMORY_ATTRIBUTE_PROTOCOL which, when available, is also used
or applying restricted attributes.

Move DXE services initialization code closer to the place they are used
to match EFI_MEMORY_ATTRIBUTE_PROTOCOL initialization code.

Tested-by: Mario Limonciello <[email protected]>
Signed-off-by: Evgeniy Baskov <[email protected]>

Warnings in the previous version were
Reported-by: kernel test robot <[email protected]>
Reported-by: Dan Carpenter <[email protected]>
Link: https://lore.kernel.org/r/[email protected]/
---
drivers/firmware/efi/libstub/efistub.h | 5 +-
drivers/firmware/efi/libstub/mem.c | 170 ++++++++++++++++++------
drivers/firmware/efi/libstub/x86-stub.c | 8 --
3 files changed, 133 insertions(+), 50 deletions(-)

diff --git a/drivers/firmware/efi/libstub/efistub.h b/drivers/firmware/efi/libstub/efistub.h
index 01ae165731a5..d679d881647b 100644
--- a/drivers/firmware/efi/libstub/efistub.h
+++ b/drivers/firmware/efi/libstub/efistub.h
@@ -40,9 +40,6 @@ extern bool efi_novamap;

extern const efi_system_table_t *efi_system_table;

-typedef union efi_dxe_services_table efi_dxe_services_table_t;
-extern const efi_dxe_services_table_t *efi_dxe_table;
-
efi_status_t __efiapi efi_pe_entry(efi_handle_t handle,
efi_system_table_t *sys_table_arg);

@@ -392,6 +389,8 @@ typedef struct {
void *device_handle;
} efi_gcd_memory_space_desc_t;

+typedef union efi_dxe_services_table efi_dxe_services_table_t;
+
/*
* EFI DXE Services table
*/
diff --git a/drivers/firmware/efi/libstub/mem.c b/drivers/firmware/efi/libstub/mem.c
index 134f17d078d1..77cf745ade0d 100644
--- a/drivers/firmware/efi/libstub/mem.c
+++ b/drivers/firmware/efi/libstub/mem.c
@@ -5,6 +5,9 @@

#include "efistub.h"

+static const efi_dxe_services_table_t *efi_dxe_table;
+static efi_memory_attribute_protocol_t *efi_mem_attrib_proto;
+
/**
* efi_get_memory_map() - get memory map
* @map: pointer to memory map pointer to which to assign the
@@ -129,53 +132,33 @@ void efi_free(unsigned long size, unsigned long addr)
efi_bs_call(free_pages, addr, nr_pages);
}

-/**
- * efi_adjust_memory_range_protection() - change memory range protection attributes
- * @start: memory range start address
- * @size: memory range size
- *
- * Actual memory range for which memory attributes are modified is
- * the smallest ranged with start address and size aligned to EFI_PAGE_SIZE
- * that includes [start, start + size].
- *
- * @return: status code
- */
-efi_status_t efi_adjust_memory_range_protection(unsigned long start,
- unsigned long size,
- unsigned long attributes)
+static void retrieve_dxe_table(void)
+{
+ efi_dxe_table = get_efi_config_table(EFI_DXE_SERVICES_TABLE_GUID);
+ if (efi_dxe_table &&
+ efi_dxe_table->hdr.signature != EFI_DXE_SERVICES_TABLE_SIGNATURE) {
+ efi_warn("Ignoring DXE services table: invalid signature\n");
+ efi_dxe_table = NULL;
+ }
+}
+
+static efi_status_t adjust_mem_attrib_dxe(efi_physical_addr_t rounded_start,
+ efi_physical_addr_t rounded_end,
+ unsigned long attributes)
{
efi_status_t status;
efi_gcd_memory_space_desc_t desc;
- efi_physical_addr_t end, next;
- efi_physical_addr_t rounded_start, rounded_end;
+ efi_physical_addr_t end, next, start;
efi_physical_addr_t unprotect_start, unprotect_size;

- if (efi_dxe_table == NULL)
- return EFI_UNSUPPORTED;
+ if (!efi_dxe_table) {
+ retrieve_dxe_table();

- /*
- * This function should not be used to modify attributes
- * other than writable/executable.
- */
-
- if ((attributes & ~(EFI_MEMORY_RO | EFI_MEMORY_XP)) != 0)
- return EFI_INVALID_PARAMETER;
-
- rounded_start = rounddown(start, EFI_PAGE_SIZE);
- rounded_end = roundup(start + size, EFI_PAGE_SIZE);
-
- /*
- * Disallow simultaniously executable and writable memory
- * to inforce W^X policy if direct extraction code is enabled.
- */
-
- if ((attributes & (EFI_MEMORY_RO | EFI_MEMORY_XP)) == 0) {
- efi_warn("W^X violation at [%08lx,%08lx]\n",
- (unsigned long)rounded_start,
- (unsigned long)rounded_end);
+ if (!efi_dxe_table)
+ return EFI_UNSUPPORTED;
}

- for (end = start + size; start < end; start = next) {
+ for (start = rounded_start, end = rounded_end; start < end; start = next) {

status = efi_dxe_call(get_memory_space_descriptor,
start, &desc);
@@ -227,3 +210,112 @@ efi_status_t efi_adjust_memory_range_protection(unsigned long start,

return EFI_SUCCESS;
}
+
+static void retrieve_memory_attributes_proto(void)
+{
+ efi_status_t status;
+ efi_guid_t guid = EFI_MEMORY_ATTRIBUTE_PROTOCOL_GUID;
+
+ status = efi_bs_call(locate_protocol, &guid, NULL,
+ (void **)&efi_mem_attrib_proto);
+ if (status != EFI_SUCCESS)
+ efi_mem_attrib_proto = NULL;
+}
+
+/**
+ * efi_adjust_memory_range_protection() - change memory range protection attributes
+ * @start: memory range start address
+ * @size: memory range size
+ * @size: memory range attributes
+ *
+ * The only supported memory attributes are EFI_MEMORY_RO and EFI_MEMORY_XP.
+ *
+ * Actual memory range for which memory attributes are modified is
+ * the smallest ranged with start address and size aligned to EFI_PAGE_SIZE
+ * that includes [start, start + size].
+ *
+ * This function first attempts to use EFI_MEMORY_ATTRIBUTE_PROTOCOL,
+ * that is a part of UEFI Specification since version 2.10.
+ *
+ * If the protocol is unavailable it falls back to DXE services functions,
+ * in which case it does not apply stricter memory attributes, only relaxed
+ * ones.
+ *
+ * @return: status code
+ */
+efi_status_t efi_adjust_memory_range_protection(unsigned long start,
+ unsigned long size,
+ unsigned long attributes)
+{
+ efi_status_t status;
+ efi_physical_addr_t rounded_start, rounded_end;
+ unsigned long attr_clear;
+
+ /*
+ * This function should not be used to modify attributes
+ * other than writable/executable.
+ */
+
+ if ((attributes & ~(EFI_MEMORY_RO | EFI_MEMORY_XP)) != 0)
+ return EFI_INVALID_PARAMETER;
+
+ rounded_start = rounddown(start, EFI_PAGE_SIZE);
+ rounded_end = roundup(start + size, EFI_PAGE_SIZE);
+
+ /*
+ * Warn if requested to make memory simultaneously
+ * executable and writable to enforce W^X policy.
+ */
+
+ if ((attributes & (EFI_MEMORY_RO | EFI_MEMORY_XP)) == 0) {
+ efi_warn("W^X violation at [%08lx,%08lx]",
+ (unsigned long)rounded_start,
+ (unsigned long)rounded_end);
+ }
+
+ if (!efi_mem_attrib_proto) {
+ retrieve_memory_attributes_proto();
+
+ /* Fall back to DXE services if unsupported */
+ if (!efi_mem_attrib_proto) {
+ return adjust_mem_attrib_dxe(rounded_start,
+ rounded_end,
+ attributes);
+ }
+ }
+
+ /*
+ * Unlike DXE services, EFI_MEMORY_ATTRIBUTE_PROTOCOL does not clear
+ * unset protection bit, so it needs to be cleared explcitly.
+ */
+
+ attr_clear = ~attributes &
+ (EFI_MEMORY_RO | EFI_MEMORY_XP | EFI_MEMORY_RP);
+
+ status = efi_call_proto(efi_mem_attrib_proto,
+ clear_memory_attributes,
+ rounded_start,
+ rounded_end - rounded_start,
+ attr_clear);
+ if (status != EFI_SUCCESS) {
+ efi_warn("Failed to clear memory attributes at [%08lx,%08lx]: %lx",
+ (unsigned long)rounded_start,
+ (unsigned long)rounded_end,
+ status);
+ return status;
+ }
+
+ status = efi_call_proto(efi_mem_attrib_proto,
+ set_memory_attributes,
+ rounded_start,
+ rounded_end - rounded_start,
+ attributes);
+ if (status != EFI_SUCCESS) {
+ efi_warn("Failed to set memory attributes at [%08lx,%08lx]: %lx",
+ (unsigned long)rounded_start,
+ (unsigned long)rounded_end,
+ status);
+ }
+
+ return status;
+}
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index 7c5561aaba71..9394ac319d18 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -23,7 +23,6 @@
#define MAXMEM_X86_64_4LEVEL (1ull << 46)

const efi_system_table_t *efi_system_table;
-const efi_dxe_services_table_t *efi_dxe_table;
u32 image_offset __section(".data");
static efi_loaded_image_t *image __section(".data");

@@ -643,13 +642,6 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,
if (efi_system_table->hdr.signature != EFI_SYSTEM_TABLE_SIGNATURE)
efi_exit(handle, EFI_INVALID_PARAMETER);

- efi_dxe_table = get_efi_config_table(EFI_DXE_SERVICES_TABLE_GUID);
- if (efi_dxe_table &&
- efi_dxe_table->hdr.signature != EFI_DXE_SERVICES_TABLE_SIGNATURE) {
- efi_warn("Ignoring DXE services table: invalid signature\n");
- efi_dxe_table = NULL;
- }
-
#ifdef CONFIG_CMDLINE_BOOL
status = efi_parse_options(CONFIG_CMDLINE);
if (status != EFI_SUCCESS) {
--
2.39.2


2023-03-14 11:08:17

by Evgeniy Baskov

[permalink] [raw]
Subject: [PATCH v5 18/27] tools/include: Add simplified version of pe.h

This is needed to remove magic numbers from x86 bzImage building tool
(arch/x86/boot/tools/build.c).

Tested-by: Mario Limonciello <[email protected]>
Signed-off-by: Evgeniy Baskov <[email protected]>
---
tools/include/linux/pe.h | 150 +++++++++++++++++++++++++++++++++++++++
1 file changed, 150 insertions(+)
create mode 100644 tools/include/linux/pe.h

diff --git a/tools/include/linux/pe.h b/tools/include/linux/pe.h
new file mode 100644
index 000000000000..41c09ec371d8
--- /dev/null
+++ b/tools/include/linux/pe.h
@@ -0,0 +1,150 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Simplified version of include/linux/pe.h:
+ * Copyright 2011 Red Hat, Inc. All rights reserved.
+ * Author(s): Peter Jones <[email protected]>
+ */
+#ifndef __LINUX_PE_H
+#define __LINUX_PE_H
+
+#include <linux/types.h>
+
+#define IMAGE_FILE_MACHINE_I386 0x014c
+
+#define IMAGE_SCN_CNT_CODE 0x00000020 /* .text */
+#define IMAGE_SCN_CNT_INITIALIZED_DATA 0x00000040 /* .data */
+#define IMAGE_SCN_ALIGN_4096BYTES 0x00d00000
+#define IMAGE_SCN_MEM_DISCARDABLE 0x02000000 /* scn can be discarded */
+#define IMAGE_SCN_MEM_EXECUTE 0x20000000 /* can be executed as code */
+#define IMAGE_SCN_MEM_READ 0x40000000 /* readable */
+#define IMAGE_SCN_MEM_WRITE 0x80000000 /* writeable */
+
+#define MZ_HEADER_PEADDR_OFFSET 0x3c
+
+struct pe_hdr {
+ uint32_t magic; /* PE magic */
+ uint16_t machine; /* machine type */
+ uint16_t sections; /* number of sections */
+ uint32_t timestamp; /* time_t */
+ uint32_t symbol_table; /* symbol table offset */
+ uint32_t symbols; /* number of symbols */
+ uint16_t opt_hdr_size; /* size of optional header */
+ uint16_t flags; /* flags */
+};
+
+/* the fact that pe32 isn't padded where pe32+ is 64-bit means union won't
+ * work right. vomit. */
+struct pe32_opt_hdr {
+ /* "standard" header */
+ uint16_t magic; /* file type */
+ uint8_t ld_major; /* linker major version */
+ uint8_t ld_minor; /* linker minor version */
+ uint32_t text_size; /* size of text section(s) */
+ uint32_t data_size; /* size of data section(s) */
+ uint32_t bss_size; /* size of bss section(s) */
+ uint32_t entry_point; /* file offset of entry point */
+ uint32_t code_base; /* relative code addr in ram */
+ uint32_t data_base; /* relative data addr in ram */
+ /* "windows" header */
+ uint32_t image_base; /* preferred load address */
+ uint32_t section_align; /* alignment in bytes */
+ uint32_t file_align; /* file alignment in bytes */
+ uint16_t os_major; /* major OS version */
+ uint16_t os_minor; /* minor OS version */
+ uint16_t image_major; /* major image version */
+ uint16_t image_minor; /* minor image version */
+ uint16_t subsys_major; /* major subsystem version */
+ uint16_t subsys_minor; /* minor subsystem version */
+ uint32_t win32_version; /* reserved, must be 0 */
+ uint32_t image_size; /* image size */
+ uint32_t header_size; /* header size rounded up to
+ file_align */
+ uint32_t csum; /* checksum */
+ uint16_t subsys; /* subsystem */
+ uint16_t dll_flags; /* more flags! */
+ uint32_t stack_size_req;/* amt of stack requested */
+ uint32_t stack_size; /* amt of stack required */
+ uint32_t heap_size_req; /* amt of heap requested */
+ uint32_t heap_size; /* amt of heap required */
+ uint32_t loader_flags; /* reserved, must be 0 */
+ uint32_t data_dirs; /* number of data dir entries */
+};
+
+struct pe32plus_opt_hdr {
+ uint16_t magic; /* file type */
+ uint8_t ld_major; /* linker major version */
+ uint8_t ld_minor; /* linker minor version */
+ uint32_t text_size; /* size of text section(s) */
+ uint32_t data_size; /* size of data section(s) */
+ uint32_t bss_size; /* size of bss section(s) */
+ uint32_t entry_point; /* file offset of entry point */
+ uint32_t code_base; /* relative code addr in ram */
+ /* "windows" header */
+ uint64_t image_base; /* preferred load address */
+ uint32_t section_align; /* alignment in bytes */
+ uint32_t file_align; /* file alignment in bytes */
+ uint16_t os_major; /* major OS version */
+ uint16_t os_minor; /* minor OS version */
+ uint16_t image_major; /* major image version */
+ uint16_t image_minor; /* minor image version */
+ uint16_t subsys_major; /* major subsystem version */
+ uint16_t subsys_minor; /* minor subsystem version */
+ uint32_t win32_version; /* reserved, must be 0 */
+ uint32_t image_size; /* image size */
+ uint32_t header_size; /* header size rounded up to
+ file_align */
+ uint32_t csum; /* checksum */
+ uint16_t subsys; /* subsystem */
+ uint16_t dll_flags; /* more flags! */
+ uint64_t stack_size_req;/* amt of stack requested */
+ uint64_t stack_size; /* amt of stack required */
+ uint64_t heap_size_req; /* amt of heap requested */
+ uint64_t heap_size; /* amt of heap required */
+ uint32_t loader_flags; /* reserved, must be 0 */
+ uint32_t data_dirs; /* number of data dir entries */
+};
+
+struct data_dirent {
+ uint32_t virtual_address; /* relative to load address */
+ uint32_t size;
+};
+
+struct data_directory {
+ struct data_dirent exports; /* .edata */
+ struct data_dirent imports; /* .idata */
+ struct data_dirent resources; /* .rsrc */
+ struct data_dirent exceptions; /* .pdata */
+ struct data_dirent certs; /* certs */
+ struct data_dirent base_relocations; /* .reloc */
+ struct data_dirent debug; /* .debug */
+ struct data_dirent arch; /* reservered */
+ struct data_dirent global_ptr; /* global pointer reg. Size=0 */
+ struct data_dirent tls; /* .tls */
+ struct data_dirent load_config; /* load configuration structure */
+ struct data_dirent bound_imports; /* no idea */
+ struct data_dirent import_addrs; /* import address table */
+ struct data_dirent delay_imports; /* delay-load import table */
+ struct data_dirent clr_runtime_hdr; /* .cor (object only) */
+ struct data_dirent reserved;
+};
+
+struct section_header {
+ char name[8]; /* name or "/12\0" string tbl offset */
+ uint32_t virtual_size; /* size of loaded section in ram */
+ uint32_t virtual_address; /* relative virtual address */
+ uint32_t raw_data_size; /* size of the section */
+ uint32_t data_addr; /* file pointer to first page of sec */
+ uint32_t relocs; /* file pointer to relocation entries */
+ uint32_t line_numbers; /* line numbers! */
+ uint16_t num_relocs; /* number of relocations */
+ uint16_t num_lin_numbers; /* srsly. */
+ uint32_t flags;
+};
+
+struct coff_reloc {
+ uint32_t virtual_address;
+ uint32_t symbol_table_index;
+ uint16_t data;
+};
+
+#endif /* __LINUX_PE_H */
--
2.39.2


2023-03-14 11:08:21

by Evgeniy Baskov

[permalink] [raw]
Subject: [PATCH v5 15/27] efi/x86: Support extracting kernel from libstub

Doing it that way allows setting up stricter memory attributes,
simplifies boot code path and removes potential relocation
of kernel image.

Wire up required interfaces and minimally initialize zero page
fields needed for it to function correctly.

Signed-off-by: Evgeniy Baskov <[email protected]>
---
arch/x86/boot/compressed/head_32.S | 40 +++-
arch/x86/boot/compressed/head_64.S | 58 ++++-
drivers/firmware/efi/Kconfig | 2 +
drivers/firmware/efi/libstub/Makefile | 2 +-
.../firmware/efi/libstub/x86-extract-direct.c | 217 ++++++++++++++++++
drivers/firmware/efi/libstub/x86-stub.c | 119 +---------
drivers/firmware/efi/libstub/x86-stub.h | 14 ++
7 files changed, 337 insertions(+), 115 deletions(-)
create mode 100644 drivers/firmware/efi/libstub/x86-extract-direct.c
create mode 100644 drivers/firmware/efi/libstub/x86-stub.h

diff --git a/arch/x86/boot/compressed/head_32.S b/arch/x86/boot/compressed/head_32.S
index 15545781ea48..49df36f78b62 100644
--- a/arch/x86/boot/compressed/head_32.S
+++ b/arch/x86/boot/compressed/head_32.S
@@ -152,11 +152,47 @@ SYM_FUNC_END(startup_32)

#ifdef CONFIG_EFI_STUB
SYM_FUNC_START(efi32_stub_entry)
+
+ /* Clear BSS */
+ call 1f
+1: popl %ebx
+ xorl %eax, %eax
+ leal (_bss - 1b)(%ebx), %edi
+ leal (_ebss - 1b)(%ebx), %ecx
+ sub %edi, %ecx
+ shrl $2, %ecx
+ rep stosl
+
add $0x4, %esp
movl 8(%esp), %esi /* save boot_params pointer */
call efi_main
- /* efi_main returns the possibly relocated address of startup_32 */
- jmp *%eax
+ movl %eax, %ecx
+
+ /*
+ * efi_main returns the possibly
+ * relocated address of extracted kernel entry point.
+ */
+
+ cli
+
+ /* Load new GDT */
+ leal (gdt - 1b)(%ebx), %eax
+ movl %eax, 2(%eax)
+ lgdt (%eax)
+
+ /* Load segment registers with our descriptors */
+ movl $__BOOT_DS, %eax
+ movl %eax, %ds
+ movl %eax, %es
+ movl %eax, %fs
+ movl %eax, %gs
+ movl %eax, %ss
+
+ /* Zero EFLAGS */
+ pushl $0
+ popfl
+
+ jmp *%ecx
SYM_FUNC_END(efi32_stub_entry)
SYM_FUNC_ALIAS(efi_stub_entry, efi32_stub_entry)
#endif
diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 8af7835de3b1..554015d21950 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -529,12 +529,64 @@ SYM_CODE_END(startup_64)
.org 0x390
#endif
SYM_FUNC_START(efi64_stub_entry)
+ /* Preserve first parameter */
+ movq %rdi, %r10
+
+ /* Clear BSS */
+ xorl %eax, %eax
+ leaq _bss(%rip), %rdi
+ leaq _ebss(%rip), %rcx
+ subq %rdi, %rcx
+ shrq $3, %rcx
+ rep stosq
+
and $~0xf, %rsp /* realign the stack */
movq %rdx, %rbx /* save boot_params pointer */
+ movq %r10, %rdi
call efi_main
- movq %rbx,%rsi
- leaq rva(startup_64)(%rax), %rax
- jmp *%rax
+
+ cld
+ cli
+
+ movq %rbx, %rdi /* boot_params */
+ movq %rax, %rsi /* decompressed kernel address */
+
+ /* Make sure we have GDT with 32-bit code segment */
+ leaq gdt64(%rip), %rax
+ addq %rax, 2(%rax)
+ lgdt (%rax)
+
+ /* Setup data segments. */
+ xorl %eax, %eax
+ movl %eax, %ds
+ movl %eax, %es
+ movl %eax, %ss
+ movl %eax, %fs
+ movl %eax, %gs
+
+ pushq %rsi
+ pushq %rdi
+
+ call load_stage1_idt
+ call enable_nx_if_supported
+
+ call trampoline_pgtable_init
+ movq %rax, %rdx
+
+
+ /* Swap %rsi and %rsi */
+ popq %rsi
+ popq %rdi
+
+ /* Save the trampoline address in RCX */
+ movq trampoline_32bit(%rip), %rcx
+
+ /* Switch to compatibility mode (CS.L = 0 CS.D = 1) via far return */
+ pushq $__KERNEL32_CS
+ leaq TRAMPOLINE_32BIT_CODE_OFFSET(%rcx), %rax
+ pushq %rax
+ lretq
+
SYM_FUNC_END(efi64_stub_entry)
SYM_FUNC_ALIAS(efi_stub_entry, efi64_stub_entry)
#endif
diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig
index 043ca31c114e..dc833ca6375b 100644
--- a/drivers/firmware/efi/Kconfig
+++ b/drivers/firmware/efi/Kconfig
@@ -58,6 +58,8 @@ config EFI_DXE_MEM_ATTRIBUTES
Use DXE services to check and alter memory protection
attributes during boot via EFISTUB to ensure that memory
ranges used by the kernel are writable and executable.
+ This option also enables stricter memory attributes
+ on compressed kernel PE images.

config EFI_PARAMS_FROM_FDT
bool
diff --git a/drivers/firmware/efi/libstub/Makefile b/drivers/firmware/efi/libstub/Makefile
index 80d85a5169fb..53ad6ef27cb3 100644
--- a/drivers/firmware/efi/libstub/Makefile
+++ b/drivers/firmware/efi/libstub/Makefile
@@ -88,7 +88,7 @@ lib-$(CONFIG_EFI_GENERIC_STUB) += efi-stub.o string.o intrinsics.o systable.o \

lib-$(CONFIG_ARM) += arm32-stub.o
lib-$(CONFIG_ARM64) += arm64.o arm64-stub.o smbios.o
-lib-$(CONFIG_X86) += x86-stub.o
+lib-$(CONFIG_X86) += x86-stub.o x86-extract-direct.o
lib-$(CONFIG_RISCV) += riscv.o riscv-stub.o
lib-$(CONFIG_LOONGARCH) += loongarch.o loongarch-stub.o

diff --git a/drivers/firmware/efi/libstub/x86-extract-direct.c b/drivers/firmware/efi/libstub/x86-extract-direct.c
new file mode 100644
index 000000000000..40c56f94428d
--- /dev/null
+++ b/drivers/firmware/efi/libstub/x86-extract-direct.c
@@ -0,0 +1,217 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/acpi.h>
+#include <linux/efi.h>
+#include <linux/elf.h>
+#include <linux/stddef.h>
+
+#include <asm/efi.h>
+#include <asm/e820/types.h>
+#include <asm/desc.h>
+#include <asm/boot.h>
+#include <asm/bootparam_utils.h>
+#include <asm/shared/extract.h>
+#include <asm/shared/pgtable.h>
+
+#include "efistub.h"
+#include "x86-stub.h"
+
+static efi_handle_t image_handle;
+
+static void do_puthex(unsigned long value)
+{
+ efi_printk("%08lx", value);
+}
+
+static void do_putstr(const char *msg)
+{
+ efi_printk("%s", msg);
+}
+
+static unsigned long do_map_range(unsigned long start,
+ unsigned long end,
+ unsigned int flags)
+{
+ efi_status_t status;
+ unsigned long size = end - start;
+
+ /*
+ * When the address for the kernel extraction is selected buffer
+ * should be allocated using UEFI services, since we have not yet
+ * exited boot services at this stage. This is performed by calling
+ * this function with MAP_ALLOC flag.
+ */
+
+ if (flags & MAP_ALLOC) {
+ unsigned long addr;
+
+ status = efi_low_alloc_above(size, CONFIG_PHYSICAL_ALIGN,
+ &addr, start);
+ if (status != EFI_SUCCESS) {
+ efi_err("Unable to allocate memory for uncompressed kernel");
+ efi_exit(image_handle, EFI_OUT_OF_RESOURCES);
+ }
+
+ if (start != addr) {
+ efi_debug("Unable to allocate at given address"
+ " (desired=0x%lx, actual=0x%lx)",
+ (unsigned long)start, addr);
+ start = addr;
+ }
+ }
+
+ if ((flags & (MAP_PROTECT | MAP_ALLOC)) &&
+ IS_ENABLED(CONFIG_EFI_DXE_MEM_ATTRIBUTES)) {
+ unsigned long attr = 0;
+
+ if (!(flags & MAP_EXEC))
+ attr |= EFI_MEMORY_XP;
+
+ if (!(flags & MAP_WRITE))
+ attr |= EFI_MEMORY_RO;
+
+ status = efi_adjust_memory_range_protection(start, size, attr);
+ if (status != EFI_SUCCESS)
+ efi_err("Unable to protect memory range");
+ }
+
+ return start;
+}
+
+/*
+ * Trampoline takes 3 pages and can be loaded in first megabyte of memory
+ * with its end placed between 0 and 640k where BIOS might start.
+ * (see arch/x86/boot/compressed/pgtable_64.c)
+ */
+
+#ifdef CONFIG_64BIT
+static efi_status_t prepare_trampoline(void)
+{
+ efi_status_t status;
+ unsigned long trampoline_start;
+ void *caddr;
+
+ status = efi_allocate_pages(TRAMPOLINE_32BIT_SIZE,
+ (unsigned long *)&trampoline_32bit,
+ TRAMPOLINE_32BIT_PLACEMENT_MAX);
+
+ if (status != EFI_SUCCESS)
+ return status;
+
+ trampoline_start = (unsigned long)trampoline_32bit;
+
+ memset(trampoline_32bit, 0, TRAMPOLINE_32BIT_SIZE);
+
+ if (IS_ENABLED(CONFIG_EFI_DXE_MEM_ATTRIBUTES)) {
+ /* First page of trampoline is a top level page table */
+ efi_adjust_memory_range_protection(trampoline_start,
+ PAGE_SIZE,
+ EFI_MEMORY_XP);
+ }
+
+ /* Second page of trampoline is the code (with a padding) */
+
+ caddr = (void *)trampoline_32bit + TRAMPOLINE_32BIT_CODE_OFFSET;
+
+ memcpy(caddr, trampoline_32bit_src, TRAMPOLINE_32BIT_CODE_SIZE);
+
+ if (IS_ENABLED(CONFIG_EFI_DXE_MEM_ATTRIBUTES)) {
+ efi_adjust_memory_range_protection((unsigned long)caddr,
+ PAGE_SIZE,
+ EFI_MEMORY_RO);
+
+ /* And the last page of trampoline is the stack */
+
+ efi_adjust_memory_range_protection(trampoline_start + 2 * PAGE_SIZE,
+ PAGE_SIZE,
+ EFI_MEMORY_XP);
+ }
+
+ return EFI_SUCCESS;
+}
+#else
+static inline efi_status_t prepare_trampoline(void)
+{
+ return EFI_SUCCESS;
+}
+#endif
+
+static efi_status_t init_loader_data(efi_handle_t handle,
+ struct boot_params *params,
+ struct efi_boot_memmap **map)
+{
+ struct efi_info *efi = (void *)&params->efi_info;
+ efi_status_t status;
+ const char *signature;
+
+ status = efi_get_memory_map(map, false);
+
+ if (status != EFI_SUCCESS) {
+ efi_err("Unable to get EFI memory map...\n");
+ return status;
+ }
+
+ signature = efi_is_64bit() ? EFI64_LOADER_SIGNATURE
+ : EFI32_LOADER_SIGNATURE;
+
+ memcpy(&efi->efi_loader_signature, signature, sizeof(__u32));
+
+ efi->efi_memdesc_size = (*map)->desc_size;
+ efi->efi_memdesc_version = (*map)->desc_ver;
+ efi->efi_memmap_size = (*map)->map_size;
+
+ efi_set_u64_split((unsigned long)(*map)->map,
+ &efi->efi_memmap, &efi->efi_memmap_hi);
+
+ efi_set_u64_split((unsigned long)efi_system_table,
+ &efi->efi_systab, &efi->efi_systab_hi);
+
+ image_handle = handle;
+
+ return EFI_SUCCESS;
+}
+
+static void free_loader_data(struct boot_params *params, struct efi_boot_memmap *map)
+{
+ struct efi_info *efi = (void *)&params->efi_info;
+
+ efi_bs_call(free_pool, map);
+
+ efi->efi_memdesc_size = 0;
+ efi->efi_memdesc_version = 0;
+ efi->efi_memmap_size = 0;
+ efi_set_u64_split(0, &efi->efi_memmap, &efi->efi_memmap_hi);
+}
+
+extern unsigned char input_data[];
+extern unsigned int input_len, output_len;
+
+unsigned long extract_kernel_direct(efi_handle_t handle, struct boot_params *params)
+{
+
+ struct efi_extract_callbacks cb = { 0 };
+ struct efi_boot_memmap *map = NULL;
+ efi_status_t status;
+ void *res;
+
+ status = prepare_trampoline();
+
+ if (status != EFI_SUCCESS)
+ return 0;
+
+ /* Prepare environment for do_extract_kernel() call */
+ status = init_loader_data(handle, params, &map);
+
+ if (status != EFI_SUCCESS)
+ return 0;
+
+ cb.puthex = do_puthex;
+ cb.putstr = do_putstr;
+ cb.map_range = do_map_range;
+
+ res = efi_extract_kernel(params, &cb, input_data, input_len, output_len);
+
+ free_loader_data(params, map);
+
+ return (unsigned long)res;
+}
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index 7fb1eff88a18..1d1ab1911fd3 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -17,6 +17,7 @@
#include <asm/boot.h>

#include "efistub.h"
+#include "x86-stub.h"

/* Maximum physical address for 64-bit kernel with 4-level paging */
#define MAXMEM_X86_64_4LEVEL (1ull << 46)
@@ -24,7 +25,7 @@
const efi_system_table_t *efi_system_table;
const efi_dxe_services_table_t *efi_dxe_table;
u32 image_offset __section(".data");
-static efi_loaded_image_t *image = NULL;
+static efi_loaded_image_t *image __section(".data");

static efi_status_t
preserve_pci_rom_image(efi_pci_io_protocol_t *pci, struct pci_setup_rom **__rom)
@@ -212,55 +213,9 @@ static void retrieve_apple_device_properties(struct boot_params *boot_params)
}
}

-/*
- * Trampoline takes 2 pages and can be loaded in first megabyte of memory
- * with its end placed between 128k and 640k where BIOS might start.
- * (see arch/x86/boot/compressed/pgtable_64.c)
- *
- * We cannot find exact trampoline placement since memory map
- * can be modified by UEFI, and it can alter the computed address.
- */
-
-#define TRAMPOLINE_PLACEMENT_BASE ((128 - 8)*1024)
-#define TRAMPOLINE_PLACEMENT_SIZE (640*1024 - (128 - 8)*1024)
-
-void startup_32(struct boot_params *boot_params);
-
-static void
-setup_memory_protection(unsigned long image_base, unsigned long image_size)
-{
- /*
- * Allow execution of possible trampoline used
- * for switching between 4- and 5-level page tables
- * and relocated kernel image.
- */
-
- efi_adjust_memory_range_protection(TRAMPOLINE_PLACEMENT_BASE,
- TRAMPOLINE_PLACEMENT_SIZE, 0);
-
-#ifdef CONFIG_64BIT
- if (image_base != (unsigned long)startup_32)
- efi_adjust_memory_range_protection(image_base, image_size, 0);
-#else
- /*
- * Clear protection flags on a whole range of possible
- * addresses used for KASLR. We don't need to do that
- * on x86_64, since KASLR/extraction is performed after
- * dedicated identity page tables are built and we only
- * need to remove possible protection on relocated image
- * itself disregarding further relocations.
- */
- efi_adjust_memory_range_protection(LOAD_PHYSICAL_ADDR,
- KERNEL_IMAGE_SIZE - LOAD_PHYSICAL_ADDR,
- 0);
-#endif
-}
-
static const efi_char16_t apple[] = L"Apple";

-static void setup_quirks(struct boot_params *boot_params,
- unsigned long image_base,
- unsigned long image_size)
+static void setup_quirks(struct boot_params *boot_params)
{
efi_char16_t *fw_vendor = (efi_char16_t *)(unsigned long)
efi_table_attr(efi_system_table, fw_vendor);
@@ -269,9 +224,6 @@ static void setup_quirks(struct boot_params *boot_params,
if (IS_ENABLED(CONFIG_APPLE_PROPERTIES))
retrieve_apple_device_properties(boot_params);
}
-
- if (IS_ENABLED(CONFIG_EFI_DXE_MEM_ATTRIBUTES))
- setup_memory_protection(image_base, image_size);
}

/*
@@ -384,7 +336,7 @@ static void setup_graphics(struct boot_params *boot_params)
}


-static void __noreturn efi_exit(efi_handle_t handle, efi_status_t status)
+void __noreturn efi_exit(efi_handle_t handle, efi_status_t status)
{
efi_bs_call(exit, handle, status, 0, NULL);
for(;;)
@@ -707,8 +659,7 @@ static efi_status_t exit_boot(struct boot_params *boot_params, void *handle)
}

/*
- * On success, we return the address of startup_32, which has potentially been
- * relocated by efi_relocate_kernel.
+ * On success, we return extracted kernel entry point.
* On failure, we exit to the firmware via efi_exit instead of returning.
*/
asmlinkage unsigned long efi_main(efi_handle_t handle,
@@ -733,60 +684,6 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,
efi_dxe_table = NULL;
}

- /*
- * If the kernel isn't already loaded at a suitable address,
- * relocate it.
- *
- * It must be loaded above LOAD_PHYSICAL_ADDR.
- *
- * The maximum address for 64-bit is 1 << 46 for 4-level paging. This
- * is defined as the macro MAXMEM, but unfortunately that is not a
- * compile-time constant if 5-level paging is configured, so we instead
- * define our own macro for use here.
- *
- * For 32-bit, the maximum address is complicated to figure out, for
- * now use KERNEL_IMAGE_SIZE, which will be 512MiB, the same as what
- * KASLR uses.
- *
- * Also relocate it if image_offset is zero, i.e. the kernel wasn't
- * loaded by LoadImage, but rather by a bootloader that called the
- * handover entry. The reason we must always relocate in this case is
- * to handle the case of systemd-boot booting a unified kernel image,
- * which is a PE executable that contains the bzImage and an initrd as
- * COFF sections. The initrd section is placed after the bzImage
- * without ensuring that there are at least init_size bytes available
- * for the bzImage, and thus the compressed kernel's startup code may
- * overwrite the initrd unless it is moved out of the way.
- */
-
- buffer_start = ALIGN(bzimage_addr - image_offset,
- hdr->kernel_alignment);
- buffer_end = buffer_start + hdr->init_size;
-
- if ((buffer_start < LOAD_PHYSICAL_ADDR) ||
- (IS_ENABLED(CONFIG_X86_32) && buffer_end > KERNEL_IMAGE_SIZE) ||
- (IS_ENABLED(CONFIG_X86_64) && buffer_end > MAXMEM_X86_64_4LEVEL) ||
- (image_offset == 0)) {
- extern char _bss[];
-
- status = efi_relocate_kernel(&bzimage_addr,
- (unsigned long)_bss - bzimage_addr,
- hdr->init_size,
- hdr->pref_address,
- hdr->kernel_alignment,
- LOAD_PHYSICAL_ADDR);
- if (status != EFI_SUCCESS) {
- efi_err("efi_relocate_kernel() failed!\n");
- goto fail;
- }
- /*
- * Now that we've copied the kernel elsewhere, we no longer
- * have a set up block before startup_32(), so reset image_offset
- * to zero in case it was set earlier.
- */
- image_offset = 0;
- }
-
#ifdef CONFIG_CMDLINE_BOOL
status = efi_parse_options(CONFIG_CMDLINE);
if (status != EFI_SUCCESS) {
@@ -843,7 +740,11 @@ asmlinkage unsigned long efi_main(efi_handle_t handle,

setup_efi_pci(boot_params);

- setup_quirks(boot_params, bzimage_addr, buffer_end - buffer_start);
+ setup_quirks(boot_params);
+
+ bzimage_addr = extract_kernel_direct(handle, boot_params);
+ if (!bzimage_addr)
+ goto fail;

status = exit_boot(boot_params, handle);
if (status != EFI_SUCCESS) {
diff --git a/drivers/firmware/efi/libstub/x86-stub.h b/drivers/firmware/efi/libstub/x86-stub.h
new file mode 100644
index 000000000000..baecc7c6e602
--- /dev/null
+++ b/drivers/firmware/efi/libstub/x86-stub.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _DRIVERS_FIRMWARE_EFI_X86STUB_H
+#define _DRIVERS_FIRMWARE_EFI_X86STUB_H
+
+#include <linux/efi.h>
+
+#include <asm/bootparam.h>
+
+void __noreturn efi_exit(efi_handle_t handle, efi_status_t status);
+unsigned long extract_kernel_direct(efi_handle_t handle, struct boot_params *boot_params);
+void startup_32(struct boot_params *boot_params);
+
+#endif
--
2.39.2


2023-03-14 20:34:08

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v5 09/27] x86/boot: Remove mapping from page fault handler

On Tue, Mar 14, 2023, at 3:13 AM, Evgeniy Baskov wrote:
> After every implicit mapping is removed, this code is no longer needed.
>
> Remove memory mapping from page fault handler to ensure that there are
> no hidden invalid memory accesses.

This patch is *by far* the scariest of the bunch in my boot. And it violates a basic principle of kernel development: it's better to run in degraded mode than to fail outright unless running in degraded mode is dangerous for some reason.

And this boot code is not actually meaningfully exposed to attack. Anyone who can get the boot code to consume garbage likely *already* controls the system, including anything that we might write to TPM or any other verification mechanism.

So I think this should log an error, set a flag to make sure we print an even louder error after full boot, but still add the mapping and keep trying.

--Andy

>
> Tested-by: Mario Limonciello <[email protected]>
> Signed-off-by: Evgeniy Baskov <[email protected]>
> ---
> arch/x86/boot/compressed/ident_map_64.c | 26 ++++++++++---------------
> 1 file changed, 10 insertions(+), 16 deletions(-)
>
> diff --git a/arch/x86/boot/compressed/ident_map_64.c
> b/arch/x86/boot/compressed/ident_map_64.c
> index eb28ce9812c5..378f99b1d7e8 100644
> --- a/arch/x86/boot/compressed/ident_map_64.c
> +++ b/arch/x86/boot/compressed/ident_map_64.c
> @@ -393,27 +393,21 @@ void do_boot_page_fault(struct pt_regs *regs,
> unsigned long error_code)
> {
> unsigned long address = native_read_cr2();
> unsigned long end;
> - bool ghcb_fault;
> + char *msg;
>
> - ghcb_fault = sev_es_check_ghcb_fault(address);
> + if (sev_es_check_ghcb_fault(address))
> + msg = "Page-fault on GHCB page:";
> + else
> + msg = "Unexpected page-fault:";
>
> address &= PMD_MASK;
> end = address + PMD_SIZE;
>
> /*
> - * Check for unexpected error codes. Unexpected are:
> - * - Faults on present pages
> - * - User faults
> - * - Reserved bits set
> - */
> - if (error_code & (X86_PF_PROT | X86_PF_USER | X86_PF_RSVD))
> - do_pf_error("Unexpected page-fault:", error_code, address, regs->ip);
> - else if (ghcb_fault)
> - do_pf_error("Page-fault on GHCB page:", error_code, address, regs->ip);
> -
> - /*
> - * Error code is sane - now identity map the 2M region around
> - * the faulting address.
> + * Since all memory allocations are made explicit
> + * now, every page fault at this stage is an
> + * error and the error handler is there only
> + * for debug purposes.
> */
> - kernel_add_identity_map(address, end, MAP_WRITE);
> + do_pf_error(msg, error_code, address, regs->ip);
> }
> --
> 2.39.2

2023-03-14 21:24:39

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v5 00/27] x86_64: Improvements at compressed kernel stage

On Tue, Mar 14, 2023, at 3:13 AM, Evgeniy Baskov wrote:
> This patchset is aimed
> * to improve UEFI compatibility of compressed kernel code for x86_64
> * to setup proper memory access attributes for code and rodata sections
> * to implement W^X protection policy throughout the whole execution
> of compressed kernel for EFISTUB code path.

The overall code quality seems okay, but I have some questions as to what this is for. The early boot environment is not exposed to most sorts of attacks -- there's no userspace, there's no network, and there is not a whole lot of input that isn't implicitly completely trusted.

What parts of this series are actually needed to get these fancy new bootloaders to boot Linux? And why?

>
> Kernel is made to be more compatible with PE image specification [3],
> allowing it to be successfully loaded by stricter PE loader
> implementations like the one from [2]. There is at least one
> known implementation that uses that loader in production [4].
> There are also ongoing efforts to upstream these changes.

Can you clarify

>
> Also the patchset adds EFI_MEMORY_ATTTRIBUTE_PROTOCOL, included into
> EFI specification since version 2.10, as a better alternative to
> using DXE services for memory protection attributes manipulation,
> since it is defined by the UEFI specification itself and not UEFI PI
> specification. This protocol is not widely available so the code
> using DXE services is kept in place as a fallback in case specific
> implementation does not support the new protocol.
> One of EFI implementations that already support
> EFI_MEMORY_ATTTRIBUTE_PROTOCOL is Microsoft Project Mu [5].

Maybe make this a separate series?

>
> Kernel image generation tool (tools/build.c) is refactored as a part
> of changes that makes PE image more compatible.
>
> The patchset implements memory protection for compressed kernel
> code while executing both inside EFI boot services and outside of
> them. For EFISTUB code path W^X protection policy is maintained
> throughout the whole execution of compressed kernel. The latter
> is achieved by extracting the kernel directly from EFI environment
> and jumping to it's head immediately after exiting EFI boot services.
> As a side effect of this change one page table rebuild and a copy of
> the kernel image is removed.

I have no problem with this, but what's it needed for?

>
> Memory protection inside EFI environment is controlled by the
> CONFIG_DXE_MEM_ATTRIBUTES option, although with these patches this
> option also control the use EFI_MEMORY_ATTTRIBUTE_PROTOCOL and memory
> protection attributes of PE sections and not only DXE services as the
> name might suggest.
>

> [1]
> https://lore.kernel.org/lkml/[email protected]/
> [2] https://github.com/acidanthera/audk/tree/secure_pe

Link is broken

> [3]
> https://download.microsoft.com/download/9/c/5/9c5b2167-8017-4bae-9fde-d599bac8184a/pecoff_v83.docx

I skimmed this very briefly, and I have no idea what I'm supposed to look at. This is the entire PE spec!

2023-03-14 23:21:31

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v5 00/27] x86_64: Improvements at compressed kernel stage



On Tue, Mar 14, 2023, at 2:23 PM, Andy Lutomirski wrote:
> On Tue, Mar 14, 2023, at 3:13 AM, Evgeniy Baskov wrote:
>>
>> Kernel is made to be more compatible with PE image specification [3],
>> allowing it to be successfully loaded by stricter PE loader
>> implementations like the one from [2]. There is at least one
>> known implementation that uses that loader in production [4].
>> There are also ongoing efforts to upstream these changes.
>
> Can you clarify

Sorry, lost part of a sentence. Can you clarify in what respect the loader is stricter?


Anyway, I did some research. I found:

https://github.com/rhboot/shim/pull/459/commits/99a8d19326f69665e0b86bcfa6a59d554f662fba

which gives a somewhat incoherent-sounding description in which setting EFI_IMAGE_DLLCHARACTERISTICS_NX_COMPAT apparently enables allocating memory that isn't RWX. But this seems odd EFI_IMAGE_DLLCHARACTERISTICS_NX_COMPAT is a property of the EFI *program*, not the boot services implementation. And I'd be surprised if a flag on the application changes the behavior of boot services, but, OTOH, this is Microsoft.

And the PE 89 spec does say that EFI_IMAGE_DLLCHARACTERISTICS_NX_COMPAT means "Image is NX compatible" and that is the sole mention of NX in the document.

And *this* seems to be the actual issue:

https://github.com/rhboot/shim/pull/459/commits/825d99361b4aaa16144392dc6cea43e24c8472ae

I assume that MS required this change as a condition for signing, but what do I know? Anyway, the rules appear to be that the PE sections must not be both W and X at the same size. (For those who are familiar with the abomination known as ELF but not with the abomination known as PE, a "section" is a range in the file that gets mapped into memory. Like a PT_LOAD segment in ELF.)

Now I don't know whether anything prevents us from doing something awful like mapping the EFI stuf RX and then immediately *re*mapping everything RWX. (Not that I'm seriously suggesting that.) And it's not immediately clear to me how the rest of this series fits in, what this has to do with the identity map, etc.

Anyway, I think the series needs to document what's going on, in the changelog and relevant comments. And if the demand-population of the identity map is a problem, then there should be a comment like (made up -- don't say this unless it's correct):

A sufficiently paranoid EFI implementation may enforce W^X when mapping memory through the boot services protocols. And creating identity mappings in the page fault handler needs to use the boot services protocols to do so because [fill this in] [or it would be a bit of an abomination to do an end run around them by modifying the page tables ourselves] [or whatever is actually happening]. While we *could* look at the actual fault type and create an R or RW or RX mapping as appropriate, it's better to figure out what needs to be mapped for real and to map it with the correct permissions before faulting.

But I still think we should keep the demand-faulting code as a fallback, even if it's hardcoded as RW, and just log the fault mode and address. We certainly shouldn't be *executing* code that wasn't identity mapped. Unless that code is boot services and we're creating the boot services mappings!

For that matter, how confident are we that there aren't crappy boot services implementations out there that require that we fix up page faults? After all, it's not like EFI implementations, especially early ones, are any good.

--Andy

2023-03-15 09:07:31

by Gerd Hoffmann

[permalink] [raw]
Subject: Re: [PATCH v5 00/27] x86_64: Improvements at compressed kernel stage

Hi,

> And *this* seems to be the actual issue:
>
> https://github.com/rhboot/shim/pull/459/commits/825d99361b4aaa16144392dc6cea43e24c8472ae
>
> I assume that MS required this change as a condition for signing, but what do I know?

Your guess is correct. UEFI world is moving to being stricter, for
example set page permissions according to the allocation type (RW for
data, RX for code).

Microsoft raised the bar for PE binaries when it comes to secure boot
signing as part of that effort. Being a valid PE binary according to
the PE spec is not good enough, some additional constrains like sections
not overlapping and sections with different load flags not sharing pages
(so setting strict page permissions is actually possible) are required
now. Stuff which is standard since years elsewhere.

> Anyway, the rules appear to be that the PE sections must not be both W and X at the same size.

That too.

> But I still think we should keep the demand-faulting code as a
> fallback, even if it's hardcoded as RW, and just log the fault mode
> and address. We certainly shouldn't be *executing* code that wasn't
> identity mapped. Unless that code is boot services and we're creating
> the boot services mappings!

Agree.

> For that matter, how confident are we that there aren't crappy boot
> services implementations out there that require that we fix up page
> faults? After all, it's not like EFI implementations, especially early
> ones, are any good.

I don't expect much problems here. Early EFI implementations don't
bother setting page permissions and just identity-map everything using
rwx huge pages, or run with paging turned off (hello ia32).

But playing safe (and keep demand-faulting just in case) is a good idea
nevertheless.

take care,
Gerd


2023-03-15 13:25:54

by Evgeniy Baskov

[permalink] [raw]
Subject: Re: [PATCH v5 00/27] x86_64: Improvements at compressed kernel stage

On 2023-03-15 00:23, Andy Lutomirski wrote:
> On Tue, Mar 14, 2023, at 3:13 AM, Evgeniy Baskov wrote:
>> This patchset is aimed
>> * to improve UEFI compatibility of compressed kernel code for x86_64
>> * to setup proper memory access attributes for code and rodata
>> sections
>> * to implement W^X protection policy throughout the whole execution
>> of compressed kernel for EFISTUB code path.
>
> The overall code quality seems okay, but I have some questions as to
> what this is for. The early boot environment is not exposed to most
> sorts of attacks -- there's no userspace, there's no network, and
> there is not a whole lot of input that isn't implicitly completely
> trusted.
>
> What parts of this series are actually needed to get these fancy new
> bootloaders to boot Linux? And why?

Well, most of the series is needed, except for may be adding W^X
for the non-UEFI boot path (patches 3-9), but those add changes,
required for booting via UEFI, like memory protection call-backs.
And since the important callbacks are already in-place W^X for
non-UEFI won't be too undesired property.

The most part of this series (3-16,26,27) implements W^X, and
the remaining patches improves the compatibility of PE, which
includes:

* Removing W+X sections (which is now required as Gerd have already
mentioned or at least very desired)
* Aligning sections to the page size in memory and to minimal
file alignment in file.
* Aligning data structures on their natural alignment
(e.g. [2] requires it)
* Filling more PE header fields to their actual values.
* Removing alignment flags on sections, which according to
the spec, is only for object files.
* Filling in relocation data directory and its rounding the size
to 4 bytes.

Most of this work is done in the patch 24 "x86/build: Make generated
PE more spec compliant", but it also requires working W^X due to
the removal of W+X sections and some clean-up work from patches
17-23.

>
>>
>> Kernel is made to be more compatible with PE image specification [3],
>> allowing it to be successfully loaded by stricter PE loader
>> implementations like the one from [2]. There is at least one
>> known implementation that uses that loader in production [4].
>> There are also ongoing efforts to upstream these changes.
>
> Can you clarify
>
>>
>> Also the patchset adds EFI_MEMORY_ATTTRIBUTE_PROTOCOL, included into
>> EFI specification since version 2.10, as a better alternative to
>> using DXE services for memory protection attributes manipulation,
>> since it is defined by the UEFI specification itself and not UEFI PI
>> specification. This protocol is not widely available so the code
>> using DXE services is kept in place as a fallback in case specific
>> implementation does not support the new protocol.
>> One of EFI implementations that already support
>> EFI_MEMORY_ATTTRIBUTE_PROTOCOL is Microsoft Project Mu [5].
>
> Maybe make this a separate series?

This now is just one fairly straight forward patch, since the protocol
definitions are already got accepted and the protocol is used elsewhere
in the EFISTUB. This patch would also have to be replaced, rather than
removed if it's made a separate series, since it adds a warning about
W+X mappings.

>
>>
>> Kernel image generation tool (tools/build.c) is refactored as a part
>> of changes that makes PE image more compatible.
>>
>> The patchset implements memory protection for compressed kernel
>> code while executing both inside EFI boot services and outside of
>> them. For EFISTUB code path W^X protection policy is maintained
>> throughout the whole execution of compressed kernel. The latter
>> is achieved by extracting the kernel directly from EFI environment
>> and jumping to it's head immediately after exiting EFI boot services.
>> As a side effect of this change one page table rebuild and a copy of
>> the kernel image is removed.
>
> I have no problem with this, but what's it needed for?

The one hard part that made the series more complicated is that
non-UEFI (or rather the only) boot path relocates the kernel, which
messes up the memory protection for sections set by the UEFI. I did not
want to remove the support of in-place extraction and relocation, when
loaded in inappropriate place, for the non-UEFI boot path, which is why
extraction from boot services was implemented. A proper W^X in EFISTUB
is a side effect, but the desired one.

The alternative would be to make the whole image RWX after the EFISTUB
execution. But the current approach is a lot nicer solution.

>
>>
>> Memory protection inside EFI environment is controlled by the
>> CONFIG_DXE_MEM_ATTRIBUTES option, although with these patches this
>> option also control the use EFI_MEMORY_ATTTRIBUTE_PROTOCOL and memory
>> protection attributes of PE sections and not only DXE services as the
>> name might suggest.
>>
>
>> [1]
>> https://lore.kernel.org/lkml/[email protected]/
>> [2] https://github.com/acidanthera/audk/tree/secure_pe
>
> Link is broken

Ah, sorry, the branch was merged into master since I've first posted
the series, so the working link is:

https://github.com/acidanthera/audk

The loader itself is here:

https://github.com/acidanthera/audk/tree/master/MdePkg/Library/BasePeCoffLib2

>
>> [3]
>> https://download.microsoft.com/download/9/c/5/9c5b2167-8017-4bae-9fde-d599bac8184a/pecoff_v83.docx
>
> I skimmed this very briefly, and I have no idea what I'm supposed to
> look at. This is the entire PE spec!

I gave some explanations above, which are mostly the duplicates of
the patch 24 "x86/build: Make generated PE more spec compliant"
commit message.

Thanks,
Evgeniy Baskov

2023-03-15 13:25:57

by Evgeniy Baskov

[permalink] [raw]
Subject: Re: [PATCH v5 09/27] x86/boot: Remove mapping from page fault handler

On 2023-03-14 23:33, Andy Lutomirski wrote:
> On Tue, Mar 14, 2023, at 3:13 AM, Evgeniy Baskov wrote:
>> After every implicit mapping is removed, this code is no longer
>> needed.
>>
>> Remove memory mapping from page fault handler to ensure that there are
>> no hidden invalid memory accesses.
>
> This patch is *by far* the scariest of the bunch in my boot. And it
> violates a basic principle of kernel development: it's better to run
> in degraded mode than to fail outright unless running in degraded mode
> is dangerous for some reason.
>
> And this boot code is not actually meaningfully exposed to attack.
> Anyone who can get the boot code to consume garbage likely *already*
> controls the system, including anything that we might write to TPM or
> any other verification mechanism.
>
> So I think this should log an error, set a flag to make sure we print
> an even louder error after full boot, but still add the mapping and
> keep trying.

Good point. This patch can be dropped and replaced by the loud warning,
since it is not required for the functioning of the rest of the series
it is here mainly to indicate bugs in the kernel rather than for the
increased protection. But I would not expect anything in the working
systems, I made my best to map all the things explicitly. And since
no code but the extraction code is supposed to be run (interrupts
are disabled and we are not using any UEFI services there), this should
be practically save to remove the implicit mapping.

And at least this allowed me to find out about the insufficient size of
the boot page tables which did not account for the ACPI and UEFI
mappings. (patch 4 "x86/boot: Increase boot page table size")

If this patch is dropped now, I can send the follow up patch later
adding the warning.

Thanks,
Evgeniy Baskov

>
> --Andy
>
>>
>> Tested-by: Mario Limonciello <[email protected]>
>> Signed-off-by: Evgeniy Baskov <[email protected]>
>> ---
>> arch/x86/boot/compressed/ident_map_64.c | 26
>> ++++++++++---------------
>> 1 file changed, 10 insertions(+), 16 deletions(-)
>>
...

2023-03-15 17:58:43

by Peter Jones

[permalink] [raw]
Subject: Re: [PATCH v5 00/27] x86_64: Improvements at compressed kernel stage

On Tue, Mar 14, 2023 at 04:20:43PM -0700, Andy Lutomirski wrote:
>
>
> On Tue, Mar 14, 2023, at 2:23 PM, Andy Lutomirski wrote:
> > On Tue, Mar 14, 2023, at 3:13 AM, Evgeniy Baskov wrote:
> >>
> >> Kernel is made to be more compatible with PE image specification [3],
> >> allowing it to be successfully loaded by stricter PE loader
> >> implementations like the one from [2]. There is at least one
> >> known implementation that uses that loader in production [4].
> >> There are also ongoing efforts to upstream these changes.
> >
> > Can you clarify
>
> Sorry, lost part of a sentence. Can you clarify in what respect the loader is stricter?
>
>
> Anyway, I did some research. I found:
>
> https://github.com/rhboot/shim/pull/459/commits/99a8d19326f69665e0b86bcfa6a59d554f662fba
>
> which gives a somewhat incoherent-sounding description in which
> setting EFI_IMAGE_DLLCHARACTERISTICS_NX_COMPAT apparently enables
> allocating memory that isn't RWX. But this seems odd
> EFI_IMAGE_DLLCHARACTERISTICS_NX_COMPAT is a property of the EFI
> *program*, not the boot services implementation.

Well, "is this binary compatible" is a property of the program, yes.
It's up to the loader to decide if it *cares*, and the compatibility
flag allows it to do that.

> And I'd be surprised if a flag on the application changes the behavior
> of boot services, but, OTOH, this is Microsoft.

There has been discussion of implementing a compatibility mode that
allows you to enable NX support by default, but only breaks the old
assumptions that the stack and memory allocations will be executable if
the flag is set, so that newer OSes get the mitigations we need, but
older OSes still work. I don't think anyone has actually implemented
this *yet*, but some hardware vendors have made noises that sound like
they may intend to. (I realize that sounds cagey as hell. Sorry.)

Currently I think the only shipping systems that implement
NX-requirements are from Microsoft - the Surface product line and
Windows Dev Kit - and they don't allow you to disable it at all. Other
vendors have produced firmware that isn't shipping yet (I *think*) that
has it as a setting in the firmware menu, and they're looking to move to
enabling it by default on some product lines. We'd like to not be left
behind.

> And the PE 89 spec does say that
> EFI_IMAGE_DLLCHARACTERISTICS_NX_COMPAT means "Image is NX compatible"
> and that is the sole mention of NX in the document.

Yeah, the PE spec is not very good in a lot of ways unrelated to how
abominable the thing it's describing is.

> And *this* seems to be the actual issue:
>
> https://github.com/rhboot/shim/pull/459/commits/825d99361b4aaa16144392dc6cea43e24c8472ae
>
> I assume that MS required this change as a condition for signing, but
> what do I know?

Yes, they have, but it's not as if they did it in a vacuum. I think the
idea was originally Kees Cook's actually, and there's been a significant
effort on the firmware and bootloader side to enable it. And there's
good reason to do this, too - more and more of this surface is being
attacked, and recently we've seen the first "bootkit" that actually
includes a Secure Boot breakout in the wild: https://www.welivesecurity.com/2023/03/01/blacklotus-uefi-bootkit-myth-confirmed/

While that particular malware (somewhat ironically) only uses code
developed for linux systems *after* the exploit, it could easily have
gone the other way, and we're definitely a target here. We need NX in
our boot path, and soon.

> Anyway, the rules appear to be that the PE sections
> must not be both W and X at the same size. (For those who are
> familiar with the abomination known as ELF but not with the
> abomination known as PE, a "section" is a range in the file that gets
> mapped into memory. Like a PT_LOAD segment in ELF.)
>
> Now I don't know whether anything prevents us from doing something
> awful like mapping the EFI stuf RX and then immediately *re*mapping
> everything RWX. (Not that I'm seriously suggesting that.)

Once we've taken over paging, nothing stops us at all. Until then,
SetMemoryAttributes() (which is more or less mprotect()) might prevent
it.

> And it's not immediately clear to me how the rest of this series fits
> in, what this has to do with the identity map, etc.

I'll let Evgeniy address that and the rest of this.

--
Peter


2023-04-05 16:20:02

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v5 00/27] x86_64: Improvements at compressed kernel stage

On Wed, Mar 15, 2023 at 01:57:33PM -0400, Peter Jones wrote:
> Currently I think the only shipping systems that implement
> NX-requirements are from Microsoft - the Surface product line and
> Windows Dev Kit - and they don't allow you to disable it at all. Other
> vendors have produced firmware that isn't shipping yet (I *think*) that
> has it as a setting in the firmware menu, and they're looking to move to
> enabling it by default on some product lines.

I hope they realize that they must leave the off switch in the BIOS for
older kernels...

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-04-05 17:42:53

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v5 02/27] x86/build: Remove RWX sections and align on 4KB

On Tue, Mar 14, 2023 at 01:13:29PM +0300, Evgeniy Baskov wrote:
> Avoid creating sections simultaneously writable and readable to prepare
> for W^X implementation for the kernel itself (not the decompressor).
> Align kernel sections on page size (4KB) to allow protecting them in the
> page tables.
>
> Split init code form ".init" segment into separate R_X ".inittext"

s/form/from/

> segment and make ".init" segment non-executable.

"... and make the .init segment RW_."

> Also add these segments to x86_32 architecture for consistency.

Same comment as before: please refrain from talking about the *what* in
a commit message but about the *why*.

And considering the matter, you have a *lot* of *why* to talk about. :-)

Pls check your whole set.

> Currently paging is disabled in x86_32 in compressed kernel, so
> protection is not applied anyways, but .init code was incorrectly
> placed in non-executable ".data" segment. This should not change
> anything meaningful in memory layout now, but might be required in case
> memory protection will also be implemented in compressed kernel for
> x86_32.

I highly doubt that - no one cares about 32-bit x86 anymore.

> @@ -226,9 +225,10 @@ SECTIONS
> #endif
>
> INIT_TEXT_SECTION(PAGE_SIZE)
> -#ifdef CONFIG_X86_64
> - :init
> -#endif
> + :inittext
> +
> + . = ALIGN(PAGE_SIZE);
> +
>
> /*
> * Section for code used exclusively before alternatives are run. All
> @@ -240,6 +240,7 @@ SECTIONS
> .altinstr_aux : AT(ADDR(.altinstr_aux) - LOAD_OFFSET) {
> *(.altinstr_aux)
> }
> + :init

Why isn't this placed after inittext but here?

I'm thinking you wanna have:

:inittext
. = ALIGN..
:init
<rest>

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-04-05 17:56:13

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v5 03/27] x86/boot: Set cr0 to known state in trampoline

On Tue, Mar 14, 2023 at 01:13:30PM +0300, Evgeniy Baskov wrote:
> Ensure WP bit to be set to prevent boot code from writing to
> non-writable memory pages.
>
> Tested-by: Mario Limonciello <[email protected]>
> Signed-off-by: Evgeniy Baskov <[email protected]>
> ---
> arch/x86/boot/compressed/head_64.S | 5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
> index 03c4328a88cb..01fa42d31648 100644
> --- a/arch/x86/boot/compressed/head_64.S
> +++ b/arch/x86/boot/compressed/head_64.S
> @@ -660,9 +660,8 @@ SYM_CODE_START(trampoline_32bit_src)
> pushl $__KERNEL_CS
> pushl %eax
>
> - /* Enable paging again. */
> - movl %cr0, %eax
> - btsl $X86_CR0_PG_BIT, %eax
> + /* Enable paging and set CR0 to known state (this also sets WP flag) */
> + movl $CR0_STATE, %eax

This sets a lot more than WP. Why?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-04-06 12:03:17

by Gerd Hoffmann

[permalink] [raw]
Subject: Re: [PATCH v5 02/27] x86/build: Remove RWX sections and align on 4KB

Hi,

> > Currently paging is disabled in x86_32 in compressed kernel, so
> > protection is not applied anyways, but .init code was incorrectly
> > placed in non-executable ".data" segment. This should not change
> > anything meaningful in memory layout now, but might be required in case
> > memory protection will also be implemented in compressed kernel for
> > x86_32.
>
> I highly doubt that - no one cares about 32-bit x86 anymore.

Indeed. ia32 edk2 runs without paging even in latest tianocore/edk2,
and I don't expect that to change until ia32 support gets removed.

take care,
Gerd

2023-04-08 15:10:15

by Evgeniy Baskov

[permalink] [raw]
Subject: Re: [PATCH v5 02/27] x86/build: Remove RWX sections and align on 4KB

On 2023-04-05 20:40, Borislav Petkov wrote:
> On Tue, Mar 14, 2023 at 01:13:29PM +0300, Evgeniy Baskov wrote:
>> Avoid creating sections simultaneously writable and readable to
>> prepare
>> for W^X implementation for the kernel itself (not the decompressor).
>> Align kernel sections on page size (4KB) to allow protecting them in
>> the
>> page tables.
>>
>> Split init code form ".init" segment into separate R_X ".inittext"
>
> s/form/from/

Thanks!

>
>> segment and make ".init" segment non-executable.
>
> "... and make the .init segment RW_."

Will fix.

>
>> Also add these segments to x86_32 architecture for consistency.
>
> Same comment as before: please refrain from talking about the *what* in
> a commit message but about the *why*.
>
> And considering the matter, you have a *lot* of *why* to talk about.
> :-)
>
> Pls check your whole set.

I'll try do make descriptions of patches more elaborate and to better
reflect the reasoning behind the changes before resubmitting, thanks.

>
>> Currently paging is disabled in x86_32 in compressed kernel, so
>> protection is not applied anyways, but .init code was incorrectly
>> placed in non-executable ".data" segment. This should not change
>> anything meaningful in memory layout now, but might be required in
>> case
>> memory protection will also be implemented in compressed kernel for
>> x86_32.
>
> I highly doubt that - no one cares about 32-bit x86 anymore.
>

True, but in theory it's still possible and also the change
makes things more correct.

>> @@ -226,9 +225,10 @@ SECTIONS
>> #endif
>>
>> INIT_TEXT_SECTION(PAGE_SIZE)
>> -#ifdef CONFIG_X86_64
>> - :init
>> -#endif
>> + :inittext
>> +
>> + . = ALIGN(PAGE_SIZE);
>> +
>>
>> /*
>> * Section for code used exclusively before alternatives are run.
>> All
>> @@ -240,6 +240,7 @@ SECTIONS
>> .altinstr_aux : AT(ADDR(.altinstr_aux) - LOAD_OFFSET) {
>> *(.altinstr_aux)
>> }
>> + :init
>
> Why isn't this placed after inittext but here?

Because, AFAIK, :init is a part of a section syntax so it must
come after the brace, at least according to the documentation:

https://sourceware.org/binutils/docs/ld/PHDRS.html

>
> I'm thinking you wanna have:
>
> :inittext
> . = ALIGN..
> :init
> <rest>
>
> Thx.

2023-04-08 15:10:44

by Evgeniy Baskov

[permalink] [raw]
Subject: Re: [PATCH v5 03/27] x86/boot: Set cr0 to known state in trampoline

On 2023-04-05 20:54, Borislav Petkov wrote:
> On Tue, Mar 14, 2023 at 01:13:30PM +0300, Evgeniy Baskov wrote:
>> Ensure WP bit to be set to prevent boot code from writing to
>> non-writable memory pages.
>>
>> Tested-by: Mario Limonciello <[email protected]>
>> Signed-off-by: Evgeniy Baskov <[email protected]>
>> ---
>> arch/x86/boot/compressed/head_64.S | 5 ++---
>> 1 file changed, 2 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/boot/compressed/head_64.S
>> b/arch/x86/boot/compressed/head_64.S
>> index 03c4328a88cb..01fa42d31648 100644
>> --- a/arch/x86/boot/compressed/head_64.S
>> +++ b/arch/x86/boot/compressed/head_64.S
>> @@ -660,9 +660,8 @@ SYM_CODE_START(trampoline_32bit_src)
>> pushl $__KERNEL_CS
>> pushl %eax
>>
>> - /* Enable paging again. */
>> - movl %cr0, %eax
>> - btsl $X86_CR0_PG_BIT, %eax
>> + /* Enable paging and set CR0 to known state (this also sets WP flag)
>> */
>> + movl $CR0_STATE, %eax
>
> This sets a lot more than WP. Why?

Because there are code paths where cr0 state is not initialized
(e.g. the EFISTUB code path) and it's better to know it exactly.
Although we don't actually care about MP, ET, NE and AM flags, but they
should be all supported, so the choice was arbitrary. Also they are
already
initialized to this value on one code path -- when the kernel started
its
execution via startup_32.

Thanks.