by Ashish Kalra

[permalink] [raw]

Subject: Re: [PATCH v2 1/3] efi/x86: skip efi_arch_mem_reserve() in case of kexec.

Hello,

On 3/18/2024 11:00 PM, Dave Young wrote:
> Hi,
>
> Added Ard in cc.
>
> On 03/18/24 at 07:02am, Ashish Kalra wrote:
>> From: Ashish Kalra <[email protected]>
>>
>> For kexec use case, need to use and stick to the EFI memmap passed
>> from the first kernel via boot-params/setup data, hence,
>> skip efi_arch_mem_reserve() during kexec.
>>
>> Additionally during SNP guest kexec testing discovered that EFI memmap
>> is corrupted during chained kexec. kexec_enter_virtual_mode() during
>> late init will remap the efi_memmap physical pages allocated in
>> efi_arch_mem_reserve() via memboot & then subsequently cause random
>> EFI memmap corruption once memblock is freed/teared-down.
>>
>> Signed-off-by: Ashish Kalra <[email protected]>
>> ---
>> arch/x86/platform/efi/quirks.c | 10 ++++++++++
>> 1 file changed, 10 insertions(+)
>>
>> diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
>> index f0cc00032751..d4562d074371 100644
>> --- a/arch/x86/platform/efi/quirks.c
>> +++ b/arch/x86/platform/efi/quirks.c
>> @@ -258,6 +258,16 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
>> int num_entries;
>> void *new;
>>
>> + /*
>> + * For kexec use case, we need to use the EFI memmap passed from the first
>> + * kernel via setup data, so we need to skip this.
>> + * Additionally kexec_enter_virtual_mode() during late init will remap
>> + * the efi_memmap physical pages allocated here via memboot & then
>> + * subsequently cause random EFI memmap corruption once memblock is freed.
> Can you elaborate a bit about the corruption, is it reproducible without
> SNP?

This is only reproducible on SNP.

This is the call-stack for the above function:

[    0.313377] efi_arch_mem_reserve+0x64/0x220^M
[    0.314060] ? memblock_add_range+0x2a0/0x2e0^M
[    0.314763] efi_mem_reserve+0x36/0x60^M
[    0.315360] efi_bgrt_init+0x17d/0x1a0^M
[    0.315959] ? __pfx_acpi_parse_bgrt+0x10/0x10^M
[    0.316711] acpi_parse_bgrt+0x12/0x20^M
[    0.317310] acpi_table_parse+0x77/0xd0^M
[    0.317922] acpi_boot_init+0x362/0x630^M
[    0.318535] setup_arch+0xa4e/0xf90^M
[    0.319091] start_kernel+0x68/0xa70^M
[    0.319664] x86_64_start_reservations+0x1c/0x30^M
[    0.320431] x86_64_start_kernel+0xbf/0x110^M
[    0.321099] secondary_startup_64_no_verify+0x179/0x17b^M

This function efi_arch_mem_reserve() calls efi_memmap_alloc() which in
turn calls __efi_memmap_alloc_early() which does memblock_phys_alloc(),
and later does efi_memmap_install() which does early_memremap() of the
EFI memmap into this memblock allocated physical memory. So now EFI
memmap gets re-mapped into the memblock allocated memory.

Later kexec_enter_virtual_mode() calls efi_memmap_init_late() which
memremap()'s the EFI memmap into the above memblock allocated physical
range.

Obviously, when memblocks are later freed during late init, this
memblock allocated physical range will get freed and re-allocated which
will eventually overwrite and corrupt the EFI memmap leading to
subsequent kexec boot crash.

>> + */
>> + if (efi_setup)
>> + return;
>> +
> How about checking the md attribute instead of checking the efi_setup,
> personally I feel it a bit better, something like below:

I based the above on the following code checking for kexec boot:

void __init efi_enter_virtual_mode(void)
{
       ...

        if (efi_setup)
                kexec_enter_virtual_mode();
        else
                __efi_enter_virtual_mode();

But, i have tested with the code (you shared below) about checking the
md attribute and it works, so i can resend my v2 patch based on this.

Thanks, Ashish

>
> diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> index f0cc00032751..699332b075bb 100644
> --- a/arch/x86/platform/efi/quirks.c
> +++ b/arch/x86/platform/efi/quirks.c
> @@ -255,15 +255,24 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
> struct efi_memory_map_data data = { 0 };
> struct efi_mem_range mr;
> efi_memory_desc_t md;
> - int num_entries;
> + int num_entries, ret;
> void *new;
>
> - if (efi_mem_desc_lookup(addr, &md) ||
> - md.type != EFI_BOOT_SERVICES_DATA) {
> + ret = efi_mem_desc_lookup(addr, &md);
> + if (ret) {
> pr_err("Failed to lookup EFI memory descriptor for %pa\n", &addr);
> return;
> }
>
> + if (md.type != EFI_BOOT_SERVICES_DATA) {
> + pr_err("Skil reserving non EFI Boot Service Data memory for %pa\n", &addr);
> + return;
> + }
> +
> + /* Kexec copied the efi memmap from the 1st kernel, thus skip the case. */
> + if (md.attribute & EFI_MEMORY_RUNTIME)
> + return;
> +
> if (addr + size > md.phys_addr + (md.num_pages << EFI_PAGE_SHIFT)) {
> pr_err("Region spans EFI memory descriptors, %pa\n", &addr);
> return;
>
>