2017-03-08 20:46:20

by Omar Sandoval

[permalink] [raw]
Subject: kexec regression since 4.9 caused by efi

Hi, everyone,

Since 4.9, kexec results in the following panic on some of our servers:

[ 0.001000] general protection fault: 0000 [#1] SMP
[ 0.001000] Modules linked in:
[ 0.001000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.11.0-rc1 #53
[ 0.001000] Hardware name: Wiwynn Leopard-Orv2/Leopard-DDR BW, BIOS LBM05 09/30/2016
[ 0.001000] task: ffffffff81e0e4c0 task.stack: ffffffff81e00000
[ 0.001000] RIP: 0010:virt_efi_set_variable+0x85/0x1a0
[ 0.001000] RSP: 0000:ffffffff81e03e18 EFLAGS: 00010202
[ 0.001000] RAX: afafafafafafafaf RBX: ffffffff81e3a4e0 RCX: 0000000000000007
[ 0.001000] RDX: ffffffff81e03e70 RSI: ffffffff81e3a4e0 RDI: ffff88407f8c2de0
[ 0.001000] RBP: ffffffff81e03e60 R08: 0000000000000000 R09: 0000000000000000
[ 0.001000] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff81e03e70
[ 0.001000] R13: 0000000000000007 R14: 0000000000000000 R15: 0000000000000000
[ 0.001000] FS: 0000000000000000(0000) GS:ffff881fff600000(0000) knlGS:0000000000000000
[ 0.001000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.001000] CR2: ffff88407f30f000 CR3: 0000001fff102000 CR4: 00000000000406b0
[ 0.001000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 0.001000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 0.001000] Call Trace:
[ 0.001000] efi_delete_dummy_variable+0x7a/0x80
[ 0.001000] efi_enter_virtual_mode+0x3e2/0x494
[ 0.001000] start_kernel+0x392/0x418
[ 0.001000] ? set_init_arg+0x55/0x55
[ 0.001000] x86_64_start_reservations+0x2a/0x2c
[ 0.001000] x86_64_start_kernel+0xea/0xed
[ 0.001000] start_cpu+0x14/0x14
[ 0.001000] Code: 42 25 8d ff 80 3d 43 77 95 00 00 75 68 9c 8f 04 24 48 8b 05 3e 7d 7e 00 48 89 de 4d 89 f9 4d 89 f0 44 89 e9 4c 89 e2 48 8b 40 58 <48> 8b 78 58 31 c0 e8 90 e4 92 ff 48 8b 3c 24 48 c7 c6 2b 0a ca
[ 0.001000] RIP: virt_efi_set_variable+0x85/0x1a0 RSP: ffffffff81e03e18
[ 0.001000] ---[ end trace 0bd213e540e9b19f ]---
[ 0.001000] Kernel panic - not syncing: Fatal exception
[ 0.001000] ---[ end Kernel panic - not syncing: Fatal exception

Booting normally (i.e., not kexec) still works.

The decoded code is:


0: 42 25 8d ff 80 3d rex.X and $0x3d80ff8d,%eax
6: 43 77 95 rex.XB ja 0xffffffffffffff9e
9: 00 00 add %al,(%rax)
b: 75 68 jne 0x75
d: 9c pushfq
e: 8f 04 24 popq (%rsp)
11: 48 8b 05 3e 7d 7e 00 mov 0x7e7d3e(%rip),%rax # 0x7e7d56
18: 48 89 de mov %rbx,%rsi
1b: 4d 89 f9 mov %r15,%r9
1e: 4d 89 f0 mov %r14,%r8
21: 44 89 e9 mov %r13d,%ecx
24: 4c 89 e2 mov %r12,%rdx
27: 48 8b 40 58 mov 0x58(%rax),%rax
2b:* 48 8b 78 58 mov 0x58(%rax),%rdi <-- trapping instruction
2f: 31 c0 xor %eax,%eax
31: e8 90 e4 92 ff callq 0xffffffffff92e4c6
36: 48 8b 3c 24 mov (%rsp),%rdi
3a: 48 rex.W
3b: c7 .byte 0xc7
3c: c6 (bad)
3d: 2b 0a sub (%rdx),%ecx
3f: ca .byte 0xca

If I'm reading this correctly, efi.systab->runtime == 0xafafafafafafafaf,
and we're crashing when we try to dereference that.

Here is the output of efi=debug from before the crash:

[ 0.000000] Linux version 4.11.0-rc1 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #53 SMP Wed Mar 8 12:07:16 PST 2017
[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.6.7-34_fbk7_2504_g8275185 ro root=LABEL=/ ipv6.autoconf=0 erst_disable biosdevname=0 net.ifnames=0 fsck.repair=yes pcie_pme=nomsi netconsole=+6666@2401:db00:0011:b03e:face:0000:0009:0000/eth0,1514@2401:db00:eef0:a59::/02:90:fb:5b:b7:1e crashkernel=128M console=tty0 co
nsole=ttyS1,57600 efi=debug
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
[ 0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
[ 0.000000] e820: BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000100-0x000000000009ffff] usable
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000750bdfff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000750be000-0x0000000075ddbfff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000075ddc000-0x0000000075e32fff] ACPI data
[ 0.000000] BIOS-e820: [mem 0x0000000075e33000-0x000000007642dfff] ACPI NVS
[ 0.000000] BIOS-e820: [mem 0x000000007642e000-0x000000007bcd9fff] reserved
[ 0.000000] BIOS-e820: [mem 0x000000007bcda000-0x000000007bcdafff] usable
[ 0.000000] BIOS-e820: [mem 0x000000007bcdb000-0x000000007bd60fff] reserved
[ 0.000000] BIOS-e820: [mem 0x000000007bd61000-0x000000007bffffff] usable
[ 0.000000] BIOS-e820: [mem 0x000000007c000000-0x000000008fffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fed1c000-0x00000000fed44fff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000407fffffff] usable
[ 0.000000] NX (Execute Disable) protection: active
[ 0.000000] extended physical RAM map:
[ 0.000000] reserve setup_data: [mem 0x0000000000000100-0x000000000009ffff] usable
[ 0.000000] reserve setup_data: [mem 0x0000000000100000-0x000000000010006f] usable
[ 0.000000] reserve setup_data: [mem 0x0000000000100070-0x00000000750bdfff] usable
[ 0.000000] reserve setup_data: [mem 0x00000000750be000-0x0000000075ddbfff] reserved
[ 0.000000] reserve setup_data: [mem 0x0000000075ddc000-0x0000000075e32fff] ACPI data
[ 0.000000] reserve setup_data: [mem 0x0000000075e33000-0x000000007642dfff] ACPI NVS
[ 0.000000] reserve setup_data: [mem 0x000000007642e000-0x000000007bcd9fff] reserved
[ 0.000000] reserve setup_data: [mem 0x000000007bcda000-0x000000007bcdafff] usable
[ 0.000000] reserve setup_data: [mem 0x000000007bcdb000-0x000000007bd60fff] reserved
[ 0.000000] reserve setup_data: [mem 0x000000007bd61000-0x000000007bffffff] usable
[ 0.000000] reserve setup_data: [mem 0x000000007c000000-0x000000008fffffff] reserved
[ 0.000000] reserve setup_data: [mem 0x00000000fed1c000-0x00000000fed44fff] reserved
[ 0.000000] reserve setup_data: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
[ 0.000000] reserve setup_data: [mem 0x0000000100000000-0x000000407fffffff] usable
[ 0.000000] efi: EFI v2.40 by American Megatrends
[ 0.000000] efi: ACPI=0x75f5c000 ACPI 2.0=0x75f5c000 ESRT=0x7bc4d018 SMBIOS=0xf05e0 SMBIOS 3.0=0x7bb2f000 MPS=0xfc9e0
[ 0.000000] efi: mem00: [Runtime Data |RUN| | | | | | | |WB|WT|WC|UC] range=[0x000000007642e000-0x000000007bc50fff] (88MB)
[ 0.000000] efi: mem01: [Runtime Code |RUN| | | | | | | |WB|WT|WC|UC] range=[0x000000007bc51000-0x000000007bcd9fff] (0MB)
[ 0.000000] efi: mem02: [Runtime Data |RUN| | | | | | | |WB|WT|WC|UC] range=[0x000000007bcdb000-0x000000007bd60fff] (0MB)
[ 0.000000] efi: mem03: [Memory Mapped I/O |RUN| | | | | | | | | | |UC] range=[0x0000000080000000-0x000000008fffffff] (256MB)
[ 0.000000] efi: mem04: [Memory Mapped I/O |RUN| | | | | | | | | | |UC] range=[0x00000000fed1c000-0x00000000fed44fff] (0MB)
[ 0.000000] efi: mem05: [Memory Mapped I/O |RUN| | | | | | | | | | |UC] range=[0x00000000ff000000-0x00000000ffffffff] (16MB)
[ 0.000000] SMBIOS 3.0.0 present.
[ 0.000000] DMI: Wiwynn Leopard-Orv2/Leopard-DDR BW, BIOS LBM05 09/30/2016
[ 0.000000] e820: last_pfn = 0x4080000 max_arch_pfn = 0x400000000
[ 0.000000] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WC UC- WT
[ 0.000000] x2apic: enabled by BIOS, switching to x2apic ops
[ 0.000000] e820: last_pfn = 0x7c000 max_arch_pfn = 0x400000000
[ 0.000000] esrt: Reserving ESRT space from 0x000000007bc4d018 to 0x000000007bc4d050.

Reverting commit 8e80632fb23f ("efi/esrt: Use efi_mem_reserve() and
avoid a kmalloc()") on top of v4.11-rc1 fixes the problem. Bisecting
this was interesting:

- Up until 8e80632fb23f ("efi/esrt: Use efi_mem_reserve() and avoid a
kmalloc()"), kexec worked.

- From 8e80632fb23f to 9d80448ac92b ("efi/arm64: Add debugfs node to
dump UEFI runtime page tables"), kexec just hung after the
"kexec_core: Starting new kernel" message.

- From 3dad6f7f6975 ("x86/efi: Defer efi_esrt_init until after
memblock_x86_fill") to 0a637ee61247 ("x86/efi: Allow invocation of
arbitrary boot services"), kexec hit the BUG_ON(!efi.systab) in
kexec_enter_virtual_mode().

- Finally, after 92dc33501bfb ("x86/efi: Round EFI memmap reservations
to EFI_PAGE_SIZE"), I get the panic described above.

Let me know if there is any more information I can provide.

Thanks!


2017-03-09 02:59:26

by Dave Young

[permalink] [raw]
Subject: Re: kexec regression since 4.9 caused by efi

On 03/08/17 at 12:16pm, Omar Sandoval wrote:
> Hi, everyone,
>
> Since 4.9, kexec results in the following panic on some of our servers:
>
> [ 0.001000] general protection fault: 0000 [#1] SMP
> [ 0.001000] Modules linked in:
> [ 0.001000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.11.0-rc1 #53
> [ 0.001000] Hardware name: Wiwynn Leopard-Orv2/Leopard-DDR BW, BIOS LBM05 09/30/2016
> [ 0.001000] task: ffffffff81e0e4c0 task.stack: ffffffff81e00000
> [ 0.001000] RIP: 0010:virt_efi_set_variable+0x85/0x1a0
> [ 0.001000] RSP: 0000:ffffffff81e03e18 EFLAGS: 00010202
> [ 0.001000] RAX: afafafafafafafaf RBX: ffffffff81e3a4e0 RCX: 0000000000000007
> [ 0.001000] RDX: ffffffff81e03e70 RSI: ffffffff81e3a4e0 RDI: ffff88407f8c2de0
> [ 0.001000] RBP: ffffffff81e03e60 R08: 0000000000000000 R09: 0000000000000000
> [ 0.001000] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff81e03e70
> [ 0.001000] R13: 0000000000000007 R14: 0000000000000000 R15: 0000000000000000
> [ 0.001000] FS: 0000000000000000(0000) GS:ffff881fff600000(0000) knlGS:0000000000000000
> [ 0.001000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 0.001000] CR2: ffff88407f30f000 CR3: 0000001fff102000 CR4: 00000000000406b0
> [ 0.001000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 0.001000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 0.001000] Call Trace:
> [ 0.001000] efi_delete_dummy_variable+0x7a/0x80
> [ 0.001000] efi_enter_virtual_mode+0x3e2/0x494
> [ 0.001000] start_kernel+0x392/0x418
> [ 0.001000] ? set_init_arg+0x55/0x55
> [ 0.001000] x86_64_start_reservations+0x2a/0x2c
> [ 0.001000] x86_64_start_kernel+0xea/0xed
> [ 0.001000] start_cpu+0x14/0x14
> [ 0.001000] Code: 42 25 8d ff 80 3d 43 77 95 00 00 75 68 9c 8f 04 24 48 8b 05 3e 7d 7e 00 48 89 de 4d 89 f9 4d 89 f0 44 89 e9 4c 89 e2 48 8b 40 58 <48> 8b 78 58 31 c0 e8 90 e4 92 ff 48 8b 3c 24 48 c7 c6 2b 0a ca
> [ 0.001000] RIP: virt_efi_set_variable+0x85/0x1a0 RSP: ffffffff81e03e18
> [ 0.001000] ---[ end trace 0bd213e540e9b19f ]---
> [ 0.001000] Kernel panic - not syncing: Fatal exception
> [ 0.001000] ---[ end Kernel panic - not syncing: Fatal exception
>
> Booting normally (i.e., not kexec) still works.
>
> The decoded code is:
>
>
> 0: 42 25 8d ff 80 3d rex.X and $0x3d80ff8d,%eax
> 6: 43 77 95 rex.XB ja 0xffffffffffffff9e
> 9: 00 00 add %al,(%rax)
> b: 75 68 jne 0x75
> d: 9c pushfq
> e: 8f 04 24 popq (%rsp)
> 11: 48 8b 05 3e 7d 7e 00 mov 0x7e7d3e(%rip),%rax # 0x7e7d56
> 18: 48 89 de mov %rbx,%rsi
> 1b: 4d 89 f9 mov %r15,%r9
> 1e: 4d 89 f0 mov %r14,%r8
> 21: 44 89 e9 mov %r13d,%ecx
> 24: 4c 89 e2 mov %r12,%rdx
> 27: 48 8b 40 58 mov 0x58(%rax),%rax
> 2b:* 48 8b 78 58 mov 0x58(%rax),%rdi <-- trapping instruction
> 2f: 31 c0 xor %eax,%eax
> 31: e8 90 e4 92 ff callq 0xffffffffff92e4c6
> 36: 48 8b 3c 24 mov (%rsp),%rdi
> 3a: 48 rex.W
> 3b: c7 .byte 0xc7
> 3c: c6 (bad)
> 3d: 2b 0a sub (%rdx),%ecx
> 3f: ca .byte 0xca
>
> If I'm reading this correctly, efi.systab->runtime == 0xafafafafafafafaf,
> and we're crashing when we try to dereference that.
>
> Here is the output of efi=debug from before the crash:
>
> [ 0.000000] Linux version 4.11.0-rc1 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #53 SMP Wed Mar 8 12:07:16 PST 2017
> [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.6.7-34_fbk7_2504_g8275185 ro root=LABEL=/ ipv6.autoconf=0 erst_disable biosdevname=0 net.ifnames=0 fsck.repair=yes pcie_pme=nomsi netconsole=+6666@2401:db00:0011:b03e:face:0000:0009:0000/eth0,1514@2401:db00:eef0:a59::/02:90:fb:5b:b7:1e crashkernel=128M console=tty0 co
> nsole=ttyS1,57600 efi=debug
> [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
> [ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
> [ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
> [ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
> [ 0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
> [ 0.000000] e820: BIOS-provided physical RAM map:
> [ 0.000000] BIOS-e820: [mem 0x0000000000000100-0x000000000009ffff] usable
> [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000750bdfff] usable
> [ 0.000000] BIOS-e820: [mem 0x00000000750be000-0x0000000075ddbfff] reserved
> [ 0.000000] BIOS-e820: [mem 0x0000000075ddc000-0x0000000075e32fff] ACPI data
> [ 0.000000] BIOS-e820: [mem 0x0000000075e33000-0x000000007642dfff] ACPI NVS
> [ 0.000000] BIOS-e820: [mem 0x000000007642e000-0x000000007bcd9fff] reserved
> [ 0.000000] BIOS-e820: [mem 0x000000007bcda000-0x000000007bcdafff] usable
> [ 0.000000] BIOS-e820: [mem 0x000000007bcdb000-0x000000007bd60fff] reserved
> [ 0.000000] BIOS-e820: [mem 0x000000007bd61000-0x000000007bffffff] usable
> [ 0.000000] BIOS-e820: [mem 0x000000007c000000-0x000000008fffffff] reserved
> [ 0.000000] BIOS-e820: [mem 0x00000000fed1c000-0x00000000fed44fff] reserved
> [ 0.000000] BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
> [ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000407fffffff] usable
> [ 0.000000] NX (Execute Disable) protection: active
> [ 0.000000] extended physical RAM map:
> [ 0.000000] reserve setup_data: [mem 0x0000000000000100-0x000000000009ffff] usable
> [ 0.000000] reserve setup_data: [mem 0x0000000000100000-0x000000000010006f] usable
> [ 0.000000] reserve setup_data: [mem 0x0000000000100070-0x00000000750bdfff] usable
> [ 0.000000] reserve setup_data: [mem 0x00000000750be000-0x0000000075ddbfff] reserved
> [ 0.000000] reserve setup_data: [mem 0x0000000075ddc000-0x0000000075e32fff] ACPI data
> [ 0.000000] reserve setup_data: [mem 0x0000000075e33000-0x000000007642dfff] ACPI NVS
> [ 0.000000] reserve setup_data: [mem 0x000000007642e000-0x000000007bcd9fff] reserved
> [ 0.000000] reserve setup_data: [mem 0x000000007bcda000-0x000000007bcdafff] usable
> [ 0.000000] reserve setup_data: [mem 0x000000007bcdb000-0x000000007bd60fff] reserved
> [ 0.000000] reserve setup_data: [mem 0x000000007bd61000-0x000000007bffffff] usable
> [ 0.000000] reserve setup_data: [mem 0x000000007c000000-0x000000008fffffff] reserved
> [ 0.000000] reserve setup_data: [mem 0x00000000fed1c000-0x00000000fed44fff] reserved
> [ 0.000000] reserve setup_data: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
> [ 0.000000] reserve setup_data: [mem 0x0000000100000000-0x000000407fffffff] usable
> [ 0.000000] efi: EFI v2.40 by American Megatrends
> [ 0.000000] efi: ACPI=0x75f5c000 ACPI 2.0=0x75f5c000 ESRT=0x7bc4d018 SMBIOS=0xf05e0 SMBIOS 3.0=0x7bb2f000 MPS=0xfc9e0
> [ 0.000000] efi: mem00: [Runtime Data |RUN| | | | | | | |WB|WT|WC|UC] range=[0x000000007642e000-0x000000007bc50fff] (88MB)
> [ 0.000000] efi: mem01: [Runtime Code |RUN| | | | | | | |WB|WT|WC|UC] range=[0x000000007bc51000-0x000000007bcd9fff] (0MB)
> [ 0.000000] efi: mem02: [Runtime Data |RUN| | | | | | | |WB|WT|WC|UC] range=[0x000000007bcdb000-0x000000007bd60fff] (0MB)
> [ 0.000000] efi: mem03: [Memory Mapped I/O |RUN| | | | | | | | | | |UC] range=[0x0000000080000000-0x000000008fffffff] (256MB)
> [ 0.000000] efi: mem04: [Memory Mapped I/O |RUN| | | | | | | | | | |UC] range=[0x00000000fed1c000-0x00000000fed44fff] (0MB)
> [ 0.000000] efi: mem05: [Memory Mapped I/O |RUN| | | | | | | | | | |UC] range=[0x00000000ff000000-0x00000000ffffffff] (16MB)
> [ 0.000000] SMBIOS 3.0.0 present.
> [ 0.000000] DMI: Wiwynn Leopard-Orv2/Leopard-DDR BW, BIOS LBM05 09/30/2016
> [ 0.000000] e820: last_pfn = 0x4080000 max_arch_pfn = 0x400000000
> [ 0.000000] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WC UC- WT
> [ 0.000000] x2apic: enabled by BIOS, switching to x2apic ops
> [ 0.000000] e820: last_pfn = 0x7c000 max_arch_pfn = 0x400000000
> [ 0.000000] esrt: Reserving ESRT space from 0x000000007bc4d018 to 0x000000007bc4d050.
>
> Reverting commit 8e80632fb23f ("efi/esrt: Use efi_mem_reserve() and
> avoid a kmalloc()") on top of v4.11-rc1 fixes the problem. Bisecting
> this was interesting:

I have no esrt machine to test, can you share the full kernel log with
efi=debug in kernel cmdline?

*) normal boot kernel log without the reverting
*) kexec boot log with and without the reverting


>
> - Up until 8e80632fb23f ("efi/esrt: Use efi_mem_reserve() and avoid a
> kmalloc()"), kexec worked.
>
> - From 8e80632fb23f to 9d80448ac92b ("efi/arm64: Add debugfs node to
> dump UEFI runtime page tables"), kexec just hung after the
> "kexec_core: Starting new kernel" message.
>
> - From 3dad6f7f6975 ("x86/efi: Defer efi_esrt_init until after
> memblock_x86_fill") to 0a637ee61247 ("x86/efi: Allow invocation of
> arbitrary boot services"), kexec hit the BUG_ON(!efi.systab) in
> kexec_enter_virtual_mode().
>
> - Finally, after 92dc33501bfb ("x86/efi: Round EFI memmap reservations
> to EFI_PAGE_SIZE"), I get the panic described above.
>
> Let me know if there is any more information I can provide.
>

Thanks
Dave

2017-03-09 06:38:26

by Dave Young

[permalink] [raw]
Subject: Re: kexec regression since 4.9 caused by efi

Add efi/kexec list.

On 03/08/17 at 12:16pm, Omar Sandoval wrote:
> Hi, everyone,
>
> Since 4.9, kexec results in the following panic on some of our servers:
>
> [ 0.001000] general protection fault: 0000 [#1] SMP
> [ 0.001000] Modules linked in:
> [ 0.001000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.11.0-rc1 #53
> [ 0.001000] Hardware name: Wiwynn Leopard-Orv2/Leopard-DDR BW, BIOS LBM05 09/30/2016
> [ 0.001000] task: ffffffff81e0e4c0 task.stack: ffffffff81e00000
> [ 0.001000] RIP: 0010:virt_efi_set_variable+0x85/0x1a0
> [ 0.001000] RSP: 0000:ffffffff81e03e18 EFLAGS: 00010202
> [ 0.001000] RAX: afafafafafafafaf RBX: ffffffff81e3a4e0 RCX: 0000000000000007
> [ 0.001000] RDX: ffffffff81e03e70 RSI: ffffffff81e3a4e0 RDI: ffff88407f8c2de0
> [ 0.001000] RBP: ffffffff81e03e60 R08: 0000000000000000 R09: 0000000000000000
> [ 0.001000] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff81e03e70
> [ 0.001000] R13: 0000000000000007 R14: 0000000000000000 R15: 0000000000000000
> [ 0.001000] FS: 0000000000000000(0000) GS:ffff881fff600000(0000) knlGS:0000000000000000
> [ 0.001000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 0.001000] CR2: ffff88407f30f000 CR3: 0000001fff102000 CR4: 00000000000406b0
> [ 0.001000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 0.001000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 0.001000] Call Trace:
> [ 0.001000] efi_delete_dummy_variable+0x7a/0x80
> [ 0.001000] efi_enter_virtual_mode+0x3e2/0x494
> [ 0.001000] start_kernel+0x392/0x418
> [ 0.001000] ? set_init_arg+0x55/0x55
> [ 0.001000] x86_64_start_reservations+0x2a/0x2c
> [ 0.001000] x86_64_start_kernel+0xea/0xed
> [ 0.001000] start_cpu+0x14/0x14
> [ 0.001000] Code: 42 25 8d ff 80 3d 43 77 95 00 00 75 68 9c 8f 04 24 48 8b 05 3e 7d 7e 00 48 89 de 4d 89 f9 4d 89 f0 44 89 e9 4c 89 e2 48 8b 40 58 <48> 8b 78 58 31 c0 e8 90 e4 92 ff 48 8b 3c 24 48 c7 c6 2b 0a ca
> [ 0.001000] RIP: virt_efi_set_variable+0x85/0x1a0 RSP: ffffffff81e03e18
> [ 0.001000] ---[ end trace 0bd213e540e9b19f ]---
> [ 0.001000] Kernel panic - not syncing: Fatal exception
> [ 0.001000] ---[ end Kernel panic - not syncing: Fatal exception
>
> Booting normally (i.e., not kexec) still works.
>
> The decoded code is:
>
>
> 0: 42 25 8d ff 80 3d rex.X and $0x3d80ff8d,%eax
> 6: 43 77 95 rex.XB ja 0xffffffffffffff9e
> 9: 00 00 add %al,(%rax)
> b: 75 68 jne 0x75
> d: 9c pushfq
> e: 8f 04 24 popq (%rsp)
> 11: 48 8b 05 3e 7d 7e 00 mov 0x7e7d3e(%rip),%rax # 0x7e7d56
> 18: 48 89 de mov %rbx,%rsi
> 1b: 4d 89 f9 mov %r15,%r9
> 1e: 4d 89 f0 mov %r14,%r8
> 21: 44 89 e9 mov %r13d,%ecx
> 24: 4c 89 e2 mov %r12,%rdx
> 27: 48 8b 40 58 mov 0x58(%rax),%rax
> 2b:* 48 8b 78 58 mov 0x58(%rax),%rdi <-- trapping instruction
> 2f: 31 c0 xor %eax,%eax
> 31: e8 90 e4 92 ff callq 0xffffffffff92e4c6
> 36: 48 8b 3c 24 mov (%rsp),%rdi
> 3a: 48 rex.W
> 3b: c7 .byte 0xc7
> 3c: c6 (bad)
> 3d: 2b 0a sub (%rdx),%ecx
> 3f: ca .byte 0xca
>
> If I'm reading this correctly, efi.systab->runtime == 0xafafafafafafafaf,

I have no more clue yet from your provided log, but the runtime value is
odd to me. It is set in below code:

arch/x86/platform/efi/efi.c: efi_systab_init()
efi_systab.runtime = data ?
(void *)(unsigned long)data->runtime :
(void *)(unsigne long)systab64->runtime;

Here data is the setup_data passed by kexec-tools from normal kernel to
kexec kernel, efi_setup_data structure is like below:
struct efi_setup_data {
u64 fw_vendor;
u64 runtime;
u64 tables;
u64 smbios;
u64 reserved[8];
};

kexec-tools get the runtime address from /sys/firmware/efi/runtime

So can you do some debuggin on your side, eg. see the sysfs runtime
value is correct or not. And add some printk in efi init path etc.

> and we're crashing when we try to dereference that.
>
> Here is the output of efi=debug from before the crash:
>
> [ 0.000000] Linux version 4.11.0-rc1 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #53 SMP Wed Mar 8 12:07:16 PST 2017
> [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.6.7-34_fbk7_2504_g8275185 ro root=LABEL=/ ipv6.autoconf=0 erst_disable biosdevname=0 net.ifnames=0 fsck.repair=yes pcie_pme=nomsi netconsole=+6666@2401:db00:0011:b03e:face:0000:0009:0000/eth0,1514@2401:db00:eef0:a59::/02:90:fb:5b:b7:1e crashkernel=128M console=tty0 co
> nsole=ttyS1,57600 efi=debug
> [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
> [ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
> [ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
> [ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
> [ 0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
> [ 0.000000] e820: BIOS-provided physical RAM map:
> [ 0.000000] BIOS-e820: [mem 0x0000000000000100-0x000000000009ffff] usable
> [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000750bdfff] usable
> [ 0.000000] BIOS-e820: [mem 0x00000000750be000-0x0000000075ddbfff] reserved
> [ 0.000000] BIOS-e820: [mem 0x0000000075ddc000-0x0000000075e32fff] ACPI data
> [ 0.000000] BIOS-e820: [mem 0x0000000075e33000-0x000000007642dfff] ACPI NVS
> [ 0.000000] BIOS-e820: [mem 0x000000007642e000-0x000000007bcd9fff] reserved
> [ 0.000000] BIOS-e820: [mem 0x000000007bcda000-0x000000007bcdafff] usable
> [ 0.000000] BIOS-e820: [mem 0x000000007bcdb000-0x000000007bd60fff] reserved
> [ 0.000000] BIOS-e820: [mem 0x000000007bd61000-0x000000007bffffff] usable
> [ 0.000000] BIOS-e820: [mem 0x000000007c000000-0x000000008fffffff] reserved
> [ 0.000000] BIOS-e820: [mem 0x00000000fed1c000-0x00000000fed44fff] reserved
> [ 0.000000] BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
> [ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000407fffffff] usable
> [ 0.000000] NX (Execute Disable) protection: active
> [ 0.000000] extended physical RAM map:
> [ 0.000000] reserve setup_data: [mem 0x0000000000000100-0x000000000009ffff] usable
> [ 0.000000] reserve setup_data: [mem 0x0000000000100000-0x000000000010006f] usable
> [ 0.000000] reserve setup_data: [mem 0x0000000000100070-0x00000000750bdfff] usable
> [ 0.000000] reserve setup_data: [mem 0x00000000750be000-0x0000000075ddbfff] reserved
> [ 0.000000] reserve setup_data: [mem 0x0000000075ddc000-0x0000000075e32fff] ACPI data
> [ 0.000000] reserve setup_data: [mem 0x0000000075e33000-0x000000007642dfff] ACPI NVS
> [ 0.000000] reserve setup_data: [mem 0x000000007642e000-0x000000007bcd9fff] reserved
> [ 0.000000] reserve setup_data: [mem 0x000000007bcda000-0x000000007bcdafff] usable
> [ 0.000000] reserve setup_data: [mem 0x000000007bcdb000-0x000000007bd60fff] reserved
> [ 0.000000] reserve setup_data: [mem 0x000000007bd61000-0x000000007bffffff] usable
> [ 0.000000] reserve setup_data: [mem 0x000000007c000000-0x000000008fffffff] reserved
> [ 0.000000] reserve setup_data: [mem 0x00000000fed1c000-0x00000000fed44fff] reserved
> [ 0.000000] reserve setup_data: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
> [ 0.000000] reserve setup_data: [mem 0x0000000100000000-0x000000407fffffff] usable
> [ 0.000000] efi: EFI v2.40 by American Megatrends
> [ 0.000000] efi: ACPI=0x75f5c000 ACPI 2.0=0x75f5c000 ESRT=0x7bc4d018 SMBIOS=0xf05e0 SMBIOS 3.0=0x7bb2f000 MPS=0xfc9e0
> [ 0.000000] efi: mem00: [Runtime Data |RUN| | | | | | | |WB|WT|WC|UC] range=[0x000000007642e000-0x000000007bc50fff] (88MB)
> [ 0.000000] efi: mem01: [Runtime Code |RUN| | | | | | | |WB|WT|WC|UC] range=[0x000000007bc51000-0x000000007bcd9fff] (0MB)
> [ 0.000000] efi: mem02: [Runtime Data |RUN| | | | | | | |WB|WT|WC|UC] range=[0x000000007bcdb000-0x000000007bd60fff] (0MB)
> [ 0.000000] efi: mem03: [Memory Mapped I/O |RUN| | | | | | | | | | |UC] range=[0x0000000080000000-0x000000008fffffff] (256MB)
> [ 0.000000] efi: mem04: [Memory Mapped I/O |RUN| | | | | | | | | | |UC] range=[0x00000000fed1c000-0x00000000fed44fff] (0MB)
> [ 0.000000] efi: mem05: [Memory Mapped I/O |RUN| | | | | | | | | | |UC] range=[0x00000000ff000000-0x00000000ffffffff] (16MB)
> [ 0.000000] SMBIOS 3.0.0 present.
> [ 0.000000] DMI: Wiwynn Leopard-Orv2/Leopard-DDR BW, BIOS LBM05 09/30/2016
> [ 0.000000] e820: last_pfn = 0x4080000 max_arch_pfn = 0x400000000
> [ 0.000000] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WC UC- WT
> [ 0.000000] x2apic: enabled by BIOS, switching to x2apic ops
> [ 0.000000] e820: last_pfn = 0x7c000 max_arch_pfn = 0x400000000
> [ 0.000000] esrt: Reserving ESRT space from 0x000000007bc4d018 to 0x000000007bc4d050.
>
> Reverting commit 8e80632fb23f ("efi/esrt: Use efi_mem_reserve() and
> avoid a kmalloc()") on top of v4.11-rc1 fixes the problem. Bisecting
> this was interesting:
>
> - Up until 8e80632fb23f ("efi/esrt: Use efi_mem_reserve() and avoid a
> kmalloc()"), kexec worked.
>
> - From 8e80632fb23f to 9d80448ac92b ("efi/arm64: Add debugfs node to
> dump UEFI runtime page tables"), kexec just hung after the
> "kexec_core: Starting new kernel" message.
>
> - From 3dad6f7f6975 ("x86/efi: Defer efi_esrt_init until after
> memblock_x86_fill") to 0a637ee61247 ("x86/efi: Allow invocation of
> arbitrary boot services"), kexec hit the BUG_ON(!efi.systab) in
> kexec_enter_virtual_mode().
>
> - Finally, after 92dc33501bfb ("x86/efi: Round EFI memmap reservations
> to EFI_PAGE_SIZE"), I get the panic described above.
>
> Let me know if there is any more information I can provide.
>
> Thanks!

Thanks
Dave

2017-03-09 07:42:21

by Omar Sandoval

[permalink] [raw]
Subject: Re: kexec regression since 4.9 caused by efi

On Thu, Mar 09, 2017 at 10:21:36AM +0800, Dave Young wrote:
> I have no esrt machine to test, can you share the full kernel log with
> efi=debug in kernel cmdline?
>
> *) normal boot kernel log without the reverting
> *) kexec boot log with and without the reverting

Attached.


Attachments:
(No filename) (279.00 B)
normal_boot.txt (43.68 kB)
kexec_no_revert.txt (19.77 kB)
kexec_revert.txt (38.92 kB)
Download all attachments

2017-03-09 09:55:59

by Omar Sandoval

[permalink] [raw]
Subject: Re: kexec regression since 4.9 caused by efi

On Thu, Mar 09, 2017 at 02:38:06PM +0800, Dave Young wrote:
> Add efi/kexec list.
>
> On 03/08/17 at 12:16pm, Omar Sandoval wrote:

[snip]

> I have no more clue yet from your provided log, but the runtime value is
> odd to me. It is set in below code:
>
> arch/x86/platform/efi/efi.c: efi_systab_init()
> efi_systab.runtime = data ?
> (void *)(unsigned long)data->runtime :
> (void *)(unsigne long)systab64->runtime;
>
> Here data is the setup_data passed by kexec-tools from normal kernel to
> kexec kernel, efi_setup_data structure is like below:
> struct efi_setup_data {
> u64 fw_vendor;
> u64 runtime;
> u64 tables;
> u64 smbios;
> u64 reserved[8];
> };
>
> kexec-tools get the runtime address from /sys/firmware/efi/runtime
>
> So can you do some debuggin on your side, eg. see the sysfs runtime
> value is correct or not. And add some printk in efi init path etc.

The attached patch fixes this for me.


Attachments:
(No filename) (974.00 B)
0001-efi-adjust-virt_addr-when-splitting-descriptors-in-e.patch (2.17 kB)
Download all attachments

2017-03-09 11:55:30

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: kexec regression since 4.9 caused by efi

On 9 March 2017 at 10:54, Omar Sandoval <[email protected]> wrote:
> On Thu, Mar 09, 2017 at 02:38:06PM +0800, Dave Young wrote:
>> Add efi/kexec list.
>>
>> On 03/08/17 at 12:16pm, Omar Sandoval wrote:
>
> [snip]
>
>> I have no more clue yet from your provided log, but the runtime value is
>> odd to me. It is set in below code:
>>
>> arch/x86/platform/efi/efi.c: efi_systab_init()
>> efi_systab.runtime = data ?
>> (void *)(unsigned long)data->runtime :
>> (void *)(unsigne long)systab64->runtime;
>>
>> Here data is the setup_data passed by kexec-tools from normal kernel to
>> kexec kernel, efi_setup_data structure is like below:
>> struct efi_setup_data {
>> u64 fw_vendor;
>> u64 runtime;
>> u64 tables;
>> u64 smbios;
>> u64 reserved[8];
>> };
>>
>> kexec-tools get the runtime address from /sys/firmware/efi/runtime
>>
>> So can you do some debuggin on your side, eg. see the sysfs runtime
>> value is correct or not. And add some printk in efi init path etc.
>
> The attached patch fixes this for me.

Hi Omar,

Thanks for tracking this down.

I wonder if this is an unintended side effect of the way we repurpose
the EFI_MEMORY_RUNTIME attribute in efi_arch_mem_reserve(). AFAIUI,
splitting memory map entries should only be necessary for regions that
are not runtime memory regions to begin with, and so whether their
virtual mapping address makes sense or not should be irrelevant.

Perhaps this only illustrates my lack of understanding of the x86 way
of doing this, so perhaps Matt can shed some light on this?

Thanks,
Ard.

2017-03-10 01:40:21

by Dave Young

[permalink] [raw]
Subject: Re: kexec regression since 4.9 caused by efi

On 03/09/17 at 12:53pm, Ard Biesheuvel wrote:
> On 9 March 2017 at 10:54, Omar Sandoval <[email protected]> wrote:
> > On Thu, Mar 09, 2017 at 02:38:06PM +0800, Dave Young wrote:
> >> Add efi/kexec list.
> >>
> >> On 03/08/17 at 12:16pm, Omar Sandoval wrote:
> >
> > [snip]
> >
> >> I have no more clue yet from your provided log, but the runtime value is
> >> odd to me. It is set in below code:
> >>
> >> arch/x86/platform/efi/efi.c: efi_systab_init()
> >> efi_systab.runtime = data ?
> >> (void *)(unsigned long)data->runtime :
> >> (void *)(unsigne long)systab64->runtime;
> >>
> >> Here data is the setup_data passed by kexec-tools from normal kernel to
> >> kexec kernel, efi_setup_data structure is like below:
> >> struct efi_setup_data {
> >> u64 fw_vendor;
> >> u64 runtime;
> >> u64 tables;
> >> u64 smbios;
> >> u64 reserved[8];
> >> };
> >>
> >> kexec-tools get the runtime address from /sys/firmware/efi/runtime
> >>
> >> So can you do some debuggin on your side, eg. see the sysfs runtime
> >> value is correct or not. And add some printk in efi init path etc.
> >
> > The attached patch fixes this for me.
>
> Hi Omar,
>
> Thanks for tracking this down.
>
> I wonder if this is an unintended side effect of the way we repurpose
> the EFI_MEMORY_RUNTIME attribute in efi_arch_mem_reserve(). AFAIUI,
> splitting memory map entries should only be necessary for regions that
> are not runtime memory regions to begin with, and so whether their
> virtual mapping address makes sense or not should be irrelevant.

In this case the esrt chunk are Runtime Data which is not necessary to
be reserved explicitly. I think efi_arch_mem_reserve are for boot areas.

Probably there could be esrt data which belongs to boot data? If we are
sure they are all runtime, the better fix may be just dropping the
efi_mem_reserve in esrt.c

>
> Perhaps this only illustrates my lack of understanding of the x86 way
> of doing this, so perhaps Matt can shed some light on this?
>
> Thanks,
> Ard.

Thanks
Dave

2017-03-10 01:44:13

by Dave Young

[permalink] [raw]
Subject: Re: kexec regression since 4.9 caused by efi

On 03/09/17 at 01:54am, Omar Sandoval wrote:
> On Thu, Mar 09, 2017 at 02:38:06PM +0800, Dave Young wrote:
> > Add efi/kexec list.
> >
> > On 03/08/17 at 12:16pm, Omar Sandoval wrote:
>
> [snip]
>
> > I have no more clue yet from your provided log, but the runtime value is
> > odd to me. It is set in below code:
> >
> > arch/x86/platform/efi/efi.c: efi_systab_init()
> > efi_systab.runtime = data ?
> > (void *)(unsigned long)data->runtime :
> > (void *)(unsigne long)systab64->runtime;
> >
> > Here data is the setup_data passed by kexec-tools from normal kernel to
> > kexec kernel, efi_setup_data structure is like below:
> > struct efi_setup_data {
> > u64 fw_vendor;
> > u64 runtime;
> > u64 tables;
> > u64 smbios;
> > u64 reserved[8];
> > };
> >
> > kexec-tools get the runtime address from /sys/firmware/efi/runtime
> >
> > So can you do some debuggin on your side, eg. see the sysfs runtime
> > value is correct or not. And add some printk in efi init path etc.
>
> The attached patch fixes this for me.
> From 4b343f0b0b408469f28c973ea52877797a166313 Mon Sep 17 00:00:00 2001
> Message-Id: <4b343f0b0b408469f28c973ea52877797a166313.1489053164.git.osandov@fb.com>
> From: Omar Sandoval <[email protected]>
> Date: Thu, 9 Mar 2017 01:46:19 -0800
> Subject: [PATCH] efi: adjust virt_addr when splitting descriptors in
> efi_memmap_insert()
>
> When we split efi memory descriptors, we adjust the physical address but
> not the virtual address it maps to. This leads to bogus memory mappings
> later when these virtual addresses are used.
>
> This fixes a kexec boot regression since 8e80632fb23f ("efi/esrt: Use
> efi_mem_reserve() and avoid a kmalloc()"), although the bug was only
> exposed by that commit.
>
> Signed-off-by: Omar Sandoval <[email protected]>
> ---
> drivers/firmware/efi/memmap.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/drivers/firmware/efi/memmap.c b/drivers/firmware/efi/memmap.c
> index 78686443cb37..ca614db76faf 100644
> --- a/drivers/firmware/efi/memmap.c
> +++ b/drivers/firmware/efi/memmap.c
> @@ -298,6 +298,7 @@ void __init efi_memmap_insert(struct efi_memory_map *old_memmap, void *buf,
> memcpy(new, old, old_memmap->desc_size);
> md = new;
> md->phys_addr = m_end + 1;
> + md->virt_addr += md->phys_addr - start;
> md->num_pages = (end - md->phys_addr + 1) >>
> EFI_PAGE_SHIFT;
> }
> @@ -312,6 +313,7 @@ void __init efi_memmap_insert(struct efi_memory_map *old_memmap, void *buf,
> md = new;
> md->attribute |= m_attr;
> md->phys_addr = m_start;
> + md->virt_addr += md->phys_addr - start;
> md->num_pages = (m_end - m_start + 1) >>
> EFI_PAGE_SHIFT;
> /* last part */
> @@ -319,6 +321,7 @@ void __init efi_memmap_insert(struct efi_memory_map *old_memmap, void *buf,
> memcpy(new, old, old_memmap->desc_size);
> md = new;
> md->phys_addr = m_end + 1;
> + md->virt_addr += md->phys_addr - start;
> md->num_pages = (end - m_end) >>
> EFI_PAGE_SHIFT;
> }
> @@ -333,6 +336,7 @@ void __init efi_memmap_insert(struct efi_memory_map *old_memmap, void *buf,
> memcpy(new, old, old_memmap->desc_size);
> md = new;
> md->phys_addr = m_start;
> + md->virt_addr += md->phys_addr - start;
> md->num_pages = (end - md->phys_addr + 1) >>
> EFI_PAGE_SHIFT;
> md->attribute |= m_attr;
> --
> 2.12.0
>

Nice, thanks for the debugging, so the problem is clear now.

Just Runtime areas are not necessarily to be reserved, for boot areas no
need to update the virt address. But I'm not sure about the fakemem
usage of this.

So need comments from Matt..

Thanks
Dave

2017-03-13 07:38:09

by Dave Young

[permalink] [raw]
Subject: Re: kexec regression since 4.9 caused by efi

On 03/09/17 at 01:54am, Omar Sandoval wrote:
> On Thu, Mar 09, 2017 at 02:38:06PM +0800, Dave Young wrote:
> > Add efi/kexec list.
> >
> > On 03/08/17 at 12:16pm, Omar Sandoval wrote:
>
> [snip]
>
> > I have no more clue yet from your provided log, but the runtime value is
> > odd to me. It is set in below code:
> >
> > arch/x86/platform/efi/efi.c: efi_systab_init()
> > efi_systab.runtime = data ?
> > (void *)(unsigned long)data->runtime :
> > (void *)(unsigne long)systab64->runtime;
> >
> > Here data is the setup_data passed by kexec-tools from normal kernel to
> > kexec kernel, efi_setup_data structure is like below:
> > struct efi_setup_data {
> > u64 fw_vendor;
> > u64 runtime;
> > u64 tables;
> > u64 smbios;
> > u64 reserved[8];
> > };
> >
> > kexec-tools get the runtime address from /sys/firmware/efi/runtime
> >
> > So can you do some debuggin on your side, eg. see the sysfs runtime
> > value is correct or not. And add some printk in efi init path etc.
>
> The attached patch fixes this for me.

Omar, could you try below patch? Looking at the efi_mem_desc_lookup, it is not
correct to be used in efi_arch_mem_reserve, if it passed your test, I
can rewrite patch log with more background and send it out:

for_each_efi_memory_desc(md) {
[snip]
if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
md->type != EFI_BOOT_SERVICES_DATA &&
md->type != EFI_RUNTIME_SERVICES_DATA) {
continue;
}

In above code, it meant to get a md of EFI_MEMORY_RUNTIME of either boot
data or runtime data, this is wrong for efi_mem_reserve, because we are
reserving boot data which has no EFI_MEMORY_RUNTIME attribute at the
running time. Just is happened to work and we did not capture the error.

Signed-off-by: Dave Young <[email protected]>
---
arch/x86/platform/efi/quirks.c | 4 +++-
drivers/firmware/efi/efi.c | 39 +++++++++++++++++++++++++++++++++++++++
include/linux/efi.h | 1 +
3 files changed, 43 insertions(+), 1 deletion(-)

--- linux-x86.orig/arch/x86/platform/efi/quirks.c
+++ linux-x86/arch/x86/platform/efi/quirks.c
@@ -191,7 +191,9 @@ void __init efi_arch_mem_reserve(phys_ad
int num_entries;
void *new;

- if (efi_mem_desc_lookup(addr, &md)) {
+ if (efi_md_lookup_boot_data(addr, &md)) {
+ if (md.attribute & EFI_MEMORY_RUNTIME)
+ return;
pr_err("Failed to lookup EFI memory descriptor for %pa\n", &addr);
return;
}
--- linux-x86.orig/drivers/firmware/efi/efi.c
+++ linux-x86/drivers/firmware/efi/efi.c
@@ -353,6 +353,45 @@ err_put:
subsys_initcall(efisubsys_init);

/*
+ * Find the efi memory descriptor for a given physical address.
+ * Given a physical address, if it exists within an EFI memory map
+ * entry of type EFI_BOOT_SERVICES_DATA and the entry has no attribute
+ * EFI_MEMORY_RUNTIME, then populate the supplied memory descriptor
+ * with the appropriate data.
+ */
+int __init efi_md_lookup_boot_data(u64 phys_addr,
+ efi_memory_desc_t *out_md)
+{
+ efi_memory_desc_t *md;
+
+ if (!efi_enabled(EFI_MEMMAP)) {
+ pr_err_once("EFI_MEMMAP is not enabled.\n");
+ return -EINVAL;
+ }
+
+ if (!out_md) {
+ pr_err_once("out_md is null.\n");
+ return -EINVAL;
+ }
+
+ for_each_efi_memory_desc(md) {
+ u64 size;
+ u64 end;
+
+ if (md->type != EFI_BOOT_SERVICES_DATA)
+ continue;
+
+ size = md->num_pages << EFI_PAGE_SHIFT;
+ end = md->phys_addr + size;
+ if (phys_addr >= md->phys_addr && phys_addr < end) {
+ memcpy(out_md, md, sizeof(*out_md));
+ return 0;
+ }
+ }
+ return -ENOENT;
+}
+
+/*
* Find the efi memory descriptor for a given physical address. Given a
* physical address, determine if it exists within an EFI Memory Map entry,
* and if so, populate the supplied memory descriptor with the appropriate
--- linux-x86.orig/include/linux/efi.h
+++ linux-x86/include/linux/efi.h
@@ -979,6 +979,7 @@ extern u64 efi_mem_attribute (unsigned l
extern int __init efi_uart_console_only (void);
extern u64 efi_mem_desc_end(efi_memory_desc_t *md);
extern int efi_mem_desc_lookup(u64 phys_addr, efi_memory_desc_t *out_md);
+extern int efi_md_lookup_boot_data(u64 phys_addr, efi_memory_desc_t *out_md);
extern void efi_mem_reserve(phys_addr_t addr, u64 size);
extern void efi_initialize_iomem_resources(struct resource *code_resource,
struct resource *data_resource, struct resource *bss_resource);

2017-03-16 12:16:00

by Matt Fleming

[permalink] [raw]
Subject: Re: kexec regression since 4.9 caused by efi

On Thu, 09 Mar, at 12:53:36PM, Ard Biesheuvel wrote:
>
> Hi Omar,
>
> Thanks for tracking this down.
>
> I wonder if this is an unintended side effect of the way we repurpose
> the EFI_MEMORY_RUNTIME attribute in efi_arch_mem_reserve(). AFAIUI,
> splitting memory map entries should only be necessary for regions that
> are not runtime memory regions to begin with, and so whether their
> virtual mapping address makes sense or not should be irrelevant.
>
> Perhaps this only illustrates my lack of understanding of the x86 way
> of doing this, so perhaps Matt can shed some light on this?

Sorry for the delay.

Yes, Ard is correct. It's not necessary to split/reserve memory
regions that already have the EFI_MEMORY_RUNTIME attribute.

2017-03-16 12:42:14

by Matt Fleming

[permalink] [raw]
Subject: Re: kexec regression since 4.9 caused by efi

On Mon, 13 Mar, at 03:37:48PM, Dave Young wrote:
>
> Omar, could you try below patch? Looking at the efi_mem_desc_lookup, it is not
> correct to be used in efi_arch_mem_reserve, if it passed your test, I
> can rewrite patch log with more background and send it out:
>
> for_each_efi_memory_desc(md) {
> [snip]
> if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
> md->type != EFI_BOOT_SERVICES_DATA &&
> md->type != EFI_RUNTIME_SERVICES_DATA) {
> continue;
> }
>
> In above code, it meant to get a md of EFI_MEMORY_RUNTIME of either boot
> data or runtime data, this is wrong for efi_mem_reserve, because we are
> reserving boot data which has no EFI_MEMORY_RUNTIME attribute at the
> running time. Just is happened to work and we did not capture the error.

Wouldn't something like this be simpler?

---

diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
index 30031d5293c4..cdfe8c628959 100644
--- a/arch/x86/platform/efi/quirks.c
+++ b/arch/x86/platform/efi/quirks.c
@@ -201,6 +201,10 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
return;
}

+ /* No need to reserve regions that will never be freed. */
+ if (md.attribute & EFI_MEMORY_RUNTIME)
+ return;
+
size += addr % EFI_PAGE_SIZE;
size = round_up(size, EFI_PAGE_SIZE);
addr = round_down(addr, EFI_PAGE_SIZE);

2017-03-16 17:51:42

by Omar Sandoval

[permalink] [raw]
Subject: Re: kexec regression since 4.9 caused by efi

On Thu, Mar 16, 2017 at 12:41:32PM +0000, Matt Fleming wrote:
> On Mon, 13 Mar, at 03:37:48PM, Dave Young wrote:
> >
> > Omar, could you try below patch? Looking at the efi_mem_desc_lookup, it is not
> > correct to be used in efi_arch_mem_reserve, if it passed your test, I
> > can rewrite patch log with more background and send it out:
> >
> > for_each_efi_memory_desc(md) {
> > [snip]
> > if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
> > md->type != EFI_BOOT_SERVICES_DATA &&
> > md->type != EFI_RUNTIME_SERVICES_DATA) {
> > continue;
> > }
> >
> > In above code, it meant to get a md of EFI_MEMORY_RUNTIME of either boot
> > data or runtime data, this is wrong for efi_mem_reserve, because we are
> > reserving boot data which has no EFI_MEMORY_RUNTIME attribute at the
> > running time. Just is happened to work and we did not capture the error.
>
> Wouldn't something like this be simpler?
>
> ---
>
> diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> index 30031d5293c4..cdfe8c628959 100644
> --- a/arch/x86/platform/efi/quirks.c
> +++ b/arch/x86/platform/efi/quirks.c
> @@ -201,6 +201,10 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
> return;
> }
>
> + /* No need to reserve regions that will never be freed. */
> + if (md.attribute & EFI_MEMORY_RUNTIME)
> + return;
> +
> size += addr % EFI_PAGE_SIZE;
> size = round_up(size, EFI_PAGE_SIZE);
> addr = round_down(addr, EFI_PAGE_SIZE);

This works for me.

Reported-and-tested-by: Omar Sandoval <[email protected]>

2017-03-17 02:20:55

by Dave Young

[permalink] [raw]
Subject: Re: kexec regression since 4.9 caused by efi

On 03/16/17 at 12:41pm, Matt Fleming wrote:
> On Mon, 13 Mar, at 03:37:48PM, Dave Young wrote:
> >
> > Omar, could you try below patch? Looking at the efi_mem_desc_lookup, it is not
> > correct to be used in efi_arch_mem_reserve, if it passed your test, I
> > can rewrite patch log with more background and send it out:
> >
> > for_each_efi_memory_desc(md) {
> > [snip]
> > if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
> > md->type != EFI_BOOT_SERVICES_DATA &&
> > md->type != EFI_RUNTIME_SERVICES_DATA) {
> > continue;
> > }
> >
> > In above code, it meant to get a md of EFI_MEMORY_RUNTIME of either boot
> > data or runtime data, this is wrong for efi_mem_reserve, because we are
> > reserving boot data which has no EFI_MEMORY_RUNTIME attribute at the
> > running time. Just is happened to work and we did not capture the error.
>
> Wouldn't something like this be simpler?
>
> ---
>
> diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> index 30031d5293c4..cdfe8c628959 100644
> --- a/arch/x86/platform/efi/quirks.c
> +++ b/arch/x86/platform/efi/quirks.c
> @@ -201,6 +201,10 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
> return;
> }
>
> + /* No need to reserve regions that will never be freed. */
> + if (md.attribute & EFI_MEMORY_RUNTIME)
> + return;
> +

Matt, I think it should be fine although I think the md type checking in
efi_mem_desc_lookup() is causing confusion and not easy to understand..

How about move the if chunk early like below because it seems no need
to sanity check the addr + size any more if the md is still RUNTIME?

--- linux-x86.orig/arch/x86/platform/efi/quirks.c
+++ linux-x86/arch/x86/platform/efi/quirks.c
@@ -196,6 +196,10 @@ void __init efi_arch_mem_reserve(phys_ad
return;
}

+ /* No need to reserve regions that will never be freed. */
+ if (md.attribute & EFI_MEMORY_RUNTIME)
+ return;
+
if (addr + size > md.phys_addr + (md.num_pages << EFI_PAGE_SHIFT)) {
pr_err("Region spans EFI memory descriptors, %pa\n", &addr);
return;

Thanks
Dave

2017-03-17 13:27:06

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: kexec regression since 4.9 caused by efi

On 17 March 2017 at 02:09, Dave Young <[email protected]> wrote:
> On 03/16/17 at 12:41pm, Matt Fleming wrote:
>> On Mon, 13 Mar, at 03:37:48PM, Dave Young wrote:
>> >
>> > Omar, could you try below patch? Looking at the efi_mem_desc_lookup, it is not
>> > correct to be used in efi_arch_mem_reserve, if it passed your test, I
>> > can rewrite patch log with more background and send it out:
>> >
>> > for_each_efi_memory_desc(md) {
>> > [snip]
>> > if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
>> > md->type != EFI_BOOT_SERVICES_DATA &&
>> > md->type != EFI_RUNTIME_SERVICES_DATA) {
>> > continue;
>> > }
>> >
>> > In above code, it meant to get a md of EFI_MEMORY_RUNTIME of either boot
>> > data or runtime data, this is wrong for efi_mem_reserve, because we are
>> > reserving boot data which has no EFI_MEMORY_RUNTIME attribute at the
>> > running time. Just is happened to work and we did not capture the error.
>>
>> Wouldn't something like this be simpler?
>>
>> ---
>>
>> diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
>> index 30031d5293c4..cdfe8c628959 100644
>> --- a/arch/x86/platform/efi/quirks.c
>> +++ b/arch/x86/platform/efi/quirks.c
>> @@ -201,6 +201,10 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
>> return;
>> }
>>
>> + /* No need to reserve regions that will never be freed. */
>> + if (md.attribute & EFI_MEMORY_RUNTIME)
>> + return;
>> +
>
> Matt, I think it should be fine although I think the md type checking in
> efi_mem_desc_lookup() is causing confusion and not easy to understand..
>
> How about move the if chunk early like below because it seems no need
> to sanity check the addr + size any more if the md is still RUNTIME?
>
> --- linux-x86.orig/arch/x86/platform/efi/quirks.c
> +++ linux-x86/arch/x86/platform/efi/quirks.c
> @@ -196,6 +196,10 @@ void __init efi_arch_mem_reserve(phys_ad
> return;
> }
>
> + /* No need to reserve regions that will never be freed. */
> + if (md.attribute & EFI_MEMORY_RUNTIME)
> + return;
> +
> if (addr + size > md.phys_addr + (md.num_pages << EFI_PAGE_SHIFT)) {
> pr_err("Region spans EFI memory descriptors, %pa\n", &addr);
> return;
>

This way, we suppress the error it the region spans multiple
descriptors, and only the first one has the runtime attribute. So the
two patches are not equivalent. I don't have a strong preference
either way, though.

2017-03-17 13:32:56

by Matt Fleming

[permalink] [raw]
Subject: Re: kexec regression since 4.9 caused by efi

On Fri, 17 Mar, at 10:09:51AM, Dave Young wrote:
>
> Matt, I think it should be fine although I think the md type checking in
> efi_mem_desc_lookup() is causing confusion and not easy to understand..

Could you make that a separate patch if you think of improvements
there?

> How about move the if chunk early like below because it seems no need
> to sanity check the addr + size any more if the md is still RUNTIME?

My original version did as you suggest, but I changed it because we
*really* want to know if someone tries to reserve a range that spans
regions. That would be totally unexpected and a warning about a
potential bug/issue.

2017-03-20 02:15:09

by Dave Young

[permalink] [raw]
Subject: Re: kexec regression since 4.9 caused by efi

On 03/17/17 at 01:32pm, Matt Fleming wrote:
> On Fri, 17 Mar, at 10:09:51AM, Dave Young wrote:
> >
> > Matt, I think it should be fine although I think the md type checking in
> > efi_mem_desc_lookup() is causing confusion and not easy to understand..
>
> Could you make that a separate patch if you think of improvements
> there?

Duplicate the lookup function is indeed a little ugly, will do it when I
have a better idea, we can leave it as is since it works.

>
> > How about move the if chunk early like below because it seems no need
> > to sanity check the addr + size any more if the md is still RUNTIME?
>
> My original version did as you suggest, but I changed it because we
> *really* want to know if someone tries to reserve a range that spans
> regions. That would be totally unexpected and a warning about a
> potential bug/issue.

Matt, I'm fine if you prefer to capture the range checking errors.
Would you like me to post it or just you send it out?

Thanks
Dave

2017-03-21 07:49:12

by Dave Young

[permalink] [raw]
Subject: Re: kexec regression since 4.9 caused by efi

On 03/20/17 at 10:14am, Dave Young wrote:
> On 03/17/17 at 01:32pm, Matt Fleming wrote:
> > On Fri, 17 Mar, at 10:09:51AM, Dave Young wrote:
> > >
> > > Matt, I think it should be fine although I think the md type checking in
> > > efi_mem_desc_lookup() is causing confusion and not easy to understand..
> >
> > Could you make that a separate patch if you think of improvements
> > there?
>
> Duplicate the lookup function is indeed a little ugly, will do it when I
> have a better idea, we can leave it as is since it works.

Matt, rethinking about this, how about doint something below, not
tested, just seeking for idea and opinons, in this way no need duplicate
a function, but there is an assumption that no overlapped mem ranges in
efi memmap.

Also the case Omar reported is the esrt memory range type is
RUNTIME_DATA, that is a little different with the mem attribute of
RUNTIME which also includes BOOT_DATA which has been set the RUNTIME
attribute, like bgrt in kexec reboot. Should we distinguish the two
cases and give out some warnings or debug info?


---
arch/x86/platform/efi/quirks.c | 5 +++++
drivers/firmware/efi/efi.c | 6 ------
drivers/firmware/efi/esrt.c | 7 +++++++
3 files changed, 12 insertions(+), 6 deletions(-)

--- linux-x86.orig/drivers/firmware/efi/efi.c
+++ linux-x86/drivers/firmware/efi/efi.c
@@ -376,12 +376,6 @@ int __init efi_mem_desc_lookup(u64 phys_
u64 size;
u64 end;

- if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
- md->type != EFI_BOOT_SERVICES_DATA &&
- md->type != EFI_RUNTIME_SERVICES_DATA) {
- continue;
- }
-
size = md->num_pages << EFI_PAGE_SHIFT;
end = md->phys_addr + size;
if (phys_addr >= md->phys_addr && phys_addr < end) {
--- linux-x86.orig/drivers/firmware/efi/esrt.c
+++ linux-x86/drivers/firmware/efi/esrt.c
@@ -258,6 +258,13 @@ void __init efi_esrt_init(void)
return;
}

+ if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
+ md->type != EFI_BOOT_SERVICES_DATA &&
+ md->type != EFI_RUNTIME_SERVICES_DATA) {
+ pr_err("ESRT header memory map type is invalid\n");
+ return;
+ }
+
max = efi_mem_desc_end(&md);
if (max < efi.esrt) {
pr_err("EFI memory descriptor is invalid. (esrt: %p max: %p)\n",
--- linux-x86.orig/arch/x86/platform/efi/quirks.c
+++ linux-x86/arch/x86/platform/efi/quirks.c
@@ -201,6 +201,11 @@ void __init efi_arch_mem_reserve(phys_ad
return;
}

+ if (md->attribute & EFI_MEMORY_RUNTIME ||
+ md->type != EFI_BOOT_SERVICES_DATA) {
+ return;
+ }
+
size += addr % EFI_PAGE_SIZE;
size = round_up(size, EFI_PAGE_SIZE);
addr = round_down(addr, EFI_PAGE_SIZE);

>
> >
> > > How about move the if chunk early like below because it seems no need
> > > to sanity check the addr + size any more if the md is still RUNTIME?
> >
> > My original version did as you suggest, but I changed it because we
> > *really* want to know if someone tries to reserve a range that spans
> > regions. That would be totally unexpected and a warning about a
> > potential bug/issue.
>
> Matt, I'm fine if you prefer to capture the range checking errors.
> Would you like me to post it or just you send it out?
>
> Thanks
> Dave

Thanks
Dave

2017-03-22 16:11:01

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: kexec regression since 4.9 caused by efi

On 21 March 2017 at 07:48, Dave Young <[email protected]> wrote:
> On 03/20/17 at 10:14am, Dave Young wrote:
>> On 03/17/17 at 01:32pm, Matt Fleming wrote:
>> > On Fri, 17 Mar, at 10:09:51AM, Dave Young wrote:
>> > >
>> > > Matt, I think it should be fine although I think the md type checking in
>> > > efi_mem_desc_lookup() is causing confusion and not easy to understand..
>> >
>> > Could you make that a separate patch if you think of improvements
>> > there?
>>
>> Duplicate the lookup function is indeed a little ugly, will do it when I
>> have a better idea, we can leave it as is since it works.
>
> Matt, rethinking about this, how about doint something below, not
> tested, just seeking for idea and opinons, in this way no need duplicate
> a function, but there is an assumption that no overlapped mem ranges in
> efi memmap.
>
> Also the case Omar reported is the esrt memory range type is
> RUNTIME_DATA, that is a little different with the mem attribute of
> RUNTIME which also includes BOOT_DATA which has been set the RUNTIME
> attribute, like bgrt in kexec reboot. Should we distinguish the two
> cases and give out some warnings or debug info?
>
>
> ---
> arch/x86/platform/efi/quirks.c | 5 +++++
> drivers/firmware/efi/efi.c | 6 ------
> drivers/firmware/efi/esrt.c | 7 +++++++
> 3 files changed, 12 insertions(+), 6 deletions(-)
>
> --- linux-x86.orig/drivers/firmware/efi/efi.c
> +++ linux-x86/drivers/firmware/efi/efi.c
> @@ -376,12 +376,6 @@ int __init efi_mem_desc_lookup(u64 phys_
> u64 size;
> u64 end;
>
> - if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
> - md->type != EFI_BOOT_SERVICES_DATA &&
> - md->type != EFI_RUNTIME_SERVICES_DATA) {
> - continue;
> - }
> -
> size = md->num_pages << EFI_PAGE_SHIFT;
> end = md->phys_addr + size;
> if (phys_addr >= md->phys_addr && phys_addr < end) {
> --- linux-x86.orig/drivers/firmware/efi/esrt.c
> +++ linux-x86/drivers/firmware/efi/esrt.c
> @@ -258,6 +258,13 @@ void __init efi_esrt_init(void)
> return;
> }
>
> + if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
> + md->type != EFI_BOOT_SERVICES_DATA &&
> + md->type != EFI_RUNTIME_SERVICES_DATA) {
> + pr_err("ESRT header memory map type is invalid\n");
> + return;
> + }
> +

This looks wrong to me. While the meanings get convoluted in practice,
the EFI_MEMORY_RUNTIME attribute only means that the firmware requests
a virtual mapping for the region. It is perfectly legal for a
EFI_RUNTIME_SERVICES_DATA region not to have the EFI_MEMORY_RUNTIME
attribute, if the region is never accessed by the runtime services
themselves, and this is not entirely unlikely for tables that the
firmware exposes to the OS

> max = efi_mem_desc_end(&md);
> if (max < efi.esrt) {
> pr_err("EFI memory descriptor is invalid. (esrt: %p max: %p)\n",
> --- linux-x86.orig/arch/x86/platform/efi/quirks.c
> +++ linux-x86/arch/x86/platform/efi/quirks.c
> @@ -201,6 +201,11 @@ void __init efi_arch_mem_reserve(phys_ad
> return;
> }
>
> + if (md->attribute & EFI_MEMORY_RUNTIME ||
> + md->type != EFI_BOOT_SERVICES_DATA) {
> + return;
> + }
> +
> size += addr % EFI_PAGE_SIZE;
> size = round_up(size, EFI_PAGE_SIZE);
> addr = round_down(addr, EFI_PAGE_SIZE);
>
>>
>> >
>> > > How about move the if chunk early like below because it seems no need
>> > > to sanity check the addr + size any more if the md is still RUNTIME?
>> >
>> > My original version did as you suggest, but I changed it because we
>> > *really* want to know if someone tries to reserve a range that spans
>> > regions. That would be totally unexpected and a warning about a
>> > potential bug/issue.
>>
>> Matt, I'm fine if you prefer to capture the range checking errors.
>> Would you like me to post it or just you send it out?
>>
>> Thanks
>> Dave
>
> Thanks
> Dave
> --
> To unsubscribe from this list: send the line "unsubscribe linux-efi" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2017-03-23 02:43:46

by Dave Young

[permalink] [raw]
Subject: Re: kexec regression since 4.9 caused by efi

On 03/22/17 at 04:10pm, Ard Biesheuvel wrote:
> On 21 March 2017 at 07:48, Dave Young <[email protected]> wrote:
> > On 03/20/17 at 10:14am, Dave Young wrote:
> >> On 03/17/17 at 01:32pm, Matt Fleming wrote:
> >> > On Fri, 17 Mar, at 10:09:51AM, Dave Young wrote:
> >> > >
> >> > > Matt, I think it should be fine although I think the md type checking in
> >> > > efi_mem_desc_lookup() is causing confusion and not easy to understand..
> >> >
> >> > Could you make that a separate patch if you think of improvements
> >> > there?
> >>
> >> Duplicate the lookup function is indeed a little ugly, will do it when I
> >> have a better idea, we can leave it as is since it works.
> >
> > Matt, rethinking about this, how about doint something below, not
> > tested, just seeking for idea and opinons, in this way no need duplicate
> > a function, but there is an assumption that no overlapped mem ranges in
> > efi memmap.
> >
> > Also the case Omar reported is the esrt memory range type is
> > RUNTIME_DATA, that is a little different with the mem attribute of
> > RUNTIME which also includes BOOT_DATA which has been set the RUNTIME
> > attribute, like bgrt in kexec reboot. Should we distinguish the two
> > cases and give out some warnings or debug info?
> >
> >
> > ---
> > arch/x86/platform/efi/quirks.c | 5 +++++
> > drivers/firmware/efi/efi.c | 6 ------
> > drivers/firmware/efi/esrt.c | 7 +++++++
> > 3 files changed, 12 insertions(+), 6 deletions(-)
> >
> > --- linux-x86.orig/drivers/firmware/efi/efi.c
> > +++ linux-x86/drivers/firmware/efi/efi.c
> > @@ -376,12 +376,6 @@ int __init efi_mem_desc_lookup(u64 phys_
> > u64 size;
> > u64 end;
> >
> > - if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
> > - md->type != EFI_BOOT_SERVICES_DATA &&
> > - md->type != EFI_RUNTIME_SERVICES_DATA) {
> > - continue;
> > - }
> > -
> > size = md->num_pages << EFI_PAGE_SHIFT;
> > end = md->phys_addr + size;
> > if (phys_addr >= md->phys_addr && phys_addr < end) {
> > --- linux-x86.orig/drivers/firmware/efi/esrt.c
> > +++ linux-x86/drivers/firmware/efi/esrt.c
> > @@ -258,6 +258,13 @@ void __init efi_esrt_init(void)
> > return;
> > }
> >
> > + if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
> > + md->type != EFI_BOOT_SERVICES_DATA &&
> > + md->type != EFI_RUNTIME_SERVICES_DATA) {
> > + pr_err("ESRT header memory map type is invalid\n");
> > + return;
> > + }
> > +
>
> This looks wrong to me. While the meanings get convoluted in practice,
> the EFI_MEMORY_RUNTIME attribute only means that the firmware requests
> a virtual mapping for the region. It is perfectly legal for a
> EFI_RUNTIME_SERVICES_DATA region not to have the EFI_MEMORY_RUNTIME
> attribute, if the region is never accessed by the runtime services
> themselves, and this is not entirely unlikely for tables that the
> firmware exposes to the OS

Thanks for the comment, if so "!(md->attribute & EFI_MEMORY_RUNTIME) &&"
should be dropped.

BTW, md->type should be md.type, bgrt reserving works fine with this
change but I have no esrt machine to test it. I would like to wait for
Matt's opinions about this first before an update.

Also cc Peter about the esrt piece.
>
> > max = efi_mem_desc_end(&md);
> > if (max < efi.esrt) {
> > pr_err("EFI memory descriptor is invalid. (esrt: %p max: %p)\n",
> > --- linux-x86.orig/arch/x86/platform/efi/quirks.c
> > +++ linux-x86/arch/x86/platform/efi/quirks.c
> > @@ -201,6 +201,11 @@ void __init efi_arch_mem_reserve(phys_ad
> > return;
> > }
> >
> > + if (md->attribute & EFI_MEMORY_RUNTIME ||
> > + md->type != EFI_BOOT_SERVICES_DATA) {
> > + return;
> > + }
> > +
> > size += addr % EFI_PAGE_SIZE;
> > size = round_up(size, EFI_PAGE_SIZE);
> > addr = round_down(addr, EFI_PAGE_SIZE);
> >
> >>
> >> >
> >> > > How about move the if chunk early like below because it seems no need
> >> > > to sanity check the addr + size any more if the md is still RUNTIME?
> >> >
> >> > My original version did as you suggest, but I changed it because we
> >> > *really* want to know if someone tries to reserve a range that spans
> >> > regions. That would be totally unexpected and a warning about a
> >> > potential bug/issue.
> >>
> >> Matt, I'm fine if you prefer to capture the range checking errors.
> >> Would you like me to post it or just you send it out?
> >>
> >> Thanks
> >> Dave
> >
> > Thanks
> > Dave
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-efi" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html

2017-04-03 23:55:32

by Omar Sandoval

[permalink] [raw]
Subject: Re: kexec regression since 4.9 caused by efi

On Thu, Mar 16, 2017 at 10:50:48AM -0700, Omar Sandoval wrote:
> On Thu, Mar 16, 2017 at 12:41:32PM +0000, Matt Fleming wrote:
> > On Mon, 13 Mar, at 03:37:48PM, Dave Young wrote:
> > >
> > > Omar, could you try below patch? Looking at the efi_mem_desc_lookup, it is not
> > > correct to be used in efi_arch_mem_reserve, if it passed your test, I
> > > can rewrite patch log with more background and send it out:
> > >
> > > for_each_efi_memory_desc(md) {
> > > [snip]
> > > if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
> > > md->type != EFI_BOOT_SERVICES_DATA &&
> > > md->type != EFI_RUNTIME_SERVICES_DATA) {
> > > continue;
> > > }
> > >
> > > In above code, it meant to get a md of EFI_MEMORY_RUNTIME of either boot
> > > data or runtime data, this is wrong for efi_mem_reserve, because we are
> > > reserving boot data which has no EFI_MEMORY_RUNTIME attribute at the
> > > running time. Just is happened to work and we did not capture the error.
> >
> > Wouldn't something like this be simpler?
> >
> > ---
> >
> > diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> > index 30031d5293c4..cdfe8c628959 100644
> > --- a/arch/x86/platform/efi/quirks.c
> > +++ b/arch/x86/platform/efi/quirks.c
> > @@ -201,6 +201,10 @@ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
> > return;
> > }
> >
> > + /* No need to reserve regions that will never be freed. */
> > + if (md.attribute & EFI_MEMORY_RUNTIME)
> > + return;
> > +
> > size += addr % EFI_PAGE_SIZE;
> > size = round_up(size, EFI_PAGE_SIZE);
> > addr = round_down(addr, EFI_PAGE_SIZE);
>
> This works for me.
>
> Reported-and-tested-by: Omar Sandoval <[email protected]>

Is this going to go in for 4.11?

2017-04-04 13:37:24

by Matt Fleming

[permalink] [raw]
Subject: Re: kexec regression since 4.9 caused by efi

On Mon, 20 Mar, at 10:14:12AM, Dave Young wrote:
>
> Matt, I'm fine if you prefer to capture the range checking errors.
> Would you like me to post it or just you send it out?

Can you please send out the patch with the minimal change to
efi_arch_mem_reserve() and we'll get it into urgent ASAP.

2017-04-05 01:23:52

by Dave Young

[permalink] [raw]
Subject: Re: kexec regression since 4.9 caused by efi

On 04/04/17 at 02:37pm, Matt Fleming wrote:
> On Mon, 20 Mar, at 10:14:12AM, Dave Young wrote:
> >
> > Matt, I'm fine if you prefer to capture the range checking errors.
> > Would you like me to post it or just you send it out?
>
> Can you please send out the patch with the minimal change to
> efi_arch_mem_reserve() and we'll get it into urgent ASAP.

Omar has sent it out, for the lookup function issue I think I can do it
after this one later.

Thanks
Dave