The SME workarea used during early encryption of the kernel during boot
is situated on a 2MB boundary after the end of the kernel text, data,
etc. sections (_end). This works well during initial boot of a compressed
kernel because of the relocation used for decompression of the kernel.
But when performing a kexec boot, there's a chance that the SME workarea
may not be mapped by the kexec pagetables or that some of the other data
used by kexec could exist in this range.
Create a section for SME in the vmlinux.lds.S. Position it after "_end"
so that the memory will be reclaimed during boot and, since it is all
zeroes, it compresses well. Since this new section will be part of the
kernel, kexec will account for it in pagetable mappings and placement of
data after the kernel.
Here's an example of a kernel size without and with the SME section:
without:
vmlinux: 36,501,616
bzImage: 6,497,344
100000000-47f37ffff : System RAM
1e4000000-1e47677d4 : Kernel code (0x7677d4)
1e47677d5-1e4e2e0bf : Kernel data (0x6c68ea)
1e5074000-1e5372fff : Kernel bss (0x2fefff)
with:
vmlinux: 44,419,408
bzImage: 6,503,136
880000000-c7ff7ffff : System RAM
8cf000000-8cf7677d4 : Kernel code (0x7677d4)
8cf7677d5-8cfe2e0bf : Kernel data (0x6c68ea)
8d0074000-8d0372fff : Kernel bss (0x2fefff)
Cc: Baoquan He <[email protected]>
Cc: Lianbo Jiang <[email protected]>
Signed-off-by: Tom Lendacky <[email protected]>
---
arch/x86/kernel/vmlinux.lds.S | 16 ++++++++++++++++
arch/x86/mm/mem_encrypt_identity.c | 22 ++++++++++++++++++++--
2 files changed, 36 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 0850b5149345..8c4377983e54 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -379,6 +379,22 @@ SECTIONS
. = ALIGN(PAGE_SIZE); /* keep VO_INIT_SIZE page aligned */
_end = .;
+#ifdef CONFIG_AMD_MEM_ENCRYPT
+ /*
+ * SME workarea section: Lives outside of the kernel proper
+ * (_text - _end) for performing in-place encryption. Resides
+ * on a 2MB boundary to simplify the pagetable setup used for
+ * the encryption.
+ */
+ . = ALIGN(HPAGE_SIZE);
+ .sme : AT(ADDR(.sme) - LOAD_OFFSET) {
+ __sme_begin = .;
+ *(.sme)
+ . = ALIGN(HPAGE_SIZE);
+ __sme_end = .;
+ }
+#endif
+
STABS_DEBUG
DWARF_DEBUG
diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index 4aa9b1480866..c55c2ec8fb12 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -73,6 +73,19 @@ struct sme_populate_pgd_data {
unsigned long vaddr_end;
};
+/*
+ * This work area lives in the .sme section, which lives outside of
+ * the kernel proper. It is sized to hold the intermediate copy buffer
+ * and more than enough pagetable pages.
+ *
+ * By using this section, the kernel can be encrypted in place and we
+ * avoid any possibility of boot parameters or initramfs images being
+ * placed such that the in-place encryption logic overwrites them. This
+ * section is 2MB aligned to allow for simple pagetable setup using only
+ * PMD entries (see vmlinux.lds.S).
+ */
+static char sme_workarea[2 * PMD_PAGE_SIZE] __section(.sme);
+
static char sme_cmdline_arg[] __initdata = "mem_encrypt";
static char sme_cmdline_on[] __initdata = "on";
static char sme_cmdline_off[] __initdata = "off";
@@ -314,8 +327,13 @@ void __init sme_encrypt_kernel(struct boot_params *bp)
}
#endif
- /* Set the encryption workarea to be immediately after the kernel */
- workarea_start = kernel_end;
+ /*
+ * We're running identity mapped, so we must obtain the address to the
+ * SME encryption workarea using rip-relative addressing.
+ */
+ asm ("lea sme_workarea(%%rip), %0"
+ : "=r" (workarea_start)
+ : "p" (sme_workarea));
/*
* Calculate required number of workarea bytes needed:
--
2.17.1
On 6/12/19 6:32 AM, Lendacky, Thomas wrote:
> Create a section for SME in the vmlinux.lds.S. Position it after "_end"
> so that the memory will be reclaimed during boot and, since it is all
> zeroes, it compresses well.
I don't think I realized that things after _end get reclaimed. Do we do
that at the same spot that we do init data or somewhere else?
How much does compressing 8MB of zeros slow down my kernel compile? ;)
On 6/12/19 10:00 AM, Dave Hansen wrote:
> On 6/12/19 6:32 AM, Lendacky, Thomas wrote:
>> Create a section for SME in the vmlinux.lds.S. Position it after "_end"
>> so that the memory will be reclaimed during boot and, since it is all
>> zeroes, it compresses well.
>
> I don't think I realized that things after _end get reclaimed. Do we do
> that at the same spot that we do init data or somewhere else?
I was looking at the start of setup_arch() in arch/x86/kernel/setup.c,
where there's a memblock_reserve() done for the kernel (it reserves from
_text to __bss_stop, not all the way to _end, and later the brk area
is reserved). At that point, my take was that the memory outside the
reserved area is now available (and there's a comment below that to that
effect, also), so the .sme section would basically be discarded and
re-claimed for general page usage.
>
> How much does compressing 8MB of zeros slow down my kernel compile? ;)
Heh, I didn't measure that. :)
Thanks,
Tom
>
On 06/12/19 at 01:32pm, Lendacky, Thomas wrote:
> The SME workarea used during early encryption of the kernel during boot
> is situated on a 2MB boundary after the end of the kernel text, data,
> etc. sections (_end). This works well during initial boot of a compressed
> kernel because of the relocation used for decompression of the kernel.
> But when performing a kexec boot, there's a chance that the SME workarea
> may not be mapped by the kexec pagetables or that some of the other data
> used by kexec could exist in this range.
>
> Create a section for SME in the vmlinux.lds.S. Position it after "_end"
> so that the memory will be reclaimed during boot and, since it is all
> zeroes, it compresses well. Since this new section will be part of the
> kernel, kexec will account for it in pagetable mappings and placement of
> data after the kernel.
>
> Here's an example of a kernel size without and with the SME section:
> without:
> vmlinux: 36,501,616
> bzImage: 6,497,344
>
> 100000000-47f37ffff : System RAM
> 1e4000000-1e47677d4 : Kernel code (0x7677d4)
> 1e47677d5-1e4e2e0bf : Kernel data (0x6c68ea)
> 1e5074000-1e5372fff : Kernel bss (0x2fefff)
>
> with:
> vmlinux: 44,419,408
> bzImage: 6,503,136
>
> 880000000-c7ff7ffff : System RAM
> 8cf000000-8cf7677d4 : Kernel code (0x7677d4)
> 8cf7677d5-8cfe2e0bf : Kernel data (0x6c68ea)
> 8d0074000-8d0372fff : Kernel bss (0x2fefff)
>
> Cc: Baoquan He <[email protected]>
> Cc: Lianbo Jiang <[email protected]>
> Signed-off-by: Tom Lendacky <[email protected]>
> ---
> arch/x86/kernel/vmlinux.lds.S | 16 ++++++++++++++++
> arch/x86/mm/mem_encrypt_identity.c | 22 ++++++++++++++++++++--
> 2 files changed, 36 insertions(+), 2 deletions(-)
This patch looks good to me.
It fixes an issue which breaks kexec/kdump kernel booting. And it can be
reproduced always when 'nokaslr' is added into kexec/kdump kernel
cmdline. Usually we suggest users to add 'nokaslr' to kdump kernel
cmdline, since KASLR functionality doesn't make any sense for kdump
kernel, and it can simplify to avoid unnecessary checking if kdump
kerenl malfunctions. So it's an important fix for sme supporting on
kexec/kdump.
Thanks for quick response and making this patch, Tom.
Reviewed-by: Baoquan He <[email protected]>
Thanks
Baoquan
>
> diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
> index 0850b5149345..8c4377983e54 100644
> --- a/arch/x86/kernel/vmlinux.lds.S
> +++ b/arch/x86/kernel/vmlinux.lds.S
> @@ -379,6 +379,22 @@ SECTIONS
> . = ALIGN(PAGE_SIZE); /* keep VO_INIT_SIZE page aligned */
> _end = .;
>
> +#ifdef CONFIG_AMD_MEM_ENCRYPT
> + /*
> + * SME workarea section: Lives outside of the kernel proper
> + * (_text - _end) for performing in-place encryption. Resides
> + * on a 2MB boundary to simplify the pagetable setup used for
> + * the encryption.
> + */
> + . = ALIGN(HPAGE_SIZE);
> + .sme : AT(ADDR(.sme) - LOAD_OFFSET) {
> + __sme_begin = .;
> + *(.sme)
> + . = ALIGN(HPAGE_SIZE);
> + __sme_end = .;
> + }
> +#endif
> +
> STABS_DEBUG
> DWARF_DEBUG
>
> diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
> index 4aa9b1480866..c55c2ec8fb12 100644
> --- a/arch/x86/mm/mem_encrypt_identity.c
> +++ b/arch/x86/mm/mem_encrypt_identity.c
> @@ -73,6 +73,19 @@ struct sme_populate_pgd_data {
> unsigned long vaddr_end;
> };
>
> +/*
> + * This work area lives in the .sme section, which lives outside of
> + * the kernel proper. It is sized to hold the intermediate copy buffer
> + * and more than enough pagetable pages.
> + *
> + * By using this section, the kernel can be encrypted in place and we
> + * avoid any possibility of boot parameters or initramfs images being
> + * placed such that the in-place encryption logic overwrites them. This
> + * section is 2MB aligned to allow for simple pagetable setup using only
> + * PMD entries (see vmlinux.lds.S).
> + */
> +static char sme_workarea[2 * PMD_PAGE_SIZE] __section(.sme);
> +
> static char sme_cmdline_arg[] __initdata = "mem_encrypt";
> static char sme_cmdline_on[] __initdata = "on";
> static char sme_cmdline_off[] __initdata = "off";
> @@ -314,8 +327,13 @@ void __init sme_encrypt_kernel(struct boot_params *bp)
> }
> #endif
>
> - /* Set the encryption workarea to be immediately after the kernel */
> - workarea_start = kernel_end;
> + /*
> + * We're running identity mapped, so we must obtain the address to the
> + * SME encryption workarea using rip-relative addressing.
> + */
> + asm ("lea sme_workarea(%%rip), %0"
> + : "=r" (workarea_start)
> + : "p" (sme_workarea));
>
> /*
> * Calculate required number of workarea bytes needed:
> --
> 2.17.1
>
Hi,
After applied this patch, i made a test on the following machines. It works well.
[a] HPE ProLiant DL385 Gen10 (878612-B21) AMD EPYC 7251 8-Core Processor
[b] Dell PowerEdge R7425 AMD EPYC 7401 24-Core Processor
[c] AMD Speedway AMD Eng Sample: 2S1405A3VIHF4_28/14_N
[d] AMD Ethanol X AMD Eng Sample: 2S1402E2VJUG2_20/14_N
On the [a] machine, it works well in the crashkernel=160M case, but need
to remove the CONFIG_DEBUG_INFO from .config, otherwise it could happen the
OOM. And it also works well in all >= 256M(crashkernel) cases(no need to
remove the above option from .config).
On the [b][c][d] machine, it also works fine in the crashkernel=256M cases,
also need to remove the debug info, otherwise it could happen the OOM. And
it also works well in all >= 320M(crashkernel) cases.
Thanks.
Tested-by: Lianbo Jiang <[email protected]>
在 2019年06月12日 21:32, Lendacky, Thomas 写道:
> The SME workarea used during early encryption of the kernel during boot
> is situated on a 2MB boundary after the end of the kernel text, data,
> etc. sections (_end). This works well during initial boot of a compressed
> kernel because of the relocation used for decompression of the kernel.
> But when performing a kexec boot, there's a chance that the SME workarea
> may not be mapped by the kexec pagetables or that some of the other data
> used by kexec could exist in this range.
>
> Create a section for SME in the vmlinux.lds.S. Position it after "_end"
> so that the memory will be reclaimed during boot and, since it is all
> zeroes, it compresses well. Since this new section will be part of the
> kernel, kexec will account for it in pagetable mappings and placement of
> data after the kernel.
>
> Here's an example of a kernel size without and with the SME section:
> without:
> vmlinux: 36,501,616
> bzImage: 6,497,344
>
> 100000000-47f37ffff : System RAM
> 1e4000000-1e47677d4 : Kernel code (0x7677d4)
> 1e47677d5-1e4e2e0bf : Kernel data (0x6c68ea)
> 1e5074000-1e5372fff : Kernel bss (0x2fefff)
>
> with:
> vmlinux: 44,419,408
> bzImage: 6,503,136
>
> 880000000-c7ff7ffff : System RAM
> 8cf000000-8cf7677d4 : Kernel code (0x7677d4)
> 8cf7677d5-8cfe2e0bf : Kernel data (0x6c68ea)
> 8d0074000-8d0372fff : Kernel bss (0x2fefff)
>
> Cc: Baoquan He <[email protected]>
> Cc: Lianbo Jiang <[email protected]>
> Signed-off-by: Tom Lendacky <[email protected]>
> ---
> arch/x86/kernel/vmlinux.lds.S | 16 ++++++++++++++++
> arch/x86/mm/mem_encrypt_identity.c | 22 ++++++++++++++++++++--
> 2 files changed, 36 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
> index 0850b5149345..8c4377983e54 100644
> --- a/arch/x86/kernel/vmlinux.lds.S
> +++ b/arch/x86/kernel/vmlinux.lds.S
> @@ -379,6 +379,22 @@ SECTIONS
> . = ALIGN(PAGE_SIZE); /* keep VO_INIT_SIZE page aligned */
> _end = .;
>
> +#ifdef CONFIG_AMD_MEM_ENCRYPT
> + /*
> + * SME workarea section: Lives outside of the kernel proper
> + * (_text - _end) for performing in-place encryption. Resides
> + * on a 2MB boundary to simplify the pagetable setup used for
> + * the encryption.
> + */
> + . = ALIGN(HPAGE_SIZE);
> + .sme : AT(ADDR(.sme) - LOAD_OFFSET) {
> + __sme_begin = .;
> + *(.sme)
> + . = ALIGN(HPAGE_SIZE);
> + __sme_end = .;
> + }
> +#endif
> +
> STABS_DEBUG
> DWARF_DEBUG
>
> diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
> index 4aa9b1480866..c55c2ec8fb12 100644
> --- a/arch/x86/mm/mem_encrypt_identity.c
> +++ b/arch/x86/mm/mem_encrypt_identity.c
> @@ -73,6 +73,19 @@ struct sme_populate_pgd_data {
> unsigned long vaddr_end;
> };
>
> +/*
> + * This work area lives in the .sme section, which lives outside of
> + * the kernel proper. It is sized to hold the intermediate copy buffer
> + * and more than enough pagetable pages.
> + *
> + * By using this section, the kernel can be encrypted in place and we
> + * avoid any possibility of boot parameters or initramfs images being
> + * placed such that the in-place encryption logic overwrites them. This
> + * section is 2MB aligned to allow for simple pagetable setup using only
> + * PMD entries (see vmlinux.lds.S).
> + */
> +static char sme_workarea[2 * PMD_PAGE_SIZE] __section(.sme);
> +
> static char sme_cmdline_arg[] __initdata = "mem_encrypt";
> static char sme_cmdline_on[] __initdata = "on";
> static char sme_cmdline_off[] __initdata = "off";
> @@ -314,8 +327,13 @@ void __init sme_encrypt_kernel(struct boot_params *bp)
> }
> #endif
>
> - /* Set the encryption workarea to be immediately after the kernel */
> - workarea_start = kernel_end;
> + /*
> + * We're running identity mapped, so we must obtain the address to the
> + * SME encryption workarea using rip-relative addressing.
> + */
> + asm ("lea sme_workarea(%%rip), %0"
> + : "=r" (workarea_start)
> + : "p" (sme_workarea));
>
> /*
> * Calculate required number of workarea bytes needed:
>
On 6/12/19 10:46 AM, Lendacky, Thomas wrote:
> On 6/12/19 10:00 AM, Dave Hansen wrote:
>> On 6/12/19 6:32 AM, Lendacky, Thomas wrote:
>>> Create a section for SME in the vmlinux.lds.S. Position it after "_end"
>>> so that the memory will be reclaimed during boot and, since it is all
>>> zeroes, it compresses well.
>>
>> I don't think I realized that things after _end get reclaimed. Do we do
>> that at the same spot that we do init data or somewhere else?
>
> I was looking at the start of setup_arch() in arch/x86/kernel/setup.c,
> where there's a memblock_reserve() done for the kernel (it reserves from
> _text to __bss_stop, not all the way to _end, and later the brk area
> is reserved). At that point, my take was that the memory outside the
> reserved area is now available (and there's a comment below that to that
> effect, also), so the .sme section would basically be discarded and
> re-claimed for general page usage.
This seems awfully subtle. This would be the only section treated this
way because, as you note, even the '.brk' area ends up getting
memblock_reserve()'d. Also, this odd property is not commented on at all.
That's not the end of the world. But, if we're going to do this, it
seems like we need to move the:
/* Sections to be discarded /*
comment to up above your new area. It also seems like we need something
explicit in there near __bss_stop saying:
/*
* Everything between _text and here is automatically reserved
* in setup_arch(). Everything after here must either have its
* own memblock_reserve(), or it will be treated as available
* memory and freed at boot.
*/
Actually, I wonder if we should add a symbol called
'__end_of_kernel_reserve' and use *that* instead of __bss_stop in
setup_arch().
After I say all that... Why can't you just stick your data in a normal,
vanilla __init variable? Wouldn't that be a lot less subtle?
On 6/13/19 12:47 PM, Dave Hansen wrote:
> On 6/12/19 10:46 AM, Lendacky, Thomas wrote:
>> On 6/12/19 10:00 AM, Dave Hansen wrote:
>>> On 6/12/19 6:32 AM, Lendacky, Thomas wrote:
>>>> Create a section for SME in the vmlinux.lds.S. Position it after "_end"
>>>> so that the memory will be reclaimed during boot and, since it is all
>>>> zeroes, it compresses well.
>>>
>>> I don't think I realized that things after _end get reclaimed. Do we do
>>> that at the same spot that we do init data or somewhere else?
>>
>> I was looking at the start of setup_arch() in arch/x86/kernel/setup.c,
>> where there's a memblock_reserve() done for the kernel (it reserves from
>> _text to __bss_stop, not all the way to _end, and later the brk area
>> is reserved). At that point, my take was that the memory outside the
>> reserved area is now available (and there's a comment below that to that
>> effect, also), so the .sme section would basically be discarded and
>> re-claimed for general page usage.
>
> This seems awfully subtle. This would be the only section treated this
> way because, as you note, even the '.brk' area ends up getting
> memblock_reserve()'d. Also, this odd property is not commented on at all.
>
> That's not the end of the world. But, if we're going to do this, it
> seems like we need to move the:
>
> /* Sections to be discarded /*
>
> comment to up above your new area. It also seems like we need something
> explicit in there near __bss_stop saying:
>
> /*
> * Everything between _text and here is automatically reserved
> * in setup_arch(). Everything after here must either have its
> * own memblock_reserve(), or it will be treated as available
> * memory and freed at boot.
> */
That all makes sense.
>
> Actually, I wonder if we should add a symbol called
> '__end_of_kernel_reserve' and use *that* instead of __bss_stop in
> setup_arch().
If everyone thinks that's best, I can do that. Probably best as a
pre-patch and make this into a 2-patch series, then.
>
> After I say all that... Why can't you just stick your data in a normal,
> vanilla __init variable? Wouldn't that be a lot less subtle?
The area needs to be outside of the kernel proper as the kernel is
encrypted "in place." So an __init variable won't work here.
Thanks,
Tom
>
On 6/13/19 10:59 AM, Lendacky, Thomas wrote:
>> After I say all that... Why can't you just stick your data in a normal,
>> vanilla __init variable? Wouldn't that be a lot less subtle?
> The area needs to be outside of the kernel proper as the kernel is
> encrypted "in place." So an __init variable won't work here.
Ahh, that makes sense. Also sounds like good changelog fodder.
FWIW, you *could* use an __init area, but I think you'd have to work
around it in sme_encrypt_kernel(), right? Basically in the
kernel_start/end logic you'd need to skip over it. That's probably more
fragile than what you have here, though.
On 6/13/19 1:06 PM, Dave Hansen wrote:
> On 6/13/19 10:59 AM, Lendacky, Thomas wrote:
>>> After I say all that... Why can't you just stick your data in a normal,
>>> vanilla __init variable? Wouldn't that be a lot less subtle?
>> The area needs to be outside of the kernel proper as the kernel is
>> encrypted "in place." So an __init variable won't work here.
>
> Ahh, that makes sense. Also sounds like good changelog fodder.
>
> FWIW, you *could* use an __init area, but I think you'd have to work
> around it in sme_encrypt_kernel(), right? Basically in the
> kernel_start/end logic you'd need to skip over it. That's probably more
> fragile than what you have here, though.
Yes, I think having the workarea outside the kernel is best.
I'll send a V2 with the pre-patch and suggested changes.
Thanks,
Tom
>