2020-10-25 08:22:42

by Arvind Sankar

[permalink] [raw]
Subject: RFC x86/boot/64: BOOT_PGT_SIZE definition for compressed kernel

Hi, I think the definition of BOOT_PGT_SIZE in
arch/x86/include/asm/boot.h is insufficient, especially after
ca0e22d4f011 ("x86/boot/compressed/64: Always switch to own page table")

Currently, it allocates 6 pages if KASLR is disabled, and either 17 or
19 pages depending on X86_VERBOSE_BOOTUP if KASLR is enabled.

- The X86_VERBOSE_BOOTUP test shouldn't be done: that only disables
debug messages, but warnings/errors are always output to VGA memory,
so the two extra pages for mapping video RAM are always needed.

- The calculation wasn't updated for X86_5LEVEL, which requires at least
one more page for the P4D level, and in theory could require two extra
pages for each of the 4 mappings (compressed kernel, output kernel,
boot_params and command line), though that would require a system with
truly ginormous amounts of RAM.

- If KASLR is disabled, there are only 6 pages, but now that we're
always setting up our own page table, we need 1+(2+2)*3 (one PGD, and
two PUD and two PMD pages for kernel, boot_params and command line),
and 2 more pages for the video RAM, and more for 5-level. Even for
!RELOCATABLE, 13 pages might be needed.

- SEV-ES needs one more page because it needs to do a PTE-level mapping
for the GHCB page.

- The static calculation is also busted because
boot/compressed/{kaslr.c,acpi.c} can scan the setup data, EFI
configuration tables and the EFI memmap, and none of these are
accounted for. They used to be scanned while still on the
firmware/bootloader page tables, but now our page tables have to cover
them as well. Trying to add up the worst case for all of these, and
anything else the compressed kernel might potentially access seems
like a lost cause.

We could do something similar to what the main kernel does with
early_dynamic_pgts: map the compressed kernel at a fixed virtual
address (in negative address space, say); recycle all the other mappings
until we're done with decompression, and then map the output,
boot_params and command line. The number of pages needed for this can be
statically calculated, for 4-level paging we'd need 2 pages for the
fixed mapping, 12 pages for the other three, and one PGD page.

Thoughts?


2020-10-25 09:10:50

by Arvind Sankar

[permalink] [raw]
Subject: Re: RFC x86/boot/64: BOOT_PGT_SIZE definition for compressed kernel

On Sat, Oct 24, 2020 at 08:41:58PM -0400, Arvind Sankar wrote:
> Hi, I think the definition of BOOT_PGT_SIZE in
> arch/x86/include/asm/boot.h is insufficient, especially after
> ca0e22d4f011 ("x86/boot/compressed/64: Always switch to own page table")
>
> Currently, it allocates 6 pages if KASLR is disabled, and either 17 or
> 19 pages depending on X86_VERBOSE_BOOTUP if KASLR is enabled.
>
> - The X86_VERBOSE_BOOTUP test shouldn't be done: that only disables
> debug messages, but warnings/errors are always output to VGA memory,
> so the two extra pages for mapping video RAM are always needed.
>
> - The calculation wasn't updated for X86_5LEVEL, which requires at least
> one more page for the P4D level, and in theory could require two extra
> pages for each of the 4 mappings (compressed kernel, output kernel,
> boot_params and command line), though that would require a system with
> truly ginormous amounts of RAM.
>
> - If KASLR is disabled, there are only 6 pages, but now that we're
> always setting up our own page table, we need 1+(2+2)*3 (one PGD, and
> two PUD and two PMD pages for kernel, boot_params and command line),
> and 2 more pages for the video RAM, and more for 5-level. Even for
> !RELOCATABLE, 13 pages might be needed.
>
> - SEV-ES needs one more page because it needs to do a PTE-level mapping
> for the GHCB page.
>
> - The static calculation is also busted because
> boot/compressed/{kaslr.c,acpi.c} can scan the setup data, EFI
> configuration tables and the EFI memmap, and none of these are
> accounted for. They used to be scanned while still on the
> firmware/bootloader page tables, but now our page tables have to cover
> them as well. Trying to add up the worst case for all of these, and
> anything else the compressed kernel might potentially access seems
> like a lost cause.
>
> We could do something similar to what the main kernel does with
> early_dynamic_pgts: map the compressed kernel at a fixed virtual
> address (in negative address space, say); recycle all the other mappings
> until we're done with decompression, and then map the output,
> boot_params and command line. The number of pages needed for this can be
> statically calculated, for 4-level paging we'd need 2 pages for the
> fixed mapping, 12 pages for the other three, and one PGD page.
>
> Thoughts?

Or just bump BOOT_PGT_SIZE to some largeish number?

2020-10-27 14:53:32

by Arvind Sankar

[permalink] [raw]
Subject: Re: RFC x86/boot/64: BOOT_PGT_SIZE definition for compressed kernel

On Tue, Oct 27, 2020 at 03:40:07PM +0300, Kirill A. Shutemov wrote:
> On Sat, Oct 24, 2020 at 08:41:58PM -0400, Arvind Sankar wrote:
> > Hi, I think the definition of BOOT_PGT_SIZE in
> > arch/x86/include/asm/boot.h is insufficient, especially after
> > ca0e22d4f011 ("x86/boot/compressed/64: Always switch to own page table")
> >
> > Currently, it allocates 6 pages if KASLR is disabled, and either 17 or
> > 19 pages depending on X86_VERBOSE_BOOTUP if KASLR is enabled.
> >
> > - The X86_VERBOSE_BOOTUP test shouldn't be done: that only disables
> > debug messages, but warnings/errors are always output to VGA memory,
> > so the two extra pages for mapping video RAM are always needed.
> >
> > - The calculation wasn't updated for X86_5LEVEL, which requires at least
> > one more page for the P4D level, and in theory could require two extra
> > pages for each of the 4 mappings (compressed kernel, output kernel,
> > boot_params and command line), though that would require a system with
> > truly ginormous amounts of RAM.
>
> Or sparse physical memory map. I hacked QEMU before for testing 5-level
> paging:
>
> https://gist.github.com/kiryl/d45eb54110944ff95e544972d8bdac1d
>
> > - If KASLR is disabled, there are only 6 pages, but now that we're
> > always setting up our own page table, we need 1+(2+2)*3 (one PGD, and
> > two PUD and two PMD pages for kernel, boot_params and command line),
> > and 2 more pages for the video RAM, and more for 5-level. Even for
> > !RELOCATABLE, 13 pages might be needed.
>
> The comment for BOOT_PGT_SIZE has to be updated.
>
> BTW, what happens if we underestimate BOOT_PGT_SIZE? Do we overwrite
> something?

No, it checks whether it ran out of pages, so it will just error out and
hang.

>
> > - SEV-ES needs one more page because it needs to do a PTE-level mapping
> > for the GHCB page.
> >
> > - The static calculation is also busted because
> > boot/compressed/{kaslr.c,acpi.c} can scan the setup data, EFI
> > configuration tables and the EFI memmap, and none of these are
> > accounted for. They used to be scanned while still on the
> > firmware/bootloader page tables, but now our page tables have to cover
> > them as well. Trying to add up the worst case for all of these, and
> > anything else the compressed kernel might potentially access seems
> > like a lost cause.
> >
> > We could do something similar to what the main kernel does with
> > early_dynamic_pgts: map the compressed kernel at a fixed virtual
> > address (in negative address space, say); recycle all the other mappings
> > until we're done with decompression, and then map the output,
> > boot_params and command line. The number of pages needed for this can be
> > statically calculated, for 4-level paging we'd need 2 pages for the
> > fixed mapping, 12 pages for the other three, and one PGD page.
>
> Recycling idea look promising to me, but it would require handling #PF in
> decompression code, right? It is considerable complication of the code.
>

The #PF handler is already there now with the SEV-ES series, but I agree
it would still complicate things. It's simpler to just increase
BOOT_PGT_SIZE and make it unconditional (i.e. bump it to say 32 or 64
even if !KASLR). It's @nobits anyway so it would not increase the size
of the bzImage, just require a slightly larger memory allocation by the
bootloader.

Another alternative is reusing the KASLR code, which contains a memory
allocator, and use it to find system memory for the page tables, but
that also seems like an over-engineered approach.

2020-10-27 16:58:05

by Joerg Roedel

[permalink] [raw]
Subject: Re: RFC x86/boot/64: BOOT_PGT_SIZE definition for compressed kernel

On Tue, Oct 27, 2020 at 03:40:07PM +0300, Kirill A. Shutemov wrote:
> BTW, what happens if we underestimate BOOT_PGT_SIZE? Do we overwrite
> something?

The boot code will print an error and stop the machine when allocating a
page-table page fails.

I also think that bumping BOOT_PGT_SIZE up to have more pages available
is a good short-term solution. Recycling pages will also need to take
page encryption attributes into account, as for SEV-ES the GHCB page
needs to be mapped unencrypted.

Another option to safe some memory is to make use of GB pages in the
decompression code. Machines where the current BOOT_PGT_SIZE is too
small will likely support GB pages too.

Regards,

Joerg

2020-10-27 18:12:49

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: RFC x86/boot/64: BOOT_PGT_SIZE definition for compressed kernel

On Tue, Oct 27, 2020 at 09:16:17AM -0400, Arvind Sankar wrote:
> The #PF handler is already there now with the SEV-ES series, but I agree
> it would still complicate things. It's simpler to just increase
> BOOT_PGT_SIZE and make it unconditional (i.e. bump it to say 32 or 64
> even if !KASLR). It's @nobits anyway so it would not increase the size
> of the bzImage, just require a slightly larger memory allocation by the
> bootloader.

I guess it's fine. But I'm worried we leave the picture in the same
fragile state. We still rely on the magic number.

--
Kirill A. Shutemov

2020-10-28 07:34:49

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: RFC x86/boot/64: BOOT_PGT_SIZE definition for compressed kernel

On Sat, Oct 24, 2020 at 08:41:58PM -0400, Arvind Sankar wrote:
> Hi, I think the definition of BOOT_PGT_SIZE in
> arch/x86/include/asm/boot.h is insufficient, especially after
> ca0e22d4f011 ("x86/boot/compressed/64: Always switch to own page table")
>
> Currently, it allocates 6 pages if KASLR is disabled, and either 17 or
> 19 pages depending on X86_VERBOSE_BOOTUP if KASLR is enabled.
>
> - The X86_VERBOSE_BOOTUP test shouldn't be done: that only disables
> debug messages, but warnings/errors are always output to VGA memory,
> so the two extra pages for mapping video RAM are always needed.
>
> - The calculation wasn't updated for X86_5LEVEL, which requires at least
> one more page for the P4D level, and in theory could require two extra
> pages for each of the 4 mappings (compressed kernel, output kernel,
> boot_params and command line), though that would require a system with
> truly ginormous amounts of RAM.

Or sparse physical memory map. I hacked QEMU before for testing 5-level
paging:

https://gist.github.com/kiryl/d45eb54110944ff95e544972d8bdac1d

> - If KASLR is disabled, there are only 6 pages, but now that we're
> always setting up our own page table, we need 1+(2+2)*3 (one PGD, and
> two PUD and two PMD pages for kernel, boot_params and command line),
> and 2 more pages for the video RAM, and more for 5-level. Even for
> !RELOCATABLE, 13 pages might be needed.

The comment for BOOT_PGT_SIZE has to be updated.

BTW, what happens if we underestimate BOOT_PGT_SIZE? Do we overwrite
something?

> - SEV-ES needs one more page because it needs to do a PTE-level mapping
> for the GHCB page.
>
> - The static calculation is also busted because
> boot/compressed/{kaslr.c,acpi.c} can scan the setup data, EFI
> configuration tables and the EFI memmap, and none of these are
> accounted for. They used to be scanned while still on the
> firmware/bootloader page tables, but now our page tables have to cover
> them as well. Trying to add up the worst case for all of these, and
> anything else the compressed kernel might potentially access seems
> like a lost cause.
>
> We could do something similar to what the main kernel does with
> early_dynamic_pgts: map the compressed kernel at a fixed virtual
> address (in negative address space, say); recycle all the other mappings
> until we're done with decompression, and then map the output,
> boot_params and command line. The number of pages needed for this can be
> statically calculated, for 4-level paging we'd need 2 pages for the
> fixed mapping, 12 pages for the other three, and one PGD page.

Recycling idea look promising to me, but it would require handling #PF in
decompression code, right? It is considerable complication of the code.

--
Kirill A. Shutemov