2022-12-28 15:11:40

by Jason A. Donenfeld

[permalink] [raw]
Subject: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

The setup_data links are appended to the compressed kernel image. Since
the kernel image is typically loaded at 0x100000, setup_data lives at
`0x100000 + compressed_size`, which does not get relocated during the
kernel's boot process.

The kernel typically decompresses the image starting at address
0x1000000 (note: there's one more zero there than the decompressed image
above). This usually is fine for most kernels.

However, if the compressed image is actually quite large, then
setup_data will live at a `0x100000 + compressed_size` that extends into
the decompressed zone at 0x1000000. In other words, if compressed_size
is larger than `0x1000000 - 0x100000`, then the decompression step will
clobber setup_data, resulting in crashes.

Fix this by detecting that possibility, and if it occurs, put setup_data
*after* the end of the decompressed kernel, so that it doesn't get
clobbered.

One caveat is that this only works for images less than around 64
megabytes, so just bail out in that case. This is unfortunate, but I
don't currently have a way of fixing it.

Cc: [email protected]
Signed-off-by: Jason A. Donenfeld <[email protected]>
---
hw/i386/x86.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)

diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index 78cc131926..628fd2b2e9 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -1077,6 +1077,36 @@ void x86_load_linux(X86MachineState *x86ms,
}
fclose(f);

+ /* If a setup_data is going to be used, make sure that the bootloader won't
+ * decompress into it and clobber those bytes. */
+ if (dtb_filename || !legacy_no_rng_seed) {
+ uint32_t payload_offset = ldl_p(setup + 0x248);
+ uint32_t payload_length = ldl_p(setup + 0x24c);
+ uint32_t target_address = ldl_p(setup + 0x258);
+ uint32_t decompressed_length = ldl_p(kernel + payload_offset + payload_length - 4);
+
+ uint32_t estimated_setup_data_length = 4096 * 16;
+ uint32_t start_setup_data = prot_addr + kernel_size;
+ uint32_t end_setup_data = start_setup_data + estimated_setup_data_length;
+ uint32_t start_target = target_address;
+ uint32_t end_target = target_address + decompressed_length;
+
+ if ((start_setup_data >= start_target && start_setup_data < end_target) ||
+ (end_setup_data >= start_target && end_setup_data < end_target)) {
+ uint32_t padded_size = target_address + decompressed_length - prot_addr;
+
+ /* The early stage can't address past around 64 MB from the original
+ * mapping, so just give up in that case. */
+ if (padded_size < 62 * 1024 * 1024)
+ kernel_size = padded_size;
+ else {
+ fprintf(stderr, "qemu: Kernel image too large to hold setup_data\n");
+ dtb_filename = NULL;
+ legacy_no_rng_seed = true;
+ }
+ }
+ }
+
/* append dtb to kernel */
if (dtb_filename) {
if (protocol < 0x209) {
--
2.39.0


2022-12-28 16:44:49

by Philippe Mathieu-Daudé

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

Hi Jason,

On 28/12/22 15:38, Jason A. Donenfeld wrote:
> The setup_data links are appended to the compressed kernel image. Since
> the kernel image is typically loaded at 0x100000, setup_data lives at
> `0x100000 + compressed_size`, which does not get relocated during the
> kernel's boot process.
>
> The kernel typically decompresses the image starting at address
> 0x1000000 (note: there's one more zero there than the decompressed image
> above). This usually is fine for most kernels.
>
> However, if the compressed image is actually quite large, then
> setup_data will live at a `0x100000 + compressed_size` that extends into
> the decompressed zone at 0x1000000. In other words, if compressed_size
> is larger than `0x1000000 - 0x100000`, then the decompression step will
> clobber setup_data, resulting in crashes.
>
> Fix this by detecting that possibility, and if it occurs, put setup_data
> *after* the end of the decompressed kernel, so that it doesn't get
> clobbered.
>
> One caveat is that this only works for images less than around 64
> megabytes, so just bail out in that case. This is unfortunate, but I
> don't currently have a way of fixing it.
>
> Cc: [email protected]
> Signed-off-by: Jason A. Donenfeld <[email protected]>
> ---
> hw/i386/x86.c | 30 ++++++++++++++++++++++++++++++
> 1 file changed, 30 insertions(+)
>
> diff --git a/hw/i386/x86.c b/hw/i386/x86.c
> index 78cc131926..628fd2b2e9 100644
> --- a/hw/i386/x86.c
> +++ b/hw/i386/x86.c
> @@ -1077,6 +1077,36 @@ void x86_load_linux(X86MachineState *x86ms,
> }
> fclose(f);
>
> + /* If a setup_data is going to be used, make sure that the bootloader won't
> + * decompress into it and clobber those bytes. */
> + if (dtb_filename || !legacy_no_rng_seed) {
> + uint32_t payload_offset = ldl_p(setup + 0x248);
> + uint32_t payload_length = ldl_p(setup + 0x24c);
> + uint32_t target_address = ldl_p(setup + 0x258);

Nitpicking, can the Linux kernel add these magic values in
arch/x86/include/uapi/asm/bootparam.h? Or can we use
offsetof(setup_header) to get them?

> + uint32_t decompressed_length = ldl_p(kernel + payload_offset + payload_length - 4);
> +
> + uint32_t estimated_setup_data_length = 4096 * 16;
> + uint32_t start_setup_data = prot_addr + kernel_size;
> + uint32_t end_setup_data = start_setup_data + estimated_setup_data_length;
> + uint32_t start_target = target_address;
> + uint32_t end_target = target_address + decompressed_length;

Maybe we can simply use 'unsigned' type.

> + if ((start_setup_data >= start_target && start_setup_data < end_target) ||
> + (end_setup_data >= start_target && end_setup_data < end_target)) {
> + uint32_t padded_size = target_address + decompressed_length - prot_addr;
> +
> + /* The early stage can't address past around 64 MB from the original
> + * mapping, so just give up in that case. */
> + if (padded_size < 62 * 1024 * 1024)

You mention 64 but check for 62, is that expected? You can use the MiB
definitions to ease code review: 64 * MiB.

> + kernel_size = padded_size;
> + else {
> + fprintf(stderr, "qemu: Kernel image too large to hold setup_data\n");
> + dtb_filename = NULL;
> + legacy_no_rng_seed = true;
> + }
> + }
> + }
Fix looks good, glad you figured out the problem.

Regards,

Phil.

2022-12-28 16:53:10

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On Wed, Dec 28, 2022 at 05:02:22PM +0100, Philippe Mathieu-Daudé wrote:
> Hi Jason,
>
> On 28/12/22 15:38, Jason A. Donenfeld wrote:
> > The setup_data links are appended to the compressed kernel image. Since
> > the kernel image is typically loaded at 0x100000, setup_data lives at
> > `0x100000 + compressed_size`, which does not get relocated during the
> > kernel's boot process.
> >
> > The kernel typically decompresses the image starting at address
> > 0x1000000 (note: there's one more zero there than the decompressed image
*compressed image

> > + uint32_t target_address = ldl_p(setup + 0x258);
>
> Nitpicking, can the Linux kernel add these magic values in
> arch/x86/include/uapi/asm/bootparam.h? Or can we use
> offsetof(setup_header) to get them?

I suspect the reason that x86.c has always had those hardcoded offsets
is because this is how it's documented in Documentation/x86/boot.rst?

> > + if ((start_setup_data >= start_target && start_setup_data < end_target) ||
> > + (end_setup_data >= start_target && end_setup_data < end_target)) {
> > + uint32_t padded_size = target_address + decompressed_length - prot_addr;
> > +
> > + /* The early stage can't address past around 64 MB from the original
> > + * mapping, so just give up in that case. */
> > + if (padded_size < 62 * 1024 * 1024)
>
> You mention 64 but check for 62, is that expected? You can use the MiB
> definitions to ease code review: 64 * MiB.

62 is intentional. But I'm still not really sure what's up. 63 doesn't
work. I haven't totally worked out why this is, or why the 64 MiB limit
exists in the first place. Maybe because this is a very early mapping
set up by real mode? Or because another mapping is placed over it that's
executable? There's that 2MiB*4096 gdt entry, but that'd cover all 4
gigs. So I really don't know yet. I'll continue to poke at it, but on
the off chance somebody here understands what's up, that'd save me a
bunch of head scratching.

> Fix looks good, glad you figured out the problem.

I mean, kind of. The solution here sucks, especially given that in the
worst case, setup_data just gets dropped. I'm half inclined to consider
this a kernel bug instead, and add some code to relocate setup_data
prior to decompression, and then fix up all the links. It seems like
this would be a lot more robust.

I just wish the people who wrote this stuff would chime in. I've had
[email protected] CC'd but so far, no input from them.

Jason

2022-12-28 17:25:39

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

HELLO H. PETER ANVIN,
E
L
L
O

On Wed, Dec 28, 2022 at 05:30:30PM +0100, Jason A. Donenfeld wrote:
> > Fix looks good, glad you figured out the problem.
>
> I mean, kind of. The solution here sucks, especially given that in the
> worst case, setup_data just gets dropped. I'm half inclined to consider
> this a kernel bug instead, and add some code to relocate setup_data
> prior to decompression, and then fix up all the links. It seems like
> this would be a lot more robust.
>
> I just wish the people who wrote this stuff would chime in. I've had
> [email protected] CC'd but so far, no input from them.

Apparently you are the x86 boot guru. What do you want to happen here?
Your input would be very instrumental.

Jason

2022-12-29 00:11:50

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On December 28, 2022 8:57:54 AM PST, "Jason A. Donenfeld" <[email protected]> wrote:
>HELLO H. PETER ANVIN,
>E
>L
>L
>O
>
>On Wed, Dec 28, 2022 at 05:30:30PM +0100, Jason A. Donenfeld wrote:
>> > Fix looks good, glad you figured out the problem.
>>
>> I mean, kind of. The solution here sucks, especially given that in the
>> worst case, setup_data just gets dropped. I'm half inclined to consider
>> this a kernel bug instead, and add some code to relocate setup_data
>> prior to decompression, and then fix up all the links. It seems like
>> this would be a lot more robust.
>>
>> I just wish the people who wrote this stuff would chime in. I've had
>> [email protected] CC'd but so far, no input from them.
>
>Apparently you are the x86 boot guru. What do you want to happen here?
>Your input would be very instrumental.
>
>Jason

Hi!

Glad you asked.

So the kernel load addresses are parameterized in the kernel image setup header. One of the things that are so parameterized are the size and possible realignment of the kernel image in memory.

I'm very confused where you are getting the 64 MB number from. There should not be any such limitation.

In general, setup_data should be able to go anywhere the initrd can go, and so is subject to the same address cap (896 MB for old kernels, 4 GB on newer ones; this address too is enumerated in the header.)

If you want to put setup_data above 4 GB, it *should* be ok if and only if the kernel supports loading the initrd high, too (again, enumerated in the header.

TL;DR: put setup_data where you put the initrd (before or after doesn't matter.)

To be maximally conservative, link the setup_data list in order from lowest to highest address; currently there is no such item of relevance, but in the future there may be setup_data items needed by the BIOS part of the bootstrap in which case they would have to be < 1 MB and precede any items > 1 MB for obvious reasons. That being said, with BIOS dying it is not all that likely that such entries will ever be needed.

2022-12-29 02:18:25

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On 12/28/22 15:58, H. Peter Anvin wrote:
> On December 28, 2022 8:57:54 AM PST, "Jason A. Donenfeld" <[email protected]> wrote:
>> HELLO H. PETER ANVIN,
>> E
>> L
>> L
>> O
>>
>> On Wed, Dec 28, 2022 at 05:30:30PM +0100, Jason A. Donenfeld wrote:
>>>> Fix looks good, glad you figured out the problem.
>>>
>>> I mean, kind of. The solution here sucks, especially given that in the
>>> worst case, setup_data just gets dropped. I'm half inclined to consider
>>> this a kernel bug instead, and add some code to relocate setup_data
>>> prior to decompression, and then fix up all the links. It seems like
>>> this would be a lot more robust.
>>>
>>> I just wish the people who wrote this stuff would chime in. I've had
>>> [email protected] CC'd but so far, no input from them.
>>
>> Apparently you are the x86 boot guru. What do you want to happen here?
>> Your input would be very instrumental.
>>
>> Jason
>
> Hi!
>
> Glad you asked.
>
> So the kernel load addresses are parameterized in the kernel image
> setup header. One of the things that are so parameterized are the
> size and possible realignment of the kernel image in memory.
>
> I'm very confused where you are getting the 64 MB number from. There
> should not be any such limitation.
>
> In general, setup_data should be able to go anywhere the initrd can
> go, and so is subject to the same address cap (896 MB for old
> kernels, 4 GB on newer ones; this address too is enumerated in the
> header.)
>
> If you want to put setup_data above 4 GB, it *should* be ok if and
> only if the kernel supports loading the initrd high, too (again,
> enumerated in the header.
>
> TL;DR: put setup_data where you put the initrd (before or after
> doesn't matter.)
>
> To be maximally conservative, link the setup_data list in order from
> lowest to highest address; currently there is no such item of
> relevance, but in the future there may be setup_data items needed by
> the BIOS part of the bootstrap in which case they would have to be <
> 1 MB and precede any items > 1 MB for obvious reasons. That being
> said, with BIOS dying it is not all that likely that such entries
> will ever be needed.
>

So let me try for an algorithm. Attached as a text file to avoid line
break damage.

-hpa


Attachments:
kernel-data-addresses.txt (5.71 kB)

2022-12-29 03:10:54

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

Hi,

Read this message in a fixed width text editor with a lot of columns.

On Wed, Dec 28, 2022 at 03:58:12PM -0800, H. Peter Anvin wrote:
> Glad you asked.
>
> So the kernel load addresses are parameterized in the kernel image
> setup header. One of the things that are so parameterized are the size
> and possible realignment of the kernel image in memory.
>
> I'm very confused where you are getting the 64 MB number from. There
> should not be any such limitation.

Currently, QEMU appends it to the kernel image, not to the initramfs as
you suggest below. So, that winds up looking, currently, like:

kernel image setup_data
|--------------------------||----------------|
0x100000 0x100000+l1 0x100000+l1+l2

The problem is that this decompresses to 0x1000000 (one more zero). So
if l1 is > (0x1000000-0x100000), then this winds up looking like:

kernel image setup_data
|--------------------------||----------------|
0x100000 0x100000+l1 0x100000+l1+l2

d e c o m p r e s s e d k e r n e l
|-------------------------------------------------------------|
0x1000000 0x1000000+l3

The decompressed kernel seemingly overwriting the compressed kernel
image isn't a problem, because that gets relocated to a higher address
early on in the boot process. setup_data, however, stays in the same
place, since those links are self referential and nothing fixes them up.
So the decompressed kernel clobbers it.

The solution in this commit adds a bunch of padding between the kernel
image and setup_data to avoid this. That looks like this:

kernel image padding setup_data
|--------------------------||---------------------------------------------------||----------------|
0x100000 0x100000+l1 0x1000000+l3 0x1000000+l3+l2

d e c o m p r e s s e d k e r n e l
|-------------------------------------------------------------|
0x1000000 0x1000000+l3

This way, the decompressed kernel doesn't clobber setup_data.

The problem is that if 0x1000000+l3-0x100000 is around 62 megabytes,
then the bootloader crashes when trying to dereference setup_data's
->len param at the end of initialize_identity_maps() in ident_map_64.c.
I don't know why it does this. If I could remove the 62 megabyte
restriction, then I could keep with this technique and all would be
well.

> In general, setup_data should be able to go anywhere the initrd can
> go, and so is subject to the same address cap (896 MB for old kernels,
> 4 GB on newer ones; this address too is enumerated in the header.)

It would be theoretically possible to attach it to the initrd image
instead of to the kernel image. As a last resort, I guess I can look
into doing that. However, that's going to require some serious rework
and plumbing of a lot of different components. So if I can make it work
as is, that'd be ideal. However, I need to figure out this weird 62 meg
limitation.

Any ideas on that?

Jason

2022-12-29 07:34:40

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On December 28, 2022 6:31:07 PM PST, "Jason A. Donenfeld" <[email protected]> wrote:
>Hi,
>
>Read this message in a fixed width text editor with a lot of columns.
>
>On Wed, Dec 28, 2022 at 03:58:12PM -0800, H. Peter Anvin wrote:
>> Glad you asked.
>>
>> So the kernel load addresses are parameterized in the kernel image
>> setup header. One of the things that are so parameterized are the size
>> and possible realignment of the kernel image in memory.
>>
>> I'm very confused where you are getting the 64 MB number from. There
>> should not be any such limitation.
>
>Currently, QEMU appends it to the kernel image, not to the initramfs as
>you suggest below. So, that winds up looking, currently, like:
>
> kernel image setup_data
> |--------------------------||----------------|
>0x100000 0x100000+l1 0x100000+l1+l2
>
>The problem is that this decompresses to 0x1000000 (one more zero). So
>if l1 is > (0x1000000-0x100000), then this winds up looking like:
>
> kernel image setup_data
> |--------------------------||----------------|
>0x100000 0x100000+l1 0x100000+l1+l2
>
> d e c o m p r e s s e d k e r n e l
> |-------------------------------------------------------------|
> 0x1000000 0x1000000+l3
>
>The decompressed kernel seemingly overwriting the compressed kernel
>image isn't a problem, because that gets relocated to a higher address
>early on in the boot process. setup_data, however, stays in the same
>place, since those links are self referential and nothing fixes them up.
>So the decompressed kernel clobbers it.
>
>The solution in this commit adds a bunch of padding between the kernel
>image and setup_data to avoid this. That looks like this:
>
> kernel image padding setup_data
> |--------------------------||---------------------------------------------------||----------------|
>0x100000 0x100000+l1 0x1000000+l3 0x1000000+l3+l2
>
> d e c o m p r e s s e d k e r n e l
> |-------------------------------------------------------------|
> 0x1000000 0x1000000+l3
>
>This way, the decompressed kernel doesn't clobber setup_data.
>
>The problem is that if 0x1000000+l3-0x100000 is around 62 megabytes,
>then the bootloader crashes when trying to dereference setup_data's
>->len param at the end of initialize_identity_maps() in ident_map_64.c.
>I don't know why it does this. If I could remove the 62 megabyte
>restriction, then I could keep with this technique and all would be
>well.
>
>> In general, setup_data should be able to go anywhere the initrd can
>> go, and so is subject to the same address cap (896 MB for old kernels,
>> 4 GB on newer ones; this address too is enumerated in the header.)
>
>It would be theoretically possible to attach it to the initrd image
>instead of to the kernel image. As a last resort, I guess I can look
>into doing that. However, that's going to require some serious rework
>and plumbing of a lot of different components. So if I can make it work
>as is, that'd be ideal. However, I need to figure out this weird 62 meg
>limitation.
>
>Any ideas on that?
>
>Jason

As far as a crash... that sounds like a big and a pretty serious one at that.

Could you let me know what kernel you are using and how *exactly* you are booting it?

2022-12-29 07:36:26

by Philippe Mathieu-Daudé

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On 29/12/22 03:31, Jason A. Donenfeld wrote:
> Hi,
>
> Read this message in a fixed width text editor with a lot of columns.
>
> On Wed, Dec 28, 2022 at 03:58:12PM -0800, H. Peter Anvin wrote:
>> Glad you asked.
>>
>> So the kernel load addresses are parameterized in the kernel image
>> setup header. One of the things that are so parameterized are the size
>> and possible realignment of the kernel image in memory.
>>
>> I'm very confused where you are getting the 64 MB number from. There
>> should not be any such limitation.

[...]

Thanks for the diagrams. Feel free to include them in the commit
description ;)

>> In general, setup_data should be able to go anywhere the initrd can
>> go, and so is subject to the same address cap (896 MB for old kernels,
>> 4 GB on newer ones; this address too is enumerated in the header.)
>
> It would be theoretically possible to attach it to the initrd image
> instead of to the kernel image. As a last resort, I guess I can look
> into doing that. However, that's going to require some serious rework
> and plumbing of a lot of different components. So if I can make it work
> as is, that'd be ideal. However, I need to figure out this weird 62 meg
> limitation.
>
> Any ideas on that?

Could it be a limitation (internal buffer) of the decompressor?

2022-12-29 07:41:30

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On December 28, 2022 6:31:07 PM PST, "Jason A. Donenfeld" <[email protected]> wrote:
>Hi,
>
>Read this message in a fixed width text editor with a lot of columns.
>
>On Wed, Dec 28, 2022 at 03:58:12PM -0800, H. Peter Anvin wrote:
>> Glad you asked.
>>
>> So the kernel load addresses are parameterized in the kernel image
>> setup header. One of the things that are so parameterized are the size
>> and possible realignment of the kernel image in memory.
>>
>> I'm very confused where you are getting the 64 MB number from. There
>> should not be any such limitation.
>
>Currently, QEMU appends it to the kernel image, not to the initramfs as
>you suggest below. So, that winds up looking, currently, like:
>
> kernel image setup_data
> |--------------------------||----------------|
>0x100000 0x100000+l1 0x100000+l1+l2
>
>The problem is that this decompresses to 0x1000000 (one more zero). So
>if l1 is > (0x1000000-0x100000), then this winds up looking like:
>
> kernel image setup_data
> |--------------------------||----------------|
>0x100000 0x100000+l1 0x100000+l1+l2
>
> d e c o m p r e s s e d k e r n e l
> |-------------------------------------------------------------|
> 0x1000000 0x1000000+l3
>
>The decompressed kernel seemingly overwriting the compressed kernel
>image isn't a problem, because that gets relocated to a higher address
>early on in the boot process. setup_data, however, stays in the same
>place, since those links are self referential and nothing fixes them up.
>So the decompressed kernel clobbers it.
>
>The solution in this commit adds a bunch of padding between the kernel
>image and setup_data to avoid this. That looks like this:
>
> kernel image padding setup_data
> |--------------------------||---------------------------------------------------||----------------|
>0x100000 0x100000+l1 0x1000000+l3 0x1000000+l3+l2
>
> d e c o m p r e s s e d k e r n e l
> |-------------------------------------------------------------|
> 0x1000000 0x1000000+l3
>
>This way, the decompressed kernel doesn't clobber setup_data.
>
>The problem is that if 0x1000000+l3-0x100000 is around 62 megabytes,
>then the bootloader crashes when trying to dereference setup_data's
>->len param at the end of initialize_identity_maps() in ident_map_64.c.
>I don't know why it does this. If I could remove the 62 megabyte
>restriction, then I could keep with this technique and all would be
>well.
>
>> In general, setup_data should be able to go anywhere the initrd can
>> go, and so is subject to the same address cap (896 MB for old kernels,
>> 4 GB on newer ones; this address too is enumerated in the header.)
>
>It would be theoretically possible to attach it to the initrd image
>instead of to the kernel image. As a last resort, I guess I can look
>into doing that. However, that's going to require some serious rework
>and plumbing of a lot of different components. So if I can make it work
>as is, that'd be ideal. However, I need to figure out this weird 62 meg
>limitation.
>
>Any ideas on that?
>
>Jason

Ok, the code I sent will figure out the minimum amount of padding that you need (min_initrd_addr) as well.

2022-12-29 13:01:34

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On Wed, Dec 28, 2022 at 11:31:34PM -0800, H. Peter Anvin wrote:
> As far as a crash... that sounds like a big and a pretty serious one at that.
>
> Could you let me know what kernel you are using and how *exactly* you are booting it?

Right, with CONFIG_X86_VERBOSE_BOOTUP=y in a guest here, it says:

early console in extract_kernel
input_data: 0x000000000be073a8
input_len: 0x00000000008cfc43
output: 0x0000000001000000
output_len: 0x000000000b600a98
kernel_total_size: 0x000000000ac26000
needed_size: 0x000000000b800000
trampoline_32bit: 0x000000000009d000

so that's a ~9M kernel which gets decompressed at 0x1000000 and the
output len is, what, ~180M which looks like plenty to me...

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-12-30 16:09:50

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

Hi,

On Wed, Dec 28, 2022 at 11:31:34PM -0800, H. Peter Anvin wrote:
> On December 28, 2022 6:31:07 PM PST, "Jason A. Donenfeld" <[email protected]> wrote:
> >Hi,
> >
> >Read this message in a fixed width text editor with a lot of columns.
> >
> >On Wed, Dec 28, 2022 at 03:58:12PM -0800, H. Peter Anvin wrote:
> >> Glad you asked.
> >>
> >> So the kernel load addresses are parameterized in the kernel image
> >> setup header. One of the things that are so parameterized are the size
> >> and possible realignment of the kernel image in memory.
> >>
> >> I'm very confused where you are getting the 64 MB number from. There
> >> should not be any such limitation.
> >
> >Currently, QEMU appends it to the kernel image, not to the initramfs as
> >you suggest below. So, that winds up looking, currently, like:
> >
> > kernel image setup_data
> > |--------------------------||----------------|
> >0x100000 0x100000+l1 0x100000+l1+l2
> >
> >The problem is that this decompresses to 0x1000000 (one more zero). So
> >if l1 is > (0x1000000-0x100000), then this winds up looking like:
> >
> > kernel image setup_data
> > |--------------------------||----------------|
> >0x100000 0x100000+l1 0x100000+l1+l2
> >
> > d e c o m p r e s s e d k e r n e l
> > |-------------------------------------------------------------|
> > 0x1000000 0x1000000+l3
> >
> >The decompressed kernel seemingly overwriting the compressed kernel
> >image isn't a problem, because that gets relocated to a higher address
> >early on in the boot process. setup_data, however, stays in the same
> >place, since those links are self referential and nothing fixes them up.
> >So the decompressed kernel clobbers it.
> >
> >The solution in this commit adds a bunch of padding between the kernel
> >image and setup_data to avoid this. That looks like this:
> >
> > kernel image padding setup_data
> > |--------------------------||---------------------------------------------------||----------------|
> >0x100000 0x100000+l1 0x1000000+l3 0x1000000+l3+l2
> >
> > d e c o m p r e s s e d k e r n e l
> > |-------------------------------------------------------------|
> > 0x1000000 0x1000000+l3
> >
> >This way, the decompressed kernel doesn't clobber setup_data.
> >
> >The problem is that if 0x1000000+l3-0x100000 is around 62 megabytes,
> >then the bootloader crashes when trying to dereference setup_data's
> >->len param at the end of initialize_identity_maps() in ident_map_64.c.
> >I don't know why it does this. If I could remove the 62 megabyte
> >restriction, then I could keep with this technique and all would be
> >well.
> >
> >> In general, setup_data should be able to go anywhere the initrd can
> >> go, and so is subject to the same address cap (896 MB for old kernels,
> >> 4 GB on newer ones; this address too is enumerated in the header.)
> >
> >It would be theoretically possible to attach it to the initrd image
> >instead of to the kernel image. As a last resort, I guess I can look
> >into doing that. However, that's going to require some serious rework
> >and plumbing of a lot of different components. So if I can make it work
> >as is, that'd be ideal. However, I need to figure out this weird 62 meg
> >limitation.
> >
> >Any ideas on that?
> >
> >Jason
>
> As far as a crash... that sounds like a big and a pretty serious one at that.
>
> Could you let me know what kernel you are using and how *exactly* you are booting it?

I'll attach a .config file. Apply the patch at the top of this thread to
qemu, except make one modification:

diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index 628fd2b2e9..a61ee23e13 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -1097,7 +1097,7 @@ void x86_load_linux(X86MachineState *x86ms,

/* The early stage can't address past around 64 MB from the original
* mapping, so just give up in that case. */
- if (padded_size < 62 * 1024 * 1024)
+ if (true || padded_size < 62 * 1024 * 1024)
kernel_size = padded_size;
else {
fprintf(stderr, "qemu: Kernel image too large to hold setup_data\n");

Then build qemu. Run it with `-kernel bzImage`, based on the kernel
built with the .config I attached.

You'll see that the CPU triple faults when hitting this line:

sd = (struct setup_data *)boot_params->hdr.setup_data;
while (sd) {
unsigned long sd_addr = (unsigned long)sd;

kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd) + sd->len); <----
sd = (struct setup_data *)sd->next;
}

, because it dereferences *sd. This does not happen if the decompressed
size of the kernel is < 62 megs.

So that's the "big and pretty serious" bug that might be worthy of
investigation.

Jason

2022-12-30 16:16:55

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On Thu, Dec 29, 2022 at 01:47:49PM +0100, Borislav Petkov wrote:
> On Wed, Dec 28, 2022 at 11:31:34PM -0800, H. Peter Anvin wrote:
> > As far as a crash... that sounds like a big and a pretty serious one at that.
> >
> > Could you let me know what kernel you are using and how *exactly* you are booting it?
>
> Right, with CONFIG_X86_VERBOSE_BOOTUP=y in a guest here, it says:
>
> early console in extract_kernel
> input_data: 0x000000000be073a8
> input_len: 0x00000000008cfc43
> output: 0x0000000001000000
> output_len: 0x000000000b600a98
> kernel_total_size: 0x000000000ac26000
> needed_size: 0x000000000b800000
> trampoline_32bit: 0x000000000009d000
>
> so that's a ~9M kernel which gets decompressed at 0x1000000 and the
> output len is, what, ~180M which looks like plenty to me...

I think you might have misunderstood the thread. First, to reproduce the
bug that this patch fixes, you need a kernel with a compressed size of
around 16 megs, not 9. Secondly, that crash is well understood and
doesn't need to be reproduced; this patch fixes it. Rather, the question
now is how to improve this patch to remove the 62 meg limit. I'll follow
up with hpa's request for reproduction info.

Jason

2022-12-30 17:23:27

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On Fri, Dec 30, 2022 at 6:01 PM Borislav Petkov <[email protected]> wrote:
>
> On Fri, Dec 30, 2022 at 04:54:27PM +0100, Jason A. Donenfeld wrote:
> > > Right, with CONFIG_X86_VERBOSE_BOOTUP=y in a guest here, it says:
> > >
> > > early console in extract_kernel
> > > input_data: 0x000000000be073a8
> > > input_len: 0x00000000008cfc43
> > > output: 0x0000000001000000
> > > output_len: 0x000000000b600a98
> > > kernel_total_size: 0x000000000ac26000
> > > needed_size: 0x000000000b800000
> > > trampoline_32bit: 0x000000000009d000
> > >
> > > so that's a ~9M kernel which gets decompressed at 0x1000000 and the
> > > output len is, what, ~180M which looks like plenty to me...
> >
> > I think you might have misunderstood the thread.
>
> Maybe...
>
> I've been trying to make sense of all the decompression dancing we're doing and
> the diagrams you've drawn and there's a comment over extract_kernel() which
> basically says that the compressed image is at the back of the memory buffer
>
> input_data: 0x000000000be073a8
>
> and we decompress to a smaller address
>
> output: 0x0000000001000000
>
> And, it *looks* like setup_data is at an address somewhere >= 0x1000000 so when
> we start decompressing, we overwrite it.
>
> I guess extract_kernel() could also dump the setup_data addresses so that we can
> verify that...
>
> > First, to reproduce the bug that this patch fixes, you need a kernel with a
> > compressed size of around 16 megs, not 9.
>
> Not only that - you need setup_data to be placed somewhere just so that it gets
> overwritten during decompression. Also, see below.
>
> > Secondly, that crash is well understood and doesn't need to be reproduced;
>
> Is it?
>
> Where do you get the 0x100000 as the starting address of the kernel image?
>
> Because input_data above says that the input address is somewhere higher...

Look closer at the boot process. The compressed image is initially at
0x100000, but it gets relocated to a safer area at the end of
startup_64:

/*
* Copy the compressed kernel to the end of our buffer
* where decompression in place becomes safe.
*/
pushq %rsi
leaq (_bss-8)(%rip), %rsi
leaq rva(_bss-8)(%rbx), %rdi
movl $(_bss - startup_32), %ecx
shrl $3, %ecx
std
rep movsq
cld
popq %rsi

/*
* The GDT may get overwritten either during the copy we just did or
* during extract_kernel below. To avoid any issues, repoint the GDTR
* to the new copy of the GDT.
*/
leaq rva(gdt64)(%rbx), %rax
leaq rva(gdt)(%rbx), %rdx
movq %rdx, 2(%rax)
lgdt (%rax)

/*
* Jump to the relocated address.
*/
leaq rva(.Lrelocated)(%rbx), %rax
jmp *%rax

And that address winds up being calculated with a combination of the
image size and the init_size param, so it's safe. Decompression won't
overwrite the compressed kernel.

HOWEVER, qemu currently appends setup_data to the end of the
compressed kernel image, and this part isn't moved, and setup_data
links aren't walked/relocated. So that means the original address
remains, of 0x100000.

Jason

2022-12-30 17:48:23

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On Fri, Dec 30, 2022 at 04:54:27PM +0100, Jason A. Donenfeld wrote:
> > Right, with CONFIG_X86_VERBOSE_BOOTUP=y in a guest here, it says:
> >
> > early console in extract_kernel
> > input_data: 0x000000000be073a8
> > input_len: 0x00000000008cfc43
> > output: 0x0000000001000000
> > output_len: 0x000000000b600a98
> > kernel_total_size: 0x000000000ac26000
> > needed_size: 0x000000000b800000
> > trampoline_32bit: 0x000000000009d000
> >
> > so that's a ~9M kernel which gets decompressed at 0x1000000 and the
> > output len is, what, ~180M which looks like plenty to me...
>
> I think you might have misunderstood the thread.

Maybe...

I've been trying to make sense of all the decompression dancing we're doing and
the diagrams you've drawn and there's a comment over extract_kernel() which
basically says that the compressed image is at the back of the memory buffer

input_data: 0x000000000be073a8

and we decompress to a smaller address

output: 0x0000000001000000

And, it *looks* like setup_data is at an address somewhere >= 0x1000000 so when
we start decompressing, we overwrite it.

I guess extract_kernel() could also dump the setup_data addresses so that we can
verify that...

> First, to reproduce the bug that this patch fixes, you need a kernel with a
> compressed size of around 16 megs, not 9.

Not only that - you need setup_data to be placed somewhere just so that it gets
overwritten during decompression. Also, see below.

> Secondly, that crash is well understood and doesn't need to be reproduced;

Is it?

Where do you get the 0x100000 as the starting address of the kernel image?

Because input_data above says that the input address is somewhere higher...

Btw, here's an allyesconfig:

early console in setup code
No EFI environment detected.
early console in extract_kernel
input_data: 0x000000001bd003a8
input_len: 0x000000000cde2963
output: 0x0000000001000000
output_len: 0x0000000027849d74
kernel_total_size: 0x0000000025830000
needed_size: 0x0000000027a00000
trampoline_32bit: 0x000000000009d000
Physical KASLR using RDRAND RDTSC...
Virtual KASLR using RDRAND RDTSC...

That kernel has compressed size of ~205M and the output buffer is of size ~632M.

> this patch fixes it. Rather, the question now is how to improve this patch
> to remove the 62 meg limit.

Yeah, I'm wondering what that 62M limit is too, actually.

But maybe I'm misunderstanding...

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-12-30 19:00:02

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On Wed, Dec 28, 2022 at 03:38:30PM +0100, Jason A. Donenfeld wrote:
> The setup_data links are appended to the compressed kernel image. Since
> the kernel image is typically loaded at 0x100000, setup_data lives at
> `0x100000 + compressed_size`, which does not get relocated during the
> kernel's boot process.
>
> The kernel typically decompresses the image starting at address
> 0x1000000 (note: there's one more zero there than the decompressed image
> above). This usually is fine for most kernels.
>
> However, if the compressed image is actually quite large, then
> setup_data will live at a `0x100000 + compressed_size` that extends into
> the decompressed zone at 0x1000000. In other words, if compressed_size
> is larger than `0x1000000 - 0x100000`, then the decompression step will
> clobber setup_data, resulting in crashes.
>
> Fix this by detecting that possibility, and if it occurs, put setup_data
> *after* the end of the decompressed kernel, so that it doesn't get
> clobbered.
>
> One caveat is that this only works for images less than around 64
> megabytes, so just bail out in that case. This is unfortunate, but I
> don't currently have a way of fixing it.

I've got a different solution now that tries to piggy back on cmdline.
I'll send a v2. It avoids the 62MiB crash situation of this one and
seems to work fine.

2022-12-30 19:35:56

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On December 30, 2022 7:59:30 AM PST, "Jason A. Donenfeld" <[email protected]> wrote:
>Hi,
>
>On Wed, Dec 28, 2022 at 11:31:34PM -0800, H. Peter Anvin wrote:
>> On December 28, 2022 6:31:07 PM PST, "Jason A. Donenfeld" <[email protected]> wrote:
>> >Hi,
>> >
>> >Read this message in a fixed width text editor with a lot of columns.
>> >
>> >On Wed, Dec 28, 2022 at 03:58:12PM -0800, H. Peter Anvin wrote:
>> >> Glad you asked.
>> >>
>> >> So the kernel load addresses are parameterized in the kernel image
>> >> setup header. One of the things that are so parameterized are the size
>> >> and possible realignment of the kernel image in memory.
>> >>
>> >> I'm very confused where you are getting the 64 MB number from. There
>> >> should not be any such limitation.
>> >
>> >Currently, QEMU appends it to the kernel image, not to the initramfs as
>> >you suggest below. So, that winds up looking, currently, like:
>> >
>> > kernel image setup_data
>> > |--------------------------||----------------|
>> >0x100000 0x100000+l1 0x100000+l1+l2
>> >
>> >The problem is that this decompresses to 0x1000000 (one more zero). So
>> >if l1 is > (0x1000000-0x100000), then this winds up looking like:
>> >
>> > kernel image setup_data
>> > |--------------------------||----------------|
>> >0x100000 0x100000+l1 0x100000+l1+l2
>> >
>> > d e c o m p r e s s e d k e r n e l
>> > |-------------------------------------------------------------|
>> > 0x1000000 0x1000000+l3
>> >
>> >The decompressed kernel seemingly overwriting the compressed kernel
>> >image isn't a problem, because that gets relocated to a higher address
>> >early on in the boot process. setup_data, however, stays in the same
>> >place, since those links are self referential and nothing fixes them up.
>> >So the decompressed kernel clobbers it.
>> >
>> >The solution in this commit adds a bunch of padding between the kernel
>> >image and setup_data to avoid this. That looks like this:
>> >
>> > kernel image padding setup_data
>> > |--------------------------||---------------------------------------------------||----------------|
>> >0x100000 0x100000+l1 0x1000000+l3 0x1000000+l3+l2
>> >
>> > d e c o m p r e s s e d k e r n e l
>> > |-------------------------------------------------------------|
>> > 0x1000000 0x1000000+l3
>> >
>> >This way, the decompressed kernel doesn't clobber setup_data.
>> >
>> >The problem is that if 0x1000000+l3-0x100000 is around 62 megabytes,
>> >then the bootloader crashes when trying to dereference setup_data's
>> >->len param at the end of initialize_identity_maps() in ident_map_64.c.
>> >I don't know why it does this. If I could remove the 62 megabyte
>> >restriction, then I could keep with this technique and all would be
>> >well.
>> >
>> >> In general, setup_data should be able to go anywhere the initrd can
>> >> go, and so is subject to the same address cap (896 MB for old kernels,
>> >> 4 GB on newer ones; this address too is enumerated in the header.)
>> >
>> >It would be theoretically possible to attach it to the initrd image
>> >instead of to the kernel image. As a last resort, I guess I can look
>> >into doing that. However, that's going to require some serious rework
>> >and plumbing of a lot of different components. So if I can make it work
>> >as is, that'd be ideal. However, I need to figure out this weird 62 meg
>> >limitation.
>> >
>> >Any ideas on that?
>> >
>> >Jason
>>
>> As far as a crash... that sounds like a big and a pretty serious one at that.
>>
>> Could you let me know what kernel you are using and how *exactly* you are booting it?
>
>I'll attach a .config file. Apply the patch at the top of this thread to
>qemu, except make one modification:
>
>diff --git a/hw/i386/x86.c b/hw/i386/x86.c
>index 628fd2b2e9..a61ee23e13 100644
>--- a/hw/i386/x86.c
>+++ b/hw/i386/x86.c
>@@ -1097,7 +1097,7 @@ void x86_load_linux(X86MachineState *x86ms,
>
> /* The early stage can't address past around 64 MB from the original
> * mapping, so just give up in that case. */
>- if (padded_size < 62 * 1024 * 1024)
>+ if (true || padded_size < 62 * 1024 * 1024)
> kernel_size = padded_size;
> else {
> fprintf(stderr, "qemu: Kernel image too large to hold setup_data\n");
>
>Then build qemu. Run it with `-kernel bzImage`, based on the kernel
>built with the .config I attached.
>
>You'll see that the CPU triple faults when hitting this line:
>
> sd = (struct setup_data *)boot_params->hdr.setup_data;
> while (sd) {
> unsigned long sd_addr = (unsigned long)sd;
>
> kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd) + sd->len); <----
> sd = (struct setup_data *)sd->next;
> }
>
>, because it dereferences *sd. This does not happen if the decompressed
>size of the kernel is < 62 megs.
>
>So that's the "big and pretty serious" bug that might be worthy of
>investigation.
>
>Jason

No kidding. Dereferencing data *before you map it* is generally frowned upon.

This needs to be split into to making calls.

*Facepalm*

2022-12-30 20:15:42

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On Fri, Dec 30, 2022 at 06:07:24PM +0100, Jason A. Donenfeld wrote:
> Look closer at the boot process. The compressed image is initially at
> 0x100000, but it gets relocated to a safer area at the end of
> startup_64:

That is the address we're executing here from, rip here looks like 0x100xxx.

> /*
> * Copy the compressed kernel to the end of our buffer
> * where decompression in place becomes safe.
> */
> pushq %rsi
> leaq (_bss-8)(%rip), %rsi
> leaq rva(_bss-8)(%rbx), %rdi

when you get to here, it looks something like this:

leaq (_bss-8)(%rip), %rsi # 0x9e7ff8
leaq rva(_bss-8)(%rbx), %rdi # 0xc6eeff8

so the source address is that _bss thing and we copy...

> movl $(_bss - startup_32), %ecx
> shrl $3, %ecx
> std

... backwards since DF=1.

Up to:

# rsi = 0xffff8
# rdi = 0xbe06ff8

Ok, so the source address is 0x100000. Good.

> HOWEVER, qemu currently appends setup_data to the end of the
> compressed kernel image,

Yeah, you mean the kernel which starts executing at 0x100000, i.e., that part
which is compressed/head_64.S and which does the above and the relocation etc.

> and this part isn't moved, and setup_data links aren't walked/relocated. So
> that means the original address remains, of 0x100000.

See above: when it starts copying the kernel image backwards to a higher
address, that last byte is at 0x9e7ff8 so I'm guessing qemu has put setup_data
*after* that address. And that doesn't get copied ofc.

So far, so good.

Now later, we extract the compressed kernel created with the mkpiggy magic:

input_data:
.incbin "arch/x86/boot/compressed/vmlinux.bin.gz"
input_data_end:

by doing

/*
* Do the extraction, and jump to the new kernel..
*/

pushq %rsi /* Save the real mode argument */ 0x13d00
movq %rsi, %rdi /* real mode address */ 0x13d00
leaq boot_heap(%rip), %rsi /* malloc area for uncompression */ 0xc6ef000
leaq input_data(%rip), %rdx /* input_data */ 0xbe073a8
movl input_len(%rip), %ecx /* input_len */ 0x8cfe13
movq %rbp, %r8 /* output target address */ 0x1000000
movl output_len(%rip), %r9d /* decompressed length, end of relocs */
call extract_kernel /* returns kernel location in %rax */
popq %rsi

(actual addresses at the end.)

Now, when you say you triplefault somewhere in initialize_identity_maps() when
trying to access setup_data, then if you look a couple of lines before that call
we do

call load_stage2_idt

which sets up a boottime #PF handler do_boot_page_fault() and it actually does
call kernel_add_identity_map() so *actually* it should map any unmapped
setup_data addresses.

So why doesn't it do that and why do you triplefault?

Hmmm.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-12-30 22:40:34

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On December 30, 2022 11:54:11 AM PST, Borislav Petkov <[email protected]> wrote:
>On Fri, Dec 30, 2022 at 06:07:24PM +0100, Jason A. Donenfeld wrote:
>> Look closer at the boot process. The compressed image is initially at
>> 0x100000, but it gets relocated to a safer area at the end of
>> startup_64:
>
>That is the address we're executing here from, rip here looks like 0x100xxx.
>
>> /*
>> * Copy the compressed kernel to the end of our buffer
>> * where decompression in place becomes safe.
>> */
>> pushq %rsi
>> leaq (_bss-8)(%rip), %rsi
>> leaq rva(_bss-8)(%rbx), %rdi
>
>when you get to here, it looks something like this:
>
> leaq (_bss-8)(%rip), %rsi # 0x9e7ff8
> leaq rva(_bss-8)(%rbx), %rdi # 0xc6eeff8
>
>so the source address is that _bss thing and we copy...
>
>> movl $(_bss - startup_32), %ecx
>> shrl $3, %ecx
>> std
>
>... backwards since DF=1.
>
>Up to:
>
># rsi = 0xffff8
># rdi = 0xbe06ff8
>
>Ok, so the source address is 0x100000. Good.
>
>> HOWEVER, qemu currently appends setup_data to the end of the
>> compressed kernel image,
>
>Yeah, you mean the kernel which starts executing at 0x100000, i.e., that part
>which is compressed/head_64.S and which does the above and the relocation etc.
>
>> and this part isn't moved, and setup_data links aren't walked/relocated. So
>> that means the original address remains, of 0x100000.
>
>See above: when it starts copying the kernel image backwards to a higher
>address, that last byte is at 0x9e7ff8 so I'm guessing qemu has put setup_data
>*after* that address. And that doesn't get copied ofc.
>
>So far, so good.
>
>Now later, we extract the compressed kernel created with the mkpiggy magic:
>
>input_data:
>.incbin "arch/x86/boot/compressed/vmlinux.bin.gz"
>input_data_end:
>
>by doing
>
>/*
> * Do the extraction, and jump to the new kernel..
> */
>
> pushq %rsi /* Save the real mode argument */ 0x13d00
> movq %rsi, %rdi /* real mode address */ 0x13d00
> leaq boot_heap(%rip), %rsi /* malloc area for uncompression */ 0xc6ef000
> leaq input_data(%rip), %rdx /* input_data */ 0xbe073a8
> movl input_len(%rip), %ecx /* input_len */ 0x8cfe13
> movq %rbp, %r8 /* output target address */ 0x1000000
> movl output_len(%rip), %r9d /* decompressed length, end of relocs */
> call extract_kernel /* returns kernel location in %rax */
> popq %rsi
>
>(actual addresses at the end.)
>
>Now, when you say you triplefault somewhere in initialize_identity_maps() when
>trying to access setup_data, then if you look a couple of lines before that call
>we do
>
> call load_stage2_idt
>
>which sets up a boottime #PF handler do_boot_page_fault() and it actually does
>call kernel_add_identity_map() so *actually* it should map any unmapped
>setup_data addresses.
>
>So why doesn't it do that and why do you triplefault?
>
>Hmmm.
>

See the other thread fork. They have identified the problem already.

2022-12-30 22:55:07

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On Fri, Dec 30, 2022 at 01:58:39PM -0800, H. Peter Anvin wrote:
> See the other thread fork. They have identified the problem already.

Not sure I follow. Is there another thread where somebody worked out why
this 62meg limit was happening?

Note that I sent v2/v3, to fix the original problem in a different way,
and if that looks good to the QEMU maintainers, then we can all be happy
with that. But I *haven't* addressed and still don't fully understand
why the 62meg limit applied to my v1 in the way it does. Did you find a
bug there to fix? If so, please do CC me.

Jason

2022-12-31 01:43:41

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data



On 12/30/22 14:10, Jason A. Donenfeld wrote:
> On Fri, Dec 30, 2022 at 01:58:39PM -0800, H. Peter Anvin wrote:
>> See the other thread fork. They have identified the problem already.
>
> Not sure I follow. Is there another thread where somebody worked out why
> this 62meg limit was happening?
>
> Note that I sent v2/v3, to fix the original problem in a different way,
> and if that looks good to the QEMU maintainers, then we can all be happy
> with that. But I *haven't* addressed and still don't fully understand
> why the 62meg limit applied to my v1 in the way it does. Did you find a
> bug there to fix? If so, please do CC me.
>

Yes, you yourself posted the problem:

> Then build qemu. Run it with `-kernel bzImage`, based on the kernel
> built with the .config I attached.
>
> You'll see that the CPU triple faults when hitting this line:
>
> sd = (struct setup_data *)boot_params->hdr.setup_data;
> while (sd) {
> unsigned long sd_addr = (unsigned long)sd;
>
> kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd) + sd->len); <----
> sd = (struct setup_data *)sd->next;
> }
>
> , because it dereferences *sd. This does not happen if the decompressed
> size of the kernel is < 62 megs.
>
> So that's the "big and pretty serious" bug that might be worthy of
> investigation.

This needs to be something like:

kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd));
kernel_add_identity_map(sd_addr + sizeof(*sd),
sd_addr + sizeof(*sd) + sd->len);


TThe 62 MB limit mentioned in boot.rst is unrelated, and only applies to
very, very old kernels that used INT 15h, AH=88h to probe memory.

-hpa

2022-12-31 01:51:58

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On 12/30/22 17:06, H. Peter Anvin wrote
>
> TThe 62 MB limit mentioned in boot.rst is unrelated, and only applies to
> very, very old kernels that used INT 15h, AH=88h to probe memory.
>

I am 88% sure this was fixed long before setup_data was created, as it
was created originally to carry e820 info for more than 128(!) memory
segments. However, as we see here, it is never certain that bugs didn't
creep in in the meantime...

-hpa

2022-12-31 09:50:51

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On Fri, Dec 30, 2022 at 04:59:30PM +0100, Jason A. Donenfeld wrote:
> I'll attach a .config file. Apply the patch at the top of this thread to
> qemu,

Hmm, so the patch at the top of this thread is fixing the clobbering of
setup_data.

But I tried latest qemu with your .config and it boots ok here. So how do I
repro the original issue you're trying to fix?

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-12-31 13:01:02

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On Sat, Dec 31, 2022 at 10:48:16AM +0100, Borislav Petkov wrote:
> On Fri, Dec 30, 2022 at 04:59:30PM +0100, Jason A. Donenfeld wrote:
> > I'll attach a .config file. Apply the patch at the top of this thread to
> > qemu,
>
> Hmm, so the patch at the top of this thread is fixing the clobbering of
> setup_data.
>
> But I tried latest qemu with your .config and it boots ok here. So how do I
> repro the original issue you're trying to fix?

Nothing special... `-kernel bzImage` should be enough to do it. Eric
reported it, and then I was able to repro trivially. Sure you got the
right version?

Jason

2022-12-31 13:02:26

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On Fri, Dec 30, 2022 at 05:06:55PM -0800, H. Peter Anvin wrote:
>
>
> On 12/30/22 14:10, Jason A. Donenfeld wrote:
> > On Fri, Dec 30, 2022 at 01:58:39PM -0800, H. Peter Anvin wrote:
> >> See the other thread fork. They have identified the problem already.
> >
> > Not sure I follow. Is there another thread where somebody worked out why
> > this 62meg limit was happening?
> >
> > Note that I sent v2/v3, to fix the original problem in a different way,
> > and if that looks good to the QEMU maintainers, then we can all be happy
> > with that. But I *haven't* addressed and still don't fully understand
> > why the 62meg limit applied to my v1 in the way it does. Did you find a
> > bug there to fix? If so, please do CC me.
> >
>
> Yes, you yourself posted the problem:
>
> > Then build qemu. Run it with `-kernel bzImage`, based on the kernel
> > built with the .config I attached.
> >
> > You'll see that the CPU triple faults when hitting this line:
> >
> > sd = (struct setup_data *)boot_params->hdr.setup_data;
> > while (sd) {
> > unsigned long sd_addr = (unsigned long)sd;
> >
> > kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd) + sd->len); <----
> > sd = (struct setup_data *)sd->next;
> > }
> >
> > , because it dereferences *sd. This does not happen if the decompressed
> > size of the kernel is < 62 megs.
> >
> > So that's the "big and pretty serious" bug that might be worthy of
> > investigation.
>
> This needs to be something like:
>
> kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd));
> kernel_add_identity_map(sd_addr + sizeof(*sd),
> sd_addr + sizeof(*sd) + sd->len);
>

Oh, right, duh. Thanks for spelling it out.

Jason

2022-12-31 14:17:29

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On Sat, Dec 31, 2022 at 01:54:50PM +0100, Jason A. Donenfeld wrote:
> Nothing special... `-kernel bzImage` should be enough to do it. Eric
> reported it, and then I was able to repro trivially. Sure you got the
> right version?

Yeah, qemu executables confusion here - had wrongly something older of the
version 7.1...

Now made sure I'm actually booting with the latest qemu:

QEMU emulator version 7.2.50 (v7.2.0-333-g222059a0fccf)

With that the kernel with your config hangs early during boot and the stack
trace is below.

Seeing how it says trapnr 14, then that looks like something you are seeing.

But lemme poke at it more.

#0 0xffffffff84738576 in native_halt () at ./arch/x86/include/asm/irqflags.h:57
#1 halt () at ./arch/x86/include/asm/irqflags.h:98
#2 early_fixup_exception (regs=regs@entry=0xffffffff84007dc8, trapnr=trapnr@entry=14) at arch/x86/mm/extable.c:340
#3 0xffffffff846ff465 in do_early_exception (regs=0xffffffff84007dc8, trapnr=14) at arch/x86/kernel/head64.c:424
#4 0xffffffff846ff14f in early_idt_handler_common () at arch/x86/kernel/head_64.S:483
#5 0x2404c74100000cd0 in ?? ()
#6 0xffffffffff20073c in ?? ()
#7 0x0000000000000010 in fixed_percpu_data ()
#8 0xdffffc0000000000 in ?? ()
#9 0xffffffff84007ea8 in init_thread_union ()
#10 0xffffffffff200cd0 in ?? ()
#11 0x0000000000000000 in ?? ()

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-12-31 14:21:36

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On Sat, Dec 31, 2022 at 02:48:12PM +0100, Borislav Petkov wrote:
> On Sat, Dec 31, 2022 at 02:44:08PM +0100, Jason A. Donenfeld wrote:
> > Are you using patch v1 minus the 62 MiB thing?
>
> No, I want to see the original failure - the one that prompted you to send
>
> https://lore.kernel.org/r/[email protected]

That failure is unrelated to the ident mapping issue Peter and
I discussed. The original failure is described in the commit message:
decompression clobbers the data, so sd->next points to garbage.

2022-12-31 14:22:50

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On Sat, Dec 31, 2022 at 02:40:59PM +0100, Borislav Petkov wrote:
> On Fri, Dec 30, 2022 at 05:06:55PM -0800, H. Peter Anvin wrote:
> > This needs to be something like:
> >
> > kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd));
> > kernel_add_identity_map(sd_addr + sizeof(*sd),
> > sd_addr + sizeof(*sd) + sd->len);
>
> It still #PFs with that:
>
> (gdb) bt
> #0 0xffffffff84738576 in native_halt () at ./arch/x86/include/asm/irqflags.h:57
> #1 halt () at ./arch/x86/include/asm/irqflags.h:98
> #2 early_fixup_exception (regs=regs@entry=0xffffffff84007dc8, trapnr=trapnr@entry=14) at arch/x86/mm/extable.c:340
> #3 0xffffffff846ff465 in do_early_exception (regs=0xffffffff84007dc8, trapnr=14) at arch/x86/kernel/head64.c:424
> #4 0xffffffff846ff14f in early_idt_handler_common () at arch/x86/kernel/head_64.S:483
> #5 0xc149f9894908788d in ?? ()
> #6 0xffffffffff2003fc in ?? ()
> #7 0x0000000000000010 in fixed_percpu_data ()
> #8 0xdffffc0000000000 in ?? ()
> #9 0xffffffff84007ea8 in init_thread_union ()
> #10 0xffffffffff20088d in ?? ()
> #11 0x0000000000000000 in ?? ()
>
> /me goes to dig more.

Are you using patch v1 minus the 62 MiB thing? If you haven't applied
patch v1 and then removed the 62 MiB limitation in it, then you've
misunderstood the conversation again.

Please see my reproduction steps to Peter:
https://lore.kernel.org/lkml/[email protected]/

Jason

2022-12-31 14:23:58

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On Fri, Dec 30, 2022 at 05:06:55PM -0800, H. Peter Anvin wrote:
> This needs to be something like:
>
> kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd));
> kernel_add_identity_map(sd_addr + sizeof(*sd),
> sd_addr + sizeof(*sd) + sd->len);

It still #PFs with that:

(gdb) bt
#0 0xffffffff84738576 in native_halt () at ./arch/x86/include/asm/irqflags.h:57
#1 halt () at ./arch/x86/include/asm/irqflags.h:98
#2 early_fixup_exception (regs=regs@entry=0xffffffff84007dc8, trapnr=trapnr@entry=14) at arch/x86/mm/extable.c:340
#3 0xffffffff846ff465 in do_early_exception (regs=0xffffffff84007dc8, trapnr=14) at arch/x86/kernel/head64.c:424
#4 0xffffffff846ff14f in early_idt_handler_common () at arch/x86/kernel/head_64.S:483
#5 0xc149f9894908788d in ?? ()
#6 0xffffffffff2003fc in ?? ()
#7 0x0000000000000010 in fixed_percpu_data ()
#8 0xdffffc0000000000 in ?? ()
#9 0xffffffff84007ea8 in init_thread_union ()
#10 0xffffffffff20088d in ?? ()
#11 0x0000000000000000 in ?? ()

/me goes to dig more.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-12-31 14:28:21

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On Sat, Dec 31, 2022 at 02:51:28PM +0100, Jason A. Donenfeld wrote:
> That failure is unrelated to the ident mapping issue Peter and
> I discussed. The original failure is described in the commit message:
> decompression clobbers the data, so sd->next points to garbage.

Right, and the fact that the kernel overwrites it still feels kinda wrong: the
kernel knows where setup_data is - the address is in the setup header so
*actually*, it should take care of not to clobber it.

But hpa would correct me if I'm missing an aspect about whose responsibility it
is to do what here...

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-12-31 14:42:07

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On Sat, Dec 31, 2022 at 02:35:45PM +0100, Borislav Petkov wrote:
> On Sat, Dec 31, 2022 at 01:54:50PM +0100, Jason A. Donenfeld wrote:
> > Nothing special... `-kernel bzImage` should be enough to do it. Eric
> > reported it, and then I was able to repro trivially. Sure you got the
> > right version?
>
> Yeah, qemu executables confusion here - had wrongly something older of the
> version 7.1...
>
> Now made sure I'm actually booting with the latest qemu:
>
> QEMU emulator version 7.2.50 (v7.2.0-333-g222059a0fccf)
>
> With that the kernel with your config hangs early during boot and the stack
> trace is below.
>
> Seeing how it says trapnr 14, then that looks like something you are seeing.
>
> But lemme poke at it more.

Yes. The cause is what I've described in the commit message. There are
two proposed fixes, the v1, which has the 62 MiB limitation due to a
kernel bug, and the v3, which seems to work fine, and is simpler, and I
suspect that's the one QEMU upstream should take.

https://lore.kernel.org/lkml/[email protected]/

2022-12-31 14:57:17

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On Sat, Dec 31, 2022 at 02:44:08PM +0100, Jason A. Donenfeld wrote:
> Are you using patch v1 minus the 62 MiB thing?

No, I want to see the original failure - the one that prompted you to send

https://lore.kernel.org/r/[email protected]

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-12-31 18:52:14

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On Sat, Dec 31, 2022 at 03:24:32PM +0100, Borislav Petkov wrote:
> On Sat, Dec 31, 2022 at 02:51:28PM +0100, Jason A. Donenfeld wrote:
> > That failure is unrelated to the ident mapping issue Peter and
> > I discussed. The original failure is described in the commit message:
> > decompression clobbers the data, so sd->next points to garbage.
>
> Right

So with that understanding confirmed, I'm confused at your surprise that
hpa's unrelated fix to the different issue didn't fix this issue.

> and the fact that the kernel overwrites it still feels kinda wrong: the
> kernel knows where setup_data is - the address is in the setup header so
> *actually*, it should take care of not to clobber it.

Yea, technically the bootloader could relocate all the setup_data links
by copying them and updating ->next. This wouldn't be so hard to do.
(Special care would have to be taken, though, to zero out
SETUP_RNG_SEED, though, for forward secrecy and such.)

But since the kernel doesn't do this now, and the 62MiB bug also seems
to apply to existing kernels, for the purposes of QEMU for now, I think
the v3 patch is probably best, since it'll handle existing kernels.
Alternatively, setup_data could be relocated, the boot param protocol
could be bumped, and then QEMU could conditionalized it's use of
setup_data based on that protocol version. That'd work, but seems a bit
more involved.

So maybe for now, v3 works? Hopefully that looks like a correct approach
to hpa, anyhow:
https://lore.kernel.org/lkml/[email protected]/
I think it should fit with what he described would work.

Jason

2022-12-31 19:29:16

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On Sat, Dec 31, 2022 at 07:22:47PM +0100, Jason A. Donenfeld wrote:
> So with that understanding confirmed, I'm confused at your surprise that
> hpa's unrelated fix to the different issue didn't fix this issue.

No surprise there - I used a qemu variant without your patch to prevent the
setup_data clobbering so hpa's fix can't help there.

> But since the kernel doesn't do this now, and the 62MiB bug also seems
> to apply to existing kernels, for the purposes of QEMU for now, I think
> the v3 patch is probably best, since it'll handle existing kernels.

Right, we can't fix all those guests which are out there.

> Alternatively, setup_data could be relocated, the boot param protocol
> could be bumped, and then QEMU could conditionalized it's use of
> setup_data based on that protocol version. That'd work, but seems a bit
> more involved.

I think this is at least worth considering because the kernel overwriting
setup_data after having been told where that setup_data is located is not really
nice.

Well not explicitly at least - it gets the pointer to the first element and
something needs to traverse that list to know which addresses are live. But
still, that info is there so perhaps we need to take setup_data into
consideration too before decompressing...

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-01-01 04:19:45

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On 12/31/22 19:21, H. Peter Anvin wrote:
>>
>>> Alternatively, setup_data could be relocated, the boot param protocol
>>> could be bumped, and then QEMU could conditionalized it's use of
>>> setup_data based on that protocol version. That'd work, but seems a bit
>>> more involved.
>>
>> I think this is at least worth considering because the kernel overwriting
>> setup_data after having been told where that setup_data is located is
>> not really
>> nice.
>>
>> Well not explicitly at least - it gets the pointer to the first
>> element and
>> something needs to traverse that list to know which addresses are
>> live. But
>> still, that info is there so perhaps we need to take setup_data into
>> consideration too before decompressing...
>>

It would probably be a good idea to add a "maximum physical address for
initrd/setup_data/cmdline" field to struct kernel_info, though. It
appears right now that those fields are being identity-mapped in the
decompressor, and that means that if 48-bit addressing is used, physical
memory may extend past the addressable range.

-hpa

2023-01-01 04:27:03

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data



On 12/31/22 11:00, Borislav Petkov wrote:
> On Sat, Dec 31, 2022 at 07:22:47PM +0100, Jason A. Donenfeld wrote:
>> So with that understanding confirmed, I'm confused at your surprise that
>> hpa's unrelated fix to the different issue didn't fix this issue.
>
> No surprise there - I used a qemu variant without your patch to prevent the
> setup_data clobbering so hpa's fix can't help there.
>
>> But since the kernel doesn't do this now, and the 62MiB bug also seems
>> to apply to existing kernels, for the purposes of QEMU for now, I think
>> the v3 patch is probably best, since it'll handle existing kernels.
>
> Right, we can't fix all those guests which are out there.
>
>> Alternatively, setup_data could be relocated, the boot param protocol
>> could be bumped, and then QEMU could conditionalized it's use of
>> setup_data based on that protocol version. That'd work, but seems a bit
>> more involved.
>
> I think this is at least worth considering because the kernel overwriting
> setup_data after having been told where that setup_data is located is not really
> nice.
>
> Well not explicitly at least - it gets the pointer to the first element and
> something needs to traverse that list to know which addresses are live. But
> still, that info is there so perhaps we need to take setup_data into
> consideration too before decompressing...
>

As far as the decompression itself goes, it should only a problem if we
are using physical KASLR since otherwise the kernel has a guaranteed
safe zone already allocated by the boot loader. However, if physical
KASLR is in use, then the decompressor needs to know everything there is
to know about the memory map.

However, there also seems to be some kind of interaction with AMD SEV-SNP.


The bug appears to have been introduced by:

b57feed2cc2622ae14b2fa62f19e973e5e0a60cf
x86/compressed/64: Add identity mappings for setup_data entries
https://lore.kernel.org/r/TYCPR01MB694815CD815E98945F63C99183B49@TYCPR01MB6948.jpnprd01.prod.outlook.com

... which was included in version 5.19, so it is relatively recent.

For a small amount of setup_data, the solution of just putting it next
to the command line makes a lot of sense, and should be safe indefinitely.

-hpa

2023-01-01 05:08:30

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data



On 12/31/22 10:22, Jason A. Donenfeld wrote:
> On Sat, Dec 31, 2022 at 03:24:32PM +0100, Borislav Petkov wrote:
>> On Sat, Dec 31, 2022 at 02:51:28PM +0100, Jason A. Donenfeld wrote:
>>> That failure is unrelated to the ident mapping issue Peter and
>>> I discussed. The original failure is described in the commit message:
>>> decompression clobbers the data, so sd->next points to garbage.
>>
>> Right
>
> So with that understanding confirmed, I'm confused at your surprise that
> hpa's unrelated fix to the different issue didn't fix this issue.
>

If decompression does clobber the data, then we *also* need to figure
out why that is. There are basically three possibilities:

1. If physical KASLR is NOT used:

a. The boot loader doesn't honor the kernel safe area properly;
b. Somewhere in the process a bug in the calculation of the
kernel safe area has crept in.

2. If physical KASLR IS used:

The decompressor doesn't correctly keep track of nor relocate
all the keep-out zones before picking a target address.

One is a bootloader bug, two is a kernel bug. My guess is (2) is the
culprit, but (1b) should be checked, too.

-hpa

2023-01-01 05:16:37

by Mika Penttilä

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data



On 1.1.2023 6.33, H. Peter Anvin wrote:
>
>
> On 12/31/22 10:22, Jason A. Donenfeld wrote:
>> On Sat, Dec 31, 2022 at 03:24:32PM +0100, Borislav Petkov wrote:
>>> On Sat, Dec 31, 2022 at 02:51:28PM +0100, Jason A. Donenfeld wrote:
>>>> That failure is unrelated to the ident mapping issue Peter and
>>>> I discussed. The original failure is described in the commit message:
>>>> decompression clobbers the data, so sd->next points to garbage.
>>>
>>> Right
>>
>> So with that understanding confirmed, I'm confused at your surprise that
>> hpa's unrelated fix to the different issue didn't fix this issue.
>>
>
> If decompression does clobber the data, then we *also* need to figure
> out why that is. There are basically three possibilities:
>
> 1. If physical KASLR is NOT used:
>
>     a. The boot loader doesn't honor the kernel safe area properly;
>     b. Somewhere in the process a bug in the calculation of the
>        kernel safe area has crept in.
>
> 2. If physical KASLR IS used:
>
>     The decompressor doesn't correctly keep track of nor relocate
>     all the keep-out zones before picking a target address.

Seems setup_data is not included in those mem_avoid regions.

>
> One is a bootloader bug, two is a kernel bug. My guess is (2) is the
> culprit, but (1b) should be checked, too.
>
>     -hpa
>


--Mika

2023-01-01 05:27:19

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On 12/31/22 20:55, Mika Penttilä wrote:
>>
>> If decompression does clobber the data, then we *also* need to figure
>> out why that is. There are basically three possibilities:
>>
>> 1. If physical KASLR is NOT used:
>>
>>      a. The boot loader doesn't honor the kernel safe area properly;
>>      b. Somewhere in the process a bug in the calculation of the
>>         kernel safe area has crept in.
>>
>> 2. If physical KASLR IS used:
>>
>>      The decompressor doesn't correctly keep track of nor relocate
>>      all the keep-out zones before picking a target address.
>
> Seems setup_data is not included in those mem_avoid regions.
>

[facepalm]


>>
>> One is a bootloader bug, two is a kernel bugs. My guess is (2) is the
>> culprit, but (1b) should be checked, too.
>>

Correction: two are kernel bugs, i.e. (1b) and (2) are both kernel bugs.

-hpa

2023-01-02 06:11:34

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On Sat, Dec 31, 2022 at 07:21:06PM -0800, H. Peter Anvin wrote:
> As far as the decompression itself goes, it should only a problem if we are
> using physical KASLR since otherwise the kernel has a guaranteed safe zone
> already allocated by the boot loader. However, if physical KASLR is in use,

No KASLR in Jason's config AFAICT:

$ grep RANDOMIZE .config
CONFIG_ARCH_HAS_ELF_RANDOMIZE=y
CONFIG_HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET=y
CONFIG_RANDOMIZE_KSTACK_OFFSET=y
# CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT is not set

> then the decompressor needs to know everything there is to know about the
> memory map.

Yeah, we do have that but as you folks establish later in the thread, those
setup_data regions would need to be avoided too. ;-\

> However, there also seems to be some kind of interaction with AMD SEV-SNP.
>
>
> The bug appears to have been introduced by:
>
> b57feed2cc2622ae14b2fa62f19e973e5e0a60cf
> x86/compressed/64: Add identity mappings for setup_data entries
> https://lore.kernel.org/r/TYCPR01MB694815CD815E98945F63C99183B49@TYCPR01MB6948.jpnprd01.prod.outlook.com
>
> ... which was included in version 5.19, so it is relatively recent.

Right. We need that for the CC blob:

b190a043c49a ("x86/sev: Add SEV-SNP feature detection/setup")

> For a small amount of setup_data, the solution of just putting it next to
> the command line makes a lot of sense, and should be safe indefinitely.

Ok.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-01-02 06:42:37

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On Sat, Dec 31, 2022 at 07:31:21PM -0800, H. Peter Anvin wrote:
> It would probably be a good idea to add a "maximum physical address for
> initrd/setup_data/cmdline" field to struct kernel_info, though. It appears
> right now that those fields are being identity-mapped in the decompressor,
> and that means that if 48-bit addressing is used, physical memory may extend
> past the addressable range.

Yeah, we will probably need that too.

Btw, looka here - it can't get any more obvious than that after dumping
setup_data too:

early console in setup code
early console in extract_kernel
input_data: 0x00000000040f92bf
input_len: 0x0000000000f1c325
output: 0x0000000001000000
output_len: 0x0000000003c5e7d8
kernel_total_size: 0x0000000004428000
needed_size: 0x0000000004600000
boot_params->hdr.setup_data: 0x00000000010203b0
trampoline_32bit: 0x000000000009d000

Decompressing Linux... Parsing ELF... done.
Booting the kernel.
<EOF>

Aligning them vertically:

output: 0x0000000001000000
output_len: 0x0000000003c5e7d8
kernel_total_size: 0x0000000004428000
needed_size: 0x0000000004600000
boot_params->hdr.setup_data: 0x00000000010203b0

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-01-02 06:52:58

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On Mon, Jan 02, 2023 at 07:01:50AM +0100, Borislav Petkov wrote:
> On Sat, Dec 31, 2022 at 07:31:21PM -0800, H. Peter Anvin wrote:
> > It would probably be a good idea to add a "maximum physical address for
> > initrd/setup_data/cmdline" field to struct kernel_info, though. It appears
> > right now that those fields are being identity-mapped in the decompressor,
> > and that means that if 48-bit addressing is used, physical memory may extend
> > past the addressable range.
>
> Yeah, we will probably need that too.
>
> Btw, looka here - it can't get any more obvious than that after dumping
> setup_data too:
>
> early console in setup code
> early console in extract_kernel
> input_data: 0x00000000040f92bf
> input_len: 0x0000000000f1c325
> output: 0x0000000001000000
> output_len: 0x0000000003c5e7d8
> kernel_total_size: 0x0000000004428000
> needed_size: 0x0000000004600000
> boot_params->hdr.setup_data: 0x00000000010203b0
> trampoline_32bit: 0x000000000009d000
>
> Decompressing Linux... Parsing ELF... done.
> Booting the kernel.
> <EOF>
>
> Aligning them vertically:
>
> output: 0x0000000001000000
> output_len: 0x0000000003c5e7d8
> kernel_total_size: 0x0000000004428000
> needed_size: 0x0000000004600000
> boot_params->hdr.setup_data: 0x00000000010203b0

Ok, waait a minute:

============ ============
Field name: pref_address
Type: read (reloc)
Offset/size: 0x258/8
Protocol: 2.10+
============ ============

This field, if nonzero, represents a preferred load address for the
kernel. A relocating bootloader should attempt to load at this
address if possible.

A non-relocatable kernel will unconditionally move itself and to run
at this address.

so a kernel loader (qemu in this case) already knows where the kernel goes:

boot_params->hdr.setup_data: 0x0000000001020450
boot_params->hdr.pref_address: 0x0000000001000000
^^^^^^^^^^^^^^^^^

now, considering that same kernel loader (qemu) knows how big that kernel is:

kernel_total_size: 0x0000000004428000

should that loader *not* put anything that the kernel will use in the range

pref_addr + kernel_total_size

?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-01-02 09:37:56

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On Mon, 2 Jan 2023 at 07:17, Borislav Petkov <[email protected]> wrote:
>
> On Mon, Jan 02, 2023 at 07:01:50AM +0100, Borislav Petkov wrote:
> > On Sat, Dec 31, 2022 at 07:31:21PM -0800, H. Peter Anvin wrote:
> > > It would probably be a good idea to add a "maximum physical address for
> > > initrd/setup_data/cmdline" field to struct kernel_info, though. It appears
> > > right now that those fields are being identity-mapped in the decompressor,
> > > and that means that if 48-bit addressing is used, physical memory may extend
> > > past the addressable range.
> >
> > Yeah, we will probably need that too.
> >
> > Btw, looka here - it can't get any more obvious than that after dumping
> > setup_data too:
> >
> > early console in setup code
> > early console in extract_kernel
> > input_data: 0x00000000040f92bf
> > input_len: 0x0000000000f1c325
> > output: 0x0000000001000000
> > output_len: 0x0000000003c5e7d8
> > kernel_total_size: 0x0000000004428000
> > needed_size: 0x0000000004600000
> > boot_params->hdr.setup_data: 0x00000000010203b0
> > trampoline_32bit: 0x000000000009d000
> >
> > Decompressing Linux... Parsing ELF... done.
> > Booting the kernel.
> > <EOF>
> >
> > Aligning them vertically:
> >
> > output: 0x0000000001000000
> > output_len: 0x0000000003c5e7d8
> > kernel_total_size: 0x0000000004428000
> > needed_size: 0x0000000004600000
> > boot_params->hdr.setup_data: 0x00000000010203b0
>
> Ok, waait a minute:
>
> ============ ============
> Field name: pref_address
> Type: read (reloc)
> Offset/size: 0x258/8
> Protocol: 2.10+
> ============ ============
>
> This field, if nonzero, represents a preferred load address for the
> kernel. A relocating bootloader should attempt to load at this
> address if possible.
>
> A non-relocatable kernel will unconditionally move itself and to run
> at this address.
>
> so a kernel loader (qemu in this case) already knows where the kernel goes:
>
> boot_params->hdr.setup_data: 0x0000000001020450
> boot_params->hdr.pref_address: 0x0000000001000000
> ^^^^^^^^^^^^^^^^^
>
> now, considering that same kernel loader (qemu) knows how big that kernel is:
>
> kernel_total_size: 0x0000000004428000
>
> should that loader *not* put anything that the kernel will use in the range
>
> pref_addr + kernel_total_size
>

This seems to be related to another issue that was discussed in the
context of this change, but affecting EFI boot not legacy BIOS boot
[0].

So, in a nutshell, we have the following pieces:
- QEMU, which manages a directory of files and other data blobs, and
exposes them via its fw_cfg interface.
- SeaBIOS, which invokes the fw_cfg interface to load the 'kernel'
blob at its preferred address
- The boot code in the kernel, which interprets the various fields in
the setup header to figure out where the compressed image lives etc

So the problem here, which applies to SETUP_DTB as well as
SETUP_RNG_SEED, is that the internal file representation of the kernel
blob (which does not have an absolute address at this point, it's just
a file in the fw_cfg filesystem) is augmented with:
1) setup_data linked-list entries carrying absolute addresses that are
assumed to be valid once SeaBIOS loads the file to memory
2) DTB and/or RNG seed blobs appended to the compressed 'kernel' blob,
but without updating that file's internal metadata

Issue 1) is what broke EFI boot, given that EFI interprets the kernel
blob as a PE/COFF image and hands it to the Loadimage() boot service,
which has no awareness of boot_params or setup_data and so just
ignores it and loads the image at an arbitrary address, resulting in
setup_data absolute address values pointing to bogus places.

It seems that now, we have another issue 2), where the fw_cfg view of
the file size goes out of sync with the compressed image's own view of
its size.

As a fix for issue 1), we explored another solution, which was to
allocate fixed areas in memory for the RNG seed, so that the absolute
address added to setup_data is guaranteed to be correct regardless of
where the compressed image is loaded, but that was shot down for other
reasons, and we ended up enabling this feature only for legacy BIOS
boot. But apparently, this approach has other issues so perhaps it is
better to revisit that solution again.

So instead of appending data to the compressed image and assuming that
it will stay in place, create or extend a memory reservation
elsewhere, and refer to its absolute address in setup_data.

--
Ard.


[0] https://lore.kernel.org/all/CAMj1kXFr6Bv4_G0-wCTu4fp_iCrG060NHJx_j2dbnyiFJKYYeQ@mail.gmail.com/

2023-01-02 13:41:15

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On Mon, Jan 02, 2023 at 10:32:03AM +0100, Ard Biesheuvel wrote:
> So instead of appending data to the compressed image and assuming that
> it will stay in place, create or extend a memory reservation
> elsewhere, and refer to its absolute address in setup_data.

From my limited experience with all those boot protocols, I'd say hardcoding
stuff is always a bad idea. But, we already more or less hardcode, or rather
codify through the setup header contract how stuff needs to get accessed.

And yeah, maybe specifying an absolute address and size for a blob of data and
putting that address and size in the setup header so that all the parties
involved are where what is, is probably better.

But WTH do I know...

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-01-02 15:15:34

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH qemu] x86: don't let decompressed kernel image clobber setup_data

On Mon, 2 Jan 2023 at 14:37, Borislav Petkov <[email protected]> wrote:
>
> On Mon, Jan 02, 2023 at 10:32:03AM +0100, Ard Biesheuvel wrote:
> > So instead of appending data to the compressed image and assuming that
> > it will stay in place, create or extend a memory reservation
> > elsewhere, and refer to its absolute address in setup_data.
>
> From my limited experience with all those boot protocols, I'd say hardcoding
> stuff is always a bad idea. But, we already more or less hardcode, or rather
> codify through the setup header contract how stuff needs to get accessed.
>
> And yeah, maybe specifying an absolute address and size for a blob of data and
> putting that address and size in the setup header so that all the parties
> involved are where what is, is probably better.
>

Exactly. In the EFI case, this was problematic because we would need
to introduce a new way to pass memory reservations between QEMU and
the firmware. But I don't think that issue should affect legacy BIOS
boot, and we could just reserve the memory in the E820 table AFAIK.