The setup_data links are appended to the compressed kernel image. Since
the kernel image is typically loaded at 0x100000, setup_data lives at
`0x100000 + compressed_size`, which does not get relocated during the
kernel's boot process.
The kernel typically decompresses the image starting at address
0x1000000 (note: there's one more zero there than the compressed image
above). This usually is fine for most kernels.
However, if the compressed image is actually quite large, then
setup_data will live at a `0x100000 + compressed_size` that extends into
the decompressed zone at 0x1000000. In other words, if compressed_size
is larger than `0x1000000 - 0x100000`, then the decompression step will
clobber setup_data, resulting in crashes.
Visually, what happens now is that QEMU appends setup_data to the kernel
image:
kernel image setup_data
|--------------------------||----------------|
0x100000 0x100000+l1 0x100000+l1+l2
The problem is that this decompresses to 0x1000000 (one more zero). So
if l1 is > (0x1000000-0x100000), then this winds up looking like:
kernel image setup_data
|--------------------------||----------------|
0x100000 0x100000+l1 0x100000+l1+l2
d e c o m p r e s s e d k e r n e l
|-------------------------------------------------------------|
0x1000000 0x1000000+l3
The decompressed kernel seemingly overwriting the compressed kernel
image isn't a problem, because that gets relocated to a higher address
early on in the boot process, at the end of startup_64. setup_data,
however, stays in the same place, since those links are self referential
and nothing fixes them up. So the decompressed kernel clobbers it.
Fix this by appending setup_data to the cmdline blob rather than the
kernel image blob, which remains at a lower address that won't get
clobbered.
This could have been done by overwriting the initrd blob instead, but
that poses big difficulties, such as no longer being able to use memory
mapped files for initrd, hurting performance, and, more importantly, the
initrd address calculation is hard coded in qboot, and it always grows
down rather than up, which means lots of brittle semantics would have to
be changed around, incurring more complexity. In contrast, using cmdline
is simple and doesn't interfere with anything.
The microvm machine has a gross hack where it fiddles with fw_cfg data
after the fact. So this hack is updated to account for this appending,
by reserving some bytes.
Cc: [email protected]
Cc: Philippe Mathieu-Daudé <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Eric Biggers <[email protected]>
Signed-off-by: Jason A. Donenfeld <[email protected]>
---
Changes v2->v3:
- Fix mistakes in string handling.
Changes v1->v2:
- Append setup_data to cmdline instead of kernel image.
hw/i386/microvm.c | 13 ++++++----
hw/i386/x86.c | 50 +++++++++++++++++++--------------------
hw/nvram/fw_cfg.c | 9 +++++++
include/hw/i386/microvm.h | 5 ++--
include/hw/nvram/fw_cfg.h | 9 +++++++
5 files changed, 54 insertions(+), 32 deletions(-)
diff --git a/hw/i386/microvm.c b/hw/i386/microvm.c
index 170a331e3f..1b19d28c02 100644
--- a/hw/i386/microvm.c
+++ b/hw/i386/microvm.c
@@ -378,7 +378,8 @@ static void microvm_fix_kernel_cmdline(MachineState *machine)
MicrovmMachineState *mms = MICROVM_MACHINE(machine);
BusState *bus;
BusChild *kid;
- char *cmdline;
+ char *cmdline, *existing_cmdline = fw_cfg_read_bytes_ptr(x86ms->fw_cfg, FW_CFG_CMDLINE_DATA);
+ size_t len;
/*
* Find MMIO transports with attached devices, and add them to the kernel
@@ -387,7 +388,7 @@ static void microvm_fix_kernel_cmdline(MachineState *machine)
* Yes, this is a hack, but one that heavily improves the UX without
* introducing any significant issues.
*/
- cmdline = g_strdup(machine->kernel_cmdline);
+ cmdline = g_strdup(existing_cmdline);
bus = sysbus_get_default();
QTAILQ_FOREACH(kid, &bus->children, sibling) {
DeviceState *dev = kid->child;
@@ -411,9 +412,11 @@ static void microvm_fix_kernel_cmdline(MachineState *machine)
}
}
- fw_cfg_modify_i32(x86ms->fw_cfg, FW_CFG_CMDLINE_SIZE, strlen(cmdline) + 1);
- fw_cfg_modify_string(x86ms->fw_cfg, FW_CFG_CMDLINE_DATA, cmdline);
-
+ len = strlen(cmdline);
+ if (len > VIRTIO_CMDLINE_TOTAL_MAX_LEN + strlen(existing_cmdline))
+ fprintf(stderr, "qemu: virtio mmio cmdline too large, skipping\n");
+ else
+ memcpy(existing_cmdline, cmdline, len + 1);
g_free(cmdline);
}
diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index 78cc131926..b57a993596 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -50,6 +50,7 @@
#include "hw/intc/i8259.h"
#include "hw/rtc/mc146818rtc.h"
#include "target/i386/sev.h"
+#include "hw/i386/microvm.h"
#include "hw/acpi/cpu_hotplug.h"
#include "hw/irq.h"
@@ -802,7 +803,7 @@ void x86_load_linux(X86MachineState *x86ms,
bool linuxboot_dma_enabled = X86_MACHINE_GET_CLASS(x86ms)->fwcfg_dma_enabled;
uint16_t protocol;
int setup_size, kernel_size, cmdline_size;
- int dtb_size, setup_data_offset;
+ int dtb_size;
uint32_t initrd_max;
uint8_t header[8192], *setup, *kernel;
hwaddr real_addr, prot_addr, cmdline_addr, initrd_addr = 0, first_setup_data = 0;
@@ -813,12 +814,16 @@ void x86_load_linux(X86MachineState *x86ms,
const char *kernel_filename = machine->kernel_filename;
const char *initrd_filename = machine->initrd_filename;
const char *dtb_filename = machine->dtb;
- const char *kernel_cmdline = machine->kernel_cmdline;
+ char *kernel_cmdline;
SevKernelLoaderContext sev_load_ctx = {};
enum { RNG_SEED_LENGTH = 32 };
- /* Align to 16 bytes as a paranoia measure */
- cmdline_size = (strlen(kernel_cmdline) + 16) & ~15;
+ /* Add the NUL terminator, some padding for the microvm cmdline fiddling
+ * hack, and then align to 16 bytes as a paranoia measure */
+ cmdline_size = (strlen(machine->kernel_cmdline) + 1 +
+ VIRTIO_CMDLINE_TOTAL_MAX_LEN + 16) & ~15;
+ /* Make a copy, since we might append arbitrary bytes to it later. */
+ kernel_cmdline = g_strndup(machine->kernel_cmdline, cmdline_size);
/* load the kernel header */
f = fopen(kernel_filename, "rb");
@@ -959,12 +964,6 @@ void x86_load_linux(X86MachineState *x86ms,
initrd_max = x86ms->below_4g_mem_size - acpi_data_size - 1;
}
- fw_cfg_add_i32(fw_cfg, FW_CFG_CMDLINE_ADDR, cmdline_addr);
- fw_cfg_add_i32(fw_cfg, FW_CFG_CMDLINE_SIZE, strlen(kernel_cmdline) + 1);
- fw_cfg_add_string(fw_cfg, FW_CFG_CMDLINE_DATA, kernel_cmdline);
- sev_load_ctx.cmdline_data = (char *)kernel_cmdline;
- sev_load_ctx.cmdline_size = strlen(kernel_cmdline) + 1;
-
if (protocol >= 0x202) {
stl_p(header + 0x228, cmdline_addr);
} else {
@@ -1091,27 +1090,22 @@ void x86_load_linux(X86MachineState *x86ms,
exit(1);
}
- setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
- kernel_size = setup_data_offset + sizeof(SetupData) + dtb_size;
- kernel = g_realloc(kernel, kernel_size);
-
-
- setup_data = (SetupData *)(kernel + setup_data_offset);
+ cmdline_size += sizeof(SetupData) + dtb_size;
+ kernel_cmdline = g_realloc(kernel_cmdline, cmdline_size);
+ setup_data = (void *)kernel_cmdline + cmdline_size - (sizeof(SetupData) + dtb_size);
setup_data->next = cpu_to_le64(first_setup_data);
- first_setup_data = prot_addr + setup_data_offset;
+ first_setup_data = cmdline_addr + ((void *)setup_data - (void *)kernel_cmdline);
setup_data->type = cpu_to_le32(SETUP_DTB);
setup_data->len = cpu_to_le32(dtb_size);
-
load_image_size(dtb_filename, setup_data->data, dtb_size);
}
- if (!legacy_no_rng_seed) {
- setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
- kernel_size = setup_data_offset + sizeof(SetupData) + RNG_SEED_LENGTH;
- kernel = g_realloc(kernel, kernel_size);
- setup_data = (SetupData *)(kernel + setup_data_offset);
+ if (!legacy_no_rng_seed && protocol >= 0x209) {
+ cmdline_size += sizeof(SetupData) + RNG_SEED_LENGTH;
+ kernel_cmdline = g_realloc(kernel_cmdline, cmdline_size);
+ setup_data = (void *)kernel_cmdline + cmdline_size - (sizeof(SetupData) + RNG_SEED_LENGTH);
setup_data->next = cpu_to_le64(first_setup_data);
- first_setup_data = prot_addr + setup_data_offset;
+ first_setup_data = cmdline_addr + ((void *)setup_data - (void *)kernel_cmdline);
setup_data->type = cpu_to_le32(SETUP_RNG_SEED);
setup_data->len = cpu_to_le32(RNG_SEED_LENGTH);
qemu_guest_getrandom_nofail(setup_data->data, RNG_SEED_LENGTH);
@@ -1122,6 +1116,12 @@ void x86_load_linux(X86MachineState *x86ms,
fw_cfg_add_bytes(fw_cfg, FW_CFG_KERNEL_DATA, kernel, kernel_size);
}
+ fw_cfg_add_i32(fw_cfg, FW_CFG_CMDLINE_ADDR, cmdline_addr);
+ fw_cfg_add_i32(fw_cfg, FW_CFG_CMDLINE_SIZE, cmdline_size);
+ fw_cfg_add_bytes(fw_cfg, FW_CFG_CMDLINE_DATA, kernel_cmdline, cmdline_size);
+ sev_load_ctx.cmdline_data = (char *)kernel_cmdline;
+ sev_load_ctx.cmdline_size = cmdline_size;
+
fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ADDR, prot_addr);
fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_SIZE, kernel_size);
sev_load_ctx.kernel_data = (char *)kernel;
@@ -1134,7 +1134,7 @@ void x86_load_linux(X86MachineState *x86ms,
* kernel on the other side of the fw_cfg interface matches the hash of the
* file the user passed in.
*/
- if (!sev_enabled()) {
+ if (!sev_enabled() && first_setup_data) {
SetupDataFixup *fixup = g_malloc(sizeof(*fixup));
memcpy(setup, header, MIN(sizeof(header), setup_size));
diff --git a/hw/nvram/fw_cfg.c b/hw/nvram/fw_cfg.c
index a00881bc64..432754eda4 100644
--- a/hw/nvram/fw_cfg.c
+++ b/hw/nvram/fw_cfg.c
@@ -741,6 +741,15 @@ void fw_cfg_add_bytes(FWCfgState *s, uint16_t key, void *data, size_t len)
fw_cfg_add_bytes_callback(s, key, NULL, NULL, NULL, data, len, true);
}
+void *fw_cfg_read_bytes_ptr(FWCfgState *s, uint16_t key)
+{
+ int arch = !!(key & FW_CFG_ARCH_LOCAL);
+
+ key &= FW_CFG_ENTRY_MASK;
+ assert(key < fw_cfg_max_entry(s));
+ return s->entries[arch][key].data;
+}
+
void fw_cfg_add_string(FWCfgState *s, uint16_t key, const char *value)
{
size_t sz = strlen(value) + 1;
diff --git a/include/hw/i386/microvm.h b/include/hw/i386/microvm.h
index fad97a891d..e8af61f194 100644
--- a/include/hw/i386/microvm.h
+++ b/include/hw/i386/microvm.h
@@ -50,8 +50,9 @@
*/
/* Platform virtio definitions */
-#define VIRTIO_MMIO_BASE 0xfeb00000
-#define VIRTIO_CMDLINE_MAXLEN 64
+#define VIRTIO_MMIO_BASE 0xfeb00000
+#define VIRTIO_CMDLINE_MAXLEN 64
+#define VIRTIO_CMDLINE_TOTAL_MAX_LEN ((VIRTIO_CMDLINE_MAXLEN + 1) * 16)
#define GED_MMIO_BASE 0xfea00000
#define GED_MMIO_BASE_MEMHP (GED_MMIO_BASE + 0x100)
diff --git a/include/hw/nvram/fw_cfg.h b/include/hw/nvram/fw_cfg.h
index 2e503904dc..990dcdbb2e 100644
--- a/include/hw/nvram/fw_cfg.h
+++ b/include/hw/nvram/fw_cfg.h
@@ -139,6 +139,15 @@ void fw_cfg_add_bytes_callback(FWCfgState *s, uint16_t key,
void *data, size_t len,
bool read_only);
+/**
+ * fw_cfg_read_bytes_ptr:
+ * @s: fw_cfg device being modified
+ * @key: selector key value for new fw_cfg item
+ *
+ * Reads an existing fw_cfg data pointer.
+ */
+void *fw_cfg_read_bytes_ptr(FWCfgState *s, uint16_t key);
+
/**
* fw_cfg_add_string:
* @s: fw_cfg device being modified
--
2.39.0
On Fri, Dec 30, 2022 at 11:07:25PM +0100, Jason A. Donenfeld wrote:
> The setup_data links are appended to the compressed kernel image. Since
> the kernel image is typically loaded at 0x100000, setup_data lives at
> `0x100000 + compressed_size`, which does not get relocated during the
> kernel's boot process.
>
> The kernel typically decompresses the image starting at address
> 0x1000000 (note: there's one more zero there than the compressed image
> above). This usually is fine for most kernels.
>
> However, if the compressed image is actually quite large, then
> setup_data will live at a `0x100000 + compressed_size` that extends into
> the decompressed zone at 0x1000000. In other words, if compressed_size
> is larger than `0x1000000 - 0x100000`, then the decompression step will
> clobber setup_data, resulting in crashes.
>
> Visually, what happens now is that QEMU appends setup_data to the kernel
> image:
>
> kernel image setup_data
> |--------------------------||----------------|
> 0x100000 0x100000+l1 0x100000+l1+l2
>
> The problem is that this decompresses to 0x1000000 (one more zero). So
> if l1 is > (0x1000000-0x100000), then this winds up looking like:
>
> kernel image setup_data
> |--------------------------||----------------|
> 0x100000 0x100000+l1 0x100000+l1+l2
>
> d e c o m p r e s s e d k e r n e l
> |-------------------------------------------------------------|
> 0x1000000 0x1000000+l3
>
> The decompressed kernel seemingly overwriting the compressed kernel
> image isn't a problem, because that gets relocated to a higher address
> early on in the boot process, at the end of startup_64. setup_data,
> however, stays in the same place, since those links are self referential
> and nothing fixes them up. So the decompressed kernel clobbers it.
>
> Fix this by appending setup_data to the cmdline blob rather than the
> kernel image blob, which remains at a lower address that won't get
> clobbered.
>
> This could have been done by overwriting the initrd blob instead, but
> that poses big difficulties, such as no longer being able to use memory
> mapped files for initrd, hurting performance, and, more importantly, the
> initrd address calculation is hard coded in qboot, and it always grows
> down rather than up, which means lots of brittle semantics would have to
> be changed around, incurring more complexity. In contrast, using cmdline
> is simple and doesn't interfere with anything.
>
> The microvm machine has a gross hack where it fiddles with fw_cfg data
> after the fact. So this hack is updated to account for this appending,
> by reserving some bytes.
>
> Cc: [email protected]
> Cc: Philippe Mathieu-Daud? <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Eric Biggers <[email protected]>
> Signed-off-by: Jason A. Donenfeld <[email protected]>
For what it's worth:
Tested-by: Eric Biggers <[email protected]>
- Eric
Hi Jason!
Am 30.12.22 um 23:07 schrieb Jason A. Donenfeld:
> The setup_data links are appended to the compressed kernel image. Since
> the kernel image is typically loaded at 0x100000, setup_data lives at
> `0x100000 + compressed_size`, which does not get relocated during the
> kernel's boot process.
>
> The kernel typically decompresses the image starting at address
> 0x1000000 (note: there's one more zero there than the compressed image
> above). This usually is fine for most kernels.
>
> However, if the compressed image is actually quite large, then
> setup_data will live at a `0x100000 + compressed_size` that extends into
> the decompressed zone at 0x1000000. In other words, if compressed_size
> is larger than `0x1000000 - 0x100000`, then the decompression step will
> clobber setup_data, resulting in crashes.
>
> Visually, what happens now is that QEMU appends setup_data to the kernel
> image:
>
> kernel image setup_data
> |--------------------------||----------------|
> 0x100000 0x100000+l1 0x100000+l1+l2
>
> The problem is that this decompresses to 0x1000000 (one more zero). So
> if l1 is > (0x1000000-0x100000), then this winds up looking like:
>
> kernel image setup_data
> |--------------------------||----------------|
> 0x100000 0x100000+l1 0x100000+l1+l2
>
> d e c o m p r e s s e d k e r n e l
> |-------------------------------------------------------------|
> 0x1000000 0x1000000+l3
>
> The decompressed kernel seemingly overwriting the compressed kernel
> image isn't a problem, because that gets relocated to a higher address
> early on in the boot process, at the end of startup_64. setup_data,
> however, stays in the same place, since those links are self referential
> and nothing fixes them up. So the decompressed kernel clobbers it.
I just ran into this very issue yesterday when trying to boot a 6.1
kernel. pipacs pointed me to some changes of yours[1] which confirmed,
the issue is related to the additional setup_data entries, as adding,
e.g., '-M pc-i440fx-7.0' to the QEMU command line made the bug vanish
(as QEMU then omits adding the random seed setup_data entries) .
[1] https://github.com/qemu/qemu/commit/67f7e426e538
After digging a while I found this thread and it fixes the issue for me,
thereby:
Tested-by: Mathias Krause <[email protected]>
Thanks,
Mathias
> [snip]
Hi Michael,
Could you queue up this patch and mark it as a fix for 7.2.1? It is a
straight-up bug fix for a 7.2 regression that's now affected several
users.
- It has two Tested-by tags on the thread.
- hpa, the maintainer of the kernel side of this, confirmed on one of
the various tributary threads that this approach is a correct one.
- It doesn't introduce any new functionality.
For your convenience, you can grab this out of lore here:
https://lore.kernel.org/lkml/[email protected]/
Or if you want to yolo it:
curl https://lore.kernel.org/lkml/[email protected]/raw | git am -s
It's now sat silent on the mailing list for a while. So let's please get
this committed and backported so that the bug reports stop coming in.
Thanks,
Jason
On 30/12/22 23:07, Jason A. Donenfeld wrote:
> The setup_data links are appended to the compressed kernel image. Since
> the kernel image is typically loaded at 0x100000, setup_data lives at
> `0x100000 + compressed_size`, which does not get relocated during the
> kernel's boot process.
>
> The kernel typically decompresses the image starting at address
> 0x1000000 (note: there's one more zero there than the compressed image
> above). This usually is fine for most kernels.
>
> However, if the compressed image is actually quite large, then
> setup_data will live at a `0x100000 + compressed_size` that extends into
> the decompressed zone at 0x1000000. In other words, if compressed_size
> is larger than `0x1000000 - 0x100000`, then the decompression step will
> clobber setup_data, resulting in crashes.
>
> Visually, what happens now is that QEMU appends setup_data to the kernel
> image:
>
> kernel image setup_data
> |--------------------------||----------------|
> 0x100000 0x100000+l1 0x100000+l1+l2
>
> The problem is that this decompresses to 0x1000000 (one more zero). So
> if l1 is > (0x1000000-0x100000), then this winds up looking like:
>
> kernel image setup_data
> |--------------------------||----------------|
> 0x100000 0x100000+l1 0x100000+l1+l2
>
> d e c o m p r e s s e d k e r n e l
> |-------------------------------------------------------------|
> 0x1000000 0x1000000+l3
>
> The decompressed kernel seemingly overwriting the compressed kernel
> image isn't a problem, because that gets relocated to a higher address
> early on in the boot process, at the end of startup_64. setup_data,
> however, stays in the same place, since those links are self referential
> and nothing fixes them up. So the decompressed kernel clobbers it.
>
> Fix this by appending setup_data to the cmdline blob rather than the
> kernel image blob, which remains at a lower address that won't get
> clobbered.
>
> This could have been done by overwriting the initrd blob instead, but
> that poses big difficulties, such as no longer being able to use memory
> mapped files for initrd, hurting performance, and, more importantly, the
> initrd address calculation is hard coded in qboot, and it always grows
> down rather than up, which means lots of brittle semantics would have to
> be changed around, incurring more complexity. In contrast, using cmdline
> is simple and doesn't interfere with anything.
>
> The microvm machine has a gross hack where it fiddles with fw_cfg data
> after the fact. So this hack is updated to account for this appending,
> by reserving some bytes.
>
> Cc: [email protected]
> Cc: Philippe Mathieu-Daudé <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Eric Biggers <[email protected]>
> Signed-off-by: Jason A. Donenfeld <[email protected]>
> ---
> Changes v2->v3:
> - Fix mistakes in string handling.
> Changes v1->v2:
> - Append setup_data to cmdline instead of kernel image.
>
> hw/i386/microvm.c | 13 ++++++----
> hw/i386/x86.c | 50 +++++++++++++++++++--------------------
> hw/nvram/fw_cfg.c | 9 +++++++
> include/hw/i386/microvm.h | 5 ++--
> include/hw/nvram/fw_cfg.h | 9 +++++++
> 5 files changed, 54 insertions(+), 32 deletions(-)
> diff --git a/hw/nvram/fw_cfg.c b/hw/nvram/fw_cfg.c
> index a00881bc64..432754eda4 100644
> --- a/hw/nvram/fw_cfg.c
> +++ b/hw/nvram/fw_cfg.c
> @@ -741,6 +741,15 @@ void fw_cfg_add_bytes(FWCfgState *s, uint16_t key, void *data, size_t len)
> fw_cfg_add_bytes_callback(s, key, NULL, NULL, NULL, data, len, true);
> }
>
> +void *fw_cfg_read_bytes_ptr(FWCfgState *s, uint16_t key)
> +{
> + int arch = !!(key & FW_CFG_ARCH_LOCAL);
> +
> + key &= FW_CFG_ENTRY_MASK;
> + assert(key < fw_cfg_max_entry(s));
> + return s->entries[arch][key].data;
Shouldn't it be safer to provide a size argument, and return
NULL if s->entries[arch][key].len < size?
Maybe this API should return a (casted) const pointer, so the
only way to update the key is via fw_cfg_add_bytes().
> +}
> +
> diff --git a/include/hw/nvram/fw_cfg.h b/include/hw/nvram/fw_cfg.h
> index 2e503904dc..990dcdbb2e 100644
> --- a/include/hw/nvram/fw_cfg.h
> +++ b/include/hw/nvram/fw_cfg.h
> @@ -139,6 +139,15 @@ void fw_cfg_add_bytes_callback(FWCfgState *s, uint16_t key,
> void *data, size_t len,
> bool read_only);
>
> +/**
> + * fw_cfg_read_bytes_ptr:
> + * @s: fw_cfg device being modified
> + * @key: selector key value for new fw_cfg item
> + *
> + * Reads an existing fw_cfg data pointer.
> + */
> +void *fw_cfg_read_bytes_ptr(FWCfgState *s, uint16_t key);
> +
> /**
> * fw_cfg_add_string:
> * @s: fw_cfg device being modified
Hi Jason,
On Fri, Dec 30, 2022 at 11:07:25PM +0100, Jason A. Donenfeld wrote:
> The setup_data links are appended to the compressed kernel image. Since
> the kernel image is typically loaded at 0x100000, setup_data lives at
> `0x100000 + compressed_size`, which does not get relocated during the
> kernel's boot process.
>
> The kernel typically decompresses the image starting at address
> 0x1000000 (note: there's one more zero there than the compressed image
> above). This usually is fine for most kernels.
>
> However, if the compressed image is actually quite large, then
> setup_data will live at a `0x100000 + compressed_size` that extends into
> the decompressed zone at 0x1000000. In other words, if compressed_size
> is larger than `0x1000000 - 0x100000`, then the decompression step will
> clobber setup_data, resulting in crashes.
>
> Visually, what happens now is that QEMU appends setup_data to the kernel
> image:
>
> kernel image setup_data
> |--------------------------||----------------|
> 0x100000 0x100000+l1 0x100000+l1+l2
>
> The problem is that this decompresses to 0x1000000 (one more zero). So
> if l1 is > (0x1000000-0x100000), then this winds up looking like:
>
> kernel image setup_data
> |--------------------------||----------------|
> 0x100000 0x100000+l1 0x100000+l1+l2
>
> d e c o m p r e s s e d k e r n e l
> |-------------------------------------------------------------|
> 0x1000000 0x1000000+l3
>
> The decompressed kernel seemingly overwriting the compressed kernel
> image isn't a problem, because that gets relocated to a higher address
> early on in the boot process, at the end of startup_64. setup_data,
> however, stays in the same place, since those links are self referential
> and nothing fixes them up. So the decompressed kernel clobbers it.
>
> Fix this by appending setup_data to the cmdline blob rather than the
> kernel image blob, which remains at a lower address that won't get
> clobbered.
>
> This could have been done by overwriting the initrd blob instead, but
> that poses big difficulties, such as no longer being able to use memory
> mapped files for initrd, hurting performance, and, more importantly, the
> initrd address calculation is hard coded in qboot, and it always grows
> down rather than up, which means lots of brittle semantics would have to
> be changed around, incurring more complexity. In contrast, using cmdline
> is simple and doesn't interfere with anything.
>
> The microvm machine has a gross hack where it fiddles with fw_cfg data
> after the fact. So this hack is updated to account for this appending,
> by reserving some bytes.
>
> Cc: [email protected]
> Cc: Philippe Mathieu-Daud? <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Eric Biggers <[email protected]>
> Signed-off-by: Jason A. Donenfeld <[email protected]>
I apologize if this has already been reported/fixed already (I did a
brief search on lore.kernel.org) or if my terminology is not as precise
as it could be, I am a little out of my element here :)
After this change as commit eac7a7791b ("x86: don't let decompressed
kernel image clobber setup_data") in QEMU master, I am no longer able to
boot a kernel directly through OVMF using '-append' + '-initrd' +
'-kernel'. Instead, systemd-boot tries to start the distribution's
kernel, which fails with:
Error registering initrd: Already started
This can be reproduced with just a defconfig Linux kernel (I used
6.2-rc7), the simple buildroot images that ClangBuiltLinux uses for
boot testing [1], and OVMF. Prior to this change, the kernel would start
up but after, I am dumped into the UEFI shell (as there is no
bootloader).
The QEMU command I used was:
$ qemu-system-x86_64 \
-kernel arch/x86_64/boot/bzImage \
-append "console=ttyS0 earlycon=uart8250,io,0x3f8" \
-drive if=pflash,format=raw,file=/usr/share/edk2/x64/OVMF_CODE.fd,readonly=on
-drive if=pflash,format=raw,file=../boot-utils/images/x86_64/OVMF_VARS.fd \
-object rng-random,filename=/dev/urandom,id=rng0 \
-device virtio-rng-pci \
-cpu host \
-enable-kvm \
-smp 8 \
-display none \
-initrd .../boot-utils/images/x86_64/rootfs.cpio \
-m 512m \
-nodefaults \
-no-reboot \
-serial mon:stdio
Not sure if the OVMF version will matter but just in case:
$ pacman -Q edk2-ovmf
edk2-ovmf 202211-3
I did see a few fixes on qemu-devel for this patch such as [2] but none
I found fixed the issue for me.
If there is any additional information I can provide or patches I can
test, please let me know.
[1]: https://github.com/ClangBuiltLinux/boot-utils/blob/fec03ef13519e26ac1f226e404e1620a142d0e40/images/x86_64/rootfs.cpio.zst
[2]: https://lore.kernel.org/[email protected]/
Cheers,
Nathan
# bad: [969d09c3a6186c0a4bc8a41db0c1aba1c76081fc] Merge tag 'pull-aspeed-20230207' of https://github.com/legoater/qemu into staging
# good: [b67b00e6b4c7831a3f5bc684bc0df7a9bfd1bd56] Update VERSION for v7.2.0
git bisect start '969d09c3a6186c0a4bc8a41db0c1aba1c76081fc' 'b67b00e6b4c7831a3f5bc684bc0df7a9bfd1bd56'
# good: [38cb336fe9d47507da5177c3d349532d0af6885e] hw/arm/gumstix: Use the IEC binary prefix definitions
git bisect good 38cb336fe9d47507da5177c3d349532d0af6885e
# bad: [864a3fa439276148b6d7abcf2d43ee8dbe4c4062] monitor: Rename misc.c to hmp-target.c
git bisect bad 864a3fa439276148b6d7abcf2d43ee8dbe4c4062
# good: [e471a8c9850f1af0c1bc5768ca28285348cdd6c5] target/riscv: Trap on writes to stimecmp from VS when hvictl.VTI=1
git bisect good e471a8c9850f1af0c1bc5768ca28285348cdd6c5
# bad: [c1a9ac9bdeea213162a76f7b9e2f876c89a50a94] tests: acpi: cleanup use_uefi argument usage
git bisect bad c1a9ac9bdeea213162a76f7b9e2f876c89a50a94
# good: [dc575b5e0300a7a375b4e4501a17ada21e9a6d10] hw/i2c/bitbang_i2c: Change state calling bitbang_i2c_set_state() helper
git bisect good dc575b5e0300a7a375b4e4501a17ada21e9a6d10
# good: [3b07a936d3bfe97b07ddffcfbb532985a88033dd] target/arm: Look up ARMCPRegInfo at runtime
git bisect good 3b07a936d3bfe97b07ddffcfbb532985a88033dd
# good: [d395b18dce82855d03d934e30a515caf5a10a885] hw/acpi/acpi_dev_interface: Remove unused parameter from AcpiDeviceIfClass::madt_cpu
git bisect good d395b18dce82855d03d934e30a515caf5a10a885
# bad: [eac7a7791bb6d719233deed750034042318ffd56] x86: don't let decompressed kernel image clobber setup_data
git bisect bad eac7a7791bb6d719233deed750034042318ffd56
# good: [8a8c9c3a747f77e664fa2288735b45a9d750be75] hw/pci-host: Use register definitions from PCI standard
git bisect good 8a8c9c3a747f77e664fa2288735b45a9d750be75
# good: [8a7c606016d283a1716290c657f6f45bc7c4d817] intel-iommu: Document iova_tree
git bisect good 8a7c606016d283a1716290c657f6f45bc7c4d817
# first bad commit: [eac7a7791bb6d719233deed750034042318ffd56] x86: don't let decompressed kernel image clobber setup_data
Hi Nathan (and MST),
On Wed, Feb 8, 2023 at 2:45 PM Nathan Chancellor <[email protected]> wrote:
>
> Hi Jason,
>
> On Fri, Dec 30, 2022 at 11:07:25PM +0100, Jason A. Donenfeld wrote:
> > The setup_data links are appended to the compressed kernel image. Since
> > the kernel image is typically loaded at 0x100000, setup_data lives at
> > `0x100000 + compressed_size`, which does not get relocated during the
> > kernel's boot process.
> >
> > The kernel typically decompresses the image starting at address
> > 0x1000000 (note: there's one more zero there than the compressed image
> > above). This usually is fine for most kernels.
> >
> > However, if the compressed image is actually quite large, then
> > setup_data will live at a `0x100000 + compressed_size` that extends into
> > the decompressed zone at 0x1000000. In other words, if compressed_size
> > is larger than `0x1000000 - 0x100000`, then the decompression step will
> > clobber setup_data, resulting in crashes.
> >
> > Visually, what happens now is that QEMU appends setup_data to the kernel
> > image:
> >
> > kernel image setup_data
> > |--------------------------||----------------|
> > 0x100000 0x100000+l1 0x100000+l1+l2
> >
> > The problem is that this decompresses to 0x1000000 (one more zero). So
> > if l1 is > (0x1000000-0x100000), then this winds up looking like:
> >
> > kernel image setup_data
> > |--------------------------||----------------|
> > 0x100000 0x100000+l1 0x100000+l1+l2
> >
> > d e c o m p r e s s e d k e r n e l
> > |-------------------------------------------------------------|
> > 0x1000000 0x1000000+l3
> >
> > The decompressed kernel seemingly overwriting the compressed kernel
> > image isn't a problem, because that gets relocated to a higher address
> > early on in the boot process, at the end of startup_64. setup_data,
> > however, stays in the same place, since those links are self referential
> > and nothing fixes them up. So the decompressed kernel clobbers it.
> >
> > Fix this by appending setup_data to the cmdline blob rather than the
> > kernel image blob, which remains at a lower address that won't get
> > clobbered.
> >
> > This could have been done by overwriting the initrd blob instead, but
> > that poses big difficulties, such as no longer being able to use memory
> > mapped files for initrd, hurting performance, and, more importantly, the
> > initrd address calculation is hard coded in qboot, and it always grows
> > down rather than up, which means lots of brittle semantics would have to
> > be changed around, incurring more complexity. In contrast, using cmdline
> > is simple and doesn't interfere with anything.
> >
> > The microvm machine has a gross hack where it fiddles with fw_cfg data
> > after the fact. So this hack is updated to account for this appending,
> > by reserving some bytes.
> >
> > Cc: [email protected]
> > Cc: Philippe Mathieu-Daudé <[email protected]>
> > Cc: H. Peter Anvin <[email protected]>
> > Cc: Borislav Petkov <[email protected]>
> > Cc: Eric Biggers <[email protected]>
> > Signed-off-by: Jason A. Donenfeld <[email protected]>
>
> I apologize if this has already been reported/fixed already (I did a
> brief search on lore.kernel.org) or if my terminology is not as precise
> as it could be, I am a little out of my element here :)
>
> After this change as commit eac7a7791b ("x86: don't let decompressed
> kernel image clobber setup_data") in QEMU master, I am no longer able to
> boot a kernel directly through OVMF using '-append' + '-initrd' +
> '-kernel'. Instead, systemd-boot tries to start the distribution's
> kernel, which fails with:
>
> Error registering initrd: Already started
>
> This can be reproduced with just a defconfig Linux kernel (I used
> 6.2-rc7), the simple buildroot images that ClangBuiltLinux uses for
> boot testing [1], and OVMF. Prior to this change, the kernel would start
> up but after, I am dumped into the UEFI shell (as there is no
> bootloader).
>
> The QEMU command I used was:
>
> $ qemu-system-x86_64 \
> -kernel arch/x86_64/boot/bzImage \
> -append "console=ttyS0 earlycon=uart8250,io,0x3f8" \
> -drive if=pflash,format=raw,file=/usr/share/edk2/x64/OVMF_CODE.fd,readonly=on
> -drive if=pflash,format=raw,file=../boot-utils/images/x86_64/OVMF_VARS.fd \
Oh no... Without jumping into it, at first glance, I have absolutely
no idea. I suppose I could start debugging it and probably come up
with a solution, but...
@mst - I'm beginning to think that this whole setup_data route is
cursed. This is accumulating hacks within hacks within hacks. What
would you think if I just send a patch *removing* all use of
setup_data (the rng seed and the dtb thing), and then we can gradually
add that back with an actual overarching design. For example, it'd
probably make sense to have a separate fwcfg file for setup_data
rather than trying to mangle and existing one, etc. This way, we
unbreak the tree, and let the new approach be reviewed more
reasonably.
Jason
On Wed, Feb 8, 2023 at 2:54 PM Jason A. Donenfeld <[email protected]> wrote:
>
> Hi Nathan (and MST),
>
> On Wed, Feb 8, 2023 at 2:45 PM Nathan Chancellor <[email protected]> wrote:
> >
> > Hi Jason,
> >
> > On Fri, Dec 30, 2022 at 11:07:25PM +0100, Jason A. Donenfeld wrote:
> > > The setup_data links are appended to the compressed kernel image. Since
> > > the kernel image is typically loaded at 0x100000, setup_data lives at
> > > `0x100000 + compressed_size`, which does not get relocated during the
> > > kernel's boot process.
> > >
> > > The kernel typically decompresses the image starting at address
> > > 0x1000000 (note: there's one more zero there than the compressed image
> > > above). This usually is fine for most kernels.
> > >
> > > However, if the compressed image is actually quite large, then
> > > setup_data will live at a `0x100000 + compressed_size` that extends into
> > > the decompressed zone at 0x1000000. In other words, if compressed_size
> > > is larger than `0x1000000 - 0x100000`, then the decompression step will
> > > clobber setup_data, resulting in crashes.
> > >
> > > Visually, what happens now is that QEMU appends setup_data to the kernel
> > > image:
> > >
> > > kernel image setup_data
> > > |--------------------------||----------------|
> > > 0x100000 0x100000+l1 0x100000+l1+l2
> > >
> > > The problem is that this decompresses to 0x1000000 (one more zero). So
> > > if l1 is > (0x1000000-0x100000), then this winds up looking like:
> > >
> > > kernel image setup_data
> > > |--------------------------||----------------|
> > > 0x100000 0x100000+l1 0x100000+l1+l2
> > >
> > > d e c o m p r e s s e d k e r n e l
> > > |-------------------------------------------------------------|
> > > 0x1000000 0x1000000+l3
> > >
> > > The decompressed kernel seemingly overwriting the compressed kernel
> > > image isn't a problem, because that gets relocated to a higher address
> > > early on in the boot process, at the end of startup_64. setup_data,
> > > however, stays in the same place, since those links are self referential
> > > and nothing fixes them up. So the decompressed kernel clobbers it.
> > >
> > > Fix this by appending setup_data to the cmdline blob rather than the
> > > kernel image blob, which remains at a lower address that won't get
> > > clobbered.
> > >
> > > This could have been done by overwriting the initrd blob instead, but
> > > that poses big difficulties, such as no longer being able to use memory
> > > mapped files for initrd, hurting performance, and, more importantly, the
> > > initrd address calculation is hard coded in qboot, and it always grows
> > > down rather than up, which means lots of brittle semantics would have to
> > > be changed around, incurring more complexity. In contrast, using cmdline
> > > is simple and doesn't interfere with anything.
> > >
> > > The microvm machine has a gross hack where it fiddles with fw_cfg data
> > > after the fact. So this hack is updated to account for this appending,
> > > by reserving some bytes.
> > >
> > > Cc: [email protected]
> > > Cc: Philippe Mathieu-Daudé <[email protected]>
> > > Cc: H. Peter Anvin <[email protected]>
> > > Cc: Borislav Petkov <[email protected]>
> > > Cc: Eric Biggers <[email protected]>
> > > Signed-off-by: Jason A. Donenfeld <[email protected]>
> >
> > I apologize if this has already been reported/fixed already (I did a
> > brief search on lore.kernel.org) or if my terminology is not as precise
> > as it could be, I am a little out of my element here :)
> >
> > After this change as commit eac7a7791b ("x86: don't let decompressed
> > kernel image clobber setup_data") in QEMU master, I am no longer able to
> > boot a kernel directly through OVMF using '-append' + '-initrd' +
> > '-kernel'. Instead, systemd-boot tries to start the distribution's
> > kernel, which fails with:
> >
> > Error registering initrd: Already started
> >
> > This can be reproduced with just a defconfig Linux kernel (I used
> > 6.2-rc7), the simple buildroot images that ClangBuiltLinux uses for
> > boot testing [1], and OVMF. Prior to this change, the kernel would start
> > up but after, I am dumped into the UEFI shell (as there is no
> > bootloader).
> >
> > The QEMU command I used was:
> >
> > $ qemu-system-x86_64 \
> > -kernel arch/x86_64/boot/bzImage \
> > -append "console=ttyS0 earlycon=uart8250,io,0x3f8" \
> > -drive if=pflash,format=raw,file=/usr/share/edk2/x64/OVMF_CODE.fd,readonly=on
> > -drive if=pflash,format=raw,file=../boot-utils/images/x86_64/OVMF_VARS.fd \
>
> Oh no... Without jumping into it, at first glance, I have absolutely
> no idea. I suppose I could start debugging it and probably come up
> with a solution, but...
>
> @mst - I'm beginning to think that this whole setup_data route is
> cursed. This is accumulating hacks within hacks within hacks. What
> would you think if I just send a patch *removing* all use of
> setup_data (the rng seed and the dtb thing), and then we can gradually
> add that back with an actual overarching design. For example, it'd
> probably make sense to have a separate fwcfg file for setup_data
> rather than trying to mangle and existing one, etc. This way, we
> unbreak the tree, and let the new approach be reviewed more
> reasonably.
Sent as https://lore.kernel.org/qemu-devel/[email protected]/
On Wed, Feb 08, 2023 at 02:54:41PM -0300, Jason A. Donenfeld wrote:
> Hi Nathan (and MST),
>
> On Wed, Feb 8, 2023 at 2:45 PM Nathan Chancellor <[email protected]> wrote:
> >
> > Hi Jason,
> >
> > On Fri, Dec 30, 2022 at 11:07:25PM +0100, Jason A. Donenfeld wrote:
> > > The setup_data links are appended to the compressed kernel image. Since
> > > the kernel image is typically loaded at 0x100000, setup_data lives at
> > > `0x100000 + compressed_size`, which does not get relocated during the
> > > kernel's boot process.
> > >
> > > The kernel typically decompresses the image starting at address
> > > 0x1000000 (note: there's one more zero there than the compressed image
> > > above). This usually is fine for most kernels.
> > >
> > > However, if the compressed image is actually quite large, then
> > > setup_data will live at a `0x100000 + compressed_size` that extends into
> > > the decompressed zone at 0x1000000. In other words, if compressed_size
> > > is larger than `0x1000000 - 0x100000`, then the decompression step will
> > > clobber setup_data, resulting in crashes.
> > >
> > > Visually, what happens now is that QEMU appends setup_data to the kernel
> > > image:
> > >
> > > kernel image setup_data
> > > |--------------------------||----------------|
> > > 0x100000 0x100000+l1 0x100000+l1+l2
> > >
> > > The problem is that this decompresses to 0x1000000 (one more zero). So
> > > if l1 is > (0x1000000-0x100000), then this winds up looking like:
> > >
> > > kernel image setup_data
> > > |--------------------------||----------------|
> > > 0x100000 0x100000+l1 0x100000+l1+l2
> > >
> > > d e c o m p r e s s e d k e r n e l
> > > |-------------------------------------------------------------|
> > > 0x1000000 0x1000000+l3
> > >
> > > The decompressed kernel seemingly overwriting the compressed kernel
> > > image isn't a problem, because that gets relocated to a higher address
> > > early on in the boot process, at the end of startup_64. setup_data,
> > > however, stays in the same place, since those links are self referential
> > > and nothing fixes them up. So the decompressed kernel clobbers it.
> > >
> > > Fix this by appending setup_data to the cmdline blob rather than the
> > > kernel image blob, which remains at a lower address that won't get
> > > clobbered.
> > >
> > > This could have been done by overwriting the initrd blob instead, but
> > > that poses big difficulties, such as no longer being able to use memory
> > > mapped files for initrd, hurting performance, and, more importantly, the
> > > initrd address calculation is hard coded in qboot, and it always grows
> > > down rather than up, which means lots of brittle semantics would have to
> > > be changed around, incurring more complexity. In contrast, using cmdline
> > > is simple and doesn't interfere with anything.
> > >
> > > The microvm machine has a gross hack where it fiddles with fw_cfg data
> > > after the fact. So this hack is updated to account for this appending,
> > > by reserving some bytes.
> > >
> > > Cc: [email protected]
> > > Cc: Philippe Mathieu-Daud? <[email protected]>
> > > Cc: H. Peter Anvin <[email protected]>
> > > Cc: Borislav Petkov <[email protected]>
> > > Cc: Eric Biggers <[email protected]>
> > > Signed-off-by: Jason A. Donenfeld <[email protected]>
> >
> > I apologize if this has already been reported/fixed already (I did a
> > brief search on lore.kernel.org) or if my terminology is not as precise
> > as it could be, I am a little out of my element here :)
> >
> > After this change as commit eac7a7791b ("x86: don't let decompressed
> > kernel image clobber setup_data") in QEMU master, I am no longer able to
> > boot a kernel directly through OVMF using '-append' + '-initrd' +
> > '-kernel'. Instead, systemd-boot tries to start the distribution's
> > kernel, which fails with:
> >
> > Error registering initrd: Already started
> >
> > This can be reproduced with just a defconfig Linux kernel (I used
> > 6.2-rc7), the simple buildroot images that ClangBuiltLinux uses for
> > boot testing [1], and OVMF. Prior to this change, the kernel would start
> > up but after, I am dumped into the UEFI shell (as there is no
> > bootloader).
> >
> > The QEMU command I used was:
> >
> > $ qemu-system-x86_64 \
> > -kernel arch/x86_64/boot/bzImage \
> > -append "console=ttyS0 earlycon=uart8250,io,0x3f8" \
> > -drive if=pflash,format=raw,file=/usr/share/edk2/x64/OVMF_CODE.fd,readonly=on
> > -drive if=pflash,format=raw,file=../boot-utils/images/x86_64/OVMF_VARS.fd \
>
> Oh no... Without jumping into it, at first glance, I have absolutely
> no idea. I suppose I could start debugging it and probably come up
> with a solution, but...
>
> @mst - I'm beginning to think that this whole setup_data route is
> cursed. This is accumulating hacks within hacks within hacks. What
> would you think if I just send a patch *removing* all use of
> setup_data (the rng seed and the dtb thing), and then we can gradually
> add that back with an actual overarching design. For example, it'd
> probably make sense to have a separate fwcfg file for setup_data
> rather than trying to mangle and existing one, etc. This way, we
> unbreak the tree, and let the new approach be reviewed more
> reasonably.
>
> Jason
Yea basically this is close to what I proposed. I can't off-hand figure
out whether just dropping all of this is fine or we need some compat
hacks to enable this in some way for 7.2, and it's pretty late here.
Suggestions welcome.
--
MST