LinuxLists.cc - why the decompressed procedure move kernel from address 0x100000(1M) to 0x1000000(16M) +x

2012-05-17 05:56:10

Subject: why the decompressed procedure move kernel from address 0x100000(1M) to 0x1000000(16M) +x

hi all,
recently, I got some puzzle when I read source code of the system boot.
I need some help.

at the end of src/arch/x86/boot/header.S, kernel jump to 0x100000(where
is the src/arch/x86/boot/compressed/head_32.S).
in __this__ head_32.S, I found the kernel is move to 0x1000000(mostly is
to here) +x. the x distance is used for decompressed buf. must leave
some distance for decompressing without overlap.

after the move, kernel is decompressed at 0x1000000(16m). and jump to it.

so why not decompressed kernel at 0x100000(1M) to 0x1000000(16m)
directly without moving?

is the move necessary?

thanks for reading above.

2012-06-02 23:48:15

by Eric W. Biederman

[permalink] [raw]

Subject: Re: why the decompressed procedure move kernel from address 0x100000(1M) to 0x1000000(16M) +x

hacklu <[email protected]> writes:

> hi all,
> recently, I got some puzzle when I read source code of the system boot. I need
> some help.
>
> at the end of src/arch/x86/boot/header.S, kernel jump to 0x100000(where is the
> src/arch/x86/boot/compressed/head_32.S).
> in __this__ head_32.S, I found the kernel is move to 0x1000000(mostly is to
> here) +x. the x distance is used for decompressed buf. must leave some distance
> for decompressing without overlap.
>
> after the move, kernel is decompressed at 0x1000000(16m). and jump to it.
>
> so why not decompressed kernel at 0x100000(1M) to 0x1000000(16m) directly
> without moving?
>
> is the move necessary?

The move is nececcessary if we are doing the decompression in place.
Without a move it is hard to tell if there are going to be overlapping
address problems. The move is cheap so there is no apparent reason
to optimize it away.

Eric

2012-06-03 03:10:17

by H. Peter Anvin

[permalink] [raw]

Subject: Re: why the decompressed procedure move kernel from address 0x100000(1M) to 0x1000000(16M) +x

On 06/02/2012 04:48 PM, Eric W. Biederman wrote:
On 06/02/2012 04:48 PM, Eric W. Biederman wrote:
> hacklu <[email protected]> writes:
>
>> hi all,
>> recently, I got some puzzle when I read source code of the system boot. I need
>> some help.
>>
>> at the end of src/arch/x86/boot/header.S, kernel jump to 0x100000(where is the
>> src/arch/x86/boot/compressed/head_32.S).
>> in __this__ head_32.S, I found the kernel is move to 0x1000000(mostly is to
>> here) +x. the x distance is used for decompressed buf. must leave some distance
>> for decompressing without overlap.
>>
>> after the move, kernel is decompressed at 0x1000000(16m). and jump to it.
>>
>> so why not decompressed kernel at 0x100000(1M) to 0x1000000(16m) directly
>> without moving?
>>
>> is the move necessary?
>
> The move is nececcessary if we are doing the decompression in place.
> Without a move it is hard to tell if there are going to be overlapping
> address problems. The move is cheap so there is no apparent reason
> to optimize it away.
>

Well, right now we do two copies (one before decompression, and one
after while parsing the ELF payload.) It would be nice to get rid of at
least one but preferably both (when possible.)

Boot time does matter, although this isn't a huge amount of time, it is
something that can be shaved off relatively cheaply.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2012-06-03 08:41:46

by Eric W. Biederman

[permalink] [raw]

Subject: Re: why the decompressed procedure move kernel from address 0x100000(1M) to 0x1000000(16M) +x

"H. Peter Anvin" <[email protected]> writes:

> On 06/02/2012 04:48 PM, Eric W. Biederman wrote:
> On 06/02/2012 04:48 PM, Eric W. Biederman wrote:
>> hacklu <[email protected]> writes:
>>
>>> hi all,
>>> recently, I got some puzzle when I read source code of the system boot. I need
>>> some help.
>>>
>>> at the end of src/arch/x86/boot/header.S, kernel jump to 0x100000(where is the
>>> src/arch/x86/boot/compressed/head_32.S).
>>> in __this__ head_32.S, I found the kernel is move to 0x1000000(mostly is to
>>> here) +x. the x distance is used for decompressed buf. must leave some distance
>>> for decompressing without overlap.
>>>
>>> after the move, kernel is decompressed at 0x1000000(16m). and jump to it.
>>>
>>> so why not decompressed kernel at 0x100000(1M) to 0x1000000(16m) directly
>>> without moving?
>>>
>>> is the move necessary?
>>
>> The move is nececcessary if we are doing the decompression in place.
>> Without a move it is hard to tell if there are going to be overlapping
>> address problems. The move is cheap so there is no apparent reason
>> to optimize it away.
>>
>
> Well, right now we do two copies (one before decompression, and one
> after while parsing the ELF payload.) It would be nice to get rid of at
> least one but preferably both (when possible.)
>
> Boot time does matter, although this isn't a huge amount of time, it is
> something that can be shaved off relatively cheaply.

Interesting. I have been out of the loop for a bit and had not noticed
the ELF payload change. It is really sad to say that I have missed
a feature like that for 4 years.

Time wise copying the kernel probably takes a millisecond or less on
modern hardware, so I don't think it is a particularly large concern.

Looking at parse_elf I see two misfeatures.
- parse_elf is short some memsets for the ELF sections that are larger
in memory than they are in the file data.
- We don't return the entry point of the elf header and instead hard
code the beginning of the file data.

The oddest thing about parse_elf is what makes the copies parse_elf
performs safe. It just happens to be the case that because of the
way ld lays out the file those copies turn into a single memmove
that just strips off the elf header and program header.

So it should be trivial to and perhaps even safer to decompress the
program segments to their final destination.

Looking at the way the code is evolving, I suspect we should just give
up overlapping compressed data and uncompressed data. The elf header
logic in theory allows the code in a more arbitrary order, and it
doesn't look like anyone has done a worst case space analysis for
anything except gzip. The code works most of the time today but
I do wonder if it is safe.

Additionally at the rate we are adding compression algorithms I don't
believe that all of the compression alorigthms use the gzip footer that
reports the uncompressed length of the file. So I suspect it would be
wise to get z_output_len from simply examining the uncompressed file
that we feed to compression programs, aka vmlinux.bin.all-y

Perhaps I am wrong but I have the strongest feeling we are playing with
fire and getting very lucky right now.

Eric

2012-06-03 13:01:42

by H. Peter Anvin

[permalink] [raw]

Subject: Re: why the decompressed procedure move kernel from address 0x100000(1M) to 0x1000000(16M) +x

On 06/03/2012 01:41 AM, Eric W. Biederman wrote:
>
> Time wise copying the kernel probably takes a millisecond or less on
> modern hardware, so I don't think it is a particularly large concern.
>

If your boot time budget is a second, getting rid of a millisecond helps.

> Looking at parse_elf I see two misfeatures.
> - parse_elf is short some memsets for the ELF sections that are larger
> in memory than they are in the file data.
> - We don't return the entry point of the elf header and instead hard
> code the beginning of the file data.
>
> The oddest thing about parse_elf is what makes the copies parse_elf
> performs safe. It just happens to be the case that because of the
> way ld lays out the file those copies turn into a single memmove
> that just strips off the elf header and program header.
>
> So it should be trivial to and perhaps even safer to decompress the
> program segments to their final destination.
>
> Looking at the way the code is evolving, I suspect we should just give
> up overlapping compressed data and uncompressed data. The elf header
> logic in theory allows the code in a more arbitrary order, and it
> doesn't look like anyone has done a worst case space analysis for
> anything except gzip. The code works most of the time today but
> I do wonder if it is safe.
>
> Additionally at the rate we are adding compression algorithms I don't
> believe that all of the compression alorigthms use the gzip footer that
> reports the uncompressed length of the file. So I suspect it would be
> wise to get z_output_len from simply examining the uncompressed file
> that we feed to compression programs, aka vmlinux.bin.all-y
>
> Perhaps I am wrong but I have the strongest feeling we are playing with
> fire and getting very lucky right now.

I think it might be even worse; on x86-64 we produce 2 MiB of completely
useless zero-padding after the header. I have wanted to get rid of it
but I'm not sure if the ELF code is robust enough. The code is a
hackwork and yes, should be replaced by decompressing into the
designated target areas.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2012-06-03 20:31:15

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 1/2] x86, boot: Don't overlap the compressed and non-compressed image

In practice there is enough room in the _bss that this doesn't noticably
increase the amount of memory we use during boot, and it makes verifying
the code much simpler.

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/x86/boot/compressed/misc.c | 75 ------------------------------------
arch/x86/boot/compressed/mkpiggy.c | 5 +--
2 files changed, 1 insertions(+), 79 deletions(-)

diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 7116dcb..5b04b66 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -18,81 +18,6 @@
*/

/*
- * Getting to provable safe in place decompression is hard.
- * Worst case behaviours need to be analyzed.
- * Background information:
- *
- * The file layout is:
- * magic[2]
- * method[1]
- * flags[1]
- * timestamp[4]
- * extraflags[1]
- * os[1]
- * compressed data blocks[N]
- * crc[4] orig_len[4]
- *
- * resulting in 18 bytes of non compressed data overhead.
- *
- * Files divided into blocks
- * 1 bit (last block flag)
- * 2 bits (block type)
- *
- * 1 block occurs every 32K -1 bytes or when there 50% compression
- * has been achieved. The smallest block type encoding is always used.
- *
- * stored:
- * 32 bits length in bytes.
- *
- * fixed:
- * magic fixed tree.
- * symbols.
- *
- * dynamic:
- * dynamic tree encoding.
- * symbols.
- *
- *
- * The buffer for decompression in place is the length of the
- * uncompressed data, plus a small amount extra to keep the algorithm safe.
- * The compressed data is placed at the end of the buffer. The output
- * pointer is placed at the start of the buffer and the input pointer
- * is placed where the compressed data starts. Problems will occur
- * when the output pointer overruns the input pointer.
- *
- * The output pointer can only overrun the input pointer if the input
- * pointer is moving faster than the output pointer. A condition only
- * triggered by data whose compressed form is larger than the uncompressed
- * form.
- *
- * The worst case at the block level is a growth of the compressed data
- * of 5 bytes per 32767 bytes.
- *
- * The worst case internal to a compressed block is very hard to figure.
- * The worst case can at least be boundined by having one bit that represents
- * 32764 bytes and then all of the rest of the bytes representing the very
- * very last byte.
- *
- * All of which is enough to compute an amount of extra data that is required
- * to be safe. To avoid problems at the block level allocating 5 extra bytes
- * per 32767 bytes of data is sufficient. To avoind problems internal to a
- * block adding an extra 32767 bytes (the worst case uncompressed block size)
- * is sufficient, to ensure that in the worst case the decompressed data for
- * block will stop the byte before the compressed data for a block begins.
- * To avoid problems with the compressed data's meta information an extra 18
- * bytes are needed. Leading to the formula:
- *
- * extra_bytes = (uncompressed_size >> 12) + 32768 + 18 + decompressor_size.
- *
- * Adding 8 bytes per 32K is a bit excessive but much easier to calculate.
- * Adding 32768 instead of 32767 just makes for round numbers.
- * Adding the decompressor_size is necessary as it musht live after all
- * of the data as well. Last I measured the decompressor is about 14K.
- * 10K of actual data and 4K of bss.
- *
- */
-
-/*
* gzip declarations
*/
#define STATIC static
diff --git a/arch/x86/boot/compressed/mkpiggy.c b/arch/x86/boot/compressed/mkpiggy.c
index 958a641..3f4a68d 100644
--- a/arch/x86/boot/compressed/mkpiggy.c
+++ b/arch/x86/boot/compressed/mkpiggy.c
@@ -70,10 +70,7 @@ int main(int argc, char *argv[])
* sizes, compute the necessary decompression offset...
*/

- offs = (olen > ilen) ? olen - ilen : 0;
- offs += olen >> 12; /* Add 8 bytes for each 32K block */
- offs += 64*1024 + 128; /* Add 64K + 128 bytes slack */
- offs = (offs+4095) & ~4095; /* Round to a 4K boundary */
+ offs = (olen+4096) & ~4095; /* Round to a 4K boundary */

printf(".section \".rodata..compressed\",\"a\",@progbits\n");
printf(".globl z_input_len\n");
--
1.7.5.4

2012-06-03 20:33:13

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH 2/2] x86, boot: Optimize the elf header handling.

Create a space for the elf headers at the begginng of the
kernels image in memory. Removing the need for an extra
copy of the kernel during boot. Making things faster
and making vmlinux smaller.

- Allow room for the elf headers in the vmlinux.
This removes the need to insert padding between
the elf headers and the start of text. Reducing
the size of vmlinux by 2MB on x86_64 and removing
the need for parse_elf in boot.c to move the code.

- Return the relocated entry point address from parse_elf
as the kernel's entry point is no long at a fixed address.

- Remove the now unnecessary copies in arch/x86/compress/misc.c:parse_elf
The ELF headers are now guaranteed to not conflict with the program
data in the uncompressed image.

- In cleanup_highmap keep all pages starting with __START_KERNEL_map
instead of _text. Those values used to be the same but with the
insertion of the hole for the ELF headers they differ and cause us
to nuke our first 2MB of text ouch! So use the __START_KERNEL_map
which includes the elf headers.

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/x86/boot/compressed/head_32.S | 2 +-
arch/x86/boot/compressed/head_64.S | 2 +-
arch/x86/boot/compressed/misc.c | 60 +++++++++++------------------------
arch/x86/kernel/vmlinux.lds.S | 4 +-
arch/x86/mm/init_64.c | 3 +-
5 files changed, 25 insertions(+), 46 deletions(-)

diff --git a/arch/x86/boot/compressed/head_32.S b/arch/x86/boot/compressed/head_32.S
index c85e3ac..1b15e2c 100644
--- a/arch/x86/boot/compressed/head_32.S
+++ b/arch/x86/boot/compressed/head_32.S
@@ -211,7 +211,7 @@ relocated:
* Jump to the decompressed kernel.
*/
xorl %ebx, %ebx
- jmp *%ebp
+ jmp *%eax

/*
* Stack and heap for uncompression
diff --git a/arch/x86/boot/compressed/head_64.S b/arch/x86/boot/compressed/head_64.S
index 87e03a1..9b8d782 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -337,7 +337,7 @@ relocated:
/*
* Jump to the decompressed kernel.
*/
- jmp *%rbp
+ jmp *%rax

.data
gdt:
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 5b04b66..fc34e8a 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -198,23 +198,23 @@ static void error(char *x)
asm("hlt");
}

-static void parse_elf(void *output)
+static void *parse_elf(void *output)
{
#ifdef CONFIG_X86_64
- Elf64_Ehdr ehdr;
- Elf64_Phdr *phdrs, *phdr;
+ Elf64_Ehdr *ehdr;
+ Elf64_Phdr *phdrs;
#else
- Elf32_Ehdr ehdr;
- Elf32_Phdr *phdrs, *phdr;
+ Elf32_Ehdr *ehdr;
+ Elf32_Phdr *phdrs;
#endif
void *dest;
int i;

- memcpy(&ehdr, output, sizeof(ehdr));
- if (ehdr.e_ident[EI_MAG0] != ELFMAG0 ||
- ehdr.e_ident[EI_MAG1] != ELFMAG1 ||
- ehdr.e_ident[EI_MAG2] != ELFMAG2 ||
- ehdr.e_ident[EI_MAG3] != ELFMAG3) {
+ ehdr = output;
+ if (ehdr->e_ident[EI_MAG0] != ELFMAG0 ||
+ ehdr->e_ident[EI_MAG1] != ELFMAG1 ||
+ ehdr->e_ident[EI_MAG2] != ELFMAG2 ||
+ ehdr->e_ident[EI_MAG3] != ELFMAG3) {
error("Kernel is not a valid ELF file");
return;
}
@@ -222,39 +222,17 @@ static void parse_elf(void *output)
if (!quiet)
putstr("Parsing ELF... ");

- phdrs = malloc(sizeof(*phdrs) * ehdr.e_phnum);
- if (!phdrs)
- error("Failed to allocate space for phdrs");
+ phdrs = output + ehdr->e_phoff;

- memcpy(phdrs, output + ehdr.e_phoff, sizeof(*phdrs) * ehdr.e_phnum);
-
- for (i = 0; i < ehdr.e_phnum; i++) {
- phdr = &phdrs[i];
-
- switch (phdr->p_type) {
- case PT_LOAD:
-#ifdef CONFIG_RELOCATABLE
- dest = output;
- dest += (phdr->p_paddr - LOAD_PHYSICAL_ADDR);
-#else
- dest = (void *)(phdr->p_paddr);
-#endif
- memcpy(dest,
- output + phdr->p_offset,
- phdr->p_filesz);
- break;
- default: /* Ignore other PT_* */ break;
- }
- }
-
- free(phdrs);
+ return output + (ehdr->e_entry - LOAD_PHYSICAL_ADDR);
}

-asmlinkage void decompress_kernel(void *rmode, memptr heap,
- unsigned char *input_data,
- unsigned long input_len,
- unsigned char *output)
+asmlinkage void *decompress_kernel(void *rmode, memptr heap,
+ unsigned char *input_data,
+ unsigned long input_len,
+ unsigned char *output)
{
+ void *entry;
real_mode = rmode;

if (cmdline_find_option_bool("quiet"))
@@ -297,8 +275,8 @@ asmlinkage void decompress_kernel(void *rmode, memptr heap,
if (!quiet)
putstr("\nDecompressing Linux... ");
decompress(input_data, input_len, NULL, NULL, output, NULL, error);
- parse_elf(output);
+ entry = parse_elf(output);
if (!quiet)
putstr("done.\nBooting the kernel.\n");
- return;
+ return entry;
}
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 22a1530..af6fb8a 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -82,10 +82,10 @@ PHDRS {
SECTIONS
{
#ifdef CONFIG_X86_32
- . = LOAD_OFFSET + LOAD_PHYSICAL_ADDR;
+ . = LOAD_OFFSET + LOAD_PHYSICAL_ADDR + SIZEOF_HEADERS;
phys_startup_32 = startup_32 - LOAD_OFFSET;
#else
- . = __START_KERNEL;
+ . = __START_KERNEL + SIZEOF_HEADERS;
phys_startup_64 = startup_64 - LOAD_OFFSET;
#endif

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 2b6b4a3..e8599cd 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -303,13 +303,14 @@ void __init cleanup_highmap(void)
{
unsigned long vaddr = __START_KERNEL_map;
unsigned long vaddr_end = __START_KERNEL_map + (max_pfn_mapped << PAGE_SHIFT);
+ unsigned long text = __START_KERNEL_map;
unsigned long end = roundup((unsigned long)_brk_end, PMD_SIZE) - 1;
pmd_t *pmd = level2_kernel_pgt;

for (; vaddr + PMD_SIZE - 1 < vaddr_end; pmd++, vaddr += PMD_SIZE) {
if (pmd_none(*pmd))
continue;
- if (vaddr < (unsigned long) _text || vaddr > end)
+ if (vaddr < text || vaddr > end)
set_pmd(pmd, __pmd(0));
}
}
--
1.7.5.4

2012-07-01 02:23:40

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH 1/2] x86, boot: Don't overlap the compressed and non-compressed image

Ping.

These patches should still apply, without problems.

Eric

2012-07-01 02:32:58

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [PATCH 1/2] x86, boot: Don't overlap the compressed and non-compressed image

On 06/30/2012 07:23 PM, Eric W. Biederman wrote:
>
> Ping.
>
> These patches should still apply, without problems.
>

Already doing the test build :)

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.