2006-08-01 11:01:25

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC] ELF Relocatable x86 and x86_64 bzImages


The problem:

We can't always run the kernel at 1MB or 2MB, and so people who need
different addresses must build multiple kernels. The bzImage format
can't even represent loading a kernel at other than it's default address.
With kexec on panic now starting to be used by distros having a kernel
not running at the default load address is starting to become common.

The goal of this patch series is to build kernels that are relocatable
at run time, and to extend the bzImage format to make it capable of
expressing a relocatable kernel.

In extending the bzImage format I am replacing the existing unused bootsector
with an ELF header. To express what is going on the ELF header will
have type ET_DYN. Just like the kernel loading an ET_DYN executable
bootloaders are not expected to process relocations. But the executable
may be shifted in the address space so long as it's alignment requirements
are met.

The x86_64 kernel is simply built to live at a fixed virtual address
and the boot page tables are relocated. The i386 kernel is built
to process relocations generated with --embedded-relocs (after vmlinux.lds.S)
has been fixed up to sort out static and dynamic relocations.

Currently there are 33 patches in my tree to do this.

The weirdest symptom I have had so far is that page faults did not
trigger the early exception handler on x86_64 (instead I got a reboot).

There is one outstanding issue where I am probably requiring too much alignment
on the arch/i386 kernel.

Can anyone find anything else?

Eric


2006-08-01 11:05:23

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 4/33] i386: CONFIG_PHYSICAL_START cleanup

Defining __PHYSICAL_START and __KERNEL_START in asm-i386/page.h works but
it triggers a full kernel rebuild for the silliest of reasons. This
modifies the users to directly use CONFIG_PHYSICAL_START and linux/config.h
which prevents the full rebuild problem, which makes the code much
more maintainer and hopefully user friendly.

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/i386/boot/compressed/head.S | 8 ++++----
arch/i386/boot/compressed/misc.c | 8 ++++----
arch/i386/kernel/vmlinux.lds.S | 3 ++-
include/asm-i386/page.h | 3 ---
4 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/arch/i386/boot/compressed/head.S b/arch/i386/boot/compressed/head.S
index b5893e4..8f28ecd 100644
--- a/arch/i386/boot/compressed/head.S
+++ b/arch/i386/boot/compressed/head.S
@@ -23,9 +23,9 @@
*/
.text

+#include <linux/config.h>
#include <linux/linkage.h>
#include <asm/segment.h>
-#include <asm/page.h>

.globl startup_32

@@ -75,7 +75,7 @@ startup_32:
popl %esi # discard address
popl %esi # real mode pointer
xorl %ebx,%ebx
- ljmp $(__BOOT_CS), $__PHYSICAL_START
+ ljmp $(__BOOT_CS), $CONFIG_PHYSICAL_START

/*
* We come here, if we were loaded high.
@@ -100,7 +100,7 @@ startup_32:
popl %ecx # lcount
popl %edx # high_buffer_start
popl %eax # hcount
- movl $__PHYSICAL_START,%edi
+ movl $CONFIG_PHYSICAL_START,%edi
cli # make sure we don't get interrupted
ljmp $(__BOOT_CS), $0x1000 # and jump to the move routine

@@ -125,5 +125,5 @@ move_routine_start:
movsl
movl %ebx,%esi # Restore setup pointer
xorl %ebx,%ebx
- ljmp $(__BOOT_CS), $__PHYSICAL_START
+ ljmp $(__BOOT_CS), $CONFIG_PHYSICAL_START
move_routine_end:
diff --git a/arch/i386/boot/compressed/misc.c b/arch/i386/boot/compressed/misc.c
index b2ccd54..905c37e 100644
--- a/arch/i386/boot/compressed/misc.c
+++ b/arch/i386/boot/compressed/misc.c
@@ -9,11 +9,11 @@
* High loaded stuff by Hans Lermen & Werner Almesberger, Feb. 1996
*/

+#include <linux/config.h>
#include <linux/linkage.h>
#include <linux/vmalloc.h>
#include <linux/screen_info.h>
#include <asm/io.h>
-#include <asm/page.h>

/*
* gzip declarations
@@ -303,7 +303,7 @@ #ifdef STANDARD_MEMORY_BIOS_CALL
#else
if ((RM_ALT_MEM_K > RM_EXT_MEM_K ? RM_ALT_MEM_K : RM_EXT_MEM_K) < 1024) error("Less than 2MB of memory");
#endif
- output_data = (unsigned char *)__PHYSICAL_START; /* Normally Points to 1M */
+ output_data = (unsigned char *)CONFIG_PHYSICAL_START; /* Normally Points to 1M */
free_mem_end_ptr = (long)real_mode;
}

@@ -326,8 +326,8 @@ #endif
low_buffer_size = low_buffer_end - LOW_BUFFER_START;
high_loaded = 1;
free_mem_end_ptr = (long)high_buffer_start;
- if ( (__PHYSICAL_START + low_buffer_size) > ((ulg)high_buffer_start)) {
- high_buffer_start = (uch *)(__PHYSICAL_START + low_buffer_size);
+ if ( (CONFIG_PHYSICAL_START + low_buffer_size) > ((ulg)high_buffer_start)) {
+ high_buffer_start = (uch *)(CONFIG_PHYSICAL_START + low_buffer_size);
mv->hcount = 0; /* say: we need not to move high_buffer */
}
else mv->hcount = -1;
diff --git a/arch/i386/kernel/vmlinux.lds.S b/arch/i386/kernel/vmlinux.lds.S
index db0833b..8bcf0e1 100644
--- a/arch/i386/kernel/vmlinux.lds.S
+++ b/arch/i386/kernel/vmlinux.lds.S
@@ -4,6 +4,7 @@

#define LOAD_OFFSET __PAGE_OFFSET

+#include <linux/config.h>
#include <asm-generic/vmlinux.lds.h>
#include <asm/thread_info.h>
#include <asm/page.h>
@@ -15,7 +16,7 @@ ENTRY(phys_startup_32)
jiffies = jiffies_64;
SECTIONS
{
- . = __KERNEL_START;
+ . = LOAD_OFFSET + CONFIG_PHYSICAL_START;
phys_startup_32 = startup_32 - LOAD_OFFSET;
/* read-only */
.text : AT(ADDR(.text) - LOAD_OFFSET) {
diff --git a/include/asm-i386/page.h b/include/asm-i386/page.h
index eceb7f5..1af9f6b 100644
--- a/include/asm-i386/page.h
+++ b/include/asm-i386/page.h
@@ -112,12 +112,9 @@ #endif /* __ASSEMBLY__ */

#ifdef __ASSEMBLY__
#define __PAGE_OFFSET CONFIG_PAGE_OFFSET
-#define __PHYSICAL_START CONFIG_PHYSICAL_START
#else
#define __PAGE_OFFSET ((unsigned long)CONFIG_PAGE_OFFSET)
-#define __PHYSICAL_START ((unsigned long)CONFIG_PHYSICAL_START)
#endif
-#define __KERNEL_START (__PAGE_OFFSET + __PHYSICAL_START)


#define PAGE_OFFSET ((unsigned long)__PAGE_OFFSET)
--
1.4.2.rc2.g5209e

2006-08-01 11:05:21

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 6/33] Make linux/elf.h safe to be included in assembly files

The motivation for this is that currently we have 512 bytes
at the begining of a bzImage that are unused now that we don't
have a bootsector there. I plan on putting an ELF header
there, and generating it by hand with assebmly data directives
to be minimally disrutptive to the current build process.

To do that I need the elf magic constants available to my
assembly code.

Signed-off-by: Eric W. Biederman <[email protected]>
---
include/linux/elf.h | 22 +++++++++++++++++++++-
1 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/include/linux/elf.h b/include/linux/elf.h
index b70d1d2..c5bf043 100644
--- a/include/linux/elf.h
+++ b/include/linux/elf.h
@@ -1,9 +1,11 @@
#ifndef _LINUX_ELF_H
#define _LINUX_ELF_H

+#include <linux/elf-em.h>
+
+#ifndef __ASSEMBLY__
#include <linux/types.h>
#include <linux/auxvec.h>
-#include <linux/elf-em.h>
#include <asm/elf.h>

#ifndef elf_read_implies_exec
@@ -30,6 +32,8 @@ typedef __u32 Elf64_Word;
typedef __u64 Elf64_Xword;
typedef __s64 Elf64_Sxword;

+#endif /* __ASSEMBLY__ */
+
/* These constants are for the segment types stored in the image headers */
#define PT_NULL 0
#define PT_LOAD 1
@@ -97,6 +101,8 @@ #define STT_FILE 4
#define STT_COMMON 5
#define STT_TLS 6

+#ifndef __ASSEMBLY__
+
#define ELF_ST_BIND(x) ((x) >> 4)
#define ELF_ST_TYPE(x) (((unsigned int) x) & 0xf)
#define ELF32_ST_BIND(x) ELF_ST_BIND(x)
@@ -204,12 +210,16 @@ typedef struct elf64_hdr {
Elf64_Half e_shstrndx;
} Elf64_Ehdr;

+#endif /* __ASSEMBLY__ */
+
/* These constants define the permissions on sections in the program
header, p_flags. */
#define PF_R 0x4
#define PF_W 0x2
#define PF_X 0x1

+#ifndef __ASSEMBLY__
+
typedef struct elf32_phdr{
Elf32_Word p_type;
Elf32_Off p_offset;
@@ -232,6 +242,8 @@ typedef struct elf64_phdr {
Elf64_Xword p_align; /* Segment alignment, file & memory */
} Elf64_Phdr;

+#endif /* __ASSEMBLY__ */
+
/* sh_type */
#define SHT_NULL 0
#define SHT_PROGBITS 1
@@ -265,6 +277,8 @@ #define SHN_HIPROC 0xff1f
#define SHN_ABS 0xfff1
#define SHN_COMMON 0xfff2
#define SHN_HIRESERVE 0xffff
+
+#ifndef __ASSEMBLY__

typedef struct {
Elf32_Word sh_name;
@@ -292,6 +306,8 @@ typedef struct elf64_shdr {
Elf64_Xword sh_entsize; /* Entry size if section holds table */
} Elf64_Shdr;

+#endif /* __ASSEMBLY__ */
+
#define EI_MAG0 0 /* e_ident[] indexes */
#define EI_MAG1 1
#define EI_MAG2 2
@@ -338,6 +354,8 @@ #define NT_AUXV 6
#define NT_PRXFPREG 0x46e62b7f /* copied from gdb5.1/include/elf/common.h */


+#ifndef __ASSEMBLY__
+
/* Note header in a PT_NOTE section */
typedef struct elf32_note {
Elf32_Word n_namesz; /* Name size */
@@ -368,5 +386,7 @@ #define elf_note elf64_note

#endif

+#endif /* __ASSEMBLY__ */
+

#endif /* _LINUX_ELF_H */
--
1.4.2.rc2.g5209e

2006-08-01 11:05:47

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 22/33] x86_64: Fix gdt table size in trampoline.S

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/x86_64/kernel/trampoline.S | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86_64/kernel/trampoline.S b/arch/x86_64/kernel/trampoline.S
index 23a03eb..c79b99a 100644
--- a/arch/x86_64/kernel/trampoline.S
+++ b/arch/x86_64/kernel/trampoline.S
@@ -64,7 +64,7 @@ idt_48:
.word 0, 0 # idt base = 0L

gdt_48:
- .short __KERNEL32_CS + 7 # gdt limit
+ .short GDT_ENTRIES*8 - 1 # gdt limit
.long cpu_gdt_table-__START_KERNEL_map

.globl trampoline_end
--
1.4.2.rc2.g5209e

2006-08-01 11:06:28

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 2/33] i386: define __pa_symbol

On x86_64 we have to be careful with calculating the physical
address of kernel symbols. Both because of compiler odditities
and because the symbols live in a different range of the virtual
address space.

Having a defintition of __pa_symbol that works on both x86_64 and
i386 simplifies writing code that works for both x86_64 and
i386 that has these kinds of dependencies.

So this patch adds the trivial i386 __pa_symbol definition.

Signed-off-by: Eric W. Biederman <[email protected]>
---
include/asm-i386/page.h | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/asm-i386/page.h b/include/asm-i386/page.h
index f5bf544..eceb7f5 100644
--- a/include/asm-i386/page.h
+++ b/include/asm-i386/page.h
@@ -124,6 +124,7 @@ #define PAGE_OFFSET ((unsigned long)__P
#define VMALLOC_RESERVE ((unsigned long)__VMALLOC_RESERVE)
#define MAXMEM (-__PAGE_OFFSET-__VMALLOC_RESERVE)
#define __pa(x) ((unsigned long)(x)-PAGE_OFFSET)
+#define __pa_symbol(x) __pa(x)
#define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET))
#define pfn_to_kaddr(pfn) __va((pfn) << PAGE_SHIFT)
#ifdef CONFIG_FLATMEM
--
1.4.2.rc2.g5209e

2006-08-01 11:06:27

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 3/33] i386 setup.c: Reserve kernel memory starting from _text

Currently when we are reserving the memory the kernel text
resides in we start at __PHYSICAL_START which happens to be
correct but not very obvious. In addition when we start relocating
the kernel __PHYSICAL_START is the wrong value, as it is an
absolute symbol that does not get relocated.

By starting the reservation at __pa_symbol(_text)
the code is clearer and will be correct when relocated.

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/i386/kernel/setup.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/i386/kernel/setup.c b/arch/i386/kernel/setup.c
index f168220..f3a451a 100644
--- a/arch/i386/kernel/setup.c
+++ b/arch/i386/kernel/setup.c
@@ -1219,8 +1219,8 @@ void __init setup_bootmem_allocator(void
* the (very unlikely) case of us accidentally initializing the
* bootmem allocator with an invalid RAM area.
*/
- reserve_bootmem(__PHYSICAL_START, (PFN_PHYS(min_low_pfn) +
- bootmap_size + PAGE_SIZE-1) - (__PHYSICAL_START));
+ reserve_bootmem(__pa_symbol(_text), (PFN_PHYS(min_low_pfn) +
+ bootmap_size + PAGE_SIZE-1) - __pa_symbol(_text));

/*
* reserve physical page 0 - it's a special BIOS page on many boxes,
--
1.4.2.rc2.g5209e

2006-08-01 11:05:45

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 18/33] x86_64: Kill temp_boot_pmds

Early in the boot process we need the ability to set
up temporary mappings, before our normal mechanisms are
initialized. Currently this is used to map pages that
are part of the page tables we are building and pages
during the dmi scan.

The core problem is that we are using the user portion of
the page tables to implement this. Which means that while
this mechanism is active we cannot catch NULL pointer dereferences
and we deviate from the normal ways of handling things.

In this patch I modify early_ioremap to map pages into
the kernel portion of address space, roughly where
we will later put modules, and I make the discovery of
which addresses we can use dynamic which removes all
kinds of static limits and remove the dependencies
on implementation details between different parts of the code.

I also modify the early page table initialization code
to use early_ioreamp and early_iounmap, instead of the
special case version of those functions that they are
now calling.

The only really silly part left with init_memory_mapping
is that find_early_table_space always finds pages below 1M.

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/x86_64/kernel/head.S | 3 -
arch/x86_64/mm/init.c | 104 ++++++++++++++++++---------------------------
2 files changed, 42 insertions(+), 65 deletions(-)

diff --git a/arch/x86_64/kernel/head.S b/arch/x86_64/kernel/head.S
index 6df05e6..ca57a37 100644
--- a/arch/x86_64/kernel/head.S
+++ b/arch/x86_64/kernel/head.S
@@ -278,9 +278,6 @@ NEXT_PAGE(level2_ident_pgt)
.quad i << 21 | 0x083
i = i + 1
.endr
- /* Temporary mappings for the super early allocator in arch/x86_64/mm/init.c */
- .globl temp_boot_pmds
-temp_boot_pmds:
.fill 492,8,0

NEXT_PAGE(level2_kernel_pgt)
diff --git a/arch/x86_64/mm/init.c b/arch/x86_64/mm/init.c
index 0522c1c..b46566a 100644
--- a/arch/x86_64/mm/init.c
+++ b/arch/x86_64/mm/init.c
@@ -167,76 +167,56 @@ __set_fixmap (enum fixed_addresses idx,

unsigned long __initdata table_start, table_end;

-extern pmd_t temp_boot_pmds[];
-
-static struct temp_map {
- pmd_t *pmd;
- void *address;
- int allocated;
-} temp_mappings[] __initdata = {
- { &temp_boot_pmds[0], (void *)(40UL * 1024 * 1024) },
- { &temp_boot_pmds[1], (void *)(42UL * 1024 * 1024) },
- {}
-};
-
-static __init void *alloc_low_page(int *index, unsigned long *phys)
+static __init unsigned long alloc_low_page(void)
{
- struct temp_map *ti;
- int i;
- unsigned long pfn = table_end++, paddr;
- void *adr;
+ unsigned long pfn = table_end++;

if (pfn >= end_pfn)
panic("alloc_low_page: ran out of memory");
- for (i = 0; temp_mappings[i].allocated; i++) {
- if (!temp_mappings[i].pmd)
- panic("alloc_low_page: ran out of temp mappings");
- }
- ti = &temp_mappings[i];
- paddr = (pfn << PAGE_SHIFT) & PMD_MASK;
- set_pmd(ti->pmd, __pmd(paddr | _KERNPG_TABLE | _PAGE_PSE));
- ti->allocated = 1;
- __flush_tlb();
- adr = ti->address + ((pfn << PAGE_SHIFT) & ~PMD_MASK);
- memset(adr, 0, PAGE_SIZE);
- *index = i;
- *phys = pfn * PAGE_SIZE;
- return adr;
-}
-
-static __init void unmap_low_page(int i)
-{
- struct temp_map *ti;
-
- ti = &temp_mappings[i];
- set_pmd(ti->pmd, __pmd(0));
- ti->allocated = 0;
+ return pfn << PAGE_SHIFT;
}

/* Must run before zap_low_mappings */
__init void *early_ioremap(unsigned long addr, unsigned long size)
{
- unsigned long map = round_down(addr, LARGE_PAGE_SIZE);
-
- /* actually usually some more */
- if (size >= LARGE_PAGE_SIZE) {
- printk("SMBIOS area too long %lu\n", size);
- return NULL;
+ unsigned long vaddr;
+ pmd_t *pmd, *last_pmd;
+ int i, pmds;
+
+ pmds = ((addr & ~PMD_MASK) + size + ~PMD_MASK) / PMD_SIZE;
+ vaddr = __START_KERNEL_map;
+ pmd = level2_kernel_pgt;
+ last_pmd = level2_kernel_pgt + PTRS_PER_PMD - 1;
+ for (; pmd <= last_pmd; pmd++, vaddr += PMD_SIZE) {
+ for (i = 0; i < pmds; i++) {
+ if (pmd_present(pmd[i]))
+ goto next;
+ }
+ vaddr += addr & ~PMD_MASK;
+ addr &= PMD_MASK;
+ for (i = 0; i < pmds; i++, addr += PMD_SIZE)
+ set_pmd(pmd + i,__pmd(addr | __PAGE_KERNEL_LARGE));
+ __flush_tlb();
+ return (void *)vaddr;
+ next:
+ ;
}
- set_pmd(temp_mappings[0].pmd, __pmd(map | _KERNPG_TABLE | _PAGE_PSE));
- map += LARGE_PAGE_SIZE;
- set_pmd(temp_mappings[1].pmd, __pmd(map | _KERNPG_TABLE | _PAGE_PSE));
- __flush_tlb();
- return temp_mappings[0].address + (addr & (LARGE_PAGE_SIZE-1));
+ printk("early_ioremap(0x%lx, %lu) failed\n", addr, size);
+ return NULL;
}

/* To avoid virtual aliases later */
__init void early_iounmap(void *addr, unsigned long size)
{
- if ((void *)round_down((unsigned long)addr, LARGE_PAGE_SIZE) != temp_mappings[0].address)
- printk("early_iounmap: bad address %p\n", addr);
- set_pmd(temp_mappings[0].pmd, __pmd(0));
- set_pmd(temp_mappings[1].pmd, __pmd(0));
+ unsigned long vaddr;
+ pmd_t *pmd;
+ int i, pmds;
+
+ vaddr = (unsigned long)addr;
+ pmds = ((vaddr & ~PMD_MASK) + size + ~PMD_MASK) / PMD_SIZE;
+ pmd = level2_kernel_pgt + pmd_index(vaddr);
+ for (i = 0; i < pmds; i++)
+ pmd_clear(pmd + i);
__flush_tlb();
}

@@ -266,7 +246,6 @@ static void __init phys_pud_init(pud_t *
pud = pud + i;

for (; i < PTRS_PER_PUD; pud++, i++) {
- int map;
unsigned long paddr, pmd_phys;
pmd_t *pmd;

@@ -279,10 +258,11 @@ static void __init phys_pud_init(pud_t *
continue;
}

- pmd = alloc_low_page(&map, &pmd_phys);
+ pmd_phys = alloc_low_page();
+ pmd = early_ioremap(pmd_phys, PAGE_SIZE);
set_pud(pud, __pud(pmd_phys | _KERNPG_TABLE));
phys_pmd_init(pmd, paddr, end);
- unmap_low_page(map);
+ early_iounmap(pmd, PAGE_SIZE);
}
__flush_tlb();
}
@@ -333,19 +313,19 @@ void __init init_memory_mapping(unsigned
end = (unsigned long)__va(end);

for (; start < end; start = next) {
- int map;
unsigned long pud_phys;
pgd_t *pgd = pgd_offset_k(start);
pud_t *pud;

- pud = alloc_low_page(&map, &pud_phys);
+ pud_phys = alloc_low_page();
+ pud = early_ioremap(pud_phys, PAGE_SIZE);

next = start + PGDIR_SIZE;
if (next > end)
next = end;
phys_pud_init(pud, __pa(start), __pa(next));
- set_pgd(pgd_offset_k(start), mk_kernel_pgd(pud_phys));
- unmap_low_page(map);
+ set_pgd(pgd, mk_kernel_pgd(pud_phys));
+ early_iounmap(pud, PAGE_SIZE);
}

asm volatile("movq %%cr4,%0" : "=r" (mmu_cr4_features));
--
1.4.2.rc2.g5209e

2006-08-01 11:07:17

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 27/33] x86_64: Modify discover_ebda to use virtual addresses

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/x86_64/kernel/setup.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86_64/kernel/setup.c b/arch/x86_64/kernel/setup.c
index 66816ba..21840ca 100644
--- a/arch/x86_64/kernel/setup.c
+++ b/arch/x86_64/kernel/setup.c
@@ -505,10 +505,10 @@ static void discover_ebda(void)
* there is a real-mode segmented pointer pointing to the
* 4K EBDA area at 0x40E
*/
- ebda_addr = *(unsigned short *)EBDA_ADDR_POINTER;
+ ebda_addr = *(unsigned short *)__va(EBDA_ADDR_POINTER);
ebda_addr <<= 4;

- ebda_size = *(unsigned short *)(unsigned long)ebda_addr;
+ ebda_size = *(unsigned short *)__va(ebda_addr);

/* Round EBDA up to pages */
if (ebda_size == 0)
--
1.4.2.rc2.g5209e

2006-08-01 11:06:59

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 19/33] x86_64: Cleanup the early boot page table.

- Merge physmem_pgt and ident_pgt, removing physmem_pgt. The merge
is broken as soon as mm/init.c:init_memory_mapping is run.
- As physmem_pgt is gone don't export it in pgtable.h.
- Use defines from pgtable.h for page permissions.
- Fix the physical memory identity mapping so it is at the correct
address.
- Remove the physical memory mapping from wakeup_level4_pgt it
is at the wrong address so we can't possibly be usinging it.
- Simply NEXT_PAGE the work to calculate the phys_ alias
of the labels was very cool. Unfortuantely it was a brittle
special purpose hack that makes maitenance more difficult.
Instead just use label - __START_KERNEL_map like we do
everywhere else in assembly.

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/x86_64/kernel/head.S | 61 +++++++++++++++++++-----------------------
include/asm-x86_64/pgtable.h | 1 -
2 files changed, 28 insertions(+), 34 deletions(-)

diff --git a/arch/x86_64/kernel/head.S b/arch/x86_64/kernel/head.S
index ca57a37..a9e34d9 100644
--- a/arch/x86_64/kernel/head.S
+++ b/arch/x86_64/kernel/head.S
@@ -15,6 +15,7 @@ #include <linux/threads.h>
#include <linux/init.h>
#include <asm/desc.h>
#include <asm/segment.h>
+#include <asm/pgtable.h>
#include <asm/page.h>
#include <asm/msr.h>
#include <asm/cache.h>
@@ -250,52 +251,48 @@ ljumpvector:
ENTRY(stext)
ENTRY(_stext)

- $page = 0
#define NEXT_PAGE(name) \
- $page = $page + 1; \
- .org $page * 0x1000; \
- phys_/**/name = $page * 0x1000 + __PHYSICAL_START; \
+ .balign PAGE_SIZE; \
ENTRY(name)

+/* Automate the creation of 1 to 1 mapping pmd entries */
+#define PMDS(START, PERM, COUNT) \
+ i = 0 ; \
+ .rept (COUNT) ; \
+ .quad (START) + (i << 21) + (PERM) ; \
+ i = i + 1 ; \
+ .endr
+
NEXT_PAGE(init_level4_pgt)
/* This gets initialized in x86_64_start_kernel */
.fill 512,8,0

NEXT_PAGE(level3_ident_pgt)
- .quad phys_level2_ident_pgt | 0x007
+ .quad level2_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
.fill 511,8,0

NEXT_PAGE(level3_kernel_pgt)
.fill 510,8,0
/* (2^48-(2*1024*1024*1024)-((2^39)*511))/(2^30) = 510 */
- .quad phys_level2_kernel_pgt | 0x007
+ .quad level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE
.fill 1,8,0

NEXT_PAGE(level2_ident_pgt)
- /* 40MB for bootup. */
- i = 0
- .rept 20
- .quad i << 21 | 0x083
- i = i + 1
- .endr
- .fill 492,8,0
+ /* Since I easily can, map the first 1G.
+ * Don't set NX because code runs from these pages.
+ */
+ PMDS(0x0000000000000000, __PAGE_KERNEL_LARGE_EXEC, PTRS_PER_PMD)

NEXT_PAGE(level2_kernel_pgt)
/* 40MB kernel mapping. The kernel code cannot be bigger than that.
When you change this change KERNEL_TEXT_SIZE in page.h too. */
/* (2^48-(2*1024*1024*1024)-((2^39)*511)-((2^30)*510)) = 0 */
- i = 0
- .rept 20
- .quad i << 21 | 0x183
- i = i + 1
- .endr
+ PMDS(0x0000000000000000, __PAGE_KERNEL_LARGE_EXEC|_PAGE_GLOBAL,
+ KERNEL_TEXT_SIZE/PMD_SIZE)
/* Module mapping starts here */
- .fill 492,8,0
-
-NEXT_PAGE(level3_physmem_pgt)
- .quad phys_level2_kernel_pgt | 0x007 /* so that __va works even before pagetable_init */
- .fill 511,8,0
+ .fill (PTRS_PER_PMD - (KERNEL_TEXT_SIZE/PMD_SIZE)),8,0

+#undef PMDS
#undef NEXT_PAGE

.data
@@ -303,12 +300,10 @@ #undef NEXT_PAGE
#ifdef CONFIG_ACPI_SLEEP
.align PAGE_SIZE
ENTRY(wakeup_level4_pgt)
- .quad phys_level3_ident_pgt | 0x007
- .fill 255,8,0
- .quad phys_level3_physmem_pgt | 0x007
- .fill 254,8,0
+ .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
+ .fill 510,8,0
/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
- .quad phys_level3_kernel_pgt | 0x007
+ .quad level3_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE
#endif

#ifndef CONFIG_HOTPLUG_CPU
@@ -322,12 +317,12 @@ #endif
*/
.align PAGE_SIZE
ENTRY(boot_level4_pgt)
- .quad phys_level3_ident_pgt | 0x007
- .fill 255,8,0
- .quad phys_level3_physmem_pgt | 0x007
- .fill 254,8,0
+ .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
+ .fill 257,8,0
+ .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
+ .fill 252,8,0
/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
- .quad phys_level3_kernel_pgt | 0x007
+ .quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE

.data

diff --git a/include/asm-x86_64/pgtable.h b/include/asm-x86_64/pgtable.h
index 211a2ca..5528e8b 100644
--- a/include/asm-x86_64/pgtable.h
+++ b/include/asm-x86_64/pgtable.h
@@ -15,7 +15,6 @@ #include <linux/threads.h>
#include <asm/pda.h>

extern pud_t level3_kernel_pgt[512];
-extern pud_t level3_physmem_pgt[512];
extern pud_t level3_ident_pgt[512];
extern pmd_t level2_kernel_pgt[512];
extern pgd_t init_level4_pgt[];
--
1.4.2.rc2.g5209e

2006-08-01 11:07:55

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 32/33] x86_64: Relocatable kernel support

This patch modifies the x86_64 kernel so that it can be loaded and run
at any 2M aligned address, below 512G. The technique used is to
compile the decompressor with -fPIC and modify it so the decompressor
is fully relocatable. For the main kernel the page tables are
modified so the kernel remains at the same virtual address. In
addition a variable phys_base is kept that holds the physical address
the kernel is loaded at. __pa_symbol is modified to add that when
we take the address of a kernel symbol.

When loaded with a normal bootloader the decompressor will decompress
the kernel to 2M and it will run there. This both ensures the
relocation code is always working, and makes it easier to use 2M
pages for the kernel and the cpu.

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/x86_64/boot/compressed/Makefile | 13 +
arch/x86_64/boot/compressed/head.S | 301 ++++++++++++++++++++++---------
arch/x86_64/boot/compressed/misc.c | 248 +++++++++++++-------------
arch/x86_64/boot/compressed/vmlinux.lds | 44 +++++
arch/x86_64/boot/compressed/vmlinux.scr | 5 -
arch/x86_64/kernel/head.S | 226 +++++++++++++----------
arch/x86_64/kernel/vmlinux.lds.S | 2
include/asm-x86_64/page.h | 6 -
8 files changed, 527 insertions(+), 318 deletions(-)

diff --git a/arch/x86_64/boot/compressed/Makefile b/arch/x86_64/boot/compressed/Makefile
index 8987c97..3dda50c 100644
--- a/arch/x86_64/boot/compressed/Makefile
+++ b/arch/x86_64/boot/compressed/Makefile
@@ -7,16 +7,15 @@ # Note all the files here are compiled/l
#

targets := vmlinux vmlinux.bin vmlinux.bin.gz head.o misc.o piggy.o
-EXTRA_AFLAGS := -traditional -m32
+EXTRA_AFLAGS := -traditional

# cannot use EXTRA_CFLAGS because base CFLAGS contains -mkernel which conflicts with
-# -m32
-CFLAGS := -m32 -D__KERNEL__ -Iinclude -O2 -fno-strict-aliasing -fno-builtin
-LDFLAGS := -m elf_i386
+CFLAGS := -m64 -D__KERNEL__ -Iinclude -O2 -fno-strict-aliasing -fPIC -mcmodel=small -fno-builtin
+LDFLAGS := -m elf_x86_64

-LDFLAGS_vmlinux := -Ttext $(IMAGE_OFFSET) -e startup_32 -m elf_i386
+LDFLAGS_vmlinux := -T

-$(obj)/vmlinux: $(obj)/head.o $(obj)/misc.o $(obj)/piggy.o FORCE
+$(obj)/vmlinux: $(src)/vmlinux.lds $(obj)/head.o $(obj)/misc.o $(obj)/piggy.o FORCE
$(call if_changed,ld)
@:

@@ -26,7 +25,7 @@ LDFLAGS_vmlinux := -Ttext $(IMAGE_OFFSET
$(obj)/vmlinux.bin.gz: $(obj)/vmlinux.bin FORCE
$(call if_changed,gzip)

-LDFLAGS_piggy.o := -r --format binary --oformat elf32-i386 -T
+LDFLAGS_piggy.o := -r --format binary --oformat elf64-x86-64 -T

$(obj)/piggy.o: $(obj)/vmlinux.scr $(obj)/vmlinux.bin.gz FORCE
$(call if_changed,ld)
diff --git a/arch/x86_64/boot/compressed/head.S b/arch/x86_64/boot/compressed/head.S
index cf55d09..22c8dc4 100644
--- a/arch/x86_64/boot/compressed/head.S
+++ b/arch/x86_64/boot/compressed/head.S
@@ -26,116 +26,245 @@

#include <linux/linkage.h>
#include <asm/segment.h>
+#include <asm/pgtable.h>
#include <asm/page.h>
+#include <asm/msr.h>

+.section ".text.head"
.code32
.globl startup_32

startup_32:
cld
cli
- movl $(__KERNEL_DS),%eax
- movl %eax,%ds
- movl %eax,%es
- movl %eax,%fs
- movl %eax,%gs
-
- lss stack_start,%esp
- xorl %eax,%eax
-1: incl %eax # check that A20 really IS enabled
- movl %eax,0x000000 # loop forever if it isn't
- cmpl %eax,0x100000
- je 1b
+ movl $(__KERNEL_DS), %eax
+ movl %eax, %ds
+ movl %eax, %es
+ movl %eax, %ss
+
+/* Calculate the delta between where we were compiled to run
+ * at and where we were actually loaded at. This can only be done
+ * with a short local call on x86. Nothing else will tell us what
+ * address we are running at. The reserved chunk of the real-mode
+ * data at 0x34-0x3f are used as the stack for this calculation.
+ * Only 4 bytes are needed.
+ */
+ leal 0x40(%esi), %esp
+ call 1f
+1: popl %ebp
+ subl $1b, %ebp
+
+/* Compute the delta between where we were compiled to run at
+ * and where the code will actually run at.
+ */
+ movl %ebp, %ebx
+ addl $(LARGE_PAGE_SIZE -1), %ebx
+ andl $LARGE_PAGE_MASK, %ebx
+
+ /* Replace the compressed data size with the uncompressed size */
+ subl input_len(%ebp), %ebx
+ movl output_len(%ebp), %eax
+ addl %eax, %ebx
+ /* Add 8 bytes for every 32K input block */
+ shrl $12, %eax
+ addl %eax, %ebx
+ /* Add 32K + 18 bytes of extra slack and align on a 4K boundary */
+ addl $(32768 + 18 + 4095), %ebx
+ andl $~4095, %ebx

/*
- * Initialize eflags. Some BIOS's leave bits like NT set. This would
- * confuse the debugger if this code is traced.
- * XXX - best to initialize before switching to protected mode.
+ * Prepare for entering 64 bit mode
*/
- pushl $0
- popfl
+
+ /* Load new GDT with the 64bit segments using 32bit descriptor */
+ leal gdt(%ebp), %eax
+ movl %eax, gdt+2(%ebp)
+ lgdt gdt(%ebp)
+
+ /* Enable PAE mode */
+ xorl %eax, %eax
+ orl $(1 << 5), %eax
+ movl %eax, %cr4
+
/*
- * Clear BSS
+ * Build early 4G boot pagetable
*/
- xorl %eax,%eax
- movl $_edata,%edi
- movl $_end,%ecx
- subl %edi,%ecx
- cld
- rep
- stosb
+ /* Initialize Page tables to 0*/
+ leal pgtable(%ebx), %edi
+ xorl %eax, %eax
+ movl $((4096*6)/4), %ecx
+ rep stosl
+
+ /* Build Level 4 */
+ leal pgtable + 0(%ebx), %edi
+ leal 0x1007 (%edi), %eax
+ movl %eax, 0(%edi)
+
+ /* Build Level 3 */
+ leal pgtable + 0x1000(%ebx), %edi
+ leal 0x1007(%edi), %eax
+ movl $4, %ecx
+1: movl %eax, 0x00(%edi)
+ addl $0x00001000, %eax
+ addl $8, %edi
+ decl %ecx
+ jnz 1b
+
+ /* Build Level 2 */
+ leal pgtable + 0x2000(%ebx), %edi
+ movl $0x00000183, %eax
+ movl $2048, %ecx
+1: movl %eax, 0(%edi)
+ addl $0x00200000, %eax
+ addl $8, %edi
+ decl %ecx
+ jnz 1b
+
+ /* Enable the boot page tables */
+ leal pgtable(%ebx), %eax
+ movl %eax, %cr3
+
+ /* Enable Long mode in EFER (Extended Feature Enable Register) */
+ movl $MSR_EFER, %ecx
+ rdmsr
+ btsl $_EFER_LME, %eax
+ wrmsr
+
+ /* Setup for the jump to 64bit mode
+ *
+ * When the jump is performend we will be in long mode but
+ * in 32bit compatibility mode with EFER.LME = 1, CS.L = 0, CS.D = 1
+ * (and in turn EFER.LMA = 1). To jump into 64bit mode we use
+ * the new gdt/idt that has __KERNEL_CS with CS.L = 1.
+ * We place all of the values on our mini stack so lret can
+ * used to perform that far jump.
+ */
+ pushl $__KERNEL_CS
+ leal startup_64(%ebp), %eax
+ pushl %eax
+
+ /* Enter paged protected Mode, activating Long Mode */
+ movl $0x80000001, %eax /* Enable Paging and Protected mode */
+ movl %eax, %cr0
+
+ /* Jump from 32bit compatibility mode into 64bit mode. */
+ lret
+
+ /* Be careful here startup_64 needs to be at a predictable
+ * address so I can export it in an ELF header. Bootloaders
+ * should look at the ELF header to find this address, as
+ * it may change in the future.
+ */
+ .code64
+ .org 0x100
+ENTRY(startup_64)
+ /* We come here either from startup_32 or directly from a
+ * 64bit bootloader. If we come here from a bootloader we depend on
+ * an identity mapped page table being provied that maps our
+ * entire text+data+bss and hopefully all of memory.
+ */
+
+ /* Setup data segments. */
+ xorl %eax, %eax
+ movl %eax, %ds
+ movl %eax, %es
+ movl %eax, %ss
+
+ /* Compute the decompressed kernel start address. It is where
+ * we were loaded at aligned to a 2M boundary.
+ */
+ leaq startup_32(%rip) /* - $startup_32 */, %rbp
+ addq $(LARGE_PAGE_SIZE - 1), %rbp
+ andq $LARGE_PAGE_MASK, %rbp
+
+/* Compute the delta between where we were compiled to run at
+ * and where the code will actually run at.
+ */
+ /* Start with the delta to where the kernel will run at. */
+ movq %rbp, %rbx
+
+ /* Replace the compressed data size with the uncompressed size */
+ movl input_len(%rip), %eax
+ subq %rax, %rbx
+ movl output_len(%rip), %eax
+ addq %rax, %rbx
+ /* Add 8 bytes for every 32K input block */
+ shrq $12, %rax
+ addq %rax, %rbx
+ /* Add 32K + 18 bytes of extra slack and align on a 4K boundary */
+ addq $(32768 + 18 + 4095), %rbx
+ andq $~4095, %rbx
+
+/* Copy the compressed kernel to the end of our buffer
+ * where decompression in place becomes safe.
+ */
+ leaq _end(%rip), %r8
+ leaq _end(%rbx), %r9
+ movq $_end /* - $startup_32 */, %rcx
+1: subq $8, %r8
+ subq $8, %r9
+ movq 0(%r8), %rax
+ movq %rax, 0(%r9)
+ subq $8, %rcx
+ jnz 1b
+
/*
- * Do the decompression, and jump to the new kernel..
+ * Jump to the relocated address.
*/
- subl $16,%esp # place for structure on the stack
- movl %esp,%eax
- pushl %esi # real mode pointer as second arg
- pushl %eax # address of structure as first arg
- call decompress_kernel
- orl %eax,%eax
- jnz 3f
- addl $8,%esp
- xorl %ebx,%ebx
- ljmp $(__KERNEL_CS), $0x200000
+ leaq relocated(%rbx), %rax
+ jmp *%rax
+
+.section ".text"
+relocated:

/*
- * We come here, if we were loaded high.
- * We need to move the move-in-place routine down to 0x1000
- * and then start it with the buffer addresses in registers,
- * which we got from the stack.
+ * Clear BSS
*/
-3:
- movl %esi,%ebx
- movl $move_routine_start,%esi
- movl $0x1000,%edi
- movl $move_routine_end,%ecx
- subl %esi,%ecx
- addl $3,%ecx
- shrl $2,%ecx
+ xorq %rax, %rax
+ movq $_edata, %rdi
+ movq $_end, %rcx
+ subq %rdi, %rcx
cld
rep
- movsl
-
- popl %esi # discard the address
- addl $4,%esp # real mode pointer
- popl %esi # low_buffer_start
- popl %ecx # lcount
- popl %edx # high_buffer_start
- popl %eax # hcount
- movl $0x200000,%edi
- cli # make sure we don't get interrupted
- ljmp $(__KERNEL_CS), $0x1000 # and jump to the move routine
+ stosb
+
+ /* Setup the stack */
+ leaq user_stack_end(%rip), %rsp
+
+ /* zero EFLAGS after setting rsp */
+ pushq $0
+ popfq

/*
- * Routine (template) for moving the decompressed kernel in place,
- * if we were high loaded. This _must_ PIC-code !
+ * Do the decompression, and jump to the new kernel..
*/
-move_routine_start:
- movl %ecx,%ebp
- shrl $2,%ecx
- rep
- movsl
- movl %ebp,%ecx
- andl $3,%ecx
- rep
- movsb
- movl %edx,%esi
- movl %eax,%ecx # NOTE: rep movsb won't move if %ecx == 0
- addl $3,%ecx
- shrl $2,%ecx
- rep
- movsl
- movl %ebx,%esi # Restore setup pointer
- xorl %ebx,%ebx
- ljmp $(__KERNEL_CS), $0x200000
-move_routine_end:
+ pushq %rsi # Save the real mode argument
+ movq %rsi, %rdi # real mode address
+ leaq _heap(%rip), %rsi # _heap
+ leaq input_data(%rip), %rdx # input_data
+ movl input_len(%rip), %eax
+ movq %rax, %rcx # input_len
+ movq %rbp, %r8 # output
+ call decompress_kernel
+ popq %rsi

+/*
+ * Jump to the decompressed kernel.
+ */
+ jmp *%rbp

-/* Stack for uncompression */
- .align 32
+ .data
+gdt:
+ .word gdt_end - gdt
+ .long gdt
+ .word 0
+ .quad 0x0000000000000000 /* NULL descriptor */
+ .quad 0x00af9a000000ffff /* __KERNEL_CS */
+ .quad 0x00cf92000000ffff /* __KERNEL_DS */
+gdt_end:
+ .bss
+/* Stack for uncompression */
+ .balign 4
user_stack:
.fill 4096,4,0
-stack_start:
- .long user_stack+4096
- .word __KERNEL_DS
-
+user_stack_end:
diff --git a/arch/x86_64/boot/compressed/misc.c b/arch/x86_64/boot/compressed/misc.c
index 2cbd7cb..0e6c4b7 100644
--- a/arch/x86_64/boot/compressed/misc.c
+++ b/arch/x86_64/boot/compressed/misc.c
@@ -9,12 +9,97 @@
* High loaded stuff by Hans Lermen & Werner Almesberger, Feb. 1996
*/

+#define _LINUX_STRING_H_ 1
+#define __LINUX_BITMAP_H 1
+
+#include <linux/linkage.h>
#include <linux/screen_info.h>
#include <linux/serial_reg.h>
#include <asm/io.h>
#include <asm/page.h>
#include <asm/setup.h>

+/* WARNING!!
+ * This code is compiled with -fPIC and it is relocated dynamically
+ * at run time, but no relocation processing is performed.
+ * This means that it is not safe to place pointers in static structures.
+ */
+
+/*
+ * Getting to provable safe in place decompression is hard.
+ * Worst case behaviours need to be analized.
+ * Background information:
+ *
+ * The file layout is:
+ * magic[2]
+ * method[1]
+ * flags[1]
+ * timestamp[4]
+ * extraflags[1]
+ * os[1]
+ * compressed data blocks[N]
+ * crc[4] orig_len[4]
+ *
+ * resulting in 18 bytes of non compressed data overhead.
+ *
+ * Files divided into blocks
+ * 1 bit (last block flag)
+ * 2 bits (block type)
+ *
+ * 1 block occurs every 32K -1 bytes or when there 50% compression has been achieved.
+ * The smallest block type encoding is always used.
+ *
+ * stored:
+ * 32 bits length in bytes.
+ *
+ * fixed:
+ * magic fixed tree.
+ * symbols.
+ *
+ * dynamic:
+ * dynamic tree encoding.
+ * symbols.
+ *
+ *
+ * The buffer for decompression in place is the length of the
+ * uncompressed data, plus a small amount extra to keep the algorithm safe.
+ * The compressed data is placed at the end of the buffer. The output
+ * pointer is placed at the start of the buffer and the input pointer
+ * is placed where the compressed data starts. Problems will occur
+ * when the output pointer overruns the input pointer.
+ *
+ * The output pointer can only overrun the input pointer if the input
+ * pointer is moving faster than the output pointer. A condition only
+ * triggered by data whose compressed form is larger than the uncompressed
+ * form.
+ *
+ * The worst case at the block level is a growth of the compressed data
+ * of 5 bytes per 32767 bytes.
+ *
+ * The worst case internal to a compressed block is very hard to figure.
+ * The worst case can at least be boundined by having one bit that represents
+ * 32764 bytes and then all of the rest of the bytes representing the very
+ * very last byte.
+ *
+ * All of which is enough to compute an amount of extra data that is required
+ * to be safe. To avoid problems at the block level allocating 5 extra bytes
+ * per 32767 bytes of data is sufficient. To avoind problems internal to a block
+ * adding an extra 32767 bytes (the worst case uncompressed block size) is
+ * sufficient, to ensure that in the worst case the decompressed data for
+ * block will stop the byte before the compressed data for a block begins.
+ * To avoid problems with the compressed data's meta information an extra 18
+ * bytes are needed. Leading to the formula:
+ *
+ * extra_bytes = (uncompressed_size >> 12) + 32768 + 18 + decompressor_size.
+ *
+ * Adding 8 bytes per 32K is a bit excessive but much easier to calculate.
+ * Adding 32768 instead of 32767 just makes for round numbers.
+ * Adding the decompressor_size is necessary as it musht live after all
+ * of the data as well. Last I measured the decompressor is about 14K.
+ * 10K of actuall data and 4K of bss.
+ *
+ */
+
/*
* gzip declarations
*/
@@ -30,15 +115,20 @@ typedef unsigned char uch;
typedef unsigned short ush;
typedef unsigned long ulg;

-#define WSIZE 0x8000 /* Window size must be at least 32k, */
- /* and a power of two */
+#define WSIZE 0x80000000 /* Window size must be at least 32k,
+ * and a power of two
+ * We don't actually have a window just
+ * a huge output buffer so I report
+ * a 2G windows size, as that should
+ * always be larger than our output buffer.
+ */

-static uch *inbuf; /* input buffer */
-static uch window[WSIZE]; /* Sliding window buffer */
+static uch *inbuf; /* input buffer */
+static uch *window; /* Sliding window buffer, (and final output buffer) */

-static unsigned insize = 0; /* valid bytes in inbuf */
-static unsigned inptr = 0; /* index of next byte to be processed in inbuf */
-static unsigned outcnt = 0; /* bytes in output buffer */
+static unsigned insize; /* valid bytes in inbuf */
+static unsigned inptr; /* index of next byte to be processed in inbuf */
+static unsigned outcnt; /* bytes in output buffer */

/* gzip flag byte */
#define ASCII_FLAG 0x01 /* bit 0 set: file probably ASCII text */
@@ -94,8 +184,6 @@ extern unsigned char input_data[];
extern int input_len;

static long bytes_out = 0;
-static uch *output_data;
-static unsigned long output_ptr = 0;

static void *malloc(int size);
static void free(void *where);
@@ -109,17 +197,10 @@ static char *strstr(const char *haystack
static void putstr(const char *);
static unsigned simple_strtou(const char *cp, char **endp, unsigned base);

-extern int end;
-static long free_mem_ptr = (long)&end;
+static long free_mem_ptr;
static long free_mem_end_ptr;

-#define INPLACE_MOVE_ROUTINE 0x1000
-#define LOW_BUFFER_START 0x2000
-#define LOW_BUFFER_MAX 0x90000
-#define HEAP_SIZE 0x3000
-static unsigned int low_buffer_end, low_buffer_size;
-static int high_loaded =0;
-static uch *high_buffer_start /* = (uch *)(((ulg)&end) + HEAP_SIZE)*/;
+#define HEAP_SIZE 0x6000

static char *vidmem;
static int vidport;
@@ -444,58 +525,31 @@ static char *strstr(const char *haystack
*/
static int fill_inbuf(void)
{
- if (insize != 0) {
- error("ran out of input data");
- }
-
- inbuf = input_data;
- insize = input_len;
- inptr = 1;
- return inbuf[0];
+ error("ran out of input data");
+ return 0;
}

/* ===========================================================================
* Write the output window window[0..outcnt-1] and update crc and bytes_out.
* (Used for the decompressed data only.)
*/
-static void flush_window_low(void)
-{
- ulg c = crc; /* temporary variable */
- unsigned n;
- uch *in, *out, ch;
-
- in = window;
- out = &output_data[output_ptr];
- for (n = 0; n < outcnt; n++) {
- ch = *out++ = *in++;
- c = crc_32_tab[((int)c ^ ch) & 0xff] ^ (c >> 8);
- }
- crc = c;
- bytes_out += (ulg)outcnt;
- output_ptr += (ulg)outcnt;
- outcnt = 0;
-}
-
-static void flush_window_high(void)
-{
- ulg c = crc; /* temporary variable */
- unsigned n;
- uch *in, ch;
- in = window;
- for (n = 0; n < outcnt; n++) {
- ch = *output_data++ = *in++;
- if ((ulg)output_data == low_buffer_end) output_data=high_buffer_start;
- c = crc_32_tab[((int)c ^ ch) & 0xff] ^ (c >> 8);
- }
- crc = c;
- bytes_out += (ulg)outcnt;
- outcnt = 0;
-}
-
static void flush_window(void)
{
- if (high_loaded) flush_window_high();
- else flush_window_low();
+ /* With my window equal to my output buffer
+ * I only need to compute the crc here.
+ */
+ ulg c = crc; /* temporary variable */
+ unsigned n;
+ uch *in, ch;
+
+ in = window;
+ for (n = 0; n < outcnt; n++) {
+ ch = *in++;
+ c = crc_32_tab[((int)c ^ ch) & 0xff] ^ (c >> 8);
+ }
+ crc = c;
+ bytes_out += (ulg)outcnt;
+ outcnt = 0;
}

static void error(char *x)
@@ -507,56 +561,6 @@ static void error(char *x)
while(1); /* Halt */
}

-static void setup_normal_output_buffer(void)
-{
-#ifdef STANDARD_MEMORY_BIOS_CALL
- if (RM_EXT_MEM_K < 1024) error("Less than 2MB of memory");
-#else
- if ((RM_ALT_MEM_K > RM_EXT_MEM_K ? RM_ALT_MEM_K : RM_EXT_MEM_K) < 1024) error("Less than 2MB of memory");
-#endif
- output_data = (unsigned char *)0x200000;
- free_mem_end_ptr = (long)real_mode;
-}
-
-struct moveparams {
- uch *low_buffer_start; int lcount;
- uch *high_buffer_start; int hcount;
-};
-
-static void setup_output_buffer_if_we_run_high(struct moveparams *mv)
-{
- high_buffer_start = (uch *)(((ulg)&end) + HEAP_SIZE);
-#ifdef STANDARD_MEMORY_BIOS_CALL
- if (RM_EXT_MEM_K < (3*1024)) error("Less than 4MB of memory");
-#else
- if ((RM_ALT_MEM_K > RM_EXT_MEM_K ? RM_ALT_MEM_K : RM_EXT_MEM_K) < (3*1024)) error("Less than 4MB of memory");
-#endif
- mv->low_buffer_start = output_data = (unsigned char *)LOW_BUFFER_START;
- low_buffer_end = ((unsigned int)real_mode > LOW_BUFFER_MAX
- ? LOW_BUFFER_MAX : (unsigned int)real_mode) & ~0xfff;
- low_buffer_size = low_buffer_end - LOW_BUFFER_START;
- high_loaded = 1;
- free_mem_end_ptr = (long)high_buffer_start;
- if ( (0x200000 + low_buffer_size) > ((ulg)high_buffer_start)) {
- high_buffer_start = (uch *)(0x200000 + low_buffer_size);
- mv->hcount = 0; /* say: we need not to move high_buffer */
- }
- else mv->hcount = -1;
- mv->high_buffer_start = high_buffer_start;
-}
-
-static void close_output_buffer_if_we_run_high(struct moveparams *mv)
-{
- if (bytes_out > low_buffer_size) {
- mv->lcount = low_buffer_size;
- if (mv->hcount)
- mv->hcount = bytes_out - low_buffer_size;
- } else {
- mv->lcount = bytes_out;
- mv->hcount = 0;
- }
-}
-
static void save_command_line(void)
{
/* Find the command line */
@@ -571,20 +575,28 @@ static void save_command_line(void)
saved_command_line[COMMAND_LINE_SIZE - 1] = '\0';
}

-int decompress_kernel(struct moveparams *mv, void *rmode)
+asmlinkage void decompress_kernel(void *rmode, unsigned long heap,
+ uch *input_data, unsigned long input_len, uch *output)
{
real_mode = rmode;
-
save_command_line();
console_init(saved_command_line);

- if (free_mem_ptr < 0x100000) setup_normal_output_buffer();
- else setup_output_buffer_if_we_run_high(mv);
+ window = output; /* Output buffer (Normally at 1M) */
+ free_mem_ptr = heap; /* Heap */
+ free_mem_end_ptr = heap + HEAP_SIZE;
+ inbuf = input_data; /* Input buffer */
+ insize = input_len;
+ inptr = 0;
+
+ if ((ulg)output & 0x1fffffUL)
+ error("Destination address not 2M aligned");
+ if ((ulg)output >= 0xffffffffffUL)
+ error("Destination address too large");

makecrc();
putstr(".\nDecompressing Linux...");
gunzip();
putstr("done.\nBooting the kernel.\n");
- if (high_loaded) close_output_buffer_if_we_run_high(mv);
- return high_loaded;
+ return;
}
diff --git a/arch/x86_64/boot/compressed/vmlinux.lds b/arch/x86_64/boot/compressed/vmlinux.lds
new file mode 100644
index 0000000..94c13e5
--- /dev/null
+++ b/arch/x86_64/boot/compressed/vmlinux.lds
@@ -0,0 +1,44 @@
+OUTPUT_FORMAT("elf64-x86-64", "elf64-x86-64", "elf64-x86-64")
+OUTPUT_ARCH(i386:x86-64)
+ENTRY(startup_64)
+SECTIONS
+{
+ /* Be careful parts of head.S assume startup_32 is at
+ * address 0.
+ */
+ . = 0;
+ .text : {
+ _head = . ;
+ *(.text.head)
+ _ehead = . ;
+ *(.text.compressed)
+ _text = .; /* Text */
+ *(.text)
+ *(.text.*)
+ _etext = . ;
+ }
+ .rodata : {
+ _rodata = . ;
+ *(.rodata) /* read-only data */
+ *(.rodata.*)
+ _erodata = . ;
+ }
+ .data : {
+ _data = . ;
+ *(.data)
+ *(.data.*)
+ _edata = . ;
+ }
+ .bss : {
+ _bss = . ;
+ *(.bss)
+ *(.bss.*)
+ *(COMMON)
+ . = ALIGN(8);
+ _end = . ;
+ . = ALIGN(4096);
+ pgtable = . ;
+ . = . + 4096 * 6;
+ _heap = .;
+ }
+}
diff --git a/arch/x86_64/boot/compressed/vmlinux.scr b/arch/x86_64/boot/compressed/vmlinux.scr
index 1ed9d79..48117cf 100644
--- a/arch/x86_64/boot/compressed/vmlinux.scr
+++ b/arch/x86_64/boot/compressed/vmlinux.scr
@@ -1,9 +1,10 @@
SECTIONS
{
- .data : {
+ .text.compressed : {
input_len = .;
LONG(input_data_end - input_data) input_data = .;
- *(.data)
+ *(.data)
+ output_len = . - 4;
input_data_end = .;
}
}
diff --git a/arch/x86_64/kernel/head.S b/arch/x86_64/kernel/head.S
index b0b5618..b821d13 100644
--- a/arch/x86_64/kernel/head.S
+++ b/arch/x86_64/kernel/head.S
@@ -5,6 +5,7 @@
* Copyright (C) 2000 Pavel Machek <[email protected]>
* Copyright (C) 2000 Karsten Keil <[email protected]>
* Copyright (C) 2001,2002 Andi Kleen <[email protected]>
+ * Copyright (C) 2005 Eric Biederman <[email protected]>
*
* $Id: head.S,v 1.49 2002/03/19 17:39:25 ak Exp $
*/
@@ -21,94 +22,121 @@ #include <asm/msr.h>
#include <asm/cache.h>

/* we are not able to switch in one step to the final KERNEL ADRESS SPACE
- * because we need identity-mapped pages on setup so define __START_KERNEL to
- * 0x100000 for this stage
+ * because we need identity-mapped pages.
*
*/

.text
.section .bootstrap.text
- .code32
- .globl startup_32
-/* %bx: 1 if coming from smp trampoline on secondary cpu */
-startup_32:
-
+ .code64
+ .globl startup_64
+startup_64:
+
/*
- * At this point the CPU runs in 32bit protected mode (CS.D = 1) with
- * paging disabled and the point of this file is to switch to 64bit
- * long mode with a kernel mapping for kerneland to jump into the
- * kernel virtual addresses.
- * There is no stack until we set one up.
+ * At this point the CPU runs in 64bit mode CS.L = 1 CS.D = 1,
+ * and someone has loaded an identity mapped page table
+ * for us. These identity mapped page tables map all of the
+ * kernel pages and possibly all of memory.
+ *
+ * %esi holds a physical pointer to real_mode_data.
+ *
+ * We come here either directly from a 64bit bootloader, or from
+ * arch/x86_64/boot/compressed/head.S.
+ *
+ * We only come here initially at boot nothing else comes here.
+ *
+ * Since we may be loaded at an address different from what we were
+ * compiled to run at we first fixup the physical addresses in our page
+ * tables and then reload them.
*/

- /* Initialize the %ds segment register */
- movl $__KERNEL_DS,%eax
- movl %eax,%ds
-
- /* Load new GDT with the 64bit segments using 32bit descriptor */
- lgdt pGDT32 - __START_KERNEL_map
-
- /* If the CPU doesn't support CPUID this will double fault.
- * Unfortunately it is hard to check for CPUID without a stack.
+ /* Compute the delta between the address I am compiled to run at and the
+ * address I am actually running at.
*/
-
- /* Check if extended functions are implemented */
- movl $0x80000000, %eax
- cpuid
- cmpl $0x80000000, %eax
- jbe no_long_mode
- /* Check if long mode is implemented */
- mov $0x80000001, %eax
- cpuid
- btl $29, %edx
- jnc no_long_mode
-
- /*
- * Prepare for entering 64bits mode
+ leaq _text(%rip), %rbp
+ subq $_text - __START_KERNEL_map, %rbp
+
+ /* Is the address not 2M aligned? */
+ movq %rbp, %rax
+ andl $~LARGE_PAGE_MASK, %eax
+ testl %eax, %eax
+ jnz bad_address
+
+ /* Is the address too large? */
+ leaq _text(%rip), %rdx
+ movq $PGDIR_SIZE, %rax
+ cmpq %rax, %rdx
+ jae bad_address
+
+ /* Fixup the physical addresses in the page table
*/
-
- /* Enable PAE mode */
- xorl %eax, %eax
- btsl $5, %eax
- movl %eax, %cr4
-
- /* Setup early boot stage 4 level pagetables */
- movl $(init_level4_pgt - __START_KERNEL_map), %eax
- movl %eax, %cr3
-
- /* Setup EFER (Extended Feature Enable Register) */
- movl $MSR_EFER, %ecx
- rdmsr
-
- /* Enable Long Mode */
- btsl $_EFER_LME, %eax
-
- /* Make changes effective */
- wrmsr
-
- xorl %eax, %eax
- btsl $31, %eax /* Enable paging and in turn activate Long Mode */
- btsl $0, %eax /* Enable protected mode */
- /* Make changes effective */
- movl %eax, %cr0
- /*
- * At this point we're in long mode but in 32bit compatibility mode
- * with EFER.LME = 1, CS.L = 0, CS.D = 1 (and in turn
- * EFER.LMA = 1). Now we want to jump in 64bit mode, to do that we use
- * the new gdt/idt that has __KERNEL_CS with CS.L = 1.
+ addq %rbp, init_level4_pgt + 0(%rip)
+ addq %rbp, init_level4_pgt + (258*8)(%rip)
+ addq %rbp, init_level4_pgt + (511*8)(%rip)
+
+ addq %rbp, level3_ident_pgt + 0(%rip)
+ addq %rbp, level3_kernel_pgt + (510*8)(%rip)
+
+ /* Add an Identity mapping if I am above 1G */
+ leaq _text(%rip), %rdi
+ andq $LARGE_PAGE_MASK, %rdi
+
+ movq %rdi, %rax
+ shrq $PUD_SHIFT, %rax
+ andq $(PTRS_PER_PUD - 1), %rax
+ jz ident_complete
+
+ leaq (level2_spare_pgt - __START_KERNEL_map + _KERNPG_TABLE)(%rbp), %rdx
+ leaq level3_ident_pgt(%rip), %rbx
+ movq %rdx, 0(%rbx, %rax, 8)
+
+ movq %rdi, %rax
+ shrq $PMD_SHIFT, %rax
+ andq $(PTRS_PER_PMD - 1), %rax
+ leaq __PAGE_KERNEL_LARGE_EXEC(%rdi), %rdx
+ leaq level2_spare_pgt(%rip), %rbx
+ movq %rdx, 0(%rbx, %rax, 8)
+ident_complete:
+
+ /* Fixup the kernel text+data virtual addresses
*/
- ljmp $__KERNEL_CS, $(startup_64 - __START_KERNEL_map)
+ leaq level2_kernel_pgt(%rip), %rdi
+ leaq 4096(%rdi), %r8
+ /* See if it is a valid page table entry */
+1: testq $1, 0(%rdi)
+ jz 2f
+ addq %rbp, 0(%rdi)
+ /* Go to the next page */
+2: addq $8, %rdi
+ cmp %r8, %rdi
+ jne 1b
+
+ /* Fixup phys_base */
+ addq %rbp, phys_base(%rip)
+
+#ifdef CONFIG_SMP
+ addq %rbp, trampoline_level4_pgt + 0(%rip)
+ addq %rbp, trampoline_level4_pgt + (511*8)(%rip)
+#endif
+#ifdef CONFIG_ACPI_SLEEP
+ addq %rbp, wakeup_level4_pgt + 0(%rip)
+ addq %rbp, wakeup_level4_pgt + (511*8)(%rip)
+#endif

- .code64
- .org 0x100
- .globl startup_64
-startup_64:
ENTRY(secondary_startup_64)
- /* We come here either from startup_32
- * or directly from a 64bit bootloader.
- * Since we may have come directly from a bootloader we
- * reload the page tables here.
- */
+ /*
+ * At this point the CPU runs in 64bit mode CS.L = 1 CS.D = 1,
+ * and someone has loaded a mapped page table.
+ *
+ * %esi holds a physical pointer to real_mode_data.
+ *
+ * We come here either from startup_64 (using physical addresses)
+ * or from trampoline.S (using virtual addresses).
+ *
+ * Using virtual addresses from trampoline.S removes the need
+ * to have any identity mapped pages in the kernel page table
+ * after the boot processor executes this code.
+ */

/* Enable PAE mode and PGE */
xorq %rax, %rax
@@ -118,8 +146,14 @@ ENTRY(secondary_startup_64)

/* Setup early boot stage 4 level pagetables. */
movq $(init_level4_pgt - __START_KERNEL_map), %rax
+ addq phys_base(%rip), %rax
movq %rax, %cr3

+ /* Ensure I am executing from virtual addresses */
+ movq $1f, %rax
+ jmp *%rax
+1:
+
/* Check if nx is implemented */
movl $0x80000001, %eax
cpuid
@@ -128,17 +162,11 @@ ENTRY(secondary_startup_64)
/* Setup EFER (Extended Feature Enable Register) */
movl $MSR_EFER, %ecx
rdmsr
-
- /* Enable System Call */
- btsl $_EFER_SCE, %eax
-
- /* No Execute supported? */
- btl $20,%edi
+ btsl $_EFER_SCE, %eax /* Enable System Call */
+ btl $20,%edi /* No Execute supported? */
jnc 1f
btsl $_EFER_NX, %eax
-1:
- /* Make changes effective */
- wrmsr
+1: wrmsr /* Make changes effective */

/* Setup cr0 */
#define CR0_PM 1 /* protected mode */
@@ -165,7 +193,7 @@ #define CR0_PAGING (1<<31)
* addresses where we're currently running on. We have to do that here
* because in 32bit we couldn't load a 64bit linear address.
*/
- lgdt cpu_gdt_descr
+ lgdt cpu_gdt_descr(%rip)

/*
* Setup up a dummy PDA. this is just for some early bootup code
@@ -204,6 +232,9 @@ initial_code:
init_rsp:
.quad init_thread_union+THREAD_SIZE-8

+bad_address:
+ jmp bad_address
+
ENTRY(early_idt_handler)
cmpl $2,early_recursion_flag(%rip)
jz 1f
@@ -232,23 +263,7 @@ early_idt_msg:
early_idt_ripmsg:
.asciz "RIP %s\n"

-.code32
-ENTRY(no_long_mode)
- /* This isn't an x86-64 CPU so hang */
-1:
- jmp 1b
-
-.org 0xf00
- .globl pGDT32
-pGDT32:
- .word gdt_end-cpu_gdt_table-1
- .long cpu_gdt_table-__START_KERNEL_map
-
-.org 0xf10
-ljumpvector:
- .long startup_64-__START_KERNEL_map
- .word __KERNEL_CS
-
+.balign PAGE_SIZE
ENTRY(stext)
ENTRY(_stext)

@@ -303,6 +318,9 @@ NEXT_PAGE(level2_kernel_pgt)
/* Module mapping starts here */
.fill (PTRS_PER_PMD - (KERNEL_TEXT_SIZE/PMD_SIZE)),8,0

+NEXT_PAGE(level2_spare_pgt)
+ .fill 512,8,0
+
#undef PMDS
#undef NEXT_PAGE

@@ -325,6 +343,10 @@ #ifdef CONFIG_SMP
.endr
#endif

+ENTRY(phys_base)
+ /* This must match the first entry in level2_kernel_pgt */
+ .quad 0x0000000000000000
+
/* We need valid kernel segments for data and code in long mode too
* IRET will check the segment types kkeil 2000/10/28
* Also sysret mandates a special GDT layout
diff --git a/arch/x86_64/kernel/vmlinux.lds.S b/arch/x86_64/kernel/vmlinux.lds.S
index 741487b..456fe8e 100644
--- a/arch/x86_64/kernel/vmlinux.lds.S
+++ b/arch/x86_64/kernel/vmlinux.lds.S
@@ -15,7 +15,7 @@ ENTRY(phys_startup_64)
jiffies_64 = jiffies;
SECTIONS
{
- . = __START_KERNEL_map + 0x200000;
+ . = __START_KERNEL_map;
phys_startup_64 = startup_64 - LOAD_OFFSET;
_text = .; /* Text and read-only data */
.text : AT(ADDR(.text) - LOAD_OFFSET) {
diff --git a/include/asm-x86_64/page.h b/include/asm-x86_64/page.h
index deed41a..d125c09 100644
--- a/include/asm-x86_64/page.h
+++ b/include/asm-x86_64/page.h
@@ -61,6 +61,8 @@ #define PTE_MASK PHYSICAL_PAGE_MASK

typedef struct { unsigned long pgprot; } pgprot_t;

+extern unsigned long phys_base;
+
#define pte_val(x) ((x).pte)
#define pmd_val(x) ((x).pmd)
#define pud_val(x) ((x).pud)
@@ -99,14 +101,14 @@ #endif /* __ASSEMBLY__ */
#define PAGE_OFFSET __PAGE_OFFSET

/* Note: __pa(&symbol_visible_to_c) should be always replaced with __pa_symbol.
- Otherwise you risk miscompilation. */
+ Otherwise you risk miscompilation. */
#define __pa(x) ((unsigned long)(x) - PAGE_OFFSET)
/* __pa_symbol should be used for C visible symbols.
This seems to be the official gcc blessed way to do such arithmetic. */
#define __pa_symbol(x) \
({unsigned long v; \
asm("" : "=r" (v) : "0" (x)); \
- (v - __START_KERNEL_map); })
+ ((v - __START_KERNEL_map) + phys_base); })

#define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET))
#ifdef CONFIG_FLATMEM
--
1.4.2.rc2.g5209e

2006-08-01 11:08:44

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 10/33] i386: Relocatable kernel support.

This patch modifies the x86 kernel so that if CONFIG_RELOCATABLE is
selected it will be able to be loaded at any 4K aligned address below
1G. The technique used is to compile the decompressor with -fPIC and
modify it so the decompressor is fully relocatable. For the main
kernel relocations are generated. Resulting in a kernel that is relocatable
with no runtime overhead and no need to modify the source code.

A reserved 32bit word in the parameters has been assigned
to serve as a stack so we figure out where are running.

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/i386/Kconfig | 12 +
arch/i386/Makefile | 2
arch/i386/boot/compressed/Makefile | 22 +
arch/i386/boot/compressed/head.S | 184 +++++++----
arch/i386/boot/compressed/misc.c | 263 ++++++++-------
arch/i386/boot/compressed/relocs.c | 563 +++++++++++++++++++++++++++++++++
arch/i386/boot/compressed/vmlinux.lds | 40 ++
arch/i386/boot/compressed/vmlinux.scr | 3
arch/i386/boot/setup.S | 29 +-
include/linux/screen_info.h | 3
10 files changed, 910 insertions(+), 211 deletions(-)

diff --git a/arch/i386/Kconfig b/arch/i386/Kconfig
index 062fa01..a0707d7 100644
--- a/arch/i386/Kconfig
+++ b/arch/i386/Kconfig
@@ -761,6 +761,18 @@ config CRASH_DUMP
help
Generate crash dump after being started by kexec.

+config RELOCATABLE
+ bool "Build a relocatable kernel"
+ help
+ This build a kernel image that retains relocation information
+ so it can be loaded someplace besides the default 1MB.
+ The relocations tend to the kernel binary about 10% larger,
+ but are discarded at runtime.
+
+ One use is for the kexec on panic case where the recovery kernel
+ must live at a different physical address than the primary
+ kernel.
+
config PHYSICAL_START
hex "Physical address where the kernel is loaded" if (EMBEDDED || CRASH_DUMP)

diff --git a/arch/i386/Makefile b/arch/i386/Makefile
index 3e4adb1..e9d6eac 100644
--- a/arch/i386/Makefile
+++ b/arch/i386/Makefile
@@ -26,7 +26,7 @@ endif

LDFLAGS := -m elf_i386
OBJCOPYFLAGS := -O binary -R .note -R .comment -S
-LDFLAGS_vmlinux :=
+LDFLAGS_vmlinux := --emit-relocs
CHECKFLAGS += -D__i386__

CFLAGS += -pipe -msoft-float
diff --git a/arch/i386/boot/compressed/Makefile b/arch/i386/boot/compressed/Makefile
index 258ea95..1c486d1 100644
--- a/arch/i386/boot/compressed/Makefile
+++ b/arch/i386/boot/compressed/Makefile
@@ -7,19 +7,33 @@ #
targets := vmlinux vmlinux.bin vmlinux.bin.gz head.o misc.o piggy.o
EXTRA_AFLAGS := -traditional

-LDFLAGS_vmlinux := -Ttext $(IMAGE_OFFSET) -e startup_32
+LDFLAGS_vmlinux := -T
+CFLAGS_misc.o += -fPIC
+hostprogs-y := relocs

-$(obj)/vmlinux: $(obj)/head.o $(obj)/misc.o $(obj)/piggy.o FORCE
+$(obj)/vmlinux: $(src)/vmlinux.lds $(obj)/head.o $(obj)/misc.o $(obj)/piggy.o FORCE
$(call if_changed,ld)
@:

$(obj)/vmlinux.bin: vmlinux FORCE
$(call if_changed,objcopy)

-$(obj)/vmlinux.bin.gz: $(obj)/vmlinux.bin FORCE
+quiet_cmd_relocs = RELOCS $@
+ cmd_relocs = $(obj)/relocs $< > $@
+$(obj)/vmlinux.relocs: vmlinux $(obj)/relocs FORCE
+ $(call if_changed,relocs)
+
+vmlinux.bin.all-y := $(obj)/vmlinux.bin
+vmlinux.bin.all-$(CONFIG_RELOCATABLE) += $(obj)/vmlinux.relocs
+quiet_cmd_relocbin = BUILD $@
+ cmd_relocbin = cat $(filter-out FORCE,$^) > $@
+$(obj)/vmlinux.bin.all: $(vmlinux.bin.all-y) FORCE
+ $(call if_changed,relocbin)
+
+$(obj)/vmlinux.bin.gz: $(obj)/vmlinux.bin.all FORCE
$(call if_changed,gzip)

LDFLAGS_piggy.o := -r --format binary --oformat elf32-i386 -T

-$(obj)/piggy.o: $(obj)/vmlinux.scr $(obj)/vmlinux.bin.gz FORCE
+$(obj)/piggy.o: $(src)/vmlinux.scr $(obj)/vmlinux.bin.gz FORCE
$(call if_changed,ld)
diff --git a/arch/i386/boot/compressed/head.S b/arch/i386/boot/compressed/head.S
index 8f28ecd..418e425 100644
--- a/arch/i386/boot/compressed/head.S
+++ b/arch/i386/boot/compressed/head.S
@@ -26,9 +26,11 @@
#include <linux/config.h>
#include <linux/linkage.h>
#include <asm/segment.h>
+#include <asm/page.h>

+.section ".text.head"
.globl startup_32
-
+
startup_32:
cld
cli
@@ -37,93 +39,141 @@ startup_32:
movl %eax,%es
movl %eax,%fs
movl %eax,%gs
+ movl %eax,%ss

- lss stack_start,%esp
- xorl %eax,%eax
-1: incl %eax # check that A20 really IS enabled
- movl %eax,0x000000 # loop forever if it isn't
- cmpl %eax,0x100000
- je 1b
+/* Calculate the delta between where we were compiled to run
+ * at and where we were actually loaded at. This can only be done
+ * with a short local call on x86. Nothing else will tell us what
+ * address we are running at. The reserved chunk of the real-mode
+ * data at 0x34-0x3f are used as the stack for this calculation.
+ * Only 4 bytes are needed.
+ */
+ leal 0x40(%esi), %esp
+ call 1f
+1: popl %ebp
+ subl $1b, %ebp
+
+/* Compute the delta between where we were compiled to run at
+ * and where the code will actually run at.
+ */
+ /* Start with the delta to where the kernel will run at. If we are
+ * a relocatable kernel this is the delta to our load address otherwise
+ * this is the delta to CONFIG_PHYSICAL start.
+ */
+#ifdef CONFIG_RELOCTABLE
+ movl %ebp, %ebx
+#else
+ movl $(CONFIG_PHYSICAL_START - startup_32), %ebx
+#endif
+
+ /* Replace the compressed data size with the uncompressed size */
+ subl input_len(%ebp), %ebx
+ movl output_len(%ebp), %eax
+ addl %eax, %ebx
+ /* Add 8 bytes for every 32K input block */
+ shrl $12, %eax
+ addl %eax, %ebx
+ /* Add 32K + 18 bytes of extra slack */
+ addl $(32768 + 18), %ebx
+ /* Align on a 4K boundary */
+ addl $4095, %ebx
+ andl $~4095, %ebx
+
+/* Copy the compressed kernel to the end of our buffer
+ * where decompression in place becomes safe.
+ */
+ pushl %esi
+ leal _end(%ebp), %esi
+ leal _end(%ebx), %edi
+ movl $(_end - startup_32), %ecx
+ std
+ rep
+ movsb
+ cld
+ popl %esi
+
+/* Compute the kernel start address.
+ */
+#ifdef CONFIG_RELOCATABLE
+ leal startup_32(%ebp), %ebp
+#else
+ movl $CONFIG_PHYSICAL_START, %ebp
+#endif

/*
- * Initialize eflags. Some BIOS's leave bits like NT set. This would
- * confuse the debugger if this code is traced.
- * XXX - best to initialize before switching to protected mode.
+ * Jump to the relocated address.
*/
- pushl $0
- popfl
+ leal relocated(%ebx), %eax
+ jmp *%eax
+.section ".text"
+relocated:
+
/*
* Clear BSS
*/
xorl %eax,%eax
- movl $_edata,%edi
- movl $_end,%ecx
+ leal _edata(%ebx),%edi
+ leal _end(%ebx), %ecx
subl %edi,%ecx
cld
rep
stosb
+
+/*
+ * Setup the stack for the decompressor
+ */
+ leal stack_end(%ebx), %esp
+
/*
* Do the decompression, and jump to the new kernel..
*/
- subl $16,%esp # place for structure on the stack
- movl %esp,%eax
+ movl output_len(%ebx), %eax
+ pushl %eax
+ pushl %ebp # output address
+ movl input_len(%ebx), %eax
+ pushl %eax # input_len
+ leal input_data(%ebx), %eax
+ pushl %eax # input_data
+ leal _end(%ebx), %eax
+ pushl %eax # end of the image as third argument
pushl %esi # real mode pointer as second arg
- pushl %eax # address of structure as first arg
call decompress_kernel
- orl %eax,%eax
- jnz 3f
- popl %esi # discard address
- popl %esi # real mode pointer
- xorl %ebx,%ebx
- ljmp $(__BOOT_CS), $CONFIG_PHYSICAL_START
+ addl $20, %esp
+ popl %ecx
+
+#if CONFIG_RELOCATABLE
+/* Find the address of the relocations.
+ */
+ movl %ebp, %edi
+ addl %ecx, %edi
+
+/* Calculate the delta between where vmlinux was compiled to run
+ * and where it was actually loaded.
+ */
+ movl %ebp, %ebx
+ subl $CONFIG_PHYSICAL_START, %ebx

/*
- * We come here, if we were loaded high.
- * We need to move the move-in-place routine down to 0x1000
- * and then start it with the buffer addresses in registers,
- * which we got from the stack.
+ * Process relocations.
*/
-3:
- movl $move_routine_start,%esi
- movl $0x1000,%edi
- movl $move_routine_end,%ecx
- subl %esi,%ecx
- addl $3,%ecx
- shrl $2,%ecx
- cld
- rep
- movsl
-
- popl %esi # discard the address
- popl %ebx # real mode pointer
- popl %esi # low_buffer_start
- popl %ecx # lcount
- popl %edx # high_buffer_start
- popl %eax # hcount
- movl $CONFIG_PHYSICAL_START,%edi
- cli # make sure we don't get interrupted
- ljmp $(__BOOT_CS), $0x1000 # and jump to the move routine
+
+1: subl $4, %edi
+ movl 0(%edi), %ecx
+ testl %ecx, %ecx
+ jz 2f
+ addl %ebx, -__PAGE_OFFSET(%ebx, %ecx)
+ jmp 1b
+2:
+#endif

/*
- * Routine (template) for moving the decompressed kernel in place,
- * if we were high loaded. This _must_ PIC-code !
+ * Jump to the decompressed kernel.
*/
-move_routine_start:
- movl %ecx,%ebp
- shrl $2,%ecx
- rep
- movsl
- movl %ebp,%ecx
- andl $3,%ecx
- rep
- movsb
- movl %edx,%esi
- movl %eax,%ecx # NOTE: rep movsb won't move if %ecx == 0
- addl $3,%ecx
- shrl $2,%ecx
- rep
- movsl
- movl %ebx,%esi # Restore setup pointer
xorl %ebx,%ebx
- ljmp $(__BOOT_CS), $CONFIG_PHYSICAL_START
-move_routine_end:
+ jmp *%ebp
+
+.bss
+.balign 4
+stack:
+ .fill 4096, 1, 0
+stack_end:
diff --git a/arch/i386/boot/compressed/misc.c b/arch/i386/boot/compressed/misc.c
index fcaa9f0..809eb93 100644
--- a/arch/i386/boot/compressed/misc.c
+++ b/arch/i386/boot/compressed/misc.c
@@ -17,6 +17,88 @@ #include <linux/serial_reg.h>
#include <linux/screen_info.h>
#include <asm/io.h>
#include <asm/setup.h>
+#include <asm/page.h>
+
+/* WARNING!!
+ * This code is compiled with -fPIC and it is relocated dynamically
+ * at run time, but no relocation processing is performed.
+ * This means that it is not safe to place pointers in static structures.
+ */
+
+/*
+ * Getting to provable safe in place decompression is hard.
+ * Worst case behaviours need to be analized.
+ * Background information:
+ *
+ * The file layout is:
+ * magic[2]
+ * method[1]
+ * flags[1]
+ * timestamp[4]
+ * extraflags[1]
+ * os[1]
+ * compressed data blocks[N]
+ * crc[4] orig_len[4]
+ *
+ * resulting in 18 bytes of non compressed data overhead.
+ *
+ * Files divided into blocks
+ * 1 bit (last block flag)
+ * 2 bits (block type)
+ *
+ * 1 block occurs every 32K -1 bytes or when there 50% compression has been achieved.
+ * The smallest block type encoding is always used.
+ *
+ * stored:
+ * 32 bits length in bytes.
+ *
+ * fixed:
+ * magic fixed tree.
+ * symbols.
+ *
+ * dynamic:
+ * dynamic tree encoding.
+ * symbols.
+ *
+ *
+ * The buffer for decompression in place is the length of the
+ * uncompressed data, plus a small amount extra to keep the algorithm safe.
+ * The compressed data is placed at the end of the buffer. The output
+ * pointer is placed at the start of the buffer and the input pointer
+ * is placed where the compressed data starts. Problems will occur
+ * when the output pointer overruns the input pointer.
+ *
+ * The output pointer can only overrun the input pointer if the input
+ * pointer is moving faster than the output pointer. A condition only
+ * triggered by data whose compressed form is larger than the uncompressed
+ * form.
+ *
+ * The worst case at the block level is a growth of the compressed data
+ * of 5 bytes per 32767 bytes.
+ *
+ * The worst case internal to a compressed block is very hard to figure.
+ * The worst case can at least be boundined by having one bit that represents
+ * 32764 bytes and then all of the rest of the bytes representing the very
+ * very last byte.
+ *
+ * All of which is enough to compute an amount of extra data that is required
+ * to be safe. To avoid problems at the block level allocating 5 extra bytes
+ * per 32767 bytes of data is sufficient. To avoind problems internal to a block
+ * adding an extra 32767 bytes (the worst case uncompressed block size) is
+ * sufficient, to ensure that in the worst case the decompressed data for
+ * block will stop the byte before the compressed data for a block begins.
+ * To avoid problems with the compressed data's meta information an extra 18
+ * bytes are needed. Leading to the formula:
+ *
+ * extra_bytes = (uncompressed_size >> 12) + 32768 + 18 + decompressor_size.
+ *
+ * Adding 8 bytes per 32K is a bit excessive but much easier to calculate.
+ * Adding 32768 instead of 32767 just makes for round numbers.
+ * Adding the decompressor_size is necessary as it musht live after all
+ * of the data as well. Last I measured the decompressor is about 14K.
+ * 10K of actuall data and 4K of bss.
+ *
+ */

/*
* gzip declarations
@@ -35,15 +117,20 @@ typedef unsigned char uch;
typedef unsigned short ush;
typedef unsigned long ulg;

-#define WSIZE 0x8000 /* Window size must be at least 32k, */
- /* and a power of two */
+#define WSIZE 0x80000000 /* Window size must be at least 32k,
+ * and a power of two
+ * We don't actually have a window just
+ * a huge output buffer so I report
+ * a 2G windows size, as that should
+ * always be larger than our output buffer.
+ */

-static uch *inbuf; /* input buffer */
-static uch window[WSIZE]; /* Sliding window buffer */
+static uch *inbuf; /* input buffer */
+static uch *window; /* Sliding window buffer, (and final output buffer) */

-static unsigned insize = 0; /* valid bytes in inbuf */
-static unsigned inptr = 0; /* index of next byte to be processed in inbuf */
-static unsigned outcnt = 0; /* bytes in output buffer */
+static unsigned insize; /* valid bytes in inbuf */
+static unsigned inptr; /* index of next byte to be processed in inbuf */
+static unsigned outcnt; /* bytes in output buffer */

/* gzip flag byte */
#define ASCII_FLAG 0x01 /* bit 0 set: file probably ASCII text */
@@ -99,8 +186,6 @@ extern unsigned char input_data[];
extern int input_len;

static long bytes_out = 0;
-static uch *output_data;
-static unsigned long output_ptr = 0;

static void *malloc(int size);
static void free(void *where);
@@ -112,17 +197,10 @@ static int memcmp(const void *s1, const
static void putstr(const char *);
static unsigned simple_strtou(const char *cp,char **endp,unsigned base);

-extern int end;
-static long free_mem_ptr = (long)&end;
-static long free_mem_end_ptr;
+static unsigned long free_mem_ptr;
+static unsigned long free_mem_end_ptr;

-#define INPLACE_MOVE_ROUTINE 0x1000
-#define LOW_BUFFER_START 0x2000
-#define LOW_BUFFER_MAX 0x90000
#define HEAP_SIZE 0x3000
-static unsigned int low_buffer_end, low_buffer_size;
-static int high_loaded =0;
-static uch *high_buffer_start /* = (uch *)(((ulg)&end) + HEAP_SIZE)*/;

static char *vidmem;
static int vidport;
@@ -174,9 +252,9 @@ static void gzip_mark(void **ptr)

static void gzip_release(void **ptr)
{
- free_mem_ptr = (long) *ptr;
+ free_mem_ptr = (unsigned long) *ptr;
}
-
+
/* The early video console */
static void vid_scroll(void)
{
@@ -203,7 +281,7 @@ static void vid_putstr(const char *s)
y--;
}
} else {
- vidmem [ ( x + cols * y ) * 2 ] = c;
+ vidmem [ ( x + cols * y ) * 2 ] = c;
if ( ++x >= cols ) {
x = 0;
if ( ++y >= lines ) {
@@ -443,58 +521,31 @@ char *strstr(const char *haystack, const
*/
static int fill_inbuf(void)
{
- if (insize != 0) {
- error("ran out of input data");
- }
-
- inbuf = input_data;
- insize = input_len;
- inptr = 1;
- return inbuf[0];
+ error("ran out of input data");
+ return 0;
}

/* ===========================================================================
* Write the output window window[0..outcnt-1] and update crc and bytes_out.
* (Used for the decompressed data only.)
*/
-static void flush_window_low(void)
-{
- ulg c = crc; /* temporary variable */
- unsigned n;
- uch *in, *out, ch;
-
- in = window;
- out = &output_data[output_ptr];
- for (n = 0; n < outcnt; n++) {
- ch = *out++ = *in++;
- c = crc_32_tab[((int)c ^ ch) & 0xff] ^ (c >> 8);
- }
- crc = c;
- bytes_out += (ulg)outcnt;
- output_ptr += (ulg)outcnt;
- outcnt = 0;
-}
-
-static void flush_window_high(void)
-{
- ulg c = crc; /* temporary variable */
- unsigned n;
- uch *in, ch;
- in = window;
- for (n = 0; n < outcnt; n++) {
- ch = *output_data++ = *in++;
- if ((ulg)output_data == low_buffer_end) output_data=high_buffer_start;
- c = crc_32_tab[((int)c ^ ch) & 0xff] ^ (c >> 8);
- }
- crc = c;
- bytes_out += (ulg)outcnt;
- outcnt = 0;
-}
-
static void flush_window(void)
{
- if (high_loaded) flush_window_high();
- else flush_window_low();
+ /* With my window equal to my output buffer
+ * I only need to compute the crc here.
+ */
+ ulg c = crc; /* temporary variable */
+ unsigned n;
+ uch *in, ch;
+
+ in = window;
+ for (n = 0; n < outcnt; n++) {
+ ch = *in++;
+ c = crc_32_tab[((int)c ^ ch) & 0xff] ^ (c >> 8);
+ }
+ crc = c;
+ bytes_out += (ulg)outcnt;
+ outcnt = 0;
}

static void error(char *x)
@@ -506,65 +557,6 @@ static void error(char *x)
while(1); /* Halt */
}

-#define STACK_SIZE (4096)
-
-long user_stack [STACK_SIZE];
-
-struct {
- long * a;
- short b;
- } stack_start = { & user_stack [STACK_SIZE] , __BOOT_DS };
-
-static void setup_normal_output_buffer(void)
-{
-#ifdef STANDARD_MEMORY_BIOS_CALL
- if (RM_EXT_MEM_K < 1024) error("Less than 2MB of memory");
-#else
- if ((RM_ALT_MEM_K > RM_EXT_MEM_K ? RM_ALT_MEM_K : RM_EXT_MEM_K) < 1024) error("Less than 2MB of memory");
-#endif
- output_data = (unsigned char *)CONFIG_PHYSICAL_START; /* Normally Points to 1M */
- free_mem_end_ptr = (long)real_mode;
-}
-
-struct moveparams {
- uch *low_buffer_start; int lcount;
- uch *high_buffer_start; int hcount;
-};
-
-static void setup_output_buffer_if_we_run_high(struct moveparams *mv)
-{
- high_buffer_start = (uch *)(((ulg)&end) + HEAP_SIZE);
-#ifdef STANDARD_MEMORY_BIOS_CALL
- if (RM_EXT_MEM_K < (3*1024)) error("Less than 4MB of memory");
-#else
- if ((RM_ALT_MEM_K > RM_EXT_MEM_K ? RM_ALT_MEM_K : RM_EXT_MEM_K) < (3*1024)) error("Less than 4MB of memory");
-#endif
- mv->low_buffer_start = output_data = (unsigned char *)LOW_BUFFER_START;
- low_buffer_end = ((unsigned int)real_mode > LOW_BUFFER_MAX
- ? LOW_BUFFER_MAX : (unsigned int)real_mode) & ~0xfff;
- low_buffer_size = low_buffer_end - LOW_BUFFER_START;
- high_loaded = 1;
- free_mem_end_ptr = (long)high_buffer_start;
- if ( (CONFIG_PHYSICAL_START + low_buffer_size) > ((ulg)high_buffer_start)) {
- high_buffer_start = (uch *)(CONFIG_PHYSICAL_START + low_buffer_size);
- mv->hcount = 0; /* say: we need not to move high_buffer */
- }
- else mv->hcount = -1;
- mv->high_buffer_start = high_buffer_start;
-}
-
-static void close_output_buffer_if_we_run_high(struct moveparams *mv)
-{
- if (bytes_out > low_buffer_size) {
- mv->lcount = low_buffer_size;
- if (mv->hcount)
- mv->hcount = bytes_out - low_buffer_size;
- } else {
- mv->lcount = bytes_out;
- mv->hcount = 0;
- }
-}
-
static void save_command_line(void)
{
/* Find the command line */
@@ -579,19 +571,32 @@ static void save_command_line(void)
saved_command_line[COMMAND_LINE_SIZE - 1] = '\0';
}

-asmlinkage int decompress_kernel(struct moveparams *mv, void *rmode)
+asmlinkage void decompress_kernel(void *rmode, unsigned long end,
+ uch *input_data, unsigned long input_len, uch *output)
{
real_mode = rmode;
save_command_line();
console_init(saved_command_line);

- if (free_mem_ptr < 0x100000) setup_normal_output_buffer();
- else setup_output_buffer_if_we_run_high(mv);
+ window = output; /* Output buffer (Normally at 1M) */
+ free_mem_ptr = end; /* Heap */
+ free_mem_end_ptr = end + HEAP_SIZE;
+ inbuf = input_data; /* Input buffer */
+ insize = input_len;
+ inptr = 0;
+
+ if (((u32)output - CONFIG_PHYSICAL_START) & 0x3fffff)
+ error("Destination address not 4M aligned");
+ if (end > ((-__PAGE_OFFSET-(512 <<20)-1) & 0x7fffffff))
+ error("Destination address too large");
+#ifndef CONFIG_RELOCATABLE
+ if ((u32)output != CONFIG_PHYSICAL_START)
+ error("Wrong destination address");
+#endif

makecrc();
putstr("Uncompressing Linux... ");
gunzip();
putstr("Ok, booting the kernel.\n");
- if (high_loaded) close_output_buffer_if_we_run_high(mv);
- return high_loaded;
+ return;
}
diff --git a/arch/i386/boot/compressed/relocs.c b/arch/i386/boot/compressed/relocs.c
new file mode 100644
index 0000000..0551ceb
--- /dev/null
+++ b/arch/i386/boot/compressed/relocs.c
@@ -0,0 +1,563 @@
+#include <stdio.h>
+#include <stdarg.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <errno.h>
+#include <unistd.h>
+#include <elf.h>
+#include <byteswap.h>
+#define USE_BSD
+#include <endian.h>
+
+#define MAX_SHDRS 100
+static Elf32_Ehdr ehdr;
+static Elf32_Shdr shdr[MAX_SHDRS];
+static Elf32_Sym *symtab[MAX_SHDRS];
+static Elf32_Rel *reltab[MAX_SHDRS];
+static char *strtab[MAX_SHDRS];
+static unsigned long reloc_count, reloc_idx;
+static unsigned long *relocs;
+
+static void die(char *fmt, ...)
+{
+ va_list ap;
+ va_start(ap, fmt);
+ vfprintf(stderr, fmt, ap);
+ va_end(ap);
+ exit(1);
+}
+
+static const char *sym_type(unsigned type)
+{
+ static const char *type_name[] = {
+#define SYM_TYPE(X) [X] = #X
+ SYM_TYPE(STT_NOTYPE),
+ SYM_TYPE(STT_OBJECT),
+ SYM_TYPE(STT_FUNC),
+ SYM_TYPE(STT_SECTION),
+ SYM_TYPE(STT_FILE),
+ SYM_TYPE(STT_COMMON),
+ SYM_TYPE(STT_TLS),
+#undef SYM_TYPE
+ };
+ const char *name = "unknown sym type name";
+ if (type < sizeof(type_name)/sizeof(type_name[0])) {
+ name = type_name[type];
+ }
+ return name;
+}
+
+static const char *sym_bind(unsigned bind)
+{
+ static const char *bind_name[] = {
+#define SYM_BIND(X) [X] = #X
+ SYM_BIND(STB_LOCAL),
+ SYM_BIND(STB_GLOBAL),
+ SYM_BIND(STB_WEAK),
+#undef SYM_BIND
+ };
+ const char *name = "unknown sym bind name";
+ if (bind < sizeof(bind_name)/sizeof(bind_name[0])) {
+ name = bind_name[bind];
+ }
+ return name;
+}
+
+static const char *sym_visibility(unsigned visibility)
+{
+ static const char *visibility_name[] = {
+#define SYM_VISIBILITY(X) [X] = #X
+ SYM_VISIBILITY(STV_DEFAULT),
+ SYM_VISIBILITY(STV_INTERNAL),
+ SYM_VISIBILITY(STV_HIDDEN),
+ SYM_VISIBILITY(STV_PROTECTED),
+#undef SYM_VISIBILITY
+ };
+ const char *name = "unknown sym visibility name";
+ if (visibility < sizeof(visibility_name)/sizeof(visibility_name[0])) {
+ name = visibility_name[visibility];
+ }
+ return name;
+}
+
+static const char *rel_type(unsigned type)
+{
+ static const char *type_name[] = {
+#define REL_TYPE(X) [X] = #X
+ REL_TYPE(R_386_NONE),
+ REL_TYPE(R_386_32),
+ REL_TYPE(R_386_PC32),
+ REL_TYPE(R_386_GOT32),
+ REL_TYPE(R_386_PLT32),
+ REL_TYPE(R_386_COPY),
+ REL_TYPE(R_386_GLOB_DAT),
+ REL_TYPE(R_386_JMP_SLOT),
+ REL_TYPE(R_386_RELATIVE),
+ REL_TYPE(R_386_GOTOFF),
+ REL_TYPE(R_386_GOTPC),
+#undef REL_TYPE
+ };
+ const char *name = "unknown type rel type name";
+ if (type < sizeof(type_name)/sizeof(type_name[0])) {
+ name = type_name[type];
+ }
+ return name;
+}
+
+static const char *sec_name(unsigned shndx)
+{
+ const char *sec_strtab;
+ const char *name;
+ sec_strtab = strtab[ehdr.e_shstrndx];
+ name = "<noname>";
+ if (shndx < ehdr.e_shnum) {
+ name = sec_strtab + shdr[shndx].sh_name;
+ }
+ else if (shndx == SHN_ABS) {
+ name = "ABSOLUTE";
+ }
+ else if (shndx == SHN_COMMON) {
+ name = "COMMON";
+ }
+ return name;
+}
+
+static const char *sym_name(const char *sym_strtab, Elf32_Sym *sym)
+{
+ const char *name;
+ name = "<noname>";
+ if (sym->st_name) {
+ name = sym_strtab + sym->st_name;
+ }
+ else {
+ name = sec_name(shdr[sym->st_shndx].sh_name);
+ }
+ return name;
+}
+
+
+
+#if BYTE_ORDER == LITTLE_ENDIAN
+#define le16_to_cpu(val) (val)
+#define le32_to_cpu(val) (val)
+#endif
+#if BYTE_ORDER == BIG_ENDIAN
+#define le16_to_cpu(val) bswap_16(val)
+#define le32_to_cpu(val) bswap_32(val)
+#endif
+
+static uint16_t elf16_to_cpu(uint16_t val)
+{
+ return le16_to_cpu(val);
+}
+
+static uint32_t elf32_to_cpu(uint32_t val)
+{
+ return le32_to_cpu(val);
+}
+
+static void read_ehdr(FILE *fp)
+{
+ if (fread(&ehdr, sizeof(ehdr), 1, fp) != 1) {
+ die("Cannot read ELF header: %s\n",
+ strerror(errno));
+ }
+ if (memcmp(ehdr.e_ident, ELFMAG, 4) != 0) {
+ die("No ELF magic\n");
+ }
+ if (ehdr.e_ident[EI_CLASS] != ELFCLASS32) {
+ die("Not a 32 bit executable\n");
+ }
+ if (ehdr.e_ident[EI_DATA] != ELFDATA2LSB) {
+ die("Not a LSB ELF executable\n");
+ }
+ if (ehdr.e_ident[EI_VERSION] != EV_CURRENT) {
+ die("Unknown ELF version\n");
+ }
+ /* Convert the fields to native endian */
+ ehdr.e_type = elf16_to_cpu(ehdr.e_type);
+ ehdr.e_machine = elf16_to_cpu(ehdr.e_machine);
+ ehdr.e_version = elf32_to_cpu(ehdr.e_version);
+ ehdr.e_entry = elf32_to_cpu(ehdr.e_entry);
+ ehdr.e_phoff = elf32_to_cpu(ehdr.e_phoff);
+ ehdr.e_shoff = elf32_to_cpu(ehdr.e_shoff);
+ ehdr.e_flags = elf32_to_cpu(ehdr.e_flags);
+ ehdr.e_ehsize = elf16_to_cpu(ehdr.e_ehsize);
+ ehdr.e_phentsize = elf16_to_cpu(ehdr.e_phentsize);
+ ehdr.e_phnum = elf16_to_cpu(ehdr.e_phnum);
+ ehdr.e_shentsize = elf16_to_cpu(ehdr.e_shentsize);
+ ehdr.e_shnum = elf16_to_cpu(ehdr.e_shnum);
+ ehdr.e_shstrndx = elf16_to_cpu(ehdr.e_shstrndx);
+
+ if ((ehdr.e_type != ET_EXEC) && (ehdr.e_type != ET_DYN)) {
+ die("Unsupported ELF header type\n");
+ }
+ if (ehdr.e_machine != EM_386) {
+ die("Not for x86\n");
+ }
+ if (ehdr.e_version != EV_CURRENT) {
+ die("Unknown ELF version\n");
+ }
+ if (ehdr.e_ehsize != sizeof(Elf32_Ehdr)) {
+ die("Bad Elf header size\n");
+ }
+ if (ehdr.e_phentsize != sizeof(Elf32_Phdr)) {
+ die("Bad program header entry\n");
+ }
+ if (ehdr.e_shentsize != sizeof(Elf32_Shdr)) {
+ die("Bad section header entry\n");
+ }
+ if (ehdr.e_shstrndx >= ehdr.e_shnum) {
+ die("String table index out of bounds\n");
+ }
+}
+
+static void read_shdrs(FILE *fp)
+{
+ int i;
+ if (ehdr.e_shnum > MAX_SHDRS) {
+ die("%d section headers supported: %d\n",
+ ehdr.e_shnum, MAX_SHDRS);
+ }
+ if (fseek(fp, ehdr.e_shoff, SEEK_SET) < 0) {
+ die("Seek to %d failed: %s\n",
+ ehdr.e_shoff, strerror(errno));
+ }
+ if (fread(&shdr, sizeof(shdr[0]), ehdr.e_shnum, fp) != ehdr.e_shnum) {
+ die("Cannot read ELF section headers: %s\n",
+ strerror(errno));
+ }
+ for(i = 0; i < ehdr.e_shnum; i++) {
+ shdr[i].sh_name = elf32_to_cpu(shdr[i].sh_name);
+ shdr[i].sh_type = elf32_to_cpu(shdr[i].sh_type);
+ shdr[i].sh_flags = elf32_to_cpu(shdr[i].sh_flags);
+ shdr[i].sh_addr = elf32_to_cpu(shdr[i].sh_addr);
+ shdr[i].sh_offset = elf32_to_cpu(shdr[i].sh_offset);
+ shdr[i].sh_size = elf32_to_cpu(shdr[i].sh_size);
+ shdr[i].sh_link = elf32_to_cpu(shdr[i].sh_link);
+ shdr[i].sh_info = elf32_to_cpu(shdr[i].sh_info);
+ shdr[i].sh_addralign = elf32_to_cpu(shdr[i].sh_addralign);
+ shdr[i].sh_entsize = elf32_to_cpu(shdr[i].sh_entsize);
+ }
+
+}
+
+static void read_strtabs(FILE *fp)
+{
+ int i;
+ for(i = 0; i < ehdr.e_shnum; i++) {
+ if (shdr[i].sh_type != SHT_STRTAB) {
+ continue;
+ }
+ strtab[i] = malloc(shdr[i].sh_size);
+ if (!strtab[i]) {
+ die("malloc of %d bytes for strtab failed\n",
+ shdr[i].sh_size);
+ }
+ if (fseek(fp, shdr[i].sh_offset, SEEK_SET) < 0) {
+ die("Seek to %d failed: %s\n",
+ shdr[i].sh_offset, strerror(errno));
+ }
+ if (fread(strtab[i], 1, shdr[i].sh_size, fp) != shdr[i].sh_size) {
+ die("Cannot read symbol table: %s\n",
+ strerror(errno));
+ }
+ }
+}
+
+static void read_symtabs(FILE *fp)
+{
+ int i,j;
+ for(i = 0; i < ehdr.e_shnum; i++) {
+ if (shdr[i].sh_type != SHT_SYMTAB) {
+ continue;
+ }
+ symtab[i] = malloc(shdr[i].sh_size);
+ if (!symtab[i]) {
+ die("malloc of %d bytes for symtab failed\n",
+ shdr[i].sh_size);
+ }
+ if (fseek(fp, shdr[i].sh_offset, SEEK_SET) < 0) {
+ die("Seek to %d failed: %s\n",
+ shdr[i].sh_offset, strerror(errno));
+ }
+ if (fread(symtab[i], 1, shdr[i].sh_size, fp) != shdr[i].sh_size) {
+ die("Cannot read symbol table: %s\n",
+ strerror(errno));
+ }
+ for(j = 0; j < shdr[i].sh_size/sizeof(symtab[i][0]); j++) {
+ symtab[i][j].st_name = elf32_to_cpu(symtab[i][j].st_name);
+ symtab[i][j].st_value = elf32_to_cpu(symtab[i][j].st_value);
+ symtab[i][j].st_size = elf32_to_cpu(symtab[i][j].st_size);
+ symtab[i][j].st_shndx = elf16_to_cpu(symtab[i][j].st_shndx);
+ }
+ }
+}
+
+
+static void read_relocs(FILE *fp)
+{
+ int i,j;
+ for(i = 0; i < ehdr.e_shnum; i++) {
+ if (shdr[i].sh_type != SHT_REL) {
+ continue;
+ }
+ reltab[i] = malloc(shdr[i].sh_size);
+ if (!reltab[i]) {
+ die("malloc of %d bytes for relocs failed\n",
+ shdr[i].sh_size);
+ }
+ if (fseek(fp, shdr[i].sh_offset, SEEK_SET) < 0) {
+ die("Seek to %d failed: %s\n",
+ shdr[i].sh_offset, strerror(errno));
+ }
+ if (fread(reltab[i], 1, shdr[i].sh_size, fp) != shdr[i].sh_size) {
+ die("Cannot read symbol table: %s\n",
+ strerror(errno));
+ }
+ for(j = 0; j < shdr[i].sh_size/sizeof(reltab[0][0]); j++) {
+ reltab[i][j].r_offset = elf32_to_cpu(reltab[i][j].r_offset);
+ reltab[i][j].r_info = elf32_to_cpu(reltab[i][j].r_info);
+ }
+ }
+}
+
+
+static void print_absolute_symbols(void)
+{
+ int i;
+ printf("Absolute symbols\n");
+ printf(" Num: Value Size Type Bind Visibility Name\n");
+ for(i = 0; i < ehdr.e_shnum; i++) {
+ char *sym_strtab;
+ Elf32_Sym *sh_symtab;
+ int j;
+ if (shdr[i].sh_type != SHT_SYMTAB) {
+ continue;
+ }
+ sh_symtab = symtab[i];
+ sym_strtab = strtab[shdr[i].sh_link];
+ for(j = 0; j < shdr[i].sh_size/sizeof(symtab[0][0]); j++) {
+ Elf32_Sym *sym;
+ const char *name;
+ sym = &symtab[i][j];
+ name = sym_name(sym_strtab, sym);
+ if (sym->st_shndx != SHN_ABS) {
+ continue;
+ }
+ printf("%5d %08x %5d %10s %10s %12s %s\n",
+ j, sym->st_value, sym->st_size,
+ sym_type(ELF32_ST_TYPE(sym->st_info)),
+ sym_bind(ELF32_ST_BIND(sym->st_info)),
+ sym_visibility(ELF32_ST_VISIBILITY(sym->st_other)),
+ name);
+ }
+ }
+ printf("\n");
+}
+
+static void print_absolute_relocs(void)
+{
+ int i;
+ printf("Absolute relocations\n");
+ printf("Offset Info Type Sym.Value Sym.Name\n");
+ for(i = 0; i < ehdr.e_shnum; i++) {
+ char *sym_strtab;
+ Elf32_Sym *sh_symtab;
+ unsigned sec_applies, sec_symtab;
+ int j;
+ if (shdr[i].sh_type != SHT_REL) {
+ continue;
+ }
+ sec_symtab = shdr[i].sh_link;
+ sec_applies = shdr[i].sh_info;
+ if (!(shdr[sec_applies].sh_flags & SHF_ALLOC)) {
+ continue;
+ }
+ sh_symtab = symtab[sec_symtab];
+ sym_strtab = strtab[shdr[sec_symtab].sh_link];
+ for(j = 0; j < shdr[i].sh_size/sizeof(reltab[0][0]); j++) {
+ Elf32_Rel *rel;
+ Elf32_Sym *sym;
+ const char *name;
+ rel = &reltab[i][j];
+ sym = &sh_symtab[ELF32_R_SYM(rel->r_info)];
+ name = sym_name(sym_strtab, sym);
+ if (sym->st_shndx != SHN_ABS) {
+ continue;
+ }
+ printf("%08x %08x %10s %08x %s\n",
+ rel->r_offset,
+ rel->r_info,
+ rel_type(ELF32_R_TYPE(rel->r_info)),
+ sym->st_value,
+ name);
+ }
+ }
+ printf("\n");
+}
+
+static void walk_relocs(void (*visit)(Elf32_Rel *rel, Elf32_Sym *sym))
+{
+ int i;
+ /* Walk through the relocations */
+ for(i = 0; i < ehdr.e_shnum; i++) {
+ char *sym_strtab;
+ Elf32_Sym *sh_symtab;
+ unsigned sec_applies, sec_symtab;
+ int j;
+ if (shdr[i].sh_type != SHT_REL) {
+ continue;
+ }
+ sec_symtab = shdr[i].sh_link;
+ sec_applies = shdr[i].sh_info;
+ if (!(shdr[sec_applies].sh_flags & SHF_ALLOC)) {
+ continue;
+ }
+ sh_symtab = symtab[sec_symtab];
+ sym_strtab = strtab[shdr[sec_symtab].sh_link];
+ for(j = 0; j < shdr[i].sh_size/sizeof(reltab[0][0]); j++) {
+ Elf32_Rel *rel;
+ Elf32_Sym *sym;
+ unsigned r_type;
+ rel = &reltab[i][j];
+ sym = &sh_symtab[ELF32_R_SYM(rel->r_info)];
+ r_type = ELF32_R_TYPE(rel->r_info);
+ /* Don't visit relocations to absolute symbols */
+ if (sym->st_shndx == SHN_ABS) {
+ continue;
+ }
+ if (r_type == R_386_PC32) {
+ /* PC relative relocations don't need to be adjusted */
+ }
+ else if (r_type == R_386_32) {
+ /* Visit relocations that need to be adjusted */
+ visit(rel, sym);
+ }
+ else {
+ die("Unsupported relocation type: %d\n", r_type);
+ }
+ }
+ }
+}
+
+static void count_reloc(Elf32_Rel *rel, Elf32_Sym *sym)
+{
+ reloc_count += 1;
+}
+
+static void collect_reloc(Elf32_Rel *rel, Elf32_Sym *sym)
+{
+ /* Remember the address that needs to be adjusted. */
+ relocs[reloc_idx++] = rel->r_offset;
+}
+
+static int cmp_relocs(const void *va, const void *vb)
+{
+ const unsigned long *a, *b;
+ a = va; b = vb;
+ return (*a == *b)? 0 : (*a > *b)? 1 : -1;
+}
+
+static void emit_relocs(int as_text)
+{
+ int i;
+ /* Count how many relocations I have and allocate space for them. */
+ reloc_count = 0;
+ walk_relocs(count_reloc);
+ relocs = malloc(reloc_count * sizeof(relocs[0]));
+ if (!relocs) {
+ die("malloc of %d entries for relocs failed\n",
+ reloc_count);
+ }
+ /* Collect up the relocations */
+ reloc_idx = 0;
+ walk_relocs(collect_reloc);
+
+ /* Order the relocations for more efficient processing */
+ qsort(relocs, reloc_count, sizeof(relocs[0]), cmp_relocs);
+
+ /* Print the relocations */
+ if (as_text) {
+ /* Print the relocations in a form suitable that
+ * gas will like.
+ */
+ printf(".section \".data.reloc\",\"a\"\n");
+ printf(".balign 4\n");
+ for(i = 0; i < reloc_count; i++) {
+ printf("\t .long 0x%08lx\n", relocs[i]);
+ }
+ printf("\n");
+ }
+ else {
+ unsigned char buf[4];
+ buf[0] = buf[1] = buf[2] = buf[3] = 0;
+ /* Print a stop */
+ printf("%c%c%c%c", buf[0], buf[1], buf[2], buf[3]);
+ /* Now print each relocation */
+ for(i = 0; i < reloc_count; i++) {
+ buf[0] = (relocs[i] >> 0) & 0xff;
+ buf[1] = (relocs[i] >> 8) & 0xff;
+ buf[2] = (relocs[i] >> 16) & 0xff;
+ buf[3] = (relocs[i] >> 24) & 0xff;
+ printf("%c%c%c%c", buf[0], buf[1], buf[2], buf[3]);
+ }
+ }
+}
+
+static void usage(void)
+{
+ die("i386_reloc [--abs | --text] vmlinux\n");
+}
+
+int main(int argc, char **argv)
+{
+ int show_absolute;
+ int as_text;
+ const char *fname;
+ FILE *fp;
+ int i;
+
+ show_absolute = 0;
+ as_text = 0;
+ fname = NULL;
+ for(i = 1; i < argc; i++) {
+ char *arg = argv[i];
+ if (*arg == '-') {
+ if (strcmp(argv[1], "--abs") == 0) {
+ show_absolute = 1;
+ continue;
+ }
+ else if (strcmp(argv[1], "--text") == 0) {
+ as_text = 1;
+ continue;
+ }
+ }
+ else if (!fname) {
+ fname = arg;
+ continue;
+ }
+ usage();
+ }
+ if (!fname) {
+ usage();
+ }
+ fp = fopen(fname, "r");
+ if (!fp) {
+ die("Cannot open %s: %s\n",
+ fname, strerror(errno));
+ }
+ read_ehdr(fp);
+ read_shdrs(fp);
+ read_strtabs(fp);
+ read_symtabs(fp);
+ read_relocs(fp);
+ if (show_absolute) {
+ print_absolute_symbols();
+ print_absolute_relocs();
+ return 0;
+ }
+ emit_relocs(as_text);
+ return 0;
+}
diff --git a/arch/i386/boot/compressed/vmlinux.lds b/arch/i386/boot/compressed/vmlinux.lds
new file mode 100644
index 0000000..973a23e
--- /dev/null
+++ b/arch/i386/boot/compressed/vmlinux.lds
@@ -0,0 +1,40 @@
+OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")
+OUTPUT_ARCH(i386)
+ENTRY(startup_32)
+SECTIONS
+{
+ . = 0 ;
+ .text.head : {
+ _head = . ;
+ *(.text.head)
+ _ehead = . ;
+ }
+ .data.compressed : {
+ *(.data.compressed)
+ }
+ .text : {
+ _text = .; /* Text */
+ *(.text)
+ *(.text.*)
+ _etext = . ;
+ }
+ .rodata : {
+ _rodata = . ;
+ *(.rodata) /* read-only data */
+ *(.rodata.*)
+ _erodata = . ;
+ }
+ .data : {
+ _data = . ;
+ *(.data)
+ *(.data.*)
+ _edata = . ;
+ }
+ .bss : {
+ _bss = . ;
+ *(.bss)
+ *(.bss.*)
+ *(COMMON)
+ _end = . ;
+ }
+}
diff --git a/arch/i386/boot/compressed/vmlinux.scr b/arch/i386/boot/compressed/vmlinux.scr
index 1ed9d79..707a88f 100644
--- a/arch/i386/boot/compressed/vmlinux.scr
+++ b/arch/i386/boot/compressed/vmlinux.scr
@@ -1,9 +1,10 @@
SECTIONS
{
- .data : {
+ .data.compressed : {
input_len = .;
LONG(input_data_end - input_data) input_data = .;
*(.data)
+ output_len = . - 4;
input_data_end = .;
}
}
diff --git a/arch/i386/boot/setup.S b/arch/i386/boot/setup.S
index d2b684c..04b6ea8 100644
--- a/arch/i386/boot/setup.S
+++ b/arch/i386/boot/setup.S
@@ -588,11 +588,6 @@ rmodeswtch_normal:
call default_switch

rmodeswtch_end:
-# we get the code32 start address and modify the below 'jmpi'
-# (loader may have changed it)
- movl %cs:code32_start, %eax
- movl %eax, %cs:code32
-
# Now we move the system to its rightful place ... but we check if we have a
# big-kernel. In that case we *must* not move it ...
testb $LOADED_HIGH, %cs:loadflags
@@ -788,11 +783,12 @@ a20_err_msg:
a20_done:

#endif /* CONFIG_X86_VOYAGER */
-# set up gdt and idt
+# set up gdt and idt and 32bit start address
lidt idt_48 # load idt with 0,0
xorl %eax, %eax # Compute gdt_base
movw %ds, %ax # (Convert %ds:gdt to a linear ptr)
shll $4, %eax
+ addl %eax, code32
addl $gdt, %eax
movl %eax, (gdt_48+2)
lgdt gdt_48 # load gdt with whatever is
@@ -851,9 +847,26 @@ # take our 48 bit far pointer. (INTeL 80
# Manual, Mixing 16-bit and 32-bit code, page 16-6)

.byte 0x66, 0xea # prefix + jmpi-opcode
-code32: .long 0x1000 # will be set to 0x100000
- # for big kernels
+code32: .long startup_32 # will be set to %cs+startup_32
.word __BOOT_CS
+.code32
+startup_32:
+ movl $(__BOOT_DS), %eax
+ movl %eax, %ds
+ movl %eax, %es
+ movl %eax, %fs
+ movl %eax, %gs
+ movl %eax, %ss
+
+ xorl %eax, %eax
+1: incl %eax # check that A20 really IS enabled
+ movl %eax, 0x00000000 # loop forever if it isn't
+ cmpl %eax, 0x00100000
+ je 1b
+
+ # Jump to the 32bit entry point
+ jmpl *(code32_start - start + (DELTA_INITSEG << 4))(%esi)
+.code16

# Here's a bunch of information about your current kernel..
kernel_version: .ascii UTS_RELEASE
diff --git a/include/linux/screen_info.h b/include/linux/screen_info.h
index 2925e66..b02308e 100644
--- a/include/linux/screen_info.h
+++ b/include/linux/screen_info.h
@@ -42,7 +42,8 @@ struct screen_info {
u16 pages; /* 0x32 */
u16 vesa_attributes; /* 0x34 */
u32 capabilities; /* 0x36 */
- /* 0x3a -- 0x3f reserved for future expansion */
+ /* 0x3a -- 0x3b reserved for future expansion */
+ /* 0x3c -- 0x3f micro stack for relocatable kernels */
};

extern struct screen_info screen_info;
--
1.4.2.rc2.g5209e

2006-08-01 11:08:07

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 30/33] x86_64: Remove CONFIG_PHYSICAL_START

I am about to add relocatable kernel support which has essentially
no cost so there is no point in retaining CONFIG_PHYSICAL_START
and retaining CONFIG_PHYSICAL_START makes implementation of and
testing of a relocatable kernel more difficult.

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/x86_64/Kconfig | 19 -------------------
arch/x86_64/boot/compressed/head.S | 6 +++---
arch/x86_64/boot/compressed/misc.c | 6 +++---
arch/x86_64/defconfig | 1 -
arch/x86_64/kernel/vmlinux.lds.S | 2 +-
arch/x86_64/mm/fault.c | 4 ++--
include/asm-x86_64/page.h | 2 --
7 files changed, 9 insertions(+), 31 deletions(-)

diff --git a/arch/x86_64/Kconfig b/arch/x86_64/Kconfig
index 28df7d8..763b25b 100644
--- a/arch/x86_64/Kconfig
+++ b/arch/x86_64/Kconfig
@@ -486,25 +486,6 @@ config CRASH_DUMP
help
Generate crash dump after being started by kexec.

-config PHYSICAL_START
- hex "Physical address where the kernel is loaded" if (EMBEDDED || CRASH_DUMP)
- default "0x1000000" if CRASH_DUMP
- default "0x200000"
- help
- This gives the physical address where the kernel is loaded. Normally
- for regular kernels this value is 0x200000 (2MB). But in the case
- of kexec on panic the fail safe kernel needs to run at a different
- address than the panic-ed kernel. This option is used to set the load
- address for kernels used to capture crash dump on being kexec'ed
- after panic. The default value for crash dump kernels is
- 0x1000000 (16MB). This can also be set based on the "X" value as
- specified in the "crashkernel=YM@XM" command line boot parameter
- passed to the panic-ed kernel. Typically this parameter is set as
- crashkernel=64M@16M. Please take a look at
- Documentation/kdump/kdump.txt for more details about crash dumps.
-
- Don't change this unless you know what you are doing.
-
config SECCOMP
bool "Enable seccomp to safely compute untrusted bytecode"
depends on PROC_FS
diff --git a/arch/x86_64/boot/compressed/head.S b/arch/x86_64/boot/compressed/head.S
index 6f55565..cf55d09 100644
--- a/arch/x86_64/boot/compressed/head.S
+++ b/arch/x86_64/boot/compressed/head.S
@@ -76,7 +76,7 @@ startup_32:
jnz 3f
addl $8,%esp
xorl %ebx,%ebx
- ljmp $(__KERNEL_CS), $__PHYSICAL_START
+ ljmp $(__KERNEL_CS), $0x200000

/*
* We come here, if we were loaded high.
@@ -102,7 +102,7 @@ startup_32:
popl %ecx # lcount
popl %edx # high_buffer_start
popl %eax # hcount
- movl $__PHYSICAL_START,%edi
+ movl $0x200000,%edi
cli # make sure we don't get interrupted
ljmp $(__KERNEL_CS), $0x1000 # and jump to the move routine

@@ -127,7 +127,7 @@ move_routine_start:
movsl
movl %ebx,%esi # Restore setup pointer
xorl %ebx,%ebx
- ljmp $(__KERNEL_CS), $__PHYSICAL_START
+ ljmp $(__KERNEL_CS), $0x200000
move_routine_end:


diff --git a/arch/x86_64/boot/compressed/misc.c b/arch/x86_64/boot/compressed/misc.c
index 3755b2e..259bb05 100644
--- a/arch/x86_64/boot/compressed/misc.c
+++ b/arch/x86_64/boot/compressed/misc.c
@@ -288,7 +288,7 @@ #ifdef STANDARD_MEMORY_BIOS_CALL
#else
if ((RM_ALT_MEM_K > RM_EXT_MEM_K ? RM_ALT_MEM_K : RM_EXT_MEM_K) < 1024) error("Less than 2MB of memory");
#endif
- output_data = (unsigned char *)__PHYSICAL_START; /* Normally Points to 1M */
+ output_data = (unsigned char *)0x200000;
free_mem_end_ptr = (long)real_mode;
}

@@ -311,8 +311,8 @@ #endif
low_buffer_size = low_buffer_end - LOW_BUFFER_START;
high_loaded = 1;
free_mem_end_ptr = (long)high_buffer_start;
- if ( (__PHYSICAL_START + low_buffer_size) > ((ulg)high_buffer_start)) {
- high_buffer_start = (uch *)(__PHYSICAL_START + low_buffer_size);
+ if ( (0x200000 + low_buffer_size) > ((ulg)high_buffer_start)) {
+ high_buffer_start = (uch *)(0x200000 + low_buffer_size);
mv->hcount = 0; /* say: we need not to move high_buffer */
}
else mv->hcount = -1;
diff --git a/arch/x86_64/defconfig b/arch/x86_64/defconfig
index 840d5d9..06cf378 100644
--- a/arch/x86_64/defconfig
+++ b/arch/x86_64/defconfig
@@ -158,7 +158,6 @@ CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
# CONFIG_KEXEC is not set
# CONFIG_CRASH_DUMP is not set
-CONFIG_PHYSICAL_START=0x200000
CONFIG_SECCOMP=y
# CONFIG_HZ_100 is not set
CONFIG_HZ_250=y
diff --git a/arch/x86_64/kernel/vmlinux.lds.S b/arch/x86_64/kernel/vmlinux.lds.S
index 7c4de31..741487b 100644
--- a/arch/x86_64/kernel/vmlinux.lds.S
+++ b/arch/x86_64/kernel/vmlinux.lds.S
@@ -15,7 +15,7 @@ ENTRY(phys_startup_64)
jiffies_64 = jiffies;
SECTIONS
{
- . = __START_KERNEL;
+ . = __START_KERNEL_map + 0x200000;
phys_startup_64 = startup_64 - LOAD_OFFSET;
_text = .; /* Text and read-only data */
.text : AT(ADDR(.text) - LOAD_OFFSET) {
diff --git a/arch/x86_64/mm/fault.c b/arch/x86_64/mm/fault.c
index ac8ea66..26d315b 100644
--- a/arch/x86_64/mm/fault.c
+++ b/arch/x86_64/mm/fault.c
@@ -650,9 +650,9 @@ void vmalloc_sync_all(void)
start = address + PGDIR_SIZE;
}
/* Check that there is no need to do the same for the modules area. */
- BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL));
+ BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL_map));
BUILD_BUG_ON(!(((MODULES_END - 1) & PGDIR_MASK) ==
- (__START_KERNEL & PGDIR_MASK)));
+ (__START_KERNEL_map & PGDIR_MASK)));
}

static int __init enable_pagefaulttrace(char *str)
diff --git a/include/asm-x86_64/page.h b/include/asm-x86_64/page.h
index 37f95ca..deed41a 100644
--- a/include/asm-x86_64/page.h
+++ b/include/asm-x86_64/page.h
@@ -75,8 +75,6 @@ #define __pgprot(x) ((pgprot_t) { (x) }

#endif /* !__ASSEMBLY__ */

-#define __PHYSICAL_START _AC(CONFIG_PHYSICAL_START,UL)
-#define __START_KERNEL (__START_KERNEL_map + __PHYSICAL_START)
#define __START_KERNEL_map _AC(0xffffffff80000000,UL)
#define __PAGE_OFFSET _AC(0xffff810000000000,UL)

--
1.4.2.rc2.g5209e

2006-08-01 11:09:15

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 21/33] x86_64: modify copy_bootdata to use virtual addresses

Use virtual addresses instead of physical addresses
in copy bootdata. In addition fix the implementation
of the old bootloader convention. Everything is
at real_mode_data always. It is just that sometimes
real_mode_data was relocated by setup.S to not sit at
0x90000.

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/x86_64/kernel/head64.c | 17 ++++++++---------
1 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/x86_64/kernel/head64.c b/arch/x86_64/kernel/head64.c
index 454498c..defef4e 100644
--- a/arch/x86_64/kernel/head64.c
+++ b/arch/x86_64/kernel/head64.c
@@ -29,29 +29,28 @@ static void __init clear_bss(void)
}

#define NEW_CL_POINTER 0x228 /* Relative to real mode data */
-#define OLD_CL_MAGIC_ADDR 0x90020
+#define OLD_CL_MAGIC_ADDR 0x20
#define OLD_CL_MAGIC 0xA33F
-#define OLD_CL_BASE_ADDR 0x90000
-#define OLD_CL_OFFSET 0x90022
+#define OLD_CL_OFFSET 0x22

extern char saved_command_line[];

static void __init copy_bootdata(char *real_mode_data)
{
- int new_data;
+ unsigned long new_data;
char * command_line;

memcpy(x86_boot_params, real_mode_data, BOOT_PARAM_SIZE);
- new_data = *(int *) (x86_boot_params + NEW_CL_POINTER);
+ new_data = *(u32 *) (x86_boot_params + NEW_CL_POINTER);
if (!new_data) {
- if (OLD_CL_MAGIC != * (u16 *) OLD_CL_MAGIC_ADDR) {
+ if (OLD_CL_MAGIC != *(u16 *)(real_mode_data + OLD_CL_MAGIC_ADDR)) {
printk("so old bootloader that it does not support commandline?!\n");
return;
}
- new_data = OLD_CL_BASE_ADDR + * (u16 *) OLD_CL_OFFSET;
+ new_data = __pa(real_mode_data) + *(u16 *)(real_mode_data + OLD_CL_OFFSET);
printk("old bootloader convention, maybe loadlin?\n");
}
- command_line = (char *) ((u64)(new_data));
+ command_line = __va(new_data);
memcpy(saved_command_line, command_line, COMMAND_LINE_SIZE);
printk("Bootdata ok (command line is %s)\n", saved_command_line);
}
@@ -99,7 +98,7 @@ void __init x86_64_start_kernel(char * r
cpu_pda(i) = &boot_cpu_pda[i];

pda_init(0);
- copy_bootdata(real_mode_data);
+ copy_bootdata(__va(real_mode_data));
#ifdef CONFIG_SMP
cpu_set(0, cpu_online_map);
#endif
--
1.4.2.rc2.g5209e

2006-08-01 11:08:45

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 28/33] x86_64: Remove the identity mapping as early as possible.

With the rewrite of the SMP trampoline and the early page
allocator there is nothing that needs identity mapped pages,
once we start executing C code.

So add zap_identity_mappings into head64.c and remove
zap_low_mappings() from much later in the code. The functions
are subtly different thus the name change.

This also kills boot_level4_pgt which was from an earlier
attempt to move the identity mappings as early as possible,
and is now no longer needed. Essentially I have replaced
boot_level4_pgt with trampoline_level4_pgt in trampoline.S

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/x86_64/kernel/head.S | 34 ++++++++++++++--------------------
arch/x86_64/kernel/head64.c | 16 ++++++++++------
arch/x86_64/kernel/setup.c | 2 --
arch/x86_64/kernel/setup64.c | 1 -
arch/x86_64/mm/init.c | 24 ------------------------
include/asm-x86_64/pgtable.h | 1 -
include/asm-x86_64/proto.h | 2 --
7 files changed, 24 insertions(+), 56 deletions(-)

diff --git a/arch/x86_64/kernel/head.S b/arch/x86_64/kernel/head.S
index a624586..b0b5618 100644
--- a/arch/x86_64/kernel/head.S
+++ b/arch/x86_64/kernel/head.S
@@ -73,7 +73,7 @@ startup_32:
movl %eax, %cr4

/* Setup early boot stage 4 level pagetables */
- movl $(boot_level4_pgt - __START_KERNEL_map), %eax
+ movl $(init_level4_pgt - __START_KERNEL_map), %eax
movl %eax, %cr3

/* Setup EFER (Extended Feature Enable Register) */
@@ -117,7 +117,7 @@ ENTRY(secondary_startup_64)
movq %rax, %cr4

/* Setup early boot stage 4 level pagetables. */
- movq $(boot_level4_pgt - __START_KERNEL_map), %rax
+ movq $(init_level4_pgt - __START_KERNEL_map), %rax
movq %rax, %cr3

/* Check if nx is implemented */
@@ -264,9 +264,19 @@ #define PMDS(START, PERM, COUNT) \
i = i + 1 ; \
.endr

+ /*
+ * This default setting generates an ident mapping at address 0x100000
+ * and a mapping for the kernel that precisely maps virtual address
+ * 0xffffffff80000000 to physical address 0x000000. (always using
+ * 2Mbyte large pages provided by PAE mode)
+ */
NEXT_PAGE(init_level4_pgt)
- /* This gets initialized in x86_64_start_kernel */
- .fill 512,8,0
+ .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
+ .fill 257,8,0
+ .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
+ .fill 252,8,0
+ /* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
+ .quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE

NEXT_PAGE(level3_ident_pgt)
.quad level2_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
@@ -301,22 +311,6 @@ #undef NEXT_PAGE
#ifndef CONFIG_HOTPLUG_CPU
__INITDATA
#endif
- /*
- * This default setting generates an ident mapping at address 0x100000
- * and a mapping for the kernel that precisely maps virtual address
- * 0xffffffff80000000 to physical address 0x000000. (always using
- * 2Mbyte large pages provided by PAE mode)
- */
- .align PAGE_SIZE
-ENTRY(boot_level4_pgt)
- .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
- .fill 257,8,0
- .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
- .fill 252,8,0
- /* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
- .quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
-
- .data

.align 16
.globl cpu_gdt_descr
diff --git a/arch/x86_64/kernel/head64.c b/arch/x86_64/kernel/head64.c
index defef4e..99d4463 100644
--- a/arch/x86_64/kernel/head64.c
+++ b/arch/x86_64/kernel/head64.c
@@ -18,8 +18,16 @@ #include <asm/bootsetup.h>
#include <asm/setup.h>
#include <asm/desc.h>
#include <asm/pgtable.h>
+#include <asm/tlbflush.h>
#include <asm/sections.h>

+static void __init zap_identity_mappings(void)
+{
+ pgd_t *pgd = pgd_offset_k(0UL);
+ pgd_clear(pgd);
+ __flush_tlb();
+}
+
/* Don't add a printk in there. printk relies on the PDA which is not initialized
yet. */
static void __init clear_bss(void)
@@ -78,6 +86,8 @@ void __init x86_64_start_kernel(char * r
char *s;
int i;

+ /* Make NULL pointers segfault */
+ zap_identity_mappings();
for (i = 0; i < 256; i++)
set_intr_gate(i, early_idt_handler);
asm volatile("lidt %0" :: "m" (idt_descr));
@@ -88,12 +98,6 @@ void __init x86_64_start_kernel(char * r
*/
lockdep_init();

- /*
- * switch to init_level4_pgt from boot_level4_pgt
- */
- memcpy(init_level4_pgt, boot_level4_pgt, PTRS_PER_PGD*sizeof(pgd_t));
- asm volatile("movq %0,%%cr3" :: "r" (__pa_symbol(&init_level4_pgt)));
-
for (i = 0; i < NR_CPUS; i++)
cpu_pda(i) = &boot_cpu_pda[i];

diff --git a/arch/x86_64/kernel/setup.c b/arch/x86_64/kernel/setup.c
index 21840ca..7225f61 100644
--- a/arch/x86_64/kernel/setup.c
+++ b/arch/x86_64/kernel/setup.c
@@ -568,8 +568,6 @@ #endif

dmi_scan_machine();

- zap_low_mappings(0);
-
#ifdef CONFIG_ACPI
/*
* Initialize the ACPI boot-time table parser (gets the RSDP and SDT).
diff --git a/arch/x86_64/kernel/setup64.c b/arch/x86_64/kernel/setup64.c
index 6fe58a6..a1f3aed 100644
--- a/arch/x86_64/kernel/setup64.c
+++ b/arch/x86_64/kernel/setup64.c
@@ -197,7 +197,6 @@ void __cpuinit cpu_init (void)
/* CPU 0 is initialised in head64.c */
if (cpu != 0) {
pda_init(cpu);
- zap_low_mappings(cpu);
} else
estacks = boot_exception_stacks;

diff --git a/arch/x86_64/mm/init.c b/arch/x86_64/mm/init.c
index b46566a..149f363 100644
--- a/arch/x86_64/mm/init.c
+++ b/arch/x86_64/mm/init.c
@@ -332,21 +332,6 @@ void __init init_memory_mapping(unsigned
__flush_tlb_all();
}

-void __cpuinit zap_low_mappings(int cpu)
-{
- if (cpu == 0) {
- pgd_t *pgd = pgd_offset_k(0UL);
- pgd_clear(pgd);
- } else {
- /*
- * For AP's, zap the low identity mappings by changing the cr3
- * to init_level4_pgt and doing local flush tlb all
- */
- asm volatile("movq %0,%%cr3" :: "r" (__pa_symbol(&init_level4_pgt)));
- }
- __flush_tlb_all();
-}
-
/* Compute zone sizes for the DMA and DMA32 zones in a node. */
__init void
size_zones(unsigned long *z, unsigned long *h,
@@ -667,15 +652,6 @@ #endif
reservedpages << (PAGE_SHIFT-10),
datasize >> 10,
initsize >> 10);
-
-#ifdef CONFIG_SMP
- /*
- * Sync boot_level4_pgt mappings with the init_level4_pgt
- * except for the low identity mappings which are already zapped
- * in init_level4_pgt. This sync-up is essential for AP's bringup
- */
- memcpy(boot_level4_pgt+1, init_level4_pgt+1, (PTRS_PER_PGD-1)*sizeof(pgd_t));
-#endif
}

void free_init_pages(char *what, unsigned long begin, unsigned long end)
diff --git a/include/asm-x86_64/pgtable.h b/include/asm-x86_64/pgtable.h
index 5528e8b..d0816fb 100644
--- a/include/asm-x86_64/pgtable.h
+++ b/include/asm-x86_64/pgtable.h
@@ -18,7 +18,6 @@ extern pud_t level3_kernel_pgt[512];
extern pud_t level3_ident_pgt[512];
extern pmd_t level2_kernel_pgt[512];
extern pgd_t init_level4_pgt[];
-extern pgd_t boot_level4_pgt[];
extern unsigned long __supported_pte_mask;

#define swapper_pg_dir init_level4_pgt
diff --git a/include/asm-x86_64/proto.h b/include/asm-x86_64/proto.h
index 038fe1f..978ea43 100644
--- a/include/asm-x86_64/proto.h
+++ b/include/asm-x86_64/proto.h
@@ -11,8 +11,6 @@ struct pt_regs;
extern void start_kernel(void);
extern void pda_init(int);

-extern void zap_low_mappings(int cpu);
-
extern void early_idt_handler(void);

extern void mcheck_init(struct cpuinfo_x86 *c);
--
1.4.2.rc2.g5209e

2006-08-01 11:08:07

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 26/33] x86_64: 64bit PIC ACPI wakeup

- Killed lots of dead code
- Improve the cpu sanity checks to verify long mode
is enabled when we wake up.
- Removed the need for modifying any existing kernel page table.
- Moved wakeup_level4_pgt into the wakeup routine so we can
run the kernel above 4G.
- Increased the size of the wakeup routine to 8K.
- Renamed the variables to use the 64bit register names.
- Lots of misc cleanups to match trampoline.S

I don't have a configuration I can test this but it compiles cleanly
and it should work, the code is very similar to the SMP trampoline,
which I have tested. At least now the comments about still running in
low memory are actually correct.

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/x86_64/kernel/acpi/sleep.c | 19 --
arch/x86_64/kernel/acpi/wakeup.S | 325 +++++++++++++++++---------------------
arch/x86_64/kernel/head.S | 9 -
include/asm-x86_64/suspend.h | 12 +
4 files changed, 151 insertions(+), 214 deletions(-)

diff --git a/arch/x86_64/kernel/acpi/sleep.c b/arch/x86_64/kernel/acpi/sleep.c
index 5ebf62c..d9b28f8 100644
--- a/arch/x86_64/kernel/acpi/sleep.c
+++ b/arch/x86_64/kernel/acpi/sleep.c
@@ -60,17 +60,6 @@ extern char wakeup_start, wakeup_end;

extern unsigned long FASTCALL(acpi_copy_wakeup_routine(unsigned long));

-static pgd_t low_ptr;
-
-static void init_low_mapping(void)
-{
- pgd_t *slot0 = pgd_offset(current->mm, 0UL);
- low_ptr = *slot0;
- set_pgd(slot0, *pgd_offset(current->mm, PAGE_OFFSET));
- WARN_ON(num_online_cpus() != 1);
- local_flush_tlb();
-}
-
/**
* acpi_save_state_mem - save kernel state
*
@@ -79,8 +68,6 @@ static void init_low_mapping(void)
*/
int acpi_save_state_mem(void)
{
- init_low_mapping();
-
memcpy((void *)acpi_wakeup_address, &wakeup_start,
&wakeup_end - &wakeup_start);
acpi_copy_wakeup_routine(acpi_wakeup_address);
@@ -93,8 +80,6 @@ int acpi_save_state_mem(void)
*/
void acpi_restore_state_mem(void)
{
- set_pgd(pgd_offset(current->mm, 0UL), low_ptr);
- local_flush_tlb();
}

/**
@@ -107,8 +92,8 @@ void acpi_restore_state_mem(void)
*/
void __init acpi_reserve_bootmem(void)
{
- acpi_wakeup_address = (unsigned long)alloc_bootmem_low(PAGE_SIZE);
- if ((&wakeup_end - &wakeup_start) > PAGE_SIZE)
+ acpi_wakeup_address = (unsigned long)alloc_bootmem_low(PAGE_SIZE*2);
+ if ((&wakeup_end - &wakeup_start) > (PAGE_SIZE*2))
printk(KERN_CRIT
"ACPI: Wakeup code way too big, will crash on attempt to suspend\n");
}
diff --git a/arch/x86_64/kernel/acpi/wakeup.S b/arch/x86_64/kernel/acpi/wakeup.S
index 185faa9..3eda0b5 100644
--- a/arch/x86_64/kernel/acpi/wakeup.S
+++ b/arch/x86_64/kernel/acpi/wakeup.S
@@ -1,6 +1,7 @@
.text
#include <linux/linkage.h>
#include <asm/segment.h>
+#include <asm/pgtable.h>
#include <asm/page.h>
#include <asm/msr.h>

@@ -15,7 +16,6 @@ # If physical address of wakeup_code is
# cs = 0x1234, eip = 0x05
#

-
ALIGN
.align 16
ENTRY(wakeup_start)
@@ -30,22 +30,25 @@ # Running in *copy* of this code, somewh
cld
# setup data segment
movw %cs, %ax
- movw %ax, %ds # Make ds:0 point to wakeup_start
+ movw %ax, %ds # Make ds:0 point to wakeup_start
movw %ax, %ss
- mov $(wakeup_stack - wakeup_code), %sp # Private stack is needed for ASUS board
+ # Private stack is needed for ASUS board
+ mov $(wakeup_stack - wakeup_code), %sp

- pushl $0 # Kill any dangerous flags
+ pushl $0 # Kill any dangerous flags
popfl

movl real_magic - wakeup_code, %eax
cmpl $0x12345678, %eax
jne bogus_real_magic

+ call verify_cpu # Verify the cpu supports long mode
+
testl $1, video_flags - wakeup_code
jz 1f
lcall $0xc000,$3
movw %cs, %ax
- movw %ax, %ds # Bios might have played with that
+ movw %ax, %ds # Bios might have played with that
movw %ax, %ss
1:

@@ -60,13 +63,17 @@ # Running in *copy* of this code, somewh
movw $0x0e00 + 'L', %fs:(0x10)

movb $0xa2, %al ; outb %al, $0x80
+
+ mov %ds, %ax # Find 32bit wakeup_code address
+ movzx %ax, %esi # (Convert %ds:gdt to a linear ptr)
+ shll $4, %esi
+
+ # Fixup the vectors
+ addl %esi, wakeup_32_vector - wakeup_code
+ addl %esi, wakeup_long64_vector - wakeup_code
+ addl %esi, gdt_48a + 2 - wakeup_code # Fixup the gdt pointer

- lidt %ds:idt_48a - wakeup_code
- xorl %eax, %eax
- movw %ds, %ax # (Convert %ds:gdt to a linear ptr)
- shll $4, %eax
- addl $(gdta - wakeup_code), %eax
- movl %eax, gdt_48a +2 - wakeup_code
+ lidtl %ds:idt_48a - wakeup_code
lgdtl %ds:gdt_48a - wakeup_code # load gdt with whatever is
# appropriate

@@ -75,85 +82,47 @@ # Running in *copy* of this code, somewh
jmp 1f
1:

- .byte 0x66, 0xea # prefix + jmpi-opcode
- .long wakeup_32 - __START_KERNEL_map
- .word __KERNEL_CS
+ ljmpl *(wakeup_32_vector - wakeup_code)
+
+ .balign 4
+wakeup_32_vector:
+ .long wakeup_32 - wakeup_code
+ .word __KERNEL32_CS, 0

.code32
wakeup_32:
# Running in this code, but at low address; paging is not yet turned on.
movb $0xa5, %al ; outb %al, $0x80

- /* Check if extended functions are implemented */
- movl $0x80000000, %eax
- cpuid
- cmpl $0x80000000, %eax
- jbe bogus_cpu
- wbinvd
- mov $0x80000001, %eax
- cpuid
- btl $29, %edx
- jnc bogus_cpu
- movl %edx,%edi
-
- movw $__KERNEL_DS, %ax
- movw %ax, %ds
- movw %ax, %es
- movw %ax, %fs
- movw %ax, %gs
-
- movw $__KERNEL_DS, %ax
- movw %ax, %ss
+ /* Initialize segments */
+ movl $__KERNEL_DS, %eax
+ movl %eax, %ds

- mov $(wakeup_stack - __START_KERNEL_map), %esp
- movl saved_magic - __START_KERNEL_map, %eax
- cmpl $0x9abcdef0, %eax
- jne bogus_32_magic
+ movw $0x0e00 + 'i', %ds:(0xb8012)
+ movb $0xa8, %al ; outb %al, $0x80;

/*
* Prepare for entering 64bits mode
*/

- /* Enable PAE mode and PGE */
+ /* Enable PAE */
xorl %eax, %eax
btsl $5, %eax
- btsl $7, %eax
movl %eax, %cr4

/* Setup early boot stage 4 level pagetables */
- movl $(wakeup_level4_pgt - __START_KERNEL_map), %eax
+ leal (wakeup_level4_pgt - wakeup_code)(%esi), %eax
movl %eax, %cr3

- /* Setup EFER (Extended Feature Enable Register) */
- movl $MSR_EFER, %ecx
- rdmsr
- /* Fool rdmsr and reset %eax to avoid dependences */
- xorl %eax, %eax
/* Enable Long Mode */
- btsl $_EFER_LME, %eax
- /* Enable System Call */
- btsl $_EFER_SCE, %eax
-
- /* No Execute supported? */
- btl $20,%edi
- jnc 1f
- btsl $_EFER_NX, %eax
-1:
-
- /* Make changes effective */
+ movl $MSR_EFER, %ecx
+ movl $(1 << _EFER_LME), %eax # Enable Long Mode
+ xorl %edx, %edx
wrmsr
- wbinvd

xorl %eax, %eax
btsl $31, %eax /* Enable paging and in turn activate Long Mode */
btsl $0, %eax /* Enable protected mode */
- btsl $1, %eax /* Enable MP */
- btsl $4, %eax /* Enable ET */
- btsl $5, %eax /* Enable NE */
- btsl $16, %eax /* Enable WP */
- btsl $18, %eax /* Enable AM */
-
- /* Make changes effective */
movl %eax, %cr0
/* At this point:
CR4.PAE must be 1
@@ -162,11 +131,6 @@ # Running in this code, but at low addre
Next instruction must be a branch
This must be on identity-mapped page
*/
- jmp reach_compatibility_mode
-reach_compatibility_mode:
- movw $0x0e00 + 'i', %ds:(0xb8012)
- movb $0xa8, %al ; outb %al, $0x80;
-
/*
* At this point we're in long mode but in 32bit compatibility mode
* with EFER.LME = 1, CS.L = 0, CS.D = 1 (and in turn
@@ -174,20 +138,13 @@ reach_compatibility_mode:
* the new gdt/idt that has __KERNEL_CS with CS.L = 1.
*/

- movw $0x0e00 + 'n', %ds:(0xb8014)
- movb $0xa9, %al ; outb %al, $0x80
-
- /* Load new GDT with the 64bit segment using 32bit descriptor */
- movl $(pGDT32 - __START_KERNEL_map), %eax
- lgdt (%eax)
-
- movl $(wakeup_jumpvector - __START_KERNEL_map), %eax
/* Finally jump in 64bit mode */
- ljmp *(%eax)
+ ljmp *(wakeup_long64_vector - wakeup_code)(%esi)

-wakeup_jumpvector:
- .long wakeup_long64 - __START_KERNEL_map
- .word __KERNEL_CS
+ .balign 4
+wakeup_long64_vector:
+ .long wakeup_long64 - wakeup_code
+ .word __KERNEL_CS, 0

.code64

@@ -199,10 +156,18 @@ wakeup_long64:
* addresses where we're currently running on. We have to do that here
* because in 32bit we couldn't load a 64bit linear address.
*/
- lgdt cpu_gdt_descr - __START_KERNEL_map
+ lgdt cpu_gdt_descr
+
+ movw $0x0e00 + 'n', %ds:(0xb8014)
+ movb $0xa9, %al ; outb %al, $0x80
+
+ movq saved_magic, %rax
+ movq $0x123456789abcdef0, %rdx
+ cmpq %rdx, %rax
+ jne bogus_64_magic

movw $0x0e00 + 'u', %ds:(0xb8016)
-
+
nop
nop
movw $__KERNEL_DS, %ax
@@ -211,16 +176,16 @@ wakeup_long64:
movw %ax, %es
movw %ax, %fs
movw %ax, %gs
- movq saved_esp, %rsp
+ movq saved_rsp, %rsp

movw $0x0e00 + 'x', %ds:(0xb8018)
- movq saved_ebx, %rbx
- movq saved_edi, %rdi
- movq saved_esi, %rsi
- movq saved_ebp, %rbp
+ movq saved_rbx, %rbx
+ movq saved_rdi, %rdi
+ movq saved_rsi, %rsi
+ movq saved_rbp, %rbp

movw $0x0e00 + '!', %ds:(0xb801a)
- movq saved_eip, %rax
+ movq saved_rip, %rax
jmp *%rax

.code32
@@ -228,25 +193,10 @@ wakeup_long64:
.align 64
gdta:
.word 0, 0, 0, 0 # dummy
-
- .word 0, 0, 0, 0 # unused
-
- .word 0xFFFF # 4Gb - (0x100000*0x1000 = 4Gb)
- .word 0 # base address = 0
- .word 0x9B00 # code read/exec. ??? Why I need 0x9B00 (as opposed to 0x9A00 in order for this to work?)
- .word 0x00CF # granularity = 4096, 386
- # (+5th nibble of limit)
-
- .word 0xFFFF # 4Gb - (0x100000*0x1000 = 4Gb)
- .word 0 # base address = 0
- .word 0x9200 # data read/write
- .word 0x00CF # granularity = 4096, 386
- # (+5th nibble of limit)
-# this is 64bit descriptor for code
- .word 0xFFFF
- .word 0
- .word 0x9A00 # code read/exec
- .word 0x00AF # as above, but it is long mode and with D=0
+ /* ??? Why I need the accessed bit set in order for this to work? */
+ .quad 0x00cf9b000000ffff # __KERNEL32_CS
+ .quad 0x00af9b000000ffff # __KERNEL_CS
+ .quad 0x00cf93000000ffff # __KERNEL_DS

idt_48a:
.word 0 # idt limit = 0
@@ -255,30 +205,24 @@ idt_48a:
gdt_48a:
.word 0x8000 # gdt limit=2048,
# 256 GDT entries
- .word 0, 0 # gdt base (filled in later)
-
-
+ .long gdta - wakeup_code # gdt base (relocated in later)
+
+
real_save_gdt: .word 0
.quad 0
real_magic: .quad 0
video_mode: .quad 0
video_flags: .quad 0

+.code16
bogus_real_magic:
movb $0xba,%al ; outb %al,$0x80
jmp bogus_real_magic

-bogus_32_magic:
+.code64
+bogus_64_magic:
movb $0xb3,%al ; outb %al,$0x80
- jmp bogus_32_magic
-
-bogus_31_magic:
- movb $0xb1,%al ; outb %al,$0x80
- jmp bogus_31_magic
-
-bogus_cpu:
- movb $0xbc,%al ; outb %al,$0x80
- jmp bogus_cpu
+ jmp bogus_64_magic


/* This code uses an extended set of video mode numbers. These include:
@@ -301,6 +245,7 @@ #define VIDEO_FIRST_VESA 0x0200
#define VIDEO_FIRST_V7 0x0900

# Setting of user mode (AX=mode ID) => CF=success
+.code16
mode_seta:
movw %ax, %bx
#if 0
@@ -346,14 +291,59 @@ check_vesaa:

_setbada: jmp setbada

- .code64
-bogus_magic:
- movw $0x0e00 + 'B', %ds:(0xb8018)
- jmp bogus_magic
+ .code16
+verify_cpu:
+ pushl $0 # Kill any dangerous flags
+ popfl
+
+ /* minimum CPUID flags for x86-64 */
+ /* see http://www.x86-64.org/lists/discuss/msg02971.html */
+#define REQUIRED_MASK1 ((1<<0)|(1<<3)|(1<<4)|(1<<5)|(1<<6)|(1<<8)|\
+ (1<<13)|(1<<15)|(1<<24)|(1<<25)|(1<<26))
+#define REQUIRED_MASK2 (1<<29)
+
+ pushfl # check for cpuid
+ popl %eax
+ movl %eax, %ebx
+ xorl $0x200000,%eax
+ pushl %eax
+ popfl
+ pushfl
+ popl %eax
+ pushl %ebx
+ popfl
+ cmpl %eax, %ebx
+ jz no_longmode
+
+ xorl %eax, %eax # See if cpuid 1 is implemented
+ cpuid
+ cmpl $0x1, %eax
+ jb no_longmode
+
+ movl $0x01, %eax # Does the cpu have what it takes?
+ cpuid
+ andl $REQUIRED_MASK1, %edx
+ xorl $REQUIRED_MASK1, %edx
+ jnz no_longmode

-bogus_magic2:
- movw $0x0e00 + '2', %ds:(0xb8018)
- jmp bogus_magic2
+ movl $0x80000000, %eax # See if extended cpuid is implemented
+ cpuid
+ cmpl $0x80000001, %eax
+ jb no_longmode
+
+ movl $0x80000001, %eax # Does the cpu have what it takes?
+ cpuid
+ andl $REQUIRED_MASK2, %edx
+ xorl $REQUIRED_MASK2, %edx
+ jnz no_longmode
+
+ ret # The cpu supports long mode
+
+no_longmode:
+ movb $0xbc,%al ; outb %al,$0x80
+ jmp no_longmode
+
+ ret


wakeup_stack_begin: # Stack grows down
@@ -361,7 +351,15 @@ wakeup_stack_begin: # Stack grows down
.org 0xff0
wakeup_stack: # Just below end of page

+.org 0x1000
+ENTRY(wakeup_level4_pgt)
+ .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
+ .fill 510,8,0
+ /* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
+ .quad level3_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE
+
ENTRY(wakeup_end)
+ .code64

##
# acpi_copy_wakeup_routine
@@ -378,23 +376,6 @@ ENTRY(acpi_copy_wakeup_routine)
pushq %rcx
pushq %rdx

- sgdt saved_gdt
- sidt saved_idt
- sldt saved_ldt
- str saved_tss
-
- movq %cr3, %rdx
- movq %rdx, saved_cr3
- movq %cr4, %rdx
- movq %rdx, saved_cr4
- movq %cr0, %rdx
- movq %rdx, saved_cr0
- sgdt real_save_gdt - wakeup_start (,%rdi)
- movl $MSR_EFER, %ecx
- rdmsr
- movl %eax, saved_efer
- movl %edx, saved_efer2
-
movl saved_video_mode, %edx
movl %edx, video_mode - wakeup_start (,%rdi)
movl acpi_video_flags, %edx
@@ -403,18 +384,11 @@ ENTRY(acpi_copy_wakeup_routine)
movq $0x123456789abcdef0, %rdx
movq %rdx, saved_magic

- movl saved_magic - __START_KERNEL_map, %eax
- cmpl $0x9abcdef0, %eax
- jne bogus_32_magic
-
- # make sure %cr4 is set correctly (features, etc)
- movl saved_cr4 - __START_KERNEL_map, %eax
- movq %rax, %cr4
+ movq saved_magic, %rax
+ movq $0x123456789abcdef0, %rdx
+ cmpq %rdx, %rax
+ jne bogus_64_magic

- movl saved_cr0 - __START_KERNEL_map, %eax
- movq %rax, %cr0
- jmp 1f # Flush pipelines
-1:
# restore the regs we used
popq %rdx
popq %rcx
@@ -450,13 +424,13 @@ do_suspend_lowlevel:
movq %r15, saved_context_r15(%rip)
pushfq ; popq saved_context_eflags(%rip)

- movq $.L97, saved_eip(%rip)
+ movq $.L97, saved_rip(%rip)

- movq %rsp,saved_esp
- movq %rbp,saved_ebp
- movq %rbx,saved_ebx
- movq %rdi,saved_edi
- movq %rsi,saved_esi
+ movq %rsp,saved_rsp
+ movq %rbp,saved_rbp
+ movq %rbx,saved_rbx
+ movq %rdi,saved_rdi
+ movq %rsi,saved_rsi

addq $8, %rsp
movl $3, %edi
@@ -503,25 +477,12 @@ do_suspend_lowlevel:

.data
ALIGN
-ENTRY(saved_ebp) .quad 0
-ENTRY(saved_esi) .quad 0
-ENTRY(saved_edi) .quad 0
-ENTRY(saved_ebx) .quad 0
+ENTRY(saved_rbp) .quad 0
+ENTRY(saved_rsi) .quad 0
+ENTRY(saved_rdi) .quad 0
+ENTRY(saved_rbx) .quad 0

-ENTRY(saved_eip) .quad 0
-ENTRY(saved_esp) .quad 0
+ENTRY(saved_rip) .quad 0
+ENTRY(saved_rsp) .quad 0

ENTRY(saved_magic) .quad 0
-
-ALIGN
-# saved registers
-saved_gdt: .quad 0,0
-saved_idt: .quad 0,0
-saved_ldt: .quad 0
-saved_tss: .quad 0
-
-saved_cr0: .quad 0
-saved_cr3: .quad 0
-saved_cr4: .quad 0
-saved_efer: .quad 0
-saved_efer2: .quad 0
diff --git a/arch/x86_64/kernel/head.S b/arch/x86_64/kernel/head.S
index 8d1b4a7..a624586 100644
--- a/arch/x86_64/kernel/head.S
+++ b/arch/x86_64/kernel/head.S
@@ -298,15 +298,6 @@ #undef NEXT_PAGE

.data

-#ifdef CONFIG_ACPI_SLEEP
- .align PAGE_SIZE
-ENTRY(wakeup_level4_pgt)
- .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
- .fill 510,8,0
- /* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
- .quad level3_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE
-#endif
-
#ifndef CONFIG_HOTPLUG_CPU
__INITDATA
#endif
diff --git a/include/asm-x86_64/suspend.h b/include/asm-x86_64/suspend.h
index a42306c..9c3f8de 100644
--- a/include/asm-x86_64/suspend.h
+++ b/include/asm-x86_64/suspend.h
@@ -45,12 +45,12 @@ #define loaddebug(thread,register) \
extern void fix_processor_context(void);

#ifdef CONFIG_ACPI_SLEEP
-extern unsigned long saved_eip;
-extern unsigned long saved_esp;
-extern unsigned long saved_ebp;
-extern unsigned long saved_ebx;
-extern unsigned long saved_esi;
-extern unsigned long saved_edi;
+extern unsigned long saved_rip;
+extern unsigned long saved_rsp;
+extern unsigned long saved_rbp;
+extern unsigned long saved_rbx;
+extern unsigned long saved_rsi;
+extern unsigned long saved_rdi;

/* routines for saving/restoring kernel state */
extern int acpi_save_state_mem(void);
--
1.4.2.rc2.g5209e

2006-08-01 11:09:54

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 25/33] x86_64: 64bit PIC SMP trampoline

This modifies the SMP trampoline and all of the associated code so
it can jump to a 64bit kernel loaded at an arbitrary address.

The dependencies on having an idenetity mapped page in the kernel
page tables for SMP bootup have all been removed.

In addition the trampoline has been modified to verify
that long mode is supported. Asking if long mode is implemented is
down right silly but we have traditionally had some of these checks,
and they can't hurt anything. So when the totally ludicrous happens
we just might handle it correctly.

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/x86_64/kernel/head.S | 1
arch/x86_64/kernel/setup.c | 9 --
arch/x86_64/kernel/trampoline.S | 168 ++++++++++++++++++++++++++++++++++++---
3 files changed, 156 insertions(+), 22 deletions(-)

diff --git a/arch/x86_64/kernel/head.S b/arch/x86_64/kernel/head.S
index d0e626e..8d1b4a7 100644
--- a/arch/x86_64/kernel/head.S
+++ b/arch/x86_64/kernel/head.S
@@ -103,6 +103,7 @@ startup_32:
.org 0x100
.globl startup_64
startup_64:
+ENTRY(secondary_startup_64)
/* We come here either from startup_32
* or directly from a 64bit bootloader.
* Since we may have come directly from a bootloader we
diff --git a/arch/x86_64/kernel/setup.c b/arch/x86_64/kernel/setup.c
index 11d31ea..66816ba 100644
--- a/arch/x86_64/kernel/setup.c
+++ b/arch/x86_64/kernel/setup.c
@@ -611,15 +611,8 @@ #endif
reserve_bootmem_generic(ebda_addr, ebda_size);

#ifdef CONFIG_SMP
- /*
- * But first pinch a few for the stack/trampoline stuff
- * FIXME: Don't need the extra page at 4K, but need to fix
- * trampoline before removing it. (see the GDT stuff)
- */
- reserve_bootmem_generic(PAGE_SIZE, PAGE_SIZE);
-
/* Reserve SMP trampoline */
- reserve_bootmem_generic(SMP_TRAMPOLINE_BASE, PAGE_SIZE);
+ reserve_bootmem_generic(SMP_TRAMPOLINE_BASE, 2*PAGE_SIZE);
#endif

#ifdef CONFIG_ACPI_SLEEP
diff --git a/arch/x86_64/kernel/trampoline.S b/arch/x86_64/kernel/trampoline.S
index c79b99a..13eee63 100644
--- a/arch/x86_64/kernel/trampoline.S
+++ b/arch/x86_64/kernel/trampoline.S
@@ -3,6 +3,7 @@
* Trampoline.S Derived from Setup.S by Linus Torvalds
*
* 4 Jan 1997 Michael Chastain: changed to gnu as.
+ * 15 Sept 2005 Eric Biederman: 64bit PIC support
*
* Entry: CS:IP point to the start of our code, we are
* in real mode with no stack, but the rest of the
@@ -17,15 +18,20 @@
* and IP is zero. Thus, data addresses need to be absolute
* (no relocation) and are taken with regard to r_base.
*
+ * With the addition of trampoline_level4_pgt this code can
+ * now enter a 64bit kernel that lives at arbitrary 64bit
+ * physical addresses.
+ *
* If you work on this file, check the object module with objdump
* --full-contents --reloc to make sure there are no relocation
- * entries. For the GDT entry we do hand relocation in smpboot.c
- * because of 64bit linker limitations.
+ * entries.
*/

#include <linux/linkage.h>
-#include <asm/segment.h>
+#include <asm/pgtable.h>
#include <asm/page.h>
+#include <asm/msr.h>
+#include <asm/segment.h>

.data

@@ -33,15 +39,31 @@ #include <asm/page.h>

ENTRY(trampoline_data)
r_base = .
+ cli # We should be safe anyway
wbinvd
mov %cs, %ax # Code and data in the same place
mov %ax, %ds
+ mov %ax, %es
+ mov %ax, %ss

- cli # We should be safe anyway

movl $0xA5A5A5A5, trampoline_data - r_base
# write marker for master knows we're running

+ # Setup stack
+ movw $(trampoline_stack_end - r_base), %sp
+
+ call verify_cpu # Verify the cpu supports long mode
+
+ mov %cs, %ax
+ movzx %ax, %esi # Find the 32bit trampoline location
+ shll $4, %esi
+
+ # Fixup the vectors
+ addl %esi, startup_32_vector - r_base
+ addl %esi, startup_64_vector - r_base
+ addl %esi, tgdt + 2 - r_base # Fixup the gdt pointer
+
/*
* GDT tables in non default location kernel can be beyond 16MB and
* lgdt will not be able to load the address as in real mode default
@@ -49,23 +71,141 @@ r_base = .
* to 32 bit.
*/

- lidtl idt_48 - r_base # load idt with 0, 0
- lgdtl gdt_48 - r_base # load gdt with whatever is appropriate
+ lidtl tidt - r_base # load idt with 0, 0
+ lgdtl tgdt - r_base # load gdt with whatever is appropriate

xor %ax, %ax
inc %ax # protected mode (PE) bit
lmsw %ax # into protected mode
- # flaush prefetch and jump to startup_32 in arch/x86_64/kernel/head.S
- ljmpl $__KERNEL32_CS, $(startup_32-__START_KERNEL_map)
+
+ # flush prefetch and jump to startup_32
+ ljmpl *(startup_32_vector - r_base)
+
+ .code32
+ .balign 4
+startup_32:
+ movl $__KERNEL_DS, %eax # Initialize the %ds segment register
+ movl %eax, %ds
+
+ xorl %eax, %eax
+ btsl $5, %eax # Enable PAE mode
+ movl %eax, %cr4
+
+ # Setup trampoline 4 level pagetables
+ leal (trampoline_level4_pgt - r_base)(%esi), %eax
+ movl %eax, %cr3
+
+ movl $MSR_EFER, %ecx
+ movl $(1 << _EFER_LME), %eax # Enable Long Mode
+ xorl %edx, %edx
+ wrmsr
+
+ xorl %eax, %eax
+ btsl $31, %eax # Enable paging and in turn activate Long Mode
+ btsl $0, %eax # Enable protected mode
+ movl %eax, %cr0
+
+ /*
+ * At this point we're in long mode but in 32bit compatibility mode
+ * with EFER.LME = 1, CS.L = 0, CS.D = 1 (and in turn
+ * EFER.LMA = 1). Now we want to jump in 64bit mode, to do that we use
+ * the new gdt/idt that has __KERNEL_CS with CS.L = 1.
+ */
+ ljmp *(startup_64_vector - r_base)(%esi)
+
+ .code64
+ .balign 4
+startup_64:
+ # Now jump into the kernel using virtual addresses
+ movq $secondary_startup_64, %rax
+ jmp *%rax
+
+ .code16
+verify_cpu:
+ pushl $0 # Kill any dangerous flags
+ popfl
+
+ /* minimum CPUID flags for x86-64 */
+ /* see http://www.x86-64.org/lists/discuss/msg02971.html */
+#define REQUIRED_MASK1 ((1<<0)|(1<<3)|(1<<4)|(1<<5)|(1<<6)|(1<<8)|\
+ (1<<13)|(1<<15)|(1<<24)|(1<<25)|(1<<26))
+#define REQUIRED_MASK2 (1<<29)
+
+ pushfl # check for cpuid
+ popl %eax
+ movl %eax, %ebx
+ xorl $0x200000,%eax
+ pushl %eax
+ popfl
+ pushfl
+ popl %eax
+ pushl %ebx
+ popfl
+ cmpl %eax, %ebx
+ jz no_longmode
+
+ xorl %eax, %eax # See if cpuid 1 is implemented
+ cpuid
+ cmpl $0x1, %eax
+ jb no_longmode
+
+ movl $0x01, %eax # Does the cpu have what it takes?
+ cpuid
+ andl $REQUIRED_MASK1, %edx
+ xorl $REQUIRED_MASK1, %edx
+ jnz no_longmode
+
+ movl $0x80000000, %eax # See if extended cpuid is implemented
+ cpuid
+ cmpl $0x80000001, %eax
+ jb no_longmode
+
+ movl $0x80000001, %eax # Does the cpu have what it takes?
+ cpuid
+ andl $REQUIRED_MASK2, %edx
+ xorl $REQUIRED_MASK2, %edx
+ jnz no_longmode
+
+ ret # The cpu supports long mode
+
+no_longmode:
+ hlt
+ jmp no_longmode
+

# Careful these need to be in the same 64K segment as the above;
-idt_48:
+tidt:
.word 0 # idt limit = 0
.word 0, 0 # idt base = 0L

-gdt_48:
- .short GDT_ENTRIES*8 - 1 # gdt limit
- .long cpu_gdt_table-__START_KERNEL_map
+ # Duplicate the global descriptor table
+ # so the kernel can live anywhere
+ .balign 4
+tgdt:
+ .short tgdt_end - tgdt # gdt limit
+ .long tgdt - r_base
+ .short 0
+ .quad 0x00cf9b000000ffff # __KERNEL32_CS
+ .quad 0x00af9b000000ffff # __KERNEL_CS
+ .quad 0x00cf93000000ffff # __KERNEL_DS
+tgdt_end:
+
+ .balign 4
+startup_32_vector:
+ .long startup_32 - r_base
+ .word __KERNEL32_CS, 0
+
+ .balign 4
+startup_64_vector:
+ .long startup_64 - r_base
+ .word __KERNEL_CS, 0
+
+trampoline_stack:
+ .org 0x1000
+trampoline_stack_end:
+ENTRY(trampoline_level4_pgt)
+ .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
+ .fill 510,8,0
+ .quad level3_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE

-.globl trampoline_end
-trampoline_end:
+ENTRY(trampoline_end)
--
1.4.2.rc2.g5209e

2006-08-01 11:09:55

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 9/33] i386 boot: Add serial output support to the decompressor

This patch does two very simple things.
It adds a serial output capability to the decompressor.
It adds a command line parser for the early_printk
option so we know which output method to use for the decompressor.

This makes debugging the decompressor a little easier, and
keeps us from assuming we always have a vga console on all
hardware.

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/i386/boot/compressed/misc.c | 258 +++++++++++++++++++++++++++++++++++---
1 files changed, 241 insertions(+), 17 deletions(-)

diff --git a/arch/i386/boot/compressed/misc.c b/arch/i386/boot/compressed/misc.c
index 905c37e..fcaa9f0 100644
--- a/arch/i386/boot/compressed/misc.c
+++ b/arch/i386/boot/compressed/misc.c
@@ -9,11 +9,14 @@
* High loaded stuff by Hans Lermen & Werner Almesberger, Feb. 1996
*/

+#define __init
#include <linux/config.h>
#include <linux/linkage.h>
#include <linux/vmalloc.h>
+#include <linux/serial_reg.h>
#include <linux/screen_info.h>
#include <asm/io.h>
+#include <asm/setup.h>

/*
* gzip declarations
@@ -24,7 +27,9 @@ #define STATIC static

#undef memset
#undef memcpy
+#undef memcmp
#define memzero(s, n) memset ((s), 0, (n))
+char *strstr(const char *haystack, const char *needle);

typedef unsigned char uch;
typedef unsigned short ush;
@@ -78,12 +83,17 @@ static void gzip_release(void **);
* This is set up by the setup-routine at boot-time
*/
static unsigned char *real_mode; /* Pointer to real-mode data */
+static char saved_command_line[COMMAND_LINE_SIZE];

#define RM_EXT_MEM_K (*(unsigned short *)(real_mode + 0x2))
#ifndef STANDARD_MEMORY_BIOS_CALL
#define RM_ALT_MEM_K (*(unsigned long *)(real_mode + 0x1e0))
#endif
#define RM_SCREEN_INFO (*(struct screen_info *)(real_mode+0))
+#define RM_NEW_CL_POINTER ((char *)(unsigned long)(*(unsigned *)(real_mode+0x228)))
+#define RM_OLD_CL_MAGIC (*(unsigned short *)(real_mode + 0x20))
+#define RM_OLD_CL_OFFSET (*(unsigned short *)(real_mode + 0x22))
+#define OLD_CL_MAGIC 0xA33F

extern unsigned char input_data[];
extern int input_len;
@@ -97,8 +107,10 @@ static void free(void *where);

static void *memset(void *s, int c, unsigned n);
static void *memcpy(void *dest, const void *src, unsigned n);
+static int memcmp(const void *s1, const void *s2, unsigned n);

static void putstr(const char *);
+static unsigned simple_strtou(const char *cp,char **endp,unsigned base);

extern int end;
static long free_mem_ptr = (long)&end;
@@ -112,14 +124,25 @@ static unsigned int low_buffer_end, low_
static int high_loaded =0;
static uch *high_buffer_start /* = (uch *)(((ulg)&end) + HEAP_SIZE)*/;

-static char *vidmem = (char *)0xb8000;
+static char *vidmem;
static int vidport;
static int lines, cols;

#ifdef CONFIG_X86_NUMAQ
-static void * xquad_portio = NULL;
+static void * xquad_portio;
#endif

+/* The early serial console */
+
+#define DEFAULT_BAUD 9600
+#define DEFAULT_BASE 0x3f8 /* ttyS0 */
+static unsigned serial_base = DEFAULT_BASE;
+
+#define CONSOLE_NOOP 0
+#define CONSOLE_VID 1
+#define CONSOLE_SERIAL 2
+static int console = CONSOLE_NOOP;
+
#include "../../../../lib/inflate.c"

static void *malloc(int size)
@@ -154,7 +177,8 @@ static void gzip_release(void **ptr)
free_mem_ptr = (long) *ptr;
}

-static void scroll(void)
+/* The early video console */
+static void vid_scroll(void)
{
int i;

@@ -163,7 +187,7 @@ static void scroll(void)
vidmem[i] = ' ';
}

-static void putstr(const char *s)
+static void vid_putstr(const char *s)
{
int x,y,pos;
char c;
@@ -175,7 +199,7 @@ static void putstr(const char *s)
if ( c == '\n' ) {
x = 0;
if ( ++y >= lines ) {
- scroll();
+ vid_scroll();
y--;
}
} else {
@@ -183,7 +207,7 @@ static void putstr(const char *s)
if ( ++x >= cols ) {
x = 0;
if ( ++y >= lines ) {
- scroll();
+ vid_scroll();
y--;
}
}
@@ -200,6 +224,178 @@ static void putstr(const char *s)
outb_p(0xff & (pos >> 1), vidport+1);
}

+static void vid_console_init(void)
+{
+ if (RM_SCREEN_INFO.orig_video_mode == 7) {
+ vidmem = (char *) 0xb0000;
+ vidport = 0x3b4;
+ } else {
+ vidmem = (char *) 0xb8000;
+ vidport = 0x3d4;
+ }
+
+ lines = RM_SCREEN_INFO.orig_video_lines;
+ cols = RM_SCREEN_INFO.orig_video_cols;
+}
+
+/* The early serial console */
+static void serial_putc(int ch)
+{
+ if (ch == '\n') {
+ serial_putc('\r');
+ }
+ /* Wait until I can send a byte */
+ while ((inb(serial_base + UART_LSR) & UART_LSR_THRE) == 0)
+ ;
+
+ /* Send the byte */
+ outb(ch, serial_base + UART_TX);
+
+ /* Wait until the byte is transmitted */
+ while (!(inb(serial_base + UART_LSR) & UART_LSR_TEMT))
+ ;
+}
+
+static void serial_putstr(const char *str)
+{
+ int ch;
+ while((ch = *str++) != '\0') {
+ if (ch == '\n') {
+ serial_putc('\r');
+ }
+ serial_putc(ch);
+ }
+}
+
+static void serial_console_init(char *s)
+{
+ unsigned base = DEFAULT_BASE;
+ unsigned baud = DEFAULT_BAUD;
+ unsigned divisor;
+ char *e;
+
+ if (*s == ',')
+ ++s;
+ if (*s && (*s != ' ')) {
+ if (memcmp(s, "0x", 2) == 0) {
+ base = simple_strtou(s, &e, 16);
+ } else {
+ static const unsigned bases[] = { 0x3f8, 0x2f8 };
+ unsigned port;
+
+ if (memcmp(s, "ttyS", 4) == 0)
+ s += 4;
+ port = simple_strtou(s, &e, 10);
+ if ((port > 1) || (s == e))
+ port = 0;
+ base = bases[port];
+ }
+ s = e;
+ if (*s == ',')
+ ++s;
+ }
+ if (*s && (*s != ' ')) {
+ baud = simple_strtou(s, &e, 0);
+ if ((baud == 0) || (s == e))
+ baud = DEFAULT_BAUD;
+ }
+ divisor = 115200 / baud;
+ serial_base = base;
+
+ outb(0x00, serial_base + UART_IER); /* no interrupt */
+ outb(0x00, serial_base + UART_FCR); /* no fifo */
+ outb(0x03, serial_base + UART_MCR); /* DTR + RTS */
+
+ /* Set Baud Rate divisor */
+ outb(0x83, serial_base + UART_LCR);
+ outb(divisor & 0xff, serial_base + UART_DLL);
+ outb(divisor >> 8, serial_base + UART_DLM);
+ outb(0x03, serial_base + UART_LCR); /* 8n1 */
+
+}
+
+static void putstr(const char *str)
+{
+ if (console == CONSOLE_VID) {
+ vid_putstr(str);
+ } else if (console == CONSOLE_SERIAL) {
+ serial_putstr(str);
+ }
+}
+
+static void console_init(char *cmdline)
+{
+ cmdline = strstr(cmdline, "earlyprintk=");
+ if (!cmdline)
+ return;
+ cmdline += 12;
+ if (memcmp(cmdline, "vga", 3) == 0) {
+ vid_console_init();
+ console = CONSOLE_VID;
+ } else if (memcmp(cmdline, "serial", 6) == 0) {
+ serial_console_init(cmdline + 6);
+ console = CONSOLE_SERIAL;
+ } else if (memcmp(cmdline, "ttyS", 4) == 0) {
+ serial_console_init(cmdline);
+ console = CONSOLE_SERIAL;
+ }
+}
+
+static inline int tolower(int ch)
+{
+ return ch | 0x20;
+}
+
+static inline int isdigit(int ch)
+{
+ return (ch >= '0') && (ch <= '9');
+}
+
+static inline int isxdigit(int ch)
+{
+ ch = tolower(ch);
+ return isdigit(ch) || ((ch >= 'a') && (ch <= 'f'));
+}
+
+
+static inline int digval(int ch)
+{
+ return isdigit(ch)? (ch - '0') : tolower(ch) - 'a' + 10;
+}
+
+/**
+ * simple_strtou - convert a string to an unsigned
+ * @cp: The start of the string
+ * @endp: A pointer to the end of the parsed string will be placed here
+ * @base: The number base to use
+ */
+static unsigned simple_strtou(const char *cp, char **endp, unsigned base)
+{
+ unsigned result = 0,value;
+
+ if (!base) {
+ base = 10;
+ if (*cp == '0') {
+ base = 8;
+ cp++;
+ if ((tolower(*cp) == 'x') && isxdigit(cp[1])) {
+ cp++;
+ base = 16;
+ }
+ }
+ } else if (base == 16) {
+ if (cp[0] == '0' && tolower(cp[1]) == 'x')
+ cp += 2;
+ }
+ while (isxdigit(*cp) && ((value = digval(*cp)) < base)) {
+ result = result*base + value;
+ cp++;
+ }
+ if (endp)
+ *endp = (char *)cp;
+ return result;
+}
+
static void* memset(void* s, int c, unsigned n)
{
int i;
@@ -218,6 +414,29 @@ static void* memcpy(void* dest, const vo
return dest;
}

+static int memcmp(const void *s1, const void *s2, unsigned n)
+{
+ const unsigned char *str1 = s1, *str2 = s2;
+ size_t i;
+ int result = 0;
+ for(i = 0; (result == 0) && (i < n); i++) {
+ result = *str1++ - *str2++;
+ }
+ return result;
+}
+
+char *strstr(const char *haystack, const char *needle)
+{
+ size_t len;
+ len = strlen(needle);
+ while(*haystack) {
+ if (memcmp(haystack, needle, len) == 0)
+ return (char *)haystack;
+ haystack++;
+ }
+ return NULL;
+}
+
/* ===========================================================================
* Fill the input buffer. This is called only when the buffer is empty
* and at least one byte is really needed.
@@ -346,20 +565,25 @@ static void close_output_buffer_if_we_ru
}
}

-asmlinkage int decompress_kernel(struct moveparams *mv, void *rmode)
+static void save_command_line(void)
{
- real_mode = rmode;
-
- if (RM_SCREEN_INFO.orig_video_mode == 7) {
- vidmem = (char *) 0xb0000;
- vidport = 0x3b4;
- } else {
- vidmem = (char *) 0xb8000;
- vidport = 0x3d4;
+ /* Find the command line */
+ char *cmdline;
+ cmdline = saved_command_line;
+ if (RM_NEW_CL_POINTER) {
+ cmdline = RM_NEW_CL_POINTER;
+ } else if (OLD_CL_MAGIC == RM_OLD_CL_MAGIC) {
+ cmdline = real_mode + RM_OLD_CL_OFFSET;
}
+ memcpy(saved_command_line, cmdline, COMMAND_LINE_SIZE);
+ saved_command_line[COMMAND_LINE_SIZE - 1] = '\0';
+}

- lines = RM_SCREEN_INFO.orig_video_lines;
- cols = RM_SCREEN_INFO.orig_video_cols;
+asmlinkage int decompress_kernel(struct moveparams *mv, void *rmode)
+{
+ real_mode = rmode;
+ save_command_line();
+ console_init(saved_command_line);

if (free_mem_ptr < 0x100000) setup_normal_output_buffer();
else setup_output_buffer_if_we_run_high(mv);
--
1.4.2.rc2.g5209e

2006-08-01 11:06:33

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 29/33] x86_64: __pa and __pa_symbol address space separation

Currently __pa_symbol is for use with symbols in the kernel address
map and __pa is for use with pointers into the physical memory map.
But the code is implemented so you can usually interchange the two.

__pa which is much more common can be implemented much more cheaply
if it is it doesn't have to worry about any other kernel address
spaces. This is especially true with a relocatable kernel as
__pa_symbol needs to peform an extra variable read to resolve
the address.

There is a third macro that is added for the vsyscall data
__pa_vsymbol for finding the physical addesses of vsyscall pages.

Most of this patch is simply sorting through the references to
__pa or __pa_symbol and using the proper one. A little of
it is continuing to use a physical address when we have it
instead of recalculating it several times.

swapper_pgd is now NULL. leave_mm now uses init_mm.pgd
and init_mm.pgd is initialized at boot (instead of compile time)
to the physmem virtual mapping of init_level4_pgd. The
physical address changed.

Except for the for EMPTY_ZERO page all of the remaining references
to __pa_symbol appear to be during kernel initialization. So this
should reduce the cost of __pa in the common case, even on a relocated
kernel.

As this is technically a semantic change we need to be on the lookout
for anything I missed. But it works for me (tm).

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/i386/kernel/alternative.c | 8 ++++----
arch/i386/mm/init.c | 15 ++++++++-------
arch/x86_64/kernel/setup.c | 9 +++++----
arch/x86_64/kernel/smp.c | 2 +-
arch/x86_64/kernel/vsyscall.c | 10 ++++++++--
arch/x86_64/mm/init.c | 21 +++++++++++----------
arch/x86_64/mm/pageattr.c | 20 +++++++++++---------
include/asm-x86_64/page.h | 6 ++----
include/asm-x86_64/pgtable.h | 4 ++--
9 files changed, 52 insertions(+), 43 deletions(-)

diff --git a/arch/i386/kernel/alternative.c b/arch/i386/kernel/alternative.c
index 28ab806..e573263 100644
--- a/arch/i386/kernel/alternative.c
+++ b/arch/i386/kernel/alternative.c
@@ -347,8 +347,8 @@ void __init alternative_instructions(voi
if (no_replacement) {
printk(KERN_INFO "(SMP-)alternatives turned off\n");
free_init_pages("SMP alternatives",
- (unsigned long)__smp_alt_begin,
- (unsigned long)__smp_alt_end);
+ __pa_symbol(&__smp_alt_begin),
+ __pa_symbol(&__smp_alt_end));
return;
}
apply_alternatives(__alt_instructions, __alt_instructions_end);
@@ -375,8 +375,8 @@ #ifdef CONFIG_SMP
_text, _etext);
}
free_init_pages("SMP alternatives",
- (unsigned long)__smp_alt_begin,
- (unsigned long)__smp_alt_end);
+ __pa_symbol(&__smp_alt_begin),
+ __pa_symbol(&__smp_alt_end));
} else {
alternatives_smp_save(__smp_alt_instructions,
__smp_alt_instructions_end);
diff --git a/arch/i386/mm/init.c b/arch/i386/mm/init.c
index 89e8486..8dbbb09 100644
--- a/arch/i386/mm/init.c
+++ b/arch/i386/mm/init.c
@@ -750,10 +750,11 @@ void free_init_pages(char *what, unsigne
unsigned long addr;

for (addr = begin; addr < end; addr += PAGE_SIZE) {
- ClearPageReserved(virt_to_page(addr));
- init_page_count(virt_to_page(addr));
- memset((void *)addr, POISON_FREE_INITMEM, PAGE_SIZE);
- free_page(addr);
+ struct page *page = pfn_to_page(addr >> PAGE_SHIFT);
+ ClearPageReserved(page);
+ init_page_count(page);
+ memset(page_address(page), POISON_FREE_INITMEM, PAGE_SIZE);
+ __free_page(page);
totalram_pages++;
}
printk(KERN_INFO "Freeing %s: %ldk freed\n", what, (end - begin) >> 10);
@@ -762,14 +763,14 @@ void free_init_pages(char *what, unsigne
void free_initmem(void)
{
free_init_pages("unused kernel memory",
- (unsigned long)(&__init_begin),
- (unsigned long)(&__init_end));
+ __pa_symbol(&__init_begin),
+ __pa_symbol(&__init_end));
}

#ifdef CONFIG_BLK_DEV_INITRD
void free_initrd_mem(unsigned long start, unsigned long end)
{
- free_init_pages("initrd memory", start, end);
+ free_init_pages("initrd memory", __pa(start), __pa(end));
}
#endif

diff --git a/arch/x86_64/kernel/setup.c b/arch/x86_64/kernel/setup.c
index 7225f61..44a40e6 100644
--- a/arch/x86_64/kernel/setup.c
+++ b/arch/x86_64/kernel/setup.c
@@ -543,11 +543,12 @@ #endif
init_mm.end_code = (unsigned long) &_etext;
init_mm.end_data = (unsigned long) &_edata;
init_mm.brk = (unsigned long) &_end;
+ init_mm.pgd = __va(__pa_symbol(&init_level4_pgt));

- code_resource.start = virt_to_phys(&_text);
- code_resource.end = virt_to_phys(&_etext)-1;
- data_resource.start = virt_to_phys(&_etext);
- data_resource.end = virt_to_phys(&_edata)-1;
+ code_resource.start = __pa_symbol(&_text);
+ code_resource.end = __pa_symbol(&_etext)-1;
+ data_resource.start = __pa_symbol(&_etext);
+ data_resource.end = __pa_symbol(&_edata)-1;

parse_cmdline_early(cmdline_p);

diff --git a/arch/x86_64/kernel/smp.c b/arch/x86_64/kernel/smp.c
index 5a1c0a3..5a54066 100644
--- a/arch/x86_64/kernel/smp.c
+++ b/arch/x86_64/kernel/smp.c
@@ -76,7 +76,7 @@ static inline void leave_mm(int cpu)
if (read_pda(mmu_state) == TLBSTATE_OK)
BUG();
cpu_clear(cpu, read_pda(active_mm)->cpu_vm_mask);
- load_cr3(swapper_pg_dir);
+ load_cr3(init_mm.pgd);
}

/*
diff --git a/arch/x86_64/kernel/vsyscall.c b/arch/x86_64/kernel/vsyscall.c
index f603037..2e48407 100644
--- a/arch/x86_64/kernel/vsyscall.c
+++ b/arch/x86_64/kernel/vsyscall.c
@@ -41,6 +41,12 @@ seqlock_t __xtime_lock __section_xtime_l

#include <asm/unistd.h>

+#define __pa_vsymbol(x) \
+ ({unsigned long v; \
+ extern char __vsyscall_0; \
+ asm("" : "=r" (v) : "0" (x)); \
+ ((v - VSYSCALL_FIRST_PAGE) + __pa_symbol(&__vsyscall_0)); })
+
static __always_inline void timeval_normalize(struct timeval * tv)
{
time_t __sec;
@@ -155,10 +161,10 @@ static int vsyscall_sysctl_change(ctl_ta
return ret;
/* gcc has some trouble with __va(__pa()), so just do it this
way. */
- map1 = ioremap(__pa_symbol(&vsysc1), 2);
+ map1 = ioremap(__pa_vsymbol(&vsysc1), 2);
if (!map1)
return -ENOMEM;
- map2 = ioremap(__pa_symbol(&vsysc2), 2);
+ map2 = ioremap(__pa_vsymbol(&vsysc2), 2);
if (!map2) {
ret = -ENOMEM;
goto out;
diff --git a/arch/x86_64/mm/init.c b/arch/x86_64/mm/init.c
index 149f363..5db93b9 100644
--- a/arch/x86_64/mm/init.c
+++ b/arch/x86_64/mm/init.c
@@ -663,11 +663,11 @@ void free_init_pages(char *what, unsigne

printk(KERN_INFO "Freeing %s: %ldk freed\n", what, (end - begin) >> 10);
for (addr = begin; addr < end; addr += PAGE_SIZE) {
- ClearPageReserved(virt_to_page(addr));
- init_page_count(virt_to_page(addr));
- memset((void *)(addr & ~(PAGE_SIZE-1)),
- POISON_FREE_INITMEM, PAGE_SIZE);
- free_page(addr);
+ struct page *page = pfn_to_page(addr >> PAGE_SHIFT);
+ ClearPageReserved(page);
+ init_page_count(page);
+ memset(page_address(page), POISON_FREE_INITMEM, PAGE_SIZE);
+ __free_page(page);
totalram_pages++;
}
}
@@ -677,17 +677,18 @@ void free_initmem(void)
memset(__initdata_begin, POISON_FREE_INITDATA,
__initdata_end - __initdata_begin);
free_init_pages("unused kernel memory",
- (unsigned long)(&__init_begin),
- (unsigned long)(&__init_end));
+ __pa_symbol(&__init_begin),
+ __pa_symbol(&__init_end));
}

#ifdef CONFIG_DEBUG_RODATA

void mark_rodata_ro(void)
{
- unsigned long addr = (unsigned long)__start_rodata;
+ unsigned long addr = (unsigned long)__va(__pa_symbol(&__start_rodata));
+ unsigned long end = (unsigned long)__va(__pa_symbol(&__end_rodata));

- for (; addr < (unsigned long)__end_rodata; addr += PAGE_SIZE)
+ for (; addr < end; addr += PAGE_SIZE)
change_page_attr_addr(addr, 1, PAGE_KERNEL_RO);

printk ("Write protecting the kernel read-only data: %luk\n",
@@ -706,7 +707,7 @@ #endif
#ifdef CONFIG_BLK_DEV_INITRD
void free_initrd_mem(unsigned long start, unsigned long end)
{
- free_init_pages("initrd memory", start, end);
+ free_init_pages("initrd memory", __pa(start), __pa(end));
}
#endif

diff --git a/arch/x86_64/mm/pageattr.c b/arch/x86_64/mm/pageattr.c
index 2685b1f..9d6196d 100644
--- a/arch/x86_64/mm/pageattr.c
+++ b/arch/x86_64/mm/pageattr.c
@@ -51,7 +51,6 @@ static struct page *split_large_page(uns
SetPagePrivate(base);
page_private(base) = 0;

- address = __pa(address);
addr = address & LARGE_PAGE_MASK;
pbase = (pte_t *)page_address(base);
for (i = 0; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE) {
@@ -95,7 +94,7 @@ static inline void save_page(struct page
* No more special protections in this 2/4MB area - revert to a
* large page again.
*/
-static void revert_page(unsigned long address, pgprot_t ref_prot)
+static void revert_page(unsigned long address, unsigned long pfn, pgprot_t ref_prot)
{
pgd_t *pgd;
pud_t *pud;
@@ -109,7 +108,7 @@ static void revert_page(unsigned long ad
pmd = pmd_offset(pud, address);
BUG_ON(pmd_val(*pmd) & _PAGE_PSE);
pgprot_val(ref_prot) |= _PAGE_PSE;
- large_pte = mk_pte_phys(__pa(address) & LARGE_PAGE_MASK, ref_prot);
+ large_pte = mk_pte_phys((pfn << PAGE_SHIFT) & LARGE_PAGE_MASK, ref_prot);
set_pte((pte_t *)pmd, large_pte);
}

@@ -137,7 +136,7 @@ __change_page_attr(unsigned long address
struct page *split;
ref_prot2 = __pgprot(pgprot_val(pte_pgprot(*lookup_address(address))) & ~(1<<_PAGE_BIT_PSE));

- split = split_large_page(address, prot, ref_prot2);
+ split = split_large_page(pfn << PAGE_SHIFT, prot, ref_prot2);
if (!split)
return -ENOMEM;
set_pte(kpte,mk_pte(split, ref_prot2));
@@ -156,7 +155,7 @@ __change_page_attr(unsigned long address

if (page_private(kpte_page) == 0) {
save_page(kpte_page);
- revert_page(address, ref_prot);
+ revert_page(address, pfn, ref_prot);
}
return 0;
}
@@ -176,6 +175,7 @@ __change_page_attr(unsigned long address
*/
int change_page_attr_addr(unsigned long address, int numpages, pgprot_t prot)
{
+ unsigned long phys_base_pfn = __pa_symbol(__START_KERNEL_map) >> PAGE_SHIFT;
int err = 0;
int i;

@@ -188,14 +188,16 @@ int change_page_attr_addr(unsigned long
break;
/* Handle kernel mapping too which aliases part of the
* lowmem */
- if (__pa(address) < KERNEL_TEXT_SIZE) {
+ if ((pfn >= phys_base_pfn) &&
+ ((pfn - phys_base_pfn) < (KERNEL_TEXT_SIZE >> PAGE_SHIFT)))
+ {
unsigned long addr2;
pgprot_t prot2 = prot;
- addr2 = __START_KERNEL_map + __pa(address);
+ addr2 = __START_KERNEL_map + ((pfn - phys_base_pfn) << PAGE_SHIFT);
pgprot_val(prot2) &= ~_PAGE_NX;
err = __change_page_attr(addr2, pfn, prot2, PAGE_KERNEL_EXEC);
- }
- }
+ }
+ }
up_write(&init_mm.mmap_sem);
return err;
}
diff --git a/include/asm-x86_64/page.h b/include/asm-x86_64/page.h
index f030260..37f95ca 100644
--- a/include/asm-x86_64/page.h
+++ b/include/asm-x86_64/page.h
@@ -102,17 +102,15 @@ #define PAGE_OFFSET __PAGE_OFFSET

/* Note: __pa(&symbol_visible_to_c) should be always replaced with __pa_symbol.
Otherwise you risk miscompilation. */
-#define __pa(x) (((unsigned long)(x)>=__START_KERNEL_map)?(unsigned long)(x) - (unsigned long)__START_KERNEL_map:(unsigned long)(x) - PAGE_OFFSET)
+#define __pa(x) ((unsigned long)(x) - PAGE_OFFSET)
/* __pa_symbol should be used for C visible symbols.
This seems to be the official gcc blessed way to do such arithmetic. */
#define __pa_symbol(x) \
({unsigned long v; \
asm("" : "=r" (v) : "0" (x)); \
- __pa(v); })
+ (v - __START_KERNEL_map); })

#define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET))
-#define __boot_va(x) __va(x)
-#define __boot_pa(x) __pa(x)
#ifdef CONFIG_FLATMEM
#define pfn_valid(pfn) ((pfn) < end_pfn)
#endif
diff --git a/include/asm-x86_64/pgtable.h b/include/asm-x86_64/pgtable.h
index d0816fb..fa43712 100644
--- a/include/asm-x86_64/pgtable.h
+++ b/include/asm-x86_64/pgtable.h
@@ -20,7 +20,7 @@ extern pmd_t level2_kernel_pgt[512];
extern pgd_t init_level4_pgt[];
extern unsigned long __supported_pte_mask;

-#define swapper_pg_dir init_level4_pgt
+#define swapper_pg_dir ((pgd_t *)NULL)

extern int nonx_setup(char *str);
extern void paging_init(void);
@@ -33,7 +33,7 @@ extern unsigned long pgkern_mask;
* for zero-mapped memory areas etc..
*/
extern unsigned long empty_zero_page[PAGE_SIZE/sizeof(unsigned long)];
-#define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page))
+#define ZERO_PAGE(vaddr) (pfn_to_page(__pa_symbol(&empty_zero_page) >> PAGE_SHIFT))

#endif /* !__ASSEMBLY__ */

--
1.4.2.rc2.g5209e

2006-08-01 11:12:00

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 8/33] kallsyms.c: Generate relocatable symbols.

Print the addresses of non-absolute symbols relative to _text
so that ld will generate relocations. Allowing a relocatable
kernel to relocate them. We can't actually use the symbol names
because kallsyms includes static symbols that are not exported
from their object files.

Signed-off-by: Eric W. Biederman <[email protected]>
---
scripts/kallsyms.c | 20 +++++++++++++++++---
1 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/scripts/kallsyms.c b/scripts/kallsyms.c
index 22d281c..4c1ad0a 100644
--- a/scripts/kallsyms.c
+++ b/scripts/kallsyms.c
@@ -43,7 +43,7 @@ struct sym_entry {

static struct sym_entry *table;
static unsigned int table_size, table_cnt;
-static unsigned long long _stext, _etext, _sinittext, _einittext, _sextratext, _eextratext;
+static unsigned long long _text, _stext, _etext, _sinittext, _einittext, _sextratext, _eextratext;
static int all_symbols = 0;
static char symbol_prefix_char = '\0';

@@ -91,7 +91,9 @@ static int read_symbol(FILE *in, struct
sym++;

/* Ignore most absolute/undefined (?) symbols. */
- if (strcmp(sym, "_stext") == 0)
+ if (strcmp(sym, "_text") == 0)
+ _text = s->addr;
+ else if (strcmp(sym, "_stext") == 0)
_stext = s->addr;
else if (strcmp(sym, "_etext") == 0)
_etext = s->addr;
@@ -265,9 +267,21 @@ static void write_src(void)

printf(".data\n");

+ /* Provide proper symbols relocatability by their '_text'
+ * relativeness. The symbol names cannot be used to construct
+ * normal symbol references as the list of symbols contains
+ * symbols that are declared static and are private to their
+ * .o files. This prevents .tmp_kallsyms.o or any other
+ * object from referencing them.
+ */
output_label("kallsyms_addresses");
for (i = 0; i < table_cnt; i++) {
- printf("\tPTR\t%#llx\n", table[i].addr);
+ if (toupper(table[i].sym[0]) != 'A') {
+ printf("\tPTR\t_text + %#llx\n",
+ table[i].addr - _text);
+ } else {
+ printf("\tPTR\t%#llx\n", table[i].addr);
+ }
}
printf("\n");

--
1.4.2.rc2.g5209e

2006-08-01 11:06:34

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 11/33] i386 boot: Add an ELF header to bzImage

Increasingly the cobbled together boot protocol that
is bzImage does not have the flexibility to deal
with booting in new situations.

Now that we no longer support the bootsector loader
we have 512 bytes at the very start of a bzImage that
we can use for other things.

Placing an ELF header there allows us to retain
a single binary for all of x86 while at the same
time describing things that bzImage does not allow
us to describe.

The existing bugger off code for warning if we attempt to
boot from the bootsector is kept but the error message is
made more terse so we have a little more room to play with.

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/i386/boot/Makefile | 2
arch/i386/boot/bootsect.S | 97 ++++++++++++++++++-
arch/i386/boot/tools/build.c | 214 +++++++++++++++++++++++++++++++++++++-----
include/linux/elf_boot.h | 19 ++++
4 files changed, 303 insertions(+), 29 deletions(-)

diff --git a/arch/i386/boot/Makefile b/arch/i386/boot/Makefile
index e979466..44ef35c 100644
--- a/arch/i386/boot/Makefile
+++ b/arch/i386/boot/Makefile
@@ -43,7 +43,7 @@ # --------------------------------------

quiet_cmd_image = BUILD $@
cmd_image = $(obj)/tools/build $(BUILDFLAGS) $(obj)/bootsect $(obj)/setup \
- $(obj)/vmlinux.bin $(ROOT_DEV) > $@
+ $(obj)/vmlinux.bin $(ROOT_DEV) vmlinux > $@

$(obj)/zImage $(obj)/bzImage: $(obj)/bootsect $(obj)/setup \
$(obj)/vmlinux.bin $(obj)/tools/build FORCE
diff --git a/arch/i386/boot/bootsect.S b/arch/i386/boot/bootsect.S
index 011b7a4..847dc8f 100644
--- a/arch/i386/boot/bootsect.S
+++ b/arch/i386/boot/bootsect.S
@@ -13,6 +13,13 @@
*
*/

+#include <linux/config.h>
+#include <linux/version.h>
+#include <linux/utsrelease.h>
+#include <linux/compile.h>
+#include <linux/elf.h>
+#include <linux/elf_boot.h>
+#include <asm/page.h>
#include <asm/boot.h>

SETUPSECTS = 4 /* default nr of setup-sectors */
@@ -42,10 +49,92 @@ #endif

.global _start
_start:
-
+ehdr:
+ # e_ident is carefully crafted so if this is treated
+ # as an x86 bootsector you will execute through
+ # e_ident and then print the bugger off message.
+ # The 1 stores to bx+di is unfortunate it is
+ # unlikely to affect the ability to print
+ # a message and you aren't supposed to be booting a
+ # bzImage directly from a floppy anyway.
+
+ # e_ident
+ .byte ELFMAG0, ELFMAG1, ELFMAG2, ELFMAG3
+ .byte ELFCLASS32, ELFDATA2LSB, EV_CURRENT, ELFOSABI_STANDALONE
+ .byte 0xeb, 0x3d, 0, 0, 0, 0, 0, 0
+#ifndef CONFIG_RELOCATABLE
+ .word ET_EXEC # e_type
+#else
+ .word ET_DYN # e_type
+#endif
+ .word EM_386 # e_machine
+ .int 1 # e_version
+ .int CONFIG_PHYSICAL_START # e_entry
+ .int phdr - _start # e_phoff
+ .int 0 # e_shoff
+ .int 0 # e_flags
+ .word e_ehdr - ehdr # e_ehsize
+ .word e_phdr1 - phdr # e_phentsize
+ .word (e_phdr - phdr)/(e_phdr1 - phdr) # e_phnum
+ .word 40 # e_shentsize
+ .word 0 # e_shnum
+ .word 0 # e_shstrndx
+e_ehdr:
+
+.org 71
+normalize:
# Normalize the start address
jmpl $BOOTSEG, $start2

+.org 80
+phdr:
+ .int PT_LOAD # p_type
+ .int (SETUPSECTS+1)*512 # p_offset
+ .int __PAGE_OFFSET + CONFIG_PHYSICAL_START # p_vaddr
+ .int CONFIG_PHYSICAL_START # p_paddr
+ .int SYSSIZE*16 # p_filesz
+ .int 0 # p_memsz
+ .int PF_R | PF_W | PF_X # p_flags
+ .int 4*1024*1024 # p_align
+e_phdr1:
+
+ .int PT_NOTE # p_type
+ .int b_note - _start # p_offset
+ .int 0 # p_vaddr
+ .int 0 # p_paddr
+ .int e_note - b_note # p_filesz
+ .int 0 # p_memsz
+ .int 0 # p_flags
+ .int 0 # p_align
+e_phdr:
+
+.macro note name, type
+ .balign 4
+ .int 2f - 1f # n_namesz
+ .int 4f - 3f # n_descsz
+ .int \type # n_type
+ .balign 4
+1: .asciz "\name"
+2: .balign 4
+3:
+.endm
+.macro enote
+4: .balign 4
+.endm
+
+ .balign 4
+b_note:
+ note ELF_NOTE_BOOT, EIN_PROGRAM_NAME
+ .asciz "Linux"
+ enote
+ note ELF_NOTE_BOOT, EIN_PROGRAM_VERSION
+ .asciz UTS_RELEASE
+ enote
+ note ELF_NOTE_BOOT, EIN_ARGUMENT_STYLE
+ .asciz "Linux"
+ enote
+e_note:
+
start2:
movw %cs, %ax
movw %ax, %ds
@@ -78,11 +167,11 @@ die:


bugger_off_msg:
- .ascii "Direct booting from floppy is no longer supported.\r\n"
- .ascii "Please use a boot loader program instead.\r\n"
+ .ascii "Booting linux without a boot loader is no longer supported.\r\n"
.ascii "\n"
- .ascii "Remove disk and press any key to reboot . . .\r\n"
+ .ascii "Press any key to reboot . . .\r\n"
.byte 0
+ebugger_off_msg:


# Kernel attributes; used by setup
diff --git a/arch/i386/boot/tools/build.c b/arch/i386/boot/tools/build.c
index 0579841..2daca93 100644
--- a/arch/i386/boot/tools/build.c
+++ b/arch/i386/boot/tools/build.c
@@ -27,6 +27,11 @@ #include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <stdarg.h>
+#include <elf.h>
+#include <byteswap.h>
+#define USE_BSD
+#include <endian.h>
+#include <errno.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/sysmacros.h>
@@ -48,6 +53,10 @@ byte buf[1024];
int fd;
int is_big_kernel;

+#define MAX_PHDRS 100
+static Elf32_Ehdr ehdr;
+static Elf32_Phdr phdr[MAX_PHDRS];
+
void die(const char * str, ...)
{
va_list args;
@@ -57,20 +66,151 @@ void die(const char * str, ...)
exit(1);
}

+#if BYTE_ORDER == LITTLE_ENDIAN
+#define le16_to_cpu(val) (val)
+#define le32_to_cpu(val) (val)
+#endif
+#if BYTE_ORDER == BIG_ENDIAN
+#define le16_to_cpu(val) bswap_16(val)
+#define le32_to_cpu(val) bswap_32(val)
+#endif
+
+static uint16_t elf16_to_cpu(uint16_t val)
+{
+ return le16_to_cpu(val);
+}
+
+static uint32_t elf32_to_cpu(uint32_t val)
+{
+ return le32_to_cpu(val);
+}
+
void file_open(const char *name)
{
if ((fd = open(name, O_RDONLY, 0)) < 0)
die("Unable to open `%s': %m", name);
}

+static void read_ehdr(void)
+{
+ if (read(fd, &ehdr, sizeof(ehdr)) != sizeof(ehdr)) {
+ die("Cannot read ELF header: %s\n",
+ strerror(errno));
+ }
+ if (memcmp(ehdr.e_ident, ELFMAG, 4) != 0) {
+ die("No ELF magic\n");
+ }
+ if (ehdr.e_ident[EI_CLASS] != ELFCLASS32) {
+ die("Not a 32 bit executable\n");
+ }
+ if (ehdr.e_ident[EI_DATA] != ELFDATA2LSB) {
+ die("Not a LSB ELF executable\n");
+ }
+ if (ehdr.e_ident[EI_VERSION] != EV_CURRENT) {
+ die("Unknown ELF version\n");
+ }
+ /* Convert the fields to native endian */
+ ehdr.e_type = elf16_to_cpu(ehdr.e_type);
+ ehdr.e_machine = elf16_to_cpu(ehdr.e_machine);
+ ehdr.e_version = elf32_to_cpu(ehdr.e_version);
+ ehdr.e_entry = elf32_to_cpu(ehdr.e_entry);
+ ehdr.e_phoff = elf32_to_cpu(ehdr.e_phoff);
+ ehdr.e_shoff = elf32_to_cpu(ehdr.e_shoff);
+ ehdr.e_flags = elf32_to_cpu(ehdr.e_flags);
+ ehdr.e_ehsize = elf16_to_cpu(ehdr.e_ehsize);
+ ehdr.e_phentsize = elf16_to_cpu(ehdr.e_phentsize);
+ ehdr.e_phnum = elf16_to_cpu(ehdr.e_phnum);
+ ehdr.e_shentsize = elf16_to_cpu(ehdr.e_shentsize);
+ ehdr.e_shnum = elf16_to_cpu(ehdr.e_shnum);
+ ehdr.e_shstrndx = elf16_to_cpu(ehdr.e_shstrndx);
+
+ if ((ehdr.e_type != ET_EXEC) && (ehdr.e_type != ET_DYN)) {
+ die("Unsupported ELF header type\n");
+ }
+ if (ehdr.e_machine != EM_386) {
+ die("Not for x86\n");
+ }
+ if (ehdr.e_version != EV_CURRENT) {
+ die("Unknown ELF version\n");
+ }
+ if (ehdr.e_ehsize != sizeof(Elf32_Ehdr)) {
+ die("Bad Elf header size\n");
+ }
+ if (ehdr.e_phentsize != sizeof(Elf32_Phdr)) {
+ die("Bad program header entry\n");
+ }
+ if (ehdr.e_shentsize != sizeof(Elf32_Shdr)) {
+ die("Bad section header entry\n");
+ }
+ if (ehdr.e_shstrndx >= ehdr.e_shnum) {
+ die("String table index out of bounds\n");
+ }
+}
+
+static void read_phds(void)
+{
+ int i;
+ size_t size;
+ if (ehdr.e_phnum > MAX_PHDRS) {
+ die("%d program headers supported: %d\n",
+ ehdr.e_phnum, MAX_PHDRS);
+ }
+ if (lseek(fd, ehdr.e_phoff, SEEK_SET) < 0) {
+ die("Seek to %d failed: %s\n",
+ ehdr.e_phoff, strerror(errno));
+ }
+ size = sizeof(phdr[0])*ehdr.e_phnum;
+ if (read(fd, &phdr, size) != size) {
+ die("Cannot read ELF section headers: %s\n",
+ strerror(errno));
+ }
+ for(i = 0; i < ehdr.e_phnum; i++) {
+ phdr[i].p_type = elf32_to_cpu(phdr[i].p_type);
+ phdr[i].p_offset = elf32_to_cpu(phdr[i].p_offset);
+ phdr[i].p_vaddr = elf32_to_cpu(phdr[i].p_vaddr);
+ phdr[i].p_paddr = elf32_to_cpu(phdr[i].p_paddr);
+ phdr[i].p_filesz = elf32_to_cpu(phdr[i].p_filesz);
+ phdr[i].p_memsz = elf32_to_cpu(phdr[i].p_memsz);
+ phdr[i].p_flags = elf32_to_cpu(phdr[i].p_flags);
+ phdr[i].p_align = elf32_to_cpu(phdr[i].p_align);
+ }
+}
+
+unsigned long vmlinux_memsz(void)
+{
+ unsigned long min, max, size;
+ int i;
+ min = 0xffffffff;
+ max = 0;
+ for(i = 0; i < ehdr.e_phnum; i++) {
+ unsigned long start, end;
+ if (phdr[i].p_type != PT_LOAD)
+ continue;
+ start = phdr[i].p_paddr;
+ end = phdr[i].p_paddr + phdr[i].p_memsz;
+ if (start < min)
+ min = start;
+ if (end > max)
+ max = end;
+ }
+ /* Get the reported size by vmlinux */
+ size = max - min;
+ /* Add 128K for the bootmem bitmap */
+ size += 128*1024;
+ /* Add in space for the initial page tables */
+ size = ((size + (((size + 4095) >> 12)*4)) + 4095) & ~4095;
+ return size;
+}
+
void usage(void)
{
- die("Usage: build [-b] bootsect setup system [rootdev] [> image]");
+ die("Usage: build [-b] bootsect setup system rootdev vmlinux [> image]");
}

int main(int argc, char ** argv)
{
unsigned int i, sz, setup_sectors;
+ unsigned kernel_offset, kernel_filesz, kernel_memsz;
int c;
u32 sys_size;
byte major_root, minor_root;
@@ -81,30 +221,25 @@ int main(int argc, char ** argv)
is_big_kernel = 1;
argc--, argv++;
}
- if ((argc < 4) || (argc > 5))
+ if (argc != 6)
usage();
- if (argc > 4) {
- if (!strcmp(argv[4], "CURRENT")) {
- if (stat("/", &sb)) {
- perror("/");
- die("Couldn't stat /");
- }
- major_root = major(sb.st_dev);
- minor_root = minor(sb.st_dev);
- } else if (strcmp(argv[4], "FLOPPY")) {
- if (stat(argv[4], &sb)) {
- perror(argv[4]);
- die("Couldn't stat root device.");
- }
- major_root = major(sb.st_rdev);
- minor_root = minor(sb.st_rdev);
- } else {
- major_root = 0;
- minor_root = 0;
+ if (!strcmp(argv[4], "CURRENT")) {
+ if (stat("/", &sb)) {
+ perror("/");
+ die("Couldn't stat /");
+ }
+ major_root = major(sb.st_dev);
+ minor_root = minor(sb.st_dev);
+ } else if (strcmp(argv[4], "FLOPPY")) {
+ if (stat(argv[4], &sb)) {
+ perror(argv[4]);
+ die("Couldn't stat root device.");
}
+ major_root = major(sb.st_rdev);
+ minor_root = minor(sb.st_rdev);
} else {
- major_root = DEFAULT_MAJOR_ROOT;
- minor_root = DEFAULT_MINOR_ROOT;
+ major_root = 0;
+ minor_root = 0;
}
fprintf(stderr, "Root device is (%d, %d)\n", major_root, minor_root);

@@ -144,10 +279,11 @@ int main(int argc, char ** argv)
i += c;
}

+ kernel_offset = (setup_sectors + 1)*512;
file_open(argv[3]);
if (fstat (fd, &sb))
die("Unable to stat `%s': %m", argv[3]);
- sz = sb.st_size;
+ kernel_filesz = sz = sb.st_size;
fprintf (stderr, "System is %d kB\n", sz/1024);
sys_size = (sz + 15) / 16;
if (!is_big_kernel && sys_size > DEF_SYSSIZE)
@@ -168,7 +304,37 @@ int main(int argc, char ** argv)
}
close(fd);

- if (lseek(1, 497, SEEK_SET) != 497) /* Write sizes to the bootsector */
+ file_open(argv[5]);
+ read_ehdr();
+ read_phds();
+ close(fd);
+ kernel_memsz = vmlinux_memsz();
+
+ if (lseek(1, 84, SEEK_SET) != 84) /* Write sizes to the bootsector */
+ die("Output: seek failed");
+ buf[0] = (kernel_offset >> 0) & 0xff;
+ buf[1] = (kernel_offset >> 8) & 0xff;
+ buf[2] = (kernel_offset >> 16) & 0xff;
+ buf[3] = (kernel_offset >> 24) & 0xff;
+ if (write(1, buf, 4) != 4)
+ die("Write of kernel file offset failed");
+ if (lseek(1, 96, SEEK_SET) != 96)
+ die("Output: seek failed");
+ buf[0] = (kernel_filesz >> 0) & 0xff;
+ buf[1] = (kernel_filesz >> 8) & 0xff;
+ buf[2] = (kernel_filesz >> 16) & 0xff;
+ buf[3] = (kernel_filesz >> 24) & 0xff;
+ if (write(1, buf, 4) != 4)
+ die("Write of kernel file size failed");
+ if (lseek(1, 100, SEEK_SET) != 100)
+ die("Output: seek failed");
+ buf[0] = (kernel_memsz >> 0) & 0xff;
+ buf[1] = (kernel_memsz >> 8) & 0xff;
+ buf[2] = (kernel_memsz >> 16) & 0xff;
+ buf[3] = (kernel_memsz >> 24) & 0xff;
+ if (write(1, buf, 4) != 4)
+ die("Write of kernel memory size failed");
+ if (lseek(1, 497, SEEK_SET) != 497)
die("Output: seek failed");
buf[0] = setup_sectors;
if (write(1, buf, 1) != 1)
diff --git a/include/linux/elf_boot.h b/include/linux/elf_boot.h
new file mode 100644
index 0000000..09301e5
--- /dev/null
+++ b/include/linux/elf_boot.h
@@ -0,0 +1,19 @@
+#ifndef ELF_BOOT_H
+#define ELF_BOOT_H
+
+/* Elf notes to help bootloaders identify what program they are booting.
+ */
+
+/* Standardized Elf image notes for booting... The name for all of these is ELFBoot */
+#define ELF_NOTE_BOOT "ELFBoot"
+
+#define EIN_PROGRAM_NAME 0x00000001
+/* The program in this ELF file */
+#define EIN_PROGRAM_VERSION 0x00000002
+/* The version of the program in this ELF file */
+#define EIN_PROGRAM_CHECKSUM 0x00000003
+/* ip style checksum of the memory image. */
+#define EIN_ARGUMENT_STYLE 0x00000004
+/* String identifying argument passing style */
+
+#endif /* ELF_BOOT_H */
--
1.4.2.rc2.g5209e

2006-08-01 11:12:00

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 5/33] i386 Kconfig: Add a range definition to config PHYSICAL_START

The kernels physical start address must fall in the first 1GB on i386.
This just adds a small range definition to enforce that.

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/i386/Kconfig | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/i386/Kconfig b/arch/i386/Kconfig
index daa75ce..062fa01 100644
--- a/arch/i386/Kconfig
+++ b/arch/i386/Kconfig
@@ -766,6 +766,7 @@ config PHYSICAL_START

default "0x1000000" if CRASH_DUMP
default "0x100000"
+ range 0x100000 0x37c00000
help
This gives the physical address where the kernel is loaded. Normally
for regular kernels this value is 0x100000 (1MB). But in the case
--
1.4.2.rc2.g5209e

2006-08-01 11:12:11

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 17/33] x86_64: Separate normal memory map initialization from the hotplug case

Currently initializing the two memory maps are combining into one
set of functions with if(after_bootmem) tests scattered all over
to handle the semantic differences. Just trying to think about
what is supposed to happen when and why makes my head hurt.

In one case we initialize a page but in another we don't because
it has been zeroed by the allocator.

In one case we have to map and unmap pages and in another we
don't because we have a mapping of the pages already.

In one case we care if a page table is partially initialized
and in the other we don't.

It is ugly to reason through and makes maintenance difficult,
because the rules are different in the two cases. So I have
separated these code paths so the can evolve separately. I
think code duplication is the lesser of two evils here.

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/x86_64/mm/init.c | 147 +++++++++++++++++++++++++++++++++----------------
1 files changed, 98 insertions(+), 49 deletions(-)

diff --git a/arch/x86_64/mm/init.c b/arch/x86_64/mm/init.c
index d14fb2d..0522c1c 100644
--- a/arch/x86_64/mm/init.c
+++ b/arch/x86_64/mm/init.c
@@ -179,19 +179,13 @@ static struct temp_map {
{}
};

-static __meminit void *alloc_low_page(int *index, unsigned long *phys)
+static __init void *alloc_low_page(int *index, unsigned long *phys)
{
struct temp_map *ti;
int i;
unsigned long pfn = table_end++, paddr;
void *adr;

- if (after_bootmem) {
- adr = (void *)get_zeroed_page(GFP_ATOMIC);
- *phys = __pa(adr);
- return adr;
- }
-
if (pfn >= end_pfn)
panic("alloc_low_page: ran out of memory");
for (i = 0; temp_mappings[i].allocated; i++) {
@@ -210,13 +204,10 @@ static __meminit void *alloc_low_page(in
return adr;
}

-static __meminit void unmap_low_page(int i)
+static __init void unmap_low_page(int i)
{
struct temp_map *ti;

- if (after_bootmem)
- return;
-
ti = &temp_mappings[i];
set_pmd(ti->pmd, __pmd(0));
ti->allocated = 0;
@@ -249,7 +240,7 @@ __init void early_iounmap(void *addr, un
__flush_tlb();
}

-static void __meminit
+static void __init
phys_pmd_init(pmd_t *pmd, unsigned long address, unsigned long end)
{
int i;
@@ -258,9 +249,8 @@ phys_pmd_init(pmd_t *pmd, unsigned long
unsigned long entry;

if (address >= end) {
- if (!after_bootmem)
- for (; i < PTRS_PER_PMD; i++, pmd++)
- set_pmd(pmd, __pmd(0));
+ for (; i < PTRS_PER_PMD; i++, pmd++)
+ set_pmd(pmd, __pmd(0));
break;
}
entry = _PAGE_NX|_PAGE_PSE|_KERNPG_TABLE|_PAGE_GLOBAL|address;
@@ -269,30 +259,12 @@ phys_pmd_init(pmd_t *pmd, unsigned long
}
}

-static void __meminit
-phys_pmd_update(pud_t *pud, unsigned long address, unsigned long end)
-{
- pmd_t *pmd = pmd_offset(pud, (unsigned long)__va(address));
-
- if (pmd_none(*pmd)) {
- spin_lock(&init_mm.page_table_lock);
- phys_pmd_init(pmd, address, end);
- spin_unlock(&init_mm.page_table_lock);
- __flush_tlb_all();
- }
-}
-
-static void __meminit phys_pud_init(pud_t *pud, unsigned long address, unsigned long end)
+static void __init phys_pud_init(pud_t *pud, unsigned long address, unsigned long end)
{
long i = pud_index(address);

pud = pud + i;

- if (after_bootmem && pud_val(*pud)) {
- phys_pmd_update(pud, address, end);
- return;
- }
-
for (; i < PTRS_PER_PUD; pud++, i++) {
int map;
unsigned long paddr, pmd_phys;
@@ -302,16 +274,14 @@ static void __meminit phys_pud_init(pud_
if (paddr >= end)
break;

- if (!after_bootmem && !e820_any_mapped(paddr, paddr+PUD_SIZE, 0)) {
+ if (!e820_any_mapped(paddr, paddr+PUD_SIZE, 0)) {
set_pud(pud, __pud(0));
continue;
}

pmd = alloc_low_page(&map, &pmd_phys);
- spin_lock(&init_mm.page_table_lock);
set_pud(pud, __pud(pmd_phys | _KERNPG_TABLE));
phys_pmd_init(pmd, paddr, end);
- spin_unlock(&init_mm.page_table_lock);
unmap_low_page(map);
}
__flush_tlb();
@@ -345,7 +315,7 @@ static void __init find_early_table_spac
/* Setup the direct mapping of the physical memory at PAGE_OFFSET.
This runs before bootmem is initialized and gets pages directly from the
physical memory. To access them they are temporarily mapped. */
-void __meminit init_memory_mapping(unsigned long start, unsigned long end)
+void __init init_memory_mapping(unsigned long start, unsigned long end)
{
unsigned long next;

@@ -357,8 +327,7 @@ void __meminit init_memory_mapping(unsig
* mapped. Unfortunately this is done currently before the nodes are
* discovered.
*/
- if (!after_bootmem)
- find_early_table_space(end);
+ find_early_table_space(end);

start = (unsigned long)__va(start);
end = (unsigned long)__va(end);
@@ -369,22 +338,17 @@ void __meminit init_memory_mapping(unsig
pgd_t *pgd = pgd_offset_k(start);
pud_t *pud;

- if (after_bootmem)
- pud = pud_offset(pgd, start & PGDIR_MASK);
- else
- pud = alloc_low_page(&map, &pud_phys);
+ pud = alloc_low_page(&map, &pud_phys);

next = start + PGDIR_SIZE;
if (next > end)
next = end;
phys_pud_init(pud, __pa(start), __pa(next));
- if (!after_bootmem)
- set_pgd(pgd_offset_k(start), mk_kernel_pgd(pud_phys));
+ set_pgd(pgd_offset_k(start), mk_kernel_pgd(pud_phys));
unmap_low_page(map);
}

- if (!after_bootmem)
- asm volatile("movq %%cr4,%0" : "=r" (mmu_cr4_features));
+ asm volatile("movq %%cr4,%0" : "=r" (mmu_cr4_features));
__flush_tlb_all();
}

@@ -529,6 +493,91 @@ int memory_add_physaddr_to_nid(u64 start
}
#endif

+static void
+late_phys_pmd_init(pmd_t *pmd, unsigned long address, unsigned long end)
+{
+ int i;
+
+ for (i = 0; i < PTRS_PER_PMD; pmd++, i++, address += PMD_SIZE) {
+ unsigned long entry;
+
+ if (address >= end)
+ break;
+ entry = _PAGE_NX|_PAGE_PSE|_KERNPG_TABLE|_PAGE_GLOBAL|address;
+ entry &= __supported_pte_mask;
+ set_pmd(pmd, __pmd(entry));
+ }
+}
+
+static void
+late_phys_pmd_update(pud_t *pud, unsigned long address, unsigned long end)
+{
+ pmd_t *pmd = pmd_offset(pud, (unsigned long)__va(address));
+
+ if (pmd_none(*pmd)) {
+ spin_lock(&init_mm.page_table_lock);
+ late_phys_pmd_init(pmd, address, end);
+ spin_unlock(&init_mm.page_table_lock);
+ __flush_tlb_all();
+ }
+}
+
+static void late_phys_pud_init(pud_t *pud, unsigned long address, unsigned long end)
+{
+ long i = pud_index(address);
+
+ pud = pud + i;
+
+ if (pud_val(*pud)) {
+ late_phys_pmd_update(pud, address, end);
+ return;
+ }
+
+ for (; i < PTR_PER_PUD; pud++, i++) {
+ unsigned long paddr, pmd_phys;
+ pmd_t *pmd;
+
+ paddr = (address & PGDIR_MASK) + i*PUD_SIZE;
+ if (paddr >= end)
+ break;
+
+ pmd = (pmd_t *)get_zeroed_page(GFP_ATOMIC);
+ phys_pmd = __pa(pmd);
+
+ spin_lock(&init_mm.page_table_lock);
+ set_pud(pud, __pud(pmd_phys | _KERNPG_TABLE));
+ late_phys_pmd_init(pmd, paddr, end);
+ spin_unlock(&init_mm.page_table_lock);
+ }
+}
+
+/* Setup the direct mapping of the physical memory at PAGE_OFFSET.
+ * This runs after bootmem is initialized and gets pages normally.
+ */
+static void late_init_memory_mapping(unsigned long start, unsigned long end)
+{
+ unsigned long next;
+
+ Dprintk("add_memory_mapping\n");
+
+ start = (unsigned long)__va(start);
+ end = (unsigned long)__va(end);
+
+ for (; start < end; start = next) {
+ unsigned long pud_phys;
+ pgd_t *pgd = pgd_offset_k(start);
+ pud_t *pud;
+
+ pud = pud_offset(pgd, start & PGDIR_MASK);
+
+ next = start + PGDIR_SIZE;
+ if (next > end)
+ next = end;
+ late_phys_pud_init(pud, __pa(start), __pa(next));
+ }
+ __flush_tlb_all();
+}
+
/*
* Memory is added always to NORMAL zone. This means you will never get
* additional DMA/DMA32 memory.
@@ -545,7 +594,7 @@ int arch_add_memory(int nid, u64 start,
if (ret)
goto error;

- init_memory_mapping(start, (start + size -1));
+ late_init_memory_mapping(start, (start + size -1));

return ret;
error:
--
1.4.2.rc2.g5209e

2006-08-01 11:12:11

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 16/33] x86_64: Assembly safe page.h and pgtable.h

This patch makes pgtable.h and page.h safe to include
in assembly files like head.S. Allowing us to use
symbolic constants instead of hard coded numbers when
refering to the page tables.

This patch copies asm-sparc64/const.h to asm-x86_64 to
get a definition of _AC() a very convinient macro that
allows us to force the type when we are compiling the
code in C and to drop all of the type information when
we are using the constant in assembly. Previously this
was done with multiple definition of the same constant.
const.h was modified slightly so that it works when given
CONFIG options as arguments.

This patch adds #ifndef __ASSEMBLY__ ... #endif
and _AC(1,UL) where appropriate so the assembler won't
choke on the header files. Otherwise nothing
should have changed.

Signed-off-by: Eric W. Biederman <[email protected]>
---
include/asm-x86_64/const.h | 20 ++++++++++++++++++++
include/asm-x86_64/page.h | 34 +++++++++++++---------------------
include/asm-x86_64/pgtable.h | 33 +++++++++++++++++++++------------
3 files changed, 54 insertions(+), 33 deletions(-)

diff --git a/include/asm-x86_64/const.h b/include/asm-x86_64/const.h
new file mode 100644
index 0000000..54fb08f
--- /dev/null
+++ b/include/asm-x86_64/const.h
@@ -0,0 +1,20 @@
+/* const.h: Macros for dealing with constants. */
+
+#ifndef _X86_64_CONST_H
+#define _X86_64_CONST_H
+
+/* Some constant macros are used in both assembler and
+ * C code. Therefore we cannot annotate them always with
+ * 'UL' and other type specificers unilaterally. We
+ * use the following macros to deal with this.
+ */
+
+#ifdef __ASSEMBLY__
+#define _AC(X,Y) X
+#else
+#define __AC(X,Y) (X##Y)
+#define _AC(X,Y) __AC(X,Y)
+#endif
+
+
+#endif /* !(_X86_64_CONST_H) */
diff --git a/include/asm-x86_64/page.h b/include/asm-x86_64/page.h
index 10f3461..f030260 100644
--- a/include/asm-x86_64/page.h
+++ b/include/asm-x86_64/page.h
@@ -1,14 +1,11 @@
#ifndef _X86_64_PAGE_H
#define _X86_64_PAGE_H

+#include <asm/const.h>

/* PAGE_SHIFT determines the page size */
#define PAGE_SHIFT 12
-#ifdef __ASSEMBLY__
-#define PAGE_SIZE (0x1 << PAGE_SHIFT)
-#else
-#define PAGE_SIZE (1UL << PAGE_SHIFT)
-#endif
+#define PAGE_SIZE (_AC(1,UL) << PAGE_SHIFT)
#define PAGE_MASK (~(PAGE_SIZE-1))
#define PHYSICAL_PAGE_MASK (~(PAGE_SIZE-1) & __PHYSICAL_MASK)

@@ -33,10 +30,10 @@ #define MCE_STACK 5
#define N_EXCEPTION_STACKS 5 /* hw limit: 7 */

#define LARGE_PAGE_MASK (~(LARGE_PAGE_SIZE-1))
-#define LARGE_PAGE_SIZE (1UL << PMD_SHIFT)
+#define LARGE_PAGE_SIZE (_AC(1,UL) << PMD_SHIFT)

#define HPAGE_SHIFT PMD_SHIFT
-#define HPAGE_SIZE ((1UL) << HPAGE_SHIFT)
+#define HPAGE_SIZE (_AC(1,UL) << HPAGE_SHIFT)
#define HPAGE_MASK (~(HPAGE_SIZE - 1))
#define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)

@@ -76,29 +73,24 @@ #define __pud(x) ((pud_t) { (x) } )
#define __pgd(x) ((pgd_t) { (x) } )
#define __pgprot(x) ((pgprot_t) { (x) } )

-#define __PHYSICAL_START ((unsigned long)CONFIG_PHYSICAL_START)
-#define __START_KERNEL (__START_KERNEL_map + __PHYSICAL_START)
-#define __START_KERNEL_map 0xffffffff80000000UL
-#define __PAGE_OFFSET 0xffff810000000000UL
+#endif /* !__ASSEMBLY__ */

-#else
-#define __PHYSICAL_START CONFIG_PHYSICAL_START
+#define __PHYSICAL_START _AC(CONFIG_PHYSICAL_START,UL)
#define __START_KERNEL (__START_KERNEL_map + __PHYSICAL_START)
-#define __START_KERNEL_map 0xffffffff80000000
-#define __PAGE_OFFSET 0xffff810000000000
-#endif /* !__ASSEMBLY__ */
+#define __START_KERNEL_map _AC(0xffffffff80000000,UL)
+#define __PAGE_OFFSET _AC(0xffff810000000000,UL)

/* to align the pointer to the (next) page boundary */
#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)

/* See Documentation/x86_64/mm.txt for a description of the memory map. */
#define __PHYSICAL_MASK_SHIFT 46
-#define __PHYSICAL_MASK ((1UL << __PHYSICAL_MASK_SHIFT) - 1)
+#define __PHYSICAL_MASK ((_AC(1,UL) << __PHYSICAL_MASK_SHIFT) - 1)
#define __VIRTUAL_MASK_SHIFT 48
-#define __VIRTUAL_MASK ((1UL << __VIRTUAL_MASK_SHIFT) - 1)
+#define __VIRTUAL_MASK ((_AC(1,UL) << __VIRTUAL_MASK_SHIFT) - 1)

-#define KERNEL_TEXT_SIZE (40UL*1024*1024)
-#define KERNEL_TEXT_START 0xffffffff80000000UL
+#define KERNEL_TEXT_SIZE (_AC(40,UL)*1024*1024)
+#define KERNEL_TEXT_START _AC(0xffffffff80000000,UL)

#ifndef __ASSEMBLY__

@@ -106,7 +98,7 @@ #include <asm/bug.h>

#endif /* __ASSEMBLY__ */

-#define PAGE_OFFSET ((unsigned long)__PAGE_OFFSET)
+#define PAGE_OFFSET __PAGE_OFFSET

/* Note: __pa(&symbol_visible_to_c) should be always replaced with __pa_symbol.
Otherwise you risk miscompilation. */
diff --git a/include/asm-x86_64/pgtable.h b/include/asm-x86_64/pgtable.h
index a31ab4e..211a2ca 100644
--- a/include/asm-x86_64/pgtable.h
+++ b/include/asm-x86_64/pgtable.h
@@ -1,6 +1,9 @@
#ifndef _X86_64_PGTABLE_H
#define _X86_64_PGTABLE_H

+#include <asm/const.h>
+#ifndef __ASSEMBLY__
+
/*
* This file contains the functions and defines necessary to modify and use
* the x86-64 page table tree.
@@ -34,6 +37,8 @@ extern unsigned long pgkern_mask;
extern unsigned long empty_zero_page[PAGE_SIZE/sizeof(unsigned long)];
#define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page))

+#endif /* !__ASSEMBLY__ */
+
/*
* PGDIR_SHIFT determines what a top-level page table entry can map
*/
@@ -58,6 +63,8 @@ #define PTRS_PER_PMD 512
*/
#define PTRS_PER_PTE 512

+#ifndef __ASSEMBLY__
+
#define pte_ERROR(e) \
printk("%s:%d: bad pte %p(%016lx).\n", __FILE__, __LINE__, &(e), pte_val(e))
#define pmd_ERROR(e) \
@@ -124,22 +131,23 @@ #define pte_same(a, b) ((a).pte == (b).

#define pte_pgprot(a) (__pgprot((a).pte & ~PHYSICAL_PAGE_MASK))

-#define PMD_SIZE (1UL << PMD_SHIFT)
+#endif /* !__ASSEMBLY__ */
+
+#define PMD_SIZE (_AC(1,UL) << PMD_SHIFT)
#define PMD_MASK (~(PMD_SIZE-1))
-#define PUD_SIZE (1UL << PUD_SHIFT)
+#define PUD_SIZE (_AC(1,UL) << PUD_SHIFT)
#define PUD_MASK (~(PUD_SIZE-1))
-#define PGDIR_SIZE (1UL << PGDIR_SHIFT)
+#define PGDIR_SIZE (_AC(1,UL) << PGDIR_SHIFT)
#define PGDIR_MASK (~(PGDIR_SIZE-1))

#define USER_PTRS_PER_PGD ((TASK_SIZE-1)/PGDIR_SIZE+1)
#define FIRST_USER_ADDRESS 0

-#ifndef __ASSEMBLY__
-#define MAXMEM 0x3fffffffffffUL
-#define VMALLOC_START 0xffffc20000000000UL
-#define VMALLOC_END 0xffffe1ffffffffffUL
-#define MODULES_VADDR 0xffffffff88000000UL
-#define MODULES_END 0xfffffffffff00000UL
+#define MAXMEM _AC(0x3fffffffffff,UL)
+#define VMALLOC_START _AC(0xffffc20000000000,UL)
+#define VMALLOC_END _AC(0xffffe1ffffffffff,UL)
+#define MODULES_VADDR _AC(0xffffffff88000000,UL)
+#define MODULES_END _AC(0xfffffffffff00000,UL)
#define MODULES_LEN (MODULES_END - MODULES_VADDR)

#define _PAGE_BIT_PRESENT 0
@@ -165,7 +173,7 @@ #define _PAGE_FILE 0x040 /* nonlinear fi
#define _PAGE_GLOBAL 0x100 /* Global TLB entry */

#define _PAGE_PROTNONE 0x080 /* If not present */
-#define _PAGE_NX (1UL<<_PAGE_BIT_NX)
+#define _PAGE_NX (_AC(1,UL)<<_PAGE_BIT_NX)

#define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | _PAGE_ACCESSED | _PAGE_DIRTY)
#define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY)
@@ -227,6 +235,8 @@ #define __S101 PAGE_READONLY_EXEC
#define __S110 PAGE_SHARED_EXEC
#define __S111 PAGE_SHARED_EXEC

+#ifndef __ASSEMBLY__
+
static inline unsigned long pgd_bad(pgd_t pgd)
{
unsigned long val = pgd_val(pgd);
@@ -418,8 +428,6 @@ extern spinlock_t pgd_lock;
extern struct page *pgd_list;
void vmalloc_sync_all(void);

-#endif /* !__ASSEMBLY__ */
-
extern int kern_addr_valid(unsigned long addr);

#define io_remap_pfn_range(vma, vaddr, pfn, size, prot) \
@@ -449,5 +457,6 @@ #define __HAVE_ARCH_PTEP_GET_AND_CLEAR_F
#define __HAVE_ARCH_PTEP_SET_WRPROTECT
#define __HAVE_ARCH_PTE_SAME
#include <asm-generic/pgtable.h>
+#endif /* !__ASSEMBLY__ */

#endif /* _X86_64_PGTABLE_H */
--
1.4.2.rc2.g5209e

2006-08-01 11:13:12

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 7/33] elf: Add ELFOSABI_STANDALONE to elf.h

Signed-off-by: Eric W. Biederman <[email protected]>
---
include/linux/elf.h | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/elf.h b/include/linux/elf.h
index c5bf043..6fa8d3d 100644
--- a/include/linux/elf.h
+++ b/include/linux/elf.h
@@ -338,8 +338,9 @@ #define EV_NONE 0 /* e_version, EI_VER
#define EV_CURRENT 1
#define EV_NUM 2

-#define ELFOSABI_NONE 0
-#define ELFOSABI_LINUX 3
+#define ELFOSABI_NONE 0
+#define ELFOSABI_LINUX 3
+#define ELFOSABI_STANDALONE 255

#ifndef ELF_OSABI
#define ELF_OSABI ELFOSABI_NONE
--
1.4.2.rc2.g5209e

2006-08-01 11:13:12

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 14/33] x86_64: Properly report in /proc/iomem the kernel address

The code assumed that the kernel was always loaded
at 1M in memory. This fixes that assumption.

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/x86_64/kernel/setup.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86_64/kernel/setup.c b/arch/x86_64/kernel/setup.c
index 8a099ff..11d31ea 100644
--- a/arch/x86_64/kernel/setup.c
+++ b/arch/x86_64/kernel/setup.c
@@ -521,7 +521,7 @@ static void discover_ebda(void)

void __init setup_arch(char **cmdline_p)
{
- unsigned long kernel_end;
+ unsigned long kernel_start, kernel_end;

ROOT_DEV = old_decode_dev(ORIG_ROOT_DEV);
screen_info = SCREEN_INFO;
@@ -596,8 +596,9 @@ #endif
(table_end - table_start) << PAGE_SHIFT);

/* reserve kernel */
+ kernel_start = __pa_symbol(&_text);
kernel_end = round_up(__pa_symbol(&_end),PAGE_SIZE);
- reserve_bootmem_generic(HIGH_MEMORY, kernel_end - HIGH_MEMORY);
+ reserve_bootmem_generic(kernel_start, kernel_end - kernel_start);

/*
* reserve physical page 0 - it's a special BIOS page on many boxes,
--
1.4.2.rc2.g5209e

2006-08-01 11:13:50

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 12/33] x86_64: fixup indentation in e820.c

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/x86_64/kernel/e820.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86_64/kernel/e820.c b/arch/x86_64/kernel/e820.c
index e56c2ad..61f029f 100644
--- a/arch/x86_64/kernel/e820.c
+++ b/arch/x86_64/kernel/e820.c
@@ -205,8 +205,8 @@ unsigned long __init e820_end_of_ram(voi
if (start >= end)
continue;
if (ei->type == E820_RAM) {
- if (end > end_pfn<<PAGE_SHIFT)
- end_pfn = end>>PAGE_SHIFT;
+ if (end > end_pfn<<PAGE_SHIFT)
+ end_pfn = end>>PAGE_SHIFT;
} else {
if (end > end_pfn_map<<PAGE_SHIFT)
end_pfn_map = end>>PAGE_SHIFT;
--
1.4.2.rc2.g5209e

2006-08-01 11:14:32

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 20/33] x86_64: fix early_printk to use the standard ISA mapping

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/x86_64/kernel/early_printk.c | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/arch/x86_64/kernel/early_printk.c b/arch/x86_64/kernel/early_printk.c
index 140051e..d2b4cfb 100644
--- a/arch/x86_64/kernel/early_printk.c
+++ b/arch/x86_64/kernel/early_printk.c
@@ -11,11 +11,10 @@ #include <asm/fcntl.h>

#ifdef __i386__
#include <asm/setup.h>
-#define VGABASE (__ISA_IO_base + 0xb8000)
#else
#include <asm/bootsetup.h>
-#define VGABASE ((void __iomem *)0xffffffff800b8000UL)
#endif
+#define VGABASE (__ISA_IO_base + 0xb8000)

static int max_ypos = 25, max_xpos = 80;
static int current_ypos = 25, current_xpos = 0;
--
1.4.2.rc2.g5209e

2006-08-01 11:13:56

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 33/33] x86_64: Make bzImage a valid 64bit elf executable.

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/x86_64/boot/Makefile | 2
arch/x86_64/boot/bootsect.S | 93 +++++++++++++++-
arch/x86_64/boot/tools/build.c | 232 ++++++++++++++++++++++++++++++++++++----
3 files changed, 297 insertions(+), 30 deletions(-)

diff --git a/arch/x86_64/boot/Makefile b/arch/x86_64/boot/Makefile
index deb063e..80a7492 100644
--- a/arch/x86_64/boot/Makefile
+++ b/arch/x86_64/boot/Makefile
@@ -41,7 +41,7 @@ # --------------------------------------

quiet_cmd_image = BUILD $@
cmd_image = $(obj)/tools/build $(BUILDFLAGS) $(obj)/bootsect $(obj)/setup \
- $(obj)/vmlinux.bin $(ROOT_DEV) > $@
+ $(obj)/vmlinux.bin $(ROOT_DEV) vmlinux > $@

$(obj)/bzImage: $(obj)/bootsect $(obj)/setup \
$(obj)/vmlinux.bin $(obj)/tools/build FORCE
diff --git a/arch/x86_64/boot/bootsect.S b/arch/x86_64/boot/bootsect.S
index 011b7a4..05bd1f3 100644
--- a/arch/x86_64/boot/bootsect.S
+++ b/arch/x86_64/boot/bootsect.S
@@ -13,6 +13,13 @@
*
*/

+#include <linux/version.h>
+#include <linux/utsrelease.h>
+#include <linux/compile.h>
+#include <linux/elf.h>
+#include <linux/elf-em.h>
+#include <linux/elf_boot.h>
+#include <asm/page.h>
#include <asm/boot.h>

SETUPSECTS = 4 /* default nr of setup-sectors */
@@ -42,10 +49,88 @@ #endif

.global _start
_start:
-
+ehdr:
+ # e_ident is carefully crafted so if this is treated
+ # as an x86 bootsector you will execute through
+ # e_ident and then print the bugger off message.
+ # The 1 store to bx+di is unfortunate it is
+ # unlikely to affect the ability to print
+ # a message and you aren't supposed to be booting a
+ # bzImage directly from a floppy anyway.
+
+ # e_ident
+ .byte ELFMAG0, ELFMAG1, ELFMAG2, ELFMAG3
+ .byte ELFCLASS64, ELFDATA2LSB, EV_CURRENT, ELFOSABI_STANDALONE
+ .byte 0xeb, 0x3d, 0, 0, 0, 0, 0, 0
+ .word ET_DYN # e_type
+ .word EM_X86_64 # e_machine
+ .int 1 # e_version
+ .quad 0x0000000000000100 # e_entry (startup_64)
+ .quad phdr - _start # e_phoff
+ .quad 0 # e_shoff
+ .int 0 # e_flags
+ .word e_ehdr - ehdr # e_ehsize
+ .word e_phdr1 - phdr # e_phentsize
+ .word (e_phdr - phdr)/(e_phdr1 - phdr) # e_phnum
+ .word 64 # e_shentsize
+ .word 0 # e_shnum
+ .word 0 # e_shstrndx
+e_ehdr:
+
+.org 71
+normalize:
# Normalize the start address
jmpl $BOOTSEG, $start2

+.org 80
+phdr:
+ .int PT_LOAD # p_type
+ .int PF_R | PF_W | PF_X # p_flags
+ .quad (SETUPSECTS+1)*512 # p_offset
+ .quad __START_KERNEL_map # p_vaddr
+ .quad 0x0000000000000000 # p_paddr
+ .quad SYSSIZE*16 # p_filesz
+ .quad 0 # p_memsz
+ .quad 2*1024*1024 # p_align
+e_phdr1:
+
+ .int PT_NOTE # p_type
+ .int 0 # p_flags
+ .quad b_note - _start # p_offset
+ .quad 0 # p_vaddr
+ .quad 0 # p_paddr
+ .quad e_note - b_note # p_filesz
+ .quad 0 # p_memsz
+ .quad 0 # p_align
+e_phdr:
+
+.macro note name, type
+ .balign 4
+ .int 2f - 1f # n_namesz
+ .int 4f - 3f # n_descsz
+ .int \type # n_type
+ .balign 4
+1: .asciz "\name"
+2: .balign 4
+3:
+.endm
+.macro enote
+4: .balign 4
+.endm
+
+ .balign 4
+b_note:
+ note ELF_NOTE_BOOT, EIN_PROGRAM_NAME
+ .asciz "Linux"
+ enote
+ note ELF_NOTE_BOOT, EIN_PROGRAM_VERSION
+ .asciz UTS_RELEASE
+ enote
+ note ELF_NOTE_BOOT, EIN_ARGUMENT_STYLE
+ .asciz "Linux"
+ enote
+e_note:
+
start2:
movw %cs, %ax
movw %ax, %ds
@@ -78,11 +163,11 @@ die:


bugger_off_msg:
- .ascii "Direct booting from floppy is no longer supported.\r\n"
- .ascii "Please use a boot loader program instead.\r\n"
+ .ascii "Booting linux without a boot loader is no longer supported.\r\n"
.ascii "\n"
- .ascii "Remove disk and press any key to reboot . . .\r\n"
+ .ascii "Press any key to reboot . . .\r\n"
.byte 0
+ebugger_off_msg:


# Kernel attributes; used by setup
diff --git a/arch/x86_64/boot/tools/build.c b/arch/x86_64/boot/tools/build.c
index eae8669..fd9bf41 100644
--- a/arch/x86_64/boot/tools/build.c
+++ b/arch/x86_64/boot/tools/build.c
@@ -27,6 +27,11 @@ #include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <stdarg.h>
+#include <elf.h>
+#include <byteswap.h>
+#define USE_BSD
+#include <endian.h>
+#include <errno.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/sysmacros.h>
@@ -48,6 +53,10 @@ byte buf[1024];
int fd;
int is_big_kernel;

+#define MAX_PHDRS 100
+static Elf64_Ehdr ehdr;
+static Elf64_Phdr phdr[MAX_PHDRS];
+
void die(const char * str, ...)
{
va_list args;
@@ -57,20 +66,155 @@ void die(const char * str, ...)
exit(1);
}

+#if BYTE_ORDER == LITTLE_ENDIAN
+#define le16_to_cpu(val) (val)
+#define le32_to_cpu(val) (val)
+#define le64_to_cpu(val) (val)
+#endif
+#if BYTE_ORDER == BIG_ENDIAN
+#define le16_to_cpu(val) bswap_16(val)
+#define le32_to_cpu(val) bswap_32(val)
+#define le64_to_cpu(val) bswap_64(val)
+#endif
+
+static uint16_t elf16_to_cpu(uint16_t val)
+{
+ return le16_to_cpu(val);
+}
+
+static uint32_t elf32_to_cpu(uint32_t val)
+{
+ return le32_to_cpu(val);
+}
+
+static uint64_t elf64_to_cpu(uint64_t val)
+{
+ return le64_to_cpu(val);
+}
+
void file_open(const char *name)
{
if ((fd = open(name, O_RDONLY, 0)) < 0)
die("Unable to open `%s': %m", name);
}

+static void read_ehdr(void)
+{
+ if (read(fd, &ehdr, sizeof(ehdr)) != sizeof(ehdr)) {
+ die("Cannot read ELF header: %s\n",
+ strerror(errno));
+ }
+ if (memcmp(ehdr.e_ident, ELFMAG, 4) != 0) {
+ die("No ELF magic\n");
+ }
+ if (ehdr.e_ident[EI_CLASS] != ELFCLASS64) {
+ die("Not a 64 bit executable\n");
+ }
+ if (ehdr.e_ident[EI_DATA] != ELFDATA2LSB) {
+ die("Not a LSB ELF executable\n");
+ }
+ if (ehdr.e_ident[EI_VERSION] != EV_CURRENT) {
+ die("Unknown ELF version\n");
+ }
+ /* Convert the fields to native endian */
+ ehdr.e_type = elf16_to_cpu(ehdr.e_type);
+ ehdr.e_machine = elf16_to_cpu(ehdr.e_machine);
+ ehdr.e_version = elf32_to_cpu(ehdr.e_version);
+ ehdr.e_entry = elf64_to_cpu(ehdr.e_entry);
+ ehdr.e_phoff = elf64_to_cpu(ehdr.e_phoff);
+ ehdr.e_shoff = elf64_to_cpu(ehdr.e_shoff);
+ ehdr.e_flags = elf32_to_cpu(ehdr.e_flags);
+ ehdr.e_ehsize = elf16_to_cpu(ehdr.e_ehsize);
+ ehdr.e_phentsize = elf16_to_cpu(ehdr.e_phentsize);
+ ehdr.e_phnum = elf16_to_cpu(ehdr.e_phnum);
+ ehdr.e_shentsize = elf16_to_cpu(ehdr.e_shentsize);
+ ehdr.e_shnum = elf16_to_cpu(ehdr.e_shnum);
+ ehdr.e_shstrndx = elf16_to_cpu(ehdr.e_shstrndx);
+
+ if ((ehdr.e_type != ET_EXEC) && (ehdr.e_type != ET_DYN)) {
+ die("Unsupported ELF header type\n");
+ }
+ if (ehdr.e_machine != EM_X86_64) {
+ die("Not for x86_64\n");
+ }
+ if (ehdr.e_version != EV_CURRENT) {
+ die("Unknown ELF version\n");
+ }
+ if (ehdr.e_ehsize != sizeof(Elf64_Ehdr)) {
+ die("Bad Elf header size\n");
+ }
+ if (ehdr.e_phentsize != sizeof(Elf64_Phdr)) {
+ die("Bad program header entry\n");
+ }
+ if (ehdr.e_shentsize != sizeof(Elf64_Shdr)) {
+ die("Bad section header entry\n");
+ }
+ if (ehdr.e_shstrndx >= ehdr.e_shnum) {
+ die("String table index out of bounds\n");
+ }
+}
+
+static void read_phds(void)
+{
+ int i;
+ size_t size;
+ if (ehdr.e_phnum > MAX_PHDRS) {
+ die("%d program headers supported: %d\n",
+ ehdr.e_phnum, MAX_PHDRS);
+ }
+ if (lseek(fd, ehdr.e_phoff, SEEK_SET) < 0) {
+ die("Seek to %d failed: %s\n",
+ ehdr.e_phoff, strerror(errno));
+ }
+ size = sizeof(phdr[0])*ehdr.e_phnum;
+ if (read(fd, &phdr, size) != size) {
+ die("Cannot read ELF section headers: %s\n",
+ strerror(errno));
+ }
+ for(i = 0; i < ehdr.e_phnum; i++) {
+ phdr[i].p_type = elf32_to_cpu(phdr[i].p_type);
+ phdr[i].p_flags = elf32_to_cpu(phdr[i].p_flags);
+ phdr[i].p_offset = elf64_to_cpu(phdr[i].p_offset);
+ phdr[i].p_vaddr = elf64_to_cpu(phdr[i].p_vaddr);
+ phdr[i].p_paddr = elf64_to_cpu(phdr[i].p_paddr);
+ phdr[i].p_filesz = elf64_to_cpu(phdr[i].p_filesz);
+ phdr[i].p_memsz = elf64_to_cpu(phdr[i].p_memsz);
+ phdr[i].p_align = elf64_to_cpu(phdr[i].p_align);
+ }
+}
+
+uint64_t vmlinux_memsz(void)
+{
+ uint64_t min, max, size;
+ int i;
+ max = 0;
+ min = ~max;
+ for(i = 0; i < ehdr.e_phnum; i++) {
+ uint64_t start, end;
+ if (phdr[i].p_type != PT_LOAD)
+ continue;
+ start = phdr[i].p_paddr;
+ end = phdr[i].p_paddr + phdr[i].p_memsz;
+ if (start < min)
+ min = start;
+ if (end > max)
+ max = end;
+ }
+ /* Get the reported size by vmlinux */
+ size = max - min;
+ return size;
+}
+
void usage(void)
{
- die("Usage: build [-b] bootsect setup system [rootdev] [> image]");
+ die("Usage: build [-b] bootsect setup system rootdev vmlinux [> image]");
}

int main(int argc, char ** argv)
{
- unsigned int i, c, sz, setup_sectors;
+ unsigned int i, sz, setup_sectors;
+ uint64_t kernel_offset, kernel_filesz, kernel_memsz;
+ int c;
u32 sys_size;
byte major_root, minor_root;
struct stat sb;
@@ -80,30 +224,25 @@ int main(int argc, char ** argv)
is_big_kernel = 1;
argc--, argv++;
}
- if ((argc < 4) || (argc > 5))
+ if (argc != 6)
usage();
- if (argc > 4) {
- if (!strcmp(argv[4], "CURRENT")) {
- if (stat("/", &sb)) {
- perror("/");
- die("Couldn't stat /");
- }
- major_root = major(sb.st_dev);
- minor_root = minor(sb.st_dev);
- } else if (strcmp(argv[4], "FLOPPY")) {
- if (stat(argv[4], &sb)) {
- perror(argv[4]);
- die("Couldn't stat root device.");
- }
- major_root = major(sb.st_rdev);
- minor_root = minor(sb.st_rdev);
- } else {
- major_root = 0;
- minor_root = 0;
+ if (!strcmp(argv[4], "CURRENT")) {
+ if (stat("/", &sb)) {
+ perror("/");
+ die("Couldn't stat /");
+ }
+ major_root = major(sb.st_dev);
+ minor_root = minor(sb.st_dev);
+ } else if (strcmp(argv[4], "FLOPPY")) {
+ if (stat(argv[4], &sb)) {
+ perror(argv[4]);
+ die("Couldn't stat root device.");
}
+ major_root = major(sb.st_rdev);
+ minor_root = minor(sb.st_rdev);
} else {
- major_root = DEFAULT_MAJOR_ROOT;
- minor_root = DEFAULT_MINOR_ROOT;
+ major_root = 0;
+ minor_root = 0;
}
fprintf(stderr, "Root device is (%d, %d)\n", major_root, minor_root);

@@ -143,10 +282,11 @@ int main(int argc, char ** argv)
i += c;
}

+ kernel_offset = (setup_sectors + 1)*512;
file_open(argv[3]);
if (fstat (fd, &sb))
die("Unable to stat `%s': %m", argv[3]);
- sz = sb.st_size;
+ kernel_filesz = sz = sb.st_size;
fprintf (stderr, "System is %d kB\n", sz/1024);
sys_size = (sz + 15) / 16;
if (!is_big_kernel && sys_size > DEF_SYSSIZE)
@@ -167,7 +307,49 @@ int main(int argc, char ** argv)
}
close(fd);

- if (lseek(1, 497, SEEK_SET) != 497) /* Write sizes to the bootsector */
+ file_open(argv[5]);
+ read_ehdr();
+ read_phds();
+ close(fd);
+ kernel_memsz = vmlinux_memsz();
+
+ if (lseek(1, 88, SEEK_SET) != 88) /* Write sizes to the bootsector */
+ die("Output: seek failed");
+ buf[0] = (kernel_offset >> 0) & 0xff;
+ buf[1] = (kernel_offset >> 8) & 0xff;
+ buf[2] = (kernel_offset >> 16) & 0xff;
+ buf[3] = (kernel_offset >> 24) & 0xff;
+ buf[4] = (kernel_offset >> 32) & 0xff;
+ buf[5] = (kernel_offset >> 40) & 0xff;
+ buf[6] = (kernel_offset >> 48) & 0xff;
+ buf[7] = (kernel_offset >> 56) & 0xff;
+ if (write(1, buf, 8) != 8)
+ die("Write of kernel file offset failed");
+ if (lseek(1, 112, SEEK_SET) != 112)
+ die("Output: seek failed");
+ buf[0] = (kernel_filesz >> 0) & 0xff;
+ buf[1] = (kernel_filesz >> 8) & 0xff;
+ buf[2] = (kernel_filesz >> 16) & 0xff;
+ buf[3] = (kernel_filesz >> 24) & 0xff;
+ buf[4] = (kernel_filesz >> 32) & 0xff;
+ buf[5] = (kernel_filesz >> 40) & 0xff;
+ buf[6] = (kernel_filesz >> 48) & 0xff;
+ buf[7] = (kernel_filesz >> 56) & 0xff;
+ if (write(1, buf, 8) != 8)
+ die("Write of kernel file size failed");
+ if (lseek(1, 120, SEEK_SET) != 120)
+ die("Output: seek failed");
+ buf[0] = (kernel_memsz >> 0) & 0xff;
+ buf[1] = (kernel_memsz >> 8) & 0xff;
+ buf[2] = (kernel_memsz >> 16) & 0xff;
+ buf[3] = (kernel_memsz >> 24) & 0xff;
+ buf[4] = (kernel_memsz >> 32) & 0xff;
+ buf[5] = (kernel_memsz >> 40) & 0xff;
+ buf[6] = (kernel_memsz >> 48) & 0xff;
+ buf[7] = (kernel_memsz >> 56) & 0xff;
+ if (write(1, buf, 8) != 8)
+ die("Write of kernel memory size failed");
+ if (lseek(1, 497, SEEK_SET) != 497)
die("Output: seek failed");
buf[0] = setup_sectors;
if (write(1, buf, 1) != 1)
--
1.4.2.rc2.g5209e

2006-08-01 11:14:43

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 31/33] x86_64 boot: Add serial output support to the decompressor

This patch does two very simple things.
It adds a serial output capability to the decompressor.
It adds a command line parser for the early_printk
option so we know which output method to use for the decompressor.

This makes debugging the decompressor a little easier, and
keeps us from assuming we always have a vga console on all
hardware.

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/x86_64/boot/compressed/Makefile | 2
arch/x86_64/boot/compressed/misc.c | 262 ++++++++++++++++++++++++++++++++--
2 files changed, 248 insertions(+), 16 deletions(-)

diff --git a/arch/x86_64/boot/compressed/Makefile b/arch/x86_64/boot/compressed/Makefile
index f89d96f..8987c97 100644
--- a/arch/x86_64/boot/compressed/Makefile
+++ b/arch/x86_64/boot/compressed/Makefile
@@ -11,7 +11,7 @@ EXTRA_AFLAGS := -traditional -m32

# cannot use EXTRA_CFLAGS because base CFLAGS contains -mkernel which conflicts with
# -m32
-CFLAGS := -m32 -D__KERNEL__ -Iinclude -O2 -fno-strict-aliasing
+CFLAGS := -m32 -D__KERNEL__ -Iinclude -O2 -fno-strict-aliasing -fno-builtin
LDFLAGS := -m elf_i386

LDFLAGS_vmlinux := -Ttext $(IMAGE_OFFSET) -e startup_32 -m elf_i386
diff --git a/arch/x86_64/boot/compressed/misc.c b/arch/x86_64/boot/compressed/misc.c
index 259bb05..2cbd7cb 100644
--- a/arch/x86_64/boot/compressed/misc.c
+++ b/arch/x86_64/boot/compressed/misc.c
@@ -10,8 +10,10 @@
*/

#include <linux/screen_info.h>
+#include <linux/serial_reg.h>
#include <asm/io.h>
#include <asm/page.h>
+#include <asm/setup.h>

/*
* gzip declarations
@@ -76,12 +78,17 @@ static void gzip_release(void **);
* This is set up by the setup-routine at boot-time
*/
static unsigned char *real_mode; /* Pointer to real-mode data */
+static char saved_command_line[COMMAND_LINE_SIZE];

#define RM_EXT_MEM_K (*(unsigned short *)(real_mode + 0x2))
#ifndef STANDARD_MEMORY_BIOS_CALL
#define RM_ALT_MEM_K (*(unsigned long *)(real_mode + 0x1e0))
#endif
#define RM_SCREEN_INFO (*(struct screen_info *)(real_mode+0))
+#define RM_NEW_CL_POINTER ((char *)(unsigned long)(*(unsigned *)(real_mode+0x228)))
+#define RM_OLD_CL_MAGIC (*(unsigned short *)(real_mode + 0x20))
+#define RM_OLD_CL_OFFSET (*(unsigned short *)(real_mode + 0x22))
+#define OLD_CL_MAGIC 0xA33F

extern unsigned char input_data[];
extern int input_len;
@@ -95,8 +102,12 @@ static void free(void *where);

static void *memset(void *s, int c, unsigned n);
static void *memcpy(void *dest, const void *src, unsigned n);
+static int memcmp(const void *s1, const void *s2, unsigned n);
+static size_t strlen(const char *str);
+static char *strstr(const char *haystack, const char *needle);

static void putstr(const char *);
+static unsigned simple_strtou(const char *cp, char **endp, unsigned base);

extern int end;
static long free_mem_ptr = (long)&end;
@@ -110,10 +121,21 @@ static unsigned int low_buffer_end, low_
static int high_loaded =0;
static uch *high_buffer_start /* = (uch *)(((ulg)&end) + HEAP_SIZE)*/;

-static char *vidmem = (char *)0xb8000;
+static char *vidmem;
static int vidport;
static int lines, cols;

+/* The early serial console */
+
+#define DEFAULT_BAUD 9600
+#define DEFAULT_BASE 0x3f8 /* ttyS0 */
+static unsigned serial_base = DEFAULT_BASE;
+
+#define CONSOLE_NOOP 0
+#define CONSOLE_VID 1
+#define CONSOLE_SERIAL 2
+static int console = CONSOLE_NOOP;
+
#include "../../../../lib/inflate.c"

static void *malloc(int size)
@@ -148,7 +170,8 @@ static void gzip_release(void **ptr)
free_mem_ptr = (long) *ptr;
}

-static void scroll(void)
+/* The early video console */
+static void vid_scroll(void)
{
int i;

@@ -157,7 +180,7 @@ static void scroll(void)
vidmem[i] = ' ';
}

-static void putstr(const char *s)
+static void vid_putstr(const char *s)
{
int x,y,pos;
char c;
@@ -169,7 +192,7 @@ static void putstr(const char *s)
if ( c == '\n' ) {
x = 0;
if ( ++y >= lines ) {
- scroll();
+ vid_scroll();
y--;
}
} else {
@@ -177,7 +200,7 @@ static void putstr(const char *s)
if ( ++x >= cols ) {
x = 0;
if ( ++y >= lines ) {
- scroll();
+ vid_scroll();
y--;
}
}
@@ -194,6 +217,178 @@ static void putstr(const char *s)
outb_p(0xff & (pos >> 1), vidport+1);
}

+static void vid_console_init(void)
+{
+ if (RM_SCREEN_INFO.orig_video_mode == 7) {
+ vidmem = (char *) 0xb0000;
+ vidport = 0x3b4;
+ } else {
+ vidmem = (char *) 0xb8000;
+ vidport = 0x3d4;
+ }
+
+ lines = RM_SCREEN_INFO.orig_video_lines;
+ cols = RM_SCREEN_INFO.orig_video_cols;
+}
+
+/* The early serial console */
+static void serial_putc(int ch)
+{
+ if (ch == '\n') {
+ serial_putc('\r');
+ }
+ /* Wait until I can send a byte */
+ while ((inb(serial_base + UART_LSR) & UART_LSR_THRE) == 0)
+ ;
+
+ /* Send the byte */
+ outb(ch, serial_base + UART_TX);
+
+ /* Wait until the byte is transmitted */
+ while (!(inb(serial_base + UART_LSR) & UART_LSR_TEMT))
+ ;
+}
+
+static void serial_putstr(const char *str)
+{
+ int ch;
+ while((ch = *str++) != '\0') {
+ if (ch == '\n') {
+ serial_putc('\r');
+ }
+ serial_putc(ch);
+ }
+}
+
+static void serial_console_init(char *s)
+{
+ unsigned base = DEFAULT_BASE;
+ unsigned baud = DEFAULT_BAUD;
+ unsigned divisor;
+ char *e;
+
+ if (*s == ',')
+ ++s;
+ if (*s && (*s != ' ')) {
+ if (memcmp(s, "0x", 2) == 0) {
+ base = simple_strtou(s, &e, 16);
+ } else {
+ static const unsigned bases[] = { 0x3f8, 0x2f8 };
+ unsigned port;
+
+ if (memcmp(s, "ttyS", 4) == 0)
+ s += 4;
+ port = simple_strtou(s, &e, 10);
+ if ((port > 1) || (s == e))
+ port = 0;
+ base = bases[port];
+ }
+ s = e;
+ if (*s == ',')
+ ++s;
+ }
+ if (*s && (*s != ' ')) {
+ baud = simple_strtou(s, &e, 0);
+ if ((baud == 0) || (s == e))
+ baud = DEFAULT_BAUD;
+ }
+ divisor = 115200 / baud;
+ serial_base = base;
+
+ outb(0x00, serial_base + UART_IER); /* no interrupt */
+ outb(0x00, serial_base + UART_FCR); /* no fifo */
+ outb(0x03, serial_base + UART_MCR); /* DTR + RTS */
+
+ /* Set Baud Rate divisor */
+ outb(0x83, serial_base + UART_LCR);
+ outb(divisor & 0xff, serial_base + UART_DLL);
+ outb(divisor >> 8, serial_base + UART_DLM);
+ outb(0x03, serial_base + UART_LCR); /* 8n1 */
+
+}
+
+static void putstr(const char *str)
+{
+ if (console == CONSOLE_VID) {
+ vid_putstr(str);
+ } else if (console == CONSOLE_SERIAL) {
+ serial_putstr(str);
+ }
+}
+
+static void console_init(char *cmdline)
+{
+ cmdline = strstr(cmdline, "earlyprintk=");
+ if (!cmdline)
+ return;
+ cmdline += 12;
+ if (memcmp(cmdline, "vga", 3) == 0) {
+ vid_console_init();
+ console = CONSOLE_VID;
+ } else if (memcmp(cmdline, "serial", 6) == 0) {
+ serial_console_init(cmdline + 6);
+ console = CONSOLE_SERIAL;
+ } else if (memcmp(cmdline, "ttyS", 4) == 0) {
+ serial_console_init(cmdline);
+ console = CONSOLE_SERIAL;
+ }
+}
+
+static inline int tolower(int ch)
+{
+ return ch | 0x20;
+}
+
+static inline int isdigit(int ch)
+{
+ return (ch >= '0') && (ch <= '9');
+}
+
+static inline int isxdigit(int ch)
+{
+ ch = tolower(ch);
+ return isdigit(ch) || ((ch >= 'a') && (ch <= 'f'));
+}
+
+
+static inline int digval(int ch)
+{
+ return isdigit(ch)? (ch - '0') : tolower(ch) - 'a' + 10;
+}
+
+/**
+ * simple_strtou - convert a string to an unsigned
+ * @cp: The start of the string
+ * @endp: A pointer to the end of the parsed string will be placed here
+ * @base: The number base to use
+ */
+static unsigned simple_strtou(const char *cp, char **endp, unsigned base)
+{
+ unsigned result = 0,value;
+
+ if (!base) {
+ base = 10;
+ if (*cp == '0') {
+ base = 8;
+ cp++;
+ if ((tolower(*cp) == 'x') && isxdigit(cp[1])) {
+ cp++;
+ base = 16;
+ }
+ }
+ } else if (base == 16) {
+ if (cp[0] == '0' && tolower(cp[1]) == 'x')
+ cp += 2;
+ }
+ while (isxdigit(*cp) && ((value = digval(*cp)) < base)) {
+ result = result*base + value;
+ cp++;
+ }
+ if (endp)
+ *endp = (char *)cp;
+ return result;
+}
+
static void* memset(void* s, int c, unsigned n)
{
int i;
@@ -212,6 +407,37 @@ static void* memcpy(void* dest, const vo
return dest;
}

+static int memcmp(const void *s1, const void *s2, unsigned n)
+{
+ const unsigned char *str1 = s1, *str2 = s2;
+ size_t i;
+ int result = 0;
+ for(i = 0; (result == 0) && (i < n); i++) {
+ result = *str1++ - *str2++;
+ }
+ return result;
+}
+
+static size_t strlen(const char *str)
+{
+ size_t len = 0;
+ while (*str++)
+ len++;
+ return len;
+}
+
+static char *strstr(const char *haystack, const char *needle)
+{
+ size_t len;
+ len = strlen(needle);
+ while(*haystack) {
+ if (memcmp(haystack, needle, len) == 0)
+ return (char *)haystack;
+ haystack++;
+ }
+ return NULL;
+}
+
/* ===========================================================================
* Fill the input buffer. This is called only when the buffer is empty
* and at least one byte is really needed.
@@ -331,20 +557,26 @@ static void close_output_buffer_if_we_ru
}
}

+static void save_command_line(void)
+{
+ /* Find the command line */
+ char *cmdline;
+ cmdline = saved_command_line;
+ if (RM_NEW_CL_POINTER) {
+ cmdline = RM_NEW_CL_POINTER;
+ } else if (OLD_CL_MAGIC == RM_OLD_CL_MAGIC) {
+ cmdline = real_mode + RM_OLD_CL_OFFSET;
+ }
+ memcpy(saved_command_line, cmdline, COMMAND_LINE_SIZE);
+ saved_command_line[COMMAND_LINE_SIZE - 1] = '\0';
+}
+
int decompress_kernel(struct moveparams *mv, void *rmode)
{
real_mode = rmode;

- if (RM_SCREEN_INFO.orig_video_mode == 7) {
- vidmem = (char *) 0xb0000;
- vidport = 0x3b4;
- } else {
- vidmem = (char *) 0xb8000;
- vidport = 0x3d4;
- }
-
- lines = RM_SCREEN_INFO.orig_video_lines;
- cols = RM_SCREEN_INFO.orig_video_cols;
+ save_command_line();
+ console_init(saved_command_line);

if (free_mem_ptr < 0x100000) setup_normal_output_buffer();
else setup_output_buffer_if_we_run_high(mv);
--
1.4.2.rc2.g5209e

2006-08-01 11:15:36

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 15/33] x86_64: Fix kernel direct mapping size check

Instead of using physical addresse of _end which has nothing to
do with knowing if the kernel fits within it's reserved
virtual addresses, verify the virtual address of _end fits
within in the kernel virtual address mapping.

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/x86_64/kernel/head64.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86_64/kernel/head64.c b/arch/x86_64/kernel/head64.c
index 36647ce..454498c 100644
--- a/arch/x86_64/kernel/head64.c
+++ b/arch/x86_64/kernel/head64.c
@@ -116,7 +116,7 @@ #ifdef CONFIG_X86_IO_APIC
disable_apic = 1;
#endif
/* You need early console to see that */
- if (__pa_symbol(&_end) >= KERNEL_TEXT_SIZE)
+ if (((unsigned long)&_end) >= (__START_KERNEL_map + KERNEL_TEXT_SIZE))
panic("Kernel too big for kernel mapping\n");

setup_boot_cpu_data();
--
1.4.2.rc2.g5209e

2006-08-01 11:15:37

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 13/33] x86_64: Remove assumptions about the kernel start address from e820/bad_addr()

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/x86_64/kernel/e820.c | 9 +++++++--
1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86_64/kernel/e820.c b/arch/x86_64/kernel/e820.c
index 61f029f..56dd525 100644
--- a/arch/x86_64/kernel/e820.c
+++ b/arch/x86_64/kernel/e820.c
@@ -69,9 +69,14 @@ #ifdef CONFIG_BLK_DEV_INITRD
return 1;
}
#endif
- /* kernel code + 640k memory hole (later should not be needed, but
+ /* 640k memory hole (later should not be needed, but
be paranoid for now) */
- if (last >= 640*1024 && addr < __pa_symbol(&_end)) {
+ if (last >= 640*1024 && addr < HIGH_MEMORY) {
+ *addrp = HIGH_MEMORY;
+ }
+
+ /* kernel code */
+ if (last >= __pa_symbol(&_text) && addr < __pa_symbol(&_end)) {
*addrp = __pa_symbol(&_end);
return 1;
}
--
1.4.2.rc2.g5209e

2006-08-01 11:16:20

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 24/33] x86_64: Add EFER to the set registers saved by save_processor_state

EFER varies like %cr4 depending on the cpu capabilities, and which cpu
capabilities we want to make use of. So save/restore it make certain
we have the same EFER value when we are done.

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/x86_64/kernel/suspend.c | 3 ++-
include/asm-x86_64/suspend.h | 1 +
2 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/arch/x86_64/kernel/suspend.c b/arch/x86_64/kernel/suspend.c
index 91f7e67..fe865ea 100644
--- a/arch/x86_64/kernel/suspend.c
+++ b/arch/x86_64/kernel/suspend.c
@@ -33,7 +33,6 @@ void __save_processor_state(struct saved
asm volatile ("str %0" : "=m" (ctxt->tr));

/* XMM0..XMM15 should be handled by kernel_fpu_begin(). */
- /* EFER should be constant for kernel version, no need to handle it. */
/*
* segment registers
*/
@@ -50,6 +49,7 @@ void __save_processor_state(struct saved
/*
* control registers
*/
+ rdmsrl(MSR_EFER, ctxt->efer);
asm volatile ("movq %%cr0, %0" : "=r" (ctxt->cr0));
asm volatile ("movq %%cr2, %0" : "=r" (ctxt->cr2));
asm volatile ("movq %%cr3, %0" : "=r" (ctxt->cr3));
@@ -75,6 +75,7 @@ void __restore_processor_state(struct sa
/*
* control registers
*/
+ wrmsrl(MSR_EFER, ctxt->efer);
asm volatile ("movq %0, %%cr8" :: "r" (ctxt->cr8));
asm volatile ("movq %0, %%cr4" :: "r" (ctxt->cr4));
asm volatile ("movq %0, %%cr3" :: "r" (ctxt->cr3));
diff --git a/include/asm-x86_64/suspend.h b/include/asm-x86_64/suspend.h
index bc7f817..a42306c 100644
--- a/include/asm-x86_64/suspend.h
+++ b/include/asm-x86_64/suspend.h
@@ -17,6 +17,7 @@ struct saved_context {
u16 ds, es, fs, gs, ss;
unsigned long gs_base, gs_kernel_base, fs_base;
unsigned long cr0, cr2, cr3, cr4, cr8;
+ unsigned long efer;
u16 gdt_pad;
u16 gdt_limit;
unsigned long gdt_base;
--
1.4.2.rc2.g5209e

2006-08-01 11:17:05

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 1/33] i386: vmlinux.lds.S Distinguish absolute symbols

Ld knows about 2 kinds of symbols, absolute and section
relative. Section relative symbols symbols change value
when a section is moved and absolute symbols do not.

Currently in the linker script we have several labels
marking the beginning and ending of sections that
are outside of sections, making them absolute symbols.
Having a mixture of absolute and section relative
symbols refereing to the same data is currently harmless
but it is confusing.

This must be done carefully as newer revs of ld do not place
symbols that appear in sections without data and instead
ld makes those symbols global :(

My ultimate goal is to build a relocatable kernel. The
safest and least intrusive technique is to generate
relocation entries so the kernel can be relocated at load
time. The only penalty would be an increase in the size
of the kernel binary. The problem is that if absolute and
relocatable symbols are not properly specified absolute symbols
will be relocated or section relative symbols won't be, which
is fatal.

The practical motivation is that when generating kernels that
will run from a reserved area for analyzing what caused
a kernel panic, it is simpler if you don't need to hard code
the physical memory location they will run at, especially
for the distributions.

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/i386/kernel/vmlinux.lds.S | 111 ++++++++++++++++++++-----------------
include/asm-generic/vmlinux.lds.h | 2 -
2 files changed, 60 insertions(+), 53 deletions(-)

diff --git a/arch/i386/kernel/vmlinux.lds.S b/arch/i386/kernel/vmlinux.lds.S
index 2d4f138..db0833b 100644
--- a/arch/i386/kernel/vmlinux.lds.S
+++ b/arch/i386/kernel/vmlinux.lds.S
@@ -18,43 +18,46 @@ SECTIONS
. = __KERNEL_START;
phys_startup_32 = startup_32 - LOAD_OFFSET;
/* read-only */
- _text = .; /* Text and read-only data */
.text : AT(ADDR(.text) - LOAD_OFFSET) {
+ _text = .; /* Text and read-only data */
*(.text)
SCHED_TEXT
LOCK_TEXT
KPROBES_TEXT
*(.fixup)
*(.gnu.warning)
- } = 0x9090
-
- _etext = .; /* End of text section */
+ _etext = .; /* End of text section */
+ } = 0x9090

. = ALIGN(16); /* Exception table */
- __start___ex_table = .;
- __ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) { *(__ex_table) }
- __stop___ex_table = .;
+ __ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) {
+ __start___ex_table = .;
+ *(__ex_table)
+ __stop___ex_table = .;
+ }

RODATA

. = ALIGN(4);
- __tracedata_start = .;
.tracedata : AT(ADDR(.tracedata) - LOAD_OFFSET) {
+ __tracedata_start = .;
*(.tracedata)
+ __tracedata_end = .;
}
- __tracedata_end = .;

/* writeable */
.data : AT(ADDR(.data) - LOAD_OFFSET) { /* Data */
*(.data)
CONSTRUCTORS
- }
+ }

. = ALIGN(4096);
- __nosave_begin = .;
- .data_nosave : AT(ADDR(.data_nosave) - LOAD_OFFSET) { *(.data.nosave) }
- . = ALIGN(4096);
- __nosave_end = .;
+ .data_nosave : AT(ADDR(.data_nosave) - LOAD_OFFSET) {
+ __nosave_begin = .;
+ *(.data.nosave)
+ . = ALIGN(4096);
+ __nosave_end = .;
+ }

. = ALIGN(4096);
.data.page_aligned : AT(ADDR(.data.page_aligned) - LOAD_OFFSET) {
@@ -68,8 +71,10 @@ SECTIONS

/* rarely changed data like cpu maps */
. = ALIGN(32);
- .data.read_mostly : AT(ADDR(.data.read_mostly) - LOAD_OFFSET) { *(.data.read_mostly) }
- _edata = .; /* End of data section */
+ .data.read_mostly : AT(ADDR(.data.read_mostly) - LOAD_OFFSET) {
+ *(.data.read_mostly)
+ _edata = .; /* End of data section */
+ }

#ifdef CONFIG_STACK_UNWIND
. = ALIGN(4);
@@ -87,39 +92,41 @@ #endif

/* might get freed after init */
. = ALIGN(4096);
- __smp_alt_begin = .;
- __smp_alt_instructions = .;
.smp_altinstructions : AT(ADDR(.smp_altinstructions) - LOAD_OFFSET) {
+ __smp_alt_begin = .;
+ __smp_alt_instructions = .;
*(.smp_altinstructions)
+ __smp_alt_instructions_end = .;
}
- __smp_alt_instructions_end = .;
. = ALIGN(4);
- __smp_locks = .;
.smp_locks : AT(ADDR(.smp_locks) - LOAD_OFFSET) {
+ __smp_locks = .;
*(.smp_locks)
+ __smp_locks_end = .;
}
- __smp_locks_end = .;
.smp_altinstr_replacement : AT(ADDR(.smp_altinstr_replacement) - LOAD_OFFSET) {
*(.smp_altinstr_replacement)
+ . = ALIGN(4096);
+ __smp_alt_end = .;
}
- . = ALIGN(4096);
- __smp_alt_end = .;

/* will be freed after init */
. = ALIGN(4096); /* Init code and data */
- __init_begin = .;
.init.text : AT(ADDR(.init.text) - LOAD_OFFSET) {
+ __init_begin = .;
_sinittext = .;
*(.init.text)
_einittext = .;
}
.init.data : AT(ADDR(.init.data) - LOAD_OFFSET) { *(.init.data) }
. = ALIGN(16);
- __setup_start = .;
- .init.setup : AT(ADDR(.init.setup) - LOAD_OFFSET) { *(.init.setup) }
- __setup_end = .;
- __initcall_start = .;
+ .init.setup : AT(ADDR(.init.setup) - LOAD_OFFSET) {
+ __setup_start = .;
+ *(.init.setup)
+ __setup_end = .;
+ }
.initcall.init : AT(ADDR(.initcall.init) - LOAD_OFFSET) {
+ __initcall_start = .;
*(.initcall1.init)
*(.initcall2.init)
*(.initcall3.init)
@@ -127,20 +134,20 @@ #endif
*(.initcall5.init)
*(.initcall6.init)
*(.initcall7.init)
+ __initcall_end = .;
}
- __initcall_end = .;
- __con_initcall_start = .;
.con_initcall.init : AT(ADDR(.con_initcall.init) - LOAD_OFFSET) {
+ __con_initcall_start = .;
*(.con_initcall.init)
+ __con_initcall_end = .;
}
- __con_initcall_end = .;
SECURITY_INIT
. = ALIGN(4);
- __alt_instructions = .;
.altinstructions : AT(ADDR(.altinstructions) - LOAD_OFFSET) {
+ __alt_instructions = .;
*(.altinstructions)
+ __alt_instructions_end = .;
}
- __alt_instructions_end = .;
.altinstr_replacement : AT(ADDR(.altinstr_replacement) - LOAD_OFFSET) {
*(.altinstr_replacement)
}
@@ -149,32 +156,32 @@ #endif
.exit.text : AT(ADDR(.exit.text) - LOAD_OFFSET) { *(.exit.text) }
.exit.data : AT(ADDR(.exit.data) - LOAD_OFFSET) { *(.exit.data) }
. = ALIGN(4096);
- __initramfs_start = .;
- .init.ramfs : AT(ADDR(.init.ramfs) - LOAD_OFFSET) { *(.init.ramfs) }
- __initramfs_end = .;
+ .init.ramfs : AT(ADDR(.init.ramfs) - LOAD_OFFSET) {
+ __initramfs_start = .;
+ *(.init.ramfs)
+ __initramfs_end = .;
+ }
. = ALIGN(L1_CACHE_BYTES);
- __per_cpu_start = .;
- .data.percpu : AT(ADDR(.data.percpu) - LOAD_OFFSET) { *(.data.percpu) }
- __per_cpu_end = .;
+ .data.percpu : AT(ADDR(.data.percpu) - LOAD_OFFSET) {
+ __per_cpu_start = .;
+ *(.data.percpu)
+ __per_cpu_end = .;
+ }
. = ALIGN(4096);
- __init_end = .;
/* freed after init ends here */

- __bss_start = .; /* BSS */
- .bss.page_aligned : AT(ADDR(.bss.page_aligned) - LOAD_OFFSET) {
- *(.bss.page_aligned)
- }
.bss : AT(ADDR(.bss) - LOAD_OFFSET) {
+ __init_end = .;
+ __bss_start = .; /* BSS */
+ *(.bss.page_aligned)
*(.bss)
+ . = ALIGN(4);
+ __bss_stop = .;
+ _end = . ;
+ /* This is where the kernel creates the early boot page tables */
+ . = ALIGN(4096);
+ pg0 = . ;
}
- . = ALIGN(4);
- __bss_stop = .;
-
- _end = . ;
-
- /* This is where the kernel creates the early boot page tables */
- . = ALIGN(4096);
- pg0 = .;

/* Sections to be discarded */
/DISCARD/ : {
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index db5a373..7cd3c22 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -11,8 +11,8 @@ #define ALIGN_FUNCTION() . = ALIGN(8)

#define RODATA \
. = ALIGN(4096); \
- __start_rodata = .; \
.rodata : AT(ADDR(.rodata) - LOAD_OFFSET) { \
+ VMLINUX_SYMBOL(__start_rodata) = .; \
*(.rodata) *(.rodata.*) \
*(__vermagic) /* Kernel version magic */ \
} \
--
1.4.2.rc2.g5209e

2006-08-01 11:16:51

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 23/33] x86_64: cleanup segments

Move __KERNEL32_CS up into the unused gdt entry. __KERNEL32_CS is
used when entering the kernel so putting it first is useful when
trying to keep boot gdt sizes to a minimum.

Set the accessed bit on all gdt entries. We don't care
so there is no need for the cpu to burn the extra cycles,
and it potentially allows the pages to be immutable. Plus
it is confusing when debugging and your gdt entries mysteriously
change.

Signed-off-by: Eric W. Biederman <[email protected]>
---
arch/x86_64/kernel/head.S | 14 +++++++-------
include/asm-x86_64/segment.h | 4 ++--
2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/x86_64/kernel/head.S b/arch/x86_64/kernel/head.S
index a9e34d9..d0e626e 100644
--- a/arch/x86_64/kernel/head.S
+++ b/arch/x86_64/kernel/head.S
@@ -352,16 +352,16 @@ #endif

ENTRY(cpu_gdt_table)
.quad 0x0000000000000000 /* NULL descriptor */
+ .quad 0x00cf9b000000ffff /* __KERNEL32_CS */
+ .quad 0x00af9b000000ffff /* __KERNEL_CS */
+ .quad 0x00cf93000000ffff /* __KERNEL_DS */
+ .quad 0x00cffb000000ffff /* __USER32_CS */
+ .quad 0x00cff3000000ffff /* __USER_DS, __USER32_DS */
+ .quad 0x00affb000000ffff /* __USER_CS */
.quad 0x0 /* unused */
- .quad 0x00af9a000000ffff /* __KERNEL_CS */
- .quad 0x00cf92000000ffff /* __KERNEL_DS */
- .quad 0x00cffa000000ffff /* __USER32_CS */
- .quad 0x00cff2000000ffff /* __USER_DS, __USER32_DS */
- .quad 0x00affa000000ffff /* __USER_CS */
- .quad 0x00cf9a000000ffff /* __KERNEL32_CS */
.quad 0,0 /* TSS */
.quad 0,0 /* LDT */
- .quad 0,0,0 /* three TLS descriptors */
+ .quad 0,0,0 /* three TLS descriptors */
.quad 0 /* unused */
gdt_end:
/* asm/segment.h:GDT_ENTRIES must match this */
diff --git a/include/asm-x86_64/segment.h b/include/asm-x86_64/segment.h
index d4bed33..58d6715 100644
--- a/include/asm-x86_64/segment.h
+++ b/include/asm-x86_64/segment.h
@@ -6,7 +6,7 @@ #include <asm/cache.h>
#define __KERNEL_CS 0x10
#define __KERNEL_DS 0x18

-#define __KERNEL32_CS 0x38
+#define __KERNEL32_CS 0x08

/*
* we cannot use the same code segment descriptor for user and kernel
@@ -20,7 +20,7 @@ #define __USER_DS 0x2b /* 5*8+3 */
#define __USER_CS 0x33 /* 6*8+3 */
#define __USER32_DS __USER_DS

-#define GDT_ENTRY_TLS 1
+#define GDT_ENTRY_TLS 7
#define GDT_ENTRY_TSS 8 /* needs two entries */
#define GDT_ENTRY_LDT 10 /* needs two entries */
#define GDT_ENTRY_TLS_MIN 12
--
1.4.2.rc2.g5209e

2006-08-01 11:36:50

by Paulo Marques

[permalink] [raw]
Subject: Re: [PATCH 8/33] kallsyms.c: Generate relocatable symbols.

Eric W. Biederman wrote:
> Print the addresses of non-absolute symbols relative to _text
> so that ld will generate relocations. Allowing a relocatable
> kernel to relocate them. We can't actually use the symbol names
> because kallsyms includes static symbols that are not exported
> from their object files.
>
> [...]
> output_label("kallsyms_addresses");
> for (i = 0; i < table_cnt; i++) {
> - printf("\tPTR\t%#llx\n", table[i].addr);
> + if (toupper(table[i].sym[0]) != 'A') {
> + printf("\tPTR\t_text + %#llx\n",
> + table[i].addr - _text);
> + } else {
> + printf("\tPTR\t%#llx\n", table[i].addr);
> + }

Doesn't this break kallsyms for almost everyone?

kallsyms addresses aren't used just for displaying, but also to find
symbols from their addresses (from the stack trace, etc.).

Am I missing something?

--
Paulo Marques - http://www.grupopie.com

"The face of a child can say it all, especially the
mouth part of the face."

2006-08-01 11:54:05

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 8/33] kallsyms.c: Generate relocatable symbols.

Paulo Marques <[email protected]> writes:

> Eric W. Biederman wrote:
>> Print the addresses of non-absolute symbols relative to _text
>> so that ld will generate relocations. Allowing a relocatable
>> kernel to relocate them. We can't actually use the symbol names
>> because kallsyms includes static symbols that are not exported
>> from their object files.
>> [...]
>> output_label("kallsyms_addresses");
>> for (i = 0; i < table_cnt; i++) {
>> - printf("\tPTR\t%#llx\n", table[i].addr);
>> + if (toupper(table[i].sym[0]) != 'A') {
>> + printf("\tPTR\t_text + %#llx\n",
>> + table[i].addr - _text);
>> + } else {
>> + printf("\tPTR\t%#llx\n", table[i].addr);
>> + }
>
> Doesn't this break kallsyms for almost everyone?
>
> kallsyms addresses aren't used just for displaying, but also to find symbols
> from their addresses (from the stack trace, etc.).
>
> Am I missing something?

Yes, you are missing something. This fixes the addresses in the table.

All this does is to put the same values in kallsyms that we have now
but it creates relocations for them. So on a kernel where we process
relocations before loading (because we are running the kernel at a
different virtual address). The processing of the relocations will
fix kallsyms to match the running kernel.

If we don't do this we will have the problems you are worried about.

Of course I would be overjoyed if you could point out a bug like
you are worried about so I could fix it :)

Eric

2006-08-01 13:32:58

by Mika Penttilä

[permalink] [raw]
Subject: Re: [PATCH 10/33] i386: Relocatable kernel support.


> @@ -1,9 +1,10 @@
> SECTIONS
> {
> - .data : {
> + .data.compressed : {
> input_len = .;
> LONG(input_data_end - input_data) input_data = .;
> *(.data)
> + output_len = . - 4;
> input_data_end = .;
> }
> }
>
I don't see how you are getting the uncompressed length from output_len...

--Mika

2006-08-01 18:08:33

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 10/33] i386: Relocatable kernel support.

Mika Penttil? <[email protected]> writes:

>> @@ -1,9 +1,10 @@
>> SECTIONS
>> {
>> - .data : { + .data.compressed : {
>> input_len = .;
>> LONG(input_data_end - input_data) input_data = .; *(.data)
>> + output_len = . - 4;
>> input_data_end = .; }
>> }
>>
> I don't see how you are getting the uncompressed length from output_len...

It's part of the gzip format. It places the length at the end of
the compressed data. I am just computing the address of where gzip
put the length and putting a variable there called output_len.

Isn't linker script magic wonderful :)

Eric

2006-08-01 18:11:31

by Sam Ravnborg

[permalink] [raw]
Subject: Re: [PATCH 10/33] i386: Relocatable kernel support.

On Tue, Aug 01, 2006 at 12:07:02PM -0600, Eric W. Biederman wrote:
> Mika Penttil? <[email protected]> writes:
>
> >> @@ -1,9 +1,10 @@
> >> SECTIONS
> >> {
> >> - .data : { + .data.compressed : {
> >> input_len = .;
> >> LONG(input_data_end - input_data) input_data = .; *(.data)
> >> + output_len = . - 4;
> >> input_data_end = .; }
> >> }
> >>
> > I don't see how you are getting the uncompressed length from output_len...
>
> It's part of the gzip format. It places the length at the end of
> the compressed data. I am just computing the address of where gzip
> put the length and putting a variable there called output_len.
>
> Isn't linker script magic wonderful :)
A comment would be appreciated.

Sam

2006-08-01 18:14:46

by Mika Penttilä

[permalink] [raw]
Subject: Re: [PATCH 10/33] i386: Relocatable kernel support.

Eric W. Biederman wrote:
> Mika Penttil? <[email protected]> writes:
>
>
>>> @@ -1,9 +1,10 @@
>>> SECTIONS
>>> {
>>> - .data : { + .data.compressed : {
>>> input_len = .;
>>> LONG(input_data_end - input_data) input_data = .; *(.data)
>>> + output_len = . - 4;
>>> input_data_end = .; }
>>> }
>>>
>>>
>> I don't see how you are getting the uncompressed length from output_len...
>>
>
> It's part of the gzip format. It places the length at the end of
> the compressed data. I am just computing the address of where gzip
> put the length and putting a variable there called output_len.
>
> Isn't linker script magic wonderful :)
>
> Eric
> -
>
Huh, quite a nice trick indeed!

Thanks,
Mika

2006-08-01 18:59:29

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 22/33] x86_64: Fix gdt table size in trampoline.S


Makes sense, thanks. I queued the patch.

-Andi

2006-08-01 19:02:23

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 18/33] x86_64: Kill temp_boot_pmds

"Eric W. Biederman" <[email protected]> writes:
>
> I also modify the early page table initialization code
> to use early_ioreamp and early_iounmap, instead of the
> special case version of those functions that they are
> now calling.

Ok valuable cleanup. I queued that one too.

> The only really silly part left with init_memory_mapping
> is that find_early_table_space always finds pages below 1M.

I fixed this some time ago - obsolete comment?

-Andi

2006-08-01 19:04:05

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 18/33] x86_64: Kill temp_boot_pmds II

"Eric W. Biederman" <[email protected]> writes:
>
> I also modify the early page table initialization code
> to use early_ioreamp and early_iounmap, instead of the
> special case version of those functions that they are
> now calling.

Or rather I tried to apply it - it doesn't apply at all
on its own:

patching file arch/x86_64/mm/init.c
Hunk #1 FAILED at 167.
Hunk #2 succeeded at 274 with fuzz 1 (offset 28 lines).
Hunk #3 FAILED at 286.
Hunk #4 FAILED at 341.
3 out of 4 hunks FAILED -- rejects in file arch/x86_64/mm/init.c

-Andi

2006-08-01 19:06:30

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 2/33] i386: define __pa_symbol

"Eric W. Biederman" <[email protected]> writes:

> On x86_64 we have to be careful with calculating the physical
> address of kernel symbols. Both because of compiler odditities
> and because the symbols live in a different range of the virtual
> address space.
>
> Having a defintition of __pa_symbol that works on both x86_64 and
> i386 simplifies writing code that works for both x86_64 and
> i386 that has these kinds of dependencies.
>
> So this patch adds the trivial i386 __pa_symbol definition.
>
> Signed-off-by: Eric W. Biederman <[email protected]>
> ---
> include/asm-i386/page.h | 1 +
> 1 files changed, 1 insertions(+), 0 deletions(-)
>
> diff --git a/include/asm-i386/page.h b/include/asm-i386/page.h
> index f5bf544..eceb7f5 100644
> --- a/include/asm-i386/page.h
> +++ b/include/asm-i386/page.h
> @@ -124,6 +124,7 @@ #define PAGE_OFFSET ((unsigned long)__P
> #define VMALLOC_RESERVE ((unsigned long)__VMALLOC_RESERVE)
> #define MAXMEM (-__PAGE_OFFSET-__VMALLOC_RESERVE)
> #define __pa(x) ((unsigned long)(x)-PAGE_OFFSET)
> +#define __pa_symbol(x) __pa(x)

Actually PAGE_OFFSET arithmetic on symbols is outside ISO C and gcc
misoptimizes it occassionally. You would need to use HIDE_RELOC
or similar. That is why x86-64 has the magic.

-Andi

2006-08-01 19:07:30

by Sam Ravnborg

[permalink] [raw]
Subject: Re: [PATCH 1/33] i386: vmlinux.lds.S Distinguish absolute symbols

On Tue, Aug 01, 2006 at 05:03:16AM -0600, Eric W. Biederman wrote:
> Ld knows about 2 kinds of symbols, absolute and section
> relative. Section relative symbols symbols change value
> when a section is moved and absolute symbols do not.
>
> Currently in the linker script we have several labels
> marking the beginning and ending of sections that
> are outside of sections, making them absolute symbols.
> Having a mixture of absolute and section relative
> symbols refereing to the same data is currently harmless
> but it is confusing.
In the past we have seen problems when there was some padding between
the global symbol and the actual section start. The reason for the
padding was the alignment of the section which is aligned accordign to
the longest of the contained symbols. So no matter the
relocatable kernel this is an improvement.

Sam

2006-08-01 19:09:15

by Sam Ravnborg

[permalink] [raw]
Subject: Re: [PATCH 4/33] i386: CONFIG_PHYSICAL_START cleanup

On Tue, Aug 01, 2006 at 05:03:19AM -0600, Eric W. Biederman wrote:
> Defining __PHYSICAL_START and __KERNEL_START in asm-i386/page.h works but
> it triggers a full kernel rebuild for the silliest of reasons. This
> modifies the users to directly use CONFIG_PHYSICAL_START and linux/config.h
> which prevents the full rebuild problem, which makes the code much
> more maintainer and hopefully user friendly.
>
> Signed-off-by: Eric W. Biederman <[email protected]>
> ---
> arch/i386/boot/compressed/head.S | 8 ++++----
> arch/i386/boot/compressed/misc.c | 8 ++++----
> arch/i386/kernel/vmlinux.lds.S | 3 ++-
> include/asm-i386/page.h | 3 ---
> 4 files changed, 10 insertions(+), 12 deletions(-)
>
> diff --git a/arch/i386/boot/compressed/head.S b/arch/i386/boot/compressed/head.S
> index b5893e4..8f28ecd 100644
> --- a/arch/i386/boot/compressed/head.S
> +++ b/arch/i386/boot/compressed/head.S
> @@ -23,9 +23,9 @@
> */
> .text
>
> +#include <linux/config.h>

You already have full access to all CONFIG_* symbols - kbuild includes
it on the commandline. So please kill this include.

Sam

2006-08-01 19:11:42

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 32/33] x86_64: Relocatable kernel support

"Eric W. Biederman" <[email protected]> writes:
>
> When loaded with a normal bootloader the decompressor will decompress
> the kernel to 2M and it will run there. This both ensures the
> relocation code is always working, and makes it easier to use 2M
> pages for the kernel and the cpu.

It would have been nicer if you had moved the uncompressor to be 64bit
first like it was planned for a long time.

-Andi

2006-08-01 19:10:55

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 26/33] x86_64: 64bit PIC ACPI wakeup

"Eric W. Biederman" <[email protected]> writes:
>
> I don't have a configuration I can test this but it compiles cleanly
> and it should work, the code is very similar to the SMP trampoline,
> which I have tested. At least now the comments about still running in
> low memory are actually correct.

We would need someone to actually test this before it could
be merged. I didn't see anything that tripped me up though.

-Andi

2006-08-01 19:13:36

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 25/33] x86_64: 64bit PIC SMP trampoline

"Eric W. Biederman" <[email protected]> writes:
> - ljmpl $__KERNEL32_CS, $(startup_32-__START_KERNEL_map)
> +
> + # flush prefetch and jump to startup_32
> + ljmpl *(startup_32_vector - r_base)
> +
> + .code32
> + .balign 4
> +startup_32:

It would be nicer if you could factor out that code into
a common file between head.S and trampoline.S

-Andi

2006-08-01 19:15:05

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 28/33] x86_64: Remove the identity mapping as early as possible.

"Eric W. Biederman" <[email protected]> writes:

> With the rewrite of the SMP trampoline and the early page
> allocator there is nothing that needs identity mapped pages,
> once we start executing C code.

Cool.

Hopefully that can be done for i386 too. People on other architectures
have been complaining that i386 doesn't catch early NULL pointers.

-Andi

2006-08-01 19:19:14

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 9/33] i386 boot: Add serial output support to the decompressor

"Eric W. Biederman" <[email protected]> writes:
> }
> @@ -200,6 +224,178 @@ static void putstr(const char *s)
> outb_p(0xff & (pos >> 1), vidport+1);
> }
>
> +static void vid_console_init(void)

Please just use early_printk instead of reimplementing this.
I think it should work in this context too.

> +static inline int tolower(int ch)
> +{
> + return ch | 0x20;
> +}
> +
> +static inline int isdigit(int ch)
> +{
> + return (ch >= '0') && (ch <= '9');
> +}
> +
> +static inline int isxdigit(int ch)
> +{
> + ch = tolower(ch);
> + return isdigit(ch) || ((ch >= 'a') && (ch <= 'f'));
> +}

And please reuse the Linux code here.


Actually the best way to reuse would be to first do 64bit uncompressor
and linker directly, but short of that #includes would be fine too.


> +
> +
> +static inline int digval(int ch)
> +{
> + return isdigit(ch)? (ch - '0') : tolower(ch) - 'a' + 10;
> +}
> +
> +/**
> + * simple_strtou - convert a string to an unsigned
> + * @cp: The start of the string
> + * @endp: A pointer to the end of the parsed string will be placed here
> + * @base: The number base to use
> + */
> +static unsigned simple_strtou(const char *cp, char **endp, unsigned base)
> +{
> + unsigned result = 0,value;
> +
> + if (!base) {
> + base = 10;
> + if (*cp == '0') {
> + base = 8;
> + cp++;
> + if ((tolower(*cp) == 'x') && isxdigit(cp[1])) {
> + cp++;
> + base = 16;
> + }
> + }
> + } else if (base == 16) {
> + if (cp[0] == '0' && tolower(cp[1]) == 'x')
> + cp += 2;
> + }
> + while (isxdigit(*cp) && ((value = digval(*cp)) < base)) {
> + result = result*base + value;
> + cp++;
> + }
> + if (endp)
> + *endp = (char *)cp;
> + return result;
> +}

Can you please somehow reuse the Linux one?

> +
> static void* memset(void* s, int c, unsigned n)
> {
> int i;
> @@ -218,6 +414,29 @@ static void* memcpy(void* dest, const vo
> return dest;
> }
>
> +static int memcmp(const void *s1, const void *s2, unsigned n)
> +{
> + const unsigned char *str1 = s1, *str2 = s2;
> + size_t i;
> + int result = 0;
> + for(i = 0; (result == 0) && (i < n); i++) {
> + result = *str1++ - *str2++;
> + }
> + return result;
> +}
> +
> +char *strstr(const char *haystack, const char *needle)
> +{
> + size_t len;
> + len = strlen(needle);
> + while(*haystack) {
> + if (memcmp(haystack, needle, len) == 0)
> + return (char *)haystack;
> + haystack++;
> + }
> + return NULL;


Would be better to just pull in lib/string.c

-Andi

2006-08-01 19:27:11

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] ELF Relocatable x86 and x86_64 bzImages

On Tue, Aug 01, 2006 at 04:58:49AM -0600, Eric W. Biederman wrote:
>
> The problem:
>
> We can't always run the kernel at 1MB or 2MB, and so people who need
> different addresses must build multiple kernels. The bzImage format
> can't even represent loading a kernel at other than it's default address.
> With kexec on panic now starting to be used by distros having a kernel
> not running at the default load address is starting to become common.
>
> The goal of this patch series is to build kernels that are relocatable
> at run time, and to extend the bzImage format to make it capable of
> expressing a relocatable kernel.
>
> In extending the bzImage format I am replacing the existing unused bootsector
> with an ELF header. To express what is going on the ELF header will
> have type ET_DYN. Just like the kernel loading an ET_DYN executable
> bootloaders are not expected to process relocations. But the executable
> may be shifted in the address space so long as it's alignment requirements
> are met.
>
> The x86_64 kernel is simply built to live at a fixed virtual address
> and the boot page tables are relocated. The i386 kernel is built
> to process relocations generated with --embedded-relocs (after vmlinux.lds.S)
> has been fixed up to sort out static and dynamic relocations.

Hi Eric,

Can't we use the x86_64 relocation approach for i386 as well? I mean keep
the virtual address space fixed and updating the page tables. This would
help in the sense that you don't have to change gdb if somebody decides to
debug the relocated kernel.

Any such tool that retrieves the symbol virtual address from vmlinux will
be confused.

Thanks
Vivek

2006-08-01 20:14:29

by Jan Kratochvil

[permalink] [raw]
Subject: Re: [RFC] ELF Relocatable x86 and x86_64 bzImages

On Tue, 01 Aug 2006 21:26:28 +0200, Vivek Goyal wrote:
...
> Can't we use the x86_64 relocation approach for i386 as well? I mean keep
> the virtual address space fixed and updating the page tables. This would
> help in the sense that you don't have to change gdb if somebody decides to
> debug the relocated kernel.

This is exactly the approach of mkdump version <=1.0
http://mkdump.sourceforge.net/
As documented by Itsuro Oda:
http://mkdump.cvs.sourceforge.net/mkdump/doc/mini_kernel_tech_note.txt?revision=1.1

There is a problem all the drivers expect that allocated buffer address can be
passed directly as physical address to the hardware. mkdump-1.0 has a lot of
backward-mappings for these drivers but you can never catch all of them.


Regards,
Lace

2006-08-01 20:26:16

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC] ELF Relocatable x86 and x86_64 bzImages

Vivek Goyal wrote:
>
> Hi Eric,
>
> Can't we use the x86_64 relocation approach for i386 as well? I mean keep
> the virtual address space fixed and updating the page tables. This would
> help in the sense that you don't have to change gdb if somebody decides to
> debug the relocated kernel.
>
> Any such tool that retrieves the symbol virtual address from vmlinux will
> be confused.
>

I don't think this is practical given the virtual space constraints on
i386 systems.

-hpa

2006-08-01 20:40:59

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] ELF Relocatable x86 and x86_64 bzImages

On Tue, Aug 01, 2006 at 04:58:49AM -0600, Eric W. Biederman wrote:
>
> Currently there are 33 patches in my tree to do this.
>
> The weirdest symptom I have had so far is that page faults did not
> trigger the early exception handler on x86_64 (instead I got a reboot).
>
> There is one outstanding issue where I am probably requiring too much alignment
> on the arch/i386 kernel.
>
> Can anyone find anything else?
>

I am running into compilation failure on x86_64.


SYSMAP System.map
SYSMAP .tmp_System.map
AS arch/x86_64/boot/bootsect.o
In file included from include/asm/bitops.h:9,
from include/linux/bitops.h:10,
from include/linux/kernel.h:16,
from include/asm/system.h:5,
from include/asm/processor.h:19,
from include/asm/elf.h:11,
from include/linux/elf.h:8,
from arch/x86_64/boot/bootsect.S:20:
include/asm/alternative.h:73: error: syntax error in macro parameter list
include/asm/alternative.h:88: error: syntax error in macro parameter list
include/asm/alternative.h:127: error: syntax error in macro parameter list
In file included from include/asm/system.h:5,
from include/asm/processor.h:19,
from include/asm/elf.h:11,
from include/linux/elf.h:8,
from arch/x86_64/boot/bootsect.S:20:
include/linux/kernel.h:34: warning: "ALIGN" redefined
In file included from include/linux/kernel.h:12,
from include/asm/system.h:5,
from include/asm/processor.h:19,
from include/asm/elf.h:11,
from include/linux/elf.h:8,
from arch/x86_64/boot/bootsect.S:20:
include/linux/linkage.h:27: warning: this is the location of the previous definition
In file included from include/asm/system.h:5,
from include/asm/processor.h:19,
from include/asm/elf.h:11,
from include/linux/elf.h:8,
from arch/x86_64/boot/bootsect.S:20:
include/linux/kernel.h:216: error: syntax error in macro parameter list
include/linux/kernel.h:220: error: syntax error in macro parameter list
In file included from include/linux/sched.h:55,
from include/asm/compat.h:9,
from include/asm/elf.h:12,
from include/linux/elf.h:8,
from arch/x86_64/boot/bootsect.S:20:
include/linux/nodemask.h:229: error: detected recursion whilst expanding macro "find_first_bit"
include/linux/nodemask.h:235: error: detected recursion whilst expanding macro "find_next_bit"
include/linux/nodemask.h:254: error: detected recursion whilst expanding macro "find_first_zero_bit"
arch/x86_64/boot/bootsect.S:21: error: linux/elf_boot.h: No such file or directory
make[1]: *** [arch/x86_64/boot/bootsect.o] Error 1
make: *** [bzImage] Error 2

Thanks
Vivek

2006-08-01 22:10:22

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 11/33] i386 boot: Add an ELF header to bzImage

Eric W. Biederman wrote:
> +.macro note name, type
> + .balign 4
> + .int 2f - 1f # n_namesz
> + .int 4f - 3f # n_descsz
> + .int \type # n_type
> + .balign 4
> +1: .asciz "\name"
> +2: .balign 4
> +3:
> +.endm
> +.macro enote
> +4: .balign 4
> +.endm
>

This is very similar to the macro I introduced in the Paravirt note
segment patch. Do think they should be made common?

> +/* Elf notes to help bootloaders identify what program they are booting.
> + */
> +
> +/* Standardized Elf image notes for booting... The name for all of these is ELFBoot */
> +#define ELF_NOTE_BOOT "ELFBoot"
>

I wonder if this should be something to suggest its Linux-specific? Or
do you see this being used by a wider audience?

J

2006-08-02 02:03:50

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] ELF Relocatable x86 and x86_64 bzImages

"H. Peter Anvin" <[email protected]> writes:

> Vivek Goyal wrote:
>> Hi Eric,
>> Can't we use the x86_64 relocation approach for i386 as well? I mean keep
>> the virtual address space fixed and updating the page tables. This would
>> help in the sense that you don't have to change gdb if somebody decides to
>> debug the relocated kernel.
>> Any such tool that retrieves the symbol virtual address from vmlinux will
>> be confused.
>>
>
> I don't think this is practical given the virtual space constraints on i386
> systems.

Exactly.

Plus it is a lot of dangerous work. Processing is a lot more
conservative.

Eric

2006-08-02 02:10:23

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 18/33] x86_64: Kill temp_boot_pmds

Andi Kleen <[email protected]> writes:

> "Eric W. Biederman" <[email protected]> writes:
>>
>> I also modify the early page table initialization code
>> to use early_ioreamp and early_iounmap, instead of the
>> special case version of those functions that they are
>> now calling.
>
> Ok valuable cleanup. I queued that one too.
>
>> The only really silly part left with init_memory_mapping
>> is that find_early_table_space always finds pages below 1M.
>
> I fixed this some time ago - obsolete comment?

Yes an obsolete comment. I thought I had rechecked that
but I was skimming to fast. find_e820_memory certainly
isn't limited to pages below 1M.

Eric

2006-08-02 02:13:20

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 18/33] x86_64: Kill temp_boot_pmds II

Andi Kleen <[email protected]> writes:

> "Eric W. Biederman" <[email protected]> writes:
>>
>> I also modify the early page table initialization code
>> to use early_ioreamp and early_iounmap, instead of the
>> special case version of those functions that they are
>> now calling.
>
> Or rather I tried to apply it - it doesn't apply at all
> on its own:
>
> patching file arch/x86_64/mm/init.c
> Hunk #1 FAILED at 167.
> Hunk #2 succeeded at 274 with fuzz 1 (offset 28 lines).
> Hunk #3 FAILED at 286.
> Hunk #4 FAILED at 341.
> 3 out of 4 hunks FAILED -- rejects in file arch/x86_64/mm/init.c

It is probably patch 17:
"x86_64: Separate normal memory map initialization from the hotplug case"

I don't see any other patches that touch arch/x86_64/mm/init.c
before that. At least not in 2.6.18-rc3, which is the base of
my patchset.

Eric

2006-08-02 02:20:59

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 2/33] i386: define __pa_symbol

Andi Kleen <[email protected]> writes:

> "Eric W. Biederman" <[email protected]> writes:
>
>> On x86_64 we have to be careful with calculating the physical
>> address of kernel symbols. Both because of compiler odditities
>> and because the symbols live in a different range of the virtual
>> address space.
>>
>> Having a defintition of __pa_symbol that works on both x86_64 and
>> i386 simplifies writing code that works for both x86_64 and
>> i386 that has these kinds of dependencies.
>>
>> So this patch adds the trivial i386 __pa_symbol definition.
>>
>> Signed-off-by: Eric W. Biederman <[email protected]>
>> ---
>> include/asm-i386/page.h | 1 +
>> 1 files changed, 1 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/asm-i386/page.h b/include/asm-i386/page.h
>> index f5bf544..eceb7f5 100644
>> --- a/include/asm-i386/page.h
>> +++ b/include/asm-i386/page.h
>> @@ -124,6 +124,7 @@ #define PAGE_OFFSET ((unsigned long)__P
>> #define VMALLOC_RESERVE ((unsigned long)__VMALLOC_RESERVE)
>> #define MAXMEM (-__PAGE_OFFSET-__VMALLOC_RESERVE)
>> #define __pa(x) ((unsigned long)(x)-PAGE_OFFSET)
>> +#define __pa_symbol(x) __pa(x)
>
> Actually PAGE_OFFSET arithmetic on symbols is outside ISO C and gcc
> misoptimizes it occassionally. You would need to use HIDE_RELOC
> or similar. That is why x86-64 has the magic.

Yes. ISO C only defines pointer arithmetic with in arrays.
I believe gnu C makes it a well defined case.

Currently we do not appear to have any problems on i386.
But I have at least one case of code that is shared between
i386 and x86_64 and it is appropriate to use __pa_symbol on
x86_64.

So I added __pa_symbol for that practical reason.

I would have no problems with generalizing this but I wanted to
at least make it possible to use the concept on i386.

I will be happy to add in the assembly magic, if you don't have
any other problems with this.

Eric

2006-08-02 02:24:36

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 4/33] i386: CONFIG_PHYSICAL_START cleanup

Sam Ravnborg <[email protected]> writes:

>> diff --git a/arch/i386/boot/compressed/head.S
> b/arch/i386/boot/compressed/head.S
>> index b5893e4..8f28ecd 100644
>> --- a/arch/i386/boot/compressed/head.S
>> +++ b/arch/i386/boot/compressed/head.S
>> @@ -23,9 +23,9 @@
>> */
>> .text
>>
>> +#include <linux/config.h>
>
> You already have full access to all CONFIG_* symbols - kbuild includes
> it on the commandline. So please kill this include.

Ok. That must be new. No problem.

Eric

2006-08-02 02:26:25

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 32/33] x86_64: Relocatable kernel support

Andi Kleen <[email protected]> writes:

> "Eric W. Biederman" <[email protected]> writes:
>>
>> When loaded with a normal bootloader the decompressor will decompress
>> the kernel to 2M and it will run there. This both ensures the
>> relocation code is always working, and makes it easier to use 2M
>> pages for the kernel and the cpu.
>
> It would have been nicer if you had moved the uncompressor to be 64bit
> first like it was planned for a long time.

Sorry. I wasn't really in those discussions.

I guess I could take this in some slightly smaller steps.
But this does wind up with decompressor being 64bit code.

Eric

2006-08-02 02:31:50

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 9/33] i386 boot: Add serial output support to the decompressor

Andi Kleen <[email protected]> writes:

> "Eric W. Biederman" <[email protected]> writes:
>> }
>> @@ -200,6 +224,178 @@ static void putstr(const char *s)
>> outb_p(0xff & (pos >> 1), vidport+1);
>> }
>>
>> +static void vid_console_init(void)
>
> Please just use early_printk instead of reimplementing this.
> I think it should work in this context too.

It doesn't or at least it didn't. I can look again though.

>> +static inline int tolower(int ch)
>> +{
>> + return ch | 0x20;
>> +}
>> +
>> +static inline int isdigit(int ch)
>> +{
>> + return (ch >= '0') && (ch <= '9');
>> +}
>> +
>> +static inline int isxdigit(int ch)
>> +{
>> + ch = tolower(ch);
>> + return isdigit(ch) || ((ch >= 'a') && (ch <= 'f'));
>> +}
>
> And please reuse the Linux code here.

Reuse is hard because we really are a separate executable,
in a slightly different environment.

> Actually the best way to reuse would be to first do 64bit uncompressor
> and linker directly, but short of that #includes would be fine too.

> Would be better to just pull in lib/string.c

Maybe. Size is fairly important here so I am concerned that I
will pull in more than I need. But look and see if I can pull
in just a subset of what is needed.

Eric

2006-08-02 02:39:36

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 11/33] i386 boot: Add an ELF header to bzImage

Jeremy Fitzhardinge <[email protected]> writes:

> Eric W. Biederman wrote:
>> +.macro note name, type
>> + .balign 4
>> + .int 2f - 1f # n_namesz
>> + .int 4f - 3f # n_descsz
>> + .int \type # n_type
>> + .balign 4
>> +1: .asciz "\name"
>> +2: .balign 4
>> +3:
>> +.endm
>> +.macro enote
>> +4: .balign 4
>> +.endm
>>
>
> This is very similar to the macro I introduced in the Paravirt note segment
> patch. Do think they should be made common?

Yes. At the point of merging these two approaches I think the notes
should just be put into vmlinux and copied either into the ELF
header or more likely into a segment that build.c tacks onto
the kernel.

It is such a small piece of my overall picture I wasn't really looking
at that.

>> +/* Elf notes to help bootloaders identify what program they are booting.
>> + */
>> +
>> +/* Standardized Elf image notes for booting... The name for all of these is
> ELFBoot */
>> +#define ELF_NOTE_BOOT "ELFBoot"
>>
>
> I wonder if this should be something to suggest its Linux-specific? Or do you
> see this being used by a wider audience?

This note is used in etherboot and LinuxBIOS right now so it isn't terribly linux
specific. And whatever the virtues of it's name it is actually in use.

I had a preliminary RFC for using these in a winder context at one point but I
ran out of energy for pushing it.

I think the information in those note headers is interesting beyond that I
really don't much care.

Eric

2006-08-02 02:42:35

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] ELF Relocatable x86 and x86_64 bzImages

Vivek Goyal <[email protected]> writes:

> On Tue, Aug 01, 2006 at 04:58:49AM -0600, Eric W. Biederman wrote:
>>
>> Currently there are 33 patches in my tree to do this.
>>
>> The weirdest symptom I have had so far is that page faults did not
>> trigger the early exception handler on x86_64 (instead I got a reboot).
>>
>> There is one outstanding issue where I am probably requiring too much
> alignment
>> on the arch/i386 kernel.
>>
>> Can anyone find anything else?
>>
>
> I am running into compilation failure on x86_64.

I'm not quite certain what is wrong, except that you haven't
applied all of my patches.

The x86_64 ones do depend on the i386 ones to some extent.
That is why it was one giant patchset.

Eric

2006-08-02 03:07:51

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 9/33] i386 boot: Add serial output support to the decompressor

Andi Kleen <[email protected]> writes:

> "Eric W. Biederman" <[email protected]> writes:
>> }
>> @@ -200,6 +224,178 @@ static void putstr(const char *s)
>> outb_p(0xff & (pos >> 1), vidport+1);
>> }
>>
>> +static void vid_console_init(void)
>
> Please just use early_printk instead of reimplementing this.
> I think it should work in this context too.

There is certainly some value in that. To do that I would
need to refactor early_printk to make it useable.

This comment from one of patches summaries the worst of the problems.

> /* WARNING!!
> * This code is compiled with -fPIC and it is relocated dynamically
> * at run time, but no relocation processing is performed.
> * This means that it is not safe to place pointers in static structures.
> */

lib/string.c might be useful. The fact that the functions are not
static slightly concerns me. I have a vague memory of non-static
functions generating relocations for no good reason.

Eric

2006-08-02 03:10:38

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 9/33] i386 boot: Add serial output support to the decompressor


> > /* WARNING!!
> > * This code is compiled with -fPIC and it is relocated dynamically
> > * at run time, but no relocation processing is performed.
> > * This means that it is not safe to place pointers in static structures.
> > */

iirc the only static relocation in early_printk is the one to initialize
the console pointers - that could certainly be moved to be at run time.

> lib/string.c might be useful. The fact that the functions are not
> static slightly concerns me. I have a vague memory of non-static
> functions generating relocations for no good reason.

Would surprise me.

-Andi

2006-08-02 03:10:41

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 2/33] i386: define __pa_symbol


> Yes. ISO C only defines pointer arithmetic with in arrays.
> I believe gnu C makes it a well defined case.

Nope, it doesn't.

There was a miscompilation on PPC some time ago, that is why
HIDE_RELOC() and __pa_symbol() was implemented.

>
> Currently we do not appear to have any problems on i386.
> But I have at least one case of code that is shared between
> i386 and x86_64 and it is appropriate to use __pa_symbol on
> x86_64.
>
> So I added __pa_symbol for that practical reason.
>
> I would have no problems with generalizing this but I wanted to
> at least make it possible to use the concept on i386.

No problem with that, just use HIDE_RELOC

-Andi

2006-08-02 03:11:00

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 9/33] i386 boot: Add serial output support to the decompressor


> > Actually the best way to reuse would be to first do 64bit uncompressor
> > and linker directly, but short of that #includes would be fine too.
>
> > Would be better to just pull in lib/string.c
>
> Maybe. Size is fairly important

Why is size important here?

-Andi

2006-08-02 03:11:00

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 18/33] x86_64: Kill temp_boot_pmds II


> It is probably patch 17:
> "x86_64: Separate normal memory map initialization from the hotplug case"

Ok that messes things up. Actually I think i prefered the previous
code - it was not that bad as you make it. The two variants.
are really doing mostly the same. So best you drop that.

> I don't see any other patches that touch arch/x86_64/mm/init.c
> before that. At least not in 2.6.18-rc3, which is the base of
> my patchset.

I got three patches that touch mm/init.c in my patchkit
(ftp://ftp.firstfloor.org/pub/ak/x86_64/quilt/patches/)

BTW I didn't merge any further patches currently, but might
after the next round when the current comments are addressed.

-Andi

2006-08-02 04:59:39

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 9/33] i386 boot: Add serial output support to the decompressor

Andi Kleen <[email protected]> writes:

>> > Actually the best way to reuse would be to first do 64bit uncompressor
>> > and linker directly, but short of that #includes would be fine too.
>>
>> > Would be better to just pull in lib/string.c
>>
>> Maybe. Size is fairly important
>
> Why is size important here?

For the same reason that we compress the kernel. ;)

This is the one chunk of code that we don't compress so every extra
byte makes our executable bigger. Now I think the code size is
actually in the 32k - 64k range so as long as it is a minor change
it doesn't really matter.

The big pain with using lib/string.c and
arch/x86_64/kernel/early_printk.c is that it is significant change
in how the code of misc.c is constructed. Which means some
serious reevaluation of all kinds of things need to be considered.
Making it a lot of work :)

One of the practical dangers is that we make it more likely
we can kill the boot by messing up the shared code.

I'm not certain what to think when even including normal
kernel headers causes problems. It certainly makes me leery
of including normal kernel code. But it might simplify some
of the problems too.

Whichever way I go scrutinizing that possibility carefully is
a lot of work.

Eric

2006-08-02 05:25:18

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 9/33] i386 boot: Add serial output support to the decompressor

On Wednesday 02 August 2006 06:57, Eric W. Biederman wrote:
> Andi Kleen <[email protected]> writes:
>
> >> > Actually the best way to reuse would be to first do 64bit uncompressor
> >> > and linker directly, but short of that #includes would be fine too.
> >>
> >> > Would be better to just pull in lib/string.c
> >>
> >> Maybe. Size is fairly important
> >
> > Why is size important here?
>
> For the same reason that we compress the kernel. ;)
>
> This is the one chunk of code that we don't compress so every extra
> byte makes our executable bigger. Now I think the code size is
> actually in the 32k - 64k range so as long as it is a minor change
> it doesn't really matter.

text data bss dec hex filename
1909 352 12 2273 8e1 arch/x86_64/kernel/early_printk.o
2212 0 0 2212 8a4 lib/string.o

It's minor.

>
> The big pain with using lib/string.c and
> arch/x86_64/kernel/early_printk.c is that it is significant change
> in how the code of misc.c is constructed.

Not if you use #include

> Which means some
> serious reevaluation of all kinds of things need to be considered.
> Making it a lot of work :)
>
> One of the practical dangers is that we make it more likely
> we can kill the boot by messing up the shared code.

If they're messed up the later boot will fail too. Doesn't make
too much difference.

>
> I'm not certain what to think when even including normal
> kernel headers causes problems. It certainly makes me leery
> of including normal kernel code. But it might simplify some
> of the problems too.

On x86-64 some trouble comes from it being 32bit code.
That is why I suggested making it 64bit first, which would
avoid many of the problems.

> Whichever way I go scrutinizing that possibility carefully is
> a lot of work.

64bit conversion would be some work, the rest isn't I think.

Alternatively if you don't like it we can just drop these compressor patches.
I don't think they were essential.

-Andi

2006-08-02 05:28:54

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 9/33] i386 boot: Add serial output support to the decompressor

Andi Kleen <[email protected]> writes:

>> > /* WARNING!!
>> > * This code is compiled with -fPIC and it is relocated dynamically
>> > * at run time, but no relocation processing is performed.
>> > * This means that it is not safe to place pointers in static structures.
>> > */
>
> iirc the only static relocation in early_printk is the one to initialize
> the console pointers - that could certainly be moved to be at run
> time.

The function pointers in the console structure are also a problem.
static struct console simnow_console = {
.name = "simnow",
.write = simnow_write,
.flags = CON_PRINTBUFFER,
.index = -1,
};

>> lib/string.c might be useful. The fact that the functions are not
>> static slightly concerns me. I have a vague memory of non-static
>> functions generating relocations for no good reason.
>
> Would surprise me.

The context where it bit me was memtest86, if I recall correctly.
The problem there was I did process relocations and I discovered simply
by making functions static or at least non-exported I had many fewer
relocations to process.

Since I am relying on a very clever trick to generate code that
doesn't have relocations at run time I have to be careful.

So if I want to continue not processing relocations.
I need to be careful not to use constructs that will generate
a procedure linkage table, which I think only kicks in with
external functions and multiple files.

I need to be careful not to put pointers in statically allocated
data structures.

Ideally the code would be setup so you can compile out consoles
the user finds uninteresting.

It is annoying to have to call strlen on all of the strings
you want to print..

So there are plenty of mismatches, there.
But we may be able to harmonized them, and reuse early_printk.

Eric

2006-08-02 05:37:14

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 18/33] x86_64: Kill temp_boot_pmds II

Andi Kleen <[email protected]> writes:

>> It is probably patch 17:
>> "x86_64: Separate normal memory map initialization from the hotplug case"
>
> Ok that messes things up. Actually I think i prefered the previous
> code - it was not that bad as you make it. The two variants.
> are really doing mostly the same. So best you drop that.

All of my complaints are real. But yes I do think a reasonable
case can be made for merging them. In several of the worst cases
simply calling memset before initializing the page is probably
sufficient to remove a test later on.

As that code sits right now you need way too much global context
to understand what is going on. It is the kind of code that cause
obviously correct patches to fail, and I'm clever enough to know
clever code is very dangerous. :)

So before I get back to that I will probably look and see if there
is some more heavy lifting I can do to make that code less of a land mine.

>> I don't see any other patches that touch arch/x86_64/mm/init.c
>> before that. At least not in 2.6.18-rc3, which is the base of
>> my patchset.
>
> I got three patches that touch mm/init.c in my patchkit
> (ftp://ftp.firstfloor.org/pub/ak/x86_64/quilt/patches/)
>
> BTW I didn't merge any further patches currently, but might
> after the next round when the current comments are addressed.

Ok. I will take a look.

Having any patches merged on a simple request for comments was a bit of a surprise :)

Eric

2006-08-02 05:45:26

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 9/33] i386 boot: Add serial output support to the decompressor


> The function pointers in the console structure are also a problem.
> static struct console simnow_console = {
> .name = "simnow",
> .write = simnow_write,
> .flags = CON_PRINTBUFFER,
> .index = -1,
> };

Yes just patch them at runtime.


> Ideally the code would be setup so you can compile out consoles
> the user finds uninteresting.

Seems overkill for early_printk

> It is annoying to have to call strlen on all of the strings
> you want to print..

What strlen?

-Andi

2006-08-02 06:34:42

by Magnus Damm

[permalink] [raw]
Subject: Re: [RFC] ELF Relocatable x86 and x86_64 bzImages

On 8/1/06, Eric W. Biederman <[email protected]> wrote:
>
> The problem:
>
> We can't always run the kernel at 1MB or 2MB, and so people who need
> different addresses must build multiple kernels. The bzImage format
> can't even represent loading a kernel at other than it's default address.
> With kexec on panic now starting to be used by distros having a kernel
> not running at the default load address is starting to become common.
>
> The goal of this patch series is to build kernels that are relocatable
> at run time, and to extend the bzImage format to make it capable of
> expressing a relocatable kernel.

Nice work. I'd really like to see support for relocatable kernels in
mainline (and kexec-tools!).

Eric, could you please list the advantages of your run-time relocation
code over my incomplete relocate-in-userspace prototype posted to
fastboot a few weeks ago?

One thing I know for sure is that your implementation supports bzImage
while my only supports relocation of vmlinux files. Are there any
other uses for relocatable bzImage except kdump?

Thanks!

/ magnus

2006-08-02 06:45:38

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 9/33] i386 boot: Add serial output support to the decompressor

Andi Kleen <[email protected]> writes:

> On Wednesday 02 August 2006 06:57, Eric W. Biederman wrote:
>
> On x86-64 some trouble comes from it being 32bit code.
> That is why I suggested making it 64bit first, which would
> avoid many of the problems.

:)

>> Whichever way I go scrutinizing that possibility carefully is
>> a lot of work.
>
> 64bit conversion would be some work, the rest isn't I think.

Except for the head.S work the 64bit conversion was practically a noop.

> Alternatively if you don't like it we can just drop these compressor patches.
> I don't think they were essential.

Agreed. The printing portion wasn't essential.

At this point I think dropping the non-essential bits just to get the size
of the patchset down makes sense.

Eric

2006-08-02 07:11:45

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] ELF Relocatable x86 and x86_64 bzImages

"Magnus Damm" <[email protected]> writes:

> On 8/1/06, Eric W. Biederman <[email protected]> wrote:
>>
>> The problem:
>>
>> We can't always run the kernel at 1MB or 2MB, and so people who need
>> different addresses must build multiple kernels. The bzImage format
>> can't even represent loading a kernel at other than it's default address.
>> With kexec on panic now starting to be used by distros having a kernel
>> not running at the default load address is starting to become common.
>>
>> The goal of this patch series is to build kernels that are relocatable
>> at run time, and to extend the bzImage format to make it capable of
>> expressing a relocatable kernel.
>
> Nice work. I'd really like to see support for relocatable kernels in
> mainline (and kexec-tools!).

kexec-tools already have initial support for ELF ET_DYN executables.
Vivek posted a fix a day two ago so I expect that the support
should be working.

> Eric, could you please list the advantages of your run-time relocation
> code over my incomplete relocate-in-userspace prototype posted to
> fastboot a few weeks ago?

If you watch an architecture evolve one thing you will notice is that
the kinds of relocations keep growing. An ever growing list of things
to for the bootloader to do is a pain. Especially when bootloaders
generally need to be as simple and as fixed as possible because bootloaders
are not something you generally want to update.

Beyond that if you look at head.S the code to process the relocations
(after I have finished post processing them at build time) is 9 instructions.
Which is absolutely trivial, at least for now.

By keeping the bzImage processing the relocations we have kept the
bootloader/kernel interface simple.

> One thing I know for sure is that your implementation supports bzImage
> while my only supports relocation of vmlinux files. Are there any
> other uses for relocatable bzImage except kdump?

I can't think of any volume users. A hypervisor that would actually report
the real physical addresses would be a candidate. It's a general purpose
facility so if it is interesting users will show up. Static
relocation has already found another use on x86_64.

There are definitely users of an ELF bzImage beyond the kdump case.
Anything that doesn't have a traditional 16bit BIOS on it. LinuxBIOS,
and Xen, and some others.

Not having to keep track of anything but your bzImage to boot is also
a serious advantage. It's the one binary to rule them all. :)

Eric

2006-08-02 07:15:43

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 9/33] i386 boot: Add serial output support to the decompressor

Andi Kleen <[email protected]> writes:

>
>> The function pointers in the console structure are also a problem.
>> static struct console simnow_console = {
>> .name = "simnow",
>> .write = simnow_write,
>> .flags = CON_PRINTBUFFER,
>> .index = -1,
>> };
>
> Yes just patch them at runtime.

I guess that can work. It's a bit of a pain though.

>> Ideally the code would be setup so you can compile out consoles
>> the user finds uninteresting.
>
> Seems overkill for early_printk

At least compiling completely out probably isn't.
I have had too many times where the size of a bzImage was an important
factor on a project.

>> It is annoying to have to call strlen on all of the strings
>> you want to print..
>
> What strlen?

The strlen that is needed to convert putstr(char *s) into the
write method for the early_printk helpers.

Eric

2006-08-02 08:34:46

by Magnus Damm

[permalink] [raw]
Subject: Re: [RFC] ELF Relocatable x86 and x86_64 bzImages

On 8/2/06, Eric W. Biederman <[email protected]> wrote:
> "Magnus Damm" <[email protected]> writes:
> > Eric, could you please list the advantages of your run-time relocation
> > code over my incomplete relocate-in-userspace prototype posted to
> > fastboot a few weeks ago?
>
> If you watch an architecture evolve one thing you will notice is that
> the kinds of relocations keep growing. An ever growing list of things
> to for the bootloader to do is a pain. Especially when bootloaders
> generally need to be as simple and as fixed as possible because bootloaders
> are not something you generally want to update.

I agree that updating bootloaders is something you want to avoid. I'm
not however sure that I would call kexec-tools a bootloader...

> Beyond that if you look at head.S the code to process the relocations
> (after I have finished post processing them at build time) is 9 instructions.
> Which is absolutely trivial, at least for now.

Yeah, but the 33 patches are touching more than 9 instructions. =)

> By keeping the bzImage processing the relocations we have kept the
> bootloader/kernel interface simple.

Agreed. I think your patch makes sense.

> > One thing I know for sure is that your implementation supports bzImage
> > while my only supports relocation of vmlinux files. Are there any
> > other uses for relocatable bzImage except kdump?
>
> I can't think of any volume users. A hypervisor that would actually report
> the real physical addresses would be a candidate. It's a general purpose
> facility so if it is interesting users will show up. Static
> relocation has already found another use on x86_64.
>
> There are definitely users of an ELF bzImage beyond the kdump case.
> Anything that doesn't have a traditional 16bit BIOS on it. LinuxBIOS,
> and Xen, and some others.
>
> Not having to keep track of anything but your bzImage to boot is also
> a serious advantage. It's the one binary to rule them all. :)

One binary to rule them all... If that is true, is there any simple
way then to extract vmlinux from the bzImage?

Thanks!

/ magnus

2006-08-02 10:01:06

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] ELF Relocatable x86 and x86_64 bzImages

"Magnus Damm" <[email protected]> writes:

> On 8/2/06, Eric W. Biederman <[email protected]> wrote:
>> "Magnus Damm" <[email protected]> writes:
>> > Eric, could you please list the advantages of your run-time relocation
>> > code over my incomplete relocate-in-userspace prototype posted to
>> > fastboot a few weeks ago?
>>
>> If you watch an architecture evolve one thing you will notice is that
>> the kinds of relocations keep growing. An ever growing list of things
>> to for the bootloader to do is a pain. Especially when bootloaders
>> generally need to be as simple and as fixed as possible because bootloaders
>> are not something you generally want to update.
>
> I agree that updating bootloaders is something you want to avoid. I'm
> not however sure that I would call kexec-tools a bootloader...

On the truly insane possibilities. It is actually possible to load
a relocated bzImage. run setup16.S below 1M and have it jump
to the kernel at any address below 4G.

>> Beyond that if you look at head.S the code to process the relocations
>> (after I have finished post processing them at build time) is 9 instructions.
>> Which is absolutely trivial, at least for now.
>
> Yeah, but the 33 patches are touching more than 9 instructions. =)

True. I at of that is general clean ups to allow the kernel to be
relocated though. Not to actually perform the relocation.

> One binary to rule them all... If that is true, is there any simple
> way then to extract vmlinux from the bzImage?

Unfortunately the process is a little lossy :(

So that still means you still need the vmlinux to get the debug
symbols.


Eric

2006-08-02 16:16:03

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 4/33] i386: CONFIG_PHYSICAL_START cleanup

Sam Ravnborg <[email protected]> writes:

> On Tue, Aug 01, 2006 at 05:03:19AM -0600, Eric W. Biederman wrote:
>> Defining __PHYSICAL_START and __KERNEL_START in asm-i386/page.h works but
>> it triggers a full kernel rebuild for the silliest of reasons. This
>> modifies the users to directly use CONFIG_PHYSICAL_START and linux/config.h
>> which prevents the full rebuild problem, which makes the code much
>> more maintainer and hopefully user friendly.
>>
>> Signed-off-by: Eric W. Biederman <[email protected]>
>> ---
>> arch/i386/boot/compressed/head.S | 8 ++++----
>> arch/i386/boot/compressed/misc.c | 8 ++++----
>> arch/i386/kernel/vmlinux.lds.S | 3 ++-
>> include/asm-i386/page.h | 3 ---
>> 4 files changed, 10 insertions(+), 12 deletions(-)
>>
>> diff --git a/arch/i386/boot/compressed/head.S
> b/arch/i386/boot/compressed/head.S
>> index b5893e4..8f28ecd 100644
>> --- a/arch/i386/boot/compressed/head.S
>> +++ b/arch/i386/boot/compressed/head.S
>> @@ -23,9 +23,9 @@
>> */
>> .text
>>
>> +#include <linux/config.h>
>
> You already have full access to all CONFIG_* symbols - kbuild includes
> it on the commandline. So please kill this include.

Stupid questions:
- Why do we still have a linux/config.h if it is totally redundant.
- Why don't we have at least a #warning in linux/config.h that would
tell us not to include it.
- Why do we still have about 200 includes of linux/config.h in the
kernel tree?

I would much rather have a compile error, or at least a compile
warning rather than needed a code review to notice this error.

We haven't needed this header since october of last year.

Eric

2006-08-02 18:34:29

by Don Zickus

[permalink] [raw]
Subject: Re: [Fastboot] [RFC] ELF Relocatable x86 and x86_64 bzImages

>
> There is one outstanding issue where I am probably requiring too much alignment
> on the arch/i386 kernel.

There was posts awhile ago about optimizing the kernel performance by
loading it at a 4MB offset.

http://www.lkml.org/lkml/2006/2/23/189

Your changes breaks that on i386 (not aligned on a 4MB boundary). But a
5MB offset works. Is that the correct update or does that break the
original idea?

Cheers,
Don

2006-08-03 01:01:48

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [RFC] ELF Relocatable x86 and x86_64 bzImages

Don Zickus <[email protected]> writes:

>>
>> There is one outstanding issue where I am probably requiring too much
> alignment
>> on the arch/i386 kernel.
>
> There was posts awhile ago about optimizing the kernel performance by
> loading it at a 4MB offset.
>
> http://www.lkml.org/lkml/2006/2/23/189
>
> Your changes breaks that on i386 (not aligned on a 4MB boundary). But a
> 5MB offset works. Is that the correct update or does that break the
> original idea?

That patch should still apply and work as described.

Actually when this stuipd cold I have stops slowing me down,
and I fix the alignment to what it really needs to be ~= 8KB.

Then bootloaders should be able to make the decision.

HPA Does that sound at all interesting?

Eric

2006-08-03 04:54:21

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [Fastboot] [RFC] ELF Relocatable x86 and x86_64 bzImages

Eric W. Biederman wrote:
> Don Zickus <[email protected]> writes:
>
>>> There is one outstanding issue where I am probably requiring too much
>> alignment
>>> on the arch/i386 kernel.
>> There was posts awhile ago about optimizing the kernel performance by
>> loading it at a 4MB offset.
>>
>> http://www.lkml.org/lkml/2006/2/23/189
>>
>> Your changes breaks that on i386 (not aligned on a 4MB boundary). But a
>> 5MB offset works. Is that the correct update or does that break the
>> original idea?
>
> That patch should still apply and work as described.
>
> Actually when this stuipd cold I have stops slowing me down,
> and I fix the alignment to what it really needs to be ~= 8KB.
>
> Then bootloaders should be able to make the decision.
>
> HPA Does that sound at all interesting?
>

I'm sorry, it's not clear to me what you're asking here.

The bootloaders will load bzImage at the 1 MB point, and it's up to the
decompressor to locate it appropriately. It has (correctly) been
pointed out that it would be faster if the decompressed kernel is
located to the 4 MB point -- large pages don't work below 2/4 MB due to
interference with the fixed MTRRs -- but that's doesn't affect the boot
protocol in any way.

I was under the impression that your relocatable patches allows the boot
loader to load the bzImage at a different address than the usual
0x100000; but again, that shouldn't affect the kernel's final resting place.

-hpa

2006-08-03 14:06:00

by Sam Ravnborg

[permalink] [raw]
Subject: Re: [PATCH 4/33] i386: CONFIG_PHYSICAL_START cleanup

>
> Stupid questions:
> - Why do we still have a linux/config.h if it is totally redundant.
> - Why don't we have at least a #warning in linux/config.h that would
> tell us not to include it.
> - Why do we still have about 200 includes of linux/config.h in the
> kernel tree?
>
> I would much rather have a compile error, or at least a compile
> warning rather than needed a code review to notice this error.
In progress. As part of the ongoing header cleanup all include
<config.h> are being removed and a warning is included in config.h.

When the change was done I did not want to spew out thousands of warning
for a simple thing like this.

Sam

2006-08-04 22:56:47

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] ELF Relocatable x86 and x86_64 bzImages

On Tue, Aug 01, 2006 at 04:58:49AM -0600, Eric W. Biederman wrote:
>
> The problem:
>
> We can't always run the kernel at 1MB or 2MB, and so people who need
> different addresses must build multiple kernels. The bzImage format
> can't even represent loading a kernel at other than it's default address.
> With kexec on panic now starting to be used by distros having a kernel
> not running at the default load address is starting to become common.
>
Hi Eric,

There seems to be a small anomaly in the current set of patches for i386.

For example if one compiles the kernel with CONFIG_RELOCATABLE=y
and CONFIG_PHYSICAL_START=0x400000 (4MB) and he uses grub to load
the kernel then kernel would run from 1MB location. I think user would
expect it to run from 4MB location.

I think distro's might want to keep above config options enabled.
CONFIG_RELOCATABLE=y so that kexec can load kdump kernel at a
different address and CONFIG_PHYSICAL_START=non 1MB location, to
extract better performance. (As we had discussions on mailing list
some time back.)

In principle this is a limitation on boot-loaders part but as we can
not fix the boot-loaders out there, probably we can try fixing it
at kernel level.

What I have done here is that decompressor code will determine the
final resting place of the kernel based on boot loader type. So
if I have been loaded by kexec, I am supposed to run from loaded address
otherwise I am supposed to run from CONFIG_PHYSICAL_START as I have been
loaded at 1MB address due to boot loader limitation and that's not the
intention.

A prototype patch is attached with the mail. I have assumed that I can
assign a boot loader type id 9 to kexec (Documentation/i386/boot.txt).
Also assuming that all other boot loaders apart from kexec have got 1MB
limitation. If not, its trivial to include their boot loader ids also.

I have tested this patch and it works fine. What do you think about
this approach ?

Thanks
Vivek




Signed-off-by: Vivek Goyal <[email protected]>
---

arch/i386/boot/compressed/head.S | 32 ++++++++++++++++++++++++++++++--
1 file changed, 30 insertions(+), 2 deletions(-)

diff -puN arch/i386/boot/compressed/head.S~debug1-patch arch/i386/boot/compressed/head.S
--- linux-2.6.18-rc3-1M/arch/i386/boot/compressed/head.S~debug1-patch 2006-08-04 18:03:02.000000000 -0400
+++ linux-2.6.18-rc3-1M-root/arch/i386/boot/compressed/head.S 2006-08-04 18:18:26.000000000 -0400
@@ -60,13 +60,32 @@ startup_32:
* a relocatable kernel this is the delta to our load address otherwise
* this is the delta to CONFIG_PHYSICAL start.
*/
+
#ifdef CONFIG_RELOCTABLE
+ /* If loaded by non kexec boot loader, then we will be loaded
+ * at 1MB fixed address. But probably the intention is to run
+ * from a address for which kernel has been compiled which can
+ * be non 1MB.
+ */
+ xorl %eax, %eax
+ movb 0x210(%esi), %al
+
+ / * check boot loader type. Kexec bootloader id 9, version any */
+ shrl $4, %eax
+ subl $0x9, %eax
+ jnz 1f
+
+ /* Run kernel from the location it has been loaded at. */
movl %ebp, %ebx
+ jmp 2f
+
+ /* Run the kernel from compiled destination location. */
+1: movl $(CONFIG_PHYSICAL_START - startup32), %ebx
#else
movl $(CONFIG_PHYSICAL_START - startup_32), %ebx
#endif

- /* Replace the compressed data size with the uncompressed size */
+2: /* Replace the compressed data size with the uncompressed size */
subl input_len(%ebp), %ebx
movl output_len(%ebp), %eax
addl %eax, %ebx
@@ -95,7 +114,16 @@ startup_32:
/* Compute the kernel start address.
*/
#ifdef CONFIG_RELOCATABLE
+ xorl %eax, %eax
+ movb 0x210(%esi), %al
+ / * check boot loader type. Kexec bootloader id 9, version any */
+ shrl $4, %eax
+ subl $0x9, %eax
+ jnz 3f
leal startup_32(%ebp), %ebp
+ jmp 4f
+3:
+ movl $CONFIG_PHYSICAL_START, %ebp
#else
movl $CONFIG_PHYSICAL_START, %ebp
#endif
@@ -103,7 +131,7 @@ startup_32:
/*
* Jump to the relocated address.
*/
- leal relocated(%ebx), %eax
+4: leal relocated(%ebx), %eax
jmp *%eax
.section ".text"
relocated:
_

2006-08-04 23:16:24

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] ELF Relocatable x86 and x86_64 bzImages

Vivek Goyal <[email protected]> writes:

> On Tue, Aug 01, 2006 at 04:58:49AM -0600, Eric W. Biederman wrote:
>>
>> The problem:
>>
>> We can't always run the kernel at 1MB or 2MB, and so people who need
>> different addresses must build multiple kernels. The bzImage format
>> can't even represent loading a kernel at other than it's default address.
>> With kexec on panic now starting to be used by distros having a kernel
>> not running at the default load address is starting to become common.
>>
> Hi Eric,
>
> There seems to be a small anomaly in the current set of patches for i386.
>
> For example if one compiles the kernel with CONFIG_RELOCATABLE=y
> and CONFIG_PHYSICAL_START=0x400000 (4MB) and he uses grub to load
> the kernel then kernel would run from 1MB location. I think user would
> expect it to run from 4MB location.

Agreed. That is a non-intuitive, and should probably be fixed.

> I think distro's might want to keep above config options enabled.
> CONFIG_RELOCATABLE=y so that kexec can load kdump kernel at a
> different address and CONFIG_PHYSICAL_START=non 1MB location, to
> extract better performance. (As we had discussions on mailing list
> some time back.)
>
> In principle this is a limitation on boot-loaders part but as we can
> not fix the boot-loaders out there, probably we can try fixing it
> at kernel level.
>
> What I have done here is that decompressor code will determine the
> final resting place of the kernel based on boot loader type. So
> if I have been loaded by kexec, I am supposed to run from loaded address
> otherwise I am supposed to run from CONFIG_PHYSICAL_START as I have been
> loaded at 1MB address due to boot loader limitation and that's not the
> intention.
>
> A prototype patch is attached with the mail. I have assumed that I can
> assign a boot loader type id 9 to kexec (Documentation/i386/boot.txt).
> Also assuming that all other boot loaders apart from kexec have got 1MB
> limitation. If not, its trivial to include their boot loader ids also.
>
> I have tested this patch and it works fine. What do you think about
> this approach ?

I think there is some value in it. But I need to digest it.

I have a cold right now and am running pretty weak, so it is going to take me
a little bit to look at this.

I don't like taking action based upon bootloader type. As that assumes
all kinds of things. But having better rules for when we perform relocation
makes sense. There might be a way to detect coming from setup.S

I gave it some care last time, I worked through this and it didn't quite work.

I guess the practical question is do people see a real performance benefit
when loading the kernel at 4MB?

Possibly the right solution is to do like I did on x86_64 and simply remove
CONFIG_PHYSICAL_START, and always place the kernel at 4MB, or something like
that.

The practical question is what to do to keep the complexity from spinning
out of control. Removing CONFIG_PHYSICAL_START would seriously help with
that.

Eric

2006-08-04 23:38:40

by Dave Jones

[permalink] [raw]
Subject: Re: [RFC] ELF Relocatable x86 and x86_64 bzImages

On Fri, Aug 04, 2006 at 05:14:37PM -0600, Eric W. Biederman wrote:

> I guess the practical question is do people see a real performance benefit
> when loading the kernel at 4MB?

Linus claimed lmbench saw some huge wins. Others showed that for eg,
a kernel compile took the same amount of time, so take from that what you will..

> Possibly the right solution is to do like I did on x86_64 and simply remove
> CONFIG_PHYSICAL_START, and always place the kernel at 4MB, or something like
> that.
>
> The practical question is what to do to keep the complexity from spinning
> out of control. Removing CONFIG_PHYSICAL_START would seriously help with
> that.

Given the two primary uses of that option right now are a) the aforementioned
perf win and b) building kexec kernels, I doubt anyone would miss it once
we go relocatable ;-)

Dave

--
http://www.codemonkey.org.uk

2006-08-04 23:47:37

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC] ELF Relocatable x86 and x86_64 bzImages

Dave Jones wrote:
> On Fri, Aug 04, 2006 at 05:14:37PM -0600, Eric W. Biederman wrote:
>
> > I guess the practical question is do people see a real performance benefit
> > when loading the kernel at 4MB?
>
> Linus claimed lmbench saw some huge wins. Others showed that for eg,
> a kernel compile took the same amount of time, so take from that what you will..
>
> > Possibly the right solution is to do like I did on x86_64 and simply remove
> > CONFIG_PHYSICAL_START, and always place the kernel at 4MB, or something like
> > that.
> >
> > The practical question is what to do to keep the complexity from spinning
> > out of control. Removing CONFIG_PHYSICAL_START would seriously help with
> > that.
>
> Given the two primary uses of that option right now are a) the aforementioned
> perf win and b) building kexec kernels, I doubt anyone would miss it once
> we go relocatable ;-)
>

We DO want the performance gain with a conventional bootloader. The
perf win is about the location of the uncompressed kernel, not the
compressed kernel.

-hpa

2006-08-05 08:03:12

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] ELF Relocatable x86 and x86_64 bzImages

"H. Peter Anvin" <[email protected]> writes:

> Dave Jones wrote:
>> On Fri, Aug 04, 2006 at 05:14:37PM -0600, Eric W. Biederman wrote:
>> > I guess the practical question is do people see a real performance benefit
>> > when loading the kernel at 4MB?
>> Linus claimed lmbench saw some huge wins. Others showed that for eg,
>> a kernel compile took the same amount of time, so take from that what you
> will..

But Linus wasn't comparing the same version of the kernel. So it was
a bit unknown. Having someone reproduce those lm_bench numbers on the
exact same kernel would be interesting.

>> > Possibly the right solution is to do like I did on x86_64 and simply remove
>> > CONFIG_PHYSICAL_START, and always place the kernel at 4MB, or something like
>> > that.
>> > > The practical question is what to do to keep the complexity from spinning
>> > out of control. Removing CONFIG_PHYSICAL_START would seriously help with
>> > that.
>> Given the two primary uses of that option right now are a) the aforementioned
>> perf win and b) building kexec kernels, I doubt anyone would miss it once
>> we go relocatable ;-)
>>
>
> We DO want the performance gain with a conventional bootloader. The perf win is
> about the location of the uncompressed kernel, not the compressed kernel.

Agreed.

We also need a way to boot a kernel on an old machine with limited memory.
Possibly we would only support this if PSE is not supported, which old machines
don't support.

The complication that the decompressor must relocation the image for
supporting old bootloaders is challenging in the context or a truly relocatable
kernel, where we run at the address the bootloader put us.

The basic other concern is how flexible do we need to be with respect to relocation.

Eric

2006-08-08 04:18:00

by Simon Horman

[permalink] [raw]
Subject: Re: [RFC] ELF Relocatable x86 and x86_64 bzImages

On Fri, Aug 04, 2006 at 05:14:37PM -0600, Eric W. Biederman wrote:
> Vivek Goyal <[email protected]> writes:
>
> > On Tue, Aug 01, 2006 at 04:58:49AM -0600, Eric W. Biederman wrote:
> >>
> >> The problem:
> >>
> >> We can't always run the kernel at 1MB or 2MB, and so people who need
> >> different addresses must build multiple kernels. The bzImage format
> >> can't even represent loading a kernel at other than it's default address.
> >> With kexec on panic now starting to be used by distros having a kernel
> >> not running at the default load address is starting to become common.
> >>
> > Hi Eric,
> >
> > There seems to be a small anomaly in the current set of patches for i386.
> >
> > For example if one compiles the kernel with CONFIG_RELOCATABLE=y
> > and CONFIG_PHYSICAL_START=0x400000 (4MB) and he uses grub to load
> > the kernel then kernel would run from 1MB location. I think user would
> > expect it to run from 4MB location.
>
> Agreed. That is a non-intuitive, and should probably be fixed.

I also agree that it is non-intitive. But I wonder if a cleaner
fix would be to remove CONFIG_PHYSICAL_START all together. Isn't
it just a work around for the kernel not being relocatable, or
are there uses for it that relocation can't replace?

--
Horms
H: http://www.vergenet.net/~horms/
W: http://www.valinux.co.jp/en/

2006-08-08 04:32:52

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC] ELF Relocatable x86 and x86_64 bzImages

Horms wrote:
>
> I also agree that it is non-intitive. But I wonder if a cleaner
> fix would be to remove CONFIG_PHYSICAL_START all together. Isn't
> it just a work around for the kernel not being relocatable, or
> are there uses for it that relocation can't replace?
>

Yes, booting with the 2^n existing bootloaders.

Relocation, as far as I've understood this patch, refers to loaded
address, not runtime address.

-hpa

2006-08-08 04:57:12

by Magnus Damm

[permalink] [raw]
Subject: Re: [RFC] ELF Relocatable x86 and x86_64 bzImages

On 8/8/06, H. Peter Anvin <[email protected]> wrote:
> Horms wrote:
> >
> > I also agree that it is non-intitive. But I wonder if a cleaner
> > fix would be to remove CONFIG_PHYSICAL_START all together. Isn't
> > it just a work around for the kernel not being relocatable, or
> > are there uses for it that relocation can't replace?
> >
>
> Yes, booting with the 2^n existing bootloaders.
>
> Relocation, as far as I've understood this patch, refers to loaded
> address, not runtime address.

I believe Erics patch implements the following (correct me if I'm wrong):

vmlinux:
vmlinux is extended to contain relocation information. Absolute
symbols are used for non-relocatable symbols, and section-relative
symbols are used for relocatable symbols.

bzImage loader:
The bzImage loader code is no longer required to be loaded at a fixed
address. The bzImage file contains vmlinux relocation information and
the bzImage loader adjusts the relocations in vmlinux before executing
it.

So I would say that the runtime address of symbols in vmlinux are
changed by the bzImage loader. Or maybe I'm misunderstanding?

/ magnus

2006-08-08 05:06:33

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] ELF Relocatable x86 and x86_64 bzImages

"H. Peter Anvin" <[email protected]> writes:

> Horms wrote:
>> I also agree that it is non-intitive. But I wonder if a cleaner
>> fix would be to remove CONFIG_PHYSICAL_START all together. Isn't
>> it just a work around for the kernel not being relocatable, or
>> are there uses for it that relocation can't replace?
>>
>
> Yes, booting with the 2^n existing bootloaders.

Which is something we certainly have to be careful with.

> Relocation, as far as I've understood this patch, refers to loaded address, not
> runtime address.

Actually it refers to both.

Basically we detect if we were loaded with a clueless bootloader,
aka at 1MB. If so we just load to whatever address the code was built
to run at. Otherwise the code stays put.

To be loaded and run at a different address is what is needed for the
crash dump scenario. Where we have to run at some reserved area
of physical memory that the original kernel did not run from.

Eric

2006-08-08 06:20:30

by Simon Horman

[permalink] [raw]
Subject: Re: [RFC] ELF Relocatable x86 and x86_64 bzImages

On Mon, Aug 07, 2006 at 09:32:23PM -0700, H. Peter Anvin wrote:
> Horms wrote:
> >
> >I also agree that it is non-intitive. But I wonder if a cleaner
> >fix would be to remove CONFIG_PHYSICAL_START all together. Isn't
> >it just a work around for the kernel not being relocatable, or
> >are there uses for it that relocation can't replace?
> >
>
> Yes, booting with the 2^n existing bootloaders.

Ok, I must be confused then. I though CONFIG_PHYSICAL_START was
introduced in order to allow an alternative address to be provided for
kdump, and that previously it was hard-coded to some
architecture-specific value.

What I was really getting as is if it needs to be configurable at
compile time or not. Obviously there needs to be some sane default
regardless.

--
Horms
H: http://www.vergenet.net/~horms/
W: http://www.valinux.co.jp/en/

2006-08-08 07:24:03

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] ELF Relocatable x86 and x86_64 bzImages

Horms <[email protected]> writes:

> On Mon, Aug 07, 2006 at 09:32:23PM -0700, H. Peter Anvin wrote:
>> Horms wrote:
>> >
>> >I also agree that it is non-intitive. But I wonder if a cleaner
>> >fix would be to remove CONFIG_PHYSICAL_START all together. Isn't
>> >it just a work around for the kernel not being relocatable, or
>> >are there uses for it that relocation can't replace?
>> >
>>
>> Yes, booting with the 2^n existing bootloaders.
>
> Ok, I must be confused then. I though CONFIG_PHYSICAL_START was
> introduced in order to allow an alternative address to be provided for
> kdump, and that previously it was hard-coded to some
> architecture-specific value.
>
> What I was really getting as is if it needs to be configurable at
> compile time or not. Obviously there needs to be some sane default
> regardless.

CONFIG_PHYSICAL_START has had 2 uses.
1) To allow a kernel to run a completely different address for use
with kexec on panic.
2) To allow the kernel to be better aligned for better performance.

For maintenance reasons I propose we introduce CONFIG_PHYSICAL_ALIGN.
Which will round our load address up to the nearest aligned address
and run the kernel there. That is roughly what I am doing on x86_64
at this point.

s/CONFIG_PHYSICAL_START/CONFIG_PHYSICAL_ALIGN/ gives me well defined
behavior and allows the alignment optimization without getting into
weird semantics.

Before CONFIG_PHYSICAL_START we _always_ ran the arch/i386 kernel
where it was loaded and I assumed we always would. Since people have
realized better aligned kernels can run better this assumption became
false. Going to CONFIG_PHYSICAL_ALIGN allows us to return to the
simple assumption of always running the kernel where it is loaded
modulo a little extra alignment.

Eric

2006-08-08 07:59:09

by Simon Horman

[permalink] [raw]
Subject: Re: [RFC] ELF Relocatable x86 and x86_64 bzImages

On Tue, Aug 08, 2006 at 01:23:15AM -0600, Eric W. Biederman wrote:
> Horms <[email protected]> writes:
>
> > On Mon, Aug 07, 2006 at 09:32:23PM -0700, H. Peter Anvin wrote:
> >> Horms wrote:
> >> >
> >> >I also agree that it is non-intitive. But I wonder if a cleaner
> >> >fix would be to remove CONFIG_PHYSICAL_START all together. Isn't
> >> >it just a work around for the kernel not being relocatable, or
> >> >are there uses for it that relocation can't replace?
> >> >
> >>
> >> Yes, booting with the 2^n existing bootloaders.
> >
> > Ok, I must be confused then. I though CONFIG_PHYSICAL_START was
> > introduced in order to allow an alternative address to be provided for
> > kdump, and that previously it was hard-coded to some
> > architecture-specific value.
> >
> > What I was really getting as is if it needs to be configurable at
> > compile time or not. Obviously there needs to be some sane default
> > regardless.
>
> CONFIG_PHYSICAL_START has had 2 uses.
> 1) To allow a kernel to run a completely different address for use
> with kexec on panic.
> 2) To allow the kernel to be better aligned for better performance.

Thanks for making that clear

> For maintenance reasons I propose we introduce CONFIG_PHYSICAL_ALIGN.
> Which will round our load address up to the nearest aligned address
> and run the kernel there. That is roughly what I am doing on x86_64
> at this point.
>
> s/CONFIG_PHYSICAL_START/CONFIG_PHYSICAL_ALIGN/ gives me well defined
> behavior and allows the alignment optimization without getting into
> weird semantics.
>
> Before CONFIG_PHYSICAL_START we _always_ ran the arch/i386 kernel
> where it was loaded and I assumed we always would. Since people have
> realized better aligned kernels can run better this assumption became
> false. Going to CONFIG_PHYSICAL_ALIGN allows us to return to the
> simple assumption of always running the kernel where it is loaded
> modulo a little extra alignment.

That sounds reasonable to me. Though it is a little less flexible,
do you think that could be a problem? Perhaps we could have both,
though that would probably be quite confusing.

--
Horms
H: http://www.vergenet.net/~horms/
W: http://www.valinux.co.jp/en/

2006-08-09 14:56:43

by Daniel Hazelton

[permalink] [raw]
Subject: Re: [RFC] ELF Relocatable x86 and x86_64 bzImages

On Tuesday 08 August 2006 03:58, Horms wrote:
> On Tue, Aug 08, 2006 at 01:23:15AM -0600, Eric W. Biederman wrote:
> > Horms <[email protected]> writes:
<snip>
> > For maintenance reasons I propose we introduce CONFIG_PHYSICAL_ALIGN.
> > Which will round our load address up to the nearest aligned address
> > and run the kernel there. That is roughly what I am doing on x86_64
> > at this point.
> >
> > s/CONFIG_PHYSICAL_START/CONFIG_PHYSICAL_ALIGN/ gives me well defined
> > behavior and allows the alignment optimization without getting into
> > weird semantics.
> >
> > Before CONFIG_PHYSICAL_START we _always_ ran the arch/i386 kernel
> > where it was loaded and I assumed we always would. Since people have
> > realized better aligned kernels can run better this assumption became
> > false. Going to CONFIG_PHYSICAL_ALIGN allows us to return to the
> > simple assumption of always running the kernel where it is loaded
> > modulo a little extra alignment.
>
> That sounds reasonable to me. Though it is a little less flexible,
> do you think that could be a problem? Perhaps we could have both,
> though that would probably be quite confusing.

More than reasonable. By changing this it seems that the kernel would still
work with older bootloaders, function properly under kexec and might actually
lead to a way to potentially recover a system from a crash without a reboot
by allowing the kexec'd kernel to reset the system.

Of course the last is only a wish... Never happen because of the complexity
involved. It would require a lot of work (I'd do this, but I'm currently
arguing with the kernel over my attempts at building a functional DRM backed
console) in having to switch back to real-mode to make the proper BIOS calls
to reset the busses et. al. before switching *back* to kernel mode to run the
standard startup.

Still, by letting a kernel run where it's loaded plus some modulo to get it
properly aligned in memory solves several problems. It removes the need for
CONFIG_PHYSICAL_START and the code involved in handling that. The kexec code
that reserves memory for the new kernel image can be tweaked to always
allocate the memory *aligned*...

Anyway, I'd better get back to the DRMCon code...

DRH

2006-08-11 13:09:44

by Rachita Kothiyal

[permalink] [raw]
Subject: Re: [Fastboot] [RFC] ELF Relocatable x86 and x86_64 bzImages

On Fri, Aug 04, 2006 at 06:56:11PM -0400, Vivek Goyal wrote:
>
> Signed-off-by: Vivek Goyal <[email protected]>
> ---
>
> arch/i386/boot/compressed/head.S | 32 ++++++++++++++++++++++++++++++--
> 1 file changed, 30 insertions(+), 2 deletions(-)
>
> diff -puN arch/i386/boot/compressed/head.S~debug1-patch arch/i386/boot/compressed/head.S
> --- linux-2.6.18-rc3-1M/arch/i386/boot/compressed/head.S~debug1-patch 2006-08-04 18:03:02.000000000 -0400
> +++ linux-2.6.18-rc3-1M-root/arch/i386/boot/compressed/head.S 2006-08-04 18:18:26.000000000 -0400
> @@ -60,13 +60,32 @@ startup_32:
> * a relocatable kernel this is the delta to our load address otherwise
> * this is the delta to CONFIG_PHYSICAL start.
> */
> +
> #ifdef CONFIG_RELOCTABLE
^^^^^^^^^
Vivek, did you mean CONFIG_RELOCATABLE ?


Thanks
Rachita

> + /* If loaded by non kexec boot loader, then we will be loaded
> + * at 1MB fixed address. But probably the intention is to run
> + * from a address for which kernel has been compiled which can
> + * be non 1MB.
> + */
> + xorl %eax, %eax
> + movb 0x210(%esi), %al
> +
> + / * check boot loader type. Kexec bootloader id 9, version any */
> + shrl $4, %eax
> + subl $0x9, %eax
> + jnz 1f

2006-08-11 13:37:28

by Vivek Goyal

[permalink] [raw]
Subject: Re: [Fastboot] [RFC] ELF Relocatable x86 and x86_64 bzImages

On Fri, Aug 11, 2006 at 06:41:28PM +0530, Rachita Kothiyal wrote:
> On Fri, Aug 04, 2006 at 06:56:11PM -0400, Vivek Goyal wrote:
> >
> > Signed-off-by: Vivek Goyal <[email protected]>
> > ---
> >
> > arch/i386/boot/compressed/head.S | 32 ++++++++++++++++++++++++++++++--
> > 1 file changed, 30 insertions(+), 2 deletions(-)
> >
> > diff -puN arch/i386/boot/compressed/head.S~debug1-patch arch/i386/boot/compressed/head.S
> > --- linux-2.6.18-rc3-1M/arch/i386/boot/compressed/head.S~debug1-patch 2006-08-04 18:03:02.000000000 -0400
> > +++ linux-2.6.18-rc3-1M-root/arch/i386/boot/compressed/head.S 2006-08-04 18:18:26.000000000 -0400
> > @@ -60,13 +60,32 @@ startup_32:
> > * a relocatable kernel this is the delta to our load address otherwise
> > * this is the delta to CONFIG_PHYSICAL start.
> > */
> > +
> > #ifdef CONFIG_RELOCTABLE
> ^^^^^^^^^
> Vivek, did you mean CONFIG_RELOCATABLE ?
>
Hi Rachita,

Thanks for catching this. Yes I meant CONFIG_RELOCATABLE.

Thanks
Vivek

2006-08-17 18:44:37

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC] ELF Relocatable x86 and x86_64 bzImages

On Tue, Aug 08, 2006 at 01:23:15AM -0600, Eric W. Biederman wrote:
> Horms <[email protected]> writes:
>
> > On Mon, Aug 07, 2006 at 09:32:23PM -0700, H. Peter Anvin wrote:
> >> Horms wrote:
> >> >
> >> >I also agree that it is non-intitive. But I wonder if a cleaner
> >> >fix would be to remove CONFIG_PHYSICAL_START all together. Isn't
> >> >it just a work around for the kernel not being relocatable, or
> >> >are there uses for it that relocation can't replace?
> >> >
> >>
> >> Yes, booting with the 2^n existing bootloaders.
> >
> > Ok, I must be confused then. I though CONFIG_PHYSICAL_START was
> > introduced in order to allow an alternative address to be provided for
> > kdump, and that previously it was hard-coded to some
> > architecture-specific value.
> >
> > What I was really getting as is if it needs to be configurable at
> > compile time or not. Obviously there needs to be some sane default
> > regardless.
>
> CONFIG_PHYSICAL_START has had 2 uses.
> 1) To allow a kernel to run a completely different address for use
> with kexec on panic.
> 2) To allow the kernel to be better aligned for better performance.
>
> For maintenance reasons I propose we introduce CONFIG_PHYSICAL_ALIGN.
> Which will round our load address up to the nearest aligned address
> and run the kernel there. That is roughly what I am doing on x86_64
> at this point.
>
> s/CONFIG_PHYSICAL_START/CONFIG_PHYSICAL_ALIGN/ gives me well defined
> behavior and allows the alignment optimization without getting into
> weird semantics.
>
> Before CONFIG_PHYSICAL_START we _always_ ran the arch/i386 kernel
> where it was loaded and I assumed we always would. Since people have
> realized better aligned kernels can run better this assumption became
> false. Going to CONFIG_PHYSICAL_ALIGN allows us to return to the
> simple assumption of always running the kernel where it is loaded
> modulo a little extra alignment.
>

Hi Eric,

I have tried implementing your idea of replacing CONFIG_PHYSICAL_START
with CONFIG_PHYSICAL_ALIGN. Please find attached the patch.

It applies on top of your relocatable kernel patch series.

I guess there should be a way for running kernel to tell user space
that what's the offset of the addr at which kernel is running to
the address for which kernel has been compiled. This info can be useful
for debuggers in case they happen to debug the core of a kernel which
was not running at compiled addr.

Thanks
Vivek




o Get rid of CONFIG_PHYSICAL_START and implement CONFIG_PHYSICAL_ALIGN

Signed-off-by: Vivek Goyal <[email protected]>
---

arch/i386/Kconfig | 34 ++++++++++++++++++----------------
arch/i386/boot/bootsect.S | 8 ++++----
arch/i386/boot/compressed/head.S | 28 ++++++++++++++++------------
arch/i386/boot/compressed/misc.c | 7 ++++---
arch/i386/boot/compressed/vmlinux.lds | 3 +++
arch/i386/kernel/vmlinux.lds.S | 5 +++--
include/asm-i386/boot.h | 6 +++++-
7 files changed, 53 insertions(+), 38 deletions(-)

diff -puN arch/i386/Kconfig~i386-implement-config-physical-align-option arch/i386/Kconfig
--- linux-2.6.18-rc3-1M/arch/i386/Kconfig~i386-implement-config-physical-align-option 2006-08-17 10:56:46.000000000 -0400
+++ linux-2.6.18-rc3-1M-root/arch/i386/Kconfig 2006-08-17 11:28:40.000000000 -0400
@@ -773,24 +773,26 @@ config RELOCATABLE
must live at a different physical address than the primary
kernel.

-config PHYSICAL_START
- hex "Physical address where the kernel is loaded" if (EMBEDDED || CRASH_DUMP)
-
- default "0x1000000" if CRASH_DUMP
+config PHYSICAL_ALIGN
+ hex "Alignment value to which kernel should be aligned" if (EMBEDDED)
default "0x100000"
- range 0x100000 0x37c00000
+ range 0x2000 0x400000
help
- This gives the physical address where the kernel is loaded. Normally
- for regular kernels this value is 0x100000 (1MB). But in the case
- of kexec on panic the fail safe kernel needs to run at a different
- address than the panic-ed kernel. This option is used to set the load
- address for kernels used to capture crash dump on being kexec'ed
- after panic. The default value for crash dump kernels is
- 0x1000000 (16MB). This can also be set based on the "X" value as
- specified in the "crashkernel=YM@XM" command line boot parameter
- passed to the panic-ed kernel. Typically this parameter is set as
- crashkernel=64M@16M. Please take a look at
- Documentation/kdump/kdump.txt for more details about crash dumps.
+ This value puts the alignment restrictions on physical address
+ where kernel is loaded and run from. Kernel is compiled for an
+ address which meets above alignment restriction.
+
+ If bootloader loads the kernel at a non-aligned address and
+ CONFIG_RELOCATABLE is set, kernel will move itself to nearest
+ address aligned to above value and run from there.
+
+ If bootloader loads the kernel at a non-aligned address and
+ CONFIG_RELOCATABLE is not set, kernel will ignore the run time
+ load address and decompress itself to the address it has been
+ compiled for and run from there. The address for which kernel is
+ compiled already meets above alignment restrictions. Hence the
+ end result is that kernel runs from a physical address meeting above
+ alignment restrictions.

Don't change this unless you know what you are doing.

diff -puN include/asm-i386/boot.h~i386-implement-config-physical-align-option include/asm-i386/boot.h
--- linux-2.6.18-rc3-1M/include/asm-i386/boot.h~i386-implement-config-physical-align-option 2006-08-17 10:56:46.000000000 -0400
+++ linux-2.6.18-rc3-1M-root/include/asm-i386/boot.h 2006-08-17 10:56:46.000000000 -0400
@@ -12,4 +12,8 @@
#define EXTENDED_VGA 0xfffe /* 80x50 mode */
#define ASK_VGA 0xfffd /* ask for it at bootup */

-#endif
+/* Physical address where kenrel should be loaded. */
+#define LOAD_PHYSICAL_ADDR ((0x100000 + CONFIG_PHYSICAL_ALIGN - 1) \
+ & ~(CONFIG_PHYSICAL_ALIGN - 1))
+
+#endif /* _LINUX_BOOT_H */
diff -puN arch/i386/kernel/vmlinux.lds.S~i386-implement-config-physical-align-option arch/i386/kernel/vmlinux.lds.S
--- linux-2.6.18-rc3-1M/arch/i386/kernel/vmlinux.lds.S~i386-implement-config-physical-align-option 2006-08-17 10:56:46.000000000 -0400
+++ linux-2.6.18-rc3-1M-root/arch/i386/kernel/vmlinux.lds.S 2006-08-17 10:56:46.000000000 -0400
@@ -2,13 +2,14 @@
* Written by Martin Mares <[email protected]>;
*/

-#define LOAD_OFFSET __PAGE_OFFSET
+#define LOAD_OFFSET __PAGE_OFFSET

#include <linux/config.h>
#include <asm-generic/vmlinux.lds.h>
#include <asm/thread_info.h>
#include <asm/page.h>
#include <asm/cache.h>
+#include <asm/boot.h>

OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")
OUTPUT_ARCH(i386)
@@ -16,7 +17,7 @@ ENTRY(phys_startup_32)
jiffies = jiffies_64;
SECTIONS
{
- . = LOAD_OFFSET + CONFIG_PHYSICAL_START;
+ . = LOAD_OFFSET + LOAD_PHYSICAL_ADDR;
phys_startup_32 = startup_32 - LOAD_OFFSET;
/* read-only */
.text : AT(ADDR(.text) - LOAD_OFFSET) {
diff -puN arch/i386/boot/bootsect.S~i386-implement-config-physical-align-option arch/i386/boot/bootsect.S
--- linux-2.6.18-rc3-1M/arch/i386/boot/bootsect.S~i386-implement-config-physical-align-option 2006-08-17 10:56:46.000000000 -0400
+++ linux-2.6.18-rc3-1M-root/arch/i386/boot/bootsect.S 2006-08-17 12:37:19.000000000 -0400
@@ -69,7 +69,7 @@ ehdr:
#endif
.word EM_386 # e_machine
.int 1 # e_version
- .int CONFIG_PHYSICAL_START # e_entry
+ .int LOAD_PHYSICAL_ADDR # e_entry
.int phdr - _start # e_phoff
.int 0 # e_shoff
.int 0 # e_flags
@@ -90,12 +90,12 @@ normalize:
phdr:
.int PT_LOAD # p_type
.int (SETUPSECTS+1)*512 # p_offset
- .int __PAGE_OFFSET + CONFIG_PHYSICAL_START # p_vaddr
- .int CONFIG_PHYSICAL_START # p_paddr
+ .int LOAD_PHYSICAL_ADDR + __PAGE_OFFSET # p_vaddr
+ .int LOAD_PHYSICAL_ADDR # p_paddr
.int SYSSIZE*16 # p_filesz
.int 0 # p_memsz
.int PF_R | PF_W | PF_X # p_flags
- .int 4*1024*1024 # p_align
+ .int CONFIG_PHYSICAL_ALIGN # p_align
e_phdr1:

.int PT_NOTE # p_type
diff -puN arch/i386/boot/compressed/vmlinux.lds~i386-implement-config-physical-align-option arch/i386/boot/compressed/vmlinux.lds
--- linux-2.6.18-rc3-1M/arch/i386/boot/compressed/vmlinux.lds~i386-implement-config-physical-align-option 2006-08-17 10:56:46.000000000 -0400
+++ linux-2.6.18-rc3-1M-root/arch/i386/boot/compressed/vmlinux.lds 2006-08-17 10:56:46.000000000 -0400
@@ -3,6 +3,9 @@ OUTPUT_ARCH(i386)
ENTRY(startup_32)
SECTIONS
{
+ /* Be careful parts of head.S assume startup_32 is at
+ * address 0.
+ */
. = 0 ;
.text.head : {
_head = . ;
diff -puN arch/i386/boot/compressed/head.S~i386-implement-config-physical-align-option arch/i386/boot/compressed/head.S
--- linux-2.6.18-rc3-1M/arch/i386/boot/compressed/head.S~i386-implement-config-physical-align-option 2006-08-17 10:56:46.000000000 -0400
+++ linux-2.6.18-rc3-1M-root/arch/i386/boot/compressed/head.S 2006-08-17 11:12:51.000000000 -0400
@@ -27,6 +27,7 @@
#include <linux/linkage.h>
#include <asm/segment.h>
#include <asm/page.h>
+#include <asm/boot.h>

.section ".text.head"
.globl startup_32
@@ -53,17 +54,19 @@ startup_32:
1: popl %ebp
subl $1b, %ebp

-/* Compute the delta between where we were compiled to run at
- * and where the code will actually run at.
+
+/* %ebp contains the address we are loaded at by the boot loader and %ebx
+ * contains the address where we should move the kernel image temporarily
+ * for safe in-place decompression.
*/
- /* Start with the delta to where the kernel will run at. If we are
- * a relocatable kernel this is the delta to our load address otherwise
- * this is the delta to CONFIG_PHYSICAL start.
- */
+
#ifdef CONFIG_RELOCTABLE
- movl %ebp, %ebx
+ movl %ebp, %ebx
+ addl $(CONFIG_PHYSICAL_ALIGN - 1), %ebx
+ andl $(~(CONFIG_PHYSICAL_ALIGN - 1)), %ebx
+
#else
- movl $(CONFIG_PHYSICAL_START - startup_32), %ebx
+ movl $LOAD_PHYSICAL_ADDR, %ebx
#endif

/* Replace the compressed data size with the uncompressed size */
@@ -95,9 +98,10 @@ startup_32:
/* Compute the kernel start address.
*/
#ifdef CONFIG_RELOCATABLE
- leal startup_32(%ebp), %ebp
+ addl $(CONFIG_PHYSICAL_ALIGN - 1), %ebp
+ andl $(~(CONFIG_PHYSICAL_ALIGN - 1)), %ebp
#else
- movl $CONFIG_PHYSICAL_START, %ebp
+ movl $LOAD_PHYSICAL_ADDR, %ebp
#endif

/*
@@ -151,8 +155,8 @@ relocated:
* and where it was actually loaded.
*/
movl %ebp, %ebx
- subl $CONFIG_PHYSICAL_START, %ebx
-
+ subl $LOAD_PHYSICAL_ADDR, %ebx
+ jz 2f /* Nothing to be done if loaded at compiled addr. */
/*
* Process relocations.
*/
diff -puN arch/i386/boot/compressed/misc.c~i386-implement-config-physical-align-option arch/i386/boot/compressed/misc.c
--- linux-2.6.18-rc3-1M/arch/i386/boot/compressed/misc.c~i386-implement-config-physical-align-option 2006-08-17 10:56:46.000000000 -0400
+++ linux-2.6.18-rc3-1M-root/arch/i386/boot/compressed/misc.c 2006-08-17 11:19:05.000000000 -0400
@@ -18,6 +18,7 @@
#include <asm/io.h>
#include <asm/setup.h>
#include <asm/page.h>
+#include <asm/boot.h>

/* WARNING!!
* This code is compiled with -fPIC and it is relocated dynamically
@@ -585,12 +586,12 @@ asmlinkage void decompress_kernel(void *
insize = input_len;
inptr = 0;

- if (((u32)output - CONFIG_PHYSICAL_START) & 0x3fffff)
- error("Destination address not 4M aligned");
+ if ((u32)output & (CONFIG_PHYSICAL_ALIGN -1))
+ error("Destination address not CONFIG_PHYSICAL_ALIGN aligned");
if (end > ((-__PAGE_OFFSET-(512 <<20)-1) & 0x7fffffff))
error("Destination address too large");
#ifndef CONFIG_RELOCATABLE
- if ((u32)output != CONFIG_PHYSICAL_START)
+ if ((u32)output != LOAD_PHYSICAL_ADDR)
error("Wrong destination address");
#endif

_

2006-11-05 06:02:17

by Lu, Yinghai

[permalink] [raw]
Subject: Re: [PATCH 32/33] x86_64: Relocatable kernel support

On 8/1/06, Eric W. Biederman <[email protected]> wrote:
> I guess I could take this in some slightly smaller steps.
> But this does wind up with decompressor being 64bit code.

Sorry to bring out the old mail.

except reusing the uncompressor in 32bit, is there any reason that you
removed startup_32 for vmlinux but keep startup_32 for bzImage?

that will make vmlinux use 64bit boot loader only.

YH

2006-11-05 06:52:46

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 32/33] x86_64: Relocatable kernel support

"Yinghai Lu" <[email protected]> writes:

> On 8/1/06, Eric W. Biederman <[email protected]> wrote:
>> I guess I could take this in some slightly smaller steps.
>> But this does wind up with decompressor being 64bit code.
>
> Sorry to bring out the old mail.
>
> except reusing the uncompressor in 32bit, is there any reason that you
> removed startup_32 for vmlinux but keep startup_32 for bzImage?
>
> that will make vmlinux use 64bit boot loader only.

If you are booting a vmlinux you read the ELF header. The ELF header
only describes the native mode. Therefore no 32bit entry makes much sense.

bzImage is something else entirely.

Eric

2006-11-05 07:16:00

by Lu, Yinghai

[permalink] [raw]
Subject: Re: [PATCH 32/33] x86_64: Relocatable kernel support

On 11/4/06, Eric W. Biederman <[email protected]> wrote:
> If you are booting a vmlinux you read the ELF header. The ELF header
> only describes the native mode. Therefore no 32bit entry makes much sense.
>
Yes, but if you keep the startup_32 and it will be at 0x200000, and
startup_64 will be 0x200100. and entry point in ELF header is
0x200100.
by removing startup_32, startup_64 will be 0x200000. and entry point
in ehdr is 0x200000.
So I assume entry_point in elf_header could be used by 64bit
bootloader and phdr[1].p_addr could be used by 32bit boot loader.

YH


ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: Advanced Micro Devices X86-64
Version: 0x1
Entry point address: 0x200100
Start of program headers: 64 (bytes into file)
Start of section headers: 7192496 (bytes into file)
Flags: 0x0
Size of this header: 64 (bytes)
Size of program headers: 56 (bytes)
Number of program headers: 5
Size of section headers: 64 (bytes)
Number of section headers: 42
Section header string table index: 39

Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000100000 0xffffffff80200000 0x0000000000200000
0x000000000032f508 0x000000000032f508 R E 100000
LOAD 0x0000000000430000 0xffffffff80530000 0x0000000000530000
0x0000000000148ec8 0x0000000000148ec8 RWE 100000
LOAD 0x0000000000600000 0xffffffffff600000 0x0000000000679000
0x0000000000000c08 0x0000000000000c08 RWE 100000
LOAD 0x000000000067a000 0xffffffff8067a000 0x000000000067a000
0x000000000005dd68 0x00000000000e91c8 RWE 100000
NOTE 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000 R 8