2006-03-13 18:05:00

by Zachary Amsden

[permalink] [raw]
Subject: [RFC, PATCH 7/24] i386 Vmi memory hole

Create a configurable hole in the linear address space at the top
of memory. A more advanced interface is needed to negotiate how
much space the hypervisor is allowed to steal, but in the end, it
seems most likely that a fixed constant size will be chosen for
the compiled kernel, potentially propagated to an information
page used by paravirtual initialization to determine interface
compatibility.

Signed-off-by: Zachary Amsden <[email protected]>

Index: linux-2.6.16-rc3/arch/i386/Kconfig
===================================================================
--- linux-2.6.16-rc3.orig/arch/i386/Kconfig 2006-02-22 16:09:04.000000000 -0800
+++ linux-2.6.16-rc3/arch/i386/Kconfig 2006-02-22 16:33:27.000000000 -0800
@@ -201,6 +201,15 @@ config VMI_DEBUG

endmenu

+config MEMORY_HOLE
+ int "Create hole at top of memory (0-256 MB)"
+ range 0 256
+ default "64" if X86_VMI
+ default "0" if !X86_VMI
+ help
+ Useful for creating a hole in the top of memory when running
+ inside of a virtual machine monitor.
+
config ACPI_SRAT
bool
default y
Index: linux-2.6.16-rc3/include/asm-i386/fixmap.h
===================================================================
--- linux-2.6.16-rc3.orig/include/asm-i386/fixmap.h 2006-02-22 15:48:23.000000000 -0800
+++ linux-2.6.16-rc3/include/asm-i386/fixmap.h 2006-02-22 16:33:27.000000000 -0800
@@ -20,7 +20,7 @@
* Leave one empty page between vmalloc'ed areas and
* the start of the fixmap.
*/
-#define __FIXADDR_TOP 0xfffff000
+#define __FIXADDR_TOP 0xfffff000-(CONFIG_MEMORY_HOLE << 20)

#ifndef __ASSEMBLY__
#include <linux/kernel.h>


2006-03-14 06:36:36

by Chris Wright

[permalink] [raw]
Subject: Re: [RFC, PATCH 7/24] i386 Vmi memory hole

* Zachary Amsden ([email protected]) wrote:
> Create a configurable hole in the linear address space at the top
> of memory. A more advanced interface is needed to negotiate how
> much space the hypervisor is allowed to steal, but in the end, it
> seems most likely that a fixed constant size will be chosen for
> the compiled kernel, potentially propagated to an information
> page used by paravirtual initialization to determine interface
> compatibility.
>
> Signed-off-by: Zachary Amsden <[email protected]>
>
> Index: linux-2.6.16-rc3/arch/i386/Kconfig
> ===================================================================
> --- linux-2.6.16-rc3.orig/arch/i386/Kconfig 2006-02-22 16:09:04.000000000 -0800
> +++ linux-2.6.16-rc3/arch/i386/Kconfig 2006-02-22 16:33:27.000000000 -0800
> @@ -201,6 +201,15 @@ config VMI_DEBUG
>
> endmenu
>
> +config MEMORY_HOLE
> + int "Create hole at top of memory (0-256 MB)"
> + range 0 256
> + default "64" if X86_VMI
> + default "0" if !X86_VMI

Deja-vu ;-) And still works in context of Xen, but we've just let the
subarch define the __FIXADDR_TOP. Having it be dynamic could be
interesting.

2006-03-14 07:16:53

by Zachary Amsden

[permalink] [raw]
Subject: Re: [RFC, PATCH 7/24] i386 Vmi memory hole

Allow creation of an compile time hole at the top of linear address space.

Extended to allow a dynamic hole in linear address space, 7/2005. This
required some serious hacking to get everything perfect, but the end result
appears to function quite nicely. Everyone can now share the appreciation
of pseudo-undocumented ELF OS fields, which means core dumps, debuggers
and even broken or obsolete linkers may continue to work.

Signed-off-by: Zachary Amsden <[email protected]>
Index: linux-2.6.13/arch/i386/Kconfig
===================================================================
--- linux-2.6.13.orig/arch/i386/Kconfig 2005-08-04 14:14:24.000000000 -0700
+++ linux-2.6.13/arch/i386/Kconfig 2005-08-05 15:28:42.000000000 -0700
@@ -127,6 +127,20 @@

endchoice

+config RELOCATABLE_FIXMAP
+ bool "Allow the fixmap to be placed dynamically at runtime"
+ depends on EXPERIMENTAL
+ help
+ Crazy hackers only.
+
+config MEMORY_HOLE
+ int "Create hole at top of memory (0-512 MB)"
+ range 0 512
+ default "0"
+ help
+ Useful for creating a hole in the top of memory when running
+ inside of a virtual machine monitor.
+
config ACPI_SRAT
bool
default y
Index: linux-2.6.13/arch/i386/kernel/sysenter.c
===================================================================
--- linux-2.6.13.orig/arch/i386/kernel/sysenter.c 2005-08-02 17:04:12.000000000 -0700
+++ linux-2.6.13/arch/i386/kernel/sysenter.c 2005-08-05 15:47:53.000000000 -0700
@@ -46,22 +46,90 @@
extern const char vsyscall_int80_start, vsyscall_int80_end;
extern const char vsyscall_sysenter_start, vsyscall_sysenter_end;

+#ifdef CONFIG_RELOCATABLE_FIXMAP
+extern const char SYSENTER_RETURN;
+const char *SYSENTER_RETURN_ADDR;
+
+static void fixup_vsyscall_elf(char *page)
+{
+ Elf32_Ehdr *hdr;
+ Elf32_Shdr *sechdrs;
+ Elf32_Phdr *phdr;
+ char *secstrings;
+ int i, j, n;
+
+ hdr = (Elf32_Ehdr *)page;
+
+ /* Sanity checks against insmoding binaries or wrong arch,
+ weird elf version */
+ if (memcmp(hdr->e_ident, ELFMAG, 4) != 0 ||
+ !elf_check_arch(hdr) ||
+ hdr->e_type != ET_DYN)
+ panic("Bogus ELF in vsyscall DSO\n");
+
+ hdr->e_entry += VSYSCALL_RELOCATION;
+
+ sechdrs = (void *)hdr + hdr->e_shoff;
+ secstrings = (void *)hdr + sechdrs[hdr->e_shstrndx].sh_offset;
+
+ for (i = 1; i < hdr->e_shnum; i++) {
+ if (!(sechdrs[i].sh_flags & SHF_ALLOC))
+ continue;
+
+ sechdrs[i].sh_addr += VSYSCALL_RELOCATION;
+ if (strcmp(secstrings+sechdrs[i].sh_name, ".dynsym") == 0) {
+ Elf32_Sym *sym = (void *)hdr + sechdrs[i].sh_offset;
+ n = sechdrs[i].sh_size / sizeof(*sym);
+ for (j = 1; j < n; j++) {
+ int ndx = sym[j].st_shndx;
+ if (ndx == SHN_UNDEF || ndx == SHN_ABS)
+ continue;
+ sym[j].st_value += VSYSCALL_RELOCATION;
+ }
+ } else if (strcmp(secstrings+sechdrs[i].sh_name, ".dynamic") == 0) {
+ Elf32_Dyn *dyn = (void *)hdr + sechdrs[i].sh_offset;
+ int tag;
+ while ((tag = (++dyn)->d_tag) != DT_NULL) {
+ if (tag == DT_PLTGOT || tag == DT_HASH ||
+ tag == DT_STRTAB || tag == DT_SYMTAB ||
+ tag == DT_RELA || tag == DT_INIT ||
+ tag == DT_FINI || tag == DT_REL ||
+ tag == DT_JMPREL || tag == DT_VERSYM ||
+ tag == DT_VERDEF || tag == DT_VERNEED)
+ dyn->d_un.d_val += VSYSCALL_RELOCATION;
+ }
+ } else if (strcmp(secstrings+sechdrs[i].sh_name, ".useless") == 0) {
+ uint32_t *got = (void *)hdr + sechdrs[i].sh_offset;
+ *got += VSYSCALL_RELOCATION;
+ }
+ }
+ phdr = (void *)hdr + hdr->e_phoff;
+ for (i = 0; i < hdr->e_phnum; i++) {
+ phdr[i].p_vaddr += VSYSCALL_RELOCATION;
+ phdr[i].p_paddr += VSYSCALL_RELOCATION;
+ }
+ SYSENTER_RETURN_ADDR = (char *)&SYSENTER_RETURN + VSYSCALL_RELOCATION;
+}
+#endif
+
int __init sysenter_setup(void)
{
void *page = (void *)get_zeroed_page(GFP_ATOMIC);

- __set_fixmap(FIX_VSYSCALL, __pa(page), PAGE_READONLY_EXEC);
-
- if (!boot_cpu_has(X86_FEATURE_SEP)) {
+ if (!boot_cpu_has(X86_FEATURE_SEP))
memcpy(page,
&vsyscall_int80_start,
&vsyscall_int80_end - &vsyscall_int80_start);
- return 0;
- }
+ else
+ memcpy(page,
+ &vsyscall_sysenter_start,
+ &vsyscall_sysenter_end - &vsyscall_sysenter_start);

- memcpy(page,
- &vsyscall_sysenter_start,
- &vsyscall_sysenter_end - &vsyscall_sysenter_start);
+#ifdef CONFIG_RELOCATABLE_FIXMAP
+ fixup_vsyscall_elf((char *)page);
+#endif
+
+ __set_fixmap(FIX_VSYSCALL, __pa(page), PAGE_READONLY_EXEC);

return 0;
}
Index: linux-2.6.13/arch/i386/kernel/asm-offsets.c
===================================================================
--- linux-2.6.13.orig/arch/i386/kernel/asm-offsets.c 2005-08-04 14:28:35.000000000 -0700
+++ linux-2.6.13/arch/i386/kernel/asm-offsets.c 2005-08-05 15:11:45.000000000 -0700
@@ -68,5 +68,9 @@
sizeof(struct tss_struct));

DEFINE(PAGE_SIZE_asm, PAGE_SIZE);
+#ifdef CONFIG_RELOCATABLE_FIXMAP
+ DEFINE(VSYSCALL_BASE, 0);
+#else
DEFINE(VSYSCALL_BASE, __fix_to_virt(FIX_VSYSCALL));
+#endif
}
Index: linux-2.6.13/arch/i386/kernel/signal.c
===================================================================
--- linux-2.6.13.orig/arch/i386/kernel/signal.c 2005-08-03 23:36:46.000000000 -0700
+++ linux-2.6.13/arch/i386/kernel/signal.c 2005-08-05 15:11:33.000000000 -0700
@@ -345,6 +345,8 @@
See vsyscall-sigreturn.S. */
extern void __user __kernel_sigreturn;
extern void __user __kernel_rt_sigreturn;
+#define kernel_sigreturn (VSYSCALL_RELOCATION + (void __user *)&__kernel_sigreturn)
+#define kernel_rt_sigreturn (VSYSCALL_RELOCATION + (void __user *)&__kernel_rt_sigreturn)

static int setup_frame(int sig, struct k_sigaction *ka,
sigset_t *set, struct pt_regs * regs)
@@ -380,7 +382,7 @@
goto give_sigsegv;
}

- restorer = &__kernel_sigreturn;
+ restorer = kernel_sigreturn;
if (ka->sa.sa_flags & SA_RESTORER)
restorer = ka->sa.sa_restorer;

@@ -476,7 +478,7 @@
goto give_sigsegv;

/* Set up to return from userspace. */
- restorer = &__kernel_rt_sigreturn;
+ restorer = kernel_rt_sigreturn;
if (ka->sa.sa_flags & SA_RESTORER)
restorer = ka->sa.sa_restorer;
err |= __put_user(restorer, &frame->pretcode);
Index: linux-2.6.13/arch/i386/kernel/entry.S
===================================================================
--- linux-2.6.13.orig/arch/i386/kernel/entry.S 2005-08-04 14:17:15.000000000 -0700
+++ linux-2.6.13/arch/i386/kernel/entry.S 2005-08-05 14:09:15.000000000 -0700
@@ -200,7 +200,11 @@
pushl %ebp
pushfl
pushl $(__USER_CS)
+#ifdef CONFIG_RELOCATABLE_FIXMAP
+ pushl %ss:SYSENTER_RETURN_ADDR
+#else
pushl $SYSENTER_RETURN
+#endif

/*
* Load the potential sixth argument from user stack.
Index: linux-2.6.13/arch/i386/mm/init.c
===================================================================
--- linux-2.6.13.orig/arch/i386/mm/init.c 2005-08-04 14:39:17.000000000 -0700
+++ linux-2.6.13/arch/i386/mm/init.c 2005-08-05 15:20:04.000000000 -0700
@@ -42,6 +42,10 @@

unsigned int __VMALLOC_RESERVE = 128 << 20;

+#ifdef CONFIG_RELOCATABLE_FIXMAP
+unsigned long __FIXADDR_TOP = 0;
+#endif
+
DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
unsigned long highstart_pfn, highend_pfn;

@@ -478,6 +482,12 @@
printk("NX (Execute Disable) protection: active\n");
#endif

+#ifdef CONFIG_RELOCATABLE_FIXMAP
+ if (!__FIXADDR_TOP)
+ __FIXADDR_TOP = 0xfffff000UL-(CONFIG_MEMORY_HOLE << 20);
+ printk(KERN_INFO "Fixmap top relocated to %lxh\n", __FIXADDR_TOP);
+#endif
+
pagetable_init();

load_cr3(swapper_pg_dir);
Index: linux-2.6.13/include/asm-i386/fixmap.h
===================================================================
--- linux-2.6.13.orig/include/asm-i386/fixmap.h 2005-08-04 14:14:24.000000000 -0700
+++ linux-2.6.13/include/asm-i386/fixmap.h 2005-08-05 15:36:13.000000000 -0700
@@ -20,7 +20,13 @@
* Leave one empty page between vmalloc'ed areas and
* the start of the fixmap.
*/
-#define __FIXADDR_TOP 0xfffff000
+#ifdef CONFIG_RELOCATABLE_FIXMAP
+extern unsigned long __FIXADDR_TOP;
+#define VSYSCALL_RELOCATION __fix_to_virt(FIX_VSYSCALL)
+#else
+#define __FIXADDR_TOP (0xfffff000-(CONFIG_MEMORY_HOLE << 20))
+#define VSYSCALL_RELOCATION 0
+#endif

#ifndef __ASSEMBLY__
#include <linux/kernel.h>
Index: linux-2.6.13/include/asm-i386/elf.h
===================================================================
--- linux-2.6.13.orig/include/asm-i386/elf.h 2005-08-02 17:06:23.000000000 -0700
+++ linux-2.6.13/include/asm-i386/elf.h 2005-08-05 15:31:32.000000000 -0700
@@ -129,7 +129,7 @@

#define VSYSCALL_BASE (__fix_to_virt(FIX_VSYSCALL))
#define VSYSCALL_EHDR ((const struct elfhdr *) VSYSCALL_BASE)
-#define VSYSCALL_ENTRY ((unsigned long) &__kernel_vsyscall)
+#define VSYSCALL_ENTRY ((unsigned long) (VSYSCALL_RELOCATION+&__kernel_vsyscall))
extern void __kernel_vsyscall;

#define ARCH_DLINFO \
Index: linux-2.6.13/include/linux/elf.h
===================================================================
--- linux-2.6.13.orig/include/linux/elf.h 2005-08-02 17:06:24.000000000 -0700
+++ linux-2.6.13/include/linux/elf.h 2005-08-05 12:06:17.000000000 -0700
@@ -138,6 +138,9 @@
#define DT_DEBUG 21
#define DT_TEXTREL 22
#define DT_JMPREL 23
+#define DT_VERSYM 0x6ffffff0
+#define DT_VERDEF 0x6ffffffc
+#define DT_VERNEED 0x6ffffffe
#define DT_LOPROC 0x70000000
#define DT_HIPROC 0x7fffffff



Attachments:
linear-hole (9.04 kB)

2006-03-14 21:51:44

by Chris Wright

[permalink] [raw]
Subject: Re: [RFC, PATCH 7/24] i386 Vmi memory hole

* Zachary Amsden ([email protected]) wrote:
> Allow creation of an compile time hole at the top of linear address space.
>
> Extended to allow a dynamic hole in linear address space, 7/2005. This
> required some serious hacking to get everything perfect, but the end result
> appears to function quite nicely. Everyone can now share the appreciation
> of pseudo-undocumented ELF OS fields, which means core dumps, debuggers
> and even broken or obsolete linkers may continue to work.

Thanks. Gerd did something similar (although I believe it's simpler,
don't recall the relocation magic) for Xen. Either way, it's useful
from Xen perspective.

thanks,
-chris

2006-03-14 22:36:04

by Zachary Amsden

[permalink] [raw]
Subject: Re: [RFC, PATCH 7/24] i386 Vmi memory hole

Chris Wright wrote:
> * Zachary Amsden ([email protected]) wrote:
>
>> Allow creation of an compile time hole at the top of linear address space.
>>
>> Extended to allow a dynamic hole in linear address space, 7/2005. This
>> required some serious hacking to get everything perfect, but the end result
>> appears to function quite nicely. Everyone can now share the appreciation
>> of pseudo-undocumented ELF OS fields, which means core dumps, debuggers
>> and even broken or obsolete linkers may continue to work.
>>
>
> Thanks. Gerd did something similar (although I believe it's simpler,
> don't recall the relocation magic) for Xen. Either way, it's useful
> from Xen perspective.
>

I believe Xen disables sysenter. The complications in my patch come
from the fact that the vsyscall page has to be relocated dynamically,
requiring, basically run time linking on the page and some tweaks to get
sysenter to work. If you don't use vsyscall (say, non-TLS glibc), then
you don't need that complexity. But I think it might be needed now,
even for Xen.

Zach

2006-03-15 04:26:38

by Chris Wright

[permalink] [raw]
Subject: Re: [RFC, PATCH 7/24] i386 Vmi memory hole

* Zachary Amsden ([email protected]) wrote:
> Chris Wright wrote:
> >* Zachary Amsden ([email protected]) wrote:
> >
> >>Allow creation of an compile time hole at the top of linear address space.
> >>
> >>Extended to allow a dynamic hole in linear address space, 7/2005. This
> >>required some serious hacking to get everything perfect, but the end
> >>result
> >>appears to function quite nicely. Everyone can now share the appreciation
> >>of pseudo-undocumented ELF OS fields, which means core dumps, debuggers
> >>and even broken or obsolete linkers may continue to work.
> >>
> >
> >Thanks. Gerd did something similar (although I believe it's simpler,
> >don't recall the relocation magic) for Xen. Either way, it's useful
> >from Xen perspective.
>
> I believe Xen disables sysenter.

Yes, so vsyscall page has int80 implementation.

> The complications in my patch come
> from the fact that the vsyscall page has to be relocated dynamically,
> requiring, basically run time linking on the page and some tweaks to get
> sysenter to work. If you don't use vsyscall (say, non-TLS glibc), then
> you don't need that complexity. But I think it might be needed now,
> even for Xen.

I believe both Xen and execshield move vsyscall out of fixmap, and then
map into userspace as normal vma.

thanks,
-chris

2006-03-15 08:25:45

by Gerd Hoffmann

[permalink] [raw]
Subject: Re: [RFC, PATCH 7/24] i386 Vmi memory hole

Index: vanilla-2.6.16-rc3/arch/i386/kernel/setup.c
===================================================================
--- vanilla-2.6.16-rc3.orig/arch/i386/kernel/setup.c 2006-02-13 09:39:33.000000000 +0100
+++ vanilla-2.6.16-rc3/arch/i386/kernel/setup.c 2006-02-13 09:57:36.000000000 +0100
@@ -922,6 +922,12 @@ static void __init parse_cmdline_early (
else if (!memcmp(from, "vmalloc=", 8))
__VMALLOC_RESERVE = memparse(from+8, &from);

+ /*
+ * fixmap=addr
+ */
+ else if (!memcmp(from, "fixmap=", 7))
+ set_fixaddr_top(simple_strtoul(from+7, NULL, 16));
+
next_char:
c = *(from++);
if (!c)
Index: vanilla-2.6.16-rc3/arch/i386/mm/init.c
===================================================================
--- vanilla-2.6.16-rc3.orig/arch/i386/mm/init.c 2006-02-13 09:39:33.000000000 +0100
+++ vanilla-2.6.16-rc3/arch/i386/mm/init.c 2006-02-13 14:33:40.000000000 +0100
@@ -628,6 +628,42 @@ void __init mem_init(void)
(unsigned long) (totalhigh_pages << (PAGE_SHIFT-10))
);

+#if 1 /* double-sanity-check paranoia */
+ printk("virtual kernel memory layout:\n"
+ " fixmap : 0x%08lx - 0x%08lx (%4ld kB)\n"
+ " pkmap : 0x%08lx - 0x%08lx (%4ld kB)\n"
+ " vmalloc : 0x%08lx - 0x%08lx (%4ld MB)\n"
+ " lowmem : 0x%08lx - 0x%08lx (%4ld MB)\n"
+ " .init : 0x%08lx - 0x%08lx (%4ld kB)\n"
+ " .data : 0x%08lx - 0x%08lx (%4ld kB)\n"
+ " .text : 0x%08lx - 0x%08lx (%4ld kB)\n",
+ FIXADDR_START, FIXADDR_TOP,
+ (FIXADDR_TOP - FIXADDR_START) >> 10,
+
+ PKMAP_BASE, PKMAP_BASE+LAST_PKMAP*PAGE_SIZE,
+ (LAST_PKMAP*PAGE_SIZE) >> 10,
+
+ VMALLOC_START, VMALLOC_END,
+ (VMALLOC_END - VMALLOC_START) >> 20,
+
+ (unsigned long)__va(0), (unsigned long)high_memory,
+ ((unsigned long)high_memory - (unsigned long)__va(0)) >> 20,
+
+ (unsigned long)&__init_begin, (unsigned long)&__init_end,
+ ((unsigned long)&__init_end - (unsigned long)&__init_begin) >> 10,
+
+ (unsigned long)&_etext, (unsigned long)&_edata,
+ ((unsigned long)&_edata - (unsigned long)&_etext) >> 10,
+
+ (unsigned long)&_text, (unsigned long)&_etext,
+ ((unsigned long)&_etext - (unsigned long)&_text) >> 10);
+
+ BUG_ON(PKMAP_BASE+LAST_PKMAP*PAGE_SIZE > FIXADDR_START);
+ BUG_ON(VMALLOC_END > PKMAP_BASE);
+ BUG_ON(VMALLOC_START > VMALLOC_END);
+ BUG_ON((unsigned long)high_memory > VMALLOC_START);
+#endif /* double-sanity-check paranoia */
+
#ifdef CONFIG_X86_PAE
if (!cpu_has_pae)
panic("cannot execute a PAE-enabled kernel on a PAE-less CPU!");
Index: vanilla-2.6.16-rc3/arch/i386/mm/pgtable.c
===================================================================
--- vanilla-2.6.16-rc3.orig/arch/i386/mm/pgtable.c 2006-01-03 04:21:10.000000000 +0100
+++ vanilla-2.6.16-rc3/arch/i386/mm/pgtable.c 2006-02-13 09:57:36.000000000 +0100
@@ -13,6 +13,7 @@
#include <linux/slab.h>
#include <linux/pagemap.h>
#include <linux/spinlock.h>
+#include <linux/module.h>

#include <asm/system.h>
#include <asm/pgtable.h>
@@ -138,6 +139,10 @@ void set_pmd_pfn(unsigned long vaddr, un
__flush_tlb_one(vaddr);
}

+static int fixmaps = 0;
+unsigned long __FIXADDR_TOP = 0xfffff000;
+EXPORT_SYMBOL(__FIXADDR_TOP);
+
void __set_fixmap (enum fixed_addresses idx, unsigned long phys, pgprot_t flags)
{
unsigned long address = __fix_to_virt(idx);
@@ -147,6 +152,14 @@ void __set_fixmap (enum fixed_addresses
return;
}
set_pte_pfn(address, phys >> PAGE_SHIFT, flags);
+ fixmaps++;
+}
+
+void set_fixaddr_top(unsigned long top)
+{
+ BUG_ON(fixmaps > 0);
+ printk("%s: addr=0x%lx\n", __FUNCTION__, top);
+ __FIXADDR_TOP = top - PAGE_SIZE;
}

pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address)
Index: vanilla-2.6.16-rc3/include/asm-i386/fixmap.h
===================================================================
--- vanilla-2.6.16-rc3.orig/include/asm-i386/fixmap.h 2006-02-13 09:57:36.000000000 +0100
+++ vanilla-2.6.16-rc3/include/asm-i386/fixmap.h 2006-02-13 09:57:36.000000000 +0100
@@ -20,7 +20,7 @@
* Leave one empty page between vmalloc'ed areas and
* the start of the fixmap.
*/
-#define __FIXADDR_TOP 0xfffff000
+extern unsigned long __FIXADDR_TOP;

#ifndef __ASSEMBLY__
#include <linux/kernel.h>
@@ -93,6 +93,7 @@ enum fixed_addresses {

extern void __set_fixmap (enum fixed_addresses idx,
unsigned long phys, pgprot_t flags);
+extern void set_fixaddr_top(unsigned long top);

#define set_fixmap(idx, phys) \
__set_fixmap(idx, phys, PAGE_KERNEL)
Index: vanilla-2.6.16-rc3/include/asm-i386/page.h
===================================================================
--- vanilla-2.6.16-rc3.orig/include/asm-i386/page.h 2006-02-13 09:57:36.000000000 +0100
+++ vanilla-2.6.16-rc3/include/asm-i386/page.h 2006-02-13 14:21:36.000000000 +0100
@@ -121,7 +121,7 @@ extern int page_is_ram(unsigned long pag

#define PAGE_OFFSET ((unsigned long)__PAGE_OFFSET)
#define VMALLOC_RESERVE ((unsigned long)__VMALLOC_RESERVE)
-#define MAXMEM (-__PAGE_OFFSET-__VMALLOC_RESERVE)
+#define MAXMEM (__FIXADDR_TOP-__PAGE_OFFSET-__VMALLOC_RESERVE)
#define __pa(x) ((unsigned long)(x)-PAGE_OFFSET)
#define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET))
#define pfn_to_kaddr(pfn) __va((pfn) << PAGE_SHIFT)


Attachments:
move-gate-page.diff (6.79 kB)
unfix-fixmap (5.26 kB)
Download all attachments

2006-03-15 08:37:20

by Zachary Amsden

[permalink] [raw]
Subject: Re: [RFC, PATCH 7/24] i386 Vmi memory hole

Gerd Hoffmann wrote:
>>> The complications in my patch come
>>> from the fact that the vsyscall page has to be relocated dynamically,
>>> requiring, basically run time linking on the page and some tweaks to get
>>> sysenter to work. If you don't use vsyscall (say, non-TLS glibc), then
>>> you don't need that complexity. But I think it might be needed now,
>>> even for Xen.
>>>
>> I believe both Xen and execshield move vsyscall out of fixmap, and then
>> map into userspace as normal vma.
>>
>
> Yep, my patch (attached below for reference) moves the vsyscall page
> into user address space, just below PAGE_OFFSET. Works basically the
> same way the vsyscall page is mapped in the ia32 emulation of the x86_64
> architecture. Address stays fixed, thus the relocation magic isn't needed.
>
> Once the vsyscall page is moved out of fixmap it's easy to make fixmap
> movable and thus have a runtime-resizable address space hole at the top
> of address space. Patch is attached too, although that one is more
> proof-of-concept, it doesn't make much sense as-is. It has a kernel
> command line option to specify the top of address space so you can play
> around with it ...
>
> Both patches are against -rc3 and most likely still apply just fine,
> havn't tested that though.
>

Your patch looks a lot cleaner and less hackish than mine. But I wonder
if it still works with kernels that support the sysenter method of
calling into the kernel. Look at the following code:

ENTRY(sysenter_entry)
movl TSS_sysenter_esp0(%esp),%esp
sysenter_past_esp:
STI
pushl $(__USER_DS)
pushl %ebp
pushfl
pushl $(__USER_CS)
pushl $SYSENTER_RETURN

SYSENTER_RETURN is a link time constant that is defined based on the
location of the vsyscall page. If the vsyscall page can move, this can
not be a constant. The reason is, this "fake" exception frame is used
to return back to the EIP of the call site, and sysenter does not record
the EIP of the call site.

Zach

2006-03-15 09:04:55

by Chris Wright

[permalink] [raw]
Subject: Re: [RFC, PATCH 7/24] i386 Vmi memory hole

* Zachary Amsden ([email protected]) wrote:
> ENTRY(sysenter_entry)
> movl TSS_sysenter_esp0(%esp),%esp
> sysenter_past_esp:
> STI
> pushl $(__USER_DS)
> pushl %ebp
> pushfl
> pushl $(__USER_CS)
> pushl $SYSENTER_RETURN
>
> SYSENTER_RETURN is a link time constant that is defined based on the
> location of the vsyscall page. If the vsyscall page can move, this can
> not be a constant. The reason is, this "fake" exception frame is used
> to return back to the EIP of the call site, and sysenter does not record
> the EIP of the call site.

It's only real issue for something like execshield. For this it's easy
to do the fixed math since it's still at fixed address.

+ DEFINE(VSYSCALL_BASE, (PAGE_OFFSET - 2*PAGE_SIZE));

But execshield has to make SYSENTER_RETURN context sensitive to current
since the vdso is mapped at random location.

thanks,
-chris

2006-03-15 09:19:22

by Zachary Amsden

[permalink] [raw]
Subject: Re: [RFC, PATCH 7/24] i386 Vmi memory hole

Chris Wright wrote:
> * Zachary Amsden ([email protected]) wrote:
>
>> ENTRY(sysenter_entry)
>> movl TSS_sysenter_esp0(%esp),%esp
>> sysenter_past_esp:
>> STI
>> pushl $(__USER_DS)
>> pushl %ebp
>> pushfl
>> pushl $(__USER_CS)
>> pushl $SYSENTER_RETURN
>>
>> SYSENTER_RETURN is a link time constant that is defined based on the
>> location of the vsyscall page. If the vsyscall page can move, this can
>> not be a constant. The reason is, this "fake" exception frame is used
>> to return back to the EIP of the call site, and sysenter does not record
>> the EIP of the call site.
>>
>
> It's only real issue for something like execshield. For this it's easy
> to do the fixed math since it's still at fixed address.
>
> + DEFINE(VSYSCALL_BASE, (PAGE_OFFSET - 2*PAGE_SIZE));
>

Ok, I'm confused. What fixed math? The return EIP that is pushed here
is used when sysenter is active and you have to IRET back to userspace.
If that EIP is dynamically relocatable, you can't do fixed math unless
you patch the pushl site dynamically. Notable reasons for returning via
IRET on this fake exception frame were (until my recent submission) IOPL
changes, but I believe there were more. I will have to inspect the
source to determine if that is still the case.

Zach

2006-03-15 09:25:51

by Gerd Hoffmann

[permalink] [raw]
Subject: Re: [RFC, PATCH 7/24] i386 Vmi memory hole

> pushl $SYSENTER_RETURN
>
> SYSENTER_RETURN is a link time constant that is defined based on the
> location of the vsyscall page. If the vsyscall page can move, this can
> not be a constant.

The vsyscall page is at PAGE_OFFSET - 2*PAGE_SIZE. It doesn't move. At
least not at runtime. At compile time it can change with the new
VMSPLIT config options, but that isn't a problem ;)

cheers,

Gerd

--
Gerd 'just married' Hoffmann <[email protected]>
I'm the hacker formerly known as Gerd Knorr.
http://www.suse.de/~kraxel/just-married.jpeg

2006-03-15 09:37:07

by Chris Wright

[permalink] [raw]
Subject: Re: [RFC, PATCH 7/24] i386 Vmi memory hole

* Zachary Amsden ([email protected]) wrote:
> >+ DEFINE(VSYSCALL_BASE, (PAGE_OFFSET - 2*PAGE_SIZE));
>
> Ok, I'm confused. What fixed math?

Sorry, bad choice of words. From above, the VYSYCALL_BASE is known
at compile time (in asm-offsets.h). So the SYSENTER_RETURN is still
fixed addr. For execshield it's truly dynamic, so you get something
like this instead of the constant SYSENTER_RETURN:

- pushl $SYSENTER_RETURN
+ pushl (TI_sysenter_return-THREAD_SIZE+8+4*4)(%esp)

thanks,
-chris

2006-03-15 09:38:04

by Zachary Amsden

[permalink] [raw]
Subject: Re: [RFC, PATCH 7/24] i386 Vmi memory hole

Gerd Hoffmann wrote:
>> pushl $SYSENTER_RETURN
>>
>> SYSENTER_RETURN is a link time constant that is defined based on the
>> location of the vsyscall page. If the vsyscall page can move, this can
>> not be a constant.
>>
>
> The vsyscall page is at PAGE_OFFSET - 2*PAGE_SIZE. It doesn't move. At
> least not at runtime. At compile time it can change with the new
> VMSPLIT config options, but that isn't a problem ;)
>

Okay, I get it now. Thanks for the explanation. This certainly does
simplify the problem.

Zach