Hello,
this patch allows to prevent linux from using the ram below
PHYSICAL_START.
The "reserved RAM" can be mapped by virtualization software with to
create a 1:1 mapping between guest physical (bus) address and host
physical (bus) address. This will allow pci passthrough with DMA for
the guest with current production hardware that misses VT-d. The only
detail to take care of is the ram marked "reserved RAM failed". The
virtualization software must create for the guest an e820 map that
only includes the "reserved RAM" regions but if the guest touches
memory with guest physical address in the "reserved RAM failed" ranges
(linux guest will do that even if the ram isn't present in the e820
map), it should provide that as ram and map it with a not-ident
mapping. This should allow any linux kernel to run fine with pci
passthrough and hopefully any other OS too with all VT enabled
hardware.
(the virtualization software should do if (pfn_valid(gfn))
get_page(pfn_to_page(gfn)) instead of get_user_pages and equivalent
check in the release path)
The trampoline page marked as "reserved RAM failed" can be easily
relocated near 640k with an incremental patch to avoid an e820 hole at
0x6000 if any bootloader or OS gets confused.
The end of the patch are just bugfixes. However the limit of the
reserved ram is 1G... this can also be relaxed with an incremental
patch later on if needed (currently 1G is enough). Perhaps this has
other usages.
Let me know if this can be merged, thanks!
svm ~ # cat /proc/iomem |head -n 20
00000000-00000fff : reserved RAM failed
00001000-00005fff : reserved RAM
00006000-00007fff : reserved RAM failed
00008000-0009efff : reserved RAM
0009f000-0009ffff : reserved
000cd600-000cffff : pnp 00:0d
000f0000-000fffff : reserved
00100000-0fffffff : reserved RAM
10000000-3dedffff : System RAM
10000000-10329ab2 : Kernel code
10329ab3-104933e7 : Kernel data
104f5000-10558e67 : Kernel bss
3dee0000-3dee2fff : ACPI Non-volatile Storage
3dee3000-3deeffff : ACPI Tables
3def0000-3defffff : reserved
3dff0000-3ffeffff : pnp 00:0d
e0000000-efffffff : reserved
fa000000-fbffffff : PCI Bus #01
fa000000-fbffffff : 0000:01:05.0
fda00000-fdbfffff : PCI Bus #01
svm ~ # hexdump /dev/mem | grep -C2 'cccc cccc cccc cccc'
00007e0 0000 0000 0000 0000 0000 0000 0000 0000
*
0001000 cccc cccc cccc cccc cccc cccc cccc cccc
*
0006000 a5a5 a5a5 8ec8 8ed8 8ec0 66d0 06c7 0000
--
*
0007ff0 0000 0000 0000 0000 3063 1000 0000 0000
0008000 cccc cccc cccc cccc cccc cccc cccc cccc
*
009f000 0002 0000 0000 0000 0000 0000 0000 0000
--
00fffe0 6000 3c03 45e7 0184 0500 0082 01c0 0223
00ffff0 5bea 00e0 31f0 2f32 3931 302f 0037 12fc
0100000 cccc cccc cccc cccc cccc cccc cccc cccc
*
10000000 8d48 f92d ffff 48ff ed81 0000 1000 8948
^C
svm ~ #
Signed-off-by: Andrea Arcangeli <[email protected]>
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1109,8 +1109,36 @@ config CRASH_DUMP
(CONFIG_RELOCATABLE=y).
For more details see Documentation/kdump/kdump.txt
+config RESERVE_PHYSICAL_START
+ bool "Reserve all RAM below PHYSICAL_START (EXPERIMENTAL)"
+ depends on !RELOCATABLE && X86_64
+ help
+ This makes the kernel use only RAM above __PHYSICAL_START.
+ All memory below __PHYSICAL_START will be left unused and
+ marked as "reserved RAM" in /proc/iomem. The few special
+ pages that can't be relocated at addresses above
+ __PHYSICAL_START and that can't be guaranteed to be unused
+ by the running kernel, will be marked "reserved RAM failed"
+ in /proc/iomem. Those may or may be not used by the kernel
+ (for example smp trampoline pages would only be used if
+ CPU hotplug is enabled).
+
+ The "reserved RAM" can be mapped by virtualization software
+ with /dev/mem to create a 1:1 mapping between guest physical
+ (bus) address and host physical (bus) address. This will
+ allow pci passthrough with DMA for the guest using the ram
+ with the 1:1 mapping. The only detail to take care of is the
+ ram marked "reserved RAM failed". The virtualization
+ software must create for the guest an e820 map that only
+ includes the "reserved RAM" regions but if the guest touches
+ memory with guest physical address in the "reserved RAM
+ failed" ranges (linux guest will do that even if the ram
+ isn't present in the e820 map), it should provide that as
+ ram and map it with a non linear mapping. This should allow
+ any linux kernel to run fine and hopefully any other OS too.
+
config PHYSICAL_START
- hex "Physical address where the kernel is loaded" if (EMBEDDED || CRASH_DUMP)
+ hex "Physical address where the kernel is loaded" if (EMBEDDED || CRASH_DUMP || RESERVE_PHYSICAL_START)
default "0x1000000" if X86_NUMAQ
default "0x200000" if X86_64
default "0x100000"
diff --git a/arch/x86/kernel/e820_64.c b/arch/x86/kernel/e820_64.c
--- a/arch/x86/kernel/e820_64.c
+++ b/arch/x86/kernel/e820_64.c
@@ -91,6 +91,11 @@ void __init early_res_to_bootmem(void)
printk(KERN_INFO "early res: %d [%lx-%lx] %s\n", i,
r->start, r->end - 1, r->name);
reserve_bootmem_generic(r->start, r->end - r->start);
+#ifdef CONFIG_RESERVE_PHYSICAL_START
+ if (r->start < __PHYSICAL_START)
+ add_memory_region(r->start, r->end - r->start,
+ E820_RESERVED_RAM_FAILED);
+#endif
}
}
@@ -231,6 +236,10 @@ void __init e820_reserve_resources(struc
struct resource *data_resource, struct resource *bss_resource)
{
int i;
+#ifdef CONFIG_RESERVE_PHYSICAL_START
+ /* solve E820_RESERVED_RAM vs E820_RESERVED_RAM_FAILED conflicts */
+ update_e820();
+#endif
for (i = 0; i < e820.nr_map; i++) {
struct resource *res;
res = alloc_bootmem_low(sizeof(struct resource));
@@ -238,6 +247,16 @@ void __init e820_reserve_resources(struc
case E820_RAM: res->name = "System RAM"; break;
case E820_ACPI: res->name = "ACPI Tables"; break;
case E820_NVS: res->name = "ACPI Non-volatile Storage"; break;
+#ifdef CONFIG_RESERVE_PHYSICAL_START
+ case E820_RESERVED_RAM_FAILED:
+ res->name = "reserved RAM failed";
+ break;
+ case E820_RESERVED_RAM:
+ memset(__va(e820.map[i].addr),
+ POISON_FREE_INITMEM, e820.map[i].size);
+ res->name = "reserved RAM";
+ break;
+#endif
default: res->name = "reserved";
}
res->start = e820.map[i].addr;
@@ -409,6 +428,12 @@ static void __init e820_print_map(char *
break;
case E820_NVS:
printk(KERN_CONT "(ACPI NVS)\n");
+ break;
+ case E820_RESERVED_RAM:
+ printk(KERN_CONT "(reserved RAM)\n");
+ break;
+ case E820_RESERVED_RAM_FAILED:
+ printk(KERN_CONT "(reserved RAM failed)\n");
break;
default:
printk(KERN_CONT "type %u\n", e820.map[i].type);
@@ -639,9 +664,31 @@ static int __init copy_e820_map(struct e
unsigned long end = start + size;
unsigned long type = biosmap->type;
+#ifdef CONFIG_RESERVE_PHYSICAL_START
+ /* make space for two more low-prio types */
+ type += 2;
+#endif
+
/* Overflow in 64 bits? Ignore the memory map. */
if (start > end)
return -1;
+
+#ifdef CONFIG_RESERVE_PHYSICAL_START
+ if (type == E820_RAM) {
+ if (end <= __PHYSICAL_START) {
+ add_memory_region(start, size,
+ E820_RESERVED_RAM);
+ continue;
+ }
+ if (start < __PHYSICAL_START) {
+ add_memory_region(start,
+ __PHYSICAL_START-start,
+ E820_RESERVED_RAM);
+ size -= __PHYSICAL_START-start;
+ start = __PHYSICAL_START;
+ }
+ }
+#endif
add_memory_region(start, size, type);
} while (biosmap++, --nr_map);
diff --git a/include/asm-x86/e820.h b/include/asm-x86/e820.h
--- a/include/asm-x86/e820.h
+++ b/include/asm-x86/e820.h
@@ -4,10 +4,19 @@
#define E820MAX 128 /* number of entries in E820MAP */
#define E820NR 0x1e8 /* # entries in E820MAP */
+#ifdef CONFIG_RESERVE_PHYSICAL_START
+#define E820_RESERVED_RAM 1
+#define E820_RESERVED_RAM_FAILED 2
+#define E820_RAM 3
+#define E820_RESERVED 4
+#define E820_ACPI 5
+#define E820_NVS 6
+#else
#define E820_RAM 1
#define E820_RESERVED 2
#define E820_ACPI 3
#define E820_NVS 4
+#endif
#ifndef __ASSEMBLY__
struct e820entry {
diff --git a/include/asm-x86/page_64.h b/include/asm-x86/page_64.h
--- a/include/asm-x86/page_64.h
+++ b/include/asm-x86/page_64.h
@@ -29,6 +29,7 @@
#define __PAGE_OFFSET _AC(0xffff810000000000, UL)
#define __PHYSICAL_START CONFIG_PHYSICAL_START
+#define __PHYSICAL_OFFSET (__PHYSICAL_START-0x200000)
#define __KERNEL_ALIGN 0x200000
/*
@@ -47,7 +48,7 @@
#define __PHYSICAL_MASK_SHIFT 46
#define __VIRTUAL_MASK_SHIFT 48
-#define KERNEL_TEXT_SIZE (40*1024*1024)
+#define KERNEL_TEXT_SIZE (40*1024*1024+__PHYSICAL_OFFSET)
#define KERNEL_TEXT_START _AC(0xffffffff80000000, UL)
#ifndef __ASSEMBLY__
diff --git a/include/asm-x86/pgtable_64.h b/include/asm-x86/pgtable_64.h
--- a/include/asm-x86/pgtable_64.h
+++ b/include/asm-x86/pgtable_64.h
@@ -140,7 +140,7 @@ static inline void native_pgd_clear(pgd_
#define VMALLOC_START _AC(0xffffc20000000000, UL)
#define VMALLOC_END _AC(0xffffe1ffffffffff, UL)
#define VMEMMAP_START _AC(0xffffe20000000000, UL)
-#define MODULES_VADDR _AC(0xffffffff88000000, UL)
+#define MODULES_VADDR (0xffffffff88000000UL+__PHYSICAL_OFFSET)
#define MODULES_END _AC(0xfffffffffff00000, UL)
#define MODULES_LEN (MODULES_END - MODULES_VADDR)
On Wed, 27 Feb 2008 01:33:25 +0100 Andrea Arcangeli wrote:
> Hello,
>
> this patch allows to prevent linux from using the ram below
> PHYSICAL_START.
>
> The "reserved RAM" can be mapped by virtualization software with to
> create a 1:1 mapping between guest physical (bus) address and host
> physical (bus) address. This will allow pci passthrough with DMA for
> the guest with current production hardware that misses VT-d. The only
> detail to take care of is the ram marked "reserved RAM failed". The
> virtualization software must create for the guest an e820 map that
> only includes the "reserved RAM" regions but if the guest touches
> memory with guest physical address in the "reserved RAM failed" ranges
> (linux guest will do that even if the ram isn't present in the e820
> map), it should provide that as ram and map it with a not-ident
> mapping. This should allow any linux kernel to run fine with pci
> passthrough and hopefully any other OS too with all VT enabled
> hardware.
>
>
> Let me know if this can be merged, thanks!
>
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1109,8 +1109,36 @@ config CRASH_DUMP
> (CONFIG_RELOCATABLE=y).
> For more details see Documentation/kdump/kdump.txt
>
> +config RESERVE_PHYSICAL_START
> + bool "Reserve all RAM below PHYSICAL_START (EXPERIMENTAL)"
> + depends on !RELOCATABLE && X86_64
> + help
> + This makes the kernel use only RAM above __PHYSICAL_START.
> + All memory below __PHYSICAL_START will be left unused and
> + marked as "reserved RAM" in /proc/iomem. The few special
> + pages that can't be relocated at addresses above
> + __PHYSICAL_START and that can't be guaranteed to be unused
> + by the running kernel, will be marked "reserved RAM failed"
No comma.
> + in /proc/iomem. Those may or may be not used by the kernel
> + (for example smp trampoline pages would only be used if
SMP
> + CPU hotplug is enabled).
> +
> + The "reserved RAM" can be mapped by virtualization software
Indent above with tab + 2 spaces, please.
> + with /dev/mem to create a 1:1 mapping between guest physical
> + (bus) address and host physical (bus) address. This will
> + allow pci passthrough with DMA for the guest using the ram
PCI RAM
> + with the 1:1 mapping. The only detail to take care of is the
> + ram marked "reserved RAM failed". The virtualization
RAM
> + software must create for the guest an e820 map that only
> + includes the "reserved RAM" regions but if the guest touches
> + memory with guest physical address in the "reserved RAM
> + failed" ranges (linux guest will do that even if the ram
Linux RAM
> + isn't present in the e820 map), it should provide that as
> + ram and map it with a non linear mapping. This should allow
RAM non-linear
> + any linux kernel to run fine and hopefully any other OS too.
Linux
> +
> config PHYSICAL_START
> - hex "Physical address where the kernel is loaded" if (EMBEDDED || CRASH_DUMP)
> + hex "Physical address where the kernel is loaded" if (EMBEDDED || CRASH_DUMP || RESERVE_PHYSICAL_START)
> default "0x1000000" if X86_NUMAQ
> default "0x200000" if X86_64
> default "0x100000"
---
~Randy
On Wed, Feb 27, 2008 at 01:33:25AM +0100, Andrea Arcangeli wrote:
> Hello,
>
> this patch allows to prevent linux from using the ram below
> PHYSICAL_START.
>
> The "reserved RAM" can be mapped by virtualization software with to
> create a 1:1 mapping between guest physical (bus) address and host
> physical (bus) address. This will allow pci passthrough with DMA for
> the guest with current production hardware that misses VT-d. The only
> detail to take care of is the ram marked "reserved RAM failed". The
> virtualization software must create for the guest an e820 map that
> only includes the "reserved RAM" regions but if the guest touches
> memory with guest physical address in the "reserved RAM failed" ranges
> (linux guest will do that even if the ram isn't present in the e820
> map), it should provide that as ram and map it with a not-ident
> mapping. This should allow any linux kernel to run fine with pci
> passthrough and hopefully any other OS too with all VT enabled
> hardware.
>
> (the virtualization software should do if (pfn_valid(gfn))
> get_page(pfn_to_page(gfn)) instead of get_user_pages and equivalent
> check in the release path)
>
> The trampoline page marked as "reserved RAM failed" can be easily
> relocated near 640k with an incremental patch to avoid an e820 hole at
> 0x6000 if any bootloader or OS gets confused.
>
> The end of the patch are just bugfixes. However the limit of the
> reserved ram is 1G... this can also be relaxed with an incremental
> patch later on if needed (currently 1G is enough). Perhaps this has
> other usages.
>
> Let me know if this can be merged, thanks!
>
I don't know much about pci passthrough thing, but in a nutshell it
looks like you just want a way to reserve memory in host which is not
used by host and then also reserve a virtual range in host where you
can create another set of mapping for that reserved memory?
Can't you just provide a command line parameter to reserve a section
of memory, the way crashkernel=X@Y parameter does?
[..]
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1109,8 +1109,36 @@ config CRASH_DUMP
> (CONFIG_RELOCATABLE=y).
> For more details see Documentation/kdump/kdump.txt
>
> +config RESERVE_PHYSICAL_START
> + bool "Reserve all RAM below PHYSICAL_START (EXPERIMENTAL)"
> + depends on !RELOCATABLE && X86_64
> + help
What prevents you from doing this for RELOCATABLE kernels?
[..]
> #ifndef __ASSEMBLY__
> struct e820entry {
> diff --git a/include/asm-x86/page_64.h b/include/asm-x86/page_64.h
> --- a/include/asm-x86/page_64.h
> +++ b/include/asm-x86/page_64.h
> @@ -29,6 +29,7 @@
> #define __PAGE_OFFSET _AC(0xffff810000000000, UL)
>
> #define __PHYSICAL_START CONFIG_PHYSICAL_START
> +#define __PHYSICAL_OFFSET (__PHYSICAL_START-0x200000)
> #define __KERNEL_ALIGN 0x200000
>
> /*
> @@ -47,7 +48,7 @@
> #define __PHYSICAL_MASK_SHIFT 46
> #define __VIRTUAL_MASK_SHIFT 48
>
> -#define KERNEL_TEXT_SIZE (40*1024*1024)
> +#define KERNEL_TEXT_SIZE (40*1024*1024+__PHYSICAL_OFFSET)
Why are you changing this? What is __PHYSICAL_OFFSET? Are you expanding
the kernel text/data region so that you can additionally map this
reserved area?
If yes, I think probably we should have a separate area altoghether to
map this reserved area than expanding existing kernel text/data region.
Thanks
Vivek
Hi Andrea,
Sorry for the long delay in replying.
I'm trying to grok the use cases for this. In particular, it seems like
a particularly restricted case of wanting to be able to reserve an
arbitrary bit of memory, which seems like it would be more useful (don't
we already have memmap= options for that, anyway?)
In particular, what's the reason for reserving *low* memory? Low memory
(first megabyte) is full of special-use address space which, as the Xen
address space discussion has showed, is nontrivial to tamper with even
if it initially works. If you want a dedicated chunk of mappable PCI
space, it would seem cleaner to have it higher up in the memory map.
Andrea Arcangeli wrote:
> Hello,
>
> this patch allows to prevent linux from using the ram below
> PHYSICAL_START.
>
> The "reserved RAM" can be mapped by virtualization software with to
> create a 1:1 mapping between guest physical (bus) address and host
> physical (bus) address. This will allow pci passthrough with DMA for
> the guest with current production hardware that misses VT-d. The only
> detail to take care of is the ram marked "reserved RAM failed". The
> virtualization software must create for the guest an e820 map that
> only includes the "reserved RAM" regions but if the guest touches
> memory with guest physical address in the "reserved RAM failed" ranges
> (linux guest will do that even if the ram isn't present in the e820
> map), it should provide that as ram and map it with a not-ident
> mapping. This should allow any linux kernel to run fine with pci
> passthrough and hopefully any other OS too with all VT enabled
> hardware.
>
> (the virtualization software should do if (pfn_valid(gfn))
> get_page(pfn_to_page(gfn)) instead of get_user_pages and equivalent
> check in the release path)
>
> The trampoline page marked as "reserved RAM failed" can be easily
> relocated near 640k with an incremental patch to avoid an e820 hole at
> 0x6000 if any bootloader or OS gets confused.
>
> The end of the patch are just bugfixes. However the limit of the
> reserved ram is 1G... this can also be relaxed with an incremental
> patch later on if needed (currently 1G is enough). Perhaps this has
> other usages.
>
> Let me know if this can be merged, thanks!
>
> svm ~ # cat /proc/iomem |head -n 20
> 00000000-00000fff : reserved RAM failed
> 00001000-00005fff : reserved RAM
> 00006000-00007fff : reserved RAM failed
> 00008000-0009efff : reserved RAM
> 0009f000-0009ffff : reserved
> 000cd600-000cffff : pnp 00:0d
> 000f0000-000fffff : reserved
> 00100000-0fffffff : reserved RAM
> 10000000-3dedffff : System RAM
> 10000000-10329ab2 : Kernel code
> 10329ab3-104933e7 : Kernel data
> 104f5000-10558e67 : Kernel bss
> 3dee0000-3dee2fff : ACPI Non-volatile Storage
> 3dee3000-3deeffff : ACPI Tables
> 3def0000-3defffff : reserved
> 3dff0000-3ffeffff : pnp 00:0d
> e0000000-efffffff : reserved
> fa000000-fbffffff : PCI Bus #01
> fa000000-fbffffff : 0000:01:05.0
> fda00000-fdbfffff : PCI Bus #01
> svm ~ # hexdump /dev/mem | grep -C2 'cccc cccc cccc cccc'
> 00007e0 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 0001000 cccc cccc cccc cccc cccc cccc cccc cccc
> *
> 0006000 a5a5 a5a5 8ec8 8ed8 8ec0 66d0 06c7 0000
> --
> *
> 0007ff0 0000 0000 0000 0000 3063 1000 0000 0000
> 0008000 cccc cccc cccc cccc cccc cccc cccc cccc
> *
> 009f000 0002 0000 0000 0000 0000 0000 0000 0000
> --
> 00fffe0 6000 3c03 45e7 0184 0500 0082 01c0 0223
> 00ffff0 5bea 00e0 31f0 2f32 3931 302f 0037 12fc
> 0100000 cccc cccc cccc cccc cccc cccc cccc cccc
> *
> 10000000 8d48 f92d ffff 48ff ed81 0000 1000 8948
> ^C
> svm ~ #
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1109,8 +1109,36 @@ config CRASH_DUMP
> (CONFIG_RELOCATABLE=y).
> For more details see Documentation/kdump/kdump.txt
>
> +config RESERVE_PHYSICAL_START
> + bool "Reserve all RAM below PHYSICAL_START (EXPERIMENTAL)"
> + depends on !RELOCATABLE && X86_64
> + help
> + This makes the kernel use only RAM above __PHYSICAL_START.
> + All memory below __PHYSICAL_START will be left unused and
> + marked as "reserved RAM" in /proc/iomem. The few special
> + pages that can't be relocated at addresses above
> + __PHYSICAL_START and that can't be guaranteed to be unused
> + by the running kernel, will be marked "reserved RAM failed"
> + in /proc/iomem. Those may or may be not used by the kernel
> + (for example smp trampoline pages would only be used if
> + CPU hotplug is enabled).
> +
> + The "reserved RAM" can be mapped by virtualization software
> + with /dev/mem to create a 1:1 mapping between guest physical
> + (bus) address and host physical (bus) address. This will
> + allow pci passthrough with DMA for the guest using the ram
> + with the 1:1 mapping. The only detail to take care of is the
> + ram marked "reserved RAM failed". The virtualization
> + software must create for the guest an e820 map that only
> + includes the "reserved RAM" regions but if the guest touches
> + memory with guest physical address in the "reserved RAM
> + failed" ranges (linux guest will do that even if the ram
> + isn't present in the e820 map), it should provide that as
> + ram and map it with a non linear mapping. This should allow
> + any linux kernel to run fine and hopefully any other OS too.
> +
> config PHYSICAL_START
> - hex "Physical address where the kernel is loaded" if (EMBEDDED || CRASH_DUMP)
> + hex "Physical address where the kernel is loaded" if (EMBEDDED || CRASH_DUMP || RESERVE_PHYSICAL_START)
> default "0x1000000" if X86_NUMAQ
> default "0x200000" if X86_64
> default "0x100000"
> diff --git a/arch/x86/kernel/e820_64.c b/arch/x86/kernel/e820_64.c
> --- a/arch/x86/kernel/e820_64.c
> +++ b/arch/x86/kernel/e820_64.c
> @@ -91,6 +91,11 @@ void __init early_res_to_bootmem(void)
> printk(KERN_INFO "early res: %d [%lx-%lx] %s\n", i,
> r->start, r->end - 1, r->name);
> reserve_bootmem_generic(r->start, r->end - r->start);
> +#ifdef CONFIG_RESERVE_PHYSICAL_START
> + if (r->start < __PHYSICAL_START)
> + add_memory_region(r->start, r->end - r->start,
> + E820_RESERVED_RAM_FAILED);
> +#endif
> }
> }
>
> @@ -231,6 +236,10 @@ void __init e820_reserve_resources(struc
> struct resource *data_resource, struct resource *bss_resource)
> {
> int i;
> +#ifdef CONFIG_RESERVE_PHYSICAL_START
> + /* solve E820_RESERVED_RAM vs E820_RESERVED_RAM_FAILED conflicts */
> + update_e820();
> +#endif
> for (i = 0; i < e820.nr_map; i++) {
> struct resource *res;
> res = alloc_bootmem_low(sizeof(struct resource));
> @@ -238,6 +247,16 @@ void __init e820_reserve_resources(struc
> case E820_RAM: res->name = "System RAM"; break;
> case E820_ACPI: res->name = "ACPI Tables"; break;
> case E820_NVS: res->name = "ACPI Non-volatile Storage"; break;
> +#ifdef CONFIG_RESERVE_PHYSICAL_START
> + case E820_RESERVED_RAM_FAILED:
> + res->name = "reserved RAM failed";
> + break;
> + case E820_RESERVED_RAM:
> + memset(__va(e820.map[i].addr),
> + POISON_FREE_INITMEM, e820.map[i].size);
> + res->name = "reserved RAM";
> + break;
> +#endif
> default: res->name = "reserved";
> }
> res->start = e820.map[i].addr;
> @@ -409,6 +428,12 @@ static void __init e820_print_map(char *
> break;
> case E820_NVS:
> printk(KERN_CONT "(ACPI NVS)\n");
> + break;
> + case E820_RESERVED_RAM:
> + printk(KERN_CONT "(reserved RAM)\n");
> + break;
> + case E820_RESERVED_RAM_FAILED:
> + printk(KERN_CONT "(reserved RAM failed)\n");
> break;
> default:
> printk(KERN_CONT "type %u\n", e820.map[i].type);
> @@ -639,9 +664,31 @@ static int __init copy_e820_map(struct e
> unsigned long end = start + size;
> unsigned long type = biosmap->type;
>
> +#ifdef CONFIG_RESERVE_PHYSICAL_START
> + /* make space for two more low-prio types */
> + type += 2;
> +#endif
> +
> /* Overflow in 64 bits? Ignore the memory map. */
> if (start > end)
> return -1;
> +
> +#ifdef CONFIG_RESERVE_PHYSICAL_START
> + if (type == E820_RAM) {
> + if (end <= __PHYSICAL_START) {
> + add_memory_region(start, size,
> + E820_RESERVED_RAM);
> + continue;
> + }
> + if (start < __PHYSICAL_START) {
> + add_memory_region(start,
> + __PHYSICAL_START-start,
> + E820_RESERVED_RAM);
> + size -= __PHYSICAL_START-start;
> + start = __PHYSICAL_START;
> + }
> + }
> +#endif
>
> add_memory_region(start, size, type);
> } while (biosmap++, --nr_map);
> diff --git a/include/asm-x86/e820.h b/include/asm-x86/e820.h
> --- a/include/asm-x86/e820.h
> +++ b/include/asm-x86/e820.h
> @@ -4,10 +4,19 @@
> #define E820MAX 128 /* number of entries in E820MAP */
> #define E820NR 0x1e8 /* # entries in E820MAP */
>
> +#ifdef CONFIG_RESERVE_PHYSICAL_START
> +#define E820_RESERVED_RAM 1
> +#define E820_RESERVED_RAM_FAILED 2
> +#define E820_RAM 3
> +#define E820_RESERVED 4
> +#define E820_ACPI 5
> +#define E820_NVS 6
> +#else
> #define E820_RAM 1
> #define E820_RESERVED 2
> #define E820_ACPI 3
> #define E820_NVS 4
> +#endif
>
> #ifndef __ASSEMBLY__
> struct e820entry {
> diff --git a/include/asm-x86/page_64.h b/include/asm-x86/page_64.h
> --- a/include/asm-x86/page_64.h
> +++ b/include/asm-x86/page_64.h
> @@ -29,6 +29,7 @@
> #define __PAGE_OFFSET _AC(0xffff810000000000, UL)
>
> #define __PHYSICAL_START CONFIG_PHYSICAL_START
> +#define __PHYSICAL_OFFSET (__PHYSICAL_START-0x200000)
> #define __KERNEL_ALIGN 0x200000
>
> /*
> @@ -47,7 +48,7 @@
> #define __PHYSICAL_MASK_SHIFT 46
> #define __VIRTUAL_MASK_SHIFT 48
>
> -#define KERNEL_TEXT_SIZE (40*1024*1024)
> +#define KERNEL_TEXT_SIZE (40*1024*1024+__PHYSICAL_OFFSET)
> #define KERNEL_TEXT_START _AC(0xffffffff80000000, UL)
>
> #ifndef __ASSEMBLY__
> diff --git a/include/asm-x86/pgtable_64.h b/include/asm-x86/pgtable_64.h
> --- a/include/asm-x86/pgtable_64.h
> +++ b/include/asm-x86/pgtable_64.h
> @@ -140,7 +140,7 @@ static inline void native_pgd_clear(pgd_
> #define VMALLOC_START _AC(0xffffc20000000000, UL)
> #define VMALLOC_END _AC(0xffffe1ffffffffff, UL)
> #define VMEMMAP_START _AC(0xffffe20000000000, UL)
> -#define MODULES_VADDR _AC(0xffffffff88000000, UL)
> +#define MODULES_VADDR (0xffffffff88000000UL+__PHYSICAL_OFFSET)
> #define MODULES_END _AC(0xfffffffffff00000, UL)
> #define MODULES_LEN (MODULES_END - MODULES_VADDR)
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
Hi Peter,
On Fri, Feb 29, 2008 at 10:21:08AM -0800, H. Peter Anvin wrote:
> Hi Andrea,
>
> Sorry for the long delay in replying.
>
> I'm trying to grok the use cases for this. In particular, it seems like a
> particularly restricted case of wanting to be able to reserve an arbitrary
> bit of memory, which seems like it would be more useful (don't we already
> have memmap= options for that, anyway?)
I'll answer the memmap in separate email.
> In particular, what's the reason for reserving *low* memory? Low memory
> (first megabyte) is full of special-use address space which, as the Xen
The only special ones are zero page and trampoline (the trampoline
optionally can later be moved near 640k with an independent patch).
> address space discussion has showed, is nontrivial to tamper with even if
> it initially works. If you want a dedicated chunk of mappable PCI space,
> it would seem cleaner to have it higher up in the memory map.
The whole e820 must be PCI mappable, bootloader starts in real mode
and if any dma happens it will crash. This is to let any guest OS or
bootloader run with pci passthrough. There's no paravirt here, the
guest has fully native drivers (which is the whole point of this
approach I guess).
This will allow things like running random 3d apps on a random 3d card
on random OS with random 3d driver, all on top of linux host that
leaves the first 1G free for this OS to run with direct access to the
graphics card. Clearly linux host isn't safe, but if you trust that
guest OS that talks directly to the 3d card, you can run many more
guests that are fully swapped out or ballooned or rss limited or ksm
shared, etc... plus the host can run apps too. This is to retain the
full KVM/Linux virtualization power and flexibility without leaving 3d
or any other proprietary hardware out of the equation in the
pci-passthrough guest.
Hi Vivek,
On Thu, Feb 28, 2008 at 01:36:04PM -0500, Vivek Goyal wrote:
> I don't know much about pci passthrough thing, but in a nutshell it
> looks like you just want a way to reserve memory in host which is not
> used by host and then also reserve a virtual range in host where you
> can create another set of mapping for that reserved memory?
I described the potential usage in the email to hpa (I'm currently
implementing the kvm-userland bits that with a -reserved-ram parameter
will force qemu to open /dev/iomem and map direct ram from /dev/mem in
the linux ptes, and add a special memslot to kvm so that it uses the
pfn number directly if the !pfn_valid or get_page(pfn_to_page) to
refcount the dummy reserved pages in case the mem_map exist for that
pfn, this same logic can also be used to direct map the pci busaddress).
> Can't you just provide a command line parameter to reserve a section
> of memory, the way crashkernel=X@Y parameter does?
My requirement to specify at compile time the ram that the
qemu-system-x86_64 -reserved-ram guest will take, already sounds
complicated enough without having to skip the 640k region and all the
reserved pages like trampoline that are only known at compile time
anyway (so one couldn't use a memmap=x@y without checking all the
kernel source first to see if anybody changed the early_reserved
array...). To be safe I would need to check kernel source and host
e820 map by hand first for every different system out there!
> What prevents you from doing this for RELOCATABLE kernels?
I already tried that before falling back to the compile-time
solution. That requires duplicating all the memparse/strlout C code in
arch/x86/kernel and to recompile it 32bit so the 32bit part of
head_64.S can call it. Keep in mind this is a solution that is
required because VT-d wasn't shipped in all hardware out there, so I
didn't want to do an overwork and an hugly big patch that duplicates
code around just so I can pass reserved-ram=512M on the boot command
line, instead of specifying it at compile time.
Furthermore it won't just be the issue of parsing the command line
params in 32bit mode from head_64.S but all other code like the setup
of the initial pagetables, and vmalloc start, would also require
changes to become dynamic (the latter would slowdown vmalloc a bit too
at runtime).
> [..]
> > #ifndef __ASSEMBLY__
> > struct e820entry {
> > diff --git a/include/asm-x86/page_64.h b/include/asm-x86/page_64.h
> > --- a/include/asm-x86/page_64.h
> > +++ b/include/asm-x86/page_64.h
> > @@ -29,6 +29,7 @@
> > #define __PAGE_OFFSET _AC(0xffff810000000000, UL)
> >
> > #define __PHYSICAL_START CONFIG_PHYSICAL_START
> > +#define __PHYSICAL_OFFSET (__PHYSICAL_START-0x200000)
> > #define __KERNEL_ALIGN 0x200000
> >
> > /*
> > @@ -47,7 +48,7 @@
> > #define __PHYSICAL_MASK_SHIFT 46
> > #define __VIRTUAL_MASK_SHIFT 48
> >
> > -#define KERNEL_TEXT_SIZE (40*1024*1024)
> > +#define KERNEL_TEXT_SIZE (40*1024*1024+__PHYSICAL_OFFSET)
>
> Why are you changing this? What is __PHYSICAL_OFFSET? Are you expanding
The __PHYSICAL_OFFSET part is just a bugfix and I can split it and
submit it separately if this patch isnt' merged. If you compile the
kernel with kdump at a phsical offset that is just a bit higher than
40M it will simply crash and fail to boot.
> the kernel text/data region so that you can additionally map this
> reserved area?
> If yes, I think probably we should have a separate area altoghether to
> map this reserved area than expanding existing kernel text/data region.
I'm not expanding anything, I'm just relocating the kernel a bit above
40M, I guess kdump is happy enough with lower addresses or it could
never work. So this is just a mainline bugfix so the relocation
address can be higher than 40M without crashing at boot. There is zero
overhead even if you use the feature, and it's a noop for default config.
Andrea Arcangeli <[email protected]> writes:
> Hello,
>
> this patch allows to prevent linux from using the ram below
> PHYSICAL_START.
>
> The "reserved RAM" can be mapped by virtualization software with to
> create a 1:1 mapping between guest physical (bus) address and host
> physical (bus) address.
Wouldn't it be easier if your virtualization software just marked
that area reserved or unmapped in its e820 map?
Of if you don't want that you can get the same result with mem=...
arguments (e.g commonly used by crash dumping)
Even if that was all not possible for some reason having CONFIG for this would
seem unfortunate for me -- i don't think users really want specially
compiled kernels for specific hypervisors. With paravirt Linux
is trying to get away from that. Some runtime setup method
would be much better.
-Andi
Hi Andi,
On Mon, Mar 03, 2008 at 01:17:46PM +0100, Andi Kleen wrote:
> Andrea Arcangeli <[email protected]> writes:
>
> > Hello,
> >
> > this patch allows to prevent linux from using the ram below
> > PHYSICAL_START.
> >
> > The "reserved RAM" can be mapped by virtualization software with to
> > create a 1:1 mapping between guest physical (bus) address and host
> > physical (bus) address.
>
> Wouldn't it be easier if your virtualization software just marked
> that area reserved or unmapped in its e820 map?
>
> Of if you don't want that you can get the same result with mem=...
> arguments (e.g commonly used by crash dumping)
Would all bootloader and OS be capable of booting with a virtualized
e820 map that marks everything below 256M as reserved (an host needs
at least 256M of ram to avoid swapping if somebody tries to log in to
kde)? How would real mode dma run at all when the host is booted with
mem=256M? I didn't verify it in practice but before starting this, I
assumed that if it really works it would be mostly by luck... not the
ideal for a virtualization solution that aims to be generic.
The only bit that won't be generic will be page at address zero and
the trampoline page, but besides those 3 pages, all other ram below 1M
will be completely marked as available ram in the virtualized e820
map. And hopefully nobody does DMA to those 3 pages marked reserved in
the virtualized e820 map (the two trampoline pages can be moved just
before phys address 640k with a fully orthogonal patch to greatly
decrease the risk of bootloader issues, I'm deferring that patch until
I tested some bootloader/OS combination with the ~0x6000 address).
> Even if that was all not possible for some reason having CONFIG for this would
> seem unfortunate for me -- i don't think users really want specially
> compiled kernels for specific hypervisors. With paravirt Linux
> is trying to get away from that. Some runtime setup method
> would be much better.
You're right but the relocatable kernel only works if you relocate it
at very low addresses (see MODULES_VADDR/KERNEL_IMAGE_SIZE). I fixed
that for the compile-time approach I taken, but fixing that for the
relocatable kernel so the kernel can relocate itself to address 900M
physical before jumping long mode, requires many more changes,
including moving all memparse/strlout/vsprintf to arch/x86/boot to
compile it it 32bit so the kernel command line can be parsed in 32bit
non-paging mode to extract the relocation address, before jumping
paging long mode.
My compile time approach doesn't slowdown the kernel module
allocation, it remains a small and relatively simple change to the
e820 map code. Hopefully KVM pci-passthrough without VT-d is done in
standard setups so the compile time approach will not be a big
limitation. So from a mainline kernel point of view, given this is
only needed in the short term because currently sold CPUs lack VT-d
the smaller is the change to allow pci-passthrough, the better. The
relocatable approach would be a much bigger change. Also note this
only works up to address near 1G, we can't reserve more than 1G with
this (extending over 1G requires even more changes). But a 800-900M
guest with pci-passthrough is sure enough right now (extending this to
2G is very easy with an incremental patch, extending over 2G is not
easy).
And if you're right and we'll later find everybody needs
pci-passthrough on every new system without recompiling the host
kernel, we can always switch to a relocatable kernel without changing
the userland API at all (/proc/iomem will show "reserved RAM" and
"reserved RAM failed" the same way as today, kvm userland won't notice
the difference). So I wouldn't worry so much about this being a
compile time thing to start with, given this avoids polluting the
kernel for a short-term matter.
In fact the only thing I'd worry about _right_now_ is the fact there's
no API in /proc/iomem to mark "reserved RAM" regions as
"busy". However given you also need to be root to map from /dev/mem I
don't think it's a big deal.
Thanks for the comments.