2006-05-24 04:40:32

by Magnus Damm

[permalink] [raw]
Subject: [PATCH 00/03] kexec: Avoid overwriting the current pgd (V2)

kexec: Avoid overwriting the current pgd (V2)

This patch updates the kexec code for i386 and x86_64 to avoid overwriting
the current pgd. For most people is overwriting the current pgd is not a big
problem. When kexec:ing into a new kernel that reinitializes and makes use of
all memory we don't care about saving state.

But overwriting the current pgd is not a good solution in the case of kdump
(CONFIG_CRASH_DUMP) where we want to preserve as much state as possible when
a crash occurs. This patch solves the overwriting issue.

20060524: V2

- Broke out architecture-specific data structures into asm/kexec.h
- Fixed a i386/PAE page table problem only triggering on real hardware.
- Moved segment handling code into the assembly routines.

20060501: V1

- First release

Signed-off-by: Magnus Damm <[email protected]>


2006-05-24 04:40:38

by Magnus Damm

[permalink] [raw]
Subject: [PATCH 01/03] kexec: Avoid overwriting the current pgd (V2, stubs)

kexec: Avoid overwriting the current pgd (V2, stubs)

This patch adds an architecture specific structure "struct kimage_arch" to
struct kimage. This structure is filled in with members by the architecture
specific patches followed by this one.

Signed-off-by: Magnus Damm <[email protected]>
---

Applies on top of 2.6.16 and 2.6.17-rc4.

asm-i386/kexec.h | 2 ++
asm-powerpc/kexec.h | 2 ++
asm-s390/kexec.h | 2 ++
asm-sh/kexec.h | 2 ++
asm-x86_64/kexec.h | 2 ++
linux/kexec.h | 2 ++
6 files changed, 12 insertions(+)

--- 0001/include/asm-i386/kexec.h
+++ work/include/asm-i386/kexec.h 2006-05-19 11:54:07.000000000 +0900
@@ -29,6 +29,8 @@

#define MAX_NOTE_BYTES 1024

+struct kimage_arch {};
+
/* CPU does not save ss and esp on stack if execution is already
* running in kernel mode at the time of NMI occurrence. This code
* fixes it.
--- 0001/include/asm-powerpc/kexec.h
+++ work/include/asm-powerpc/kexec.h 2006-05-18 11:13:13.000000000 +0900
@@ -108,6 +108,8 @@ static inline void crash_setup_regs(stru

#define MAX_NOTE_BYTES 1024

+struct kimage_arch {};
+
#ifdef __powerpc64__
extern void kexec_smp_wait(void); /* get and clear naca physid, wait for
master to copy new code to 0 */
--- 0001/include/asm-s390/kexec.h
+++ work/include/asm-s390/kexec.h 2006-05-18 11:13:13.000000000 +0900
@@ -36,6 +36,8 @@

#define MAX_NOTE_BYTES 1024

+struct kimage_arch {};
+
/* Provide a dummy definition to avoid build failures. */
static inline void crash_setup_regs(struct pt_regs *newregs,
struct pt_regs *oldregs) { }
--- 0001/include/asm-sh/kexec.h
+++ work/include/asm-sh/kexec.h 2006-05-18 11:13:13.000000000 +0900
@@ -25,6 +25,8 @@

#ifndef __ASSEMBLY__

+struct kimage_arch {};
+
extern void machine_shutdown(void);
extern void *crash_notes;

--- 0001/include/asm-x86_64/kexec.h
+++ work/include/asm-x86_64/kexec.h 2006-05-18 11:13:13.000000000 +0900
@@ -29,6 +29,8 @@

#define MAX_NOTE_BYTES 1024

+struct kimage_arch {};
+
/*
* Saving the registers of the cpu on which panic occured in
* crash_kexec to save a valid sp. The registers of other cpus
--- 0001/include/linux/kexec.h
+++ work/include/linux/kexec.h 2006-05-18 11:13:13.000000000 +0900
@@ -69,6 +69,8 @@ struct kimage {
unsigned long start;
struct page *control_code_page;

+ struct kimage_arch arch_data;
+
unsigned long nr_segments;
struct kexec_segment segment[KEXEC_SEGMENT_MAX];

2006-05-24 04:40:58

by Magnus Damm

[permalink] [raw]
Subject: [PATCH 03/03] kexec: Avoid overwriting the current pgd (V2, x86_64)

kexec: Avoid overwriting the current pgd (V2, x86_64)

This patch upgrades the x86_64-specific kexec code to avoid overwriting the
current pgd. Overwriting the current pgd is bad when CONFIG_CRASH_DUMP is used
to start a secondary kernel that dumps the memory of the previous kernel.

The code introduces a new set of page tables called "page_table_a". These
tables are used to provide an executable identity mapping without overwriting
the current pgd. The already existing page table is renamed to "page_table_b".

KEXEC_CONTROL_CODE_SIZE is changed into a single page. This updated version of
the patch also moves the segment handling code into the reloacte_kernel.S.

Signed-off-by: Magnus Damm <[email protected]>
---

The patch has been tested with regular kexec and CONFIG_CRASH_DUMP.
Applies on top of 2.6.16 and 2.6.17-rc4.

arch/x86_64/kernel/machine_kexec.c | 193 +++++++++++++++++-----------------
arch/x86_64/kernel/relocate_kernel.S | 84 +++++++++++++-
include/asm-x86_64/kexec.h | 15 ++
3 files changed, 189 insertions(+), 103 deletions(-)

--- 0001/arch/x86_64/kernel/machine_kexec.c
+++ work/arch/x86_64/kernel/machine_kexec.c 2006-05-19 12:09:39.000000000 +0900
@@ -2,6 +2,10 @@
* machine_kexec.c - handle transition of Linux booting another kernel
* Copyright (C) 2002-2005 Eric Biederman <[email protected]>
*
+ * 2006-05-19 Magnus Damm <[email protected]>:
+ * - rewrote identity map code to avoid overwriting current pgd
+ * - moved segment handling code into relocate_kernel.S
+ *
* This source code is licensed under the GNU General Public License,
* Version 2. See the file COPYING for more details.
*/
@@ -96,81 +100,110 @@ out:
}


-static int init_pgtable(struct kimage *image, unsigned long start_pgtable)
+static int create_page_table_b(struct kimage *image)
{
- pgd_t *level4p;
- level4p = (pgd_t *)__va(start_pgtable);
- return init_level4_page(image, level4p, 0, end_pfn << PAGE_SHIFT);
-}
+ struct kimage_arch *arch = &image->arch_data;

-static void set_idt(void *newidt, u16 limit)
-{
- struct desc_ptr curidt;
+ arch->page_table_b = kimage_alloc_control_pages(image, 0);

- /* x86-64 supports unaliged loads & stores */
- curidt.size = limit;
- curidt.address = (unsigned long)newidt;
+ if (!arch->page_table_b)
+ return -ENOMEM;

- __asm__ __volatile__ (
- "lidtq %0\n"
- : : "m" (curidt)
- );
-};
+ return init_level4_page(image, page_address(arch->page_table_b),
+ 0, end_pfn << PAGE_SHIFT);
+}

+typedef NORET_TYPE void (*relocate_new_kernel_t)(unsigned long indirection_page,
+ unsigned long control_code_buffer,
+ unsigned long start_address,
+ unsigned long page_table_a,
+ unsigned long page_table_b) ATTRIB_NORET;
+
+const extern unsigned char relocate_new_kernel[];
+const extern unsigned long relocate_new_kernel_size;

-static void set_gdt(void *newgdt, u16 limit)
+static int allocate_page_table_a(struct kimage *image)
{
- struct desc_ptr curgdt;
+ struct kimage_arch *arch = &image->arch_data;
+ struct page *page;
+ int k = sizeof(arch->page_table_a) / sizeof(arch->page_table_a[0]);

- /* x86-64 supports unaligned loads & stores */
- curgdt.size = limit;
- curgdt.address = (unsigned long)newgdt;
+ for (; k > 0; k--) {
+ page = kimage_alloc_control_pages(image, 0);
+ if (!page)
+ return -ENOMEM;

- __asm__ __volatile__ (
- "lgdtq %0\n"
- : : "m" (curgdt)
- );
-};
+ clear_page(page_address(page));
+ arch->page_table_a[k - 1] = page;
+ }

-static void load_segments(void)
-{
- __asm__ __volatile__ (
- "\tmovl %0,%%ds\n"
- "\tmovl %0,%%es\n"
- "\tmovl %0,%%ss\n"
- "\tmovl %0,%%fs\n"
- "\tmovl %0,%%gs\n"
- : : "a" (__KERNEL_DS) : "memory"
- );
+ return 0;
}

-typedef NORET_TYPE void (*relocate_new_kernel_t)(unsigned long indirection_page,
- unsigned long control_code_buffer,
- unsigned long start_address,
- unsigned long pgtable) ATTRIB_NORET;
+#define _PAGE_KERNEL_EXEC __PAGE_KERNEL_EXEC
+#define pa_page(page) __pa_symbol(page_address(page)) /* __pa() miscompiles */

-const extern unsigned char relocate_new_kernel[];
-const extern unsigned long relocate_new_kernel_size;
+static int create_mapping(struct page *root, struct page **pages,
+ unsigned long va, unsigned long pa)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+ int k = 0;
+
+ pgd = (pgd_t *)page_address(root) + pgd_index(va);
+ if (!pgd_present(*pgd))
+ set_pgd(pgd, __pgd(pa_page(pages[k++]) | _KERNPG_TABLE));
+
+ pud = pud_offset(pgd, va);
+ if (!pud_present(*pud))
+ set_pud(pud, __pud(pa_page(pages[k++]) | _KERNPG_TABLE));
+
+ pmd = pmd_offset(pud, va);
+ if (!pmd_present(*pmd))
+ set_pmd(pmd, __pmd(pa_page(pages[k++]) | _KERNPG_TABLE));
+
+ pte = (pte_t *)page_address(pmd_page(*pmd)) + pte_index(va);
+ set_pte(pte, __pte(pa | _PAGE_KERNEL_EXEC));
+
+ return k;
+}

int machine_kexec_prepare(struct kimage *image)
{
- unsigned long start_pgtable, control_code_buffer;
- int result;
+ void *control_page;
+ unsigned long pa;
+ int k;

- /* Calculate the offsets */
- start_pgtable = page_to_pfn(image->control_code_page) << PAGE_SHIFT;
- control_code_buffer = start_pgtable + PAGE_SIZE;
-
- /* Setup the identity mapped 64bit page table */
- result = init_pgtable(image, start_pgtable);
- if (result)
- return result;
-
- /* Place the code in the reboot code buffer */
- memcpy(__va(control_code_buffer), relocate_new_kernel,
- relocate_new_kernel_size);
+ memset(&image->arch_data, 0, sizeof(image->arch_data));

- return 0;
+ k = allocate_page_table_a(image);
+ if (k)
+ return k;
+
+ /* fill in control_page with assembly code */
+
+ control_page = page_address(image->control_code_page);
+ memcpy(control_page, relocate_new_kernel, relocate_new_kernel_size);
+
+ /* map the control_page at the virtual address of relocate_kernel.S */
+
+ pa = __pa(control_page);
+
+ k = create_mapping(image->arch_data.page_table_a[0],
+ &image->arch_data.page_table_a[1],
+ (unsigned long)relocate_new_kernel, pa);
+
+ /* identity map the control_page */
+
+ create_mapping(image->arch_data.page_table_a[0],
+ &image->arch_data.page_table_a[k + 1],
+ pa, pa);
+
+ /* create identity mapped page table aka page_table_b */
+
+ return create_page_table_b(image);
}

void machine_kexec_cleanup(struct kimage *image)
@@ -185,47 +218,17 @@ void machine_kexec_cleanup(struct kimage
NORET_TYPE void machine_kexec(struct kimage *image)
{
unsigned long page_list;
- unsigned long control_code_buffer;
- unsigned long start_pgtable;
+ unsigned long control_code;
+ unsigned long page_table_a;
+ unsigned long page_table_b;
relocate_new_kernel_t rnk;

- /* Interrupts aren't acceptable while we reboot */
- local_irq_disable();
-
- /* Calculate the offsets */
page_list = image->head;
- start_pgtable = page_to_pfn(image->control_code_page) << PAGE_SHIFT;
- control_code_buffer = start_pgtable + PAGE_SIZE;
+ control_code = __pa(page_address(image->control_code_page));
+ page_table_a = __pa(page_address(image->arch_data.page_table_a[0]));
+ page_table_b = __pa(page_address(image->arch_data.page_table_b));

- /* Set the low half of the page table to my identity mapped
- * page table for kexec. Leave the high half pointing at the
- * kernel pages. Don't bother to flush the global pages
- * as that will happen when I fully switch to my identity mapped
- * page table anyway.
- */
- memcpy(__va(read_cr3()), __va(start_pgtable), PAGE_SIZE/2);
- __flush_tlb();
-
-
- /* The segment registers are funny things, they are
- * automatically loaded from a table, in memory wherever you
- * set them to a specific selector, but this table is never
- * accessed again unless you set the segment to a different selector.
- *
- * The more common model are caches where the behide
- * the scenes work is done, but is also dropped at arbitrary
- * times.
- *
- * I take advantage of this here by force loading the
- * segments, before I zap the gdt with an invalid value.
- */
- load_segments();
- /* The gdt & idt are now invalid.
- * If you want to load them you must set up your own idt & gdt.
- */
- set_gdt(phys_to_virt(0),0);
- set_idt(phys_to_virt(0),0);
/* now call it */
- rnk = (relocate_new_kernel_t) control_code_buffer;
- (*rnk)(page_list, control_code_buffer, image->start, start_pgtable);
+ rnk = (relocate_new_kernel_t) relocate_new_kernel;
+ (*rnk)(page_list, control_code, image->start, page_table_a, page_table_b);
}
--- 0001/arch/x86_64/kernel/relocate_kernel.S
+++ work/arch/x86_64/kernel/relocate_kernel.S 2006-05-19 12:11:13.000000000 +0900
@@ -2,11 +2,18 @@
* relocate_kernel.S - put the kernel image in place to boot
* Copyright (C) 2002-2005 Eric Biederman <[email protected]>
*
+ * 2006-05-19 Magnus Damm <[email protected]>:
+ * - moved segment handling code from machine_kexec.c
+ *
* This source code is licensed under the GNU General Public License,
* Version 2. See the file COPYING for more details.
*/

#include <linux/linkage.h>
+#include <asm/page.h>
+
+.text
+.align (1 << PAGE_SHIFT)

/*
* Must be relocatable PIC code callable as a C function, that once
@@ -18,21 +25,69 @@ relocate_new_kernel:
/* %rdi page_list
* %rsi reboot_code_buffer
* %rdx start address
- * %rcx page_table
- * %r8 arg5
+ * %rcx page_table_a
+ * %r8 page_table_b
* %r9 arg6
*/
-
+
/* zero out flags, and disable interrupts */
pushq $0
popfq

+ /* switch to page_table_a */
+ movq %rcx, %cr3
+
+ /* setup idt */
+
+ movq %rsi, %rax
+ addq $(idt_48 - relocate_new_kernel), %rax
+ lidtq (%rax)
+
+ /* setup gdt */
+
+ movq %rsi, %rax
+ addq $(gdt - relocate_new_kernel), %rax
+ movq %rsi, %r9
+ addq $((gdt_48 - relocate_new_kernel) + 2), %r9
+ movq %rax, (%r9)
+
+ movq %rsi, %rax
+ addq $(gdt_48 - relocate_new_kernel), %rax
+ lgdtq (%rax)
+
+ /* setup data segment registers */
+
+ xorl %eax,%eax
+ movl %eax, %ds
+ movl %eax, %es
+ movl %eax, %fs
+ movl %eax, %gs
+ movl %eax, %ss
+
/* set a new stack at the bottom of our page... */
lea 4096(%rsi), %rsp

+ /* load new code segment */
+
+ movq %rsp, %rcx
+ xorq %rax, %rax
+ pushq %rax /* SS */
+ pushq %rcx /* ESP */
+ pushq %rax /* RFLAGS */
+
+ movq $(gdt_code - gdt), %rax
+ pushq %rax /* CS */
+
+ movq %rsi, %rax
+ addq $(identity_mapped - relocate_new_kernel), %rax
+ pushq %rax /* RIP */
+
+ iretq
+
+identity_mapped:
/* store the parameters back on the stack */
pushq %rdx /* store the start address */
-
+
/* Set cr0 to a known state:
* 31 1 == Paging enabled
* 18 0 == Alignment check disabled
@@ -69,7 +124,7 @@ relocate_new_kernel:
/* Switch to the identity mapped page tables,
* and flush the TLB.
*/
- movq %rcx, %cr3
+ movq %r8, %cr3

/* Do the copies */
movq %rdi, %rcx /* Put the page_list in %rcx */
@@ -136,6 +191,25 @@ relocate_new_kernel:
xorq %r15, %r15

ret
+ .align 16
+gdt:
+ .long 0x00000000 /* NULL descriptor */
+ .long 0x00000000
+gdt_code:
+ .long 0x00000000 /* code descriptor */
+ .long 0x00209800
+
+gdt_end:
+ .align 4
+
+idt_48:
+ .word 0 # idt limit = 0
+ .quad 0, 0 # idt base = 0L
+
+gdt_48:
+ .word gdt_end - gdt - 1 # gdt limit
+ .quad 0, 0 # gdt base (filled in later)
+
relocate_new_kernel_end:

.globl relocate_new_kernel_size
--- 0002/include/asm-x86_64/kexec.h
+++ work/include/asm-x86_64/kexec.h 2006-05-19 12:07:33.000000000 +0900
@@ -21,15 +21,24 @@
/* Maximum address we can use for the control pages */
#define KEXEC_CONTROL_MEMORY_LIMIT (0xFFFFFFFFFFUL)

-/* Allocate one page for the pdp and the second for the code */
-#define KEXEC_CONTROL_CODE_SIZE (4096UL + 4096UL)
+#define KEXEC_CONTROL_CODE_SIZE 4096

/* The native architecture */
#define KEXEC_ARCH KEXEC_ARCH_X86_64

#define MAX_NOTE_BYTES 1024

-struct kimage_arch {};
+struct kimage_arch {
+ /* page_table_a[] holds enough pages to create a new page table
+ * that maps the control page twice..
+ *
+ * page_table_b points to the root page of a page table which is used
+ * to provide identity mapping of all ram.
+ */
+
+ struct page *page_table_a[7]; /* 2 * (pte + pud + pmd) + pgd */
+ struct page *page_table_b;
+};

/*
* Saving the registers of the cpu on which panic occured in

2006-05-24 04:41:02

by Magnus Damm

[permalink] [raw]
Subject: [PATCH 02/03] kexec: Avoid overwriting the current pgd (V2, i386)

kexec: Avoid overwriting the current pgd (V2, i386)

This patch upgrades the i386-specific kexec code to avoid overwriting the
current pgd. Overwriting the current pgd is bad when CONFIG_CRASH_DUMP is used
to start a secondary kernel that dumps the memory of the previous kernel.

The code introduces a new set of page tables called "page_table_a". These
tables are used to provide an executable identity mapping without overwriting
the current pgd. This updated version of the patch fixes a PAE bug and moves
the segment handling code into the reloacte_kernel.S.

Signed-off-by: Magnus Damm <[email protected]>
---

The patch has been tested with regular kexec and CONFIG_CRASH_DUMP.
Both PAE and non-PAE configurations work well.
Applies on top of 2.6.16 and 2.6.17-rc4.

arch/i386/kernel/machine_kexec.c | 230 ++++++++++++++----------------------
arch/i386/kernel/relocate_kernel.S | 92 ++++++++++++++
include/asm-i386/kexec.h | 12 +
3 files changed, 192 insertions(+), 142 deletions(-)

--- 0001/arch/i386/kernel/machine_kexec.c
+++ work/arch/i386/kernel/machine_kexec.c 2006-05-22 15:57:50.000000000 +0900
@@ -2,6 +2,10 @@
* machine_kexec.c - handle transition of Linux booting another kernel
* Copyright (C) 2002-2005 Eric Biederman <[email protected]>
*
+ * 2006-05-19 Magnus Damm <[email protected]>:
+ * - rewrote identity map code to avoid overwriting current pgd
+ * - moved segment handling code into relocate_kernel.S
+ *
* This source code is licensed under the GNU General Public License,
* Version 2. See the file COPYING for more details.
*/
@@ -19,123 +23,73 @@
#include <asm/desc.h>
#include <asm/system.h>

-#define PAGE_ALIGNED __attribute__ ((__aligned__(PAGE_SIZE)))
-
-#define L0_ATTR (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY)
-#define L1_ATTR (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY)
-#define L2_ATTR (_PAGE_PRESENT)
-
-#define LEVEL0_SIZE (1UL << 12UL)
+typedef asmlinkage NORET_TYPE void (*relocate_new_kernel_t)(
+ unsigned long indirection_page,
+ unsigned long reboot_code_buffer,
+ unsigned long start_address,
+ unsigned long page_table_a,
+ unsigned long has_pae) ATTRIB_NORET;

-#ifndef CONFIG_X86_PAE
-#define LEVEL1_SIZE (1UL << 22UL)
-static u32 pgtable_level1[1024] PAGE_ALIGNED;
+const extern unsigned char relocate_new_kernel[];
+extern void relocate_new_kernel_end(void);
+const extern unsigned int relocate_new_kernel_size;

-static void identity_map_page(unsigned long address)
+static int allocate_page_table_a(struct kimage *image)
{
- unsigned long level1_index, level2_index;
- u32 *pgtable_level2;
-
- /* Find the current page table */
- pgtable_level2 = __va(read_cr3());
+ struct kimage_arch *arch = &image->arch_data;
+ struct page *page;
+ int k = sizeof(arch->page_table_a) / sizeof(arch->page_table_a[0]);
+
+ for (; k > 0; k--) {
+ page = kimage_alloc_control_pages(image, 0);
+ if (!page)
+ return -ENOMEM;
+
+ clear_page(page_address(page));
+ arch->page_table_a[k - 1] = page;
+ }

- /* Find the indexes of the physical address to identity map */
- level1_index = (address % LEVEL1_SIZE)/LEVEL0_SIZE;
- level2_index = address / LEVEL1_SIZE;
-
- /* Identity map the page table entry */
- pgtable_level1[level1_index] = address | L0_ATTR;
- pgtable_level2[level2_index] = __pa(pgtable_level1) | L1_ATTR;
-
- /* Flush the tlb so the new mapping takes effect.
- * Global tlb entries are not flushed but that is not an issue.
- */
- load_cr3(pgtable_level2);
+ return 0;
}

-#else
-#define LEVEL1_SIZE (1UL << 21UL)
-#define LEVEL2_SIZE (1UL << 30UL)
-static u64 pgtable_level1[512] PAGE_ALIGNED;
-static u64 pgtable_level2[512] PAGE_ALIGNED;
-
-static void identity_map_page(unsigned long address)
-{
- unsigned long level1_index, level2_index, level3_index;
- u64 *pgtable_level3;
+/* workaround for include/asm-i386/pgtable-3level.h */

- /* Find the current page table */
- pgtable_level3 = __va(read_cr3());
-
- /* Find the indexes of the physical address to identity map */
- level1_index = (address % LEVEL1_SIZE)/LEVEL0_SIZE;
- level2_index = (address % LEVEL2_SIZE)/LEVEL1_SIZE;
- level3_index = address / LEVEL2_SIZE;
-
- /* Identity map the page table entry */
- pgtable_level1[level1_index] = address | L0_ATTR;
- pgtable_level2[level2_index] = __pa(pgtable_level1) | L1_ATTR;
- set_64bit(&pgtable_level3[level3_index],
- __pa(pgtable_level2) | L2_ATTR);
-
- /* Flush the tlb so the new mapping takes effect.
- * Global tlb entries are not flushed but that is not an issue.
- */
- load_cr3(pgtable_level3);
-}
+#ifdef CONFIG_X86_PAE
+#undef pgd_present
+#define pgd_present(pgd) (pgd_val(pgd) & _PAGE_PRESENT)
+#define _PGD_ATTR _PAGE_PRESENT
+#else
+#define _PGD_ATTR _KERNPG_TABLE
#endif

-static void set_idt(void *newidt, __u16 limit)
-{
- struct Xgt_desc_struct curidt;
-
- /* ia32 supports unaliged loads & stores */
- curidt.size = limit;
- curidt.address = (unsigned long)newidt;
-
- load_idt(&curidt);
-};
+#define pa_page(page) __pa(page_address(page))

-
-static void set_gdt(void *newgdt, __u16 limit)
+static int create_mapping(struct page *root, struct page **pages,
+ unsigned long va, unsigned long pa)
{
- struct Xgt_desc_struct curgdt;
-
- /* ia32 supports unaligned loads & stores */
- curgdt.size = limit;
- curgdt.address = (unsigned long)newgdt;
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+ int k = 0;

- load_gdt(&curgdt);
-};
+ pgd = (pgd_t *)page_address(root) + pgd_index(va);
+ if (!pgd_present(*pgd))
+ set_pgd(pgd, __pgd(pa_page(pages[k++]) | _PGD_ATTR));

-static void load_segments(void)
-{
-#define __STR(X) #X
-#define STR(X) __STR(X)
+ pud = pud_offset(pgd, va);
+ if (!pud_present(*pud))
+ set_pud(pud, __pud(pa_page(pages[k++]) | _KERNPG_TABLE));

- __asm__ __volatile__ (
- "\tljmp $"STR(__KERNEL_CS)",$1f\n"
- "\t1:\n"
- "\tmovl $"STR(__KERNEL_DS)",%%eax\n"
- "\tmovl %%eax,%%ds\n"
- "\tmovl %%eax,%%es\n"
- "\tmovl %%eax,%%fs\n"
- "\tmovl %%eax,%%gs\n"
- "\tmovl %%eax,%%ss\n"
- ::: "eax", "memory");
-#undef STR
-#undef __STR
-}
+ pmd = pmd_offset(pud, va);
+ if (!pmd_present(*pmd))
+ set_pmd(pmd, __pmd(pa_page(pages[k++]) | _KERNPG_TABLE));

-typedef asmlinkage NORET_TYPE void (*relocate_new_kernel_t)(
- unsigned long indirection_page,
- unsigned long reboot_code_buffer,
- unsigned long start_address,
- unsigned int has_pae) ATTRIB_NORET;
+ pte = (pte_t *)page_address(pmd_page(*pmd)) + pte_index(va);
+ set_pte(pte, __pte(pa | _PAGE_KERNEL_EXEC));

-const extern unsigned char relocate_new_kernel[];
-extern void relocate_new_kernel_end(void);
-const extern unsigned int relocate_new_kernel_size;
+ return k;
+}

/*
* A architecture hook called to validate the
@@ -147,11 +101,38 @@ const extern unsigned int relocate_new_k
* Do what every setup is needed on image and the
* reboot code buffer to allow us to avoid allocations
* later.
- *
- * Currently nothing.
*/
int machine_kexec_prepare(struct kimage *image)
{
+ void *control_page;
+ unsigned long pa;
+ int k;
+
+ memset(&image->arch_data, 0, sizeof(image->arch_data));
+
+ k = allocate_page_table_a(image);
+ if (k)
+ return k;
+
+ /* fill in control_page with assembly code */
+
+ control_page = page_address(image->control_code_page);
+ memcpy(control_page, relocate_new_kernel, relocate_new_kernel_size);
+
+ /* map the control_page at the virtual address of relocate_kernel.S */
+
+ pa = __pa(control_page);
+
+ k = create_mapping(image->arch_data.page_table_a[0],
+ &image->arch_data.page_table_a[1],
+ (unsigned long)relocate_new_kernel, pa);
+
+ /* identity map the control_page */
+
+ create_mapping(image->arch_data.page_table_a[0],
+ &image->arch_data.page_table_a[k + 1],
+ pa, pa);
+
return 0;
}

@@ -170,45 +151,16 @@ void machine_kexec_cleanup(struct kimage
NORET_TYPE void machine_kexec(struct kimage *image)
{
unsigned long page_list;
- unsigned long reboot_code_buffer;
-
+ unsigned long control_code;
+ unsigned long page_table_a;
relocate_new_kernel_t rnk;

- /* Interrupts aren't acceptable while we reboot */
- local_irq_disable();
-
- /* Compute some offsets */
- reboot_code_buffer = page_to_pfn(image->control_code_page)
- << PAGE_SHIFT;
page_list = image->head;
-
- /* Set up an identity mapping for the reboot_code_buffer */
- identity_map_page(reboot_code_buffer);
-
- /* copy it out */
- memcpy((void *)reboot_code_buffer, relocate_new_kernel,
- relocate_new_kernel_size);
-
- /* The segment registers are funny things, they are
- * automatically loaded from a table, in memory wherever you
- * set them to a specific selector, but this table is never
- * accessed again you set the segment to a different selector.
- *
- * The more common model is are caches where the behide
- * the scenes work is done, but is also dropped at arbitrary
- * times.
- *
- * I take advantage of this here by force loading the
- * segments, before I zap the gdt with an invalid value.
- */
- load_segments();
- /* The gdt & idt are now invalid.
- * If you want to load them you must set up your own idt & gdt.
- */
- set_gdt(phys_to_virt(0),0);
- set_idt(phys_to_virt(0),0);
+ control_code = __pa(page_address(image->control_code_page));
+ page_table_a = __pa(page_address(image->arch_data.page_table_a[0]));

/* now call it */
- rnk = (relocate_new_kernel_t) reboot_code_buffer;
- (*rnk)(page_list, reboot_code_buffer, image->start, cpu_has_pae);
+ rnk = (relocate_new_kernel_t) relocate_new_kernel;
+ (*rnk)(page_list, control_code, image->start,
+ page_table_a, (unsigned long)cpu_has_pae);
}
--- 0001/arch/i386/kernel/relocate_kernel.S
+++ work/arch/i386/kernel/relocate_kernel.S 2006-05-22 12:55:37.000000000 +0900
@@ -2,12 +2,20 @@
* relocate_kernel.S - put the kernel image in place to boot
* Copyright (C) 2002-2004 Eric Biederman <[email protected]>
*
+ * 2006-05-19 Magnus Damm <[email protected]>:
+ * - moved segment handling code from machine_kexec.c
+ * - gdt tables stolen from arch/i386/boot/setup.S
+ *
* This source code is licensed under the GNU General Public License,
* Version 2. See the file COPYING for more details.
*/

#include <linux/linkage.h>
+#include <asm/page.h>

+.text
+.align (1 << PAGE_SHIFT)
+
/*
* Must be relocatable PIC code callable as a C function, that once
* it starts can not use the previous processes stack.
@@ -18,18 +26,68 @@ relocate_new_kernel:
movl 4(%esp), %ebx /* page_list */
movl 8(%esp), %ebp /* reboot_code_buffer */
movl 12(%esp), %edx /* start address */
- movl 16(%esp), %ecx /* cpu_has_pae */
+ movl 16(%esp), %edi /* page_table_a */
+ movl 20(%esp), %ecx /* cpu_has_pae */

/* zero out flags, and disable interrupts */
pushl $0
popfl

+ /* switch to page_table_a */
+ movl %edi, %eax
+ movl %eax, %cr3
+
+ /* setup idt */
+
+ movl %ebp, %eax
+ addl $(idt_48 - relocate_new_kernel), %eax
+ lidtl (%eax)
+
+ /* setup gdt */
+
+ movl %ebp, %eax
+ addl $(gdt - relocate_new_kernel), %eax
+ movl %ebp, %esi
+ addl $((gdt_48 - relocate_new_kernel) + 2), %esi
+ movl %eax, (%esi)
+
+ movl %ebp, %eax
+ addl $(gdt_48 - relocate_new_kernel), %eax
+ lgdtl (%eax)
+
+ /* setup data segment registers */
+
+ mov $(gdt_ds - gdt), %eax
+ mov %eax, %ds
+ mov %eax, %es
+ mov %eax, %fs
+ mov %eax, %gs
+ mov %eax, %ss
+
/* set a new stack at the bottom of our page... */
lea 4096(%ebp), %esp

+ /* load new code segment */
+
+ movl %ebp, %esi
+ xorl %eax, %eax
+ pushl %eax
+ pushl %esi
+ pushl %eax
+
+ movl $(gdt_cs - gdt), %eax
+ pushl %eax
+
+ movl %ebp, %eax
+ addl $(identity_mapped - relocate_new_kernel),%eax
+ pushl %eax
+ iretl
+
+identity_mapped:
+
/* store the parameters back on the stack */
pushl %edx /* store the start address */
-
+
/* Set cr0 to a known state:
* 31 0 == Paging disabled
* 18 0 == Alignment check disabled
@@ -113,6 +171,36 @@ relocate_new_kernel:
xorl %edi, %edi
xorl %ebp, %ebp
ret
+
+ .align 16
+gdt:
+ .fill 1,8,0
+
+gdt_cs:
+ .word 0xFFFF # 4Gb - (0x100000*0x1000 = 4Gb)
+ .word 0 # base address = 0
+ .word 0x9A00 # code read/exec
+ .word 0x00CF # granularity = 4096, 386
+ # (+5th nibble of limit)
+gdt_ds:
+ .word 0xFFFF # 4Gb - (0x100000*0x1000 = 4Gb)
+ .word 0 # base address = 0
+ .word 0x9200 # data read/write
+ .word 0x00CF # granularity = 4096, 386
+ # (+5th nibble of limit)
+gdt_end:
+ .align 4
+
+ .word 0 # alignment byte
+idt_48:
+ .word 0 # idt limit = 0
+ .word 0, 0 # idt base = 0L
+
+ .word 0 # alignment byte
+gdt_48:
+ .word gdt_end - gdt - 1 # gdt limit
+ .word 0, 0 # gdt base (filled in later)
+
relocate_new_kernel_end:

.globl relocate_new_kernel_size
--- 0002/include/asm-i386/kexec.h
+++ work/include/asm-i386/kexec.h 2006-05-22 12:55:38.000000000 +0900
@@ -29,7 +29,17 @@

#define MAX_NOTE_BYTES 1024

-struct kimage_arch {};
+struct kimage_arch {
+ /* page_table_a[] holds enough pages to create a new page table
+ * that maps the control page twice..
+ */
+
+#if defined(CONFIG_X86_PAE)
+ struct page *page_table_a[5]; /* (2 * pte) + (2 * pmd) + pgd */
+#else
+ struct page *page_table_a[3]; /* (2 * pte) + pgd */
+#endif
+};

/* CPU does not save ss and esp on stack if execution is already
* running in kernel mode at the time of NMI occurrence. This code

2006-05-24 22:56:41

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 00/03] kexec: Avoid overwriting the current pgd (V2)

On Wed, May 24, 2006 at 01:40:31PM +0900, Magnus Damm wrote:
> kexec: Avoid overwriting the current pgd (V2)
>
> This patch updates the kexec code for i386 and x86_64 to avoid overwriting
> the current pgd. For most people is overwriting the current pgd is not a big
> problem. When kexec:ing into a new kernel that reinitializes and makes use of
> all memory we don't care about saving state.
>
> But overwriting the current pgd is not a good solution in the case of kdump
> (CONFIG_CRASH_DUMP) where we want to preserve as much state as possible when
> a crash occurs. This patch solves the overwriting issue.
>
> 20060524: V2
>
> - Broke out architecture-specific data structures into asm/kexec.h
> - Fixed a i386/PAE page table problem only triggering on real hardware.
> - Moved segment handling code into the assembly routines.

What's the advantage of moving segment handling code into assembly
routines? It will only add to the fear of control code page size growing
beyond 4K.

Thanks
Vivek

2006-05-24 22:59:00

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 02/03] kexec: Avoid overwriting the current pgd (V2, i386)

On Wed, May 24, 2006 at 01:40:41PM +0900, Magnus Damm wrote:
>
> @@ -170,45 +151,16 @@ void machine_kexec_cleanup(struct kimage
> NORET_TYPE void machine_kexec(struct kimage *image)
> {
> unsigned long page_list;
> - unsigned long reboot_code_buffer;
> -
> + unsigned long control_code;
> + unsigned long page_table_a;
> relocate_new_kernel_t rnk;
>
> - /* Interrupts aren't acceptable while we reboot */
> - local_irq_disable();

Why are you getting rid of this call? Looks sane to disable interrupts
early.

Thanks
Vivek

2006-05-25 02:07:08

by Magnus Damm

[permalink] [raw]
Subject: Re: [PATCH 00/03] kexec: Avoid overwriting the current pgd (V2)

On Wed, 2006-05-24 at 18:56 -0400, Vivek Goyal wrote:
> On Wed, May 24, 2006 at 01:40:31PM +0900, Magnus Damm wrote:
> > kexec: Avoid overwriting the current pgd (V2)
> >
> > This patch updates the kexec code for i386 and x86_64 to avoid overwriting
> > the current pgd. For most people is overwriting the current pgd is not a big
> > problem. When kexec:ing into a new kernel that reinitializes and makes use of
> > all memory we don't care about saving state.
> >
> > But overwriting the current pgd is not a good solution in the case of kdump
> > (CONFIG_CRASH_DUMP) where we want to preserve as much state as possible when
> > a crash occurs. This patch solves the overwriting issue.
> >
> > 20060524: V2
> >
> > - Broke out architecture-specific data structures into asm/kexec.h
> > - Fixed a i386/PAE page table problem only triggering on real hardware.
> > - Moved segment handling code into the assembly routines.
>
> What's the advantage of moving segment handling code into assembly
> routines? It will only add to the fear of control code page size growing
> beyond 4K.

I have two main reasons:

- Why wrap assembler instructions in C code if you just can move them
into an already existing assembly file? Much cleaner IMO.

- I'm currently working on making kexec to work under xen/dom0. And by
moving the segment handling code into the assembly file we reduce the
amount of duplicated code.

I was concerned about the size of the assembly code too and I just
re-checked with my patches applied:

i386: 300 / 4096 bytes used
x86_64: 364 / 4096 bytes used

I'd say we have more than enough space!

Thanks,

/ magnus


2006-05-25 02:12:56

by Magnus Damm

[permalink] [raw]
Subject: Re: [PATCH 02/03] kexec: Avoid overwriting the current pgd (V2, i386)

On Wed, 2006-05-24 at 18:58 -0400, Vivek Goyal wrote:
> On Wed, May 24, 2006 at 01:40:41PM +0900, Magnus Damm wrote:
> >
> > @@ -170,45 +151,16 @@ void machine_kexec_cleanup(struct kimage
> > NORET_TYPE void machine_kexec(struct kimage *image)
> > {
> > unsigned long page_list;
> > - unsigned long reboot_code_buffer;
> > -
> > + unsigned long control_code;
> > + unsigned long page_table_a;
> > relocate_new_kernel_t rnk;
> >
> > - /* Interrupts aren't acceptable while we reboot */
> > - local_irq_disable();
>
> Why are you getting rid of this call? Looks sane to disable interrupts
> early.

It made sense to disable interrupts early because we used to setup the
segments and overwrite the page tables from the c code. My patch moves
the segment handling and page table switching code into the assembly
routine, and one of the first things the assembly code does is:

/* zero out flags, and disable interrupts */
pushl $0
popfl

So I think we should be safe. Thanks for checking!

/ magnus

2006-05-25 02:20:04

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] Re: [PATCH 02/03] kexec: Avoid overwriting the current pgd (V2, i386)

Vivek Goyal <[email protected]> writes:

> On Wed, May 24, 2006 at 01:40:41PM +0900, Magnus Damm wrote:
>>
>> @@ -170,45 +151,16 @@ void machine_kexec_cleanup(struct kimage
>> NORET_TYPE void machine_kexec(struct kimage *image)
>> {
>> unsigned long page_list;
>> - unsigned long reboot_code_buffer;
>> -
>> + unsigned long control_code;
>> + unsigned long page_table_a;
>> relocate_new_kernel_t rnk;
>>
>> - /* Interrupts aren't acceptable while we reboot */
>> - local_irq_disable();
>
> Why are you getting rid of this call? Looks sane to disable interrupts
> early.

Agreed, this and changing the segment and idt handling is gratuitous.
If you are going to make multiple unrelated changes place do them
as separate patches.

Eric

2006-05-25 02:42:34

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH 01/03] kexec: Avoid overwriting the current pgd (V2, stubs)

Magnus Damm <[email protected]> writes:

> --===============059810052910035161==
>
> kexec: Avoid overwriting the current pgd (V2, stubs)
>
> This patch adds an architecture specific structure "struct kimage_arch" to
> struct kimage. This structure is filled in with members by the architecture
> specific patches followed by this one.

You should be able to completely remove the need for this by simply
adding a single additional external variable to the control code
page. Given that you abuse this information and store way more
than you need I am not persuaded that it is an interesting case.

Eric

2006-05-25 02:39:36

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH 02/03] kexec: Avoid overwriting the current pgd (V2, i386)

Magnus Damm <[email protected]> writes:

> --===============18843918423041384==
>
> kexec: Avoid overwriting the current pgd (V2, i386)
>
> This patch upgrades the i386-specific kexec code to avoid overwriting the
> current pgd. Overwriting the current pgd is bad when CONFIG_CRASH_DUMP is used
> to start a secondary kernel that dumps the memory of the previous kernel.
>
> The code introduces a new set of page tables called "page_table_a". These
> tables are used to provide an executable identity mapping without overwriting
> the current pgd. This updated version of the patch fixes a PAE bug and moves
> the segment handling code into the reloacte_kernel.S.
>
> Signed-off-by: Magnus Damm <[email protected]>

So the overall approach of just mapping the control code buffer in a page
table both physically and virtually appears sound.

However the implementation can stand some more refinement.

> ---
>
> The patch has been tested with regular kexec and CONFIG_CRASH_DUMP.
> Both PAE and non-PAE configurations work well.
> Applies on top of 2.6.16 and 2.6.17-rc4.
>
> arch/i386/kernel/machine_kexec.c | 230 ++++++++++++++----------------------
> arch/i386/kernel/relocate_kernel.S | 92 ++++++++++++++
> include/asm-i386/kexec.h | 12 +
> 3 files changed, 192 insertions(+), 142 deletions(-)
>
> --- 0001/arch/i386/kernel/machine_kexec.c
> +++ work/arch/i386/kernel/machine_kexec.c 2006-05-22 15:57:50.000000000 +0900
> @@ -2,6 +2,10 @@
> * machine_kexec.c - handle transition of Linux booting another kernel
> * Copyright (C) 2002-2005 Eric Biederman <[email protected]>
> *
> + * 2006-05-19 Magnus Damm <[email protected]>:
> + * - rewrote identity map code to avoid overwriting current pgd
> + * - moved segment handling code into relocate_kernel.S

If you give the right justification for this I could be presuaded to see that
it is sane. But it is a completely different issue and should be treated
as such.

> + *
> * This source code is licensed under the GNU General Public License,
> * Version 2. See the file COPYING for more details.
> */
> @@ -19,123 +23,73 @@
> #include <asm/desc.h>
> #include <asm/system.h>
>
> -#define PAGE_ALIGNED __attribute__ ((__aligned__(PAGE_SIZE)))
> -
> -#define L0_ATTR (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY)
> -#define L1_ATTR (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY)
> -#define L2_ATTR (_PAGE_PRESENT)
> -
> -#define LEVEL0_SIZE (1UL << 12UL)
> +typedef asmlinkage NORET_TYPE void (*relocate_new_kernel_t)(
> + unsigned long indirection_page,
> + unsigned long reboot_code_buffer,
> + unsigned long start_address,
> + unsigned long page_table_a,
> + unsigned long has_pae) ATTRIB_NORET;
>
> -#ifndef CONFIG_X86_PAE
> -#define LEVEL1_SIZE (1UL << 22UL)
> -static u32 pgtable_level1[1024] PAGE_ALIGNED;
> +const extern unsigned char relocate_new_kernel[];
> +extern void relocate_new_kernel_end(void);
> +const extern unsigned int relocate_new_kernel_size;
>
> -static void identity_map_page(unsigned long address)
> +static int allocate_page_table_a(struct kimage *image)
> {
> - unsigned long level1_index, level2_index;
> - u32 *pgtable_level2;
> -
> - /* Find the current page table */
> - pgtable_level2 = __va(read_cr3());
> + struct kimage_arch *arch = &image->arch_data;
> + struct page *page;
> + int k = sizeof(arch->page_table_a) / sizeof(arch->page_table_a[0]);
> +
> + for (; k > 0; k--) {
> + page = kimage_alloc_control_pages(image, 0);
> + if (!page)
> + return -ENOMEM;
> +
> + clear_page(page_address(page));
> + arch->page_table_a[k - 1] = page;
> + }
>
> - /* Find the indexes of the physical address to identity map */
> - level1_index = (address % LEVEL1_SIZE)/LEVEL0_SIZE;
> - level2_index = address / LEVEL1_SIZE;
> -
> - /* Identity map the page table entry */
> - pgtable_level1[level1_index] = address | L0_ATTR;
> - pgtable_level2[level2_index] = __pa(pgtable_level1) | L1_ATTR;
> -
> - /* Flush the tlb so the new mapping takes effect.
> - * Global tlb entries are not flushed but that is not an issue.
> - */
> - load_cr3(pgtable_level2);
> + return 0;
> }
>
> -#else
> -#define LEVEL1_SIZE (1UL << 21UL)
> -#define LEVEL2_SIZE (1UL << 30UL)
> -static u64 pgtable_level1[512] PAGE_ALIGNED;
> -static u64 pgtable_level2[512] PAGE_ALIGNED;
> -
> -static void identity_map_page(unsigned long address)
> -{
> - unsigned long level1_index, level2_index, level3_index;
> - u64 *pgtable_level3;
> +/* workaround for include/asm-i386/pgtable-3level.h */
>
> - /* Find the current page table */
> - pgtable_level3 = __va(read_cr3());
> -
> - /* Find the indexes of the physical address to identity map */
> - level1_index = (address % LEVEL1_SIZE)/LEVEL0_SIZE;
> - level2_index = (address % LEVEL2_SIZE)/LEVEL1_SIZE;
> - level3_index = address / LEVEL2_SIZE;
> -
> - /* Identity map the page table entry */
> - pgtable_level1[level1_index] = address | L0_ATTR;
> - pgtable_level2[level2_index] = __pa(pgtable_level1) | L1_ATTR;
> - set_64bit(&pgtable_level3[level3_index],
> - __pa(pgtable_level2) | L2_ATTR);
> -
> - /* Flush the tlb so the new mapping takes effect.
> - * Global tlb entries are not flushed but that is not an issue.
> - */
> - load_cr3(pgtable_level3);
> -}
> +#ifdef CONFIG_X86_PAE
> +#undef pgd_present
> +#define pgd_present(pgd) (pgd_val(pgd) & _PAGE_PRESENT)
> +#define _PGD_ATTR _PAGE_PRESENT
> +#else
> +#define _PGD_ATTR _KERNPG_TABLE
> #endif

Ok. This is just horrible. Please don't use helper functions that are not
appropriate for the task. There is a reason I defined my own helper macros.

> -static void set_idt(void *newidt, __u16 limit)
> -{
> - struct Xgt_desc_struct curidt;
> -
> - /* ia32 supports unaliged loads & stores */
> - curidt.size = limit;
> - curidt.address = (unsigned long)newidt;
> -
> - load_idt(&curidt);
> -};

Gratuitous code removal that makes the code less accessible, to other programmers.
The code is already barely comprehensible.

> +#define pa_page(page) __pa(page_address(page))
>
> -
> -static void set_gdt(void *newgdt, __u16 limit)
> +static int create_mapping(struct page *root, struct page **pages,
> + unsigned long va, unsigned long pa)
> {
> - struct Xgt_desc_struct curgdt;
> -
> - /* ia32 supports unaligned loads & stores */
> - curgdt.size = limit;
> - curgdt.address = (unsigned long)newgdt;
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *pte;
> + int k = 0;
>
> - load_gdt(&curgdt);
> -};
> + pgd = (pgd_t *)page_address(root) + pgd_index(va);
> + if (!pgd_present(*pgd))
> + set_pgd(pgd, __pgd(pa_page(pages[k++]) | _PGD_ATTR));
>
> -static void load_segments(void)
> -{
> -#define __STR(X) #X
> -#define STR(X) __STR(X)
> + pud = pud_offset(pgd, va);
> + if (!pud_present(*pud))
> + set_pud(pud, __pud(pa_page(pages[k++]) | _KERNPG_TABLE));
>
> - __asm__ __volatile__ (
> - "\tljmp $"STR(__KERNEL_CS)",$1f\n"
> - "\t1:\n"
> - "\tmovl $"STR(__KERNEL_DS)",%%eax\n"
> - "\tmovl %%eax,%%ds\n"
> - "\tmovl %%eax,%%es\n"
> - "\tmovl %%eax,%%fs\n"
> - "\tmovl %%eax,%%gs\n"
> - "\tmovl %%eax,%%ss\n"
> - ::: "eax", "memory");
> -#undef STR
> -#undef __STR
> -}
> + pmd = pmd_offset(pud, va);
> + if (!pmd_present(*pmd))
> + set_pmd(pmd, __pmd(pa_page(pages[k++]) | _KERNPG_TABLE));
>
> -typedef asmlinkage NORET_TYPE void (*relocate_new_kernel_t)(
> - unsigned long indirection_page,
> - unsigned long reboot_code_buffer,
> - unsigned long start_address,
> - unsigned int has_pae) ATTRIB_NORET;
> + pte = (pte_t *)page_address(pmd_page(*pmd)) + pte_index(va);
> + set_pte(pte, __pte(pa | _PAGE_KERNEL_EXEC));
>
> -const extern unsigned char relocate_new_kernel[];
> -extern void relocate_new_kernel_end(void);
> -const extern unsigned int relocate_new_kernel_size;
> + return k;
> +}
>
> /*
> * A architecture hook called to validate the
> @@ -147,11 +101,38 @@ const extern unsigned int relocate_new_k
> * Do what every setup is needed on image and the
> * reboot code buffer to allow us to avoid allocations
> * later.
> - *
> - * Currently nothing.
> */
> int machine_kexec_prepare(struct kimage *image)
> {
> + void *control_page;
> + unsigned long pa;
> + int k;
> +
> + memset(&image->arch_data, 0, sizeof(image->arch_data));
> +
> + k = allocate_page_table_a(image);
> + if (k)
> + return k;
> +
> + /* fill in control_page with assembly code */
> +
> + control_page = page_address(image->control_code_page);
> + memcpy(control_page, relocate_new_kernel, relocate_new_kernel_size);
> +
> + /* map the control_page at the virtual address of relocate_kernel.S */
> +
> + pa = __pa(control_page);
> +
> + k = create_mapping(image->arch_data.page_table_a[0],
> + &image->arch_data.page_table_a[1],
> + (unsigned long)relocate_new_kernel, pa);
> +
> + /* identity map the control_page */
> +
> + create_mapping(image->arch_data.page_table_a[0],
> + &image->arch_data.page_table_a[k + 1],
> + pa, pa);
> +
> return 0;
> }
>
> @@ -170,45 +151,16 @@ void machine_kexec_cleanup(struct kimage
> NORET_TYPE void machine_kexec(struct kimage *image)
> {
> unsigned long page_list;
> - unsigned long reboot_code_buffer;
> -
> + unsigned long control_code;
> + unsigned long page_table_a;
> relocate_new_kernel_t rnk;
>
> - /* Interrupts aren't acceptable while we reboot */
> - local_irq_disable();
> -
> - /* Compute some offsets */
> - reboot_code_buffer = page_to_pfn(image->control_code_page)
> - << PAGE_SHIFT;
> page_list = image->head;
> -
> - /* Set up an identity mapping for the reboot_code_buffer */
> - identity_map_page(reboot_code_buffer);
> -
> - /* copy it out */
> - memcpy((void *)reboot_code_buffer, relocate_new_kernel,
> - relocate_new_kernel_size);
> -
> - /* The segment registers are funny things, they are
> - * automatically loaded from a table, in memory wherever you
> - * set them to a specific selector, but this table is never
> - * accessed again you set the segment to a different selector.
> - *
> - * The more common model is are caches where the behide
> - * the scenes work is done, but is also dropped at arbitrary
> - * times.
> - *
> - * I take advantage of this here by force loading the
> - * segments, before I zap the gdt with an invalid value.
> - */
> - load_segments();
> - /* The gdt & idt are now invalid.
> - * If you want to load them you must set up your own idt & gdt.
> - */
> - set_gdt(phys_to_virt(0),0);
> - set_idt(phys_to_virt(0),0);
> + control_code = __pa(page_address(image->control_code_page));
> + page_table_a = __pa(page_address(image->arch_data.page_table_a[0]));

These are the wrong helper functions. You deleted code that properly computes
the address in a portable and future safe fashion.
i.e. page_to_pfn(image->control_coe_page) << PAGE_SHIFT.

Although it is probably best to both of these if we want to be very safe
at kdump time.

> /* now call it */
> - rnk = (relocate_new_kernel_t) reboot_code_buffer;
> - (*rnk)(page_list, reboot_code_buffer, image->start, cpu_has_pae);
> + rnk = (relocate_new_kernel_t) relocate_new_kernel;
> + (*rnk)(page_list, control_code, image->start,
> + page_table_a, (unsigned long)cpu_has_pae);
> }
> --- 0001/arch/i386/kernel/relocate_kernel.S
> +++ work/arch/i386/kernel/relocate_kernel.S 2006-05-22 12:55:37.000000000 +0900
> @@ -2,12 +2,20 @@
> * relocate_kernel.S - put the kernel image in place to boot
> * Copyright (C) 2002-2004 Eric Biederman <[email protected]>
> *
> + * 2006-05-19 Magnus Damm <[email protected]>:
> + * - moved segment handling code from machine_kexec.c
> + * - gdt tables stolen from arch/i386/boot/setup.S
> + *
> * This source code is licensed under the GNU General Public License,
> * Version 2. See the file COPYING for more details.
> */
>
> #include <linux/linkage.h>
> +#include <asm/page.h>
>
> +.text
> +.align (1 << PAGE_SHIFT)
> +

How does this new alignment help.

> /*
> * Must be relocatable PIC code callable as a C function, that once
> * it starts can not use the previous processes stack.
> @@ -18,18 +26,68 @@ relocate_new_kernel:
> movl 4(%esp), %ebx /* page_list */
> movl 8(%esp), %ebp /* reboot_code_buffer */
> movl 12(%esp), %edx /* start address */
> - movl 16(%esp), %ecx /* cpu_has_pae */
> + movl 16(%esp), %edi /* page_table_a */
> + movl 20(%esp), %ecx /* cpu_has_pae */
>
> /* zero out flags, and disable interrupts */
> pushl $0
> popfl
>
> + /* switch to page_table_a */
> + movl %edi, %eax
> + movl %eax, %cr3
> +
> + /* setup idt */
> +
> + movl %ebp, %eax
> + addl $(idt_48 - relocate_new_kernel), %eax
> + lidtl (%eax)
> +
> + /* setup gdt */
> +
> + movl %ebp, %eax
> + addl $(gdt - relocate_new_kernel), %eax
> + movl %ebp, %esi
> + addl $((gdt_48 - relocate_new_kernel) + 2), %esi
> + movl %eax, (%esi)
> +
> + movl %ebp, %eax
> + addl $(gdt_48 - relocate_new_kernel), %eax
> + lgdtl (%eax)
> +
> + /* setup data segment registers */
> +
> + mov $(gdt_ds - gdt), %eax
> + mov %eax, %ds
> + mov %eax, %es
> + mov %eax, %fs
> + mov %eax, %gs
> + mov %eax, %ss
> +
> /* set a new stack at the bottom of our page... */
> lea 4096(%ebp), %esp
>
> + /* load new code segment */
> +
> + movl %ebp, %esi
> + xorl %eax, %eax
> + pushl %eax
> + pushl %esi
> + pushl %eax
> +
> + movl $(gdt_cs - gdt), %eax
> + pushl %eax
> +
> + movl %ebp, %eax
> + addl $(identity_mapped - relocate_new_kernel),%eax
> + pushl %eax
> + iretl
> +
> +identity_mapped:
> +
> /* store the parameters back on the stack */
> pushl %edx /* store the start address */
> -
> +

This is a gratuitous addition of white space.

> /* Set cr0 to a known state:
> * 31 0 == Paging disabled
> * 18 0 == Alignment check disabled
> @@ -113,6 +171,36 @@ relocate_new_kernel:
> xorl %edi, %edi
> xorl %ebp, %ebp
> ret
> +
> + .align 16
> +gdt:
> + .fill 1,8,0
> +
> +gdt_cs:
> + .word 0xFFFF # 4Gb - (0x100000*0x1000 = 4Gb)
> + .word 0 # base address = 0
> + .word 0x9A00 # code read/exec
> + .word 0x00CF # granularity = 4096, 386
> + # (+5th nibble of limit)
> +gdt_ds:
> + .word 0xFFFF # 4Gb - (0x100000*0x1000 = 4Gb)
> + .word 0 # base address = 0
> + .word 0x9200 # data read/write
> + .word 0x00CF # granularity = 4096, 386
> + # (+5th nibble of limit)
> +gdt_end:
> + .align 4
> +
> + .word 0 # alignment byte
> +idt_48:
> + .word 0 # idt limit = 0
> + .word 0, 0 # idt base = 0L
> +
> + .word 0 # alignment byte
> +gdt_48:
> + .word gdt_end - gdt - 1 # gdt limit
> + .word 0, 0 # gdt base (filled in later)
> +
> relocate_new_kernel_end:
>
> .globl relocate_new_kernel_size
> --- 0002/include/asm-i386/kexec.h
> +++ work/include/asm-i386/kexec.h 2006-05-22 12:55:38.000000000 +0900
> @@ -29,7 +29,17 @@
>
> #define MAX_NOTE_BYTES 1024
>
> -struct kimage_arch {};
> +struct kimage_arch {
> + /* page_table_a[] holds enough pages to create a new page table
> + * that maps the control page twice..
> + */
> +
> +#if defined(CONFIG_X86_PAE)
> + struct page *page_table_a[5]; /* (2 * pte) + (2 * pmd) + pgd */
> +#else
> + struct page *page_table_a[3]; /* (2 * pte) + pgd */
> +#endif

This is ridiculous. You only need one pointer to the top of the page
table. Don't take 5. Plus a struct page is not the way you need the
data so please just store the physical address you need.


> +};
>
> /* CPU does not save ss and esp on stack if execution is already
> * running in kernel mode at the time of NMI occurrence. This code
>
> --===============18843918423041384==
> Content-Type: text/plain; charset="iso-8859-1"
> MIME-Version: 1.0
> Content-Transfer-Encoding: quoted-printable
> Content-Disposition: inline
>
> _______________________________________________
> fastboot mailing list
> [email protected]
> https://lists.osdl.org/mailman/listinfo/fastboot
>
> --===============18843918423041384==--

2006-05-25 02:51:35

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH 03/03] kexec: Avoid overwriting the current pgd (V2, x86_64)

Magnus Damm <[email protected]> writes:

> --===============37282618571824511==
>
> kexec: Avoid overwriting the current pgd (V2, x86_64)
>
> This patch upgrades the x86_64-specific kexec code to avoid overwriting the
> current pgd. Overwriting the current pgd is bad when CONFIG_CRASH_DUMP is used
> to start a secondary kernel that dumps the memory of the previous kernel.
>
> The code introduces a new set of page tables called "page_table_a". These
> tables are used to provide an executable identity mapping without overwriting
> the current pgd. The already existing page table is renamed to "page_table_b".

As I mentioned earlier you don't need two page tables, because
it is easy to guarantee that the identity mapping and the virtual mapping
will not intersect on x86_64. Until we have as many physical address bits
as virtual address bits (which requires page table modifications) it
is nonsense to have 2 page tables here.

> KEXEC_CONTROL_CODE_SIZE is changed into a single page. This updated version of
> the patch also moves the segment handling code into the reloacte_kernel.S.

You are doing two things at once here. If it is worth moving the segment
handling and idt handling it should be done as a separate patch.


> Signed-off-by: Magnus Damm <[email protected]>
> ---
>
> The patch has been tested with regular kexec and CONFIG_CRASH_DUMP.
> Applies on top of 2.6.16 and 2.6.17-rc4.
>
> arch/x86_64/kernel/machine_kexec.c | 193 +++++++++++++++++-----------------
> arch/x86_64/kernel/relocate_kernel.S | 84 +++++++++++++-
> include/asm-x86_64/kexec.h | 15 ++
> 3 files changed, 189 insertions(+), 103 deletions(-)
>
> --- 0001/arch/x86_64/kernel/machine_kexec.c
> +++ work/arch/x86_64/kernel/machine_kexec.c 2006-05-19 12:09:39.000000000 +0900
> @@ -2,6 +2,10 @@
> * machine_kexec.c - handle transition of Linux booting another kernel
> * Copyright (C) 2002-2005 Eric Biederman <[email protected]>
> *
> + * 2006-05-19 Magnus Damm <[email protected]>:
> + * - rewrote identity map code to avoid overwriting current pgd
> + * - moved segment handling code into relocate_kernel.S
> + *
> * This source code is licensed under the GNU General Public License,
> * Version 2. See the file COPYING for more details.
> */
> @@ -96,81 +100,110 @@ out:
> }
>
>
> -static int init_pgtable(struct kimage *image, unsigned long start_pgtable)
> +static int create_page_table_b(struct kimage *image)
> {
> - pgd_t *level4p;
> - level4p = (pgd_t *)__va(start_pgtable);
> - return init_level4_page(image, level4p, 0, end_pfn << PAGE_SHIFT);
> -}
> + struct kimage_arch *arch = &image->arch_data;
>
> -static void set_idt(void *newidt, u16 limit)
> -{
> - struct desc_ptr curidt;
> + arch->page_table_b = kimage_alloc_control_pages(image, 0);
>
> - /* x86-64 supports unaliged loads & stores */
> - curidt.size = limit;
> - curidt.address = (unsigned long)newidt;
> + if (!arch->page_table_b)
> + return -ENOMEM;
>
> - __asm__ __volatile__ (
> - "lidtq %0\n"
> - : : "m" (curidt)
> - );
> -};
> + return init_level4_page(image, page_address(arch->page_table_b),
> + 0, end_pfn << PAGE_SHIFT);
> +}
>
> +typedef NORET_TYPE void (*relocate_new_kernel_t)(unsigned long
> indirection_page,
> + unsigned long control_code_buffer,
> + unsigned long start_address,
> + unsigned long page_table_a,
> + unsigned long page_table_b) ATTRIB_NORET;
> +
> +const extern unsigned char relocate_new_kernel[];
> +const extern unsigned long relocate_new_kernel_size;
>
> -static void set_gdt(void *newgdt, u16 limit)
> +static int allocate_page_table_a(struct kimage *image)
> {
> - struct desc_ptr curgdt;
> + struct kimage_arch *arch = &image->arch_data;
> + struct page *page;
> + int k = sizeof(arch->page_table_a) / sizeof(arch->page_table_a[0]);
>
> - /* x86-64 supports unaligned loads & stores */
> - curgdt.size = limit;
> - curgdt.address = (unsigned long)newgdt;
> + for (; k > 0; k--) {
> + page = kimage_alloc_control_pages(image, 0);
> + if (!page)
> + return -ENOMEM;
>
> - __asm__ __volatile__ (
> - "lgdtq %0\n"
> - : : "m" (curgdt)
> - );
> -};
> + clear_page(page_address(page));
> + arch->page_table_a[k - 1] = page;
> + }
>
> -static void load_segments(void)
> -{
> - __asm__ __volatile__ (
> - "\tmovl %0,%%ds\n"
> - "\tmovl %0,%%es\n"
> - "\tmovl %0,%%ss\n"
> - "\tmovl %0,%%fs\n"
> - "\tmovl %0,%%gs\n"
> - : : "a" (__KERNEL_DS) : "memory"
> - );
> + return 0;
> }
>
> -typedef NORET_TYPE void (*relocate_new_kernel_t)(unsigned long
> indirection_page,
> - unsigned long control_code_buffer,
> - unsigned long start_address,
> - unsigned long pgtable) ATTRIB_NORET;
> +#define _PAGE_KERNEL_EXEC __PAGE_KERNEL_EXEC
> +#define pa_page(page) __pa_symbol(page_address(page)) /* __pa() miscompiles */
>
> -const extern unsigned char relocate_new_kernel[];
> -const extern unsigned long relocate_new_kernel_size;
> +static int create_mapping(struct page *root, struct page **pages,
> + unsigned long va, unsigned long pa)
> +{
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *pte;
> + int k = 0;
> +
> + pgd = (pgd_t *)page_address(root) + pgd_index(va);
> + if (!pgd_present(*pgd))
> + set_pgd(pgd, __pgd(pa_page(pages[k++]) | _KERNPG_TABLE));
> +
> + pud = pud_offset(pgd, va);
> + if (!pud_present(*pud))
> + set_pud(pud, __pud(pa_page(pages[k++]) | _KERNPG_TABLE));
> +
> + pmd = pmd_offset(pud, va);
> + if (!pmd_present(*pmd))
> + set_pmd(pmd, __pmd(pa_page(pages[k++]) | _KERNPG_TABLE));
> +
> + pte = (pte_t *)page_address(pmd_page(*pmd)) + pte_index(va);
> + set_pte(pte, __pte(pa | _PAGE_KERNEL_EXEC));
> +
> + return k;
> +}
>
> int machine_kexec_prepare(struct kimage *image)
> {
> - unsigned long start_pgtable, control_code_buffer;
> - int result;
> + void *control_page;
> + unsigned long pa;
> + int k;
>
> - /* Calculate the offsets */
> - start_pgtable = page_to_pfn(image->control_code_page) << PAGE_SHIFT;
> - control_code_buffer = start_pgtable + PAGE_SIZE;
> -
> - /* Setup the identity mapped 64bit page table */
> - result = init_pgtable(image, start_pgtable);
> - if (result)
> - return result;
> -
> - /* Place the code in the reboot code buffer */
> - memcpy(__va(control_code_buffer), relocate_new_kernel,
> - relocate_new_kernel_size);
> + memset(&image->arch_data, 0, sizeof(image->arch_data));
>
> - return 0;
> + k = allocate_page_table_a(image);
> + if (k)
> + return k;
> +
> + /* fill in control_page with assembly code */
> +
> + control_page = page_address(image->control_code_page);
> + memcpy(control_page, relocate_new_kernel, relocate_new_kernel_size);
> +
> + /* map the control_page at the virtual address of relocate_kernel.S */
> +
> + pa = __pa(control_page);
> +
> + k = create_mapping(image->arch_data.page_table_a[0],
> + &image->arch_data.page_table_a[1],
> + (unsigned long)relocate_new_kernel, pa);
> +
> + /* identity map the control_page */
> +
> + create_mapping(image->arch_data.page_table_a[0],
> + &image->arch_data.page_table_a[k + 1],
> + pa, pa);
> +
> + /* create identity mapped page table aka page_table_b */
> +
> + return create_page_table_b(image);
> }
>
> void machine_kexec_cleanup(struct kimage *image)
> @@ -185,47 +218,17 @@ void machine_kexec_cleanup(struct kimage
> NORET_TYPE void machine_kexec(struct kimage *image)
> {
> unsigned long page_list;
> - unsigned long control_code_buffer;
> - unsigned long start_pgtable;
> + unsigned long control_code;
> + unsigned long page_table_a;
> + unsigned long page_table_b;
> relocate_new_kernel_t rnk;
>
> - /* Interrupts aren't acceptable while we reboot */
> - local_irq_disable();
> -
> - /* Calculate the offsets */
> page_list = image->head;
> - start_pgtable = page_to_pfn(image->control_code_page) << PAGE_SHIFT;
> - control_code_buffer = start_pgtable + PAGE_SIZE;
> + control_code = __pa(page_address(image->control_code_page));
> + page_table_a = __pa(page_address(image->arch_data.page_table_a[0]));
> + page_table_b = __pa(page_address(image->arch_data.page_table_b));
>
> - /* Set the low half of the page table to my identity mapped
> - * page table for kexec. Leave the high half pointing at the
> - * kernel pages. Don't bother to flush the global pages
> - * as that will happen when I fully switch to my identity mapped
> - * page table anyway.
> - */
> - memcpy(__va(read_cr3()), __va(start_pgtable), PAGE_SIZE/2);
> - __flush_tlb();
> -
> -
> - /* The segment registers are funny things, they are
> - * automatically loaded from a table, in memory wherever you
> - * set them to a specific selector, but this table is never
> - * accessed again unless you set the segment to a different selector.
> - *
> - * The more common model are caches where the behide
> - * the scenes work is done, but is also dropped at arbitrary
> - * times.
> - *
> - * I take advantage of this here by force loading the
> - * segments, before I zap the gdt with an invalid value.
> - */
> - load_segments();
> - /* The gdt & idt are now invalid.
> - * If you want to load them you must set up your own idt & gdt.
> - */
> - set_gdt(phys_to_virt(0),0);
> - set_idt(phys_to_virt(0),0);
> /* now call it */
> - rnk = (relocate_new_kernel_t) control_code_buffer;
> - (*rnk)(page_list, control_code_buffer, image->start, start_pgtable);
> + rnk = (relocate_new_kernel_t) relocate_new_kernel;
> + (*rnk)(page_list, control_code, image->start, page_table_a, page_table_b);
> }
> --- 0001/arch/x86_64/kernel/relocate_kernel.S
> +++ work/arch/x86_64/kernel/relocate_kernel.S 2006-05-19 12:11:13.000000000
> +0900
> @@ -2,11 +2,18 @@
> * relocate_kernel.S - put the kernel image in place to boot
> * Copyright (C) 2002-2005 Eric Biederman <[email protected]>
> *
> + * 2006-05-19 Magnus Damm <[email protected]>:
> + * - moved segment handling code from machine_kexec.c
> + *
> * This source code is licensed under the GNU General Public License,
> * Version 2. See the file COPYING for more details.
> */
>
> #include <linux/linkage.h>
> +#include <asm/page.h>
> +
> +.text
> +.align (1 << PAGE_SHIFT)
>
> /*
> * Must be relocatable PIC code callable as a C function, that once
> @@ -18,21 +25,69 @@ relocate_new_kernel:
> /* %rdi page_list
> * %rsi reboot_code_buffer
> * %rdx start address
> - * %rcx page_table
> - * %r8 arg5
> + * %rcx page_table_a
> + * %r8 page_table_b
> * %r9 arg6
> */
> -
> +
> /* zero out flags, and disable interrupts */
> pushq $0
> popfq
>
> + /* switch to page_table_a */
> + movq %rcx, %cr3
> +
> + /* setup idt */
> +
> + movq %rsi, %rax
> + addq $(idt_48 - relocate_new_kernel), %rax
> + lidtq (%rax)
> +
> + /* setup gdt */
> +
> + movq %rsi, %rax
> + addq $(gdt - relocate_new_kernel), %rax
> + movq %rsi, %r9
> + addq $((gdt_48 - relocate_new_kernel) + 2), %r9
> + movq %rax, (%r9)
> +
> + movq %rsi, %rax
> + addq $(gdt_48 - relocate_new_kernel), %rax
> + lgdtq (%rax)
> +
> + /* setup data segment registers */
> +
> + xorl %eax,%eax
> + movl %eax, %ds
> + movl %eax, %es
> + movl %eax, %fs
> + movl %eax, %gs
> + movl %eax, %ss
> +
> /* set a new stack at the bottom of our page... */
> lea 4096(%rsi), %rsp
>
> + /* load new code segment */
> +
> + movq %rsp, %rcx
> + xorq %rax, %rax
> + pushq %rax /* SS */
> + pushq %rcx /* ESP */
> + pushq %rax /* RFLAGS */
> +
> + movq $(gdt_code - gdt), %rax
> + pushq %rax /* CS */
> +
> + movq %rsi, %rax
> + addq $(identity_mapped - relocate_new_kernel), %rax
> + pushq %rax /* RIP */
> +
> + iretq
> +
> +identity_mapped:
> /* store the parameters back on the stack */
> pushq %rdx /* store the start address */
> -
> +
> /* Set cr0 to a known state:
> * 31 1 == Paging enabled
> * 18 0 == Alignment check disabled
> @@ -69,7 +124,7 @@ relocate_new_kernel:
> /* Switch to the identity mapped page tables,
> * and flush the TLB.
> */
> - movq %rcx, %cr3
> + movq %r8, %cr3
>
> /* Do the copies */
> movq %rdi, %rcx /* Put the page_list in %rcx */
> @@ -136,6 +191,25 @@ relocate_new_kernel:
> xorq %r15, %r15
>
> ret
> + .align 16
> +gdt:
> + .long 0x00000000 /* NULL descriptor */
> + .long 0x00000000
> +gdt_code:
> + .long 0x00000000 /* code descriptor */
> + .long 0x00209800
> +
> +gdt_end:
> + .align 4
> +
> +idt_48:
> + .word 0 # idt limit = 0
> + .quad 0, 0 # idt base = 0L
> +
> +gdt_48:
> + .word gdt_end - gdt - 1 # gdt limit
> + .quad 0, 0 # gdt base (filled in later)
> +
> relocate_new_kernel_end:
>
> .globl relocate_new_kernel_size
> --- 0002/include/asm-x86_64/kexec.h
> +++ work/include/asm-x86_64/kexec.h 2006-05-19 12:07:33.000000000 +0900
> @@ -21,15 +21,24 @@
> /* Maximum address we can use for the control pages */
> #define KEXEC_CONTROL_MEMORY_LIMIT (0xFFFFFFFFFFUL)
>
> -/* Allocate one page for the pdp and the second for the code */
> -#define KEXEC_CONTROL_CODE_SIZE (4096UL + 4096UL)
> +#define KEXEC_CONTROL_CODE_SIZE 4096
>
> /* The native architecture */
> #define KEXEC_ARCH KEXEC_ARCH_X86_64
>
> #define MAX_NOTE_BYTES 1024
>
> -struct kimage_arch {};
> +struct kimage_arch {
> + /* page_table_a[] holds enough pages to create a new page table
> + * that maps the control page twice..
> + *
> + * page_table_b points to the root page of a page table which is used
> + * to provide identity mapping of all ram.
> + */
> +
> + struct page *page_table_a[7]; /* 2 * (pte + pud + pmd) + pgd */
> + struct page *page_table_b;
> +};
>
> /*
> * Saving the registers of the cpu on which panic occured in
>
> --===============37282618571824511==
> Content-Type: text/plain; charset="iso-8859-1"
> MIME-Version: 1.0
> Content-Transfer-Encoding: quoted-printable
> Content-Disposition: inline
>
> _______________________________________________
> fastboot mailing list
> [email protected]
> https://lists.osdl.org/mailman/listinfo/fastboot
>
> --===============37282618571824511==--

2006-05-25 02:57:35

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 00/03] kexec: Avoid overwriting the current pgd (V2)

Magnus Damm <[email protected]> writes:

> On Wed, 2006-05-24 at 18:56 -0400, Vivek Goyal wrote:
>> On Wed, May 24, 2006 at 01:40:31PM +0900, Magnus Damm wrote:
>> > kexec: Avoid overwriting the current pgd (V2)
>> >
>> > This patch updates the kexec code for i386 and x86_64 to avoid overwriting
>> > the current pgd. For most people is overwriting the current pgd is not a big
>> > problem. When kexec:ing into a new kernel that reinitializes and makes use
> of
>> > all memory we don't care about saving state.
>> >
>> > But overwriting the current pgd is not a good solution in the case of kdump
>> > (CONFIG_CRASH_DUMP) where we want to preserve as much state as possible when
>> > a crash occurs. This patch solves the overwriting issue.
>> >
>> > 20060524: V2
>> >
>> > - Broke out architecture-specific data structures into asm/kexec.h
>> > - Fixed a i386/PAE page table problem only triggering on real hardware.
>> > - Moved segment handling code into the assembly routines.
>>
>> What's the advantage of moving segment handling code into assembly
>> routines? It will only add to the fear of control code page size growing
>> beyond 4K.
>
> I have two main reasons:
>
> - Why wrap assembler instructions in C code if you just can move them
> into an already existing assembly file? Much cleaner IMO.

C code is much more accessible to other programmers than arch specific
assembly. The code on the control page was almost written in C, and
I'm still not quite convinced that it would be wrong to do that.

> - I'm currently working on making kexec to work under xen/dom0. And by
> moving the segment handling code into the assembly file we reduce the
> amount of duplicated code.

Not the reason I would have expected. So you are only differring the
two implementations by the contents of the control code page?

Eric

2006-05-25 03:28:35

by Magnus Damm

[permalink] [raw]
Subject: Re: [PATCH 00/03] kexec: Avoid overwriting the current pgd (V2)

Hi Eric,

On Wed, 2006-05-24 at 20:56 -0600, Eric W. Biederman wrote:
> Magnus Damm <[email protected]> writes:
>
> > On Wed, 2006-05-24 at 18:56 -0400, Vivek Goyal wrote:
> >> On Wed, May 24, 2006 at 01:40:31PM +0900, Magnus Damm wrote:
> >> > kexec: Avoid overwriting the current pgd (V2)
> >> >
> >> > This patch updates the kexec code for i386 and x86_64 to avoid overwriting
> >> > the current pgd. For most people is overwriting the current pgd is not a big
> >> > problem. When kexec:ing into a new kernel that reinitializes and makes use
> > of
> >> > all memory we don't care about saving state.
> >> >
> >> > But overwriting the current pgd is not a good solution in the case of kdump
> >> > (CONFIG_CRASH_DUMP) where we want to preserve as much state as possible when
> >> > a crash occurs. This patch solves the overwriting issue.
> >> >
> >> > 20060524: V2
> >> >
> >> > - Broke out architecture-specific data structures into asm/kexec.h
> >> > - Fixed a i386/PAE page table problem only triggering on real hardware.
> >> > - Moved segment handling code into the assembly routines.
> >>
> >> What's the advantage of moving segment handling code into assembly
> >> routines? It will only add to the fear of control code page size growing
> >> beyond 4K.
> >
> > I have two main reasons:
> >
> > - Why wrap assembler instructions in C code if you just can move them
> > into an already existing assembly file? Much cleaner IMO.
>
> C code is much more accessible to other programmers than arch specific
> assembly. The code on the control page was almost written in C, and
> I'm still not quite convinced that it would be wrong to do that.

I agree with you that it is of course better to implement something in C
if possible compared to writing it in architecture-specific assembly.

But I do not agree that wrapping architecture-specific assembly code in
C functions makes the code more understandable. I'd really like to meet
the kernel hacker that is aware of how x86 segmentation works but is
unable to read x86 assembly.

> > - I'm currently working on making kexec to work under xen/dom0. And by
> > moving the segment handling code into the assembly file we reduce the
> > amount of duplicated code.
>
> Not the reason I would have expected. So you are only differring the
> two implementations by the contents of the control code page?

Nah, there's a fairly large framework to pass pages to the hypervisor,
converting pfn:s to mfn:s, building page tables etc. We will resend the
patches later on today to xen-devel if you're interested.

Thanks,

/ magnus

2006-05-25 03:43:41

by Magnus Damm

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH 01/03] kexec: Avoid overwriting the current pgd (V2, stubs)

On Wed, 2006-05-24 at 20:41 -0600, Eric W. Biederman wrote:
> Magnus Damm <[email protected]> writes:
>
> > --===============059810052910035161==
> >
> > kexec: Avoid overwriting the current pgd (V2, stubs)
> >
> > This patch adds an architecture specific structure "struct kimage_arch" to
> > struct kimage. This structure is filled in with members by the architecture
> > specific patches followed by this one.
>
> You should be able to completely remove the need for this by simply
> adding a single additional external variable to the control code
> page. Given that you abuse this information and store way more
> than you need I am not persuaded that it is an interesting case.

I'm sorry, but I do not understand. Care to explain a bit more, maybe
with some examples?

Also, I get the impression that you dislike my patches. I'd like to work
with you and the community to merge my patches, but for that to happen
I'd appreciate if we both kept the language on a professional level.

Next time, please try to avoid strong words such as "abuse", "horrible"
and "ridiculous".

Thanks,

/ magnus

2006-05-25 04:38:07

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH 01/03] kexec: Avoid overwriting the current pgd (V2, stubs)

Magnus Damm <[email protected]> writes:

> On Wed, 2006-05-24 at 20:41 -0600, Eric W. Biederman wrote:
>> Magnus Damm <[email protected]> writes:
>>
>> > --===============059810052910035161==
>> >
>> > kexec: Avoid overwriting the current pgd (V2, stubs)
>> >
>> > This patch adds an architecture specific structure "struct kimage_arch" to
>> > struct kimage. This structure is filled in with members by the architecture
>> > specific patches followed by this one.
>>
>> You should be able to completely remove the need for this by simply
>> adding a single additional external variable to the control code
>> page. Given that you abuse this information and store way more
>> than you need I am not persuaded that it is an interesting case.
>
> I'm sorry, but I do not understand. Care to explain a bit more, maybe
> with some examples?

I believe I gave a complete explanation the first round.

By having an extra extern variable you can export the offset of
a variable on the control code page, or what you need to compute
the offset.

> Also, I get the impression that you dislike my patches. I'd like to work
> with you and the community to merge my patches, but for that to happen
> I'd appreciate if we both kept the language on a professional level.

Yes. I dislike your patches, but I don't dislike what you are trying to do.

Part of the reason is you do more than one thing at a time, which makes
review much more difficult than it needs to be.

> Next time, please try to avoid strong words such as "abuse", "horrible"
> and "ridiculous".

I call them as I see them, and probably if I am a little frustrated I
may be a little more extreme. Usually I find that I have too much
implied content and don't explain why I consider something abuse
for example.

In the above quoted section I figure it is abuse to place what only needs
to be an array of local variables in a global structure.

Eric

2006-05-25 04:53:09

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 00/03] kexec: Avoid overwriting the current pgd (V2)

Magnus Damm <[email protected]> writes:

> Hi Eric,
>
> On Wed, 2006-05-24 at 20:56 -0600, Eric W. Biederman wrote:
>>
>> C code is much more accessible to other programmers than arch specific
>> assembly. The code on the control page was almost written in C, and
>> I'm still not quite convinced that it would be wrong to do that.
>
> I agree with you that it is of course better to implement something in C
> if possible compared to writing it in architecture-specific assembly.
>
> But I do not agree that wrapping architecture-specific assembly code in
> C functions makes the code more understandable. I'd really like to meet
> the kernel hacker that is aware of how x86 segmentation works but is
> unable to read x86 assembly.

For some young programmers it may be a matter of reading ability.
For older programmers it is more likely to be a matter of reading
speed.

Regardless that is how the code is now, and how it came out of the series
of code reviews I had to go through when I wrote it. I had requests
to do more in C and I never had a request to do more in assembly.
Proving there was no sane way to write the control code page in
C was actually difficult.

If there is a legitimate reason to change the code that is fine. But
as it looked as simply a change without a good reason that is not
fine.

The big problem was you did several things with a single patch,
and that made the review much more difficult than it had to be.

Having to check if you correctly modified the page tables, while also
having to check for segmentation, and the interrupt descriptor
transformations was distracting.

>> > - I'm currently working on making kexec to work under xen/dom0. And by
>> > moving the segment handling code into the assembly file we reduce the
>> > amount of duplicated code.
>>
>> Not the reason I would have expected. So you are only differring the
>> two implementations by the contents of the control code page?
>
> Nah, there's a fairly large framework to pass pages to the hypervisor,
> converting pfn:s to mfn:s, building page tables etc. We will resend the
> patches later on today to xen-devel if you're interested.

Ok. I might have to look.

Eric

2006-05-25 07:11:14

by Magnus Damm

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH 01/03] kexec: Avoid overwriting the current pgd (V2, stubs)

On 5/25/06, Eric W. Biederman <[email protected]> wrote:
> Magnus Damm <[email protected]> writes:
> > On Wed, 2006-05-24 at 20:41 -0600, Eric W. Biederman wrote:
> >> Magnus Damm <[email protected]> writes:
> >>
> >> > --===============059810052910035161==
> >> >
> >> > kexec: Avoid overwriting the current pgd (V2, stubs)
> >> >
> >> > This patch adds an architecture specific structure "struct kimage_arch" to
> >> > struct kimage. This structure is filled in with members by the architecture
> >> > specific patches followed by this one.
> >>
> >> You should be able to completely remove the need for this by simply
> >> adding a single additional external variable to the control code
> >> page. Given that you abuse this information and store way more
> >> than you need I am not persuaded that it is an interesting case.
> >
> > I'm sorry, but I do not understand. Care to explain a bit more, maybe
> > with some examples?
>
> I believe I gave a complete explanation the first round.
>
> By having an extra extern variable you can export the offset of
> a variable on the control code page, or what you need to compute
> the offset.

You explained some things last round, but there were still some
questions left open. The main question was regarding "additional
protection".

I'd be happy to change to code into something that you would feel
comfortable with, I just like to understand why. Having a
per-architecture data area in struct kimage is by far the simplest way
IMO.

> > Also, I get the impression that you dislike my patches. I'd like to work
> > with you and the community to merge my patches, but for that to happen
> > I'd appreciate if we both kept the language on a professional level.
>
> Yes. I dislike your patches, but I don't dislike what you are trying to do.

That's one good thing at least I guess. =)

> Part of the reason is you do more than one thing at a time, which makes
> review much more difficult than it needs to be.

Yeah, I know. I'm sorry about that. I took some time to split the
patches in the most logical way I could think of. The reason behind
not separating the segment code and the page_table_a code was that
they both touched more or less the same lines.

Let's figure out _what_ you want to merge, then let us decide in what
order to merge it. If we end up with more than 0 things to do then I'd
be happy to help out implementing them one by one.

> > Next time, please try to avoid strong words such as "abuse", "horrible"
> > and "ridiculous".
>
> I call them as I see them, and probably if I am a little frustrated I
> may be a little more extreme. Usually I find that I have too much
> implied content and don't explain why I consider something abuse
> for example.

Right. And I get offended by the strong works which is bad.
I think we both should try to stay cool, otherwise this will go nowhere.

> In the above quoted section I figure it is abuse to place what only needs
> to be an array of local variables in a global structure.

And by global structure you mean the dynamically allocated struct
kimage? If you find that abusive then I think that _not_ using an
already existing structure is abusive. =)

I just want to keep things as simple as possible.

Thanks for your comments!

/ magnus

2006-05-25 07:29:04

by Magnus Damm

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH 00/03] kexec: Avoid overwriting the current pgd (V2)

On 5/25/06, Eric W. Biederman <[email protected]> wrote:
> Magnus Damm <[email protected]> writes:
>
> > Hi Eric,
> >
> > On Wed, 2006-05-24 at 20:56 -0600, Eric W. Biederman wrote:
> >>
> >> C code is much more accessible to other programmers than arch specific
> >> assembly. The code on the control page was almost written in C, and
> >> I'm still not quite convinced that it would be wrong to do that.
> >
> > I agree with you that it is of course better to implement something in C
> > if possible compared to writing it in architecture-specific assembly.
> >
> > But I do not agree that wrapping architecture-specific assembly code in
> > C functions makes the code more understandable. I'd really like to meet
> > the kernel hacker that is aware of how x86 segmentation works but is
> > unable to read x86 assembly.
>
> For some young programmers it may be a matter of reading ability.
> For older programmers it is more likely to be a matter of reading
> speed.
>
> Regardless that is how the code is now, and how it came out of the series
> of code reviews I had to go through when I wrote it. I had requests
> to do more in C and I never had a request to do more in assembly.
> Proving there was no sane way to write the control code page in
> C was actually difficult.

Consider this a "more assembly" request then. What about these reasons:

- C code requires a stack
You must access one page only and therefore you need to setup the
stackpointer somewhere. Requires assembly.

- We switch to a new page table, twice on x86_64
Requires assembly.

- Need to setup segment registeres and cr*
Requires assembly.

- You need to setup/clear registers before passing control
Requires assembly.

> If there is a legitimate reason to change the code that is fine. But
> as it looked as simply a change without a good reason that is not
> fine.

We would have to duplicate the segment handling code in our xen port
otherwise. It's no biggie, it's just a matter of a few lines of code.

Also, I feel that my approach with a valid idt and gdt is more robust.

> The big problem was you did several things with a single patch,
> and that made the review much more difficult than it had to be.
>
> Having to check if you correctly modified the page tables, while also
> having to check for segmentation, and the interrupt descriptor
> transformations was distracting.

Let me know which parts you think are good and we will implement and
review them bit by bit instead then.

Thanks,

/ magnus

2006-05-25 08:24:54

by Magnus Damm

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH 03/03] kexec: Avoid overwriting the current pgd (V2, x86_64)

On Wed, 2006-05-24 at 20:50 -0600, Eric W. Biederman wrote:
> Magnus Damm <[email protected]> writes:
>
> > --===============37282618571824511==
> >
> > kexec: Avoid overwriting the current pgd (V2, x86_64)
> >
> > This patch upgrades the x86_64-specific kexec code to avoid overwriting the
> > current pgd. Overwriting the current pgd is bad when CONFIG_CRASH_DUMP is used
> > to start a secondary kernel that dumps the memory of the previous kernel.
> >
> > The code introduces a new set of page tables called "page_table_a". These
> > tables are used to provide an executable identity mapping without overwriting
> > the current pgd. The already existing page table is renamed to "page_table_b".
>
> As I mentioned earlier you don't need two page tables, because
> it is easy to guarantee that the identity mapping and the virtual mapping
> will not intersect on x86_64. Until we have as many physical address bits
> as virtual address bits (which requires page table modifications) it
> is nonsense to have 2 page tables here.

Let me explain the theory behind my kexec patch:

During prepare, page_table_a is initialized. For x86_64 page_table_b is
also initialized.

When it is time for machine_kexec() the following sequence takes place:

1. The C-code in machine_kexec() jumps to the assembly routine.
To avoid overwriting the page table and still work with NX bits we jump
to the original location of relocate_new_kernel. This is different from
the unpatched code which jumps to the physical address (ie the identity
mapped location).

2. The assembly code switches to page_table_a.
page_table_a has mapped the control page at two virtual addresses:
- The same virtual address as relocate_new_kernel is located at.
(This explains the extra aligment in the assembly file)
- An identity mapping, ie virtual address == physical address.
After the switch the code runs at the same virtual address, but the
physical page is now the control page.

3. Setup idt, gdt, segment registers and stack pointer.
The stack pointer should point to the identity mapped page.

4. Jump to the identity mapped address.
When the jump is performed we will be running at a virtual address which
is the same as the physical address.

5. Turn off MMU (i386) / switch to page_table_b (x86_64).
We are able to turn off the MMU or switch to page_table_b because we are
already running at the physical address.

6. Proceed with the page copying business as usual...

Ok, so far so good.

The fun part begins when we throw in Xen into the mix. Linux under xen
runs with pseudophysical addresses, ie what Linux thinks are physical
addresses are not physical. Xen use the term machine addresses for
addresses that are called physical address under "regular" Linux. On top
of that is Xen using a different memory map than Linux.

After prepare, all pages in page_table_a are passed to the hypervisor
that overwrites the contents filled in by machine_prepare().
(this explains the "ridiculous" array of struct page *)

A similar two page mapping is used for here too, but in the xen case we
use a different virtual address (the non-identity mapped address)
compared to "regular" Linux. All to fit the address space used by xen.

The xen port which is based on my patches is using a sequence similar to
"regular" Linux:

1a. The C-code in xen_machine_kexec() performs a hypercall.

1b. The hypervisor jumps to the assembly code.
After prepare we've created a NX-safe mapping for the control page. We
jump to that NX-safe address to transfer control to the assembly code.

Goto 2 above.

So, to answer your question regarding two page table copies. You may be
right that it can be made work with just one page table, but I feel my
two table approach is nice because it will always work - regardless of
the memory map used.

Thanks,

/ magnus

2006-05-25 15:21:28

by Vivek Goyal

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH 03/03] kexec: Avoid overwriting the current pgd (V2, x86_64)

On Thu, May 25, 2006 at 05:26:56PM +0900, Magnus Damm wrote:
> > >
> > > The code introduces a new set of page tables called "page_table_a". These
> > > tables are used to provide an executable identity mapping without overwriting
> > > the current pgd. The already existing page table is renamed to "page_table_b".
> >
> > As I mentioned earlier you don't need two page tables, because
> > it is easy to guarantee that the identity mapping and the virtual mapping
> > will not intersect on x86_64. Until we have as many physical address bits
> > as virtual address bits (which requires page table modifications) it
> > is nonsense to have 2 page tables here.
>
> Let me explain the theory behind my kexec patch:
>
> During prepare, page_table_a is initialized. For x86_64 page_table_b is
> also initialized.
>
> When it is time for machine_kexec() the following sequence takes place:
>
> 1. The C-code in machine_kexec() jumps to the assembly routine.
> To avoid overwriting the page table and still work with NX bits we jump
> to the original location of relocate_new_kernel. This is different from
> the unpatched code which jumps to the physical address (ie the identity
> mapped location).
>
> 2. The assembly code switches to page_table_a.
> page_table_a has mapped the control page at two virtual addresses:
> - The same virtual address as relocate_new_kernel is located at.
> (This explains the extra aligment in the assembly file)
> - An identity mapping, ie virtual address == physical address.
> After the switch the code runs at the same virtual address, but the
> physical page is now the control page.
>
> 3. Setup idt, gdt, segment registers and stack pointer.
> The stack pointer should point to the identity mapped page.
>
> 4. Jump to the identity mapped address.
> When the jump is performed we will be running at a virtual address which
> is the same as the physical address.
>
> 5. Turn off MMU (i386) / switch to page_table_b (x86_64).
> We are able to turn off the MMU or switch to page_table_b because we are
> already running at the physical address.
>
> 6. Proceed with the page copying business as usual...
>
> Ok, so far so good.
>
> The fun part begins when we throw in Xen into the mix. Linux under xen
> runs with pseudophysical addresses, ie what Linux thinks are physical
> addresses are not physical. Xen use the term machine addresses for
> addresses that are called physical address under "regular" Linux. On top
> of that is Xen using a different memory map than Linux.
>
> After prepare, all pages in page_table_a are passed to the hypervisor
> that overwrites the contents filled in by machine_prepare().
> (this explains the "ridiculous" array of struct page *)
>
> A similar two page mapping is used for here too, but in the xen case we
> use a different virtual address (the non-identity mapped address)
> compared to "regular" Linux. All to fit the address space used by xen.
>
> The xen port which is based on my patches is using a sequence similar to
> "regular" Linux:
>
> 1a. The C-code in xen_machine_kexec() performs a hypercall.
>
> 1b. The hypervisor jumps to the assembly code.
> After prepare we've created a NX-safe mapping for the control page. We
> jump to that NX-safe address to transfer control to the assembly code.
>
> Goto 2 above.
>
> So, to answer your question regarding two page table copies. You may be
> right that it can be made work with just one page table, but I feel my
> two table approach is nice because it will always work - regardless of
> the memory map used.
>

So you seem to be suggesting that code can be made to work (even with Xen)
with single identity mapped page table as used currently but it would fail
in certain circumstances (different memory map used). Can you please explain
a little more how?

This might be a very stupid question given the fact I am blissfully unaware
of all the details of Xen.

Thanks
Vivek

2006-05-25 16:02:49

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH 03/03] kexec: Avoid overwriting the current pgd (V2, x86_64)

Magnus Damm <[email protected]> writes:

> Let me explain the theory behind my kexec patch:

Please the next time you submit a please try to keep what
you are changing sufficiently clear that a followup email
is not needed to explain the code.

> During prepare, page_table_a is initialized. For x86_64 page_table_b is
> also initialized.
>
> When it is time for machine_kexec() the following sequence takes place:
>
> 1. The C-code in machine_kexec() jumps to the assembly routine.
> To avoid overwriting the page table and still work with NX bits we jump
> to the original location of relocate_new_kernel. This is different from
> the unpatched code which jumps to the physical address (ie the identity
> mapped location).
>
> 2. The assembly code switches to page_table_a.
> page_table_a has mapped the control page at two virtual addresses:
> - The same virtual address as relocate_new_kernel is located at.
> (This explains the extra aligment in the assembly file)
> - An identity mapping, ie virtual address == physical address.
> After the switch the code runs at the same virtual address, but the
> physical page is now the control page.

Sorry this is broken. The code location you are running from when you
perform the switch from one page table to another must be mapped at
the same location in both page tables. Otherwise it is undefined what
the processor will do.

So you need at least one additional entry in your page table.

The fact that this bug did not jump out is a clear sign you were
changing too many things at once, and did not have an adequate
explanation in your change log.

> 3. Setup idt, gdt, segment registers and stack pointer.
> The stack pointer should point to the identity mapped page.
>
> 4. Jump to the identity mapped address.
> When the jump is performed we will be running at a virtual address which
> is the same as the physical address.

After the jump is performed?

> 5. Turn off MMU (i386) / switch to page_table_b (x86_64).
> We are able to turn off the MMU or switch to page_table_b because we are
> already running at the physical address.
>
> 6. Proceed with the page copying business as usual...
>
> Ok, so far so good.
>
> The fun part begins when we throw in Xen into the mix. Linux under xen
> runs with pseudophysical addresses, ie what Linux thinks are physical
> addresses are not physical. Xen use the term machine addresses for
> addresses that are called physical address under "regular" Linux. On top
> of that is Xen using a different memory map than Linux.
>
> After prepare, all pages in page_table_a are passed to the hypervisor
> that overwrites the contents filled in by machine_prepare().
> (this explains the "ridiculous" array of struct page *)

Actually it doesn't really, because I don't have enough information
to infer which Xen call you are using or why it is sane. Certainly
Xen does not need struct page * pointers or Xen is too tightly coupled
with linux to be sane.

> A similar two page mapping is used for here too, but in the xen case we
> use a different virtual address (the non-identity mapped address)
> compared to "regular" Linux. All to fit the address space used by xen.
>
> The xen port which is based on my patches is using a sequence similar to
> "regular" Linux:
>
> 1a. The C-code in xen_machine_kexec() performs a hypercall.
>
> 1b. The hypervisor jumps to the assembly code.
> After prepare we've created a NX-safe mapping for the control page. We
> jump to that NX-safe address to transfer control to the assembly
> code.

I assume this is a Xen call with the semantics:
switch page tables and jump to location X in the new page tables.

I assume Xen is still running at this point?

> Goto 2 above.
>
> So, to answer your question regarding two page table copies. You may be
> right that it can be made work with just one page table, but I feel my
> two table approach is nice because it will always work - regardless of
> the memory map used.

Except that the memory map in linux is fixed. The x86_64 kernel will
run with negative addresses and physical addresses will remain
positive until a decade or two from now when we get 64bit physical
addresses.

Unless linux runs with a different memory map when running under Xen.

Eric

2006-05-25 16:37:20

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH 01/03] kexec: Avoid overwriting the current pgd (V2, stubs)

"Magnus Damm" <[email protected]> writes:

> On 5/25/06, Eric W. Biederman <[email protected]> wrote:
>> Magnus Damm <[email protected]> writes:
>> > On Wed, 2006-05-24 at 20:41 -0600, Eric W. Biederman wrote:
>> >> Magnus Damm <[email protected]> writes:
>>
>> I believe I gave a complete explanation the first round.
>>
>> By having an extra extern variable you can export the offset of
>> a variable on the control code page, or what you need to compute
>> the offset.
>
> You explained some things last round, but there were still some
> questions left open. The main question was regarding "additional
> protection".

Memory that the kernel never sets up for DMA ever.

> I'd be happy to change to code into something that you would feel
> comfortable with, I just like to understand why. Having a
> per-architecture data area in struct kimage is by far the simplest way
> IMO.

But the problem is fundamentally hard. I do not want to encourage
people to make changes without thinking through all of the consequences.

So far my impression is that an additional per-architecture data area
is struct kimage encourages people to be sloppy, and it moves key structures
farther from where they are used. I am coming to suspect it is as bad
as ioctl.

I could probably be convinced with a good use of a per-architecture
area and a well reasoned argument. But at the moment changing the
infrastructure is unnecessary noise, so please drop that for now.

>> Part of the reason is you do more than one thing at a time, which makes
>> review much more difficult than it needs to be.
>
> Yeah, I know. I'm sorry about that. I took some time to split the
> patches in the most logical way I could think of. The reason behind
> not separating the segment code and the page_table_a code was that
> they both touched more or less the same lines.

Dependent patches are not a problem.

> And by global structure you mean the dynamically allocated struct
> kimage? If you find that abusive then I think that _not_ using an
> already existing structure is abusive. =)
>
> I just want to keep things as simple as possible.

Simplicity is good.

Doing unnecessary things is confusing and the issue is not good,
and at least until the Xen support is merged you were doing
unnecessary things.

Please just take it carefully. This is possibly the hardest
to debug code path in the kernel and currently it works. I
don't want to break that.

Eric

2006-05-25 16:41:48

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH 00/03] kexec: Avoid overwriting the current pgd (V2)

"Magnus Damm" <[email protected]> writes:
>
> Also, I feel that my approach with a valid idt and gdt is more robust.

One of my biggest concerns with the current code is that it is not
sufficiently robust, in the kdump case. So I am all in favor things
that improve that situation. At the same time just moving code from C
to assembly doesn't make it more robust, especially when the comments
explaining what the code does don't come along.

>> The big problem was you did several things with a single patch,
>> and that made the review much more difficult than it had to be.
>>
>> Having to check if you correctly modified the page tables, while also
>> having to check for segmentation, and the interrupt descriptor
>> transformations was distracting.
>
> Let me know which parts you think are good and we will implement and
> review them bit by bit instead then.

Skip the infrastructure changes, and the rest looks like real
possibilities.

Eric

2006-05-26 01:57:16

by Magnus Damm

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH 00/03] kexec: Avoid overwriting the current pgd (V2)

On 5/26/06, Eric W. Biederman <[email protected]> wrote:
> "Magnus Damm" <[email protected]> writes:
> >
> > Also, I feel that my approach with a valid idt and gdt is more robust.
>
> One of my biggest concerns with the current code is that it is not
> sufficiently robust, in the kdump case. So I am all in favor things
> that improve that situation. At the same time just moving code from C
> to assembly doesn't make it more robust, especially when the comments
> explaining what the code does don't come along.

I agree that just moving the code does not help. But my code actually
loads a new set of gdts and idts and I'm hoping that it will improve
the robustness.

Regarding more comments I totally agree with you.

> >> The big problem was you did several things with a single patch,
> >> and that made the review much more difficult than it had to be.
> >>
> >> Having to check if you correctly modified the page tables, while also
> >> having to check for segmentation, and the interrupt descriptor
> >> transformations was distracting.
> >
> > Let me know which parts you think are good and we will implement and
> > review them bit by bit instead then.
>
> Skip the infrastructure changes, and the rest looks like real
> possibilities.

But I need to store my page tables somewhere, and there is no good
place to store them now. With good reasoning I can be convinced that
storing things on the control page is a good thing, and I'd like to
agree on something, but without good reasoning I will continue to
think that the control page solution is overly complex.

Thanks,

/ magnus

2006-05-26 02:23:24

by Magnus Damm

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH 01/03] kexec: Avoid overwriting the current pgd (V2, stubs)

Hi again Eric,

Thank you for your comments so far.

On 5/26/06, Eric W. Biederman <[email protected]> wrote:
> "Magnus Damm" <[email protected]> writes:
>
> > On 5/25/06, Eric W. Biederman <[email protected]> wrote:
> >> Magnus Damm <[email protected]> writes:
> >> > On Wed, 2006-05-24 at 20:41 -0600, Eric W. Biederman wrote:
> >> >> Magnus Damm <[email protected]> writes:
> >>
> >> I believe I gave a complete explanation the first round.
> >>
> >> By having an extra extern variable you can export the offset of
> >> a variable on the control code page, or what you need to compute
> >> the offset.
> >
> > You explained some things last round, but there were still some
> > questions left open. The main question was regarding "additional
> > protection".
>
> Memory that the kernel never sets up for DMA ever.

I had hoped that I would get a more detailed answer. I asked you the
following follow-up questions during the first round:

---

I suppose you mean that control pages have additional protection that
struct kimage does _not_ have. Protection provided by
kimage_alloc_control_pages(), right?

I agree with you that this protection is good. But I do not see how
that applies to my patch, because the page_table_a[] pages pointed out
by struct kimage are read out by machine_kexec() and passed as
arguments to the assembly code. So the assembly code itself never
tries to access struct kimage. All data accessed by the assembly code
is allocated with kimage_alloc_control_pages(), isn't that good
enough? Or maybe I'm misunderstanding?

---

I'm trying to understand your argument, but it does not make sense to me.

If you are worried that struct kimage will get overwritten by
background DMA, then why the are you storing the pointer to the
control page there? That pointer is _very_ important...

And if your DMA issue it is that important, why do you not allocate
struct kimage from kimage_alloc_control_pages()?

> > I'd be happy to change to code into something that you would feel
> > comfortable with, I just like to understand why. Having a
> > per-architecture data area in struct kimage is by far the simplest way
> > IMO.
>
> But the problem is fundamentally hard. I do not want to encourage
> people to make changes without thinking through all of the consequences.

That makes sense.

> So far my impression is that an additional per-architecture data area
> is struct kimage encourages people to be sloppy, and it moves key structures
> farther from where they are used. I am coming to suspect it is as bad
> as ioctl.

I see it from the other side. By _not_ having per-architecture data
the current kexec code for x86_64 unnecessarily allocates 2 physically
contiguous pages (1-order allocation).

> I could probably be convinced with a good use of a per-architecture
> area and a well reasoned argument. But at the moment changing the
> infrastructure is unnecessary noise, so please drop that for now.

But then I would have to make the code overly complex and duplicate
code that overloads the control page with a structure - both for i386
and x86_64. By adding a per-architecture data structure the amount of
code duplication is reduced IMO.

> >> Part of the reason is you do more than one thing at a time, which makes
> >> review much more difficult than it needs to be.
> >
> > Yeah, I know. I'm sorry about that. I took some time to split the
> > patches in the most logical way I could think of. The reason behind
> > not separating the segment code and the page_table_a code was that
> > they both touched more or less the same lines.
>
> Dependent patches are not a problem.

Good. I'm sorry that the patches are unreadable, they tend to get that
when one is modifying major parts of a file. I'd recommend that
instead of reading the patch directly it may be better to apply it and
read the resulting code.

> > And by global structure you mean the dynamically allocated struct
> > kimage? If you find that abusive then I think that _not_ using an
> > already existing structure is abusive. =)
> >
> > I just want to keep things as simple as possible.
>
> Simplicity is good.
>
> Doing unnecessary things is confusing and the issue is not good,
> and at least until the Xen support is merged you were doing
> unnecessary things.

Yes and no. I agree that modifying kernel code to fit code outside of
the tree is a bad thing, but I think my modification alone has value
for the kernel.

I think we both can agree on that overwriting the current page tables
should be avoided regardless of Xen. It's just a matter of where we
store data and if we are using one or two page tables.

> Please just take it carefully. This is possibly the hardest
> to debug code path in the kernel and currently it works. I
> don't want to break that.

I don't want to break it either, so careful is good. I think kexec
works pretty well, but I would not say it's working correctly in all
cases.

One issue on i386 with the current page table overwriting code is that
it may break with different kernel/userspace splits. I think my
"page_table_a" approach should work for all cases.

Also, I cannot get one x86_64 box here to work with kexec. It does not
work with the vanilla kexec, and not with my page_table_a patch. The
troublesome machine has a cpu reported as "AMD Athlon(tm) 64 Processor
3200+". The same code does however run on hardware with dual "AMD
Opteron(tm) Processor 244".

I'd like to work on solving these issues too, but I'd like to sort out
merging my current patches before starting on something new.

Thanks,

/ magnus

2006-05-26 03:17:16

by Magnus Damm

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH 03/03] kexec: Avoid overwriting the current pgd (V2, x86_64)

On 5/26/06, Eric W. Biederman <[email protected]> wrote:
> Magnus Damm <[email protected]> writes:
>
> > Let me explain the theory behind my kexec patch:
>
> Please the next time you submit a please try to keep what
> you are changing sufficiently clear that a followup email
> is not needed to explain the code.

Yes, I will try to do that. I guess what I'm doing requires a great
amount of explaining so let's do it bit by bit.

> > During prepare, page_table_a is initialized. For x86_64 page_table_b is
> > also initialized.
> >
> > When it is time for machine_kexec() the following sequence takes place:
> >
> > 1. The C-code in machine_kexec() jumps to the assembly routine.
> > To avoid overwriting the page table and still work with NX bits we jump
> > to the original location of relocate_new_kernel. This is different from
> > the unpatched code which jumps to the physical address (ie the identity
> > mapped location).
> >
> > 2. The assembly code switches to page_table_a.
> > page_table_a has mapped the control page at two virtual addresses:
> > - The same virtual address as relocate_new_kernel is located at.
> > (This explains the extra aligment in the assembly file)
> > - An identity mapping, ie virtual address == physical address.
> > After the switch the code runs at the same virtual address, but the
> > physical page is now the control page.
>
> Sorry this is broken. The code location you are running from when you
> perform the switch from one page table to another must be mapped at
> the same location in both page tables. Otherwise it is undefined what
> the processor will do.

The code is not broken. The code does exactly what you are talking
about. Maybe I was a bit unclear.

"page_table_a" contains two mappings. One which is the same as where
the page is mapped in the kernel (the virtual address where
relocate_new_kernel is located), and one identity mapped. So when the
switch occurs the cpu will continue to run on the same virtual
address.

> So you need at least one additional entry in your page table.
>
> The fact that this bug did not jump out is a clear sign you were
> changing too many things at once, and did not have an adequate
> explanation in your change log.

I will ignore that last part for now. I've tested the code on i386,
i386/pae, x86_64 (both opteron and athlon64) and all the combinations
with and without xen, and all configurations except x86_64 with both
kexec and kdump.

> > 3. Setup idt, gdt, segment registers and stack pointer.
> > The stack pointer should point to the identity mapped page.
> >
> > 4. Jump to the identity mapped address.
> > When the jump is performed we will be running at a virtual address which
> > is the same as the physical address.
>
> After the jump is performed?

Sorry for being unclear. Read it like this instead:

Before jumping we will be running from the virtual address provided by
the non-identity mapped page in page_table_a. This is the same page as
relocate_new_kernel.

After the jump we will be running from the identity mapped address
provided by page_table_a. This is the same address as the physical
address of the control page.

> > 5. Turn off MMU (i386) / switch to page_table_b (x86_64).
> > We are able to turn off the MMU or switch to page_table_b because we are
> > already running at the physical address.
> >
> > 6. Proceed with the page copying business as usual...
> >
> > Ok, so far so good.
> >
> > The fun part begins when we throw in Xen into the mix. Linux under xen
> > runs with pseudophysical addresses, ie what Linux thinks are physical
> > addresses are not physical. Xen use the term machine addresses for
> > addresses that are called physical address under "regular" Linux. On top
> > of that is Xen using a different memory map than Linux.
> >
> > After prepare, all pages in page_table_a are passed to the hypervisor
> > that overwrites the contents filled in by machine_prepare().
> > (this explains the "ridiculous" array of struct page *)
>
> Actually it doesn't really, because I don't have enough information
> to infer which Xen call you are using or why it is sane. Certainly
> Xen does not need struct page * pointers or Xen is too tightly coupled
> with linux to be sane.

The interface between dom0 and the hypervisor passes machine
addresses. So no struct pages are passed there. I'm not insane.

I actually went with struct page * over unsigned long because that's
the way the current struct kimage refers to the control page. On top
of that I must say that I prefer types over unsigned longs because I
think it leads to better code. But that's another story.

> > A similar two page mapping is used for here too, but in the xen case we
> > use a different virtual address (the non-identity mapped address)
> > compared to "regular" Linux. All to fit the address space used by xen.
> >
> > The xen port which is based on my patches is using a sequence similar to
> > "regular" Linux:
> >
> > 1a. The C-code in xen_machine_kexec() performs a hypercall.
> >
> > 1b. The hypervisor jumps to the assembly code.
> > After prepare we've created a NX-safe mapping for the control page. We
> > jump to that NX-safe address to transfer control to the assembly
> > code.
>
> I assume this is a Xen call with the semantics:
> switch page tables and jump to location X in the new page tables.

We are not switching any page tables more than whatever Xen likes to
do during a hypercall. We simply jump to a NX-safe mapped location
(fixmap) of the already loaded control page. Then the assembly code in
the control page does the same magic as in the Linux case.

> I assume Xen is still running at this point?

If the hypervisor is crashing the loaded kdump kernel will be started.
I guess Xen must be running for us to be able to go from dom0 to the
hypervisor, but it that's not the case there's not much we can do. The
dom0 kernel is not privileged to switch page tables by itself so we
must do it from within the hypervisor.

If you are referring to if kexec works under Xen, then yes:

http://lists.xensource.com/archives/html/xen-devel/2006-05/msg01272.html

> > Goto 2 above.
> >
> > So, to answer your question regarding two page table copies. You may be
> > right that it can be made work with just one page table, but I feel my
> > two table approach is nice because it will always work - regardless of
> > the memory map used.
>
> Except that the memory map in linux is fixed. The x86_64 kernel will
> run with negative addresses and physical addresses will remain
> positive until a decade or two from now when we get 64bit physical
> addresses.
>
> Unless linux runs with a different memory map when running under Xen.

Xen can be configured in many ways. Different page table modes.

And to be frank, I'm not exactly sure how the memory map differs on
all architectures when various configuration options are varied. And
I'm not that interested in it either because it feels like very
volatile knowledge.

I do however believe that the "page_table_a" approach works regardless
of memory map. Both with and without xen. Regardless of memory split.

Thanks,

/ magnus

2006-05-26 07:40:53

by Gerd Hoffmann

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH 03/03] kexec: Avoid overwriting the current pgd (V2, x86_64)

> 1a. The C-code in xen_machine_kexec() performs a hypercall.
>
> 1b. The hypervisor jumps to the assembly code.
> After prepare we've created a NX-safe mapping for the control page. We
> jump to that NX-safe address to transfer control to the assembly code.

This is about kexec'ing the physical machine, not the virtual machine,
right?

cheers,

Gerd

--
Gerd Hoffmann <[email protected]>
http://www.suse.de/~kraxel/julika-dora.jpeg

2006-05-26 09:02:20

by Magnus Damm

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH 03/03] kexec: Avoid overwriting the current pgd (V2, x86_64)

On 5/26/06, Gerd Hoffmann <[email protected]> wrote:
> > 1a. The C-code in xen_machine_kexec() performs a hypercall.
> >
> > 1b. The hypervisor jumps to the assembly code.
> > After prepare we've created a NX-safe mapping for the control page. We
> > jump to that NX-safe address to transfer control to the assembly code.
>
> This is about kexec'ing the physical machine, not the virtual machine,
> right?

Correct, kexec:ing from dom0.

/ magnus

2006-05-26 09:19:23

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH 03/03] kexec: Avoid overwriting the current pgd (V2, x86_64)

"Magnus Damm" <[email protected]> writes:

> On 5/26/06, Gerd Hoffmann <[email protected]> wrote:
>> > 1a. The C-code in xen_machine_kexec() performs a hypercall.
>> >
>> > 1b. The hypervisor jumps to the assembly code.
>> > After prepare we've created a NX-safe mapping for the control page. We
>> > jump to that NX-safe address to transfer control to the assembly code.
>>
>> This is about kexec'ing the physical machine, not the virtual machine,
>> right?
>
> Correct, kexec:ing from dom0.

And staying in dom0? Or does Xen go away?

Eric

2006-05-26 09:29:17

by Magnus Damm

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH 03/03] kexec: Avoid overwriting the current pgd (V2, x86_64)

On 5/26/06, Eric W. Biederman <[email protected]> wrote:
> "Magnus Damm" <[email protected]> writes:
>
> > On 5/26/06, Gerd Hoffmann <[email protected]> wrote:
> >> > 1a. The C-code in xen_machine_kexec() performs a hypercall.
> >> >
> >> > 1b. The hypervisor jumps to the assembly code.
> >> > After prepare we've created a NX-safe mapping for the control page. We
> >> > jump to that NX-safe address to transfer control to the assembly code.
> >>
> >> This is about kexec'ing the physical machine, not the virtual machine,
> >> right?
> >
> > Correct, kexec:ing from dom0.
>
> And staying in dom0? Or does Xen go away?

You replace what's running on the physical machine.

You can chose to kexec into a new "regular" Linux kernel (Xen goes
away), or kexec into a new Xen hypervisor with a new dom0 kernel (Xen
is replaced). Kexec behaves exactly the same as Linux today - no
patches are needed to kexec-tools.

Kdump is a little different though, we reserve the physical ram range
in the hypervisor and our hypervisor code is currently using a
differen format for the command line options. With Kdump under Xen it
is possible to take a memory snapshot of the entire machine - both the
hypervisor and dom0.

/ magnus

2006-05-26 10:42:55

by Magnus Damm

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH 03/03] kexec: Avoid overwriting the current pgd (V2, x86_64)

On 5/26/06, Vivek Goyal <[email protected]> wrote:
> On Thu, May 25, 2006 at 05:26:56PM +0900, Magnus Damm wrote:
> > So, to answer your question regarding two page table copies. You may be
> > right that it can be made work with just one page table, but I feel my
> > two table approach is nice because it will always work - regardless of
> > the memory map used.
>
> So you seem to be suggesting that code can be made to work (even with Xen)
> with single identity mapped page table as used currently but it would fail
> in certain circumstances (different memory map used). Can you please explain
> a little more how?

Sorry about the delay, your question needed some thinking.

I do not think that vanilla kexec x86_64 has any memory map related
problems, apart for the page table overwriting. And the page table
overwriting is not a bug that will make kdump fail, it just makes the
memory image less accurate. I do however think the accuracy is quite
important given that kdump mainly is used for memory image collection.

The main reason for using two sets of page tables is simplicity. The
page_table_a code for allocation and page table setup is more or less
identical for i386 and x86_64. The code was written to be generic, but
during the testing I realized that the architecture-specific header
files defined things differently so I needed to add some
architecture-specific macros as workarounds. It should be possible to
share the code in a common file.

Simon and I have tried to make the Xen port of kexec as simple as
possible. One design decision that I know Eric dislikes is the array
of pages for page_table_a. The reason behind this array is simplicity,
especially for our Xen port.

Our Xen port makes the dom0 kernel responsible for allocating pages.
These pages are then used by the hypervisor. Some of these pages are
page_table_a pages, and these pages are overwritten by the hypervisor
with mappings that fit the memory map used by Xen. If we would pass
the root pointer instead of an array of pages then the hypervisor
would have to be extended to include code to parse page tables,
extract pages and then fill in the new page table. That would be all
but simple.

Using a single page table with Xen would probably result in a more
complex solution - the hypervisor code must be able to parse both page
tables with both huge and normal pages. I feel that such a solution is
error prone and overly complex. Especially since we already have
something that works.

I must admit that I got a bit scared of all the different page table
modes that Xen can run in. If and how they can affect the memory map
is beyond me. So instead of thinking of _how_ the memory maps are
arranged under x86_64, i386, i386/pae using all config options I
thought it was better to write something generic.
That's how the page_table_a strategy came up. It came up as a way of
jumping to a physical address from any virtual address, regardless of
memory map.

> This might be a very stupid question given the fact I am blissfully unaware
> of all the details of Xen.

No worries. "blissfully unaware of all the details of xen"... you
lucky bastard! =)

Thank you for your comments!

/ magnus

2006-05-26 15:08:38

by Vivek Goyal

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH 03/03] kexec: Avoid overwriting the current pgd (V2, x86_64)

On Fri, May 26, 2006 at 07:42:53PM +0900, Magnus Damm wrote:
> On 5/26/06, Vivek Goyal <[email protected]> wrote:
> >On Thu, May 25, 2006 at 05:26:56PM +0900, Magnus Damm wrote:
> >> So, to answer your question regarding two page table copies. You may be
> >> right that it can be made work with just one page table, but I feel my
> >> two table approach is nice because it will always work - regardless of
> >> the memory map used.
> >
> >So you seem to be suggesting that code can be made to work (even with Xen)
> >with single identity mapped page table as used currently but it would fail
> >in certain circumstances (different memory map used). Can you please
> >explain
> >a little more how?
>
> Sorry about the delay, your question needed some thinking.
>
> I do not think that vanilla kexec x86_64 has any memory map related
> problems, apart for the page table overwriting.

So we are trying to solve two problems here.

- Page table overwriting during kexec.
- Creating framework in advance to ensure kexec compatibility with Xen.

Definitely first one need to be solved now. But given the fact Xen code
is not mainline yet, I feel that second one should be solved once
Xen code is mainline.

> And the page table
> overwriting is not a bug that will make kdump fail, it just makes the
> memory image less accurate. I do however think the accuracy is quite
> important given that kdump mainly is used for memory image collection.
>
> The main reason for using two sets of page tables is simplicity. The
> page_table_a code for allocation and page table setup is more or less
> identical for i386 and x86_64. The code was written to be generic, but
> during the testing I realized that the architecture-specific header
> files defined things differently so I needed to add some
> architecture-specific macros as workarounds. It should be possible to
> share the code in a common file.
>
> Simon and I have tried to make the Xen port of kexec as simple as
> possible. One design decision that I know Eric dislikes is the array
> of pages for page_table_a. The reason behind this array is simplicity,
> especially for our Xen port.
>
> Our Xen port makes the dom0 kernel responsible for allocating pages.
> These pages are then used by the hypervisor. Some of these pages are
> page_table_a pages, and these pages are overwritten by the hypervisor
> with mappings that fit the memory map used by Xen. If we would pass
> the root pointer instead of an array of pages then the hypervisor
> would have to be extended to include code to parse page tables,
> extract pages and then fill in the new page table. That would be all
> but simple.
>

Again I am trying to understand the need for two page tables. So your
concern is that if current user space/kernel space memory split changes
in x86_64, it will break down the kexec and that's why the need of a
page_table_a which will help jump to identity mapped address in control
page and then you will switch to page_table_b which will help copying the
pages to destination.

Can't we create an additional entry in identity mapped page table
(page_table_b) to map the control page at the same virtual address
as current page table and then jump to control page (using virtual address)
and then swith to identity mapped page table (page_table_b).

This would make sure that existing page tables are not overwritten as
well there is no dependency on user space/kernel space memory split being
used by the current page tables.

I hope I understood the problem right. :-)

Thanks
Vivek

2006-05-26 15:52:26

by Magnus Damm

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH 03/03] kexec: Avoid overwriting the current pgd (V2, x86_64)

On 5/27/06, Vivek Goyal <[email protected]> wrote:
> On Fri, May 26, 2006 at 07:42:53PM +0900, Magnus Damm wrote:
> > On 5/26/06, Vivek Goyal <[email protected]> wrote:
> > >On Thu, May 25, 2006 at 05:26:56PM +0900, Magnus Damm wrote:
> > >> So, to answer your question regarding two page table copies. You may be
> > >> right that it can be made work with just one page table, but I feel my
> > >> two table approach is nice because it will always work - regardless of
> > >> the memory map used.
> > >
> > >So you seem to be suggesting that code can be made to work (even with Xen)
> > >with single identity mapped page table as used currently but it would fail
> > >in certain circumstances (different memory map used). Can you please
> > >explain
> > >a little more how?
> >
> > Sorry about the delay, your question needed some thinking.
> >
> > I do not think that vanilla kexec x86_64 has any memory map related
> > problems, apart for the page table overwriting.
>
> So we are trying to solve two problems here.
>
> - Page table overwriting during kexec.
> - Creating framework in advance to ensure kexec compatibility with Xen.

Correct. I'm actually supposed to only work on porting kexec to Linux
under Xen, but the page table overwriting issue annoyed me so I
thought that I would try to solve both problems at once.

> Definitely first one need to be solved now. But given the fact Xen code
> is not mainline yet, I feel that second one should be solved once
> Xen code is mainline.

That makes sense. My goal here is trying to write Xen-friendly code
which does not give too much penalty when used without Xen and will
not generate too much many conflicts when merging with Xen.

> > And the page table
> > overwriting is not a bug that will make kdump fail, it just makes the
> > memory image less accurate. I do however think the accuracy is quite
> > important given that kdump mainly is used for memory image collection.
> >
> > The main reason for using two sets of page tables is simplicity. The
> > page_table_a code for allocation and page table setup is more or less
> > identical for i386 and x86_64. The code was written to be generic, but
> > during the testing I realized that the architecture-specific header
> > files defined things differently so I needed to add some
> > architecture-specific macros as workarounds. It should be possible to
> > share the code in a common file.
> >
> > Simon and I have tried to make the Xen port of kexec as simple as
> > possible. One design decision that I know Eric dislikes is the array
> > of pages for page_table_a. The reason behind this array is simplicity,
> > especially for our Xen port.
> >
> > Our Xen port makes the dom0 kernel responsible for allocating pages.
> > These pages are then used by the hypervisor. Some of these pages are
> > page_table_a pages, and these pages are overwritten by the hypervisor
> > with mappings that fit the memory map used by Xen. If we would pass
> > the root pointer instead of an array of pages then the hypervisor
> > would have to be extended to include code to parse page tables,
> > extract pages and then fill in the new page table. That would be all
> > but simple.
> >
>
> Again I am trying to understand the need for two page tables. So your
> concern is that if current user space/kernel space memory split changes
> in x86_64, it will break down the kexec and that's why the need of a
> page_table_a which will help jump to identity mapped address in control
> page and then you will switch to page_table_b which will help copying the
> pages to destination.

If we omit the Xen case I'm pretty sure it can be done with just one
page table. No concerns about memory split on x86_64.

I don't see the problem with two page tables though. The two page
table approach makes the x86_64 port very similar to i386 and it makes
it possible to share code. And on top of that it will make the Xen
porting work much much simpler.

> Can't we create an additional entry in identity mapped page table
> (page_table_b) to map the control page at the same virtual address
> as current page table and then jump to control page (using virtual address)
> and then swith to identity mapped page table (page_table_b).
>
> This would make sure that existing page tables are not overwritten as
> well there is no dependency on user space/kernel space memory split being
> used by the current page tables.
>
> I hope I understood the problem right. :-)

Your reasoning seems right to me. =)
An additional entry in page_table_b will most likely do.

But, I think that the mix of large size and 4k pages (page_table_b is
mapped with large pages, and page_table_a with 4k) might introduce
some corner cases. I'm thinking about the case when the virtual
address for the control page (the only 4k mapping) happens to be in
the identity mapping range. In that case you will have to break up the
large page mapping and add several 4k mappings. Breaking up the large
page will require more pages to be allocated for the page table.

And if we build the Xen port on top of that (without an array of
pages) we will have to parse that mess in the hypervisor. And Xen
might have a different memory map which leads to a page table that has
a different virtual address for the 4k mapping. And this might
introduce a need to allocate extra pages from within the hypervisor,
which is something I definitely want to avoid.

To summarize, the single page table solution will work for most cases,
but it might be complicated if we happen to have the control page
mapped at a low address in the kernel/hypervisor. Mapping the page at
a low address forces us to allocate more pages for the page table. In
other words, the number of pages required to build the page table
varies depending on the virtual address where the control page is
mapped in the kernel/hypervisor.

The problematic case above might be very unlikely to happen with the
current x86_64 memory map, but I still think that the two page table
solution is a small price to pay for something that is guaranteed to
work in all cases.

Thank you for your comments!

/ magnus

2006-05-26 16:33:54

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH 03/03] kexec: Avoid overwriting the current pgd (V2, x86_64)

"Magnus Damm" <[email protected]> writes:

> The code is not broken. The code does exactly what you are talking
> about. Maybe I was a bit unclear.

Ack. I got a little confused. I was still thinking of what
the identity mapping was trying to achieve, and you don't achieve
the same thing.

> "page_table_a" contains two mappings. One which is the same as where
> the page is mapped in the kernel (the virtual address where
> relocate_new_kernel is located), and one identity mapped. So when the
> switch occurs the cpu will continue to run on the same virtual
> address.

In the general case which you want to cover page_table_a is still
slightly wrong. In particular you can theoretically get a conflict
between the virtual address of your page table switching function and
the physical address of the control code page.

However while I agree that you need to do this in assembly for
control I disagree that this code should be part of the
relocate_new_kernel function.

Please move the code that uses page_table_a to a separate function,
that when it is done jumps to the control_code page. Then you can
map this page both virtually and physically with a statically
allocated page table built a compile time.

This is a little simpler as you don't need to build this first
page table dynamically and a little clearer as you aren't trying to
get the control code page to serve two different functions.

If this function was more than 3 lines of assembly it could even
be written in C for clarity with a special section so that
the linker would not map it twice. Of course that would still need
to address the stack usage problem.


>> So you need at least one additional entry in your page table.
>>
>> The fact that this bug did not jump out is a clear sign you were
>> changing too many things at once, and did not have an adequate
>> explanation in your change log.
>
> I will ignore that last part for now. I've tested the code on i386,
> i386/pae, x86_64 (both opteron and athlon64) and all the combinations
> with and without xen, and all configurations except x86_64 with both
> kexec and kdump.

Testing is important here. But given that 99% of the bug hunting must
be done via code review and thinking about the problem testing is not
sufficient.

Eric

2006-05-29 08:40:34

by Magnus Damm

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH 03/03] kexec: Avoid overwriting the current pgd (V2, x86_64)

On 5/27/06, Eric W. Biederman <[email protected]> wrote:
> "Magnus Damm" <[email protected]> writes:
>
> > The code is not broken. The code does exactly what you are talking
> > about. Maybe I was a bit unclear.
>
> Ack. I got a little confused. I was still thinking of what
> the identity mapping was trying to achieve, and you don't achieve
> the same thing.

No problem. The remapping code is all but simple.

> > "page_table_a" contains two mappings. One which is the same as where
> > the page is mapped in the kernel (the virtual address where
> > relocate_new_kernel is located), and one identity mapped. So when the
> > switch occurs the cpu will continue to run on the same virtual
> > address.
>
> In the general case which you want to cover page_table_a is still
> slightly wrong. In particular you can theoretically get a conflict
> between the virtual address of your page table switching function and
> the physical address of the control code page.

Are you sure?

The mappings in page_table_a are created by C code in the
kernel/hypervisor, right? One of the mappings is an identity mapping,
and the other one is mapping the control page at the same address as a
"NX-safe" page in the kernel/hypervisor.

When we are about to kexec into the new kernel, the kernel/hypervisor
is accessing the regular page tables as usual. machine_kexec() then
jumps to the assembly code where we disable interrupts and switch page
tables. Could you please clarify what you mean by "page table
switching function"?

The only conflict that I see is when the two mappings in page_table_a
happen to be on the same address. But that is not a problem,
everything should work as expected in that case too. The page_table_a
code should in that case just use one mapping instead of two.

I could of course be wrong and I'd really like to hear all doubts
people have regarding my "page_table_a" solution. So if anyonel thinks
that there's a problem somewhere please give me a detailed
description.

> However while I agree that you need to do this in assembly for
> control I disagree that this code should be part of the
> relocate_new_kernel function.
>
> Please move the code that uses page_table_a to a separate function,
> that when it is done jumps to the control_code page. Then you can
> map this page both virtually and physically with a statically
> allocated page table built a compile time.

This function, you write "uses page_table_a". Do you mean that the
function allocates it? Or fills it in? Or maybe switches to it? Please
clarify!

> This is a little simpler as you don't need to build this first
> page table dynamically and a little clearer as you aren't trying to
> get the control code page to serve two different functions.

But doesn't a static set of pages used for page_table_a just mean that
you are wasting valuable unswappable kernel memory? Also, how can you
be sure that the static pages are in a DMA-safe address range?

> If this function was more than 3 lines of assembly it could even
> be written in C for clarity with a special section so that
> the linker would not map it twice. Of course that would still need
> to address the stack usage problem.

I do not really see the benefit with this static solution actually.
The special section also feels a bit complicated. Maybe it fits your
big picture better and that is of course good.

It does however make the Xen porting more difficult. =(

> >> So you need at least one additional entry in your page table.
> >>
> >> The fact that this bug did not jump out is a clear sign you were
> >> changing too many things at once, and did not have an adequate
> >> explanation in your change log.
> >
> > I will ignore that last part for now. I've tested the code on i386,
> > i386/pae, x86_64 (both opteron and athlon64) and all the combinations
> > with and without xen, and all configurations except x86_64 with both
> > kexec and kdump.
>
> Testing is important here. But given that 99% of the bug hunting must
> be done via code review and thinking about the problem testing is not
> sufficient.

Agreed.

Thanks!

/ magnus

2006-05-31 17:20:08

by Vivek Goyal

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH 03/03] kexec: Avoid overwriting the current pgd (V2, x86_64)

On Mon, May 29, 2006 at 05:40:32PM +0900, Magnus Damm wrote:
>
> >However while I agree that you need to do this in assembly for
> >control I disagree that this code should be part of the
> >relocate_new_kernel function.
> >
> >Please move the code that uses page_table_a to a separate function,
> >that when it is done jumps to the control_code page. Then you can
> >map this page both virtually and physically with a statically
> >allocated page table built a compile time.
>
> This function, you write "uses page_table_a". Do you mean that the
> function allocates it? Or fills it in? Or maybe switches to it? Please
> clarify!
>
> >This is a little simpler as you don't need to build this first
> >page table dynamically and a little clearer as you aren't trying to
> >get the control code page to serve two different functions.
>
> But doesn't a static set of pages used for page_table_a just mean that
> you are wasting valuable unswappable kernel memory?

In your implementation, is control page swappable?

>Also, how can you
> be sure that the static pages are in a DMA-safe address range?
>

Why would kernel setup DMA on statically allocated pages?

Thanks
Vivek