2002-12-22 11:00:26

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH][CFT] kexec (rewrite) for 2.5.52


I have recently taken the time to dig through the internals of
kexec to see if I could make my code any simpler and have managed
to trim off about 100 lines, and have made the code much more
obviously correct.

The key realization was that I had a much simpler test to detect
if a page was a destination page. Allowing me to remove the second
pass, and put all of the logic to avoid stepping on my own
toes in the page allocator.

I have also made the small changes necessary to allow using high
memory pages. Since I cannot push the request for memory below 4GB
into alloc_pages, I simply keep push the unusable pages on a list, and
keep requesting memory. This should be o.k. on a large memory machine
but I can also see it as being pathological. The advantage to using
high memory is that I should be able to use most of the memory below
4GB for a kernel image, and depending on how the zones are setup this
keeps me from artificially limiting myself. I get the feeling what I
really want is my own special zone, maybe later...

With all of the strange logic in the kimage_alloc_page the code
is much more obviously correct, and in most cases should run in O(N)
time, though it can still get pathological and run in O(N^2).

I have also made allocation of the reboot code buffer a little less
clever, removing a very small pathological case that was previously
present.

Anyway, I would love to know in what entertaining ways this code blows
up, or if I get lucky and it doesn't. I probably will not reply back
in a timely manner as I am off to visit my parents, for Christmas and
New Years.

Eric

MAINTAINERS | 8
arch/i386/Kconfig | 17 +
arch/i386/kernel/Makefile | 1
arch/i386/kernel/entry.S | 1
arch/i386/kernel/machine_kexec.c | 140 +++++++++
arch/i386/kernel/relocate_kernel.S | 107 +++++++
include/asm-i386/kexec.h | 23 +
include/asm-i386/unistd.h | 1
include/linux/kexec.h | 54 +++
include/linux/reboot.h | 2
kernel/Makefile | 1
kernel/kexec.c | 547 +++++++++++++++++++++++++++++++++++++
kernel/sys.c | 23 +
13 files changed, 925 insertions

diff -uNr linux-2.5.52/MAINTAINERS linux-2.5.52.x86kexec-2/MAINTAINERS
--- linux-2.5.52/MAINTAINERS Thu Dec 12 07:41:16 2002
+++ linux-2.5.52.x86kexec-2/MAINTAINERS Mon Dec 16 02:24:32 2002
@@ -997,6 +997,14 @@
W: http://www.cse.unsw.edu.au/~neilb/patches/linux-devel/
S: Maintained

+KEXEC
+P: Eric Biederman
+M: [email protected]
+M: [email protected]
+W: http://www.xmission.com/~ebiederm/files/kexec/
+L: [email protected]
+S: Maintained
+
LANMEDIA WAN CARD DRIVER
P: Andrew Stanley-Jones
M: [email protected]
diff -uNr linux-2.5.52/arch/i386/Kconfig linux-2.5.52.x86kexec-2/arch/i386/Kconfig
--- linux-2.5.52/arch/i386/Kconfig Mon Dec 16 02:18:32 2002
+++ linux-2.5.52.x86kexec-2/arch/i386/Kconfig Mon Dec 16 02:23:00 2002
@@ -686,6 +686,23 @@
depends on (SMP || PREEMPT) && X86_CMPXCHG
default y

+config KEXEC
+ bool "kexec system call (EXPERIMENTAL)"
+ depends on EXPERIMENTAL
+ help
+ kexec is a system call that implements the ability to shutdown your
+ current kernel, and to start another kernel. It is like a reboot
+ but it is indepedent of the system firmware. And like a reboot
+ you can start any kernel with it not just Linux.
+
+ The name comes from the similiarity to the exec system call.
+
+ It is on an going process to be certain the hardware in a machine
+ is properly shutdown, so do not be surprised if this code does not
+ initially work for you. It may help to enable device hotplugging
+ support. As of this writing the exact hardware interface is
+ strongly in flux, so no good recommendation can be made.
+
endmenu


diff -uNr linux-2.5.52/arch/i386/kernel/Makefile linux-2.5.52.x86kexec-2/arch/i386/kernel/Makefile
--- linux-2.5.52/arch/i386/kernel/Makefile Mon Dec 16 02:18:32 2002
+++ linux-2.5.52.x86kexec-2/arch/i386/kernel/Makefile Mon Dec 16 02:23:00 2002
@@ -24,6 +24,7 @@
obj-$(CONFIG_X86_MPPARSE) += mpparse.o
obj-$(CONFIG_X86_LOCAL_APIC) += apic.o nmi.o
obj-$(CONFIG_X86_IO_APIC) += io_apic.o
+obj-$(CONFIG_KEXEC) += machine_kexec.o relocate_kernel.o
obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend.o suspend_asm.o
obj-$(CONFIG_X86_NUMAQ) += numaq.o
obj-$(CONFIG_PROFILING) += profile.o
diff -uNr linux-2.5.52/arch/i386/kernel/entry.S linux-2.5.52.x86kexec-2/arch/i386/kernel/entry.S
--- linux-2.5.52/arch/i386/kernel/entry.S Thu Dec 12 07:41:17 2002
+++ linux-2.5.52.x86kexec-2/arch/i386/kernel/entry.S Sat Dec 21 23:36:10 2002
@@ -743,6 +743,7 @@
.long sys_epoll_wait
.long sys_remap_file_pages
.long sys_set_tid_address
+ .long sys_kexec_load


.rept NR_syscalls-(.-sys_call_table)/4
diff -uNr linux-2.5.52/arch/i386/kernel/machine_kexec.c linux-2.5.52.x86kexec-2/arch/i386/kernel/machine_kexec.c
--- linux-2.5.52/arch/i386/kernel/machine_kexec.c Wed Dec 31 17:00:00 1969
+++ linux-2.5.52.x86kexec-2/arch/i386/kernel/machine_kexec.c Sat Dec 21 16:07:05 2002
@@ -0,0 +1,140 @@
+#include <linux/config.h>
+#include <linux/mm.h>
+#include <linux/kexec.h>
+#include <linux/delay.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
+#include <asm/io.h>
+#include <asm/apic.h>
+
+
+/*
+ * machine_kexec
+ * =======================
+ */
+
+
+static void set_idt(void *newidt, __u16 limit)
+{
+ unsigned char curidt[6];
+
+ /* ia32 supports unaliged loads & stores */
+ (*(__u16 *)(curidt)) = limit;
+ (*(__u32 *)(curidt +2)) = (unsigned long)(newidt);
+
+ __asm__ __volatile__ (
+ "lidt %0\n"
+ : "=m" (curidt)
+ );
+};
+
+
+static void set_gdt(void *newgdt, __u16 limit)
+{
+ unsigned char curgdt[6];
+
+ /* ia32 supports unaliged loads & stores */
+ (*(__u16 *)(curgdt)) = limit;
+ (*(__u32 *)(curgdt +2)) = (unsigned long)(newgdt);
+
+ __asm__ __volatile__ (
+ "lgdt %0\n"
+ : "=m" (curgdt)
+ );
+};
+
+static void load_segments(void)
+{
+#define __STR(X) #X
+#define STR(X) __STR(X)
+
+ __asm__ __volatile__ (
+ "\tljmp $"STR(__KERNEL_CS)",$1f\n"
+ "\t1:\n"
+ "\tmovl $"STR(__KERNEL_DS)",%eax\n"
+ "\tmovl %eax,%ds\n"
+ "\tmovl %eax,%es\n"
+ "\tmovl %eax,%fs\n"
+ "\tmovl %eax,%gs\n"
+ "\tmovl %eax,%ss\n"
+ );
+#undef STR
+#undef __STR
+}
+
+static void identity_map_page(unsigned long address)
+{
+ /* This code is x86 specific...
+ * general purpose code must be more carful
+ * of caches and tlbs...
+ */
+ pgd_t *pgd;
+ pmd_t *pmd;
+ struct mm_struct *mm = current->mm;
+ spin_lock(&mm->page_table_lock);
+
+ pgd = pgd_offset(mm, address);
+ pmd = pmd_alloc(mm, pgd, address);
+
+ if (pmd) {
+ pte_t *pte = pte_alloc_map(mm, pmd, address);
+ if (pte) {
+ set_pte(pte,
+ mk_pte(pfn_to_page(address >> PAGE_SHIFT), PAGE_SHARED));
+ __flush_tlb_one(address);
+ }
+ }
+ spin_unlock(&mm->page_table_lock);
+}
+
+
+typedef void (*relocate_new_kernel_t)(
+ unsigned long indirection_page, unsigned long reboot_code_buffer,
+ unsigned long start_address);
+
+const extern unsigned char relocate_new_kernel[];
+extern void relocate_new_kernel_end(void);
+const extern unsigned int relocate_new_kernel_size;
+
+void machine_kexec(struct kimage *image)
+{
+ unsigned long indirection_page;
+ unsigned long reboot_code_buffer;
+ void *ptr;
+ relocate_new_kernel_t rnk;
+
+ /* Interrupts aren't acceptable while we reboot */
+ local_irq_disable();
+ reboot_code_buffer = page_to_pfn(image->reboot_code_pages) << PAGE_SHIFT;
+ indirection_page = image->head & PAGE_MASK;
+
+ identity_map_page(reboot_code_buffer);
+
+ /* copy it out */
+ memcpy((void *)reboot_code_buffer, relocate_new_kernel, relocate_new_kernel_size);
+
+ /* The segment registers are funny things, they are
+ * automatically loaded from a table, in memory wherever you
+ * set them to a specific selector, but this table is never
+ * accessed again you set the segment to a different selector.
+ *
+ * The more common model is are caches where the behide
+ * the scenes work is done, but is also dropped at arbitrary
+ * times.
+ *
+ * I take advantage of this here by force loading the
+ * segments, before I zap the gdt with an invalid value.
+ */
+ load_segments();
+ /* The gdt & idt are now invalid.
+ * If you want to load them you must set up your own idt & gdt.
+ */
+ set_gdt(phys_to_virt(0),0);
+ set_idt(phys_to_virt(0),0);
+
+ /* now call it */
+ rnk = (relocate_new_kernel_t) reboot_code_buffer;
+ (*rnk)(indirection_page, reboot_code_buffer, image->start);
+}
+
diff -uNr linux-2.5.52/arch/i386/kernel/relocate_kernel.S linux-2.5.52.x86kexec-2/arch/i386/kernel/relocate_kernel.S
--- linux-2.5.52/arch/i386/kernel/relocate_kernel.S Wed Dec 31 17:00:00 1969
+++ linux-2.5.52.x86kexec-2/arch/i386/kernel/relocate_kernel.S Mon Dec 16 02:23:00 2002
@@ -0,0 +1,107 @@
+#include <linux/config.h>
+#include <linux/linkage.h>
+
+ /* Must be relocatable PIC code callable as a C function, that once
+ * it starts can not use the previous processes stack.
+ *
+ */
+ .globl relocate_new_kernel
+relocate_new_kernel:
+ /* read the arguments and say goodbye to the stack */
+ movl 4(%esp), %ebx /* indirection_page */
+ movl 8(%esp), %ebp /* reboot_code_buffer */
+ movl 12(%esp), %edx /* start address */
+
+ /* zero out flags, and disable interrupts */
+ pushl $0
+ popfl
+
+ /* set a new stack at the bottom of our page... */
+ lea 4096(%ebp), %esp
+
+ /* store the parameters back on the stack */
+ pushl %edx /* store the start address */
+
+ /* Set cr0 to a known state:
+ * 31 0 == Paging disabled
+ * 18 0 == Alignment check disabled
+ * 16 0 == Write protect disabled
+ * 3 0 == No task switch
+ * 2 0 == Don't do FP software emulation.
+ * 0 1 == Proctected mode enabled
+ */
+ movl %cr0, %eax
+ andl $~((1<<31)|(1<<18)|(1<<16)|(1<<3)|(1<<2)), %eax
+ orl $(1<<0), %eax
+ movl %eax, %cr0
+
+ /* Set cr4 to a known state:
+ * Setting everything to zero seems safe.
+ */
+ movl %cr4, %eax
+ andl $0, %eax
+ movl %eax, %cr4
+
+ jmp 1f
+1:
+
+ /* Flush the TLB (needed?) */
+ xorl %eax, %eax
+ movl %eax, %cr3
+
+ /* Do the copies */
+ cld
+0: /* top, read another word for the indirection page */
+ movl %ebx, %ecx
+ movl (%ebx), %ecx
+ addl $4, %ebx
+ testl $0x1, %ecx /* is it a destination page */
+ jz 1f
+ movl %ecx, %edi
+ andl $0xfffff000, %edi
+ jmp 0b
+1:
+ testl $0x2, %ecx /* is it an indirection page */
+ jz 1f
+ movl %ecx, %ebx
+ andl $0xfffff000, %ebx
+ jmp 0b
+1:
+ testl $0x4, %ecx /* is it the done indicator */
+ jz 1f
+ jmp 2f
+1:
+ testl $0x8, %ecx /* is it the source indicator */
+ jz 0b /* Ignore it otherwise */
+ movl %ecx, %esi /* For every source page do a copy */
+ andl $0xfffff000, %esi
+
+ movl $1024, %ecx
+ rep ; movsl
+ jmp 0b
+
+2:
+
+ /* To be certain of avoiding problems with self modifying code
+ * I need to execute a serializing instruction here.
+ * So I flush the TLB, it's handy, and not processor dependent.
+ */
+ xorl %eax, %eax
+ movl %eax, %cr3
+
+ /* set all of the registers to known values */
+ /* leave %esp alone */
+
+ xorl %eax, %eax
+ xorl %ebx, %ebx
+ xorl %ecx, %ecx
+ xorl %edx, %edx
+ xorl %esi, %esi
+ xorl %edi, %edi
+ xorl %ebp, %ebp
+ ret
+relocate_new_kernel_end:
+
+ .globl relocate_new_kernel_size
+relocate_new_kernel_size:
+ .long relocate_new_kernel_end - relocate_new_kernel
diff -uNr linux-2.5.52/include/asm-i386/kexec.h linux-2.5.52.x86kexec-2/include/asm-i386/kexec.h
--- linux-2.5.52/include/asm-i386/kexec.h Wed Dec 31 17:00:00 1969
+++ linux-2.5.52.x86kexec-2/include/asm-i386/kexec.h Sat Dec 21 14:18:31 2002
@@ -0,0 +1,23 @@
+#ifndef _I386_KEXEC_H
+#define _I386_KEXEC_H
+
+#include <asm/fixmap.h>
+
+/*
+ * KEXEC_SOURCE_MEMORY_LIMIT maximum page get_free_page can return.
+ * I.e. Maximum page that is mapped directly into kernel memory,
+ * and kmap is not required.
+ *
+ * Someone correct me if FIXADDR_START - PAGEOFFSET is not the correct
+ * calculation for the amount of memory directly mappable into the
+ * kernel memory space.
+ */
+
+/* Maximum physical address we can use pages from */
+#define KEXEC_SOURCE_MEMORY_LIMIT (-1UL)
+/* Maximum address we can reach in physical address mode */
+#define KEXEC_DESTINATION_MEMORY_LIMIT (-1UL)
+
+#define KEXEC_REBOOT_CODE_SIZE 4096
+
+#endif /* _I386_KEXEC_H */
diff -uNr linux-2.5.52/include/asm-i386/unistd.h linux-2.5.52.x86kexec-2/include/asm-i386/unistd.h
--- linux-2.5.52/include/asm-i386/unistd.h Thu Dec 12 07:41:35 2002
+++ linux-2.5.52.x86kexec-2/include/asm-i386/unistd.h Sat Dec 21 23:36:55 2002
@@ -264,6 +264,7 @@
#define __NR_epoll_wait 256
#define __NR_remap_file_pages 257
#define __NR_set_tid_address 258
+#define __NR_sys_kexec_load 259


/* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
diff -uNr linux-2.5.52/include/linux/kexec.h linux-2.5.52.x86kexec-2/include/linux/kexec.h
--- linux-2.5.52/include/linux/kexec.h Wed Dec 31 17:00:00 1969
+++ linux-2.5.52.x86kexec-2/include/linux/kexec.h Sat Dec 21 15:27:17 2002
@@ -0,0 +1,54 @@
+#ifndef LINUX_KEXEC_H
+#define LINUX_KEXEC_H
+
+#if CONFIG_KEXEC
+#include <linux/types.h>
+#include <linux/list.h>
+#include <asm/kexec.h>
+
+/*
+ * This structure is used to hold the arguments that are used when loading
+ * kernel binaries.
+ */
+
+typedef unsigned long kimage_entry_t;
+#define IND_DESTINATION 0x1
+#define IND_INDIRECTION 0x2
+#define IND_DONE 0x4
+#define IND_SOURCE 0x8
+
+#define KEXEC_SEGMENT_MAX 8
+struct kexec_segment {
+ void *buf;
+ size_t bufsz;
+ void *mem;
+ size_t memsz;
+};
+
+struct kimage {
+ kimage_entry_t head;
+ kimage_entry_t *entry;
+ kimage_entry_t *last_entry;
+
+ unsigned long destination;
+ unsigned long offset;
+
+ unsigned long start;
+ struct page *reboot_code_pages;
+
+ unsigned long nr_segments;
+ struct kexec_segment segment[KEXEC_SEGMENT_MAX+1];
+
+ struct list_head dest_pages;
+ struct list_head unuseable_pages;
+};
+
+
+/* kexec interface functions */
+extern void machine_kexec(struct kimage *image);
+extern asmlinkage long sys_kexec(unsigned long entry, long nr_segments,
+ struct kexec_segment *segments);
+extern struct kimage *kexec_image;
+#endif
+#endif /* LINUX_KEXEC_H */
+
diff -uNr linux-2.5.52/include/linux/reboot.h linux-2.5.52.x86kexec-2/include/linux/reboot.h
--- linux-2.5.52/include/linux/reboot.h Thu Dec 12 07:41:37 2002
+++ linux-2.5.52.x86kexec-2/include/linux/reboot.h Mon Dec 16 02:23:00 2002
@@ -21,6 +21,7 @@
* POWER_OFF Stop OS and remove all power from system, if possible.
* RESTART2 Restart system using given command string.
* SW_SUSPEND Suspend system using Software Suspend if compiled in
+ * KEXEC Restart the system using a different kernel.
*/

#define LINUX_REBOOT_CMD_RESTART 0x01234567
@@ -30,6 +31,7 @@
#define LINUX_REBOOT_CMD_POWER_OFF 0x4321FEDC
#define LINUX_REBOOT_CMD_RESTART2 0xA1B2C3D4
#define LINUX_REBOOT_CMD_SW_SUSPEND 0xD000FCE2
+#define LINUX_REBOOT_CMD_KEXEC 0x45584543


#ifdef __KERNEL__
diff -uNr linux-2.5.52/kernel/Makefile linux-2.5.52.x86kexec-2/kernel/Makefile
--- linux-2.5.52/kernel/Makefile Mon Dec 16 02:19:15 2002
+++ linux-2.5.52.x86kexec-2/kernel/Makefile Mon Dec 16 02:23:00 2002
@@ -21,6 +21,7 @@
obj-$(CONFIG_CPU_FREQ) += cpufreq.o
obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend.o
+obj-$(CONFIG_KEXEC) += kexec.o
obj-$(CONFIG_COMPAT) += compat.o

ifneq ($(CONFIG_IA64),y)
diff -uNr linux-2.5.52/kernel/kexec.c linux-2.5.52.x86kexec-2/kernel/kexec.c
--- linux-2.5.52/kernel/kexec.c Wed Dec 31 17:00:00 1969
+++ linux-2.5.52.x86kexec-2/kernel/kexec.c Sun Dec 22 02:58:12 2002
@@ -0,0 +1,547 @@
+#include <linux/mm.h>
+#include <linux/file.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/version.h>
+#include <linux/compile.h>
+#include <linux/kexec.h>
+#include <linux/spinlock.h>
+#include <linux/list.h>
+#include <linux/highmem.h>
+#include <net/checksum.h>
+#include <asm/page.h>
+#include <asm/uaccess.h>
+#include <asm/io.h>
+#include <asm/system.h>
+
+/* When kexec transitions to the new kernel there is a one to one
+ * mapping between physical and virtual addresses. On processors
+ * where you can disable the MMU this is trivial, and easy. For
+ * others it is still a simple predictable page table to setup.
+ *
+ * In that environment kexec copies the new kernel to it's final
+ * resting place. This means I can only support memory whose
+ * physical address can fit in an unsigned long. In particular
+ * addresses where (pfn << PAGE_SHIFT) > ULONG_MAX cannot be handled.
+ * If the assembly stub has more restrictive requirements
+ * KEXEC_SOURCE_MEMORY_LIMIT and KEXEC_DEST_MEMORY_LIMIT can be
+ * defined more restrictively in <asm/kexec.h>.
+ *
+ * The code for the transition from the current kernel to the
+ * the new kernel is placed in the reboot_code_buffer, whose size
+ * is given by KEXEC_REBOOT_CODE_SIZE. In the best case only a single
+ * page of memory is necessary, but some architectures require more.
+ * Because this memory must be identity mapped in the transition from
+ * virtual to physical addresses it must live in the range
+ * 0 - TASK_SIZE, as only the user space mappings are arbitrarily
+ * modifyable.
+ *
+ * The assembly stub in the reboot code buffer is passed a linked list
+ * of descriptor pages detailing the source pages of the new kernel,
+ * and the destination addresses of those source pages. As this data
+ * structure is not used in the context of the current OS, it must
+ * be self contained.
+ *
+ * The code has been made to work with highmem pages and will use a
+ * destination page in it's final resting place (if it happens
+ * to allocate it). The end product of this is that most of the
+ * physical address space, and most of ram can be used.
+ *
+ * Future directions include:
+ * - allocating a page table with the reboot code buffer identity
+ * mapped, to simplify machine_kexec and make kexec_on_panic, more
+ * reliable.
+ * - allocating the pages for a page table for machines that cannot
+ * disable their MMUs. (Hammer, Alpha...)
+ */
+
+/* KIMAGE_NO_DEST is an impossible destination address..., for
+ * allocating pages whose destination address we do not care about.
+ */
+#define KIMAGE_NO_DEST (-1UL)
+
+static int kimage_is_destination_range(
+ struct kimage *image, unsigned long start, unsigned long end);
+static struct page *kimage_alloc_reboot_code_pages(struct kimage *image);
+static struct page *kimage_alloc_page(struct kimage *image, unsigned int gfp_mask, unsigned long dest);
+
+static int kimage_alloc(struct kimage **rimage,
+ unsigned long nr_segments, struct kexec_segment *segments)
+{
+ int result;
+ struct kimage *image;
+ size_t segment_bytes;
+ struct page *reboot_pages;
+ unsigned long i;
+
+ /* Allocate a controlling structure */
+ result = -ENOMEM;
+ image = kmalloc(sizeof(*image), GFP_KERNEL);
+ if (!image) {
+ goto out;
+ }
+ memset(image, 0, sizeof(*image));
+ image->head = 0;
+ image->entry = &image->head;
+ image->last_entry = &image->head;
+
+ /* Initialize the list of destination pages */
+ INIT_LIST_HEAD(&image->dest_pages);
+
+ /* Initialize the list of unuseable pages */
+ INIT_LIST_HEAD(&image->unuseable_pages);
+
+ /* Read in the segments */
+ image->nr_segments = nr_segments;
+ segment_bytes = nr_segments * sizeof*segments;
+ result = copy_from_user(image->segment, segments, segment_bytes);
+ if (result)
+ goto out;
+
+ /* Verify we have good destination addresses. The caller is
+ * responsible for making certain we don't attempt to load
+ * the new image into invalid or reserved areas of RAM. This
+ * just verifies it is an address we can use.
+ */
+ result = -EADDRNOTAVAIL;
+ for(i = 0; i < nr_segments; i++) {
+ unsigned long mend;
+ mend = ((unsigned long)(image->segment[i].mem)) +
+ image->segment[i].memsz;
+ if (mend >= KEXEC_DESTINATION_MEMORY_LIMIT)
+ goto out;
+ }
+
+ /* Find a location for the reboot code buffer, and add it
+ * the vector of segments so that it's pages will also be
+ * counted as destination pages.
+ */
+ result = -ENOMEM;
+ reboot_pages = kimage_alloc_reboot_code_pages(image);
+ if (!reboot_pages) {
+ printk(KERN_ERR "Could not allocate reboot_code_buffer\n");
+ goto out;
+ }
+ image->reboot_code_pages = reboot_pages;
+ image->segment[nr_segments].buf = 0;
+ image->segment[nr_segments].bufsz = 0;
+ image->segment[nr_segments].mem = (void *)(page_to_pfn(reboot_pages) << PAGE_SHIFT);
+ image->segment[nr_segments].memsz = KEXEC_REBOOT_CODE_SIZE;
+ image->nr_segments++;
+
+ result = 0;
+ out:
+ if (result == 0) {
+ *rimage = image;
+ } else {
+ kfree(image);
+ }
+ return result;
+}
+
+static int kimage_is_destination_range(
+ struct kimage *image, unsigned long start, unsigned long end)
+{
+ unsigned long i;
+ for(i = 0; i < image->nr_segments; i++) {
+ unsigned long mstart, mend;
+ mstart = (unsigned long)image->segment[i].mem;
+ mend = mstart + image->segment[i].memsz;
+ if ((end > mstart) && (start < mend)) {
+ return 1;
+ }
+ }
+ return 0;
+}
+
+struct page *kimage_alloc_reboot_code_pages(struct kimage *image)
+{
+ /* The reboot code buffer is special. It is the only set of
+ * pages that must be allocated in their final resting place,
+ * and the only set of pages whose final resting place we can
+ * pick.
+ *
+ * At worst this runs in O(N) of the image size.
+ */
+ struct list_head extra_pages, *pos, *next;
+ struct page *pages;
+ unsigned long addr;
+ int order;
+ order = get_order(KEXEC_REBOOT_CODE_SIZE);
+ INIT_LIST_HEAD(&extra_pages);
+ do {
+ pages = alloc_pages(GFP_HIGHUSER, order);
+ addr = page_to_pfn(pages) << PAGE_SHIFT;
+ if ((page_to_pfn(pages) >= (TASK_SIZE >> PAGE_SHIFT)) ||
+ kimage_is_destination_range(image, addr, addr + KEXEC_REBOOT_CODE_SIZE)) {
+ list_add(&pages->list, &extra_pages);
+ pages = 0;
+ }
+ } while(!pages);
+ /* If I could convert a multi page allocation into a buch of
+ * single page allocations I could add these pages to
+ * image->dest_pages. For now it is simpler to just free the
+ * pages again.
+ */
+ list_for_each_safe(pos, next, &extra_pages) {
+ struct page *page;
+ page = list_entry(pos, struct page, list);
+ list_del(&extra_pages);
+ __free_pages(page, order);
+ }
+ return pages;
+}
+
+static int kimage_add_entry(struct kimage *image, kimage_entry_t entry)
+{
+ if (image->offset != 0) {
+ image->entry++;
+ }
+ if (image->entry == image->last_entry) {
+ kimage_entry_t *ind_page;
+ struct page *page;
+ page = kimage_alloc_page(image, GFP_KERNEL, KIMAGE_NO_DEST);
+ if (!page) {
+ return -ENOMEM;
+ }
+ ind_page = page_address(page);
+ *image->entry = virt_to_phys(ind_page) | IND_INDIRECTION;
+ image->entry = ind_page;
+ image->last_entry =
+ ind_page + ((PAGE_SIZE/sizeof(kimage_entry_t)) - 1);
+ }
+ *image->entry = entry;
+ image->entry++;
+ image->offset = 0;
+ return 0;
+}
+
+static int kimage_set_destination(
+ struct kimage *image, unsigned long destination)
+{
+ int result;
+ destination &= PAGE_MASK;
+ result = kimage_add_entry(image, destination | IND_DESTINATION);
+ if (result == 0) {
+ image->destination = destination;
+ }
+ return result;
+}
+
+
+static int kimage_add_page(struct kimage *image, unsigned long page)
+{
+ int result;
+ page &= PAGE_MASK;
+ result = kimage_add_entry(image, page | IND_SOURCE);
+ if (result == 0) {
+ image->destination += PAGE_SIZE;
+ }
+ return result;
+}
+
+
+static void kimage_free_extra_pages(struct kimage *image)
+{
+ /* Walk through and free any extra destination pages I may have */
+ struct list_head *pos, *next;
+ list_for_each_safe(pos, next, &image->dest_pages) {
+ struct page *page;
+ page = list_entry(pos, struct page, list);
+ list_del(&page->list);
+ __free_page(page);
+ }
+ /* Walk through and free any unuseable pages I have cached */
+ list_for_each_safe(pos, next, &image->unuseable_pages) {
+ struct page *page;
+ page = list_entry(pos, struct page, list);
+ list_del(&page->list);
+ __free_page(page);
+ }
+
+}
+static int kimage_terminate(struct kimage *image)
+{
+ int result;
+ result = kimage_add_entry(image, IND_DONE);
+ if (result == 0) {
+ /* Point at the terminating element */
+ image->entry--;
+ kimage_free_extra_pages(image);
+ }
+ return result;
+}
+
+#define for_each_kimage_entry(image, ptr, entry) \
+ for (ptr = &image->head; (entry = *ptr) && !(entry & IND_DONE); \
+ ptr = (entry & IND_INDIRECTION)? \
+ phys_to_virt((entry & PAGE_MASK)): ptr +1)
+
+static void kimage_free(struct kimage *image)
+{
+ kimage_entry_t *ptr, entry;
+ kimage_entry_t ind = 0;
+ if (!image)
+ return;
+ kimage_free_extra_pages(image);
+ for_each_kimage_entry(image, ptr, entry) {
+ if (entry & IND_INDIRECTION) {
+ /* Free the previous indirection page */
+ if (ind & IND_INDIRECTION) {
+ free_page((unsigned long)phys_to_virt(ind & PAGE_MASK));
+ }
+ /* Save this indirection page until we are
+ * done with it.
+ */
+ ind = entry;
+ }
+ else if (entry & IND_SOURCE) {
+ free_page((unsigned long)phys_to_virt(entry & PAGE_MASK));
+ }
+ }
+ __free_pages(image->reboot_code_pages, get_order(KEXEC_REBOOT_CODE_SIZE));
+ kfree(image);
+}
+
+static kimage_entry_t *kimage_dst_used(struct kimage *image, unsigned long page)
+{
+ kimage_entry_t *ptr, entry;
+ unsigned long destination = 0;
+ for_each_kimage_entry(image, ptr, entry) {
+ if (entry & IND_DESTINATION) {
+ destination = entry & PAGE_MASK;
+ }
+ else if (entry & IND_SOURCE) {
+ if (page == destination) {
+ return ptr;
+ }
+ destination += PAGE_SIZE;
+ }
+ }
+ return 0;
+}
+
+static struct page *kimage_alloc_page(struct kimage *image, unsigned int gfp_mask, unsigned long destination)
+{
+ /* Here we implment safe guards to ensure that a source page
+ * is not copied to it's destination page before the data on
+ * the destination page is no longer useful.
+ *
+ * To do this we maintain the invariant that a source page is
+ * either it's own destination page, or it is not a
+ * destination page at all.
+ *
+ * That is slightly stronger than required, but the proof
+ * that no problems will not occur is trivial, and the
+ * implemenation is simply to verify.
+ *
+ * When allocating all pages normally this algorithm will run
+ * in O(N) time, but in the worst case it will run in O(N^2)
+ * time. If the runtime is a problem the data structures can
+ * be fixed.
+ */
+ struct page *page;
+ unsigned long addr;
+
+ /* Walk through the list of destination pages, and see if I
+ * have a match.
+ */
+ list_for_each_entry(page, &image->dest_pages, list) {
+ addr = page_to_pfn(page) << PAGE_SHIFT;
+ if (addr == destination) {
+ list_del(&page->list);
+ return page;
+ }
+ }
+ page = 0;
+ while(1) {
+ kimage_entry_t *old;
+ /* Allocate a page, if we run out of memory give up */
+ page = alloc_page(gfp_mask);
+ if (!page) {
+ return 0;
+ }
+
+ /* If the page cannot be used file it away */
+ if (page_to_pfn(page) > (KEXEC_SOURCE_MEMORY_LIMIT >> PAGE_SHIFT)) {
+ list_add(&page->list, &image->unuseable_pages);
+ continue;
+ }
+ addr = page_to_pfn(page) << PAGE_SHIFT;
+
+ /* If it is the destination page we want use it */
+ if (addr == destination)
+ break;
+
+ /* If the page is not a destination page use it */
+ if (!kimage_is_destination_range(image, addr, addr + PAGE_SIZE))
+ break;
+
+ /* I know that the page is someones destination page.
+ * See if there is already a source page for this
+ * destination page. And if so swap the source pages.
+ */
+ old = kimage_dst_used(image, addr);
+ if (old) {
+ /* If so move it */
+ unsigned long old_addr;
+ struct page *old_page;
+
+ old_addr = *old & PAGE_MASK;
+ old_page = pfn_to_page(old_addr >> PAGE_SHIFT);
+ copy_highpage(page, old_page);
+ *old = addr | (*old & ~PAGE_MASK);
+
+ /* The old page I have found cannot be a
+ * destination page, so return it.
+ */
+ addr = old_addr;
+ page = old_page;
+ break;
+ }
+ else {
+ /* Place the page on the destination list I
+ * will use it later.
+ */
+ list_add(&page->list, &image->dest_pages);
+ }
+ }
+ return page;
+}
+
+static int kimage_load_segment(struct kimage *image,
+ struct kexec_segment *segment)
+{
+ unsigned long mstart;
+ int result;
+ unsigned long offset;
+ unsigned long offset_end;
+ unsigned char *buf;
+
+ result = 0;
+ buf = segment->buf;
+ mstart = (unsigned long)segment->mem;
+
+ offset_end = segment->memsz;
+
+ result = kimage_set_destination(image, mstart);
+ if (result < 0) {
+ goto out;
+ }
+ for(offset = 0; offset < segment->memsz; offset += PAGE_SIZE) {
+ struct page *page;
+ char *ptr;
+ size_t size, leader;
+ page = kimage_alloc_page(image, GFP_HIGHUSER, mstart + offset);
+ if (page == 0) {
+ result = -ENOMEM;
+ goto out;
+ }
+ result = kimage_add_page(image, page_to_pfn(page) << PAGE_SHIFT);
+ if (result < 0) {
+ goto out;
+ }
+ ptr = kmap(page);
+ if (segment->bufsz < offset) {
+ /* We are past the end zero the whole page */
+ memset(ptr, 0, PAGE_SIZE);
+ kunmap(page);
+ continue;
+ }
+ size = PAGE_SIZE;
+ leader = 0;
+ if ((offset == 0)) {
+ leader = mstart & ~PAGE_MASK;
+ }
+ if (leader) {
+ /* We are on the first page zero the unused portion */
+ memset(ptr, 0, leader);
+ size -= leader;
+ ptr += leader;
+ }
+ if (size > (segment->bufsz - offset)) {
+ size = segment->bufsz - offset;
+ }
+ if (size < (PAGE_SIZE - leader)) {
+ /* zero the trailing part of the page */
+ memset(ptr + size, 0, (PAGE_SIZE - leader) - size);
+ }
+ result = copy_from_user(ptr, buf + offset, size);
+ kunmap(page);
+ if (result) {
+ result = (result < 0)?result : -EIO;
+ goto out;
+ }
+ }
+ out:
+ return result;
+}
+
+/*
+ * Exec Kernel system call: for obvious reasons only root may call it.
+ *
+ * This call breaks up into three pieces.
+ * - A generic part which loads the new kernel from the current
+ * address space, and very carefully places the data in the
+ * allocated pages.
+ *
+ * - A generic part that interacts with the kernel and tells all of
+ * the devices to shut down. Preventing on-going dmas, and placing
+ * the devices in a consistent state so a later kernel can
+ * reinitialize them.
+ *
+ * - A machine specific part that includes the syscall number
+ * and the copies the image to it's final destination. And
+ * jumps into the image at entry.
+ *
+ * kexec does not sync, or unmount filesystems so if you need
+ * that to happen you need to do that yourself.
+ */
+struct kimage *kexec_image = 0;
+
+asmlinkage long sys_kexec_load(unsigned long entry, unsigned long nr_segments,
+ struct kexec_segment *segments, unsigned long flags)
+{
+ struct kimage *image;
+ int result;
+
+ /* We only trust the superuser with rebooting the system. */
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ /* In case we need just a little bit of special behavior for
+ * reboot on panic
+ */
+ if (flags != 0)
+ return -EINVAL;
+
+ if (nr_segments > KEXEC_SEGMENT_MAX)
+ return -EINVAL;
+ image = 0;
+
+ result = 0;
+ if (nr_segments > 0) {
+ unsigned long i;
+ result = kimage_alloc(&image, nr_segments, segments);
+ if (result) {
+ goto out;
+ }
+ image->start = entry;
+ for(i = 0; i < nr_segments; i++) {
+ result = kimage_load_segment(image, &segments[i]);
+ if (result) {
+ goto out;
+ }
+ }
+ result = kimage_terminate(image);
+ if (result) {
+ goto out;
+ }
+ }
+
+ image = xchg(&kexec_image, image);
+
+ out:
+ kimage_free(image);
+ return result;
+}
diff -uNr linux-2.5.52/kernel/sys.c linux-2.5.52.x86kexec-2/kernel/sys.c
--- linux-2.5.52/kernel/sys.c Thu Dec 12 07:41:37 2002
+++ linux-2.5.52.x86kexec-2/kernel/sys.c Mon Dec 16 02:23:00 2002
@@ -16,6 +16,7 @@
#include <linux/init.h>
#include <linux/highuid.h>
#include <linux/fs.h>
+#include <linux/kexec.h>
#include <linux/workqueue.h>
#include <linux/device.h>
#include <linux/times.h>
@@ -207,6 +208,7 @@
cond_syscall(sys_lookup_dcookie)
cond_syscall(sys_swapon)
cond_syscall(sys_swapoff)
+cond_syscall(sys_kexec_load)
cond_syscall(sys_init_module)
cond_syscall(sys_delete_module)

@@ -419,6 +421,27 @@
machine_restart(buffer);
break;

+#ifdef CONFIG_KEXEC
+ case LINUX_REBOOT_CMD_KEXEC:
+ {
+ struct kimage *image;
+ if (arg) {
+ unlock_kernel();
+ return -EINVAL;
+ }
+ image = xchg(&kexec_image, 0);
+ if (!image) {
+ unlock_kernel();
+ return -EINVAL;
+ }
+ notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
+ system_running = 0;
+ device_shutdown();
+ printk(KERN_EMERG "Starting new kernel\n");
+ machine_kexec(image);
+ break;
+ }
+#endif
#ifdef CONFIG_SOFTWARE_SUSPEND
case LINUX_REBOOT_CMD_SW_SUSPEND:
if (!software_suspend_enabled) {


2002-12-31 14:25:48

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [PATCH][CFT] kexec (rewrite) for 2.5.52

On Sun, Dec 22, 2002 at 04:07:52AM -0700, Eric W. Biederman wrote:
>
> I have recently taken the time to dig through the internals of
> kexec to see if I could make my code any simpler and have managed
> to trim off about 100 lines, and have made the code much more
> obviously correct.
>
> Anyway, I would love to know in what entertaining ways this code blows
> up, or if I get lucky and it doesn't. I probably will not reply back
> in a timely manner as I am off to visit my parents, for Christmas and
> New Years.
>

The good news is that it worked for me. Not only that, I have just
managed to get lkcd to save a dump in memory and then write it out
to disk after a kexec soft boot ! I haven't tried real panic cases yet
(which probably won't work rightaway :) ) and have testing and
tuning to do. But kexec seems to be looking good.

Have a wonderful new year.

Regards
Suparna

2003-01-03 10:29:48

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH][CFT] kexec (rewrite) for 2.5.52

Suparna Bhattacharya <[email protected]> writes:

> The good news is that it worked for me. Not only that, I have just
> managed to get lkcd to save a dump in memory and then write it out
> to disk after a kexec soft boot ! I haven't tried real panic cases yet
> (which probably won't work rightaway :) ) and have testing and
> tuning to do. But kexec seems to be looking good.

Nice. Any pointers besides lkcd.sourceforge.net

For the kexec on panic case there is a little code motion yet to be
done so that no memory allocations need to happen. The big one is
setting up a page table with the reboot code buffer identity mapped.

I am tempted to do the identity mapping of the reboot code buffer in
init_mm, but for starters I will look at how complex it will be to
have a spare mm just sitting around for that purpose. When I get
to dealing with the architectures like the hammer, and the alpha where
you always need page tables I will need to develop an architecture
specific hook for building the page tables needed by the
code residing in the reboot code buffer, (because virtual memory
cannot be disabled), but that should be straight forward.

My goal is to have no locks on the kexec part of the panic path. And
the current memory allocations are the only really bad part of that.

A dump question. Why doesn't the lkcd stuff use the normal ELF core
dump format? allowing ``gdb vmlinux core'' to work?

Eric

2003-01-03 12:31:36

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [PATCH][CFT] kexec (rewrite) for 2.5.52

On Fri, Jan 03, 2003 at 03:37:06AM -0700, Eric W. Biederman wrote:
> Suparna Bhattacharya <[email protected]> writes:
>
> > The good news is that it worked for me. Not only that, I have just
> > managed to get lkcd to save a dump in memory and then write it out
> > to disk after a kexec soft boot ! I haven't tried real panic cases yet
> > (which probably won't work rightaway :) ) and have testing and
> > tuning to do. But kexec seems to be looking good.
>
> Nice. Any pointers besides lkcd.sourceforge.net

I haven't posted this code to lkcd as yet - so far I'd only
checked in the preparatory code reshuffle into lkcd cvs. There are
still some things to improve and think about, but am planning
to upgrade to the latest tree early next week and put things
out, and then work on it incrementally.

>
> For the kexec on panic case there is a little code motion yet to be
> done so that no memory allocations need to happen. The big one is
> setting up a page table with the reboot code buffer identity mapped.

I missed noticing that.
Bootimg avoided the allocation at this stage. It did something like
this:

+static unsigned long get_identity_mapped_page(void)
+{
+ set_pgd(pgd_offset(current->active_mm,
+ virt_to_phys(unity_page)), __pgd((_KERNPG_TABLE
+ _PAGE_PSE + (virt_to_phys(unity_page)&PGDIR_MASK))));
+ return (unsigned long)unity_page;
+}

where unity page is within directly mapped memory (not highmem).

>
> I am tempted to do the identity mapping of the reboot code buffer in
> init_mm, but for starters I will look at how complex it will be to
> have a spare mm just sitting around for that purpose. When I get
> to dealing with the architectures like the hammer, and the alpha where
> you always need page tables I will need to develop an architecture
> specific hook for building the page tables needed by the
> code residing in the reboot code buffer, (because virtual memory
> cannot be disabled), but that should be straight forward.

A spare mm may be something which I could use for the crash dump
pages mapping possibly simpler than the way it is maintained
right now ... but haven't given enough thought to it yet.

>
> My goal is to have no locks on the kexec part of the panic path. And
> the current memory allocations are the only really bad part of that.

OK.

>
> A dump question. Why doesn't the lkcd stuff use the normal ELF core
> dump format? allowing ``gdb vmlinux core'' to work?

I guess its ultimately a choice of format, how much processing
to do at dump time, vs afterwards prior to analysis, and whether
it captures all aspects relevant for the kind of analysis
intended. The lkcd dump format appears to designed in a way that
makes it suitable for crash dumping kind of situations. It takes
an approach of simplifying work at dump time (desirable). It
enables pages to be dumped right away in any order with a header
preceding the page dumped, which makes it easier to support extraction
of information from truncated dumps. This also makes it easier to do
selective dumping and placement of more critical data earlier
in the dump.

Secondly, it retains the notion of pages being dumped by physical
address, with interpretation/conversions from virt-to-phys on
analysis being taken care of the analyser or convertor. For example
there has been work on a post processor that generates a core file
from the lkcd dump corresponding to a given task/address space
context for analysis via gdb. Similarly there is a capibility
in lcrash that lets one generate a (smaller) selected subset of
dumped state from an existing dump, which can be mailed out from
a remote site for analysis.

The tradeoff is that there is a bit of pre-processing that happens
prior to analysis for generation of an index, or conversion
depending on what analysis tool gets used. But that time is less
crucial than actual dump time.

Am cc'ing lkcd-devel on this one - there are experts who can
add to this or answer this question better than I can.

Regards
Suparna

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Labs, India

2003-01-04 00:26:02

by Andy Pfiffer

[permalink] [raw]
Subject: 2.5.54: Re: [PATCH][CFT] kexec (rewrite) for 2.5.52

On Tue, 2002-12-31 at 06:35, Suparna Bhattacharya wrote:
> On Sun, Dec 22, 2002 at 04:07:52AM -0700, Eric W. Biederman wrote:
> >
> > I have recently taken the time to dig through the internals of
> > kexec to see if I could make my code any simpler and have managed
> > to trim off about 100 lines, and have made the code much more
> > obviously correct.
> >
> > Anyway, I would love to know in what entertaining ways this code blows
> > up, or if I get lucky and it doesn't. I probably will not reply back
> > in a timely manner as I am off to visit my parents, for Christmas and
> > New Years.
> >

Eric,

The patch applied cleanly to 2.5.54 for me.

The kexec portion works just fine and the reboot discovers all of the
memory on my system using kexec_tools 1.8.

However, something has recently changed in the 2.5.5x series that causes
the reboot to hang while calibrating the delay loop after a kexec
reboot:

setup16_end: 00091b2f
Synchronizing SCSI caches:
Shutting down devices
Starting new kernel
Linux version 2.5.54 (andyp@joe) (gcc version 2.95.3 20010315 (SuSE)) #2
SMP Fri Jan 3 21:36:51 PST 2003
Video mode to be used for restore is ffff
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009dc00 (usable)
BIOS-e820: 000000000009dc00 - 00000000000a0000 (reserved)
BIOS-e820: 0000000000100000 - 0000000027fed140 (usable)
BIOS-e820: 0000000027fed140 - 0000000027ff0000 (ACPI data)
BIOS-e820: 0000000027ff0000 - 0000000028000000 (reserved)
BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
639MB LOWMEM available.
found SMP MP-table at 0009ddd0
hm, page 0009d000 reserved twice.
hm, page 0009e000 reserved twice.
hm, page 0009d000 reserved twice.
hm, page 0009e000 reserved twice.
WARNING: MP table in the EBDA can be UNSAFE, contact
[email protected] if you experience SMP problems!
On node 0 totalpages: 163821
DMA zone: 4096 pages, LIFO batch:1
Normal zone: 159725 pages, LIFO batch:16
HighMem zone: 0 pages, LIFO batch:1
Intel MultiProcessor Specification v1.4
Virtual Wire compatibility mode.
OEM ID: IBM ENSW Product ID: xSeries 220 APIC at: 0xFEE00000
Processor #0 6:8 APIC version 17
I/O APIC #14 Version 17 at 0xFEC00000.
I/O APIC #13 Version 17 at 0xFEC01000.
Enabling APIC mode: Flat. Using 2 I/O APICs
Processors: 1
IBM machine detected. Enabling interrupts during APM calls.
IBM machine detected. Disabling SMBus accesses.
Building zonelist for node : 0
Kernel command line: auto BOOT_IMAGE=linux-2.5 ro root=805
console=ttyS0,9600n8
Initializing
CPU#0
Detected 799.578 MHz processor.
Console: colour VGA+ 80x25
Calibrating delay loop...

<wedged>

This happens with -and- without the separate "hwfixes" chunk of code
(that patch carries forward and continues to apply cleanly).

It would appear that clock interrupts are no long arriving (ticks always
equals jiffies).

You can download the kexec patches for 2.5.54 from OSDL's PLM service:
(apologies in advance for the long URL):
https://www.osdl.org/cgi-bin/plm?module=search&search_patch=kexec-rewrite&search_created=Anytime&search_format=detailed&action=run_patch_search&sort_field=idDESC

If the URL is mangled, go here:
https://www.osdl.org/cgi-bin/plm?module=search
and then put "kexec-rewrite" into the "Patch Name or ID" box,
and then press "Submit Query".

Key:
kexec-rewrite-2.5.54-1-of-3-1 == your rewrite from 2002-12-22
kexec-rewrite-2.5.54-2-of-3-1 == your "hwfixes" from 2.5.48ish
kexec-rewrite-2.5.54-3-of-3-1 == ignore it (changes CONFIG_KEXEC=y for PLM)

Regards,
Andy




2003-01-04 18:49:36

by Eric W. Biederman

[permalink] [raw]
Subject: Re: 2.5.54: Re: [PATCH][CFT] kexec (rewrite) for 2.5.52

Andy Pfiffer <[email protected]> writes:

> Eric,
>
> The patch applied cleanly to 2.5.54 for me.
>
> The kexec portion works just fine and the reboot discovers all of the
> memory on my system using kexec_tools 1.8.
>
> However, something has recently changed in the 2.5.5x series that causes
> the reboot to hang while calibrating the delay loop after a kexec
> reboot:

Thanks I will take a look. It looks like something is definitely having
interrupt problems...

BTW, Have you tried booting an older kernel?
That would help indicate where the problem is. I am pretty certain
it is from somewhere in the kernels initialization path.

Eric

2003-01-04 20:27:00

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH][CFT] kexec (rewrite) for 2.5.52

Suparna Bhattacharya <[email protected]> writes:

> On Fri, Jan 03, 2003 at 03:37:06AM -0700, Eric W. Biederman wrote:
> > Suparna Bhattacharya <[email protected]> writes:
> >
> > > The good news is that it worked for me. Not only that, I have just
> > > managed to get lkcd to save a dump in memory and then write it out
> > > to disk after a kexec soft boot ! I haven't tried real panic cases yet
> > > (which probably won't work rightaway :) ) and have testing and
> > > tuning to do. But kexec seems to be looking good.
> >
> > Nice. Any pointers besides lkcd.sourceforge.net
>
> I haven't posted this code to lkcd as yet - so far I'd only
> checked in the preparatory code reshuffle into lkcd cvs. There are
> still some things to improve and think about, but am planning
> to upgrade to the latest tree early next week and put things
> out, and then work on it incrementally.

O.k.

> > For the kexec on panic case there is a little code motion yet to be
> > done so that no memory allocations need to happen. The big one is
> > setting up a page table with the reboot code buffer identity mapped.
>
> I missed noticing that.
> Bootimg avoided the allocation at this stage. It did something like
> this:
>
> +static unsigned long get_identity_mapped_page(void)
> +{
> + set_pgd(pgd_offset(current->active_mm,
> + virt_to_phys(unity_page)), __pgd((_KERNPG_TABLE
> + _PAGE_PSE + (virt_to_phys(unity_page)&PGDIR_MASK))));
> + return (unsigned long)unity_page;
> +}
>
> where unity page is within directly mapped memory (not highmem).

With unity_page being allocated ahead of time...
But there is some other trick it is pulling to make certain the
intermediate page table entries are present. Spooky and I don't want
to go there.

> > I am tempted to do the identity mapping of the reboot code buffer in
> > init_mm, but for starters I will look at how complex it will be to
> > have a spare mm just sitting around for that purpose. When I get
> > to dealing with the architectures like the hammer, and the alpha where
> > you always need page tables I will need to develop an architecture
> > specific hook for building the page tables needed by the
> > code residing in the reboot code buffer, (because virtual memory
> > cannot be disabled), but that should be straight forward.
>
> A spare mm may be something which I could use for the crash dump
> pages mapping possibly simpler than the way it is maintained
> right now ... but haven't given enough thought to it yet.

Given that it is likely only to be a temporary thing I doubt it will
help. A very interesting question along those lines is how do
you get at all of the memory you are dumping, especially in PAE mode.
I have not seen the code that handles that part at all...

Eric

2003-01-04 22:34:41

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH][CFT] kexec (rewrite) for 2.5.52

Suparna Bhattacharya <[email protected]> writes:

> On Fri, Jan 03, 2003 at 03:37:06AM -0700, Eric W. Biederman wrote:
> > A dump question. Why doesn't the lkcd stuff use the normal ELF core
> > dump format? allowing ``gdb vmlinux core'' to work?

Digesting.... Of the pieces I have no problem if a valid
ELF core dump is written but gdb does not know what to do with it out
of the box. The piece that disturbed me most is that the file format
seemed to be mutating from release to release.

An ELF core dump consists of:
ELF header
ELF Program header
ELF Note segment
Data Segments...

All of the weird processor specific information can be stored as
various ELF note types.

Compression of pages that are zero can be handled by treating
them as BSS pages and not putting them in the image.

I do admit it would likely take an extra pass to generate the ELF
program header if anything non-trivial like zero removal or
compression was going on. But at the same time that should also
quite dramatically reduce the per page overhead. A pure dump of ram
on x86 should take only 2 or 3 segments.

Using physical addresses is no problem in an ELF core dump. The ELF
program header has both physical and virtual addresses, and you just
fill in the physical addresses.

I keep asking and thinking about ELF images, because they are
simple, clean, extensible, and well documented. With an added bonus
that using the allows a large degree of code reuse with existing
tools.


[snip a good description of the usefulness of the existing core dump format]

> Am cc'ing lkcd-devel on this one - there are experts who can
> add to this or answer this question better than I can.

Thanks,

Eric

2003-01-05 01:45:32

by Eric W. Biederman

[permalink] [raw]
Subject: Re: 2.5.54: Re: [PATCH][CFT] kexec (rewrite) for 2.5.52

Ed Tomlinson <[email protected]> writes:

> Just so you do not feel that kexec (.52 rewrite) is always failing in
> 2.5.54 - it not. I am using it for most of my boots here. Aside from
> an intermittant hang starting usb, its worked very well. I have not
> installed any of your other patches. (ie. 2.5.54+kexec+myownpatch)

Thanks, I think.

It good except that it make mean the problem is harder to reproduce,
making debugging harder.

If you are not doing SMP, UP-IOAPIC or > 4GB of ram the other patches
should not be an issue.

Eric

2003-01-06 05:41:11

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH] kexec for 2.5.54


O.k. I have switched to using the init_mm and premapping the reboot
code buffer. I would have play with a fully private mm, but all of
the functions to allocate one are private to kernel/fork.c and so to
much of a pain to mess with, right now.

The code in machine_kexec now takes no locks and is drop dead simple,
so it should be safe to call from a panic handler.

It is funny in making identity_map_page generic so it should work on
all architectures, and using more kernel prebuilt functions the code
actually got a little longer...

Linus if you would like to apply it, be my guest.

Suparna this should be a good base to build the kexec on panic code
upon. Until I see it a little more in action this is as much as I can
do to help.

And if this week goes on schedule I can do an Itanium port...

Eric

MAINTAINERS | 8
arch/i386/Kconfig | 17 +
arch/i386/kernel/Makefile | 1
arch/i386/kernel/entry.S | 1
arch/i386/kernel/machine_kexec.c | 115 ++++++
arch/i386/kernel/relocate_kernel.S | 107 ++++++
include/asm-i386/kexec.h | 23 +
include/asm-i386/unistd.h | 1
include/linux/kexec.h | 54 +++
include/linux/reboot.h | 2
kernel/Makefile | 1
kernel/kexec.c | 629 +++++++++++++++++++++++++++++++++++++
kernel/sys.c | 23 +
13 files changed, 982 insertions

diff -uNr linux-2.5.54/MAINTAINERS linux-2.5.54.x86kexec/MAINTAINERS
--- linux-2.5.54/MAINTAINERS Sat Jan 4 12:00:56 2003
+++ linux-2.5.54.x86kexec/MAINTAINERS Sat Jan 4 12:02:05 2003
@@ -1006,6 +1006,14 @@
W: http://www.cse.unsw.edu.au/~neilb/patches/linux-devel/
S: Maintained

+KEXEC
+P: Eric Biederman
+M: [email protected]
+M: [email protected]
+W: http://www.xmission.com/~ebiederm/files/kexec/
+L: [email protected]
+S: Maintained
+
LANMEDIA WAN CARD DRIVER
P: Andrew Stanley-Jones
M: [email protected]
diff -uNr linux-2.5.54/arch/i386/Kconfig linux-2.5.54.x86kexec/arch/i386/Kconfig
--- linux-2.5.54/arch/i386/Kconfig Sat Jan 4 12:00:56 2003
+++ linux-2.5.54.x86kexec/arch/i386/Kconfig Sat Jan 4 12:02:05 2003
@@ -733,6 +733,23 @@
depends on (SMP || PREEMPT) && X86_CMPXCHG
default y

+config KEXEC
+ bool "kexec system call (EXPERIMENTAL)"
+ depends on EXPERIMENTAL
+ help
+ kexec is a system call that implements the ability to shutdown your
+ current kernel, and to start another kernel. It is like a reboot
+ but it is indepedent of the system firmware. And like a reboot
+ you can start any kernel with it not just Linux.
+
+ The name comes from the similiarity to the exec system call.
+
+ It is on an going process to be certain the hardware in a machine
+ is properly shutdown, so do not be surprised if this code does not
+ initially work for you. It may help to enable device hotplugging
+ support. As of this writing the exact hardware interface is
+ strongly in flux, so no good recommendation can be made.
+
endmenu


diff -uNr linux-2.5.54/arch/i386/kernel/Makefile linux-2.5.54.x86kexec/arch/i386/kernel/Makefile
--- linux-2.5.54/arch/i386/kernel/Makefile Sat Jan 4 12:00:56 2003
+++ linux-2.5.54.x86kexec/arch/i386/kernel/Makefile Sat Jan 4 12:02:05 2003
@@ -25,6 +25,7 @@
obj-$(CONFIG_X86_MPPARSE) += mpparse.o
obj-$(CONFIG_X86_LOCAL_APIC) += apic.o nmi.o
obj-$(CONFIG_X86_IO_APIC) += io_apic.o
+obj-$(CONFIG_KEXEC) += machine_kexec.o relocate_kernel.o
obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend.o suspend_asm.o
obj-$(CONFIG_X86_NUMAQ) += numaq.o
obj-$(CONFIG_PROFILING) += profile.o
diff -uNr linux-2.5.54/arch/i386/kernel/entry.S linux-2.5.54.x86kexec/arch/i386/kernel/entry.S
--- linux-2.5.54/arch/i386/kernel/entry.S Sat Jan 4 12:00:56 2003
+++ linux-2.5.54.x86kexec/arch/i386/kernel/entry.S Sat Jan 4 12:02:05 2003
@@ -804,6 +804,7 @@
.long sys_epoll_wait
.long sys_remap_file_pages
.long sys_set_tid_address
+ .long sys_kexec_load


.rept NR_syscalls-(.-sys_call_table)/4
diff -uNr linux-2.5.54/arch/i386/kernel/machine_kexec.c linux-2.5.54.x86kexec/arch/i386/kernel/machine_kexec.c
--- linux-2.5.54/arch/i386/kernel/machine_kexec.c Wed Dec 31 17:00:00 1969
+++ linux-2.5.54.x86kexec/arch/i386/kernel/machine_kexec.c Sun Jan 5 16:12:28 2003
@@ -0,0 +1,115 @@
+#include <linux/config.h>
+#include <linux/mm.h>
+#include <linux/kexec.h>
+#include <linux/delay.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
+#include <asm/mmu_context.h>
+#include <asm/io.h>
+#include <asm/apic.h>
+
+
+/*
+ * machine_kexec
+ * =======================
+ */
+
+
+static void set_idt(void *newidt, __u16 limit)
+{
+ unsigned char curidt[6];
+
+ /* ia32 supports unaliged loads & stores */
+ (*(__u16 *)(curidt)) = limit;
+ (*(__u32 *)(curidt +2)) = (unsigned long)(newidt);
+
+ __asm__ __volatile__ (
+ "lidt %0\n"
+ : "=m" (curidt)
+ );
+};
+
+
+static void set_gdt(void *newgdt, __u16 limit)
+{
+ unsigned char curgdt[6];
+
+ /* ia32 supports unaliged loads & stores */
+ (*(__u16 *)(curgdt)) = limit;
+ (*(__u32 *)(curgdt +2)) = (unsigned long)(newgdt);
+
+ __asm__ __volatile__ (
+ "lgdt %0\n"
+ : "=m" (curgdt)
+ );
+};
+
+static void load_segments(void)
+{
+#define __STR(X) #X
+#define STR(X) __STR(X)
+
+ __asm__ __volatile__ (
+ "\tljmp $"STR(__KERNEL_CS)",$1f\n"
+ "\t1:\n"
+ "\tmovl $"STR(__KERNEL_DS)",%eax\n"
+ "\tmovl %eax,%ds\n"
+ "\tmovl %eax,%es\n"
+ "\tmovl %eax,%fs\n"
+ "\tmovl %eax,%gs\n"
+ "\tmovl %eax,%ss\n"
+ );
+#undef STR
+#undef __STR
+}
+
+typedef void (*relocate_new_kernel_t)(
+ unsigned long indirection_page, unsigned long reboot_code_buffer,
+ unsigned long start_address);
+
+const extern unsigned char relocate_new_kernel[];
+extern void relocate_new_kernel_end(void);
+const extern unsigned int relocate_new_kernel_size;
+
+void machine_kexec(struct kimage *image)
+{
+ unsigned long indirection_page;
+ unsigned long reboot_code_buffer;
+ void *ptr;
+ relocate_new_kernel_t rnk;
+
+ /* switch to an mm where the reboot_code_buffer is identity mapped */
+ switch_mm(current->active_mm, &init_mm, current, smp_processor_id());
+
+ /* Interrupts aren't acceptable while we reboot */
+ local_irq_disable();
+ reboot_code_buffer = page_to_pfn(image->reboot_code_pages) << PAGE_SHIFT;
+ indirection_page = image->head & PAGE_MASK;
+
+ /* copy it out */
+ memcpy((void *)reboot_code_buffer, relocate_new_kernel, relocate_new_kernel_size);
+
+ /* The segment registers are funny things, they are
+ * automatically loaded from a table, in memory wherever you
+ * set them to a specific selector, but this table is never
+ * accessed again you set the segment to a different selector.
+ *
+ * The more common model is are caches where the behide
+ * the scenes work is done, but is also dropped at arbitrary
+ * times.
+ *
+ * I take advantage of this here by force loading the
+ * segments, before I zap the gdt with an invalid value.
+ */
+ load_segments();
+ /* The gdt & idt are now invalid.
+ * If you want to load them you must set up your own idt & gdt.
+ */
+ set_gdt(phys_to_virt(0),0);
+ set_idt(phys_to_virt(0),0);
+
+ /* now call it */
+ rnk = (relocate_new_kernel_t) reboot_code_buffer;
+ (*rnk)(indirection_page, reboot_code_buffer, image->start);
+}
diff -uNr linux-2.5.54/arch/i386/kernel/relocate_kernel.S linux-2.5.54.x86kexec/arch/i386/kernel/relocate_kernel.S
--- linux-2.5.54/arch/i386/kernel/relocate_kernel.S Wed Dec 31 17:00:00 1969
+++ linux-2.5.54.x86kexec/arch/i386/kernel/relocate_kernel.S Sat Jan 4 12:02:05 2003
@@ -0,0 +1,107 @@
+#include <linux/config.h>
+#include <linux/linkage.h>
+
+ /* Must be relocatable PIC code callable as a C function, that once
+ * it starts can not use the previous processes stack.
+ *
+ */
+ .globl relocate_new_kernel
+relocate_new_kernel:
+ /* read the arguments and say goodbye to the stack */
+ movl 4(%esp), %ebx /* indirection_page */
+ movl 8(%esp), %ebp /* reboot_code_buffer */
+ movl 12(%esp), %edx /* start address */
+
+ /* zero out flags, and disable interrupts */
+ pushl $0
+ popfl
+
+ /* set a new stack at the bottom of our page... */
+ lea 4096(%ebp), %esp
+
+ /* store the parameters back on the stack */
+ pushl %edx /* store the start address */
+
+ /* Set cr0 to a known state:
+ * 31 0 == Paging disabled
+ * 18 0 == Alignment check disabled
+ * 16 0 == Write protect disabled
+ * 3 0 == No task switch
+ * 2 0 == Don't do FP software emulation.
+ * 0 1 == Proctected mode enabled
+ */
+ movl %cr0, %eax
+ andl $~((1<<31)|(1<<18)|(1<<16)|(1<<3)|(1<<2)), %eax
+ orl $(1<<0), %eax
+ movl %eax, %cr0
+
+ /* Set cr4 to a known state:
+ * Setting everything to zero seems safe.
+ */
+ movl %cr4, %eax
+ andl $0, %eax
+ movl %eax, %cr4
+
+ jmp 1f
+1:
+
+ /* Flush the TLB (needed?) */
+ xorl %eax, %eax
+ movl %eax, %cr3
+
+ /* Do the copies */
+ cld
+0: /* top, read another word for the indirection page */
+ movl %ebx, %ecx
+ movl (%ebx), %ecx
+ addl $4, %ebx
+ testl $0x1, %ecx /* is it a destination page */
+ jz 1f
+ movl %ecx, %edi
+ andl $0xfffff000, %edi
+ jmp 0b
+1:
+ testl $0x2, %ecx /* is it an indirection page */
+ jz 1f
+ movl %ecx, %ebx
+ andl $0xfffff000, %ebx
+ jmp 0b
+1:
+ testl $0x4, %ecx /* is it the done indicator */
+ jz 1f
+ jmp 2f
+1:
+ testl $0x8, %ecx /* is it the source indicator */
+ jz 0b /* Ignore it otherwise */
+ movl %ecx, %esi /* For every source page do a copy */
+ andl $0xfffff000, %esi
+
+ movl $1024, %ecx
+ rep ; movsl
+ jmp 0b
+
+2:
+
+ /* To be certain of avoiding problems with self modifying code
+ * I need to execute a serializing instruction here.
+ * So I flush the TLB, it's handy, and not processor dependent.
+ */
+ xorl %eax, %eax
+ movl %eax, %cr3
+
+ /* set all of the registers to known values */
+ /* leave %esp alone */
+
+ xorl %eax, %eax
+ xorl %ebx, %ebx
+ xorl %ecx, %ecx
+ xorl %edx, %edx
+ xorl %esi, %esi
+ xorl %edi, %edi
+ xorl %ebp, %ebp
+ ret
+relocate_new_kernel_end:
+
+ .globl relocate_new_kernel_size
+relocate_new_kernel_size:
+ .long relocate_new_kernel_end - relocate_new_kernel
diff -uNr linux-2.5.54/include/asm-i386/kexec.h linux-2.5.54.x86kexec/include/asm-i386/kexec.h
--- linux-2.5.54/include/asm-i386/kexec.h Wed Dec 31 17:00:00 1969
+++ linux-2.5.54.x86kexec/include/asm-i386/kexec.h Sat Jan 4 12:02:05 2003
@@ -0,0 +1,23 @@
+#ifndef _I386_KEXEC_H
+#define _I386_KEXEC_H
+
+#include <asm/fixmap.h>
+
+/*
+ * KEXEC_SOURCE_MEMORY_LIMIT maximum page get_free_page can return.
+ * I.e. Maximum page that is mapped directly into kernel memory,
+ * and kmap is not required.
+ *
+ * Someone correct me if FIXADDR_START - PAGEOFFSET is not the correct
+ * calculation for the amount of memory directly mappable into the
+ * kernel memory space.
+ */
+
+/* Maximum physical address we can use pages from */
+#define KEXEC_SOURCE_MEMORY_LIMIT (-1UL)
+/* Maximum address we can reach in physical address mode */
+#define KEXEC_DESTINATION_MEMORY_LIMIT (-1UL)
+
+#define KEXEC_REBOOT_CODE_SIZE 4096
+
+#endif /* _I386_KEXEC_H */
diff -uNr linux-2.5.54/include/asm-i386/unistd.h linux-2.5.54.x86kexec/include/asm-i386/unistd.h
--- linux-2.5.54/include/asm-i386/unistd.h Sat Jan 4 12:01:05 2003
+++ linux-2.5.54.x86kexec/include/asm-i386/unistd.h Sat Jan 4 12:02:05 2003
@@ -262,6 +262,7 @@
#define __NR_epoll_wait 256
#define __NR_remap_file_pages 257
#define __NR_set_tid_address 258
+#define __NR_sys_kexec_load 259


/* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
diff -uNr linux-2.5.54/include/linux/kexec.h linux-2.5.54.x86kexec/include/linux/kexec.h
--- linux-2.5.54/include/linux/kexec.h Wed Dec 31 17:00:00 1969
+++ linux-2.5.54.x86kexec/include/linux/kexec.h Sat Jan 4 16:17:20 2003
@@ -0,0 +1,54 @@
+#ifndef LINUX_KEXEC_H
+#define LINUX_KEXEC_H
+
+#if CONFIG_KEXEC
+#include <linux/types.h>
+#include <linux/list.h>
+#include <asm/kexec.h>
+
+/*
+ * This structure is used to hold the arguments that are used when loading
+ * kernel binaries.
+ */
+
+typedef unsigned long kimage_entry_t;
+#define IND_DESTINATION 0x1
+#define IND_INDIRECTION 0x2
+#define IND_DONE 0x4
+#define IND_SOURCE 0x8
+
+#define KEXEC_SEGMENT_MAX 8
+struct kexec_segment {
+ void *buf;
+ size_t bufsz;
+ void *mem;
+ size_t memsz;
+};
+
+struct kimage {
+ kimage_entry_t head;
+ kimage_entry_t *entry;
+ kimage_entry_t *last_entry;
+
+ unsigned long destination;
+ unsigned long offset;
+
+ unsigned long start;
+ struct page *reboot_code_pages;
+
+ unsigned long nr_segments;
+ struct kexec_segment segment[KEXEC_SEGMENT_MAX+1];
+
+ struct list_head dest_pages;
+ struct list_head unuseable_pages;
+};
+
+
+/* kexec interface functions */
+extern void machine_kexec(struct kimage *image);
+extern asmlinkage long sys_kexec(unsigned long entry, long nr_segments,
+ struct kexec_segment *segments);
+extern struct kimage *kexec_image;
+#endif
+#endif /* LINUX_KEXEC_H */
+
diff -uNr linux-2.5.54/include/linux/reboot.h linux-2.5.54.x86kexec/include/linux/reboot.h
--- linux-2.5.54/include/linux/reboot.h Thu Dec 12 07:41:37 2002
+++ linux-2.5.54.x86kexec/include/linux/reboot.h Sat Jan 4 12:02:05 2003
@@ -21,6 +21,7 @@
* POWER_OFF Stop OS and remove all power from system, if possible.
* RESTART2 Restart system using given command string.
* SW_SUSPEND Suspend system using Software Suspend if compiled in
+ * KEXEC Restart the system using a different kernel.
*/

#define LINUX_REBOOT_CMD_RESTART 0x01234567
@@ -30,6 +31,7 @@
#define LINUX_REBOOT_CMD_POWER_OFF 0x4321FEDC
#define LINUX_REBOOT_CMD_RESTART2 0xA1B2C3D4
#define LINUX_REBOOT_CMD_SW_SUSPEND 0xD000FCE2
+#define LINUX_REBOOT_CMD_KEXEC 0x45584543


#ifdef __KERNEL__
diff -uNr linux-2.5.54/kernel/Makefile linux-2.5.54.x86kexec/kernel/Makefile
--- linux-2.5.54/kernel/Makefile Mon Dec 16 02:19:15 2002
+++ linux-2.5.54.x86kexec/kernel/Makefile Sat Jan 4 12:02:05 2003
@@ -21,6 +21,7 @@
obj-$(CONFIG_CPU_FREQ) += cpufreq.o
obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend.o
+obj-$(CONFIG_KEXEC) += kexec.o
obj-$(CONFIG_COMPAT) += compat.o

ifneq ($(CONFIG_IA64),y)
diff -uNr linux-2.5.54/kernel/kexec.c linux-2.5.54.x86kexec/kernel/kexec.c
--- linux-2.5.54/kernel/kexec.c Wed Dec 31 17:00:00 1969
+++ linux-2.5.54.x86kexec/kernel/kexec.c Sun Jan 5 21:54:52 2003
@@ -0,0 +1,629 @@
+#include <linux/mm.h>
+#include <linux/file.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/version.h>
+#include <linux/compile.h>
+#include <linux/kexec.h>
+#include <linux/spinlock.h>
+#include <linux/list.h>
+#include <linux/highmem.h>
+#include <net/checksum.h>
+#include <asm/page.h>
+#include <asm/uaccess.h>
+#include <asm/io.h>
+#include <asm/system.h>
+
+/* When kexec transitions to the new kernel there is a one to one
+ * mapping between physical and virtual addresses. On processors
+ * where you can disable the MMU this is trivial, and easy. For
+ * others it is still a simple predictable page table to setup.
+ *
+ * In that environment kexec copies the new kernel to it's final
+ * resting place. This means I can only support memory whose
+ * physical address can fit in an unsigned long. In particular
+ * addresses where (pfn << PAGE_SHIFT) > ULONG_MAX cannot be handled.
+ * If the assembly stub has more restrictive requirements
+ * KEXEC_SOURCE_MEMORY_LIMIT and KEXEC_DEST_MEMORY_LIMIT can be
+ * defined more restrictively in <asm/kexec.h>.
+ *
+ * The code for the transition from the current kernel to the
+ * the new kernel is placed in the reboot_code_buffer, whose size
+ * is given by KEXEC_REBOOT_CODE_SIZE. In the best case only a single
+ * page of memory is necessary, but some architectures require more.
+ * Because this memory must be identity mapped in the transition from
+ * virtual to physical addresses it must live in the range
+ * 0 - TASK_SIZE, as only the user space mappings are arbitrarily
+ * modifyable.
+ *
+ * The assembly stub in the reboot code buffer is passed a linked list
+ * of descriptor pages detailing the source pages of the new kernel,
+ * and the destination addresses of those source pages. As this data
+ * structure is not used in the context of the current OS, it must
+ * be self contained.
+ *
+ * The code has been made to work with highmem pages and will use a
+ * destination page in it's final resting place (if it happens
+ * to allocate it). The end product of this is that most of the
+ * physical address space, and most of ram can be used.
+ *
+ * Future directions include:
+ * - allocating a page table with the reboot code buffer identity
+ * mapped, to simplify machine_kexec and make kexec_on_panic, more
+ * reliable.
+ * - allocating the pages for a page table for machines that cannot
+ * disable their MMUs. (Hammer, Alpha...)
+ */
+
+/* KIMAGE_NO_DEST is an impossible destination address..., for
+ * allocating pages whose destination address we do not care about.
+ */
+#define KIMAGE_NO_DEST (-1UL)
+
+static int kimage_is_destination_range(
+ struct kimage *image, unsigned long start, unsigned long end);
+static struct page *kimage_alloc_reboot_code_pages(struct kimage *image);
+static struct page *kimage_alloc_page(struct kimage *image, unsigned int gfp_mask, unsigned long dest);
+
+
+static int kimage_alloc(struct kimage **rimage,
+ unsigned long nr_segments, struct kexec_segment *segments)
+{
+ int result;
+ struct kimage *image;
+ size_t segment_bytes;
+ struct page *reboot_pages;
+ unsigned long i;
+
+ /* Allocate a controlling structure */
+ result = -ENOMEM;
+ image = kmalloc(sizeof(*image), GFP_KERNEL);
+ if (!image) {
+ goto out;
+ }
+ memset(image, 0, sizeof(*image));
+ image->head = 0;
+ image->entry = &image->head;
+ image->last_entry = &image->head;
+
+ /* Initialize the list of destination pages */
+ INIT_LIST_HEAD(&image->dest_pages);
+
+ /* Initialize the list of unuseable pages */
+ INIT_LIST_HEAD(&image->unuseable_pages);
+
+ /* Read in the segments */
+ image->nr_segments = nr_segments;
+ segment_bytes = nr_segments * sizeof*segments;
+ result = copy_from_user(image->segment, segments, segment_bytes);
+ if (result)
+ goto out;
+
+ /* Verify we have good destination addresses. The caller is
+ * responsible for making certain we don't attempt to load
+ * the new image into invalid or reserved areas of RAM. This
+ * just verifies it is an address we can use.
+ */
+ result = -EADDRNOTAVAIL;
+ for(i = 0; i < nr_segments; i++) {
+ unsigned long mend;
+ mend = ((unsigned long)(image->segment[i].mem)) +
+ image->segment[i].memsz;
+ if (mend >= KEXEC_DESTINATION_MEMORY_LIMIT)
+ goto out;
+ }
+
+ /* Find a location for the reboot code buffer, and add it
+ * the vector of segments so that it's pages will also be
+ * counted as destination pages.
+ */
+ result = -ENOMEM;
+ reboot_pages = kimage_alloc_reboot_code_pages(image);
+ if (!reboot_pages) {
+ printk(KERN_ERR "Could not allocate reboot_code_buffer\n");
+ goto out;
+ }
+ image->reboot_code_pages = reboot_pages;
+ image->segment[nr_segments].buf = 0;
+ image->segment[nr_segments].bufsz = 0;
+ image->segment[nr_segments].mem = (void *)(page_to_pfn(reboot_pages) << PAGE_SHIFT);
+ image->segment[nr_segments].memsz = KEXEC_REBOOT_CODE_SIZE;
+ image->nr_segments++;
+
+ result = 0;
+ out:
+ if (result == 0) {
+ *rimage = image;
+ } else {
+ kfree(image);
+ }
+ return result;
+}
+
+static int kimage_is_destination_range(
+ struct kimage *image, unsigned long start, unsigned long end)
+{
+ unsigned long i;
+ for(i = 0; i < image->nr_segments; i++) {
+ unsigned long mstart, mend;
+ mstart = (unsigned long)image->segment[i].mem;
+ mend = mstart + image->segment[i].memsz;
+ if ((end > mstart) && (start < mend)) {
+ return 1;
+ }
+ }
+ return 0;
+}
+
+#ifdef CONFIG_MMU
+static int identity_map_pages(struct page *pages, int order)
+{
+ struct mm_struct *mm;
+ struct vm_area_struct *vma;
+ int error;
+ mm = &init_mm;
+ vma = 0;
+
+ down_write(&mm->mmap_sem);
+ error = -ENOMEM;
+ vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
+ if (!vma) {
+ goto out;
+ }
+
+ memset(vma, 0, sizeof(vma));
+ vma->vm_mm = mm;
+ vma->vm_start = page_to_pfn(pages) << PAGE_SHIFT;
+ vma->vm_end = vma->vm_start + (1 << (order + PAGE_SHIFT));
+ vma->vm_ops = 0;
+ vma->vm_flags = VM_SHARED \
+ | VM_READ | VM_WRITE | VM_EXEC \
+ | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC \
+ | VM_DONTCOPY | VM_RESERVED;
+ vma->vm_page_prot = protection_map[vma->vm_flags & 0xf];
+ vma->vm_file = NULL;
+ vma->vm_private_data = NULL;
+ INIT_LIST_HEAD(&vma->shared);
+ insert_vm_struct(mm, vma);
+
+ error = remap_page_range(vma, vma->vm_start, vma->vm_start,
+ vma->vm_end - vma->vm_start, vma->vm_page_prot);
+ if (error) {
+ goto out;
+ }
+
+ error = 0;
+ out:
+ if (error && vma) {
+ kmem_cache_free(vm_area_cachep, vma);
+ vma = 0;
+ }
+ up_write(&mm->mmap_sem);
+
+ return error;
+}
+#else
+#define identity_map_pages(pages, order) 0
+#endif
+
+struct page *kimage_alloc_reboot_code_pages(struct kimage *image)
+{
+ /* The reboot code buffer is special. It is the only set of
+ * pages that must be allocated in their final resting place,
+ * and the only set of pages whose final resting place we can
+ * pick.
+ *
+ * At worst this runs in O(N) of the image size.
+ */
+ struct list_head extra_pages, *pos, *next;
+ struct page *pages;
+ unsigned long addr;
+ int order, count;
+ order = get_order(KEXEC_REBOOT_CODE_SIZE);
+ count = 1 << order;
+ INIT_LIST_HEAD(&extra_pages);
+ do {
+ int i;
+ pages = alloc_pages(GFP_HIGHUSER, order);
+ if (!pages)
+ break;
+ for(i = 0; i < count; i++) {
+ SetPageReserved(pages +i);
+ }
+ addr = page_to_pfn(pages) << PAGE_SHIFT;
+ if ((page_to_pfn(pages) >= (TASK_SIZE >> PAGE_SHIFT)) ||
+ kimage_is_destination_range(image, addr, addr + KEXEC_REBOOT_CODE_SIZE)) {
+ list_add(&pages->list, &extra_pages);
+ pages = 0;
+ }
+ } while(!pages);
+ if (pages) {
+ int i, result;
+ result = identity_map_pages(pages, order);
+ if (result < 0) {
+ list_add(&pages->list, &extra_pages);
+ pages = 0;
+ }
+ }
+ /* If I could convert a multi page allocation into a buch of
+ * single page allocations I could add these pages to
+ * image->dest_pages. For now it is simpler to just free the
+ * pages again.
+ */
+ list_for_each_safe(pos, next, &extra_pages) {
+ struct page *page;
+ int i;
+ page = list_entry(pos, struct page, list);
+ for(i = 0; i < count; i++) {
+ ClearPageReserved(pages +i);
+ }
+ list_del(&extra_pages);
+ __free_pages(page, order);
+ }
+ return pages;
+}
+
+static int kimage_add_entry(struct kimage *image, kimage_entry_t entry)
+{
+ if (image->offset != 0) {
+ image->entry++;
+ }
+ if (image->entry == image->last_entry) {
+ kimage_entry_t *ind_page;
+ struct page *page;
+ page = kimage_alloc_page(image, GFP_KERNEL, KIMAGE_NO_DEST);
+ if (!page) {
+ return -ENOMEM;
+ }
+ ind_page = page_address(page);
+ *image->entry = virt_to_phys(ind_page) | IND_INDIRECTION;
+ image->entry = ind_page;
+ image->last_entry =
+ ind_page + ((PAGE_SIZE/sizeof(kimage_entry_t)) - 1);
+ }
+ *image->entry = entry;
+ image->entry++;
+ image->offset = 0;
+ return 0;
+}
+
+static int kimage_set_destination(
+ struct kimage *image, unsigned long destination)
+{
+ int result;
+ destination &= PAGE_MASK;
+ result = kimage_add_entry(image, destination | IND_DESTINATION);
+ if (result == 0) {
+ image->destination = destination;
+ }
+ return result;
+}
+
+
+static int kimage_add_page(struct kimage *image, unsigned long page)
+{
+ int result;
+ page &= PAGE_MASK;
+ result = kimage_add_entry(image, page | IND_SOURCE);
+ if (result == 0) {
+ image->destination += PAGE_SIZE;
+ }
+ return result;
+}
+
+
+static void kimage_free_extra_pages(struct kimage *image)
+{
+ /* Walk through and free any extra destination pages I may have */
+ struct list_head *pos, *next;
+ list_for_each_safe(pos, next, &image->dest_pages) {
+ struct page *page;
+ page = list_entry(pos, struct page, list);
+ list_del(&page->list);
+ ClearPageReserved(page);
+ __free_page(page);
+ }
+ /* Walk through and free any unuseable pages I have cached */
+ list_for_each_safe(pos, next, &image->unuseable_pages) {
+ struct page *page;
+ page = list_entry(pos, struct page, list);
+ list_del(&page->list);
+ ClearPageReserved(page);
+ __free_page(page);
+ }
+
+}
+static int kimage_terminate(struct kimage *image)
+{
+ int result;
+ result = kimage_add_entry(image, IND_DONE);
+ if (result == 0) {
+ /* Point at the terminating element */
+ image->entry--;
+ kimage_free_extra_pages(image);
+ }
+ return result;
+}
+
+#define for_each_kimage_entry(image, ptr, entry) \
+ for (ptr = &image->head; (entry = *ptr) && !(entry & IND_DONE); \
+ ptr = (entry & IND_INDIRECTION)? \
+ phys_to_virt((entry & PAGE_MASK)): ptr +1)
+
+static void kimage_free(struct kimage *image)
+{
+ kimage_entry_t *ptr, entry;
+ kimage_entry_t ind = 0;
+ int i, count, order;
+ if (!image)
+ return;
+ kimage_free_extra_pages(image);
+ for_each_kimage_entry(image, ptr, entry) {
+ if (entry & IND_INDIRECTION) {
+ /* Free the previous indirection page */
+ if (ind & IND_INDIRECTION) {
+ free_page((unsigned long)phys_to_virt(ind & PAGE_MASK));
+ }
+ /* Save this indirection page until we are
+ * done with it.
+ */
+ ind = entry;
+ }
+ else if (entry & IND_SOURCE) {
+ free_page((unsigned long)phys_to_virt(entry & PAGE_MASK));
+ }
+ }
+ order = get_order(KEXEC_REBOOT_CODE_SIZE);
+ count = 1 << order;
+ do_munmap(&init_mm,
+ page_to_pfn(image->reboot_code_pages) << PAGE_SHIFT,
+ count << PAGE_SHIFT);
+ for(i = 0; i < count; i++) {
+ ClearPageReserved(image->reboot_code_pages + i);
+ }
+ __free_pages(image->reboot_code_pages, order);
+ kfree(image);
+}
+
+static kimage_entry_t *kimage_dst_used(struct kimage *image, unsigned long page)
+{
+ kimage_entry_t *ptr, entry;
+ unsigned long destination = 0;
+ for_each_kimage_entry(image, ptr, entry) {
+ if (entry & IND_DESTINATION) {
+ destination = entry & PAGE_MASK;
+ }
+ else if (entry & IND_SOURCE) {
+ if (page == destination) {
+ return ptr;
+ }
+ destination += PAGE_SIZE;
+ }
+ }
+ return 0;
+}
+
+static struct page *kimage_alloc_page(struct kimage *image, unsigned int gfp_mask, unsigned long destination)
+{
+ /* Here we implment safe guards to ensure that a source page
+ * is not copied to it's destination page before the data on
+ * the destination page is no longer useful.
+ *
+ * To do this we maintain the invariant that a source page is
+ * either it's own destination page, or it is not a
+ * destination page at all.
+ *
+ * That is slightly stronger than required, but the proof
+ * that no problems will not occur is trivial, and the
+ * implemenation is simply to verify.
+ *
+ * When allocating all pages normally this algorithm will run
+ * in O(N) time, but in the worst case it will run in O(N^2)
+ * time. If the runtime is a problem the data structures can
+ * be fixed.
+ */
+ struct page *page;
+ unsigned long addr;
+
+ /* Walk through the list of destination pages, and see if I
+ * have a match.
+ */
+ list_for_each_entry(page, &image->dest_pages, list) {
+ addr = page_to_pfn(page) << PAGE_SHIFT;
+ if (addr == destination) {
+ list_del(&page->list);
+ return page;
+ }
+ }
+ page = 0;
+ while(1) {
+ kimage_entry_t *old;
+ /* Allocate a page, if we run out of memory give up */
+ page = alloc_page(gfp_mask);
+ if (!page) {
+ return 0;
+ }
+ SetPageReserved(page);
+ /* If the page cannot be used file it away */
+ if (page_to_pfn(page) > (KEXEC_SOURCE_MEMORY_LIMIT >> PAGE_SHIFT)) {
+ list_add(&page->list, &image->unuseable_pages);
+ continue;
+ }
+ addr = page_to_pfn(page) << PAGE_SHIFT;
+
+ /* If it is the destination page we want use it */
+ if (addr == destination)
+ break;
+
+ /* If the page is not a destination page use it */
+ if (!kimage_is_destination_range(image, addr, addr + PAGE_SIZE))
+ break;
+
+ /* I know that the page is someones destination page.
+ * See if there is already a source page for this
+ * destination page. And if so swap the source pages.
+ */
+ old = kimage_dst_used(image, addr);
+ if (old) {
+ /* If so move it */
+ unsigned long old_addr;
+ struct page *old_page;
+
+ old_addr = *old & PAGE_MASK;
+ old_page = pfn_to_page(old_addr >> PAGE_SHIFT);
+ copy_highpage(page, old_page);
+ *old = addr | (*old & ~PAGE_MASK);
+
+ /* The old page I have found cannot be a
+ * destination page, so return it.
+ */
+ addr = old_addr;
+ page = old_page;
+ break;
+ }
+ else {
+ /* Place the page on the destination list I
+ * will use it later.
+ */
+ list_add(&page->list, &image->dest_pages);
+ }
+ }
+ return page;
+}
+
+static int kimage_load_segment(struct kimage *image,
+ struct kexec_segment *segment)
+{
+ unsigned long mstart;
+ int result;
+ unsigned long offset;
+ unsigned long offset_end;
+ unsigned char *buf;
+
+ result = 0;
+ buf = segment->buf;
+ mstart = (unsigned long)segment->mem;
+
+ offset_end = segment->memsz;
+
+ result = kimage_set_destination(image, mstart);
+ if (result < 0) {
+ goto out;
+ }
+ for(offset = 0; offset < segment->memsz; offset += PAGE_SIZE) {
+ struct page *page;
+ char *ptr;
+ size_t size, leader;
+ page = kimage_alloc_page(image, GFP_HIGHUSER, mstart + offset);
+ if (page == 0) {
+ result = -ENOMEM;
+ goto out;
+ }
+ result = kimage_add_page(image, page_to_pfn(page) << PAGE_SHIFT);
+ if (result < 0) {
+ goto out;
+ }
+ ptr = kmap(page);
+ if (segment->bufsz < offset) {
+ /* We are past the end zero the whole page */
+ memset(ptr, 0, PAGE_SIZE);
+ kunmap(page);
+ continue;
+ }
+ size = PAGE_SIZE;
+ leader = 0;
+ if ((offset == 0)) {
+ leader = mstart & ~PAGE_MASK;
+ }
+ if (leader) {
+ /* We are on the first page zero the unused portion */
+ memset(ptr, 0, leader);
+ size -= leader;
+ ptr += leader;
+ }
+ if (size > (segment->bufsz - offset)) {
+ size = segment->bufsz - offset;
+ }
+ if (size < (PAGE_SIZE - leader)) {
+ /* zero the trailing part of the page */
+ memset(ptr + size, 0, (PAGE_SIZE - leader) - size);
+ }
+ result = copy_from_user(ptr, buf + offset, size);
+ kunmap(page);
+ if (result) {
+ result = (result < 0)?result : -EIO;
+ goto out;
+ }
+ }
+ out:
+ return result;
+}
+
+/*
+ * Exec Kernel system call: for obvious reasons only root may call it.
+ *
+ * This call breaks up into three pieces.
+ * - A generic part which loads the new kernel from the current
+ * address space, and very carefully places the data in the
+ * allocated pages.
+ *
+ * - A generic part that interacts with the kernel and tells all of
+ * the devices to shut down. Preventing on-going dmas, and placing
+ * the devices in a consistent state so a later kernel can
+ * reinitialize them.
+ *
+ * - A machine specific part that includes the syscall number
+ * and the copies the image to it's final destination. And
+ * jumps into the image at entry.
+ *
+ * kexec does not sync, or unmount filesystems so if you need
+ * that to happen you need to do that yourself.
+ */
+struct kimage *kexec_image = 0;
+
+asmlinkage long sys_kexec_load(unsigned long entry, unsigned long nr_segments,
+ struct kexec_segment *segments, unsigned long flags)
+{
+ struct kimage *image;
+ int result;
+
+ /* We only trust the superuser with rebooting the system. */
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ /* In case we need just a little bit of special behavior for
+ * reboot on panic
+ */
+ if (flags != 0)
+ return -EINVAL;
+
+ if (nr_segments > KEXEC_SEGMENT_MAX)
+ return -EINVAL;
+ image = 0;
+
+ result = 0;
+ if (nr_segments > 0) {
+ unsigned long i;
+ result = kimage_alloc(&image, nr_segments, segments);
+ if (result) {
+ goto out;
+ }
+ image->start = entry;
+ for(i = 0; i < nr_segments; i++) {
+ result = kimage_load_segment(image, &segments[i]);
+ if (result) {
+ goto out;
+ }
+ }
+ result = kimage_terminate(image);
+ if (result) {
+ goto out;
+ }
+ }
+
+ image = xchg(&kexec_image, image);
+
+ out:
+ kimage_free(image);
+ return result;
+}
diff -uNr linux-2.5.54/kernel/sys.c linux-2.5.54.x86kexec/kernel/sys.c
--- linux-2.5.54/kernel/sys.c Thu Dec 12 07:41:37 2002
+++ linux-2.5.54.x86kexec/kernel/sys.c Sat Jan 4 12:02:05 2003
@@ -16,6 +16,7 @@
#include <linux/init.h>
#include <linux/highuid.h>
#include <linux/fs.h>
+#include <linux/kexec.h>
#include <linux/workqueue.h>
#include <linux/device.h>
#include <linux/times.h>
@@ -207,6 +208,7 @@
cond_syscall(sys_lookup_dcookie)
cond_syscall(sys_swapon)
cond_syscall(sys_swapoff)
+cond_syscall(sys_kexec_load)
cond_syscall(sys_init_module)
cond_syscall(sys_delete_module)

@@ -419,6 +421,27 @@
machine_restart(buffer);
break;

+#ifdef CONFIG_KEXEC
+ case LINUX_REBOOT_CMD_KEXEC:
+ {
+ struct kimage *image;
+ if (arg) {
+ unlock_kernel();
+ return -EINVAL;
+ }
+ image = xchg(&kexec_image, 0);
+ if (!image) {
+ unlock_kernel();
+ return -EINVAL;
+ }
+ notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
+ system_running = 0;
+ device_shutdown();
+ printk(KERN_EMERG "Starting new kernel\n");
+ machine_kexec(image);
+ break;
+ }
+#endif
#ifdef CONFIG_SOFTWARE_SUSPEND
case LINUX_REBOOT_CMD_SW_SUSPEND:
if (!software_suspend_enabled) {






2003-01-07 22:37:18

by Andy Pfiffer

[permalink] [raw]
Subject: Re: [PATCH] kexec for 2.5.54

On Sun, 2003-01-05 at 21:48, Eric W. Biederman wrote:
>
> O.k. I have switched to using the init_mm and premapping the reboot
> code buffer.
<snip>
> The code in machine_kexec now takes no locks and is drop dead simple,
> so it should be safe to call from a panic handler.

Eric,

The patch applies cleanly to 2.5.54 for me. Current behavior matches
the version of kexec for 2.5.48 that I carried forward into 2.5.52 and
2.5.54 (and kexec_tools 1.8):

- the kexec-ed kernel starts rebooting and finds all of my system's
memory, so the generic kexec machinery is working as expected.

- the kexec-ed kernel hangs while calibrating the delay loop. The list
of kernels I attempted to reboot includes permutations of 2.5.48 +/-
kexec, 2.5.52 +/- kexec(from 2.5.48), 2.5.54, 2.5.54 + kexec(from
2.5.48), and 2.5.54 + kexec (recent patch from you).

- Whatever it is that SuSE supplies in 8.0 (2.4.x +) panics near/during
frame buffer initialization when rebooted via kexec for 2.5.54:
.
.
.
Initializing CPU#0
Detected 799.665 MHz Processor
Console: colour VGA+ 80x25
invalid operand: 0000
CPU: 0
EIP: 0010[<00000007>] Not tainted
EFLAGS 00010002
.
.
.

Something has definitely changed in the 2.5.5x series, and the symptoms
indicate that at least the clock interrupt is not being received.

kexec for 2.5.48 worked for me (with some limits), so I should be able
to walk the tree forwards and poke at it some more.

For those that have had success w/ recent vintage kernels and kexec (>
2.5.48), could I get a roll-call of your machine's hardware? Uniproc,
SMP, AGP, chipset, BIOS version, that kind of thing. lspci -v,
/cat/proc/cpuinfo, and maybe the boot-up messages would all be
appreciated.

Regards,
Andy


2003-01-07 22:53:37

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] kexec for 2.5.54

... taking poor Linus off the cc list
Andy Pfiffer wrote:
> For those that have had success w/ recent vintage kernels and kexec (>
> 2.5.48), could I get a roll-call of your machine's hardware? Uniproc,
> SMP, AGP, chipset, BIOS version, that kind of thing. lspci -v,
> /cat/proc/cpuinfo, and maybe the boot-up messages would all be
> appreciated.

I've had it work on 2 IBM x86 boxes.
4/8-way SMP
1/4/16 GB RAM
no AGP
Intel Profusion Chipset and some funky IBM one

It failed on the NUMA-Q's I tried it on. I haven't investigated any
more thoroughly.

If you want more details, let me know. But, I've never seen your
"Calibrating delay loop..." problem. The last time I saw problems
there was when I broke the interrupt stack patches. But, since those
aren't in mainline, you shouldn't be seeing it.
--
Dave Hansen
[email protected]

2003-01-07 23:03:23

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH] kexec for 2.5.54

> ... taking poor Linus off the cc list
> Andy Pfiffer wrote:
>> For those that have had success w/ recent vintage kernels and kexec (>
>> 2.5.48), could I get a roll-call of your machine's hardware? Uniproc,
>> SMP, AGP, chipset, BIOS version, that kind of thing. lspci -v,
>> /cat/proc/cpuinfo, and maybe the boot-up messages would all be
>> appreciated.
>
> I've had it work on 2 IBM x86 boxes.
> 4/8-way SMP
> 1/4/16 GB RAM
> no AGP
> Intel Profusion Chipset and some funky IBM one
>
> It failed on the NUMA-Q's I tried it on. I haven't investigated any more thoroughly.
>
> If you want more details, let me know. But, I've never seen your "Calibrating delay loop..." problem. The last time I saw problems there was when I broke the interrupt stack patches. But, since those aren't in mainline, you shouldn't be seeing it.


Last time I saw calibrating delay loop problems, it just mean the other CPUs
weren't getting / acting upon IPIs. I might expect that on NUMA-Q, but the
INIT, INIT, STARTUP sequence on normal machines should kick the remote proc
pretty damned hard and reset it. You might want to add more APIC resetting
things (I think there are some in there that only NUMA-Q does right now ..
try turning those on).

M.

2003-01-15 19:34:29

by Andy Pfiffer

[permalink] [raw]
Subject: [2.5.58][KEXEC] Success! (using 2.5.54 version + kexec tools 1.8)

Eric,

Success!

I've been carrying your kexec for 2.5.54 patch (and the hwfixes patch)
forward through subsequent kernels, and have had good luck (kexec works
fine for me) with them in 2.5.58 on my troublesome system (UP, P3-800,
640MB, Adaptec AIC7XXX SCSI).

I haven't had to change a thing with kexec. For reference, the code I'm
currently using is downloadable from OSDL's patch-manager:

The "kexec-hwfixes" for to 2.5.58:
http://www.osdl.org/cgi-bin/plm?module=patch_info&patch_id=1432

kexec patch for 2.5.58 (from the 2.5.54 version):
http://www.osdl.org/cgi-bin/plm?module=patch_info&patch_id=1424

Regards,
Andy