kexec is a system call that allows you to load another kernel from the
currently executing linux kernel. The current implementation has only
been tested, and had the kinks worked out on x86, but the generic
code should work on any architecture.
Some machines have BIOSes that are either extremely slow to reboot,
or that cannot reliably perform a reboot. In which case kexec
may be the only alternative to reboot in a reliable timely manner.
The patch is archived at:
http://www.xmission.com/~ebiederm/files/kexec/kexec-2.5.43.bk2.x86kexec.diff
A compatible user space is at:
http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.2.tar.gz
This code boots either a static ELF executable or a bzImage.
A kernel reformater is makes images that seem to boot more reliably is at:
ftp://ftp.lnxi.com/pub/mkelfImage/mkelfImage-1.17.tar.gz
In bug reports please include the serial console output of
kexec kexec_test. kexec_test exercises most of the interesting code
paths that are needed to load a kernel with lots of debugging print
statements, so hangs can easily be detected.
I have been using this technique for the last several years and the
worst of the kinks have been worked out. But it is still easy
to get on the wrong side of a BIOS. For stability the remaining work
should be to just ensure all of the kernel drivers properly shutdown
their hardware. And digging through weird BIOS incompatibilities.
The system call signature should not to change in the future.
Please test. Unless something in the interface is spotted I start
sending this to Linus for kernel inclusion.
Eric
MAINTAINERS | 7
arch/i386/Config.help | 9
arch/i386/config.in | 3
arch/i386/kernel/Makefile | 1
arch/i386/kernel/apic.c | 51 +++
arch/i386/kernel/dmi_scan.c | 27 -
arch/i386/kernel/entry.S | 1
arch/i386/kernel/i8259.c | 24 +
arch/i386/kernel/io_apic.c | 2
arch/i386/kernel/machine_kexec.c | 143 ++++++++
arch/i386/kernel/reboot.c | 43 --
arch/i386/kernel/relocate_kernel.S | 99 +++++
arch/i386/kernel/smp.c | 24 +
include/asm-i386/apic.h | 3
include/asm-i386/apicdef.h | 1
include/asm-i386/kexec.h | 25 +
include/asm-i386/unistd.h | 2
include/linux/kexec.h | 49 ++
kernel/Makefile | 3
kernel/kexec.c | 624 +++++++++++++++++++++++++++++++++++++
kernel/sys.c | 61 +++
21 files changed, 1134 insertions, 68 deletions
diff -uNr linux-2.5.43/MAINTAINERS linux-2.5.43.x86kexec/MAINTAINERS
--- linux-2.5.43/MAINTAINERS Fri Oct 18 11:59:13 2002
+++ linux-2.5.43.x86kexec/MAINTAINERS Fri Oct 18 12:08:38 2002
@@ -934,6 +934,13 @@
W: http://www.cse.unsw.edu.au/~neilb/patches/linux-devel/
S: Maintained
+KEXEC
+P: Eric Biederman
+M: [email protected]
+M: [email protected]
+L: [email protected]
+S: Maintained
+
LANMEDIA WAN CARD DRIVER
P: Andrew Stanley-Jones
M: [email protected]
diff -uNr linux-2.5.43/arch/i386/Config.help linux-2.5.43.x86kexec/arch/i386/Config.help
--- linux-2.5.43/arch/i386/Config.help Fri Oct 18 11:59:14 2002
+++ linux-2.5.43.x86kexec/arch/i386/Config.help Fri Oct 18 12:08:38 2002
@@ -417,6 +417,20 @@
you have use for it; the module is called binfmt_misc.o. If you
don't know what to answer at this point, say Y.
+CONFIG_KEXEC
+ kexec is a system call that implements kernel level exec support.
+ Or the ability to boot linux from linux. The kexec system call
+ allows you to replace your current kernel with another kernel,
+ not necessarily linux.
+
+ A known caveat is that for this to be fully useful all of the
+ devices must be shutdown. One way to impelment this is to build
+ all devices as modules and remove them before rebooting.
+
+ You will probably want to dig up the elfboottools package as this
+ has the first implementation of a user client that uses this kernel
+ interface.
+
CONFIG_M386
This is the processor type of your CPU. This information is used for
optimizing purposes. In order to compile a kernel that can run on
diff -uNr linux-2.5.43/arch/i386/config.in linux-2.5.43.x86kexec/arch/i386/config.in
--- linux-2.5.43/arch/i386/config.in Fri Oct 18 11:59:14 2002
+++ linux-2.5.43.x86kexec/arch/i386/config.in Fri Oct 18 12:08:38 2002
@@ -243,6 +243,9 @@
fi
fi
+if [ "$CONFIG_EXPERIMENTAL" = "y" ]; then
+ bool 'Kernel execing kernel support' CONFIG_KEXEC
+fi
endmenu
mainmenu_option next_comment
diff -uNr linux-2.5.43/arch/i386/kernel/Makefile linux-2.5.43.x86kexec/arch/i386/kernel/Makefile
--- linux-2.5.43/arch/i386/kernel/Makefile Fri Oct 18 11:59:14 2002
+++ linux-2.5.43.x86kexec/arch/i386/kernel/Makefile Fri Oct 18 12:08:38 2002
@@ -25,6 +25,7 @@
obj-$(CONFIG_X86_MPPARSE) += mpparse.o
obj-$(CONFIG_X86_LOCAL_APIC) += apic.o nmi.o
obj-$(CONFIG_X86_IO_APIC) += io_apic.o
+obj-$(CONFIG_KEXEC) += machine_kexec.o relocate_kernel.o
obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend.o
obj-$(CONFIG_X86_NUMAQ) += numaq.o
obj-$(CONFIG_PROFILING) += profile.o
diff -uNr linux-2.5.43/arch/i386/kernel/apic.c linux-2.5.43.x86kexec/arch/i386/kernel/apic.c
--- linux-2.5.43/arch/i386/kernel/apic.c Fri Oct 18 11:59:14 2002
+++ linux-2.5.43.x86kexec/arch/i386/kernel/apic.c Fri Oct 18 12:08:38 2002
@@ -23,6 +23,7 @@
#include <linux/interrupt.h>
#include <linux/mc146818rtc.h>
#include <linux/kernel_stat.h>
+#include <linux/reboot.h>
#include <asm/atomic.h>
#include <asm/smp.h>
@@ -154,6 +155,36 @@
outb(0x70, 0x22);
outb(0x00, 0x23);
}
+ else {
+ /* Go back to Virtual Wire compatibility mode */
+ unsigned long value;
+
+ /* For the spurious interrupt use vector F, and enable it */
+ value = apic_read(APIC_SPIV);
+ value &= ~APIC_VECTOR_MASK;
+ value |= APIC_SPIV_APIC_ENABLED;
+ value |= 0xf;
+ apic_write_around(APIC_SPIV, value);
+
+ /* For LVT0 make it edge triggered, active high, external and enabled */
+ value = apic_read(APIC_LVT0);
+ value &= ~(APIC_MODE_MASK | APIC_SEND_PENDING |
+ APIC_INPUT_POLARITY | APIC_LVT_REMOTE_IRR |
+ APIC_LVT_LEVEL_TRIGGER | APIC_LVT_MASKED );
+ value |= APIC_LVT_REMOTE_IRR | APIC_SEND_PENDING;
+ value = SET_APIC_DELIVERY_MODE(value, APIC_MODE_EXINT);
+ apic_write_around(APIC_LVT0, value);
+
+ /* For LVT1 make it edge triggered, active high, nmi and enabled */
+ value = apic_read(APIC_LVT1);
+ value &= ~(
+ APIC_MODE_MASK | APIC_SEND_PENDING |
+ APIC_INPUT_POLARITY | APIC_LVT_REMOTE_IRR |
+ APIC_LVT_LEVEL_TRIGGER | APIC_LVT_MASKED);
+ value |= APIC_LVT_REMOTE_IRR | APIC_SEND_PENDING;
+ value = SET_APIC_DELIVERY_MODE(value, APIC_MODE_NMI);
+ apic_write_around(APIC_LVT1, value);
+ }
}
void disable_local_APIC(void)
@@ -1128,6 +1159,26 @@
printk (KERN_INFO "APIC error on CPU%d: %02lx(%02lx)\n",
smp_processor_id(), v , v1);
irq_exit();
+}
+
+void stop_apics(void)
+{
+ /* By resetting the APIC's we disable the nmi watchdog */
+#if CONFIG_SMP
+ /*
+ * Stop all CPUs and turn off local APICs and the IO-APIC, so
+ * other OSs see a clean IRQ state.
+ */
+ smp_send_stop();
+#else
+ disable_local_APIC();
+#endif
+#if defined(CONFIG_X86_IO_APIC)
+ if (smp_found_config) {
+ disable_IO_APIC();
+ }
+#endif
+ disconnect_bsp_APIC();
}
/*
diff -uNr linux-2.5.43/arch/i386/kernel/dmi_scan.c linux-2.5.43.x86kexec/arch/i386/kernel/dmi_scan.c
--- linux-2.5.43/arch/i386/kernel/dmi_scan.c Fri Oct 18 11:59:14 2002
+++ linux-2.5.43.x86kexec/arch/i386/kernel/dmi_scan.c Fri Oct 18 12:08:38 2002
@@ -214,31 +214,6 @@
return 0;
}
-/*
- * Some machines require the "reboot=s" commandline option, this quirk makes that automatic.
- */
-static __init int set_smp_reboot(struct dmi_blacklist *d)
-{
-#ifdef CONFIG_SMP
- extern int reboot_smp;
- if (reboot_smp == 0)
- {
- reboot_smp = 1;
- printk(KERN_INFO "%s series board detected. Selecting SMP-method for reboots.\n", d->ident);
- }
-#endif
- return 0;
-}
-
-/*
- * Some machines require the "reboot=b,s" commandline option, this quirk makes that automatic.
- */
-static __init int set_smp_bios_reboot(struct dmi_blacklist *d)
-{
- set_smp_reboot(d);
- set_bios_reboot(d);
- return 0;
-}
/*
* Some bioses have a broken protected mode poweroff and need to use realmode
@@ -529,7 +504,7 @@
MATCH(DMI_BIOS_VERSION, "4.60 PGMA"),
MATCH(DMI_BIOS_DATE, "134526184"), NO_MATCH
} },
- { set_smp_bios_reboot, "Dell PowerEdge 1300", { /* Handle problems with rebooting on Dell 1300's */
+ { set_bios_reboot, "Dell PowerEdge 1300", { /* Handle problems with rebooting on Dell 1300's */
MATCH(DMI_SYS_VENDOR, "Dell Computer Corporation"),
MATCH(DMI_PRODUCT_NAME, "PowerEdge 1300/"),
NO_MATCH, NO_MATCH
diff -uNr linux-2.5.43/arch/i386/kernel/entry.S linux-2.5.43.x86kexec/arch/i386/kernel/entry.S
--- linux-2.5.43/arch/i386/kernel/entry.S Fri Oct 18 11:59:14 2002
+++ linux-2.5.43.x86kexec/arch/i386/kernel/entry.S Fri Oct 18 12:11:10 2002
@@ -737,6 +737,7 @@
.long sys_free_hugepages
.long sys_exit_group
.long sys_lookup_dcookie
+ .long sys_kexec
.rept NR_syscalls-(.-sys_call_table)/4
.long sys_ni_syscall
diff -uNr linux-2.5.43/arch/i386/kernel/i8259.c linux-2.5.43.x86kexec/arch/i386/kernel/i8259.c
--- linux-2.5.43/arch/i386/kernel/i8259.c Fri Oct 11 22:22:19 2002
+++ linux-2.5.43.x86kexec/arch/i386/kernel/i8259.c Fri Oct 18 12:08:38 2002
@@ -246,10 +246,34 @@
return 0;
}
+static void i8259A_remove(struct device *dev)
+{
+ /* Restore the i8259A to it's legacy dos setup.
+ * The kernel won't be using it any more, and it
+ * just might make reboots, and kexec type applications
+ * more stable.
+ */
+ outb(0xff, 0x21); /* mask all of 8259A-1 */
+ outb(0xff, 0xA1); /* mask all of 8259A-1 */
+
+ outb_p(0x11, 0x20); /* ICW1: select 8259A-1 init */
+ outb_p(0x08, 0x21); /* ICW2: 8259A-1 IR0-7 mappend to 0x8-0xf */
+ outb_p(0x01, 0x21); /* Normal 8086 auto EOI mode */
+
+ outb_p(0x11, 0xA0); /* ICW1: select 8259A-2 init */
+ outb_p(0x08, 0xA1); /* ICW2: 8259A-2 IR0-7 mappend to 0x70-0x77 */
+ outb_p(0x01, 0xA1); /* Normal 8086 auto EOI mode */
+
+ udelay(100); /* wait for 8259A to initialize */
+
+ /* Should I unmask interrupts here? */
+}
+
static struct device_driver i8259A_driver = {
.name = "pic",
.bus = &system_bus_type,
.resume = i8259A_resume,
+ .remove = i8259A_remove,
};
static struct sys_device device_i8259A = {
diff -uNr linux-2.5.43/arch/i386/kernel/io_apic.c linux-2.5.43.x86kexec/arch/i386/kernel/io_apic.c
--- linux-2.5.43/arch/i386/kernel/io_apic.c Fri Oct 18 11:59:14 2002
+++ linux-2.5.43.x86kexec/arch/i386/kernel/io_apic.c Fri Oct 18 12:08:38 2002
@@ -1113,8 +1113,6 @@
* Clear the IO-APIC before rebooting:
*/
clear_IO_APIC();
-
- disconnect_bsp_APIC();
}
/*
diff -uNr linux-2.5.43/arch/i386/kernel/machine_kexec.c linux-2.5.43.x86kexec/arch/i386/kernel/machine_kexec.c
--- linux-2.5.43/arch/i386/kernel/machine_kexec.c Wed Dec 31 17:00:00 1969
+++ linux-2.5.43.x86kexec/arch/i386/kernel/machine_kexec.c Fri Oct 18 12:08:38 2002
@@ -0,0 +1,143 @@
+#include <linux/config.h>
+#include <linux/mm.h>
+#include <linux/kexec.h>
+#include <linux/delay.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
+#include <asm/io.h>
+
+
+/*
+ * machine_kexec
+ * =======================
+ */
+
+
+static void set_idt(void *newidt, __u16 limit)
+{
+ unsigned char curidt[6];
+
+ /* ia32 supports unaliged loads & stores */
+ (*(__u16 *)(curidt)) = limit;
+ (*(__u32 *)(curidt +2)) = (unsigned long)(newidt);
+
+ __asm__ __volatile__ (
+ "lidt %0\n"
+ : "=m" (curidt)
+ );
+};
+
+
+static void set_gdt(void *newgdt, __u16 limit)
+{
+ unsigned char curgdt[6];
+
+ /* ia32 supports unaliged loads & stores */
+ (*(__u16 *)(curgdt)) = limit;
+ (*(__u32 *)(curgdt +2)) = (unsigned long)(newgdt);
+
+ __asm__ __volatile__ (
+ "lgdt %0\n"
+ : "=m" (curgdt)
+ );
+};
+
+static void load_segments(void)
+{
+#define __STR(X) #X
+#define STR(X) __STR(X)
+
+ __asm__ __volatile__ (
+ "\tljmp $"STR(__KERNEL_CS)",$1f\n"
+ "\t1:\n"
+ "\tmovl $"STR(__KERNEL_DS)",%eax\n"
+ "\tmovl %eax,%ds\n"
+ "\tmovl %eax,%es\n"
+ "\tmovl %eax,%fs\n"
+ "\tmovl %eax,%gs\n"
+ "\tmovl %eax,%ss\n"
+ );
+#undef STR
+#undef __STR
+}
+
+static void identity_map_page(unsigned long address)
+{
+ /* This code is x86 specific...
+ * general purpose code must be more carful
+ * of caches and tlbs...
+ */
+ pgd_t *pgd;
+ pmd_t *pmd;
+ struct mm_struct *mm = current->mm;
+ spin_lock(&mm->page_table_lock);
+
+ pgd = pgd_offset(mm, address);
+ pmd = pmd_alloc(mm, pgd, address);
+
+ if (pmd) {
+ pte_t *pte = pte_alloc_map(mm, pmd, address);
+ if (pte) {
+ set_pte(pte,
+ mk_pte(virt_to_page(phys_to_virt(address)),
+ PAGE_SHARED));
+ __flush_tlb_one(address);
+ }
+ }
+ spin_unlock(&mm->page_table_lock);
+}
+
+
+typedef void (*relocate_new_kernel_t)(
+ unsigned long indirection_page, unsigned long reboot_code_buffer,
+ unsigned long start_address);
+
+const extern unsigned char relocate_new_kernel[];
+extern void relocate_new_kernel_end(void);
+const extern unsigned int relocate_new_kernel_size;
+
+void machine_kexec(struct kimage *image)
+{
+ unsigned long *indirection_page;
+ void *reboot_code_buffer;
+ relocate_new_kernel_t rnk;
+
+ stop_apics();
+
+ /* Interrupts aren't acceptable while we reboot */
+ local_irq_disable();
+ reboot_code_buffer = image->reboot_code_buffer;
+ indirection_page = phys_to_virt(image->head & PAGE_MASK);
+
+ identity_map_page(virt_to_phys(reboot_code_buffer));
+
+ /* copy it out */
+ memcpy(reboot_code_buffer, relocate_new_kernel,
+ relocate_new_kernel_size);
+
+ /* The segment registers are funny things, they are
+ * automatically loaded from a table, in memory wherever you
+ * set them to a specific selector, but this table is never
+ * accessed again you set the segment to a different selector.
+ *
+ * The more common model is are caches where the behide
+ * the scenes work is done, but is also dropped at arbitrary
+ * times.
+ *
+ * I take advantage of this here by force loading the
+ * segments, before I zap the gdt with an invalid value.
+ */
+ load_segments();
+ /* The gdt & idt are now invalid.
+ * If you want to load them you must set up your own idt & gdt.
+ */
+ set_gdt(phys_to_virt(0),0);
+ set_idt(phys_to_virt(0),0);
+
+ /* now call it */
+ rnk = (relocate_new_kernel_t) virt_to_phys(reboot_code_buffer);
+ (*rnk)(virt_to_phys(indirection_page), virt_to_phys(reboot_code_buffer),
+ image->start);
+}
+
diff -uNr linux-2.5.43/arch/i386/kernel/reboot.c linux-2.5.43.x86kexec/arch/i386/kernel/reboot.c
--- linux-2.5.43/arch/i386/kernel/reboot.c Fri Oct 11 22:21:36 2002
+++ linux-2.5.43.x86kexec/arch/i386/kernel/reboot.c Fri Oct 18 12:08:38 2002
@@ -19,8 +19,7 @@
int reboot_thru_bios;
#ifdef CONFIG_SMP
-int reboot_smp = 0;
-static int reboot_cpu = -1;
+int reboot_cpu = -1; /* specifies the internal linux cpu id, not the apicid */
/* shamelessly grabbed from lib/vsprintf.c for readability */
#define is_digit(c) ((c) >= '0' && (c) <= '9')
#endif
@@ -42,7 +41,6 @@
break;
#ifdef CONFIG_SMP
case 's': /* "smp" reboot by executing reset on BSP or other CPU*/
- reboot_smp = 1;
if (is_digit(*(str+1))) {
reboot_cpu = (int) (*(str+1) - '0');
if (is_digit(*(str+2)))
@@ -223,42 +221,7 @@
void machine_restart(char * __unused)
{
-#if CONFIG_SMP
- int cpuid;
-
- cpuid = GET_APIC_ID(apic_read(APIC_ID));
-
- if (reboot_smp) {
-
- /* check to see if reboot_cpu is valid
- if its not, default to the BSP */
- if ((reboot_cpu == -1) ||
- (reboot_cpu > (NR_CPUS -1)) ||
- !(phys_cpu_present_map & (1<<cpuid)))
- reboot_cpu = boot_cpu_physical_apicid;
-
- reboot_smp = 0; /* use this as a flag to only go through this once*/
- /* re-run this function on the other CPUs
- it will fall though this section since we have
- cleared reboot_smp, and do the reboot if it is the
- correct CPU, otherwise it halts. */
- if (reboot_cpu != cpuid)
- smp_call_function((void *)machine_restart , NULL, 1, 0);
- }
-
- /* if reboot_cpu is still -1, then we want a tradional reboot,
- and if we are not running on the reboot_cpu,, halt */
- if ((reboot_cpu != -1) && (cpuid != reboot_cpu)) {
- for (;;)
- __asm__ __volatile__ ("hlt");
- }
- /*
- * Stop all CPUs and turn off local APICs and the IO-APIC, so
- * other OSs see a clean IRQ state.
- */
- smp_send_stop();
- disable_IO_APIC();
-#endif
+ stop_apics();
if(!reboot_thru_bios) {
/* rebooting needs to touch the page at absolute addr 0 */
@@ -282,10 +245,12 @@
void machine_halt(void)
{
+ stop_apics();
}
void machine_power_off(void)
{
+ stop_apics();
if (pm_power_off)
pm_power_off();
}
diff -uNr linux-2.5.43/arch/i386/kernel/relocate_kernel.S linux-2.5.43.x86kexec/arch/i386/kernel/relocate_kernel.S
--- linux-2.5.43/arch/i386/kernel/relocate_kernel.S Wed Dec 31 17:00:00 1969
+++ linux-2.5.43.x86kexec/arch/i386/kernel/relocate_kernel.S Fri Oct 18 12:08:38 2002
@@ -0,0 +1,99 @@
+#include <linux/config.h>
+#include <linux/linkage.h>
+
+ /* Must be relocatable PIC code callable as a C function, that once
+ * it starts can not use the previous processes stack.
+ *
+ */
+ .globl relocate_new_kernel
+relocate_new_kernel:
+ /* read the arguments and say goodbye to the stack */
+ movl 4(%esp), %ebx /* indirection_page */
+ movl 8(%esp), %ebp /* reboot_code_buffer */
+ movl 12(%esp), %edx /* start address */
+
+ /* zero out flags, and disable interrupts */
+ pushl $0
+ popfl
+
+ /* set a new stack at the bottom of our page... */
+ lea 4096(%ebp), %esp
+
+ /* store the parameters back on the stack */
+ pushl %edx /* store the start address */
+
+ /* Set cr0 to a known state:
+ * 31 0 == Paging disabled
+ * 18 0 == Alignment check disabled
+ * 16 0 == Write protect disabled
+ * 3 0 == No task switch
+ * 2 0 == Don't do FP software emulation.
+ * 0 1 == Proctected mode enabled
+ */
+ movl %cr0, %eax
+ andl $~((1<<31)|(1<<18)|(1<<16)|(1<<3)|(1<<2)), %eax
+ orl $(1<<0), %eax
+ movl %eax, %cr0
+ jmp 1f
+1:
+
+ /* Flush the TLB (needed?) */
+ xorl %eax, %eax
+ movl %eax, %cr3
+
+ /* Do the copies */
+ cld
+0: /* top, read another word for the indirection page */
+ movl %ebx, %ecx
+ movl (%ebx), %ecx
+ addl $4, %ebx
+ testl $0x1, %ecx /* is it a destination page */
+ jz 1f
+ movl %ecx, %edi
+ andl $0xfffff000, %edi
+ jmp 0b
+1:
+ testl $0x2, %ecx /* is it an indirection page */
+ jz 1f
+ movl %ecx, %ebx
+ andl $0xfffff000, %ebx
+ jmp 0b
+1:
+ testl $0x4, %ecx /* is it the done indicator */
+ jz 1f
+ jmp 2f
+1:
+ testl $0x8, %ecx /* is it the source indicator */
+ jz 0b /* Ignore it otherwise */
+ movl %ecx, %esi /* For every source page do a copy */
+ andl $0xfffff000, %esi
+
+ movl $1024, %ecx
+ rep ; movsl
+ jmp 0b
+
+2:
+
+ /* To be certain of avoiding problems with self modifying code
+ * I need to execute a serializing instruction here.
+ * So I flush the TLB, it's handy, and not processor dependent.
+ */
+ xorl %eax, %eax
+ movl %eax, %cr3
+
+ /* set all of the registers to known values */
+ /* leave %esp alone */
+
+ xorl %eax, %eax
+ xorl %ebx, %ebx
+ xorl %ecx, %ecx
+ xorl %edx, %edx
+ xorl %esi, %esi
+ xorl %edi, %edi
+ xorl %ebp, %ebp
+ ret
+relocate_new_kernel_end:
+
+ .globl relocate_new_kernel_size
+relocate_new_kernel_size:
+ .long relocate_new_kernel_end - relocate_new_kernel
diff -uNr linux-2.5.43/arch/i386/kernel/smp.c linux-2.5.43.x86kexec/arch/i386/kernel/smp.c
--- linux-2.5.43/arch/i386/kernel/smp.c Fri Oct 11 22:21:31 2002
+++ linux-2.5.43.x86kexec/arch/i386/kernel/smp.c Fri Oct 18 12:08:38 2002
@@ -611,6 +611,30 @@
void smp_send_stop(void)
{
+ extern int reboot_cpu;
+ int reboot_cpu_id;
+
+ /* The boot cpu is always logical cpu 0 */
+ reboot_cpu_id = 0;
+
+ /* See if there has been give a command line override .
+ */
+ if ((reboot_cpu != -1) && !(reboot_cpu >= NR_CPUS) &&
+ test_bit(reboot_cpu, &cpu_online_map)) {
+ reboot_cpu_id = reboot_cpu;
+ }
+
+ /* Make certain the the cpu I'm rebooting on is online */
+ if (!test_bit(reboot_cpu_id, &cpu_online_map)) {
+ reboot_cpu_id = smp_processor_id();
+ }
+
+ /* Make certain I only run on the appropriate processor */
+ set_cpus_allowed(current, 1 << reboot_cpu_id);
+
+ /* O.k. Now that I'm on the appropriate processor stop
+ * all of the others.
+ */
smp_call_function(stop_this_cpu, NULL, 1, 0);
local_irq_disable();
diff -uNr linux-2.5.43/include/asm-i386/apic.h linux-2.5.43.x86kexec/include/asm-i386/apic.h
--- linux-2.5.43/include/asm-i386/apic.h Fri Oct 11 22:22:45 2002
+++ linux-2.5.43.x86kexec/include/asm-i386/apic.h Fri Oct 18 12:08:38 2002
@@ -96,6 +96,9 @@
#define NMI_LOCAL_APIC 2
#define NMI_INVALID 3
+extern void stop_apics(void);
+#else
+static inline void stop_apics(void) { }
#endif /* CONFIG_X86_LOCAL_APIC */
#endif /* __ASM_APIC_H */
diff -uNr linux-2.5.43/include/asm-i386/apicdef.h linux-2.5.43.x86kexec/include/asm-i386/apicdef.h
--- linux-2.5.43/include/asm-i386/apicdef.h Fri Oct 18 11:59:27 2002
+++ linux-2.5.43.x86kexec/include/asm-i386/apicdef.h Fri Oct 18 12:08:38 2002
@@ -88,6 +88,7 @@
#define APIC_LVT_REMOTE_IRR (1<<14)
#define APIC_INPUT_POLARITY (1<<13)
#define APIC_SEND_PENDING (1<<12)
+#define APIC_MODE_MASK 0x700
#define GET_APIC_DELIVERY_MODE(x) (((x)>>8)&0x7)
#define SET_APIC_DELIVERY_MODE(x,y) (((x)&~0x700)|((y)<<8))
#define APIC_MODE_FIXED 0x0
diff -uNr linux-2.5.43/include/asm-i386/kexec.h linux-2.5.43.x86kexec/include/asm-i386/kexec.h
--- linux-2.5.43/include/asm-i386/kexec.h Wed Dec 31 17:00:00 1969
+++ linux-2.5.43.x86kexec/include/asm-i386/kexec.h Fri Oct 18 12:08:38 2002
@@ -0,0 +1,25 @@
+#ifndef _I386_KEXEC_H
+#define _I386_KEXEC_H
+
+#include <asm/fixmap.h>
+
+/*
+ * KEXEC_SOURCE_MEMORY_LIMIT maximum page get_free_page can return.
+ * I.e. Maximum page that is mapped directly into kernel memory,
+ * and kmap is not required.
+ *
+ * Someone correct me if FIXADDR_START - PAGEOFFSET is not the correct
+ * calculation for the amount of memory directly mappable into the
+ * kernel memory space.
+ */
+
+/* Maximum physical address we can use pages from */
+#define KEXEC_SOURCE_MEMORY_LIMIT (FIXADDR_START - PAGE_OFFSET)
+/* Maximum address we can reach in physical address mode */
+#define KEXEC_DESTINATION_MEMORY_LIMIT (-1UL)
+
+#define KEXEC_REBOOT_CODE_SIZE 4096
+#define KEXEC_REBOOT_CODE_ALIGN 0
+
+
+#endif /* _I386_KEXEC_H */
diff -uNr linux-2.5.43/include/asm-i386/unistd.h linux-2.5.43.x86kexec/include/asm-i386/unistd.h
--- linux-2.5.43/include/asm-i386/unistd.h Fri Oct 18 11:59:28 2002
+++ linux-2.5.43.x86kexec/include/asm-i386/unistd.h Fri Oct 18 12:09:51 2002
@@ -258,7 +258,7 @@
#define __NR_free_hugepages 251
#define __NR_exit_group 252
#define __NR_lookup_dcookie 253
-
+#define __NR_kexec 254
/* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
diff -uNr linux-2.5.43/include/linux/kexec.h linux-2.5.43.x86kexec/include/linux/kexec.h
--- linux-2.5.43/include/linux/kexec.h Wed Dec 31 17:00:00 1969
+++ linux-2.5.43.x86kexec/include/linux/kexec.h Fri Oct 18 12:08:38 2002
@@ -0,0 +1,49 @@
+#ifndef LINUX_KEXEC_H
+#define LINUX_KEXEC_H
+
+#if CONFIG_KEXEC
+#include <linux/types.h>
+#include <asm/kexec.h>
+
+/*
+ * This structure is used to hold the arguments that are used when loading
+ * kernel binaries.
+ */
+
+typedef unsigned long kimage_entry_t;
+#define IND_DESTINATION 0x1
+#define IND_INDIRECTION 0x2
+#define IND_DONE 0x4
+#define IND_SOURCE 0x8
+
+struct kimage {
+ kimage_entry_t head;
+ kimage_entry_t *entry;
+ kimage_entry_t *last_entry;
+
+ unsigned long destination;
+ unsigned long offset;
+
+ unsigned long start;
+ void *reboot_code_buffer;
+};
+
+/* kexec helper functions */
+void kimage_init(struct kimage *image);
+void kimage_free(struct kimage *image);
+
+struct kexec_segment {
+ void *buf;
+ size_t bufsz;
+ void *mem;
+ size_t memsz;
+};
+
+/* kexec interface functions */
+extern void machine_kexec(struct kimage *image);
+extern int do_kexec(unsigned long entry, long nr_segments,
+ struct kexec_segment *segments, struct kimage *image);
+extern int load_elf_kernel(struct kimage *image, struct file *file);
+#endif
+#endif /* LINUX_KEXEC_H */
+
diff -uNr linux-2.5.43/kernel/Makefile linux-2.5.43.x86kexec/kernel/Makefile
--- linux-2.5.43/kernel/Makefile Fri Oct 18 11:59:29 2002
+++ linux-2.5.43.x86kexec/kernel/Makefile Fri Oct 18 12:08:38 2002
@@ -21,6 +21,7 @@
obj-$(CONFIG_CPU_FREQ) += cpufreq.o
obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend.o
+obj-$(CONFIG_KEXEC) += kexec.o
ifneq ($(CONFIG_IA64),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
@@ -30,5 +31,7 @@
# to get a correct value for the wait-channel (WCHAN in ps). --davidm
CFLAGS_sched.o := $(PROFILING) -fno-omit-frame-pointer
endif
+
+obj-$(CONFIG_KEXEC) += kexec.o
include $(TOPDIR)/Rules.make
diff -uNr linux-2.5.43/kernel/kexec.c linux-2.5.43.x86kexec/kernel/kexec.c
--- linux-2.5.43/kernel/kexec.c Wed Dec 31 17:00:00 1969
+++ linux-2.5.43.x86kexec/kernel/kexec.c Fri Oct 18 12:37:53 2002
@@ -0,0 +1,624 @@
+#include <linux/mm.h>
+#include <linux/file.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/version.h>
+#include <linux/compile.h>
+#include <linux/kexec.h>
+#include <net/checksum.h>
+#include <asm/page.h>
+#include <asm/uaccess.h>
+#include <asm/io.h>
+
+#define DEBUG 0
+
+/* As designed kexec can only use the memory that you don't
+ * need to use kmap to access. Memory that you can use virt_to_phys()
+ * on an call get_free_page to allocate.
+ *
+ * In the best case you need one page for the transition from
+ * virtual to physical memory. And this page must be identity
+ * mapped. Which pretty much leaves you with pages < PAGE_OFFSET
+ * as you can only mess with user pages.
+ *
+ * As the only subset of memory that it is easy to restrict allocation
+ * to is the physical memory mapped into the kernel, I do that
+ * with get_free_page and hope it is enough.
+ *
+ * I don't know of a good way to do this calcuate which pages get_free_page
+ * will return independent of architecture so I depend on
+ * <asm/kexec.h> to properly set
+ * KEXEC_SOURCE_MEMORY_LIMIT and KEXEC_DESTINATION_MEMORY_LIMIT
+ *
+ */
+
+void kimage_init(struct kimage *image)
+{
+ memset(image, 0, sizeof(*image));
+ image->head = 0;
+ image->entry = &image->head;
+ image->last_entry = &image->head;
+}
+static int kimage_add_entry(struct kimage *image, kimage_entry_t entry)
+{
+ if (image->offset != 0) {
+ image->entry++;
+ }
+ if (image->entry == image->last_entry) {
+ kimage_entry_t *ind_page;
+ ind_page = (void *)__get_free_page(GFP_KERNEL);
+ if (!ind_page) {
+ return -ENOMEM;
+ }
+ *image->entry = virt_to_phys(ind_page) | IND_INDIRECTION;
+ image->entry = ind_page;
+ image->last_entry =
+ ind_page + ((PAGE_SIZE/sizeof(kimage_entry_t)) - 1);
+ }
+ *image->entry = entry;
+ image->entry++;
+ image->offset = 0;
+ return 0;
+}
+
+static int kimage_verify_destination(unsigned long destination)
+{
+ int result;
+
+ /* Assume the page is bad unless we pass the checks */
+ result = -EADDRNOTAVAIL;
+
+ if (destination >= KEXEC_DESTINATION_MEMORY_LIMIT) {
+ goto out;
+ }
+
+ /* FIXME:
+ * add checking to ensure the new image doesn't go into
+ * invalid or reserved areas of RAM.
+ */
+ result = 0;
+out:
+ return result;
+}
+
+static int kimage_set_destination(
+ struct kimage *image, unsigned long destination)
+{
+ int result;
+ destination &= PAGE_MASK;
+ result = kimage_verify_destination(destination);
+ if (result) {
+ return result;
+ }
+ result = kimage_add_entry(image, destination | IND_DESTINATION);
+ if (result == 0) {
+ image->destination = destination;
+ }
+ return result;
+}
+
+
+static int kimage_add_page(struct kimage *image, unsigned long page)
+{
+ int result;
+ page &= PAGE_MASK;
+ result = kimage_verify_destination(image->destination);
+ if (result) {
+ return result;
+ }
+ result = kimage_add_entry(image, page | IND_SOURCE);
+ if (result == 0) {
+ image->destination += PAGE_SIZE;
+ }
+ return result;
+}
+
+
+static int kimage_terminate(struct kimage *image)
+{
+ int result;
+ result = kimage_add_entry(image, IND_DONE);
+ if (result == 0) {
+ /* Point at the terminating element */
+ image->entry--;
+ }
+ return result;
+}
+
+#define for_each_kimage_entry(image, ptr, entry) \
+ for (ptr = &image->head; (entry = *ptr) && !(entry & IND_DONE); \
+ ptr = (entry & IND_INDIRECTION)? \
+ phys_to_virt((entry & PAGE_MASK)): ptr +1)
+
+void kimage_free(struct kimage *image)
+{
+ kimage_entry_t *ptr, entry;
+ kimage_entry_t ind = 0;
+ for_each_kimage_entry(image, ptr, entry) {
+ if (entry & IND_INDIRECTION) {
+ /* Free the previous indirection page */
+ if (ind & IND_INDIRECTION) {
+ free_page((unsigned long)phys_to_virt(ind & PAGE_MASK));
+ }
+ /* Save this indirection page until we are
+ * done with it.
+ */
+ ind = entry;
+ }
+ else if (entry & IND_SOURCE) {
+ free_page((unsigned long)phys_to_virt(entry & PAGE_MASK));
+ }
+ }
+}
+
+#if DEBUG
+static void kimage_print_image(struct kimage *image)
+{
+ kimage_entry_t *ptr, entry;
+ int i;
+ printk(KERN_EMERG "kimage_print_image\n");
+ i = 0;
+ for_each_kimage_entry(image, ptr, entry) {
+ if (entry & IND_DESTINATION) {
+ printk(KERN_EMERG "%5d DEST\n", i);
+ }
+ else if (entry & IND_INDIRECTION) {
+ printk(KERN_EMERG "%5d IND\n", i);
+ }
+ else if (entry & IND_SOURCE) {
+ printk(KERN_EMERG "%5d SOURCE\n", i);
+ }
+ else if (entry & IND_DONE) {
+ printk(KERN_EMERG "%5d DONE\n", i);
+ }
+ else {
+ printk(KERN_EMERG "%5d ?\n", i);
+ }
+ i++;
+ }
+ printk(KERN_EMERG "kimage_print_image: %5d\n", i);
+}
+#endif
+static int kimage_is_destination_page(
+ struct kimage *image, unsigned long page)
+{
+ kimage_entry_t *ptr, entry;
+ unsigned long destination;
+ destination = 0;
+ page &= PAGE_MASK;
+ for_each_kimage_entry(image, ptr, entry) {
+ if (entry & IND_DESTINATION) {
+ destination = entry & PAGE_MASK;
+ }
+ else if (entry & IND_SOURCE) {
+ if (page == destination) {
+ return 1;
+ }
+ destination += PAGE_SIZE;
+ }
+ }
+ return 0;
+}
+
+static int kimage_get_unused_area(
+ struct kimage *image, unsigned long size, unsigned long align,
+ unsigned long *area)
+{
+ /* Walk through mem_map and find the first chunk of
+ * ununsed memory that is at least size bytes long.
+ */
+ /* Since the kernel plays with Page_Reseved mem_map is less
+ * than ideal for this purpose, but it will give us a correct
+ * conservative estimate of what we need to do.
+ */
+ /* For now we take advantage of the fact that all kernel pages
+ * are marked with PG_resereved to allocate a large
+ * contiguous area for the reboot code buffer.
+ */
+ unsigned long addr;
+ unsigned long start, end;
+ unsigned long mask;
+ mask = ((1 << align) -1);
+ start = end = PAGE_SIZE;
+ for(addr = PAGE_SIZE; addr < KEXEC_SOURCE_MEMORY_LIMIT; addr += PAGE_SIZE) {
+ struct page *page;
+ unsigned long aligned_start;
+ page = virt_to_page(phys_to_virt(addr));
+ if (PageReserved(page) ||
+ kimage_is_destination_page(image, addr)) {
+ /* The current page is reserved so the start &
+ * end of the next area must be atleast at the
+ * next page.
+ */
+ start = end = addr + PAGE_SIZE;
+ }
+ else {
+ /* O.k. The current page isn't reserved
+ * so push up the end of the area.
+ */
+ end = addr;
+ }
+ aligned_start = (start + mask) & ~mask;
+ if (aligned_start > start) {
+ continue;
+ }
+ if (aligned_start > end) {
+ continue;
+ }
+ if (end - aligned_start >= size) {
+ *area = aligned_start;
+ return 0;
+ }
+ }
+ *area = 0;
+ return -ENOSPC;
+}
+
+static kimage_entry_t *kimage_dst_conflict(
+ struct kimage *image, unsigned long page, kimage_entry_t *limit)
+{
+ kimage_entry_t *ptr, entry;
+ unsigned long destination = 0;
+ for_each_kimage_entry(image, ptr, entry) {
+ if (ptr == limit) {
+ return 0;
+ }
+ else if (entry & IND_DESTINATION) {
+ destination = entry & PAGE_MASK;
+ }
+ else if (entry & IND_SOURCE) {
+ if (page == destination) {
+ return ptr;
+ }
+ destination += PAGE_SIZE;
+ }
+ }
+ return 0;
+}
+
+static kimage_entry_t *kimage_src_conflict(
+ struct kimage *image, unsigned long destination, kimage_entry_t *limit)
+{
+ kimage_entry_t *ptr, entry;
+ for_each_kimage_entry(image, ptr, entry) {
+ unsigned long page;
+ if (ptr == limit) {
+ return 0;
+ }
+ else if (entry & IND_DESTINATION) {
+ /* nop */
+ }
+ else if (entry & IND_DONE) {
+ /* nop */
+ }
+ else {
+ /* SOURCE & INDIRECTION */
+ page = entry & PAGE_MASK;
+ if (page == destination) {
+ return ptr;
+ }
+ }
+ }
+ return 0;
+}
+
+static int kimage_get_off_destination_pages(struct kimage *image)
+{
+ kimage_entry_t *ptr, *cptr, entry;
+ unsigned long buffer, page;
+ unsigned long destination = 0;
+
+ /* Here we implement safe guards to insure that
+ * a source page is not copied to it's destination
+ * page before the data on the destination page is
+ * no longer useful.
+ *
+ * To make it work we actually wind up with a
+ * stronger condition. For every page considered
+ * it is either it's own destination page or it is
+ * not a destination page of any page considered.
+ *
+ * Invariants
+ * 1. buffer is not a destination of a previous page.
+ * 2. page is not a destination of a previous page.
+ * 3. destination is not a previous source page.
+ *
+ * Result: Either a source page and a destination page
+ * are the same or the page is not a destination page.
+ *
+ * These checks could be done when we allocate the pages,
+ * but doing it as a final pass allows us more freedom
+ * on how we allocate pages.
+ *
+ * Also while the checks are necessary, in practice nothing
+ * happens. The destination kernel wants to sit in the
+ * same physical addresses as the current kernel so we never
+ * actually allocate a destination page.
+ *
+ * BUGS: This is a O(N^2) algorithm.
+ */
+
+
+ buffer = __get_free_page(GFP_KERNEL);
+ if (!buffer) {
+ return -ENOMEM;
+ }
+ buffer = virt_to_phys((void *)buffer);
+ for_each_kimage_entry(image, ptr, entry) {
+ /* Here we check to see if an allocated page */
+ kimage_entry_t *limit;
+ if (entry & IND_DESTINATION) {
+ destination = entry & PAGE_MASK;
+ }
+ else if (entry & IND_INDIRECTION) {
+ /* Indirection pages must include all of their
+ * contents in limit checking.
+ */
+ limit = phys_to_virt(page + PAGE_SIZE - sizeof(*limit));
+ }
+ if (!((entry & IND_SOURCE) | (entry & IND_INDIRECTION))) {
+ continue;
+ }
+ page = entry & PAGE_MASK;
+ limit = ptr;
+
+ /* See if a previous page has the current page as it's
+ * destination.
+ * i.e. invariant 2
+ */
+ cptr = kimage_dst_conflict(image, page, limit);
+ if (cptr) {
+ unsigned long cpage;
+ kimage_entry_t centry;
+ centry = *cptr;
+ cpage = centry & PAGE_MASK;
+ memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE);
+ memcpy(phys_to_virt(page), phys_to_virt(cpage), PAGE_SIZE);
+ *cptr = page | (centry & ~PAGE_MASK);
+ *ptr = buffer | (entry & ~PAGE_MASK);
+ buffer = cpage;
+ }
+ if (!(entry & IND_SOURCE)) {
+ continue;
+ }
+
+ /* See if a previous page is our destination page.
+ * If so claim it now.
+ * i.e. invariant 3
+ */
+ cptr = kimage_src_conflict(image, destination, limit);
+ if (cptr) {
+ unsigned long cpage;
+ kimage_entry_t centry;
+ centry = *cptr;
+ cpage = centry & PAGE_MASK;
+ memcpy(phys_to_virt(buffer), phys_to_virt(cpage), PAGE_SIZE);
+ memcpy(phys_to_virt(cpage), phys_to_virt(page), PAGE_SIZE);
+ *cptr = buffer | (centry & ~PAGE_MASK);
+ *ptr = cpage | ( entry & ~PAGE_MASK);
+ buffer = page;
+ }
+ /* If the buffer is my destination page do the copy now
+ * i.e. invariant 3 & 1
+ */
+ if (buffer == destination) {
+ memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE);
+ *ptr = buffer | (entry & ~PAGE_MASK);
+ buffer = page;
+ }
+ }
+ free_page((unsigned long)phys_to_virt(buffer));
+ return 0;
+}
+
+static int kimage_add_empty_pages(struct kimage *image,
+ unsigned long len)
+{
+ unsigned long pos;
+ int result;
+ for(pos = 0; pos < len; pos += PAGE_SIZE) {
+ char *page;
+ result = -ENOMEM;
+ page = (void *)__get_free_page(GFP_KERNEL);
+ if (!page) {
+ goto out;
+ }
+ result = kimage_add_page(image, virt_to_phys(page));
+ if (result) {
+ goto out;
+ }
+ }
+ result = 0;
+ out:
+ return result;
+}
+
+
+static int kimage_load_segment(struct kimage *image,
+ struct kexec_segment *segment)
+{
+ unsigned long mstart;
+ int result;
+ unsigned long offset;
+ unsigned long offset_end;
+ unsigned char *buf;
+
+ result = 0;
+ buf = segment->buf;
+ mstart = (unsigned long)segment->mem;
+
+ offset_end = segment->memsz;
+
+ result = kimage_set_destination(image, mstart);
+ if (result < 0) {
+ goto out;
+ }
+ for(offset = 0; offset < segment->memsz; offset += PAGE_SIZE) {
+ char *page;
+ size_t size, leader;
+ page = (char *)__get_free_page(GFP_KERNEL);
+ if (page == 0) {
+ result = -ENOMEM;
+ goto out;
+ }
+ result = kimage_add_page(image, virt_to_phys(page));
+ if (result < 0) {
+ goto out;
+ }
+ if (segment->bufsz < offset) {
+ /* We are past the end zero the whole page */
+ memset(page, 0, PAGE_SIZE);
+ continue;
+ }
+ size = PAGE_SIZE;
+ leader = 0;
+ if ((offset == 0)) {
+ leader = mstart & ~PAGE_MASK;
+ }
+ if (leader) {
+ /* We are on the first page zero the unused portion */
+ memset(page, 0, leader);
+ size -= leader;
+ page += leader;
+ }
+ if (size > (segment->bufsz - offset)) {
+ size = segment->bufsz - offset;
+ }
+ result = copy_from_user(page, buf + offset, size);
+ if (result) {
+ result = (result < 0)?result : -EIO;
+ goto out;
+ }
+ if (size < (PAGE_SIZE - leader)) {
+ /* zero the trailing part of the page */
+ memset(page + size, 0, (PAGE_SIZE - leader) - size);
+ }
+ }
+ out:
+ return result;
+}
+
+
+/* do_kexec executes a new kernel
+ */
+int do_kexec(unsigned long start, long nr_segments,
+ struct kexec_segment *arg_segments, struct kimage *image)
+{
+ struct kexec_segment *segments;
+ size_t segment_bytes;
+ int i;
+
+ int result;
+ unsigned long reboot_code_buffer;
+ kimage_entry_t *end;
+
+ /* Initialize variables */
+ segments = 0;
+
+ /* We only trust the superuser with rebooting the system. */
+ if (nr_segments <= 0) {
+ result = -EINVAL;
+ goto out;
+ }
+ segment_bytes = nr_segments * sizeof(*segments);
+ segments = kmalloc(GFP_KERNEL, segment_bytes);
+ if (segments == 0) {
+ result = -ENOMEM;
+ goto out;
+ }
+ result = copy_from_user(segments, arg_segments, segment_bytes);
+ if (result) {
+ goto out;
+ }
+#if DEBUG
+ for(i = 0; i < nr_segments; i++) {
+ printk(KERN_EMERG "k_segment[%d].buf = %p\n", i, segments[i].buf);
+ printk(KERN_EMERG "k_segment[%d].bufsz = 0x%x\n", i, segments[i].bufsz);
+ printk(KERN_EMERG "k_segment[%d].mem = %p\n", i, segments[i].mem);
+ printk(KERN_EMERG "k_segment[%d].memsz = 0x%x\n", i, segments[i].memsz);
+ }
+ printk(KERN_EMERG "k_entry = 0x%08lx\n", start);
+ printk(KERN_EMERG "k_nr_segments = %d\n", nr_segments);
+ printk(KERN_EMERG "k_segments = %p\n", segments);
+#endif
+
+ /* Read in the data from user space */
+ image->start = start;
+ for(i = 0; i < nr_segments; i++) {
+ result = kimage_load_segment(image, &segments[i]);
+ if (result) {
+ goto out;
+ }
+ }
+
+ /* Terminate early so I can get a place holder. */
+ result = kimage_terminate(image);
+ if (result)
+ goto out;
+ end = image->entry;
+
+ /* Usage of the reboot code buffer is subtle. We first
+ * find a continguous area of ram, that is not one
+ * of our destination pages. We do not allocate the ram.
+ *
+ * The algorithm to make certain we do not have address
+ * conflicts requires each destination region to have some
+ * backing store so we allocate abitrary source pages.
+ *
+ * Later in machine_kexec when we copy data to the
+ * reboot_code_buffer it still may be allocated for other
+ * purposes, but we do know there are no source or destination
+ * pages in that area. And since the rest of the kernel
+ * is already shutdown those pages are free for use,
+ * regardless of their page->count values.
+ */
+ result = kimage_get_unused_area(
+ image, KEXEC_REBOOT_CODE_SIZE, KEXEC_REBOOT_CODE_ALIGN,
+ &reboot_code_buffer);
+ if (result)
+ goto out;
+
+ /* Allocating pages we should never need is silly but the
+ * code won't work correctly unless we have dummy pages to
+ * work with.
+ */
+ result = kimage_set_destination(image, reboot_code_buffer);
+ if (result)
+ goto out;
+ result = kimage_add_empty_pages(image, KEXEC_REBOOT_CODE_SIZE);
+ if (result)
+ goto out;
+ image->reboot_code_buffer = phys_to_virt(reboot_code_buffer);
+
+ result = kimage_terminate(image);
+ if (result)
+ goto out;
+
+ result = kimage_get_off_destination_pages(image);
+ if (result)
+ goto out;
+
+#if DEBUG
+ kimage_print_image(image);
+#endif
+
+ /* Now hide the extra source pages for the reboot code buffer
+ * What is the logic with the reboot code buffer, should it
+ * be mapped 1-1 by this point FIXME verify this?
+ */
+ image->entry = end;
+ result = kimage_terminate(image);
+ if (result)
+ goto out;
+
+#if DEBUG
+ kimage_print_image(image);
+#endif
+
+ result = 0;
+ out:
+ /* cleanup and exit */
+ if (segments) kfree(segments);
+ return result;
+}
+
diff -uNr linux-2.5.43/kernel/sys.c linux-2.5.43.x86kexec/kernel/sys.c
--- linux-2.5.43/kernel/sys.c Fri Oct 18 11:59:29 2002
+++ linux-2.5.43.x86kexec/kernel/sys.c Fri Oct 18 12:08:38 2002
@@ -16,6 +16,7 @@
#include <linux/init.h>
#include <linux/highuid.h>
#include <linux/fs.h>
+#include <linux/kexec.h>
#include <linux/workqueue.h>
#include <linux/device.h>
#include <linux/times.h>
@@ -430,6 +431,66 @@
unlock_kernel();
return 0;
}
+
+#ifdef CONFIG_KEXEC
+/*
+ * Exec Kernel system call: for obvious reasons only root may call it.
+ *
+ * This call breaks up into three pieces.
+ * - A generic part which loads the new kernel from the current
+ * address space, and very carefully places the data in the
+ * allocated pages.
+ *
+ * - A generic part that interacts with the kernel and tells all of
+ * the devices to shut down. Preventing on-going dmas, and placing
+ * the devices in a consistent state so a later kernel can
+ * reinitialize them.
+ *
+ * - A machine specific part that includes the syscall number
+ * and the copies the image to it's final destination. And
+ * jumps into the image at entry.
+ *
+ * kexec does not sync, or unmount filesystems so if you need
+ * that to happen you need to do that yourself.
+ */
+asmlinkage long sys_kexec(unsigned long entry, long nr_segments,
+ struct kexec_segment *segments)
+{
+ /* Am I using to much stack space here? */
+ struct kimage image;
+ int result;
+
+ /* We only trust the superuser with rebooting the system. */
+ if (!capable(CAP_SYS_BOOT))
+ return -EPERM;
+
+ lock_kernel();
+ kimage_init(&image);
+ result = do_kexec(entry, nr_segments, segments, &image);
+ if (result) {
+ kimage_free(&image);
+ unlock_kernel();
+ return result;
+ }
+
+ /* The point of no return is here... */
+ notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
+ system_running = 0;
+ device_shutdown();
+ printk(KERN_EMERG "kexecing image\n");
+ machine_kexec(&image);
+ /* We never get here but... */
+ kimage_free(&image);
+ unlock_kernel();
+ return -EINVAL;
+}
+#else
+asmlinkage long sys_kexec(unsigned long entry, long nr_segments,
+ struct kexec_segment *segments)
+{
+ return -ENOSYS;
+}
+#endif /* CONFIG_KEXEC */
static void deferred_cad(void *dummy)
{
Werner Almesberger <[email protected]> writes:
I am CC'ing the kernel list so perhaps I don't have to answer these
questions too many times.
> Eric W. Biederman wrote:
> > This is fixed in the bk2 snapshot of 43, but I guess since 44 is out
> > I should do build another patch against that.
>
> Your .43 patch applies flawlessly to .44 - and kexec even works :-)
> Not surprisingly, it doesn't like rebooting out of X11, though.
Cool. What fails with X11. Fixing it might be as simple as calling
int 0x10 early in the new image.
> Shouldn't ELF images work too ?
Try kexec_test it is a valid static ELF executable.
Other valid static elf executables I know of:
memtest86-3.0.
etherboot-5.1 (make bin32/*.elf)
> kexec linux/.../bzImage is okay,
> but kexec linux/vmlinux yields
> Invalid memory segment 0xc0100000 - 0xc03f3d7c
> Cannot load linux-2.5.44/vmlinux
Yep. vmlinux wants to load where you don't have memory, and I have
a sanity check in there to prevent that. If you had > 3GB of ram it
might have succeeded. Unfortunately vmlinux on x86 has a number of
barriers to working correctly. It specifies incorrect physical
addresses, and it expects to be passed a whole host of strange values,
in weird places. Loading an arbitrary segment without first setting
the GDT is rude. And vmlinux has not distinguishing marks other than
it's name to same it is something special.
My mkelfImage code will fix it up vmlinux so it is usable. And I have
some old patches that will correct the kernel build but when I was
submitting them earlier they were not picked up.
kexec when presented with a static elf executable will:
-1) TODO: query the ELF note segment to see if there is anything
special it needs to do.
0) Sanity check the elf headers to see if their requests are reasonable.
1) Load each segment from the program header to the physical
address it specifies.
2) Setup a stack somewhere in ram outside of the image segments in the
elf program header.
3) Jump to the entry point address in the elf program header
This all happens in 32bit protected mode with paging disabled, and
all of the segments registers set to a flat 32bit segment, with a base
address of zero.
For practical purposes the above is the raw interface to sys_kexec but
presented in a flat file.
References:
memtest86: http://www.memtest86.com/
etherboot: http://www.etherboot.org
mkelfImage: ftp://http://www.lnxi.com/pub/src/mkelfImage/
Eric
Eric W. Biederman wrote:
> Cool. What fails with X11. Fixing it might be as simple as calling
> int 0x10 early in the new image.
The graphic engine (i810) simply doesn't switch back to text mode.
Yes, 0x10 helps. I've attached a little patch that does this in a
relatively safe way. (Alternative, one could also use set_80x25,
but I think always forcing mode 3 is slightly more reliable.
Except for MGA users, of course :-)
If you get a new boot loader type code from Peter Anvin, this
should even be good enough for inclusion into the mainstream
kernel. Alternatively, we could also pick a new loader flag to
indicate that the firmware didn't initialize the system.
> [vmlinux] specifies incorrect physical
> addresses, and it expects to be passed a whole host of strange values,
> in weird places.
I see. Perhaps you could say then that mkelfImage fixes flaws in
the vmlinux ELF image meta-data, or such:
| A kernel reformater is makes images that seem to boot more reliably is at:
| ftp://ftp.lnxi.com/pub/mkelfImage/mkelfImage-1.17.tar.gz
This sounds more like "if I kick it here, it usually works,
but I have no idea why" :-)
And yes, if it's not too intrusive, fixing the ELF meta-data
along with the addition of kexec might be a good idea.
- Werner
---------------------------------- cut here -----------------------------------
--- linux-2.5.44/arch/i386/boot/video.S.orig Sat Oct 19 12:55:14 2002
+++ linux-2.5.44/arch/i386/boot/video.S Sat Oct 19 13:51:19 2002
@@ -148,6 +148,13 @@
cmpb $0x10, %bl # No, it's a CGA/MDA/HGA card.
je basret
+ cmpb $0xff,type_of_loader # are we using kexec ?
+ jne novgareset
+
+ movw $0x3, %ax # reset EGA/VGA to 80x25 text
+ int $0x10
+
+novgareset:
incb adapter
movw $0x1a00, %ax # Check EGA or VGA?
int $0x10
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/
Werner Almesberger <[email protected]> writes:
> Eric W. Biederman wrote:
> > Cool. What fails with X11. Fixing it might be as simple as calling
> > int 0x10 early in the new image.
>
> The graphic engine (i810) simply doesn't switch back to text mode.
> Yes, 0x10 helps. I've attached a little patch that does this in a
> relatively safe way. (Alternative, one could also use set_80x25,
> but I think always forcing mode 3 is slightly more reliable.
> Except for MGA users, of course :-)
>
> If you get a new boot loader type code from Peter Anvin, this
> should even be good enough for inclusion into the mainstream
> kernel. Alternatively, we could also pick a new loader flag to
> indicate that the firmware didn't initialize the system.
For the most part my preferences is to put the system into a sane state
when we reboot/shutdown. The device_shutdown call in the kernel can handle
this, on reboot. The problem with video is mostly that the drivers
are not well integrated.
> > [vmlinux] specifies incorrect physical
> > addresses, and it expects to be passed a whole host of strange values,
> > in weird places.
>
> I see. Perhaps you could say then that mkelfImage fixes flaws in
> the vmlinux ELF image meta-data, or such:
The primary thing it does is give the kernel a working 32bit entry
point.
> | A kernel reformater is makes images that seem to boot more reliably is at:
> | ftp://ftp.lnxi.com/pub/mkelfImage/mkelfImage-1.17.tar.gz
>
> This sounds more like "if I kick it here, it usually works,
> but I have no idea why" :-)
The code was built so I could put a kernel, a ramdisk, and a command line
all in a single ELF executable. With the addition of entering the kernel
at it's unsupported 32bit entry point. So I perform a different set of BIOS
calls that setup.S does. Though they are very similar.
But why mkelfImage works occasionally when a bzImage doesn't and you
have a pcbios is a mystery to me. I know why only mkelfImage works
under LinuxBIOS...
> And yes, if it's not too intrusive, fixing the ELF meta-data
> along with the addition of kexec might be a good idea.
The meta-data is easy, about one line in vmlinux.lds. Fixing the
actual entry point is more interesting. I've done it and the result
is maintainable but the patch met some definitions of intrusive.
Eric
On Fri, 2002-10-18 at 13:02, Eric W. Biederman wrote:
> In bug reports please include the serial console output of
> kexec kexec_test. kexec_test exercises most of the interesting code
> paths that are needed to load a kernel with lots of debugging print
> statements, so hangs can easily be detected.
Hi again, Eric.
Thanks for sending out pointers to a fresh batch of code.
I applied your patch cleanly to 2.5.44, and tried it on one of the two
"troublesome" systems that I have used in the past. I had to make one
unrelated change to the SCSI subsystem to fix a problem with a hang
during an ordinary reboot.
Symptoms:
kexec bzImage makes it all the way down to the indirect call at the very
bottom of machine_kexec() before the system appears to hang.
Debugging output from "kexec kexec_test":
# ./kexec ./kexec_test
kexecing image
kexec_test 1.1 starting...
eax: 0E1FB007 ebx: 00001078 ecx: 00000000 edx: 00000000
esi: 00000000 edi: 00000000 esp: 00000000 ebp: 00000000
idt: 00000000 C0000000
gdt: 00000000 C0000000
Switching descriptors.
Descriptors changed.
In real mode.
Interrupts enabled.
Base memory size: 0277
--> [ edited: long pause here ] <--
Can not A20 line.
E820 Memory Map.
000000000009DC00 @ 0000000000000000 type: 00000001
0000000000002400 @ 000000000009DC00 type: 00000002
0000000000020000 @ 00000000000E0000 type: 00000002
0000000027EED140 @ 0000000000100000 type: 00000001
0000000000010000 @ 0000000027FF0000 type: 00000002
0000000000002EC0 @ 0000000027FED140 type: 00000003
0000000001400000 @ 00000000FEC00000 type: 00000002
E801 Memory size: 0009F800
Mem88 Memory size: FFFF
Testing for APM.
APM test done.
A20 enabled
Interrupts disabled.
In protected mode.
Halting.
System info:
IBM eServer xSeries 220 Type 8645-2AX
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 8
model name : Pentium III (Coppermine)
stepping : 10
cpu MHz : 799.962
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse
bogomips : 1576.96
% lspci
00:00.0 Host bridge: ServerWorks CNB20LE Host Bridge (rev 06)
00:00.1 Host bridge: ServerWorks CNB20LE Host Bridge (rev 06)
00:01.0 VGA compatible controller: S3 Inc. Savage 4 (rev 04)
00:09.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev 08)
00:0f.0 ISA bridge: ServerWorks OSB4 South Bridge (rev 50)
00:0f.1 IDE interface: ServerWorks OSB4 IDE Controller
00:0f.2 USB Controller: ServerWorks OSB4/CSB5 OHCI USB Controller (rev 04)
01:03.0 SCSI storage controller: Adaptec AIC-7892P U160/m (rev 02)
What other information can I provide?
Regards,
Andy
ps: if it helps, the OSDL CGL tree has bootimg and linux-2.4.18, and
that combination works (unless X is still running) on this system.
Andy Pfiffer <[email protected]> writes:
> On Fri, 2002-10-18 at 13:02, Eric W. Biederman wrote:
>
> > In bug reports please include the serial console output of
> > kexec kexec_test. kexec_test exercises most of the interesting code
> > paths that are needed to load a kernel with lots of debugging print
> > statements, so hangs can easily be detected.
>
> Hi again, Eric.
>
> Thanks for sending out pointers to a fresh batch of code.
>
> I applied your patch cleanly to 2.5.44, and tried it on one of the two
> "troublesome" systems that I have used in the past. I had to make one
> unrelated change to the SCSI subsystem to fix a problem with a hang
> during an ordinary reboot.
>
> Symptoms:
> kexec bzImage makes it all the way down to the indirect call at the very
> bottom of machine_kexec() before the system appears to hang.
Somehow I suspect it is getting farther and you just are not getting any feedback.
> Debugging output from "kexec kexec_test":
> # ./kexec ./kexec_test
> kexecing image
> kexec_test 1.1 starting...
> eax: 0E1FB007 ebx: 00001078 ecx: 00000000 edx: 00000000
> esi: 00000000 edi: 00000000 esp: 00000000 ebp: 00000000
> idt: 00000000 C0000000
> gdt: 00000000 C0000000
> Switching descriptors.
> Descriptors changed.
> In real mode.
> Interrupts enabled.
> Base memory size: 0277
> --> [ edited: long pause here ] <--
> Can not A20 line.
> E820 Memory Map.
> 000000000009DC00 @ 0000000000000000 type: 00000001
> 0000000000002400 @ 000000000009DC00 type: 00000002
> 0000000000020000 @ 00000000000E0000 type: 00000002
> 0000000027EED140 @ 0000000000100000 type: 00000001
> 0000000000010000 @ 0000000027FF0000 type: 00000002
> 0000000000002EC0 @ 0000000027FED140 type: 00000003
> 0000000001400000 @ 00000000FEC00000 type: 00000002
> E801 Memory size: 0009F800
> Mem88 Memory size: FFFF
> Testing for APM.
> APM test done.
> A20 enabled
> Interrupts disabled.
> In protected mode.
> Halting.
First it is not a problem that it took a long time trying to disable
the a20 line. That just means it tried, and you have a system with
the a20 line permanently enabled. Which by some measures is a bug,
and by others a feature. But it does seem to be a feature of the
systems that are giving the most trouble at the moment.
The fact that kexec_test worked shows that the kexec code basically
worked, and that under the right circumstances most BIOS calls work.
Why it is hanging is currently a mystery to me. Your system
also does not have APM support which is an interesting data point
and rules out a number of problems.
> What other information can I provide?
I think I need to see if kexec loads setup.S at a bad location.
Well it loads setup.S at 0x90000 which is the traditional location,
but it should not be a problem...
> Regards,
> Andy
>
> ps: if it helps, the OSDL CGL tree has bootimg and linux-2.4.18, and
> that combination works (unless X is still running) on this system.
Have you tried booting that 2.4.18 kernel with kexec.
Oh, wait as I recall bootimg simply copies the BIOS results
from the current kernel to the freshly booted kernel, so it skips
the BIOS calls altogether. Which is both very handy, and trouble
some.
It may also be a device shutdown problem, as well. memtest86
is the only candidate I can think of for testing that. But
My other question is does running mkelfImage-1.17 on your kernel
before you boot it help?
ftp://ftp.lnxi.com/pub/src/mkelfImage/mkelfImage-1.17.tar.gz
It does the BIOS calls in a different order and that seems to help,
though I am still trying to track down why.
Eric
Digging further into the one failure I can reproduced, I have
found a very weird failure case. The kernel code dies after switching
into 32bit mode. I found this boot setting the setup.S hooks and printing
a character to the serial port whenever they were encountered.
I will release another version of kexec-tools shortly with a -debug switch
to enable this debugging, and anything else I can think of. For the most
part I have avoided printing messages out the serial port because not
everyone has one, or has it setup as a serial console.
But if I enable it just on a debugging switch it should be o.k. and help
quite a bit with figuring out why some machines fail, and others do not.
Eric
Eric W. Biederman wrote:
> Oh, wait as I recall bootimg simply copies the BIOS results
> from the current kernel to the freshly booted kernel, so it skips
> the BIOS calls altogether.
Yes, I don't trust the BIOS very much under normal conditions,
so I wouldn't even dream of running it with a largely undefined
system state. I'm actually quite surprised that kexec has so
few problems doing that :-)
In any case, since the kexec kernel code is more or less just a
generic loader, this is something you can always decide to
change in user space. The only thing bootimg did that kexec
doesn't do is to explicitly mark BIOS-provided data tables
(mainly SMP stuff) as reserved so that they won't be
overwritten. But it seems that mpparse.c now reserves that
already, so kexec should be fine.
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/
Ok as promised kexec-tools-1.3.tar.gz is released.
The new test case it provides is
kexec -debug bzImage
The serial console must be initialized before using this.
[root@p4dp8-0 root]# kexec -debug bzImage-2.4.17.eb-amd768-eepro100-kexec-apic-lb-mtd2 ip=dhcp root=/dev/nfs console=tty0 console=ttyS0,9600 reboot=hard panic=5 ide0=ata66 verbose
setup16_end: 00091ac4
Shutting down devices
kexecing image
a
b
c
d
e
f
g
h
< All above are various points in x86-setup-16.S >
i < Printed from the first callback in setup.S, before protected mode is entered >
j < Printed from the second callback in setup.S, just before the kernel decompresser is run >
I have a very strange node that makes it all of the way to 'j' before rebooting.
The concept that something is dying in protected mode will all of the interrupts
disabled is so novel that I really don't know what to make of it, yet.
But I would be very interested if other people had similar experiences.
Eric
On Tuesday 22 October 2002 03:33, Eric W. Biederman wrote:
> j < Printed from the second callback in setup.S, just before the
> kernel decompresser is run >
>
>
> I have a very strange node that makes it all of the way to 'j' before
> rebooting. The concept that something is dying in protected mode will all
> of the interrupts disabled is so novel that I really don't know what to
> make of it, yet.
It would almost have to be the MMU. Any way to dump the page tables?
Rob
--
http://penguicon.sf.net - Terry Pratchett, Eric Raymond, Pete Abrams, Illiad,
CmdrTaco, liquid nitrogen ice cream, and caffienated jello. Well why not?
Werner Almesberger <[email protected]> writes:
> Eric W. Biederman wrote:
> > Oh, wait as I recall bootimg simply copies the BIOS results
> > from the current kernel to the freshly booted kernel, so it skips
> > the BIOS calls altogether.
>
> Yes, I don't trust the BIOS very much under normal conditions,
> so I wouldn't even dream of running it with a largely undefined
> system state. I'm actually quite surprised that kexec has so
> few problems doing that :-)
Me too. There are two advantages to tracking things down so the BIOS
calls work. 1) BIOS calls are always the best source for what the
capabilities of the machine are. 2) If enough glitches are shaken out
of the system I can put in a boot sector loader, and people can boot
windows. And that is the really killer feature, because then you can
replace GRUB and lilo.
In my optimized setup I put on LinuxBIOS, which simply provides a
table of values to the kernel. Which I can requery all day long
without problems. And I suppose that has me spoiled :)
> In any case, since the kexec kernel code is more or less just a
> generic loader, this is something you can always decide to
> change in user space.
Definitely that is where the policy is.
> The only thing bootimg did that kexec
> doesn't do is to explicitly mark BIOS-provided data tables
> (mainly SMP stuff) as reserved so that they won't be
> overwritten. But it seems that mpparse.c now reserves that
> already, so kexec should be fine.
Yep. I checked up on that one, a few bug hunts ago...
Also the BIOS normally reserves that memory as well, and linux
honors the BIOS reservations as well.
What has me currently baffled is that I have an old 2.4.17 kernel,
that is dying in the decompressor. But if I bypass the 32bit startup
code and substitute in my own, the kernel works.
I don't know yet if this is common or not.
I guess the next step is to work on a reliable means of skipping the
16bit kernel startup code. I do that all of the time with mkelfImage,
so it should not be too bad.
But getting good parameter values can be a challenge if the original
boot sector values are not saved. At the same time all I really need
to preserve reliably is the memory size. The kernels seems to work
o.k. without dummy values for everything else.
I guess those will be my next two steps, a debugging checksum, and
the option of entering the kernel in 32bit mode. I already have a
rough memory map that I check the kernel against to make certain it is
being loaded to a valid memory location anyway...
Eric
Rob Landley <[email protected]> writes:
> On Tuesday 22 October 2002 03:33, Eric W. Biederman wrote:
>
> > j < Printed from the second callback in setup.S, just before the
> > kernel decompresser is run >
> >
> >
> > I have a very strange node that makes it all of the way to 'j' before
> > rebooting. The concept that something is dying in protected mode will all
> > of the interrupts disabled is so novel that I really don't know what to
> > make of it, yet.
>
> It would almost have to be the MMU. Any way to dump the page tables?
I don't know yet. I need to find a way to install some additional hooks
at run time so I can narrow down where the failure is occuring. I
will have to look, but I should be able to set up an interrupt
descriptor table and single step through the code.
What has me very puzzled is that I can boot that same kernel if I run
a substitute for setup.S that makes a similar set of BIOS calls.
The kernel I am having problems with is an old 2.4.17 kernel. But
more than anything my goal is to make the boot process debuggable
without requiring a recompile. So I can ask users to throw a switch
and I can find out where things are failing.
I may be able to recompile that 2.4.17 kernel and then edit the build
so I can get more debug information. But if it is my preference to
attempt to track down what is happening without doing that.
Especially as it is more useful if that does not happen.
The compressor in misc.c runs without a page table enabled, so it may
be something before the page tables are enabled.
Eric
[email protected] (Eric W. Biederman) writes:
> Rob Landley <[email protected]> writes:
>
> > On Tuesday 22 October 2002 03:33, Eric W. Biederman wrote:
> >
> > > j < Printed from the second callback in setup.S, just before the
> > > kernel decompresser is run >
> > >
> > >
> > > I have a very strange node that makes it all of the way to 'j' before
> > > rebooting. The concept that something is dying in protected mode will all
> > > of the interrupts disabled is so novel that I really don't know what to
> > > make of it, yet.
> >
> > It would almost have to be the MMU. Any way to dump the page tables?
>
> I don't know yet. I need to find a way to install some additional hooks
> at run time so I can narrow down where the failure is occuring. I
> will have to look, but I should be able to set up an interrupt
> descriptor table and single step through the code.
In the process of setting up hooks, I have run across a very interesting
data point. If I load %ds, %es, %ss in my hook the problem goes away.
But I must load all 3.
Given that the code sequence that is executed if my hook is not run is:
cld
cli
movl $(__KERNEL_DS),%eax
movl %eax,%ds
movl %eax,%es
movl %eax,%fs
movl %eax,%gs
lss stack_start,%esp
I am rather confused. I am not changing the gdt or anything like that so it
appears I may have found a way to tickle a processor errata.
Anyway Andy if you have a second please try kexec-tools 1.3 and see what
happens when you pass it the debug option. I am really curious if your lockup
is anywhere near mine. I doubt it as I am running on a P4. But it appears
you never know what the problems will look like until you test them.
Eric
[email protected] (Eric W. Biederman) wrote:
> In the process of setting up hooks, I have run across a very interesting
> data point. If I load %ds, %es, %ss in my hook the problem goes away.
> But I must load all 3.
>
> Given that the code sequence that is executed if my hook is not run is:
>
> cld
> cli
> movl $(__KERNEL_DS),%eax
> movl %eax,%ds
> movl %eax,%es
> movl %eax,%fs
> movl %eax,%gs
>
> lss stack_start,%esp
>
> I am rather confused. I am not changing the gdt or anything like that so it
> appears I may have found a way to tickle a processor errata.
I kind of doubt you found an errata... the mode switch combinations in most
of the modern x86-variants has been tested pretty exhaustively because
people use so many variations on it.
Let's see:
%ds and %es are implicit operands for the source and destination of a
MOVS operation, so if you or the Linux kernel performs a MOVS copy
before that point, that is likely the problem there.
The requirement of %ss is a bit more puzzling, but are you 100% sure
you don't reference the stack anywhere? Else it may blow up.
For example, the start sequence calls "cli", but do you have interrupts
disabled before that point? Maybe you have a stray interrupt catching
you there...
I had to deal with these problems, and had exactly something like the
last case, in my early work on the GRUB bootloader.
--
Erich Stefan Boleyn <[email protected]> http://www.uruk.org/
"Reality is truly stranger than fiction; Probably why fiction is so popular"
Well, of course, %ds is the implicit source/dest of all but a few memory
referencing ops, so not loading that is bound to lead to trouble in most
cases...
--
Erich Stefan Boleyn <[email protected]> http://www.uruk.org/
"Reality is truly stranger than fiction; Probably why fiction is so popular"
On Mon, 2002-10-21 at 21:20, Eric W. Biederman wrote:
> Have you tried booting that 2.4.18 kernel with kexec.
> My other question is does running mkelfImage-1.17 on your kernel
> before you boot it help?
> ftp://ftp.lnxi.com/pub/src/mkelfImage/mkelfImage-1.17.tar.gz
Summary report of 12 combinations:
kexec 1.2 kexec 1.2 kexec 1.3 kexec 1.3
Test raw +mkelf raw +mkelf
---------- ---------- ---------- ---------- ----------
kexec_test runs reset runs reset
linux-2.5.44 hang boot,scsi panic hang boot,scsi panic
linux-2.4.18 hang boot,bad eep100 hang boot,bad eep100
Key:
linux-2.4.18 == 2.4.18 + CGL
linux-2.5.44 == 2.5.44 + kexec
kexec 1.2 == kexec-tools-1.2
kexec 1.3 == kexec-tools-1.3
kexec_test 1.2 was used for the 1.2 column,
kexec_test 1.3 was used for the 1.3 column.
Notes:
1) kexec_test 1.[23] "runs to completion" when not
massaged by mkelfImage.
2) kexec_test 1.[23] "resets" the system (firmware
starts like cold boot) when massaged by mkelfImage.
3) all linux kernels tested on this system (same IBM
eServer as yesterday) "hang" when started with
kexec 1.[23] *unless* massaged by mkelfImage.
4) linux-2.5.44 + mkelfImage panic in the SCSI driver
during reboot with kexec 1.[23]. <see below>
5) linux-2.4.18 + CGL + mkelfImage + "root=805" came -->this<--
close to booting all the way, but it looks like the
ethernet driver can't probe the card in whatever state
it is left in... <see below> ...and there's that little
blurb about "Verify that the card is a bus-master capable slot."
Hmmm... It's got a ServerWorks chipset...
6) no visible differences in success/failure between
kexec-tools-1.2 and kexec-tools-1.3.
Wild Speculation:
Aside from driver issues, it would appear that whatever it is
that mkelfImage does, it is better to use it for this specific
system. It would also appear that there may be several
lingering device shutdown/reprobe problems.
SCSI panic from a kexec'ed linux-2.5.44:
<snip>
ide: Assuming33MHz system bus speed for PIO modes; override with idebus=xx
hda: probing with STATUS(0x50) instead of ALTSTATUS(0x80)
hda: probing with STATUS(0x58) instead of ALTSTATUS(0x80)
SCSI subsystem driver Revision: 1.00
PCI: Enabling device 01:03.0 (0156 -> 0157)
scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSIHBA DRIVER, Rev 6.2.4
<Adaptec aic7892 Ultra160 SCSI adapter>
aic7892: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs
scsi0:0:0:0: Attempting to queue an ABORT message
scsi0: Dumping Card State in Message-in phase, at SEQADDR 0x168
ACCUM = 0x1, SINDEX = 0x61, DINDEX = 0x65, ARG_2 = 0xff
HCNT = 0x0
SCSISEQ = 0x12, SBLKCTL = 0xa
DFCNTRL = 0x0, DFSTATUS = 0x89
LASTPHASE = 0xe0, SCSISIGI = 0xe6, SXFRCTL0 = 0x88
SSTAT0 = 0x2, SSTAT1 = 0x1
SCSIPHASE = 0x8
STACK == 0x175, 0x0, 0x0, 0xe7
SCB count = 4
Kernel NEXTQSCB = 2
Card NEXTQSCB = 2
QINFIFO entries:
Waiting Queue entries:
Disconnected Queue entries:
QOUTFIFO entries:
Sequencer Free SCB List: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Pending list: 3
Kernel Free SCB list: 1 0
Untagged Q(0): 3
DevQ(0:0:0): 0 waiting
scsi0:0:0:0: Device is active, asserting ATN
Recovery code sleeping
Kernel panic: Attempted to kill the idle task!
In idle task - not syncing
ethernet driver diagnostics from kexec'ed linux-2.4.18:
<snip>
eepro100.c:v1.09j-t 9/29/99 Donald Becker http://www.scyld.com/network/eepro100leepro100.c: $Revision: 1.36 $ 2000/11/17 Modified by Andrey V. Savochkin <saw@ssPCI: Enabling device 00:09.0 (0000 -> 0003)
PCI: Setting latency timer of device 00:09.0 to 64
eth0: Invalid EEPROM checksum 0x7f00, check settings before activating this dev!eth0: OEM i82557/i82558 10/100 Ethernet, FF:FF:FF:FF:FF:FF, IRQ 16.
Board assembly ffffff-255, Physical connectors present: RJ45 BNC AUI MII
Primary interface chip unknown-15 PHY #31.
Secondary interface chip i82555.
Self test failed, status ffffffff:
Failure to initialize the i82557.
Verify that the card is a bus-master capable slot.
<snip>
Setting up network interfaces:
lo done
eth0 eepro100: wait_for_cmd_done timeout!
eepro100: wait_for_cmd_done timeout!
eepro100: wait_for_cmd_done timeout!
eepro100: wait_for_cmd_done timeout!
eepro100: wait_for_cmd_done timeout!
eepro100: wait_for_cmd_done timeout!
eepro100: wait_for_cmd_done timeout!
eepro100: wait_for_cmd_done timeout!
eepro100: wait_for_cmd_done timeout!
eepro100: wait_for_cmd_done timeout!
(DHCP) eepro100: wait_for_cmd_done timeout!
eepro100: wait_for_cmd_done timeout!
eepro100: wait_for_cmd_done timeout!
eepro100: wait_for_cmd_done timeout!
. . . eepro100: wait_for_cmd_done timeout!
eepro100: wait_for_cmd_done timeout!
invalid operand: 0000
CPU: 0
EIP: 0010:[<c0116014>] Not tainted
EFLAGS: 00010213
eax: 00000000 ebx: e75c6000 ecx: e8806000 edx: ffffffff
esi: e666c7e0 edi: c1a6b800 ebp: e75c7e14 esp: e75c7df8
ds: 0018 es: 0018 ss: 0018
Process dhcpcd (pid: 359, stackpage=e75c7000)
Stack: e75c6000 e666c7e0 c1a6b800 c1a6b800 00000100 e8806002 00000282 e75c7e20
c0116406 e75c6000 e75c7e38 c02708b9 e75c6000 c1a6b800 00000000 e763ca00
e75c7e58 c02689fd c1a6b800 e763ca00 c1a6b800 0000024e e75c7e80 c02a5403
Call Trace: [<c0116406>] [<c02708b9>] [<c02689fd>] [<c02a5403>] [<c02a5439>]
[<c02620fd>] [<c0263054>] [<c013a817>] [<c013a839>] [<c015112a>] [<c0151450> [<c0151490>] [<c01518f4>] [<c026386f>] [<c01070db>]
Code: 0f 0b b8 00 e0 ff ff 21 e0 ff 40 04 89 45 fc 69 40 20 a0 09
Entering kdb (current=0xe75c6000, pid 359) on processor 0 Oops: invalid operand
due to oops @ 0xc0116014
eax = 0x00000000 ebx = 0xe75c6000 ecx = 0xe8806000 edx = 0xffffffff
esi = 0xe666c7e0 edi = 0xc1a6b800 esp = 0xe75c7df8 eip = 0xc0116014
ebp = 0xe75c7e14 xss = 0x00000018 xcs = 0x00000010 eflags = 0x00010213
xds = 0x00000018 xes = 0x00000018 origeax = 0xffffffff ®s = 0xe75c7dc4
[0]kdb> reboot
lspci output for the system:
00:00.0 Host bridge: ServerWorks CNB20LE Host Bridge (rev 06)
00:00.1 Host bridge: ServerWorks CNB20LE Host Bridge (rev 06)
00:01.0 VGA compatible controller: S3 Inc. Savage 4 (rev 04)
00:09.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev 08)
00:0f.0 ISA bridge: ServerWorks OSB4 South Bridge (rev 50)
00:0f.1 IDE interface: ServerWorks OSB4 IDE Controller
00:0f.2 USB Controller: ServerWorks OSB4/CSB5 OHCI USB Controller (rev 04)
01:03.0 SCSI storage controller: Adaptec AIC-7892P U160/m (rev 02)
Regards,
Andy
On Tue, 2002-10-22 at 01:33, Eric W. Biederman wrote:
> Ok as promised kexec-tools-1.3.tar.gz is released.
>
> The new test case it provides is
> kexec -debug bzImage
>
> The serial console must be initialized before using this.
>
> [root@p4dp8-0 root]# kexec -debug bzImage-2.4.17.eb-amd768-eepro100-kexec-apic-lb-mtd2 ip=dhcp root=/dev/nfs console=tty0 console=ttyS0,9600 reboot=hard panic=5 ide0=ata66 verbose
> setup16_end: 00091ac4
> Shutting down devices
> kexecing image
> a
> b
> c
> d
> e
> f
> g
> h
> < All above are various points in x86-setup-16.S >
> i < Printed from the first callback in setup.S, before protected mode is entered >
> j < Printed from the second callback in setup.S, just before the kernel decompresser is run >
joe:/boot # ./kexec-1.3 -debug linux-2.5 console=ttyS0,9600 reboot=hard
verbose
setup16_end: 00091a94
kexecing image
a
b
c
d
e
f
g
h
Wedged.
Andy
On Tue, 2002-10-22 at 16:27, Andy Pfiffer wrote:
> On Tue, 2002-10-22 at 01:33, Eric W. Biederman wrote:
> > Ok as promised kexec-tools-1.3.tar.gz is released.
> >
> > The new test case it provides is
> > kexec -debug bzImage
> >
> > The serial console must be initialized before using this.
> >
> > [root@p4dp8-0 root]# kexec -debug bzImage-2.4.17.eb-amd768-eepro100-kexec-apic-lb-mtd2 ip=dhcp root=/dev/nfs console=tty0 console=ttyS0,9600 reboot=hard panic=5 ide0=ata66 verbose
> > setup16_end: 00091ac4
> > Shutting down devices
> > kexecing image
> > a
> > b
> > c
> > d
> > e
> > f
> > g
> > h
> > < All above are various points in x86-setup-16.S >
> > i < Printed from the first callback in setup.S, before protected mode is entered >
> > j < Printed from the second callback in setup.S, just before the kernel decompresser is run >
>
>
> joe:/boot # ./kexec-1.3 -debug linux-2.5 console=ttyS0,9600 reboot=hard
> verbose
> setup16_end: 00091a94
> kexecing image
> a
> b
> c
> d
> e
> f
> g
> h
>
> Wedged.
Same results for the 2.4.18-based kernel:
joe:/boot # ./kexec-1.3 -debug linux-cgle console=ttyS0,9600 reboot=hard
setup16_end: 00091084
kexecing image
a
b
c
d
e
f
g
h
Wedged.
Andy
[email protected] writes:
> [email protected] (Eric W. Biederman) wrote:
>
> > In the process of setting up hooks, I have run across a very interesting
> > data point. If I load %ds, %es, %ss in my hook the problem goes away.
> > But I must load all 3.
> >
> > Given that the code sequence that is executed if my hook is not run is:
> >
> > cld
> > cli
> > movl $(__KERNEL_DS),%eax
> > movl %eax,%ds
> > movl %eax,%es
> > movl %eax,%fs
> > movl %eax,%gs
> >
> > lss stack_start,%esp
> >
> > I am rather confused. I am not changing the gdt or anything like that so it
> > appears I may have found a way to tickle a processor errata.
>
> I kind of doubt you found an errata...
Me too but the number of remaining possibilities is quite small.
> the mode switch combinations in most
> of the modern x86-variants has been tested pretty exhaustively because
> people use so many variations on it.
>
> Let's see:
>
> %ds and %es are implicit operands for the source and destination of a
> MOVS operation, so if you or the Linux kernel performs a MOVS copy
> before that point, that is likely the problem there.
Nope. In fact on a another 2.4.17 kernel built with slightly different
options the code works.
> The requirement of %ss is a bit more puzzling, but are you 100% sure
> you don't reference the stack anywhere? Else it may blow up.
Absolutely.
> For example, the start sequence calls "cli", but do you have interrupts
> disabled before that point? Maybe you have a stray interrupt catching
> you there...
Yep. In fact last I checked I had interrupts disabled at the interrupt
controller as well, but that may not be a certaintly. But it doesn't matter
as I also have nmi disabled at that point.
> I had to deal with these problems, and had exactly something like the
> last case, in my early work on the GRUB bootloader.
I will certainly take any help people can give. But I am tickling some
very weird things in there.
Eric
Andy Pfiffer <[email protected]> writes:
> On Mon, 2002-10-21 at 21:20, Eric W. Biederman wrote:
>
> > Have you tried booting that 2.4.18 kernel with kexec.
>
> > My other question is does running mkelfImage-1.17 on your kernel
> > before you boot it help?
> > ftp://ftp.lnxi.com/pub/src/mkelfImage/mkelfImage-1.17.tar.gz
Ok thanks this was a good data points. My tools are getting solid
enough that I can actually debug these problems. Wow.
> Notes:
> 1) kexec_test 1.[23] "runs to completion" when not
> massaged by mkelfImage.
>
> 2) kexec_test 1.[23] "resets" the system (firmware
> starts like cold boot) when massaged by mkelfImage.
Not a problem. mkelfImage should really be called mkelfImage-linux.
It is not useful for anything else. When it was feed kexec_test it
assumed it was being feed vmlinux and did the wrong thing, and
attempted to load it at 1MB.
> 3) all linux kernels tested on this system (same IBM
> eServer as yesterday) "hang" when started with
> kexec 1.[23] *unless* massaged by mkelfImage.
>
> 4) linux-2.5.44 + mkelfImage panic in the SCSI driver
> during reboot with kexec 1.[23]. <see below>
>
> 5) linux-2.4.18 + CGL + mkelfImage + "root=805" came -->this<--
> close to booting all the way, but it looks like the
> ethernet driver can't probe the card in whatever state
> it is left in... <see below> ...and there's that little
> blurb about "Verify that the card is a bus-master capable slot."
> Hmmm... It's got a ServerWorks chipset...
It is a pci slot it better be bus-master capable...
> 6) no visible differences in success/failure between
> kexec-tools-1.2 and kexec-tools-1.3.
The only substantive difference was in the addition of a debug mode,
that tells you it is make a little progress before the machine hangs.
>
> Wild Speculation:
> Aside from driver issues, it would appear that whatever it is
> that mkelfImage does, it is better to use it for this specific
> system. It would also appear that there may be several
> lingering device shutdown/reprobe problems.
Ok. I came to suggest mkelfImage for some invalid reasons my weird
2.4.17 problem kernel has a weird development patch and the bzImage is
just plain unbootable.
What mkelfImage does it that it makes fewer BIOS calls, roughly the
same set of BIOS calls as are in kexec_test of 1.3. While the kernels
setup.S makes more BIOS calls and it looks like one of those extra
BIOS calls is hanging your system.
Can you try kexec_test-1.4 in:
http://www.xmission.com/~ebiederm/files/kexec/kexec-utils-1.4.tar.gz
I have added a bunch more BIOS calls and unless I have missed the
important one we should be able to nail down which BIOS call is
crashing/hanging your system.
I need to nail this down but it appears with a little care I can
count on making BIOS calls to get the kernel parameters after a kexec.
That is the ideal case. Now if I could only figure out what is needed
to make all of the BIOS calls actually work....
I will skimming the driver issues, but mostly I think they should have
an appropriate ->shutdown/reboot_notifier method implemented and the driver
should work.
> SCSI panic from a kexec'ed linux-2.5.44:
> <snip>
> ide: Assuming33MHz system bus speed for PIO modes; override with idebus=xx
> hda: probing with STATUS(0x50) instead of ALTSTATUS(0x80)
> hda: probing with STATUS(0x58) instead of ALTSTATUS(0x80)
> SCSI subsystem driver Revision: 1.00
> PCI: Enabling device 01:03.0 (0156 -> 0157)
> scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSIHBA DRIVER, Rev 6.2.4
> <Adaptec aic7892 Ultra160 SCSI adapter>
> aic7892: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs
>
> scsi0:0:0:0: Attempting to queue an ABORT message
> scsi0: Dumping Card State in Message-in phase, at SEQADDR 0x168
> ACCUM = 0x1, SINDEX = 0x61, DINDEX = 0x65, ARG_2 = 0xff
> HCNT = 0x0
> SCSISEQ = 0x12, SBLKCTL = 0xa
> DFCNTRL = 0x0, DFSTATUS = 0x89
> LASTPHASE = 0xe0, SCSISIGI = 0xe6, SXFRCTL0 = 0x88
> SSTAT0 = 0x2, SSTAT1 = 0x1
> SCSIPHASE = 0x8
> STACK == 0x175, 0x0, 0x0, 0xe7
> SCB count = 4
> Kernel NEXTQSCB = 2
> Card NEXTQSCB = 2
> QINFIFO entries:
> Waiting Queue entries:
> Disconnected Queue entries:
> QOUTFIFO entries:
> Sequencer Free SCB List: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
> Pending list: 3
>
> Kernel Free SCB list: 1 0
> Untagged Q(0): 3
> DevQ(0:0:0): 0 waiting
> scsi0:0:0:0: Device is active, asserting ATN
> Recovery code sleeping
> Kernel panic: Attempted to kill the idle task!
> In idle task - not syncing
>
There is nothing in the SCSI crash that I recognize at all :(
Given that there have been recent scsi problems I don't know.
> ethernet driver diagnostics from kexec'ed linux-2.4.18:
> <snip>
> eepro100.c:v1.09j-t 9/29/99 Donald Becker
> http://www.scyld.com/network/eepro100leepro100.c: $Revision: 1.36 $ 2000/11/17
> Modified by Andrey V. Savochkin <saw@ssPCI: Enabling device 00:09.0 (0000 ->
> 0003)
>
> PCI: Setting latency timer of device 00:09.0 to 64
> eth0: Invalid EEPROM checksum 0x7f00, check settings before activating this
^^^^^ Here we go
> dev!eth0: OEM i82557/i82558 10/100 Ethernet, FF:FF:FF:FF:FF:FF, IRQ 16.
>
> Board assembly ffffff-255, Physical connectors present: RJ45 BNC AUI MII
> Primary interface chip unknown-15 PHY #31.
> Secondary interface chip i82555.
> Self test failed, status ffffffff:
^^^^^ Or possibly here.
Either the nic eeprom was left in a bad state or something weird happened
with the eeprom.
> Failure to initialize the i82557.
> Verify that the card is a bus-master capable slot.
> <snip>
> Setting up network interfaces:
> lo done
> eth0 eepro100: wait_for_cmd_done timeout!
> eepro100: wait_for_cmd_done timeout!
> eepro100: wait_for_cmd_done timeout!
^^^^ If you had only gotten these I would assume your eepro100 had just chosen
today this reboot to freak out.
> eepro100: wait_for_cmd_done timeout!
> eepro100: wait_for_cmd_done timeout!
> eepro100: wait_for_cmd_done timeout!
> eepro100: wait_for_cmd_done timeout!
> eepro100: wait_for_cmd_done timeout!
> eepro100: wait_for_cmd_done timeout!
> eepro100: wait_for_cmd_done timeout!
> (DHCP) eepro100: wait_for_cmd_done timeout!
> eepro100: wait_for_cmd_done timeout!
> eepro100: wait_for_cmd_done timeout!
> eepro100: wait_for_cmd_done timeout!
> . . . eepro100: wait_for_cmd_done timeout!
> eepro100: wait_for_cmd_done timeout!
> invalid operand: 0000
> CPU: 0
> EIP: 0010:[<c0116014>] Not tainted
> EFLAGS: 00010213
> eax: 00000000 ebx: e75c6000 ecx: e8806000 edx: ffffffff
> esi: e666c7e0 edi: c1a6b800 ebp: e75c7e14 esp: e75c7df8
> ds: 0018 es: 0018 ss: 0018
> Process dhcpcd (pid: 359, stackpage=e75c7000)
> Stack: e75c6000 e666c7e0 c1a6b800 c1a6b800 00000100 e8806002 00000282 e75c7e20
> c0116406 e75c6000 e75c7e38 c02708b9 e75c6000 c1a6b800 00000000 e763ca00
> e75c7e58 c02689fd c1a6b800 e763ca00 c1a6b800 0000024e e75c7e80 c02a5403
> Call Trace: [<c0116406>] [<c02708b9>] [<c02689fd>] [<c02a5403>] [<c02a5439>]
> [<c02620fd>] [<c0263054>] [<c013a817>] [<c013a839>] [<c015112a>] [<c0151450>
> [<c0151490>] [<c01518f4>] [<c026386f>] [<c01070db>]
>
>
> Code: 0f 0b b8 00 e0 ff ff 21 e0 ff 40 04 89 45 fc 69 40 20 a0 09
>
> Entering kdb (current=0xe75c6000, pid 359) on processor 0 Oops: invalid operand
> due to oops @ 0xc0116014
> eax = 0x00000000 ebx = 0xe75c6000 ecx = 0xe8806000 edx = 0xffffffff
> esi = 0xe666c7e0 edi = 0xc1a6b800 esp = 0xe75c7df8 eip = 0xc0116014
> ebp = 0xe75c7e14 xss = 0x00000018 xcs = 0x00000010 eflags = 0x00010213
> xds = 0x00000018 xes = 0x00000018 origeax = 0xffffffff ®s = 0xe75c7dc4
> [0]kdb> reboot
>
> lspci output for the system:
> 00:00.0 Host bridge: ServerWorks CNB20LE Host Bridge (rev 06)
> 00:00.1 Host bridge: ServerWorks CNB20LE Host Bridge (rev 06)
> 00:01.0 VGA compatible controller: S3 Inc. Savage 4 (rev 04)
> 00:09.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev 08)
> 00:0f.0 ISA bridge: ServerWorks OSB4 South Bridge (rev 50)
> 00:0f.1 IDE interface: ServerWorks OSB4 IDE Controller
> 00:0f.2 USB Controller: ServerWorks OSB4/CSB5 OHCI USB Controller (rev 04)
> 01:03.0 SCSI storage controller: Adaptec AIC-7892P U160/m (rev 02)
I would love to have the drivers to have reasonable shutdown routines
so this code could just be used. But as that does not currently exist
in 2.5.44 I have some possibly productive suggestions.
1) Down all network interfaces before kexec
2) Build everything as modules and remove the modules before kexec
3) Look at the remove methods and see if you can adapt into shutdown methods.
And please tell me what kexec_test-1.4 reports. I would love to find out which
BIOS calls are hanging your system.
In kexec_test/kexec_test16.S The calls to all BIOS tests are listed as below.
And if one of those keeps kexec from completing could you please comment
out print calls until kexec_test runs to completion. This will give a list
of the bad BIOS calls on your system.
/* Here we test various BIOS calls to determine how much of the system is working */
call get_meme820
call print_meme820
call print_meme801
call print_mem88
call disable_apm
call print_dasd_type
call print_equipment_list
call print_sysdesc
call print_edd
call print_video
call print_cursor
call print_video_mode
call set_auto_repeat_rate
The next step is to see if I can integrate a safer set of BIOS into the kernel.
Or at the very least integrate in the mkelfImage functionality into kexec...
Eric
> > lspci output for the system:
> > 00:00.0 Host bridge: ServerWorks CNB20LE Host Bridge (rev 06)
> > 00:00.1 Host bridge: ServerWorks CNB20LE Host Bridge (rev 06)
> > 00:01.0 VGA compatible controller: S3 Inc. Savage 4 (rev 04)
> > 00:09.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev 08)
> > 00:0f.0 ISA bridge: ServerWorks OSB4 South Bridge (rev 50)
> > 00:0f.1 IDE interface: ServerWorks OSB4 IDE Controller
> > 00:0f.2 USB Controller: ServerWorks OSB4/CSB5 OHCI USB Controller (rev 04)
> > 01:03.0 SCSI storage controller: Adaptec AIC-7892P U160/m (rev 02)
> And please tell me what kexec_test-1.4 reports. I would love to find out which
> BIOS calls are hanging your system.
It's this one: call print_dasd_type
If I comment it out, kexec_test-1.4 runs to completion.
FYI: My installation is on a scsi disk. I'm beginning to wonder if
there is something funky with the BIOS not being able to talk to
the SCSI controller after the kernel has used it... Hmmm.
Full output:
Run 1:
# ./kexec-1.4 -debug kexec_test-1.4
kexecing image
kexec_test 1.4 starting...
eax: 0E1FB007 ebx: 00001078 ecx: 00000000 edx: 00000000
esi: 00000000 edi: 00000000 esp: 00000000 ebp: 00000000
idt: 00000000 C0000000
gdt: 00000000 C0000000
Switching descriptors.
Descriptors changed.
In real mode.
Interrupts enabled.
Base memory size: 0277
Can not A20 line.
E820 Memory Map.
000000000009DC00 @ 0000000000000000 type: 00000001
0000000000002400 @ 000000000009DC00 type: 00000002
0000000000020000 @ 00000000000E0000 type: 00000002
0000000027EED140 @ 0000000000100000 type: 00000001
0000000000010000 @ 0000000027FF0000 type: 00000002
0000000000002EC0 @ 0000000027FED140 type: 00000003
0000000001400000 @ 00000000FEC00000 type: 00000002
E801 Memory size: 0009F800
Mem88 Memory size: FFFF
Testing for APM.
APM test done.
DASD type:
<Wedged>
Run 2:
# ./kexec-1.4 -debug kexec_test-1.4
kexecing image
kexec_test 1.4 starting...
eax: 0E1FB007 ebx: 00001078 ecx: 00000000 edx: 00000000
esi: 00000000 edi: 00000000 esp: 00000000 ebp: 00000000
idt: 00000000 C0000000
gdt: 00000000 C0000000
Switching descriptors.
Descriptors changed.
In real mode.
Interrupts enabled.
Base memory size: 0277
Can not A20 line.
E820 Memory Map.
000000000009DC00 @ 0000000000000000 type: 00000001
0000000000002400 @ 000000000009DC00 type: 00000002
0000000000020000 @ 00000000000E0000 type: 00000002
0000000027EED140 @ 0000000000100000 type: 00000001
0000000000010000 @ 0000000027FF0000 type: 00000002
0000000000002EC0 @ 0000000027FED140 type: 00000003
0000000001400000 @ 00000000FEC00000 type: 00000002
E801 Memory size: 0009F800
Mem88 Memory size: FFFF
Testing for APM.
APM test done.
Equiptment list: 4427
Sysdesc: F000:E6F5
EDD: ok
Video type: VGA
Cursor Position(Row,Column): 0012 0000
Video Mode: 0003
Setting auto repeat rate done
A20 enabled
Interrupts disabled.
In protected mode.
Halting.
Andy Pfiffer <[email protected]> writes:
> > > lspci output for the system:
> > > 00:00.0 Host bridge: ServerWorks CNB20LE Host Bridge (rev 06)
> > > 00:00.1 Host bridge: ServerWorks CNB20LE Host Bridge (rev 06)
> > > 00:01.0 VGA compatible controller: S3 Inc. Savage 4 (rev 04)
> > > 00:09.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev
> 08)
>
> > > 00:0f.0 ISA bridge: ServerWorks OSB4 South Bridge (rev 50)
> > > 00:0f.1 IDE interface: ServerWorks OSB4 IDE Controller
> > > 00:0f.2 USB Controller: ServerWorks OSB4/CSB5 OHCI USB Controller (rev 04)
> > > 01:03.0 SCSI storage controller: Adaptec AIC-7892P U160/m (rev 02)
>
>
> > And please tell me what kexec_test-1.4 reports. I would love to find out which
>
> > BIOS calls are hanging your system.
>
> It's this one: call print_dasd_type
Cool, thanks.
This is both good, and bad.
The Good: This BIOS call is only used to populate disk_info to get the
disk geometry, which is only in hd.c, and further this call
rests on a broken assumption that the both the first and the
second BIOS disks are IDE, so it should be safe to remove
the call from setup.S, And there is another filter in
that makes it work.
The Bad: That would take patching the kernel, so on some machines
older kernels will not work...
I have not setup the x86 pic so it may be worth setting that up and
testing to see if that helps.
> If I comment it out, kexec_test-1.4 runs to completion.
>
> FYI: My installation is on a scsi disk. I'm beginning to wonder if
> there is something funky with the BIOS not being able to talk to
> the SCSI controller after the kernel has used it... Hmmm.
I would not be surprised, if that is part of the problem.
Did you have to edit the aic7xxx reboot notifier to get scsi reboots
to work? The reboot notifier should be what sets the controller up
so it can be reinitialized by Linux later.
I am attempting to figure out where to go with the user space side to
get a useable, and useful ability to boot new kernels.
Pieces:
1) If loaded from loadlin or a sufficiently buggy BIOS is present
we need to skip the 16bit BIOS calls already.
2) It is ideal to requery the system BIOS in case there are enhancements
like the EDD work that make the new kernel more useable.
So my strategy will be:
1) As much as is reasonable fix setup.S to work in strange hostile environments,
there is a lot to be gained and currently it usually works as is.
2) Query the existing kernel infrastructure and as much as possible
fill in the table of data wants. And I will skip to the 32bit entry point,
with that information.
The second will take a bit more work, but having it as an option looks like
a very healthy thing to have.
Eric
"Eric W. Biederman" wrote:
>
> +static void i8259A_remove(struct device *dev)
> +{
> + /* Restore the i8259A to it's legacy dos setup.
> + * The kernel won't be using it any more, and it
> + * just might make reboots, and kexec type applications
> + * more stable.
> + */
> + outb(0xff, 0x21); /* mask all of 8259A-1 */
> + outb(0xff, 0xA1); /* mask all of 8259A-1 */
> +
> + outb_p(0x11, 0x20); /* ICW1: select 8259A-1 init */
> + outb_p(0x08, 0x21); /* ICW2: 8259A-1 IR0-7 mappend to 0x8-0xf */
> + outb_p(0x01, 0x21); /* Normal 8086 auto EOI mode */
> +
> + outb_p(0x11, 0xA0); /* ICW1: select 8259A-2 init */
> + outb_p(0x08, 0xA1); /* ICW2: 8259A-2 IR0-7 mappend to 0x70-0x77 */
^^^^ ^^^^
This looks wrong to me.
--
Kasper Dupont -- der bruger for meget tid p? usenet.
For sending spam use mailto:[email protected]
Don't do this at home kids: touch -- -rf
Kasper Dupont <[email protected]> writes:
> "Eric W. Biederman" wrote:
> >
> > +static void i8259A_remove(struct device *dev)
> > +{
> > + /* Restore the i8259A to it's legacy dos setup.
> > + * The kernel won't be using it any more, and it
> > + * just might make reboots, and kexec type applications
> > + * more stable.
> > + */
> > + outb(0xff, 0x21); /* mask all of 8259A-1 */
> > + outb(0xff, 0xA1); /* mask all of 8259A-1 */
> > +
> > + outb_p(0x11, 0x20); /* ICW1: select 8259A-1 init */
> > + outb_p(0x08, 0x21); /* ICW2: 8259A-1 IR0-7 mappend to 0x8-0xf */
> > + outb_p(0x01, 0x21); /* Normal 8086 auto EOI mode */
> > +
> > + outb_p(0x11, 0xA0); /* ICW1: select 8259A-2 init */
> > + outb_p(0x08, 0xA1); /* ICW2: 8259A-2 IR0-7 mappend to 0x70-0x77 */
>
> ^^^^ ^^^^
>
> This looks wrong to me.
Thanks that was a clear cut and paste bug.
I am in the process of moving the i8259A setup code into my kexec user
space program. I believe it is inappropriate to assume the interrupt
controller is going to be used by dos when it is shut down. So my
latest version (just published) simply masks all interrupts through
the pic.
The pic setup code I am in the process of moving into kexec-tools, and
since the pic is well know I will do this piece of setup work
there.
Another bug worth noting in this code, is that as of 2.5.44 only the
->shutdown methods are called on reboot and not the ->remove methods
so I actually hooked the wrong routine. Just in case someone else is
trying to hook that moving target.
Eric
"Eric W. Biederman" wrote:
>
> I believe it is inappropriate to assume the interrupt
> controller is going to be used by dos when it is shut down.
If Linux was booted by LILO or SYSLINUX, there will be no DOS
in memory. But the BIOS interrupt vector table and other data
structures are in the first physical memory page which is not
touched by Linux, so I'd expect the BIOS to be usable if we
can just leave the hardware in a usable state. Booting to DOS
from Linux might actually be possible.
OTOH if Linux was booted by LOADLIN, there will have been a
DOS in memory. DOS has changed interrupt vectors to point to
DOS own code in segment 0x70, but that code will be outside
the first physical page and will thus have been overwritten
by Linux. In this case neither DOS nor BIOS routines can be
used reliable. Any INT instruction can potentially crash,
this lead to problems with kmonte, does kexec have the same
problem?
--
Kasper Dupont -- der bruger for meget tid p? usenet.
For sending spam use mailto:[email protected]
Don't do this at home kids: touch -- -rf
Kasper Dupont <[email protected]> writes:
> "Eric W. Biederman" wrote:
> >
> > I believe it is inappropriate to assume the interrupt
> > controller is going to be used by dos when it is shut down.
>
> If Linux was booted by LILO or SYSLINUX, there will be no DOS
> in memory. But the BIOS interrupt vector table and other data
> structures are in the first physical memory page which is not
> touched by Linux, so I'd expect the BIOS to be usable if we
> can just leave the hardware in a usable state. Booting to DOS
> from Linux might actually be possible.
I agree. To rephrase I believe it is inappropriate to assume a
classic x86 PCBIOS is present on the machine.
All the shutdown routines should do is to place the hardware in a
quiescent state, that the linux driver init code can recover from.
Since arbitrary code can be loaded leaving this policy to the users of
the kexec system call is reasonable.
> OTOH if Linux was booted by LOADLIN, there will have been a
> DOS in memory. DOS has changed interrupt vectors to point to
> DOS own code in segment 0x70, but that code will be outside
> the first physical page and will thus have been overwritten
> by Linux. In this case neither DOS nor BIOS routines can be
> used reliable. Any INT instruction can potentially crash,
> this lead to problems with kmonte, does kexec have the same
> problem?
The system call does not. My user space component needs a bit more
work to bypass setup.S and do a reasonable job at it. I have all of
the code in a tested state except the code that queries the current
kernel for the BIOS setup information.
Getting the last pieces together so I can reliably boot a linux kernel
is the next step. My user space started out as a proof of concept and
is evolving into something useful from there. I refuse to use the
hack of copying the empty_zero_page from the old kernel to the new
kernel.
Eric