2005-01-19 08:06:56

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 0/29] overview


Andrew the following patchset is against 2.6.11-rc1-mm1 with
all of the kexec patches removed. The list of removed patches
is included below.

This patchsset is a major refresh of the kexec on panic
functionality in the kernel. The primary aim of which was to take
the requirements capture of the kernel crashdump patches and
start integrating the functionality cleanly into the kexec
patches.

Major accomplishments:
- Compat syscall support has been added.
- The crashdump capture code has been separated from the kexec on panic code.
- The kernel to jump to on panic is now loaded in place.
- A long standing bug that allowed 2 sources pages to copy data
to a single destination page has been caught and fixed.
- Support for loading an x86_64 kernel in a reserved of memory has been completed.

The crashdump code is currently slightly broken. I have attempted to
minimize the breakage so things can quick be made to work again.

With respect to a final design discussion there are two remaining
open issues. The first is how little hardware shutdown we can get away
with in the kernel that is panicing. I believe we can reduce this
to a simply NMI to the other cpus telling them to stop. This has
been address as a major concern in previous conversations.

The second is an issue is the most significant with respect to the
design of a kernel based crash dump capture implementation. How does
the crashdump capture process discover relevant information about the
kernel that just crashed? There are two options.

1) As represented by the current crashdump patches the crashdump
kernel and the kernel in which it loads are kept in sync so that
it has uptodate versions all of crashed kernels data structures
because it is built from the same source. So it only needs to
find the address of the data structures it would like to look at.

2) The relevant information if it is available when sys_kexec_load
is called is exported to user space, or the machine_crash_shutdown
method marshalls what little information must be captured when the
machine dies in a well known standard format (most likely ELF
notes). Allowing the crashdump capture process to simply pass
on the information or utilize it as appropriate.

If the second method can successfully represent all of the
interesting information then we can allow kernel version
skew, between the two kernels, and potentially implement
the entire crash dump capture process in user space.

As best as I have been able to discover the interesting information
includes. The cpu state (registers) at the time of the crash/panic.
The list of memory regions the kernel that has crashed was using.
And potentially the list of pages dedicated to kernel data as opposed
to user space, so the the people with insane amounts of memory (1TB+)
don't require unmanagely large core files.


Andrew earlier when asked about the possibility of merghing the kexec
on panic code you said:

> I don't want us to be in a position of merging all that code and then
> finding out that it cannot be made to work "sufficiently well",
> forcing us to revert it and find a new crashdump solution. You guys
> know far better than I when we will reach that threshold. If the
> kexec/dump developers can say "yup, this is going to work (because X)"
> then I'm happy.

So here is my subjective view.
- This code needs to sit in a development tree for a little while
to shake out whatever bugs still linger from my massive refactoring.
- Through the kexec patches the code and design appears to be sound.
Given that machine_kexec is little more than a jump there are few
possible implementations that will be able to use it. The only
exception I can see are running special dump drivers from the kernel
that crashed, and I believe no one thinks the that will work well.
- Once we finish sorting out the best way to get information out of
the kernel that crashed I think we will have a complete architecture
that is largely portable to any architecture.

In the interests of full disclosure my main interesting is using the
kernel as a bootloader for other kernels and that has been working
fairly for years now :)

Eric



# Patches to remove from 2.6.11-rc1-mm1 before applying this patchset:
#
assign_irq_vector-section-fix.patch
kexec-i8259-shutdowni386.patch
kexec-i8259-shutdown-x86_64.patch
kexec-apic-virtwire-on-shutdowni386patch.patch
kexec-apic-virtwire-on-shutdownx86_64.patch
kexec-ioapic-virtwire-on-shutdowni386.patch
kexec-apic-virt-wire-fix.patch
kexec-ioapic-virtwire-on-shutdownx86_64.patch
kexec-e820-64bit.patch
kexec-kexec-generic.patch
kexec-ide-spindown-fix.patch
kexec-ifdef-cleanup.patch
kexec-machine_shutdownx86_64.patch
kexec-kexecx86_64.patch
kexec-kexecx86_64-4level-fix.patch
kexec-kexecx86_64-4level-fix-unfix.patch
kexec-machine_shutdowni386.patch
kexec-kexeci386.patch
kexec-use_mm.patch
kexec-loading-kernel-from-non-default-offset.patch
kexec-loading-kernel-from-non-default-offset-fix.patch
kexec-enabling-co-existence-of-normal-kexec-kernel-and-panic-kernel.patch
kexec-ppc-support.patch
#kexec-kexecppc.patch
#
crashdump-documentation.patch
crashdump-memory-preserving-reboot-using-kexec.patch
crashdump-memory-preserving-reboot-using-kexec-fix.patch
kdump-config_discontigmem-fix.patch
crashdump-routines-for-copying-dump-pages.patch
crashdump-routines-for-copying-dump-pages-kmap-fiddle.patch
crashdump-kmap-build-fix.patch
crashdump-register-snapshotting-before-kexec-boot.patch
crashdump-elf-format-dump-file-access.patch
crashdump-linear-raw-format-dump-file-access.patch
crashdump-minor-bug-fixes-to-kexec-crashdump-code.patch
crashdump-cleanups-to-the-kexec-based-crashdump-code.patch
#
x86-rename-apic_mode_exint.patch
x86-local-apic-fix.patch
#


2005-01-19 07:33:54

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 6/29] x86-apic-virtwire-on-shutdown


When coming out of apic mode attempt to set the appropriate
apic back into virtual wire mode. This improves on previous versions
of this patch by by never setting bot the local apic and the ioapic
into veritual wire mode.

This code looks at data from the mptable to see if an ioapic has
an ExtInt input to make this decision. A future improvement
is to figure out which apic or ioapic was in virtual wire mode
at boot time and to remember it. That is potentially a more accurate
method, of selecting which apic to place in virutal wire mode.

Signed-off-by: Eric Biederman <[email protected]>
---

arch/i386/kernel/apic.c | 38 +++++++++++++++++++++++++++++++++++++-
arch/i386/kernel/io_apic.c | 33 ++++++++++++++++++++++++++++++++-
include/asm-i386/apic.h | 2 +-
include/asm-i386/apicdef.h | 1 +
4 files changed, 71 insertions(+), 3 deletions(-)

diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-i8259-shutdown/arch/i386/kernel/apic.c linux-2.6.11-rc1-mm1-nokexec-x86-apic-virtwire-on-shutdown/arch/i386/kernel/apic.c
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-i8259-shutdown/arch/i386/kernel/apic.c Tue Jan 18 22:43:54 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86-apic-virtwire-on-shutdown/arch/i386/kernel/apic.c Tue Jan 18 22:45:00 2005
@@ -211,7 +211,7 @@
enable_apic_mode();
}

-void disconnect_bsp_APIC(void)
+void disconnect_bsp_APIC(int virt_wire_setup)
{
if (pic_mode) {
/*
@@ -224,6 +224,42 @@
"entering PIC mode.\n");
outb(0x70, 0x22);
outb(0x00, 0x23);
+ }
+ else {
+ /* Go back to Virtual Wire compatibility mode */
+ unsigned long value;
+
+ /* For the spurious interrupt use vector F, and enable it */
+ value = apic_read(APIC_SPIV);
+ value &= ~APIC_VECTOR_MASK;
+ value |= APIC_SPIV_APIC_ENABLED;
+ value |= 0xf;
+ apic_write_around(APIC_SPIV, value);
+
+ if (!virt_wire_setup) {
+ /* For LVT0 make it edge triggered, active high, external and enabled */
+ value = apic_read(APIC_LVT0);
+ value &= ~(APIC_MODE_MASK | APIC_SEND_PENDING |
+ APIC_INPUT_POLARITY | APIC_LVT_REMOTE_IRR |
+ APIC_LVT_LEVEL_TRIGGER | APIC_LVT_MASKED );
+ value |= APIC_LVT_REMOTE_IRR | APIC_SEND_PENDING;
+ value = SET_APIC_DELIVERY_MODE(value, APIC_MODE_EXTINT);
+ apic_write_around(APIC_LVT0, value);
+ }
+ else {
+ /* Disable LVT0 */
+ apic_write_around(APIC_LVT0, APIC_LVT_MASKED);
+ }
+
+ /* For LVT1 make it edge triggered, active high, nmi and enabled */
+ value = apic_read(APIC_LVT1);
+ value &= ~(
+ APIC_MODE_MASK | APIC_SEND_PENDING |
+ APIC_INPUT_POLARITY | APIC_LVT_REMOTE_IRR |
+ APIC_LVT_LEVEL_TRIGGER | APIC_LVT_MASKED);
+ value |= APIC_LVT_REMOTE_IRR | APIC_SEND_PENDING;
+ value = SET_APIC_DELIVERY_MODE(value, APIC_MODE_NMI);
+ apic_write_around(APIC_LVT1, value);
}
}

diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-i8259-shutdown/arch/i386/kernel/io_apic.c linux-2.6.11-rc1-mm1-nokexec-x86-apic-virtwire-on-shutdown/arch/i386/kernel/io_apic.c
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-i8259-shutdown/arch/i386/kernel/io_apic.c Fri Jan 14 04:32:22 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86-apic-virtwire-on-shutdown/arch/i386/kernel/io_apic.c Tue Jan 18 22:45:00 2005
@@ -1631,12 +1631,43 @@
*/
void disable_IO_APIC(void)
{
+ int pin;
/*
* Clear the IO-APIC before rebooting:
*/
clear_IO_APIC();

- disconnect_bsp_APIC();
+ /*
+ * If the i82559 is routed through an IOAPIC
+ * Put that IOAPIC in virtual wire mode
+ * so legacy interrups can be delivered.
+ */
+ pin = find_isa_irq_pin(0, mp_ExtINT);
+ if (pin != -1) {
+ struct IO_APIC_route_entry entry;
+ unsigned long flags;
+
+ memset(&entry, 0, sizeof(entry));
+ entry.mask = 0; /* Enabled */
+ entry.trigger = 0; /* Edge */
+ entry.irr = 0;
+ entry.polarity = 0; /* High */
+ entry.delivery_status = 0;
+ entry.dest_mode = 0; /* Physical */
+ entry.delivery_mode = 7; /* ExtInt */
+ entry.vector = 0;
+ entry.dest.physical.physical_dest = 0;
+
+
+ /*
+ * Add it to the IO-APIC irq-routing table:
+ */
+ spin_lock_irqsave(&ioapic_lock, flags);
+ io_apic_write(0, 0x11+2*pin, *(((int *)&entry)+1));
+ io_apic_write(0, 0x10+2*pin, *(((int *)&entry)+0));
+ spin_unlock_irqrestore(&ioapic_lock, flags);
+ }
+ disconnect_bsp_APIC(pin != -1);
}

/*
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-i8259-shutdown/include/asm-i386/apic.h linux-2.6.11-rc1-mm1-nokexec-x86-apic-virtwire-on-shutdown/include/asm-i386/apic.h
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-i8259-shutdown/include/asm-i386/apic.h Tue Jan 18 22:43:55 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86-apic-virtwire-on-shutdown/include/asm-i386/apic.h Tue Jan 18 22:45:00 2005
@@ -100,7 +100,7 @@
extern int get_maxlvt(void);
extern void clear_local_APIC(void);
extern void connect_bsp_APIC (void);
-extern void disconnect_bsp_APIC (void);
+extern void disconnect_bsp_APIC (int virt_wire_setup);
extern void disable_local_APIC (void);
extern void lapic_shutdown (void);
extern int verify_local_APIC (void);
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-i8259-shutdown/include/asm-i386/apicdef.h linux-2.6.11-rc1-mm1-nokexec-x86-apic-virtwire-on-shutdown/include/asm-i386/apicdef.h
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-i8259-shutdown/include/asm-i386/apicdef.h Tue Jan 18 22:43:44 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86-apic-virtwire-on-shutdown/include/asm-i386/apicdef.h Tue Jan 18 22:45:00 2005
@@ -86,6 +86,7 @@
#define APIC_LVT_REMOTE_IRR (1<<14)
#define APIC_INPUT_POLARITY (1<<13)
#define APIC_SEND_PENDING (1<<12)
+#define APIC_MODE_MASK 0x700
#define GET_APIC_DELIVERY_MODE(x) (((x)>>8)&0x7)
#define SET_APIC_DELIVERY_MODE(x,y) (((x)&~0x700)|((y)<<8))
#define APIC_MODE_FIXED 0x0

2005-01-19 07:35:03

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 28/29] crashdump-elf-format-dump-file-access


This patch has been refactored to more closely match the prevailing
style in the affected files. And to clearly indicate the dependency
between /proc/kcore and proc/vmcore.c

From: Hariprasad Nellitheertha <[email protected]>

This patch contains the code that provides an ELF format interface to the
previous kernel's memory post kexec reboot.

Signed off by Hariprasad Nellitheertha <[email protected]>

Signed-off-by: Eric Biederman <[email protected]>
---

fs/proc/Makefile | 3
fs/proc/kcore.c | 10 -
fs/proc/proc_misc.c | 8 +
fs/proc/vmcore.c | 239 +++++++++++++++++++++++++++++++++++++++++++++
include/linux/crash_dump.h | 13 ++
5 files changed, 267 insertions(+), 6 deletions(-)

diff -uNr linux-2.6.11-rc1-mm1-nokexec-crashdump-routines-for-copying-dump-pages/fs/proc/Makefile linux-2.6.11-rc1-mm1-nokexec-crashdump-elf-format-dump-file-access/fs/proc/Makefile
--- linux-2.6.11-rc1-mm1-nokexec-crashdump-routines-for-copying-dump-pages/fs/proc/Makefile Fri Jan 14 04:28:46 2005
+++ linux-2.6.11-rc1-mm1-nokexec-crashdump-elf-format-dump-file-access/fs/proc/Makefile Tue Jan 18 23:16:57 2005
@@ -10,5 +10,6 @@
proc-y += inode.o root.o base.o generic.o array.o \
kmsg.o proc_tty.o proc_misc.o

-proc-$(CONFIG_PROC_KCORE) += kcore.o
+kcore-$(CONFIG_CRASH_DUMP) += vmcore.o
+proc-$(CONFIG_PROC_KCORE) += kcore.o $(kcore-y)
proc-$(CONFIG_PROC_DEVICETREE) += proc_devtree.o
diff -uNr linux-2.6.11-rc1-mm1-nokexec-crashdump-routines-for-copying-dump-pages/fs/proc/kcore.c linux-2.6.11-rc1-mm1-nokexec-crashdump-elf-format-dump-file-access/fs/proc/kcore.c
--- linux-2.6.11-rc1-mm1-nokexec-crashdump-routines-for-copying-dump-pages/fs/proc/kcore.c Fri Jan 14 04:32:26 2005
+++ linux-2.6.11-rc1-mm1-nokexec-crashdump-elf-format-dump-file-access/fs/proc/kcore.c Tue Jan 18 23:16:57 2005
@@ -97,7 +97,7 @@
/*
* determine size of ELF note
*/
-static int notesize(struct memelfnote *en)
+int notesize(struct memelfnote *en)
{
int sz;

@@ -112,7 +112,7 @@
/*
* store a note in the header buffer
*/
-static char *storenote(struct memelfnote *men, char *bufp)
+char *storenote(struct memelfnote *men, char *bufp)
{
struct elf_note en;

@@ -139,7 +139,7 @@
* store an ELF coredump header in the supplied buffer
* nphdr is the number of elf_phdr to insert
*/
-static void elf_kcore_store_hdr(char *bufp, int nphdr, int dataoff)
+void elf_kcore_store_hdr(char *bufp, int nphdr, int dataoff, struct kcore_list *clist)
{
struct elf_prstatus prstatus; /* NT_PRSTATUS */
struct elf_prpsinfo prpsinfo; /* NT_PRPSINFO */
@@ -191,7 +191,7 @@
nhdr->p_align = 0;

/* setup ELF PT_LOAD program header for every area */
- for (m=kclist; m; m=m->next) {
+ for (m=clist; m; m=m->next) {
phdr = (struct elf_phdr *) bufp;
bufp += sizeof(struct elf_phdr);
offset += sizeof(struct elf_phdr);
@@ -287,7 +287,7 @@
return -ENOMEM;
}
memset(elf_buf, 0, elf_buflen);
- elf_kcore_store_hdr(elf_buf, nphdr, elf_buflen);
+ elf_kcore_store_hdr(elf_buf, nphdr, elf_buflen, kclist);
read_unlock(&kclist_lock);
if (copy_to_user(buffer, elf_buf + *fpos, tsz)) {
kfree(elf_buf);
diff -uNr linux-2.6.11-rc1-mm1-nokexec-crashdump-routines-for-copying-dump-pages/fs/proc/proc_misc.c linux-2.6.11-rc1-mm1-nokexec-crashdump-elf-format-dump-file-access/fs/proc/proc_misc.c
--- linux-2.6.11-rc1-mm1-nokexec-crashdump-routines-for-copying-dump-pages/fs/proc/proc_misc.c Fri Jan 14 04:28:46 2005
+++ linux-2.6.11-rc1-mm1-nokexec-crashdump-elf-format-dump-file-access/fs/proc/proc_misc.c Tue Jan 18 23:16:57 2005
@@ -44,6 +44,7 @@
#include <linux/jiffies.h>
#include <linux/sysrq.h>
#include <linux/vmalloc.h>
+#include <linux/crash_dump.h>
#include <asm/uaccess.h>
#include <asm/pgtable.h>
#include <asm/io.h>
@@ -598,6 +599,13 @@
proc_root_kcore->size =
(size_t)high_memory - PAGE_OFFSET + PAGE_SIZE;
}
+# ifdef CONFIG_CRASH_DUMP
+ entry = create_proc_entry("vmcore", S_IRUSR, NULL);
+ if (entry) {
+ entry->proc_fops = &proc_vmcore_operations;
+ entry->size = (size_t)(saved_max_pfn << PAGE_SHIFT);
+ }
+# endif
#endif
#ifdef CONFIG_MAGIC_SYSRQ
entry = create_proc_entry("sysrq-trigger", S_IWUSR, NULL);
diff -uNr linux-2.6.11-rc1-mm1-nokexec-crashdump-routines-for-copying-dump-pages/fs/proc/vmcore.c linux-2.6.11-rc1-mm1-nokexec-crashdump-elf-format-dump-file-access/fs/proc/vmcore.c
--- linux-2.6.11-rc1-mm1-nokexec-crashdump-routines-for-copying-dump-pages/fs/proc/vmcore.c Wed Dec 31 17:00:00 1969
+++ linux-2.6.11-rc1-mm1-nokexec-crashdump-elf-format-dump-file-access/fs/proc/vmcore.c Tue Jan 18 23:16:57 2005
@@ -0,0 +1,239 @@
+/*
+ * fs/proc/vmcore.c Interface for accessing the crash
+ * dump from the system's previous life.
+ * Heavily borrowed from fs/proc/kcore.c
+ * Created by: Hariprasad Nellitheertha ([email protected])
+ * Copyright (C) IBM Corporation, 2004. All rights reserved
+ */
+
+#include <linux/config.h>
+#include <linux/mm.h>
+#include <linux/proc_fs.h>
+#include <linux/user.h>
+#include <linux/a.out.h>
+#include <linux/elf.h>
+#include <linux/elfcore.h>
+#include <linux/vmalloc.h>
+#include <linux/proc_fs.h>
+#include <linux/highmem.h>
+#include <linux/bootmem.h>
+#include <linux/init.h>
+#include <linux/crash_dump.h>
+#include <asm/uaccess.h>
+#include <asm/io.h>
+
+/* This is to re-use the kcore header creation code */
+static struct kcore_list vmcore_mem;
+
+static int open_vmcore(struct inode * inode, struct file * filp)
+{
+ return 0;
+}
+
+static ssize_t read_vmcore(struct file *,char __user *,size_t, loff_t *);
+
+#define BACKUP_START CRASH_BACKUP_BASE
+#define BACKUP_END CRASH_BACKUP_BASE + CRASH_BACKUP_SIZE
+#define REG_SIZE sizeof(elf_gregset_t)
+
+struct file_operations proc_vmcore_operations = {
+ .read = read_vmcore,
+ .open = open_vmcore,
+};
+
+struct proc_dir_entry *proc_vmcore;
+
+struct memelfnote
+{
+ const char *name;
+ int type;
+ unsigned int datasz;
+ void *data;
+};
+
+static size_t get_vmcore_size(int *nphdr, size_t *elf_buflen)
+{
+ size_t size;
+
+ /* We need 1 PT_LOAD segment headers
+ * In addition, we need one PT_NOTE header
+ */
+ *nphdr = 2;
+ size = (size_t)(saved_max_pfn << PAGE_SHIFT);
+
+ *elf_buflen = sizeof(struct elfhdr) +
+ (*nphdr + 2)*sizeof(struct elf_phdr) +
+ 3 * sizeof(struct memelfnote) +
+ sizeof(struct elf_prstatus) +
+ sizeof(struct elf_prpsinfo) +
+ sizeof(struct task_struct);
+ *elf_buflen = PAGE_ALIGN(*elf_buflen);
+ return size + *elf_buflen;
+}
+
+/*
+ * Reads a page from the oldmem device from given offset.
+ */
+static ssize_t read_from_oldmem(char *buf, size_t count,
+ loff_t *ppos, int userbuf)
+{
+ unsigned long pfn;
+ size_t read = 0;
+
+ pfn = (unsigned long)(*ppos / PAGE_SIZE);
+
+ if (pfn > saved_max_pfn) {
+ read = -EINVAL;
+ goto done;
+ }
+
+ count = (count > PAGE_SIZE) ? PAGE_SIZE : count;
+
+ if (copy_oldmem_page(pfn, buf, count, userbuf)) {
+ read = -EFAULT;
+ goto done;
+ }
+
+ *ppos += count;
+done:
+ return read;
+}
+
+/*
+ * store an ELF crash dump header in the supplied buffer
+ * nphdr is the number of elf_phdr to insert
+ */
+static void elf_vmcore_store_hdr(char *bufp, int nphdr, int dataoff)
+{
+ struct elf_prstatus prstatus; /* NT_PRSTATUS */
+ struct memelfnote notes[1];
+ char reg_buf[REG_SIZE];
+ loff_t reg_ppos;
+ char *buf = bufp;
+
+ vmcore_mem.addr = (unsigned long)__va(0);
+ vmcore_mem.size = saved_max_pfn << PAGE_SHIFT;
+ vmcore_mem.next = NULL;
+
+ /* Re-use the kcore code */
+ elf_kcore_store_hdr(bufp, nphdr, dataoff, &vmcore_mem);
+ buf += sizeof(struct elfhdr) + 2*sizeof(struct elf_phdr);
+
+ /* set up the process status */
+ notes[0].name = "CORE";
+ notes[0].type = NT_PRSTATUS;
+ notes[0].datasz = sizeof(struct elf_prstatus);
+ notes[0].data = &prstatus;
+
+ memset(&prstatus, 0, sizeof(struct elf_prstatus));
+
+ /* 1 - Get the registers from the reserved memory area */
+ reg_ppos = BACKUP_END + CRASH_RELOCATE_SIZE;
+ read_from_oldmem(reg_buf, REG_SIZE, &reg_ppos, 0);
+ elf_core_copy_regs(&prstatus.pr_reg, (struct pt_regs *)reg_buf);
+ buf = storenote(&notes[0], buf);
+}
+
+/*
+ * read from the ELF header and then the crash dump
+ */
+static ssize_t read_vmcore(
+struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
+{
+ ssize_t acc = 0;
+ size_t size, tsz;
+ size_t elf_buflen;
+ int nphdr;
+ unsigned long start;
+
+ tsz = get_vmcore_size(&nphdr, &elf_buflen);
+ proc_vmcore->size = size = tsz + elf_buflen;
+ if (buflen == 0 || *fpos >= size) {
+ goto done;
+ }
+
+ /* trim buflen to not go beyond EOF */
+ if (buflen > size - *fpos)
+ buflen = size - *fpos;
+
+ /* construct an ELF core header if we'll need some of it */
+ if (*fpos < elf_buflen) {
+ char * elf_buf;
+
+ tsz = elf_buflen - *fpos;
+ if (buflen < tsz)
+ tsz = buflen;
+ elf_buf = kmalloc(elf_buflen, GFP_ATOMIC);
+ if (!elf_buf) {
+ acc = -ENOMEM;
+ goto done;
+ }
+ memset(elf_buf, 0, elf_buflen);
+ elf_vmcore_store_hdr(elf_buf, nphdr, elf_buflen);
+ if (copy_to_user(buffer, elf_buf + *fpos, tsz)) {
+ kfree(elf_buf);
+ acc = -EFAULT;
+ goto done;
+ }
+ kfree(elf_buf);
+ buflen -= tsz;
+ *fpos += tsz;
+ buffer += tsz;
+ acc += tsz;
+
+ /* leave now if filled buffer already */
+ if (buflen == 0) {
+ goto done;
+ }
+ }
+
+ start = *fpos - elf_buflen;
+ if ((tsz = (PAGE_SIZE - (start & ~PAGE_MASK))) > buflen)
+ tsz = buflen;
+
+ while (buflen) {
+ unsigned long p;
+ loff_t pdup;
+
+ if ((start < 0) || (start >= size))
+ if (clear_user(buffer, tsz)) {
+ acc = -EFAULT;
+ goto done;
+ }
+
+ /* tsz contains actual len of dump to be read.
+ * buflen is the total len that was requested.
+ * This may contain part of ELF header. start
+ * is the fpos for the oldmem region
+ * If the file position corresponds to the second
+ * kernel's memory, we just return zeroes
+ */
+ p = start;
+ if ((p >= BACKUP_START) && (p < BACKUP_END)) {
+ if (clear_user(buffer, tsz)) {
+ acc = -EFAULT;
+ goto done;
+ }
+
+ goto read_done;
+ } else if (p < CRASH_RELOCATE_SIZE)
+ p += BACKUP_END;
+
+ pdup = p;
+ if (read_from_oldmem(buffer, tsz, &pdup, 1)) {
+ acc = -EINVAL;
+ goto done;
+ }
+
+read_done:
+ buflen -= tsz;
+ *fpos += tsz;
+ buffer += tsz;
+ acc += tsz;
+ start += tsz;
+ tsz = (buflen > PAGE_SIZE ? PAGE_SIZE : buflen);
+ }
+
+done:
+ return acc;
+}
diff -uNr linux-2.6.11-rc1-mm1-nokexec-crashdump-routines-for-copying-dump-pages/include/linux/crash_dump.h linux-2.6.11-rc1-mm1-nokexec-crashdump-elf-format-dump-file-access/include/linux/crash_dump.h
--- linux-2.6.11-rc1-mm1-nokexec-crashdump-routines-for-copying-dump-pages/include/linux/crash_dump.h Tue Jan 18 23:16:24 2005
+++ linux-2.6.11-rc1-mm1-nokexec-crashdump-elf-format-dump-file-access/include/linux/crash_dump.h Tue Jan 18 23:16:57 2005
@@ -1,10 +1,23 @@
#include <linux/kexec.h>
#include <linux/smp_lock.h>
#include <linux/device.h>
+#include <linux/proc_fs.h>
#ifdef CONFIG_CRASH_DUMP
#include <asm/crash_dump.h>
#endif

+extern unsigned long saved_max_pfn;
+extern struct memelfnote memelfnote;
+extern int notesize(struct memelfnote *);
+extern char *storenote(struct memelfnote *, char *);
+extern void elf_kcore_store_hdr(char *, int, int, struct kcore_list *);
+
#ifdef CONFIG_CRASH_DUMP
+extern struct file_operations proc_vmcore_operations;
+extern struct proc_dir_entry *proc_vmcore;
+
+extern ssize_t copy_oldmem_page(unsigned long, char *, size_t, int);
+extern void crash_create_proc_entry(void);
#else
+#define crash_create_proc_entry() do { } while(0)
#endif

2005-01-19 07:39:18

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 26/29] crashdump-memory-preserving-reboot-using-kexec


With the recent refactoring of the kexec code this patch is a shadow
of it's former self. The user space code in /sbin/kexec has been
enhanced so it can contain copy the first 640k. And the strong tying
between the crashdump capturecode paths and the kexec on panic code
paths have been removed.

From: Hariprasad Nellitheertha <[email protected]>

This patch contains the code that does the memory preserving reboot. It
copies over the first 640k into a backup region before handing over to kexec.
The second kernel will boot using only the backup region.

Signed off by Hariprasad Nellitheertha <[email protected]>
Signed off by Adam Litke <[email protected]>

Signed-off-by: Eric Biederman <[email protected]>
---

arch/i386/Kconfig | 21 +++++++++++++++++++++
arch/i386/kernel/setup.c | 8 ++++++++
include/asm-i386/crash_dump.h | 20 ++++++++++++++++++++
include/linux/bootmem.h | 3 +++
include/linux/crash_dump.h | 10 ++++++++++
kernel/Makefile | 1 +
kernel/crash_dump.c | 13 +++++++++++++
mm/bootmem.c | 7 +++++++
8 files changed, 83 insertions(+)

diff -uNr linux-2.6.11-rc1-mm1-nokexec-crashdump-documentation/arch/i386/Kconfig linux-2.6.11-rc1-mm1-nokexec-crashdump-memory-preserving-reboot-using-kexec/arch/i386/Kconfig
--- linux-2.6.11-rc1-mm1-nokexec-crashdump-documentation/arch/i386/Kconfig Tue Jan 18 22:58:15 2005
+++ linux-2.6.11-rc1-mm1-nokexec-crashdump-memory-preserving-reboot-using-kexec/arch/i386/Kconfig Tue Jan 18 23:16:24 2005
@@ -918,6 +918,27 @@
support. As of this writing the exact hardware interface is
strongly in flux, so no good recommendation can be made.

+config CRASH_DUMP
+ bool "kernel crash dumps (EXPERIMENTAL)"
+ depends on EMBEDDED
+ depends on EXPERIMENTAL
+ help
+ Generate crash dump after being started by kexec.
+
+config BACKUP_BASE
+ int "location from where the crash dumping kernel will boot (MB)"
+ depends on CRASH_DUMP
+ default 16
+ help
+ This is the location where the second kernel will boot from.
+
+config BACKUP_SIZE
+ int "Size of memory used by the crash dumping kernel (MB)"
+ depends on CRASH_DUMP
+ range 16 64
+ default 32
+ help
+ The size of the second kernel's memory.
endmenu


diff -uNr linux-2.6.11-rc1-mm1-nokexec-crashdump-documentation/arch/i386/kernel/setup.c linux-2.6.11-rc1-mm1-nokexec-crashdump-memory-preserving-reboot-using-kexec/arch/i386/kernel/setup.c
--- linux-2.6.11-rc1-mm1-nokexec-crashdump-documentation/arch/i386/kernel/setup.c Tue Jan 18 22:58:33 2005
+++ linux-2.6.11-rc1-mm1-nokexec-crashdump-memory-preserving-reboot-using-kexec/arch/i386/kernel/setup.c Tue Jan 18 23:16:24 2005
@@ -51,6 +51,7 @@
#include <asm/io_apic.h>
#include <asm/ist.h>
#include <asm/io.h>
+#include <asm/crash_dump.h>
#include "setup_arch_pre.h"
#include <bios_ebda.h>

@@ -713,6 +714,13 @@
if (to != command_line)
to--;
if (!memcmp(from+7, "exactmap", 8)) {
+#ifdef CONFIG_CRASH_DUMP
+ /* If we are doing a crash dump, we
+ * still need to know the real mem
+ * size.
+ */
+ set_saved_max_pfn();
+#endif
from += 8+7;
e820.nr_map = 0;
userdef = 1;
diff -uNr linux-2.6.11-rc1-mm1-nokexec-crashdump-documentation/include/asm-i386/crash_dump.h linux-2.6.11-rc1-mm1-nokexec-crashdump-memory-preserving-reboot-using-kexec/include/asm-i386/crash_dump.h
--- linux-2.6.11-rc1-mm1-nokexec-crashdump-documentation/include/asm-i386/crash_dump.h Wed Dec 31 17:00:00 1969
+++ linux-2.6.11-rc1-mm1-nokexec-crashdump-memory-preserving-reboot-using-kexec/include/asm-i386/crash_dump.h Tue Jan 18 23:16:24 2005
@@ -0,0 +1,20 @@
+/* asm-i386/crash_dump.h */
+#include <linux/bootmem.h>
+
+#ifdef CONFIG_CRASH_DUMP
+extern unsigned long __init find_max_low_pfn(void);
+extern void __init find_max_pfn(void);
+
+#define CRASH_BACKUP_BASE ((unsigned long)CONFIG_BACKUP_BASE * 0x100000)
+#define CRASH_BACKUP_SIZE ((unsigned long)CONFIG_BACKUP_SIZE * 0x100000)
+#define CRASH_RELOCATE_SIZE 0xa0000
+
+static inline void set_saved_max_pfn(void)
+{
+ find_max_pfn();
+ saved_max_pfn = find_max_low_pfn();
+}
+
+#else
+#define set_saved_max_pfn() do { } while(0)
+#endif
diff -uNr linux-2.6.11-rc1-mm1-nokexec-crashdump-documentation/include/linux/bootmem.h linux-2.6.11-rc1-mm1-nokexec-crashdump-memory-preserving-reboot-using-kexec/include/linux/bootmem.h
--- linux-2.6.11-rc1-mm1-nokexec-crashdump-documentation/include/linux/bootmem.h Fri Jan 14 04:28:48 2005
+++ linux-2.6.11-rc1-mm1-nokexec-crashdump-memory-preserving-reboot-using-kexec/include/linux/bootmem.h Tue Jan 18 23:16:24 2005
@@ -21,6 +21,9 @@
* highest page
*/
extern unsigned long max_pfn;
+#ifdef CONFIG_CRASH_DUMP
+extern unsigned long saved_max_pfn;
+#endif

/*
* node_bootmem_map is a map pointer - the bits represent all physical
diff -uNr linux-2.6.11-rc1-mm1-nokexec-crashdump-documentation/include/linux/crash_dump.h linux-2.6.11-rc1-mm1-nokexec-crashdump-memory-preserving-reboot-using-kexec/include/linux/crash_dump.h
--- linux-2.6.11-rc1-mm1-nokexec-crashdump-documentation/include/linux/crash_dump.h Wed Dec 31 17:00:00 1969
+++ linux-2.6.11-rc1-mm1-nokexec-crashdump-memory-preserving-reboot-using-kexec/include/linux/crash_dump.h Tue Jan 18 23:16:24 2005
@@ -0,0 +1,10 @@
+#include <linux/kexec.h>
+#include <linux/smp_lock.h>
+#include <linux/device.h>
+#ifdef CONFIG_CRASH_DUMP
+#include <asm/crash_dump.h>
+#endif
+
+#ifdef CONFIG_CRASH_DUMP
+#else
+#endif
diff -uNr linux-2.6.11-rc1-mm1-nokexec-crashdump-documentation/kernel/Makefile linux-2.6.11-rc1-mm1-nokexec-crashdump-memory-preserving-reboot-using-kexec/kernel/Makefile
--- linux-2.6.11-rc1-mm1-nokexec-crashdump-documentation/kernel/Makefile Tue Jan 18 22:47:13 2005
+++ linux-2.6.11-rc1-mm1-nokexec-crashdump-memory-preserving-reboot-using-kexec/kernel/Makefile Tue Jan 18 23:16:24 2005
@@ -29,6 +29,7 @@
obj-$(CONFIG_KPROBES) += kprobes.o
obj-$(CONFIG_SYSFS) += ksysfs.o
obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
+obj-$(CONFIG_CRASH_DUMP) += crash_dump.o

ifneq ($(CONFIG_IA64),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
diff -uNr linux-2.6.11-rc1-mm1-nokexec-crashdump-documentation/kernel/crash_dump.c linux-2.6.11-rc1-mm1-nokexec-crashdump-memory-preserving-reboot-using-kexec/kernel/crash_dump.c
--- linux-2.6.11-rc1-mm1-nokexec-crashdump-documentation/kernel/crash_dump.c Wed Dec 31 17:00:00 1969
+++ linux-2.6.11-rc1-mm1-nokexec-crashdump-memory-preserving-reboot-using-kexec/kernel/crash_dump.c Tue Jan 18 23:16:24 2005
@@ -0,0 +1,13 @@
+/*
+ * kernel/crash_dump.c - Memory preserving reboot related code.
+ *
+ * Created by: Hariprasad Nellitheertha ([email protected])
+ * Copyright (C) IBM Corporation, 2004. All rights reserved
+ */
+
+#include <linux/smp_lock.h>
+#include <linux/errno.h>
+#include <linux/proc_fs.h>
+#include <asm/io.h>
+#include <asm/uaccess.h>
+
diff -uNr linux-2.6.11-rc1-mm1-nokexec-crashdump-documentation/mm/bootmem.c linux-2.6.11-rc1-mm1-nokexec-crashdump-memory-preserving-reboot-using-kexec/mm/bootmem.c
--- linux-2.6.11-rc1-mm1-nokexec-crashdump-documentation/mm/bootmem.c Fri Jan 14 04:28:50 2005
+++ linux-2.6.11-rc1-mm1-nokexec-crashdump-memory-preserving-reboot-using-kexec/mm/bootmem.c Tue Jan 18 23:16:24 2005
@@ -28,6 +28,13 @@
unsigned long max_low_pfn;
unsigned long min_low_pfn;
unsigned long max_pfn;
+#ifdef CONFIG_CRASH_DUMP
+/*
+ * If we have booted due to a crash, max_pfn will be a very low value. We need
+ * to know the amount of memory that the previous kernel used.
+ */
+unsigned long saved_max_pfn;
+#endif

EXPORT_SYMBOL(max_pfn); /* This is exported so
* dma_get_required_mask(), which uses

2005-01-19 07:37:59

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 21/29] kexec-ppc-support


I have tweaked this patch slightly to handle an empty list
of pages to relocate passed to relocate_new_kernel. And
I have added ppc_md.machine_crash_shutdown. To keep up with
the changes in the generic kexec infrastructure.

From: Albert Herranz <[email protected]>

The following patch adds support for kexec on the ppc32 platform.

Non-OpenFirmware based platforms are likely to work directly without
additional changes on the kernel side. The kexec-tools userland package
may need to be slightly updated, though.

For OpenFirmware based machines, additional work is still needed on the
kernel side before kexec support is ready. Benjamin Herrenschmidt is
kindly working on that part.

In order for a ppc platform to use the kexec kernel services it must
implement some ppc_md hooks. Otherwise, kexec will be explicitly disabled,
as suggested by benh.

There are 3+1 new ppc_md hooks that a platform supporting kexec may
implement. Two of them are mandatory for kexec to work. See
include/asm-ppc/machdep.h for details.

- machine_kexec_prepare(image)

This function is called to make any arrangements to the image before it
is loaded.

This hook _MUST_ be provided by a platform in order to activate kexec
support for that platform. Otherwise, the platform is considered to not
support kexec and the kexec_load system call will fail (that makes all
existing platforms by default non-kexec'able).

- machine_kexec_cleanup(image)

This function is called to make any cleanups on image after the loaded
image data it is freed. This hook is optional. A platform may or may
not provide this hook.

- machine_kexec(image)

This function is called to perform the _actual_ kexec. This hook
_MUST_ be provided by a platform in order to activate kexec support for
that platform.

If a platform provides machine_kexec_prepare but forgets to provide
machine_kexec, a kexec will fall back to a reboot.

A ready-to-use machine_kexec_simple() generic function is provided to,
hopefully, simplify kexec adoption for embedded platforms. A platform
may call this function from its specific machine_kexec hook, like this:

void myplatform_kexec(struct kimage *image)
{
machine_kexec_simple(image);
}

- machine_shutdown()

This function is called to perform any machine specific shutdowns, not
already done by drivers. This hook is optional. A platform may or may
not provide this hook.

An example (trimmed) platform specific module for a platform supporting
kexec through the existing machine_kexec_simple follows:

/* ... */

#ifdef CONFIG_KEXEC
int myplatform_kexec_prepare(struct kimage *image)
{
/* here, we can place additional preparations
*/
return 0; /* yes, we support kexec */
}


void myplatform_kexec(struct kimage *image)
{
machine_kexec_simple(image);
}
#endif /* CONFIG_KEXEC */

/* ... */

void __init
platform_init(unsigned long r3, unsigned long r4,
unsigned long r5,
unsigned long r6, unsigned long r7)
{

/* ... */

#ifdef CONFIG_KEXEC
ppc_md.machine_kexec_prepare =
myplatform_kexec_prepare;
ppc_md.machine_kexec =
myplatform_kexec;
#endif /* CONFIG_KEXEC */

/* ... */

}

The kexec ppc kernel support has been heavily tested on the GameCube Linux
port, and, as reported in the fastboot mailing list, it has been tested too
on a Moto 82xx ppc by Rick Richardson.

Signed-off-by: Albert Herranz <[email protected]>

Signed-off-by: Eric Biederman <[email protected]>
---

arch/ppc/Kconfig | 20 ++++++
arch/ppc/kernel/Makefile | 1
arch/ppc/kernel/machine_kexec.c | 121 +++++++++++++++++++++++++++++++++++++
arch/ppc/kernel/misc.S | 2
arch/ppc/kernel/relocate_kernel.S | 123 ++++++++++++++++++++++++++++++++++++++
include/asm-ppc/kexec.h | 38 +++++++++++
include/asm-ppc/machdep.h | 31 +++++++++
7 files changed, 335 insertions(+), 1 deletion(-)

diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-crashkernel/arch/ppc/Kconfig linux-2.6.11-rc1-mm1-nokexec-kexec-ppc-support/arch/ppc/Kconfig
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-crashkernel/arch/ppc/Kconfig Fri Jan 14 04:32:22 2005
+++ linux-2.6.11-rc1-mm1-nokexec-kexec-ppc-support/arch/ppc/Kconfig Tue Jan 18 23:15:00 2005
@@ -198,6 +198,26 @@
here. Saying Y here will not hurt performance (on any machine) but
will increase the size of the kernel.

+config KEXEC
+ bool "kexec system call (EXPERIMENTAL)"
+ depends on EXPERIMENTAL
+ help
+ kexec is a system call that implements the ability to shutdown your
+ current kernel, and to start another kernel. It is like a reboot
+ but it is indepedent of the system firmware. And like a reboot
+ you can start any kernel with it, not just Linux.
+
+ The name comes from the similiarity to the exec system call.
+
+ It is an ongoing process to be certain the hardware in a machine
+ is properly shutdown, so do not be surprised if this code does not
+ initially work for you. It may help to enable device hotplugging
+ support. As of this writing the exact hardware interface is
+ strongly in flux, so no good recommendation can be made.
+
+ In the GameCube implementation, kexec allows you to load and
+ run DOL files, including kernel and homebrew DOLs.
+
source "drivers/cpufreq/Kconfig"

config CPU_FREQ_PMAC
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-crashkernel/arch/ppc/kernel/Makefile linux-2.6.11-rc1-mm1-nokexec-kexec-ppc-support/arch/ppc/kernel/Makefile
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-crashkernel/arch/ppc/kernel/Makefile Fri Jan 14 04:28:31 2005
+++ linux-2.6.11-rc1-mm1-nokexec-kexec-ppc-support/arch/ppc/kernel/Makefile Tue Jan 18 23:15:00 2005
@@ -25,6 +25,7 @@
obj-$(CONFIG_TAU) += temp.o
obj-$(CONFIG_ALTIVEC) += vecemu.o vector.o
obj-$(CONFIG_FSL_BOOKE) += perfmon_fsl_booke.o
+obj-$(CONFIG_KEXEC) += machine_kexec.o relocate_kernel.o

ifndef CONFIG_MATH_EMULATION
obj-$(CONFIG_8xx) += softemu8xx.o
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-crashkernel/arch/ppc/kernel/machine_kexec.c linux-2.6.11-rc1-mm1-nokexec-kexec-ppc-support/arch/ppc/kernel/machine_kexec.c
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-crashkernel/arch/ppc/kernel/machine_kexec.c Wed Dec 31 17:00:00 1969
+++ linux-2.6.11-rc1-mm1-nokexec-kexec-ppc-support/arch/ppc/kernel/machine_kexec.c Tue Jan 18 23:15:00 2005
@@ -0,0 +1,121 @@
+/*
+ * machine_kexec.c - handle transition of Linux booting another kernel
+ * Copyright (C) 2002-2003 Eric Biederman <[email protected]>
+ *
+ * GameCube/ppc32 port Copyright (C) 2004 Albert Herranz
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2. See the file COPYING for more details.
+ */
+
+#include <linux/mm.h>
+#include <linux/kexec.h>
+#include <linux/delay.h>
+#include <linux/reboot.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/mmu_context.h>
+#include <asm/io.h>
+#include <asm/hw_irq.h>
+#include <asm/cacheflush.h>
+#include <asm/machdep.h>
+
+typedef NORET_TYPE void (*relocate_new_kernel_t)(
+ unsigned long indirection_page, unsigned long reboot_code_buffer,
+ unsigned long start_address) ATTRIB_NORET;
+
+const extern unsigned char relocate_new_kernel[];
+const extern unsigned int relocate_new_kernel_size;
+
+void machine_shutdown(void)
+{
+ if (ppc_md.machine_shutdown) {
+ ppc_md.machine_shutdown();
+ }
+}
+
+void machine_crash_shutdown(void)
+{
+ if (ppc_md.machine_crash_shutdown) {
+ ppc_md.machine_crash_shutdown();
+ }
+}
+
+/*
+ * Do what every setup is needed on image and the
+ * reboot code buffer to allow us to avoid allocations
+ * later.
+ */
+int machine_kexec_prepare(struct kimage *image)
+{
+ if (ppc_md.machine_kexec_prepare) {
+ return ppc_md.machine_kexec_prepare(image);
+ }
+ /*
+ * Fail if platform doesn't provide its own machine_kexec_prepare
+ * implementation.
+ */
+ return -ENOSYS;
+}
+
+void machine_kexec_cleanup(struct kimage *image)
+{
+ if (ppc_md.machine_kexec_cleanup) {
+ ppc_md.machine_kexec_cleanup(image);
+ }
+}
+
+/*
+ * Do not allocate memory (or fail in any way) in machine_kexec().
+ * We are past the point of no return, committed to rebooting now.
+ */
+void machine_kexec(struct kimage *image)
+{
+ if (ppc_md.machine_kexec) {
+ ppc_md.machine_kexec(image);
+ } else {
+ /*
+ * Fall back to normal restart if platform doesn't provide
+ * its own kexec function, and user insist to kexec...
+ */
+ machine_restart(NULL);
+ }
+}
+
+
+/*
+ * This is a generic machine_kexec function suitable at least for
+ * non-OpenFirmware embedded platforms.
+ * It merely copies the image relocation code to the control page and
+ * jumps to it.
+ * A platform specific function may just call this one.
+ */
+NORET_TYPE void machine_kexec_simple(struct kimage *image)
+{
+ unsigned long page_list;
+ unsigned long reboot_code_buffer, reboot_code_buffer_phys;
+ relocate_new_kernel_t rnk;
+
+ /* Interrupts aren't acceptable while we reboot */
+ local_irq_disable();
+
+ page_list = image->head;
+
+ /* we need both effective and real address here */
+ reboot_code_buffer =
+ (unsigned long)page_address(image->control_code_page);
+ reboot_code_buffer_phys = virt_to_phys((void *)reboot_code_buffer);
+
+ /* copy our kernel relocation code to the control code page */
+ memcpy((void *)reboot_code_buffer,
+ relocate_new_kernel, relocate_new_kernel_size);
+
+ flush_icache_range(reboot_code_buffer,
+ reboot_code_buffer + KEXEC_CONTROL_CODE_SIZE);
+ printk(KERN_INFO "Bye!\n");
+
+ /* now call it */
+ rnk = (relocate_new_kernel_t) reboot_code_buffer;
+ (*rnk)(page_list, reboot_code_buffer_phys, image->start);
+}
+
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-crashkernel/arch/ppc/kernel/misc.S linux-2.6.11-rc1-mm1-nokexec-kexec-ppc-support/arch/ppc/kernel/misc.S
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-crashkernel/arch/ppc/kernel/misc.S Fri Jan 14 04:32:22 2005
+++ linux-2.6.11-rc1-mm1-nokexec-kexec-ppc-support/arch/ppc/kernel/misc.S Tue Jan 18 23:15:00 2005
@@ -1450,7 +1450,7 @@
.long sys_mq_timedreceive /* 265 */
.long sys_mq_notify
.long sys_mq_getsetattr
- .long sys_ni_syscall /* 268 reserved for sys_kexec_load */
+ .long sys_kexec_load
.long sys_add_key
.long sys_request_key /* 270 */
.long sys_keyctl
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-crashkernel/arch/ppc/kernel/relocate_kernel.S linux-2.6.11-rc1-mm1-nokexec-kexec-ppc-support/arch/ppc/kernel/relocate_kernel.S
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-crashkernel/arch/ppc/kernel/relocate_kernel.S Wed Dec 31 17:00:00 1969
+++ linux-2.6.11-rc1-mm1-nokexec-kexec-ppc-support/arch/ppc/kernel/relocate_kernel.S Tue Jan 18 23:15:00 2005
@@ -0,0 +1,123 @@
+/*
+ * relocate_kernel.S - put the kernel image in place to boot
+ * Copyright (C) 2002-2003 Eric Biederman <[email protected]>
+ *
+ * GameCube/ppc32 port Copyright (C) 2004 Albert Herranz
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2. See the file COPYING for more details.
+ */
+
+#include <asm/reg.h>
+#include <asm/ppc_asm.h>
+#include <asm/processor.h>
+
+#include <asm/kexec.h>
+
+#define PAGE_SIZE 4096 /* must be same value as in <asm/page.h> */
+
+ /*
+ * Must be relocatable PIC code callable as a C function.
+ */
+ .globl relocate_new_kernel
+relocate_new_kernel:
+ /* r3 = page_list */
+ /* r4 = reboot_code_buffer */
+ /* r5 = start_address */
+
+ li r0, 0
+
+ /*
+ * Set Machine Status Register to a known status,
+ * switch the MMU off and jump to 1: in a single step.
+ */
+
+ mr r8, r0
+ ori r8, r8, MSR_RI|MSR_ME
+ mtspr SRR1, r8
+ addi r8, r4, 1f - relocate_new_kernel
+ mtspr SRR0, r8
+ sync
+ rfi
+
+1:
+ /* from this point address translation is turned off */
+ /* and interrupts are disabled */
+
+ /* set a new stack at the bottom of our page... */
+ /* (not really needed now) */
+ addi r1, r4, KEXEC_CONTROL_CODE_SIZE - 8 /* for LR Save+Back Chain */
+ stw r0, 0(r1)
+
+ /* Do the copies */
+ li r6, 0 /* checksum */
+ mr r0, r3
+ b 1f
+
+0: /* top, read another word for the indirection page */
+ lwzu r0, 4(r3)
+
+1:
+ /* is it a destination page? (r8) */
+ rlwinm. r7, r0, 0, 31, 31 /* IND_DESTINATION (1<<0) */
+ beq 2f
+
+ rlwinm r8, r0, 0, 0, 19 /* clear kexec flags, page align */
+ b 0b
+
+2: /* is it an indirection page? (r3) */
+ rlwinm. r7, r0, 0, 30, 30 /* IND_INDIRECTION (1<<1) */
+ beq 2f
+
+ rlwinm r3, r0, 0, 0, 19 /* clear kexec flags, page align */
+ subi r3, r3, 4
+ b 0b
+
+2: /* are we done? */
+ rlwinm. r7, r0, 0, 29, 29 /* IND_DONE (1<<2) */
+ beq 2f
+ b 3f
+
+2: /* is it a source page? (r9) */
+ rlwinm. r7, r0, 0, 28, 28 /* IND_SOURCE (1<<3) */
+ beq 0b
+
+ rlwinm r9, r0, 0, 0, 19 /* clear kexec flags, page align */
+
+ li r7, PAGE_SIZE / 4
+ mtctr r7
+ subi r9, r9, 4
+ subi r8, r8, 4
+9:
+ lwzu r0, 4(r9) /* do the copy */
+ xor r6, r6, r0
+ stwu r0, 4(r8)
+ dcbst 0, r8
+ sync
+ icbi 0, r8
+ bdnz 9b
+
+ addi r9, r9, 4
+ addi r8, r8, 4
+ b 0b
+
+3:
+
+ /* To be certain of avoiding problems with self-modifying code
+ * execute a serializing instruction here.
+ */
+ isync
+ sync
+
+ /* jump to the entry point, usually the setup routine */
+ mtlr r5
+ blrl
+
+1: b 1b
+
+relocate_new_kernel_end:
+
+ .globl relocate_new_kernel_size
+relocate_new_kernel_size:
+ .long relocate_new_kernel_end - relocate_new_kernel
+
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-crashkernel/include/asm-ppc/kexec.h linux-2.6.11-rc1-mm1-nokexec-kexec-ppc-support/include/asm-ppc/kexec.h
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-crashkernel/include/asm-ppc/kexec.h Wed Dec 31 17:00:00 1969
+++ linux-2.6.11-rc1-mm1-nokexec-kexec-ppc-support/include/asm-ppc/kexec.h Tue Jan 18 23:15:00 2005
@@ -0,0 +1,38 @@
+#ifndef _PPC_KEXEC_H
+#define _PPC_KEXEC_H
+
+#ifdef CONFIG_KEXEC
+
+/*
+ * KEXEC_SOURCE_MEMORY_LIMIT maximum page get_free_page can return.
+ * I.e. Maximum page that is mapped directly into kernel memory,
+ * and kmap is not required.
+ *
+ * Someone correct me if FIXADDR_START - PAGEOFFSET is not the correct
+ * calculation for the amount of memory directly mappable into the
+ * kernel memory space.
+ */
+
+/* Maximum physical address we can use pages from */
+#define KEXEC_SOURCE_MEMORY_LIMIT (-1UL)
+/* Maximum address we can reach in physical address mode */
+#define KEXEC_DESTINATION_MEMORY_LIMIT (-1UL)
+/* Maximum address we can use for the control code buffer */
+#define KEXEC_CONTROL_MEMORY_LIMIT TASK_SIZE
+
+#define KEXEC_CONTROL_CODE_SIZE 4096
+
+/* The native architecture */
+#define KEXEC_ARCH KEXEC_ARCH_PPC
+
+#ifndef __ASSEMBLY__
+
+struct kimage;
+
+extern void machine_kexec_simple(struct kimage *image);
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* CONFIG_KEXEC */
+
+#endif /* _PPC_KEXEC_H */
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-crashkernel/include/asm-ppc/machdep.h linux-2.6.11-rc1-mm1-nokexec-kexec-ppc-support/include/asm-ppc/machdep.h
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-crashkernel/include/asm-ppc/machdep.h Fri Jan 7 12:54:15 2005
+++ linux-2.6.11-rc1-mm1-nokexec-kexec-ppc-support/include/asm-ppc/machdep.h Tue Jan 18 23:15:00 2005
@@ -4,6 +4,7 @@

#include <linux/config.h>
#include <linux/init.h>
+#include <linux/kexec.h>

#include <asm/setup.h>

@@ -106,6 +107,36 @@
/* functions for dealing with other cpus */
struct smp_ops_t *smp_ops;
#endif /* CONFIG_SMP */
+
+#ifdef CONFIG_KEXEC
+ /* Called to shutdown machine specific hardware not already controlled
+ * by other drivers.
+ * XXX Should we move this one out of kexec scope?
+ */
+ void (*machine_shutdown)(void);
+
+ /* Called to do the minimal shutdown needed to run a kexec'd kernel
+ * to run successfully.
+ * XXX Should we move this one out of kexec scope?
+ */
+ void (*machine_crash_shutdown)(void);
+
+ /* Called to do what every setup is needed on image and the
+ * reboot code buffer. Returns 0 on success.
+ * Provide your own (maybe dummy) implementation if your platform
+ * claims to support kexec.
+ */
+ int (*machine_kexec_prepare)(struct kimage *image);
+
+ /* Called to handle any machine specific cleanup on image */
+ void (*machine_kexec_cleanup)(struct kimage *image);
+
+ /* Called to perform the _real_ kexec.
+ * Do NOT allocate memory or fail here. We are past the point of
+ * no return.
+ */
+ void (*machine_kexec)(struct kimage *image);
+#endif /* CONFIG_KEXEC */
};

extern struct machdep_calls ppc_md;

2005-01-19 07:42:04

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 9/29] x86-vmlinux-fix-physical-addrs


The vmlinux on i386 does not report the correct physical address of
the kernel. Instead in the physical address field it currently
reports the virtual address of the kernel.

This is patch is a bug fix that corrects vmlinux to report the
proper physical addresses.

This is potentially a help for crash dump analysis tools.

This definitiely allows bootloaders that load vmlinux as a standard
ELF executable. Bootloaders directly loading vmlinux become of
practical importance when we consider the kexec on panic case.

Signed-off-by: Eric Biederman <[email protected]>
---

vmlinux.lds.S | 59 +++++++++++++++++++++++++++++++++++++---------------------
1 files changed, 38 insertions(+), 21 deletions(-)

diff -uNr linux-2.6.11-rc1-mm1-nokexec-vmlinux-fix-physical-addrs/arch/i386/kernel/vmlinux.lds.S linux-2.6.11-rc1-mm1-nokexec-x86-vmlinux-fix-physical-addrs/arch/i386/kernel/vmlinux.lds.S
--- linux-2.6.11-rc1-mm1-nokexec-vmlinux-fix-physical-addrs/arch/i386/kernel/vmlinux.lds.S Mon Oct 18 15:53:43 2004
+++ linux-2.6.11-rc1-mm1-nokexec-x86-vmlinux-fix-physical-addrs/arch/i386/kernel/vmlinux.lds.S Tue Jan 18 22:45:51 2005
@@ -2,20 +2,23 @@
* Written by Martin Mares <[email protected]>;
*/

+#define LOAD_OFFSET __PAGE_OFFSET
+
#include <asm-generic/vmlinux.lds.h>
#include <asm/thread_info.h>
#include <asm/page.h>

OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")
OUTPUT_ARCH(i386)
-ENTRY(startup_32)
+ENTRY(phys_startup_32)
jiffies = jiffies_64;
SECTIONS
{
- . = __PAGE_OFFSET + 0x100000;
+ . = LOAD_OFFSET + 0x100000;
+ phys_startup_32 = startup_32 - LOAD_OFFSET;
/* read-only */
_text = .; /* Text and read-only data */
- .text : {
+ .text : AT(ADDR(.text) - LOAD_OFFSET) {
*(.text)
SCHED_TEXT
LOCK_TEXT
@@ -27,49 +30,55 @@

. = ALIGN(16); /* Exception table */
__start___ex_table = .;
- __ex_table : { *(__ex_table) }
+ __ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) { *(__ex_table) }
__stop___ex_table = .;

RODATA

/* writeable */
- .data : { /* Data */
+ .data : AT(ADDR(.data) - LOAD_OFFSET) { /* Data */
*(.data)
CONSTRUCTORS
}

. = ALIGN(4096);
__nosave_begin = .;
- .data_nosave : { *(.data.nosave) }
+ .data_nosave : AT(ADDR(.data_nosave) - LOAD_OFFSET) { *(.data.nosave) }
. = ALIGN(4096);
__nosave_end = .;

. = ALIGN(4096);
- .data.page_aligned : { *(.data.idt) }
+ .data.page_aligned : AT(ADDR(.data.page_aligned) - LOAD_OFFSET) {
+ *(.data.idt)
+ }

. = ALIGN(32);
- .data.cacheline_aligned : { *(.data.cacheline_aligned) }
+ .data.cacheline_aligned : AT(ADDR(.data.cacheline_aligned) - LOAD_OFFSET) {
+ *(.data.cacheline_aligned)
+ }

_edata = .; /* End of data section */

. = ALIGN(THREAD_SIZE); /* init_task */
- .data.init_task : { *(.data.init_task) }
+ .data.init_task : AT(ADDR(.data.init_task) - LOAD_OFFSET) {
+ *(.data.init_task)
+ }

/* will be freed after init */
. = ALIGN(4096); /* Init code and data */
__init_begin = .;
- .init.text : {
+ .init.text : AT(ADDR(.init.text) - LOAD_OFFSET) {
_sinittext = .;
*(.init.text)
_einittext = .;
}
- .init.data : { *(.init.data) }
+ .init.data : AT(ADDR(.init.data) - LOAD_OFFSET) { *(.init.data) }
. = ALIGN(16);
__setup_start = .;
- .init.setup : { *(.init.setup) }
+ .init.setup : AT(ADDR(.init.setup) - LOAD_OFFSET) { *(.init.setup) }
__setup_end = .;
__initcall_start = .;
- .initcall.init : {
+ .initcall.init : AT(ADDR(.initcall.init) - LOAD_OFFSET) {
*(.initcall1.init)
*(.initcall2.init)
*(.initcall3.init)
@@ -80,33 +89,41 @@
}
__initcall_end = .;
__con_initcall_start = .;
- .con_initcall.init : { *(.con_initcall.init) }
+ .con_initcall.init : AT(ADDR(.con_initcall.init) - LOAD_OFFSET) {
+ *(.con_initcall.init)
+ }
__con_initcall_end = .;
SECURITY_INIT
. = ALIGN(4);
__alt_instructions = .;
- .altinstructions : { *(.altinstructions) }
+ .altinstructions : AT(ADDR(.altinstructions) - LOAD_OFFSET) {
+ *(.altinstructions)
+ }
__alt_instructions_end = .;
- .altinstr_replacement : { *(.altinstr_replacement) }
+ .altinstr_replacement : AT(ADDR(.altinstr_replacement) - LOAD_OFFSET) {
+ *(.altinstr_replacement)
+ }
/* .exit.text is discard at runtime, not link time, to deal with references
from .altinstructions and .eh_frame */
- .exit.text : { *(.exit.text) }
- .exit.data : { *(.exit.data) }
+ .exit.text : AT(ADDR(.exit.text) - LOAD_OFFSET) { *(.exit.text) }
+ .exit.data : AT(ADDR(.exit.data) - LOAD_OFFSET) { *(.exit.data) }
. = ALIGN(4096);
__initramfs_start = .;
- .init.ramfs : { *(.init.ramfs) }
+ .init.ramfs : AT(ADDR(.init.ramfs) - LOAD_OFFSET) { *(.init.ramfs) }
__initramfs_end = .;
. = ALIGN(32);
__per_cpu_start = .;
- .data.percpu : { *(.data.percpu) }
+ .data.percpu : AT(ADDR(.data.percpu) - LOAD_OFFSET) { *(.data.percpu) }
__per_cpu_end = .;
. = ALIGN(4096);
__init_end = .;
/* freed after init ends here */

__bss_start = .; /* BSS */
- .bss : {
+ .bss.page_aligned : AT(ADDR(.bss.page_aligned) - LOAD_OFFSET) {
*(.bss.page_aligned)
+ }
+ .bss : AT(ADDR(.bss) - LOAD_OFFSET) {
*(.bss)
}
. = ALIGN(4);

2005-01-19 07:43:30

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 5/29] x86_64-i8259-shutdown


From: Eric W. Biederman <[email protected]

The following patch simply adds a shutdown method to the x86_64 i8259 code.

Signed-off-by: Eric Biederman <[email protected]>
---

i8259.c | 12 ++++++++++++
1 files changed, 12 insertions(+)

diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-i8259-shutdown/arch/x86_64/kernel/i8259.c linux-2.6.11-rc1-mm1-nokexec-x86_64-i8259-shutdown/arch/x86_64/kernel/i8259.c
--- linux-2.6.11-rc1-mm1-nokexec-x86-i8259-shutdown/arch/x86_64/kernel/i8259.c Fri Jan 14 04:32:23 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86_64-i8259-shutdown/arch/x86_64/kernel/i8259.c Tue Jan 18 22:44:43 2005
@@ -416,10 +416,22 @@
return 0;
}

+static int i8259A_shutdown(struct sys_device *dev)
+{
+ /* Put the i8259A into a quiescent state that
+ * the kernel initialization code can get it
+ * out of.
+ */
+ outb(0xff, 0x21); /* mask all of 8259A-1 */
+ outb(0xff, 0xA1); /* mask all of 8259A-1 */
+ return 0;
+}
+
static struct sysdev_class i8259_sysdev_class = {
set_kset_name("i8259"),
.suspend = i8259A_suspend,
.resume = i8259A_resume,
+ .shutdown = i8259A_shutdown,
};

static struct sys_device device_i8259A = {

2005-01-19 07:49:37

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 14/29] kexec-kexec-generic


This patch introduces the architecture independent implementation
the sys_kexec_load, the compat_sys_kexec_load system calls.

Kexec on panic support has been integrated into the core patch and
is relatively clean.

In addition the hopefully architecture independent option
crashkernel=size@location has been docuemented. It's purpose
is to reserve space for the panic kernel to live, and where
no DMA transfer will ever be setup to access.

Signed-off-by: Eric Biederman <[email protected]>
---

Documentation/kernel-parameters.txt | 4
MAINTAINERS | 11
include/linux/kexec.h | 128 ++++
include/linux/reboot.h | 3
include/linux/syscalls.h | 5
kernel/Makefile | 1
kernel/kexec.c | 1036 ++++++++++++++++++++++++++++++++++++
kernel/panic.c | 11
kernel/sys.c | 20
kernel/sys_ni.c | 2
10 files changed, 1219 insertions(+), 2 deletions(-)

diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/Documentation/kernel-parameters.txt linux-2.6.11-rc1-mm1-nokexec-kexec-kexec-generic/Documentation/kernel-parameters.txt
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/Documentation/kernel-parameters.txt Fri Jan 14 04:32:22 2005
+++ linux-2.6.11-rc1-mm1-nokexec-kexec-kexec-generic/Documentation/kernel-parameters.txt Tue Jan 18 22:47:13 2005
@@ -341,6 +341,10 @@
cpia_pp= [HW,PPT]
Format: { parport<nr> | auto | none }

+ crashkernel=nn[KMG]@ss[KMG]
+ [KNL] Reserve a chunk of physical memory to
+ hold a kernel to switch to with kexec on panic.
+
cs4232= [HW,OSS]
Format: <io>,<irq>,<dma>,<dma2>,<mpuio>,<mpuirq>

diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/MAINTAINERS linux-2.6.11-rc1-mm1-nokexec-kexec-kexec-generic/MAINTAINERS
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/MAINTAINERS Fri Jan 14 04:32:22 2005
+++ linux-2.6.11-rc1-mm1-nokexec-kexec-kexec-generic/MAINTAINERS Tue Jan 18 22:47:13 2005
@@ -1318,6 +1318,17 @@
L: [email protected]
S: Maintained

+KEXEC
+P: Eric Biederman
+P: Randy Dunlap
+M: [email protected]
+M: [email protected]
+W: http://www.xmission.com/~ebiederm/files/kexec/
+W: http://developer.osdl.org/rddunlap/kexec/
+L: [email protected]
+L: [email protected]
+S: Maintained
+
LANMEDIA WAN CARD DRIVER
P: Andrew Stanley-Jones
M: [email protected]
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/include/linux/kexec.h linux-2.6.11-rc1-mm1-nokexec-kexec-kexec-generic/include/linux/kexec.h
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/include/linux/kexec.h Wed Dec 31 17:00:00 1969
+++ linux-2.6.11-rc1-mm1-nokexec-kexec-kexec-generic/include/linux/kexec.h Tue Jan 18 22:55:53 2005
@@ -0,0 +1,128 @@
+#ifndef LINUX_KEXEC_H
+#define LINUX_KEXEC_H
+
+#ifdef CONFIG_KEXEC
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/linkage.h>
+#include <linux/compat.h>
+#include <asm/kexec.h>
+
+/* Verify architecture specific macros are defined */
+
+#ifndef KEXEC_SOURCE_MEMORY_LIMIT
+#error KEXEC_SOURCE_MEMORY_LIMIT not defined
+#endif
+
+#ifndef KEXEC_DESTINATION_MEMORY_LIMIT
+#error KEXEC_DESTINATION_MEMORY_LIMIT not defined
+#endif
+
+#ifndef KEXEC_CONTROL_MEMORY_LIMIT
+#error KEXEC_CONTROL_MEMORY_LIMIT not defined
+#endif
+
+#ifndef KEXEC_CONTROL_CODE_SIZE
+#error KEXEC_CONTROL_CODE_SIZE not defined
+#endif
+
+#ifndef KEXEC_ARCH
+#error KEXEC_ARCH not defined
+#endif
+
+/*
+ * This structure is used to hold the arguments that are used when loading
+ * kernel binaries.
+ */
+
+typedef unsigned long kimage_entry_t;
+#define IND_DESTINATION 0x1
+#define IND_INDIRECTION 0x2
+#define IND_DONE 0x4
+#define IND_SOURCE 0x8
+
+#define KEXEC_SEGMENT_MAX 8
+struct kexec_segment {
+ void __user *buf;
+ size_t bufsz;
+ unsigned long mem; /* User space sees this as a (void *) ... */
+ size_t memsz;
+};
+
+#ifdef CONFIG_COMPAT
+struct compat_kexec_segment {
+ compat_uptr_t buf;
+ compat_size_t bufsz;
+ compat_ulong_t mem; /* User space sees this as a (void *) ... */
+ compat_size_t memsz;
+};
+#endif
+
+struct kimage {
+ kimage_entry_t head;
+ kimage_entry_t *entry;
+ kimage_entry_t *last_entry;
+
+ unsigned long destination;
+
+ unsigned long start;
+ struct page *control_code_page;
+
+ unsigned long nr_segments;
+ struct kexec_segment segment[KEXEC_SEGMENT_MAX];
+
+ struct list_head control_pages;
+ struct list_head dest_pages;
+ struct list_head unuseable_pages;
+
+ /* Address of next control page to allocate for crash kernels. */
+ unsigned long control_page;
+
+ /* Flags to indicate special processing */
+ int type : 1;
+#define KEXEC_TYPE_DEFAULT 0
+#define KEXEC_TYPE_CRASH 1
+};
+
+
+
+/* kexec interface functions */
+extern NORET_TYPE void machine_kexec(struct kimage *image) ATTRIB_NORET;
+extern int machine_kexec_prepare(struct kimage *image);
+extern void machine_kexec_cleanup(struct kimage *image);
+extern asmlinkage long sys_kexec_load(unsigned long entry,
+ unsigned long nr_segments, struct kexec_segment __user *segments,
+ unsigned long flags);
+#ifdef CONFIG_COMPAT
+extern asmlinkage long compat_sys_kexec_load(unsigned long entry,
+ unsigned long nr_segments, struct compat_kexec_segment __user *segments,
+ unsigned long flags);
+#endif
+extern struct page *kimage_alloc_control_pages(struct kimage *image, unsigned int order);
+extern void crash_kexec(void);
+extern struct kimage *kexec_image;
+extern struct kimage *kexec_crash_image;
+
+#define KEXEC_ON_CRASH 0x00000001
+#define KEXEC_ARCH_MASK 0xffff0000
+
+/* These values match the ELF architecture values.
+ * Unless there is a good reason that should continue to be the case.
+ */
+#define KEXEC_ARCH_DEFAULT ( 0 << 16)
+#define KEXEC_ARCH_386 ( 3 << 16)
+#define KEXEC_ARCH_X86_64 (62 << 16)
+#define KEXEC_ARCH_PPC (20 << 16)
+#define KEXEC_ARCH_PPC64 (21 << 16)
+#define KEXEC_ARCH_IA_64 (50 << 16)
+
+#define KEXEC_FLAGS (KEXEC_ON_CRASH) /* List of defined/legal kexec flags */
+
+/* Location of a reserved region to hold the crash kernel.
+ */
+extern struct resource crashk_res;
+
+#else /* !CONFIG_KEXEC */
+static inline void crash_kexec(void) { }
+#endif /* CONFIG_KEXEC */
+#endif /* LINUX_KEXEC_H */
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/include/linux/reboot.h linux-2.6.11-rc1-mm1-nokexec-kexec-kexec-generic/include/linux/reboot.h
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/include/linux/reboot.h Mon Oct 18 15:55:36 2004
+++ linux-2.6.11-rc1-mm1-nokexec-kexec-kexec-generic/include/linux/reboot.h Tue Jan 18 22:47:13 2005
@@ -51,6 +51,9 @@
extern void machine_halt(void);
extern void machine_power_off(void);

+extern void machine_shutdown(void);
+extern void machine_crash_shutdown(void);
+
#endif

#endif /* _LINUX_REBOOT_H */
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/include/linux/syscalls.h linux-2.6.11-rc1-mm1-nokexec-kexec-kexec-generic/include/linux/syscalls.h
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/include/linux/syscalls.h Fri Jan 14 04:28:49 2005
+++ linux-2.6.11-rc1-mm1-nokexec-kexec-kexec-generic/include/linux/syscalls.h Tue Jan 18 22:47:13 2005
@@ -159,8 +159,9 @@
asmlinkage long sys_reboot(int magic1, int magic2, unsigned int cmd,
void __user *arg);
asmlinkage long sys_restart_syscall(void);
-asmlinkage long sys_kexec_load(void *entry, unsigned long nr_segments,
- struct kexec_segment *segments, unsigned long flags);
+asmlinkage long sys_kexec_load(unsigned long entry,
+ unsigned long nr_segments, struct kexec_segment __user *segments,
+ unsigned long flags);

asmlinkage long sys_exit(int error_code);
asmlinkage void sys_exit_group(int error_code);
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/kernel/Makefile linux-2.6.11-rc1-mm1-nokexec-kexec-kexec-generic/kernel/Makefile
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/kernel/Makefile Fri Jan 14 04:32:28 2005
+++ linux-2.6.11-rc1-mm1-nokexec-kexec-kexec-generic/kernel/Makefile Tue Jan 18 22:47:13 2005
@@ -17,6 +17,7 @@
obj-$(CONFIG_KALLSYMS) += kallsyms.o
obj-$(CONFIG_PM) += power/
obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
+obj-$(CONFIG_KEXEC) += kexec.o
obj-$(CONFIG_LTT) += ltt-core.o
obj-$(CONFIG_COMPAT) += compat.o
obj-$(CONFIG_CPUSETS) += cpuset.o
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/kernel/kexec.c linux-2.6.11-rc1-mm1-nokexec-kexec-kexec-generic/kernel/kexec.c
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/kernel/kexec.c Wed Dec 31 17:00:00 1969
+++ linux-2.6.11-rc1-mm1-nokexec-kexec-kexec-generic/kernel/kexec.c Tue Jan 18 22:47:13 2005
@@ -0,0 +1,1036 @@
+/*
+ * kexec.c - kexec system call
+ * Copyright (C) 2002-2004 Eric Biederman <[email protected]>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2. See the file COPYING for more details.
+ */
+
+#include <linux/mm.h>
+#include <linux/file.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/kexec.h>
+#include <linux/spinlock.h>
+#include <linux/list.h>
+#include <linux/highmem.h>
+#include <linux/syscalls.h>
+#include <linux/reboot.h>
+#include <linux/syscalls.h>
+#include <linux/ioport.h>
+#include <asm/page.h>
+#include <asm/uaccess.h>
+#include <asm/io.h>
+#include <asm/system.h>
+#include <asm/semaphore.h>
+
+/* Location of the reserved area for the crash kernel */
+struct resource crashk_res = {
+ .name = "Crash kernel",
+ .start = 0,
+ .end = 0,
+ .flags = IORESOURCE_BUSY | IORESOURCE_MEM
+};
+
+/*
+ * When kexec transitions to the new kernel there is a one-to-one
+ * mapping between physical and virtual addresses. On processors
+ * where you can disable the MMU this is trivial, and easy. For
+ * others it is still a simple predictable page table to setup.
+ *
+ * In that environment kexec copies the new kernel to its final
+ * resting place. This means I can only support memory whose
+ * physical address can fit in an unsigned long. In particular
+ * addresses where (pfn << PAGE_SHIFT) > ULONG_MAX cannot be handled.
+ * If the assembly stub has more restrictive requirements
+ * KEXEC_SOURCE_MEMORY_LIMIT and KEXEC_DEST_MEMORY_LIMIT can be
+ * defined more restrictively in <asm/kexec.h>.
+ *
+ * The code for the transition from the current kernel to the
+ * the new kernel is placed in the control_code_buffer, whose size
+ * is given by KEXEC_CONTROL_CODE_SIZE. In the best case only a single
+ * page of memory is necessary, but some architectures require more.
+ * Because this memory must be identity mapped in the transition from
+ * virtual to physical addresses it must live in the range
+ * 0 - TASK_SIZE, as only the user space mappings are arbitrarily
+ * modifiable.
+ *
+ * The assembly stub in the control code buffer is passed a linked list
+ * of descriptor pages detailing the source pages of the new kernel,
+ * and the destination addresses of those source pages. As this data
+ * structure is not used in the context of the current OS, it must
+ * be self-contained.
+ *
+ * The code has been made to work with highmem pages and will use a
+ * destination page in its final resting place (if it happens
+ * to allocate it). The end product of this is that most of the
+ * physical address space, and most of RAM can be used.
+ *
+ * Future directions include:
+ * - allocating a page table with the control code buffer identity
+ * mapped, to simplify machine_kexec and make kexec_on_panic more
+ * reliable.
+ */
+
+/*
+ * KIMAGE_NO_DEST is an impossible destination address..., for
+ * allocating pages whose destination address we do not care about.
+ */
+#define KIMAGE_NO_DEST (-1UL)
+
+static int kimage_is_destination_range(
+ struct kimage *image, unsigned long start, unsigned long end);
+static struct page *kimage_alloc_page(struct kimage *image, unsigned int gfp_mask, unsigned long dest);
+
+static int do_kimage_alloc(struct kimage **rimage, unsigned long entry,
+ unsigned long nr_segments, struct kexec_segment __user *segments)
+{
+ size_t segment_bytes;
+ struct kimage *image;
+ unsigned long i;
+ int result;
+
+ /* Allocate a controlling structure */
+ result = -ENOMEM;
+ image = kmalloc(sizeof(*image), GFP_KERNEL);
+ if (!image) {
+ goto out;
+ }
+ memset(image, 0, sizeof(*image));
+ image->head = 0;
+ image->entry = &image->head;
+ image->last_entry = &image->head;
+ image->control_page = ~0; /* By default this does not apply */
+ image->start = entry;
+ image->type = KEXEC_TYPE_DEFAULT;
+
+ /* Initialize the list of control pages */
+ INIT_LIST_HEAD(&image->control_pages);
+
+ /* Initialize the list of destination pages */
+ INIT_LIST_HEAD(&image->dest_pages);
+
+ /* Initialize the list of unuseable pages */
+ INIT_LIST_HEAD(&image->unuseable_pages);
+
+ /* Read in the segments */
+ image->nr_segments = nr_segments;
+ segment_bytes = nr_segments * sizeof(*segments);
+ result = copy_from_user(image->segment, segments, segment_bytes);
+ if (result)
+ goto out;
+
+ /*
+ * Verify we have good destination addresses. The caller is
+ * responsible for making certain we don't attempt to load
+ * the new image into invalid or reserved areas of RAM. This
+ * just verifies it is an address we can use.
+ *
+ * Since the kernel does everything in page size chunks ensure
+ * the destination addreses are page aligned. Too many
+ * special cases crop of when we don't do this. The most
+ * insidious is getting overlapping destination addresses
+ * simply because addresses are changed to page size
+ * granularity.
+ */
+ result = -EADDRNOTAVAIL;
+ for (i = 0; i < nr_segments; i++) {
+ unsigned long mstart, mend;
+ mstart = image->segment[i].mem;
+ mend = mstart + image->segment[i].memsz;
+ if ((mstart & ~PAGE_MASK) || (mend & ~PAGE_MASK))
+ goto out;
+ if (mend >= KEXEC_DESTINATION_MEMORY_LIMIT)
+ goto out;
+ }
+
+ /* Verify our destination addresses do not overlap.
+ * If we alloed overlapping destination addresses
+ * through very weird things can happen with no
+ * easy explanation as one segment stops on another.
+ */
+ result = -EINVAL;
+ for(i = 0; i < nr_segments; i++) {
+ unsigned long mstart, mend;
+ unsigned long j;
+ mstart = image->segment[i].mem;
+ mend = mstart + image->segment[i].memsz;
+ for(j = 0; j < i; j++) {
+ unsigned long pstart, pend;
+ pstart = image->segment[j].mem;
+ pend = pstart + image->segment[j].memsz;
+ /* Do the segments overlap ? */
+ if ((mend > pstart) && (mstart < pend))
+ goto out;
+ }
+ }
+
+ /* Ensure our buffer sizes are strictly less than
+ * our memory sizes. This should always be the case,
+ * and it is easier to check up front than to be surprised
+ * later on.
+ */
+ result = -EINVAL;
+ for(i = 0; i < nr_segments; i++) {
+ if (image->segment[i].bufsz > image->segment[i].memsz)
+ goto out;
+ }
+
+
+ result = 0;
+ out:
+ if (result == 0) {
+ *rimage = image;
+ } else {
+ kfree(image);
+ }
+ return result;
+
+}
+
+static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
+ unsigned long nr_segments, struct kexec_segment __user *segments)
+{
+ int result;
+ struct kimage *image;
+
+ /* Allocate and initialize a controlling structure */
+ image = NULL;
+ result = do_kimage_alloc(&image, entry, nr_segments, segments);
+ if (result) {
+ goto out;
+ }
+ *rimage = image;
+
+ /*
+ * Find a location for the control code buffer, and add it
+ * the vector of segments so that it's pages will also be
+ * counted as destination pages.
+ */
+ result = -ENOMEM;
+ image->control_code_page = kimage_alloc_control_pages(image,
+ get_order(KEXEC_CONTROL_CODE_SIZE));
+ if (!image->control_code_page) {
+ printk(KERN_ERR "Could not allocate control_code_buffer\n");
+ goto out;
+ }
+
+ result = 0;
+ out:
+ if (result == 0) {
+ *rimage = image;
+ } else {
+ kfree(image);
+ }
+ return result;
+}
+
+static int kimage_crash_alloc(struct kimage **rimage, unsigned long entry,
+ unsigned long nr_segments, struct kexec_segment *segments)
+{
+ int result;
+ struct kimage *image;
+ unsigned long i;
+
+ image = NULL;
+ /* Verify we have a valid entry point */
+ if ((entry < crashk_res.start) || (entry > crashk_res.end)) {
+ result = -EADDRNOTAVAIL;
+ goto out;
+ }
+
+ /* Allocate and initialize a controlling structure */
+ result = do_kimage_alloc(&image, entry, nr_segments, segments);
+ if (result) {
+ goto out;
+ }
+
+ /* Enable the special crash kernel control page
+ * allocation policy.
+ */
+ image->control_page = crashk_res.start;
+ image->type = KEXEC_TYPE_CRASH;
+
+ /*
+ * Verify we have good destination addresses. Normally
+ * the caller is responsible for making certain we don't
+ * attempt to load the new image into invalid or reserved
+ * areas of RAM. But crash kernels are preloaded into a
+ * reserved area of ram. We must ensure the addresses
+ * are in the reserved area otherwise preloading the
+ * kernel could corrupt things.
+ */
+ result = -EADDRNOTAVAIL;
+ for (i = 0; i < nr_segments; i++) {
+ unsigned long mstart, mend;
+ mstart = image->segment[i].mem;
+ mend = mstart + image->segment[i].memsz;
+ /* Ensure we are within the crash kernel limits */
+ if ((mstart < crashk_res.start) || (mend > crashk_res.end))
+ goto out;
+ }
+
+
+ /*
+ * Find a location for the control code buffer, and add
+ * the vector of segments so that it's pages will also be
+ * counted as destination pages.
+ */
+ result = -ENOMEM;
+ image->control_code_page = kimage_alloc_control_pages(image,
+ get_order(KEXEC_CONTROL_CODE_SIZE));
+ if (!image->control_code_page) {
+ printk(KERN_ERR "Could not allocate control_code_buffer\n");
+ goto out;
+ }
+
+ result = 0;
+ out:
+ if (result == 0) {
+ *rimage = image;
+ } else {
+ kfree(image);
+ }
+ return result;
+}
+
+static int kimage_is_destination_range(
+ struct kimage *image, unsigned long start, unsigned long end)
+{
+ unsigned long i;
+
+ for (i = 0; i < image->nr_segments; i++) {
+ unsigned long mstart, mend;
+ mstart = image->segment[i].mem;
+ mend = mstart + image->segment[i].memsz;
+ if ((end > mstart) && (start < mend)) {
+ return 1;
+ }
+ }
+ return 0;
+}
+
+static struct page *kimage_alloc_pages(unsigned int gfp_mask, unsigned int order)
+{
+ struct page *pages;
+ pages = alloc_pages(gfp_mask, order);
+ if (pages) {
+ unsigned int count, i;
+ pages->mapping = NULL;
+ pages->private = order;
+ count = 1 << order;
+ for(i = 0; i < count; i++) {
+ SetPageReserved(pages + i);
+ }
+ }
+ return pages;
+}
+
+static void kimage_free_pages(struct page *page)
+{
+ unsigned int order, count, i;
+ order = page->private;
+ count = 1 << order;
+ for(i = 0; i < count; i++) {
+ ClearPageReserved(page + i);
+ }
+ __free_pages(page, order);
+}
+
+static void kimage_free_page_list(struct list_head *list)
+{
+ struct list_head *pos, *next;
+ list_for_each_safe(pos, next, list) {
+ struct page *page;
+
+ page = list_entry(pos, struct page, lru);
+ list_del(&page->lru);
+
+ kimage_free_pages(page);
+ }
+}
+
+static struct page *kimage_alloc_normal_control_pages(
+ struct kimage *image, unsigned int order)
+{
+ /* Control pages are special, they are the intermediaries
+ * that are needed while we copy the rest of the pages
+ * to their final resting place. As such they must
+ * not conflict with either the destination addresses
+ * or memory the kernel is already using.
+ *
+ * The only case where we really need more than one of
+ * these are for architectures where we cannot disable
+ * the MMU and must instead generate an identity mapped
+ * page table for all of the memory.
+ *
+ * At worst this runs in O(N) of the image size.
+ */
+ struct list_head extra_pages;
+ struct page *pages;
+ unsigned int count;
+
+ count = 1 << order;
+ INIT_LIST_HEAD(&extra_pages);
+
+ /* Loop while I can allocate a page and the page allocated
+ * is a destination page.
+ */
+ do {
+ unsigned long pfn, epfn, addr, eaddr;
+ pages = kimage_alloc_pages(GFP_KERNEL, order);
+ if (!pages)
+ break;
+ pfn = page_to_pfn(pages);
+ epfn = pfn + count;
+ addr = pfn << PAGE_SHIFT;
+ eaddr = epfn << PAGE_SHIFT;
+ if ((epfn >= (KEXEC_CONTROL_MEMORY_LIMIT >> PAGE_SHIFT)) ||
+ kimage_is_destination_range(image, addr, eaddr))
+ {
+ list_add(&pages->lru, &extra_pages);
+ pages = NULL;
+ }
+ } while(!pages);
+ if (pages) {
+ /* Remember the allocated page... */
+ list_add(&pages->lru, &image->control_pages);
+
+ /* Because the page is already in it's destination
+ * location we will never allocate another page at
+ * that address. Therefore kimage_alloc_pages
+ * will not return it (again) and we don't need
+ * to give it an entry in image->segment[].
+ */
+ }
+ /* Deal with the destination pages I have inadvertently allocated.
+ *
+ * Ideally I would convert multi-page allocations into single
+ * page allocations, and add everyting to image->dest_pages.
+ *
+ * For now it is simpler to just free the pages.
+ */
+ kimage_free_page_list(&extra_pages);
+ return pages;
+
+}
+
+static struct page *kimage_alloc_crash_control_pages(
+ struct kimage *image, unsigned int order)
+{
+ /* Control pages are special, they are the intermediaries
+ * that are needed while we copy the rest of the pages
+ * to their final resting place. As such they must
+ * not conflict with either the destination addresses
+ * or memory the kernel is already using.
+ *
+ * Control pages are also the only pags we must allocate
+ * when loading a crash kernel. All of the other pages
+ * are specified by the segments and we just memcpy
+ * into them directly.
+ *
+ * The only case where we really need more than one of
+ * these are for architectures where we cannot disable
+ * the MMU and must instead generate an identity mapped
+ * page table for all of the memory.
+ *
+ * Given the low demand this implements a very simple
+ * allocator that finds the first hole of the appropriate
+ * size in the reserved memory region, and allocates all
+ * of the memory up to and including the hole.
+ */
+ unsigned long hole_start, hole_end, size;
+ struct page *pages;
+ pages = NULL;
+ size = (1 << order) << PAGE_SHIFT;
+ hole_start = (image->control_page + (size - 1)) & ~(size - 1);
+ hole_end = hole_start + size - 1;
+ while(hole_end <= crashk_res.end) {
+ unsigned long i;
+ if (hole_end > KEXEC_CONTROL_MEMORY_LIMIT) {
+ break;
+ }
+ if (hole_end > crashk_res.end) {
+ break;
+ }
+ /* See if I overlap any of the segments */
+ for(i = 0; i < image->nr_segments; i++) {
+ unsigned long mstart, mend;
+ mstart = image->segment[i].mem;
+ mend = mstart + image->segment[i].memsz - 1;
+ if ((hole_end >= mstart) && (hole_start <= mend)) {
+ /* Advance the hole to the end of the segment */
+ hole_start = (mend + (size - 1)) & ~(size - 1);
+ hole_end = hole_start + size - 1;
+ break;
+ }
+ }
+ /* If I don't overlap any segments I have found my hole! */
+ if (i == image->nr_segments) {
+ pages = pfn_to_page(hole_start >> PAGE_SHIFT);
+ break;
+ }
+ }
+ if (pages) {
+ image->control_page = hole_end;
+ }
+ return pages;
+}
+
+
+struct page *kimage_alloc_control_pages(
+ struct kimage *image, unsigned int order)
+{
+ struct page *pages = NULL;
+ switch(image->type) {
+ case KEXEC_TYPE_DEFAULT:
+ pages = kimage_alloc_normal_control_pages(image, order);
+ break;
+ case KEXEC_TYPE_CRASH:
+ pages = kimage_alloc_crash_control_pages(image, order);
+ break;
+ }
+ return pages;
+}
+
+static int kimage_add_entry(struct kimage *image, kimage_entry_t entry)
+{
+ if (*image->entry != 0) {
+ image->entry++;
+ }
+ if (image->entry == image->last_entry) {
+ kimage_entry_t *ind_page;
+ struct page *page;
+ page = kimage_alloc_page(image, GFP_KERNEL, KIMAGE_NO_DEST);
+ if (!page) {
+ return -ENOMEM;
+ }
+ ind_page = page_address(page);
+ *image->entry = virt_to_phys(ind_page) | IND_INDIRECTION;
+ image->entry = ind_page;
+ image->last_entry =
+ ind_page + ((PAGE_SIZE/sizeof(kimage_entry_t)) - 1);
+ }
+ *image->entry = entry;
+ image->entry++;
+ *image->entry = 0;
+ return 0;
+}
+
+static int kimage_set_destination(
+ struct kimage *image, unsigned long destination)
+{
+ int result;
+
+ destination &= PAGE_MASK;
+ result = kimage_add_entry(image, destination | IND_DESTINATION);
+ if (result == 0) {
+ image->destination = destination;
+ }
+ return result;
+}
+
+
+static int kimage_add_page(struct kimage *image, unsigned long page)
+{
+ int result;
+
+ page &= PAGE_MASK;
+ result = kimage_add_entry(image, page | IND_SOURCE);
+ if (result == 0) {
+ image->destination += PAGE_SIZE;
+ }
+ return result;
+}
+
+
+static void kimage_free_extra_pages(struct kimage *image)
+{
+ /* Walk through and free any extra destination pages I may have */
+ kimage_free_page_list(&image->dest_pages);
+
+ /* Walk through and free any unuseable pages I have cached */
+ kimage_free_page_list(&image->unuseable_pages);
+
+}
+static int kimage_terminate(struct kimage *image)
+{
+ if (*image->entry != 0) {
+ image->entry++;
+ }
+ *image->entry = IND_DONE;
+ return 0;
+}
+
+#define for_each_kimage_entry(image, ptr, entry) \
+ for (ptr = &image->head; (entry = *ptr) && !(entry & IND_DONE); \
+ ptr = (entry & IND_INDIRECTION)? \
+ phys_to_virt((entry & PAGE_MASK)): ptr +1)
+
+static void kimage_free_entry(kimage_entry_t entry)
+{
+ struct page *page;
+
+ page = pfn_to_page(entry >> PAGE_SHIFT);
+ kimage_free_pages(page);
+}
+
+static void kimage_free(struct kimage *image)
+{
+ kimage_entry_t *ptr, entry;
+ kimage_entry_t ind = 0;
+
+ if (!image)
+ return;
+ kimage_free_extra_pages(image);
+ for_each_kimage_entry(image, ptr, entry) {
+ if (entry & IND_INDIRECTION) {
+ /* Free the previous indirection page */
+ if (ind & IND_INDIRECTION) {
+ kimage_free_entry(ind);
+ }
+ /* Save this indirection page until we are
+ * done with it.
+ */
+ ind = entry;
+ }
+ else if (entry & IND_SOURCE) {
+ kimage_free_entry(entry);
+ }
+ }
+ /* Free the final indirection page */
+ if (ind & IND_INDIRECTION) {
+ kimage_free_entry(ind);
+ }
+
+ /* Handle any machine specific cleanup */
+ machine_kexec_cleanup(image);
+
+ /* Free the kexec control pages... */
+ kimage_free_page_list(&image->control_pages);
+ kfree(image);
+}
+
+static kimage_entry_t *kimage_dst_used(struct kimage *image, unsigned long page)
+{
+ kimage_entry_t *ptr, entry;
+ unsigned long destination = 0;
+
+ for_each_kimage_entry(image, ptr, entry) {
+ if (entry & IND_DESTINATION) {
+ destination = entry & PAGE_MASK;
+ }
+ else if (entry & IND_SOURCE) {
+ if (page == destination) {
+ return ptr;
+ }
+ destination += PAGE_SIZE;
+ }
+ }
+ return 0;
+}
+
+static struct page *kimage_alloc_page(struct kimage *image, unsigned int gfp_mask, unsigned long destination)
+{
+ /*
+ * Here we implement safeguards to ensure that a source page
+ * is not copied to its destination page before the data on
+ * the destination page is no longer useful.
+ *
+ * To do this we maintain the invariant that a source page is
+ * either its own destination page, or it is not a
+ * destination page at all.
+ *
+ * That is slightly stronger than required, but the proof
+ * that no problems will not occur is trivial, and the
+ * implementation is simply to verify.
+ *
+ * When allocating all pages normally this algorithm will run
+ * in O(N) time, but in the worst case it will run in O(N^2)
+ * time. If the runtime is a problem the data structures can
+ * be fixed.
+ */
+ struct page *page;
+ unsigned long addr;
+
+ /*
+ * Walk through the list of destination pages, and see if I
+ * have a match.
+ */
+ list_for_each_entry(page, &image->dest_pages, lru) {
+ addr = page_to_pfn(page) << PAGE_SHIFT;
+ if (addr == destination) {
+ list_del(&page->lru);
+ return page;
+ }
+ }
+ page = NULL;
+ while (1) {
+ kimage_entry_t *old;
+
+ /* Allocate a page, if we run out of memory give up */
+ page = kimage_alloc_pages(gfp_mask, 0);
+ if (!page) {
+ return 0;
+ }
+ /* If the page cannot be used file it away */
+ if (page_to_pfn(page) > (KEXEC_SOURCE_MEMORY_LIMIT >> PAGE_SHIFT)) {
+ list_add(&page->lru, &image->unuseable_pages);
+ continue;
+ }
+ addr = page_to_pfn(page) << PAGE_SHIFT;
+
+ /* If it is the destination page we want use it */
+ if (addr == destination)
+ break;
+
+ /* If the page is not a destination page use it */
+ if (!kimage_is_destination_range(image, addr, addr + PAGE_SIZE))
+ break;
+
+ /*
+ * I know that the page is someones destination page.
+ * See if there is already a source page for this
+ * destination page. And if so swap the source pages.
+ */
+ old = kimage_dst_used(image, addr);
+ if (old) {
+ /* If so move it */
+ unsigned long old_addr;
+ struct page *old_page;
+
+ old_addr = *old & PAGE_MASK;
+ old_page = pfn_to_page(old_addr >> PAGE_SHIFT);
+ copy_highpage(page, old_page);
+ *old = addr | (*old & ~PAGE_MASK);
+
+ /* The old page I have found cannot be a
+ * destination page, so return it.
+ */
+ addr = old_addr;
+ page = old_page;
+ break;
+ }
+ else {
+ /* Place the page on the destination list I
+ * will use it later.
+ */
+ list_add(&page->lru, &image->dest_pages);
+ }
+ }
+ return page;
+}
+
+static int kimage_load_normal_segment(struct kimage *image,
+ struct kexec_segment *segment)
+{
+ unsigned long maddr;
+ unsigned long ubytes, mbytes;
+ int result;
+ unsigned char *buf;
+
+ result = 0;
+ buf = segment->buf;
+ ubytes = segment->bufsz;
+ mbytes = segment->memsz;
+ maddr = segment->mem;
+
+ result = kimage_set_destination(image, maddr);
+ if (result < 0) {
+ goto out;
+ }
+ while(mbytes) {
+ struct page *page;
+ char *ptr;
+ size_t uchunk, mchunk;
+ page = kimage_alloc_page(image, GFP_HIGHUSER, maddr);
+ if (page == 0) {
+ result = -ENOMEM;
+ goto out;
+ }
+ result = kimage_add_page(image, page_to_pfn(page) << PAGE_SHIFT);
+ if (result < 0) {
+ goto out;
+ }
+ ptr = kmap(page);
+ /* Start with a clear page */
+ memset(ptr, 0, PAGE_SIZE);
+ ptr += maddr & ~PAGE_MASK;
+ mchunk = PAGE_SIZE - (maddr & ~PAGE_MASK);
+ if (mchunk > mbytes) {
+ mchunk = mbytes;
+ }
+ uchunk = mchunk;
+ if (uchunk > ubytes) {
+ uchunk = ubytes;
+ }
+ result = copy_from_user(ptr, buf, uchunk);
+ kunmap(page);
+ if (result) {
+ result = (result < 0) ? result : -EIO;
+ goto out;
+ }
+ ubytes -= uchunk;
+ maddr += mchunk;
+ buf += mchunk;
+ mbytes -= mchunk;
+ }
+ out:
+ return result;
+}
+
+static int kimage_load_crash_segment(struct kimage *image,
+ struct kexec_segment *segment)
+{
+ /* For crash dumps kernels we simply copy the data from
+ * user space to it's destination.
+ * We do things a page at a time for the sake of kmap.
+ */
+ unsigned long maddr;
+ unsigned long ubytes, mbytes;
+ int result;
+ unsigned char *buf;
+
+ result = 0;
+ buf = segment->buf;
+ ubytes = segment->bufsz;
+ mbytes = segment->memsz;
+ maddr = segment->mem;
+ while(mbytes) {
+ struct page *page;
+ char *ptr;
+ size_t uchunk, mchunk;
+ page = pfn_to_page(maddr >> PAGE_SHIFT);
+ if (page == 0) {
+ result = -ENOMEM;
+ goto out;
+ }
+ ptr = kmap(page);
+ ptr += maddr & ~PAGE_MASK;
+ mchunk = PAGE_SIZE - (maddr & ~PAGE_MASK);
+ if (mchunk > mbytes) {
+ mchunk = mbytes;
+ }
+ uchunk = mchunk;
+ if (uchunk > ubytes) {
+ uchunk = ubytes;
+ /* Zero the trailing part of the page */
+ memset(ptr + uchunk, 0, mchunk - uchunk);
+ }
+ result = copy_from_user(ptr, buf, uchunk);
+ kunmap(page);
+ if (result) {
+ result = (result < 0) ? result : -EIO;
+ goto out;
+ }
+ ubytes -= uchunk;
+ maddr += mchunk;
+ buf += mchunk;
+ mbytes -= mchunk;
+ }
+ out:
+ return result;
+}
+
+static int kimage_load_segment(struct kimage *image,
+ struct kexec_segment *segment)
+{
+ int result = -ENOMEM;
+ switch(image->type) {
+ case KEXEC_TYPE_DEFAULT:
+ result = kimage_load_normal_segment(image, segment);
+ break;
+ case KEXEC_TYPE_CRASH:
+ result = kimage_load_crash_segment(image, segment);
+ break;
+ }
+ return result;
+}
+
+/*
+ * Exec Kernel system call: for obvious reasons only root may call it.
+ *
+ * This call breaks up into three pieces.
+ * - A generic part which loads the new kernel from the current
+ * address space, and very carefully places the data in the
+ * allocated pages.
+ *
+ * - A generic part that interacts with the kernel and tells all of
+ * the devices to shut down. Preventing on-going dmas, and placing
+ * the devices in a consistent state so a later kernel can
+ * reinitialize them.
+ *
+ * - A machine specific part that includes the syscall number
+ * and the copies the image to it's final destination. And
+ * jumps into the image at entry.
+ *
+ * kexec does not sync, or unmount filesystems so if you need
+ * that to happen you need to do that yourself.
+ */
+struct kimage *kexec_image = NULL;
+struct kimage *kexec_crash_image = NULL;
+/*
+ * A home grown binary mutex.
+ * Nothing can wait so this mutex is safe to use
+ * in interrupt context :)
+ */
+static int kexec_lock = 0;
+
+asmlinkage long sys_kexec_load(unsigned long entry,
+ unsigned long nr_segments, struct kexec_segment __user *segments,
+ unsigned long flags)
+{
+ struct kimage **dest_image, *image;
+ int locked;
+ int result;
+
+ /* We only trust the superuser with rebooting the system. */
+ if (!capable(CAP_SYS_BOOT))
+ return -EPERM;
+
+ /*
+ * Verify we have a legal set of flags
+ * This leaves us room for future extensions.
+ */
+ if ((flags & KEXEC_FLAGS) != (flags & ~KEXEC_ARCH_MASK))
+ return -EINVAL;
+
+ /* Verify we are on the appropriate architecture */
+ if (((flags & KEXEC_ARCH_MASK) != KEXEC_ARCH) &&
+ ((flags & KEXEC_ARCH_MASK) != KEXEC_ARCH_DEFAULT))
+ {
+ return -EINVAL;
+ }
+
+ /* Put an artificial cap on the number
+ * of segments passed to kexec_load.
+ */
+ if (nr_segments > KEXEC_SEGMENT_MAX)
+ return -EINVAL;
+
+ image = NULL;
+ result = 0;
+
+ /* Because we write directly to the reserved memory
+ * region when loading crash kernels we need a mutex here to
+ * prevent multiple crash kernels from attempting to load
+ * simultaneously, and to prevent a crash kernel from loading
+ * over the top of a in use crash kernel.
+ *
+ * KISS: always take the mutex.
+ */
+ locked = xchg(&kexec_lock, 1);
+ if (locked) {
+ return -EBUSY;
+ }
+ dest_image = &kexec_image;
+ if (flags & KEXEC_ON_CRASH) {
+ dest_image = &kexec_crash_image;
+ }
+ if (nr_segments > 0) {
+ unsigned long i;
+ /* Loading another kernel to reboot into */
+ if ((flags & KEXEC_ON_CRASH) == 0) {
+ result = kimage_normal_alloc(&image, entry, nr_segments, segments);
+ }
+ /* Loading another kernel to switch to if this one crashes */
+ else if (flags & KEXEC_ON_CRASH) {
+ /* Free any current crash dump kernel before
+ * we corrupt it.
+ */
+ kimage_free(xchg(&kexec_crash_image, NULL));
+ result = kimage_crash_alloc(&image, entry, nr_segments, segments);
+ }
+ if (result) {
+ goto out;
+ }
+ result = machine_kexec_prepare(image);
+ if (result) {
+ goto out;
+ }
+ for(i = 0; i < nr_segments; i++) {
+ result = kimage_load_segment(image, &image->segment[i]);
+ if (result) {
+ goto out;
+ }
+ }
+ result = kimage_terminate(image);
+ if (result) {
+ goto out;
+ }
+ }
+ /* Install the new kernel, and Uninstall the old */
+ image = xchg(dest_image, image);
+
+ out:
+ xchg(&kexec_lock, 0); /* Release the mutex */
+ kimage_free(image);
+ return result;
+}
+
+#ifdef CONFIG_COMPAT
+asmlinkage long compat_sys_kexec_load(unsigned long entry,
+ unsigned long nr_segments, struct compat_kexec_segment __user *segments,
+ unsigned long flags)
+{
+ struct compat_kexec_segment in;
+ struct kexec_segment out, __user *ksegments;
+ unsigned long i, result;
+
+ /* Don't allow clients that don't understand the native
+ * architecture to do anything.
+ */
+ if ((flags & KEXEC_ARCH_MASK) == KEXEC_ARCH_DEFAULT) {
+ return -EINVAL;
+ }
+
+ if (nr_segments > KEXEC_SEGMENT_MAX) {
+ return -EINVAL;
+ }
+
+ ksegments = compat_alloc_user_space(nr_segments * sizeof(out));
+ for (i=0; i < nr_segments; i++) {
+ result = copy_from_user(&in, &segments[i], sizeof(in));
+ if (result) {
+ return -EFAULT;
+ }
+
+ out.buf = compat_ptr(in.buf);
+ out.bufsz = in.bufsz;
+ out.mem = in.mem;
+ out.memsz = in.memsz;
+
+ result = copy_to_user(&ksegments[i], &out, sizeof(out));
+ if (result) {
+ return -EFAULT;
+ }
+ }
+
+ return sys_kexec_load(entry, nr_segments, ksegments, flags);
+}
+#endif
+
+void crash_kexec(void)
+{
+ struct kimage *image;
+ int locked;
+
+
+ /* Take the kexec_lock here to prevent sys_kexec_load
+ * running on one cpu from replacing the crash kernel
+ * we are using after a panic on a different cpu.
+ *
+ * If the crash kernel was not located in a fixed area
+ * of memory the xchg(&kexec_crash_image) would be
+ * sufficient. But since I reuse the memory...
+ */
+ locked = xchg(&kexec_lock, 1);
+ if (!locked) {
+ image = xchg(&kexec_crash_image, NULL);
+ if (image) {
+ machine_crash_shutdown();
+ machine_kexec(image);
+ }
+ xchg(&kexec_lock, 0);
+ }
+}
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/kernel/panic.c linux-2.6.11-rc1-mm1-nokexec-kexec-kexec-generic/kernel/panic.c
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/kernel/panic.c Fri Jan 7 12:54:17 2005
+++ linux-2.6.11-rc1-mm1-nokexec-kexec-kexec-generic/kernel/panic.c Tue Jan 18 22:47:13 2005
@@ -18,6 +18,7 @@
#include <linux/sysrq.h>
#include <linux/interrupt.h>
#include <linux/nmi.h>
+#include <linux/kexec.h>

int panic_timeout;
int panic_on_oops;
@@ -71,7 +72,17 @@
printk(KERN_EMERG "Kernel panic - not syncing: %s\n",buf);
bust_spinlocks(0);

+ /* If we have crashed and we have a crash kernel loaded
+ * let it handle everything else.
+ * Do we want to call this before we try to display a message?
+ */
+ crash_kexec();
+
#ifdef CONFIG_SMP
+ /* Note smp_send_stop is the usual smp shutdown function, which
+ * unfortunately means it may not be hardened to work in a panic
+ * situation.
+ */
smp_send_stop();
#endif

diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/kernel/sys.c linux-2.6.11-rc1-mm1-nokexec-kexec-kexec-generic/kernel/sys.c
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/kernel/sys.c Fri Jan 14 04:28:49 2005
+++ linux-2.6.11-rc1-mm1-nokexec-kexec-kexec-generic/kernel/sys.c Tue Jan 18 22:47:13 2005
@@ -16,6 +16,8 @@
#include <linux/init.h>
#include <linux/highuid.h>
#include <linux/fs.h>
+#include <linux/kernel.h>
+#include <linux/kexec.h>
#include <linux/workqueue.h>
#include <linux/device.h>
#include <linux/key.h>
@@ -433,6 +435,24 @@
machine_restart(buffer);
break;

+#ifdef CONFIG_KEXEC
+ case LINUX_REBOOT_CMD_KEXEC:
+ {
+ struct kimage *image;
+ image = xchg(&kexec_image, 0);
+ if (!image) {
+ unlock_kernel();
+ return -EINVAL;
+ }
+ notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL);
+ system_state = SYSTEM_RESTART;
+ device_shutdown();
+ printk(KERN_EMERG "Starting new kernel\n");
+ machine_shutdown();
+ machine_kexec(image);
+ break;
+ }
+#endif
#ifdef CONFIG_SOFTWARE_SUSPEND
case LINUX_REBOOT_CMD_SW_SUSPEND:
{
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/kernel/sys_ni.c linux-2.6.11-rc1-mm1-nokexec-kexec-kexec-generic/kernel/sys_ni.c
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/kernel/sys_ni.c Fri Jan 14 04:32:28 2005
+++ linux-2.6.11-rc1-mm1-nokexec-kexec-kexec-generic/kernel/sys_ni.c Tue Jan 18 22:47:13 2005
@@ -18,6 +18,8 @@
cond_syscall(sys_lookup_dcookie)
cond_syscall(sys_swapon)
cond_syscall(sys_swapoff)
+cond_syscall(sys_kexec_load)
+cond_syscall(compat_sys_kexec_load)
cond_syscall(sys_init_module)
cond_syscall(sys_delete_module)
cond_syscall(sys_socketpair)

2005-01-19 07:51:21

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 19/29] x86_64-kexec


This is the x86_64 implementation of machine kexec.
32bit compatibility support has been implemented, and machine_kexec
has been enhanced to not care about the changing internal kernel paget
table structures.

Signed-off-by: Eric Biederman <[email protected]>
---

arch/x86_64/Kconfig | 17 ++
arch/x86_64/ia32/ia32entry.S | 2
arch/x86_64/kernel/Makefile | 1
arch/x86_64/kernel/crash.c | 40 +++++
arch/x86_64/kernel/machine_kexec.c | 245 +++++++++++++++++++++++++++++++++++
arch/x86_64/kernel/relocate_kernel.S | 143 ++++++++++++++++++++
include/asm-x86_64/kexec.h | 28 ++++
include/asm-x86_64/unistd.h | 2
8 files changed, 476 insertions(+), 2 deletions(-)

diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-machine_shutdown/arch/x86_64/Kconfig linux-2.6.11-rc1-mm1-nokexec-x86_64-kexec/arch/x86_64/Kconfig
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-machine_shutdown/arch/x86_64/Kconfig Tue Jan 18 22:46:57 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86_64-kexec/arch/x86_64/Kconfig Tue Jan 18 23:14:06 2005
@@ -370,6 +370,23 @@
the panic-ed kernel.

Don't change this unless you know what you are doing.
+
+config KEXEC
+ bool "kexec system call (EXPERIMENTAL)"
+ depends on EXPERIMENTAL
+ help
+ kexec is a system call that implements the ability to shutdown your
+ current kernel, and to start another kernel. It is like a reboot
+ but it is indepedent of the system firmware. And like a reboot
+ you can start any kernel with it, not just Linux.
+
+ The name comes from the similiarity to the exec system call.
+
+ It is an ongoing process to be certain the hardware in a machine
+ is properly shutdown, so do not be surprised if this code does not
+ initially work for you. It may help to enable device hotplugging
+ support. As of this writing the exact hardware interface is
+ strongly in flux, so no good recommendation can be made.
endmenu

#
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-machine_shutdown/arch/x86_64/ia32/ia32entry.S linux-2.6.11-rc1-mm1-nokexec-x86_64-kexec/arch/x86_64/ia32/ia32entry.S
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-machine_shutdown/arch/x86_64/ia32/ia32entry.S Fri Jan 14 04:32:23 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86_64-kexec/arch/x86_64/ia32/ia32entry.S Tue Jan 18 23:14:06 2005
@@ -589,7 +589,7 @@
.quad compat_sys_mq_timedreceive /* 280 */
.quad compat_sys_mq_notify
.quad compat_sys_mq_getsetattr
- .quad quiet_ni_syscall /* reserved for kexec */
+ .quad compat_sys_kexec_load
.quad sys32_waitid
.quad quiet_ni_syscall /* sys_altroot */
.quad sys_add_key
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-machine_shutdown/arch/x86_64/kernel/Makefile linux-2.6.11-rc1-mm1-nokexec-x86_64-kexec/arch/x86_64/kernel/Makefile
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-machine_shutdown/arch/x86_64/kernel/Makefile Fri Jan 14 04:32:23 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86_64-kexec/arch/x86_64/kernel/Makefile Tue Jan 18 23:14:06 2005
@@ -20,6 +20,7 @@
obj-$(CONFIG_X86_LOCAL_APIC) += apic.o nmi.o
obj-$(CONFIG_X86_IO_APIC) += io_apic.o mpparse.o \
genapic.o genapic_cluster.o genapic_flat.o
+obj-$(CONFIG_KEXEC) += machine_kexec.o relocate_kernel.o crash.o
obj-$(CONFIG_PM) += suspend.o
obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend_asm.o
obj-$(CONFIG_CPU_FREQ) += cpufreq/
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-machine_shutdown/arch/x86_64/kernel/crash.c linux-2.6.11-rc1-mm1-nokexec-x86_64-kexec/arch/x86_64/kernel/crash.c
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-machine_shutdown/arch/x86_64/kernel/crash.c Wed Dec 31 17:00:00 1969
+++ linux-2.6.11-rc1-mm1-nokexec-x86_64-kexec/arch/x86_64/kernel/crash.c Tue Jan 18 23:14:06 2005
@@ -0,0 +1,40 @@
+/*
+ * Architecture specific (x86_64) functions for kexec based crash dumps.
+ *
+ * Created by: Hariprasad Nellitheertha ([email protected])
+ *
+ * Copyright (C) IBM Corporation, 2004. All rights reserved.
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/smp.h>
+#include <linux/irq.h>
+#include <linux/reboot.h>
+#include <linux/kexec.h>
+#include <linux/elf.h>
+#include <linux/elfcore.h>
+
+#include <asm/processor.h>
+#include <asm/hardirq.h>
+#include <asm/nmi.h>
+#include <asm/hw_irq.h>
+
+#define MAX_NOTE_BYTES 1024
+typedef u32 note_buf_t[MAX_NOTE_BYTES/4];
+
+note_buf_t crash_notes[NR_CPUS];
+
+void machine_crash_shutdown(void)
+{
+ /* This function is only called after the system
+ * has paniced or is otherwise in a critical state.
+ * The minimum amount of code to allow a kexec'd kernel
+ * to run successfully needs to happen here.
+ *
+ * In practice this means shooting down the other cpus in
+ * an SMP system.
+ */
+}
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-machine_shutdown/arch/x86_64/kernel/machine_kexec.c linux-2.6.11-rc1-mm1-nokexec-x86_64-kexec/arch/x86_64/kernel/machine_kexec.c
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-machine_shutdown/arch/x86_64/kernel/machine_kexec.c Wed Dec 31 17:00:00 1969
+++ linux-2.6.11-rc1-mm1-nokexec-x86_64-kexec/arch/x86_64/kernel/machine_kexec.c Tue Jan 18 23:14:06 2005
@@ -0,0 +1,245 @@
+/*
+ * machine_kexec.c - handle transition of Linux booting another kernel
+ * Copyright (C) 2002-2005 Eric Biederman <[email protected]>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2. See the file COPYING for more details.
+ */
+
+#include <linux/mm.h>
+#include <linux/kexec.h>
+#include <linux/delay.h>
+#include <linux/string.h>
+#include <linux/reboot.h>
+#include <asm/pda.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
+#include <asm/mmu_context.h>
+#include <asm/io.h>
+#include <asm/apic.h>
+#include <asm/cpufeature.h>
+#include <asm/hw_irq.h>
+
+#define LEVEL0_SIZE (1UL << 12UL)
+#define LEVEL1_SIZE (1UL << 21UL)
+#define LEVEL2_SIZE (1UL << 30UL)
+#define LEVEL3_SIZE (1UL << 39UL)
+#define LEVEL4_SIZE (1UL << 48UL)
+
+#define L0_ATTR (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY)
+#define L1_ATTR (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE)
+#define L2_ATTR (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY)
+#define L3_ATTR (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY)
+
+static void init_level2_page(
+ u64 *level2p, unsigned long addr)
+{
+ unsigned long end_addr;
+ addr &= PAGE_MASK;
+ end_addr = addr + LEVEL2_SIZE;
+ while(addr < end_addr) {
+ *(level2p++) = addr | L1_ATTR;
+ addr += LEVEL1_SIZE;
+ }
+}
+
+static int init_level3_page(struct kimage *image,
+ u64 *level3p, unsigned long addr, unsigned long last_addr)
+{
+ unsigned long end_addr;
+ int result;
+ result = 0;
+ addr &= PAGE_MASK;
+ end_addr = addr + LEVEL3_SIZE;
+ while((addr < last_addr) && (addr < end_addr)) {
+ struct page *page;
+ u64 *level2p;
+ page = kimage_alloc_control_pages(image, 0);
+ if (!page) {
+ result = -ENOMEM;
+ goto out;
+ }
+ level2p = (u64 *)page_address(page);
+ init_level2_page(level2p, addr);
+ *(level3p++) = __pa(level2p) | L2_ATTR;
+ addr += LEVEL2_SIZE;
+ }
+ /* clear the unused entries */
+ while(addr < end_addr) {
+ *(level3p++) = 0;
+ addr += LEVEL2_SIZE;
+ }
+out:
+ return result;
+}
+
+
+static int init_level4_page(struct kimage *image,
+ u64 *level4p, unsigned long addr, unsigned long last_addr)
+{
+ unsigned long end_addr;
+ int result;
+ result = 0;
+ addr &= PAGE_MASK;
+ end_addr = addr + LEVEL4_SIZE;
+ while((addr < last_addr) && (addr < end_addr)) {
+ struct page *page;
+ u64 *level3p;
+ page = kimage_alloc_control_pages(image, 0);
+ if (!page) {
+ result = -ENOMEM;
+ goto out;
+ }
+ level3p = (u64 *)page_address(page);
+ result = init_level3_page(image, level3p, addr, last_addr);
+ if (result) {
+ goto out;
+ }
+ *(level4p++) = __pa(level3p) | L3_ATTR;
+ addr += LEVEL3_SIZE;
+ }
+ /* clear the unused entries */
+ while(addr < end_addr) {
+ *(level4p++) = 0;
+ addr += LEVEL3_SIZE;
+ }
+ out:
+ return result;
+}
+
+
+static int init_pgtable(struct kimage *image, unsigned long start_pgtable)
+{
+ u64 *level4p;
+ level4p = (u64 *)__va(start_pgtable);
+ return init_level4_page(image, level4p, 0, end_pfn << PAGE_SHIFT);
+}
+
+static void set_idt(void *newidt, u16 limit)
+{
+ unsigned char curidt[10];
+
+ /* x86-64 supports unaliged loads & stores */
+ (*(u16 *)(curidt)) = limit;
+ (*(u64 *)(curidt +2)) = (unsigned long)(newidt);
+
+ __asm__ __volatile__ (
+ "lidt %0\n"
+ : "=m" (curidt)
+ );
+};
+
+
+static void set_gdt(void *newgdt, u16 limit)
+{
+ unsigned char curgdt[10];
+
+ /* x86-64 supports unaligned loads & stores */
+ (*(u16 *)(curgdt)) = limit;
+ (*(u64 *)(curgdt +2)) = (unsigned long)(newgdt);
+
+ __asm__ __volatile__ (
+ "lgdt %0\n"
+ : "=m" (curgdt)
+ );
+};
+
+static void load_segments(void)
+{
+ __asm__ __volatile__ (
+ "\tmovl $"STR(__KERNEL_DS)",%eax\n"
+ "\tmovl %eax,%ds\n"
+ "\tmovl %eax,%es\n"
+ "\tmovl %eax,%ss\n"
+ "\tmovl %eax,%fs\n"
+ "\tmovl %eax,%gs\n"
+ );
+#undef STR
+#undef __STR
+}
+
+typedef NORET_TYPE void (*relocate_new_kernel_t)(
+ unsigned long indirection_page, unsigned long control_code_buffer,
+ unsigned long start_address, unsigned long pgtable) ATTRIB_NORET;
+
+const extern unsigned char relocate_new_kernel[];
+const extern unsigned long relocate_new_kernel_size;
+
+int machine_kexec_prepare(struct kimage *image)
+{
+ unsigned long start_pgtable, control_code_buffer;
+ int result;
+
+ /* Calculate the offsets */
+ start_pgtable = page_to_pfn(image->control_code_page) << PAGE_SHIFT;
+ control_code_buffer = start_pgtable + 4096UL;
+
+ /* Setup the identity mapped 64bit page table */
+ result = init_pgtable(image, start_pgtable);
+ if (result) {
+ return result;
+ }
+
+ /* Place the code in the reboot code buffer */
+ memcpy(__va(control_code_buffer), relocate_new_kernel, relocate_new_kernel_size);
+
+ return 0;
+}
+
+void machine_kexec_cleanup(struct kimage *image)
+{
+ return;
+}
+
+/*
+ * Do not allocate memory (or fail in any way) in machine_kexec().
+ * We are past the point of no return, committed to rebooting now.
+ */
+NORET_TYPE void machine_kexec(struct kimage *image)
+{
+ unsigned long page_list;
+ unsigned long control_code_buffer;
+ unsigned long start_pgtable;
+ relocate_new_kernel_t rnk;
+
+ /* Interrupts aren't acceptable while we reboot */
+ local_irq_disable();
+
+ /* Calculate the offsets */
+ page_list = image->head;
+ start_pgtable = page_to_pfn(image->control_code_page) << PAGE_SHIFT;
+ control_code_buffer = start_pgtable + 4096UL;
+
+ /* Set the low half of the page table to my identity mapped
+ * page table for kexec. Leave the high half pointing at the
+ * kernel pages. Don't bother to flush the global pages
+ * as that will happen when I fully switch to my identity mapped
+ * page table anyway.
+ */
+ memcpy(__va(read_cr3()), __va(start_pgtable), PAGE_SIZE/2);
+ __flush_tlb();
+
+
+ /* The segment registers are funny things, they are
+ * automatically loaded from a table, in memory wherever you
+ * set them to a specific selector, but this table is never
+ * accessed again unless you set the segment to a different selector.
+ *
+ * The more common model are caches where the behide
+ * the scenes work is done, but is also dropped at arbitrary
+ * times.
+ *
+ * I take advantage of this here by force loading the
+ * segments, before I zap the gdt with an invalid value.
+ */
+ load_segments();
+ /* The gdt & idt are now invalid.
+ * If you want to load them you must set up your own idt & gdt.
+ */
+ set_gdt(phys_to_virt(0),0);
+ set_idt(phys_to_virt(0),0);
+ /* now call it */
+ rnk = (relocate_new_kernel_t) control_code_buffer;
+ (*rnk)(page_list, control_code_buffer, image->start, start_pgtable);
+}
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-machine_shutdown/arch/x86_64/kernel/relocate_kernel.S linux-2.6.11-rc1-mm1-nokexec-x86_64-kexec/arch/x86_64/kernel/relocate_kernel.S
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-machine_shutdown/arch/x86_64/kernel/relocate_kernel.S Wed Dec 31 17:00:00 1969
+++ linux-2.6.11-rc1-mm1-nokexec-x86_64-kexec/arch/x86_64/kernel/relocate_kernel.S Tue Jan 18 23:14:06 2005
@@ -0,0 +1,143 @@
+/*
+ * relocate_kernel.S - put the kernel image in place to boot
+ * Copyright (C) 2002-2005 Eric Biederman <[email protected]>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2. See the file COPYING for more details.
+ */
+
+#include <linux/linkage.h>
+
+ /*
+ * Must be relocatable PIC code callable as a C function, that once
+ * it starts can not use the previous processes stack.
+ */
+ .globl relocate_new_kernel
+ .code64
+relocate_new_kernel:
+ /* %rdi page_list
+ * %rsi reboot_code_buffer
+ * %rdx start address
+ * %rcx page_table
+ * %r8 arg5
+ * %r9 arg6
+ */
+
+ /* zero out flags, and disable interrupts */
+ pushq $0
+ popfq
+
+ /* set a new stack at the bottom of our page... */
+ lea 4096(%rsi), %rsp
+
+ /* store the parameters back on the stack */
+ pushq %rdx /* store the start address */
+
+ /* Set cr0 to a known state:
+ * 31 1 == Paging enabled
+ * 18 0 == Alignment check disabled
+ * 16 0 == Write protect disabled
+ * 3 0 == No task switch
+ * 2 0 == Don't do FP software emulation.
+ * 0 1 == Proctected mode enabled
+ */
+ movq %cr0, %rax
+ andq $~((1<<18)|(1<<16)|(1<<3)|(1<<2)), %rax
+ orl $((1<<31)|(1<<0)), %eax
+ movq %rax, %cr0
+
+ /* Set cr4 to a known state:
+ * 10 0 == xmm exceptions disabled
+ * 9 0 == xmm registers instructions disabled
+ * 8 0 == performance monitoring counter disabled
+ * 7 0 == page global disabled
+ * 6 0 == machine check exceptions disabled
+ * 5 1 == physical address extension enabled
+ * 4 0 == page size extensions disabled
+ * 3 0 == Debug extensions disabled
+ * 2 0 == Time stamp disable (disabled)
+ * 1 0 == Protected mode virtual interrupts disabled
+ * 0 0 == VME disabled
+ */
+
+ movq $((1<<5)), %rax
+ movq %rax, %cr4
+
+ jmp 1f
+1:
+
+ /* Switch to the identity mapped page tables,
+ * and flush the TLB.
+ */
+ movq %rcx, %cr3
+
+ /* Do the copies */
+ movq %rdi, %rcx /* Put the page_list in %rcx */
+ xorq %rdi, %rdi
+ xorq %rsi, %rsi
+ jmp 1f
+
+0: /* top, read another word for the indirection page */
+
+ movq (%rbx), %rcx
+ addq $8, %rbx
+1:
+ testq $0x1, %rcx /* is it a destination page? */
+ jz 2f
+ movq %rcx, %rdi
+ andq $0xfffffffffffff000, %rdi
+ jmp 0b
+2:
+ testq $0x2, %rcx /* is it an indirection page? */
+ jz 2f
+ movq %rcx, %rbx
+ andq $0xfffffffffffff000, %rbx
+ jmp 0b
+2:
+ testq $0x4, %rcx /* is it the done indicator? */
+ jz 2f
+ jmp 3f
+2:
+ testq $0x8, %rcx /* is it the source indicator? */
+ jz 0b /* Ignore it otherwise */
+ movq %rcx, %rsi /* For ever source page do a copy */
+ andq $0xfffffffffffff000, %rsi
+
+ movq $512, %rcx
+ rep ; movsq
+ jmp 0b
+3:
+
+ /* To be certain of avoiding problems with self-modifying code
+ * I need to execute a serializing instruction here.
+ * So I flush the TLB by reloading %cr3 here, it's handy,
+ * and not processor dependent.
+ */
+ movq %cr3, %rax
+ movq %rax, %cr3
+
+ /* set all of the registers to known values */
+ /* leave %rsp alone */
+
+ xorq %rax, %rax
+ xorq %rbx, %rbx
+ xorq %rcx, %rcx
+ xorq %rdx, %rdx
+ xorq %rsi, %rsi
+ xorq %rdi, %rdi
+ xorq %rbp, %rbp
+ xorq %r8, %r8
+ xorq %r9, %r9
+ xorq %r10, %r9
+ xorq %r11, %r11
+ xorq %r12, %r12
+ xorq %r13, %r13
+ xorq %r14, %r14
+ xorq %r15, %r15
+
+ ret
+relocate_new_kernel_end:
+
+ .globl relocate_new_kernel_size
+relocate_new_kernel_size:
+ .quad relocate_new_kernel_end - relocate_new_kernel
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-machine_shutdown/include/asm-x86_64/kexec.h linux-2.6.11-rc1-mm1-nokexec-x86_64-kexec/include/asm-x86_64/kexec.h
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-machine_shutdown/include/asm-x86_64/kexec.h Wed Dec 31 17:00:00 1969
+++ linux-2.6.11-rc1-mm1-nokexec-x86_64-kexec/include/asm-x86_64/kexec.h Tue Jan 18 23:14:06 2005
@@ -0,0 +1,28 @@
+#ifndef _X86_64_KEXEC_H
+#define _X86_64_KEXEC_H
+
+#include <asm/page.h>
+#include <asm/proto.h>
+
+/*
+ * KEXEC_SOURCE_MEMORY_LIMIT maximum page get_free_page can return.
+ * I.e. Maximum page that is mapped directly into kernel memory,
+ * and kmap is not required.
+ *
+ * So far x86_64 is limited to 40 physical address bits.
+ */
+
+/* Maximum physical address we can use pages from */
+#define KEXEC_SOURCE_MEMORY_LIMIT (0xFFFFFFFFFFUL)
+/* Maximum address we can reach in physical address mode */
+#define KEXEC_DESTINATION_MEMORY_LIMIT (0xFFFFFFFFFFUL)
+/* Maximum address we can use for the control pages */
+#define KEXEC_CONTROL_MEMORY_LIMIT (0xFFFFFFFFFFUL)
+
+/* Allocate one page for the pdp and the second for the code */
+#define KEXEC_CONTROL_CODE_SIZE (4096UL + 4096UL)
+
+/* The native architecture */
+#define KEXEC_ARCH KEXEC_ARCH_X86_64
+
+#endif /* _X86_64_KEXEC_H */
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-machine_shutdown/include/asm-x86_64/unistd.h linux-2.6.11-rc1-mm1-nokexec-x86_64-kexec/include/asm-x86_64/unistd.h
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-machine_shutdown/include/asm-x86_64/unistd.h Fri Jan 14 04:32:27 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86_64-kexec/include/asm-x86_64/unistd.h Tue Jan 18 23:14:06 2005
@@ -554,7 +554,7 @@
#define __NR_mq_getsetattr 245
__SYSCALL(__NR_mq_getsetattr, sys_mq_getsetattr)
#define __NR_kexec_load 246
-__SYSCALL(__NR_kexec_load, sys_ni_syscall)
+__SYSCALL(__NR_kexec_load, sys_kexec_load)
#define __NR_waitid 247
__SYSCALL(__NR_waitid, sys_waitid)
#define __NR_add_key 248

2005-01-19 07:53:17

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 29/29] crashdump-linear-raw-format-dump-file-access


From: Hariprasad Nellitheertha <[email protected]>

This patch contains the code that enables us to access the previous kernel's
memory as /dev/oldmem.


Signed-off-by: Eric Biederman <[email protected]>
---

Documentation/devices.txt | 1
drivers/char/mem.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 75 insertions(+)

diff -uNr linux-2.6.11-rc1-mm1-nokexec-crashdump-elf-format-dump-file-access/Documentation/devices.txt linux-2.6.11-rc1-mm1-nokexec-crashdump-linear-raw-format-dump-file-access/Documentation/devices.txt
--- linux-2.6.11-rc1-mm1-nokexec-crashdump-elf-format-dump-file-access/Documentation/devices.txt Fri Jan 14 04:32:22 2005
+++ linux-2.6.11-rc1-mm1-nokexec-crashdump-linear-raw-format-dump-file-access/Documentation/devices.txt Tue Jan 18 23:17:15 2005
@@ -100,6 +100,7 @@
9 = /dev/urandom Faster, less secure random number gen.
10 = /dev/aio Asyncronous I/O notification interface
11 = /dev/kmsg Writes to this come out as printk's
+ 12 = /dev/oldmem Access to crash dump from kexec kernel
1 block RAM disk
0 = /dev/ram0 First RAM disk
1 = /dev/ram1 Second RAM disk
diff -uNr linux-2.6.11-rc1-mm1-nokexec-crashdump-elf-format-dump-file-access/drivers/char/mem.c linux-2.6.11-rc1-mm1-nokexec-crashdump-linear-raw-format-dump-file-access/drivers/char/mem.c
--- linux-2.6.11-rc1-mm1-nokexec-crashdump-elf-format-dump-file-access/drivers/char/mem.c Fri Jan 14 04:32:23 2005
+++ linux-2.6.11-rc1-mm1-nokexec-crashdump-linear-raw-format-dump-file-access/drivers/char/mem.c Tue Jan 18 23:17:15 2005
@@ -23,6 +23,8 @@
#include <linux/devfs_fs_kernel.h>
#include <linux/ptrace.h>
#include <linux/device.h>
+#include <linux/highmem.h>
+#include <linux/crash_dump.h>

#include <asm/uaccess.h>
#include <asm/io.h>
@@ -226,6 +228,62 @@
return 0;
}

+#ifdef CONFIG_CRASH_DUMP
+/*
+ * Read memory corresponding to the old kernel.
+ * If we are reading from the reserved section, which is
+ * actually used by the current kernel, we just return zeroes.
+ * Or if we are reading from the first 640k, we return from the
+ * backed up area.
+ */
+static ssize_t read_oldmem(struct file * file, char * buf,
+ size_t count, loff_t *ppos)
+{
+ unsigned long pfn;
+ unsigned backup_start, backup_end, relocate_start;
+ size_t read=0, csize;
+
+ backup_start = CRASH_BACKUP_BASE / PAGE_SIZE;
+ backup_end = backup_start + (CRASH_BACKUP_SIZE / PAGE_SIZE);
+ relocate_start = (CRASH_BACKUP_BASE + CRASH_BACKUP_SIZE) / PAGE_SIZE;
+
+ while(count) {
+ pfn = *ppos / PAGE_SIZE;
+
+ csize = (count > PAGE_SIZE) ? PAGE_SIZE : count;
+
+ /* Perform translation (see comment above) */
+ if ((pfn >= backup_start) && (pfn < backup_end)) {
+ if (clear_user(buf, csize)) {
+ read = -EFAULT;
+ goto done;
+ }
+
+ goto copy_done;
+ } else if (pfn < (CRASH_RELOCATE_SIZE / PAGE_SIZE))
+ pfn += relocate_start;
+
+ if (pfn > saved_max_pfn) {
+ read = 0;
+ goto done;
+ }
+
+ if (copy_oldmem_page(pfn, buf, csize, 1)) {
+ read = -EFAULT;
+ goto done;
+ }
+
+copy_done:
+ buf += csize;
+ *ppos += csize;
+ read += csize;
+ count -= csize;
+ }
+done:
+ return read;
+}
+#endif
+
extern long vread(char *buf, char *addr, unsigned long count);
extern long vwrite(char *buf, char *addr, unsigned long count);

@@ -530,6 +588,7 @@
#define read_full read_zero
#define open_mem open_port
#define open_kmem open_mem
+#define open_oldmem open_mem

#ifndef ARCH_HAS_DEV_MEM
static struct file_operations mem_fops = {
@@ -578,6 +637,13 @@
.write = write_full,
};

+#ifdef CONFIG_CRASH_DUMP
+static struct file_operations oldmem_fops = {
+ .read = read_oldmem,
+ .open = open_oldmem,
+};
+#endif
+
static ssize_t kmsg_write(struct file * file, const char __user * buf,
size_t count, loff_t *ppos)
{
@@ -632,6 +698,11 @@
case 11:
filp->f_op = &kmsg_fops;
break;
+#ifdef CONFIG_CRASH_DUMP
+ case 12:
+ filp->f_op = &oldmem_fops;
+ break;
+#endif
default:
return -ENXIO;
}
@@ -661,6 +732,9 @@
{8, "random", S_IRUGO | S_IWUSR, &random_fops},
{9, "urandom", S_IRUGO | S_IWUSR, &urandom_fops},
{11,"kmsg", S_IRUGO | S_IWUSR, &kmsg_fops},
+#ifdef CONFIG_CRASH_DUMP
+ {12,"oldmem", S_IRUSR | S_IWUSR | S_IRGRP, &oldmem_fops},
+#endif
};

static struct class_simple *mem_class;

2005-01-19 07:53:16

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 12/29] x86-config-kernel-start


For one kernel to report a crash another kernel has created we need
to have 2 kernels loaded simultaneously in memory. To accomplish this
the two kernels need to built to run at different physical addresses.

This patch adds the CONFIG_PHYSICAL_START option to the x86 kernel
so we can do just that. You need to know what you are doing and
the ramifications are before changing this value, and most users
won't care so I have made it depend on CONFIG_EMBEDDED

bzImage kernels will work and run at a different address when compiled
with this option but they will still load at 1MB. If you need a kernel
loaded at a different address as well you need to boot a vmlinux.

Signed-off-by: Eric Biederman <[email protected]>
---

arch/i386/Kconfig | 11 +++++++++++
arch/i386/boot/compressed/head.S | 7 ++++---
arch/i386/boot/compressed/misc.c | 7 ++++---
arch/i386/kernel/vmlinux.lds.S | 2 +-
include/asm-i386/page.h | 3 +++
5 files changed, 23 insertions(+), 7 deletions(-)

diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-entry64/arch/i386/Kconfig linux-2.6.11-rc1-mm1-nokexec-x86-config-kernel-start/arch/i386/Kconfig
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-entry64/arch/i386/Kconfig Fri Jan 14 04:32:22 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86-config-kernel-start/arch/i386/Kconfig Tue Jan 18 22:46:40 2005
@@ -890,6 +890,17 @@

source "drivers/perfctr/Kconfig"

+config PHYSICAL_START
+ hex "Physical address where the kernel is loaded" if EMBEDDED
+ default "0x100000"
+ help
+ This gives the physical address where the kernel is loaded.
+ Primarily used in the case of kexec on panic where the
+ fail safe kernel needs to run at a different address than
+ the panic-ed kernel.
+
+ Don't change this unless you know what you are doing.
+
endmenu


diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-entry64/arch/i386/boot/compressed/head.S linux-2.6.11-rc1-mm1-nokexec-x86-config-kernel-start/arch/i386/boot/compressed/head.S
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-entry64/arch/i386/boot/compressed/head.S Mon Oct 18 15:55:27 2004
+++ linux-2.6.11-rc1-mm1-nokexec-x86-config-kernel-start/arch/i386/boot/compressed/head.S Tue Jan 18 22:46:40 2005
@@ -25,6 +25,7 @@

#include <linux/linkage.h>
#include <asm/segment.h>
+#include <asm/page.h>

.globl startup_32

@@ -74,7 +75,7 @@
popl %esi # discard address
popl %esi # real mode pointer
xorl %ebx,%ebx
- ljmp $(__BOOT_CS), $0x100000
+ ljmp $(__BOOT_CS), $__PHYSICAL_START

/*
* We come here, if we were loaded high.
@@ -99,7 +100,7 @@
popl %ecx # lcount
popl %edx # high_buffer_start
popl %eax # hcount
- movl $0x100000,%edi
+ movl $__PHYSICAL_START,%edi
cli # make sure we don't get interrupted
ljmp $(__BOOT_CS), $0x1000 # and jump to the move routine

@@ -124,5 +125,5 @@
movsl
movl %ebx,%esi # Restore setup pointer
xorl %ebx,%ebx
- ljmp $(__BOOT_CS), $0x100000
+ ljmp $(__BOOT_CS), $__PHYSICAL_START
move_routine_end:
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-entry64/arch/i386/boot/compressed/misc.c linux-2.6.11-rc1-mm1-nokexec-x86-config-kernel-start/arch/i386/boot/compressed/misc.c
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-entry64/arch/i386/boot/compressed/misc.c Mon Oct 18 15:54:32 2004
+++ linux-2.6.11-rc1-mm1-nokexec-x86-config-kernel-start/arch/i386/boot/compressed/misc.c Tue Jan 18 22:46:40 2005
@@ -14,6 +14,7 @@
#include <linux/tty.h>
#include <video/edid.h>
#include <asm/io.h>
+#include <asm/page.h>

/*
* gzip declarations
@@ -309,7 +310,7 @@
#else
if ((RM_ALT_MEM_K > RM_EXT_MEM_K ? RM_ALT_MEM_K : RM_EXT_MEM_K) < 1024) error("Less than 2MB of memory");
#endif
- output_data = (char *)0x100000; /* Points to 1M */
+ output_data = (char *)__PHYSICAL_START; /* Normally Points to 1M */
free_mem_end_ptr = (long)real_mode;
}

@@ -334,8 +335,8 @@
low_buffer_size = low_buffer_end - LOW_BUFFER_START;
high_loaded = 1;
free_mem_end_ptr = (long)high_buffer_start;
- if ( (0x100000 + low_buffer_size) > ((ulg)high_buffer_start)) {
- high_buffer_start = (uch *)(0x100000 + low_buffer_size);
+ if ( (__PHYSICAL_START + low_buffer_size) > ((ulg)high_buffer_start)) {
+ high_buffer_start = (uch *)(__PHYSICAL_START + low_buffer_size);
mv->hcount = 0; /* say: we need not to move high_buffer */
}
else mv->hcount = -1;
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-entry64/arch/i386/kernel/vmlinux.lds.S linux-2.6.11-rc1-mm1-nokexec-x86-config-kernel-start/arch/i386/kernel/vmlinux.lds.S
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-entry64/arch/i386/kernel/vmlinux.lds.S Tue Jan 18 22:45:51 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86-config-kernel-start/arch/i386/kernel/vmlinux.lds.S Tue Jan 18 22:46:40 2005
@@ -14,7 +14,7 @@
jiffies = jiffies_64;
SECTIONS
{
- . = LOAD_OFFSET + 0x100000;
+ . = __KERNEL_START;
phys_startup_32 = startup_32 - LOAD_OFFSET;
/* read-only */
_text = .; /* Text and read-only data */
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-entry64/include/asm-i386/page.h linux-2.6.11-rc1-mm1-nokexec-x86-config-kernel-start/include/asm-i386/page.h
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-entry64/include/asm-i386/page.h Fri Jan 14 04:32:27 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86-config-kernel-start/include/asm-i386/page.h Tue Jan 18 22:46:40 2005
@@ -122,9 +122,12 @@

#ifdef __ASSEMBLY__
#define __PAGE_OFFSET (0xC0000000)
+#define __PHYSICAL_START CONFIG_PHYSICAL_START
#else
#define __PAGE_OFFSET (0xC0000000UL)
+#define __PHYSICAL_START ((unsigned long)CONFIG_PHYSICAL_START)
#endif
+#define __KERNEL_START (__PAGE_OFFSET + __PHYSICAL_START)


#define PAGE_OFFSET ((unsigned long)__PAGE_OFFSET)

2005-01-19 07:54:56

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 20/29] x86_64-crashkernel


This is the x86_64 implementation of the crashkernel option. It reserves
a window of memory very early in the bootup process, so we never use
it for anything but the kernel to switch to when the running
kernel panics.

In addition to reserving this memory a resource structure is registered
so looking at /proc/iomem it is clear what happened to that memory.

ISSUES:
Is it possible to implement this in a architecture generic way?
What should be done with architectures that always use an iommu and
thus don't report their RAM memory resources in /proc/iomem?


Signed-off-by: Eric Biederman <[email protected]>
---

e820.c | 4 ++++
setup.c | 27 +++++++++++++++++++++++++++
2 files changed, 31 insertions(+)

diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-kexec/arch/x86_64/kernel/e820.c linux-2.6.11-rc1-mm1-nokexec-x86_64-crashkernel/arch/x86_64/kernel/e820.c
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-kexec/arch/x86_64/kernel/e820.c Tue Jan 18 22:44:10 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86_64-crashkernel/arch/x86_64/kernel/e820.c Tue Jan 18 23:14:34 2005
@@ -10,6 +10,7 @@
#include <linux/bootmem.h>
#include <linux/ioport.h>
#include <linux/string.h>
+#include <linux/kexec.h>
#include <asm/page.h>
#include <asm/e820.h>
#include <asm/proto.h>
@@ -204,6 +205,9 @@
*/
request_resource(res, &code_resource);
request_resource(res, &data_resource);
+#ifdef CONFIG_KEXEC
+ request_resource(res, &crashk_res);
+#endif
}
}
}
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-kexec/arch/x86_64/kernel/setup.c linux-2.6.11-rc1-mm1-nokexec-x86_64-crashkernel/arch/x86_64/kernel/setup.c
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-kexec/arch/x86_64/kernel/setup.c Fri Jan 14 04:32:23 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86_64-crashkernel/arch/x86_64/kernel/setup.c Tue Jan 18 23:14:34 2005
@@ -40,6 +40,7 @@
#include <linux/acpi.h>
#include <linux/kallsyms.h>
#include <linux/edd.h>
+#include <linux/kexec.h>
#include <asm/mtrr.h>
#include <asm/uaccess.h>
#include <asm/system.h>
@@ -321,6 +322,27 @@
if (!memcmp(from, "noexec=", 7))
nonx_setup(from + 7);

+#ifdef CONFIG_KEXEC
+ /* crashkernel=size@addr specifies the location to reserve for
+ * a crash kernel. By reserving this memory we guarantee
+ * that linux never set's it up as a DMA target.
+ * Useful for holding code to do something appropriate
+ * after a kernel panic.
+ */
+ else if (!memcmp(from, "crashkernel=", 12)) {
+ unsigned long size, base;
+ size = memparse(from+12, &from);
+ if (*from == '@') {
+ base = memparse(from+1, &from);
+ /* FIXME: Do I want a sanity check
+ * to validate the memory range?
+ */
+ crashk_res.start = base;
+ crashk_res.end = base + size - 1;
+ }
+ }
+#endif
+
next_char:
c = *(from++);
if (!c)
@@ -574,6 +596,11 @@
(unsigned long)(end_pfn << PAGE_SHIFT));
initrd_start = 0;
}
+ }
+#endif
+#ifdef CONFIG_KEXEC
+ if (crashk_res.start != crashk_res.end) {
+ reserve_bootmem(crashk_res.start, crashk_res.end - crashk_res.start + 1);
}
#endif
paging_init();

2005-01-19 07:57:34

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 27/29] crashdump-routines-for-copying-dump-pages


kernel/crash.c has been renamed kernel/crash_dump.c to clarify it's purpose.

From: Hariprasad Nellitheertha <[email protected]>

This patch provides the interfaces necessary to read the dump contents,
treating it as a high memory device.

Signed off by Hariprasad Nellitheertha <[email protected]>

Signed-off-by: Eric Biederman <[email protected]>
---

arch/i386/mm/highmem.c | 18 ++++++++++++++++++
include/asm-i386/highmem.h | 1 +
include/linux/highmem.h | 1 +
kernel/crash_dump.c | 33 +++++++++++++++++++++++++++++++++
4 files changed, 53 insertions(+)

diff -uNr linux-2.6.11-rc1-mm1-nokexec-crashdump-memory-preserving-reboot-using-kexec/arch/i386/mm/highmem.c linux-2.6.11-rc1-mm1-nokexec-crashdump-routines-for-copying-dump-pages/arch/i386/mm/highmem.c
--- linux-2.6.11-rc1-mm1-nokexec-crashdump-memory-preserving-reboot-using-kexec/arch/i386/mm/highmem.c Fri Jan 14 04:32:22 2005
+++ linux-2.6.11-rc1-mm1-nokexec-crashdump-routines-for-copying-dump-pages/arch/i386/mm/highmem.c Tue Jan 18 23:16:41 2005
@@ -74,6 +74,24 @@
preempt_check_resched();
}

+/* This is the same as kmap_atomic() but can map memory that doesn't
+ * have a struct page associated with it.
+ */
+void *kmap_atomic_pfn(unsigned long pfn, enum km_type type)
+{
+ enum fixed_addresses idx;
+ unsigned long vaddr;
+
+ inc_preempt_count();
+
+ idx = type + KM_TYPE_NR*smp_processor_id();
+ vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
+ set_pte(kmap_pte-idx, pfn_pte(pfn, kmap_prot));
+ __flush_tlb_one(vaddr);
+
+ return (void*) vaddr;
+}
+
struct page *kmap_atomic_to_page(char *ptr)
{
unsigned long idx, vaddr = (unsigned long)ptr;
diff -uNr linux-2.6.11-rc1-mm1-nokexec-crashdump-memory-preserving-reboot-using-kexec/include/asm-i386/highmem.h linux-2.6.11-rc1-mm1-nokexec-crashdump-routines-for-copying-dump-pages/include/asm-i386/highmem.h
--- linux-2.6.11-rc1-mm1-nokexec-crashdump-memory-preserving-reboot-using-kexec/include/asm-i386/highmem.h Fri Jan 14 04:32:27 2005
+++ linux-2.6.11-rc1-mm1-nokexec-crashdump-routines-for-copying-dump-pages/include/asm-i386/highmem.h Tue Jan 18 23:16:41 2005
@@ -72,6 +72,7 @@
void kunmap(struct page *page);
char *kmap_atomic(struct page *page, enum km_type type);
void kunmap_atomic(char *kvaddr, enum km_type type);
+void *kmap_atomic_pfn(unsigned long pfn, enum km_type type);
struct page *kmap_atomic_to_page(char *ptr);

#define flush_cache_kmaps() do { } while (0)
diff -uNr linux-2.6.11-rc1-mm1-nokexec-crashdump-memory-preserving-reboot-using-kexec/include/linux/highmem.h linux-2.6.11-rc1-mm1-nokexec-crashdump-routines-for-copying-dump-pages/include/linux/highmem.h
--- linux-2.6.11-rc1-mm1-nokexec-crashdump-memory-preserving-reboot-using-kexec/include/linux/highmem.h Fri Jan 14 04:32:27 2005
+++ linux-2.6.11-rc1-mm1-nokexec-crashdump-routines-for-copying-dump-pages/include/linux/highmem.h Tue Jan 18 23:16:41 2005
@@ -28,6 +28,7 @@

#define kmap_atomic(page, idx) ((char *)page_address(page))
#define kunmap_atomic(addr, idx) do { char *p = addr; (void)p; } while (0)
+#define kmap_atomic_pfn(pfn, idx) page_address(pfn_to_page(pfn))
#define kmap_atomic_to_page(ptr) virt_to_page(ptr)

#endif /* CONFIG_HIGHMEM */
diff -uNr linux-2.6.11-rc1-mm1-nokexec-crashdump-memory-preserving-reboot-using-kexec/kernel/crash_dump.c linux-2.6.11-rc1-mm1-nokexec-crashdump-routines-for-copying-dump-pages/kernel/crash_dump.c
--- linux-2.6.11-rc1-mm1-nokexec-crashdump-memory-preserving-reboot-using-kexec/kernel/crash_dump.c Tue Jan 18 23:16:24 2005
+++ linux-2.6.11-rc1-mm1-nokexec-crashdump-routines-for-copying-dump-pages/kernel/crash_dump.c Tue Jan 18 23:16:41 2005
@@ -8,6 +8,39 @@
#include <linux/smp_lock.h>
#include <linux/errno.h>
#include <linux/proc_fs.h>
+#include <linux/bootmem.h>
+#include <linux/highmem.h>
+#include <linux/crash_dump.h>
+
#include <asm/io.h>
#include <asm/uaccess.h>

+/*
+ * Copy a page from "oldmem". For this page, there is no pte mapped
+ * in the current kernel. We stitch up a pte, similar to kmap_atomic.
+ */
+ssize_t copy_oldmem_page(unsigned long pfn, char *buf,
+ size_t csize, int userbuf)
+{
+ void *page, *vaddr;
+
+ if (!csize)
+ return 0;
+
+ page = kmalloc(PAGE_SIZE, GFP_KERNEL);
+
+ vaddr = kmap_atomic_pfn(pfn, KM_PTE0);
+ copy_page(page, vaddr);
+ kunmap_atomic(vaddr, KM_PTE0);
+
+ if (userbuf) {
+ if (copy_to_user(buf, page, csize)) {
+ kfree(page);
+ return -EFAULT;
+ }
+ } else
+ memcpy(buf, page, csize);
+ kfree(page);
+
+ return 0;
+}

2005-01-19 08:02:10

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 1/29] x86-rename-apic_mode_exint


From: "Maciej W. Rozycki" <[email protected]>

Rename APIC_MODE_EXINT to APIC_MODE_EXTINT - I think it should be named
after what the mode is called in documentation.

From: "Eric W. Biederman" <[email protected]>

I have reduced this patch to just the name change in the header. And
integrated the changes into the patches that add those
lines. Otherwise I ran into some ugly dependencies.

Signed-off-by: Maciej W. Rozycki <[email protected]
Signed-off-by: Eric Biederman <[email protected]>
---

asm-i386/apicdef.h | 2 +-
asm-x86_64/apicdef.h | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff -uNr linux-2.6.11-rc1-mm1-nokexec/include/asm-i386/apicdef.h linux-2.6.11-rc1-mm1-nokexec-x86-rename-apic_mode_exint/include/asm-i386/apicdef.h
--- linux-2.6.11-rc1-mm1-nokexec/include/asm-i386/apicdef.h Mon Oct 18 15:54:31 2004
+++ linux-2.6.11-rc1-mm1-nokexec-x86-rename-apic_mode_exint/include/asm-i386/apicdef.h Tue Jan 18 22:43:44 2005
@@ -90,7 +90,7 @@
#define SET_APIC_DELIVERY_MODE(x,y) (((x)&~0x700)|((y)<<8))
#define APIC_MODE_FIXED 0x0
#define APIC_MODE_NMI 0x4
-#define APIC_MODE_EXINT 0x7
+#define APIC_MODE_EXTINT 0x7
#define APIC_LVT1 0x360
#define APIC_LVTERR 0x370
#define APIC_TMICT 0x380
diff -uNr linux-2.6.11-rc1-mm1-nokexec/include/asm-x86_64/apicdef.h linux-2.6.11-rc1-mm1-nokexec-x86-rename-apic_mode_exint/include/asm-x86_64/apicdef.h
--- linux-2.6.11-rc1-mm1-nokexec/include/asm-x86_64/apicdef.h Fri Jan 7 12:54:16 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86-rename-apic_mode_exint/include/asm-x86_64/apicdef.h Tue Jan 18 22:43:44 2005
@@ -94,7 +94,7 @@
#define SET_APIC_DELIVERY_MODE(x,y) (((x)&~0x700)|((y)<<8))
#define APIC_MODE_FIXED 0x0
#define APIC_MODE_NMI 0x4
-#define APIC_MODE_EXINT 0x7
+#define APIC_MODE_EXTINT 0x7
#define APIC_LVT1 0x360
#define APIC_LVTERR 0x370
#define APIC_TMICT 0x380

2005-01-19 08:04:28

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 2/29] x86-local-apic-fix


From: "Maciej W. Rozycki" <[email protected]>

Fix a kexec problem whcih causes local APIC detection failure.

The problem is detect_init_APIC() is called early, before the command line
have been processed. Therefore "lapic" (and "nolapic") have not been seen,
yet.

Signed-off-by: Maciej W. Rozycki <[email protected]>
Signed-off-by: Eric Biederman <[email protected]>
---

arch/i386/kernel/apic.c | 25 +++++--------------------
arch/i386/kernel/setup.c | 11 +++++++++++
include/asm-i386/apic.h | 13 +++++++++++++
3 files changed, 29 insertions(+), 20 deletions(-)

diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-rename-apic_mode_exint/arch/i386/kernel/apic.c linux-2.6.11-rc1-mm1-nokexec-x86-local-apic-fix/arch/i386/kernel/apic.c
--- linux-2.6.11-rc1-mm1-nokexec-x86-rename-apic_mode_exint/arch/i386/kernel/apic.c Fri Jan 14 04:32:22 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86-local-apic-fix/arch/i386/kernel/apic.c Tue Jan 18 22:43:54 2005
@@ -41,6 +41,11 @@
#include "io_ports.h"

/*
+ * Knob to control our willingness to enable the local APIC.
+ */
+int enable_local_apic __initdata = 0; /* -1=force-disable, +1=force-enable */
+
+/*
* Debug level
*/
int apic_verbosity;
@@ -666,26 +671,6 @@
* Detect and enable local APICs on non-SMP boards.
* Original code written by Keir Fraser.
*/
-
-/*
- * Knob to control our willingness to enable the local APIC.
- */
-int enable_local_apic __initdata = 0; /* -1=force-disable, +1=force-enable */
-
-static int __init lapic_disable(char *str)
-{
- enable_local_apic = -1;
- clear_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability);
- return 0;
-}
-__setup("nolapic", lapic_disable);
-
-static int __init lapic_enable(char *str)
-{
- enable_local_apic = 1;
- return 0;
-}
-__setup("lapic", lapic_enable);

static int __init apic_set_verbosity(char *str)
{
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-rename-apic_mode_exint/arch/i386/kernel/setup.c linux-2.6.11-rc1-mm1-nokexec-x86-local-apic-fix/arch/i386/kernel/setup.c
--- linux-2.6.11-rc1-mm1-nokexec-x86-rename-apic_mode_exint/arch/i386/kernel/setup.c Fri Jan 14 04:28:30 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86-local-apic-fix/arch/i386/kernel/setup.c Tue Jan 18 22:43:55 2005
@@ -41,6 +41,7 @@
#include <linux/init.h>
#include <linux/edd.h>
#include <video/edid.h>
+#include <asm/apic.h>
#include <asm/e820.h>
#include <asm/mpspec.h>
#include <asm/setup.h>
@@ -813,6 +814,16 @@
disable_ioapic_setup();
#endif /* CONFIG_X86_LOCAL_APIC */
#endif /* CONFIG_ACPI_BOOT */
+
+#ifdef CONFIG_X86_LOCAL_APIC
+ /* enable local APIC */
+ else if (!memcmp(from, "lapic", 5))
+ lapic_enable();
+
+ /* disable local APIC */
+ else if (!memcmp(from, "nolapic", 6))
+ lapic_disable();
+#endif /* CONFIG_X86_LOCAL_APIC */

/*
* highmem=size forces highmem to be exactly 'size' bytes.
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-rename-apic_mode_exint/include/asm-i386/apic.h linux-2.6.11-rc1-mm1-nokexec-x86-local-apic-fix/include/asm-i386/apic.h
--- linux-2.6.11-rc1-mm1-nokexec-x86-rename-apic_mode_exint/include/asm-i386/apic.h Fri Jan 7 12:54:13 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86-local-apic-fix/include/asm-i386/apic.h Tue Jan 18 22:43:55 2005
@@ -5,6 +5,7 @@
#include <linux/pm.h>
#include <asm/fixmap.h>
#include <asm/apicdef.h>
+#include <asm/processor.h>
#include <asm/system.h>

#define Dprintk(x...)
@@ -16,7 +17,19 @@
#define APIC_VERBOSE 1
#define APIC_DEBUG 2

+extern int enable_local_apic;
extern int apic_verbosity;
+
+static inline void lapic_disable(void)
+{
+ enable_local_apic = -1;
+ clear_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability);
+}
+
+static inline void lapic_enable(void)
+{
+ enable_local_apic = 1;
+}

/*
* Define the default level of output to be very little

2005-01-19 08:06:54

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 4/29] x86-i8259-shutdown


From: Eric W. Biederman <[email protected]>

This patch disables interrupt generation from the legacy pic on reboot. Now
that there is a sys_device class it should not be called while drivers are
still using interrupts.

There is a report about this breaking ACPI power off on some systems.
http://bugme.osdl.org/show_bug.cgi?id=4041
However the final comment seems to exhonorate this code. So until
I get more information I believe that was a false positive.

Signed-off-by: Eric Biederman <[email protected]>
---

i8259.c | 12 ++++++++++++
1 files changed, 12 insertions(+)

diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-e820-64bit/arch/i386/kernel/i8259.c linux-2.6.11-rc1-mm1-nokexec-x86-i8259-shutdown/arch/i386/kernel/i8259.c
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-e820-64bit/arch/i386/kernel/i8259.c Fri Jan 14 04:32:22 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86-i8259-shutdown/arch/i386/kernel/i8259.c Tue Jan 18 22:44:27 2005
@@ -270,10 +270,22 @@
return 0;
}

+static int i8259A_shutdown(struct sys_device *dev)
+{
+ /* Put the i8259A into a quiescent state that
+ * the kernel initialization code can get it
+ * out of.
+ */
+ outb(0xff, 0x21); /* mask all of 8259A-1 */
+ outb(0xff, 0xA1); /* mask all of 8259A-1 */
+ return 0;
+}
+
static struct sysdev_class i8259_sysdev_class = {
set_kset_name("i8259"),
.suspend = i8259A_suspend,
.resume = i8259A_resume,
+ .shutdown = i8259A_shutdown,
};

static struct sys_device device_i8259A = {

2005-01-19 08:09:10

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 3/29] x86_64-e820-64bit


From: Eric W. Biederman <[email protected]>

It is ok to reserve resources > 4G on x86_64 struct resource is 64bit now :)

Signed-off-by: Eric Biederman <[email protected]>
---

e820.c | 2 --
1 files changed, 2 deletions(-)

diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-local-apic-fix/arch/x86_64/kernel/e820.c linux-2.6.11-rc1-mm1-nokexec-x86_64-e820-64bit/arch/x86_64/kernel/e820.c
--- linux-2.6.11-rc1-mm1-nokexec-x86-local-apic-fix/arch/x86_64/kernel/e820.c Mon Oct 18 15:53:45 2004
+++ linux-2.6.11-rc1-mm1-nokexec-x86_64-e820-64bit/arch/x86_64/kernel/e820.c Tue Jan 18 22:44:10 2005
@@ -185,8 +185,6 @@
int i;
for (i = 0; i < e820.nr_map; i++) {
struct resource *res;
- if (e820.map[i].addr + e820.map[i].size > 0x100000000ULL)
- continue;
res = alloc_bootmem_low(sizeof(struct resource));
switch (e820.map[i].type) {
case E820_RAM: res->name = "System RAM"; break;

2005-01-19 08:09:09

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 18/29] x86_64-machine_shutdown


Factor out the apic and smp shutdown code from machine_restart so it can be
called by in the kexec reboot path as well.

Signed-off-by: Eric Biederman <[email protected]>
---

reboot.c | 62 +++++++++++++++++++++++++++++++++-----------------------------
1 files changed, 33 insertions(+), 29 deletions(-)

diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-crashkernel/arch/x86_64/kernel/reboot.c linux-2.6.11-rc1-mm1-nokexec-x86_64-machine_shutdown/arch/x86_64/kernel/reboot.c
--- linux-2.6.11-rc1-mm1-nokexec-x86-crashkernel/arch/x86_64/kernel/reboot.c Fri Jan 14 04:28:33 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86_64-machine_shutdown/arch/x86_64/kernel/reboot.c Tue Jan 18 23:13:50 2005
@@ -66,41 +66,47 @@

__setup("reboot=", reboot_setup);

-#ifdef CONFIG_SMP
-static void smp_halt(void)
+static inline void kb_wait(void)
{
- int cpuid = safe_smp_processor_id();
- static int first_entry = 1;
+ int i;
+
+ for (i=0; i<0x10000; i++)
+ if ((inb_p(0x64) & 0x02) == 0)
+ break;
+}

- if (reboot_force)
- return;
+void machine_shutdown(void)
+{
+ /* Stop the cpus and apics */
+#ifdef CONFIG_SMP
+ int reboot_cpu_id;

- if (first_entry) {
- first_entry = 0;
- smp_call_function((void *)machine_restart, NULL, 1, 0);
- }
-
- smp_stop_cpu();
+ /* The boot cpu is always logical cpu 0 */
+ reboot_cpu_id = 0;

- /* AP calling this. Just halt */
- if (cpuid != boot_cpu_id) {
- for (;;)
- asm("hlt");
+ /* Make certain the cpu I'm about to reboot on is online */
+ if (!cpu_isset(reboot_cpu_id, cpu_online_map)) {
+ reboot_cpu_id = smp_processor_id();
}

- /* Wait for all other CPUs to have run smp_stop_cpu */
- while (!cpus_empty(cpu_online_map))
- rep_nop();
-}
+ /* Make certain I only run on the appropriate processor */
+ set_cpus_allowed(current, cpumask_of_cpu(reboot_cpu_id));
+
+ /* O.K Now that I'm on the appropriate processor,
+ * stop all of the others.
+ */
+ smp_send_stop();
#endif

-static inline void kb_wait(void)
-{
- int i;
+ local_irq_disable();

- for (i=0; i<0x10000; i++)
- if ((inb_p(0x64) & 0x02) == 0)
- break;
+#ifndef CONFIG_SMP
+ disable_local_APIC();
+#endif
+
+ disable_IO_APIC();
+
+ local_irq_enable();
}

void machine_restart(char * __unused)
@@ -109,9 +115,7 @@

printk("machine restart\n");

-#ifdef CONFIG_SMP
- smp_halt();
-#endif
+ machine_shutdown();

if (!reboot_force) {
local_irq_disable();

2005-01-19 08:12:12

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 13/29] x86_64-config-kernel-start



For one kernel to report a crash another kernel has created we need
to have 2 kernels loaded simultaneously in memory. To accomplish this
the two kernels need to built to run at different physical addresses.

This patch adds the CONFIG_PHYSICAL_START option to the x86_64 kernel
so we can do just that. You need to know what you are doing and
the ramifications are before changing this value, and most users
won't care so I have made it depend on CONFIG_EMBEDDED

bzImage kernels will work and run at a different address when compiled
with this option but they will still load at 1MB. If you need a kernel
loaded at a different address as well you need to boot a vmlinux.

Signed-off-by: Eric Biederman <[email protected]>
---

arch/x86_64/Kconfig | 11 +++++++++++
arch/x86_64/boot/compressed/head.S | 7 ++++---
arch/x86_64/boot/compressed/misc.c | 7 ++++---
arch/x86_64/kernel/head.S | 18 +++++++++---------
include/asm-x86_64/page.h | 6 ++++--
5 files changed, 32 insertions(+), 17 deletions(-)

diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-config-kernel-start/arch/x86_64/Kconfig linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/arch/x86_64/Kconfig
--- linux-2.6.11-rc1-mm1-nokexec-x86-config-kernel-start/arch/x86_64/Kconfig Fri Jan 14 04:32:23 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/arch/x86_64/Kconfig Tue Jan 18 22:46:57 2005
@@ -359,6 +359,17 @@
help
Additional support for intel specific MCE features such as
the thermal monitor.
+
+config PHYSICAL_START
+ hex "Physical address where the kernel is loaded" if EMBEDDED
+ default "0x100000"
+ help
+ This gives the physical address where the kernel is loaded.
+ Primarily used in the case of kexec on panic where the
+ fail safe kernel needs to run at a different address than
+ the panic-ed kernel.
+
+ Don't change this unless you know what you are doing.
endmenu

#
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-config-kernel-start/arch/x86_64/boot/compressed/head.S linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/arch/x86_64/boot/compressed/head.S
--- linux-2.6.11-rc1-mm1-nokexec-x86-config-kernel-start/arch/x86_64/boot/compressed/head.S Mon Oct 18 15:55:28 2004
+++ linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/arch/x86_64/boot/compressed/head.S Tue Jan 18 22:46:57 2005
@@ -28,6 +28,7 @@

#include <linux/linkage.h>
#include <asm/segment.h>
+#include <asm/page.h>

.code32
.globl startup_32
@@ -77,7 +78,7 @@
jnz 3f
addl $8,%esp
xorl %ebx,%ebx
- ljmp $(__KERNEL_CS), $0x100000
+ ljmp $(__KERNEL_CS), $__PHYSICAL_START

/*
* We come here, if we were loaded high.
@@ -103,7 +104,7 @@
popl %ecx # lcount
popl %edx # high_buffer_start
popl %eax # hcount
- movl $0x100000,%edi
+ movl $__PHYSICAL_START,%edi
cli # make sure we don't get interrupted
ljmp $(__KERNEL_CS), $0x1000 # and jump to the move routine

@@ -128,7 +129,7 @@
movsl
movl %ebx,%esi # Restore setup pointer
xorl %ebx,%ebx
- ljmp $(__KERNEL_CS), $0x100000
+ ljmp $(__KERNEL_CS), $__PHYSICAL_START
move_routine_end:


diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-config-kernel-start/arch/x86_64/boot/compressed/misc.c linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/arch/x86_64/boot/compressed/misc.c
--- linux-2.6.11-rc1-mm1-nokexec-x86-config-kernel-start/arch/x86_64/boot/compressed/misc.c Mon Oct 18 15:53:51 2004
+++ linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/arch/x86_64/boot/compressed/misc.c Tue Jan 18 22:46:57 2005
@@ -11,6 +11,7 @@

#include "miscsetup.h"
#include <asm/io.h>
+#include <asm/page.h>

/*
* gzip declarations
@@ -284,7 +285,7 @@
#else
if ((ALT_MEM_K > EXT_MEM_K ? ALT_MEM_K : EXT_MEM_K) < 1024) error("Less than 2MB of memory");
#endif
- output_data = (char *)0x100000; /* Points to 1M */
+ output_data = (char *)__PHYSICAL_START; /* Normally Points to 1M */
free_mem_end_ptr = (long)real_mode;
}

@@ -307,8 +308,8 @@
low_buffer_size = low_buffer_end - LOW_BUFFER_START;
high_loaded = 1;
free_mem_end_ptr = (long)high_buffer_start;
- if ( (0x100000 + low_buffer_size) > ((ulg)high_buffer_start)) {
- high_buffer_start = (uch *)(0x100000 + low_buffer_size);
+ if ( (__PHYSICAL_START + low_buffer_size) > ((ulg)high_buffer_start)) {
+ high_buffer_start = (uch *)(__PHYSICAL_START + low_buffer_size);
mv->hcount = 0; /* say: we need not to move high_buffer */
}
else mv->hcount = -1;
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-config-kernel-start/arch/x86_64/kernel/head.S linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/arch/x86_64/kernel/head.S
--- linux-2.6.11-rc1-mm1-nokexec-x86-config-kernel-start/arch/x86_64/kernel/head.S Tue Jan 18 22:46:24 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/arch/x86_64/kernel/head.S Tue Jan 18 22:46:57 2005
@@ -235,23 +235,23 @@
*/
.org 0x1000
ENTRY(init_level4_pgt)
- .quad 0x0000000000102007 /* -> level3_ident_pgt */
+ .quad 0x0000000000002007 + __PHYSICAL_START /* -> level3_ident_pgt */
.fill 255,8,0
- .quad 0x000000000010a007
+ .quad 0x000000000000a007 + __PHYSICAL_START
.fill 254,8,0
/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
- .quad 0x0000000000103007 /* -> level3_kernel_pgt */
+ .quad 0x0000000000003007 + __PHYSICAL_START /* -> level3_kernel_pgt */

.org 0x2000
ENTRY(level3_ident_pgt)
- .quad 0x0000000000104007
+ .quad 0x0000000000004007 + __PHYSICAL_START
.fill 511,8,0

.org 0x3000
ENTRY(level3_kernel_pgt)
.fill 510,8,0
/* (2^48-(2*1024*1024*1024)-((2^39)*511))/(2^30) = 510 */
- .quad 0x0000000000105007 /* -> level2_kernel_pgt */
+ .quad 0x0000000000005007 + __PHYSICAL_START /* -> level2_kernel_pgt */
.fill 1,8,0

.org 0x4000
@@ -324,17 +324,17 @@

.org 0xa000
ENTRY(level3_physmem_pgt)
- .quad 0x0000000000105007 /* -> level2_kernel_pgt (so that __va works even before pagetable_init) */
+ .quad 0x0000000000005007 + __PHYSICAL_START /* -> level2_kernel_pgt (so that __va works even before pagetable_init) */

.org 0xb000
#ifdef CONFIG_ACPI_SLEEP
ENTRY(wakeup_level4_pgt)
- .quad 0x0000000000102007 /* -> level3_ident_pgt */
+ .quad 0x0000000000002007 + __PHYSICAL_START /* -> level3_ident_pgt */
.fill 255,8,0
- .quad 0x000000000010a007
+ .quad 0x000000000000a007 + __PHYSICAL_START
.fill 254,8,0
/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
- .quad 0x0000000000103007 /* -> level3_kernel_pgt */
+ .quad 0x0000000000003007 + __PHYSICAL_START /* -> level3_kernel_pgt */
#endif

.data
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-config-kernel-start/include/asm-x86_64/page.h linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/include/asm-x86_64/page.h
--- linux-2.6.11-rc1-mm1-nokexec-x86-config-kernel-start/include/asm-x86_64/page.h Fri Jan 14 04:32:27 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86_64-config-kernel-start/include/asm-x86_64/page.h Tue Jan 18 22:46:57 2005
@@ -65,12 +65,14 @@
extern unsigned long vm_data_default_flags, vm_data_default_flags32;
extern unsigned long vm_force_exec32;

-#define __START_KERNEL 0xffffffff80100000UL
+#define __PHYSICAL_START ((unsigned long)CONFIG_PHYSICAL_START)
+#define __START_KERNEL (__START_KERNEL_map + __PHYSICAL_START)
#define __START_KERNEL_map 0xffffffff80000000UL
#define __PAGE_OFFSET 0xffff810000000000UL

#else
-#define __START_KERNEL 0xffffffff80100000
+#define __PHYSICAL_START CONFIG_PHYSICAL_START
+#define __START_KERNEL (__START_KERNEL_map + __PHYSICAL_START)
#define __START_KERNEL_map 0xffffffff80000000
#define __PAGE_OFFSET 0xffff810000000000
#endif /* !__ASSEMBLY__ */

2005-01-19 08:14:23

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 11/29] x86_64-entry64


To enable bootloaders to boot to directly load the x86_64 vmlinux
and to enable the x86_64 kernel to switch into 64bit mode earlier
this patch refactors the x86_64 entry code so there is a native
64bit entry point to the kernel.

I ran this by Andi Kleen and he agreed it looks fairly sane.

Signed-off-by: Eric Biederman <[email protected]>
---

head.S | 112 ++++++++++++++++++++++++++++++++--------------------------
smpboot.c | 2 -
trampoline.S | 24 +++---------
vmlinux.lds.S | 3 +
4 files changed, 70 insertions(+), 71 deletions(-)

diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-vmlinux-fix-physical-addrs/arch/x86_64/kernel/head.S linux-2.6.11-rc1-mm1-nokexec-x86_64-entry64/arch/x86_64/kernel/head.S
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-vmlinux-fix-physical-addrs/arch/x86_64/kernel/head.S Fri Jan 14 04:28:33 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86_64-entry64/arch/x86_64/kernel/head.S Tue Jan 18 22:46:24 2005
@@ -26,6 +26,7 @@

.text
.code32
+ .globl startup_32
/* %bx: 1 if coming from smp trampoline on secondary cpu */
startup_32:

@@ -37,11 +38,13 @@
* There is no stack until we set one up.
*/

- movl %ebx,%ebp /* Save trampoline flag */
-
+ /* Initialize the %ds segment register */
movl $__KERNEL_DS,%eax
movl %eax,%ds
-
+
+ /* Load new GDT with the 64bit segments using 32bit descriptor */
+ lgdt pGDT32 - __START_KERNEL_map
+
/* If the CPU doesn't support CPUID this will double fault.
* Unfortunately it is hard to check for CPUID without a stack.
*/
@@ -57,16 +60,13 @@
btl $29, %edx
jnc no_long_mode

- movl %edx,%edi
-
/*
* Prepare for entering 64bits mode
*/

- /* Enable PAE mode and PGE */
+ /* Enable PAE mode */
xorl %eax, %eax
btsl $5, %eax
- btsl $7, %eax
movl %eax, %cr4

/* Setup early boot stage 4 level pagetables */
@@ -79,14 +79,6 @@

/* Enable Long Mode */
btsl $_EFER_LME, %eax
- /* Enable System Call */
- btsl $_EFER_SCE, %eax
-
- /* No Execute supported? */
- btl $20,%edi
- jnc 1f
- btsl $_EFER_NX, %eax
-1:

/* Make changes effective */
wrmsr
@@ -94,38 +86,69 @@
xorl %eax, %eax
btsl $31, %eax /* Enable paging and in turn activate Long Mode */
btsl $0, %eax /* Enable protected mode */
- btsl $1, %eax /* Enable MP */
- btsl $4, %eax /* Enable ET */
- btsl $5, %eax /* Enable NE */
- btsl $16, %eax /* Enable WP */
- btsl $18, %eax /* Enable AM */
/* Make changes effective */
movl %eax, %cr0
- jmp reach_compatibility_mode
-reach_compatibility_mode:
-
/*
* At this point we're in long mode but in 32bit compatibility mode
* with EFER.LME = 1, CS.L = 0, CS.D = 1 (and in turn
- * EFER.LMA = 1). Now we want to jump in 64bit mode, to do that we load
+ * EFER.LMA = 1). Now we want to jump in 64bit mode, to do that we use
* the new gdt/idt that has __KERNEL_CS with CS.L = 1.
*/
-
- testw %bp,%bp /* secondary CPU? */
- jnz second
-
- /* Load new GDT with the 64bit segment using 32bit descriptor */
- movl $(pGDT32 - __START_KERNEL_map), %eax
- lgdt (%eax)
-
-second:
- movl $(ljumpvector - __START_KERNEL_map), %eax
- /* Finally jump in 64bit mode */
- ljmp *(%eax)
+ ljmp $__KERNEL_CS, $(startup_64 - __START_KERNEL_map)

.code64
.org 0x100
-reach_long64:
+ .globl startup_64
+startup_64:
+ /* We come here either from startup_32
+ * or directly from a 64bit bootloader.
+ * Since we may have come directly from a bootloader we
+ * reload the page tables here.
+ */
+
+ /* Enable PAE mode and PGE */
+ xorq %rax, %rax
+ btsq $5, %rax
+ btsq $7, %rax
+ movq %rax, %cr4
+
+ /* Setup early boot stage 4 level pagetables. */
+ movq $(init_level4_pgt - __START_KERNEL_map), %rax
+ movq %rax, %cr3
+
+ /* Check if nx is implemented */
+ movl $0x80000001, %eax
+ cpuid
+ movl %edx,%edi
+
+ /* Setup EFER (Extended Feature Enable Register) */
+ movl $MSR_EFER, %ecx
+ rdmsr
+
+ /* Enable System Call */
+ btsl $_EFER_SCE, %eax
+
+ /* No Execute supported? */
+ btl $20,%edi
+ jnc 1f
+ btsl $_EFER_NX, %eax
+1:
+ /* Make changes effective */
+ wrmsr
+
+ /* Setup cr0 */
+ xorq %rax, %rax
+ btsq $31, %rax /* Enable paging */
+ btsq $0, %rax /* Enable protected mode */
+ btsq $1, %rax /* Enable MP */
+ btsq $4, %rax /* Enable ET */
+ btsq $5, %rax /* Enable NE */
+ btsq $16, %rax /* Enable WP */
+ btsq $18, %rax /* Enable AM */
+ /* Make changes effective */
+ movq %rax, %cr0
+
+ /* Setup a boot time stack */
movq init_rsp(%rip),%rsp

/* zero EFLAGS after setting rsp */
@@ -198,13 +221,8 @@
.org 0xf00
.globl pGDT32
pGDT32:
- .word gdt32_end-gdt_table32
- .long gdt_table32-__START_KERNEL_map
-
-.org 0xf10
-ljumpvector:
- .long reach_long64-__START_KERNEL_map
- .word __KERNEL_CS
+ .word gdt_end-cpu_gdt_table
+ .long cpu_gdt_table-__START_KERNEL_map

ENTRY(stext)
ENTRY(_stext)
@@ -334,12 +352,6 @@
.endr
#endif

-ENTRY(gdt_table32)
- .quad 0x0000000000000000 /* This one is magic */
- .quad 0x0000000000000000 /* unused */
- .quad 0x00af9a000000ffff /* __KERNEL_CS */
-gdt32_end:
-
/* We need valid kernel segments for data and code in long mode too
* IRET will check the segment types kkeil 2000/10/28
* Also sysret mandates a special GDT layout
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-vmlinux-fix-physical-addrs/arch/x86_64/kernel/smpboot.c linux-2.6.11-rc1-mm1-nokexec-x86_64-entry64/arch/x86_64/kernel/smpboot.c
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-vmlinux-fix-physical-addrs/arch/x86_64/kernel/smpboot.c Fri Jan 14 04:28:33 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86_64-entry64/arch/x86_64/kernel/smpboot.c Tue Jan 18 22:46:24 2005
@@ -91,8 +91,6 @@
static unsigned long __init setup_trampoline(void)
{
void *tramp = __va(SMP_TRAMPOLINE_BASE);
- extern volatile __u32 tramp_gdt_ptr;
- tramp_gdt_ptr = __pa_symbol(&cpu_gdt_table);
memcpy(tramp, trampoline_data, trampoline_end - trampoline_data);
return virt_to_phys(tramp);
}
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-vmlinux-fix-physical-addrs/arch/x86_64/kernel/trampoline.S linux-2.6.11-rc1-mm1-nokexec-x86_64-entry64/arch/x86_64/kernel/trampoline.S
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-vmlinux-fix-physical-addrs/arch/x86_64/kernel/trampoline.S Mon Oct 18 15:55:06 2004
+++ linux-2.6.11-rc1-mm1-nokexec-x86_64-entry64/arch/x86_64/kernel/trampoline.S Tue Jan 18 22:46:24 2005
@@ -37,7 +37,6 @@
mov %cs, %ax # Code and data in the same place
mov %ax, %ds

- mov $1, %bx # Flag an SMP trampoline
cli # We should be safe anyway

movl $0xA5A5A5A5, trampoline_data - r_base
@@ -46,31 +45,20 @@
lidt idt_48 - r_base # load idt with 0, 0
lgdt gdt_48 - r_base # load gdt with whatever is appropriate

- movw $__KERNEL_DS,%ax
- movw %ax,%ds
- movw %ax,%es
-
xor %ax, %ax
inc %ax # protected mode (PE) bit
lmsw %ax # into protected mode
- jmp flush_instr
-flush_instr:
- ljmpl $__KERNEL32_CS, $0x00100000
- # jump to startup_32 in arch/x86_64/kernel/head.S
-
+ # flaush prefetch and jump to startup_32 in arch/x86_64/kernel/head.S
+ ljmpl $__KERNEL32_CS, $(startup_32-__START_KERNEL_map)
+
+ # Careful these need to be in the same 64K segment as the above;
idt_48:
.word 0 # idt limit = 0
.word 0, 0 # idt base = 0L

gdt_48:
- .short 0x0800 # gdt limit = 2048, 256 GDT entries
- .globl tramp_gdt_ptr
-tramp_gdt_ptr:
- .long 0 # gdt base = gdt (first SMP CPU)
- # this is filled in by C because the 64bit
- # linker doesn't support absolute 32bit
- # relocations.
-
+ .short __KERNEL32_CS + 7 # gdt limit
+ .long cpu_gdt_table-__START_KERNEL_map

.globl trampoline_end
trampoline_end:
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-vmlinux-fix-physical-addrs/arch/x86_64/kernel/vmlinux.lds.S linux-2.6.11-rc1-mm1-nokexec-x86_64-entry64/arch/x86_64/kernel/vmlinux.lds.S
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-vmlinux-fix-physical-addrs/arch/x86_64/kernel/vmlinux.lds.S Tue Jan 18 22:46:07 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86_64-entry64/arch/x86_64/kernel/vmlinux.lds.S Tue Jan 18 22:46:24 2005
@@ -10,11 +10,12 @@

OUTPUT_FORMAT("elf64-x86-64", "elf64-x86-64", "elf64-x86-64")
OUTPUT_ARCH(i386:x86-64)
-ENTRY(stext)
+ENTRY(phys_startup_64)
jiffies_64 = jiffies;
SECTIONS
{
. = __START_KERNEL;
+ phys_startup_64 = startup_64 - LOAD_OFFSET;
_text = .; /* Text and read-only data */
.text : AT(ADDR(.text) - LOAD_OFFSET) {
*(.text)

2005-01-19 08:12:12

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 10/29] x86_64-vmlinux-fix-physical-addrs


The vmlinux on x86_64 does not report the correct physical address of
the kernel. Instead in the physical address field it currently
reports the virtual address of the kernel.

This is patch is a bug fix that corrects vmlinux to report the
proper physical addresses.

This is potentially a help for crash dump analysis tools.

This definitiely allows bootloaders that load vmlinux as a standard
ELF executable. Bootloaders directly loading vmlinux become of
practical importance when we consider the kexec on panic case.

Signed-off-by: Eric Biederman <[email protected]>
---

Makefile | 2
kernel/vmlinux.lds.S | 130 ++++++++++++++++++++++++++++++++++-----------------
2 files changed, 88 insertions(+), 44 deletions(-)

diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-vmlinux-fix-physical-addrs/arch/x86_64/Makefile linux-2.6.11-rc1-mm1-nokexec-x86_64-vmlinux-fix-physical-addrs/arch/x86_64/Makefile
--- linux-2.6.11-rc1-mm1-nokexec-x86-vmlinux-fix-physical-addrs/arch/x86_64/Makefile Fri Jan 14 04:28:33 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86_64-vmlinux-fix-physical-addrs/arch/x86_64/Makefile Tue Jan 18 22:46:07 2005
@@ -35,7 +35,7 @@

LDFLAGS := -m elf_x86_64
OBJCOPYFLAGS := -O binary -R .note -R .comment -S
-LDFLAGS_vmlinux := -e stext
+LDFLAGS_vmlinux :=

CHECKFLAGS += -D__x86_64__ -m64

diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-vmlinux-fix-physical-addrs/arch/x86_64/kernel/vmlinux.lds.S linux-2.6.11-rc1-mm1-nokexec-x86_64-vmlinux-fix-physical-addrs/arch/x86_64/kernel/vmlinux.lds.S
--- linux-2.6.11-rc1-mm1-nokexec-x86-vmlinux-fix-physical-addrs/arch/x86_64/kernel/vmlinux.lds.S Fri Jan 7 12:53:50 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86_64-vmlinux-fix-physical-addrs/arch/x86_64/kernel/vmlinux.lds.S Tue Jan 18 22:46:07 2005
@@ -2,36 +2,41 @@
* Written by Martin Mares <[email protected]>;
*/

+#define LOAD_OFFSET __START_KERNEL_map
+
#include <asm-generic/vmlinux.lds.h>
+#include <asm/page.h>
#include <linux/config.h>

OUTPUT_FORMAT("elf64-x86-64", "elf64-x86-64", "elf64-x86-64")
OUTPUT_ARCH(i386:x86-64)
-ENTRY(_start)
+ENTRY(stext)
jiffies_64 = jiffies;
SECTIONS
{
- . = 0xffffffff80100000;
+ . = __START_KERNEL;
_text = .; /* Text and read-only data */
- .text : {
+ .text : AT(ADDR(.text) - LOAD_OFFSET) {
*(.text)
SCHED_TEXT
LOCK_TEXT
*(.fixup)
*(.gnu.warning)
} = 0x9090
- .text.lock : { *(.text.lock) } /* out-of-line lock text */
+ /* out-of-line lock text */
+ .text.lock : AT(ADDR(.text.lock) - LOAD_OFFSET) { *(.text.lock) }

_etext = .; /* End of text section */

. = ALIGN(16); /* Exception table */
__start___ex_table = .;
- __ex_table : { *(__ex_table) }
+ __ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) { *(__ex_table) }
__stop___ex_table = .;

RODATA

- .data : { /* Data */
+ /* Data */
+ .data : AT(ADDR(.data) - LOAD_OFFSET) {
*(.data)
CONSTRUCTORS
}
@@ -39,62 +44,95 @@
_edata = .; /* End of data section */

__bss_start = .; /* BSS */
- .bss : {
+ .bss : AT(ADDR(.bss) - LOAD_OFFSET) {
*(.bss.page_aligned)
*(.bss)
}
__bss_end = .;

+ . = ALIGN(PAGE_SIZE);
. = ALIGN(CONFIG_X86_L1_CACHE_BYTES);
- .data.cacheline_aligned : { *(.data.cacheline_aligned) }
+ .data.cacheline_aligned : AT(ADDR(.data.cacheline_aligned) - LOAD_OFFSET) {
+ *(.data.cacheline_aligned)
+ }

-#define AFTER(x) BINALIGN(LOADADDR(x) + SIZEOF(x), 16)
-#define BINALIGN(x,y) (((x) + (y) - 1) & ~((y) - 1))
-#define CACHE_ALIGN(x) BINALIGN(x, CONFIG_X86_L1_CACHE_BYTES)
+#define VSYSCALL_ADDR (-10*1024*1024)
+#define VSYSCALL_PHYS_ADDR ((LOADADDR(.data.cacheline_aligned) + SIZEOF(.data.cacheline_aligned) + 4095) & ~(4095))
+#define VSYSCALL_VIRT_ADDR ((ADDR(.data.cacheline_aligned) + SIZEOF(.data.cacheline_aligned) + 4095) & ~(4095))
+
+#define VLOAD_OFFSET (VSYSCALL_ADDR - VSYSCALL_PHYS_ADDR)
+#define VLOAD(x) (ADDR(x) - VLOAD_OFFSET)
+
+#define VVIRT_OFFSET (VSYSCALL_ADDR - VSYSCALL_VIRT_ADDR)
+#define VVIRT(x) (ADDR(x) - VVIRT_OFFSET)
+
+ . = VSYSCALL_ADDR;
+ .vsyscall_0 : AT(VSYSCALL_PHYS_ADDR) { *(.vsyscall_0) }
+ __vsyscall_0 = VSYSCALL_VIRT_ADDR;

- .vsyscall_0 -10*1024*1024: AT ((LOADADDR(.data.cacheline_aligned) + SIZEOF(.data.cacheline_aligned) + 4095) & ~(4095)) { *(.vsyscall_0) }
- __vsyscall_0 = LOADADDR(.vsyscall_0);
. = ALIGN(CONFIG_X86_L1_CACHE_BYTES);
- .xtime_lock : AT CACHE_ALIGN(AFTER(.vsyscall_0)) { *(.xtime_lock) }
- xtime_lock = LOADADDR(.xtime_lock);
- .vxtime : AT AFTER(.xtime_lock) { *(.vxtime) }
- vxtime = LOADADDR(.vxtime);
- .wall_jiffies : AT AFTER(.vxtime) { *(.wall_jiffies) }
- wall_jiffies = LOADADDR(.wall_jiffies);
- .sys_tz : AT AFTER(.wall_jiffies) { *(.sys_tz) }
- sys_tz = LOADADDR(.sys_tz);
- .sysctl_vsyscall : AT AFTER(.sys_tz) { *(.sysctl_vsyscall) }
- sysctl_vsyscall = LOADADDR(.sysctl_vsyscall);
- .xtime : AT AFTER(.sysctl_vsyscall) { *(.xtime) }
- xtime = LOADADDR(.xtime);
+ .xtime_lock : AT(VLOAD(.xtime_lock)) { *(.xtime_lock) }
+ xtime_lock = VVIRT(.xtime_lock);
+
+ .vxtime : AT(VLOAD(.vxtime)) { *(.vxtime) }
+ vxtime = VVIRT(.vxtime);
+
+ .wall_jiffies : AT(VLOAD(.wall_jiffies)) { *(.wall_jiffies) }
+ wall_jiffies = VVIRT(.wall_jiffies);
+
+ .sys_tz : AT(VLOAD(.sys_tz)) { *(.sys_tz) }
+ sys_tz = VVIRT(.sys_tz);
+
+ .sysctl_vsyscall : AT(VLOAD(.sysctl_vsyscall)) { *(.sysctl_vsyscall) }
+ sysctl_vsyscall = VVIRT(.sysctl_vsyscall);
+
+ .xtime : AT(VLOAD(.xtime)) { *(.xtime) }
+ xtime = VVIRT(.xtime);
+
. = ALIGN(CONFIG_X86_L1_CACHE_BYTES);
- .jiffies : AT CACHE_ALIGN(AFTER(.xtime)) { *(.jiffies) }
- jiffies = LOADADDR(.jiffies);
- .vsyscall_1 ADDR(.vsyscall_0) + 1024: AT (LOADADDR(.vsyscall_0) + 1024) { *(.vsyscall_1) }
- . = LOADADDR(.vsyscall_0) + 4096;
+ .jiffies : AT(VLOAD(.jiffies)) { *(.jiffies) }
+ jiffies = VVIRT(.jiffies);
+
+ .vsyscall_1 ADDR(.vsyscall_0) + 1024: AT(VLOAD(.vsyscall_1)) { *(.vsyscall_1) }
+ .vsyscall_2 ADDR(.vsyscall_0) + 2048: AT(VLOAD(.vsyscall_2)) { *(.vsyscall_2) }
+ .vsyscall_3 ADDR(.vsyscall_0) + 3072: AT(VLOAD(.vsyscall_3)) { *(.vsyscall_3) }
+
+ . = VSYSCALL_VIRT_ADDR + 4096;
+
+#undef VSYSCALL_ADDR
+#undef VSYSCALL_PHYS_ADDR
+#undef VSYSCALL_VIRT_ADDR
+#undef VLOAD_OFFSET
+#undef VLOAD
+#undef VVIRT_OFFSET
+#undef VVIRT

. = ALIGN(8192); /* init_task */
- .data.init_task : { *(.data.init_task) }
+ .data.init_task : AT(ADDR(.data.init_task) - LOAD_OFFSET) {
+ *(.data.init_task)
+ }

. = ALIGN(4096);
- .data.page_aligned : { *(.data.page_aligned) }
+ .data.page_aligned : AT(ADDR(.data.page_aligned) - LOAD_OFFSET) {
+ *(.data.page_aligned)
+ }

. = ALIGN(4096); /* Init code and data */
__init_begin = .;
- .init.text : {
+ .init.text : AT(ADDR(.init.text) - LOAD_OFFSET) {
_sinittext = .;
*(.init.text)
_einittext = .;
}
__initdata_begin = .;
- .init.data : { *(.init.data) }
+ .init.data : AT(ADDR(.init.data) - LOAD_OFFSET) { *(.init.data) }
__initdata_end = .;
. = ALIGN(16);
__setup_start = .;
- .init.setup : { *(.init.setup) }
+ .init.setup : AT(ADDR(.init.setup) - LOAD_OFFSET) { *(.init.setup) }
__setup_end = .;
__initcall_start = .;
- .initcall.init : {
+ .initcall.init : AT(ADDR(.initcall.init) - LOAD_OFFSET) {
*(.initcall1.init)
*(.initcall2.init)
*(.initcall3.init)
@@ -105,32 +143,38 @@
}
__initcall_end = .;
__con_initcall_start = .;
- .con_initcall.init : { *(.con_initcall.init) }
+ .con_initcall.init : AT(ADDR(.con_initcall.init) - LOAD_OFFSET) {
+ *(.con_initcall.init)
+ }
__con_initcall_end = .;
SECURITY_INIT
. = ALIGN(8);
__alt_instructions = .;
- .altinstructions : { *(.altinstructions) }
+ .altinstructions : AT(ADDR(.altinstructions) - LOAD_OFFSET) {
+ *(.altinstructions)
+ }
__alt_instructions_end = .;
- .altinstr_replacement : { *(.altinstr_replacement) }
+ .altinstr_replacement : AT(ADDR(.altinstr_replacement) - LOAD_OFFSET) {
+ *(.altinstr_replacement)
+ }
/* .exit.text is discard at runtime, not link time, to deal with references
from .altinstructions and .eh_frame */
- .exit.text : { *(.exit.text) }
- .exit.data : { *(.exit.data) }
+ .exit.text : AT(ADDR(.exit.text) - LOAD_OFFSET) { *(.exit.text) }
+ .exit.data : AT(ADDR(.ext.data) - LOAD_OFFSET) { *(.exit.data) }
. = ALIGN(4096);
__initramfs_start = .;
- .init.ramfs : { *(.init.ramfs) }
+ .init.ramfs : AT(ADDR(.init.ramfs) - LOAD_OFFSET) { *(.init.ramfs) }
__initramfs_end = .;
. = ALIGN(32);
__per_cpu_start = .;
- .data.percpu : { *(.data.percpu) }
+ .data.percpu : AT(ADDR(.data.percpu) - LOAD_OFFSET) { *(.data.percpu) }
__per_cpu_end = .;
. = ALIGN(4096);
__init_end = .;

. = ALIGN(4096);
__nosave_begin = .;
- .data_nosave : { *(.data.nosave) }
+ .data_nosave : AT(ADDR(.data_nosave) - LOAD_OFFSET) { *(.data.nosave) }
. = ALIGN(4096);
__nosave_end = .;

2005-01-19 08:06:55

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 15/29] x86-machine_shutdown


Factor out the apic and smp shutdown code from machine_restart so it can be
called by in the kexec reboot path as well.

By switching to the bootstrap cpu by default on reboot I can delete/simplify
some motherboard fixups well.

Signed-off-by: Eric Biederman <[email protected]>
---

reboot.c | 82 +++++++++++++++++++--------------------------------------------
1 files changed, 26 insertions(+), 56 deletions(-)

diff -uNr linux-2.6.11-rc1-mm1-nokexec-kexec-kexec-generic/arch/i386/kernel/reboot.c linux-2.6.11-rc1-mm1-nokexec-x86-machine_shutdown/arch/i386/kernel/reboot.c
--- linux-2.6.11-rc1-mm1-nokexec-kexec-kexec-generic/arch/i386/kernel/reboot.c Fri Jan 7 12:53:43 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86-machine_shutdown/arch/i386/kernel/reboot.c Tue Jan 18 22:58:00 2005
@@ -23,7 +23,6 @@
int reboot_thru_bios;

#ifdef CONFIG_SMP
-int reboot_smp = 0;
static int reboot_cpu = -1;
/* shamelessly grabbed from lib/vsprintf.c for readability */
#define is_digit(c) ((c) >= '0' && (c) <= '9')
@@ -46,7 +45,6 @@
break;
#ifdef CONFIG_SMP
case 's': /* "smp" reboot by executing reset on BSP or other CPU*/
- reboot_smp = 1;
if (is_digit(*(str+1))) {
reboot_cpu = (int) (*(str+1) - '0');
if (is_digit(*(str+2)))
@@ -85,33 +83,9 @@
return 0;
}

-/*
- * Some machines require the "reboot=s" commandline option, this quirk makes that automatic.
- */
-static int __init set_smp_reboot(struct dmi_system_id *d)
-{
-#ifdef CONFIG_SMP
- if (!reboot_smp) {
- reboot_smp = 1;
- printk(KERN_INFO "%s series board detected. Selecting SMP-method for reboots.\n", d->ident);
- }
-#endif
- return 0;
-}
-
-/*
- * Some machines require the "reboot=b,s" commandline option, this quirk makes that automatic.
- */
-static int __init set_smp_bios_reboot(struct dmi_system_id *d)
-{
- set_smp_reboot(d);
- set_bios_reboot(d);
- return 0;
-}
-
static struct dmi_system_id __initdata reboot_dmi_table[] = {
{ /* Handle problems with rebooting on Dell 1300's */
- .callback = set_smp_bios_reboot,
+ .callback = set_bios_reboot,
.ident = "Dell PowerEdge 1300",
.matches = {
DMI_MATCH(DMI_SYS_VENDOR, "Dell Computer Corporation"),
@@ -295,41 +269,32 @@
: "i" ((void *) (0x1000 - sizeof (real_mode_switch) - 100)));
}

-void machine_restart(char * __unused)
+void machine_shutdown(void)
{
#ifdef CONFIG_SMP
- int cpuid;
-
- cpuid = GET_APIC_ID(apic_read(APIC_ID));
-
- if (reboot_smp) {
-
- /* check to see if reboot_cpu is valid
- if its not, default to the BSP */
- if ((reboot_cpu == -1) ||
- (reboot_cpu > (NR_CPUS -1)) ||
- !physid_isset(cpuid, phys_cpu_present_map))
- reboot_cpu = boot_cpu_physical_apicid;
-
- reboot_smp = 0; /* use this as a flag to only go through this once*/
- /* re-run this function on the other CPUs
- it will fall though this section since we have
- cleared reboot_smp, and do the reboot if it is the
- correct CPU, otherwise it halts. */
- if (reboot_cpu != cpuid)
- smp_call_function((void *)machine_restart , NULL, 1, 0);
+ int reboot_cpu_id;
+
+ /* The boot cpu is always logical cpu 0 */
+ reboot_cpu_id = 0;
+
+ /* See if there has been given a command line override */
+ if ((reboot_cpu_id != -1) && (reboot_cpu < NR_CPUS) &&
+ cpu_isset(reboot_cpu, cpu_online_map)) {
+ reboot_cpu_id = reboot_cpu;
}

- /* if reboot_cpu is still -1, then we want a tradional reboot,
- and if we are not running on the reboot_cpu,, halt */
- if ((reboot_cpu != -1) && (cpuid != reboot_cpu)) {
- for (;;)
- __asm__ __volatile__ ("hlt");
+ /* Make certain the cpu I'm rebooting on is online */
+ if (!cpu_isset(reboot_cpu_id, cpu_online_map)) {
+ reboot_cpu_id = smp_processor_id();
}
- /*
- * Stop all CPUs and turn off local APICs and the IO-APIC, so
- * other OSs see a clean IRQ state.
+
+ /* Make certain I only run on the appropriate processor */
+ set_cpus_allowed(current, cpumask_of_cpu(reboot_cpu_id));
+
+ /* O.K. Now that I'm on the appropriate processor, stop
+ * all of the others, and disable their local APICs.
*/
+
smp_send_stop();
#endif /* CONFIG_SMP */

@@ -338,6 +303,11 @@
#ifdef CONFIG_X86_IO_APIC
disable_IO_APIC();
#endif
+}
+
+void machine_restart(char * __unused)
+{
+ machine_shutdown();

if (!reboot_thru_bios) {
if (efi_enabled) {

2005-01-19 09:08:32

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 7/29] x86_64-apic-virtwire-on-shutdown


When coming out of apic mode attempt to set the appropriate
apic back into virtual wire mode. This improves on previous versions
of this patch by by never setting bot the local apic and the ioapic
into veritual wire mode.

This code looks at data from the mptable to see if an ioapic has
an ExtInt input to make this decision. A future improvement
is to figure out which apic or ioapic was in virtual wire mode
at boot time and to remember it. That is potentially a more accurate
method, of selecting which apic to place in virutal wire mode.


Signed-off-by: Eric Biederman <[email protected]>
---

arch/x86_64/kernel/apic.c | 38 +++++++++++++++++++++++++++++++++++++-
arch/x86_64/kernel/io_apic.c | 36 ++++++++++++++++++++++++++++++++++--
include/asm-x86_64/apic.h | 2 +-
3 files changed, 72 insertions(+), 4 deletions(-)

diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-apic-virtwire-on-shutdown/arch/x86_64/kernel/apic.c linux-2.6.11-rc1-mm1-nokexec-x86_64-apic-virtwire-on-shutdown/arch/x86_64/kernel/apic.c
--- linux-2.6.11-rc1-mm1-nokexec-x86-apic-virtwire-on-shutdown/arch/x86_64/kernel/apic.c Fri Jan 14 04:28:33 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86_64-apic-virtwire-on-shutdown/arch/x86_64/kernel/apic.c Tue Jan 18 22:45:16 2005
@@ -132,7 +132,7 @@
}
}

-void disconnect_bsp_APIC(void)
+void disconnect_bsp_APIC(int virt_wire_setup)
{
if (pic_mode) {
/*
@@ -144,6 +144,42 @@
apic_printk(APIC_QUIET, "disabling APIC mode, entering PIC mode.\n");
outb(0x70, 0x22);
outb(0x00, 0x23);
+ }
+ else {
+ /* Go back to Virtual Wire compatibility mode */
+ unsigned long value;
+
+ /* For the spurious interrupt use vector F, and enable it */
+ value = apic_read(APIC_SPIV);
+ value &= ~APIC_VECTOR_MASK;
+ value |= APIC_SPIV_APIC_ENABLED;
+ value |= 0xf;
+ apic_write_around(APIC_SPIV, value);
+
+ if (!virt_wire_setup) {
+ /* For LVT0 make it edge triggered, active high, external and enabled */
+ value = apic_read(APIC_LVT0);
+ value &= ~(APIC_MODE_MASK | APIC_SEND_PENDING |
+ APIC_INPUT_POLARITY | APIC_LVT_REMOTE_IRR |
+ APIC_LVT_LEVEL_TRIGGER | APIC_LVT_MASKED );
+ value |= APIC_LVT_REMOTE_IRR | APIC_SEND_PENDING;
+ value = SET_APIC_DELIVERY_MODE(value, APIC_MODE_EXTINT);
+ apic_write_around(APIC_LVT0, value);
+ }
+ else {
+ /* Disable LVT0 */
+ apic_write_around(APIC_LVT0, APIC_LVT_MASKED);
+ }
+
+ /* For LVT1 make it edge triggered, active high, nmi and enabled */
+ value = apic_read(APIC_LVT1);
+ value &= ~(
+ APIC_MODE_MASK | APIC_SEND_PENDING |
+ APIC_INPUT_POLARITY | APIC_LVT_REMOTE_IRR |
+ APIC_LVT_LEVEL_TRIGGER | APIC_LVT_MASKED);
+ value |= APIC_LVT_REMOTE_IRR | APIC_SEND_PENDING;
+ value = SET_APIC_DELIVERY_MODE(value, APIC_MODE_NMI);
+ apic_write_around(APIC_LVT1, value);
}
}

diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-apic-virtwire-on-shutdown/arch/x86_64/kernel/io_apic.c linux-2.6.11-rc1-mm1-nokexec-x86_64-apic-virtwire-on-shutdown/arch/x86_64/kernel/io_apic.c
--- linux-2.6.11-rc1-mm1-nokexec-x86-apic-virtwire-on-shutdown/arch/x86_64/kernel/io_apic.c Fri Jan 14 04:32:23 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86_64-apic-virtwire-on-shutdown/arch/x86_64/kernel/io_apic.c Tue Jan 18 22:45:17 2005
@@ -327,7 +327,7 @@
/*
* Find the pin to which IRQ[irq] (ISA) is connected
*/
-static int __init find_isa_irq_pin(int irq, int type)
+static int find_isa_irq_pin(int irq, int type)
{
int i;

@@ -1125,12 +1125,44 @@
*/
void disable_IO_APIC(void)
{
+ int pin;
/*
* Clear the IO-APIC before rebooting:
*/
clear_IO_APIC();

- disconnect_bsp_APIC();
+ /*
+ * If the i82559 is routed through an IOAPIC
+ * Put that IOAPIC in virtual wire mode
+ * so legacy interrups can be delivered.
+ */
+ pin = find_isa_irq_pin(0, mp_ExtINT);
+ if (pin != -1) {
+ struct IO_APIC_route_entry entry;
+ unsigned long flags;
+
+ memset(&entry, 0, sizeof(entry));
+ entry.mask = 0; /* Enabled */
+ entry.trigger = 0; /* Edge */
+ entry.irr = 0;
+ entry.polarity = 0; /* High */
+ entry.delivery_status = 0;
+ entry.dest_mode = 0; /* Physical */
+ entry.delivery_mode = 7; /* ExtInt */
+ entry.vector = 0;
+ entry.dest.physical.physical_dest = 0;
+
+
+ /*
+ * Add it to the IO-APIC irq-routing table:
+ */
+ spin_lock_irqsave(&ioapic_lock, flags);
+ io_apic_write(0, 0x11+2*pin, *(((int *)&entry)+1));
+ io_apic_write(0, 0x10+2*pin, *(((int *)&entry)+0));
+ spin_unlock_irqrestore(&ioapic_lock, flags);
+ }
+
+ disconnect_bsp_APIC(pin != -1);
}

/*
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-apic-virtwire-on-shutdown/include/asm-x86_64/apic.h linux-2.6.11-rc1-mm1-nokexec-x86_64-apic-virtwire-on-shutdown/include/asm-x86_64/apic.h
--- linux-2.6.11-rc1-mm1-nokexec-x86-apic-virtwire-on-shutdown/include/asm-x86_64/apic.h Fri Jan 7 12:54:16 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86_64-apic-virtwire-on-shutdown/include/asm-x86_64/apic.h Tue Jan 18 22:45:17 2005
@@ -77,7 +77,7 @@
extern int get_maxlvt (void);
extern void clear_local_APIC (void);
extern void connect_bsp_APIC (void);
-extern void disconnect_bsp_APIC (void);
+extern void disconnect_bsp_APIC (int virt_wire_setup);
extern void disable_local_APIC (void);
extern int verify_local_APIC (void);
extern void cache_APIC_registers (void);

2005-01-19 08:02:11

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 16/29] x86-kexec


This is the i386 implementation of kexec.

Signed-off-by: Eric Biederman <[email protected]>
---

arch/i386/Kconfig | 17 ++
arch/i386/kernel/Makefile | 1
arch/i386/kernel/crash.c | 42 +++++++
arch/i386/kernel/entry.S | 2
arch/i386/kernel/machine_kexec.c | 220 +++++++++++++++++++++++++++++++++++++
arch/i386/kernel/relocate_kernel.S | 120 ++++++++++++++++++++
include/asm-i386/kexec.h | 28 ++++
7 files changed, 429 insertions(+), 1 deletion(-)

diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-machine_shutdown/arch/i386/Kconfig linux-2.6.11-rc1-mm1-nokexec-x86-kexec/arch/i386/Kconfig
--- linux-2.6.11-rc1-mm1-nokexec-x86-machine_shutdown/arch/i386/Kconfig Tue Jan 18 22:46:40 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86-kexec/arch/i386/Kconfig Tue Jan 18 22:58:15 2005
@@ -901,6 +901,23 @@

Don't change this unless you know what you are doing.

+config KEXEC
+ bool "kexec system call (EXPERIMENTAL)"
+ depends on EXPERIMENTAL
+ help
+ kexec is a system call that implements the ability to shutdown your
+ current kernel, and to start another kernel. It is like a reboot
+ but it is indepedent of the system firmware. And like a reboot
+ you can start any kernel with it, not just Linux.
+
+ The name comes from the similiarity to the exec system call.
+
+ It is an ongoing process to be certain the hardware in a machine
+ is properly shutdown, so do not be surprised if this code does not
+ initially work for you. It may help to enable device hotplugging
+ support. As of this writing the exact hardware interface is
+ strongly in flux, so no good recommendation can be made.
+
endmenu


diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-machine_shutdown/arch/i386/kernel/Makefile linux-2.6.11-rc1-mm1-nokexec-x86-kexec/arch/i386/kernel/Makefile
--- linux-2.6.11-rc1-mm1-nokexec-x86-machine_shutdown/arch/i386/kernel/Makefile Fri Jan 14 04:32:22 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86-kexec/arch/i386/kernel/Makefile Tue Jan 18 22:58:15 2005
@@ -24,6 +24,7 @@
obj-$(CONFIG_X86_MPPARSE) += mpparse.o
obj-$(CONFIG_X86_LOCAL_APIC) += apic.o nmi.o
obj-$(CONFIG_X86_IO_APIC) += io_apic.o
+obj-$(CONFIG_KEXEC) += machine_kexec.o relocate_kernel.o crash.o
obj-$(CONFIG_X86_NUMAQ) += numaq.o
obj-$(CONFIG_X86_SUMMIT_NUMA) += summit.o
obj-$(CONFIG_KPROBES) += kprobes.o
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-machine_shutdown/arch/i386/kernel/crash.c linux-2.6.11-rc1-mm1-nokexec-x86-kexec/arch/i386/kernel/crash.c
--- linux-2.6.11-rc1-mm1-nokexec-x86-machine_shutdown/arch/i386/kernel/crash.c Wed Dec 31 17:00:00 1969
+++ linux-2.6.11-rc1-mm1-nokexec-x86-kexec/arch/i386/kernel/crash.c Tue Jan 18 22:58:15 2005
@@ -0,0 +1,42 @@
+/*
+ * Architecture specific (i386) functions for kexec based crash dumps.
+ *
+ * Created by: Hariprasad Nellitheertha ([email protected])
+ *
+ * Copyright (C) IBM Corporation, 2004. All rights reserved.
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/smp.h>
+#include <linux/irq.h>
+#include <linux/reboot.h>
+#include <linux/kexec.h>
+#include <linux/irq.h>
+#include <linux/delay.h>
+#include <linux/elf.h>
+#include <linux/elfcore.h>
+
+#include <asm/processor.h>
+#include <asm/hardirq.h>
+#include <asm/nmi.h>
+#include <asm/hw_irq.h>
+
+#define MAX_NOTE_BYTES 1024
+typedef u32 note_buf_t[MAX_NOTE_BYTES/4];
+
+note_buf_t crash_notes[NR_CPUS];
+
+void machine_crash_shutdown(void)
+{
+ /* This function is only called after the system
+ * has paniced or is otherwise in a critical state.
+ * The minimum amount of code to allow a kexec'd kernel
+ * to run successfully needs to happen here.
+ *
+ * In practice this means shooting down the other cpus in
+ * an SMP system.
+ */
+}
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-machine_shutdown/arch/i386/kernel/entry.S linux-2.6.11-rc1-mm1-nokexec-x86-kexec/arch/i386/kernel/entry.S
--- linux-2.6.11-rc1-mm1-nokexec-x86-machine_shutdown/arch/i386/kernel/entry.S Fri Jan 14 04:32:22 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86-kexec/arch/i386/kernel/entry.S Tue Jan 18 22:58:15 2005
@@ -911,7 +911,7 @@
.long sys_mq_timedreceive /* 280 */
.long sys_mq_notify
.long sys_mq_getsetattr
- .long sys_ni_syscall /* reserved for kexec */
+ .long sys_kexec_load
.long sys_waitid
.long sys_ni_syscall /* 285 */ /* available */
.long sys_add_key
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-machine_shutdown/arch/i386/kernel/machine_kexec.c linux-2.6.11-rc1-mm1-nokexec-x86-kexec/arch/i386/kernel/machine_kexec.c
--- linux-2.6.11-rc1-mm1-nokexec-x86-machine_shutdown/arch/i386/kernel/machine_kexec.c Wed Dec 31 17:00:00 1969
+++ linux-2.6.11-rc1-mm1-nokexec-x86-kexec/arch/i386/kernel/machine_kexec.c Tue Jan 18 22:58:15 2005
@@ -0,0 +1,220 @@
+/*
+ * machine_kexec.c - handle transition of Linux booting another kernel
+ * Copyright (C) 2002-2005 Eric Biederman <[email protected]>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2. See the file COPYING for more details.
+ */
+
+#include <linux/mm.h>
+#include <linux/kexec.h>
+#include <linux/delay.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
+#include <asm/mmu_context.h>
+#include <asm/io.h>
+#include <asm/apic.h>
+#include <asm/cpufeature.h>
+
+static inline unsigned long read_cr3(void)
+{
+ unsigned long cr3;
+ asm volatile("movl %%cr3,%0": "=r"(cr3));
+ return cr3;
+}
+
+#define PAGE_ALIGNED __attribute__ ((__aligned__(PAGE_SIZE)))
+
+#define L0_ATTR (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY)
+#define L1_ATTR (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY)
+#define L2_ATTR (_PAGE_PRESENT)
+
+#define LEVEL0_SIZE (1UL << 12UL)
+
+#ifndef CONFIG_X86_PAE
+#define LEVEL1_SIZE (1UL << 22UL)
+static u32 pgtable_level1[1024] PAGE_ALIGNED;
+
+static void identity_map_page(unsigned long address)
+{
+ unsigned long level1_index, level2_index;
+ u32 *pgtable_level2;
+
+ /* Find the current page table */
+ pgtable_level2 = __va(read_cr3());
+
+ /* Find the indexes of the physical address to identity map */
+ level1_index = (address % LEVEL1_SIZE)/LEVEL0_SIZE;
+ level2_index = address / LEVEL1_SIZE;
+
+ /* Identity map the page table entry */
+ pgtable_level1[level1_index] = address | L0_ATTR;
+ pgtable_level2[level2_index] = __pa(pgtable_level1) | L1_ATTR;
+
+ /* Flush the tlb so the new mapping takes effect.
+ * Global tlb entries are not flushed but that is not an issue.
+ */
+ load_cr3(pgtable_level2);
+}
+
+#else
+#define LEVEL1_SIZE (1UL << 21UL)
+#define LEVEL2_SIZE (1UL << 30UL)
+static u64 pgtable_level1[512] PAGE_ALIGNED;
+static u64 pgtable_level2[512] PAGE_ALIGNED;
+
+static void identity_map_page(unsigned long address)
+{
+ unsigned long level1_index, level2_index, level3_index;
+ u64 *pgtable_level3;
+
+ /* Find the current page table */
+ pgtable_level3 = __va(read_cr3());
+
+ /* Find the indexes of the physical address to identity map */
+ level1_index = (address % LEVEL1_SIZE)/LEVEL0_SIZE;
+ level2_index = (address % LEVEL2_SIZE)/LEVEL1_SIZE;
+ level3_index = address / LEVEL2_SIZE;
+
+ /* Identity map the page table entry */
+ pgtable_level1[level1_index] = address | L0_ATTR;
+ pgtable_level2[level2_index] = __pa(pgtable_level1) | L1_ATTR;
+ set_64bit(&pgtable_level3[level3_index], __pa(pgtable_level2) | L2_ATTR);
+
+ /* Flush the tlb so the new mapping takes effect.
+ * Global tlb entries are not flushed but that is not an issue.
+ */
+ load_cr3(pgtable_level3);
+}
+#endif
+
+
+static void set_idt(void *newidt, __u16 limit)
+{
+ unsigned char curidt[6];
+
+ /* ia32 supports unaliged loads & stores */
+ (*(__u16 *)(curidt)) = limit;
+ (*(__u32 *)(curidt +2)) = (unsigned long)(newidt);
+
+ __asm__ __volatile__ (
+ "lidt %0\n"
+ : "=m" (curidt)
+ );
+};
+
+
+static void set_gdt(void *newgdt, __u16 limit)
+{
+ unsigned char curgdt[6];
+
+ /* ia32 supports unaligned loads & stores */
+ (*(__u16 *)(curgdt)) = limit;
+ (*(__u32 *)(curgdt +2)) = (unsigned long)(newgdt);
+
+ __asm__ __volatile__ (
+ "lgdt %0\n"
+ : "=m" (curgdt)
+ );
+};
+
+static void load_segments(void)
+{
+#define __STR(X) #X
+#define STR(X) __STR(X)
+
+ __asm__ __volatile__ (
+ "\tljmp $"STR(__KERNEL_CS)",$1f\n"
+ "\t1:\n"
+ "\tmovl $"STR(__KERNEL_DS)",%eax\n"
+ "\tmovl %eax,%ds\n"
+ "\tmovl %eax,%es\n"
+ "\tmovl %eax,%fs\n"
+ "\tmovl %eax,%gs\n"
+ "\tmovl %eax,%ss\n"
+ );
+#undef STR
+#undef __STR
+}
+
+typedef asmlinkage NORET_TYPE void (*relocate_new_kernel_t)(
+ unsigned long indirection_page, unsigned long reboot_code_buffer,
+ unsigned long start_address, unsigned int has_pae) ATTRIB_NORET;
+
+const extern unsigned char relocate_new_kernel[];
+extern void relocate_new_kernel_end(void);
+const extern unsigned int relocate_new_kernel_size;
+
+/*
+ * A architecture hook called to validate the
+ * proposed image and prepare the control pages
+ * as needed. The pages for KEXEC_CONTROL_CODE_SIZE
+ * have been allocated, but the segments have yet
+ * been copied into the kernel.
+ *
+ * Do what every setup is needed on image and the
+ * reboot code buffer to allow us to avoid allocations
+ * later.
+ *
+ * Currently nothing.
+ */
+int machine_kexec_prepare(struct kimage *image)
+{
+ return 0;
+}
+
+/*
+ * Undo anything leftover by machine_kexec_prepare
+ * when an image is freed.
+ */
+void machine_kexec_cleanup(struct kimage *image)
+{
+}
+
+/*
+ * Do not allocate memory (or fail in any way) in machine_kexec().
+ * We are past the point of no return, committed to rebooting now.
+ */
+NORET_TYPE void machine_kexec(struct kimage *image)
+{
+ unsigned long page_list;
+ unsigned long reboot_code_buffer;
+ relocate_new_kernel_t rnk;
+
+ /* Interrupts aren't acceptable while we reboot */
+ local_irq_disable();
+
+ /* Compute some offsets */
+ reboot_code_buffer = page_to_pfn(image->control_code_page) << PAGE_SHIFT;
+ page_list = image->head;
+
+ /* Set up an identity mapping for the reboot_code_buffer */
+ identity_map_page(reboot_code_buffer);
+
+ /* copy it out */
+ memcpy((void *)reboot_code_buffer, relocate_new_kernel, relocate_new_kernel_size);
+
+ /* The segment registers are funny things, they are
+ * automatically loaded from a table, in memory wherever you
+ * set them to a specific selector, but this table is never
+ * accessed again you set the segment to a different selector.
+ *
+ * The more common model is are caches where the behide
+ * the scenes work is done, but is also dropped at arbitrary
+ * times.
+ *
+ * I take advantage of this here by force loading the
+ * segments, before I zap the gdt with an invalid value.
+ */
+ load_segments();
+ /* The gdt & idt are now invalid.
+ * If you want to load them you must set up your own idt & gdt.
+ */
+ set_gdt(phys_to_virt(0),0);
+ set_idt(phys_to_virt(0),0);
+
+ /* now call it */
+ rnk = (relocate_new_kernel_t) reboot_code_buffer;
+ (*rnk)(page_list, reboot_code_buffer, image->start, cpu_has_pae);
+}
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-machine_shutdown/arch/i386/kernel/relocate_kernel.S linux-2.6.11-rc1-mm1-nokexec-x86-kexec/arch/i386/kernel/relocate_kernel.S
--- linux-2.6.11-rc1-mm1-nokexec-x86-machine_shutdown/arch/i386/kernel/relocate_kernel.S Wed Dec 31 17:00:00 1969
+++ linux-2.6.11-rc1-mm1-nokexec-x86-kexec/arch/i386/kernel/relocate_kernel.S Tue Jan 18 22:58:15 2005
@@ -0,0 +1,120 @@
+/*
+ * relocate_kernel.S - put the kernel image in place to boot
+ * Copyright (C) 2002-2004 Eric Biederman <[email protected]>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2. See the file COPYING for more details.
+ */
+
+#include <linux/linkage.h>
+
+ /*
+ * Must be relocatable PIC code callable as a C function, that once
+ * it starts can not use the previous processes stack.
+ */
+ .globl relocate_new_kernel
+relocate_new_kernel:
+ /* read the arguments and say goodbye to the stack */
+ movl 4(%esp), %ebx /* page_list */
+ movl 8(%esp), %ebp /* reboot_code_buffer */
+ movl 12(%esp), %edx /* start address */
+ movl 16(%esp), %ecx /* cpu_has_pae */
+
+ /* zero out flags, and disable interrupts */
+ pushl $0
+ popfl
+
+ /* set a new stack at the bottom of our page... */
+ lea 4096(%ebp), %esp
+
+ /* store the parameters back on the stack */
+ pushl %edx /* store the start address */
+
+ /* Set cr0 to a known state:
+ * 31 0 == Paging disabled
+ * 18 0 == Alignment check disabled
+ * 16 0 == Write protect disabled
+ * 3 0 == No task switch
+ * 2 0 == Don't do FP software emulation.
+ * 0 1 == Proctected mode enabled
+ */
+ movl %cr0, %eax
+ andl $~((1<<31)|(1<<18)|(1<<16)|(1<<3)|(1<<2)), %eax
+ orl $(1<<0), %eax
+ movl %eax, %cr0
+
+ /* clear cr4 if applicable */
+ testl %ecx, %ecx
+ jz 1f
+ /* Set cr4 to a known state:
+ * Setting everything to zero seems safe.
+ */
+ movl %cr4, %eax
+ andl $0, %eax
+ movl %eax, %cr4
+
+ jmp 1f
+1:
+
+ /* Flush the TLB (needed?) */
+ xorl %eax, %eax
+ movl %eax, %cr3
+
+ /* Do the copies */
+ movl %ebx, %ecx
+ jmp 1f
+
+0: /* top, read another word from the indirection page */
+ movl (%ebx), %ecx
+ addl $4, %ebx
+1:
+ testl $0x1, %ecx /* is it a destination page */
+ jz 2f
+ movl %ecx, %edi
+ andl $0xfffff000, %edi
+ jmp 0b
+2:
+ testl $0x2, %ecx /* is it an indirection page */
+ jz 2f
+ movl %ecx, %ebx
+ andl $0xfffff000, %ebx
+ jmp 0b
+2:
+ testl $0x4, %ecx /* is it the done indicator */
+ jz 2f
+ jmp 3f
+2:
+ testl $0x8, %ecx /* is it the source indicator */
+ jz 0b /* Ignore it otherwise */
+ movl %ecx, %esi /* For every source page do a copy */
+ andl $0xfffff000, %esi
+
+ movl $1024, %ecx
+ rep ; movsl
+ jmp 0b
+
+3:
+
+ /* To be certain of avoiding problems with self-modifying code
+ * I need to execute a serializing instruction here.
+ * So I flush the TLB, it's handy, and not processor dependent.
+ */
+ xorl %eax, %eax
+ movl %eax, %cr3
+
+ /* set all of the registers to known values */
+ /* leave %esp alone */
+
+ xorl %eax, %eax
+ xorl %ebx, %ebx
+ xorl %ecx, %ecx
+ xorl %edx, %edx
+ xorl %esi, %esi
+ xorl %edi, %edi
+ xorl %ebp, %ebp
+ ret
+relocate_new_kernel_end:
+
+ .globl relocate_new_kernel_size
+relocate_new_kernel_size:
+ .long relocate_new_kernel_end - relocate_new_kernel
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-machine_shutdown/include/asm-i386/kexec.h linux-2.6.11-rc1-mm1-nokexec-x86-kexec/include/asm-i386/kexec.h
--- linux-2.6.11-rc1-mm1-nokexec-x86-machine_shutdown/include/asm-i386/kexec.h Wed Dec 31 17:00:00 1969
+++ linux-2.6.11-rc1-mm1-nokexec-x86-kexec/include/asm-i386/kexec.h Tue Jan 18 22:58:15 2005
@@ -0,0 +1,28 @@
+#ifndef _I386_KEXEC_H
+#define _I386_KEXEC_H
+
+#include <asm/fixmap.h>
+
+/*
+ * KEXEC_SOURCE_MEMORY_LIMIT maximum page get_free_page can return.
+ * I.e. Maximum page that is mapped directly into kernel memory,
+ * and kmap is not required.
+ *
+ * Someone correct me if FIXADDR_START - PAGEOFFSET is not the correct
+ * calculation for the amount of memory directly mappable into the
+ * kernel memory space.
+ */
+
+/* Maximum physical address we can use pages from */
+#define KEXEC_SOURCE_MEMORY_LIMIT (-1UL)
+/* Maximum address we can reach in physical address mode */
+#define KEXEC_DESTINATION_MEMORY_LIMIT (-1UL)
+/* Maximum address we can use for the control code buffer */
+#define KEXEC_CONTROL_MEMORY_LIMIT TASK_SIZE
+
+#define KEXEC_CONTROL_CODE_SIZE 4096
+
+/* The native architecture */
+#define KEXEC_ARCH KEXEC_ARCH_386
+
+#endif /* _I386_KEXEC_H */

2005-01-19 09:12:11

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 17/29] x86-crashkernel


This is the x86 implementation of the crashkernel option. It reserves
a window of memory very early in the bootup process, so we never use
it for anything but the kernel to switch to when the running
kernel panics.

In addition to reserving this memory a resource structure is registered
so looking at /proc/iomem it is clear what happened to that memory.

ISSUES:
Is it possible to implement this in a architecture generic way?
What should be done with architectures that always use an iommu and
thus don't report their RAM memory resources in /proc/iomem?

Signed-off-by: Eric Biederman <[email protected]>
---

kernel/efi.c | 4 ++++
kernel/setup.c | 30 ++++++++++++++++++++++++++++++
mm/discontig.c | 6 ++++++
3 files changed, 40 insertions(+)

diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-kexec/arch/i386/kernel/efi.c linux-2.6.11-rc1-mm1-nokexec-x86-crashkernel/arch/i386/kernel/efi.c
--- linux-2.6.11-rc1-mm1-nokexec-x86-kexec/arch/i386/kernel/efi.c Fri Jan 14 04:32:22 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86-crashkernel/arch/i386/kernel/efi.c Tue Jan 18 22:58:33 2005
@@ -30,6 +30,7 @@
#include <linux/ioport.h>
#include <linux/module.h>
#include <linux/efi.h>
+#include <linux/kexec.h>

#include <asm/setup.h>
#include <asm/io.h>
@@ -596,6 +597,9 @@
if (md->type == EFI_CONVENTIONAL_MEMORY) {
request_resource(res, code_resource);
request_resource(res, data_resource);
+#ifdef CONFIG_KEXEC
+ request_resource(res, &crashk_res);
+#endif
}
}
}
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-kexec/arch/i386/kernel/setup.c linux-2.6.11-rc1-mm1-nokexec-x86-crashkernel/arch/i386/kernel/setup.c
--- linux-2.6.11-rc1-mm1-nokexec-x86-kexec/arch/i386/kernel/setup.c Tue Jan 18 23:11:43 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86-crashkernel/arch/i386/kernel/setup.c Tue Jan 18 22:58:33 2005
@@ -40,6 +40,7 @@
#include <linux/efi.h>
#include <linux/init.h>
#include <linux/edd.h>
+#include <linux/kexec.h>
#include <video/edid.h>
#include <asm/apic.h>
#include <asm/e820.h>
@@ -825,6 +826,27 @@
lapic_disable();
#endif /* CONFIG_X86_LOCAL_APIC */

+#ifdef CONFIG_KEXEC
+ /* crashkernel=size@addr specifies the location to reserve for
+ * a crash kernel. By reserving this memory we guarantee
+ * that linux never set's it up as a DMA target.
+ * Useful for holding code to do something appropriate
+ * after a kernel panic.
+ */
+ else if (!memcmp(from, "crashkernel=", 12)) {
+ unsigned long size, base;
+ size = memparse(from+12, &from);
+ if (*from == '@') {
+ base = memparse(from+1, &from);
+ /* FIXME: Do I want a sanity check
+ * to validate the memory range?
+ */
+ crashk_res.start = base;
+ crashk_res.end = base + size - 1;
+ }
+ }
+#endif
+
/*
* highmem=size forces highmem to be exactly 'size' bytes.
* This works even on boxes that have no highmem otherwise.
@@ -1128,6 +1150,11 @@
}
}
#endif
+#ifdef CONFIG_KEXEC
+ if (crashk_res.start != crashk_res.end) {
+ reserve_bootmem(crashk_res.start, crashk_res.end - crashk_res.start + 1);
+ }
+#endif
return max_low_pfn;
}
#else
@@ -1167,6 +1194,9 @@
*/
request_resource(res, code_resource);
request_resource(res, data_resource);
+#ifdef CONFIG_KEXEC
+ request_resource(res, &crashk_res);
+#endif
}
}
}
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-kexec/arch/i386/mm/discontig.c linux-2.6.11-rc1-mm1-nokexec-x86-crashkernel/arch/i386/mm/discontig.c
--- linux-2.6.11-rc1-mm1-nokexec-x86-kexec/arch/i386/mm/discontig.c Fri Jan 14 04:28:30 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86-crashkernel/arch/i386/mm/discontig.c Tue Jan 18 23:03:02 2005
@@ -29,6 +29,7 @@
#include <linux/highmem.h>
#include <linux/initrd.h>
#include <linux/nodemask.h>
+#include <linux/kexec.h>
#include <asm/e820.h>
#include <asm/setup.h>
#include <asm/mmzone.h>
@@ -365,6 +366,11 @@
system_max_low_pfn << PAGE_SHIFT);
initrd_start = 0;
}
+ }
+#endif
+#ifdef CONFIG_KEXEC
+ if (crashk_res.start != crashk_res.end) {
+ reserve_bootmem(crashk_res.start, crashk_res.end - crashk_res.start + 1);
}
#endif
return system_max_low_pfn;

2005-01-19 08:02:10

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 8/29] vmlinux-fix-physical-addrs


In vmlinux.lds.h the code is carefull to define every section so vmlinux
properly reports the correct physical load address of code, as well as
it's virtual address.

The new SECURITY_INIT definition fails to follow that convention and
and causes incorrect physical address to appear in the vmlinux if
there are any security initcalls.

This patch updates the SECURITY_INIT to follow the convention in the rest of the
file.

Signed-off-by: Eric Biederman <[email protected]>
---

vmlinux.lds.h | 2 +-
1 files changed, 1 insertion(+), 1 deletion(-)

diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86_64-apic-virtwire-on-shutdown/include/asm-generic/vmlinux.lds.h linux-2.6.11-rc1-mm1-nokexec-vmlinux-fix-physical-addrs/include/asm-generic/vmlinux.lds.h
--- linux-2.6.11-rc1-mm1-nokexec-x86_64-apic-virtwire-on-shutdown/include/asm-generic/vmlinux.lds.h Fri Jan 7 12:54:13 2005
+++ linux-2.6.11-rc1-mm1-nokexec-vmlinux-fix-physical-addrs/include/asm-generic/vmlinux.lds.h Tue Jan 18 22:45:34 2005
@@ -73,7 +73,7 @@
}

#define SECURITY_INIT \
- .security_initcall.init : { \
+ .security_initcall.init : AT(ADDR(.security_initcall.init) - LOAD_OFFSET) { \
VMLINUX_SYMBOL(__security_initcall_start) = .; \
*(.security_initcall.init) \
VMLINUX_SYMBOL(__security_initcall_end) = .; \

2005-01-19 07:57:33

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 22/29] x86-crash_shutdown-nmi-shootdown


One of the dangers when switching from one kernel to another
is what happens to all of the other cpus that were running
in the crashed kernel. In an attempt to avoid that problem
this patch adds a nmi handler and attempts to shoot down
the other cpus by sending them non maskable interrupts.

The code then waits for 1 second or until all known cpus
have stopped running and then jumps from the running kernel
that has crashed to the kernel in reserved memory.

The kernel spin loop is used for the delay as that should
behave continue to be safe even in after a crash.

Signed-off-by: Eric Biederman <[email protected]>
---

crash.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 56 insertions(+)

diff -uNr linux-2.6.11-rc1-mm1-nokexec-kexec-ppc-support/arch/i386/kernel/crash.c linux-2.6.11-rc1-mm1-nokexec-x86-crash_shutdown-nmi-shootdown/arch/i386/kernel/crash.c
--- linux-2.6.11-rc1-mm1-nokexec-kexec-ppc-support/arch/i386/kernel/crash.c Tue Jan 18 22:58:15 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86-crash_shutdown-nmi-shootdown/arch/i386/kernel/crash.c Tue Jan 18 23:15:17 2005
@@ -23,12 +23,65 @@
#include <asm/hardirq.h>
#include <asm/nmi.h>
#include <asm/hw_irq.h>
+#include <mach_ipi.h>

#define MAX_NOTE_BYTES 1024
typedef u32 note_buf_t[MAX_NOTE_BYTES/4];

note_buf_t crash_notes[NR_CPUS];

+#ifdef CONFIG_SMP
+static atomic_t waiting_for_crash_ipi;
+
+static int crash_nmi_callback(struct pt_regs *regs, int cpu)
+{
+ local_irq_disable();
+ atomic_dec(&waiting_for_crash_ipi);
+ /* Assume hlt works */
+ __asm__("hlt");
+ for(;;);
+ return 1;
+}
+
+/*
+ * By using the NMI code instead of a vector we just sneak thru the
+ * word generator coming out with just what we want. AND it does
+ * not matter if clustered_apic_mode is set or not.
+ */
+static void smp_send_nmi_allbutself(void)
+{
+ send_IPI_allbutself(APIC_DM_NMI);
+}
+
+static void nmi_shootdown_cpus(void)
+{
+ unsigned long msecs;
+ atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1);
+
+ /* Would it be better to replace the trap vector here? */
+ set_nmi_callback(crash_nmi_callback);
+ /* Ensure the new callback function is set before sending
+ * out the NMI
+ */
+ wmb();
+
+ smp_send_nmi_allbutself();
+
+ msecs = 1000; /* Wait at most a second for the other cpus to stop */
+ while ((atomic_read(&waiting_for_crash_ipi) > 0) && msecs) {
+ mdelay(1);
+ msecs--;
+ }
+
+ /* Leave the nmi callback set */
+}
+#else
+static void nmi_shootdown_cpus(void)
+{
+ /* There are no cpus to shootdown */
+}
+#endif
+
void machine_crash_shutdown(void)
{
/* This function is only called after the system
@@ -39,4 +92,7 @@
* In practice this means shooting down the other cpus in
* an SMP system.
*/
+ /* The kernel is broken so disable interrupts */
+ local_irq_disable();
+ nmi_shootdown_cpus();
}

2005-01-19 07:57:32

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 25/29] crashdump-documentation


I have addressed the worst of the documentation changes
that come about from the current refacatoring.

From: Hariprasad Nellitheertha <[email protected]>

This patch contains the documentation for the kexec based crash dump tool.

Signed off by Hariprasad Nellitheertha <[email protected]>

Signed-off-by: Eric Biederman <[email protected]>
---

00-INDEX | 2 +
kdump.txt | 98 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 100 insertions(+)

diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-crash_shutdown-apic-shutdown/Documentation/00-INDEX linux-2.6.11-rc1-mm1-nokexec-crashdump-documentation/Documentation/00-INDEX
--- linux-2.6.11-rc1-mm1-nokexec-x86-crash_shutdown-apic-shutdown/Documentation/00-INDEX Fri Jan 14 04:28:28 2005
+++ linux-2.6.11-rc1-mm1-nokexec-crashdump-documentation/Documentation/00-INDEX Tue Jan 18 23:16:08 2005
@@ -140,6 +140,8 @@
- info on the in-kernel binary support for Java(tm).
kbuild/
- directory with info about the kernel build process.
+kdumpt.txt
+ - mini HowTo on getting the crash dump code to work.
kernel-doc-nano-HOWTO.txt
- mini HowTo on generation and location of kernel documentation files.
kernel-docs.txt
diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-crash_shutdown-apic-shutdown/Documentation/kdump.txt linux-2.6.11-rc1-mm1-nokexec-crashdump-documentation/Documentation/kdump.txt
--- linux-2.6.11-rc1-mm1-nokexec-x86-crash_shutdown-apic-shutdown/Documentation/kdump.txt Wed Dec 31 17:00:00 1969
+++ linux-2.6.11-rc1-mm1-nokexec-crashdump-documentation/Documentation/kdump.txt Tue Jan 18 23:16:08 2005
@@ -0,0 +1,98 @@
+Documentation for kdump - the kexec based crash dumping solution
+================================================================
+
+DESIGN
+======
+
+We use kexec to reboot to a second kernel whenever a dump needs to be taken.
+This second kernel is booted with with very little memory (configurable
+at compile time). The first kernel reserves the section of memory that the
+second kernel uses. This ensures that on-going DMA from the first kernel
+does not corrupt the second kernel. The first 640k of physical memory is
+needed irrespective of where the kernel loads at. Hence, this region is
+backed up before reboot.
+
+In the second kernel, "old memory" can be accessed in two ways. The
+first one is through a device interface. We can create a /dev/oldmem or
+whatever and write out the memory in raw format. The second interface is
+through /proc/vmcore. This exports the dump as an ELF format file which
+can be written out using any file copy command (cp, scp, etc). Further, gdb
+can be used to perform some minimal debugging on the dump file. Both these
+methods ensure that there is correct ordering of the dump pages (corresponding
+to the first 640k that has been relocated).
+
+SETUP
+=====
+
+1) Obtain the appropriate -mm tree patch and apply it on to the vanilla
+ kernel tree.
+
+2) Two kernels need to be built in order to get this feature working.
+
+ For the first kernel, choose the default values for the following options.
+
+ a) Physical address where the kernel is loaded
+ b) kexec system call
+ c) kernel crash dumps
+
+ All the options are under "Processor type and features"
+
+ For the second kernel, change (a) to 16MB. If you want to choose another
+ value here, ensure "location from where the crash dumping kernel will boot
+ (MB)" under (c) reflects the same value.
+
+ Also ensure you have CONFIG_HIGHMEM on.
+
+3) Boot into the first kernel. You are now ready to try out kexec based crash
+ dumps.
+
+4) Load the second kernel to be booted using
+
+ kexec -p <second-kernel> --args-linux --append="root=<root-dev> dump
+ init 1 memmap=exactmap memmap=640k@0 memmap=32M@16M"
+
+ Note that <second-kernel> has to be a vmlinux image. bzImage will not
+ work, as of now.
+
+5) System reboots into the second kernel when a panic occurs.
+ You could write a module to call panic, for testing purposes.
+
+6) Write out the dump file using
+
+ cp /proc/vmcore <dump-file>
+
+You can also access the dump as a device for a linear/raw view. To do this,
+you will need the kd-oldmem-<version>.patch built into the kernel. To create
+the device, type
+
+ mknod /dev/oldmem c 1 12
+
+Use "dd" with suitable options for count, bs and skip to access specific
+portions of the dump.
+
+ANALYSIS
+========
+
+You can run gdb on the dump file copied out of /proc/vmcore. Use vmlinux built
+with -g and run
+
+ gdb vmlinux <dump-file>
+
+Stack trace for the task on processor 0, register display, memory display
+work fine.
+
+TODO
+====
+
+1) Provide a kernel-pages only view for the dump. This could possibly turn up
+ as /proc/vmcore-kern.
+2) Provide register contents of all processors (similar to what multi-threaded
+ core dumps does).
+3) Modify "crash" to make it recognize this dump.
+4) Make the i386 kernel boot from any location so we can run the second kernel
+ from the reserved location instead of the current approach.
+
+CONTACT
+=======
+
+Hariprasad Nellitheertha - hari at in dot ibm dot com

2005-01-19 07:54:56

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 24/29] x86-crash_shutdown-apic-shutdown



This patch should not make it to the stable kernel without a being
reviewed a lot more. It is unclear how much a hardned kernel can
take when it comes to misconfigured apics. So since a normal kernel
has problems this patch does a clean shutdown.

It is my expectation this patch will be dropped from future
generations of the kexec work. But for the moment it is a
crutch to keep from breaking everything.


Signed-off-by: Eric Biederman <[email protected]>
---

crash.c | 7 +++++++
1 files changed, 7 insertions(+)

diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-crash_shutdown-snapshot-registers/arch/i386/kernel/crash.c linux-2.6.11-rc1-mm1-nokexec-x86-crash_shutdown-apic-shutdown/arch/i386/kernel/crash.c
--- linux-2.6.11-rc1-mm1-nokexec-x86-crash_shutdown-snapshot-registers/arch/i386/kernel/crash.c Tue Jan 18 23:15:34 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86-crash_shutdown-apic-shutdown/arch/i386/kernel/crash.c Tue Jan 18 23:15:50 2005
@@ -23,6 +23,7 @@
#include <asm/hardirq.h>
#include <asm/nmi.h>
#include <asm/hw_irq.h>
+#include <asm/apic.h>
#include <mach_ipi.h>

#define MAX_NOTE_BYTES 1024
@@ -115,6 +116,7 @@
{
local_irq_disable();
crash_save_this_cpu(regs, cpu);
+ disable_local_APIC();
atomic_dec(&waiting_for_crash_ipi);
/* Assume hlt works */
__asm__("hlt");
@@ -153,6 +155,7 @@
}

/* Leave the nmi callback set */
+ disable_local_APIC();
}
#else
static void nmi_shootdown_cpus(void)
@@ -174,5 +177,9 @@
/* The kernel is broken so disable interrupts */
local_irq_disable();
nmi_shootdown_cpus();
+ lapic_shutdown();
+#if defined(CONFIG_X86_IO_APIC)
+ disable_IO_APIC();
+#endif
crash_save_self();
}

2005-01-19 07:54:55

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 23/29] x86-crash_shutdown-snapshot-registers


After the kernel panics if we wish to generate an entire machine core
file it is very nice to know the register state at the time the
machine crashed.

After long discussion it was realized that if you are going to be
saving the information anyway it is reasonable to store the
information in a format that it will be used and recognized in
so the register state is stored in the standard ELF note format.

Signed-off-by: Eric Biederman <[email protected]>
---

crash.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 80 insertions(+)

diff -uNr linux-2.6.11-rc1-mm1-nokexec-x86-crash_shutdown-nmi-shootdown/arch/i386/kernel/crash.c linux-2.6.11-rc1-mm1-nokexec-x86-crash_shutdown-snapshot-registers/arch/i386/kernel/crash.c
--- linux-2.6.11-rc1-mm1-nokexec-x86-crash_shutdown-nmi-shootdown/arch/i386/kernel/crash.c Tue Jan 18 23:15:17 2005
+++ linux-2.6.11-rc1-mm1-nokexec-x86-crash_shutdown-snapshot-registers/arch/i386/kernel/crash.c Tue Jan 18 23:15:34 2005
@@ -30,12 +30,91 @@

note_buf_t crash_notes[NR_CPUS];

+static u32 *append_elf_note(u32 *buf,
+ char *name, unsigned type, void *data, size_t data_len)
+{
+ struct elf_note note;
+ note.n_namesz = strlen(name) + 1;
+ note.n_descsz = data_len;
+ note.n_type = type;
+ memcpy(buf, &note, sizeof(note));
+ buf += (sizeof(note) +3)/4;
+ memcpy(buf, name, note.n_namesz);
+ buf += (note.n_namesz + 3)/4;
+ memcpy(buf, data, note.n_descsz);
+ buf += (note.n_descsz + 3)/4;
+ return buf;
+}
+
+static void final_note(u32 *buf)
+{
+ struct elf_note note;
+ note.n_namesz = 0;
+ note.n_descsz = 0;
+ note.n_type = 0;
+ memcpy(buf, &note, sizeof(note));
+}
+
+
+static void crash_save_this_cpu(struct pt_regs *regs, int cpu)
+{
+ struct elf_prstatus prstatus;
+ u32 *buf;
+ if ((cpu < 0) || (cpu >= NR_CPUS)) {
+ return;
+ }
+ /* Using ELF notes here is opportunistic.
+ * I need a well defined structure format
+ * for the data I pass, and I need tags
+ * on the data to indicate what information I have
+ * squirrelled away. ELF notes happen to provide
+ * all of that that no need to invent something new.
+ */
+ buf = &crash_notes[cpu][0];
+ memset(&prstatus, 0, sizeof(prstatus));
+ prstatus.pr_pid = current->pid;
+ elf_core_copy_regs(&prstatus.pr_reg, regs);
+ buf = append_elf_note(buf, "CORE", NT_PRSTATUS,
+ &prstatus, sizeof(prstatus));
+
+ final_note(buf);
+}
+
+static void crash_get_current_regs(struct pt_regs *regs)
+{
+ __asm__ __volatile__("movl %%ebx,%0" : "=m"(regs->ebx));
+ __asm__ __volatile__("movl %%ecx,%0" : "=m"(regs->ecx));
+ __asm__ __volatile__("movl %%edx,%0" : "=m"(regs->edx));
+ __asm__ __volatile__("movl %%esi,%0" : "=m"(regs->esi));
+ __asm__ __volatile__("movl %%edi,%0" : "=m"(regs->edi));
+ __asm__ __volatile__("movl %%ebp,%0" : "=m"(regs->ebp));
+ __asm__ __volatile__("movl %%eax,%0" : "=m"(regs->eax));
+ __asm__ __volatile__("movl %%esp,%0" : "=m"(regs->esp));
+ __asm__ __volatile__("movw %%ss, %%ax;" :"=a"(regs->xss));
+ __asm__ __volatile__("movw %%cs, %%ax;" :"=a"(regs->xcs));
+ __asm__ __volatile__("movw %%ds, %%ax;" :"=a"(regs->xds));
+ __asm__ __volatile__("movw %%es, %%ax;" :"=a"(regs->xes));
+ __asm__ __volatile__("pushfl; popl %0" :"=m"(regs->eflags));
+
+ regs->eip = (unsigned long)current_text_addr();
+}
+
+static void crash_save_self(void)
+{
+ struct pt_regs regs;
+ int cpu;
+ cpu = smp_processor_id();
+ crash_get_current_regs(&regs);
+ crash_save_this_cpu(&regs, cpu);
+}
+
#ifdef CONFIG_SMP
static atomic_t waiting_for_crash_ipi;

static int crash_nmi_callback(struct pt_regs *regs, int cpu)
{
local_irq_disable();
+ crash_save_this_cpu(regs, cpu);
atomic_dec(&waiting_for_crash_ipi);
/* Assume hlt works */
__asm__("hlt");
@@ -95,4 +174,5 @@
/* The kernel is broken so disable interrupts */
local_irq_disable();
nmi_shootdown_cpus();
+ crash_save_self();
}

2005-01-19 12:10:35

by Hariprasad Nellitheertha

[permalink] [raw]
Subject: Re: [PATCH 16/29] x86-kexec

Hello Eric,

Eric W. Biederman wrote:
> This is the i386 implementation of kexec.

I tried these patches on an i386 box with kexec-tools-1.99.
kexec-ing with vmlinux works fine but bzImage still doesnt
go through. Is there a newer kexec-tools package that we
need to use this with (to take care of the "purgatory code
getting overwritten" problem you had identified).

Regards, Hari

2005-01-19 12:25:35

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 19/29] x86_64-kexec

"Eric W. Biederman" <[email protected]> writes:

[note an extensive review of all this code, but from a quick read...]

> +
> +static void load_segments(void)
> +{
> + __asm__ __volatile__ (
> + "\tmovl $"STR(__KERNEL_DS)",%eax\n"
> + "\tmovl %eax,%ds\n"
> + "\tmovl %eax,%es\n"
> + "\tmovl %eax,%ss\n"
> + "\tmovl %eax,%fs\n"
> + "\tmovl %eax,%gs\n"
> + );

This misses an clobber for "eax"


-Andi

2005-01-19 18:19:10

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] Re: [PATCH 16/29] x86-kexec

Hariprasad Nellitheertha <[email protected]> writes:

> Hello Eric,
>
> Eric W. Biederman wrote:
> > This is the i386 implementation of kexec.
>
> I tried these patches on an i386 box with kexec-tools-1.99. kexec-ing with
> vmlinux works fine but bzImage still doesnt go through. Is there a newer
> kexec-tools package that we need to use this with (to take care of the
> "purgatory code getting overwritten" problem you had identified).

Yes. I will release the 2.0 version shortly. I need to give it a code
review before I put it out. So sometime later today.

Eric

2005-01-20 15:55:06

by Adrian Bunk

[permalink] [raw]
Subject: Re: [PATCH 19/29] x86_64-kexec

On Wed, Jan 19, 2005 at 12:31:37AM -0700, Eric W. Biederman wrote:
>...
> --- linux-2.6.11-rc1-mm1-nokexec-x86_64-machine_shutdown/arch/x86_64/kernel/crash.c Wed Dec 31 17:00:00 1969
> +++ linux-2.6.11-rc1-mm1-nokexec-x86_64-kexec/arch/x86_64/kernel/crash.c Tue Jan 18 23:14:06 2005
>...
> +note_buf_t crash_notes[NR_CPUS];
>...

After your patches, this global variable stays completely unused
on x86_64 (for the i386 version, you added a usage).

cu
Adrian

BTW: Is external usage for crash_notes planned, or can it become static
on both architectures?

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

2005-01-20 18:13:35

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] Re: [PATCH 19/29] x86_64-kexec

Adrian Bunk <[email protected]> writes:

> On Wed, Jan 19, 2005 at 12:31:37AM -0700, Eric W. Biederman wrote:
> >...
> > ---
> linux-2.6.11-rc1-mm1-nokexec-x86_64-machine_shutdown/arch/x86_64/kernel/crash.c
> Wed Dec 31 17:00:00 1969
>
> > +++ linux-2.6.11-rc1-mm1-nokexec-x86_64-kexec/arch/x86_64/kernel/crash.c Tue
> Jan 18 23:14:06 2005
>
> >...
> > +note_buf_t crash_notes[NR_CPUS];
> >...
>
> After your patches, this global variable stays completely unused
> on x86_64 (for the i386 version, you added a usage).
>
> cu
> Adrian
>
> BTW: Is external usage for crash_notes planned, or can it become static
> on both architectures?

A sharp eye. That array is a key part of an ongoing conversation.

To analyze why a kernel crashed you need some information, beyond
simply the contents of memory at the time of the crash.

If that information is not static and obtainable at the time of the crash
machine_crash_shutdown() needs to capture that information.

For the format of the information that crashed we can either use some
random structure, that you need to know to read kernel debug information
to interpret. Or we can use a standard format, reducing the need
for magic in the interpretation process. The introduction of
crash_notes is the first step is switching to using a standard format,
for the data to remove unnecessary dependencies between a kernel
and the tools that analyze it after it has crashed.

crash_notes is designed to be a set of per cpu buffers that hold
information captured just after a kernel has crashed. So the usage
is expected to be very external. How we communicate the address of
these per cpu buffers to analysis tools still needs to be addressed.
/proc/kallsyms?

As for internal users those will come when machine_crash_shutdown
becomes more than a noop on x86_64.

Eric

2005-01-21 07:04:58

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH] Reserving backup region for kexec based crashdumps.

Hi Andrew,

Following patch is against 2.6.11-rc1-mm2.

As mentioned by following note from Eric, crashdump code is currently
broken.
>
> The crashdump code is currently slightly broken. I have attempted to
> minimize the breakage so things can quick be made to work again.

We have started doing changes to make crashdump up and running again.
Following are few identified items to be done.

1. Reserve the backup region (640k) during kernel bootup.
2. Copy the data to backup region during crash.(moved to kexec user
space code, patch posted in separate mail)
3. Prepare elf headers while loading kexec panic kernel and store in
reserved memory area.
4. Pass required information to crashdump kernel, which parses it and
exports through /proc/vmcore. (may be user space utility, open to
discussion)

Following patch implements item 1) in the list. Soon we shall be rolling
out the patches for rest.

Thanks
Vivek



Attachments:
crashdump-x86-reserve-640k-memory.patch (3.03 kB)

2005-01-21 08:02:18

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Vivek Goyal <[email protected]> writes:

> Hi Andrew,
>
> Following patch is against 2.6.11-rc1-mm2.
>
> As mentioned by following note from Eric, crashdump code is currently
> broken.
> >
> > The crashdump code is currently slightly broken. I have attempted to
> > minimize the breakage so things can quick be made to work again.
>
> We have started doing changes to make crashdump up and running again.
> Following are few identified items to be done.
>
> 1. Reserve the backup region (640k) during kernel bootup.

Why do we need a separate region for this?

It should be simple enough to take 640 out of the area kexec reserves
for the crash dump kernel. That is what the previous code implemented.

> 2. Copy the data to backup region during crash.(moved to kexec user
> space code, patch posted in separate mail)

Thanks by and large it looks sane, it won't work yet the but it is
moving in the right direction.

> +++ linux-2.6.11-rc1-mm2-kexec-eric-root/include/linux/kexec.h 2005-01-20
> 13:55:33.000000000 +0530
>
> @@ -79,7 +79,7 @@ struct kimage {
> unsigned long control_page;
>
> /* Flags to indicate special processing */
> - int type : 1;
> + unsigned int type : 1;

That looks like a sane bug fix. Having values of 0 and -1 is quite what
I was expecting...

Eric



2005-01-21 10:08:31

by Vivek Goyal

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

On Fri, 2005-01-21 at 13:24, Eric W. Biederman wrote:
> Vivek Goyal <[email protected]> writes:
>
> > Hi Andrew,
> >
> > Following patch is against 2.6.11-rc1-mm2.
> >
> > As mentioned by following note from Eric, crashdump code is currently
> > broken.
> > >
> > > The crashdump code is currently slightly broken. I have attempted to
> > > minimize the breakage so things can quick be made to work again.
> >
> > We have started doing changes to make crashdump up and running again.
> > Following are few identified items to be done.
> >
> > 1. Reserve the backup region (640k) during kernel bootup.
>
> Why do we need a separate region for this?
>
> It should be simple enough to take 640 out of the area kexec reserves
> for the crash dump kernel. That is what the previous code implemented.

Previous code also reserved the backup memory region after crash kernel
region. It is just a matter of interpretation. What I understand that
crash kernel reserved region represents something where one can load the
panic kernel directly and new kernel can use this memory region for
memory allocation.

I don't want to steal the backup region from crash kernel region
otherwise, I shall have to boot the crash kernel with some strange
values like memmap=(32M-640k)@16M (symbolically) to prevent crash kernel
overwriting backup region. Why to make user aware of location of backup
region.

Alternatively, this can be managed by reserving this backup region again
in crash kernel to avoid any stomping. May be pass backup region
location to new kernel through parameter segment or through command line
but don't see a strong reason for doing that.


Thanks
Vivek

2005-01-21 11:15:29

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.


On deeper review your patch as it stands is incomplete. In particular
you don't provide a way to either hardcode or dynamically set
the area you are attempt to reserve to hold the backup region.

Vivek Goyal <[email protected]> writes:

> On Fri, 2005-01-21 at 13:24, Eric W. Biederman wrote:
> > Why do we need a separate region for this?
> >
> > It should be simple enough to take 640 out of the area kexec reserves
> > for the crash dump kernel. That is what the previous code implemented.
>
> Previous code also reserved the backup memory region after crash kernel
> region. It is just a matter of interpretation. What I understand that
> crash kernel reserved region represents something where one can load the
> panic kernel directly and new kernel can use this memory region for
> memory allocation.

Yes the reservation is a hunk of memory reserved for use by the crashdump
process, or whatever happens after panic. It is up to the loaded code
to define how that memory is used. purgatory.ro is a legitimate part
of that loaded code.

> I don't want to steal the backup region from crash kernel region
> otherwise, I shall have to boot the crash kernel with some strange
> values like memmap=(32M-640k)@16M (symbolically) to prevent crash kernel
> overwriting backup region. Why to make user aware of location of backup
> region.

Making the user aware of the region makes it one more thing for the user
to be aware of and to manually manage. Based on what was passed as
crashkernel=... We should be able to automate all of the rest of it.
So a weird memmap= line should not be hard.

I will have to wait and see but it would not surprise me if we settled
on a fixed address per architecture for the reservation to make it
easier for various users.

On that note we probably want to move the magic that we are doing
for crashdumps into the linux loader (i.e. x86-linux-setup.c ) in
kexec-tools, as most of these pieces are specific to taking a
crashdump with linux. Not that I expect we will be doing it with
anything else but...

> Alternatively, this can be managed by reserving this backup region again
> in crash kernel to avoid any stomping. May be pass backup region
> location to new kernel through parameter segment or through command line
> but don't see a strong reason for doing that.

Probably the biggest reason for doing it in one reservation is that
it happens to be an implementation detail of the crashdump capture
kernel. If that kernel is not SMP I believe you can safely leave the
first 640k alone. I know at least one other effort has had success in
that area.

In general it is not good to make unnecessary implementation details
between two pieces of software be part of their interface.

Eric

2005-01-23 09:24:48

by Vivek Goyal

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

On Fri, 2005-01-21 at 16:43, Eric W. Biederman wrote:
> On deeper review your patch as it stands is incomplete. In particular
> you don't provide a way to either hardcode or dynamically set
> the area you are attempt to reserve to hold the backup region.

Well. Here is the new patch. This one steals the 640k from top of memory
region reserved for crash kernel.

A new command line parameter (crashbackup=) has been introduced for
crash dump kernels. This parameter specifies the location of backup
region from where to retrieve the backup data.

Thanks
Vivek



Attachments:
crashdump-x86-reserve-640k-memory.patch (7.54 kB)

2005-01-25 03:33:00

by Brown, Len

[permalink] [raw]
Subject: Re: [PATCH 4/29] x86-i8259-shutdown

On Wed, 2005-01-19 at 02:31, Eric W. Biederman wrote:
> From: Eric W. Biederman <[email protected]>
>
> This patch disables interrupt generation from the legacy pic on
> reboot. Now that there is a sys_device class it should not be called
> while drivers are still using interrupts.
>
> There is a report about this breaking ACPI power off on some systems.
> http://bugme.osdl.org/show_bug.cgi?id=4041
> However the final comment seems to exhonorate this code. So until
> I get more information I believe that was a false positive.

No, the last comment in the bug report
(davej says that there were poweroff problems in FC)
does not exhonerate this patch.
All it says is that there are additional poweroff bugs out there.

-Len


2005-01-25 03:54:31

by Brown, Len

[permalink] [raw]
Subject: Re: [PATCH 6/29] x86-apic-virtwire-on-shutdown

On Wed, 2005-01-19 at 02:31, Eric W. Biederman wrote:
> When coming out of apic mode attempt to set the appropriate
> apic back into virtual wire mode. This improves on previous versions
> of this patch by by never setting bot the local apic and the ioapic
> into veritual wire mode.
>
> This code looks at data from the mptable to see if an ioapic has
> an ExtInt input to make this decision. A future improvement
> is to figure out which apic or ioapic was in virtual wire mode
> at boot time and to remember it. That is potentially a more accurate
> method, of selecting which apic to place in virutal wire mode.
>

The call to find_isa_irq_pin() will always fail on ACPI-enabled systems,
so this patch is a NO-OP unless the system is booted in MPS mode.

Do we really want to be adding this complexity for obsolete systems?
Are there systems that fail without this patch?

-Len


2005-01-25 03:59:38

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH 4/29] x86-i8259-shutdown

On Mon, Jan 24, 2005 at 10:32:50PM -0500, Len Brown wrote:
> On Wed, 2005-01-19 at 02:31, Eric W. Biederman wrote:
> > From: Eric W. Biederman <[email protected]>
> >
> > This patch disables interrupt generation from the legacy pic on
> > reboot. Now that there is a sys_device class it should not be called
> > while drivers are still using interrupts.
> >
> > There is a report about this breaking ACPI power off on some systems.
> > http://bugme.osdl.org/show_bug.cgi?id=4041
> > However the final comment seems to exhonorate this code. So until
> > I get more information I believe that was a false positive.
>
> No, the last comment in the bug report
> (davej says that there were poweroff problems in FC)
> does not exhonerate this patch.
> All it says is that there are additional poweroff bugs out there.

Indeed. Since dropping the kexec bits from the Fedora kernel,
the 'hangs at poweroff' bug went away for a lot of folks,
but there still remain some people affected by some other regression.
https://bugzilla.redhat.com/beta/show_bug.cgi?id=acpi_power_off
has the gory details.

Dave

2005-01-25 06:33:04

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 4/29] x86-i8259-shutdown

Dave Jones <[email protected]> writes:

> On Mon, Jan 24, 2005 at 10:32:50PM -0500, Len Brown wrote:
> > On Wed, 2005-01-19 at 02:31, Eric W. Biederman wrote:
> > > From: Eric W. Biederman <[email protected]>
> > >
> > > This patch disables interrupt generation from the legacy pic on
> > > reboot. Now that there is a sys_device class it should not be called
> > > while drivers are still using interrupts.
> > >
> > > There is a report about this breaking ACPI power off on some systems.
> > > http://bugme.osdl.org/show_bug.cgi?id=4041
> > > However the final comment seems to exhonorate this code. So until
> > > I get more information I believe that was a false positive.
> >
> > No, the last comment in the bug report
> > (davej says that there were poweroff problems in FC)
> > does not exhonerate this patch.
> > All it says is that there are additional poweroff bugs out there.
>
> Indeed. Since dropping the kexec bits from the Fedora kernel,
> the 'hangs at poweroff' bug went away for a lot of folks,
> but there still remain some people affected by some other regression.
> https://bugzilla.redhat.com/beta/show_bug.cgi?id=acpi_power_off
> has the gory details.

Ok. I misunderstood that one then. I thought a separate fix
had cured the bug. With the kexec bits remaining.

Eric

2005-01-25 06:41:24

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 6/29] x86-apic-virtwire-on-shutdown

Len Brown <[email protected]> writes:

> On Wed, 2005-01-19 at 02:31, Eric W. Biederman wrote:
> > When coming out of apic mode attempt to set the appropriate
> > apic back into virtual wire mode. This improves on previous versions
> > of this patch by by never setting bot the local apic and the ioapic
> > into veritual wire mode.
> >
> > This code looks at data from the mptable to see if an ioapic has
> > an ExtInt input to make this decision. A future improvement
> > is to figure out which apic or ioapic was in virtual wire mode
> > at boot time and to remember it. That is potentially a more accurate
> > method, of selecting which apic to place in virutal wire mode.
> >
>
> The call to find_isa_irq_pin() will always fail on ACPI-enabled systems,
> so this patch is a NO-OP unless the system is booted in MPS mode.
>
> Do we really want to be adding this complexity for obsolete systems?
> Are there systems that fail without this patch?

Yes there are bleeding edge systems that fail without this patch.
And I have them. That is why I wrote the code.

I do agree that find_isa_irq_pin is a suboptimal way to get this
information, looking at the ioapics at boot time would be better.
However it works for me, the code is not wrong, and as you said
usually the code becomes a noop.

If I can find the appropriate place in the boot path to examine
the ioapics before they get stomped I am more than willing to write
code that will handle this even in the presence of acpi data.

In addition this code is not a complete noop because when
find_isa_irq_pin fails it does put the local apic in virtual wire mode.


Eric

2005-01-25 07:37:48

by Brown, Len

[permalink] [raw]
Subject: Re: [PATCH 6/29] x86-apic-virtwire-on-shutdown

On Tue, 2005-01-25 at 01:39, Eric W. Biederman wrote:
> Len Brown <[email protected]> writes:
>
> > On Wed, 2005-01-19 at 02:31, Eric W. Biederman wrote:
> > > When coming out of apic mode attempt to set the appropriate
> > > apic back into virtual wire mode. This improves on previous
> versions
> > > of this patch by by never setting bot the local apic and the
> ioapic
> > > into veritual wire mode.
> > >
> > > This code looks at data from the mptable to see if an ioapic has
> > > an ExtInt input to make this decision. A future improvement
> > > is to figure out which apic or ioapic was in virtual wire mode
> > > at boot time and to remember it. That is potentially a more
> accurate
> > > method, of selecting which apic to place in virutal wire mode.
> > >
> >
> > The call to find_isa_irq_pin() will always fail on ACPI-enabled
> systems,
> > so this patch is a NO-OP unless the system is booted in MPS mode.
> >
> > Do we really want to be adding this complexity for obsolete systems?
> > Are there systems that fail without this patch?
>
> Yes there are bleeding edge systems that fail without this patch.
> And I have them. That is why I wrote the code.

What bleeding edge system support MPS and does not support ACPI?

> I do agree that find_isa_irq_pin is a suboptimal way to get this
> information, looking at the ioapics at boot time would be better.
> However it works for me, the code is not wrong, and as you said
> usually the code becomes a noop.
>
> If I can find the appropriate place in the boot path to examine
> the ioapics before they get stomped I am more than willing to write
> code that will handle this even in the presence of acpi data.

I belive we don't touch the IO_APICS in either MPS or ACPI mode before
setup_IO_APIC.

> In addition this code is not a complete noop because when
> find_isa_irq_pin fails it does put the local apic in virtual wire
> mode.

If the goal of this patch is to restore the hardware to the state
that it was before Linux scribbed on it, then it might be a better
ideal to save/restore the actual register values the BIOS gave us rather
than writing hard-coded values, no?

-Len


2005-01-25 08:37:03

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 4/29] x86-i8259-shutdown

Dave Jones <[email protected]> writes:

> On Mon, Jan 24, 2005 at 10:32:50PM -0500, Len Brown wrote:
> > On Wed, 2005-01-19 at 02:31, Eric W. Biederman wrote:
> > > From: Eric W. Biederman <[email protected]>
> > >
> > > This patch disables interrupt generation from the legacy pic on
> > > reboot. Now that there is a sys_device class it should not be called
> > > while drivers are still using interrupts.
> > >
> > > There is a report about this breaking ACPI power off on some systems.
> > > http://bugme.osdl.org/show_bug.cgi?id=4041
> > > However the final comment seems to exhonorate this code. So until
> > > I get more information I believe that was a false positive.
> >
> > No, the last comment in the bug report
> > (davej says that there were poweroff problems in FC)
> > does not exhonerate this patch.
> > All it says is that there are additional poweroff bugs out there.

So I will ask again, as I did when Andrew first pointed this in my
direction. What code path in the kernel could possibly care if we
disable the i8259 after we have disabled all of the other hardware in
the system.

The i8259 is a system device so everything else is shutoff first
(by design). I have one vague reference that this is a/the
problem but I don't have anything to go on with respect to fixing it.

I don't have a system that reproduces this.

Eric

2005-01-25 09:13:38

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 6/29] x86-apic-virtwire-on-shutdown

Len Brown <[email protected]> writes:

> On Tue, 2005-01-25 at 01:39, Eric W. Biederman wrote:
> > Yes there are bleeding edge systems that fail without this patch.
> > And I have them. That is why I wrote the code.
>
> What bleeding edge system support MPS and does not support ACPI?

All I was talking about is the hardware here. Last I looked the
errata on the E7520 E7525 and E7530 chipsets listed using the IOAPIC
in virtual wire mode the only way to get them to get the system
to work stably if you did not have an SMP kernel. If there is an
updated errata work around I'd love to hear it.

The fact that ACPI is quite common is one of the reasons this code
path needs more work.

I have not seen the problems with ACPI as I have yet to see a
compelling reason to turn it on. And my few experiences with
it have lead me to put acpi=off on my kernel command-line by default.

> I belive we don't touch the IO_APICS in either MPS or ACPI mode before
> setup_IO_APIC.

Thanks I will have a lookup. That sounds like a likely place.

> If the goal of this patch is to restore the hardware to the state
> that it was before Linux scribbed on it, then it might be a better
> ideal to save/restore the actual register values the BIOS gave us rather
> than writing hard-coded values, no?

Nope that is not the goal. The goal is to place system devices in
a state close enough to pc compatibility mode that an unpatched kernel
will start. A secondary goal is to place the system in a state where
the firmware is likely not to have problems.

As the apics are architectural hardware and we only touch the bits we
understand there is no reason for us to be modest and pretend we don't
know what we are doing. The architecture defines a very narrow
set of states that are valid when you are not using the apics. The
question is only which of those states will work? pic_mode,
virt_wire mode in local apic, virt_wire mode in ioapic. Looking at
the hardware is probably the most consistently reliable method of
determining that as the kernel will not boot if the firmware
sets it up wrong.

Eric

2005-01-25 09:43:59

by Barry K. Nathan

[permalink] [raw]
Subject: Re: [PATCH 4/29] x86-i8259-shutdown

On Tue, Jan 25, 2005 at 01:35:00AM -0700, Eric W. Biederman wrote:
> So I will ask again, as I did when Andrew first pointed this in my
> direction. What code path in the kernel could possibly care if we
> disable the i8259 after we have disabled all of the other hardware in
> the system.

This may be a foolish question, but, are there possibly any code paths
in the *BIOS* that could care?

-Barry K. Nathan <[email protected]>

2005-01-25 10:16:00

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 4/29] x86-i8259-shutdown

"Barry K. Nathan" <[email protected]> writes:

> On Tue, Jan 25, 2005 at 01:35:00AM -0700, Eric W. Biederman wrote:
> > So I will ask again, as I did when Andrew first pointed this in my
> > direction. What code path in the kernel could possibly care if we
> > disable the i8259 after we have disabled all of the other hardware in
> > the system.
>
> This may be a foolish question, but, are there possibly any code paths
> in the *BIOS* that could care?

Fairly unlikely at this point, as the state we have traditionally
reprogrammed the i8259 to, delivers interrupts to different vectors
than the firmware uses. So I don't see how telling it not
to deliver interrupts where the firmware won't expect them
is likely to change things.

It could be that ACPI AML code is trying something at an inappropriate
time. But I can not even find the ACPI soft power code path. pm_power_off
never seems to get hooked.

Or it could one of the other kexec related patches for all I know.

Until I get a good data point or a reproducer I can't do anything.
It doesn't even make sense to drop the patch because then
I won't get a good data point. And I won't know if similar symptoms
crop of if I need to do something else.

Eric

2005-01-25 10:49:21

by Barry K. Nathan

[permalink] [raw]
Subject: Re: [PATCH 4/29] x86-i8259-shutdown

On Tue, Jan 25, 2005 at 03:14:06AM -0700, Eric W. Biederman wrote:
> "Barry K. Nathan" <[email protected]> writes:
>
> > On Tue, Jan 25, 2005 at 01:35:00AM -0700, Eric W. Biederman wrote:
> > > So I will ask again, as I did when Andrew first pointed this in my
> > > direction. What code path in the kernel could possibly care if we
> > > disable the i8259 after we have disabled all of the other hardware in
> > > the system.
> >
> > This may be a foolish question, but, are there possibly any code paths
> > in the *BIOS* that could care?
>
> Fairly unlikely at this point, as the state we have traditionally
> reprogrammed the i8259 to, delivers interrupts to different vectors
> than the firmware uses. So I don't see how telling it not
> to deliver interrupts where the firmware won't expect them
> is likely to change things.

FWIW I've noticed something quirky. I guess this is only likely to show
up in a contrived scenario, but I have actually reproduced this (by
accident, even), so maybe it's worth mentioning.

1. System is booted with both APM and ACPI disabled. (To be exact, ACPI
is not compiled into the kernel I tested in this case, and APM is, but
the BIOS lacks APM support.)
2. Ultra-minimal /sbin/init does shutdown syscall.
3. "Shutdown: hda
Power down."

You (or I, at least) would expect things to stop here, but we enter the
Twilight Zone instead:

4. "Kernel panic - not syncing: Attempted to kill init!" (Nothing this
weird ever happens if I have ACPI enabled. In that case it either
freezes or properly shuts down.)

5. The admin (that's me) then tries to reboot box without resorting to the
power or reset buttons. (Actually the chassis in this case doesn't have
a reset button, and I want to avoid unnecessarily power-cycling the
thing.)

Without the i8259 shutdown patch applied, I can reboot with Alt-SysRq-B.
With the patch, the computer doesn't respond to Alt-SysRq-B and I have
to use the power button.

If anyone's interested, the source code to my minimal /sbin/init is
here:
http://bugme.osdl.org/attachment.cgi?id=4398&action=view

> It could be that ACPI AML code is trying something at an inappropriate
> time. But I can not even find the ACPI soft power code path. pm_power_off
> never seems to get hooked.
>
> Or it could one of the other kexec related patches for all I know.

The problem occurs even with no other kexec patches applied. And
applying all kexec patches except the i8259 shutdown patch fails to
reproduce the problem.

> Until I get a good data point or a reproducer I can't do anything.
> It doesn't even make sense to drop the patch because then
> I won't get a good data point. And I won't know if similar symptoms
> crop of if I need to do something else.

It's only happening on 1 of 4 tested boxes here, too (the other three
are not affected, but they don't have the same motherboard either). :(

-Barry K. Nathan <[email protected]>

2005-01-25 11:49:28

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 4/29] x86-i8259-shutdown


Ok Looking at the code I have finally tracked where apci_power_off()
lives. drives/acpi/poweroff/sleep.c And it has companions in
drivers/acpi/hardware/hwsleep.c

Why I did not find this when I wall looking for things that
set pm_power_off earlier I haven't a clue because it showed up this
time.

One of the functions called by acpi_enter_sleep_state_prep has
this to say.
/******************************************************************************
*
* FUNCTION: acpi_enter_sleep_state_prep
*
* PARAMETERS: sleep_state - Which sleep state to enter
*
* RETURN: Status
*
* DESCRIPTION: Prepare to enter a system sleep state (see ACPI 2.0 spec p 231)
* This function must execute with interrupts enabled.
* We break sleeping into 2 stages so that OSPM can handle
* various OS-specific tasks between the two steps.
*
******************************************************************************/

Since I clearly have interrupts disabled at this point we clearly have
an issue. Not if we were using apics we would never have had interrupts
enabled when we got here so this code as coded is also broken in
an SMP context.

Cool I wonder how many more bugs work on kexec can flush out?

So the question becomes where is a good place to call
acpi_enter_sleep_state_prep()?


"Barry K. Nathan" <[email protected]> writes:

> On Tue, Jan 25, 2005 at 03:14:06AM -0700, Eric W. Biederman wrote:
> > "Barry K. Nathan" <[email protected]> writes:
> >
> > > On Tue, Jan 25, 2005 at 01:35:00AM -0700, Eric W. Biederman wrote:
> > > > So I will ask again, as I did when Andrew first pointed this in my
> > > > direction. What code path in the kernel could possibly care if we
> > > > disable the i8259 after we have disabled all of the other hardware in
> > > > the system.
> > >
> > > This may be a foolish question, but, are there possibly any code paths
> > > in the *BIOS* that could care?
> >
> > Fairly unlikely at this point, as the state we have traditionally
> > reprogrammed the i8259 to, delivers interrupts to different vectors
> > than the firmware uses. So I don't see how telling it not
> > to deliver interrupts where the firmware won't expect them
> > is likely to change things.
>
> FWIW I've noticed something quirky. I guess this is only likely to show
> up in a contrived scenario, but I have actually reproduced this (by
> accident, even), so maybe it's worth mentioning.
>
> 1. System is booted with both APM and ACPI disabled. (To be exact, ACPI
> is not compiled into the kernel I tested in this case, and APM is, but
> the BIOS lacks APM support.)
> 2. Ultra-minimal /sbin/init does shutdown syscall.
> 3. "Shutdown: hda
> Power down."
>
> You (or I, at least) would expect things to stop here, but we enter the
> Twilight Zone instead:

So would I.

> 4. "Kernel panic - not syncing: Attempted to kill init!" (Nothing this
> weird ever happens if I have ACPI enabled. In that case it either
> freezes or properly shuts down.)

What lines are printed just before that? It would be nice
to know which part of the kernel panic if we can.

At least if you can reproduce this that would be nice.

I wonder if APM is really disabled at this point. I guess since
I have found the ACPI bug we can save the APM bug for later.

> 5. The admin (that's me) then tries to reboot box without resorting to the
> power or reset buttons. (Actually the chassis in this case doesn't have
> a reset button, and I want to avoid unnecessarily power-cycling the
> thing.)
>
> Without the i8259 shutdown patch applied, I can reboot with Alt-SysRq-B.
> With the patch, the computer doesn't respond to Alt-SysRq-B and I have
> to use the power button.

That is probably the one legitimate glitch with this patch. It disables
interrupts so the keyboard becomes unresponsive. Of course if the
system is really powered down that is not an issue but...

> If anyone's interested, the source code to my minimal /sbin/init is
> here:
> http://bugme.osdl.org/attachment.cgi?id=4398&action=view

Interesting.

I am wondering if you see the same symptoms if you make this
your /init in a initcpio.gz. I am wondering if some of the
issues might be related to something not being shutdown properly.

> > It could be that ACPI AML code is trying something at an inappropriate
> > time. But I can not even find the ACPI soft power code path. pm_power_off
> > never seems to get hooked.
> >
> > Or it could one of the other kexec related patches for all I know.
>
> The problem occurs even with no other kexec patches applied. And
> applying all kexec patches except the i8259 shutdown patch fails to
> reproduce the problem.

Very odd. I was wondering how many test cases had run, before I found
the acpi bug above. If the problem was not 100% the correlation with
this patch might have been a fluke. I just want to be certain.

> > Until I get a good data point or a reproducer I can't do anything.
> > It doesn't even make sense to drop the patch because then
> > I won't get a good data point. And I won't know if similar symptoms
> > crop of if I need to do something else.
>
> It's only happening on 1 of 4 tested boxes here, too (the other three
> are not affected, but they don't have the same motherboard either). :(

Well if you can help me track this down I would appreciate it.

Do you think you would have time to try some test patches?

Eric

2005-01-25 12:14:40

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 4/29] x86-i8259-shutdown


Could you try this patch on your system with acpi that
is having problems.

The patch needs some work before it goes into a mainline kernel
as I have hacked the call to acpi_power_off_prepare into roughly
the proper position in the call chain instead of use a proper
hook. But I can't quickly find an existing hook in the proper
location.

Eric

diff -uNr linux-2.6.11-rc1-mm1-nokexec/drivers/acpi/sleep/poweroff.c linux-2.6.11-rc1-mm1-acpi-power-off-shuffle/drivers/acpi/sleep/poweroff.c
--- linux-2.6.11-rc1-mm1-nokexec/drivers/acpi/sleep/poweroff.c Fri Jan 7 12:53:50 2005
+++ linux-2.6.11-rc1-mm1-acpi-power-off-shuffle/drivers/acpi/sleep/poweroff.c Tue Jan 25 05:05:06 2005
@@ -7,18 +7,34 @@

#include <linux/pm.h>
#include <linux/init.h>
+#include <linux/kernel.h>
#include <acpi/acpi_bus.h>
#include <linux/sched.h>
#include "sleep.h"

+static void acpi_power_off_prepare(void)
+{
+ if (system_state == SYSTEM_POWER_OFF) {
+ acpi_wakeup_gpe_poweroff_prepare();
+ acpi_enter_sleep_state_prep(ACPI_STATE_S5);
+ }
+}
+
+void do_acpi_power_off_prepare(void)
+{
+ if (!acpi_disabled) {
+ apci_power_offf_prepare();
+ }
+}
+
static void
acpi_power_off (void)
{
printk("%s called\n",__FUNCTION__);
+#if 0 /* This should be made redundant by other patches.. */
/* Some SMP machines only can poweroff in boot CPU */
set_cpus_allowed(current, cpumask_of_cpu(0));
- acpi_wakeup_gpe_poweroff_prepare();
- acpi_enter_sleep_state_prep(ACPI_STATE_S5);
+#endif
ACPI_DISABLE_IRQS();
acpi_enter_sleep_state(ACPI_STATE_S5);
}
diff -uNr linux-2.6.11-rc1-mm1-nokexec/drivers/base/power/shutdown.c linux-2.6.11-rc1-mm1-acpi-power-off-shuffle/drivers/base/power/shutdown.c
--- linux-2.6.11-rc1-mm1-nokexec/drivers/base/power/shutdown.c Mon Oct 18 15:54:37 2004
+++ linux-2.6.11-rc1-mm1-acpi-power-off-shuffle/drivers/base/power/shutdown.c Tue Jan 25 05:06:09 2005
@@ -62,6 +62,12 @@
}
up_write(&devices_subsys.rwsem);

+#if 1
+ {
+ extern void do_acpi_power_off_prepare(void);
+ do_acpi_power_off_prepare();
+ }
+#endif
sysdev_shutdown();
}

2005-01-25 21:02:20

by Barry K. Nathan

[permalink] [raw]
Subject: Re: [PATCH 4/29] x86-i8259-shutdown

On Tue, Jan 25, 2005 at 04:40:14AM -0700, Eric W. Biederman wrote:
> Cool I wonder how many more bugs work on kexec can flush out?

BTW, I got your other e-mail with the patch. I'll see if I can try the
patch sometime in the next few hours.

> > 3. "Shutdown: hda
> > Power down."
> >
> > You (or I, at least) would expect things to stop here, but we enter the
> > Twilight Zone instead:
>
> So would I.
>
> > 4. "Kernel panic - not syncing: Attempted to kill init!" (Nothing this
> > weird ever happens if I have ACPI enabled. In that case it either
> > freezes or properly shuts down.)
>
> What lines are printed just before that? It would be nice
> to know which part of the kernel panic if we can.
>
> At least if you can reproduce this that would be nice.

100% reproducible for me:

---begin quote---
Shutdown: hda
Power down.
Kernel panic - not syncing: Attempted to kill init!
---end quote---

> I wonder if APM is really disabled at this point. I guess since
> I have found the ACPI bug we can save the APM bug for later.

Earlier in boot it says "apm: bios not found" or something like that. I
can try again without ACPI or APM compiled into the kernel at all and
see what happens. (IIRC, on all of my systems with working APM, the patch
doesn't introduce any problems -- shutdown just works.)

> > If anyone's interested, the source code to my minimal /sbin/init is
> > here:
> > http://bugme.osdl.org/attachment.cgi?id=4398&action=view
>
> Interesting.
>
> I am wondering if you see the same symptoms if you make this
> your /init in a initcpio.gz. I am wondering if some of the
> issues might be related to something not being shutdown properly.

Yeah, I need to give that a try, but it's possible that I won't be able
to do that for a couple of days.

> > The problem occurs even with no other kexec patches applied. And
> > applying all kexec patches except the i8259 shutdown patch fails to
> > reproduce the problem.
>
> Very odd. I was wondering how many test cases had run, before I found
> the acpi bug above. If the problem was not 100% the correlation with
> this patch might have been a fluke. I just want to be certain.

For me, it's absolute 100% correlation over 20-30 test runs or so
(unfortunately it didn't occur to me to keep an exact count).

> Well if you can help me track this down I would appreciate it.
>
> Do you think you would have time to try some test patches?

Yes, I should have time to test patches.

-Barry K. Nathan <[email protected]>

2005-01-25 22:10:08

by Barry K. Nathan

[permalink] [raw]
Subject: Re: [PATCH 4/29] x86-i8259-shutdown

On Tue, Jan 25, 2005 at 05:12:43AM -0700, Eric W. Biederman wrote:
> Could you try this patch on your system with acpi that
> is having problems.
>
> The patch needs some work before it goes into a mainline kernel
> as I have hacked the call to acpi_power_off_prepare into roughly
> the proper position in the call chain instead of use a proper
> hook. But I can't quickly find an existing hook in the proper
> location.

I had to fix a couple of typos ("apci" and "offf") to get it to compile.
Once I did that, the patch made shutdown work again.

-Barry K. Nathan <[email protected]>

2005-01-25 22:19:36

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 4/29] x86-i8259-shutdown

"Barry K. Nathan" <[email protected]> writes:

> On Tue, Jan 25, 2005 at 05:12:43AM -0700, Eric W. Biederman wrote:
> > Could you try this patch on your system with acpi that
> > is having problems.
> >
> > The patch needs some work before it goes into a mainline kernel
> > as I have hacked the call to acpi_power_off_prepare into roughly
> > the proper position in the call chain instead of use a proper
> > hook. But I can't quickly find an existing hook in the proper
> > location.
>
> I had to fix a couple of typos ("apci" and "offf") to get it to compile.
> Once I did that, the patch made shutdown work again.

Thanks. Now I just need to come up with the good version unless one of
the acpi guys wants to volunteer.

Eric

2005-01-26 13:27:50

by Sytse Wielinga

[permalink] [raw]
Subject: Re: [PATCH 4/29] x86-i8259-shutdown

On Tue, Jan 25, 2005 at 03:12:00PM -0700, Eric W. Biederman wrote:
> "Barry K. Nathan" <[email protected]> writes:
>
> > On Tue, Jan 25, 2005 at 05:12:43AM -0700, Eric W. Biederman wrote:
> > > Could you try this patch on your system with acpi that
> > > is having problems.
> > >
> > > The patch needs some work before it goes into a mainline kernel
> > > as I have hacked the call to acpi_power_off_prepare into roughly
> > > the proper position in the call chain instead of use a proper
> > > hook. But I can't quickly find an existing hook in the proper
> > > location.
> >
> > I had to fix a couple of typos ("apci" and "offf") to get it to compile.
> > Once I did that, the patch made shutdown work again.
>
> Thanks. Now I just need to come up with the good version unless one of
> the acpi guys wants to volunteer.

On my box this patch breaks shutdown instead, while it was working without it
on -rc2-mm1.

I have an Asus A7V8X motherboard with a VIA VT8377 (KT400) north bridge and a
VT8235 south bridge (according to lspci). The IO-APIC is used for interrupt
routing.

Sytse

2005-01-26 14:09:18

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 4/29] x86-i8259-shutdown

Sytse Wielinga <[email protected]> writes:

> On my box this patch breaks shutdown instead, while it was working without it
> on -rc2-mm1.
>
> I have an Asus A7V8X motherboard with a VIA VT8377 (KT400) north bridge and a
> VT8235 south bridge (according to lspci). The IO-APIC is used for interrupt
> routing.

Hmm. The patch had a couple of hard coded assumptions about the
configuration (using ACPI etc), but I don't think it was significant
enough to break anything. You have a UP board and a K7 processor
so my removal of set_cpus_allowed that should not affect anything.

But you are using an SMP kernel or at least the apic support.

Are you using ACPI poweroff?

How does the kernel shutdown fail?

Eric

2005-01-26 14:44:22

by Sytse Wielinga

[permalink] [raw]
Subject: Re: [PATCH 4/29] x86-i8259-shutdown

On Wed, Jan 26, 2005 at 07:06:50AM -0700, Eric W. Biederman wrote:
> Sytse Wielinga <[email protected]> writes:
>
> > On my box this patch breaks shutdown instead, while it was working without it
> > on -rc2-mm1.
> >
> > I have an Asus A7V8X motherboard with a VIA VT8377 (KT400) north bridge and a
> > VT8235 south bridge (according to lspci). The IO-APIC is used for interrupt
> > routing.
>
> Hmm. The patch had a couple of hard coded assumptions about the
> configuration (using ACPI etc), but I don't think it was significant
> enough to break anything. You have a UP board and a K7 processor
> so my removal of set_cpus_allowed that should not affect anything.
>
> But you are using an SMP kernel or at least the apic support.
Yes, I have only one processor but I am using the IO-APIC.

> Are you using ACPI poweroff?
Yes.

> How does the kernel shutdown fail?
It halts after saying 'acpi_power_off called'. Strangely, it only breaks when
using the Alt-SysRq-O poweroff function. Shutting down normally still powers
off the system (and does print 'acpi_power_off called'). I think it must have
something to do with the IDE devices not having powered down before
acpi_power_off is called, but I haven't seen the code so I have no idea what
really causes it to break.

Sytse

2005-01-26 15:14:13

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 4/29] x86-i8259-shutdown

Sytse Wielinga <[email protected]> writes:

> On Wed, Jan 26, 2005 at 07:06:50AM -0700, Eric W. Biederman wrote:
> > How does the kernel shutdown fail?
> It halts after saying 'acpi_power_off called'. Strangely, it only breaks when
> using the Alt-SysRq-O poweroff function. Shutting down normally still powers
> off the system (and does print 'acpi_power_off called'). I think it must have
> something to do with the IDE devices not having powered down before
> acpi_power_off is called, but I haven't seen the code so I have no idea what
> really causes it to break.


I am starting to hate the poor factoring of all of this stuff
in the kernel.

kernel/power/poweroff.c re-implements the wheel it comes to doing
poweroff a system. Instead of doing a graceful power off it
skips calling the powerdown notifer and calling device_shutdown.

Since I moved the acpi prepare for powerdown in device_shutdown
it makes sense that code path would now fail.

Do you know if there is any deliberate reason Alt-SysRq-O skips
doing a normal device shutdown work?

If not I think I will just extract a common factor from
kernel/sys.c/sys_reboot(CMD_POWER_OFF);
And have both code paths call it.


Eric

2005-01-27 00:21:25

by Barry K. Nathan

[permalink] [raw]
Subject: Re: [PATCH 4/29] x86-i8259-shutdown

On Wed, Jan 26, 2005 at 08:12:05AM -0700, Eric W. Biederman wrote:
> Do you know if there is any deliberate reason Alt-SysRq-O skips
> doing a normal device shutdown work?

I would guess that it's intended for use when things are so messed up
that all you want to do is cut power ASAP. But, that's just a guess.

-Barry K. Nathan <[email protected]>

2005-01-27 02:01:42

by Andrew Morton

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

[email protected] (Eric W. Biederman) wrote:
>
> There is evil intermingling and false dependency sharing between
> the dying kernel and the crash capture kernel in this patch,

yikes! I'll drop it from -mm while we have a rethink.

2005-01-27 03:08:29

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.


Right now I am very frustrated with reviewing any of the crashdump
patches. When I make comments usually things change just enough that
what I said is addressed but things are addressed very much at
a surface level. Which means that if I think any kind of substantial
change is needed the only way I seem to be able to communicate
that is by actually implementing it myself.

Code that works today is great it does manages the job of requirements
capture. But just throwing code together when you are dealing
with fundamental interface boundaries is not a good way to build
a sustainable design. And with the crashdump code I want an
interface that is at least as simple and as stable as the syscall
interface.

At the very least if a patch is just a snapshot of your development
process up for comment and you are going to continue on making
headway please say as much. If I know the code is quite possibly
going to change in some pretty fundamental ways I can stop worrying
about it. This patch is certainly nothing I would want for more
than a couple of day hack, in my personal development tree.

I will try once again...

There is evil intermingling and false dependency sharing between
the dying kernel and the crash capture kernel in this patch, and
virtually all of the code is unnecessary. I have already addressed
why.

Vivek Goyal <[email protected]> writes:

> On Fri, 2005-01-21 at 16:43, Eric W. Biederman wrote:
> > On deeper review your patch as it stands is incomplete. In particular
> > you don't provide a way to either hardcode or dynamically set
> > the area you are attempt to reserve to hold the backup region.
>
> Well. Here is the new patch. This one steals the 640k from top of memory
> region reserved for crash kernel.
>
> A new command line parameter (crashbackup=) has been introduced for
> crash dump kernels. This parameter specifies the location of backup
> region from where to retrieve the backup data.

What is wrong with user space doing all of the extra space
reservation?

Could you send this fairly obvious kexec fix, as a separate patch?

> diff -puN include/linux/kexec.h~crashdump-x86-reserve-640k-memory
> include/linux/kexec.h
>
> --- linux-2.6.11-rc1/include/linux/kexec.h~crashdump-x86-reserve-640k-memory
> 2005-01-22 14:16:27.000000000 +0530
>
> +++ linux-2.6.11-rc1-root/include/linux/kexec.h 2005-01-22 14:16:27.000000000
> +0530
>
> @@ -79,7 +79,7 @@ struct kimage {
> unsigned long control_page;
>
> /* Flags to indicate special processing */
> - int type : 1;
> + unsigned int type : 1;
> #define KEXEC_TYPE_DEFAULT 0
> #define KEXEC_TYPE_CRASH 1
> };

Eric

2005-01-27 12:55:50

by Vivek Goyal

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Hi Eric,

It looks like we are looking at things a little differently. I
see a portion of the picture in your mind, but obviously not
entirely.

Perhaps, we need to step back and iron out in specific terms what
the interface between the two kernels should be in the crash dump
case, and the distribution of responsibility between kernel, user space
and the user.

[BTW, the patch was intended as a step in development up for
comment early enough to be able to get agreement on the interface
and think issues through to more completeness before going
too far. Sorry, if that wasn't apparent.]

When you say "evil intermingling", I'm guessing you mean the
"crashbackup=" boot parameter ? If so, then yes, I agree it'd
be nice to find a way around it that doesn't push hardcoding
elsewhere.

Let me explain the interface/approach I was looking at.

1.First kernel reserves some area of memory for crash/capture kernel as
specified by crashkernel=X@Y boot time parameter.

2.First kernel marks the top 640K of this area as backup area. (If
architecture needs it.) This is sort of a hardcoding and probably this
space reservation can be managed from user space as well as mentioned by
you in this mail below.

3. Location of backup region is exported through /proc/iomem which can
be read by user space utility to pass this information to purgatory code
to determine where to copy the first 640K.

Note that we do not make any additional reservation for the
backup region. We carve this out from the top of the already
reserved region and export it through /proc/iomem so that
the user space code and the capture kernel code need not
make any assumptions about where this region is located.

4. Once the capture kernel boots, it needs to know the location of
backup region for two purposes.

a. It should not overwrite the backup region.

b. There needs to be a way for the capture tool to access the original
contents of the backed up region

Boot time parameter crashbackup=A@B has been provided to pass this
information to capture kernel. This parameter is valid only for capture
kernel and becomes effective only if CONFIG_CRASH_DUMP is enabled.


> What is wrong with user space doing all of the extra space
> reservation?

Just for clarity, are you suggesting kexec-tools creating an additional
segment for the backup region and pass the information to kernel.

There is no problem in doing reservation from user space except
one. How does the user and in-turn capture kernel come to know the
location of backup region, assuming that the user is going to provide
the exactmap for capture kernel to boot into.

Just a thought, is it a good idea for kexec-tools to be creating and
passing memmap parameters doing appropriate adjustment for backup
region.

I had another question. How is the starting location of elf headers
communicated to capture tool? Is parameter segment a good idea? or
some hardcoding?

Another approach can be that backup area information is encoded in elf
headers and capture kernel is booted with modified memmap (User gets
backup region information from /proc/iomem) and capture tool can
extract backup area information from elf headers as stored by first
kernel.

Could you please elaborate a little more on what aspect of your view
differs from the above.

Thanks
Vivek

2005-01-27 20:54:22

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.


For the guys on ppc, and other architectures that have all of their
cpu memory behind an iommu. I propose we create a /proc/cpumem
which is the subset of /proc/iomem that deals with RAM. In any event
as something like that is straight forward to implement I will
assume the existence of the functionality and we can attack the
details when we do the merge the first of those architectures
into the kernel.

Vivek Goyal <[email protected]> writes:

> Hi Eric,
>
> It looks like we are looking at things a little differently. I
> see a portion of the picture in your mind, but obviously not
> entirely.
>
> Perhaps, we need to step back and iron out in specific terms what
> the interface between the two kernels should be in the crash dump
> case, and the distribution of responsibility between kernel, user space
> and the user.
>
> [BTW, the patch was intended as a step in development up for
> comment early enough to be able to get agreement on the interface
> and think issues through to more completeness before going
> too far. Sorry, if that wasn't apparent.]

It wasn't quite, and the fact that Andrew picked it up added
to the confusion.

> When you say "evil intermingling", I'm guessing you mean the
> "crashbackup=" boot parameter ? If so, then yes, I agree it'd
> be nice to find a way around it that doesn't push hardcoding
> elsewhere.

I believe there are some alternatives to crashbackup= in the
crashdump capture kernel. But as long as that code is running
in the kernel we can't do a lot better.

However for the primary kernel it has no need to know that we
even have a backup region, nor does it need to know about the
size of the backup region. That can all be handled with the single
reservation, we have now.

/sbin/kexec which makes the backup needs to know about it and it needs
to pass that information on. But the primary kernel does not.

The largest reason I am sensitive to this issue is that if you are not
booting an SMP kernel I don't believe we need a backup region on x86
at all. If we can remove that dependency I want the freedom to do
that without having to modify the primary kernel. Or if we discover
we need to preserve other things like the ACPI, mp and pirq tables
I don't want to require patching the kernel just so I can copy those
and preserve them.

> Let me explain the interface/approach I was looking at.
>
> 1.First kernel reserves some area of memory for crash/capture kernel as
> specified by crashkernel=X@Y boot time parameter.
>
> 2.First kernel marks the top 640K of this area as backup area. (If
> architecture needs it.) This is sort of a hardcoding and probably this
> space reservation can be managed from user space as well as mentioned by
> you in this mail below.
>
> 3. Location of backup region is exported through /proc/iomem which can
> be read by user space utility to pass this information to purgatory code
> to determine where to copy the first 640K.

And 1-3 can be done in /sbin/kexec. And if it is done there we can
increase our freedom of implementation in the crashdump capture process
quite a bit.

> Note that we do not make any additional reservation for the
> backup region. We carve this out from the top of the already
> reserved region and export it through /proc/iomem so that
> the user space code and the capture kernel code need not
> make any assumptions about where this region is located.
>
> 4. Once the capture kernel boots, it needs to know the location of
> backup region for two purposes.
>
> a. It should not overwrite the backup region.
>
> b. There needs to be a way for the capture tool to access the original
> contents of the backed up region
>
> Boot time parameter crashbackup=A@B has been provided to pass this
> information to capture kernel. This parameter is valid only for capture
> kernel and becomes effective only if CONFIG_CRASH_DUMP is enabled.

But that is not what you implemented. crashbackup= was an alternative
to the carving out of 640K in parts 1-3.

> > What is wrong with user space doing all of the extra space
> > reservation?
>
> Just for clarity, are you suggesting kexec-tools creating an additional
> segment for the backup region and pass the information to kernel.

Yes, having kexec create a bss segment for the backup region would
be a good idea. It will keep us from stomping on the kernel trampoline
(think the identity mapped x86_64 page tables here) by accident.

> There is no problem in doing reservation from user space except
> one. How does the user and in-turn capture kernel come to know the
> location of backup region, assuming that the user is going to provide
> the exactmap for capture kernel to boot into.
>
> Just a thought, is it a good idea for kexec-tools to be creating and
> passing memmap parameters doing appropriate adjustment for backup
> region.

Exactly. Having /sbin/kexec do this instead of the user doing this
manually is a much simpler solution than we have now.

> I had another question. How is the starting location of elf headers
> communicated to capture tool? Is parameter segment a good idea? or
> some hardcoding?

I recognize the need for that information. But I do not recognize
the need for it to be an ELF header (we do need something
conceptually close). If we don't have regions of the memory map
appearing and disappearing dynamically we can get this information
from /proc/iomem, before the crash and store it in one of the data
segments that we checksum.

> Another approach can be that backup area information is encoded in elf
> headers and capture kernel is booted with modified memmap (User gets
> backup region information from /proc/iomem) and capture tool can
> extract backup area information from elf headers as stored by first
> kernel.
>
> Could you please elaborate a little more on what aspect of your view
> differs from the above.

See above.

The direction I would take if I was to take to implementing this
the crashdump functionality is something different.

Instead of patching crashdump functionality into the kernel,
I would create a subdirectory in kexec-tools called crashdump
and put in the source for a user-space program that could run as
init. In addition I would but in the code to generate and
initramfs cpio.gz archive of that program. And I would build
the program against uclibc, klibc or one of the other libc variants
that actually allows building static binaries. Unless something
has changed recently glibc does not all for truly static binaries
as it dynamically open /lib/libnss*

Given the pain of building against an external library that is not
widely distributed I would probably take a snapshot of the code and
place it in crashdump/libc in the kexec-tools source. Taking a
snapshot of frequently used libraries is commonly done with the
gnu toolchain and is wonderfully effective in resolving painful
dependencies.

The crashdump /init would just mmap /dev/mem to read the raw memory.
>From there it would generate the core file.

When kexec'ing a panic kernel I would simply have /sbin/kexec
unconditionally load that cpio.gz as the initrd and things
would work.

The large advantage of doing all of this in user space
is that it moves all of the crashdump policy into user space
and into one source tree, for simplified maintenance.

However as long as we gracefully handle the interface
between the primary kernel and the capture kernel we can
switch mechanisms for actually taking the crash dump,
kernel based or user space as seems most sane.

Eric

2005-01-28 12:14:55

by Vivek Goyal

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Hi Eric,


However for the primary kernel it has no need to know that we
> even have a backup region, nor does it need to know about the
> size of the backup region. That can all be handled with the single
> reservation, we have now.
>
> /sbin/kexec which makes the backup needs to know about it and it needs
> to pass that information on. But the primary kernel does not.


Agreed. Primary kernel need not to be aware of backup region and
reservation of this region can be better managed from user space.


> > Boot time parameter crashbackup=A@B has been provided to pass this
> > information to capture kernel. This parameter is valid only for capture
> > kernel and becomes effective only if CONFIG_CRASH_DUMP is enabled.
>
> But that is not what you implemented. crashbackup= was an alternative
> to the carving out of 640K in parts 1-3.


Not really. crashbackup= is not being used for carving out backup
region. It is just used for passing the address of this region to second
kernel. That's why it has been put under CONFIG_CRASH_DUMP.


> > > What is wrong with user space doing all of the extra space
> > > reservation?
> >
> > Just for clarity, are you suggesting kexec-tools creating an additional
> > segment for the backup region and pass the information to kernel.
>
> Yes, having kexec create a bss segment for the backup region would
> be a good idea. It will keep us from stomping on the kernel trampoline
> (think the identity mapped x86_64 page tables here) by accident.
>
> > There is no problem in doing reservation from user space except
> > one. How does the user and in-turn capture kernel come to know the
> > location of backup region, assuming that the user is going to provide
> > the exactmap for capture kernel to boot into.
> >
> > Just a thought, is it a good idea for kexec-tools to be creating and
> > passing memmap parameters doing appropriate adjustment for backup
> > region.
>
> Exactly. Having /sbin/kexec do this instead of the user doing this
> manually is a much simpler solution than we have now.
>
> > I had another question. How is the starting location of elf headers
> > communicated to capture tool? Is parameter segment a good idea? or
> > some hardcoding?
>
> I recognize the need for that information. But I do not recognize
> the need for it to be an ELF header (we do need something
> conceptually close). If we don't have regions of the memory map
> appearing and disappearing dynamically we can get this information
> from /proc/iomem, before the crash and store it in one of the data
> segments that we checksum.
>


This looks good. So memory regions are parsed from /proc/iomem and this
information is put in one data segment and stored in reserved region
during panic kernel load time.

But I am unable to co-relate as to how the capture tool (even if its all
in user space) gets to know the address of this segment (or for that
matter, the bss segment created for backup region). Am I missing
something obvious.


> The direction I would take if I was to take to implementing this
> the crashdump functionality is something different.
>
> Instead of patching crashdump functionality into the kernel,
> I would create a subdirectory in kexec-tools called crashdump
> and put in the source for a user-space program that could run as
> init. In addition I would but in the code to generate and
> initramfs cpio.gz archive of that program. And I would build
> the program against uclibc, klibc or one of the other libc variants
> that actually allows building static binaries. Unless something
> has changed recently glibc does not all for truly static binaries
> as it dynamically open /lib/libnss*
>
> Given the pain of building against an external library that is not
> widely distributed I would probably take a snapshot of the code and
> place it in crashdump/libc in the kexec-tools source. Taking a
> snapshot of frequently used libraries is commonly done with the
> gnu toolchain and is wonderfully effective in resolving painful
> dependencies.
>
> The crashdump /init would just mmap /dev/mem to read the raw memory.
> >From there it would generate the core file.
>
> When kexec'ing a panic kernel I would simply have /sbin/kexec
> unconditionally load that cpio.gz as the initrd and things
> would work.
>
> The large advantage of doing all of this in user space
> is that it moves all of the crashdump policy into user space
> and into one source tree, for simplified maintenance.
>
> However as long as we gracefully handle the interface
> between the primary kernel and the capture kernel we can
> switch mechanisms for actually taking the crash dump,
> kernel based or user space as seems most sane.


This seems to be a good direction.


Thanks
Vivek


2005-01-28 20:35:12

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Vivek Goyal <[email protected]> writes:

> Hi Eric,
>
>
> However for the primary kernel it has no need to know that we
> > even have a backup region, nor does it need to know about the
> > size of the backup region. That can all be handled with the single
> > reservation, we have now.
> >
> > /sbin/kexec which makes the backup needs to know about it and it needs
> > to pass that information on. But the primary kernel does not.
>
>
> Agreed. Primary kernel need not to be aware of backup region and
> reservation of this region can be better managed from user space.

Good. It sound like we are pretty much back on the same page then.

> > > Boot time parameter crashbackup=A@B has been provided to pass this
> > > information to capture kernel. This parameter is valid only for capture
> > > kernel and becomes effective only if CONFIG_CRASH_DUMP is enabled.
> >
> > But that is not what you implemented. crashbackup= was an alternative
> > to the carving out of 640K in parts 1-3.
>
>
> Not really. crashbackup= is not being used for carving out backup
> region. It is just used for passing the address of this region to second
> kernel. That's why it has been put under CONFIG_CRASH_DUMP.

Ok I missed a piece in your patch. You have crashdumpk_res, and
crashbackup_start, crashbackup_end. And I missed the fact that
they were different variables as they dealt with the same concept.

So that patch actually should have been three patches. The
one line bug fix. The crashdumpk_res bit (which I strongly
object to) and the crashbackup_start/_end bit. The fact
that all three were in the same patch is a reviewing and maintenance
pain.

Please in the future do not include code that runs in the primary
kernel and crashdump specific code that runs in the capture kernel in
the same patch.

> This looks good. So memory regions are parsed from /proc/iomem and this
> information is put in one data segment and stored in reserved region
> during panic kernel load time.
>
> But I am unable to co-relate as to how the capture tool (even if its all
> in user space) gets to know the address of this segment (or for that
> matter, the bss segment created for backup region). Am I missing
> something obvious.

There are a lot of choices at that point.
Place the data in the on the kernel command line, and pick
it up from /proc/cmdline.
Place the data in a file on the initramfs.
Place the data in a user space data segment.


> > However as long as we gracefully handle the interface
> > between the primary kernel and the capture kernel we can
> > switch mechanisms for actually taking the crash dump,
> > kernel based or user space as seems most sane.
>
>
> This seems to be a good direction.

Cool.

One of the ideas worth exploring is to see about stabilizing the
other side of this interface as well. That is we should explore
providing a fixed interface coming out of purgatory.ro to the new
kernel and it's user space (i.e. the ELF header like thing). I think
we are quite close to that point already. And this goes back to your
question of how do we let the capture kernel/user space know where to
look.

Eric

2005-02-01 08:07:04

by Koichi Suzuki

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Hook in panic code is very good idea and is useful in various scenes.
It could be used to kick RAM dump code, obviously, and also kick the
code to initiate failover, etc. Various use could be possible so I
believe that this hook should be prepared for wider use.

--
Koichi Suzuki
NTT DATA Intellilink Corp.

[email protected] wrote:
> For the guys on ppc, and other architectures that have all of their
> cpu memory behind an iommu. I propose we create a /proc/cpumem
> which is the subset of /proc/iomem that deals with RAM. In any event
> as something like that is straight forward to implement I will
> assume the existence of the functionality and we can attack the
> details when we do the merge the first of those architectures
> into the kernel.
>
> Vivek Goyal <[email protected]> writes:
>
>
>>Hi Eric,
>>
>>It looks like we are looking at things a little differently. I
>>see a portion of the picture in your mind, but obviously not
>>entirely.
>>
>>Perhaps, we need to step back and iron out in specific terms what
>>the interface between the two kernels should be in the crash dump
>>case, and the distribution of responsibility between kernel, user space
>>and the user.
>>
>>[BTW, the patch was intended as a step in development up for
>>comment early enough to be able to get agreement on the interface
>>and think issues through to more completeness before going
>>too far. Sorry, if that wasn't apparent.]
>
>
> It wasn't quite, and the fact that Andrew picked it up added
> to the confusion.
>
>
>>When you say "evil intermingling", I'm guessing you mean the
>>"crashbackup=" boot parameter ? If so, then yes, I agree it'd
>>be nice to find a way around it that doesn't push hardcoding
>>elsewhere.
>
>
> I believe there are some alternatives to crashbackup= in the
> crashdump capture kernel. But as long as that code is running
> in the kernel we can't do a lot better.
>
> However for the primary kernel it has no need to know that we
> even have a backup region, nor does it need to know about the
> size of the backup region. That can all be handled with the single
> reservation, we have now.
>
> /sbin/kexec which makes the backup needs to know about it and it needs
> to pass that information on. But the primary kernel does not.
>
> The largest reason I am sensitive to this issue is that if you are not
> booting an SMP kernel I don't believe we need a backup region on x86
> at all. If we can remove that dependency I want the freedom to do
> that without having to modify the primary kernel. Or if we discover
> we need to preserve other things like the ACPI, mp and pirq tables
> I don't want to require patching the kernel just so I can copy those
> and preserve them.
>
>
>>Let me explain the interface/approach I was looking at.
>>
>>1.First kernel reserves some area of memory for crash/capture kernel as
>>specified by crashkernel=X@Y boot time parameter.
>>
>>2.First kernel marks the top 640K of this area as backup area. (If
>>architecture needs it.) This is sort of a hardcoding and probably this
>>space reservation can be managed from user space as well as mentioned by
>>you in this mail below.
>>
>>3. Location of backup region is exported through /proc/iomem which can
>>be read by user space utility to pass this information to purgatory code
>>to determine where to copy the first 640K.
>
>
> And 1-3 can be done in /sbin/kexec. And if it is done there we can
> increase our freedom of implementation in the crashdump capture process
> quite a bit.
>
>
>>Note that we do not make any additional reservation for the
>>backup region. We carve this out from the top of the already
>>reserved region and export it through /proc/iomem so that
>>the user space code and the capture kernel code need not
>>make any assumptions about where this region is located.
>>
>>4. Once the capture kernel boots, it needs to know the location of
>>backup region for two purposes.
>>
>>a. It should not overwrite the backup region.
>>
>>b. There needs to be a way for the capture tool to access the original
>> contents of the backed up region
>>
>>Boot time parameter crashbackup=A@B has been provided to pass this
>>information to capture kernel. This parameter is valid only for capture
>>kernel and becomes effective only if CONFIG_CRASH_DUMP is enabled.
>
>
> But that is not what you implemented. crashbackup= was an alternative
> to the carving out of 640K in parts 1-3.
>
>
>>>What is wrong with user space doing all of the extra space
>>>reservation?
>>
>>Just for clarity, are you suggesting kexec-tools creating an additional
>>segment for the backup region and pass the information to kernel.
>
>
> Yes, having kexec create a bss segment for the backup region would
> be a good idea. It will keep us from stomping on the kernel trampoline
> (think the identity mapped x86_64 page tables here) by accident.
>
>
>>There is no problem in doing reservation from user space except
>>one. How does the user and in-turn capture kernel come to know the
>>location of backup region, assuming that the user is going to provide
>>the exactmap for capture kernel to boot into.
>>
>>Just a thought, is it a good idea for kexec-tools to be creating and
>>passing memmap parameters doing appropriate adjustment for backup
>>region.
>
>
> Exactly. Having /sbin/kexec do this instead of the user doing this
> manually is a much simpler solution than we have now.
>
>
>>I had another question. How is the starting location of elf headers
>>communicated to capture tool? Is parameter segment a good idea? or
>>some hardcoding?
>
>
> I recognize the need for that information. But I do not recognize
> the need for it to be an ELF header (we do need something
> conceptually close). If we don't have regions of the memory map
> appearing and disappearing dynamically we can get this information
> from /proc/iomem, before the crash and store it in one of the data
> segments that we checksum.
>
>
>>Another approach can be that backup area information is encoded in elf
>>headers and capture kernel is booted with modified memmap (User gets
>>backup region information from /proc/iomem) and capture tool can
>>extract backup area information from elf headers as stored by first
>>kernel.
>>
>>Could you please elaborate a little more on what aspect of your view
>>differs from the above.
>
>
> See above.
>
> The direction I would take if I was to take to implementing this
> the crashdump functionality is something different.
>
> Instead of patching crashdump functionality into the kernel,
> I would create a subdirectory in kexec-tools called crashdump
> and put in the source for a user-space program that could run as
> init. In addition I would but in the code to generate and
> initramfs cpio.gz archive of that program. And I would build
> the program against uclibc, klibc or one of the other libc variants
> that actually allows building static binaries. Unless something
> has changed recently glibc does not all for truly static binaries
> as it dynamically open /lib/libnss*
>
> Given the pain of building against an external library that is not
> widely distributed I would probably take a snapshot of the code and
> place it in crashdump/libc in the kexec-tools source. Taking a
> snapshot of frequently used libraries is commonly done with the
> gnu toolchain and is wonderfully effective in resolving painful
> dependencies.
>
> The crashdump /init would just mmap /dev/mem to read the raw memory.
>>From there it would generate the core file.
>
> When kexec'ing a panic kernel I would simply have /sbin/kexec
> unconditionally load that cpio.gz as the initrd and things
> would work.
>
> The large advantage of doing all of this in user space
> is that it moves all of the crashdump policy into user space
> and into one source tree, for simplified maintenance.
>
> However as long as we gracefully handle the interface
> between the primary kernel and the capture kernel we can
> switch mechanisms for actually taking the crash dump,
> kernel based or user space as seems most sane.
>
> Eric
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>

2005-02-01 09:09:40

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Koichi Suzuki <[email protected]> writes:

> Hook in panic code is very good idea and is useful in various scenes. It could
> be used to kick RAM dump code, obviously, and also kick the code to initiate
> failover, etc. Various use could be possible so I believe that this hook
> should be prepared for wider use.

It is. Basically it is the normal kexec interface that allows you to
boot another kernel. With a few restrictions that should keep it as
reliable as possible when the kernel has not shut itself down cleanly.

The hardest case is to do a useful system core dump. As that requires
looking at what has gone before. For the rest if you can do it
with a kernel and a initramfs you are in good shape.

There seems to be a significant amount of interest in the full
system core dump case so that is what the work is concentrating
on.

Eric

2005-02-01 14:26:02

by Vivek Goyal

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.


Well, trying to put the already discussed ideas together. I was
planning to work on following design. Please comment.

Crashed Kernel <-->Capture Kernel(or User Space) Interface:
----------------------------------------------------------

The whole idea is that Crash image is represented in ELF Core format.
These ELF Headers are prepared by kexec-tools user space and put in one
segment. Address of start of image is passed to the capture kernel(or
user space) using one command line (eg. crashimage=). Now either kernel
space or user space can parse the elf headers and extract required
information and export final kernel elf core image.


> [email protected] wrote:
> If we were using an ELF header I would include one PT_NOTE program
> header per cpu (Giving each cpu it's own area to mess around in).
> And I would use one PT_LOAD segment per possible memory zone.
> So in the worst case (current sgi altix) (MAX_NUMNODES=256,
> MAX_NR_ZONES=3, MAX_NR_CPUS=1024) 256*3+1024 = 1792 program
> headers. At 56 bytes per 64bit program header that is 100352 bytes
> or 98KiB. A little bit expensive. A tuned data structure with
> 64bit base and size would only consume 1792*16 = 28672 or 28KiB.

If I prepare One elf header for each physical contiguous memory area (as
obtained from /proc/iomem) instead of per zone, then number of elf
headers will come down significantly. I don't have any idea on number of
actual physically contiguous regions present per machine, but roughly
assuming it to be 1 per node, it will lead to 256 + 1024 = 1280 program
headers.At 56 bytes per 64 bit program header this will amount to 70KB.

This is worst case estimate and on lower end machines this will require
much less a space. On machines as big as 1024 cpus, this should not be a
concern, as big machines come with big RAMs.

Eric, do you still think that ELF headers are inappropriate to be passed
across interface boundary.

ELF headers can be prepared by kexec-tools in advance and put into one
of the data segments. This requires following information to be
available to user space.

- Starting address of space reserved by kernel for notes section
(crash_notes[]). Probably can be obtained from /proc/kallsysms?

- NR_CPUS. May be sysconf(_SC_NPROCESSORS_CONF) should be sufficient.

- Size of memory reserved per cpu. No clue how to get that? Any
suggestions?
May be hard-coding like 1K area per cpu should be to address the
future needs ?


Regarding Backup Region
-----------------------

- Kexec user space does the reservation for backup region segment.
- Purgatory copies the backup data to backup region. (Already
implemented)
- A separate elf header is prepared to represent backed up memory
region. And "offset" field of this program header can contain the actual
physical address where backup contents are stored.


Thanks
Vivek



2005-02-01 15:29:01

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Vivek Goyal <[email protected]> writes:

> Well, trying to put the already discussed ideas together. I was
> planning to work on following design. Please comment.
>
> Crashed Kernel <-->Capture Kernel(or User Space) Interface:
> ----------------------------------------------------------
>
> The whole idea is that Crash image is represented in ELF Core format.
> These ELF Headers are prepared by kexec-tools user space and put in one
> segment. Address of start of image is passed to the capture kernel(or
> user space) using one command line (eg. crashimage=). Now either kernel
> space or user space can parse the elf headers and extract required
> information and export final kernel elf core image.

Sounds sane. We need to make certain there is a checksum of that
region but putting it in a separate segment should ensure that.

I also think we need to look at the name crashimage= and see if we
can find something more descriptive. But that is minor. Possibly
elfcorehdr= We have a little while to think about that one before we
are stuck.

> > [email protected] wrote:
> > If we were using an ELF header I would include one PT_NOTE program
> > header per cpu (Giving each cpu it's own area to mess around in).
> > And I would use one PT_LOAD segment per possible memory zone.
> > So in the worst case (current sgi altix) (MAX_NUMNODES=256,
> > MAX_NR_ZONES=3, MAX_NR_CPUS=1024) 256*3+1024 = 1792 program
> > headers. At 56 bytes per 64bit program header that is 100352 bytes
> > or 98KiB. A little bit expensive. A tuned data structure with
> > 64bit base and size would only consume 1792*16 = 28672 or 28KiB.
>
> If I prepare One elf header for each physical contiguous memory area (as
> obtained from /proc/iomem) instead of per zone, then number of elf
> headers will come down significantly.

A clarification on terminology we are talking about struct Elf64_Phdr
here. There is only one Elf header. That seems to be clear farther
down.

> I don't have any idea on number of
> actual physically contiguous regions present per machine, but roughly
> assuming it to be 1 per node, it will lead to 256 + 1024 = 1280 program
> headers.At 56 bytes per 64 bit program header this will amount to 70KB.
>
> This is worst case estimate and on lower end machines this will require
> much less a space. On machines as big as 1024 cpus, this should not be a
> concern, as big machines come with big RAMs.

Agreed. Size is not the primary issue. There is some clear waste
but that is a secondary concern. Not performing a 1-1 mapping
to the kernel data structures also seems to be a win, as the concepts
are noticeably different.

> Eric, do you still think that ELF headers are inappropriate to be passed
> across interface boundary.

I have serious concerns about the kernel generating the ELF headers
and only delivering them after the kernel has crashed. Because
then we run into questions of what information can be trusted. If we
avoid that issue I am not too concerned.

> ELF headers can be prepared by kexec-tools in advance and put into one
> of the data segments. This requires following information to be
> available to user space.

For a first round doing it in user space sounds sane. Obtaining
the information at the time of load is much more robust.

> - Starting address of space reserved by kernel for notes section
> (crash_notes[]). Probably can be obtained from /proc/kallsysms?

At least for a start.

> - NR_CPUS. May be sysconf(_SC_NPROCESSORS_CONF) should be
> sufficient.

Either that or /proc/cpuinfo. But the sysconf approach looks more
robust at this point.

> - Size of memory reserved per cpu. No clue how to get that? Any
> suggestions?
> May be hard-coding like 1K area per cpu should be to address the
> future needs ?

The nice thing about doing this in user space is that we can hack
something together and get each side of the interface sorted
out independently. i.e. We can hard code it for now. Sort out
the users and then come back and make certain we have the information
exported cleanly. 1K per cpu currently matches the kernel code so
it is a good place to start :)

It does look like getting the size of each array element is a problem,
so the current kernel code certainly needs to be revisited. And
there are quite a few other things pieces of how we are obtaining
the information that can be fixed as well.

> Regarding Backup Region
> -----------------------
>
> - Kexec user space does the reservation for backup region segment.
> - Purgatory copies the backup data to backup region. (Already
> implemented)
> - A separate elf header is prepared to represent backed up memory
> region. And "offset" field of this program header can contain the actual
> physical address where backup contents are stored.

I like that. I was thinking a virtual versus physical address
separation. But using the offset field is much more appropriate,
and it leaves us the potential of doing something nice like specifying
the kernels virtual address later on. Looking exclusively at the
offset field to know which memory addresses to dump sounds good.
For now we should have virtual==physical==offset except for the
backup region.

This sounds like a good place to start.

Eric

2005-02-02 07:10:21

by Itsuro Oda

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Hi,

I can't understand why ELF format is necessary.

I think the only necessary information is "what physical address
regions are valid to read". This information is necessary for any
sort of dump tools. (and must get it while the system is normal.)
The Eric's /proc/cpumem idea sounds nice to me.

--
Itsuro ODA <[email protected]>

2005-02-02 07:15:05

by Koichi Suzuki

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

[email protected] wrote:
> Koichi Suzuki <[email protected]> writes:
>
>
>>Hook in panic code is very good idea and is useful in various scenes. It could
>>be used to kick RAM dump code, obviously, and also kick the code to initiate
>>failover, etc. Various use could be possible so I believe that this hook
>>should be prepared for wider use.
>
>
> It is. Basically it is the normal kexec interface that allows you to
> boot another kernel. With a few restrictions that should keep it as
> reliable as possible when the kernel has not shut itself down cleanly.
>
> The hardest case is to do a useful system core dump. As that requires
> looking at what has gone before. For the rest if you can do it
> with a kernel and a initramfs you are in good shape.
>
> There seems to be a significant amount of interest in the full
> system core dump case so that is what the work is concentrating
> on.
>
> Eric
>

I meant with kexec and dump hook, there could be many more things can be
done in addition to full core dump. Initiating failover to other node
will be one example. Starting with this hook, there must be many good
ideas. So my idea is to make this hook general purpose, not for
specific core dump tool.

Koichi Suzuki

2005-02-02 07:42:35

by Itsuro Oda

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Hi,

I don't like calling crash_kexec() directly in (ex.) panic().
It should be call_dump_hook() (or something like this).

I think the necessary modifications of the kernel is only:
- insert the hooks that calls a dump function when crash occur
- binding interface that binds a dump function to the hook
(like register_dump_hook())
- supply the information of valid physical address regions
(- maybe some existent functions and variables need to be exported ?)

I think this makes any sort of dump functions can be implemented
as a kernel module. I don't think it is best way that the "kexec based
crashdump" is built in the kernel.

Thanks.

On 01 Feb 2005 02:06:42 -0700
[email protected] (Eric W. Biederman) wrote:

> Koichi Suzuki <[email protected]> writes:
>
> > Hook in panic code is very good idea and is useful in various scenes. It could
> > be used to kick RAM dump code, obviously, and also kick the code to initiate
> > failover, etc. Various use could be possible so I believe that this hook
> > should be prepared for wider use.
>
> It is. Basically it is the normal kexec interface that allows you to
> boot another kernel. With a few restrictions that should keep it as
> reliable as possible when the kernel has not shut itself down cleanly.
>
> The hardest case is to do a useful system core dump. As that requires
> looking at what has gone before. For the rest if you can do it
> with a kernel and a initramfs you are in good shape.
>
> There seems to be a significant amount of interest in the full
> system core dump case so that is what the work is concentrating
> on.
>
> Eric
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Itsuro ODA <[email protected]>

2005-02-02 07:50:27

by Koichi Suzuki

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Itsuro Oda wrote:
> Hi,
>
> I can't understand why ELF format is necessary.
>
> I think the only necessary information is "what physical address
> regions are valid to read". This information is necessary for any
> sort of dump tools. (and must get it while the system is normal.)
> The Eric's /proc/cpumem idea sounds nice to me.
>

I agree. Format conversion should be done in healthy system separately
and we should restrict what to do while taking the dump as few as
possible. Conversion from just memory image to crash/lcrash format will
be very useful to use existing tools and experiences. I already have
such tool and (if my administration allows) I can make such tool open.
Let me do some paperwork.

Koichi Suzuki
NTT DATA Intellilink

2005-02-02 09:16:57

by Vivek Goyal

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

On Tue, 2005-02-01 at 20:56, Eric W. Biederman wrote:
> Vivek Goyal <[email protected]> writes:
>
> > Well, trying to put the already discussed ideas together. I was
> > planning to work on following design. Please comment.
> >
> > Crashed Kernel <-->Capture Kernel(or User Space) Interface:
> > ----------------------------------------------------------
> >
> > The whole idea is that Crash image is represented in ELF Core format.
> > These ELF Headers are prepared by kexec-tools user space and put in one
> > segment. Address of start of image is passed to the capture kernel(or
> > user space) using one command line (eg. crashimage=). Now either kernel
> > space or user space can parse the elf headers and extract required
> > information and export final kernel elf core image.
>
> Sounds sane. We need to make certain there is a checksum of that
> region but putting it in a separate segment should ensure that.
>
> I also think we need to look at the name crashimage= and see if we
> can find something more descriptive. But that is minor. Possibly
> elfcorehdr= We have a little while to think about that one before we
> are stuck.

"elfcorehdr=" also looks good.

>
> > > [email protected] wrote:
> > > If we were using an ELF header I would include one PT_NOTE program
> > > header per cpu (Giving each cpu it's own area to mess around in).
> > > And I would use one PT_LOAD segment per possible memory zone.
> > > So in the worst case (current sgi altix) (MAX_NUMNODES=256,
> > > MAX_NR_ZONES=3, MAX_NR_CPUS=1024) 256*3+1024 = 1792 program
> > > headers. At 56 bytes per 64bit program header that is 100352 bytes
> > > or 98KiB. A little bit expensive. A tuned data structure with
> > > 64bit base and size would only consume 1792*16 = 28672 or 28KiB.
> >
> > If I prepare One elf header for each physical contiguous memory area (as
> > obtained from /proc/iomem) instead of per zone, then number of elf
> > headers will come down significantly.
>
> A clarification on terminology we are talking about struct Elf64_Phdr
> here. There is only one Elf header. That seems to be clear farther
> down.
>


Exactly. There shall be one Elf header for whole of the image. In
addition there will be one struct Elf64_Phdr, per contiguous physical
memory area. One Elf64_Phdr of PT_NOTE type for notes section and one
Elf64_Phdr for backup region.


> > I don't have any idea on number of
> > actual physically contiguous regions present per machine, but roughly
> > assuming it to be 1 per node, it will lead to 256 + 1024 = 1280 program
> > headers.At 56 bytes per 64 bit program header this will amount to 70KB.
> >
> > This is worst case estimate and on lower end machines this will require
> > much less a space. On machines as big as 1024 cpus, this should not be a
> > concern, as big machines come with big RAMs.
>
> Agreed. Size is not the primary issue. There is some clear waste
> but that is a secondary concern. Not performing a 1-1 mapping
> to the kernel data structures also seems to be a win, as the concepts
> are noticeably different.
>
> > Eric, do you still think that ELF headers are inappropriate to be passed
> > across interface boundary.
>
> I have serious concerns about the kernel generating the ELF headers
> and only delivering them after the kernel has crashed. Because
> then we run into questions of what information can be trusted. If we
> avoid that issue I am not too concerned.


I hope, all elf headers once prepared by kexec-tools need not to change
later (Cannot think of any piece of information which shall change
later). These shall be put in separate segment. And SHA-256 shall take
care of authenticity of information after crash.


>
> > Regarding Backup Region
> > -----------------------
> >
> > - Kexec user space does the reservation for backup region segment.
> > - Purgatory copies the backup data to backup region. (Already
> > implemented)
> > - A separate elf header is prepared to represent backed up memory
> > region. And "offset" field of this program header can contain the actual
> > physical address where backup contents are stored.
>
> I like that. I was thinking a virtual versus physical address
> separation. But using the offset field is much more appropriate,
> and it leaves us the potential of doing something nice like specifying
> the kernels virtual address later on. Looking exclusively at the
> offset field to know which memory addresses to dump sounds good.
> For now we should have virtual==physical==offset except for the
> backup region.


For notes section program header, virtual = physical = 0 and "offset"
shall point to crash_notes[], so that notes can directly be read by the
capture kernel (or user space).

Thanks
Vivek

2005-02-02 14:28:32

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Itsuro Oda <[email protected]> writes:

> Hi,
>
> I can't understand why ELF format is necessary.

ELF format is not. However essentially the information an ELF
provides is. So using an ELF header to convey that information
is a sane choice of data structure.

> I think the only necessary information is "what physical address
> regions are valid to read". This information is necessary for any
> sort of dump tools. (and must get it while the system is normal.)
> The Eric's /proc/cpumem idea sounds nice to me.

Patches welcome.

Eric

2005-02-02 14:34:01

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Koichi Suzuki <[email protected]> writes:

> I meant with kexec and dump hook, there could be many more things can be done in
> addition to full core dump. Initiating failover to other node will be one
> example. Starting with this hook, there must be many good ideas. So my idea
> is to make this hook general purpose, not for specific core dump tool.

Again that is what is has been implemented. A fully stand alone executable
that lives in an independent and reserved address in memory is jumped
to.

The goal in the generic kernel is to keep the code path to do that
as small and as simple as possible to reduce the chances of it being
mis-implemented, or the chances of attempting to use corrupted kernel
functionality.

Eric

2005-02-02 14:47:18

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.


And the feedback begins :)

Itsuro Oda <[email protected]> writes:

> Hi,
>
> I don't like calling crash_kexec() directly in (ex.) panic().
> It should be call_dump_hook() (or something like this).
>
> I think the necessary modifications of the kernel is only:
> - insert the hooks that calls a dump function when crash occur
crash_kexec()
> - binding interface that binds a dump function to the hook
> (like register_dump_hook())
sys_kexec_load(...);
> - supply the information of valid physical address regions
/proc/iomem or possibly /proc/cpumem. At least until someone
actually implements hot plug memory support.

> (- maybe some existent functions and variables need to be exported ?)
>
> I think this makes any sort of dump functions can be implemented
> as a kernel module. I don't think it is best way that the "kexec based
> crashdump" is built in the kernel.

For people developing code outside of the kernel I can see where
this is a problem. Given the insane auditing requirements necessary
to get a reliable code path I don't see how not putting the implementation
in the kernel is sane. Anything that needs to be touched at that point
is core kernel functionality GPL_ONLY if it is exported at all.
Touching anything from a module at that point is not sane.

Basically the code path setup with crash_kexec is little more
than a jump instruction. And it should be audited and reduced
as much as possible. I don't see how you get simpler or what
piece of functionality could possibly improve by having multiple
implementations in kernel modules.

Eric


2005-02-02 15:26:20

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Koichi Suzuki <[email protected]> writes:

> Itsuro Oda wrote:
> > Hi,
> > I can't understand why ELF format is necessary.
> > I think the only necessary information is "what physical address regions are
> > valid to read". This information is necessary for any
> > sort of dump tools. (and must get it while the system is normal.)
> > The Eric's /proc/cpumem idea sounds nice to me.
>
> I agree. Format conversion should be done in healthy system separately and we
> should restrict what to do while taking the dump as few as possible. Conversion
> from just memory image to crash/lcrash format will be very useful to use
> existing tools and experiences. I already have such tool and (if my
> administration allows) I can make such tool open. Let me do some paperwork.

The big part of the conversation that is happening right now is how
do we uncouple dependencies between the various parts as much as
possible. There is nothing here about format conversions except
as to convert weird kernel formats into a stable interface.

There are 3 pieces of code interacting.
1) The primary kernel that will call panic.
2) The kernel+initrd that takes over.
3) The user space that sets it all up (/sbin/kexec) while the primary
kernel is still in a sane state.

The goal is to make those 3 pieces as independent of each other as
reasonably possible.

So the kernel+initrd that captures a crash dump will live and execute
in a reserved area of memory. It needs to know which memory regions
are valid, and it needs to know small things like the final register
state of each cpu. For the set of valid memory regions it is the
intention to encode that as an array of ELF program headers. The
information of what the final register contents were will be encoded
as ELF notes. There will be one PT_NOTE segment per cpu that holds
the notes needed to encode a given cpu's final state. It really
does not matter to implementation that captures each cpu's final
register state which format we record the data in so using a format
designed not to change is not a problem. So all that needs
to be communicated to the kernel+initrd that captures a crash
dump is the location of an ELF header and it can figure out all of
the rest.

For the primary kernel except for remembering it's final cpu
register state as it dies it does nothing except jump to the
crash recover kernel. All of the interesting information will
be exported to user space.

/sbin/kexec is the glue that fills in the cracks. While
the primary kernel is in a sane state it sets everything up including
finding out which memory areas need to be looked at. And it stashes
it all in a reserved area of memory, that has never been the target
of DMA transfers.

The goal is to reduce the dependencies as much as possible. So
an old stable kernel can take a crash dump of a new buggy kernel.
And so that you don't have to be running the latest and greatest
user space simply to set everything up. Although it is still
better to require a user-space upgrade to cope with new
kernels than to require the crash capture kernel+initrd to
be upgraded.

Eric

2005-02-02 15:45:09

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Vivek Goyal <[email protected]> writes:

> On Tue, 2005-02-01 at 20:56, Eric W. Biederman wrote:
> > Vivek Goyal <[email protected]> writes:
>
> "elfcorehdr=" also looks good.

Then let's go with that for now. It is not perfect but it seems
a little more self explanatory at first glance.

> > A clarification on terminology we are talking about struct Elf64_Phdr
> > here. There is only one Elf header. That seems to be clear farther
> > down.
> >
>
>
> Exactly. There shall be one Elf header for whole of the image. In
> addition there will be one struct Elf64_Phdr, per contiguous physical
> memory area. One Elf64_Phdr of PT_NOTE type for notes section and one
> Elf64_Phdr for backup region.

Actually if we are just pointing a kernel data structures we will
need multiple Elf64_Phdr of PT_NOTE. Each cpu has it's own
notes section and until the smoke clears we can't be confident
about what is going to wind up there or how densely those will
be packed. So collapsing everything into a single notes segment
needs to happen after we have switched to the crash capture kernel.

> > I have serious concerns about the kernel generating the ELF headers
> > and only delivering them after the kernel has crashed. Because
> > then we run into questions of what information can be trusted. If we
> > avoid that issue I am not too concerned.
>
>
> I hope, all elf headers once prepared by kexec-tools need not to change
> later (Cannot think of any piece of information which shall change
> later). These shall be put in separate segment. And SHA-256 shall take
> care of authenticity of information after crash.

That should work fine. We need to consider through throwing in an
extra note section with information like kernel version that
we can capture while the system is running.

> For notes section program header, virtual = physical = 0 and "offset"
> shall point to crash_notes[], so that notes can directly be read by the
> capture kernel (or user space).

I agree. But see my caveat. I think we should have one PT_NOTE
segment point at each element of the crash_notes[] array. I know
it is technically a violation of the ELF spec. But in this case
it makes sense. Since we can't guarantee that crash_notes will
be packed properly I don't know that we could reliably see more
than one cpu if we pointed a PT_NOTE header at the whole thing.

If it turns out that we can reliably point a single PT_NOTE header
at crash_notes so much the better but things are likely to be
more robust if we don't start with that assumption. That
at least allows us the freedom to capture some notes (like NT_UTSNAME)
before the kernel crashes.

Eric

2005-02-03 07:10:57

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Hi Vivek and Eric,

IMHO, why don't we swap not only the contents of the top 640K
but also kernel working memory for kdump kernel?

I guess this approach has some good points.

1.Preallocating reserved area is not mandatory at boot time.
And the reserved area can be distributed in small pieces
like original kexec does.

2.Special linking is not required for kdump kernel.
Each kdump kernel can be linked in the same way,
where the original kernel exists.

Am I missing something?


physical memory
+-------+
| 640K ------------+
|.......| |
| | copy
+-------+ |
| | |
|original<-----+ |
|kernel | | |
| | | |
|.......| | |
| | | |
| | | |
| | swap |
| | | |
+-------+ | |
|reserved<----------+
|area | |
| | |
|kdump |<-----+
|kernel |
+-------+
| |
| |
| |
+-------+



> Hi Eric,
>
> It looks like we are looking at things a little differently. I
> see a portion of the picture in your mind, but obviously not
> entirely.
>
> Perhaps, we need to step back and iron out in specific terms what
> the interface between the two kernels should be in the crash dump
> case, and the distribution of responsibility between kernel, user space
> and the user.
>
> [BTW, the patch was intended as a step in development up for
> comment early enough to be able to get agreement on the interface
> and think issues through to more completeness before going
> too far. Sorry, if that wasn't apparent.]
>
> When you say "evil intermingling", I'm guessing you mean the
> "crashbackup=" boot parameter ? If so, then yes, I agree it'd
> be nice to find a way around it that doesn't push hardcoding
> elsewhere.
>
> Let me explain the interface/approach I was looking at.
>
> 1.First kernel reserves some area of memory for crash/capture kernel as
> specified by crashkernel=X@Y boot time parameter.
>
> 2.First kernel marks the top 640K of this area as backup area. (If
> architecture needs it.) This is sort of a hardcoding and probably this
> space reservation can be managed from user space as well as mentioned by
> you in this mail below.
>
> 3. Location of backup region is exported through /proc/iomem which can
> be read by user space utility to pass this information to purgatory code
> to determine where to copy the first 640K.
>
> Note that we do not make any additional reservation for the
> backup region. We carve this out from the top of the already
> reserved region and export it through /proc/iomem so that
> the user space code and the capture kernel code need not
> make any assumptions about where this region is located.
>
> 4. Once the capture kernel boots, it needs to know the location of
> backup region for two purposes.
>
> a. It should not overwrite the backup region.
>
> b. There needs to be a way for the capture tool to access the original
> contents of the backed up region
>
> Boot time parameter crashbackup=A@B has been provided to pass this
> information to capture kernel. This parameter is valid only for capture
> kernel and becomes effective only if CONFIG_CRASH_DUMP is enabled.
>
>
> > What is wrong with user space doing all of the extra space
> > reservation?
>
> Just for clarity, are you suggesting kexec-tools creating an additional
> segment for the backup region and pass the information to kernel.
>
> There is no problem in doing reservation from user space except
> one. How does the user and in-turn capture kernel come to know the
> location of backup region, assuming that the user is going to provide
> the exactmap for capture kernel to boot into.
>
> Just a thought, is it a good idea for kexec-tools to be creating and
> passing memmap parameters doing appropriate adjustment for backup
> region.
>
> I had another question. How is the starting location of elf headers
> communicated to capture tool? Is parameter segment a good idea? or
> some hardcoding?
>
> Another approach can be that backup area information is encoded in elf
> headers and capture kernel is booted with modified memmap (User gets
> backup region information from /proc/iomem) and capture tool can
> extract backup area information from elf headers as stored by first
> kernel.
>
> Could you please elaborate a little more on what aspect of your view
> differs from the above.
>
> Thanks
> Vivek

Thaks,
Hirokazu Takahashi.

2005-02-03 07:28:36

by Itsuro Oda

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Hi,

On 02 Feb 2005 08:24:03 -0700
[email protected] (Eric W. Biederman) wrote:
>
> So the kernel+initrd that captures a crash dump will live and execute
> in a reserved area of memory. It needs to know which memory regions
> are valid, and it needs to know small things like the final register
> state of each cpu.

Exactly.

Please let me clarify what you are going to.
1) standard kernel: reserve a small contigous area for a dump kernel
(this is not changed as the current code)
2) standard kernel: export the information of valid physical memory
regions. (/proc/iomem or /proc/cpumem etc.)
3) kexec (system call?): store the information of valid physical memory
regions as ELF program header to the reserved area (mentioned 1)).
4) standard kernel: when a panic occur, append (ex.) the register
information as ELF note after the memory information (if necessary).
and jump new kernel
5) dump kernel: export all valid physical memory (and saved register
information) to the user. (as /dev/oldmem /proc/vmcore ?)

Is this correct ? one question: how the dump kernel know the saved
area of ELF headers ?

one more question: I don't understand what the 640K backup area is.
Please let me know why it is necessary.

Thanks.
--
Itsuro ODA <[email protected]>

2005-02-03 08:10:28

by Vivek Goyal

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Hi,

On Thu, 2005-02-03 at 12:32, Hirokazu Takahashi wrote:
> Hi Vivek and Eric,
>
> IMHO, why don't we swap not only the contents of the top 640K
> but also kernel working memory for kdump kernel?


Initial patches of kdump had adopted the same approach but given the
fact devices are not stopped during transition to new kernel after a
panic, it carried inherent risk of some DMA going on and corrupting the
new kernel/data structures. Hence the idea of running the kernel from a
reserved location came up. This should be DMA safe as long as DMA is not
misdirected.

Thanks
Vivek

>
> I guess this approach has some good points.
>
> 1.Preallocating reserved area is not mandatory at boot time.
> And the reserved area can be distributed in small pieces
> like original kexec does.
>
> 2.Special linking is not required for kdump kernel.
> Each kdump kernel can be linked in the same way,
> where the original kernel exists.
>
> Am I missing something?
> physical memory
> +-------+
> | 640K ------------+
> |.......| |
> | | copy
> +-------+ |
> | | |
> |original<-----+ |
> |kernel | | |
> | | | |
> |.......| | |
> | | | |
> | | | |
> | | swap |
> | | | |
> +-------+ | |
> |reserved<----------+
> |area | |
> | | |
> |kdump |<-----+
> |kernel |
> +-------+
> | |
> | |
> | |
> +-------+
>
>


2005-02-03 09:03:25

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Itsuro Oda <[email protected]> writes:

> Hi,
>
> On 02 Feb 2005 08:24:03 -0700
> [email protected] (Eric W. Biederman) wrote:
> >
> > So the kernel+initrd that captures a crash dump will live and execute
> > in a reserved area of memory. It needs to know which memory regions
> > are valid, and it needs to know small things like the final register
> > state of each cpu.
>
> Exactly.
>
> Please let me clarify what you are going to.
> 1) standard kernel: reserve a small contigous area for a dump kernel
> (this is not changed as the current code)
> 2) standard kernel: export the information of valid physical memory
> regions. (/proc/iomem or /proc/cpumem etc.)
> 3) kexec (system call?): store the information of valid physical memory
> regions as ELF program header to the reserved area (mentioned 1)).

A better description is probably make a list of memory regions
using an ELF header data structure in user space.
Use sys_kexec_load to put that list the dump kernel and a little
big of glue code in the reserved area. The glue code includes
a hash of all of everything so it can all be validated before
use.

> 4) standard kernel: when a panic occur, append (ex.) the register
> information as ELF note after the memory information (if necessary).
> and jump new kernel

Record the register information as ELF notes in a per cpu data
area. The per cpu data areas are known and enumerated in
the list of memory regions. The kernel knows nothing about
the ELF header etc.

> 5) dump kernel: export all valid physical memory (and saved register
> information) to the user. (as /dev/oldmem /proc/vmcore ?)

Or in user space, by just mmaping /dev/mem. That is part of the
current conversation. The only real point for putting that code in
the kernel (besides momentum) is it is a cheap way to get the exact
data structures of the kernel you are using. But since:
(a) it does not look like any primary kernel data structures need to
be examined.
(b) even simple compile options like SMP/NOSMP are enough to change
the layout of the data structures.
I think there is a pretty good case for moving all of the work to
user space. But you still need a kernel that loads and
runs in the reserved area.

> Is this correct ? one question: how the dump kernel know the saved
> area of ELF headers ?

A command line parameter will be passed. Probably
elfcorehdr=xxx

> one more question: I don't understand what the 640K backup area is.
> Please let me know why it is necessary.

In practice I think we can kill it on x86. It is necessary (at least
a subset of it is) if we want to boot a SMP kernel. As cpu must
start running code in the first 1M of the address space. In addition
some architectures have exceptions vectors and or other data
structures at fixed locations in memory so in the general case a
backup area is required. So building the infrastructure to handle
backup areas is needed even, even if we later stop using it on
x86.

The other reason for the 640K backup area is the IBM guys were having
problems without it. The fact that you don't need it is a good
indication that it is unnecessary.

Eric

2005-02-03 09:15:33

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Hirokazu Takahashi <[email protected]> writes:

> Hi Vivek and Eric,
>
> IMHO, why don't we swap not only the contents of the top 640K
> but also kernel working memory for kdump kernel?
>
> I guess this approach has some good points.
>
> 1.Preallocating reserved area is not mandatory at boot time.
> And the reserved area can be distributed in small pieces
> like original kexec does.
>
> 2.Special linking is not required for kdump kernel.
> Each kdump kernel can be linked in the same way,
> where the original kernel exists.
>
> Am I missing something?

Preallocating the reserved area is largely to keep it from
being the target of DMA accesses. Since we are not able
to shutdown any of the drivers in the primary kernel running
in a normal swath of memory sounds like a good way to get
yourself stomped at the worst possible time.

In addition we get to avoid running a lot of code in the
panic path if we are jumping to a contiguous region of memory
with everything already setup.

To some extent this is a contest who has the better imagination
for things that can go wrong. Real life on dying hardware and
kernels, or the programmers writing the diagnostic code.

But if it is a gamble you are willing to take it is quite
feasible to use the reserved region for what you are
proposing and you could run a standard kernel.

The other reason for running out of the reserved region is that
it actually requires less memory reserved. Every byte you backup
needs to have a reserved area of memory to hold it. And if you are
also going to fill that with meaningful content you need another
byte to hold the data. So using a stock kernel probably requires
2/3 more memory.

Eric

2005-02-03 09:45:18

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Hi Vivek,

> > Hi Vivek and Eric,
> >
> > IMHO, why don't we swap not only the contents of the top 640K
> > but also kernel working memory for kdump kernel?
>
>
> Initial patches of kdump had adopted the same approach but given the
> fact devices are not stopped during transition to new kernel after a
> panic, it carried inherent risk of some DMA going on and corrupting the
> new kernel/data structures. Hence the idea of running the kernel from a
> reserved location came up. This should be DMA safe as long as DMA is not
> misdirected.

I see, that makes sense.
But I'm not sure yet that it's safe to access the top of 640MB.
I wonder how kmalloc(GFP_DMA) works in a kdump kernel.

Thanks,
Hirokazu Takahashi.

2005-02-03 10:10:37

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Hirokazu Takahashi <[email protected]> writes:

> Hi Vivek,
>
> > > Hi Vivek and Eric,
> > >
> > > IMHO, why don't we swap not only the contents of the top 640K
> > > but also kernel working memory for kdump kernel?
> >
> >
> > Initial patches of kdump had adopted the same approach but given the
> > fact devices are not stopped during transition to new kernel after a
> > panic, it carried inherent risk of some DMA going on and corrupting the
> > new kernel/data structures. Hence the idea of running the kernel from a
> > reserved location came up. This should be DMA safe as long as DMA is not
> > misdirected.
>
> I see, that makes sense.
> But I'm not sure yet that it's safe to access the top of 640MB.
640K?

> I wonder how kmalloc(GFP_DMA) works in a kdump kernel.

All that happens there is a one line change to vmlinux.lds.S that
causes the kernel to live at a different physical and virtual
address. So everything works as normal.

I do agree that it is risky to use the first 640K for normal work.
But on the list of things to fix it is a minor war, and even if we
back up that region of memory we don't need to use it.

There are still remain a lot of code reviews to ensure the code is
generally safe.

Eric

2005-02-03 10:18:19

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Hi Eric,

> > Hi Vivek and Eric,
> >
> > IMHO, why don't we swap not only the contents of the top 640K
> > but also kernel working memory for kdump kernel?
> >
> > I guess this approach has some good points.
> >
> > 1.Preallocating reserved area is not mandatory at boot time.
> > And the reserved area can be distributed in small pieces
> > like original kexec does.
> >
> > 2.Special linking is not required for kdump kernel.
> > Each kdump kernel can be linked in the same way,
> > where the original kernel exists.
> >
> > Am I missing something?
>
> Preallocating the reserved area is largely to keep it from
> being the target of DMA accesses. Since we are not able
> to shutdown any of the drivers in the primary kernel running
> in a normal swath of memory sounds like a good way to get
> yourself stomped at the worst possible time.

So what do you think my another idea?

I think we can always make a kdump kernel mapped to the same virtual
address. So we will be free from caring about the physical address
where the kdump kernel is loaded.

I believe the memsection functionality which LHMS project is working
on would help this.

+
|
|
(user space)
|
|
physical | virtual
memory | space
+ ------------ +
| |
| |
| |
+ ------------.+
original | . | map kdump kernel here
kernel | . |
| . |
| . .+
+ . . |
| . . |
+ . |
kdump | . |
kernel | . |
| . |
+ |
| |
| |
| |



Thanks,
Hirokazu Takahashi.

2005-02-03 10:48:43

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Hirokazu Takahashi <[email protected]> writes:

> Hi Eric,
>
> > > Hi Vivek and Eric,
> > >
> > > IMHO, why don't we swap not only the contents of the top 640K
> > > but also kernel working memory for kdump kernel?
> > >
> > > I guess this approach has some good points.
> > >
> > > 1.Preallocating reserved area is not mandatory at boot time.
> > > And the reserved area can be distributed in small pieces
> > > like original kexec does.
> > >
> > > 2.Special linking is not required for kdump kernel.
> > > Each kdump kernel can be linked in the same way,
> > > where the original kernel exists.
> > >
> > > Am I missing something?
> >
> > Preallocating the reserved area is largely to keep it from
> > being the target of DMA accesses. Since we are not able
> > to shutdown any of the drivers in the primary kernel running
> > in a normal swath of memory sounds like a good way to get
> > yourself stomped at the worst possible time.
>
> So what do you think my another idea?

I have proposed it. I think ia64 already does that.
It has been pointed that the PowerPC kernel occasionally runs
with the mmu turned off. So it is not a technique the is 100%
portable.

> I think we can always make a kdump kernel mapped to the same virtual
> address. So we will be free from caring about the physical address
> where the kdump kernel is loaded.
>
> I believe the memsection functionality which LHMS project is working
> on would help this.

You don't need anything fancy except to build the page tables
during bootup. However there are a few potential gotchas
with respect to using large pages, that can give 4MiB or
greater alignment restrictions on the kernel. Code wise
the gotcha is moving the kernel's .text section into what
is essentially the vmalloc portion of the address space.
For x86_64 the kernels virtual address is already decoupled from the
physical addresses, so it is probably easier.

Most of this just results in easier management between the pieces.
Which is a good thing. However at the moment I don't think it
simplifies any of the core problems. I still need to reserve
a large hunk of physical address space early on before any
DMA transactions are setup to hold the new kernel.

So while I am happy to see patches that improve this I don't
actually care right now.

Eric

2005-02-03 13:56:32

by Vivek Goyal

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

On Wed, 2005-02-02 at 21:12, Eric W. Biederman wrote:
> Vivek Goyal <[email protected]> writes:
>
> > On Tue, 2005-02-01 at 20:56, Eric W. Biederman wrote:
> > > Vivek Goyal <[email protected]> writes:
> >
> > "elfcorehdr=" also looks good.
>
> Then let's go with that for now. It is not perfect but it seems
> a little more self explanatory at first glance.
> > > A clarification on terminology we are talking about struct Elf64_Phdr
> > > here. There is only one Elf header. That seems to be clear farther
> > > down.
> > >
> >
> >
> > Exactly. There shall be one Elf header for whole of the image. In
> > addition there will be one struct Elf64_Phdr, per contiguous physical
> > memory area. One Elf64_Phdr of PT_NOTE type for notes section and one
> > Elf64_Phdr for backup region.
>
> Actually if we are just pointing a kernel data structures we will
> need multiple Elf64_Phdr of PT_NOTE. Each cpu has it's own
> notes section and until the smoke clears we can't be confident
> about what is going to wind up there or how densely those will
> be packed. So collapsing everything into a single notes segment
> needs to happen after we have switched to the crash capture kernel.


Sounds good. So there shall be a PT_NOTE type program header per cpu.
And these headers can be collapsed into one PT_NOTE type header later.


>
> > > I have serious concerns about the kernel generating the ELF headers
> > > and only delivering them after the kernel has crashed. Because
> > > then we run into questions of what information can be trusted. If we
> > > avoid that issue I am not too concerned.
> >
> >
> > I hope, all elf headers once prepared by kexec-tools need not to change
> > later (Cannot think of any piece of information which shall change
> > later). These shall be put in separate segment. And SHA-256 shall take
> > care of authenticity of information after crash.
>
> That should work fine. We need to consider through throwing in an
> extra note section with information like kernel version that
> we can capture while the system is running.
>
> > For notes section program header, virtual = physical = 0 and "offset"
> > shall point to crash_notes[], so that notes can directly be read by the
> > capture kernel (or user space).
>
> I agree. But see my caveat. I think we should have one PT_NOTE
> segment point at each element of the crash_notes[] array. I know
> it is technically a violation of the ELF spec. But in this case
> it makes sense. Since we can't guarantee that crash_notes will
> be packed properly I don't know that we could reliably see more
> than one cpu if we pointed a PT_NOTE header at the whole thing.
>
> If it turns out that we can reliably point a single PT_NOTE header
> at crash_notes so much the better but things are likely to be
> more robust if we don't start with that assumption. That
> at least allows us the freedom to capture some notes (like NT_UTSNAME)
> before the kernel crashes.
>
> Eric
>

2005-02-03 23:19:28

by Itsuro Oda

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Hi,

On 03 Feb 2005 02:00:51 -0700
[email protected] (Eric W. Biederman) wrote:

> A better description is probably make a list of memory regions
> using an ELF header data structure in user space.
> Use sys_kexec_load to put that list the dump kernel and a little
> big of glue code in the reserved area. The glue code includes
> a hash of all of everything so it can all be validated before
> use.

I see. The data structure is put on a part of loading kernel's data.

> Record the register information as ELF notes in a per cpu data
> area. The per cpu data areas are known and enumerated in
> the list of memory regions. The kernel knows nothing about
> the ELF header etc.
>

I see.

> > 5) dump kernel: export all valid physical memory (and saved register
> > information) to the user. (as /dev/oldmem /proc/vmcore ?)
>
> Or in user space, by just mmaping /dev/mem. That is part of the
> current conversation. The only real point for putting that code in
> the kernel (besides momentum) is it is a cheap way to get the exact
> data structures of the kernel you are using. But since:
> (a) it does not look like any primary kernel data structures need to
> be examined.
> (b) even simple compile options like SMP/NOSMP are enough to change
> the layout of the data structures.
> I think there is a pretty good case for moving all of the work to
> user space. But you still need a kernel that loads and
> runs in the reserved area.
>
I don't make sense. what do you mean ?

What we want to do when the system is crashed is storing the whole
physical memory (and saved register information for x86 arch) to
some place (ex. a disk partition) for later analysis.
So the basic requirments to the dump kernel is that:
* supply a method to access whole (valid) physical memory.
* supply a method to access the saved register information.

Does the kdump meet this requirment ?

(I am not interesting to /proc/vmcore. Constructing the vmcore
image is area of analysis tools. not kernel's task.)

Thanks.
--
Itsuro ODA <[email protected]>

2005-02-04 00:23:52

by Itsuro Oda

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Hi,

On 02 Feb 2005 07:45:11 -0700
[email protected] (Eric W. Biederman) wrote:

>
> And the feedback begins :)
>
> Itsuro Oda <[email protected]> writes:
>
> > Hi,
> >
> > I don't like calling crash_kexec() directly in (ex.) panic().
> > It should be call_dump_hook() (or something like this).
> >
> > I think the necessary modifications of the kernel is only:
> > - insert the hooks that calls a dump function when crash occur
> crash_kexec()
> > - binding interface that binds a dump function to the hook
> > (like register_dump_hook())
> sys_kexec_load(...);

For example there are pepole who want to execute a built in kernel
debugger when the system is crashed. or there are pepole who
believe the diskdump is the best dump tool :-)

So I think a sort of hook is better than calling crash_kexec
directly. (May I make a patch ?)

Thanks.
--
Itsuro ODA <[email protected]>

2005-02-04 00:46:45

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Itsuro Oda <[email protected]> writes:

> Hi,
>
> On 03 Feb 2005 02:00:51 -0700
> [email protected] (Eric W. Biederman) wrote:
>
> > > 5) dump kernel: export all valid physical memory (and saved register
> > > information) to the user. (as /dev/oldmem /proc/vmcore ?)
> >
> > Or in user space, by just mmaping /dev/mem. That is part of the
> > current conversation. The only real point for putting that code in
> > the kernel (besides momentum) is it is a cheap way to get the exact
> > data structures of the kernel you are using. But since:
> > (a) it does not look like any primary kernel data structures need to
> > be examined.
> > (b) even simple compile options like SMP/NOSMP are enough to change
> > the layout of the data structures.
> > I think there is a pretty good case for moving all of the work to
> > user space. But you still need a kernel that loads and
> > runs in the reserved area.
> >
> I don't make sense. what do you mean ?
>
> What we want to do when the system is crashed is storing the whole
> physical memory (and saved register information for x86 arch) to
> some place (ex. a disk partition) for later analysis.
> So the basic requirments to the dump kernel is that:
> * supply a method to access whole (valid) physical memory.
> * supply a method to access the saved register information.
>
> Does the kdump meet this requirment ?

Yes, the discussion in this area is what is the best way to implement
this requirement. How much should be in the kernel and how much
should be in user space.

At the moment things are broken but should be fixed shortly.
So what has been implemented are /dev/oldmem which provides access
to the old memory. And /proc/vmcore which provides both the old
memory and the register information.

> (I am not interesting to /proc/vmcore. Constructing the vmcore
> image is area of analysis tools. not kernel's task.)

There is a fine line there, as a simple ELF core dump has just enough
information to describe discontiguous memory, and to have an out of
band channel for register information. Adding anything extra like
virtual addresses that match the kernel should be left for the
crash dump analysis tools.

In code that is currently in the mainstream kernel /dev/mem can
mmap any area of memory that is not used by the kernel as ram.
So what I believe we will end up is that /sbin/kexec (user space)
will prepare an ELF header (data) that describes the memory regions
and details where to find the kernels register information. The
address of that ELF header will be passed to the crash dump
capture kernel and user space combination. The something
(probably a user space program reading /dev/mem) will look
at the ELF header and save the already prepared ELF core
dump to disk. Possibly doing little things like merging
the MAX_NR_CPUS note segments into one so it actually conforms
to the ELF spec.

This thread started as the design discussion before finishing
that part of the implementation. The proof of concept
implementations have happened. We have all seen this kind
of functionality implemented. Now is the time to come up
with a good solid design that can be maintained and merged
into the mainline kernel and distros.

So thank you for ask questions, it means we have a better chance
of getting a solid design and a design that those people who
care about this functionality can use. And with a little luck
we can all wind up on agreeing on the general principles. You came in
a little late to this conversation so a lot of details have been
settled, but if you have a good argument for doing something another
way we can certainly look at that.

Eric

2005-02-04 02:10:58

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Itsuro Oda <[email protected]> writes:

> Hi,
>
> On 02 Feb 2005 07:45:11 -0700
> [email protected] (Eric W. Biederman) wrote:
>
> >
> > And the feedback begins :)
> >
> > Itsuro Oda <[email protected]> writes:
> >
> > > Hi,
> > >
> > > I don't like calling crash_kexec() directly in (ex.) panic().
> > > It should be call_dump_hook() (or something like this).
> > >
> > > I think the necessary modifications of the kernel is only:
> > > - insert the hooks that calls a dump function when crash occur
> > crash_kexec()
> > > - binding interface that binds a dump function to the hook
> > > (like register_dump_hook())
> > sys_kexec_load(...);
>
> For example there are pepole who want to execute a built in kernel
> debugger when the system is crashed. or there are pepole who
> believe the diskdump is the best dump tool :-)
>
> So I think a sort of hook is better than calling crash_kexec
> directly. (May I make a patch ?)

The prevalent feeling I have heard from kernel developers and
and my personal feeling as well is that after a kernel has called
panic you can't trust it. Which means anything running in the kernel
itself is suspect.

The crash_kexec() hooks enables everything that does not get linked into
the kernel. So I don't feel a hook in the panic path is necessary
nor do I feel that it is wise, especially with no in-kernel users.

Plus the worst part about a hook in the panic path is that it is
inherently racy. Keeping the crash_kexec() code from blocking or
being racy has been a challenge. And I still think that entire code
path needs a review and some more code tweaks to remove races.

If someone else wants a hook in the panic path they can add their own
hook, and make their own case for why it is needed.

Eric

2005-02-04 01:24:07

by Itsuro Oda

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Hi,

On Fri, 04 Feb 2005 08:18:56 +0900
Itsuro Oda <[email protected]> wrote:

>
> > > 5) dump kernel: export all valid physical memory (and saved register
> > > information) to the user. (as /dev/oldmem /proc/vmcore ?)
> >
> > Or in user space, by just mmaping /dev/mem. That is part of the
> > current conversation. The only real point for putting that code in
> > the kernel (besides momentum) is it is a cheap way to get the exact
> > data structures of the kernel you are using. But since:
> > (a) it does not look like any primary kernel data structures need to
> > be examined.
> > (b) even simple compile options like SMP/NOSMP are enough to change
> > the layout of the data structures.
> > I think there is a pretty good case for moving all of the work to
> > user space. But you still need a kernel that loads and
> > runs in the reserved area.
> >
> I don't make sense. what do you mean ?
>

"I don't make sense." should be "It does not make sense."
sorry. I'm not familiar with English.

--
Itsuro ODA <[email protected]>

2005-02-04 10:13:07

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Hi,

> > Hi Eric,
> >
> > > > Hi Vivek and Eric,
> > > >
> > > > IMHO, why don't we swap not only the contents of the top 640K
> > > > but also kernel working memory for kdump kernel?
> > > >
> > > > I guess this approach has some good points.
> > > >
> > > > 1.Preallocating reserved area is not mandatory at boot time.
> > > > And the reserved area can be distributed in small pieces
> > > > like original kexec does.
> > > >
> > > > 2.Special linking is not required for kdump kernel.
> > > > Each kdump kernel can be linked in the same way,
> > > > where the original kernel exists.
> > > >
> > > > Am I missing something?
> > >
> > > Preallocating the reserved area is largely to keep it from
> > > being the target of DMA accesses. Since we are not able
> > > to shutdown any of the drivers in the primary kernel running
> > > in a normal swath of memory sounds like a good way to get
> > > yourself stomped at the worst possible time.
> >
> > So what do you think my another idea?
>
> I have proposed it. I think ia64 already does that.
> It has been pointed that the PowerPC kernel occasionally runs
> with the mmu turned off. So it is not a technique the is 100%
> portable.

I see you have.
And MIPS CPUs doesn't allow kernel pages to be remapped either.

> > I think we can always make a kdump kernel mapped to the same virtual
> > address. So we will be free from caring about the physical address
> > where the kdump kernel is loaded.
> >
> > I believe the memsection functionality which LHMS project is working
> > on would help this.
>
> You don't need anything fancy except to build the page tables
> during bootup. However there are a few potential gotchas
> with respect to using large pages, that can give 4MiB or
> greater alignment restrictions on the kernel. Code wise
> the gotcha is moving the kernel's .text section into what
> is essentially the vmalloc portion of the address space.
> For x86_64 the kernels virtual address is already decoupled from the
> physical addresses, so it is probably easier.

I know we can place the kernel in any address though there
exist some exceptions.

I know mapping kernel pages to the same virtual address only helps
to avoid caring about physical addresses or vmalloc'ed addresses
when linking the kernel. I think it wouldn't be bad idea in many
architectures. I prefer it rather than linking the kernel for each
system.

> Most of this just results in easier management between the pieces.
> Which is a good thing. However at the moment I don't think it
> simplifies any of the core problems. I still need to reserve
> a large hunk of physical address space early on before any
> DMA transactions are setup to hold the new kernel.

I agree that my idea is not essential at the moment.

> So while I am happy to see patches that improve this I don't
> actually care right now.

ok.

> Eric
>

Thanks,
Hirokazu Takahashi.

2005-02-04 11:19:56

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

Hirokazu Takahashi <[email protected]> writes:

> Hi,
>
> > > Hi Eric,
> > >

> I see you have.
> And MIPS CPUs doesn't allow kernel pages to be remapped either.

I guess I should add to be relocatable in the general case most
likely requires running a PIC dynamic linker at kernel startup.
If none of the rest of the kernel is built PIC and the relocation
table is not too big we might be able to convince people to implement
it generally.

At least that is one technique for generating a PIC kernel that I
have not explored fully.

> > You don't need anything fancy except to build the page tables
> > during bootup. However there are a few potential gotchas
> > with respect to using large pages, that can give 4MiB or
> > greater alignment restrictions on the kernel. Code wise
> > the gotcha is moving the kernel's .text section into what
> > is essentially the vmalloc portion of the address space.
> > For x86_64 the kernels virtual address is already decoupled from the
> > physical addresses, so it is probably easier.
>
> I know we can place the kernel in any address though there
> exist some exceptions.
>
> I know mapping kernel pages to the same virtual address only helps
> to avoid caring about physical addresses or vmalloc'ed addresses
> when linking the kernel. I think it wouldn't be bad idea in many
> architectures. I prefer it rather than linking the kernel for each
> system.

Agreed. Although I suspect most architectures will have a region
that will work for most users.

> > Most of this just results in easier management between the pieces.
> > Which is a good thing. However at the moment I don't think it
> > simplifies any of the core problems. I still need to reserve
> > a large hunk of physical address space early on before any
> > DMA transactions are setup to hold the new kernel.
>
> I agree that my idea is not essential at the moment.
>
> > So while I am happy to see patches that improve this I don't
> > actually care right now.
>
> ok.

The one part I do request is that if you build such a kernel that
you figure a way to get it's ELF header of type ET_DYN. So it
does not require a magic loader to load it.

I have recently patched both etherboot and /sbin/kexec to accept
that kind of binary :)

Eric

2005-02-04 12:05:17

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] Reserving backup region for kexec based crashdumps.

[email protected] (Eric W. Biederman) writes:

> Hirokazu Takahashi <[email protected]> writes:
> > > Most of this just results in easier management between the pieces.
> > > Which is a good thing. However at the moment I don't think it
> > > simplifies any of the core problems. I still need to reserve
> > > a large hunk of physical address space early on before any
> > > DMA transactions are setup to hold the new kernel.
> >
> > I agree that my idea is not essential at the moment.
> >
> > > So while I am happy to see patches that improve this I don't
> > > actually care right now.
> >
> > ok.

Thinking about this some more this does have a significant aspect
on the design. For architectures that support this, on the
primary kernel the command line option becomes:
crashkernel=size instead of crashkernel=size@location.
Which means the kernel needs to call alloc_bootmem instead
of reserve_bootmem. So it results in a primary kernel implementation
difference.

In addition if we really can push all of the dump specific
functionality into user space as it appears we can, this allows a
generic kernel to be used for the crash dump process. It will
probably still be a special hardened build where reliability is
more important than performance. So that any micro hit we take in
performance by modifying __pa() and __va() will be irrelevant.

I like it.

I have already demonstrated that there is a general technique that
any architecture can use to build a kernel that runs at a non-default
address. So for the architectures that cannot build a PIC kernel
there is still a proven solution available, it simply will not
be as nice to manage.

x86_64 should pretty straight forward. i386 will be a little more
difficult but doable.

Patches are still welcome.

Eric

2005-02-16 08:50:14

by Itsuro Oda

[permalink] [raw]
Subject: [PATCH] /proc/cpumem

Hi, Eric and all

Attached is an implementation of /proc/cpumem.
/proc/cpumem shows the valid physical memory ranges.

* i386 and x86_64
* implement valid_phys_addr_range() and use it.
(the first argument of the i386 version is little uncomfortable.)
* /dev/mem of the i386 version should be mofified. but not yet.

example: amd64 8GB Mem
# cat /proc/cpumem
0000000000000000 000000000009b800
0000000000100000 00000000fbe70000
0000000100000000 0000000100000000
#
start address and size. hex digit.

Any comments, recomendations and suggestions are welcom.

BTW, does not kexec/kdump run on 2.6.11-rc3-mm2 ?
How do I get and examine the latest kexec/kdump ?

Thanks.
--
Itsuro ODA <[email protected]>

---
--- linux-2.6.11-rc3-mm2/drivers/char/mem.c 2005-02-16 15:36:31.000000000 +0900
+++ linux-2.6.11-rc3-mm2-test/drivers/char/mem.c 2005-02-16 23:32:15.244876816 +0900
@@ -25,6 +25,9 @@
#include <linux/device.h>
#include <linux/highmem.h>
#include <linux/crash_dump.h>
+#include <linux/bootmem.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>

#include <asm/uaccess.h>
#include <asm/io.h>
@@ -759,3 +762,125 @@
}

fs_initcall(chr_dev_init);
+
+#ifdef CONFIG_PROC_FS
+/*
+ * /proc/cpumem: show valid physical address range
+ */
+struct cpumem_info {
+ unsigned long long addr;
+ unsigned long long size;
+};
+
+static void *cpumem_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ struct cpumem_info *p = m->private;
+ unsigned long long end = (unsigned long long)max_pfn << PAGE_SHIFT;
+ unsigned long long addr;
+ size_t size;
+ int found = 0;
+
+ (*pos)++;
+
+ if (p->addr >= end) {
+ return NULL;
+ }
+
+ /* always start page boundary */
+ addr = ((p->addr + p->size + PAGE_SIZE - 1) >> PAGE_SHIFT) << PAGE_SHIFT;
+ size = 0xf0000000;
+
+ while (addr < end) {
+ if (valid_phys_addr_range(addr, &size)) {
+ if (!found) {
+ found = 1;
+ p->addr = addr;
+ p->size = size;
+ } else {
+ p->size += size;
+ }
+ addr += size;
+ size = 0xf0000000;
+ } else {
+ if (found) {
+ return p;
+ }
+ addr += PAGE_SIZE;
+ }
+ }
+
+ return found ? p : NULL;
+}
+
+static void *cpumem_start(struct seq_file *m, loff_t *pos)
+{
+ struct cpumem_info *p = m->private;
+ loff_t n = 0;
+
+ p->addr = 0;
+ p->size = 0;
+
+ while (n <= *pos) {
+ if (!cpumem_next(m, NULL, &n)) {
+ return NULL;
+ }
+ }
+
+ return p;
+}
+
+static void cpumem_stop(struct seq_file *m, void *v)
+{
+}
+
+static int cpumem_show(struct seq_file *m, void *v)
+{
+ struct cpumem_info *p = m->private;
+ unsigned long long end = (unsigned long long)max_pfn << PAGE_SHIFT;
+
+ if (p->addr < end) {
+ seq_printf(m, "%016llx %016llx\n", p->addr, p->size);
+ }
+ return 0;
+}
+
+struct seq_operations cpumem_op = {
+ .start = cpumem_start,
+ .next = cpumem_next,
+ .stop = cpumem_stop,
+ .show = cpumem_show
+};
+
+static int cpumem_open(struct inode *inode, struct file *file)
+{
+ int res = seq_open(file, &cpumem_op);
+ if (!res) {
+ struct seq_file *m = file->private_data;
+ m->private = kmalloc(sizeof(struct cpumem_info), GFP_KERNEL);
+ if (!m->private) {
+ seq_release(inode, file);
+ return -ENOMEM;
+ }
+ }
+ return res;
+}
+
+static struct file_operations proc_cpumem_operations = {
+ .open = cpumem_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release_private
+};
+
+static int __init cpumem_init(void)
+{
+ struct proc_dir_entry *entry;
+
+ entry = create_proc_entry("cpumem", 0, NULL);
+ if (entry) {
+ entry->proc_fops = &proc_cpumem_operations;
+ }
+ return 0;
+}
+__initcall(cpumem_init);
+#endif /* CONFIG_PROC_FS */

---
--- linux-2.6.11-rc3-mm2/arch/i386/mm/init.c 2005-02-16 15:36:29.000000000 +0900
+++ linux-2.6.11-rc3-mm2-test/arch/i386/mm/init.c 2005-02-16 23:32:29.499709752 +0900
@@ -248,6 +248,47 @@
return 0;
}

+int valid_phys_addr_range(unsigned long long phys_addr, size_t *size)
+{
+ int i;
+ unsigned long long addr, end;
+ efi_memory_desc_t *md;
+
+ if (efi_enabled) {
+ for (i = 0; i < memmap.nr_map; i++) {
+ md = &memmap.map[i];
+ if (!is_available_memory(md)) {
+ continue;
+ }
+ addr = md->phys_addr;
+ end = md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT);
+ if ((phys_addr >= addr) && (phys_addr < end)) {
+ if (*size > end - phys_addr) {
+ *size = end - phys_addr;
+ }
+ return 1;
+ }
+ }
+ return 0;
+ }
+
+ for (i = 0; i < e820.nr_map; i++) {
+ if (e820.map[i].type != E820_RAM) {
+ continue;
+ }
+ addr = e820.map[i].addr;
+ end = e820.map[i].addr + e820.map[i].size;
+ if ((phys_addr >= addr) && (phys_addr < end)) {
+ if (*size > end - phys_addr) {
+ *size = end - phys_addr;
+ }
+ return 1;
+ }
+ }
+ return 0;
+}
+EXPORT_SYMBOL(valid_phys_addr_range);
+
#ifdef CONFIG_HIGHMEM
pte_t *kmap_pte;
pgprot_t kmap_prot;

---
--- linux-2.6.11-rc3-mm2/include/asm-i386/io.h 2004-12-25 06:35:40.000000000 +0900
+++ linux-2.6.11-rc3-mm2-test/include/asm-i386/io.h 2005-02-16 23:36:24.454991120 +0900
@@ -90,6 +90,12 @@
*/
#define page_to_phys(page) ((dma_addr_t)page_to_pfn(page) << PAGE_SHIFT)

+/*
+ * for /dev/mem
+ */
+#define ARCH_HAS_VALID_PHYS_ADDR_RANGE
+extern int valid_phys_addr_range(unsigned long long, size_t *);
+
extern void __iomem * __ioremap(unsigned long offset, unsigned long size, unsigned long flags);

/**

---
--- linux-2.6.11-rc3-mm2/arch/x86_64/mm/init.c 2005-02-16 15:36:30.000000000 +0900
+++ linux-2.6.11-rc3-mm2-test/arch/x86_64/mm/init.c 2005-02-16 16:23:08.000000000 +0900
@@ -22,6 +22,7 @@
#include <linux/pagemap.h>
#include <linux/bootmem.h>
#include <linux/proc_fs.h>
+#include <linux/module.h>

#include <asm/processor.h>
#include <asm/system.h>
@@ -395,6 +396,27 @@
return 0;
}

+int valid_phys_addr_range(unsigned long phys_addr, size_t *size)
+{
+ int i;
+ unsigned long end;
+
+ for (i = 0; i < e820.nr_map; i++) {
+ if (e820.map[i].type != E820_RAM) {
+ continue;
+ }
+ end = e820.map[i].addr + e820.map[i].size;
+ if (phys_addr >= e820.map[i].addr && phys_addr < end) {
+ if (*size > end - phys_addr) {
+ *size = end - phys_addr;
+ }
+ return 1;
+ }
+ }
+ return 0;
+}
+EXPORT_SYMBOL(valid_phys_addr_range);
+
extern int swiotlb_force;

/*

---
--- linux-2.6.11-rc3-mm2/include/asm-x86_64/io.h 2005-02-16 15:36:12.000000000 +0900
+++ linux-2.6.11-rc3-mm2-test/include/asm-x86_64/io.h 2005-02-16 16:23:59.000000000 +0900
@@ -123,6 +123,9 @@
{
return __va(address);
}
+
+#define ARCH_HAS_VALID_PHYS_ADDR_RANGE
+extern int valid_phys_addr_range(unsigned long, size_t *);
#endif

/*

---

2005-02-16 14:01:11

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] /proc/cpumem

Itsuro Oda <[email protected]> writes:

> Hi, Eric and all
>
> Attached is an implementation of /proc/cpumem.
> /proc/cpumem shows the valid physical memory ranges.

Interesting. My imagination when I proposed this
was something based on struct resource that works
like /proc/iomem on x86 but can be meaningfully
be used on systems with where ram lives in a separate
address space from io device memory.

> example: amd64 8GB Mem
> # cat /proc/cpumem
> 0000000000000000 000000000009b800
> 0000000000100000 00000000fbe70000
> 0000000100000000 0000000100000000
> #
> start address and size. hex digit.

The lack of a type field looses a fair amount of functionality compared
to /proc/iomem. In particular you can't see where the ACPI data is.

The other direction something like this can go is to dump
the data structures in linux/mmzone.h

> Any comments, recomendations and suggestions are welcom.
>
> BTW, does not kexec/kdump run on 2.6.11-rc3-mm2 ?
> How do I get and examine the latest kexec/kdump ?

I'm not quite certain what is happening.

I have been playing with kexec user space a little bit and a new
development release is at:
http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.101.tar.gz

I have written a first pass at a user space core dump generator,
using /dev/mem. /sbin/kexec still needs some work to prepare
the ELF headers before a crash.

Eric

2005-02-17 00:18:48

by YAMAMOTO Takashi

[permalink] [raw]
Subject: Re: [PATCH] /proc/cpumem

hi,

> + while (addr < end) {
> + if (valid_phys_addr_range(addr, &size)) {
> + if (!found) {
> + found = 1;
> + p->addr = addr;
> + p->size = size;
> + } else {
> + p->size += size;
> + }
> + addr += size;
> + size = 0xf0000000;
> + } else {
> + if (found) {
> + return p;
> + }
> + addr += PAGE_SIZE;
> + }
> + }

doesn't this loop take very long time if you have a large hole?

i'd suggest to change valid_phys_addr_range to fill &size even when
it returns false, so that caller can skip the hole efficiently.

YAMAMOTO Takashi

2005-02-17 00:43:32

by Itsuro Oda

[permalink] [raw]
Subject: Re: [PATCH] /proc/cpumem

Hi Eric,

> The lack of a type field looses a fair amount of functionality compared
> to /proc/iomem. In particular you can't see where the ACPI data is.

Hmm, restricting System RAM only may be too pessimistic.
(One of motivations of this work is for using /dev/mem safely.
"dd if=/dev/mem of=xxx" causes panic on my amd64(8GB mem) machine
since reading from address around 0xfe000000 causes a machine
check. hmm, this area is marked as "reserved". not ACPI area.
ACPI area can be read.)

Ok, I will add a type field.

> The other direction something like this can go is to dump
> the data structures in linux/mmzone.h

Do you mean defining a data structure in linux/mmzone.h ?

I used to think a particular struct is not necessary for this work,
but now I think it is better to define a struct for this.
Let me consider.

> I have written a first pass at a user space core dump generator,
> using /dev/mem. /sbin/kexec still needs some work to prepare
> the ELF headers before a crash.

I am looking forward this :-)

And, you mentioned a couple of weeks ago:
> Anyway one thing I want to do is actually drop the apic shutdown
> code altogether in this code path. I threw it in there to
> ease the transition from the old code base to the new, but
> if that code is causing issues.... So this is probably a good time
> to start testing that.

How about this ?

Thanks.
--
Itsuro ODA <[email protected]>

2005-02-17 05:07:04

by Vivek Goyal

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] /proc/cpumem

Hi,

On Wed, 2005-02-16 at 14:19, Itsuro Oda wrote:

>
> BTW, does not kexec/kdump run on 2.6.11-rc3-mm2 ?
> How do I get and examine the latest kexec/kdump ?

Currently kdump is broken. I am working on Elf Header generation part in
kexec-tools. Next week I should be able to post the initial patches.

---
> --- linux-2.6.11-rc3-mm2/arch/i386/mm/init.c 2005-02-16
> 15:36:29.000000000 +0900
> +++ linux-2.6.11-rc3-mm2-test/arch/i386/mm/init.c 2005-02-16
> 23:32:29.499709752 +0900
> @@ -248,6 +248,47 @@
> return 0;
> }
>
> +int valid_phys_addr_range(unsigned long long phys_addr, size_t *size)
> +{
> + int i;
> + unsigned long long addr, end;
> + efi_memory_desc_t *md;
> +
> + if (efi_enabled) {
> + for (i = 0; i < memmap.nr_map; i++) {
> + md = &memmap.map[i];
> + if (!is_available_memory(md)) {
> + continue;
> + }
> + addr = md->phys_addr;
> + end = md->phys_addr + (md->num_pages <<
> EFI_PAGE_SHIFT);
> + if ((phys_addr >= addr) && (phys_addr < end)) {
> + if (*size > end - phys_addr) {
> + *size = end - phys_addr;
> + }
> + return 1;
> + }
> + }
> + return 0;
> + }


I thought efi related data structures are of type __initdata and will be gone after initilization. (efi.c)


Thanks
Vivek


2005-02-17 06:18:28

by Itsuro Oda

[permalink] [raw]
Subject: Re: [Fastboot] [PATCH] /proc/cpumem

Hi,

> I thought efi related data structures are of type __initdata and will be gone after initilization. (efi.c)

oops. certainly.
and, devmem_is_allowed does same mistake :-)
(I don't know who made it.)

Thanks.
--
Itsuro ODA <[email protected]>

2005-02-17 09:58:08

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] Re: [PATCH] /proc/cpumem

Itsuro Oda <[email protected]> writes:

> Hi Eric,
>
> > The lack of a type field looses a fair amount of functionality compared
> > to /proc/iomem. In particular you can't see where the ACPI data is.
>
> Hmm, restricting System RAM only may be too pessimistic.
> (One of motivations of this work is for using /dev/mem safely.
> "dd if=/dev/mem of=xxx" causes panic on my amd64(8GB mem) machine
> since reading from address around 0xfe000000 causes a machine
> check. hmm, this area is marked as "reserved". not ACPI area.
> ACPI area can be read.)
>
> Ok, I will add a type field.

To be very clear. I do not believe is necessary for x86. The
is already sufficient information elsewhere to handle this.

> > The other direction something like this can go is to dump
> > the data structures in linux/mmzone.h
>
> Do you mean defining a data structure in linux/mmzone.h ?
>
> I used to think a particular struct is not necessary for this work,
> but now I think it is better to define a struct for this.
> Let me consider.

To be clear there are two pieces of information that are needed.
1) The list of physical memory areas and what they are.
/proc/iomem does a good job of this.
2) The list of which memory areas the kernel is using.
It is the pgdat_t and related structures that define this.
For the purposes of a core dump we want to capture this
information before the kernel crashes and use it afterward.

> > I have written a first pass at a user space core dump generator,
> > using /dev/mem. /sbin/kexec still needs some work to prepare
> > the ELF headers before a crash.
>
> I am looking forward this :-)
>
> And, you mentioned a couple of weeks ago:
> > Anyway one thing I want to do is actually drop the apic shutdown
> > code altogether in this code path. I threw it in there to
> > ease the transition from the old code base to the new, but
> > if that code is causing issues.... So this is probably a good time
> > to start testing that.
>
> How about this ?

My role in this is that of maintainer and architect. On a practical
level I gain nothing from a working crash-dump/kexec-on-panic
implementation except it stops being a gating factor for the rest
of the kexec code. So while many times I can see what needs to be
done it is hard for me to justify doing it. So a lot of times
where I will weigh in with code is when I see a particular blind spot
on the part of the implementors.

The parties I see actively working on the crash dump implementation
are currently a group from IBM and you guys from valinux.co.jp.
One of the primaries at IBM has been on vacation which is likely
why we have not seen anything out of them for the last couple of
weeks.

But also this is open source software it will be done when it
is done.

Eric

2005-02-17 18:19:10

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH] /proc/cpumem

On Wed, Feb 16, 2005 at 05:49:51PM +0900, Itsuro Oda wrote:
> Hi, Eric and all
>
> Attached is an implementation of /proc/cpumem.
> /proc/cpumem shows the valid physical memory ranges.
>
> * i386 and x86_64
> * implement valid_phys_addr_range() and use it.
> (the first argument of the i386 version is little uncomfortable.)
> * /dev/mem of the i386 version should be mofified. but not yet.
>
> example: amd64 8GB Mem
> # cat /proc/cpumem
> 0000000000000000 000000000009b800
> 0000000000100000 00000000fbe70000
> 0000000100000000 0000000100000000
> #
> start address and size. hex digit.
>
> Any comments, recomendations and suggestions are welcom.

It may make more sense to export the entire e820 (or similar)
bios memory tables. Probably better off in sysfs than adding
more cruft to procfs too.

Dave

2005-02-17 19:49:35

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] Re: [PATCH] /proc/cpumem

Dave Jones <[email protected]> writes:

> On Wed, Feb 16, 2005 at 05:49:51PM +0900, Itsuro Oda wrote:
> > Hi, Eric and all
> >
> > Attached is an implementation of /proc/cpumem.
> > /proc/cpumem shows the valid physical memory ranges.
> >
> > * i386 and x86_64
> > * implement valid_phys_addr_range() and use it.
> > (the first argument of the i386 version is little uncomfortable.)
> > * /dev/mem of the i386 version should be mofified. but not yet.
> >
> > example: amd64 8GB Mem
> > # cat /proc/cpumem
> > 0000000000000000 000000000009b800
> > 0000000000100000 00000000fbe70000
> > 0000000100000000 0000000100000000
> > #
> > start address and size. hex digit.
> >
> > Any comments, recomendations and suggestions are welcom.
>
> It may make more sense to export the entire e820 (or similar)
> bios memory tables. Probably better off in sysfs than adding
> more cruft to procfs too.

Agreed. In practice we actually do this already with /proc/iomem.
Except that we truncate everything above 4GB, and we allow the
map to get mangled with mem=xxx options.

I brought up the idea of a /proc/cpumem by analogy because on platforms
that have an iommu and memory is in a distinct address space
there have been complaints that /proc/iomem just won't work. But it
is simple enough to do something that is just for the cpu's memory.

As for how to do this cleanly this looks like the start of that discussion.

Eric

2005-02-18 06:17:21

by Itsuro Oda

[permalink] [raw]
Subject: Re: [Fastboot] Re: [PATCH] /proc/cpumem

Hi,

On 17 Feb 2005 02:55:31 -0700
[email protected] (Eric W. Biederman) wrote:

> My role in this is that of maintainer and architect. On a practical
> level I gain nothing from a working crash-dump/kexec-on-panic
> implementation except it stops being a gating factor for the rest
> of the kexec code. So while many times I can see what needs to be
> done it is hard for me to justify doing it. So a lot of times
> where I will weigh in with code is when I see a particular blind spot
> on the part of the implementors.

I see. I would like to contribute as possible I can.

Thanks.
--
Itsuro ODA <[email protected]>

2005-02-18 07:24:30

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Fastboot] Re: [PATCH] /proc/cpumem

Itsuro Oda <[email protected]> writes:

> I see. I would like to contribute as possible I can.

Pick some piece you that have an affinity for and work on it.
Problems are best solved by those who see them and by those who care :)

I believe Vivek Goyal is currently working on the remaining user space
piece, and expects to have something in a week or so.

Eric