2008-07-07 03:21:20

by Huang, Ying

[permalink] [raw]
Subject: [PATCH -mm 1/2] kexec jump -v12: kexec jump

This patch provides an enhancement to kexec/kdump. It implements
the following features:

- Backup/restore memory used by the original kernel before/after
kexec.

- Save/restore CPU state before/after kexec.

The features of this patch can be used as a general method to call
program in physical mode (paging turning off). This can be used to
call BIOS code under Linux.


kexec-tools needs to be patched to support kexec jump. The patches and
the precompiled kexec can be download from the following URL:

source: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec-tools-src_git_kh10.tar.bz2
patches: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec-tools-patches_git_kh10.tar.bz2
binary: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec_git_kh10


Usage example of calling some physical mode code and return:

1. Compile and install patched kernel with following options selected:

CONFIG_X86_32=y
CONFIG_KEXEC=y
CONFIG_PM=y
CONFIG_KEXEC_JUMP=y

2. Build patched kexec-tool or download the pre-built one.

3. Build some physical mode executable named such as "phy_mode"

4. Boot kernel compiled in step 1.

5. Load physical mode executable with /sbin/kexec. The shell command
line can be as follow:

/sbin/kexec --load-preserve-context --args-none phy_mode

6. Call physical mode executable with following shell command line:

/sbin/kexec -e


Implementation point:

To support jumping without reserving memory. One shadow backup page
(source page) is allocated for each page used by kexeced code image
(destination page). When do kexec_load, the image of kexeced code is
loaded into source pages, and before executing, the destination pages
and the source pages are swapped, so the contents of destination pages
are backupped. Before jumping to the kexeced code image and after
jumping back to the original kernel, the destination pages and the
source pages are swapped too.

C ABI (calling convention) is used as communication protocol between
kernel and called code.

A flag named KEXEC_PRESERVE_CONTEXT for sys_kexec_load is added to
indicate that the loaded kernel image is used for jumping back.


ChangeLog:

v12:

- Add a Kconfig option KEXEC_JUMP to resolve the dependency problem.

- Merge kexec_jump() into kernel_kexec().

v10:

- Device state save/restore related code is split into another patch
because it depends on devices hibernation/restore callback and prone
to be changed.

- C ABI (calling convention) is used as communication protocol between
kernel and called code.

- Code cleanup: CPU state save/restore code goes in relocate_kernel().

v9:

- pm_mutex is locked during kexec jump to avoid potential conflict
between kexec jump and suspend/resume/hibernation.

- Split /dev/oldmem writing and kimagecore patch out, keep only the
core function.

v8:

- Split kexec jump patchset from kexec based hibernation patchset.

- Merge various KEXEC_PRESERVE_* flags into one KEXEC_PRESERVE_CONTEXT
because there is no need for such subtle control.

- Delete variable argument based "kernel to kernel" communication
mechanism from basic kexec jump patchset.

v7:

- Refactor kexec jump to be a command driven programming model.

- Use kexec_lock to do synchronization.

v6:

- Refactor kexec jump to be a general facility to call real mode code.

v5:

- A flag (KEXEC_JUMP_BACK) is added to indicate the loaded kernel
image is used for jumping back. The reboot command for jumping back
is removed. This interface is more stable (proposed by Eric
Biederman).

- NX bit handling support for kexec is added.

- Merge machine_kexec and machine_kexec_jump, remove NO_RET attribute
from machine_kexec.

- Passing jump back entry to kexeced kernel via kernel command line
(parsed by user space tool via /proc/cmdline instead of
kernel). Original corresponding boot parameter and sysfs code is
removed.

v4:

- Two reboot command are merged back to one because the underlying
implementation is same.

- Jumping without reserving memory is implemented. As a side effect,
two direction jumping is implemented.

- A jump back protocol is defined and documented. The original kernel
and kexeced kernel are more independent from each other.

- The CPU state save/restore code are merged into relocate_kernel.S.

v3:

- The reboot command LINUX_REBOOT_CMD_KJUMP is split into to two
reboot command to reflect the different function.

- Document is added for added kernel parameters.

- /sys/kernel/kexec_jump_buf_pfn is made writable, it is used for
memory image restoring.

- Console restoring after jumping back is implemented.

v2:

- The kexec jump implementation is put into the kexec/kdump framework
instead of software suspend framework. The device and CPU state
save/restore code of software suspend is called when needed.

- The same code path is used for both kexec a new kernel and jump back
to original kernel.


Now, only the i386 architecture is supported. The patchset is based on
Linux kernel 2.6.26-rc8-mm1, and has been tested on IBM T42.


Signed-off-by: Huang Ying <[email protected]>

---
arch/powerpc/kernel/machine_kexec.c | 2
arch/sh/kernel/machine_kexec.c | 2
arch/x86/Kconfig | 7 +
arch/x86/kernel/machine_kexec_32.c | 27 ++++-
arch/x86/kernel/machine_kexec_64.c | 2
arch/x86/kernel/relocate_kernel_32.S | 174 ++++++++++++++++++++++++++++++-----
include/asm-x86/kexec.h | 18 ++-
include/linux/kexec.h | 17 ++-
kernel/kexec.c | 57 +++++++++++
kernel/sys.c | 31 +-----
10 files changed, 269 insertions(+), 68 deletions(-)

--- a/arch/x86/kernel/machine_kexec_32.c
+++ b/arch/x86/kernel/machine_kexec_32.c
@@ -22,6 +22,7 @@
#include <asm/cpufeature.h>
#include <asm/desc.h>
#include <asm/system.h>
+#include <asm/cacheflush.h>

#define PAGE_ALIGNED __attribute__ ((__aligned__(PAGE_SIZE)))
static u32 kexec_pgd[1024] PAGE_ALIGNED;
@@ -85,10 +86,12 @@ static void load_segments(void)
* reboot code buffer to allow us to avoid allocations
* later.
*
- * Currently nothing.
+ * Make control page executable.
*/
int machine_kexec_prepare(struct kimage *image)
{
+ if (nx_enabled)
+ set_pages_x(image->control_code_page, 1);
return 0;
}

@@ -98,16 +101,24 @@ int machine_kexec_prepare(struct kimage
*/
void machine_kexec_cleanup(struct kimage *image)
{
+ if (nx_enabled)
+ set_pages_nx(image->control_code_page, 1);
}

/*
* Do not allocate memory (or fail in any way) in machine_kexec().
* We are past the point of no return, committed to rebooting now.
*/
-NORET_TYPE void machine_kexec(struct kimage *image)
+void machine_kexec(struct kimage *image)
{
unsigned long page_list[PAGES_NR];
void *control_page;
+ asmlinkage unsigned long
+ (*relocate_kernel_ptr)(unsigned long indirection_page,
+ unsigned long control_page,
+ unsigned long start_address,
+ unsigned int has_pae,
+ unsigned int preserve_context);

tracer_disable();

@@ -115,10 +126,11 @@ NORET_TYPE void machine_kexec(struct kim
local_irq_disable();

control_page = page_address(image->control_code_page);
- memcpy(control_page, relocate_kernel, PAGE_SIZE);
+ memcpy(control_page, relocate_kernel, PAGE_SIZE/2);

+ relocate_kernel_ptr = control_page;
page_list[PA_CONTROL_PAGE] = __pa(control_page);
- page_list[VA_CONTROL_PAGE] = (unsigned long)relocate_kernel;
+ page_list[VA_CONTROL_PAGE] = (unsigned long)control_page;
page_list[PA_PGD] = __pa(kexec_pgd);
page_list[VA_PGD] = (unsigned long)kexec_pgd;
#ifdef CONFIG_X86_PAE
@@ -131,6 +143,7 @@ NORET_TYPE void machine_kexec(struct kim
page_list[VA_PTE_0] = (unsigned long)kexec_pte0;
page_list[PA_PTE_1] = __pa(kexec_pte1);
page_list[VA_PTE_1] = (unsigned long)kexec_pte1;
+ page_list[PA_SWAP_PAGE] = (page_to_pfn(image->swap_page) << PAGE_SHIFT);

/* The segment registers are funny things, they have both a
* visible and an invisible part. Whenever the visible part is
@@ -149,8 +162,10 @@ NORET_TYPE void machine_kexec(struct kim
set_idt(phys_to_virt(0),0);

/* now call it */
- relocate_kernel((unsigned long)image->head, (unsigned long)page_list,
- image->start, cpu_has_pae);
+ image->start = relocate_kernel_ptr((unsigned long)image->head,
+ (unsigned long)page_list,
+ image->start, cpu_has_pae,
+ image->preserve_context);
}

void arch_crash_save_vmcoreinfo(void)
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -83,6 +83,7 @@ struct kimage {

unsigned long start;
struct page *control_code_page;
+ struct page *swap_page;

unsigned long nr_segments;
struct kexec_segment segment[KEXEC_SEGMENT_MAX];
@@ -98,18 +99,20 @@ struct kimage {
unsigned int type : 1;
#define KEXEC_TYPE_DEFAULT 0
#define KEXEC_TYPE_CRASH 1
+ unsigned int preserve_context : 1;
};



/* kexec interface functions */
-extern NORET_TYPE void machine_kexec(struct kimage *image) ATTRIB_NORET;
+extern void machine_kexec(struct kimage *image);
extern int machine_kexec_prepare(struct kimage *image);
extern void machine_kexec_cleanup(struct kimage *image);
extern asmlinkage long sys_kexec_load(unsigned long entry,
unsigned long nr_segments,
struct kexec_segment __user *segments,
unsigned long flags);
+extern int kernel_kexec(void);
#ifdef CONFIG_COMPAT
extern asmlinkage long compat_sys_kexec_load(unsigned long entry,
unsigned long nr_segments,
@@ -156,8 +159,9 @@ extern struct kimage *kexec_crash_image;
#define kexec_flush_icache_page(page)
#endif

-#define KEXEC_ON_CRASH 0x00000001
-#define KEXEC_ARCH_MASK 0xffff0000
+#define KEXEC_ON_CRASH 0x00000001
+#define KEXEC_PRESERVE_CONTEXT 0x00000002
+#define KEXEC_ARCH_MASK 0xffff0000

/* These values match the ELF architecture values.
* Unless there is a good reason that should continue to be the case.
@@ -174,7 +178,12 @@ extern struct kimage *kexec_crash_image;
#define KEXEC_ARCH_MIPS_LE (10 << 16)
#define KEXEC_ARCH_MIPS ( 8 << 16)

-#define KEXEC_FLAGS (KEXEC_ON_CRASH) /* List of defined/legal kexec flags */
+/* List of defined/legal kexec flags */
+#ifndef CONFIG_KEXEC_JUMP
+#define KEXEC_FLAGS KEXEC_ON_CRASH
+#else
+#define KEXEC_FLAGS (KEXEC_ON_CRASH | KEXEC_PRESERVE_CONTEXT)
+#endif

#define VMCOREINFO_BYTES (4096)
#define VMCOREINFO_NOTE_NAME "VMCOREINFO"
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -24,6 +24,8 @@
#include <linux/utsrelease.h>
#include <linux/utsname.h>
#include <linux/numa.h>
+#include <linux/suspend.h>
+#include <linux/device.h>

#include <asm/page.h>
#include <asm/uaccess.h>
@@ -242,6 +244,12 @@ static int kimage_normal_alloc(struct ki
goto out;
}

+ image->swap_page = kimage_alloc_control_pages(image, 0);
+ if (!image->swap_page) {
+ printk(KERN_ERR "Could not allocate swap buffer\n");
+ goto out;
+ }
+
result = 0;
out:
if (result == 0)
@@ -986,6 +994,8 @@ asmlinkage long sys_kexec_load(unsigned
if (result)
goto out;

+ if (flags & KEXEC_PRESERVE_CONTEXT)
+ image->preserve_context = 1;
result = machine_kexec_prepare(image);
if (result)
goto out;
@@ -1411,3 +1421,50 @@ static int __init crash_save_vmcoreinfo_
}

module_init(crash_save_vmcoreinfo_init)
+
+/**
+ * kernel_kexec - reboot the system
+ *
+ * Move into place and start executing a preloaded standalone
+ * executable. If nothing was preloaded return an error.
+ */
+int kernel_kexec(void)
+{
+ int error = 0;
+
+ if (xchg(&kexec_lock, 1))
+ return -EBUSY;
+ if (!kexec_image) {
+ error = -EINVAL;
+ goto Unlock;
+ }
+
+ if (kexec_image->preserve_context) {
+#ifdef CONFIG_KEXEC_JUMP
+ local_irq_disable();
+ save_processor_state();
+#endif
+ } else {
+ blocking_notifier_call_chain(&reboot_notifier_list,
+ SYS_RESTART, NULL);
+ system_state = SYSTEM_RESTART;
+ device_shutdown();
+ sysdev_shutdown();
+ printk(KERN_EMERG "Starting new kernel\n");
+ machine_shutdown();
+ }
+
+ machine_kexec(kexec_image);
+
+ if (kexec_image->preserve_context) {
+#ifdef CONFIG_KEXEC_JUMP
+ restore_processor_state();
+ local_irq_enable();
+#endif
+ }
+
+ Unlock:
+ xchg(&kexec_lock, 0);
+
+ return error;
+}
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -301,26 +301,6 @@ void kernel_restart(char *cmd)
}
EXPORT_SYMBOL_GPL(kernel_restart);

-/**
- * kernel_kexec - reboot the system
- *
- * Move into place and start executing a preloaded standalone
- * executable. If nothing was preloaded return an error.
- */
-static void kernel_kexec(void)
-{
-#ifdef CONFIG_KEXEC
- struct kimage *image;
- image = xchg(&kexec_image, NULL);
- if (!image)
- return;
- kernel_restart_prepare(NULL);
- printk(KERN_EMERG "Starting new kernel\n");
- machine_shutdown();
- machine_kexec(image);
-#endif
-}
-
static void kernel_shutdown_prepare(enum system_states state)
{
blocking_notifier_call_chain(&reboot_notifier_list,
@@ -425,10 +405,15 @@ asmlinkage long sys_reboot(int magic1, i
kernel_restart(buffer);
break;

+#ifdef CONFIG_KEXEC
case LINUX_REBOOT_CMD_KEXEC:
- kernel_kexec();
- unlock_kernel();
- return -EINVAL;
+ {
+ int ret;
+ ret = kernel_kexec();
+ unlock_kernel();
+ return ret;
+ }
+#endif

#ifdef CONFIG_HIBERNATION
case LINUX_REBOOT_CMD_SW_SUSPEND:
--- a/arch/x86/kernel/relocate_kernel_32.S
+++ b/arch/x86/kernel/relocate_kernel_32.S
@@ -20,11 +20,44 @@
#define PAGE_ATTR (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY)
#define PAE_PGD_ATTR (_PAGE_PRESENT)

+/* control_page + PAGE_SIZE/2 ~ control_page + PAGE_SIZE * 3/4 are
+ * used to save some data for jumping back
+ */
+#define DATA(offset) (PAGE_SIZE/2+(offset))
+
+/* Minimal CPU state */
+#define ESP DATA(0x0)
+#define CR0 DATA(0x4)
+#define CR3 DATA(0x8)
+#define CR4 DATA(0xc)
+
+/* other data */
+#define CP_VA_CONTROL_PAGE DATA(0x10)
+#define CP_PA_PGD DATA(0x14)
+#define CP_PA_SWAP_PAGE DATA(0x18)
+#define CP_PA_BACKUP_PAGES_MAP DATA(0x1c)
+
.text
.align PAGE_SIZE
.globl relocate_kernel
relocate_kernel:
- movl 8(%esp), %ebp /* list of pages */
+ /* Save the CPU context, used for jumping back */
+
+ pushl %ebx
+ pushl %esi
+ pushl %edi
+ pushl %ebp
+ pushf
+
+ movl 20+8(%esp), %ebp /* list of pages */
+ movl PTR(VA_CONTROL_PAGE)(%ebp), %edi
+ movl %esp, ESP(%edi)
+ movl %cr0, %eax
+ movl %eax, CR0(%edi)
+ movl %cr3, %eax
+ movl %eax, CR3(%edi)
+ movl %cr4, %eax
+ movl %eax, CR4(%edi)

#ifdef CONFIG_X86_PAE
/* map the control page at its virtual address */
@@ -138,15 +171,25 @@ relocate_kernel:

relocate_new_kernel:
/* read the arguments and say goodbye to the stack */
- movl 4(%esp), %ebx /* page_list */
- movl 8(%esp), %ebp /* list of pages */
- movl 12(%esp), %edx /* start address */
- movl 16(%esp), %ecx /* cpu_has_pae */
+ movl 20+4(%esp), %ebx /* page_list */
+ movl 20+8(%esp), %ebp /* list of pages */
+ movl 20+12(%esp), %edx /* start address */
+ movl 20+16(%esp), %ecx /* cpu_has_pae */
+ movl 20+20(%esp), %esi /* preserve_context */

/* zero out flags, and disable interrupts */
pushl $0
popfl

+ /* save some information for jumping back */
+ movl PTR(VA_CONTROL_PAGE)(%ebp), %edi
+ movl %edi, CP_VA_CONTROL_PAGE(%edi)
+ movl PTR(PA_PGD)(%ebp), %eax
+ movl %eax, CP_PA_PGD(%edi)
+ movl PTR(PA_SWAP_PAGE)(%ebp), %eax
+ movl %eax, CP_PA_SWAP_PAGE(%edi)
+ movl %ebx, CP_PA_BACKUP_PAGES_MAP(%edi)
+
/* get physical address of control page now */
/* this is impossible after page table switch */
movl PTR(PA_CONTROL_PAGE)(%ebp), %edi
@@ -197,8 +240,90 @@ identity_mapped:
xorl %eax, %eax
movl %eax, %cr3

+ movl CP_PA_SWAP_PAGE(%edi), %eax
+ pushl %eax
+ pushl %ebx
+ call swap_pages
+ addl $8, %esp
+
+ /* To be certain of avoiding problems with self-modifying code
+ * I need to execute a serializing instruction here.
+ * So I flush the TLB, it's handy, and not processor dependent.
+ */
+ xorl %eax, %eax
+ movl %eax, %cr3
+
+ /* set all of the registers to known values */
+ /* leave %esp alone */
+
+ testl %esi, %esi
+ jnz 1f
+ xorl %edi, %edi
+ xorl %eax, %eax
+ xorl %ebx, %ebx
+ xorl %ecx, %ecx
+ xorl %edx, %edx
+ xorl %esi, %esi
+ xorl %ebp, %ebp
+ ret
+1:
+ popl %edx
+ movl CP_PA_SWAP_PAGE(%edi), %esp
+ addl $PAGE_SIZE, %esp
+2:
+ call *%edx
+
+ /* get the re-entry point of the peer system */
+ movl 0(%esp), %ebp
+ call 1f
+1:
+ popl %ebx
+ subl $(1b - relocate_kernel), %ebx
+ movl CP_VA_CONTROL_PAGE(%ebx), %edi
+ lea PAGE_SIZE(%ebx), %esp
+ movl CP_PA_SWAP_PAGE(%ebx), %eax
+ movl CP_PA_BACKUP_PAGES_MAP(%ebx), %edx
+ pushl %eax
+ pushl %edx
+ call swap_pages
+ addl $8, %esp
+ movl CP_PA_PGD(%ebx), %eax
+ movl %eax, %cr3
+ movl %cr0, %eax
+ orl $(1<<31), %eax
+ movl %eax, %cr0
+ lea PAGE_SIZE(%edi), %esp
+ movl %edi, %eax
+ addl $(virtual_mapped - relocate_kernel), %eax
+ pushl %eax
+ ret
+
+virtual_mapped:
+ movl CR4(%edi), %eax
+ movl %eax, %cr4
+ movl CR3(%edi), %eax
+ movl %eax, %cr3
+ movl CR0(%edi), %eax
+ movl %eax, %cr0
+ movl ESP(%edi), %esp
+ movl %ebp, %eax
+
+ popf
+ popl %ebp
+ popl %edi
+ popl %esi
+ popl %ebx
+ ret
+
/* Do the copies */
- movl %ebx, %ecx
+swap_pages:
+ movl 8(%esp), %edx
+ movl 4(%esp), %ecx
+ pushl %ebp
+ pushl %ebx
+ pushl %edi
+ pushl %esi
+ movl %ecx, %ebx
jmp 1f

0: /* top, read another word from the indirection page */
@@ -226,27 +351,28 @@ identity_mapped:
movl %ecx, %esi /* For every source page do a copy */
andl $0xfffff000, %esi

+ movl %edi, %eax
+ movl %esi, %ebp
+
+ movl %edx, %edi
movl $1024, %ecx
rep ; movsl
- jmp 0b
-
-3:

- /* To be certain of avoiding problems with self-modifying code
- * I need to execute a serializing instruction here.
- * So I flush the TLB, it's handy, and not processor dependent.
- */
- xorl %eax, %eax
- movl %eax, %cr3
+ movl %ebp, %edi
+ movl %eax, %esi
+ movl $1024, %ecx
+ rep ; movsl

- /* set all of the registers to known values */
- /* leave %esp alone */
+ movl %eax, %edi
+ movl %edx, %esi
+ movl $1024, %ecx
+ rep ; movsl

- xorl %eax, %eax
- xorl %ebx, %ebx
- xorl %ecx, %ecx
- xorl %edx, %edx
- xorl %esi, %esi
- xorl %edi, %edi
- xorl %ebp, %ebp
+ lea PAGE_SIZE(%ebp), %esi
+ jmp 0b
+3:
+ popl %esi
+ popl %edi
+ popl %ebx
+ popl %ebp
ret
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -181,7 +181,7 @@ void machine_kexec_cleanup(struct kimage
* Do not allocate memory (or fail in any way) in machine_kexec().
* We are past the point of no return, committed to rebooting now.
*/
-NORET_TYPE void machine_kexec(struct kimage *image)
+void machine_kexec(struct kimage *image)
{
unsigned long page_list[PAGES_NR];
void *control_page;
--- a/arch/sh/kernel/machine_kexec.c
+++ b/arch/sh/kernel/machine_kexec.c
@@ -70,7 +70,7 @@ static void kexec_info(struct kimage *im
* Do not allocate memory (or fail in any way) in machine_kexec().
* We are past the point of no return, committed to rebooting now.
*/
-NORET_TYPE void machine_kexec(struct kimage *image)
+void machine_kexec(struct kimage *image)
{

unsigned long page_list;
--- a/arch/powerpc/kernel/machine_kexec.c
+++ b/arch/powerpc/kernel/machine_kexec.c
@@ -48,7 +48,7 @@ void machine_kexec_cleanup(struct kimage
* Do not allocate memory (or fail in any way) in machine_kexec().
* We are past the point of no return, committed to rebooting now.
*/
-NORET_TYPE void machine_kexec(struct kimage *image)
+void machine_kexec(struct kimage *image)
{
if (ppc_md.machine_kexec)
ppc_md.machine_kexec(image);
--- a/include/asm-x86/kexec.h
+++ b/include/asm-x86/kexec.h
@@ -10,14 +10,15 @@
# define VA_PTE_0 5
# define PA_PTE_1 6
# define VA_PTE_1 7
+# define PA_SWAP_PAGE 8
# ifdef CONFIG_X86_PAE
-# define PA_PMD_0 8
-# define VA_PMD_0 9
-# define PA_PMD_1 10
-# define VA_PMD_1 11
-# define PAGES_NR 12
+# define PA_PMD_0 9
+# define VA_PMD_0 10
+# define PA_PMD_1 11
+# define VA_PMD_1 12
+# define PAGES_NR 13
# else
-# define PAGES_NR 8
+# define PAGES_NR 9
# endif
#else
# define PA_CONTROL_PAGE 0
@@ -152,11 +153,12 @@ static inline void crash_setup_regs(stru
}

#ifdef CONFIG_X86_32
-asmlinkage NORET_TYPE void
+asmlinkage unsigned long
relocate_kernel(unsigned long indirection_page,
unsigned long control_page,
unsigned long start_address,
- unsigned int has_pae) ATTRIB_NORET;
+ unsigned int has_pae,
+ unsigned int preserve_context);
#else
NORET_TYPE void
relocate_kernel(unsigned long indirection_page,
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1273,6 +1273,13 @@ config CRASH_DUMP
(CONFIG_RELOCATABLE=y).
For more details see Documentation/kdump/kdump.txt

+config KEXEC_JUMP
+ bool "kexec jump (EXPERIMENTAL)"
+ depends on EXPERIMENTAL
+ depends on KEXEC && PM_SLEEP && X86_32
+ help
+ Invoke code in physical address mode via KEXEC
+
config PHYSICAL_START
hex "Physical address where the kernel is loaded" if (EMBEDDED || CRASH_DUMP)
default "0x1000000" if X86_NUMAQ


2008-07-07 17:00:59

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH -mm 1/2] kexec jump -v12: kexec jump

Hi!

The patch looks mostly ok to me. (Perhaps there's time to split it
into smaller chunks?)

You can add Acked-by: Pavel Machek <[email protected]> to it, I guess.


> This patch provides an enhancement to kexec/kdump. It implements
> the following features:
>
> - Backup/restore memory used by the original kernel before/after
> kexec.
>
> - Save/restore CPU state before/after kexec.
>
> The features of this patch can be used as a general method to call
> program in physical mode (paging turning off). This can be used to
> call BIOS code under Linux.
>
>
> kexec-tools needs to be patched to support kexec jump. The patches and
> the precompiled kexec can be download from the following URL:
>
> source: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec-tools-src_git_kh10.tar.bz2
> patches: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec-tools-patches_git_kh10.tar.bz2
> binary: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec_git_kh10
>
>
> Usage example of calling some physical mode code and return:

> @@ -98,16 +101,24 @@ int machine_kexec_prepare(struct kimage
> */
> void machine_kexec_cleanup(struct kimage *image)
> {
> + if (nx_enabled)
> + set_pages_nx(image->control_code_page, 1);
> }

, 0 ? (setup and cleanup were same, which is strange).



> @@ -1411,3 +1421,50 @@ static int __init crash_save_vmcoreinfo_
> }
>
> module_init(crash_save_vmcoreinfo_init)
> +
> +/**
> + * kernel_kexec - reboot the system

Really?

> + * Move into place and start executing a preloaded standalone
> + * executable. If nothing was preloaded return an error.
> + */
> +int kernel_kexec(void)
> +{
> + int error = 0;
> +
> + if (xchg(&kexec_lock, 1))
> + return -EBUSY;

That's quite a strange way to provide a lock. mutex_trylock?


> + if (!kexec_image) {
> + error = -EINVAL;
> + goto Unlock;
> + }
> +
> + if (kexec_image->preserve_context) {
> +#ifdef CONFIG_KEXEC_JUMP
> + local_irq_disable();
> + save_processor_state();

#else
BUG()

...because otherwise you silently do nothing?

> +#endif
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-07-08 09:05:34

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH -mm 1/2] kexec jump -v12: kexec jump

Hi, Pavel,

On Mon, 2008-07-07 at 20:50 +0800, Pavel Machek wrote:
> Hi!
>
> The patch looks mostly ok to me. (Perhaps there's time to split it
> into smaller chunks?)
>
> You can add Acked-by: Pavel Machek <[email protected]> to it, I guess.

Thank you very much!

[...]
> > @@ -98,16 +101,24 @@ int machine_kexec_prepare(struct kimage
> > */
> > void machine_kexec_cleanup(struct kimage *image)
> > {
> > + if (nx_enabled)
> > + set_pages_nx(image->control_code_page, 1);
> > }
>
> , 0 ? (setup and cleanup were same, which is strange).

Oh, Yes. That should be 0, I will change it.
>
> > @@ -1411,3 +1421,50 @@ static int __init crash_save_vmcoreinfo_
> > }
> >
> > module_init(crash_save_vmcoreinfo_init)
> > +
> > +/**
> > + * kernel_kexec - reboot the system

> Really?

I will change the comments to reflect the changes to kernel_kexec.

> > + * Move into place and start executing a preloaded standalone
> > + * executable. If nothing was preloaded return an error.
> > + */
> > +int kernel_kexec(void)
> > +{
> > + int error = 0;
> > +
> > + if (xchg(&kexec_lock, 1))
> > + return -EBUSY;
>
> That's quite a strange way to provide a lock. mutex_trylock?

I think this is because kexec_lock is used by crash_kexec() too, which
may be called in some extreme environment, such as during panic().

> > + if (!kexec_image) {
> > + error = -EINVAL;
> > + goto Unlock;
> > + }
> > +
> > + if (kexec_image->preserve_context) {
> > +#ifdef CONFIG_KEXEC_JUMP
> > + local_irq_disable();
> > + save_processor_state();
>
> #else
> BUG()
>
> ...because otherwise you silently do nothing?
>
> > +#endif
>
> Pavel

If CONFIG_KEXEC_JUMP is defined, kexec_image->preserve_context will
always be 0. So current code is safe. Here, #ifdef is used to resolve
the dependency issue. For example, save_processor_state() may be
undefined if CONFIG_KEXEC_JUMP is not defined.

Best Regards,
Huang Ying

2008-07-08 14:52:48

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH -mm 1/2] kexec jump -v12: kexec jump

On Mon, Jul 07, 2008 at 11:25:22AM +0800, Huang Ying wrote:
> This patch provides an enhancement to kexec/kdump. It implements
> the following features:
>
> - Backup/restore memory used by the original kernel before/after
> kexec.
>
> - Save/restore CPU state before/after kexec.
>

Hi Huang,

In general this patch set looks good enough to live in -mm and
get some testing going.

To me, adding capability to return back to original kernel looks
like a logical extension to kexec functionality.

Acked-by: Vivek Goyal <[email protected]>

Few minor comments inline.

[..]
> --- a/arch/x86/kernel/machine_kexec_32.c
> +++ b/arch/x86/kernel/machine_kexec_32.c
> @@ -22,6 +22,7 @@
> #include <asm/cpufeature.h>
> #include <asm/desc.h>
> #include <asm/system.h>
> +#include <asm/cacheflush.h>
>
> #define PAGE_ALIGNED __attribute__ ((__aligned__(PAGE_SIZE)))
> static u32 kexec_pgd[1024] PAGE_ALIGNED;
> @@ -85,10 +86,12 @@ static void load_segments(void)
> * reboot code buffer to allow us to avoid allocations
> * later.
> *
> - * Currently nothing.
> + * Make control page executable.
> */
> int machine_kexec_prepare(struct kimage *image)
> {
> + if (nx_enabled)
> + set_pages_x(image->control_code_page, 1);
> return 0;
> }
>
> @@ -98,16 +101,24 @@ int machine_kexec_prepare(struct kimage
> */
> void machine_kexec_cleanup(struct kimage *image)
> {
> + if (nx_enabled)
> + set_pages_nx(image->control_code_page, 1);
> }
>
> /*
> * Do not allocate memory (or fail in any way) in machine_kexec().
> * We are past the point of no return, committed to rebooting now.
> */
> -NORET_TYPE void machine_kexec(struct kimage *image)
> +void machine_kexec(struct kimage *image)
> {
> unsigned long page_list[PAGES_NR];
> void *control_page;
> + asmlinkage unsigned long
> + (*relocate_kernel_ptr)(unsigned long indirection_page,
> + unsigned long control_page,
> + unsigned long start_address,
> + unsigned int has_pae,
> + unsigned int preserve_context);
>
> tracer_disable();
>
> @@ -115,10 +126,11 @@ NORET_TYPE void machine_kexec(struct kim
> local_irq_disable();
>
> control_page = page_address(image->control_code_page);
> - memcpy(control_page, relocate_kernel, PAGE_SIZE);
> + memcpy(control_page, relocate_kernel, PAGE_SIZE/2);
>

Is it possible to add either a compile time or run time check
somewhere to make sure code in relocate_kernel.S does not exceed
PAGE_SIZE/2.

[..]
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -24,6 +24,8 @@
> #include <linux/utsrelease.h>
> #include <linux/utsname.h>
> #include <linux/numa.h>
> +#include <linux/suspend.h>
> +#include <linux/device.h>
>
> #include <asm/page.h>
> #include <asm/uaccess.h>
> @@ -242,6 +244,12 @@ static int kimage_normal_alloc(struct ki
> goto out;
> }
>
> + image->swap_page = kimage_alloc_control_pages(image, 0);
> + if (!image->swap_page) {
> + printk(KERN_ERR "Could not allocate swap buffer\n");
> + goto out;
> + }
> +
> result = 0;
> out:
> if (result == 0)
> @@ -986,6 +994,8 @@ asmlinkage long sys_kexec_load(unsigned
> if (result)
> goto out;
>
> + if (flags & KEXEC_PRESERVE_CONTEXT)
> + image->preserve_context = 1;
> result = machine_kexec_prepare(image);
> if (result)
> goto out;
> @@ -1411,3 +1421,50 @@ static int __init crash_save_vmcoreinfo_
> }
>
> module_init(crash_save_vmcoreinfo_init)
> +
> +/**
> + * kernel_kexec - reboot the system
> + *
> + * Move into place and start executing a preloaded standalone
> + * executable. If nothing was preloaded return an error.
> + */
> +int kernel_kexec(void)
> +{
> + int error = 0;
> +
> + if (xchg(&kexec_lock, 1))
> + return -EBUSY;
> + if (!kexec_image) {
> + error = -EINVAL;
> + goto Unlock;
> + }
> +
> + if (kexec_image->preserve_context) {
> +#ifdef CONFIG_KEXEC_JUMP
> + local_irq_disable();
> + save_processor_state();
> +#endif
> + } else {
> + blocking_notifier_call_chain(&reboot_notifier_list,
> + SYS_RESTART, NULL);
> + system_state = SYSTEM_RESTART;
> + device_shutdown();
> + sysdev_shutdown();
> + printk(KERN_EMERG "Starting new kernel\n");
> + machine_shutdown();

All the above code was part of kernel_restart_prepare(), can't we just
make that function non-static and use that?

[..]
> --- a/arch/x86/kernel/relocate_kernel_32.S
> +++ b/arch/x86/kernel/relocate_kernel_32.S
> @@ -20,11 +20,44 @@
> #define PAGE_ATTR (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY)
> #define PAE_PGD_ATTR (_PAGE_PRESENT)
>
> +/* control_page + PAGE_SIZE/2 ~ control_page + PAGE_SIZE * 3/4 are
> + * used to save some data for jumping back
> + */
> +#define DATA(offset) (PAGE_SIZE/2+(offset))
> +
> +/* Minimal CPU state */
> +#define ESP DATA(0x0)
> +#define CR0 DATA(0x4)
> +#define CR3 DATA(0x8)
> +#define CR4 DATA(0xc)
> +
> +/* other data */
> +#define CP_VA_CONTROL_PAGE DATA(0x10)
> +#define CP_PA_PGD DATA(0x14)
> +#define CP_PA_SWAP_PAGE DATA(0x18)
> +#define CP_PA_BACKUP_PAGES_MAP DATA(0x1c)
> +

In general, this assembly piece of code is getting bigger and its
difficult to read it now. I think we should at-least pull out the page
table setup code into C. Somebody had posted a patch to do that. Don't
know what happened to that. Anyway, this is a separate issue and is on
wish list.

Thanks
Vivek

2008-07-08 20:32:14

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH -mm 1/2] kexec jump -v12: kexec jump

Hi!

> > > @@ -1411,3 +1421,50 @@ static int __init crash_save_vmcoreinfo_
> > > }
> > >
> > > module_init(crash_save_vmcoreinfo_init)
> > > +
> > > +/**
> > > + * kernel_kexec - reboot the system
>
> > Really?
>
> I will change the comments to reflect the changes to kernel_kexec.
>
> > > + * Move into place and start executing a preloaded standalone
> > > + * executable. If nothing was preloaded return an error.
> > > + */
> > > +int kernel_kexec(void)
> > > +{
> > > + int error = 0;
> > > +
> > > + if (xchg(&kexec_lock, 1))
> > > + return -EBUSY;
> >
> > That's quite a strange way to provide a lock. mutex_trylock?
>
> I think this is because kexec_lock is used by crash_kexec() too, which
> may be called in some extreme environment, such as during panic().
>
> > > + if (!kexec_image) {
> > > + error = -EINVAL;
> > > + goto Unlock;
> > > + }
> > > +
> > > + if (kexec_image->preserve_context) {
> > > +#ifdef CONFIG_KEXEC_JUMP
> > > + local_irq_disable();
> > > + save_processor_state();
> >
> > #else
> > BUG()
> >
> > ...because otherwise you silently do nothing?
> >
> > > +#endif
>
> If CONFIG_KEXEC_JUMP is defined, kexec_image->preserve_context will
> always be 0. So current code is safe. Here, #ifdef is used to resolve
> the dependency issue. For example, save_processor_state() may be
> undefined if CONFIG_KEXEC_JUMP is not defined.

Move the #ifdef outside the if (), then, so this is clear?

Actually, if preserve_context is always zero in !KEXEC_JUMP case, it
might make sense to remove whole variable...
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-07-09 01:04:45

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH -mm 1/2] kexec jump -v12: kexec jump

On Tue, 2008-07-08 at 10:50 -0400, Vivek Goyal wrote:
> On Mon, Jul 07, 2008 at 11:25:22AM +0800, Huang Ying wrote:
> > This patch provides an enhancement to kexec/kdump. It implements
> > the following features:
> >
> > - Backup/restore memory used by the original kernel before/after
> > kexec.
> >
> > - Save/restore CPU state before/after kexec.
> >
>
> Hi Huang,
>
> In general this patch set looks good enough to live in -mm and
> get some testing going.
>
> To me, adding capability to return back to original kernel looks
> like a logical extension to kexec functionality.
>
> Acked-by: Vivek Goyal <[email protected]>
>
> Few minor comments inline.

Thank you very much!

> [..]
> > --- a/arch/x86/kernel/machine_kexec_32.c
> > +++ b/arch/x86/kernel/machine_kexec_32.c
> > @@ -22,6 +22,7 @@
> > #include <asm/cpufeature.h>
> > #include <asm/desc.h>
> > #include <asm/system.h>
> > +#include <asm/cacheflush.h>
> >
> > #define PAGE_ALIGNED __attribute__ ((__aligned__(PAGE_SIZE)))
> > static u32 kexec_pgd[1024] PAGE_ALIGNED;
> > @@ -85,10 +86,12 @@ static void load_segments(void)
> > * reboot code buffer to allow us to avoid allocations
> > * later.
> > *
> > - * Currently nothing.
> > + * Make control page executable.
> > */
> > int machine_kexec_prepare(struct kimage *image)
> > {
> > + if (nx_enabled)
> > + set_pages_x(image->control_code_page, 1);
> > return 0;
> > }
> >
> > @@ -98,16 +101,24 @@ int machine_kexec_prepare(struct kimage
> > */
> > void machine_kexec_cleanup(struct kimage *image)
> > {
> > + if (nx_enabled)
> > + set_pages_nx(image->control_code_page, 1);
> > }
> >
> > /*
> > * Do not allocate memory (or fail in any way) in machine_kexec().
> > * We are past the point of no return, committed to rebooting now.
> > */
> > -NORET_TYPE void machine_kexec(struct kimage *image)
> > +void machine_kexec(struct kimage *image)
> > {
> > unsigned long page_list[PAGES_NR];
> > void *control_page;
> > + asmlinkage unsigned long
> > + (*relocate_kernel_ptr)(unsigned long indirection_page,
> > + unsigned long control_page,
> > + unsigned long start_address,
> > + unsigned int has_pae,
> > + unsigned int preserve_context);
> >
> > tracer_disable();
> >
> > @@ -115,10 +126,11 @@ NORET_TYPE void machine_kexec(struct kim
> > local_irq_disable();
> >
> > control_page = page_address(image->control_code_page);
> > - memcpy(control_page, relocate_kernel, PAGE_SIZE);
> > + memcpy(control_page, relocate_kernel, PAGE_SIZE/2);
> >
>
> Is it possible to add either a compile time or run time check
> somewhere to make sure code in relocate_kernel.S does not exceed
> PAGE_SIZE/2.

OK, I will add it.

> [..]
> > --- a/kernel/kexec.c
> > +++ b/kernel/kexec.c
> > @@ -24,6 +24,8 @@
> > #include <linux/utsrelease.h>
> > #include <linux/utsname.h>
> > #include <linux/numa.h>
> > +#include <linux/suspend.h>
> > +#include <linux/device.h>
> >
> > #include <asm/page.h>
> > #include <asm/uaccess.h>
> > @@ -242,6 +244,12 @@ static int kimage_normal_alloc(struct ki
> > goto out;
> > }
> >
> > + image->swap_page = kimage_alloc_control_pages(image, 0);
> > + if (!image->swap_page) {
> > + printk(KERN_ERR "Could not allocate swap buffer\n");
> > + goto out;
> > + }
> > +
> > result = 0;
> > out:
> > if (result == 0)
> > @@ -986,6 +994,8 @@ asmlinkage long sys_kexec_load(unsigned
> > if (result)
> > goto out;
> >
> > + if (flags & KEXEC_PRESERVE_CONTEXT)
> > + image->preserve_context = 1;
> > result = machine_kexec_prepare(image);
> > if (result)
> > goto out;
> > @@ -1411,3 +1421,50 @@ static int __init crash_save_vmcoreinfo_
> > }
> >
> > module_init(crash_save_vmcoreinfo_init)
> > +
> > +/**
> > + * kernel_kexec - reboot the system
> > + *
> > + * Move into place and start executing a preloaded standalone
> > + * executable. If nothing was preloaded return an error.
> > + */
> > +int kernel_kexec(void)
> > +{
> > + int error = 0;
> > +
> > + if (xchg(&kexec_lock, 1))
> > + return -EBUSY;
> > + if (!kexec_image) {
> > + error = -EINVAL;
> > + goto Unlock;
> > + }
> > +
> > + if (kexec_image->preserve_context) {
> > +#ifdef CONFIG_KEXEC_JUMP
> > + local_irq_disable();
> > + save_processor_state();
> > +#endif
> > + } else {
> > + blocking_notifier_call_chain(&reboot_notifier_list,
> > + SYS_RESTART, NULL);
> > + system_state = SYSTEM_RESTART;
> > + device_shutdown();
> > + sysdev_shutdown();
> > + printk(KERN_EMERG "Starting new kernel\n");
> > + machine_shutdown();
>
> All the above code was part of kernel_restart_prepare(), can't we just
> make that function non-static and use that?

OK, I will do that.

> [..]
> > --- a/arch/x86/kernel/relocate_kernel_32.S
> > +++ b/arch/x86/kernel/relocate_kernel_32.S
> > @@ -20,11 +20,44 @@
> > #define PAGE_ATTR (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY)
> > #define PAE_PGD_ATTR (_PAGE_PRESENT)
> >
> > +/* control_page + PAGE_SIZE/2 ~ control_page + PAGE_SIZE * 3/4 are
> > + * used to save some data for jumping back
> > + */
> > +#define DATA(offset) (PAGE_SIZE/2+(offset))
> > +
> > +/* Minimal CPU state */
> > +#define ESP DATA(0x0)
> > +#define CR0 DATA(0x4)
> > +#define CR3 DATA(0x8)
> > +#define CR4 DATA(0xc)
> > +
> > +/* other data */
> > +#define CP_VA_CONTROL_PAGE DATA(0x10)
> > +#define CP_PA_PGD DATA(0x14)
> > +#define CP_PA_SWAP_PAGE DATA(0x18)
> > +#define CP_PA_BACKUP_PAGES_MAP DATA(0x1c)
> > +
>
> In general, this assembly piece of code is getting bigger and its
> difficult to read it now. I think we should at-least pull out the page
> table setup code into C. Somebody had posted a patch to do that. Don't
> know what happened to that. Anyway, this is a separate issue and is on
> wish list.

In fact, that patch was posted by me. I will re-post that patch.

Best Regards,
Huang Ying

2008-07-09 01:08:08

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH -mm 1/2] kexec jump -v12: kexec jump

Hi, Pavel,

On Tue, 2008-07-08 at 12:40 +0200, Pavel Machek wrote:
> Hi!
>
> > > > @@ -1411,3 +1421,50 @@ static int __init crash_save_vmcoreinfo_
> > > > }
> > > >
> > > > module_init(crash_save_vmcoreinfo_init)
> > > > +
> > > > +/**
> > > > + * kernel_kexec - reboot the system
> >
> > > Really?
> >
> > I will change the comments to reflect the changes to kernel_kexec.
> >
> > > > + * Move into place and start executing a preloaded standalone
> > > > + * executable. If nothing was preloaded return an error.
> > > > + */
> > > > +int kernel_kexec(void)
> > > > +{
> > > > + int error = 0;
> > > > +
> > > > + if (xchg(&kexec_lock, 1))
> > > > + return -EBUSY;
> > >
> > > That's quite a strange way to provide a lock. mutex_trylock?
> >
> > I think this is because kexec_lock is used by crash_kexec() too, which
> > may be called in some extreme environment, such as during panic().
> >
> > > > + if (!kexec_image) {
> > > > + error = -EINVAL;
> > > > + goto Unlock;
> > > > + }
> > > > +
> > > > + if (kexec_image->preserve_context) {
> > > > +#ifdef CONFIG_KEXEC_JUMP
> > > > + local_irq_disable();
> > > > + save_processor_state();
> > >
> > > #else
> > > BUG()
> > >
> > > ...because otherwise you silently do nothing?
> > >
> > > > +#endif
> >
> > If CONFIG_KEXEC_JUMP is defined, kexec_image->preserve_context will
> > always be 0. So current code is safe. Here, #ifdef is used to resolve
> > the dependency issue. For example, save_processor_state() may be
> > undefined if CONFIG_KEXEC_JUMP is not defined.
>
> Move the #ifdef outside the if (), then, so this is clear?

I think this is reasonable, I will do it.

> Actually, if preserve_context is always zero in !KEXEC_JUMP case, it
> might make sense to remove whole variable...

I think this will add too many #ifndef CONFIG_KEXEC_JUMP ... #endif that
is necessary. The memory and performance gain is too little to
compensate the code readability reduction.

Best Regards,
Huang Ying

2008-07-11 19:25:03

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH -mm 1/2] kexec jump -v12: kexec jump

On Tue, 8 Jul 2008 10:50:51 -0400 Vivek Goyal <[email protected]> wrote:

> On Mon, Jul 07, 2008 at 11:25:22AM +0800, Huang Ying wrote:
> > This patch provides an enhancement to kexec/kdump. It implements
> > the following features:
> >
> > - Backup/restore memory used by the original kernel before/after
> > kexec.
> >
> > - Save/restore CPU state before/after kexec.
> >
>
> Hi Huang,
>
> In general this patch set looks good enough to live in -mm and
> get some testing going.
>
> To me, adding capability to return back to original kernel looks
> like a logical extension to kexec functionality.

Exciting ;) It's much less code than I expected.

I don't think I understand the feature any more. Once upon a time we
thought that this might become a new and better (or at least
better-code-sharing) way of doing suspend-to-disk. How far are we from
that?

What are the prospects of supporting other architectures?

Who maintains kexec-tools, and are they OK with merging up the
corresponding changes?

Thanks.

2008-07-11 20:12:38

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH -mm 1/2] kexec jump -v12: kexec jump

On Fri, Jul 11, 2008 at 12:21:31PM -0700, Andrew Morton wrote:
> On Tue, 8 Jul 2008 10:50:51 -0400 Vivek Goyal <[email protected]> wrote:
>
> > On Mon, Jul 07, 2008 at 11:25:22AM +0800, Huang Ying wrote:
> > > This patch provides an enhancement to kexec/kdump. It implements
> > > the following features:
> > >
> > > - Backup/restore memory used by the original kernel before/after
> > > kexec.
> > >
> > > - Save/restore CPU state before/after kexec.
> > >
> >
> > Hi Huang,
> >
> > In general this patch set looks good enough to live in -mm and
> > get some testing going.
> >
> > To me, adding capability to return back to original kernel looks
> > like a logical extension to kexec functionality.
>
> Exciting ;) It's much less code than I expected.
>
> I don't think I understand the feature any more. Once upon a time we
> thought that this might become a new and better (or at least
> better-code-sharing) way of doing suspend-to-disk. How far are we from
> that?
>

Hi Andrew,

We can use this patchset for hibernation, but can it be a better way of doing
things than what we already have, I don't know. Last time I had raised
this question and power people had various views. In the end, Pavel wanted
this patchset to be in. Pavel, can tell more here...

To me this patchset looks interesting for couple of reasons.

- Looks like an interesting feature where one can have a separate kernel
in memory and one can switch between the kernels on the fly. It can
be modified to have more than one kernel in memory at a time.

- So far kexec was one directional. One can only kexec to new kernel and
old kernel was gone. Now this patchset makes kexec functionality kind
of bidirectional and this looks like logical extension and can lead
to intersting use cases in future.

Huang also talks of using this feature for snapshotting kernel and
invoking some BIOS code in protected mode. I am not very sure how exactly
are they planning to use it. Huang, do you have more details on this?

> What are the prospects of supporting other architectures?
>

I think it should be doable on other architectures as well where kexec
is supported. Can't think of a reason why it can't be. Huang, what do
you think?

> Who maintains kexec-tools, and are they OK with merging up the
> corresponding changes?
>

I think Eric still has the ownership of kexec-tools. But it has been
long since kexec-tools has been updated. Now simon horman is maintaining
a separate tree, kexec-tools-testing, and all the active development
is taking place there.

Huang has not exactly posted kexec-tools patches but has given link
to kexec-tools patches and no body has objected so far. I am CCing it
to Simon Horman, if he sees any issues.

Thanks
Vivek

2008-07-11 20:24:23

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH -mm 1/2] kexec jump -v12: kexec jump

On Fri 2008-07-11 12:21:31, Andrew Morton wrote:
> On Tue, 8 Jul 2008 10:50:51 -0400 Vivek Goyal <[email protected]> wrote:
>
> > On Mon, Jul 07, 2008 at 11:25:22AM +0800, Huang Ying wrote:
> > > This patch provides an enhancement to kexec/kdump. It implements
> > > the following features:
> > >
> > > - Backup/restore memory used by the original kernel before/after
> > > kexec.
> > >
> > > - Save/restore CPU state before/after kexec.
> > >
> >
> > Hi Huang,
> >
> > In general this patch set looks good enough to live in -mm and
> > get some testing going.
> >
> > To me, adding capability to return back to original kernel looks
> > like a logical extension to kexec functionality.
>
> Exciting ;) It's much less code than I expected.
>
> I don't think I understand the feature any more. Once upon a time we
> thought that this might become a new and better (or at least
> better-code-sharing) way of doing suspend-to-disk. How far are we from
> that?

Well, it will be tricky to get kjump-hibernation right with respect to
ACPI, but we should be fairly close to basic hibernation working with
this. It has major advantage of not needing refrigerator (and few
disadvantages -- like doing aditional boot during suspend).

But main reason I'd like kjump to be in is different -- it should be
useful to stuff like "dump but continue running", etc...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-07-11 20:39:20

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH -mm 1/2] kexec jump -v12: kexec jump

On Friday, 11 of July 2008, Pavel Machek wrote:
> On Fri 2008-07-11 12:21:31, Andrew Morton wrote:
> > On Tue, 8 Jul 2008 10:50:51 -0400 Vivek Goyal <[email protected]> wrote:
> >
> > > On Mon, Jul 07, 2008 at 11:25:22AM +0800, Huang Ying wrote:
> > > > This patch provides an enhancement to kexec/kdump. It implements
> > > > the following features:
> > > >
> > > > - Backup/restore memory used by the original kernel before/after
> > > > kexec.
> > > >
> > > > - Save/restore CPU state before/after kexec.
> > > >
> > >
> > > Hi Huang,
> > >
> > > In general this patch set looks good enough to live in -mm and
> > > get some testing going.
> > >
> > > To me, adding capability to return back to original kernel looks
> > > like a logical extension to kexec functionality.
> >
> > Exciting ;) It's much less code than I expected.
> >
> > I don't think I understand the feature any more. Once upon a time we
> > thought that this might become a new and better (or at least
> > better-code-sharing) way of doing suspend-to-disk. How far are we from
> > that?
>
> Well, it will be tricky to get kjump-hibernation right with respect to
> ACPI, but we should be fairly close to basic hibernation working with
> this. It has major advantage of not needing refrigerator (and few
> disadvantages -- like doing aditional boot during suspend).

Please, stop that. This has always been a bogus argument.

The truth is we could do hibernation without the freezer if
(a) some drivers were fixed not to rely on it (kexec doesn't help here),
(b) we had support at the block layer or filesystems level (kexec is a big
workaround here).

> But main reason I'd like kjump to be in is different -- it should be
> useful to stuff like "dump but continue running", etc...

That's a different thing.

Thanks,
Rafael

2008-07-12 01:10:05

by Nigel Cunningham

[permalink] [raw]
Subject: Re: [PATCH -mm 1/2] kexec jump -v12: kexec jump

Hi.

On Fri, 2008-07-11 at 16:11 -0400, Vivek Goyal wrote:
> Hi Andrew,
>
> We can use this patchset for hibernation, but can it be a better way of doing
> things than what we already have, I don't know. Last time I had raised
> this question and power people had various views. In the end, Pavel wanted
> this patchset to be in. Pavel, can tell more here...
>
> To me this patchset looks interesting for couple of reasons.
>
> - Looks like an interesting feature where one can have a separate kernel
> in memory and one can switch between the kernels on the fly. It can
> be modified to have more than one kernel in memory at a time.

I'm not sure how useful that would be, though. I already have
functionality in TuxOnIce which allows you to resume a different image
instead of powering off (roughly the same thing when combined with not
removing the image after resuming). It was neat when testing to be able
to switch back and forth, and I developed the code because I imagined
that it could form part of the foundation for switching between a login
screen and users' stored sessions. Is this what you're imagining?

> - So far kexec was one directional. One can only kexec to new kernel and
> old kernel was gone. Now this patchset makes kexec functionality kind
> of bidirectional and this looks like logical extension and can lead
> to intersting use cases in future.

Ah. You mean keeping both kernels in memory at the same time? In the
above, I was replacing one image with another.

> Huang also talks of using this feature for snapshotting kernel and
> invoking some BIOS code in protected mode. I am not very sure how exactly
> are they planning to use it. Huang, do you have more details on this?

As Rafael wrote, snapshotting is a completely different beast.

Regards,

Nigel

2008-07-12 02:32:17

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH -mm 1/2] kexec jump -v12: kexec jump

"Rafael J. Wysocki" <[email protected]> writes:

> The truth is we could do hibernation without the freezer if
> (a) some drivers were fixed not to rely on it (kexec doesn't help here),
> (b) we had support at the block layer or filesystems level (kexec is a big
> workaround here).

I just realized with a little care the block layer does have support for this,
or something very close.

You setup a software raid mirror with one disk device. The physical
device can come in and out while the filesystems depend on the real device.

I expect a hardware pass through device configured to do exactly the
above would be about 100 lines of code, so getting past the filesystem
hurdle should be very doable. Arguably we should be able to do this
up a level, but it is easy enough to do that you can do a proof of
concept with out that.

Now I'm curious to see how far you can go with just the device hotplug support.

Eric

2008-07-12 03:05:12

by Alan Stern

[permalink] [raw]
Subject: Re: [linux-pm] [PATCH -mm 1/2] kexec jump -v12: kexec jump

On Fri, 11 Jul 2008, Eric W. Biederman wrote:

> I just realized with a little care the block layer does have support for this,
> or something very close.
>
> You setup a software raid mirror with one disk device. The physical
> device can come in and out while the filesystems depend on the real device.

Do you mean "the filesystems depend on the logical RAID device"?

What's to prevent userspace from accessing the physical device
directly?

What this amounts to, in the end, is having a way to distinguish the
set of I/O requests coming from the hibernation code (reading or
writing the memory image) from the set of all other I/O requests. The
driver or the block layer has to be set up to allow the first set
through while blocking the second set. (And don't forget about the
complications caused by error-recovery I/O during the hibernation
activity!)

Forcing the second set of requests to filter through an extra software
layer is a clumsy way of accomplishing this. There ought to be a
better approach.

Alan Stern

2008-07-12 03:52:31

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [linux-pm] [PATCH -mm 1/2] kexec jump -v12: kexec jump

Alan Stern <[email protected]> writes:

> On Fri, 11 Jul 2008, Eric W. Biederman wrote:
>
>> I just realized with a little care the block layer does have support for this,
>> or something very close.
>>
>> You setup a software raid mirror with one disk device. The physical
>> device can come in and out while the filesystems depend on the real device.
>
> Do you mean "the filesystems depend on the logical RAID device"?

Oh yes. Thinko.

> What's to prevent userspace from accessing the physical device
> directly?

Nothing.

> What this amounts to, in the end, is having a way to distinguish the
> set of I/O requests coming from the hibernation code (reading or
> writing the memory image) from the set of all other I/O requests. The
> driver or the block layer has to be set up to allow the first set
> through while blocking the second set. (And don't forget about the
> complications caused by error-recovery I/O during the hibernation
> activity!)

I guess this problem exists but it is not at all the problem I was
thinking of.

> Forcing the second set of requests to filter through an extra software
> layer is a clumsy way of accomplishing this. There ought to be a
> better approach.

The point was something different. The reasons we can not store the
state of the system with the hardware devices logically hot unplugged
(and thus reuse all of the find device hotplug methods) is because
things like the filesystem layer don't know how to cope with their
block devices going away an coming back.

That is the problem inserting an virtual software device in the middle
can solve. If that works should there be a better way? Certainly but
to prove it out starting with a block device wrapper is a trivial way to
go.

Eric

2008-07-12 18:50:33

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [linux-pm] [PATCH -mm 1/2] kexec jump -v12: kexec jump

On Saturday, 12 of July 2008, Eric W. Biederman wrote:
> Alan Stern <[email protected]> writes:
>
> > On Fri, 11 Jul 2008, Eric W. Biederman wrote:
> >
> >> I just realized with a little care the block layer does have support for this,
> >> or something very close.
> >>
> >> You setup a software raid mirror with one disk device. The physical
> >> device can come in and out while the filesystems depend on the real device.
> >
> > Do you mean "the filesystems depend on the logical RAID device"?
>
> Oh yes. Thinko.
>
> > What's to prevent userspace from accessing the physical device
> > directly?
>
> Nothing.
>
> > What this amounts to, in the end, is having a way to distinguish the
> > set of I/O requests coming from the hibernation code (reading or
> > writing the memory image) from the set of all other I/O requests. The
> > driver or the block layer has to be set up to allow the first set
> > through while blocking the second set. (And don't forget about the
> > complications caused by error-recovery I/O during the hibernation
> > activity!)
>
> I guess this problem exists but it is not at all the problem I was
> thinking of.
>
> > Forcing the second set of requests to filter through an extra software
> > layer is a clumsy way of accomplishing this. There ought to be a
> > better approach.
>
> The point was something different. The reasons we can not store the
> state of the system with the hardware devices logically hot unplugged
> (and thus reuse all of the find device hotplug methods) is because
> things like the filesystem layer don't know how to cope with their
> block devices going away an coming back.
>
> That is the problem inserting an virtual software device in the middle
> can solve. If that works should there be a better way? Certainly but
> to prove it out starting with a block device wrapper is a trivial way to
> go.

I have discussed that with Jens a bit and it seems we can use a special I/O
scheduler that will separate the image saving I/O from any other I/O, allowing
only the former to reach lower layers. Since you can switch I/O schedulers on
the fly already, quite a bit of the necessary functionality is in place.

Of course, we also need character device drivers to block user space while
suspended and we need ioctls to be handled correctly at that time etc.

That said, even if devices are accessed while we're saving the image, there
will be no damage as long as those accesses will not result in any data being
actually written to non-volatile storage, such as disks.

Thanks,
Rafael

2008-07-12 19:55:15

by Alan Stern

[permalink] [raw]
Subject: Re: [linux-pm] [PATCH -mm 1/2] kexec jump -v12: kexec jump

On Fri, 11 Jul 2008, Eric W. Biederman wrote:

> > Forcing the second set of requests to filter through an extra software
> > layer is a clumsy way of accomplishing this. There ought to be a
> > better approach.
>
> The point was something different. The reasons we can not store the
> state of the system with the hardware devices logically hot unplugged
> (and thus reuse all of the find device hotplug methods) is because
> things like the filesystem layer don't know how to cope with their
> block devices going away an coming back.

This is not how the procedure works. During hibernation, block devices
are not logically hot-unplugged. (If they were then they couldn't be
used for writing the memory image.) Instead, they are quiesced or
suspended and their input queues are plugged.

> That is the problem inserting an virtual software device in the middle
> can solve. If that works should there be a better way? Certainly but
> to prove it out starting with a block device wrapper is a trivial way to
> go.

This sounds like a solution to a non-existent problem.

Alan Stern

2008-07-14 05:46:17

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH -mm 1/2] kexec jump -v12: kexec jump

On Sat 2008-07-12 11:02:17, Nigel Cunningham wrote:
> Hi.
>
> On Fri, 2008-07-11 at 16:11 -0400, Vivek Goyal wrote:
> > Hi Andrew,
> >
> > We can use this patchset for hibernation, but can it be a better way of doing
> > things than what we already have, I don't know. Last time I had raised
> > this question and power people had various views. In the end, Pavel wanted
> > this patchset to be in. Pavel, can tell more here...
> >
> > To me this patchset looks interesting for couple of reasons.
> >
> > - Looks like an interesting feature where one can have a separate kernel
> > in memory and one can switch between the kernels on the fly. It can
> > be modified to have more than one kernel in memory at a time.
>
> I'm not sure how useful that would be, though. I already have
> functionality in TuxOnIce which allows you to resume a different image
> instead of powering off (roughly the same thing when combined with not
> removing the image after resuming). It was neat when testing to be
> able

Beaty of kjump is that it is supposed to used on half-broken system,
so it is useful for debugging.

> > - So far kexec was one directional. One can only kexec to new kernel and
> > old kernel was gone. Now this patchset makes kexec functionality kind
> > of bidirectional and this looks like logical extension and can lead
> > to intersting use cases in future.
>
> Ah. You mean keeping both kernels in memory at the same time? In the
> above, I was replacing one image with another.

Yep, kjump keeps both kernels loaded at the same time.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-07-14 13:10:46

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH -mm 1/2] kexec jump -v12: kexec jump

On Sat, Jul 12, 2008 at 11:02:17AM +1000, Nigel Cunningham wrote:
> Hi.
>
> On Fri, 2008-07-11 at 16:11 -0400, Vivek Goyal wrote:
> > Hi Andrew,
> >
> > We can use this patchset for hibernation, but can it be a better way of doing
> > things than what we already have, I don't know. Last time I had raised
> > this question and power people had various views. In the end, Pavel wanted
> > this patchset to be in. Pavel, can tell more here...
> >
> > To me this patchset looks interesting for couple of reasons.
> >
> > - Looks like an interesting feature where one can have a separate kernel
> > in memory and one can switch between the kernels on the fly. It can
> > be modified to have more than one kernel in memory at a time.
>
> I'm not sure how useful that would be, though. I already have
> functionality in TuxOnIce which allows you to resume a different image
> instead of powering off (roughly the same thing when combined with not
> removing the image after resuming). It was neat when testing to be able
> to switch back and forth, and I developed the code because I imagined
> that it could form part of the foundation for switching between a login
> screen and users' stored sessions. Is this what you're imagining?
>

I did not think of that. I thought of two things.

- This can possibly be used for non-disruptive kernel crash dumping, where
if a user wants to capture the snapshot of kernel and then continue
to work. Thought, it might be little heavy weight solution and kernel
state also has changed a bit by the time dump is captured (because of
all the suspend code).

- One can have two distributions installed on a single system and switch
between two booted kernels in few seconds.

> > - So far kexec was one directional. One can only kexec to new kernel and
> > old kernel was gone. Now this patchset makes kexec functionality kind
> > of bidirectional and this looks like logical extension and can lead
> > to intersting use cases in future.
>
> Ah. You mean keeping both kernels in memory at the same time? In the
> above, I was replacing one image with another.

Yes, here both the kernels will remain in the RAM. In fact it should be
easily possible to keep more than 2 kernels in memory and switch between
these.

Thanks
Vivek

2008-07-14 13:30:46

by huang ying

[permalink] [raw]
Subject: Re: [PATCH -mm 1/2] kexec jump -v12: kexec jump

On Sat, Jul 12, 2008 at 3:21 AM, Andrew Morton
<[email protected]> wrote:
> On Tue, 8 Jul 2008 10:50:51 -0400 Vivek Goyal <[email protected]> wrote:
>
>> On Mon, Jul 07, 2008 at 11:25:22AM +0800, Huang Ying wrote:
>> > This patch provides an enhancement to kexec/kdump. It implements
>> > the following features:
>> >
>> > - Backup/restore memory used by the original kernel before/after
>> > kexec.
>> >
>> > - Save/restore CPU state before/after kexec.
>> >
>>
>> Hi Huang,
>>
>> In general this patch set looks good enough to live in -mm and
>> get some testing going.
>>
>> To me, adding capability to return back to original kernel looks
>> like a logical extension to kexec functionality.
>
> Exciting ;) It's much less code than I expected.
>
> I don't think I understand the feature any more. Once upon a time we
> thought that this might become a new and better (or at least
> better-code-sharing) way of doing suspend-to-disk. How far are we from
> that?

At least there are still issues as follow:

- We need a mechanism to pass some information (such as backup pages
map) from hibernated kernel to hibernating kernel. Maybe in C calling
convention.
- To load hibernation image via /sbin/kexec, the segment number
constraint of sys_kexec_load needs to be extended (maybe via
multi-stage loading).
- Make kexec based hibernation compatible with ACPI S4.
- Extend makedumpfile utility for kexec based hibernation.

> What are the prospects of supporting other architectures?

I will work on x86_64 supporting.

> Who maintains kexec-tools, and are they OK with merging up the
> corresponding changes?

I will work with kexec-tools mailing list for corresponding kexec-tools patches.

Best Regards,
Huang Ying

2008-07-14 13:34:14

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH -mm 1/2] kexec jump -v12: kexec jump

On Mon, Jul 14, 2008 at 07:46:44AM +0200, Pavel Machek wrote:
> On Sat 2008-07-12 11:02:17, Nigel Cunningham wrote:
> > Hi.
> >
> > On Fri, 2008-07-11 at 16:11 -0400, Vivek Goyal wrote:
> > > Hi Andrew,
> > >
> > > We can use this patchset for hibernation, but can it be a better way of doing
> > > things than what we already have, I don't know. Last time I had raised
> > > this question and power people had various views. In the end, Pavel wanted
> > > this patchset to be in. Pavel, can tell more here...
> > >
> > > To me this patchset looks interesting for couple of reasons.
> > >
> > > - Looks like an interesting feature where one can have a separate kernel
> > > in memory and one can switch between the kernels on the fly. It can
> > > be modified to have more than one kernel in memory at a time.
> >
> > I'm not sure how useful that would be, though. I already have
> > functionality in TuxOnIce which allows you to resume a different image
> > instead of powering off (roughly the same thing when combined with not
> > removing the image after resuming). It was neat when testing to be
> > able
>
> Beaty of kjump is that it is supposed to used on half-broken system,
> so it is useful for debugging.
>

What do you mean by supposed to be used on half-broken system?

Thanks
Vivek

2008-08-04 11:01:17

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH -mm 1/2] kexec jump -v12: kexec jump

> On Mon, Jul 14, 2008 at 07:46:44AM +0200, Pavel Machek wrote:
> > On Sat 2008-07-12 11:02:17, Nigel Cunningham wrote:
> > > Hi.
> > >
> > > On Fri, 2008-07-11 at 16:11 -0400, Vivek Goyal wrote:
> > > > Hi Andrew,
> > > >
> > > > We can use this patchset for hibernation, but can it be a better way of doing
> > > > things than what we already have, I don't know. Last time I had raised
> > > > this question and power people had various views. In the end, Pavel wanted
> > > > this patchset to be in. Pavel, can tell more here...
> > > >
> > > > To me this patchset looks interesting for couple of reasons.
> > > >
> > > > - Looks like an interesting feature where one can have a separate kernel
> > > > in memory and one can switch between the kernels on the fly. It can
> > > > be modified to have more than one kernel in memory at a time.
> > >
> > > I'm not sure how useful that would be, though. I already have
> > > functionality in TuxOnIce which allows you to resume a different image
> > > instead of powering off (roughly the same thing when combined with not
> > > removing the image after resuming). It was neat when testing to be
> > > able
> >
> > Beaty of kjump is that it is supposed to used on half-broken system,
> > so it is useful for debugging.
> >
>
> What do you mean by supposed to be used on half-broken system?

Maybe some network driver went wrong (but rest of machine keeps
working), so you want to do kdump but do not want to kill the machine
(as the remaining network drivers make the server useful).
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html