2014-01-27 18:58:57

by Vivek Goyal

[permalink] [raw]
Subject: [RFC PATCH 00/11][V2] kexec: A new system call to allow in kernel loading

Hi

This is V2 of new system call patches. Previous version was posted here.

https://lkml.org/lkml/2013/11/20/540

V2 primarily does following changes

- Creates a binary object (called purgatory) which runs between two kernels.
This is a stand alone relocatable object (it is not linked with kernel) and
it is loaded and relocated by kexec syscall.

- Provided kexec support for loading ELF of type ET_EXEC. This only works for
kexec case and not kexec on panic case. More about it patch changelog.

- Took care of feedback received during first round.

Primary goal of this patchset is to prepare groundwork so that kernel
image can be signed and signatures be verified during kexec load. This
should help with two things.

- It should allow kexec/kdump on secureboot enabled machines.

- In general it can help even without secureboot. By being able to verify
kernel image signature in kexec, it should help with avoiding module
signing restrictions. Matthew Garret showed how to boot into a custom
kernel, modify first kernel's memory and then jump back to old kernel and
bypass any policy one wants to.

I have not taken care of signing part yet. First I want to get to a stage
where all the required pieces of kexec are re-implemented in kernel. And
then I want to look into signing part. Also only 64bit bzImage entry is
supported, no EFI/UEFI support, no x86_32 support. Trying to first come
up with minimum functionality which matters most.

Posting patches for early reiew. Your feedback and comments are welcome.

Thanks
Vivek


Vivek Goyal (11):
kexec: Move segment verification code in a separate function
resource: Provide new functions to walk through resources
bin2c: Move bin2c in scripts/basic
kernel: Build bin2c based on config option CONFIG_BUILD_BIN2C
kexec: Make kexec_segment user buffer pointer a union
kexec: A new system call, kexec_file_load, for in kernel kexec
kexec: Create a relocatable object called purgatory
kexec-bzImage: Support for loading bzImage using 64bit entry
kexec: Provide a function to add a segment at fixed address
kexec: Support for loading ELF x86_64 images
kexec: Support for Kexec on panic using new system call

arch/x86/Kbuild | 1 +
arch/x86/Kconfig | 2 +
arch/x86/Makefile | 6 +
arch/x86/include/asm/crash.h | 9 +
arch/x86/include/asm/kexec-bzimage.h | 11 +
arch/x86/include/asm/kexec-elf.h | 11 +
arch/x86/include/asm/kexec.h | 51 ++
arch/x86/kernel/Makefile | 3 +
arch/x86/kernel/crash.c | 574 ++++++++++++++
arch/x86/kernel/kexec-bzimage.c | 255 +++++++
arch/x86/kernel/kexec-elf.c | 231 ++++++
arch/x86/kernel/machine_kexec.c | 149 ++++
arch/x86/kernel/machine_kexec_64.c | 173 +++++
arch/x86/purgatory/Makefile | 35 +
arch/x86/purgatory/entry64.S | 111 +++
arch/x86/purgatory/purgatory.c | 103 +++
arch/x86/purgatory/setup-x86_32.S | 29 +
arch/x86/purgatory/setup-x86_64.S | 68 ++
arch/x86/purgatory/sha256.c | 315 ++++++++
arch/x86/purgatory/sha256.h | 33 +
arch/x86/purgatory/stack.S | 29 +
arch/x86/syscalls/syscall_64.tbl | 1 +
include/linux/ioport.h | 6 +
include/linux/kexec.h | 102 ++-
include/linux/syscalls.h | 3 +
include/uapi/linux/kexec.h | 4 +
init/Kconfig | 5 +
kernel/Makefile | 2 +-
kernel/kexec.c | 1356 +++++++++++++++++++++++++++++++---
kernel/resource.c | 108 ++-
kernel/sys_ni.c | 1 +
scripts/Makefile | 1 -
scripts/basic/Makefile | 1 +
scripts/basic/bin2c.c | 36 +
scripts/bin2c.c | 36 -
35 files changed, 3701 insertions(+), 160 deletions(-)
create mode 100644 arch/x86/include/asm/crash.h
create mode 100644 arch/x86/include/asm/kexec-bzimage.h
create mode 100644 arch/x86/include/asm/kexec-elf.h
create mode 100644 arch/x86/kernel/kexec-bzimage.c
create mode 100644 arch/x86/kernel/kexec-elf.c
create mode 100644 arch/x86/kernel/machine_kexec.c
create mode 100644 arch/x86/purgatory/Makefile
create mode 100644 arch/x86/purgatory/entry64.S
create mode 100644 arch/x86/purgatory/purgatory.c
create mode 100644 arch/x86/purgatory/setup-x86_32.S
create mode 100644 arch/x86/purgatory/setup-x86_64.S
create mode 100644 arch/x86/purgatory/sha256.c
create mode 100644 arch/x86/purgatory/sha256.h
create mode 100644 arch/x86/purgatory/stack.S
create mode 100644 scripts/basic/bin2c.c
delete mode 100644 scripts/bin2c.c

--
1.8.4.2


2014-01-27 18:58:58

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 02/11] resource: Provide new functions to walk through resources

I have added two more functions to walk through resources.
Current walk_system_ram_range() deals with pfn and /proc/iomem can contain
partial pages. By dealing in pfn, callback function loses the info that
last page of a memory range is a partial page and not the full page. So
I implemented walk_system_ran_res() which returns u64 values to callback
functions and now it properly return start and end address.

walk_system_ram_range() uses find_next_system_ram() to find the next
ram resource. This in turn only travels through siblings of top level
child and does not travers through all the nodes of the resoruce tree. I
also need another function where I can walk through all the resources,
for example figure out where "GART" aperture is. Figure out where
ACPI memory is.

So I wrote another function walk_ram_res() which walks through all
/proc/iomem resources and returns matches as asked by caller. Caller
can specify "name" of resource, start and end.

Signed-off-by: Vivek Goyal <[email protected]>
Cc: Yinghai Lu <[email protected]>
---
include/linux/ioport.h | 6 +++
kernel/resource.c | 108 +++++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 110 insertions(+), 4 deletions(-)

diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 89b7c24..0ebf8b0 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -227,6 +227,12 @@ extern int iomem_is_exclusive(u64 addr);
extern int
walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages,
void *arg, int (*func)(unsigned long, unsigned long, void *));
+extern int
+walk_system_ram_res(u64 start, u64 end, void *arg,
+ int (*func)(u64, u64, void *));
+extern int
+walk_ram_res(char *name, unsigned long flags, u64 start, u64 end, void *arg,
+ int (*func)(u64, u64, void *));

/* True if any part of r1 overlaps r2 */
static inline bool resource_overlaps(struct resource *r1, struct resource *r2)
diff --git a/kernel/resource.c b/kernel/resource.c
index 3f285dc..5e575e8 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -59,10 +59,8 @@ static DEFINE_RWLOCK(resource_lock);
static struct resource *bootmem_resource_free;
static DEFINE_SPINLOCK(bootmem_resource_lock);

-static void *r_next(struct seq_file *m, void *v, loff_t *pos)
+static struct resource *next_resource(struct resource *p)
{
- struct resource *p = v;
- (*pos)++;
if (p->child)
return p->child;
while (!p->sibling && p->parent)
@@ -70,6 +68,13 @@ static void *r_next(struct seq_file *m, void *v, loff_t *pos)
return p->sibling;
}

+static void *r_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ struct resource *p = v;
+ (*pos)++;
+ return (void *)next_resource(p);
+}
+
#ifdef CONFIG_PROC_FS

enum { MAX_IORES_LEVEL = 5 };
@@ -322,7 +327,71 @@ int release_resource(struct resource *old)

EXPORT_SYMBOL(release_resource);

-#if !defined(CONFIG_ARCH_HAS_WALK_MEMORY)
+/*
+ * Finds the lowest iomem reosurce exists with-in [res->start.res->end)
+ * the caller must specify res->start, res->end, res->flags and "name".
+ * If found, returns 0, res is overwritten, if not found, returns -1.
+ * This walks through whole tree and not just first level children.
+ */
+static int find_next_iomem_res(struct resource *res, char *name)
+{
+ resource_size_t start, end;
+ struct resource *p;
+
+ BUG_ON(!res);
+
+ start = res->start;
+ end = res->end;
+ BUG_ON(start >= end);
+
+ read_lock(&resource_lock);
+ p = &iomem_resource;
+ while ((p = next_resource(p))) {
+ if (p->flags != res->flags)
+ continue;
+ if (name && strcmp(p->name, name))
+ continue;
+ if (p->start > end) {
+ p = NULL;
+ break;
+ }
+ if ((p->end >= start) && (p->start < end))
+ break;
+ }
+
+ read_unlock(&resource_lock);
+ if (!p)
+ return -1;
+ /* copy data */
+ if (res->start < p->start)
+ res->start = p->start;
+ if (res->end > p->end)
+ res->end = p->end;
+ return 0;
+}
+
+int walk_ram_res(char *name, unsigned long flags, u64 start, u64 end,
+ void *arg, int (*func)(u64, u64, void *))
+{
+ struct resource res;
+ u64 orig_end;
+ int ret = -1;
+
+ res.start = start;
+ res.end = end;
+ res.flags = IORESOURCE_MEM | IORESOURCE_BUSY;
+ orig_end = res.end;
+ while ((res.start < res.end) &&
+ (find_next_iomem_res(&res, name) >= 0)) {
+ ret = (*func)(res.start, res.end, arg);
+ if (ret)
+ break;
+ res.start = res.end + 1;
+ res.end = orig_end;
+ }
+ return ret;
+}
+
/*
* Finds the lowest memory reosurce exists within [res->start.res->end)
* the caller must specify res->start, res->end, res->flags and "name".
@@ -367,6 +436,37 @@ static int find_next_system_ram(struct resource *res, char *name)
/*
* This function calls callback against all memory range of "System RAM"
* which are marked as IORESOURCE_MEM and IORESOUCE_BUSY.
+ * Now, this function is only for "System RAM". This function deals with
+ * full ranges and not pfn. If resources are not pfn aligned, dealing
+ * with pfn can truncate ranges.
+ */
+int walk_system_ram_res(u64 start, u64 end, void *arg,
+ int (*func)(u64, u64, void *))
+{
+ struct resource res;
+ u64 orig_end;
+ int ret = -1;
+
+ res.start = start;
+ res.end = end;
+ res.flags = IORESOURCE_MEM | IORESOURCE_BUSY;
+ orig_end = res.end;
+ while ((res.start < res.end) &&
+ (find_next_system_ram(&res, "System RAM") >= 0)) {
+ ret = (*func)(res.start, res.end, arg);
+ if (ret)
+ break;
+ res.start = res.end + 1;
+ res.end = orig_end;
+ }
+ return ret;
+}
+
+#if !defined(CONFIG_ARCH_HAS_WALK_MEMORY)
+
+/*
+ * This function calls callback against all memory range of "System RAM"
+ * which are marked as IORESOURCE_MEM and IORESOUCE_BUSY.
* Now, this function is only for "System RAM".
*/
int walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages,
--
1.8.4.2

2014-01-27 18:59:12

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 06/11] kexec: A new system call, kexec_file_load, for in kernel kexec

This patch implements the in kernel kexec functionality. It implements a
new system call kexec_file_load. I think parameter list of this system
call will change as I have not done the kernel image signature handling
yet. I have been told that I might have to pass the detached signature
and size as part of system call.

Previously segment list was prepared in user space. Now user space just
passes kernel fd, initrd fd and command line and kernel will create a
segment list internally.

This patch contains generic part of the code. Actual segment preparation
and loading is done by arch and image specific loader. Which comes in
next patch.

Signed-off-by: Vivek Goyal <[email protected]>
---
arch/x86/kernel/machine_kexec_64.c | 50 ++++
arch/x86/syscalls/syscall_64.tbl | 1 +
include/linux/kexec.h | 55 +++++
include/linux/syscalls.h | 3 +
include/uapi/linux/kexec.h | 4 +
kernel/kexec.c | 495 ++++++++++++++++++++++++++++++++++++-
kernel/sys_ni.c | 1 +
7 files changed, 605 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 4eabc16..c91d72a 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -22,6 +22,13 @@
#include <asm/mmu_context.h>
#include <asm/debugreg.h>

+/* arch dependent functionality related to kexec file based syscall */
+static struct kexec_file_type kexec_file_type[]={
+ {"", NULL, NULL, NULL},
+};
+
+static int nr_file_types = sizeof(kexec_file_type)/sizeof(kexec_file_type[0]);
+
static void free_transition_pgtable(struct kimage *image)
{
free_page((unsigned long)image->arch.pud);
@@ -281,3 +288,46 @@ void arch_crash_save_vmcoreinfo(void)
#endif
}

+/* arch dependent functionality related to kexec file based syscall */
+
+int arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
+ unsigned long buf_len)
+{
+ int i, ret = -ENOEXEC;
+
+ for (i = 0; i < nr_file_types; i++) {
+ if (!kexec_file_type[i].probe)
+ continue;
+
+ ret = kexec_file_type[i].probe(buf, buf_len);
+ if (!ret) {
+ image->file_handler_idx = i;
+ return ret;
+ }
+ }
+
+ return ret;
+}
+
+void *arch_kexec_kernel_image_load(struct kimage *image, char *kernel,
+ unsigned long kernel_len, char *initrd,
+ unsigned long initrd_len, char *cmdline,
+ unsigned long cmdline_len)
+{
+ int idx = image->file_handler_idx;
+
+ if (idx < 0)
+ return ERR_PTR(-ENOEXEC);
+
+ return kexec_file_type[idx].load(image, kernel, kernel_len, initrd,
+ initrd_len, cmdline, cmdline_len);
+}
+
+int arch_kimage_file_post_load_cleanup(struct kimage *image)
+{
+ int idx = image->file_handler_idx;
+
+ if (kexec_file_type[idx].cleanup)
+ return kexec_file_type[idx].cleanup(image);
+ return 0;
+}
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index a12bddc..3eec4d4 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -322,6 +322,7 @@
313 common finit_module sys_finit_module
314 common sched_setattr sys_sched_setattr
315 common sched_getattr sys_sched_getattr
+316 common kexec_file_load sys_kexec_file_load

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index d8188b3..51b56cd 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -121,13 +121,58 @@ struct kimage {
#define KEXEC_TYPE_DEFAULT 0
#define KEXEC_TYPE_CRASH 1
unsigned int preserve_context : 1;
+ /* If set, we are using file mode kexec syscall */
+ unsigned int file_mode : 1;

#ifdef ARCH_HAS_KIMAGE_ARCH
struct kimage_arch arch;
#endif
+
+ /* Additional Fields for file based kexec syscall */
+ void *kernel_buf;
+ unsigned long kernel_buf_len;
+
+ void *initrd_buf;
+ unsigned long initrd_buf_len;
+
+ char *cmdline_buf;
+ unsigned long cmdline_buf_len;
+
+ /* index of file handler in array */
+ int file_handler_idx;
+
+ /* Image loader handling the kernel can store a pointer here */
+ void * image_loader_data;
};

+/*
+ * Keeps a track of buffer parameters as provided by caller for requesting
+ * memory placement of buffer.
+ */
+struct kexec_buf {
+ struct kimage *image;
+ char *buffer;
+ unsigned long bufsz;
+ unsigned long memsz;
+ unsigned long buf_align;
+ unsigned long buf_min;
+ unsigned long buf_max;
+ int top_down; /* allocate from top of memory hole */
+};

+typedef int (kexec_probe_t)(const char *kernel_buf, unsigned long kernel_size);
+typedef void *(kexec_load_t)(struct kimage *image, char *kernel_buf,
+ unsigned long kernel_len, char *initrd,
+ unsigned long initrd_len, char *cmdline,
+ unsigned long cmdline_len);
+typedef int (kexec_cleanup_t)(struct kimage *image);
+
+struct kexec_file_type {
+ const char *name;
+ kexec_probe_t *probe;
+ kexec_load_t *load;
+ kexec_cleanup_t *cleanup;
+};

/* kexec interface functions */
extern void machine_kexec(struct kimage *image);
@@ -138,6 +183,11 @@ extern asmlinkage long sys_kexec_load(unsigned long entry,
struct kexec_segment __user *segments,
unsigned long flags);
extern int kernel_kexec(void);
+extern int kexec_add_buffer(struct kimage *image, char *buffer,
+ unsigned long bufsz, unsigned long memsz,
+ unsigned long buf_align, unsigned long buf_min,
+ unsigned long buf_max, int buf_end,
+ unsigned long *load_addr);
#ifdef CONFIG_COMPAT
extern asmlinkage long compat_sys_kexec_load(unsigned long entry,
unsigned long nr_segments,
@@ -146,6 +196,8 @@ extern asmlinkage long compat_sys_kexec_load(unsigned long entry,
#endif
extern struct page *kimage_alloc_control_pages(struct kimage *image,
unsigned int order);
+extern void kimage_set_start_addr(struct kimage *image, unsigned long start);
+
extern void crash_kexec(struct pt_regs *);
int kexec_should_crash(struct task_struct *);
void crash_save_cpu(struct pt_regs *regs, int cpu);
@@ -194,6 +246,9 @@ extern int kexec_load_disabled;
#define KEXEC_FLAGS (KEXEC_ON_CRASH | KEXEC_PRESERVE_CONTEXT)
#endif

+/* Listof defined/legal kexec file flags */
+#define KEXEC_FILE_FLAGS (KEXEC_FILE_UNLOAD | KEXEC_FILE_ON_CRASH)
+
#define VMCOREINFO_BYTES (4096)
#define VMCOREINFO_NOTE_NAME "VMCOREINFO"
#define VMCOREINFO_NOTE_NAME_BYTES ALIGN(sizeof(VMCOREINFO_NOTE_NAME), 4)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 40ed9e9..db55884 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -307,6 +307,9 @@ asmlinkage long sys_restart_syscall(void);
asmlinkage long sys_kexec_load(unsigned long entry, unsigned long nr_segments,
struct kexec_segment __user *segments,
unsigned long flags);
+asmlinkage long sys_kexec_file_load(int kernel_fd, int initrd_fd,
+ const char __user * cmdline_ptr,
+ unsigned long cmdline_len, unsigned long flags);

asmlinkage long sys_exit(int error_code);
asmlinkage long sys_exit_group(int error_code);
diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h
index d6629d4..5fddb1b 100644
--- a/include/uapi/linux/kexec.h
+++ b/include/uapi/linux/kexec.h
@@ -13,6 +13,10 @@
#define KEXEC_PRESERVE_CONTEXT 0x00000002
#define KEXEC_ARCH_MASK 0xffff0000

+/* Kexec file load interface flags */
+#define KEXEC_FILE_UNLOAD 0x00000001
+#define KEXEC_FILE_ON_CRASH 0x00000002
+
/* These values match the ELF architecture values.
* Unless there is a good reason that should continue to be the case.
*/
diff --git a/kernel/kexec.c b/kernel/kexec.c
index c0944b2..b28578a 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -123,6 +123,11 @@ static struct page *kimage_alloc_page(struct kimage *image,
gfp_t gfp_mask,
unsigned long dest);

+void kimage_set_start_addr(struct kimage *image, unsigned long start)
+{
+ image->start = start;
+}
+
static int copy_user_segment_list(struct kimage *image,
unsigned long nr_segments,
struct kexec_segment __user *segments)
@@ -259,6 +264,219 @@ static struct kimage *do_kimage_alloc_init(void)

static void kimage_free_page_list(struct list_head *list);

+static int copy_file_from_fd(int fd, void **buf, unsigned long *buf_len)
+{
+ struct fd f = fdget(fd);
+ int ret = 0;
+ struct kstat stat;
+ loff_t pos;
+ ssize_t bytes = 0;
+
+ if (!f.file)
+ return -EBADF;
+
+ ret = vfs_getattr(&f.file->f_path, &stat);
+ if (ret)
+ goto out;
+
+ if (stat.size > INT_MAX) {
+ ret = -EFBIG;
+ goto out;
+ }
+
+ /* Don't hand 0 to vmalloc, it whines. */
+ if (stat.size == 0) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ *buf = vmalloc(stat.size);
+ if (!*buf) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ pos = 0;
+ while (pos < stat.size) {
+ bytes = kernel_read(f.file, pos, (char *)(*buf) + pos,
+ stat.size - pos);
+ if (bytes < 0) {
+ vfree(*buf);
+ ret = bytes;
+ goto out;
+ }
+
+ if (bytes == 0)
+ break;
+ pos += bytes;
+ }
+
+ *buf_len = pos;
+
+out:
+ fdput(f);
+ return ret;
+}
+
+/* Architectures can provide this probe function */
+int __attribute__ ((weak))
+arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
+ unsigned long buf_len)
+{
+ return -ENOEXEC;
+}
+
+void * __attribute__ ((weak))
+arch_kexec_kernel_image_load(struct kimage *image, char *kernel,
+ unsigned long kernel_len, char *initrd,
+ unsigned long initrd_len, char *cmdline,
+ unsigned long cmdline_len)
+{
+ return ERR_PTR(-ENOEXEC);
+}
+
+void __attribute__ ((weak))
+arch_kimage_file_post_load_cleanup(struct kimage *image)
+{
+ return;
+}
+
+/*
+ * Free up tempory buffers allocated which are not needed after image has
+ * been loaded.
+ *
+ * Free up memory used by kernel, initrd, and comand line. This is temporary
+ * memory allocation which is not needed any more after these buffers have
+ * been loaded into separate segments and have been copied elsewhere
+ */
+static void kimage_file_post_load_cleanup(struct kimage *image)
+{
+ vfree(image->kernel_buf);
+ image->kernel_buf = NULL;
+
+ vfree(image->initrd_buf);
+ image->initrd_buf = NULL;
+
+ vfree(image->cmdline_buf);
+ image->cmdline_buf = NULL;
+
+ /* See if architcture has anything to cleanup post load */
+ arch_kimage_file_post_load_cleanup(image);
+}
+
+/*
+ * In file mode list of segments is prepared by kernel. Copy relevant
+ * data from user space, do error checking, prepare segment list
+ */
+static int kimage_file_prepare_segments(struct kimage *image, int kernel_fd,
+ int initrd_fd, const char __user *cmdline_ptr,
+ unsigned long cmdline_len)
+{
+ int ret = 0;
+ void *ldata;
+
+ ret = copy_file_from_fd(kernel_fd, &image->kernel_buf,
+ &image->kernel_buf_len);
+ if (ret)
+ goto out;
+
+ /* Call arch image probe handlers */
+ ret = arch_kexec_kernel_image_probe(image, image->kernel_buf,
+ image->kernel_buf_len);
+
+ if (ret)
+ goto out;
+
+ ret = copy_file_from_fd(initrd_fd, &image->initrd_buf,
+ &image->initrd_buf_len);
+ if (ret)
+ goto out;
+
+ image->cmdline_buf = vzalloc(cmdline_len);
+ if (!image->cmdline_buf)
+ goto out;
+
+ ret = copy_from_user(image->cmdline_buf, cmdline_ptr, cmdline_len);
+ if (ret) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ image->cmdline_buf_len = cmdline_len;
+
+ /* command line should be a string with last byte null */
+ if (image->cmdline_buf[cmdline_len - 1] != '\0') {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /* Call arch image load handlers */
+ ldata = arch_kexec_kernel_image_load(image,
+ image->kernel_buf, image->kernel_buf_len,
+ image->initrd_buf, image->initrd_buf_len,
+ image->cmdline_buf, image->cmdline_buf_len);
+
+ if (IS_ERR(ldata)) {
+ ret = PTR_ERR(ldata);
+ goto out;
+ }
+
+ image->image_loader_data = ldata;
+out:
+ return ret;
+}
+
+static int kimage_file_normal_alloc(struct kimage **rimage, int kernel_fd,
+ int initrd_fd, const char __user *cmdline_ptr,
+ unsigned long cmdline_len)
+{
+ int result;
+ struct kimage *image;
+
+ /* Allocate and initialize a controlling structure */
+ image = do_kimage_alloc_init();
+ if (!image)
+ return -ENOMEM;
+
+ image->file_mode = 1;
+ image->file_handler_idx = -1;
+
+ result = kimage_file_prepare_segments(image, kernel_fd, initrd_fd,
+ cmdline_ptr, cmdline_len);
+ if (result)
+ goto out_free_image;
+
+ result = sanity_check_segment_list(image);
+ if (result)
+ goto out_free_post_load_bufs;
+
+ result = -ENOMEM;
+ image->control_code_page = kimage_alloc_control_pages(image,
+ get_order(KEXEC_CONTROL_PAGE_SIZE));
+ if (!image->control_code_page) {
+ printk(KERN_ERR "Could not allocate control_code_buffer\n");
+ goto out_free_post_load_bufs;
+ }
+
+ image->swap_page = kimage_alloc_control_pages(image, 0);
+ if (!image->swap_page) {
+ printk(KERN_ERR "Could not allocate swap buffer\n");
+ goto out_free_control_pages;
+ }
+
+ *rimage = image;
+ return 0;
+
+out_free_control_pages:
+ kimage_free_page_list(&image->control_pages);
+out_free_post_load_bufs:
+ kimage_file_post_load_cleanup(image);
+ kfree(image->image_loader_data);
+out_free_image:
+ kfree(image);
+ return result;
+}
+
static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
unsigned long nr_segments,
struct kexec_segment __user *segments)
@@ -682,6 +900,16 @@ static void kimage_free(struct kimage *image)

/* Free the kexec control pages... */
kimage_free_page_list(&image->control_pages);
+
+ kfree(image->image_loader_data);
+
+ /*
+ * Free up any temporary buffers allocated. This might hit if
+ * error occurred much later after buffer allocation.
+ */
+ if (image->file_mode)
+ kimage_file_post_load_cleanup(image);
+
kfree(image);
}

@@ -811,10 +1039,14 @@ static int kimage_load_normal_segment(struct kimage *image,
unsigned long maddr;
size_t ubytes, mbytes;
int result;
- unsigned char __user *buf;
+ unsigned char __user *buf = NULL;
+ unsigned char *kbuf = NULL;

result = 0;
- buf = segment->buf;
+ if (image->file_mode)
+ kbuf = segment->kbuf;
+ else
+ buf = segment->buf;
ubytes = segment->bufsz;
mbytes = segment->memsz;
maddr = segment->mem;
@@ -846,7 +1078,11 @@ static int kimage_load_normal_segment(struct kimage *image,
PAGE_SIZE - (maddr & ~PAGE_MASK));
uchunk = min(ubytes, mchunk);

- result = copy_from_user(ptr, buf, uchunk);
+ /* For file based kexec, source pages are in kernel memory */
+ if (image->file_mode)
+ memcpy(ptr, kbuf, uchunk);
+ else
+ result = copy_from_user(ptr, buf, uchunk);
kunmap(page);
if (result) {
result = -EFAULT;
@@ -854,7 +1090,10 @@ static int kimage_load_normal_segment(struct kimage *image,
}
ubytes -= uchunk;
maddr += mchunk;
- buf += mchunk;
+ if (image->file_mode)
+ kbuf += mchunk;
+ else
+ buf += mchunk;
mbytes -= mchunk;
}
out:
@@ -1097,6 +1336,72 @@ asmlinkage long compat_sys_kexec_load(unsigned long entry,
}
#endif

+SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd, const char __user *, cmdline_ptr, unsigned long, cmdline_len, unsigned long, flags)
+{
+ int ret = 0, i;
+ struct kimage **dest_image, *image;
+
+ /* We only trust the superuser with rebooting the system. */
+ if (!capable(CAP_SYS_BOOT))
+ return -EPERM;
+
+ pr_debug("kexec_file_load: kernel_fd=%d initrd_fd=%d cmdline=0x%p"
+ " cmdline_len=%lu flags=0x%lx\n", kernel_fd, initrd_fd,
+ cmdline_ptr, cmdline_len, flags);
+
+ /* Make sure we have a legal set of flags */
+ if (flags != (flags & KEXEC_FILE_FLAGS))
+ return -EINVAL;
+
+ image = NULL;
+
+ if (!mutex_trylock(&kexec_mutex))
+ return -EBUSY;
+
+ dest_image = &kexec_image;
+ if (flags & KEXEC_FILE_ON_CRASH)
+ dest_image = &kexec_crash_image;
+
+ if (flags & KEXEC_FILE_UNLOAD)
+ goto exchange;
+
+ ret = kimage_file_normal_alloc(&image, kernel_fd, initrd_fd,
+ cmdline_ptr, cmdline_len);
+ if (ret)
+ goto out;
+
+ ret = machine_kexec_prepare(image);
+ if (ret)
+ goto out;
+
+ for (i = 0; i < image->nr_segments; i++) {
+ struct kexec_segment *ksegment;
+
+ ksegment = &image->segment[i];
+ pr_debug("Loading segment %d: buf=0x%p bufsz=0x%zx mem=0x%lx"
+ " memsz=0x%zx\n", i, ksegment->buf, ksegment->bufsz,
+ ksegment->mem, ksegment->memsz);
+ ret = kimage_load_segment(image, &image->segment[i]);
+ if (ret)
+ goto out;
+ pr_debug("Done loading segment %d\n", i);
+ }
+
+ kimage_terminate(image);
+
+ /*
+ * Free up any temporary buffers allocated which are not needed
+ * after image has been loaded
+ */
+ kimage_file_post_load_cleanup(image);
+exchange:
+ image = xchg(dest_image, image);
+out:
+ mutex_unlock(&kexec_mutex);
+ kimage_free(image);
+ return ret;
+}
+
void crash_kexec(struct pt_regs *regs)
{
/* Take the kexec_mutex here to prevent sys_kexec_load
@@ -1651,6 +1956,188 @@ static int __init crash_save_vmcoreinfo_init(void)

module_init(crash_save_vmcoreinfo_init)

+static int __kexec_add_segment(struct kimage *image, char *buf,
+ unsigned long bufsz, unsigned long mem, unsigned long memsz)
+{
+ struct kexec_segment *ksegment;
+
+ ksegment = &image->segment[image->nr_segments];
+ ksegment->kbuf = buf;
+ ksegment->bufsz = bufsz;
+ ksegment->mem = mem;
+ ksegment->memsz = memsz;
+ image->nr_segments++;
+
+ return 0;
+}
+
+static int locate_mem_hole_top_down(unsigned long start, unsigned long end,
+ struct kexec_buf *kbuf)
+{
+ struct kimage *image = kbuf->image;
+ unsigned long temp_start, temp_end;
+
+ temp_end = min(end, kbuf->buf_max);
+ temp_start = temp_end - kbuf->memsz;
+
+ do {
+ /* align down start */
+ temp_start = temp_start & (~ (kbuf->buf_align - 1));
+
+ if (temp_start < start || temp_start < kbuf->buf_min)
+ return 0;
+
+ temp_end = temp_start + kbuf->memsz - 1;
+
+ /*
+ * Make sure this does not conflict with any of existing
+ * segments
+ */
+ if (kimage_is_destination_range(image, temp_start, temp_end)) {
+ temp_start = temp_start - PAGE_SIZE;
+ continue;
+ }
+
+ /* We found a suitable memory range */
+ break;
+ } while(1);
+
+ /* If we are here, we found a suitable memory range */
+ __kexec_add_segment(image, kbuf->buffer, kbuf->bufsz, temp_start,
+ kbuf->memsz);
+
+ /* Stop navigating through remaining System RAM ranges */
+ return 1;
+}
+
+static int locate_mem_hole_bottom_up(unsigned long start, unsigned long end,
+ struct kexec_buf *kbuf)
+{
+ struct kimage *image = kbuf->image;
+ unsigned long temp_start, temp_end;
+
+ temp_start = max(start, kbuf->buf_min);
+
+ do {
+ temp_start = ALIGN(temp_start, kbuf->buf_align);
+ temp_end = temp_start + kbuf->memsz - 1;
+
+ if (temp_end > end || temp_end > kbuf->buf_max)
+ return 0;
+ /*
+ * Make sure this does not conflict with any of existing
+ * segments
+ */
+ if (kimage_is_destination_range(image, temp_start, temp_end)) {
+ temp_start = temp_start + PAGE_SIZE;
+ continue;
+ }
+
+ /* We found a suitable memory range */
+ break;
+ } while(1);
+
+ /* If we are here, we found a suitable memory range */
+ __kexec_add_segment(image, kbuf->buffer, kbuf->bufsz, temp_start,
+ kbuf->memsz);
+
+ /* Stop navigating through remaining System RAM ranges */
+ return 1;
+}
+
+static int walk_ram_range_callback(u64 start, u64 end, void *arg)
+{
+ struct kexec_buf *kbuf = (struct kexec_buf *)arg;
+ unsigned long sz = end - start + 1;
+
+ /* Returning 0 will take to next memory range */
+ if (sz < kbuf->memsz)
+ return 0;
+
+ if (end < kbuf->buf_min || start > kbuf->buf_max)
+ return 0;
+
+ /*
+ * Allocate memory top down with-in ram range. Otherwise bottom up
+ * allocation.
+ */
+ if (kbuf->top_down)
+ return locate_mem_hole_top_down(start, end, kbuf);
+ else
+ return locate_mem_hole_bottom_up(start, end, kbuf);
+}
+
+/*
+ * Helper functions for placing a buffer in a kexec segment. This assumes
+ * that kexec_mutex is held.
+ */
+int kexec_add_buffer(struct kimage *image, char *buffer,
+ unsigned long bufsz, unsigned long memsz,
+ unsigned long buf_align, unsigned long buf_min,
+ unsigned long buf_max, int top_down, unsigned long *load_addr)
+{
+
+ unsigned long nr_segments = image->nr_segments, new_nr_segments;
+ struct kexec_segment *ksegment;
+ struct kexec_buf *kbuf;
+
+ /* Currently adding segment this way is allowed only in file mode */
+ if (!image->file_mode)
+ return -EINVAL;
+
+ if (nr_segments >= KEXEC_SEGMENT_MAX)
+ return -EINVAL;
+
+ /*
+ * Make sure we are not trying to add buffer after allocating
+ * control pages. All segments need to be placed first before
+ * any control pages are allocated. As control page allocation
+ * logic goes through list of segments to make sure there are
+ * no destination overlaps.
+ */
+ WARN_ONCE(!list_empty(&image->control_pages), "Adding kexec buffer"
+ " after allocating control pages\n");
+
+ kbuf = kzalloc(sizeof(struct kexec_buf), GFP_KERNEL);
+ if (!kbuf)
+ return -ENOMEM;
+
+ kbuf->image = image;
+ kbuf->buffer = buffer;
+ kbuf->bufsz = bufsz;
+ /* Align memsz to next page boundary */
+ kbuf->memsz = ALIGN(memsz, PAGE_SIZE);
+
+ /* Align to atleast page size boundary */
+ kbuf->buf_align = max(buf_align, PAGE_SIZE);
+ kbuf->buf_min = buf_min;
+ kbuf->buf_max = buf_max;
+ kbuf->top_down = top_down;
+
+ /* Walk the RAM ranges and allocate a suitable range for the buffer */
+ walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
+
+ kbuf->image = NULL;
+ kfree(kbuf);
+
+ /*
+ * If range could be found successfully, it would have incremented
+ * the nr_segments value.
+ */
+ new_nr_segments = image->nr_segments;
+
+ /* A suitable memory range could not be found for buffer */
+ if (new_nr_segments == nr_segments)
+ return -EADDRNOTAVAIL;
+
+ /* Found a suitable memory range */
+
+ ksegment = &image->segment[new_nr_segments - 1];
+ *load_addr = ksegment->mem;
+ return 0;
+}
+
+
/*
* Move into place and start executing a preloaded standalone
* executable. If nothing was preloaded return an error.
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7078052..7e1e13d 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -25,6 +25,7 @@ cond_syscall(sys_swapon);
cond_syscall(sys_swapoff);
cond_syscall(sys_kexec_load);
cond_syscall(compat_sys_kexec_load);
+cond_syscall(sys_kexec_file_load);
cond_syscall(sys_init_module);
cond_syscall(sys_finit_module);
cond_syscall(sys_delete_module);
--
1.8.4.2

2014-01-27 18:59:09

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 08/11] kexec-bzImage: Support for loading bzImage using 64bit entry

This is loader specific code which can load bzImage and set it up for
64bit entry. This does not take care of 32bit entry or real mode entry
yet.

Signed-off-by: Vivek Goyal <[email protected]>
---
arch/x86/include/asm/kexec-bzimage.h | 11 ++
arch/x86/include/asm/kexec.h | 30 +++++
arch/x86/kernel/Makefile | 2 +
arch/x86/kernel/kexec-bzimage.c | 234 +++++++++++++++++++++++++++++++++++
arch/x86/kernel/machine_kexec.c | 136 ++++++++++++++++++++
arch/x86/kernel/machine_kexec_64.c | 3 +-
6 files changed, 415 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/include/asm/kexec-bzimage.h
create mode 100644 arch/x86/kernel/kexec-bzimage.c
create mode 100644 arch/x86/kernel/machine_kexec.c

diff --git a/arch/x86/include/asm/kexec-bzimage.h b/arch/x86/include/asm/kexec-bzimage.h
new file mode 100644
index 0000000..9e83961
--- /dev/null
+++ b/arch/x86/include/asm/kexec-bzimage.h
@@ -0,0 +1,11 @@
+#ifndef _ASM_BZIMAGE_H
+#define _ASM_BZIMAGE_H
+
+extern int bzImage64_probe(const char *buf, unsigned long len);
+extern void *bzImage64_load(struct kimage *image, char *kernel,
+ unsigned long kernel_len, char *initrd,
+ unsigned long initrd_len, char *cmdline,
+ unsigned long cmdline_len);
+extern int bzImage64_cleanup(struct kimage *image);
+
+#endif /* _ASM_BZIMAGE_H */
diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 17483a4..9bd6fec 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -23,6 +23,7 @@

#include <asm/page.h>
#include <asm/ptrace.h>
+#include <asm/bootparam.h>

/*
* KEXEC_SOURCE_MEMORY_LIMIT maximum page get_free_page can return.
@@ -161,11 +162,40 @@ struct kimage_arch {
pmd_t *pmd;
pte_t *pte;
};
+
+struct kexec_entry64_regs {
+ uint64_t rax;
+ uint64_t rbx;
+ uint64_t rcx;
+ uint64_t rdx;
+ uint64_t rsi;
+ uint64_t rdi;
+ uint64_t rsp;
+ uint64_t rbp;
+ uint64_t r8;
+ uint64_t r9;
+ uint64_t r10;
+ uint64_t r11;
+ uint64_t r12;
+ uint64_t r13;
+ uint64_t r14;
+ uint64_t r15;
+ uint64_t rip;
+};
#endif

typedef void crash_vmclear_fn(void);
extern crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss;

+extern int kexec_setup_initrd(struct boot_params *boot_params,
+ unsigned long initrd_load_addr, unsigned long initrd_len);
+extern int kexec_setup_cmdline(struct boot_params *boot_params,
+ unsigned long bootparams_load_addr,
+ unsigned long cmdline_offset, char *cmdline,
+ unsigned long cmdline_len);
+extern int kexec_setup_boot_parameters(struct boot_params *params);
+
+
#endif /* __ASSEMBLY__ */

#endif /* _ASM_X86_KEXEC_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index cb648c8..fa9981d 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -67,8 +67,10 @@ obj-$(CONFIG_DYNAMIC_FTRACE) += ftrace.o
obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += ftrace.o
obj-$(CONFIG_FTRACE_SYSCALLS) += ftrace.o
obj-$(CONFIG_X86_TSC) += trace_clock.o
+obj-$(CONFIG_KEXEC) += machine_kexec.o
obj-$(CONFIG_KEXEC) += machine_kexec_$(BITS).o
obj-$(CONFIG_KEXEC) += relocate_kernel_$(BITS).o crash.o
+obj-$(CONFIG_KEXEC) += kexec-bzimage.o
obj-$(CONFIG_CRASH_DUMP) += crash_dump_$(BITS).o
obj-y += kprobes/
obj-$(CONFIG_MODULES) += module.o
diff --git a/arch/x86/kernel/kexec-bzimage.c b/arch/x86/kernel/kexec-bzimage.c
new file mode 100644
index 0000000..cbfcd00
--- /dev/null
+++ b/arch/x86/kernel/kexec-bzimage.c
@@ -0,0 +1,234 @@
+#include <linux/string.h>
+#include <linux/printk.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/kexec.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+
+#include <asm/bootparam.h>
+#include <asm/setup.h>
+
+#ifdef CONFIG_X86_64
+
+struct bzimage64_data {
+ /*
+ * Temporary buffer to hold bootparams buffer. This should be
+ * freed once the bootparam segment has been loaded.
+ */
+ void *bootparams_buf;
+};
+
+int bzImage64_probe(const char *buf, unsigned long len)
+{
+ int ret = -ENOEXEC;
+ struct setup_header *header;
+
+ if (len < 2 * 512) {
+ pr_debug("File is too short to be a bzImage\n");
+ return ret;
+ }
+
+ header = (struct setup_header *)(buf + 0x1F1);
+ if (memcmp((char *)&header->header, "HdrS", 4) != 0) {
+ pr_debug("Not a bzImage\n");
+ return ret;
+ }
+
+ if (header->boot_flag != 0xAA55) {
+ /* No x86 boot sector present */
+ pr_debug("No x86 boot sector present\n");
+ return ret;
+ }
+
+ if (header->version < 0x020C) {
+ /* Must be at least protocol version 2.12 */
+ pr_debug("Must be at least protocol version 2.12\n");
+ return ret;
+ }
+
+ if ((header->loadflags & 1) == 0) {
+ /* Not a bzImage */
+ pr_debug("zImage not a bzImage\n");
+ return ret;
+ }
+
+ if ((header->xloadflags & 3) != 3) {
+ /* XLF_KERNEL_64 and XLF_CAN_BE_LOADED_ABOVE_4G should be set */
+ pr_debug("Not a relocatable bzImage64\n");
+ return ret;
+ }
+
+ /* I've got a bzImage */
+ pr_debug("It's a relocatable bzImage64\n");
+ ret = 0;
+
+ return ret;
+}
+
+void *bzImage64_load(struct kimage *image, char *kernel,
+ unsigned long kernel_len,
+ char *initrd, unsigned long initrd_len,
+ char *cmdline, unsigned long cmdline_len)
+{
+
+ struct setup_header *header;
+ int setup_sects, kern16_size, ret = 0;
+ unsigned long setup_header_size, params_cmdline_sz;
+ struct boot_params *params;
+ unsigned long bootparam_load_addr, kernel_load_addr, initrd_load_addr;
+ unsigned long purgatory_load_addr;
+ unsigned long kernel_bufsz, kernel_memsz, kernel_align;
+ char *kernel_buf;
+ struct bzimage64_data *ldata;
+ struct kexec_entry64_regs regs64;
+ void *stack;
+
+ header = (struct setup_header *)(kernel + 0x1F1);
+ setup_sects = header->setup_sects;
+ if (setup_sects == 0)
+ setup_sects = 4;
+
+ kern16_size = (setup_sects + 1) * 512;
+ if (kernel_len < kern16_size) {
+ pr_debug("bzImage truncated\n");
+ return ERR_PTR(-ENOEXEC);
+ }
+
+ if (cmdline_len > header->cmdline_size) {
+ pr_debug("Kernel command line too long\n");
+ return ERR_PTR(-EINVAL);
+ }
+
+ /* Allocate loader specific data */
+ ldata = kzalloc(sizeof(struct bzimage64_data), GFP_KERNEL);
+ if (!ldata)
+ return ERR_PTR(-ENOMEM);
+
+ /*
+ * Load purgatory. For 64bit entry point, purgatory code can be
+ * anywhere.
+ */
+ ret = kexec_load_purgatory(image, 0x3000, -1, 1, &purgatory_load_addr);
+ if (ret) {
+ pr_debug("Loading purgatory failed\n");
+ goto out_free_loader_data;
+ }
+
+ pr_debug("Loaded purgatory at 0x%lx\n", purgatory_load_addr);
+
+ /* Load Bootparams and cmdline */
+ params_cmdline_sz = sizeof(struct boot_params) + cmdline_len;
+ params = kzalloc(params_cmdline_sz, GFP_KERNEL);
+ if (!params) {
+ ret = -ENOMEM;
+ goto out_free_loader_data;
+ }
+
+ /* Copy setup header onto bootparams. */
+ setup_header_size = 0x0202 + kernel[0x0201] - 0x1F1;
+
+ /* Is there a limit on setup header size? */
+ memcpy(&params->hdr, (kernel + 0x1F1), setup_header_size);
+ ret = kexec_add_buffer(image, (char *)params, params_cmdline_sz,
+ params_cmdline_sz, 16, 0x3000, -1, 1,
+ &bootparam_load_addr);
+ if (ret)
+ goto out_free_params;
+ pr_debug("Loaded boot_param and command line at 0x%lx sz=0x%lx\n",
+ bootparam_load_addr, params_cmdline_sz);
+
+ /* Load kernel */
+ kernel_buf = kernel + kern16_size;
+ kernel_bufsz = kernel_len - kern16_size;
+ kernel_memsz = ALIGN(header->init_size, 4096);
+ kernel_align = header->kernel_alignment;
+
+ ret = kexec_add_buffer(image, kernel_buf,
+ kernel_bufsz, kernel_memsz, kernel_align, 0x100000,
+ -1, 1, &kernel_load_addr);
+ if (ret)
+ goto out_free_params;
+
+ pr_debug("Loaded 64bit kernel at 0x%lx sz = 0x%lx\n", kernel_load_addr,
+ kernel_memsz);
+
+ /* Load initrd high */
+ if (initrd) {
+ ret = kexec_add_buffer(image, initrd, initrd_len, initrd_len,
+ 4096, 0x1000000, ULONG_MAX, 1, &initrd_load_addr);
+ if (ret)
+ goto out_free_params;
+
+ pr_debug("Loaded initrd at 0x%lx sz = 0x%lx\n",
+ initrd_load_addr, initrd_len);
+ ret = kexec_setup_initrd(params, initrd_load_addr, initrd_len);
+ if (ret)
+ goto out_free_params;
+ }
+
+ ret = kexec_setup_cmdline(params, bootparam_load_addr,
+ sizeof(struct boot_params), cmdline, cmdline_len);
+ if (ret)
+ goto out_free_params;
+
+ /* bootloader info. Do we need a separate ID for kexec kernel loader? */
+ params->hdr.type_of_loader = 0x0D << 4;
+ params->hdr.loadflags = 0;
+
+ /* Setup purgatory regs for entry */
+ ret = kexec_purgatory_get_set_symbol(image, "entry64_regs", &regs64,
+ sizeof(regs64), 1);
+ if (ret)
+ goto out_free_params;
+
+ regs64.rbx = 0; /* Bootstrap Processor */
+ regs64.rsi = bootparam_load_addr;
+ regs64.rip = kernel_load_addr + 0x200;
+ stack = kexec_purgatory_get_symbol_addr(image, "stack_end");
+ if (IS_ERR(stack)) {
+ pr_debug("Could not find address of symbol stack_end\n");
+ ret = -EINVAL;
+ goto out_free_params;
+ }
+
+ regs64.rsp = (unsigned long)stack;
+ ret = kexec_purgatory_get_set_symbol(image, "entry64_regs", &regs64,
+ sizeof(regs64), 0);
+ if (ret)
+ goto out_free_params;
+
+ ret = kexec_setup_boot_parameters(params);
+ if (ret)
+ goto out_free_params;
+
+ /*
+ * Store pointer to params so that it could be freed after loading
+ * params segment has been loaded and contents have been copied
+ * somewhere else.
+ */
+ ldata->bootparams_buf = params;
+ return ldata;
+
+out_free_params:
+ kfree(params);
+out_free_loader_data:
+ kfree(ldata);
+ return ERR_PTR(ret);
+}
+
+/* This cleanup function is called after various segments have been loaded */
+int bzImage64_cleanup(struct kimage *image)
+{
+ struct bzimage64_data *ldata = image->image_loader_data;
+
+ if (!ldata)
+ return 0;
+
+ kfree(ldata->bootparams_buf);
+ ldata->bootparams_buf = NULL;
+
+ return 0;
+}
+
+#endif /* CONFIG_X86_64 */
diff --git a/arch/x86/kernel/machine_kexec.c b/arch/x86/kernel/machine_kexec.c
new file mode 100644
index 0000000..ac55890
--- /dev/null
+++ b/arch/x86/kernel/machine_kexec.c
@@ -0,0 +1,136 @@
+/*
+ * handle transition of Linux booting another kernel
+ *
+ * Copyright (C) 2014 Red Hat Inc.
+ * Authors:
+ * Vivek Goyal <[email protected]>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2. See the file COPYING for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <asm/bootparam.h>
+#include <asm/setup.h>
+
+/*
+ * Common code for x86 and x86_64 used for kexec.
+ *
+ * For the time being it compiles only for x86_64 as there are no image
+ * loaders implemented * for x86. This #ifdef can be removed once somebody
+ * decides to write an image loader on CONFIG_X86_32.
+ */
+
+#ifdef CONFIG_X86_64
+
+int kexec_setup_initrd(struct boot_params *boot_params,
+ unsigned long initrd_load_addr, unsigned long initrd_len)
+{
+ boot_params->hdr.ramdisk_image = initrd_load_addr & 0xffffffffUL;
+ boot_params->hdr.ramdisk_size = initrd_len & 0xffffffffUL;
+
+ boot_params->ext_ramdisk_image = initrd_load_addr >> 32;
+ boot_params->ext_ramdisk_size = initrd_len >> 32;
+
+ return 0;
+}
+
+int kexec_setup_cmdline(struct boot_params *boot_params,
+ unsigned long bootparams_load_addr,
+ unsigned long cmdline_offset, char *cmdline,
+ unsigned long cmdline_len)
+{
+ char *cmdline_ptr = ((char *)boot_params) + cmdline_offset;
+ unsigned long cmdline_ptr_phys;
+ uint32_t cmdline_low_32, cmdline_ext_32;
+
+ memcpy(cmdline_ptr, cmdline, cmdline_len);
+ cmdline_ptr[cmdline_len - 1] = '\0';
+
+ cmdline_ptr_phys = bootparams_load_addr + cmdline_offset;
+ cmdline_low_32 = cmdline_ptr_phys & 0xffffffffUL;
+ cmdline_ext_32 = cmdline_ptr_phys >> 32;
+
+ boot_params->hdr.cmd_line_ptr = cmdline_low_32;
+ if (cmdline_ext_32)
+ boot_params->ext_cmd_line_ptr = cmdline_ext_32;
+
+ return 0;
+}
+
+static int setup_memory_map_entries(struct boot_params *params)
+{
+ unsigned int nr_e820_entries;
+
+ /* TODO: What about EFI */
+ nr_e820_entries = e820_saved.nr_map;
+ if (nr_e820_entries > E820MAX)
+ nr_e820_entries = E820MAX;
+
+ params->e820_entries = nr_e820_entries;
+ memcpy(&params->e820_map, &e820_saved.map,
+ nr_e820_entries * sizeof(struct e820entry));
+
+ return 0;
+}
+
+int kexec_setup_boot_parameters(struct boot_params *params)
+{
+ unsigned int nr_e820_entries;
+ unsigned long long mem_k, start, end;
+ int i;
+
+ /* Get subarch from existing bootparams */
+ params->hdr.hardware_subarch = boot_params.hdr.hardware_subarch;
+
+ /* Copying screen_info will do? */
+ memcpy(&params->screen_info, &boot_params.screen_info,
+ sizeof(struct screen_info));
+
+ /* Fill in memsize later */
+ params->screen_info.ext_mem_k = 0;
+ params->alt_mem_k = 0;
+
+ /* Default APM info */
+ memset(&params->apm_bios_info, 0, sizeof(params->apm_bios_info));
+
+ /* Default drive info */
+ memset(&params->hd0_info, 0, sizeof(params->hd0_info));
+ memset(&params->hd1_info, 0, sizeof(params->hd1_info));
+
+ /* Default sysdesc table */
+ params->sys_desc_table.length = 0;
+
+ setup_memory_map_entries(params);
+ nr_e820_entries = params->e820_entries;
+
+ for(i = 0; i < nr_e820_entries; i++) {
+ if (params->e820_map[i].type != E820_RAM)
+ continue;
+ start = params->e820_map[i].addr;
+ end = params->e820_map[i].addr + params->e820_map[i].size - 1;
+
+ if ((start <= 0x100000) && end > 0x100000) {
+ mem_k = (end >> 10) - (0x100000 >> 10);
+ params->screen_info.ext_mem_k = mem_k;
+ params->alt_mem_k = mem_k;
+ if (mem_k > 0xfc00)
+ params->screen_info.ext_mem_k = 0xfc00; /* 64M*/
+ if (mem_k > 0xffffffff)
+ params->alt_mem_k = 0xffffffff;
+ }
+ }
+
+ /* Setup EDD info */
+ memcpy(params->eddbuf, boot_params.eddbuf,
+ EDDMAXNR * sizeof(struct edd_info));
+ params->eddbuf_entries = boot_params.eddbuf_entries;
+
+ memcpy(params->edd_mbr_sig_buffer, boot_params.edd_mbr_sig_buffer,
+ EDD_MBR_SIG_MAX * sizeof(unsigned int));
+
+ return 0;
+}
+
+#endif /* CONFIG_X86_64 */
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 8866c5e..37df7d3 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -21,10 +21,11 @@
#include <asm/tlbflush.h>
#include <asm/mmu_context.h>
#include <asm/debugreg.h>
+#include <asm/kexec-bzimage.h>

/* arch dependent functionality related to kexec file based syscall */
static struct kexec_file_type kexec_file_type[]={
- {"", NULL, NULL, NULL},
+ {"bzImage64", bzImage64_probe, bzImage64_load, bzImage64_cleanup},
};

static int nr_file_types = sizeof(kexec_file_type)/sizeof(kexec_file_type[0]);
--
1.8.4.2

2014-01-27 18:59:46

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 07/11] kexec: Create a relocatable object called purgatory

Create a stand alone relocatable object purgatory which runs between two
kernels. This name, concept and some code has been taken from kexec-tools.
Idea is that this code runs after a crash and it runs in minimal environment.
So keep it separate from rest of the kernel and in long term we will have
to practically do no maintenance of this code.

This code also has the logic to do verify sha256 hashes of various
segments which have been loaded into memory. So first we verify that
the kernel we are jumping to is fine and has not been corrupted and
make progress only if checsums are verified.

This code also takes care of copying some memory contents to backup region.

sha256 hash related code has been taken from crypto/sha256_generic.c. I
could not call into functions exported by sha256_generic.c directly as
we don't link against the kernel. Purgatory is a stand alone object.

Also sha256_generic.c is supposed to work with higher level crypto
abstrations and API and there was no point in importing all that in
purgatory. So instead of doing #include on sha256_generic.c I just
copied relevant portions of code into arch/x86/purgatory/sha256.c. Now
we shouldn't have to touch this code at all. Do let me know if there are
better ways to handle it.

Signed-off-by: Vivek Goyal <[email protected]>
---
arch/x86/Kbuild | 1 +
arch/x86/Kconfig | 1 +
arch/x86/Makefile | 6 +
arch/x86/kernel/machine_kexec_64.c | 82 +++++++
arch/x86/purgatory/Makefile | 35 +++
arch/x86/purgatory/entry64.S | 111 +++++++++
arch/x86/purgatory/purgatory.c | 103 ++++++++
arch/x86/purgatory/setup-x86_32.S | 29 +++
arch/x86/purgatory/setup-x86_64.S | 68 ++++++
arch/x86/purgatory/sha256.c | 315 ++++++++++++++++++++++++
arch/x86/purgatory/sha256.h | 33 +++
arch/x86/purgatory/stack.S | 29 +++
include/linux/kexec.h | 31 +++
kernel/kexec.c | 481 +++++++++++++++++++++++++++++++++++++
14 files changed, 1325 insertions(+)
create mode 100644 arch/x86/purgatory/Makefile
create mode 100644 arch/x86/purgatory/entry64.S
create mode 100644 arch/x86/purgatory/purgatory.c
create mode 100644 arch/x86/purgatory/setup-x86_32.S
create mode 100644 arch/x86/purgatory/setup-x86_64.S
create mode 100644 arch/x86/purgatory/sha256.c
create mode 100644 arch/x86/purgatory/sha256.h
create mode 100644 arch/x86/purgatory/stack.S

diff --git a/arch/x86/Kbuild b/arch/x86/Kbuild
index e5287d8..faaeee7 100644
--- a/arch/x86/Kbuild
+++ b/arch/x86/Kbuild
@@ -16,3 +16,4 @@ obj-$(CONFIG_IA32_EMULATION) += ia32/

obj-y += platform/
obj-y += net/
+obj-$(CONFIG_KEXEC) += purgatory/
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index aa5aeed..532cc0b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1605,6 +1605,7 @@ source kernel/Kconfig.hz
config KEXEC
bool "kexec system call"
select BUILD_BIN2C
+ select CRYPTO_SHA256
---help---
kexec is a system call that implements the ability to shutdown your
current kernel, and to start another kernel. It is like a reboot
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 13b22e0..fedcd16 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -160,6 +160,11 @@ archscripts: scripts_basic
archheaders:
$(Q)$(MAKE) $(build)=arch/x86/syscalls all

+archprepare:
+ifeq ($(CONFIG_KEXEC),y)
+ $(Q)$(MAKE) $(build)=arch/x86/purgatory arch/x86/purgatory/kexec-purgatory.c
+endif
+
###
# Kernel objects

@@ -223,6 +228,7 @@ archclean:
$(Q)rm -rf $(objtree)/arch/x86_64
$(Q)$(MAKE) $(clean)=$(boot)
$(Q)$(MAKE) $(clean)=arch/x86/tools
+ $(Q)$(MAKE) $(clean)=arch/x86/purgatory

PHONY += kvmconfig
kvmconfig:
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index c91d72a..8866c5e 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -331,3 +331,85 @@ int arch_kimage_file_post_load_cleanup(struct kimage *image)
return kexec_file_type[idx].cleanup(image);
return 0;
}
+
+/* Apply purgatory relocations */
+int arch_kexec_apply_relocations_add(Elf64_Shdr *sechdrs,
+ unsigned int nr_sections, unsigned int relsec)
+{
+ unsigned int i;
+ Elf64_Rela *rel = (void *)sechdrs[relsec].sh_offset;
+ Elf64_Sym *sym;
+ void *location;
+ Elf64_Shdr *section, *symtab;
+ unsigned long address, sec_base, value;
+
+ /* Section to which relocations apply */
+ section = &sechdrs[sechdrs[relsec].sh_info];
+
+ /* Associated symbol table */
+ symtab = &sechdrs[sechdrs[relsec].sh_link];
+
+ for (i = 0; i < sechdrs[relsec].sh_size / sizeof(*rel); i++) {
+
+ /*
+ * This is location (->sh_offset) to update. This is temporary
+ * buffer where section is currently loaded. This will finally
+ * be loaded to a different address later (pointed to
+ * by ->sh_addr. kexec takes care of moving it
+ * (kexec_load_segment()).
+ */
+ location = (void *)(section->sh_offset + rel[i].r_offset);
+
+ /* Final address of the location */
+ address = section->sh_addr + rel[i].r_offset;
+
+ sym = (Elf64_Sym *)symtab->sh_offset +
+ ELF64_R_SYM(rel[i].r_info);
+
+ if (sym->st_shndx == SHN_UNDEF || sym->st_shndx == SHN_COMMON)
+ return -ENOEXEC;
+
+ if (sym->st_shndx == SHN_ABS)
+ sec_base = 0;
+ else if (sym->st_shndx >= nr_sections)
+ return -ENOEXEC;
+ else
+ sec_base = sechdrs[sym->st_shndx].sh_addr;
+
+ value = sym->st_value;
+ value += sec_base;
+ value += rel[i].r_addend;
+
+ switch(ELF64_R_TYPE(rel[i].r_info)) {
+ case R_X86_64_NONE:
+ break;
+ case R_X86_64_64:
+ *(u64 *)location = value;
+ break;
+ case R_X86_64_32:
+ *(u32 *)location = value;
+ if (value != *(u32 *)location)
+ goto overflow;
+ break;
+ case R_X86_64_32S:
+ *(s32 *)location = value;
+ if ((s64)value != *(s32 *)location)
+ goto overflow;
+ break;
+ case R_X86_64_PC32:
+ value -= (u64)address;
+ *(u32 *)location = value;
+ break;
+ default:
+ pr_err("kexec: Unknown rela relocation: %llu\n",
+ ELF64_R_TYPE(rel[i].r_info));
+ return -ENOEXEC;
+ }
+ }
+ return 0;
+
+overflow:
+ pr_err("kexec: overflow in relocation type %d value 0x%lx\n",
+ (int)ELF64_R_TYPE(rel[i].r_info), value);
+ return -ENOEXEC;
+}
diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile
new file mode 100644
index 0000000..c83c557
--- /dev/null
+++ b/arch/x86/purgatory/Makefile
@@ -0,0 +1,35 @@
+ifeq ($(CONFIG_X86_64),y)
+ purgatory-y := purgatory.o entry64.o stack.o setup-x86_64.o sha256.o
+else
+ purgatory-y := purgatory.o stack.o sha256.o setup-x86_32.o
+endif
+
+targets += $(purgatory-y)
+PURGATORY_OBJS = $(addprefix $(obj)/,$(purgatory-y))
+
+LDFLAGS_purgatory.ro := -e purgatory_start -r --no-undefined -nostdlib -z nodefaultlib
+targets += purgatory.ro
+
+# Default KBUILD_CFLAGS can have -pg option set when FTRACE is enabled. That
+# in turn leaves some undefined symbols like __fentry__ in purgatory and not
+# sure how to relocate those. Like kexec-tools, custom flags.
+
+ifeq ($(CONFIG_X86_64),y)
+KBUILD_CFLAGS := -fno-strict-aliasing -Wall -Wstrict-prototypes -fno-zero-initialized-in-bss -mcmodel=large -Os -fno-builtin -ffreestanding -c -MD
+else
+KBUILD_CFLAGS := -fno-strict-aliasing -Wall -Wstrict-prototypes -fno-zero-initialized-in-bss -Os -fno-builtin -ffreestanding -c -MD -m32
+endif
+
+$(obj)/purgatory.ro: $(PURGATORY_OBJS) FORCE
+ $(call if_changed,ld)
+
+targets += kexec-purgatory.c
+
+quiet_cmd_bin2c = BIN2C $@
+ cmd_bin2c = cat $(obj)/purgatory.ro | $(srctree)/scripts/basic/bin2c kexec_purgatory > $(obj)/kexec-purgatory.c
+
+$(obj)/kexec-purgatory.c: $(obj)/purgatory.ro FORCE
+ $(call if_changed,bin2c)
+
+
+obj-$(CONFIG_KEXEC) += kexec-purgatory.o
diff --git a/arch/x86/purgatory/entry64.S b/arch/x86/purgatory/entry64.S
new file mode 100644
index 0000000..e405c0f
--- /dev/null
+++ b/arch/x86/purgatory/entry64.S
@@ -0,0 +1,111 @@
+/*
+ * Copyright (C) 2003,2004 Eric Biederman ([email protected])
+ * Copyright (C) 2014 Red Hat Inc.
+
+ * Author(s): Vivek Goyal <[email protected]>
+ *
+ * This code has been taken from kexec-tools.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation (version 2 of the License).
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+ .text
+ .balign 16
+ .code64
+ .globl entry64, entry64_regs
+
+
+entry64:
+ /* Setup a gdt that should be preserved */
+ lgdt gdt(%rip)
+
+ /* load the data segments */
+ movl $0x18, %eax /* data segment */
+ movl %eax, %ds
+ movl %eax, %es
+ movl %eax, %ss
+ movl %eax, %fs
+ movl %eax, %gs
+
+ /* Setup new stack */
+ leaq stack_init(%rip), %rsp
+ pushq $0x10 /* CS */
+ leaq new_cs_exit(%rip), %rax
+ pushq %rax
+ lretq
+new_cs_exit:
+
+ /* Load the registers */
+ movq rax(%rip), %rax
+ movq rbx(%rip), %rbx
+ movq rcx(%rip), %rcx
+ movq rdx(%rip), %rdx
+ movq rsi(%rip), %rsi
+ movq rdi(%rip), %rdi
+ movq rsp(%rip), %rsp
+ movq rbp(%rip), %rbp
+ movq r8(%rip), %r8
+ movq r9(%rip), %r9
+ movq r10(%rip), %r10
+ movq r11(%rip), %r11
+ movq r12(%rip), %r12
+ movq r13(%rip), %r13
+ movq r14(%rip), %r14
+ movq r15(%rip), %r15
+
+ /* Jump to the new code... */
+ jmpq *rip(%rip)
+
+ .section ".rodata"
+ .balign 4
+entry64_regs:
+rax: .quad 0x00000000
+rbx: .quad 0x00000000
+rcx: .quad 0x00000000
+rdx: .quad 0x00000000
+rsi: .quad 0x00000000
+rdi: .quad 0x00000000
+rsp: .quad 0x00000000
+rbp: .quad 0x00000000
+r8: .quad 0x00000000
+r9: .quad 0x00000000
+r10: .quad 0x00000000
+r11: .quad 0x00000000
+r12: .quad 0x00000000
+r13: .quad 0x00000000
+r14: .quad 0x00000000
+r15: .quad 0x00000000
+rip: .quad 0x00000000
+ .size entry64_regs, . - entry64_regs
+
+ /* GDT */
+ .section ".rodata"
+ .balign 16
+gdt:
+ /* 0x00 unusable segment
+ * 0x08 unused
+ * so use them as gdt ptr
+ */
+ .word gdt_end - gdt - 1
+ .quad gdt
+ .word 0, 0, 0
+
+ /* 0x10 4GB flat code segment */
+ .word 0xFFFF, 0x0000, 0x9A00, 0x00AF
+
+ /* 0x18 4GB flat data segment */
+ .word 0xFFFF, 0x0000, 0x9200, 0x00CF
+gdt_end:
+stack: .quad 0, 0
+stack_init:
diff --git a/arch/x86/purgatory/purgatory.c b/arch/x86/purgatory/purgatory.c
new file mode 100644
index 0000000..375cfb7
--- /dev/null
+++ b/arch/x86/purgatory/purgatory.c
@@ -0,0 +1,103 @@
+/*
+ * purgatory: Runs between two kernels
+ *
+ * Copyright (C) 2013 Red Hat Inc.
+ *
+ * Author:
+ *
+ * Vivek Goyal <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation (version 2 of the License).
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+#include "sha256.h"
+
+struct sha_region {
+ unsigned long start;
+ unsigned long len;
+};
+
+unsigned long backup_dest = 0;
+unsigned long backup_src = 0;
+unsigned long backup_sz = 0;
+
+u8 sha256_digest[SHA256_DIGEST_SIZE] = { 0 };
+
+struct sha_region sha_regions[16] = {};
+
+/**
+ * memcpy - Copy one area of memory to another
+ * @dest: Where to copy to
+ * @src: Where to copy from
+ * @count: The size of the area.
+ */
+static void *memcpy(void *dest, const void *src, unsigned long count)
+{
+ char *tmp = dest;
+ const char *s = src;
+
+ while (count--)
+ *tmp++ = *s++;
+ return dest;
+}
+
+static int memcmp(const void *cs, const void *ct, size_t count)
+{
+ const unsigned char *su1, *su2;
+ int res = 0;
+
+ for (su1 = cs, su2 = ct; 0 < count; ++su1, ++su2, count--)
+ if ((res = *su1 - *su2) != 0)
+ break;
+ return res;
+}
+
+static int copy_backup_region(void)
+{
+ if (backup_dest)
+ memcpy((void *)backup_dest, (void *)backup_src, backup_sz);
+
+ return 0;
+}
+
+int verify_sha256_digest(void)
+{
+ struct sha_region *ptr, *end;
+ u8 digest[SHA256_DIGEST_SIZE];
+ struct sha256_state sctx;
+
+ sha256_init(&sctx);
+ end = &sha_regions[sizeof(sha_regions)/sizeof(sha_regions[0])];
+ for (ptr = sha_regions; ptr < end; ptr++)
+ sha256_update(&sctx, (uint8_t *)(ptr->start), ptr->len);
+
+ sha256_final(&sctx, digest);
+
+ if (memcmp(digest, sha256_digest, sizeof(digest)) != 0)
+ return 1;
+
+ return 0;
+}
+
+void purgatory(void)
+{
+ int ret;
+
+ ret = verify_sha256_digest();
+ if (ret) {
+ /* loop forever */
+ for(;;);
+ }
+ copy_backup_region();
+}
diff --git a/arch/x86/purgatory/setup-x86_32.S b/arch/x86/purgatory/setup-x86_32.S
new file mode 100644
index 0000000..a9d5aa5
--- /dev/null
+++ b/arch/x86/purgatory/setup-x86_32.S
@@ -0,0 +1,29 @@
+/*
+ * purgatory: setup code
+ *
+ * Copyright (C) 2014 Red Hat Inc.
+ *
+ * This code has been taken from kexec-tools.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation (version 2 of the License).
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+ .text
+ .globl purgatory_start
+ .balign 16
+purgatory_start:
+ .code32
+
+ /* This is just a stub. Write code when 32bit support comes along */
+ call purgatory
diff --git a/arch/x86/purgatory/setup-x86_64.S b/arch/x86/purgatory/setup-x86_64.S
new file mode 100644
index 0000000..d23bc54
--- /dev/null
+++ b/arch/x86/purgatory/setup-x86_64.S
@@ -0,0 +1,68 @@
+/*
+ * purgatory: setup code
+ *
+ * Copyright (C) 2003,2004 Eric Biederman ([email protected])
+ * Copyright (C) 2014 Red Hat Inc.
+ *
+ * This code has been taken from kexec-tools.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation (version 2 of the License).
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+ .text
+ .globl purgatory_start
+ .balign 16
+purgatory_start:
+ .code64
+
+ /* Load a gdt so I know what the segment registers are */
+ lgdt gdt(%rip)
+
+ /* load the data segments */
+ movl $0x18, %eax /* data segment */
+ movl %eax, %ds
+ movl %eax, %es
+ movl %eax, %ss
+ movl %eax, %fs
+ movl %eax, %gs
+
+ /* Setup a stack */
+ leaq lstack_end(%rip), %rsp
+
+ /* Call the C code */
+ call purgatory
+ jmp entry64
+
+ .section ".rodata"
+ .balign 16
+gdt: /* 0x00 unusable segment
+ * 0x08 unused
+ * so use them as the gdt ptr
+ */
+ .word gdt_end - gdt - 1
+ .quad gdt
+ .word 0, 0, 0
+
+ /* 0x10 4GB flat code segment */
+ .word 0xFFFF, 0x0000, 0x9A00, 0x00AF
+
+ /* 0x18 4GB flat data segment */
+ .word 0xFFFF, 0x0000, 0x9200, 0x00CF
+gdt_end:
+
+ .bss
+ .balign 4096
+lstack:
+ .skip 4096
+lstack_end:
diff --git a/arch/x86/purgatory/sha256.c b/arch/x86/purgatory/sha256.c
new file mode 100644
index 0000000..990d17f
--- /dev/null
+++ b/arch/x86/purgatory/sha256.c
@@ -0,0 +1,315 @@
+/*
+ * SHA-256, as specified in
+ * http://csrc.nist.gov/groups/STM/cavp/documents/shs/sha256-384-512.pdf
+ *
+ * SHA-256 code by Jean-Luc Cooke <[email protected]>.
+ *
+ * Copyright (c) Jean-Luc Cooke <[email protected]>
+ * Copyright (c) Andrew McDonald <[email protected]>
+ * Copyright (c) 2002 James Morris <[email protected]>
+ * Copyright (c) 2014 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ */
+
+#include <linux/bitops.h>
+#include <asm/byteorder.h>
+#include "sha256.h"
+
+
+/**
+ * memset - Fill a region of memory with the given value
+ * @s: Pointer to the start of the area.
+ * @c: The byte to fill the area with
+ * @count: The size of the area.
+ */
+static void *memset(void *s, int c, size_t count)
+{
+ char *xs = s;
+
+ while (count--)
+ *xs++ = c;
+ return s;
+}
+
+/**
+ * memcpy - Copy one area of memory to another
+ * @dest: Where to copy to
+ * @src: Where to copy from
+ * @count: The size of the area.
+ */
+static void *memcpy(void *dest, const void *src, size_t count)
+{
+ char *tmp = dest;
+ const char *s = src;
+
+ while (count--)
+ *tmp++ = *s++;
+ return dest;
+}
+
+static inline u32 Ch(u32 x, u32 y, u32 z)
+{
+ return z ^ (x & (y ^ z));
+}
+
+static inline u32 Maj(u32 x, u32 y, u32 z)
+{
+ return (x & y) | (z & (x | y));
+}
+
+#define e0(x) (ror32(x, 2) ^ ror32(x,13) ^ ror32(x,22))
+#define e1(x) (ror32(x, 6) ^ ror32(x,11) ^ ror32(x,25))
+#define s0(x) (ror32(x, 7) ^ ror32(x,18) ^ (x >> 3))
+#define s1(x) (ror32(x,17) ^ ror32(x,19) ^ (x >> 10))
+
+static inline void LOAD_OP(int I, u32 *W, const u8 *input)
+{
+ W[I] = __be32_to_cpu( ((__be32*)(input))[I] );
+}
+
+static inline void BLEND_OP(int I, u32 *W)
+{
+ W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
+}
+
+static void sha256_transform(u32 *state, const u8 *input)
+{
+ u32 a, b, c, d, e, f, g, h, t1, t2;
+ u32 W[64];
+ int i;
+
+ /* load the input */
+ for (i = 0; i < 16; i++)
+ LOAD_OP(i, W, input);
+
+ /* now blend */
+ for (i = 16; i < 64; i++)
+ BLEND_OP(i, W);
+
+ /* load the state into our registers */
+ a=state[0]; b=state[1]; c=state[2]; d=state[3];
+ e=state[4]; f=state[5]; g=state[6]; h=state[7];
+
+ /* now iterate */
+ t1 = h + e1(e) + Ch(e,f,g) + 0x428a2f98 + W[ 0];
+ t2 = e0(a) + Maj(a,b,c); d+=t1; h=t1+t2;
+ t1 = g + e1(d) + Ch(d,e,f) + 0x71374491 + W[ 1];
+ t2 = e0(h) + Maj(h,a,b); c+=t1; g=t1+t2;
+ t1 = f + e1(c) + Ch(c,d,e) + 0xb5c0fbcf + W[ 2];
+ t2 = e0(g) + Maj(g,h,a); b+=t1; f=t1+t2;
+ t1 = e + e1(b) + Ch(b,c,d) + 0xe9b5dba5 + W[ 3];
+ t2 = e0(f) + Maj(f,g,h); a+=t1; e=t1+t2;
+ t1 = d + e1(a) + Ch(a,b,c) + 0x3956c25b + W[ 4];
+ t2 = e0(e) + Maj(e,f,g); h+=t1; d=t1+t2;
+ t1 = c + e1(h) + Ch(h,a,b) + 0x59f111f1 + W[ 5];
+ t2 = e0(d) + Maj(d,e,f); g+=t1; c=t1+t2;
+ t1 = b + e1(g) + Ch(g,h,a) + 0x923f82a4 + W[ 6];
+ t2 = e0(c) + Maj(c,d,e); f+=t1; b=t1+t2;
+ t1 = a + e1(f) + Ch(f,g,h) + 0xab1c5ed5 + W[ 7];
+ t2 = e0(b) + Maj(b,c,d); e+=t1; a=t1+t2;
+
+ t1 = h + e1(e) + Ch(e,f,g) + 0xd807aa98 + W[ 8];
+ t2 = e0(a) + Maj(a,b,c); d+=t1; h=t1+t2;
+ t1 = g + e1(d) + Ch(d,e,f) + 0x12835b01 + W[ 9];
+ t2 = e0(h) + Maj(h,a,b); c+=t1; g=t1+t2;
+ t1 = f + e1(c) + Ch(c,d,e) + 0x243185be + W[10];
+ t2 = e0(g) + Maj(g,h,a); b+=t1; f=t1+t2;
+ t1 = e + e1(b) + Ch(b,c,d) + 0x550c7dc3 + W[11];
+ t2 = e0(f) + Maj(f,g,h); a+=t1; e=t1+t2;
+ t1 = d + e1(a) + Ch(a,b,c) + 0x72be5d74 + W[12];
+ t2 = e0(e) + Maj(e,f,g); h+=t1; d=t1+t2;
+ t1 = c + e1(h) + Ch(h,a,b) + 0x80deb1fe + W[13];
+ t2 = e0(d) + Maj(d,e,f); g+=t1; c=t1+t2;
+ t1 = b + e1(g) + Ch(g,h,a) + 0x9bdc06a7 + W[14];
+ t2 = e0(c) + Maj(c,d,e); f+=t1; b=t1+t2;
+ t1 = a + e1(f) + Ch(f,g,h) + 0xc19bf174 + W[15];
+ t2 = e0(b) + Maj(b,c,d); e+=t1; a=t1+t2;
+
+ t1 = h + e1(e) + Ch(e,f,g) + 0xe49b69c1 + W[16];
+ t2 = e0(a) + Maj(a,b,c); d+=t1; h=t1+t2;
+ t1 = g + e1(d) + Ch(d,e,f) + 0xefbe4786 + W[17];
+ t2 = e0(h) + Maj(h,a,b); c+=t1; g=t1+t2;
+ t1 = f + e1(c) + Ch(c,d,e) + 0x0fc19dc6 + W[18];
+ t2 = e0(g) + Maj(g,h,a); b+=t1; f=t1+t2;
+ t1 = e + e1(b) + Ch(b,c,d) + 0x240ca1cc + W[19];
+ t2 = e0(f) + Maj(f,g,h); a+=t1; e=t1+t2;
+ t1 = d + e1(a) + Ch(a,b,c) + 0x2de92c6f + W[20];
+ t2 = e0(e) + Maj(e,f,g); h+=t1; d=t1+t2;
+ t1 = c + e1(h) + Ch(h,a,b) + 0x4a7484aa + W[21];
+ t2 = e0(d) + Maj(d,e,f); g+=t1; c=t1+t2;
+ t1 = b + e1(g) + Ch(g,h,a) + 0x5cb0a9dc + W[22];
+ t2 = e0(c) + Maj(c,d,e); f+=t1; b=t1+t2;
+ t1 = a + e1(f) + Ch(f,g,h) + 0x76f988da + W[23];
+ t2 = e0(b) + Maj(b,c,d); e+=t1; a=t1+t2;
+
+ t1 = h + e1(e) + Ch(e,f,g) + 0x983e5152 + W[24];
+ t2 = e0(a) + Maj(a,b,c); d+=t1; h=t1+t2;
+ t1 = g + e1(d) + Ch(d,e,f) + 0xa831c66d + W[25];
+ t2 = e0(h) + Maj(h,a,b); c+=t1; g=t1+t2;
+ t1 = f + e1(c) + Ch(c,d,e) + 0xb00327c8 + W[26];
+ t2 = e0(g) + Maj(g,h,a); b+=t1; f=t1+t2;
+ t1 = e + e1(b) + Ch(b,c,d) + 0xbf597fc7 + W[27];
+ t2 = e0(f) + Maj(f,g,h); a+=t1; e=t1+t2;
+ t1 = d + e1(a) + Ch(a,b,c) + 0xc6e00bf3 + W[28];
+ t2 = e0(e) + Maj(e,f,g); h+=t1; d=t1+t2;
+ t1 = c + e1(h) + Ch(h,a,b) + 0xd5a79147 + W[29];
+ t2 = e0(d) + Maj(d,e,f); g+=t1; c=t1+t2;
+ t1 = b + e1(g) + Ch(g,h,a) + 0x06ca6351 + W[30];
+ t2 = e0(c) + Maj(c,d,e); f+=t1; b=t1+t2;
+ t1 = a + e1(f) + Ch(f,g,h) + 0x14292967 + W[31];
+ t2 = e0(b) + Maj(b,c,d); e+=t1; a=t1+t2;
+
+ t1 = h + e1(e) + Ch(e,f,g) + 0x27b70a85 + W[32];
+ t2 = e0(a) + Maj(a,b,c); d+=t1; h=t1+t2;
+ t1 = g + e1(d) + Ch(d,e,f) + 0x2e1b2138 + W[33];
+ t2 = e0(h) + Maj(h,a,b); c+=t1; g=t1+t2;
+ t1 = f + e1(c) + Ch(c,d,e) + 0x4d2c6dfc + W[34];
+ t2 = e0(g) + Maj(g,h,a); b+=t1; f=t1+t2;
+ t1 = e + e1(b) + Ch(b,c,d) + 0x53380d13 + W[35];
+ t2 = e0(f) + Maj(f,g,h); a+=t1; e=t1+t2;
+ t1 = d + e1(a) + Ch(a,b,c) + 0x650a7354 + W[36];
+ t2 = e0(e) + Maj(e,f,g); h+=t1; d=t1+t2;
+ t1 = c + e1(h) + Ch(h,a,b) + 0x766a0abb + W[37];
+ t2 = e0(d) + Maj(d,e,f); g+=t1; c=t1+t2;
+ t1 = b + e1(g) + Ch(g,h,a) + 0x81c2c92e + W[38];
+ t2 = e0(c) + Maj(c,d,e); f+=t1; b=t1+t2;
+ t1 = a + e1(f) + Ch(f,g,h) + 0x92722c85 + W[39];
+ t2 = e0(b) + Maj(b,c,d); e+=t1; a=t1+t2;
+
+ t1 = h + e1(e) + Ch(e,f,g) + 0xa2bfe8a1 + W[40];
+ t2 = e0(a) + Maj(a,b,c); d+=t1; h=t1+t2;
+ t1 = g + e1(d) + Ch(d,e,f) + 0xa81a664b + W[41];
+ t2 = e0(h) + Maj(h,a,b); c+=t1; g=t1+t2;
+ t1 = f + e1(c) + Ch(c,d,e) + 0xc24b8b70 + W[42];
+ t2 = e0(g) + Maj(g,h,a); b+=t1; f=t1+t2;
+ t1 = e + e1(b) + Ch(b,c,d) + 0xc76c51a3 + W[43];
+ t2 = e0(f) + Maj(f,g,h); a+=t1; e=t1+t2;
+ t1 = d + e1(a) + Ch(a,b,c) + 0xd192e819 + W[44];
+ t2 = e0(e) + Maj(e,f,g); h+=t1; d=t1+t2;
+ t1 = c + e1(h) + Ch(h,a,b) + 0xd6990624 + W[45];
+ t2 = e0(d) + Maj(d,e,f); g+=t1; c=t1+t2;
+ t1 = b + e1(g) + Ch(g,h,a) + 0xf40e3585 + W[46];
+ t2 = e0(c) + Maj(c,d,e); f+=t1; b=t1+t2;
+ t1 = a + e1(f) + Ch(f,g,h) + 0x106aa070 + W[47];
+ t2 = e0(b) + Maj(b,c,d); e+=t1; a=t1+t2;
+
+ t1 = h + e1(e) + Ch(e,f,g) + 0x19a4c116 + W[48];
+ t2 = e0(a) + Maj(a,b,c); d+=t1; h=t1+t2;
+ t1 = g + e1(d) + Ch(d,e,f) + 0x1e376c08 + W[49];
+ t2 = e0(h) + Maj(h,a,b); c+=t1; g=t1+t2;
+ t1 = f + e1(c) + Ch(c,d,e) + 0x2748774c + W[50];
+ t2 = e0(g) + Maj(g,h,a); b+=t1; f=t1+t2;
+ t1 = e + e1(b) + Ch(b,c,d) + 0x34b0bcb5 + W[51];
+ t2 = e0(f) + Maj(f,g,h); a+=t1; e=t1+t2;
+ t1 = d + e1(a) + Ch(a,b,c) + 0x391c0cb3 + W[52];
+ t2 = e0(e) + Maj(e,f,g); h+=t1; d=t1+t2;
+ t1 = c + e1(h) + Ch(h,a,b) + 0x4ed8aa4a + W[53];
+ t2 = e0(d) + Maj(d,e,f); g+=t1; c=t1+t2;
+ t1 = b + e1(g) + Ch(g,h,a) + 0x5b9cca4f + W[54];
+ t2 = e0(c) + Maj(c,d,e); f+=t1; b=t1+t2;
+ t1 = a + e1(f) + Ch(f,g,h) + 0x682e6ff3 + W[55];
+ t2 = e0(b) + Maj(b,c,d); e+=t1; a=t1+t2;
+
+ t1 = h + e1(e) + Ch(e,f,g) + 0x748f82ee + W[56];
+ t2 = e0(a) + Maj(a,b,c); d+=t1; h=t1+t2;
+ t1 = g + e1(d) + Ch(d,e,f) + 0x78a5636f + W[57];
+ t2 = e0(h) + Maj(h,a,b); c+=t1; g=t1+t2;
+ t1 = f + e1(c) + Ch(c,d,e) + 0x84c87814 + W[58];
+ t2 = e0(g) + Maj(g,h,a); b+=t1; f=t1+t2;
+ t1 = e + e1(b) + Ch(b,c,d) + 0x8cc70208 + W[59];
+ t2 = e0(f) + Maj(f,g,h); a+=t1; e=t1+t2;
+ t1 = d + e1(a) + Ch(a,b,c) + 0x90befffa + W[60];
+ t2 = e0(e) + Maj(e,f,g); h+=t1; d=t1+t2;
+ t1 = c + e1(h) + Ch(h,a,b) + 0xa4506ceb + W[61];
+ t2 = e0(d) + Maj(d,e,f); g+=t1; c=t1+t2;
+ t1 = b + e1(g) + Ch(g,h,a) + 0xbef9a3f7 + W[62];
+ t2 = e0(c) + Maj(c,d,e); f+=t1; b=t1+t2;
+ t1 = a + e1(f) + Ch(f,g,h) + 0xc67178f2 + W[63];
+ t2 = e0(b) + Maj(b,c,d); e+=t1; a=t1+t2;
+
+ state[0] += a; state[1] += b; state[2] += c; state[3] += d;
+ state[4] += e; state[5] += f; state[6] += g; state[7] += h;
+
+ /* clear any sensitive info... */
+ a = b = c = d = e = f = g = h = t1 = t2 = 0;
+ memset(W, 0, 64 * sizeof(u32));
+}
+
+int sha256_init(struct sha256_state *sctx)
+{
+ sctx->state[0] = SHA256_H0;
+ sctx->state[1] = SHA256_H1;
+ sctx->state[2] = SHA256_H2;
+ sctx->state[3] = SHA256_H3;
+ sctx->state[4] = SHA256_H4;
+ sctx->state[5] = SHA256_H5;
+ sctx->state[6] = SHA256_H6;
+ sctx->state[7] = SHA256_H7;
+ sctx->count = 0;
+
+ return 0;
+}
+
+int sha256_update(struct sha256_state *sctx, const u8 *data,
+ unsigned int len)
+{
+ unsigned int partial, done;
+ const u8 *src;
+
+ partial = sctx->count & 0x3f;
+ sctx->count += len;
+ done = 0;
+ src = data;
+
+ if ((partial + len) > 63) {
+ if (partial) {
+ done = -partial;
+ memcpy(sctx->buf + partial, data, done + 64);
+ src = sctx->buf;
+ }
+
+ do {
+ sha256_transform(sctx->state, src);
+ done += 64;
+ src = data + done;
+ } while (done + 63 < len);
+
+ partial = 0;
+ }
+ memcpy(sctx->buf + partial, src, len - done);
+
+ return 0;
+}
+
+int sha256_final(struct sha256_state *sctx, u8 *out)
+{
+ __be32 *dst = (__be32 *)out;
+ __be64 bits;
+ unsigned int index, pad_len;
+ int i;
+ static const u8 padding[64] = { 0x80, };
+
+ /* Save number of bits */
+ bits = cpu_to_be64(sctx->count << 3);
+
+ /* Pad out to 56 mod 64. */
+ index = sctx->count & 0x3f;
+ pad_len = (index < 56) ? (56 - index) : ((64+56) - index);
+ sha256_update(sctx, padding, pad_len);
+
+ /* Append length (before padding) */
+ sha256_update(sctx, (const u8 *)&bits, sizeof(bits));
+
+ /* Store state in digest */
+ for (i = 0; i < 8; i++)
+ dst[i] = cpu_to_be32(sctx->state[i]);
+
+ /* Zeroize sensitive information. */
+ memset(sctx, 0, sizeof(*sctx));
+
+ return 0;
+}
diff --git a/arch/x86/purgatory/sha256.h b/arch/x86/purgatory/sha256.h
new file mode 100644
index 0000000..74df8f4
--- /dev/null
+++ b/arch/x86/purgatory/sha256.h
@@ -0,0 +1,33 @@
+/*
+ * Copyright (C) 2014 Red Hat Inc.
+ *
+ * Author: Vivek Goyal <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef SHA256_H
+#define SHA256_H
+
+
+#include <linux/types.h>
+#include <crypto/sha.h>
+
+extern int sha256_init(struct sha256_state *sctx);
+extern int sha256_update(struct sha256_state *sctx, const u8 *input,
+ unsigned int length);
+extern int sha256_final(struct sha256_state *sctx, u8* hash);
+
+#endif /* SHA256_H */
diff --git a/arch/x86/purgatory/stack.S b/arch/x86/purgatory/stack.S
new file mode 100644
index 0000000..aff1fa9
--- /dev/null
+++ b/arch/x86/purgatory/stack.S
@@ -0,0 +1,29 @@
+/*
+ * purgatory: stack
+ *
+ * Copyright (C) 2014 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation (version 2 of the License).
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+
+ /* A stack for the loaded kernel.
+ * Seperate and in the data section so it can be prepopulated.
+ */
+ .data
+ .balign 4096
+ .globl stack, stack_end
+
+stack:
+ .skip 4096
+stack_end:
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 51b56cd..d391ed7 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -10,6 +10,7 @@
#include <linux/ioport.h>
#include <linux/elfcore.h>
#include <linux/elf.h>
+#include <linux/module.h>
#include <asm/kexec.h>

/* Verify architecture specific macros are defined */
@@ -95,6 +96,27 @@ struct compat_kexec_segment {
};
#endif

+struct kexec_sha_region {
+ unsigned long start;
+ unsigned long len;
+};
+
+struct purgatory_info {
+ /* Pointer to elf header of read only purgatory */
+ Elf_Ehdr *ehdr;
+
+ /* Pointer to purgatory sechdrs which are modifiable */
+ Elf_Shdr *sechdrs;
+ /*
+ * Temporary buffer location where purgatory is loaded and relocated
+ * This memory can be freed post image load
+ */
+ void *purgatory_buf;
+
+ /* Address where purgatory is finally loaded and is executed from */
+ unsigned long purgatory_load_addr;
+};
+
struct kimage {
kimage_entry_t head;
kimage_entry_t *entry;
@@ -143,6 +165,9 @@ struct kimage {

/* Image loader handling the kernel can store a pointer here */
void * image_loader_data;
+
+ /* Information for loading purgatory */
+ struct purgatory_info purgatory_info;
};

/*
@@ -197,6 +222,12 @@ extern asmlinkage long compat_sys_kexec_load(unsigned long entry,
extern struct page *kimage_alloc_control_pages(struct kimage *image,
unsigned int order);
extern void kimage_set_start_addr(struct kimage *image, unsigned long start);
+extern int kexec_load_purgatory(struct kimage *image, unsigned long min,
+ unsigned long max, int top_down, unsigned long *load_addr);
+extern int kexec_purgatory_get_set_symbol(struct kimage *image,
+ const char *name, void *buf, unsigned int size, bool get_value);
+extern void *kexec_purgatory_get_symbol_addr(struct kimage *image,
+ const char *name);

extern void crash_kexec(struct pt_regs *);
int kexec_should_crash(struct task_struct *);
diff --git a/kernel/kexec.c b/kernel/kexec.c
index b28578a..20169a4 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -38,6 +38,9 @@
#include <asm/io.h>
#include <asm/sections.h>

+#include <crypto/hash.h>
+#include <crypto/sha.h>
+
/* Per cpu memory for storing cpu states in case of system crash. */
note_buf_t __percpu *crash_notes;

@@ -50,6 +53,15 @@ size_t vmcoreinfo_max_size = sizeof(vmcoreinfo_data);
/* Flag to indicate we are going to kexec a new kernel */
bool kexec_in_progress = false;

+/*
+ * Declare these symbols weak so that if architecture provides a purgatory,
+ * these will be overridden.
+ */
+char __weak kexec_purgatory[0];
+size_t __weak kexec_purgatory_size = 0;
+
+static int kexec_calculate_store_digests(struct kimage *image);
+
/* Location of the reserved area for the crash kernel */
struct resource crashk_res = {
.name = "Crash kernel",
@@ -341,6 +353,15 @@ arch_kimage_file_post_load_cleanup(struct kimage *image)
return;
}

+/* Apply relocations for rela section */
+int __attribute__ ((weak))
+arch_kexec_apply_relocations_add(Elf_Shdr *sechdrs, unsigned int nr_sections,
+ unsigned int relsec)
+{
+ printk(KERN_ERR "kexec: REL relocation unsupported\n");
+ return -ENOEXEC;
+}
+
/*
* Free up tempory buffers allocated which are not needed after image has
* been loaded.
@@ -360,6 +381,12 @@ static void kimage_file_post_load_cleanup(struct kimage *image)
vfree(image->cmdline_buf);
image->cmdline_buf = NULL;

+ vfree(image->purgatory_info.purgatory_buf);
+ image->purgatory_info.purgatory_buf = NULL;
+
+ vfree(image->purgatory_info.sechdrs);
+ image->purgatory_info.sechdrs = NULL;
+
/* See if architcture has anything to cleanup post load */
arch_kimage_file_post_load_cleanup(image);
}
@@ -1374,6 +1401,10 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd, const char __us
if (ret)
goto out;

+ ret = kexec_calculate_store_digests(image);
+ if (ret)
+ goto out;
+
for (i = 0; i < image->nr_segments; i++) {
struct kexec_segment *ksegment;

@@ -2137,6 +2168,456 @@ int kexec_add_buffer(struct kimage *image, char *buffer,
return 0;
}

+/* Calculate and store the digest of segments */
+static int kexec_calculate_store_digests(struct kimage *image)
+{
+ struct crypto_shash *tfm;
+ struct shash_desc *desc;
+ int ret = 0, i, j, zero_buf_sz = 256, sha_region_sz;
+ size_t desc_size, nullsz;
+ char *digest = NULL;
+ void *zero_buf;
+ struct kexec_sha_region *sha_regions;
+
+ tfm = crypto_alloc_shash("sha256", 0, 0);
+ if (IS_ERR(tfm)) {
+ ret = PTR_ERR(tfm);
+ goto out;
+ }
+
+ desc_size = crypto_shash_descsize(tfm) + sizeof(*desc);
+ desc = kzalloc(desc_size, GFP_KERNEL);
+ if (!desc) {
+ ret = -ENOMEM;
+ goto out_free_tfm;
+ }
+
+ zero_buf = kzalloc(zero_buf_sz, GFP_KERNEL);
+ if (!zero_buf) {
+ ret = -ENOMEM;
+ goto out_free_desc;
+ }
+
+ sha_region_sz = KEXEC_SEGMENT_MAX * sizeof(struct kexec_sha_region);
+ sha_regions = vzalloc(sha_region_sz);
+ if (!sha_regions)
+ goto out_free_zero_buf;
+
+ desc->tfm = tfm;
+ desc->flags = 0;
+
+ ret = crypto_shash_init(desc);
+ if (ret < 0)
+ goto out_free_sha_regions;
+
+ digest = kzalloc(SHA256_DIGEST_SIZE, GFP_KERNEL);
+ if (!digest) {
+ ret = -ENOMEM;
+ goto out_free_sha_regions;
+ }
+
+ /* Traverse through all segments */
+ for (j = i = 0; i < image->nr_segments; i++) {
+ struct kexec_segment *ksegment;
+ ksegment = &image->segment[i];
+
+ /*
+ * Skip purgatory as it will be modified once we put digest
+ * info in purgatory
+ */
+ if (ksegment->kbuf == image->purgatory_info.purgatory_buf)
+ continue;
+
+ ret = crypto_shash_update(desc, ksegment->kbuf,
+ ksegment->bufsz);
+ if (ret)
+ break;
+
+ nullsz = ksegment->memsz - ksegment->bufsz;
+ while (nullsz) {
+ unsigned long bytes = nullsz;
+ if (bytes > zero_buf_sz)
+ bytes = zero_buf_sz;
+ ret = crypto_shash_update(desc, zero_buf, bytes);
+ if (ret)
+ break;
+ nullsz -= bytes;
+ }
+
+ if (ret)
+ break;
+
+ sha_regions[j].start = ksegment->mem;
+ sha_regions[j].len = ksegment->memsz;
+ j++;
+ }
+
+ if (!ret) {
+ ret = crypto_shash_final(desc, digest);
+ if (ret)
+ goto out_free_sha_regions;
+ ret = kexec_purgatory_get_set_symbol(image, "sha_regions",
+ sha_regions, sha_region_sz, 0);
+ if (ret)
+ goto out_free_sha_regions;
+
+ ret = kexec_purgatory_get_set_symbol(image, "sha256_digest",
+ digest, SHA256_DIGEST_SIZE, 0);
+ if (ret)
+ goto out_free_sha_regions;
+ }
+
+out_free_sha_regions:
+ vfree(sha_regions);
+out_free_zero_buf:
+ kfree(zero_buf);
+out_free_desc:
+ kfree(desc);
+out_free_tfm:
+ kfree(tfm);
+out:
+ kfree(digest);
+ return ret;
+}
+
+/* Actually load and relcoate purgatory. Lot of code taken from kexec-tools */
+static int elf_rel_load_relocate(struct kimage *image, unsigned long min,
+ unsigned long max, int top_down)
+{
+ struct purgatory_info *pi = &image->purgatory_info;
+ unsigned long align, buf_align, bss_align, buf_sz, bss_sz, bss_pad;
+ unsigned long memsz, entry, load_addr, data_addr, bss_addr, off;
+ unsigned char *buf_addr, *src;
+ int i, ret = 0, entry_sidx = -1;
+ Elf_Shdr *sechdrs, *sechdrs_c;
+
+ /*
+ * sechdrs_c points to section headers in purgatory and are read
+ * only. No modifications allowed.
+ */
+ sechdrs_c = (void *)pi->ehdr + pi->ehdr->e_shoff;
+
+ /*
+ * We can not modify sechdrs_c[] and its fields. It is read only.
+ * Copy it over to a local copy where one can store some temporary
+ * data and free it at the end. We need to modify ->sh_addr and
+ * ->sh_offset fields to keep track permanent and temporary locations
+ * of sections.
+ */
+ sechdrs = vzalloc(pi->ehdr->e_shnum * sizeof(Elf_Shdr));
+ if (!sechdrs)
+ return -ENOMEM;
+
+ memcpy(sechdrs, sechdrs_c, pi->ehdr->e_shnum * sizeof(Elf_Shdr));
+
+ /*
+ * We seem to have multiple copies of sections. First copy is which
+ * is embedded in kernel in read only section. Some of these sections
+ * will be copied to a temporary buffer and relocated. And these
+ * sections will finally be copied to their final detination at
+ * segment load time.
+ *
+ * Use ->sh_offset to reflect section address in memory. It will
+ * point to original read only copy if section is not allocatable.
+ * Otherwise it will point to temporary copy which will be relocated.
+ *
+ * Use ->sh_addr to contain final address of the section where it
+ * will go during execution time.
+ */
+ for (i = 0; i < pi->ehdr->e_shnum; i++) {
+ if (sechdrs[i].sh_type == SHT_NOBITS)
+ continue;
+
+ sechdrs[i].sh_offset = (unsigned long)pi->ehdr +
+ sechdrs[i].sh_offset;
+ }
+
+ entry = pi->ehdr->e_entry;
+ for (i = 0; i < pi->ehdr->e_shnum; i++) {
+ if (!(sechdrs[i].sh_flags & SHF_ALLOC))
+ continue;
+
+ if (!(sechdrs[i].sh_flags & SHF_EXECINSTR))
+ continue;
+
+ /* Make entry section relative */
+ if (sechdrs[i].sh_addr <= pi->ehdr->e_entry &&
+ ((sechdrs[i].sh_addr + sechdrs[i].sh_size) >
+ pi->ehdr->e_entry)) {
+ entry_sidx = i;
+ entry -= sechdrs[i].sh_addr;
+ break;
+ }
+ }
+
+ /* Find the RAM size requirements of relocatable object */
+ buf_align = 1;
+ bss_align = 1;
+ buf_sz = 0;
+ bss_sz = 0;
+
+ for (i = 0; i < pi->ehdr->e_shnum; i++) {
+ if (!(sechdrs[i].sh_flags & SHF_ALLOC))
+ continue;
+
+ align = sechdrs[i].sh_addralign;
+ if (sechdrs[i].sh_type != SHT_NOBITS) {
+ if (buf_align < align)
+ buf_align = align;
+ buf_sz = ALIGN(buf_sz, align);
+ buf_sz += sechdrs[i].sh_size;
+ } else {
+ if (bss_align < align)
+ bss_align = align;
+ bss_sz = ALIGN(bss_sz, align);
+ bss_sz += sechdrs[i].sh_size;
+ }
+ }
+
+ if (buf_align < bss_align)
+ buf_align = bss_align;
+ bss_pad = 0;
+ if (buf_sz & (bss_align - 1))
+ bss_pad = bss_align - (buf_sz & (bss_align - 1));
+
+ memsz = buf_sz + bss_pad + bss_sz;
+
+ /* Allocate buffer for purgatory */
+ pi->purgatory_buf = vzalloc(buf_sz);
+ if (!pi->purgatory_buf) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ /* Add buffer to segment list */
+ ret = kexec_add_buffer(image, pi->purgatory_buf, buf_sz, memsz,
+ buf_align, min, max, top_down,
+ &pi->purgatory_load_addr);
+ if (ret)
+ goto out;
+
+ /* Load SHF_ALLOC sections */
+ buf_addr = pi->purgatory_buf;
+ load_addr = pi->purgatory_load_addr;
+ data_addr = load_addr;
+ bss_addr = load_addr + buf_sz + bss_pad;
+
+ for (i = 0; i < pi->ehdr->e_shnum; i++) {
+ if (!(sechdrs[i].sh_flags & SHF_ALLOC))
+ continue;
+
+ align = sechdrs[i].sh_addralign;
+ if (sechdrs[i].sh_type != SHT_NOBITS) {
+ data_addr = ALIGN(data_addr, align);
+ off = data_addr - load_addr;
+ /* We have already modifed ->sh_offset to keep addr */
+ src = (char *) sechdrs[i].sh_offset;
+ memcpy(buf_addr + off, src, sechdrs[i].sh_size);
+
+ /* Store load address and source address of section */
+ sechdrs[i].sh_addr = data_addr;
+
+ /*
+ * This section got copied to temporary buffer. Update
+ * ->sh_offset accordingly.
+ */
+ sechdrs[i].sh_offset = (unsigned long)(buf_addr + off);
+
+ /* Advance to the next address */
+ data_addr += sechdrs[i].sh_size;
+ } else {
+ bss_addr = ALIGN(bss_addr, align);
+ sechdrs[i].sh_addr = bss_addr;
+ bss_addr += sechdrs[i].sh_size;
+ }
+ }
+
+ /* update entry based on entry section position */
+ if (entry_sidx >= 0)
+ entry += sechdrs[entry_sidx].sh_addr;
+
+ /* Set the entry point of purgatory */
+ kimage_set_start_addr(image, entry);
+
+ /* Apply relocations */
+ for (i = 0; i < pi->ehdr->e_shnum; i++) {
+ Elf_Shdr *section, *symtab;
+
+ if (sechdrs[i].sh_type != SHT_RELA &&
+ sechdrs[i].sh_type != SHT_REL)
+ continue;
+
+ if (sechdrs[i].sh_info > pi->ehdr->e_shnum ||
+ sechdrs[i].sh_link > pi->ehdr->e_shnum) {
+ ret = -ENOEXEC;
+ goto out;
+ }
+
+ section = &sechdrs[sechdrs[i].sh_info];
+ symtab = &sechdrs[sechdrs[i].sh_link];
+
+ if (!(section->sh_flags & SHF_ALLOC))
+ continue;
+
+ if (symtab->sh_link > pi->ehdr->e_shnum)
+ /* Invalid section number? */
+ continue;
+
+ ret = -EOPNOTSUPP;
+ if (sechdrs[i].sh_type == SHT_RELA)
+ ret = arch_kexec_apply_relocations_add(sechdrs, pi->ehdr->e_shnum, i);
+ if (ret)
+ goto out;
+ }
+
+ pi->sechdrs = sechdrs;
+ return ret;
+out:
+ vfree(sechdrs);
+ vfree(pi->purgatory_buf);
+ return ret;
+}
+
+/* Load relocatable purgatory object and relocate it appropriately */
+int kexec_load_purgatory(struct kimage *image, unsigned long min,
+ unsigned long max, int top_down, unsigned long *load_addr)
+{
+ struct purgatory_info *pi = &image->purgatory_info;
+ int ret;
+
+ if (kexec_purgatory_size <= 0)
+ return -EINVAL;
+
+ if (kexec_purgatory_size < sizeof(Elf_Ehdr))
+ return -ENOEXEC;
+
+ pi->ehdr = (Elf_Ehdr *)kexec_purgatory;
+
+ if (memcmp(pi->ehdr->e_ident, ELFMAG, SELFMAG) != 0
+ || pi->ehdr->e_type != ET_REL
+ || !elf_check_arch(pi->ehdr)
+ || pi->ehdr->e_shentsize != sizeof(Elf_Shdr))
+ return -ENOEXEC;
+
+ if (pi->ehdr->e_shoff >= kexec_purgatory_size
+ || (pi->ehdr->e_shnum * sizeof(Elf_Shdr) >
+ kexec_purgatory_size - pi->ehdr->e_shoff))
+ return -ENOEXEC;
+
+ ret = elf_rel_load_relocate(image, min, max, top_down);
+ if (ret)
+ return ret;
+
+ *load_addr = image->purgatory_info.purgatory_load_addr;
+ return 0;
+}
+
+static Elf_Sym *kexec_purgatory_find_symbol(struct purgatory_info *pi,
+ const char *name)
+{
+ Elf_Sym *syms;
+ Elf_Shdr *sechdrs;
+ Elf_Ehdr *ehdr;
+ int i, k;
+ const char *strtab;
+
+ if (!pi->sechdrs || !pi->ehdr)
+ return NULL;
+
+ sechdrs = pi->sechdrs;
+ ehdr = pi->ehdr;
+
+ for (i = 0; i < ehdr->e_shnum; i++) {
+ if (sechdrs[i].sh_type != SHT_SYMTAB)
+ continue;
+
+ if (sechdrs[i].sh_link > ehdr->e_shnum)
+ /* Invalid stratab section number */
+ continue;
+ strtab = (char *)sechdrs[sechdrs[i].sh_link].sh_offset;
+ syms = (Elf_Sym*)sechdrs[i].sh_offset;
+
+ /* Go through symbols for a match */
+ for (k = 0; k < sechdrs[i].sh_size/sizeof(Elf_Sym); k++) {
+ if (ELF_ST_BIND(syms[k].st_info) != STB_GLOBAL)
+ continue;
+
+ if (strcmp(strtab + syms[k].st_name, name) != 0)
+ continue;
+
+ if (syms[k].st_shndx == SHN_UNDEF ||
+ syms[k].st_shndx > ehdr->e_shnum) {
+ pr_debug("Symbol: %s has bad section index %d."
+ "\n", name, syms[k].st_shndx);
+ return NULL;
+ }
+
+ /* Found the symbol we are looking for */
+ return &syms[k];
+ }
+ }
+
+ return NULL;
+}
+
+void *kexec_purgatory_get_symbol_addr(struct kimage *image, const char *name)
+{
+ struct purgatory_info *pi = &image->purgatory_info;
+ Elf_Sym *sym;
+ Elf_Shdr *sechdr;
+
+ sym = kexec_purgatory_find_symbol(pi, name);
+ if (!sym)
+ return ERR_PTR(-EINVAL);
+
+ sechdr = &pi->sechdrs[sym->st_shndx];
+
+ /*
+ * Returns the address where symbol will finally be loaded after
+ * kexec_load_segment()
+ */
+ return (void *)(sechdr->sh_addr + sym->st_value);
+}
+
+/*
+ * Get or set value of a symbol. If "get_value" is true, symbol value is
+ * returned in buf otherwise symbol value is set based on value in buf.
+ */
+int kexec_purgatory_get_set_symbol(struct kimage *image, const char *name,
+ void *buf, unsigned int size, bool get_value)
+{
+ Elf_Sym *sym;
+ Elf_Shdr *sechdrs;
+ struct purgatory_info *pi = &image->purgatory_info;
+ char *sym_buf;
+
+ sym = kexec_purgatory_find_symbol(pi, name);
+ if (!sym)
+ return -EINVAL;
+
+ if (sym->st_size != size) {
+ pr_debug("Symbol: %s size is not right\n", name);
+ return -EINVAL;
+ }
+
+ sechdrs = pi->sechdrs;
+
+ if (sechdrs[sym->st_shndx].sh_type == SHT_NOBITS) {
+ pr_debug("Symbol: %s is in a bss section. Cannot get/set\n",
+ name);
+ return -EINVAL;
+ }
+
+ sym_buf = (unsigned char *)sechdrs[sym->st_shndx].sh_offset +
+ sym->st_value;
+
+ if (get_value)
+ memcpy((void*)buf, sym_buf, size);
+ else
+ memcpy((void*)sym_buf, buf, size);
+
+ return 0;
+}

/*
* Move into place and start executing a preloaded standalone
--
1.8.4.2

2014-01-27 18:59:45

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 11/11] kexec: Support for Kexec on panic using new system call

This patch adds support for loading a kexec on panic (kdump) kernel usning
new system call. Right now this primarily works with bzImage loader only.
But changes to ELF loader should be minimal as all the core infrastrcture
is there.

Only thing preventing making ELF load in crash reseved memory is
that kernel vmlinux is of type ET_EXEC and it expects to be loaded at
address it has been compiled for. At that location current kernel is
already running. One first needs to make vmlinux fully relocatable
and export it is type ET_DYN and then modify this ELF loader to support
images of type ET_DYN.

I am leaving it as a future TODO item.

Signed-off-by: Vivek Goyal <[email protected]>
---
arch/x86/include/asm/crash.h | 9 +
arch/x86/include/asm/kexec.h | 25 +-
arch/x86/kernel/crash.c | 574 +++++++++++++++++++++++++++++++++++++
arch/x86/kernel/kexec-bzimage.c | 27 +-
arch/x86/kernel/kexec-elf.c | 4 +-
arch/x86/kernel/machine_kexec.c | 21 +-
arch/x86/kernel/machine_kexec_64.c | 38 +++
kernel/kexec.c | 83 +++++-
8 files changed, 763 insertions(+), 18 deletions(-)
create mode 100644 arch/x86/include/asm/crash.h

diff --git a/arch/x86/include/asm/crash.h b/arch/x86/include/asm/crash.h
new file mode 100644
index 0000000..2dd2eb8
--- /dev/null
+++ b/arch/x86/include/asm/crash.h
@@ -0,0 +1,9 @@
+#ifndef _ASM_X86_CRASH_H
+#define _ASM_X86_CRASH_H
+
+int load_crashdump_segments(struct kimage *image);
+int crash_copy_backup_region(struct kimage *image);
+int crash_setup_memmap_entries(struct kimage *image,
+ struct boot_params *params);
+
+#endif /* _ASM_X86_CRASH_H */
diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 9bd6fec..a330d85 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -25,6 +25,8 @@
#include <asm/ptrace.h>
#include <asm/bootparam.h>

+struct kimage;
+
/*
* KEXEC_SOURCE_MEMORY_LIMIT maximum page get_free_page can return.
* I.e. Maximum page that is mapped directly into kernel memory,
@@ -62,6 +64,10 @@
# define KEXEC_ARCH KEXEC_ARCH_X86_64
#endif

+/* Memory to backup during crash kdump */
+#define KEXEC_BACKUP_SRC_START (0UL)
+#define KEXEC_BACKUP_SRC_END (655360UL) /* 640K */
+
/*
* CPU does not save ss and sp on stack if execution is already
* running in kernel mode at the time of NMI occurrence. This code
@@ -161,8 +167,21 @@ struct kimage_arch {
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
+ /* Details of backup region */
+ unsigned long backup_src_start;
+ unsigned long backup_src_sz;
+
+ /* Physical address of backup segment */
+ unsigned long backup_load_addr;
+
+ /* Core ELF header buffer */
+ unsigned long elf_headers;
+ unsigned long elf_headers_sz;
+ unsigned long elf_load_addr;
};
+#endif /* CONFIG_X86_32 */

+#ifdef CONFIG_X86_64
struct kexec_entry64_regs {
uint64_t rax;
uint64_t rbx;
@@ -189,11 +208,13 @@ extern crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss;

extern int kexec_setup_initrd(struct boot_params *boot_params,
unsigned long initrd_load_addr, unsigned long initrd_len);
-extern int kexec_setup_cmdline(struct boot_params *boot_params,
+extern int kexec_setup_cmdline(struct kimage *image,
+ struct boot_params *boot_params,
unsigned long bootparams_load_addr,
unsigned long cmdline_offset, char *cmdline,
unsigned long cmdline_len);
-extern int kexec_setup_boot_parameters(struct boot_params *params);
+extern int kexec_setup_boot_parameters(struct kimage *image,
+ struct boot_params *params);


#endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index a57902e..8eabde4 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -4,6 +4,9 @@
* Created by: Hariprasad Nellitheertha ([email protected])
*
* Copyright (C) IBM Corporation, 2004. All rights reserved.
+ * Copyright (C) Red Hat Inc., 2014. All rights reserved.
+ * Authors:
+ * Vivek Goyal <[email protected]>
*
*/

@@ -16,6 +19,7 @@
#include <linux/elf.h>
#include <linux/elfcore.h>
#include <linux/module.h>
+#include <linux/slab.h>

#include <asm/processor.h>
#include <asm/hardirq.h>
@@ -28,6 +32,45 @@
#include <asm/reboot.h>
#include <asm/virtext.h>

+/* Alignment required for elf header segment */
+#define ELF_CORE_HEADER_ALIGN 4096
+
+/* This primarily reprsents number of split ranges due to exclusion */
+#define CRASH_MAX_RANGES 16
+
+struct crash_mem_range {
+ unsigned long long start, end;
+};
+
+struct crash_mem {
+ unsigned int nr_ranges;
+ struct crash_mem_range ranges[CRASH_MAX_RANGES];
+};
+
+/* Misc data about ram ranges needed to prepare elf headers */
+struct crash_elf_data {
+ struct kimage *image;
+ /*
+ * Total number of ram ranges we have after various ajustments for
+ * GART, crash reserved region etc.
+ */
+ unsigned int max_nr_ranges;
+ unsigned long gart_start, gart_end;
+
+ /* Pointer to elf header */
+ void *ehdr;
+ /* Pointer to next phdr */
+ void *bufp;
+ struct crash_mem mem;
+};
+
+/* Used while prepareing memory map entries for second kernel */
+struct crash_memmap_data {
+ struct boot_params *params;
+ /* Type of memory */
+ unsigned int type;
+};
+
int in_crash_kexec;

/*
@@ -137,3 +180,534 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
#endif
crash_save_cpu(regs, safe_smp_processor_id());
}
+
+#ifdef CONFIG_X86_64
+
+static int get_nr_ram_ranges_callback(unsigned long start_pfn,
+ unsigned long nr_pfn, void *arg)
+{
+ int *nr_ranges = arg;
+
+ (*nr_ranges)++;
+ return 0;
+}
+
+static int get_gart_ranges_callback(u64 start, u64 end, void *arg)
+{
+ struct crash_elf_data *ced = arg;
+
+ ced->gart_start = start;
+ ced->gart_end = end;
+
+ /* Not expecting more than 1 gart aperture */
+ return 1;
+}
+
+
+/* Gather all the required information to prepare elf headers for ram regions */
+static int fill_up_ced(struct crash_elf_data *ced, struct kimage *image)
+{
+ unsigned int nr_ranges = 0;
+
+ ced->image = image;
+
+ walk_system_ram_range(0, -1, &nr_ranges,
+ get_nr_ram_ranges_callback);
+
+ ced->max_nr_ranges = nr_ranges;
+
+ /*
+ * We don't create ELF headers for GART aperture as an attempt
+ * to dump this memory in second kernel leads to hang/crash.
+ * If gart aperture is present, one needs to exclude that region
+ * and that could lead to need of extra phdr.
+ */
+
+ walk_ram_res("GART", IORESOURCE_MEM, 0, -1,
+ ced, get_gart_ranges_callback);
+
+ /*
+ * If we have gart region, excluding that could potentially split
+ * a memory range, resulting in extra header. Account for that.
+ */
+ if (ced->gart_end)
+ ced->max_nr_ranges++;
+
+ /* Exclusion of crash region could split memory ranges */
+ ced->max_nr_ranges++;
+
+ /* If crashk_low_res is there, another range split possible */
+ if (crashk_low_res.end != 0)
+ ced->max_nr_ranges++;
+
+ return 0;
+}
+
+static int exclude_mem_range(struct crash_mem *mem,
+ unsigned long long mstart, unsigned long long mend)
+{
+ int i, j;
+ unsigned long long start, end;
+ struct crash_mem_range temp_range = {0, 0};
+
+ for (i = 0; i < mem->nr_ranges; i++) {
+ start = mem->ranges[i].start;
+ end = mem->ranges[i].end;
+
+ if (mstart > end || mend < start)
+ continue;
+
+ /* Truncate any area outside of range */
+ if (mstart < start)
+ mstart = start;
+ if (mend > end)
+ mend = end;
+
+ /* Found completely overlapping range */
+ if (mstart == start && mend == end) {
+ mem->ranges[i].start = 0;
+ mem->ranges[i].end = 0;
+ if (i < mem->nr_ranges - 1) {
+ /* Shift rest of the ranges to left */
+ for(j = i; j < mem->nr_ranges - 1; j++) {
+ mem->ranges[j].start =
+ mem->ranges[j+1].start;
+ mem->ranges[j].end =
+ mem->ranges[j+1].end;
+ }
+ }
+ mem->nr_ranges--;
+ return 0;
+ }
+
+ if (mstart > start && mend < end) {
+ /* Split original range */
+ mem->ranges[i].end = mstart - 1;
+ temp_range.start = mend + 1;
+ temp_range.end = end;
+ } else if (mstart != start)
+ mem->ranges[i].end = mstart - 1;
+ else
+ mem->ranges[i].start = mend + 1;
+ break;
+ }
+
+ /* If a split happend, add the split in array */
+ if (!temp_range.end)
+ return 0;
+
+ /* Split happened */
+ if (i == CRASH_MAX_RANGES - 1) {
+ printk("Too many crash ranges after split\n");
+ return -ENOMEM;
+ }
+
+ /* Location where new range should go */
+ j = i + 1;
+ if (j < mem->nr_ranges) {
+ /* Move over all ranges one place */
+ for (i = mem->nr_ranges - 1; i >= j; i--)
+ mem->ranges[i + 1] = mem->ranges[i];
+ }
+
+ mem->ranges[j].start = temp_range.start;
+ mem->ranges[j].end = temp_range.end;
+ mem->nr_ranges++;
+ return 0;
+}
+
+/*
+ * Look for any unwanted ranges between mstart, mend and remove them. This
+ * might lead to split and split ranges are put in ced->mem.ranges[] array
+ */
+static int elf_header_exclude_ranges(struct crash_elf_data *ced,
+ unsigned long long mstart, unsigned long long mend)
+{
+ struct crash_mem *cmem = &ced->mem;
+ int ret = 0;
+
+ memset(cmem->ranges, 0, sizeof(cmem->ranges));
+
+ cmem->ranges[0].start = mstart;
+ cmem->ranges[0].end = mend;
+ cmem->nr_ranges = 1;
+
+ /* Exclude crashkernel region */
+ ret = exclude_mem_range(cmem, crashk_res.start, crashk_res.end);
+ if (ret)
+ return ret;
+
+ ret = exclude_mem_range(cmem, crashk_low_res.start, crashk_low_res.end);
+ if (ret)
+ return ret;
+
+ /* Exclude GART region */
+ if (ced->gart_end) {
+ ret = exclude_mem_range(cmem, ced->gart_start, ced->gart_end);
+ if (ret)
+ return ret;
+ }
+
+ return ret;
+}
+
+static int prepare_elf64_ram_headers_callback(u64 start, u64 end, void *arg)
+{
+ struct crash_elf_data *ced = arg;
+ Elf64_Ehdr *ehdr;
+ Elf64_Phdr *phdr;
+ unsigned long mstart, mend;
+ struct kimage *image = ced->image;
+ struct crash_mem *cmem;
+ int ret, i;
+
+ ehdr = ced->ehdr;
+
+ /* Exclude unwanted mem ranges */
+ ret = elf_header_exclude_ranges(ced, start, end);
+ if (ret)
+ return ret;
+
+ /* Go through all the ranges in ced->mem.ranges[] and prepare phdr */
+ cmem = &ced->mem;
+
+ for (i = 0; i < cmem->nr_ranges; i++) {
+ mstart = cmem->ranges[i].start;
+ mend = cmem->ranges[i].end;
+
+ phdr = ced->bufp;
+ ced->bufp += sizeof(Elf64_Phdr);
+
+ phdr->p_type = PT_LOAD;
+ phdr->p_flags = PF_R|PF_W|PF_X;
+ phdr->p_offset = mstart;
+
+ /*
+ * If a range matches backup region, adjust offset to backup
+ * segment.
+ */
+ if (mstart == image->arch.backup_src_start &&
+ (mend - mstart + 1) == image->arch.backup_src_sz)
+ phdr->p_offset = image->arch.backup_load_addr;
+
+ phdr->p_paddr = mstart;
+ phdr->p_vaddr = (unsigned long long) __va(mstart);
+ phdr->p_filesz = phdr->p_memsz = mend - mstart + 1;
+ phdr->p_align = 0;
+ ehdr->e_phnum++;
+ pr_debug("Crash PT_LOAD elf header. phdr=%p"
+ " vaddr=0x%llx, paddr=0x%llx, sz=0x%llx e_phnum=%d"
+ " p_offset=0x%llx\n", phdr, phdr->p_vaddr,
+ phdr->p_paddr, phdr->p_filesz, ehdr->e_phnum,
+ phdr->p_offset);
+ }
+
+ return ret;
+}
+
+static int prepare_elf64_headers(struct crash_elf_data *ced,
+ unsigned long *addr, unsigned long *sz)
+{
+ Elf64_Ehdr *ehdr;
+ Elf64_Phdr *phdr;
+ unsigned long nr_cpus = NR_CPUS, nr_phdr, elf_sz;
+ unsigned char *buf, *bufp;
+ unsigned int cpu;
+ unsigned long long notes_addr;
+ int ret;
+
+ /* extra phdr for vmcoreinfo elf note */
+ nr_phdr = nr_cpus + 1;
+ nr_phdr += ced->max_nr_ranges;
+
+ /*
+ * kexec-tools creates an extra PT_LOAD phdr for kernel text mapping
+ * area on x86_64 (ffffffff80000000 - ffffffffa0000000).
+ * I think this is required by tools like gdb. So same physical
+ * memory will be mapped in two elf headers. One will contain kernel
+ * text virtual addresses and other will have __va(physical) addresses.
+ */
+
+ nr_phdr++;
+ elf_sz = sizeof(Elf64_Ehdr) + nr_phdr * sizeof(Elf64_Phdr);
+ elf_sz = ALIGN(elf_sz, ELF_CORE_HEADER_ALIGN);
+
+ buf = vzalloc(elf_sz);
+ if (!buf)
+ return -ENOMEM;
+
+ bufp = buf;
+ ehdr = (Elf64_Ehdr *)bufp;
+ bufp += sizeof(Elf64_Ehdr);
+ memcpy(ehdr->e_ident, ELFMAG, SELFMAG);
+ ehdr->e_ident[EI_CLASS] = ELFCLASS64;
+ ehdr->e_ident[EI_DATA] = ELFDATA2LSB;
+ ehdr->e_ident[EI_VERSION] = EV_CURRENT;
+ ehdr->e_ident[EI_OSABI] = ELF_OSABI;
+ memset(ehdr->e_ident + EI_PAD, 0, EI_NIDENT - EI_PAD);
+ ehdr->e_type = ET_CORE;
+ ehdr->e_machine = ELF_ARCH;
+ ehdr->e_version = EV_CURRENT;
+ ehdr->e_entry = 0;
+ ehdr->e_phoff = sizeof(Elf64_Ehdr);
+ ehdr->e_shoff = 0;
+ ehdr->e_flags = 0;
+ ehdr->e_ehsize = sizeof(Elf64_Ehdr);
+ ehdr->e_phentsize = sizeof(Elf64_Phdr);
+ ehdr->e_phnum = 0;
+ ehdr->e_shentsize = 0;
+ ehdr->e_shnum = 0;
+ ehdr->e_shstrndx = 0;
+
+ /* Prepare one phdr of type PT_NOTE for each present cpu */
+ for_each_present_cpu(cpu) {
+ phdr = (Elf64_Phdr *)bufp;
+ bufp += sizeof(Elf64_Phdr);
+ phdr->p_type = PT_NOTE;
+ phdr->p_flags = 0;
+ notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu));
+ phdr->p_offset = phdr->p_paddr = notes_addr;
+ phdr->p_vaddr = 0;
+ phdr->p_filesz = phdr->p_memsz = sizeof(note_buf_t);
+ phdr->p_align = 0;
+ (ehdr->e_phnum)++;
+ }
+
+ /* Prepare one PT_NOTE header for vmcoreinfo */
+ phdr = (Elf64_Phdr *)bufp;
+ bufp += sizeof(Elf64_Phdr);
+ phdr->p_type = PT_NOTE;
+ phdr->p_flags = 0;
+ phdr->p_offset = phdr->p_paddr = paddr_vmcoreinfo_note();
+ phdr->p_vaddr = 0;
+ phdr->p_filesz = phdr->p_memsz = sizeof(vmcoreinfo_note);
+ phdr->p_align = 0;
+ (ehdr->e_phnum)++;
+
+#ifdef CONFIG_X86_64
+ /* Prepare PT_LOAD type program header for kernel text region */
+ phdr = (Elf64_Phdr *)bufp;
+ bufp += sizeof(Elf64_Phdr);
+ phdr->p_type = PT_LOAD;
+ phdr->p_flags = PF_R|PF_W|PF_X;
+ phdr->p_vaddr = (Elf64_Addr)_text;
+ phdr->p_filesz = phdr->p_memsz = _end - _text;
+ phdr->p_offset = phdr->p_paddr = __pa_symbol(_text);
+ phdr->p_align = 0;
+ (ehdr->e_phnum)++;
+#endif
+
+ /* Prepare PT_LOAD headers for system ram chunks. */
+ ced->ehdr = ehdr;
+ ced->bufp = bufp;
+ ret = walk_system_ram_res(0, -1, ced,
+ prepare_elf64_ram_headers_callback);
+ if (ret < 0)
+ return ret;
+
+ *addr = (unsigned long)buf;
+ *sz = elf_sz;
+ return 0;
+}
+
+/* Prepare elf headers. Return addr and size */
+static int prepare_elf_headers(struct kimage *image, unsigned long *addr,
+ unsigned long *sz)
+{
+ struct crash_elf_data *ced;
+ int ret;
+
+ ced = kzalloc(sizeof(*ced), GFP_KERNEL);
+ if (!ced)
+ return -ENOMEM;
+
+ ret = fill_up_ced(ced, image);
+ if (ret)
+ goto out;
+
+ /* By default prepare 64bit headers */
+ ret = prepare_elf64_headers(ced, addr, sz);
+out:
+ kfree(ced);
+ return ret;
+}
+
+static int add_e820_entry(struct boot_params *params, struct e820entry *entry)
+{
+ unsigned int nr_e820_entries;
+
+ nr_e820_entries = params->e820_entries;
+ if (nr_e820_entries >= E820MAX)
+ return 1;
+
+ memcpy(&params->e820_map[nr_e820_entries], entry,
+ sizeof(struct e820entry));
+ params->e820_entries++;
+
+ pr_debug("Add e820 entry to bootparams. addr=0x%llx size=0x%llx"
+ " type=%d\n", entry->addr, entry->size, entry->type);
+ return 0;
+}
+
+static int memmap_entry_callback(u64 start, u64 end, void *arg)
+{
+ struct crash_memmap_data *cmd = arg;
+ struct boot_params *params = cmd->params;
+ struct e820entry ei;
+
+ ei.addr = start;
+ ei.size = end - start + 1;
+ ei.type = cmd->type;
+ add_e820_entry(params, &ei);
+
+ return 0;
+}
+
+static int memmap_exclude_ranges(struct kimage *image, struct crash_mem *cmem,
+ unsigned long long mstart, unsigned long long mend)
+{
+ unsigned long start, end;
+ int ret = 0;
+
+ memset(cmem->ranges, 0, sizeof(cmem->ranges));
+
+ cmem->ranges[0].start = mstart;
+ cmem->ranges[0].end = mend;
+ cmem->nr_ranges = 1;
+
+ /* Exclude Backup region */
+ start = image->arch.backup_load_addr;
+ end = start + image->arch.backup_src_sz - 1;
+ ret = exclude_mem_range(cmem, start, end);
+ if (ret)
+ return ret;
+
+ /* Exclude elf header region */
+ start = image->arch.elf_load_addr;
+ end = start + image->arch.elf_headers_sz - 1;
+ ret = exclude_mem_range(cmem, start, end);
+ return ret;
+}
+
+/* Prepare memory map for crash dump kernel */
+int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
+{
+ int i, ret = 0;
+ unsigned long flags;
+ struct e820entry ei;
+ struct crash_memmap_data cmd;
+ struct crash_mem *cmem;
+
+ cmem = vzalloc(sizeof(struct crash_mem));
+ if (!cmem)
+ return -ENOMEM;
+
+ memset(&cmd, 0, sizeof(struct crash_memmap_data));
+ cmd.params = params;
+
+ /* Add first 640K segment */
+ ei.addr = image->arch.backup_src_start;
+ ei.size = image->arch.backup_src_sz;
+ ei.type = E820_RAM;
+ add_e820_entry(params, &ei);
+
+ /* Add ACPI tables */
+ cmd.type = E820_ACPI;
+ flags = IORESOURCE_MEM | IORESOURCE_BUSY;
+ walk_ram_res("ACPI Tables", flags, 0, -1, &cmd, memmap_entry_callback);
+
+ /* Add ACPI Non-volatile Storage */
+ cmd.type = E820_NVS;
+ walk_ram_res("ACPI Non-volatile Storage", flags, 0, -1, &cmd,
+ memmap_entry_callback);
+
+ /* Add crashk_low_res region */
+ if (crashk_low_res.end) {
+ ei.addr = crashk_low_res.start;
+ ei.size = crashk_low_res.end - crashk_low_res.start + 1;
+ ei.type = E820_RAM;
+ add_e820_entry(params, &ei);
+ }
+
+ /* Exclude some ranges from crashk_res and add rest to memmap */
+ ret = memmap_exclude_ranges(image, cmem, crashk_res.start,
+ crashk_res.end);
+ if (ret)
+ goto out;
+
+ for (i = 0; i < cmem->nr_ranges; i++) {
+ ei.addr = cmem->ranges[i].start;
+ ei.size = cmem->ranges[i].end - ei.addr + 1;
+ ei.type = E820_RAM;
+
+ /* If entry is less than a page, skip it */
+ if (ei.size < PAGE_SIZE) {
+ continue;
+ }
+ add_e820_entry(params, &ei);
+ }
+
+out:
+ vfree(cmem);
+ return ret;
+}
+
+static int determine_backup_region(u64 start, u64 end, void *arg)
+{
+ struct kimage *image = arg;
+
+ image->arch.backup_src_start = start;
+ image->arch.backup_src_sz = end - start + 1;
+
+ /* Expecting only one range for backup region */
+ return 1;
+}
+
+int load_crashdump_segments(struct kimage *image)
+{
+ unsigned long src_start, src_sz;
+ unsigned long elf_addr, elf_sz;
+ int ret;
+
+ /*
+ * Determine and load a segment for backup area. First 640K RAM
+ * region is backup source
+ */
+
+ ret = walk_system_ram_res(KEXEC_BACKUP_SRC_START, KEXEC_BACKUP_SRC_END,
+ image, determine_backup_region);
+
+ /* Zero or postive return values are ok */
+ if (ret < 0)
+ return ret;
+
+ src_start = image->arch.backup_src_start;
+ src_sz = image->arch.backup_src_sz;
+
+ /* Add backup segment. */
+ if (src_sz) {
+ ret = kexec_add_buffer(image, __va(src_start), src_sz, src_sz,
+ PAGE_SIZE, 0, -1, 0,
+ &image->arch.backup_load_addr);
+ if (ret)
+ return ret;
+ }
+
+ /* Prepare elf headers and add a segment */
+ ret = prepare_elf_headers(image, &elf_addr, &elf_sz);
+ if (ret)
+ return ret;
+
+ image->arch.elf_headers = elf_addr;
+ image->arch.elf_headers_sz = elf_sz;
+
+ ret = kexec_add_buffer(image, (char *)elf_addr, elf_sz, elf_sz,
+ ELF_CORE_HEADER_ALIGN, 0, -1, 0,
+ &image->arch.elf_load_addr);
+ if (ret)
+ kfree((void *)image->arch.elf_headers);
+
+ return ret;
+}
+
+#endif /* CONFIG_X86_64 */
diff --git a/arch/x86/kernel/kexec-bzimage.c b/arch/x86/kernel/kexec-bzimage.c
index cbfcd00..f66188d 100644
--- a/arch/x86/kernel/kexec-bzimage.c
+++ b/arch/x86/kernel/kexec-bzimage.c
@@ -8,6 +8,9 @@

#include <asm/bootparam.h>
#include <asm/setup.h>
+#include <asm/crash.h>
+
+#define MAX_ELFCOREHDR_STR_LEN 30 /* elfcorehdr=0x<64bit-value> */

#ifdef CONFIG_X86_64

@@ -100,11 +103,28 @@ void *bzImage64_load(struct kimage *image, char *kernel,
return ERR_PTR(-EINVAL);
}

+ /*
+ * In case of crash dump, we will append elfcorehdr=<addr> to
+ * command line. Make sure it does not overflow
+ */
+ if (cmdline_len + MAX_ELFCOREHDR_STR_LEN > header->cmdline_size) {
+ ret = -EINVAL;
+ pr_debug("Kernel command line too long\n");
+ return ERR_PTR(-EINVAL);
+ }
+
/* Allocate loader specific data */
ldata = kzalloc(sizeof(struct bzimage64_data), GFP_KERNEL);
if (!ldata)
return ERR_PTR(-ENOMEM);

+ /* Allocate and load backup region */
+ if (image->type == KEXEC_TYPE_CRASH) {
+ ret = load_crashdump_segments(image);
+ if (ret)
+ goto out_free_loader_data;
+ }
+
/*
* Load purgatory. For 64bit entry point, purgatory code can be
* anywhere.
@@ -118,7 +138,8 @@ void *bzImage64_load(struct kimage *image, char *kernel,
pr_debug("Loaded purgatory at 0x%lx\n", purgatory_load_addr);

/* Load Bootparams and cmdline */
- params_cmdline_sz = sizeof(struct boot_params) + cmdline_len;
+ params_cmdline_sz = sizeof(struct boot_params) + cmdline_len +
+ MAX_ELFCOREHDR_STR_LEN;
params = kzalloc(params_cmdline_sz, GFP_KERNEL);
if (!params) {
ret = -ENOMEM;
@@ -167,7 +188,7 @@ void *bzImage64_load(struct kimage *image, char *kernel,
goto out_free_params;
}

- ret = kexec_setup_cmdline(params, bootparam_load_addr,
+ ret = kexec_setup_cmdline(image, params, bootparam_load_addr,
sizeof(struct boot_params), cmdline, cmdline_len);
if (ret)
goto out_free_params;
@@ -198,7 +219,7 @@ void *bzImage64_load(struct kimage *image, char *kernel,
if (ret)
goto out_free_params;

- ret = kexec_setup_boot_parameters(params);
+ ret = kexec_setup_boot_parameters(image, params);
if (ret)
goto out_free_params;

diff --git a/arch/x86/kernel/kexec-elf.c b/arch/x86/kernel/kexec-elf.c
index ff1017c..e7ce2d0 100644
--- a/arch/x86/kernel/kexec-elf.c
+++ b/arch/x86/kernel/kexec-elf.c
@@ -164,7 +164,7 @@ void *elf_x86_64_load(struct kimage *image, char *kernel,
goto out_free_params;
}

- ret = kexec_setup_cmdline(params, bootparam_load_addr,
+ ret = kexec_setup_cmdline(image, params, bootparam_load_addr,
sizeof(struct boot_params), cmdline, cmdline_len);
if (ret)
goto out_free_params;
@@ -195,7 +195,7 @@ void *elf_x86_64_load(struct kimage *image, char *kernel,
if (ret)
goto out_free_params;

- ret = kexec_setup_boot_parameters(params);
+ ret = kexec_setup_boot_parameters(image, params);
if (ret)
goto out_free_params;

diff --git a/arch/x86/kernel/machine_kexec.c b/arch/x86/kernel/machine_kexec.c
index ac55890..26d474b 100644
--- a/arch/x86/kernel/machine_kexec.c
+++ b/arch/x86/kernel/machine_kexec.c
@@ -10,9 +10,11 @@
*/

#include <linux/kernel.h>
+#include <linux/kexec.h>
#include <linux/string.h>
#include <asm/bootparam.h>
#include <asm/setup.h>
+#include <asm/crash.h>

/*
* Common code for x86 and x86_64 used for kexec.
@@ -36,18 +38,24 @@ int kexec_setup_initrd(struct boot_params *boot_params,
return 0;
}

-int kexec_setup_cmdline(struct boot_params *boot_params,
+int kexec_setup_cmdline(struct kimage *image, struct boot_params *boot_params,
unsigned long bootparams_load_addr,
unsigned long cmdline_offset, char *cmdline,
unsigned long cmdline_len)
{
char *cmdline_ptr = ((char *)boot_params) + cmdline_offset;
- unsigned long cmdline_ptr_phys;
+ unsigned long cmdline_ptr_phys, len;
uint32_t cmdline_low_32, cmdline_ext_32;

memcpy(cmdline_ptr, cmdline, cmdline_len);
+ if (image->type == KEXEC_TYPE_CRASH) {
+ len = sprintf(cmdline_ptr + cmdline_len - 1,
+ " elfcorehdr=0x%lx", image->arch.elf_load_addr);
+ cmdline_len += len;
+ }
cmdline_ptr[cmdline_len - 1] = '\0';

+ pr_debug("Final command line is:%s\n", cmdline_ptr);
cmdline_ptr_phys = bootparams_load_addr + cmdline_offset;
cmdline_low_32 = cmdline_ptr_phys & 0xffffffffUL;
cmdline_ext_32 = cmdline_ptr_phys >> 32;
@@ -75,7 +83,8 @@ static int setup_memory_map_entries(struct boot_params *params)
return 0;
}

-int kexec_setup_boot_parameters(struct boot_params *params)
+int kexec_setup_boot_parameters(struct kimage *image,
+ struct boot_params *params)
{
unsigned int nr_e820_entries;
unsigned long long mem_k, start, end;
@@ -102,7 +111,11 @@ int kexec_setup_boot_parameters(struct boot_params *params)
/* Default sysdesc table */
params->sys_desc_table.length = 0;

- setup_memory_map_entries(params);
+ if (image->type == KEXEC_TYPE_CRASH)
+ crash_setup_memmap_entries(image, params);
+ else
+ setup_memory_map_entries(params);
+
nr_e820_entries = params->e820_entries;

for(i = 0; i < nr_e820_entries; i++) {
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index e35bcaf..e852a51 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -181,6 +181,38 @@ static void load_segments(void)
);
}

+/* Update purgatory as needed after various image segments have been prepared */
+static int arch_update_purgatory(struct kimage *image)
+{
+ int ret = 0;
+
+ if (!image->file_mode)
+ return 0;
+
+ /* Setup copying of backup region */
+ if (image->type == KEXEC_TYPE_CRASH) {
+ ret = kexec_purgatory_get_set_symbol(image, "backup_dest",
+ &image->arch.backup_load_addr,
+ sizeof(image->arch.backup_load_addr), 0);
+ if (ret)
+ return ret;
+
+ ret = kexec_purgatory_get_set_symbol(image, "backup_src",
+ &image->arch.backup_src_start,
+ sizeof(image->arch.backup_src_start), 0);
+ if (ret)
+ return ret;
+
+ ret = kexec_purgatory_get_set_symbol(image, "backup_sz",
+ &image->arch.backup_src_sz,
+ sizeof(image->arch.backup_src_sz), 0);
+ if (ret)
+ return ret;
+ }
+
+ return ret;
+}
+
int machine_kexec_prepare(struct kimage *image)
{
unsigned long start_pgtable;
@@ -194,6 +226,11 @@ int machine_kexec_prepare(struct kimage *image)
if (result)
return result;

+ /* update purgatory as needed */
+ result = arch_update_purgatory(image);
+ if (result)
+ return result;
+
return 0;
}

@@ -330,6 +367,7 @@ int arch_kimage_file_post_load_cleanup(struct kimage *image)
{
int idx = image->file_handler_idx;

+ vfree((void *)image->arch.elf_headers);
if (kexec_file_type[idx].cleanup)
return kexec_file_type[idx].cleanup(image);
return 0;
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 9e4718b..3ea1d41 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -548,7 +548,6 @@ static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
*rimage = image;
return 0;

-
out_free_control_pages:
kimage_free_page_list(&image->control_pages);
out_free_image:
@@ -556,6 +555,54 @@ out_free_image:
return result;
}

+static int kimage_file_crash_alloc(struct kimage **rimage, int kernel_fd,
+ int initrd_fd, const char __user *cmdline_ptr,
+ unsigned long cmdline_len)
+{
+ int result;
+ struct kimage *image;
+
+ /* Allocate and initialize a controlling structure */
+ image = do_kimage_alloc_init();
+ if (!image)
+ return -ENOMEM;
+
+ image->file_mode = 1;
+ image->file_handler_idx = -1;
+
+ /* Enable the special crash kernel control page allocation policy. */
+ image->control_page = crashk_res.start;
+ image->type = KEXEC_TYPE_CRASH;
+
+ result = kimage_file_prepare_segments(image, kernel_fd, initrd_fd,
+ cmdline_ptr, cmdline_len);
+ if (result)
+ goto out_free_image;
+
+ result = sanity_check_segment_list(image);
+ if (result)
+ goto out_free_post_load_bufs;
+
+ result = -ENOMEM;
+ image->control_code_page = kimage_alloc_control_pages(image,
+ get_order(KEXEC_CONTROL_PAGE_SIZE));
+ if (!image->control_code_page) {
+ printk(KERN_ERR "Could not allocate control_code_buffer\n");
+ goto out_free_post_load_bufs;
+ }
+
+ *rimage = image;
+ return 0;
+
+out_free_post_load_bufs:
+ kimage_file_post_load_cleanup(image);
+ kfree(image->image_loader_data);
+out_free_image:
+ kfree(image);
+ return result;
+}
+
+
static int kimage_crash_alloc(struct kimage **rimage, unsigned long entry,
unsigned long nr_segments,
struct kexec_segment __user *segments)
@@ -1137,10 +1184,15 @@ static int kimage_load_crash_segment(struct kimage *image,
unsigned long maddr;
size_t ubytes, mbytes;
int result;
- unsigned char __user *buf;
+ unsigned char __user *buf = NULL;
+ unsigned char *kbuf = NULL;

result = 0;
- buf = segment->buf;
+ if (image->file_mode)
+ kbuf = segment->kbuf;
+ else
+ buf = segment->buf;
+
ubytes = segment->bufsz;
mbytes = segment->memsz;
maddr = segment->mem;
@@ -1163,7 +1215,12 @@ static int kimage_load_crash_segment(struct kimage *image,
/* Zero the trailing part of the page */
memset(ptr + uchunk, 0, mchunk - uchunk);
}
- result = copy_from_user(ptr, buf, uchunk);
+
+ /* For file based kexec, source pages are in kernel memory */
+ if (image->file_mode)
+ memcpy(ptr, kbuf, uchunk);
+ else
+ result = copy_from_user(ptr, buf, uchunk);
kexec_flush_icache_page(page);
kunmap(page);
if (result) {
@@ -1172,7 +1229,10 @@ static int kimage_load_crash_segment(struct kimage *image,
}
ubytes -= uchunk;
maddr += mchunk;
- buf += mchunk;
+ if (image->file_mode)
+ kbuf += mchunk;
+ else
+ buf += mchunk;
mbytes -= mchunk;
}
out:
@@ -1392,7 +1452,11 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd, const char __us
if (flags & KEXEC_FILE_UNLOAD)
goto exchange;

- ret = kimage_file_normal_alloc(&image, kernel_fd, initrd_fd,
+ if (flags & KEXEC_FILE_ON_CRASH)
+ ret = kimage_file_crash_alloc(&image, kernel_fd, initrd_fd,
+ cmdline_ptr, cmdline_len);
+ else
+ ret = kimage_file_normal_alloc(&image, kernel_fd, initrd_fd,
cmdline_ptr, cmdline_len);
if (ret)
goto out;
@@ -2211,7 +2275,12 @@ int kexec_add_buffer(struct kimage *image, char *buffer,
kbuf->top_down = top_down;

/* Walk the RAM ranges and allocate a suitable range for the buffer */
- walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
+ if (image->type == KEXEC_TYPE_CRASH)
+ walk_ram_res("Crash kernel", IORESOURCE_MEM | IORESOURCE_BUSY,
+ crashk_res.start, crashk_res.end, kbuf,
+ walk_ram_range_callback);
+ else
+ walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);

kbuf->image = NULL;
kfree(kbuf);
--
1.8.4.2

2014-01-27 18:58:56

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 09/11] kexec: Provide a function to add a segment at fixed address

kexec_add_buffer() can find a suitable range of memory for user buffer and
add it to list of segments. But ELF loader will require that a buffer can
be loaded at the address it has been compiled for (ET_EXEC type executables).
So we need a helper function which can see if requested memory is valid and
available and add a segment accordiingly. This patch provides that helper
function. It will be used by elf loader in later patch.

Signed-off-by: Vivek Goyal <[email protected]>
---
include/linux/kexec.h | 3 +++
kernel/kexec.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 68 insertions(+)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index d391ed7..2fb052c 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -208,6 +208,9 @@ extern asmlinkage long sys_kexec_load(unsigned long entry,
struct kexec_segment __user *segments,
unsigned long flags);
extern int kernel_kexec(void);
+extern int kexec_add_segment(struct kimage *image, char *buffer,
+ unsigned long bufsz, unsigned long memsz,
+ unsigned long base);
extern int kexec_add_buffer(struct kimage *image, char *buffer,
unsigned long bufsz, unsigned long memsz,
unsigned long buf_align, unsigned long buf_min,
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 20169a4..9e4718b 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -2002,6 +2002,71 @@ static int __kexec_add_segment(struct kimage *image, char *buf,
return 0;
}

+static int validate_ram_range_callback(u64 start, u64 end, void *arg)
+{
+ struct kexec_segment *ksegment = arg;
+ u64 mstart = ksegment->mem;
+ u64 mend = ksegment->mem + ksegment->memsz - 1;
+
+ /* Found a valid range. Stop going through more ranges */
+ if (mstart >= start && mend <= end)
+ return 1;
+
+ /* Range did not match. Go to next one */
+ return 0;
+}
+
+/* Add a kexec segment at fixed address provided by caller */
+int kexec_add_segment(struct kimage *image, char *buffer, unsigned long bufsz,
+ unsigned long memsz, unsigned long base)
+{
+ struct kexec_segment ksegment;
+ int ret;
+
+ /* Currently adding segment this way is allowed only in file mode */
+ if (!image->file_mode)
+ return -EINVAL;
+
+ if (image->nr_segments >= KEXEC_SEGMENT_MAX)
+ return -EINVAL;
+
+ /*
+ * Make sure we are not trying to add segment after allocating
+ * control pages. All segments need to be placed first before
+ * any control pages are allocated. As control page allocation
+ * logic goes through list of segments to make sure there are
+ * no destination overlaps.
+ */
+ WARN_ONCE(!list_empty(&image->control_pages), "Adding kexec segment"
+ " after allocating control pages\n");
+
+ if (bufsz > memsz)
+ return -EINVAL;
+ if (memsz == 0)
+ return -EINVAL;
+
+ /* Align memsz to next page boundary */
+ memsz = ALIGN(memsz, PAGE_SIZE);
+
+ /* Make sure base is atleast page size aligned */
+ if (base & (PAGE_SIZE - 1))
+ return -EINVAL;
+
+ memset(&ksegment, 0, sizeof(struct kexec_segment));
+ ksegment.mem = base;
+ ksegment.memsz = memsz;
+
+ /* Validate memory range */
+ ret = walk_system_ram_res(base, base + memsz - 1, &ksegment,
+ validate_ram_range_callback);
+
+ /* If a valid range is found, 1 is returned */
+ if (ret != 1)
+ return -EINVAL;
+
+ return __kexec_add_segment(image, buffer, bufsz, base, memsz);
+}
+
static int locate_mem_hole_top_down(unsigned long start, unsigned long end,
struct kexec_buf *kbuf)
{
--
1.8.4.2

2014-01-27 18:58:55

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 05/11] kexec: Make kexec_segment user buffer pointer a union

So far kexec_segment->buf was always a user space pointer as user space
passed the array of kexec_segment structures and kernel copied it.

But with new system call, list of kexec segments will be prepared by
kernel and kexec_segment->buf will point to a kernel memory.

So while I was adding code where I made assumption that ->buf is pointing
to kernel memory, sparse started giving warning.

Make ->buf a union. And where a user space pointer is expected, access
it using ->buf and where a kernel space pointer is expected, access it
using ->kbuf. That takes care of sparse warnings.

Signed-off-by: Vivek Goyal <[email protected]>
---
include/linux/kexec.h | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 6d4066c..d8188b3 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -69,7 +69,18 @@ typedef unsigned long kimage_entry_t;
#define IND_SOURCE 0x8

struct kexec_segment {
- void __user *buf;
+ /*
+ * This pointer can point to user memory if kexec_load() system
+ * call is used or will point to kernel memory if
+ * kexec_file_load() system call is used.
+ *
+ * Use ->buf when expecting to deal with user memory and use ->kbuf
+ * when expecting to deal with kernel memory.
+ */
+ union {
+ void __user *buf;
+ void *kbuf;
+ };
size_t bufsz;
unsigned long mem;
size_t memsz;
--
1.8.4.2

2014-01-27 18:58:54

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 03/11] bin2c: Move bin2c in scripts/basic

Kexec wants to use bin2c and it wants to use it really early in the build
process. See arch/x86/purgatory/ code in later patches.

So move bin2c in scripts/basic so that it can be built very early and
be usable by arch/x86/purgatory/

Signed-off-by: Vivek Goyal <[email protected]>
---
kernel/Makefile | 2 +-
scripts/Makefile | 1 -
scripts/basic/Makefile | 1 +
scripts/basic/bin2c.c | 36 ++++++++++++++++++++++++++++++++++++
scripts/bin2c.c | 36 ------------------------------------
5 files changed, 38 insertions(+), 38 deletions(-)
create mode 100644 scripts/basic/bin2c.c
delete mode 100644 scripts/bin2c.c

diff --git a/kernel/Makefile b/kernel/Makefile
index bc010ee..dff602a 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -102,7 +102,7 @@ targets += config_data.gz
$(obj)/config_data.gz: $(KCONFIG_CONFIG) FORCE
$(call if_changed,gzip)

- filechk_ikconfiggz = (echo "static const char kernel_config_data[] __used = MAGIC_START"; cat $< | scripts/bin2c; echo "MAGIC_END;")
+ filechk_ikconfiggz = (echo "static const char kernel_config_data[] __used = MAGIC_START"; cat $< | scripts/basic/bin2c; echo "MAGIC_END;")
targets += config_data.h
$(obj)/config_data.h: $(obj)/config_data.gz FORCE
$(call filechk,ikconfiggz)
diff --git a/scripts/Makefile b/scripts/Makefile
index 01e7adb..62e6cc2 100644
--- a/scripts/Makefile
+++ b/scripts/Makefile
@@ -13,7 +13,6 @@ HOST_EXTRACFLAGS += -I$(srctree)/tools/include
hostprogs-$(CONFIG_KALLSYMS) += kallsyms
hostprogs-$(CONFIG_LOGO) += pnmtologo
hostprogs-$(CONFIG_VT) += conmakehash
-hostprogs-$(CONFIG_IKCONFIG) += bin2c
hostprogs-$(BUILD_C_RECORDMCOUNT) += recordmcount
hostprogs-$(CONFIG_BUILDTIME_EXTABLE_SORT) += sortextable
hostprogs-$(CONFIG_ASN1) += asn1_compiler
diff --git a/scripts/basic/Makefile b/scripts/basic/Makefile
index 4fcef87..afbc1cd 100644
--- a/scripts/basic/Makefile
+++ b/scripts/basic/Makefile
@@ -9,6 +9,7 @@
# fixdep: Used to generate dependency information during build process

hostprogs-y := fixdep
+hostprogs-$(CONFIG_IKCONFIG) += bin2c
always := $(hostprogs-y)

# fixdep is needed to compile other host programs
diff --git a/scripts/basic/bin2c.c b/scripts/basic/bin2c.c
new file mode 100644
index 0000000..96dd2bc
--- /dev/null
+++ b/scripts/basic/bin2c.c
@@ -0,0 +1,36 @@
+/*
+ * Unloved program to convert a binary on stdin to a C include on stdout
+ *
+ * Jan 1999 Matt Mackall <[email protected]>
+ *
+ * This software may be used and distributed according to the terms
+ * of the GNU General Public License, incorporated herein by reference.
+ */
+
+#include <stdio.h>
+
+int main(int argc, char *argv[])
+{
+ int ch, total=0;
+
+ if (argc > 1)
+ printf("const char %s[] %s=\n",
+ argv[1], argc > 2 ? argv[2] : "");
+
+ do {
+ printf("\t\"");
+ while ((ch = getchar()) != EOF)
+ {
+ total++;
+ printf("\\x%02x",ch);
+ if (total % 16 == 0)
+ break;
+ }
+ printf("\"\n");
+ } while (ch != EOF);
+
+ if (argc > 1)
+ printf("\t;\n\nconst int %s_size = %d;\n", argv[1], total);
+
+ return 0;
+}
diff --git a/scripts/bin2c.c b/scripts/bin2c.c
deleted file mode 100644
index 96dd2bc..0000000
--- a/scripts/bin2c.c
+++ /dev/null
@@ -1,36 +0,0 @@
-/*
- * Unloved program to convert a binary on stdin to a C include on stdout
- *
- * Jan 1999 Matt Mackall <[email protected]>
- *
- * This software may be used and distributed according to the terms
- * of the GNU General Public License, incorporated herein by reference.
- */
-
-#include <stdio.h>
-
-int main(int argc, char *argv[])
-{
- int ch, total=0;
-
- if (argc > 1)
- printf("const char %s[] %s=\n",
- argv[1], argc > 2 ? argv[2] : "");
-
- do {
- printf("\t\"");
- while ((ch = getchar()) != EOF)
- {
- total++;
- printf("\\x%02x",ch);
- if (total % 16 == 0)
- break;
- }
- printf("\"\n");
- } while (ch != EOF);
-
- if (argc > 1)
- printf("\t;\n\nconst int %s_size = %d;\n", argv[1], total);
-
- return 0;
-}
--
1.8.4.2

2014-01-27 19:01:27

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 10/11] kexec: Support for loading ELF x86_64 images

This patch provides support for kexec for loading ELF x86_64 images. I have
tested it with loading vmlinux and it worked.

Signed-off-by: Vivek Goyal <[email protected]>
---
arch/x86/include/asm/kexec-elf.h | 11 ++
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/kexec-elf.c | 231 +++++++++++++++++++++++++++++++++++++
arch/x86/kernel/machine_kexec_64.c | 2 +
4 files changed, 245 insertions(+)
create mode 100644 arch/x86/include/asm/kexec-elf.h
create mode 100644 arch/x86/kernel/kexec-elf.c

diff --git a/arch/x86/include/asm/kexec-elf.h b/arch/x86/include/asm/kexec-elf.h
new file mode 100644
index 0000000..afef382
--- /dev/null
+++ b/arch/x86/include/asm/kexec-elf.h
@@ -0,0 +1,11 @@
+#ifndef _ASM_KEXEC_ELF_H
+#define _ASM_KEXEC_ELF_H
+
+extern int elf_x86_64_probe(const char *buf, unsigned long len);
+extern void *elf_x86_64_load(struct kimage *image, char *kernel,
+ unsigned long kernel_len, char *initrd,
+ unsigned long initrd_len, char *cmdline,
+ unsigned long cmdline_len);
+extern int elf_x86_64_cleanup(struct kimage *image);
+
+#endif /* _ASM_KEXEC_ELF_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index fa9981d..2d77de7 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -71,6 +71,7 @@ obj-$(CONFIG_KEXEC) += machine_kexec.o
obj-$(CONFIG_KEXEC) += machine_kexec_$(BITS).o
obj-$(CONFIG_KEXEC) += relocate_kernel_$(BITS).o crash.o
obj-$(CONFIG_KEXEC) += kexec-bzimage.o
+obj-$(CONFIG_KEXEC) += kexec-elf.o
obj-$(CONFIG_CRASH_DUMP) += crash_dump_$(BITS).o
obj-y += kprobes/
obj-$(CONFIG_MODULES) += module.o
diff --git a/arch/x86/kernel/kexec-elf.c b/arch/x86/kernel/kexec-elf.c
new file mode 100644
index 0000000..ff1017c
--- /dev/null
+++ b/arch/x86/kernel/kexec-elf.c
@@ -0,0 +1,231 @@
+#include <linux/string.h>
+#include <linux/printk.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/kexec.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+
+#include <asm/bootparam.h>
+#include <asm/setup.h>
+
+#ifdef CONFIG_X86_64
+
+struct elf_x86_64_data {
+ /*
+ * Temporary buffer to hold bootparams buffer. This should be
+ * freed once the bootparam segment has been loaded.
+ */
+ void *bootparams_buf;
+};
+
+int elf_x86_64_probe(const char *buf, unsigned long len)
+{
+ int ret = -ENOEXEC;
+ Elf_Ehdr *ehdr;
+
+ if (len < sizeof(Elf_Ehdr)) {
+ pr_debug("File is too short to be an ELF executable.\n");
+ return ret;
+ }
+
+ ehdr = (Elf_Ehdr *)buf;
+
+ if (memcmp(ehdr->e_ident, ELFMAG, SELFMAG) != 0
+ || ehdr->e_type != ET_EXEC || !elf_check_arch(ehdr)
+ || ehdr->e_phentsize != sizeof(Elf_Phdr))
+ return -ENOEXEC;
+
+ if (ehdr->e_phoff >= len
+ || (ehdr->e_phnum * sizeof(Elf_Phdr) > len - ehdr->e_phoff))
+ return -ENOEXEC;
+
+ /* I've got a bzImage */
+ pr_debug("It's an elf_x86_64 image.\n");
+ ret = 0;
+
+ return ret;
+}
+
+static int elf_exec_load(struct kimage *image, char *kernel)
+{
+ Elf_Ehdr *ehdr;
+ Elf_Phdr *phdrs;
+ int i, ret;
+ size_t filesz;
+ char *buffer;
+
+ ehdr = (Elf_Ehdr *)kernel;
+ phdrs = (void *)ehdr + ehdr->e_phoff;
+
+ for (i = 0; i < ehdr->e_phnum; i++) {
+ if (phdrs[i].p_type != PT_LOAD)
+ continue;
+ filesz = phdrs[i].p_filesz;
+ if (filesz > phdrs[i].p_memsz)
+ filesz = phdrs[i].p_memsz;
+
+ buffer = (char *)ehdr + phdrs[i].p_offset;
+ ret = kexec_add_segment(image, buffer, filesz, phdrs[i].p_memsz,
+ phdrs[i].p_paddr);
+ if (ret)
+ break;
+ }
+
+ return ret;
+}
+
+/* Fill in fields which are usually present in bzImage */
+static int init_linux_parameters(struct boot_params *params)
+{
+ /*
+ * FIXME: It is odd that the information which comes from kernel
+ * has to be faked by loading kernel. I guess it is limitation of
+ * ELF format. Right now keeping it same as kexec-tools
+ * implementation. But this most likely needs fixing.
+ */
+ memcpy(&params->hdr.header, "HdrS", 4);
+ params->hdr.version = 0x0206;
+ params->hdr.initrd_addr_max = 0x37FFFFFF;
+ params->hdr.cmdline_size = 2048;
+ return 0;
+}
+
+void *elf_x86_64_load(struct kimage *image, char *kernel,
+ unsigned long kernel_len,
+ char *initrd, unsigned long initrd_len,
+ char *cmdline, unsigned long cmdline_len)
+{
+
+ int ret = 0;
+ unsigned long params_cmdline_sz;
+ struct boot_params *params;
+ unsigned long bootparam_load_addr, initrd_load_addr;
+ unsigned long purgatory_load_addr;
+ struct elf_x86_64_data *ldata;
+ struct kexec_entry64_regs regs64;
+ void *stack;
+ Elf_Ehdr *ehdr;
+
+ ehdr = (Elf_Ehdr *)kernel;
+
+ /* Allocate loader specific data */
+ ldata = kzalloc(sizeof(struct elf_x86_64_data), GFP_KERNEL);
+ if (!ldata)
+ return ERR_PTR(-ENOMEM);
+
+ /* Load the ELF executable */
+ ret = elf_exec_load(image, kernel);
+ if (ret) {
+ pr_debug("Loading ELF executable failed\n");
+ goto out_free_loader_data;
+ }
+
+ /*
+ * Load purgatory. For 64bit entry point, purgatory code can be
+ * anywhere.
+ */
+ ret = kexec_load_purgatory(image, 0, ULONG_MAX, 0,
+ &purgatory_load_addr);
+ if (ret) {
+ pr_debug("Loading purgatory failed\n");
+ goto out_free_loader_data;
+ }
+
+ pr_debug("Loaded purgatory at 0x%lx\n", purgatory_load_addr);
+
+ /* Argument/parameter segment */
+ params_cmdline_sz = sizeof(struct boot_params) + cmdline_len;
+ params = kzalloc(params_cmdline_sz, GFP_KERNEL);
+ if (!params) {
+ ret = -ENOMEM;
+ goto out_free_loader_data;
+ }
+
+ init_linux_parameters(params);
+ ret = kexec_add_buffer(image, (char *)params, params_cmdline_sz,
+ params_cmdline_sz, 16, 0, -1, 0, &bootparam_load_addr);
+ if (ret)
+ goto out_free_params;
+ pr_debug("Loaded boot_param and command line at 0x%lx sz=0x%lx\n",
+ bootparam_load_addr, params_cmdline_sz);
+
+ /* Load initrd high */
+ if (initrd) {
+ ret = kexec_add_buffer(image, initrd, initrd_len, initrd_len,
+ 4096, 0x1000000, ULONG_MAX, 1, &initrd_load_addr);
+ if (ret)
+ goto out_free_params;
+
+ pr_debug("Loaded initrd at 0x%lx sz = 0x%lx\n",
+ initrd_load_addr, initrd_len);
+ ret = kexec_setup_initrd(params, initrd_load_addr, initrd_len);
+ if (ret)
+ goto out_free_params;
+ }
+
+ ret = kexec_setup_cmdline(params, bootparam_load_addr,
+ sizeof(struct boot_params), cmdline, cmdline_len);
+ if (ret)
+ goto out_free_params;
+
+ /* bootloader info. Do we need a separate ID for kexec kernel loader? */
+ params->hdr.type_of_loader = 0x0D << 4;
+ params->hdr.loadflags = 0;
+
+ /* Setup purgatory regs for entry */
+ ret = kexec_purgatory_get_set_symbol(image, "entry64_regs", &regs64,
+ sizeof(regs64), 1);
+ if (ret)
+ goto out_free_params;
+
+ regs64.rbx = 0; /* Bootstrap Processor */
+ regs64.rsi = bootparam_load_addr;
+ regs64.rip = ehdr->e_entry;
+ stack = kexec_purgatory_get_symbol_addr(image, "stack_end");
+ if (IS_ERR(stack)) {
+ pr_debug("Could not find address of symbol stack_end\n");
+ ret = -EINVAL;
+ goto out_free_params;
+ }
+
+ regs64.rsp = (unsigned long)stack;
+ ret = kexec_purgatory_get_set_symbol(image, "entry64_regs", &regs64,
+ sizeof(regs64), 0);
+ if (ret)
+ goto out_free_params;
+
+ ret = kexec_setup_boot_parameters(params);
+ if (ret)
+ goto out_free_params;
+
+ /*
+ * Store pointer to params so that it could be freed after loading
+ * params segment has been loaded and contents have been copied
+ * somewhere else.
+ */
+ ldata->bootparams_buf = params;
+ return ldata;
+
+out_free_params:
+ kfree(params);
+out_free_loader_data:
+ kfree(ldata);
+ return ERR_PTR(ret);
+}
+
+/* This cleanup function is called after various segments have been loaded */
+int elf_x86_64_cleanup(struct kimage *image)
+{
+ struct elf_x86_64_data *ldata = image->image_loader_data;
+
+ if (!ldata)
+ return 0;
+
+ kfree(ldata->bootparams_buf);
+ ldata->bootparams_buf = NULL;
+
+ return 0;
+}
+
+#endif /* CONFIG_X86_64 */
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 37df7d3..e35bcaf 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -22,10 +22,12 @@
#include <asm/mmu_context.h>
#include <asm/debugreg.h>
#include <asm/kexec-bzimage.h>
+#include <asm/kexec-elf.h>

/* arch dependent functionality related to kexec file based syscall */
static struct kexec_file_type kexec_file_type[]={
{"bzImage64", bzImage64_probe, bzImage64_load, bzImage64_cleanup},
+ {"elf-x86_64", elf_x86_64_probe, elf_x86_64_load, elf_x86_64_cleanup},
};

static int nr_file_types = sizeof(kexec_file_type)/sizeof(kexec_file_type[0]);
--
1.8.4.2

2014-01-27 18:58:52

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 04/11] kernel: Build bin2c based on config option CONFIG_BUILD_BIN2C

currently bin2c builds only if CONFIG_IKCONFIG=y. But bin2c will now be
used by kexec too. So make it compilation dependent on CONFIG_BUILD_BIN2C
and this config option can be selected by CONFIG_KEXEC and CONFIG_IKCONFIG.

Signed-off-by: Vivek Goyal <[email protected]>
---
arch/x86/Kconfig | 1 +
init/Kconfig | 5 +++++
scripts/basic/Makefile | 2 +-
3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 940e50e..aa5aeed 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1604,6 +1604,7 @@ source kernel/Kconfig.hz

config KEXEC
bool "kexec system call"
+ select BUILD_BIN2C
---help---
kexec is a system call that implements the ability to shutdown your
current kernel, and to start another kernel. It is like a reboot
diff --git a/init/Kconfig b/init/Kconfig
index 34a0a3b..32a1adb 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -760,8 +760,13 @@ endchoice

endmenu # "RCU Subsystem"

+config BUILD_BIN2C
+ bool
+ default n
+
config IKCONFIG
tristate "Kernel .config support"
+ select BUILD_BIN2C
---help---
This option enables the complete Linux kernel ".config" file
contents to be saved in the kernel. It provides documentation
diff --git a/scripts/basic/Makefile b/scripts/basic/Makefile
index afbc1cd..ec10d93 100644
--- a/scripts/basic/Makefile
+++ b/scripts/basic/Makefile
@@ -9,7 +9,7 @@
# fixdep: Used to generate dependency information during build process

hostprogs-y := fixdep
-hostprogs-$(CONFIG_IKCONFIG) += bin2c
+hostprogs-$(CONFIG_BUILD_BIN2C) += bin2c
always := $(hostprogs-y)

# fixdep is needed to compile other host programs
--
1.8.4.2

2014-01-27 19:01:53

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 01/11] kexec: Move segment verification code in a separate function

Previously do_kimage_alloc() will allocate a kimage structure, copy
segment list from user space and then do the segment list sanity verification.

Break down this function in 3 parts. do_kimage_alloc_init() to do actual
allocation and basic initialization of kimage structure.
copy_user_segment_list() to copy segment list from user space and
sanity_check_segment_list() to verify the sanity of segment list as passed
by user space.

In later patches, I need to only allocate kimage and not copy segment
list from user space. So breaking down in smaller functions enables
re-use of code at other places.

Signed-off-by: Vivek Goyal <[email protected]>
---
kernel/kexec.c | 182 ++++++++++++++++++++++++++++++++-------------------------
1 file changed, 101 insertions(+), 81 deletions(-)

diff --git a/kernel/kexec.c b/kernel/kexec.c
index ac73878..c0944b2 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -123,45 +123,27 @@ static struct page *kimage_alloc_page(struct kimage *image,
gfp_t gfp_mask,
unsigned long dest);

-static int do_kimage_alloc(struct kimage **rimage, unsigned long entry,
- unsigned long nr_segments,
- struct kexec_segment __user *segments)
+static int copy_user_segment_list(struct kimage *image,
+ unsigned long nr_segments,
+ struct kexec_segment __user *segments)
{
+ int ret;
size_t segment_bytes;
- struct kimage *image;
- unsigned long i;
- int result;
-
- /* Allocate a controlling structure */
- result = -ENOMEM;
- image = kzalloc(sizeof(*image), GFP_KERNEL);
- if (!image)
- goto out;
-
- image->head = 0;
- image->entry = &image->head;
- image->last_entry = &image->head;
- image->control_page = ~0; /* By default this does not apply */
- image->start = entry;
- image->type = KEXEC_TYPE_DEFAULT;
-
- /* Initialize the list of control pages */
- INIT_LIST_HEAD(&image->control_pages);
-
- /* Initialize the list of destination pages */
- INIT_LIST_HEAD(&image->dest_pages);
-
- /* Initialize the list of unusable pages */
- INIT_LIST_HEAD(&image->unuseable_pages);

/* Read in the segments */
image->nr_segments = nr_segments;
segment_bytes = nr_segments * sizeof(*segments);
- result = copy_from_user(image->segment, segments, segment_bytes);
- if (result) {
- result = -EFAULT;
- goto out;
- }
+ ret = copy_from_user(image->segment, segments, segment_bytes);
+ if (ret)
+ ret = -EFAULT;
+
+ return ret;
+}
+
+static int sanity_check_segment_list(struct kimage *image)
+{
+ int result, i;
+ unsigned long nr_segments = image->nr_segments;

/*
* Verify we have good destination addresses. The caller is
@@ -183,9 +165,9 @@ static int do_kimage_alloc(struct kimage **rimage, unsigned long entry,
mstart = image->segment[i].mem;
mend = mstart + image->segment[i].memsz;
if ((mstart & ~PAGE_MASK) || (mend & ~PAGE_MASK))
- goto out;
+ return result;
if (mend >= KEXEC_DESTINATION_MEMORY_LIMIT)
- goto out;
+ return result;
}

/* Verify our destination addresses do not overlap.
@@ -206,7 +188,7 @@ static int do_kimage_alloc(struct kimage **rimage, unsigned long entry,
pend = pstart + image->segment[j].memsz;
/* Do the segments overlap ? */
if ((mend > pstart) && (mstart < pend))
- goto out;
+ return result;
}
}

@@ -218,18 +200,61 @@ static int do_kimage_alloc(struct kimage **rimage, unsigned long entry,
result = -EINVAL;
for (i = 0; i < nr_segments; i++) {
if (image->segment[i].bufsz > image->segment[i].memsz)
- goto out;
+ return result;
}

- result = 0;
-out:
- if (result == 0)
- *rimage = image;
- else
- kfree(image);
+ /*
+ * Verify we have good destination addresses. Normally
+ * the caller is responsible for making certain we don't
+ * attempt to load the new image into invalid or reserved
+ * areas of RAM. But crash kernels are preloaded into a
+ * reserved area of ram. We must ensure the addresses
+ * are in the reserved area otherwise preloading the
+ * kernel could corrupt things.
+ */

- return result;
+ if (image->type == KEXEC_TYPE_CRASH) {
+ result = -EADDRNOTAVAIL;
+ for (i = 0; i < nr_segments; i++) {
+ unsigned long mstart, mend;

+ mstart = image->segment[i].mem;
+ mend = mstart + image->segment[i].memsz - 1;
+ /* Ensure we are within the crash kernel limits */
+ if ((mstart < crashk_res.start) ||
+ (mend > crashk_res.end))
+ return result;
+ }
+ }
+
+ return 0;
+}
+
+static struct kimage *do_kimage_alloc_init(void)
+{
+ struct kimage *image;
+
+ /* Allocate a controlling structure */
+ image = kzalloc(sizeof(*image), GFP_KERNEL);
+ if (!image)
+ return NULL;
+
+ image->head = 0;
+ image->entry = &image->head;
+ image->last_entry = &image->head;
+ image->control_page = ~0; /* By default this does not apply */
+ image->type = KEXEC_TYPE_DEFAULT;
+
+ /* Initialize the list of control pages */
+ INIT_LIST_HEAD(&image->control_pages);
+
+ /* Initialize the list of destination pages */
+ INIT_LIST_HEAD(&image->dest_pages);
+
+ /* Initialize the list of unusable pages */
+ INIT_LIST_HEAD(&image->unuseable_pages);
+
+ return image;
}

static void kimage_free_page_list(struct list_head *list);
@@ -242,10 +267,19 @@ static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
struct kimage *image;

/* Allocate and initialize a controlling structure */
- image = NULL;
- result = do_kimage_alloc(&image, entry, nr_segments, segments);
+ image = do_kimage_alloc_init();
+ if (!image)
+ return -ENOMEM;
+
+ image->start = entry;
+
+ result = copy_user_segment_list(image, nr_segments, segments);
if (result)
- goto out;
+ goto out_free_image;
+
+ result = sanity_check_segment_list(image);
+ if (result)
+ goto out_free_image;

/*
* Find a location for the control code buffer, and add it
@@ -257,22 +291,23 @@ static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
get_order(KEXEC_CONTROL_PAGE_SIZE));
if (!image->control_code_page) {
printk(KERN_ERR "Could not allocate control_code_buffer\n");
- goto out_free;
+ goto out_free_image;
}

image->swap_page = kimage_alloc_control_pages(image, 0);
if (!image->swap_page) {
printk(KERN_ERR "Could not allocate swap buffer\n");
- goto out_free;
+ goto out_free_control_pages;
}

*rimage = image;
return 0;

-out_free:
+
+out_free_control_pages:
kimage_free_page_list(&image->control_pages);
+out_free_image:
kfree(image);
-out:
return result;
}

@@ -282,19 +317,17 @@ static int kimage_crash_alloc(struct kimage **rimage, unsigned long entry,
{
int result;
struct kimage *image;
- unsigned long i;

- image = NULL;
/* Verify we have a valid entry point */
- if ((entry < crashk_res.start) || (entry > crashk_res.end)) {
- result = -EADDRNOTAVAIL;
- goto out;
- }
+ if ((entry < crashk_res.start) || (entry > crashk_res.end))
+ return -EADDRNOTAVAIL;

/* Allocate and initialize a controlling structure */
- result = do_kimage_alloc(&image, entry, nr_segments, segments);
- if (result)
- goto out;
+ image = do_kimage_alloc_init();
+ if (!image)
+ return -ENOMEM;
+
+ image->start = entry;

/* Enable the special crash kernel control page
* allocation policy.
@@ -302,25 +335,13 @@ static int kimage_crash_alloc(struct kimage **rimage, unsigned long entry,
image->control_page = crashk_res.start;
image->type = KEXEC_TYPE_CRASH;

- /*
- * Verify we have good destination addresses. Normally
- * the caller is responsible for making certain we don't
- * attempt to load the new image into invalid or reserved
- * areas of RAM. But crash kernels are preloaded into a
- * reserved area of ram. We must ensure the addresses
- * are in the reserved area otherwise preloading the
- * kernel could corrupt things.
- */
- result = -EADDRNOTAVAIL;
- for (i = 0; i < nr_segments; i++) {
- unsigned long mstart, mend;
+ result = copy_user_segment_list(image, nr_segments, segments);
+ if (result)
+ goto out_free_image;

- mstart = image->segment[i].mem;
- mend = mstart + image->segment[i].memsz - 1;
- /* Ensure we are within the crash kernel limits */
- if ((mstart < crashk_res.start) || (mend > crashk_res.end))
- goto out_free;
- }
+ result = sanity_check_segment_list(image);
+ if (result)
+ goto out_free_image;

/*
* Find a location for the control code buffer, and add
@@ -332,15 +353,14 @@ static int kimage_crash_alloc(struct kimage **rimage, unsigned long entry,
get_order(KEXEC_CONTROL_PAGE_SIZE));
if (!image->control_code_page) {
printk(KERN_ERR "Could not allocate control_code_buffer\n");
- goto out_free;
+ goto out_free_image;
}

*rimage = image;
return 0;

-out_free:
+out_free_image:
kfree(image);
-out:
return result;
}

--
1.8.4.2

2014-01-27 21:12:17

by Michal Marek

[permalink] [raw]
Subject: Re: [PATCH 03/11] bin2c: Move bin2c in scripts/basic

Dne 27.1.2014 19:57, Vivek Goyal napsal(a):
> Kexec wants to use bin2c and it wants to use it really early in the build
> process. See arch/x86/purgatory/ code in later patches.
>
> So move bin2c in scripts/basic so that it can be built very early and
> be usable by arch/x86/purgatory/
>
> Signed-off-by: Vivek Goyal <[email protected]>
> ---
[...]
> diff --git a/scripts/basic/Makefile b/scripts/basic/Makefile
> index 4fcef87..afbc1cd 100644
> --- a/scripts/basic/Makefile
> +++ b/scripts/basic/Makefile
> @@ -9,6 +9,7 @@
> # fixdep: Used to generate dependency information during build process
>
> hostprogs-y := fixdep
> +hostprogs-$(CONFIG_IKCONFIG) += bin2c

Is the CONFIG_IKCONFIG dependency still correct, now that you found
another use of bin2c?

Thanks,
Michal

2014-01-27 21:19:16

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 03/11] bin2c: Move bin2c in scripts/basic

On Mon, Jan 27, 2014 at 10:12:10PM +0100, Michal Marek wrote:
> Dne 27.1.2014 19:57, Vivek Goyal napsal(a):
> > Kexec wants to use bin2c and it wants to use it really early in the build
> > process. See arch/x86/purgatory/ code in later patches.
> >
> > So move bin2c in scripts/basic so that it can be built very early and
> > be usable by arch/x86/purgatory/
> >
> > Signed-off-by: Vivek Goyal <[email protected]>
> > ---
> [...]
> > diff --git a/scripts/basic/Makefile b/scripts/basic/Makefile
> > index 4fcef87..afbc1cd 100644
> > --- a/scripts/basic/Makefile
> > +++ b/scripts/basic/Makefile
> > @@ -9,6 +9,7 @@
> > # fixdep: Used to generate dependency information during build process
> >
> > hostprogs-y := fixdep
> > +hostprogs-$(CONFIG_IKCONFIG) += bin2c
>
> Is the CONFIG_IKCONFIG dependency still correct, now that you found
> another use of bin2c?

In the next patch I changed depedency too. (CONFIG_BUILD_BIN2C). And
now CONFIG_IKCONFIG and CONFIG_KEXEC will select CONFIG_BUILD_BIN2C.

Thanks
Vivek

2014-01-27 21:54:15

by Michal Marek

[permalink] [raw]
Subject: Re: [PATCH 03/11] bin2c: Move bin2c in scripts/basic

Dne 27.1.2014 22:18, Vivek Goyal napsal(a):
> On Mon, Jan 27, 2014 at 10:12:10PM +0100, Michal Marek wrote:
>> Dne 27.1.2014 19:57, Vivek Goyal napsal(a):
>>> Kexec wants to use bin2c and it wants to use it really early in the build
>>> process. See arch/x86/purgatory/ code in later patches.
>>>
>>> So move bin2c in scripts/basic so that it can be built very early and
>>> be usable by arch/x86/purgatory/
>>>
>>> Signed-off-by: Vivek Goyal <[email protected]>
>>> ---
>> [...]
>>> diff --git a/scripts/basic/Makefile b/scripts/basic/Makefile
>>> index 4fcef87..afbc1cd 100644
>>> --- a/scripts/basic/Makefile
>>> +++ b/scripts/basic/Makefile
>>> @@ -9,6 +9,7 @@
>>> # fixdep: Used to generate dependency information during build process
>>>
>>> hostprogs-y := fixdep
>>> +hostprogs-$(CONFIG_IKCONFIG) += bin2c
>>
>> Is the CONFIG_IKCONFIG dependency still correct, now that you found
>> another use of bin2c?
>
> In the next patch I changed depedency too. (CONFIG_BUILD_BIN2C). And
> now CONFIG_IKCONFIG and CONFIG_KEXEC will select CONFIG_BUILD_BIN2C.

OK, then it's all fine. I only saw patch 03/11.

Thanks,
Michal

2014-02-21 14:59:18

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 06/11] kexec: A new system call, kexec_file_load, for in kernel kexec

On Mon, Jan 27, 2014 at 01:57:46PM -0500, Vivek Goyal wrote:
> This patch implements the in kernel kexec functionality. It implements a
> new system call kexec_file_load. I think parameter list of this system
> call will change as I have not done the kernel image signature handling
> yet. I have been told that I might have to pass the detached signature
> and size as part of system call.
>
> Previously segment list was prepared in user space. Now user space just
> passes kernel fd, initrd fd and command line and kernel will create a
> segment list internally.
>
> This patch contains generic part of the code. Actual segment preparation
> and loading is done by arch and image specific loader. Which comes in
> next patch.
>
> Signed-off-by: Vivek Goyal <[email protected]>

You might want to run it through checkpatch - some of them are actually,
to my surprise, legitimate :)

Just some minor nitpicks below...

> ---
> arch/x86/kernel/machine_kexec_64.c | 50 ++++
> arch/x86/syscalls/syscall_64.tbl | 1 +
> include/linux/kexec.h | 55 +++++
> include/linux/syscalls.h | 3 +
> include/uapi/linux/kexec.h | 4 +
> kernel/kexec.c | 495 ++++++++++++++++++++++++++++++++++++-
> kernel/sys_ni.c | 1 +
> 7 files changed, 605 insertions(+), 4 deletions(-)

...

> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index d8188b3..51b56cd 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -121,13 +121,58 @@ struct kimage {
> #define KEXEC_TYPE_DEFAULT 0
> #define KEXEC_TYPE_CRASH 1
> unsigned int preserve_context : 1;
> + /* If set, we are using file mode kexec syscall */
> + unsigned int file_mode : 1;
>
> #ifdef ARCH_HAS_KIMAGE_ARCH
> struct kimage_arch arch;
> #endif
> +
> + /* Additional Fields for file based kexec syscall */
> + void *kernel_buf;
> + unsigned long kernel_buf_len;
> +
> + void *initrd_buf;
> + unsigned long initrd_buf_len;
> +
> + char *cmdline_buf;
> + unsigned long cmdline_buf_len;
> +
> + /* index of file handler in array */
> + int file_handler_idx;
> +
> + /* Image loader handling the kernel can store a pointer here */
> + void * image_loader_data;
> };
>
> +/*
> + * Keeps a track of buffer parameters as provided by caller for requesting
> + * memory placement of buffer.
> + */
> +struct kexec_buf {
> + struct kimage *image;
> + char *buffer;
> + unsigned long bufsz;
> + unsigned long memsz;
> + unsigned long buf_align;
> + unsigned long buf_min;
> + unsigned long buf_max;
> + int top_down; /* allocate from top of memory hole */

Looks like this wants to be a bool.

...

> diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h
> index d6629d4..5fddb1b 100644
> --- a/include/uapi/linux/kexec.h
> +++ b/include/uapi/linux/kexec.h
> @@ -13,6 +13,10 @@
> #define KEXEC_PRESERVE_CONTEXT 0x00000002
> #define KEXEC_ARCH_MASK 0xffff0000
>
> +/* Kexec file load interface flags */
> +#define KEXEC_FILE_UNLOAD 0x00000001
> +#define KEXEC_FILE_ON_CRASH 0x00000002

BIT()

> +
> /* These values match the ELF architecture values.
> * Unless there is a good reason that should continue to be the case.
> */
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index c0944b2..b28578a 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -123,6 +123,11 @@ static struct page *kimage_alloc_page(struct kimage *image,
> gfp_t gfp_mask,
> unsigned long dest);
>
> +void kimage_set_start_addr(struct kimage *image, unsigned long start)
> +{
> + image->start = start;
> +}

Why a separate function? It is used only once in the next patch.

...

> +static int kimage_file_prepare_segments(struct kimage *image, int kernel_fd,
> + int initrd_fd, const char __user *cmdline_ptr,
> + unsigned long cmdline_len)
> +{
> + int ret = 0;
> + void *ldata;
> +
> + ret = copy_file_from_fd(kernel_fd, &image->kernel_buf,
> + &image->kernel_buf_len);
> + if (ret)
> + goto out;
> +
> + /* Call arch image probe handlers */
> + ret = arch_kexec_kernel_image_probe(image, image->kernel_buf,
> + image->kernel_buf_len);
> +
> + if (ret)
> + goto out;
> +
> + ret = copy_file_from_fd(initrd_fd, &image->initrd_buf,
> + &image->initrd_buf_len);
> + if (ret)
> + goto out;
> +
> + image->cmdline_buf = vzalloc(cmdline_len);
> + if (!image->cmdline_buf)
> + goto out;
> +
> + ret = copy_from_user(image->cmdline_buf, cmdline_ptr, cmdline_len);
> + if (ret) {
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + image->cmdline_buf_len = cmdline_len;
> +
> + /* command line should be a string with last byte null */
> + if (image->cmdline_buf[cmdline_len - 1] != '\0') {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + /* Call arch image load handlers */
> + ldata = arch_kexec_kernel_image_load(image,
> + image->kernel_buf, image->kernel_buf_len,
> + image->initrd_buf, image->initrd_buf_len,
> + image->cmdline_buf, image->cmdline_buf_len);
> +
> + if (IS_ERR(ldata)) {
> + ret = PTR_ERR(ldata);
> + goto out;
> + }
> +
> + image->image_loader_data = ldata;
> +out:
> + return ret;

You probably want to drop this "out:" label and simply return the error
value directly in each error path above for simplicity.

> +static int kimage_file_normal_alloc(struct kimage **rimage, int kernel_fd,
> + int initrd_fd, const char __user *cmdline_ptr,
> + unsigned long cmdline_len)
> +{
> + int result;
> + struct kimage *image;
> +
> + /* Allocate and initialize a controlling structure */
> + image = do_kimage_alloc_init();
> + if (!image)
> + return -ENOMEM;
> +
> + image->file_mode = 1;
> + image->file_handler_idx = -1;
> +
> + result = kimage_file_prepare_segments(image, kernel_fd, initrd_fd,
> + cmdline_ptr, cmdline_len);
> + if (result)
> + goto out_free_image;
> +
> + result = sanity_check_segment_list(image);
> + if (result)
> + goto out_free_post_load_bufs;

Dunno, it could probably be a larger restructuring effort but if you
do load a segment and sanity-check it right after loading, instead of
loading all of them first and then iterating over them, you could save
yourself some work in the error case when a segment fails the check.

...

> +int kexec_add_buffer(struct kimage *image, char *buffer,
> + unsigned long bufsz, unsigned long memsz,
> + unsigned long buf_align, unsigned long buf_min,
> + unsigned long buf_max, int top_down, unsigned long *load_addr)
> +{
> +
> + unsigned long nr_segments = image->nr_segments, new_nr_segments;
> + struct kexec_segment *ksegment;
> + struct kexec_buf *kbuf;
> +
> + /* Currently adding segment this way is allowed only in file mode */
> + if (!image->file_mode)
> + return -EINVAL;
> +
> + if (nr_segments >= KEXEC_SEGMENT_MAX)
> + return -EINVAL;
> +
> + /*
> + * Make sure we are not trying to add buffer after allocating
> + * control pages. All segments need to be placed first before
> + * any control pages are allocated. As control page allocation
> + * logic goes through list of segments to make sure there are
> + * no destination overlaps.
> + */
> + WARN_ONCE(!list_empty(&image->control_pages), "Adding kexec buffer"
> + " after allocating control pages\n");
> +
> + kbuf = kzalloc(sizeof(struct kexec_buf), GFP_KERNEL);
> + if (!kbuf)
> + return -ENOMEM;
> +
> + kbuf->image = image;
> + kbuf->buffer = buffer;
> + kbuf->bufsz = bufsz;
> + /* Align memsz to next page boundary */
> + kbuf->memsz = ALIGN(memsz, PAGE_SIZE);
> +
> + /* Align to atleast page size boundary */
> + kbuf->buf_align = max(buf_align, PAGE_SIZE);
> + kbuf->buf_min = buf_min;
> + kbuf->buf_max = buf_max;
> + kbuf->top_down = top_down;
> +
> + /* Walk the RAM ranges and allocate a suitable range for the buffer */
> + walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
> +
> + kbuf->image = NULL;
> + kfree(kbuf);

This is freed after kzalloc'ing it a bit earlier, why not make it a
stack variable for simplicity? struct kexec_buf doesn't seem that
large...

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-02-24 18:00:32

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 06/11] kexec: A new system call, kexec_file_load, for in kernel kexec

On Fri, Feb 21, 2014 at 03:59:10PM +0100, Borislav Petkov wrote:

[..]
> You might want to run it through checkpatch - some of them are actually,
> to my surprise, legitimate :)

Thanks for having a look at the patches. I will run patches through
checkpatch before next posting.

[..]
> > +struct kexec_buf {
> > + struct kimage *image;
> > + char *buffer;
> > + unsigned long bufsz;
> > + unsigned long memsz;
> > + unsigned long buf_align;
> > + unsigned long buf_min;
> > + unsigned long buf_max;
> > + int top_down; /* allocate from top of memory hole */
>
> Looks like this wants to be a bool.

Will change.

>
> ...
>
> > diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h
> > index d6629d4..5fddb1b 100644
> > --- a/include/uapi/linux/kexec.h
> > +++ b/include/uapi/linux/kexec.h
> > @@ -13,6 +13,10 @@
> > #define KEXEC_PRESERVE_CONTEXT 0x00000002
> > #define KEXEC_ARCH_MASK 0xffff0000
> >
> > +/* Kexec file load interface flags */
> > +#define KEXEC_FILE_UNLOAD 0x00000001
> > +#define KEXEC_FILE_ON_CRASH 0x00000002
>
> BIT()

What's that?

>
> > +
> > /* These values match the ELF architecture values.
> > * Unless there is a good reason that should continue to be the case.
> > */
> > diff --git a/kernel/kexec.c b/kernel/kexec.c
> > index c0944b2..b28578a 100644
> > --- a/kernel/kexec.c
> > +++ b/kernel/kexec.c
> > @@ -123,6 +123,11 @@ static struct page *kimage_alloc_page(struct kimage *image,
> > gfp_t gfp_mask,
> > unsigned long dest);
> >
> > +void kimage_set_start_addr(struct kimage *image, unsigned long start)
> > +{
> > + image->start = start;
> > +}
>
> Why a separate function? It is used only once in the next patch.

Right now there is only one user. But once other image loader support
comes along or other arches support file based kexec, they can make
use of same function.

This is a pretty important modification as it decides what's the starting
point of next kernel image. I wanted to make it a function callable by
users who wanted to modify it instead of of letting them directly
modify image->start.

[..]
> > + /* Call arch image load handlers */
> > + ldata = arch_kexec_kernel_image_load(image,
> > + image->kernel_buf, image->kernel_buf_len,
> > + image->initrd_buf, image->initrd_buf_len,
> > + image->cmdline_buf, image->cmdline_buf_len);
> > +
> > + if (IS_ERR(ldata)) {
> > + ret = PTR_ERR(ldata);
> > + goto out;
> > + }
> > +
> > + image->image_loader_data = ldata;
> > +out:
> > + return ret;
>
> You probably want to drop this "out:" label and simply return the error
> value directly in each error path above for simplicity.
>

Ok, will do. I don't seem to be doing anything in "out" except return ret.
So will get rid of this label and return early in error path.

> > +static int kimage_file_normal_alloc(struct kimage **rimage, int kernel_fd,
> > + int initrd_fd, const char __user *cmdline_ptr,
> > + unsigned long cmdline_len)
> > +{
> > + int result;
> > + struct kimage *image;
> > +
> > + /* Allocate and initialize a controlling structure */
> > + image = do_kimage_alloc_init();
> > + if (!image)
> > + return -ENOMEM;
> > +
> > + image->file_mode = 1;
> > + image->file_handler_idx = -1;
> > +
> > + result = kimage_file_prepare_segments(image, kernel_fd, initrd_fd,
> > + cmdline_ptr, cmdline_len);
> > + if (result)
> > + goto out_free_image;
> > +
> > + result = sanity_check_segment_list(image);
> > + if (result)
> > + goto out_free_post_load_bufs;
>
> Dunno, it could probably be a larger restructuring effort but if you
> do load a segment and sanity-check it right after loading, instead of
> loading all of them first and then iterating over them, you could save
> yourself some work in the error case when a segment fails the check.
>

Could be. I am trying to reuse exisitng code. In current code, user space
passes a list of segments and then kernel calls this function to make sure
all the segments passed in make sense.

Here also arch dependent part of kernel is preparing a list of segments
wanted and then generic code is reusing the existing function to make
sure segment list is sane. I like code reuse part of it.

Also, segment loading has not taken place yet. They are loaded later
in kimage_load_segment().

[..]
> > + /* Walk the RAM ranges and allocate a suitable range for the buffer */
> > + walk_system_ram_res(0, -1, kbuf, walk_ram_range_callback);
> > +
> > + kbuf->image = NULL;
> > + kfree(kbuf);
>
> This is freed after kzalloc'ing it a bit earlier, why not make it a
> stack variable for simplicity? struct kexec_buf doesn't seem that
> large...

Ya, I was not sure whether to make it stack variable or allocate
dynamically. It seems to be roughly 60bytes in size on 64bit. I guess
I will convert it into a stack variable.

Thanks
Vivek

2014-02-24 19:09:00

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 07/11] kexec: Create a relocatable object called purgatory

On 01/27/2014 10:57 AM, Vivek Goyal wrote:
> +
> +/**
> + * memcpy - Copy one area of memory to another
> + * @dest: Where to copy to
> + * @src: Where to copy from
> + * @count: The size of the area.
> + */
> +static void *memcpy(void *dest, const void *src, unsigned long count)
> +{
> + char *tmp = dest;
> + const char *s = src;
> +
> + while (count--)
> + *tmp++ = *s++;
> + return dest;
> +}
> +
> +static int memcmp(const void *cs, const void *ct, size_t count)
> +{
> + const unsigned char *su1, *su2;
> + int res = 0;
> +
> + for (su1 = cs, su2 = ct; 0 < count; ++su1, ++su2, count--)
> + if ((res = *su1 - *su2) != 0)
> + break;
> + return res;
> +}
> +

<bikeshed>

There multiple implementations of memcpy(), memcmp() and memset() in
this patchset, and they make my eyes want to bleed (especially
memcmp()). Can we centralize there, and perhaps even share code with
the stuff in arch/x86/boot already?

</bikeshed>

-hpa

2014-02-25 16:44:14

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 07/11] kexec: Create a relocatable object called purgatory

On Mon, Feb 24, 2014 at 11:08:11AM -0800, H. Peter Anvin wrote:
> On 01/27/2014 10:57 AM, Vivek Goyal wrote:
> > +
> > +/**
> > + * memcpy - Copy one area of memory to another
> > + * @dest: Where to copy to
> > + * @src: Where to copy from
> > + * @count: The size of the area.
> > + */
> > +static void *memcpy(void *dest, const void *src, unsigned long count)
> > +{
> > + char *tmp = dest;
> > + const char *s = src;
> > +
> > + while (count--)
> > + *tmp++ = *s++;
> > + return dest;
> > +}
> > +
> > +static int memcmp(const void *cs, const void *ct, size_t count)
> > +{
> > + const unsigned char *su1, *su2;
> > + int res = 0;
> > +
> > + for (su1 = cs, su2 = ct; 0 < count; ++su1, ++su2, count--)
> > + if ((res = *su1 - *su2) != 0)
> > + break;
> > + return res;
> > +}
> > +
>
> <bikeshed>
>
> There multiple implementations of memcpy(), memcmp() and memset() in
> this patchset, and they make my eyes want to bleed (especially
> memcmp()). Can we centralize there, and perhaps even share code with
> the stuff in arch/x86/boot already?
>
> </bikeshed>

Hi hpa,

There is multiple implementation of memcpy() only (sha256.c and
purgatory.c). I will merge the two and make them use single definition of
memcpy().

I can't see multiple implementation of memcpy() and memcmp() in purgatory
code.

W.r.t sharing the code with arch/x86/boot/, I am not sure how to do it.

I see two implementations of memcpy() under arch/x86/boot.

One is in copy.S. This is assembly code and looks like is supposed to
run in 16bit mode. (code16).

Other one is in compressed/misc.c and there are two definitions, one
for 32bit and one fore 64bit.

I am not sure why there is a need to write memcpy() in assembly when
C will do just fine for my case. I don't have to write two versions of
memcpy() and use it both for 32bit and 64bit.

So I can just make all the purgatory code share same version of memcpy(),
memcmp() and memset(), is that fine. I have taken implementations of
these functions from lib/string.c

Thanks
Vivek

>
> -hpa

2014-02-25 16:56:25

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 07/11] kexec: Create a relocatable object called purgatory

On 02/25/2014 08:43 AM, Vivek Goyal wrote:
>
> W.r.t sharing the code with arch/x86/boot/, I am not sure how to do it.
>

Pretty much we have been doing #includes (a bit sad, I know)... there
are already a lot of them between arch/x86/boot,
arch/x86/boot/compressed, and arch/x86/realmode. In that sense
collecting these "limited environments" together and have these kinds of
stuff together in one place seems like a good idea.

Does purgatory move large amounts of data around? If so, we probably
*do* want to use rep movsl, but otherwise you're definitely right, using
C code makes more sense.

> I see two implementations of memcpy() under arch/x86/boot.
>
> One is in copy.S. This is assembly code and looks like is supposed to
> run in 16bit mode. (code16).
>
> Other one is in compressed/misc.c and there are two definitions, one
> for 32bit and one fore 64bit.

They are basically the 16-, 32-, and 64-bit variants of the same code.

> I am not sure why there is a need to write memcpy() in assembly when
> C will do just fine for my case. I don't have to write two versions of
> memcpy() and use it both for 32bit and 64bit.

The point would be to use the ones we already have.

> So I can just make all the purgatory code share same version of memcpy(),
> memcmp() and memset(), is that fine. I have taken implementations of
> these functions from lib/string.c

It depends on if you care about performance. For memcpy() and memset()
in particular, the CPU has internally optimized versions of these that
beats C at least on any newer silicon.

-hpa

2014-02-25 18:39:14

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 08/11] kexec-bzImage: Support for loading bzImage using 64bit entry

On 01/27/2014 10:57 AM, Vivek Goyal wrote:
> This is loader specific code which can load bzImage and set it up for
> 64bit entry. This does not take care of 32bit entry or real mode entry
> yet.

Is there any use in that? Real mode entry especially is more than a bit
scary when coming from another kernel already...

-hpa

2014-02-25 18:44:14

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 08/11] kexec-bzImage: Support for loading bzImage using 64bit entry

On Tue, Feb 25, 2014 at 10:38:24AM -0800, H. Peter Anvin wrote:
> On 01/27/2014 10:57 AM, Vivek Goyal wrote:
> > This is loader specific code which can load bzImage and set it up for
> > 64bit entry. This does not take care of 32bit entry or real mode entry
> > yet.
>
> Is there any use in that? Real mode entry especially is more than a bit
> scary when coming from another kernel already...

I think 64bit entry should be good for x86_64. When we start supporting
this new system call on x86(32bit), then we can implement 32bit entry
support (which should not be hard).

I have no plans to implement 16bit entry support. I just wanted to mention
explicitly what is currently supported.

Thanks
Vivek

2014-02-25 19:24:25

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 07/11] kexec: Create a relocatable object called purgatory

On Tue, Feb 25, 2014 at 08:55:42AM -0800, H. Peter Anvin wrote:
> On 02/25/2014 08:43 AM, Vivek Goyal wrote:
> >
> > W.r.t sharing the code with arch/x86/boot/, I am not sure how to do it.
> >
>
> Pretty much we have been doing #includes (a bit sad, I know)... there
> are already a lot of them between arch/x86/boot,
> arch/x86/boot/compressed, and arch/x86/realmode. In that sense
> collecting these "limited environments" together and have these kinds of
> stuff together in one place seems like a good idea.
>
> Does purgatory move large amounts of data around? If so, we probably
> *do* want to use rep movsl, but otherwise you're definitely right, using
> C code makes more sense.

No, I don't move lots of data around in purgatory. We just copy backup
region and that's 640KB of data.

So we don't copy a lot and it is not a performance sensitive path.

Hence I would like to keep it simple. That is have C version of memcpy()
in purgatory() which works both for 32bit and 64bit and reuse that across
purgatory code.

Please let me know if you don't like the idea and you still think there
should be a shared implementation between arch/x86/boot/ and
arch/x86/purgatory/.

Thanks
Vivek

2014-02-25 19:35:25

by Petr Tesařík

[permalink] [raw]
Subject: Re: [PATCH 06/11] kexec: A new system call, kexec_file_load, for in kernel kexec

On Mon, 24 Feb 2014 11:41:31 -0500
Vivek Goyal <[email protected]> wrote:

> On Fri, Feb 21, 2014 at 03:59:10PM +0100, Borislav Petkov wrote:
>
>[...]
> >
> > ...
> >
> > > diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h
> > > index d6629d4..5fddb1b 100644
> > > --- a/include/uapi/linux/kexec.h
> > > +++ b/include/uapi/linux/kexec.h
> > > @@ -13,6 +13,10 @@
> > > #define KEXEC_PRESERVE_CONTEXT 0x00000002
> > > #define KEXEC_ARCH_MASK 0xffff0000
> > >
> > > +/* Kexec file load interface flags */
> > > +#define KEXEC_FILE_UNLOAD 0x00000001
> > > +#define KEXEC_FILE_ON_CRASH 0x00000002
> >
> > BIT()
>
> What's that?

#define BIT(nr) (1UL << (nr))

For my part I'm not convinced it's a better way to do it, unless
Borislav also wanted to suggest adding an enum for the bit number
values...

Petr T

2014-02-25 21:10:25

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 07/11] kexec: Create a relocatable object called purgatory

On 02/25/2014 10:20 AM, Vivek Goyal wrote:
>
> Please let me know if you don't like the idea and you still think there
> should be a shared implementation between arch/x86/boot/ and
> arch/x86/purgatory/.
>

That is what I would *prefer*. There are some other string functions in
arch/x86/boot/string.c which also ought to be sharable (and are in
places already.)

-hpa

2014-02-25 21:47:54

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 06/11] kexec: A new system call, kexec_file_load, for in kernel kexec

On Tue, Feb 25, 2014 at 08:35:19PM +0100, Petr Tesarik wrote:
> #define BIT(nr) (1UL << (nr))
>
> For my part I'm not convinced it's a better way to do it, unless
> Borislav also wanted to suggest adding an enum for the bit number
> values...

Well,

#define KEXEC_FILE_UNLOAD BIT(1)

is much more readable.

The enum thing is also nice to have but it's not my personal preference.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-02-26 14:53:17

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 07/11] kexec: Create a relocatable object called purgatory

On Tue, Feb 25, 2014 at 01:09:34PM -0800, H. Peter Anvin wrote:
> On 02/25/2014 10:20 AM, Vivek Goyal wrote:
> >
> > Please let me know if you don't like the idea and you still think there
> > should be a shared implementation between arch/x86/boot/ and
> > arch/x86/purgatory/.
> >
>
> That is what I would *prefer*. There are some other string functions in
> arch/x86/boot/string.c which also ought to be sharable (and are in
> places already.)
>

Ok, I will look into it. Possibly move optimized 32bit and 64bit version of
memcpy in arch/x86/boot/string.c and include it in purgaotry files.

Or we could define new C version of routines in boot/string.c and leave
optimized versions where they are.

Thanks
Vivek

2014-02-26 15:37:17

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 06/11] kexec: A new system call, kexec_file_load, for in kernel kexec

On Mon, Feb 24, 2014 at 11:41:31AM -0500, Vivek Goyal wrote:
> Right now there is only one user. But once other image loader support
> comes along or other arches support file based kexec, they can make
> use of same function.
>
> This is a pretty important modification as it decides what's the starting
> point of next kernel image. I wanted to make it a function callable by
> users who wanted to modify it instead of of letting them directly
> modify image->start.

But are you expecting any other way to set image->start by the other
arches/image loaders?

> Could be. I am trying to reuse exisitng code. In current code, user space
> passes a list of segments and then kernel calls this function to make sure
> all the segments passed in make sense.
>
> Here also arch dependent part of kernel is preparing a list of segments
> wanted and then generic code is reusing the existing function to make
> sure segment list is sane. I like code reuse part of it.

Right, ok, this was just an idea. Maybe not worth it.

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-02-26 16:00:17

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 07/11] kexec: Create a relocatable object called purgatory

On Mon, Jan 27, 2014 at 01:57:47PM -0500, Vivek Goyal wrote:
> Create a stand alone relocatable object purgatory which runs between two
> kernels. This name, concept and some code has been taken from kexec-tools.

... and the concept probably originates from Dante's "Divine Comedy" :-P

> Idea is that this code runs after a crash and it runs in minimal environment.
> So keep it separate from rest of the kernel and in long term we will have
> to practically do no maintenance of this code.
>
> This code also has the logic to do verify sha256 hashes of various
> segments which have been loaded into memory. So first we verify that
> the kernel we are jumping to is fine and has not been corrupted and
> make progress only if checsums are verified.
>
> This code also takes care of copying some memory contents to backup region.
>
> sha256 hash related code has been taken from crypto/sha256_generic.c. I
> could not call into functions exported by sha256_generic.c directly as
> we don't link against the kernel. Purgatory is a stand alone object.
>
> Also sha256_generic.c is supposed to work with higher level crypto
> abstrations and API and there was no point in importing all that in
> purgatory. So instead of doing #include on sha256_generic.c I just
> copied relevant portions of code into arch/x86/purgatory/sha256.c. Now
> we shouldn't have to touch this code at all. Do let me know if there are
> better ways to handle it.

Ok, but what about configurable encryption algorithms? Maybe there are
people who don't want to use sha-2 and prefer something else instead.

> Signed-off-by: Vivek Goyal <[email protected]>

Also checkpatch:

...
total: 429 errors, 5 warnings, 1409 lines checked

> diff --git a/arch/x86/Makefile b/arch/x86/Makefile
> index 13b22e0..fedcd16 100644
> --- a/arch/x86/Makefile
> +++ b/arch/x86/Makefile
> @@ -160,6 +160,11 @@ archscripts: scripts_basic
> archheaders:
> $(Q)$(MAKE) $(build)=arch/x86/syscalls all
>
> +archprepare:
> +ifeq ($(CONFIG_KEXEC),y)
> + $(Q)$(MAKE) $(build)=arch/x86/purgatory arch/x86/purgatory/kexec-purgatory.c
> +endif

I wonder if this could be put into arch/x86/boot/ and built there too...
But hpa said that already.

...

> diff --git a/arch/x86/purgatory/entry64.S b/arch/x86/purgatory/entry64.S
> new file mode 100644
> index 0000000..e405c0f
> --- /dev/null
> +++ b/arch/x86/purgatory/entry64.S
> @@ -0,0 +1,111 @@
> +/*
> + * Copyright (C) 2003,2004 Eric Biederman ([email protected])
> + * Copyright (C) 2014 Red Hat Inc.
> +
> + * Author(s): Vivek Goyal <[email protected]>
> + *
> + * This code has been taken from kexec-tools.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation (version 2 of the License).
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.

Yeah, can we drop this boilerplate gunk and refer to COPYING instead.

> + */
> +
> + .text
> + .balign 16
> + .code64
> + .globl entry64, entry64_regs
> +
> +
> +entry64:
> + /* Setup a gdt that should be preserved */
> + lgdt gdt(%rip)
> +
> + /* load the data segments */
> + movl $0x18, %eax /* data segment */
> + movl %eax, %ds
> + movl %eax, %es
> + movl %eax, %ss
> + movl %eax, %fs
> + movl %eax, %gs
> +
> + /* Setup new stack */
> + leaq stack_init(%rip), %rsp
> + pushq $0x10 /* CS */
> + leaq new_cs_exit(%rip), %rax
> + pushq %rax
> + lretq
> +new_cs_exit:
> +
> + /* Load the registers */
> + movq rax(%rip), %rax
> + movq rbx(%rip), %rbx
> + movq rcx(%rip), %rcx
> + movq rdx(%rip), %rdx
> + movq rsi(%rip), %rsi
> + movq rdi(%rip), %rdi
> + movq rsp(%rip), %rsp
> + movq rbp(%rip), %rbp
> + movq r8(%rip), %r8
> + movq r9(%rip), %r9
> + movq r10(%rip), %r10
> + movq r11(%rip), %r11
> + movq r12(%rip), %r12
> + movq r13(%rip), %r13
> + movq r14(%rip), %r14
> + movq r15(%rip), %r15

Huh, is the purpose to simply clear the arch registers here?

xor %reg,%reg

?

If so, you don't need the entry64_regs and below definitions at all.
>From the looks of it, though, something's populating those regs before
we jump to rip below...

> +
> + /* Jump to the new code... */
> + jmpq *rip(%rip)
> +
> + .section ".rodata"
> + .balign 4
> +entry64_regs:
> +rax: .quad 0x00000000
> +rbx: .quad 0x00000000
> +rcx: .quad 0x00000000
> +rdx: .quad 0x00000000
> +rsi: .quad 0x00000000
> +rdi: .quad 0x00000000
> +rsp: .quad 0x00000000
> +rbp: .quad 0x00000000
> +r8: .quad 0x00000000
> +r9: .quad 0x00000000
> +r10: .quad 0x00000000
> +r11: .quad 0x00000000
> +r12: .quad 0x00000000
> +r13: .quad 0x00000000
> +r14: .quad 0x00000000
> +r15: .quad 0x00000000
> +rip: .quad 0x00000000
> + .size entry64_regs, . - entry64_regs
> +
> + /* GDT */
> + .section ".rodata"
> + .balign 16
> +gdt:
> + /* 0x00 unusable segment
> + * 0x08 unused
> + * so use them as gdt ptr
> + */
> + .word gdt_end - gdt - 1
> + .quad gdt
> + .word 0, 0, 0
> +
> + /* 0x10 4GB flat code segment */
> + .word 0xFFFF, 0x0000, 0x9A00, 0x00AF
> +
> + /* 0x18 4GB flat data segment */
> + .word 0xFFFF, 0x0000, 0x9200, 0x00CF
> +gdt_end:
> +stack: .quad 0, 0
> +stack_init:
> diff --git a/arch/x86/purgatory/purgatory.c b/arch/x86/purgatory/purgatory.c
> new file mode 100644
> index 0000000..375cfb7
> --- /dev/null
> +++ b/arch/x86/purgatory/purgatory.c
> @@ -0,0 +1,103 @@
> +/*
> + * purgatory: Runs between two kernels
> + *
> + * Copyright (C) 2013 Red Hat Inc.
> + *
> + * Author:
> + *
> + * Vivek Goyal <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation (version 2 of the License).
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.

Ditto for the boilerplate gunk.

> + */
> +
> +#include "sha256.h"
> +
> +struct sha_region {
> + unsigned long start;
> + unsigned long len;
> +};
> +
> +unsigned long backup_dest = 0;
> +unsigned long backup_src = 0;
> +unsigned long backup_sz = 0;
> +
> +u8 sha256_digest[SHA256_DIGEST_SIZE] = { 0 };
> +
> +struct sha_region sha_regions[16] = {};
> +
> +/**
> + * memcpy - Copy one area of memory to another
> + * @dest: Where to copy to
> + * @src: Where to copy from
> + * @count: The size of the area.
> + */
> +static void *memcpy(void *dest, const void *src, unsigned long count)
> +{
> + char *tmp = dest;
> + const char *s = src;
> +
> + while (count--)
> + *tmp++ = *s++;
> + return dest;
> +}
> +
> +static int memcmp(const void *cs, const void *ct, size_t count)
> +{
> + const unsigned char *su1, *su2;
> + int res = 0;
> +
> + for (su1 = cs, su2 = ct; 0 < count; ++su1, ++su2, count--)
> + if ((res = *su1 - *su2) != 0)
> + break;
> + return res;
> +}
> +
> +static int copy_backup_region(void)
> +{
> + if (backup_dest)
> + memcpy((void *)backup_dest, (void *)backup_src, backup_sz);
> +
> + return 0;
> +}
> +
> +int verify_sha256_digest(void)
> +{
> + struct sha_region *ptr, *end;
> + u8 digest[SHA256_DIGEST_SIZE];
> + struct sha256_state sctx;
> +
> + sha256_init(&sctx);
> + end = &sha_regions[sizeof(sha_regions)/sizeof(sha_regions[0])];
> + for (ptr = sha_regions; ptr < end; ptr++)
> + sha256_update(&sctx, (uint8_t *)(ptr->start), ptr->len);
> +
> + sha256_final(&sctx, digest);
> +
> + if (memcmp(digest, sha256_digest, sizeof(digest)) != 0)
> + return 1;
> +
> + return 0;
> +}
> +
> +void purgatory(void)
> +{
> + int ret;
> +
> + ret = verify_sha256_digest();

Yeah, again, hardcoding sha256 is kinda jucky. We probably should link
the needed crypto API stuff stuff and support multiple encryption algos.

> + if (ret) {
> + /* loop forever */
> + for(;;);
> + }
> + copy_backup_region();

What is this thing supposed to do? I see in patch 11/11
arch_update_purgatory() does some preparations for KEXEC_TYPE_CRASH.

> +}
> diff --git a/arch/x86/purgatory/setup-x86_32.S b/arch/x86/purgatory/setup-x86_32.S
> new file mode 100644
> index 0000000..a9d5aa5
> --- /dev/null
> +++ b/arch/x86/purgatory/setup-x86_32.S
> @@ -0,0 +1,29 @@
> +/*
> + * purgatory: setup code
> + *
> + * Copyright (C) 2014 Red Hat Inc.
> + *
> + * This code has been taken from kexec-tools.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation (version 2 of the License).
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.


Ditto on the boilerplate.

> + */
> +
> + .text
> + .globl purgatory_start
> + .balign 16
> +purgatory_start:
> + .code32
> +
> + /* This is just a stub. Write code when 32bit support comes along */

I'm guessing we want to support 32-bit secure boot with kexec at some
point...

> + call purgatory
> diff --git a/arch/x86/purgatory/setup-x86_64.S b/arch/x86/purgatory/setup-x86_64.S
> new file mode 100644
> index 0000000..d23bc54
> --- /dev/null
> +++ b/arch/x86/purgatory/setup-x86_64.S
> @@ -0,0 +1,68 @@
> +/*
> + * purgatory: setup code
> + *
> + * Copyright (C) 2003,2004 Eric Biederman ([email protected])
> + * Copyright (C) 2014 Red Hat Inc.
> + *
> + * This code has been taken from kexec-tools.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation (version 2 of the License).
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.

Boilerplate gunk.

...

Bah, that's a huge patch - it could use some splitting.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-02-26 16:33:43

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 07/11] kexec: Create a relocatable object called purgatory

On Wed, Feb 26, 2014 at 05:00:08PM +0100, Borislav Petkov wrote:

[..]
> > Also sha256_generic.c is supposed to work with higher level crypto
> > abstrations and API and there was no point in importing all that in
> > purgatory. So instead of doing #include on sha256_generic.c I just
> > copied relevant portions of code into arch/x86/purgatory/sha256.c. Now
> > we shouldn't have to touch this code at all. Do let me know if there are
> > better ways to handle it.
>
> Ok, but what about configurable encryption algorithms? Maybe there are
> people who don't want to use sha-2 and prefer something else instead.

We have been using sha256 for last 7-8 years in kexec-tools and nobody
asked for changing hash algorithm so far.

So yes, somebody wanting to use a different algorithm is a possibility
but I don't think it is likely in near future.

This patchset is already very big. I would rather make it work with
sha256 and down the line one can make it more generic if they feel
the need. This is not a user space API/ABI and one should be able to
change it without breaking any user space applications.

So I really don't feel the need of making it more complicated, pull in
all the crypto API in purgaotry to support other kind of hash algorithms
in purgatory.

>
> > Signed-off-by: Vivek Goyal <[email protected]>
>
> Also checkpatch:
>
> ...
> total: 429 errors, 5 warnings, 1409 lines checked

Will run checkpatch.

>
> > diff --git a/arch/x86/Makefile b/arch/x86/Makefile
> > index 13b22e0..fedcd16 100644
> > --- a/arch/x86/Makefile
> > +++ b/arch/x86/Makefile
> > @@ -160,6 +160,11 @@ archscripts: scripts_basic
> > archheaders:
> > $(Q)$(MAKE) $(build)=arch/x86/syscalls all
> >
> > +archprepare:
> > +ifeq ($(CONFIG_KEXEC),y)
> > + $(Q)$(MAKE) $(build)=arch/x86/purgatory arch/x86/purgatory/kexec-purgatory.c
> > +endif
>
> I wonder if this could be put into arch/x86/boot/ and built there too...
> But hpa said that already.

I thought hpa wanted to shared memcpy(), memset() and memcmp() functions
from arch/x86/. He did nto ask to move arch/x86/purgatory/ in
arch/x86/boot.

I personally felt that arch/x86/boot/ has all the code to build kernel
and purgaotry code does not deal with it. So I felt it is better to create
arch/x86/purgatory.

hpa, what do you think. Is there any strong reason that purgatory/ dir
should be under arch/x86/boot/ and not under arch/x86/.

[..]
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write to the Free Software
> > + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
>
> Yeah, can we drop this boilerplate gunk and refer to COPYING instead.

Ok. Will do.

[..]
> > + /* Load the registers */
> > + movq rax(%rip), %rax
> > + movq rbx(%rip), %rbx
> > + movq rcx(%rip), %rcx
> > + movq rdx(%rip), %rdx
> > + movq rsi(%rip), %rsi
> > + movq rdi(%rip), %rdi
> > + movq rsp(%rip), %rsp
> > + movq rbp(%rip), %rbp
> > + movq r8(%rip), %r8
> > + movq r9(%rip), %r9
> > + movq r10(%rip), %r10
> > + movq r11(%rip), %r11
> > + movq r12(%rip), %r12
> > + movq r13(%rip), %r13
> > + movq r14(%rip), %r14
> > + movq r15(%rip), %r15
>
> Huh, is the purpose to simply clear the arch registers here?
>
> xor %reg,%reg

No. One can modify purgatory object symbol entry64_regs dynamically from
outside and purpose of this code is to load values into registers. At
compile time value of entry64_regs is 0 so it kind of gives the impression
that we are just trying to zero registers.

At kernel load time, we set values of some of those registers. Stack and
kernel entry point is one of those. Look at
arch/x86/kernel/kexec-bzimage.c

regs64.rbx = 0; /* Bootstrap Processor */
regs64.rsi = bootparam_load_addr;
regs64.rip = kernel_load_addr + 0x200;
regs64.rsp = (unsigned long)stack;
ret = kexec_purgatory_get_set_symbol(image, "entry64_regs", &regs64,
sizeof(regs64), 0);

[..]
> > +int verify_sha256_digest(void)
> > +{
> > + struct sha_region *ptr, *end;
> > + u8 digest[SHA256_DIGEST_SIZE];
> > + struct sha256_state sctx;
> > +
> > + sha256_init(&sctx);
> > + end = &sha_regions[sizeof(sha_regions)/sizeof(sha_regions[0])];
> > + for (ptr = sha_regions; ptr < end; ptr++)
> > + sha256_update(&sctx, (uint8_t *)(ptr->start), ptr->len);
> > +
> > + sha256_final(&sctx, digest);
> > +
> > + if (memcmp(digest, sha256_digest, sizeof(digest)) != 0)
> > + return 1;
> > +
> > + return 0;
> > +}
> > +
> > +void purgatory(void)
> > +{
> > + int ret;
> > +
> > + ret = verify_sha256_digest();
>
> Yeah, again, hardcoding sha256 is kinda jucky. We probably should link
> the needed crypto API stuff stuff and support multiple encryption algos.

There is no easy way to link to kernel crypto API as purgatory object
does not link against kernel. Most likely it will have to be either a
common library against which both kernel and purgatory link or use
all the #include magic. I really don't want to make things complicated
here.

The purpose of this isolated code in a directory is that write it once
and almost never touch it again.

>
> > + if (ret) {
> > + /* loop forever */
> > + for(;;);
> > + }
> > + copy_backup_region();
>
> What is this thing supposed to do? I see in patch 11/11
> arch_update_purgatory() does some preparations for KEXEC_TYPE_CRASH.

Kdump kernel runs from reserved region of memory. But we also found on
x86, that it still needed first 640KB of memory to boot. So before jumping
to second kernel, we copy first 640KB in reserved region and let second
kernel use first 640KB. We call this concept as backup region as we are
backing up an area of memory so that second kernel does not overwrite it.

For more details on backup region have a look at this paper.

http://lse.sourceforge.net/kdump/documentation/ols2oo5-kdump-paper.pdf

[..]
> > + */
> > +
> > + .text
> > + .globl purgatory_start
> > + .balign 16
> > +purgatory_start:
> > + .code32
> > +
> > + /* This is just a stub. Write code when 32bit support comes along */
>
> I'm guessing we want to support 32-bit secure boot with kexec at some
> point...

Yep. Right now these patches support 64bit kernel only and I put some
code for 32bit so that there are no compilation failures.

Though 32bit is becoming less relevant with every passing day, still
supporting it on 32bit is a good idea. It can happen down the line tough.

[..]
> Bah, that's a huge patch - it could use some splitting.

I will see if I can find a way to split this patch. Thanks for reviewing
it.

Thanks
Vivek

2014-02-26 16:39:04

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 06/11] kexec: A new system call, kexec_file_load, for in kernel kexec

On Wed, Feb 26, 2014 at 04:37:09PM +0100, Borislav Petkov wrote:
> On Mon, Feb 24, 2014 at 11:41:31AM -0500, Vivek Goyal wrote:
> > Right now there is only one user. But once other image loader support
> > comes along or other arches support file based kexec, they can make
> > use of same function.
> >
> > This is a pretty important modification as it decides what's the starting
> > point of next kernel image. I wanted to make it a function callable by
> > users who wanted to modify it instead of of letting them directly
> > modify image->start.
>
> But are you expecting any other way to set image->start by the other
> arches/image loaders?

No. I am expecting other loaders to just call this function to set the
start of kernel image.

Thanks
Vivek

2014-02-27 15:44:38

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 07/11] kexec: Create a relocatable object called purgatory

On Wed, Feb 26, 2014 at 11:32:56AM -0500, Vivek Goyal wrote:
> We have been using sha256 for last 7-8 years in kexec-tools and nobody
> asked for changing hash algorithm so far.
>
> So yes, somebody wanting to use a different algorithm is a possibility
> but I don't think it is likely in near future.
>
> This patchset is already very big. I would rather make it work with
> sha256 and down the line one can make it more generic if they feel
> the need. This is not a user space API/ABI and one should be able to
> change it without breaking any user space applications.
>
> So I really don't feel the need of making it more complicated, pull in
> all the crypto API in purgaotry to support other kind of hash algorithms
> in purgatory.

Ok, makes sense. Right, if someone needs it, someone could add that
support fairly easily.

> No. One can modify purgatory object symbol entry64_regs dynamically from
> outside and purpose of this code is to load values into registers. At
> compile time value of entry64_regs is 0 so it kind of gives the impression
> that we are just trying to zero registers.
>
> At kernel load time, we set values of some of those registers. Stack and
> kernel entry point is one of those. Look at
> arch/x86/kernel/kexec-bzimage.c
>
> regs64.rbx = 0; /* Bootstrap Processor */
> regs64.rsi = bootparam_load_addr;
> regs64.rip = kernel_load_addr + 0x200;
> regs64.rsp = (unsigned long)stack;
> ret = kexec_purgatory_get_set_symbol(image, "entry64_regs", &regs64,
> sizeof(regs64), 0);

Ok, thanks for explaining.

> Kdump kernel runs from reserved region of memory. But we also found on
> x86, that it still needed first 640KB of memory to boot. So before jumping
> to second kernel, we copy first 640KB in reserved region and let second
> kernel use first 640KB. We call this concept as backup region as we are
> backing up an area of memory so that second kernel does not overwrite it.
>
> For more details on backup region have a look at this paper.
>
> http://lse.sourceforge.net/kdump/documentation/ols2oo5-kdump-paper.pdf

That's valuable info, can we put some of it in a comment in the code
somewhere?

> Yep. Right now these patches support 64bit kernel only and I put some
> code for 32bit so that there are no compilation failures.
>
> Though 32bit is becoming less relevant with every passing day, still
> supporting it on 32bit is a good idea. It can happen down the line tough.

Ok.

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-02-27 21:36:41

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 08/11] kexec-bzImage: Support for loading bzImage using 64bit entry

On Mon, Jan 27, 2014 at 01:57:48PM -0500, Vivek Goyal wrote:
> This is loader specific code which can load bzImage and set it up for
> 64bit entry. This does not take care of 32bit entry or real mode entry
> yet.
>
> Signed-off-by: Vivek Goyal <[email protected]>
> ---

checkpatch: total: 4 errors, 2 warnings, 450 lines checked

...

> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index cb648c8..fa9981d 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -67,8 +67,10 @@ obj-$(CONFIG_DYNAMIC_FTRACE) += ftrace.o
> obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += ftrace.o
> obj-$(CONFIG_FTRACE_SYSCALLS) += ftrace.o
> obj-$(CONFIG_X86_TSC) += trace_clock.o
> +obj-$(CONFIG_KEXEC) += machine_kexec.o
> obj-$(CONFIG_KEXEC) += machine_kexec_$(BITS).o
> obj-$(CONFIG_KEXEC) += relocate_kernel_$(BITS).o crash.o
> +obj-$(CONFIG_KEXEC) += kexec-bzimage.o

Maybe use less obj-$(CONFIG_KEXEC) lines here.

> obj-$(CONFIG_CRASH_DUMP) += crash_dump_$(BITS).o
> obj-y += kprobes/
> obj-$(CONFIG_MODULES) += module.o
> diff --git a/arch/x86/kernel/kexec-bzimage.c b/arch/x86/kernel/kexec-bzimage.c
> new file mode 100644
> index 0000000..cbfcd00
> --- /dev/null
> +++ b/arch/x86/kernel/kexec-bzimage.c
> @@ -0,0 +1,234 @@
> +#include <linux/string.h>
> +#include <linux/printk.h>
> +#include <linux/errno.h>
> +#include <linux/slab.h>
> +#include <linux/kexec.h>
> +#include <linux/kernel.h>
> +#include <linux/mm.h>
> +
> +#include <asm/bootparam.h>
> +#include <asm/setup.h>
> +
> +#ifdef CONFIG_X86_64
> +
> +struct bzimage64_data {
> + /*
> + * Temporary buffer to hold bootparams buffer. This should be
> + * freed once the bootparam segment has been loaded.
> + */
> + void *bootparams_buf;
> +};

Why a struct if it is going to have only one member?

> +
> +int bzImage64_probe(const char *buf, unsigned long len)
> +{
> + int ret = -ENOEXEC;
> + struct setup_header *header;
> +
> + if (len < 2 * 512) {

What's 2*512? Two sectors?

> + pr_debug("File is too short to be a bzImage\n");
> + return ret;
> + }
> +
> + header = (struct setup_header *)(buf + 0x1F1);

0x1F1 should need at least a comment or "offsetof(struct boot_params, hdr)"
or both, which would be best. :-)

> + if (memcmp((char *)&header->header, "HdrS", 4) != 0) {
> + pr_debug("Not a bzImage\n");

Actually, I think that means that there is no real mode kernel header
there, or we're using an old boot protocol version:

Documentation/x86/boot.txt

> + return ret;
> + }
> +
> + if (header->boot_flag != 0xAA55) {
> + /* No x86 boot sector present */

Comment is kinda redundant here :)

> + pr_debug("No x86 boot sector present\n");
> + return ret;
> + }
> +
> + if (header->version < 0x020C) {
> + /* Must be at least protocol version 2.12 */

Ditto.

> + pr_debug("Must be at least protocol version 2.12\n");
> + return ret;
> + }
> +
> + if ((header->loadflags & 1) == 0) {

That must be LOADED_HIGH bit. Why does this bit mean it is a bzImage?

Ok, I see it in boot.txt:

"...
When loading a zImage kernel ((loadflags & 0x01) == 0).
"

> + /* Not a bzImage */
> + pr_debug("zImage not a bzImage\n");
> + return ret;
> + }
> +
> + if ((header->xloadflags & 3) != 3) {
> + /* XLF_KERNEL_64 and XLF_CAN_BE_LOADED_ABOVE_4G should be set */

Use those defines in the code please instead of naked numbers.

> + pr_debug("Not a relocatable bzImage64\n");
> + return ret;
> + }
> +
> + /* I've got a bzImage */
> + pr_debug("It's a relocatable bzImage64\n");
> + ret = 0;
> +
> + return ret;
> +}
> +
> +void *bzImage64_load(struct kimage *image, char *kernel,
> + unsigned long kernel_len,
> + char *initrd, unsigned long initrd_len,
> + char *cmdline, unsigned long cmdline_len)
> +{
> +
> + struct setup_header *header;
> + int setup_sects, kern16_size, ret = 0;
> + unsigned long setup_header_size, params_cmdline_sz;
> + struct boot_params *params;
> + unsigned long bootparam_load_addr, kernel_load_addr, initrd_load_addr;
> + unsigned long purgatory_load_addr;
> + unsigned long kernel_bufsz, kernel_memsz, kernel_align;
> + char *kernel_buf;
> + struct bzimage64_data *ldata;
> + struct kexec_entry64_regs regs64;
> + void *stack;
> +
> + header = (struct setup_header *)(kernel + 0x1F1);

See above.

> + setup_sects = header->setup_sects;
> + if (setup_sects == 0)
> + setup_sects = 4;
> +
> + kern16_size = (setup_sects + 1) * 512;
> + if (kernel_len < kern16_size) {
> + pr_debug("bzImage truncated\n");
> + return ERR_PTR(-ENOEXEC);
> + }
> +
> + if (cmdline_len > header->cmdline_size) {
> + pr_debug("Kernel command line too long\n");
> + return ERR_PTR(-EINVAL);
> + }
> +
> + /* Allocate loader specific data */
> + ldata = kzalloc(sizeof(struct bzimage64_data), GFP_KERNEL);
> + if (!ldata)
> + return ERR_PTR(-ENOMEM);
> +
> + /*
> + * Load purgatory. For 64bit entry point, purgatory code can be
> + * anywhere.
> + */
> + ret = kexec_load_purgatory(image, 0x3000, -1, 1, &purgatory_load_addr);

Some defines like MIN_<something> and MAX_<something> could be more
readable here.

> + if (ret) {
> + pr_debug("Loading purgatory failed\n");
> + goto out_free_loader_data;
> + }
> +
> + pr_debug("Loaded purgatory at 0x%lx\n", purgatory_load_addr);
> +
> + /* Load Bootparams and cmdline */
> + params_cmdline_sz = sizeof(struct boot_params) + cmdline_len;
> + params = kzalloc(params_cmdline_sz, GFP_KERNEL);
> + if (!params) {
> + ret = -ENOMEM;
> + goto out_free_loader_data;
> + }
> +
> + /* Copy setup header onto bootparams. */
> + setup_header_size = 0x0202 + kernel[0x0201] - 0x1F1;

More magic numbers :-\ Ok, I'm not going to comment on the rest of them
below but you get the idea - it would be much better to have descriptive
defines here instead of naked numbers.

> +
> + /* Is there a limit on setup header size? */
> + memcpy(&params->hdr, (kernel + 0x1F1), setup_header_size);
> + ret = kexec_add_buffer(image, (char *)params, params_cmdline_sz,
> + params_cmdline_sz, 16, 0x3000, -1, 1,
> + &bootparam_load_addr);

Normally we do arg alignment below the opening brace of the function.
Ditto for a bunch of call sites below.


...

> diff --git a/arch/x86/kernel/machine_kexec.c b/arch/x86/kernel/machine_kexec.c
> new file mode 100644
> index 0000000..ac55890
> --- /dev/null
> +++ b/arch/x86/kernel/machine_kexec.c
> @@ -0,0 +1,136 @@
> +/*
> + * handle transition of Linux booting another kernel
> + *
> + * Copyright (C) 2014 Red Hat Inc.
> + * Authors:
> + * Vivek Goyal <[email protected]>
> + *
> + * This source code is licensed under the GNU General Public License,
> + * Version 2. See the file COPYING for more details.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/string.h>
> +#include <asm/bootparam.h>
> +#include <asm/setup.h>
> +
> +/*
> + * Common code for x86 and x86_64 used for kexec.

I think you mean i386 by x86, right?

> + *
> + * For the time being it compiles only for x86_64 as there are no image
> + * loaders implemented * for x86. This #ifdef can be removed once somebody
> + * decides to write an image loader on CONFIG_X86_32.
> + */
> +
> +#ifdef CONFIG_X86_64
> +
> +int kexec_setup_initrd(struct boot_params *boot_params,
> + unsigned long initrd_load_addr, unsigned long initrd_len)
> +{
> + boot_params->hdr.ramdisk_image = initrd_load_addr & 0xffffffffUL;
> + boot_params->hdr.ramdisk_size = initrd_len & 0xffffffffUL;
> +
> + boot_params->ext_ramdisk_image = initrd_load_addr >> 32;
> + boot_params->ext_ramdisk_size = initrd_len >> 32;
> +
> + return 0;
> +}
> +
> +int kexec_setup_cmdline(struct boot_params *boot_params,
> + unsigned long bootparams_load_addr,
> + unsigned long cmdline_offset, char *cmdline,
> + unsigned long cmdline_len)
> +{
> + char *cmdline_ptr = ((char *)boot_params) + cmdline_offset;
> + unsigned long cmdline_ptr_phys;
> + uint32_t cmdline_low_32, cmdline_ext_32;
> +
> + memcpy(cmdline_ptr, cmdline, cmdline_len);
> + cmdline_ptr[cmdline_len - 1] = '\0';
> +
> + cmdline_ptr_phys = bootparams_load_addr + cmdline_offset;
> + cmdline_low_32 = cmdline_ptr_phys & 0xffffffffUL;
> + cmdline_ext_32 = cmdline_ptr_phys >> 32;
> +
> + boot_params->hdr.cmd_line_ptr = cmdline_low_32;
> + if (cmdline_ext_32)
> + boot_params->ext_cmd_line_ptr = cmdline_ext_32;
> +
> + return 0;
> +}
> +
> +static int setup_memory_map_entries(struct boot_params *params)
> +{
> + unsigned int nr_e820_entries;
> +
> + /* TODO: What about EFI */

Do you mean by that what do_add_efi_memmap() does? We add the efi
entries only when add_efi_memmap is supplied on the cmdline, see
200001eb140ea.

> + nr_e820_entries = e820_saved.nr_map;
> + if (nr_e820_entries > E820MAX)
> + nr_e820_entries = E820MAX;
> +
> + params->e820_entries = nr_e820_entries;
> + memcpy(&params->e820_map, &e820_saved.map,
> + nr_e820_entries * sizeof(struct e820entry));
> +
> + return 0;
> +}

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-02-27 21:52:42

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 09/11] kexec: Provide a function to add a segment at fixed address

On Mon, Jan 27, 2014 at 01:57:49PM -0500, Vivek Goyal wrote:
> kexec_add_buffer() can find a suitable range of memory for user buffer and
> add it to list of segments. But ELF loader will require that a buffer can
> be loaded at the address it has been compiled for (ET_EXEC type executables).
> So we need a helper function which can see if requested memory is valid and
> available and add a segment accordiingly. This patch provides that helper
> function. It will be used by elf loader in later patch.
>
> Signed-off-by: Vivek Goyal <[email protected]>
> ---
> include/linux/kexec.h | 3 +++
> kernel/kexec.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 68 insertions(+)
>
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index d391ed7..2fb052c 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -208,6 +208,9 @@ extern asmlinkage long sys_kexec_load(unsigned long entry,
> struct kexec_segment __user *segments,
> unsigned long flags);
> extern int kernel_kexec(void);
> +extern int kexec_add_segment(struct kimage *image, char *buffer,
> + unsigned long bufsz, unsigned long memsz,
> + unsigned long base);
> extern int kexec_add_buffer(struct kimage *image, char *buffer,
> unsigned long bufsz, unsigned long memsz,
> unsigned long buf_align, unsigned long buf_min,
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index 20169a4..9e4718b 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -2002,6 +2002,71 @@ static int __kexec_add_segment(struct kimage *image, char *buf,
> return 0;
> }
>
> +static int validate_ram_range_callback(u64 start, u64 end, void *arg)
> +{
> + struct kexec_segment *ksegment = arg;
> + u64 mstart = ksegment->mem;
> + u64 mend = ksegment->mem + ksegment->memsz - 1;
> +
> + /* Found a valid range. Stop going through more ranges */
> + if (mstart >= start && mend <= end)
> + return 1;
> +
> + /* Range did not match. Go to next one */
> + return 0;
> +}
> +
> +/* Add a kexec segment at fixed address provided by caller */
> +int kexec_add_segment(struct kimage *image, char *buffer, unsigned long bufsz,
> + unsigned long memsz, unsigned long base)
> +{
> + struct kexec_segment ksegment;
> + int ret;
> +
> + /* Currently adding segment this way is allowed only in file mode */
> + if (!image->file_mode)
> + return -EINVAL;

Why the guard? On a quick scan, I don't see this function called by
something else except on the kexec_file_load path...

> +
> + if (image->nr_segments >= KEXEC_SEGMENT_MAX)
> + return -EINVAL;
> +
> + /*
> + * Make sure we are not trying to add segment after allocating
> + * control pages. All segments need to be placed first before
> + * any control pages are allocated. As control page allocation
> + * logic goes through list of segments to make sure there are
> + * no destination overlaps.
> + */
> + WARN_ONCE(!list_empty(&image->control_pages), "Adding kexec segment"

Maybe say at which address here:

... "Adding a kexec segment at address 0x%lx.."

for a bit more helpful info.

> + " after allocating control pages\n");
> +
> + if (bufsz > memsz)
> + return -EINVAL;
> + if (memsz == 0)
> + return -EINVAL;
> +
> + /* Align memsz to next page boundary */
> + memsz = ALIGN(memsz, PAGE_SIZE);

We even have PAGE_ALIGN for that.

> +
> + /* Make sure base is atleast page size aligned */
> + if (base & (PAGE_SIZE - 1))

PAGE_ALIGNED even :)

> + return -EINVAL;
> +
> + memset(&ksegment, 0, sizeof(struct kexec_segment));
> + ksegment.mem = base;
> + ksegment.memsz = memsz;
> +
> + /* Validate memory range */
> + ret = walk_system_ram_res(base, base + memsz - 1, &ksegment,
> + validate_ram_range_callback);
> +
> + /* If a valid range is found, 1 is returned */
> + if (ret != 1)

That's the retval of validate_ram_range_callback, right? So

if (!ret)

And shouldn't the convention be the opposite? 0 on success, !0 on error?

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-02-28 14:58:37

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 10/11] kexec: Support for loading ELF x86_64 images

On Mon, Jan 27, 2014 at 01:57:50PM -0500, Vivek Goyal wrote:
> This patch provides support for kexec for loading ELF x86_64 images. I have
> tested it with loading vmlinux and it worked.

Can you please enlighten me what the use case for ELF kernel images is? bzImage
I understand but what produces ELF images?

I see that kexec_file_load() can receive ELF segments too but why are we
doing that?

> Signed-off-by: Vivek Goyal <[email protected]>
> ---
> arch/x86/include/asm/kexec-elf.h | 11 ++
> arch/x86/kernel/Makefile | 1 +
> arch/x86/kernel/kexec-elf.c | 231 +++++++++++++++++++++++++++++++++++++
> arch/x86/kernel/machine_kexec_64.c | 2 +
> 4 files changed, 245 insertions(+)
> create mode 100644 arch/x86/include/asm/kexec-elf.h
> create mode 100644 arch/x86/kernel/kexec-elf.c
>
> diff --git a/arch/x86/include/asm/kexec-elf.h b/arch/x86/include/asm/kexec-elf.h
> new file mode 100644
> index 0000000..afef382
> --- /dev/null
> +++ b/arch/x86/include/asm/kexec-elf.h
> @@ -0,0 +1,11 @@
> +#ifndef _ASM_KEXEC_ELF_H
> +#define _ASM_KEXEC_ELF_H
> +
> +extern int elf_x86_64_probe(const char *buf, unsigned long len);
> +extern void *elf_x86_64_load(struct kimage *image, char *kernel,
> + unsigned long kernel_len, char *initrd,
> + unsigned long initrd_len, char *cmdline,
> + unsigned long cmdline_len);
> +extern int elf_x86_64_cleanup(struct kimage *image);
> +
> +#endif /* _ASM_KEXEC_ELF_H */
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index fa9981d..2d77de7 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -71,6 +71,7 @@ obj-$(CONFIG_KEXEC) += machine_kexec.o
> obj-$(CONFIG_KEXEC) += machine_kexec_$(BITS).o
> obj-$(CONFIG_KEXEC) += relocate_kernel_$(BITS).o crash.o
> obj-$(CONFIG_KEXEC) += kexec-bzimage.o
> +obj-$(CONFIG_KEXEC) += kexec-elf.o

It looks like kexec could slowly grow its own dir now:

arch/x86/kexec/

or so.

> obj-$(CONFIG_CRASH_DUMP) += crash_dump_$(BITS).o
> obj-y += kprobes/
> obj-$(CONFIG_MODULES) += module.o
> diff --git a/arch/x86/kernel/kexec-elf.c b/arch/x86/kernel/kexec-elf.c
> new file mode 100644
> index 0000000..ff1017c
> --- /dev/null
> +++ b/arch/x86/kernel/kexec-elf.c
> @@ -0,0 +1,231 @@
> +#include <linux/string.h>
> +#include <linux/printk.h>
> +#include <linux/errno.h>
> +#include <linux/slab.h>
> +#include <linux/kexec.h>
> +#include <linux/kernel.h>
> +#include <linux/mm.h>
> +
> +#include <asm/bootparam.h>
> +#include <asm/setup.h>
> +
> +#ifdef CONFIG_X86_64
> +
> +struct elf_x86_64_data {
> + /*
> + * Temporary buffer to hold bootparams buffer. This should be
> + * freed once the bootparam segment has been loaded.
> + */
> + void *bootparams_buf;
> +};
> +
> +int elf_x86_64_probe(const char *buf, unsigned long len)
> +{
> + int ret = -ENOEXEC;
> + Elf_Ehdr *ehdr;
> +
> + if (len < sizeof(Elf_Ehdr)) {
> + pr_debug("File is too short to be an ELF executable.\n");
> + return ret;
> + }
> +
> + ehdr = (Elf_Ehdr *)buf;
> +
> + if (memcmp(ehdr->e_ident, ELFMAG, SELFMAG) != 0
> + || ehdr->e_type != ET_EXEC || !elf_check_arch(ehdr)
> + || ehdr->e_phentsize != sizeof(Elf_Phdr))
> + return -ENOEXEC;
> +
> + if (ehdr->e_phoff >= len
> + || (ehdr->e_phnum * sizeof(Elf_Phdr) > len - ehdr->e_phoff))
> + return -ENOEXEC;
> +
> + /* I've got a bzImage */
> + pr_debug("It's an elf_x86_64 image.\n");
> + ret = 0;
> +
> + return ret;

I think you can drop 'ret' here and return the error vals directly.

> +}
> +
> +static int elf_exec_load(struct kimage *image, char *kernel)
> +{
> + Elf_Ehdr *ehdr;
> + Elf_Phdr *phdrs;
> + int i, ret;
> + size_t filesz;
> + char *buffer;
> +
> + ehdr = (Elf_Ehdr *)kernel;
> + phdrs = (void *)ehdr + ehdr->e_phoff;
> +
> + for (i = 0; i < ehdr->e_phnum; i++) {
> + if (phdrs[i].p_type != PT_LOAD)
> + continue;

newline

> + filesz = phdrs[i].p_filesz;
> + if (filesz > phdrs[i].p_memsz)
> + filesz = phdrs[i].p_memsz;
> +
> + buffer = (char *)ehdr + phdrs[i].p_offset;
> + ret = kexec_add_segment(image, buffer, filesz, phdrs[i].p_memsz,
> + phdrs[i].p_paddr);
> + if (ret)
> + break;
> + }
> +
> + return ret;
> +}
> +
> +/* Fill in fields which are usually present in bzImage */
> +static int init_linux_parameters(struct boot_params *params)
> +{
> + /*
> + * FIXME: It is odd that the information which comes from kernel
> + * has to be faked by loading kernel. I guess it is limitation of
> + * ELF format. Right now keeping it same as kexec-tools
> + * implementation. But this most likely needs fixing.
> + */
> + memcpy(&params->hdr.header, "HdrS", 4);
> + params->hdr.version = 0x0206;
> + params->hdr.initrd_addr_max = 0x37FFFFFF;
> + params->hdr.cmdline_size = 2048;
> + return 0;
> +}

Why a separate function? Its body is small enough to be merged into
elf_x86_64_load.

> +
> +void *elf_x86_64_load(struct kimage *image, char *kernel,
> + unsigned long kernel_len,
> + char *initrd, unsigned long initrd_len,
> + char *cmdline, unsigned long cmdline_len)
> +{

Btw, this functionality below looks very similar to the one in
bzImage64_load(). Can we share some of it?

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-02-28 16:32:27

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 08/11] kexec-bzImage: Support for loading bzImage using 64bit entry

On Thu, Feb 27, 2014 at 10:36:29PM +0100, Borislav Petkov wrote:

Hi Boris,

Thanks for taking time to review this large patchset. Please find
my comments inline.

[..]
> > diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> > index cb648c8..fa9981d 100644
> > --- a/arch/x86/kernel/Makefile
> > +++ b/arch/x86/kernel/Makefile
> > @@ -67,8 +67,10 @@ obj-$(CONFIG_DYNAMIC_FTRACE) += ftrace.o
> > obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += ftrace.o
> > obj-$(CONFIG_FTRACE_SYSCALLS) += ftrace.o
> > obj-$(CONFIG_X86_TSC) += trace_clock.o
> > +obj-$(CONFIG_KEXEC) += machine_kexec.o
> > obj-$(CONFIG_KEXEC) += machine_kexec_$(BITS).o
> > obj-$(CONFIG_KEXEC) += relocate_kernel_$(BITS).o crash.o
> > +obj-$(CONFIG_KEXEC) += kexec-bzimage.o
>
> Maybe use less obj-$(CONFIG_KEXEC) lines here.

Ok, Will put some of them in single line.

[..]
> > +struct bzimage64_data {
> > + /*
> > + * Temporary buffer to hold bootparams buffer. This should be
> > + * freed once the bootparam segment has been loaded.
> > + */
> > + void *bootparams_buf;
> > +};
>
> Why a struct if it is going to have only one member?

Well, I had started with a generic idea of bootloader being able to store
some data in image and retrieve it later. Finally it turned out to be only
one field in current implementation.

But I still like it as it allows storing more data down the line. There
is no other place where we can store bootloader specific data. So this
is the mechanism I created.

>
> > +
> > +int bzImage64_probe(const char *buf, unsigned long len)
> > +{
> > + int ret = -ENOEXEC;
> > + struct setup_header *header;
> > +
> > + if (len < 2 * 512) {
>
> What's 2*512? Two sectors?

Yep, two sectors.

>
> > + pr_debug("File is too short to be a bzImage\n");
> > + return ret;
> > + }
> > +
> > + header = (struct setup_header *)(buf + 0x1F1);
>
> 0x1F1 should need at least a comment or "offsetof(struct boot_params, hdr)"
> or both, which would be best. :-)

Ok, I will put a comment.

>
> > + if (memcmp((char *)&header->header, "HdrS", 4) != 0) {
> > + pr_debug("Not a bzImage\n");
>
> Actually, I think that means that there is no real mode kernel header
> there, or we're using an old boot protocol version:

So Documentation/x86/boot.txt says following.


If the "HdrS" (0x53726448) magic number is not found at offset 0x202,
the boot protocol version is "old". Loading an old kernel, the
following parameters should be assumed:

Image type = zImage
initrd not supported
Real-mode kernel must be located at 0x90000.

So if it is old version then it is zImage (and not bzImage). So pr_debug()
message seems to be correct that image one is trying to load is not
bzImage and this loader will not handle this image.

>
> Documentation/x86/boot.txt
>
> > + return ret;
> > + }
> > +
> > + if (header->boot_flag != 0xAA55) {
> > + /* No x86 boot sector present */
>
> Comment is kinda redundant here :)

Ok, will get rid of it.

>
> > + pr_debug("No x86 boot sector present\n");
> > + return ret;
> > + }
> > +
> > + if (header->version < 0x020C) {
> > + /* Must be at least protocol version 2.12 */
>
> Ditto.

Will get rid of it.

>
> > + pr_debug("Must be at least protocol version 2.12\n");
> > + return ret;
> > + }
> > +
> > + if ((header->loadflags & 1) == 0) {
>
> That must be LOADED_HIGH bit. Why does this bit mean it is a bzImage?

Yep, that's LOADED_HIGH check. I think if LOADED_HIGH is not set, then
that means protected mode code needs to be loaded at 0x10000 and I think
that also means that it is zImage and not bzImage.

Atleast this loader does not handle anything which is that old where
protected mode code needs to be loaded at 0x10000.

>
> Ok, I see it in boot.txt:
>
> "...
> When loading a zImage kernel ((loadflags & 0x01) == 0).
> "
>
> > + /* Not a bzImage */
> > + pr_debug("zImage not a bzImage\n");
> > + return ret;
> > + }
> > +
> > + if ((header->xloadflags & 3) != 3) {
> > + /* XLF_KERNEL_64 and XLF_CAN_BE_LOADED_ABOVE_4G should be set */
>
> Use those defines in the code please instead of naked numbers.

Agreed. Using these defines is more readable. Will change it.


[..]
> > +void *bzImage64_load(struct kimage *image, char *kernel,
> > + unsigned long kernel_len,
> > + char *initrd, unsigned long initrd_len,
> > + char *cmdline, unsigned long cmdline_len)
> > +{
> > +
> > + struct setup_header *header;
> > + int setup_sects, kern16_size, ret = 0;
> > + unsigned long setup_header_size, params_cmdline_sz;
> > + struct boot_params *params;
> > + unsigned long bootparam_load_addr, kernel_load_addr, initrd_load_addr;
> > + unsigned long purgatory_load_addr;
> > + unsigned long kernel_bufsz, kernel_memsz, kernel_align;
> > + char *kernel_buf;
> > + struct bzimage64_data *ldata;
> > + struct kexec_entry64_regs regs64;
> > + void *stack;
> > +
> > + header = (struct setup_header *)(kernel + 0x1F1);
>
> See above.

Will comment or use offsetof().

[..]
> > + /* Allocate loader specific data */
> > + ldata = kzalloc(sizeof(struct bzimage64_data), GFP_KERNEL);
> > + if (!ldata)
> > + return ERR_PTR(-ENOMEM);
> > +
> > + /*
> > + * Load purgatory. For 64bit entry point, purgatory code can be
> > + * anywhere.
> > + */
> > + ret = kexec_load_purgatory(image, 0x3000, -1, 1, &purgatory_load_addr);
>
> Some defines like MIN_<something> and MAX_<something> could be more
> readable here.

Ok, atleast will use MAX_<something> for "-1". I might define one
internally to this file for representing minimum address of 0x3000.

>
> > + if (ret) {
> > + pr_debug("Loading purgatory failed\n");
> > + goto out_free_loader_data;
> > + }
> > +
> > + pr_debug("Loaded purgatory at 0x%lx\n", purgatory_load_addr);
> > +
> > + /* Load Bootparams and cmdline */
> > + params_cmdline_sz = sizeof(struct boot_params) + cmdline_len;
> > + params = kzalloc(params_cmdline_sz, GFP_KERNEL);
> > + if (!params) {
> > + ret = -ENOMEM;
> > + goto out_free_loader_data;
> > + }
> > +
> > + /* Copy setup header onto bootparams. */
> > + setup_header_size = 0x0202 + kernel[0x0201] - 0x1F1;
>
> More magic numbers :-\ Ok, I'm not going to comment on the rest of them
> below but you get the idea - it would be much better to have descriptive
> defines here instead of naked numbers.

That's how boot.txt defines it. Look at 64-bit BOOT PROTOCOL.

0x0202 + byte value at offset 0x0201

Now one can argue that create some new defines to represent those magic
numbers and include that file in kexec-bzimage loader. I will see what
can I do.

>
> > +
> > + /* Is there a limit on setup header size? */
> > + memcpy(&params->hdr, (kernel + 0x1F1), setup_header_size);
> > + ret = kexec_add_buffer(image, (char *)params, params_cmdline_sz,
> > + params_cmdline_sz, 16, 0x3000, -1, 1,
> > + &bootparam_load_addr);
>
> Normally we do arg alignment below the opening brace of the function.
> Ditto for a bunch of call sites below.

Hmm.., I have seen all kinds of alignments. But anyway, if you prefer
that, I will change atleast some of them.

[..]
> > +#include <linux/kernel.h>
> > +#include <linux/string.h>
> > +#include <asm/bootparam.h>
> > +#include <asm/setup.h>
> > +
> > +/*
> > + * Common code for x86 and x86_64 used for kexec.
>
> I think you mean i386 by x86, right?

Yes.

[..]
> > +static int setup_memory_map_entries(struct boot_params *params)
> > +{
> > + unsigned int nr_e820_entries;
> > +
> > + /* TODO: What about EFI */
>
> Do you mean by that what do_add_efi_memmap() does? We add the efi
> entries only when add_efi_memmap is supplied on the cmdline, see
> 200001eb140ea.

I meant in general I have not done any EFI handling. I am not even
sure what needs to be done for EFI. I am just going through memory
map and filling up E820 map in bootparams so that second kernel knows
what are the memory areas available.

At some point of time we need to start passing EFI memory map to
second kernel. (All the new code you and dave young have added to
make kexec work on EFI systems).

I wanted to that in a separate patch. This patch series is alredy very
big. bzImage signing and verification also I am planning to post in a
separate series.

Thanks
Vivek

2014-02-28 16:57:39

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 09/11] kexec: Provide a function to add a segment at fixed address

On Thu, Feb 27, 2014 at 10:52:32PM +0100, Borislav Petkov wrote:

[..]
> > +/* Add a kexec segment at fixed address provided by caller */
> > +int kexec_add_segment(struct kimage *image, char *buffer, unsigned long bufsz,
> > + unsigned long memsz, unsigned long base)
> > +{
> > + struct kexec_segment ksegment;
> > + int ret;
> > +
> > + /* Currently adding segment this way is allowed only in file mode */
> > + if (!image->file_mode)
> > + return -EINVAL;
>
> Why the guard? On a quick scan, I don't see this function called by
> something else except on the kexec_file_load path...

This is more of future proofing it. I have been putting this check to
catch any accidental errors if somebody ends up calling this function
from old mode.

But I am not very particular about it. If you don't like it, I can get
rid of it.

>
> > +
> > + if (image->nr_segments >= KEXEC_SEGMENT_MAX)
> > + return -EINVAL;
> > +
> > + /*
> > + * Make sure we are not trying to add segment after allocating
> > + * control pages. All segments need to be placed first before
> > + * any control pages are allocated. As control page allocation
> > + * logic goes through list of segments to make sure there are
> > + * no destination overlaps.
> > + */
> > + WARN_ONCE(!list_empty(&image->control_pages), "Adding kexec segment"
>
> Maybe say at which address here:
>
> ... "Adding a kexec segment at address 0x%lx.."
>
> for a bit more helpful info.

I think address does not matter here. You can't add a segemnt after you
have allocated a control page. So I am not sure how printing address will
help.

>
> > + " after allocating control pages\n");
> > +
> > + if (bufsz > memsz)
> > + return -EINVAL;
> > + if (memsz == 0)
> > + return -EINVAL;
> > +
> > + /* Align memsz to next page boundary */
> > + memsz = ALIGN(memsz, PAGE_SIZE);
>
> We even have PAGE_ALIGN for that.

Ok, there is not much difference between two, but I can use PAGE_ALIGN().

>
> > +
> > + /* Make sure base is atleast page size aligned */
> > + if (base & (PAGE_SIZE - 1))
>
> PAGE_ALIGNED even :)

Will use it. :-)

>
> > + return -EINVAL;
> > +
> > + memset(&ksegment, 0, sizeof(struct kexec_segment));
> > + ksegment.mem = base;
> > + ksegment.memsz = memsz;
> > +
> > + /* Validate memory range */
> > + ret = walk_system_ram_res(base, base + memsz - 1, &ksegment,
> > + validate_ram_range_callback);
> > +
> > + /* If a valid range is found, 1 is returned */
> > + if (ret != 1)
>
> That's the retval of validate_ram_range_callback, right? So
>
> if (!ret)
>
> And shouldn't the convention be the opposite? 0 on success, !0 on error?

Ok, this one is little twisted.

walk_system_ram_res() stops calling callback function if callback
function returned non zero code.

So in this case, once we have found the range to be valid, we don't want
to continue to loop and look at any more ranges. So we return "1". If
we return "0" for success, outer loop of walk_system_ram_res() will
continue with next ranges.

Given the fact that "0" is interpreted as success by walk_system_ram_res()
and it continues with next set of ranges, I could not use 0 as final
measure of success. Negative returns are errors. So I thought of using
postive values of ret to indicate whether range was found or not.

If there are better ways to handle it, I am all for it.

Thanks
Vivek

2014-02-28 17:12:29

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 10/11] kexec: Support for loading ELF x86_64 images

On Fri, Feb 28, 2014 at 03:58:32PM +0100, Borislav Petkov wrote:
> On Mon, Jan 27, 2014 at 01:57:50PM -0500, Vivek Goyal wrote:
> > This patch provides support for kexec for loading ELF x86_64 images. I have
> > tested it with loading vmlinux and it worked.
>
> Can you please enlighten me what the use case for ELF kernel images is? bzImage
> I understand but what produces ELF images?

Before bzImage is generated, we produce an ELF vmlinux (linux-2.6/vmlinux).
And then this vmlinux is processed further to produce bzImage.

In theory you can load this vmlinux and boot into it using kexec system
call.

In general, Eric wanted to support ELF images hence I put in this patch.

>
> I see that kexec_file_load() can receive ELF segments too but why are we
> doing that?

So kexec_file_load() does not know what kind of file it is. Whether an
ELF file or an bzImage file. It will just call all the image loaders and
see if some loader recognizes the file and claims it.

So idea is what kind of image one should be able to load using this
new system call and kexec into.

My primary use case is bzImage. Others think that it should be generic
enough to be able to handle ELF too.

>
> > Signed-off-by: Vivek Goyal <[email protected]>
> > ---
> > arch/x86/include/asm/kexec-elf.h | 11 ++
> > arch/x86/kernel/Makefile | 1 +
> > arch/x86/kernel/kexec-elf.c | 231 +++++++++++++++++++++++++++++++++++++
> > arch/x86/kernel/machine_kexec_64.c | 2 +
> > 4 files changed, 245 insertions(+)
> > create mode 100644 arch/x86/include/asm/kexec-elf.h
> > create mode 100644 arch/x86/kernel/kexec-elf.c
> >
> > diff --git a/arch/x86/include/asm/kexec-elf.h b/arch/x86/include/asm/kexec-elf.h
> > new file mode 100644
> > index 0000000..afef382
> > --- /dev/null
> > +++ b/arch/x86/include/asm/kexec-elf.h
> > @@ -0,0 +1,11 @@
> > +#ifndef _ASM_KEXEC_ELF_H
> > +#define _ASM_KEXEC_ELF_H
> > +
> > +extern int elf_x86_64_probe(const char *buf, unsigned long len);
> > +extern void *elf_x86_64_load(struct kimage *image, char *kernel,
> > + unsigned long kernel_len, char *initrd,
> > + unsigned long initrd_len, char *cmdline,
> > + unsigned long cmdline_len);
> > +extern int elf_x86_64_cleanup(struct kimage *image);
> > +
> > +#endif /* _ASM_KEXEC_ELF_H */
> > diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> > index fa9981d..2d77de7 100644
> > --- a/arch/x86/kernel/Makefile
> > +++ b/arch/x86/kernel/Makefile
> > @@ -71,6 +71,7 @@ obj-$(CONFIG_KEXEC) += machine_kexec.o
> > obj-$(CONFIG_KEXEC) += machine_kexec_$(BITS).o
> > obj-$(CONFIG_KEXEC) += relocate_kernel_$(BITS).o crash.o
> > obj-$(CONFIG_KEXEC) += kexec-bzimage.o
> > +obj-$(CONFIG_KEXEC) += kexec-elf.o
>
> It looks like kexec could slowly grow its own dir now:
>
> arch/x86/kexec/

I was rather thinking of arch/x86/kernel/kexec. But that's for some other
day. Not part of this patchset. This is alredy too big and I don't want
to make any changes which are nice to have and bloat the patch size.

>
> or so.
>
> > obj-$(CONFIG_CRASH_DUMP) += crash_dump_$(BITS).o
> > obj-y += kprobes/
> > obj-$(CONFIG_MODULES) += module.o
> > diff --git a/arch/x86/kernel/kexec-elf.c b/arch/x86/kernel/kexec-elf.c
> > new file mode 100644
> > index 0000000..ff1017c
> > --- /dev/null
> > +++ b/arch/x86/kernel/kexec-elf.c
> > @@ -0,0 +1,231 @@
> > +#include <linux/string.h>
> > +#include <linux/printk.h>
> > +#include <linux/errno.h>
> > +#include <linux/slab.h>
> > +#include <linux/kexec.h>
> > +#include <linux/kernel.h>
> > +#include <linux/mm.h>
> > +
> > +#include <asm/bootparam.h>
> > +#include <asm/setup.h>
> > +
> > +#ifdef CONFIG_X86_64
> > +
> > +struct elf_x86_64_data {
> > + /*
> > + * Temporary buffer to hold bootparams buffer. This should be
> > + * freed once the bootparam segment has been loaded.
> > + */
> > + void *bootparams_buf;
> > +};
> > +
> > +int elf_x86_64_probe(const char *buf, unsigned long len)
> > +{
> > + int ret = -ENOEXEC;
> > + Elf_Ehdr *ehdr;
> > +
> > + if (len < sizeof(Elf_Ehdr)) {
> > + pr_debug("File is too short to be an ELF executable.\n");
> > + return ret;
> > + }
> > +
> > + ehdr = (Elf_Ehdr *)buf;
> > +
> > + if (memcmp(ehdr->e_ident, ELFMAG, SELFMAG) != 0
> > + || ehdr->e_type != ET_EXEC || !elf_check_arch(ehdr)
> > + || ehdr->e_phentsize != sizeof(Elf_Phdr))
> > + return -ENOEXEC;
> > +
> > + if (ehdr->e_phoff >= len
> > + || (ehdr->e_phnum * sizeof(Elf_Phdr) > len - ehdr->e_phoff))
> > + return -ENOEXEC;
> > +
> > + /* I've got a bzImage */
> > + pr_debug("It's an elf_x86_64 image.\n");
> > + ret = 0;
> > +
> > + return ret;
>
> I think you can drop 'ret' here and return the error vals directly.

Yep, will simplify this one.

>
> > +}
> > +
> > +static int elf_exec_load(struct kimage *image, char *kernel)
> > +{
> > + Elf_Ehdr *ehdr;
> > + Elf_Phdr *phdrs;
> > + int i, ret;
> > + size_t filesz;
> > + char *buffer;
> > +
> > + ehdr = (Elf_Ehdr *)kernel;
> > + phdrs = (void *)ehdr + ehdr->e_phoff;
> > +
> > + for (i = 0; i < ehdr->e_phnum; i++) {
> > + if (phdrs[i].p_type != PT_LOAD)
> > + continue;
>
> newline

What's that?

>
> > + filesz = phdrs[i].p_filesz;
> > + if (filesz > phdrs[i].p_memsz)
> > + filesz = phdrs[i].p_memsz;
> > +
> > + buffer = (char *)ehdr + phdrs[i].p_offset;
> > + ret = kexec_add_segment(image, buffer, filesz, phdrs[i].p_memsz,
> > + phdrs[i].p_paddr);
> > + if (ret)
> > + break;
> > + }
> > +
> > + return ret;
> > +}
> > +
> > +/* Fill in fields which are usually present in bzImage */
> > +static int init_linux_parameters(struct boot_params *params)
> > +{
> > + /*
> > + * FIXME: It is odd that the information which comes from kernel
> > + * has to be faked by loading kernel. I guess it is limitation of
> > + * ELF format. Right now keeping it same as kexec-tools
> > + * implementation. But this most likely needs fixing.
> > + */
> > + memcpy(&params->hdr.header, "HdrS", 4);
> > + params->hdr.version = 0x0206;
> > + params->hdr.initrd_addr_max = 0x37FFFFFF;
> > + params->hdr.cmdline_size = 2048;
> > + return 0;
> > +}
>
> Why a separate function? Its body is small enough to be merged into
> elf_x86_64_load.

Actually this logic shows the limitation of ELF format kernel image.
This information should be exported by the image so that loader can
do some verifications. But instead loader is hardcoding this info and
faking things.

For example, it should be kernel which tells what's the maximum command
line size it can handle and then loader can return error if user specified
command line size is greater than what new kernel can handle.

Similarly, what's the max address initrd can be loaded at.

Actually I have copied this code from kexec-tools. And I am wondering
if some of this is

I am not sure why we set hdr.version and hdr.header. Are there any
assumptions in booth path kernel is making. May be some other code
down the line is parsing it or it is completely redundant. I think
I will play with removing setting of hdr.version and hdr.header and
see how does it go.

so I put it in a separate function because user space code had it
that way. Also because I did not like this part of the code and this
looks like a limitation of ELF format, I wanted to isolate it in
a separate function so that it is easy to spot it.

>
> > +
> > +void *elf_x86_64_load(struct kimage *image, char *kernel,
> > + unsigned long kernel_len,
> > + char *initrd, unsigned long initrd_len,
> > + char *cmdline, unsigned long cmdline_len)
> > +{
>
> Btw, this functionality below looks very similar to the one in
> bzImage64_load(). Can we share some of it?

It looks similar but values with which some of the functions are called
are different. I will see if there are some obivious candidate to share
the code. It is not a lot of code to begin with.

Thanks
Vivek

2014-02-28 17:29:05

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 11/11] kexec: Support for Kexec on panic using new system call

On Mon, Jan 27, 2014 at 01:57:51PM -0500, Vivek Goyal wrote:
> This patch adds support for loading a kexec on panic (kdump) kernel usning
> new system call. Right now this primarily works with bzImage loader only.
> But changes to ELF loader should be minimal as all the core infrastrcture
> is there.
>
> Only thing preventing making ELF load in crash reseved memory is
> that kernel vmlinux is of type ET_EXEC and it expects to be loaded at
> address it has been compiled for. At that location current kernel is
> already running. One first needs to make vmlinux fully relocatable
> and export it is type ET_DYN and then modify this ELF loader to support
> images of type ET_DYN.
>
> I am leaving it as a future TODO item.
>
> Signed-off-by: Vivek Goyal <[email protected]>

checkpatch: total: 2 errors, 10 warnings, 977 lines checked

> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> index 9bd6fec..a330d85 100644
> --- a/arch/x86/include/asm/kexec.h
> +++ b/arch/x86/include/asm/kexec.h
> @@ -25,6 +25,8 @@
> #include <asm/ptrace.h>
> #include <asm/bootparam.h>
>
> +struct kimage;
> +
> /*
> * KEXEC_SOURCE_MEMORY_LIMIT maximum page get_free_page can return.
> * I.e. Maximum page that is mapped directly into kernel memory,
> @@ -62,6 +64,10 @@
> # define KEXEC_ARCH KEXEC_ARCH_X86_64
> #endif
>
> +/* Memory to backup during crash kdump */
> +#define KEXEC_BACKUP_SRC_START (0UL)
> +#define KEXEC_BACKUP_SRC_END (655360UL) /* 640K */

I guess

#define KEXEC_BACKUP_SRC_END (640 * 1024UL)

should be more clear.

> /*
> * CPU does not save ss and sp on stack if execution is already
> * running in kernel mode at the time of NMI occurrence. This code
> @@ -161,8 +167,21 @@ struct kimage_arch {
> pud_t *pud;
> pmd_t *pmd;
> pte_t *pte;
> + /* Details of backup region */
> + unsigned long backup_src_start;
> + unsigned long backup_src_sz;
> +
> + /* Physical address of backup segment */
> + unsigned long backup_load_addr;
> +
> + /* Core ELF header buffer */
> + unsigned long elf_headers;
> + unsigned long elf_headers_sz;
> + unsigned long elf_load_addr;
> };
> +#endif /* CONFIG_X86_32 */
>
> +#ifdef CONFIG_X86_64
> struct kexec_entry64_regs {
> uint64_t rax;
> uint64_t rbx;
> @@ -189,11 +208,13 @@ extern crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss;
>
> extern int kexec_setup_initrd(struct boot_params *boot_params,
> unsigned long initrd_load_addr, unsigned long initrd_len);
> -extern int kexec_setup_cmdline(struct boot_params *boot_params,
> +extern int kexec_setup_cmdline(struct kimage *image,
> + struct boot_params *boot_params,
> unsigned long bootparams_load_addr,
> unsigned long cmdline_offset, char *cmdline,
> unsigned long cmdline_len);
> -extern int kexec_setup_boot_parameters(struct boot_params *params);
> +extern int kexec_setup_boot_parameters(struct kimage *image,
> + struct boot_params *params);
>
>
> #endif /* __ASSEMBLY__ */
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index a57902e..8eabde4 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -4,6 +4,9 @@
> * Created by: Hariprasad Nellitheertha ([email protected])
> *
> * Copyright (C) IBM Corporation, 2004. All rights reserved.
> + * Copyright (C) Red Hat Inc., 2014. All rights reserved.
> + * Authors:
> + * Vivek Goyal <[email protected]>
> *
> */
>
> @@ -16,6 +19,7 @@
> #include <linux/elf.h>
> #include <linux/elfcore.h>
> #include <linux/module.h>
> +#include <linux/slab.h>
>
> #include <asm/processor.h>
> #include <asm/hardirq.h>
> @@ -28,6 +32,45 @@
> #include <asm/reboot.h>
> #include <asm/virtext.h>
>
> +/* Alignment required for elf header segment */
> +#define ELF_CORE_HEADER_ALIGN 4096
> +
> +/* This primarily reprsents number of split ranges due to exclusion */
> +#define CRASH_MAX_RANGES 16
> +
> +struct crash_mem_range {
> + unsigned long long start, end;

u64?

> +};
> +
> +struct crash_mem {
> + unsigned int nr_ranges;
> + struct crash_mem_range ranges[CRASH_MAX_RANGES];
> +};
> +
> +/* Misc data about ram ranges needed to prepare elf headers */
> +struct crash_elf_data {
> + struct kimage *image;
> + /*
> + * Total number of ram ranges we have after various ajustments for
> + * GART, crash reserved region etc.
> + */
> + unsigned int max_nr_ranges;
> + unsigned long gart_start, gart_end;
> +
> + /* Pointer to elf header */
> + void *ehdr;
> + /* Pointer to next phdr */
> + void *bufp;
> + struct crash_mem mem;
> +};
> +
> +/* Used while prepareing memory map entries for second kernel */

s/prepareing/preparing/

> +struct crash_memmap_data {
> + struct boot_params *params;
> + /* Type of memory */
> + unsigned int type;
> +};
> +
> int in_crash_kexec;
>
> /*
> @@ -137,3 +180,534 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
> #endif
> crash_save_cpu(regs, safe_smp_processor_id());
> }
> +
> +#ifdef CONFIG_X86_64
> +
> +static int get_nr_ram_ranges_callback(unsigned long start_pfn,
> + unsigned long nr_pfn, void *arg)
> +{
> + int *nr_ranges = arg;
> +
> + (*nr_ranges)++;
> + return 0;
> +}
> +
> +static int get_gart_ranges_callback(u64 start, u64 end, void *arg)
> +{
> + struct crash_elf_data *ced = arg;
> +
> + ced->gart_start = start;
> + ced->gart_end = end;
> +
> + /* Not expecting more than 1 gart aperture */
> + return 1;
> +}
> +
> +
> +/* Gather all the required information to prepare elf headers for ram regions */
> +static int fill_up_ced(struct crash_elf_data *ced, struct kimage *image)

All other functions have nice, spelled out names but not this one :)

Why not fill_up_crash_elf_data()?

> +{
> + unsigned int nr_ranges = 0;
> +
> + ced->image = image;
> +
> + walk_system_ram_range(0, -1, &nr_ranges,
> + get_nr_ram_ranges_callback);
> +
> + ced->max_nr_ranges = nr_ranges;
> +
> + /*
> + * We don't create ELF headers for GART aperture as an attempt
> + * to dump this memory in second kernel leads to hang/crash.
> + * If gart aperture is present, one needs to exclude that region
> + * and that could lead to need of extra phdr.
> + */
> +

superfluous newline.

> + walk_ram_res("GART", IORESOURCE_MEM, 0, -1,
> + ced, get_gart_ranges_callback);
> +
> + /*
> + * If we have gart region, excluding that could potentially split
> + * a memory range, resulting in extra header. Account for that.
> + */
> + if (ced->gart_end)
> + ced->max_nr_ranges++;
> +
> + /* Exclusion of crash region could split memory ranges */
> + ced->max_nr_ranges++;
> +
> + /* If crashk_low_res is there, another range split possible */
> + if (crashk_low_res.end != 0)
> + ced->max_nr_ranges++;
> +
> + return 0;
> +}

...

> +int load_crashdump_segments(struct kimage *image)
> +{
> + unsigned long src_start, src_sz;
> + unsigned long elf_addr, elf_sz;
> + int ret;
> +
> + /*
> + * Determine and load a segment for backup area. First 640K RAM
> + * region is backup source
> + */
> +
> + ret = walk_system_ram_res(KEXEC_BACKUP_SRC_START, KEXEC_BACKUP_SRC_END,
> + image, determine_backup_region);
> +
> + /* Zero or postive return values are ok */
> + if (ret < 0)
> + return ret;
> +
> + src_start = image->arch.backup_src_start;
> + src_sz = image->arch.backup_src_sz;
> +
> + /* Add backup segment. */
> + if (src_sz) {
> + ret = kexec_add_buffer(image, __va(src_start), src_sz, src_sz,
> + PAGE_SIZE, 0, -1, 0,
> + &image->arch.backup_load_addr);
> + if (ret)
> + return ret;
> + }
> +
> + /* Prepare elf headers and add a segment */
> + ret = prepare_elf_headers(image, &elf_addr, &elf_sz);
> + if (ret)
> + return ret;
> +
> + image->arch.elf_headers = elf_addr;
> + image->arch.elf_headers_sz = elf_sz;
> +
> + ret = kexec_add_buffer(image, (char *)elf_addr, elf_sz, elf_sz,

For some reason, my compiler complains here:

arch/x86/kernel/crash.c: In function ‘load_crashdump_segments’:
arch/x86/kernel/crash.c:704:6: warning: ‘elf_sz’ may be used uninitialized in this function [-Wuninitialized]
arch/x86/kernel/crash.c:704:24: warning: ‘elf_addr’ may be used uninitialized in this function [-Wuninitialized]

It is likely bogus, though.

...

> -int kexec_setup_cmdline(struct boot_params *boot_params,
> +int kexec_setup_cmdline(struct kimage *image, struct boot_params *boot_params,
> unsigned long bootparams_load_addr,
> unsigned long cmdline_offset, char *cmdline,
> unsigned long cmdline_len)
> {
> char *cmdline_ptr = ((char *)boot_params) + cmdline_offset;
> - unsigned long cmdline_ptr_phys;
> + unsigned long cmdline_ptr_phys, len;
> uint32_t cmdline_low_32, cmdline_ext_32;
>
> memcpy(cmdline_ptr, cmdline, cmdline_len);
> + if (image->type == KEXEC_TYPE_CRASH) {
> + len = sprintf(cmdline_ptr + cmdline_len - 1,
> + " elfcorehdr=0x%lx", image->arch.elf_load_addr);
> + cmdline_len += len;
> + }
> cmdline_ptr[cmdline_len - 1] = '\0';
>
> + pr_debug("Final command line is:%s\n", cmdline_ptr);

one space after ":"

The rest looks ok to me, but that doesn't mean a whole lot considering
my very limited kexec knowledge.

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-02-28 21:06:55

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 11/11] kexec: Support for Kexec on panic using new system call

On Fri, Feb 28, 2014 at 06:28:57PM +0100, Borislav Petkov wrote:

[..]
> > +/* Memory to backup during crash kdump */
> > +#define KEXEC_BACKUP_SRC_START (0UL)
> > +#define KEXEC_BACKUP_SRC_END (655360UL) /* 640K */
>
> I guess
>
> #define KEXEC_BACKUP_SRC_END (640 * 1024UL)
>
> should be more clear.

Will Change.

[..]
> > +/* Alignment required for elf header segment */
> > +#define ELF_CORE_HEADER_ALIGN 4096
> > +
> > +/* This primarily reprsents number of split ranges due to exclusion */
> > +#define CRASH_MAX_RANGES 16
> > +
> > +struct crash_mem_range {
> > + unsigned long long start, end;
>
> u64?

Ok, that's shorter. Can use that.

[..]
> > +
> > +/* Used while prepareing memory map entries for second kernel */
>
> s/prepareing/preparing/

Yep typo. Will fix.

[..]
> > +static int get_gart_ranges_callback(u64 start, u64 end, void *arg)
> > +{
> > + struct crash_elf_data *ced = arg;
> > +
> > + ced->gart_start = start;
> > + ced->gart_end = end;
> > +
> > + /* Not expecting more than 1 gart aperture */
> > + return 1;
> > +}
> > +
> > +
> > +/* Gather all the required information to prepare elf headers for ram regions */
> > +static int fill_up_ced(struct crash_elf_data *ced, struct kimage *image)
>
> All other functions have nice, spelled out names but not this one :)
>
> Why not fill_up_crash_elf_data()?

Will change it.

>
> > +{
> > + unsigned int nr_ranges = 0;
> > +
> > + ced->image = image;
> > +
> > + walk_system_ram_range(0, -1, &nr_ranges,
> > + get_nr_ram_ranges_callback);
> > +
> > + ced->max_nr_ranges = nr_ranges;
> > +
> > + /*
> > + * We don't create ELF headers for GART aperture as an attempt
> > + * to dump this memory in second kernel leads to hang/crash.
> > + * If gart aperture is present, one needs to exclude that region
> > + * and that could lead to need of extra phdr.
> > + */
> > +
>
> superfluous newline.

Will remove.

[..]
> > +int load_crashdump_segments(struct kimage *image)
> > +{
> > + unsigned long src_start, src_sz;
> > + unsigned long elf_addr, elf_sz;
> > + int ret;
> > +
> > + /*
> > + * Determine and load a segment for backup area. First 640K RAM
> > + * region is backup source
> > + */
> > +
> > + ret = walk_system_ram_res(KEXEC_BACKUP_SRC_START, KEXEC_BACKUP_SRC_END,
> > + image, determine_backup_region);
> > +
> > + /* Zero or postive return values are ok */
> > + if (ret < 0)
> > + return ret;
> > +
> > + src_start = image->arch.backup_src_start;
> > + src_sz = image->arch.backup_src_sz;
> > +
> > + /* Add backup segment. */
> > + if (src_sz) {
> > + ret = kexec_add_buffer(image, __va(src_start), src_sz, src_sz,
> > + PAGE_SIZE, 0, -1, 0,
> > + &image->arch.backup_load_addr);
> > + if (ret)
> > + return ret;
> > + }
> > +
> > + /* Prepare elf headers and add a segment */
> > + ret = prepare_elf_headers(image, &elf_addr, &elf_sz);
> > + if (ret)
> > + return ret;
> > +
> > + image->arch.elf_headers = elf_addr;
> > + image->arch.elf_headers_sz = elf_sz;
> > +
> > + ret = kexec_add_buffer(image, (char *)elf_addr, elf_sz, elf_sz,
>
> For some reason, my compiler complains here:
>
> arch/x86/kernel/crash.c: In function ‘load_crashdump_segments’:
> arch/x86/kernel/crash.c:704:6: warning: ‘elf_sz’ may be used uninitialized in this function [-Wuninitialized]
> arch/x86/kernel/crash.c:704:24: warning: ‘elf_addr’ may be used uninitialized in this function [-Wuninitialized]
>
> It is likely bogus, though.

Hmm..., I did not see these warnings with my setup. elf_addr and elf_sz
will be initialized by prepare_elf_headers(). Not sure why compiler is
complaining.

[..]
> > + pr_debug("Final command line is:%s\n", cmdline_ptr);
>
> one space after ":"

Ok. will do.

>
> The rest looks ok to me, but that doesn't mean a whole lot considering
> my very limited kexec knowledge.

Thanks for review. We need many eyes on this patch set. I will make
changes and post another version for review.

Thanks
Vivek