2014-06-26 20:35:11

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 00/15][V4] kexec: A new system call to allow in kernel loading

Hi,

This is V4 of the patchset. Previous versions were posted here.

V1: https://lkml.org/lkml/2013/11/20/540
V2: https://lkml.org/lkml/2014/1/27/331
V3: https://lkml.org/lkml/2014/6/3/432

Changes since v3:

- Took care of most of the review comments from V3.
- Stopped building purgatory for 32bit.
- If 64bit EFI is not enabled (EFI_64BIT) return error in kernel loading.
- If EFI OLD_MEMMAP is being used, do not do EFI setup and user space is
expected to pass acpi_rsdp=<addr> param and boot second kernel in non
efi mode.
- move machine_kexec.c code into kexec-bzimage64.c
- Renamed kexec-bzimage.c to kexec-bzimage64.c to reflect the fact that
it is only 64bit bzimage loader.

This patch series is generated on top of 3.16.0-rc2.

This patch series does not do kernel signature verification yet. I plan
to post another patch series for that. Now distributions are already signing
PE/COFF bzImage with PKCS7 signature I plan to parse and verify those
signatures.

Primary goal of this patchset is to prepare groundwork so that kernel
image can be signed and signatures be verified during kexec load. This
should help with two things.

- It should allow kexec/kdump on secureboot enabled machines.

- In general it can help even without secureboot. By being able to verify
kernel image signature in kexec, it should help with avoiding module
signing restrictions. Matthew Garret showed how to boot into a custom
kernel, modify first kernel's memory and then jump back to old kernel and
bypass any policy one wants to.

I hope these patches can be queued up for 3.17. Even without signature
verification support, they provide new syscall functionality. But I
wil leave it to maintainers to decide if they want signature verification
support also be ready to merge before they merge this patchset.

Any feedback is welcome.

Thanks
Vivek

Vivek Goyal (15):
bin2c: Move bin2c in scripts/basic
kernel: Build bin2c based on config option CONFIG_BUILD_BIN2C
kexec: rename unusebale_pages to unusable_pages
kexec: Move segment verification code in a separate function
kexec: Use common function for kimage_normal_alloc() and
kimage_crash_alloc()
resource: Provide new functions to walk through resources
kexec: Make kexec_segment user buffer pointer a union
kexec: New syscall kexec_file_load() declaration
kexec: Implementation of new syscall kexec_file_load
purgatory/sha256: Provide implementation of sha256 in purgaotory
context
purgatory: Core purgatory functionality
kexec: Load and Relocate purgatory at kernel load time
kexec-bzImage64: Support for loading bzImage using 64bit entry
kexec: Support for kexec on panic using new system call
kexec: Support kexec/kdump on EFI systems

arch/x86/Kbuild | 4 +
arch/x86/Kconfig | 3 +
arch/x86/Makefile | 8 +
arch/x86/include/asm/crash.h | 9 +
arch/x86/include/asm/kexec-bzimage64.h | 6 +
arch/x86/include/asm/kexec.h | 40 +
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/crash.c | 563 ++++++++++++++
arch/x86/kernel/kexec-bzimage64.c | 532 +++++++++++++
arch/x86/kernel/machine_kexec_64.c | 228 ++++++
arch/x86/purgatory/Makefile | 30 +
arch/x86/purgatory/entry64.S | 101 +++
arch/x86/purgatory/purgatory.c | 72 ++
arch/x86/purgatory/setup-x86_64.S | 58 ++
arch/x86/purgatory/sha256.c | 283 +++++++
arch/x86/purgatory/sha256.h | 22 +
arch/x86/purgatory/stack.S | 19 +
arch/x86/purgatory/string.c | 13 +
arch/x86/syscalls/syscall_64.tbl | 1 +
drivers/firmware/efi/runtime-map.c | 21 +
include/linux/efi.h | 19 +
include/linux/ioport.h | 6 +
include/linux/kexec.h | 101 ++-
include/linux/syscalls.h | 4 +
include/uapi/linux/kexec.h | 11 +
init/Kconfig | 5 +
kernel/Makefile | 2 +-
kernel/kexec.c | 1322 ++++++++++++++++++++++++++++----
kernel/resource.c | 101 ++-
kernel/sys_ni.c | 1 +
scripts/.gitignore | 1 -
scripts/Makefile | 1 -
scripts/basic/.gitignore | 1 +
scripts/basic/Makefile | 1 +
scripts/basic/bin2c.c | 35 +
scripts/bin2c.c | 36 -
36 files changed, 3463 insertions(+), 198 deletions(-)
create mode 100644 arch/x86/include/asm/crash.h
create mode 100644 arch/x86/include/asm/kexec-bzimage64.h
create mode 100644 arch/x86/kernel/kexec-bzimage64.c
create mode 100644 arch/x86/purgatory/Makefile
create mode 100644 arch/x86/purgatory/entry64.S
create mode 100644 arch/x86/purgatory/purgatory.c
create mode 100644 arch/x86/purgatory/setup-x86_64.S
create mode 100644 arch/x86/purgatory/sha256.c
create mode 100644 arch/x86/purgatory/sha256.h
create mode 100644 arch/x86/purgatory/stack.S
create mode 100644 arch/x86/purgatory/string.c
create mode 100644 scripts/basic/bin2c.c
delete mode 100644 scripts/bin2c.c

--
1.9.0


2014-06-26 20:34:39

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 03/15] kexec: rename unusebale_pages to unusable_pages

Let's use the more common "unusable".

This patch was originally written and posted by Boris. I am including it
in this patch series.

Signed-off-by: Borislav Petkov <[email protected]>
---
include/linux/kexec.h | 2 +-
kernel/kexec.c | 6 +++---
2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index a756419..d9bb0a5 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -100,7 +100,7 @@ struct kimage {

struct list_head control_pages;
struct list_head dest_pages;
- struct list_head unuseable_pages;
+ struct list_head unusable_pages;

/* Address of next control page to allocate for crash kernels. */
unsigned long control_page;
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 369f41a..3aad6dc 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -153,7 +153,7 @@ static int do_kimage_alloc(struct kimage **rimage, unsigned long entry,
INIT_LIST_HEAD(&image->dest_pages);

/* Initialize the list of unusable pages */
- INIT_LIST_HEAD(&image->unuseable_pages);
+ INIT_LIST_HEAD(&image->unusable_pages);

/* Read in the segments */
image->nr_segments = nr_segments;
@@ -608,7 +608,7 @@ static void kimage_free_extra_pages(struct kimage *image)
kimage_free_page_list(&image->dest_pages);

/* Walk through and free any unusable pages I have cached */
- kimage_free_page_list(&image->unuseable_pages);
+ kimage_free_page_list(&image->unusable_pages);

}
static void kimage_terminate(struct kimage *image)
@@ -731,7 +731,7 @@ static struct page *kimage_alloc_page(struct kimage *image,
/* If the page cannot be used file it away */
if (page_to_pfn(page) >
(KEXEC_SOURCE_MEMORY_LIMIT >> PAGE_SHIFT)) {
- list_add(&page->lru, &image->unuseable_pages);
+ list_add(&page->lru, &image->unusable_pages);
continue;
}
addr = page_to_pfn(page) << PAGE_SHIFT;
--
1.9.0

2014-06-26 20:34:45

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 07/15] kexec: Make kexec_segment user buffer pointer a union

So far kexec_segment->buf was always a user space pointer as user space
passed the array of kexec_segment structures and kernel copied it.

But with new system call, list of kexec segments will be prepared by
kernel and kexec_segment->buf will point to a kernel memory.

So while I was adding code where I made assumption that ->buf is pointing
to kernel memory, sparse started giving warning.

Make ->buf a union. And where a user space pointer is expected, access
it using ->buf and where a kernel space pointer is expected, access it
using ->kbuf. That takes care of sparse warnings.

Signed-off-by: Vivek Goyal <[email protected]>
---
include/linux/kexec.h | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index d9bb0a5..66d56ac 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -69,7 +69,18 @@ typedef unsigned long kimage_entry_t;
#define IND_SOURCE 0x8

struct kexec_segment {
- void __user *buf;
+ /*
+ * This pointer can point to user memory if kexec_load() system
+ * call is used or will point to kernel memory if
+ * kexec_file_load() system call is used.
+ *
+ * Use ->buf when expecting to deal with user memory and use ->kbuf
+ * when expecting to deal with kernel memory.
+ */
+ union {
+ void __user *buf;
+ void *kbuf;
+ };
size_t bufsz;
unsigned long mem;
size_t memsz;
--
1.9.0

2014-06-26 20:34:48

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 01/15] bin2c: Move bin2c in scripts/basic

Kexec wants to use bin2c and it wants to use it really early in the build
process. See arch/x86/purgatory/ code in later patches.

So move bin2c in scripts/basic so that it can be built very early and
be usable by arch/x86/purgatory/

Signed-off-by: Vivek Goyal <[email protected]>
---
kernel/Makefile | 2 +-
scripts/.gitignore | 1 -
scripts/Makefile | 1 -
scripts/basic/.gitignore | 1 +
scripts/basic/Makefile | 1 +
scripts/basic/bin2c.c | 35 +++++++++++++++++++++++++++++++++++
scripts/bin2c.c | 36 ------------------------------------
7 files changed, 38 insertions(+), 39 deletions(-)
create mode 100644 scripts/basic/bin2c.c
delete mode 100644 scripts/bin2c.c

diff --git a/kernel/Makefile b/kernel/Makefile
index f2a8b62..9b07bb7 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -105,7 +105,7 @@ targets += config_data.gz
$(obj)/config_data.gz: $(KCONFIG_CONFIG) FORCE
$(call if_changed,gzip)

- filechk_ikconfiggz = (echo "static const char kernel_config_data[] __used = MAGIC_START"; cat $< | scripts/bin2c; echo "MAGIC_END;")
+ filechk_ikconfiggz = (echo "static const char kernel_config_data[] __used = MAGIC_START"; cat $< | scripts/basic/bin2c; echo "MAGIC_END;")
targets += config_data.h
$(obj)/config_data.h: $(obj)/config_data.gz FORCE
$(call filechk,ikconfiggz)
diff --git a/scripts/.gitignore b/scripts/.gitignore
index fb070fa..5ecfe93 100644
--- a/scripts/.gitignore
+++ b/scripts/.gitignore
@@ -4,7 +4,6 @@
conmakehash
kallsyms
pnmtologo
-bin2c
unifdef
ihex2fw
recordmcount
diff --git a/scripts/Makefile b/scripts/Makefile
index 890df5c..72902b5 100644
--- a/scripts/Makefile
+++ b/scripts/Makefile
@@ -13,7 +13,6 @@ HOST_EXTRACFLAGS += -I$(srctree)/tools/include
hostprogs-$(CONFIG_KALLSYMS) += kallsyms
hostprogs-$(CONFIG_LOGO) += pnmtologo
hostprogs-$(CONFIG_VT) += conmakehash
-hostprogs-$(CONFIG_IKCONFIG) += bin2c
hostprogs-$(BUILD_C_RECORDMCOUNT) += recordmcount
hostprogs-$(CONFIG_BUILDTIME_EXTABLE_SORT) += sortextable
hostprogs-$(CONFIG_ASN1) += asn1_compiler
diff --git a/scripts/basic/.gitignore b/scripts/basic/.gitignore
index a776371..9528ec9 100644
--- a/scripts/basic/.gitignore
+++ b/scripts/basic/.gitignore
@@ -1 +1,2 @@
fixdep
+bin2c
diff --git a/scripts/basic/Makefile b/scripts/basic/Makefile
index 4fcef87..afbc1cd 100644
--- a/scripts/basic/Makefile
+++ b/scripts/basic/Makefile
@@ -9,6 +9,7 @@
# fixdep: Used to generate dependency information during build process

hostprogs-y := fixdep
+hostprogs-$(CONFIG_IKCONFIG) += bin2c
always := $(hostprogs-y)

# fixdep is needed to compile other host programs
diff --git a/scripts/basic/bin2c.c b/scripts/basic/bin2c.c
new file mode 100644
index 0000000..af187e6
--- /dev/null
+++ b/scripts/basic/bin2c.c
@@ -0,0 +1,35 @@
+/*
+ * Unloved program to convert a binary on stdin to a C include on stdout
+ *
+ * Jan 1999 Matt Mackall <[email protected]>
+ *
+ * This software may be used and distributed according to the terms
+ * of the GNU General Public License, incorporated herein by reference.
+ */
+
+#include <stdio.h>
+
+int main(int argc, char *argv[])
+{
+ int ch, total = 0;
+
+ if (argc > 1)
+ printf("const char %s[] %s=\n",
+ argv[1], argc > 2 ? argv[2] : "");
+
+ do {
+ printf("\t\"");
+ while ((ch = getchar()) != EOF) {
+ total++;
+ printf("\\x%02x", ch);
+ if (total % 16 == 0)
+ break;
+ }
+ printf("\"\n");
+ } while (ch != EOF);
+
+ if (argc > 1)
+ printf("\t;\n\nconst int %s_size = %d;\n", argv[1], total);
+
+ return 0;
+}
diff --git a/scripts/bin2c.c b/scripts/bin2c.c
deleted file mode 100644
index 96dd2bc..0000000
--- a/scripts/bin2c.c
+++ /dev/null
@@ -1,36 +0,0 @@
-/*
- * Unloved program to convert a binary on stdin to a C include on stdout
- *
- * Jan 1999 Matt Mackall <[email protected]>
- *
- * This software may be used and distributed according to the terms
- * of the GNU General Public License, incorporated herein by reference.
- */
-
-#include <stdio.h>
-
-int main(int argc, char *argv[])
-{
- int ch, total=0;
-
- if (argc > 1)
- printf("const char %s[] %s=\n",
- argv[1], argc > 2 ? argv[2] : "");
-
- do {
- printf("\t\"");
- while ((ch = getchar()) != EOF)
- {
- total++;
- printf("\\x%02x",ch);
- if (total % 16 == 0)
- break;
- }
- printf("\"\n");
- } while (ch != EOF);
-
- if (argc > 1)
- printf("\t;\n\nconst int %s_size = %d;\n", argv[1], total);
-
- return 0;
-}
--
1.9.0

2014-06-26 20:35:00

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 15/15] kexec: Support kexec/kdump on EFI systems

This patch does two thigns. It passes EFI run time mappings to second
kernel in bootparams efi_info. Second kernel parse this info and create
new mappings in second kernel. That means mappings in first and second
kernel will be same. This paves the way to enable EFI in kexec kernel.

This patch also prepares and passes EFI setup data through bootparams.
This contains bunch of information about various tables and their
addresses.

These information gathering and passing has been written along the lines
of what current kexec-tools is doing to make kexec work with UEFI.

Signed-off-by: Vivek Goyal <[email protected]>
CC: [email protected]
---
arch/x86/kernel/kexec-bzimage64.c | 146 ++++++++++++++++++++++++++++++++++---
drivers/firmware/efi/runtime-map.c | 21 ++++++
include/linux/efi.h | 19 +++++
3 files changed, 174 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c
index 61e4306..9487845 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -18,10 +18,12 @@
#include <linux/kexec.h>
#include <linux/kernel.h>
#include <linux/mm.h>
+#include <linux/efi.h>

#include <asm/bootparam.h>
#include <asm/setup.h>
#include <asm/crash.h>
+#include <asm/efi.h>

#define MAX_ELFCOREHDR_STR_LEN 30 /* elfcorehdr=0x<64bit-value> */

@@ -90,7 +92,7 @@ static int setup_cmdline(struct kimage *image, struct boot_params *params,
return 0;
}

-static int setup_memory_map_entries(struct boot_params *params)
+static int setup_e820_entries(struct boot_params *params)
{
unsigned int nr_e820_entries;

@@ -107,8 +109,93 @@ static int setup_memory_map_entries(struct boot_params *params)
return 0;
}

-static int setup_boot_parameters(struct kimage *image,
- struct boot_params *params)
+#ifdef CONFIG_EFI
+static int setup_efi_info_memmap(struct boot_params *params,
+ unsigned long params_load_addr,
+ unsigned int efi_map_offset,
+ unsigned int efi_map_sz)
+{
+ void *efi_map = (void *)params + efi_map_offset;
+ unsigned long efi_map_phys_addr = params_load_addr + efi_map_offset;
+ struct efi_info *ei = &params->efi_info;
+
+ if (!efi_map_sz)
+ return 0;
+
+ efi_runtime_map_copy(efi_map, efi_map_sz);
+
+ ei->efi_memmap = efi_map_phys_addr & 0xffffffff;
+ ei->efi_memmap_hi = efi_map_phys_addr >> 32;
+ ei->efi_memmap_size = efi_map_sz;
+
+ return 0;
+}
+
+static int
+prepare_add_efi_setup_data(struct boot_params *params,
+ unsigned long params_load_addr,
+ unsigned int efi_setup_data_offset)
+{
+ unsigned long setup_data_phys;
+ struct setup_data *sd = (void *)params + efi_setup_data_offset;
+ struct efi_setup_data *esd = (void *)sd + sizeof(struct setup_data);
+
+ esd->fw_vendor = efi.fw_vendor;
+ esd->runtime = efi.runtime;
+ esd->tables = efi.config_table;
+ esd->smbios = efi.smbios;
+
+ sd->type = SETUP_EFI;
+ sd->len = sizeof(struct efi_setup_data);
+
+ /* Add setup data */
+ setup_data_phys = params_load_addr + efi_setup_data_offset;
+ sd->next = params->hdr.setup_data;
+ params->hdr.setup_data = setup_data_phys;
+
+ return 0;
+}
+
+static int
+setup_efi_state(struct boot_params *params, unsigned long params_load_addr,
+ unsigned int efi_map_offset, unsigned int efi_map_sz,
+ unsigned int efi_setup_data_offset)
+{
+ struct efi_info *current_ei = &boot_params.efi_info;
+ struct efi_info *ei = &params->efi_info;
+
+ if (!current_ei->efi_memmap_size)
+ return 0;
+
+ /*
+ * If 1:1 mapping is not enabled, second kernel can not setup EFI
+ * and use EFI run time services. User space will have to pass
+ * acpi_rsdp=<addr> on kernel command line to make second kernel boot
+ * without efi.
+ */
+ if (efi_enabled(EFI_OLD_MEMMAP))
+ return 0;
+
+ ei->efi_loader_signature = current_ei->efi_loader_signature;
+ ei->efi_systab = current_ei->efi_systab;
+ ei->efi_systab_hi = current_ei->efi_systab_hi;
+
+ ei->efi_memdesc_version = current_ei->efi_memdesc_version;
+ ei->efi_memdesc_size = get_efi_runtime_map_desc_size();
+
+ setup_efi_info_memmap(params, params_load_addr, efi_map_offset,
+ efi_map_sz);
+ prepare_add_efi_setup_data(params, params_load_addr,
+ efi_setup_data_offset);
+ return 0;
+}
+#endif /* CONFIG_EFI */
+
+static int
+setup_boot_parameters(struct kimage *image, struct boot_params *params,
+ unsigned long params_load_addr,
+ unsigned int efi_map_offset, unsigned int efi_map_sz,
+ unsigned int efi_setup_data_offset)
{
unsigned int nr_e820_entries;
unsigned long long mem_k, start, end;
@@ -140,7 +227,7 @@ static int setup_boot_parameters(struct kimage *image,
if (ret)
return ret;
} else
- setup_memory_map_entries(params);
+ setup_e820_entries(params);

nr_e820_entries = params->e820_entries;

@@ -161,6 +248,12 @@ static int setup_boot_parameters(struct kimage *image,
}
}

+#ifdef CONFIG_EFI
+ /* Setup EFI state */
+ setup_efi_state(params, params_load_addr, efi_map_offset, efi_map_sz,
+ efi_setup_data_offset);
+#endif
+
/* Setup EDD info */
memcpy(params->eddbuf, boot_params.eddbuf,
EDDMAXNR * sizeof(struct edd_info));
@@ -214,6 +307,15 @@ int bzImage64_probe(const char *buf, unsigned long len)
return ret;
}

+ /*
+ * Can't handle 32bit EFI as it does not allow loading kernel
+ * above 4G. This should be handled by 32bit bzImage loader
+ */
+ if (efi_enabled(EFI_RUNTIME_SERVICES) && !efi_enabled(EFI_64BIT)) {
+ pr_debug("EFI is 32 bit. Can't load kernel above 4G.\n");
+ return ret;
+ }
+
/* I've got a bzImage */
pr_debug("It's a relocatable bzImage64\n");
ret = 0;
@@ -229,7 +331,7 @@ void *bzImage64_load(struct kimage *image, char *kernel,

struct setup_header *header;
int setup_sects, kern16_size, ret = 0;
- unsigned long setup_header_size, params_cmdline_sz;
+ unsigned long setup_header_size, params_cmdline_sz, params_misc_sz;
struct boot_params *params;
unsigned long bootparam_load_addr, kernel_load_addr, initrd_load_addr;
unsigned long purgatory_load_addr;
@@ -239,6 +341,7 @@ void *bzImage64_load(struct kimage *image, char *kernel,
struct kexec_entry64_regs regs64;
void *stack;
unsigned int setup_hdr_offset = offsetof(struct boot_params, hdr);
+ unsigned int efi_map_offset, efi_map_sz, efi_setup_data_offset;

header = (struct setup_header *)(kernel + setup_hdr_offset);
setup_sects = header->setup_sects;
@@ -285,12 +388,29 @@ void *bzImage64_load(struct kimage *image, char *kernel,

pr_debug("Loaded purgatory at 0x%lx\n", purgatory_load_addr);

- /* Load Bootparams and cmdline */
+
+ /*
+ * Load Bootparams and cmdline and space for efi stuff.
+ *
+ * Allocate memory together for multiple data structures so
+ * that they all can go in single area/segment and we don't
+ * have to create separate segment for each. Keeps things
+ * little bit simple
+ */
+ efi_map_sz = get_efi_runtime_map_size();
+ efi_map_sz = ALIGN(efi_map_sz, 16);
params_cmdline_sz = sizeof(struct boot_params) + cmdline_len +
MAX_ELFCOREHDR_STR_LEN;
- params = kzalloc(params_cmdline_sz, GFP_KERNEL);
+ params_cmdline_sz = ALIGN(params_cmdline_sz, 16);
+ params_misc_sz = params_cmdline_sz + efi_map_sz +
+ sizeof(struct setup_data) +
+ sizeof(struct efi_setup_data);
+
+ params = kzalloc(params_misc_sz, GFP_KERNEL);
if (!params)
return ERR_PTR(-ENOMEM);
+ efi_map_offset = params_cmdline_sz;
+ efi_setup_data_offset = efi_map_offset + efi_map_sz;

/* Copy setup header onto bootparams. Documentation/x86/boot.txt */
setup_header_size = 0x0202 + kernel[0x0201] - setup_hdr_offset;
@@ -298,13 +418,13 @@ void *bzImage64_load(struct kimage *image, char *kernel,
/* Is there a limit on setup header size? */
memcpy(&params->hdr, (kernel + setup_hdr_offset), setup_header_size);

- ret = kexec_add_buffer(image, (char *)params, params_cmdline_sz,
- params_cmdline_sz, 16, MIN_BOOTPARAM_ADDR,
+ ret = kexec_add_buffer(image, (char *)params, params_misc_sz,
+ params_misc_sz, 16, MIN_BOOTPARAM_ADDR,
ULONG_MAX, 1, &bootparam_load_addr);
if (ret)
goto out_free_params;
- pr_debug("Loaded boot_param and command line at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
- bootparam_load_addr, params_cmdline_sz, params_cmdline_sz);
+ pr_debug("Loaded boot_param, command line and misc at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
+ bootparam_load_addr, params_misc_sz, params_misc_sz);

/* Load kernel */
kernel_buf = kernel + kern16_size;
@@ -365,7 +485,9 @@ void *bzImage64_load(struct kimage *image, char *kernel,
if (ret)
goto out_free_params;

- ret = setup_boot_parameters(image, params);
+ ret = setup_boot_parameters(image, params, bootparam_load_addr,
+ efi_map_offset, efi_map_sz,
+ efi_setup_data_offset);
if (ret)
goto out_free_params;

diff --git a/drivers/firmware/efi/runtime-map.c b/drivers/firmware/efi/runtime-map.c
index 97cdd16..40f2213 100644
--- a/drivers/firmware/efi/runtime-map.c
+++ b/drivers/firmware/efi/runtime-map.c
@@ -138,6 +138,27 @@ add_sysfs_runtime_map_entry(struct kobject *kobj, int nr)
return entry;
}

+int get_efi_runtime_map_size(void)
+{
+ return nr_efi_runtime_map * efi_memdesc_size;
+}
+
+int get_efi_runtime_map_desc_size(void)
+{
+ return efi_memdesc_size;
+}
+
+int efi_runtime_map_copy(void *buf, size_t bufsz)
+{
+ size_t sz = get_efi_runtime_map_size();
+
+ if (sz > bufsz)
+ sz = bufsz;
+
+ memcpy(buf, efi_runtime_map, sz);
+ return 0;
+}
+
void efi_runtime_map_setup(void *map, int nr_entries, u32 desc_size)
{
efi_runtime_map = map;
diff --git a/include/linux/efi.h b/include/linux/efi.h
index 41bbf8b..dce9f31 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -1151,6 +1151,9 @@ int efivars_sysfs_init(void);
#ifdef CONFIG_EFI_RUNTIME_MAP
int efi_runtime_map_init(struct kobject *);
void efi_runtime_map_setup(void *, int, u32);
+int get_efi_runtime_map_size(void);
+int get_efi_runtime_map_desc_size(void);
+int efi_runtime_map_copy(void *buf, size_t bufsz);
#else
static inline int efi_runtime_map_init(struct kobject *kobj)
{
@@ -1159,6 +1162,22 @@ static inline int efi_runtime_map_init(struct kobject *kobj)

static inline void
efi_runtime_map_setup(void *map, int nr_entries, u32 desc_size) {}
+
+static inline int get_efi_runtime_map_size(void)
+{
+ return 0;
+}
+
+static inline int get_efi_runtime_map_desc_size(void)
+{
+ return 0;
+}
+
+static inline int efi_runtime_map_copy(void *buf, size_t bufsz)
+{
+ return 0;
+}
+
#endif

#endif /* _LINUX_EFI_H */
--
1.9.0

2014-06-26 20:35:19

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 05/15] kexec: Use common function for kimage_normal_alloc() and kimage_crash_alloc()

kimage_normal_alloc() and kimage_crash_alloc() are doing lot of similar
things and differ only little. So instead of having two separate functions
create a common function kimage_alloc_init() and pass it the "flags"
argument which tells whether it is normal kexec or kexec_on_panic. And
this function should be able to deal with both the cases.

This consolidation also helps later where we can use a common function
kimage_file_alloc_init() to handle normal and crash cases for new file
based kexec syscall.

Signed-off-by: Vivek Goyal <[email protected]>
---
kernel/kexec.c | 105 +++++++++++++++++++--------------------------------------
1 file changed, 34 insertions(+), 71 deletions(-)

diff --git a/kernel/kexec.c b/kernel/kexec.c
index 44e823e..c69ce00 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -260,12 +260,20 @@ static struct kimage *do_kimage_alloc_init(void)

static void kimage_free_page_list(struct list_head *list);

-static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
- unsigned long nr_segments,
- struct kexec_segment __user *segments)
+static int kimage_alloc_init(struct kimage **rimage, unsigned long entry,
+ unsigned long nr_segments,
+ struct kexec_segment __user *segments,
+ unsigned long flags)
{
- int result;
+ int ret;
struct kimage *image;
+ bool kexec_on_panic = flags & KEXEC_ON_CRASH;
+
+ if (kexec_on_panic) {
+ /* Verify we have a valid entry point */
+ if ((entry < crashk_res.start) || (entry > crashk_res.end))
+ return -EADDRNOTAVAIL;
+ }

/* Allocate and initialize a controlling structure */
image = do_kimage_alloc_init();
@@ -274,20 +282,26 @@ static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,

image->start = entry;

- result = copy_user_segment_list(image, nr_segments, segments);
- if (result)
+ ret = copy_user_segment_list(image, nr_segments, segments);
+ if (ret)
goto out_free_image;

- result = sanity_check_segment_list(image);
- if (result)
+ ret = sanity_check_segment_list(image);
+ if (ret)
goto out_free_image;

+ /* Enable the special crash kernel control page allocation policy. */
+ if (kexec_on_panic) {
+ image->control_page = crashk_res.start;
+ image->type = KEXEC_TYPE_CRASH;
+ }
+
/*
* Find a location for the control code buffer, and add it
* the vector of segments so that it's pages will also be
* counted as destination pages.
*/
- result = -ENOMEM;
+ ret = -ENOMEM;
image->control_code_page = kimage_alloc_control_pages(image,
get_order(KEXEC_CONTROL_PAGE_SIZE));
if (!image->control_code_page) {
@@ -295,10 +309,12 @@ static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
goto out_free_image;
}

- image->swap_page = kimage_alloc_control_pages(image, 0);
- if (!image->swap_page) {
- pr_err("Could not allocate swap buffer\n");
- goto out_free_control_pages;
+ if (!kexec_on_panic) {
+ image->swap_page = kimage_alloc_control_pages(image, 0);
+ if (!image->swap_page) {
+ pr_err("Could not allocate swap buffer\n");
+ goto out_free_control_pages;
+ }
}

*rimage = image;
@@ -307,60 +323,7 @@ out_free_control_pages:
kimage_free_page_list(&image->control_pages);
out_free_image:
kfree(image);
- return result;
-}
-
-static int kimage_crash_alloc(struct kimage **rimage, unsigned long entry,
- unsigned long nr_segments,
- struct kexec_segment __user *segments)
-{
- int result;
- struct kimage *image;
-
- /* Verify we have a valid entry point */
- if ((entry < crashk_res.start) || (entry > crashk_res.end))
- return -EADDRNOTAVAIL;
-
- /* Allocate and initialize a controlling structure */
- image = do_kimage_alloc_init();
- if (!image)
- return -ENOMEM;
-
- image->start = entry;
-
- /* Enable the special crash kernel control page
- * allocation policy.
- */
- image->control_page = crashk_res.start;
- image->type = KEXEC_TYPE_CRASH;
-
- result = copy_user_segment_list(image, nr_segments, segments);
- if (result)
- goto out_free_image;
-
- result = sanity_check_segment_list(image);
- if (result)
- goto out_free_image;
-
- /*
- * Find a location for the control code buffer, and add
- * the vector of segments so that it's pages will also be
- * counted as destination pages.
- */
- result = -ENOMEM;
- image->control_code_page = kimage_alloc_control_pages(image,
- get_order(KEXEC_CONTROL_PAGE_SIZE));
- if (!image->control_code_page) {
- pr_err("Could not allocate control_code_buffer\n");
- goto out_free_image;
- }
-
- *rimage = image;
- return 0;
-
-out_free_image:
- kfree(image);
- return result;
+ return ret;
}

static int kimage_is_destination_range(struct kimage *image,
@@ -1003,16 +966,16 @@ SYSCALL_DEFINE4(kexec_load, unsigned long, entry, unsigned long, nr_segments,

/* Loading another kernel to reboot into */
if ((flags & KEXEC_ON_CRASH) == 0)
- result = kimage_normal_alloc(&image, entry,
- nr_segments, segments);
+ result = kimage_alloc_init(&image, entry, nr_segments,
+ segments, flags);
/* Loading another kernel to switch to if this one crashes */
else if (flags & KEXEC_ON_CRASH) {
/* Free any current crash dump kernel before
* we corrupt it.
*/
kimage_free(xchg(&kexec_crash_image, NULL));
- result = kimage_crash_alloc(&image, entry,
- nr_segments, segments);
+ result = kimage_alloc_init(&image, entry, nr_segments,
+ segments, flags);
crash_map_reserved_pages();
}
if (result)
--
1.9.0

2014-06-26 20:35:29

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 11/15] purgatory: Core purgatory functionality

Create a stand alone relocatable object purgatory which runs between two
kernels. This name, concept and some code has been taken from kexec-tools.
Idea is that this code runs after a crash and it runs in minimal environment.
So keep it separate from rest of the kernel and in long term we will have
to practically do no maintenance of this code.

This code also has the logic to do verify sha256 hashes of various
segments which have been loaded into memory. So first we verify that
the kernel we are jumping to is fine and has not been corrupted and
make progress only if checsums are verified.

This code also takes care of copying some memory contents to backup region.

Signed-off-by: Vivek Goyal <[email protected]>
---
arch/x86/Kbuild | 4 ++
arch/x86/Makefile | 8 +++
arch/x86/purgatory/Makefile | 30 +++++++++++
arch/x86/purgatory/entry64.S | 101 ++++++++++++++++++++++++++++++++++++++
arch/x86/purgatory/purgatory.c | 72 +++++++++++++++++++++++++++
arch/x86/purgatory/setup-x86_64.S | 58 ++++++++++++++++++++++
arch/x86/purgatory/stack.S | 19 +++++++
arch/x86/purgatory/string.c | 13 +++++
8 files changed, 305 insertions(+)
create mode 100644 arch/x86/purgatory/Makefile
create mode 100644 arch/x86/purgatory/entry64.S
create mode 100644 arch/x86/purgatory/purgatory.c
create mode 100644 arch/x86/purgatory/setup-x86_64.S
create mode 100644 arch/x86/purgatory/stack.S
create mode 100644 arch/x86/purgatory/string.c

diff --git a/arch/x86/Kbuild b/arch/x86/Kbuild
index e5287d8..61b6d51 100644
--- a/arch/x86/Kbuild
+++ b/arch/x86/Kbuild
@@ -16,3 +16,7 @@ obj-$(CONFIG_IA32_EMULATION) += ia32/

obj-y += platform/
obj-y += net/
+
+ifeq ($(CONFIG_X86_64),y)
+obj-$(CONFIG_KEXEC) += purgatory/
+endif
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 33f71b0..dc302a7 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -186,6 +186,14 @@ archscripts: scripts_basic
archheaders:
$(Q)$(MAKE) $(build)=arch/x86/syscalls all

+archprepare:
+ifeq ($(CONFIG_KEXEC),y)
+# Build only for 64bit. No loaders for 32bit yet.
+ ifeq ($(CONFIG_X86_64),y)
+ $(Q)$(MAKE) $(build)=arch/x86/purgatory arch/x86/purgatory/kexec-purgatory.c
+ endif
+endif
+
###
# Kernel objects

diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile
new file mode 100644
index 0000000..e5829dd
--- /dev/null
+++ b/arch/x86/purgatory/Makefile
@@ -0,0 +1,30 @@
+purgatory-y := purgatory.o stack.o setup-x86_$(BITS).o sha256.o entry64.o string.o
+
+targets += $(purgatory-y)
+PURGATORY_OBJS = $(addprefix $(obj)/,$(purgatory-y))
+
+LDFLAGS_purgatory.ro := -e purgatory_start -r --no-undefined -nostdlib -z nodefaultlib
+targets += purgatory.ro
+
+# Default KBUILD_CFLAGS can have -pg option set when FTRACE is enabled. That
+# in turn leaves some undefined symbols like __fentry__ in purgatory and not
+# sure how to relocate those. Like kexec-tools, use custom flags.
+
+KBUILD_CFLAGS := -fno-strict-aliasing -Wall -Wstrict-prototypes -fno-zero-initialized-in-bss -fno-builtin -ffreestanding -c -MD -Os -mcmodel=large
+
+$(obj)/purgatory.ro: $(PURGATORY_OBJS) FORCE
+ $(call if_changed,ld)
+
+targets += kexec-purgatory.c
+
+quiet_cmd_bin2c = BIN2C $@
+ cmd_bin2c = cat $(obj)/purgatory.ro | $(srctree)/scripts/basic/bin2c kexec_purgatory > $(obj)/kexec-purgatory.c
+
+$(obj)/kexec-purgatory.c: $(obj)/purgatory.ro FORCE
+ $(call if_changed,bin2c)
+
+
+# No loaders for 32bits yet.
+ifeq ($(CONFIG_X86_64),y)
+ obj-$(CONFIG_KEXEC) += kexec-purgatory.o
+endif
diff --git a/arch/x86/purgatory/entry64.S b/arch/x86/purgatory/entry64.S
new file mode 100644
index 0000000..be3249d
--- /dev/null
+++ b/arch/x86/purgatory/entry64.S
@@ -0,0 +1,101 @@
+/*
+ * Copyright (C) 2003,2004 Eric Biederman ([email protected])
+ * Copyright (C) 2014 Red Hat Inc.
+
+ * Author(s): Vivek Goyal <[email protected]>
+ *
+ * This code has been taken from kexec-tools.
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2. See the file COPYING for more details.
+ */
+
+ .text
+ .balign 16
+ .code64
+ .globl entry64, entry64_regs
+
+
+entry64:
+ /* Setup a gdt that should be preserved */
+ lgdt gdt(%rip)
+
+ /* load the data segments */
+ movl $0x18, %eax /* data segment */
+ movl %eax, %ds
+ movl %eax, %es
+ movl %eax, %ss
+ movl %eax, %fs
+ movl %eax, %gs
+
+ /* Setup new stack */
+ leaq stack_init(%rip), %rsp
+ pushq $0x10 /* CS */
+ leaq new_cs_exit(%rip), %rax
+ pushq %rax
+ lretq
+new_cs_exit:
+
+ /* Load the registers */
+ movq rax(%rip), %rax
+ movq rbx(%rip), %rbx
+ movq rcx(%rip), %rcx
+ movq rdx(%rip), %rdx
+ movq rsi(%rip), %rsi
+ movq rdi(%rip), %rdi
+ movq rsp(%rip), %rsp
+ movq rbp(%rip), %rbp
+ movq r8(%rip), %r8
+ movq r9(%rip), %r9
+ movq r10(%rip), %r10
+ movq r11(%rip), %r11
+ movq r12(%rip), %r12
+ movq r13(%rip), %r13
+ movq r14(%rip), %r14
+ movq r15(%rip), %r15
+
+ /* Jump to the new code... */
+ jmpq *rip(%rip)
+
+ .section ".rodata"
+ .balign 4
+entry64_regs:
+rax: .quad 0x0
+rbx: .quad 0x0
+rcx: .quad 0x0
+rdx: .quad 0x0
+rsi: .quad 0x0
+rdi: .quad 0x0
+rsp: .quad 0x0
+rbp: .quad 0x0
+r8: .quad 0x0
+r9: .quad 0x0
+r10: .quad 0x0
+r11: .quad 0x0
+r12: .quad 0x0
+r13: .quad 0x0
+r14: .quad 0x0
+r15: .quad 0x0
+rip: .quad 0x0
+ .size entry64_regs, . - entry64_regs
+
+ /* GDT */
+ .section ".rodata"
+ .balign 16
+gdt:
+ /* 0x00 unusable segment
+ * 0x08 unused
+ * so use them as gdt ptr
+ */
+ .word gdt_end - gdt - 1
+ .quad gdt
+ .word 0, 0, 0
+
+ /* 0x10 4GB flat code segment */
+ .word 0xFFFF, 0x0000, 0x9A00, 0x00AF
+
+ /* 0x18 4GB flat data segment */
+ .word 0xFFFF, 0x0000, 0x9200, 0x00CF
+gdt_end:
+stack: .quad 0, 0
+stack_init:
diff --git a/arch/x86/purgatory/purgatory.c b/arch/x86/purgatory/purgatory.c
new file mode 100644
index 0000000..25e068b
--- /dev/null
+++ b/arch/x86/purgatory/purgatory.c
@@ -0,0 +1,72 @@
+/*
+ * purgatory: Runs between two kernels
+ *
+ * Copyright (C) 2014 Red Hat Inc.
+ *
+ * Author:
+ * Vivek Goyal <[email protected]>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2. See the file COPYING for more details.
+ */
+
+#include "sha256.h"
+#include "../boot/string.h"
+
+struct sha_region {
+ unsigned long start;
+ unsigned long len;
+};
+
+unsigned long backup_dest = 0;
+unsigned long backup_src = 0;
+unsigned long backup_sz = 0;
+
+u8 sha256_digest[SHA256_DIGEST_SIZE] = { 0 };
+
+struct sha_region sha_regions[16] = {};
+
+/*
+ * On x86, second kernel requries first 640K of memory to boot. Copy
+ * first 640K to a backup region in reserved memory range so that second
+ * kernel can use first 640K.
+ */
+static int copy_backup_region(void)
+{
+ if (backup_dest)
+ memcpy((void *)backup_dest, (void *)backup_src, backup_sz);
+
+ return 0;
+}
+
+int verify_sha256_digest(void)
+{
+ struct sha_region *ptr, *end;
+ u8 digest[SHA256_DIGEST_SIZE];
+ struct sha256_state sctx;
+
+ sha256_init(&sctx);
+ end = &sha_regions[sizeof(sha_regions)/sizeof(sha_regions[0])];
+ for (ptr = sha_regions; ptr < end; ptr++)
+ sha256_update(&sctx, (uint8_t *)(ptr->start), ptr->len);
+
+ sha256_final(&sctx, digest);
+
+ if (memcmp(digest, sha256_digest, sizeof(digest)))
+ return 1;
+
+ return 0;
+}
+
+void purgatory(void)
+{
+ int ret;
+
+ ret = verify_sha256_digest();
+ if (ret) {
+ /* loop forever */
+ for (;;)
+ ;
+ }
+ copy_backup_region();
+}
diff --git a/arch/x86/purgatory/setup-x86_64.S b/arch/x86/purgatory/setup-x86_64.S
new file mode 100644
index 0000000..fe3c91b
--- /dev/null
+++ b/arch/x86/purgatory/setup-x86_64.S
@@ -0,0 +1,58 @@
+/*
+ * purgatory: setup code
+ *
+ * Copyright (C) 2003,2004 Eric Biederman ([email protected])
+ * Copyright (C) 2014 Red Hat Inc.
+ *
+ * This code has been taken from kexec-tools.
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2. See the file COPYING for more details.
+ */
+
+ .text
+ .globl purgatory_start
+ .balign 16
+purgatory_start:
+ .code64
+
+ /* Load a gdt so I know what the segment registers are */
+ lgdt gdt(%rip)
+
+ /* load the data segments */
+ movl $0x18, %eax /* data segment */
+ movl %eax, %ds
+ movl %eax, %es
+ movl %eax, %ss
+ movl %eax, %fs
+ movl %eax, %gs
+
+ /* Setup a stack */
+ leaq lstack_end(%rip), %rsp
+
+ /* Call the C code */
+ call purgatory
+ jmp entry64
+
+ .section ".rodata"
+ .balign 16
+gdt: /* 0x00 unusable segment
+ * 0x08 unused
+ * so use them as the gdt ptr
+ */
+ .word gdt_end - gdt - 1
+ .quad gdt
+ .word 0, 0, 0
+
+ /* 0x10 4GB flat code segment */
+ .word 0xFFFF, 0x0000, 0x9A00, 0x00AF
+
+ /* 0x18 4GB flat data segment */
+ .word 0xFFFF, 0x0000, 0x9200, 0x00CF
+gdt_end:
+
+ .bss
+ .balign 4096
+lstack:
+ .skip 4096
+lstack_end:
diff --git a/arch/x86/purgatory/stack.S b/arch/x86/purgatory/stack.S
new file mode 100644
index 0000000..3cefba1
--- /dev/null
+++ b/arch/x86/purgatory/stack.S
@@ -0,0 +1,19 @@
+/*
+ * purgatory: stack
+ *
+ * Copyright (C) 2014 Red Hat Inc.
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2. See the file COPYING for more details.
+ */
+
+ /* A stack for the loaded kernel.
+ * Seperate and in the data section so it can be prepopulated.
+ */
+ .data
+ .balign 4096
+ .globl stack, stack_end
+
+stack:
+ .skip 4096
+stack_end:
diff --git a/arch/x86/purgatory/string.c b/arch/x86/purgatory/string.c
new file mode 100644
index 0000000..d886b1f
--- /dev/null
+++ b/arch/x86/purgatory/string.c
@@ -0,0 +1,13 @@
+/*
+ * Simple string functions.
+ *
+ * Copyright (C) 2014 Red Hat Inc.
+ *
+ * Author:
+ * Vivek Goyal <[email protected]>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2. See the file COPYING for more details.
+ */
+
+#include "../boot/string.c"
--
1.9.0

2014-06-26 20:35:24

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 02/15] kernel: Build bin2c based on config option CONFIG_BUILD_BIN2C

currently bin2c builds only if CONFIG_IKCONFIG=y. But bin2c will now be
used by kexec too. So make it compilation dependent on CONFIG_BUILD_BIN2C
and this config option can be selected by CONFIG_KEXEC and CONFIG_IKCONFIG.

Signed-off-by: Vivek Goyal <[email protected]>
---
arch/x86/Kconfig | 1 +
init/Kconfig | 5 +++++
scripts/basic/Makefile | 2 +-
3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a8f749e..eaa00ae 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1576,6 +1576,7 @@ source kernel/Kconfig.hz

config KEXEC
bool "kexec system call"
+ select BUILD_BIN2C
---help---
kexec is a system call that implements the ability to shutdown your
current kernel, and to start another kernel. It is like a reboot
diff --git a/init/Kconfig b/init/Kconfig
index 9d76b99..866fc50 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -783,8 +783,13 @@ endchoice

endmenu # "RCU Subsystem"

+config BUILD_BIN2C
+ bool
+ default n
+
config IKCONFIG
tristate "Kernel .config support"
+ select BUILD_BIN2C
---help---
This option enables the complete Linux kernel ".config" file
contents to be saved in the kernel. It provides documentation
diff --git a/scripts/basic/Makefile b/scripts/basic/Makefile
index afbc1cd..ec10d93 100644
--- a/scripts/basic/Makefile
+++ b/scripts/basic/Makefile
@@ -9,7 +9,7 @@
# fixdep: Used to generate dependency information during build process

hostprogs-y := fixdep
-hostprogs-$(CONFIG_IKCONFIG) += bin2c
+hostprogs-$(CONFIG_BUILD_BIN2C) += bin2c
always := $(hostprogs-y)

# fixdep is needed to compile other host programs
--
1.9.0

2014-06-26 20:35:35

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 14/15] kexec: Support for kexec on panic using new system call

This patch adds support for loading a kexec on panic (kdump) kernel usning
new system call.

It prepares ELF headers for memory areas to be dumped and for saved cpu
registers. Also prepares the memory map for second kernel and limits its
boot to reserved areas only.

Signed-off-by: Vivek Goyal <[email protected]>
---
arch/x86/include/asm/crash.h | 9 +
arch/x86/include/asm/kexec.h | 19 ++
arch/x86/kernel/crash.c | 563 +++++++++++++++++++++++++++++++++++++
arch/x86/kernel/kexec-bzimage64.c | 55 +++-
arch/x86/kernel/machine_kexec_64.c | 40 +++
kernel/kexec.c | 46 ++-
6 files changed, 713 insertions(+), 19 deletions(-)
create mode 100644 arch/x86/include/asm/crash.h

diff --git a/arch/x86/include/asm/crash.h b/arch/x86/include/asm/crash.h
new file mode 100644
index 0000000..f498411
--- /dev/null
+++ b/arch/x86/include/asm/crash.h
@@ -0,0 +1,9 @@
+#ifndef _ASM_X86_CRASH_H
+#define _ASM_X86_CRASH_H
+
+int crash_load_segments(struct kimage *image);
+int crash_copy_backup_region(struct kimage *image);
+int crash_setup_memmap_entries(struct kimage *image,
+ struct boot_params *params);
+
+#endif /* _ASM_X86_CRASH_H */
diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 0dfccce..45aa864 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -25,6 +25,8 @@
#include <asm/ptrace.h>
#include <asm/bootparam.h>

+struct kimage;
+
/*
* KEXEC_SOURCE_MEMORY_LIMIT maximum page get_free_page can return.
* I.e. Maximum page that is mapped directly into kernel memory,
@@ -62,6 +64,10 @@
# define KEXEC_ARCH KEXEC_ARCH_X86_64
#endif

+/* Memory to backup during crash kdump */
+#define KEXEC_BACKUP_SRC_START (0UL)
+#define KEXEC_BACKUP_SRC_END (640 * 1024UL) /* 640K */
+
/*
* CPU does not save ss and sp on stack if execution is already
* running in kernel mode at the time of NMI occurrence. This code
@@ -161,8 +167,21 @@ struct kimage_arch {
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
+ /* Details of backup region */
+ unsigned long backup_src_start;
+ unsigned long backup_src_sz;
+
+ /* Physical address of backup segment */
+ unsigned long backup_load_addr;
+
+ /* Core ELF header buffer */
+ void *elf_headers;
+ unsigned long elf_headers_sz;
+ unsigned long elf_load_addr;
};
+#endif /* CONFIG_X86_32 */

+#ifdef CONFIG_X86_64
struct kexec_entry64_regs {
uint64_t rax;
uint64_t rbx;
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 507de80..0553a34 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -4,9 +4,14 @@
* Created by: Hariprasad Nellitheertha ([email protected])
*
* Copyright (C) IBM Corporation, 2004. All rights reserved.
+ * Copyright (C) Red Hat Inc., 2014. All rights reserved.
+ * Authors:
+ * Vivek Goyal <[email protected]>
*
*/

+#define pr_fmt(fmt) "kexec: " fmt
+
#include <linux/types.h>
#include <linux/kernel.h>
#include <linux/smp.h>
@@ -16,6 +21,7 @@
#include <linux/elf.h>
#include <linux/elfcore.h>
#include <linux/module.h>
+#include <linux/slab.h>

#include <asm/processor.h>
#include <asm/hardirq.h>
@@ -28,6 +34,45 @@
#include <asm/reboot.h>
#include <asm/virtext.h>

+/* Alignment required for elf header segment */
+#define ELF_CORE_HEADER_ALIGN 4096
+
+/* This primarily represents number of split ranges due to exclusion */
+#define CRASH_MAX_RANGES 16
+
+struct crash_mem_range {
+ u64 start, end;
+};
+
+struct crash_mem {
+ unsigned int nr_ranges;
+ struct crash_mem_range ranges[CRASH_MAX_RANGES];
+};
+
+/* Misc data about ram ranges needed to prepare elf headers */
+struct crash_elf_data {
+ struct kimage *image;
+ /*
+ * Total number of ram ranges we have after various adjustments for
+ * GART, crash reserved region etc.
+ */
+ unsigned int max_nr_ranges;
+ unsigned long gart_start, gart_end;
+
+ /* Pointer to elf header */
+ void *ehdr;
+ /* Pointer to next phdr */
+ void *bufp;
+ struct crash_mem mem;
+};
+
+/* Used while preparing memory map entries for second kernel */
+struct crash_memmap_data {
+ struct boot_params *params;
+ /* Type of memory */
+ unsigned int type;
+};
+
int in_crash_kexec;

/*
@@ -39,6 +84,7 @@ int in_crash_kexec;
*/
crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss = NULL;
EXPORT_SYMBOL_GPL(crash_vmclear_loaded_vmcss);
+unsigned long crash_zero_bytes;

static inline void cpu_crash_vmclear_loaded_vmcss(void)
{
@@ -135,3 +181,520 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
#endif
crash_save_cpu(regs, safe_smp_processor_id());
}
+
+#ifdef CONFIG_X86_64
+
+static int get_nr_ram_ranges_callback(unsigned long start_pfn,
+ unsigned long nr_pfn, void *arg)
+{
+ int *nr_ranges = arg;
+
+ (*nr_ranges)++;
+ return 0;
+}
+
+static int get_gart_ranges_callback(u64 start, u64 end, void *arg)
+{
+ struct crash_elf_data *ced = arg;
+
+ ced->gart_start = start;
+ ced->gart_end = end;
+
+ /* Not expecting more than 1 gart aperture */
+ return 1;
+}
+
+
+/* Gather all the required information to prepare elf headers for ram regions */
+static void fill_up_crash_elf_data(struct crash_elf_data *ced,
+ struct kimage *image)
+{
+ unsigned int nr_ranges = 0;
+
+ ced->image = image;
+
+ walk_system_ram_range(0, -1, &nr_ranges,
+ get_nr_ram_ranges_callback);
+
+ ced->max_nr_ranges = nr_ranges;
+
+ /*
+ * We don't create ELF headers for GART aperture as an attempt
+ * to dump this memory in second kernel leads to hang/crash.
+ * If gart aperture is present, one needs to exclude that region
+ * and that could lead to need of extra phdr.
+ */
+ walk_iomem_res("GART", IORESOURCE_MEM, 0, -1,
+ ced, get_gart_ranges_callback);
+
+ /*
+ * If we have gart region, excluding that could potentially split
+ * a memory range, resulting in extra header. Account for that.
+ */
+ if (ced->gart_end)
+ ced->max_nr_ranges++;
+
+ /* Exclusion of crash region could split memory ranges */
+ ced->max_nr_ranges++;
+
+ /* If crashk_low_res is not 0, another range split possible */
+ if (crashk_low_res.end != 0)
+ ced->max_nr_ranges++;
+}
+
+static int exclude_mem_range(struct crash_mem *mem,
+ unsigned long long mstart, unsigned long long mend)
+{
+ int i, j;
+ unsigned long long start, end;
+ struct crash_mem_range temp_range = {0, 0};
+
+ for (i = 0; i < mem->nr_ranges; i++) {
+ start = mem->ranges[i].start;
+ end = mem->ranges[i].end;
+
+ if (mstart > end || mend < start)
+ continue;
+
+ /* Truncate any area outside of range */
+ if (mstart < start)
+ mstart = start;
+ if (mend > end)
+ mend = end;
+
+ /* Found completely overlapping range */
+ if (mstart == start && mend == end) {
+ mem->ranges[i].start = 0;
+ mem->ranges[i].end = 0;
+ if (i < mem->nr_ranges - 1) {
+ /* Shift rest of the ranges to left */
+ for (j = i; j < mem->nr_ranges - 1; j++) {
+ mem->ranges[j].start =
+ mem->ranges[j+1].start;
+ mem->ranges[j].end =
+ mem->ranges[j+1].end;
+ }
+ }
+ mem->nr_ranges--;
+ return 0;
+ }
+
+ if (mstart > start && mend < end) {
+ /* Split original range */
+ mem->ranges[i].end = mstart - 1;
+ temp_range.start = mend + 1;
+ temp_range.end = end;
+ } else if (mstart != start)
+ mem->ranges[i].end = mstart - 1;
+ else
+ mem->ranges[i].start = mend + 1;
+ break;
+ }
+
+ /* If a split happend, add the split to array */
+ if (!temp_range.end)
+ return 0;
+
+ /* Split happened */
+ if (i == CRASH_MAX_RANGES - 1) {
+ pr_err("Too many crash ranges after split\n");
+ return -ENOMEM;
+ }
+
+ /* Location where new range should go */
+ j = i + 1;
+ if (j < mem->nr_ranges) {
+ /* Move over all ranges one slot towards the end */
+ for (i = mem->nr_ranges - 1; i >= j; i--)
+ mem->ranges[i + 1] = mem->ranges[i];
+ }
+
+ mem->ranges[j].start = temp_range.start;
+ mem->ranges[j].end = temp_range.end;
+ mem->nr_ranges++;
+ return 0;
+}
+
+/*
+ * Look for any unwanted ranges between mstart, mend and remove them. This
+ * might lead to split and split ranges are put in ced->mem.ranges[] array
+ */
+static int elf_header_exclude_ranges(struct crash_elf_data *ced,
+ unsigned long long mstart, unsigned long long mend)
+{
+ struct crash_mem *cmem = &ced->mem;
+ int ret = 0;
+
+ memset(cmem->ranges, 0, sizeof(cmem->ranges));
+
+ cmem->ranges[0].start = mstart;
+ cmem->ranges[0].end = mend;
+ cmem->nr_ranges = 1;
+
+ /* Exclude crashkernel region */
+ ret = exclude_mem_range(cmem, crashk_res.start, crashk_res.end);
+ if (ret)
+ return ret;
+
+ ret = exclude_mem_range(cmem, crashk_low_res.start, crashk_low_res.end);
+ if (ret)
+ return ret;
+
+ /* Exclude GART region */
+ if (ced->gart_end) {
+ ret = exclude_mem_range(cmem, ced->gart_start, ced->gart_end);
+ if (ret)
+ return ret;
+ }
+
+ return ret;
+}
+
+static int prepare_elf64_ram_headers_callback(u64 start, u64 end, void *arg)
+{
+ struct crash_elf_data *ced = arg;
+ Elf64_Ehdr *ehdr;
+ Elf64_Phdr *phdr;
+ unsigned long mstart, mend;
+ struct kimage *image = ced->image;
+ struct crash_mem *cmem;
+ int ret, i;
+
+ ehdr = ced->ehdr;
+
+ /* Exclude unwanted mem ranges */
+ ret = elf_header_exclude_ranges(ced, start, end);
+ if (ret)
+ return ret;
+
+ /* Go through all the ranges in ced->mem.ranges[] and prepare phdr */
+ cmem = &ced->mem;
+
+ for (i = 0; i < cmem->nr_ranges; i++) {
+ mstart = cmem->ranges[i].start;
+ mend = cmem->ranges[i].end;
+
+ phdr = ced->bufp;
+ ced->bufp += sizeof(Elf64_Phdr);
+
+ phdr->p_type = PT_LOAD;
+ phdr->p_flags = PF_R|PF_W|PF_X;
+ phdr->p_offset = mstart;
+
+ /*
+ * If a range matches backup region, adjust offset to backup
+ * segment.
+ */
+ if (mstart == image->arch.backup_src_start &&
+ (mend - mstart + 1) == image->arch.backup_src_sz)
+ phdr->p_offset = image->arch.backup_load_addr;
+
+ phdr->p_paddr = mstart;
+ phdr->p_vaddr = (unsigned long long) __va(mstart);
+ phdr->p_filesz = phdr->p_memsz = mend - mstart + 1;
+ phdr->p_align = 0;
+ ehdr->e_phnum++;
+ pr_debug("Crash PT_LOAD elf header. phdr=%p vaddr=0x%llx, paddr=0x%llx, sz=0x%llx e_phnum=%d p_offset=0x%llx\n",
+ phdr, phdr->p_vaddr, phdr->p_paddr, phdr->p_filesz,
+ ehdr->e_phnum, phdr->p_offset);
+ }
+
+ return ret;
+}
+
+static int prepare_elf64_headers(struct crash_elf_data *ced,
+ void **addr, unsigned long *sz)
+{
+ Elf64_Ehdr *ehdr;
+ Elf64_Phdr *phdr;
+ unsigned long nr_cpus = num_possible_cpus(), nr_phdr, elf_sz;
+ unsigned char *buf, *bufp;
+ unsigned int cpu;
+ unsigned long long notes_addr;
+ int ret;
+
+ /* extra phdr for vmcoreinfo elf note */
+ nr_phdr = nr_cpus + 1;
+ nr_phdr += ced->max_nr_ranges;
+
+ /*
+ * kexec-tools creates an extra PT_LOAD phdr for kernel text mapping
+ * area on x86_64 (ffffffff80000000 - ffffffffa0000000).
+ * I think this is required by tools like gdb. So same physical
+ * memory will be mapped in two elf headers. One will contain kernel
+ * text virtual addresses and other will have __va(physical) addresses.
+ */
+
+ nr_phdr++;
+ elf_sz = sizeof(Elf64_Ehdr) + nr_phdr * sizeof(Elf64_Phdr);
+ elf_sz = ALIGN(elf_sz, ELF_CORE_HEADER_ALIGN);
+
+ buf = vzalloc(elf_sz);
+ if (!buf)
+ return -ENOMEM;
+
+ bufp = buf;
+ ehdr = (Elf64_Ehdr *)bufp;
+ bufp += sizeof(Elf64_Ehdr);
+ memcpy(ehdr->e_ident, ELFMAG, SELFMAG);
+ ehdr->e_ident[EI_CLASS] = ELFCLASS64;
+ ehdr->e_ident[EI_DATA] = ELFDATA2LSB;
+ ehdr->e_ident[EI_VERSION] = EV_CURRENT;
+ ehdr->e_ident[EI_OSABI] = ELF_OSABI;
+ memset(ehdr->e_ident + EI_PAD, 0, EI_NIDENT - EI_PAD);
+ ehdr->e_type = ET_CORE;
+ ehdr->e_machine = ELF_ARCH;
+ ehdr->e_version = EV_CURRENT;
+ ehdr->e_phoff = sizeof(Elf64_Ehdr);
+ ehdr->e_ehsize = sizeof(Elf64_Ehdr);
+ ehdr->e_phentsize = sizeof(Elf64_Phdr);
+
+ /* Prepare one phdr of type PT_NOTE for each present cpu */
+ for_each_present_cpu(cpu) {
+ phdr = (Elf64_Phdr *)bufp;
+ bufp += sizeof(Elf64_Phdr);
+ phdr->p_type = PT_NOTE;
+ notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu));
+ phdr->p_offset = phdr->p_paddr = notes_addr;
+ phdr->p_filesz = phdr->p_memsz = sizeof(note_buf_t);
+ (ehdr->e_phnum)++;
+ }
+
+ /* Prepare one PT_NOTE header for vmcoreinfo */
+ phdr = (Elf64_Phdr *)bufp;
+ bufp += sizeof(Elf64_Phdr);
+ phdr->p_type = PT_NOTE;
+ phdr->p_offset = phdr->p_paddr = paddr_vmcoreinfo_note();
+ phdr->p_filesz = phdr->p_memsz = sizeof(vmcoreinfo_note);
+ (ehdr->e_phnum)++;
+
+#ifdef CONFIG_X86_64
+ /* Prepare PT_LOAD type program header for kernel text region */
+ phdr = (Elf64_Phdr *)bufp;
+ bufp += sizeof(Elf64_Phdr);
+ phdr->p_type = PT_LOAD;
+ phdr->p_flags = PF_R|PF_W|PF_X;
+ phdr->p_vaddr = (Elf64_Addr)_text;
+ phdr->p_filesz = phdr->p_memsz = _end - _text;
+ phdr->p_offset = phdr->p_paddr = __pa_symbol(_text);
+ (ehdr->e_phnum)++;
+#endif
+
+ /* Prepare PT_LOAD headers for system ram chunks. */
+ ced->ehdr = ehdr;
+ ced->bufp = bufp;
+ ret = walk_system_ram_res(0, -1, ced,
+ prepare_elf64_ram_headers_callback);
+ if (ret < 0)
+ return ret;
+
+ *addr = buf;
+ *sz = elf_sz;
+ return 0;
+}
+
+/* Prepare elf headers. Return addr and size */
+static int prepare_elf_headers(struct kimage *image, void **addr,
+ unsigned long *sz)
+{
+ struct crash_elf_data *ced;
+ int ret;
+
+ ced = kzalloc(sizeof(*ced), GFP_KERNEL);
+ if (!ced)
+ return -ENOMEM;
+
+ fill_up_crash_elf_data(ced, image);
+
+ /* By default prepare 64bit headers */
+ ret = prepare_elf64_headers(ced, addr, sz);
+ kfree(ced);
+ return ret;
+}
+
+static int add_e820_entry(struct boot_params *params, struct e820entry *entry)
+{
+ unsigned int nr_e820_entries;
+
+ nr_e820_entries = params->e820_entries;
+ if (nr_e820_entries >= E820MAX)
+ return 1;
+
+ memcpy(&params->e820_map[nr_e820_entries], entry,
+ sizeof(struct e820entry));
+ params->e820_entries++;
+ return 0;
+}
+
+static int memmap_entry_callback(u64 start, u64 end, void *arg)
+{
+ struct crash_memmap_data *cmd = arg;
+ struct boot_params *params = cmd->params;
+ struct e820entry ei;
+
+ ei.addr = start;
+ ei.size = end - start + 1;
+ ei.type = cmd->type;
+ add_e820_entry(params, &ei);
+
+ return 0;
+}
+
+static int memmap_exclude_ranges(struct kimage *image, struct crash_mem *cmem,
+ unsigned long long mstart,
+ unsigned long long mend)
+{
+ unsigned long start, end;
+ int ret = 0;
+
+ cmem->ranges[0].start = mstart;
+ cmem->ranges[0].end = mend;
+ cmem->nr_ranges = 1;
+
+ /* Exclude Backup region */
+ start = image->arch.backup_load_addr;
+ end = start + image->arch.backup_src_sz - 1;
+ ret = exclude_mem_range(cmem, start, end);
+ if (ret)
+ return ret;
+
+ /* Exclude elf header region */
+ start = image->arch.elf_load_addr;
+ end = start + image->arch.elf_headers_sz - 1;
+ return exclude_mem_range(cmem, start, end);
+}
+
+/* Prepare memory map for crash dump kernel */
+int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
+{
+ int i, ret = 0;
+ unsigned long flags;
+ struct e820entry ei;
+ struct crash_memmap_data cmd;
+ struct crash_mem *cmem;
+
+ cmem = vzalloc(sizeof(struct crash_mem));
+ if (!cmem)
+ return -ENOMEM;
+
+ memset(&cmd, 0, sizeof(struct crash_memmap_data));
+ cmd.params = params;
+
+ /* Add first 640K segment */
+ ei.addr = image->arch.backup_src_start;
+ ei.size = image->arch.backup_src_sz;
+ ei.type = E820_RAM;
+ add_e820_entry(params, &ei);
+
+ /* Add ACPI tables */
+ cmd.type = E820_ACPI;
+ flags = IORESOURCE_MEM | IORESOURCE_BUSY;
+ walk_iomem_res("ACPI Tables", flags, 0, -1, &cmd,
+ memmap_entry_callback);
+
+ /* Add ACPI Non-volatile Storage */
+ cmd.type = E820_NVS;
+ walk_iomem_res("ACPI Non-volatile Storage", flags, 0, -1, &cmd,
+ memmap_entry_callback);
+
+ /* Add crashk_low_res region */
+ if (crashk_low_res.end) {
+ ei.addr = crashk_low_res.start;
+ ei.size = crashk_low_res.end - crashk_low_res.start + 1;
+ ei.type = E820_RAM;
+ add_e820_entry(params, &ei);
+ }
+
+ /* Exclude some ranges from crashk_res and add rest to memmap */
+ ret = memmap_exclude_ranges(image, cmem, crashk_res.start,
+ crashk_res.end);
+ if (ret)
+ goto out;
+
+ for (i = 0; i < cmem->nr_ranges; i++) {
+ ei.size = cmem->ranges[i].end - cmem->ranges[i].start + 1;
+
+ /* If entry is less than a page, skip it */
+ if (ei.size < PAGE_SIZE)
+ continue;
+ ei.addr = cmem->ranges[i].start;
+ ei.type = E820_RAM;
+ add_e820_entry(params, &ei);
+ }
+
+out:
+ vfree(cmem);
+ return ret;
+}
+
+static int determine_backup_region(u64 start, u64 end, void *arg)
+{
+ struct kimage *image = arg;
+
+ image->arch.backup_src_start = start;
+ image->arch.backup_src_sz = end - start + 1;
+
+ /* Expecting only one range for backup region */
+ return 1;
+}
+
+int crash_load_segments(struct kimage *image)
+{
+ unsigned long src_start, src_sz, elf_sz;
+ void *elf_addr;
+ int ret;
+
+ /*
+ * Determine and load a segment for backup area. First 640K RAM
+ * region is backup source
+ */
+
+ ret = walk_system_ram_res(KEXEC_BACKUP_SRC_START, KEXEC_BACKUP_SRC_END,
+ image, determine_backup_region);
+
+ /* Zero or postive return values are ok */
+ if (ret < 0)
+ return ret;
+
+ src_start = image->arch.backup_src_start;
+ src_sz = image->arch.backup_src_sz;
+
+ /* Add backup segment. */
+ if (src_sz) {
+ /*
+ * Ideally there is no source for backup segment. This is
+ * copied in purgatory after crash. Just add a zero filled
+ * segment for now to make sure checksum logic works fine.
+ */
+ ret = kexec_add_buffer(image, (char *)&crash_zero_bytes,
+ sizeof(crash_zero_bytes), src_sz,
+ PAGE_SIZE, 0, -1, 0,
+ &image->arch.backup_load_addr);
+ if (ret)
+ return ret;
+ pr_debug("Loaded backup region at 0x%lx backup_start=0x%lx memsz=0x%lx\n",
+ image->arch.backup_load_addr, src_start, src_sz);
+ }
+
+ /* Prepare elf headers and add a segment */
+ ret = prepare_elf_headers(image, &elf_addr, &elf_sz);
+ if (ret)
+ return ret;
+
+ image->arch.elf_headers = elf_addr;
+ image->arch.elf_headers_sz = elf_sz;
+
+ ret = kexec_add_buffer(image, (char *)elf_addr, elf_sz, elf_sz,
+ ELF_CORE_HEADER_ALIGN, 0, -1, 0,
+ &image->arch.elf_load_addr);
+ if (ret) {
+ vfree((void *)image->arch.elf_headers);
+ return ret;
+ }
+ pr_debug("Loaded ELF headers at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
+ image->arch.elf_load_addr, elf_sz, elf_sz);
+
+ return ret;
+}
+
+#endif /* CONFIG_X86_64 */
diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c
index 990bf27..61e4306 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -21,6 +21,9 @@

#include <asm/bootparam.h>
#include <asm/setup.h>
+#include <asm/crash.h>
+
+#define MAX_ELFCOREHDR_STR_LEN 30 /* elfcorehdr=0x<64bit-value> */

/*
* Defines lowest physical address for various segments. Not sure where
@@ -58,18 +61,24 @@ static int setup_initrd(struct boot_params *params,
return 0;
}

-static int setup_cmdline(struct boot_params *params,
+static int setup_cmdline(struct kimage *image, struct boot_params *params,
unsigned long bootparams_load_addr,
unsigned long cmdline_offset, char *cmdline,
unsigned long cmdline_len)
{
char *cmdline_ptr = ((char *)params) + cmdline_offset;
- unsigned long cmdline_ptr_phys;
+ unsigned long cmdline_ptr_phys, len;
uint32_t cmdline_low_32, cmdline_ext_32;

memcpy(cmdline_ptr, cmdline, cmdline_len);
+ if (image->type == KEXEC_TYPE_CRASH) {
+ len = sprintf(cmdline_ptr + cmdline_len - 1,
+ " elfcorehdr=0x%lx", image->arch.elf_load_addr);
+ cmdline_len += len;
+ }
cmdline_ptr[cmdline_len - 1] = '\0';

+ pr_debug("Final command line is: %s\n", cmdline_ptr);
cmdline_ptr_phys = bootparams_load_addr + cmdline_offset;
cmdline_low_32 = cmdline_ptr_phys & 0xffffffffUL;
cmdline_ext_32 = cmdline_ptr_phys >> 32;
@@ -98,11 +107,12 @@ static int setup_memory_map_entries(struct boot_params *params)
return 0;
}

-static int setup_boot_parameters(struct boot_params *params)
+static int setup_boot_parameters(struct kimage *image,
+ struct boot_params *params)
{
unsigned int nr_e820_entries;
unsigned long long mem_k, start, end;
- int i;
+ int i, ret = 0;

/* Get subarch from existing bootparams */
params->hdr.hardware_subarch = boot_params.hdr.hardware_subarch;
@@ -125,7 +135,13 @@ static int setup_boot_parameters(struct boot_params *params)
/* Default sysdesc table */
params->sys_desc_table.length = 0;

- setup_memory_map_entries(params);
+ if (image->type == KEXEC_TYPE_CRASH) {
+ ret = crash_setup_memmap_entries(image, params);
+ if (ret)
+ return ret;
+ } else
+ setup_memory_map_entries(params);
+
nr_e820_entries = params->e820_entries;

for (i = 0; i < nr_e820_entries; i++) {
@@ -153,7 +169,7 @@ static int setup_boot_parameters(struct boot_params *params)
memcpy(params->edd_mbr_sig_buffer, boot_params.edd_mbr_sig_buffer,
EDD_MBR_SIG_MAX * sizeof(unsigned int));

- return 0;
+ return ret;
}

int bzImage64_probe(const char *buf, unsigned long len)
@@ -241,6 +257,22 @@ void *bzImage64_load(struct kimage *image, char *kernel,
}

/*
+ * In case of crash dump, we will append elfcorehdr=<addr> to
+ * command line. Make sure it does not overflow
+ */
+ if (cmdline_len + MAX_ELFCOREHDR_STR_LEN > header->cmdline_size) {
+ pr_debug("Appending elfcorehdr=<addr> to command line exceeds maximum allowed length\n");
+ return ERR_PTR(-EINVAL);
+ }
+
+ /* Allocate and load backup region */
+ if (image->type == KEXEC_TYPE_CRASH) {
+ ret = crash_load_segments(image);
+ if (ret)
+ return ERR_PTR(ret);
+ }
+
+ /*
* Load purgatory. For 64bit entry point, purgatory code can be
* anywhere.
*/
@@ -254,7 +286,8 @@ void *bzImage64_load(struct kimage *image, char *kernel,
pr_debug("Loaded purgatory at 0x%lx\n", purgatory_load_addr);

/* Load Bootparams and cmdline */
- params_cmdline_sz = sizeof(struct boot_params) + cmdline_len;
+ params_cmdline_sz = sizeof(struct boot_params) + cmdline_len +
+ MAX_ELFCOREHDR_STR_LEN;
params = kzalloc(params_cmdline_sz, GFP_KERNEL);
if (!params)
return ERR_PTR(-ENOMEM);
@@ -303,8 +336,8 @@ void *bzImage64_load(struct kimage *image, char *kernel,
setup_initrd(params, initrd_load_addr, initrd_len);
}

- setup_cmdline(params, bootparam_load_addr, sizeof(struct boot_params),
- cmdline, cmdline_len);
+ setup_cmdline(image, params, bootparam_load_addr,
+ sizeof(struct boot_params), cmdline, cmdline_len);

/* bootloader info. Do we need a separate ID for kexec kernel loader? */
params->hdr.type_of_loader = 0x0D << 4;
@@ -332,7 +365,9 @@ void *bzImage64_load(struct kimage *image, char *kernel,
if (ret)
goto out_free_params;

- setup_boot_parameters(params);
+ ret = setup_boot_parameters(image, params);
+ if (ret)
+ goto out_free_params;

/* Allocate loader specific data */
ldata = kzalloc(sizeof(struct bzimage64_data), GFP_KERNEL);
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 87c6a99..b6cf40c 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -178,6 +178,38 @@ static void load_segments(void)
);
}

+/* Update purgatory as needed after various image segments have been prepared */
+static int arch_update_purgatory(struct kimage *image)
+{
+ int ret = 0;
+
+ if (!image->file_mode)
+ return 0;
+
+ /* Setup copying of backup region */
+ if (image->type == KEXEC_TYPE_CRASH) {
+ ret = kexec_purgatory_get_set_symbol(image, "backup_dest",
+ &image->arch.backup_load_addr,
+ sizeof(image->arch.backup_load_addr), 0);
+ if (ret)
+ return ret;
+
+ ret = kexec_purgatory_get_set_symbol(image, "backup_src",
+ &image->arch.backup_src_start,
+ sizeof(image->arch.backup_src_start), 0);
+ if (ret)
+ return ret;
+
+ ret = kexec_purgatory_get_set_symbol(image, "backup_sz",
+ &image->arch.backup_src_sz,
+ sizeof(image->arch.backup_src_sz), 0);
+ if (ret)
+ return ret;
+ }
+
+ return ret;
+}
+
int machine_kexec_prepare(struct kimage *image)
{
unsigned long start_pgtable;
@@ -191,6 +223,11 @@ int machine_kexec_prepare(struct kimage *image)
if (result)
return result;

+ /* update purgatory as needed */
+ result = arch_update_purgatory(image);
+ if (result)
+ return result;
+
return 0;
}

@@ -315,6 +352,9 @@ int arch_kexec_kernel_image_probe(struct kimage *image, void *buf,

void *arch_kexec_kernel_image_load(struct kimage *image)
{
+ vfree(image->arch.elf_headers);
+ image->arch.elf_headers = NULL;
+
if (!image->fops || !image->fops->load)
return ERR_PTR(-ENOEXEC);

diff --git a/kernel/kexec.c b/kernel/kexec.c
index f7ca4ce..18a962f 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -533,6 +533,7 @@ kimage_file_alloc_init(struct kimage **rimage, int kernel_fd,
{
int ret;
struct kimage *image;
+ bool kexec_on_panic = flags & KEXEC_FILE_ON_CRASH;

image = do_kimage_alloc_init();
if (!image)
@@ -540,6 +541,12 @@ kimage_file_alloc_init(struct kimage **rimage, int kernel_fd,

image->file_mode = 1;

+ if (kexec_on_panic) {
+ /* Enable special crash kernel control page alloc policy. */
+ image->control_page = crashk_res.start;
+ image->type = KEXEC_TYPE_CRASH;
+ }
+
ret = kimage_file_prepare_segments(image, kernel_fd, initrd_fd,
cmdline_ptr, cmdline_len, flags);
if (ret)
@@ -557,10 +564,12 @@ kimage_file_alloc_init(struct kimage **rimage, int kernel_fd,
goto out_free_post_load_bufs;
}

- image->swap_page = kimage_alloc_control_pages(image, 0);
- if (!image->swap_page) {
- pr_err(KERN_ERR "Could not allocate swap buffer\n");
- goto out_free_control_pages;
+ if (!kexec_on_panic) {
+ image->swap_page = kimage_alloc_control_pages(image, 0);
+ if (!image->swap_page) {
+ pr_err(KERN_ERR "Could not allocate swap buffer\n");
+ goto out_free_control_pages;
+ }
}

*rimage = image;
@@ -1101,10 +1110,14 @@ static int kimage_load_crash_segment(struct kimage *image,
unsigned long maddr;
size_t ubytes, mbytes;
int result;
- unsigned char __user *buf;
+ unsigned char __user *buf = NULL;
+ unsigned char *kbuf = NULL;

result = 0;
- buf = segment->buf;
+ if (image->file_mode)
+ kbuf = segment->kbuf;
+ else
+ buf = segment->buf;
ubytes = segment->bufsz;
mbytes = segment->memsz;
maddr = segment->mem;
@@ -1127,7 +1140,12 @@ static int kimage_load_crash_segment(struct kimage *image,
/* Zero the trailing part of the page */
memset(ptr + uchunk, 0, mchunk - uchunk);
}
- result = copy_from_user(ptr, buf, uchunk);
+
+ /* For file based kexec, source pages are in kernel memory */
+ if (image->file_mode)
+ memcpy(ptr, kbuf, uchunk);
+ else
+ result = copy_from_user(ptr, buf, uchunk);
kexec_flush_icache_page(page);
kunmap(page);
if (result) {
@@ -1136,7 +1154,10 @@ static int kimage_load_crash_segment(struct kimage *image,
}
ubytes -= uchunk;
maddr += mchunk;
- buf += mchunk;
+ if (image->file_mode)
+ kbuf += mchunk;
+ else
+ buf += mchunk;
mbytes -= mchunk;
}
out:
@@ -2113,7 +2134,14 @@ int kexec_add_buffer(struct kimage *image, char *buffer, unsigned long bufsz,
kbuf->top_down = top_down;

/* Walk the RAM ranges and allocate a suitable range for the buffer */
- ret = walk_system_ram_res(0, -1, kbuf, locate_mem_hole_callback);
+ if (image->type == KEXEC_TYPE_CRASH)
+ ret = walk_iomem_res("Crash kernel",
+ IORESOURCE_MEM | IORESOURCE_BUSY,
+ crashk_res.start, crashk_res.end, kbuf,
+ locate_mem_hole_callback);
+ else
+ ret = walk_system_ram_res(0, -1, kbuf,
+ locate_mem_hole_callback);
if (ret != 1) {
/* A suitable memory range could not be found for buffer */
return -EADDRNOTAVAIL;
--
1.9.0

2014-06-26 20:35:33

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 08/15] kexec: New syscall kexec_file_load() declaration

This is the new syscall kexec_file_load() declaration/interface. I have
reserved the syscall number only for x86_64 so far. Other architectures
(including i386) can reserve syscall number when they enable the support
for this new syscall.

Signed-off-by: Vivek Goyal <[email protected]>
CC: [email protected]
---
arch/x86/syscalls/syscall_64.tbl | 1 +
include/linux/syscalls.h | 4 ++++
kernel/kexec.c | 7 +++++++
kernel/sys_ni.c | 1 +
4 files changed, 13 insertions(+)

diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index ec255a1..6d35459 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,7 @@
314 common sched_setattr sys_sched_setattr
315 common sched_getattr sys_sched_getattr
316 common renameat2 sys_renameat2
+317 common kexec_file_load sys_kexec_file_load

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index b0881a0..9e98193 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -317,6 +317,10 @@ asmlinkage long sys_restart_syscall(void);
asmlinkage long sys_kexec_load(unsigned long entry, unsigned long nr_segments,
struct kexec_segment __user *segments,
unsigned long flags);
+asmlinkage long sys_kexec_file_load(int kernel_fd, int initrd_fd,
+ unsigned long cmdline_len,
+ const char __user *cmdline_ptr,
+ unsigned long flags);

asmlinkage long sys_exit(int error_code);
asmlinkage long sys_exit_group(int error_code);
diff --git a/kernel/kexec.c b/kernel/kexec.c
index c69ce00..bdda717 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -1057,6 +1057,13 @@ COMPAT_SYSCALL_DEFINE4(kexec_load, compat_ulong_t, entry,
}
#endif

+SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
+ unsigned long, cmdline_len, const char __user *, cmdline_ptr,
+ unsigned long, flags)
+{
+ return -ENOSYS;
+}
+
void crash_kexec(struct pt_regs *regs)
{
/* Take the kexec_mutex here to prevent sys_kexec_load
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 36441b5..51ea89f 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -25,6 +25,7 @@ cond_syscall(sys_swapon);
cond_syscall(sys_swapoff);
cond_syscall(sys_kexec_load);
cond_syscall(compat_sys_kexec_load);
+cond_syscall(sys_kexec_file_load);
cond_syscall(sys_init_module);
cond_syscall(sys_finit_module);
cond_syscall(sys_delete_module);
--
1.9.0

2014-06-26 20:36:12

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 06/15] resource: Provide new functions to walk through resources

I have added two more functions to walk through resources.

Currently walk_system_ram_range() deals with pfn and /proc/iomem can contain
partial pages. By dealing in pfn, callback function loses the info that
last page of a memory range is a partial page and not the full page. So
I implemented walk_system_ram_res() which returns u64 values to callback
functions and now it properly return start and end address.

walk_system_ram_range() uses find_next_system_ram() to find the next
ram resource. This in turn only travels through siblings of top level
child and does not travers through all the nodes of the resoruce tree. I
also need another function where I can walk through all the resources,
for example figure out where "GART" aperture is. Figure out where
ACPI memory is.

So I wrote another function walk_iomem_res() which walks through all
/proc/iomem resources and returns matches as asked by caller. Caller
can specify "name" of resource, start and end and flags.

Got rid of find_next_system_ram_res() and instead implemented more
generic find_next_iomem_res() which can be used to traverse top level
children only based on an argument.

Signed-off-by: Vivek Goyal <[email protected]>
Cc: Yinghai Lu <[email protected]>
---
include/linux/ioport.h | 6 +++
kernel/resource.c | 101 ++++++++++++++++++++++++++++++++++++++++++++-----
2 files changed, 98 insertions(+), 9 deletions(-)

diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 5e3a906..142ec54 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -237,6 +237,12 @@ extern int iomem_is_exclusive(u64 addr);
extern int
walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages,
void *arg, int (*func)(unsigned long, unsigned long, void *));
+extern int
+walk_system_ram_res(u64 start, u64 end, void *arg,
+ int (*func)(u64, u64, void *));
+extern int
+walk_iomem_res(char *name, unsigned long flags, u64 start, u64 end, void *arg,
+ int (*func)(u64, u64, void *));

/* True if any part of r1 overlaps r2 */
static inline bool resource_overlaps(struct resource *r1, struct resource *r2)
diff --git a/kernel/resource.c b/kernel/resource.c
index 3c2237a..da14b8d 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -59,10 +59,12 @@ static DEFINE_RWLOCK(resource_lock);
static struct resource *bootmem_resource_free;
static DEFINE_SPINLOCK(bootmem_resource_lock);

-static void *r_next(struct seq_file *m, void *v, loff_t *pos)
+static struct resource *next_resource(struct resource *p, bool sibling_only)
{
- struct resource *p = v;
- (*pos)++;
+ /* Caller wants to traverse through siblings only */
+ if (sibling_only)
+ return p->sibling;
+
if (p->child)
return p->child;
while (!p->sibling && p->parent)
@@ -70,6 +72,13 @@ static void *r_next(struct seq_file *m, void *v, loff_t *pos)
return p->sibling;
}

+static void *r_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ struct resource *p = v;
+ (*pos)++;
+ return (void *)next_resource(p, false);
+}
+
#ifdef CONFIG_PROC_FS

enum { MAX_IORES_LEVEL = 5 };
@@ -322,16 +331,19 @@ int release_resource(struct resource *old)

EXPORT_SYMBOL(release_resource);

-#if !defined(CONFIG_ARCH_HAS_WALK_MEMORY)
/*
- * Finds the lowest memory reosurce exists within [res->start.res->end)
+ * Finds the lowest iomem reosurce exists with-in [res->start.res->end)
* the caller must specify res->start, res->end, res->flags and "name".
* If found, returns 0, res is overwritten, if not found, returns -1.
+ * This walks through whole tree and not just first level children
+ * until and unless first_level_children_only is true.
*/
-static int find_next_system_ram(struct resource *res, char *name)
+static int find_next_iomem_res(struct resource *res, char *name,
+ bool first_level_children_only)
{
resource_size_t start, end;
struct resource *p;
+ bool sibling_only = false;

BUG_ON(!res);

@@ -340,8 +352,14 @@ static int find_next_system_ram(struct resource *res, char *name)
BUG_ON(start >= end);

read_lock(&resource_lock);
- for (p = iomem_resource.child; p ; p = p->sibling) {
- /* system ram is just marked as IORESOURCE_MEM */
+
+ if (first_level_children_only) {
+ p = iomem_resource.child;
+ sibling_only = true;
+ } else
+ p = &iomem_resource;
+
+ while ((p = next_resource(p, sibling_only))) {
if (p->flags != res->flags)
continue;
if (name && strcmp(p->name, name))
@@ -353,6 +371,7 @@ static int find_next_system_ram(struct resource *res, char *name)
if ((p->end >= start) && (p->start < end))
break;
}
+
read_unlock(&resource_lock);
if (!p)
return -1;
@@ -365,6 +384,70 @@ static int find_next_system_ram(struct resource *res, char *name)
}

/*
+ * Walks through iomem resources and calls func() with matching resource
+ * ranges. This walks through whole tree and not just first level children.
+ * All the memory ranges which overlap start,end and also match flags and
+ * name are valid candidates.
+ *
+ * @name: name of resource
+ * @flags: resource flags
+ * @start: start addr
+ * @end: end addr
+ */
+int walk_iomem_res(char *name, unsigned long flags, u64 start, u64 end,
+ void *arg, int (*func)(u64, u64, void *))
+{
+ struct resource res;
+ u64 orig_end;
+ int ret = -1;
+
+ res.start = start;
+ res.end = end;
+ res.flags = flags;
+ orig_end = res.end;
+ while ((res.start < res.end) &&
+ (!find_next_iomem_res(&res, name, false))) {
+ ret = (*func)(res.start, res.end, arg);
+ if (ret)
+ break;
+ res.start = res.end + 1;
+ res.end = orig_end;
+ }
+ return ret;
+}
+
+/*
+ * This function calls callback against all memory range of "System RAM"
+ * which are marked as IORESOURCE_MEM and IORESOUCE_BUSY.
+ * Now, this function is only for "System RAM". This function deals with
+ * full ranges and not pfn. If resources are not pfn aligned, dealing
+ * with pfn can truncate ranges.
+ */
+int walk_system_ram_res(u64 start, u64 end, void *arg,
+ int (*func)(u64, u64, void *))
+{
+ struct resource res;
+ u64 orig_end;
+ int ret = -1;
+
+ res.start = start;
+ res.end = end;
+ res.flags = IORESOURCE_MEM | IORESOURCE_BUSY;
+ orig_end = res.end;
+ while ((res.start < res.end) &&
+ (!find_next_iomem_res(&res, "System RAM", true))) {
+ ret = (*func)(res.start, res.end, arg);
+ if (ret)
+ break;
+ res.start = res.end + 1;
+ res.end = orig_end;
+ }
+ return ret;
+}
+
+#if !defined(CONFIG_ARCH_HAS_WALK_MEMORY)
+
+/*
* This function calls callback against all memory range of "System RAM"
* which are marked as IORESOURCE_MEM and IORESOUCE_BUSY.
* Now, this function is only for "System RAM".
@@ -382,7 +465,7 @@ int walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages,
res.flags = IORESOURCE_MEM | IORESOURCE_BUSY;
orig_end = res.end;
while ((res.start < res.end) &&
- (find_next_system_ram(&res, "System RAM") >= 0)) {
+ (find_next_iomem_res(&res, "System RAM", true) >= 0)) {
pfn = (res.start + PAGE_SIZE - 1) >> PAGE_SHIFT;
end_pfn = (res.end + 1) >> PAGE_SHIFT;
if (end_pfn > pfn)
--
1.9.0

2014-06-26 20:36:33

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 09/15] kexec: Implementation of new syscall kexec_file_load

Previous patch provided the interface definition and this patch prvides
implementation of new syscall.

Previously segment list was prepared in user space. Now user space just
passes kernel fd, initrd fd and command line and kernel will create a
segment list internally.

This patch contains generic part of the code. Actual segment preparation
and loading is done by arch and image specific loader. Which comes in
next patch.

Signed-off-by: Vivek Goyal <[email protected]>
---
arch/x86/kernel/machine_kexec_64.c | 45 ++++
include/linux/kexec.h | 53 ++++
include/uapi/linux/kexec.h | 11 +
kernel/kexec.c | 478 ++++++++++++++++++++++++++++++++++++-
4 files changed, 582 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 679cef0..c8875b5 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -22,6 +22,10 @@
#include <asm/mmu_context.h>
#include <asm/debugreg.h>

+static struct kexec_file_ops *kexec_file_loaders[] = {
+ NULL,
+};
+
static void free_transition_pgtable(struct kimage *image)
{
free_page((unsigned long)image->arch.pud);
@@ -283,3 +287,44 @@ void arch_crash_save_vmcoreinfo(void)
(unsigned long)&_text - __START_KERNEL);
}

+/* arch-dependent functionality related to kexec file-based syscall */
+
+int arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
+ unsigned long buf_len)
+{
+ int i, ret = -ENOEXEC;
+ struct kexec_file_ops *fops;
+
+ for (i = 0; i < ARRAY_SIZE(kexec_file_loaders); i++) {
+ fops = kexec_file_loaders[i];
+ if (!fops || !fops->probe)
+ continue;
+
+ ret = fops->probe(buf, buf_len);
+ if (!ret) {
+ image->fops = fops;
+ return ret;
+ }
+ }
+
+ return ret;
+}
+
+void *arch_kexec_kernel_image_load(struct kimage *image)
+{
+ if (!image->fops || !image->fops->load)
+ return ERR_PTR(-ENOEXEC);
+
+ return image->fops->load(image, image->kernel_buf,
+ image->kernel_buf_len, image->initrd_buf,
+ image->initrd_buf_len, image->cmdline_buf,
+ image->cmdline_buf_len);
+}
+
+int arch_kimage_file_post_load_cleanup(struct kimage *image)
+{
+ if (!image->fops || !image->fops->cleanup)
+ return 0;
+
+ return image->fops->cleanup(image);
+}
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 66d56ac..8e80901 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -121,13 +121,57 @@ struct kimage {
#define KEXEC_TYPE_DEFAULT 0
#define KEXEC_TYPE_CRASH 1
unsigned int preserve_context : 1;
+ /* If set, we are using file mode kexec syscall */
+ unsigned int file_mode:1;

#ifdef ARCH_HAS_KIMAGE_ARCH
struct kimage_arch arch;
#endif
+
+ /* Additional fields for file based kexec syscall */
+ void *kernel_buf;
+ unsigned long kernel_buf_len;
+
+ void *initrd_buf;
+ unsigned long initrd_buf_len;
+
+ char *cmdline_buf;
+ unsigned long cmdline_buf_len;
+
+ /* File operations provided by image loader */
+ struct kexec_file_ops *fops;
+
+ /* Image loader handling the kernel can store a pointer here */
+ void *image_loader_data;
};

+/*
+ * Keeps track of buffer parameters as provided by caller for requesting
+ * memory placement of buffer.
+ */
+struct kexec_buf {
+ struct kimage *image;
+ char *buffer;
+ unsigned long bufsz;
+ unsigned long memsz;
+ unsigned long buf_align;
+ unsigned long buf_min;
+ unsigned long buf_max;
+ bool top_down; /* allocate from top of memory hole */
+};

+typedef int (kexec_probe_t)(const char *kernel_buf, unsigned long kernel_size);
+typedef void *(kexec_load_t)(struct kimage *image, char *kernel_buf,
+ unsigned long kernel_len, char *initrd,
+ unsigned long initrd_len, char *cmdline,
+ unsigned long cmdline_len);
+typedef int (kexec_cleanup_t)(struct kimage *image);
+
+struct kexec_file_ops {
+ kexec_probe_t *probe;
+ kexec_load_t *load;
+ kexec_cleanup_t *cleanup;
+};

/* kexec interface functions */
extern void machine_kexec(struct kimage *image);
@@ -138,6 +182,11 @@ extern asmlinkage long sys_kexec_load(unsigned long entry,
struct kexec_segment __user *segments,
unsigned long flags);
extern int kernel_kexec(void);
+extern int kexec_add_buffer(struct kimage *image, char *buffer,
+ unsigned long bufsz, unsigned long memsz,
+ unsigned long buf_align, unsigned long buf_min,
+ unsigned long buf_max, bool top_down,
+ unsigned long *load_addr);
extern struct page *kimage_alloc_control_pages(struct kimage *image,
unsigned int order);
extern void crash_kexec(struct pt_regs *);
@@ -188,6 +237,10 @@ extern int kexec_load_disabled;
#define KEXEC_FLAGS (KEXEC_ON_CRASH | KEXEC_PRESERVE_CONTEXT)
#endif

+/* List of defined/legal kexec file flags */
+#define KEXEC_FILE_FLAGS (KEXEC_FILE_UNLOAD | KEXEC_FILE_ON_CRASH | \
+ KEXEC_FILE_NO_INITRAMFS)
+
#define VMCOREINFO_BYTES (4096)
#define VMCOREINFO_NOTE_NAME "VMCOREINFO"
#define VMCOREINFO_NOTE_NAME_BYTES ALIGN(sizeof(VMCOREINFO_NOTE_NAME), 4)
diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h
index d6629d4..6925f5b 100644
--- a/include/uapi/linux/kexec.h
+++ b/include/uapi/linux/kexec.h
@@ -13,6 +13,17 @@
#define KEXEC_PRESERVE_CONTEXT 0x00000002
#define KEXEC_ARCH_MASK 0xffff0000

+/*
+ * Kexec file load interface flags.
+ * KEXEC_FILE_UNLOAD : Unload already loaded kexec/kdump image.
+ * KEXEC_FILE_ON_CRASH : Load/unload operation belongs to kdump image.
+ * KEXEC_FILE_NO_INITRAMFS : No initramfs is being loaded. Ignore the initrd
+ * fd field.
+ */
+#define KEXEC_FILE_UNLOAD 0x00000001
+#define KEXEC_FILE_ON_CRASH 0x00000002
+#define KEXEC_FILE_NO_INITRAMFS 0x00000004
+
/* These values match the ELF architecture values.
* Unless there is a good reason that should continue to be the case.
*/
diff --git a/kernel/kexec.c b/kernel/kexec.c
index bdda717..e5e0f6a 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -6,6 +6,8 @@
* Version 2. See the file COPYING for more details.
*/

+#define pr_fmt(fmt) "kexec: " fmt
+
#include <linux/capability.h>
#include <linux/mm.h>
#include <linux/file.h>
@@ -326,6 +328,215 @@ out_free_image:
return ret;
}

+static int copy_file_from_fd(int fd, void **buf, unsigned long *buf_len)
+{
+ struct fd f = fdget(fd);
+ int ret = 0;
+ struct kstat stat;
+ loff_t pos;
+ ssize_t bytes = 0;
+
+ if (!f.file)
+ return -EBADF;
+
+ ret = vfs_getattr(&f.file->f_path, &stat);
+ if (ret)
+ goto out;
+
+ if (stat.size > INT_MAX) {
+ ret = -EFBIG;
+ goto out;
+ }
+
+ /* Don't hand 0 to vmalloc, it whines. */
+ if (stat.size == 0) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ *buf = vmalloc(stat.size);
+ if (!*buf) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ pos = 0;
+ while (pos < stat.size) {
+ bytes = kernel_read(f.file, pos, (char *)(*buf) + pos,
+ stat.size - pos);
+ if (bytes < 0) {
+ vfree(*buf);
+ ret = bytes;
+ goto out;
+ }
+
+ if (bytes == 0)
+ break;
+ pos += bytes;
+ }
+
+ *buf_len = pos;
+out:
+ fdput(f);
+ return ret;
+}
+
+/* Architectures can provide this probe function */
+int __weak arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
+ unsigned long buf_len)
+{
+ return -ENOEXEC;
+}
+
+void * __weak arch_kexec_kernel_image_load(struct kimage *image)
+{
+ return ERR_PTR(-ENOEXEC);
+}
+
+void __weak arch_kimage_file_post_load_cleanup(struct kimage *image)
+{
+}
+
+/*
+ * Free up memory used by kernel, initrd, and comand line. This is temporary
+ * memory allocation which is not needed any more after these buffers have
+ * been loaded into separate segments and have been copied elsewhere.
+ */
+static void kimage_file_post_load_cleanup(struct kimage *image)
+{
+ vfree(image->kernel_buf);
+ image->kernel_buf = NULL;
+
+ vfree(image->initrd_buf);
+ image->initrd_buf = NULL;
+
+ kfree(image->cmdline_buf);
+ image->cmdline_buf = NULL;
+
+ /* See if architecture has anything to cleanup post load */
+ arch_kimage_file_post_load_cleanup(image);
+}
+
+/*
+ * In file mode list of segments is prepared by kernel. Copy relevant
+ * data from user space, do error checking, prepare segment list
+ */
+static int
+kimage_file_prepare_segments(struct kimage *image, int kernel_fd, int initrd_fd,
+ const char __user *cmdline_ptr,
+ unsigned long cmdline_len, unsigned flags)
+{
+ int ret = 0;
+ void *ldata;
+
+ ret = copy_file_from_fd(kernel_fd, &image->kernel_buf,
+ &image->kernel_buf_len);
+ if (ret)
+ return ret;
+
+ /* Call arch image probe handlers */
+ ret = arch_kexec_kernel_image_probe(image, image->kernel_buf,
+ image->kernel_buf_len);
+
+ if (ret)
+ goto out;
+
+ /* It is possible that there no initramfs is being loaded */
+ if (!(flags & KEXEC_FILE_NO_INITRAMFS)) {
+ ret = copy_file_from_fd(initrd_fd, &image->initrd_buf,
+ &image->initrd_buf_len);
+ if (ret)
+ goto out;
+ }
+
+ if (cmdline_len) {
+ image->cmdline_buf = kzalloc(cmdline_len, GFP_KERNEL);
+ if (!image->cmdline_buf) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ ret = copy_from_user(image->cmdline_buf, cmdline_ptr,
+ cmdline_len);
+ if (ret) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ image->cmdline_buf_len = cmdline_len;
+
+ /* command line should be a string with last byte null */
+ if (image->cmdline_buf[cmdline_len - 1] != '\0') {
+ ret = -EINVAL;
+ goto out;
+ }
+ }
+
+ /* Call arch image load handlers */
+ ldata = arch_kexec_kernel_image_load(image);
+
+ if (IS_ERR(ldata)) {
+ ret = PTR_ERR(ldata);
+ goto out;
+ }
+
+ image->image_loader_data = ldata;
+out:
+ /* In case of error, free up all allocated memory in this function */
+ if (ret)
+ kimage_file_post_load_cleanup(image);
+ return ret;
+}
+
+static int
+kimage_file_alloc_init(struct kimage **rimage, int kernel_fd,
+ int initrd_fd, const char __user *cmdline_ptr,
+ unsigned long cmdline_len, unsigned long flags)
+{
+ int ret;
+ struct kimage *image;
+
+ image = do_kimage_alloc_init();
+ if (!image)
+ return -ENOMEM;
+
+ image->file_mode = 1;
+
+ ret = kimage_file_prepare_segments(image, kernel_fd, initrd_fd,
+ cmdline_ptr, cmdline_len, flags);
+ if (ret)
+ goto out_free_image;
+
+ ret = sanity_check_segment_list(image);
+ if (ret)
+ goto out_free_post_load_bufs;
+
+ ret = -ENOMEM;
+ image->control_code_page = kimage_alloc_control_pages(image,
+ get_order(KEXEC_CONTROL_PAGE_SIZE));
+ if (!image->control_code_page) {
+ pr_err("Could not allocate control_code_buffer\n");
+ goto out_free_post_load_bufs;
+ }
+
+ image->swap_page = kimage_alloc_control_pages(image, 0);
+ if (!image->swap_page) {
+ pr_err(KERN_ERR "Could not allocate swap buffer\n");
+ goto out_free_control_pages;
+ }
+
+ *rimage = image;
+ return 0;
+out_free_control_pages:
+ kimage_free_page_list(&image->control_pages);
+out_free_post_load_bufs:
+ kimage_file_post_load_cleanup(image);
+ kfree(image->image_loader_data);
+out_free_image:
+ kfree(image);
+ return ret;
+}
+
static int kimage_is_destination_range(struct kimage *image,
unsigned long start,
unsigned long end)
@@ -643,6 +854,16 @@ static void kimage_free(struct kimage *image)

/* Free the kexec control pages... */
kimage_free_page_list(&image->control_pages);
+
+ kfree(image->image_loader_data);
+
+ /*
+ * Free up any temporary buffers allocated. This might hit if
+ * error occurred much later after buffer allocation.
+ */
+ if (image->file_mode)
+ kimage_file_post_load_cleanup(image);
+
kfree(image);
}

@@ -771,10 +992,14 @@ static int kimage_load_normal_segment(struct kimage *image,
unsigned long maddr;
size_t ubytes, mbytes;
int result;
- unsigned char __user *buf;
+ unsigned char __user *buf = NULL;
+ unsigned char *kbuf = NULL;

result = 0;
- buf = segment->buf;
+ if (image->file_mode)
+ kbuf = segment->kbuf;
+ else
+ buf = segment->buf;
ubytes = segment->bufsz;
mbytes = segment->memsz;
maddr = segment->mem;
@@ -806,7 +1031,11 @@ static int kimage_load_normal_segment(struct kimage *image,
PAGE_SIZE - (maddr & ~PAGE_MASK));
uchunk = min(ubytes, mchunk);

- result = copy_from_user(ptr, buf, uchunk);
+ /* For file based kexec, source pages are in kernel memory */
+ if (image->file_mode)
+ memcpy(ptr, kbuf, uchunk);
+ else
+ result = copy_from_user(ptr, buf, uchunk);
kunmap(page);
if (result) {
result = -EFAULT;
@@ -814,7 +1043,10 @@ static int kimage_load_normal_segment(struct kimage *image,
}
ubytes -= uchunk;
maddr += mchunk;
- buf += mchunk;
+ if (image->file_mode)
+ kbuf += mchunk;
+ else
+ buf += mchunk;
mbytes -= mchunk;
}
out:
@@ -1061,7 +1293,72 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
unsigned long, cmdline_len, const char __user *, cmdline_ptr,
unsigned long, flags)
{
- return -ENOSYS;
+ int ret = 0, i;
+ struct kimage **dest_image, *image;
+
+ /* We only trust the superuser with rebooting the system. */
+ if (!capable(CAP_SYS_BOOT) || kexec_load_disabled)
+ return -EPERM;
+
+ /* Make sure we have a legal set of flags */
+ if (flags != (flags & KEXEC_FILE_FLAGS))
+ return -EINVAL;
+
+ image = NULL;
+
+ if (!mutex_trylock(&kexec_mutex))
+ return -EBUSY;
+
+ dest_image = &kexec_image;
+ if (flags & KEXEC_FILE_ON_CRASH)
+ dest_image = &kexec_crash_image;
+
+ if (flags & KEXEC_FILE_UNLOAD)
+ goto exchange;
+
+ /*
+ * In case of crash, new kernel gets loaded in reserved region. It is
+ * same memory where old crash kernel might be loaded. Free any
+ * current crash dump kernel before we corrupt it.
+ */
+ if (flags & KEXEC_FILE_ON_CRASH)
+ kimage_free(xchg(&kexec_crash_image, NULL));
+
+ ret = kimage_file_alloc_init(&image, kernel_fd, initrd_fd, cmdline_ptr,
+ cmdline_len, flags);
+ if (ret)
+ goto out;
+
+ ret = machine_kexec_prepare(image);
+ if (ret)
+ goto out;
+
+ for (i = 0; i < image->nr_segments; i++) {
+ struct kexec_segment *ksegment;
+
+ ksegment = &image->segment[i];
+ pr_debug("Loading segment %d: buf=0x%p bufsz=0x%zx mem=0x%lx memsz=0x%zx\n",
+ i, ksegment->buf, ksegment->bufsz, ksegment->mem,
+ ksegment->memsz);
+
+ ret = kimage_load_segment(image, &image->segment[i]);
+ if (ret)
+ goto out;
+ }
+
+ kimage_terminate(image);
+
+ /*
+ * Free up any temporary buffers allocated which are not needed
+ * after image has been loaded
+ */
+ kimage_file_post_load_cleanup(image);
+exchange:
+ image = xchg(dest_image, image);
+out:
+ mutex_unlock(&kexec_mutex);
+ kimage_free(image);
+ return ret;
}

void crash_kexec(struct pt_regs *regs)
@@ -1616,6 +1913,177 @@ static int __init crash_save_vmcoreinfo_init(void)

subsys_initcall(crash_save_vmcoreinfo_init);

+static int __kexec_add_segment(struct kimage *image, char *buf,
+ unsigned long bufsz, unsigned long mem,
+ unsigned long memsz)
+{
+ struct kexec_segment *ksegment;
+
+ ksegment = &image->segment[image->nr_segments];
+ ksegment->kbuf = buf;
+ ksegment->bufsz = bufsz;
+ ksegment->mem = mem;
+ ksegment->memsz = memsz;
+ image->nr_segments++;
+
+ return 0;
+}
+
+static int locate_mem_hole_top_down(unsigned long start, unsigned long end,
+ struct kexec_buf *kbuf)
+{
+ struct kimage *image = kbuf->image;
+ unsigned long temp_start, temp_end;
+
+ temp_end = min(end, kbuf->buf_max);
+ temp_start = temp_end - kbuf->memsz;
+
+ do {
+ /* align down start */
+ temp_start = temp_start & (~(kbuf->buf_align - 1));
+
+ if (temp_start < start || temp_start < kbuf->buf_min)
+ return 0;
+
+ temp_end = temp_start + kbuf->memsz - 1;
+
+ /*
+ * Make sure this does not conflict with any of existing
+ * segments
+ */
+ if (kimage_is_destination_range(image, temp_start, temp_end)) {
+ temp_start = temp_start - PAGE_SIZE;
+ continue;
+ }
+
+ /* We found a suitable memory range */
+ break;
+ } while (1);
+
+ /* If we are here, we found a suitable memory range */
+ __kexec_add_segment(image, kbuf->buffer, kbuf->bufsz, temp_start,
+ kbuf->memsz);
+
+ /* Success, stop navigating through remaining System RAM ranges */
+ return 1;
+}
+
+static int locate_mem_hole_bottom_up(unsigned long start, unsigned long end,
+ struct kexec_buf *kbuf)
+{
+ struct kimage *image = kbuf->image;
+ unsigned long temp_start, temp_end;
+
+ temp_start = max(start, kbuf->buf_min);
+
+ do {
+ temp_start = ALIGN(temp_start, kbuf->buf_align);
+ temp_end = temp_start + kbuf->memsz - 1;
+
+ if (temp_end > end || temp_end > kbuf->buf_max)
+ return 0;
+ /*
+ * Make sure this does not conflict with any of existing
+ * segments
+ */
+ if (kimage_is_destination_range(image, temp_start, temp_end)) {
+ temp_start = temp_start + PAGE_SIZE;
+ continue;
+ }
+
+ /* We found a suitable memory range */
+ break;
+ } while (1);
+
+ /* If we are here, we found a suitable memory range */
+ __kexec_add_segment(image, kbuf->buffer, kbuf->bufsz, temp_start,
+ kbuf->memsz);
+
+ /* Success, stop navigating through remaining System RAM ranges */
+ return 1;
+}
+
+static int locate_mem_hole_callback(u64 start, u64 end, void *arg)
+{
+ struct kexec_buf *kbuf = (struct kexec_buf *)arg;
+ unsigned long sz = end - start + 1;
+
+ /* Returning 0 will take to next memory range */
+ if (sz < kbuf->memsz)
+ return 0;
+
+ if (end < kbuf->buf_min || start > kbuf->buf_max)
+ return 0;
+
+ /*
+ * Allocate memory top down with-in ram range. Otherwise bottom up
+ * allocation.
+ */
+ if (kbuf->top_down)
+ return locate_mem_hole_top_down(start, end, kbuf);
+ else
+ return locate_mem_hole_bottom_up(start, end, kbuf);
+}
+
+/*
+ * Helper function for placing a buffer in a kexec segment. This assumes
+ * that kexec_mutex is held.
+ */
+int kexec_add_buffer(struct kimage *image, char *buffer, unsigned long bufsz,
+ unsigned long memsz, unsigned long buf_align,
+ unsigned long buf_min, unsigned long buf_max,
+ bool top_down, unsigned long *load_addr)
+{
+
+ struct kexec_segment *ksegment;
+ struct kexec_buf buf, *kbuf;
+ int ret;
+
+ /* Currently adding segment this way is allowed only in file mode */
+ if (!image->file_mode)
+ return -EINVAL;
+
+ if (image->nr_segments >= KEXEC_SEGMENT_MAX)
+ return -EINVAL;
+
+ /*
+ * Make sure we are not trying to add buffer after allocating
+ * control pages. All segments need to be placed first before
+ * any control pages are allocated. As control page allocation
+ * logic goes through list of segments to make sure there are
+ * no destination overlaps.
+ */
+ if (!list_empty(&image->control_pages)) {
+ WARN_ON(1);
+ return -EINVAL;
+ }
+
+ memset(&buf, 0, sizeof(struct kexec_buf));
+ kbuf = &buf;
+ kbuf->image = image;
+ kbuf->buffer = buffer;
+ kbuf->bufsz = bufsz;
+
+ kbuf->memsz = ALIGN(memsz, PAGE_SIZE);
+ kbuf->buf_align = max(buf_align, PAGE_SIZE);
+ kbuf->buf_min = buf_min;
+ kbuf->buf_max = buf_max;
+ kbuf->top_down = top_down;
+
+ /* Walk the RAM ranges and allocate a suitable range for the buffer */
+ ret = walk_system_ram_res(0, -1, kbuf, locate_mem_hole_callback);
+ if (ret != 1) {
+ /* A suitable memory range could not be found for buffer */
+ return -EADDRNOTAVAIL;
+ }
+
+ /* Found a suitable memory range */
+ ksegment = &image->segment[image->nr_segments - 1];
+ *load_addr = ksegment->mem;
+ return 0;
+}
+
+
/*
* Move into place and start executing a preloaded standalone
* executable. If nothing was preloaded return an error.
--
1.9.0

2014-06-26 20:36:51

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 10/15] purgatory/sha256: Provide implementation of sha256 in purgaotory context

Next two patches provide code for purgatory. This is a code which does
not link against the kernel and runs stand alone. This code runs between
two kernels. One of the primary purpose of this code is to verify the
digest of newly loaded kernel and making sure it matches the digest
computed at kernel load time.

We use sha256 for calculating digest of kexec segmetns. Purgatory can't
use stanard crypto API as that API is not available in purgatory context.

Hence, I have copied code from crypto/sha256_generic.c and compiled it
with purgaotry code so that it could be used. I could not
#include sha256_generic.c file here as some of the function signature
requiered little tweaking. Original functions work with crypto API but
these ones don't

So instead of doing #include on sha256_generic.c I just copied relevant
portions of code into arch/x86/purgatory/sha256.c. Now we shouldn't have to
touch this code at all. Do let me know if there are better ways to handle it.

This patch does not enable compiling of this code. That happens in next
patch. I wanted to highlight this change in a separate patch for easy
review.

Signed-off-by: Vivek Goyal <[email protected]>
---
arch/x86/purgatory/sha256.c | 283 ++++++++++++++++++++++++++++++++++++++++++++
arch/x86/purgatory/sha256.h | 22 ++++
2 files changed, 305 insertions(+)
create mode 100644 arch/x86/purgatory/sha256.c
create mode 100644 arch/x86/purgatory/sha256.h

diff --git a/arch/x86/purgatory/sha256.c b/arch/x86/purgatory/sha256.c
new file mode 100644
index 0000000..548ca67
--- /dev/null
+++ b/arch/x86/purgatory/sha256.c
@@ -0,0 +1,283 @@
+/*
+ * SHA-256, as specified in
+ * http://csrc.nist.gov/groups/STM/cavp/documents/shs/sha256-384-512.pdf
+ *
+ * SHA-256 code by Jean-Luc Cooke <[email protected]>.
+ *
+ * Copyright (c) Jean-Luc Cooke <[email protected]>
+ * Copyright (c) Andrew McDonald <[email protected]>
+ * Copyright (c) 2002 James Morris <[email protected]>
+ * Copyright (c) 2014 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ */
+
+#include <linux/bitops.h>
+#include <asm/byteorder.h>
+#include "sha256.h"
+#include "../boot/string.h"
+
+static inline u32 Ch(u32 x, u32 y, u32 z)
+{
+ return z ^ (x & (y ^ z));
+}
+
+static inline u32 Maj(u32 x, u32 y, u32 z)
+{
+ return (x & y) | (z & (x | y));
+}
+
+#define e0(x) (ror32(x, 2) ^ ror32(x, 13) ^ ror32(x, 22))
+#define e1(x) (ror32(x, 6) ^ ror32(x, 11) ^ ror32(x, 25))
+#define s0(x) (ror32(x, 7) ^ ror32(x, 18) ^ (x >> 3))
+#define s1(x) (ror32(x, 17) ^ ror32(x, 19) ^ (x >> 10))
+
+static inline void LOAD_OP(int I, u32 *W, const u8 *input)
+{
+ W[I] = __be32_to_cpu(((__be32 *)(input))[I]);
+}
+
+static inline void BLEND_OP(int I, u32 *W)
+{
+ W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
+}
+
+static void sha256_transform(u32 *state, const u8 *input)
+{
+ u32 a, b, c, d, e, f, g, h, t1, t2;
+ u32 W[64];
+ int i;
+
+ /* load the input */
+ for (i = 0; i < 16; i++)
+ LOAD_OP(i, W, input);
+
+ /* now blend */
+ for (i = 16; i < 64; i++)
+ BLEND_OP(i, W);
+
+ /* load the state into our registers */
+ a = state[0]; b = state[1]; c = state[2]; d = state[3];
+ e = state[4]; f = state[5]; g = state[6]; h = state[7];
+
+ /* now iterate */
+ t1 = h + e1(e) + Ch(e, f, g) + 0x428a2f98 + W[0];
+ t2 = e0(a) + Maj(a, b, c); d += t1; h = t1 + t2;
+ t1 = g + e1(d) + Ch(d, e, f) + 0x71374491 + W[1];
+ t2 = e0(h) + Maj(h, a, b); c += t1; g = t1 + t2;
+ t1 = f + e1(c) + Ch(c, d, e) + 0xb5c0fbcf + W[2];
+ t2 = e0(g) + Maj(g, h, a); b += t1; f = t1 + t2;
+ t1 = e + e1(b) + Ch(b, c, d) + 0xe9b5dba5 + W[3];
+ t2 = e0(f) + Maj(f, g, h); a += t1; e = t1 + t2;
+ t1 = d + e1(a) + Ch(a, b, c) + 0x3956c25b + W[4];
+ t2 = e0(e) + Maj(e, f, g); h += t1; d = t1 + t2;
+ t1 = c + e1(h) + Ch(h, a, b) + 0x59f111f1 + W[5];
+ t2 = e0(d) + Maj(d, e, f); g += t1; c = t1 + t2;
+ t1 = b + e1(g) + Ch(g, h, a) + 0x923f82a4 + W[6];
+ t2 = e0(c) + Maj(c, d, e); f += t1; b = t1 + t2;
+ t1 = a + e1(f) + Ch(f, g, h) + 0xab1c5ed5 + W[7];
+ t2 = e0(b) + Maj(b, c, d); e += t1; a = t1 + t2;
+
+ t1 = h + e1(e) + Ch(e, f, g) + 0xd807aa98 + W[8];
+ t2 = e0(a) + Maj(a, b, c); d += t1; h = t1 + t2;
+ t1 = g + e1(d) + Ch(d, e, f) + 0x12835b01 + W[9];
+ t2 = e0(h) + Maj(h, a, b); c += t1; g = t1 + t2;
+ t1 = f + e1(c) + Ch(c, d, e) + 0x243185be + W[10];
+ t2 = e0(g) + Maj(g, h, a); b += t1; f = t1 + t2;
+ t1 = e + e1(b) + Ch(b, c, d) + 0x550c7dc3 + W[11];
+ t2 = e0(f) + Maj(f, g, h); a += t1; e = t1 + t2;
+ t1 = d + e1(a) + Ch(a, b, c) + 0x72be5d74 + W[12];
+ t2 = e0(e) + Maj(e, f, g); h += t1; d = t1 + t2;
+ t1 = c + e1(h) + Ch(h, a, b) + 0x80deb1fe + W[13];
+ t2 = e0(d) + Maj(d, e, f); g += t1; c = t1 + t2;
+ t1 = b + e1(g) + Ch(g, h, a) + 0x9bdc06a7 + W[14];
+ t2 = e0(c) + Maj(c, d, e); f += t1; b = t1 + t2;
+ t1 = a + e1(f) + Ch(f, g, h) + 0xc19bf174 + W[15];
+ t2 = e0(b) + Maj(b, c, d); e += t1; a = t1+t2;
+
+ t1 = h + e1(e) + Ch(e, f, g) + 0xe49b69c1 + W[16];
+ t2 = e0(a) + Maj(a, b, c); d += t1; h = t1+t2;
+ t1 = g + e1(d) + Ch(d, e, f) + 0xefbe4786 + W[17];
+ t2 = e0(h) + Maj(h, a, b); c += t1; g = t1+t2;
+ t1 = f + e1(c) + Ch(c, d, e) + 0x0fc19dc6 + W[18];
+ t2 = e0(g) + Maj(g, h, a); b += t1; f = t1+t2;
+ t1 = e + e1(b) + Ch(b, c, d) + 0x240ca1cc + W[19];
+ t2 = e0(f) + Maj(f, g, h); a += t1; e = t1+t2;
+ t1 = d + e1(a) + Ch(a, b, c) + 0x2de92c6f + W[20];
+ t2 = e0(e) + Maj(e, f, g); h += t1; d = t1+t2;
+ t1 = c + e1(h) + Ch(h, a, b) + 0x4a7484aa + W[21];
+ t2 = e0(d) + Maj(d, e, f); g += t1; c = t1+t2;
+ t1 = b + e1(g) + Ch(g, h, a) + 0x5cb0a9dc + W[22];
+ t2 = e0(c) + Maj(c, d, e); f += t1; b = t1+t2;
+ t1 = a + e1(f) + Ch(f, g, h) + 0x76f988da + W[23];
+ t2 = e0(b) + Maj(b, c, d); e += t1; a = t1+t2;
+
+ t1 = h + e1(e) + Ch(e, f, g) + 0x983e5152 + W[24];
+ t2 = e0(a) + Maj(a, b, c); d += t1; h = t1+t2;
+ t1 = g + e1(d) + Ch(d, e, f) + 0xa831c66d + W[25];
+ t2 = e0(h) + Maj(h, a, b); c += t1; g = t1+t2;
+ t1 = f + e1(c) + Ch(c, d, e) + 0xb00327c8 + W[26];
+ t2 = e0(g) + Maj(g, h, a); b += t1; f = t1+t2;
+ t1 = e + e1(b) + Ch(b, c, d) + 0xbf597fc7 + W[27];
+ t2 = e0(f) + Maj(f, g, h); a += t1; e = t1+t2;
+ t1 = d + e1(a) + Ch(a, b, c) + 0xc6e00bf3 + W[28];
+ t2 = e0(e) + Maj(e, f, g); h += t1; d = t1+t2;
+ t1 = c + e1(h) + Ch(h, a, b) + 0xd5a79147 + W[29];
+ t2 = e0(d) + Maj(d, e, f); g += t1; c = t1+t2;
+ t1 = b + e1(g) + Ch(g, h, a) + 0x06ca6351 + W[30];
+ t2 = e0(c) + Maj(c, d, e); f += t1; b = t1+t2;
+ t1 = a + e1(f) + Ch(f, g, h) + 0x14292967 + W[31];
+ t2 = e0(b) + Maj(b, c, d); e += t1; a = t1+t2;
+
+ t1 = h + e1(e) + Ch(e, f, g) + 0x27b70a85 + W[32];
+ t2 = e0(a) + Maj(a, b, c); d += t1; h = t1+t2;
+ t1 = g + e1(d) + Ch(d, e, f) + 0x2e1b2138 + W[33];
+ t2 = e0(h) + Maj(h, a, b); c += t1; g = t1+t2;
+ t1 = f + e1(c) + Ch(c, d, e) + 0x4d2c6dfc + W[34];
+ t2 = e0(g) + Maj(g, h, a); b += t1; f = t1+t2;
+ t1 = e + e1(b) + Ch(b, c, d) + 0x53380d13 + W[35];
+ t2 = e0(f) + Maj(f, g, h); a += t1; e = t1+t2;
+ t1 = d + e1(a) + Ch(a, b, c) + 0x650a7354 + W[36];
+ t2 = e0(e) + Maj(e, f, g); h += t1; d = t1+t2;
+ t1 = c + e1(h) + Ch(h, a, b) + 0x766a0abb + W[37];
+ t2 = e0(d) + Maj(d, e, f); g += t1; c = t1+t2;
+ t1 = b + e1(g) + Ch(g, h, a) + 0x81c2c92e + W[38];
+ t2 = e0(c) + Maj(c, d, e); f += t1; b = t1+t2;
+ t1 = a + e1(f) + Ch(f, g, h) + 0x92722c85 + W[39];
+ t2 = e0(b) + Maj(b, c, d); e += t1; a = t1+t2;
+
+ t1 = h + e1(e) + Ch(e, f, g) + 0xa2bfe8a1 + W[40];
+ t2 = e0(a) + Maj(a, b, c); d += t1; h = t1+t2;
+ t1 = g + e1(d) + Ch(d, e, f) + 0xa81a664b + W[41];
+ t2 = e0(h) + Maj(h, a, b); c += t1; g = t1+t2;
+ t1 = f + e1(c) + Ch(c, d, e) + 0xc24b8b70 + W[42];
+ t2 = e0(g) + Maj(g, h, a); b += t1; f = t1+t2;
+ t1 = e + e1(b) + Ch(b, c, d) + 0xc76c51a3 + W[43];
+ t2 = e0(f) + Maj(f, g, h); a += t1; e = t1+t2;
+ t1 = d + e1(a) + Ch(a, b, c) + 0xd192e819 + W[44];
+ t2 = e0(e) + Maj(e, f, g); h += t1; d = t1+t2;
+ t1 = c + e1(h) + Ch(h, a, b) + 0xd6990624 + W[45];
+ t2 = e0(d) + Maj(d, e, f); g += t1; c = t1+t2;
+ t1 = b + e1(g) + Ch(g, h, a) + 0xf40e3585 + W[46];
+ t2 = e0(c) + Maj(c, d, e); f += t1; b = t1+t2;
+ t1 = a + e1(f) + Ch(f, g, h) + 0x106aa070 + W[47];
+ t2 = e0(b) + Maj(b, c, d); e += t1; a = t1+t2;
+
+ t1 = h + e1(e) + Ch(e, f, g) + 0x19a4c116 + W[48];
+ t2 = e0(a) + Maj(a, b, c); d += t1; h = t1+t2;
+ t1 = g + e1(d) + Ch(d, e, f) + 0x1e376c08 + W[49];
+ t2 = e0(h) + Maj(h, a, b); c += t1; g = t1+t2;
+ t1 = f + e1(c) + Ch(c, d, e) + 0x2748774c + W[50];
+ t2 = e0(g) + Maj(g, h, a); b += t1; f = t1+t2;
+ t1 = e + e1(b) + Ch(b, c, d) + 0x34b0bcb5 + W[51];
+ t2 = e0(f) + Maj(f, g, h); a += t1; e = t1+t2;
+ t1 = d + e1(a) + Ch(a, b, c) + 0x391c0cb3 + W[52];
+ t2 = e0(e) + Maj(e, f, g); h += t1; d = t1+t2;
+ t1 = c + e1(h) + Ch(h, a, b) + 0x4ed8aa4a + W[53];
+ t2 = e0(d) + Maj(d, e, f); g += t1; c = t1+t2;
+ t1 = b + e1(g) + Ch(g, h, a) + 0x5b9cca4f + W[54];
+ t2 = e0(c) + Maj(c, d, e); f += t1; b = t1+t2;
+ t1 = a + e1(f) + Ch(f, g, h) + 0x682e6ff3 + W[55];
+ t2 = e0(b) + Maj(b, c, d); e += t1; a = t1+t2;
+
+ t1 = h + e1(e) + Ch(e, f, g) + 0x748f82ee + W[56];
+ t2 = e0(a) + Maj(a, b, c); d += t1; h = t1+t2;
+ t1 = g + e1(d) + Ch(d, e, f) + 0x78a5636f + W[57];
+ t2 = e0(h) + Maj(h, a, b); c += t1; g = t1+t2;
+ t1 = f + e1(c) + Ch(c, d, e) + 0x84c87814 + W[58];
+ t2 = e0(g) + Maj(g, h, a); b += t1; f = t1+t2;
+ t1 = e + e1(b) + Ch(b, c, d) + 0x8cc70208 + W[59];
+ t2 = e0(f) + Maj(f, g, h); a += t1; e = t1+t2;
+ t1 = d + e1(a) + Ch(a, b, c) + 0x90befffa + W[60];
+ t2 = e0(e) + Maj(e, f, g); h += t1; d = t1+t2;
+ t1 = c + e1(h) + Ch(h, a, b) + 0xa4506ceb + W[61];
+ t2 = e0(d) + Maj(d, e, f); g += t1; c = t1+t2;
+ t1 = b + e1(g) + Ch(g, h, a) + 0xbef9a3f7 + W[62];
+ t2 = e0(c) + Maj(c, d, e); f += t1; b = t1+t2;
+ t1 = a + e1(f) + Ch(f, g, h) + 0xc67178f2 + W[63];
+ t2 = e0(b) + Maj(b, c, d); e += t1; a = t1+t2;
+
+ state[0] += a; state[1] += b; state[2] += c; state[3] += d;
+ state[4] += e; state[5] += f; state[6] += g; state[7] += h;
+
+ /* clear any sensitive info... */
+ a = b = c = d = e = f = g = h = t1 = t2 = 0;
+ memset(W, 0, 64 * sizeof(u32));
+}
+
+int sha256_init(struct sha256_state *sctx)
+{
+ sctx->state[0] = SHA256_H0;
+ sctx->state[1] = SHA256_H1;
+ sctx->state[2] = SHA256_H2;
+ sctx->state[3] = SHA256_H3;
+ sctx->state[4] = SHA256_H4;
+ sctx->state[5] = SHA256_H5;
+ sctx->state[6] = SHA256_H6;
+ sctx->state[7] = SHA256_H7;
+ sctx->count = 0;
+
+ return 0;
+}
+
+int sha256_update(struct sha256_state *sctx, const u8 *data, unsigned int len)
+{
+ unsigned int partial, done;
+ const u8 *src;
+
+ partial = sctx->count & 0x3f;
+ sctx->count += len;
+ done = 0;
+ src = data;
+
+ if ((partial + len) > 63) {
+ if (partial) {
+ done = -partial;
+ memcpy(sctx->buf + partial, data, done + 64);
+ src = sctx->buf;
+ }
+
+ do {
+ sha256_transform(sctx->state, src);
+ done += 64;
+ src = data + done;
+ } while (done + 63 < len);
+
+ partial = 0;
+ }
+ memcpy(sctx->buf + partial, src, len - done);
+
+ return 0;
+}
+
+int sha256_final(struct sha256_state *sctx, u8 *out)
+{
+ __be32 *dst = (__be32 *)out;
+ __be64 bits;
+ unsigned int index, pad_len;
+ int i;
+ static const u8 padding[64] = { 0x80, };
+
+ /* Save number of bits */
+ bits = cpu_to_be64(sctx->count << 3);
+
+ /* Pad out to 56 mod 64. */
+ index = sctx->count & 0x3f;
+ pad_len = (index < 56) ? (56 - index) : ((64+56) - index);
+ sha256_update(sctx, padding, pad_len);
+
+ /* Append length (before padding) */
+ sha256_update(sctx, (const u8 *)&bits, sizeof(bits));
+
+ /* Store state in digest */
+ for (i = 0; i < 8; i++)
+ dst[i] = cpu_to_be32(sctx->state[i]);
+
+ /* Zeroize sensitive information. */
+ memset(sctx, 0, sizeof(*sctx));
+
+ return 0;
+}
diff --git a/arch/x86/purgatory/sha256.h b/arch/x86/purgatory/sha256.h
new file mode 100644
index 0000000..bd15a41
--- /dev/null
+++ b/arch/x86/purgatory/sha256.h
@@ -0,0 +1,22 @@
+/*
+ * Copyright (C) 2014 Red Hat Inc.
+ *
+ * Author: Vivek Goyal <[email protected]>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2. See the file COPYING for more details.
+ */
+
+#ifndef SHA256_H
+#define SHA256_H
+
+
+#include <linux/types.h>
+#include <crypto/sha.h>
+
+extern int sha256_init(struct sha256_state *sctx);
+extern int sha256_update(struct sha256_state *sctx, const u8 *input,
+ unsigned int length);
+extern int sha256_final(struct sha256_state *sctx, u8 *hash);
+
+#endif /* SHA256_H */
--
1.9.0

2014-06-26 20:37:38

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 13/15] kexec-bzImage64: Support for loading bzImage using 64bit entry

This is loader specific code which can load bzImage and set it up for
64bit entry. This does not take care of 32bit entry or real mode entry.

32bit mode entry can be implemented if somebody needs it.

Signed-off-by: Vivek Goyal <[email protected]>
---
arch/x86/include/asm/kexec-bzimage64.h | 6 +
arch/x86/include/asm/kexec.h | 21 ++
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/kexec-bzimage64.c | 375 +++++++++++++++++++++++++++++++++
arch/x86/kernel/machine_kexec_64.c | 3 +-
5 files changed, 405 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/include/asm/kexec-bzimage64.h
create mode 100644 arch/x86/kernel/kexec-bzimage64.c

diff --git a/arch/x86/include/asm/kexec-bzimage64.h b/arch/x86/include/asm/kexec-bzimage64.h
new file mode 100644
index 0000000..d1b5d19
--- /dev/null
+++ b/arch/x86/include/asm/kexec-bzimage64.h
@@ -0,0 +1,6 @@
+#ifndef _ASM_KEXEC_BZIMAGE64_H
+#define _ASM_KEXEC_BZIMAGE64_H
+
+extern struct kexec_file_ops kexec_bzImage64_ops;
+
+#endif /* _ASM_KEXE_BZIMAGE64_H */
diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 17483a4..0dfccce 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -23,6 +23,7 @@

#include <asm/page.h>
#include <asm/ptrace.h>
+#include <asm/bootparam.h>

/*
* KEXEC_SOURCE_MEMORY_LIMIT maximum page get_free_page can return.
@@ -161,6 +162,26 @@ struct kimage_arch {
pmd_t *pmd;
pte_t *pte;
};
+
+struct kexec_entry64_regs {
+ uint64_t rax;
+ uint64_t rbx;
+ uint64_t rcx;
+ uint64_t rdx;
+ uint64_t rsi;
+ uint64_t rdi;
+ uint64_t rsp;
+ uint64_t rbp;
+ uint64_t r8;
+ uint64_t r9;
+ uint64_t r10;
+ uint64_t r11;
+ uint64_t r12;
+ uint64_t r13;
+ uint64_t r14;
+ uint64_t r15;
+ uint64_t rip;
+};
#endif

typedef void crash_vmclear_fn(void);
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 047f9ff..ece67cb 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -117,4 +117,5 @@ ifeq ($(CONFIG_X86_64),y)

obj-$(CONFIG_PCI_MMCONFIG) += mmconf-fam10h_64.o
obj-y += vsmp_64.o
+ obj-$(CONFIG_KEXEC) += kexec-bzimage64.o
endif
diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c
new file mode 100644
index 0000000..990bf27
--- /dev/null
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -0,0 +1,375 @@
+/*
+ * Kexec bzImage loader
+ *
+ * Copyright (C) 2014 Red Hat Inc.
+ * Authors:
+ * Vivek Goyal <[email protected]>
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2. See the file COPYING for more details.
+ */
+
+#define pr_fmt(fmt) "kexec-bzImage64: " fmt
+
+#include <linux/string.h>
+#include <linux/printk.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/kexec.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+
+#include <asm/bootparam.h>
+#include <asm/setup.h>
+
+/*
+ * Defines lowest physical address for various segments. Not sure where
+ * exactly these limits came from. Current bzimage64 loader in kexec-tools
+ * uses these so I am retaining it. It can be changed over time as we gain
+ * more insight.
+ */
+#define MIN_PURGATORY_ADDR 0x3000
+#define MIN_BOOTPARAM_ADDR 0x3000
+#define MIN_KERNEL_LOAD_ADDR 0x100000
+#define MIN_INITRD_LOAD_ADDR 0x1000000
+
+/*
+ * This is a place holder for all boot loader specific data structure which
+ * gets allocated in one call but gets freed much later during cleanup
+ * time. Right now there is only one field but it can grow as need be.
+ */
+struct bzimage64_data {
+ /*
+ * Temporary buffer to hold bootparams buffer. This should be
+ * freed once the bootparam segment has been loaded.
+ */
+ void *bootparams_buf;
+};
+
+static int setup_initrd(struct boot_params *params,
+ unsigned long initrd_load_addr, unsigned long initrd_len)
+{
+ params->hdr.ramdisk_image = initrd_load_addr & 0xffffffffUL;
+ params->hdr.ramdisk_size = initrd_len & 0xffffffffUL;
+
+ params->ext_ramdisk_image = initrd_load_addr >> 32;
+ params->ext_ramdisk_size = initrd_len >> 32;
+
+ return 0;
+}
+
+static int setup_cmdline(struct boot_params *params,
+ unsigned long bootparams_load_addr,
+ unsigned long cmdline_offset, char *cmdline,
+ unsigned long cmdline_len)
+{
+ char *cmdline_ptr = ((char *)params) + cmdline_offset;
+ unsigned long cmdline_ptr_phys;
+ uint32_t cmdline_low_32, cmdline_ext_32;
+
+ memcpy(cmdline_ptr, cmdline, cmdline_len);
+ cmdline_ptr[cmdline_len - 1] = '\0';
+
+ cmdline_ptr_phys = bootparams_load_addr + cmdline_offset;
+ cmdline_low_32 = cmdline_ptr_phys & 0xffffffffUL;
+ cmdline_ext_32 = cmdline_ptr_phys >> 32;
+
+ params->hdr.cmd_line_ptr = cmdline_low_32;
+ if (cmdline_ext_32)
+ params->ext_cmd_line_ptr = cmdline_ext_32;
+
+ return 0;
+}
+
+static int setup_memory_map_entries(struct boot_params *params)
+{
+ unsigned int nr_e820_entries;
+
+ nr_e820_entries = e820_saved.nr_map;
+
+ /* TODO: Pass entries more than E820MAX in bootparams setup data */
+ if (nr_e820_entries > E820MAX)
+ nr_e820_entries = E820MAX;
+
+ params->e820_entries = nr_e820_entries;
+ memcpy(&params->e820_map, &e820_saved.map,
+ nr_e820_entries * sizeof(struct e820entry));
+
+ return 0;
+}
+
+static int setup_boot_parameters(struct boot_params *params)
+{
+ unsigned int nr_e820_entries;
+ unsigned long long mem_k, start, end;
+ int i;
+
+ /* Get subarch from existing bootparams */
+ params->hdr.hardware_subarch = boot_params.hdr.hardware_subarch;
+
+ /* Copying screen_info will do? */
+ memcpy(&params->screen_info, &boot_params.screen_info,
+ sizeof(struct screen_info));
+
+ /* Fill in memsize later */
+ params->screen_info.ext_mem_k = 0;
+ params->alt_mem_k = 0;
+
+ /* Default APM info */
+ memset(&params->apm_bios_info, 0, sizeof(params->apm_bios_info));
+
+ /* Default drive info */
+ memset(&params->hd0_info, 0, sizeof(params->hd0_info));
+ memset(&params->hd1_info, 0, sizeof(params->hd1_info));
+
+ /* Default sysdesc table */
+ params->sys_desc_table.length = 0;
+
+ setup_memory_map_entries(params);
+ nr_e820_entries = params->e820_entries;
+
+ for (i = 0; i < nr_e820_entries; i++) {
+ if (params->e820_map[i].type != E820_RAM)
+ continue;
+ start = params->e820_map[i].addr;
+ end = params->e820_map[i].addr + params->e820_map[i].size - 1;
+
+ if ((start <= 0x100000) && end > 0x100000) {
+ mem_k = (end >> 10) - (0x100000 >> 10);
+ params->screen_info.ext_mem_k = mem_k;
+ params->alt_mem_k = mem_k;
+ if (mem_k > 0xfc00)
+ params->screen_info.ext_mem_k = 0xfc00; /* 64M*/
+ if (mem_k > 0xffffffff)
+ params->alt_mem_k = 0xffffffff;
+ }
+ }
+
+ /* Setup EDD info */
+ memcpy(params->eddbuf, boot_params.eddbuf,
+ EDDMAXNR * sizeof(struct edd_info));
+ params->eddbuf_entries = boot_params.eddbuf_entries;
+
+ memcpy(params->edd_mbr_sig_buffer, boot_params.edd_mbr_sig_buffer,
+ EDD_MBR_SIG_MAX * sizeof(unsigned int));
+
+ return 0;
+}
+
+int bzImage64_probe(const char *buf, unsigned long len)
+{
+ int ret = -ENOEXEC;
+ struct setup_header *header;
+
+ /* kernel should be atleast two sectors long */
+ if (len < 2 * 512) {
+ pr_err("File is too short to be a bzImage\n");
+ return ret;
+ }
+
+ header = (struct setup_header *)(buf + offsetof(struct boot_params, hdr));
+ if (memcmp((char *)&header->header, "HdrS", 4) != 0) {
+ pr_err("Not a bzImage\n");
+ return ret;
+ }
+
+ if (header->boot_flag != 0xAA55) {
+ pr_err("No x86 boot sector present\n");
+ return ret;
+ }
+
+ if (header->version < 0x020C) {
+ pr_err("Must be at least protocol version 2.12\n");
+ return ret;
+ }
+
+ if (!(header->loadflags & LOADED_HIGH)) {
+ pr_err("zImage not a bzImage\n");
+ return ret;
+ }
+
+ if (!(header->xloadflags & XLF_KERNEL_64)) {
+ pr_err("Not a bzImage64. XLF_KERNEL_64 is not set.\n");
+ return ret;
+ }
+
+ if (!(header->xloadflags & XLF_CAN_BE_LOADED_ABOVE_4G)) {
+ pr_err("XLF_CAN_BE_LOADED_ABOVE_4G is not set.\n");
+ return ret;
+ }
+
+ /* I've got a bzImage */
+ pr_debug("It's a relocatable bzImage64\n");
+ ret = 0;
+
+ return ret;
+}
+
+void *bzImage64_load(struct kimage *image, char *kernel,
+ unsigned long kernel_len, char *initrd,
+ unsigned long initrd_len, char *cmdline,
+ unsigned long cmdline_len)
+{
+
+ struct setup_header *header;
+ int setup_sects, kern16_size, ret = 0;
+ unsigned long setup_header_size, params_cmdline_sz;
+ struct boot_params *params;
+ unsigned long bootparam_load_addr, kernel_load_addr, initrd_load_addr;
+ unsigned long purgatory_load_addr;
+ unsigned long kernel_bufsz, kernel_memsz, kernel_align;
+ char *kernel_buf;
+ struct bzimage64_data *ldata;
+ struct kexec_entry64_regs regs64;
+ void *stack;
+ unsigned int setup_hdr_offset = offsetof(struct boot_params, hdr);
+
+ header = (struct setup_header *)(kernel + setup_hdr_offset);
+ setup_sects = header->setup_sects;
+ if (setup_sects == 0)
+ setup_sects = 4;
+
+ kern16_size = (setup_sects + 1) * 512;
+ if (kernel_len < kern16_size) {
+ pr_err("bzImage truncated\n");
+ return ERR_PTR(-ENOEXEC);
+ }
+
+ if (cmdline_len > header->cmdline_size) {
+ pr_err("Kernel command line too long\n");
+ return ERR_PTR(-EINVAL);
+ }
+
+ /*
+ * Load purgatory. For 64bit entry point, purgatory code can be
+ * anywhere.
+ */
+ ret = kexec_load_purgatory(image, MIN_PURGATORY_ADDR, ULONG_MAX, 1,
+ &purgatory_load_addr);
+ if (ret) {
+ pr_err("Loading purgatory failed\n");
+ return ERR_PTR(ret);
+ }
+
+ pr_debug("Loaded purgatory at 0x%lx\n", purgatory_load_addr);
+
+ /* Load Bootparams and cmdline */
+ params_cmdline_sz = sizeof(struct boot_params) + cmdline_len;
+ params = kzalloc(params_cmdline_sz, GFP_KERNEL);
+ if (!params)
+ return ERR_PTR(-ENOMEM);
+
+ /* Copy setup header onto bootparams. Documentation/x86/boot.txt */
+ setup_header_size = 0x0202 + kernel[0x0201] - setup_hdr_offset;
+
+ /* Is there a limit on setup header size? */
+ memcpy(&params->hdr, (kernel + setup_hdr_offset), setup_header_size);
+
+ ret = kexec_add_buffer(image, (char *)params, params_cmdline_sz,
+ params_cmdline_sz, 16, MIN_BOOTPARAM_ADDR,
+ ULONG_MAX, 1, &bootparam_load_addr);
+ if (ret)
+ goto out_free_params;
+ pr_debug("Loaded boot_param and command line at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
+ bootparam_load_addr, params_cmdline_sz, params_cmdline_sz);
+
+ /* Load kernel */
+ kernel_buf = kernel + kern16_size;
+ kernel_bufsz = kernel_len - kern16_size;
+ kernel_memsz = PAGE_ALIGN(header->init_size);
+ kernel_align = header->kernel_alignment;
+
+ ret = kexec_add_buffer(image, kernel_buf,
+ kernel_bufsz, kernel_memsz, kernel_align,
+ MIN_KERNEL_LOAD_ADDR, ULONG_MAX, 1,
+ &kernel_load_addr);
+ if (ret)
+ goto out_free_params;
+
+ pr_debug("Loaded 64bit kernel at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
+ kernel_load_addr, kernel_memsz, kernel_memsz);
+
+ /* Load initrd high */
+ if (initrd) {
+ ret = kexec_add_buffer(image, initrd, initrd_len, initrd_len,
+ PAGE_SIZE, MIN_INITRD_LOAD_ADDR,
+ ULONG_MAX, 1, &initrd_load_addr);
+ if (ret)
+ goto out_free_params;
+
+ pr_debug("Loaded initrd at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
+ initrd_load_addr, initrd_len, initrd_len);
+
+ setup_initrd(params, initrd_load_addr, initrd_len);
+ }
+
+ setup_cmdline(params, bootparam_load_addr, sizeof(struct boot_params),
+ cmdline, cmdline_len);
+
+ /* bootloader info. Do we need a separate ID for kexec kernel loader? */
+ params->hdr.type_of_loader = 0x0D << 4;
+ params->hdr.loadflags = 0;
+
+ /* Setup purgatory regs for entry */
+ ret = kexec_purgatory_get_set_symbol(image, "entry64_regs", &regs64,
+ sizeof(regs64), 1);
+ if (ret)
+ goto out_free_params;
+
+ regs64.rbx = 0; /* Bootstrap Processor */
+ regs64.rsi = bootparam_load_addr;
+ regs64.rip = kernel_load_addr + 0x200;
+ stack = kexec_purgatory_get_symbol_addr(image, "stack_end");
+ if (IS_ERR(stack)) {
+ pr_err("Could not find address of symbol stack_end\n");
+ ret = -EINVAL;
+ goto out_free_params;
+ }
+
+ regs64.rsp = (unsigned long)stack;
+ ret = kexec_purgatory_get_set_symbol(image, "entry64_regs", &regs64,
+ sizeof(regs64), 0);
+ if (ret)
+ goto out_free_params;
+
+ setup_boot_parameters(params);
+
+ /* Allocate loader specific data */
+ ldata = kzalloc(sizeof(struct bzimage64_data), GFP_KERNEL);
+ if (!ldata) {
+ ret = -ENOMEM;
+ goto out_free_params;
+ }
+
+ /*
+ * Store pointer to params so that it could be freed after loading
+ * params segment has been loaded and contents have been copied
+ * somewhere else.
+ */
+ ldata->bootparams_buf = params;
+ return ldata;
+
+out_free_params:
+ kfree(params);
+ return ERR_PTR(ret);
+}
+
+/* This cleanup function is called after various segments have been loaded */
+int bzImage64_cleanup(struct kimage *image)
+{
+ struct bzimage64_data *ldata = image->image_loader_data;
+
+ if (!ldata)
+ return 0;
+
+ kfree(ldata->bootparams_buf);
+ ldata->bootparams_buf = NULL;
+
+ return 0;
+}
+
+struct kexec_file_ops kexec_bzImage64_ops = {
+ .probe = bzImage64_probe,
+ .load = bzImage64_load,
+ .cleanup = bzImage64_cleanup,
+};
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 88404c4..87c6a99 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -23,9 +23,10 @@
#include <asm/tlbflush.h>
#include <asm/mmu_context.h>
#include <asm/debugreg.h>
+#include <asm/kexec-bzimage64.h>

static struct kexec_file_ops *kexec_file_loaders[] = {
- NULL,
+ &kexec_bzImage64_ops,
};

static void free_transition_pgtable(struct kimage *image)
--
1.9.0

2014-06-26 20:38:39

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 12/15] kexec: Load and Relocate purgatory at kernel load time

Load purgatory code in RAM and relocate it based on the location. Relocation
code has been inspired by module relocation code and purgatory relocation
code in kexec-tools.

Also compute the checksums of loaded kexec segments and store them in
purgatory.

Arch independent code provides this functionality so that arch dependent
bootloaders can make use of it.

Helper functions are provided to get/set symbol values in purgatory which
are used by bootloaders later to set things like stack and entry point
of second kernel etc.

Signed-off-by: Vivek Goyal <[email protected]>
---
arch/x86/Kconfig | 2 +
arch/x86/kernel/machine_kexec_64.c | 142 ++++++++++
include/linux/kexec.h | 33 +++
kernel/kexec.c | 544 ++++++++++++++++++++++++++++++++++++-
4 files changed, 720 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index eaa00ae..2cee2a6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1577,6 +1577,8 @@ source kernel/Kconfig.hz
config KEXEC
bool "kexec system call"
select BUILD_BIN2C
+ select CRYPTO
+ select CRYPTO_SHA256
---help---
kexec is a system call that implements the ability to shutdown your
current kernel, and to start another kernel. It is like a reboot
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index c8875b5..88404c4 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -6,6 +6,8 @@
* Version 2. See the file COPYING for more details.
*/

+#define pr_fmt(fmt) "kexec: " fmt
+
#include <linux/mm.h>
#include <linux/kexec.h>
#include <linux/string.h>
@@ -328,3 +330,143 @@ int arch_kimage_file_post_load_cleanup(struct kimage *image)

return image->fops->cleanup(image);
}
+
+/*
+ * Apply purgatory relocations.
+ *
+ * ehdr: Pointer to elf headers
+ * sechdrs: Pointer to section headers.
+ * relsec: section index of SHT_RELA section.
+ *
+ * TODO: Some of the code belongs to generic code. Move that in kexec.c.
+ */
+int arch_kexec_apply_relocations_add(const Elf64_Ehdr *ehdr,
+ Elf64_Shdr *sechdrs, unsigned int relsec)
+{
+ unsigned int i;
+ Elf64_Rela *rel;
+ Elf64_Sym *sym;
+ void *location;
+ Elf64_Shdr *section, *symtabsec;
+ unsigned long address, sec_base, value;
+ const char *strtab, *name, *shstrtab;
+
+ /*
+ * ->sh_offset has been modified to keep the pointer to section
+ * contents in memory
+ */
+ rel = (void *)sechdrs[relsec].sh_offset;
+
+ /* Section to which relocations apply */
+ section = &sechdrs[sechdrs[relsec].sh_info];
+
+ pr_debug("Applying relocate section %u to %u\n", relsec,
+ sechdrs[relsec].sh_info);
+
+ /* Associated symbol table */
+ symtabsec = &sechdrs[sechdrs[relsec].sh_link];
+
+ /* String table */
+ if (symtabsec->sh_link >= ehdr->e_shnum) {
+ /* Invalid strtab section number */
+ pr_err("Invalid string table section index %d\n",
+ symtabsec->sh_link);
+ return -ENOEXEC;
+ }
+
+ strtab = (char *)sechdrs[symtabsec->sh_link].sh_offset;
+
+ /* section header string table */
+ shstrtab = (char *)sechdrs[ehdr->e_shstrndx].sh_offset;
+
+ for (i = 0; i < sechdrs[relsec].sh_size / sizeof(*rel); i++) {
+
+ /*
+ * rel[i].r_offset contains byte offset from beginning
+ * of section to the storage unit affected.
+ *
+ * This is location to update (->sh_offset). This is temporary
+ * buffer where section is currently loaded. This will finally
+ * be loaded to a different address later, pointed to by
+ * ->sh_addr. kexec takes care of moving it
+ * (kexec_load_segment()).
+ */
+ location = (void *)(section->sh_offset + rel[i].r_offset);
+
+ /* Final address of the location */
+ address = section->sh_addr + rel[i].r_offset;
+
+ /*
+ * rel[i].r_info contains information about symbol table index
+ * w.r.t which relocation must be made and type of relocation
+ * to apply. ELF64_R_SYM() and ELF64_R_TYPE() macros get
+ * these respectively.
+ */
+ sym = (Elf64_Sym *)symtabsec->sh_offset +
+ ELF64_R_SYM(rel[i].r_info);
+
+ if (sym->st_name)
+ name = strtab + sym->st_name;
+ else
+ name = shstrtab + sechdrs[sym->st_shndx].sh_name;
+
+ pr_debug("Symbol: %s info: %02x shndx: %02x value=%llx size: %llx\n",
+ name, sym->st_info, sym->st_shndx, sym->st_value,
+ sym->st_size);
+
+ if (sym->st_shndx == SHN_UNDEF) {
+ pr_err("Undefined symbol: %s\n", name);
+ return -ENOEXEC;
+ }
+
+ if (sym->st_shndx == SHN_COMMON) {
+ pr_err("symbol '%s' in common section\n", name);
+ return -ENOEXEC;
+ }
+
+ if (sym->st_shndx == SHN_ABS)
+ sec_base = 0;
+ else if (sym->st_shndx >= ehdr->e_shnum) {
+ pr_err("Invalid section %d for symbol %s\n",
+ sym->st_shndx, name);
+ return -ENOEXEC;
+ } else
+ sec_base = sechdrs[sym->st_shndx].sh_addr;
+
+ value = sym->st_value;
+ value += sec_base;
+ value += rel[i].r_addend;
+
+ switch (ELF64_R_TYPE(rel[i].r_info)) {
+ case R_X86_64_NONE:
+ break;
+ case R_X86_64_64:
+ *(u64 *)location = value;
+ break;
+ case R_X86_64_32:
+ *(u32 *)location = value;
+ if (value != *(u32 *)location)
+ goto overflow;
+ break;
+ case R_X86_64_32S:
+ *(s32 *)location = value;
+ if ((s64)value != *(s32 *)location)
+ goto overflow;
+ break;
+ case R_X86_64_PC32:
+ value -= (u64)address;
+ *(u32 *)location = value;
+ break;
+ default:
+ pr_err("Unknown rela relocation: %llu\n",
+ ELF64_R_TYPE(rel[i].r_info));
+ return -ENOEXEC;
+ }
+ }
+ return 0;
+
+overflow:
+ pr_err("Overflow in relocation type %d value 0x%lx\n",
+ (int)ELF64_R_TYPE(rel[i].r_info), value);
+ return -ENOEXEC;
+}
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 8e80901..84f09e9 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -10,6 +10,7 @@
#include <linux/ioport.h>
#include <linux/elfcore.h>
#include <linux/elf.h>
+#include <linux/module.h>
#include <asm/kexec.h>

/* Verify architecture specific macros are defined */
@@ -95,6 +96,27 @@ struct compat_kexec_segment {
};
#endif

+struct kexec_sha_region {
+ unsigned long start;
+ unsigned long len;
+};
+
+struct purgatory_info {
+ /* Pointer to elf header of read only purgatory */
+ Elf_Ehdr *ehdr;
+
+ /* Pointer to purgatory sechdrs which are modifiable */
+ Elf_Shdr *sechdrs;
+ /*
+ * Temporary buffer location where purgatory is loaded and relocated
+ * This memory can be freed post image load
+ */
+ void *purgatory_buf;
+
+ /* Address where purgatory is finally loaded and is executed from */
+ unsigned long purgatory_load_addr;
+};
+
struct kimage {
kimage_entry_t head;
kimage_entry_t *entry;
@@ -143,6 +165,9 @@ struct kimage {

/* Image loader handling the kernel can store a pointer here */
void *image_loader_data;
+
+ /* Information for loading purgatory */
+ struct purgatory_info purgatory_info;
};

/*
@@ -189,6 +214,14 @@ extern int kexec_add_buffer(struct kimage *image, char *buffer,
unsigned long *load_addr);
extern struct page *kimage_alloc_control_pages(struct kimage *image,
unsigned int order);
+extern int kexec_load_purgatory(struct kimage *image, unsigned long min,
+ unsigned long max, int top_down,
+ unsigned long *load_addr);
+extern int kexec_purgatory_get_set_symbol(struct kimage *image,
+ const char *name, void *buf,
+ unsigned int size, bool get_value);
+extern void *kexec_purgatory_get_symbol_addr(struct kimage *image,
+ const char *name);
extern void crash_kexec(struct pt_regs *);
int kexec_should_crash(struct task_struct *);
void crash_save_cpu(struct pt_regs *regs, int cpu);
diff --git a/kernel/kexec.c b/kernel/kexec.c
index e5e0f6a..f7ca4ce 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -41,6 +41,9 @@
#include <asm/io.h>
#include <asm/sections.h>

+#include <crypto/hash.h>
+#include <crypto/sha.h>
+
/* Per cpu memory for storing cpu states in case of system crash. */
note_buf_t __percpu *crash_notes;

@@ -53,6 +56,15 @@ size_t vmcoreinfo_max_size = sizeof(vmcoreinfo_data);
/* Flag to indicate we are going to kexec a new kernel */
bool kexec_in_progress = false;

+/*
+ * Declare these symbols weak so that if architecture provides a purgatory,
+ * these will be overridden.
+ */
+char __weak kexec_purgatory[0];
+size_t __weak kexec_purgatory_size = 0;
+
+static int kexec_calculate_store_digests(struct kimage *image);
+
/* Location of the reserved area for the crash kernel */
struct resource crashk_res = {
.name = "Crash kernel",
@@ -397,6 +409,24 @@ void __weak arch_kimage_file_post_load_cleanup(struct kimage *image)
{
}

+/* Apply relocations of type RELA */
+int __weak
+arch_kexec_apply_relocations_add(const Elf_Ehdr *ehdr, Elf_Shdr *sechdrs,
+ unsigned int relsec)
+{
+ pr_err("RELA relocation unsupported.\n");
+ return -ENOEXEC;
+}
+
+/* Apply relocations of type REL */
+int __weak
+arch_kexec_apply_relocations(const Elf_Ehdr *ehdr, Elf_Shdr *sechdrs,
+ unsigned int relsec)
+{
+ pr_err("REL relocation unsupported.\n");
+ return -ENOEXEC;
+}
+
/*
* Free up memory used by kernel, initrd, and comand line. This is temporary
* memory allocation which is not needed any more after these buffers have
@@ -404,6 +434,8 @@ void __weak arch_kimage_file_post_load_cleanup(struct kimage *image)
*/
static void kimage_file_post_load_cleanup(struct kimage *image)
{
+ struct purgatory_info *pi = &image->purgatory_info;
+
vfree(image->kernel_buf);
image->kernel_buf = NULL;

@@ -413,6 +445,12 @@ static void kimage_file_post_load_cleanup(struct kimage *image)
kfree(image->cmdline_buf);
image->cmdline_buf = NULL;

+ vfree(pi->purgatory_buf);
+ pi->purgatory_buf = NULL;
+
+ vfree(pi->sechdrs);
+ pi->sechdrs = NULL;
+
/* See if architecture has anything to cleanup post load */
arch_kimage_file_post_load_cleanup(image);
}
@@ -1098,7 +1136,7 @@ static int kimage_load_crash_segment(struct kimage *image,
}
ubytes -= uchunk;
maddr += mchunk;
- buf += mchunk;
+ buf += mchunk;
mbytes -= mchunk;
}
out:
@@ -1333,6 +1371,10 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
if (ret)
goto out;

+ ret = kexec_calculate_store_digests(image);
+ if (ret)
+ goto out;
+
for (i = 0; i < image->nr_segments; i++) {
struct kexec_segment *ksegment;

@@ -2083,6 +2125,506 @@ int kexec_add_buffer(struct kimage *image, char *buffer, unsigned long bufsz,
return 0;
}

+/* Calculate and store the digest of segments */
+static int kexec_calculate_store_digests(struct kimage *image)
+{
+ struct crypto_shash *tfm;
+ struct shash_desc *desc;
+ int ret = 0, i, j, zero_buf_sz, sha_region_sz;
+ size_t desc_size, nullsz;
+ char *digest;
+ void *zero_buf;
+ struct kexec_sha_region *sha_regions;
+ struct purgatory_info *pi = &image->purgatory_info;
+
+ zero_buf = __va(page_to_pfn(ZERO_PAGE(0)) << PAGE_SHIFT);
+ zero_buf_sz = PAGE_SIZE;
+
+ tfm = crypto_alloc_shash("sha256", 0, 0);
+ if (IS_ERR(tfm)) {
+ ret = PTR_ERR(tfm);
+ goto out;
+ }
+
+ desc_size = crypto_shash_descsize(tfm) + sizeof(*desc);
+ desc = kzalloc(desc_size, GFP_KERNEL);
+ if (!desc) {
+ ret = -ENOMEM;
+ goto out_free_tfm;
+ }
+
+ sha_region_sz = KEXEC_SEGMENT_MAX * sizeof(struct kexec_sha_region);
+ sha_regions = vzalloc(sha_region_sz);
+ if (!sha_regions)
+ goto out_free_desc;
+
+ desc->tfm = tfm;
+ desc->flags = 0;
+
+ ret = crypto_shash_init(desc);
+ if (ret < 0)
+ goto out_free_sha_regions;
+
+ digest = kzalloc(SHA256_DIGEST_SIZE, GFP_KERNEL);
+ if (!digest) {
+ ret = -ENOMEM;
+ goto out_free_sha_regions;
+ }
+
+ for (j = i = 0; i < image->nr_segments; i++) {
+ struct kexec_segment *ksegment;
+
+ ksegment = &image->segment[i];
+ /*
+ * Skip purgatory as it will be modified once we put digest
+ * info in purgatory.
+ */
+ if (ksegment->kbuf == pi->purgatory_buf)
+ continue;
+
+ ret = crypto_shash_update(desc, ksegment->kbuf,
+ ksegment->bufsz);
+ if (ret)
+ break;
+
+ /*
+ * Assume rest of the buffer is filled with zero and
+ * update digest accordingly.
+ */
+ nullsz = ksegment->memsz - ksegment->bufsz;
+ while (nullsz) {
+ unsigned long bytes = nullsz;
+
+ if (bytes > zero_buf_sz)
+ bytes = zero_buf_sz;
+ ret = crypto_shash_update(desc, zero_buf, bytes);
+ if (ret)
+ break;
+ nullsz -= bytes;
+ }
+
+ if (ret)
+ break;
+
+ sha_regions[j].start = ksegment->mem;
+ sha_regions[j].len = ksegment->memsz;
+ j++;
+ }
+
+ if (!ret) {
+ ret = crypto_shash_final(desc, digest);
+ if (ret)
+ goto out_free_digest;
+ ret = kexec_purgatory_get_set_symbol(image, "sha_regions",
+ sha_regions, sha_region_sz, 0);
+ if (ret)
+ goto out_free_digest;
+
+ ret = kexec_purgatory_get_set_symbol(image, "sha256_digest",
+ digest, SHA256_DIGEST_SIZE, 0);
+ if (ret)
+ goto out_free_digest;
+ }
+
+out_free_digest:
+ kfree(digest);
+out_free_sha_regions:
+ vfree(sha_regions);
+out_free_desc:
+ kfree(desc);
+out_free_tfm:
+ kfree(tfm);
+out:
+ return ret;
+}
+
+/* Actually load purgatory. Lot of code taken from kexec-tools */
+static int __kexec_load_purgatory(struct kimage *image, unsigned long min,
+ unsigned long max, int top_down)
+{
+ struct purgatory_info *pi = &image->purgatory_info;
+ unsigned long align, buf_align, bss_align, buf_sz, bss_sz, bss_pad;
+ unsigned long memsz, entry, load_addr, curr_load_addr, bss_addr, offset;
+ unsigned char *buf_addr, *src;
+ int i, ret = 0, entry_sidx = -1;
+ const Elf_Shdr *sechdrs_c;
+ Elf_Shdr *sechdrs = NULL;
+ void *purgatory_buf = NULL;
+
+ /*
+ * sechdrs_c points to section headers in purgatory and are read
+ * only. No modifications allowed.
+ */
+ sechdrs_c = (void *)pi->ehdr + pi->ehdr->e_shoff;
+
+ /*
+ * We can not modify sechdrs_c[] and its fields. It is read only.
+ * Copy it over to a local copy where one can store some temporary
+ * data and free it at the end. We need to modify ->sh_addr and
+ * ->sh_offset fields to keep track of permanent and temporary
+ * locations of sections.
+ */
+ sechdrs = vzalloc(pi->ehdr->e_shnum * sizeof(Elf_Shdr));
+ if (!sechdrs)
+ return -ENOMEM;
+
+ memcpy(sechdrs, sechdrs_c, pi->ehdr->e_shnum * sizeof(Elf_Shdr));
+
+ /*
+ * We seem to have multiple copies of sections. First copy is which
+ * is embedded in kernel in read only section. Some of these sections
+ * will be copied to a temporary buffer and relocated. And these
+ * sections will finally be copied to their final destination at
+ * segment load time.
+ *
+ * Use ->sh_offset to reflect section address in memory. It will
+ * point to original read only copy if section is not allocatable.
+ * Otherwise it will point to temporary copy which will be relocated.
+ *
+ * Use ->sh_addr to contain final address of the section where it
+ * will go during execution time.
+ */
+ for (i = 0; i < pi->ehdr->e_shnum; i++) {
+ if (sechdrs[i].sh_type == SHT_NOBITS)
+ continue;
+
+ sechdrs[i].sh_offset = (unsigned long)pi->ehdr +
+ sechdrs[i].sh_offset;
+ }
+
+ /*
+ * Identify entry point section and make entry relative to section
+ * start.
+ */
+ entry = pi->ehdr->e_entry;
+ for (i = 0; i < pi->ehdr->e_shnum; i++) {
+ if (!(sechdrs[i].sh_flags & SHF_ALLOC))
+ continue;
+
+ if (!(sechdrs[i].sh_flags & SHF_EXECINSTR))
+ continue;
+
+ /* Make entry section relative */
+ if (sechdrs[i].sh_addr <= pi->ehdr->e_entry &&
+ ((sechdrs[i].sh_addr + sechdrs[i].sh_size) >
+ pi->ehdr->e_entry)) {
+ entry_sidx = i;
+ entry -= sechdrs[i].sh_addr;
+ break;
+ }
+ }
+
+ /* Determine how much memory is needed to load relocatable object. */
+ buf_align = 1;
+ bss_align = 1;
+ buf_sz = 0;
+ bss_sz = 0;
+
+ for (i = 0; i < pi->ehdr->e_shnum; i++) {
+ if (!(sechdrs[i].sh_flags & SHF_ALLOC))
+ continue;
+
+ align = sechdrs[i].sh_addralign;
+ if (sechdrs[i].sh_type != SHT_NOBITS) {
+ if (buf_align < align)
+ buf_align = align;
+ buf_sz = ALIGN(buf_sz, align);
+ buf_sz += sechdrs[i].sh_size;
+ } else {
+ /* bss section */
+ if (bss_align < align)
+ bss_align = align;
+ bss_sz = ALIGN(bss_sz, align);
+ bss_sz += sechdrs[i].sh_size;
+ }
+ }
+
+ /* Determine the bss padding required to align bss properly */
+ bss_pad = 0;
+ if (buf_sz & (bss_align - 1))
+ bss_pad = bss_align - (buf_sz & (bss_align - 1));
+
+ memsz = buf_sz + bss_pad + bss_sz;
+
+ /* Allocate buffer for purgatory */
+ purgatory_buf = vzalloc(buf_sz);
+ if (!purgatory_buf) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ if (buf_align < bss_align)
+ buf_align = bss_align;
+
+ /* Add buffer to segment list */
+ ret = kexec_add_buffer(image, purgatory_buf, buf_sz, memsz,
+ buf_align, min, max, top_down,
+ &pi->purgatory_load_addr);
+ if (ret)
+ goto out;
+
+ /* Load SHF_ALLOC sections */
+ buf_addr = purgatory_buf;
+ load_addr = curr_load_addr = pi->purgatory_load_addr;
+ bss_addr = load_addr + buf_sz + bss_pad;
+
+ for (i = 0; i < pi->ehdr->e_shnum; i++) {
+ if (!(sechdrs[i].sh_flags & SHF_ALLOC))
+ continue;
+
+ align = sechdrs[i].sh_addralign;
+ if (sechdrs[i].sh_type != SHT_NOBITS) {
+ curr_load_addr = ALIGN(curr_load_addr, align);
+ offset = curr_load_addr - load_addr;
+ /* We already modifed ->sh_offset to keep src addr */
+ src = (char *) sechdrs[i].sh_offset;
+ memcpy(buf_addr + offset, src, sechdrs[i].sh_size);
+
+ /* Store load address and source address of section */
+ sechdrs[i].sh_addr = curr_load_addr;
+
+ /*
+ * This section got copied to temporary buffer. Update
+ * ->sh_offset accordingly.
+ */
+ sechdrs[i].sh_offset = (unsigned long)(buf_addr + offset);
+
+ /* Advance to the next address */
+ curr_load_addr += sechdrs[i].sh_size;
+ } else {
+ bss_addr = ALIGN(bss_addr, align);
+ sechdrs[i].sh_addr = bss_addr;
+ bss_addr += sechdrs[i].sh_size;
+ }
+ }
+
+ /* Update entry point based on load address of text section */
+ if (entry_sidx >= 0)
+ entry += sechdrs[entry_sidx].sh_addr;
+
+ /* Make kernel jump to purgatory after shutdown */
+ image->start = entry;
+
+ /* Used later to get/set symbol values */
+ pi->sechdrs = sechdrs;
+
+ /*
+ * Used later to identify which section is purgatory and skip it
+ * from checksumming.
+ */
+ pi->purgatory_buf = purgatory_buf;
+ return ret;
+out:
+ vfree(sechdrs);
+ vfree(purgatory_buf);
+ return ret;
+}
+
+static int kexec_apply_relocations(struct kimage *image)
+{
+ int i, ret;
+ struct purgatory_info *pi = &image->purgatory_info;
+ Elf_Shdr *sechdrs = pi->sechdrs;
+
+ /* Apply relocations */
+ for (i = 0; i < pi->ehdr->e_shnum; i++) {
+ Elf_Shdr *section, *symtab;
+
+ if (sechdrs[i].sh_type != SHT_RELA &&
+ sechdrs[i].sh_type != SHT_REL)
+ continue;
+
+ /*
+ * For section of type SHT_RELA/SHT_REL,
+ * ->sh_link contains section header index of associated
+ * symbol table. And ->sh_info contains section header
+ * index of section to which relocations apply.
+ */
+ if (sechdrs[i].sh_info >= pi->ehdr->e_shnum ||
+ sechdrs[i].sh_link >= pi->ehdr->e_shnum)
+ return -ENOEXEC;
+
+ section = &sechdrs[sechdrs[i].sh_info];
+ symtab = &sechdrs[sechdrs[i].sh_link];
+
+ if (!(section->sh_flags & SHF_ALLOC))
+ continue;
+
+ /*
+ * symtab->sh_link contain section header index of associated
+ * string table.
+ */
+ if (symtab->sh_link >= pi->ehdr->e_shnum)
+ /* Invalid section number? */
+ continue;
+
+ /*
+ * Respective archicture needs to provide support for applying
+ * relocations of type SHT_RELA/SHT_REL.
+ */
+ if (sechdrs[i].sh_type == SHT_RELA)
+ ret = arch_kexec_apply_relocations_add(pi->ehdr,
+ sechdrs, i);
+ else if (sechdrs[i].sh_type == SHT_REL)
+ ret = arch_kexec_apply_relocations(pi->ehdr,
+ sechdrs, i);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+
+/* Load relocatable purgatory object and relocate it appropriately */
+int kexec_load_purgatory(struct kimage *image, unsigned long min,
+ unsigned long max, int top_down,
+ unsigned long *load_addr)
+{
+ struct purgatory_info *pi = &image->purgatory_info;
+ int ret;
+
+ if (kexec_purgatory_size <= 0)
+ return -EINVAL;
+
+ if (kexec_purgatory_size < sizeof(Elf_Ehdr))
+ return -ENOEXEC;
+
+ pi->ehdr = (Elf_Ehdr *)kexec_purgatory;
+
+ if (memcmp(pi->ehdr->e_ident, ELFMAG, SELFMAG) != 0
+ || pi->ehdr->e_type != ET_REL
+ || !elf_check_arch(pi->ehdr)
+ || pi->ehdr->e_shentsize != sizeof(Elf_Shdr))
+ return -ENOEXEC;
+
+ if (pi->ehdr->e_shoff >= kexec_purgatory_size
+ || (pi->ehdr->e_shnum * sizeof(Elf_Shdr) >
+ kexec_purgatory_size - pi->ehdr->e_shoff))
+ return -ENOEXEC;
+
+ ret = __kexec_load_purgatory(image, min, max, top_down);
+ if (ret)
+ return ret;
+
+ ret = kexec_apply_relocations(image);
+ if (ret)
+ goto out;
+
+ *load_addr = pi->purgatory_load_addr;
+ return 0;
+out:
+ vfree(pi->sechdrs);
+ vfree(pi->purgatory_buf);
+ return ret;
+}
+
+static Elf_Sym *kexec_purgatory_find_symbol(struct purgatory_info *pi,
+ const char *name)
+{
+ Elf_Sym *syms;
+ Elf_Shdr *sechdrs;
+ Elf_Ehdr *ehdr;
+ int i, k;
+ const char *strtab;
+
+ if (!pi->sechdrs || !pi->ehdr)
+ return NULL;
+
+ sechdrs = pi->sechdrs;
+ ehdr = pi->ehdr;
+
+ for (i = 0; i < ehdr->e_shnum; i++) {
+ if (sechdrs[i].sh_type != SHT_SYMTAB)
+ continue;
+
+ if (sechdrs[i].sh_link >= ehdr->e_shnum)
+ /* Invalid strtab section number */
+ continue;
+ strtab = (char *)sechdrs[sechdrs[i].sh_link].sh_offset;
+ syms = (Elf_Sym *)sechdrs[i].sh_offset;
+
+ /* Go through symbols for a match */
+ for (k = 0; k < sechdrs[i].sh_size/sizeof(Elf_Sym); k++) {
+ if (ELF_ST_BIND(syms[k].st_info) != STB_GLOBAL)
+ continue;
+
+ if (strcmp(strtab + syms[k].st_name, name) != 0)
+ continue;
+
+ if (syms[k].st_shndx == SHN_UNDEF ||
+ syms[k].st_shndx >= ehdr->e_shnum) {
+ pr_debug("Symbol: %s has bad section index %d.\n",
+ name, syms[k].st_shndx);
+ return NULL;
+ }
+
+ /* Found the symbol we are looking for */
+ return &syms[k];
+ }
+ }
+
+ return NULL;
+}
+
+void *kexec_purgatory_get_symbol_addr(struct kimage *image, const char *name)
+{
+ struct purgatory_info *pi = &image->purgatory_info;
+ Elf_Sym *sym;
+ Elf_Shdr *sechdr;
+
+ sym = kexec_purgatory_find_symbol(pi, name);
+ if (!sym)
+ return ERR_PTR(-EINVAL);
+
+ sechdr = &pi->sechdrs[sym->st_shndx];
+
+ /*
+ * Returns the address where symbol will finally be loaded after
+ * kexec_load_segment()
+ */
+ return (void *)(sechdr->sh_addr + sym->st_value);
+}
+
+/*
+ * Get or set value of a symbol. If "get_value" is true, symbol value is
+ * returned in buf otherwise symbol value is set based on value in buf.
+ */
+int kexec_purgatory_get_set_symbol(struct kimage *image, const char *name,
+ void *buf, unsigned int size, bool get_value)
+{
+ Elf_Sym *sym;
+ Elf_Shdr *sechdrs;
+ struct purgatory_info *pi = &image->purgatory_info;
+ char *sym_buf;
+
+ sym = kexec_purgatory_find_symbol(pi, name);
+ if (!sym)
+ return -EINVAL;
+
+ if (sym->st_size != size) {
+ pr_err("symbol %s size mismatch: expected %lu actual %u\n",
+ name, (unsigned long)sym->st_size, size);
+ return -EINVAL;
+ }
+
+ sechdrs = pi->sechdrs;
+
+ if (sechdrs[sym->st_shndx].sh_type == SHT_NOBITS) {
+ pr_err("symbol %s is in a bss section. Cannot %s\n", name,
+ get_value ? "get" : "set");
+ return -EINVAL;
+ }
+
+ sym_buf = (unsigned char *)sechdrs[sym->st_shndx].sh_offset +
+ sym->st_value;
+
+ if (get_value)
+ memcpy((void *)buf, sym_buf, size);
+ else
+ memcpy((void *)sym_buf, buf, size);
+
+ return 0;
+}

/*
* Move into place and start executing a preloaded standalone
--
1.9.0

2014-06-26 20:39:13

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 04/15] kexec: Move segment verification code in a separate function

Previously do_kimage_alloc() will allocate a kimage structure, copy
segment list from user space and then do the segment list sanity verification.

Break down this function in 3 parts. do_kimage_alloc_init() to do actual
allocation and basic initialization of kimage structure.
copy_user_segment_list() to copy segment list from user space and
sanity_check_segment_list() to verify the sanity of segment list as passed
by user space.

In later patches, I need to only allocate kimage and not copy segment
list from user space. So breaking down in smaller functions enables
re-use of code at other places.

Signed-off-by: Vivek Goyal <[email protected]>
---
kernel/kexec.c | 182 +++++++++++++++++++++++++++++++--------------------------
1 file changed, 100 insertions(+), 82 deletions(-)

diff --git a/kernel/kexec.c b/kernel/kexec.c
index 3aad6dc..44e823e 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -124,45 +124,27 @@ static struct page *kimage_alloc_page(struct kimage *image,
gfp_t gfp_mask,
unsigned long dest);

-static int do_kimage_alloc(struct kimage **rimage, unsigned long entry,
- unsigned long nr_segments,
- struct kexec_segment __user *segments)
+static int copy_user_segment_list(struct kimage *image,
+ unsigned long nr_segments,
+ struct kexec_segment __user *segments)
{
+ int ret;
size_t segment_bytes;
- struct kimage *image;
- unsigned long i;
- int result;
-
- /* Allocate a controlling structure */
- result = -ENOMEM;
- image = kzalloc(sizeof(*image), GFP_KERNEL);
- if (!image)
- goto out;
-
- image->head = 0;
- image->entry = &image->head;
- image->last_entry = &image->head;
- image->control_page = ~0; /* By default this does not apply */
- image->start = entry;
- image->type = KEXEC_TYPE_DEFAULT;
-
- /* Initialize the list of control pages */
- INIT_LIST_HEAD(&image->control_pages);
-
- /* Initialize the list of destination pages */
- INIT_LIST_HEAD(&image->dest_pages);
-
- /* Initialize the list of unusable pages */
- INIT_LIST_HEAD(&image->unusable_pages);

/* Read in the segments */
image->nr_segments = nr_segments;
segment_bytes = nr_segments * sizeof(*segments);
- result = copy_from_user(image->segment, segments, segment_bytes);
- if (result) {
- result = -EFAULT;
- goto out;
- }
+ ret = copy_from_user(image->segment, segments, segment_bytes);
+ if (ret)
+ ret = -EFAULT;
+
+ return ret;
+}
+
+static int sanity_check_segment_list(struct kimage *image)
+{
+ int result, i;
+ unsigned long nr_segments = image->nr_segments;

/*
* Verify we have good destination addresses. The caller is
@@ -184,9 +166,9 @@ static int do_kimage_alloc(struct kimage **rimage, unsigned long entry,
mstart = image->segment[i].mem;
mend = mstart + image->segment[i].memsz;
if ((mstart & ~PAGE_MASK) || (mend & ~PAGE_MASK))
- goto out;
+ return result;
if (mend >= KEXEC_DESTINATION_MEMORY_LIMIT)
- goto out;
+ return result;
}

/* Verify our destination addresses do not overlap.
@@ -207,7 +189,7 @@ static int do_kimage_alloc(struct kimage **rimage, unsigned long entry,
pend = pstart + image->segment[j].memsz;
/* Do the segments overlap ? */
if ((mend > pstart) && (mstart < pend))
- goto out;
+ return result;
}
}

@@ -219,18 +201,61 @@ static int do_kimage_alloc(struct kimage **rimage, unsigned long entry,
result = -EINVAL;
for (i = 0; i < nr_segments; i++) {
if (image->segment[i].bufsz > image->segment[i].memsz)
- goto out;
+ return result;
}

- result = 0;
-out:
- if (result == 0)
- *rimage = image;
- else
- kfree(image);
+ /*
+ * Verify we have good destination addresses. Normally
+ * the caller is responsible for making certain we don't
+ * attempt to load the new image into invalid or reserved
+ * areas of RAM. But crash kernels are preloaded into a
+ * reserved area of ram. We must ensure the addresses
+ * are in the reserved area otherwise preloading the
+ * kernel could corrupt things.
+ */

- return result;
+ if (image->type == KEXEC_TYPE_CRASH) {
+ result = -EADDRNOTAVAIL;
+ for (i = 0; i < nr_segments; i++) {
+ unsigned long mstart, mend;

+ mstart = image->segment[i].mem;
+ mend = mstart + image->segment[i].memsz - 1;
+ /* Ensure we are within the crash kernel limits */
+ if ((mstart < crashk_res.start) ||
+ (mend > crashk_res.end))
+ return result;
+ }
+ }
+
+ return 0;
+}
+
+static struct kimage *do_kimage_alloc_init(void)
+{
+ struct kimage *image;
+
+ /* Allocate a controlling structure */
+ image = kzalloc(sizeof(*image), GFP_KERNEL);
+ if (!image)
+ return NULL;
+
+ image->head = 0;
+ image->entry = &image->head;
+ image->last_entry = &image->head;
+ image->control_page = ~0; /* By default this does not apply */
+ image->type = KEXEC_TYPE_DEFAULT;
+
+ /* Initialize the list of control pages */
+ INIT_LIST_HEAD(&image->control_pages);
+
+ /* Initialize the list of destination pages */
+ INIT_LIST_HEAD(&image->dest_pages);
+
+ /* Initialize the list of unusable pages */
+ INIT_LIST_HEAD(&image->unusable_pages);
+
+ return image;
}

static void kimage_free_page_list(struct list_head *list);
@@ -243,10 +268,19 @@ static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
struct kimage *image;

/* Allocate and initialize a controlling structure */
- image = NULL;
- result = do_kimage_alloc(&image, entry, nr_segments, segments);
+ image = do_kimage_alloc_init();
+ if (!image)
+ return -ENOMEM;
+
+ image->start = entry;
+
+ result = copy_user_segment_list(image, nr_segments, segments);
if (result)
- goto out;
+ goto out_free_image;
+
+ result = sanity_check_segment_list(image);
+ if (result)
+ goto out_free_image;

/*
* Find a location for the control code buffer, and add it
@@ -258,22 +292,21 @@ static int kimage_normal_alloc(struct kimage **rimage, unsigned long entry,
get_order(KEXEC_CONTROL_PAGE_SIZE));
if (!image->control_code_page) {
pr_err("Could not allocate control_code_buffer\n");
- goto out_free;
+ goto out_free_image;
}

image->swap_page = kimage_alloc_control_pages(image, 0);
if (!image->swap_page) {
pr_err("Could not allocate swap buffer\n");
- goto out_free;
+ goto out_free_control_pages;
}

*rimage = image;
return 0;
-
-out_free:
+out_free_control_pages:
kimage_free_page_list(&image->control_pages);
+out_free_image:
kfree(image);
-out:
return result;
}

@@ -283,19 +316,17 @@ static int kimage_crash_alloc(struct kimage **rimage, unsigned long entry,
{
int result;
struct kimage *image;
- unsigned long i;

- image = NULL;
/* Verify we have a valid entry point */
- if ((entry < crashk_res.start) || (entry > crashk_res.end)) {
- result = -EADDRNOTAVAIL;
- goto out;
- }
+ if ((entry < crashk_res.start) || (entry > crashk_res.end))
+ return -EADDRNOTAVAIL;

/* Allocate and initialize a controlling structure */
- result = do_kimage_alloc(&image, entry, nr_segments, segments);
- if (result)
- goto out;
+ image = do_kimage_alloc_init();
+ if (!image)
+ return -ENOMEM;
+
+ image->start = entry;

/* Enable the special crash kernel control page
* allocation policy.
@@ -303,25 +334,13 @@ static int kimage_crash_alloc(struct kimage **rimage, unsigned long entry,
image->control_page = crashk_res.start;
image->type = KEXEC_TYPE_CRASH;

- /*
- * Verify we have good destination addresses. Normally
- * the caller is responsible for making certain we don't
- * attempt to load the new image into invalid or reserved
- * areas of RAM. But crash kernels are preloaded into a
- * reserved area of ram. We must ensure the addresses
- * are in the reserved area otherwise preloading the
- * kernel could corrupt things.
- */
- result = -EADDRNOTAVAIL;
- for (i = 0; i < nr_segments; i++) {
- unsigned long mstart, mend;
+ result = copy_user_segment_list(image, nr_segments, segments);
+ if (result)
+ goto out_free_image;

- mstart = image->segment[i].mem;
- mend = mstart + image->segment[i].memsz - 1;
- /* Ensure we are within the crash kernel limits */
- if ((mstart < crashk_res.start) || (mend > crashk_res.end))
- goto out_free;
- }
+ result = sanity_check_segment_list(image);
+ if (result)
+ goto out_free_image;

/*
* Find a location for the control code buffer, and add
@@ -333,15 +352,14 @@ static int kimage_crash_alloc(struct kimage **rimage, unsigned long entry,
get_order(KEXEC_CONTROL_PAGE_SIZE));
if (!image->control_code_page) {
pr_err("Could not allocate control_code_buffer\n");
- goto out_free;
+ goto out_free_image;
}

*rimage = image;
return 0;

-out_free:
+out_free_image:
kfree(image);
-out:
return result;
}

--
1.9.0

2014-06-26 20:40:44

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 00/15][V4] kexec: A new system call to allow in kernel loading

On Thu, Jun 26, 2014 at 04:33:29PM -0400, Vivek Goyal wrote:
> Hi,
>
> This is V4 of the patchset. Previous versions were posted here.
>
> V1: https://lkml.org/lkml/2013/11/20/540
> V2: https://lkml.org/lkml/2014/1/27/331
> V3: https://lkml.org/lkml/2014/6/3/432
>

I used following kexec-tools patch to test my changes.

Thanks
Vivek

kexec-tools: Provide an option to make use of new system call

This patch provides and option --kexec-file-syscall, to force use of
new system call for kexec. Default is to continue to use old syscall.

Signed-off-by: Vivek Goyal <[email protected]>
---
kexec/arch/x86_64/kexec-bzImage64.c | 86 +++++++++++++++++++++++
kexec/kexec-syscall.h | 32 ++++++++
kexec/kexec.c | 132 +++++++++++++++++++++++++++++++++++-
kexec/kexec.h | 11 ++-
4 files changed, 257 insertions(+), 4 deletions(-)

Index: kexec-tools/kexec/kexec.c
===================================================================
--- kexec-tools.orig/kexec/kexec.c 2014-06-17 13:15:37.723825990 -0400
+++ kexec-tools/kexec/kexec.c 2014-06-26 15:19:59.064940065 -0400
@@ -51,6 +51,8 @@
unsigned long long mem_min = 0;
unsigned long long mem_max = ULONG_MAX;
static unsigned long kexec_flags = 0;
+/* Flags for kexec file (fd) based syscall */
+static unsigned long kexec_file_flags = 0;
int kexec_debug = 0;

void dbgprint_mem_range(const char *prefix, struct memory_range *mr, int nr_mr)
@@ -787,6 +789,19 @@ static int my_load(const char *type, int
return result;
}

+static int kexec_file_unload(unsigned long kexec_file_flags)
+{
+ int ret = 0;
+
+ ret = kexec_file_load(-1, -1, 0, NULL, kexec_file_flags);
+ if (ret != 0) {
+ /* The unload failed, print some debugging information */
+ fprintf(stderr, "kexec_file_load(unload) failed\n: %s\n",
+ strerror(errno));
+ }
+ return ret;
+}
+
static int k_unload (unsigned long kexec_flags)
{
int result;
@@ -925,6 +940,7 @@ void usage(void)
" (0 means it's not jump back or\n"
" preserve context)\n"
" to original kernel.\n"
+ " -s --kexec-file-syscall Use file based syscall for kexec operation\n"
" -d, --debug Enable debugging to help spot a failure.\n"
"\n"
"Supported kernel file types and options: \n");
@@ -1072,6 +1088,82 @@ char *concat_cmdline(const char *base, c
return cmdline;
}

+/* New file based kexec system call related code */
+static int do_kexec_file_load(int fileind, int argc, char **argv,
+ unsigned long flags) {
+
+ char *kernel;
+ int kernel_fd, i;
+ struct kexec_info info;
+ int ret = 0;
+ char *kernel_buf;
+ off_t kernel_size;
+
+ memset(&info, 0, sizeof(info));
+ info.segment = NULL;
+ info.nr_segments = 0;
+ info.entry = NULL;
+ info.backup_start = 0;
+ info.kexec_flags = flags;
+
+ info.file_mode = 1;
+ info.initrd_fd = -1;
+
+ if (argc - fileind <= 0) {
+ fprintf(stderr, "No kernel specified\n");
+ usage();
+ return -1;
+ }
+
+ kernel = argv[fileind];
+
+ kernel_fd = open(kernel, O_RDONLY);
+ if (kernel_fd == -1) {
+ fprintf(stderr, "Failed to open file %s:%s\n", kernel,
+ strerror(errno));
+ return -1;
+ }
+
+ /* slurp in the input kernel */
+ kernel_buf = slurp_decompress_file(kernel, &kernel_size);
+
+ for (i = 0; i < file_types; i++) {
+ if (file_type[i].probe(kernel_buf, kernel_size) >= 0)
+ break;
+ }
+
+ if (i == file_types) {
+ fprintf(stderr, "Cannot determine the file type " "of %s\n",
+ kernel);
+ return -1;
+ }
+
+ ret = file_type[i].load(argc, argv, kernel_buf, kernel_size, &info);
+ if (ret < 0) {
+ fprintf(stderr, "Cannot load %s\n", kernel);
+ return ret;
+ }
+
+ if (!is_kexec_file_load_implemented()) {
+ fprintf(stderr, "syscall kexec_file_load not available.\n");
+ return -1;
+ }
+
+ /*
+ * If there is no initramfs, set KEXEC_FILE_NO_INITRAMFS flag so that
+ * kernel does not return error with negative initrd_fd.
+ */
+ if (info.initrd_fd == -1)
+ info.kexec_flags |= KEXEC_FILE_NO_INITRAMFS;
+
+ ret = kexec_file_load(kernel_fd, info.initrd_fd, info.command_line_len,
+ info.command_line, info.kexec_flags);
+ if (ret != 0)
+ fprintf(stderr, "kexec_file_load failed: %s\n",
+ strerror(errno));
+ return ret;
+}
+

int main(int argc, char *argv[])
{
@@ -1083,6 +1175,7 @@ int main(int argc, char *argv[])
int do_ifdown = 0;
int do_unload = 0;
int do_reuse_initrd = 0;
+ int do_kexec_file_syscall = 0;
void *entry = 0;
char *type = 0;
char *endptr;
@@ -1095,6 +1188,23 @@ int main(int argc, char *argv[])
};
static const char short_options[] = KEXEC_ALL_OPT_STR;

+ /*
+ * First check if --use-kexec-file-syscall is set. That changes lot of
+ * things
+ */
+ while ((opt = getopt_long(argc, argv, short_options,
+ options, 0)) != -1) {
+ switch(opt) {
+ case OPT_KEXEC_FILE_SYSCALL:
+ do_kexec_file_syscall = 1;
+ break;
+ }
+ }
+
+ /* Reset getopt for the next pass. */
+ opterr = 1;
+ optind = 1;
+
while ((opt = getopt_long(argc, argv, short_options,
options, 0)) != -1) {
switch(opt) {
@@ -1127,6 +1237,8 @@ int main(int argc, char *argv[])
do_shutdown = 0;
do_sync = 0;
do_unload = 1;
+ if (do_kexec_file_syscall)
+ kexec_file_flags |= KEXEC_FILE_UNLOAD;
break;
case OPT_EXEC:
do_load = 0;
@@ -1169,7 +1281,10 @@ int main(int argc, char *argv[])
do_exec = 0;
do_shutdown = 0;
do_sync = 0;
- kexec_flags = KEXEC_ON_CRASH;
+ if (do_kexec_file_syscall)
+ kexec_file_flags |= KEXEC_FILE_ON_CRASH;
+ else
+ kexec_flags = KEXEC_ON_CRASH;
break;
case OPT_MEM_MIN:
mem_min = strtoul(optarg, &endptr, 0);
@@ -1194,6 +1309,9 @@ int main(int argc, char *argv[])
case OPT_REUSE_INITRD:
do_reuse_initrd = 1;
break;
+ case OPT_KEXEC_FILE_SYSCALL:
+ /* We already parsed it. Nothing to do. */
+ break;
default:
break;
}
@@ -1238,10 +1356,18 @@ int main(int argc, char *argv[])
}

if (do_unload) {
- result = k_unload(kexec_flags);
+ if (do_kexec_file_syscall)
+ result = kexec_file_unload(kexec_file_flags);
+ else
+ result = k_unload(kexec_flags);
}
if (do_load && (result == 0)) {
- result = my_load(type, fileind, argc, argv, kexec_flags, entry);
+ if (do_kexec_file_syscall)
+ result = do_kexec_file_load(fileind, argc, argv,
+ kexec_file_flags);
+ else
+ result = my_load(type, fileind, argc, argv,
+ kexec_flags, entry);
}
/* Don't shutdown unless there is something to reboot to! */
if ((result == 0) && (do_shutdown || do_exec) && !kexec_loaded()) {
Index: kexec-tools/kexec/kexec.h
===================================================================
--- kexec-tools.orig/kexec/kexec.h 2014-06-17 13:15:37.723825990 -0400
+++ kexec-tools/kexec/kexec.h 2014-06-17 13:44:14.634927130 -0400
@@ -156,6 +156,13 @@ struct kexec_info {
unsigned long kexec_flags;
unsigned long backup_src_start;
unsigned long backup_src_size;
+ /* Set to 1 if we are using kexec file syscall */
+ unsigned long file_mode :1;
+
+ /* Filled by kernel image processing code */
+ int initrd_fd;
+ char *command_line;
+ int command_line_len;
};

struct arch_map_entry {
@@ -207,6 +214,7 @@ extern int file_types;
#define OPT_UNLOAD 'u'
#define OPT_TYPE 't'
#define OPT_PANIC 'p'
+#define OPT_KEXEC_FILE_SYSCALL 's'
#define OPT_MEM_MIN 256
#define OPT_MEM_MAX 257
#define OPT_REUSE_INITRD 258
@@ -230,9 +238,10 @@ extern int file_types;
{ "mem-min", 1, 0, OPT_MEM_MIN }, \
{ "mem-max", 1, 0, OPT_MEM_MAX }, \
{ "reuseinitrd", 0, 0, OPT_REUSE_INITRD }, \
+ { "kexec-file-syscall", 0, 0, OPT_KEXEC_FILE_SYSCALL }, \
{ "debug", 0, 0, OPT_DEBUG }, \

-#define KEXEC_OPT_STR "h?vdfxluet:p"
+#define KEXEC_OPT_STR "h?vdfxluet:ps"

extern void dbgprint_mem_range(const char *prefix, struct memory_range *mr, int nr_mr);
extern void die(const char *fmt, ...)
Index: kexec-tools/kexec/arch/x86_64/kexec-bzImage64.c
===================================================================
--- kexec-tools.orig/kexec/arch/x86_64/kexec-bzImage64.c 2014-06-17 13:15:37.723825990 -0400
+++ kexec-tools/kexec/arch/x86_64/kexec-bzImage64.c 2014-06-17 13:17:39.916833188 -0400
@@ -235,6 +235,89 @@ static int do_bzImage64_load(struct kexe
return 0;
}

+/* This assumes file is being loaded using file based kexec2 syscall */
+int bzImage64_load_file(int argc, char **argv, struct kexec_info *info)
+{
+ int ret = 0;
+ char *command_line = NULL, *tmp_cmdline = NULL;
+ const char *ramdisk = NULL, *append = NULL;
+ int entry_16bit = 0, entry_32bit = 0;
+ int opt;
+ int command_line_len;
+
+ /* See options.h -- add any more there, too. */
+ static const struct option options[] = {
+ KEXEC_ARCH_OPTIONS
+ { "command-line", 1, 0, OPT_APPEND },
+ { "append", 1, 0, OPT_APPEND },
+ { "reuse-cmdline", 0, 0, OPT_REUSE_CMDLINE },
+ { "initrd", 1, 0, OPT_RAMDISK },
+ { "ramdisk", 1, 0, OPT_RAMDISK },
+ { "real-mode", 0, 0, OPT_REAL_MODE },
+ { "entry-32bit", 0, 0, OPT_ENTRY_32BIT },
+ { 0, 0, 0, 0 },
+ };
+ static const char short_options[] = KEXEC_ARCH_OPT_STR "d";
+
+ while ((opt = getopt_long(argc, argv, short_options, options, 0)) != -1) {
+ switch (opt) {
+ default:
+ /* Ignore core options */
+ if (opt < OPT_ARCH_MAX)
+ break;
+ case OPT_APPEND:
+ append = optarg;
+ break;
+ case OPT_REUSE_CMDLINE:
+ tmp_cmdline = get_command_line();
+ break;
+ case OPT_RAMDISK:
+ ramdisk = optarg;
+ break;
+ case OPT_REAL_MODE:
+ entry_16bit = 1;
+ break;
+ case OPT_ENTRY_32BIT:
+ entry_32bit = 1;
+ break;
+ }
+ }
+ command_line = concat_cmdline(tmp_cmdline, append);
+ if (tmp_cmdline)
+ free(tmp_cmdline);
+ command_line_len = 0;
+ if (command_line) {
+ command_line_len = strlen(command_line) + 1;
+ } else {
+ command_line = strdup("\0");
+ command_line_len = 1;
+ }
+
+ if (entry_16bit || entry_32bit) {
+ fprintf(stderr, "Kexec2 syscall does not support 16bit"
+ " or 32bit entry yet\n");
+ ret = -1;
+ goto out;
+ }
+
+ if (ramdisk) {
+ info->initrd_fd = open(ramdisk, O_RDONLY);
+ if (info->initrd_fd == -1) {
+ fprintf(stderr, "Could not open initrd file %s:%s\n",
+ ramdisk, strerror(errno));
+ ret = -1;
+ goto out;
+ }
+ }
+
+ info->command_line = command_line;
+ info->command_line_len = command_line_len;
+ return ret;
+out:
+ free(command_line);
+ return ret;
+}
+
int bzImage64_load(int argc, char **argv, const char *buf, off_t len,
struct kexec_info *info)
{
@@ -247,6 +330,9 @@ int bzImage64_load(int argc, char **argv
int opt;
int result;

+ if (info->file_mode)
+ return bzImage64_load_file(argc, argv, info);
+
/* See options.h -- add any more there, too. */
static const struct option options[] = {
KEXEC_ARCH_OPTIONS
Index: kexec-tools/kexec/kexec-syscall.h
===================================================================
--- kexec-tools.orig/kexec/kexec-syscall.h 2014-06-17 13:15:37.723825990 -0400
+++ kexec-tools/kexec/kexec-syscall.h 2014-06-26 15:19:59.063940065 -0400
@@ -53,6 +53,19 @@
#endif
#endif /*ifndef __NR_kexec_load*/

+#ifndef __NR_kexec_file_load
+
+#ifdef __x86_64__
+#define __NR_kexec_file_load 317
+#endif
+
+#ifndef __NR_kexec_file_load
+/* system call not available for the arch */
+#define __NR_kexec_file_load 0xffffffff /* system call not available */
+#endif
+
+#endif /*ifndef __NR_kexec_file_load*/
+
struct kexec_segment;

static inline long kexec_load(void *entry, unsigned long nr_segments,
@@ -61,10 +74,29 @@ static inline long kexec_load(void *entr
return (long) syscall(__NR_kexec_load, entry, nr_segments, segments, flags);
}

+static inline int is_kexec_file_load_implemented(void) {
+ if (__NR_kexec_file_load != 0xffffffff)
+ return 1;
+ return 0;
+}
+
+static inline long kexec_file_load(int kernel_fd, int initrd_fd,
+ unsigned long cmdline_len, const char *cmdline_ptr,
+ unsigned long flags)
+{
+ return (long) syscall(__NR_kexec_file_load, kernel_fd, initrd_fd,
+ cmdline_len, cmdline_ptr, flags);
+}
+
#define KEXEC_ON_CRASH 0x00000001
#define KEXEC_PRESERVE_CONTEXT 0x00000002
#define KEXEC_ARCH_MASK 0xffff0000

+/* Flags for kexec file based system call */
+#define KEXEC_FILE_UNLOAD 0x00000001
+#define KEXEC_FILE_ON_CRASH 0x00000002
+#define KEXEC_FILE_NO_INITRAMFS 0x00000004
+
/* These values match the ELF architecture values.
* Unless there is a good reason that should continue to be the case.
*/

2014-06-26 20:43:46

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 08/15] kexec: New syscall kexec_file_load() declaration

On Thu, Jun 26, 2014 at 04:33:37PM -0400, Vivek Goyal wrote:
> This is the new syscall kexec_file_load() declaration/interface. I have
> reserved the syscall number only for x86_64 so far. Other architectures
> (including i386) can reserve syscall number when they enable the support
> for this new syscall.
>
> Signed-off-by: Vivek Goyal <[email protected]>
> CC: [email protected]

Hi Michael,

As per feedback last time, I enhanced the existing man page to include
details of this new syscall. Here is the patch.

Thanks
Vivek


Subject: kexec_file_load() syscall man page

We already have man page for kexec_load() syscall. This patch adds details
of kexec_file_load() to same man page.

Signed-off-by: Vivek Goyal <[email protected]>
---
man2/kexec_file_load.2 | 1
man2/kexec_load.2 | 55 +++++++++++++++++++++++++++++++++++++++++++------
2 files changed, 50 insertions(+), 6 deletions(-)

Index: man-pages/man2/kexec_file_load.2
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ man-pages/man2/kexec_file_load.2 2014-06-25 17:39:12.056441803 -0400
@@ -0,0 +1 @@
+.so man2/kexec_load.2
Index: man-pages/man2/kexec_load.2
===================================================================
--- man-pages.orig/man2/kexec_load.2 2014-06-25 17:36:09.237453355 -0400
+++ man-pages/man2/kexec_load.2 2014-06-26 11:11:49.599810213 -0400
@@ -25,17 +25,26 @@
.\"
.TH KEXEC_LOAD 2 2012-07-13 "Linux" "Linux Programmer's Manual"
.SH NAME
-kexec_load \- load a new kernel for later execution
+kexec_load, kexec_file_load \- load a new kernel for later execution
.SH SYNOPSIS
.B #include <linux/kexec.h>
.br
+
.BI "long kexec_load(unsigned long " entry ", unsigned long " nr_segments ","
.br
.BI " struct kexec_segment *" segments \
", unsigned long " flags ");"
+.br
+
+.BI "int kexec_file_load(int " kernel_fd ", int " initrd_fd ","
+.br
+.BI " unsigned long " cmdline_len \
+", const char *" cmdline ","
+.br
+.BI " unsigned long " flags ");"

.IR Note :
-There is no glibc wrapper for this system call; see NOTES.
+There are no glibc wrappers for these system calls; see NOTES.
.SH DESCRIPTION
The
.BR kexec_load ()
@@ -111,11 +120,42 @@ struct kexec_segment {
The kernel image defined by
.I segments
is copied from the calling process into previously reserved memory.
+.SS kexec_file_load()
+The
+.BR kexec_file_load ()
+system call is similar to
+.BR kexec_load(),
+but it takes a different set of arguments. It reads kernel to be loaded from
+file descriptor
+.IR kernel_fd
+and initrd to be loaded from file descriptor
+.IR initrd_fd .
+It also takes length of kernel command line in
+.IR cmdline_len
+and pointer to command line in
+.IR cmdline .
+
+The
+.IR flags
+argument is a mask which allows control over system call operation. The
+following values can be specified in
+.IR flags
+
+.TP
+.BR KEXEC_FILE_UNLOAD
+Unload currently loaded kernel.
+.TP
+.BR KEXEC_FILE_ON_CRASH
+Load kernel in memory region reserved for crash kernel. This kernel is
+booted into if currently running kernel crashes.
+.TP
+.BR KEXEC_FILE_NO_INITRAMFS
+Loading initrd/initramfs is optional. Specify this flag if no initramfs
+is being loaded. If this flag is set, kernel will ignore the value passed
+in
+.IR initrd_fd
.SH RETURN VALUE
-On success,
-.BR kexec_load ()
-returns 0.
-On error, \-1 is returned and
+On success, these system calls returns 0. On error, \-1 is returned and
.I errno
is set to indicate the error.
.SH ERRORS
@@ -135,6 +175,9 @@ is too large
The caller does not have the
.BR CAP_SYS_BOOT
capability.
+.TP
+.B ENOEXEC
+kernel_fd does not refer to an open file. Or kernel can't load this file.
.SH VERSIONS
The
.BR kexec_load ()

2014-06-26 20:58:13

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 00/15][V4] kexec: A new system call to allow in kernel loading

On Thu, 26 Jun 2014 16:33:29 -0400 Vivek Goyal <[email protected]> wrote:

> This patch series does not do kernel signature verification yet. I plan
> to post another patch series for that. Now distributions are already signing
> PE/COFF bzImage with PKCS7 signature I plan to parse and verify those
> signatures.
>
> Primary goal of this patchset is to prepare groundwork so that kernel
> image can be signed and signatures be verified during kexec load. This
> should help with two things.
>
> - It should allow kexec/kdump on secureboot enabled machines.
>
> - In general it can help even without secureboot. By being able to verify
> kernel image signature in kexec, it should help with avoiding module
> signing restrictions. Matthew Garret showed how to boot into a custom
> kernel, modify first kernel's memory and then jump back to old kernel and
> bypass any policy one wants to.
>
> I hope these patches can be queued up for 3.17. Even without signature
> verification support, they provide new syscall functionality. But I
> wil leave it to maintainers to decide if they want signature verification
> support also be ready to merge before they merge this patchset.

Well, this is an absolute ton of new code, much of it pretty complex.
And I believe the entire point of this work is to enable image
signature checking, but that hasn't been implemented yet?

In which case I'm thinking it would be unwise to merge these parts into
mainline - if signature checking doesn't work or fails review or if you
get hit by a bus then we'd be left with a large lump of rather useless
code?

In which case I'm inclined to put this series into -next and keep it
there pending completion of the signature checking part.

2014-06-26 20:58:29

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 09/15] kexec: Implementation of new syscall kexec_file_load

On Thu, 26 Jun 2014 16:33:38 -0400 Vivek Goyal <[email protected]> wrote:

> Previous patch provided the interface definition and this patch prvides
> implementation of new syscall.
>
> Previously segment list was prepared in user space. Now user space just
> passes kernel fd, initrd fd and command line and kernel will create a
> segment list internally.
>
> This patch contains generic part of the code. Actual segment preparation
> and loading is done by arch and image specific loader. Which comes in
> next patch.
>
> ...
>
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -6,6 +6,8 @@
> * Version 2. See the file COPYING for more details.
> */
>
> +#define pr_fmt(fmt) "kexec: " fmt
> +
> #include <linux/capability.h>
> #include <linux/mm.h>
> #include <linux/file.h>
> @@ -326,6 +328,215 @@ out_free_image:
> return ret;
> }
>
> +static int copy_file_from_fd(int fd, void **buf, unsigned long *buf_len)
> +{
> + struct fd f = fdget(fd);
> + int ret = 0;

unneeded initialisation.

> + struct kstat stat;
> + loff_t pos;
> + ssize_t bytes = 0;
> +
> + if (!f.file)
> + return -EBADF;
> +
> + ret = vfs_getattr(&f.file->f_path, &stat);
> + if (ret)
> + goto out;
> +
> + if (stat.size > INT_MAX) {
> + ret = -EFBIG;
> + goto out;
> + }
> +
> + /* Don't hand 0 to vmalloc, it whines. */
> + if (stat.size == 0) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + *buf = vmalloc(stat.size);
> + if (!*buf) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + pos = 0;
> + while (pos < stat.size) {
> + bytes = kernel_read(f.file, pos, (char *)(*buf) + pos,
> + stat.size - pos);
> + if (bytes < 0) {
> + vfree(*buf);
> + ret = bytes;
> + goto out;
> + }
> +
> + if (bytes == 0)
> + break;

Here we can get a short read: (pos < stat.size). Seems to me that it
is risky to return this result to the caller as if all is well.

> + pos += bytes;
> + }
> +
> + *buf_len = pos;
> +out:
> + fdput(f);
> + return ret;
> +}
>
> ...
>

2014-06-26 21:03:41

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 08/15] kexec: New syscall kexec_file_load() declaration

On Thu, Jun 26, 2014 at 1:43 PM, Vivek Goyal <[email protected]> wrote:
> On Thu, Jun 26, 2014 at 04:33:37PM -0400, Vivek Goyal wrote:
>> This is the new syscall kexec_file_load() declaration/interface. I have
>> reserved the syscall number only for x86_64 so far. Other architectures
>> (including i386) can reserve syscall number when they enable the support
>> for this new syscall.
>>
>> Signed-off-by: Vivek Goyal <[email protected]>
>> CC: [email protected]
>
> +.BR KEXEC_FILE_NO_INITRAMFS
> +Loading initrd/initramfs is optional. Specify this flag if no initramfs
> +is being loaded. If this flag is set, kernel will ignore the value passed
> +in

This seems pointless. Why not just pass -1 for initrd_fd to indicate
that no initrd is needed?

--Andy

2014-06-26 21:21:30

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 00/15][V4] kexec: A new system call to allow in kernel loading

On Thu, Jun 26, 2014 at 01:58:11PM -0700, Andrew Morton wrote:
> Well, this is an absolute ton of new code, much of it pretty complex.
> And I believe the entire point of this work is to enable image
> signature checking, but that hasn't been implemented yet?
>
> In which case I'm thinking it would be unwise to merge these parts
> into mainline - if signature checking doesn't work or fails review or
> if you get hit by a bus then we'd be left with a large lump of rather
> useless code?
>
> In which case I'm inclined to put this series into -next and keep it
> there pending completion of the signature checking part.

Before we rush these in, it'd be nice if someone more experienced would
take a look at the general approach of the whole handling with the
purgatory and such. I certainly tried to give my best while reviewing
but I'm too inexperienced in kexec and the whole booting of another
kernel and all the intricacies of the process. It'll be optimal if Eric
Biederman would find some free time for those.

AFAIK, hpa wanted to take a look too so can we please slow down a bit
here first?

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-06-27 11:34:00

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 00/15][V4] kexec: A new system call to allow in kernel loading

On Thu, Jun 26, 2014 at 01:58:11PM -0700, Andrew Morton wrote:
> On Thu, 26 Jun 2014 16:33:29 -0400 Vivek Goyal <[email protected]> wrote:
>
> > This patch series does not do kernel signature verification yet. I plan
> > to post another patch series for that. Now distributions are already signing
> > PE/COFF bzImage with PKCS7 signature I plan to parse and verify those
> > signatures.
> >
> > Primary goal of this patchset is to prepare groundwork so that kernel
> > image can be signed and signatures be verified during kexec load. This
> > should help with two things.
> >
> > - It should allow kexec/kdump on secureboot enabled machines.
> >
> > - In general it can help even without secureboot. By being able to verify
> > kernel image signature in kexec, it should help with avoiding module
> > signing restrictions. Matthew Garret showed how to boot into a custom
> > kernel, modify first kernel's memory and then jump back to old kernel and
> > bypass any policy one wants to.
> >
> > I hope these patches can be queued up for 3.17. Even without signature
> > verification support, they provide new syscall functionality. But I
> > wil leave it to maintainers to decide if they want signature verification
> > support also be ready to merge before they merge this patchset.
>
> Well, this is an absolute ton of new code, much of it pretty complex.
> And I believe the entire point of this work is to enable image
> signature checking, but that hasn't been implemented yet?

I have a patchset which works. But it requires more work. I will do
remaining work and clean it up and post for review.

>
> In which case I'm thinking it would be unwise to merge these parts into
> mainline - if signature checking doesn't work or fails review or if you
> get hit by a bus then we'd be left with a large lump of rather useless
> code?
>
> In which case I'm inclined to put this series into -next and keep it
> there pending completion of the signature checking part.

Agreed. Primary purpose of this patch series is to be able to do signature
verification of kernel during kexec. So it will make sense to first have
some sort of consensus on signature verification patches. Otherwise we
might be stuck with this 3.5K lines of code if things go south w.r.t
signature verification.

Thanks
Vivek

2014-06-27 11:51:44

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 08/15] kexec: New syscall kexec_file_load() declaration

On Thu, Jun 26, 2014 at 02:03:17PM -0700, Andy Lutomirski wrote:
> On Thu, Jun 26, 2014 at 1:43 PM, Vivek Goyal <[email protected]> wrote:
> > On Thu, Jun 26, 2014 at 04:33:37PM -0400, Vivek Goyal wrote:
> >> This is the new syscall kexec_file_load() declaration/interface. I have
> >> reserved the syscall number only for x86_64 so far. Other architectures
> >> (including i386) can reserve syscall number when they enable the support
> >> for this new syscall.
> >>
> >> Signed-off-by: Vivek Goyal <[email protected]>
> >> CC: [email protected]
> >
> > +.BR KEXEC_FILE_NO_INITRAMFS
> > +Loading initrd/initramfs is optional. Specify this flag if no initramfs
> > +is being loaded. If this flag is set, kernel will ignore the value passed
> > +in
>
> This seems pointless. Why not just pass -1 for initrd_fd to indicate
> that no initrd is needed?

I was not sure whether negative fd should be treated as error and system
call should fail or it should be treated as user does not want to load
initrd and system call succeeds.

I was concerned about the cases where application does an fd = open(),
operation fails and fd contains -1. Caller does not check fd and
passed it to kexec system call.

I thought that in such cases we should error out saying initrd fd is
not valid. Instead of continuing and loading kernel without initrd. A
user might be surprised.

This is little defensive programming. But I am open to change it if
the perception is that above is not a valid concern.

Thanks
Vivek

Subject: Re: [PATCH 08/15] kexec: New syscall kexec_file_load() declaration

On Fri, Jun 27, 2014 at 1:50 PM, Vivek Goyal <[email protected]> wrote:
> On Thu, Jun 26, 2014 at 02:03:17PM -0700, Andy Lutomirski wrote:
>> On Thu, Jun 26, 2014 at 1:43 PM, Vivek Goyal <[email protected]> wrote:
>> > On Thu, Jun 26, 2014 at 04:33:37PM -0400, Vivek Goyal wrote:
>> >> This is the new syscall kexec_file_load() declaration/interface. I have
>> >> reserved the syscall number only for x86_64 so far. Other architectures
>> >> (including i386) can reserve syscall number when they enable the support
>> >> for this new syscall.
>> >>
>> >> Signed-off-by: Vivek Goyal <[email protected]>
>> >> CC: [email protected]
>> >
>> > +.BR KEXEC_FILE_NO_INITRAMFS
>> > +Loading initrd/initramfs is optional. Specify this flag if no initramfs
>> > +is being loaded. If this flag is set, kernel will ignore the value passed
>> > +in
>>
>> This seems pointless. Why not just pass -1 for initrd_fd to indicate
>> that no initrd is needed?
>
> I was not sure whether negative fd should be treated as error and system
> call should fail or it should be treated as user does not want to load
> initrd and system call succeeds.
>
> I was concerned about the cases where application does an fd = open(),
> operation fails and fd contains -1. Caller does not check fd and
> passed it to kexec system call.
>
> I thought that in such cases we should error out saying initrd fd is
> not valid. Instead of continuing and loading kernel without initrd. A
> user might be surprised.
>
> This is little defensive programming. But I am open to change it if
> the perception is that above is not a valid concern.

Your logic for using a flag rather than -1 sounds reasonable to me.
The nearest precedent I can think of offhand is mmap(), which also
takes a file descriptor argument for some use cases. However, if
MAP_ANONYMOUS is specified, no file descriptor ir required. The
treatment of the 'fd' argument in that case depends on the system. On
Linux, the fd argument is just ignored. However, many other systems
require 'fd' to be negative when MAP_ANONYMOUS is specified; one
presumes as a kind of safety check.

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2014-06-27 16:32:27

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 09/15] kexec: Implementation of new syscall kexec_file_load

On Thu, Jun 26, 2014 at 01:58:26PM -0700, Andrew Morton wrote:

[..]
> > + while (pos < stat.size) {
> > + bytes = kernel_read(f.file, pos, (char *)(*buf) + pos,
> > + stat.size - pos);
> > + if (bytes < 0) {
> > + vfree(*buf);
> > + ret = bytes;
> > + goto out;
> > + }
> > +
> > + if (bytes == 0)
> > + break;
>
> Here we can get a short read: (pos < stat.size). Seems to me that it
> is risky to return this result to the caller as if all is well.

Hi Andrew,

That's a good point. Please find attached the patch which fixes both
the issues.

Thanks
Vivek



Subject: kexec: Return error if file bytes are less then file size

If number of bytes read from file are not same as file size, return error.

Signed-off-by: Vivek Goyal <[email protected]>
---
kernel/kexec.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

Index: linux-2.6/kernel/kexec.c
===================================================================
--- linux-2.6.orig/kernel/kexec.c 2014-06-27 09:55:41.826755422 -0400
+++ linux-2.6/kernel/kexec.c 2014-06-27 10:04:23.409024171 -0400
@@ -343,7 +343,7 @@ out_free_image:
static int copy_file_from_fd(int fd, void **buf, unsigned long *buf_len)
{
struct fd f = fdget(fd);
- int ret = 0;
+ int ret;
struct kstat stat;
loff_t pos;
ssize_t bytes = 0;
@@ -387,6 +387,12 @@ static int copy_file_from_fd(int fd, voi
pos += bytes;
}

+ if (pos != stat.size) {
+ ret = -EBADF;
+ vfree(*buf);
+ goto out;
+ }
+
*buf_len = pos;
out:
fdput(f);

2014-06-27 18:01:52

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 16/15] kexec: Fix freeing up for image loader data loading


During testing I noticed a crash. Which in turn showed that there are
problems with how I am freeing up image->image_loader_data.

In one case I am freeing up kimage->image_loader_data first and then
calling up arch to free up which might have been contained in that
structure. That's wrong.

I have done little cleanup and this should fix the issues around
freeing up of loader data.

Signed-off-by: Vivek Goyal <[email protected]>
---
arch/x86/kernel/kexec-bzimage64.c | 4 ++--
arch/x86/kernel/machine_kexec_64.c | 2 +-
include/linux/kexec.h | 2 +-
kernel/kexec.c | 11 ++++++++---
4 files changed, 12 insertions(+), 7 deletions(-)

Index: linux-2.6/arch/x86/kernel/machine_kexec_64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/machine_kexec_64.c 2014-06-27 09:55:41.824755401 -0400
+++ linux-2.6/arch/x86/kernel/machine_kexec_64.c 2014-06-27 11:02:02.607548946 -0400
@@ -369,7 +369,7 @@ int arch_kimage_file_post_load_cleanup(s
if (!image->fops || !image->fops->cleanup)
return 0;

- return image->fops->cleanup(image);
+ return image->fops->cleanup(image->image_loader_data);
}

/*
Index: linux-2.6/include/linux/kexec.h
===================================================================
--- linux-2.6.orig/include/linux/kexec.h 2014-06-27 09:55:41.695754029 -0400
+++ linux-2.6/include/linux/kexec.h 2014-06-27 11:04:28.467151813 -0400
@@ -190,7 +190,7 @@ typedef void *(kexec_load_t)(struct kima
unsigned long kernel_len, char *initrd,
unsigned long initrd_len, char *cmdline,
unsigned long cmdline_len);
-typedef int (kexec_cleanup_t)(struct kimage *image);
+typedef int (kexec_cleanup_t)(void *loader_data);

struct kexec_file_ops {
kexec_probe_t *probe;
Index: linux-2.6/arch/x86/kernel/kexec-bzimage64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/kexec-bzimage64.c 2014-06-27 09:55:41.872755912 -0400
+++ linux-2.6/arch/x86/kernel/kexec-bzimage64.c 2014-06-27 11:05:42.151963710 -0400
@@ -512,9 +512,9 @@ out_free_params:
}

/* This cleanup function is called after various segments have been loaded */
-int bzImage64_cleanup(struct kimage *image)
+int bzImage64_cleanup(void *loader_data)
{
- struct bzimage64_data *ldata = image->image_loader_data;
+ struct bzimage64_data *ldata = loader_data;

if (!ldata)
return 0;
Index: linux-2.6/kernel/kexec.c
===================================================================
--- linux-2.6.orig/kernel/kexec.c 2014-06-27 10:04:23.409024171 -0400
+++ linux-2.6/kernel/kexec.c 2014-06-27 11:45:14.684874978 -0400
@@ -459,6 +459,14 @@ static void kimage_file_post_load_cleanu

/* See if architecture has anything to cleanup post load */
arch_kimage_file_post_load_cleanup(image);
+
+ /*
+ * Above call should have called into bootloader to free up
+ * any data stored in kimage->image_loader_data. It should
+ * be ok now to free it up.
+ */
+ kfree(image->image_loader_data);
+ image->image_loader_data = NULL;
}

/*
@@ -584,7 +592,6 @@ out_free_control_pages:
kimage_free_page_list(&image->control_pages);
out_free_post_load_bufs:
kimage_file_post_load_cleanup(image);
- kfree(image->image_loader_data);
out_free_image:
kfree(image);
return ret;
@@ -908,8 +915,6 @@ static void kimage_free(struct kimage *i
/* Free the kexec control pages... */
kimage_free_page_list(&image->control_pages);

- kfree(image->image_loader_data);
-
/*
* Free up any temporary buffers allocated. This might hit if
* error occurred much later after buffer allocation.

2014-07-01 19:46:10

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH 15/15] kexec: Support kexec/kdump on EFI systems

On Thu, 26 Jun, at 04:33:44PM, Vivek Goyal wrote:
> This patch does two thigns. It passes EFI run time mappings to second
> kernel in bootparams efi_info. Second kernel parse this info and create
> new mappings in second kernel. That means mappings in first and second
> kernel will be same. This paves the way to enable EFI in kexec kernel.
>
> This patch also prepares and passes EFI setup data through bootparams.
> This contains bunch of information about various tables and their
> addresses.
>
> These information gathering and passing has been written along the lines
> of what current kexec-tools is doing to make kexec work with UEFI.
>
> Signed-off-by: Vivek Goyal <[email protected]>
> CC: [email protected]
> ---
> arch/x86/kernel/kexec-bzimage64.c | 146 ++++++++++++++++++++++++++++++++++---
> drivers/firmware/efi/runtime-map.c | 21 ++++++
> include/linux/efi.h | 19 +++++
> 3 files changed, 174 insertions(+), 12 deletions(-)

[...]

> diff --git a/drivers/firmware/efi/runtime-map.c b/drivers/firmware/efi/runtime-map.c
> index 97cdd16..40f2213 100644
> --- a/drivers/firmware/efi/runtime-map.c
> +++ b/drivers/firmware/efi/runtime-map.c
> @@ -138,6 +138,27 @@ add_sysfs_runtime_map_entry(struct kobject *kobj, int nr)
> return entry;
> }
>
> +int get_efi_runtime_map_size(void)
> +{
> + return nr_efi_runtime_map * efi_memdesc_size;
> +}
> +
> +int get_efi_runtime_map_desc_size(void)
> +{
> + return efi_memdesc_size;
> +}
> +
> +int efi_runtime_map_copy(void *buf, size_t bufsz)
> +{
> + size_t sz = get_efi_runtime_map_size();
> +
> + if (sz > bufsz)
> + sz = bufsz;
> +
> + memcpy(buf, efi_runtime_map, sz);
> + return 0;
> +}

Could we prefix these with efi_, e.g. efi_get_runtime_map_size() ?

--
Matt Fleming, Intel Open Source Technology Center

2014-07-01 20:10:35

by Vivek Goyal

[permalink] [raw]
Subject: [PATCH 17/15] kexec-bzimage: Change EFI helper function names

Matt suggested to change helper function names newly introduced functions
and prefix these with efi_.

Signed-off-by: Vivek Goyal <[email protected]>
CC: Matt Fleming <[email protected]>
CC: [email protected]
---
arch/x86/kernel/kexec-bzimage64.c | 4 ++--
drivers/firmware/efi/runtime-map.c | 6 +++---
include/linux/efi.h | 4 ++--
3 files changed, 7 insertions(+), 7 deletions(-)

Index: linux-2.6/include/linux/efi.h
===================================================================
--- linux-2.6.orig/include/linux/efi.h 2014-07-01 14:05:54.197071710 -0400
+++ linux-2.6/include/linux/efi.h 2014-07-01 15:54:21.019754754 -0400
@@ -1151,8 +1151,8 @@ int efivars_sysfs_init(void);
#ifdef CONFIG_EFI_RUNTIME_MAP
int efi_runtime_map_init(struct kobject *);
void efi_runtime_map_setup(void *, int, u32);
-int get_efi_runtime_map_size(void);
-int get_efi_runtime_map_desc_size(void);
+int efi_get_runtime_map_size(void);
+int efi_get_runtime_map_desc_size(void);
int efi_runtime_map_copy(void *buf, size_t bufsz);
#else
static inline int efi_runtime_map_init(struct kobject *kobj)
Index: linux-2.6/drivers/firmware/efi/runtime-map.c
===================================================================
--- linux-2.6.orig/drivers/firmware/efi/runtime-map.c 2014-07-01 14:05:54.196071711 -0400
+++ linux-2.6/drivers/firmware/efi/runtime-map.c 2014-07-01 15:55:47.990759859 -0400
@@ -138,19 +138,19 @@ add_sysfs_runtime_map_entry(struct kobje
return entry;
}

-int get_efi_runtime_map_size(void)
+int efi_get_runtime_map_size(void)
{
return nr_efi_runtime_map * efi_memdesc_size;
}

-int get_efi_runtime_map_desc_size(void)
+int efi_get_runtime_map_desc_size(void)
{
return efi_memdesc_size;
}

int efi_runtime_map_copy(void *buf, size_t bufsz)
{
- size_t sz = get_efi_runtime_map_size();
+ size_t sz = efi_get_runtime_map_size();

if (sz > bufsz)
sz = bufsz;
Index: linux-2.6/arch/x86/kernel/kexec-bzimage64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/kexec-bzimage64.c 2014-07-01 15:52:22.285747785 -0400
+++ linux-2.6/arch/x86/kernel/kexec-bzimage64.c 2014-07-01 15:56:31.071762387 -0400
@@ -181,7 +181,7 @@ setup_efi_state(struct boot_params *para
ei->efi_systab_hi = current_ei->efi_systab_hi;

ei->efi_memdesc_version = current_ei->efi_memdesc_version;
- ei->efi_memdesc_size = get_efi_runtime_map_desc_size();
+ ei->efi_memdesc_size = efi_get_runtime_map_desc_size();

setup_efi_info_memmap(params, params_load_addr, efi_map_offset,
efi_map_sz);
@@ -397,7 +397,7 @@ void *bzImage64_load(struct kimage *imag
* have to create separate segment for each. Keeps things
* little bit simple
*/
- efi_map_sz = get_efi_runtime_map_size();
+ efi_map_sz = efi_get_runtime_map_size();
efi_map_sz = ALIGN(efi_map_sz, 16);
params_cmdline_sz = sizeof(struct boot_params) + cmdline_len +
MAX_ELFCOREHDR_STR_LEN;

2014-07-01 20:14:23

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 15/15] kexec: Support kexec/kdump on EFI systems

On Tue, 1 Jul 2014 20:46:05 +0100 Matt Fleming <[email protected]> wrote:

> > +int get_efi_runtime_map_size(void)
> > +{
> > + return nr_efi_runtime_map * efi_memdesc_size;
> > +}
> > +
> > +int get_efi_runtime_map_desc_size(void)
> > +{
> > + return efi_memdesc_size;
> > +}
> > +
> > +int efi_runtime_map_copy(void *buf, size_t bufsz)
> > +{
> > + size_t sz = get_efi_runtime_map_size();
> > +
> > + if (sz > bufsz)
> > + sz = bufsz;
> > +
> > + memcpy(buf, efi_runtime_map, sz);
> > + return 0;
> > +}
>
> Could we prefix these with efi_, e.g. efi_get_runtime_map_size() ?

This?

From: Andrew Morton <[email protected]>
Subject: kexec-support-kexec-kdump-on-efi-systems-fix

s/get_efi/efi_get/g, per Matt

Cc: Vivek Goyal <[email protected]>
Cc: Matt Fleming <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

arch/x86/kernel/kexec-bzimage64.c | 4 ++--
drivers/firmware/efi/runtime-map.c | 6 +++---
include/linux/efi.h | 8 ++++----
3 files changed, 9 insertions(+), 9 deletions(-)

diff -puN arch/x86/kernel/kexec-bzimage64.c~kexec-support-kexec-kdump-on-efi-systems-fix arch/x86/kernel/kexec-bzimage64.c
--- a/arch/x86/kernel/kexec-bzimage64.c~kexec-support-kexec-kdump-on-efi-systems-fix
+++ a/arch/x86/kernel/kexec-bzimage64.c
@@ -181,7 +181,7 @@ setup_efi_state(struct boot_params *para
ei->efi_systab_hi = current_ei->efi_systab_hi;

ei->efi_memdesc_version = current_ei->efi_memdesc_version;
- ei->efi_memdesc_size = get_efi_runtime_map_desc_size();
+ ei->efi_memdesc_size = efi_get_runtime_map_desc_size();

setup_efi_info_memmap(params, params_load_addr, efi_map_offset,
efi_map_sz);
@@ -397,7 +397,7 @@ void *bzImage64_load(struct kimage *imag
* have to create separate segment for each. Keeps things
* little bit simple
*/
- efi_map_sz = get_efi_runtime_map_size();
+ efi_map_sz = efi_get_runtime_map_size();
efi_map_sz = ALIGN(efi_map_sz, 16);
params_cmdline_sz = sizeof(struct boot_params) + cmdline_len +
MAX_ELFCOREHDR_STR_LEN;
diff -puN drivers/firmware/efi/runtime-map.c~kexec-support-kexec-kdump-on-efi-systems-fix drivers/firmware/efi/runtime-map.c
--- a/drivers/firmware/efi/runtime-map.c~kexec-support-kexec-kdump-on-efi-systems-fix
+++ a/drivers/firmware/efi/runtime-map.c
@@ -138,19 +138,19 @@ add_sysfs_runtime_map_entry(struct kobje
return entry;
}

-int get_efi_runtime_map_size(void)
+int efi_get_runtime_map_size(void)
{
return nr_efi_runtime_map * efi_memdesc_size;
}

-int get_efi_runtime_map_desc_size(void)
+int efi_get_runtime_map_desc_size(void)
{
return efi_memdesc_size;
}

int efi_runtime_map_copy(void *buf, size_t bufsz)
{
- size_t sz = get_efi_runtime_map_size();
+ size_t sz = efi_get_runtime_map_size();

if (sz > bufsz)
sz = bufsz;
diff -puN include/linux/efi.h~kexec-support-kexec-kdump-on-efi-systems-fix include/linux/efi.h
--- a/include/linux/efi.h~kexec-support-kexec-kdump-on-efi-systems-fix
+++ a/include/linux/efi.h
@@ -1151,8 +1151,8 @@ int efivars_sysfs_init(void);
#ifdef CONFIG_EFI_RUNTIME_MAP
int efi_runtime_map_init(struct kobject *);
void efi_runtime_map_setup(void *, int, u32);
-int get_efi_runtime_map_size(void);
-int get_efi_runtime_map_desc_size(void);
+int efi_get_runtime_map_size(void);
+int efi_get_runtime_map_desc_size(void);
int efi_runtime_map_copy(void *buf, size_t bufsz);
#else
static inline int efi_runtime_map_init(struct kobject *kobj)
@@ -1163,12 +1163,12 @@ static inline int efi_runtime_map_init(s
static inline void
efi_runtime_map_setup(void *map, int nr_entries, u32 desc_size) {}

-static inline int get_efi_runtime_map_size(void)
+static inline int efi_get_runtime_map_size(void)
{
return 0;
}

-static inline int get_efi_runtime_map_desc_size(void)
+static inline int efi_get_runtime_map_desc_size(void)
{
return 0;
}
_

2014-07-01 20:22:36

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 15/15] kexec: Support kexec/kdump on EFI systems

On Tue, Jul 01, 2014 at 01:14:19PM -0700, Andrew Morton wrote:
> On Tue, 1 Jul 2014 20:46:05 +0100 Matt Fleming <[email protected]> wrote:
>
> > > +int get_efi_runtime_map_size(void)
> > > +{
> > > + return nr_efi_runtime_map * efi_memdesc_size;
> > > +}
> > > +
> > > +int get_efi_runtime_map_desc_size(void)
> > > +{
> > > + return efi_memdesc_size;
> > > +}
> > > +
> > > +int efi_runtime_map_copy(void *buf, size_t bufsz)
> > > +{
> > > + size_t sz = get_efi_runtime_map_size();
> > > +
> > > + if (sz > bufsz)
> > > + sz = bufsz;
> > > +
> > > + memcpy(buf, efi_runtime_map, sz);
> > > + return 0;
> > > +}
> >
> > Could we prefix these with efi_, e.g. efi_get_runtime_map_size() ?
>
> This?
>
> From: Andrew Morton <[email protected]>
> Subject: kexec-support-kexec-kdump-on-efi-systems-fix
>
> s/get_efi/efi_get/g, per Matt
>
> Cc: Vivek Goyal <[email protected]>
> Cc: Matt Fleming <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>

Looks good to me. Thanks Andrew.

Vivek

> ---
>
> arch/x86/kernel/kexec-bzimage64.c | 4 ++--
> drivers/firmware/efi/runtime-map.c | 6 +++---
> include/linux/efi.h | 8 ++++----
> 3 files changed, 9 insertions(+), 9 deletions(-)
>
> diff -puN arch/x86/kernel/kexec-bzimage64.c~kexec-support-kexec-kdump-on-efi-systems-fix arch/x86/kernel/kexec-bzimage64.c
> --- a/arch/x86/kernel/kexec-bzimage64.c~kexec-support-kexec-kdump-on-efi-systems-fix
> +++ a/arch/x86/kernel/kexec-bzimage64.c
> @@ -181,7 +181,7 @@ setup_efi_state(struct boot_params *para
> ei->efi_systab_hi = current_ei->efi_systab_hi;
>
> ei->efi_memdesc_version = current_ei->efi_memdesc_version;
> - ei->efi_memdesc_size = get_efi_runtime_map_desc_size();
> + ei->efi_memdesc_size = efi_get_runtime_map_desc_size();
>
> setup_efi_info_memmap(params, params_load_addr, efi_map_offset,
> efi_map_sz);
> @@ -397,7 +397,7 @@ void *bzImage64_load(struct kimage *imag
> * have to create separate segment for each. Keeps things
> * little bit simple
> */
> - efi_map_sz = get_efi_runtime_map_size();
> + efi_map_sz = efi_get_runtime_map_size();
> efi_map_sz = ALIGN(efi_map_sz, 16);
> params_cmdline_sz = sizeof(struct boot_params) + cmdline_len +
> MAX_ELFCOREHDR_STR_LEN;
> diff -puN drivers/firmware/efi/runtime-map.c~kexec-support-kexec-kdump-on-efi-systems-fix drivers/firmware/efi/runtime-map.c
> --- a/drivers/firmware/efi/runtime-map.c~kexec-support-kexec-kdump-on-efi-systems-fix
> +++ a/drivers/firmware/efi/runtime-map.c
> @@ -138,19 +138,19 @@ add_sysfs_runtime_map_entry(struct kobje
> return entry;
> }
>
> -int get_efi_runtime_map_size(void)
> +int efi_get_runtime_map_size(void)
> {
> return nr_efi_runtime_map * efi_memdesc_size;
> }
>
> -int get_efi_runtime_map_desc_size(void)
> +int efi_get_runtime_map_desc_size(void)
> {
> return efi_memdesc_size;
> }
>
> int efi_runtime_map_copy(void *buf, size_t bufsz)
> {
> - size_t sz = get_efi_runtime_map_size();
> + size_t sz = efi_get_runtime_map_size();
>
> if (sz > bufsz)
> sz = bufsz;
> diff -puN include/linux/efi.h~kexec-support-kexec-kdump-on-efi-systems-fix include/linux/efi.h
> --- a/include/linux/efi.h~kexec-support-kexec-kdump-on-efi-systems-fix
> +++ a/include/linux/efi.h
> @@ -1151,8 +1151,8 @@ int efivars_sysfs_init(void);
> #ifdef CONFIG_EFI_RUNTIME_MAP
> int efi_runtime_map_init(struct kobject *);
> void efi_runtime_map_setup(void *, int, u32);
> -int get_efi_runtime_map_size(void);
> -int get_efi_runtime_map_desc_size(void);
> +int efi_get_runtime_map_size(void);
> +int efi_get_runtime_map_desc_size(void);
> int efi_runtime_map_copy(void *buf, size_t bufsz);
> #else
> static inline int efi_runtime_map_init(struct kobject *kobj)
> @@ -1163,12 +1163,12 @@ static inline int efi_runtime_map_init(s
> static inline void
> efi_runtime_map_setup(void *map, int nr_entries, u32 desc_size) {}
>
> -static inline int get_efi_runtime_map_size(void)
> +static inline int efi_get_runtime_map_size(void)
> {
> return 0;
> }
>
> -static inline int get_efi_runtime_map_desc_size(void)
> +static inline int efi_get_runtime_map_desc_size(void)
> {
> return 0;
> }
> _

2014-07-01 20:26:21

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 09/15] kexec: Implementation of new syscall kexec_file_load

On Fri, Jun 27, 2014 at 12:31:41PM -0400, Vivek Goyal wrote:
> On Thu, Jun 26, 2014 at 01:58:26PM -0700, Andrew Morton wrote:
>
> [..]
> > > + while (pos < stat.size) {
> > > + bytes = kernel_read(f.file, pos, (char *)(*buf) + pos,
> > > + stat.size - pos);
> > > + if (bytes < 0) {
> > > + vfree(*buf);
> > > + ret = bytes;
> > > + goto out;
> > > + }
> > > +
> > > + if (bytes == 0)
> > > + break;
> >
> > Here we can get a short read: (pos < stat.size). Seems to me that it
> > is risky to return this result to the caller as if all is well.
>
> Hi Andrew,
>
> That's a good point. Please find attached the patch which fixes both
> the issues.
>
> Thanks
> Vivek
>
>
>

Hi Andrew,

Based on your feedback, I wrote following patch. Does it look good to
you. If yes, can you please include this one too. Do let me know if you
want me to post it separately.

Thanks
Vivek

> Subject: kexec: Return error if file bytes are less then file size
>
> If number of bytes read from file are not same as file size, return error.
>
> Signed-off-by: Vivek Goyal <[email protected]>
> ---
> kernel/kexec.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/kernel/kexec.c
> ===================================================================
> --- linux-2.6.orig/kernel/kexec.c 2014-06-27 09:55:41.826755422 -0400
> +++ linux-2.6/kernel/kexec.c 2014-06-27 10:04:23.409024171 -0400
> @@ -343,7 +343,7 @@ out_free_image:
> static int copy_file_from_fd(int fd, void **buf, unsigned long *buf_len)
> {
> struct fd f = fdget(fd);
> - int ret = 0;
> + int ret;
> struct kstat stat;
> loff_t pos;
> ssize_t bytes = 0;
> @@ -387,6 +387,12 @@ static int copy_file_from_fd(int fd, voi
> pos += bytes;
> }
>
> + if (pos != stat.size) {
> + ret = -EBADF;
> + vfree(*buf);
> + goto out;
> + }
> +
> *buf_len = pos;
> out:
> fdput(f);

2014-07-01 21:23:56

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH 15/15] kexec: Support kexec/kdump on EFI systems

On Tue, 01 Jul, at 01:14:19PM, Andrew Morton wrote:
>
> This?
>
> From: Andrew Morton <[email protected]>
> Subject: kexec-support-kexec-kdump-on-efi-systems-fix
>
> s/get_efi/efi_get/g, per Matt
>
> Cc: Vivek Goyal <[email protected]>
> Cc: Matt Fleming <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>

Yep, looks spot on. Thanks Andrew.

--
Matt Fleming, Intel Open Source Technology Center