2013-03-07 14:58:27

by Jingbai Ma

[permalink] [raw]
Subject: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process

This patch intend to speedup the memory pages scanning process in
selective dump mode.

Test result (On HP ProLiant DL980 G7 with 1TB RAM, makedumpfile
v1.5.3):

Total scan Time
Original kernel
+ makedumpfile v1.5.3 cyclic mode 1958.05 seconds
Original kernel
+ makedumpfile v1.5.3 non-cyclic mode 1151.50 seconds
Patched kernel
+ patched makedumpfile v1.5.3 17.50 seconds

Traditionally, to reduce the size of dump file, dumper scans all memory
pages to exclude the unnecessary memory pages after capture kernel
booted, and scan it in userspace code (makedumpfile).

It introduces several problems:

1. Requires more memory to store memory bitmap on systems with large
amount of memory installed. And in capture kernel there is only a few
free memory available, it will cause an out of memory error and fail.
(Non-cyclic mode)

2. Scans all memory pages in makedumpfile is a very slow process. On
system with 1TB or more memory installed, the scanning process is very
long. Typically on 1TB idle system, it takes about 19 minutes. On system
with 4TB or more memory installed, it even doesn't work. To address the
out of memory issue on system with big memory (4TB or more memory
installed), makedumpfile v1.5.1 introduces a new cyclic mode. It only
scans a piece of memory pages each time, and do it cyclically to scan
all memory pages. But it runs more slowly, on 1TB system, takes about 33
minutes.

3. Scans memory pages code in makedumpfile is very complicated, without
kernel memory management related data structure, makedumpfile has to
build up its own data structure, and will not able to use some macros
that only be available in kernel (e.g. page_to_pfn), and has to use some
slow lookup algorithm instead.

This patch introduces a new way to scan memory pages. It reserves a
piece of memory (1 bit for each page, 32MB per TB memory on x86 systems)
in the first kernel. During the kernel crash process, it scans all
memory pages, clear the bit for all excluded memory pages in the
reserved memory.

We have several benefits by this new approach:

1. It's extremely fast, on 1TB system only takes about 17.5 seconds to
scan all memory pages!

2. Reduces the memory requirement of makedumpfile by putting the
reserved memory in the first kernel memory space.

3. Simplifies the complexity of existing memory pages scanning code in
userspace.

To do:
1. It only has been verified on x86 64bit platform, needs to be modified
for other platforms. (ARM, XEN, PPC, etc...)

---

Jingbai Ma (5):
crash dump bitmap: add a kernel config and help document
crash dump bitmap: init crash dump bitmap in kernel booting process
crash dump bitmap: scan memory pages in kernel crash process
crash dump bitmap: add a proc interface for crash dump bitmap
crash dump bitmap: workaround for kernel 3.9-rc1 kdump issue


Documentation/kdump/crash_dump_bitmap.txt | 378 +++++++++++++++++++++++++++++
arch/x86/Kconfig | 16 +
arch/x86/kernel/setup.c | 62 +++++
fs/proc/Makefile | 1
fs/proc/crash_dump_bitmap.c | 221 +++++++++++++++++
include/linux/crash_dump_bitmap.h | 59 +++++
kernel/Makefile | 1
kernel/crash_dump_bitmap.c | 201 +++++++++++++++
kernel/kexec.c | 5
9 files changed, 943 insertions(+), 1 deletions(-)
create mode 100644 Documentation/kdump/crash_dump_bitmap.txt
create mode 100644 fs/proc/crash_dump_bitmap.c
create mode 100644 include/linux/crash_dump_bitmap.h
create mode 100644 kernel/crash_dump_bitmap.c

--
Jingbai Ma <[email protected]>


2013-03-07 14:58:40

by Jingbai Ma

[permalink] [raw]
Subject: [RFC PATCH 1/5] crash dump bitmap: add a kernel config and help document

Add a kernel config and help document for CRASH_DUMP_BITMAP.

Signed-off-by: Jingbai Ma <[email protected]>
---
Documentation/kdump/crash_dump_bitmap.txt | 378 +++++++++++++++++++++++++++++
arch/x86/Kconfig | 16 +
2 files changed, 394 insertions(+), 0 deletions(-)
create mode 100644 Documentation/kdump/crash_dump_bitmap.txt

diff --git a/Documentation/kdump/crash_dump_bitmap.txt b/Documentation/kdump/crash_dump_bitmap.txt
new file mode 100644
index 0000000..468cdf2
--- /dev/null
+++ b/Documentation/kdump/crash_dump_bitmap.txt
@@ -0,0 +1,378 @@
+================================================================
+Documentation for Crash Dump Bitmap
+================================================================
+
+This document includes overview, setup and installation, and analysis
+information.
+
+Overview
+========
+
+Traditionally, to reduce the size of dump file, dumper scans all memory
+pages to exclude the unnecessary memory pages after capture kernel
+booted, and scan it in userspace code (makedumpfile).
+
+It introduces several problems:
+
+1. Requires more memory to store memory bitmap on systems with large
+amount of memory installed. And in capture kernel there is only a few
+free memory available, it will cause an out of memory error and fail.
+(Non-cyclic mode)
+
+2. Scans all memory pages in makedumpfile is a very slow process. On
+system with 1TB or more memory installed, the scanning process is very
+long. Typically on 1TB idle system, it takes about 19 minutes. On system
+with 4TB or more memory installed, it even doesn't work. To address the
+out of memory issue on system with big memory (4TB or more memory
+installed), makedumpfile v1.5.1 introduces a new cyclic mode. It only
+scans a piece of memory pages each time, and do it cyclically to scan
+all memory pages. But it runs more slowly, on 1TB system, takes about 33
+minutes.
+
+3. Scans memory pages code in makedumpfile is very complicated, without
+kernel memory management related data structure, makedumpfile has to
+build up its on data structure, and will not able to use some macros
+that only be available in kernel (e.g. page_to_pfn), and has to use some
+slow lookup algorithm instead.
+
+This patch introduces a new way to scan memory pages. It reserves a piece of
+memory (1 bit for each page, 32MB per TB memory on x86 systems) in the first
+kernel. During the kernel panic process, it scans all memory pages, clear the
+bit for all excluded memory pages in the reserved memory.
+
+We have several benefits by this new approach:
+
+1. It's extremely fast, on 1TB system only takes about 17.5 seconds to
+scan all memory pages!
+
+2. Reduces the memory requirement of makedumpfile by putting the
+reserved memory in the first kernel memory space.
+
+3. Simplifies the complexity of existing memory pages scanning code in
+userspace.
+
+
+Usage
+=====
+
+1) Enable "kernel crash dump bitmap" in "Processor type and features", under
+"kernel crash dumps".
+
+CONFIG_CRASH_DUMP_BITMAP=y
+
+it depends on "kexec system call" and "kernel crash dumps", so there features
+must be enabled also.
+
+CONFIG_KEXEC=y
+CONFIG_CRASH_DUMP=y
+
+2) Enable "sysfs file system support" in "Filesystem" -> "Pseudo filesystems.".
+
+ CONFIG_SYSFS=y
+
+3) Compile and install the new kernel.
+
+4) Check the new kernel.
+Once new kernel has booted, there will be a new foler
+/proc/crash_dump_bitmap.
+Check current dump level:
+cat /proc/crash_dump_bitmap/dump_level
+
+Set dump level:
+echo "dump level" > /proc/crash_dump_bitmap/dump_level
+
+The dump level is as same as the parameter of makedumpfile -d dump_level.
+
+Run page scan and check page status:
+cat /proc/crash_dump_bitmap/page_status
+
+5) Download makedumpfile v1.5.3 or later from sourceforge:
+http://sourceforge.net/projects/makedumpfile/
+
+6) Patch it with the patch at the end of this file.
+
+7) Compile it and copy the patched makedumpfile into the right folder
+(/sbin or /usr/sbin)
+
+8) Change the /etc/kdump.conf, and a "-q" in the makedumpfile parameter
+line. It will tell makedumpfile to use the crash dump bitmap in kernel.
+core_collector makedumpfile --non-cyclic -q -c -d 31 --message-level 23
+
+9) Regenerate initramfs to make sure the patched makedumpfile and config
+has been included in it.
+
+
+To Do
+=====
+
+It only supports x86-64 architecture currently, need to add supports for
+other architectures.
+
+
+Contact
+=======
+
+Jingbai Ma ([email protected])
+
+
+Patch (for makedumpfile v1.5.3)
+
+Please forgive me, for some format issues of makedumpfile source, I have
+to wrap this patch with '#'. Please use this sed command to get the
+patch for makedumpfile:
+
+sed -n -e "s/^#\(.*\)#$/\1/p" crash_dump_bitmap.txt > makedumpfile.patch
+
+=====
+#diff --git a/makedumpfile.c b/makedumpfile.c#
+#index acb1b21..f29b6a5 100644#
+#--- a/makedumpfile.c#
+#+++ b/makedumpfile.c#
+#@@ -34,6 +34,10 @@ struct srcfile_table srcfile_table;#
+# struct vm_table vt = { 0 };#
+# struct DumpInfo *info = NULL;#
+# #
+#+struct crash_dump_bitmap_info crash_dump_bitmap_info;#
+#+#
+#+const unsigned int CURRENT_BITMAP_INFO_VERSION = 1;#
+#+#
+# char filename_stdout[] = FILENAME_STDOUT;#
+# #
+# /*#
+#@@ -892,6 +896,7 @@ get_symbol_info(void)#
+# SYMBOL_INIT(node_remap_start_vaddr, "node_remap_start_vaddr");#
+# SYMBOL_INIT(node_remap_end_vaddr, "node_remap_end_vaddr");#
+# SYMBOL_INIT(node_remap_start_pfn, "node_remap_start_pfn");#
+#+ SYMBOL_INIT(crash_dump_bitmap_info, "crash_dump_bitmap_info");#
+# #
+# if (SYMBOL(node_data) != NOT_FOUND_SYMBOL)#
+# SYMBOL_ARRAY_TYPE_INIT(node_data, "node_data");#
+#@@ -1704,6 +1709,8 @@ read_vmcoreinfo(void)#
+# READ_SYMBOL("node_remap_end_vaddr", node_remap_end_vaddr);#
+# READ_SYMBOL("node_remap_start_pfn", node_remap_start_pfn);#
+# #
+#+ READ_SYMBOL("crash_dump_bitmap_info", crash_dump_bitmap_info);#
+#+#
+# READ_STRUCTURE_SIZE("page", page);#
+# READ_STRUCTURE_SIZE("mem_section", mem_section);#
+# READ_STRUCTURE_SIZE("pglist_data", pglist_data);#
+#@@ -4423,6 +4430,74 @@ copy_bitmap(void)#
+# int#
+# create_2nd_bitmap(void)#
+# {#
+#+ off_t offset_page;#
+#+ char buf1[info->page_size], buf2[info->page_size];#
+#+ int i;#
+#+#
+#+ if (info->flag_crash_dump_bitmap) {#
+#+ offset_page = 0;#
+#+ while (offset_page < (info->len_bitmap / 2)) {#
+#+ if (lseek(info->bitmap1->fd, info->bitmap1->offset#
+#+ + offset_page, SEEK_SET) < 0) {#
+#+ ERRMSG("Can't seek the bitmap(%s). %s\n",#
+#+ info->bitmap1->file_name, strerror(errno));#
+#+ return FALSE;#
+#+ }#
+#+#
+#+ if (read(info->bitmap1->fd, buf1, info->page_size)#
+#+ != info->page_size) {#
+#+ ERRMSG("Can't read bitmap(%s). %s\n",#
+#+ info->bitmap1->file_name,#
+#+ strerror(errno));#
+#+ return FALSE;#
+#+ }#
+#+#
+#+ if (readmem(PADDR, crash_dump_bitmap_info.bitmap#
+#+ + offset_page, buf2, info->page_size)#
+#+ != info->page_size) {#
+#+ ERRMSG("Can't read bitmap1! addr=%llx\n",#
+#+ crash_dump_bitmap_info.bitmap#
+#+ + offset_page);#
+#+ return FALSE;#
+#+ }#
+#+#
+#+ if (crash_dump_bitmap_info.version#
+#+ != CURRENT_BITMAP_INFO_VERSION) {#
+#+ ERRMSG("bitmap version! expected=%d, got=%d\n",#
+#+ CURRENT_BITMAP_INFO_VERSION,#
+#+ crash_dump_bitmap_info.version);#
+#+ return FALSE;#
+#+ }#
+#+#
+#+ for (i = 0; i < info->page_size; i++)#
+#+ buf2[i] = buf1[i] & buf2[i];#
+#+#
+#+ if (lseek(info->bitmap2->fd, info->bitmap2->offset#
+#+ + offset_page, SEEK_SET) < 0) {#
+#+ ERRMSG("Can't seek the bitmap(%s). %s\n",#
+#+ info->bitmap2->file_name, strerror(errno));#
+#+ return FALSE;#
+#+ }#
+#+#
+#+ if (write(info->bitmap2->fd, buf2, info->page_size)#
+#+ != info->page_size) {#
+#+ ERRMSG("Can't write the bitmap(%s). %s\n",#
+#+ info->bitmap2->file_name, strerror(errno));#
+#+ return FALSE;#
+#+ }#
+#+#
+#+ offset_page += info->page_size;#
+#+ }#
+#+#
+#+ pfn_cache = crash_dump_bitmap_info.cache_pages;#
+#+ pfn_cache_private = crash_dump_bitmap_info.cache_private_pages;#
+#+ pfn_user = crash_dump_bitmap_info.user_pages;#
+#+ pfn_free = crash_dump_bitmap_info.free_pages;#
+#+ pfn_hwpoison = crash_dump_bitmap_info.hwpoison_pages;#
+#+#
+#+ return TRUE;#
+#+ }#
+#+#
+# /*#
+# * Copy 1st-bitmap to 2nd-bitmap.#
+# */#
+#@@ -4587,6 +4662,46 @@ create_dump_bitmap(void)#
+# if (!prepare_bitmap_buffer())#
+# goto out;#
+# #
+#+ if (info->flag_crash_dump_bitmap#
+#+ && (SYMBOL(crash_dump_bitmap_info)#
+#+ != NOT_FOUND_SYMBOL)) {#
+#+ /* Read crash_dump_bitmap_info from old kernel */#
+#+ readmem(VADDR, SYMBOL(crash_dump_bitmap_info),#
+#+ &crash_dump_bitmap_info,#
+#+ sizeof(struct crash_dump_bitmap_info));#
+#+#
+#+ if (!crash_dump_bitmap_info.bitmap_size#
+#+ || !crash_dump_bitmap_info.bitmap) {#
+#+ ERRMSG("Can't get crash_dump bitmap info! ");#
+#+ ERRMSG("Failback to legacy mode.\n");#
+#+ ERRMSG("crash_dump_bitmap_info=0x%llx, ",#
+#+ SYMBOL(crash_dump_bitmap_info));#
+#+ ERRMSG("bitmap=0x%llx, ",#
+#+ crash_dump_bitmap_info.bitmap);#
+#+ ERRMSG("bitmap_size=%lld\n",#
+#+ crash_dump_bitmap_info.bitmap_size);#
+#+#
+#+ info->flag_crash_dump_bitmap = FALSE;#
+#+ } else {#
+#+ MSG("crash_dump_bitmap: ");#
+#+ MSG("crash_dump_bitmap_info=0x%llx, ",#
+#+ SYMBOL(crash_dump_bitmap_info));#
+#+ MSG("bitmap=0x%llx, ",#
+#+ crash_dump_bitmap_info.bitmap);#
+#+ MSG("bitmap_size=%lld, ",#
+#+ crash_dump_bitmap_info.bitmap_size);#
+#+ MSG("cache_pages=0x%lx, ",#
+#+ crash_dump_bitmap_info.cache_pages);#
+#+ MSG("cache_private_pages=0x%lx, ",#
+#+ crash_dump_bitmap_info#
+#+ .cache_private_pages);#
+#+ MSG("user_pages=0x%lx, ",#
+#+ crash_dump_bitmap_info.user_pages);#
+#+ MSG("free_pages=0x%lx\n",#
+#+ crash_dump_bitmap_info.free_pages);#
+#+ }#
+#+ }#
+#+#
+# if (!create_1st_bitmap())#
+# goto out;#
+# #
+#@@ -8454,7 +8569,8 @@ main(int argc, char *argv[])#
+# #
+# info->block_order = DEFAULT_ORDER;#
+# message_level = DEFAULT_MSG_LEVEL;#
+#- while ((opt = getopt_long(argc, argv, "b:cDd:EFfg:hi:lMpRrsvXx:", longopts,#
+#+ while ((opt = getopt_long(argc, argv, "b:cDd:EFfg:hi:lMpqRrsvXx:",#
+#+ longopts,#
+# NULL)) != -1) {#
+# switch (opt) {#
+# case 'b':#
+#@@ -8518,6 +8634,10 @@ main(int argc, char *argv[])#
+# case 'P':#
+# info->xen_phys_start = strtoul(optarg, NULL, 0);#
+# break;#
+#+ case 'q':#
+#+ info->flag_crash_dump_bitmap = TRUE;#
+#+ info->flag_cyclic = FALSE;#
+#+ break;#
+# case 'R':#
+# info->flag_rearrange = 1;#
+# break;#
+#diff --git a/makedumpfile.h b/makedumpfile.h#
+#index 272273e..6404b16 100644#
+#--- a/makedumpfile.h#
+#+++ b/makedumpfile.h#
+#@@ -41,6 +41,8 @@#
+# #include "dwarf_info.h"#
+# #include "diskdump_mod.h"#
+# #include "sadump_mod.h"#
+#+#include "print_info.h"#
+#+#
+# #
+# /*#
+# * Result of command#
+#@@ -889,6 +891,7 @@ struct DumpInfo {#
+# int flag_refiltering; /* refilter from kdump-compressed file */#
+# int flag_force; /* overwrite existing stuff */#
+# int flag_exclude_xen_dom;/* exclude Domain-U from xen-kdump */#
+#+ int flag_crash_dump_bitmap;/* crash dump bitmap */#
+# int flag_dmesg; /* dump the dmesg log out of the vmcore file */#
+# int flag_nospace; /* the flag of "No space on device" error */#
+# unsigned long vaddr_for_vtop; /* virtual address for debugging */#
+#@@ -1153,6 +1156,11 @@ struct symbol_table {#
+# unsigned long long __per_cpu_load;#
+# unsigned long long cpu_online_mask;#
+# unsigned long long kexec_crash_image;#
+#+#
+#+ /*#
+#+ * for crash_dump_bitmap#
+#+ */#
+#+ unsigned long long crash_dump_bitmap_info;#
+# };#
+# #
+# struct size_table {#
+#@@ -1381,6 +1389,20 @@ struct srcfile_table {#
+# char pud_t[LEN_SRCFILE];#
+# };#
+# #
+#+/*#
+#+ * for crash_dump_bitmap#
+#+ */#
+#+struct crash_dump_bitmap_info {#
+#+ unsigned int version;#
+#+ unsigned long long bitmap;#
+#+ unsigned long long bitmap_size;#
+#+ unsigned long cache_pages;#
+#+ unsigned long cache_private_pages;#
+#+ unsigned long user_pages;#
+#+ unsigned long free_pages;#
+#+ unsigned long hwpoison_pages;#
+#+};#
+#+#
+# extern struct symbol_table symbol_table;#
+# extern struct size_table size_table;#
+# extern struct offset_table offset_table;#
+#@@ -1541,8 +1563,20 @@ is_dumpable(struct dump_bitmap *bitmap, unsigned long long pfn)#
+# off_t offset;#
+# if (pfn == 0 || bitmap->no_block != pfn/PFN_BUFBITMAP) {#
+# offset = bitmap->offset + BUFSIZE_BITMAP*(pfn/PFN_BUFBITMAP);#
+#- lseek(bitmap->fd, offset, SEEK_SET);#
+#- read(bitmap->fd, bitmap->buf, BUFSIZE_BITMAP);#
+#+ if (lseek(bitmap->fd, offset, SEEK_SET) < 0) {#
+#+ ERRMSG("Can't seek bitmap file %s:(%d), ",#
+#+ bitmap->file_name, bitmap->fd);#
+#+ ERRMSG("offset=%ld, error: %s\n",#
+#+ offset, strerror(errno));#
+#+ }#
+#+#
+#+ if (read(bitmap->fd, bitmap->buf, BUFSIZE_BITMAP) < 0) {#
+#+ ERRMSG("Can't read bitmap file %s:(%d), ",#
+#+ bitmap->file_name, bitmap->fd);#
+#+ ERRMSG("offset=%ld, error: %s\n",#
+#+ offset, strerror(errno));#
+#+ }#
+#+#
+# if (pfn == 0)#
+# bitmap->no_block = 0;#
+# else#
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a4f24f5..7b6232e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1633,6 +1633,22 @@ config CRASH_DUMP
(CONFIG_RELOCATABLE=y).
For more details see Documentation/kdump/kdump.txt

+config CRASH_DUMP_BITMAP
+ bool "kernel crash dump bitmap"
+ def_bool y
+ depends on CRASH_DUMP && X86_64
+ ---help---
+ This option will enable the kernel crash dump bitmap support.
+ It will reserve a block of memory to store crash dump bitmap.
+ (1 bit for each page, 32MB per TB memory on x86 systems)
+ It will scan all memory pages during crash processing and mark the
+ excluded memory page bit in the reserved memory. It will be very
+ fast compare to scan it later in the capture kernel.
+ User can control which type of page to be excluded through procfs:
+ /proc/crash_dump_bitmap/dump_level
+ The default dump level is 31 (exclude all unnecessary pages).
+ For more details see Documentation/kdump/crash_dump_bitmap.txt
+
config KEXEC_JUMP
bool "kexec jump"
depends on KEXEC && HIBERNATION

2013-03-07 14:58:51

by Jingbai Ma

[permalink] [raw]
Subject: [RFC PATCH 2/5] crash dump bitmap: init crash dump bitmap in kernel booting process

Reserve a memory block for crash_dump_bitmap in kernel booting process.

Signed-off-by: Jingbai Ma <[email protected]>
---
arch/x86/kernel/setup.c | 59 +++++++++++++++++++++++++++++++++++++
include/linux/crash_dump_bitmap.h | 59 +++++++++++++++++++++++++++++++++++++
kernel/Makefile | 1 +
kernel/crash_dump_bitmap.c | 45 ++++++++++++++++++++++++++++
4 files changed, 164 insertions(+), 0 deletions(-)
create mode 100644 include/linux/crash_dump_bitmap.h
create mode 100644 kernel/crash_dump_bitmap.c

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 84d3285..165c831 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -67,6 +67,7 @@

#include <linux/percpu.h>
#include <linux/crash_dump.h>
+#include <linux/crash_dump_bitmap.h>
#include <linux/tboot.h>
#include <linux/jiffies.h>

@@ -601,6 +602,62 @@ static void __init reserve_crashkernel(void)
}
#endif

+#ifdef CONFIG_CRASH_DUMP_BITMAP
+static void __init crash_dump_bitmap_init(void)
+{
+ static unsigned long BITSPERBYTE = 8;
+
+ unsigned long long mem_start;
+ unsigned long long mem_size;
+
+ if (is_kdump_kernel())
+ return;
+
+ mem_start = (1ULL << 24); /* 16MB */
+ mem_size = roundup((roundup(max_pfn, BITSPERBYTE) / BITSPERBYTE),
+ PAGE_SIZE);
+
+ crash_dump_bitmap_mem = memblock_find_in_range(mem_start,
+ MEMBLOCK_ALLOC_ACCESSIBLE, mem_size, PAGE_SIZE);
+
+ if (!crash_dump_bitmap_mem) {
+ pr_err(
+ "crash_dump_bitmap: allocate error! size=%lldkB, from=%lldMB\n",
+ mem_size >> 10, mem_start >> 20);
+
+ return;
+ }
+
+ crash_dump_bitmap_mem_size = mem_size;
+ memblock_reserve(crash_dump_bitmap_mem, crash_dump_bitmap_mem_size);
+ pr_info("crash_dump_bitmap: bitmap_mem=%lldMB. size=%lldkB\n",
+ (unsigned long long)crash_dump_bitmap_mem >> 20,
+ mem_size >> 10);
+
+ crash_dump_bitmap_res.start = crash_dump_bitmap_mem;
+ crash_dump_bitmap_res.end = crash_dump_bitmap_mem + mem_size - 1;
+ insert_resource(&iomem_resource, &crash_dump_bitmap_res);
+
+ crash_dump_bitmap_info.version = CRASH_DUMP_BITMAP_VERSION;
+
+ crash_dump_bitmap_info.bitmap = crash_dump_bitmap_mem;
+ crash_dump_bitmap_info.bitmap_size = crash_dump_bitmap_mem_size;
+
+ crash_dump_bitmap_ctrl.exclude_crash_dump_bitmap_pages = 1;
+ crash_dump_bitmap_ctrl.exclude_zero_pages = 1;
+ crash_dump_bitmap_ctrl.exclude_cache_pages = 1;
+ crash_dump_bitmap_ctrl.exclude_cache_private_pages = 1;
+ crash_dump_bitmap_ctrl.exclude_user_pages = 1;
+ crash_dump_bitmap_ctrl.exclude_free_pages = 1;
+
+ pr_info("crash_dump_bitmap: Initialized!\n");
+}
+#else
+static void __init crash_dump_bitmap_init(void)
+{
+}
+#endif
+
static struct resource standard_io_resources[] = {
{ .name = "dma1", .start = 0x00, .end = 0x1f,
.flags = IORESOURCE_BUSY | IORESOURCE_IO },
@@ -1094,6 +1151,8 @@ void __init setup_arch(char **cmdline_p)

reserve_crashkernel();

+ crash_dump_bitmap_init();
+
vsmp_init();

io_delay_init();
diff --git a/include/linux/crash_dump_bitmap.h b/include/linux/crash_dump_bitmap.h
new file mode 100644
index 0000000..63b1264
--- /dev/null
+++ b/include/linux/crash_dump_bitmap.h
@@ -0,0 +1,59 @@
+/*
+ * include/linux/crash_dump_bitmap.h
+ * Declaration of crash dump bitmap functions and data structures.
+ *
+ * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
+ * Author: Jingbai Ma <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#ifndef _LINUX_CRASH_DUMP_BITMAP_H
+#define _LINUX_CRASH_DUMP_BITMAP_H
+
+#define CRASH_DUMP_BITMAP_VERSION 1;
+
+enum {
+ CRASH_DUMP_LEVEL_EXCLUDE_ZERO_PAGES = 1,
+ CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PAGES = 2,
+ CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PRIVATE_PAGES = 4,
+ CRASH_DUMP_LEVEL_EXCLUDE_USER_PAGES = 8,
+ CRASH_DUMP_LEVEL_EXCLUDE_FREE_PAGES = 16
+};
+
+struct crash_dump_bitmap_ctrl {
+ char exclude_crash_dump_bitmap_pages;
+ char exclude_zero_pages; /* only for tracking dump level */
+ char exclude_cache_pages;
+ char exclude_cache_private_pages;
+ char exclude_user_pages;
+ char exclude_free_pages;
+};
+
+struct crash_dump_bitmap_info {
+ unsigned int version;
+ phys_addr_t bitmap;
+ phys_addr_t bitmap_size;
+ unsigned long cache_pages;
+ unsigned long cache_private_pages;
+ unsigned long user_pages;
+ unsigned long free_pages;
+ unsigned long hwpoison_pages;
+};
+
+void generate_crash_dump_bitmap(void);
+
+extern phys_addr_t crash_dump_bitmap_mem;
+extern phys_addr_t crash_dump_bitmap_mem_size;
+extern struct crash_dump_bitmap_ctrl crash_dump_bitmap_ctrl;
+extern struct crash_dump_bitmap_info crash_dump_bitmap_info;
+extern struct resource crash_dump_bitmap_res;
+
+#endif /* _LINUX_CRASH_DUMP_BITMAP_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index bbde5f1..3e85003 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -107,6 +107,7 @@ obj-$(CONFIG_PERF_EVENTS) += events/
obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
obj-$(CONFIG_PADATA) += padata.o
obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
+obj-$(CONFIG_KEXEC) += crash_dump_bitmap.o
obj-$(CONFIG_JUMP_LABEL) += jump_label.o
obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o

diff --git a/kernel/crash_dump_bitmap.c b/kernel/crash_dump_bitmap.c
new file mode 100644
index 0000000..e743cdd
--- /dev/null
+++ b/kernel/crash_dump_bitmap.c
@@ -0,0 +1,45 @@
+/*
+ * kernel/crash_dump_bitmap.c
+ * Crash dump bitmap implementation.
+ *
+ * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
+ * Author: Jingbai Ma <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/memblock.h>
+#include <linux/crash_dump.h>
+#include <linux/crash_dump_bitmap.h>
+
+#ifdef CONFIG_CRASH_DUMP_BITMAP
+
+phys_addr_t crash_dump_bitmap_mem;
+EXPORT_SYMBOL(crash_dump_bitmap_mem);
+
+phys_addr_t crash_dump_bitmap_mem_size;
+EXPORT_SYMBOL(crash_dump_bitmap_mem_size);
+
+struct crash_dump_bitmap_ctrl crash_dump_bitmap_ctrl;
+EXPORT_SYMBOL(crash_dump_bitmap_ctrl);
+
+struct crash_dump_bitmap_info crash_dump_bitmap_info;
+EXPORT_SYMBOL(crash_dump_bitmap_info);
+
+/* Location of the reserved area for the crash_dump_bitmap */
+struct resource crash_dump_bitmap_res = {
+ .name = "Crash dump bitmap",
+ .start = 0,
+ .end = 0,
+ .flags = IORESOURCE_BUSY | IORESOURCE_MEM
+};
+#endif /* CONFIG_CRASH_DUMP_BITMAP */

2013-03-07 14:59:08

by Jingbai Ma

[permalink] [raw]
Subject: [RFC PATCH 3/5] crash dump bitmap: scan memory pages in kernel crash process

In the kernel crash process, call generate_crash_dump_bitmap() to scans
all memory pages, clear the bit for all excluded memory pages in the
reserved memory.

Signed-off-by: Jingbai Ma <[email protected]>
---
kernel/crash_dump_bitmap.c | 156 ++++++++++++++++++++++++++++++++++++++++++++
kernel/kexec.c | 5 +
2 files changed, 161 insertions(+), 0 deletions(-)

diff --git a/kernel/crash_dump_bitmap.c b/kernel/crash_dump_bitmap.c
index e743cdd..eed13ca 100644
--- a/kernel/crash_dump_bitmap.c
+++ b/kernel/crash_dump_bitmap.c
@@ -23,6 +23,8 @@

#ifdef CONFIG_CRASH_DUMP_BITMAP

+#define virt_to_pfn(kaddr) (__pa(kaddr) >> PAGE_SHIFT)
+
phys_addr_t crash_dump_bitmap_mem;
EXPORT_SYMBOL(crash_dump_bitmap_mem);

@@ -35,6 +37,7 @@ EXPORT_SYMBOL(crash_dump_bitmap_ctrl);
struct crash_dump_bitmap_info crash_dump_bitmap_info;
EXPORT_SYMBOL(crash_dump_bitmap_info);

+
/* Location of the reserved area for the crash_dump_bitmap */
struct resource crash_dump_bitmap_res = {
.name = "Crash dump bitmap",
@@ -42,4 +45,157 @@ struct resource crash_dump_bitmap_res = {
.end = 0,
.flags = IORESOURCE_BUSY | IORESOURCE_MEM
};
+
+inline void set_crash_dump_bitmap(unsigned long pfn, int val)
+{
+ phys_addr_t paddr = crash_dump_bitmap_info.bitmap + (pfn >> 3);
+ unsigned char *vaddr;
+ unsigned char bit = (pfn & 7);
+
+ if (unlikely(paddr > (crash_dump_bitmap_mem
+ + crash_dump_bitmap_mem_size))) {
+ pr_err(
+ "crash_dump_bitmap: pfn exceed limit. pfn=%ld, addr=0x%llX\n",
+ pfn, paddr);
+ return;
+ }
+
+ vaddr = (unsigned char *)__va(paddr);
+
+ if (val)
+ *vaddr |= (1U << bit);
+ else
+ *vaddr &= (~(1U << bit));
+}
+
+void generate_crash_dump_bitmap(void)
+{
+ pg_data_t *pgdat;
+ struct zone *zone;
+ unsigned long flags;
+ int order, t;
+ struct list_head *curr;
+ unsigned long zone_free_pages;
+ phys_addr_t addr;
+
+ if (!crash_dump_bitmap_mem) {
+ pr_info("crash_dump_bitmap: no crash_dump_bitmap memory.\n");
+ return;
+ }
+
+ pr_info(
+ "Excluding pages: bitmap=%d, cache=%d, private=%d, user=%d, free=%d\n",
+ crash_dump_bitmap_ctrl.exclude_crash_dump_bitmap_pages,
+ crash_dump_bitmap_ctrl.exclude_cache_pages,
+ crash_dump_bitmap_ctrl.exclude_cache_private_pages,
+ crash_dump_bitmap_ctrl.exclude_user_pages,
+ crash_dump_bitmap_ctrl.exclude_free_pages);
+
+ crash_dump_bitmap_info.free_pages = 0;
+ crash_dump_bitmap_info.cache_pages = 0;
+ crash_dump_bitmap_info.cache_private_pages = 0;
+ crash_dump_bitmap_info.user_pages = 0;
+ crash_dump_bitmap_info.hwpoison_pages = 0;
+
+ /* Set all bits on bitmap */
+ memset(__va(crash_dump_bitmap_info.bitmap), 0xff,
+ crash_dump_bitmap_info.bitmap_size);
+
+ /* Exclude all crash_dump_bitmap pages */
+ if (crash_dump_bitmap_ctrl.exclude_crash_dump_bitmap_pages) {
+ for (addr = crash_dump_bitmap_mem; addr <
+ crash_dump_bitmap_mem + crash_dump_bitmap_mem_size;
+ addr += PAGE_SIZE)
+ set_crash_dump_bitmap(
+ virt_to_pfn(__va(addr)), 0);
+ }
+
+ /* Exclude unnecessary pages */
+ for_each_online_pgdat(pgdat) {
+ unsigned long i;
+ unsigned long flags;
+
+ pgdat_resize_lock(pgdat, &flags);
+ for (i = 0; i < pgdat->node_spanned_pages; i++) {
+ struct page *page;
+ unsigned long pfn = pgdat->node_start_pfn + i;
+
+ if (!pfn_valid(pfn))
+ continue;
+
+ page = pfn_to_page(pfn);
+
+ /* Exclude the cache pages without the private page */
+ if (crash_dump_bitmap_ctrl.exclude_cache_pages
+ && (PageLRU(page) || PageSwapCache(page))
+ && !page_has_private(page) && !PageAnon(page)) {
+ set_crash_dump_bitmap(pfn, 0);
+ crash_dump_bitmap_info.cache_pages++;
+ }
+ /* Exclude the cache pages with private page */
+ else if (
+ crash_dump_bitmap_ctrl.exclude_cache_private_pages
+ && (PageLRU(page) || PageSwapCache(page))
+ && !PageAnon(page)) {
+ set_crash_dump_bitmap(pfn, 0);
+ crash_dump_bitmap_info.cache_private_pages++;
+ }
+ /* Exclude the pages used by user process */
+ else if (crash_dump_bitmap_ctrl.exclude_user_pages
+ && PageAnon(page)) {
+ set_crash_dump_bitmap(pfn, 0);
+ crash_dump_bitmap_info.user_pages++;
+ }
+#ifdef CONFIG_MEMORY_FAILURE
+ /* Exclude the hwpoison pages */
+ else if (PageHWPoison(page)) {
+ set_crash_dump_bitmap(pfn, 0);
+ crash_dump_bitmap_info.hwpoison_pages++;
+ }
+#endif
+ }
+ pgdat_resize_unlock(pgdat, &flags);
+ }
+
+ /* Exclude the free pages managed by a buddy system */
+ if (crash_dump_bitmap_ctrl.exclude_free_pages) {
+ for_each_populated_zone(zone) {
+ if (!zone->spanned_pages)
+ continue;
+
+ spin_lock_irqsave(&zone->lock, flags);
+
+ zone_free_pages = 0;
+ for_each_migratetype_order(order, t) {
+ list_for_each(
+ curr, &zone->free_area[order].free_list[t]) {
+ unsigned long i;
+ struct page *page = list_entry(curr,
+ struct page, lru);
+ for (i = 0; i < (1 << order); i++) {
+ set_crash_dump_bitmap(
+ page_to_pfn(page + i), 0);
+ zone_free_pages++;
+ crash_dump_bitmap_info.free_pages++;
+ }
+ }
+ }
+ spin_unlock_irqrestore(&zone->lock, flags);
+ }
+ }
+
+ pr_info("crash_dump_bitmap: excluded pages: cache=%ld, private=%ld\n",
+ crash_dump_bitmap_info.cache_pages,
+ crash_dump_bitmap_info.cache_private_pages);
+ pr_info("crash_dump_bitmap: excluded pages: user=%ld, free=%ld\n",
+ crash_dump_bitmap_info.user_pages,
+ crash_dump_bitmap_info.free_pages);
+ pr_info("crash_dump_bitmap: excluded pages: hwpoison=%ld\n",
+ crash_dump_bitmap_info.hwpoison_pages);
+}
+EXPORT_SYMBOL(generate_crash_dump_bitmap);
+#else
+void generate_crash_dump_bitmap(void)
+{
+}
#endif /* CONFIG_CRASH_DUMP_BITMAP */
diff --git a/kernel/kexec.c b/kernel/kexec.c
index bddd3d7..ce00f0f 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -32,6 +32,7 @@
#include <linux/vmalloc.h>
#include <linux/swap.h>
#include <linux/syscore_ops.h>
+#include <linux/crash_dump_bitmap.h>

#include <asm/page.h>
#include <asm/uaccess.h>
@@ -1097,6 +1098,7 @@ void crash_kexec(struct pt_regs *regs)
crash_setup_regs(&fixed_regs, regs);
crash_save_vmcoreinfo();
machine_crash_shutdown(&fixed_regs);
+ generate_crash_dump_bitmap();
machine_kexec(kexec_crash_image);
}
mutex_unlock(&kexec_mutex);
@@ -1495,6 +1497,9 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_SYMBOL(mem_map);
VMCOREINFO_SYMBOL(contig_page_data);
#endif
+#ifdef CONFIG_CRASH_DUMP_BITMAP
+ VMCOREINFO_SYMBOL(crash_dump_bitmap_info);
+#endif
#ifdef CONFIG_SPARSEMEM
VMCOREINFO_SYMBOL(mem_section);
VMCOREINFO_LENGTH(mem_section, NR_SECTION_ROOTS);

2013-03-07 14:59:17

by Jingbai Ma

[permalink] [raw]
Subject: [RFC PATCH 4/5] crash dump bitmap: add a proc interface for crash dump bitmap

Add a procfs driver for selecting exclude pages in userspace.
/proc/crash_dump_bitmap/

Signed-off-by: Jingbai Ma <[email protected]>
---
fs/proc/Makefile | 1
fs/proc/crash_dump_bitmap.c | 221 +++++++++++++++++++++++++++++++++++++++++++
2 files changed, 222 insertions(+), 0 deletions(-)
create mode 100644 fs/proc/crash_dump_bitmap.c

diff --git a/fs/proc/Makefile b/fs/proc/Makefile
index 712f24d..2dfcff1 100644
--- a/fs/proc/Makefile
+++ b/fs/proc/Makefile
@@ -27,6 +27,7 @@ proc-$(CONFIG_PROC_SYSCTL) += proc_sysctl.o
proc-$(CONFIG_NET) += proc_net.o
proc-$(CONFIG_PROC_KCORE) += kcore.o
proc-$(CONFIG_PROC_VMCORE) += vmcore.o
+proc-$(CONFIG_CRASH_DUMP_BITMAP) += crash_dump_bitmap.o
proc-$(CONFIG_PROC_DEVICETREE) += proc_devtree.o
proc-$(CONFIG_PRINTK) += kmsg.o
proc-$(CONFIG_PROC_PAGE_MONITOR) += page.o
diff --git a/fs/proc/crash_dump_bitmap.c b/fs/proc/crash_dump_bitmap.c
new file mode 100644
index 0000000..77ecaae
--- /dev/null
+++ b/fs/proc/crash_dump_bitmap.c
@@ -0,0 +1,221 @@
+/*
+ * fs/proc/crash_dump_bitmap.c
+ * Interface for controlling the crash dump bitmap from user space.
+ *
+ * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
+ * Author: Jingbai Ma <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include <linux/jiffies.h>
+#include <linux/crash_dump.h>
+#include <linux/crash_dump_bitmap.h>
+
+#ifdef CONFIG_CRASH_DUMP_BITMAP
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Jingbai Ma <[email protected]>");
+MODULE_DESCRIPTION("Crash dump bitmap support driver");
+
+static const char *proc_dir_name = "crash_dump_bitmap";
+static const char *proc_page_status_name = "page_status";
+static const char *proc_dump_level_name = "dump_level";
+
+static struct proc_dir_entry *proc_dir, *proc_page_status, *proc_dump_level;
+
+static unsigned int get_dump_level(void)
+{
+ unsigned int dump_level;
+
+ dump_level = crash_dump_bitmap_ctrl.exclude_zero_pages
+ ? CRASH_DUMP_LEVEL_EXCLUDE_ZERO_PAGES : 0;
+ dump_level |= crash_dump_bitmap_ctrl.exclude_cache_pages
+ ? CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PAGES : 0;
+ dump_level |= crash_dump_bitmap_ctrl.exclude_cache_private_pages
+ ? CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PRIVATE_PAGES : 0;
+ dump_level |= crash_dump_bitmap_ctrl.exclude_user_pages
+ ? CRASH_DUMP_LEVEL_EXCLUDE_USER_PAGES : 0;
+ dump_level |= crash_dump_bitmap_ctrl.exclude_free_pages
+ ? CRASH_DUMP_LEVEL_EXCLUDE_FREE_PAGES : 0;
+
+ return dump_level;
+}
+
+static void set_dump_level(unsigned int dump_level)
+{
+ crash_dump_bitmap_ctrl.exclude_zero_pages =
+ (dump_level & CRASH_DUMP_LEVEL_EXCLUDE_ZERO_PAGES) ? 1 : 0;
+ crash_dump_bitmap_ctrl.exclude_cache_pages =
+ (dump_level & CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PAGES) ? 1 : 0;
+ crash_dump_bitmap_ctrl.exclude_cache_private_pages =
+ (dump_level & CRASH_DUMP_LEVEL_EXCLUDE_CACHE_PRIVATE_PAGES)
+ ? 1 : 0;
+ crash_dump_bitmap_ctrl.exclude_user_pages =
+ (dump_level & CRASH_DUMP_LEVEL_EXCLUDE_USER_PAGES) ? 1 : 0;
+ crash_dump_bitmap_ctrl.exclude_free_pages =
+ (dump_level & CRASH_DUMP_LEVEL_EXCLUDE_FREE_PAGES) ? 1 : 0;
+}
+
+static int proc_page_status_show(struct seq_file *m, void *v)
+{
+ u64 start, duration;
+
+ if (!crash_dump_bitmap_mem) {
+ seq_printf(m,
+ "crash_dump_bitmap: crash_dump_bitmap_mem not found!\n");
+
+ return -EINVAL;
+ }
+
+ seq_printf(m, "Exclude page flag status:\n");
+ seq_printf(m, " exclude_dump_bitmap_pages=%d\n",
+ crash_dump_bitmap_ctrl.exclude_crash_dump_bitmap_pages);
+ seq_printf(m, " exclude_zero_pages=%d\n",
+ crash_dump_bitmap_ctrl.exclude_zero_pages);
+ seq_printf(m, " exclude_cache_pages=%d\n",
+ crash_dump_bitmap_ctrl.exclude_cache_pages);
+ seq_printf(m, " exclude_cache_private_pages=%d\n",
+ crash_dump_bitmap_ctrl.exclude_cache_private_pages);
+ seq_printf(m, " exclude_user_pages=%d\n",
+ crash_dump_bitmap_ctrl.exclude_user_pages);
+ seq_printf(m, " exclude_free_pages=%d\n",
+ crash_dump_bitmap_ctrl.exclude_free_pages);
+
+ seq_printf(m, "Scanning all memory pages:\n");
+ start = get_jiffies_64();
+ generate_crash_dump_bitmap();
+ duration = get_jiffies_64() - start;
+ seq_printf(m, " Done. Duration=%dms\n", jiffies_to_msecs(duration));
+
+ seq_printf(m, "Excluded memory page status:\n");
+ seq_printf(m, " cache_pages=%ld\n",
+ crash_dump_bitmap_info.cache_pages);
+ seq_printf(m, " cache_private_pages=%ld\n",
+ crash_dump_bitmap_info.cache_private_pages);
+ seq_printf(m, " user_pages=%ld\n",
+ crash_dump_bitmap_info.user_pages);
+ seq_printf(m, " free_pages=%ld\n",
+ crash_dump_bitmap_info.free_pages);
+ seq_printf(m, " hwpoison_pages=%ld\n",
+ crash_dump_bitmap_info.hwpoison_pages);
+
+ return 0;
+}
+
+static int proc_page_status_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, proc_page_status_show, NULL);
+}
+
+static const struct file_operations proc_page_status_fops = {
+ .open = proc_page_status_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static int proc_dump_level_show(struct seq_file *m, void *v)
+{
+ if (!crash_dump_bitmap_mem) {
+ seq_printf(m,
+ "crash_dump_bitmap: crash_dump_bitmap_mem not found!\n");
+
+ return -EINVAL;
+ }
+
+ seq_printf(m, "%d\n", get_dump_level());
+
+ return 0;
+}
+
+static ssize_t proc_dump_level_write(struct file *file,
+ const char __user *buffer, size_t count, loff_t *ppos)
+{
+ int ret;
+ unsigned int dump_level;
+
+ ret = kstrtouint_from_user(buffer, count, 10, &dump_level);
+ if (ret)
+ return -EFAULT;
+
+ set_dump_level(dump_level);
+
+ pr_info("crash_dump_bitmap: new dump_level=%d\n", dump_level);
+
+ return count;
+}
+
+static int proc_dump_level_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, proc_dump_level_show, NULL);
+}
+
+static const struct file_operations proc_dump_level_fops = {
+ .open = proc_dump_level_open,
+ .read = seq_read,
+ .write = proc_dump_level_write,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+int __init init_proc_crash_dump_bitmap(void)
+{
+ if (is_kdump_kernel() || (crash_dump_bitmap_mem == 0)
+ || (crash_dump_bitmap_mem_size == 0))
+ return 0;
+
+ proc_dir = proc_mkdir(proc_dir_name, NULL);
+ if (proc_dir == NULL) {
+ pr_err("crash_dump_bitmap: proc_mkdir failed!\n");
+ return -EINVAL;
+ }
+
+ proc_page_status = proc_create(proc_page_status_name, 0444,
+ proc_dir, &proc_page_status_fops);
+ if (proc_page_status == NULL) {
+ pr_err("crash_dump_bitmap: create procfs %s failed!\n",
+ proc_page_status_name);
+ remove_proc_entry(proc_dir_name, NULL);
+ return -EINVAL;
+ }
+
+ proc_dump_level = proc_create(proc_dump_level_name, 0644,
+ proc_dir, &proc_dump_level_fops);
+ if (proc_dump_level == NULL) {
+ pr_err("crash_dump_bitmap: create procfs %s failed!\n",
+ proc_dump_level_name);
+ remove_proc_entry(proc_page_status_name, proc_dir);
+ remove_proc_entry(proc_dir_name, NULL);
+ return -EINVAL;
+ }
+
+ pr_info("crash_dump_bitmap: procfs driver initialized successfully!\n");
+
+ return 0;
+}
+
+void __exit cleanup_proc_crash_dump_bitmap(void)
+{
+ remove_proc_entry(proc_dump_level_name, proc_dir);
+ remove_proc_entry(proc_page_status_name, proc_dir);
+ remove_proc_entry(proc_dir_name, NULL);
+
+ pr_info("crash_dump_bitmap: procfs driver unloaded!\n");
+}
+
+module_init(init_proc_crash_dump_bitmap);
+module_exit(cleanup_proc_crash_dump_bitmap);
+
+#endif /* CONFIG_CRASH_DUMP_BITMAP */

2013-03-07 14:59:26

by Jingbai Ma

[permalink] [raw]
Subject: [RFC PATCH 5/5] crash dump bitmap: workaround for kernel 3.9-rc1 kdump issue

Linux kernel 3.9-rc1 allows crashkernel above 4GB, but current
kexec-tools doesn't support it yet.
This patch is only a workaround to make kdump work again.
This patch should be removed after kexec-tools 2.0.4 release.

Signed-off-by: Jingbai Ma <[email protected]>
---
arch/x86/kernel/setup.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 165c831..15321d6 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -506,7 +506,8 @@ static void __init memblock_x86_reserve_range_setup_data(void)
#ifdef CONFIG_X86_32
# define CRASH_KERNEL_ADDR_MAX (512 << 20)
#else
-# define CRASH_KERNEL_ADDR_MAX MAXMEM
+/* # define CRASH_KERNEL_ADDR_MAX MAXMEM */
+# define CRASH_KERNEL_ADDR_MAX (896 << 20)
#endif

static void __init reserve_crashkernel_low(void)

2013-03-07 15:21:25

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process

On Thu, Mar 07, 2013 at 10:58:18PM +0800, Jingbai Ma wrote:
> This patch intend to speedup the memory pages scanning process in
> selective dump mode.
>
> Test result (On HP ProLiant DL980 G7 with 1TB RAM, makedumpfile
> v1.5.3):
>
> Total scan Time
> Original kernel
> + makedumpfile v1.5.3 cyclic mode 1958.05 seconds
> Original kernel
> + makedumpfile v1.5.3 non-cyclic mode 1151.50 seconds
> Patched kernel
> + patched makedumpfile v1.5.3 17.50 seconds
>
> Traditionally, to reduce the size of dump file, dumper scans all memory
> pages to exclude the unnecessary memory pages after capture kernel
> booted, and scan it in userspace code (makedumpfile).

I think this is not a good idea. It has several issues.

- First of all it is doing more stuff in first kernel. And that runs
contrary to kdump design where we want to do stuff in second kernel.
After a kernel crash, you can't trust running kernel's data structures.
So to improve reliability just do minial stuff in crashed kernel and
get out quickly.

- Secondly, it moves filetering policy in kernel. I think keeping it
in user space gives us the extra flexibility.
>
> It introduces several problems:
>
> 1. Requires more memory to store memory bitmap on systems with large
> amount of memory installed. And in capture kernel there is only a few
> free memory available, it will cause an out of memory error and fail.
> (Non-cyclic mode)

makedumpfile requires 2bits per 4K page. That is 64MB per TB. In your
patches also you are reserving 1bit per page and that is 32MB per TB
in first kernel.

So memory is anyway being reserved, just that makedumpfile seems to be
needing this extra bit. Not sure if that can be optimized or not.

First of all 64MB per TB should not be a huge deal. And makedumpfile
also has this cyclic mode where you process a map, discard it and then
move on to next section. So memory usage remains constant at the expense
of processing time.

Looks like now hpa and yinghai have done the work to be able to load
kdump kernel above 4GB. I am assuming this also removes the restriction
that we can only reserve 512MB or 896MB in second kernel. If that's
the case, then I don't see why people can't get away with reserving
64MB per TB.

>
> 2. Scans all memory pages in makedumpfile is a very slow process. On
> system with 1TB or more memory installed, the scanning process is very
> long. Typically on 1TB idle system, it takes about 19 minutes. On system
> with 4TB or more memory installed, it even doesn't work. To address the
> out of memory issue on system with big memory (4TB or more memory
> installed), makedumpfile v1.5.1 introduces a new cyclic mode. It only
> scans a piece of memory pages each time, and do it cyclically to scan
> all memory pages. But it runs more slowly, on 1TB system, takes about 33
> minutes.

One of the reasons it is slow because we don't support mmpa() interface.
That means for every read, we map 4K page, flush TLB, read it, unmap it
and flush TLB again. This is lot of processing overhead per 4K page.

Hatayama is now working on making mmap() interface and allow user space
to bigger chunks of memory in one so. So that in one mmap() call we can
map a bigger range instead of just 4K. And his numbers show that it
has helped a lot.

So instead of trying to move filtering logic in kernel, I think it
might be better if we try to optimize things in makedumpfile or second
kernel.

>
> 3. Scans memory pages code in makedumpfile is very complicated, without
> kernel memory management related data structure, makedumpfile has to
> build up its own data structure, and will not able to use some macros
> that only be available in kernel (e.g. page_to_pfn), and has to use some
> slow lookup algorithm instead.
>
> This patch introduces a new way to scan memory pages. It reserves a
> piece of memory (1 bit for each page, 32MB per TB memory on x86 systems)
> in the first kernel. During the kernel crash process, it scans all
> memory pages, clear the bit for all excluded memory pages in the
> reserved memory.

I think this is not a good idea. It has several issues.

- First of all it is doing more stuff in first kernel. And that runs
contrary to kdump design where we want to do stuff in second kernel.
After a kernel crash, you can't trust running kernel's data structures.
So to improve reliability just do minial stuff in crashed kernel and
get out quickly.

- Secondly, it moves filetering policy in kernel. I think keeping it
in user space gives us the extra flexibility.

>

> We have several benefits by this new approach:
>
> 1. It's extremely fast, on 1TB system only takes about 17.5 seconds to
> scan all memory pages!
>
> 2. Reduces the memory requirement of makedumpfile by putting the
> reserved memory in the first kernel memory space.

Even the second kernel's memory comes from first kernel. So that really
does not help.

Thanks
Vivek

2013-03-07 21:38:09

by Yinghai Lu

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process

On Thu, Mar 7, 2013 at 7:21 AM, Vivek Goyal <[email protected]> wrote:
> Looks like now hpa and yinghai have done the work to be able to load
> kdump kernel above 4GB. I am assuming this also removes the restriction
> that we can only reserve 512MB or 896MB in second kernel.

Yes, From v3.9 and kexec-tools 2.0.4 on x86 64bit.

Thanks

Yinghai

2013-03-07 21:54:58

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process

Vivek Goyal <[email protected]> writes:

> On Thu, Mar 07, 2013 at 10:58:18PM +0800, Jingbai Ma wrote:
>> This patch intend to speedup the memory pages scanning process in
>> selective dump mode.
>>
>> Test result (On HP ProLiant DL980 G7 with 1TB RAM, makedumpfile
>> v1.5.3):
>>
>> Total scan Time
>> Original kernel
>> + makedumpfile v1.5.3 cyclic mode 1958.05 seconds
>> Original kernel
>> + makedumpfile v1.5.3 non-cyclic mode 1151.50 seconds
>> Patched kernel
>> + patched makedumpfile v1.5.3 17.50 seconds
>>
>> Traditionally, to reduce the size of dump file, dumper scans all memory
>> pages to exclude the unnecessary memory pages after capture kernel
>> booted, and scan it in userspace code (makedumpfile).
>
> I think this is not a good idea. It has several issues.

Actually it does not appear to be doing any work in the first kernel.

> - First of all it is doing more stuff in first kernel. And that runs
> contrary to kdump design where we want to do stuff in second kernel.
> After a kernel crash, you can't trust running kernel's data structures.
> So to improve reliability just do minial stuff in crashed kernel and
> get out quickly.
>
> - Secondly, it moves filetering policy in kernel. I think keeping it
> in user space gives us the extra flexibility.

Agreed.

>> It introduces several problems:
>>
>> 1. Requires more memory to store memory bitmap on systems with large
>> amount of memory installed. And in capture kernel there is only a few
>> free memory available, it will cause an out of memory error and fail.
>> (Non-cyclic mode)
>
> makedumpfile requires 2bits per 4K page. That is 64MB per TB. In your
> patches also you are reserving 1bit per page and that is 32MB per TB
> in first kernel.
>
> So memory is anyway being reserved, just that makedumpfile seems to be
> needing this extra bit. Not sure if that can be optimized or not.
>
> First of all 64MB per TB should not be a huge deal. And makedumpfile
> also has this cyclic mode where you process a map, discard it and then
> move on to next section. So memory usage remains constant at the expense
> of processing time.
>
> Looks like now hpa and yinghai have done the work to be able to load
> kdump kernel above 4GB. I am assuming this also removes the restriction
> that we can only reserve 512MB or 896MB in second kernel. If that's
> the case, then I don't see why people can't get away with reserving
> 64MB per TB.
>
>>
>> 2. Scans all memory pages in makedumpfile is a very slow process. On
>> system with 1TB or more memory installed, the scanning process is very
>> long. Typically on 1TB idle system, it takes about 19 minutes. On system
>> with 4TB or more memory installed, it even doesn't work. To address the
>> out of memory issue on system with big memory (4TB or more memory
>> installed), makedumpfile v1.5.1 introduces a new cyclic mode. It only
>> scans a piece of memory pages each time, and do it cyclically to scan
>> all memory pages. But it runs more slowly, on 1TB system, takes about 33
>> minutes.
>
> One of the reasons it is slow because we don't support mmpa() interface.
> That means for every read, we map 4K page, flush TLB, read it, unmap it
> and flush TLB again. This is lot of processing overhead per 4K page.
>
> Hatayama is now working on making mmap() interface and allow user space
> to bigger chunks of memory in one so. So that in one mmap() call we can
> map a bigger range instead of just 4K. And his numbers show that it
> has helped a lot.
>
> So instead of trying to move filtering logic in kernel, I think it
> might be better if we try to optimize things in makedumpfile or second
> kernel.

Yes. I think optimizing the intermediate forms we have is better, as
that should also increase the speed of writing the dumps not just the
speed of calculating what to dump.

>> 3. Scans memory pages code in makedumpfile is very complicated, without
>> kernel memory management related data structure, makedumpfile has to
>> build up its own data structure, and will not able to use some macros
>> that only be available in kernel (e.g. page_to_pfn), and has to use some
>> slow lookup algorithm instead.
>>
>> This patch introduces a new way to scan memory pages. It reserves a
>> piece of memory (1 bit for each page, 32MB per TB memory on x86 systems)
>> in the first kernel. During the kernel crash process, it scans all
>> memory pages, clear the bit for all excluded memory pages in the
>> reserved memory.
>
> I think this is not a good idea. It has several issues.
>
> - First of all it is doing more stuff in first kernel. And that runs
> contrary to kdump design where we want to do stuff in second kernel.
> After a kernel crash, you can't trust running kernel's data structures.
> So to improve reliability just do minial stuff in crashed kernel and
> get out quickly.
>
> - Secondly, it moves filetering policy in kernel. I think keeping it
> in user space gives us the extra flexibility.

And it also runs into the deep problem of do the two kernels match.

If the kernels don't match using the second kernels macros for the first
kernels data structures is a recipe for major disaster.

Eric

2013-03-08 01:31:31

by Hatayama, Daisuke

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process

From: Vivek Goyal <[email protected]>
Subject: Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process
Date: Thu, 7 Mar 2013 10:21:08 -0500

> On Thu, Mar 07, 2013 at 10:58:18PM +0800, Jingbai Ma wrote:
<cut>
>>
>> 2. Scans all memory pages in makedumpfile is a very slow process. On
>> system with 1TB or more memory installed, the scanning process is very
>> long. Typically on 1TB idle system, it takes about 19 minutes. On system
>> with 4TB or more memory installed, it even doesn't work. To address the
>> out of memory issue on system with big memory (4TB or more memory
>> installed), makedumpfile v1.5.1 introduces a new cyclic mode. It only
>> scans a piece of memory pages each time, and do it cyclically to scan
>> all memory pages. But it runs more slowly, on 1TB system, takes about 33
>> minutes.
>
> One of the reasons it is slow because we don't support mmpa() interface.
> That means for every read, we map 4K page, flush TLB, read it, unmap it
> and flush TLB again. This is lot of processing overhead per 4K page.
>
> Hatayama is now working on making mmap() interface and allow user space
> to bigger chunks of memory in one so. So that in one mmap() call we can
> map a bigger range instead of just 4K. And his numbers show that it
> has helped a lot.

Yes, so when are you able to get down to reviewing the patch set?

Thanks.
HATAYAMA, Daisuke

2013-03-08 10:06:39

by Jingbai Ma

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process

On 03/07/2013 11:21 PM, Vivek Goyal wrote:
> On Thu, Mar 07, 2013 at 10:58:18PM +0800, Jingbai Ma wrote:
>> This patch intend to speedup the memory pages scanning process in
>> selective dump mode.
>>
>> Test result (On HP ProLiant DL980 G7 with 1TB RAM, makedumpfile
>> v1.5.3):
>>
>> Total scan Time
>> Original kernel
>> + makedumpfile v1.5.3 cyclic mode 1958.05 seconds
>> Original kernel
>> + makedumpfile v1.5.3 non-cyclic mode 1151.50 seconds
>> Patched kernel
>> + patched makedumpfile v1.5.3 17.50 seconds
>>
>> Traditionally, to reduce the size of dump file, dumper scans all memory
>> pages to exclude the unnecessary memory pages after capture kernel
>> booted, and scan it in userspace code (makedumpfile).
>
> I think this is not a good idea. It has several issues.
>
> - First of all it is doing more stuff in first kernel. And that runs
> contrary to kdump design where we want to do stuff in second kernel.
> After a kernel crash, you can't trust running kernel's data structures.
> So to improve reliability just do minial stuff in crashed kernel and
> get out quickly.

I agreed with you, the first kernel should do as less as possible.
Intuitively, filter memory pages in the first kernel will harm the
reliability of kernel dump, but let's think it thoroughly:

1. It only relies on the memory management data structure that
makedumpfile also relies on, so no any reliability degradation at this
point.

2. Filtering code itself is very simple and straightforward, doesn't
depend on kernel functions too much. Current code calls
pgdat_resize_lock() and spin_lock_irqsave() for testing purpose in
non-crash situation, and can be removed safely in crash processing. It
may affects reliability but very limit.

3. Before calling filtering code, the machine_crash_shutdown() has been
executed, so all IRQs have been disabled, all other CPUs have been
halted. We only need to make sure NMI from watchdog has been disabled here.
So far, we stay on a separate stack, no any potential interrupts here,
only executes a little piece of code with very limit system functions.
Compares to the complicated functions been executed previously, the
risks from the filtering code should be acceptable.

>
> - Secondly, it moves filetering policy in kernel. I think keeping it
> in user space gives us the extra flexibility.

It doesn't keep user from extra flexibility, just adds another
possibility. I have added a flag in makedumpfile, user can decide to
filter memory pages by makedumpfile itself or just use the bitmap came
from the first kernel.

>>
>> It introduces several problems:
>>
>> 1. Requires more memory to store memory bitmap on systems with large
>> amount of memory installed. And in capture kernel there is only a few
>> free memory available, it will cause an out of memory error and fail.
>> (Non-cyclic mode)
>
> makedumpfile requires 2bits per 4K page. That is 64MB per TB. In your
> patches also you are reserving 1bit per page and that is 32MB per TB
> in first kernel.
>
> So memory is anyway being reserved, just that makedumpfile seems to be
> needing this extra bit. Not sure if that can be optimized or not.

Yes, you are right. It's only a POC (proof of concept) implementation
currently. I can add a mmap interface to allow makedumpfile to access
the bitmap memory directly without reserving memory for it again.

>
> First of all 64MB per TB should not be a huge deal. And makedumpfile
> also has this cyclic mode where you process a map, discard it and then
> move on to next section. So memory usage remains constant at the expense
> of processing time.

Yes, that's true. But in cyclic mode, makedumpfile will have to
write/read bitmap from storage, it will also impact the performance.
I have measured the penalty for cyclic mode is about 70% slowdown. Maybe
could be faster after mmap implemented.

>
> Looks like now hpa and yinghai have done the work to be able to load
> kdump kernel above 4GB. I am assuming this also removes the restriction
> that we can only reserve 512MB or 896MB in second kernel. If that's
> the case, then I don't see why people can't get away with reserving
> 64MB per TB.

That's true. With kernel 3.9-rc1 with kexec-tools 2.0.4, capture kernel
will have enough memory to run. And makedumpfile could be always run at
non-cyclic mode, but we still concern about the kernel dump performance
on systems with huge memory (above 4TB).

>>
>> 2. Scans all memory pages in makedumpfile is a very slow process. On
>> system with 1TB or more memory installed, the scanning process is very
>> long. Typically on 1TB idle system, it takes about 19 minutes. On system
>> with 4TB or more memory installed, it even doesn't work. To address the
>> out of memory issue on system with big memory (4TB or more memory
>> installed), makedumpfile v1.5.1 introduces a new cyclic mode. It only
>> scans a piece of memory pages each time, and do it cyclically to scan
>> all memory pages. But it runs more slowly, on 1TB system, takes about 33
>> minutes.
>
> One of the reasons it is slow because we don't support mmpa() interface.
> That means for every read, we map 4K page, flush TLB, read it, unmap it
> and flush TLB again. This is lot of processing overhead per 4K page.

Agree, I have read the code, and even had a plan to implement mmap() for
capture kernel, but I saw Hatayama has been working on it for a while.
So I decided to improve the page filtering part.

>
> Hatayama is now working on making mmap() interface and allow user space
> to bigger chunks of memory in one so. So that in one mmap() call we can
> map a bigger range instead of just 4K. And his numbers show that it
> has helped a lot.
>
> So instead of trying to move filtering logic in kernel, I think it
> might be better if we try to optimize things in makedumpfile or second
> kernel.

Kernel do have some abilities that user space haven't. It's possible to
map whole memory space of the first kernel into user space on the second
kernel. But the user space code has to re-implement some parts of the
kernel memory management system again. And worse, it's architecture
dependent, more architectures supported, more codes have to be
implemented. All implementation in user space must be sync to kernel
implementation. It's may called "flexibility", but it's painful to
maintain the codes.

But if we scan memory in the first kernel, all problem will not exist
anymore. We just use the same logical for all kind of architectures.

User still be able to decide if the want to filter memory as their own
way. We can treat it as an option to accelerate the kernel dump process.

The downtime usually is not a big deal for personal user, but for some
mission critical systems, time really is money.

Again, I summarized the pros and cons of filtering memory pages in the
first kernel:
Pros:
1. Extremely fast.
2. Simple logic and code.
3. Move architecture dependent code into, make user space code simpler
and easy to maintain.
Cons:
1. Reduce the reliability of kernel dump very slightly.
2. A few more memory occupation in current version, and can be improved.

Thanks for your comments!

>
>>
>> 3. Scans memory pages code in makedumpfile is very complicated, without
>> kernel memory management related data structure, makedumpfile has to
>> build up its own data structure, and will not able to use some macros
>> that only be available in kernel (e.g. page_to_pfn), and has to use some
>> slow lookup algorithm instead.
>>
>> This patch introduces a new way to scan memory pages. It reserves a
>> piece of memory (1 bit for each page, 32MB per TB memory on x86 systems)
>> in the first kernel. During the kernel crash process, it scans all
>> memory pages, clear the bit for all excluded memory pages in the
>> reserved memory.
>
> I think this is not a good idea. It has several issues.
>
> - First of all it is doing more stuff in first kernel. And that runs
> contrary to kdump design where we want to do stuff in second kernel.
> After a kernel crash, you can't trust running kernel's data structures.
> So to improve reliability just do minial stuff in crashed kernel and
> get out quickly.
>
> - Secondly, it moves filetering policy in kernel. I think keeping it
> in user space gives us the extra flexibility.
>
>>
>
>> We have several benefits by this new approach:
>>
>> 1. It's extremely fast, on 1TB system only takes about 17.5 seconds to
>> scan all memory pages!
>>
>> 2. Reduces the memory requirement of makedumpfile by putting the
>> reserved memory in the first kernel memory space.
>
> Even the second kernel's memory comes from first kernel. So that really
> does not help.
>
> Thanks
> Vivek


--
Jingbai Ma ([email protected])

2013-03-08 10:16:55

by Jingbai Ma

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process

On 03/07/2013 11:21 PM, Vivek Goyal wrote:
> On Thu, Mar 07, 2013 at 10:58:18PM +0800, Jingbai Ma wrote:
>> This patch intend to speedup the memory pages scanning process in
>> selective dump mode.
>>
>> Test result (On HP ProLiant DL980 G7 with 1TB RAM, makedumpfile
>> v1.5.3):
>>
>> Total scan Time
>> Original kernel
>> + makedumpfile v1.5.3 cyclic mode 1958.05 seconds
>> Original kernel
>> + makedumpfile v1.5.3 non-cyclic mode 1151.50 seconds
>> Patched kernel
>> + patched makedumpfile v1.5.3 17.50 seconds
>>
>> Traditionally, to reduce the size of dump file, dumper scans all memory
>> pages to exclude the unnecessary memory pages after capture kernel
>> booted, and scan it in userspace code (makedumpfile).
>
> I think this is not a good idea. It has several issues.
>
> - First of all it is doing more stuff in first kernel. And that runs
> contrary to kdump design where we want to do stuff in second kernel.
> After a kernel crash, you can't trust running kernel's data structures.
> So to improve reliability just do minial stuff in crashed kernel and
> get out quickly.

I agreed with you, the first kernel should do as less as possible.
Intuitively, filter memory pages in the first kernel will harm the
reliability of kernel dump, but let's think it thoroughly:

1. It only relies on the memory management data structure that
makedumpfile also relies on, so no any reliability degradation at this
point.

2. Filtering code itself is very simple and straightforward, doesn't
depend on kernel functions too much. Current code calls
pgdat_resize_lock() and spin_lock_irqsave() for testing purpose in
non-crash situation, and can be removed safely in crash processing. It
may affects reliability but very limit.

3. Before calling filtering code, the machine_crash_shutdown() has been
executed, so all IRQs have been disabled, all other CPUs have been
halted. We only need to make sure NMI from watchdog has been disabled here.
So far, we stay on a separate stack, no any potential interrupts here,
only executes a little piece of code with very limit system functions.
Compares to the complicated functions been executed previously, the
risks from the filtering code should be acceptable.

>
> - Secondly, it moves filetering policy in kernel. I think keeping it
> in user space gives us the extra flexibility.

It doesn't keep user from extra flexibility, just adds another
possibility. I have added a flag in makedumpfile, user can decide to
filter memory pages by makedumpfile itself or just use the bitmap came
from the first kernel.

>>
>> It introduces several problems:
>>
>> 1. Requires more memory to store memory bitmap on systems with large
>> amount of memory installed. And in capture kernel there is only a few
>> free memory available, it will cause an out of memory error and fail.
>> (Non-cyclic mode)
>
> makedumpfile requires 2bits per 4K page. That is 64MB per TB. In your
> patches also you are reserving 1bit per page and that is 32MB per TB
> in first kernel.
>
> So memory is anyway being reserved, just that makedumpfile seems to be
> needing this extra bit. Not sure if that can be optimized or not.

Yes, you are right. It's only a POC (proof of concept) implementation
currently. I can add a mmap interface to allow makedumpfile to access
the bitmap memory directly without reserving memory for it again.

>
> First of all 64MB per TB should not be a huge deal. And makedumpfile
> also has this cyclic mode where you process a map, discard it and then
> move on to next section. So memory usage remains constant at the expense
> of processing time.

Yes, that's true. But in cyclic mode, makedumpfile will have to
write/read bitmap from storage, it will also impact the performance.
I have measured the penalty for cyclic mode is about 70% slowdown. Maybe
could be faster after mmap implemented.

>
> Looks like now hpa and yinghai have done the work to be able to load
> kdump kernel above 4GB. I am assuming this also removes the restriction
> that we can only reserve 512MB or 896MB in second kernel. If that's
> the case, then I don't see why people can't get away with reserving
> 64MB per TB.

That's true. With kernel 3.9-rc1 with kexec-tools 2.0.4, capture kernel
will have enough memory to run. And makedumpfile could be always run at
non-cyclic mode, but we still concern about the kernel dump performance
on systems with huge memory (above 4TB).

>>
>> 2. Scans all memory pages in makedumpfile is a very slow process. On
>> system with 1TB or more memory installed, the scanning process is very
>> long. Typically on 1TB idle system, it takes about 19 minutes. On system
>> with 4TB or more memory installed, it even doesn't work. To address the
>> out of memory issue on system with big memory (4TB or more memory
>> installed), makedumpfile v1.5.1 introduces a new cyclic mode. It only
>> scans a piece of memory pages each time, and do it cyclically to scan
>> all memory pages. But it runs more slowly, on 1TB system, takes about 33
>> minutes.
>
> One of the reasons it is slow because we don't support mmpa() interface.
> That means for every read, we map 4K page, flush TLB, read it, unmap it
> and flush TLB again. This is lot of processing overhead per 4K page.

Agree, I have read the code, and even had a plan to implement mmap() for
capture kernel, but I saw Hatayama has been working on it for a while.
So I decided to improve the page filtering part.

>
> Hatayama is now working on making mmap() interface and allow user space
> to bigger chunks of memory in one so. So that in one mmap() call we can
> map a bigger range instead of just 4K. And his numbers show that it
> has helped a lot.
>
> So instead of trying to move filtering logic in kernel, I think it
> might be better if we try to optimize things in makedumpfile or second
> kernel.

Kernel do have some abilities that user space haven't. It's possible to
map whole memory space of the first kernel into user space on the second
kernel. But the user space code has to re-implement some parts of the
kernel memory management system again. And worse, it's architecture
dependent, more architectures supported, more codes have to be
implemented. All implementation in user space must be sync to kernel
implementation. It's may called "flexibility", but it's painful to
maintain the codes.

But if we scan memory in the first kernel, all problem will not exist
anymore. We just use the same logical for all kind of architectures.

User still be able to decide if the want to filter memory as their own
way. We can treat it as an option to accelerate the kernel dump process.

The downtime usually is not a big deal for personal user, but for some
mission critical systems, time really is money.

Again, I summarized the pros and cons of filtering memory pages in the
first kernel:
Pros:
1. Extremely fast.
2. Simple logic and code.
3. Move architecture dependent code into, make user space code simpler
and easy to maintain.
Cons:
1. Reduce the reliability of kernel dump very slightly.
2. A few more memory occupation in current version, and can be improved.

Thanks for your comments!

>
>>
>> 3. Scans memory pages code in makedumpfile is very complicated, without
>> kernel memory management related data structure, makedumpfile has to
>> build up its own data structure, and will not able to use some macros
>> that only be available in kernel (e.g. page_to_pfn), and has to use some
>> slow lookup algorithm instead.
>>
>> This patch introduces a new way to scan memory pages. It reserves a
>> piece of memory (1 bit for each page, 32MB per TB memory on x86 systems)
>> in the first kernel. During the kernel crash process, it scans all
>> memory pages, clear the bit for all excluded memory pages in the
>> reserved memory.
>
> I think this is not a good idea. It has several issues.
>
> - First of all it is doing more stuff in first kernel. And that runs
> contrary to kdump design where we want to do stuff in second kernel.
> After a kernel crash, you can't trust running kernel's data structures.
> So to improve reliability just do minial stuff in crashed kernel and
> get out quickly.
>
> - Secondly, it moves filetering policy in kernel. I think keeping it
> in user space gives us the extra flexibility.
>
>>
>
>> We have several benefits by this new approach:
>>
>> 1. It's extremely fast, on 1TB system only takes about 17.5 seconds to
>> scan all memory pages!
>>
>> 2. Reduces the memory requirement of makedumpfile by putting the
>> reserved memory in the first kernel memory space.
>
> Even the second kernel's memory comes from first kernel. So that really
> does not help.
>
> Thanks
> Vivek


--
Jingbai Ma ([email protected])

2013-03-08 10:34:15

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process

On 03/08/2013 02:06 AM, Jingbai Ma wrote:
>
> Kernel do have some abilities that user space haven't. It's possible to
> map whole memory space of the first kernel into user space on the second
> kernel. But the user space code has to re-implement some parts of the
> kernel memory management system again. And worse, it's architecture
> dependent, more architectures supported, more codes have to be
> implemented. All implementation in user space must be sync to kernel
> implementation. It's may called "flexibility", but it's painful to
> maintain the codes.
>

What? You are basically talking about /dev/mem... there is nothing
particularly magic about it at all.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2013-03-08 13:33:00

by Ma, Jingbai (Kingboard)

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process

On 3/8/13 6:33 PM, "H. Peter Anvin" <[email protected]> wrote:


>On 03/08/2013 02:06 AM, Jingbai Ma wrote:
>>
>> Kernel do have some abilities that user space haven't. It's possible to
>> map whole memory space of the first kernel into user space on the second
>> kernel. But the user space code has to re-implement some parts of the
>> kernel memory management system again. And worse, it's architecture
>> dependent, more architectures supported, more codes have to be
>> implemented. All implementation in user space must be sync to kernel
>> implementation. It's may called "flexibility", but it's painful to
>> maintain the codes.
>>
>
>What? You are basically talking about /dev/mem... there is nothing
>particularly magic about it at all.

What we are talking about is filtering memory pages (AKA memory pages
classification)
The makedumpfile (or any other dumper in user space) has to know the
exactly
memory layout of the memory management data structures, it not only
architecture dependent, but also may varies in different kernel release.
At this point, /dev/mem doesn't give any help.
So IMHO, I would like to do it in kernel, rather than So keep tracking
changes in user space code.

>
> -hpa
>
>--
>H. Peter Anvin, Intel Open Source Technology Center
>I work for Intel. I don't speak on their behalf.
>

2013-03-08 15:52:31

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process

On Thu, Mar 07, 2013 at 01:54:45PM -0800, Eric W. Biederman wrote:
> Vivek Goyal <[email protected]> writes:
>
> > On Thu, Mar 07, 2013 at 10:58:18PM +0800, Jingbai Ma wrote:
> >> This patch intend to speedup the memory pages scanning process in
> >> selective dump mode.
> >>
> >> Test result (On HP ProLiant DL980 G7 with 1TB RAM, makedumpfile
> >> v1.5.3):
> >>
> >> Total scan Time
> >> Original kernel
> >> + makedumpfile v1.5.3 cyclic mode 1958.05 seconds
> >> Original kernel
> >> + makedumpfile v1.5.3 non-cyclic mode 1151.50 seconds
> >> Patched kernel
> >> + patched makedumpfile v1.5.3 17.50 seconds
> >>
> >> Traditionally, to reduce the size of dump file, dumper scans all memory
> >> pages to exclude the unnecessary memory pages after capture kernel
> >> booted, and scan it in userspace code (makedumpfile).
> >
> > I think this is not a good idea. It has several issues.
>
> Actually it does not appear to be doing any work in the first kernel.

Looks like patch3 in series is doing that.

machine_crash_shutdown(&fixed_regs);
+ generate_crash_dump_bitmap();
machine_kexec(kexec_crash_image);

So this bitmap seems to be being set just before transitioning into
second kernel.

I am sure you would not like this extra code in this path. :-)

Thanks
Vivek

2013-03-08 16:13:47

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process

"Ma, Jingbai (Kingboard)" <[email protected]> writes:

> On 3/8/13 6:33 PM, "H. Peter Anvin" <[email protected]> wrote:
>
>
>>On 03/08/2013 02:06 AM, Jingbai Ma wrote:
>>>
>>> Kernel do have some abilities that user space haven't. It's possible to
>>> map whole memory space of the first kernel into user space on the second
>>> kernel. But the user space code has to re-implement some parts of the
>>> kernel memory management system again. And worse, it's architecture
>>> dependent, more architectures supported, more codes have to be
>>> implemented. All implementation in user space must be sync to kernel
>>> implementation. It's may called "flexibility", but it's painful to
>>> maintain the codes.
>>>
>>
>>What? You are basically talking about /dev/mem... there is nothing
>>particularly magic about it at all.
>
> What we are talking about is filtering memory pages (AKA memory pages
> classification)
> The makedumpfile (or any other dumper in user space) has to know the
> exactly
> memory layout of the memory management data structures, it not only
> architecture dependent, but also may varies in different kernel release.
> At this point, /dev/mem doesn't give any help.
> So IMHO, I would like to do it in kernel, rather than So keep tracking
> changes in user space code.

But the fact is there is no requirment that the crash dump capture
kernel is the same version as the kernel that crashed. In fact it has
been common at some points in time to use slightly different build
options, or slightly different kernels. Say a 32bit PAE kernel to
capture a 64bit x86_64 kernel.

So in fact performing this work in the kernel and is actively harmful to
reliability and maintenance because it adds an incorrect assumption.

If you do want the benefit of shared maintenance with the kernel one
solution that has been suggested several times is to put code into
tools/makedumpfile (probably a library) that encapsulates the kernel
specific knowledge that can be loaded into the ramdisk when the
crahsdump kernel is being loaded.

That would allow shared maintenance along without breaking the
possibility of supporting kernel versions.

Eric

2013-03-08 16:19:21

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process

Vivek Goyal <[email protected]> writes:

> On Thu, Mar 07, 2013 at 01:54:45PM -0800, Eric W. Biederman wrote:
>> Vivek Goyal <[email protected]> writes:
>> > I think this is not a good idea. It has several issues.
>>
>> Actually it does not appear to be doing any work in the first kernel.
>
> Looks like patch3 in series is doing that.
>
> machine_crash_shutdown(&fixed_regs);
> + generate_crash_dump_bitmap();
> machine_kexec(kexec_crash_image);
>
> So this bitmap seems to be being set just before transitioning into
> second kernel.
>
> I am sure you would not like this extra code in this path. :-)

Ouch! I had totally missed that. No that is not at all acceptable.

I was blind that day. The only call I saw was in patch 4 that put
the call generated the bitmap in the new proc file.

Eric

2013-03-08 16:19:27

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process

On Fri, Mar 08, 2013 at 06:06:31PM +0800, Jingbai Ma wrote:

[..]
> >- First of all it is doing more stuff in first kernel. And that runs
> > contrary to kdump design where we want to do stuff in second kernel.
> > After a kernel crash, you can't trust running kernel's data structures.
> > So to improve reliability just do minial stuff in crashed kernel and
> > get out quickly.
>
> I agreed with you, the first kernel should do as less as possible.
> Intuitively, filter memory pages in the first kernel will harm the
> reliability of kernel dump, but let's think it thoroughly:
>
> 1. It only relies on the memory management data structure that
> makedumpfile also relies on, so no any reliability degradation at
> this point.

Its not same. If there is something wrong with memory management
data structures, you can panic() again and self lock yourself and
never even transition to the second kernel.

With makedumpfile, if something is wrong, either we will save wrong
bits or get segmentation fault. But one can still try to be careful
or save whole dump and try to get specific pieces out.

So it it is not apples to apples comparison.

[..]
> >Looks like now hpa and yinghai have done the work to be able to load
> >kdump kernel above 4GB. I am assuming this also removes the restriction
> >that we can only reserve 512MB or 896MB in second kernel. If that's
> >the case, then I don't see why people can't get away with reserving
> >64MB per TB.
>
> That's true. With kernel 3.9-rc1 with kexec-tools 2.0.4, capture
> kernel will have enough memory to run. And makedumpfile could be
> always run at non-cyclic mode, but we still concern about the kernel
> dump performance on systems with huge memory (above 4TB).

I would think that lets first try to make mmap() on /proc/vmcore work and
optimize makefumpfile to make use of it and then see if performance is
acceptable or not on large machines. And then take it from there.

Thanks
Vivek

2013-03-09 04:31:26

by Hatayama, Daisuke

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process

From: Jingbai Ma <[email protected]>
Subject: Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process
Date: Fri, 8 Mar 2013 18:06:31 +0800

> On 03/07/2013 11:21 PM, Vivek Goyal wrote:
>> On Thu, Mar 07, 2013 at 10:58:18PM +0800, Jingbai Ma wrote:
...
>> First of all 64MB per TB should not be a huge deal. And makedumpfile
>> also has this cyclic mode where you process a map, discard it and then
>> move on to next section. So memory usage remains constant at the
>> expense
>> of processing time.
>
> Yes, that's true. But in cyclic mode, makedumpfile will have to
> write/read bitmap from storage, it will also impact the performance.
> I have measured the penalty for cyclic mode is about 70%
> slowdown. Maybe could be faster after mmap implemented.

I guess the slowdown came from the issue that enough VMCOREINFO was
not provided from the kernel, and unnecessary filtering processing for
free pages is done multiple times.

For example, confirm how filtering is done in your environment like
this:

$ makedumpfile --message-level 16 # 16 is report message
makedumpfile: map_size = 4
sadump: does not have partition header
...
pfn_end : 880000
Can't select page_is_buddy handler; follow free lists instead of mem_map array.
STEP [Excluding free pages ] : 0.431724 seconds
STEP [Excluding unnecessary pages] : 1.052160 seconds

Here STEP [..] colum occurs the number of cycles in cyclic-mode. If
STEP [Excluding free pages ] column occurs multiple times in log, it
causes the slowdown on your environment. (free_list doesn't sort its
elements in pfn's order, so we have only to iterate a whole part of
free_list in each cycle...; it could amount to be close to a whole
memory size in worst case just after system boot)

To use mem_map array logic, VMCOREINFO nees to have the corresponding
information to refer to related data structures. The patch is

commit 8d67091ec6ae98ca67f77990ef9e9ec21337f077
Author: Atsushi Kumagai <[email protected]>
Date: Wed Feb 27 17:03:25 2013 -0800

kexec: add the values related to buddy system for filtering free pages.

and it has been merged in 3.9-rc1.

$ git describe 8d67091ec6ae98ca67f77990ef9e9ec21337f077
v3.8-9443-g8d67091

Or you can edit VMCOREINFO manually and specify it to makedumpfile as:

1. generate vmcoreinfo from vmlinux

makedumpfile -x vmlinux -g vmcoreinfo.txt

2. Add the following values in the generated vmcoreinfo.txt

- 3.1, 3.4, 3.8.x
NUMBER(PG_slab)=7
SIZE(pageflags)=4
OFFSET(page._mapcount)=24
OFFSET(page.private)=48
NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE)=-128

- 2.6.38
SIZE(pageflags)=4
OFFSET(page._mapcount)=12
OFFSET(page.private)=16
NUMBER(PG_slab)=7
NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE)=-2

- 2.6.32
NUMBER(PG_slab)=7
NUMBER(PG_buddy)=19
OFFSET(page._mapcount)=12
OFFSET(page.private)=16
SIZE(pageflags)=4

- 2.6.18
NUMBER(PG_slab)=7
NUMBER(PG_buddy)=19
OFFSET(page._mapcount)=12
OFFSET(page.private)=16

3. Specify the vmcoreinfo.txt to makedumpfile via -i option

makedumpfile -i vmcoreinfo.txt [-c|-l|-p] -d 31 /proc/vmcore dumpfile

Anyway, please help benchmark. I'll send CC to you too.

Thanks.
HATAYAMA, Daisuke

2013-03-11 08:18:07

by Jingbai Ma

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process

On 03/08/2013 11:52 PM, Vivek Goyal wrote:
> On Thu, Mar 07, 2013 at 01:54:45PM -0800, Eric W. Biederman wrote:
>> Vivek Goyal<[email protected]> writes:
>>
>>> On Thu, Mar 07, 2013 at 10:58:18PM +0800, Jingbai Ma wrote:
>>>> This patch intend to speedup the memory pages scanning process in
>>>> selective dump mode.
>>>>
>>>> Test result (On HP ProLiant DL980 G7 with 1TB RAM, makedumpfile
>>>> v1.5.3):
>>>>
>>>> Total scan Time
>>>> Original kernel
>>>> + makedumpfile v1.5.3 cyclic mode 1958.05 seconds
>>>> Original kernel
>>>> + makedumpfile v1.5.3 non-cyclic mode 1151.50 seconds
>>>> Patched kernel
>>>> + patched makedumpfile v1.5.3 17.50 seconds
>>>>
>>>> Traditionally, to reduce the size of dump file, dumper scans all memory
>>>> pages to exclude the unnecessary memory pages after capture kernel
>>>> booted, and scan it in userspace code (makedumpfile).
>>>
>>> I think this is not a good idea. It has several issues.
>>
>> Actually it does not appear to be doing any work in the first kernel.
>
> Looks like patch3 in series is doing that.
>
> machine_crash_shutdown(&fixed_regs);
> + generate_crash_dump_bitmap();
> machine_kexec(kexec_crash_image);
>
> So this bitmap seems to be being set just before transitioning into
> second kernel.
>
> I am sure you would not like this extra code in this path. :-)

I was thought this function code is pretty simple, could be called here
safely.
If it's not proper for here, how about before the function
machine_crash_shutdown(&fixed_regs)?
Furthermore, could you explain the real risks to execute more codes here?

Thanks!

>
> Thanks
> Vivek


--
Jingbai Ma ([email protected])

2013-03-11 08:31:29

by Jingbai Ma

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process

On 03/09/2013 12:13 AM, Eric W. Biederman wrote:
> "Ma, Jingbai (Kingboard)"<[email protected]> writes:
>
>> On 3/8/13 6:33 PM, "H. Peter Anvin"<[email protected]> wrote:
>>
>>
>>> On 03/08/2013 02:06 AM, Jingbai Ma wrote:
>>>>
>>>> Kernel do have some abilities that user space haven't. It's possible to
>>>> map whole memory space of the first kernel into user space on the second
>>>> kernel. But the user space code has to re-implement some parts of the
>>>> kernel memory management system again. And worse, it's architecture
>>>> dependent, more architectures supported, more codes have to be
>>>> implemented. All implementation in user space must be sync to kernel
>>>> implementation. It's may called "flexibility", but it's painful to
>>>> maintain the codes.
>>>>
>>>
>>> What? You are basically talking about /dev/mem... there is nothing
>>> particularly magic about it at all.
>>
>> What we are talking about is filtering memory pages (AKA memory pages
>> classification)
>> The makedumpfile (or any other dumper in user space) has to know the
>> exactly
>> memory layout of the memory management data structures, it not only
>> architecture dependent, but also may varies in different kernel release.
>> At this point, /dev/mem doesn't give any help.
>> So IMHO, I would like to do it in kernel, rather than So keep tracking
>> changes in user space code.
>
> But the fact is there is no requirment that the crash dump capture
> kernel is the same version as the kernel that crashed. In fact it has
> been common at some points in time to use slightly different build
> options, or slightly different kernels. Say a 32bit PAE kernel to
> capture a 64bit x86_64 kernel.

The filtering code will be executed in the first kernel, so this problem
will not be exist.

>
> So in fact performing this work in the kernel and is actively harmful to
> reliability and maintenance because it adds an incorrect assumption.
>
> If you do want the benefit of shared maintenance with the kernel one
> solution that has been suggested several times is to put code into
> tools/makedumpfile (probably a library) that encapsulates the kernel
> specific knowledge that can be loaded into the ramdisk when the
> crahsdump kernel is being loaded.
>
> That would allow shared maintenance along without breaking the
> possibility of supporting kernel versions.

Yes, you are right. But it requires makedumpfile changes significantly,
and if we also want to shared the code with kernel memory management
subsystem, I believe that's not a easy job. (at least to my limited
kernel knowledge)

>
> Eric


--
Jingbai Ma ([email protected])

2013-03-11 08:53:56

by Jingbai Ma

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process

On 03/09/2013 12:19 AM, Vivek Goyal wrote:
> On Fri, Mar 08, 2013 at 06:06:31PM +0800, Jingbai Ma wrote:
>
> [..]
>>> - First of all it is doing more stuff in first kernel. And that runs
>>> contrary to kdump design where we want to do stuff in second kernel.
>>> After a kernel crash, you can't trust running kernel's data structures.
>>> So to improve reliability just do minial stuff in crashed kernel and
>>> get out quickly.
>>
>> I agreed with you, the first kernel should do as less as possible.
>> Intuitively, filter memory pages in the first kernel will harm the
>> reliability of kernel dump, but let's think it thoroughly:
>>
>> 1. It only relies on the memory management data structure that
>> makedumpfile also relies on, so no any reliability degradation at
>> this point.
>
> Its not same. If there is something wrong with memory management
> data structures, you can panic() again and self lock yourself and
> never even transition to the second kernel.
>
> With makedumpfile, if something is wrong, either we will save wrong
> bits or get segmentation fault. But one can still try to be careful
> or save whole dump and try to get specific pieces out.
>
> So it it is not apples to apples comparison.
>

Understood, the double panic() does harm the reliabilities. But consider
the chance to panic in to memory filtering code, it shouldn't increase
the risks very much.
If the filtering code panicked, I doubt even without it, the second
kernel could be booted up normally.

> [..]
>>> Looks like now hpa and yinghai have done the work to be able to load
>>> kdump kernel above 4GB. I am assuming this also removes the restriction
>>> that we can only reserve 512MB or 896MB in second kernel. If that's
>>> the case, then I don't see why people can't get away with reserving
>>> 64MB per TB.
>>
>> That's true. With kernel 3.9-rc1 with kexec-tools 2.0.4, capture
>> kernel will have enough memory to run. And makedumpfile could be
>> always run at non-cyclic mode, but we still concern about the kernel
>> dump performance on systems with huge memory (above 4TB).
>
> I would think that lets first try to make mmap() on /proc/vmcore work and
> optimize makefumpfile to make use of it and then see if performance is
> acceptable or not on large machines. And then take it from there.

Sure, you are right, I'm going to test the mmap() solution first, if it
doesn't meet the performance requirement on large machine, We still need
a solution here.

Thanks!

>
> Thanks
> Vivek


--
Jingbai Ma ([email protected])

2013-03-11 09:02:47

by Jingbai Ma

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process

On 03/09/2013 12:31 PM, HATAYAMA Daisuke wrote:
> From: Jingbai Ma<[email protected]>
> Subject: Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process
> Date: Fri, 8 Mar 2013 18:06:31 +0800
>
>> On 03/07/2013 11:21 PM, Vivek Goyal wrote:
>>> On Thu, Mar 07, 2013 at 10:58:18PM +0800, Jingbai Ma wrote:
> ...
>>> First of all 64MB per TB should not be a huge deal. And makedumpfile
>>> also has this cyclic mode where you process a map, discard it and then
>>> move on to next section. So memory usage remains constant at the
>>> expense
>>> of processing time.
>>
>> Yes, that's true. But in cyclic mode, makedumpfile will have to
>> write/read bitmap from storage, it will also impact the performance.
>> I have measured the penalty for cyclic mode is about 70%
>> slowdown. Maybe could be faster after mmap implemented.
>
> I guess the slowdown came from the issue that enough VMCOREINFO was
> not provided from the kernel, and unnecessary filtering processing for
> free pages is done multiple times.

Thanks for your comments! It would be very helpful.
I will test it on the machine again.

--
Jingbai Ma ([email protected])

2013-03-11 09:43:11

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process

Jingbai Ma <[email protected]> writes:

> On 03/08/2013 11:52 PM, Vivek Goyal wrote:
>> On Thu, Mar 07, 2013 at 01:54:45PM -0800, Eric W. Biederman wrote:
>>> Vivek Goyal<[email protected]> writes:
>>>
>>>> On Thu, Mar 07, 2013 at 10:58:18PM +0800, Jingbai Ma wrote:
>>>>> This patch intend to speedup the memory pages scanning process in
>>>>> selective dump mode.
>>>>>
>>>>> Test result (On HP ProLiant DL980 G7 with 1TB RAM, makedumpfile
>>>>> v1.5.3):
>>>>>
>>>>> Total scan Time
>>>>> Original kernel
>>>>> + makedumpfile v1.5.3 cyclic mode 1958.05 seconds
>>>>> Original kernel
>>>>> + makedumpfile v1.5.3 non-cyclic mode 1151.50 seconds
>>>>> Patched kernel
>>>>> + patched makedumpfile v1.5.3 17.50 seconds
>>>>>
>>>>> Traditionally, to reduce the size of dump file, dumper scans all memory
>>>>> pages to exclude the unnecessary memory pages after capture kernel
>>>>> booted, and scan it in userspace code (makedumpfile).
>>>>
>>>> I think this is not a good idea. It has several issues.
>>>
>>> Actually it does not appear to be doing any work in the first kernel.
>>
>> Looks like patch3 in series is doing that.
>>
>> machine_crash_shutdown(&fixed_regs);
>> + generate_crash_dump_bitmap();
>> machine_kexec(kexec_crash_image);
>>
>> So this bitmap seems to be being set just before transitioning into
>> second kernel.
>>
>> I am sure you would not like this extra code in this path. :-)
>
> I was thought this function code is pretty simple, could be called
> here safely.
> If it's not proper for here, how about before the function
> machine_crash_shutdown(&fixed_regs)?
> Furthermore, could you explain the real risks to execute more codes here?

The kernel is known bad. What is bad is unclear.
Executing any extra code is a bad idea.

The history here is that before kexec-on-panic there were lots of dump
routines that did all of the crashdump logic in the kernel before they
shutdown. They all worked beautifully during development, and on
developers test machines and were absolutely worthless in real world
situations.

A piece of code that walks all of the page tables is most definitely
opening itself up to all kinds of failure situations I can't even
imagine.

The only way that it would be ok to do this would be to maintain the
bitmap in real time with the existing page table maintenance code,
and that would only be ok if it did not add a performance penalty.

Every once in a great while there is a new cpu architecture feature
we need to deal with, but otherwise the only thing that is ok to
do on that code path is to reduce it until it much more closely
resembles the glorified jump instruction that it really is.

Speaking of have you given this code any coverage testing with lkdtm?

Eric

2013-03-12 10:05:17

by Jingbai Ma

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process

On 03/11/2013 05:42 PM, Eric W. Biederman wrote:
> Jingbai Ma<[email protected]> writes:
>
>> On 03/08/2013 11:52 PM, Vivek Goyal wrote:
>>> On Thu, Mar 07, 2013 at 01:54:45PM -0800, Eric W. Biederman wrote:
>>>> Vivek Goyal<[email protected]> writes:
>>>>
>>>>> On Thu, Mar 07, 2013 at 10:58:18PM +0800, Jingbai Ma wrote:
>>>>>> This patch intend to speedup the memory pages scanning process in
>>>>>> selective dump mode.
>>>>>>
>>>>>> Test result (On HP ProLiant DL980 G7 with 1TB RAM, makedumpfile
>>>>>> v1.5.3):
>>>>>>
>>>>>> Total scan Time
>>>>>> Original kernel
>>>>>> + makedumpfile v1.5.3 cyclic mode 1958.05 seconds
>>>>>> Original kernel
>>>>>> + makedumpfile v1.5.3 non-cyclic mode 1151.50 seconds
>>>>>> Patched kernel
>>>>>> + patched makedumpfile v1.5.3 17.50 seconds
>>>>>>
>>>>>> Traditionally, to reduce the size of dump file, dumper scans all memory
>>>>>> pages to exclude the unnecessary memory pages after capture kernel
>>>>>> booted, and scan it in userspace code (makedumpfile).
>>>>>
>>>>> I think this is not a good idea. It has several issues.
>>>>
>>>> Actually it does not appear to be doing any work in the first kernel.
>>>
>>> Looks like patch3 in series is doing that.
>>>
>>> machine_crash_shutdown(&fixed_regs);
>>> + generate_crash_dump_bitmap();
>>> machine_kexec(kexec_crash_image);
>>>
>>> So this bitmap seems to be being set just before transitioning into
>>> second kernel.
>>>
>>> I am sure you would not like this extra code in this path. :-)
>>
>> I was thought this function code is pretty simple, could be called
>> here safely.
>> If it's not proper for here, how about before the function
>> machine_crash_shutdown(&fixed_regs)?
>> Furthermore, could you explain the real risks to execute more codes here?
>
> The kernel is known bad. What is bad is unclear.
> Executing any extra code is a bad idea.
>
> The history here is that before kexec-on-panic there were lots of dump
> routines that did all of the crashdump logic in the kernel before they
> shutdown. They all worked beautifully during development, and on
> developers test machines and were absolutely worthless in real world
> situations.

I also have learned some from the old style kernel dump. Yes, they do
have some problems in real world situations. The primary problems come
from I/O operations (disk writing/network sending) and invalid page table.

>
> A piece of code that walks all of the page tables is most definitely
> opening itself up to all kinds of failure situations I can't even
> imagine.

Agree, invalid page table will cause disaster.
But even in the capture kernel with user space program, it may only
causes a core dump, user still have chance to dump the crashed system by
themselves with some special tools, It's possible, but should be very
rare in real world.
I doubt how many users be able to handle it in such kind of situations.
So in most cases, if page tables have corrupted, and can not dump it
normally, user would like to reboot the system directly.

>
> The only way that it would be ok to do this would be to maintain the
> bitmap in real time with the existing page table maintenance code,
> and that would only be ok if it did not add a performance penalty.

I also have a prototype that can trace the page table changes in real
time, but I still didn't test the performance penalty. I will test it
again if I have time.

>
> Every once in a great while there is a new cpu architecture feature
> we need to deal with, but otherwise the only thing that is ok to
> do on that code path is to reduce it until it much more closely
> resembles the glorified jump instruction that it really is.

Agree. But if we can find some solution that can be proved as robust as
a jump that may apply.

>
> Speaking of have you given this code any coverage testing with lkdtm?

Still not, But I will test it with lkdtm.
Before that, I would like to test the mmap() solution first.

Thanks for your very valuable comments, that helped me a lot!

>
> Eric
>


--
Jingbai Ma ([email protected])

2013-03-12 19:48:25

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/5] crash dump bitmap: scan memory pages in kernel to speedup kernel dump process

Jingbai Ma <[email protected]> writes:

> On 03/11/2013 05:42 PM, Eric W. Biederman wrote:

>> The kernel is known bad. What is bad is unclear.
>> Executing any extra code is a bad idea.
>>
>> The history here is that before kexec-on-panic there were lots of dump
>> routines that did all of the crashdump logic in the kernel before they
>> shutdown. They all worked beautifully during development, and on
>> developers test machines and were absolutely worthless in real world
>> situations.
>
> I also have learned some from the old style kernel dump. Yes, they do have some
> problems in real world situations. The primary problems come from I/O operations
> (disk writing/network sending) and invalid page table.

That and they provide no guarantee that you won't write corrupt data.

The kdump based solutions fail safe. The worst that will happen is that
you don't take a crash dump. There is no chance of corrupting your
system.


>> A piece of code that walks all of the page tables is most definitely
>> opening itself up to all kinds of failure situations I can't even
>> imagine.
>
> Agree, invalid page table will cause disaster.
> But even in the capture kernel with user space program, it may only causes a
> core dump, user still have chance to dump the crashed system by themselves with
> some special tools, It's possible, but should be very rare in real world.
> I doubt how many users be able to handle it in such kind of situations.
> So in most cases, if page tables have corrupted, and can not dump it normally,
> user would like to reboot the system directly.

For whatever it is worth corrupted page tables. Or at least page tables
that the cpu thinks are invalid are one of the crash scenarios that I
have seen in the wild on enough occasions to remember them.

Eric