2012-05-25 09:30:11

by YOSHIDA Masanori

[permalink] [raw]
Subject: [RFC PATCH 0/4 V2] introduce: livedump

Changes in V2:
- A little more comments are added.
- Operation using tools/livedump/livedump is simpliefied.
- Previous 5 patches are arranged to 4 patches.
([3/5] and [4/5] are merged)
- The patchset is rebased onto v3.4.
- crash-6.0.6 is required (which was 6.0.1 previously).


The following series introduces the new memory dumping mechanism Live Dump,
which let users obtain a consistent memory dump without stopping a running
system.

Such a mechanism is useful especially in the case where very important
systems are consolidated onto a single machine via virtualization.
Assuming a KVM host runs multiple important VMs on it and one of them
fails, the other VMs have to keep running. However, at the same time, an
administrator may want to obtain memory dump of not only the failed guest
but also the host because possibly the cause of failture is not in the
guest but in the host or the hardware under it.

Live Dump is based on Copy-on-write technique. Basically processing is
performed in the following order.
(1) Suspends processing of all CPUs.
(2) Makes pages (which you want to dump) read-only.
(3) Resumes all CPUs
(4) On page fault, dumps a page including a fault address.
(5) Finally, dumps the rest of pages that are not updated.

Currently, Live Dump is just a simple prototype and it has many
limitations. I list the important ones below.
(1) It write-protects only kernel's straight mapping areas. Therefore
memory updates from vmap areas and user space don't cause page fault.
Pages corresponding to these areas are not consistently dumped.
(2) It supports only x86-64 architecture.
(3) It can only handle 4K pages. As we know, most pages in kernel space are
mapped via 2M or 1G large page mapping. Therefore, the current
implementation of Live Dump splits all large pages into 4K pages before
setting up write protection.
(4) It allocates about 50% of physical RAM to store dumped pages. Currently
Live Dump saves all dumped data on memory once, and after that a user
becomes able to use the dumped data. Live Dump itself has no feature to
save dumped data onto a disk or any other storage device.

This series consists of 4 patches.

Ths 1st patch adds notifier-call-chain in do_page_fault. This is the only
modification against the existing code path of the upstream kernel.

The 2nd patch introduces "livedump" misc device.

The 3rd patch introduces feature of write protection management. This
enables users to turn on write protection on kernel space and to install a
hook function that is called every time page fault occurs on each protected
page.

The last patch introduces memory dumping feature. This patch installs the
function to dump content of the protected page on page fault. At the same
time, it lets users to access the dumped data via the misc device
interface.


***How to test***
To test this patch, you have to apply the attached patch to the source code
of crash[1]. This patch can be applied to the version 6.0.6 of crash. In
addition to this, you have to configure your kernel to turn on
CONFIG_DEBUG_INFO.

[1]crash, http://people.redhat.com/anderson/crash-6.0.6.tar.gz

At first, kick the script tools/livedump/livedump as follows.
# livedump dump

At this point, all memory image has been saved (also on memory). Then you
can analyze the image by kicking the patched crash as follows.
# crash /dev/livedump /boot/System.map /boot/vmlinux.o

By the following command, you can release all resources of livedump.
# livedump release

---

YOSHIDA Masanori (4):
livedump: Add memory dumping functionality
livedump: Add write protection management
livedump: Add the new misc device "livedump"
livedump: Add notifier-call-chain into do_page_fault


arch/x86/Kconfig | 29 ++
arch/x86/include/asm/traps.h | 2
arch/x86/include/asm/wrprotect.h | 47 +++
arch/x86/mm/Makefile | 2
arch/x86/mm/fault.c | 7
arch/x86/mm/wrprotect.c | 618 ++++++++++++++++++++++++++++++++++++++
kernel/Makefile | 1
kernel/livedump-memdump.c | 237 +++++++++++++++
kernel/livedump-memdump.h | 45 +++
kernel/livedump.c | 129 ++++++++
tools/livedump/livedump | 28 ++
11 files changed, 1145 insertions(+), 0 deletions(-)
create mode 100644 arch/x86/include/asm/wrprotect.h
create mode 100644 arch/x86/mm/wrprotect.c
create mode 100644 kernel/livedump-memdump.c
create mode 100644 kernel/livedump-memdump.h
create mode 100644 kernel/livedump.c
create mode 100755 tools/livedump/livedump

--
Signature


Attachments:
(No filename) (4.53 kB)
crash-6.0.6-livedump.patch (1.67 kB)
Download all attachments

2012-05-25 09:20:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4 V2] livedump: Add notifier-call-chain into do_page_fault

On Fri, 2012-05-25 at 18:12 +0900, YOSHIDA Masanori wrote:
>
> This patch adds notifier-call-chain that is called in do_page_fault.
> Livedump uses this to check if page fault is caused by livedump, and if so,
> the fault is handled by livedump's handler function. Otherwise, it is
> handled by the original page fault handler.

No, please no notifiers..


2012-05-25 09:25:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4 V2] introduce: livedump

On Fri, 2012-05-25 at 18:12 +0900, YOSHIDA Masanori wrote:
> Live Dump is based on Copy-on-write technique. Basically processing is
> performed in the following order.
> (1) Suspends processing of all CPUs.
> (2) Makes pages (which you want to dump) read-only.
> (3) Resumes all CPUs
> (4) On page fault, dumps a page including a fault address.

Suppose a PF is in progress when all this happens, you mark all RO, then
an NMI happens, from the NMI context we'll generate another PF to update
a vmap area, this will again PF because you mucked about and marked
things RO.

You're now at 3 PFs, which is instant reboot.

I don't think this is going to work.

2012-05-25 09:30:08

by YOSHIDA Masanori

[permalink] [raw]
Subject: [RFC PATCH 1/4 V2] livedump: Add notifier-call-chain into do_page_fault

This patch adds notifier-call-chain that is called in do_page_fault.
Livedump uses this to check if page fault is caused by livedump, and if so,
the fault is handled by livedump's handler function. Otherwise, it is
handled by the original page fault handler.

Signed-off-by: YOSHIDA Masanori <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Cc: Andy Lutomirski <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Prarit Bhargava <[email protected]>
Cc: [email protected]
---

arch/x86/include/asm/traps.h | 2 ++
arch/x86/mm/fault.c | 7 +++++++
2 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 88eae2a..dcd5318 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -114,4 +114,6 @@ enum {
X86_TRAP_IRET = 32, /* 32, IRET Exception */
};

+extern struct atomic_notifier_head page_fault_notifier_list;
+
#endif /* _ASM_X86_TRAPS_H */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 3ecfd1a..f7460e2 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -995,6 +995,8 @@ static int fault_in_kernel_space(unsigned long address)
return address >= TASK_SIZE_MAX;
}

+ATOMIC_NOTIFIER_HEAD(page_fault_notifier_list);
+
/*
* This routine handles page faults. It determines the address,
* and the problem, and then passes it off to one of the appropriate
@@ -1018,6 +1020,11 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
/* Get the faulting address: */
address = read_cr2();

+ if (atomic_notifier_call_chain(
+ &page_fault_notifier_list, error_code, regs)
+ == NOTIFY_STOP)
+ return;
+
/*
* Detect and handle instructions that would cause a page fault for
* both a tracked kernel page and a userspace page.

2012-05-25 09:31:31

by YOSHIDA Masanori

[permalink] [raw]
Subject: [RFC PATCH 2/4 V2] livedump: Add the new misc device "livedump"

Introduces the new misc device "livedump".
This device will be used as interface between livedump and user space.
Right now, the device only has empty ioctl operation.

***ATTENTION PLEASE***
I think debugfs is more suitable for this feature, but currently livedump
uses the misc device for simplicity. This will be fixed in the future.

Signed-off-by: YOSHIDA Masanori <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Cc: Kevin Hilman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: [email protected]
---

arch/x86/Kconfig | 15 ++++++++++
kernel/Makefile | 1 +
kernel/livedump.c | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 99 insertions(+), 0 deletions(-)
create mode 100644 kernel/livedump.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c9866b0..4c97583 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1729,6 +1729,21 @@ config CMDLINE_OVERRIDE
This is used to work around broken boot loaders. This should
be set to 'N' under normal conditions.

+config LIVEDUMP
+ bool "Live Dump support"
+ depends on X86_64
+ ---help---
+ Set this option to 'Y' to allow the kernel support to acquire
+ a consistent snapshot of kernel space without stopping system.
+
+ This feature regularly causes small overhead on kernel.
+
+ Once this feature is initialized by its special ioctl, it
+ allocates huge memory for itself and causes much more overhead
+ on kernel.
+
+ If in doubt, say N.
+
endmenu

config ARCH_ENABLE_MEMORY_HOTPLUG
diff --git a/kernel/Makefile b/kernel/Makefile
index cb41b95..f095e7a 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -106,6 +106,7 @@ obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
obj-$(CONFIG_PADATA) += padata.o
obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
obj-$(CONFIG_JUMP_LABEL) += jump_label.o
+obj-$(CONFIG_LIVEDUMP) += livedump.o

$(obj)/configs.o: $(obj)/config_data.h

diff --git a/kernel/livedump.c b/kernel/livedump.c
new file mode 100644
index 0000000..3103292
--- /dev/null
+++ b/kernel/livedump.c
@@ -0,0 +1,83 @@
+/* livedump.c - Live Dump's main
+ * Copyright (C) 2012 Hitachi, Ltd.
+ * Author: YOSHIDA Masanori <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
+ * MA 02110-1301, USA.
+ */
+
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/miscdevice.h>
+
+#define DEVICE_NAME "livedump"
+
+#define LIVEDUMP_IOC(x) _IO(0xff, x)
+
+static long livedump_ioctl(
+ struct file *file, unsigned int cmd, unsigned long arg)
+{
+ switch (cmd) {
+ default:
+ return -ENOIOCTLCMD;
+ }
+}
+
+static int livedump_open(struct inode *inode, struct file *file)
+{
+ if (!try_module_get(THIS_MODULE))
+ return -ENOENT;
+ return 0;
+}
+
+static int livedump_release(struct inode *inode, struct file *file)
+{
+ module_put(THIS_MODULE);
+ return 0;
+}
+
+static const struct file_operations livedump_fops = {
+ .unlocked_ioctl = livedump_ioctl,
+ .open = livedump_open,
+ .release = livedump_release,
+};
+static struct miscdevice livedump_misc = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = DEVICE_NAME,
+ .fops = &livedump_fops,
+};
+
+static int livedump_module_init(void)
+{
+ int ret;
+
+ ret = misc_register(&livedump_misc);
+ if (WARN(ret,
+ "livedump: Failed to register livedump on misc device.\n"
+ ))
+ return ret;
+
+ return 0;
+}
+module_init(livedump_module_init);
+
+static void livedump_module_exit(void)
+{
+ misc_deregister(&livedump_misc);
+}
+module_exit(livedump_module_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Livedump kernel module");

2012-05-25 09:31:29

by YOSHIDA Masanori

[permalink] [raw]
Subject: [RFC PATCH 4/4 V2] livedump: Add memory dumping functionality

This patch realizes memory dumping of kernel space. All dumped memory image
is saved on memory once. To do so, this patch allocates about 50% of RAM at
the initialization.

This patch also adds read/lseek operations to the "livedump" misc device to
provide user land with means to read the dumped data. The standard dump
analysis tool "crash" can analyze the dumped data via these operations.

The previous patch made it possible to define hook functions that specify
which pages to write-protect and how to handle pages. This patch defines
the hooks functions as follows.

- fn_select_pages:
Selects all normal RAM pages, which are marked as E820_RAM.

Also selects pages of physical memory address from 0 to
CONFIG_X86_RESERVE_LOW. This range is usually used by BIOS,
but crash also uses this range of memory.

Pages which contain this patch's own stuffs (e.g. Allocated pages to
store dumped image) are not selected because they are not needed for
memory dump analysis.
However, this patch's own stuffs are not necessarily aligned to 4K.
Therefore, first and last pages can contain together data other than
this patch's stuffs. I call such pages as "edge pages".
Edge pages are selected here, but all of them area handled during the
stop-machine because they are "sensitive pages".

- fn_handle_page:
Saves a faulting page onto the above allocated area.

- fn_handle_sensitive_pages:
Handles edge pages as described above.

Signed-off-by: YOSHIDA Masanori <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Kevin Hilman <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: [email protected]
---

kernel/Makefile | 2
kernel/livedump-memdump.c | 237 +++++++++++++++++++++++++++++++++++++++++++++
kernel/livedump-memdump.h | 45 +++++++++
kernel/livedump.c | 13 ++
tools/livedump/livedump | 28 ++++-
5 files changed, 315 insertions(+), 10 deletions(-)
create mode 100644 kernel/livedump-memdump.c
create mode 100644 kernel/livedump-memdump.h

diff --git a/kernel/Makefile b/kernel/Makefile
index f095e7a..13dce48 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -106,7 +106,7 @@ obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
obj-$(CONFIG_PADATA) += padata.o
obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
obj-$(CONFIG_JUMP_LABEL) += jump_label.o
-obj-$(CONFIG_LIVEDUMP) += livedump.o
+obj-$(CONFIG_LIVEDUMP) += livedump.o livedump-memdump.o

$(obj)/configs.o: $(obj)/config_data.h

diff --git a/kernel/livedump-memdump.c b/kernel/livedump-memdump.c
new file mode 100644
index 0000000..7280d10
--- /dev/null
+++ b/kernel/livedump-memdump.c
@@ -0,0 +1,237 @@
+/* livedump-memdump.c - Live Dump's memory dumping management
+ * Copyright (C) 2012 Hitachi, Ltd.
+ * Author: YOSHIDA Masanori <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
+ * MA 02110-1301, USA.
+ */
+
+#include "livedump-memdump.h"
+#include <asm/wrprotect.h>
+
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/uaccess.h>
+
+/* memdump's stuffs */
+static struct memdump {
+ spinlock_t lock;
+ unsigned long alloced;
+ unsigned long used;
+ int state;
+#define STATE_UNINIT 0
+#define STATE_INITED 1
+} memdump;
+
+static void **pages; /* allocated pages */
+static void **pagemap; /* mapping from PFN to page */
+
+int livedump_memdump_init(void)
+{
+ int ret;
+ unsigned long i;
+
+ if (WARN(STATE_UNINIT != memdump.state,
+ "livedump: memdump is already initialized.\n"))
+ return 0;
+
+ spin_lock_init(&memdump.lock);
+ memdump.alloced = num_physpages / 2 + 1;
+
+ ret = -ENOMEM;
+ pages = vmalloc(sizeof(void *) * memdump.alloced);
+ if (!pages)
+ goto err;
+ for (i = 0; i < memdump.alloced; i++) {
+ pages[i] = (void *)__get_free_page(GFP_KERNEL);
+ if (!pages[i])
+ goto err;
+ }
+
+ ret = -ENOMEM;
+ pagemap = vmalloc(sizeof(void *) * num_physpages);
+ if (!pagemap)
+ goto err;
+ memset(pagemap, 0, sizeof(void *) * num_physpages);
+
+ memdump.state = STATE_INITED;
+ return 0;
+err:
+ livedump_memdump_uninit();
+ return ret;
+}
+
+void livedump_memdump_uninit(void)
+{
+ if (pagemap) {
+ vfree(pagemap);
+ pagemap = NULL;
+ }
+ if (pages) {
+ unsigned long i;
+ for (i = 0; i < memdump.alloced; i++)
+ if (pages[i])
+ free_page((unsigned long)pages[i]);
+ else
+ break;
+ vfree(pages);
+ pages = NULL;
+ }
+ memdump.used = 0;
+ memdump.alloced = 0;
+ spin_lock_init(&memdump.lock);
+
+ memdump.state = STATE_UNINIT;
+}
+
+/* livedump_memdump_select_pages
+ *
+ * Selects pages to protect.
+ *
+ * The following pages are selected.
+ * - Pages marked as RAM by E820
+ * - Pages of low memory used by BIOS (needed for crash to work normally)
+ *
+ * Pages that contain memdump's stuffs are unselected (eliminated from
+ * selection).
+ *
+ * On the other hand, because vmap areas are not write-protected,
+ * we don't have to unselect pagemap.
+ */
+int livedump_memdump_select_pages(unsigned long *pgbmp)
+{
+ unsigned long pfn, i;
+
+ /* Select all RAM pages */
+ for (pfn = 0; pfn < num_physpages; pfn++) {
+ if (e820_any_mapped(pfn << PAGE_SHIFT,
+ (pfn + 1) << PAGE_SHIFT,
+ E820_RAM))
+ set_bit(pfn, pgbmp);
+ cond_resched();
+ }
+
+ /* Essential area for executing crash with livedump */
+ bitmap_set(pgbmp, 0, (CONFIG_X86_RESERVE_LOW << 10) >> PAGE_SHIFT);
+
+ /* Unselect memdump stuffs (not needed against vmap areas) */
+ wrprotect_unselect_pages_but_edges(pgbmp,
+ (unsigned long)&memdump, sizeof(memdump));
+ for (i = 0; i < memdump.alloced; i++) {
+ clear_bit(__pa(pages[i]) >> PAGE_SHIFT, pgbmp);
+ cond_resched();
+ }
+
+ return 0;
+}
+
+/* livedump_memdump_handle_sensitive_pages
+ *
+ * Edge pages possibly contain both memdump's stuffs and something else.
+ * Such pages must not be unselected in advance.
+ * In fact, they should be handled during the stop-machine state.
+ *
+ * memdump_handle_sensitive_pages hook function is called to do this.
+ */
+void livedump_memdump_handle_sensitive_pages(unsigned long *pgbmp)
+{
+ wrprotect_handle_only_edges(pgbmp, livedump_memdump_handle_page,
+ (unsigned long)&memdump, sizeof(memdump));
+}
+
+void livedump_memdump_handle_page(unsigned long pfn)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&memdump.lock, flags);
+ if (WARN(memdump.used >= memdump.alloced,
+ "livedump: Out of memory of memdump.\n"))
+ goto out;
+ pagemap[pfn] = pages[memdump.used++];
+ memcpy(pagemap[pfn], pfn_to_kaddr(pfn), PAGE_SIZE);
+out:
+ spin_unlock_irqrestore(&memdump.lock, flags);
+}
+
+static void *memdump_page(unsigned long pfn)
+{
+ void *p = pagemap[pfn];
+ if (p)
+ return p;
+ return empty_zero_page;
+}
+
+loff_t livedump_memdump_sys_llseek(struct file *file, loff_t offset, int origin)
+{
+ loff_t retval;
+
+ switch (origin) {
+ case SEEK_SET:
+ break;
+ case SEEK_END:
+ offset += PFN_PHYS(num_physpages);
+ break;
+ case SEEK_CUR:
+ if (offset == 0) {
+ retval = file->f_pos;
+ goto out;
+ }
+ offset += file->f_pos;
+ break;
+ case SEEK_DATA:
+ case SEEK_HOLE:
+ retval = -ENOSYS;
+ goto out;
+ default:
+ retval = -EINVAL;
+ goto out;
+ }
+ retval = -EINVAL;
+ if (offset >= 0) {
+ if (offset != file->f_pos) {
+ file->f_pos = offset;
+ file->f_version = 0;
+ }
+ retval = offset;
+ }
+out:
+ return retval;
+}
+
+ssize_t livedump_memdump_sys_read(
+ struct file *file, char __user *buf, size_t count, loff_t *ppos)
+{
+ loff_t pos = *ppos;
+
+ if (pos >= PFN_PHYS(num_physpages))
+ return 0;
+ if (count > PFN_PHYS(num_physpages) - pos)
+ count = PFN_PHYS(num_physpages) - pos;
+
+ while (count) {
+ void *p = memdump_page(pos >> PAGE_SHIFT);
+ unsigned long off = pos & ~PAGE_MASK;
+ unsigned long len = min(count, PAGE_SIZE - off);
+ if (copy_to_user(buf, p + off, len))
+ return -EFAULT;
+ buf += len;
+ pos += len;
+ count -= len;
+ }
+
+ pos -= *ppos;
+ *ppos += pos;
+ return pos;
+}
diff --git a/kernel/livedump-memdump.h b/kernel/livedump-memdump.h
new file mode 100644
index 0000000..e3c3a5c
--- /dev/null
+++ b/kernel/livedump-memdump.h
@@ -0,0 +1,45 @@
+/* livedump-memdump.h - Live Dump's memory dumping management
+ * Copyright (C) 2012 Hitachi, Ltd.
+ * Author: YOSHIDA Masanori <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
+ * MA 02110-1301, USA.
+ */
+
+#ifndef _LIVEDUMP_MEMDUMP_H
+#define _LIVEDUMP_MEMDUMP_H
+
+#include <linux/fs.h>
+
+extern int livedump_memdump_init(void);
+
+extern void livedump_memdump_uninit(void);
+
+extern int livedump_memdump_select_pages(unsigned long *pgbmp);
+
+extern void livedump_memdump_handle_sensitive_pages(unsigned long *pgbmp);
+
+extern void livedump_memdump_handle_page(unsigned long pfn);
+
+extern loff_t livedump_memdump_sys_llseek(
+ struct file *file, loff_t offset, int origin);
+
+extern ssize_t livedump_memdump_sys_read(
+ struct file *file,
+ char __user *buf,
+ size_t len,
+ loff_t *ppos);
+
+#endif /* _LIVEDUMP_MEMDUMP_H */
diff --git a/kernel/livedump.c b/kernel/livedump.c
index 7be84e2..f3b6a7b 100644
--- a/kernel/livedump.c
+++ b/kernel/livedump.c
@@ -18,6 +18,7 @@
* MA 02110-1301, USA.
*/

+#include "livedump-memdump.h"
#include <asm/wrprotect.h>

#include <linux/module.h>
@@ -35,13 +36,21 @@
static void do_uninit(void)
{
wrprotect_uninit();
+ livedump_memdump_uninit();
}

static int do_init(void)
{
int ret;

- ret = wrprotect_init(NULL, NULL, NULL);
+ ret = livedump_memdump_init();
+ if (WARN(ret, "livedump: Failed to initialize Dump manager.\n"))
+ goto err;
+
+ ret = wrprotect_init(
+ livedump_memdump_select_pages,
+ livedump_memdump_handle_sensitive_pages,
+ livedump_memdump_handle_page);
if (WARN(ret, "livedump: Failed to initialize Protection manager.\n"))
goto err;

@@ -86,6 +95,8 @@ static const struct file_operations livedump_fops = {
.unlocked_ioctl = livedump_ioctl,
.open = livedump_open,
.release = livedump_release,
+ .read = livedump_memdump_sys_read,
+ .llseek = livedump_memdump_sys_llseek,
};
static struct miscdevice livedump_misc = {
.minor = MISC_DYNAMIC_MINOR,
diff --git a/tools/livedump/livedump b/tools/livedump/livedump
index b873b39..2520bd0 100755
--- a/tools/livedump/livedump
+++ b/tools/livedump/livedump
@@ -3,14 +3,26 @@
import sys
import fcntl

-cmds = {
- 'start':0xff01,
- 'sweep':0xff02,
- 'init':0xff64,
- 'uninit':0xff65
- }
-cmd = cmds[sys.argv[1]]
+def livedump_ioctl(f, scmd):
+ cmds = {
+ 'start':0xff01,
+ 'sweep':0xff02,
+ 'init':0xff64,
+ 'uninit':0xff65
+ }
+ cmd = cmds[scmd]
+ fcntl.ioctl(f, cmd)
+ print('done: ' + scmd)

f = open('/dev/livedump')
-fcntl.ioctl(f, cmd)
+
+if 'dump' == sys.argv[1]:
+ livedump_ioctl(f, 'init')
+ livedump_ioctl(f, 'start')
+ livedump_ioctl(f, 'sweep')
+elif 'release' == sys.argv[1]:
+ livedump_ioctl(f, 'uninit')
+else:
+ livedump_ioctl(f, sys.argv[1])
+
f.close

2012-05-25 09:31:28

by YOSHIDA Masanori

[permalink] [raw]
Subject: [RFC PATCH 3/4 V2] livedump: Add write protection management

This patch makes it possible to write-protect pages in kernel space and to
install a handler function that is called every time when page fault occurs
on the protected page. The write protection is executed in the stop-machine
state to protect all pages consistently.

Processing of write protection and fault handling is executed in the order
as follows:

(1) Initialization phase
- Sets up data structure for write protection management.
- Splits all large pages in kernel space into 4K pages since currently
livedump can handle only 4K pages. In the future, this step (page
splitting) should be eliminated.
(2) Write protection phase
- Stops machine.
- Handles sensitive pages.
(described below about sensitive pages)
- Sets up write protection.
- Resumes machine.
(3) Page fault exception handling
- Calls the handler function before unprotecting the faulted page.
(4) Sweep phase
- Calls the handler function against the rest of pages.

This patch exports the following 4 ioctl operations.
- Ioctl to activate this feature of write protection
- Ioctl to deactivate this feature
- Ioctl to kick stop-machine and to set up write protection
- Ioctl to sweep all the rest of pages

States of processing is as follows. They can transit only in this order.
- STATE_UNINIT
- STATE_INITED
- STATE_STARTED (= write protection already set up)
- STATE_SWEPT

However, this order is protected by a normal integer variable, therefore,
to be exact, this code is not safe against concurrent operation.

The livedump module has to acquire consistent memory image of kernel space.
Therefore, write protection is set up while the update of memory state is
suspended. To do so, the livedump is using stop_machine currently.

Causing page fault during page fault handling results in kernel panic, and
so any pages that can be updated during page fault handling must not be
write-protected. For the same reason, any pages that can be updated during
NMI handling must not be write-protected. I call such pages "sensitive
page". The handler function is called against the sensitive pages during
the stop-machine state as if they caused page fault at this timing.

I list the sensitive pages in the following:

- Kernel/Exception/Interrupt stacks
- Page table structure
- All task_struct
- ".data" section of kernel
- per_cpu areas

This handler function is not called against the pages that are not updated
unless the function is called by someone else. To handle these pages, the
livedump module finally calls the handler function against each of the
pages. I call this phase "sweep", which is triggered by ioctl operation.

To specify which pages to be write-protected and how to handle the pages,
the following 3 types of hook functions need to be defined.

- void fn_select_pages(unsigned long *bmp)
This function selects pages to be protected. Selection is returned in
the form of bitmap of which bit corresponds to PFN (page frame number).
This function is called outside the stop-machine state, and so the
processing of this function doesn't make the stop-machine time longer.

- void fn_handle_page(unsigned long pfn)
This function handles faulting pages. The argument pfn specifies which
page caused page fault. How to handle the page can be defined
arbitrarily.
This function is called when page fault occurs on the pages protected
by this module. It's also called during the stop-machine state to
handle the above sensitive pages.

- void fn_handle_sensitive_pages(unsigned long *bmp)
Someone who defines these hook functions may have additional sensitive
pages, to say, pages that must not be write-protected. This function
handles such pages during the stop-machine state. Bits in the bitmap
corresponding to pages that are handled by this function must be
cleared.

To be exact, if set_memory_rw is called between states of WRPROTECT_STARTED
and WRPROTECT_SWEPT, consistency of dumped memory image possibly breaks.
To solve this problem, I plan to add a hook into set_memory_rw in the next
version of the patch series.

Signed-off-by: YOSHIDA Masanori <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Cc: Tejun Heo <[email protected]>
Cc: [email protected]
---

arch/x86/Kconfig | 16 +
arch/x86/include/asm/wrprotect.h | 47 +++
arch/x86/mm/Makefile | 2
arch/x86/mm/wrprotect.c | 618 ++++++++++++++++++++++++++++++++++++++
kernel/livedump.c | 35 ++
tools/livedump/livedump | 16 +
6 files changed, 733 insertions(+), 1 deletions(-)
create mode 100644 arch/x86/include/asm/wrprotect.h
create mode 100644 arch/x86/mm/wrprotect.c
create mode 100755 tools/livedump/livedump

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4c97583..12fe7a6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1729,9 +1729,23 @@ config CMDLINE_OVERRIDE
This is used to work around broken boot loaders. This should
be set to 'N' under normal conditions.

+config WRPROTECT
+ bool "Write protection on kernel space"
+ depends on X86_64
+ ---help---
+ Set this option to 'Y' to allow the kernel to write protect
+ its own memory space and to handle page fault caused by the
+ write protection.
+
+ This feature regularly causes small overhead on kernel.
+ Once this feature is activated, it causes much more overhead
+ on kernel.
+
+ If in doubt, say N.
+
config LIVEDUMP
bool "Live Dump support"
- depends on X86_64
+ depends on WRPROTECT
---help---
Set this option to 'Y' to allow the kernel support to acquire
a consistent snapshot of kernel space without stopping system.
diff --git a/arch/x86/include/asm/wrprotect.h b/arch/x86/include/asm/wrprotect.h
new file mode 100644
index 0000000..92edab4
--- /dev/null
+++ b/arch/x86/include/asm/wrprotect.h
@@ -0,0 +1,47 @@
+/* wrprortect.h - Kernel space write protection support
+ * Copyright (C) 2012 Hitachi, Ltd.
+ * Author: YOSHIDA Masanori <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
+ * MA 02110-1301, USA.
+ */
+
+#ifndef _WRPROTECT_H
+#define _WRPROTECT_H
+
+typedef int (*fn_select_pages_t)(unsigned long *pfn_bmp);
+typedef void (*fn_handle_sensitive_pages_t)(unsigned long *pgbmp);
+typedef void (*fn_handle_page_t)(unsigned long pfn);
+
+extern int wrprotect_init(
+ fn_select_pages_t fn_select_pages,
+ fn_handle_sensitive_pages_t fn_handle_sensitive_pages,
+ fn_handle_page_t fn_handle_page);
+extern void wrprotect_uninit(void);
+
+extern int wrprotect_start(void);
+extern int wrprotect_sweep(void);
+
+extern void wrprotect_unselect_pages_but_edges(
+ unsigned long *pgbmp,
+ unsigned long start,
+ unsigned long len);
+extern void wrprotect_handle_only_edges(
+ unsigned long *pgbmp,
+ fn_handle_page_t fn_handle_page,
+ unsigned long start,
+ unsigned long len);
+
+#endif /* _WRPROTECT_H */
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 23d8e5f..58f1428 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -28,3 +28,5 @@ obj-$(CONFIG_ACPI_NUMA) += srat.o
obj-$(CONFIG_NUMA_EMU) += numa_emulation.o

obj-$(CONFIG_MEMTEST) += memtest.o
+
+obj-$(CONFIG_WRPROTECT) += wrprotect.o
diff --git a/arch/x86/mm/wrprotect.c b/arch/x86/mm/wrprotect.c
new file mode 100644
index 0000000..aef7646
--- /dev/null
+++ b/arch/x86/mm/wrprotect.c
@@ -0,0 +1,618 @@
+/* wrprotect.c - Kernel space write protection support
+ * Copyright (C) 2012 Hitachi, Ltd.
+ * Author: YOSHIDA Masanori <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
+ * MA 02110-1301, USA.
+ */
+
+#include <asm/wrprotect.h>
+#include <linux/mm.h> /* num_physpages, __get_free_page, etc. */
+#include <linux/bitmap.h> /* bit operations */
+#include <linux/slab.h> /* kmalloc, kfree */
+#include <linux/hugetlb.h> /* __flush_tlb_all */
+#include <linux/stop_machine.h> /* stop_machine */
+#include <asm/traps.h> /* page_fault_notifier_list */
+#include <asm/sections.h> /* __per_cpu_* */
+
+/* wrprotect's stuffs */
+static struct wrprotect {
+ int state;
+#define STATE_UNINIT 0
+#define STATE_INITED 1
+#define STATE_STARTED 2
+#define STATE_SWEPT 3
+} wrprotect;
+
+/* Bitmap specifying pages being write-protected */
+static unsigned long *pgbmp;
+#define PGBMP_LEN (sizeof(long) * BITS_TO_LONGS(num_physpages))
+
+/* wrprotect's hook functions, which define which and how to handle pages */
+static struct {
+ fn_select_pages_t select_pages;
+ fn_handle_sensitive_pages_t handle_sensitive_pages;
+ fn_handle_page_t handle_page;
+} ops;
+
+static int split_large_pages(void)
+{
+ unsigned long pfn;
+ for (pfn = 0; pfn < num_physpages; pfn++) {
+ int ret = set_memory_4k((unsigned long)pfn_to_kaddr(pfn), 1);
+ if (ret)
+ return ret;
+ }
+ return 0;
+}
+
+struct sm_context {
+ int leader_cpu;
+ int leader_done;
+ int (*fn_leader)(void *arg);
+ int (*fn_follower)(void *arg);
+ void *arg;
+};
+
+static int call_leader_follower(void *data)
+{
+ int ret;
+ struct sm_context *ctx = data;
+
+ if (smp_processor_id() == ctx->leader_cpu) {
+ ret = ctx->fn_leader(ctx->arg);
+ ctx->leader_done = 1;
+ } else {
+ while (!ctx->leader_done)
+ cpu_relax();
+ ret = ctx->fn_follower(ctx->arg);
+ }
+
+ return ret;
+}
+
+/* stop_machine_leader_follower
+ *
+ * Calls stop_machine with a leader CPU and follower CPUs
+ * executing different codes.
+ * At first, the leader CPU is selected randomly and executes its code.
+ * After that, follower CPUs execute their codes.
+ */
+static int stop_machine_leader_follower(
+ int (*fn_leader)(void *),
+ int (*fn_follower)(void *),
+ void *arg)
+{
+ int cpu;
+ struct sm_context ctx;
+
+ preempt_disable();
+ cpu = smp_processor_id();
+ preempt_enable();
+
+ memset(&ctx, 0, sizeof(ctx));
+ ctx.leader_cpu = cpu;
+ ctx.leader_done = 0;
+ ctx.fn_leader = fn_leader;
+ ctx.fn_follower = fn_follower;
+ ctx.arg = arg;
+
+ return stop_machine(call_leader_follower, &ctx, cpu_online_mask);
+}
+
+/* wrprotect_unselect_pages_but_edges
+ *
+ * Clear bits corresponding to pages that cover a range
+ * from start to start+len-1.
+ * However, if edges (start and/or start+len) are not aligned to PAGE_SIZE,
+ * the first and the last bits are not cleared.
+ */
+void wrprotect_unselect_pages_but_edges(
+ unsigned long *pgbmp,
+ unsigned long start,
+ unsigned long len)
+{
+ unsigned long end = (start + len) & PAGE_MASK;
+
+ start = (start + PAGE_SIZE - 1) & PAGE_MASK;
+ while (start < end) {
+ unsigned long pfn = __pa(start) >> PAGE_SHIFT;
+ clear_bit(pfn, pgbmp);
+ start += PAGE_SIZE;
+ }
+}
+
+/* wrprotect_handle_only_edges
+ *
+ * Call fn_handle_page against the first and the last pages
+ * if the corresponding bits are set.
+ * When fn_handle_page is called, the corresponding bit is cleared.
+ */
+void wrprotect_handle_only_edges(
+ unsigned long *pgbmp,
+ fn_handle_page_t fn_handle_page,
+ unsigned long start,
+ unsigned long len)
+{
+ unsigned long pfn_begin = __pa(start) >> PAGE_SHIFT;
+ unsigned long pfn_last = __pa(start + len - 1) >> PAGE_SHIFT;
+
+ if (test_bit(pfn_begin, pgbmp)) {
+ fn_handle_page(pfn_begin);
+ clear_bit(pfn_begin, pgbmp);
+ }
+ if (test_bit(pfn_last, pgbmp)) {
+ fn_handle_page(pfn_last);
+ clear_bit(pfn_last, pgbmp);
+ }
+}
+
+/* handle_addr_range
+ *
+ * Call fn_handle_page in turns against pages that cover a range
+ * from start to start+len-1.
+ * At the same time, bits corresponding to the pages are cleared.
+ */
+static void handle_addr_range(
+ unsigned long *pgbmp,
+ fn_handle_page_t fn_handle_page,
+ unsigned long start,
+ unsigned long len)
+{
+ unsigned long end = start + len;
+
+ while (start < end) {
+ unsigned long pfn = __pa(start) >> PAGE_SHIFT;
+ if (test_bit(pfn, pgbmp)) {
+ fn_handle_page(pfn);
+ clear_bit(pfn, pgbmp);
+ }
+ start += PAGE_SIZE;
+ }
+}
+
+/* handle_task
+ *
+ * Call handle_addr_range against a given task_struct & thread_info
+ */
+static void handle_task(
+ unsigned long *pgbmp,
+ fn_handle_page_t fn_handle_page,
+ struct task_struct *t)
+{
+ BUG_ON(!t);
+ BUG_ON(!t->stack);
+ BUG_ON((unsigned long)t->stack & ~PAGE_MASK);
+ handle_addr_range(pgbmp, fn_handle_page,
+ (unsigned long)t, sizeof(*t));
+ handle_addr_range(pgbmp, fn_handle_page,
+ (unsigned long)t->stack, THREAD_SIZE);
+}
+
+/* handle_tasks
+ *
+ * Call handle_task against all tasks (including idle_task's).
+ */
+static void handle_tasks(
+ unsigned long *pgbmp,
+ fn_handle_page_t fn_handle_page)
+{
+ struct task_struct *p, *t;
+ unsigned int cpu;
+
+ do_each_thread(p, t) {
+ handle_task(pgbmp, fn_handle_page, t);
+ } while_each_thread(p, t);
+
+ for_each_online_cpu(cpu)
+ handle_task(pgbmp, fn_handle_page, idle_task(cpu));
+}
+
+static void handle_pmd(
+ unsigned long *pgbmp,
+ fn_handle_page_t fn_handle_page,
+ pmd_t *pmd)
+{
+ unsigned long i;
+
+ handle_addr_range(pgbmp, fn_handle_page,
+ (unsigned long)pmd, PAGE_SIZE);
+ for (i = 0; i < PTRS_PER_PMD; i++) {
+ if (pmd_present(pmd[i]) && !pmd_large(pmd[i]))
+ handle_addr_range(pgbmp, fn_handle_page,
+ pmd_page_vaddr(pmd[i]), PAGE_SIZE);
+ }
+}
+
+static void handle_pud(
+ unsigned long *pgbmp,
+ fn_handle_page_t fn_handle_page,
+ pud_t *pud)
+{
+ unsigned long i;
+
+ handle_addr_range(pgbmp, fn_handle_page,
+ (unsigned long)pud, PAGE_SIZE);
+ for (i = 0; i < PTRS_PER_PUD; i++) {
+ if (pud_present(pud[i]) && !pud_large(pud[i]))
+ handle_pmd(pgbmp, fn_handle_page,
+ (pmd_t *)pud_page_vaddr(pud[i]));
+ }
+}
+
+/* handle_page_table
+ *
+ * Call fn_handle_page against all pages of page table structure
+ * and clear all bits corresponding to the pages.
+ */
+static void handle_page_table(
+ unsigned long *pgbmp,
+ fn_handle_page_t fn_handle_page)
+{
+ pgd_t *pgd;
+ unsigned long i;
+
+ pgd = __va(read_cr3() & PAGE_MASK);
+ handle_addr_range(pgbmp, fn_handle_page,
+ (unsigned long)pgd, PAGE_SIZE);
+ for (i = pgd_index(PAGE_OFFSET); i < PTRS_PER_PGD; i++) {
+ if (pgd_present(pgd[i]))
+ handle_pud(pgbmp, fn_handle_page,
+ (pud_t *)pgd_page_vaddr(pgd[i]));
+ }
+}
+
+/* handle_sensitive_pages
+ *
+ * Call fn_handle_page against the following pages and
+ * clear bits corresponding them.
+ */
+static void handle_sensitive_pages(
+ unsigned long *pgbmp,
+ fn_handle_page_t fn_handle_page)
+{
+ handle_tasks(pgbmp, fn_handle_page);
+ handle_page_table(pgbmp, fn_handle_page);
+ handle_addr_range(pgbmp, fn_handle_page,
+ (unsigned long)__per_cpu_offset[0], PMD_PAGE_SIZE);
+ handle_addr_range(pgbmp, fn_handle_page,
+ (unsigned long)_sdata, _end - _sdata);
+}
+
+/* protect_page
+ *
+ * Changes a specified page's _PAGE_RW flag and _PAGE_UNUSED1 flag.
+ * If the argument protect is non-zero:
+ * - _PAGE_RW flag is cleared
+ * - _PAGE_UNUSED1 flag is set
+ * If the argument protect is zero:
+ * - _PAGE_RW flag is set
+ * - _PAGE_UNUSED1 flag is cleared
+ *
+ * The change is executed only when all the following are true.
+ * - The page is mapped by the straight mapping area.
+ * - The page is mapped as 4K page.
+ * - The page is originally writable.
+ *
+ * Returns 1 if the change is actually executed, otherwise returns 0.
+ */
+static int protect_page(unsigned long pfn, int protect)
+{
+ unsigned long addr = (unsigned long)pfn_to_kaddr(pfn);
+ pte_t *ptep, pte;
+ unsigned int level;
+
+ ptep = lookup_address(addr, &level);
+ if (WARN(!ptep, "livedump: Page=%016lx isn't mapped.\n", addr) ||
+ WARN(!pte_present(*ptep),
+ "livedump: Page=%016lx isn't mapped.\n", addr) ||
+ WARN(PG_LEVEL_NONE == level,
+ "livedump: Page=%016lx isn't mapped.\n", addr) ||
+ WARN(PG_LEVEL_2M == level,
+ "livedump: Page=%016lx is consisted of 2M page.\n", addr) ||
+ WARN(PG_LEVEL_1G == level,
+ "livedump: Page=%016lx is consisted of 1G page.\n", addr)) {
+ return 0;
+ }
+
+ pte = *ptep;
+ if (protect) {
+ if (pte_write(pte)) {
+ pte = pte_wrprotect(pte);
+ pte = pte_set_flags(pte, _PAGE_UNUSED1);
+ }
+ } else {
+ pte = pte_mkwrite(pte);
+ pte = pte_clear_flags(pte, _PAGE_UNUSED1);
+ }
+ *ptep = pte;
+
+ return 1;
+}
+
+/*
+ * Page fault error code bits:
+ *
+ * bit 0 == 0: no page found 1: protection fault
+ * bit 1 == 0: read access 1: write access
+ * bit 2 == 0: kernel-mode access 1: user-mode access
+ * bit 3 == 1: use of reserved bit detected
+ * bit 4 == 1: fault was an instruction fetch
+ */
+enum x86_pf_error_code {
+ PF_PROT = 1 << 0,
+ PF_WRITE = 1 << 1,
+ PF_USER = 1 << 2,
+ PF_RSVD = 1 << 3,
+ PF_INSTR = 1 << 4,
+};
+
+static int wrprotect_page_fault_notifier(
+ struct notifier_block *n, unsigned long val, void *v)
+{
+ unsigned long error_code = val;
+ pte_t *ptep, pte;
+ unsigned int level;
+ unsigned long pfn;
+
+ /*
+ * Handle only kernel-mode write access
+ *
+ * error_code must be:
+ * (1) PF_PROT
+ * (2) PF_WRITE
+ * (3) not PF_USER
+ * (4) not PF_SRVD
+ * (5) not PF_INSTR
+ */
+ if (!(PF_PROT & error_code) ||
+ !(PF_WRITE & error_code) ||
+ (PF_USER & error_code) ||
+ (PF_RSVD & error_code) ||
+ (PF_INSTR & error_code))
+ goto not_processed;
+
+ ptep = lookup_address(read_cr2(), &level);
+ if (!ptep)
+ goto not_processed;
+ pte = *ptep;
+ if (!pte_present(pte) || PG_LEVEL_4K != level)
+ goto not_processed;
+ if (!(pte_flags(pte) & _PAGE_UNUSED1))
+ goto not_processed;
+
+ pfn = pte_pfn(pte);
+ if (test_and_clear_bit(pfn, pgbmp)) {
+ ops.handle_page(pfn);
+ protect_page(pfn, 0);
+ }
+
+ return NOTIFY_STOP;
+
+not_processed:
+ return NOTIFY_DONE;
+}
+
+static struct notifier_block wrprotect_page_fault_notifier_block = {
+ .notifier_call = wrprotect_page_fault_notifier,
+ .priority = 0,
+};
+
+/* sm_leader
+ *
+ * Is executed by a leader CPU during stop-machine.
+ *
+ * Does the following:
+ * (1)Handle sensitive pages, which must not be write-protected.
+ * (2)Register notifier-call-chain into the kernel's page fault handler.
+ * (3)Write-protect pages which are specified by the bitmap.
+ * (4)Flush TLB cache of the leader CPU.
+ */
+static int sm_leader(void *arg)
+{
+ int ret;
+ unsigned long pfn;
+
+ handle_sensitive_pages(pgbmp, ops.handle_page);
+ wrprotect_handle_only_edges(pgbmp, ops.handle_page,
+ (unsigned long)pgbmp, PGBMP_LEN);
+ wrprotect_handle_only_edges(pgbmp, ops.handle_page,
+ (unsigned long)&wrprotect, sizeof(wrprotect));
+ ops.handle_sensitive_pages(pgbmp);
+
+ ret = atomic_notifier_chain_register(
+ &page_fault_notifier_list,
+ &wrprotect_page_fault_notifier_block);
+ if (WARN(ret, "livedump: Failed to register notifier.\n"))
+ return ret;
+
+ for_each_set_bit(pfn, pgbmp, num_physpages)
+ if (!protect_page(pfn, 1))
+ clear_bit(pfn, pgbmp);
+
+ __flush_tlb_all();
+
+ return 0;
+}
+
+/* sm_follower
+ *
+ * Is executed by follower CPUs during stop-machine.
+ * Flushes TLB cache of each CPU.
+ */
+static int sm_follower(void *arg)
+{
+ __flush_tlb_all();
+ return 0;
+}
+
+/* wrprotect_start
+ *
+ * Set up write protection on the kernel space in the stop-machine state.
+ */
+int wrprotect_start(void)
+{
+ int ret;
+
+ if (WARN(STATE_INITED != wrprotect.state,
+ "livedump: wrprotect isn't initialized yet.\n"))
+ return 0;
+
+ ret = stop_machine_leader_follower(sm_leader, sm_follower, NULL);
+ if (WARN(ret, "livedump: Failed to protect pages w/errno=%d.\n", ret))
+ return ret;
+
+ wrprotect.state = STATE_STARTED;
+ return 0;
+}
+
+/* wrprotect_sweep
+ *
+ * On every page specified by the bitmap, the following is executed.
+ * - Handle the page by the way defined as ops.handle_page.
+ * - Change the page's flags by calling protect_page.
+ *
+ * The above work can be executed on the same page at the same time
+ * by the notifer-call-chain.
+ * test_and_clear_bit is used for exclusion control.
+ */
+int wrprotect_sweep(void)
+{
+ unsigned long pfn;
+
+ if (WARN(STATE_STARTED != wrprotect.state,
+ "livedump: Pages aren't protected yet.\n"))
+ return 0;
+ for_each_set_bit(pfn, pgbmp, num_physpages) {
+ if (!test_and_clear_bit(pfn, pgbmp))
+ continue;
+ ops.handle_page(pfn);
+ protect_page(pfn, 0);
+ if (!(pfn & 0xffUL))
+ cond_resched();
+ }
+ wrprotect.state = STATE_SWEPT;
+ return 0;
+}
+
+static int default_select_pages(unsigned long *pgmap)
+{
+ unsigned long pfn;
+
+ for (pfn = 0; pfn < num_physpages; pfn++) {
+ if (e820_any_mapped(pfn << PAGE_SHIFT,
+ (pfn + 1) << PAGE_SHIFT,
+ E820_RAM))
+ bitmap_set(pgbmp, pfn, 1);
+ if (!(pfn & 0xffUL))
+ cond_resched();
+ }
+ return 0;
+}
+
+static void default_handle_sensitive_pages(unsigned long *pgbmp)
+{
+}
+
+static void default_handle_page(unsigned long pfn)
+{
+}
+
+int wrprotect_init(
+ fn_select_pages_t fn_select_pages,
+ fn_handle_sensitive_pages_t fn_handle_sensitive_pages,
+ fn_handle_page_t fn_handle_page)
+{
+ int ret;
+
+ if (WARN(STATE_UNINIT != wrprotect.state,
+ "livedump: wrprotect is already initialized.\n"))
+ return 0;
+
+ ret = split_large_pages();
+ if (ret)
+ goto err;
+
+ if (fn_select_pages && fn_handle_sensitive_pages && fn_handle_page) {
+ ops.select_pages = fn_select_pages;
+ ops.handle_sensitive_pages = fn_handle_sensitive_pages;
+ ops.handle_page = fn_handle_page;
+ } else {
+ ops.select_pages = default_select_pages;
+ ops.handle_sensitive_pages = default_handle_sensitive_pages;
+ ops.handle_page = default_handle_page;
+ }
+
+ ret = -ENOMEM;
+ pgbmp = kzalloc(PGBMP_LEN, GFP_KERNEL);
+ if (!pgbmp)
+ goto err;
+
+ ret = ops.select_pages(pgbmp);
+ if (ret)
+ goto err;
+
+ wrprotect_unselect_pages_but_edges(
+ pgbmp, (unsigned long)pgbmp, PGBMP_LEN);
+ wrprotect_unselect_pages_but_edges(
+ pgbmp, (unsigned long)&wrprotect, sizeof(wrprotect));
+
+ wrprotect.state = STATE_INITED;
+ return 0;
+
+err:
+ kfree(pgbmp);
+ pgbmp = NULL;
+
+ return ret;
+}
+
+void wrprotect_uninit(void)
+{
+ int ret;
+ unsigned long pfn;
+
+ if (STATE_UNINIT == wrprotect.state)
+ return;
+
+ if (STATE_STARTED == wrprotect.state) {
+ for_each_set_bit(pfn, pgbmp, num_physpages) {
+ if (!test_and_clear_bit(pfn, pgbmp))
+ continue;
+ protect_page(pfn, 0);
+ cond_resched();
+ }
+
+ flush_tlb_all();
+ }
+
+ if (STATE_STARTED <= wrprotect.state) {
+ ret = atomic_notifier_chain_unregister(
+ &page_fault_notifier_list,
+ &wrprotect_page_fault_notifier_block);
+ WARN(ret,
+ "livedump: Failed to unregister notifier w/errno=%d.\n",
+ -ret);
+ }
+
+ ops.select_pages = NULL;
+ ops.handle_sensitive_pages = NULL;
+ ops.handle_page = NULL;
+
+ kfree(pgbmp);
+ pgbmp = NULL;
+
+ wrprotect.state = STATE_UNINIT;
+}
diff --git a/kernel/livedump.c b/kernel/livedump.c
index 3103292..7be84e2 100644
--- a/kernel/livedump.c
+++ b/kernel/livedump.c
@@ -18,6 +18,8 @@
* MA 02110-1301, USA.
*/

+#include <asm/wrprotect.h>
+
#include <linux/module.h>
#include <linux/fs.h>
#include <linux/miscdevice.h>
@@ -25,11 +27,43 @@
#define DEVICE_NAME "livedump"

#define LIVEDUMP_IOC(x) _IO(0xff, x)
+#define LIVEDUMP_IOC_START LIVEDUMP_IOC(1)
+#define LIVEDUMP_IOC_SWEEP LIVEDUMP_IOC(2)
+#define LIVEDUMP_IOC_INIT LIVEDUMP_IOC(100)
+#define LIVEDUMP_IOC_UNINIT LIVEDUMP_IOC(101)
+
+static void do_uninit(void)
+{
+ wrprotect_uninit();
+}
+
+static int do_init(void)
+{
+ int ret;
+
+ ret = wrprotect_init(NULL, NULL, NULL);
+ if (WARN(ret, "livedump: Failed to initialize Protection manager.\n"))
+ goto err;
+
+ return 0;
+err:
+ do_uninit();
+ return ret;
+}

static long livedump_ioctl(
struct file *file, unsigned int cmd, unsigned long arg)
{
switch (cmd) {
+ case LIVEDUMP_IOC_START:
+ return wrprotect_start();
+ case LIVEDUMP_IOC_SWEEP:
+ return wrprotect_sweep();
+ case LIVEDUMP_IOC_INIT:
+ return do_init();
+ case LIVEDUMP_IOC_UNINIT:
+ do_uninit();
+ return 0;
default:
return -ENOIOCTLCMD;
}
@@ -76,6 +110,7 @@ module_init(livedump_module_init);
static void livedump_module_exit(void)
{
misc_deregister(&livedump_misc);
+ do_uninit();
}
module_exit(livedump_module_exit);

diff --git a/tools/livedump/livedump b/tools/livedump/livedump
new file mode 100755
index 0000000..b873b39
--- /dev/null
+++ b/tools/livedump/livedump
@@ -0,0 +1,16 @@
+#!/usr/bin/python
+
+import sys
+import fcntl
+
+cmds = {
+ 'start':0xff01,
+ 'sweep':0xff02,
+ 'init':0xff64,
+ 'uninit':0xff65
+ }
+cmd = cmds[sys.argv[1]]
+
+f = open('/dev/livedump')
+fcntl.ioctl(f, cmd)
+f.close

2012-05-25 11:12:51

by YOSHIDA Masanori

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4 V2] introduce: livedump

Hi, Peter

Thank you for quick reply.

Yes, I know that PF in NMI handling is dangerous, and so livedump doesn't
protect such pages that can be updated during NMI handling.
Such pages are listed in [3/4] as "sensitive pages".

Currently, I regard the following pages as sensitive pages in [3/4].
- Kernel/Exception/Interrupt stacks
- Page table structure
- All task_struct
- ".data" section of kernel
- All per_cpu areas

However, I can't assure these pages are enough to avoid PF in NMI handling.
Do you have any idea to enumerate sensitive pages correctly?

Thank you.


On 2012/05/25 18:25, Peter Zijlstra wrote:
> On Fri, 2012-05-25 at 18:12 +0900, YOSHIDA Masanori wrote:
>> Live Dump is based on Copy-on-write technique. Basically processing is
>> performed in the following order.
>> (1) Suspends processing of all CPUs.
>> (2) Makes pages (which you want to dump) read-only.
>> (3) Resumes all CPUs
>> (4) On page fault, dumps a page including a fault address.
>
> Suppose a PF is in progress when all this happens, you mark all RO, then
> an NMI happens, from the NMI context we'll generate another PF to update
> a vmap area, this will again PF because you mucked about and marked
> things RO.
>
> You're now at 3 PFs, which is instant reboot.
>
> I don't think this is going to work.
>

2012-05-25 12:15:17

by YOSHIDA Masanori

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4 V2] livedump: Add notifier-call-chain into do_page_fault

Hi, Peter,

I agree that notifier isn't suitable for this purpose.

In the next version, I plan to replace this with a simple
callback as follows.

if (very_unlikely(under_live_dump))
...

In addition to this, since PF handler is a very hot path,
I will cover the condition with "#ifdef CONFIG_WRPROTECT".

Thank you.


On 2012/05/25 18:19, Peter Zijlstra wrote:
> On Fri, 2012-05-25 at 18:12 +0900, YOSHIDA Masanori wrote:
>>
>> This patch adds notifier-call-chain that is called in do_page_fault.
>> Livedump uses this to check if page fault is caused by livedump, and if so,
>> the fault is handled by livedump's handler function. Otherwise, it is
>> handled by the original page fault handler.
>
> No, please no notifiers..
>
>
>

2012-06-04 21:09:47

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4 V2] introduce: livedump

On Fri, May 25, 2012 at 06:12:07PM +0900, YOSHIDA Masanori wrote:

[..]
> (4) It allocates about 50% of physical RAM to store dumped pages. Currently
> Live Dump saves all dumped data on memory once, and after that a user
> becomes able to use the dumped data. Live Dump itself has no feature to
> save dumped data onto a disk or any other storage device.

People complain when kdump reserves 128M of memory when system crashes.
I am skeptical that reserving 50% of memory for livedumps is going to fly.

Thanks
Vivek

2012-06-04 21:30:25

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4 V2] introduce: livedump

On 05/25/2012 02:12 AM, YOSHIDA Masanori wrote:
>
> Such a mechanism is useful especially in the case where very important
> systems are consolidated onto a single machine via virtualization.
> Assuming a KVM host runs multiple important VMs on it and one of them
> fails, the other VMs have to keep running. However, at the same time, an
> administrator may want to obtain memory dump of not only the failed guest
> but also the host because possibly the cause of failture is not in the
> guest but in the host or the hardware under it.
>
> Live Dump is based on Copy-on-write technique. Basically processing is
> performed in the following order.
> (1) Suspends processing of all CPUs.
> (2) Makes pages (which you want to dump) read-only.
> (3) Resumes all CPUs
> (4) On page fault, dumps a page including a fault address.
> (5) Finally, dumps the rest of pages that are not updated.
>
> Currently, Live Dump is just a simple prototype and it has many
> limitations. I list the important ones below.
> (1) It write-protects only kernel's straight mapping areas. Therefore
> memory updates from vmap areas and user space don't cause page fault.
> Pages corresponding to these areas are not consistently dumped.
> (2) It supports only x86-64 architecture.
> (3) It can only handle 4K pages. As we know, most pages in kernel space are
> mapped via 2M or 1G large page mapping. Therefore, the current
> implementation of Live Dump splits all large pages into 4K pages before
> setting up write protection.
> (4) It allocates about 50% of physical RAM to store dumped pages. Currently
> Live Dump saves all dumped data on memory once, and after that a user
> becomes able to use the dumped data. Live Dump itself has no feature to
> save dumped data onto a disk or any other storage device.
>

I am very concerned about the impact of this patch versus its value...
losing half the RAM means the value is extremely limited and the other
limitations above indicates that the cost is very very high.

At the same time, the guest can be dumped without any special tricks.

-hpa

2012-06-05 09:50:49

by YOSHIDA Masanori

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4 V2] introduce: livedump

Hi, Vivek and Peter

Thank you for your replies.

Yes, I agree that it is terrible to consume a half of memory only for
this purpose.

I think buffer size can be reduced by dumping memory to disk
on-the-fly with buffer of limited size.
However, this means that page fault(PF) handling may sleep if the
buffer is temporarily full.
It is never acceptable when PF happens in preempt_disable()ed path,
but I think size of pages updated in the path is not much.

To reduce buffer size, we can introduce two types of buffers as below;
(1)Buffer dedicated for non-preemptible path
(2)Buffer for the rest

If the first buffer is full, tracking updated pages partially fails.
To avoid this, users need to allocate enough memory for this buffer.

We don't need to care about the second buffer being full,
because PF handling can wait for space by sleeping.

Regards,
Masanori

Yokohama Research Laboratory,
Hitachi, Ltd.


On 2012/06/05 6:09, Vivek Goyal wrote:
> On Fri, May 25, 2012 at 06:12:07PM +0900, YOSHIDA Masanori wrote:
>
> [..]
>> (4) It allocates about 50% of physical RAM to store dumped pages. Currently
>> Live Dump saves all dumped data on memory once, and after that a user
>> becomes able to use the dumped data. Live Dump itself has no feature to
>> save dumped data onto a disk or any other storage device.
>
> People complain when kdump reserves 128M of memory when system crashes.
> I am skeptical that reserving 50% of memory for livedumps is going to fly.
>
> Thanks
> Vivek