2008-10-20 05:42:27

by Oren Laadan

[permalink] [raw]
Subject: [RFC v7][PATCH 0/9] Kernel based checkpoint/restart

I'm re-posting this with a couple of fixes: one fixes the FPU
save/restore and another fixes argument to kunmap_atomic().

There is now a git tree at:
git://gorgona.ncl.cs.columbia.edu/pub/git/linux-cr.git

And another tree to track development and older versions:
git://gorgona.ncl.cs.columbia.edu/pub/git/linux-cr-dev.git

I could not have told it better than Dave Hansen:

--

I'd like to see these merged into -mm and on the way to mainline. The
entire freakin' world is cc'd. So sue me. :)

Why do we want it? It allows containers to be moved between physical
machines' kernels in the same way that VMWare can move VMs between
physical machines' hypervisors. There are currently at least two
out-of-tree implementations of this in the commercial world (IBM's
Metacluster and Parallels' OpenVZ/Virtuozzo) and several in the academic
world like Zap.

Why do we need it in mainline now? Because we already have plenty of
out-of-tree ones, and want to know what an in-tree one will be like. :)
What *I* want right now is the extra review and scrutiny that comes with
a mainline submission to make sure we're not going in a direction
contrary to the community.

This only supports pretty simple apps. But, I trust Ingo when he says:
> > Generally, if something works for simple apps already (in a robust,
> > compatible and supportable way) and users find it "very cool", then
> > support for more complex apps is not far in the future. but if you
> > want to support more complex apps straight away, it takes forever and
> > gets ugly.

We're *certainly* going to be changing the ABI (which is the format of
the checkpoint). I'd like to follow the model that we used for
ext4-dev, which is to make it very clear that this is a development-only
feature for now. Perhaps we do that by making the interface only
available through debugfs or something similar for now. Or, reserving
the syscall numbers but require some runtime switch to be thrown before
they can be used. I'm open to suggestions here.

These patches are Oren Laadan's baby. Virtually all this code is his,
but he's a bit busy at the moment finishing up his PhD.

There's a plethora of old history and some userspace tools below if you
want some more detail, but please ignore them and look at the kernel
code. :)

--

These patches implement basic checkpoint-restart [CR]. This version
(v7) supports basic tasks with simple private memory, and open files
(regular files and directories only). See original announcements below.

Oren.

--
Todo:
- Add support for x86-64 and improve ABI
- Refine or change syscall interface
- Extend to handle (multiple) tasks in a container
- Handle multiple namespaces in a container (e.g. save the filesystem
namespaces state with the file descriptors)
- Security (without CAPS_SYS_ADMIN files restore may fail)

Changelog:

[2008-Oct-17] v7:
- Fix save/restore state of FPU
- Fix argument given to kunmap_atomic() in memory dump/restore

[2008-Oct-07] v6:
- Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
(even though it's not really needed)
- Add assumptions and what's-missing to documentation
- Misc fixes and cleanups

[2008-Sep-11] v5:
- Config is now 'def_bool n' by default
- Improve memory dump/restore code (following Dave Hansen's comments)
- Change dump format (and code) to allow chunks of <vaddrs, pages>
instead of one long list of each
- Fix use of follow_page() to avoid faulting in non-present pages
- Memory restore now maps user pages explicitly to copy data into them,
instead of reading directly to user space; got rid of mprotect_fixup()
- Remove preempt_disable() when restoring debug registers
- Rename headers files s/ckpt/checkpoint/
- Fix misc bugs in files dump/restore
- Fixes and cleanups on some error paths
- Fix misc coding style

[2008-Sep-09] v4:
- Various fixes and clean-ups
- Fix calculation of hash table size
- Fix header structure alignment
- Use stand list_... for cr_pgarr

[2008-Aug-29] v3:
- Various fixes and clean-ups
- Use standard hlist_... for hash table
- Better use of standard kmalloc/kfree

[2008-Aug-20] v2:
- Added Dump and restore of open files (regular and directories)
- Added basic handling of shared objects, and improve handling of
'parent tag' concept
- Added documentation
- Improved ABI, 64bit padding for image data
- Improved locking when saving/restoring memory
- Added UTS information to header (release, version, machine)
- Cleanup extraction of filename from a file pointer
- Refactor to allow easier reviewing
- Remove requirement for CAPS_SYS_ADMIN until we come up with a
security policy (this means that file restore may fail)
- Other cleanup and response to comments for v1

[2008-Jul-29] v1:
- Initial version: support a single task with address space of only
private anonymous or file-mapped VMAs; syscalls ignore pid/crid
argument and act on current process.

--
(Dave Hansen's announcement)

At the containers mini-conference before OLS, the consensus among
all the stakeholders was that doing checkpoint/restart in the kernel
as much as possible was the best approach. With this approach, the
kernel will export a relatively opaque 'blob' of data to userspace
which can then be handed to the new kernel at restore time.

This is different than what had been proposed before, which was
that a userspace application would be responsible for collecting
all of this data. We were also planning on adding lots of new,
little kernel interfaces for all of the things that needed
checkpointing. This unites those into a single, grand interface.

The 'blob' will contain copies of select portions of kernel
structures such as vmas and mm_structs. It will also contain
copies of the actual memory that the process uses. Any changes
in this blob's format between kernel revisions can be handled by
an in-userspace conversion program.

This is a similar approach to virtually all of the commercial
checkpoint/restart products out there, as well as the research
project Zap.

These patches basically serialize internel kernel state and write
it out to a file descriptor. The checkpoint and restore are done
with two new system calls: sys_checkpoint and sys_restart.

In this incarnation, they can only work checkpoint and restore a
single task. The task's address space may consist of only private,
simple vma's - anonymous or file-mapped. The open files may consist
of only simple files and directories.

--
(Original announcement)

In the recent mini-summit at OLS 2008 and the following days it was
agreed to tackle the checkpoint/restart (CR) by beginning with a very
simple case: save and restore a single task, with simple memory
layout, disregarding other task state such as files, signals etc.

Following these discussions I coded a prototype that can do exactly
that, as a starter. This code adds two system calls - sys_checkpoint
and sys_restart - that a task can call to save and restore its state
respectively. It also demonstrates how the checkpoint image file can
be formatted, as well as show its nested nature (e.g. cr_write_mm()
-> cr_write_vma() nesting).

The state that is saved/restored is the following:
* some of the task_struct
* some of the thread_struct and thread_info
* the cpu state (including FPU)
* the memory address space

In the current code, sys_checkpoint will checkpoint the current task,
although the logic exists to checkpoint other tasks (not in the
checkpointee's execution context). A simple loop will extend this to
handle multiple processes. sys_restart restarts the current tasks, and
with multiple tasks each task will call the syscall independently.
(Actually, to checkpoint outside the context of a task, it is also
necessary to also handle restart-block logic when saving/restoring the
thread data).

It takes longer to describe what isn't implemented or supported by
this prototype ... basically everything that isn't as simple as the
above.

As for containers - since we still don't have a representation for a
container, this patch has no notion of a container. The tests for
consistent namespaces (and isolation) are also omitted.


2008-10-20 05:42:07

by Oren Laadan

[permalink] [raw]
Subject: [RFC v7][PATCH 3/9] x86 support for checkpoint/restart

(Following Dave Hansen's refactoring of the original post)

Add logic to save and restore architecture specific state, including
thread-specific state, CPU registers and FPU state.

Currently only x86-32 is supported. Compiling on x86-64 will trigger
an explicit error.

Signed-off-by: Oren Laadan <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
---
arch/x86/mm/Makefile | 2 +
arch/x86/mm/checkpoint.c | 198 ++++++++++++++++++++++++++++++++++++++
arch/x86/mm/restart.c | 194 +++++++++++++++++++++++++++++++++++++
checkpoint/checkpoint.c | 13 +++-
checkpoint/checkpoint_arch.h | 7 ++
checkpoint/restart.c | 13 +++-
include/asm-x86/checkpoint_hdr.h | 72 ++++++++++++++
include/linux/checkpoint_hdr.h | 1 +
8 files changed, 498 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/mm/checkpoint.c
create mode 100644 arch/x86/mm/restart.c
create mode 100644 checkpoint/checkpoint_arch.h
create mode 100644 include/asm-x86/checkpoint_hdr.h

diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index dfb932d..58fe072 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -22,3 +22,5 @@ endif
obj-$(CONFIG_ACPI_NUMA) += srat_$(BITS).o

obj-$(CONFIG_MEMTEST) += memtest.o
+
+obj-$(CONFIG_CHECKPOINT_RESTART) += checkpoint.o restart.o
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
new file mode 100644
index 0000000..43dadac
--- /dev/null
+++ b/arch/x86/mm/checkpoint.c
@@ -0,0 +1,198 @@
+/*
+ * Checkpoint/restart - architecture specific support for x86
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <asm/desc.h>
+#include <asm/i387.h>
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/* dump the thread_struct of a given task */
+int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t)
+{
+ struct cr_hdr h;
+ struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ struct thread_struct *thread;
+ struct desc_struct *desc;
+ int ntls = 0;
+ int n, ret;
+
+ h.type = CR_HDR_THREAD;
+ h.len = sizeof(*hh);
+ h.parent = task_pid_vnr(t);
+
+ thread = &t->thread;
+
+ /* calculate no. of TLS entries that follow */
+ desc = thread->tls_array;
+ for (n = GDT_ENTRY_TLS_ENTRIES; n > 0; n--, desc++) {
+ if (desc->a || desc->b)
+ ntls++;
+ }
+
+ hh->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
+ hh->sizeof_tls_array = sizeof(thread->tls_array);
+ hh->ntls = ntls;
+
+ ret = cr_write_obj(ctx, &h, hh);
+ cr_hbuf_put(ctx, sizeof(*hh));
+ if (ret < 0)
+ return ret;
+
+ /* for simplicity dump the entire array, cherry-pick upon restart */
+ ret = cr_kwrite(ctx, thread->tls_array, sizeof(thread->tls_array));
+
+ cr_debug("ntls %d\n", ntls);
+
+ /* IGNORE RESTART BLOCKS FOR NOW ... */
+
+ return ret;
+}
+
+#ifdef CONFIG_X86_64
+
+#error "CONFIG_X86_64 unsupported yet."
+
+#else /* !CONFIG_X86_64 */
+
+void cr_write_cpu_regs(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+ struct thread_struct *thread = &t->thread;
+ struct pt_regs *regs = task_pt_regs(t);
+
+ hh->bp = regs->bp;
+ hh->bx = regs->bx;
+ hh->ax = regs->ax;
+ hh->cx = regs->cx;
+ hh->dx = regs->dx;
+ hh->si = regs->si;
+ hh->di = regs->di;
+ hh->orig_ax = regs->orig_ax;
+ hh->ip = regs->ip;
+ hh->cs = regs->cs;
+ hh->flags = regs->flags;
+ hh->sp = regs->sp;
+ hh->ss = regs->ss;
+
+ hh->ds = regs->ds;
+ hh->es = regs->es;
+
+ /*
+ * for checkpoint in process context (from within a container)
+ * the GS and FS registers should be saved from the hardware;
+ * otherwise they are already sabed on the thread structure
+ */
+ if (t == current) {
+ savesegment(gs, hh->gs);
+ savesegment(fs, hh->fs);
+ } else {
+ hh->gs = thread->gs;
+ hh->fs = thread->fs;
+ }
+
+ /*
+ * for checkpoint in process context (from within a container),
+ * the actual syscall is taking place at this very moment; so
+ * we (optimistically) subtitute the future return value (0) of
+ * this syscall into the orig_eax, so that upon restart it will
+ * succeed (or it will endlessly retry checkpoint...)
+ */
+ if (t == current) {
+ BUG_ON(hh->orig_ax < 0);
+ hh->ax = 0;
+ }
+}
+
+void cr_write_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+ struct thread_struct *thread = &t->thread;
+
+ /* debug regs */
+
+ preempt_disable();
+
+ /*
+ * for checkpoint in process context (from within a container),
+ * get the actual registers; otherwise get the saved values.
+ */
+
+ if (t == current) {
+ get_debugreg(hh->debugreg0, 0);
+ get_debugreg(hh->debugreg1, 1);
+ get_debugreg(hh->debugreg2, 2);
+ get_debugreg(hh->debugreg3, 3);
+ get_debugreg(hh->debugreg6, 6);
+ get_debugreg(hh->debugreg7, 7);
+ } else {
+ hh->debugreg0 = thread->debugreg0;
+ hh->debugreg1 = thread->debugreg1;
+ hh->debugreg2 = thread->debugreg2;
+ hh->debugreg3 = thread->debugreg3;
+ hh->debugreg6 = thread->debugreg6;
+ hh->debugreg7 = thread->debugreg7;
+ }
+
+ hh->debugreg4 = 0;
+ hh->debugreg5 = 0;
+
+ hh->uses_debug = !!(task_thread_info(t)->flags & TIF_DEBUG);
+
+ preempt_enable();
+}
+
+void cr_write_cpu_fpu(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+ struct thread_struct *thread = &t->thread;
+ struct thread_info *thread_info = task_thread_info(t);
+
+ /* i387 + MMU + SSE logic */
+
+ preempt_disable();
+
+ hh->used_math = tsk_used_math(t) ? 1 : 0;
+ if (hh->used_math) {
+ /*
+ * normally, no need to unlazy_fpu(), since TS_USEDFPU flag
+ * have been cleared when task was conexted-switched out...
+ * except if we are in process context, in which case we do
+ */
+ if (thread_info->status & TS_USEDFPU)
+ unlazy_fpu(current);
+
+ hh->has_fxsr = cpu_has_fxsr;
+ memcpy(&hh->xstate, thread->xstate, sizeof(*thread->xstate));
+ }
+
+ preempt_enable();
+}
+
+#endif /* CONFIG_X86_64 */
+
+/* dump the cpu state and registers of a given task */
+int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t)
+{
+ struct cr_hdr h;
+ struct cr_hdr_cpu *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ int ret;
+
+ h.type = CR_HDR_CPU;
+ h.len = sizeof(*hh);
+ h.parent = task_pid_vnr(t);
+
+ cr_write_cpu_regs(hh, t);
+ cr_write_cpu_debug(hh, t);
+ cr_write_cpu_fpu(hh, t);
+
+ cr_debug("math %d debug %d\n", hh->used_math, hh->uses_debug);
+
+ ret = cr_write_obj(ctx, &h, hh);
+ cr_hbuf_put(ctx, sizeof(*hh));
+ return ret;
+}
diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
new file mode 100644
index 0000000..2bff5eb
--- /dev/null
+++ b/arch/x86/mm/restart.c
@@ -0,0 +1,194 @@
+/*
+ * Checkpoint/restart - architecture specific support for x86
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <asm/desc.h>
+#include <asm/i387.h>
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/* read the thread_struct into the current task */
+int cr_read_thread(struct cr_ctx *ctx)
+{
+ struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ struct task_struct *t = current;
+ struct thread_struct *thread = &t->thread;
+ int parent, ret;
+
+ parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_THREAD);
+ if (parent < 0) {
+ ret = parent;
+ goto out;
+ }
+
+ ret = -EINVAL;
+
+#if 0 /* activate when containers are used */
+ if (parent != task_pid_vnr(t))
+ goto out;
+#endif
+ cr_debug("ntls %d\n", hh->ntls);
+
+ if (hh->gdt_entry_tls_entries != GDT_ENTRY_TLS_ENTRIES ||
+ hh->sizeof_tls_array != sizeof(thread->tls_array) ||
+ hh->ntls < 0 || hh->ntls > GDT_ENTRY_TLS_ENTRIES)
+ goto out;
+
+ if (hh->ntls > 0) {
+ struct desc_struct *desc;
+ int size, cpu;
+
+ /*
+ * restore TLS by hand: why convert to struct user_desc if
+ * sys_set_thread_entry() will convert it back ?
+ */
+
+ size = sizeof(*desc) * GDT_ENTRY_TLS_ENTRIES;
+ desc = kmalloc(size, GFP_KERNEL);
+ if (!desc)
+ return -ENOMEM;
+
+ ret = cr_kread(ctx, desc, size);
+ if (ret >= 0) {
+ /*
+ * FIX: add sanity checks (eg. that values makes
+ * sense, that we don't overwrite old values, etc
+ */
+ cpu = get_cpu();
+ memcpy(thread->tls_array, desc, size);
+ load_TLS(thread, cpu);
+ put_cpu();
+ }
+ kfree(desc);
+ }
+
+ ret = 0;
+ out:
+ cr_hbuf_put(ctx, sizeof(*hh));
+ return ret;
+}
+
+#ifdef CONFIG_X86_64
+
+#error "CONFIG_X86_64 unsupported yet."
+
+#else /* !CONFIG_X86_64 */
+
+int cr_read_cpu_regs(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+ struct thread_struct *thread = &t->thread;
+ struct pt_regs *regs = task_pt_regs(t);
+
+ regs->bx = hh->bx;
+ regs->cx = hh->cx;
+ regs->dx = hh->dx;
+ regs->si = hh->si;
+ regs->di = hh->di;
+ regs->bp = hh->bp;
+ regs->ax = hh->ax;
+ regs->ds = hh->ds;
+ regs->es = hh->es;
+ regs->orig_ax = hh->orig_ax;
+ regs->ip = hh->ip;
+ regs->cs = hh->cs;
+ regs->flags = hh->flags;
+ regs->sp = hh->sp;
+ regs->ss = hh->ss;
+
+ thread->gs = hh->gs;
+ thread->fs = hh->fs;
+ loadsegment(gs, hh->gs);
+ loadsegment(fs, hh->fs);
+
+ return 0;
+}
+
+int cr_read_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+ /* debug regs */
+
+ if (hh->uses_debug) {
+ set_debugreg(hh->debugreg0, 0);
+ set_debugreg(hh->debugreg1, 1);
+ /* ignore 4, 5 */
+ set_debugreg(hh->debugreg2, 2);
+ set_debugreg(hh->debugreg3, 3);
+ set_debugreg(hh->debugreg6, 6);
+ set_debugreg(hh->debugreg7, 7);
+ }
+
+ return 0;
+}
+
+int cr_read_cpu_fpu(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+ struct thread_struct *thread = &t->thread;
+ int ret;
+
+ /* i387 + MMU + SSE */
+
+ preempt_disable();
+
+ __clear_fpu(t); /* in case we used FPU in user mode */
+
+ if (!hh->used_math)
+ clear_used_math();
+ else {
+ if (hh->has_fxsr != cpu_has_fxsr) {
+ force_sig(SIGFPE, t);
+ return -EINVAL;
+ }
+ /* init_fpu() also calls set_used_math() */
+ ret = init_fpu(current);
+ if (ret < 0)
+ return ret;
+ memcpy(thread->xstate, &hh->xstate, sizeof(*thread->xstate));
+ }
+
+ preempt_enable();
+ return 0;
+}
+
+#endif /* CONFIG_X86_64 */
+
+/* read the cpu state and registers for the current task */
+int cr_read_cpu(struct cr_ctx *ctx)
+{
+ struct cr_hdr_cpu *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ struct task_struct *t = current;
+ int parent, ret;
+
+ parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_CPU);
+ if (parent < 0) {
+ ret = parent;
+ goto out;
+ }
+
+ ret = -EINVAL;
+
+#if 0 /* activate when containers are used */
+ if (parent != task_pid_vnr(t))
+ goto out;
+#endif
+ /* FIX: sanity check for sensitive registers (eg. eflags) */
+
+ ret = cr_read_cpu_regs(hh, t);
+ if (ret < 0)
+ goto out;
+ ret = cr_read_cpu_debug(hh, t);
+ if (ret < 0)
+ goto out;
+ ret = cr_read_cpu_fpu(hh, t);
+
+ cr_debug("math %d debug %d\n", hh->used_math, hh->uses_debug);
+ out:
+ cr_hbuf_put(ctx, sizeof(*hh));
+ return ret;
+}
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index e5e188f..6ca26d0 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -20,6 +20,8 @@
#include <linux/checkpoint.h>
#include <linux/checkpoint_hdr.h>

+#include "checkpoint_arch.h"
+
/**
* cr_write_obj - write a record described by a cr_hdr
* @ctx: checkpoint context
@@ -145,8 +147,17 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
}

ret = cr_write_task_struct(ctx, t);
- cr_debug("ret %d\n", ret);
+ cr_debug("task_struct: ret %d\n", ret);
+ if (ret < 0)
+ goto out;
+ ret = cr_write_thread(ctx, t);
+ cr_debug("thread: ret %d\n", ret);
+ if (ret < 0)
+ goto out;
+ ret = cr_write_cpu(ctx, t);
+ cr_debug("cpu: ret %d\n", ret);

+ out:
return ret;
}

diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
new file mode 100644
index 0000000..bf2d21e
--- /dev/null
+++ b/checkpoint/checkpoint_arch.h
@@ -0,0 +1,7 @@
+#include <linux/checkpoint.h>
+
+extern int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
+extern int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
+
+extern int cr_read_thread(struct cr_ctx *ctx);
+extern int cr_read_cpu(struct cr_ctx *ctx);
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 69befa7..766e381 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -15,6 +15,8 @@
#include <linux/checkpoint.h>
#include <linux/checkpoint_hdr.h>

+#include "checkpoint_arch.h"
+
/**
* cr_read_obj - read a whole record (cr_hdr followed by payload)
* @ctx: checkpoint context
@@ -172,8 +174,17 @@ static int cr_read_task(struct cr_ctx *ctx)
int ret;

ret = cr_read_task_struct(ctx);
- cr_debug("ret %d\n", ret);
+ cr_debug("task_struct: ret %d\n", ret);
+ if (ret < 0)
+ goto out;
+ ret = cr_read_thread(ctx);
+ cr_debug("thread: ret %d\n", ret);
+ if (ret < 0)
+ goto out;
+ ret = cr_read_cpu(ctx);
+ cr_debug("cpu: ret %d\n", ret);

+ out:
return ret;
}

diff --git a/include/asm-x86/checkpoint_hdr.h b/include/asm-x86/checkpoint_hdr.h
new file mode 100644
index 0000000..44a903c
--- /dev/null
+++ b/include/asm-x86/checkpoint_hdr.h
@@ -0,0 +1,72 @@
+#ifndef __ASM_X86_CKPT_HDR_H
+#define __ASM_X86_CKPT_HDR_H
+/*
+ * Checkpoint/restart - architecture specific headers x86
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <asm/processor.h>
+
+struct cr_hdr_thread {
+ /* NEED: restart blocks */
+
+ __s16 gdt_entry_tls_entries;
+ __s16 sizeof_tls_array;
+ __s16 ntls; /* number of TLS entries to follow */
+} __attribute__((aligned(8)));
+
+struct cr_hdr_cpu {
+ /* see struct pt_regs (x86-64) */
+ __u64 r15;
+ __u64 r14;
+ __u64 r13;
+ __u64 r12;
+ __u64 bp;
+ __u64 bx;
+ __u64 r11;
+ __u64 r10;
+ __u64 r9;
+ __u64 r8;
+ __u64 ax;
+ __u64 cx;
+ __u64 dx;
+ __u64 si;
+ __u64 di;
+ __u64 orig_ax;
+ __u64 ip;
+ __u64 cs;
+ __u64 flags;
+ __u64 sp;
+ __u64 ss;
+
+ /* segment registers */
+ __u64 ds;
+ __u64 es;
+ __u64 fs;
+ __u64 gs;
+
+ /* debug registers */
+ __u64 debugreg0;
+ __u64 debugreg1;
+ __u64 debugreg2;
+ __u64 debugreg3;
+ __u64 debugreg4;
+ __u64 debugreg5;
+ __u64 debugreg6;
+ __u64 debugreg7;
+
+ __u16 uses_debug;
+ __u16 used_math;
+ __u16 has_fxsr;
+ __u16 _padding;
+
+ union thread_xstate xstate; /* i387 */
+
+} __attribute__((aligned(8)));
+
+#endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 79e4df2..03ec72e 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -12,6 +12,7 @@

#include <linux/types.h>
#include <linux/utsname.h>
+#include <asm/checkpoint_hdr.h>

/*
* To maintain compatibility between 32-bit and 64-bit architecture flavors,
--
1.5.4.3

2008-10-20 05:42:45

by Oren Laadan

[permalink] [raw]
Subject: [RFC v7][PATCH 2/9] General infrastructure for checkpoint restart

Add those interfaces, as well as helpers needed to easily manage the
file format. The code is roughly broken out as follows:

checkpoint/sys.c - user/kernel data transfer, as well as setup of the
checkpoint/restart context (a per-checkpoint data structure for
housekeeping)

checkpoint/checkpoint.c - output wrappers and basic checkpoint handling

checkpoint/restart.c - input wrappers and basic restart handling

Patches to add the per-architecture support as well as the actual
work to do the memory checkpoint follow in subsequent patches.

Signed-off-by: Oren Laadan <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
---
Makefile | 2 +-
checkpoint/Makefile | 2 +-
checkpoint/checkpoint.c | 174 +++++++++++++++++++++++++++++++
checkpoint/restart.c | 197 ++++++++++++++++++++++++++++++++++++
checkpoint/sys.c | 219 +++++++++++++++++++++++++++++++++++++++-
fs/read_write.c | 4 +-
include/linux/checkpoint.h | 60 +++++++++++
include/linux/checkpoint_hdr.h | 75 ++++++++++++++
include/linux/magic.h | 3 +
9 files changed, 728 insertions(+), 8 deletions(-)
create mode 100644 checkpoint/checkpoint.c
create mode 100644 checkpoint/restart.c
create mode 100644 include/linux/checkpoint.h
create mode 100644 include/linux/checkpoint_hdr.h

diff --git a/Makefile b/Makefile
index ce9eceb..cb99128 100644
--- a/Makefile
+++ b/Makefile
@@ -619,7 +619,7 @@ export mod_strip_cmd


ifeq ($(KBUILD_EXTMOD),)
-core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/
+core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ checkpoint/

vmlinux-dirs := $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
$(core-y) $(core-m) $(drivers-y) $(drivers-m) \
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 07d018b..d2df68c 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,4 @@
# Makefile for linux checkpoint/restart.
#

-obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
new file mode 100644
index 0000000..e5e188f
--- /dev/null
+++ b/checkpoint/checkpoint.c
@@ -0,0 +1,174 @@
+/*
+ * Checkpoint logic and helpers
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/time.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/dcache.h>
+#include <linux/mount.h>
+#include <linux/utsname.h>
+#include <linux/magic.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/**
+ * cr_write_obj - write a record described by a cr_hdr
+ * @ctx: checkpoint context
+ * @h: record descriptor
+ * @buf: record buffer
+ */
+int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf)
+{
+ int ret;
+
+ ret = cr_kwrite(ctx, h, sizeof(*h));
+ if (ret < 0)
+ return ret;
+ return cr_kwrite(ctx, buf, h->len);
+}
+
+/**
+ * cr_write_string - write a string
+ * @ctx: checkpoint context
+ * @str: string pointer
+ * @len: string length
+ */
+int cr_write_string(struct cr_ctx *ctx, char *str, int len)
+{
+ struct cr_hdr h;
+
+ h.type = CR_HDR_STRING;
+ h.len = len;
+ h.parent = 0;
+
+ return cr_write_obj(ctx, &h, str);
+}
+
+/* write the checkpoint header */
+static int cr_write_head(struct cr_ctx *ctx)
+{
+ struct cr_hdr h;
+ struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ struct new_utsname *uts;
+ struct timeval ktv;
+ int ret;
+
+ h.type = CR_HDR_HEAD;
+ h.len = sizeof(*hh);
+ h.parent = 0;
+
+ do_gettimeofday(&ktv);
+
+ hh->magic = CHECKPOINT_MAGIC_HEAD;
+ hh->major = (LINUX_VERSION_CODE >> 16) & 0xff;
+ hh->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
+ hh->patch = (LINUX_VERSION_CODE) & 0xff;
+
+ hh->rev = CR_VERSION;
+
+ hh->flags = ctx->flags;
+ hh->time = ktv.tv_sec;
+
+ uts = utsname();
+ memcpy(hh->release, uts->release, __NEW_UTS_LEN);
+ memcpy(hh->version, uts->version, __NEW_UTS_LEN);
+ memcpy(hh->machine, uts->machine, __NEW_UTS_LEN);
+
+ ret = cr_write_obj(ctx, &h, hh);
+ cr_hbuf_put(ctx, sizeof(*hh));
+ return ret;
+}
+
+/* write the checkpoint trailer */
+static int cr_write_tail(struct cr_ctx *ctx)
+{
+ struct cr_hdr h;
+ struct cr_hdr_tail *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ int ret;
+
+ h.type = CR_HDR_TAIL;
+ h.len = sizeof(*hh);
+ h.parent = 0;
+
+ hh->magic = CHECKPOINT_MAGIC_TAIL;
+
+ ret = cr_write_obj(ctx, &h, hh);
+ cr_hbuf_put(ctx, sizeof(*hh));
+ return ret;
+}
+
+/* dump the task_struct of a given task */
+static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t)
+{
+ struct cr_hdr h;
+ struct cr_hdr_task *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ int ret;
+
+ h.type = CR_HDR_TASK;
+ h.len = sizeof(*hh);
+ h.parent = 0;
+
+ hh->state = t->state;
+ hh->exit_state = t->exit_state;
+ hh->exit_code = t->exit_code;
+ hh->exit_signal = t->exit_signal;
+
+ hh->task_comm_len = TASK_COMM_LEN;
+
+ /* FIXME: save remaining relevant task_struct fields */
+
+ ret = cr_write_obj(ctx, &h, hh);
+ cr_hbuf_put(ctx, sizeof(*hh));
+ if (ret < 0)
+ return ret;
+
+ return cr_write_string(ctx, t->comm, TASK_COMM_LEN);
+}
+
+/* dump the entire state of a given task */
+static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
+{
+ int ret ;
+
+ if (t->state == TASK_DEAD) {
+ pr_warning("CR: task may not be in state TASK_DEAD\n");
+ return -EAGAIN;
+ }
+
+ ret = cr_write_task_struct(ctx, t);
+ cr_debug("ret %d\n", ret);
+
+ return ret;
+}
+
+int do_checkpoint(struct cr_ctx *ctx)
+{
+ int ret;
+
+ /* FIX: need to test whether container is checkpointable */
+
+ ret = cr_write_head(ctx);
+ if (ret < 0)
+ goto out;
+ ret = cr_write_task(ctx, current);
+ if (ret < 0)
+ goto out;
+ ret = cr_write_tail(ctx);
+ if (ret < 0)
+ goto out;
+
+ /* on success, return (unique) checkpoint identifier */
+ ret = ctx->crid;
+
+ out:
+ return ret;
+}
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
new file mode 100644
index 0000000..69befa7
--- /dev/null
+++ b/checkpoint/restart.c
@@ -0,0 +1,197 @@
+/*
+ * Restart logic and helpers
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/magic.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/**
+ * cr_read_obj - read a whole record (cr_hdr followed by payload)
+ * @ctx: checkpoint context
+ * @h: record descriptor
+ * @buf: record buffer
+ * @n: available buffer size
+ *
+ * Returns size of payload
+ */
+int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n)
+{
+ int ret;
+
+ ret = cr_kread(ctx, h, sizeof(*h));
+ if (ret < 0)
+ return ret;
+
+ cr_debug("type %d len %d parent %d\n", h->type, h->len, h->parent);
+
+ if (h->len < 0 || h->len > n)
+ return -EINVAL;
+
+ return cr_kread(ctx, buf, h->len);
+}
+
+/**
+ * cr_read_obj_type - read a whole record of expected type
+ * @ctx: checkpoint context
+ * @buf: record buffer
+ * @n: available buffer size
+ * @type: expected record type
+ *
+ * Returns object reference of the parent object
+ */
+int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type)
+{
+ struct cr_hdr h;
+ int ret;
+
+ ret = cr_read_obj(ctx, &h, buf, n);
+ if (ret < 0)
+ return ret;
+
+ ret = -EINVAL;
+ if (h.type == type)
+ ret = h.parent;
+
+ return ret;
+}
+
+/**
+ * cr_read_string - read a string
+ * @ctx: checkpoint context
+ * @str: string buffer
+ * @len: buffer buffer length
+ */
+int cr_read_string(struct cr_ctx *ctx, void *str, int len)
+{
+ return cr_read_obj_type(ctx, str, len, CR_HDR_STRING);
+}
+
+/* read the checkpoint header */
+static int cr_read_head(struct cr_ctx *ctx)
+{
+ struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ int parent, ret = -EINVAL;
+
+ parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD);
+ if (parent < 0) {
+ ret = parent;
+ goto out;
+ } else if (parent != 0)
+ goto out;
+
+ if (hh->magic != CHECKPOINT_MAGIC_HEAD || hh->rev != CR_VERSION ||
+ hh->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
+ hh->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
+ hh->patch != ((LINUX_VERSION_CODE) & 0xff))
+ goto out;
+
+ if (hh->flags & ~CR_CTX_CKPT)
+ goto out;
+
+ ctx->oflags = hh->flags;
+
+ /* FIX: verify compatibility of release, version and machine */
+
+ ret = 0;
+ out:
+ cr_hbuf_put(ctx, sizeof(*hh));
+ return ret;
+}
+
+/* read the checkpoint trailer */
+static int cr_read_tail(struct cr_ctx *ctx)
+{
+ struct cr_hdr_tail *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ int parent, ret = -EINVAL;
+
+ parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TAIL);
+ if (parent < 0) {
+ ret = parent;
+ goto out;
+ } else if (parent != 0)
+ goto out;
+
+ if (hh->magic != CHECKPOINT_MAGIC_TAIL)
+ goto out;
+
+ ret = 0;
+ out:
+ cr_hbuf_put(ctx, sizeof(*hh));
+ return ret;
+}
+
+/* read the task_struct into the current task */
+static int cr_read_task_struct(struct cr_ctx *ctx)
+{
+ struct cr_hdr_task *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ struct task_struct *t = current;
+ char *buf;
+ int parent, ret = -EINVAL;
+
+ parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TASK);
+ if (parent < 0) {
+ ret = parent;
+ goto out;
+ } else if (parent != 0)
+ goto out;
+
+ /* upper limit for task_comm_len to prevent DoS */
+ if (hh->task_comm_len < 0 || hh->task_comm_len > PAGE_SIZE)
+ goto out;
+
+ buf = kmalloc(hh->task_comm_len, GFP_KERNEL);
+ if (!buf)
+ goto out;
+ ret = cr_read_string(ctx, buf, hh->task_comm_len);
+ if (!ret) {
+ /* if t->comm is too long, silently truncate */
+ memset(t->comm, 0, TASK_COMM_LEN);
+ memcpy(t->comm, buf, min(hh->task_comm_len, TASK_COMM_LEN));
+ }
+ kfree(buf);
+
+ /* FIXME: restore remaining relevant task_struct fields */
+ out:
+ cr_hbuf_put(ctx, sizeof(*hh));
+ return ret;
+}
+
+/* read the entire state of the current task */
+static int cr_read_task(struct cr_ctx *ctx)
+{
+ int ret;
+
+ ret = cr_read_task_struct(ctx);
+ cr_debug("ret %d\n", ret);
+
+ return ret;
+}
+
+int do_restart(struct cr_ctx *ctx)
+{
+ int ret;
+
+ ret = cr_read_head(ctx);
+ if (ret < 0)
+ goto out;
+ ret = cr_read_task(ctx);
+ if (ret < 0)
+ goto out;
+ ret = cr_read_tail(ctx);
+ if (ret < 0)
+ goto out;
+
+ /* on success, adjust the return value if needed [TODO] */
+ out:
+ return ret;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 375129c..6c8ba56 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -10,6 +10,187 @@

#include <linux/sched.h>
#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/uaccess.h>
+#include <linux/capability.h>
+#include <linux/checkpoint.h>
+
+/*
+ * helpers to write/read to/from the image file descriptor
+ *
+ * cr_uwrite() - write a user-space buffer to the checkpoint image
+ * cr_kwrite() - write a kernel-space buffer to the checkpoint image
+ * cr_uread() - read from the checkpoint image to a user-space buffer
+ * cr_kread() - read from the checkpoint image to a kernel-space buffer
+ */
+
+int cr_uwrite(struct cr_ctx *ctx, void *buf, int count)
+{
+ struct file *file = ctx->file;
+ ssize_t nwrite;
+ int nleft;
+
+ for (nleft = count; nleft; nleft -= nwrite) {
+ loff_t pos = file_pos_read(file);
+ nwrite = vfs_write(file, (char __user *) buf, nleft, &pos);
+ file_pos_write(file, pos);
+ if (nwrite <= 0) {
+ if (nwrite == -EAGAIN)
+ nwrite = 0;
+ else
+ return nwrite;
+ }
+ buf += nwrite;
+ }
+
+ ctx->total += count;
+ return 0;
+}
+
+int cr_kwrite(struct cr_ctx *ctx, void *buf, int count)
+{
+ mm_segment_t oldfs;
+ int ret;
+
+ oldfs = get_fs();
+ set_fs(KERNEL_DS);
+ ret = cr_uwrite(ctx, buf, count);
+ set_fs(oldfs);
+
+ return ret;
+}
+
+int cr_uread(struct cr_ctx *ctx, void *buf, int count)
+{
+ struct file *file = ctx->file;
+ ssize_t nread;
+ int nleft;
+
+ for (nleft = count; nleft; nleft -= nread) {
+ loff_t pos = file_pos_read(file);
+ nread = vfs_read(file, (char __user *) buf, nleft, &pos);
+ file_pos_write(file, pos);
+ if (nread <= 0) {
+ if (nread == -EAGAIN)
+ nread = 0;
+ else
+ return nread;
+ }
+ buf += nread;
+ }
+
+ ctx->total += count;
+ return 0;
+}
+
+int cr_kread(struct cr_ctx *ctx, void *buf, int count)
+{
+ mm_segment_t oldfs;
+ int ret;
+
+ oldfs = get_fs();
+ set_fs(KERNEL_DS);
+ ret = cr_uread(ctx, buf, count);
+ set_fs(oldfs);
+
+ return ret;
+}
+
+/*
+ * During checkpoint and restart the code writes outs/reads in data
+ * to/from the chekcpoint image from/to a temporary buffer (ctx->hbuf).
+ * Because operations can be nested, one should call cr_hbuf_get() to
+ * reserve space in the buffer, and then cr_hbuf_put() when no longer
+ * needs that space.
+ */
+
+/*
+ * ctx->hbuf is used to hold headers and data of known (or bound),
+ * static sizes. In some cases, multiple headers may be allocated in
+ * a nested manner. The size should accommodate all headers, nested
+ * or not, on all archs.
+ */
+#define CR_HBUF_TOTAL (8 * 4096)
+
+/**
+ * cr_hbuf_get - reserve space on the hbuf
+ * @ctx: checkpoint context
+ * @n: number of bytes to reserve
+ *
+ * Returns pointer to reserved space
+ */
+void *cr_hbuf_get(struct cr_ctx *ctx, int n)
+{
+ void *ptr;
+
+ /*
+ * Since requests depend on logic and static header sizes (not on
+ * user data), space should always suffice, unless someone either
+ * made a structure bigger or call path deeper than expected.
+ */
+ BUG_ON(ctx->hpos + n > CR_HBUF_TOTAL);
+ ptr = ctx->hbuf + ctx->hpos;
+ ctx->hpos += n;
+ return ptr;
+}
+
+/**
+ * cr_hbuf_put - unreserve space on the hbuf
+ * @ctx: checkpoint context
+ * @n: number of bytes to reserve
+ */
+void cr_hbuf_put(struct cr_ctx *ctx, int n)
+{
+ BUG_ON(ctx->hpos < n);
+ ctx->hpos -= n;
+}
+
+/*
+ * helpers to manage CR contexts: allocated for each checkpoint and/or
+ * restart operation, and persists until the operation is completed.
+ */
+
+/* unique checkpoint identifier (FIXME: should be per-container) */
+static atomic_t cr_ctx_count;
+
+void cr_ctx_free(struct cr_ctx *ctx)
+{
+ if (ctx->file)
+ fput(ctx->file);
+
+ kfree(ctx->hbuf);
+
+ kfree(ctx);
+}
+
+struct cr_ctx *cr_ctx_alloc(pid_t pid, int fd, unsigned long flags)
+{
+ struct cr_ctx *ctx;
+
+ ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+ if (!ctx)
+ return ERR_PTR(-ENOMEM);
+
+ ctx->file = fget(fd);
+ if (!ctx->file) {
+ cr_ctx_free(ctx);
+ return ERR_PTR(-EBADF);
+ }
+
+ ctx->hbuf = kmalloc(CR_HBUF_TOTAL, GFP_KERNEL);
+ if (!ctx->hbuf) {
+ cr_ctx_free(ctx);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ ctx->pid = pid;
+ ctx->flags = flags;
+
+ ctx->crid = atomic_inc_return(&cr_ctx_count);
+
+ return ctx;
+}

/**
* sys_checkpoint - checkpoint a container
@@ -22,9 +203,26 @@
*/
asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
{
- pr_debug("sys_checkpoint not implemented yet\n");
- return -ENOSYS;
+ struct cr_ctx *ctx;
+ int ret;
+
+ /* no flags for now */
+ if (flags)
+ return -EINVAL;
+
+ ctx = cr_ctx_alloc(pid, fd, flags | CR_CTX_CKPT);
+ if (IS_ERR(ctx))
+ return PTR_ERR(ctx);
+
+ ret = do_checkpoint(ctx);
+
+ if (!ret)
+ ret = ctx->crid;
+
+ cr_ctx_free(ctx);
+ return ret;
}
+
/**
* sys_restart - restart a container
* @crid: checkpoint image identifier
@@ -36,6 +234,19 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
*/
asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
{
- pr_debug("sys_restart not implemented yet\n");
- return -ENOSYS;
+ struct cr_ctx *ctx;
+ int ret;
+
+ /* no flags for now */
+ if (flags)
+ return -EINVAL;
+
+ ctx = cr_ctx_alloc(crid, fd, flags | CR_CTX_RSTR);
+ if (IS_ERR(ctx))
+ return PTR_ERR(ctx);
+
+ ret = do_restart(ctx);
+
+ cr_ctx_free(ctx);
+ return ret;
}
diff --git a/fs/read_write.c b/fs/read_write.c
index 9ba495d..e2deded 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -324,12 +324,12 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_

EXPORT_SYMBOL(vfs_write);

-static inline loff_t file_pos_read(struct file *file)
+inline loff_t file_pos_read(struct file *file)
{
return file->f_pos;
}

-static inline void file_pos_write(struct file *file, loff_t pos)
+inline void file_pos_write(struct file *file, loff_t pos)
{
file->f_pos = pos;
}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
new file mode 100644
index 0000000..93ff0ce
--- /dev/null
+++ b/include/linux/checkpoint.h
@@ -0,0 +1,60 @@
+#ifndef _CHECKPOINT_CKPT_H_
+#define _CHECKPOINT_CKPT_H_
+/*
+ * Generic container checkpoint-restart
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#define CR_VERSION 1
+
+struct cr_ctx {
+ pid_t pid; /* container identifier */
+ int crid; /* unique checkpoint id */
+
+ unsigned long flags;
+ unsigned long oflags; /* restart: old flags */
+
+ struct file *file;
+ int total; /* total read/written */
+
+ void *hbuf; /* temporary buffer for headers */
+ int hpos; /* position in headers buffer */
+};
+
+/* cr_ctx: flags */
+#define CR_CTX_CKPT 0x1
+#define CR_CTX_RSTR 0x2
+
+extern int cr_uwrite(struct cr_ctx *ctx, void *buf, int count);
+extern int cr_kwrite(struct cr_ctx *ctx, void *buf, int count);
+extern int cr_uread(struct cr_ctx *ctx, void *buf, int count);
+extern int cr_kread(struct cr_ctx *ctx, void *buf, int count);
+
+extern void *cr_hbuf_get(struct cr_ctx *ctx, int n);
+extern void cr_hbuf_put(struct cr_ctx *ctx, int n);
+
+struct cr_hdr;
+
+extern int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
+extern int cr_write_string(struct cr_ctx *ctx, char *str, int len);
+
+extern int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
+extern int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type);
+extern int cr_read_string(struct cr_ctx *ctx, void *str, int len);
+
+extern int do_checkpoint(struct cr_ctx *ctx);
+extern int do_restart(struct cr_ctx *ctx);
+
+/* there are from fs/read_write.c, not exported otherwise in a header */
+extern loff_t file_pos_read(struct file *file);
+extern void file_pos_write(struct file *file, loff_t pos);
+
+#define cr_debug(fmt, args...) \
+ pr_debug("[CR:%s] " fmt, __func__, ## args)
+
+#endif /* _CHECKPOINT_CKPT_H_ */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
new file mode 100644
index 0000000..79e4df2
--- /dev/null
+++ b/include/linux/checkpoint_hdr.h
@@ -0,0 +1,75 @@
+#ifndef _CHECKPOINT_CKPT_HDR_H_
+#define _CHECKPOINT_CKPT_HDR_H_
+/*
+ * Generic container checkpoint-restart
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/types.h>
+#include <linux/utsname.h>
+
+/*
+ * To maintain compatibility between 32-bit and 64-bit architecture flavors,
+ * keep data 64-bit aligned: use padding for structure members, and use
+ * __attribute__ ((aligned (8))) for the entire structure.
+ */
+
+/* records: generic header */
+
+struct cr_hdr {
+ __s16 type;
+ __s16 len;
+ __u32 parent;
+};
+
+/* header types */
+enum {
+ CR_HDR_HEAD = 1,
+ CR_HDR_STRING,
+
+ CR_HDR_TASK = 101,
+ CR_HDR_THREAD,
+ CR_HDR_CPU,
+
+ CR_HDR_MM = 201,
+ CR_HDR_VMA,
+ CR_HDR_MM_CONTEXT,
+
+ CR_HDR_TAIL = 5001
+};
+
+struct cr_hdr_head {
+ __u64 magic;
+
+ __u16 major;
+ __u16 minor;
+ __u16 patch;
+ __u16 rev;
+
+ __u64 time; /* when checkpoint taken */
+ __u64 flags; /* checkpoint options */
+
+ char release[__NEW_UTS_LEN];
+ char version[__NEW_UTS_LEN];
+ char machine[__NEW_UTS_LEN];
+} __attribute__((aligned(8)));
+
+struct cr_hdr_tail {
+ __u64 magic;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_task {
+ __u32 state;
+ __u32 exit_state;
+ __u32 exit_code;
+ __u32 exit_signal;
+
+ __s32 task_comm_len;
+} __attribute__((aligned(8)));
+
+#endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/magic.h b/include/linux/magic.h
index 1fa0c2c..c2b811c 100644
--- a/include/linux/magic.h
+++ b/include/linux/magic.h
@@ -42,4 +42,7 @@
#define FUTEXFS_SUPER_MAGIC 0xBAD1DEA
#define INOTIFYFS_SUPER_MAGIC 0x2BAD1DEA

+#define CHECKPOINT_MAGIC_HEAD 0x00feed0cc0a2d200LL
+#define CHECKPOINT_MAGIC_TAIL 0x002d2a0cc0deef00LL
+
#endif /* __LINUX_MAGIC_H__ */
--
1.5.4.3

2008-10-20 05:42:58

by Oren Laadan

[permalink] [raw]
Subject: [RFC v7][PATCH 1/9] Create syscalls: sys_checkpoint, sys_restart

Create trivial sys_checkpoint and sys_restore system calls. They will
enable to checkpoint and restart an entire container, to and from a
checkpoint image file descriptor.

The syscalls take a file descriptor (for the image file) and flags as
arguments. For sys_checkpoint the first argument identifies the target
container; for sys_restart it will identify the checkpoint image.

Signed-off-by: Oren Laadan <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
---
arch/x86/kernel/syscall_table_32.S | 2 +
checkpoint/Kconfig | 11 +++++++++
checkpoint/Makefile | 5 ++++
checkpoint/sys.c | 41 ++++++++++++++++++++++++++++++++++++
include/asm-x86/unistd_32.h | 2 +
include/linux/syscalls.h | 2 +
init/Kconfig | 2 +
kernel/sys_ni.c | 4 +++
8 files changed, 69 insertions(+), 0 deletions(-)
create mode 100644 checkpoint/Kconfig
create mode 100644 checkpoint/Makefile
create mode 100644 checkpoint/sys.c

diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index d44395f..5543136 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -332,3 +332,5 @@ ENTRY(sys_call_table)
.long sys_dup3 /* 330 */
.long sys_pipe2
.long sys_inotify_init1
+ .long sys_checkpoint
+ .long sys_restart
diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
new file mode 100644
index 0000000..ffaa635
--- /dev/null
+++ b/checkpoint/Kconfig
@@ -0,0 +1,11 @@
+config CHECKPOINT_RESTART
+ prompt "Enable checkpoint/restart (EXPERIMENTAL)"
+ def_bool n
+ depends on X86_32 && EXPERIMENTAL
+ help
+ Application checkpoint/restart is the ability to save the
+ state of a running application so that it can later resume
+ its execution from the time at which it was checkpointed.
+
+ Turning this option on will enable checkpoint and restart
+ functionality in the kernel.
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
new file mode 100644
index 0000000..07d018b
--- /dev/null
+++ b/checkpoint/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for linux checkpoint/restart.
+#
+
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
new file mode 100644
index 0000000..375129c
--- /dev/null
+++ b/checkpoint/sys.c
@@ -0,0 +1,41 @@
+/*
+ * Generic container checkpoint-restart
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/sched.h>
+#include <linux/kernel.h>
+
+/**
+ * sys_checkpoint - checkpoint a container
+ * @pid: pid of the container init(1) process
+ * @fd: file to which dump the checkpoint image
+ * @flags: checkpoint operation flags
+ *
+ * Returns positive identifier on success, 0 when returning from restart
+ * or negative value on error
+ */
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
+{
+ pr_debug("sys_checkpoint not implemented yet\n");
+ return -ENOSYS;
+}
+/**
+ * sys_restart - restart a container
+ * @crid: checkpoint image identifier
+ * @fd: file from which read the checkpoint image
+ * @flags: restart operation flags
+ *
+ * Returns negative value on error, or otherwise returns in the realm
+ * of the original checkpoint
+ */
+asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
+{
+ pr_debug("sys_restart not implemented yet\n");
+ return -ENOSYS;
+}
diff --git a/include/asm-x86/unistd_32.h b/include/asm-x86/unistd_32.h
index d739467..88bdec4 100644
--- a/include/asm-x86/unistd_32.h
+++ b/include/asm-x86/unistd_32.h
@@ -338,6 +338,8 @@
#define __NR_dup3 330
#define __NR_pipe2 331
#define __NR_inotify_init1 332
+#define __NR_checkpoint 333
+#define __NR_restart 334

#ifdef __KERNEL__

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index d6ff145..edc218b 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -622,6 +622,8 @@ asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
asmlinkage long sys_eventfd(unsigned int count);
asmlinkage long sys_eventfd2(unsigned int count, int flags);
asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags);
+asmlinkage long sys_restart(int crid, int fd, unsigned long flags);

int kernel_execve(const char *filename, char *const argv[], char *const envp[]);

diff --git a/init/Kconfig b/init/Kconfig
index c11da38..fd5f7bf 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -779,6 +779,8 @@ config MARKERS

source "arch/Kconfig"

+source "checkpoint/Kconfig"
+
config PROC_PAGE_MONITOR
default y
depends on PROC_FS && MMU
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 08d6e1b..ca95c25 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -168,3 +168,7 @@ cond_syscall(compat_sys_timerfd_settime);
cond_syscall(compat_sys_timerfd_gettime);
cond_syscall(sys_eventfd);
cond_syscall(sys_eventfd2);
+
+/* checkpoint/restart */
+cond_syscall(sys_checkpoint);
+cond_syscall(sys_restart);
--
1.5.4.3

2008-10-20 05:43:42

by Oren Laadan

[permalink] [raw]
Subject: [RFC v7][PATCH 7/9] Infrastructure for shared objects

Infrastructure to handle objects that may be shared and referenced by
multiple tasks or other objects, e..g open files, memory address space
etc.

The state of shared objects is saved once. On the first encounter, the
state is dumped and the object is assigned a unique identifier (objref)
and also stored in a hash table (indexed by its physical kenrel address).
>From then on the object will be found in the hash and only its identifier
is saved.

On restart the identifier is looked up in the hash table; if not found
then the state is read, the object is created, and added to the hash
table (this time indexed by its identifier). Otherwise, the object in
the hash table is used.

Signed-off-by: Oren Laadan <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
---
Documentation/checkpoint.txt | 46 +++++++
checkpoint/Makefile | 2 +-
checkpoint/objhash.c | 268 ++++++++++++++++++++++++++++++++++++++++++
checkpoint/sys.c | 6 +
include/linux/checkpoint.h | 20 +++
5 files changed, 341 insertions(+), 1 deletions(-)
create mode 100644 checkpoint/objhash.c

diff --git a/Documentation/checkpoint.txt b/Documentation/checkpoint.txt
index a73a4f3..a9ea79c 100644
--- a/Documentation/checkpoint.txt
+++ b/Documentation/checkpoint.txt
@@ -189,6 +189,52 @@ cr_hdr + cr_hdr_task
cr_hdr + cr_hdr_tail


+=== Shared resources (objects)
+
+Many resources used by tasks may be shared by more than one task (e.g.
+file descriptors, memory address space, etc), or even have multiple
+references from other resources (e.g. a single inode that represents
+two ends of a pipe).
+
+Clearly, the state of shared objects need only be saved once, even if
+they occur multiple times. We use a hash table (ctx->objhash) to keep
+track of shared objects and whether they were already saved. Shared
+objects are stored in a hash table as they appear, indexed by their
+kernel address. (The hash table itself is not saved as part of the
+checkpoint image: it is constructed dynamically during both checkpoint
+and restart, and discarded at the end of the operation).
+
+Each shared object that is found is first looked up in the hash table.
+On the first encounter, the object will not be found, so its state is
+dumped, and the object is assigned a unique identifier and also stored
+in the hash table. Subsequent lookups of that object in the hash table
+will yield that entry, and then only the unique identifier is saved,
+as opposed the entire state of the object.
+
+During restart, shared objects are seen by their unique identifiers as
+assigned during the checkpoint. Each shared object that it read in is
+first looked up in the hash table. On the first encounter it will not
+be found, meaning that the object needs to be created and its state
+read in and restored. Then the object is added to the hash table, this
+time indexed by its unique identifier. Subsequent lookups of the same
+unique identifier in the hash table will yield that entry, and then
+the existing object instance is reused instead of creating another one.
+
+The interface for the hash table is the following:
+
+cr_obj_get_by_ptr() - find the unique object reference (objref)
+ of the object that is pointer to by ptr [checkpoint]
+
+cr_obj_add_ptr() - add the object pointed to by ptr to the hash table
+ if not already there, and fill its unique object reference (objref)
+
+cr_obj_get_by_ref() - return the pointer to the object whose unique
+ object reference is equal to objref [restart]
+
+cr_obj_add_ref() - add the object with given unique object reference
+ (objref), pointed to by ptr to the hash table. [restart]
+
+
=== Current Implementation

[2008-Oct-07]
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index ac35033..9843fb9 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,5 +2,5 @@
# Makefile for linux checkpoint/restart.
#

-obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
ckpt_mem.o rstr_mem.o
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
new file mode 100644
index 0000000..05b1a1b
--- /dev/null
+++ b/checkpoint/objhash.c
@@ -0,0 +1,268 @@
+/*
+ * Checkpoint-restart - object hash infrastructure to manage shared objects
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/file.h>
+#include <linux/hash.h>
+#include <linux/checkpoint.h>
+
+struct cr_objref {
+ int objref;
+ void *ptr;
+ unsigned short type;
+ unsigned short flags;
+ struct hlist_node hash;
+};
+
+struct cr_objhash {
+ struct hlist_head *head;
+ int next_free_objref;
+};
+
+#define CR_OBJHASH_NBITS 10
+#define CR_OBJHASH_TOTAL (1UL << CR_OBJHASH_NBITS)
+
+static void cr_obj_ref_drop(struct cr_objref *obj)
+{
+ switch (obj->type) {
+ case CR_OBJ_FILE:
+ fput((struct file *) obj->ptr);
+ break;
+ default:
+ BUG();
+ }
+}
+
+static void cr_obj_ref_grab(struct cr_objref *obj)
+{
+ switch (obj->type) {
+ case CR_OBJ_FILE:
+ get_file((struct file *) obj->ptr);
+ break;
+ default:
+ BUG();
+ }
+}
+
+static void cr_objhash_clear(struct cr_objhash *objhash)
+{
+ struct hlist_head *h = objhash->head;
+ struct hlist_node *n, *t;
+ struct cr_objref *obj;
+ int i;
+
+ for (i = 0; i < CR_OBJHASH_TOTAL; i++) {
+ hlist_for_each_entry_safe(obj, n, t, &h[i], hash) {
+ cr_obj_ref_drop(obj);
+ kfree(obj);
+ }
+ }
+}
+
+void cr_objhash_free(struct cr_ctx *ctx)
+{
+ struct cr_objhash *objhash = ctx->objhash;
+
+ if (objhash) {
+ cr_objhash_clear(objhash);
+ kfree(objhash->head);
+ kfree(ctx->objhash);
+ ctx->objhash = NULL;
+ }
+}
+
+int cr_objhash_alloc(struct cr_ctx *ctx)
+{
+ struct cr_objhash *objhash;
+ struct hlist_head *head;
+
+ objhash = kzalloc(sizeof(*objhash), GFP_KERNEL);
+ if (!objhash)
+ return -ENOMEM;
+ head = kzalloc(CR_OBJHASH_TOTAL * sizeof(*head), GFP_KERNEL);
+ if (!head) {
+ kfree(objhash);
+ return -ENOMEM;
+ }
+
+ objhash->head = head;
+ objhash->next_free_objref = 1;
+
+ ctx->objhash = objhash;
+ return 0;
+}
+
+static struct cr_objref *cr_obj_find_by_ptr(struct cr_ctx *ctx, void *ptr)
+{
+ struct hlist_head *h;
+ struct hlist_node *n;
+ struct cr_objref *obj;
+
+ h = &ctx->objhash->head[hash_ptr(ptr, CR_OBJHASH_NBITS)];
+ hlist_for_each_entry(obj, n, h, hash)
+ if (obj->ptr == ptr)
+ return obj;
+ return NULL;
+}
+
+static struct cr_objref *cr_obj_find_by_objref(struct cr_ctx *ctx, int objref)
+{
+ struct hlist_head *h;
+ struct hlist_node *n;
+ struct cr_objref *obj;
+
+ h = &ctx->objhash->head[hash_ptr((void *) objref, CR_OBJHASH_NBITS)];
+ hlist_for_each_entry(obj, n, h, hash)
+ if (obj->objref == objref)
+ return obj;
+ return NULL;
+}
+
+/**
+ * cr_obj_new - allocate an object and add to the hash table
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: unique object reference
+ * @type: object type
+ * @flags: object flags
+ *
+ * Allocate an object referring to @ptr and add to the hash table.
+ * If @objref is zero, assign a unique object reference and use @ptr
+ * as a hash key [checkpoint]. Else use @objref as a key [restart].
+ */
+static struct cr_objref *cr_obj_new(struct cr_ctx *ctx, void *ptr, int objref,
+ unsigned short type, unsigned short flags)
+{
+ struct cr_objref *obj;
+ int i;
+
+ obj = kmalloc(sizeof(*obj), GFP_KERNEL);
+ if (!obj)
+ return NULL;
+
+ obj->ptr = ptr;
+ obj->type = type;
+ obj->flags = flags;
+
+ if (objref) {
+ /* use @objref to index (restart) */
+ obj->objref = objref;
+ i = hash_ptr((void *) objref, CR_OBJHASH_NBITS);
+ } else {
+ /* use @ptr to index, assign objref (checkpoint) */
+ obj->objref = ctx->objhash->next_free_objref++;;
+ i = hash_ptr(ptr, CR_OBJHASH_NBITS);
+ }
+
+ hlist_add_head(&obj->hash, &ctx->objhash->head[i]);
+ cr_obj_ref_grab(obj);
+ return obj;
+}
+
+/**
+ * cr_obj_add_ptr - add an object to the hash table if not already there
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: unique object reference [output]
+ * @type: object type
+ * @flags: object flags
+ *
+ * Look up the object pointed to by @ptr in the hash table. If it isn't
+ * already found there, then add the object to the table, and allocate a
+ * fresh unique object reference (objref). Fills the unique objref of
+ * the object into @objref.
+ * [This is used during checkpoint].
+ *
+ * Returns 0 if found, 1 if added, < 0 on error
+ */
+int cr_obj_add_ptr(struct cr_ctx *ctx, void *ptr, int *objref,
+ unsigned short type, unsigned short flags)
+{
+ struct cr_objref *obj;
+ int ret = 0;
+
+ obj = cr_obj_find_by_ptr(ctx, ptr);
+ if (!obj) {
+ obj = cr_obj_new(ctx, ptr, 0, type, flags);
+ if (!obj)
+ return -ENOMEM;
+ else
+ ret = 1;
+ } else if (obj->type != type) /* sanity check */
+ return -EINVAL;
+ *objref = obj->objref;
+ return ret;
+}
+
+/**
+ * cr_obj_add_ref - add an object with unique objref to the hash table
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: unique identifier - object reference
+ * @type: object type
+ * @flags: object flags
+ *
+ * Add the object pointer to by @ptr and identified by unique object
+ * reference given by @objref to the hash table (indexed by @objref).
+ * [This is used during restart].
+ */
+int cr_obj_add_ref(struct cr_ctx *ctx, void *ptr, int objref,
+ unsigned short type, unsigned short flags)
+{
+ struct cr_objref *obj;
+
+ obj = cr_obj_new(ctx, ptr, objref, type, flags);
+ return obj ? 0 : -ENOMEM;
+}
+
+/**
+ * cr_obj_get_by_ptr - find the unique object reference of an object
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ *
+ * Look up the unique object reference (objref) of the object pointed
+ * to by @ptr, and return that number, or 0 if not found.
+ * [This is used during checkpoint].
+ */
+int cr_obj_get_by_ptr(struct cr_ctx *ctx, void *ptr, unsigned short type)
+{
+ struct cr_objref *obj;
+
+ obj = cr_obj_find_by_ptr(ctx, ptr);
+ if (!obj)
+ return -ESRCH;
+ if (obj->type != type)
+ return -EINVAL;
+ return obj->objref;
+}
+
+/**
+ * cr_obj_get_by_ref - find an object given its unique object reference
+ * @ctx: checkpoint context
+ * @objref: unique identifier - object reference
+ * @type: object type
+ *
+ * Look up the object who is identified by unique object reference that
+ * is specified by @objref, and return a pointer to that matching object,
+ * or NULL if not found.
+ * [This is used during restart].
+ */
+void *cr_obj_get_by_ref(struct cr_ctx *ctx, int objref, unsigned short type)
+{
+ struct cr_objref *obj;
+
+ obj = cr_obj_find_by_objref(ctx, objref);
+ if (!obj)
+ return NULL;
+ if (obj->type != type)
+ return ERR_PTR(-EINVAL);
+ return obj->ptr;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 6a18966..c1f2c8f 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -167,6 +167,7 @@ void cr_ctx_free(struct cr_ctx *ctx)
path_put(ctx->vfsroot);

cr_pgarr_free(ctx);
+ cr_objhash_free(ctx);

kfree(ctx);
}
@@ -191,6 +192,11 @@ struct cr_ctx *cr_ctx_alloc(pid_t pid, int fd, unsigned long flags)
return ERR_PTR(-ENOMEM);
}

+ if (cr_objhash_alloc(ctx) < 0) {
+ cr_ctx_free(ctx);
+ return ERR_PTR(-ENOMEM);
+ }
+
/*
* assume checkpointer is in container's root vfs
* FIXME: this works for now, but will change with real containers
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 3c6d1d1..2da3a9f 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -28,6 +28,8 @@ struct cr_ctx {
void *hbuf; /* temporary buffer for headers */
int hpos; /* position in headers buffer */

+ struct cr_objhash *objhash; /* hash for shared objects */
+
struct list_head pgarr_list; /* page array to dump VMA contents */

struct path *vfsroot; /* container root (FIXME) */
@@ -45,6 +47,24 @@ extern int cr_kread(struct cr_ctx *ctx, void *buf, int count);
extern void *cr_hbuf_get(struct cr_ctx *ctx, int n);
extern void cr_hbuf_put(struct cr_ctx *ctx, int n);

+/* shared objects handling */
+
+enum {
+ CR_OBJ_FILE = 1,
+ CR_OBJ_MAX
+};
+
+extern void cr_objhash_free(struct cr_ctx *ctx);
+extern int cr_objhash_alloc(struct cr_ctx *ctx);
+extern void *cr_obj_get_by_ref(struct cr_ctx *ctx,
+ int objref, unsigned short type);
+extern int cr_obj_get_by_ptr(struct cr_ctx *ctx,
+ void *ptr, unsigned short type);
+extern int cr_obj_add_ptr(struct cr_ctx *ctx, void *ptr, int *objref,
+ unsigned short type, unsigned short flags);
+extern int cr_obj_add_ref(struct cr_ctx *ctx, void *ptr, int objref,
+ unsigned short type, unsigned short flags);
+
struct cr_hdr;

extern int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
--
1.5.4.3

2008-10-20 05:43:23

by Oren Laadan

[permalink] [raw]
Subject: [RFC v7][PATCH 6/9] Checkpoint/restart: initial documentation

Covers application checkpoint/restart, overall design, interfaces
and checkpoint image format.

Signed-off-by: Oren Laadan <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
---
Documentation/checkpoint.txt | 374 ++++++++++++++++++++++++++++++++++++++++++
1 files changed, 374 insertions(+), 0 deletions(-)
create mode 100644 Documentation/checkpoint.txt

diff --git a/Documentation/checkpoint.txt b/Documentation/checkpoint.txt
new file mode 100644
index 0000000..a73a4f3
--- /dev/null
+++ b/Documentation/checkpoint.txt
@@ -0,0 +1,374 @@
+
+ === Checkpoint-Restart support in the Linux kernel ===
+
+Copyright (C) 2008 Oren Laadan
+
+Author: Oren Laadan <[email protected]>
+
+License: The GNU Free Documentation License, Version 1.2
+ (dual licensed under the GPL v2)
+Reviewers:
+
+Application checkpoint/restart [CR] is the ability to save the state
+of a running application so that it can later resume its execution
+from the time at which it was checkpointed. An application can be
+migrated by checkpointing it on one machine and restarting it on
+another. CR can provide many potential benefits:
+
+* Failure recovery: by rolling back an to a previous checkpoint
+
+* Improved response time: by restarting applications from checkpoints
+ instead of from scratch.
+
+* Improved system utilization: by suspending long running CPU
+ intensive jobs and resuming them when load decreases.
+
+* Fault resilience: by migrating applications off of faulty hosts.
+
+* Dynamic load balancing: by migrating applications to less loaded
+ hosts.
+
+* Improved service availability and administration: by migrating
+ applications before host maintenance so that they continue to run
+ with minimal downtime
+
+* Time-travel: by taking periodic checkpoints and restarting from
+ any previous checkpoint.
+
+
+=== Overall design
+
+Checkpoint and restart is done in the kernel as much as possible. The
+kernel exports a relative opaque 'blob' of data to userspace which can
+then be handed to the new kernel at restore time. The 'blob' contains
+data and state of select portions of kernel structures such as VMAs
+and mm_structs, as well as copies of the actual memory that the tasks
+use. Any changes in this blob's format between kernel revisions can be
+handled by an in-userspace conversion program. The approach is similar
+to virtually all of the commercial CR products out there, as well as
+the research project Zap.
+
+Two new system calls are introduced to provide CR: sys_checkpoint and
+sys_restart. The checkpoint code basically serializes internal kernel
+state and writes it out to a file descriptor, and the resulting image
+is stream-able. More specifically, it consists of 5 steps:
+ 1. Pre-dump
+ 2. Freeze the container
+ 3. Dump
+ 4. Thaw (or kill) the container
+ 5. Post-dump
+Steps 1 and 5 are an optimization to reduce application downtime:
+"pre-dump" works before freezing the container, e.g. the pre-copy for
+live migration, and "post-dump" works after the container resumes
+execution, e.g. write-back the data to secondary storage.
+
+The restart code basically reads the saved kernel state and from a
+file descriptor, and re-creates the tasks and the resources they need
+to resume execution. The restart code is executed by each task that
+is restored in a new container to reconstruct its own state.
+
+
+=== Interfaces
+
+int sys_checkpoint(pid_t pid, int fd, unsigned long flag);
+ Checkpoint a container whose init task is identified by pid, to the
+ file designated by fd. Flags will have future meaning (should be 0
+ for now).
+ Returns: a positive integer that identifies the checkpoint image
+ (for future reference in case it is kept in memory) upon success,
+ 0 if it returns from a restart, and -1 if an error occurs.
+
+int sys_restart(int crid, int fd, unsigned long flags);
+ Restart a container from a checkpoint image identified by crid, or
+ from the blob stored in the file designated by fd. Flags will have
+ future meaning (should be 0 for now).
+ Returns: 0 on success and -1 if an error occurs.
+
+Thus, if checkpoint is initiated by a process in the container, one
+can use logic similar to fork():
+ ...
+ crid = checkpoint(...);
+ switch (crid) {
+ case -1:
+ perror("checkpoint failed");
+ break;
+ default:
+ fprintf(stderr, "checkpoint succeeded, CRID=%d\n", ret);
+ /* proceed with execution after checkpoint */
+ ...
+ break;
+ case 0:
+ fprintf(stderr, "returned after restart\n");
+ /* proceed with action required following a restart */
+ ...
+ break;
+ }
+ ...
+And to initiate a restart, the process in an empty container can use
+logic similar to execve():
+ ...
+ if (restart(crid, ...) < 0)
+ perror("restart failed");
+ /* only get here if restart failed */
+ ...
+
+See below a complete example in C.
+
+
+=== Order of state dump
+
+The order of operations, both save and restore, is as following:
+
+* Header section: header, container information, etc.
+* Global section: [TBD] global resources such as IPC, UTS, etc.
+* Process forest: [TBD] tasks and their relationships
+* Per task data (for each task):
+ -> task state: elements of task_struct
+ -> thread state: elements of thread_struct and thread_info
+ -> CPU state: registers etc, including FPU
+ -> memory state: memory address space layout and contents
+ -> filesystem state: [TBD] filesystem namespace state, chroot, cwd, etc
+ -> files state: open file descriptors and their state
+ -> signals state: [TBD] pending signals and signal handling state
+ -> credentials state: [TBD] user and group state, statistics
+
+
+=== Checkpoint image format
+
+The checkpoint image format is composed of records consistings of a
+pre-header that identifies its contents, followed by a payload. (The
+idea here is to enable parallel checkpointing in the future in which
+multiple threads interleave data from multiple processes into a single
+stream).
+
+The pre-header is defined by "struct cr_hdr" as follows:
+
+struct cr_hdr {
+ __s16 type;
+ __s16 len;
+ __u32 parent;
+};
+
+Here, 'type' field identifies the type of the payload, 'len' tells its
+length in bytes. The 'parent' identifies the owner object instance. The
+meaning of the 'parent field varies depending on the type. For example,
+for type CR_HDR_MM, the 'parent identifies the task to which this MM
+belongs. The payload also varies depending on the type, for instance,
+the data describing a task_struct is given by a 'struct cr_hdr_task'
+(type CR_HDR_TASK) and so on.
+
+The format of the memory dump is as follows: for each VMA, there is a
+'struct cr_vma'; if the VMA is file-mapped, it is followed by the file
+name. Following comes the actual contents, in one or more chunk: each
+chunk begins with a header that specifies how many pages it holds,
+then a the virtual addresses of all the dumped pages in that chunk,
+followed by the actual contents of all the dumped pages. A header with
+zero number of pages marks the end of the contents for a particular
+VMA. Then comes the next VMA and so on.
+
+To illustrate this, consider a single simple task with two VMAs: one
+is file mapped with two dumped pages, and the other is anonymous with
+three dumped pages. The checkpoint image will look like this:
+
+cr_hdr + cr_hdr_head
+cr_hdr + cr_hdr_task
+ cr_hdr + cr_hdr_mm
+ cr_hdr + cr_hdr_vma + cr_hdr + string
+ cr_hdr_pgarr (nr_pages = 2)
+ addr1, addr2
+ page1, page2
+ cr_hdr_pgarr (nr_pages = 0)
+ cr_hdr + cr_hdr_vma
+ cr_hdr_pgarr (nr_pages = 3)
+ addr3, addr4, addr5
+ page3, page4, page5
+ cr_hdr_pgarr (nr_pages = 0)
+ cr_hdr + cr_mm_context
+ cr_hdr + cr_hdr_thread
+ cr_hdr + cr_hdr_cpu
+cr_hdr + cr_hdr_tail
+
+
+=== Current Implementation
+
+[2008-Oct-07]
+There are several assumptions in the current implementation; they will
+be gradually relaxed in future versions. The main ones are:
+* A task can only checkpoint itself (missing "restart-block" logic).
+* Namespaces are not saved or restored; They will be treated as a type
+ of shared object.
+* In particular, it is assumed that the task's file system namespace
+ is the "root" for the entire container.
+* It is assumed that the same file system view is available for the
+ restart task(s). Otherwise, a file system snapshot is required.
+
+
+=== Sample code
+
+Two example programs: one uses checkpoint (called ckpt) to checkpoint
+itself, and another uses restart (called rstr) to restart from that
+checkpoint. Note the use of "dup2" to create a copy of an open file
+and show how shared objects are treated. Execute like this:
+
+orenl:~/test$ ./ckpt > out.1
+ <-- ctrl-c
+orenl:~/test$ cat /tmp/cr-rest.out
+hello, world!
+world, hello!
+(ret = 1)
+
+orenl:~/test$ ./ckpt > out.1
+ <-- ctrl-c
+orenl:~/test$ cat /tmp/cr-rest.out
+hello, world!
+world, hello!
+(ret = 2)
+
+ <-- now change the contents of the file
+orenl:~/test$ sed -i 's/world, hello!/xxxx/' /tmp/cr-rest.out
+orenl:~/test$ cat /tmp/cr-rest.out
+hello, world!
+xxxx
+(ret = 2)
+
+ <-- and do the restart
+orenl:~/test$ ./rstr < out.1
+ <-- ctrl-c
+orenl:~/test$ cat /tmp/cr-rest.out
+hello, world!
+world, hello!
+(ret = 0)
+
+(if you check the output of ps, you'll see that "rstr" changed its
+name to "ckpt", as expected).
+
+============================== ckpt.c ================================
+
+#define _GNU_SOURCE /* or _BSD_SOURCE or _SVID_SOURCE */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <asm/unistd.h>
+#include <sys/syscall.h>
+
+#define OUTFILE "/tmp/cr-test.out"
+
+int main(int argc, char *argv[])
+{
+ pid_t pid = getpid();
+ FILE *file;
+ int ret;
+
+ close(0);
+ close(2);
+
+ unlink(OUTFILE);
+ file = fopen(OUTFILE, "w+");
+ if (!file) {
+ perror("open");
+ exit(1);
+ }
+
+ if (dup2(0,2) < 0) {
+ perror("dups");
+ exit(1);
+ }
+
+ fprintf(file, "hello, world!\n");
+ fflush(file);
+
+ ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
+ if (ret < 0) {
+ perror("checkpoint");
+ exit(2);
+ }
+
+ fprintf(file, "world, hello!\n");
+ fprintf(file, "(ret = %d)\n", ret);
+ fflush(file);
+
+ while (1)
+ ;
+
+ return 0;
+}
+======================================================================
+
+============================== rstr.c ================================
+
+#define _GNU_SOURCE /* or _BSD_SOURCE or _SVID_SOURCE */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <asm/unistd.h>
+#include <sys/syscall.h>
+
+int main(int argc, char *argv[])
+{
+ pid_t pid = getpid();
+ int ret;
+
+ ret = syscall(__NR_restart, pid, STDIN_FILENO, 0);
+ if (ret < 0)
+ perror("restart");
+
+ printf("should not reach here !\n");
+
+ return 0;
+}
+======================================================================
+
+
+=== Changelog
+
+[2008-Oct-17] v7:
+ - Fix save/restore state of FPU
+ - Fix argument given to kunmap_atomic() in memory dump/restore
+
+[2008-Oct-07] v6:
+ - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
+ (even though it's not really needed)
+ - Add 'current implementation' to docs to describe assumptions
+ - Misc fixes and cleanups
+
+[2008-Sep-11] v5:
+ - Config is 'def_bool n' by default
+ - Improve memory dump/restore code (following Dave Hansen's comments)
+ - Change dump format (and code) to allow chunks of <vaddrs, pages>
+ instead of one long list of each
+ - Fix use of follow_page() to avoid faulting in non-present pages
+ - Memory restore now maps user pages explicitly to copy data into them,
+ instead of reading directly to user space; got rid of mprotect_fixup()
+ - Remove preempt_disable() when restoring debug registers
+ - Rename headers files s/ckpt/checkpoint/
+ - Fix misc bugs in files dump/restore
+ - Fix cleanup on some error paths
+ - Fix misc coding style
+
+[2008-Sep-04] v4:
+ - Fix calculation of hash table size
+ - Fix header structure alignment
+ - Use stand list_... for cr_pgarr
+
+[2008-Aug-20] v3:
+ - Various fixes and clean-ups
+ - Use standard hlist_... for hash table
+ - Better use of standard kmalloc/kfree
+
+[2008-Aug-09] v2:
+ - Added utsname->{release,version,machine} to checkpoint header
+ - Pad header structures to 64 bits to ensure compatibility
+ - Address comments from LKML and linux-containers mailing list
+
+[2008-Jul-29] v1:
+In this incarnation, CR only works on single task. The address space
+may consist of only private, simple VMAs - anonymous or file-mapped.
+Both checkpoint and restart will ignore the first argument (pid/crid)
+and instead act on themselves.
--
1.5.4.3

2008-10-20 05:44:04

by Oren Laadan

[permalink] [raw]
Subject: [RFC v7][PATCH 5/9] Restore memory address space

Restoring the memory address space begins with nuking the existing one
of the current process, and then reading the VMA state and contents.
Call do_mmap_pgoffset() for each VMA and then read in the data.

Signed-off-by: Oren Laadan <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
---
arch/x86/mm/restart.c | 64 ++++++-
checkpoint/Makefile | 2 +-
checkpoint/checkpoint_arch.h | 2 +
checkpoint/checkpoint_mem.h | 5 +
checkpoint/restart.c | 42 ++++
checkpoint/rstr_mem.c | 384 ++++++++++++++++++++++++++++++++++++++
include/asm-x86/checkpoint_hdr.h | 4 +
include/linux/checkpoint.h | 3 +
8 files changed, 503 insertions(+), 3 deletions(-)
create mode 100644 checkpoint/rstr_mem.c

diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
index bc2a502..aeae29f 100644
--- a/arch/x86/mm/restart.c
+++ b/arch/x86/mm/restart.c
@@ -53,8 +53,10 @@ int cr_read_thread(struct cr_ctx *ctx)

size = sizeof(*desc) * GDT_ENTRY_TLS_ENTRIES;
desc = kmalloc(size, GFP_KERNEL);
- if (!desc)
- return -ENOMEM;
+ if (!desc) {
+ ret = -ENOMEM;
+ goto out;
+ }

ret = cr_kread(ctx, desc, size);
if (ret >= 0) {
@@ -193,3 +195,61 @@ int cr_read_cpu(struct cr_ctx *ctx)
cr_hbuf_put(ctx, sizeof(*hh));
return ret;
}
+
+int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
+{
+ struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ int n, rparent, ret = -EINVAL;
+
+ rparent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM_CONTEXT);
+ cr_debug("parent %d rparent %d nldt %d\n", parent, rparent, hh->nldt);
+ if (rparent < 0) {
+ ret = rparent;
+ goto out;
+ }
+ if (rparent != parent)
+ goto out;
+
+ if (hh->nldt < 0 || hh->ldt_entry_size != LDT_ENTRY_SIZE)
+ goto out;
+
+ /*
+ * to utilize the syscall modify_ldt() we first convert the data
+ * in the checkpoint image from 'struct desc_struct' to 'struct
+ * user_desc' with reverse logic of include/asm/desc.h:fill_ldt()
+ */
+
+ for (n = 0; n < hh->nldt; n++) {
+ struct user_desc info;
+ struct desc_struct desc;
+ mm_segment_t old_fs;
+
+ ret = cr_kread(ctx, &desc, LDT_ENTRY_SIZE);
+ if (ret < 0)
+ goto out;
+
+ info.entry_number = n;
+ info.base_addr = desc.base0 | (desc.base1 << 16);
+ info.limit = desc.limit0;
+ info.seg_32bit = desc.d;
+ info.contents = desc.type >> 2;
+ info.read_exec_only = (desc.type >> 1) ^ 1;
+ info.limit_in_pages = desc.g;
+ info.seg_not_present = desc.p ^ 1;
+ info.useable = desc.avl;
+
+ old_fs = get_fs();
+ set_fs(get_ds());
+ ret = sys_modify_ldt(1, (struct user_desc __user *) &info,
+ sizeof(info));
+ set_fs(old_fs);
+
+ if (ret < 0)
+ goto out;
+ }
+
+ ret = 0;
+ out:
+ cr_hbuf_put(ctx, sizeof(*hh));
+ return ret;
+}
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 3a0df6d..ac35033 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
#

obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
- ckpt_mem.o
+ ckpt_mem.o rstr_mem.o
diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
index 7da4ad0..018a72e 100644
--- a/checkpoint/checkpoint_arch.h
+++ b/checkpoint/checkpoint_arch.h
@@ -7,3 +7,5 @@ extern int cr_write_mm_context(struct cr_ctx *ctx,

extern int cr_read_thread(struct cr_ctx *ctx);
extern int cr_read_cpu(struct cr_ctx *ctx);
+extern int cr_read_mm_context(struct cr_ctx *ctx,
+ struct mm_struct *mm, int parent);
diff --git a/checkpoint/checkpoint_mem.h b/checkpoint/checkpoint_mem.h
index 85546f4..85a5cf3 100644
--- a/checkpoint/checkpoint_mem.h
+++ b/checkpoint/checkpoint_mem.h
@@ -38,4 +38,9 @@ static inline int cr_pgarr_is_full(struct cr_pgarr *pgarr)
return (pgarr->nr_used == CR_PGARR_TOTAL);
}

+static inline int cr_pgarr_nr_free(struct cr_pgarr *pgarr)
+{
+ return CR_PGARR_TOTAL - pgarr->nr_used;
+}
+
#endif /* _CHECKPOINT_CKPT_MEM_H_ */
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 766e381..f4d87ba 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -78,6 +78,44 @@ int cr_read_string(struct cr_ctx *ctx, void *str, int len)
return cr_read_obj_type(ctx, str, len, CR_HDR_STRING);
}

+/**
+ * cr_read_fname - read a file name
+ * @ctx: checkpoint context
+ * @fname: buffer
+ * @n: buffer length
+ */
+int cr_read_fname(struct cr_ctx *ctx, void *fname, int flen)
+{
+ return cr_read_obj_type(ctx, fname, flen, CR_HDR_FNAME);
+}
+
+/**
+ * cr_read_open_fname - read a file name and open a file
+ * @ctx: checkpoint context
+ * @flags: file flags
+ * @mode: file mode
+ */
+struct file *cr_read_open_fname(struct cr_ctx *ctx, int flags, int mode)
+{
+ struct file *file;
+ char *fname;
+ int ret;
+
+ fname = kmalloc(PATH_MAX, GFP_KERNEL);
+ if (!fname)
+ return ERR_PTR(-ENOMEM);
+
+ ret = cr_read_fname(ctx, fname, PATH_MAX);
+ cr_debug("fname '%s' flags %#x mode %#x\n", fname, flags, mode);
+ if (ret >= 0)
+ file = filp_open(fname, flags, mode);
+ else
+ file = ERR_PTR(ret);
+
+ kfree(fname);
+ return file;
+}
+
/* read the checkpoint header */
static int cr_read_head(struct cr_ctx *ctx)
{
@@ -177,6 +215,10 @@ static int cr_read_task(struct cr_ctx *ctx)
cr_debug("task_struct: ret %d\n", ret);
if (ret < 0)
goto out;
+ ret = cr_read_mm(ctx);
+ cr_debug("memory: ret %d\n", ret);
+ if (ret < 0)
+ goto out;
ret = cr_read_thread(ctx);
cr_debug("thread: ret %d\n", ret);
if (ret < 0)
diff --git a/checkpoint/rstr_mem.c b/checkpoint/rstr_mem.c
new file mode 100644
index 0000000..ed7cdd4
--- /dev/null
+++ b/checkpoint/rstr_mem.c
@@ -0,0 +1,384 @@
+/*
+ * Restart memory contents
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fcntl.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/pagemap.h>
+#include <linux/mm_types.h>
+#include <linux/mman.h>
+#include <linux/mm.h>
+#include <linux/err.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "checkpoint_arch.h"
+#include "checkpoint_mem.h"
+
+/*
+ * Unlike checkpoint, restart is executed in the context of each restarting
+ * process: vma regions are restored via a call to mmap(), and the data is
+ * read into the address space of the current process.
+ */
+
+
+/**
+ * cr_read_pages_vaddrs - read addresses of pages to page-array chain
+ * @ctx - restart context
+ * @nr_pages - number of address to read
+ */
+static int cr_read_pages_vaddrs(struct cr_ctx *ctx, unsigned long nr_pages)
+{
+ struct cr_pgarr *pgarr;
+ unsigned long *vaddrp;
+ int nr, ret;
+
+ while (nr_pages) {
+ pgarr = cr_pgarr_current(ctx);
+ if (!pgarr)
+ return -ENOMEM;
+ nr = cr_pgarr_nr_free(pgarr);
+ if (nr > nr_pages)
+ nr = nr_pages;
+ vaddrp = &pgarr->vaddrs[pgarr->nr_used];
+ ret = cr_kread(ctx, vaddrp, nr * sizeof(unsigned long));
+ if (ret < 0)
+ return ret;
+ pgarr->nr_used += nr;
+ nr_pages -= nr;
+ }
+ return 0;
+}
+
+static int cr_page_read(struct cr_ctx *ctx, struct page *page, char *buf)
+{
+ void *ptr;
+ int ret;
+
+ ret = cr_kread(ctx, buf, PAGE_SIZE);
+ if (ret < 0)
+ return ret;
+
+ ptr = kmap_atomic(page, KM_USER1);
+ memcpy(ptr, buf, PAGE_SIZE);
+ kunmap_atomic(ptr, KM_USER1);
+
+ return 0;
+}
+
+/**
+ * cr_read_pages_contents - read in data of pages in page-array chain
+ * @ctx - restart context
+ */
+static int cr_read_pages_contents(struct cr_ctx *ctx)
+{
+ struct mm_struct *mm = current->mm;
+ struct cr_pgarr *pgarr;
+ unsigned long *vaddrs;
+ char *buf;
+ int i, ret = 0;
+
+ buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ down_read(&mm->mmap_sem);
+ list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+ vaddrs = pgarr->vaddrs;
+ for (i = 0; i < pgarr->nr_used; i++) {
+ struct page *page;
+
+ ret = get_user_pages(current, mm, vaddrs[i],
+ 1, 1, 1, &page, NULL);
+ if (ret < 0)
+ goto out;
+
+ ret = cr_page_read(ctx, page, buf);
+ page_cache_release(page);
+
+ if (ret < 0)
+ goto out;
+ }
+ }
+
+ out:
+ up_read(&mm->mmap_sem);
+ kfree(buf);
+ return 0;
+}
+
+/**
+ * cr_read_private_vma_contents - restore contents of a VMA with private memory
+ * @ctx - restart context
+ *
+ * Reads a header that specifies how many pages will follow, then reads
+ * a list of virtual addresses into ctx->pgarr_list page-array chain,
+ * followed by the actual contents of the corresponding pages. Iterates
+ * these steps until reaching a header specifying "0" pages, which marks
+ * the end of the contents.
+ */
+static int cr_read_private_vma_contents(struct cr_ctx *ctx)
+{
+ struct cr_hdr_pgarr *hh;
+ unsigned long nr_pages;
+ int parent, ret = 0;
+
+ while (1) {
+ hh = cr_hbuf_get(ctx, sizeof(*hh));
+ parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_PGARR);
+ if (parent != 0) {
+ if (parent < 0)
+ ret = parent;
+ else
+ ret = -EINVAL;
+ cr_hbuf_put(ctx, sizeof(*hh));
+ break;
+ }
+
+ cr_debug("nr_pages %ld\n", (unsigned long) hh->nr_pages);
+
+ nr_pages = hh->nr_pages;
+ cr_hbuf_put(ctx, sizeof(*hh));
+
+ if (!nr_pages)
+ break;
+
+ ret = cr_read_pages_vaddrs(ctx, nr_pages);
+ if (ret < 0)
+ break;
+ ret = cr_read_pages_contents(ctx);
+ if (ret < 0)
+ break;
+ cr_pgarr_reset_all(ctx);
+ }
+
+ return ret;
+}
+
+/**
+ * cr_calc_map_prot_bits - convert vm_flags to mmap protection
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long cr_calc_map_prot_bits(unsigned long orig_vm_flags)
+{
+ unsigned long vm_prot = 0;
+
+ if (orig_vm_flags & VM_READ)
+ vm_prot |= PROT_READ;
+ if (orig_vm_flags & VM_WRITE)
+ vm_prot |= PROT_WRITE;
+ if (orig_vm_flags & VM_EXEC)
+ vm_prot |= PROT_EXEC;
+ if (orig_vm_flags & PROT_SEM) /* only (?) with IPC-SHM */
+ vm_prot |= PROT_SEM;
+
+ return vm_prot;
+}
+
+/**
+ * cr_calc_map_flags_bits - convert vm_flags to mmap flags
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long cr_calc_map_flags_bits(unsigned long orig_vm_flags)
+{
+ unsigned long vm_flags = 0;
+
+ vm_flags = MAP_FIXED;
+ if (orig_vm_flags & VM_GROWSDOWN)
+ vm_flags |= MAP_GROWSDOWN;
+ if (orig_vm_flags & VM_DENYWRITE)
+ vm_flags |= MAP_DENYWRITE;
+ if (orig_vm_flags & VM_EXECUTABLE)
+ vm_flags |= MAP_EXECUTABLE;
+ if (orig_vm_flags & VM_MAYSHARE)
+ vm_flags |= MAP_SHARED;
+ else
+ vm_flags |= MAP_PRIVATE;
+
+ return vm_flags;
+}
+
+static int cr_read_vma(struct cr_ctx *ctx, struct mm_struct *mm)
+{
+ struct cr_hdr_vma *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ unsigned long vm_size, vm_start, vm_flags, vm_prot, vm_pgoff;
+ unsigned long addr;
+ struct file *file = NULL;
+ int parent, ret = -EINVAL;
+
+ parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_VMA);
+ if (parent < 0) {
+ ret = parent;
+ goto err;
+ } else if (parent != 0)
+ goto err;
+
+ cr_debug("vma %#lx-%#lx type %d\n", (unsigned long) hh->vm_start,
+ (unsigned long) hh->vm_end, (int) hh->vma_type);
+
+ if (hh->vm_end < hh->vm_start)
+ goto err;
+
+ vm_start = hh->vm_start;
+ vm_pgoff = hh->vm_pgoff;
+ vm_size = hh->vm_end - hh->vm_start;
+ vm_prot = cr_calc_map_prot_bits(hh->vm_flags);
+ vm_flags = cr_calc_map_flags_bits(hh->vm_flags);
+
+ switch (hh->vma_type) {
+
+ case CR_VMA_ANON: /* anonymous private mapping */
+ if (vm_flags & VM_SHARED)
+ goto err;
+ /*
+ * vm_pgoff for anonymous mapping is the "global" page
+ * offset (namely from addr 0x0), so we force a zero
+ */
+ vm_pgoff = 0;
+ break;
+
+ case CR_VMA_FILE: /* private mapping from a file */
+ if (vm_flags & VM_SHARED)
+ goto err;
+ /*
+ * for private mapping using 'read-only' is sufficient
+ */
+ file = cr_read_open_fname(ctx, O_RDONLY, 0);
+ if (IS_ERR(file)) {
+ ret = PTR_ERR(file);
+ goto err;
+ }
+ break;
+
+ default:
+ goto err;
+
+ }
+
+ cr_hbuf_put(ctx, sizeof(*hh));
+
+ down_write(&mm->mmap_sem);
+ addr = do_mmap_pgoff(file, vm_start, vm_size,
+ vm_prot, vm_flags, vm_pgoff);
+ up_write(&mm->mmap_sem);
+ cr_debug("size %#lx prot %#lx flag %#lx pgoff %#lx => %#lx\n",
+ vm_size, vm_prot, vm_flags, vm_pgoff, addr);
+
+ /* the file (if opened) is now referenced by the vma */
+ if (file)
+ filp_close(file, NULL);
+
+ if (IS_ERR((void *) addr))
+ return PTR_ERR((void *) addr);
+
+ /*
+ * CR_VMA_ANON: read in memory as is
+ * CR_VMA_FILE: read in memory as is
+ * (more to follow ...)
+ */
+
+ switch (hh->vma_type) {
+ case CR_VMA_ANON:
+ case CR_VMA_FILE:
+ /* standard case: read the data into the memory */
+ ret = cr_read_private_vma_contents(ctx);
+ break;
+ }
+
+ if (ret < 0)
+ return ret;
+
+ cr_debug("vma retval %d\n", ret);
+ return 0;
+
+ err:
+ cr_hbuf_put(ctx, sizeof(*hh));
+ return ret;
+}
+
+static int cr_destroy_mm(struct mm_struct *mm)
+{
+ struct vm_area_struct *vmnext = mm->mmap;
+ struct vm_area_struct *vma;
+ int ret;
+
+ while (vmnext) {
+ vma = vmnext;
+ vmnext = vmnext->vm_next;
+ ret = do_munmap(mm, vma->vm_start, vma->vm_end-vma->vm_start);
+ if (ret < 0) {
+ pr_debug("CR: restart failed do_munmap (%d)\n", ret);
+ return ret;
+ }
+ }
+ return 0;
+}
+
+int cr_read_mm(struct cr_ctx *ctx)
+{
+ struct cr_hdr_mm *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ struct mm_struct *mm;
+ int nr, parent, ret;
+
+ parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM);
+ if (parent < 0) {
+ ret = parent;
+ goto out;
+ }
+
+ ret = -EINVAL;
+#if 0 /* activate when containers are used */
+ if (parent != task_pid_vnr(current))
+ goto out;
+#endif
+ cr_debug("map_count %d\n", hh->map_count);
+
+ /* XXX need more sanity checks */
+ if (hh->start_code > hh->end_code ||
+ hh->start_data > hh->end_data || hh->map_count < 0)
+ goto out;
+
+ mm = current->mm;
+
+ /* point of no return -- destruct current mm */
+ down_write(&mm->mmap_sem);
+ ret = cr_destroy_mm(mm);
+ if (ret < 0) {
+ up_write(&mm->mmap_sem);
+ goto out;
+ }
+ mm->start_code = hh->start_code;
+ mm->end_code = hh->end_code;
+ mm->start_data = hh->start_data;
+ mm->end_data = hh->end_data;
+ mm->start_brk = hh->start_brk;
+ mm->brk = hh->brk;
+ mm->start_stack = hh->start_stack;
+ mm->arg_start = hh->arg_start;
+ mm->arg_end = hh->arg_end;
+ mm->env_start = hh->env_start;
+ mm->env_end = hh->env_end;
+ up_write(&mm->mmap_sem);
+
+ /* FIX: need also mm->flags */
+
+ for (nr = hh->map_count; nr; nr--) {
+ ret = cr_read_vma(ctx, mm);
+ if (ret < 0)
+ goto out;
+ }
+
+ ret = cr_read_mm_context(ctx, mm, hh->objref);
+ out:
+ cr_hbuf_put(ctx, sizeof(*hh));
+ return ret;
+}
diff --git a/include/asm-x86/checkpoint_hdr.h b/include/asm-x86/checkpoint_hdr.h
index 6bc61ac..f8eee6a 100644
--- a/include/asm-x86/checkpoint_hdr.h
+++ b/include/asm-x86/checkpoint_hdr.h
@@ -74,4 +74,8 @@ struct cr_hdr_mm_context {
__s16 nldt;
} __attribute__((aligned(8)));

+
+/* misc prototypes from kernel (not defined elsewhere) */
+asmlinkage int sys_modify_ldt(int func, void __user *ptr, unsigned long bytecount);
+
#endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 3f018df..3c6d1d1 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -55,6 +55,9 @@ extern int cr_write_fname(struct cr_ctx *ctx,
extern int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
extern int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type);
extern int cr_read_string(struct cr_ctx *ctx, void *str, int len);
+extern int cr_read_fname(struct cr_ctx *ctx, void *fname, int n);
+extern struct file *cr_read_open_fname(struct cr_ctx *ctx,
+ int flags, int mode);

extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
extern int cr_read_mm(struct cr_ctx *ctx);
--
1.5.4.3

2008-10-20 05:44:44

by Oren Laadan

[permalink] [raw]
Subject: [RFC v7][PATCH 4/9] Dump memory address space

For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
it will be followed by the file name. Then comes the actual contents,
in one or more chunk: each chunk begins with a header that specifies
how many pages it holds, then the virtual addresses of all the dumped
pages in that chunk, followed by the actual contents of all dumped
pages. A header with zero number of pages marks the end of the contents.
Then comes the next VMA and so on.

Signed-off-by: Oren Laadan <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
---
arch/x86/mm/checkpoint.c | 31 +++
arch/x86/mm/restart.c | 1 +
checkpoint/Makefile | 3 +-
checkpoint/checkpoint.c | 53 ++++
checkpoint/checkpoint_arch.h | 2 +
checkpoint/checkpoint_mem.h | 41 +++
checkpoint/ckpt_mem.c | 500 ++++++++++++++++++++++++++++++++++++++
checkpoint/sys.c | 16 ++
include/asm-x86/checkpoint_hdr.h | 5 +
include/linux/checkpoint.h | 12 +
include/linux/checkpoint_hdr.h | 32 +++
11 files changed, 695 insertions(+), 1 deletions(-)
create mode 100644 checkpoint/checkpoint_mem.h
create mode 100644 checkpoint/ckpt_mem.c

diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index 43dadac..554dbff 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -196,3 +196,34 @@ int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t)
cr_hbuf_put(ctx, sizeof(*hh));
return ret;
}
+
+/* dump the mm->context state */
+int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
+{
+ struct cr_hdr h;
+ struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ int ret;
+
+ h.type = CR_HDR_MM_CONTEXT;
+ h.len = sizeof(*hh);
+ h.parent = parent;
+
+ mutex_lock(&mm->context.lock);
+
+ hh->ldt_entry_size = LDT_ENTRY_SIZE;
+ hh->nldt = mm->context.size;
+
+ cr_debug("nldt %d\n", hh->nldt);
+
+ ret = cr_write_obj(ctx, &h, hh);
+ cr_hbuf_put(ctx, sizeof(*hh));
+ if (ret < 0)
+ goto out;
+
+ ret = cr_kwrite(ctx, mm->context.ldt,
+ mm->context.size * LDT_ENTRY_SIZE);
+
+ out:
+ mutex_unlock(&mm->context.lock);
+ return ret;
+}
diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
index 2bff5eb..bc2a502 100644
--- a/arch/x86/mm/restart.c
+++ b/arch/x86/mm/restart.c
@@ -8,6 +8,7 @@
* distribution for more details.
*/

+#include <linux/unistd.h>
#include <asm/desc.h>
#include <asm/i387.h>

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index d2df68c..3a0df6d 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,5 @@
# Makefile for linux checkpoint/restart.
#

-obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
+ ckpt_mem.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 6ca26d0..d4c1b31 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -55,6 +55,55 @@ int cr_write_string(struct cr_ctx *ctx, char *str, int len)
return cr_write_obj(ctx, &h, str);
}

+/**
+ * cr_fill_fname - return pathname of a given file
+ * @path: path name
+ * @root: relative root
+ * @buf: buffer for pathname
+ * @n: buffer length (in) and pathname length (out)
+ */
+static char *
+cr_fill_fname(struct path *path, struct path *root, char *buf, int *n)
+{
+ char *fname;
+
+ BUG_ON(!buf);
+ fname = __d_path(path, root, buf, *n);
+ if (!IS_ERR(fname))
+ *n = (buf + (*n) - fname);
+ return fname;
+}
+
+/**
+ * cr_write_fname - write a file name
+ * @ctx: checkpoint context
+ * @path: path name
+ * @root: relative root
+ */
+int cr_write_fname(struct cr_ctx *ctx, struct path *path, struct path *root)
+{
+ struct cr_hdr h;
+ char *buf, *fname;
+ int ret, flen;
+
+ flen = PATH_MAX;
+ buf = kmalloc(flen, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ fname = cr_fill_fname(path, root, buf, &flen);
+ if (!IS_ERR(fname)) {
+ h.type = CR_HDR_FNAME;
+ h.len = flen;
+ h.parent = 0;
+ ret = cr_write_obj(ctx, &h, fname);
+ } else
+ ret = PTR_ERR(fname);
+
+ kfree(buf);
+ return ret;
+}
+
/* write the checkpoint header */
static int cr_write_head(struct cr_ctx *ctx)
{
@@ -150,6 +199,10 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
cr_debug("task_struct: ret %d\n", ret);
if (ret < 0)
goto out;
+ ret = cr_write_mm(ctx, t);
+ cr_debug("memory: ret %d\n", ret);
+ if (ret < 0)
+ goto out;
ret = cr_write_thread(ctx, t);
cr_debug("thread: ret %d\n", ret);
if (ret < 0)
diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
index bf2d21e..7da4ad0 100644
--- a/checkpoint/checkpoint_arch.h
+++ b/checkpoint/checkpoint_arch.h
@@ -2,6 +2,8 @@

extern int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
extern int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
+extern int cr_write_mm_context(struct cr_ctx *ctx,
+ struct mm_struct *mm, int parent);

extern int cr_read_thread(struct cr_ctx *ctx);
extern int cr_read_cpu(struct cr_ctx *ctx);
diff --git a/checkpoint/checkpoint_mem.h b/checkpoint/checkpoint_mem.h
new file mode 100644
index 0000000..85546f4
--- /dev/null
+++ b/checkpoint/checkpoint_mem.h
@@ -0,0 +1,41 @@
+#ifndef _CHECKPOINT_CKPT_MEM_H_
+#define _CHECKPOINT_CKPT_MEM_H_
+/*
+ * Generic container checkpoint-restart
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/mm_types.h>
+
+/*
+ * page-array chains: each cr_pgarr describes a set of <strcut page *,vaddr>
+ * tuples (where vaddr is the virtual address of a page in a particular mm).
+ * Specifically, we use separate arrays so that all vaddrs can be written
+ * and read at once.
+ */
+
+struct cr_pgarr {
+ unsigned long *vaddrs;
+ struct page **pages;
+ unsigned int nr_used;
+ struct list_head list;
+};
+
+#define CR_PGARR_TOTAL (PAGE_SIZE / sizeof(void *))
+#define CR_PGARR_CHUNK (4 * CR_PGARR_TOTAL)
+
+extern void cr_pgarr_free(struct cr_ctx *ctx);
+extern struct cr_pgarr *cr_pgarr_current(struct cr_ctx *ctx);
+extern void cr_pgarr_reset_all(struct cr_ctx *ctx);
+
+static inline int cr_pgarr_is_full(struct cr_pgarr *pgarr)
+{
+ return (pgarr->nr_used == CR_PGARR_TOTAL);
+}
+
+#endif /* _CHECKPOINT_CKPT_MEM_H_ */
diff --git a/checkpoint/ckpt_mem.c b/checkpoint/ckpt_mem.c
new file mode 100644
index 0000000..d9025e5
--- /dev/null
+++ b/checkpoint/ckpt_mem.c
@@ -0,0 +1,500 @@
+/*
+ * Checkpoint memory contents
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/pagemap.h>
+#include <linux/mm_types.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "checkpoint_arch.h"
+#include "checkpoint_mem.h"
+
+/*
+ * utilities to alloc, free, and handle 'struct cr_pgarr' (page-arrays)
+ * (common to ckpt_mem.c and rstr_mem.c).
+ *
+ * The checkpoint context structure has two members for page-arrays:
+ * ctx->pgarr_list: list head of the page-array chain
+ *
+ * During checkpoint (and restart) the chain tracks the dirty pages (page
+ * pointer and virtual address) of each MM. For a particular MM, these are
+ * always added to the head of the page-array chain (ctx->pgarr_list).
+ * This "current" page-array advances as necessary, and new page-array
+ * descriptors are allocated on-demand. Before the next chunk of pages,
+ * the chain is reset but not freed (that is, dereference page pointers).
+ */
+
+/* return first page-array in the chain */
+static inline struct cr_pgarr *cr_pgarr_first(struct cr_ctx *ctx)
+{
+ if (list_empty(&ctx->pgarr_list))
+ return NULL;
+ return list_first_entry(&ctx->pgarr_list, struct cr_pgarr, list);
+}
+
+/* release pages referenced by a page-array */
+static void cr_pgarr_release_pages(struct cr_pgarr *pgarr)
+{
+ int i;
+
+ cr_debug("nr_used %d\n", pgarr->nr_used);
+ /*
+ * although both checkpoint and restart use 'nr_used', we only
+ * collect pages during checkpoint; in restart we simply return
+ */
+ if (!pgarr->pages)
+ return;
+ for (i = pgarr->nr_used; i--; /**/)
+ page_cache_release(pgarr->pages[i]);
+}
+
+/* free a single page-array object */
+static void cr_pgarr_free_one(struct cr_pgarr *pgarr)
+{
+ cr_pgarr_release_pages(pgarr);
+ kfree(pgarr->pages);
+ kfree(pgarr->vaddrs);
+ kfree(pgarr);
+}
+
+/* free a chain of page-arrays */
+void cr_pgarr_free(struct cr_ctx *ctx)
+{
+ struct cr_pgarr *pgarr, *tmp;
+
+ list_for_each_entry_safe(pgarr, tmp, &ctx->pgarr_list, list) {
+ list_del(&pgarr->list);
+ cr_pgarr_free_one(pgarr);
+ }
+}
+
+/* allocate a single page-array object */
+static struct cr_pgarr *cr_pgarr_alloc_one(unsigned long flags)
+{
+ struct cr_pgarr *pgarr;
+
+ pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL);
+ if (!pgarr)
+ return NULL;
+
+ pgarr->vaddrs = kmalloc(CR_PGARR_TOTAL * sizeof(unsigned long),
+ GFP_KERNEL);
+ if (!pgarr->vaddrs)
+ goto nomem;
+
+ /* pgarr->pages is needed only for checkpoint */
+ if (flags & CR_CTX_CKPT) {
+ pgarr->pages = kmalloc(CR_PGARR_TOTAL * sizeof(struct page *),
+ GFP_KERNEL);
+ if (!pgarr->pages)
+ goto nomem;
+ }
+
+ return pgarr;
+
+ nomem:
+ cr_pgarr_free_one(pgarr);
+ return NULL;
+}
+
+/* cr_pgarr_current - return the next available page-array in the chain
+ * @ctx: checkpoint context
+ *
+ * Returns the first page-array in the list that has space. Extends the
+ * list if none has space.
+ */
+struct cr_pgarr *cr_pgarr_current(struct cr_ctx *ctx)
+{
+ struct cr_pgarr *pgarr;
+
+ pgarr = cr_pgarr_first(ctx);
+ if (pgarr && !cr_pgarr_is_full(pgarr))
+ goto out;
+ pgarr = cr_pgarr_alloc_one(ctx->flags);
+ if (!pgarr)
+ goto out;
+ list_add(&pgarr->list, &ctx->pgarr_list);
+ out:
+ return pgarr;
+}
+
+/* reset the page-array chain (dropping page references if necessary) */
+void cr_pgarr_reset_all(struct cr_ctx *ctx)
+{
+ struct cr_pgarr *pgarr;
+
+ list_for_each_entry(pgarr, &ctx->pgarr_list, list) {
+ cr_pgarr_release_pages(pgarr);
+ pgarr->nr_used = 0;
+ }
+}
+
+/*
+ * Checkpoint is outside the context of the checkpointee, so one cannot
+ * simply read pages from user-space. Instead, we scan the address space
+ * of the target to cherry-pick pages of interest. Selected pages are
+ * enlisted in a page-array chain (attached to the checkpoint context).
+ * To save their contents, each page is mapped to kernel memory and then
+ * dumped to the file descriptor.
+ */
+
+
+/**
+ * cr_private_follow_page - return page pointer for dirty pages
+ * @vma - target vma
+ * @addr - page address
+ *
+ * Looks up the page that correspond to the address in the vma, and
+ * returns the page if it was modified (and grabs a reference to it),
+ * or otherwise returns NULL (or error).
+ *
+ * This function should _only_ called for private vma's.
+ */
+static struct page *
+cr_private_follow_page(struct vm_area_struct *vma, unsigned long addr)
+{
+ struct page *page;
+
+ BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+
+ /*
+ * simplified version of get_user_pages(): already have vma,
+ * only need FOLL_ANON, and (for now) ignore fault stats.
+ *
+ * follow_page() will return NULL if the page is not present
+ * (swapped), ZERO_PAGE(0) if the pte wasn't allocated, and
+ * the actual page pointer otherwise.
+ *
+ * FIXME: consolidate with get_user_pages()
+ */
+
+ cond_resched();
+ while (!(page = follow_page(vma, addr, FOLL_ANON | FOLL_GET))) {
+ int ret;
+
+ /* the page is swapped out - bring it in (optimize ?) */
+ ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
+ if (ret & VM_FAULT_ERROR) {
+ if (ret & VM_FAULT_OOM)
+ return ERR_PTR(-ENOMEM);
+ else if (ret & VM_FAULT_SIGBUS)
+ return ERR_PTR(-EFAULT);
+ else
+ BUG();
+ break;
+ }
+ cond_resched();
+ }
+
+ if (IS_ERR(page))
+ return page;
+
+ /*
+ * We only care about dirty pages: either non-zero page, or
+ * file-backed (copy-on-write) that were touched. For the latter,
+ * the page_mapping() will be unset because it will no longer be
+ * mapped to the original file after having been modified.
+ */
+ if (page == ZERO_PAGE(0)) {
+ /* this is the zero page: ignore */
+ page_cache_release(page);
+ page = NULL;
+ } else if (vma->vm_file && (page_mapping(page) != NULL)) {
+ /* file backed clean cow: ignore */
+ page_cache_release(page);
+ page = NULL;
+ }
+
+ return page;
+}
+
+/**
+ * cr_private_vma_fill_pgarr - fill a page-array with addr/page tuples
+ * @ctx - checkpoint context
+ * @pgarr - page-array to fill
+ * @vma - vma to scan
+ * @start - start address (updated)
+ *
+ * Returns the number of pages collected
+ */
+static int
+cr_private_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
+ struct vm_area_struct *vma, unsigned long *start)
+{
+ unsigned long end = vma->vm_end;
+ unsigned long addr = *start;
+ int orig_used = pgarr->nr_used;
+
+ /* this function is only for private memory (anon or file-mapped) */
+ BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+
+ while (addr < end) {
+ struct page *page;
+
+ page = cr_private_follow_page(vma, addr);
+ if (IS_ERR(page))
+ return PTR_ERR(page);
+
+ if (page) {
+ pgarr->pages[pgarr->nr_used] = page;
+ pgarr->vaddrs[pgarr->nr_used] = addr;
+ pgarr->nr_used++;
+ }
+
+ addr += PAGE_SIZE;
+
+ if (cr_pgarr_is_full(pgarr))
+ break;
+ }
+
+ *start = addr;
+ return pgarr->nr_used - orig_used;
+}
+
+/* dump contents of a pages: use kmap_atomic() to avoid TLB flush */
+static int cr_page_write(struct cr_ctx *ctx, struct page *page, char *buf)
+{
+ void *ptr;
+
+ ptr = kmap_atomic(page, KM_USER1);
+ memcpy(buf, ptr, PAGE_SIZE);
+ kunmap_atomic(ptr, KM_USER1);
+
+ return cr_kwrite(ctx, buf, PAGE_SIZE);
+}
+
+/**
+ * cr_vma_dump_pages - dump pages listed in the ctx page-array chain
+ * @ctx - checkpoint context
+ * @total - total number of pages
+ *
+ * First dump all virtual addresses, followed by the contents of all pages
+ */
+static int cr_vma_dump_pages(struct cr_ctx *ctx, int total)
+{
+ struct cr_pgarr *pgarr;
+ char *buf;
+ int i, ret = 0;
+
+ if (!total)
+ return 0;
+
+ list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+ ret = cr_kwrite(ctx, pgarr->vaddrs,
+ pgarr->nr_used * sizeof(*pgarr->vaddrs));
+ if (ret < 0)
+ return ret;
+ }
+
+ buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+ for (i = 0; i < pgarr->nr_used; i++) {
+ ret = cr_page_write(ctx, pgarr->pages[i], buf);
+ if (ret < 0)
+ goto out;
+ }
+ }
+
+ out:
+ kfree(buf);
+ return ret;
+}
+
+/**
+ * cr_write_private_vma_contents - dump contents of a VMA with private memory
+ * @ctx - checkpoint context
+ * @vma - vma to scan
+ *
+ * Collect lists of pages that needs to be dumped, and corresponding
+ * virtual addresses into ctx->pgarr_list page-array chain. Then dump
+ * the addresses, followed by the page contents.
+ */
+static int
+cr_write_private_vma_contents(struct cr_ctx *ctx, struct vm_area_struct *vma)
+{
+ struct cr_hdr h;
+ struct cr_hdr_pgarr *hh;
+ unsigned long addr = vma->vm_start;
+ struct cr_pgarr *pgarr;
+ unsigned long cnt = 0;
+ int ret;
+
+ /*
+ * Work iteratively, collecting and dumping at most CR_PGARR_CHUNK
+ * in each round. Each iterations is divided into two steps:
+ *
+ * (1) scan: scan through the PTEs of the vma to collect the pages
+ * to dump (later we'll also make them COW), while keeping a list
+ * of pages and their corresponding addresses on ctx->pgarr_list.
+ *
+ * (2) dump: write out a header specifying how many pages, followed
+ * by the addresses of all pages in ctx->pgarr_list, followed by
+ * the actual contents of all pages. (Then, release the references
+ * to the pages and reset the page-array chain).
+ *
+ * (This split makes the logic simpler by first counting the pages
+ * that need saving. More importantly, it allows for a future
+ * optimization that will reduce application downtime by deferring
+ * the actual write-out of the data to after the application is
+ * allowed to resume execution).
+ *
+ * After dumpting the entire contents, conclude with a header that
+ * specifies 0 pages to mark the end of the contents.
+ */
+
+ h.type = CR_HDR_PGARR;
+ h.len = sizeof(*hh);
+ h.parent = 0;
+
+ while (addr < vma->vm_end) {
+ pgarr = cr_pgarr_current(ctx);
+ if (!pgarr)
+ return -ENOMEM;
+ ret = cr_private_vma_fill_pgarr(ctx, pgarr, vma, &addr);
+ if (ret < 0)
+ return ret;
+ cnt += ret;
+
+ /* did we complete a chunk, or is this the last chunk ? */
+ if (cnt >= CR_PGARR_CHUNK || (cnt && addr == vma->vm_end)) {
+ hh = cr_hbuf_get(ctx, sizeof(*hh));
+ hh->nr_pages = cnt;
+ ret = cr_write_obj(ctx, &h, hh);
+ cr_hbuf_put(ctx, sizeof(*hh));
+ if (ret < 0)
+ return ret;
+
+ ret = cr_vma_dump_pages(ctx, cnt);
+ if (ret < 0)
+ return ret;
+
+ cr_pgarr_reset_all(ctx);
+ }
+ }
+
+ /* mark end of contents with header saying "0" pages */
+ hh = cr_hbuf_get(ctx, sizeof(*hh));
+ hh->nr_pages = 0;
+ ret = cr_write_obj(ctx, &h, hh);
+ cr_hbuf_put(ctx, sizeof(*hh));
+
+ return ret;
+}
+
+static int cr_write_vma(struct cr_ctx *ctx, struct vm_area_struct *vma)
+{
+ struct cr_hdr h;
+ struct cr_hdr_vma *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ int vma_type, ret;
+
+ h.type = CR_HDR_VMA;
+ h.len = sizeof(*hh);
+ h.parent = 0;
+
+ hh->vm_start = vma->vm_start;
+ hh->vm_end = vma->vm_end;
+ hh->vm_page_prot = vma->vm_page_prot.pgprot;
+ hh->vm_flags = vma->vm_flags;
+ hh->vm_pgoff = vma->vm_pgoff;
+
+ if (vma->vm_flags & (VM_SHARED | VM_IO | VM_HUGETLB | VM_NONLINEAR)) {
+ pr_warning("CR: unsupported VMA %#lx\n", vma->vm_flags);
+ cr_hbuf_put(ctx, sizeof(*hh));
+ return -ENOSYS;
+ }
+
+ /* by default assume anon memory */
+ vma_type = CR_VMA_ANON;
+
+ /*
+ * if there is a backing file, assume private-mapped
+ * (FIXME: check if the file is unlinked)
+ */
+ if (vma->vm_file)
+ vma_type = CR_VMA_FILE;
+
+ hh->vma_type = vma_type;
+
+ ret = cr_write_obj(ctx, &h, hh);
+ cr_hbuf_put(ctx, sizeof(*hh));
+ if (ret < 0)
+ return ret;
+
+ /* save the file name, if relevant */
+ if (vma->vm_file) {
+ ret = cr_write_fname(ctx, &vma->vm_file->f_path, ctx->vfsroot);
+ if (ret < 0)
+ return ret;
+ }
+
+ return cr_write_private_vma_contents(ctx, vma);
+}
+
+int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t)
+{
+ struct cr_hdr h;
+ struct cr_hdr_mm *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ struct mm_struct *mm;
+ struct vm_area_struct *vma;
+ int objref, ret;
+
+ h.type = CR_HDR_MM;
+ h.len = sizeof(*hh);
+ h.parent = task_pid_vnr(t);
+
+ mm = get_task_mm(t);
+
+ objref = 0; /* will be meaningful with multiple processes */
+ hh->objref = objref;
+
+ down_read(&mm->mmap_sem);
+
+ hh->start_code = mm->start_code;
+ hh->end_code = mm->end_code;
+ hh->start_data = mm->start_data;
+ hh->end_data = mm->end_data;
+ hh->start_brk = mm->start_brk;
+ hh->brk = mm->brk;
+ hh->start_stack = mm->start_stack;
+ hh->arg_start = mm->arg_start;
+ hh->arg_end = mm->arg_end;
+ hh->env_start = mm->env_start;
+ hh->env_end = mm->env_end;
+
+ hh->map_count = mm->map_count;
+
+ /* FIX: need also mm->flags */
+
+ ret = cr_write_obj(ctx, &h, hh);
+ cr_hbuf_put(ctx, sizeof(*hh));
+ if (ret < 0)
+ goto out;
+
+ /* write the vma's */
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ ret = cr_write_vma(ctx, vma);
+ if (ret < 0)
+ goto out;
+ }
+
+ ret = cr_write_mm_context(ctx, mm, objref);
+
+ out:
+ up_read(&mm->mmap_sem);
+ mmput(mm);
+ return ret;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 6c8ba56..6a18966 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -16,6 +16,8 @@
#include <linux/capability.h>
#include <linux/checkpoint.h>

+#include "checkpoint_mem.h"
+
/*
* helpers to write/read to/from the image file descriptor
*
@@ -161,6 +163,11 @@ void cr_ctx_free(struct cr_ctx *ctx)

kfree(ctx->hbuf);

+ if (ctx->vfsroot)
+ path_put(ctx->vfsroot);
+
+ cr_pgarr_free(ctx);
+
kfree(ctx);
}

@@ -184,6 +191,15 @@ struct cr_ctx *cr_ctx_alloc(pid_t pid, int fd, unsigned long flags)
return ERR_PTR(-ENOMEM);
}

+ /*
+ * assume checkpointer is in container's root vfs
+ * FIXME: this works for now, but will change with real containers
+ */
+ ctx->vfsroot = &current->fs->root;
+ path_get(ctx->vfsroot);
+
+ INIT_LIST_HEAD(&ctx->pgarr_list);
+
ctx->pid = pid;
ctx->flags = flags;

diff --git a/include/asm-x86/checkpoint_hdr.h b/include/asm-x86/checkpoint_hdr.h
index 44a903c..6bc61ac 100644
--- a/include/asm-x86/checkpoint_hdr.h
+++ b/include/asm-x86/checkpoint_hdr.h
@@ -69,4 +69,9 @@ struct cr_hdr_cpu {

} __attribute__((aligned(8)));

+struct cr_hdr_mm_context {
+ __s16 ldt_entry_size;
+ __s16 nldt;
+} __attribute__((aligned(8)));
+
#endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 93ff0ce..3f018df 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -10,6 +10,9 @@
* distribution for more details.
*/

+#include <linux/path.h>
+#include <linux/fs.h>
+
#define CR_VERSION 1

struct cr_ctx {
@@ -24,6 +27,10 @@ struct cr_ctx {

void *hbuf; /* temporary buffer for headers */
int hpos; /* position in headers buffer */
+
+ struct list_head pgarr_list; /* page array to dump VMA contents */
+
+ struct path *vfsroot; /* container root (FIXME) */
};

/* cr_ctx: flags */
@@ -42,11 +49,16 @@ struct cr_hdr;

extern int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
extern int cr_write_string(struct cr_ctx *ctx, char *str, int len);
+extern int cr_write_fname(struct cr_ctx *ctx,
+ struct path *path, struct path *root);

extern int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
extern int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type);
extern int cr_read_string(struct cr_ctx *ctx, void *str, int len);

+extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
+extern int cr_read_mm(struct cr_ctx *ctx);
+
extern int do_checkpoint(struct cr_ctx *ctx);
extern int do_restart(struct cr_ctx *ctx);

diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 03ec72e..2b110f1 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -32,6 +32,7 @@ struct cr_hdr {
enum {
CR_HDR_HEAD = 1,
CR_HDR_STRING,
+ CR_HDR_FNAME,

CR_HDR_TASK = 101,
CR_HDR_THREAD,
@@ -39,6 +40,7 @@ enum {

CR_HDR_MM = 201,
CR_HDR_VMA,
+ CR_HDR_PGARR,
CR_HDR_MM_CONTEXT,

CR_HDR_TAIL = 5001
@@ -73,4 +75,34 @@ struct cr_hdr_task {
__s32 task_comm_len;
} __attribute__((aligned(8)));

+struct cr_hdr_mm {
+ __u32 objref; /* identifier for shared objects */
+ __u32 map_count;
+
+ __u64 start_code, end_code, start_data, end_data;
+ __u64 start_brk, brk, start_stack;
+ __u64 arg_start, arg_end, env_start, env_end;
+} __attribute__((aligned(8)));
+
+/* vma subtypes */
+enum vm_type {
+ CR_VMA_ANON = 1,
+ CR_VMA_FILE
+};
+
+struct cr_hdr_vma {
+ __u32 vma_type;
+ __u32 _padding;
+
+ __u64 vm_start;
+ __u64 vm_end;
+ __u64 vm_page_prot;
+ __u64 vm_flags;
+ __u64 vm_pgoff;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_pgarr {
+ __u64 nr_pages; /* number of pages to saved */
+} __attribute__((aligned(8)));
+
#endif /* _CHECKPOINT_CKPT_HDR_H_ */
--
1.5.4.3

2008-10-20 05:44:25

by Oren Laadan

[permalink] [raw]
Subject: [RFC v7][PATCH 8/9] Dump open file descriptors

Dump the files_struct of a task with 'struct cr_hdr_files', followed by
all open file descriptors. Since FDs can be shared, they are assigned an
objref and registered in the object hash.

For each open FD there is a 'struct cr_hdr_fd_ent' with the FD, its objref
and its close-on-exec property. If the FD is to be saved (first time)
then this is followed by a 'struct cr_hdr_fd_data' with the FD state.
Then will come the next FD and so on.

This patch only handles basic FDs - regular files, directories and also
symbolic links.

Signed-off-by: Oren Laadan <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
---
checkpoint/Makefile | 2 +-
checkpoint/checkpoint.c | 4 +
checkpoint/checkpoint_file.h | 17 +++
checkpoint/ckpt_file.c | 231 ++++++++++++++++++++++++++++++++++++++++
include/linux/checkpoint.h | 7 +-
include/linux/checkpoint_hdr.h | 32 ++++++-
6 files changed, 288 insertions(+), 5 deletions(-)
create mode 100644 checkpoint/checkpoint_file.h
create mode 100644 checkpoint/ckpt_file.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 9843fb9..7496695 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
#

obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
- ckpt_mem.o rstr_mem.o
+ ckpt_mem.o rstr_mem.o ckpt_file.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index d4c1b31..87420dc 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -203,6 +203,10 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
cr_debug("memory: ret %d\n", ret);
if (ret < 0)
goto out;
+ ret = cr_write_files(ctx, t);
+ cr_debug("files: ret %d\n", ret);
+ if (ret < 0)
+ goto out;
ret = cr_write_thread(ctx, t);
cr_debug("thread: ret %d\n", ret);
if (ret < 0)
diff --git a/checkpoint/checkpoint_file.h b/checkpoint/checkpoint_file.h
new file mode 100644
index 0000000..9dc3eba
--- /dev/null
+++ b/checkpoint/checkpoint_file.h
@@ -0,0 +1,17 @@
+#ifndef _CHECKPOINT_CKPT_FILE_H_
+#define _CHECKPOINT_CKPT_FILE_H_
+/*
+ * Checkpoint file descriptors
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/fdtable.h>
+
+int cr_scan_fds(struct files_struct *files, int **fdtable);
+
+#endif /* _CHECKPOINT_CKPT_FILE_H_ */
diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
new file mode 100644
index 0000000..767fc01
--- /dev/null
+++ b/checkpoint/ckpt_file.c
@@ -0,0 +1,231 @@
+/*
+ * Checkpoint file descriptors
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "checkpoint_file.h"
+
+#define CR_DEFAULT_FDTABLE 256 /* an initial guess */
+
+/**
+ * cr_scan_fds - scan file table and construct array of open fds
+ * @files: files_struct pointer
+ * @fdtable: (output) array of open fds
+ *
+ * Returns the number of open fds found, and also the file table
+ * array via *fdtable. The caller should free the array.
+ *
+ * The caller must validate the file descriptors collected in the
+ * array before using them, e.g. by using fcheck_files(), in case
+ * the task's fdtable changes in the meantime.
+ */
+int cr_scan_fds(struct files_struct *files, int **fdtable)
+{
+ struct fdtable *fdt;
+ int *fds;
+ int i, n = 0;
+ int tot = CR_DEFAULT_FDTABLE;
+
+ fds = kmalloc(tot * sizeof(*fds), GFP_KERNEL);
+ if (!fds)
+ return -ENOMEM;
+
+ /*
+ * We assume that the target task is frozen (or that we checkpoint
+ * ourselves), so we can safely proceed after krealloc() from where
+ * we left off; in the worst cases restart will fail.
+ */
+
+ spin_lock(&files->file_lock);
+ rcu_read_lock();
+ fdt = files_fdtable(files);
+ for (i = 0; i < fdt->max_fds; i++) {
+ if (!fcheck_files(files, i))
+ continue;
+ if (n == tot) {
+ /*
+ * fcheck_files() is safe with drop/re-acquire
+ * of the lock, because it tests: fd < max_fds
+ */
+ spin_unlock(&files->file_lock);
+ rcu_read_unlock();
+ tot *= 2; /* won't overflow: kmalloc will fail */
+ fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL);
+ if (!fds) {
+ kfree(fds);
+ return -ENOMEM;
+ }
+ rcu_read_lock();
+ spin_lock(&files->file_lock);
+ }
+ fds[n++] = i;
+ }
+ rcu_read_unlock();
+ spin_unlock(&files->file_lock);
+
+ *fdtable = fds;
+ return n;
+}
+
+/* cr_write_fd_data - dump the state of a given file pointer */
+static int cr_write_fd_data(struct cr_ctx *ctx, struct file *file, int parent)
+{
+ struct cr_hdr h;
+ struct cr_hdr_fd_data *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ struct dentry *dent = file->f_dentry;
+ struct inode *inode = dent->d_inode;
+ enum fd_type fd_type;
+ int ret;
+
+ h.type = CR_HDR_FD_DATA;
+ h.len = sizeof(*hh);
+ h.parent = parent;
+
+ hh->f_flags = file->f_flags;
+ hh->f_mode = file->f_mode;
+ hh->f_pos = file->f_pos;
+ hh->f_version = file->f_version;
+ /* FIX: need also file->uid, file->gid, file->f_owner, etc */
+
+ switch (inode->i_mode & S_IFMT) {
+ case S_IFREG:
+ fd_type = CR_FD_FILE;
+ break;
+ case S_IFDIR:
+ fd_type = CR_FD_DIR;
+ break;
+ case S_IFLNK:
+ fd_type = CR_FD_LINK;
+ break;
+ default:
+ cr_hbuf_put(ctx, sizeof(*hh));
+ return -EBADF;
+ }
+
+ /* FIX: check if the file/dir/link is unlinked */
+ hh->fd_type = fd_type;
+
+ ret = cr_write_obj(ctx, &h, hh);
+ cr_hbuf_put(ctx, sizeof(*hh));
+ if (ret < 0)
+ return ret;
+
+ return cr_write_fname(ctx, &file->f_path, ctx->vfsroot);
+}
+
+/**
+ * cr_write_fd_ent - dump the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @fd: file descriptor
+ *
+ * Saves the state of the file descriptor; looks up the actual file
+ * pointer in the hash table, and if found saves the matching objref,
+ * otherwise calls cr_write_fd_data to dump the file pointer too.
+ */
+static int
+cr_write_fd_ent(struct cr_ctx *ctx, struct files_struct *files, int fd)
+{
+ struct cr_hdr h;
+ struct cr_hdr_fd_ent *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ struct file *file = NULL;
+ struct fdtable *fdt;
+ int coe, objref, new, ret;
+
+ rcu_read_lock();
+ fdt = files_fdtable(files);
+ file = fcheck_files(files, fd);
+ if (file) {
+ coe = FD_ISSET(fd, fdt->close_on_exec);
+ get_file(file);
+ }
+ rcu_read_unlock();
+
+ /* sanity check (although this shouldn't happen) */
+ if (!file) {
+ ret = -EBADF;
+ goto out;
+ }
+
+ new = cr_obj_add_ptr(ctx, file, &objref, CR_OBJ_FILE, 0);
+ cr_debug("fd %d objref %d file %p c-o-e %d)\n", fd, objref, file, coe);
+
+ if (new < 0) {
+ ret = new;
+ goto out;
+ }
+
+ h.type = CR_HDR_FD_ENT;
+ h.len = sizeof(*hh);
+ h.parent = 0;
+
+ hh->objref = objref;
+ hh->fd = fd;
+ hh->close_on_exec = coe;
+
+ ret = cr_write_obj(ctx, &h, hh);
+ if (ret < 0)
+ goto out;
+
+ /* new==1 if-and-only-if file was newly added to hash */
+ if (new)
+ ret = cr_write_fd_data(ctx, file, objref);
+
+out:
+ cr_hbuf_put(ctx, sizeof(*hh));
+ fput(file);
+ return ret;
+}
+
+int cr_write_files(struct cr_ctx *ctx, struct task_struct *t)
+{
+ struct cr_hdr h;
+ struct cr_hdr_files *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ struct files_struct *files;
+ int *fdtable;
+ int nfds, n, ret;
+
+ h.type = CR_HDR_FILES;
+ h.len = sizeof(*hh);
+ h.parent = task_pid_vnr(t);
+
+ files = get_files_struct(t);
+
+ nfds = cr_scan_fds(files, &fdtable);
+ if (nfds < 0) {
+ put_files_struct(files);
+ return nfds;
+ }
+
+ hh->objref = 0; /* will be meaningful with multiple processes */
+ hh->nfds = nfds;
+
+ ret = cr_write_obj(ctx, &h, hh);
+ cr_hbuf_put(ctx, sizeof(*hh));
+ if (ret < 0)
+ goto clean;
+
+ cr_debug("nfds %d\n", nfds);
+ for (n = 0; n < nfds; n++) {
+ ret = cr_write_fd_ent(ctx, files, fdtable[n]);
+ if (ret < 0)
+ break;
+ }
+
+ clean:
+ kfree(fdtable);
+ put_files_struct(files);
+ return ret;
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 2da3a9f..d6bf6dc 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -13,7 +13,7 @@
#include <linux/path.h>
#include <linux/fs.h>

-#define CR_VERSION 1
+#define CR_VERSION 2

struct cr_ctx {
pid_t pid; /* container identifier */
@@ -79,11 +79,12 @@ extern int cr_read_fname(struct cr_ctx *ctx, void *fname, int n);
extern struct file *cr_read_open_fname(struct cr_ctx *ctx,
int flags, int mode);

+extern int do_checkpoint(struct cr_ctx *ctx);
extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
-extern int cr_read_mm(struct cr_ctx *ctx);
+extern int cr_write_files(struct cr_ctx *ctx, struct task_struct *t);

-extern int do_checkpoint(struct cr_ctx *ctx);
extern int do_restart(struct cr_ctx *ctx);
+extern int cr_read_mm(struct cr_ctx *ctx);

/* there are from fs/read_write.c, not exported otherwise in a header */
extern loff_t file_pos_read(struct file *file);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 2b110f1..cbb920f 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -17,7 +17,7 @@
/*
* To maintain compatibility between 32-bit and 64-bit architecture flavors,
* keep data 64-bit aligned: use padding for structure members, and use
- * __attribute__ ((aligned (8))) for the entire structure.
+ * __attribute__((aligned(8))) for the entire structure.
*/

/* records: generic header */
@@ -43,6 +43,10 @@ enum {
CR_HDR_PGARR,
CR_HDR_MM_CONTEXT,

+ CR_HDR_FILES = 301,
+ CR_HDR_FD_ENT,
+ CR_HDR_FD_DATA,
+
CR_HDR_TAIL = 5001
};

@@ -105,4 +109,30 @@ struct cr_hdr_pgarr {
__u64 nr_pages; /* number of pages to saved */
} __attribute__((aligned(8)));

+struct cr_hdr_files {
+ __u32 objref; /* identifier for shared objects */
+ __u32 nfds;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_fd_ent {
+ __u32 objref; /* identifier for shared objects */
+ __s32 fd;
+ __u32 close_on_exec;
+} __attribute__((aligned(8)));
+
+/* fd types */
+enum fd_type {
+ CR_FD_FILE = 1,
+ CR_FD_DIR,
+ CR_FD_LINK
+};
+
+struct cr_hdr_fd_data {
+ __u16 fd_type;
+ __u16 f_mode;
+ __u32 f_flags;
+ __u64 f_pos;
+ __u64 f_version;
+} __attribute__((aligned(8)));
+
#endif /* _CHECKPOINT_CKPT_HDR_H_ */
--
1.5.4.3

2008-10-20 05:44:59

by Oren Laadan

[permalink] [raw]
Subject: [RFC v7][PATCH 9/9] Restore open file descriprtors

Restore open file descriptors: for each FD read 'struct cr_hdr_fd_ent'
and lookup objref in the hash table; if not found (first occurence), read
in 'struct cr_hdr_fd_data', create a new FD and register in the hash.
Otherwise attach the file pointer from the hash as an FD.

This patch only handles basic FDs - regular files, directories and also
symbolic links.

Signed-off-by: Oren Laadan <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
---
checkpoint/Makefile | 2 +-
checkpoint/restart.c | 4 +
checkpoint/rstr_file.c | 246 ++++++++++++++++++++++++++++++++++++++++++++
include/linux/checkpoint.h | 1 +
4 files changed, 252 insertions(+), 1 deletions(-)
create mode 100644 checkpoint/rstr_file.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 7496695..88bbc10 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
#

obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
- ckpt_mem.o rstr_mem.o ckpt_file.o
+ ckpt_mem.o rstr_mem.o ckpt_file.o rstr_file.o
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index f4d87ba..9ff9f66 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -219,6 +219,10 @@ static int cr_read_task(struct cr_ctx *ctx)
cr_debug("memory: ret %d\n", ret);
if (ret < 0)
goto out;
+ ret = cr_read_files(ctx);
+ cr_debug("files: ret %d\n", ret);
+ if (ret < 0)
+ goto out;
ret = cr_read_thread(ctx);
cr_debug("thread: ret %d\n", ret);
if (ret < 0)
diff --git a/checkpoint/rstr_file.c b/checkpoint/rstr_file.c
new file mode 100644
index 0000000..08bb049
--- /dev/null
+++ b/checkpoint/rstr_file.c
@@ -0,0 +1,246 @@
+/*
+ * Checkpoint file descriptors
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/fsnotify.h>
+#include <linux/syscalls.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "checkpoint_file.h"
+
+static int cr_close_all_fds(struct files_struct *files)
+{
+ int *fdtable;
+ int nfds;
+
+ nfds = cr_scan_fds(files, &fdtable);
+ if (nfds < 0)
+ return nfds;
+ while (nfds--)
+ sys_close(fdtable[nfds]);
+ kfree(fdtable);
+ return 0;
+}
+
+/**
+ * cr_attach_file - attach a lonely file ptr to a file descriptor
+ * @file: lonely file pointer
+ */
+static int cr_attach_file(struct file *file)
+{
+ int fd = get_unused_fd_flags(0);
+
+ if (fd >= 0) {
+ fsnotify_open(file->f_path.dentry);
+ fd_install(fd, file);
+ }
+ return fd;
+}
+
+/**
+ * cr_attach_get_file - attach (and get) lonely file ptr to a file descriptor
+ * @file: lonely file pointer
+ */
+static int cr_attach_get_file(struct file *file)
+{
+ int fd = get_unused_fd_flags(0);
+
+ if (fd >= 0) {
+ fsnotify_open(file->f_path.dentry);
+ fd_install(fd, file);
+ get_file(file);
+ }
+ return fd;
+}
+
+#define CR_SETFL_MASK (O_APPEND|O_NONBLOCK|O_NDELAY|FASYNC|O_DIRECT|O_NOATIME)
+
+/* cr_read_fd_data - restore the state of a given file pointer */
+static int
+cr_read_fd_data(struct cr_ctx *ctx, struct files_struct *files, int parent)
+{
+ struct cr_hdr_fd_data *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ struct file *file;
+ int rparent, ret;
+ int fd = 0; /* pacify gcc warning */
+
+ rparent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FD_DATA);
+ cr_debug("rparent %d parent %d flags %#x mode %#x how %d\n",
+ rparent, parent, hh->f_flags, hh->f_mode, hh->fd_type);
+ if (rparent < 0) {
+ ret = parent;
+ goto out;
+ }
+
+ ret = -EINVAL;
+
+ if (rparent != parent)
+ goto out;
+
+ /* FIX: more sanity checks on f_flags, f_mode etc */
+
+ switch (hh->fd_type) {
+ case CR_FD_FILE:
+ case CR_FD_DIR:
+ case CR_FD_LINK:
+ file = cr_read_open_fname(ctx, hh->f_flags, hh->f_mode);
+ break;
+ default:
+ goto out;
+ }
+
+ if (IS_ERR(file)) {
+ ret = PTR_ERR(file);
+ goto out;
+ }
+
+ /* FIX: need to restore uid, gid, owner etc */
+
+ fd = cr_attach_file(file); /* no need to cleanup 'file' below */
+ if (fd < 0) {
+ filp_close(file, NULL);
+ ret = fd;
+ goto out;
+ }
+
+ /* register new <objref, file> tuple in hash table */
+ ret = cr_obj_add_ref(ctx, (void *) file, parent, CR_OBJ_FILE, 0);
+ if (ret < 0)
+ goto out;
+ ret = sys_fcntl(fd, F_SETFL, hh->f_flags & CR_SETFL_MASK);
+ if (ret < 0)
+ goto out;
+ ret = vfs_llseek(file, hh->f_pos, SEEK_SET);
+ if (ret == -ESPIPE) /* ignore error on non-seekable files */
+ ret = 0;
+
+ ret = 0;
+ out:
+ cr_hbuf_put(ctx, sizeof(*hh));
+ return ret < 0 ? ret : fd;
+}
+
+/**
+ * cr_read_fd_ent - restore the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @parent: parent objref
+ *
+ * Restores the state of a file descriptor; looks up the objref (in the
+ * header) in the hash table, and if found picks the matching file and
+ * use it; otherwise calls cr_read_fd_data to restore the file too.
+ */
+static int
+cr_read_fd_ent(struct cr_ctx *ctx, struct files_struct *files, int parent)
+{
+ struct cr_hdr_fd_ent *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ struct file *file;
+ int newfd, rparent, ret;
+
+ rparent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FD_ENT);
+ cr_debug("rparent %d parent %d ref %d fd %d c.o.e %d\n",
+ rparent, parent, hh->objref, hh->fd, hh->close_on_exec);
+ if (rparent < 0) {
+ ret = rparent;
+ goto out;
+ }
+
+ ret = -EINVAL;
+
+ if (rparent != parent)
+ goto out;
+ if (hh->objref <= 0)
+ goto out;
+
+ file = cr_obj_get_by_ref(ctx, hh->objref, CR_OBJ_FILE);
+ if (IS_ERR(file)) {
+ ret = PTR_ERR(file);
+ goto out;
+ }
+
+ if (file) {
+ /* reuse file descriptor found in the hash table */
+ newfd = cr_attach_get_file(file);
+ } else {
+ /* create new file pointer (and register in hash table) */
+ newfd = cr_read_fd_data(ctx, files, hh->objref);
+ }
+
+ if (newfd < 0) {
+ ret = newfd;
+ goto out;
+ }
+
+ cr_debug("newfd got %d wanted %d\n", newfd, hh->fd);
+
+ /* if newfd isn't desired fd then reposition it */
+ if (newfd != hh->fd) {
+ ret = sys_dup2(newfd, hh->fd);
+ if (ret < 0)
+ goto out;
+ sys_close(newfd);
+ }
+
+ if (hh->close_on_exec)
+ set_close_on_exec(hh->fd, 1);
+
+ ret = 0;
+ out:
+ cr_hbuf_put(ctx, sizeof(*hh));
+ return ret;
+}
+
+int cr_read_files(struct cr_ctx *ctx)
+{
+ struct cr_hdr_files *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ struct files_struct *files = current->files;
+ int i, parent, ret;
+
+ parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FILES);
+ if (parent < 0) {
+ ret = parent;
+ goto out;
+ }
+
+ ret = -EINVAL;
+#if 0 /* activate when containers are used */
+ if (parent != task_pid_vnr(current))
+ goto out;
+#endif
+ cr_debug("objref %d nfds %d\n", hh->objref, hh->nfds);
+ if (hh->objref < 0 || hh->nfds < 0)
+ goto out;
+
+ if (hh->nfds > sysctl_nr_open) {
+ ret = -EMFILE;
+ goto out;
+ }
+
+ /* point of no return -- close all file descriptors */
+ ret = cr_close_all_fds(files);
+ if (ret < 0)
+ goto out;
+
+ for (i = 0; i < hh->nfds; i++) {
+ ret = cr_read_fd_ent(ctx, files, hh->objref);
+ if (ret < 0)
+ break;
+ }
+
+ ret = 0;
+ out:
+ cr_hbuf_put(ctx, sizeof(*hh));
+ return ret;
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index d6bf6dc..b102346 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -85,6 +85,7 @@ extern int cr_write_files(struct cr_ctx *ctx, struct task_struct *t);

extern int do_restart(struct cr_ctx *ctx);
extern int cr_read_mm(struct cr_ctx *ctx);
+extern int cr_read_files(struct cr_ctx *ctx);

/* there are from fs/read_write.c, not exported otherwise in a header */
extern loff_t file_pos_read(struct file *file);
--
1.5.4.3

2008-10-21 19:22:54

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC v7][PATCH 0/9] Kernel based checkpoint/restart

On Mon, 20 Oct 2008 01:40:28 -0400
Oren Laadan <[email protected]> wrote:

> These patches implement basic checkpoint-restart [CR]. This version
> (v7) supports basic tasks with simple private memory, and open files
> (regular files and directories only).

This is a problem. I wouldn't want to be in a position where we merge
this code in mainline, but it's just a not-very-useful toy. Then, as
we turn it into a useful non-toy it all turns into an utter mess.

IOW, merging this code as-is will commit us to merging more code which
hasn't even been written yet. It might even commit us to solving
thus-far-unknown problems which we don't know how to solve!

It's a big blank cheque.

So.

- how useful is this code as it stands in real-world usage?

- what additional work needs to be done to it? (important!)

- how far are we down the design and implementation path with that new
work? Are we yet at least in a position where we can say "yes, this
feature can be completed and no, it won't be a horrid mess"?

2008-10-21 19:42:29

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC v7][PATCH 2/9] General infrastructure for checkpoint restart

On Mon, 20 Oct 2008 01:40:30 -0400
Oren Laadan <[email protected]> wrote:

> Add those interfaces, as well as helpers needed to easily manage the
> file format. The code is roughly broken out as follows:
>
> checkpoint/sys.c - user/kernel data transfer, as well as setup of the
> checkpoint/restart context (a per-checkpoint data structure for
> housekeeping)
>
> checkpoint/checkpoint.c - output wrappers and basic checkpoint handling
>
> checkpoint/restart.c - input wrappers and basic restart handling
>
> Patches to add the per-architecture support as well as the actual
> work to do the memory checkpoint follow in subsequent patches.
>
>
> ...
>
> +int cr_kwrite(struct cr_ctx *ctx, void *buf, int count)
> +{
> + mm_segment_t oldfs;
> + int ret;
> +
> + oldfs = get_fs();
> + set_fs(KERNEL_DS);
> + ret = cr_uwrite(ctx, buf, count);
> + set_fs(oldfs);
> +
> + return ret;
> +}

The decision to write files direct from within the kernel is a bit
unusual and needs discussion and justification in the changelog,
please.

Other schemes would be to make the data available to userspace via a
pseudo-fs file, netlink, a pipe, blah, blah.

>
> ...
>
> +/*
> + * During checkpoint and restart the code writes outs/reads in data
> + * to/from the chekcpoint image from/to a temporary buffer (ctx->hbuf).

Yuo cnat tpye.

> + * Because operations can be nested, one should call cr_hbuf_get() to
> + * reserve space in the buffer, and then cr_hbuf_put() when no longer
> + * needs that space.

Mangled grammar.

> + */
> +
> +/*
> + * ctx->hbuf is used to hold headers and data of known (or bound),
> + * static sizes. In some cases, multiple headers may be allocated in
> + * a nested manner. The size should accommodate all headers, nested
> + * or not, on all archs.
> + */
> +#define CR_HBUF_TOTAL (8 * 4096)
> +
>
> ...
>
> +/*
> + * helpers to manage CR contexts: allocated for each checkpoint and/or
> + * restart operation, and persists until the operation is completed.
> + */
> +
> +/* unique checkpoint identifier (FIXME: should be per-container) */
> +static atomic_t cr_ctx_count;

This never gets initialised. Use ATOMIC_INIT() here. (It doesn't
matter, but one day it might!)

>
> ...
>
> asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
> {
> - pr_debug("sys_checkpoint not implemented yet\n");
> - return -ENOSYS;
> + struct cr_ctx *ctx;
> + int ret;
> +
> + /* no flags for now */
> + if (flags)
> + return -EINVAL;
> +
> + ctx = cr_ctx_alloc(pid, fd, flags | CR_CTX_CKPT);
> + if (IS_ERR(ctx))
> + return PTR_ERR(ctx);
> +
> + ret = do_checkpoint(ctx);
> +
> + if (!ret)
> + ret = ctx->crid;
> +
> + cr_ctx_free(ctx);
> + return ret;
> }

Is it appropriate that this be an unprivileged operation?

What happens if I pass it a pid which isn't system-wide unique?

What happens if I pass it a pid of a process which I don't own? This
is super security-sensitive and we need to go over the permission
checking with a toothcomb. It needs to be exhaustively described in
the changelog. It might have security/selinux implications - I don't
know, I didn't look, but lights are flashing and bells are ringing over
here.

What happens if I pass it a pid of a process which I _do_ own, but it
does not refer to a container's init process?

If `pid' must refer to a container's init process, isn't it always
equal to 1??

> /**
> * sys_restart - restart a container
> * @crid: checkpoint image identifier
> @@ -36,6 +234,19 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
> */
> asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
> {
> - pr_debug("sys_restart not implemented yet\n");
> - return -ENOSYS;
> + struct cr_ctx *ctx;
> + int ret;
> +
> + /* no flags for now */
> + if (flags)
> + return -EINVAL;
> +
> + ctx = cr_ctx_alloc(crid, fd, flags | CR_CTX_RSTR);
> + if (IS_ERR(ctx))
> + return PTR_ERR(ctx);
> +
> + ret = do_restart(ctx);
> +
> + cr_ctx_free(ctx);
> + return ret;
> }

Again, this is scary stuff. We're allowing unprivileged userspace to
feed random numbers into kernel data structures.

I'd like to see the security guys take a real close look at all of
this, and for them to do that effectively they should be provided with
a full description of the security design of this feature.

> diff --git a/fs/read_write.c b/fs/read_write.c
> index 9ba495d..e2deded 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -324,12 +324,12 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
>
> EXPORT_SYMBOL(vfs_write);
>
> -static inline loff_t file_pos_read(struct file *file)
> +inline loff_t file_pos_read(struct file *file)
> {
> return file->f_pos;
> }
>
> -static inline void file_pos_write(struct file *file, loff_t pos)
> +inline void file_pos_write(struct file *file, loff_t pos)
> {
> file->f_pos = pos;
> }

Might as well move these to a header and inline them everywhere.
That'd be a separate leadin patch.

2008-10-21 20:24:29

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC v7][PATCH 2/9] General infrastructure for checkpoint restart

Quoting Andrew Morton ([email protected]):
> On Mon, 20 Oct 2008 01:40:30 -0400
> Oren Laadan <[email protected]> wrote:
> > asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
> > {
> > - pr_debug("sys_checkpoint not implemented yet\n");
> > - return -ENOSYS;
> > + struct cr_ctx *ctx;
> > + int ret;
> > +
> > + /* no flags for now */
> > + if (flags)
> > + return -EINVAL;
> > +
> > + ctx = cr_ctx_alloc(pid, fd, flags | CR_CTX_CKPT);
> > + if (IS_ERR(ctx))
> > + return PTR_ERR(ctx);
> > +
> > + ret = do_checkpoint(ctx);
> > +
> > + if (!ret)
> > + ret = ctx->crid;
> > +
> > + cr_ctx_free(ctx);
> > + return ret;
> > }
>
> Is it appropriate that this be an unprivileged operation?

Early versions checked capable(CAP_SYS_ADMIN), and we reasoned that we
would later attempt to remove the need for privilege so that all users
could safely use it.

Arnd Bergmann called us on that nonsense, pointing out that it'd make
more sense to let unprivileged users use them now, so that we'll be
more careful about the security as patches roll in.

So, Oren's patchset right now only checkpoints current, despite pid
being part of the API. So the task can access its own data. When
the patch supports checkpointing another task (which Oren says he's
doing right now), then our intent is to check for ptrace access to
the target task. (Right, Oren?)

> What happens if I pass it a pid which isn't system-wide unique?

pid must be checked in the caller's pid namespace. So if I've create a
container which I want to checkpoint, pid 1 in that pidns will be, say,
3497 in my pid_ns, and so 3497 is the pid I must use. If I try to pass
1, I'll try to checkpoint my own container. And, if I'm not privileged
and init is owned by root, the ptrace() check I mentioned above will
return -EPERM.

> What happens if I pass it a pid of a process which I don't own? This
> is super security-sensitive and we need to go over the permission
> checking with a toothcomb. It needs to be exhaustively described in
> the changelog. It might have security/selinux implications - I don't
> know, I didn't look, but lights are flashing and bells are ringing over
> here.
>
> What happens if I pass it a pid of a process which I _do_ own, but it
> does not refer to a container's init process?

I would assume that do_checkpoint() would return -EINVAL, but it's a
great question: Oren, did you have another plan?

> If `pid' must refer to a container's init process, isn't it always
> equal to 1??

Not in the caller's pid_namespace.

>
> > /**
> > * sys_restart - restart a container
> > * @crid: checkpoint image identifier
> > @@ -36,6 +234,19 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
> > */
> > asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
> > {
> > - pr_debug("sys_restart not implemented yet\n");
> > - return -ENOSYS;
> > + struct cr_ctx *ctx;
> > + int ret;
> > +
> > + /* no flags for now */
> > + if (flags)
> > + return -EINVAL;
> > +
> > + ctx = cr_ctx_alloc(crid, fd, flags | CR_CTX_RSTR);
> > + if (IS_ERR(ctx))
> > + return PTR_ERR(ctx);
> > +
> > + ret = do_restart(ctx);
> > +
> > + cr_ctx_free(ctx);
> > + return ret;
> > }
>
> Again, this is scary stuff. We're allowing unprivileged userspace to
> feed random numbers into kernel data structures.

Yes, all of the file opens and mmaps must not skip the usual security
checks. The task credentials are currently unsupported, meaning that
euid, etc, come from the caller, not the checkpoint image. When the
restoration of credentials becomes supported, then definately the
caller (of sys_restore())'s ability to setresuid/setresgid to those
values must be checked.

So that's why we don't want CAP_SYS_ADMIN required up-front. That way
we will be forced to more carefully review each of those features.

> I'd like to see the security guys take a real close look at all of
> this, and for them to do that effectively they should be provided with
> a full description of the security design of this feature.

Right, some of the above should be spelled out somewhere. Should it be
in the patch description, in the Documentation/checkpoint.txt file,
or someplace else? Oren, do you want to filter the above information
into the right place, or do you want me to do it and send you a patch?

> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index 9ba495d..e2deded 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -324,12 +324,12 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
> >
> > EXPORT_SYMBOL(vfs_write);
> >
> > -static inline loff_t file_pos_read(struct file *file)
> > +inline loff_t file_pos_read(struct file *file)
> > {
> > return file->f_pos;
> > }
> >
> > -static inline void file_pos_write(struct file *file, loff_t pos)
> > +inline void file_pos_write(struct file *file, loff_t pos)
> > {
> > file->f_pos = pos;
> > }
>
> Might as well move these to a header and inline them everywhere.
> That'd be a separate leadin patch.
>

thanks,
-serge

2008-10-21 20:41:23

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v7][PATCH 0/9] Kernel based checkpoint/restart

On Tue, 2008-10-21 at 12:21 -0700, Andrew Morton wrote:
> On Mon, 20 Oct 2008 01:40:28 -0400
> Oren Laadan <[email protected]> wrote:
> > These patches implement basic checkpoint-restart [CR]. This version
> > (v7) supports basic tasks with simple private memory, and open files
> > (regular files and directories only).
>
> - how useful is this code as it stands in real-world usage?

Right now, an application must be specifically written to use these mew
system calls. It must be a single process and not share any resources
with other processes. The only file descriptors that may be open are
simple files and may not include sockets or pipes.

What this means in practice is that it is useful for a simple app doing
computational work.

> - what additional work needs to be done to it? (important!)
>
> - how far are we down the design and implementation path with that new
> work?

We know this design can work. We have two commercial products and a
horde of academic projects doing it today using this basic design.
We're early in this particular implementation because we're trying to
release early and often.

I think we're at the point where we need a yes or no from the rest of
the community on it. Reading the patches, I'd hope a reviewer can get
an idea how this will extend to other subsystems. Do you think the
current patches aren't enough from which to extrapolate how this will be
extended?

> Are we yet at least in a position where we can say "yes, this
> feature can be completed and no, it won't be a horrid mess"?

It will be complete a few months after the rest of the kernel is
complete. :)

>From these patches, I think you can see that this will largely be
something that can live off in its own corner of the tree. We will, of
course, need to do plenty of refactoring of existing code (like the pid
namespaces for instance) to make some of it more accessible from the
outside. We're also going to look for every opportunity to share code
with other users like the freezer.

-- Dave

2008-10-21 20:42:21

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC v7][PATCH 2/9] General infrastructure for checkpoint restart

On Tue, 21 Oct 2008 15:24:10 -0500
"Serge E. Hallyn" <[email protected]> wrote:

> > I'd like to see the security guys take a real close look at all of
> > this, and for them to do that effectively they should be provided with
> > a full description of the security design of this feature.
>
> Right, some of the above should be spelled out somewhere. Should it be
> in the patch description, in the Documentation/checkpoint.txt file,
> or someplace else?

Dupliction is usually bad. Documentation/checkpoint.txt would be good
(although these things tend to go out of date fast).

If you go that way, please ensure that the documentation patch is early
in the series and that the changelog says "look in here before whining,
dummy".

2008-10-22 01:34:34

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC v7][PATCH 2/9] General infrastructure for checkpoint restart



Serge E. Hallyn wrote:
> Quoting Andrew Morton ([email protected]):
>> On Mon, 20 Oct 2008 01:40:30 -0400
>> Oren Laadan <[email protected]> wrote:
>>> asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
>>> {
>>> - pr_debug("sys_checkpoint not implemented yet\n");
>>> - return -ENOSYS;
>>> + struct cr_ctx *ctx;
>>> + int ret;
>>> +
>>> + /* no flags for now */
>>> + if (flags)
>>> + return -EINVAL;
>>> +
>>> + ctx = cr_ctx_alloc(pid, fd, flags | CR_CTX_CKPT);
>>> + if (IS_ERR(ctx))
>>> + return PTR_ERR(ctx);
>>> +
>>> + ret = do_checkpoint(ctx);
>>> +
>>> + if (!ret)
>>> + ret = ctx->crid;
>>> +
>>> + cr_ctx_free(ctx);
>>> + return ret;
>>> }
>> Is it appropriate that this be an unprivileged operation?
>
> Early versions checked capable(CAP_SYS_ADMIN), and we reasoned that we
> would later attempt to remove the need for privilege so that all users
> could safely use it.
>
> Arnd Bergmann called us on that nonsense, pointing out that it'd make
> more sense to let unprivileged users use them now, so that we'll be
> more careful about the security as patches roll in.
>
> So, Oren's patchset right now only checkpoints current, despite pid
> being part of the API. So the task can access its own data. When
> the patch supports checkpointing another task (which Oren says he's
> doing right now), then our intent is to check for ptrace access to
> the target task. (Right, Oren?)

Correct. That's already in the additional patch in the git tree - first
I locate the task and if found, I check ptrace_may_access() (read mode).

see the top patch in:
http://git.ncl.cs.columbia.edu/?p=linux-cr-dev.git;a=shortlog;h=refs/heads/ckpt-v7

>
>> What happens if I pass it a pid which isn't system-wide unique?
>
> pid must be checked in the caller's pid namespace. So if I've create a
> container which I want to checkpoint, pid 1 in that pidns will be, say,
> 3497 in my pid_ns, and so 3497 is the pid I must use. If I try to pass
> 1, I'll try to checkpoint my own container. And, if I'm not privileged
> and init is owned by root, the ptrace() check I mentioned above will
> return -EPERM.

Yup.

>
>> What happens if I pass it a pid of a process which I don't own? This
>> is super security-sensitive and we need to go over the permission
>> checking with a toothcomb. It needs to be exhaustively described in
>> the changelog. It might have security/selinux implications - I don't
>> know, I didn't look, but lights are flashing and bells are ringing over
>> here.

This should be covered by ptrace_may_access() test.

In the longer run, I suppose SElinux people would want a security hook
there to approve or disapprove the operation.

>>
>> What happens if I pass it a pid of a process which I _do_ own, but it
>> does not refer to a container's init process?
>
> I would assume that do_checkpoint() would return -EINVAL, but it's a
> great question: Oren, did you have another plan?

Since we intentional provide minimal functionality to keep the patchset
simple and allow easy review - we only checkpoint one task; it doesn't
really matter because we don't deal with the entire container.

With the ability to checkpoint multiple process we will have to ensure
that we checkpoint an entire container. I planned to return -EINVAL if
the target task isn't a container init(1). Another option, if people
prefer, is to use any task in a container to "represent" the entire
container.

>
>> If `pid' must refer to a container's init process, isn't it always
>> equal to 1??
>
> Not in the caller's pid_namespace.
>
>>> /**
>>> * sys_restart - restart a container
>>> * @crid: checkpoint image identifier
>>> @@ -36,6 +234,19 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
>>> */
>>> asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
>>> {
>>> - pr_debug("sys_restart not implemented yet\n");
>>> - return -ENOSYS;
>>> + struct cr_ctx *ctx;
>>> + int ret;
>>> +
>>> + /* no flags for now */
>>> + if (flags)
>>> + return -EINVAL;
>>> +
>>> + ctx = cr_ctx_alloc(crid, fd, flags | CR_CTX_RSTR);
>>> + if (IS_ERR(ctx))
>>> + return PTR_ERR(ctx);
>>> +
>>> + ret = do_restart(ctx);
>>> +
>>> + cr_ctx_free(ctx);
>>> + return ret;
>>> }
>> Again, this is scary stuff. We're allowing unprivileged userspace to
>> feed random numbers into kernel data structures.
>
> Yes, all of the file opens and mmaps must not skip the usual security
> checks. The task credentials are currently unsupported, meaning that
> euid, etc, come from the caller, not the checkpoint image. When the

Actually, the fact that task credentials are not restored makes it
more secure, because the user can't do anything beyond her current
capabilities.

For the same reason, however, unless we agree on a secure way to
elevate credentials, there are various things that we cannot restore,
even though it may be something we would want to permit.

> restoration of credentials becomes supported, then definately the
> caller (of sys_restore())'s ability to setresuid/setresgid to those
> values must be checked.
>
> So that's why we don't want CAP_SYS_ADMIN required up-front. That way
> we will be forced to more carefully review each of those features.
>
>> I'd like to see the security guys take a real close look at all of
>> this, and for them to do that effectively they should be provided with
>> a full description of the security design of this feature.
>
> Right, some of the above should be spelled out somewhere. Should it be
> in the patch description, in the Documentation/checkpoint.txt file,
> or someplace else? Oren, do you want to filter the above information
> into the right place, or do you want me to do it and send you a patch?

I'll add something to the Documentation/checkpoint.txt.

Thanks,

Oren.

2008-10-22 03:02:57

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v7][PATCH 2/9] General infrastructure for checkpoint restart

On Tue, 2008-10-21 at 22:55 -0400, Daniel Jacobowitz wrote:
> I haven't been following - but why this whole container restriction?
> Checkpoint/restart of individual processes is very useful too.
> There are issues with e.g. IPC, but I'm not convinced they're
> substantially different than the issues already present for a
> container.

Containers provide isolation. Once you have isolation, you have a
discrete set of resources which you can checkpoint/restart.

Let's say you have a process you want to checkpoint. If it uses a
completely discrete IPC namespace, you *know* that nothing else depends
on those IPC ids. We don't even have to worry about who might have been
using them and when.

Also think about pids. Without containers, how can you guarantee a
restarted process that it can regain the same pid?

-- Dave

2008-10-22 03:16:19

by Daniel Jacobowitz

[permalink] [raw]
Subject: Re: [RFC v7][PATCH 2/9] General infrastructure for checkpoint restart

On Tue, Oct 21, 2008 at 09:33:19PM -0400, Oren Laadan wrote:
> >> What happens if I pass it a pid of a process which I _do_ own, but it
> >> does not refer to a container's init process?
> >
> > I would assume that do_checkpoint() would return -EINVAL, but it's a
> > great question: Oren, did you have another plan?
>
> Since we intentional provide minimal functionality to keep the patchset
> simple and allow easy review - we only checkpoint one task; it doesn't
> really matter because we don't deal with the entire container.
>
> With the ability to checkpoint multiple process we will have to ensure
> that we checkpoint an entire container. I planned to return -EINVAL if
> the target task isn't a container init(1). Another option, if people
> prefer, is to use any task in a container to "represent" the entire
> container.

I haven't been following - but why this whole container restriction?
Checkpoint/restart of individual processes is very useful too.
There are issues with e.g. IPC, but I'm not convinced they're
substantially different than the issues already present for a
container.

--
Daniel Jacobowitz
CodeSourcery

2008-10-22 09:20:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC v7][PATCH 0/9] Kernel based checkpoint/restart


* Dave Hansen <[email protected]> wrote:

> On Tue, 2008-10-21 at 12:21 -0700, Andrew Morton wrote:
> > On Mon, 20 Oct 2008 01:40:28 -0400
> > Oren Laadan <[email protected]> wrote:
> > > These patches implement basic checkpoint-restart [CR]. This version
> > > (v7) supports basic tasks with simple private memory, and open files
> > > (regular files and directories only).
> >
> > - how useful is this code as it stands in real-world usage?
>
> Right now, an application must be specifically written to use these
> mew system calls. It must be a single process and not share any
> resources with other processes. The only file descriptors that may be
> open are simple files and may not include sockets or pipes.
>
> What this means in practice is that it is useful for a simple app
> doing computational work.

say a chemistry application doing calculations. Or a raytracer with a
large job. Both can take many hours (days!) even on very fast machine
and the restrictions on rebootability can hurt in such cases.

You should reach a minimal level of initial practical utility: say some
helper tool that allows testers to checkpoint and restore a real PovRay
session - without any modification to a stock distro PovRay.

Ingo

2008-10-22 11:51:46

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [RFC v7][PATCH 0/9] Kernel based checkpoint/restart

Ingo Molnar wrote:
> * Dave Hansen <[email protected]> wrote:
>
>> On Tue, 2008-10-21 at 12:21 -0700, Andrew Morton wrote:
>>> On Mon, 20 Oct 2008 01:40:28 -0400
>>> Oren Laadan <[email protected]> wrote:
>>>> These patches implement basic checkpoint-restart [CR]. This version
>>>> (v7) supports basic tasks with simple private memory, and open files
>>>> (regular files and directories only).
>>> - how useful is this code as it stands in real-world usage?
>> Right now, an application must be specifically written to use these
>> mew system calls. It must be a single process and not share any
>> resources with other processes. The only file descriptors that may be
>> open are simple files and may not include sockets or pipes.
>>
>> What this means in practice is that it is useful for a simple app
>> doing computational work.
>
> say a chemistry application doing calculations. Or a raytracer with a
> large job. Both can take many hours (days!) even on very fast machine
> and the restrictions on rebootability can hurt in such cases.
>
> You should reach a minimal level of initial practical utility: say some
> helper tool that allows testers to checkpoint and restore a real PovRay
> session - without any modification to a stock distro PovRay.

There are the liblxc userspace tools doing that.

http://sourceforge.net/projects/lxc/

There are the lxc-checkpoint and lxc-restart commands to test the Oren's
patches with the external checkpoint Cedric did. These commands are
experimental and under development so a hack may be necessary for
checkpoint/restart.

I didn't tried with Oren's external checkpoint yet, but I think the
commands should work. Actually these commands relies on the freezer, so
the checkpoint command does freeze, checkpoint, unfreeze. (and kill if
specified).

lxc-create -n foo

lxc-start -n foo mypovray

lxc-checkpoint -s -n foo > myckptfile

lxc-restart -n foo < myckptfile

Thanks
-- Daniel

2008-10-22 14:29:21

by Daniel Jacobowitz

[permalink] [raw]
Subject: Re: [RFC v7][PATCH 2/9] General infrastructure for checkpoint restart

On Tue, Oct 21, 2008 at 08:02:43PM -0700, Dave Hansen wrote:
> Let's say you have a process you want to checkpoint. If it uses a
> completely discrete IPC namespace, you *know* that nothing else depends
> on those IPC ids. We don't even have to worry about who might have been
> using them and when.
>
> Also think about pids. Without containers, how can you guarantee a
> restarted process that it can regain the same pid?

OK, that makes sense. In a lot of simple cases you can get by without
regaining the same pid; there's an implementation of checkpointing in
GDB that works by injecting fork calls into the child, and it is
useful for a reasonable selection of single-threaded programs.

--
Daniel Jacobowitz
CodeSourcery

2008-10-22 15:28:27

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC v7][PATCH 2/9] General infrastructure for checkpoint restart

Quoting Oren Laadan ([email protected]):
>
>
> Serge E. Hallyn wrote:
> > Quoting Andrew Morton ([email protected]):
> >> On Mon, 20 Oct 2008 01:40:30 -0400
> >> Oren Laadan <[email protected]> wrote:
> >>> asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
> >>> {
> >>> - pr_debug("sys_checkpoint not implemented yet\n");
> >>> - return -ENOSYS;
> >>> + struct cr_ctx *ctx;
> >>> + int ret;
> >>> +
> >>> + /* no flags for now */
> >>> + if (flags)
> >>> + return -EINVAL;
> >>> +
> >>> + ctx = cr_ctx_alloc(pid, fd, flags | CR_CTX_CKPT);
> >>> + if (IS_ERR(ctx))
> >>> + return PTR_ERR(ctx);
> >>> +
> >>> + ret = do_checkpoint(ctx);
> >>> +
> >>> + if (!ret)
> >>> + ret = ctx->crid;
> >>> +
> >>> + cr_ctx_free(ctx);
> >>> + return ret;
> >>> }
> >> Is it appropriate that this be an unprivileged operation?
> >
> > Early versions checked capable(CAP_SYS_ADMIN), and we reasoned that we
> > would later attempt to remove the need for privilege so that all users
> > could safely use it.
> >
> > Arnd Bergmann called us on that nonsense, pointing out that it'd make
> > more sense to let unprivileged users use them now, so that we'll be
> > more careful about the security as patches roll in.
> >
> > So, Oren's patchset right now only checkpoints current, despite pid
> > being part of the API. So the task can access its own data. When
> > the patch supports checkpointing another task (which Oren says he's
> > doing right now), then our intent is to check for ptrace access to
> > the target task. (Right, Oren?)
>
> Correct. That's already in the additional patch in the git tree - first
> I locate the task and if found, I check ptrace_may_access() (read mode).

Just thinking aloud...

Is read mode appropriate? The user can edit the statefile and restart
it. Admittedly the restart code should then do all the appropriate
checks for recreating resources, but I'm having a hard time thinking
through this straight.

Let's say hallyn is running passwd. ruid=500,euid=0. He quickly
checkpoints. Then he restarts. Will restart say "ok, the /bin/passwd
binary is setuid 0 so let hallyn take euid=0 for this?" I guess not.
But are there other resources for which this is harder to get right?

...

> This should be covered by ptrace_may_access() test.
>
> In the longer run, I suppose SElinux people would want a security hook
> there to approve or disapprove the operation.

I think we'll find the ptrace() checks to be so like what we're doing
that no new check will be needed. But we should definately ask them.

Now may be too early to ask, though. The answer will be clearer once
more resources are supported.

>
> >>
> >> What happens if I pass it a pid of a process which I _do_ own, but it
> >> does not refer to a container's init process?
> >
> > I would assume that do_checkpoint() would return -EINVAL, but it's a
> > great question: Oren, did you have another plan?
>
> Since we intentional provide minimal functionality to keep the patchset
> simple and allow easy review - we only checkpoint one task; it doesn't
> really matter because we don't deal with the entire container.
>
> With the ability to checkpoint multiple process we will have to ensure
> that we checkpoint an entire container. I planned to return -EINVAL if
> the target task isn't a container init(1). Another option, if people
> prefer, is to use any task in a container to "represent" the entire
> container.

Except we support nested containers, so unless we only support
checkpoint of the deepest container, that doesn't work.

...

> >> Again, this is scary stuff. We're allowing unprivileged userspace to
> >> feed random numbers into kernel data structures.
> >
> > Yes, all of the file opens and mmaps must not skip the usual security
> > checks. The task credentials are currently unsupported, meaning that
> > euid, etc, come from the caller, not the checkpoint image. When the
>
> Actually, the fact that task credentials are not restored makes it
> more secure, because the user can't do anything beyond her current
> capabilities.

Hmm, so do you think we just always use the caller's credentials?

If we were to use some sort of tpm-signing of statefiles, then
hallyn restarting a checkpointed /bin/passwd may become doable.

> For the same reason, however, unless we agree on a secure way to
> elevate credentials, there are various things that we cannot restore,
> even though it may be something we would want to permit.
>
> > restoration of credentials becomes supported, then definately the
> > caller (of sys_restore())'s ability to setresuid/setresgid to those
> > values must be checked.
> >
> > So that's why we don't want CAP_SYS_ADMIN required up-front. That way
> > we will be forced to more carefully review each of those features.
> >
> >> I'd like to see the security guys take a real close look at all of
> >> this, and for them to do that effectively they should be provided with
> >> a full description of the security design of this feature.
> >
> > Right, some of the above should be spelled out somewhere. Should it be
> > in the patch description, in the Documentation/checkpoint.txt file,
> > or someplace else? Oren, do you want to filter the above information
> > into the right place, or do you want me to do it and send you a patch?
>
> I'll add something to the Documentation/checkpoint.txt.

Cool, thanks Oren.

-serge

2008-10-22 16:04:52

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC v7][PATCH 2/9] General infrastructure for checkpoint restart



Serge E. Hallyn wrote:
> Quoting Oren Laadan ([email protected]):
>>
>> Serge E. Hallyn wrote:
>>> Quoting Andrew Morton ([email protected]):
>>>> On Mon, 20 Oct 2008 01:40:30 -0400
>>>> Oren Laadan <[email protected]> wrote:
>>>>> asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
>>>>> {
>>>>> - pr_debug("sys_checkpoint not implemented yet\n");
>>>>> - return -ENOSYS;
>>>>> + struct cr_ctx *ctx;
>>>>> + int ret;
>>>>> +
>>>>> + /* no flags for now */
>>>>> + if (flags)
>>>>> + return -EINVAL;
>>>>> +
>>>>> + ctx = cr_ctx_alloc(pid, fd, flags | CR_CTX_CKPT);
>>>>> + if (IS_ERR(ctx))
>>>>> + return PTR_ERR(ctx);
>>>>> +
>>>>> + ret = do_checkpoint(ctx);
>>>>> +
>>>>> + if (!ret)
>>>>> + ret = ctx->crid;
>>>>> +
>>>>> + cr_ctx_free(ctx);
>>>>> + return ret;
>>>>> }
>>>> Is it appropriate that this be an unprivileged operation?
>>> Early versions checked capable(CAP_SYS_ADMIN), and we reasoned that we
>>> would later attempt to remove the need for privilege so that all users
>>> could safely use it.
>>>
>>> Arnd Bergmann called us on that nonsense, pointing out that it'd make
>>> more sense to let unprivileged users use them now, so that we'll be
>>> more careful about the security as patches roll in.
>>>
>>> So, Oren's patchset right now only checkpoints current, despite pid
>>> being part of the API. So the task can access its own data. When
>>> the patch supports checkpointing another task (which Oren says he's
>>> doing right now), then our intent is to check for ptrace access to
>>> the target task. (Right, Oren?)
>> Correct. That's already in the additional patch in the git tree - first
>> I locate the task and if found, I check ptrace_may_access() (read mode).
>
> Just thinking aloud...
>
> Is read mode appropriate? The user can edit the statefile and restart
> it. Admittedly the restart code should then do all the appropriate
> checks for recreating resources, but I'm having a hard time thinking
> through this straight.
>
> Let's say hallyn is running passwd. ruid=500,euid=0. He quickly
> checkpoints. Then he restarts. Will restart say "ok, the /bin/passwd
> binary is setuid 0 so let hallyn take euid=0 for this?" I guess not.
> But are there other resources for which this is harder to get right?

I'd say that checkpoint and restart are separate.

In checkpoint, you read the state and save it somewhere; you don't
modify anything in the target task (container). This equivalent to
ptrace read-mode. If you could do ptrace, you could save all that
state. In fact, you could save it in a format that is suitable for
a future restart ... (or just forge one !)

In restart, we either don't trust the user and keep everything to
be done with her credentials, of we trust the root user and allow
all operations (like loading a kernel module).

We can actually have both modes of operations. How to decide that
we trust the user is a separate question: one option is to have
both checkpoint and restart executables setuid - checkpoint will
sign (in user space) the output image, and restart (in user space)
will validate the signature, before passing it to the kenrel. Surely
there are other ways...

>
> ...
>
>> This should be covered by ptrace_may_access() test.
>>
>> In the longer run, I suppose SElinux people would want a security hook
>> there to approve or disapprove the operation.
>
> I think we'll find the ptrace() checks to be so like what we're doing
> that no new check will be needed. But we should definately ask them.
>
> Now may be too early to ask, though. The answer will be clearer once
> more resources are supported.
>
>>>> What happens if I pass it a pid of a process which I _do_ own, but it
>>>> does not refer to a container's init process?
>>> I would assume that do_checkpoint() would return -EINVAL, but it's a
>>> great question: Oren, did you have another plan?
>> Since we intentional provide minimal functionality to keep the patchset
>> simple and allow easy review - we only checkpoint one task; it doesn't
>> really matter because we don't deal with the entire container.
>>
>> With the ability to checkpoint multiple process we will have to ensure
>> that we checkpoint an entire container. I planned to return -EINVAL if
>> the target task isn't a container init(1). Another option, if people
>> prefer, is to use any task in a container to "represent" the entire
>> container.
>
> Except we support nested containers, so unless we only support
> checkpoint of the deepest container, that doesn't work.
>
> ...

yeah.. I just though about it this mornig ;)

>
>>>> Again, this is scary stuff. We're allowing unprivileged userspace to
>>>> feed random numbers into kernel data structures.
>>> Yes, all of the file opens and mmaps must not skip the usual security
>>> checks. The task credentials are currently unsupported, meaning that
>>> euid, etc, come from the caller, not the checkpoint image. When the
>> Actually, the fact that task credentials are not restored makes it
>> more secure, because the user can't do anything beyond her current
>> capabilities.
>
> Hmm, so do you think we just always use the caller's credentials?

Nope, since we will fail to restart in many cases. We will need a way
to move from caller's credentials to saved credentials, and even from
caller's credentials to privileged credentials (e.g. to reopen a file
that was created by a setuid program prior to dropping privileges).

To do that, we will need to agree on a way to escalate/change the
credentials. This however belongs to user-space (and then the binaries
for checkpoint/restart will be setuid themselves).

There will also be the issue of mapping credentials: a user A may have
one UID/GID on once system and another UID/GID on another system, and
we may want to do the conversion. This, too, can be done in user space
prior to restart by using an appropriate filter through the checkpoint
stream.

Oren.

>
> If we were to use some sort of tpm-signing of statefiles, then
> hallyn restarting a checkpointed /bin/passwd may become doable.
>
>> For the same reason, however, unless we agree on a secure way to
>> elevate credentials, there are various things that we cannot restore,
>> even though it may be something we would want to permit.
>>
>>> restoration of credentials becomes supported, then definately the
>>> caller (of sys_restore())'s ability to setresuid/setresgid to those
>>> values must be checked.
>>>
>>> So that's why we don't want CAP_SYS_ADMIN required up-front. That way
>>> we will be forced to more carefully review each of those features.
>>>
>>>> I'd like to see the security guys take a real close look at all of
>>>> this, and for them to do that effectively they should be provided with
>>>> a full description of the security design of this feature.
>>> Right, some of the above should be spelled out somewhere. Should it be
>>> in the patch description, in the Documentation/checkpoint.txt file,
>>> or someplace else? Oren, do you want to filter the above information
>>> into the right place, or do you want me to do it and send you a patch?
>> I'll add something to the Documentation/checkpoint.txt.
>
> Cool, thanks Oren.
>
> -serge

2008-10-22 17:04:23

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC v7][PATCH 2/9] General infrastructure for checkpoint restart

Quoting Oren Laadan ([email protected]):
>
>
> Serge E. Hallyn wrote:
> > Quoting Oren Laadan ([email protected]):
> > Just thinking aloud...
> >
> > Is read mode appropriate? The user can edit the statefile and restart
> > it. Admittedly the restart code should then do all the appropriate
> > checks for recreating resources, but I'm having a hard time thinking
> > through this straight.
> >
> > Let's say hallyn is running passwd. ruid=500,euid=0. He quickly
> > checkpoints. Then he restarts. Will restart say "ok, the /bin/passwd
> > binary is setuid 0 so let hallyn take euid=0 for this?" I guess not.
> > But are there other resources for which this is harder to get right?
>
> I'd say that checkpoint and restart are separate.
>
> In checkpoint, you read the state and save it somewhere; you don't
> modify anything in the target task (container). This equivalent to
> ptrace read-mode. If you could do ptrace, you could save all that
> state. In fact, you could save it in a format that is suitable for
> a future restart ... (or just forge one !)

Yeah, that's convincing.

> In restart, we either don't trust the user and keep everything to
> be done with her credentials, of we trust the root user and allow
> all operations (like loading a kernel module).
>
> We can actually have both modes of operations. How to decide that
> we trust the user is a separate question: one option is to have
> both checkpoint and restart executables setuid - checkpoint will
> sign (in user space) the output image, and restart (in user space)
> will validate the signature, before passing it to the kenrel. Surely
> there are other ways...

Makes sense.

...

> > Hmm, so do you think we just always use the caller's credentials?
>
> Nope, since we will fail to restart in many cases. We will need a way
> to move from caller's credentials to saved credentials, and even from
> caller's credentials to privileged credentials (e.g. to reopen a file
> that was created by a setuid program prior to dropping privileges).

Can we agree to worry about that much much later? :) Would you agree
that for the majority of use-cases, restarting with caller's credentials
will work? Or am I wrong about that?

> To do that, we will need to agree on a way to escalate/change the
> credentials. This however belongs to user-space (and then the binaries
> for checkpoint/restart will be setuid themselves).

Ok those are less scary, and I have no problem with those.

> There will also be the issue of mapping credentials: a user A may have
> one UID/GID on once system and another UID/GID on another system, and
> we may want to do the conversion. This, too, can be done in user space
> prior to restart by using an appropriate filter through the checkpoint
> stream.

User namespaces may help here too. So user A can create a new user
namespace and restart as user B in that namespace. But right now that
sounds like overkill.

-serge

2008-10-22 18:33:24

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC v7][PATCH 2/9] General infrastructure for checkpoint restart



Serge E. Hallyn wrote:
> Quoting Oren Laadan ([email protected]):
>>
>> Serge E. Hallyn wrote:
>>> Quoting Oren Laadan ([email protected]):
>>> Just thinking aloud...
>>>
>>> Is read mode appropriate? The user can edit the statefile and restart
>>> it. Admittedly the restart code should then do all the appropriate
>>> checks for recreating resources, but I'm having a hard time thinking
>>> through this straight.
>>>
>>> Let's say hallyn is running passwd. ruid=500,euid=0. He quickly
>>> checkpoints. Then he restarts. Will restart say "ok, the /bin/passwd
>>> binary is setuid 0 so let hallyn take euid=0 for this?" I guess not.
>>> But are there other resources for which this is harder to get right?
>> I'd say that checkpoint and restart are separate.
>>
>> In checkpoint, you read the state and save it somewhere; you don't
>> modify anything in the target task (container). This equivalent to
>> ptrace read-mode. If you could do ptrace, you could save all that
>> state. In fact, you could save it in a format that is suitable for
>> a future restart ... (or just forge one !)
>
> Yeah, that's convincing.
>
>> In restart, we either don't trust the user and keep everything to
>> be done with her credentials, of we trust the root user and allow
>> all operations (like loading a kernel module).
>>
>> We can actually have both modes of operations. How to decide that
>> we trust the user is a separate question: one option is to have
>> both checkpoint and restart executables setuid - checkpoint will
>> sign (in user space) the output image, and restart (in user space)
>> will validate the signature, before passing it to the kenrel. Surely
>> there are other ways...
>
> Makes sense.
>
> ...
>
>>> Hmm, so do you think we just always use the caller's credentials?
>> Nope, since we will fail to restart in many cases. We will need a way
>> to move from caller's credentials to saved credentials, and even from
>> caller's credentials to privileged credentials (e.g. to reopen a file
>> that was created by a setuid program prior to dropping privileges).
>
> Can we agree to worry about that much much later? :) Would you agree

Definitely. Even more so - I believe that's a user-space issue :)

> that for the majority of use-cases, restarting with caller's credentials
> will work? Or am I wrong about that?

That depends on your target audience. For HPC you're probably right.
For server applications this may not be the case (e.g. apache needs
a privileged port, and then it drops privileges).

I agree that we may safely (...) defer this discussion until the
implementation gets much beefier.

>
>> To do that, we will need to agree on a way to escalate/change the
>> credentials. This however belongs to user-space (and then the binaries
>> for checkpoint/restart will be setuid themselves).
>
> Ok those are less scary, and I have no problem with those.
>
>> There will also be the issue of mapping credentials: a user A may have
>> one UID/GID on once system and another UID/GID on another system, and
>> we may want to do the conversion. This, too, can be done in user space
>> prior to restart by using an appropriate filter through the checkpoint
>> stream.
>
> User namespaces may help here too. So user A can create a new user
> namespace and restart as user B in that namespace. But right now that
> sounds like overkill.

Indeed, virtualization is probably the solution. Here, too, I think
it's safe to defer the discussion.

Oren.

2008-10-27 08:28:49

by Peter Chubb

[permalink] [raw]
Subject: Re: [RFC v7][PATCH 2/9] General infrastructure for checkpoint restart

>>>>> "Oren" == Oren Laadan <[email protected]> writes:


Oren> Nope, since we will fail to restart in many cases. We will need
Oren> a way to move from caller's credentials to saved credentials,
Oren> and even from caller's credentials to privileged credentials
Oren> (e.g. to reopen a file that was created by a setuid program
Oren> prior to dropping privileges).

You can't necessarily tell the difference between this and revocation
of privilege. For most security models, it must be possible to change
the permissions on the file, and then the restart should fail.

In our implementation, we simply refused to checkpoint setid programs.
--
Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au ERTOS within National ICT Australia

2008-10-27 11:04:12

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC v7][PATCH 2/9] General infrastructure for checkpoint restart



Peter Chubb wrote:
>>>>>> "Oren" == Oren Laadan <[email protected]> writes:
>
>
> Oren> Nope, since we will fail to restart in many cases. We will need
> Oren> a way to move from caller's credentials to saved credentials,
> Oren> and even from caller's credentials to privileged credentials
> Oren> (e.g. to reopen a file that was created by a setuid program
> Oren> prior to dropping privileges).
>
> You can't necessarily tell the difference between this and revocation
> of privilege. For most security models, it must be possible to change
> the permissions on the file, and then the restart should fail.
>
> In our implementation, we simply refused to checkpoint setid programs.

True. And this works very well for HPC applications.

However, it doesn't work so well for server applications, for instance.

Also, you could use file system snapshotting to ensure that the file
system view does not change, and still face the same issue.

So I'm perfectly ok with deferring this discussion to a later time :)

Oren.

> --
> Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au
> http://www.ertos.nicta.com.au ERTOS within National ICT Australia
>

2008-10-27 16:43:18

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v7][PATCH 2/9] General infrastructure for checkpoint restart

On Mon, 2008-10-27 at 07:03 -0400, Oren Laadan wrote:
> > In our implementation, we simply refused to checkpoint setid
> programs.
>
> True. And this works very well for HPC applications.
>
> However, it doesn't work so well for server applications, for
> instance.
>
> Also, you could use file system snapshotting to ensure that the file
> system view does not change, and still face the same issue.
>
> So I'm perfectly ok with deferring this discussion to a later time :)

Oren, is this a good place to stick a process_deny_checkpoint()? Both
so we refuse to checkpoint, and document this as something that has to
be addressed later?

-- Dave

2008-10-27 17:12:31

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC v7][PATCH 2/9] General infrastructure for checkpoint restart


Dave Hansen wrote:
> On Mon, 2008-10-27 at 07:03 -0400, Oren Laadan wrote:
>>> In our implementation, we simply refused to checkpoint setid
>> programs.
>>
>> True. And this works very well for HPC applications.
>>
>> However, it doesn't work so well for server applications, for
>> instance.
>>
>> Also, you could use file system snapshotting to ensure that the file
>> system view does not change, and still face the same issue.
>>
>> So I'm perfectly ok with deferring this discussion to a later time :)
>
> Oren, is this a good place to stick a process_deny_checkpoint()? Both
> so we refuse to checkpoint, and document this as something that has to
> be addressed later?

why refuse to checkpoint ?

if I'm root, and I want to checkpoint, and later restart, my sshd server
(assuming we support listening sockets) - then why not ?

we can just let it be, and have the restart fail (if it isn't root that
does the restart); perhaps add something like warn_checkpoint() (similar
to deny, but only warns) ?

Oren.

2008-10-27 20:52:08

by Matt Helsley

[permalink] [raw]
Subject: Re: [RFC v7][PATCH 2/9] General infrastructure for checkpoint restart

On Mon, 2008-10-27 at 13:11 -0400, Oren Laadan wrote:
> Dave Hansen wrote:
> > On Mon, 2008-10-27 at 07:03 -0400, Oren Laadan wrote:
> >>> In our implementation, we simply refused to checkpoint setid
> >> programs.
> >>
> >> True. And this works very well for HPC applications.
> >>
> >> However, it doesn't work so well for server applications, for
> >> instance.
> >>
> >> Also, you could use file system snapshotting to ensure that the file
> >> system view does not change, and still face the same issue.
> >>
> >> So I'm perfectly ok with deferring this discussion to a later time :)
> >
> > Oren, is this a good place to stick a process_deny_checkpoint()? Both
> > so we refuse to checkpoint, and document this as something that has to
> > be addressed later?
>
> why refuse to checkpoint ?

If most setuid programs hold privileged resources for extended periods
of time after dropping privileges then it seems like a good idea to
refuse to checkpoint. Restart of those programs would be quite
unreliable unless/until we find a nice solution.

> if I'm root, and I want to checkpoint, and later restart, my sshd server
> (assuming we support listening sockets) - then why not ?
> we can just let it be, and have the restart fail (if it isn't root that
> does the restart); perhaps add something like warn_checkpoint() (similar
> to deny, but only warns) ?

How will folks not specializing in checkpoint/restart know when to use
this as opposed to deny?

Instead, how about a flag to sys_checkpoint() -- DO_RISKY_CHECKPOINT --
which checkpoints despite !may_checkpoint?

Cheers,
-Matt Helsley

2008-10-27 21:20:49

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC v7][PATCH 2/9] General infrastructure for checkpoint restart

Quoting Matt Helsley ([email protected]):
> On Mon, 2008-10-27 at 13:11 -0400, Oren Laadan wrote:
> > Dave Hansen wrote:
> > > On Mon, 2008-10-27 at 07:03 -0400, Oren Laadan wrote:
> > >>> In our implementation, we simply refused to checkpoint setid
> > >> programs.
> > >>
> > >> True. And this works very well for HPC applications.
> > >>
> > >> However, it doesn't work so well for server applications, for
> > >> instance.
> > >>
> > >> Also, you could use file system snapshotting to ensure that the file
> > >> system view does not change, and still face the same issue.
> > >>
> > >> So I'm perfectly ok with deferring this discussion to a later time :)
> > >
> > > Oren, is this a good place to stick a process_deny_checkpoint()? Both
> > > so we refuse to checkpoint, and document this as something that has to
> > > be addressed later?
> >
> > why refuse to checkpoint ?
>
> If most setuid programs hold privileged resources for extended periods
> of time after dropping privileges then it seems like a good idea to
> refuse to checkpoint. Restart of those programs would be quite
> unreliable unless/until we find a nice solution.

I agree with Dave and Matt. Let's assume that we have a setuid root
program which creates some resources then drops to username kooky. If
you now checkpoint and restart that program, then a stupid restart will
either

1. be done as user kooky and not be able to recreate the
resources, fail.
2. be done as user root and not drop uid back to kooky, unsafe.

For the earliest prototypes of c/r, I think saying that setuid an the
life of a container makes checkpoint impossible is the right thing to
do.

-serge

2008-10-27 21:51:56

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC v7][PATCH 2/9] General infrastructure for checkpoint restart



Matt Helsley wrote:
> On Mon, 2008-10-27 at 13:11 -0400, Oren Laadan wrote:
>> Dave Hansen wrote:
>>> On Mon, 2008-10-27 at 07:03 -0400, Oren Laadan wrote:
>>>>> In our implementation, we simply refused to checkpoint setid
>>>> programs.
>>>>
>>>> True. And this works very well for HPC applications.
>>>>
>>>> However, it doesn't work so well for server applications, for
>>>> instance.
>>>>
>>>> Also, you could use file system snapshotting to ensure that the file
>>>> system view does not change, and still face the same issue.
>>>>
>>>> So I'm perfectly ok with deferring this discussion to a later time :)
>>> Oren, is this a good place to stick a process_deny_checkpoint()? Both
>>> so we refuse to checkpoint, and document this as something that has to
>>> be addressed later?
>> why refuse to checkpoint ?
>
> If most setuid programs hold privileged resources for extended periods
> of time after dropping privileges then it seems like a good idea to
> refuse to checkpoint. Restart of those programs would be quite
> unreliable unless/until we find a nice solution.
>
>> if I'm root, and I want to checkpoint, and later restart, my sshd server
>> (assuming we support listening sockets) - then why not ?
>> we can just let it be, and have the restart fail (if it isn't root that
>> does the restart); perhaps add something like warn_checkpoint() (similar
>> to deny, but only warns) ?
>
> How will folks not specializing in checkpoint/restart know when to use
> this as opposed to deny?
>
> Instead, how about a flag to sys_checkpoint() -- DO_RISKY_CHECKPOINT --
> which checkpoints despite !may_checkpoint?

I also agree with Matt - so we have a quorum :)

so just to clarify: sys_checkpoint() is to fail (with what error ?) if the
deny-checkpoint test fails.

however, if the user is risky, she can specify CR_CHECKPOINT_RISKY to force
an attempt to checkpoint as is.

does this sound right ?

Oren.

2008-10-27 22:10:08

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v7][PATCH 2/9] General infrastructure for checkpoint restart

On Mon, 2008-10-27 at 17:51 -0400, Oren Laadan wrote:
> > Instead, how about a flag to sys_checkpoint() -- DO_RISKY_CHECKPOINT --
> > which checkpoints despite !may_checkpoint?
>
> I also agree with Matt - so we have a quorum :)
>
> so just to clarify: sys_checkpoint() is to fail (with what error ?) if the
> deny-checkpoint test fails.
>
> however, if the user is risky, she can specify CR_CHECKPOINT_RISKY to force
> an attempt to checkpoint as is.

This sounds like an awful lot of policy to determine *inside* the
kernel. Everybody is going to have a different definition of risky, so
this scheme will work for approximately 5 minutes until it gets
patched. :)

Is it possible to enhance our interface such that users might have some
kind of choice on these matters?

-- Dave

2008-10-28 16:48:44

by Michael Kerrisk

[permalink] [raw]
Subject: Re: [RFC v7][PATCH 6/9] Checkpoint/restart: initial documentation

Oren,

Some comments/suggested fixes below.

On Mon, Oct 20, 2008 at 12:40 AM, Oren Laadan <[email protected]> wrote:
> Covers application checkpoint/restart, overall design, interfaces
> and checkpoint image format.
>
> Signed-off-by: Oren Laadan <[email protected]>
> Acked-by: Serge Hallyn <[email protected]>
> Signed-off-by: Dave Hansen <[email protected]>
> ---
> Documentation/checkpoint.txt | 374 ++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 374 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/checkpoint.txt
>
> diff --git a/Documentation/checkpoint.txt b/Documentation/checkpoint.txt
> new file mode 100644
> index 0000000..a73a4f3
> --- /dev/null
> +++ b/Documentation/checkpoint.txt
> @@ -0,0 +1,374 @@
> +
> + === Checkpoint-Restart support in the Linux kernel ===
> +
> +Copyright (C) 2008 Oren Laadan
> +
> +Author: Oren Laadan <[email protected]>
> +
> +License: The GNU Free Documentation License, Version 1.2
> + (dual licensed under the GPL v2)
> +Reviewers:
> +
> +Application checkpoint/restart [CR] is the ability to save the state
> +of a running application so that it can later resume its execution
> +from the time at which it was checkpointed. An application can be
> +migrated by checkpointing it on one machine and restarting it on
> +another. CR can provide many potential benefits:
> +
> +* Failure recovery: by rolling back an to a previous checkpoint

Extraneous word "an"?

> +
> +* Improved response time: by restarting applications from checkpoints
> + instead of from scratch.
> +
> +* Improved system utilization: by suspending long running CPU
> + intensive jobs and resuming them when load decreases.
> +
> +* Fault resilience: by migrating applications off of faulty hosts.

s/off of/off/

> +
> +* Dynamic load balancing: by migrating applications to less loaded
> + hosts.
> +
> +* Improved service availability and administration: by migrating
> + applications before host maintenance so that they continue to run
> + with minimal downtime
> +
> +* Time-travel: by taking periodic checkpoints and restarting from
> + any previous checkpoint.
> +
> +
> +=== Overall design
> +
> +Checkpoint and restart is done in the kernel as much as possible. The
> +kernel exports a relative opaque 'blob' of data to userspace which can

s/relative/relatively/

> +then be handed to the new kernel at restore time. The 'blob' contains
> +data and state of select portions of kernel structures such as VMAs
> +and mm_structs, as well as copies of the actual memory that the tasks
> +use. Any changes in this blob's format between kernel revisions can be
> +handled by an in-userspace conversion program. The approach is similar
> +to virtually all of the commercial CR products out there, as well as
> +the research project Zap.
> +
> +Two new system calls are introduced to provide CR: sys_checkpoint and
> +sys_restart. The checkpoint code basically serializes internal kernel
> +state and writes it out to a file descriptor, and the resulting image
> +is stream-able. More specifically, it consists of 5 steps:
> + 1. Pre-dump
> + 2. Freeze the container
> + 3. Dump
> + 4. Thaw (or kill) the container
> + 5. Post-dump
> +Steps 1 and 5 are an optimization to reduce application downtime:
> +"pre-dump" works before freezing the container, e.g. the pre-copy for
> +live migration, and "post-dump" works after the container resumes
> +execution, e.g. write-back the data to secondary storage.
> +
> +The restart code basically reads the saved kernel state and from a

Extraneous word "and"

> +file descriptor, and re-creates the tasks and the resources they need
> +to resume execution. The restart code is executed by each task that
> +is restored in a new container to reconstruct its own state.
> +
> +
> +=== Interfaces
> +
> +int sys_checkpoint(pid_t pid, int fd, unsigned long flag);
> + Checkpoint a container whose init task is identified by pid, to the

I seem to recall Andrew M. mentioning something about this. Could you
add a bit of text here to explain why "pid" is not always "1".

> + file designated by fd. Flags will have future meaning (should be 0
> + for now).

Should be 0, or must be 0? IMO, the text should be the latter. And
the code should check that -- does it?

> + Returns: a positive integer that identifies the checkpoint image

Can you add some more text here to describe the what this "positive
integer" is. E..g., how is it generated, and what does it refer to
(an address, a file descriptor, something else).

Looking further down, it seems that this is the "crid", but you could
make that clearer already here.

> + (for future reference in case it is kept in memory) upon success,
> + 0 if it returns from a restart, and -1 if an error occurs.
> +
> +int sys_restart(int crid, int fd, unsigned long flags);
> + Restart a container from a checkpoint image identified by crid, or

See above -- that "crid" is the thing returned by checkpoint(), right?
Make that clearer here.

> + from the blob stored in the file designated by fd. Flags will have
> + future meaning (should be 0 for now).

Again... Should be 0, or must be 0? IMO, the text should be the
latter. And the code should check that -- does it?

> + Returns: 0 on success and -1 if an error occurs.
> +
> +Thus, if checkpoint is initiated by a process in the container, one
> +can use logic similar to fork():
> + ...
> + crid = checkpoint(...);
> + switch (crid) {
> + case -1:
> + perror("checkpoint failed");
> + break;
> + default:
> + fprintf(stderr, "checkpoint succeeded, CRID=%d\n", ret);
> + /* proceed with execution after checkpoint */
> + ...
> + break;
> + case 0:
> + fprintf(stderr, "returned after restart\n");
> + /* proceed with action required following a restart */
> + ...
> + break;
> + }
> + ...
> +And to initiate a restart, the process in an empty container can use
> +logic similar to execve():
> + ...
> + if (restart(crid, ...) < 0)
> + perror("restart failed");
> + /* only get here if restart failed */
> + ...
> +
> +See below a complete example in C.
> +
> +
> +=== Order of state dump
> +
> +The order of operations, both save and restore, is as following:

s/is as following/is as follows/

> +
> +* Header section: header, container information, etc.
> +* Global section: [TBD] global resources such as IPC, UTS, etc.
> +* Process forest: [TBD] tasks and their relationships
> +* Per task data (for each task):
> + -> task state: elements of task_struct
> + -> thread state: elements of thread_struct and thread_info
> + -> CPU state: registers etc, including FPU
> + -> memory state: memory address space layout and contents
> + -> filesystem state: [TBD] filesystem namespace state, chroot, cwd, etc
> + -> files state: open file descriptors and their state
> + -> signals state: [TBD] pending signals and signal handling state
> + -> credentials state: [TBD] user and group state, statistics
> +
> +
> +=== Checkpoint image format
> +
> +The checkpoint image format is composed of records consistings of a

consisting

> +pre-header that identifies its contents, followed by a payload. (The
> +idea here is to enable parallel checkpointing in the future in which
> +multiple threads interleave data from multiple processes into a single
> +stream).
> +
> +The pre-header is defined by "struct cr_hdr" as follows:
> +
> +struct cr_hdr {
> + __s16 type;
> + __s16 len;
> + __u32 parent;
> +};
> +
> +Here, 'type' field identifies the type of the payload, 'len' tells its

s/'type' field/'type'/
or
s/'type' field/the 'type' field/

> +length in bytes. The 'parent' identifies the owner object instance. The

add "field" after 'parent'

> +meaning of the 'parent field varies depending on the type. For example,
> +for type CR_HDR_MM, the 'parent identifies the task to which this MM

Missing ' (single quote)

> +belongs. The payload also varies depending on the type, for instance,
> +the data describing a task_struct is given by a 'struct cr_hdr_task'
> +(type CR_HDR_TASK) and so on.
> +
> +The format of the memory dump is as follows: for each VMA, there is a
> +'struct cr_vma'; if the VMA is file-mapped, it is followed by the file
> +name. Following comes the actual contents, in one or more chunk: each

s/Following comes/Following that are/

s/chunk/chunks/

> +chunk begins with a header that specifies how many pages it holds,
> +then a the virtual addresses of all the dumped pages in that chunk,

s/a the/the/

> +followed by the actual contents of all the dumped pages. A header with
> +zero number of pages marks the end of the contents for a particular
> +VMA. Then comes the next VMA and so on.
> +
> +To illustrate this, consider a single simple task with two VMAs: one
> +is file mapped with two dumped pages, and the other is anonymous with
> +three dumped pages. The checkpoint image will look like this:
> +
> +cr_hdr + cr_hdr_head
> +cr_hdr + cr_hdr_task
> + cr_hdr + cr_hdr_mm
> + cr_hdr + cr_hdr_vma + cr_hdr + string
> + cr_hdr_pgarr (nr_pages = 2)
> + addr1, addr2
> + page1, page2
> + cr_hdr_pgarr (nr_pages = 0)
> + cr_hdr + cr_hdr_vma
> + cr_hdr_pgarr (nr_pages = 3)
> + addr3, addr4, addr5
> + page3, page4, page5
> + cr_hdr_pgarr (nr_pages = 0)
> + cr_hdr + cr_mm_context
> + cr_hdr + cr_hdr_thread
> + cr_hdr + cr_hdr_cpu
> +cr_hdr + cr_hdr_tail
> +
> +
> +=== Current Implementation
> +
> +[2008-Oct-07]
> +There are several assumptions in the current implementation; they will
> +be gradually relaxed in future versions. The main ones are:
> +* A task can only checkpoint itself (missing "restart-block" logic).
> +* Namespaces are not saved or restored; They will be treated as a type
> + of shared object.
> +* In particular, it is assumed that the task's file system namespace
> + is the "root" for the entire container.
> +* It is assumed that the same file system view is available for the
> + restart task(s). Otherwise, a file system snapshot is required.
> +
> +
> +=== Sample code
> +
> +Two example programs: one uses checkpoint (called ckpt) to checkpoint
> +itself, and another uses restart (called rstr) to restart from that
> +checkpoint. Note the use of "dup2" to create a copy of an open file
> +and show how shared objects are treated. Execute like this:
> +
> +orenl:~/test$ ./ckpt > out.1
> + <-- ctrl-c

It really would be more readable if you simplified the shell prompt to
just "$ " or "sh$ " or "bash$". Future generations do not need to
know where you tested this code, or your username ;-).

[...]

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
git://git.kernel.org/pub/scm/docs/man-pages/man-pages.git
man-pages online: http://www.kernel.org/doc/man-pages/online_pages.html
Found a bug? http://www.kernel.org/doc/man-pages/reporting_bugs.html

2008-10-28 18:43:44

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC v7][PATCH 2/9] General infrastructure for checkpoint restart

Quoting Dave Hansen ([email protected]):
> On Mon, 2008-10-27 at 17:51 -0400, Oren Laadan wrote:
> > > Instead, how about a flag to sys_checkpoint() -- DO_RISKY_CHECKPOINT --
> > > which checkpoints despite !may_checkpoint?
> >
> > I also agree with Matt - so we have a quorum :)
> >
> > so just to clarify: sys_checkpoint() is to fail (with what error ?) if the
> > deny-checkpoint test fails.
> >
> > however, if the user is risky, she can specify CR_CHECKPOINT_RISKY to force
> > an attempt to checkpoint as is.
>
> This sounds like an awful lot of policy to determine *inside* the
> kernel. Everybody is going to have a different definition of risky, so
> this scheme will work for approximately 5 minutes until it gets
> patched. :)
>
> Is it possible to enhance our interface such that users might have some
> kind of choice on these matters?

Well we could always just add a field to /proc/self/status, and let
userspace check that field (after freezing the task) for the
presence of CR_CHECKPOINT_RISKY and make up its own mind.

Though my preference is for simplicity - just refuse the checkpoint.
That way people might screan loudly enough for us to support the
features they want. If we let them just bypass and hope for the
best that starts to dilute some of the intended effect of all this.

-serge