2009-07-22 10:23:26

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 00/60] Kernel based checkpoint/restart

Application checkpoint/restart (c/r) is the ability to save the state
of a running application so that it can later resume its execution
from the time at which it was checkpointed, on the same or a different
machine.

This version introduces 'clone_with_pids()' syscall to preset pid(s)
for a child process. It is used by restart(2) to recreate process
hierarchy with the same pids as at checkpoint time.

It also adds a freezer state CHECKPOINTING to safeguard processes
during a checkpoint. Other important changes include support for
threads and zombies, credentials, signal handling, and improved
restart logic. See below for a more detailed changelog.

Compiled and tested against v2.6.31-rc3.

For more information, check out Documentation/checkpoint/*.txt

Q: How useful is this code as it stands in real-world usage?
A: Right now, the application can be single- or multi-processes and
threads. Supports open files - regular files and directories on
ext[234], pipes, and /dev/{null,zero,random,urandom}. All sort of
shared memory work. sysv IPC also works (except for semaphore
undo). It's already suitable for many types of batch jobs. (Note:
it is assumed that the fs view is available at restart).

Q: What can it checkpoint and restart ?
A: A (single threaded) process can checkpoint itself, aka "self"
checkpoint, if it calls the new system calls. Otherise, for an
"external" checkpoint, the caller must first freeze the target
processes. One can either checkpoint an entire container (and
we make best effort to ensure that the result is self-contained),
or merely a subtree of a process hierarchy.

Q: What about namespaces ?
A: Currrently, UTS and IPC namespaces are restored. They demonstrate
how namespaces are handled. More to come.

Q: What additional work needs to be done to it?
A: Fill in the gory details following the examples so far. WIP
includes (pending) signals, fifos, pseudo terminals, unix-
domain sockets, event-poll, architectures: powerpc and x86_64,
and more file systems types.

Q: How can I try it ?
A: This one can actually be used for simple batch jobs (pipes, too),
a whole container or just a subtree of tasks. Try it:

create the freezer cgroup:
$ mount -t cgroup -ofreezer freezer /cgroup
$ mkdir /cgroup/0

run the test, freeze it:
$ test/multitask &
[1] 2754
$ for i in `pidof multitask`; do echo $i > /cgroup/0/tasks; done
$ echo FROZEN > /cgruop/0/freezer.state

checkpoint:
$ ./ckpt 2754 > ckpt.out

restart:
$ ./mktree < ckpt.out

voila :)

To do all this, you'll need:

The git tree tracking v17, branch 'ckpt-v17' (and past versions):
git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git

Restarting multiple processes requires 'mktree' userspace tool with
the matching branch (v17):
git://git.ncl.cs.columbia.edu/pub/git/user-cr.git

Oren.


Changelog:

[2009-Jul-21] v17
- Introduce syscall clone_with_pids() to restore original pids
- Support threads and zombies
- Save/restore task->files
- Save/restore task->sighand
- Save/restore futex
- Save/restore credentials
- Introduce PF_RESTARTING to skip notifications on task exit
- restart(2) allow caller to ask to freeze tasks after restart
- restart(2) isn't idempotent: return -EINTR if interrupted
- Improve debugging output handling
- Make multi-process restart logic more robust and complete
- Correctly select return value for restarting tasks on success
- Tighten ptrace test for checkpoint to PTRACE_MODE_ATTACH
- Use CHECKPOINTING state for frozen checkpointed tasks
- Fix compilation without CONFIG_CHECKPOINT
- Fix compilation with CONFIG_COMPAT
- Fix headers includes and exports
- Leak detection performed in two steps
- Detect "inverse" leaks of objects (dis)appearing unexpectedly
- Memory: save/restore mm->{flags,def_flags,saved_auxv}
- Memory: only collect sub-objects of mm once (leak detection)
- Files: validate f_mode after restore
- Namespaces: leak detection for nsproxy sub-components
- Namespaces: proper restart from namespace(s) without namespace(s)
- Save global constants in header instead of per-object
- IPC: replace sys_unshare() with create_ipc_ns()
- IPC: restore objects in suitable namespace
- IPC: correct behavior under !CONFIG_IPC_NS
- UTS: save/restore all fields
- UTS: replace sys_unshare() with create_uts_ns()
- X86_32: sanitize cpu, debug, and segment registers on restart
- cgroup_freezer: add CHECKPOINTING state to safeguard checkpoint
- cgroup_freezer: add interface to freeze a cgroup (given a task)

[2009-May-27] v16
- Privilege checks for IPC checkpoint
- Fix error string generation during checkpoint
- Use kzalloc for header allocation
- Restart blocks are arch-independent
- Redo pipe c/r using splice
- Fixes to s390 arch
- Remove powerpc arch (temporary)
- Explicitly restore ->nsproxy
- All objects in image are precedeed by 'struct ckpt_hdr'
- Fix leaks detection (and leaks)
- Reorder of patchset
- Misc bugs and compilation fixes

[2009-Apr-12] v15
- Minor fixes

[2009-Apr-28] v14
- Tested against kernel v2.6.30-rc3 on x86_32.
- Refactor files chekpoint to use f_ops (file operations)
- Refactor mm/vma to use vma_ops
- Explicitly handle VDSO vma (and require compat mode)
- Added code to c/r restat-blocks (restart timeout related syscalls)
- Added code to c/r namespaces: uts, ipc (with Dan Smith)
- Added code to c/r sysvipc (shm, msg, sem)
- Support for VM_CLONE shared memory
- Added resource leak detection for whole-container checkpoint
- Added sysctl gauge to allow unprivileged restart/checkpoint
- Improve and simplify the code and logic of shared objects
- Rework image format: shared objects appear prior to their use
- Merge checkpoint and restart functionality into same files
- Massive renaming of functions: prefix "ckpt_" for generics,
"checkpoint_" for checkpoint, and "restore_" for restart.
- Report checkpoint errors as a valid (string record) in the output
- Merged PPC architecture (by Nathan Lunch),
- Requires updates to userspace tools too.
- Misc nits and bug fixes

[2009-Mar-31] v14-rc2
- Change along Dave's suggestion to use f_ops->checkpoint() for files
- Merge patch simplifying Kconfig, with CONFIG_CHECKPOINT_SUPPORT
- Merge support for PPC arch (Nathan Lynch)
- Misc cleanups and fixes in response to comments

[2009-Mar-20] v14-rc1:
- The 'h.parent' field of 'struct cr_hdr' isn't used - discard
- Check whether calls to cr_hbuf_get() succeed or fail.
- Fixed of pipe c/r code
- Prevent deadlock by refusing c/r when a pipe inode == ctx->file inode
- Refuse non-self checkpoint if a task isn't frozen
- Use unsigned fields in checkpoint headers unless otherwise required
- Rename functions in files c/r to better reflect their role
- Add support for anonymous shared memory
- Merge support for s390 arch (Dan Smith, Serge Hallyn)

[2008-Dec-03] v13:
- Cleanups of 'struct cr_ctx' - remove unused fields
- Misc fixes for comments

[2008-Dec-17] v12:
- Fix re-alloc/reset of pgarr chain to correctly reuse buffers
(empty pgarr are saves in a separate pool chain)
- Add a couple of missed calls to cr_hbuf_put()
- cr_kwrite/cr_kread() again use vfs_read(), vfs_write() (safer)
- Split cr_write/cr_read() to two parts: _cr_write/read() helper
- Befriend with sparse: explicit conversion to 'void __user *'
- Redrefine 'pr_fmt' ind replace cr_debug() with pr_debug()

[2008-Dec-05] v11:
- Use contents of 'init->fs->root' instead of pointing to it
- Ignore symlinks (there is no such thing as an open symlink)
- cr_scan_fds() retries from scratch if it hits size limits
- Add missing test for VM_MAYSHARE when dumping memory
- Improve documentation about: behavior when tasks aren't fronen,
life span of the object hash, references to objects in the hash

[2008-Nov-26] v10:
- Grab vfs root of container init, rather than current process
- Acquire dcache_lock around call to __d_path() in cr_fill_name()
- Force end-of-string in cr_read_string() (fix possible DoS)
- Introduce cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()

[2008-Nov-10] v9:
- Support multiple processes c/r
- Extend checkpoint header with archtiecture dependent header
- Misc bug fixes (see individual changelogs)
- Rebase to v2.6.28-rc3.

[2008-Oct-29] v8:
- Support "external" checkpoint
- Include Dave Hansen's 'deny-checkpoint' patch
- Split docs in Documentation/checkpoint/..., and improve contents

[2008-Oct-17] v7:
- Fix save/restore state of FPU
- Fix argument given to kunmap_atomic() in memory dump/restore

[2008-Oct-07] v6:
- Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
(even though it's not really needed)
- Add assumptions and what's-missing to documentation
- Misc fixes and cleanups

[2008-Sep-11] v5:
- Config is now 'def_bool n' by default
- Improve memory dump/restore code (following Dave Hansen's comments)
- Change dump format (and code) to allow chunks of <vaddrs, pages>
instead of one long list of each
- Fix use of follow_page() to avoid faulting in non-present pages
- Memory restore now maps user pages explicitly to copy data into them,
instead of reading directly to user space; got rid of mprotect_fixup()
- Remove preempt_disable() when restoring debug registers
- Rename headers files s/ckpt/checkpoint/
- Fix misc bugs in files dump/restore
- Fixes and cleanups on some error paths
- Fix misc coding style

[2008-Sep-09] v4:
- Various fixes and clean-ups
- Fix calculation of hash table size
- Fix header structure alignment
- Use stand list_... for cr_pgarr

[2008-Aug-29] v3:
- Various fixes and clean-ups
- Use standard hlist_... for hash table
- Better use of standard kmalloc/kfree

[2008-Aug-20] v2:
- Added Dump and restore of open files (regular and directories)
- Added basic handling of shared objects, and improve handling of
'parent tag' concept
- Added documentation
- Improved ABI, 64bit padding for image data
- Improved locking when saving/restoring memory
- Added UTS information to header (release, version, machine)
- Cleanup extraction of filename from a file pointer
- Refactor to allow easier reviewing
- Remove requirement for CAPS_SYS_ADMIN until we come up with a
security policy (this means that file restore may fail)
- Other cleanup and response to comments for v1

[2008-Jul-29] v1:
- Initial version: support a single task with address space of only
private anonymous or file-mapped VMAs; syscalls ignore pid/crid
argument and act on current process.

--
At the containers mini-conference before OLS, the consensus among
all the stakeholders was that doing checkpoint/restart in the kernel
as much as possible was the best approach. With this approach, the
kernel will export a relatively opaque 'blob' of data to userspace
which can then be handed to the new kernel at restore time.

This is different than what had been proposed before, which was
that a userspace application would be responsible for collecting
all of this data. We were also planning on adding lots of new,
little kernel interfaces for all of the things that needed
checkpointing. This unites those into a single, grand interface.

The 'blob' will contain copies of select portions of kernel
structures such as vmas and mm_structs. It will also contain
copies of the actual memory that the process uses. Any changes
in this blob's format between kernel revisions can be handled by
an in-userspace conversion program.

This is a similar approach to virtually all of the commercial
checkpoint/restart products out there, as well as the research
project Zap.

These patches basically serialize internel kernel state and write
it out to a file descriptor. The checkpoint and restore are done
with two new system calls: sys_checkpoint and sys_restart.

In this incarnation, they can only work checkpoint and restore a
single task. The task's address space may consist of only private,
simple vma's - anonymous or file-mapped. The open files may consist
of only simple files and directories.
--


2009-07-22 10:07:34

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 09/60] Namespaces submenu

From: Dave Hansen <[email protected]>

Let's not steal too much space in the 'General Setup' menu.
Take a cue from the cgroups code and create a submenu.

This can go upstream now.

Signed-off-by: Dave Hansen <[email protected]>
Acked-by: Oren Laadan <[email protected]>
---
init/Kconfig | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 1ce05a4..7503957 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -664,7 +664,7 @@ config RELAY

If unsure, say N.

-config NAMESPACES
+menuconfig NAMESPACES
bool "Namespaces support" if EMBEDDED
default !EMBEDDED
help
--
1.6.0.4

2009-07-22 10:07:17

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 08/60] cgroup freezer: interface to freeze a cgroup from within the kernel

Add public interface to freeze a cgroup freezer given a task that
belongs to that cgroup: cgroup_freezer_make_frozen(task)

Freezing the root cgroup is not permitted. Freezing the cgroup to
which current process belong is also not permitted.

This will be used for restart(2) to be able to leave the restarted
processes in a frozen state, instead of resuming execution.

This is useful for debugging, if the user would like to attach a
debugger to the restarted task(s).

It is also useful if the restart procedure would like to perform
additional setup once the tasks are restored but before they are
allowed to proceed execution.

Signed-off-by: Oren Laadan <[email protected]>
CC: Matt Helsley <[email protected]>
Cc: Paul Menage <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Cedric Le Goater <[email protected]>
---
include/linux/freezer.h | 1 +
kernel/cgroup_freezer.c | 27 +++++++++++++++++++++++++++
2 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index 3d32641..0cb22cb 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -68,6 +68,7 @@ extern int cgroup_freezing_or_frozen(struct task_struct *task);
extern int in_same_cgroup_freezer(struct task_struct *p, struct task_struct *q);
extern int cgroup_freezer_begin_checkpoint(struct task_struct *task);
extern void cgroup_freezer_end_checkpoint(struct task_struct *task);
+extern int cgroup_freezer_make_frozen(struct task_struct *task);
#else /* !CONFIG_CGROUP_FREEZER */
static inline int cgroup_freezing_or_frozen(struct task_struct *task)
{
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index 87dfbfb..7925850 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -466,4 +466,31 @@ void cgroup_freezer_end_checkpoint(struct task_struct *task)
*/
WARN_ON(freezer_checkpointing(task, CGROUP_FROZEN) != CGROUP_CHECKPOINTING);
}
+
+int cgroup_freezer_make_frozen(struct task_struct *task)
+{
+ struct freezer *freezer;
+ struct cgroup_subsys_state *css;
+ int ret = -ENODEV;
+
+ task_lock(task);
+ css = task_subsys_state(task, freezer_subsys_id);
+ css_get(css); /* make sure freezer doesn't go away */
+ freezer = container_of(css, struct freezer, css);
+ task_unlock(task);
+
+ /* Never freeze the root cgroup */
+ if (!test_bit(CSS_ROOT, &css->flags) &&
+ cgroup_lock_live_group(css->cgroup)) {
+ /* do not freeze outselves, ei ?! */
+ if (css != task_subsys_state(current, freezer_subsys_id))
+ ret = freezer_change_state(css->cgroup, CGROUP_FROZEN);
+ else
+ ret = -EPERM;
+ cgroup_unlock();
+ }
+
+ css_put(css);
+ return ret;
+}
#endif /* CONFIG_CHECKPOINT */
--
1.6.0.4

2009-07-22 10:10:37

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 16/60] pids 6/7: Define do_fork_with_pids()

From: Sukadev Bhattiprolu <[email protected]>

do_fork_with_pids() is same as do_fork(), except that it takes an
additional, 'pid_set', parameter. This parameter, currently unused,
specifies the set of target pids of the process in each of its pid
namespaces.

Changelog[v3]:
- Fix "long-line" warning from checkpatch.pl

Changelog[v2]:
- To facilitate moving architecture-inpdendent code to kernel/fork.c
pass in 'struct target_pid_set __user *' to do_fork_with_pids()
rather than 'pid_t *' (next patch moves the arch-independent
code to kernel/fork.c)

Signed-off-by: Sukadev Bhattiprolu <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Reviewed-by: Oren Laadan <[email protected]>
---
include/linux/sched.h | 3 +++
include/linux/types.h | 5 +++++
kernel/fork.c | 16 ++++++++++++++--
3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 16a982e..e2ebb41 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2052,6 +2052,9 @@ extern int disallow_signal(int);

extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
+extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *,
+ unsigned long, int __user *, int __user *,
+ struct target_pid_set __user *pid_set);
struct task_struct *fork_idle(int);

extern void set_task_comm(struct task_struct *tsk, char *from);
diff --git a/include/linux/types.h b/include/linux/types.h
index c42724f..d9efefe 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -204,6 +204,11 @@ struct ustat {
char f_fpack[6];
};

+struct target_pid_set {
+ int num_pids;
+ pid_t *target_pids;
+};
+
#endif /* __KERNEL__ */
#endif /* __ASSEMBLY__ */
#endif /* _LINUX_TYPES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 6f90cf4..64d53d9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1341,12 +1341,13 @@ struct task_struct * __cpuinit fork_idle(int cpu)
* It copies the process, and if successful kick-starts
* it and waits for it to finish using the VM if required.
*/
-long do_fork(unsigned long clone_flags,
+long do_fork_with_pids(unsigned long clone_flags,
unsigned long stack_start,
struct pt_regs *regs,
unsigned long stack_size,
int __user *parent_tidptr,
- int __user *child_tidptr)
+ int __user *child_tidptr,
+ struct target_pid_set __user *pid_setp)
{
struct task_struct *p;
int trace = 0;
@@ -1455,6 +1456,17 @@ long do_fork(unsigned long clone_flags,
return nr;
}

+long do_fork(unsigned long clone_flags,
+ unsigned long stack_start,
+ struct pt_regs *regs,
+ unsigned long stack_size,
+ int __user *parent_tidptr,
+ int __user *child_tidptr)
+{
+ return do_fork_with_pids(clone_flags, stack_start, regs, stack_size,
+ parent_tidptr, child_tidptr, NULL);
+}
+
#ifndef ARCH_MIN_MMSTRUCT_ALIGN
#define ARCH_MIN_MMSTRUCT_ALIGN 0
#endif
--
1.6.0.4

2009-07-22 10:11:01

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 52/60] c/r: support semaphore sysv-ipc

Checkpoint of sysvipc semaphores is performed by iterating through all
sem objects and dumping the contents of each one. The semaphore array
of each sem is dumped with that object.

The semaphore array (sem->sem_base) holds an array of 'struct sem',
which is a {int, int}. Because this translates into the same format
on 32- and 64-bit architectures, the checkpoint format is simply the
dump of this array as is.

TODO: this patch does not handle semaphore-undo -- this data should be
saved per-task while iterating through the tasks.

Changelog[v17]:
- Restore objects in the right namespace
- Forward declare struct msg_msg (instead of include linux/msg.h)
- Fix typo in comment
- Don't unlock ipc before calling freeary in error path

Signed-off-by: Oren Laadan <[email protected]>
---
include/linux/checkpoint_hdr.h | 8 ++
ipc/Makefile | 2 +-
ipc/checkpoint.c | 4 -
ipc/checkpoint_sem.c | 219 ++++++++++++++++++++++++++++++++++++++++
ipc/sem.c | 11 +--
ipc/util.h | 8 ++
6 files changed, 240 insertions(+), 12 deletions(-)
create mode 100644 ipc/checkpoint_sem.c

diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index e33bb58..2364a6f 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -406,6 +406,14 @@ struct ckpt_hdr_ipc_msg_msg {
__u32 m_ts;
} __attribute__((aligned(8)));

+struct ckpt_hdr_ipc_sem {
+ struct ckpt_hdr h;
+ struct ckpt_hdr_ipc_perms perms;
+ __u64 sem_otime;
+ __u64 sem_ctime;
+ __u32 sem_nsems;
+} __attribute__((aligned(8)));
+

#define CKPT_TST_OVERFLOW_16(a, b) \
((sizeof(a) > sizeof(b)) && ((a) > SHORT_MAX))
diff --git a/ipc/Makefile b/ipc/Makefile
index 71a257f..3ecba9e 100644
--- a/ipc/Makefile
+++ b/ipc/Makefile
@@ -10,4 +10,4 @@ obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y)
obj-$(CONFIG_IPC_NS) += namespace.o
obj-$(CONFIG_POSIX_MQUEUE_SYSCTL) += mq_sysctl.o
obj-$(CONFIG_SYSVIPC_CHECKPOINT) += checkpoint.o \
- checkpoint_shm.o checkpoint_msg.o
+ checkpoint_shm.o checkpoint_msg.o checkpoint_sem.o
diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c
index 11941d7..7834051 100644
--- a/ipc/checkpoint.c
+++ b/ipc/checkpoint.c
@@ -119,12 +119,10 @@ static int do_checkpoint_ipc_ns(struct ckpt_ctx *ctx,
return ret;
ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
CKPT_HDR_IPC_MSG, checkpoint_ipc_msg);
-#if 0 /* NEXT FEW PATCHES */
if (ret < 0)
return ret;
ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SEM_IDS,
CKPT_HDR_IPC_SEM, checkpoint_ipc_sem);
-#endif
return ret;
}

@@ -288,7 +286,6 @@ static struct ipc_namespace *do_restore_ipc_ns(struct ckpt_ctx *ctx)

ret = restore_ipc_any(ctx, ipc_ns, IPC_SHM_IDS,
CKPT_HDR_IPC_SHM, restore_ipc_shm);
-#if 0 /* NEXT FEW PATCHES */
if (ret < 0)
goto out;
ret = restore_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
@@ -297,7 +294,6 @@ static struct ipc_namespace *do_restore_ipc_ns(struct ckpt_ctx *ctx)
goto out;
ret = restore_ipc_any(ctx, ipc_ns, IPC_SEM_IDS,
CKPT_HDR_IPC_SEM, restore_ipc_sem);
-#endif
if (ret < 0)
goto out;

diff --git a/ipc/checkpoint_sem.c b/ipc/checkpoint_sem.c
new file mode 100644
index 0000000..746ad63
--- /dev/null
+++ b/ipc/checkpoint_sem.c
@@ -0,0 +1,219 @@
+/*
+ * Checkpoint/restart - dump state of sysvipc sem
+ *
+ * Copyright (C) 2009 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG CKPT_DIPC
+
+#include <linux/mm.h>
+#include <linux/sem.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/syscalls.h>
+#include <linux/nsproxy.h>
+#include <linux/ipc_namespace.h>
+
+struct msg_msg;
+#include "util.h"
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/************************************************************************
+ * ipc checkpoint
+ */
+
+static int fill_ipc_sem_hdr(struct ckpt_ctx *ctx,
+ struct ckpt_hdr_ipc_sem *h,
+ struct sem_array *sem)
+{
+ int ret = 0;
+
+ ipc_lock_by_ptr(&sem->sem_perm);
+
+ ret = checkpoint_fill_ipc_perms(&h->perms, &sem->sem_perm);
+ if (ret < 0)
+ goto unlock;
+
+ h->sem_otime = sem->sem_otime;
+ h->sem_ctime = sem->sem_ctime;
+ h->sem_nsems = sem->sem_nsems;
+
+ unlock:
+ ipc_unlock(&sem->sem_perm);
+ ckpt_debug("sem: nsems %u\n", h->sem_nsems);
+
+ return ret;
+}
+
+/**
+ * ckpt_write_sem_array - dump the state of a semaphore array
+ * @ctx: checkpoint context
+ * @sem: semphore array
+ *
+ * The state of a sempahore is an array of 'struct sem'. This structure
+ * is {int, int}, which translates to the same format {32 bits, 32 bits}
+ * on both 32- and 64-bit architectures. So we simply dump the array.
+ *
+ * The sem-undo information is not saved per ipc_ns, but rather per task.
+ */
+static int checkpoint_sem_array(struct ckpt_ctx *ctx, struct sem_array *sem)
+{
+ /* this is a "best-effort" test, so lock not needed */
+ if (!list_empty(&sem->sem_pending))
+ return -EBUSY;
+
+ /* our caller holds the mutex, so this is safe */
+ return ckpt_write_buffer(ctx, sem->sem_base,
+ sem->sem_nsems * sizeof(*sem->sem_base));
+}
+
+int checkpoint_ipc_sem(int id, void *p, void *data)
+{
+ struct ckpt_hdr_ipc_sem *h;
+ struct ckpt_ctx *ctx = (struct ckpt_ctx *) data;
+ struct kern_ipc_perm *perm = (struct kern_ipc_perm *) p;
+ struct sem_array *sem;
+ int ret;
+
+ sem = container_of(perm, struct sem_array, sem_perm);
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC_SEM);
+ if (!h)
+ return -ENOMEM;
+
+ ret = fill_ipc_sem_hdr(ctx, h, sem);
+ if (ret < 0)
+ goto out;
+
+ ret = ckpt_write_obj(ctx, &h->h);
+ if (ret < 0)
+ goto out;
+
+ if (h->sem_nsems)
+ ret = checkpoint_sem_array(ctx, sem);
+ out:
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
+
+/************************************************************************
+ * ipc restart
+ */
+
+static int load_ipc_sem_hdr(struct ckpt_ctx *ctx,
+ struct ckpt_hdr_ipc_sem *h,
+ struct sem_array *sem)
+{
+ int ret = 0;
+
+ ret = restore_load_ipc_perms(&h->perms, &sem->sem_perm);
+ if (ret < 0)
+ return ret;
+
+ ckpt_debug("sem: nsems %u\n", h->sem_nsems);
+
+ sem->sem_otime = h->sem_otime;
+ sem->sem_ctime = h->sem_ctime;
+ sem->sem_nsems = h->sem_nsems;
+
+ return 0;
+}
+
+/**
+ * ckpt_read_sem_array - read the state of a semaphore array
+ * @ctx: checkpoint context
+ * @sem: semphore array
+ *
+ * Expect the data in an array of 'struct sem': {32 bit, 32 bit}.
+ * See comment in ckpt_write_sem_array().
+ *
+ * The sem-undo information is not restored per ipc_ns, but rather per task.
+ */
+static struct sem *restore_sem_array(struct ckpt_ctx *ctx, int nsems)
+{
+ struct sem *sma;
+ int i, ret;
+
+ sma = kmalloc(nsems * sizeof(*sma), GFP_KERNEL);
+ ret = _ckpt_read_buffer(ctx, sma, nsems * sizeof(*sma));
+ if (ret < 0)
+ goto out;
+
+ /* validate sem array contents */
+ for (i = 0; i < nsems; i++) {
+ if (sma[i].semval < 0 || sma[i].sempid < 0) {
+ ret = -EINVAL;
+ break;
+ }
+ }
+ out:
+ if (ret < 0) {
+ kfree(sma);
+ sma = ERR_PTR(ret);
+ }
+ return sma;
+}
+
+int restore_ipc_sem(struct ckpt_ctx *ctx, struct ipc_namespace *ns)
+{
+ struct ckpt_hdr_ipc_sem *h;
+ struct kern_ipc_perm *perms;
+ struct sem_array *sem;
+ struct sem *sma = NULL;
+ struct ipc_ids *sem_ids = &ns->ids[IPC_SEM_IDS];
+ int semflag, ret;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC_SEM);
+ if (IS_ERR(h))
+ return PTR_ERR(h);
+
+ ret = -EINVAL;
+ if (h->perms.id < 0)
+ goto out;
+ if (h->sem_nsems < 0)
+ goto out;
+
+ /* read sempahore array state */
+ sma = restore_sem_array(ctx, h->sem_nsems);
+ if (IS_ERR(sma)) {
+ ret = PTR_ERR(sma);
+ goto out;
+ }
+
+ /* restore the message queue now */
+ semflag = h->perms.mode | IPC_CREAT | IPC_EXCL;
+ ckpt_debug("sem: do_semget key %d flag %#x id %d\n",
+ h->perms.key, semflag, h->perms.id);
+ ret = do_semget(ns, h->perms.key, h->sem_nsems, semflag, h->perms.id);
+ ckpt_debug("sem: do_semget ret %d\n", ret);
+ if (ret < 0)
+ goto out;
+
+ down_write(&sem_ids->rw_mutex);
+
+ /* we are the sole owners/users of this ipc_ns, it can't go away */
+ perms = ipc_lock(sem_ids, h->perms.id);
+ BUG_ON(IS_ERR(perms)); /* ipc_ns is private to us */
+
+ sem = container_of(perms, struct sem_array, sem_perm);
+ memcpy(sem->sem_base, sma, sem->sem_nsems * sizeof(*sma));
+
+ ret = load_ipc_sem_hdr(ctx, h, sem);
+ if (ret < 0) {
+ ckpt_debug("sem: need to remove (%d)\n", ret);
+ freeary(ns, perms);
+ } else
+ ipc_unlock(perms);
+ up_write(&sem_ids->rw_mutex);
+ out:
+ kfree(sma);
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
diff --git a/ipc/sem.c b/ipc/sem.c
index a2b2135..7361041 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -93,7 +93,6 @@
#define sem_checkid(sma, semid) ipc_checkid(&sma->sem_perm, semid)

static int newary(struct ipc_namespace *, struct ipc_params *, int);
-static void freeary(struct ipc_namespace *, struct kern_ipc_perm *);
#ifdef CONFIG_PROC_FS
static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
#endif
@@ -310,14 +309,12 @@ static inline int sem_more_checks(struct kern_ipc_perm *ipcp,
return 0;
}

-int do_semget(key_t key, int nsems, int semflg, int req_id)
+int do_semget(struct ipc_namespace *ns, key_t key, int nsems,
+ int semflg, int req_id)
{
- struct ipc_namespace *ns;
struct ipc_ops sem_ops;
struct ipc_params sem_params;

- ns = current->nsproxy->ipc_ns;
-
if (nsems < 0 || nsems > ns->sc_semmsl)
return -EINVAL;

@@ -334,7 +331,7 @@ int do_semget(key_t key, int nsems, int semflg, int req_id)

SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg)
{
- return do_semget(key, nsems, semflg, -1);
+ return do_semget(current->nsproxy->ipc_ns, key, nsems, semflg, -1);
}

/*
@@ -521,7 +518,7 @@ static void free_un(struct rcu_head *head)
* as a writer and the spinlock for this semaphore set hold. sem_ids.rw_mutex
* remains locked on exit.
*/
-static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
+void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
{
struct sem_undo *un, *tu;
struct sem_queue *q, *tq;
diff --git a/ipc/util.h b/ipc/util.h
index a06a98d..315831f 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -193,6 +193,11 @@ void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
int do_msgget(struct ipc_namespace *ns, key_t key, int msgflg, int req_id);
void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);

+int do_semget(struct ipc_namespace *ns, key_t key, int nsems, int semflg,
+ int req_id);
+void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
+
+
#ifdef CONFIG_CHECKPOINT
extern int checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *h,
struct kern_ipc_perm *perm);
@@ -204,6 +209,9 @@ extern int restore_ipc_shm(struct ckpt_ctx *ctx, struct ipc_namespace *ns);

extern int checkpoint_ipc_msg(int id, void *p, void *data);
extern int restore_ipc_msg(struct ckpt_ctx *ctx, struct ipc_namespace *ns);
+
+extern int checkpoint_ipc_sem(int id, void *p, void *data);
+extern int restore_ipc_sem(struct ckpt_ctx *ctx, struct ipc_namespace *ns);
#endif

#endif
--
1.6.0.4

2009-07-22 10:10:54

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 33/60] c/r: dump open file descriptors

Dump the file table with 'struct ckpt_hdr_file_table, followed by all
open file descriptors. Because the 'struct file' corresponding to an
fd can be shared, they are assigned an objref and registered in the
object hash. A reference to the 'file *' is kept for as long as it
lives in the hash (the hash is only cleaned up at the end of the
checkpoint).

Also provide generic_checkpoint_file() and generic_restore_file()
which is good for normal files and directories. It does not support
yet unlinked files or directories.

Changelog[v17]:
- Only collect sub-objects of files_struct once
- Better file error debugging
- Use (new) d_unlinked()
Changelog[v16]:
- Fix compile warning in checkpoint_bad()
Changelog[v16]:
- Reorder patch (move earlier in series)
- Handle shared files_struct objects
Changelog[v14]:
- File objects are dumped/restored prior to the first reference
- Introduce a per file-type restore() callback
- Use struct file_operations->checkpoint()
- Put code for generic file descriptors in a separate function
- Use one CKPT_FILE_GENERIC for both regular files and dirs
- Revert change to pr_debug(), back to ckpt_debug()
- Use only unsigned fields in checkpoint headers
- Rename: ckpt_write_files() => checkpoint_fd_table()
- Rename: ckpt_write_fd_data() => checkpoint_file()
- Discard field 'h->parent'
Changelog[v12]:
- Replace obsolete ckpt_debug() with pr_debug()
Changelog[v11]:
- Discard handling of opened symlinks (there is no such thing)
- ckpt_scan_fds() retries from scratch if hits size limits
Changelog[v9]:
- Fix a couple of leaks in ckpt_write_files()
- Drop useless kfree from ckpt_scan_fds()
Changelog[v8]:
- initialize 'coe' to workaround gcc false warning
Changelog[v6]:
- Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
(even though it's not really needed)

Signed-off-by: Oren Laadan <[email protected]>
---
checkpoint/Makefile | 3 +-
checkpoint/checkpoint.c | 11 +
checkpoint/files.c | 382 ++++++++++++++++++++++++++++++++++++++
checkpoint/objhash.c | 53 ++++++
checkpoint/process.c | 34 ++++-
checkpoint/sys.c | 1 +
include/linux/checkpoint.h | 15 ++
include/linux/checkpoint_hdr.h | 49 +++++
include/linux/checkpoint_types.h | 6 +
include/linux/fs.h | 4 +
10 files changed, 556 insertions(+), 2 deletions(-)
create mode 100644 checkpoint/files.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 5aa6a75..1d0c058 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -7,4 +7,5 @@ obj-$(CONFIG_CHECKPOINT) += \
objhash.o \
checkpoint.o \
restart.o \
- process.o
+ process.o \
+ files.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index e126626..59b86d8 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -18,6 +18,7 @@
#include <linux/time.h>
#include <linux/fs.h>
#include <linux/file.h>
+#include <linux/fs_struct.h>
#include <linux/dcache.h>
#include <linux/mount.h>
#include <linux/utsname.h>
@@ -573,6 +574,7 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
{
struct task_struct *task;
struct nsproxy *nsproxy;
+ struct fs_struct *fs;

/*
* No need for explicit cleanup here, because if an error
@@ -612,6 +614,15 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
if (!(ctx->uflags & CHECKPOINT_SUBTREE) && !ctx->root_init)
return -EINVAL; /* cleanup by ckpt_ctx_free() */

+ /* root vfs (FIX: WILL CHANGE with mnt-ns etc */
+ task_lock(ctx->root_task);
+ fs = ctx->root_task->fs;
+ read_lock(&fs->lock);
+ ctx->fs_mnt = fs->root;
+ path_get(&ctx->fs_mnt);
+ read_unlock(&fs->lock);
+ task_unlock(ctx->root_task);
+
return 0;
}

diff --git a/checkpoint/files.c b/checkpoint/files.c
new file mode 100644
index 0000000..5ff9925
--- /dev/null
+++ b/checkpoint/files.c
@@ -0,0 +1,382 @@
+/*
+ * Checkpoint file descriptors
+ *
+ * Copyright (C) 2008-2009 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG CKPT_DFILE
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+/**
+ * fill_fname - return pathname of a given file
+ * @path: path name
+ * @root: relative root
+ * @buf: buffer for pathname
+ * @len: buffer length (in) and pathname length (out)
+ */
+static char *fill_fname(struct path *path, struct path *root,
+ char *buf, int *len)
+{
+ struct path tmp = *root;
+ char *fname;
+
+ BUG_ON(!buf);
+ spin_lock(&dcache_lock);
+ fname = __d_path(path, &tmp, buf, *len);
+ spin_unlock(&dcache_lock);
+ if (IS_ERR(fname))
+ return fname;
+ *len = (buf + (*len) - fname);
+ /*
+ * FIX: if __d_path() changed these, it must have stepped out of
+ * init's namespace. Since currently we require a unified namespace
+ * within the container: simply fail.
+ */
+ if (tmp.mnt != root->mnt || tmp.dentry != root->dentry)
+ fname = ERR_PTR(-EBADF);
+
+ return fname;
+}
+
+/**
+ * checkpoint_fname - write a file name
+ * @ctx: checkpoint context
+ * @path: path name
+ * @root: relative root
+ */
+int checkpoint_fname(struct ckpt_ctx *ctx, struct path *path, struct path *root)
+{
+ char *buf, *fname;
+ int ret, flen;
+
+ /*
+ * FIXME: we can optimize and save memory (and storage) if we
+ * share strings (through objhash) and reference them instead
+ */
+
+ flen = PATH_MAX;
+ buf = kmalloc(flen, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ fname = fill_fname(path, root, buf, &flen);
+ if (!IS_ERR(fname))
+ ret = ckpt_write_obj_type(ctx, fname, flen,
+ CKPT_HDR_FILE_NAME);
+ else
+ ret = PTR_ERR(fname);
+
+ kfree(buf);
+ return ret;
+}
+
+#define CKPT_DEFAULT_FDTABLE 256 /* an initial guess */
+
+/**
+ * scan_fds - scan file table and construct array of open fds
+ * @files: files_struct pointer
+ * @fdtable: (output) array of open fds
+ *
+ * Returns the number of open fds found, and also the file table
+ * array via *fdtable. The caller should free the array.
+ *
+ * The caller must validate the file descriptors collected in the
+ * array before using them, e.g. by using fcheck_files(), in case
+ * the task's fdtable changes in the meantime.
+ */
+static int scan_fds(struct files_struct *files, int **fdtable)
+{
+ struct fdtable *fdt;
+ int *fds = NULL;
+ int i = 0, n = 0;
+ int tot = CKPT_DEFAULT_FDTABLE;
+
+ /*
+ * We assume that all tasks possibly sharing the file table are
+ * frozen (or we are a single process and we checkpoint ourselves).
+ * Therefore, we can safely proceed after krealloc() from where we
+ * left off. Otherwise the file table may be modified by another
+ * task after we scan it. The behavior is this case is undefined,
+ * and either checkpoint or restart will likely fail.
+ */
+ retry:
+ fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL);
+ if (!fds)
+ return -ENOMEM;
+
+ rcu_read_lock();
+ fdt = files_fdtable(files);
+ for (/**/; i < fdt->max_fds; i++) {
+ if (!fcheck_files(files, i))
+ continue;
+ if (n == tot) {
+ rcu_read_unlock();
+ tot *= 2; /* won't overflow: kmalloc will fail */
+ goto retry;
+ }
+ fds[n++] = i;
+ }
+ rcu_read_unlock();
+
+ *fdtable = fds;
+ return n;
+}
+
+int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
+ struct ckpt_hdr_file *h)
+{
+ h->f_flags = file->f_flags;
+ h->f_mode = file->f_mode;
+ h->f_pos = file->f_pos;
+ h->f_version = file->f_version;
+
+ /* FIX: need also file->uid, file->gid, file->f_owner, etc */
+
+ return 0;
+}
+
+int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+ struct ckpt_hdr_file_generic *h;
+ int ret;
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+ if (!h)
+ return -ENOMEM;
+
+ /*
+ * FIXME: when we'll add support for unlinked files/dirs, we'll
+ * need to distinguish between unlinked filed and unlinked dirs.
+ */
+ h->common.f_type = CKPT_FILE_GENERIC;
+
+ ret = checkpoint_file_common(ctx, file, &h->common);
+ if (ret < 0)
+ goto out;
+ ret = ckpt_write_obj(ctx, &h->common.h);
+ if (ret < 0)
+ goto out;
+ ret = checkpoint_fname(ctx, &file->f_path, &ctx->fs_mnt);
+ out:
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
+EXPORT_SYMBOL(generic_file_checkpoint);
+
+/* checkpoint callback for file pointer */
+int checkpoint_file(struct ckpt_ctx *ctx, void *ptr)
+{
+ struct file *file = (struct file *) ptr;
+
+ if (!file->f_op || !file->f_op->checkpoint) {
+ ckpt_debug("f_op lacks checkpoint handler: %pS\n", file->f_op);
+ return -EBADF;
+ }
+ if (d_unlinked(file->f_dentry)) {
+ ckpt_debug("unlinked files are unsupported\n");
+ return -EBADF;
+ }
+ return file->f_op->checkpoint(ctx, file);
+}
+
+/**
+ * ckpt_write_file_desc - dump the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @fd: file descriptor
+ *
+ * Saves the state of the file descriptor; looks up the actual file
+ * pointer in the hash table, and if found saves the matching objref,
+ * otherwise calls ckpt_write_file to dump the file pointer too.
+ */
+static int checkpoint_file_desc(struct ckpt_ctx *ctx,
+ struct files_struct *files, int fd)
+{
+ struct ckpt_hdr_file_desc *h;
+ struct file *file = NULL;
+ struct fdtable *fdt;
+ int objref, ret;
+ int coe = 0; /* avoid gcc warning */
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC);
+ if (!h)
+ return -ENOMEM;
+
+ rcu_read_lock();
+ fdt = files_fdtable(files);
+ file = fcheck_files(files, fd);
+ if (file) {
+ coe = FD_ISSET(fd, fdt->close_on_exec);
+ get_file(file);
+ }
+ rcu_read_unlock();
+
+ /* sanity check (although this shouldn't happen) */
+ ret = -EBADF;
+ if (!file)
+ goto out;
+
+ /*
+ * if seen first time, this will add 'file' to the objhash, keep
+ * a reference to it, dump its state while at it.
+ */
+ objref = checkpoint_obj(ctx, file, CKPT_OBJ_FILE);
+ ckpt_debug("fd %d objref %d file %p coe %d)\n", fd, objref, file, coe);
+ if (objref < 0) {
+ ret = objref;
+ goto out;
+ }
+
+ h->fd_objref = objref;
+ h->fd_descriptor = fd;
+ h->fd_close_on_exec = coe;
+
+ ret = ckpt_write_obj(ctx, &h->h);
+out:
+ ckpt_hdr_put(ctx, h);
+ if (file)
+ fput(file);
+ return ret;
+}
+
+static int do_checkpoint_file_table(struct ckpt_ctx *ctx,
+ struct files_struct *files)
+{
+ struct ckpt_hdr_file_table *h;
+ int *fdtable = NULL;
+ int nfds, n, ret;
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_TABLE);
+ if (!h)
+ return -ENOMEM;
+
+ nfds = scan_fds(files, &fdtable);
+ if (nfds < 0) {
+ ret = nfds;
+ goto out;
+ }
+
+ h->fdt_nfds = nfds;
+
+ ret = ckpt_write_obj(ctx, &h->h);
+ ckpt_hdr_put(ctx, h);
+ if (ret < 0)
+ goto out;
+
+ ckpt_debug("nfds %d\n", nfds);
+ for (n = 0; n < nfds; n++) {
+ ret = checkpoint_file_desc(ctx, files, fdtable[n]);
+ if (ret < 0)
+ break;
+ }
+ out:
+ kfree(fdtable);
+ return ret;
+}
+
+/* checkpoint callback for file table */
+int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr)
+{
+ return do_checkpoint_file_table(ctx, (struct files_struct *) ptr);
+}
+
+/* checkpoint wrapper for file table */
+int checkpoint_obj_file_table(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+ struct files_struct *files;
+ int objref;
+
+ files = get_files_struct(t);
+ if (!files)
+ return -EBUSY;
+ objref = checkpoint_obj(ctx, files, CKPT_OBJ_FILE_TABLE);
+ put_files_struct(files);
+
+ return objref;
+}
+
+/***********************************************************************
+ * Collect
+ */
+
+static int collect_file_desc(struct ckpt_ctx *ctx,
+ struct files_struct *files, int fd)
+{
+ struct fdtable *fdt;
+ struct file *file;
+ int ret;
+
+ rcu_read_lock();
+ fdt = files_fdtable(files);
+ file = fcheck_files(files, fd);
+ if (file)
+ get_file(file);
+ rcu_read_unlock();
+
+ if (!file)
+ return -EAGAIN;
+
+ ret = ckpt_obj_collect(ctx, file, CKPT_OBJ_FILE);
+ fput(file);
+
+ return ret;
+}
+
+static int collect_file_table(struct ckpt_ctx *ctx, struct files_struct *files)
+{
+ int *fdtable;
+ int exists;
+ int nfds, n;
+ int ret;
+
+ /* if already exists, don't proceed inside the struct */
+ exists = ckpt_obj_lookup(ctx, files, CKPT_OBJ_FILE_TABLE);
+
+ ret = ckpt_obj_collect(ctx, files, CKPT_OBJ_FILE_TABLE);
+ if (ret < 0 || exists)
+ return ret;
+
+ nfds = scan_fds(files, &fdtable);
+ if (nfds < 0)
+ return nfds;
+
+ for (n = 0; n < nfds; n++) {
+ ret = collect_file_desc(ctx, files, fdtable[n]);
+ if (ret < 0)
+ break;
+ }
+
+ kfree(fdtable);
+ return ret;
+}
+
+int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+ struct files_struct *files;
+ int ret;
+
+ files = get_files_struct(t);
+ if (!files)
+ return -EBUSY;
+ ret = collect_file_table(ctx, files);
+ put_files_struct(files);
+
+ return ret;
+}
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 3f23910..d77e8c4 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -13,6 +13,8 @@

#include <linux/kernel.h>
#include <linux/hash.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
#include <linux/checkpoint.h>
#include <linux/checkpoint_hdr.h>

@@ -52,6 +54,7 @@ struct ckpt_obj_hash {
int checkpoint_bad(struct ckpt_ctx *ctx, void *ptr)
{
BUG();
+ return 0;
}

void *restore_bad(struct ckpt_ctx *ctx)
@@ -71,6 +74,38 @@ static int obj_no_grab(void *ptr)
return 0;
}

+static int obj_file_table_grab(void *ptr)
+{
+ atomic_inc(&((struct files_struct *) ptr)->count);
+ return 0;
+}
+
+static void obj_file_table_drop(void *ptr)
+{
+ put_files_struct((struct files_struct *) ptr);
+}
+
+static int obj_file_table_users(void *ptr)
+{
+ return atomic_read(&((struct files_struct *) ptr)->count);
+}
+
+static int obj_file_grab(void *ptr)
+{
+ get_file((struct file *) ptr);
+ return 0;
+}
+
+static void obj_file_drop(void *ptr)
+{
+ fput((struct file *) ptr);
+}
+
+static int obj_file_users(void *ptr)
+{
+ return atomic_long_read(&((struct file *) ptr)->f_count);
+}
+
static struct ckpt_obj_ops ckpt_obj_ops[] = {
/* ignored object */
{
@@ -79,6 +114,24 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
.ref_drop = obj_no_drop,
.ref_grab = obj_no_grab,
},
+ /* files_struct object */
+ {
+ .obj_name = "FILE_TABLE",
+ .obj_type = CKPT_OBJ_FILE_TABLE,
+ .ref_drop = obj_file_table_drop,
+ .ref_grab = obj_file_table_grab,
+ .ref_users = obj_file_table_users,
+ .checkpoint = checkpoint_file_table,
+ },
+ /* file object */
+ {
+ .obj_name = "FILE",
+ .obj_type = CKPT_OBJ_FILE,
+ .ref_drop = obj_file_drop,
+ .ref_grab = obj_file_grab,
+ .ref_users = obj_file_users,
+ .checkpoint = checkpoint_file,
+ },
};


diff --git a/checkpoint/process.c b/checkpoint/process.c
index 4da4e4a..61caa01 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -103,6 +103,30 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN);
}

+static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+ struct ckpt_hdr_task_objs *h;
+ int files_objref;
+ int ret;
+
+ files_objref = checkpoint_obj_file_table(ctx, t);
+ ckpt_debug("files: objref %d\n", files_objref);
+ if (files_objref < 0) {
+ ckpt_write_err(ctx, "task %d (%s), files_struct: %d",
+ task_pid_vnr(t), t->comm, files_objref);
+ return files_objref;
+ }
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
+ if (!h)
+ return -ENOMEM;
+ h->files_objref = files_objref;
+ ret = ckpt_write_obj(ctx, &h->h);
+ ckpt_hdr_put(ctx, h);
+
+ return ret;
+}
+
/* dump the task_struct of a given task */
int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t)
{
@@ -227,6 +251,10 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
if (t->exit_state)
return 0;

+ ret = checkpoint_task_objs(ctx, t);
+ ckpt_debug("objs %d\n", ret);
+ if (ret < 0)
+ goto out;
ret = checkpoint_thread(ctx, t);
ckpt_debug("thread %d\n", ret);
if (ret < 0)
@@ -243,7 +271,11 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)

int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t)
{
- return 0;
+ int ret;
+
+ ret = ckpt_collect_file_table(ctx, t);
+
+ return ret;
}

/***********************************************************************
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index d16d48f..bc5620f 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -195,6 +195,7 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
fput(ctx->file);

ckpt_obj_hash_free(ctx);
+ path_put(&ctx->fs_mnt);

if (ctx->tasks_arr)
task_arr_free(ctx);
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index efd05cc..67845dc 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -124,12 +124,27 @@ extern int checkpoint_restart_block(struct ckpt_ctx *ctx,
struct task_struct *t);
extern int restore_restart_block(struct ckpt_ctx *ctx);

+/* file table */
+extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx,
+ struct task_struct *t);
+extern int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr);
+
+/* files */
+extern int checkpoint_fname(struct ckpt_ctx *ctx,
+ struct path *path, struct path *root);
+extern int checkpoint_file(struct ckpt_ctx *ctx, void *ptr);
+
+extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
+ struct ckpt_hdr_file *h);
+

/* debugging flags */
#define CKPT_DBASE 0x1 /* anything */
#define CKPT_DSYS 0x2 /* generic (system) */
#define CKPT_DRW 0x4 /* image read/write */
#define CKPT_DOBJ 0x8 /* shared objects */
+#define CKPT_DFILE 0x10 /* files and filesystem */

#define CKPT_DDEFAULT 0xffff /* default debug level */

diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 7c46638..3f8483e 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -52,12 +52,18 @@ enum {

CKPT_HDR_TREE = 101,
CKPT_HDR_TASK,
+ CKPT_HDR_TASK_OBJS,
CKPT_HDR_RESTART_BLOCK,
CKPT_HDR_THREAD,
CKPT_HDR_CPU,

/* 201-299: reserved for arch-dependent */

+ CKPT_HDR_FILE_TABLE = 301,
+ CKPT_HDR_FILE_DESC,
+ CKPT_HDR_FILE_NAME,
+ CKPT_HDR_FILE,
+
CKPT_HDR_TAIL = 9001,

CKPT_HDR_ERROR = 9999,
@@ -78,6 +84,8 @@ struct ckpt_hdr_objref {
/* shared objects types */
enum obj_type {
CKPT_OBJ_IGNORE = 0,
+ CKPT_OBJ_FILE_TABLE,
+ CKPT_OBJ_FILE,
CKPT_OBJ_MAX
};

@@ -155,6 +163,12 @@ struct ckpt_hdr_task {
__u64 robust_futex_list; /* a __user ptr */
} __attribute__((aligned(8)));

+/* task's shared resources */
+struct ckpt_hdr_task_objs {
+ struct ckpt_hdr h;
+ __s32 files_objref;
+} __attribute__((aligned(8)));
+
/* restart blocks */
struct ckpt_hdr_restart_block {
struct ckpt_hdr h;
@@ -176,4 +190,39 @@ enum restart_block_type {
CKPT_RESTART_BLOCK_FUTEX
};

+/* file system */
+struct ckpt_hdr_file_table {
+ struct ckpt_hdr h;
+ __s32 fdt_nfds;
+} __attribute__((aligned(8)));
+
+/* file descriptors */
+struct ckpt_hdr_file_desc {
+ struct ckpt_hdr h;
+ __s32 fd_objref;
+ __s32 fd_descriptor;
+ __u32 fd_close_on_exec;
+} __attribute__((aligned(8)));
+
+enum file_type {
+ CKPT_FILE_IGNORE = 0,
+ CKPT_FILE_GENERIC,
+ CKPT_FILE_MAX
+};
+
+/* file objects */
+struct ckpt_hdr_file {
+ struct ckpt_hdr h;
+ __u32 f_type;
+ __u32 f_mode;
+ __u32 f_flags;
+ __u32 _padding;
+ __u64 f_pos;
+ __u64 f_version;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_file_generic {
+ struct ckpt_hdr_file common;
+} __attribute__((aligned(8)));
+
#endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index bd78d19..c446510 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -12,6 +12,10 @@

#ifdef __KERNEL__

+#include <linux/list.h>
+#include <linux/path.h>
+#include <linux/fs.h>
+
#include <linux/sched.h>
#include <linux/nsproxy.h>
#include <linux/fs.h>
@@ -40,6 +44,8 @@ struct ckpt_ctx {

struct ckpt_obj_hash *obj_hash; /* repository for shared objects */

+ struct path fs_mnt; /* container root (FIXME) */
+
char err_string[256]; /* checkpoint: error string */

/* [multi-process checkpoint] */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 05d4745..2174957 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2313,7 +2313,11 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes);
loff_t inode_get_bytes(struct inode *inode);
void inode_set_bytes(struct inode *inode, loff_t bytes);

+#ifdef CONFIG_CHECKPOINT
+extern int generic_file_checkpoint(struct ckpt_ctx *ctx, struct file *file);
+#else
#define generic_file_checkpoint NULL
+#endif

extern int vfs_readdir(struct file *, filldir_t, void *);

--
1.6.0.4

2009-07-22 10:10:58

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 53/60] c/r: (s390): expose a constant for the number of words (CRs)

We need to use this value in the checkpoint/restart code and would like to
have a constant instead of a magic '3'.

Changelog:
Mar 30:
. Add CHECKPOINT_SUPPORT in Kconfig (Nathan Lynch)
Mar 03:
. Picked up additional use of magic '3' in ptrace.h

Signed-off-by: Dan Smith <[email protected]>
---
arch/s390/Kconfig | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 2ae5d72..6f143ab 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -49,6 +49,10 @@ config GENERIC_TIME_VSYSCALL
config GENERIC_CLOCKEVENTS
def_bool y

+config CHECKPOINT_SUPPORT
+ bool
+ default y if 64BIT
+
config GENERIC_BUG
bool
depends on BUG
--
1.6.0.4

2009-07-22 10:10:56

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 40/60] c/r: export shmem_getpage() to support shared memory

Export functionality to retrieve specific pages from shared memory
given an inode in shmem-fs; this will be used in the next two patches
to provide support for c/r of shared memory.

mm/shmem.c:
- shmem_getpage() and 'enum sgp_type' moved to linux/mm.h

Signed-off-by: Oren Laadan <[email protected]>
---
include/linux/mm.h | 11 +++++++++++
mm/shmem.c | 15 ++-------------
2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 98e1fdf..6c2c3dd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -329,6 +329,17 @@ void put_pages_list(struct list_head *pages);

void split_page(struct page *page, unsigned int order);

+/* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
+enum sgp_type {
+ SGP_READ, /* don't exceed i_size, don't allocate page */
+ SGP_CACHE, /* don't exceed i_size, may allocate page */
+ SGP_DIRTY, /* like SGP_CACHE, but set new page dirty */
+ SGP_WRITE, /* may exceed i_size, may allocate page */
+};
+
+extern int shmem_getpage(struct inode *inode, unsigned long idx,
+ struct page **pagep, enum sgp_type sgp, int *type);
+
/*
* Compound pages have a destructor function. Provide a
* prototype for that function and accessor functions.
diff --git a/mm/shmem.c b/mm/shmem.c
index d713239..d80532b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -99,14 +99,6 @@ static struct vfsmount *shm_mnt;
/* Pretend that each entry is of this size in directory's i_size */
#define BOGO_DIRENT_SIZE 20

-/* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
-enum sgp_type {
- SGP_READ, /* don't exceed i_size, don't allocate page */
- SGP_CACHE, /* don't exceed i_size, may allocate page */
- SGP_DIRTY, /* like SGP_CACHE, but set new page dirty */
- SGP_WRITE, /* may exceed i_size, may allocate page */
-};
-
#ifdef CONFIG_TMPFS
static unsigned long shmem_default_max_blocks(void)
{
@@ -119,9 +111,6 @@ static unsigned long shmem_default_max_inodes(void)
}
#endif

-static int shmem_getpage(struct inode *inode, unsigned long idx,
- struct page **pagep, enum sgp_type sgp, int *type);
-
static inline struct page *shmem_dir_alloc(gfp_t gfp_mask)
{
/*
@@ -1202,8 +1191,8 @@ static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
* vm. If we swap it in we mark it dirty since we also free the swap
* entry since a page cannot live in both the swap and page cache
*/
-static int shmem_getpage(struct inode *inode, unsigned long idx,
- struct page **pagep, enum sgp_type sgp, int *type)
+int shmem_getpage(struct inode *inode, unsigned long idx,
+ struct page **pagep, enum sgp_type sgp, int *type)
{
struct address_space *mapping = inode->i_mapping;
struct shmem_inode_info *info = SHMEM_I(inode);
--
1.6.0.4

2009-07-22 10:12:01

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 54/60] c/r: add CKPT_COPY() macro

From: Dan Smith <[email protected]>

As suggested by Dave[1], this provides us a way to make the copy-in and
copy-out processes symmetric. CKPT_COPY_ARRAY() provides us a way to do
the same thing but for arrays. It's not critical, but it helps us unify
the checkpoint and restart paths for some things.

Changelog:
Mar 04:
. Removed semicolons
. Added build-time check for __must_be_array in CKPT_COPY_ARRAY
Feb 27:
. Changed CKPT_COPY() to use assignment, eliminating the need
for the CKPT_COPY_BIT() macro
. Add CKPT_COPY_ARRAY() macro to help copying register arrays,
etc
. Move the macro definitions inside the CR #ifdef
Feb 25:
. Changed WARN_ON() to BUILD_BUG_ON()

Signed-off-by: Dan Smith <[email protected]>
Signed-off-by: Oren Laadan <[email protected]>

1: https://lists.linux-foundation.org/pipermail/containers/2009-February/015821.html (all the way at the bottom)
---
include/linux/checkpoint.h | 29 +++++++++++++++++++++++++++++
1 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index aeae2fa..2ee1bc2 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -211,6 +211,34 @@ extern int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode);
VM_MAPPED_COPY | VM_INSERTPAGE | VM_MIXEDMAP | VM_SAO)


+/* useful macros to copy fields and buffers to/from ckpt_hdr_xxx structures */
+#define CKPT_CPT 1
+#define CKPT_RST 2
+
+#define CKPT_COPY(op, SAVE, LIVE) \
+ do { \
+ if (op == CKPT_CPT) \
+ SAVE = LIVE; \
+ else \
+ LIVE = SAVE; \
+ } while (0)
+
+/*
+ * Copy @count items from @LIVE to @SAVE if op is CKPT_CPT (otherwise,
+ * copy in the reverse direction)
+ */
+#define CKPT_COPY_ARRAY(op, SAVE, LIVE, count) \
+ do { \
+ (void)__must_be_array(SAVE); \
+ (void)__must_be_array(LIVE); \
+ BUILD_BUG_ON(sizeof(*SAVE) != sizeof(*LIVE)); \
+ if (op == CKPT_CPT) \
+ memcpy(SAVE, LIVE, count * sizeof(*SAVE)); \
+ else \
+ memcpy(LIVE, SAVE, count * sizeof(*SAVE)); \
+ } while (0)
+
+
/* debugging flags */
#define CKPT_DBASE 0x1 /* anything */
#define CKPT_DSYS 0x2 /* generic (system) */
@@ -243,6 +271,7 @@ extern unsigned long ckpt_debug_level;
* CKPT_DBASE is the base flags, doesn't change
* CKPT_DFLAG is to be redfined in each source file
*/
+
#define ckpt_debug(fmt, args...) \
_ckpt_debug(CKPT_DBASE | CKPT_DFLAG, fmt, ## args)

--
1.6.0.4

2009-07-22 10:12:50

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 45/60] c/r: make ckpt_may_checkpoint_task() check each namespace individually

For a given namespace type, say XXX, if a checkpoint was taken on a
CONFIG_XXX_NS system, is restarted on a !CONFIG_XXX_NS, then ensure
that:

1) The global settings of the global (init) namespace do not get
overwritten. Creating new objects in that namespace is ok, as long as
the request identifier is available.

2) All restarting tasks use a single namespace - because it is
impossible to create additional namespaces to accommodate for what had
been checkpointed.

Original patch introducing nsproxy c/r by Dan Smith <[email protected]>

Chagnelog[v17]:
- Only collect sub-objects of struct_nsproxy once.
- Restore namespace pieces directly instead of using sys_unshare()
- Proper handling of restart from namespace(s) without namespace(s)

Signed-off-by: Oren Laadan <[email protected]>
---
checkpoint/checkpoint.c | 20 ++++++++--
checkpoint/objhash.c | 28 ++++++++++++++
checkpoint/process.c | 81 ++++++++++++++++++++++++++++++++++++++++
include/linux/checkpoint.h | 5 ++
include/linux/checkpoint_hdr.h | 13 ++++++
kernel/nsproxy.c | 76 +++++++++++++++++++++++++++++++++++++
6 files changed, 219 insertions(+), 4 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index c68e443..af6b58b 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -281,6 +281,8 @@ static int checkpoint_all_tasks(struct ckpt_ctx *ctx)
static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
{
struct task_struct *root = ctx->root_task;
+ struct nsproxy *nsproxy;
+ int ret = 0;

ckpt_debug("check %d\n", task_pid_nr_ns(t, ctx->root_nsproxy->pid_ns));

@@ -324,11 +326,21 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
return -EINVAL;
}

- /* FIX: change this when namespaces are added */
- if (task_nsproxy(t) != ctx->root_nsproxy)
- return -EPERM;
+ rcu_read_lock();
+ nsproxy = task_nsproxy(t);
+ if (nsproxy->uts_ns != ctx->root_nsproxy->uts_ns)
+ ret = -EPERM;
+ if (nsproxy->ipc_ns != ctx->root_nsproxy->ipc_ns)
+ ret = -EPERM;
+ if (nsproxy->mnt_ns != ctx->root_nsproxy->mnt_ns)
+ ret = -EPERM;
+ if (nsproxy->pid_ns != ctx->root_nsproxy->pid_ns)
+ ret = -EPERM;
+ if (nsproxy->net_ns != ctx->root_nsproxy->net_ns)
+ ret = -EPERM;
+ rcu_read_unlock();

- return 0;
+ return ret;
}

#define CKPT_HDR_PIDS_CHUNK 256
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 02b42a0..18ede6f 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -132,6 +132,22 @@ static int obj_mm_users(void *ptr)
return atomic_read(&((struct mm_struct *) ptr)->mm_users);
}

+static int obj_ns_grab(void *ptr)
+{
+ get_nsproxy((struct nsproxy *) ptr);
+ return 0;
+}
+
+static void obj_ns_drop(void *ptr)
+{
+ put_nsproxy((struct nsproxy *) ptr);
+}
+
+static int obj_ns_users(void *ptr)
+{
+ return atomic_read(&((struct nsproxy *) ptr)->count);
+}
+
static struct ckpt_obj_ops ckpt_obj_ops[] = {
/* ignored object */
{
@@ -179,6 +195,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
.checkpoint = checkpoint_mm,
.restore = restore_mm,
},
+ /* ns object */
+ {
+ .obj_name = "NSPROXY",
+ .obj_type = CKPT_OBJ_NS,
+ .ref_drop = obj_ns_drop,
+ .ref_grab = obj_ns_grab,
+ .ref_users = obj_ns_users,
+ .checkpoint = checkpoint_ns,
+ .restore = restore_ns,
+ },
};


@@ -520,6 +546,8 @@ int ckpt_obj_contained(struct ckpt_ctx *ctx)

/* account for ctx->file reference (if in the table already) */
ckpt_obj_users_inc(ctx, ctx->file, 1);
+ /* account for ctx->root_nsproxy reference (if in the table already) */
+ ckpt_obj_users_inc(ctx, ctx->root_nsproxy, 1);

hlist_for_each_entry(obj, node, &ctx->obj_hash->list, next) {
if (!obj->ops->ref_users)
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 5d71016..40e83c9 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -12,6 +12,7 @@
#define CKPT_DFLAG CKPT_DSYS

#include <linux/sched.h>
+#include <linux/nsproxy.h>
#include <linux/posix-timers.h>
#include <linux/futex.h>
#include <linux/poll.h>
@@ -103,6 +104,35 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN);
}

+static int checkpoint_task_ns(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+ struct ckpt_hdr_task_ns *h;
+ struct nsproxy *nsproxy;
+ int ns_objref;
+ int ret;
+
+ rcu_read_lock();
+ nsproxy = task_nsproxy(t);
+ get_nsproxy(nsproxy);
+ rcu_read_unlock();
+
+ ns_objref = checkpoint_obj(ctx, nsproxy, CKPT_OBJ_NS);
+ put_nsproxy(nsproxy);
+
+ ckpt_debug("nsproxy: objref %d\n", ns_objref);
+ if (ns_objref < 0)
+ return ns_objref;
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_NS);
+ if (!h)
+ return -ENOMEM;
+ h->ns_objref = ns_objref;
+ ret = ckpt_write_obj(ctx, &h->h);
+ ckpt_hdr_put(ctx, h);
+
+ return ret;
+}
+
static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
{
struct ckpt_hdr_task_objs *h;
@@ -110,6 +140,19 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
int mm_objref;
int ret;

+ /*
+ * Shared objects may have dependencies among them: task->mm
+ * depends on task->nsproxy (by ipc_ns). Therefore first save
+ * the namespaces, and then the remaining shared objects.
+ * During restart a task will already have its namespaces
+ * restored when it gets to restore, e.g. its memory.
+ */
+
+ ret = checkpoint_task_ns(ctx, t);
+ ckpt_debug("ns: objref %d\n", ret);
+ if (ret < 0)
+ return ret;
+
files_objref = checkpoint_obj_file_table(ctx, t);
ckpt_debug("files: objref %d\n", files_objref);
if (files_objref < 0) {
@@ -283,6 +326,9 @@ int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t)
{
int ret;

+ ret = ckpt_collect_ns(ctx, t);
+ if (ret < 0)
+ return ret;
ret = ckpt_collect_file_table(ctx, t);
if (ret < 0)
return ret;
@@ -356,11 +402,46 @@ static int restore_task_struct(struct ckpt_ctx *ctx)
return ret;
}

+static int restore_task_ns(struct ckpt_ctx *ctx)
+{
+ struct ckpt_hdr_task_ns *h;
+ struct nsproxy *nsproxy;
+ int ret = 0;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TASK_NS);
+ if (IS_ERR(h))
+ return PTR_ERR(h);
+
+ nsproxy = ckpt_obj_fetch(ctx, h->ns_objref, CKPT_OBJ_NS);
+ if (IS_ERR(nsproxy)) {
+ ret = PTR_ERR(nsproxy);
+ goto out;
+ }
+
+ if (nsproxy != task_nsproxy(current)) {
+ get_nsproxy(nsproxy);
+ switch_task_namespaces(current, nsproxy);
+ }
+ out:
+ ckpt_debug("nsproxy: ret %d (%p)\n", ret, task_nsproxy(current));
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
+
static int restore_task_objs(struct ckpt_ctx *ctx)
{
struct ckpt_hdr_task_objs *h;
int ret;

+ /*
+ * Namespaces come first, because ->mm depends on ->nsproxy,
+ * and because shared objects are restored before they are
+ * referenced. See comment in checkpoint_task_objs.
+ */
+ ret = restore_task_ns(ctx);
+ if (ret < 0)
+ return ret;
+
h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
if (IS_ERR(h))
return PTR_ERR(h);
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 5920453..e433b5c 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -126,6 +126,11 @@ extern int checkpoint_restart_block(struct ckpt_ctx *ctx,
struct task_struct *t);
extern int restore_restart_block(struct ckpt_ctx *ctx);

+/* namespaces */
+extern int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_ns(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_ns(struct ckpt_ctx *ctx);
+
/* file table */
extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t);
extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx,
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index b187719..af18332 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -52,10 +52,12 @@ enum {

CKPT_HDR_TREE = 101,
CKPT_HDR_TASK,
+ CKPT_HDR_TASK_NS,
CKPT_HDR_TASK_OBJS,
CKPT_HDR_RESTART_BLOCK,
CKPT_HDR_THREAD,
CKPT_HDR_CPU,
+ CKPT_HDR_NS,

/* 201-299: reserved for arch-dependent */

@@ -94,6 +96,7 @@ enum obj_type {
CKPT_OBJ_FILE_TABLE,
CKPT_OBJ_FILE,
CKPT_OBJ_MM,
+ CKPT_OBJ_NS,
CKPT_OBJ_MAX
};

@@ -173,6 +176,16 @@ struct ckpt_hdr_task {
__u64 robust_futex_list; /* a __user ptr */
} __attribute__((aligned(8)));

+/* namespaces */
+struct ckpt_hdr_task_ns {
+ struct ckpt_hdr h;
+ __s32 ns_objref;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_ns {
+ struct ckpt_hdr h;
+} __attribute__((aligned(8)));
+
/* task's shared resources */
struct ckpt_hdr_task_objs {
struct ckpt_hdr h;
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 09b4ff9..54cb987 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -21,6 +21,7 @@
#include <linux/pid_namespace.h>
#include <net/net_namespace.h>
#include <linux/ipc_namespace.h>
+#include <linux/checkpoint.h>

static struct kmem_cache *nsproxy_cachep;

@@ -221,6 +222,81 @@ void exit_task_namespaces(struct task_struct *p)
switch_task_namespaces(p, NULL);
}

+#ifdef CONFIG_CHECKPOINT
+int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+ struct nsproxy *nsproxy;
+ int exists;
+ int ret;
+
+ rcu_read_lock();
+ nsproxy = task_nsproxy(t);
+ if (nsproxy)
+ get_nsproxy(nsproxy);
+ rcu_read_unlock();
+
+ if (!nsproxy)
+ return 0;
+
+ /* if already exists, don't proceed inside the struct */
+ exists = ckpt_obj_lookup(ctx, nsproxy, CKPT_OBJ_NS);
+
+ ret = ckpt_obj_collect(ctx, nsproxy, CKPT_OBJ_NS);
+ if (ret < 0 || exists)
+ goto out;
+
+ /* TODO: collect other namespaces here */
+ out:
+ put_nsproxy(nsproxy);
+ return ret;
+}
+
+static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy)
+{
+ struct ckpt_hdr_ns *h;
+ int ret;
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_NS);
+ if (!h)
+ return -ENOMEM;
+
+ /* TODO: Write other namespaces here */
+
+ ret = ckpt_write_obj(ctx, &h->h);
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
+
+
+int checkpoint_ns(struct ckpt_ctx *ctx, void *ptr)
+{
+ return do_checkpoint_ns(ctx, (struct nsproxy *) ptr);
+}
+
+static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
+{
+ struct ckpt_hdr_ns *h;
+ struct nsproxy *nsproxy = NULL;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NS);
+ if (IS_ERR(h))
+ return (struct nsproxy *) h;
+
+ nsproxy = current->nsproxy;
+ get_nsproxy(nsproxy);
+
+ /* TODO: add more namespaces here */
+
+ ckpt_hdr_put(ctx, h);
+ return nsproxy;
+}
+
+void *restore_ns(struct ckpt_ctx *ctx)
+{
+ return (void *) do_restore_ns(ctx);
+}
+#endif /* CONFIG_CHECKPOINT */
+
static int __init nsproxy_cache_init(void)
{
nsproxy_cachep = KMEM_CACHE(nsproxy, SLAB_PANIC);
--
1.6.0.4

2009-07-22 10:12:19

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 35/60] c/r: add generic '->checkpoint' f_op to ext fses

From: Dave Hansen <[email protected]>

This marks ext[234] as being checkpointable. There will be many
more to do this to, but this is a start.

Signed-off-by: Dave Hansen <[email protected]>
---
fs/ext2/dir.c | 1 +
fs/ext2/file.c | 2 ++
fs/ext3/dir.c | 1 +
fs/ext3/file.c | 1 +
fs/ext4/dir.c | 1 +
fs/ext4/file.c | 1 +
6 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 6cde970..78e9157 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -722,4 +722,5 @@ const struct file_operations ext2_dir_operations = {
.compat_ioctl = ext2_compat_ioctl,
#endif
.fsync = simple_fsync,
+ .checkpoint = generic_file_checkpoint,
};
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 2b9e47d..edbc3dc 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -58,6 +58,7 @@ const struct file_operations ext2_file_operations = {
.fsync = simple_fsync,
.splice_read = generic_file_splice_read,
.splice_write = generic_file_splice_write,
+ .checkpoint = generic_file_checkpoint,
};

#ifdef CONFIG_EXT2_FS_XIP
@@ -73,6 +74,7 @@ const struct file_operations ext2_xip_file_operations = {
.open = generic_file_open,
.release = ext2_release_file,
.fsync = simple_fsync,
+ .checkpoint = generic_file_checkpoint,
};
#endif

diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c
index 3d724a9..54b05d2 100644
--- a/fs/ext3/dir.c
+++ b/fs/ext3/dir.c
@@ -48,6 +48,7 @@ const struct file_operations ext3_dir_operations = {
#endif
.fsync = ext3_sync_file, /* BKL held */
.release = ext3_release_dir,
+ .checkpoint = generic_file_checkpoint,
};


diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index 5b49704..a421e07 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -126,6 +126,7 @@ const struct file_operations ext3_file_operations = {
.fsync = ext3_sync_file,
.splice_read = generic_file_splice_read,
.splice_write = generic_file_splice_write,
+ .checkpoint = generic_file_checkpoint,
};

const struct inode_operations ext3_file_inode_operations = {
diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index 9dc9316..f69404c 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -48,6 +48,7 @@ const struct file_operations ext4_dir_operations = {
#endif
.fsync = ext4_sync_file,
.release = ext4_release_dir,
+ .checkpoint = generic_file_checkpoint,
};


diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 3f1873f..a99bcc3 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -195,6 +195,7 @@ const struct file_operations ext4_file_operations = {
.fsync = ext4_sync_file,
.splice_read = generic_file_splice_read,
.splice_write = generic_file_splice_write,
+ .checkpoint = generic_file_checkpoint,
};

const struct inode_operations ext4_file_inode_operations = {
--
1.6.0.4

2009-07-22 10:10:59

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 10/60] c/r: make file_pos_read/write() public

These two are used in the next patch when calling vfs_read/write()

Signed-off-by: Oren Laadan <[email protected]>
---
fs/read_write.c | 10 ----------
include/linux/fs.h | 10 ++++++++++
2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 6c8c55d..d331975 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -359,16 +359,6 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_

EXPORT_SYMBOL(vfs_write);

-static inline loff_t file_pos_read(struct file *file)
-{
- return file->f_pos;
-}
-
-static inline void file_pos_write(struct file *file, loff_t pos)
-{
- file->f_pos = pos;
-}
-
SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
{
struct file *file;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0872372..d88d4fc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1548,6 +1548,16 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
struct iovec *fast_pointer,
struct iovec **ret_pointer);

+static inline loff_t file_pos_read(struct file *file)
+{
+ return file->f_pos;
+}
+
+static inline void file_pos_write(struct file *file, loff_t pos)
+{
+ file->f_pos = pos;
+}
+
extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
--
1.6.0.4

2009-07-22 10:11:25

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 15/60] pids 5/7: Add target_pids parameter to copy_process()

From: Sukadev Bhattiprolu <[email protected]>

The new parameter will be used in a follow-on patch when clone_with_pids()
is implemented.

Signed-off-by: Sukadev Bhattiprolu <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Reviewed-by: Oren Laadan <[email protected]>
---
kernel/fork.c | 7 ++++---
1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 8c9ca1c..6f90cf4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -948,12 +948,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
unsigned long stack_size,
int __user *child_tidptr,
struct pid *pid,
+ pid_t *target_pids,
int trace)
{
int retval;
struct task_struct *p;
int cgroup_callbacks_done = 0;
- pid_t *target_pids = NULL;

if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL);
@@ -1328,7 +1328,7 @@ struct task_struct * __cpuinit fork_idle(int cpu)
struct pt_regs regs;

task = copy_process(CLONE_VM, 0, idle_regs(&regs), 0, NULL,
- &init_struct_pid, 0);
+ &init_struct_pid, NULL, 0);
if (!IS_ERR(task))
init_idle(task, cpu);

@@ -1351,6 +1351,7 @@ long do_fork(unsigned long clone_flags,
struct task_struct *p;
int trace = 0;
long nr;
+ pid_t *target_pids = NULL;

/*
* Do some preliminary argument and permissions checking before we
@@ -1391,7 +1392,7 @@ long do_fork(unsigned long clone_flags,
trace = tracehook_prepare_clone(clone_flags);

p = copy_process(clone_flags, stack_start, regs, stack_size,
- child_tidptr, NULL, trace);
+ child_tidptr, NULL, target_pids, trace);
/*
* Do this prior waking up the new thread - the thread pointer
* might get invalid after that point, if the thread exits quickly.
--
1.6.0.4

2009-07-22 10:10:47

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 28/60] c/r: support for zombie processes

During checkpoint, a zombie processes need only save p->comm,
p->state, p->exit_state, and p->exit_code.

During restart, zombie processes are created like all other
processes. They validate the saved exit_code restore p->comm
and p->exit_code. Then they call do_exit() instead of waking
up the next task in line.

But before, they place the @ctx in p->checkpoint_ctx, so that
only at exit time they will wake up the next task in line,
and drop the reference to the @ctx.

This provides the guarantee that when the coordinator's wait
completes, all normal tasks completed their restart, and all
zombie tasks are already zombified (as opposed to perhap only
becoming a zombie).

Changelog[v17]:
- Validate t->exit_signal for both threads and leader
- Skip zombies in most of may_checkpoint_task()
- Save/restore t->pdeath_signal
- Validate ->exit_signal and ->pdeath_signal

Signed-off-by: Oren Laadan <[email protected]>
---
checkpoint/checkpoint.c | 12 +++++--
checkpoint/process.c | 67 +++++++++++++++++++++++++++++++++++-----
checkpoint/restart.c | 40 +++++++++++++++++++++---
include/linux/checkpoint.h | 1 +
include/linux/checkpoint_hdr.h | 1 +
5 files changed, 104 insertions(+), 17 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 57f59de..fb14585 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -280,8 +280,8 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)

ckpt_debug("check %d\n", task_pid_nr_ns(t, ctx->root_nsproxy->pid_ns));

- if (t->state == TASK_DEAD) {
- pr_warning("c/r: task %d is TASK_DEAD\n", task_pid_vnr(t));
+ if (t->exit_state == EXIT_DEAD) {
+ pr_warning("c/r: task %d is EXIT_DEAD\n", task_pid_vnr(t));
return -EAGAIN;
}

@@ -291,6 +291,10 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
return -EPERM;
}

+ /* zombies are cool (and also don't have nsproxy, below...) */
+ if (t->exit_state)
+ return 0;
+
/* verify that all tasks belongs to same freezer cgroup */
if (t != current && !in_same_cgroup_freezer(t, ctx->root_freezer)) {
__ckpt_write_err(ctx, "task %d (%s) not frozen (wrong cgroup)",
@@ -309,8 +313,8 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
* FIX: for now, disallow siblings of container init created
* via CLONE_PARENT (unclear if they will remain possible)
*/
- if (ctx->root_init && t != root && t->tgid != root->tgid &&
- t->real_parent == root->real_parent) {
+ if (ctx->root_init && t != root &&
+ t->real_parent == root->real_parent && t->tgid != root->tgid) {
__ckpt_write_err(ctx, "task %d (%s) is sibling of root",
task_pid_vnr(t), t->comm);
return -EINVAL;
diff --git a/checkpoint/process.c b/checkpoint/process.c
index a0bf344..a67c389 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -35,12 +35,18 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
h->state = t->state;
h->exit_state = t->exit_state;
h->exit_code = t->exit_code;
- h->exit_signal = t->exit_signal;

- h->set_child_tid = t->set_child_tid;
- h->clear_child_tid = t->clear_child_tid;
+ if (t->exit_state) {
+ /* zombie - skip remaining state */
+ BUG_ON(t->exit_state != EXIT_ZOMBIE);
+ } else {
+ /* FIXME: save remaining relevant task_struct fields */
+ h->exit_signal = t->exit_signal;
+ h->pdeath_signal = t->pdeath_signal;

- /* FIXME: save remaining relevant task_struct fields */
+ h->set_child_tid = t->set_child_tid;
+ h->clear_child_tid = t->clear_child_tid;
+ }

ret = ckpt_write_obj(ctx, &h->h);
ckpt_hdr_put(ctx, h);
@@ -169,6 +175,11 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
ckpt_debug("task %d\n", ret);
if (ret < 0)
goto out;
+
+ /* zombie - we're done here */
+ if (t->exit_state)
+ return 0;
+
ret = checkpoint_thread(ctx, t);
ckpt_debug("thread %d\n", ret);
if (ret < 0)
@@ -187,6 +198,19 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
* Restart
*/

+static inline int valid_exit_code(int exit_code)
+{
+ if (exit_code >= 0x10000)
+ return 0;
+ if (exit_code & 0xff) {
+ if (exit_code & ~0xff)
+ return 0;
+ if (!valid_signal(exit_code & 0xff))
+ return 0;
+ }
+ return 1;
+}
+
/* read the task_struct into the current task */
static int restore_task_struct(struct ckpt_ctx *ctx)
{
@@ -198,15 +222,37 @@ static int restore_task_struct(struct ckpt_ctx *ctx)
if (IS_ERR(h))
return PTR_ERR(h);

+ ret = -EINVAL;
+ if (h->state == TASK_DEAD) {
+ if (h->exit_state != EXIT_ZOMBIE)
+ goto out;
+ if (!valid_exit_code(h->exit_code))
+ goto out;
+ t->exit_code = h->exit_code;
+ } else {
+ if (h->exit_code)
+ goto out;
+ if ((thread_group_leader(t) && !valid_signal(h->exit_signal)) ||
+ (!thread_group_leader(t) && h->exit_signal != -1))
+ goto out;
+ if (!valid_signal(h->pdeath_signal))
+ goto out;
+
+ /* FIXME: restore remaining relevant task_struct fields */
+ t->exit_signal = h->exit_signal;
+ t->pdeath_signal = h->pdeath_signal;
+
+ t->set_child_tid = h->set_child_tid;
+ t->clear_child_tid = h->clear_child_tid;
+ }
+
memset(t->comm, 0, TASK_COMM_LEN);
ret = _ckpt_read_string(ctx, t->comm, TASK_COMM_LEN);
if (ret < 0)
goto out;

- t->set_child_tid = h->set_child_tid;
- t->clear_child_tid = h->clear_child_tid;
-
- /* FIXME: restore remaining relevant task_struct fields */
+ /* return 1 for zombie, 0 otherwise */
+ ret = (h->state == TASK_DEAD ? 1 : 0);
out:
ckpt_hdr_put(ctx, h);
return ret;
@@ -326,6 +372,11 @@ int restore_task(struct ckpt_ctx *ctx)
ckpt_debug("task %d\n", ret);
if (ret < 0)
goto out;
+
+ /* zombie - we're done here */
+ if (ret)
+ goto out;
+
ret = restore_thread(ctx);
ckpt_debug("thread %d\n", ret);
if (ret < 0)
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 65422e2..1b1f639 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -375,20 +375,17 @@ static inline void ckpt_notify_error(struct ckpt_ctx *ctx)
complete(&ctx->complete);
}

-static int ckpt_activate_next(struct ckpt_ctx *ctx)
+int ckpt_activate_next(struct ckpt_ctx *ctx)
{
struct task_struct *task;
- int active;
pid_t pid;

- active = ++ctx->active_pid;
- if (active >= ctx->nr_pids) {
+ if (++ctx->active_pid >= ctx->nr_pids) {
complete(&ctx->complete);
return 0;
}

pid = get_active_pid(ctx);
- ckpt_debug("active pid %d (%d < %d)\n", pid, active, ctx->nr_pids);

rcu_read_lock();
task = find_task_by_pid_ns(pid, ctx->root_nsproxy->pid_ns);
@@ -413,6 +410,8 @@ static int wait_task_active(struct ckpt_ctx *ctx)
ret = wait_event_interruptible(ctx->waitq,
is_task_active(ctx, pid) ||
ckpt_test_ctx_error(ctx));
+ ckpt_debug("active %d < %d (ret %d)\n",
+ ctx->active_pid, ctx->nr_pids, ret);
if (!ret && ckpt_test_ctx_error(ctx)) {
force_sig(SIGKILL, current);
ret = -EBUSY;
@@ -468,6 +467,8 @@ static int do_restore_task(void)
return -EAGAIN;
}

+ current->flags |= PF_RESTARTING;
+
/* wait for our turn, do the restore, and tell next task in line */
ret = wait_task_active(ctx);
if (ret < 0)
@@ -477,6 +478,13 @@ static int do_restore_task(void)
if (ret < 0)
goto out;

+ /*
+ * zombie: we're done here; Save @ctx on task_struct, to be
+ * used to ckpt_activate_next(), and released, from do_exit().
+ */
+ if (ret)
+ do_exit(current->exit_code);
+
ret = ckpt_activate_next(ctx);
if (ret < 0)
goto out;
@@ -493,6 +501,7 @@ static int do_restore_task(void)
wake_up_all(&ctx->waitq);
}

+ current->flags &= ~PF_RESTARTING;
ckpt_ctx_put(ctx);
return ret;
}
@@ -593,6 +602,7 @@ static int wait_all_tasks_finish(struct ckpt_ctx *ctx)

ret = wait_for_completion_interruptible(&ctx->complete);

+ ckpt_debug("final sync kflags %#lx\n", ctx->kflags);
if (ckpt_test_ctx_error(ctx))
ret = -EBUSY;
return ret;
@@ -820,3 +830,23 @@ long do_restart(struct ckpt_ctx *ctx, pid_t pid)

return ret;
}
+
+/**
+ * exit_checkpoint - callback from do_exit to cleanup checkpoint state
+ * @tsk: terminating task
+ */
+void exit_checkpoint(struct task_struct *tsk)
+{
+ struct ckpt_ctx *ctx;
+
+ ctx = tsk->checkpoint_ctx;
+ tsk->checkpoint_ctx = NULL;
+
+ /* restarting zombies will acrivate next task in restart */
+ if (tsk->flags & PF_RESTARTING) {
+ if (ckpt_activate_next(ctx) < 0)
+ pr_warning("c/r: [%d] failed zombie exit\n", tsk->pid);
+ }
+
+ ckpt_ctx_put(ctx);
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 44b692d..b6af5b9 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -85,6 +85,7 @@ extern long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid);
extern long do_restart(struct ckpt_ctx *ctx, pid_t pid);

/* task */
+extern int ckpt_activate_next(struct ckpt_ctx *ctx);
extern int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t);
extern int restore_task(struct ckpt_ctx *ctx);

diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index c9a80dc..3f2db22 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -130,6 +130,7 @@ struct ckpt_hdr_task {
__u32 exit_state;
__u32 exit_code;
__u32 exit_signal;
+ __u32 pdeath_signal;

__u64 set_child_tid;
__u64 clear_child_tid;
--
1.6.0.4

2009-07-22 10:13:34

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 44/60] c/r: support for open pipes

A pipe is a double-headed inode with a buffer attached to it. We
checkpoint the pipe buffer only once, as soon as we hit one side of
the pipe, regardless whether it is read- or write- end.

To checkpoint a file descriptor that refers to a pipe (either end), we
first lookup the inode in the hash table: If not found, it is the
first encounter of this pipe. Besides the file descriptor, we also (a)
save the pipe data, and (b) register the pipe inode in the hash. If
found, it is the second encounter of this pipe, namely, as we hit the
other end of the same pipe. In both cases we write the pipe-objref of
the inode.

To restore, create a new pipe and thus have two file pointers (read-
and write- ends). We only use one of them, depending on which side was
checkpointed first. We register the file pointer of the other end in
the hash table, with the pipe_objref given for this pipe from the
checkpoint, to be used later when the other arrives. At this point we
also restore the contents of the pipe buffers.

To save the pipe buffer, given a source pipe, use do_tee() to clone
its contents into a temporary 'struct pipe_inode_info', and then use
do_splice_from() to transfer it directly to the checkpoint image file.

To restore the pipe buffer, with a fresh newly allocated target pipe,
use do_splice_to() to splice the data directly between the checkpoint
image file and the pipe.

Changelog[v17]:
- Forward-declare 'ckpt_ctx' et-al, don't use checkpoint_types.h

Signed-off-by: Oren Laadan <[email protected]>
---
checkpoint/files.c | 7 ++
fs/pipe.c | 170 ++++++++++++++++++++++++++++++++++++++++
include/linux/checkpoint_hdr.h | 12 +++
include/linux/pipe_fs_i.h | 8 ++
4 files changed, 197 insertions(+), 0 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index 88d7adf..c247d44 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -17,6 +17,7 @@
#include <linux/file.h>
#include <linux/fdtable.h>
#include <linux/fsnotify.h>
+#include <linux/pipe_fs_i.h>
#include <linux/syscalls.h>
#include <linux/checkpoint.h>
#include <linux/checkpoint_hdr.h>
@@ -529,6 +530,12 @@ static struct restore_file_ops restore_file_ops[] = {
.file_type = CKPT_FILE_GENERIC,
.restore = generic_file_restore,
},
+ /* pipes */
+ {
+ .file_name = "PIPE",
+ .file_type = CKPT_FILE_PIPE,
+ .restore = pipe_file_restore,
+ },
};

static struct file *do_restore_file(struct ckpt_ctx *ctx)
diff --git a/fs/pipe.c b/fs/pipe.c
index f7dd21a..facb36b 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -13,11 +13,13 @@
#include <linux/fs.h>
#include <linux/mount.h>
#include <linux/pipe_fs_i.h>
+#include <linux/splice.h>
#include <linux/uio.h>
#include <linux/highmem.h>
#include <linux/pagemap.h>
#include <linux/audit.h>
#include <linux/syscalls.h>
+#include <linux/checkpoint.h>

#include <asm/uaccess.h>
#include <asm/ioctls.h>
@@ -809,6 +811,171 @@ pipe_rdwr_open(struct inode *inode, struct file *filp)
return 0;
}

+#ifdef CONFIG_CHECKPOINT
+static int checkpoint_pipe(struct ckpt_ctx *ctx, struct inode *inode)
+{
+ struct ckpt_hdr_file_pipe_state *h;
+ struct pipe_inode_info *pipe;
+ int len, ret = -ENOMEM;
+
+ pipe = alloc_pipe_info(NULL);
+ if (!pipe)
+ return ret;
+
+ pipe->readers = 1; /* bluff link_pipe() below */
+ len = link_pipe(inode->i_pipe, pipe, INT_MAX, SPLICE_F_NONBLOCK);
+ if (len == -EAGAIN)
+ len = 0;
+ if (len < 0) {
+ ret = len;
+ goto out;
+ }
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_PIPE);
+ if (!h)
+ goto out;
+ h->pipe_len = len;
+ ret = ckpt_write_obj(ctx, &h->h);
+ ckpt_hdr_put(ctx, h);
+ if (ret < 0)
+ goto out;
+
+ ret = do_splice_from(pipe, ctx->file, &ctx->file->f_pos, len, 0);
+ if (ret < 0)
+ goto out;
+ if (ret != len)
+ ret = -EPIPE; /* can occur due to an error in target file */
+ out:
+ __free_pipe_info(pipe);
+ return ret;
+}
+
+static int pipe_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
+{
+ struct ckpt_hdr_file_pipe *h;
+ struct inode *inode = file->f_dentry->d_inode;
+ int objref, first, ret;
+
+ objref = ckpt_obj_lookup_add(ctx, inode, CKPT_OBJ_INODE, &first);
+ if (objref < 0)
+ return objref;
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE);
+ if (!h)
+ return -ENOMEM;
+
+ h->common.f_type = CKPT_FILE_PIPE;
+ h->pipe_objref = objref;
+
+ ret = checkpoint_file_common(ctx, file, &h->common);
+ if (ret < 0)
+ goto out;
+ ret = ckpt_write_obj(ctx, &h->common.h);
+ if (ret < 0)
+ goto out;
+
+ if (first)
+ ret = checkpoint_pipe(ctx, inode);
+ out:
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
+
+static int restore_pipe(struct ckpt_ctx *ctx, struct file *file)
+{
+ struct ckpt_hdr_file_pipe_state *h;
+ struct pipe_inode_info *pipe;
+ int len, ret;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_FILE_PIPE);
+ if (IS_ERR(h))
+ return PTR_ERR(h);
+
+ len = h->pipe_len;
+ ckpt_hdr_put(ctx, h);
+
+ if (len < 0)
+ return -EINVAL;
+
+ pipe = file->f_dentry->d_inode->i_pipe;
+ ret = do_splice_to(ctx->file, &ctx->file->f_pos, pipe, len, 0);
+
+ if (ret >= 0 && ret != len)
+ ret = -EPIPE; /* can occur due to an error in source file */
+
+ return ret;
+}
+
+struct file *pipe_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
+{
+ struct ckpt_hdr_file_pipe *h = (struct ckpt_hdr_file_pipe *) ptr;
+ struct file *file;
+ int fds[2], which, ret;
+
+ if (ptr->h.type != CKPT_HDR_FILE ||
+ ptr->h.len != sizeof(*h) || ptr->f_type != CKPT_FILE_PIPE)
+ return ERR_PTR(-EINVAL);
+
+ if (h->pipe_objref <= 0)
+ return ERR_PTR(-EINVAL);
+
+ file = ckpt_obj_fetch(ctx, h->pipe_objref, CKPT_OBJ_FILE);
+ /*
+ * If ckpt_obj_fetch() returned ERR_PTR(-EINVAL), then this is
+ * the first time we see this pipe so need to restore the
+ * contents. Otherwise, use the file pointer skip forward.
+ */
+ if (!IS_ERR(file)) {
+ get_file(file);
+ } else if (PTR_ERR(file) == -EINVAL) {
+ /* first encounter of this pipe: create it */
+ ret = do_pipe_flags(fds, 0);
+ if (ret < 0)
+ return file;
+
+ which = (ptr->f_flags & O_WRONLY ? 1 : 0);
+ /*
+ * Below we return the file corersponding to one side
+ * of the pipe for our caller to use. Now insert the
+ * other side of the pipe to the hash, to be picked up
+ * when that side is restored.
+ */
+ file = fget(fds[1-which]); /* the 'other' side */
+ if (!file) /* this should _never_ happen ! */
+ return ERR_PTR(-EBADF);
+ ret = ckpt_obj_insert(ctx, file, h->pipe_objref, CKPT_OBJ_FILE);
+ if (ret < 0)
+ goto out;
+
+ ret = restore_pipe(ctx, file);
+ fput(file);
+ if (ret < 0)
+ return ERR_PTR(ret);
+
+ file = fget(fds[which]); /* 'this' side */
+ if (!file) /* this should _never_ happen ! */
+ return ERR_PTR(-EBADF);
+
+ /* get rid of the file descriptors (caller sets that) */
+ sys_close(fds[which]);
+ sys_close(fds[1-which]);
+ } else {
+ return file;
+ }
+
+ ret = restore_file_common(ctx, file, ptr);
+ out:
+ if (ret < 0) {
+ fput(file);
+ file = ERR_PTR(ret);
+ }
+
+ return file;
+}
+#else
+#define pipe_file_checkpoint NULL
+#endif /* CONFIG_CHECKPOINT */
+
/*
* The file_operations structs are not static because they
* are also used in linux/fs/fifo.c to do operations on FIFOs.
@@ -825,6 +992,7 @@ const struct file_operations read_pipefifo_fops = {
.open = pipe_read_open,
.release = pipe_read_release,
.fasync = pipe_read_fasync,
+ .checkpoint = pipe_file_checkpoint,
};

const struct file_operations write_pipefifo_fops = {
@@ -837,6 +1005,7 @@ const struct file_operations write_pipefifo_fops = {
.open = pipe_write_open,
.release = pipe_write_release,
.fasync = pipe_write_fasync,
+ .checkpoint = pipe_file_checkpoint,
};

const struct file_operations rdwr_pipefifo_fops = {
@@ -850,6 +1019,7 @@ const struct file_operations rdwr_pipefifo_fops = {
.open = pipe_rdwr_open,
.release = pipe_rdwr_release,
.fasync = pipe_rdwr_fasync,
+ .checkpoint = pipe_file_checkpoint,
};

struct pipe_inode_info * alloc_pipe_info(struct inode *inode)
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index d95c9fb..b187719 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -63,6 +63,7 @@ enum {
CKPT_HDR_FILE_DESC,
CKPT_HDR_FILE_NAME,
CKPT_HDR_FILE,
+ CKPT_HDR_FILE_PIPE,

CKPT_HDR_MM = 401,
CKPT_HDR_VMA,
@@ -218,6 +219,7 @@ struct ckpt_hdr_file_desc {
enum file_type {
CKPT_FILE_IGNORE = 0,
CKPT_FILE_GENERIC,
+ CKPT_FILE_PIPE,
CKPT_FILE_MAX
};

@@ -236,6 +238,16 @@ struct ckpt_hdr_file_generic {
struct ckpt_hdr_file common;
} __attribute__((aligned(8)));

+struct ckpt_hdr_file_pipe {
+ struct ckpt_hdr_file common;
+ __s32 pipe_objref;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_file_pipe_state {
+ struct ckpt_hdr h;
+ __s32 pipe_len;
+} __attribute__((aligned(8)));
+
/* memory layout */
struct ckpt_hdr_mm {
struct ckpt_hdr h;
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index b43a9e0..e526a12 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -154,4 +154,12 @@ int generic_pipe_buf_confirm(struct pipe_inode_info *, struct pipe_buffer *);
int generic_pipe_buf_steal(struct pipe_inode_info *, struct pipe_buffer *);
void generic_pipe_buf_release(struct pipe_inode_info *, struct pipe_buffer *);

+/* checkpoint/restart */
+#ifdef CONFIG_CHECKPOINT
+struct ckpt_ctx;
+struct ckpt_hdr_file;
+extern struct file *pipe_file_restore(struct ckpt_ctx *ctx,
+ struct ckpt_hdr_file *ptr);
+#endif
+
#endif
--
1.6.0.4

2009-07-22 10:10:51

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 38/60] c/r: dump memory address space (private memory)

For each vma, there is a 'struct ckpt_vma'; Then comes the actual
contents, in one or more chunk: each chunk begins with a header that
specifies how many pages it holds, then the virtual addresses of all
the dumped pages in that chunk, followed by the actual contents of all
dumped pages. A header with zero number of pages marks the end of the
contents. Then comes the next vma and so on.

To checkpoint a vma, call the ops->checkpoint() method of that vma.
Normally the per-vma function will invoke generic_vma_checkpoint()
which first writes the vma description, followed by the specific
logic to dump the contents of the pages.

Currently for private mapped memory we save the pathname of the file
that is mapped (restart will use it to re-open it and then map it).
Later we change that to reference a file object.

Changelog[v17]:
- Only collect sub-objects of mm_struct once
- Save mm->{flags,def_flags,saved_auxv}
Changelog[v16]:
- Precede vaddrs/pages with a buffer header
- Checkpoint mm->exe_file
- Handle shared task->mm
Changelog[v14]:
- Modify the ops->checkpoint method to be much more powerful
- Improve support for VDSO (with special_mapping checkpoint callback)
- Save new field 'vdso' in mm_context
- Revert change to pr_debug(), back to ckpt_debug()
- Check whether calls to ckpt_hbuf_get() fail
- Discard field 'h->parent'
Changelog[v13]:
- pgprot_t is an abstract type; use the proper accessor (fix for
64-bit powerpc (Nathan Lynch <[email protected]>)
Changelog[v12]:
- Hide pgarr management inside ckpt_private_vma_fill_pgarr()
- Fix management of pgarr chain reset and alloc/expand: keep empty
pgarr in a pool chain
- Replace obsolete ckpt_debug() with pr_debug()
Changelog[v11]:
- Copy contents of 'init->fs->root' instead of pointing to them.
- Add missing test for VM_MAYSHARE when dumping memory
Changelog[v10]:
- Acquire dcache_lock around call to __d_path() in ckpt_fill_name()
Changelog[v9]:
- Introduce ckpt_ctx_checkpoint() for checkpoint-specific ctx setup
- Test if __d_path() changes mnt/dentry (when crossing filesystem
namespace boundary). for now ckpt_fill_fname() fails the checkpoint.
Changelog[v7]:
- Fix argument given to kunmap_atomic() in memory dump/restore
Changelog[v6]:
- Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
(even though it's not really needed)
Changelog[v5]:
- Improve memory dump code (following Dave Hansen's comments)
- Change dump format (and code) to allow chunks of <vaddrs, pages>
instead of one long list of each
- Fix use of follow_page() to avoid faulting in non-present pages
Changelog[v4]:
- Use standard list_... for ckpt_pgarr

Signed-off-by: Oren Laadan <[email protected]>
---
arch/x86/include/asm/checkpoint_hdr.h | 8 +
arch/x86/mm/checkpoint.c | 31 ++
checkpoint/Makefile | 3 +-
checkpoint/checkpoint.c | 3 +
checkpoint/memory.c | 688 +++++++++++++++++++++++++++++++++
checkpoint/objhash.c | 25 ++
checkpoint/process.c | 13 +
checkpoint/sys.c | 3 +
include/linux/checkpoint.h | 26 ++
include/linux/checkpoint_hdr.h | 52 +++
include/linux/checkpoint_types.h | 7 +-
mm/filemap.c | 25 ++
mm/mmap.c | 28 ++
13 files changed, 908 insertions(+), 4 deletions(-)
create mode 100644 checkpoint/memory.c

diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
index f4d1e14..0e756b0 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -45,6 +45,7 @@
/* arch dependent header types */
enum {
CKPT_HDR_CPU_FPU = 201,
+ CKPT_HDR_MM_CONTEXT_LDT,
};

struct ckpt_hdr_header_arch {
@@ -118,4 +119,11 @@ struct ckpt_hdr_cpu {
#define CKPT_X86_SEG_TLS 0x4000 /* 0100 0000 0000 00xx */
#define CKPT_X86_SEG_LDT 0x8000 /* 100x xxxx xxxx xxxx */

+struct ckpt_hdr_mm_context {
+ struct ckpt_hdr h;
+ __u64 vdso;
+ __u32 ldt_entry_size;
+ __u32 nldt;
+} __attribute__((aligned(8)));
+
#endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index f085e14..fa26d60 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -327,6 +327,37 @@ int checkpoint_write_header_arch(struct ckpt_ctx *ctx)
return ret;
}

+/* dump the mm->context state */
+int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+ struct ckpt_hdr_mm_context *h;
+ int ret;
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_MM_CONTEXT);
+ if (!h)
+ return -ENOMEM;
+
+ mutex_lock(&mm->context.lock);
+
+ h->vdso = (unsigned long) mm->context.vdso;
+ h->ldt_entry_size = LDT_ENTRY_SIZE;
+ h->nldt = mm->context.size;
+
+ ckpt_debug("nldt %d vdso %#llx\n", h->nldt, h->vdso);
+
+ ret = ckpt_write_obj(ctx, &h->h);
+ ckpt_hdr_put(ctx, h);
+ if (ret < 0)
+ goto out;
+
+ ret = ckpt_write_obj_type(ctx, mm->context.ldt,
+ mm->context.size * LDT_ENTRY_SIZE,
+ CKPT_HDR_MM_CONTEXT_LDT);
+ out:
+ mutex_unlock(&mm->context.lock);
+ return ret;
+}
+
/**************************************************************************
* Restart
*/
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 1d0c058..f56a7d6 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -8,4 +8,5 @@ obj-$(CONFIG_CHECKPOINT) += \
checkpoint.o \
restart.o \
process.o \
- files.o
+ files.o \
+ memory.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 59b86d8..c68e443 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -183,10 +183,13 @@ int ckpt_write_err(struct ckpt_ctx *ctx, char *fmt, ...)
static void fill_kernel_const(struct ckpt_hdr_const *h)
{
struct task_struct *tsk;
+ struct mm_struct *mm;
struct new_utsname *uts;

/* task */
h->task_comm_len = sizeof(tsk->comm);
+ /* mm */
+ h->mm_saved_auxv_len = sizeof(mm->saved_auxv);
/* uts */
h->uts_release_len = sizeof(uts->release);
h->uts_version_len = sizeof(uts->version);
diff --git a/checkpoint/memory.c b/checkpoint/memory.c
new file mode 100644
index 0000000..68c31b6
--- /dev/null
+++ b/checkpoint/memory.c
@@ -0,0 +1,688 @@
+/*
+ * Checkpoint/restart memory contents
+ *
+ * Copyright (C) 2008-2009 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG CKPT_DMEM
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/pagemap.h>
+#include <linux/mm_types.h>
+#include <linux/proc_fs.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/*
+ * page-array chains: each ckpt_pgarr describes a set of <struct page *,vaddr>
+ * tuples (where vaddr is the virtual address of a page in a particular mm).
+ * Specifically, we use separate arrays so that all vaddrs can be written
+ * and read at once.
+ */
+
+struct ckpt_pgarr {
+ unsigned long *vaddrs;
+ struct page **pages;
+ unsigned int nr_used;
+ struct list_head list;
+};
+
+#define CKPT_PGARR_TOTAL (PAGE_SIZE / sizeof(void *))
+#define CKPT_PGARR_BATCH (16 * CKPT_PGARR_TOTAL)
+
+static inline int pgarr_is_full(struct ckpt_pgarr *pgarr)
+{
+ return (pgarr->nr_used == CKPT_PGARR_TOTAL);
+}
+
+static inline int pgarr_nr_free(struct ckpt_pgarr *pgarr)
+{
+ return CKPT_PGARR_TOTAL - pgarr->nr_used;
+}
+
+/*
+ * utilities to alloc, free, and handle 'struct ckpt_pgarr' (page-arrays)
+ * (common to ckpt_mem.c and rstr_mem.c).
+ *
+ * The checkpoint context structure has two members for page-arrays:
+ * ctx->pgarr_list: list head of populated page-array chain
+ * ctx->pgarr_pool: list head of empty page-array pool chain
+ *
+ * During checkpoint (and restart) the chain tracks the dirty pages (page
+ * pointer and virtual address) of each MM. For a particular MM, these are
+ * always added to the head of the page-array chain (ctx->pgarr_list).
+ * Before the next chunk of pages, the chain is reset (by dereferencing
+ * all pages) but not freed; instead, empty descsriptors are kept in pool.
+ *
+ * The head of the chain page-array ("current") advances as necessary. When
+ * it gets full, a new page-array descriptor is pushed in front of it. The
+ * new descriptor is taken from first empty descriptor (if one exists, for
+ * instance, after a chain reset), or allocated on-demand.
+ *
+ * When dumping the data, the chain is traversed in reverse order.
+ */
+
+/* return first page-array in the chain */
+static inline struct ckpt_pgarr *pgarr_first(struct ckpt_ctx *ctx)
+{
+ if (list_empty(&ctx->pgarr_list))
+ return NULL;
+ return list_first_entry(&ctx->pgarr_list, struct ckpt_pgarr, list);
+}
+
+/* return (and detach) first empty page-array in the pool, if exists */
+static inline struct ckpt_pgarr *pgarr_from_pool(struct ckpt_ctx *ctx)
+{
+ struct ckpt_pgarr *pgarr;
+
+ if (list_empty(&ctx->pgarr_pool))
+ return NULL;
+ pgarr = list_first_entry(&ctx->pgarr_pool, struct ckpt_pgarr, list);
+ list_del(&pgarr->list);
+ return pgarr;
+}
+
+/* release pages referenced by a page-array */
+static void pgarr_release_pages(struct ckpt_pgarr *pgarr)
+{
+ ckpt_debug("total pages %d\n", pgarr->nr_used);
+ /*
+ * both checkpoint and restart use 'nr_used', however we only
+ * collect pages during checkpoint; in restart we simply return
+ * because pgarr->pages remains NULL.
+ */
+ if (pgarr->pages) {
+ struct page **pages = pgarr->pages;
+ int nr = pgarr->nr_used;
+
+ while (nr--)
+ page_cache_release(pages[nr]);
+ }
+
+ pgarr->nr_used = 0;
+}
+
+/* free a single page-array object */
+static void pgarr_free_one(struct ckpt_pgarr *pgarr)
+{
+ pgarr_release_pages(pgarr);
+ kfree(pgarr->pages);
+ kfree(pgarr->vaddrs);
+ kfree(pgarr);
+}
+
+/* free the chains of page-arrays (populated and empty pool) */
+void ckpt_pgarr_free(struct ckpt_ctx *ctx)
+{
+ struct ckpt_pgarr *pgarr, *tmp;
+
+ list_for_each_entry_safe(pgarr, tmp, &ctx->pgarr_list, list) {
+ list_del(&pgarr->list);
+ pgarr_free_one(pgarr);
+ }
+
+ list_for_each_entry_safe(pgarr, tmp, &ctx->pgarr_pool, list) {
+ list_del(&pgarr->list);
+ pgarr_free_one(pgarr);
+ }
+}
+
+/* allocate a single page-array object */
+static struct ckpt_pgarr *pgarr_alloc_one(unsigned long flags)
+{
+ struct ckpt_pgarr *pgarr;
+
+ pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL);
+ if (!pgarr)
+ return NULL;
+ pgarr->vaddrs = kmalloc(CKPT_PGARR_TOTAL * sizeof(unsigned long),
+ GFP_KERNEL);
+ if (!pgarr->vaddrs)
+ goto nomem;
+
+ /* pgarr->pages is needed only for checkpoint */
+ if (flags & CKPT_CTX_CHECKPOINT) {
+ pgarr->pages = kmalloc(CKPT_PGARR_TOTAL *
+ sizeof(struct page *), GFP_KERNEL);
+ if (!pgarr->pages)
+ goto nomem;
+ }
+
+ return pgarr;
+ nomem:
+ pgarr_free_one(pgarr);
+ return NULL;
+}
+
+/* pgarr_current - return the next available page-array in the chain
+ * @ctx: checkpoint context
+ *
+ * Returns the first page-array in the list that has space. Otherwise,
+ * try the next page-array after the last non-empty one, and move it to
+ * the front of the chain. Extends the list if none has space.
+ */
+static struct ckpt_pgarr *pgarr_current(struct ckpt_ctx *ctx)
+{
+ struct ckpt_pgarr *pgarr;
+
+ pgarr = pgarr_first(ctx);
+ if (pgarr && !pgarr_is_full(pgarr))
+ return pgarr;
+
+ pgarr = pgarr_from_pool(ctx);
+ if (!pgarr)
+ pgarr = pgarr_alloc_one(ctx->kflags);
+ if (!pgarr)
+ return NULL;
+
+ list_add(&pgarr->list, &ctx->pgarr_list);
+ return pgarr;
+}
+
+/* reset the page-array chain (dropping page references if necessary) */
+static void pgarr_reset_all(struct ckpt_ctx *ctx)
+{
+ struct ckpt_pgarr *pgarr;
+
+ list_for_each_entry(pgarr, &ctx->pgarr_list, list)
+ pgarr_release_pages(pgarr);
+ list_splice_init(&ctx->pgarr_list, &ctx->pgarr_pool);
+}
+
+/**************************************************************************
+ * Checkpoint
+ *
+ * Checkpoint is outside the context of the checkpointee, so one cannot
+ * simply read pages from user-space. Instead, we scan the address space
+ * of the target to cherry-pick pages of interest. Selected pages are
+ * enlisted in a page-array chain (attached to the checkpoint context).
+ * To save their contents, each page is mapped to kernel memory and then
+ * dumped to the file descriptor.
+ */
+
+
+/**
+ * private_follow_page - return page pointer for dirty pages
+ * @vma - target vma
+ * @addr - page address
+ *
+ * Looks up the page that correspond to the address in the vma, and
+ * returns the page if it was modified (and grabs a reference to it),
+ * or otherwise returns NULL (or error).
+ */
+static struct page *consider_private_page(struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ struct page *page;
+
+ /*
+ * simplified version of get_user_pages(): already have vma,
+ * only need FOLL_ANON, and (for now) ignore fault stats.
+ *
+ * follow_page() will return NULL if the page is not present
+ * (swapped), ZERO_PAGE(0) if the pte wasn't allocated, and
+ * the actual page pointer otherwise.
+ *
+ * FIXME: consolidate with get_user_pages()
+ */
+
+ cond_resched();
+ while (!(page = follow_page(vma, addr, FOLL_ANON | FOLL_GET))) {
+ int ret;
+
+ /* the page is swapped out - bring it in (optimize ?) */
+ ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
+ if (ret & VM_FAULT_ERROR) {
+ if (ret & VM_FAULT_OOM)
+ return ERR_PTR(-ENOMEM);
+ else if (ret & VM_FAULT_SIGBUS)
+ return ERR_PTR(-EFAULT);
+ else
+ BUG();
+ break;
+ }
+ cond_resched();
+ }
+
+ if (IS_ERR(page))
+ return page;
+
+ /*
+ * Only care about dirty pages: either anonymous non-zero pages,
+ * or file-backed COW (copy-on-write) pages that were modified.
+ * A clean COW page is not interesting because its contents are
+ * identical to the backing file; ignore such pages.
+ * A file-backed broken COW is identified by its page_mapping()
+ * being unset (NULL) because the page will no longer be mapped
+ * to the original file after having been modified.
+ */
+ if (page == ZERO_PAGE(0)) {
+ /* this is the zero page: ignore */
+ page_cache_release(page);
+ page = NULL;
+ } else if (vma->vm_file && (page_mapping(page) != NULL)) {
+ /* file backed clean cow: ignore */
+ page_cache_release(page);
+ page = NULL;
+ }
+
+ return page;
+}
+
+/**
+ * vma_fill_pgarr - fill a page-array with addr/page tuples
+ * @ctx - checkpoint context
+ * @vma - vma to scan
+ * @start - start address (updated)
+ *
+ * Returns the number of pages collected
+ */
+static int vma_fill_pgarr(struct ckpt_ctx *ctx,
+ struct vm_area_struct *vma,
+ unsigned long *start)
+{
+ unsigned long end = vma->vm_end;
+ unsigned long addr = *start;
+ struct ckpt_pgarr *pgarr;
+ int nr_used;
+ int cnt = 0;
+
+ /* this function is only for private memory (anon or file-mapped) */
+ BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+
+ do {
+ pgarr = pgarr_current(ctx);
+ if (!pgarr)
+ return -ENOMEM;
+
+ nr_used = pgarr->nr_used;
+
+ while (addr < end) {
+ struct page *page;
+
+ page = consider_private_page(vma, addr);
+ if (IS_ERR(page))
+ return PTR_ERR(page);
+
+ if (page) {
+ _ckpt_debug(CKPT_DPAGE,
+ "got page %#lx\n", addr);
+ pgarr->pages[pgarr->nr_used] = page;
+ pgarr->vaddrs[pgarr->nr_used] = addr;
+ pgarr->nr_used++;
+ }
+
+ addr += PAGE_SIZE;
+
+ if (pgarr_is_full(pgarr))
+ break;
+ }
+
+ cnt += pgarr->nr_used - nr_used;
+
+ } while ((cnt < CKPT_PGARR_BATCH) && (addr < end));
+
+ *start = addr;
+ return cnt;
+}
+
+/* dump contents of a pages: use kmap_atomic() to avoid TLB flush */
+static int checkpoint_dump_page(struct ckpt_ctx *ctx,
+ struct page *page, char *buf)
+{
+ void *ptr;
+
+ ptr = kmap_atomic(page, KM_USER1);
+ memcpy(buf, ptr, PAGE_SIZE);
+ kunmap_atomic(ptr, KM_USER1);
+
+ return ckpt_kwrite(ctx, buf, PAGE_SIZE);
+}
+
+/**
+ * vma_dump_pages - dump pages listed in the ctx page-array chain
+ * @ctx - checkpoint context
+ * @total - total number of pages
+ *
+ * First dump all virtual addresses, followed by the contents of all pages
+ */
+static int vma_dump_pages(struct ckpt_ctx *ctx, int total)
+{
+ struct ckpt_pgarr *pgarr;
+ void *buf;
+ int i, ret = 0;
+
+ if (!total)
+ return 0;
+
+ i = total * (sizeof(unsigned long) + PAGE_SIZE);
+ ret = ckpt_write_obj_type(ctx, NULL, i, CKPT_HDR_BUFFER);
+ if (ret < 0)
+ return ret;
+
+ list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+ ret = ckpt_kwrite(ctx, pgarr->vaddrs,
+ pgarr->nr_used * sizeof(unsigned long));
+ if (ret < 0)
+ return ret;
+ }
+
+ buf = (void *) __get_free_page(GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+ for (i = 0; i < pgarr->nr_used; i++) {
+ ret = checkpoint_dump_page(ctx, pgarr->pages[i], buf);
+ if (ret < 0)
+ goto out;
+ }
+ }
+ out:
+ free_page((unsigned long) buf);
+ return ret;
+}
+
+/**
+ * checkpoint_memory_contents - dump contents of a VMA with private memory
+ * @ctx - checkpoint context
+ * @vma - vma to scan
+ *
+ * Collect lists of pages that needs to be dumped, and corresponding
+ * virtual addresses into ctx->pgarr_list page-array chain. Then dump
+ * the addresses, followed by the page contents.
+ */
+static int checkpoint_memory_contents(struct ckpt_ctx *ctx,
+ struct vm_area_struct *vma)
+{
+ struct ckpt_hdr_pgarr *h;
+ unsigned long addr, end;
+ int cnt, ret;
+
+ addr = vma->vm_start;
+ end = vma->vm_end;
+
+ /*
+ * Work iteratively, collecting and dumping at most CKPT_PGARR_BATCH
+ * in each round. Each iterations is divided into two steps:
+ *
+ * (1) scan: scan through the PTEs of the vma to collect the pages
+ * to dump (later we'll also make them COW), while keeping a list
+ * of pages and their corresponding addresses on ctx->pgarr_list.
+ *
+ * (2) dump: write out a header specifying how many pages, followed
+ * by the addresses of all pages in ctx->pgarr_list, followed by
+ * the actual contents of all pages. (Then, release the references
+ * to the pages and reset the page-array chain).
+ *
+ * (This split makes the logic simpler by first counting the pages
+ * that need saving. More importantly, it allows for a future
+ * optimization that will reduce application downtime by deferring
+ * the actual write-out of the data to after the application is
+ * allowed to resume execution).
+ *
+ * After dumping the entire contents, conclude with a header that
+ * specifies 0 pages to mark the end of the contents.
+ */
+
+ while (addr < end) {
+ cnt = vma_fill_pgarr(ctx, vma, &addr);
+ if (cnt == 0)
+ break;
+ else if (cnt < 0)
+ return cnt;
+
+ ckpt_debug("collected %d pages\n", cnt);
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_PGARR);
+ if (!h)
+ return -ENOMEM;
+
+ h->nr_pages = cnt;
+ ret = ckpt_write_obj(ctx, &h->h);
+ ckpt_hdr_put(ctx, h);
+ if (ret < 0)
+ return ret;
+
+ ret = vma_dump_pages(ctx, cnt);
+ if (ret < 0)
+ return ret;
+
+ pgarr_reset_all(ctx);
+ }
+
+ /* mark end of contents with header saying "0" pages */
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_PGARR);
+ if (!h)
+ return -ENOMEM;
+ h->nr_pages = 0;
+ ret = ckpt_write_obj(ctx, &h->h);
+ ckpt_hdr_put(ctx, h);
+
+ return ret;
+}
+
+/**
+ * generic_vma_checkpoint - dump metadata of vma
+ * @ctx: checkpoint context
+ * @vma: vma object
+ * @type: vma type
+ * @vma_objref: vma objref
+ */
+int generic_vma_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma,
+ enum vma_type type, int vma_objref)
+{
+ struct ckpt_hdr_vma *h;
+ int ret;
+
+ ckpt_debug("vma %#lx-%#lx flags %#lx type %d\n",
+ vma->vm_start, vma->vm_end, vma->vm_flags, type);
+
+ if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED) {
+ pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
+ return -ENOSYS;
+ }
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_VMA);
+ if (!h)
+ return -ENOMEM;
+
+ h->vma_type = type;
+ h->vma_objref = vma_objref;
+ h->vm_start = vma->vm_start;
+ h->vm_end = vma->vm_end;
+ h->vm_page_prot = pgprot_val(vma->vm_page_prot);
+ h->vm_flags = vma->vm_flags;
+ h->vm_pgoff = vma->vm_pgoff;
+
+ ret = ckpt_write_obj(ctx, &h->h);
+ ckpt_hdr_put(ctx, h);
+
+ return ret;
+}
+
+/**
+ * private_vma_checkpoint - dump contents of private (anon, file) vma
+ * @ctx: checkpoint context
+ * @vma: vma object
+ * @type: vma type
+ * @vma_objref: vma objref
+ */
+int private_vma_checkpoint(struct ckpt_ctx *ctx,
+ struct vm_area_struct *vma,
+ enum vma_type type, int vma_objref)
+{
+ int ret;
+
+ BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+
+ ret = generic_vma_checkpoint(ctx, vma, type, vma_objref);
+ if (ret < 0)
+ goto out;
+ ret = checkpoint_memory_contents(ctx, vma);
+ out:
+ return ret;
+}
+
+/**
+ * anonymous_checkpoint - dump contents of private-anonymous vma
+ * @ctx: checkpoint context
+ * @vma: vma object
+ */
+static int anonymous_checkpoint(struct ckpt_ctx *ctx,
+ struct vm_area_struct *vma)
+{
+ /* should be private anonymous ... verify that this is the case */
+ if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED) {
+ pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
+ return -ENOSYS;
+ }
+
+ BUG_ON(vma->vm_file);
+
+ return private_vma_checkpoint(ctx, vma, CKPT_VMA_ANON, 0);
+}
+
+static int do_checkpoint_mm(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+ struct ckpt_hdr_mm *h;
+ struct vm_area_struct *vma;
+ int exe_objref = 0;
+ int ret;
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_MM);
+ if (!h)
+ return -ENOMEM;
+
+ down_read(&mm->mmap_sem);
+
+ h->flags = mm->flags;
+ h->def_flags = mm->def_flags;
+
+ h->start_code = mm->start_code;
+ h->end_code = mm->end_code;
+ h->start_data = mm->start_data;
+ h->end_data = mm->end_data;
+ h->start_brk = mm->start_brk;
+ h->brk = mm->brk;
+ h->start_stack = mm->start_stack;
+ h->arg_start = mm->arg_start;
+ h->arg_end = mm->arg_end;
+ h->env_start = mm->env_start;
+ h->env_end = mm->env_end;
+
+ h->map_count = mm->map_count;
+
+ /* checkpoint the ->exe_file */
+ if (mm->exe_file) {
+ exe_objref = checkpoint_obj(ctx, mm->exe_file, CKPT_OBJ_FILE);
+ if (exe_objref < 0) {
+ ret = exe_objref;
+ goto out;
+ }
+ h->exe_objref = exe_objref;
+ }
+
+ ret = ckpt_write_obj(ctx, &h->h);
+ if (ret < 0)
+ goto out;
+
+ ret = ckpt_write_buffer(ctx, mm->saved_auxv, sizeof(mm->saved_auxv));
+ if (ret < 0)
+ return ret;
+
+ /* write the vma's */
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ ckpt_debug("vma %#lx-%#lx flags %#lx\n",
+ vma->vm_start, vma->vm_end, vma->vm_flags);
+ if (!vma->vm_ops)
+ ret = anonymous_checkpoint(ctx, vma);
+ else if (vma->vm_ops->checkpoint)
+ ret = (*vma->vm_ops->checkpoint)(ctx, vma);
+ else
+ ret = -ENOSYS;
+ if (ret < 0)
+ goto out;
+ }
+
+ ret = checkpoint_mm_context(ctx, mm);
+ out:
+ ckpt_hdr_put(ctx, h);
+ up_read(&mm->mmap_sem);
+ return ret;
+}
+
+int checkpoint_mm(struct ckpt_ctx *ctx, void *ptr)
+{
+ return do_checkpoint_mm(ctx, (struct mm_struct *) ptr);
+}
+
+int checkpoint_obj_mm(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+ struct mm_struct *mm;
+ int objref;
+
+ mm = get_task_mm(t);
+ objref = checkpoint_obj(ctx, mm, CKPT_OBJ_MM);
+ mmput(mm);
+
+ return objref;
+}
+
+/***********************************************************************
+ * Collect
+ */
+
+static int collect_mm(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+ struct vm_area_struct *vma;
+ struct file *file;
+ int exists;
+ int ret;
+
+ /* if already exists, don't proceed inside the struct */
+ exists = ckpt_obj_lookup(ctx, mm, CKPT_OBJ_MM);
+
+ ret = ckpt_obj_collect(ctx, mm, CKPT_OBJ_MM);
+ if (ret < 0 || exists)
+ return ret;
+
+ down_read(&mm->mmap_sem);
+ if (mm->exe_file) {
+ ret = ckpt_obj_collect(ctx, mm->exe_file, CKPT_OBJ_FILE);
+ if (ret < 0)
+ goto out;
+ }
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ file = vma->vm_file;
+ if (file) {
+ ret = ckpt_obj_collect(ctx, file, CKPT_OBJ_FILE);
+ if (ret < 0)
+ break;
+ }
+ }
+ out:
+ up_read(&mm->mmap_sem);
+ return ret;
+
+}
+
+int ckpt_collect_mm(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+ struct mm_struct *mm;
+ int ret;
+
+ mm = get_task_mm(t);
+ ret = collect_mm(ctx, mm);
+ mmput(mm);
+
+ return ret;
+}
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index fae6bfc..479e8eb 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -106,6 +106,22 @@ static int obj_file_users(void *ptr)
return atomic_long_read(&((struct file *) ptr)->f_count);
}

+static int obj_mm_grab(void *ptr)
+{
+ atomic_inc(&((struct mm_struct *) ptr)->mm_users);
+ return 0;
+}
+
+static void obj_mm_drop(void *ptr)
+{
+ mmput((struct mm_struct *) ptr);
+}
+
+static int obj_mm_users(void *ptr)
+{
+ return atomic_read(&((struct mm_struct *) ptr)->mm_users);
+}
+
static struct ckpt_obj_ops ckpt_obj_ops[] = {
/* ignored object */
{
@@ -134,6 +150,15 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
.checkpoint = checkpoint_file,
.restore = restore_file,
},
+ /* mm object */
+ {
+ .obj_name = "MM",
+ .obj_type = CKPT_OBJ_MM,
+ .ref_drop = obj_mm_drop,
+ .ref_grab = obj_mm_grab,
+ .ref_users = obj_mm_users,
+ .checkpoint = checkpoint_mm,
+ },
};


diff --git a/checkpoint/process.c b/checkpoint/process.c
index 8cbbace..397ab08 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -107,6 +107,7 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
{
struct ckpt_hdr_task_objs *h;
int files_objref;
+ int mm_objref;
int ret;

files_objref = checkpoint_obj_file_table(ctx, t);
@@ -117,10 +118,19 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
return files_objref;
}

+ mm_objref = checkpoint_obj_mm(ctx, t);
+ ckpt_debug("mm: objref %d\n", mm_objref);
+ if (mm_objref < 0) {
+ ckpt_write_err(ctx, "task %d (%s), mm_struct: %d",
+ task_pid_vnr(t), t->comm, mm_objref);
+ return mm_objref;
+ }
+
h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
if (!h)
return -ENOMEM;
h->files_objref = files_objref;
+ h->mm_objref = mm_objref;
ret = ckpt_write_obj(ctx, &h->h);
ckpt_hdr_put(ctx, h);

@@ -274,6 +284,9 @@ int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t)
int ret;

ret = ckpt_collect_file_table(ctx, t);
+ if (ret < 0)
+ return ret;
+ ret = ckpt_collect_mm(ctx, t);

return ret;
}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index bc5620f..4351c28 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -196,6 +196,7 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)

ckpt_obj_hash_free(ctx);
path_put(&ctx->fs_mnt);
+ ckpt_pgarr_free(ctx);

if (ctx->tasks_arr)
task_arr_free(ctx);
@@ -227,6 +228,8 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
ctx->ktime_begin = ktime_get();

atomic_set(&ctx->refcount, 0);
+ INIT_LIST_HEAD(&ctx->pgarr_list);
+ INIT_LIST_HEAD(&ctx->pgarr_pool);
init_waitqueue_head(&ctx->waitq);

err = -EBADF;
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 3f28a06..452007c 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -115,6 +115,7 @@ extern int restore_task(struct ckpt_ctx *ctx);
extern int checkpoint_write_header_arch(struct ckpt_ctx *ctx);
extern int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t);
extern int checkpoint_cpu(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm);

extern int restore_read_header_arch(struct ckpt_ctx *ctx);
extern int restore_thread(struct ckpt_ctx *ctx);
@@ -145,6 +146,29 @@ extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
extern int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
struct ckpt_hdr_file *h);

+/* memory */
+extern void ckpt_pgarr_free(struct ckpt_ctx *ctx);
+
+extern int generic_vma_checkpoint(struct ckpt_ctx *ctx,
+ struct vm_area_struct *vma,
+ enum vma_type type,
+ int vma_objref);
+extern int private_vma_checkpoint(struct ckpt_ctx *ctx,
+ struct vm_area_struct *vma,
+ enum vma_type type,
+ int vma_objref);
+
+extern int checkpoint_obj_mm(struct ckpt_ctx *ctx, struct task_struct *t);
+
+extern int ckpt_collect_mm(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_mm(struct ckpt_ctx *ctx, void *ptr);
+
+#define CKPT_VMA_NOT_SUPPORTED \
+ (VM_SHARED | VM_MAYSHARE | VM_IO | VM_HUGETLB | \
+ VM_NONLINEAR | VM_PFNMAP | VM_RESERVED | VM_NORESERVE \
+ | VM_HUGETLB | VM_NONLINEAR | VM_MAPPED_COPY | \
+ VM_INSERTPAGE | VM_MIXEDMAP | VM_SAO)
+

/* debugging flags */
#define CKPT_DBASE 0x1 /* anything */
@@ -152,6 +176,8 @@ extern int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
#define CKPT_DRW 0x4 /* image read/write */
#define CKPT_DOBJ 0x8 /* shared objects */
#define CKPT_DFILE 0x10 /* files and filesystem */
+#define CKPT_DMEM 0x20 /* memory state */
+#define CKPT_DPAGE 0x40 /* memory pages */

#define CKPT_DDEFAULT 0xffff /* default debug level */

diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 3f8483e..10c54b2 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -64,6 +64,11 @@ enum {
CKPT_HDR_FILE_NAME,
CKPT_HDR_FILE,

+ CKPT_HDR_MM = 401,
+ CKPT_HDR_VMA,
+ CKPT_HDR_PGARR,
+ CKPT_HDR_MM_CONTEXT,
+
CKPT_HDR_TAIL = 9001,

CKPT_HDR_ERROR = 9999,
@@ -86,6 +91,7 @@ enum obj_type {
CKPT_OBJ_IGNORE = 0,
CKPT_OBJ_FILE_TABLE,
CKPT_OBJ_FILE,
+ CKPT_OBJ_MM,
CKPT_OBJ_MAX
};

@@ -93,6 +99,8 @@ enum obj_type {
struct ckpt_hdr_const {
/* task */
__u16 task_comm_len;
+ /* mm */
+ __u16 mm_saved_auxv_len;
/* uts */
__u16 uts_release_len;
__u16 uts_version_len;
@@ -167,6 +175,7 @@ struct ckpt_hdr_task {
struct ckpt_hdr_task_objs {
struct ckpt_hdr h;
__s32 files_objref;
+ __s32 mm_objref;
} __attribute__((aligned(8)));

/* restart blocks */
@@ -225,4 +234,47 @@ struct ckpt_hdr_file_generic {
struct ckpt_hdr_file common;
} __attribute__((aligned(8)));

+/* memory layout */
+struct ckpt_hdr_mm {
+ struct ckpt_hdr h;
+ __u32 map_count;
+ __s32 exe_objref;
+
+ __u64 def_flags;
+ __u64 flags;
+
+ __u64 start_code, end_code, start_data, end_data;
+ __u64 start_brk, brk, start_stack;
+ __u64 arg_start, arg_end, env_start, env_end;
+} __attribute__((aligned(8)));
+
+/* vma subtypes */
+enum vma_type {
+ CKPT_VMA_IGNORE = 0,
+ CKPT_VMA_VDSO, /* special vdso vma */
+ CKPT_VMA_ANON, /* private anonymous */
+ CKPT_VMA_FILE, /* private mapped file */
+ CKPT_VMA_MAX
+};
+
+/* vma descriptor */
+struct ckpt_hdr_vma {
+ struct ckpt_hdr h;
+ __u32 vma_type;
+ __s32 vma_objref; /* objref of backing file */
+
+ __u64 vm_start;
+ __u64 vm_end;
+ __u64 vm_page_prot;
+ __u64 vm_flags;
+ __u64 vm_pgoff;
+} __attribute__((aligned(8)));
+
+/* page array */
+struct ckpt_hdr_pgarr {
+ struct ckpt_hdr h;
+ __u64 nr_pages; /* number of pages to saved */
+} __attribute__((aligned(8)));
+
+
#endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index c446510..57cbc96 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -13,11 +13,9 @@
#ifdef __KERNEL__

#include <linux/list.h>
-#include <linux/path.h>
-#include <linux/fs.h>
-
#include <linux/sched.h>
#include <linux/nsproxy.h>
+#include <linux/path.h>
#include <linux/fs.h>
#include <linux/ktime.h>
#include <linux/wait.h>
@@ -48,6 +46,9 @@ struct ckpt_ctx {

char err_string[256]; /* checkpoint: error string */

+ struct list_head pgarr_list; /* page array to dump VMA contents */
+ struct list_head pgarr_pool; /* pool of empty page arrays chain */
+
/* [multi-process checkpoint] */
struct task_struct **tasks_arr; /* array of all tasks [checkpoint] */
int nr_tasks; /* size of tasks array */
diff --git a/mm/filemap.c b/mm/filemap.c
index ccea3b6..d866bbd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -34,6 +34,7 @@
#include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
#include <linux/memcontrol.h>
#include <linux/mm_inline.h> /* for page_is_file_cache() */
+#include <linux/checkpoint.h>
#include "internal.h"

/*
@@ -1648,8 +1649,32 @@ page_not_uptodate:
}
EXPORT_SYMBOL(filemap_fault);

+#ifdef CONFIG_CHECKPOINT
+static int filemap_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
+{
+ struct file *file = vma->vm_file;
+ int vma_objref;
+
+ if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED) {
+ pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
+ return -ENOSYS;
+ }
+
+ BUG_ON(!file);
+
+ vma_objref = checkpoint_obj(ctx, file, CKPT_OBJ_FILE);
+ if (vma_objref < 0)
+ return vma_objref;
+
+ return private_vma_checkpoint(ctx, vma, CKPT_VMA_FILE, vma_objref);
+}
+#endif /* CONFIG_CHECKPOINT */
+
struct vm_operations_struct generic_file_vm_ops = {
.fault = filemap_fault,
+#ifdef CONFIG_CHECKPOINT
+ .checkpoint = filemap_checkpoint,
+#endif
};

/* This is used for a general mmap of a disk file */
diff --git a/mm/mmap.c b/mm/mmap.c
index 34579b2..939a17c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -29,6 +29,7 @@
#include <linux/rmap.h>
#include <linux/mmu_notifier.h>
#include <linux/perf_counter.h>
+#include <linux/checkpoint.h>

#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -2270,9 +2271,36 @@ static void special_mapping_close(struct vm_area_struct *vma)
{
}

+#ifdef CONFIG_CHECKPOINT
+static int special_mapping_checkpoint(struct ckpt_ctx *ctx,
+ struct vm_area_struct *vma)
+{
+ const char *name;
+
+ /*
+ * FIX:
+ * Currently, we only handle VDSO/vsyscall special handling.
+ * Even that, is very basic - we just skip the contents and
+ * hope for the best in terms of compatilibity upon restart.
+ */
+
+ if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED)
+ return -ENOSYS;
+
+ name = arch_vma_name(vma);
+ if (!name || strcmp(name, "[vdso]"))
+ return -ENOSYS;
+
+ return generic_vma_checkpoint(ctx, vma, CKPT_VMA_VDSO, 0);
+}
+#endif /* CONFIG_CHECKPOINT */
+
static struct vm_operations_struct special_mapping_vmops = {
.close = special_mapping_close,
.fault = special_mapping_fault,
+#ifdef CONFIG_CHECKPOINT
+ .checkpoint = special_mapping_checkpoint,
+#endif
};

/*
--
1.6.0.4

2009-07-22 10:15:33

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 04/60] c/r: split core function out of some set*{u,g}id functions

From: Serge E. Hallyn <[email protected]>

When restarting tasks, we want to be able to change xuid and
xgid in a struct cred, and do so with security checks. Break
the core functionality of set{fs,res}{u,g}id into cred_setX
which performs the access checks based on current_cred(),
but performs the requested change on a passed-in cred.

This will allow us to securely construct struct creds based
on a checkpoint image, constrained by the caller's permissions,
and apply them to the caller at the end of sys_restart().

Signed-off-by: Serge E. Hallyn <[email protected]>
---
include/linux/cred.h | 8 +++
kernel/cred.c | 114 ++++++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 134 ++++++++------------------------------------------
3 files changed, 143 insertions(+), 113 deletions(-)

diff --git a/include/linux/cred.h b/include/linux/cred.h
index 4fa9996..2ffffbe 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -21,6 +21,9 @@ struct user_struct;
struct cred;
struct inode;

+/* defined in sys.c, used in cred_setresuid */
+extern int set_user(struct cred *new);
+
/*
* COW Supplementary groups list
*/
@@ -344,4 +347,9 @@ do { \
*(_fsgid) = __cred->fsgid; \
} while(0)

+int cred_setresuid(struct cred *new, uid_t ruid, uid_t euid, uid_t suid);
+int cred_setresgid(struct cred *new, gid_t rgid, gid_t egid, gid_t sgid);
+int cred_setfsuid(struct cred *new, uid_t uid, uid_t *old_fsuid);
+int cred_setfsgid(struct cred *new, gid_t gid, gid_t *old_fsgid);
+
#endif /* _LINUX_CRED_H */
diff --git a/kernel/cred.c b/kernel/cred.c
index 1bb4d7e..5c8db56 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -589,3 +589,117 @@ int set_create_files_as(struct cred *new, struct inode *inode)
return security_kernel_create_files_as(new, inode);
}
EXPORT_SYMBOL(set_create_files_as);
+
+int cred_setresuid(struct cred *new, uid_t ruid, uid_t euid, uid_t suid)
+{
+ int retval;
+ const struct cred *old;
+
+ retval = security_task_setuid(ruid, euid, suid, LSM_SETID_RES);
+ if (retval)
+ return retval;
+ old = current_cred();
+
+ if (!capable(CAP_SETUID)) {
+ if (ruid != (uid_t) -1 && ruid != old->uid &&
+ ruid != old->euid && ruid != old->suid)
+ return -EPERM;
+ if (euid != (uid_t) -1 && euid != old->uid &&
+ euid != old->euid && euid != old->suid)
+ return -EPERM;
+ if (suid != (uid_t) -1 && suid != old->uid &&
+ suid != old->euid && suid != old->suid)
+ return -EPERM;
+ }
+
+ if (ruid != (uid_t) -1) {
+ new->uid = ruid;
+ if (ruid != old->uid) {
+ retval = set_user(new);
+ if (retval < 0)
+ return retval;
+ }
+ }
+ if (euid != (uid_t) -1)
+ new->euid = euid;
+ if (suid != (uid_t) -1)
+ new->suid = suid;
+ new->fsuid = new->euid;
+
+ return security_task_fix_setuid(new, old, LSM_SETID_RES);
+}
+
+int cred_setresgid(struct cred *new, gid_t rgid, gid_t egid,
+ gid_t sgid)
+{
+ const struct cred *old = current_cred();
+ int retval;
+
+ retval = security_task_setgid(rgid, egid, sgid, LSM_SETID_RES);
+ if (retval)
+ return retval;
+
+ if (!capable(CAP_SETGID)) {
+ if (rgid != (gid_t) -1 && rgid != old->gid &&
+ rgid != old->egid && rgid != old->sgid)
+ return -EPERM;
+ if (egid != (gid_t) -1 && egid != old->gid &&
+ egid != old->egid && egid != old->sgid)
+ return -EPERM;
+ if (sgid != (gid_t) -1 && sgid != old->gid &&
+ sgid != old->egid && sgid != old->sgid)
+ return -EPERM;
+ }
+
+ if (rgid != (gid_t) -1)
+ new->gid = rgid;
+ if (egid != (gid_t) -1)
+ new->egid = egid;
+ if (sgid != (gid_t) -1)
+ new->sgid = sgid;
+ new->fsgid = new->egid;
+ return 0;
+}
+
+int cred_setfsuid(struct cred *new, uid_t uid, uid_t *old_fsuid)
+{
+ const struct cred *old;
+
+ old = current_cred();
+ *old_fsuid = old->fsuid;
+
+ if (security_task_setuid(uid, (uid_t)-1, (uid_t)-1, LSM_SETID_FS) < 0)
+ return -EPERM;
+
+ if (uid == old->uid || uid == old->euid ||
+ uid == old->suid || uid == old->fsuid ||
+ capable(CAP_SETUID)) {
+ if (uid != *old_fsuid) {
+ new->fsuid = uid;
+ if (security_task_fix_setuid(new, old, LSM_SETID_FS) == 0)
+ return 0;
+ }
+ }
+ return -EPERM;
+}
+
+int cred_setfsgid(struct cred *new, gid_t gid, gid_t *old_fsgid)
+{
+ const struct cred *old;
+
+ old = current_cred();
+ *old_fsgid = old->fsgid;
+
+ if (security_task_setgid(gid, (gid_t)-1, (gid_t)-1, LSM_SETID_FS))
+ return -EPERM;
+
+ if (gid == old->gid || gid == old->egid ||
+ gid == old->sgid || gid == old->fsgid ||
+ capable(CAP_SETGID)) {
+ if (gid != *old_fsgid) {
+ new->fsgid = gid;
+ return 0;
+ }
+ }
+ return -EPERM;
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index b3f1097..da4f9e0 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -559,11 +559,12 @@ error:
/*
* change the user struct in a credentials set to match the new UID
*/
-static int set_user(struct cred *new)
+int set_user(struct cred *new)
{
struct user_struct *new_user;

- new_user = alloc_uid(current_user_ns(), new->uid);
+ /* is this ok? */
+ new_user = alloc_uid(new->user->user_ns, new->uid);
if (!new_user)
return -EAGAIN;

@@ -704,14 +705,12 @@ error:
return retval;
}

-
/*
* This function implements a generic ability to update ruid, euid,
* and suid. This allows you to implement the 4.4 compatible seteuid().
*/
SYSCALL_DEFINE3(setresuid, uid_t, ruid, uid_t, euid, uid_t, suid)
{
- const struct cred *old;
struct cred *new;
int retval;

@@ -719,45 +718,10 @@ SYSCALL_DEFINE3(setresuid, uid_t, ruid, uid_t, euid, uid_t, suid)
if (!new)
return -ENOMEM;

- retval = security_task_setuid(ruid, euid, suid, LSM_SETID_RES);
- if (retval)
- goto error;
- old = current_cred();
-
- retval = -EPERM;
- if (!capable(CAP_SETUID)) {
- if (ruid != (uid_t) -1 && ruid != old->uid &&
- ruid != old->euid && ruid != old->suid)
- goto error;
- if (euid != (uid_t) -1 && euid != old->uid &&
- euid != old->euid && euid != old->suid)
- goto error;
- if (suid != (uid_t) -1 && suid != old->uid &&
- suid != old->euid && suid != old->suid)
- goto error;
- }
-
- if (ruid != (uid_t) -1) {
- new->uid = ruid;
- if (ruid != old->uid) {
- retval = set_user(new);
- if (retval < 0)
- goto error;
- }
- }
- if (euid != (uid_t) -1)
- new->euid = euid;
- if (suid != (uid_t) -1)
- new->suid = suid;
- new->fsuid = new->euid;
-
- retval = security_task_fix_setuid(new, old, LSM_SETID_RES);
- if (retval < 0)
- goto error;
-
- return commit_creds(new);
+ retval = cred_setresuid(new, ruid, euid, suid);
+ if (retval == 0)
+ return commit_creds(new);

-error:
abort_creds(new);
return retval;
}
@@ -779,43 +743,17 @@ SYSCALL_DEFINE3(getresuid, uid_t __user *, ruid, uid_t __user *, euid, uid_t __u
*/
SYSCALL_DEFINE3(setresgid, gid_t, rgid, gid_t, egid, gid_t, sgid)
{
- const struct cred *old;
struct cred *new;
int retval;

new = prepare_creds();
if (!new)
return -ENOMEM;
- old = current_cred();

- retval = security_task_setgid(rgid, egid, sgid, LSM_SETID_RES);
- if (retval)
- goto error;
+ retval = cred_setresgid(new, rgid, egid, sgid);
+ if (retval == 0)
+ return commit_creds(new);

- retval = -EPERM;
- if (!capable(CAP_SETGID)) {
- if (rgid != (gid_t) -1 && rgid != old->gid &&
- rgid != old->egid && rgid != old->sgid)
- goto error;
- if (egid != (gid_t) -1 && egid != old->gid &&
- egid != old->egid && egid != old->sgid)
- goto error;
- if (sgid != (gid_t) -1 && sgid != old->gid &&
- sgid != old->egid && sgid != old->sgid)
- goto error;
- }
-
- if (rgid != (gid_t) -1)
- new->gid = rgid;
- if (egid != (gid_t) -1)
- new->egid = egid;
- if (sgid != (gid_t) -1)
- new->sgid = sgid;
- new->fsgid = new->egid;
-
- return commit_creds(new);
-
-error:
abort_creds(new);
return retval;
}
@@ -832,7 +770,6 @@ SYSCALL_DEFINE3(getresgid, gid_t __user *, rgid, gid_t __user *, egid, gid_t __u
return retval;
}

-
/*
* "setfsuid()" sets the fsuid - the uid used for filesystem checks. This
* is used for "access()" and for the NFS daemon (letting nfsd stay at
@@ -841,35 +778,20 @@ SYSCALL_DEFINE3(getresgid, gid_t __user *, rgid, gid_t __user *, egid, gid_t __u
*/
SYSCALL_DEFINE1(setfsuid, uid_t, uid)
{
- const struct cred *old;
struct cred *new;
uid_t old_fsuid;
+ int retval;

new = prepare_creds();
if (!new)
return current_fsuid();
- old = current_cred();
- old_fsuid = old->fsuid;
-
- if (security_task_setuid(uid, (uid_t)-1, (uid_t)-1, LSM_SETID_FS) < 0)
- goto error;
-
- if (uid == old->uid || uid == old->euid ||
- uid == old->suid || uid == old->fsuid ||
- capable(CAP_SETUID)) {
- if (uid != old_fsuid) {
- new->fsuid = uid;
- if (security_task_fix_setuid(new, old, LSM_SETID_FS) == 0)
- goto change_okay;
- }
- }

-error:
- abort_creds(new);
- return old_fsuid;
+ retval = cred_setfsuid(new, uid, &old_fsuid);
+ if (retval == 0)
+ commit_creds(new);
+ else
+ abort_creds(new);

-change_okay:
- commit_creds(new);
return old_fsuid;
}

@@ -878,34 +800,20 @@ change_okay:
*/
SYSCALL_DEFINE1(setfsgid, gid_t, gid)
{
- const struct cred *old;
struct cred *new;
gid_t old_fsgid;
+ int retval;

new = prepare_creds();
if (!new)
return current_fsgid();
- old = current_cred();
- old_fsgid = old->fsgid;
-
- if (security_task_setgid(gid, (gid_t)-1, (gid_t)-1, LSM_SETID_FS))
- goto error;
-
- if (gid == old->gid || gid == old->egid ||
- gid == old->sgid || gid == old->fsgid ||
- capable(CAP_SETGID)) {
- if (gid != old_fsgid) {
- new->fsgid = gid;
- goto change_okay;
- }
- }

-error:
- abort_creds(new);
- return old_fsgid;
+ retval = cred_setfsgid(new, gid, &old_fsgid);
+ if (retval == 0)
+ commit_creds(new);
+ else
+ abort_creds(new);

-change_okay:
- commit_creds(new);
return old_fsgid;
}

--
1.6.0.4

2009-07-22 10:15:00

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 36/60] c/r: add generic '->checkpoint()' f_op to simple devices

* /dev/null
* /dev/zero
* /dev/random
* /dev/urandom

Signed-off-by: Oren Laadan <[email protected]>
---
drivers/char/mem.c | 2 ++
drivers/char/random.c | 2 ++
2 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index afa8813..828ba7f 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -799,6 +799,7 @@ static const struct file_operations null_fops = {
.read = read_null,
.write = write_null,
.splice_write = splice_write_null,
+ .checkpoint = generic_file_checkpoint,
};

#ifdef CONFIG_DEVPORT
@@ -815,6 +816,7 @@ static const struct file_operations zero_fops = {
.read = read_zero,
.write = write_zero,
.mmap = mmap_zero,
+ .checkpoint = generic_file_checkpoint,
};

/*
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 8c74448..211ca70 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1164,6 +1164,7 @@ const struct file_operations random_fops = {
.poll = random_poll,
.unlocked_ioctl = random_ioctl,
.fasync = random_fasync,
+ .checkpoint = generic_file_checkpoint,
};

const struct file_operations urandom_fops = {
@@ -1171,6 +1172,7 @@ const struct file_operations urandom_fops = {
.write = random_write,
.unlocked_ioctl = random_ioctl,
.fasync = random_fasync,
+ .checkpoint = generic_file_checkpoint,
};

/***************************************************************
--
1.6.0.4

2009-07-22 10:13:38

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 11/60] pids 1/7: Factor out code to allocate pidmap page

From: Sukadev Bhattiprolu <[email protected]>

To implement support for clone_with_pids() system call we would
need to allocate pidmap page in more than one place. Move this
code to a new function alloc_pidmap_page().

Changelog[v2]:
- (Matt Helsley, Dave Hansen) Have alloc_pidmap_page() return
-ENOMEM on error instead of -1.

Signed-off-by: Sukadev Bhattiprolu <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Reviewed-by: Oren Laadan <[email protected]>
---
kernel/pid.c | 46 ++++++++++++++++++++++++++++++----------------
1 files changed, 30 insertions(+), 16 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index 31310b5..f618096 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -122,9 +122,34 @@ static void free_pidmap(struct upid *upid)
atomic_inc(&map->nr_free);
}

+static int alloc_pidmap_page(struct pidmap *map)
+{
+ void *page;
+
+ if (likely(map->page))
+ return 0;
+
+ page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+
+ /*
+ * Free the page if someone raced with us installing it:
+ */
+ spin_lock_irq(&pidmap_lock);
+ if (map->page)
+ kfree(page);
+ else
+ map->page = page;
+ spin_unlock_irq(&pidmap_lock);
+
+ if (unlikely(!map->page))
+ return -ENOMEM;
+
+ return 0;
+}
+
static int alloc_pidmap(struct pid_namespace *pid_ns)
{
- int i, offset, max_scan, pid, last = pid_ns->last_pid;
+ int i, rc, offset, max_scan, pid, last = pid_ns->last_pid;
struct pidmap *map;

pid = last + 1;
@@ -134,21 +159,10 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
for (i = 0; i <= max_scan; ++i) {
- if (unlikely(!map->page)) {
- void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
- /*
- * Free the page if someone raced with us
- * installing it:
- */
- spin_lock_irq(&pidmap_lock);
- if (map->page)
- kfree(page);
- else
- map->page = page;
- spin_unlock_irq(&pidmap_lock);
- if (unlikely(!map->page))
- break;
- }
+ rc = alloc_pidmap_page(map);
+ if (rc)
+ break;
+
if (likely(atomic_read(&map->nr_free))) {
do {
if (!test_and_set_bit(offset, map->page)) {
--
1.6.0.4

2009-07-22 10:13:38

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 12/60] pids 2/7: Have alloc_pidmap() return actual error code

From: Sukadev Bhattiprolu <[email protected]>

alloc_pidmap() can fail either because all pid numbers are in use or
because memory allocation failed. With support for setting a specific
pid number, alloc_pidmap() would also fail if either the given pid
number is invalid or in use.

Rather than have callers assume -ENOMEM, have alloc_pidmap() return
the actual error.

Signed-off-by: Sukadev Bhattiprolu <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Reviewed-by: Oren Laadan <[email protected]>
---
kernel/fork.c | 5 +++--
kernel/pid.c | 9 ++++++---
2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index bd29592..e90cee5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1123,10 +1123,11 @@ static struct task_struct *copy_process(unsigned long clone_flags,
goto bad_fork_cleanup_io;

if (pid != &init_struct_pid) {
- retval = -ENOMEM;
pid = alloc_pid(p->nsproxy->pid_ns);
- if (!pid)
+ if (IS_ERR(pid)) {
+ retval = PTR_ERR(pid);
goto bad_fork_cleanup_io;
+ }

if (clone_flags & CLONE_NEWPID) {
retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
diff --git a/kernel/pid.c b/kernel/pid.c
index f618096..9c678ce 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -158,6 +158,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
offset = pid & BITS_PER_PAGE_MASK;
map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
+ rc = -EAGAIN;
for (i = 0; i <= max_scan; ++i) {
rc = alloc_pidmap_page(map);
if (rc)
@@ -188,12 +189,14 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
} else {
map = &pid_ns->pidmap[0];
offset = RESERVED_PIDS;
- if (unlikely(last == offset))
+ if (unlikely(last == offset)) {
+ rc = -EAGAIN;
break;
+ }
}
pid = mk_pid(pid_ns, map, offset);
}
- return -1;
+ return rc;
}

int next_pidmap(struct pid_namespace *pid_ns, int last)
@@ -298,7 +301,7 @@ out_free:
free_pidmap(pid->numbers + i);

kmem_cache_free(ns->pid_cachep, pid);
- pid = NULL;
+ pid = ERR_PTR(nr);
goto out;
}

--
1.6.0.4

2009-07-22 10:14:43

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 37/60] c/r: introduce method '->checkpoint()' in struct vm_operations_struct

Changelog[v17]
- Forward-declare 'ckpt_ctx et-al, don't use checkpoint_types.h

Signed-off-by: Oren Laadan <[email protected]>
---
include/linux/mm.h | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ba3a7cb..0e46e95 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -19,6 +19,7 @@ struct file_ra_state;
struct user_struct;
struct writeback_control;
struct rlimit;
+struct ckpt_ctx;

#ifndef CONFIG_DISCONTIGMEM /* Don't use mapnrs, do it properly */
extern unsigned long max_mapnr;
@@ -220,6 +221,9 @@ struct vm_operations_struct {
int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from,
const nodemask_t *to, unsigned long flags);
#endif
+#ifdef CONFIG_CHECKPOINT
+ int (*checkpoint)(struct ckpt_ctx *ctx, struct vm_area_struct *vma);
+#endif
};

struct mmu_gather;
--
1.6.0.4

2009-07-22 10:15:34

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 03/60] c/r: break out new_user_ns()

From: Serge E. Hallyn <[email protected]>

Break out the core function which checks privilege and (if
allowed) creates a new user namespace, with the passed-in
creating user_struct. Note that a user_namespace, unlike
other namespace pointers, is not stored in the nsproxy.
Rather it is purely a property of user_structs.

This will let us keep the task restore code simpler.

Signed-off-by: Serge E. Hallyn <[email protected]>
---
include/linux/user_namespace.h | 8 ++++++
kernel/user_namespace.c | 53 ++++++++++++++++++++++++++++------------
2 files changed, 45 insertions(+), 16 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index cc4f453..f6ea75d 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -20,6 +20,8 @@ extern struct user_namespace init_user_ns;

#ifdef CONFIG_USER_NS

+struct user_namespace *new_user_ns(struct user_struct *creator,
+ struct user_struct **newroot);
static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
{
if (ns)
@@ -38,6 +40,12 @@ static inline void put_user_ns(struct user_namespace *ns)

#else

+static inline struct user_namespace *new_user_ns(struct user_struct *creator,
+ struct user_struct **newroot)
+{
+ return ERR_PTR(-EINVAL);
+}
+
static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
{
return &init_user_ns;
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 076c7c8..e624b0f 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -11,15 +11,8 @@
#include <linux/user_namespace.h>
#include <linux/cred.h>

-/*
- * Create a new user namespace, deriving the creator from the user in the
- * passed credentials, and replacing that user with the new root user for the
- * new namespace.
- *
- * This is called by copy_creds(), which will finish setting the target task's
- * credentials.
- */
-int create_user_ns(struct cred *new)
+static struct user_namespace *_new_user_ns(struct user_struct *creator,
+ struct user_struct **newroot)
{
struct user_namespace *ns;
struct user_struct *root_user;
@@ -27,7 +20,7 @@ int create_user_ns(struct cred *new)

ns = kmalloc(sizeof(struct user_namespace), GFP_KERNEL);
if (!ns)
- return -ENOMEM;
+ return ERR_PTR(-ENOMEM);

kref_init(&ns->kref);

@@ -38,12 +31,43 @@ int create_user_ns(struct cred *new)
root_user = alloc_uid(ns, 0);
if (!root_user) {
kfree(ns);
- return -ENOMEM;
+ return ERR_PTR(-ENOMEM);
}

/* set the new root user in the credentials under preparation */
- ns->creator = new->user;
- new->user = root_user;
+ ns->creator = creator;
+
+ /* alloc_uid() incremented the userns refcount. Just set it to 1 */
+ kref_set(&ns->kref, 1);
+
+ *newroot = root_user;
+ return ns;
+}
+
+struct user_namespace *new_user_ns(struct user_struct *creator,
+ struct user_struct **newroot)
+{
+ if (!capable(CAP_SYS_ADMIN))
+ return ERR_PTR(-EPERM);
+ return _new_user_ns(creator, newroot);
+}
+
+/*
+ * Create a new user namespace, deriving the creator from the user in the
+ * passed credentials, and replacing that user with the new root user for the
+ * new namespace.
+ *
+ * This is called by copy_creds(), which will finish setting the target task's
+ * credentials.
+ */
+int create_user_ns(struct cred *new)
+{
+ struct user_namespace *ns;
+
+ ns = new_user_ns(new->user, &new->user);
+ if (IS_ERR(ns))
+ return PTR_ERR(ns);
+
new->uid = new->euid = new->suid = new->fsuid = 0;
new->gid = new->egid = new->sgid = new->fsgid = 0;
put_group_info(new->group_info);
@@ -54,9 +78,6 @@ int create_user_ns(struct cred *new)
#endif
/* tgcred will be cleared in our caller bc CLONE_THREAD won't be set */

- /* alloc_uid() incremented the userns refcount. Just set it to 1 */
- kref_set(&ns->kref, 1);
-
return 0;
}

--
1.6.0.4

2009-07-22 10:14:58

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 25/60] c/r: checkpoint multiple processes

Checkpointing of multiple processes works by recording the tasks tree
structure below a given "root" task. The root task is expected to be a
container init, and then an entire container is checkpointed. However,
passing CHECKPOINT_SUBTREE to checkpoint(2) relaxes this requirement
and allows to checkpoint a subtree of processes from the root task.

For a given root task, do a DFS scan of the tasks tree and collect them
into an array (keeping a reference to each task). Using DFS simplifies
the recreation of tasks either in user space or kernel space. For each
task collected, test if it can be checkpointed, and save its pid, tgid,
and ppid.

The actual work is divided into two passes: a first scan counts the
tasks, then memory is allocated and a second scan fills the array.

Whether checkpoints and restarts require CAP_SYS_ADMIN is determined
by sysctl 'ckpt_unpriv_allowed': if 1, then regular permission checks
are intended to prevent privilege escalation, however if 0 it prevents
unprivileged users from exploiting any privilege escalation bugs.

The logic is suitable for creation of processes during restart either
in userspace or by the kernel.

Currently we ignore threads and zombies.

Changelog[v16]:
- CHECKPOINT_SUBTREE flags allows subtree (not whole container)
- sysctl variable 'ckpt_unpriv_allowed' controls needed privileges
Changelog[v14]:
- Refuse non-self checkpoint if target task isn't frozen
- Refuse checkpoint (for now) if task is ptraced
- Revert change to pr_debug(), back to ckpt_debug()
- Use only unsigned fields in checkpoint headers
- Check retval of ckpt_tree_count_tasks() in ckpt_build_tree()
- Discard 'h.parent' field
- Check whether calls to ckpt_hbuf_get() fail
- Disallow threads or siblings to container init
Changelog[v13]:
- Release tasklist_lock in error path in ckpt_tree_count_tasks()
- Use separate index for 'tasks_arr' and 'hh' in ckpt_write_pids()
Changelog[v12]:
- Replace obsolete ckpt_debug() with pr_debug()

Signed-off-by: Oren Laadan <[email protected]>
---
checkpoint/checkpoint.c | 287 ++++++++++++++++++++++++++++++++++++--
checkpoint/restart.c | 2 +-
checkpoint/sys.c | 33 ++++-
include/linux/checkpoint.h | 14 ++
include/linux/checkpoint_hdr.h | 16 ++-
include/linux/checkpoint_types.h | 4 +
kernel/sysctl.c | 17 +++
7 files changed, 357 insertions(+), 16 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 8facd9a..57f59de 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -259,8 +259,27 @@ static int checkpoint_write_tail(struct ckpt_ctx *ctx)
return ret;
}

+/* dump all tasks in ctx->tasks_arr[] */
+static int checkpoint_all_tasks(struct ckpt_ctx *ctx)
+{
+ int n, ret = 0;
+
+ for (n = 0; n < ctx->nr_tasks; n++) {
+ ckpt_debug("dumping task #%d\n", n);
+ ret = checkpoint_task(ctx, ctx->tasks_arr[n]);
+ if (ret < 0)
+ break;
+ }
+
+ return ret;
+}
+
static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
{
+ struct task_struct *root = ctx->root_task;
+
+ ckpt_debug("check %d\n", task_pid_nr_ns(t, ctx->root_nsproxy->pid_ns));
+
if (t->state == TASK_DEAD) {
pr_warning("c/r: task %d is TASK_DEAD\n", task_pid_vnr(t));
return -EAGAIN;
@@ -286,15 +305,256 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
return -EBUSY;
}

+ /*
+ * FIX: for now, disallow siblings of container init created
+ * via CLONE_PARENT (unclear if they will remain possible)
+ */
+ if (ctx->root_init && t != root && t->tgid != root->tgid &&
+ t->real_parent == root->real_parent) {
+ __ckpt_write_err(ctx, "task %d (%s) is sibling of root",
+ task_pid_vnr(t), t->comm);
+ return -EINVAL;
+ }
+
+ /* FIX: change this when namespaces are added */
+ if (task_nsproxy(t) != ctx->root_nsproxy)
+ return -EPERM;
+
return 0;
}

+#define CKPT_HDR_PIDS_CHUNK 256
+
+static int checkpoint_pids(struct ckpt_ctx *ctx)
+{
+ struct ckpt_hdr_pids *h;
+ struct pid_namespace *ns;
+ struct task_struct *task;
+ struct task_struct **tasks_arr;
+ int nr_tasks, n, pos = 0, ret = 0;
+
+ ns = ctx->root_nsproxy->pid_ns;
+ tasks_arr = ctx->tasks_arr;
+ nr_tasks = ctx->nr_tasks;
+ BUG_ON(nr_tasks <= 0);
+
+ ret = ckpt_write_obj_type(ctx, NULL,
+ sizeof(*h) * nr_tasks,
+ CKPT_HDR_BUFFER);
+ if (ret < 0)
+ return ret;
+
+ h = ckpt_hdr_get(ctx, sizeof(*h) * CKPT_HDR_PIDS_CHUNK);
+ if (!h)
+ return -ENOMEM;
+
+ do {
+ rcu_read_lock();
+ for (n = 0; n < min(nr_tasks, CKPT_HDR_PIDS_CHUNK); n++) {
+ task = tasks_arr[pos];
+
+ h[n].vpid = task_pid_nr_ns(task, ns);
+ h[n].vtgid = task_tgid_nr_ns(task, ns);
+ h[n].vpgid = task_pgrp_nr_ns(task, ns);
+ h[n].vsid = task_session_nr_ns(task, ns);
+ h[n].vppid = task_tgid_nr_ns(task->real_parent, ns);
+ ckpt_debug("task[%d]: vpid %d vtgid %d parent %d\n",
+ pos, h[n].vpid, h[n].vtgid, h[n].vppid);
+ pos++;
+ }
+ rcu_read_unlock();
+
+ n = min(nr_tasks, CKPT_HDR_PIDS_CHUNK);
+ ret = ckpt_kwrite(ctx, h, n * sizeof(*h));
+ if (ret < 0)
+ break;
+
+ nr_tasks -= n;
+ } while (nr_tasks > 0);
+
+ _ckpt_hdr_put(ctx, h, sizeof(*h) * CKPT_HDR_PIDS_CHUNK);
+ return ret;
+}
+
+/* count number of tasks in tree (and optionally fill pid's in array) */
+static int tree_count_tasks(struct ckpt_ctx *ctx)
+{
+ struct task_struct *root;
+ struct task_struct *task;
+ struct task_struct *parent;
+ struct task_struct **tasks_arr = ctx->tasks_arr;
+ int nr_tasks = ctx->nr_tasks;
+ int nr = 0;
+ int ret;
+
+ read_lock(&tasklist_lock);
+
+ /* we hold the lock, so root_task->real_parent can't change */
+ task = ctx->root_task;
+ if (ctx->root_init) {
+ /* container-init: start from container parent */
+ parent = task->real_parent;
+ root = parent;
+ } else {
+ /* non-container-init: start from root task and down */
+ parent = NULL;
+ root = task;
+ }
+
+ /* count tasks via DFS scan of the tree */
+ while (1) {
+ /* is this task cool ? */
+ ret = may_checkpoint_task(ctx, task);
+ if (ret < 0) {
+ nr = ret;
+ break;
+ }
+ if (tasks_arr) {
+ /* unlikely... but if so then try again later */
+ if (nr == nr_tasks) {
+ nr = -EAGAIN; /* cleanup in ckpt_ctx_free() */
+ break;
+ }
+ tasks_arr[nr] = task;
+ get_task_struct(task);
+ }
+ nr++;
+ /* if has children - proceed with child */
+ if (!list_empty(&task->children)) {
+ parent = task;
+ task = list_entry(task->children.next,
+ struct task_struct, sibling);
+ continue;
+ }
+ while (task != root) {
+ /* if has sibling - proceed with sibling */
+ if (!list_is_last(&task->sibling, &parent->children)) {
+ task = list_entry(task->sibling.next,
+ struct task_struct, sibling);
+ break;
+ }
+
+ /* else, trace back to parent and proceed */
+ task = parent;
+ parent = parent->real_parent;
+ }
+ if (task == root)
+ break;
+ }
+
+ read_unlock(&tasklist_lock);
+
+ if (nr < 0)
+ ckpt_write_err(ctx, NULL);
+ return nr;
+}
+
+/*
+ * build_tree - scan the tasks tree in DFS order and fill in array
+ * @ctx: checkpoint context
+ *
+ * Using DFS order simplifies the restart logic to re-create the tasks.
+ *
+ * On success, ctx->tasks_arr will be allocated and populated with all
+ * tasks (reference taken), and ctx->nr_tasks will hold the total count.
+ * The array is cleaned up by ckpt_ctx_free().
+ */
+static int build_tree(struct ckpt_ctx *ctx)
+{
+ int n, m;
+
+ /* count tasks (no side effects) */
+ n = tree_count_tasks(ctx);
+ if (n < 0)
+ return n;
+
+ ctx->nr_tasks = n;
+ ctx->tasks_arr = kzalloc(n * sizeof(*ctx->tasks_arr), GFP_KERNEL);
+ if (!ctx->tasks_arr)
+ return -ENOMEM;
+
+ /* count again (now will fill array) */
+ m = tree_count_tasks(ctx);
+
+ /* unlikely, but ... (cleanup in ckpt_ctx_free) */
+ if (m < 0)
+ return m;
+ else if (m != n)
+ return -EBUSY;
+
+ return 0;
+}
+
+/* dump the array that describes the tasks tree */
+static int checkpoint_tree(struct ckpt_ctx *ctx)
+{
+ struct ckpt_hdr_tree *h;
+ int ret;
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TREE);
+ if (!h)
+ return -ENOMEM;
+
+ h->nr_tasks = ctx->nr_tasks;
+
+ ret = ckpt_write_obj(ctx, &h->h);
+ ckpt_hdr_put(ctx, h);
+ if (ret < 0)
+ return ret;
+
+ ret = checkpoint_pids(ctx);
+ return ret;
+}
+
+static struct task_struct *get_freezer_task(struct task_struct *root_task)
+{
+ struct task_struct *p;
+
+ /*
+ * For the duration of checkpoint we deep-freeze all tasks.
+ * Normally do it through the root task's freezer cgroup.
+ * However, if the root task is also the current task (doing
+ * self-checkpoint) we can't freeze ourselves. In this case,
+ * choose the next available (non-dead) task instead. We'll
+ * use its freezer cgroup to verify that all tasks belong to
+ * the same cgroup.
+ */
+
+ if (root_task != current) {
+ get_task_struct(root_task);
+ return root_task;
+ }
+
+ /* search among threads, then children */
+ read_lock(&tasklist_lock);
+
+ for (p = next_thread(root_task); p != root_task; p = next_thread(p)) {
+ if (p->state == TASK_DEAD)
+ continue;
+ if (!in_same_cgroup_freezer(p, root_task))
+ goto out;
+ }
+
+ list_for_each_entry(p, &root_task->children, sibling) {
+ if (p->state == TASK_DEAD)
+ continue;
+ if (!in_same_cgroup_freezer(p, root_task))
+ goto out;
+ }
+
+ p = NULL;
+ out:
+ read_unlock(&tasklist_lock);
+ if (p)
+ get_task_struct(p);
+ return p;
+}
+
/* setup checkpoint-specific parts of ctx */
static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
{
struct task_struct *task;
struct nsproxy *nsproxy;
- int ret;

/*
* No need for explicit cleanup here, because if an error
@@ -326,17 +586,13 @@ static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
ctx->root_nsproxy = nsproxy;

/* root freezer */
- ctx->root_freezer = task;
- geT_task_struct(task);
+ ctx->root_freezer = get_freezer_task(task);

- ret = may_checkpoint_task(ctx, task);
- if (ret) {
- ckpt_write_err(ctx, NULL);
- put_task_struct(task);
- put_task_struct(task);
- put_nsproxy(nsproxy);
- return ret;
- }
+ /* container init ? */
+ ctx->root_init = is_container_init(task);
+
+ if (!(ctx->uflags & CHECKPOINT_SUBTREE) && !ctx->root_init)
+ return -EINVAL; /* cleanup by ckpt_ctx_free() */

return 0;
}
@@ -355,10 +611,17 @@ long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
return ret;
}

+ ret = build_tree(ctx);
+ if (ret < 0)
+ goto out;
+
ret = checkpoint_write_header(ctx);
if (ret < 0)
goto out;
- ret = checkpoint_task(ctx, ctx->root_task);
+ ret = checkpoint_tree(ctx);
+ if (ret < 0)
+ goto out;
+ ret = checkpoint_all_tasks(ctx);
if (ret < 0)
goto out;
ret = checkpoint_write_tail(ctx);
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 582d6b4..4d1ff31 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -275,7 +275,7 @@ static int restore_read_header(struct ckpt_ctx *ctx)
h->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
h->patch != ((LINUX_VERSION_CODE) & 0xff))
goto out;
- if (h->uflags)
+ if (h->uflags & ~CHECKPOINT_USER_FLAGS)
goto out;

ret = check_kernel_const(&h->constants);
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index b37bc8c..cc94775 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -23,6 +23,14 @@
#include <linux/checkpoint.h>

/*
+ * ckpt_unpriv_allowed - sysctl controlled, do not allow checkpoints or
+ * restarts unless caller has CAP_SYS_ADMIN, if 0 (prevent unprivileged
+ * useres from expoitling any privilege escalation bugs). If it is 1,
+ * then regular permissions checks are intended to do the job.
+ */
+int ckpt_unpriv_allowed = 1; /* default: allow */
+
+/*
* Helpers to write(read) from(to) kernel space to(from) the checkpoint
* image file descriptor (similar to how a core-dump is performed).
*
@@ -166,11 +174,27 @@ void *ckpt_hdr_get_type(struct ckpt_ctx *ctx, int len, int type)
* restart operation, and persists until the operation is completed.
*/

+static void task_arr_free(struct ckpt_ctx *ctx)
+{
+ int n;
+
+ for (n = 0; n < ctx->nr_tasks; n++) {
+ if (ctx->tasks_arr[n]) {
+ put_task_struct(ctx->tasks_arr[n]);
+ ctx->tasks_arr[n] = NULL;
+ }
+ }
+ kfree(ctx->tasks_arr);
+}
+
static void ckpt_ctx_free(struct ckpt_ctx *ctx)
{
if (ctx->file)
fput(ctx->file);

+ if (ctx->tasks_arr)
+ task_arr_free(ctx);
+
if (ctx->root_nsproxy)
put_nsproxy(ctx->root_nsproxy);
if (ctx->root_task)
@@ -220,10 +244,12 @@ SYSCALL_DEFINE3(checkpoint, pid_t, pid, int, fd, unsigned long, flags)
struct ckpt_ctx *ctx;
long ret;

- /* no flags for now */
- if (flags)
+ if (flags & ~CHECKPOINT_USER_FLAGS)
return -EINVAL;

+ if (!ckpt_unpriv_allowed && !capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
if (pid == 0)
pid = task_pid_vnr(current);
ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_CHECKPOINT);
@@ -257,6 +283,9 @@ SYSCALL_DEFINE3(restart, pid_t, pid, int, fd, unsigned long, flags)
if (flags)
return -EINVAL;

+ if (!ckpt_unpriv_allowed && !capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_RESTART);
if (IS_ERR(ctx))
return PTR_ERR(ctx);
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 01541b8..df2938f 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -12,6 +12,9 @@

#define CHECKPOINT_VERSION 1

+/* checkpoint user flags */
+#define CHECKPOINT_SUBTREE 0x1
+
#ifdef __KERNEL__
#ifdef CONFIG_CHECKPOINT

@@ -27,6 +30,17 @@
#define CKPT_CTX_RESTART (1 << CKPT_CTX_RESTART_BIT)


+/* ckpt_ctx: kflags */
+#define CKPT_CTX_CHECKPOINT_BIT 1
+#define CKPT_CTX_RESTART_BIT 2
+
+#define CKPT_CTX_CHECKPOINT (1 << CKPT_CTX_CHECKPOINT_BIT)
+#define CKPT_CTX_RESTART (1 << CKPT_CTX_RESTART_BIT)
+
+/* ckpt ctx: uflags */
+#define CHECKPOINT_USER_FLAGS CHECKPOINT_SUBTREE
+
+
extern int ckpt_kwrite(struct ckpt_ctx *ctx, void *buf, int count);
extern int ckpt_kread(struct ckpt_ctx *ctx, void *buf, int count);

diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index fa23629..c9a80dc 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -49,7 +49,8 @@ enum {
CKPT_HDR_BUFFER,
CKPT_HDR_STRING,

- CKPT_HDR_TASK = 101,
+ CKPT_HDR_TREE = 101,
+ CKPT_HDR_TASK,
CKPT_HDR_RESTART_BLOCK,
CKPT_HDR_THREAD,
CKPT_HDR_CPU,
@@ -108,6 +109,19 @@ struct ckpt_hdr_tail {
__u64 magic;
} __attribute__((aligned(8)));

+/* task tree */
+struct ckpt_hdr_tree {
+ struct ckpt_hdr h;
+ __s32 nr_tasks;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_pids {
+ __s32 vpid;
+ __s32 vppid;
+ __s32 vtgid;
+ __s32 vpgid;
+ __s32 vsid;
+} __attribute__((aligned(8)));

/* task data */
struct ckpt_hdr_task {
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 220c209..5dca34f 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -22,6 +22,7 @@ struct ckpt_ctx {

ktime_t ktime_begin; /* checkpoint start time */

+ int root_init; /* [container] root init ? */
pid_t root_pid; /* [container] root pid */
struct task_struct *root_task; /* [container] root task */
struct nsproxy *root_nsproxy; /* [container] root nsproxy */
@@ -34,6 +35,9 @@ struct ckpt_ctx {
struct file *file; /* input/output file */
int total; /* total read/written */

+ struct task_struct **tasks_arr; /* array of all tasks in container */
+ int nr_tasks; /* size of tasks array */
+
char err_string[256]; /* checkpoint: error string */
};

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 98e0232..9f4de60 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -197,6 +197,10 @@ int sysctl_legacy_va_layout;
extern int prove_locking;
extern int lock_stat;

+#ifdef CONFIG_CHECKPOINT
+extern int ckpt_unpriv_allowed;
+#endif
+
/* The default sysctl tables: */

static struct ctl_table root_table[] = {
@@ -989,6 +993,19 @@ static struct ctl_table kern_table[] = {
.proc_handler = &proc_dointvec,
},
#endif
+#ifdef CONFIG_CHECKPOINT
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "ckpt_unpriv_allowed",
+ .data = &ckpt_unpriv_allowed,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+#endif

/*
* NOTE: do not add new entries to this table unless you have read
--
1.6.0.4

2009-07-22 10:14:01

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 51/60] c/r: support message-queues sysv-ipc

Checkpoint of sysvipc message-queues is performed by iterating through
all 'msq' objects and dumping the contents of each one. The message
queued on each 'msq' are dumped with that object.

Message of a specific queue get written one by one. The queue lock
cannot be held while dumping them, but the loop must be protected from
someone (who ?) writing or reading. To do that we grab the lock, then
hijack the entire chain of messages from the queue, drop the lock,
and then safely dump them in a loop. Finally, with the lock held, we
re-attach the chain while verifying that there isn't other (new) data
on that queue.

Writing the message contents themselves is straight forward. The code
is similar to that in ipc/msgutil.c, the main difference being that
we deal with kernel memory and not user memory.

Changelog[v17]:
- Allocate security context for msg_msg
- Restore objects in the right namespace
- Don't unlock ipc before freeing

Signed-off-by: Oren Laadan <[email protected]>
---
include/linux/checkpoint_hdr.h | 20 +++
ipc/Makefile | 3 +-
ipc/checkpoint.c | 2 +-
ipc/checkpoint_msg.c | 364 ++++++++++++++++++++++++++++++++++++++++
ipc/msg.c | 10 +-
ipc/msgutil.c | 8 -
ipc/util.h | 13 ++
7 files changed, 403 insertions(+), 17 deletions(-)
create mode 100644 ipc/checkpoint_msg.c

diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index f4c3f7b..e33bb58 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -77,6 +77,7 @@ enum {
CKPT_HDR_IPC = 501,
CKPT_HDR_IPC_SHM,
CKPT_HDR_IPC_MSG,
+ CKPT_HDR_IPC_MSG_MSG,
CKPT_HDR_IPC_SEM,

CKPT_HDR_TAIL = 9001,
@@ -386,6 +387,25 @@ struct ckpt_hdr_ipc_shm {
__u32 objref;
} __attribute__((aligned(8)));

+struct ckpt_hdr_ipc_msg {
+ struct ckpt_hdr h;
+ struct ckpt_hdr_ipc_perms perms;
+ __u64 q_stime;
+ __u64 q_rtime;
+ __u64 q_ctime;
+ __u64 q_cbytes;
+ __u64 q_qnum;
+ __u64 q_qbytes;
+ __s32 q_lspid;
+ __s32 q_lrpid;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_ipc_msg_msg {
+ struct ckpt_hdr h;
+ __s32 m_type;
+ __u32 m_ts;
+} __attribute__((aligned(8)));
+

#define CKPT_TST_OVERFLOW_16(a, b) \
((sizeof(a) > sizeof(b)) && ((a) > SHORT_MAX))
diff --git a/ipc/Makefile b/ipc/Makefile
index db4b076..71a257f 100644
--- a/ipc/Makefile
+++ b/ipc/Makefile
@@ -9,4 +9,5 @@ obj_mq-$(CONFIG_COMPAT) += compat_mq.o
obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y)
obj-$(CONFIG_IPC_NS) += namespace.o
obj-$(CONFIG_POSIX_MQUEUE_SYSCTL) += mq_sysctl.o
-obj-$(CONFIG_SYSVIPC_CHECKPOINT) += checkpoint.o checkpoint_shm.o
+obj-$(CONFIG_SYSVIPC_CHECKPOINT) += checkpoint.o \
+ checkpoint_shm.o checkpoint_msg.o
diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c
index 9062dc6..11941d7 100644
--- a/ipc/checkpoint.c
+++ b/ipc/checkpoint.c
@@ -115,11 +115,11 @@ static int do_checkpoint_ipc_ns(struct ckpt_ctx *ctx,

ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SHM_IDS,
CKPT_HDR_IPC_SHM, checkpoint_ipc_shm);
-#if 0 /* NEXT FEW PATCHES */
if (ret < 0)
return ret;
ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
CKPT_HDR_IPC_MSG, checkpoint_ipc_msg);
+#if 0 /* NEXT FEW PATCHES */
if (ret < 0)
return ret;
ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SEM_IDS,
diff --git a/ipc/checkpoint_msg.c b/ipc/checkpoint_msg.c
new file mode 100644
index 0000000..b933c19
--- /dev/null
+++ b/ipc/checkpoint_msg.c
@@ -0,0 +1,364 @@
+/*
+ * Checkpoint/restart - dump state of sysvipc msg
+ *
+ * Copyright (C) 2009 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG CKPT_DIPC
+
+#include <linux/mm.h>
+#include <linux/msg.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/syscalls.h>
+#include <linux/nsproxy.h>
+#include <linux/security.h>
+#include <linux/ipc_namespace.h>
+
+#include "util.h"
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/************************************************************************
+ * ipc checkpoint
+ */
+
+static int fill_ipc_msg_hdr(struct ckpt_ctx *ctx,
+ struct ckpt_hdr_ipc_msg *h,
+ struct msg_queue *msq)
+{
+ int ret = 0;
+
+ ipc_lock_by_ptr(&msq->q_perm);
+
+ ret = checkpoint_fill_ipc_perms(&h->perms, &msq->q_perm);
+ if (ret < 0)
+ goto unlock;
+
+ h->q_stime = msq->q_stime;
+ h->q_rtime = msq->q_rtime;
+ h->q_ctime = msq->q_ctime;
+ h->q_cbytes = msq->q_cbytes;
+ h->q_qnum = msq->q_qnum;
+ h->q_qbytes = msq->q_qbytes;
+ h->q_lspid = msq->q_lspid;
+ h->q_lrpid = msq->q_lrpid;
+
+ unlock:
+ ipc_unlock(&msq->q_perm);
+ ckpt_debug("msg: lspid %d rspid %d qnum %lld qbytes %lld\n",
+ h->q_lspid, h->q_lrpid, h->q_qnum, h->q_qbytes);
+
+ return ret;
+}
+
+static int checkpoint_msg_contents(struct ckpt_ctx *ctx, struct msg_msg *msg)
+{
+ struct ckpt_hdr_ipc_msg_msg *h;
+ struct msg_msgseg *seg;
+ int total, len;
+ int ret;
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC_MSG_MSG);
+ if (!h)
+ return -ENOMEM;
+
+ h->m_type = msg->m_type;
+ h->m_ts = msg->m_ts;
+
+ ret = ckpt_write_obj(ctx, &h->h);
+ ckpt_hdr_put(ctx, h);
+ if (ret < 0)
+ return ret;
+
+ total = msg->m_ts;
+ len = min(total, (int) DATALEN_MSG);
+ ret = ckpt_write_buffer(ctx, (msg + 1), len);
+ if (ret < 0)
+ return ret;
+
+ seg = msg->next;
+ total -= len;
+
+ while (total) {
+ len = min(total, (int) DATALEN_SEG);
+ ret = ckpt_write_buffer(ctx, (seg + 1), len);
+ if (ret < 0)
+ break;
+ seg = seg->next;
+ total -= len;
+ }
+
+ return ret;
+}
+
+static int checkpoint_msg_queue(struct ckpt_ctx *ctx, struct msg_queue *msq)
+{
+ struct list_head messages;
+ struct msg_msg *msg;
+ int ret = -EBUSY;
+
+ /*
+ * Scanning the msq requires the lock, but then we can't write
+ * data out from inside. Instead, we grab the lock, remove all
+ * messages to our own list, drop the lock, write the messages,
+ * and finally re-attach the them to the msq with the lock taken.
+ */
+ ipc_lock_by_ptr(&msq->q_perm);
+ if (!list_empty(&msq->q_receivers))
+ goto unlock;
+ if (!list_empty(&msq->q_senders))
+ goto unlock;
+ if (list_empty(&msq->q_messages))
+ goto unlock;
+ /* temporarily take out all messages */
+ INIT_LIST_HEAD(&messages);
+ list_splice_init(&msq->q_messages, &messages);
+ unlock:
+ ipc_unlock(&msq->q_perm);
+
+ list_for_each_entry(msg, &messages, m_list) {
+ ret = checkpoint_msg_contents(ctx, msg);
+ if (ret < 0)
+ break;
+ }
+
+ /* put all the messages back in */
+ ipc_lock_by_ptr(&msq->q_perm);
+ list_splice(&messages, &msq->q_messages);
+ ipc_unlock(&msq->q_perm);
+
+ return ret;
+}
+
+int checkpoint_ipc_msg(int id, void *p, void *data)
+{
+ struct ckpt_hdr_ipc_msg *h;
+ struct ckpt_ctx *ctx = (struct ckpt_ctx *) data;
+ struct kern_ipc_perm *perm = (struct kern_ipc_perm *) p;
+ struct msg_queue *msq;
+ int ret;
+
+ msq = container_of(perm, struct msg_queue, q_perm);
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC_MSG);
+ if (!h)
+ return -ENOMEM;
+
+ ret = fill_ipc_msg_hdr(ctx, h, msq);
+ if (ret < 0)
+ goto out;
+
+ ret = ckpt_write_obj(ctx, &h->h);
+ if (ret < 0)
+ goto out;
+
+ if (h->q_qnum)
+ ret = checkpoint_msg_queue(ctx, msq);
+ out:
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
+
+
+/************************************************************************
+ * ipc restart
+ */
+
+static int load_ipc_msg_hdr(struct ckpt_ctx *ctx,
+ struct ckpt_hdr_ipc_msg *h,
+ struct msg_queue *msq)
+{
+ int ret = 0;
+
+ ret = restore_load_ipc_perms(&h->perms, &msq->q_perm);
+ if (ret < 0)
+ return ret;
+
+ ckpt_debug("msq: lspid %d lrpid %d qnum %lld qbytes %lld\n",
+ h->q_lspid, h->q_lrpid, h->q_qnum, h->q_qbytes);
+
+ if (h->q_lspid < 0 || h->q_lrpid < 0)
+ return -EINVAL;
+
+ msq->q_stime = h->q_stime;
+ msq->q_rtime = h->q_rtime;
+ msq->q_ctime = h->q_ctime;
+ msq->q_lspid = h->q_lspid;
+ msq->q_lrpid = h->q_lrpid;
+
+ return 0;
+}
+
+static struct msg_msg *restore_msg_contents_one(struct ckpt_ctx *ctx, int *clen)
+{
+ struct ckpt_hdr_ipc_msg_msg *h;
+ struct msg_msg *msg = NULL;
+ struct msg_msgseg *seg, **pseg;
+ int total, len;
+ int ret;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC_MSG_MSG);
+ if (IS_ERR(h))
+ return (struct msg_msg *) h;
+
+ ret = -EINVAL;
+ if (h->m_type < 1)
+ goto out;
+ if (h->m_ts > current->nsproxy->ipc_ns->msg_ctlmax)
+ goto out;
+
+ total = h->m_ts;
+ len = min(total, (int) DATALEN_MSG);
+ msg = kmalloc(sizeof(*msg) + len, GFP_KERNEL);
+ if (!msg) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ msg->next = NULL;
+ pseg = &msg->next;
+
+ ret = _ckpt_read_buffer(ctx, (msg + 1), len);
+ if (ret < 0)
+ goto out;
+
+ total -= len;
+ while (total) {
+ len = min(total, (int) DATALEN_SEG);
+ seg = kmalloc(sizeof(*seg) + len, GFP_KERNEL);
+ if (!seg) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ seg->next = NULL;
+ *pseg = seg;
+ pseg = &seg->next;
+
+ ret = _ckpt_read_buffer(ctx, (seg + 1), len);
+ if (ret < 0)
+ goto out;
+ total -= len;
+ }
+
+ msg->m_type = h->m_type;
+ msg->m_ts = h->m_ts;
+ *clen = h->m_ts;
+ ret = security_msg_msg_alloc(msg);
+ out:
+ if (ret < 0 && msg) {
+ free_msg(msg);
+ msg = ERR_PTR(ret);
+ }
+ ckpt_hdr_put(ctx, h);
+ return msg;
+}
+
+static inline void free_msg_list(struct list_head *queue)
+{
+ struct msg_msg *msg, *tmp;
+
+ list_for_each_entry_safe(msg, tmp, queue, m_list)
+ free_msg(msg);
+}
+
+static int restore_msg_contents(struct ckpt_ctx *ctx, struct list_head *queue,
+ unsigned long qnum, unsigned long *cbytes)
+{
+ struct msg_msg *msg;
+ int clen = 0;
+ int ret = 0;
+
+ INIT_LIST_HEAD(queue);
+
+ *cbytes = 0;
+ while (qnum--) {
+ msg = restore_msg_contents_one(ctx, &clen);
+ if (IS_ERR(msg))
+ goto fail;
+ list_add_tail(&msg->m_list, queue);
+ *cbytes += clen;
+ }
+ return 0;
+ fail:
+ ret = PTR_ERR(msg);
+ free_msg_list(queue);
+ return ret;
+}
+
+int restore_ipc_msg(struct ckpt_ctx *ctx, struct ipc_namespace *ns)
+{
+ struct ckpt_hdr_ipc_msg *h;
+ struct kern_ipc_perm *perms;
+ struct msg_queue *msq;
+ struct ipc_ids *msg_ids = &ns->ids[IPC_MSG_IDS];
+ struct list_head messages;
+ unsigned long cbytes;
+ int msgflag;
+ int ret;
+
+ INIT_LIST_HEAD(&messages);
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC_MSG);
+ if (IS_ERR(h))
+ return PTR_ERR(h);
+
+ ret = -EINVAL;
+ if (h->perms.id < 0)
+ goto out;
+
+ /* read queued messages into temporary queue */
+ ret = restore_msg_contents(ctx, &messages, h->q_qnum, &cbytes);
+ if (ret < 0)
+ goto out;
+
+ ret = -EINVAL;
+ if (h->q_cbytes != cbytes)
+ goto out;
+
+ /* restore the message queue */
+ msgflag = h->perms.mode | IPC_CREAT | IPC_EXCL;
+ ckpt_debug("msg: do_msgget key %d flag %#x id %d\n",
+ h->perms.key, msgflag, h->perms.id);
+ ret = do_msgget(ns, h->perms.key, msgflag, h->perms.id);
+ ckpt_debug("msg: do_msgget ret %d\n", ret);
+ if (ret < 0)
+ goto out;
+
+ down_write(&msg_ids->rw_mutex);
+
+ /* we are the sole owners/users of this ipc_ns, it can't go away */
+ perms = ipc_lock(msg_ids, h->perms.id);
+ BUG_ON(IS_ERR(perms)); /* ipc_ns is private to us */
+
+ msq = container_of(perms, struct msg_queue, q_perm);
+ BUG_ON(!list_empty(&msq->q_messages)); /* ipc_ns is private to us */
+
+ /* attach queued messages we read before */
+ list_splice_init(&messages, &msq->q_messages);
+
+ /* adjust msq and namespace statistics */
+ atomic_add(h->q_cbytes, &ns->msg_bytes);
+ atomic_add(h->q_qnum, &ns->msg_hdrs);
+ msq->q_cbytes = h->q_cbytes;
+ msq->q_qbytes = h->q_qbytes;
+ msq->q_qnum = h->q_qnum;
+
+ ret = load_ipc_msg_hdr(ctx, h, msq);
+
+ if (ret < 0) {
+ ckpt_debug("msq: need to remove (%d)\n", ret);
+ freeque(ns, perms);
+ } else
+ ipc_unlock(perms);
+ up_write(&msg_ids->rw_mutex);
+ out:
+ free_msg_list(&messages); /* no-op if all ok, else cleanup msgs */
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
diff --git a/ipc/msg.c b/ipc/msg.c
index 1db7c45..3559d53 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -72,7 +72,6 @@ struct msg_sender {

#define msg_unlock(msq) ipc_unlock(&(msq)->q_perm)

-static void freeque(struct ipc_namespace *, struct kern_ipc_perm *);
static int newque(struct ipc_namespace *, struct ipc_params *, int);
#ifdef CONFIG_PROC_FS
static int sysvipc_msg_proc_show(struct seq_file *s, void *it);
@@ -278,7 +277,7 @@ static void expunge_all(struct msg_queue *msq, int res)
* msg_ids.rw_mutex (writer) and the spinlock for this message queue are held
* before freeque() is called. msg_ids.rw_mutex remains locked on exit.
*/
-static void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
+void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
{
struct list_head *tmp;
struct msg_queue *msq = container_of(ipcp, struct msg_queue, q_perm);
@@ -311,14 +310,11 @@ static inline int msg_security(struct kern_ipc_perm *ipcp, int msgflg)
return security_msg_queue_associate(msq, msgflg);
}

-int do_msgget(key_t key, int msgflg, int req_id)
+int do_msgget(struct ipc_namespace *ns, key_t key, int msgflg, int req_id)
{
- struct ipc_namespace *ns;
struct ipc_ops msg_ops;
struct ipc_params msg_params;

- ns = current->nsproxy->ipc_ns;
-
msg_ops.getnew = newque;
msg_ops.associate = msg_security;
msg_ops.more_checks = NULL;
@@ -331,7 +327,7 @@ int do_msgget(key_t key, int msgflg, int req_id)

SYSCALL_DEFINE2(msgget, key_t, key, int, msgflg)
{
- return do_msgget(key, msgflg, -1);
+ return do_msgget(current->nsproxy->ipc_ns, key, msgflg, -1);
}

static inline unsigned long
diff --git a/ipc/msgutil.c b/ipc/msgutil.c
index f095ee2..e119243 100644
--- a/ipc/msgutil.c
+++ b/ipc/msgutil.c
@@ -36,14 +36,6 @@ struct ipc_namespace init_ipc_ns = {

atomic_t nr_ipc_ns = ATOMIC_INIT(1);

-struct msg_msgseg {
- struct msg_msgseg* next;
- /* the next part of the message follows immediately */
-};
-
-#define DATALEN_MSG (PAGE_SIZE-sizeof(struct msg_msg))
-#define DATALEN_SEG (PAGE_SIZE-sizeof(struct msg_msgseg))
-
struct msg_msg *load_msg(const void __user *src, int len)
{
struct msg_msg *msg;
diff --git a/ipc/util.h b/ipc/util.h
index 5f47593..a06a98d 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -141,6 +141,14 @@ extern void free_msg(struct msg_msg *msg);
extern struct msg_msg *load_msg(const void __user *src, int len);
extern int store_msg(void __user *dest, struct msg_msg *msg, int len);

+struct msg_msgseg {
+ struct msg_msgseg *next;
+ /* the next part of the message follows immediately */
+};
+
+#define DATALEN_MSG (PAGE_SIZE-sizeof(struct msg_msg))
+#define DATALEN_SEG (PAGE_SIZE-sizeof(struct msg_msgseg))
+
extern void recompute_msgmni(struct ipc_namespace *);

static inline int ipc_buildid(int id, int seq)
@@ -182,6 +190,8 @@ int do_shmget(struct ipc_namespace *ns, key_t key, size_t size, int shmflg,
int req_id);
void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);

+int do_msgget(struct ipc_namespace *ns, key_t key, int msgflg, int req_id);
+void freeque(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);

#ifdef CONFIG_CHECKPOINT
extern int checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *h,
@@ -191,6 +201,9 @@ extern int restore_load_ipc_perms(struct ckpt_hdr_ipc_perms *h,

extern int checkpoint_ipc_shm(int id, void *p, void *data);
extern int restore_ipc_shm(struct ckpt_ctx *ctx, struct ipc_namespace *ns);
+
+extern int checkpoint_ipc_msg(int id, void *p, void *data);
+extern int restore_ipc_msg(struct ckpt_ctx *ctx, struct ipc_namespace *ns);
#endif

#endif
--
1.6.0.4

2009-07-22 10:15:32

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 05/60] cgroup freezer: Fix buggy resume test for tasks frozen with cgroup freezer

From: Matt Helsley <[email protected]>

When the cgroup freezer is used to freeze tasks we do not want to thaw
those tasks during resume. Currently we test the cgroup freezer
state of the resuming tasks to see if the cgroup is FROZEN. If so
then we don't thaw the task. However, the FREEZING state also indicates
that the task should remain frozen.

This also avoids a problem pointed out by Oren Ladaan: the freezer state
transition from FREEZING to FROZEN is updated lazily when userspace reads
or writes the freezer.state file in the cgroup filesystem. This means that
resume will thaw tasks in cgroups which should be in the FROZEN state if
there is no read/write of the freezer.state file to trigger this
transition before suspend.

NOTE: Another "simple" solution would be to always update the cgroup
freezer state during resume. However it's a bad choice for several reasons:
Updating the cgroup freezer state is somewhat expensive because it requires
walking all the tasks in the cgroup and checking if they are each frozen.
Worse, this could easily make resume run in N^2 time where N is the number
of tasks in the cgroup. Finally, updating the freezer state from this code
path requires trickier locking because of the way locks must be ordered.

Instead of updating the freezer state we rely on the fact that lazy
updates only manage the transition from FREEZING to FROZEN. We know that
a cgroup with the FREEZING state may actually be FROZEN so test for that
state too. This makes sense in the resume path even for partially-frozen
cgroups -- those that really are FREEZING but not FROZEN.

Reported-by: Oren Ladaan <[email protected]>
Signed-off-by: Matt Helsley <[email protected]>
Cc: Cedric Le Goater <[email protected]>
Cc: Paul Menage <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Pavel Machek <[email protected]>
Cc: [email protected]

Seems like a candidate for -stable.
---
include/linux/freezer.h | 7 +++++--
kernel/cgroup_freezer.c | 9 ++++++---
kernel/power/process.c | 2 +-
3 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index 5a361f8..da7e52b 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -64,9 +64,12 @@ extern bool freeze_task(struct task_struct *p, bool sig_only);
extern void cancel_freezing(struct task_struct *p);

#ifdef CONFIG_CGROUP_FREEZER
-extern int cgroup_frozen(struct task_struct *task);
+extern int cgroup_freezing_or_frozen(struct task_struct *task);
#else /* !CONFIG_CGROUP_FREEZER */
-static inline int cgroup_frozen(struct task_struct *task) { return 0; }
+static inline int cgroup_freezing_or_frozen(struct task_struct *task)
+{
+ return 0;
+}
#endif /* !CONFIG_CGROUP_FREEZER */

/*
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index fb249e2..765e2c1 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -47,17 +47,20 @@ static inline struct freezer *task_freezer(struct task_struct *task)
struct freezer, css);
}

-int cgroup_frozen(struct task_struct *task)
+int cgroup_freezing_or_frozen(struct task_struct *task)
{
struct freezer *freezer;
enum freezer_state state;

task_lock(task);
freezer = task_freezer(task);
- state = freezer->state;
+ if (!freezer->css.cgroup->parent)
+ state = CGROUP_THAWED; /* root cgroup can't be frozen */
+ else
+ state = freezer->state;
task_unlock(task);

- return state == CGROUP_FROZEN;
+ return (state == CGROUP_FREEZING) || (state == CGROUP_FROZEN);
}

/*
diff --git a/kernel/power/process.c b/kernel/power/process.c
index da2072d..3728d4c 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -138,7 +138,7 @@ static void thaw_tasks(bool nosig_only)
if (nosig_only && should_send_signal(p))
continue;

- if (cgroup_frozen(p))
+ if (cgroup_freezing_or_frozen(p))
continue;

thaw_process(p);
--
1.6.0.4

2009-07-22 10:14:56

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 42/60] c/r: restore anonymous- and file-mapped- shared memory

The bulk of the work is in ckpt_read_vma(), which has been refactored:
the part that create the suitable 'struct file *' for the mapping is
now larger and moved to a separate function. What's left is to read
the VMA description, get the file pointer, create the mapping, and
proceed to read the contents in.

Both anonymous shared VMAs that have been read earlier (as indicated
by a look up to objhash) and file-mapped shared VMAs are skipped.
Anonymous shared VMAs seen for the first time have their contents
read in directly to the backing inode, as indexed by the page numbers
(as opposed to virtual addresses).

Changelog[v14]:
- Introduce patch

Signed-off-by: Oren Laadan <[email protected]>
---
checkpoint/memory.c | 66 ++++++++++++++++++++++++++++++++-----------
include/linux/checkpoint.h | 6 ++++
include/linux/mm.h | 2 +
mm/filemap.c | 13 ++++++++-
mm/shmem.c | 49 ++++++++++++++++++++++++++++++++
5 files changed, 118 insertions(+), 18 deletions(-)

diff --git a/checkpoint/memory.c b/checkpoint/memory.c
index a1d1eca..77234cd 100644
--- a/checkpoint/memory.c
+++ b/checkpoint/memory.c
@@ -840,13 +840,36 @@ static int restore_read_page(struct ckpt_ctx *ctx, struct page *page, void *p)
return 0;
}

+static struct page *bring_private_page(unsigned long addr)
+{
+ struct page *page;
+ int ret;
+
+ ret = get_user_pages(current, current->mm, addr, 1, 1, 1, &page, NULL);
+ if (ret < 0)
+ page = ERR_PTR(ret);
+ return page;
+}
+
+static struct page *bring_shared_page(unsigned long idx, struct inode *ino)
+{
+ struct page *page = NULL;
+ int ret;
+
+ ret = shmem_getpage(ino, idx, &page, SGP_WRITE, NULL);
+ if (ret < 0)
+ return ERR_PTR(ret);
+ if (page)
+ unlock_page(page);
+ return page;
+}
+
/**
* read_pages_contents - read in data of pages in page-array chain
* @ctx - restart context
*/
-static int read_pages_contents(struct ckpt_ctx *ctx)
+static int read_pages_contents(struct ckpt_ctx *ctx, struct inode *inode)
{
- struct mm_struct *mm = current->mm;
struct ckpt_pgarr *pgarr;
unsigned long *vaddrs;
char *buf;
@@ -856,17 +879,22 @@ static int read_pages_contents(struct ckpt_ctx *ctx)
if (!buf)
return -ENOMEM;

- down_read(&mm->mmap_sem);
+ down_read(&current->mm->mmap_sem);
list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
vaddrs = pgarr->vaddrs;
for (i = 0; i < pgarr->nr_used; i++) {
struct page *page;

_ckpt_debug(CKPT_DPAGE, "got page %#lx\n", vaddrs[i]);
- ret = get_user_pages(current, mm, vaddrs[i],
- 1, 1, 1, &page, NULL);
- if (ret < 0)
+ if (inode)
+ page = bring_shared_page(vaddrs[i], inode);
+ else
+ page = bring_private_page(vaddrs[i]);
+
+ if (IS_ERR(page)) {
+ ret = PTR_ERR(page);
goto out;
+ }

ret = restore_read_page(ctx, page, buf);
page_cache_release(page);
@@ -877,14 +905,15 @@ static int read_pages_contents(struct ckpt_ctx *ctx)
}

out:
- up_read(&mm->mmap_sem);
+ up_read(&current->mm->mmap_sem);
kfree(buf);
return 0;
}

/**
- * restore_memory_contents - restore contents of a VMA with private memory
+ * restore_memory_contents - restore contents of a memory region
* @ctx - restart context
+ * @inode - backing inode
*
* Reads a header that specifies how many pages will follow, then reads
* a list of virtual addresses into ctx->pgarr_list page-array chain,
@@ -892,7 +921,7 @@ static int read_pages_contents(struct ckpt_ctx *ctx)
* these steps until reaching a header specifying "0" pages, which marks
* the end of the contents.
*/
-static int restore_memory_contents(struct ckpt_ctx *ctx)
+int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode)
{
struct ckpt_hdr_pgarr *h;
unsigned long nr_pages;
@@ -919,7 +948,7 @@ static int restore_memory_contents(struct ckpt_ctx *ctx)
ret = read_pages_vaddrs(ctx, nr_pages);
if (ret < 0)
break;
- ret = read_pages_contents(ctx);
+ ret = read_pages_contents(ctx, inode);
if (ret < 0)
break;
pgarr_reset_all(ctx);
@@ -977,9 +1006,9 @@ static unsigned long calc_map_flags_bits(unsigned long orig_vm_flags)
* @file - file to map (NULL for anonymous)
* @h - vma header data
*/
-static unsigned long generic_vma_restore(struct mm_struct *mm,
- struct file *file,
- struct ckpt_hdr_vma *h)
+unsigned long generic_vma_restore(struct mm_struct *mm,
+ struct file *file,
+ struct ckpt_hdr_vma *h)
{
unsigned long vm_size, vm_start, vm_flags, vm_prot, vm_pgoff;
unsigned long addr;
@@ -1026,7 +1055,7 @@ int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
if (IS_ERR((void *) addr))
return PTR_ERR((void *) addr);

- return restore_memory_contents(ctx);
+ return restore_memory_contents(ctx, NULL);
}

/**
@@ -1086,16 +1115,19 @@ static struct restore_vma_ops restore_vma_ops[] = {
{
.vma_name = "ANON SHARED",
.vma_type = CKPT_VMA_SHM_ANON,
+ .restore = shmem_restore,
},
/* anonymous shared (skipped) */
{
.vma_name = "ANON SHARED (skip)",
.vma_type = CKPT_VMA_SHM_ANON_SKIP,
+ .restore = shmem_restore,
},
/* file-mapped shared */
{
.vma_name = "FILE SHARED",
.vma_type = CKPT_VMA_SHM_FILE,
+ .restore = filemap_restore,
},
};

@@ -1114,15 +1146,15 @@ static int restore_vma(struct ckpt_ctx *ctx, struct mm_struct *mm)
if (IS_ERR(h))
return PTR_ERR(h);

- ckpt_debug("vma %#lx-%#lx flags %#lx type %d vmaref %d\n",
+ ckpt_debug("vma %#lx-%#lx flags %#lx type %d vmaref %d inoref %d\n",
(unsigned long) h->vm_start, (unsigned long) h->vm_end,
(unsigned long) h->vm_flags, (int) h->vma_type,
- (int) h->vma_objref);
+ (int) h->vma_objref, (int) h->ino_objref);

ret = -EINVAL;
if (h->vm_end < h->vm_start)
goto out;
- if (h->vma_objref < 0)
+ if (h->vma_objref < 0 || h->ino_objref < 0)
goto out;
if (h->vma_type >= CKPT_VMA_MAX)
goto out;
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 54cc4b0..5920453 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -170,9 +170,15 @@ extern int ckpt_collect_mm(struct ckpt_ctx *ctx, struct task_struct *t);
extern int checkpoint_mm(struct ckpt_ctx *ctx, void *ptr);
extern void *restore_mm(struct ckpt_ctx *ctx);

+extern unsigned long generic_vma_restore(struct mm_struct *mm,
+ struct file *file,
+ struct ckpt_hdr_vma *h);
+
extern int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
struct file *file, struct ckpt_hdr_vma *h);

+extern int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode);
+

#define CKPT_VMA_NOT_SUPPORTED \
(VM_IO | VM_HUGETLB | VM_NONLINEAR | VM_PFNMAP | \
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6c2c3dd..5f341ac 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1198,6 +1198,8 @@ extern int filemap_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
struct ckpt_hdr_vma *hh);
extern int special_mapping_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
struct ckpt_hdr_vma *hh);
+extern int shmem_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+ struct ckpt_hdr_vma *hh);
#endif

/* readahead.c */
diff --git a/mm/filemap.c b/mm/filemap.c
index a07bb3d..0c4906f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1711,17 +1711,28 @@ int filemap_restore(struct ckpt_ctx *ctx,
struct ckpt_hdr_vma *h)
{
struct file *file;
+ unsigned long addr;
int ret;

if (h->vma_type == CKPT_VMA_FILE &&
(h->vm_flags & (VM_SHARED | VM_MAYSHARE)))
return -EINVAL;
+ if (h->vma_type == CKPT_VMA_SHM_FILE &&
+ !(h->vm_flags & (VM_SHARED | VM_MAYSHARE)))
+ return -EINVAL;

file = ckpt_obj_fetch(ctx, h->vma_objref, CKPT_OBJ_FILE);
if (IS_ERR(file))
return PTR_ERR(file);

- ret = private_vma_restore(ctx, mm, file, h);
+ if (h->vma_type == CKPT_VMA_FILE) {
+ /* private mapped file */
+ ret = private_vma_restore(ctx, mm, file, h);
+ } else {
+ /* shared mapped file */
+ addr = generic_vma_restore(mm, file, h);
+ ret = (IS_ERR((void *) addr) ? PTR_ERR((void *) addr) : 0);
+ }
return ret;
}
#endif /* CONFIG_CHECKPOINT */
diff --git a/mm/shmem.c b/mm/shmem.c
index 808e14a..9334810 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2406,6 +2406,55 @@ static int shmem_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)

return shmem_vma_checkpoint(ctx, vma, vma_type, ino_objref);
}
+
+int shmem_restore(struct ckpt_ctx *ctx,
+ struct mm_struct *mm, struct ckpt_hdr_vma *h)
+{
+ unsigned long addr;
+ struct file *file;
+ int ret = 0;
+
+ file = ckpt_obj_fetch(ctx, h->ino_objref, CKPT_OBJ_FILE);
+ if (PTR_ERR(file) == -EINVAL)
+ file = NULL;
+ if (IS_ERR(file))
+ return PTR_ERR(file);
+
+ /* if file is NULL, this is the premiere - create and insert */
+ if (!file) {
+ if (h->vma_type != CKPT_VMA_SHM_ANON)
+ return -EINVAL;
+ /*
+ * in theory could pass NULL to mmap and let it create
+ * the file. But, if 'shm_size != vm_end - vm_start',
+ * or if 'vm_pgoff != 0', then the vma reflects only a
+ * portion of the shm object and we need to "manually"
+ * create the full shm object.
+ */
+ file = shmem_file_setup("/dev/zero", h->ino_size, h->vm_flags);
+ if (IS_ERR(file))
+ return PTR_ERR(file);
+ ret = ckpt_obj_insert(ctx, file, h->ino_objref, CKPT_OBJ_FILE);
+ if (ret < 0)
+ goto out;
+ } else {
+ if (h->vma_type != CKPT_VMA_SHM_ANON_SKIP)
+ return -EINVAL;
+ /* Already need fput() for the file above; keep path simple */
+ get_file(file);
+ }
+
+ addr = generic_vma_restore(mm, file, h);
+ if (IS_ERR((void *) addr))
+ return PTR_ERR((void *) addr);
+
+ if (h->vma_type == CKPT_VMA_SHM_ANON)
+ ret = restore_memory_contents(ctx, file->f_dentry->d_inode);
+ out:
+ fput(file);
+ return ret;
+}
+
#endif /* CONFIG_CHECKPOINT */

static void init_once(void *foo)
--
1.6.0.4

2009-07-22 10:14:32

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 34/60] c/r: restore open file descriptors

For each fd read 'struct ckpt_hdr_file_desc' and lookup objref in the
hash table; If not found in the hash table, (first occurence), read in
'struct ckpt_hdr_file', create a new file and register in the hash.
Otherwise attach the file pointer from the hash as an FD.

Changelog[v17]:
- Validate f_mode after restore against saved f_mode
- Fail if f_flags have O_CREAT|O_EXCL|O_NOCTTY|O_TRUN
- Reorder patch (move earlier in series)
- Handle shared files_struct objects
Changelog[v14]:
- Introduce a per file-type restore() callback
- Revert change to pr_debug(), back to ckpt_debug()
- Rename: restore_files() => restore_fd_table()
- Rename: ckpt_read_fd_data() => restore_file()
- Check whether calls to ckpt_hbuf_get() fail
- Discard field 'hh->parent'
Changelog[v12]:
- Replace obsolete ckpt_debug() with pr_debug()
Changelog[v6]:
- Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
(even though it's not really needed)

Signed-off-by: Oren Laadan <[email protected]>
---
checkpoint/files.c | 310 ++++++++++++++++++++++++++++++++++++++++++++
checkpoint/objhash.c | 2 +
checkpoint/process.c | 20 +++
include/linux/checkpoint.h | 7 +
4 files changed, 339 insertions(+), 0 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index 5ff9925..88d7adf 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -16,6 +16,8 @@
#include <linux/sched.h>
#include <linux/file.h>
#include <linux/fdtable.h>
+#include <linux/fsnotify.h>
+#include <linux/syscalls.h>
#include <linux/checkpoint.h>
#include <linux/checkpoint_hdr.h>

@@ -380,3 +382,311 @@ int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t)

return ret;
}
+
+/**************************************************************************
+ * Restart
+ */
+
+/**
+ * restore_open_fname - read a file name and open a file
+ * @ctx: checkpoint context
+ * @flags: file flags
+ */
+struct file *restore_open_fname(struct ckpt_ctx *ctx, int flags)
+{
+ struct ckpt_hdr *h;
+ struct file *file;
+ char *fname;
+
+ /* prevent bad input from doing bad things */
+ if (flags & (O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC))
+ return ERR_PTR(-EINVAL);
+
+ h = ckpt_read_buf_type(ctx, PATH_MAX, CKPT_HDR_FILE_NAME);
+ if (IS_ERR(h))
+ return (struct file *) h;
+ fname = (char *) (h + 1);
+ ckpt_debug("fname '%s' flags %#x\n", fname, flags);
+
+ file = filp_open(fname, flags, 0);
+ ckpt_hdr_put(ctx, h);
+
+ return file;
+}
+
+static int close_all_fds(struct files_struct *files)
+{
+ int *fdtable;
+ int nfds;
+
+ nfds = scan_fds(files, &fdtable);
+ if (nfds < 0)
+ return nfds;
+ while (nfds--)
+ sys_close(fdtable[nfds]);
+ kfree(fdtable);
+ return 0;
+}
+
+/**
+ * attach_file - attach a lonely file ptr to a file descriptor
+ * @file: lonely file pointer
+ */
+static int attach_file(struct file *file)
+{
+ int fd = get_unused_fd_flags(0);
+
+ if (fd >= 0) {
+ get_file(file);
+ fsnotify_open(file->f_path.dentry);
+ fd_install(fd, file);
+ }
+ return fd;
+}
+
+#define CKPT_SETFL_MASK \
+ (O_APPEND | O_NONBLOCK | O_NDELAY | FASYNC | O_DIRECT | O_NOATIME)
+
+int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
+ struct ckpt_hdr_file *h)
+{
+ fmode_t new_mode = (__force fmode_t) file->f_mode;
+ fmode_t saved_mode = (__force fmode_t) h->f_mode;
+ int ret;
+
+ /* FIX: need to restore uid, gid, owner etc */
+
+ /* safe to set 1st arg (fd) to 0, as command is F_SETFL */
+ ret = vfs_fcntl(0, F_SETFL, h->f_flags & CKPT_SETFL_MASK, file);
+ if (ret < 0)
+ return ret;
+
+ /*
+ * Normally f_mode is set by open, and modified only via
+ * fcntl(), so its value now should match that at checkpoint.
+ * However, a file may be downgraded from (read-)write to
+ * read-only, e.g:
+ * - mark_files_ro() unsets FMODE_WRITE
+ * - nfs4_file_downgrade() too, and also sert FMODE_READ
+ * Validate the new f_mode against saved f_mode, allowing:
+ * - new with FMODE_WRITE, saved without FMODE_WRITE
+ * - new without FMODE_READ, saved with FMODE_READ
+ */
+ if ((new_mode & FMODE_WRITE) && !(saved_mode & FMODE_WRITE)) {
+ new_mode &= ~FMODE_WRITE;
+ if (!(new_mode & FMODE_READ) && (saved_mode & FMODE_READ))
+ new_mode |= FMODE_READ;
+ }
+ /* finally, at this point new mode should match saved mode */
+ if (new_mode ^ saved_mode)
+ return -EINVAL;
+
+ if (file->f_mode & FMODE_LSEEK)
+ ret = vfs_llseek(file, h->f_pos, SEEK_SET);
+
+ return ret;
+}
+
+static struct file *generic_file_restore(struct ckpt_ctx *ctx,
+ struct ckpt_hdr_file *ptr)
+{
+ struct file *file;
+ int ret;
+
+ if (ptr->h.type != CKPT_HDR_FILE ||
+ ptr->h.len != sizeof(*ptr) || ptr->f_type != CKPT_FILE_GENERIC)
+ return ERR_PTR(-EINVAL);
+
+ file = restore_open_fname(ctx, ptr->f_flags);
+ if (IS_ERR(file))
+ return file;
+
+ ret = restore_file_common(ctx, file, ptr);
+ if (ret < 0) {
+ fput(file);
+ file = ERR_PTR(ret);
+ }
+ return file;
+}
+
+struct restore_file_ops {
+ char *file_name;
+ enum file_type file_type;
+ struct file * (*restore) (struct ckpt_ctx *ctx,
+ struct ckpt_hdr_file *ptr);
+};
+
+static struct restore_file_ops restore_file_ops[] = {
+ /* ignored file */
+ {
+ .file_name = "IGNORE",
+ .file_type = CKPT_FILE_IGNORE,
+ .restore = NULL,
+ },
+ /* regular file/directory */
+ {
+ .file_name = "GENERIC",
+ .file_type = CKPT_FILE_GENERIC,
+ .restore = generic_file_restore,
+ },
+};
+
+static struct file *do_restore_file(struct ckpt_ctx *ctx)
+{
+ struct restore_file_ops *ops;
+ struct ckpt_hdr_file *h;
+ struct file *file = ERR_PTR(-EINVAL);
+
+ /*
+ * All 'struct ckpt_hdr_file_...' begin with ckpt_hdr_file,
+ * but the actual object depends on the file type. The length
+ * should never be more than page.
+ */
+ h = ckpt_read_buf_type(ctx, PAGE_SIZE, CKPT_HDR_FILE);
+ if (IS_ERR(h))
+ return (struct file *) h;
+ ckpt_debug("flags %#x mode %#x type %d\n",
+ h->f_flags, h->f_mode, h->f_type);
+
+ if (h->f_type >= CKPT_FILE_MAX)
+ goto out;
+
+ ops = &restore_file_ops[h->f_type];
+ BUG_ON(ops->file_type != h->f_type);
+
+ if (ops->restore)
+ file = ops->restore(ctx, h);
+ out:
+ ckpt_hdr_put(ctx, h);
+ return file;
+}
+
+/* restore callback for file pointer */
+void *restore_file(struct ckpt_ctx *ctx)
+{
+ return (void *) do_restore_file(ctx);
+}
+
+/**
+ * ckpt_read_file_desc - restore the state of a given file descriptor
+ * @ctx: checkpoint context
+ *
+ * Restores the state of a file descriptor; looks up the objref (in the
+ * header) in the hash table, and if found picks the matching file and
+ * use it; otherwise calls restore_file to restore the file too.
+ */
+static int restore_file_desc(struct ckpt_ctx *ctx)
+{
+ struct ckpt_hdr_file_desc *h;
+ struct file *file;
+ int newfd, ret;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_FILE_DESC);
+ if (IS_ERR(h))
+ return PTR_ERR(h);
+ ckpt_debug("ref %d fd %d c.o.e %d\n",
+ h->fd_objref, h->fd_descriptor, h->fd_close_on_exec);
+
+ ret = -EINVAL;
+ if (h->fd_objref <= 0 || h->fd_descriptor < 0)
+ goto out;
+
+ file = ckpt_obj_fetch(ctx, h->fd_objref, CKPT_OBJ_FILE);
+ if (IS_ERR(file)) {
+ ret = PTR_ERR(file);
+ goto out;
+ }
+
+ newfd = attach_file(file);
+ if (newfd < 0) {
+ ret = newfd;
+ goto out;
+ }
+
+ ckpt_debug("newfd got %d wanted %d\n", newfd, h->fd_descriptor);
+
+ /* reposition if newfd isn't desired fd */
+ if (newfd != h->fd_descriptor) {
+ ret = sys_dup2(newfd, h->fd_descriptor);
+ if (ret < 0)
+ goto out;
+ sys_close(newfd);
+ }
+
+ if (h->fd_close_on_exec)
+ set_close_on_exec(h->fd_descriptor, 1);
+
+ ret = 0;
+ out:
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
+
+/* restore callback for file table */
+static struct files_struct *do_restore_file_table(struct ckpt_ctx *ctx)
+{
+ struct ckpt_hdr_file_table *h;
+ struct files_struct *files;
+ int i, ret;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_FILE_TABLE);
+ if (IS_ERR(h))
+ return (struct files_struct *) h;
+
+ ckpt_debug("nfds %d\n", h->fdt_nfds);
+
+ ret = -EMFILE;
+ if (h->fdt_nfds < 0 || h->fdt_nfds > sysctl_nr_open)
+ goto out;
+
+ /*
+ * We assume that restarting tasks, as created in user-space,
+ * have distinct files_struct objects each. If not, we need to
+ * call dup_fd() to make sure we don't overwrite an already
+ * restored one.
+ */
+
+ /* point of no return -- close all file descriptors */
+ ret = close_all_fds(current->files);
+ if (ret < 0)
+ goto out;
+
+ for (i = 0; i < h->fdt_nfds; i++) {
+ ret = restore_file_desc(ctx);
+ if (ret < 0)
+ break;
+ }
+ out:
+ ckpt_hdr_put(ctx, h);
+ if (!ret) {
+ files = current->files;
+ atomic_inc(&files->count);
+ } else {
+ files = ERR_PTR(ret);
+ }
+ return files;
+}
+
+void *restore_file_table(struct ckpt_ctx *ctx)
+{
+ return (void *) do_restore_file_table(ctx);
+}
+
+int restore_obj_file_table(struct ckpt_ctx *ctx, int files_objref)
+{
+ struct files_struct *files;
+
+ files = ckpt_obj_fetch(ctx, files_objref, CKPT_OBJ_FILE_TABLE);
+ if (IS_ERR(files))
+ return PTR_ERR(files);
+
+ if (files != current->files) {
+ task_lock(current);
+ put_files_struct(current->files);
+ current->files = files;
+ task_unlock(current);
+ atomic_inc(&files->count);
+ }
+
+ return 0;
+}
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index d77e8c4..fae6bfc 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -122,6 +122,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
.ref_grab = obj_file_table_grab,
.ref_users = obj_file_table_users,
.checkpoint = checkpoint_file_table,
+ .restore = restore_file_table,
},
/* file object */
{
@@ -131,6 +132,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
.ref_grab = obj_file_grab,
.ref_users = obj_file_users,
.checkpoint = checkpoint_file,
+ .restore = restore_file,
},
};

diff --git a/checkpoint/process.c b/checkpoint/process.c
index 61caa01..8cbbace 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -343,6 +343,22 @@ static int restore_task_struct(struct ckpt_ctx *ctx)
return ret;
}

+static int restore_task_objs(struct ckpt_ctx *ctx)
+{
+ struct ckpt_hdr_task_objs *h;
+ int ret;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
+ if (IS_ERR(h))
+ return PTR_ERR(h);
+
+ ret = restore_obj_file_table(ctx, h->files_objref);
+ ckpt_debug("file_table: ret %d (%p)\n", ret, current->files);
+
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
+
int restore_restart_block(struct ckpt_ctx *ctx)
{
struct ckpt_hdr_restart_block *h;
@@ -462,6 +478,10 @@ int restore_task(struct ckpt_ctx *ctx)
if (ret)
goto out;

+ ret = restore_task_objs(ctx);
+ ckpt_debug("objs %d\n", ret);
+ if (ret < 0)
+ goto out;
ret = restore_thread(ctx);
ckpt_debug("thread %d\n", ret);
if (ret < 0)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 67845dc..3f28a06 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -128,15 +128,22 @@ extern int restore_restart_block(struct ckpt_ctx *ctx);
extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t);
extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx,
struct task_struct *t);
+extern int restore_obj_file_table(struct ckpt_ctx *ctx, int files_objref);
extern int checkpoint_file_table(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_file_table(struct ckpt_ctx *ctx);

/* files */
extern int checkpoint_fname(struct ckpt_ctx *ctx,
struct path *path, struct path *root);
+extern struct file *restore_open_fname(struct ckpt_ctx *ctx, int flags);
+
extern int checkpoint_file(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_file(struct ckpt_ctx *ctx);

extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
struct ckpt_hdr_file *h);
+extern int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
+ struct ckpt_hdr_file *h);


/* debugging flags */
--
1.6.0.4

2009-07-22 10:16:28

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 13/60] pids 3/7: Add target_pid parameter to alloc_pidmap()

From: Sukadev Bhattiprolu <[email protected]>

With support for setting a specific pid number for a process,
alloc_pidmap() will need a paramter a 'target_pid' parameter.

Changelog[v2]:
- (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code
actually checks for 'pid <= 0' for completeness).

Signed-off-by: Sukadev Bhattiprolu <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Reviewed-by: Oren Laadan <[email protected]>
---
kernel/pid.c | 28 ++++++++++++++++++++++++++--
1 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index 9c678ce..29cf119 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -147,11 +147,35 @@ static int alloc_pidmap_page(struct pidmap *map)
return 0;
}

-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int set_pidmap(struct pid_namespace *pid_ns, int pid)
+{
+ int offset;
+ struct pidmap *map;
+
+ if (pid <= 0 || pid >= pid_max)
+ return -EINVAL;
+
+ offset = pid & BITS_PER_PAGE_MASK;
+ map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
+
+ if (alloc_pidmap_page(map))
+ return -ENOMEM;
+
+ if (test_and_set_bit(offset, map->page))
+ return -EBUSY;
+
+ atomic_dec(&map->nr_free);
+ return pid;
+}
+
+static int alloc_pidmap(struct pid_namespace *pid_ns, int target_pid)
{
int i, rc, offset, max_scan, pid, last = pid_ns->last_pid;
struct pidmap *map;

+ if (target_pid)
+ return set_pidmap(pid_ns, target_pid);
+
pid = last + 1;
if (pid >= pid_max)
pid = RESERVED_PIDS;
@@ -270,7 +294,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)

tmp = ns;
for (i = ns->level; i >= 0; i--) {
- nr = alloc_pidmap(tmp);
+ nr = alloc_pidmap(tmp, 0);
if (nr < 0)
goto out_free;

--
1.6.0.4

2009-07-22 10:13:30

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 59/60] c/r: restore file->f_cred

From: Serge E. Hallyn <[email protected]>

Restore a file's f_cred. This is set to the cred of the task doing
the open, so often it will be the same as that of the restarted task.

Signed-off-by: Serge E. Hallyn <[email protected]>
---
checkpoint/files.c | 16 ++++++++++++++--
include/linux/checkpoint_hdr.h | 2 +-
2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/checkpoint/files.c b/checkpoint/files.c
index c247d44..bcdc774 100644
--- a/checkpoint/files.c
+++ b/checkpoint/files.c
@@ -150,7 +150,11 @@ int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
h->f_pos = file->f_pos;
h->f_version = file->f_version;

- /* FIX: need also file->uid, file->gid, file->f_owner, etc */
+ h->f_credref = checkpoint_obj(ctx, file->f_cred, CKPT_OBJ_CRED);
+ if (h->f_credref < 0)
+ return h->f_credref;
+
+ /* FIX: need also file->f_owner, etc */

return 0;
}
@@ -454,8 +458,16 @@ int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
fmode_t new_mode = (__force fmode_t) file->f_mode;
fmode_t saved_mode = (__force fmode_t) h->f_mode;
int ret;
+ struct cred *cred;
+
+ /* FIX: need to restore owner etc */

- /* FIX: need to restore uid, gid, owner etc */
+ /* restore the cred */
+ cred = ckpt_obj_fetch(ctx, h->f_credref, CKPT_OBJ_CRED);
+ if (IS_ERR(cred))
+ return PTR_ERR(cred);
+ put_cred(file->f_cred);
+ file->f_cred = get_cred(cred);

/* safe to set 1st arg (fd) to 0, as command is F_SETFL */
ret = vfs_fcntl(0, F_SETFL, h->f_flags & CKPT_SETFL_MASK, file);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index ca02d9d..0863a07 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -329,7 +329,7 @@ struct ckpt_hdr_file {
__u32 f_type;
__u32 f_mode;
__u32 f_flags;
- __u32 _padding;
+ __s32 f_credref;
__u64 f_pos;
__u64 f_version;
} __attribute__((aligned(8)));
--
1.6.0.4

2009-07-22 10:10:32

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 19/60] c/r: documentation

Covers application checkpoint/restart, overall design, interfaces,
usage, shared objects, and and checkpoint image format.

Changelog[v16]:
- Update documentation
- Unify into readme.txt and usage.txt
Changelog[v14]:
- Discard the 'h.parent' field
- New image format (shared objects appear before they are referenced
unless they are compound)
Changelog[v8]:
- Split into multiple files in Documentation/checkpoint/...
- Extend documentation, fix typos and comments from feedback

Signed-off-by: Oren Laadan <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
---
Documentation/checkpoint/ckpt.c | 32 ++++
Documentation/checkpoint/readme.txt | 347 +++++++++++++++++++++++++++++++++++
Documentation/checkpoint/rstr.c | 20 ++
Documentation/checkpoint/self.c | 57 ++++++
Documentation/checkpoint/test.c | 48 +++++
Documentation/checkpoint/usage.txt | 193 +++++++++++++++++++
checkpoint/sys.c | 2 +-
7 files changed, 698 insertions(+), 1 deletions(-)
create mode 100644 Documentation/checkpoint/ckpt.c
create mode 100644 Documentation/checkpoint/readme.txt
create mode 100644 Documentation/checkpoint/rstr.c
create mode 100644 Documentation/checkpoint/self.c
create mode 100644 Documentation/checkpoint/test.c
create mode 100644 Documentation/checkpoint/usage.txt

diff --git a/Documentation/checkpoint/ckpt.c b/Documentation/checkpoint/ckpt.c
new file mode 100644
index 0000000..094408c
--- /dev/null
+++ b/Documentation/checkpoint/ckpt.c
@@ -0,0 +1,32 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+
+int main(int argc, char *argv[])
+{
+ pid_t pid;
+ int ret;
+
+ if (argc != 2) {
+ printf("usage: ckpt PID\n");
+ exit(1);
+ }
+
+ pid = atoi(argv[1]);
+ if (pid <= 0) {
+ printf("invalid pid\n");
+ exit(1);
+ }
+
+ ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
+
+ if (ret < 0)
+ perror("checkpoint");
+ else
+ printf("checkpoint id %d\n", ret);
+
+ return (ret > 0 ? 0 : 1);
+}
+
diff --git a/Documentation/checkpoint/readme.txt b/Documentation/checkpoint/readme.txt
new file mode 100644
index 0000000..e84dc39
--- /dev/null
+++ b/Documentation/checkpoint/readme.txt
@@ -0,0 +1,347 @@
+
+ Checkpoint-Restart support in the Linux kernel
+ ==========================================================
+
+Copyright (C) 2008 Oren Laadan
+
+Author: Oren Laadan <[email protected]>
+
+License: The GNU Free Documentation License, Version 1.2
+ (dual licensed under the GPL v2)
+
+Reviewers: Serge Hallyn <[email protected]>
+ Dave Hansen <[email protected]>
+
+
+Introduction
+============
+
+Application checkpoint/restart [C/R] is the ability to save the state
+of a running application so that it can later resume its execution
+from the time at which it was checkpointed. An application can be
+migrated by checkpointing it on one machine and restarting it on
+another. C/R can provide many potential benefits:
+
+* Failure recovery: by rolling back to a previous checkpoint
+
+* Improved response time: by restarting applications from checkpoints
+ instead of from scratch.
+
+* Improved system utilization: by suspending long running CPU
+ intensive jobs and resuming them when load decreases.
+
+* Fault resilience: by migrating applications off faulty hosts.
+
+* Dynamic load balancing: by migrating applications to less loaded
+ hosts.
+
+* Improved service availability and administration: by migrating
+ applications before host maintenance so that they continue to run
+ with minimal downtime
+
+* Time-travel: by taking periodic checkpoints and restarting from
+ any previous checkpoint.
+
+Compared to hypervisor approaches, application C/R is more lightweight
+since it need only save the state associated with applications, while
+operating system data structures (e.g. buffer cache, drivers state
+and the like) are uninteresting.
+
+
+Overall design
+==============
+
+Checkpoint and restart are done in the kernel as much as possible.
+Two new system calls are introduced to provide C/R: sys_checkpoint()
+and sys_restart(). They both operate on a process tree (hierarchy),
+either a whole container or a subtree of a container.
+
+Checkpointing entire containers ensures that there are no dependencies
+on anything outside the container, which guarantees that a matching
+restart will succeed (assuming that the file system state remains
+consistent). However, it requires that users will always run the tasks
+that they wish to checkpoint inside containers. This is ideal for,
+e.g., private virtual servers and the like.
+
+In contrast, when checkpointing a subtree of a container it is up to
+the user to ensure that dependencies either don't exist or can be
+safely ignored. This is useful, for instance, for HPC scenarios or
+even a user that would like to periodically checkpoint a long-running
+batch job.
+
+An additional system call, a la madvise(), is planned, so that tasks
+can advise the kernel how to handle specific resources. For instance,
+a task could ask to skip a memory area at checkpoint to save space,
+or to use a preset file descriptor at restart instead of restoring it
+from the checkpoint image. It will provide the flexibility that is
+particularly useful to address the needs of a diverse crowd of users
+and use-cases.
+
+Syscall sys_checkpoint() is given a pid that indicates the top of the
+hierarchy, a file descriptor to store the image, and flags. The code
+serializes internal user- and kernel-state and writes it out to the
+file descriptor. The resulting image is stream-able. The processes are
+expected to be frozen for the duration of the checkpoint.
+
+In general, a checkpoint consists of 5 steps:
+1. Pre-dump
+2. Freeze the container/subtree
+3. Save tasks' and kernel state <-- sys_checkpoint()
+4. Thaw (or kill) the container/subtree
+5. Post-dump
+
+Step 3 is done by calling sys_checkpoint(). Steps 1 and 5 are an
+optimization to reduce application downtime. In particular, "pre-dump"
+works before freezing the container, e.g. the pre-copy for live
+migration, and "post-dump" works after the container resumes
+execution, e.g. write-back the data to secondary storage.
+
+The kernel exports a relatively opaque 'blob' of data to userspace
+which can then be handed to the new kernel at restart time. The
+'blob' contains data and state of select portions of kernel structures
+such as VMAs and mm_structs, as well as copies of the actual memory
+that the tasks use. Any changes in this blob's format between kernel
+revisions can be handled by an in-userspace conversion program.
+
+To restart, userspace first create a process hierarchy that matches
+that of the checkpoint, and each task calls sys_restart(). The syscall
+reads the saved kernel state from a file descriptor, and re-creates
+the resources that the tasks need to resume execution. The restart
+code is executed by each task that is restored in the new hierarchy to
+reconstruct its own state.
+
+In general, a restart consists of 3 steps:
+1. Create hierarchy
+2. Restore tasks' and kernel state <-- sys_restart()
+3. Resume userspace (or freeze tasks)
+
+Because the process hierarchy, during restart in created in userspace,
+the restarting tasks have the flexibility to prepare before calling
+sys_restart().
+
+
+Checkpoint image format
+=======================
+
+The checkpoint image format is built of records that consist of a
+pre-header identifying its contents, followed by a payload. This
+format allow userspace tools to easily parse and skip through the
+image without requiring intimate knowledge of the data. It will also
+be handy to enable parallel checkpointing in the future where multiple
+threads interleave data from multiple processes into a single stream.
+
+The pre-header is defined by 'struct ckpt_hdr' as follows: @type
+identifies the type of the payload, @len tells its length in bytes
+including the pre-header.
+
+struct ckpt_hdr {
+ __s32 type;
+ __s32 len;
+};
+
+The pre-header must be the first component in all other headers. For
+instance, the task data is saved in 'struct ckpt_hdr_task', which
+looks something like this:
+
+struct ckpt_hdr_task {
+ struct ckpt_hdr h;
+ __u32 pid;
+ ...
+};
+
+THE IMAGE FORMAT IS EXPECTED TO CHANGE over time as more features are
+supported, or as existing features change in the kernel and require to
+adjust their representation. Any such changes will be be handled by
+in-userspace conversion tools.
+
+The general format of the checkpoint image is as follows:
+1. Image header
+2. Task hierarchy
+3. Tasks' state
+4. Image trailer
+
+The image always begins with a general header that holds a magic
+number, an architecture identifier (little endian format), a format
+version number (@rev), followed by information about the kernel
+(currently version and UTS data). It also holds the time of the
+checkpoint and the flags given to sys_checkpoint(). This header is
+followed by an arch-specific header.
+
+The task hierarchy comes next so that userspace tools can read it
+early (even from a stream) and re-create the restarting tasks. This is
+basically an array of all checkpointed tasks, and their relationships
+(parent, siblings, threads, etc).
+
+Then the state of all tasks is saved, in the order that they appear in
+the tasks array above. For each state, we save data like task_struct,
+namespaces, open files, memory layout, memory contents, cpu state,
+signals and signal handlers, etc. For resources that are shared among
+multiple processes, we first checkpoint said resource (and only once),
+and in the task data we give a reference to it. More about shared
+resources below.
+
+Finally, the image always ends with a trailer that holds a (different)
+magic number, serving for sanity check.
+
+
+Shared objects
+==============
+
+Many resources may be shared by multiple tasks (e.g. file descriptors,
+memory address space, etc), or even have multiple references from
+other resources (e.g. a single inode that represents two ends of a
+pipe).
+
+Shared objects are tracked using a hash table (objhash) to ensure that
+they are only checkpointed or restored once. To handle a shared
+object, it is first looked up in the hash table, to determine if is
+the first encounter or a recurring appearance. The hash table itself
+is not saved as part of the checkpoint image: it is constructed
+dynamically during both checkpoint and restart, and discarded at the
+end of the operation.
+
+During checkpoint, when a shared object is encountered for the first
+time, it is inserted to the hash table, indexed by its kernel address.
+It is assigned an identifier (@objref) in order of appearance, and
+then its state if saved. Subsequent lookups of that object in the hash
+will yield that entry, in which case only the @objref is saved, as
+opposed the entire state of the object.
+
+During restart, shared objects are indexed by their @objref as given
+during the checkpoint. On the first appearance of each shared object,
+a new resource will be created and its state restored from the image.
+Then the object is added to the hash table. Subsequent lookups of the
+same unique identifier in the hash table will yield that entry, and
+then the existing object instance is reused instead of creating
+a new one.
+
+The hash grabs a reference to each object that is inserted, and
+maintains this reference for the entire lifetime of the hash. Thus,
+it is always safe to reference an object that is stored in the hash.
+The hash is "one-way" in the sense that objects that are added are
+never deleted from the hash until the hash is discarded. This, in
+turn, happens only when the checkpoint (or restart) terminates.
+
+Shared objects are thus saved when they are first seen, and _before_
+the parent object that uses them. Therefore by the time the parent
+objects needs them, they should already be in the objhash. The one
+exception is when more than a single shared resource will be restarted
+at once (e.g. like the two ends of a pipe, or all the namespaces in an
+nsproxy). In this case the parent object is dumped first followed by
+the individual sub-resources).
+
+The checkpoint image is stream-able, meaning that restarting from it
+may not require lseek(). This is enforced at checkpoint time, by
+carefully selecting the order of shared objects, to respect the rule
+that an object is always saved before the objects that refers to it.
+
+
+Memory contents format
+======================
+
+The memory contents of a given memory address space (->mm) is dumped
+as a sequence of vma objects, represented by 'struct ckpt_hdr_vma'.
+This header details the vma properties, and a reference to a file
+(if file backed) or an inode (or shared memory) object.
+
+The vma header is followed by the actual contents - but only those
+pages that need to be saved, i.e. dirty pages. They are written in
+chunks of data, where each chunks contains a header that indicates
+that number of pages in the chunk, followed by an array of virtual
+addresses and then an array of actual page contents. The last chunk
+holds zero pages.
+
+To illustrate this, consider a single simple task with two vmas: one
+is file mapped with two dumped pages, and the other is anonymous with
+three dumped pages. The memory dump will look like this:
+
+ ckpt_hdr + ckpt_hdr_vma
+ ckpt_hdr_pgarr (nr_pages = 2)
+ addr1, addr2
+ page1, page2
+ ckpt_hdr_pgarr (nr_pages = 0)
+ ckpt_hdr + ckpt_hdr_vma
+ ckpt_hdr_pgarr (nr_pages = 3)
+ addr3, addr4, addr5
+ page3, page4, page5
+ ckpt_hdr_pgarr (nr_pages = 0)
+
+
+Error handling
+==============
+
+Both checkpoint and restart operations may fail due to a variety of
+reasons. Using a simple, single return value from the system call is
+insufficient to report the reason of a failure.
+
+Checkpoint - to provide informative status report upon failure, the
+checkpoint image may contain one (or more) error objects, 'struct
+ckpt_hdr_err'. An error objects consists of a mandatory pre-header
+followed by a null character ('\0'), and then a string that describes
+the error. By default, if an error occurs, this will be the last
+object written to the checkpoint image.
+
+Upon failure, the caller can examine the image (e.g. with 'ckptinfo')
+and extract the detailed error message. The leading '\0' is useful if
+one wants to seek back from the end of the checkpoint image, instead
+of parsing the entire image separately.
+
+Restart - to be defined.
+
+
+Security
+========
+
+The main question is whether sys_checkpoint() and sys_restart()
+require privileged or unprivileged operation.
+
+Early versions checked capable(CAP_SYS_ADMIN) assuming that we would
+attempt to remove the need for privilege, so that all users could
+safely use it. Arnd Bergmann pointed out that it'd make more sense to
+let unprivileged users use them now, so that we'll be more careful
+about the security as patches roll in.
+
+Checkpoint: the main concern is whether a task that performs the
+checkpoint of another task has sufficient privileges to access its
+state. We address this by requiring that the checkpointer task will be
+able to ptrace the target task, by means of ptrace_may_access() with
+read mode.
+
+Restart: the main concern is that we may allow an unprivileged user to
+feed the kernel with random data. To this end, the restart works in a
+way that does not skip the usual security checks. Task credentials,
+i.e. euid, reuid, and LSM security contexts currently come from the
+caller, not the checkpoint image. When restoration of credentials
+becomes supported, then definitely the ability of the task that calls
+sys_restore() to setresuid/setresgid to those values must be checked.
+
+Keeping the restart procedure to operate within the limits of the
+caller's credentials means that there various scenarios that cannot
+be supported. For instance, a setuid program that opened a protected
+log file and then dropped privileges will fail the restart, because
+the user won't have enough credentials to reopen the file. In these
+cases, we should probably treat restarting like inserting a kernel
+module: surely the user can cause havoc by providing incorrect data,
+but then again we must trust the root account.
+
+So that's why we don't want CAP_SYS_ADMIN required up-front. That way
+we will be forced to more carefully review each of those features.
+However, this can be controlled with a sysctl-variable.
+
+
+Kernel interfaces
+=================
+
+* To checkpoint a vma, the 'struct vm_operations_struct' needs to
+provide a method ->checkpoint:
+ int checkpoint(struct ckpt_ctx *, struct vma_struct *)
+Restart requires a matching (exported) restore:
+ int restore(struct ckpt_ctx *, struct mm_struct *, struct ckpt_hdr_vma *)
+
+* To checkpoint a file, the 'struct file_operations' needs to provide
+a method ->checkpoint:
+ int checkpoint(struct ckpt_ctx *, struct file *)
+Restart requires a matching (exported) restore:
+ int restore(struct ckpt_ctx *, struct ckpt_hdr_file *)
+For most file systems, generic_file_{checkpoint,restore}() can be
+used.
diff --git a/Documentation/checkpoint/rstr.c b/Documentation/checkpoint/rstr.c
new file mode 100644
index 0000000..288209d
--- /dev/null
+++ b/Documentation/checkpoint/rstr.c
@@ -0,0 +1,20 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <sys/syscall.h>
+
+int main(int argc, char *argv[])
+{
+ pid_t pid = getpid();
+ int ret;
+
+ ret = syscall(__NR_restart, pid, STDIN_FILENO, 0);
+ if (ret < 0)
+ perror("restart");
+
+ printf("should not reach here !\n");
+
+ return 0;
+}
+
diff --git a/Documentation/checkpoint/self.c b/Documentation/checkpoint/self.c
new file mode 100644
index 0000000..febb888
--- /dev/null
+++ b/Documentation/checkpoint/self.c
@@ -0,0 +1,57 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+#include <math.h>
+#include <sys/syscall.h>
+
+#define OUTFILE "/tmp/cr-test.out"
+
+int main(int argc, char *argv[])
+{
+ pid_t pid = getpid();
+ FILE *file;
+ int i, ret;
+ float a;
+
+ close(0);
+ close(2);
+
+ unlink(OUTFILE);
+ file = fopen(OUTFILE, "w+");
+ if (!file) {
+ perror("open");
+ exit(1);
+ }
+ if (dup2(0, 2) < 0) {
+ perror("dup2");
+ exit(1);
+ }
+
+ a = sqrt(2.53 * (getpid() / 1.21));
+
+ fprintf(file, "hello, world (%.2f)!\n", a);
+ fflush(file);
+
+ for (i = 0; i < 1000; i++) {
+ sleep(1);
+ /* make the fpu work -> a = a + i/10 */
+ a = sqrt(a*a + 2*a*(i/10.0) + i*i/100.0);
+ fprintf(file, "count %d (%.2f)!\n", i, a);
+ fflush(file);
+
+ if (i == 2) {
+ ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
+ if (ret < 0) {
+ fprintf(file, "ckpt: %s\n", strerror(errno));
+ exit(2);
+ }
+ fprintf(file, "checkpoint ret: %d\n", ret);
+ fflush(file);
+ }
+ }
+
+ return 0;
+}
+
diff --git a/Documentation/checkpoint/test.c b/Documentation/checkpoint/test.c
new file mode 100644
index 0000000..1183655
--- /dev/null
+++ b/Documentation/checkpoint/test.c
@@ -0,0 +1,48 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <math.h>
+
+#define OUTFILE "/tmp/cr-test.out"
+
+int main(int argc, char *argv[])
+{
+ FILE *file;
+ float a;
+ int i;
+
+ close(0);
+ close(1);
+ close(2);
+
+ unlink(OUTFILE);
+ file = fopen(OUTFILE, "w+");
+ if (!file) {
+ perror("open");
+ exit(1);
+ }
+ if (dup2(0, 2) < 0) {
+ perror("dup2");
+ exit(1);
+ }
+
+ a = sqrt(2.53 * (getpid() / 1.21));
+
+ fprintf(file, "hello, world (%.2f)!\n", a);
+ fflush(file);
+
+ for (i = 0; i < 1000; i++) {
+ sleep(1);
+ /* make the fpu work -> a = a + i/10 */
+ a = sqrt(a*a + 2*a*(i/10.0) + i*i/100.0);
+ fprintf(file, "count %d (%.2f)!\n", i, a);
+ fflush(file);
+ }
+
+ fprintf(file, "world, hello (%.2f) !\n", a);
+ fflush(file);
+
+ return 0;
+}
+
diff --git a/Documentation/checkpoint/usage.txt b/Documentation/checkpoint/usage.txt
new file mode 100644
index 0000000..ed34765
--- /dev/null
+++ b/Documentation/checkpoint/usage.txt
@@ -0,0 +1,193 @@
+
+ How to use Checkpoint-Restart
+ =========================================
+
+
+API
+===
+
+The API consists of two new system calls:
+
+* int checkpoint(pid_t pid, int fd, unsigned long flag);
+
+ Checkpoint a (sub-)container whose root task is identified by @pid,
+ to the open file indicated by @fd. @flags may be on or more of:
+ - CHECKPOINT_SUBTREE : allow checkpoint of sub-container
+ (other value are not allowed).
+
+ Returns: a positive checkpoint identifier (ckptid) upon success, 0 if
+ it returns from a restart, and -1 if an error occurs. The ckptid will
+ uniquely identify a checkpoint image, for as long as the checkpoint
+ is kept in the kernel (e.g. if one wishes to keep a checkpoint, or a
+ partial checkpoint, residing in kernel memory).
+
+* int sys_restart(pid_t pid, int fd, unsigned long flags);
+
+ Restart a process hierarchy from a checkpoint image that is read from
+ the blob stored in the file indicated by @fd. The @flags' will have
+ future meaning (must be 0 for now). @pid indicates the root of the
+ hierarchy as seen in the coordinator's pid-namespace, and is expected
+ to be a child of the coordinator. (Note that this argument may mean
+ 'ckptid' to identify an in-kernel checkpoint image, with some @flags
+ in the future).
+
+ Returns: -1 if an error occurs, 0 on success when restarting from a
+ "self" checkpoint, and return value of system call at the time of the
+ checkpoint when restarting from an "external" checkpoint.
+
+ TODO: upon successful "external" restart, the container will end up
+ in a frozen state.
+
+
+Sysctl/proc
+===========
+
+/proc/sys/kernel/ckpt_unpriv_allowed [default = 1]
+ controls whether c/r operation is allowed for unprivileged users
+
+
+Operation
+=========
+
+The granularity of a checkpoint usually is a process hierarchy. The
+'pid' argument is interpreted in the caller's pid namespace. So to
+checkpoint a container whose init task (pid 1 in that pidns) appears
+as pid 3497 the caller's pidns, the caller must use pid 3497. Passing
+pid 1 will attempt to checkpoint the caller's container, and if the
+caller isn't privileged and init is owned by root, it will fail.
+
+Unless the CHECKPOINT_SUBTREE flag is set, if the caller passes a pid
+which does not refer to a container's init task, then sys_checkpoint()
+would return -EINVAL.
+
+We assume that during checkpoint and restart the container state is
+quiescent. During checkpoint, this means that all affected tasks are
+frozen (or otherwise stopped). During restart, this means that all
+affected tasks are executing the sys_restart() call. In both cases, if
+there are other tasks possible sharing state with the container, they
+must not modify it during the operation. It is the responsibility of
+the caller to follow this requirement.
+
+If the assumption that all tasks are frozen and that there is no other
+sharing doesn't hold - then the results of the operation are undefined
+(just as, e.g. not calling execve() immediately after vfork() produces
+undefined results). In particular, either checkpoint will fail, or it
+may produce a checkpoint image that can't be restarted, or (unlikely)
+the restart may produce a container whose state does not match that of
+the original container.
+
+
+User tools
+==========
+
+* ckpt: a tool to perform a checkpoint of a container/subtree
+* mktree: a tool to restart a container/subtree
+* ckptinfo: a tool to examine a checkpoint image
+
+It is best to use the dedicated user tools for checkpoint and restart.
+
+If you insist, then here is a code snippet that illustrates how a
+checkpoint is initiated by a process inside a container - the logic is
+similar to fork():
+ ...
+ ckptid = checkpoint(1, ...);
+ switch (crid) {
+ case -1:
+ perror("checkpoint failed");
+ break;
+ default:
+ fprintf(stderr, "checkpoint succeeded, CRID=%d\n", ret);
+ /* proceed with execution after checkpoint */
+ ...
+ break;
+ case 0:
+ fprintf(stderr, "returned after restart\n");
+ /* proceed with action required following a restart */
+ ...
+ break;
+ }
+ ...
+
+And to initiate a restart, the process in an empty container can use
+logic similar to execve():
+ ...
+ if (restart(pid, ...) < 0)
+ perror("restart failed");
+ /* only get here if restart failed */
+ ...
+
+Note, that the code also supports "self" checkpoint, where a process
+can checkpoint itself. This mode does not capture the relationships of
+the task with other tasks, or any shared resources. It is useful for
+application that wish to be able to save and restore their state.
+They will either not use (or care about) shared resources, or they
+will be aware of the operations and adapt suitably after a restart.
+The code above can also be used for "self" checkpoint.
+
+
+You may find the following sample programs useful:
+
+* ckpt.c: accepts a 'pid' argument and checkpoint that task to stdout
+* rstr.c: restarts a checkpoint image from stdin
+* self.c: a simple test program doing self-checkpoint
+* test.c: a simple test program to checkpoint
+
+
+"External" checkpoint
+=====================
+
+To do "external" checkpoint, you need to first freeze that other task
+either using the freezer cgroup.
+
+Restart does not preserve the original PID yet, (because we haven't
+solved yet the fork-with-specific-pid issue). In a real scenario, you
+probably want to first create a new names space, and have the init
+task there call 'sys_restart()'.
+
+I tested it this way:
+ $ ./test &
+ [1] 3493
+
+ $ kill -STOP 3493
+ $ ./ckpt 3493 > ckpt.image
+
+ $ mv /tmp/cr-test.out /tmp/cr-test.out.orig
+ $ cp /tmp/cr-test.out.orig /tmp/cr-test.out
+
+ $ kill -CONT 3493
+
+ $ ./rstr < ckpt.image
+Now compare the output of the two output files.
+
+
+"Self checkpoint
+================
+
+To do "self" checkpoint, you can incorporate the code from ckpt.c into
+your application.
+
+Here is how to test the "self" checkpoint:
+ $ ./self > self.image &
+ [1] 3512
+
+ $ sleep 3
+ $ mv /tmp/cr-test.out /tmp/cr-test.out.orig
+ $ cp /tmp/cr-test.out.orig /tmp/cr-test.out
+
+ $ cat /tmp/cr-rest.out
+ hello, world (85.46)!
+ count 0 (85.46)!
+ count 1 (85.56)!
+ count 2 (85.76)!
+ count 3 (86.46)!
+
+ $ sed -i 's/count/xxxx/g' /tmp/cr-rest.out
+
+ $ ./rstr < self.image &
+Now compare the output of the two output files.
+
+Note how in test.c we close stdin, stdout, stderr - that's because
+currently we only support regular files (not ttys/ptys).
+
+If you check the output of ps, you'll see that "rstr" changed its name
+to "test" or "self", as expected.
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 50c3cd8..79936cc 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -1,7 +1,7 @@
/*
* Generic container checkpoint-restart
*
- * Copyright (C) 2008 Oren Laadan
+ * Copyright (C) 2008-2009 Oren Laadan
*
* This file is subject to the terms and conditions of the GNU General Public
* License. See the file COPYING in the main directory of the Linux
--
1.6.0.4

2009-07-22 10:17:51

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 49/60] c/r: save and restore sysvipc namespace basics

Add the helpers to checkpoint and restore the contents of 'struct
kern_ipc_perm'. Add header structures for ipc state. Put place-holders
to save and restore ipc state.

Save and restores the common state (parameters) of ipc namespace.

Generic code to iterate through the objects of sysvipc shared memory,
message queues and semaphores. The logic to save and restore the state
of these objects will be added in the next few patches.

Right now, we return -EPERM if the user calling sys_restart() isn't
allowed to create an object with the checkpointed uid. We may prefer
to simply use the caller's uid in that case - but that could lead to
subtle userspace bugs? Unsure, so going for the stricter behavior.

TODO: restore kern_ipc_perms->security.

Changelog[v17]:
- Fix include: use checkpoint.h not checkpoint_hdr.h
- Collect nsproxy->ipc_ns
- Restore objects in the right namespace
- If !CONFIG_IPC_NS only restore objects, not global settings
- Don't overwrite global ipc-ns if !CONFIG_IPC_NS
- Reset the checkpointed uid and gid info on ipc objects
- Fix compilation with CONFIG_SYSVIPC=n
Changelog [Dan Smith <[email protected]>]
- Fix compilation with CONFIG_SYSVIPC=n
- Update to match UTS changes

Signed-off-by: Oren Laadan <[email protected]>
---
checkpoint/checkpoint.c | 2 -
checkpoint/objhash.c | 28 ++++
include/linux/checkpoint.h | 13 ++
include/linux/checkpoint_hdr.h | 54 +++++++
include/linux/checkpoint_types.h | 1 +
init/Kconfig | 6 +
ipc/Makefile | 2 +-
ipc/checkpoint.c | 317 ++++++++++++++++++++++++++++++++++++++
ipc/namespace.c | 2 +-
ipc/util.h | 10 ++
kernel/nsproxy.c | 22 ++-
11 files changed, 449 insertions(+), 8 deletions(-)
create mode 100644 ipc/checkpoint.c

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 39ee917..e4f971e 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -331,8 +331,6 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)

rcu_read_lock();
nsproxy = task_nsproxy(t);
- if (nsproxy->ipc_ns != ctx->root_nsproxy->ipc_ns)
- ret = -EPERM;
if (nsproxy->mnt_ns != ctx->root_nsproxy->mnt_ns)
ret = -EPERM;
if (nsproxy->pid_ns != ctx->root_nsproxy->pid_ns)
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index caa856c..29c7a04 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -15,6 +15,8 @@
#include <linux/hash.h>
#include <linux/file.h>
#include <linux/fdtable.h>
+#include <linux/sched.h>
+#include <linux/ipc_namespace.h>
#include <linux/checkpoint.h>
#include <linux/checkpoint_hdr.h>

@@ -164,6 +166,22 @@ static int obj_uts_ns_users(void *ptr)
return atomic_read(&((struct uts_namespace *) ptr)->kref.refcount);
}

+static int obj_ipc_ns_grab(void *ptr)
+{
+ get_ipc_ns((struct ipc_namespace *) ptr);
+ return 0;
+}
+
+static void obj_ipc_ns_drop(void *ptr)
+{
+ put_ipc_ns((struct ipc_namespace *) ptr);
+}
+
+static int obj_ipc_ns_users(void *ptr)
+{
+ return atomic_read(&((struct ipc_namespace *) ptr)->count);
+}
+
static struct ckpt_obj_ops ckpt_obj_ops[] = {
/* ignored object */
{
@@ -231,6 +249,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
.checkpoint = checkpoint_uts_ns,
.restore = restore_uts_ns,
},
+ /* ipc_ns object */
+ {
+ .obj_name = "IPC_NS",
+ .obj_type = CKPT_OBJ_IPC_NS,
+ .ref_drop = obj_ipc_ns_drop,
+ .ref_grab = obj_ipc_ns_grab,
+ .ref_users = obj_ipc_ns_users,
+ .checkpoint = checkpoint_ipc_ns,
+ .restore = restore_ipc_ns,
+ },
};


diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 0085ea8..9d6b0cc 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -22,6 +22,9 @@
#ifdef __KERNEL__
#ifdef CONFIG_CHECKPOINT

+#include <linux/sched.h>
+#include <linux/nsproxy.h>
+#include <linux/ipc_namespace.h>
#include <linux/checkpoint_types.h>
#include <linux/checkpoint_hdr.h>

@@ -136,6 +139,15 @@ extern void *restore_ns(struct ckpt_ctx *ctx);
extern int checkpoint_uts_ns(struct ckpt_ctx *ctx, void *ptr);
extern void *restore_uts_ns(struct ckpt_ctx *ctx);

+/* ipc-ns */
+#ifdef CONFIG_SYSVIPC
+extern int checkpoint_ipc_ns(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_ipc_ns(struct ckpt_ctx *ctx);
+#else
+#define checkpoint_ipc_ns checkpoint_bad
+#define restore_ipc_ns restore_bad
+#endif /* CONFIG_SYSVIPC */
+
/* file table */
extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t);
extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx,
@@ -204,6 +216,7 @@ extern int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode);
#define CKPT_DFILE 0x10 /* files and filesystem */
#define CKPT_DMEM 0x20 /* memory state */
#define CKPT_DPAGE 0x40 /* memory pages */
+#define CKPT_DIPC 0x80 /* sysvipc */

#define CKPT_DDEFAULT 0xffff /* default debug level */

diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 18ab78f..3159750 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -59,6 +59,7 @@ enum {
CKPT_HDR_CPU,
CKPT_HDR_NS,
CKPT_HDR_UTS_NS,
+ CKPT_HDR_IPC_NS,

/* 201-299: reserved for arch-dependent */

@@ -73,6 +74,11 @@ enum {
CKPT_HDR_PGARR,
CKPT_HDR_MM_CONTEXT,

+ CKPT_HDR_IPC = 501,
+ CKPT_HDR_IPC_SHM,
+ CKPT_HDR_IPC_MSG,
+ CKPT_HDR_IPC_SEM,
+
CKPT_HDR_TAIL = 9001,

CKPT_HDR_ERROR = 9999,
@@ -99,6 +105,7 @@ enum obj_type {
CKPT_OBJ_MM,
CKPT_OBJ_NS,
CKPT_OBJ_UTS_NS,
+ CKPT_OBJ_IPC_NS,
CKPT_OBJ_MAX
};

@@ -190,6 +197,7 @@ struct ckpt_hdr_task_ns {
struct ckpt_hdr_ns {
struct ckpt_hdr h;
__s32 uts_objref;
+ __u32 ipc_objref;
} __attribute__((aligned(8)));

/* task's shared resources */
@@ -326,4 +334,50 @@ struct ckpt_hdr_pgarr {
} __attribute__((aligned(8)));


+/* ipc commons */
+struct ckpt_hdr_ipcns {
+ struct ckpt_hdr h;
+ __u64 shm_ctlmax;
+ __u64 shm_ctlall;
+ __s32 shm_ctlmni;
+
+ __s32 msg_ctlmax;
+ __s32 msg_ctlmnb;
+ __s32 msg_ctlmni;
+
+ __s32 sem_ctl_msl;
+ __s32 sem_ctl_mns;
+ __s32 sem_ctl_opm;
+ __s32 sem_ctl_mni;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_ipc {
+ struct ckpt_hdr h;
+ __u32 ipc_type;
+ __u32 ipc_count;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_ipc_perms {
+ __s32 id;
+ __u32 key;
+ __u32 uid;
+ __u32 gid;
+ __u32 cuid;
+ __u32 cgid;
+ __u32 mode;
+ __u32 _padding;
+ __u64 seq;
+} __attribute__((aligned(8)));
+
+
+#define CKPT_TST_OVERFLOW_16(a, b) \
+ ((sizeof(a) > sizeof(b)) && ((a) > SHORT_MAX))
+
+#define CKPT_TST_OVERFLOW_32(a, b) \
+ ((sizeof(a) > sizeof(b)) && ((a) > INT_MAX))
+
+#define CKPT_TST_OVERFLOW_64(a, b) \
+ ((sizeof(a) > sizeof(b)) && ((a) > LONG_MAX))
+
+
#endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 0a9c58b..9ffa492 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -22,6 +22,7 @@

struct ckpt_stats {
int uts_ns;
+ int ipc_ns;
};

struct ckpt_ctx {
diff --git a/init/Kconfig b/init/Kconfig
index a083161..21a7ca2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -191,6 +191,12 @@ config SYSVIPC
section 6.4 of the Linux Programmer's Guide, available from
<http://www.tldp.org/guides.html>.

+config SYSVIPC_CHECKPOINT
+ bool
+ depends on SYSVIPC
+ depends on CHECKPOINT
+ default y
+
config SYSVIPC_SYSCTL
bool
depends on SYSVIPC
diff --git a/ipc/Makefile b/ipc/Makefile
index 4e1955e..b747127 100644
--- a/ipc/Makefile
+++ b/ipc/Makefile
@@ -9,4 +9,4 @@ obj_mq-$(CONFIG_COMPAT) += compat_mq.o
obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y)
obj-$(CONFIG_IPC_NS) += namespace.o
obj-$(CONFIG_POSIX_MQUEUE_SYSCTL) += mq_sysctl.o
-
+obj-$(CONFIG_SYSVIPC_CHECKPOINT) += checkpoint.o
diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c
new file mode 100644
index 0000000..4eb1a97
--- /dev/null
+++ b/ipc/checkpoint.c
@@ -0,0 +1,317 @@
+/*
+ * Checkpoint logic and helpers
+ *
+ * Copyright (C) 2009 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG CKPT_DIPC
+
+#include <linux/ipc.h>
+#include <linux/msg.h>
+#include <linux/sched.h>
+#include <linux/ipc_namespace.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "util.h"
+
+/* for ckpt_debug */
+static char *ipc_ind_to_str[] = { "sem", "msg", "shm" };
+
+#define shm_ids(ns) ((ns)->ids[IPC_SHM_IDS])
+#define msg_ids(ns) ((ns)->ids[IPC_MSG_IDS])
+#define sem_ids(ns) ((ns)->ids[IPC_SEM_IDS])
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+int checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *h,
+ struct kern_ipc_perm *perm)
+{
+ if (ipcperms(perm, S_IROTH))
+ return -EACCES;
+
+ h->id = perm->id;
+ h->key = perm->key;
+ h->uid = perm->uid;
+ h->gid = perm->gid;
+ h->cuid = perm->cuid;
+ h->cgid = perm->cgid;
+ h->mode = perm->mode & S_IRWXUGO;
+ h->seq = perm->seq;
+
+ return 0;
+}
+
+static int checkpoint_ipc_any(struct ckpt_ctx *ctx,
+ struct ipc_namespace *ipc_ns,
+ int ipc_ind, int ipc_type,
+ int (*func)(int id, void *p, void *data))
+{
+ struct ckpt_hdr_ipc *h;
+ struct ipc_ids *ipc_ids = &ipc_ns->ids[ipc_ind];
+ int ret = -ENOMEM;
+
+ down_read(&ipc_ids->rw_mutex);
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC);
+ if (!h)
+ goto out;
+
+ h->ipc_type = ipc_type;
+ h->ipc_count = ipc_ids->in_use;
+ ckpt_debug("ipc-%s count %d\n", ipc_ind_to_str[ipc_ind], h->ipc_count);
+
+ ret = ckpt_write_obj(ctx, &h->h);
+ ckpt_hdr_put(ctx, h);
+ if (ret < 0)
+ goto out;
+
+ ret = idr_for_each(&ipc_ids->ipcs_idr, func, ctx);
+ ckpt_debug("ipc-%s ret %d\n", ipc_ind_to_str[ipc_ind], ret);
+ out:
+ up_read(&ipc_ids->rw_mutex);
+ return ret;
+}
+
+static int do_checkpoint_ipc_ns(struct ckpt_ctx *ctx,
+ struct ipc_namespace *ipc_ns)
+{
+ struct ckpt_hdr_ipcns *h;
+ int ret;
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC_NS);
+ if (!h)
+ return -ENOMEM;
+
+ down_read(&shm_ids(ipc_ns).rw_mutex);
+ h->shm_ctlmax = ipc_ns->shm_ctlmax;
+ h->shm_ctlall = ipc_ns->shm_ctlall;
+ h->shm_ctlmni = ipc_ns->shm_ctlmni;
+ up_read(&shm_ids(ipc_ns).rw_mutex);
+
+ down_read(&msg_ids(ipc_ns).rw_mutex);
+ h->msg_ctlmax = ipc_ns->msg_ctlmax;
+ h->msg_ctlmnb = ipc_ns->msg_ctlmnb;
+ h->msg_ctlmni = ipc_ns->msg_ctlmni;
+ up_read(&msg_ids(ipc_ns).rw_mutex);
+
+ down_read(&sem_ids(ipc_ns).rw_mutex);
+ h->sem_ctl_msl = ipc_ns->sem_ctls[0];
+ h->sem_ctl_mns = ipc_ns->sem_ctls[1];
+ h->sem_ctl_opm = ipc_ns->sem_ctls[2];
+ h->sem_ctl_mni = ipc_ns->sem_ctls[3];
+ up_read(&sem_ids(ipc_ns).rw_mutex);
+
+ ret = ckpt_write_obj(ctx, &h->h);
+ ckpt_hdr_put(ctx, h);
+ if (ret < 0)
+ return ret;
+
+#if 0 /* NEXT FEW PATCHES */
+ ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SHM_IDS,
+ CKPT_HDR_IPC_SHM, checkpoint_ipc_shm);
+ if (ret < 0)
+ return ret;
+ ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
+ CKPT_HDR_IPC_MSG, checkpoint_ipc_msg);
+ if (ret < 0)
+ return ret;
+ ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SEM_IDS,
+ CKPT_HDR_IPC_SEM, checkpoint_ipc_sem);
+#endif
+ return ret;
+}
+
+int checkpoint_ipc_ns(struct ckpt_ctx *ctx, void *ptr)
+{
+ return do_checkpoint_ipc_ns(ctx, (struct ipc_namespace *) ptr);
+}
+
+/**************************************************************************
+ * Restart
+ */
+
+/*
+ * check whether current task may create ipc object with
+ * checkpointed uids and gids.
+ * Return 1 if ok, 0 if not.
+ */
+static int validate_created_perms(struct ckpt_hdr_ipc_perms *h)
+{
+ const struct cred *cred = current_cred();
+ uid_t uid = cred->uid, euid = cred->euid;
+
+ /* actually I don't know - is CAP_IPC_OWNER the right one? */
+ if (((h->uid != uid && h->uid == euid) ||
+ (h->cuid != uid && h->cuid != euid) ||
+ !in_group_p(h->cgid) ||
+ !in_group_p(h->gid)) &&
+ !capable(CAP_IPC_OWNER))
+ return 0;
+ return 1;
+}
+
+int restore_load_ipc_perms(struct ckpt_hdr_ipc_perms *h,
+ struct kern_ipc_perm *perm)
+{
+ if (h->id < 0)
+ return -EINVAL;
+ if (CKPT_TST_OVERFLOW_16(h->uid, perm->uid) ||
+ CKPT_TST_OVERFLOW_16(h->gid, perm->gid) ||
+ CKPT_TST_OVERFLOW_16(h->cuid, perm->cuid) ||
+ CKPT_TST_OVERFLOW_16(h->cgid, perm->cgid) ||
+ CKPT_TST_OVERFLOW_16(h->mode, perm->mode))
+ return -EINVAL;
+ if (h->seq >= USHORT_MAX)
+ return -EINVAL;
+ if (h->mode & ~S_IRWXUGO)
+ return -EINVAL;
+
+ /* FIX: verify the ->mode field makes sense */
+
+ perm->id = h->id;
+ perm->key = h->key;
+
+ if (!validate_created_perms(h))
+ return -EPERM;
+ perm->uid = h->uid;
+ perm->gid = h->gid;
+ perm->cuid = h->cuid;
+ perm->cgid = h->cgid;
+ perm->mode = h->mode;
+ perm->seq = h->seq;
+ /*
+ * Todo: restore perm->security.
+ * At the moment it gets set by security_x_alloc() called through
+ * ipcget()->ipcget_public()->ops-.getnew (->nequeue for instance)
+ * We will want to ask the LSM to consider resetting the
+ * checkpointed ->security, based on current_security(),
+ * the checkpointed ->security, and the checkpoint file context.
+ */
+
+ return 0;
+}
+
+static int restore_ipc_any(struct ckpt_ctx *ctx, struct ipc_namespace *ipc_ns,
+ int ipc_ind, int ipc_type,
+ int (*func)(struct ckpt_ctx *ctx,
+ struct ipc_namespace *ns))
+{
+ struct ckpt_hdr_ipc *h;
+ int n, ret;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC);
+ if (IS_ERR(h))
+ return PTR_ERR(h);
+
+ ckpt_debug("ipc-%s: count %d\n", ipc_ind_to_str[ipc_ind], h->ipc_count);
+
+ ret = -EINVAL;
+ if (h->ipc_type != ipc_type)
+ goto out;
+
+ ret = 0;
+ for (n = 0; n < h->ipc_count; n++) {
+ ret = (*func)(ctx, ipc_ns);
+ if (ret < 0)
+ goto out;
+ }
+ out:
+ ckpt_debug("ipc-%s: ret %d\n", ipc_ind_to_str[ipc_ind], ret);
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
+
+static struct ipc_namespace *do_restore_ipc_ns(struct ckpt_ctx *ctx)
+{
+ struct ipc_namespace *ipc_ns = NULL;
+ struct ckpt_hdr_ipcns *h;
+ int ret;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC_NS);
+ if (IS_ERR(h))
+ return ERR_PTR(PTR_ERR(h));
+
+ ret = -EINVAL;
+ if (h->shm_ctlmax < 0 || h->shm_ctlall < 0 || h->shm_ctlmni < 0)
+ goto out;
+ if (h->msg_ctlmax < 0 || h->msg_ctlmnb < 0 || h->msg_ctlmni < 0)
+ goto out;
+ if (h->sem_ctl_msl < 0 || h->sem_ctl_mns < 0 ||
+ h->sem_ctl_opm < 0 || h->sem_ctl_mni < 0)
+ goto out;
+
+ /*
+ * If !CONFIG_IPC_NS, do not restore the global IPC state, as
+ * it is used by other processes. It is ok to try to restore
+ * the {shm,msg,sem} objects: in the worst case the requested
+ * identifiers will be in use.
+ */
+#ifdef CONFIG_IPC_NS
+ ret = -ENOMEM;
+ ipc_ns = create_ipc_ns();
+ if (!ipc_ns)
+ goto out;
+
+ down_read(&shm_ids(ipc_ns).rw_mutex);
+ ipc_ns->shm_ctlmax = h->shm_ctlmax;
+ ipc_ns->shm_ctlall = h->shm_ctlall;
+ ipc_ns->shm_ctlmni = h->shm_ctlmni;
+ up_read(&shm_ids(ipc_ns).rw_mutex);
+
+ down_read(&msg_ids(ipc_ns).rw_mutex);
+ ipc_ns->msg_ctlmax = h->msg_ctlmax;
+ ipc_ns->msg_ctlmnb = h->msg_ctlmnb;
+ ipc_ns->msg_ctlmni = h->msg_ctlmni;
+ up_read(&msg_ids(ipc_ns).rw_mutex);
+
+ down_read(&sem_ids(ipc_ns).rw_mutex);
+ ipc_ns->sem_ctls[0] = h->sem_ctl_msl;
+ ipc_ns->sem_ctls[1] = h->sem_ctl_mns;
+ ipc_ns->sem_ctls[2] = h->sem_ctl_opm;
+ ipc_ns->sem_ctls[3] = h->sem_ctl_mni;
+ up_read(&sem_ids(ipc_ns).rw_mutex);
+#else
+ ret = -EEXIST;
+ /* complain if image contains multiple namespaces */
+ if (ctx->stats.ipc_ns)
+ goto out;
+ ipc_ns = current->nsproxy->ipc_ns;
+ get_ipc_ns(ipc_ns);
+#endif
+
+#if 0 /* NEXT FEW PATCHES */
+ ret = restore_ipc_any(ctx, ipc_ns, IPC_SHM_IDS,
+ CKPT_HDR_IPC_SHM, restore_ipc_shm);
+ if (ret < 0)
+ goto out;
+ ret = restore_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
+ CKPT_HDR_IPC_MSG, restore_ipc_msg);
+ if (ret < 0)
+ goto out;
+ ret = restore_ipc_any(ctx, ipc_ns, IPC_SEM_IDS,
+ CKPT_HDR_IPC_SEM, restore_ipc_sem);
+#endif
+ if (ret < 0)
+ goto out;
+
+ ctx->stats.ipc_ns++;
+ out:
+ ckpt_hdr_put(ctx, h);
+ if (ret < 0) {
+ put_ipc_ns(ipc_ns);
+ ipc_ns = ERR_PTR(ret);
+ }
+ return ipc_ns;
+}
+
+void *restore_ipc_ns(struct ckpt_ctx *ctx)
+{
+ return (void *) do_restore_ipc_ns(ctx);
+}
diff --git a/ipc/namespace.c b/ipc/namespace.c
index a1094ff..8e5ea32 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -14,7 +14,7 @@

#include "util.h"

-static struct ipc_namespace *create_ipc_ns(void)
+struct ipc_namespace *create_ipc_ns(void)
{
struct ipc_namespace *ns;
int err;
diff --git a/ipc/util.h b/ipc/util.h
index 159a73c..8ae1f8e 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -12,6 +12,7 @@

#include <linux/unistd.h>
#include <linux/err.h>
+#include <linux/checkpoint.h>

#define SEQ_MULTIPLIER (IPCMNI)

@@ -175,4 +176,13 @@ int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
void (*free)(struct ipc_namespace *, struct kern_ipc_perm *));

+struct ipc_namespace *create_ipc_ns(void);
+
+#ifdef CONFIG_CHECKPOINT
+extern int checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *h,
+ struct kern_ipc_perm *perm);
+extern int restore_load_ipc_perms(struct ckpt_hdr_ipc_perms *h,
+ struct kern_ipc_perm *perm);
+#endif
+
#endif
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 4f48a68..fddc724 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -248,6 +248,7 @@ int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t)
ret = ckpt_obj_collect(ctx, nsproxy->uts_ns, CKPT_OBJ_UTS_NS);
if (ret < 0)
goto out;
+ ret = ckpt_obj_collect(ctx, nsproxy->ipc_ns, CKPT_OBJ_IPC_NS);

/* TODO: collect other namespaces here */
out:
@@ -268,6 +269,11 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy)
if (ret <= 0)
goto out;
h->uts_objref = ret;
+ ret = checkpoint_obj(ctx, nsproxy->ipc_ns, CKPT_OBJ_IPC_NS);
+ if (ret < 0)
+ goto out;
+ h->ipc_objref = ret;
+
/* TODO: Write other namespaces here */

ret = ckpt_write_obj(ctx, &h->h);
@@ -287,6 +293,7 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
struct ckpt_hdr_ns *h;
struct nsproxy *nsproxy = NULL;
struct uts_namespace *uts_ns;
+ struct ipc_namespace *ipc_ns;
int ret;

h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NS);
@@ -294,7 +301,8 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
return (struct nsproxy *) h;

ret = -EINVAL;
- if (h->uts_objref <= 0)
+ if (h->uts_objref <= 0 ||
+ h->ipc_objref <= 0)
goto out;

uts_ns = ckpt_obj_fetch(ctx, h->uts_objref, CKPT_OBJ_UTS_NS);
@@ -302,8 +310,13 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
ret = PTR_ERR(uts_ns);
goto out;
}
+ ipc_ns = ckpt_obj_fetch(ctx, h->ipc_objref, CKPT_OBJ_IPC_NS);
+ if (IS_ERR(ipc_ns)) {
+ ret = PTR_ERR(ipc_ns);
+ goto out;
+ }

-#if defined(COFNIG_UTS_NS)
+#if defined(COFNIG_UTS_NS) || defined(CONFIG_IPC_NS)
ret = -ENOMEM;
nsproxy = create_nsproxy();
if (!nsproxy)
@@ -311,9 +324,9 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)

get_uts_ns(uts_ns);
nsproxy->uts_ns = uts_ns;
-
- get_ipc_ns(current->nsproxy->ipc_ns);
+ get_ipc_ns(ipc_ns);
nsproxy->ipc_ns = ipc_ns;
+
get_pid_ns(current->nsproxy->pid_ns);
nsproxy->pid_ns = current->nsproxy->pid_ns;
get_mnt_ns(current->nsproxy->mnt_ns);
@@ -325,6 +338,7 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
get_nsproxy(nsproxy);

BUG_ON(nsproxy->uts_ns != uts_ns);
+ BUG_ON(nsproxy->ipc_ns != ipc_ns);
#endif

/* TODO: add more namespaces here */
--
1.6.0.4

2009-07-22 10:17:31

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 39/60] c/r: restore memory address space (private memory)

Restoring the memory address space begins with nuking the existing one
of the current process, and then reading the vma state and contents.
Call do_mmap_pgoffset() for each vma and then read in the data.

Changelog[v17]:
- Restore mm->{flags,def_flags,saved_auxv}
- Fix bogus warning in do_restore_mm()
Changelog[v16]:
- Restore mm->exe_file
Changelog[v14]:
- Introduce per vma-type restore() function
- Merge restart code into same file as checkpoint (memory.c)
- Compare saved 'vdso' field of mm_context with current value
- Check whether calls to ckpt_hbuf_get() fail
- Discard field 'h->parent'
- Revert change to pr_debug(), back to ckpt_debug()
Changelog[v13]:
- Avoid access to hh->vma_type after the header is freed
- Test for no vma's in exit_mmap() before calling unmap_vma() (or it
may crash if restart fails after having removed all vma's)
Changelog[v12]:
- Replace obsolete ckpt_debug() with pr_debug()
Changelog[v9]:
- Introduce ckpt_ctx_checkpoint() for checkpoint-specific ctx setup
Changelog[v7]:
- Fix argument given to kunmap_atomic() in memory dump/restore
Changelog[v6]:
- Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
(even though it's not really needed)
Changelog[v5]:
- Improve memory restore code (following Dave Hansen's comments)
- Change dump format (and code) to allow chunks of <vaddrs, pages>
instead of one long list of each
- Memory restore now maps user pages explicitly to copy data into them,
instead of reading directly to user space; got rid of mprotect_fixup()
Changelog[v4]:
- Use standard list_... for ckpt_pgarr


Signed-off-by: Oren Laadan <[email protected]>
---
arch/x86/include/asm/ldt.h | 7 +
arch/x86/mm/checkpoint.c | 64 ++++++
checkpoint/memory.c | 472 ++++++++++++++++++++++++++++++++++++++++
checkpoint/objhash.c | 1 +
checkpoint/process.c | 3 +
checkpoint/restart.c | 4 +
fs/exec.c | 2 +-
include/linux/checkpoint.h | 7 +
include/linux/checkpoint_hdr.h | 2 +-
include/linux/mm.h | 13 +
mm/filemap.c | 19 ++
mm/mmap.c | 23 ++-
12 files changed, 614 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/ldt.h b/arch/x86/include/asm/ldt.h
index 46727eb..f2845f9 100644
--- a/arch/x86/include/asm/ldt.h
+++ b/arch/x86/include/asm/ldt.h
@@ -37,4 +37,11 @@ struct user_desc {
#define MODIFY_LDT_CONTENTS_CODE 2

#endif /* !__ASSEMBLY__ */
+
+#ifdef __KERNEL__
+#include <linux/linkage.h>
+asmlinkage int sys_modify_ldt(int func, void __user *ptr,
+ unsigned long bytecount);
+#endif
+
#endif /* _ASM_X86_LDT_H */
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index fa26d60..68432c8 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -13,6 +13,7 @@

#include <asm/desc.h>
#include <asm/i387.h>
+#include <asm/elf.h>

#include <linux/checkpoint.h>
#include <linux/checkpoint_hdr.h>
@@ -563,3 +564,66 @@ int restore_read_header_arch(struct ckpt_ctx *ctx)
ckpt_hdr_put(ctx, h);
return ret;
}
+
+int restore_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+ struct ckpt_hdr_mm_context *h;
+ unsigned int n;
+ int ret;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_MM_CONTEXT);
+ if (IS_ERR(h))
+ return PTR_ERR(h);
+
+ ckpt_debug("nldt %d vdso %#lx (%p)\n",
+ h->nldt, (unsigned long) h->vdso, mm->context.vdso);
+
+ ret = -EINVAL;
+ if (h->vdso != (unsigned long) mm->context.vdso)
+ goto out;
+ if (h->ldt_entry_size != LDT_ENTRY_SIZE)
+ goto out;
+
+ ret = _ckpt_read_obj_type(ctx, NULL,
+ h->nldt * LDT_ENTRY_SIZE,
+ CKPT_HDR_MM_CONTEXT_LDT);
+ if (ret < 0)
+ goto out;
+
+ /*
+ * to utilize the syscall modify_ldt() we first convert the data
+ * in the checkpoint image from 'struct desc_struct' to 'struct
+ * user_desc' with reverse logic of include/asm/desc.h:fill_ldt()
+ */
+ for (n = 0; n < h->nldt; n++) {
+ struct user_desc info;
+ struct desc_struct desc;
+ mm_segment_t old_fs;
+
+ ret = ckpt_kread(ctx, &desc, LDT_ENTRY_SIZE);
+ if (ret < 0)
+ break;
+
+ info.entry_number = n;
+ info.base_addr = desc.base0 | (desc.base1 << 16);
+ info.limit = desc.limit0;
+ info.seg_32bit = desc.d;
+ info.contents = desc.type >> 2;
+ info.read_exec_only = (desc.type >> 1) ^ 1;
+ info.limit_in_pages = desc.g;
+ info.seg_not_present = desc.p ^ 1;
+ info.useable = desc.avl;
+
+ old_fs = get_fs();
+ set_fs(get_ds());
+ ret = sys_modify_ldt(1, (struct user_desc __user *) &info,
+ sizeof(info));
+ set_fs(old_fs);
+
+ if (ret < 0)
+ break;
+ }
+ out:
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
diff --git a/checkpoint/memory.c b/checkpoint/memory.c
index 68c31b6..e11784e 100644
--- a/checkpoint/memory.c
+++ b/checkpoint/memory.c
@@ -15,6 +15,9 @@
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/file.h>
+#include <linux/err.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
#include <linux/pagemap.h>
#include <linux/mm_types.h>
#include <linux/proc_fs.h>
@@ -686,3 +689,472 @@ int ckpt_collect_mm(struct ckpt_ctx *ctx, struct task_struct *t)

return ret;
}
+
+/***********************************************************************
+ * Restart
+ *
+ * Unlike checkpoint, restart is executed in the context of each restarting
+ * process: vma regions are restored via a call to mmap(), and the data is
+ * read into the address space of the current process.
+ */
+
+/**
+ * read_pages_vaddrs - read addresses of pages to page-array chain
+ * @ctx - restart context
+ * @nr_pages - number of address to read
+ */
+static int read_pages_vaddrs(struct ckpt_ctx *ctx, unsigned long nr_pages)
+{
+ struct ckpt_pgarr *pgarr;
+ unsigned long *vaddrp;
+ int nr, ret;
+
+ while (nr_pages) {
+ pgarr = pgarr_current(ctx);
+ if (!pgarr)
+ return -ENOMEM;
+ nr = pgarr_nr_free(pgarr);
+ if (nr > nr_pages)
+ nr = nr_pages;
+ vaddrp = &pgarr->vaddrs[pgarr->nr_used];
+ ret = ckpt_kread(ctx, vaddrp, nr * sizeof(unsigned long));
+ if (ret < 0)
+ return ret;
+ pgarr->nr_used += nr;
+ nr_pages -= nr;
+ }
+ return 0;
+}
+
+static int restore_read_page(struct ckpt_ctx *ctx, struct page *page, void *p)
+{
+ void *ptr;
+ int ret;
+
+ ret = ckpt_kread(ctx, p, PAGE_SIZE);
+ if (ret < 0)
+ return ret;
+
+ ptr = kmap_atomic(page, KM_USER1);
+ memcpy(ptr, p, PAGE_SIZE);
+ kunmap_atomic(ptr, KM_USER1);
+
+ return 0;
+}
+
+/**
+ * read_pages_contents - read in data of pages in page-array chain
+ * @ctx - restart context
+ */
+static int read_pages_contents(struct ckpt_ctx *ctx)
+{
+ struct mm_struct *mm = current->mm;
+ struct ckpt_pgarr *pgarr;
+ unsigned long *vaddrs;
+ char *buf;
+ int i, ret = 0;
+
+ buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ down_read(&mm->mmap_sem);
+ list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+ vaddrs = pgarr->vaddrs;
+ for (i = 0; i < pgarr->nr_used; i++) {
+ struct page *page;
+
+ _ckpt_debug(CKPT_DPAGE, "got page %#lx\n", vaddrs[i]);
+ ret = get_user_pages(current, mm, vaddrs[i],
+ 1, 1, 1, &page, NULL);
+ if (ret < 0)
+ goto out;
+
+ ret = restore_read_page(ctx, page, buf);
+ page_cache_release(page);
+
+ if (ret < 0)
+ goto out;
+ }
+ }
+
+ out:
+ up_read(&mm->mmap_sem);
+ kfree(buf);
+ return 0;
+}
+
+/**
+ * restore_memory_contents - restore contents of a VMA with private memory
+ * @ctx - restart context
+ *
+ * Reads a header that specifies how many pages will follow, then reads
+ * a list of virtual addresses into ctx->pgarr_list page-array chain,
+ * followed by the actual contents of the corresponding pages. Iterates
+ * these steps until reaching a header specifying "0" pages, which marks
+ * the end of the contents.
+ */
+static int restore_memory_contents(struct ckpt_ctx *ctx)
+{
+ struct ckpt_hdr_pgarr *h;
+ unsigned long nr_pages;
+ int len, ret = 0;
+
+ while (1) {
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_PGARR);
+ if (IS_ERR(h))
+ break;
+
+ ckpt_debug("total pages %ld\n", (unsigned long) h->nr_pages);
+
+ nr_pages = h->nr_pages;
+ ckpt_hdr_put(ctx, h);
+
+ if (!nr_pages)
+ break;
+
+ len = nr_pages * (sizeof(unsigned long) + PAGE_SIZE);
+ ret = _ckpt_read_buffer(ctx, NULL, len);
+ if (ret < 0)
+ break;
+
+ ret = read_pages_vaddrs(ctx, nr_pages);
+ if (ret < 0)
+ break;
+ ret = read_pages_contents(ctx);
+ if (ret < 0)
+ break;
+ pgarr_reset_all(ctx);
+ }
+
+ return ret;
+}
+
+/**
+ * calc_map_prot_bits - convert vm_flags to mmap protection
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long calc_map_prot_bits(unsigned long orig_vm_flags)
+{
+ unsigned long vm_prot = 0;
+
+ if (orig_vm_flags & VM_READ)
+ vm_prot |= PROT_READ;
+ if (orig_vm_flags & VM_WRITE)
+ vm_prot |= PROT_WRITE;
+ if (orig_vm_flags & VM_EXEC)
+ vm_prot |= PROT_EXEC;
+ if (orig_vm_flags & PROT_SEM) /* only (?) with IPC-SHM */
+ vm_prot |= PROT_SEM;
+
+ return vm_prot;
+}
+
+/**
+ * calc_map_flags_bits - convert vm_flags to mmap flags
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long calc_map_flags_bits(unsigned long orig_vm_flags)
+{
+ unsigned long vm_flags = 0;
+
+ vm_flags = MAP_FIXED;
+ if (orig_vm_flags & VM_GROWSDOWN)
+ vm_flags |= MAP_GROWSDOWN;
+ if (orig_vm_flags & VM_DENYWRITE)
+ vm_flags |= MAP_DENYWRITE;
+ if (orig_vm_flags & VM_EXECUTABLE)
+ vm_flags |= MAP_EXECUTABLE;
+ if (orig_vm_flags & VM_MAYSHARE)
+ vm_flags |= MAP_SHARED;
+ else
+ vm_flags |= MAP_PRIVATE;
+
+ return vm_flags;
+}
+
+/**
+ * generic_vma_restore - restore a vma
+ * @mm - address space
+ * @file - file to map (NULL for anonymous)
+ * @h - vma header data
+ */
+static unsigned long generic_vma_restore(struct mm_struct *mm,
+ struct file *file,
+ struct ckpt_hdr_vma *h)
+{
+ unsigned long vm_size, vm_start, vm_flags, vm_prot, vm_pgoff;
+ unsigned long addr;
+
+ if (h->vm_end < h->vm_start)
+ return -EINVAL;
+ if (h->vma_objref < 0)
+ return -EINVAL;
+ if (h->vm_flags & CKPT_VMA_NOT_SUPPORTED)
+ return -ENOSYS;
+
+ vm_start = h->vm_start;
+ vm_pgoff = h->vm_pgoff;
+ vm_size = h->vm_end - h->vm_start;
+ vm_prot = calc_map_prot_bits(h->vm_flags);
+ vm_flags = calc_map_flags_bits(h->vm_flags);
+
+ down_write(&mm->mmap_sem);
+ addr = do_mmap_pgoff(file, vm_start, vm_size,
+ vm_prot, vm_flags, vm_pgoff);
+ up_write(&mm->mmap_sem);
+ ckpt_debug("size %#lx prot %#lx flag %#lx pgoff %#lx => %#lx\n",
+ vm_size, vm_prot, vm_flags, vm_pgoff, addr);
+
+ return addr;
+}
+
+/**
+ * private_vma_restore - read vma data, recreate it and read contents
+ * @ctx: checkpoint context
+ * @mm: memory address space
+ * @file: file to use for mapping
+ * @h - vma header data
+ */
+int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+ struct file *file, struct ckpt_hdr_vma *h)
+{
+ unsigned long addr;
+
+ if (h->vm_flags & VM_SHARED)
+ return -EINVAL;
+
+ addr = generic_vma_restore(mm, file, h);
+ if (IS_ERR((void *) addr))
+ return PTR_ERR((void *) addr);
+
+ return restore_memory_contents(ctx);
+}
+
+/**
+ * anon_private_restore - read vma data, recreate it and read contents
+ * @ctx: checkpoint context
+ * @mm: memory address space
+ * @h - vma header data
+ */
+static int anon_private_restore(struct ckpt_ctx *ctx,
+ struct mm_struct *mm,
+ struct ckpt_hdr_vma *h)
+{
+ /*
+ * vm_pgoff for anonymous mapping is the "global" page
+ * offset (namely from addr 0x0), so we force a zero
+ */
+ h->vm_pgoff = 0;
+
+ return private_vma_restore(ctx, mm, NULL, h);
+}
+
+/* callbacks to restore vma per its type: */
+struct restore_vma_ops {
+ char *vma_name;
+ enum vma_type vma_type;
+ int (*restore) (struct ckpt_ctx *ctx,
+ struct mm_struct *mm,
+ struct ckpt_hdr_vma *ptr);
+};
+
+static struct restore_vma_ops restore_vma_ops[] = {
+ /* ignored vma */
+ {
+ .vma_name = "IGNORE",
+ .vma_type = CKPT_VMA_IGNORE,
+ .restore = NULL,
+ },
+ /* special mapping (vdso) */
+ {
+ .vma_name = "VDSO",
+ .vma_type = CKPT_VMA_VDSO,
+ .restore = special_mapping_restore,
+ },
+ /* anonymous private */
+ {
+ .vma_name = "ANON PRIVATE",
+ .vma_type = CKPT_VMA_ANON,
+ .restore = anon_private_restore,
+ },
+ /* file-mapped private */
+ {
+ .vma_name = "FILE PRIVATE",
+ .vma_type = CKPT_VMA_FILE,
+ .restore = filemap_restore,
+ },
+};
+
+/**
+ * restore_vma - read vma data, recreate it and read contents
+ * @ctx: checkpoint context
+ * @mm: memory address space
+ */
+static int restore_vma(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+ struct ckpt_hdr_vma *h;
+ struct restore_vma_ops *ops;
+ int ret;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_VMA);
+ if (IS_ERR(h))
+ return PTR_ERR(h);
+
+ ckpt_debug("vma %#lx-%#lx flags %#lx type %d vmaref %d\n",
+ (unsigned long) h->vm_start, (unsigned long) h->vm_end,
+ (unsigned long) h->vm_flags, (int) h->vma_type,
+ (int) h->vma_objref);
+
+ ret = -EINVAL;
+ if (h->vm_end < h->vm_start)
+ goto out;
+ if (h->vma_objref < 0)
+ goto out;
+ if (h->vma_type >= CKPT_VMA_MAX)
+ goto out;
+
+ ops = &restore_vma_ops[h->vma_type];
+
+ /* make sure we don't change this accidentally */
+ BUG_ON(ops->vma_type != h->vma_type);
+
+ if (ops->restore) {
+ ckpt_debug("vma type %s\n", ops->vma_name);
+ ret = ops->restore(ctx, mm, h);
+ } else {
+ ckpt_debug("vma ignored\n");
+ ret = 0;
+ }
+ out:
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
+
+static int destroy_mm(struct mm_struct *mm)
+{
+ struct vm_area_struct *vmnext = mm->mmap;
+ struct vm_area_struct *vma;
+ int ret;
+
+ while (vmnext) {
+ vma = vmnext;
+ vmnext = vmnext->vm_next;
+ ret = do_munmap(mm, vma->vm_start, vma->vm_end-vma->vm_start);
+ if (ret < 0) {
+ pr_warning("c/r: failed do_munmap (%d)\n", ret);
+ return ret;
+ }
+ }
+ return 0;
+}
+
+static struct mm_struct *do_restore_mm(struct ckpt_ctx *ctx)
+{
+ struct ckpt_hdr_mm *h;
+ struct mm_struct *mm = NULL;
+ struct file *file;
+ unsigned int nr;
+ int ret;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_MM);
+ if (IS_ERR(h))
+ return (struct mm_struct *) h;
+
+ ckpt_debug("map_count %d\n", h->map_count);
+
+ /* XXX need more sanity checks */
+
+ ret = -EINVAL;
+ if ((h->start_code > h->end_code) ||
+ (h->start_data > h->end_data))
+ goto out;
+ if (h->exe_objref < 0)
+ goto out;
+ if (h->def_flags & ~VM_LOCKED)
+ goto out;
+ if (h->flags & ~(MMF_DUMP_FILTER_MASK |
+ ((1 << MMF_DUMP_FILTER_BITS) - 1)))
+ goto out;
+
+ mm = current->mm;
+
+ /* point of no return -- destruct current mm */
+ down_write(&mm->mmap_sem);
+ ret = destroy_mm(mm);
+ if (ret < 0) {
+ up_write(&mm->mmap_sem);
+ goto out;
+ }
+
+ mm->flags = h->flags;
+ mm->def_flags = h->def_flags;
+
+ mm->start_code = h->start_code;
+ mm->end_code = h->end_code;
+ mm->start_data = h->start_data;
+ mm->end_data = h->end_data;
+ mm->start_brk = h->start_brk;
+ mm->brk = h->brk;
+ mm->start_stack = h->start_stack;
+ mm->arg_start = h->arg_start;
+ mm->arg_end = h->arg_end;
+ mm->env_start = h->env_start;
+ mm->env_end = h->env_end;
+
+ /* restore the ->exe_file */
+ if (h->exe_objref) {
+ file = ckpt_obj_fetch(ctx, h->exe_objref, CKPT_OBJ_FILE);
+ if (IS_ERR(file)) {
+ up_write(&mm->mmap_sem);
+ ret = PTR_ERR(file);
+ goto out;
+ }
+ set_mm_exe_file(mm, file);
+ }
+
+ ret = _ckpt_read_buffer(ctx, mm->saved_auxv, sizeof(mm->saved_auxv));
+ up_write(&mm->mmap_sem);
+ if (ret < 0)
+ goto out;
+
+ for (nr = h->map_count; nr; nr--) {
+ ret = restore_vma(ctx, mm);
+ if (ret < 0)
+ goto out;
+ }
+
+ ret = restore_mm_context(ctx, mm);
+ out:
+ ckpt_hdr_put(ctx, h);
+ if (ret < 0)
+ return ERR_PTR(ret);
+ /* restore_obj() expect an extra reference */
+ atomic_inc(&mm->mm_users);
+ return mm;
+}
+
+void *restore_mm(struct ckpt_ctx *ctx)
+{
+ return (void *) do_restore_mm(ctx);
+}
+
+int restore_obj_mm(struct ckpt_ctx *ctx, int mm_objref)
+{
+ struct mm_struct *mm;
+ int ret;
+
+ mm = ckpt_obj_fetch(ctx, mm_objref, CKPT_OBJ_MM);
+ if (IS_ERR(mm))
+ return PTR_ERR(mm);
+
+ if (mm == current->mm)
+ return 0;
+
+ ret = exec_mmap(mm);
+ if (ret < 0)
+ return ret;
+
+ atomic_inc(&mm->mm_users);
+ return 0;
+}
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 479e8eb..354b200 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -158,6 +158,7 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
.ref_grab = obj_mm_grab,
.ref_users = obj_mm_users,
.checkpoint = checkpoint_mm,
+ .restore = restore_mm,
},
};

diff --git a/checkpoint/process.c b/checkpoint/process.c
index 397ab08..5d71016 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -368,6 +368,9 @@ static int restore_task_objs(struct ckpt_ctx *ctx)
ret = restore_obj_file_table(ctx, h->files_objref);
ckpt_debug("file_table: ret %d (%p)\n", ret, current->files);

+ ret = restore_obj_mm(ctx, h->mm_objref);
+ ckpt_debug("mm: ret %d (%p)\n", ret, current->mm);
+
ckpt_hdr_put(ctx, h);
return ret;
}
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 81790fe..972bee6 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -288,11 +288,15 @@ void *ckpt_read_buf_type(struct ckpt_ctx *ctx, int len, int type)
static int check_kernel_const(struct ckpt_hdr_const *h)
{
struct task_struct *tsk;
+ struct mm_struct *mm;
struct new_utsname *uts;

/* task */
if (h->task_comm_len != sizeof(tsk->comm))
return -EINVAL;
+ /* mm */
+ if (h->mm_saved_auxv_len != sizeof(mm->saved_auxv))
+ return -EINVAL;
/* uts */
if (h->uts_release_len != sizeof(uts->release))
return -EINVAL;
diff --git a/fs/exec.c b/fs/exec.c
index 4a8849e..08cda1e 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -695,7 +695,7 @@ int kernel_read(struct file *file, unsigned long offset,

EXPORT_SYMBOL(kernel_read);

-static int exec_mmap(struct mm_struct *mm)
+int exec_mmap(struct mm_struct *mm)
{
struct task_struct *tsk;
struct mm_struct * old_mm, *active_mm;
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 452007c..f7f6967 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -120,6 +120,7 @@ extern int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm);
extern int restore_read_header_arch(struct ckpt_ctx *ctx);
extern int restore_thread(struct ckpt_ctx *ctx);
extern int restore_cpu(struct ckpt_ctx *ctx);
+extern int restore_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm);

extern int checkpoint_restart_block(struct ckpt_ctx *ctx,
struct task_struct *t);
@@ -159,9 +160,15 @@ extern int private_vma_checkpoint(struct ckpt_ctx *ctx,
int vma_objref);

extern int checkpoint_obj_mm(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int restore_obj_mm(struct ckpt_ctx *ctx, int mm_objref);

extern int ckpt_collect_mm(struct ckpt_ctx *ctx, struct task_struct *t);
extern int checkpoint_mm(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_mm(struct ckpt_ctx *ctx);
+
+extern int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+ struct file *file, struct ckpt_hdr_vma *h);
+

#define CKPT_VMA_NOT_SUPPORTED \
(VM_SHARED | VM_MAYSHARE | VM_IO | VM_HUGETLB | \
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 10c54b2..8bd2f11 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -248,7 +248,7 @@ struct ckpt_hdr_mm {
__u64 arg_start, arg_end, env_start, env_end;
} __attribute__((aligned(8)));

-/* vma subtypes */
+/* vma subtypes - index into restore_vma_dispatch[] */
enum vma_type {
CKPT_VMA_IGNORE = 0,
CKPT_VMA_VDSO, /* special vdso vma */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0e46e95..98e1fdf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1163,6 +1163,9 @@ extern int do_munmap(struct mm_struct *, unsigned long, size_t);

extern unsigned long do_brk(unsigned long, unsigned long);

+/* fs/exec.c */
+extern int exec_mmap(struct mm_struct *mm);
+
/* filemap.c */
extern unsigned long page_unuse(struct page *);
extern void truncate_inode_pages(struct address_space *, loff_t);
@@ -1176,6 +1179,16 @@ extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
int write_one_page(struct page *page, int wait);
void task_dirty_inc(struct task_struct *tsk);

+
+/* checkpoint/restart */
+#ifdef CONFIG_CHECKPOINT
+struct ckpt_hdr_vma;
+extern int filemap_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+ struct ckpt_hdr_vma *hh);
+extern int special_mapping_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+ struct ckpt_hdr_vma *hh);
+#endif
+
/* readahead.c */
#define VM_MAX_READAHEAD 128 /* kbytes */
#define VM_MIN_READAHEAD 16 /* kbytes (includes current page) */
diff --git a/mm/filemap.c b/mm/filemap.c
index d866bbd..843d88b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1668,6 +1668,25 @@ static int filemap_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)

return private_vma_checkpoint(ctx, vma, CKPT_VMA_FILE, vma_objref);
}
+
+int filemap_restore(struct ckpt_ctx *ctx,
+ struct mm_struct *mm,
+ struct ckpt_hdr_vma *h)
+{
+ struct file *file;
+ int ret;
+
+ if (h->vma_type == CKPT_VMA_FILE &&
+ (h->vm_flags & (VM_SHARED | VM_MAYSHARE)))
+ return -EINVAL;
+
+ file = ckpt_obj_fetch(ctx, h->vma_objref, CKPT_OBJ_FILE);
+ if (IS_ERR(file))
+ return PTR_ERR(file);
+
+ ret = private_vma_restore(ctx, mm, file, h);
+ return ret;
+}
#endif /* CONFIG_CHECKPOINT */

struct vm_operations_struct generic_file_vm_ops = {
diff --git a/mm/mmap.c b/mm/mmap.c
index 939a17c..52d203e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2113,7 +2113,7 @@ void exit_mmap(struct mm_struct *mm)
tlb = tlb_gather_mmu(mm, 1);
/* update_hiwater_rss(mm) here? but nobody should be looking */
/* Use -1 here to ensure all VMAs in the mm are unmapped */
- end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
+ end = vma ? unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL) : 0;
vm_unacct_memory(nr_accounted);
free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
tlb_finish_mmu(tlb, 0, end);
@@ -2272,6 +2272,14 @@ static void special_mapping_close(struct vm_area_struct *vma)
}

#ifdef CONFIG_CHECKPOINT
+/*
+ * FIX:
+ * - checkpoint vdso pages (once per distinct vdso is enough)
+ * - check for compatilibility between saved and current vdso
+ * - accommodate for dynamic kernel data in vdso page
+ *
+ * Current, we require COMPAT_VDSO which somewhat mitigates the issue
+ */
static int special_mapping_checkpoint(struct ckpt_ctx *ctx,
struct vm_area_struct *vma)
{
@@ -2293,6 +2301,19 @@ static int special_mapping_checkpoint(struct ckpt_ctx *ctx,

return generic_vma_checkpoint(ctx, vma, CKPT_VMA_VDSO, 0);
}
+
+int special_mapping_restore(struct ckpt_ctx *ctx,
+ struct mm_struct *mm,
+ struct ckpt_hdr_vma *h)
+{
+ /*
+ * FIX:
+ * Currently, we only handle VDSO/vsyscall special handling.
+ * Even that, is very basic - call arch_setup_additional_pages
+ * requiring the same mapping (start address) as before.
+ */
+ return arch_setup_additional_pages(NULL, h->vm_start, 0);
+}
#endif /* CONFIG_CHECKPOINT */

static struct vm_operations_struct special_mapping_vmops = {
--
1.6.0.4

2009-07-22 10:17:07

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 14/60] pids 4/7: Add target_pids parameter to alloc_pid()

From: Sukadev Bhattiprolu <[email protected]>

This parameter is currently NULL, but will be used in a follow-on patch.

Signed-off-by: Sukadev Bhattiprolu <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Reviewed-by: Oren Laadan <[email protected]>
---
include/linux/pid.h | 2 +-
kernel/fork.c | 3 ++-
kernel/pid.c | 9 +++++++--
3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 49f1c2f..914185d 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
int next_pidmap(struct pid_namespace *pid_ns, int last);

-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids);
extern void free_pid(struct pid *pid);

/*
diff --git a/kernel/fork.c b/kernel/fork.c
index e90cee5..8c9ca1c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -953,6 +953,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
int retval;
struct task_struct *p;
int cgroup_callbacks_done = 0;
+ pid_t *target_pids = NULL;

if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL);
@@ -1123,7 +1124,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
goto bad_fork_cleanup_io;

if (pid != &init_struct_pid) {
- pid = alloc_pid(p->nsproxy->pid_ns);
+ pid = alloc_pid(p->nsproxy->pid_ns, target_pids);
if (IS_ERR(pid)) {
retval = PTR_ERR(pid);
goto bad_fork_cleanup_io;
diff --git a/kernel/pid.c b/kernel/pid.c
index 29cf119..6ee1a9e 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -280,13 +280,14 @@ void free_pid(struct pid *pid)
call_rcu(&pid->rcu, delayed_put_pid);
}

-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids)
{
struct pid *pid;
enum pid_type type;
int i, nr;
struct pid_namespace *tmp;
struct upid *upid;
+ int tpid;

pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
if (!pid)
@@ -294,7 +295,11 @@ struct pid *alloc_pid(struct pid_namespace *ns)

tmp = ns;
for (i = ns->level; i >= 0; i--) {
- nr = alloc_pidmap(tmp, 0);
+ tpid = 0;
+ if (target_pids)
+ tpid = target_pids[i];
+
+ nr = alloc_pidmap(tmp, tpid);
if (nr < 0)
goto out_free;

--
1.6.0.4

2009-07-22 10:19:50

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 57/60] c/r: capabilities: define checkpoint and restore fns

From: Serge E. Hallyn <[email protected]>

[ Andrew: I am punting on dealing with the subsystem cooperation
issues in this version, in favor of trying to get LSM issues
straightened out ]

An application checkpoint image will store capability sets
(and the bounding set) as __u64s. Define checkpoint and
restart functions to translate between those and kernel_cap_t's.

Define a common function do_capset_tocred() which applies capability
set changes to a passed-in struct cred.

The restore function uses do_capset_tocred() to apply the restored
capabilities to the struct cred being crafted, subject to the
current task's (task executing sys_restart()) permissions.

Changelog:
Jun 09: Can't choose securebits or drop bounding set if
file capabilities aren't compiled into the kernel.
Also just store caps in __u32s (looks cleaner).
Jun 01: Made the checkpoint and restore functions and the
ckpt_hdr_capabilities struct more opaque to the
rest of the c/r code, as suggested by Andrew Morgan,
and using naming suggested by Oren.
Jun 01: Add commented BUILD_BUG_ON() to point out that the
current implementation depends on 64-bit capabilities.
(Andrew Morgan and Alexey Dobriyan).
May 28: add helpers to c/r securebits

Signed-off-by: Serge E. Hallyn <[email protected]>
---
include/linux/capability.h | 6 ++
include/linux/checkpoint_hdr.h | 11 +++
kernel/capability.c | 164 +++++++++++++++++++++++++++++++++++++---
security/commoncap.c | 19 +----
4 files changed, 172 insertions(+), 28 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index c302110..3a74655 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -568,6 +568,12 @@ extern int capable(int cap);
struct dentry;
extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);

+struct cred;
+int apply_securebits(unsigned securebits, struct cred *new);
+struct ckpt_capabilities;
+int restore_capabilities(struct ckpt_capabilities *h, struct cred *new);
+void checkpoint_capabilities(struct ckpt_capabilities *h, struct cred * cred);
+
#endif /* __KERNEL__ */

#endif /* !_LINUX_CAPABILITY_H */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 3671e72..1f6a33d 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -60,6 +60,7 @@ enum {
CKPT_HDR_NS,
CKPT_HDR_UTS_NS,
CKPT_HDR_IPC_NS,
+ CKPT_HDR_CAPABILITIES,

/* 201-299: reserved for arch-dependent */

@@ -191,6 +192,16 @@ struct ckpt_hdr_task {
__u64 robust_futex_list; /* a __user ptr */
} __attribute__((aligned(8)));

+/* Posix capabilities */
+struct ckpt_capabilities {
+ __u32 cap_i_0, cap_i_1; /* inheritable set */
+ __u32 cap_p_0, cap_p_1; /* permitted set */
+ __u32 cap_e_0, cap_e_1; /* effective set */
+ __u32 cap_b_0, cap_b_1; /* bounding set */
+ __u32 securebits;
+ __u32 padding;
+} __attribute__((aligned(8)));
+
/* namespaces */
struct ckpt_hdr_task_ns {
struct ckpt_hdr h;
diff --git a/kernel/capability.c b/kernel/capability.c
index 4e17041..4f58454 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -14,6 +14,8 @@
#include <linux/security.h>
#include <linux/syscalls.h>
#include <linux/pid_namespace.h>
+#include <linux/securebits.h>
+#include <linux/checkpoint.h>
#include <asm/uaccess.h>
#include "cred-internals.h"

@@ -217,6 +219,45 @@ SYSCALL_DEFINE2(capget, cap_user_header_t, header, cap_user_data_t, dataptr)
return ret;
}

+static int do_capset_tocred(kernel_cap_t *effective, kernel_cap_t *inheritable,
+ kernel_cap_t *permitted, struct cred *new)
+{
+ int ret;
+
+ ret = security_capset(new, current_cred(),
+ effective, inheritable, permitted);
+ if (ret < 0)
+ return ret;
+
+ /*
+ * for checkpoint-restart, do we want to wait until end of restart?
+ * not sure we care */
+ audit_log_capset(current->pid, new, current_cred());
+
+ return 0;
+}
+
+static int do_capset(kernel_cap_t *effective, kernel_cap_t *inheritable,
+ kernel_cap_t *permitted)
+{
+ struct cred *new;
+ int ret;
+
+ new = prepare_creds();
+ if (!new)
+ return -ENOMEM;
+
+ ret = do_capset_tocred(effective, inheritable, permitted, new);
+ if (ret < 0)
+ goto error;
+
+ return commit_creds(new);
+
+error:
+ abort_creds(new);
+ return ret;
+}
+
/**
* sys_capset - set capabilities for a process or (*) a group of processes
* @header: pointer to struct that contains capability version and
@@ -240,7 +281,6 @@ SYSCALL_DEFINE2(capset, cap_user_header_t, header, const cap_user_data_t, data)
struct __user_cap_data_struct kdata[_KERNEL_CAPABILITY_U32S];
unsigned i, tocopy;
kernel_cap_t inheritable, permitted, effective;
- struct cred *new;
int ret;
pid_t pid;

@@ -271,23 +311,125 @@ SYSCALL_DEFINE2(capset, cap_user_header_t, header, const cap_user_data_t, data)
i++;
}

- new = prepare_creds();
- if (!new)
- return -ENOMEM;
+ return do_capset(&effective, &inheritable, &permitted);

- ret = security_capset(new, current_cred(),
- &effective, &inheritable, &permitted);
+}
+
+#ifdef CONFIG_SECURITY_FILE_CAPABILITIES
+int apply_securebits(unsigned securebits, struct cred *new)
+{
+ if ((((new->securebits & SECURE_ALL_LOCKS) >> 1)
+ & (new->securebits ^ securebits)) /*[1]*/
+ || ((new->securebits & SECURE_ALL_LOCKS & ~securebits)) /*[2]*/
+ || (securebits & ~(SECURE_ALL_LOCKS | SECURE_ALL_BITS)) /*[3]*/
+ || (cap_capable(current, current_cred(), CAP_SETPCAP,
+ SECURITY_CAP_AUDIT) != 0) /*[4]*/
+ /*
+ * [1] no changing of bits that are locked
+ * [2] no unlocking of locks
+ * [3] no setting of unsupported bits
+ * [4] doing anything requires privilege (go read about
+ * the "sendmail capabilities bug")
+ */
+ )
+ /* cannot change a locked bit */
+ return -EPERM;
+ new->securebits = securebits;
+ return 0;
+}
+
+static void do_capbset_drop(struct cred *cred, int cap)
+{
+ cap_lower(cred->cap_bset, cap);
+}
+
+static inline int restore_cap_bset(kernel_cap_t bset, struct cred *cred)
+{
+ int i, may_dropbcap = capable(CAP_SETPCAP);
+
+ for (i = 0; i < CAP_LAST_CAP; i++) {
+ if (cap_raised(bset, i))
+ continue;
+ if (!cap_raised(current_cred()->cap_bset, i))
+ continue;
+ if (!may_dropbcap)
+ return -EPERM;
+ do_capbset_drop(cred, i);
+ }
+
+ return 0;
+}
+
+#else /* CONFIG_SECURITY_FILE_CAPABILITIES */
+
+int apply_securebits(unsigned securebits, struct cred *new)
+{
+ /* settable securebits not supported */
+ return 0;
+}
+
+static inline int restore_cap_bset(kernel_cap_t bset, struct cred *cred)
+{
+ /* bounding sets not supported */
+ return 0;
+}
+#endif /* CONFIG_SECURITY_FILE_CAPABILITIES */
+
+#ifdef CONFIG_CHECKPOINT
+static int do_restore_caps(struct ckpt_capabilities *h, struct cred *cred)
+{
+ kernel_cap_t effective, inheritable, permitted, bset;
+ int ret;
+
+ effective.cap[0] = h->cap_e_0;
+ effective.cap[1] = h->cap_e_1;
+ inheritable.cap[0] = h->cap_i_0;
+ inheritable.cap[1] = h->cap_i_1;
+ permitted.cap[0] = h->cap_p_0;
+ permitted.cap[1] = h->cap_p_1;
+ bset.cap[0] = h->cap_b_0;
+ bset.cap[1] = h->cap_b_1;
+
+ ret = do_capset_tocred(&effective, &inheritable, &permitted, cred);
if (ret < 0)
- goto error;
+ return ret;
+
+ ret = restore_cap_bset(bset, cred);
+ return ret;
+}

- audit_log_capset(pid, new, current_cred());
+void checkpoint_capabilities(struct ckpt_capabilities *h, struct cred * cred)
+{
+ BUILD_BUG_ON(CAP_LAST_CAP >= 64);
+ h->securebits = cred->securebits;
+ h->cap_i_0 = cred->cap_inheritable.cap[0];
+ h->cap_i_1 = cred->cap_inheritable.cap[1];
+ h->cap_p_0 = cred->cap_permitted.cap[0];
+ h->cap_p_1 = cred->cap_permitted.cap[1];
+ h->cap_e_0 = cred->cap_effective.cap[0];
+ h->cap_e_1 = cred->cap_effective.cap[1];
+ h->cap_b_0 = cred->cap_bset.cap[0];
+ h->cap_b_1 = cred->cap_bset.cap[1];
+}

- return commit_creds(new);
+/*
+ * restore_capabilities: called by restore_creds() to set the
+ * restored capabilities (if permitted) in a new struct cred which
+ * will be attached at the end of the sys_restart().
+ * struct cred *new is prepared by caller (using prepare_creds())
+ * (and aborted by caller on error)
+ * return 0 on success, < 0 on error
+ */
+int restore_capabilities(struct ckpt_capabilities *h, struct cred *new)
+{
+ int ret = do_restore_caps(h, new);
+
+ if (!ret)
+ ret = apply_securebits(h->securebits, new);

-error:
- abort_creds(new);
return ret;
}
+#endif /* CONFIG_CHECKPOINT */

/**
* capable - Determine if the current task has a superior capability in effect
diff --git a/security/commoncap.c b/security/commoncap.c
index 48b7e02..2456b46 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -893,24 +893,9 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3,
* capability-based-privilege environment.
*/
case PR_SET_SECUREBITS:
- error = -EPERM;
- if ((((new->securebits & SECURE_ALL_LOCKS) >> 1)
- & (new->securebits ^ arg2)) /*[1]*/
- || ((new->securebits & SECURE_ALL_LOCKS & ~arg2)) /*[2]*/
- || (arg2 & ~(SECURE_ALL_LOCKS | SECURE_ALL_BITS)) /*[3]*/
- || (cap_capable(current, current_cred(), CAP_SETPCAP,
- SECURITY_CAP_AUDIT) != 0) /*[4]*/
- /*
- * [1] no changing of bits that are locked
- * [2] no unlocking of locks
- * [3] no setting of unsupported bits
- * [4] doing anything requires privilege (go read about
- * the "sendmail capabilities bug")
- */
- )
- /* cannot change a locked bit */
+ error = apply_securebits(arg2, new);
+ if (error)
goto error;
- new->securebits = arg2;
goto changed;

case PR_GET_SECUREBITS:
--
1.6.0.4

2009-07-22 10:18:57

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 26/60] c/r: restart multiple processes

Restarting of multiple processes expects all restarting tasks to call
sys_restart(). Once inside the system call, each task will restart
itself at the same order that they were saved. The internals of the
syscall will take care of in-kernel synchronization bewteen tasks.

This patch does _not_ create the task tree in the kernel. Instead it
assumes that all tasks are created in some way and then invoke the
restart syscall. You can use the userspace mktree.c program to do
that.

There is one special task - the coordinator - that is not part of the
restarted hierarchy. The coordinator task allocates the restart
context (ctx) and orchestrates the restart. Thus even if a restart
fails after, or during the restore of the root task, the user
perceives a clean exit and an error message.

The coordinator task will:
1) read header and tree, create @ctx (wake up restarting tasks)
2) set the ->checkpoint_ctx field of itself and all descendants
3) wait for all restarting tasks to reach sync point #1
4) activate first restarting task (root task)
5) wait for all other tasks to complete and reach sync point #3
6) wake up everybody

(Note that in step #2 the coordinator assumes that the entire task
hierarchy exists by the time it enters sys_restart; this is arranged
in user space by 'mktree')

Task that are restarting has three sync points:
1) wait for its ->checkpoint_ctx to be set (by the coordinator)
2) wait for the task's turn to restore (be active)
[...now the task restores its state...]
3) wait for all other tasks to complete

The third sync point ensures that a task may only resume execution
after all tasks have successfully restored their state (or fail if an
error has occured). This prevents tasks from returning to user space
prematurely, before the entire restart completes.

If a single task wishes to restart, it can set the "RESTART_TASKSELF"
flag to restart(2) to skip the logic of the coordinator.

The root-task is a child of the coordinator, identified by the @pid
given to sys_restart() in the pid-ns of the coordinator. Restarting
tasks that aren't the coordinator, should set the @pid argument of
restart(2) syscall to zero.

All tasks explicitly test for an error flag on the checkpoint context
when they wakeup from sync points. If an error occurs during the
restart of some task, it will mark the @ctx with an error flag, and
wakeup the other tasks.

An array of pids (the one saved during the checkpoint) is used to
synchronize the operation. The first task in the array is the init
task (*). The restart context (@ctx) maintains a "current position" in
the array, which indicates which task is currently active. Once the
currently active task completes its own restart, it increments that
position and wakes up the next task.

Restart assumes that userspace provides meaningful data, otherwise
it's garbage-in-garbage-out. In this case, the syscall may block
indefinitely, but in TASK_INTERRUPTIBLE, so the user can ctrl-c or
otherwise kill the stray restarting tasks.

In terms of security, restart runs as the user the invokes it, so it
will not allow a user to do more than is otherwise permitted by the
usual system semantics and policy.

Currently we ignore threads and zombies, as well as session ids.
Add support for multiple processes

(*) For containers, restart should be called inside a fresh container
by the init task of that container. However, it is also possible to
restart applications not necessarily inside a container, and without
restoring the original pids of the processes (that is, provided that
the application can tolerate such behavior). This is useful to allow
multi-process restart of tasks not isolated inside a container, and
also for debugging.

Changelog[v17]:
- Add uflag RESTART_FROZEN to freeze tasks after restart
- Fix restore_retval() and use only for restarting tasks
- Coordinator converts -ERSTART... to -EINTR
- Coordinator marks and sets descendants' ->checkpoint_ctx
- Coordinator properly detects errors when woken up from wait
- Fix race where root_task could kick start too early
- Add a sync point for restarting tasks
- Multiple fixes to restart logic
Changelog[v14]:
- Revert change to pr_debug(), back to ckpt_debug()
- Discard field 'h.parent'
- Check whether calls to ckpt_hbuf_get() fail
Changelog[v13]:
- Clear root_task->checkpoint_ctx regardless of error condition
- Remove unused argument 'ctx' from do_restore_task() prototype
- Remove unused member 'pids_err' from 'struct ckpt_ctx'
Changelog[v12]:
- Replace obsolete ckpt_debug() with pr_debug()

Signed-off-by: Oren Laadan <[email protected]>
---
checkpoint/restart.c | 461 ++++++++++++++++++++++++++++++++++++--
checkpoint/sys.c | 33 ++-
include/linux/checkpoint.h | 39 +++-
include/linux/checkpoint_types.h | 15 +-
include/linux/sched.h | 4 +
kernel/exit.c | 5 +
kernel/fork.c | 3 +
7 files changed, 519 insertions(+), 41 deletions(-)

diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 4d1ff31..65422e2 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -13,7 +13,10 @@

#include <linux/version.h>
#include <linux/sched.h>
+#include <linux/wait.h>
#include <linux/file.h>
+#include <linux/ptrace.h>
+#include <linux/freezer.h>
#include <linux/magic.h>
#include <linux/utsname.h>
#include <asm/syscall.h>
@@ -324,6 +327,414 @@ static int restore_read_tail(struct ckpt_ctx *ctx)
return ret;
}

+/* restore_read_tree - read the tasks tree into the checkpoint context */
+static int restore_read_tree(struct ckpt_ctx *ctx)
+{
+ struct ckpt_hdr_tree *h;
+ int size, ret;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TREE);
+ if (IS_ERR(h))
+ return PTR_ERR(h);
+
+ ret = -EINVAL;
+ if (h->nr_tasks < 0)
+ goto out;
+
+ ctx->nr_pids = h->nr_tasks;
+ size = sizeof(*ctx->pids_arr) * ctx->nr_pids;
+ if (size < 0) /* overflow ? */
+ goto out;
+
+ ctx->pids_arr = kmalloc(size, GFP_KERNEL);
+ if (!ctx->pids_arr) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ ret = _ckpt_read_buffer(ctx, ctx->pids_arr, size);
+ out:
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
+
+static inline pid_t get_active_pid(struct ckpt_ctx *ctx)
+{
+ int active = ctx->active_pid;
+ return active >= 0 ? ctx->pids_arr[active].vpid : 0;
+}
+
+static inline int is_task_active(struct ckpt_ctx *ctx, pid_t pid)
+{
+ return get_active_pid(ctx) == pid;
+}
+
+static inline void ckpt_notify_error(struct ckpt_ctx *ctx)
+{
+ ckpt_debug("ctx with root pid %d (%p)", ctx->root_pid, ctx);
+ ckpt_set_ctx_error(ctx);
+ complete(&ctx->complete);
+}
+
+static int ckpt_activate_next(struct ckpt_ctx *ctx)
+{
+ struct task_struct *task;
+ int active;
+ pid_t pid;
+
+ active = ++ctx->active_pid;
+ if (active >= ctx->nr_pids) {
+ complete(&ctx->complete);
+ return 0;
+ }
+
+ pid = get_active_pid(ctx);
+ ckpt_debug("active pid %d (%d < %d)\n", pid, active, ctx->nr_pids);
+
+ rcu_read_lock();
+ task = find_task_by_pid_ns(pid, ctx->root_nsproxy->pid_ns);
+ if (task)
+ wake_up_process(task);
+ rcu_read_unlock();
+
+ if (!task) {
+ ckpt_notify_error(ctx);
+ return -ESRCH;
+ }
+
+ return 0;
+}
+
+static int wait_task_active(struct ckpt_ctx *ctx)
+{
+ pid_t pid = task_pid_vnr(current);
+ int ret;
+
+ ckpt_debug("pid %d waiting\n", pid);
+ ret = wait_event_interruptible(ctx->waitq,
+ is_task_active(ctx, pid) ||
+ ckpt_test_ctx_error(ctx));
+ if (!ret && ckpt_test_ctx_error(ctx)) {
+ force_sig(SIGKILL, current);
+ ret = -EBUSY;
+ }
+ return ret;
+}
+
+static int wait_task_sync(struct ckpt_ctx *ctx)
+{
+ ckpt_debug("pid %d syncing\n", task_pid_vnr(current));
+ wait_event_interruptible(ctx->waitq, ckpt_test_ctx_complete(ctx));
+ if (ckpt_test_ctx_error(ctx)) {
+ force_sig(SIGKILL, current);
+ return -EBUSY;
+ }
+ return 0;
+}
+
+static int do_restore_task(void)
+{
+ DECLARE_WAIT_QUEUE_HEAD(waitq);
+ struct ckpt_ctx *ctx, *old_ctx;
+ int ret;
+
+ /*
+ * Wait for coordinator to make become visible, then grab a
+ * reference to its restart context. If we're the last task to
+ * do it, notify the coordinator.
+ */
+ ret = wait_event_interruptible(waitq, current->checkpoint_ctx);
+ if (ret < 0)
+ return ret;
+
+ ctx = xchg(&current->checkpoint_ctx, NULL);
+ if (!ctx)
+ return -EAGAIN;
+ ckpt_ctx_get(ctx);
+
+ /*
+ * Put the @ctx back on our task_struct. If an ancestor tried
+ * to prepare_descendants() on us (although extremly unlikely)
+ * we will encounter the ctx that he xchg()ed there and bail.
+ */
+ old_ctx = xchg(&current->checkpoint_ctx, ctx);
+ if (old_ctx) {
+ ckpt_debug("self-set of checkpoint_ctx failed\n");
+ /* alert coordinator of unexpected ctx */
+ ckpt_notify_error(old_ctx);
+ ckpt_ctx_put(old_ctx);
+ /* alert our coordinator that we bail */
+ ckpt_notify_error(ctx);
+ ckpt_ctx_put(ctx);
+ return -EAGAIN;
+ }
+
+ /* wait for our turn, do the restore, and tell next task in line */
+ ret = wait_task_active(ctx);
+ if (ret < 0)
+ goto out;
+
+ ret = restore_task(ctx);
+ if (ret < 0)
+ goto out;
+
+ ret = ckpt_activate_next(ctx);
+ if (ret < 0)
+ goto out;
+
+ ret = wait_task_sync(ctx);
+ out:
+ old_ctx = xchg(&current->checkpoint_ctx, NULL);
+ if (old_ctx)
+ ckpt_ctx_put(old_ctx);
+
+ /* if we're first to fail - notify others */
+ if (ret < 0 && !ckpt_test_ctx_error(ctx)) {
+ ckpt_notify_error(ctx);
+ wake_up_all(&ctx->waitq);
+ }
+
+ ckpt_ctx_put(ctx);
+ return ret;
+}
+
+/**
+ * prepare_descendants - set ->restart_tsk of all descendants
+ * @ctx: checkpoint context
+ * @root: root process for restart
+ *
+ * Called by the coodinator to set the ->restart_tsk pointer of the
+ * root task and all its descendants.
+ */
+static int prepare_descendants(struct ckpt_ctx *ctx, struct task_struct *root)
+{
+ struct task_struct *leader = root;
+ struct task_struct *parent = NULL;
+ struct task_struct *task = root;
+ struct ckpt_ctx *old_ctx;
+ int nr_pids = ctx->nr_pids;
+ int ret = 0;
+
+ read_lock(&tasklist_lock);
+ while (nr_pids) {
+ ckpt_debug("consider task %d\n", task_pid_vnr(task));
+ if (task_ptrace(task) & PT_PTRACED) {
+ ret = -EBUSY;
+ break;
+ }
+ /*
+ * Set task->restart_tsk of all non-zombie descendants.
+ * If a descendant already has a ->checkpoint_ctx, it
+ * must be a coordinator (for a different restart ?) so
+ * we fail.
+ *
+ * Note that own ancestors cannot interfere since they
+ * won't descend past us, as own ->checkpoint_ctx must
+ * already be set.
+ */
+ if (!task->exit_state) {
+ ckpt_ctx_get(ctx);
+ old_ctx = xchg(&task->checkpoint_ctx, ctx);
+ if (old_ctx) {
+ ckpt_debug("bad task %d\n",task_pid_vnr(task));
+ ckpt_ctx_put(old_ctx);
+ ret = -EAGAIN;
+ break;
+ }
+ ckpt_debug("prepare task %d\n", task_pid_vnr(task));
+ wake_up_process(task);
+ nr_pids--;
+ }
+
+ /* if has children - proceed with child */
+ if (!list_empty(&task->children)) {
+ parent = task;
+ task = list_entry(task->children.next,
+ struct task_struct, sibling);
+ continue;
+ }
+ while (task != root) {
+ /* if has sibling - proceed with sibling */
+ if (!list_is_last(&task->sibling, &parent->children)) {
+ task = list_entry(task->sibling.next,
+ struct task_struct, sibling);
+ break;
+ }
+
+ /* else, trace back to parent and proceed */
+ task = parent;
+ parent = parent->real_parent;
+ }
+ if (task == root) {
+ /* in case root task in multi-threaded */
+ root = task = next_thread(task);
+ if (root == leader)
+ break;
+ }
+ }
+ read_unlock(&tasklist_lock);
+ ckpt_debug("left %d ret %d root/task %d\n", nr_pids, ret, task == root);
+
+ /* fail unless number of processes matches */
+ if (!ret && (nr_pids || task != root))
+ ret = -ESRCH;
+
+ return ret;
+}
+
+static int wait_all_tasks_finish(struct ckpt_ctx *ctx)
+{
+ int ret;
+
+ init_completion(&ctx->complete);
+
+ ret = ckpt_activate_next(ctx);
+ if (ret < 0)
+ return ret;
+
+ ret = wait_for_completion_interruptible(&ctx->complete);
+
+ if (ckpt_test_ctx_error(ctx))
+ ret = -EBUSY;
+ return ret;
+}
+
+static struct task_struct *choose_root_task(struct ckpt_ctx *ctx, pid_t pid)
+{
+ struct task_struct *task;
+
+ if (ctx->uflags & RESTART_TASKSELF) {
+ ctx->root_pid = pid;
+ ctx->root_task = current;
+ get_task_struct(current);
+ return current;
+ }
+
+ read_lock(&tasklist_lock);
+ list_for_each_entry(task, &current->children, sibling) {
+ if (task_pid_vnr(task) == pid) {
+ get_task_struct(task);
+ ctx->root_task = task;
+ ctx->root_pid = pid;
+ break;
+ }
+ }
+ read_unlock(&tasklist_lock);
+
+ return task;
+}
+
+/* setup restart-specific parts of ctx */
+static int init_restart_ctx(struct ckpt_ctx *ctx, pid_t pid)
+{
+ struct nsproxy *nsproxy;
+
+ /*
+ * No need for explicit cleanup here, because if an error
+ * occurs then ckpt_ctx_free() is eventually called.
+ */
+
+ ctx->root_task = choose_root_task(ctx, pid);
+ if (!ctx->root_task)
+ return -ESRCH;
+
+ rcu_read_lock();
+ nsproxy = task_nsproxy(ctx->root_task);
+ if (nsproxy) {
+ get_nsproxy(nsproxy);
+ ctx->root_nsproxy = nsproxy;
+ }
+ rcu_read_unlock();
+ if (!nsproxy)
+ return -ESRCH;
+
+ ctx->active_pid = -1; /* see ckpt_activate_next, get_active_pid */
+
+ return 0;
+}
+
+static int do_restore_coord(struct ckpt_ctx *ctx, pid_t pid)
+{
+ struct ckpt_ctx *old_ctx;
+ int ret;
+
+ ret = restore_read_header(ctx);
+ if (ret < 0)
+ return ret;
+ ret = restore_read_tree(ctx);
+ if (ret < 0)
+ return ret;
+
+ if ((ctx->uflags & RESTART_TASKSELF) && ctx->nr_pids != 1)
+ return -EINVAL;
+
+ ret = init_restart_ctx(ctx, pid);
+ if (ret < 0)
+ return ret;
+
+ /*
+ * Populate own ->checkpoint_ctx: if an ancestor attempts to
+ * prepare_descendants() on us, it will fail. Furthermore,
+ * that ancestor won't proceed deeper to interfere with our
+ * descendants that are restarting (e.g. by xchg()ing their
+ * ->checkpoint_ctx pointer temporarily).
+ */
+ ckpt_ctx_get(ctx);
+ old_ctx = xchg(&current->checkpoint_ctx, ctx);
+ if (old_ctx) {
+ /*
+ * We are a bad-behaving descendant: an ancestor must
+ * have done prepare_descendants() on us as part of a
+ * restart. Oh, well ... alert ancestor (coordinator)
+ * with an error on @old_ctx.
+ */
+ ckpt_debug("bad bavhing checkpoint_ctx\n");
+ ckpt_notify_error(old_ctx);
+ ckpt_ctx_put(old_ctx);
+ return -EBUSY;
+ }
+
+ if (ctx->uflags & RESTART_TASKSELF) {
+ ret = restore_task(ctx);
+ if (ret < 0)
+ goto out;
+ } else {
+ /* prepare descendants' t->restart_tsk point to coord */
+ ret = prepare_descendants(ctx, ctx->root_task);
+ if (ret < 0)
+ goto out;
+ /* wait for all other tasks to complete do_restore_task() */
+ ret = wait_all_tasks_finish(ctx);
+ if (ret < 0)
+ goto out;
+ }
+
+ ret = restore_read_tail(ctx);
+ if (ret < 0)
+ goto out;
+
+ if (ctx->uflags & RESTART_FROZEN) {
+ ret = cgroup_freezer_make_frozen(ctx->root_task);
+ ckpt_debug("freezing restart tasks ... %d\n", ret);
+ }
+ out:
+ if (ret < 0)
+ ckpt_set_ctx_error(ctx);
+ else
+ ckpt_set_ctx_success(ctx);
+
+ if (!(ctx->uflags & RESTART_TASKSELF))
+ wake_up_all(&ctx->waitq);
+ /*
+ * If an ancestor attempts to prepare_descendants() on us, it
+ * xchg()s our ->checkpoint_ctx, and free it. Our @ctx will,
+ * instead, point to the ctx that said ancestor placed.
+ */
+ ctx = xchg(&current->checkpoint_ctx, NULL);
+ ckpt_ctx_put(ctx);
+
+ return ret;
+}
+
static long restore_retval(void)
{
struct pt_regs *regs = task_pt_regs(current);
@@ -372,28 +783,40 @@ static long restore_retval(void)
return ret;
}

-/* setup restart-specific parts of ctx */
-static int init_restart_ctx(struct ckpt_ctx *ctx, pid_t pid)
-{
- return 0;
-}
-
long do_restart(struct ckpt_ctx *ctx, pid_t pid)
{
long ret;

- ret = init_restart_ctx(ctx, pid);
- if (ret < 0)
- return ret;
- ret = restore_read_header(ctx);
- if (ret < 0)
- return ret;
- ret = restore_task(ctx);
- if (ret < 0)
- return ret;
- ret = restore_read_tail(ctx);
- if (ret < 0)
- return ret;
+ if (ctx)
+ ret = do_restore_coord(ctx, pid);
+ else
+ ret = do_restore_task();

- return restore_retval();
+ /* restart(2) isn't idempotent: should not be auto-restarted */
+ if (ret == -ERESTARTSYS || ret == -ERESTARTNOINTR ||
+ ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK)
+ ret = -EINTR;
+
+ /*
+ * The retval from what we return to the caller when all goes
+ * well: this is either the retval from the original syscall
+ * that was interrupted during checkpoint, or the contents of
+ * (saved) eax if the task was in userspace.
+ *
+ * The coordinator (ctx!=NULL) is exempt: don't adjust its retval.
+ * But in self-restart (where RESTART_TASKSELF), the coordinator
+ * _itself_ is a restarting task.
+ */
+
+ if (!ctx || (ctx->uflags & RESTART_TASKSELF)) {
+ if (ret < 0) {
+ /* partial restore is undefined: terminate */
+ ckpt_debug("restart err %d, exiting\n", ret);
+ force_sig(SIGKILL, current);
+ } else {
+ ret = restore_retval();
+ }
+ }
+
+ return ret;
}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index cc94775..c8921f0 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -189,6 +189,8 @@ static void task_arr_free(struct ckpt_ctx *ctx)

static void ckpt_ctx_free(struct ckpt_ctx *ctx)
{
+ BUG_ON(atomic_read(&ctx->refcount));
+
if (ctx->file)
fput(ctx->file);

@@ -202,6 +204,8 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
if (ctx->root_freezer)
put_task_struct(ctx->root_freezer);

+ kfree(ctx->pids_arr);
+
kfree(ctx);
}

@@ -219,17 +223,32 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
ctx->kflags = kflags;
ctx->ktime_begin = ktime_get();

+ atomic_set(&ctx->refcount, 0);
+ init_waitqueue_head(&ctx->waitq);
+
err = -EBADF;
ctx->file = fget(fd);
if (!ctx->file)
goto err;

+ atomic_inc(&ctx->refcount);
return ctx;
err:
ckpt_ctx_free(ctx);
return ERR_PTR(err);
}

+void ckpt_ctx_get(struct ckpt_ctx *ctx)
+{
+ atomic_inc(&ctx->refcount);
+}
+
+void ckpt_ctx_put(struct ckpt_ctx *ctx)
+{
+ if (ctx && atomic_dec_and_test(&ctx->refcount))
+ ckpt_ctx_free(ctx);
+}
+
/**
* sys_checkpoint - checkpoint a container
* @pid: pid of the container init(1) process
@@ -261,7 +280,7 @@ SYSCALL_DEFINE3(checkpoint, pid_t, pid, int, fd, unsigned long, flags)
if (!ret)
ret = ctx->crid;

- ckpt_ctx_free(ctx);
+ ckpt_ctx_put(ctx);
return ret;
}

@@ -280,24 +299,20 @@ SYSCALL_DEFINE3(restart, pid_t, pid, int, fd, unsigned long, flags)
long ret;

/* no flags for now */
- if (flags)
+ if (flags & ~RESTART_USER_FLAGS)
return -EINVAL;

if (!ckpt_unpriv_allowed && !capable(CAP_SYS_ADMIN))
return -EPERM;

- ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_RESTART);
+ if (pid)
+ ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_RESTART);
if (IS_ERR(ctx))
return PTR_ERR(ctx);

ret = do_restart(ctx, pid);

- /* restart(2) isn't idempotent: can't restart syscall */
- if (ret == -ERESTARTSYS || ret == -ERESTARTNOINTR ||
- ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK)
- ret = -EINTR;
-
- ckpt_ctx_free(ctx);
+ ckpt_ctx_put(ctx);
return ret;
}

diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index df2938f..44b692d 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -15,6 +15,10 @@
/* checkpoint user flags */
#define CHECKPOINT_SUBTREE 0x1

+/* restart user flags */
+#define RESTART_TASKSELF 0x1
+#define RESTART_FROZEN 0x2
+
#ifdef __KERNEL__
#ifdef CONFIG_CHECKPOINT

@@ -23,23 +27,21 @@


/* ckpt_ctx: kflags */
-#define CKPT_CTX_CHECKPOINT_BIT 1
-#define CKPT_CTX_RESTART_BIT 2
+#define CKPT_CTX_CHECKPOINT_BIT 0
+#define CKPT_CTX_RESTART_BIT 1
+#define CKPT_CTX_SUCCESS_BIT 2
+#define CKPT_CTX_ERROR_BIT 3

#define CKPT_CTX_CHECKPOINT (1 << CKPT_CTX_CHECKPOINT_BIT)
#define CKPT_CTX_RESTART (1 << CKPT_CTX_RESTART_BIT)
+#define CKPT_CTX_SUCCESS (1 << CKPT_CTX_SUCCESS_BIT)
+#define CKPT_CTX_ERROR (1 << CKPT_CTX_ERROR_BIT)

-
-/* ckpt_ctx: kflags */
-#define CKPT_CTX_CHECKPOINT_BIT 1
-#define CKPT_CTX_RESTART_BIT 2
-
-#define CKPT_CTX_CHECKPOINT (1 << CKPT_CTX_CHECKPOINT_BIT)
-#define CKPT_CTX_RESTART (1 << CKPT_CTX_RESTART_BIT)
-
-/* ckpt ctx: uflags */
+/* ckpt_ctx: uflags */
#define CHECKPOINT_USER_FLAGS CHECKPOINT_SUBTREE
+#define RESTART_USER_FLAGS (RESTART_TASKSELF | RESTART_FROZEN)

+extern void exit_checkpoint(struct task_struct *tsk);

extern int ckpt_kwrite(struct ckpt_ctx *ctx, void *buf, int count);
extern int ckpt_kread(struct ckpt_ctx *ctx, void *buf, int count);
@@ -64,6 +66,21 @@ extern int _ckpt_read_string(struct ckpt_ctx *ctx, void *ptr, int len);
extern void *ckpt_read_obj_type(struct ckpt_ctx *ctx, int len, int type);
extern void *ckpt_read_buf_type(struct ckpt_ctx *ctx, int len, int type);

+/* ckpt kflags */
+#define ckpt_set_ctx_kflag(__ctx, __kflag) \
+ set_bit(__kflag##_BIT, &(__ctx)->kflags)
+
+#define ckpt_set_ctx_success(ctx) ckpt_set_ctx_kflag(ctx, CKPT_CTX_SUCCESS)
+#define ckpt_set_ctx_error(ctx) ckpt_set_ctx_kflag(ctx, CKPT_CTX_ERROR)
+
+#define ckpt_test_ctx_error(ctx) \
+ ((ctx)->kflags & CKPT_CTX_ERROR)
+#define ckpt_test_ctx_complete(ctx) \
+ ((ctx)->kflags & (CKPT_CTX_SUCCESS | CKPT_CTX_ERROR))
+
+extern void ckpt_ctx_get(struct ckpt_ctx *ctx);
+extern void ckpt_ctx_put(struct ckpt_ctx *ctx);
+
extern long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid);
extern long do_restart(struct ckpt_ctx *ctx, pid_t pid);

diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 5dca34f..4785df6 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -16,6 +16,7 @@
#include <linux/nsproxy.h>
#include <linux/fs.h>
#include <linux/ktime.h>
+#include <linux/wait.h>

struct ckpt_ctx {
int crid; /* unique checkpoint id */
@@ -35,10 +36,20 @@ struct ckpt_ctx {
struct file *file; /* input/output file */
int total; /* total read/written */

- struct task_struct **tasks_arr; /* array of all tasks in container */
- int nr_tasks; /* size of tasks array */
+ atomic_t refcount;

char err_string[256]; /* checkpoint: error string */
+
+ /* [multi-process checkpoint] */
+ struct task_struct **tasks_arr; /* array of all tasks [checkpoint] */
+ int nr_tasks; /* size of tasks array */
+
+ /* [multi-process restart] */
+ struct ckpt_hdr_pids *pids_arr; /* array of all pids [restart] */
+ int nr_pids; /* size of pids array */
+ int active_pid; /* (next) position in pids array */
+ struct completion complete; /* container root and other tasks on */
+ wait_queue_head_t waitq; /* start, end, and restart ordering */
};

#endif /* __KERNEL__ */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e2ebb41..0e67de7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1479,6 +1479,9 @@ struct task_struct {
/* bitmask of trace recursion */
unsigned long trace_recursion;
#endif /* CONFIG_TRACING */
+#ifdef CONFIG_CHECKPOINT
+ struct ckpt_ctx *checkpoint_ctx;
+#endif
};

/* Future-safe accessor for struct task_struct's cpus_allowed. */
@@ -1692,6 +1695,7 @@ extern cputime_t task_gtime(struct task_struct *p);
#define PF_SPREAD_PAGE 0x01000000 /* Spread page cache over cpuset */
#define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over cpuset */
#define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpu */
+#define PF_RESTARTING 0x08000000 /* Process is restarting (c/r) */
#define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */
#define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */
#define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezeable */
diff --git a/kernel/exit.c b/kernel/exit.c
index 869dc22..912b1fa 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -49,6 +49,7 @@
#include <linux/init_task.h>
#include <linux/perf_counter.h>
#include <trace/events/sched.h>
+#include <linux/checkpoint.h>

#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -992,6 +993,10 @@ NORET_TYPE void do_exit(long code)
if (unlikely(current->pi_state_cache))
kfree(current->pi_state_cache);
#endif
+#ifdef CONFIG_CHECKPOINT
+ if (unlikely(tsk->checkpoint_ctx))
+ exit_checkpoint(tsk);
+#endif
/*
* Make sure we are holding no locks:
*/
diff --git a/kernel/fork.c b/kernel/fork.c
index 29c66f0..68412b5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1161,6 +1161,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
INIT_LIST_HEAD(&p->pi_state_list);
p->pi_state_cache = NULL;
#endif
+#ifdef CONFIG_CHECKPOINT
+ p->checkpoint_ctx = NULL;
+#endif
/*
* sigaltstack should be cleared when sharing the same VM
*/
--
1.6.0.4

2009-07-22 10:16:53

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 27/60] c/r: introduce PF_RESTARTING, and skip notification on exit

To restore zombie's we will create the a task, that, on its turn to
run, calls do_exit(). Unlike normal tasks that exit, we need to
prevent notification side effects that send signals to other
processes, e.g. parent (SIGCHLD) or child tasks (per child's request).

There are three main cases for such notifications:

1) do_notify_parent(): parent of a process is notified about a change
in status (e.g. become zombie, reparent, etc). If parent ignores,
then mark child for immediate release (skip zombie).

2) kill_orphan_pgrp(): a process group that becomes orphaned will
signal stopped jobs (HUP then CONT).

3) reparent_thread(): children of a process are signaled (per request)
with p->pdeath_signal

Remember that restoring signal state (for any restarting task) must
complete _before_ it is allowed to resume execution, and not during
the resume. Otherwise, a running task may send a signal to another
task that hasn't restored yet, so the new signal will be lost
soon-after.

I considered two possible way to address this:

1. Add another sync point to restart: all tasks will first restore
their state without signals (all signals blocked), and zombies call
do_exit(). A sync point then will ensure that all zombies are gone and
their effects done. Then all tasks restore their signal state (and
mask), and sync (new point) again. Only then they may resume
execution.
The main disadvantage is the added complexity and inefficiency,
for no good reason.

2. Introduce PF_RESTARTING: mark all restarting tasks with a new flag,
and teach the above three notifications to skip sending the signal if
theis flag is set.
The main advantage is simplicity and completeness. Also, such a flag
may to be useful later on. This the method implemented.

Signed-off-by: Oren Laadan <[email protected]>
---
kernel/exit.c | 7 ++++++-
kernel/signal.c | 4 ++++
2 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 912b1fa..41ac4cf 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -299,6 +299,10 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
struct pid *pgrp = task_pgrp(tsk);
struct task_struct *ignored_task = tsk;

+ /* restarting zombie doesn't trigger signals */
+ if (tsk->flags & PF_RESTARTING)
+ return;
+
if (!parent)
/* exit: our father is in a different pgrp than
* we are and we were the only connection outside.
@@ -739,7 +743,8 @@ static struct task_struct *find_new_reaper(struct task_struct *father)
static void reparent_thread(struct task_struct *father, struct task_struct *p,
struct list_head *dead)
{
- if (p->pdeath_signal)
+ /* restarting zombie doesn't trigger signals */
+ if (p->pdeath_signal && !(p->flags & PF_RESTARTING))
group_send_sig_info(p->pdeath_signal, SEND_SIG_NOINFO, p);

list_move_tail(&p->sibling, &p->real_parent->children);
diff --git a/kernel/signal.c b/kernel/signal.c
index ccf1cee..697f700 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1413,6 +1413,10 @@ int do_notify_parent(struct task_struct *tsk, int sig)
BUG_ON(!task_ptrace(tsk) &&
(tsk->group_leader != tsk || !thread_group_empty(tsk)));

+ /* restarting zombie doesn't notify parent */
+ if (tsk->flags & PF_RESTARTING)
+ return ret;
+
info.si_signo = sig;
info.si_errno = 0;
/*
--
1.6.0.4

2009-07-22 10:18:06

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 17/60] pids 7/7: Define clone_with_pids syscall

From: Sukadev Bhattiprolu <[email protected]>

Container restart requires that a task have the same pid it had when it was
checkpointed. When containers are nested the tasks within the containers
exist in multiple pid namespaces and hence have multiple pids to specify
during restart.

clone_with_pids(), intended for use during restart, is the same as clone(),
except that it takes a 'target_pid_set' paramter. This parameter lets caller
choose specific pid numbers for the child process, in the process's active
and ancestor pid namespaces. (Descendant pid namespaces in general don't
matter since processes don't have pids in them anyway, but see comments
in copy_target_pids() regarding CLONE_NEWPID).

Unlike clone(), clone_with_pids() needs CAP_SYS_ADMIN, at least for now, to
prevent unprivileged processes from misusing this interface.

Call clone_with_pids as follows:

pid_t pids[] = { 0, 77, 99 };
struct target_pid_set pid_set;

pid_set.num_pids = sizeof(pids) / sizeof(int);
pid_set.target_pids = &pids;

syscall(__NR_clone_with_pids, flags, stack, NULL, NULL, NULL, &pid_set);

If a target-pid is 0, the kernel continues to assign a pid for the process in
that namespace. In the above example, pids[0] is 0, meaning the kernel will
assign next available pid to the process in init_pid_ns. But kernel will assign
pid 77 in the child pid namespace 1 and pid 99 in pid namespace 2. If either
77 or 99 are taken, the system call fails with -EBUSY.

If 'pid_set.num_pids' exceeds the current nesting level of pid namespaces,
the system call fails with -EINVAL.

Its mostly an exploratory patch seeking feedback on the interface.

NOTE:
Compared to clone(), clone_with_pids() needs to pass in two more
pieces of information:

- number of pids in the set
- user buffer containing the list of pids.

But since clone() already takes 5 parameters, use a 'struct
target_pid_set'.

TODO:
- Gently tested.
- May need additional sanity checks in do_fork_with_pids().

Changelog[v3]:
- (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
in the target_pids[] list and setting it 0. See copy_target_pids()).
- (Oren Laadan) Specified target pids should apply only to youngest
pid-namespaces (see copy_target_pids())
- (Matt Helsley) Update patch description.

Changelog[v2]:
- Remove unnecessary printk and add a note to callers of
copy_target_pids() to free target_pids.
- (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
- (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
'num_pids == 0' (fall back to normal clone()).
- Move arch-independent code (sanity checks and copy-in of target-pids)
into kernel/fork.c and simplify sys_clone_with_pids()

Changelog[v1]:
- Fixed some compile errors (had fixed these errors earlier in my
git tree but had not refreshed patches before emailing them)

Signed-off-by: Sukadev Bhattiprolu <[email protected]>
---
arch/x86/include/asm/syscalls.h | 2 +
arch/x86/include/asm/unistd_32.h | 1 +
arch/x86/kernel/entry_32.S | 1 +
arch/x86/kernel/process_32.c | 21 +++++++
arch/x86/kernel/syscall_table_32.S | 1 +
kernel/fork.c | 108 +++++++++++++++++++++++++++++++++++-
6 files changed, 133 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h
index 372b76e..df3c4a8 100644
--- a/arch/x86/include/asm/syscalls.h
+++ b/arch/x86/include/asm/syscalls.h
@@ -40,6 +40,8 @@ long sys_iopl(struct pt_regs *);

/* kernel/process_32.c */
int sys_clone(struct pt_regs *);
+int sys_clone_with_pids(struct pt_regs *);
+int sys_vfork(struct pt_regs *);
int sys_execve(struct pt_regs *);

/* kernel/signal.c */
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 732a307..f65b750 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -342,6 +342,7 @@
#define __NR_pwritev 334
#define __NR_rt_tgsigqueueinfo 335
#define __NR_perf_counter_open 336
+#define __NR_clone_with_pids 337

#ifdef __KERNEL__

diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index c097e7d..c7bd1f6 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -718,6 +718,7 @@ ptregs_##name: \
PTREGSCALL(iopl)
PTREGSCALL(fork)
PTREGSCALL(clone)
+PTREGSCALL(clone_with_pids)
PTREGSCALL(vfork)
PTREGSCALL(execve)
PTREGSCALL(sigaltstack)
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 59f4524..9965c06 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -443,6 +443,27 @@ int sys_clone(struct pt_regs *regs)
return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
}

+int sys_clone_with_pids(struct pt_regs *regs)
+{
+ unsigned long clone_flags;
+ unsigned long newsp;
+ int __user *parent_tidptr;
+ int __user *child_tidptr;
+ void __user *upid_setp;
+
+ clone_flags = regs->bx;
+ newsp = regs->cx;
+ parent_tidptr = (int __user *)regs->dx;
+ child_tidptr = (int __user *)regs->di;
+ upid_setp = (void __user *)regs->bp;
+
+ if (!newsp)
+ newsp = regs->sp;
+
+ return do_fork_with_pids(clone_flags, newsp, regs, 0, parent_tidptr,
+ child_tidptr, upid_setp);
+}
+
/*
* sys_execve() executes a new program.
*/
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index d51321d..879e5ec 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -336,3 +336,4 @@ ENTRY(sys_call_table)
.long sys_pwritev
.long sys_rt_tgsigqueueinfo /* 335 */
.long sys_perf_counter_open
+ .long ptregs_clone_with_pids
diff --git a/kernel/fork.c b/kernel/fork.c
index 64d53d9..29c66f0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1336,6 +1336,97 @@ struct task_struct * __cpuinit fork_idle(int cpu)
}

/*
+ * If user specified any 'target-pids' in @upid_setp, copy them from
+ * user and return a pointer to a local copy of the list of pids. The
+ * caller must free the list, when they are done using it.
+ *
+ * If user did not specify any target pids, return NULL (caller should
+ * treat this like normal clone).
+ *
+ * On any errors, return the error code
+ */
+static pid_t *copy_target_pids(void __user *upid_setp)
+{
+ int j;
+ int rc;
+ int size;
+ int unum_pids; /* # of pids specified by user */
+ int knum_pids; /* # of pids needed in kernel */
+ pid_t *target_pids;
+ struct target_pid_set pid_set;
+
+ if (!upid_setp)
+ return NULL;
+
+ rc = copy_from_user(&pid_set, upid_setp, sizeof(pid_set));
+ if (rc)
+ return ERR_PTR(-EFAULT);
+
+ unum_pids = pid_set.num_pids;
+ knum_pids = task_pid(current)->level + 1;
+
+ if (!unum_pids)
+ return NULL;
+
+ if (unum_pids < 0 || unum_pids > knum_pids)
+ return ERR_PTR(-EINVAL);
+
+ /*
+ * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[]
+ * and set it to 0. This last entry in target_pids[] corresponds to the
+ * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was
+ * specified. If CLONE_NEWPID was not specified, this last entry will
+ * simply be ignored.
+ */
+ target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL);
+ if (!target_pids)
+ return ERR_PTR(-ENOMEM);
+
+ /*
+ * A process running in a level 2 pid namespace has three pid namespaces
+ * and hence three pid numbers. If this process is checkpointed,
+ * information about these three namespaces are saved. We refer to these
+ * namespaces as 'known namespaces'.
+ *
+ * If this checkpointed process is however restarted in a level 3 pid
+ * namespace, the restarted process has an extra ancestor pid namespace
+ * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'.
+ *
+ * During restart, the process requests specific pids for its 'known
+ * namespaces' and lets kernel assign pids to its 'unknown namespaces'.
+ *
+ * Since the requested-pids correspond to 'known namespaces' and since
+ * 'known-namespaces' are younger than (i.e descendants of) 'unknown-
+ * namespaces', copy requested pids to the back-end of target_pids[]
+ * (i.e before the last entry for CLONE_NEWPID mentioned above).
+ * Any entries in target_pids[] not corresponding to a requested pid
+ * will be set to zero and kernel assigns a pid in those namespaces.
+ *
+ * NOTE: The order of pids in target_pids[] is oldest pid namespace to
+ * youngest (target_pids[0] corresponds to init_pid_ns). i.e.
+ * the order is:
+ *
+ * - pids for 'unknown-namespaces' (if any)
+ * - pids for 'known-namespaces' (requested pids)
+ * - 0 in the last entry (for CLONE_NEWPID).
+ */
+ j = knum_pids - unum_pids;
+ size = unum_pids * sizeof(pid_t);
+
+ rc = copy_from_user(&target_pids[j], pid_set.target_pids, size);
+ if (rc) {
+ rc = -EFAULT;
+ goto out_free;
+ }
+
+ return target_pids;
+
+out_free:
+ kfree(target_pids);
+ return ERR_PTR(rc);
+}
+
+/*
* Ok, this is the main fork-routine.
*
* It copies the process, and if successful kick-starts
@@ -1352,7 +1443,7 @@ long do_fork_with_pids(unsigned long clone_flags,
struct task_struct *p;
int trace = 0;
long nr;
- pid_t *target_pids = NULL;
+ pid_t *target_pids;

/*
* Do some preliminary argument and permissions checking before we
@@ -1386,6 +1477,17 @@ long do_fork_with_pids(unsigned long clone_flags,
}
}

+ target_pids = copy_target_pids(pid_setp);
+
+ if (target_pids) {
+ if (IS_ERR(target_pids))
+ return PTR_ERR(target_pids);
+
+ nr = -EPERM;
+ if (!capable(CAP_SYS_ADMIN))
+ goto out_free;
+ }
+
/*
* When called from kernel_thread, don't do user tracing stuff.
*/
@@ -1453,6 +1555,10 @@ long do_fork_with_pids(unsigned long clone_flags,
} else {
nr = PTR_ERR(p);
}
+
+out_free:
+ kfree(target_pids);
+
return nr;
}

--
1.6.0.4

2009-07-22 10:17:30

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 29/60] c/r: Save and restore the [compat_]robust_list member of the task struct

From: Matt Helsley <[email protected]>

These lists record which futexes the task holds. To keep the overhead of
robust futexes low the list is kept in userspace. When the task exits the
kernel carefully walks these lists to recover held futexes that
other tasks may be attempting to acquire with FUTEX_WAIT.

Because they point to userspace memory that is saved/restored by
checkpoint/restart saving the list pointers themselves is safe.

While saving the pointers is safe during checkpoint, restart is tricky
because the robust futex ABI contains provisions for changes based on
checking the size of the list head. So we need to save the length of
the list head too in order to make sure that the kernel used during
restart is capable of handling that ABI. Since there is only one ABI
supported at the moment taking the list head's size is simple. Should
the ABI change we will need to use the same size as specified during
sys_set_robust_list() and hence some new means of determining the length
of this userspace structure in sys_checkpoint would be required.

Rather than rewrite the logic that checks and handles the ABI we reuse
sys_set_robust_list() by factoring out the body of the function and
calling it during restart.

Signed-off-by: Matt Helsley <[email protected]>
[[email protected]: move save/restore code to checkpoint/process.c]
---
checkpoint/process.c | 48 ++++++++++++++++++++++++++++++++++++++++
include/linux/checkpoint_hdr.h | 5 ++++
include/linux/compat.h | 3 +-
include/linux/futex.h | 1 +
kernel/futex.c | 19 ++++++++++-----
kernel/futex_compat.c | 13 ++++++++--
6 files changed, 78 insertions(+), 11 deletions(-)

diff --git a/checkpoint/process.c b/checkpoint/process.c
index a67c389..9e459c6 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -18,6 +18,52 @@
#include <linux/checkpoint.h>
#include <linux/checkpoint_hdr.h>

+
+#ifdef CONFIG_FUTEX
+static void save_task_robust_futex_list(struct ckpt_hdr_task *h,
+ struct task_struct *t)
+{
+ /*
+ * These are __user pointers and thus can be saved without
+ * the objhash.
+ */
+ h->robust_futex_list = (unsigned long)t->robust_list;
+ h->robust_futex_head_len = sizeof(*t->robust_list);
+#ifdef CONFIG_COMPAT
+ h->compat_robust_futex_list = ptr_to_compat(t->compat_robust_list);
+ h->compat_robust_futex_head_len = sizeof(*t->compat_robust_list);
+#endif
+}
+
+static void restore_task_robust_futex_list(struct ckpt_hdr_task *h)
+{
+ /* Since we restore the memory map the address remains the same and
+ * this is safe. This is the same as [compat_]sys_set_robust_list() */
+ if (h->robust_futex_list) {
+ struct robust_list_head __user *rfl;
+ rfl = (void __user *)(unsigned long) h->robust_futex_list;
+ do_set_robust_list(rfl, h->robust_futex_head_len);
+ }
+#ifdef CONFIG_COMPAT
+ if (h->compat_robust_futex_list) {
+ struct compat_robust_list_head __user *crfl;
+ crfl = compat_ptr(h->compat_robust_futex_list);
+ do_compat_set_robust_list(crfl, h->compat_robust_futex_head_len);
+ }
+#endif
+}
+#else /* !CONFIG_FUTEX */
+static inline void save_task_robust_futex_list(struct ckpt_hdr_task *h,
+ struct task_struct *t)
+{
+}
+
+static inline void restore_task_robust_futex_list(struct ckpt_hdr_task *h)
+{
+}
+#endif /* CONFIG_FUTEX */
+
+
/***********************************************************************
* Checkpoint
*/
@@ -46,6 +92,7 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)

h->set_child_tid = t->set_child_tid;
h->clear_child_tid = t->clear_child_tid;
+ save_task_robust_futex_list(h, t);
}

ret = ckpt_write_obj(ctx, &h->h);
@@ -244,6 +291,7 @@ static int restore_task_struct(struct ckpt_ctx *ctx)

t->set_child_tid = h->set_child_tid;
t->clear_child_tid = h->clear_child_tid;
+ restore_task_robust_futex_list(h);
}

memset(t->comm, 0, TASK_COMM_LEN);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 3f2db22..ad5851d 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -134,6 +134,11 @@ struct ckpt_hdr_task {

__u64 set_child_tid;
__u64 clear_child_tid;
+
+ __u32 compat_robust_futex_head_len;
+ __u32 compat_robust_futex_list; /* a compat __user ptr */
+ __u32 robust_futex_head_len;
+ __u64 robust_futex_list; /* a __user ptr */
} __attribute__((aligned(8)));

/* restart blocks */
diff --git a/include/linux/compat.h b/include/linux/compat.h
index af931ee..f444cf0 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -165,7 +165,8 @@ struct compat_robust_list_head {
};

extern void compat_exit_robust_list(struct task_struct *curr);
-
+extern long do_compat_set_robust_list(struct compat_robust_list_head __user *head,
+ compat_size_t len);
asmlinkage long
compat_sys_set_robust_list(struct compat_robust_list_head __user *head,
compat_size_t len);
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 4326f81..2e126a9 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -185,6 +185,7 @@ union futex_key {
#define FUTEX_KEY_INIT (union futex_key) { .both = { .ptr = NULL } }

#ifdef CONFIG_FUTEX
+extern long do_set_robust_list(struct robust_list_head __user *head, size_t len);
extern void exit_robust_list(struct task_struct *curr);
extern void exit_pi_state_list(struct task_struct *curr);
extern int futex_cmpxchg_enabled;
diff --git a/kernel/futex.c b/kernel/futex.c
index dfe246f..57a46c9 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2261,13 +2261,7 @@ out:
* the list. There can only be one such pending lock.
*/

-/**
- * sys_set_robust_list - set the robust-futex list head of a task
- * @head: pointer to the list-head
- * @len: length of the list-head, as userspace expects
- */
-SYSCALL_DEFINE2(set_robust_list, struct robust_list_head __user *, head,
- size_t, len)
+long do_set_robust_list(struct robust_list_head __user *head, size_t len)
{
if (!futex_cmpxchg_enabled)
return -ENOSYS;
@@ -2283,6 +2277,17 @@ SYSCALL_DEFINE2(set_robust_list, struct robust_list_head __user *, head,
}

/**
+ * sys_set_robust_list - set the robust-futex list head of a task
+ * @head: pointer to the list-head
+ * @len: length of the list-head, as userspace expects
+ */
+SYSCALL_DEFINE2(set_robust_list, struct robust_list_head __user *, head,
+ size_t, len)
+{
+ return do_set_robust_list(head, len);
+}
+
+/**
* sys_get_robust_list - get the robust-futex list head of a task
* @pid: pid of the process [zero for current task]
* @head_ptr: pointer to a list-head pointer, the kernel fills it in
diff --git a/kernel/futex_compat.c b/kernel/futex_compat.c
index d607a5b..eac734c 100644
--- a/kernel/futex_compat.c
+++ b/kernel/futex_compat.c
@@ -114,9 +114,9 @@ void compat_exit_robust_list(struct task_struct *curr)
}
}

-asmlinkage long
-compat_sys_set_robust_list(struct compat_robust_list_head __user *head,
- compat_size_t len)
+long
+do_compat_set_robust_list(struct compat_robust_list_head __user *head,
+ compat_size_t len)
{
if (!futex_cmpxchg_enabled)
return -ENOSYS;
@@ -130,6 +130,13 @@ compat_sys_set_robust_list(struct compat_robust_list_head __user *head,
}

asmlinkage long
+compat_sys_set_robust_list(struct compat_robust_list_head __user *head,
+ compat_size_t len)
+{
+ return do_compat_set_robust_list(head, len);
+}
+
+asmlinkage long
compat_sys_get_robust_list(int pid, compat_uptr_t __user *head_ptr,
compat_size_t __user *len_ptr)
{
--
1.6.0.4

2009-07-22 10:21:38

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 48/60] c/r (ipc): allow allocation of a desired ipc identifier

During restart, we need to allocate ipc objects that with the same
identifiers as recorded during checkpoint. Modify the allocation
code allow an in-kernel caller to request a specific ipc identifier.
The system call interface remains unchanged.

Signed-off-by: Oren Laadan <[email protected]>
---
ipc/msg.c | 17 ++++++++++++-----
ipc/sem.c | 17 ++++++++++++-----
ipc/shm.c | 19 +++++++++++++------
ipc/util.c | 42 +++++++++++++++++++++++++++++-------------
ipc/util.h | 9 +++++----
5 files changed, 71 insertions(+), 33 deletions(-)

diff --git a/ipc/msg.c b/ipc/msg.c
index 2ceab7f..1db7c45 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -73,7 +73,7 @@ struct msg_sender {
#define msg_unlock(msq) ipc_unlock(&(msq)->q_perm)

static void freeque(struct ipc_namespace *, struct kern_ipc_perm *);
-static int newque(struct ipc_namespace *, struct ipc_params *);
+static int newque(struct ipc_namespace *, struct ipc_params *, int);
#ifdef CONFIG_PROC_FS
static int sysvipc_msg_proc_show(struct seq_file *s, void *it);
#endif
@@ -174,10 +174,12 @@ static inline void msg_rmid(struct ipc_namespace *ns, struct msg_queue *s)
* newque - Create a new msg queue
* @ns: namespace
* @params: ptr to the structure that contains the key and msgflg
+ * @req_id: request desired id if available (-1 if don't care)
*
* Called with msg_ids.rw_mutex held (writer)
*/
-static int newque(struct ipc_namespace *ns, struct ipc_params *params)
+static int
+newque(struct ipc_namespace *ns, struct ipc_params *params, int req_id)
{
struct msg_queue *msq;
int id, retval;
@@ -201,7 +203,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params)
/*
* ipc_addid() locks msq
*/
- id = ipc_addid(&msg_ids(ns), &msq->q_perm, ns->msg_ctlmni);
+ id = ipc_addid(&msg_ids(ns), &msq->q_perm, ns->msg_ctlmni, req_id);
if (id < 0) {
security_msg_queue_free(msq);
ipc_rcu_putref(msq);
@@ -309,7 +311,7 @@ static inline int msg_security(struct kern_ipc_perm *ipcp, int msgflg)
return security_msg_queue_associate(msq, msgflg);
}

-SYSCALL_DEFINE2(msgget, key_t, key, int, msgflg)
+int do_msgget(key_t key, int msgflg, int req_id)
{
struct ipc_namespace *ns;
struct ipc_ops msg_ops;
@@ -324,7 +326,12 @@ SYSCALL_DEFINE2(msgget, key_t, key, int, msgflg)
msg_params.key = key;
msg_params.flg = msgflg;

- return ipcget(ns, &msg_ids(ns), &msg_ops, &msg_params);
+ return ipcget(ns, &msg_ids(ns), &msg_ops, &msg_params, req_id);
+}
+
+SYSCALL_DEFINE2(msgget, key_t, key, int, msgflg)
+{
+ return do_msgget(key, msgflg, -1);
}

static inline unsigned long
diff --git a/ipc/sem.c b/ipc/sem.c
index 87c2b64..a2b2135 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -92,7 +92,7 @@
#define sem_unlock(sma) ipc_unlock(&(sma)->sem_perm)
#define sem_checkid(sma, semid) ipc_checkid(&sma->sem_perm, semid)

-static int newary(struct ipc_namespace *, struct ipc_params *);
+static int newary(struct ipc_namespace *, struct ipc_params *, int);
static void freeary(struct ipc_namespace *, struct kern_ipc_perm *);
#ifdef CONFIG_PROC_FS
static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
@@ -227,11 +227,13 @@ static inline void sem_rmid(struct ipc_namespace *ns, struct sem_array *s)
* newary - Create a new semaphore set
* @ns: namespace
* @params: ptr to the structure that contains key, semflg and nsems
+ * @req_id: request desired id if available (-1 if don't care)
*
* Called with sem_ids.rw_mutex held (as a writer)
*/

-static int newary(struct ipc_namespace *ns, struct ipc_params *params)
+static int
+newary(struct ipc_namespace *ns, struct ipc_params *params, int req_id)
{
int id;
int retval;
@@ -263,7 +265,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
return retval;
}

- id = ipc_addid(&sem_ids(ns), &sma->sem_perm, ns->sc_semmni);
+ id = ipc_addid(&sem_ids(ns), &sma->sem_perm, ns->sc_semmni, req_id);
if (id < 0) {
security_sem_free(sma);
ipc_rcu_putref(sma);
@@ -308,7 +310,7 @@ static inline int sem_more_checks(struct kern_ipc_perm *ipcp,
return 0;
}

-SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg)
+int do_semget(key_t key, int nsems, int semflg, int req_id)
{
struct ipc_namespace *ns;
struct ipc_ops sem_ops;
@@ -327,7 +329,12 @@ SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg)
sem_params.flg = semflg;
sem_params.u.nsems = nsems;

- return ipcget(ns, &sem_ids(ns), &sem_ops, &sem_params);
+ return ipcget(ns, &sem_ids(ns), &sem_ops, &sem_params, req_id);
+}
+
+SYSCALL_DEFINE3(semget, key_t, key, int, nsems, int, semflg)
+{
+ return do_semget(key, nsems, semflg, -1);
}

/*
diff --git a/ipc/shm.c b/ipc/shm.c
index 15dd238..0ee2c35 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -62,7 +62,7 @@ static struct vm_operations_struct shm_vm_ops;
#define shm_unlock(shp) \
ipc_unlock(&(shp)->shm_perm)

-static int newseg(struct ipc_namespace *, struct ipc_params *);
+static int newseg(struct ipc_namespace *, struct ipc_params *, int);
static void shm_open(struct vm_area_struct *vma);
static void shm_close(struct vm_area_struct *vma);
static void shm_destroy (struct ipc_namespace *ns, struct shmid_kernel *shp);
@@ -83,7 +83,7 @@ void shm_init_ns(struct ipc_namespace *ns)
* Called with shm_ids.rw_mutex (writer) and the shp structure locked.
* Only shm_ids.rw_mutex remains locked on exit.
*/
-static void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
+void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp)
{
struct shmid_kernel *shp;
shp = container_of(ipcp, struct shmid_kernel, shm_perm);
@@ -326,11 +326,13 @@ static struct vm_operations_struct shm_vm_ops = {
* newseg - Create a new shared memory segment
* @ns: namespace
* @params: ptr to the structure that contains key, size and shmflg
+ * @req_id: request desired id if available (-1 if don't care)
*
* Called with shm_ids.rw_mutex held as a writer.
*/

-static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
+static int
+newseg(struct ipc_namespace *ns, struct ipc_params *params, int req_id)
{
key_t key = params->key;
int shmflg = params->flg;
@@ -385,7 +387,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
if (IS_ERR(file))
goto no_file;

- id = ipc_addid(&shm_ids(ns), &shp->shm_perm, ns->shm_ctlmni);
+ id = ipc_addid(&shm_ids(ns), &shp->shm_perm, ns->shm_ctlmni, req_id);
if (id < 0) {
error = id;
goto no_id;
@@ -443,7 +445,7 @@ static inline int shm_more_checks(struct kern_ipc_perm *ipcp,
return 0;
}

-SYSCALL_DEFINE3(shmget, key_t, key, size_t, size, int, shmflg)
+int do_shmget(key_t key, size_t size, int shmflg, int req_id)
{
struct ipc_namespace *ns;
struct ipc_ops shm_ops;
@@ -459,7 +461,12 @@ SYSCALL_DEFINE3(shmget, key_t, key, size_t, size, int, shmflg)
shm_params.flg = shmflg;
shm_params.u.size = size;

- return ipcget(ns, &shm_ids(ns), &shm_ops, &shm_params);
+ return ipcget(ns, &shm_ids(ns), &shm_ops, &shm_params, req_id);
+}
+
+SYSCALL_DEFINE3(shmget, key_t, key, size_t, size, int, shmflg)
+{
+ return do_shmget(key, size, shmflg, -1);
}

static inline unsigned long copy_shmid_to_user(void __user *buf, struct shmid64_ds *in, int version)
diff --git a/ipc/util.c b/ipc/util.c
index b8e4ba9..ca248ec 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -247,10 +247,12 @@ int ipc_get_maxid(struct ipc_ids *ids)
* Called with ipc_ids.rw_mutex held as a writer.
*/

-int ipc_addid(struct ipc_ids* ids, struct kern_ipc_perm* new, int size)
+int
+ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm *new, int size, int req_id)
{
uid_t euid;
gid_t egid;
+ int lid = 0;
int id, err;

if (size > IPCMNI)
@@ -259,28 +261,41 @@ int ipc_addid(struct ipc_ids* ids, struct kern_ipc_perm* new, int size)
if (ids->in_use >= size)
return -ENOSPC;

+ if (req_id >= 0)
+ lid = ipcid_to_idx(req_id);
+
spin_lock_init(&new->lock);
new->deleted = 0;
rcu_read_lock();
spin_lock(&new->lock);

- err = idr_get_new(&ids->ipcs_idr, new, &id);
+ err = idr_get_new_above(&ids->ipcs_idr, new, lid, &id);
if (err) {
spin_unlock(&new->lock);
rcu_read_unlock();
return err;
}

+ if (req_id >= 0) {
+ if (id != lid) {
+ idr_remove(&ids->ipcs_idr, id);
+ spin_unlock(&new->lock);
+ rcu_read_unlock();
+ return -EBUSY;
+ }
+ new->seq = req_id / SEQ_MULTIPLIER;
+ } else {
+ new->seq = ids->seq++;
+ if (ids->seq > ids->seq_max)
+ ids->seq = 0;
+ }
+
ids->in_use++;

current_euid_egid(&euid, &egid);
new->cuid = new->uid = euid;
new->gid = new->cgid = egid;

- new->seq = ids->seq++;
- if(ids->seq > ids->seq_max)
- ids->seq = 0;
-
new->id = ipc_buildid(id, new->seq);
return id;
}
@@ -296,7 +311,7 @@ int ipc_addid(struct ipc_ids* ids, struct kern_ipc_perm* new, int size)
* when the key is IPC_PRIVATE.
*/
static int ipcget_new(struct ipc_namespace *ns, struct ipc_ids *ids,
- struct ipc_ops *ops, struct ipc_params *params)
+ struct ipc_ops *ops, struct ipc_params *params, int req_id)
{
int err;
retry:
@@ -306,7 +321,7 @@ retry:
return -ENOMEM;

down_write(&ids->rw_mutex);
- err = ops->getnew(ns, params);
+ err = ops->getnew(ns, params, req_id);
up_write(&ids->rw_mutex);

if (err == -EAGAIN)
@@ -351,6 +366,7 @@ static int ipc_check_perms(struct kern_ipc_perm *ipcp, struct ipc_ops *ops,
* @ids: IPC identifer set
* @ops: the actual creation routine to call
* @params: its parameters
+ * @req_id: request desired id if available (-1 if don't care)
*
* This routine is called by sys_msgget, sys_semget() and sys_shmget()
* when the key is not IPC_PRIVATE.
@@ -360,7 +376,7 @@ static int ipc_check_perms(struct kern_ipc_perm *ipcp, struct ipc_ops *ops,
* On success, the ipc id is returned.
*/
static int ipcget_public(struct ipc_namespace *ns, struct ipc_ids *ids,
- struct ipc_ops *ops, struct ipc_params *params)
+ struct ipc_ops *ops, struct ipc_params *params, int req_id)
{
struct kern_ipc_perm *ipcp;
int flg = params->flg;
@@ -381,7 +397,7 @@ retry:
else if (!err)
err = -ENOMEM;
else
- err = ops->getnew(ns, params);
+ err = ops->getnew(ns, params, req_id);
} else {
/* ipc object has been locked by ipc_findkey() */

@@ -742,12 +758,12 @@ struct kern_ipc_perm *ipc_lock_check(struct ipc_ids *ids, int id)
* Common routine called by sys_msgget(), sys_semget() and sys_shmget().
*/
int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
- struct ipc_ops *ops, struct ipc_params *params)
+ struct ipc_ops *ops, struct ipc_params *params, int req_id)
{
if (params->key == IPC_PRIVATE)
- return ipcget_new(ns, ids, ops, params);
+ return ipcget_new(ns, ids, ops, params, req_id);
else
- return ipcget_public(ns, ids, ops, params);
+ return ipcget_public(ns, ids, ops, params, req_id);
}

/**
diff --git a/ipc/util.h b/ipc/util.h
index 764b51a..159a73c 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -71,7 +71,7 @@ struct ipc_params {
* . routine to call for an extra check if needed
*/
struct ipc_ops {
- int (*getnew) (struct ipc_namespace *, struct ipc_params *);
+ int (*getnew) (struct ipc_namespace *, struct ipc_params *, int);
int (*associate) (struct kern_ipc_perm *, int);
int (*more_checks) (struct kern_ipc_perm *, struct ipc_params *);
};
@@ -94,7 +94,7 @@ void __init ipc_init_proc_interface(const char *path, const char *header,
#define ipcid_to_idx(id) ((id) % SEQ_MULTIPLIER)

/* must be called with ids->rw_mutex acquired for writing */
-int ipc_addid(struct ipc_ids *, struct kern_ipc_perm *, int);
+int ipc_addid(struct ipc_ids *, struct kern_ipc_perm *, int, int);

/* must be called with ids->rw_mutex acquired for reading */
int ipc_get_maxid(struct ipc_ids *);
@@ -171,7 +171,8 @@ static inline void ipc_unlock(struct kern_ipc_perm *perm)

struct kern_ipc_perm *ipc_lock_check(struct ipc_ids *ids, int id);
int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
- struct ipc_ops *ops, struct ipc_params *params);
+ struct ipc_ops *ops, struct ipc_params *params, int req_id);
void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
- void (*free)(struct ipc_namespace *, struct kern_ipc_perm *));
+ void (*free)(struct ipc_namespace *, struct kern_ipc_perm *));
+
#endif
--
1.6.0.4

2009-07-22 10:10:46

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 24/60] c/r: restart-blocks

(Paraphrasing what's said this message:
http://lists.openwall.net/linux-kernel/2007/12/05/64)

Restart blocks are callbacks used cause a system call to be restarted
with the arguments specified in the system call restart block. It is
useful for system call that are not idempotent, i.e. the argument(s)
might be a relative timeout, where some adjustments are required when
restarting the system call. It relies on the system call itself to set
up its restart point and the argument save area. They are rare: an
actual signal would turn that it an EINTR. The only case that should
ever trigger this is some kernel action that interrupts the system
call, but does not actually result in any user-visible state changes -
like freeze and thaw.

So restart blocks are about time remaining for the system call to
sleep/wait. Generally in c/r, there are two possible time models that
we can follow: absolute, relative. Here, I chose to save the relative
timeout, measured from the beginning of the checkpoint. The time when
the checkpoint (and restart) begin is also saved. This information is
sufficient to restart in either model (absolute or negative).

Which model to use should eventually be a per application choice (and
possible configurable via cradvise() or some sort). For now, we adopt
the relative model, namely, at restart the timeout is set relative to
the beginning of the restart.

To checkpoint, we check if a task has a valid restart block, and if so
we save the *remaining* time that is has to wait/sleep, and the type
of the restart block.

To restart, we fill in the data required at the proper place in the
thread information. If the system call return an error (which is
possibly an -ERESTARTSYS eg), we not only use that error as our own
return value, but also arrange for the task to execute the signal
handler (by faking a signal). The handler, in turn, already has the
code to handle these restart request gracefully.

Signed-off-by: Oren Laadan <[email protected]>
---
arch/x86/include/asm/checkpoint_hdr.h | 1 -
checkpoint/checkpoint.c | 1 +
checkpoint/process.c | 226 +++++++++++++++++++++++++++++++++
checkpoint/restart.c | 5 +-
checkpoint/sys.c | 1 +
include/linux/checkpoint.h | 4 +
include/linux/checkpoint_hdr.h | 22 +++
include/linux/checkpoint_types.h | 3 +
8 files changed, 260 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
index c5762fb..f4d1e14 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -58,7 +58,6 @@ struct ckpt_hdr_header_arch {

struct ckpt_hdr_thread {
struct ckpt_hdr h;
- /* FIXME: restart blocks */
__u32 thread_info_flags;
__u16 gdt_entry_tls_entries;
__u16 sizeof_tls_array;
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 226735c..8facd9a 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -22,6 +22,7 @@
#include <linux/mount.h>
#include <linux/utsname.h>
#include <linux/magic.h>
+#include <linux/hrtimer.h>
#include <linux/checkpoint.h>
#include <linux/checkpoint_hdr.h>

diff --git a/checkpoint/process.c b/checkpoint/process.c
index d2c59d2..a0bf344 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -12,6 +12,9 @@
#define CKPT_DFLAG CKPT_DSYS

#include <linux/sched.h>
+#include <linux/posix-timers.h>
+#include <linux/futex.h>
+#include <linux/poll.h>
#include <linux/checkpoint.h>
#include <linux/checkpoint_hdr.h>

@@ -47,6 +50,116 @@ static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN);
}

+/* dump the task_struct of a given task */
+int checkpoint_restart_block(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+ struct ckpt_hdr_restart_block *h;
+ struct restart_block *restart_block;
+ long (*fn)(struct restart_block *);
+ s64 base, expire = 0;
+ int ret;
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_RESTART_BLOCK);
+ if (!h)
+ return -ENOMEM;
+
+ base = ktime_to_ns(ctx->ktime_begin);
+ restart_block = &task_thread_info(t)->restart_block;
+ fn = restart_block->fn;
+
+ /* FIX: enumerate clockid_t so we're immune to changes */
+
+ if (fn == do_no_restart_syscall) {
+
+ h->function_type = CKPT_RESTART_BLOCK_NONE;
+ ckpt_debug("restart_block: non\n");
+
+ } else if (fn == hrtimer_nanosleep_restart) {
+
+ h->function_type = CKPT_RESTART_BLOCK_HRTIMER_NANOSLEEP;
+ h->arg_0 = restart_block->nanosleep.index;
+ h->arg_1 = (unsigned long) restart_block->nanosleep.rmtp;
+ expire = restart_block->nanosleep.expires;
+ ckpt_debug("restart_block: hrtimer expire %lld now %lld\n",
+ expire, base);
+
+ } else if (fn == posix_cpu_nsleep_restart) {
+ struct timespec ts;
+
+ h->function_type = CKPT_RESTART_BLOCK_POSIX_CPU_NANOSLEEP;
+ h->arg_0 = restart_block->arg0;
+ h->arg_1 = restart_block->arg1;
+ ts.tv_sec = restart_block->arg2;
+ ts.tv_nsec = restart_block->arg3;
+ expire = timespec_to_ns(&ts);
+ ckpt_debug("restart_block: posix_cpu expire %lld now %lld\n",
+ expire, base);
+
+#ifdef CONFIG_COMPAT
+ } else if (fn == compat_nanosleep_restart) {
+
+ h->function_type = CKPT_RESTART_BLOCK_NANOSLEEP;
+ h->arg_0 = restart_block->nanosleep.index;
+ h->arg_1 = (unsigned long)restart_block->nanosleep.rmtp;
+ h->arg_2 = (unsigned long)restart_block->nanosleep.compat_rmtp;
+ expire = restart_block->nanosleep.expires;
+ ckpt_debug("restart_block: compat expire %lld now %lld\n",
+ expire, base);
+
+ } else if (fn == compat_clock_nanosleep_restart) {
+
+ h->function_type = CKPT_RESTART_BLOCK_COMPAT_CLOCK_NANOSLEEP;
+ h->arg_0 = restart_block->nanosleep.index;
+ h->arg_1 = (unsigned long)restart_block->nanosleep.rmtp;
+ h->arg_2 = (unsigned long)restart_block->nanosleep.compat_rmtp;
+ expire = restart_block->nanosleep.expires;
+ ckpt_debug("restart_block: compat_clock expire %lld now %lld\n",
+ expire, base);
+
+#endif
+ } else if (fn == futex_wait_restart) {
+
+ h->function_type = CKPT_RESTART_BLOCK_FUTEX;
+ h->arg_0 = (unsigned long) restart_block->futex.uaddr;
+ h->arg_1 = restart_block->futex.val;
+ h->arg_2 = restart_block->futex.flags;
+ h->arg_3 = restart_block->futex.bitset;
+ expire = restart_block->futex.time;
+ ckpt_debug("restart_block: futex expire %lld now %lld\n",
+ expire, base);
+
+ } else if (fn == do_restart_poll) {
+ struct timespec ts;
+
+ h->function_type = CKPT_RESTART_BLOCK_POLL;
+ h->arg_0 = (unsigned long) restart_block->poll.ufds;
+ h->arg_1 = restart_block->poll.nfds;
+ h->arg_2 = restart_block->poll.has_timeout;
+ ts.tv_sec = restart_block->poll.tv_sec;
+ ts.tv_nsec = restart_block->poll.tv_nsec;
+ expire = timespec_to_ns(&ts);
+ ckpt_debug("restart_block: poll expire %lld now %lld\n",
+ expire, base);
+
+ } else {
+
+ BUG();
+
+ }
+
+ /* common to all restart blocks: */
+ h->arg_4 = (base < expire ? expire - base : 0);
+
+ ckpt_debug("restart_block: args %#llx %#llx %#llx %#llx %#llx\n",
+ h->arg_0, h->arg_1, h->arg_2, h->arg_3, h->arg_4);
+
+ ret = ckpt_write_obj(ctx, &h->h);
+ ckpt_hdr_put(ctx, h);
+
+ ckpt_debug("restart_block ret %d\n", ret);
+ return ret;
+}
+
/* dump the entire state of a given task */
int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
{
@@ -60,6 +173,10 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
ckpt_debug("thread %d\n", ret);
if (ret < 0)
goto out;
+ ret = checkpoint_restart_block(ctx, t);
+ ckpt_debug("restart-blocks %d\n", ret);
+ if (ret < 0)
+ goto out;
ret = checkpoint_cpu(ctx, t);
ckpt_debug("cpu %d\n", ret);
out:
@@ -95,6 +212,111 @@ static int restore_task_struct(struct ckpt_ctx *ctx)
return ret;
}

+int restore_restart_block(struct ckpt_ctx *ctx)
+{
+ struct ckpt_hdr_restart_block *h;
+ struct restart_block restart_block;
+ struct timespec ts;
+ clockid_t clockid;
+ s64 expire;
+ int ret = 0;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_RESTART_BLOCK);
+ if (IS_ERR(h))
+ return PTR_ERR(h);
+
+ expire = ktime_to_ns(ctx->ktime_begin) + h->arg_4;
+ restart_block.fn = NULL;
+
+ ckpt_debug("restart_block: expire %lld begin %lld\n",
+ expire, ktime_to_ns(ctx->ktime_begin));
+ ckpt_debug("restart_block: args %#llx %#llx %#llx %#llx %#llx\n",
+ h->arg_0, h->arg_1, h->arg_2, h->arg_3, h->arg_4);
+
+ switch (h->function_type) {
+ case CKPT_RESTART_BLOCK_NONE:
+ restart_block.fn = do_no_restart_syscall;
+ break;
+ case CKPT_RESTART_BLOCK_HRTIMER_NANOSLEEP:
+ clockid = h->arg_0;
+ if (clockid < 0 || invalid_clockid(clockid))
+ break;
+ restart_block.fn = hrtimer_nanosleep_restart;
+ restart_block.nanosleep.index = clockid;
+ restart_block.nanosleep.rmtp =
+ (struct timespec __user *) (unsigned long) h->arg_1;
+ restart_block.nanosleep.expires = expire;
+ break;
+ case CKPT_RESTART_BLOCK_POSIX_CPU_NANOSLEEP:
+ clockid = h->arg_0;
+ if (clockid < 0 || invalid_clockid(clockid))
+ break;
+ restart_block.fn = posix_cpu_nsleep_restart;
+ restart_block.arg0 = clockid;
+ restart_block.arg1 = h->arg_1;
+ ts = ns_to_timespec(expire);
+ restart_block.arg2 = ts.tv_sec;
+ restart_block.arg3 = ts.tv_nsec;
+ break;
+#ifdef CONFIG_COMPAT
+ case CKPT_RESTART_BLOCK_COMPAT_NANOSLEEP:
+ clockid = h->arg_0;
+ if (clockid < 0 || invalid_clockid(clockid))
+ break;
+ restart_block.fn = compat_nanosleep_restart;
+ restart_block.nanosleep.index = clockid;
+ restart_block.nanosleep.rmtp =
+ (struct timespec __user *) (unsigned long) h->arg_1;
+ restart_block.nanosleep.compat_rmtp =
+ (struct compat_timespec __user *)
+ (unsigned long) h->arg_2;
+ resatrt_block.nanosleep.expires = expire;
+ break;
+ case CKPT_RESTART_BLOCK_COMPAT_CLOCK_NANOSLEEP:
+ clockid = h->arg_0;
+ if (clockid < 0 || invalid_clockid(clockid))
+ break;
+ restart_block.fn = compat_clock_nanosleep_restart;
+ restart_block.nanosleep.index = clockid;
+ restart_block.nanosleep.rmtp =
+ (struct timespec __user *) (unsigned long) h->arg_1;
+ restart_block.nanosleep.compat_rmtp =
+ (struct compat_timespec __user *)
+ (unsigned long) h->arg_2;
+ resatrt_block.nanosleep.expires = expire;
+ break;
+#endif
+ case CKPT_RESTART_BLOCK_FUTEX:
+ restart_block.fn = futex_wait_restart;
+ restart_block.futex.uaddr = (u32 *) (unsigned long) h->arg_0;
+ restart_block.futex.val = h->arg_1;
+ restart_block.futex.flags = h->arg_2;
+ restart_block.futex.bitset = h->arg_3;
+ restart_block.futex.time = expire;
+ break;
+ case CKPT_RESTART_BLOCK_POLL:
+ restart_block.fn = do_restart_poll;
+ restart_block.poll.ufds =
+ (struct pollfd __user *) (unsigned long) h->arg_0;
+ restart_block.poll.nfds = h->arg_1;
+ restart_block.poll.has_timeout = h->arg_2;
+ ts = ns_to_timespec(expire);
+ restart_block.poll.tv_sec = ts.tv_sec;
+ restart_block.poll.tv_nsec = ts.tv_nsec;
+ break;
+ default:
+ break;
+ }
+
+ if (restart_block.fn)
+ task_thread_info(current)->restart_block = restart_block;
+ else
+ ret = -EINVAL;
+
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
+
/* read the entire state of the current task */
int restore_task(struct ckpt_ctx *ctx)
{
@@ -108,6 +330,10 @@ int restore_task(struct ckpt_ctx *ctx)
ckpt_debug("thread %d\n", ret);
if (ret < 0)
goto out;
+ ret = restore_restart_block(ctx);
+ ckpt_debug("restart-blocks %d\n", ret);
+ if (ret < 0)
+ goto out;
ret = restore_cpu(ctx);
ckpt_debug("cpu %d\n", ret);
out:
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 62e19b4..582d6b4 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -16,6 +16,8 @@
#include <linux/file.h>
#include <linux/magic.h>
#include <linux/utsname.h>
+#include <asm/syscall.h>
+#include <linux/elf.h>
#include <linux/checkpoint.h>
#include <linux/checkpoint_hdr.h>

@@ -393,6 +395,5 @@ long do_restart(struct ckpt_ctx *ctx, pid_t pid)
if (ret < 0)
return ret;

- /* on success, adjust the return value if needed [TODO] */
- return restore_retval(ctx);
+ return restore_retval();
}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index dda2c21..b37bc8c 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -193,6 +193,7 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,

ctx->uflags = uflags;
ctx->kflags = kflags;
+ ctx->ktime_begin = ktime_get();

err = -EBADF;
ctx->file = fget(fd);
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index f7e2cb8..01541b8 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -66,6 +66,10 @@ extern int restore_read_header_arch(struct ckpt_ctx *ctx);
extern int restore_thread(struct ckpt_ctx *ctx);
extern int restore_cpu(struct ckpt_ctx *ctx);

+extern int checkpoint_restart_block(struct ckpt_ctx *ctx,
+ struct task_struct *t);
+extern int restore_restart_block(struct ckpt_ctx *ctx);
+

/* debugging flags */
#define CKPT_DBASE 0x1 /* anything */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index ce43aa9..fa23629 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -50,6 +50,7 @@ enum {
CKPT_HDR_STRING,

CKPT_HDR_TASK = 101,
+ CKPT_HDR_RESTART_BLOCK,
CKPT_HDR_THREAD,
CKPT_HDR_CPU,

@@ -120,4 +121,25 @@ struct ckpt_hdr_task {
__u64 clear_child_tid;
} __attribute__((aligned(8)));

+/* restart blocks */
+struct ckpt_hdr_restart_block {
+ struct ckpt_hdr h;
+ __u64 function_type;
+ __u64 arg_0;
+ __u64 arg_1;
+ __u64 arg_2;
+ __u64 arg_3;
+ __u64 arg_4;
+} __attribute__((aligned(8)));
+
+enum restart_block_type {
+ CKPT_RESTART_BLOCK_NONE = 1,
+ CKPT_RESTART_BLOCK_HRTIMER_NANOSLEEP,
+ CKPT_RESTART_BLOCK_POSIX_CPU_NANOSLEEP,
+ CKPT_RESTART_BLOCK_COMPAT_NANOSLEEP,
+ CKPT_RESTART_BLOCK_COMPAT_CLOCK_NANOSLEEP,
+ CKPT_RESTART_BLOCK_POLL,
+ CKPT_RESTART_BLOCK_FUTEX
+};
+
#endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 21b5965..220c209 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -15,10 +15,13 @@
#include <linux/sched.h>
#include <linux/nsproxy.h>
#include <linux/fs.h>
+#include <linux/ktime.h>

struct ckpt_ctx {
int crid; /* unique checkpoint id */

+ ktime_t ktime_begin; /* checkpoint start time */
+
pid_t root_pid; /* [container] root pid */
struct task_struct *root_task; /* [container] root task */
struct nsproxy *root_nsproxy; /* [container] root nsproxy */
--
1.6.0.4

2009-07-22 10:20:24

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 07/60] cgroup freezer: Add CHECKPOINTING state to safeguard container checkpoint

From: Matt Helsley <[email protected]>

The CHECKPOINTING state prevents userspace from unfreezing tasks until
sys_checkpoint() is finished. When doing container checkpoint userspace
will do:

echo FROZEN > /cgroups/my_container/freezer.state
...
rc = sys_checkpoint( <pid of container root> );

To ensure a consistent checkpoint image userspace should not be allowed
to thaw the cgroup (echo THAWED > /cgroups/my_container/freezer.state)
during checkpoint.

"CHECKPOINTING" can only be set on a "FROZEN" cgroup using the checkpoint
system call. Once in the "CHECKPOINTING" state, the cgroup may not leave until
the checkpoint system call is finished and ready to return. Then the
freezer state returns to "FROZEN". Writing any new state to freezer.state while
checkpointing will return EBUSY. These semantics ensure that userspace cannot
unfreeze the cgroup midway through the checkpoint system call.

The cgroup_freezer_begin_checkpoint() and cgroup_freezer_end_checkpoint()
make relatively few assumptions about the task that is passed in. However the
way they are called in do_checkpoint() assumes that the root of the container
is in the same freezer cgroup as all the other tasks that will be
checkpointed.

Notes:
As a side-effect this prevents the multiple tasks from entering the
CHECKPOINTING state simultaneously. All but one will get -EBUSY.

Signed-off-by: Oren Laadan <[email protected]>
Signed-off-by: Matt Helsley <[email protected]>
Cc: Paul Menage <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Cedric Le Goater <[email protected]>
---
Documentation/cgroups/freezer-subsystem.txt | 10 ++
include/linux/freezer.h | 8 ++
kernel/cgroup_freezer.c | 166 ++++++++++++++++++++-------
3 files changed, 142 insertions(+), 42 deletions(-)

diff --git a/Documentation/cgroups/freezer-subsystem.txt b/Documentation/cgroups/freezer-subsystem.txt
index 41f37fe..92b68e6 100644
--- a/Documentation/cgroups/freezer-subsystem.txt
+++ b/Documentation/cgroups/freezer-subsystem.txt
@@ -100,3 +100,13 @@ things happens:
and returns EINVAL)
3) The tasks that blocked the cgroup from entering the "FROZEN"
state disappear from the cgroup's set of tasks.
+
+When the cgroup freezer is used to guard container checkpoint operations the
+freezer.state may be "CHECKPOINTING". "CHECKPOINTING" can only be set on a
+"FROZEN" cgroup using the checkpoint system call. Once in the "CHECKPOINTING"
+state, the cgroup may not leave until the checkpoint system call returns the
+freezer state to "FROZEN". Writing any new state to freezer.state while
+checkpointing will return EBUSY. These semantics ensure that userspace cannot
+unfreeze the cgroup midway through the checkpoint system call. Note that,
+unlike "FROZEN" and "FREEZING", there is no corresponding "CHECKPOINTED"
+state.
diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index da7e52b..3d32641 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -65,11 +65,19 @@ extern void cancel_freezing(struct task_struct *p);

#ifdef CONFIG_CGROUP_FREEZER
extern int cgroup_freezing_or_frozen(struct task_struct *task);
+extern int in_same_cgroup_freezer(struct task_struct *p, struct task_struct *q);
+extern int cgroup_freezer_begin_checkpoint(struct task_struct *task);
+extern void cgroup_freezer_end_checkpoint(struct task_struct *task);
#else /* !CONFIG_CGROUP_FREEZER */
static inline int cgroup_freezing_or_frozen(struct task_struct *task)
{
return 0;
}
+static inline int in_same_cgroup_freezer(struct task_struct *p,
+ struct task_struct *q)
+{
+ return 0;
+}
#endif /* !CONFIG_CGROUP_FREEZER */

/*
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index 22fce5d..87dfbfb 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -25,6 +25,7 @@ enum freezer_state {
CGROUP_THAWED = 0,
CGROUP_FREEZING,
CGROUP_FROZEN,
+ CGROUP_CHECKPOINTING,
};

struct freezer {
@@ -63,6 +64,44 @@ int cgroup_freezing_or_frozen(struct task_struct *task)
return (state == CGROUP_FREEZING) || (state == CGROUP_FROZEN);
}

+/* Task is frozen or will freeze immediately when next it gets woken */
+static bool is_task_frozen_enough(struct task_struct *task)
+{
+ return frozen(task) ||
+ (task_is_stopped_or_traced(task) && freezing(task));
+}
+
+/*
+ * caller must hold freezer->lock
+ */
+static void update_freezer_state(struct cgroup *cgroup,
+ struct freezer *freezer)
+{
+ struct cgroup_iter it;
+ struct task_struct *task;
+ unsigned int nfrozen = 0, ntotal = 0;
+
+ cgroup_iter_start(cgroup, &it);
+ while ((task = cgroup_iter_next(cgroup, &it))) {
+ ntotal++;
+ if (is_task_frozen_enough(task))
+ nfrozen++;
+ }
+
+ /*
+ * Transition to FROZEN when no new tasks can be added ensures
+ * that we never exist in the FROZEN state while there are unfrozen
+ * tasks.
+ */
+ if (nfrozen == ntotal)
+ freezer->state = CGROUP_FROZEN;
+ else if (nfrozen > 0)
+ freezer->state = CGROUP_FREEZING;
+ else
+ freezer->state = CGROUP_THAWED;
+ cgroup_iter_end(cgroup, &it);
+}
+
/*
* cgroups_write_string() limits the size of freezer state strings to
* CGROUP_LOCAL_BUFFER_SIZE
@@ -71,6 +110,7 @@ static const char *freezer_state_strs[] = {
"THAWED",
"FREEZING",
"FROZEN",
+ "CHECKPOINTING",
};

/*
@@ -78,9 +118,9 @@ static const char *freezer_state_strs[] = {
* Transitions are caused by userspace writes to the freezer.state file.
* The values in parenthesis are state labels. The rest are edge labels.
*
- * (THAWED) --FROZEN--> (FREEZING) --FROZEN--> (FROZEN)
- * ^ ^ | |
- * | \_______THAWED_______/ |
+ * (THAWED) --FROZEN--> (FREEZING) --FROZEN--> (FROZEN) --> (CHECKPOINTING)
+ * ^ ^ | | ^ |
+ * | \_______THAWED_______/ | \_____________/
* \__________________________THAWED____________/
*/

@@ -153,13 +193,6 @@ static void freezer_destroy(struct cgroup_subsys *ss,
kfree(cgroup_freezer(cgroup));
}

-/* Task is frozen or will freeze immediately when next it gets woken */
-static bool is_task_frozen_enough(struct task_struct *task)
-{
- return frozen(task) ||
- (task_is_stopped_or_traced(task) && freezing(task));
-}
-
/*
* The call to cgroup_lock() in the freezer.state write method prevents
* a write to that file racing against an attach, and hence the
@@ -216,37 +249,6 @@ static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
spin_unlock_irq(&freezer->lock);
}

-/*
- * caller must hold freezer->lock
- */
-static void update_freezer_state(struct cgroup *cgroup,
- struct freezer *freezer)
-{
- struct cgroup_iter it;
- struct task_struct *task;
- unsigned int nfrozen = 0, ntotal = 0;
-
- cgroup_iter_start(cgroup, &it);
- while ((task = cgroup_iter_next(cgroup, &it))) {
- ntotal++;
- if (is_task_frozen_enough(task))
- nfrozen++;
- }
-
- /*
- * Transition to FROZEN when no new tasks can be added ensures
- * that we never exist in the FROZEN state while there are unfrozen
- * tasks.
- */
- if (nfrozen == ntotal)
- freezer->state = CGROUP_FROZEN;
- else if (nfrozen > 0)
- freezer->state = CGROUP_FREEZING;
- else
- freezer->state = CGROUP_THAWED;
- cgroup_iter_end(cgroup, &it);
-}
-
static int freezer_read(struct cgroup *cgroup, struct cftype *cft,
struct seq_file *m)
{
@@ -317,7 +319,10 @@ static int freezer_change_state(struct cgroup *cgroup,
freezer = cgroup_freezer(cgroup);

spin_lock_irq(&freezer->lock);
-
+ if (freezer->state == CGROUP_CHECKPOINTING) {
+ retval = -EBUSY;
+ goto out;
+ }
update_freezer_state(cgroup, freezer);
if (goal_state == freezer->state)
goto out;
@@ -385,3 +390,80 @@ struct cgroup_subsys freezer_subsys = {
.fork = freezer_fork,
.exit = NULL,
};
+
+#ifdef CONFIG_CHECKPOINT
+/*
+ * Caller is expected to ensure that neither @p nor @q may change its
+ * freezer cgroup during this test in a way that may affect the result.
+ * E.g., when called form c/r, @p must be in CHECKPOINTING cgroup, so
+ * may not change cgroup, and either @q is also there, or is not there
+ * and may not join.
+ */
+int in_same_cgroup_freezer(struct task_struct *p, struct task_struct *q)
+{
+ struct cgroup_subsys_state *p_css, *q_css;
+
+ task_lock(p);
+ p_css = task_subsys_state(p, freezer_subsys_id);
+ task_unlock(p);
+
+ task_lock(q);
+ q_css = task_subsys_state(q, freezer_subsys_id);
+ task_unlock(q);
+
+ return (p_css == q_css);
+}
+
+/*
+ * cgroup freezer state changes made without the aid of the cgroup filesystem
+ * must go through this function to ensure proper locking is observed.
+ */
+static int freezer_checkpointing(struct task_struct *task,
+ enum freezer_state next_state)
+{
+ struct freezer *freezer;
+ struct cgroup_subsys_state *css;
+ enum freezer_state state;
+
+ task_lock(task);
+ css = task_subsys_state(task, freezer_subsys_id);
+ css_get(css); /* make sure freezer doesn't go away */
+ freezer = container_of(css, struct freezer, css);
+ task_unlock(task);
+
+ if (freezer->state == CGROUP_FREEZING) {
+ /* May be in middle of a lazy FREEZING -> FROZEN transition */
+ if (cgroup_lock_live_group(css->cgroup)) {
+ spin_lock_irq(&freezer->lock);
+ update_freezer_state(css->cgroup, freezer);
+ spin_unlock_irq(&freezer->lock);
+ cgroup_unlock();
+ }
+ }
+
+ spin_lock_irq(&freezer->lock);
+ state = freezer->state;
+ if ((state == CGROUP_FROZEN && next_state == CGROUP_CHECKPOINTING) ||
+ (state == CGROUP_CHECKPOINTING && next_state == CGROUP_FROZEN))
+ freezer->state = next_state;
+ spin_unlock_irq(&freezer->lock);
+ css_put(css);
+ return state;
+}
+
+int cgroup_freezer_begin_checkpoint(struct task_struct *task)
+{
+ if (freezer_checkpointing(task, CGROUP_CHECKPOINTING) != CGROUP_FROZEN)
+ return -EBUSY;
+ return 0;
+}
+
+void cgroup_freezer_end_checkpoint(struct task_struct *task)
+{
+ /*
+ * If we weren't in CHECKPOINTING state then userspace could have
+ * unfrozen a task and given us an inconsistent checkpoint image
+ */
+ WARN_ON(freezer_checkpointing(task, CGROUP_FROZEN) != CGROUP_CHECKPOINTING);
+}
+#endif /* CONFIG_CHECKPOINT */
--
1.6.0.4

2009-07-22 10:10:35

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 22/60] c/r: external checkpoint of a task other than ourself

Now we can do "external" checkpoint, i.e. act on another task.

sys_checkpoint() now looks up the target pid (in our namespace) and
checkpoints that corresponding task. That task should be the root of
a container, unless CHECKPOINT_SUBTREE flag is given.

Set state of freezer cgroup of checkpointed task hierarchy to
"CHECKPOINTING" during a checkpoint, to ensure that task(s) cannot be
thawed while at it.

Ensure that all tasks belong to root task's freezer cgroup (the root
task is also tested, to detect it if changes its freezer cgroups
before it moves to "CHECKPOINTING").

sys_restart() remains nearly the same, as the restart is always done
in the context of the restarting task. However, the original task may
have been frozen from user space, or interrupted from a syscall for
the checkpoint. This is accounted for by restoring a suitable retval
for the restarting task, according to how it was checkpointed.

Changelog[v17]:
- Move restore_retval() to this patch
- Tighten ptrace ceckpoint for checkpoint to PTRACE_MODE_ATTACH
- Use CHECKPOINTING state for hierarchy's freezer for checkpoint
Changelog[v16]:
- Use CHECKPOINT_SUBTREE to allow subtree (partial container)
Changelog[v14]:
- Refuse non-self checkpoint if target task isn't frozen
Changelog[v12]:
- Replace obsolete ckpt_debug() with pr_debug()
Changelog[v11]:
- Copy contents of 'init->fs->root' instead of pointing to them
Changelog[v10]:
- Grab vfs root of container init, rather than current process

Signed-off-by: Oren Laadan <[email protected]>
---
checkpoint/Kconfig | 1 +
checkpoint/checkpoint.c | 99 +++++++++++++++++++++++++++++++++++++-
checkpoint/restart.c | 61 +++++++++++++++++++++++-
checkpoint/sys.c | 10 ++++
include/linux/checkpoint_types.h | 7 ++-
5 files changed, 175 insertions(+), 3 deletions(-)

diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
index ef7d406..21fc86b 100644
--- a/checkpoint/Kconfig
+++ b/checkpoint/Kconfig
@@ -5,6 +5,7 @@
config CHECKPOINT
bool "Checkpoint/restart (EXPERIMENTAL)"
depends on CHECKPOINT_SUPPORT && EXPERIMENTAL
+ depends on CGROUP_FREEZER
help
Application checkpoint/restart is the ability to save the
state of a running application so that it can later resume
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index a465fb6..226735c 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -12,6 +12,9 @@
#define CKPT_DFLAG CKPT_DSYS

#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/freezer.h>
+#include <linux/ptrace.h>
#include <linux/time.h>
#include <linux/fs.h>
#include <linux/file.h>
@@ -255,14 +258,106 @@ static int checkpoint_write_tail(struct ckpt_ctx *ctx)
return ret;
}

+static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+ if (t->state == TASK_DEAD) {
+ pr_warning("c/r: task %d is TASK_DEAD\n", task_pid_vnr(t));
+ return -EAGAIN;
+ }
+
+ if (!ptrace_may_access(t, PTRACE_MODE_ATTACH)) {
+ __ckpt_write_err(ctx, "access to task %d (%s) denied",
+ task_pid_vnr(t), t->comm);
+ return -EPERM;
+ }
+
+ /* verify that all tasks belongs to same freezer cgroup */
+ if (t != current && !in_same_cgroup_freezer(t, ctx->root_freezer)) {
+ __ckpt_write_err(ctx, "task %d (%s) not frozen (wrong cgroup)",
+ task_pid_vnr(t), t->comm);
+ return -EBUSY;
+ }
+
+ /* FIX: add support for ptraced tasks */
+ if (task_ptrace(t)) {
+ __ckpt_write_err(ctx, "task %d (%s) is ptraced",
+ task_pid_vnr(t), t->comm);
+ return -EBUSY;
+ }
+
+ return 0;
+}
+
+/* setup checkpoint-specific parts of ctx */
+static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
+{
+ struct task_struct *task;
+ struct nsproxy *nsproxy;
+ int ret;
+
+ /*
+ * No need for explicit cleanup here, because if an error
+ * occurs then ckpt_ctx_free() is eventually called.
+ */
+
+ ctx->root_pid = pid;
+
+ /* root task */
+ read_lock(&tasklist_lock);
+ task = find_task_by_vpid(pid);
+ if (task)
+ get_task_struct(task);
+ read_unlock(&tasklist_lock);
+ if (!task)
+ return -ESRCH;
+ else
+ ctx->root_task = task;
+
+ /* root nsproxy */
+ rcu_read_lock();
+ nsproxy = task_nsproxy(task);
+ if (nsproxy)
+ get_nsproxy(nsproxy);
+ rcu_read_unlock();
+ if (!nsproxy)
+ return -ESRCH;
+ else
+ ctx->root_nsproxy = nsproxy;
+
+ /* root freezer */
+ ctx->root_freezer = task;
+ geT_task_struct(task);
+
+ ret = may_checkpoint_task(ctx, task);
+ if (ret) {
+ ckpt_write_err(ctx, NULL);
+ put_task_struct(task);
+ put_task_struct(task);
+ put_nsproxy(nsproxy);
+ return ret;
+ }
+
+ return 0;
+}
+
long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
{
long ret;

+ ret = init_checkpoint_ctx(ctx, pid);
+ if (ret < 0)
+ return ret;
+
+ if (ctx->root_freezer) {
+ ret = cgroup_freezer_begin_checkpoint(ctx->root_freezer);
+ if (ret < 0)
+ return ret;
+ }
+
ret = checkpoint_write_header(ctx);
if (ret < 0)
goto out;
- ret = checkpoint_task(ctx, current);
+ ret = checkpoint_task(ctx, ctx->root_task);
if (ret < 0)
goto out;
ret = checkpoint_write_tail(ctx);
@@ -273,5 +368,7 @@ long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
ctx->crid = atomic_inc_return(&ctx_count);
ret = ctx->crid;
out:
+ if (ctx->root_freezer)
+ cgroup_freezer_end_checkpoint(ctx->root_freezer);
return ret;
}
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 17135fe..62e19b4 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -322,10 +322,67 @@ static int restore_read_tail(struct ckpt_ctx *ctx)
return ret;
}

+static long restore_retval(void)
+{
+ struct pt_regs *regs = task_pt_regs(current);
+ long ret;
+
+ /*
+ * For the restart, we entered the kernel via sys_restart(),
+ * so our return path is via the syscall exit. In particular,
+ * the code in entry.S will put the value that we will return
+ * into a register (e.g. regs->eax in x86), thus passing it to
+ * the caller task.
+ *
+ * What we do now depends on what happened to the checkpointed
+ * task right before the checkpoint - there are three cases:
+ *
+ * 1) It was carrying out a syscall when became frozen, or
+ * 2) It was running in userspace, or
+ * 3) It was doing a self-checkpoint
+ *
+ * In case #1, if the syscall succeeded, perhaps partially,
+ * then the retval is non-negative. If it failed, the error
+ * may be one of -ERESTART..., which is interpreted in the
+ * signal handling code. If that is the case, we force the
+ * signal handler to kick in by faking a signal to ourselves
+ * (a la freeze/thaw) when ret < 0.
+ *
+ * In case #2, our return value will overwrite the original
+ * value in the affected register. Workaround by simply using
+ * that saved value of that register as our retval.
+ *
+ * In case #3, then the state was recorded while the task was
+ * in checkpoint(2) syscall. The syscall is execpted to return
+ * 0 when returning from a restart. Fortunately, this already
+ * has been arranged for at checkpoint time (the register that
+ * holds the retval, e.g. regs->eax in x86, was set to
+ * zero).
+ */
+
+ /* needed for all 3 cases: get old value/error/retval */
+ ret = syscall_get_return_value(current, regs);
+
+ /* if from a syscall and returning error, kick in signal handlig */
+ if (syscall_get_nr(current, regs) >= 0 && ret < 0)
+ set_tsk_thread_flag(current, TIF_SIGPENDING);
+
+ return ret;
+}
+
+/* setup restart-specific parts of ctx */
+static int init_restart_ctx(struct ckpt_ctx *ctx, pid_t pid)
+{
+ return 0;
+}
+
long do_restart(struct ckpt_ctx *ctx, pid_t pid)
{
long ret;

+ ret = init_restart_ctx(ctx, pid);
+ if (ret < 0)
+ return ret;
ret = restore_read_header(ctx);
if (ret < 0)
return ret;
@@ -333,7 +390,9 @@ long do_restart(struct ckpt_ctx *ctx, pid_t pid)
if (ret < 0)
return ret;
ret = restore_read_tail(ctx);
+ if (ret < 0)
+ return ret;

/* on success, adjust the return value if needed [TODO] */
- return ret;
+ return restore_retval(ctx);
}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 7f6f71e..dda2c21 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -12,7 +12,9 @@
#define CKPT_DFLAG CKPT_DSYS

#include <linux/sched.h>
+#include <linux/nsproxy.h>
#include <linux/kernel.h>
+#include <linux/cgroup.h>
#include <linux/syscalls.h>
#include <linux/fs.h>
#include <linux/file.h>
@@ -168,6 +170,14 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
{
if (ctx->file)
fput(ctx->file);
+
+ if (ctx->root_nsproxy)
+ put_nsproxy(ctx->root_nsproxy);
+ if (ctx->root_task)
+ put_task_struct(ctx->root_task);
+ if (ctx->root_freezer)
+ put_task_struct(ctx->root_freezer);
+
kfree(ctx);
}

diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 203ecac..21b5965 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -12,12 +12,17 @@

#ifdef __KERNEL__

+#include <linux/sched.h>
+#include <linux/nsproxy.h>
#include <linux/fs.h>

struct ckpt_ctx {
int crid; /* unique checkpoint id */

- pid_t root_pid; /* container identifier */
+ pid_t root_pid; /* [container] root pid */
+ struct task_struct *root_task; /* [container] root task */
+ struct nsproxy *root_nsproxy; /* [container] root nsproxy */
+ struct task_struct *root_freezer; /* [container] root task */

unsigned long kflags; /* kerenl flags */
unsigned long uflags; /* user flags */
--
1.6.0.4

2009-07-22 10:18:51

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 46/60] c/r: support for UTS namespace

From: Dan Smith <[email protected]>

This patch adds a "phase" of checkpoint that saves out information about any
namespaces the task(s) may have. Do this by tracking the namespace objects
of the tasks and making sure that tasks with the same namespace that follow
get properly referenced in the checkpoint stream.

Changes[v17]:
- Collect nsproxy->uts_ns
- Save uts string lengths once in ckpt_hdr_const
- Save and restore all fields of uts-ns
- Don't overwrite global uts-ns if !CONFIG_UTS_NS
- Replace sys_unshare() with create_uts_ns()
- Take uts_sem around access to uts data
Changes:
- Remove the kernel restore path
- Punt on nested namespaces
- Use __NEW_UTS_LEN in nodename and domainname buffers
- Add a note to Documentation/checkpoint/internals.txt to indicate where
in the save/restore process the UTS information is kept
- Store (and track) the objref of the namespace itself instead of the
nsproxy (based on comments from Dave on IRC)
- Remove explicit check for non-root nsproxy
- Store the nodename and domainname lengths and use ckpt_write_string()
to store the actual name strings
- Catch failure of ckpt_obj_add_ptr() in ckpt_write_namespaces()
- Remove "types" bitfield and use the "is this new" flag to determine
whether or not we should write out a new ns descriptor
- Replace kernel restore path
- Move the namespace information to be directly after the task
information record
- Update Documentation to reflect new location of namespace info
- Support checkpoint and restart of nested UTS namespaces

Signed-off-by: Dan Smith <[email protected]>
Signed-off-by: Oren Laadan <[email protected]>
---
checkpoint/Makefile | 1 +
checkpoint/checkpoint.c | 5 +-
checkpoint/namespace.c | 100 ++++++++++++++++++++++++++++++++++++++
checkpoint/objhash.c | 26 ++++++++++
checkpoint/process.c | 2 +
checkpoint/restart.c | 32 ++++++++++++
include/linux/checkpoint.h | 5 ++
include/linux/checkpoint_hdr.h | 16 ++++++
include/linux/checkpoint_types.h | 6 ++
include/linux/utsname.h | 1 +
kernel/nsproxy.c | 47 +++++++++++++++++-
kernel/utsname.c | 3 +-
12 files changed, 240 insertions(+), 4 deletions(-)
create mode 100644 checkpoint/namespace.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index f56a7d6..bb2c0ca 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -8,5 +8,6 @@ obj-$(CONFIG_CHECKPOINT) += \
checkpoint.o \
restart.o \
process.o \
+ namespace.o \
files.o \
memory.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index af6b58b..39ee917 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -191,9 +191,12 @@ static void fill_kernel_const(struct ckpt_hdr_const *h)
/* mm */
h->mm_saved_auxv_len = sizeof(mm->saved_auxv);
/* uts */
+ h->uts_sysname_len = sizeof(uts->sysname);
+ h->uts_nodename_len = sizeof(uts->nodename);
h->uts_release_len = sizeof(uts->release);
h->uts_version_len = sizeof(uts->version);
h->uts_machine_len = sizeof(uts->machine);
+ h->uts_domainname_len = sizeof(uts->domainname);
}

/* write the checkpoint header */
@@ -328,8 +331,6 @@ static int may_checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)

rcu_read_lock();
nsproxy = task_nsproxy(t);
- if (nsproxy->uts_ns != ctx->root_nsproxy->uts_ns)
- ret = -EPERM;
if (nsproxy->ipc_ns != ctx->root_nsproxy->ipc_ns)
ret = -EPERM;
if (nsproxy->mnt_ns != ctx->root_nsproxy->mnt_ns)
diff --git a/checkpoint/namespace.c b/checkpoint/namespace.c
new file mode 100644
index 0000000..49b8f0a
--- /dev/null
+++ b/checkpoint/namespace.c
@@ -0,0 +1,100 @@
+/*
+ * Checkpoint namespaces
+ *
+ * Copyright (C) 2009 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG CKPT_DSYS
+
+#include <linux/nsproxy.h>
+#include <linux/user_namespace.h>
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/*
+ * uts_ns - this needs to compile even for !CONFIG_USER_NS, so
+ * the code may not reside in kernel/utsname.c (which wouldn't
+ * compile then).
+ */
+static int do_checkpoint_uts_ns(struct ckpt_ctx *ctx,
+ struct uts_namespace *uts_ns)
+{
+ struct ckpt_hdr_utsns *h;
+ struct new_utsname *name;
+ int ret;
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_UTS_NS);
+ if (!h)
+ return -ENOMEM;
+
+ down_read(&uts_sem);
+ name = &uts_ns->name;
+ memcpy(h->sysname, name->sysname, sizeof(name->sysname));
+ memcpy(h->nodename, name->nodename, sizeof(name->nodename));
+ memcpy(h->release, name->release, sizeof(name->release));
+ memcpy(h->version, name->version, sizeof(name->version));
+ memcpy(h->machine, name->machine, sizeof(name->machine));
+ memcpy(h->domainname, name->domainname, sizeof(name->domainname));
+ up_read(&uts_sem);
+
+ ret = ckpt_write_obj(ctx, &h->h);
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
+
+int checkpoint_uts_ns(struct ckpt_ctx *ctx, void *ptr)
+{
+ return do_checkpoint_uts_ns(ctx, (struct uts_namespace *) ptr);
+}
+
+static struct uts_namespace *do_restore_uts_ns(struct ckpt_ctx *ctx)
+{
+ struct ckpt_hdr_utsns *h;
+ struct uts_namespace *uts_ns = NULL;
+ struct new_utsname *name;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_UTS_NS);
+ if (IS_ERR(h))
+ return (struct uts_namespace *) h;
+
+#ifdef CONFIG_UTS_NS
+ uts_ns = create_uts_ns();
+ if (!uts_ns) {
+ uts_ns = ERR_PTR(-ENOMEM);
+ goto out;
+ }
+ down_read(&uts_sem);
+ name = &uts_ns->name;
+ memcpy(name->sysname, h->sysname, sizeof(name->sysname));
+ memcpy(name->nodename, h->nodename, sizeof(name->nodename));
+ memcpy(name->release, h->release, sizeof(name->release));
+ memcpy(name->version, h->version, sizeof(name->version));
+ memcpy(name->machine, h->machine, sizeof(name->machine));
+ memcpy(name->domainname, h->domainname, sizeof(name->domainname));
+ up_read(&uts_sem);
+#else
+ /* complain if image contains multiple namespaces */
+ if (ctx->stats.uts_ns) {
+ uts_ns = ERR_PTR(-EEXIST);
+ goto out;
+ }
+ uts_ns = current->nsproxy->uts_ns;
+ get_uts_ns(uts_ns);
+#endif
+
+ ctx->stats.uts_ns++;
+ out:
+ ckpt_hdr_put(ctx, h);
+ return uts_ns;
+}
+
+void *restore_uts_ns(struct ckpt_ctx *ctx)
+{
+ return (void *) do_restore_uts_ns(ctx);
+}
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 18ede6f..caa856c 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -148,6 +148,22 @@ static int obj_ns_users(void *ptr)
return atomic_read(&((struct nsproxy *) ptr)->count);
}

+static int obj_uts_ns_grab(void *ptr)
+{
+ get_uts_ns((struct uts_namespace *) ptr);
+ return 0;
+}
+
+static void obj_uts_ns_drop(void *ptr)
+{
+ put_uts_ns((struct uts_namespace *) ptr);
+}
+
+static int obj_uts_ns_users(void *ptr)
+{
+ return atomic_read(&((struct uts_namespace *) ptr)->kref.refcount);
+}
+
static struct ckpt_obj_ops ckpt_obj_ops[] = {
/* ignored object */
{
@@ -205,6 +221,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
.checkpoint = checkpoint_ns,
.restore = restore_ns,
},
+ /* uts_ns object */
+ {
+ .obj_name = "UTS_NS",
+ .obj_type = CKPT_OBJ_UTS_NS,
+ .ref_drop = obj_uts_ns_drop,
+ .ref_grab = obj_uts_ns_grab,
+ .ref_users = obj_uts_ns_users,
+ .checkpoint = checkpoint_uts_ns,
+ .restore = restore_uts_ns,
+ },
};


diff --git a/checkpoint/process.c b/checkpoint/process.c
index 40e83c9..245607b 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -16,8 +16,10 @@
#include <linux/posix-timers.h>
#include <linux/futex.h>
#include <linux/poll.h>
+#include <linux/utsname.h>
#include <linux/checkpoint.h>
#include <linux/checkpoint_hdr.h>
+#include <linux/syscalls.h>


#ifdef CONFIG_FUTEX
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 972bee6..935caf6 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -281,6 +281,32 @@ void *ckpt_read_buf_type(struct ckpt_ctx *ctx, int len, int type)
return h;
}

+/**
+ * ckpt_read_consume - consume the next object of expected type
+ * @ctx: checkpoint context
+ * @len: desired object length
+ * @type: desired object type
+ *
+ * This can be used to skip an object in the input stream when the
+ * data is unnecessary for the restart. @len indicates the length of
+ * the object); if @len is zero the length is unconstrained.
+ */
+int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type)
+{
+ struct ckpt_hdr *h;
+ int ret = 0;
+
+ h = ckpt_read_obj(ctx, len, 0);
+ if (IS_ERR(h))
+ return PTR_ERR(h);
+
+ if (h->type != type)
+ ret = -EINVAL;
+
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
+
/***********************************************************************
* Restart
*/
@@ -298,12 +324,18 @@ static int check_kernel_const(struct ckpt_hdr_const *h)
if (h->mm_saved_auxv_len != sizeof(mm->saved_auxv))
return -EINVAL;
/* uts */
+ if (h->uts_sysname_len != sizeof(uts->sysname))
+ return -EINVAL;
+ if (h->uts_nodename_len != sizeof(uts->nodename))
+ return -EINVAL;
if (h->uts_release_len != sizeof(uts->release))
return -EINVAL;
if (h->uts_version_len != sizeof(uts->version))
return -EINVAL;
if (h->uts_machine_len != sizeof(uts->machine))
return -EINVAL;
+ if (h->uts_domainname_len != sizeof(uts->domainname))
+ return -EINVAL;

return 0;
}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index e433b5c..0085ea8 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -66,6 +66,7 @@ extern int _ckpt_read_buffer(struct ckpt_ctx *ctx, void *ptr, int len);
extern int _ckpt_read_string(struct ckpt_ctx *ctx, void *ptr, int len);
extern void *ckpt_read_obj_type(struct ckpt_ctx *ctx, int len, int type);
extern void *ckpt_read_buf_type(struct ckpt_ctx *ctx, int len, int type);
+extern int ckpt_read_consume(struct ckpt_ctx *ctx, int len, int type);

/* ckpt kflags */
#define ckpt_set_ctx_kflag(__ctx, __kflag) \
@@ -131,6 +132,10 @@ extern int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t);
extern int checkpoint_ns(struct ckpt_ctx *ctx, void *ptr);
extern void *restore_ns(struct ckpt_ctx *ctx);

+/* uts-ns */
+extern int checkpoint_uts_ns(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_uts_ns(struct ckpt_ctx *ctx);
+
/* file table */
extern int ckpt_collect_file_table(struct ckpt_ctx *ctx, struct task_struct *t);
extern int checkpoint_obj_file_table(struct ckpt_ctx *ctx,
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index af18332..18ab78f 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -58,6 +58,7 @@ enum {
CKPT_HDR_THREAD,
CKPT_HDR_CPU,
CKPT_HDR_NS,
+ CKPT_HDR_UTS_NS,

/* 201-299: reserved for arch-dependent */

@@ -97,6 +98,7 @@ enum obj_type {
CKPT_OBJ_FILE,
CKPT_OBJ_MM,
CKPT_OBJ_NS,
+ CKPT_OBJ_UTS_NS,
CKPT_OBJ_MAX
};

@@ -107,9 +109,12 @@ struct ckpt_hdr_const {
/* mm */
__u16 mm_saved_auxv_len;
/* uts */
+ __u16 uts_sysname_len;
+ __u16 uts_nodename_len;
__u16 uts_release_len;
__u16 uts_version_len;
__u16 uts_machine_len;
+ __u16 uts_domainname_len;
} __attribute__((aligned(8)));

/* checkpoint image header */
@@ -184,6 +189,7 @@ struct ckpt_hdr_task_ns {

struct ckpt_hdr_ns {
struct ckpt_hdr h;
+ __s32 uts_objref;
} __attribute__((aligned(8)));

/* task's shared resources */
@@ -261,6 +267,16 @@ struct ckpt_hdr_file_pipe_state {
__s32 pipe_len;
} __attribute__((aligned(8)));

+struct ckpt_hdr_utsns {
+ struct ckpt_hdr h;
+ char sysname[__NEW_UTS_LEN + 1];
+ char nodename[__NEW_UTS_LEN + 1];
+ char release[__NEW_UTS_LEN + 1];
+ char version[__NEW_UTS_LEN + 1];
+ char machine[__NEW_UTS_LEN + 1];
+ char domainname[__NEW_UTS_LEN + 1];
+} __attribute__((aligned(8)));
+
/* memory layout */
struct ckpt_hdr_mm {
struct ckpt_hdr h;
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 57cbc96..0a9c58b 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -20,6 +20,10 @@
#include <linux/ktime.h>
#include <linux/wait.h>

+struct ckpt_stats {
+ int uts_ns;
+};
+
struct ckpt_ctx {
int crid; /* unique checkpoint id */

@@ -59,6 +63,8 @@ struct ckpt_ctx {
int active_pid; /* (next) position in pids array */
struct completion complete; /* container root and other tasks on */
wait_queue_head_t waitq; /* start, end, and restart ordering */
+
+ struct ckpt_stats stats; /* statistics */
};

#endif /* __KERNEL__ */
diff --git a/include/linux/utsname.h b/include/linux/utsname.h
index 3656b30..d6f24a9 100644
--- a/include/linux/utsname.h
+++ b/include/linux/utsname.h
@@ -50,6 +50,7 @@ static inline void get_uts_ns(struct uts_namespace *ns)
kref_get(&ns->kref);
}

+extern struct uts_namespace *create_uts_ns(void);
extern struct uts_namespace *copy_utsname(unsigned long flags,
struct uts_namespace *ns);
extern void free_uts_ns(struct kref *kref);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 54cb987..4f48a68 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -245,6 +245,10 @@ int ckpt_collect_ns(struct ckpt_ctx *ctx, struct task_struct *t)
if (ret < 0 || exists)
goto out;

+ ret = ckpt_obj_collect(ctx, nsproxy->uts_ns, CKPT_OBJ_UTS_NS);
+ if (ret < 0)
+ goto out;
+
/* TODO: collect other namespaces here */
out:
put_nsproxy(nsproxy);
@@ -260,9 +264,14 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy)
if (!h)
return -ENOMEM;

+ ret = checkpoint_obj(ctx, nsproxy->uts_ns, CKPT_OBJ_UTS_NS);
+ if (ret <= 0)
+ goto out;
+ h->uts_objref = ret;
/* TODO: Write other namespaces here */

ret = ckpt_write_obj(ctx, &h->h);
+ out:
ckpt_hdr_put(ctx, h);
return ret;
}
@@ -277,16 +286,52 @@ static struct nsproxy *do_restore_ns(struct ckpt_ctx *ctx)
{
struct ckpt_hdr_ns *h;
struct nsproxy *nsproxy = NULL;
+ struct uts_namespace *uts_ns;
+ int ret;

h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_NS);
if (IS_ERR(h))
return (struct nsproxy *) h;

+ ret = -EINVAL;
+ if (h->uts_objref <= 0)
+ goto out;
+
+ uts_ns = ckpt_obj_fetch(ctx, h->uts_objref, CKPT_OBJ_UTS_NS);
+ if (IS_ERR(uts_ns)) {
+ ret = PTR_ERR(uts_ns);
+ goto out;
+ }
+
+#if defined(COFNIG_UTS_NS)
+ ret = -ENOMEM;
+ nsproxy = create_nsproxy();
+ if (!nsproxy)
+ goto out;
+
+ get_uts_ns(uts_ns);
+ nsproxy->uts_ns = uts_ns;
+
+ get_ipc_ns(current->nsproxy->ipc_ns);
+ nsproxy->ipc_ns = ipc_ns;
+ get_pid_ns(current->nsproxy->pid_ns);
+ nsproxy->pid_ns = current->nsproxy->pid_ns;
+ get_mnt_ns(current->nsproxy->mnt_ns);
+ nsproxy->mnt_ns = current->nsproxy->mnt_ns;
+ get_net(current->nsproxy->net_ns);
+ nsproxy->net_ns = current->nsproxy->net_ns;
+#else
nsproxy = current->nsproxy;
get_nsproxy(nsproxy);

- /* TODO: add more namespaces here */
+ BUG_ON(nsproxy->uts_ns != uts_ns);
+#endif

+ /* TODO: add more namespaces here */
+ ret = 0;
+ out:
+ if (ret < 0)
+ nsproxy = ERR_PTR(ret);
ckpt_hdr_put(ctx, h);
return nsproxy;
}
diff --git a/kernel/utsname.c b/kernel/utsname.c
index 8a82b4b..c82ed83 100644
--- a/kernel/utsname.c
+++ b/kernel/utsname.c
@@ -14,8 +14,9 @@
#include <linux/utsname.h>
#include <linux/err.h>
#include <linux/slab.h>
+#include <linux/checkpoint.h>

-static struct uts_namespace *create_uts_ns(void)
+struct uts_namespace *create_uts_ns(void)
{
struct uts_namespace *uts_ns;

--
1.6.0.4

2009-07-22 10:10:29

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 50/60] c/r: support share-memory sysv-ipc

Checkpoint of sysvipc shared memory is performed in two steps: first,
the entire ipc namespace is dumped as a whole by iterating through all
shm objects and dumping the contents of each one. The shmem inode is
registered in the objhash. Second, for each vma that refers to ipc
shared memory we find the inode in the objhash, and save the objref.

(If we find a new inode, that indicates that the ipc namespace is not
entirely frozen and someone must have manipulated it since step 1).

Handling of shm objects that have been deleted (via IPC_RMID) is left
to a later patch in this series.

Changelog[v17]:
- Restore objects in the right namespace
- Properly initialize ctx->deferqueue
- Fix compilation with CONFIG_CHECKPOINT=n

Signed-off-by: Oren Laadan <[email protected]>
---
checkpoint/memory.c | 28 ++++-
checkpoint/sys.c | 13 ++
include/linux/checkpoint.h | 3 +
include/linux/checkpoint_hdr.h | 19 +++-
include/linux/checkpoint_types.h | 1 +
include/linux/shm.h | 15 ++
ipc/Makefile | 2 +-
ipc/checkpoint.c | 4 +-
ipc/checkpoint_shm.c | 261 ++++++++++++++++++++++++++++++++++++++
ipc/shm.c | 84 +++++++++++-
ipc/util.h | 8 +
11 files changed, 424 insertions(+), 14 deletions(-)
create mode 100644 ipc/checkpoint_shm.c

diff --git a/checkpoint/memory.c b/checkpoint/memory.c
index 77234cd..73709b8 100644
--- a/checkpoint/memory.c
+++ b/checkpoint/memory.c
@@ -20,6 +20,7 @@
#include <linux/mman.h>
#include <linux/pagemap.h>
#include <linux/mm_types.h>
+#include <linux/shm.h>
#include <linux/proc_fs.h>
#include <linux/swap.h>
#include <linux/checkpoint.h>
@@ -459,9 +460,9 @@ static int vma_dump_pages(struct ckpt_ctx *ctx, int total)
* virtual addresses into ctx->pgarr_list page-array chain. Then dump
* the addresses, followed by the page contents.
*/
-static int checkpoint_memory_contents(struct ckpt_ctx *ctx,
- struct vm_area_struct *vma,
- struct inode *inode)
+int checkpoint_memory_contents(struct ckpt_ctx *ctx,
+ struct vm_area_struct *vma,
+ struct inode *inode)
{
struct ckpt_hdr_pgarr *h;
unsigned long addr, end;
@@ -1077,6 +1078,13 @@ static int anon_private_restore(struct ckpt_ctx *ctx,
return private_vma_restore(ctx, mm, NULL, h);
}

+static int bad_vma_restore(struct ckpt_ctx *ctx,
+ struct mm_struct *mm,
+ struct ckpt_hdr_vma *h)
+{
+ return -EINVAL;
+}
+
/* callbacks to restore vma per its type: */
struct restore_vma_ops {
char *vma_name;
@@ -1129,6 +1137,20 @@ static struct restore_vma_ops restore_vma_ops[] = {
.vma_type = CKPT_VMA_SHM_FILE,
.restore = filemap_restore,
},
+ /* sysvipc shared */
+ {
+ .vma_name = "IPC SHARED",
+ .vma_type = CKPT_VMA_SHM_IPC,
+ /* ipc inode itself is restore by restore_ipc_ns()... */
+ .restore = bad_vma_restore,
+
+ },
+ /* sysvipc shared (skip) */
+ {
+ .vma_name = "IPC SHARED (skip)",
+ .vma_type = CKPT_VMA_SHM_IPC_SKIP,
+ .restore = ipcshm_restore,
+ },
};

/**
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 4351c28..525182a 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -21,6 +21,7 @@
#include <linux/uaccess.h>
#include <linux/capability.h>
#include <linux/checkpoint.h>
+#include <linux/deferqueue.h>

/*
* ckpt_unpriv_allowed - sysctl controlled, do not allow checkpoints or
@@ -189,8 +190,17 @@ static void task_arr_free(struct ckpt_ctx *ctx)

static void ckpt_ctx_free(struct ckpt_ctx *ctx)
{
+ int ret;
+
BUG_ON(atomic_read(&ctx->refcount));

+ if (ctx->deferqueue) {
+ ret = deferqueue_run(ctx->deferqueue);
+ if (ret != 0)
+ pr_warning("c/r: deferqueue had %d entries\n", ret);
+ deferqueue_destroy(ctx->deferqueue);
+ }
+
if (ctx->file)
fput(ctx->file);

@@ -240,6 +250,9 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
err = -ENOMEM;
if (ckpt_obj_hash_alloc(ctx) < 0)
goto err;
+ ctx->deferqueue = deferqueue_create();
+ if (!ctx->deferqueue)
+ goto err;

atomic_inc(&ctx->refcount);
return ctx;
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 9d6b0cc..aeae2fa 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -199,6 +199,9 @@ extern unsigned long generic_vma_restore(struct mm_struct *mm,
extern int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
struct file *file, struct ckpt_hdr_vma *h);

+extern int checkpoint_memory_contents(struct ckpt_ctx *ctx,
+ struct vm_area_struct *vma,
+ struct inode *inode);
extern int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode);


diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 3159750..f4c3f7b 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -308,7 +308,9 @@ enum vma_type {
CKPT_VMA_SHM_ANON, /* shared anonymous */
CKPT_VMA_SHM_ANON_SKIP, /* shared anonymous (skip contents) */
CKPT_VMA_SHM_FILE, /* shared mapped file, only msync */
- CKPT_VMA_MAX
+ CKPT_VMA_SHM_IPC, /* shared sysvipc */
+ CKPT_VMA_SHM_IPC_SKIP, /* shared sysvipc (skip contents) */
+ CKPT_VMA_MAX,
};

/* vma descriptor */
@@ -358,6 +360,7 @@ struct ckpt_hdr_ipc {
} __attribute__((aligned(8)));

struct ckpt_hdr_ipc_perms {
+ struct ckpt_hdr h;
__s32 id;
__u32 key;
__u32 uid;
@@ -369,6 +372,20 @@ struct ckpt_hdr_ipc_perms {
__u64 seq;
} __attribute__((aligned(8)));

+struct ckpt_hdr_ipc_shm {
+ struct ckpt_hdr h;
+ struct ckpt_hdr_ipc_perms perms;
+ __u64 shm_segsz;
+ __u64 shm_atim;
+ __u64 shm_dtim;
+ __u64 shm_ctim;
+ __s32 shm_cprid;
+ __s32 shm_lprid;
+ __u32 mlock_uid;
+ __u32 flags;
+ __u32 objref;
+} __attribute__((aligned(8)));
+

#define CKPT_TST_OVERFLOW_16(a, b) \
((sizeof(a) > sizeof(b)) && ((a) > SHORT_MAX))
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 9ffa492..fb9b5b2 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -46,6 +46,7 @@ struct ckpt_ctx {
atomic_t refcount;

struct ckpt_obj_hash *obj_hash; /* repository for shared objects */
+ struct deferqueue_head *deferqueue; /* queue of deferred work */

struct path fs_mnt; /* container root (FIXME) */

diff --git a/include/linux/shm.h b/include/linux/shm.h
index eca6235..94ac1a7 100644
--- a/include/linux/shm.h
+++ b/include/linux/shm.h
@@ -118,6 +118,21 @@ static inline int is_file_shm_hugepages(struct file *file)
}
#endif

+struct ipc_namespace;
+extern int shmctl_down(struct ipc_namespace *ns, int shmid, int cmd,
+ struct shmid_ds __user *buf, int version);
+
+#ifdef CONFIG_CHECKPOINT
+#ifdef CONFIG_SYSVIPC
+struct ckpt_ctx;
+struct ckpt_hdr_vma;
+extern int ipcshm_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+ struct ckpt_hdr_vma *h);
+#else
+#define ipcshm_restore NULL
+#endif
+#endif
+
#endif /* __KERNEL__ */

#endif /* _LINUX_SHM_H_ */
diff --git a/ipc/Makefile b/ipc/Makefile
index b747127..db4b076 100644
--- a/ipc/Makefile
+++ b/ipc/Makefile
@@ -9,4 +9,4 @@ obj_mq-$(CONFIG_COMPAT) += compat_mq.o
obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o $(obj_mq-y)
obj-$(CONFIG_IPC_NS) += namespace.o
obj-$(CONFIG_POSIX_MQUEUE_SYSCTL) += mq_sysctl.o
-obj-$(CONFIG_SYSVIPC_CHECKPOINT) += checkpoint.o
+obj-$(CONFIG_SYSVIPC_CHECKPOINT) += checkpoint.o checkpoint_shm.o
diff --git a/ipc/checkpoint.c b/ipc/checkpoint.c
index 4eb1a97..9062dc6 100644
--- a/ipc/checkpoint.c
+++ b/ipc/checkpoint.c
@@ -113,9 +113,9 @@ static int do_checkpoint_ipc_ns(struct ckpt_ctx *ctx,
if (ret < 0)
return ret;

-#if 0 /* NEXT FEW PATCHES */
ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_SHM_IDS,
CKPT_HDR_IPC_SHM, checkpoint_ipc_shm);
+#if 0 /* NEXT FEW PATCHES */
if (ret < 0)
return ret;
ret = checkpoint_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
@@ -286,9 +286,9 @@ static struct ipc_namespace *do_restore_ipc_ns(struct ckpt_ctx *ctx)
get_ipc_ns(ipc_ns);
#endif

-#if 0 /* NEXT FEW PATCHES */
ret = restore_ipc_any(ctx, ipc_ns, IPC_SHM_IDS,
CKPT_HDR_IPC_SHM, restore_ipc_shm);
+#if 0 /* NEXT FEW PATCHES */
if (ret < 0)
goto out;
ret = restore_ipc_any(ctx, ipc_ns, IPC_MSG_IDS,
diff --git a/ipc/checkpoint_shm.c b/ipc/checkpoint_shm.c
new file mode 100644
index 0000000..7f0bdd7
--- /dev/null
+++ b/ipc/checkpoint_shm.c
@@ -0,0 +1,261 @@
+/*
+ * Checkpoint/restart - dump state of sysvipc shm
+ *
+ * Copyright (C) 2009 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG CKPT_DIPC
+
+#include <linux/mm.h>
+#include <linux/shm.h>
+#include <linux/shmem_fs.h>
+#include <linux/hugetlb.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/syscalls.h>
+#include <linux/nsproxy.h>
+#include <linux/ipc_namespace.h>
+#include <linux/deferqueue.h>
+
+#include <linux/msg.h> /* needed for util.h that uses 'struct msg_msg' */
+#include "util.h"
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/************************************************************************
+ * ipc checkpoint
+ */
+
+static int fill_ipc_shm_hdr(struct ckpt_ctx *ctx,
+ struct ckpt_hdr_ipc_shm *h,
+ struct shmid_kernel *shp)
+{
+ int ret = 0;
+
+ ipc_lock_by_ptr(&shp->shm_perm);
+
+ ret = checkpoint_fill_ipc_perms(&h->perms, &shp->shm_perm);
+ if (ret < 0)
+ goto unlock;
+
+ h->shm_segsz = shp->shm_segsz;
+ h->shm_atim = shp->shm_atim;
+ h->shm_dtim = shp->shm_dtim;
+ h->shm_ctim = shp->shm_ctim;
+ h->shm_cprid = shp->shm_cprid;
+ h->shm_lprid = shp->shm_lprid;
+
+ if (shp->mlock_user)
+ h->mlock_uid = shp->mlock_user->uid;
+ else
+ h->mlock_uid = (unsigned int) -1;
+
+ h->flags = 0;
+ /* check if shm was setup with SHM_NORESERVE */
+ if (SHMEM_I(shp->shm_file->f_dentry->d_inode)->flags & VM_NORESERVE)
+ h->flags |= SHM_NORESERVE;
+ /* check if shm was setup with SHM_HUGETLB (unsupported yet) */
+ if (is_file_hugepages(shp->shm_file)) {
+ pr_warning("c/r: unsupported SHM_HUGETLB\n");
+ ret = -ENOSYS;
+ }
+
+ unlock:
+ ipc_unlock(&shp->shm_perm);
+ ckpt_debug("shm: cprid %d lprid %d segsz %lld mlock %d\n",
+ h->shm_cprid, h->shm_lprid, h->shm_segsz, h->mlock_uid);
+
+ return ret;
+}
+
+int checkpoint_ipc_shm(int id, void *p, void *data)
+{
+ struct ckpt_hdr_ipc_shm *h;
+ struct ckpt_ctx *ctx = (struct ckpt_ctx *) data;
+ struct kern_ipc_perm *perm = (struct kern_ipc_perm *) p;
+ struct shmid_kernel *shp;
+ struct inode *inode;
+ int first, objref;
+ int ret;
+
+ shp = container_of(perm, struct shmid_kernel, shm_perm);
+ inode = shp->shm_file->f_dentry->d_inode;
+
+ objref = ckpt_obj_lookup_add(ctx, inode, CKPT_OBJ_INODE, &first);
+ if (objref < 0)
+ return objref;
+ /* this must be the first time we see this region */
+ BUG_ON(!first);
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_IPC_SHM);
+ if (!h)
+ return -ENOMEM;
+
+ ret = fill_ipc_shm_hdr(ctx, h, shp);
+ if (ret < 0)
+ goto out;
+
+ h->objref = objref;
+ ckpt_debug("shm: objref %d\n", h->objref);
+
+ ret = ckpt_write_obj(ctx, &h->h);
+ if (ret < 0)
+ goto out;
+
+ ret = checkpoint_memory_contents(ctx, NULL, inode);
+ out:
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
+
+/************************************************************************
+ * ipc restart
+ */
+
+struct dq_ipcshm_del {
+ /*
+ * XXX: always keep ->ipcns first so that put_ipc_ns() can
+ * be safely provided as the dtor for this deferqueue object
+ */
+ struct ipc_namespace *ipcns;
+ int id;
+};
+
+static int ipc_shm_delete(void *data)
+{
+ struct dq_ipcshm_del *dq = (struct dq_ipcshm_del *) data;
+ mm_segment_t old_fs;
+ int ret;
+
+ old_fs = get_fs();
+ set_fs(get_ds());
+ ret = shmctl_down(dq->ipcns, dq->id, IPC_RMID, NULL, 0);
+ set_fs(old_fs);
+
+ put_ipc_ns(dq->ipcns);
+ return ret;
+}
+
+static int load_ipc_shm_hdr(struct ckpt_ctx *ctx,
+ struct ckpt_hdr_ipc_shm *h,
+ struct shmid_kernel *shp)
+{
+ int ret;
+
+ ret = restore_load_ipc_perms(&h->perms, &shp->shm_perm);
+ if (ret < 0)
+ return ret;
+
+ ckpt_debug("shm: cprid %d lprid %d segsz %lld mlock %d\n",
+ h->shm_cprid, h->shm_lprid, h->shm_segsz, h->mlock_uid);
+
+ if (h->shm_cprid < 0 || h->shm_lprid < 0)
+ return -EINVAL;
+
+ shp->shm_segsz = h->shm_segsz;
+ shp->shm_atim = h->shm_atim;
+ shp->shm_dtim = h->shm_dtim;
+ shp->shm_ctim = h->shm_ctim;
+ shp->shm_cprid = h->shm_cprid;
+ shp->shm_lprid = h->shm_lprid;
+
+ return 0;
+}
+
+int restore_ipc_shm(struct ckpt_ctx *ctx, struct ipc_namespace *ns)
+{
+ struct ckpt_hdr_ipc_shm *h;
+ struct kern_ipc_perm *perms;
+ struct shmid_kernel *shp;
+ struct ipc_ids *shm_ids = &ns->ids[IPC_SHM_IDS];
+ struct file *file;
+ int shmflag;
+ int ret;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_IPC_SHM);
+ if (IS_ERR(h))
+ return PTR_ERR(h);
+
+ ret = -EINVAL;
+ if (h->perms.id < 0)
+ goto out;
+
+#define CKPT_SHMFL_MASK (SHM_NORESERVE | SHM_HUGETLB)
+ if (h->flags & ~CKPT_SHMFL_MASK)
+ goto out;
+
+ ret = -ENOSYS;
+ if (h->mlock_uid != (unsigned int) -1) /* FIXME: support SHM_LOCK */
+ goto out;
+ if (h->flags & SHM_HUGETLB) /* FIXME: support SHM_HUGETLB */
+ goto out;
+
+ /*
+ * SHM_DEST means that the shm is to be deleted after creation.
+ * However, deleting before it's actually attached is quite silly.
+ * Instead, we defer this task to until restart has succeeded.
+ */
+ if (h->perms.mode & SHM_DEST) {
+ struct dq_ipcshm_del dq;
+
+ /* to not confuse the rest of the code */
+ h->perms.mode &= ~SHM_DEST;
+
+ dq.id = h->perms.id;
+ dq.ipcns = ns;
+ get_ipc_ns(dq.ipcns);
+
+ /* XXX can safely use put_ipc_ns() as dtor, see above */
+ ret = deferqueue_add(ctx->deferqueue, &dq, sizeof(dq),
+ (deferqueue_func_t) ipc_shm_delete,
+ (deferqueue_func_t) put_ipc_ns);
+ if (ret < 0)
+ goto out;
+ }
+
+ shmflag = h->flags | h->perms.mode | IPC_CREAT | IPC_EXCL;
+ ckpt_debug("shm: do_shmget size %lld flag %#x id %d\n",
+ h->shm_segsz, shmflag, h->perms.id);
+ ret = do_shmget(ns, h->perms.key, h->shm_segsz, shmflag, h->perms.id);
+ ckpt_debug("shm: do_shmget ret %d\n", ret);
+ if (ret < 0)
+ goto out;
+
+ down_write(&shm_ids->rw_mutex);
+
+ /* we are the sole owners/users of this ipc_ns, it can't go away */
+ perms = ipc_lock(shm_ids, h->perms.id);
+ BUG_ON(IS_ERR(perms)); /* ipc_ns is private to us */
+
+ shp = container_of(perms, struct shmid_kernel, shm_perm);
+ file = shp->shm_file;
+ get_file(file);
+
+ ret = load_ipc_shm_hdr(ctx, h, shp);
+ if (ret < 0)
+ goto mutex;
+
+ /* deposit in objhash and read contents in */
+ ret = ckpt_obj_insert(ctx, file, h->objref, CKPT_OBJ_FILE);
+ if (ret < 0)
+ goto mutex;
+ ret = restore_memory_contents(ctx, file->f_dentry->d_inode);
+ mutex:
+ fput(file);
+ if (ret < 0) {
+ ckpt_debug("shm: need to remove (%d)\n", ret);
+ do_shm_rmid(ns, perms);
+ } else
+ ipc_unlock(perms);
+ up_write(&shm_ids->rw_mutex);
+ out:
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
diff --git a/ipc/shm.c b/ipc/shm.c
index 0ee2c35..516b179 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -40,6 +40,7 @@
#include <linux/mount.h>
#include <linux/ipc_namespace.h>
#include <linux/ima.h>
+#include <linux/checkpoint.h>

#include <asm/uaccess.h>

@@ -305,6 +306,74 @@ int is_file_shm_hugepages(struct file *file)
return ret;
}

+#ifdef CONFIG_CHECKPOINT
+static int ipcshm_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
+{
+ int ino_objref;
+ int first;
+
+ ino_objref = ckpt_obj_lookup_add(ctx, vma->vm_file->f_dentry->d_inode,
+ CKPT_OBJ_INODE, &first);
+ if (ino_objref < 0)
+ return ino_objref;
+
+ /*
+ * This shouldn't happen, because all IPC regions should have
+ * been already dumped by now via ipc namespaces; It means
+ * the ipc_ns has been modified recently during checkpoint.
+ */
+ if (first)
+ return -EBUSY;
+
+ return generic_vma_checkpoint(ctx, vma, CKPT_VMA_SHM_IPC_SKIP,
+ 0, ino_objref);
+}
+
+int ipcshm_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
+ struct ckpt_hdr_vma *h)
+{
+ struct file *file;
+ int shmid, shmflg = 0;
+ mm_segment_t old_fs;
+ unsigned long start;
+ unsigned long addr;
+ int ret;
+
+ if (!h->ino_objref)
+ return -EINVAL;
+ /* FIX: verify the vm_flags too */
+
+ file = ckpt_obj_fetch(ctx, h->ino_objref, CKPT_OBJ_FILE);
+ if (IS_ERR(file))
+ PTR_ERR(file);
+
+ shmid = file->f_dentry->d_inode->i_ino;
+
+ if (!(h->vm_flags & VM_WRITE))
+ shmflg |= SHM_RDONLY;
+
+ /*
+ * FIX: do_shmat() has limited interface: all-or-nothing
+ * mapping. If the vma, however, reflects a partial mapping
+ * then we need to modify that function to accomplish the
+ * desired outcome. Partial mapping can exist due to the user
+ * call shmat() and then unmapping part of the region.
+ * Currently, we at least detect this and call it a foul play.
+ */
+ if (((h->vm_end - h->vm_start) != h->ino_size) || h->vm_pgoff)
+ return -ENOSYS;
+
+ old_fs = get_fs();
+ set_fs(get_ds());
+ start = h->vm_start;
+ ret = do_shmat(shmid, (char __user *) start, shmflg, &addr);
+ set_fs(old_fs);
+
+ BUG_ON(ret >= 0 && addr != h->vm_start);
+ return ret;
+}
+#endif
+
static const struct file_operations shm_file_operations = {
.mmap = shm_mmap,
.fsync = shm_fsync,
@@ -320,6 +389,9 @@ static struct vm_operations_struct shm_vm_ops = {
.set_policy = shm_set_policy,
.get_policy = shm_get_policy,
#endif
+#if defined(CONFIG_CHECKPOINT)
+ .checkpoint = ipcshm_checkpoint,
+#endif
};

/**
@@ -445,14 +517,12 @@ static inline int shm_more_checks(struct kern_ipc_perm *ipcp,
return 0;
}

-int do_shmget(key_t key, size_t size, int shmflg, int req_id)
+int do_shmget(struct ipc_namespace *ns, key_t key, size_t size,
+ int shmflg, int req_id)
{
- struct ipc_namespace *ns;
struct ipc_ops shm_ops;
struct ipc_params shm_params;

- ns = current->nsproxy->ipc_ns;
-
shm_ops.getnew = newseg;
shm_ops.associate = shm_security;
shm_ops.more_checks = shm_more_checks;
@@ -466,7 +536,7 @@ int do_shmget(key_t key, size_t size, int shmflg, int req_id)

SYSCALL_DEFINE3(shmget, key_t, key, size_t, size, int, shmflg)
{
- return do_shmget(key, size, shmflg, -1);
+ return do_shmget(current->nsproxy->ipc_ns, key, size, shmflg, -1);
}

static inline unsigned long copy_shmid_to_user(void __user *buf, struct shmid64_ds *in, int version)
@@ -597,8 +667,8 @@ static void shm_get_stat(struct ipc_namespace *ns, unsigned long *rss,
* to be held in write mode.
* NOTE: no locks must be held, the rw_mutex is taken inside this function.
*/
-static int shmctl_down(struct ipc_namespace *ns, int shmid, int cmd,
- struct shmid_ds __user *buf, int version)
+int shmctl_down(struct ipc_namespace *ns, int shmid, int cmd,
+ struct shmid_ds __user *buf, int version)
{
struct kern_ipc_perm *ipcp;
struct shmid64_ds shmid64;
diff --git a/ipc/util.h b/ipc/util.h
index 8ae1f8e..5f47593 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -178,11 +178,19 @@ void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,

struct ipc_namespace *create_ipc_ns(void);

+int do_shmget(struct ipc_namespace *ns, key_t key, size_t size, int shmflg,
+ int req_id);
+void do_shm_rmid(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp);
+
+
#ifdef CONFIG_CHECKPOINT
extern int checkpoint_fill_ipc_perms(struct ckpt_hdr_ipc_perms *h,
struct kern_ipc_perm *perm);
extern int restore_load_ipc_perms(struct ckpt_hdr_ipc_perms *h,
struct kern_ipc_perm *perm);
+
+extern int checkpoint_ipc_shm(int id, void *p, void *data);
+extern int restore_ipc_shm(struct ckpt_ctx *ctx, struct ipc_namespace *ns);
#endif

#endif
--
1.6.0.4

2009-07-22 10:20:25

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 06/60] cgroup freezer: Update stale locking comments

From: Matt Helsley <[email protected]>

Update stale comments regarding locking order and add a little more detail
so it's easier to follow the locking between the cgroup freezer and the
power management freezer code.

Signed-off-by: Matt Helsley <[email protected]>
Cc: Oren Laadan <[email protected]>
Cc: Cedric Le Goater <[email protected]>
Cc: Paul Menage <[email protected]>
Cc: Li Zefan <[email protected]>
---
kernel/cgroup_freezer.c | 21 +++++++++++++--------
1 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index 765e2c1..22fce5d 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -88,10 +88,10 @@ struct cgroup_subsys freezer_subsys;

/* Locks taken and their ordering
* ------------------------------
- * css_set_lock
* cgroup_mutex (AKA cgroup_lock)
- * task->alloc_lock (AKA task_lock)
* freezer->lock
+ * css_set_lock
+ * task->alloc_lock (AKA task_lock)
* task->sighand->siglock
*
* cgroup code forces css_set_lock to be taken before task->alloc_lock
@@ -99,33 +99,38 @@ struct cgroup_subsys freezer_subsys;
* freezer_create(), freezer_destroy():
* cgroup_mutex [ by cgroup core ]
*
- * can_attach():
- * cgroup_mutex
+ * freezer_can_attach():
+ * cgroup_mutex (held by caller of can_attach)
*
- * cgroup_frozen():
+ * cgroup_freezing_or_frozen():
* task->alloc_lock (to get task's cgroup)
*
* freezer_fork() (preserving fork() performance means can't take cgroup_mutex):
- * task->alloc_lock (to get task's cgroup)
* freezer->lock
* sighand->siglock (if the cgroup is freezing)
*
* freezer_read():
* cgroup_mutex
* freezer->lock
+ * write_lock css_set_lock (cgroup iterator start)
+ * task->alloc_lock
* read_lock css_set_lock (cgroup iterator start)
*
* freezer_write() (freeze):
* cgroup_mutex
* freezer->lock
+ * write_lock css_set_lock (cgroup iterator start)
+ * task->alloc_lock
* read_lock css_set_lock (cgroup iterator start)
- * sighand->siglock
+ * sighand->siglock (fake signal delivery inside freeze_task())
*
* freezer_write() (unfreeze):
* cgroup_mutex
* freezer->lock
+ * write_lock css_set_lock (cgroup iterator start)
+ * task->alloc_lock
* read_lock css_set_lock (cgroup iterator start)
- * task->alloc_lock (to prevent races with freeze_task())
+ * task->alloc_lock (inside thaw_process(), prevents race with refrigerator())
* sighand->siglock
*/
static struct cgroup_subsys_state *freezer_create(struct cgroup_subsys *ss,
--
1.6.0.4

2009-07-22 10:20:11

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 56/60] c/r: clone_with_pids: define the s390 syscall

From: Serge E. Hallyn <[email protected]>

Hook up the clone_with_pids system call for s390x. clone_with_pids()
takes an additional argument over clone(), which we pass in through
register 7. Stub code for using the syscall looks like:

struct target_pid_set {
int num_pids;
pid_t *target_pids;
unsigned long flags;
};

register unsigned long int __r2 asm ("2") = (unsigned long int)(stack);
register unsigned long int __r3 asm ("3") = (unsigned long int)(flags);
register unsigned long int __r4 asm ("4") = (unsigned long int)(NULL);
register unsigned long int __r5 asm ("5") = (unsigned long int)(NULL);
register unsigned long int __r6 asm ("6") = (unsigned long int)(NULL);
register unsigned long int __r7 asm ("7") = (unsigned long int)(setp);
register unsigned long int __result asm ("2");
__asm__ __volatile__(
" lghi %%r1,332\n"
" svc 0\n"
: "=d" (__result)
: "0" (__r2), "d" (__r3),
"d" (__r4), "d" (__r5), "d" (__r6), "d" (__r7)
: "1", "cc", "memory"
);
__result;
})

struct target_pid_set pid_set;
int pids[1] = { 19799 };
pid_set.num_pids = 1;
pid_set.target_pids = &pids[0];
pid_set.flags = 0;

rc = do_clone_with_pids(topstack, clone_flags, setp);
if (rc == 0)
printf("Child\n");
else if (rc > 0)
printf("Parent: child pid %d\n", rc);
else
printf("Error %d\n", rc);

Signed-off-by: Serge E. Hallyn <[email protected]>
---
arch/s390/include/asm/unistd.h | 3 ++-
arch/s390/kernel/compat_linux.c | 19 +++++++++++++++++++
arch/s390/kernel/process.c | 19 +++++++++++++++++++
arch/s390/kernel/syscalls.S | 1 +
4 files changed, 41 insertions(+), 1 deletions(-)

diff --git a/arch/s390/include/asm/unistd.h b/arch/s390/include/asm/unistd.h
index 5d1678a..2a84f9c 100644
--- a/arch/s390/include/asm/unistd.h
+++ b/arch/s390/include/asm/unistd.h
@@ -271,7 +271,8 @@
#define __NR_perf_counter_open 331
#define __NR_checkpoint 332
#define __NR_restart 333
-#define NR_syscalls 334
+#define __NR_clone_with_pids 334
+#define NR_syscalls 335

/*
* There are some system calls that are not present on 64 bit, some
diff --git a/arch/s390/kernel/compat_linux.c b/arch/s390/kernel/compat_linux.c
index 9ab188d..c6dc681 100644
--- a/arch/s390/kernel/compat_linux.c
+++ b/arch/s390/kernel/compat_linux.c
@@ -818,6 +818,25 @@ asmlinkage long sys32_clone(void)
parent_tidptr, child_tidptr);
}

+asmlinkage long sys32_clone_with_pids(void)
+{
+ struct pt_regs *regs = task_pt_regs(current);
+ unsigned long clone_flags;
+ unsigned long newsp;
+ int __user *parent_tidptr, *child_tidptr;
+ void __user *upid_setp;
+
+ clone_flags = regs->gprs[3] & 0xffffffffUL;
+ newsp = regs->orig_gpr2 & 0x7fffffffUL;
+ parent_tidptr = compat_ptr(regs->gprs[4]);
+ child_tidptr = compat_ptr(regs->gprs[5]);
+ upid_setp = compat_ptr(regs->gprs[7]);
+ if (!newsp)
+ newsp = regs->gprs[15];
+ return do_fork_with_pids(clone_flags, newsp, regs, 0,
+ parent_tidptr, child_tidptr, upid_setp);
+}
+
/*
* 31 bit emulation wrapper functions for sys_fadvise64/fadvise64_64.
* These need to rewrite the advise values for POSIX_FADV_{DONTNEED,NOREUSE}
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index 5a43f27..263d3ab 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -247,6 +247,25 @@ SYSCALL_DEFINE0(clone)
parent_tidptr, child_tidptr);
}

+SYSCALL_DEFINE0(clone_with_pids)
+{
+ struct pt_regs *regs = task_pt_regs(current);
+ unsigned long clone_flags;
+ unsigned long newsp;
+ int __user *parent_tidptr, *child_tidptr;
+ void __user *upid_setp;
+
+ clone_flags = regs->gprs[3];
+ newsp = regs->orig_gpr2;
+ parent_tidptr = (int __user *) regs->gprs[4];
+ child_tidptr = (int __user *) regs->gprs[5];
+ upid_setp = (void __user *) regs->gprs[7];
+ if (!newsp)
+ newsp = regs->gprs[15];
+ return do_fork_with_pids(clone_flags, newsp, regs, 0, parent_tidptr,
+ child_tidptr, upid_setp);
+}
+
/*
* This is trivial, and on the face of it looks like it
* could equally well be done in user mode.
diff --git a/arch/s390/kernel/syscalls.S b/arch/s390/kernel/syscalls.S
index 67518e2..db850e7 100644
--- a/arch/s390/kernel/syscalls.S
+++ b/arch/s390/kernel/syscalls.S
@@ -342,3 +342,4 @@ SYSCALL(sys_rt_tgsigqueueinfo,sys_rt_tgsigqueueinfo,compat_sys_rt_tgsigqueueinfo
SYSCALL(sys_perf_counter_open,sys_perf_counter_open,sys_perf_counter_open_wrapper)
SYSCALL(sys_checkpoint,sys_checkpoint,sys_checkpoint_wrapper)
SYSCALL(sys_restart,sys_restart,sys_restore_wrapper)
+SYSCALL(sys_clone_with_pids,sys_clone_with_pids,sys_clone_with_pids_wrapper)
--
1.6.0.4

2009-07-22 10:18:30

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 18/60] c/r: create syscalls: sys_checkpoint, sys_restart

Create trivial sys_checkpoint and sys_restore system calls. They will
enable to checkpoint and restart an entire container, to and from a
checkpoint image file descriptor.

The syscalls take a pid, a file descriptor (for the image file) and
flags as arguments. The pid identifies the top-most (root) task in the
process tree, e.g. the container init: for sys_checkpoint the first
argument identifies the pid of the target container/subtree; for
sys_restart it will identify the pid of restarting root task.

A checkpoint, much like a process coredump, dumps the state of multiple
processes at once, including the state of the container. The checkpoint
image is written to (and read from) the file descriptor directly from
the kernel. This way the data is generated and then pushed out naturally
as resources and tasks are scanned to save their state. This is the
approach taken by, e.g., Zap and OpenVZ.

By using a return value and not a file descriptor, we can distinguish
between a return from checkpoint, a return from restart (in case of a
checkpoint that includes self, i.e. a task checkpointing its own
container, or itself), and an error condition, in a manner analogous
to a fork() call.

We don't use copy_from_user()/copy_to_user() because it requires
holding the entire image in user space, and does not make sense for
restart. Also, we don't use a pipe, pseudo-fs file and the like,
because they work by generating data on demand as the user pulls it
(unless the entire image is buffered in the kernel) and would require
more complex logic. They also would significantly complicate
checkpoint that includes self.

Changelog[v17]:
- Move checkpoint closer to namespaces (kconfig)
- Kill "Enable" in c/r config option
Changelog[v16]:
- Change sys_restart() first argument to be 'pid_t pid'
Changelog[v14]:
- Change CONFIG_CHEKCPOINT_RESTART to CONFIG_CHECKPOINT (Ingo)
- Remove line 'def_bool n' (default is already 'n')
- Add CHECKPOINT_SUPPORT in Kconfig (Nathan Lynch)
Changelog[v5]:
- Config is 'def_bool n' by default

Signed-off-by: Oren Laadan <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
---
arch/x86/Kconfig | 4 +++
arch/x86/include/asm/unistd_32.h | 2 +
arch/x86/kernel/syscall_table_32.S | 2 +
checkpoint/Kconfig | 14 ++++++++++++
checkpoint/Makefile | 5 ++++
checkpoint/sys.c | 41 ++++++++++++++++++++++++++++++++++++
include/linux/syscalls.h | 2 +
init/Kconfig | 2 +
kernel/sys_ni.c | 4 +++
9 files changed, 76 insertions(+), 0 deletions(-)
create mode 100644 checkpoint/Kconfig
create mode 100644 checkpoint/Makefile
create mode 100644 checkpoint/sys.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 738bdc6..97ec17c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -85,6 +85,10 @@ config STACKTRACE_SUPPORT
config HAVE_LATENCYTOP_SUPPORT
def_bool y

+config CHECKPOINT_SUPPORT
+ bool
+ default y if X86_32
+
config FAST_CMPXCHG_LOCAL
bool
default y
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index f65b750..c25971b 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -343,6 +343,8 @@
#define __NR_rt_tgsigqueueinfo 335
#define __NR_perf_counter_open 336
#define __NR_clone_with_pids 337
+#define __NR_checkpoint 338
+#define __NR_restart 339

#ifdef __KERNEL__

diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 879e5ec..4741554 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -337,3 +337,5 @@ ENTRY(sys_call_table)
.long sys_rt_tgsigqueueinfo /* 335 */
.long sys_perf_counter_open
.long ptregs_clone_with_pids
+ .long sys_checkpoint
+ .long sys_restart
diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
new file mode 100644
index 0000000..ef7d406
--- /dev/null
+++ b/checkpoint/Kconfig
@@ -0,0 +1,14 @@
+# Architectures should define CHECKPOINT_SUPPORT when they have
+# implemented the hooks for processor state etc. needed by the
+# core checkpoint/restart code.
+
+config CHECKPOINT
+ bool "Checkpoint/restart (EXPERIMENTAL)"
+ depends on CHECKPOINT_SUPPORT && EXPERIMENTAL
+ help
+ Application checkpoint/restart is the ability to save the
+ state of a running application so that it can later resume
+ its execution from the time at which it was checkpointed.
+
+ Turning this option on will enable checkpoint and restart
+ functionality in the kernel.
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
new file mode 100644
index 0000000..8a32c6f
--- /dev/null
+++ b/checkpoint/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for linux checkpoint/restart.
+#
+
+obj-$(CONFIG_CHECKPOINT) += sys.o
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
new file mode 100644
index 0000000..50c3cd8
--- /dev/null
+++ b/checkpoint/sys.c
@@ -0,0 +1,41 @@
+/*
+ * Generic container checkpoint-restart
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/sched.h>
+#include <linux/kernel.h>
+#include <linux/syscalls.h>
+
+/**
+ * sys_checkpoint - checkpoint a container
+ * @pid: pid of the container init(1) process
+ * @fd: file to which dump the checkpoint image
+ * @flags: checkpoint operation flags
+ *
+ * Returns positive identifier on success, 0 when returning from restart
+ * or negative value on error
+ */
+SYSCALL_DEFINE3(checkpoint, pid_t, pid, int, fd, unsigned long, flags)
+{
+ return -ENOSYS;
+}
+
+/**
+ * sys_restart - restart a container
+ * @pid: pid of task root (in coordinator's namespace), or 0
+ * @fd: file from which read the checkpoint image
+ * @flags: restart operation flags
+ *
+ * Returns negative value on error, or otherwise returns in the realm
+ * of the original checkpoint
+ */
+SYSCALL_DEFINE3(restart, pid_t, pid, int, fd, unsigned long, flags)
+{
+ return -ENOSYS;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 80de700..33bce6e 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -754,6 +754,8 @@ asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
asmlinkage long sys_ppoll(struct pollfd __user *, unsigned int,
struct timespec __user *, const sigset_t __user *,
size_t);
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags);
+asmlinkage long sys_restart(pid_t pid, int fd, unsigned long flags);

int kernel_execve(const char *filename, char *const argv[], char *const envp[]);

diff --git a/init/Kconfig b/init/Kconfig
index 7503957..a083161 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -715,6 +715,8 @@ config NET_NS
Allow user space to create what appear to be multiple instances
of the network stack.

+source "checkpoint/Kconfig"
+
config BLK_DEV_INITRD
bool "Initial RAM filesystem and RAM disk (initramfs/initrd) support"
depends on BROKEN || !FRV
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 68320f6..32f3f26 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -178,3 +178,7 @@ cond_syscall(sys_eventfd2);

/* performance counters: */
cond_syscall(sys_perf_counter_open);
+
+/* checkpoint/restart */
+cond_syscall(sys_checkpoint);
+cond_syscall(sys_restart);
--
1.6.0.4

2009-07-22 10:21:09

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 23/60] c/r: export functionality used in next patch for restart-blocks

To support c/r of restart-blocks (system call that need to be
restarted because they were interrupted but there was no userspace
visible side-effect), export restart-block callbacks for poll()
and futex() syscalls.

More details on c/r of restart-blocks and how it works in the
following patch.

Signed-off-by: Oren Laadan <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
---
fs/select.c | 2 +-
include/linux/futex.h | 11 +++++++++++
include/linux/poll.h | 3 +++
include/linux/posix-timers.h | 6 ++++++
kernel/compat.c | 4 ++--
kernel/futex.c | 12 +-----------
kernel/posix-timers.c | 2 +-
7 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index d870237..08d1d35 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -865,7 +865,7 @@ out_fds:
return err;
}

-static long do_restart_poll(struct restart_block *restart_block)
+long do_restart_poll(struct restart_block *restart_block)
{
struct pollfd __user *ufds = restart_block->poll.ufds;
int nfds = restart_block->poll.nfds;
diff --git a/include/linux/futex.h b/include/linux/futex.h
index 34956c8..4326f81 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -136,6 +136,17 @@ extern int
handle_futex_death(u32 __user *uaddr, struct task_struct *curr, int pi);

/*
+ * In case we must use restart_block to restart a futex_wait,
+ * we encode in the 'flags' shared capability
+ */
+#define FLAGS_SHARED 0x01
+#define FLAGS_CLOCKRT 0x02
+#define FLAGS_HAS_TIMEOUT 0x04
+
+/* for c/r */
+extern long futex_wait_restart(struct restart_block *restart);
+
+/*
* Futexes are matched on equal values of this key.
* The key type depends on whether it's a shared or private mapping.
* Don't rearrange members without looking at hash_futex().
diff --git a/include/linux/poll.h b/include/linux/poll.h
index fa287f2..0841c51 100644
--- a/include/linux/poll.h
+++ b/include/linux/poll.h
@@ -134,6 +134,9 @@ extern int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,

extern int poll_select_set_timeout(struct timespec *to, long sec, long nsec);

+/* used by checkpoint/restart */
+extern long do_restart_poll(struct restart_block *restart_block);
+
#endif /* KERNEL */

#endif /* _LINUX_POLL_H */
diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index 4f71bf4..d0d6a66 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -101,6 +101,10 @@ int posix_cpu_timer_create(struct k_itimer *timer);
int posix_cpu_nsleep(const clockid_t which_clock, int flags,
struct timespec *rqtp, struct timespec __user *rmtp);
long posix_cpu_nsleep_restart(struct restart_block *restart_block);
+#ifdef CONFIG_COMPAT
+long compat_nanosleep_restart(struct restart_block *restart);
+long compat_clock_nanosleep_restart(struct restart_block *restart);
+#endif
int posix_cpu_timer_set(struct k_itimer *timer, int flags,
struct itimerspec *new, struct itimerspec *old);
int posix_cpu_timer_del(struct k_itimer *timer);
@@ -119,4 +123,6 @@ long clock_nanosleep_restart(struct restart_block *restart_block);

void update_rlimit_cpu(unsigned long rlim_new);

+int invalid_clockid(const clockid_t which_clock);
+
#endif
diff --git a/kernel/compat.c b/kernel/compat.c
index f6c204f..20afdba 100644
--- a/kernel/compat.c
+++ b/kernel/compat.c
@@ -100,7 +100,7 @@ int put_compat_timespec(const struct timespec *ts, struct compat_timespec __user
__put_user(ts->tv_nsec, &cts->tv_nsec)) ? -EFAULT : 0;
}

-static long compat_nanosleep_restart(struct restart_block *restart)
+long compat_nanosleep_restart(struct restart_block *restart)
{
struct compat_timespec __user *rmtp;
struct timespec rmt;
@@ -647,7 +647,7 @@ long compat_sys_clock_getres(clockid_t which_clock,
return err;
}

-static long compat_clock_nanosleep_restart(struct restart_block *restart)
+long compat_clock_nanosleep_restart(struct restart_block *restart)
{
long err;
mm_segment_t oldfs;
diff --git a/kernel/futex.c b/kernel/futex.c
index 794c862..dfe246f 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1516,16 +1516,6 @@ handle_fault:
goto retry;
}

-/*
- * In case we must use restart_block to restart a futex_wait,
- * we encode in the 'flags' shared capability
- */
-#define FLAGS_SHARED 0x01
-#define FLAGS_CLOCKRT 0x02
-#define FLAGS_HAS_TIMEOUT 0x04
-
-static long futex_wait_restart(struct restart_block *restart);
-
/**
* fixup_owner() - Post lock pi_state and corner case management
* @uaddr: user address of the futex
@@ -1795,7 +1785,7 @@ out:
}


-static long futex_wait_restart(struct restart_block *restart)
+long futex_wait_restart(struct restart_block *restart)
{
u32 __user *uaddr = (u32 __user *)restart->futex.uaddr;
int fshared = 0;
diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c
index 052ec4d..589aed2 100644
--- a/kernel/posix-timers.c
+++ b/kernel/posix-timers.c
@@ -205,7 +205,7 @@ static int no_timer_create(struct k_itimer *new_timer)
/*
* Return nonzero if we know a priori this clockid_t value is bogus.
*/
-static inline int invalid_clockid(const clockid_t which_clock)
+int invalid_clockid(const clockid_t which_clock)
{
if (which_clock < 0) /* CPU clock, posix_cpu_* will check it */
return 0;
--
1.6.0.4

2009-07-22 10:21:39

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 55/60] c/r: define s390-specific checkpoint-restart code

From: Dan Smith <[email protected]>

Implement the s390 arch-specific checkpoint/restart helpers. This
is on top of Oren Laadan's c/r code.

With these, I am able to checkpoint and restart simple programs as per
Oren's patch intro. While on x86 I never had to freeze a single task
to checkpoint it, on s390 I do need to. That is a prereq for consistent
snapshots (esp with multiple processes) anyway so I don't see that as
a problem.

Changelog:
Jun 15:
. Fix checkpoint and restart compat wrappers
May 28:
. Export asm/checkpoint_hdr.h to userspace
. Define CKPT_ARCH_ID for S390
Apr 11:
. Introduce ckpt_arch_vdso()
Feb 27:
. Add checkpoint_s390.h
. Fixed up save and restore of PSW, with the non-address bits
properly masked out
Feb 25:
. Make checkpoint_hdr.h safe for inclusion in userspace
. Replace comment about vsdo code
. Add comment about restoring access registers
. Write and read an empty ckpt_hdr_head_arch record to appease
code (mktree) that expects it to be there
. Utilize NUM_CKPT_WORDS in checkpoint_hdr.h
Feb 24:
. Use CKPT_COPY() to unify the un/loading of cpu and mm state
. Fix fprs definition in ckpt_hdr_cpu
. Remove debug WARN_ON() from checkpoint.c
Feb 23:
. Macro-ize the un/packing of trace flags
. Fix the crash when externally-linked
. Break out the restart functions into restart.c
. Remove unneeded s390_enable_sie() call
Jan 30:
. Switched types in ckpt_hdr_cpu to __u64 etc.
(Per Oren suggestion)
. Replaced direct inclusion of structs in
ckpt_hdr_cpu with the struct members.
(Per Oren suggestion)
. Also ended up adding a bunch of new things
into restart (mm_segment, ksp, etc) in vain
attempt to get code using fpu to not segfault
after restart.

Signed-off-by: Serge E. Hallyn <[email protected]>
Signed-off-by: Dan Smith <[email protected]>
---
arch/s390/include/asm/Kbuild | 1 +
arch/s390/include/asm/checkpoint_hdr.h | 89 +++++++++++++++
arch/s390/include/asm/unistd.h | 4 +-
arch/s390/kernel/compat_wrapper.S | 14 +++
arch/s390/kernel/syscalls.S | 2 +
arch/s390/mm/Makefile | 1 +
arch/s390/mm/checkpoint.c | 183 ++++++++++++++++++++++++++++++++
arch/s390/mm/checkpoint_s390.h | 23 ++++
include/linux/checkpoint_hdr.h | 2 +
9 files changed, 318 insertions(+), 1 deletions(-)
create mode 100644 arch/s390/include/asm/checkpoint_hdr.h
create mode 100644 arch/s390/mm/checkpoint.c
create mode 100644 arch/s390/mm/checkpoint_s390.h

diff --git a/arch/s390/include/asm/Kbuild b/arch/s390/include/asm/Kbuild
index 63a2341..3282a6e 100644
--- a/arch/s390/include/asm/Kbuild
+++ b/arch/s390/include/asm/Kbuild
@@ -8,6 +8,7 @@ header-y += ucontext.h
header-y += vtoc.h
header-y += zcrypt.h
header-y += chsc.h
+header-y += checkpoint_hdr.h

unifdef-y += cmb.h
unifdef-y += debug.h
diff --git a/arch/s390/include/asm/checkpoint_hdr.h b/arch/s390/include/asm/checkpoint_hdr.h
new file mode 100644
index 0000000..ad9449e
--- /dev/null
+++ b/arch/s390/include/asm/checkpoint_hdr.h
@@ -0,0 +1,89 @@
+#ifndef __ASM_S390_CKPT_HDR_H
+#define __ASM_S390_CKPT_HDR_H
+/*
+ * Checkpoint/restart - architecture specific headers s/390
+ *
+ * Copyright IBM Corp. 2009
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#ifndef _CHECKPOINT_CKPT_HDR_H_
+#error asm/checkpoint_hdr.h included directly
+#endif
+
+#include <linux/types.h>
+#include <asm/ptrace.h>
+
+#ifdef __KERNEL__
+#include <asm/processor.h>
+#else
+#include <sys/user.h>
+#endif
+
+#ifdef CONFIG_64BIT
+#define CKPT_ARCH_ID CKPT_ARCH_S390X
+/* else - if we ever support 32bit - CKPT_ARCH_S390 */
+#endif
+
+/*
+ * Notes
+ * NUM_GPRS defined in <asm/ptrace.h> to be 16
+ * NUM_FPRS defined in <asm/ptrace.h> to be 16
+ * NUM_APRS defined in <asm/ptrace.h> to be 16
+ * NUM_CR_WORDS defined in <asm/ptrace.h> to be 3
+ */
+struct ckpt_hdr_cpu {
+ struct ckpt_hdr h;
+ __u64 args[1];
+ __u64 gprs[NUM_GPRS];
+ __u64 orig_gpr2;
+ __u16 svcnr;
+ __u16 ilc;
+ __u32 acrs[NUM_ACRS];
+ __u64 ieee_instruction_pointer;
+
+ /* psw_t */
+ __u64 psw_t_mask;
+ __u64 psw_t_addr;
+
+ /* s390_fp_regs_t */
+ __u32 fpc;
+ union {
+ float f;
+ double d;
+ __u64 ui;
+ struct {
+ __u32 fp_hi;
+ __u32 fp_lo;
+ } fp;
+ } fprs[NUM_FPRS];
+
+ /* per_struct */
+ __u64 per_control_regs[NUM_CR_WORDS];
+ __u64 starting_addr;
+ __u64 ending_addr;
+ __u64 address;
+ __u16 perc_atmid;
+ __u8 access_id;
+ __u8 single_step;
+ __u8 instruction_fetch;
+};
+
+struct ckpt_hdr_mm_context {
+ struct ckpt_hdr h;
+ unsigned long vdso_base;
+ int noexec;
+ int has_pgste;
+ int alloc_pgste;
+ unsigned long asce_bits;
+ unsigned long asce_limit;
+};
+
+struct ckpt_hdr_header_arch {
+ struct ckpt_hdr h;
+};
+
+#endif /* __ASM_S390_CKPT_HDR__H */
diff --git a/arch/s390/include/asm/unistd.h b/arch/s390/include/asm/unistd.h
index c80602d..5d1678a 100644
--- a/arch/s390/include/asm/unistd.h
+++ b/arch/s390/include/asm/unistd.h
@@ -269,7 +269,9 @@
#define __NR_pwritev 329
#define __NR_rt_tgsigqueueinfo 330
#define __NR_perf_counter_open 331
-#define NR_syscalls 332
+#define __NR_checkpoint 332
+#define __NR_restart 333
+#define NR_syscalls 334

/*
* There are some system calls that are not present on 64 bit, some
diff --git a/arch/s390/kernel/compat_wrapper.S b/arch/s390/kernel/compat_wrapper.S
index 88a8336..e882f99 100644
--- a/arch/s390/kernel/compat_wrapper.S
+++ b/arch/s390/kernel/compat_wrapper.S
@@ -1840,3 +1840,17 @@ sys_perf_counter_open_wrapper:
lgfr %r5,%r5 # int
llgfr %r6,%r6 # unsigned long
jg sys_perf_counter_open # branch to system call
+
+ .globl sys_checkpoint_wrapper
+sys_checkpoint_wrapper:
+ lgfr %r2,%r2 # pid_t
+ lgfr %r3,%r3 # int
+ llgfr %r4,%r4 # unsigned long
+ jg compat_sys_checkpoint
+
+ .globl sys_restore_wrapper
+sys_restore_wrapper:
+ lgfr %r2,%r2 # int
+ lgfr %r3,%r3 # int
+ llgfr %r4,%r4 # unsigned long
+ jg compat_sys_restore
diff --git a/arch/s390/kernel/syscalls.S b/arch/s390/kernel/syscalls.S
index ad1acd2..67518e2 100644
--- a/arch/s390/kernel/syscalls.S
+++ b/arch/s390/kernel/syscalls.S
@@ -340,3 +340,5 @@ SYSCALL(sys_preadv,sys_preadv,compat_sys_preadv_wrapper)
SYSCALL(sys_pwritev,sys_pwritev,compat_sys_pwritev_wrapper)
SYSCALL(sys_rt_tgsigqueueinfo,sys_rt_tgsigqueueinfo,compat_sys_rt_tgsigqueueinfo_wrapper) /* 330 */
SYSCALL(sys_perf_counter_open,sys_perf_counter_open,sys_perf_counter_open_wrapper)
+SYSCALL(sys_checkpoint,sys_checkpoint,sys_checkpoint_wrapper)
+SYSCALL(sys_restart,sys_restart,sys_restore_wrapper)
diff --git a/arch/s390/mm/Makefile b/arch/s390/mm/Makefile
index db05661..e3d356d 100644
--- a/arch/s390/mm/Makefile
+++ b/arch/s390/mm/Makefile
@@ -6,3 +6,4 @@ obj-y := init.o fault.o extmem.o mmap.o vmem.o pgtable.o maccess.o
obj-$(CONFIG_CMM) += cmm.o
obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
obj-$(CONFIG_PAGE_STATES) += page-states.o
+obj-$(CONFIG_CHECKPOINT) += checkpoint.o
diff --git a/arch/s390/mm/checkpoint.c b/arch/s390/mm/checkpoint.c
new file mode 100644
index 0000000..a4a5da9
--- /dev/null
+++ b/arch/s390/mm/checkpoint.c
@@ -0,0 +1,183 @@
+/*
+ * Checkpoint/restart - architecture specific support for s390
+ *
+ * Copyright IBM Corp. 2009
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <asm/system.h>
+#include <asm/pgtable.h>
+#include <asm/elf.h>
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+static void s390_copy_regs(int op, struct ckpt_hdr_cpu *h,
+ struct task_struct *t)
+{
+ struct pt_regs *regs = task_pt_regs(t);
+ struct thread_struct *thr = &t->thread;
+
+ /* Save the whole PSW to facilitate forensic debugging, but only
+ * restore the address portion to avoid letting userspace do
+ * bad things by manipulating its value.
+ */
+ if (op == CKPT_CPT) {
+ CKPT_COPY(op, h->psw_t_addr, regs->psw.addr);
+ } else {
+ regs->psw.addr &= ~PSW_ADDR_INSN;
+ regs->psw.addr |= h->psw_t_addr;
+ }
+
+ CKPT_COPY(op, h->args[0], regs->args[0]);
+ CKPT_COPY(op, h->orig_gpr2, regs->orig_gpr2);
+ CKPT_COPY(op, h->svcnr, regs->svcnr);
+ CKPT_COPY(op, h->ilc, regs->ilc);
+ CKPT_COPY(op, h->ieee_instruction_pointer,
+ thr->ieee_instruction_pointer);
+ CKPT_COPY(op, h->psw_t_mask, regs->psw.mask);
+ CKPT_COPY(op, h->fpc, thr->fp_regs.fpc);
+ CKPT_COPY(op, h->starting_addr, thr->per_info.starting_addr);
+ CKPT_COPY(op, h->ending_addr, thr->per_info.ending_addr);
+ CKPT_COPY(op, h->address, thr->per_info.lowcore.words.address);
+ CKPT_COPY(op, h->perc_atmid, thr->per_info.lowcore.words.perc_atmid);
+ CKPT_COPY(op, h->access_id, thr->per_info.lowcore.words.access_id);
+ CKPT_COPY(op, h->single_step, thr->per_info.single_step);
+ CKPT_COPY(op, h->instruction_fetch, thr->per_info.instruction_fetch);
+
+ CKPT_COPY_ARRAY(op, h->gprs, regs->gprs, NUM_GPRS);
+ CKPT_COPY_ARRAY(op, h->fprs, thr->fp_regs.fprs, NUM_FPRS);
+ CKPT_COPY_ARRAY(op, h->acrs, thr->acrs, NUM_ACRS);
+ CKPT_COPY_ARRAY(op, h->per_control_regs,
+ thr->per_info.control_regs.words.cr, NUM_CR_WORDS);
+}
+
+static void s390_mm(int op, struct ckpt_hdr_mm_context *h,
+ struct mm_struct *mm)
+{
+ CKPT_COPY(op, h->noexec, mm->context.noexec);
+ CKPT_COPY(op, h->has_pgste, mm->context.has_pgste);
+ CKPT_COPY(op, h->alloc_pgste, mm->context.alloc_pgste);
+ CKPT_COPY(op, h->asce_bits, mm->context.asce_bits);
+ CKPT_COPY(op, h->asce_limit, mm->context.asce_limit);
+}
+
+int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+ return 0;
+}
+
+/* dump the cpu state and registers of a given task */
+int checkpoint_cpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+ struct ckpt_hdr_cpu *h;
+ int ret;
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_CPU);
+ if (!h)
+ return -ENOMEM;
+
+ s390_copy_regs(CKPT_CPT, h, t);
+
+ ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+ ckpt_hdr_put(ctx, h);
+
+ return ret;
+}
+
+/* Write an empty header since it is assumed to be there */
+int checkpoint_write_header_arch(struct ckpt_ctx *ctx)
+{
+ struct ckpt_hdr_header_arch *h;
+ int ret;
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_HEADER_ARCH);
+ if (!h)
+ return -ENOMEM;
+
+ ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+ ckpt_hdr_put(ctx, h);
+
+ return ret;
+}
+
+int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+ struct ckpt_hdr_mm_context *h;
+ int ret;
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_MM_CONTEXT);
+ if (!h)
+ return -ENOMEM;
+
+ s390_mm(CKPT_CPT, h, mm);
+
+ ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+ ckpt_hdr_put(ctx, h);
+
+ return ret;
+}
+
+/**************************************************************************
+ * Restart
+ */
+
+int restore_thread(struct ckpt_ctx *ctx)
+{
+ return 0;
+}
+
+int restore_cpu(struct ckpt_ctx *ctx)
+{
+ struct ckpt_hdr_cpu *h;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_CPU);
+ if (IS_ERR(h))
+ return PTR_ERR(h);
+
+ s390_copy_regs(CKPT_RST, h, current);
+
+ /* s390 does not restore the access registers after a syscall,
+ * but does on a task switch. Since we're switching tasks (in
+ * a way), we need to replicate that behavior here.
+ */
+ restore_access_regs(h->acrs);
+
+ ckpt_hdr_put(ctx, h);
+ return 0;
+}
+
+int restore_read_header_arch(struct ckpt_ctx *ctx)
+{
+ struct ckpt_hdr_header_arch *h;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_HEADER_ARCH);
+ if (IS_ERR(h))
+ return PTR_ERR(h);
+
+ ckpt_hdr_put(ctx, h);
+ return 0;
+}
+
+
+int restore_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+ struct ckpt_hdr_mm_context *h;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_MM_CONTEXT);
+ if (IS_ERR(h))
+ return PTR_ERR(h);
+
+ s390_mm(CKPT_RST, h, mm);
+
+ ckpt_hdr_put(ctx, h);
+ return 0;
+}
diff --git a/arch/s390/mm/checkpoint_s390.h b/arch/s390/mm/checkpoint_s390.h
new file mode 100644
index 0000000..c3bf24d
--- /dev/null
+++ b/arch/s390/mm/checkpoint_s390.h
@@ -0,0 +1,23 @@
+/*
+ * Checkpoint/restart - architecture specific support for s390
+ *
+ * Copyright IBM Corp. 2009
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#ifndef _S390_CHECKPOINT_H
+#define _S390_CHECKPOINT_H
+
+#include <linux/checkpoint_hdr.h>
+#include <linux/sched.h>
+#include <linux/mm_types.h>
+
+extern void checkpoint_s390_regs(int op, struct ckpt_hdr_cpu *h,
+ struct task_struct *t);
+extern void checkpoint_s390_mm(int op, struct ckpt_hdr_mm_context *h,
+ struct mm_struct *mm);
+
+#endif /* _S390_CHECKPOINT_H */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 2364a6f..3671e72 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -87,7 +87,9 @@ enum {

/* architecture */
enum {
+ /* do not change order (will break ABI) */
CKPT_ARCH_X86_32 = 1,
+ CKPT_ARCH_S390X,
};

/* shared objrects (objref) */
--
1.6.0.4

2009-07-22 10:19:11

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 30/60] c/r: infrastructure for shared objects

The state of shared objects is saved once. On the first encounter, the
state is dumped and the object is assigned a unique identifier (objref)
and also stored in a hash table (indexed by its physical kernel address).
>From then on the object will be found in the hash and only its identifier
is saved.

On restart the identifier is looked up in the hash table; if not found
then the state is read, the object is created, and added to the hash
table (this time indexed by its identifier). Otherwise, the object in
the hash table is used.

The hash is "one-way": objects added to it are never deleted until the
hash it discarded. The hash is discarded at the end of checkpoint or
restart, whether successful or not.

The hash keeps a reference to every object that is added to it, matching
the object's type, and maintains this reference during its lifetime.
Therefore, it is always safe to use an object that is stored in the hash.

Changelog[v17]:
- Add ckpt_obj->flags with CKPT_OBJ_CHECKPOINTED flag
- Add prototype of ckpt_obj_lookup
- Complain on attempt to add NULL ptr to objhash
- Prepare for 'leaks detection'
Changelog[v16]:
- Introduce ckpt_obj_lookup() to find an object by its ptr
Changelog[v14]:
- Introduce 'struct ckpt_obj_ops' to better modularize shared objs.
- Replace long 'switch' statements with table lookups and callbacks.
- Introduce checkpoint_obj() and restart_obj() helpers
- Shared objects now dumped/saved right before they are referenced
- Cleanup interface of shared objects
Changelog[v13]:
- Use hash_long() with 'unsigned long' cast to support 64bit archs
(Nathan Lynch <[email protected]>)
Changelog[v11]:
- Doc: be explicit about grabbing a reference and object lifetime
Changelog[v4]:
- Fix calculation of hash table size
Changelog[v3]:
- Use standard hlist_... for hash table

Signed-off-by: Oren Laadan <[email protected]>
---
checkpoint/Makefile | 1 +
checkpoint/objhash.c | 419 ++++++++++++++++++++++++++++++++++++++
checkpoint/restart.c | 50 +++++-
checkpoint/sys.c | 6 +
include/linux/checkpoint.h | 18 ++
include/linux/checkpoint_hdr.h | 14 ++
include/linux/checkpoint_types.h | 2 +
7 files changed, 508 insertions(+), 2 deletions(-)
create mode 100644 checkpoint/objhash.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 99364cc..5aa6a75 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -4,6 +4,7 @@

obj-$(CONFIG_CHECKPOINT) += \
sys.o \
+ objhash.o \
checkpoint.o \
restart.o \
process.o
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
new file mode 100644
index 0000000..eb2bb55
--- /dev/null
+++ b/checkpoint/objhash.c
@@ -0,0 +1,419 @@
+/*
+ * Checkpoint-restart - object hash infrastructure to manage shared objects
+ *
+ * Copyright (C) 2008-2009 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG CKPT_DOBJ
+
+#include <linux/kernel.h>
+#include <linux/hash.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+struct ckpt_obj;
+struct ckpt_obj_ops;
+
+/* object operations */
+struct ckpt_obj_ops {
+ char *obj_name;
+ enum obj_type obj_type;
+ void (*ref_drop)(void *ptr);
+ int (*ref_grab)(void *ptr);
+ int (*checkpoint)(struct ckpt_ctx *ctx, void *ptr);
+ void *(*restore)(struct ckpt_ctx *ctx);
+};
+
+struct ckpt_obj {
+ int objref;
+ int flags;
+ void *ptr;
+ struct ckpt_obj_ops *ops;
+ struct hlist_node hash;
+};
+
+/* object internal flags */
+#define CKPT_OBJ_CHECKPOINTED 0x1 /* object already checkpointed */
+
+struct ckpt_obj_hash {
+ struct hlist_head *head;
+ int next_free_objref;
+};
+
+/* helper grab/drop functions: */
+
+static void obj_no_drop(void *ptr)
+{
+ return;
+}
+
+static int obj_no_grab(void *ptr)
+{
+ return 0;
+}
+
+static struct ckpt_obj_ops ckpt_obj_ops[] = {
+ /* ignored object */
+ {
+ .obj_name = "IGNORED",
+ .obj_type = CKPT_OBJ_IGNORE,
+ .ref_drop = obj_no_drop,
+ .ref_grab = obj_no_grab,
+ },
+};
+
+
+#define CKPT_OBJ_HASH_NBITS 10
+#define CKPT_OBJ_HASH_TOTAL (1UL << CKPT_OBJ_HASH_NBITS)
+
+static void obj_hash_clear(struct ckpt_obj_hash *obj_hash)
+{
+ struct hlist_head *h = obj_hash->head;
+ struct hlist_node *n, *t;
+ struct ckpt_obj *obj;
+ int i;
+
+ for (i = 0; i < CKPT_OBJ_HASH_TOTAL; i++) {
+ hlist_for_each_entry_safe(obj, n, t, &h[i], hash) {
+ obj->ops->ref_drop(obj->ptr);
+ kfree(obj);
+ }
+ }
+}
+
+void ckpt_obj_hash_free(struct ckpt_ctx *ctx)
+{
+ struct ckpt_obj_hash *obj_hash = ctx->obj_hash;
+
+ if (obj_hash) {
+ obj_hash_clear(obj_hash);
+ kfree(obj_hash->head);
+ kfree(ctx->obj_hash);
+ ctx->obj_hash = NULL;
+ }
+}
+
+int ckpt_obj_hash_alloc(struct ckpt_ctx *ctx)
+{
+ struct ckpt_obj_hash *obj_hash;
+ struct hlist_head *head;
+
+ obj_hash = kzalloc(sizeof(*obj_hash), GFP_KERNEL);
+ if (!obj_hash)
+ return -ENOMEM;
+ head = kzalloc(CKPT_OBJ_HASH_TOTAL * sizeof(*head), GFP_KERNEL);
+ if (!head) {
+ kfree(obj_hash);
+ return -ENOMEM;
+ }
+
+ obj_hash->head = head;
+ obj_hash->next_free_objref = 1;
+
+ ctx->obj_hash = obj_hash;
+ return 0;
+}
+
+static struct ckpt_obj *obj_find_by_ptr(struct ckpt_ctx *ctx, void *ptr)
+{
+ struct hlist_head *h;
+ struct hlist_node *n;
+ struct ckpt_obj *obj;
+
+ h = &ctx->obj_hash->head[hash_long((unsigned long) ptr,
+ CKPT_OBJ_HASH_NBITS)];
+ hlist_for_each_entry(obj, n, h, hash)
+ if (obj->ptr == ptr)
+ return obj;
+ return NULL;
+}
+
+static struct ckpt_obj *obj_find_by_objref(struct ckpt_ctx *ctx, int objref)
+{
+ struct hlist_head *h;
+ struct hlist_node *n;
+ struct ckpt_obj *obj;
+
+ h = &ctx->obj_hash->head[hash_long((unsigned long) objref,
+ CKPT_OBJ_HASH_NBITS)];
+ hlist_for_each_entry(obj, n, h, hash)
+ if (obj->objref == objref)
+ return obj;
+ return NULL;
+}
+
+/**
+ * ckpt_obj_new - add an object to the obj_hash
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: object unique id
+ * @ops: object operations
+ *
+ * Add the object to the obj_hash. If @objref is zero, assign a unique
+ * object id and use @ptr as a hash key [checkpoint]. Else use @objref
+ * as a key [restart].
+ */
+static struct ckpt_obj *obj_new(struct ckpt_ctx *ctx, void *ptr,
+ int objref, enum obj_type type)
+{
+ struct ckpt_obj_ops *ops = &ckpt_obj_ops[type];
+ struct ckpt_obj *obj;
+ int i, ret;
+
+ /* explicitly disallow null pointers */
+ BUG_ON(!ptr);
+ /* make sure we don't change this accidentally */
+ BUG_ON(ops->obj_type != type);
+
+ obj = kzalloc(sizeof(*obj), GFP_KERNEL);
+ if (!obj)
+ return ERR_PTR(-ENOMEM);
+
+ obj->ptr = ptr;
+ obj->ops = ops;
+
+ if (!objref) {
+ /* use @obj->ptr to index, assign objref (checkpoint) */
+ obj->objref = ctx->obj_hash->next_free_objref++;;
+ i = hash_long((unsigned long) ptr, CKPT_OBJ_HASH_NBITS);
+ } else {
+ /* use @obj->objref to index (restart) */
+ obj->objref = objref;
+ i = hash_long((unsigned long) objref, CKPT_OBJ_HASH_NBITS);
+ }
+
+ ret = ops->ref_grab(obj->ptr);
+ if (ret < 0) {
+ kfree(obj);
+ obj = ERR_PTR(ret);
+ } else {
+ hlist_add_head(&obj->hash, &ctx->obj_hash->head[i]);
+ }
+
+ return obj;
+}
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+/**
+ * obj_lookup_add - lookup object and add if not in objhash
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ * @first: [output] first encounter (added to table)
+ *
+ * Look up the object pointed to by @ptr in the hash table. If it isn't
+ * already found there, add the object, and allocate a unique object
+ * id. Grab a reference to every object that is added, and maintain the
+ * reference until the entire hash is freed.
+ */
+static struct ckpt_obj *obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
+ enum obj_type type, int *first)
+{
+ struct ckpt_obj *obj;
+
+ obj = obj_find_by_ptr(ctx, ptr);
+ if (!obj) {
+ obj = obj_new(ctx, ptr, 0, type);
+ *first = 1;
+ } else {
+ BUG_ON(obj->ops->obj_type != type);
+ *first = 0;
+ }
+ return obj;
+}
+
+/**
+ * ckpt_obj_lookup - lookup object (by pointer) in objhash
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ *
+ * [used during checkpoint].
+ * Return: objref (or zero if not found)
+ */
+int ckpt_obj_lookup(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
+{
+ struct ckpt_obj *obj;
+
+ obj = obj_find_by_ptr(ctx, ptr);
+ BUG_ON(obj && obj->ops->obj_type != type);
+ if (obj)
+ ckpt_debug("%s objref %d\n", obj->ops->obj_name, obj->objref);
+ return obj ? obj->objref : 0;
+}
+
+/**
+ * ckpt_obj_lookup_add - lookup object and add if not in objhash
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ * @first: [output] first encoutner (added to table)
+ *
+ * [used during checkpoint].
+ * Return: objref
+ */
+int ckpt_obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
+ enum obj_type type, int *first)
+{
+ struct ckpt_obj *obj;
+
+ obj = obj_lookup_add(ctx, ptr, type, first);
+ if (IS_ERR(obj))
+ return PTR_ERR(obj);
+ ckpt_debug("%s objref %d first %d\n",
+ obj->ops->obj_name, obj->objref, *first);
+ obj->flags |= CKPT_OBJ_CHECKPOINTED;
+ return obj->objref;
+}
+
+/**
+ * checkpoint_obj - if not already in hash, add object and checkpoint
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ *
+ * Use obj_lookup_add() to lookup (and possibly add) the object to the
+ * hash table. If the CKPT_OBJ_CHECKPOINTED flag isn't set, then also
+ * save the object's state using its ops->checkpoint().
+ *
+ * [This is used during checkpoint].
+ * Returns: objref
+ */
+int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
+{
+ struct ckpt_hdr_objref *h;
+ struct ckpt_obj *obj;
+ int first, ret = 0;
+
+ obj = obj_lookup_add(ctx, ptr, type, &first);
+ if (IS_ERR(obj))
+ return PTR_ERR(obj);
+
+ if (!(obj->flags & CKPT_OBJ_CHECKPOINTED)) {
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_OBJREF);
+ if (!h)
+ return -ENOMEM;
+
+ h->objtype = type;
+ h->objref = obj->objref;
+ ret = ckpt_write_obj(ctx, &h->h);
+ ckpt_hdr_put(ctx, h);
+
+ if (ret < 0)
+ return ret;
+
+ /* invoke callback to actually dump the state */
+ if (obj->ops->checkpoint)
+ ret = obj->ops->checkpoint(ctx, ptr);
+
+ obj->flags |= CKPT_OBJ_CHECKPOINTED;
+ }
+ return (ret < 0 ? ret : obj->objref);
+}
+
+/**************************************************************************
+ * Restart
+ */
+
+/**
+ * restore_obj - read in and restore a (first seen) shared object
+ * @ctx: checkpoint context
+ * @h: ckpt_hdr of shared object
+ *
+ * Read in the header payload (struct ckpt_hdr_objref). Lookup the
+ * object to verify it isn't there. Then restore the object's state
+ * and add it to the objash. No need to explicitly grab a reference -
+ * we hold the initial instance of this object. (Object maintained
+ * until the entire hash is free).
+ *
+ * [This is used during restart].
+ */
+int restore_obj(struct ckpt_ctx *ctx, struct ckpt_hdr_objref *h)
+{
+ struct ckpt_obj_ops *ops;
+ struct ckpt_obj *obj;
+ void *ptr = NULL;
+
+ ckpt_debug("len %d ref %d type %d\n", h->h.len, h->objref, h->objtype);
+ if (obj_find_by_objref(ctx, h->objref))
+ return -EINVAL;
+
+ if (h->objtype >= CKPT_OBJ_MAX)
+ return -EINVAL;
+
+ ops = &ckpt_obj_ops[h->objtype];
+ BUG_ON(ops->obj_type != h->objtype);
+
+ if (ops->restore)
+ ptr = ops->restore(ctx);
+ if (IS_ERR(ptr))
+ return PTR_ERR(ptr);
+
+ obj = obj_new(ctx, ptr, h->objref, h->objtype);
+ /*
+ * Drop an extra reference to the object returned by ops->restore:
+ * On success, this clears the extra reference taken by obj_new(),
+ * and on failure, this cleans up the object itself.
+ */
+ ops->ref_drop(ptr);
+ if (IS_ERR(obj)) {
+ ops->ref_drop(ptr);
+ return PTR_ERR(obj);
+ }
+ return obj->objref;
+}
+
+/**
+ * ckpt_obj_insert - add an object with a given objref to obj_hash
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: unique object id
+ * @type: object type
+ *
+ * Add the object pointer to by @ptr and identified by unique object id
+ * @objref to the hash table (indexed by @objref). Grab a reference to
+ * every object added, and maintain it until the entire hash is freed.
+ *
+ * [This is used during restart].
+ */
+int ckpt_obj_insert(struct ckpt_ctx *ctx, void *ptr,
+ int objref, enum obj_type type)
+{
+ struct ckpt_obj *obj;
+
+ obj = obj_new(ctx, ptr, objref, type);
+ if (IS_ERR(obj))
+ return PTR_ERR(obj);
+ ckpt_debug("%s objref %d\n", obj->ops->obj_name, objref);
+ return obj->objref;
+}
+
+/**
+ * ckpt_obj_fetch - fetch an object by its identifier
+ * @ctx: checkpoint context
+ * @objref: object id
+ * @type: object type
+ *
+ * Lookup the objref identifier by @objref in the hash table. Return
+ * an error not found.
+ *
+ * [This is used during restart].
+ */
+void *ckpt_obj_fetch(struct ckpt_ctx *ctx, int objref, enum obj_type type)
+{
+ struct ckpt_obj *obj;
+
+ obj = obj_find_by_objref(ctx, objref);
+ if (!obj)
+ return ERR_PTR(-EINVAL);
+ ckpt_debug("%s ref %d\n", obj->ops->obj_name, obj->objref);
+ return (obj->ops->obj_type == type ? obj->ptr : ERR_PTR(-ENOMSG));
+}
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 1b1f639..81790fe 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -25,6 +25,34 @@
#include <linux/checkpoint_hdr.h>

/**
+ * _ckpt_read_objref - dispatch handling of a shared object
+ * @ctx: checkpoint context
+ * @hh: objrect descriptor
+ */
+static int _ckpt_read_objref(struct ckpt_ctx *ctx, struct ckpt_hdr *hh)
+{
+ struct ckpt_hdr *h;
+ int ret;
+
+ h = ckpt_hdr_get(ctx, hh->len);
+ if (!h)
+ return -ENOMEM;
+
+ *h = *hh; /* yay ! */
+
+ _ckpt_debug(CKPT_DOBJ, "shared len %d type %d\n", h->len, h->type);
+ ret = ckpt_kread(ctx, (h + 1), hh->len - sizeof(struct ckpt_hdr));
+ if (ret < 0)
+ goto out;
+
+ ret = restore_obj(ctx, (struct ckpt_hdr_objref *) h);
+ out:
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
+
+
+/**
* _ckpt_read_obj - read an object (ckpt_hdr followed by payload)
* @ctx: checkpoint context
* @h: desired ckpt_hdr
@@ -39,6 +67,7 @@ static int _ckpt_read_obj(struct ckpt_ctx *ctx, struct ckpt_hdr *h,
{
int ret;

+ again:
ret = ckpt_kread(ctx, h, sizeof(*h));
if (ret < 0)
return ret;
@@ -46,7 +75,15 @@ static int _ckpt_read_obj(struct ckpt_ctx *ctx, struct ckpt_hdr *h,
h->type, h->len, len, max);
if (h->len < sizeof(*h))
return -EINVAL;
+
/* if len specified, enforce, else if maximum specified, enforce */
+ if (h->type == CKPT_HDR_OBJREF) {
+ ret = _ckpt_read_objref(ctx, h);
+ if (ret < 0)
+ return ret;
+ goto again;
+ }
+
if ((len && h->len != len) || (!len && max && h->len > max))
return -EINVAL;

@@ -155,6 +192,7 @@ static void *ckpt_read_obj(struct ckpt_ctx *ctx, int len, int max)
struct ckpt_hdr *h;
int ret;

+ again:
ret = ckpt_kread(ctx, &hh, sizeof(hh));
if (ret < 0)
return ERR_PTR(ret);
@@ -162,6 +200,14 @@ static void *ckpt_read_obj(struct ckpt_ctx *ctx, int len, int max)
hh.type, hh.len, len, max);
if (hh.len < sizeof(*h))
return ERR_PTR(-EINVAL);
+
+ if (hh.type == CKPT_HDR_OBJREF) {
+ ret = _ckpt_read_objref(ctx, &hh);
+ if (ret < 0)
+ return ERR_PTR(ret);
+ goto again;
+ }
+
/* if len specified, enforce, else if maximum specified, enforce */
if ((len && hh.len != len) || (!len && max && hh.len > max))
return ERR_PTR(-EINVAL);
@@ -214,8 +260,8 @@ void *ckpt_read_obj_type(struct ckpt_ctx *ctx, int len, int type)
* @type: desired object type
*
* This differs from ckpt_read_obj_type() in that the length of the
- * incoming object is flexible (up to the maximum specified by @len),
- * as determined by the ckpt_hdr data.
+ * incoming object is flexible (up to the maximum specified by @len;
+ * unlimited if @len is 0), as determined by the ckpt_hdr data.
*
* Return: new buffer allocated on success, error pointer otherwise
*/
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index c8921f0..d16d48f 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -194,6 +194,8 @@ static void ckpt_ctx_free(struct ckpt_ctx *ctx)
if (ctx->file)
fput(ctx->file);

+ ckpt_obj_hash_free(ctx);
+
if (ctx->tasks_arr)
task_arr_free(ctx);

@@ -231,6 +233,10 @@ static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
if (!ctx->file)
goto err;

+ err = -ENOMEM;
+ if (ckpt_obj_hash_alloc(ctx) < 0)
+ goto err;
+
atomic_inc(&ctx->refcount);
return ctx;
err:
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index b6af5b9..8eb5434 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -61,6 +61,7 @@ extern int ckpt_write_err(struct ckpt_ctx *ctx, char *fmt, ...);

extern int _ckpt_read_obj_type(struct ckpt_ctx *ctx,
void *ptr, int len, int type);
+extern int _ckpt_read_nbuffer(struct ckpt_ctx *ctx, void *ptr, int len);
extern int _ckpt_read_buffer(struct ckpt_ctx *ctx, void *ptr, int len);
extern int _ckpt_read_string(struct ckpt_ctx *ctx, void *ptr, int len);
extern void *ckpt_read_obj_type(struct ckpt_ctx *ctx, int len, int type);
@@ -78,6 +79,22 @@ extern void *ckpt_read_buf_type(struct ckpt_ctx *ctx, int len, int type);
#define ckpt_test_ctx_complete(ctx) \
((ctx)->kflags & (CKPT_CTX_SUCCESS | CKPT_CTX_ERROR))

+/* obj_hash */
+extern void ckpt_obj_hash_free(struct ckpt_ctx *ctx);
+extern int ckpt_obj_hash_alloc(struct ckpt_ctx *ctx);
+
+extern int restore_obj(struct ckpt_ctx *ctx, struct ckpt_hdr_objref *h);
+extern int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr,
+ enum obj_type type);
+extern int ckpt_obj_lookup(struct ckpt_ctx *ctx, void *ptr,
+ enum obj_type type);
+extern int ckpt_obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
+ enum obj_type type, int *first);
+extern void *ckpt_obj_fetch(struct ckpt_ctx *ctx, int objref,
+ enum obj_type type);
+extern int ckpt_obj_insert(struct ckpt_ctx *ctx, void *ptr, int objref,
+ enum obj_type type);
+
extern void ckpt_ctx_get(struct ckpt_ctx *ctx);
extern void ckpt_ctx_put(struct ckpt_ctx *ctx);

@@ -107,6 +124,7 @@ extern int restore_restart_block(struct ckpt_ctx *ctx);
#define CKPT_DBASE 0x1 /* anything */
#define CKPT_DSYS 0x2 /* generic (system) */
#define CKPT_DRW 0x4 /* image read/write */
+#define CKPT_DOBJ 0x8 /* shared objects */

#define CKPT_DDEFAULT 0xffff /* default debug level */

diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index ad5851d..7c46638 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -48,6 +48,7 @@ enum {
CKPT_HDR_HEADER_ARCH,
CKPT_HDR_BUFFER,
CKPT_HDR_STRING,
+ CKPT_HDR_OBJREF,

CKPT_HDR_TREE = 101,
CKPT_HDR_TASK,
@@ -67,6 +68,19 @@ enum {
CKPT_ARCH_X86_32 = 1,
};

+/* shared objrects (objref) */
+struct ckpt_hdr_objref {
+ struct ckpt_hdr h;
+ __u32 objtype;
+ __s32 objref;
+} __attribute__((aligned(8)));
+
+/* shared objects types */
+enum obj_type {
+ CKPT_OBJ_IGNORE = 0,
+ CKPT_OBJ_MAX
+};
+
/* kernel constants */
struct ckpt_hdr_const {
/* task */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index 4785df6..bd78d19 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -38,6 +38,8 @@ struct ckpt_ctx {

atomic_t refcount;

+ struct ckpt_obj_hash *obj_hash; /* repository for shared objects */
+
char err_string[256]; /* checkpoint: error string */

/* [multi-process checkpoint] */
--
1.6.0.4

2009-07-22 10:21:09

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 47/60] deferqueue: generic queue to defer work

Add a interface to postpone an action until the end of the entire
checkpoint or restart operation. This is useful when during the
scan of tasks an operation cannot be performed in place, to avoid
the need for a second scan.

One use case is when restoring an ipc shared memory region that has
been deleted (but is still attached), during restart it needs to be
create, attached and then deleted. However, creation and attachment
are performed in distinct locations, so deletion can not be performed
on the spot. Instead, this work (delete) is deferred until later.
(This example is in one of the following patches).

This interface allows chronic procrastination in the kernel:

deferqueue_create(void):
Allocates and returns a new deferqueue.

deferqueue_run(deferqueue):
Executes all the pending works in the queue. Returns the number
of works executed, or an error upon the first error reported by
a deferred work.

deferqueue_add(deferqueue, data, size, func, dtor):
Enqueue a deferred work. @function is the callback function to
do the work, which will be called with @data as an argument.
@size tells the size of data. @dtor is a destructor callback
that is invoked for deferred works remaining in the queue when
the queue is destroyed. NOTE: for a given deferred work, @dtor
is _not_ called if @func was already called (regardless of the
return value of the latter).

deferqueue_destroy(deferqueue):
Free the deferqueue and any queued items while invoking the
@dtor callback for each queued item.

Why aren't we using the existing kernel workqueue mechanism? We need
to defer to work until the end of the operation: not earlier, since we
need other things to be in place; not later, to not block waiting for
it. However, the workqueue schedules the work for 'some time later'.
Also, the kernel workqueue may run in any task context, but we require
many times that an operation be run in the context of some specific
restarting task (e.g., restoring IPC state of a certain ipc_ns).

Instead, this mechanism is a simple way for the c/r operation as a
whole, and later a task in particular, to defer some action until
later (but not arbitrarily later) _in the restore_ operation.

Changelog[v17]
- Fix deferqueue_add() function

Signed-off-by: Oren Laadan <[email protected]>
---
checkpoint/Kconfig | 5 ++
include/linux/deferqueue.h | 58 +++++++++++++++++++++++
kernel/Makefile | 1 +
kernel/deferqueue.c | 109 ++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 173 insertions(+), 0 deletions(-)
create mode 100644 include/linux/deferqueue.h
create mode 100644 kernel/deferqueue.c

diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
index 21fc86b..4a2c845 100644
--- a/checkpoint/Kconfig
+++ b/checkpoint/Kconfig
@@ -2,10 +2,15 @@
# implemented the hooks for processor state etc. needed by the
# core checkpoint/restart code.

+config DEFERQUEUE
+ bool
+ default n
+
config CHECKPOINT
bool "Checkpoint/restart (EXPERIMENTAL)"
depends on CHECKPOINT_SUPPORT && EXPERIMENTAL
depends on CGROUP_FREEZER
+ select DEFERQUEUE
help
Application checkpoint/restart is the ability to save the
state of a running application so that it can later resume
diff --git a/include/linux/deferqueue.h b/include/linux/deferqueue.h
new file mode 100644
index 0000000..2eb58cf
--- /dev/null
+++ b/include/linux/deferqueue.h
@@ -0,0 +1,58 @@
+/*
+ * deferqueue.h --- deferred work queue handling for Linux.
+ */
+
+#ifndef _LINUX_DEFERQUEUE_H
+#define _LINUX_DEFERQUEUE_H
+
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+
+/*
+ * This interface allows chronic procrastination in the kernel:
+ *
+ * deferqueue_create(void):
+ * Allocates and returns a new deferqueue.
+ *
+ * deferqueue_run(deferqueue):
+ * Executes all the pending works in the queue. Returns the number
+ * of works executed, or an error upon the first error reported by
+ * a deferred work.
+ *
+ * deferqueue_add(deferqueue, data, size, func, dtor):
+ * Enqueue a deferred work. @function is the callback function to
+ * do the work, which will be called with @data as an argument.
+ * @size tells the size of data. @dtor is a destructor callback
+ * that is invoked for deferred works remaining in the queue when
+ * the queue is destroyed. NOTE: for a given deferred work, @dtor
+ * is _not_ called if @func was already called (regardless of the
+ * return value of the latter).
+ *
+ * deferqueue_destroy(deferqueue):
+ * Free the deferqueue and any queued items while invoking the
+ * @dtor callback for each queued item.
+ */
+
+
+typedef int (*deferqueue_func_t)(void *);
+
+struct deferqueue_entry {
+ deferqueue_func_t function;
+ deferqueue_func_t destructor;
+ struct list_head list;
+ char data[0];
+};
+
+struct deferqueue_head {
+ spinlock_t lock;
+ struct list_head list;
+};
+
+struct deferqueue_head *deferqueue_create(void);
+void deferqueue_destroy(struct deferqueue_head *head);
+int deferqueue_add(struct deferqueue_head *head, void *data, int size,
+ deferqueue_func_t func, deferqueue_func_t dtor);
+int deferqueue_run(struct deferqueue_head *head);
+
+#endif
diff --git a/kernel/Makefile b/kernel/Makefile
index 2093a69..ef229da 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -23,6 +23,7 @@ CFLAGS_REMOVE_cgroup-debug.o = -pg
CFLAGS_REMOVE_sched_clock.o = -pg
endif

+obj-$(CONFIG_DEFERQUEUE) += deferqueue.o
obj-$(CONFIG_FREEZER) += freezer.o
obj-$(CONFIG_PROFILING) += profile.o
obj-$(CONFIG_SYSCTL_SYSCALL_CHECK) += sysctl_check.o
diff --git a/kernel/deferqueue.c b/kernel/deferqueue.c
new file mode 100644
index 0000000..3fb388b
--- /dev/null
+++ b/kernel/deferqueue.c
@@ -0,0 +1,109 @@
+/*
+ * Infrastructure to manage deferred work
+ *
+ * This differs from a workqueue in that the work must be deferred
+ * until specifically run by the caller.
+ *
+ * As the only user currently is checkpoint/restart, which has
+ * very simple usage, the locking is kept simple. Adding rules
+ * is protected by the head->lock. But deferqueue_run() is only
+ * called once, after all entries have been added. So it is not
+ * protected. Similarly, _destroy is only called once when the
+ * ckpt_ctx is releeased, so it is not locked or refcounted. These
+ * can of course be added if needed by other users.
+ *
+ * Why not use workqueue ? We need to defer work until the end of an
+ * operation: not earlier, since we need other things to be in place;
+ * not later, to not block waiting for it. However, the workqueue
+ * schedules the work for 'some time later'. Also, workqueue may run
+ * in any task context, but we require many times that an operation
+ * be run in the context of some specific restarting task (e.g.,
+ * restoring IPC state of a certain ipc_ns).
+ *
+ * Instead, this mechanism is a simple way for the c/r operation as a
+ * whole, and later a task in particular, to defer some action until
+ * later (but not arbitrarily later) _in the restore_ operation.
+ *
+ * Copyright (C) 2009 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/deferqueue.h>
+
+struct deferqueue_head *deferqueue_create(void)
+{
+ struct deferqueue_head *h = kmalloc(sizeof(*h), GFP_KERNEL);
+ if (h) {
+ spin_lock_init(&h->lock);
+ INIT_LIST_HEAD(&h->list);
+ }
+ return h;
+}
+
+void deferqueue_destroy(struct deferqueue_head *h)
+{
+ if (!list_empty(&h->list)) {
+ struct deferqueue_entry *dq, *n;
+
+ pr_debug("%s: freeing non-empty queue\n", __func__);
+ list_for_each_entry_safe(dq, n, &h->list, list) {
+ dq->destructor(dq->data);
+ list_del(&dq->list);
+ kfree(dq);
+ }
+ }
+ kfree(h);
+}
+
+int deferqueue_add(struct deferqueue_head *head, void *data, int size,
+ deferqueue_func_t func, deferqueue_func_t dtor)
+{
+ struct deferqueue_entry *dq;
+
+ dq = kmalloc(sizeof(*dq) + size, GFP_KERNEL);
+ if (!dq)
+ return -ENOMEM;
+
+ dq->function = func;
+ dq->destructor = dtor;
+ memcpy(dq->data, data, size);
+
+ pr_debug("%s: adding work %p func %p dtor %p\n",
+ __func__, dq, func, dtor);
+ spin_lock(&head->lock);
+ list_add_tail(&dq->list, &head->list);
+ spin_unlock(&head->lock);
+ return 0;
+}
+
+/*
+ * deferqueue_run - perform all work in the work queue
+ * @head: deferqueue_head from which to run
+ *
+ * returns: number of works performed, or < 0 on error
+ */
+int deferqueue_run(struct deferqueue_head *head)
+{
+ struct deferqueue_entry *dq, *n;
+ int nr = 0;
+ int ret;
+
+ list_for_each_entry_safe(dq, n, &head->list, list) {
+ pr_debug("doing work %p function %p\n", dq, dq->function);
+ /* don't call destructor - function callback should do it */
+ ret = dq->function(dq->data);
+ if (ret < 0)
+ pr_debug("wq function failed %d\n", ret);
+ list_del(&dq->list);
+ kfree(dq);
+ nr++;
+ }
+
+ return nr;
+}
--
1.6.0.4

2009-07-22 10:19:56

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 58/60] c/r: checkpoint and restore task credentials

From: Serge E. Hallyn <[email protected]>

This patch adds the checkpointing and restart of credentials
(uids, gids, and capabilities) to Oren's c/r patchset (on top
of v14). It goes to great pains to re-use (and define when
needed) common helpers, in order to make sure that as security
code is modified, the cr code will be updated. Some of the
helpers should still be moved (i.e. _creds() functions should
be in kernel/cred.c).

When building the credentials for the restarted process, I
1. create a new struct cred as a copy of the running task's
cred (using prepare_cred())
2. always authorize any changes to the new struct cred
based on the permissions of current_cred() (not the current
transient state of the new cred).

While this may mean that certain transient_cred1->transient_cred2
states are allowed which otherwise wouldn't be allowed, the
fact remains that current_cred() is allowed to transition to
transient_cred2.

The reconstructed creds are applied to the task at the very
end of the sys_restart call. This ensures that any objects which
need to be re-created (file, socket, etc) are re-created using
the creds of the task calling sys_restart - preventing an unpriv
user from creating a privileged object, and ensuring that a
root task can restart a process which had started out privileged,
created some privileged objects, then dropped its privilege.

With these patches, the root user can restart checkpoint images
(created by either hallyn or root) of user hallyn's tasks,
resulting in a program owned by hallyn.

Changelog:
Jun 15: Fix user_ns handling when !CONFIG_USER_N
Set creator_ref=0 for root_ns (discard @flags)
Don't overwrite global user-ns if CONFIG_USER_NS
Jun 10: Merge with ckpt-v16-dev (Oren Laadan)
Jun 01: Don't check ordering of groups in group_info, bc
set_groups() will sort it for us.
May 28: 1. Restore securebits
2. Address Alexey's comments: move prototypes out of
sched.h, validate ngroups < NGROUPS_MAX, validate
groups are sorted, and get rid of ckpt_hdr_cred->version.
3. remove bogus unused flag RESTORE_CREATE_USERNS
May 26: Move group, user, userns, creds c/r functions out
of checkpoint/process.c and into the appropriate files.
May 26: Define struct ckpt_hdr_task_creds and move task cred
objref c/r into {checkpoint_restore}_task_shared().
May 26: Take cred refs around checkpoint_write_creds()
May 20: Remove the limit on number of groups in groupinfo
at checkpoint time
May 20: Remove the depth limit on empty user namespaces
May 20: Better document checkpoint_user
May 18: fix more refcounting: if (userns 5, uid 0) had
no active tasks or child user_namespaces, then
it shouldn't exist at restart or it, its namespace,
and its whole chain of creators will be leaked.
May 14: fix some refcounting:
1. a new user_ns needs a ref to remain pinned
by its root user
2. current_user_ns needs an extra ref bc objhash
drops two on restart
3. cred needs a ref for the real credentials bc
commit_creds eats one ref.
May 13: folded in fix to userns refcounting.

Signed-off-by: Serge E. Hallyn <[email protected]>
[[email protected]: merge with ckpt-v16-dev]
---
checkpoint/namespace.c | 41 ++++++++++
checkpoint/objhash.c | 82 ++++++++++++++++++++
checkpoint/process.c | 111 ++++++++++++++++++++++++++-
include/linux/capability.h | 6 +-
include/linux/checkpoint.h | 12 +++
include/linux/checkpoint_hdr.h | 59 ++++++++++++++
include/linux/checkpoint_types.h | 2 +
kernel/cred.c | 123 +++++++++++++++++++++++++++++
kernel/groups.c | 69 +++++++++++++++++
kernel/user.c | 158 ++++++++++++++++++++++++++++++++++++++
kernel/user_namespace.c | 89 +++++++++++++++++++++
11 files changed, 746 insertions(+), 6 deletions(-)

diff --git a/checkpoint/namespace.c b/checkpoint/namespace.c
index 49b8f0a..89af2c0 100644
--- a/checkpoint/namespace.c
+++ b/checkpoint/namespace.c
@@ -98,3 +98,44 @@ void *restore_uts_ns(struct ckpt_ctx *ctx)
{
return (void *) do_restore_uts_ns(ctx);
}
+
+/*
+ * user_ns - trivial checkpoint/restore for !CONFIG_USER_NS case
+ */
+#ifndef CONFIG_USER_NS
+int checkpoint_userns(struct ckpt_ctx *ctx, void *ptr)
+{
+ struct ckpt_hdr_user_ns *h;
+ int ret;
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_USER_NS);
+ if (!h)
+ return -ENOMEM;
+ ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
+
+void *restore_userns(struct ckpt_ctx *ctx)
+{
+ struct ckpt_hdr_user_ns *h;
+ struct user_namespace *ns;
+
+ /* complain if image contains multiple namespaces */
+ if (ctx->stats.user_ns)
+ return ERR_PTR(-EEXIST);
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_USER_NS);
+ if (IS_ERR(h))
+ return ERR_PTR(PTR_ERR(h));
+
+ if (h->creator_ref)
+ ns = ERR_PTR(-EINVAL);
+ else
+ ns = get_user_ns(current_user_ns());
+
+ ctx->stats.user_ns++;
+ ckpt_hdr_put(ctx, h);
+ return ns;
+}
+#endif
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 29c7a04..15b9d66 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -17,6 +17,7 @@
#include <linux/fdtable.h>
#include <linux/sched.h>
#include <linux/ipc_namespace.h>
+#include <linux/user_namespace.h>
#include <linux/checkpoint.h>
#include <linux/checkpoint_hdr.h>

@@ -182,6 +183,51 @@ static int obj_ipc_ns_users(void *ptr)
return atomic_read(&((struct ipc_namespace *) ptr)->count);
}

+static int obj_cred_grab(void *ptr)
+{
+ get_cred((struct cred *) ptr);
+ return 0;
+}
+
+static void obj_cred_drop(void *ptr)
+{
+ put_cred((struct cred *) ptr);
+}
+
+static int obj_user_grab(void *ptr)
+{
+ struct user_struct *u = ptr;
+ (void) get_uid(u);
+ return 0;
+}
+
+static void obj_user_drop(void *ptr)
+{
+ free_uid((struct user_struct *) ptr);
+}
+
+static int obj_userns_grab(void *ptr)
+{
+ get_user_ns((struct user_namespace *) ptr);
+ return 0;
+}
+
+static void obj_userns_drop(void *ptr)
+{
+ put_user_ns((struct user_namespace *) ptr);
+}
+
+static int obj_groupinfo_grab(void *ptr)
+{
+ get_group_info((struct group_info *) ptr);
+ return 0;
+}
+
+static void obj_groupinfo_drop(void *ptr)
+{
+ put_group_info((struct group_info *) ptr);
+}
+
static struct ckpt_obj_ops ckpt_obj_ops[] = {
/* ignored object */
{
@@ -259,6 +305,42 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
.checkpoint = checkpoint_ipc_ns,
.restore = restore_ipc_ns,
},
+ /* user_ns object */
+ {
+ .obj_name = "USER_NS",
+ .obj_type = CKPT_OBJ_USER_NS,
+ .ref_drop = obj_userns_drop,
+ .ref_grab = obj_userns_grab,
+ .checkpoint = checkpoint_userns,
+ .restore = restore_userns,
+ },
+ /* struct cred */
+ {
+ .obj_name = "CRED",
+ .obj_type = CKPT_OBJ_CRED,
+ .ref_drop = obj_cred_drop,
+ .ref_grab = obj_cred_grab,
+ .checkpoint = checkpoint_cred,
+ .restore = restore_cred,
+ },
+ /* user object */
+ {
+ .obj_name = "USER",
+ .obj_type = CKPT_OBJ_USER,
+ .ref_drop = obj_user_drop,
+ .ref_grab = obj_user_grab,
+ .checkpoint = checkpoint_user,
+ .restore = restore_user,
+ },
+ /* struct groupinfo */
+ {
+ .obj_name = "GROUPINFO",
+ .obj_type = CKPT_OBJ_GROUPINFO,
+ .ref_drop = obj_groupinfo_drop,
+ .ref_grab = obj_groupinfo_grab,
+ .checkpoint = checkpoint_groupinfo,
+ .restore = restore_groupinfo,
+ },
};


diff --git a/checkpoint/process.c b/checkpoint/process.c
index 245607b..f028822 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -17,6 +17,7 @@
#include <linux/futex.h>
#include <linux/poll.h>
#include <linux/utsname.h>
+#include <linux/user_namespace.h>
#include <linux/checkpoint.h>
#include <linux/checkpoint_hdr.h>
#include <linux/syscalls.h>
@@ -135,6 +136,45 @@ static int checkpoint_task_ns(struct ckpt_ctx *ctx, struct task_struct *t)
return ret;
}

+static int checkpoint_task_creds(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+ int realcred_ref, ecred_ref;
+ struct cred *rcred, *ecred;
+ struct ckpt_hdr_task_creds *h;
+ int ret;
+
+ rcred = get_cred(t->real_cred);
+ ecred = get_cred(t->cred);
+
+ realcred_ref = checkpoint_obj(ctx, rcred, CKPT_OBJ_CRED);
+ if (realcred_ref < 0) {
+ ret = realcred_ref;
+ goto error;
+ }
+
+ ecred_ref = checkpoint_obj(ctx, ecred, CKPT_OBJ_CRED);
+ if (ecred_ref < 0) {
+ ret = ecred_ref;
+ goto error;
+ }
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_CREDS);
+ if (!h) {
+ ret = -ENOMEM;
+ goto error;
+ }
+
+ h->cred_ref = realcred_ref;
+ h->ecred_ref = ecred_ref;
+ ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+ ckpt_hdr_put(ctx, h);
+
+error:
+ put_cred(rcred);
+ put_cred(ecred);
+ return ret;
+}
+
static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
{
struct ckpt_hdr_task_objs *h;
@@ -150,8 +190,12 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
* restored when it gets to restore, e.g. its memory.
*/

- ret = checkpoint_task_ns(ctx, t);
- ckpt_debug("ns: objref %d\n", ret);
+ ret = checkpoint_task_creds(ctx, t);
+ ckpt_debug("cred: objref %d\n", ret);
+ if (!ret) {
+ ret = checkpoint_task_ns(ctx, t);
+ ckpt_debug("ns: objref %d\n", ret);
+ }
if (ret < 0)
return ret;

@@ -430,6 +474,34 @@ static int restore_task_ns(struct ckpt_ctx *ctx)
return ret;
}

+static int restore_task_creds(struct ckpt_ctx *ctx)
+{
+ struct ckpt_hdr_task_creds *h;
+ struct cred *realcred, *ecred;
+ int ret = 0;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TASK_CREDS);
+ if (IS_ERR(h))
+ return PTR_ERR(h);
+
+ realcred = ckpt_obj_fetch(ctx, h->cred_ref, CKPT_OBJ_CRED);
+ if (IS_ERR(realcred)) {
+ ret = PTR_ERR(realcred);
+ goto out;
+ }
+ ecred = ckpt_obj_fetch(ctx, h->ecred_ref, CKPT_OBJ_CRED);
+ if (IS_ERR(ecred)) {
+ ret = PTR_ERR(ecred);
+ goto out;
+ }
+ ctx->realcred = realcred;
+ ctx->ecred = ecred;
+
+out:
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
+
static int restore_task_objs(struct ckpt_ctx *ctx)
{
struct ckpt_hdr_task_objs *h;
@@ -440,7 +512,9 @@ static int restore_task_objs(struct ckpt_ctx *ctx)
* and because shared objects are restored before they are
* referenced. See comment in checkpoint_task_objs.
*/
- ret = restore_task_ns(ctx);
+ ret = restore_task_creds(ctx);
+ if (!ret)
+ ret = restore_task_ns(ctx);
if (ret < 0)
return ret;

@@ -458,6 +532,33 @@ static int restore_task_objs(struct ckpt_ctx *ctx)
return ret;
}

+static int restore_creds(struct ckpt_ctx *ctx)
+{
+ int ret;
+ const struct cred *old;
+ struct cred *rcred, *ecred;
+
+ rcred = ctx->realcred;
+ ecred = ctx->ecred;
+
+ /* commit_creds will take one ref for the eff creds, but
+ * expects us to hold a ref for the obj creds, so take a
+ * ref here */
+ get_cred(rcred);
+ ret = commit_creds(rcred);
+ if (ret)
+ return ret;
+
+ if (ecred == rcred)
+ return 0;
+
+ old = override_creds(ecred); /* override_creds otoh takes new ref */
+ put_cred(old);
+
+ ctx->realcred = ctx->ecred = NULL;
+ return 0;
+}
+
int restore_restart_block(struct ckpt_ctx *ctx)
{
struct ckpt_hdr_restart_block *h;
@@ -591,6 +692,10 @@ int restore_task(struct ckpt_ctx *ctx)
goto out;
ret = restore_cpu(ctx);
ckpt_debug("cpu %d\n", ret);
+ if (ret < 0)
+ goto out;
+ ret = restore_creds(ctx);
+ ckpt_debug("creds: ret %d\n", ret);
out:
return ret;
}
diff --git a/include/linux/capability.h b/include/linux/capability.h
index 3a74655..2f726f7 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -569,10 +569,10 @@ struct dentry;
extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);

struct cred;
-int apply_securebits(unsigned securebits, struct cred *new);
+extern int apply_securebits(unsigned securebits, struct cred *new);
struct ckpt_capabilities;
-int restore_capabilities(struct ckpt_capabilities *h, struct cred *new);
-void checkpoint_capabilities(struct ckpt_capabilities *h, struct cred * cred);
+extern int restore_capabilities(struct ckpt_capabilities *h, struct cred *new);
+extern void checkpoint_capabilities(struct ckpt_capabilities *h, struct cred *cred);

#endif /* __KERNEL__ */

diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 2ee1bc2..0a8bfc7 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -25,6 +25,7 @@
#include <linux/sched.h>
#include <linux/nsproxy.h>
#include <linux/ipc_namespace.h>
+#include <linux/user_namespace.h>
#include <linux/checkpoint_types.h>
#include <linux/checkpoint_hdr.h>

@@ -169,6 +170,17 @@ extern int checkpoint_file_common(struct ckpt_ctx *ctx, struct file *file,
extern int restore_file_common(struct ckpt_ctx *ctx, struct file *file,
struct ckpt_hdr_file *h);

+/* credentials */
+extern int checkpoint_groupinfo(struct ckpt_ctx *ctx, void *ptr);
+extern int checkpoint_user(struct ckpt_ctx *ctx, void *ptr);
+extern int checkpoint_cred(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_groupinfo(struct ckpt_ctx *ctx);
+extern void *restore_user(struct ckpt_ctx *ctx);
+extern void *restore_cred(struct ckpt_ctx *ctx);
+
+extern int checkpoint_userns(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_userns(struct ckpt_ctx *ctx);
+
/* memory */
extern void ckpt_pgarr_free(struct ckpt_ctx *ctx);

diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 1f6a33d..ca02d9d 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -61,6 +61,11 @@ enum {
CKPT_HDR_UTS_NS,
CKPT_HDR_IPC_NS,
CKPT_HDR_CAPABILITIES,
+ CKPT_HDR_USER_NS,
+ CKPT_HDR_CRED,
+ CKPT_HDR_USER,
+ CKPT_HDR_GROUPINFO,
+ CKPT_HDR_TASK_CREDS,

/* 201-299: reserved for arch-dependent */

@@ -110,6 +115,10 @@ enum obj_type {
CKPT_OBJ_NS,
CKPT_OBJ_UTS_NS,
CKPT_OBJ_IPC_NS,
+ CKPT_OBJ_USER_NS,
+ CKPT_OBJ_CRED,
+ CKPT_OBJ_USER,
+ CKPT_OBJ_GROUPINFO,
CKPT_OBJ_MAX
};

@@ -183,6 +192,11 @@ struct ckpt_hdr_task {
__u32 exit_signal;
__u32 pdeath_signal;

+#ifdef CONFIG_AUDITSYSCALL
+ /* would audit want to track the checkpointed ids,
+ or (more likely) who actually restarted? */
+#endif
+
__u64 set_child_tid;
__u64 clear_child_tid;

@@ -190,6 +204,7 @@ struct ckpt_hdr_task {
__u32 compat_robust_futex_list; /* a compat __user ptr */
__u32 robust_futex_head_len;
__u64 robust_futex_list; /* a __user ptr */
+
} __attribute__((aligned(8)));

/* Posix capabilities */
@@ -202,6 +217,50 @@ struct ckpt_capabilities {
__u32 padding;
} __attribute__((aligned(8)));

+struct ckpt_hdr_task_creds {
+ struct ckpt_hdr h;
+ __s32 cred_ref;
+ __s32 ecred_ref;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_cred {
+ struct ckpt_hdr h;
+ __u32 uid, suid, euid, fsuid;
+ __u32 gid, sgid, egid, fsgid;
+ __s32 user_ref;
+ __s32 groupinfo_ref;
+ struct ckpt_capabilities cap_s;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_groupinfo {
+ struct ckpt_hdr h;
+ __u32 ngroups;
+ /*
+ * This is followed by ngroups __u32s
+ */
+ __u32 groups[0];
+} __attribute__((aligned(8)));
+
+/*
+ * todo - keyrings and LSM
+ * These may be better done with userspace help though
+ */
+struct ckpt_hdr_user_struct {
+ struct ckpt_hdr h;
+ __u32 uid;
+ __s32 userns_ref;
+} __attribute__((aligned(8)));
+
+/*
+ * The user-struct mostly tracks system resource usage.
+ * Most of it's contents therefore will simply be set
+ * correctly as restart opens resources
+ */
+struct ckpt_hdr_user_ns {
+ struct ckpt_hdr h;
+ __s32 creator_ref;
+} __attribute__((aligned(8)));
+
/* namespaces */
struct ckpt_hdr_task_ns {
struct ckpt_hdr h;
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
index fb9b5b2..e98251b 100644
--- a/include/linux/checkpoint_types.h
+++ b/include/linux/checkpoint_types.h
@@ -23,6 +23,7 @@
struct ckpt_stats {
int uts_ns;
int ipc_ns;
+ int user_ns;
};

struct ckpt_ctx {
@@ -65,6 +66,7 @@ struct ckpt_ctx {
int active_pid; /* (next) position in pids array */
struct completion complete; /* container root and other tasks on */
wait_queue_head_t waitq; /* start, end, and restart ordering */
+ struct cred *realcred, *ecred; /* tmp storage for cred at restart */

struct ckpt_stats stats; /* statistics */
};
diff --git a/kernel/cred.c b/kernel/cred.c
index 5c8db56..27e02ca 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -16,6 +16,7 @@
#include <linux/init_task.h>
#include <linux/security.h>
#include <linux/cn_proc.h>
+#include <linux/checkpoint.h>
#include "cred-internals.h"

static struct kmem_cache *cred_jar;
@@ -703,3 +704,125 @@ int cred_setfsgid(struct cred *new, gid_t gid, gid_t *old_fsgid)
}
return -EPERM;
}
+
+#ifdef CONFIG_CHECKPOINT
+static int do_checkpoint_cred(struct ckpt_ctx *ctx, const struct cred *cred)
+{
+ int ret;
+ int groupinfo_ref, user_ref;
+ struct ckpt_hdr_cred *h;
+
+ groupinfo_ref = checkpoint_obj(ctx, cred->group_info,
+ CKPT_OBJ_GROUPINFO);
+ if (groupinfo_ref < 0)
+ return groupinfo_ref;
+ user_ref = checkpoint_obj(ctx, cred->user, CKPT_OBJ_USER);
+ if (user_ref < 0)
+ return user_ref;
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_CRED);
+ if (!h)
+ return -ENOMEM;
+
+ h->uid = cred->uid;
+ h->suid = cred->suid;
+ h->euid = cred->euid;
+ h->fsuid = cred->fsuid;
+
+ h->gid = cred->gid;
+ h->sgid = cred->sgid;
+ h->egid = cred->egid;
+ h->fsgid = cred->fsgid;
+
+ checkpoint_capabilities(&h->cap_s, cred);
+
+ h->user_ref = user_ref;
+ h->groupinfo_ref = groupinfo_ref;
+
+ ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+ ckpt_hdr_put(ctx, h);
+
+ return ret;
+}
+
+int checkpoint_cred(struct ckpt_ctx *ctx, void *ptr)
+{
+ return do_checkpoint_cred(ctx, (struct cred *) ptr);
+}
+
+static struct cred *do_restore_cred(struct ckpt_ctx *ctx)
+{
+ struct cred *cred;
+ struct ckpt_hdr_cred *h;
+ struct user_struct *user;
+ struct group_info *groupinfo;
+ int ret = -EINVAL;
+ uid_t olduid;
+ gid_t oldgid;
+ int i;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_CRED);
+ if (IS_ERR(h))
+ return ERR_PTR(PTR_ERR(h));
+
+ cred = prepare_creds();
+ if (!cred)
+ goto error;
+
+
+ /* Do we care if the target user and target group were compatible?
+ * Probably. But then, we can't do any setuid without CAP_SETUID,
+ * so we must have been privileged to abuse it... */
+ groupinfo = ckpt_obj_fetch(ctx, h->groupinfo_ref, CKPT_OBJ_GROUPINFO);
+ if (IS_ERR(groupinfo))
+ goto err_putcred;
+ user = ckpt_obj_fetch(ctx, h->user_ref, CKPT_OBJ_USER);
+ if (IS_ERR(user))
+ goto err_putcred;
+
+ /*
+ * TODO: this check should go into the common helper in
+ * kernel/sys.c, and should account for user namespaces
+ */
+ if (!capable(CAP_SETGID))
+ for (i = 0; i < groupinfo->ngroups; i++) {
+ if (!in_egroup_p(GROUP_AT(groupinfo, i)))
+ goto err_putcred;
+ }
+ ret = set_groups(cred, groupinfo);
+ if (ret < 0)
+ goto err_putcred;
+ free_uid(cred->user);
+ cred->user = get_uid(user);
+ ret = cred_setresuid(cred, h->uid, h->euid, h->suid);
+ if (ret < 0)
+ goto err_putcred;
+ ret = cred_setfsuid(cred, h->fsuid, &olduid);
+ if (olduid != h->fsuid && ret < 0)
+ goto err_putcred;
+ ret = cred_setresgid(cred, h->gid, h->egid, h->sgid);
+ if (ret < 0)
+ goto err_putcred;
+ ret = cred_setfsgid(cred, h->fsgid, &oldgid);
+ if (oldgid != h->fsgid && ret < 0)
+ goto err_putcred;
+ ret = restore_capabilities(&h->cap_s, cred);
+ if (ret)
+ goto err_putcred;
+
+ ckpt_hdr_put(ctx, h);
+ return cred;
+
+err_putcred:
+ abort_creds(cred);
+error:
+ ckpt_hdr_put(ctx, h);
+ return ERR_PTR(ret);
+}
+
+void *restore_cred(struct ckpt_ctx *ctx)
+{
+ return (void *) do_restore_cred(ctx);
+}
+
+#endif
diff --git a/kernel/groups.c b/kernel/groups.c
index 2b45b2e..3612c3e 100644
--- a/kernel/groups.c
+++ b/kernel/groups.c
@@ -6,6 +6,7 @@
#include <linux/slab.h>
#include <linux/security.h>
#include <linux/syscalls.h>
+#include <linux/checkpoint.h>
#include <asm/uaccess.h>

/* init to 2 - one for init_task, one to ensure it is never freed */
@@ -286,3 +287,71 @@ int in_egroup_p(gid_t grp)
}

EXPORT_SYMBOL(in_egroup_p);
+
+#ifdef CONFIG_CHECKPOINT
+static int do_checkpoint_groupinfo(struct ckpt_ctx *ctx, struct group_info *g)
+{
+ int ret, i, size;
+ struct ckpt_hdr_groupinfo *h;
+
+ size = sizeof(*h) + g->ngroups * sizeof(__u32);
+ h = ckpt_hdr_get_type(ctx, size, CKPT_HDR_GROUPINFO);
+ if (!h)
+ return -ENOMEM;
+
+ h->ngroups = g->ngroups;
+ for (i = 0; i < g->ngroups; i++)
+ h->groups[i] = GROUP_AT(g, i);
+
+ ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+ ckpt_hdr_put(ctx, h);
+
+ return ret;
+}
+
+int checkpoint_groupinfo(struct ckpt_ctx *ctx, void *ptr)
+{
+ return do_checkpoint_groupinfo(ctx, (struct group_info *)ptr);
+}
+
+/*
+ * TODO - switch to reading in smaller blocks?
+ */
+#define MAX_GROUPINFO_SIZE (sizeof(*h)+NGROUPS_MAX*sizeof(gid_t))
+static struct group_info *do_restore_groupinfo(struct ckpt_ctx *ctx)
+{
+ struct group_info *g;
+ struct ckpt_hdr_groupinfo *h;
+ int i;
+
+ h = ckpt_read_buf_type(ctx, MAX_GROUPINFO_SIZE, CKPT_HDR_GROUPINFO);
+ if (IS_ERR(h))
+ return ERR_PTR(PTR_ERR(h));
+
+ g = ERR_PTR(-EINVAL);
+ if (h->ngroups > NGROUPS_MAX)
+ goto out;
+
+ for (i = 1; i < h->ngroups; i++)
+ if (h->groups[i-1] >= h->groups[i])
+ goto out;
+
+ g = groups_alloc(h->ngroups);
+ if (!g) {
+ g = ERR_PTR(-ENOMEM);
+ goto out;
+ }
+ for (i = 0; i < h->ngroups; i++)
+ GROUP_AT(g, i) = h->groups[i];
+
+out:
+ ckpt_hdr_put(ctx, h);
+ return g;
+}
+
+void *restore_groupinfo(struct ckpt_ctx *ctx)
+{
+ return (void *) do_restore_groupinfo(ctx);
+}
+
+#endif
diff --git a/kernel/user.c b/kernel/user.c
index 2c000e7..a535ed6 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -16,6 +16,7 @@
#include <linux/interrupt.h>
#include <linux/module.h>
#include <linux/user_namespace.h>
+#include <linux/checkpoint.h>
#include "cred-internals.h"

struct user_namespace init_user_ns = {
@@ -508,3 +509,160 @@ static int __init uid_cache_init(void)
}

module_init(uid_cache_init);
+
+#ifdef CONFIG_CHECKPOINT
+/*
+ * write the user struct
+ * TODO keyring will need to be dumped
+ *
+ * Here is what we're doing. Remember a task can do clone(CLONE_NEWUSER)
+ * resulting in a cloned task in a new user namespace, with uid 0 in that
+ * new user_ns. In that case, the parent's user (uid+user_ns) is the
+ * 'creator' of the new user_ns.
+ * Here, we call the user_ns of the ctx->root_task the 'root_ns'. When we
+ * checkpoint a user-struct, we must store the chain of creators. We
+ * must not do so recursively, this being the kernel. In
+ * checkpoint_write_user() we walk and record in memory the list of creators up
+ * to either the latest user_struct which has already been saved, or the
+ * root_ns. Then we walk that chain backward, writing out the user_ns and
+ * user_struct to the checkpoint image.
+ */
+#define UNSAVED_STRIDE 50
+static int do_checkpoint_user(struct ckpt_ctx *ctx, struct user_struct *u)
+{
+ struct user_namespace *ns, *root_ns;
+ struct ckpt_hdr_user_struct *h;
+ int ns_objref;
+ int ret, i, unsaved_ns_nr = 0;
+ struct user_struct *save_u;
+ struct user_struct **unsaved_creators;
+ int step = 1, size;
+
+ /* if we've already saved the userns, then life is good */
+ ns_objref = ckpt_obj_lookup(ctx, u->user_ns, CKPT_OBJ_USER_NS);
+ if (ns_objref)
+ goto write_user;
+
+ root_ns = task_cred_xxx(ctx->root_task, user)->user_ns;
+
+ if (u->user_ns == root_ns)
+ goto save_last_ns;
+
+ size = UNSAVED_STRIDE*sizeof(struct user_struct *);
+ unsaved_creators = kmalloc(size, GFP_KERNEL);
+ if (!unsaved_creators)
+ return -ENOMEM;
+ save_u = u;
+ do {
+ ns = save_u->user_ns;
+ save_u = ns->creator;
+ if (ckpt_obj_lookup(ctx, save_u, CKPT_OBJ_USER))
+ goto found;
+ unsaved_creators[unsaved_ns_nr++] = save_u;
+ if (unsaved_ns_nr == step * UNSAVED_STRIDE) {
+ step++;
+ size = step*UNSAVED_STRIDE*sizeof(struct user_struct *);
+ unsaved_creators = krealloc(unsaved_creators, size,
+ GFP_KERNEL);
+ if (!unsaved_creators)
+ return -ENOMEM;
+ }
+ } while (ns != root_ns);
+
+found:
+ for (i = unsaved_ns_nr-1; i >= 0; i--) {
+ ret = checkpoint_obj(ctx, unsaved_creators[i], CKPT_OBJ_USER);
+ if (ret < 0) {
+ kfree(unsaved_creators);
+ return ret;
+ }
+ }
+ kfree(unsaved_creators);
+
+save_last_ns:
+ ns_objref = checkpoint_obj(ctx, u->user_ns, CKPT_OBJ_USER_NS);
+ if (ns_objref < 0)
+ return ns_objref;
+
+write_user:
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_USER);
+ if (!h)
+ return -ENOMEM;
+
+ h->uid = u->uid;
+ h->userns_ref = ns_objref;
+
+ /* write out the user_struct */
+ ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+ ckpt_hdr_put(ctx, h);
+
+ return ret;
+}
+
+int checkpoint_user(struct ckpt_ctx *ctx, void *ptr)
+{
+ return do_checkpoint_user(ctx, (struct user_struct *) ptr);
+}
+
+static int may_setuid(struct user_namespace *ns, uid_t uid)
+{
+ /*
+ * this next check will one day become
+ * if capable(CAP_SETUID, ns) return 1;
+ * followed by uid_equiv(current_userns, current_uid, ns, uid)
+ * instead of just uids.
+ */
+ if (capable(CAP_SETUID))
+ return 1;
+
+ /*
+ * this may be overly strict, but since we might end up
+ * restarting a privileged program here, we do not want
+ * someone with only CAP_SYS_ADMIN but no CAP_SETUID to
+ * be able to create random userids even in a userns he
+ * created.
+ */
+ if (current_user()->user_ns != ns)
+ return 0;
+ if (current_uid() == uid ||
+ current_euid() == uid ||
+ current_suid() == uid)
+ return 1;
+ return 0;
+}
+
+static struct user_struct *do_restore_user(struct ckpt_ctx *ctx)
+{
+ struct user_struct *u;
+ struct user_namespace *ns;
+ struct ckpt_hdr_user_struct *h;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_USER);
+ if (IS_ERR(h))
+ return ERR_PTR(PTR_ERR(h));
+
+ ns = ckpt_obj_fetch(ctx, h->userns_ref, CKPT_OBJ_USER_NS);
+ if (IS_ERR(ns)) {
+ u = ERR_PTR(PTR_ERR(ns));
+ goto out;
+ }
+
+ if (!may_setuid(ns, h->uid)) {
+ u = ERR_PTR(-EPERM);
+ goto out;
+ }
+ u = alloc_uid(ns, h->uid);
+ if (!u)
+ u = ERR_PTR(-EINVAL);
+
+out:
+ ckpt_hdr_put(ctx, h);
+ return u;
+}
+
+void *restore_user(struct ckpt_ctx *ctx)
+{
+ return (void *) do_restore_user(ctx);
+}
+
+#endif
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index e624b0f..3a35b50 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -9,6 +9,7 @@
#include <linux/nsproxy.h>
#include <linux/slab.h>
#include <linux/user_namespace.h>
+#include <linux/checkpoint.h>
#include <linux/cred.h>

static struct user_namespace *_new_user_ns(struct user_struct *creator,
@@ -103,3 +104,91 @@ void free_user_ns(struct kref *kref)
schedule_work(&ns->destroyer);
}
EXPORT_SYMBOL(free_user_ns);
+
+#ifdef CONFIG_CHECKPOINT
+/*
+ * do_checkpoint_userns() is only called from do_checkpoint_user().
+ * When called, we always know that either:
+ * 1. This is the root_ns (user_ns of the ctx->root_task),
+ * in which case we set h->creator_ref = 0.
+ * or
+ * 2. The creator has already been written out to the
+ * checkpoint image (and saved in the objhash)
+ */
+static int do_checkpoint_userns(struct ckpt_ctx *ctx, struct user_namespace *ns)
+{
+ struct ckpt_hdr_user_ns *h;
+ struct user_namespace *root_ns;
+ int creator_ref = 0;
+ int ret;
+
+ root_ns = task_cred_xxx(ctx->root_task, user)->user_ns;
+ if (ns != root_ns) {
+ creator_ref = ckpt_obj_lookup(ctx, ns->creator, CKPT_OBJ_USER);
+ if (!creator_ref)
+ return -EINVAL;
+ }
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_USER_NS);
+ if (!h)
+ return -ENOMEM;
+ h->creator_ref = creator_ref;
+ ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+ ckpt_hdr_put(ctx, h);
+
+ return ret;
+}
+
+int checkpoint_userns(struct ckpt_ctx *ctx, void *ptr)
+{
+ return do_checkpoint_userns(ctx, (struct user_namespace *) ptr);
+}
+
+static struct user_namespace *do_restore_userns(struct ckpt_ctx *ctx)
+{
+ struct ckpt_hdr_user_ns *h;
+ struct user_namespace *ns;
+ struct user_struct *new_root, *creator;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_USER_NS);
+ if (IS_ERR(h))
+ return ERR_PTR(PTR_ERR(h));
+
+ if (!h->creator_ref) {
+ ns = get_user_ns(current_user_ns());
+ goto out;
+ }
+
+ creator = ckpt_obj_fetch(ctx, h->creator_ref, CKPT_OBJ_USER);
+ if (IS_ERR(creator)) {
+ ns = ERR_PTR(-EINVAL);
+ goto out;
+ }
+
+ ns = new_user_ns(creator, &new_root);
+ if (IS_ERR(ns))
+ goto out;
+
+ /* ns only referenced from new_root, which we discard below */
+ get_user_ns(ns);
+
+ /* new_user_ns() doesn't bump creator's refcount */
+ get_uid(creator);
+
+ /*
+ * Free the new root user. If we actually needed it,
+ * then it will show up later in the checkpoint image
+ * The objhash will keep the userns pinned until then.
+ */
+ free_uid(new_root);
+ out:
+ ctx->stats.user_ns++;
+ ckpt_hdr_put(ctx, h);
+ return ns;
+}
+
+void *restore_userns(struct ckpt_ctx *ctx)
+{
+ return (void *) do_restore_userns(ctx);
+}
+#endif
--
1.6.0.4

2009-07-22 10:18:40

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 32/60] c/r: introduce '->checkpoint()' method in 'struct file_operations'

While we assume all normal files and directories can be checkpointed,
there are, as usual in the VFS, specialized places that will always
need an ability to override these defaults. Although we could do this
completely in the checkpoint code, that would bitrot quickly.

This adds a new 'file_operations' function for checkpointing a file.
It is assumed that there should be a dirt-simple way to make something
(un)checkpointable that fits in with current code.

As you can see in the ext[234] patches down the road, all that we have
to do to make something simple be supported is add a single "generic"
f_op entry.

Also introduce vfs_fcntl() so that it can be called from restart (see
patch adding restart of files).

Changelog[v17]
- Forward-declare 'ckpt_ctx' et-al, don't use checkpoint_types.h

Signed-off-by: Oren Laadan <[email protected]>
---
fs/fcntl.c | 21 +++++++++++++--------
include/linux/fs.h | 6 ++++++
2 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index ae41308..78d3116 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -339,6 +339,18 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
return err;
}

+int vfs_fcntl(int fd, unsigned int cmd, unsigned long arg, struct file *filp)
+{
+ int err;
+
+ err = security_file_fcntl(filp, cmd, arg);
+ if (err)
+ goto out;
+ err = do_fcntl(fd, cmd, arg, filp);
+ out:
+ return err;
+}
+
SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
{
struct file *filp;
@@ -348,14 +360,7 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
if (!filp)
goto out;

- err = security_file_fcntl(filp, cmd, arg);
- if (err) {
- fput(filp);
- return err;
- }
-
- err = do_fcntl(fd, cmd, arg, filp);
-
+ err = vfs_fcntl(fd, cmd, arg, filp);
fput(filp);
out:
return err;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d88d4fc..05d4745 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -388,6 +388,7 @@ struct kstatfs;
struct vm_area_struct;
struct vfsmount;
struct cred;
+struct ckpt_ctx;

extern void __init inode_init(void);
extern void __init inode_init_early(void);
@@ -1088,6 +1089,8 @@ struct file_lock {

#include <linux/fcntl.h>

+extern int vfs_fcntl(int fd, unsigned cmd, unsigned long arg, struct file *fp);
+
extern void send_sigio(struct fown_struct *fown, int fd, int band);

/* fs/sync.c */
@@ -1510,6 +1513,7 @@ struct file_operations {
ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
int (*setlease)(struct file *, long, struct file_lock **);
+ int (*checkpoint)(struct ckpt_ctx *, struct file *);
};

struct inode_operations {
@@ -2309,6 +2313,8 @@ void inode_sub_bytes(struct inode *inode, loff_t bytes);
loff_t inode_get_bytes(struct inode *inode);
void inode_set_bytes(struct inode *inode, loff_t bytes);

+#define generic_file_checkpoint NULL
+
extern int vfs_readdir(struct file *, filldir_t, void *);

extern int vfs_stat(char __user *, struct kstat *);
--
1.6.0.4

2009-07-22 10:10:21

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 01/60] c/r: extend arch_setup_additional_pages()

From: Alexey Dobriyan <[email protected]>

Add "start" argument, to request to map vDSO to a specific place,
and fail the operation if not.

This is useful for restart(2) to ensure that memory layout is restore
exactly as needed.

Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Oren Laadan <[email protected]>
---
arch/powerpc/include/asm/elf.h | 1 +
arch/powerpc/kernel/vdso.c | 13 ++++++++++++-
arch/s390/include/asm/elf.h | 2 +-
arch/s390/kernel/vdso.c | 13 ++++++++++++-
arch/sh/include/asm/elf.h | 1 +
arch/sh/kernel/vsyscall/vsyscall.c | 2 +-
arch/x86/include/asm/elf.h | 3 ++-
arch/x86/vdso/vdso32-setup.c | 9 +++++++--
arch/x86/vdso/vma.c | 9 +++++++--
fs/binfmt_elf.c | 2 +-
10 files changed, 45 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/elf.h b/arch/powerpc/include/asm/elf.h
index 014a624..3cef9cf 100644
--- a/arch/powerpc/include/asm/elf.h
+++ b/arch/powerpc/include/asm/elf.h
@@ -271,6 +271,7 @@ extern int ucache_bsize;
#define ARCH_HAS_SETUP_ADDITIONAL_PAGES
struct linux_binprm;
extern int arch_setup_additional_pages(struct linux_binprm *bprm,
+ unsigned long start,
int uses_interp);
#define VDSO_AUX_ENT(a,b) NEW_AUX_ENT(a,b);

diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index ad06d5c..c25213b 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -184,7 +184,8 @@ static void dump_vdso_pages(struct vm_area_struct * vma)
* This is called from binfmt_elf, we create the special vma for the
* vDSO and insert it into the mm struct tree
*/
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm,
+ unsigned long start, int uses_interp)
{
struct mm_struct *mm = current->mm;
struct page **vdso_pagelist;
@@ -211,6 +212,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
vdso_base = VDSO32_MBASE;
#endif

+ /* in case restart(2) mandates a specific location */
+ if (start)
+ vdso_base = start;
+
current->mm->context.vdso_base = 0;

/* vDSO has a problem and was disabled, just don't "enable" it for the
@@ -234,6 +239,12 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
goto fail_mmapsem;
}

+ /* for restart(2), double check that we got we asked for */
+ if (start && vdso_base != start) {
+ ret = -EBUSY;
+ goto fail_mmapsem;
+ }
+
/*
* our vma flags don't have VM_WRITE so by default, the process isn't
* allowed to write those pages.
diff --git a/arch/s390/include/asm/elf.h b/arch/s390/include/asm/elf.h
index 74d0bbb..54235bc 100644
--- a/arch/s390/include/asm/elf.h
+++ b/arch/s390/include/asm/elf.h
@@ -205,6 +205,6 @@ do { \
struct linux_binprm;

#define ARCH_HAS_SETUP_ADDITIONAL_PAGES 1
-int arch_setup_additional_pages(struct linux_binprm *, int);
+int arch_setup_additional_pages(struct linux_binprm *, unsigned long, int);

#endif
diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c
index 45e1708..c2ee689 100644
--- a/arch/s390/kernel/vdso.c
+++ b/arch/s390/kernel/vdso.c
@@ -193,7 +193,8 @@ static void vdso_init_cr5(void)
* This is called from binfmt_elf, we create the special vma for the
* vDSO and insert it into the mm struct tree
*/
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm,
+ unsigned long start, int uses_interp)
{
struct mm_struct *mm = current->mm;
struct page **vdso_pagelist;
@@ -224,6 +225,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
vdso_pages = vdso32_pages;
#endif

+ /* in case restart(2) mandates a specific location */
+ if (start)
+ vdso_base = start;
+
/*
* vDSO has a problem and was disabled, just don't "enable" it for
* the process
@@ -246,6 +251,12 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
goto out_up;
}

+ /* for restart(2), double check that we got we asked for */
+ if (start && vdso_base != start) {
+ rc = -EINVAL;
+ goto out_up;
+ }
+
/*
* our vma flags don't have VM_WRITE so by default, the process
* isn't allowed to write those pages.
diff --git a/arch/sh/include/asm/elf.h b/arch/sh/include/asm/elf.h
index ccb1d93..6c27b1f 100644
--- a/arch/sh/include/asm/elf.h
+++ b/arch/sh/include/asm/elf.h
@@ -202,6 +202,7 @@ do { \
#define ARCH_HAS_SETUP_ADDITIONAL_PAGES
struct linux_binprm;
extern int arch_setup_additional_pages(struct linux_binprm *bprm,
+ unsigned long start,
int uses_interp);

extern unsigned int vdso_enabled;
diff --git a/arch/sh/kernel/vsyscall/vsyscall.c b/arch/sh/kernel/vsyscall/vsyscall.c
index 3f7e415..64c70e5 100644
--- a/arch/sh/kernel/vsyscall/vsyscall.c
+++ b/arch/sh/kernel/vsyscall/vsyscall.c
@@ -59,7 +59,7 @@ int __init vsyscall_init(void)
}

/* Setup a VMA at program startup for the vsyscall page */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm, unsigned long start, int uses_interp)
{
struct mm_struct *mm = current->mm;
unsigned long addr;
diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index 83c1bc8..a4398c8 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -336,9 +336,10 @@ struct linux_binprm;

#define ARCH_HAS_SETUP_ADDITIONAL_PAGES 1
extern int arch_setup_additional_pages(struct linux_binprm *bprm,
+ unsigned long start,
int uses_interp);

-extern int syscall32_setup_pages(struct linux_binprm *, int exstack);
+extern int syscall32_setup_pages(struct linux_binprm *, unsigned long start, int exstack);
#define compat_arch_setup_additional_pages syscall32_setup_pages

extern unsigned long arch_randomize_brk(struct mm_struct *mm);
diff --git a/arch/x86/vdso/vdso32-setup.c b/arch/x86/vdso/vdso32-setup.c
index 58bc00f..5c914b0 100644
--- a/arch/x86/vdso/vdso32-setup.c
+++ b/arch/x86/vdso/vdso32-setup.c
@@ -310,7 +310,8 @@ int __init sysenter_setup(void)
}

/* Setup a VMA at program startup for the vsyscall page */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm,
+ unsigned long start, int uses_interp)
{
struct mm_struct *mm = current->mm;
unsigned long addr;
@@ -331,13 +332,17 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
if (compat)
addr = VDSO_HIGH_BASE;
else {
- addr = get_unmapped_area(NULL, 0, PAGE_SIZE, 0, 0);
+ addr = get_unmapped_area(NULL, start, PAGE_SIZE, 0, 0);
if (IS_ERR_VALUE(addr)) {
ret = addr;
goto up_fail;
}
}

+ /* for restart(2), double check that we got we asked for */
+ if (start && addr != start)
+ goto up_fail;
+
current->mm->context.vdso = (void *)addr;

if (compat_uses_vma || !compat) {
diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
index 21e1aeb..393b22a 100644
--- a/arch/x86/vdso/vma.c
+++ b/arch/x86/vdso/vma.c
@@ -99,7 +99,8 @@ static unsigned long vdso_addr(unsigned long start, unsigned len)

/* Setup a VMA at program startup for the vsyscall page.
Not called for compat tasks */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm,
+ unsigned long start, int uses_interp)
{
struct mm_struct *mm = current->mm;
unsigned long addr;
@@ -109,13 +110,17 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
return 0;

down_write(&mm->mmap_sem);
- addr = vdso_addr(mm->start_stack, vdso_size);
+ addr = start ? : vdso_addr(mm->start_stack, vdso_size);
addr = get_unmapped_area(NULL, addr, vdso_size, 0, 0);
if (IS_ERR_VALUE(addr)) {
ret = addr;
goto up_fail;
}

+ /* for restart(2), double check that we got we asked for */
+ if (start && addr != start)
+ goto up_fail;
+
current->mm->context.vdso = (void *)addr;

ret = install_special_mapping(mm, addr, vdso_size,
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index b7c1603..14a1b3c 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -945,7 +945,7 @@ static int load_elf_binary(struct linux_binprm *bprm, struct pt_regs *regs)
set_binfmt(&elf_format);

#ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES
- retval = arch_setup_additional_pages(bprm, !!elf_interpreter);
+ retval = arch_setup_additional_pages(bprm, 0, !!elf_interpreter);
if (retval < 0) {
send_sig(SIGKILL, current, 0);
goto out;
--
1.6.0.4

2009-07-22 10:22:50

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 02/60] x86: ptrace debugreg checks rewrite

From: Alexey Dobriyan <[email protected]>

This is a mess.

Pre unified-x86 code did check for breakpoint addr
to be "< TASK_SIZE - 3 (or 7)". This was fine from security POV,
but banned valid breakpoint usage when address is close to TASK_SIZE.
E. g. 1-byte breakpoint at TASK_SIZE - 1 should be allowed, but it wasn't.

Then came commit 84929801e14d968caeb84795bfbb88f04283fbd9
("[PATCH] x86_64: TASK_SIZE fixes for compatibility mode processes")
which for some reason touched ptrace as well and made effective
TASK_SIZE of 32-bit process depending on IA32_PAGE_OFFSET
which is not a constant!:

#define IA32_PAGE_OFFSET ((current->personality & ADDR_LIMIT_3GB) ? 0xc0000000 : 0xFFFFe000)
^^^^^^^
Maximum addr for breakpoint became dependent on personality of ptracer.

Commit also relaxed danger zone for 32-bit processes from 8 bytes to 4
not taking into account that 8-byte wide breakpoints are possible even
for 32-bit processes. This was fine, however, because 64-bit kernel
addresses are too far from 32-bit ones.

Then came utrace with commit 2047b08be67b70875d8765fc81d34ce28041bec3
("x86: x86 ptrace getreg/putreg merge") which copy-pasted and ifdeffed 32-bit
part of TASK_SIZE_OF() leaving 8-byte issue as-is.

So, what patch fixes?
1) Too strict logic near TASK_SIZE boundary -- as long as we don't cross
TASK_SIZE_MAX, we're fine.
2) Too smart logic of using breakpoints over non-existent kernel
boundary -- we should only protect against setting up after
TASK_SIZE_MAX, the rest is none of kernel business. This fixes
IA32_PAGE_OFFSET beartrap as well.

As a bonus, remove uberhack and big comment determining DR7 validness,
rewrite with clear algorithm when it's obvious what's going on.

Make DR validness checker suitable for C/R. On restart DR registers
must be checked the same way they are checked on PTRACE_POKEUSR.

Question 1: TIF_DEBUG can set even if none of breakpoints is turned on,
should this be optimized?

Question 2: Breakpoints are allowed to be globally enabled, is this a
security risk?

Signed-off-by: Alexey Dobriyan <[email protected]>
---
arch/x86/kernel/ptrace.c | 175 +++++++++++++++++++++++++++-------------------
1 files changed, 103 insertions(+), 72 deletions(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 09ecbde..9b4cacf 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -136,11 +136,6 @@ static int set_segment_reg(struct task_struct *task,
return 0;
}

-static unsigned long debugreg_addr_limit(struct task_struct *task)
-{
- return TASK_SIZE - 3;
-}
-
#else /* CONFIG_X86_64 */

#define FLAG_MASK (FLAG_MASK_32 | X86_EFLAGS_NT)
@@ -264,16 +259,6 @@ static int set_segment_reg(struct task_struct *task,

return 0;
}
-
-static unsigned long debugreg_addr_limit(struct task_struct *task)
-{
-#ifdef CONFIG_IA32_EMULATION
- if (test_tsk_thread_flag(task, TIF_IA32))
- return IA32_PAGE_OFFSET - 3;
-#endif
- return TASK_SIZE_MAX - 7;
-}
-
#endif /* CONFIG_X86_32 */

static unsigned long get_flags(struct task_struct *task)
@@ -481,77 +466,123 @@ static unsigned long ptrace_get_debugreg(struct task_struct *child, int n)
return 0;
}

+static int ptrace_check_debugreg(int _32bit,
+ unsigned long dr0, unsigned long dr1,
+ unsigned long dr2, unsigned long dr3,
+ unsigned long dr6, unsigned long dr7)
+{
+ /* Breakpoint type: 00: --x, 01: -w-, 10: undefined, 11: rw- */
+ unsigned int rw[4];
+ /* Breakpoint length: 00: 1 byte, 01: 2 bytes, 10: 8 bytes, 11: 4 bytes */
+ unsigned int len[4];
+ int n;
+
+ if (dr0 >= TASK_SIZE_MAX)
+ return -EINVAL;
+ if (dr1 >= TASK_SIZE_MAX)
+ return -EINVAL;
+ if (dr2 >= TASK_SIZE_MAX)
+ return -EINVAL;
+ if (dr3 >= TASK_SIZE_MAX)
+ return -EINVAL;
+
+ for (n = 0; n < 4; n++) {
+ rw[n] = (dr7 >> (16 + n * 4)) & 0x3;
+ len[n] = (dr7 >> (16 + n * 4 + 2)) & 0x3;
+
+ if (rw[n] == 0x2)
+ return -EINVAL;
+ if (rw[n] == 0x0 && len[n] != 0x0)
+ return -EINVAL;
+ if (_32bit && len[n] == 0x2)
+ return -EINVAL;
+
+ if (len[n] == 0x0)
+ len[n] = 1;
+ else if (len[n] == 0x1)
+ len[n] = 2;
+ else if (len[n] == 0x2)
+ len[n] = 8;
+ else if (len[n] == 0x3)
+ len[n] = 4;
+ /* From now breakpoint length is in bytes. */
+ }
+
+ if (dr6 & ~0xFFFFFFFFUL)
+ return -EINVAL;
+ if (dr7 & ~0xFFFFFFFFUL)
+ return -EINVAL;
+
+ if (dr7 == 0)
+ return 0;
+
+ if (dr0 + len[0] > TASK_SIZE_MAX)
+ return -EINVAL;
+ if (dr1 + len[1] > TASK_SIZE_MAX)
+ return -EINVAL;
+ if (dr2 + len[2] > TASK_SIZE_MAX)
+ return -EINVAL;
+ if (dr3 + len[3] > TASK_SIZE_MAX)
+ return -EINVAL;
+
+ return 0;
+}
+
static int ptrace_set_debugreg(struct task_struct *child,
int n, unsigned long data)
{
- int i;
+ unsigned long dr0, dr1, dr2, dr3, dr6, dr7;
+ int _32bit;

if (unlikely(n == 4 || n == 5))
return -EIO;

- if (n < 4 && unlikely(data >= debugreg_addr_limit(child)))
- return -EIO;
-
+ dr0 = child->thread.debugreg0;
+ dr1 = child->thread.debugreg1;
+ dr2 = child->thread.debugreg2;
+ dr3 = child->thread.debugreg3;
+ dr6 = child->thread.debugreg6;
+ dr7 = child->thread.debugreg7;
switch (n) {
- case 0: child->thread.debugreg0 = data; break;
- case 1: child->thread.debugreg1 = data; break;
- case 2: child->thread.debugreg2 = data; break;
- case 3: child->thread.debugreg3 = data; break;
-
+ case 0:
+ dr0 = data;
+ break;
+ case 1:
+ dr1 = data;
+ break;
+ case 2:
+ dr2 = data;
+ break;
+ case 3:
+ dr3 = data;
+ break;
case 6:
- if ((data & ~0xffffffffUL) != 0)
- return -EIO;
- child->thread.debugreg6 = data;
+ dr6 = data;
break;
-
case 7:
- /*
- * Sanity-check data. Take one half-byte at once with
- * check = (val >> (16 + 4*i)) & 0xf. It contains the
- * R/Wi and LENi bits; bits 0 and 1 are R/Wi, and bits
- * 2 and 3 are LENi. Given a list of invalid values,
- * we do mask |= 1 << invalid_value, so that
- * (mask >> check) & 1 is a correct test for invalid
- * values.
- *
- * R/Wi contains the type of the breakpoint /
- * watchpoint, LENi contains the length of the watched
- * data in the watchpoint case.
- *
- * The invalid values are:
- * - LENi == 0x10 (undefined), so mask |= 0x0f00. [32-bit]
- * - R/Wi == 0x10 (break on I/O reads or writes), so
- * mask |= 0x4444.
- * - R/Wi == 0x00 && LENi != 0x00, so we have mask |=
- * 0x1110.
- *
- * Finally, mask = 0x0f00 | 0x4444 | 0x1110 == 0x5f54.
- *
- * See the Intel Manual "System Programming Guide",
- * 15.2.4
- *
- * Note that LENi == 0x10 is defined on x86_64 in long
- * mode (i.e. even for 32-bit userspace software, but
- * 64-bit kernel), so the x86_64 mask value is 0x5454.
- * See the AMD manual no. 24593 (AMD64 System Programming)
- */
-#ifdef CONFIG_X86_32
-#define DR7_MASK 0x5f54
-#else
-#define DR7_MASK 0x5554
-#endif
- data &= ~DR_CONTROL_RESERVED;
- for (i = 0; i < 4; i++)
- if ((DR7_MASK >> ((data >> (16 + 4*i)) & 0xf)) & 1)
- return -EIO;
- child->thread.debugreg7 = data;
- if (data)
- set_tsk_thread_flag(child, TIF_DEBUG);
- else
- clear_tsk_thread_flag(child, TIF_DEBUG);
+ dr7 = data & ~DR_CONTROL_RESERVED;
break;
}

+ _32bit = (sizeof(unsigned long) == 4);
+#ifdef CONFIG_COMPAT
+ if (test_tsk_thread_flag(child, TIF_IA32))
+ _32bit = 1;
+#endif
+ if (ptrace_check_debugreg(_32bit, dr0, dr1, dr2, dr3, dr6, dr7))
+ return -EIO;
+
+ child->thread.debugreg0 = dr0;
+ child->thread.debugreg1 = dr1;
+ child->thread.debugreg2 = dr2;
+ child->thread.debugreg3 = dr3;
+ child->thread.debugreg6 = dr6;
+ child->thread.debugreg7 = dr7;
+ if (dr7)
+ set_tsk_thread_flag(child, TIF_DEBUG);
+ else
+ clear_tsk_thread_flag(child, TIF_DEBUG);
+
return 0;
}

--
1.6.0.4

2009-07-22 10:22:53

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 21/60] c/r: x86_32 support for checkpoint/restart

Add logic to save and restore architecture specific state, including
thread-specific state, CPU registers and FPU state.

In addition, architecture capabilities are saved in an architecure
specific extension of the header (ckpt_hdr_head_arch); Currently this
includes only FPU capabilities.

Currently only x86-32 is supported.

Changelog[v17]:
- Fix compilation for architectures that don't support checkpoint
- Validate cpu registers and TLS descriptors on restart
- Validate debug registers on restart
- Export asm/checkpoint_hdr.h to userspace
Changelog[v16]:
- All objects are preceded by ckpt_hdr (TLS and xstate_buf)
- Add architecture identifier to main header
Changelog[v14]:
- Use new interface ckpt_hdr_get/put()
- Embed struct ckpt_hdr in struct ckpt_hdr...
- Remove preempt_disable/enable() around init_fpu() and fix leak
- Revert change to pr_debug(), back to ckpt_debug()
- Move code related to task_struct to checkpoint/process.c
Changelog[v12]:
- A couple of missed calls to ckpt_hbuf_put()
- Replace obsolete ckpt_debug() with pr_debug()
Changelog[v9]:
- Add arch-specific header that details architecture capabilities;
split FPU restore to send capabilities only once.
- Test for zero TLS entries in ckpt_write_thread()
- Fix asm/checkpoint_hdr.h so it can be included from user-space
Changelog[v7]:
- Fix save/restore state of FPU
Changelog[v5]:
- Remove preempt_disable() when restoring debug registers
Changelog[v4]:
- Fix header structure alignment
Changelog[v2]:
- Pad header structures to 64 bits to ensure compatibility
- Follow Dave Hansen's refactoring of the original post

Signed-off-by: Oren Laadan <[email protected]>
---
arch/x86/include/asm/Kbuild | 1 +
arch/x86/include/asm/checkpoint_hdr.h | 122 ++++++++
arch/x86/include/asm/ptrace.h | 5 +
arch/x86/kernel/ptrace.c | 8 +-
arch/x86/mm/Makefile | 2 +
arch/x86/mm/checkpoint.c | 534 +++++++++++++++++++++++++++++++++
checkpoint/checkpoint.c | 7 +-
checkpoint/process.c | 20 ++-
checkpoint/restart.c | 6 +
include/linux/checkpoint.h | 9 +
include/linux/checkpoint_hdr.h | 16 +-
11 files changed, 722 insertions(+), 8 deletions(-)
create mode 100644 arch/x86/include/asm/checkpoint_hdr.h
create mode 100644 arch/x86/mm/checkpoint.c

diff --git a/arch/x86/include/asm/Kbuild b/arch/x86/include/asm/Kbuild
index 4a8e80c..f76cb6e 100644
--- a/arch/x86/include/asm/Kbuild
+++ b/arch/x86/include/asm/Kbuild
@@ -2,6 +2,7 @@ include include/asm-generic/Kbuild.asm

header-y += boot.h
header-y += bootparam.h
+header-y += checkpoint_hdr.h
header-y += debugreg.h
header-y += ldt.h
header-y += msr-index.h
diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
new file mode 100644
index 0000000..c5762fb
--- /dev/null
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -0,0 +1,122 @@
+#ifndef __ASM_X86_CKPT_HDR_H
+#define __ASM_X86_CKPT_HDR_H
+/*
+ * Checkpoint/restart - architecture specific headers x86
+ *
+ * Copyright (C) 2008-2009 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#ifndef _CHECKPOINT_CKPT_HDR_H_
+#error asm/checkpoint_hdr.h included directly
+#endif
+
+#include <linux/types.h>
+
+/*
+ * To maintain compatibility between 32-bit and 64-bit architecture flavors,
+ * keep data 64-bit aligned: use padding for structure members, and use
+ * __attribute__((aligned (8))) for the entire structure.
+ *
+ * Quoting Arnd Bergmann:
+ * "This structure has an odd multiple of 32-bit members, which means
+ * that if you put it into a larger structure that also contains 64-bit
+ * members, the larger structure may get different alignment on x86-32
+ * and x86-64, which you might want to avoid. I can't tell if this is
+ * an actual problem here. ... In this case, I'm pretty sure that
+ * sizeof(ckpt_hdr_task) on x86-32 is different from x86-64, since it
+ * will be 32-bit aligned on x86-32."
+ */
+
+/* i387 structure seen from kernel/userspace */
+#ifdef __KERNEL__
+#include <asm/processor.h>
+#else
+#include <sys/user.h>
+#endif
+
+#ifdef CONFIG_X86_32
+#define CKPT_ARCH_ID CKPT_ARCH_X86_32
+#endif
+
+/* arch dependent header types */
+enum {
+ CKPT_HDR_CPU_FPU = 201,
+};
+
+struct ckpt_hdr_header_arch {
+ struct ckpt_hdr h;
+ /* FIXME: add HAVE_HWFP */
+ __u16 has_fxsr;
+ __u16 has_xsave;
+ __u16 xstate_size;
+ __u16 _pading;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_thread {
+ struct ckpt_hdr h;
+ /* FIXME: restart blocks */
+ __u32 thread_info_flags;
+ __u16 gdt_entry_tls_entries;
+ __u16 sizeof_tls_array;
+} __attribute__((aligned(8)));
+
+/* designed to work for both x86_32 and x86_64 */
+struct ckpt_hdr_cpu {
+ struct ckpt_hdr h;
+ /* see struct pt_regs (x86_64) */
+ __u64 r15;
+ __u64 r14;
+ __u64 r13;
+ __u64 r12;
+ __u64 bp;
+ __u64 bx;
+ __u64 r11;
+ __u64 r10;
+ __u64 r9;
+ __u64 r8;
+ __u64 ax;
+ __u64 cx;
+ __u64 dx;
+ __u64 si;
+ __u64 di;
+ __u64 orig_ax;
+ __u64 ip;
+ __u64 sp;
+
+ __u64 flags;
+
+ /* segment registers */
+ __u64 fs;
+ __u64 gs;
+
+ __u16 fsindex;
+ __u16 gsindex;
+ __u16 cs;
+ __u16 ss;
+ __u16 ds;
+ __u16 es;
+
+ __u32 used_math;
+
+ /* debug registers */
+ __u64 debugreg0;
+ __u64 debugreg1;
+ __u64 debugreg2;
+ __u64 debugreg3;
+ __u64 debugreg6;
+ __u64 debugreg7;
+
+ /* thread_xstate contents follow (if used_math) */
+} __attribute__((aligned(8)));
+
+#define CKPT_X86_SEG_NULL 0
+#define CKPT_X86_SEG_USER32_CS 1
+#define CKPT_X86_SEG_USER32_DS 2
+#define CKPT_X86_SEG_TLS 0x4000 /* 0100 0000 0000 00xx */
+#define CKPT_X86_SEG_LDT 0x8000 /* 100x xxxx xxxx xxxx */
+
+#endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 0f0d908..66b507b 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -242,6 +242,11 @@ extern void ptrace_bts_untrace(struct task_struct *tsk);
#define arch_ptrace_untrace(tsk) ptrace_bts_untrace(tsk)
#endif /* CONFIG_X86_PTRACE_BTS */

+extern int ptrace_check_debugreg(int _32bit,
+ unsigned long dr0, unsigned long dr1,
+ unsigned long dr2, unsigned long dr3,
+ unsigned long dr6, unsigned long dr7);
+
#endif /* __KERNEL__ */

#endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 9b4cacf..3b434bd 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -466,10 +466,10 @@ static unsigned long ptrace_get_debugreg(struct task_struct *child, int n)
return 0;
}

-static int ptrace_check_debugreg(int _32bit,
- unsigned long dr0, unsigned long dr1,
- unsigned long dr2, unsigned long dr3,
- unsigned long dr6, unsigned long dr7)
+int ptrace_check_debugreg(int _32bit,
+ unsigned long dr0, unsigned long dr1,
+ unsigned long dr2, unsigned long dr3,
+ unsigned long dr6, unsigned long dr7)
{
/* Breakpoint type: 00: --x, 01: -w-, 10: undefined, 11: rw- */
unsigned int rw[4];
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index eefdeee..ddd5abb 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -21,3 +21,5 @@ obj-$(CONFIG_K8_NUMA) += k8topology_64.o
obj-$(CONFIG_ACPI_NUMA) += srat_$(BITS).o

obj-$(CONFIG_MEMTEST) += memtest.o
+
+obj-$(CONFIG_CHECKPOINT) += checkpoint.o
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
new file mode 100644
index 0000000..f085e14
--- /dev/null
+++ b/arch/x86/mm/checkpoint.c
@@ -0,0 +1,534 @@
+/*
+ * Checkpoint/restart - architecture specific support for x86
+ *
+ * Copyright (C) 2008-2009 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG CKPT_DSYS
+
+#include <asm/desc.h>
+#include <asm/i387.h>
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/*
+ * helpers to encode/decode/validate registers/segments/eflags
+ */
+
+static int check_eflags(__u32 eflags)
+{
+#define X86_EFLAGS_CKPT_MASK \
+ (X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF | X86_EFLAGS_ZF | \
+ X86_EFLAGS_SF | X86_EFLAGS_TF | X86_EFLAGS_DF | X86_EFLAGS_OF | \
+ X86_EFLAGS_NT | X86_EFLAGS_AC | X86_EFLAGS_ID)
+
+ if ((eflags & ~X86_EFLAGS_CKPT_MASK) != (X86_EFLAGS_IF | 0x2))
+ return 0;
+ return 1;
+}
+
+static int check_tls(struct desc_struct *desc)
+{
+ if (!desc->a && !desc->b)
+ return 1;
+ if (desc->l != 0 || desc->s != 1 || desc->dpl != 3)
+ return 0;
+ return 1;
+}
+
+static int check_segment(__u16 seg)
+{
+ int ret = 0;
+
+ switch (seg) {
+ case CKPT_X86_SEG_NULL:
+ case CKPT_X86_SEG_USER32_CS:
+ case CKPT_X86_SEG_USER32_DS:
+ return 1;
+ }
+ if (seg & CKPT_X86_SEG_TLS) {
+ seg &= ~CKPT_X86_SEG_TLS;
+ if (seg <= GDT_ENTRY_TLS_MAX - GDT_ENTRY_TLS_MIN)
+ ret = 1;
+ } else if (seg & CKPT_X86_SEG_LDT) {
+ seg &= ~CKPT_X86_SEG_LDT;
+ if (seg <= 0x1fff)
+ ret = 1;
+ }
+ return ret;
+}
+
+static __u16 encode_segment(unsigned short seg)
+{
+ if (seg == 0)
+ return CKPT_X86_SEG_NULL;
+ BUG_ON((seg & 3) != 3);
+
+ if (seg == __USER_CS)
+ return CKPT_X86_SEG_USER32_CS;
+ if (seg == __USER_DS)
+ return CKPT_X86_SEG_USER32_DS;
+
+ if (seg & 4)
+ return CKPT_X86_SEG_LDT | (seg >> 3);
+
+ seg >>= 3;
+ if (GDT_ENTRY_TLS_MIN <= seg && seg <= GDT_ENTRY_TLS_MAX)
+ return CKPT_X86_SEG_TLS | (seg - GDT_ENTRY_TLS_MIN);
+
+ printk(KERN_ERR "c/r: (decode) bad segment %#hx\n", seg);
+ BUG();
+}
+
+static unsigned short decode_segment(__u16 seg)
+{
+ if (seg == CKPT_X86_SEG_NULL)
+ return 0;
+ if (seg == CKPT_X86_SEG_USER32_CS)
+ return __USER_CS;
+ if (seg == CKPT_X86_SEG_USER32_DS)
+ return __USER_DS;
+
+ if (seg & CKPT_X86_SEG_TLS) {
+ seg &= ~CKPT_X86_SEG_TLS;
+ return ((GDT_ENTRY_TLS_MIN + seg) << 3) | 3;
+ }
+ if (seg & CKPT_X86_SEG_LDT) {
+ seg &= ~CKPT_X86_SEG_LDT;
+ return (seg << 3) | 7;
+ }
+ BUG();
+}
+
+#define CKPT_X86_TIF_UNSUPPORTED (_TIF_SECCOMP | _TIF_IO_BITMAP)
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+static int may_checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+ if (t->thread.vm86_info) {
+ ckpt_write_err(ctx, "task %d (%s) in VM86 mode",
+ task_pid_vnr(t), t->comm);
+ return -EBUSY;
+ }
+ if (task_thread_info(t)->flags & CKPT_X86_TIF_UNSUPPORTED) {
+ ckpt_write_err(ctx, "task %d (%s) uncool thread flags %#lx",
+ task_pid_vnr(t), t->comm,
+ task_thread_info(t)->flags);
+ return -EBUSY;
+ }
+ return 0;
+}
+
+/* dump the thread_struct of a given task */
+int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+ struct ckpt_hdr_thread *h;
+ int tls_size;
+ int ret;
+
+ ret = may_checkpoint_thread(ctx, t);
+ if (ret < 0)
+ return ret;
+
+ tls_size = sizeof(t->thread.tls_array);
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h) + tls_size, CKPT_HDR_THREAD);
+ if (!h)
+ return -ENOMEM;
+
+ h->thread_info_flags =
+ task_thread_info(t)->flags & ~CKPT_X86_TIF_UNSUPPORTED;
+ h->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
+ h->sizeof_tls_array = tls_size;
+
+ /* For simplicity dump the entire array */
+ memcpy(h + 1, t->thread.tls_array, tls_size);
+
+ ret = ckpt_write_obj(ctx, &h->h);
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
+
+#ifdef CONFIG_X86_32
+
+static void save_cpu_regs(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+ struct thread_struct *thread = &t->thread;
+ struct pt_regs *regs = task_pt_regs(t);
+ unsigned long _gs;
+
+ h->bp = regs->bp;
+ h->bx = regs->bx;
+ h->ax = regs->ax;
+ h->cx = regs->cx;
+ h->dx = regs->dx;
+ h->si = regs->si;
+ h->di = regs->di;
+ h->orig_ax = regs->orig_ax;
+ h->ip = regs->ip;
+
+ h->flags = regs->flags;
+ h->sp = regs->sp;
+
+ h->cs = encode_segment(regs->cs);
+ h->ss = encode_segment(regs->ss);
+ h->ds = encode_segment(regs->ds);
+ h->es = encode_segment(regs->es);
+
+ /*
+ * for checkpoint in process context (from within a container)
+ * the GS segment register should be saved from the hardware;
+ * otherwise it is already saved on the thread structure
+ */
+ if (t == current)
+ _gs = get_user_gs(regs);
+ else
+ _gs = thread->gs;
+
+ h->fsindex = encode_segment(regs->fs);
+ h->gsindex = encode_segment(_gs);
+
+ /*
+ * for checkpoint in process context (from within a container),
+ * the actual syscall is taking place at this very moment; so
+ * we (optimistically) subtitute the future return value (0) of
+ * this syscall into the orig_eax, so that upon restart it will
+ * succeed (or it will endlessly retry checkpoint...)
+ */
+ if (t == current) {
+ BUG_ON(h->orig_ax < 0);
+ h->ax = 0;
+ }
+}
+
+static void save_cpu_debug(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+ struct thread_struct *thread = &t->thread;
+
+ /* debug regs */
+
+ /*
+ * for checkpoint in process context (from within a container),
+ * get the actual registers; otherwise get the saved values.
+ */
+
+ if (t == current) {
+ get_debugreg(h->debugreg0, 0);
+ get_debugreg(h->debugreg1, 1);
+ get_debugreg(h->debugreg2, 2);
+ get_debugreg(h->debugreg3, 3);
+ get_debugreg(h->debugreg6, 6);
+ get_debugreg(h->debugreg7, 7);
+ } else {
+ h->debugreg0 = thread->debugreg0;
+ h->debugreg1 = thread->debugreg1;
+ h->debugreg2 = thread->debugreg2;
+ h->debugreg3 = thread->debugreg3;
+ h->debugreg6 = thread->debugreg6;
+ h->debugreg7 = thread->debugreg7;
+ }
+}
+
+static void save_cpu_fpu(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+ h->used_math = tsk_used_math(t) ? 1 : 0;
+}
+
+static int checkpoint_cpu_fpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+ struct ckpt_hdr *h;
+ int ret;
+
+ h = ckpt_hdr_get_type(ctx, xstate_size + sizeof(*h),
+ CKPT_HDR_CPU_FPU);
+ if (!h)
+ return -ENOMEM;
+
+ /* i387 + MMU + SSE logic */
+ preempt_disable(); /* needed it (t == current) */
+
+ /*
+ * normally, no need to unlazy_fpu(), since TS_USEDFPU flag
+ * was cleared when task was context-switched out...
+ * except if we are in process context, in which case we do
+ */
+ if (t == current && (task_thread_info(t)->status & TS_USEDFPU))
+ unlazy_fpu(current);
+
+ /*
+ * For simplicity dump the entire structure.
+ * FIX: need to be deliberate about what registers we are
+ * dumping for traceability and compatibility.
+ */
+ memcpy(h + 1, t->thread.xstate, xstate_size);
+ preempt_enable(); /* needed if (t == current) */
+
+ ret = ckpt_write_obj(ctx, h);
+ ckpt_hdr_put(ctx, h);
+
+ return ret;
+}
+
+#endif /* CONFIG_X86_32 */
+
+/* dump the cpu state and registers of a given task */
+int checkpoint_cpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+ struct ckpt_hdr_cpu *h;
+ int ret;
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_CPU);
+ if (!h)
+ return -ENOMEM;
+
+ save_cpu_regs(h, t);
+ save_cpu_debug(h, t);
+ save_cpu_fpu(h, t);
+
+ ckpt_debug("math %d debug %d\n", h->used_math, !!h->debugreg7);
+
+ ret = ckpt_write_obj(ctx, &h->h);
+ if (ret < 0)
+ goto out;
+
+ if (h->used_math)
+ ret = checkpoint_cpu_fpu(ctx, t);
+ out:
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
+
+int checkpoint_write_header_arch(struct ckpt_ctx *ctx)
+{
+ struct ckpt_hdr_header_arch *h;
+ int ret;
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_HEADER_ARCH);
+ if (!h)
+ return -ENOMEM;
+
+ /* FPU capabilities */
+ h->has_fxsr = cpu_has_fxsr;
+ h->has_xsave = cpu_has_xsave;
+ h->xstate_size = xstate_size;
+
+ ret = ckpt_write_obj(ctx, &h->h);
+ ckpt_hdr_put(ctx, h);
+
+ return ret;
+}
+
+/**************************************************************************
+ * Restart
+ */
+
+/* read the thread_struct into the current task */
+int restore_thread(struct ckpt_ctx *ctx)
+{
+ struct ckpt_hdr_thread *h;
+ struct thread_struct *thread = &current->thread;
+ struct desc_struct *desc;
+ int tls_size;
+ int i, cpu, ret;
+
+ tls_size = sizeof(thread->tls_array);
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h) + tls_size, CKPT_HDR_THREAD);
+ if (IS_ERR(h))
+ return PTR_ERR(h);
+
+ ret = -EINVAL;
+ if (h->thread_info_flags & CKPT_X86_TIF_UNSUPPORTED)
+ goto out;
+ if (h->gdt_entry_tls_entries != GDT_ENTRY_TLS_ENTRIES)
+ goto out;
+ if (h->sizeof_tls_array != tls_size)
+ goto out;
+
+ /*
+ * restore TLS by hand: why convert to struct user_desc if
+ * sys_set_thread_entry() will convert it back ?
+ */
+ desc = (struct desc_struct *) (h + 1);
+
+ for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++) {
+ if (!check_tls(&desc[i]))
+ goto out;
+ }
+
+ cpu = get_cpu();
+ memcpy(thread->tls_array, desc, tls_size);
+ load_TLS(thread, cpu);
+ put_cpu();
+
+ /* TODO: restore TIF flags as necessary (e.g. TIF_NOTSC) */
+
+ ret = 0;
+ out:
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
+
+#ifdef CONFIG_X86_32
+
+static int load_cpu_regs(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+ struct thread_struct *thread = &t->thread;
+ struct pt_regs *regs = task_pt_regs(t);
+
+ if (!check_eflags(h->flags))
+ return -EINVAL;
+ if (h->cs == CKPT_X86_SEG_NULL)
+ return -EINVAL;
+ if (!check_segment(h->cs) || !check_segment(h->ds) ||
+ !check_segment(h->es) || !check_segment(h->ss) ||
+ !check_segment(h->fsindex) || !check_segment(h->gsindex))
+ return -EINVAL;
+
+ regs->bp = h->bp;
+ regs->bx = h->bx;
+ regs->ax = h->ax;
+ regs->cx = h->cx;
+ regs->dx = h->dx;
+ regs->si = h->si;
+ regs->di = h->di;
+ regs->orig_ax = h->orig_ax;
+ regs->ip = h->ip;
+
+ regs->flags = h->flags;
+ regs->sp = h->sp;
+
+ regs->ds = decode_segment(h->ds);
+ regs->es = decode_segment(h->es);
+ regs->cs = decode_segment(h->cs);
+ regs->ss = decode_segment(h->ss);
+
+ regs->fs = decode_segment(h->fsindex);
+ regs->gs = decode_segment(h->gsindex);
+
+ thread->gs = regs->gs;
+ lazy_load_gs(regs->gs);
+
+ return 0;
+}
+
+static int load_cpu_debug(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+ int ret;
+
+ ret = ptrace_check_debugreg(1, h->debugreg0, h->debugreg1, h->debugreg2,
+ h->debugreg3, h->debugreg6, h->debugreg7);
+ if (ret < 0)
+ return ret;
+
+ set_debugreg(h->debugreg0, 0);
+ set_debugreg(h->debugreg1, 1);
+ /* ignore 4, 5 */
+ set_debugreg(h->debugreg2, 2);
+ set_debugreg(h->debugreg3, 3);
+ set_debugreg(h->debugreg6, 6);
+ set_debugreg(h->debugreg7, 7);
+
+ if (h->debugreg7)
+ set_tsk_thread_flag(t, TIF_DEBUG);
+ else
+ clear_tsk_thread_flag(t, TIF_DEBUG);
+
+ return 0;
+}
+
+static int load_cpu_fpu(struct ckpt_hdr_cpu *h, struct task_struct *t)
+{
+ preempt_disable();
+
+ __clear_fpu(t); /* in case we used FPU in user mode */
+
+ if (!h->used_math)
+ clear_used_math();
+
+ preempt_enable();
+ return 0;
+}
+
+static int restore_cpu_fpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+ struct ckpt_hdr *h;
+ int ret;
+
+ /* init_fpu() eventually also calls set_used_math() */
+ ret = init_fpu(current);
+ if (ret < 0)
+ return ret;
+
+ h = ckpt_read_obj_type(ctx, xstate_size + sizeof(*h),
+ CKPT_HDR_CPU_FPU);
+ if (IS_ERR(h))
+ return PTR_ERR(h);
+
+ memcpy(t->thread.xstate, h + 1, xstate_size);
+
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
+
+#endif /* CONFIG_X86_32 */
+
+/* read the cpu state and registers for the current task */
+int restore_cpu(struct ckpt_ctx *ctx)
+{
+ struct ckpt_hdr_cpu *h;
+ struct task_struct *t = current;
+ int ret;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_CPU);
+ if (IS_ERR(h))
+ return PTR_ERR(h);
+
+ ckpt_debug("math %d debug %d\n", h->used_math, !!h->debugreg7);
+
+ ret = load_cpu_regs(h, t);
+ if (ret < 0)
+ goto out;
+ ret = load_cpu_debug(h, t);
+ if (ret < 0)
+ goto out;
+ ret = load_cpu_fpu(h, t);
+ if (ret < 0)
+ goto out;
+
+ if (h->used_math)
+ ret = restore_cpu_fpu(ctx, t);
+ out:
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
+
+int restore_read_header_arch(struct ckpt_ctx *ctx)
+{
+ struct ckpt_hdr_header_arch *h;
+ int ret = 0;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_HEADER_ARCH);
+ if (IS_ERR(h))
+ return PTR_ERR(h);
+
+ /* FIX: verify compatibility of architecture features */
+
+ /* verify FPU capabilities */
+ if (h->has_fxsr != cpu_has_fxsr ||
+ h->has_xsave != cpu_has_xsave ||
+ h->xstate_size != xstate_size)
+ ret = -EINVAL;
+
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 7563a9f..a465fb6 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -203,6 +203,8 @@ static int checkpoint_write_header(struct ckpt_ctx *ctx)
do_gettimeofday(&ktv);
uts = utsname();

+ h->arch_id = cpu_to_le16(CKPT_ARCH_ID); /* see asm/checkpoitn.h */
+
h->magic = CHECKPOINT_MAGIC_HEAD;
h->major = (LINUX_VERSION_CODE >> 16) & 0xff;
h->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
@@ -230,7 +232,10 @@ static int checkpoint_write_header(struct ckpt_ctx *ctx)
ret = ckpt_write_buffer(ctx, uts->machine, sizeof(uts->machine));
up:
up_read(&uts_sem);
- return ret;
+ if (ret < 0)
+ return ret;
+
+ return checkpoint_write_header_arch(ctx);
}

/* write the checkpoint trailer */
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 9e1b861..d2c59d2 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -54,7 +54,15 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)

ret = checkpoint_task_struct(ctx, t);
ckpt_debug("task %d\n", ret);
-
+ if (ret < 0)
+ goto out;
+ ret = checkpoint_thread(ctx, t);
+ ckpt_debug("thread %d\n", ret);
+ if (ret < 0)
+ goto out;
+ ret = checkpoint_cpu(ctx, t);
+ ckpt_debug("cpu %d\n", ret);
+ out:
return ret;
}

@@ -94,6 +102,14 @@ int restore_task(struct ckpt_ctx *ctx)

ret = restore_task_struct(ctx);
ckpt_debug("task %d\n", ret);
-
+ if (ret < 0)
+ goto out;
+ ret = restore_thread(ctx);
+ ckpt_debug("thread %d\n", ret);
+ if (ret < 0)
+ goto out;
+ ret = restore_cpu(ctx);
+ ckpt_debug("cpu %d\n", ret);
+ out:
return ret;
}
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 562ce8f..17135fe 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -265,6 +265,8 @@ static int restore_read_header(struct ckpt_ctx *ctx)
return PTR_ERR(h);

ret = -EINVAL;
+ if (le16_to_cpu(h->arch_id) != CKPT_ARCH_ID)
+ goto out;
if (h->magic != CHECKPOINT_MAGIC_HEAD ||
h->rev != CHECKPOINT_VERSION ||
h->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
@@ -293,6 +295,10 @@ static int restore_read_header(struct ckpt_ctx *ctx)
if (ret < 0)
goto out;
ret = _ckpt_read_buffer(ctx, uts->machine, sizeof(uts->machine));
+ if (ret < 0)
+ goto out;
+
+ ret = restore_read_header_arch(ctx);
out:
kfree(uts);
ckpt_hdr_put(ctx, h);
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index b2cb91f..f7e2cb8 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -57,6 +57,15 @@ extern long do_restart(struct ckpt_ctx *ctx, pid_t pid);
extern int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t);
extern int restore_task(struct ckpt_ctx *ctx);

+/* arch hooks */
+extern int checkpoint_write_header_arch(struct ckpt_ctx *ctx);
+extern int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_cpu(struct ckpt_ctx *ctx, struct task_struct *t);
+
+extern int restore_read_header_arch(struct ckpt_ctx *ctx);
+extern int restore_thread(struct ckpt_ctx *ctx);
+extern int restore_cpu(struct ckpt_ctx *ctx);
+

/* debugging flags */
#define CKPT_DBASE 0x1 /* anything */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 827a6bb..ce43aa9 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -38,19 +38,33 @@ struct ckpt_hdr {
__u32 len;
} __attribute__((aligned(8)));

+
+#include <asm/checkpoint_hdr.h>
+
+
/* header types */
enum {
CKPT_HDR_HEADER = 1,
+ CKPT_HDR_HEADER_ARCH,
CKPT_HDR_BUFFER,
CKPT_HDR_STRING,

CKPT_HDR_TASK = 101,
+ CKPT_HDR_THREAD,
+ CKPT_HDR_CPU,
+
+ /* 201-299: reserved for arch-dependent */

CKPT_HDR_TAIL = 9001,

CKPT_HDR_ERROR = 9999,
};

+/* architecture */
+enum {
+ CKPT_ARCH_X86_32 = 1,
+};
+
/* kernel constants */
struct ckpt_hdr_const {
/* task */
@@ -66,7 +80,7 @@ struct ckpt_hdr_header {
struct ckpt_hdr h;
__u64 magic;

- __u16 _padding;
+ __u16 arch_id;

__u16 major;
__u16 minor;
--
1.6.0.4

2009-07-22 10:10:25

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 41/60] c/r: dump anonymous- and file-mapped- shared memory

We now handle anonymous and file-mapped shared memory. Support for IPC
shared memory requires support for IPC first. We extend ckpt_write_vma()
to detect shared memory VMAs and handle it separately than private
memory.

There is not much to do for file-mapped shared memory, except to force
msync() on the region to ensure that the file system is consistent
with the checkpoint image. Use our internal type CKPT_VMA_SHM_FILE.

Anonymous shared memory is always backed by inode in shmem filesystem.
We use that inode to look up the VMA in the objhash and register it if
not found (on first encounter). In this case, the type of the VMA is
CKPT_VMA_SHM_ANON, and we dump the contents. On the other hand, if it is
found there, we must have already saved it before, so we change the
type to CKPT_VMA_SHM_ANON_SKIP and skip it.

To dump the contents of a shmem VMA, we loop through the pages of the
inode in the shmem filesystem, and dump the contents of each dirty
(allocated) page - unallocated pages must be clean.

Note that we save the original size of a shmem VMA because it may have
been re-mapped partially. The format itself remains like with private
VMAs, except that instead of addresses we record _indices_ (page nr)
into the backing inode.

Signed-off-by: Oren Laadan <[email protected]>
---
checkpoint/memory.c | 143 +++++++++++++++++++++++++++++++++++----
checkpoint/objhash.c | 19 +++++
include/linux/checkpoint.h | 15 +++--
include/linux/checkpoint_hdr.h | 8 ++
mm/filemap.c | 39 +++++++++++-
mm/mmap.c | 2 +-
mm/shmem.c | 30 ++++++++
7 files changed, 233 insertions(+), 23 deletions(-)

diff --git a/checkpoint/memory.c b/checkpoint/memory.c
index e11784e..a1d1eca 100644
--- a/checkpoint/memory.c
+++ b/checkpoint/memory.c
@@ -21,6 +21,7 @@
#include <linux/pagemap.h>
#include <linux/mm_types.h>
#include <linux/proc_fs.h>
+#include <linux/swap.h>
#include <linux/checkpoint.h>
#include <linux/checkpoint_hdr.h>

@@ -281,6 +282,54 @@ static struct page *consider_private_page(struct vm_area_struct *vma,
}

/**
+ * consider_shared_page - return page pointer for dirty pages
+ * @ino - inode of shmem object
+ * @idx - page index in shmem object
+ *
+ * Looks up the page that corresponds to the index in the shmem object,
+ * and returns the page if it was modified (and grabs a reference to it),
+ * or otherwise returns NULL (or error).
+ */
+static struct page *consider_shared_page(struct inode *ino, unsigned long idx)
+{
+ struct page *page = NULL;
+ int ret;
+
+ /*
+ * Inspired by do_shmem_file_read(): very simplified version.
+ *
+ * FIXME: consolidate with do_shmem_file_read()
+ */
+
+ ret = shmem_getpage(ino, idx, &page, SGP_READ, NULL);
+ if (ret < 0)
+ return ERR_PTR(ret);
+
+ /*
+ * Only care about dirty pages; shmem_getpage() only returns
+ * pages that have been allocated, so they must be dirty. The
+ * pages returned are locked and referenced.
+ */
+
+ if (page) {
+ unlock_page(page);
+ /*
+ * If users can be writing to this page using arbitrary
+ * virtual addresses, take care about potential aliasing
+ * before reading the page on the kernel side.
+ */
+ if (mapping_writably_mapped(ino->i_mapping))
+ flush_dcache_page(page);
+ /*
+ * Mark the page accessed if we read the beginning.
+ */
+ mark_page_accessed(page);
+ }
+
+ return page;
+}
+
+/**
* vma_fill_pgarr - fill a page-array with addr/page tuples
* @ctx - checkpoint context
* @vma - vma to scan
@@ -289,17 +338,16 @@ static struct page *consider_private_page(struct vm_area_struct *vma,
* Returns the number of pages collected
*/
static int vma_fill_pgarr(struct ckpt_ctx *ctx,
- struct vm_area_struct *vma,
- unsigned long *start)
+ struct vm_area_struct *vma, struct inode *inode,
+ unsigned long *start, unsigned long end)
{
- unsigned long end = vma->vm_end;
unsigned long addr = *start;
struct ckpt_pgarr *pgarr;
int nr_used;
int cnt = 0;

/* this function is only for private memory (anon or file-mapped) */
- BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+ BUG_ON(inode && vma);

do {
pgarr = pgarr_current(ctx);
@@ -311,7 +359,11 @@ static int vma_fill_pgarr(struct ckpt_ctx *ctx,
while (addr < end) {
struct page *page;

- page = consider_private_page(vma, addr);
+ if (vma)
+ page = consider_private_page(vma, addr);
+ else
+ page = consider_shared_page(inode, addr);
+
if (IS_ERR(page))
return PTR_ERR(page);

@@ -323,7 +375,10 @@ static int vma_fill_pgarr(struct ckpt_ctx *ctx,
pgarr->nr_used++;
}

- addr += PAGE_SIZE;
+ if (vma)
+ addr += PAGE_SIZE;
+ else
+ addr++;

if (pgarr_is_full(pgarr))
break;
@@ -395,23 +450,32 @@ static int vma_dump_pages(struct ckpt_ctx *ctx, int total)
}

/**
- * checkpoint_memory_contents - dump contents of a VMA with private memory
+ * checkpoint_memory_contents - dump contents of a memory region
* @ctx - checkpoint context
- * @vma - vma to scan
+ * @vma - vma to scan (--or--)
+ * @inode - inode to scan
*
* Collect lists of pages that needs to be dumped, and corresponding
* virtual addresses into ctx->pgarr_list page-array chain. Then dump
* the addresses, followed by the page contents.
*/
static int checkpoint_memory_contents(struct ckpt_ctx *ctx,
- struct vm_area_struct *vma)
+ struct vm_area_struct *vma,
+ struct inode *inode)
{
struct ckpt_hdr_pgarr *h;
unsigned long addr, end;
int cnt, ret;

- addr = vma->vm_start;
- end = vma->vm_end;
+ BUG_ON(vma && inode);
+
+ if (vma) {
+ addr = vma->vm_start;
+ end = vma->vm_end;
+ } else {
+ addr = 0;
+ end = PAGE_ALIGN(i_size_read(inode)) >> PAGE_CACHE_SHIFT;
+ }

/*
* Work iteratively, collecting and dumping at most CKPT_PGARR_BATCH
@@ -437,7 +501,7 @@ static int checkpoint_memory_contents(struct ckpt_ctx *ctx,
*/

while (addr < end) {
- cnt = vma_fill_pgarr(ctx, vma, &addr);
+ cnt = vma_fill_pgarr(ctx, vma, inode, &addr, end);
if (cnt == 0)
break;
else if (cnt < 0)
@@ -481,7 +545,7 @@ static int checkpoint_memory_contents(struct ckpt_ctx *ctx,
* @vma_objref: vma objref
*/
int generic_vma_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma,
- enum vma_type type, int vma_objref)
+ enum vma_type type, int vma_objref, int ino_objref)
{
struct ckpt_hdr_vma *h;
int ret;
@@ -500,6 +564,13 @@ int generic_vma_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma,

h->vma_type = type;
h->vma_objref = vma_objref;
+ h->ino_objref = ino_objref;
+
+ if (vma->vm_file)
+ h->ino_size = i_size_read(vma->vm_file->f_dentry->d_inode);
+ else
+ h->ino_size = 0;
+
h->vm_start = vma->vm_start;
h->vm_end = vma->vm_end;
h->vm_page_prot = pgprot_val(vma->vm_page_prot);
@@ -527,10 +598,37 @@ int private_vma_checkpoint(struct ckpt_ctx *ctx,

BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));

- ret = generic_vma_checkpoint(ctx, vma, type, vma_objref);
+ ret = generic_vma_checkpoint(ctx, vma, type, vma_objref, 0);
+ if (ret < 0)
+ goto out;
+ ret = checkpoint_memory_contents(ctx, vma, NULL);
+ out:
+ return ret;
+}
+
+/**
+ * shmem_vma_checkpoint - dump contents of private (anon, file) vma
+ * @ctx: checkpoint context
+ * @vma: vma object
+ * @type: vma type
+ * @objref: vma object id
+ */
+int shmem_vma_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma,
+ enum vma_type type, int ino_objref)
+{
+ struct file *file = vma->vm_file;
+ int ret;
+
+ ckpt_debug("type %d, ino_ref %d\n", type, ino_objref);
+ BUG_ON(!(vma->vm_flags & (VM_SHARED | VM_MAYSHARE)));
+ BUG_ON(!file);
+
+ ret = generic_vma_checkpoint(ctx, vma, type, 0, ino_objref);
if (ret < 0)
goto out;
- ret = checkpoint_memory_contents(ctx, vma);
+ if (type == CKPT_VMA_SHM_ANON_SKIP)
+ goto out;
+ ret = checkpoint_memory_contents(ctx, NULL, file->f_dentry->d_inode);
out:
return ret;
}
@@ -984,6 +1082,21 @@ static struct restore_vma_ops restore_vma_ops[] = {
.vma_type = CKPT_VMA_FILE,
.restore = filemap_restore,
},
+ /* anonymous shared */
+ {
+ .vma_name = "ANON SHARED",
+ .vma_type = CKPT_VMA_SHM_ANON,
+ },
+ /* anonymous shared (skipped) */
+ {
+ .vma_name = "ANON SHARED (skip)",
+ .vma_type = CKPT_VMA_SHM_ANON_SKIP,
+ },
+ /* file-mapped shared */
+ {
+ .vma_name = "FILE SHARED",
+ .vma_type = CKPT_VMA_SHM_FILE,
+ },
};

/**
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 354b200..02b42a0 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -74,6 +74,16 @@ static int obj_no_grab(void *ptr)
return 0;
}

+static int obj_inode_grab(void *ptr)
+{
+ return igrab((struct inode *) ptr) ? 0 : -EBADF;
+}
+
+static void obj_inode_drop(void *ptr)
+{
+ iput((struct inode *) ptr);
+}
+
static int obj_file_table_grab(void *ptr)
{
atomic_inc(&((struct files_struct *) ptr)->count);
@@ -130,6 +140,15 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
.ref_drop = obj_no_drop,
.ref_grab = obj_no_grab,
},
+ /* inode object */
+ {
+ .obj_name = "INODE",
+ .obj_type = CKPT_OBJ_INODE,
+ .ref_drop = obj_inode_drop,
+ .ref_grab = obj_inode_grab,
+ .checkpoint = checkpoint_bad, /* no c/r at inode level */
+ .restore = restore_bad, /* no c/r at inode level */
+ },
/* files_struct object */
{
.obj_name = "FILE_TABLE",
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index f7f6967..54cc4b0 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -153,11 +153,15 @@ extern void ckpt_pgarr_free(struct ckpt_ctx *ctx);
extern int generic_vma_checkpoint(struct ckpt_ctx *ctx,
struct vm_area_struct *vma,
enum vma_type type,
- int vma_objref);
+ int vma_objref, int ino_objref);
extern int private_vma_checkpoint(struct ckpt_ctx *ctx,
struct vm_area_struct *vma,
enum vma_type type,
int vma_objref);
+extern int shmem_vma_checkpoint(struct ckpt_ctx *ctx,
+ struct vm_area_struct *vma,
+ enum vma_type type,
+ int ino_objref);

extern int checkpoint_obj_mm(struct ckpt_ctx *ctx, struct task_struct *t);
extern int restore_obj_mm(struct ckpt_ctx *ctx, int mm_objref);
@@ -170,11 +174,10 @@ extern int private_vma_restore(struct ckpt_ctx *ctx, struct mm_struct *mm,
struct file *file, struct ckpt_hdr_vma *h);


-#define CKPT_VMA_NOT_SUPPORTED \
- (VM_SHARED | VM_MAYSHARE | VM_IO | VM_HUGETLB | \
- VM_NONLINEAR | VM_PFNMAP | VM_RESERVED | VM_NORESERVE \
- | VM_HUGETLB | VM_NONLINEAR | VM_MAPPED_COPY | \
- VM_INSERTPAGE | VM_MIXEDMAP | VM_SAO)
+#define CKPT_VMA_NOT_SUPPORTED \
+ (VM_IO | VM_HUGETLB | VM_NONLINEAR | VM_PFNMAP | \
+ VM_RESERVED | VM_NORESERVE | VM_HUGETLB | VM_NONLINEAR | \
+ VM_MAPPED_COPY | VM_INSERTPAGE | VM_MIXEDMAP | VM_SAO)


/* debugging flags */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 8bd2f11..d95c9fb 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -89,6 +89,7 @@ struct ckpt_hdr_objref {
/* shared objects types */
enum obj_type {
CKPT_OBJ_IGNORE = 0,
+ CKPT_OBJ_INODE,
CKPT_OBJ_FILE_TABLE,
CKPT_OBJ_FILE,
CKPT_OBJ_MM,
@@ -174,6 +175,7 @@ struct ckpt_hdr_task {
/* task's shared resources */
struct ckpt_hdr_task_objs {
struct ckpt_hdr h;
+
__s32 files_objref;
__s32 mm_objref;
} __attribute__((aligned(8)));
@@ -254,6 +256,9 @@ enum vma_type {
CKPT_VMA_VDSO, /* special vdso vma */
CKPT_VMA_ANON, /* private anonymous */
CKPT_VMA_FILE, /* private mapped file */
+ CKPT_VMA_SHM_ANON, /* shared anonymous */
+ CKPT_VMA_SHM_ANON_SKIP, /* shared anonymous (skip contents) */
+ CKPT_VMA_SHM_FILE, /* shared mapped file, only msync */
CKPT_VMA_MAX
};

@@ -262,6 +267,9 @@ struct ckpt_hdr_vma {
struct ckpt_hdr h;
__u32 vma_type;
__s32 vma_objref; /* objref of backing file */
+ __s32 ino_objref; /* objref of shared segment */
+ __u32 _padding;
+ __u64 ino_size; /* size of shared segment */

__u64 vm_start;
__u64 vm_end;
diff --git a/mm/filemap.c b/mm/filemap.c
index 843d88b..a07bb3d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1654,6 +1654,8 @@ static int filemap_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
{
struct file *file = vma->vm_file;
int vma_objref;
+ int ino_objref;
+ int first, ret;

if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED) {
pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
@@ -1666,7 +1668,42 @@ static int filemap_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
if (vma_objref < 0)
return vma_objref;

- return private_vma_checkpoint(ctx, vma, CKPT_VMA_FILE, vma_objref);
+ if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
+ /*
+ * Citing mmap(2): "Updates to the mapping are visible
+ * to other processes that map this file, and are
+ * carried through to the underlying file. The file
+ * may not actually be updated until msync(2) or
+ * munmap(2) is called"
+ *
+ * Citing msync(2): "Without use of this call there is
+ * no guarantee that changes are written back before
+ * munmap(2) is called."
+ *
+ * Force msync for region of shared mapped files, to
+ * ensure that that the file system is consistent with
+ * the checkpoint image. (inspired by sys_msync).
+ */
+
+ ino_objref = ckpt_obj_lookup_add(ctx, file->f_dentry->d_inode,
+ CKPT_OBJ_INODE, &first);
+ if (ino_objref < 0)
+ return ino_objref;
+
+ if (first) {
+ ret = vfs_fsync(file, file->f_path.dentry, 0);
+ if (ret < 0)
+ return ret;
+ }
+
+ ret = generic_vma_checkpoint(ctx, vma, CKPT_VMA_SHM_FILE,
+ vma_objref, ino_objref);
+ } else {
+ ret = private_vma_checkpoint(ctx, vma, CKPT_VMA_FILE,
+ vma_objref);
+ }
+
+ return ret;
}

int filemap_restore(struct ckpt_ctx *ctx,
diff --git a/mm/mmap.c b/mm/mmap.c
index 52d203e..4c01a90 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2299,7 +2299,7 @@ static int special_mapping_checkpoint(struct ckpt_ctx *ctx,
if (!name || strcmp(name, "[vdso]"))
return -ENOSYS;

- return generic_vma_checkpoint(ctx, vma, CKPT_VMA_VDSO, 0);
+ return generic_vma_checkpoint(ctx, vma, CKPT_VMA_VDSO, 0, 0);
}

int special_mapping_restore(struct ckpt_ctx *ctx,
diff --git a/mm/shmem.c b/mm/shmem.c
index d80532b..808e14a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -30,6 +30,7 @@
#include <linux/module.h>
#include <linux/swap.h>
#include <linux/ima.h>
+#include <linux/checkpoint.h>

static struct vfsmount *shm_mnt;

@@ -2381,6 +2382,32 @@ static void shmem_destroy_inode(struct inode *inode)
kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode));
}

+#ifdef CONFIG_CHECKPOINT
+static int shmem_checkpoint(struct ckpt_ctx *ctx, struct vm_area_struct *vma)
+{
+ enum vma_type vma_type;
+ int ino_objref;
+ int first;
+
+ /* should be private anonymous ... verify that this is the case */
+ if (vma->vm_flags & CKPT_VMA_NOT_SUPPORTED) {
+ pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
+ return -ENOSYS;
+ }
+
+ BUG_ON(!vma->vm_file);
+
+ ino_objref = ckpt_obj_lookup_add(ctx, vma->vm_file->f_dentry->d_inode,
+ CKPT_OBJ_INODE, &first);
+ if (ino_objref < 0)
+ return ino_objref;
+
+ vma_type = (first ? CKPT_VMA_SHM_ANON : CKPT_VMA_SHM_ANON_SKIP);
+
+ return shmem_vma_checkpoint(ctx, vma, vma_type, ino_objref);
+}
+#endif /* CONFIG_CHECKPOINT */
+
static void init_once(void *foo)
{
struct shmem_inode_info *p = (struct shmem_inode_info *) foo;
@@ -2492,6 +2519,9 @@ static struct vm_operations_struct shmem_vm_ops = {
.set_policy = shmem_set_policy,
.get_policy = shmem_get_policy,
#endif
+#ifdef CONFIG_CHECKPOINT
+ .checkpoint = shmem_checkpoint,
+#endif
};


--
1.6.0.4

2009-07-22 10:23:32

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 20/60] c/r: basic infrastructure for checkpoint/restart

Add those interfaces, as well as helpers needed to easily manage the
file format. The code is roughly broken out as follows:

checkpoint/sys.c - user/kernel data transfer, as well as setup of the
c/r context (a per-checkpoint data structure for housekeeping)

checkpoint/checkpoint.c - output wrappers and basic checkpoint handling

checkpoint/restart.c - input wrappers and basic restart handling

checkpoint/process.c - c/r of task data

For now, we can only checkpoint the 'current' task ("self" checkpoint),
and the 'pid' argument to the syscall is ignored.

Patches to add the per-architecture support as well as the actual
work to do the memory checkpoint follow in subsequent patches.

Changelog[v17]:
- Fix compilation for architectures that don't support checkpoint
- Save/restore t->{set,clear}_child_tid
- Restart(2) isn't idempotent: must return -EINTR if interrupted
- ckpt_debug does not depend on DYNAMIC_DEBUG, on by default
- Export generic checkpoint headers to userespace
- Fix comment for prototype of sys_restart
- Have ckpt_debug() print global-pid and __LINE__
- Only save and test kernel constants once (in header)
Changelog[v16]:
- Split ctx->flags to ->uflags (user flags) and ->kflags (kernel flags)
- Introduce __ckpt_write_err() and ckpt_write_err() to report errors
- Allow @ptr == NULL to write (or read) header only without payload
- Introduce _ckpt_read_obj_type()
Changelog[v15]:
- Replace header buffer in ckpt_ctx (hbuf,hpos) with kmalloc/kfree()
Changelog[v14]:
- Cleanup interface to get/put hdr buffers
- Merge checkpoint and restart code into a single file (per subsystem)
- Take uts_sem around access to uts->{release,version,machine}
- Embed ckpt_hdr in all ckpt_hdr_...., cleanup read/write helpers
- Define sys_checkpoint(0,...) as asking for a self-checkpoint (Serge)
- Revert use of 'pr_fmt' to avoid tainting whom includes us (Nathan Lynch)
- Explicitly indicate length of UTS fields in header
- Discard field 'h->parent' from ckpt_hdr
Changelog[v12]:
- ckpt_kwrite/ckpt_kread() again use vfs_read(), vfs_write() (safer)
- Split ckpt_write/ckpt_read() to two parts: _ckpt_write/read() helper
- Befriend with sparse : explicit conversion to 'void __user *'
- Redfine 'pr_fmt' instead of using special ckpt_debug()
Changelog[v10]:
- add ckpt_write_buffer(), ckpt_read_buffer() and ckpt_read_buf_type()
- force end-of-string in ckpt_read_string() (fix possible DoS)
Changelog[v9]:
- ckpt_kwrite/ckpt_kread() use file->f_op->write() directly
- Drop ckpt_uwrite/ckpt_uread() since they aren't used anywhere
Changelog[v6]:
- Balance all calls to ckpt_hbuf_get() with matching ckpt_hbuf_put()
(although it's not really needed)
Changelog[v5]:
- Rename headers files s/ckpt/checkpoint/
Changelog[v2]:
- Added utsname->{release,version,machine} to checkpoint header
- Pad header structures to 64 bits to ensure compatibility

Signed-off-by: Oren Laadan <[email protected]>
---
Makefile | 2 +-
checkpoint/Makefile | 6 +-
checkpoint/checkpoint.c | 272 +++++++++++++++++++++++++++++++
checkpoint/process.c | 99 +++++++++++
checkpoint/restart.c | 333 ++++++++++++++++++++++++++++++++++++++
checkpoint/sys.c | 247 ++++++++++++++++++++++++++++-
include/linux/Kbuild | 3 +
include/linux/checkpoint.h | 101 ++++++++++++
include/linux/checkpoint_hdr.h | 109 +++++++++++++
include/linux/checkpoint_types.h | 34 ++++
include/linux/magic.h | 4 +
lib/Kconfig.debug | 13 ++
12 files changed, 1219 insertions(+), 4 deletions(-)
create mode 100644 checkpoint/checkpoint.c
create mode 100644 checkpoint/process.c
create mode 100644 checkpoint/restart.c
create mode 100644 include/linux/checkpoint.h
create mode 100644 include/linux/checkpoint_hdr.h
create mode 100644 include/linux/checkpoint_types.h

diff --git a/Makefile b/Makefile
index be0abac..cf631d7 100644
--- a/Makefile
+++ b/Makefile
@@ -638,7 +638,7 @@ export mod_strip_cmd


ifeq ($(KBUILD_EXTMOD),)
-core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/
+core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ checkpoint/

vmlinux-dirs := $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
$(core-y) $(core-m) $(drivers-y) $(drivers-m) \
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 8a32c6f..99364cc 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,8 @@
# Makefile for linux checkpoint/restart.
#

-obj-$(CONFIG_CHECKPOINT) += sys.o
+obj-$(CONFIG_CHECKPOINT) += \
+ sys.o \
+ checkpoint.o \
+ restart.o \
+ process.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
new file mode 100644
index 0000000..7563a9f
--- /dev/null
+++ b/checkpoint/checkpoint.c
@@ -0,0 +1,272 @@
+/*
+ * Checkpoint logic and helpers
+ *
+ * Copyright (C) 2008-2009 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG CKPT_DSYS
+
+#include <linux/version.h>
+#include <linux/time.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/dcache.h>
+#include <linux/mount.h>
+#include <linux/utsname.h>
+#include <linux/magic.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/* unique checkpoint identifier (FIXME: should be per-container ?) */
+static atomic_t ctx_count = ATOMIC_INIT(0);
+
+/**
+ * ckpt_write_obj - write an object
+ * @ctx: checkpoint context
+ * @h: object descriptor
+ */
+int ckpt_write_obj(struct ckpt_ctx *ctx, struct ckpt_hdr *h)
+{
+ _ckpt_debug(CKPT_DRW, "type %d len %d\n", h->type, h->len);
+ return ckpt_kwrite(ctx, h, h->len);
+}
+
+/**
+ * ckpt_write_obj_type - write an object (from a pointer)
+ * @ctx: checkpoint context
+ * @ptr: buffer pointer
+ * @len: buffer size
+ * @type: desired type
+ *
+ * If @ptr is NULL, then write only the header (payload to follow)
+ */
+int ckpt_write_obj_type(struct ckpt_ctx *ctx, void *ptr, int len, int type)
+{
+ struct ckpt_hdr *h;
+ int ret;
+
+ h = ckpt_hdr_get(ctx, sizeof(*h));
+ if (!h)
+ return -ENOMEM;
+
+ h->type = type;
+ h->len = len + sizeof(*h);
+
+ _ckpt_debug(CKPT_DRW, "type %d len %d\n", h->type, h->len);
+ ret = ckpt_kwrite(ctx, h, sizeof(*h));
+ if (ret < 0)
+ goto out;
+ if (ptr)
+ ret = ckpt_kwrite(ctx, ptr, len);
+ out:
+ _ckpt_hdr_put(ctx, h, sizeof(*h));
+ return ret;
+}
+
+/**
+ * ckpt_write_buffer - write an object of type buffer
+ * @ctx: checkpoint context
+ * @ptr: buffer pointer
+ * @len: buffer size
+ */
+int ckpt_write_buffer(struct ckpt_ctx *ctx, void *ptr, int len)
+{
+ return ckpt_write_obj_type(ctx, ptr, len, CKPT_HDR_BUFFER);
+}
+
+/**
+ * ckpt_write_string - write an object of type string
+ * @ctx: checkpoint context
+ * @str: string pointer
+ * @len: string length
+ */
+int ckpt_write_string(struct ckpt_ctx *ctx, char *str, int len)
+{
+ return ckpt_write_obj_type(ctx, str, len, CKPT_HDR_STRING);
+}
+
+static void __ckpt_generate_err(struct ckpt_ctx *ctx, char *fmt, va_list ap)
+{
+ va_list aq;
+ char *str;
+ int len;
+
+ va_copy(aq, ap);
+
+ /*
+ * prefix the error string with a '\0' to facilitate easy
+ * backtrace to the beginning of the error message without
+ * needing to parse the entire checkpoint image.
+ */
+ ctx->err_string[0] = '\0';
+ str = &ctx->err_string[1];
+ len = vsnprintf(str, 255, fmt, ap) + 2;
+
+ if (len > 256) {
+ printk(KERN_NOTICE "c/r: error string truncated: ");
+ vprintk(fmt, aq);
+ }
+
+ va_end(aq);
+
+ ckpt_debug("c/r: checkpoint error: %s\n", str);
+}
+
+/**
+ * __ckpt_write_err - save an error string on the ctx->err_string
+ * @ctx: checkpoint context
+ * @fmt: error string format
+ * @...: error string arguments
+ *
+ * Use this during checkpoint to report while holding a spinlock
+ */
+void __ckpt_write_err(struct ckpt_ctx *ctx, char *fmt, ...)
+{
+ va_list ap;
+
+ va_start(ap, fmt);
+ __ckpt_generate_err(ctx, fmt, ap);
+ va_end(ap);
+}
+
+/**
+ * ckpt_write_err - write an object describing an error
+ * @ctx: checkpoint context
+ * @fmt: error string format
+ * @...: error string arguments
+ *
+ * If @fmt is null, the string in the ctx->err_string will be used (and freed)
+ */
+int ckpt_write_err(struct ckpt_ctx *ctx, char *fmt, ...)
+{
+ va_list ap;
+ char *str;
+ int len, ret = 0;
+
+ if (fmt) {
+ va_start(ap, fmt);
+ __ckpt_generate_err(ctx, fmt, ap);
+ va_end(ap);
+ }
+
+ str = ctx->err_string;
+ len = strlen(str + 1) + 2; /* leading and trailing '\0' */
+
+ if (len == 0) /* empty error string */
+ return 0;
+
+ ret = ckpt_write_obj_type(ctx, NULL, 0, CKPT_HDR_ERROR);
+ if (!ret)
+ ret = ckpt_write_string(ctx, str, len);
+ if (ret < 0)
+ printk(KERN_NOTICE "c/r: error string unsaved (%d): %s\n",
+ ret, str + 1);
+
+ str[1] = '\0';
+ return ret;
+}
+
+/***********************************************************************
+ * Checkpoint
+ */
+
+static void fill_kernel_const(struct ckpt_hdr_const *h)
+{
+ struct task_struct *tsk;
+ struct new_utsname *uts;
+
+ /* task */
+ h->task_comm_len = sizeof(tsk->comm);
+ /* uts */
+ h->uts_release_len = sizeof(uts->release);
+ h->uts_version_len = sizeof(uts->version);
+ h->uts_machine_len = sizeof(uts->machine);
+}
+
+/* write the checkpoint header */
+static int checkpoint_write_header(struct ckpt_ctx *ctx)
+{
+ struct ckpt_hdr_header *h;
+ struct new_utsname *uts;
+ struct timeval ktv;
+ int ret;
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_HEADER);
+ if (!h)
+ return -ENOMEM;
+
+ do_gettimeofday(&ktv);
+ uts = utsname();
+
+ h->magic = CHECKPOINT_MAGIC_HEAD;
+ h->major = (LINUX_VERSION_CODE >> 16) & 0xff;
+ h->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
+ h->patch = (LINUX_VERSION_CODE) & 0xff;
+
+ h->rev = CHECKPOINT_VERSION;
+
+ h->uflags = ctx->uflags;
+ h->time = ktv.tv_sec;
+
+ fill_kernel_const(&h->constants);
+
+ ret = ckpt_write_obj(ctx, &h->h);
+ ckpt_hdr_put(ctx, h);
+ if (ret < 0)
+ return ret;
+
+ down_read(&uts_sem);
+ ret = ckpt_write_buffer(ctx, uts->release, sizeof(uts->release));
+ if (ret < 0)
+ goto up;
+ ret = ckpt_write_buffer(ctx, uts->version, sizeof(uts->version));
+ if (ret < 0)
+ goto up;
+ ret = ckpt_write_buffer(ctx, uts->machine, sizeof(uts->machine));
+ up:
+ up_read(&uts_sem);
+ return ret;
+}
+
+/* write the checkpoint trailer */
+static int checkpoint_write_tail(struct ckpt_ctx *ctx)
+{
+ struct ckpt_hdr_tail *h;
+ int ret;
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TAIL);
+ if (!h)
+ return -ENOMEM;
+
+ h->magic = CHECKPOINT_MAGIC_TAIL;
+
+ ret = ckpt_write_obj(ctx, &h->h);
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
+
+long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
+{
+ long ret;
+
+ ret = checkpoint_write_header(ctx);
+ if (ret < 0)
+ goto out;
+ ret = checkpoint_task(ctx, current);
+ if (ret < 0)
+ goto out;
+ ret = checkpoint_write_tail(ctx);
+ if (ret < 0)
+ goto out;
+
+ /* on success, return (unique) checkpoint identifier */
+ ctx->crid = atomic_inc_return(&ctx_count);
+ ret = ctx->crid;
+ out:
+ return ret;
+}
diff --git a/checkpoint/process.c b/checkpoint/process.c
new file mode 100644
index 0000000..9e1b861
--- /dev/null
+++ b/checkpoint/process.c
@@ -0,0 +1,99 @@
+/*
+ * Checkpoint task structure
+ *
+ * Copyright (C) 2008-2009 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG CKPT_DSYS
+
+#include <linux/sched.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/***********************************************************************
+ * Checkpoint
+ */
+
+/* dump the task_struct of a given task */
+static int checkpoint_task_struct(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+ struct ckpt_hdr_task *h;
+ int ret;
+
+ h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK);
+ if (!h)
+ return -ENOMEM;
+
+ h->state = t->state;
+ h->exit_state = t->exit_state;
+ h->exit_code = t->exit_code;
+ h->exit_signal = t->exit_signal;
+
+ h->set_child_tid = t->set_child_tid;
+ h->clear_child_tid = t->clear_child_tid;
+
+ /* FIXME: save remaining relevant task_struct fields */
+
+ ret = ckpt_write_obj(ctx, &h->h);
+ ckpt_hdr_put(ctx, h);
+ if (ret < 0)
+ return ret;
+
+ return ckpt_write_string(ctx, t->comm, TASK_COMM_LEN);
+}
+
+/* dump the entire state of a given task */
+int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+ int ret;
+
+ ret = checkpoint_task_struct(ctx, t);
+ ckpt_debug("task %d\n", ret);
+
+ return ret;
+}
+
+/***********************************************************************
+ * Restart
+ */
+
+/* read the task_struct into the current task */
+static int restore_task_struct(struct ckpt_ctx *ctx)
+{
+ struct ckpt_hdr_task *h;
+ struct task_struct *t = current;
+ int ret;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TASK);
+ if (IS_ERR(h))
+ return PTR_ERR(h);
+
+ memset(t->comm, 0, TASK_COMM_LEN);
+ ret = _ckpt_read_string(ctx, t->comm, TASK_COMM_LEN);
+ if (ret < 0)
+ goto out;
+
+ t->set_child_tid = h->set_child_tid;
+ t->clear_child_tid = h->clear_child_tid;
+
+ /* FIXME: restore remaining relevant task_struct fields */
+ out:
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
+
+/* read the entire state of the current task */
+int restore_task(struct ckpt_ctx *ctx)
+{
+ int ret;
+
+ ret = restore_task_struct(ctx);
+ ckpt_debug("task %d\n", ret);
+
+ return ret;
+}
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
new file mode 100644
index 0000000..562ce8f
--- /dev/null
+++ b/checkpoint/restart.c
@@ -0,0 +1,333 @@
+/*
+ * Restart logic and helpers
+ *
+ * Copyright (C) 2008-2009 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG CKPT_DSYS
+
+#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/magic.h>
+#include <linux/utsname.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/**
+ * _ckpt_read_obj - read an object (ckpt_hdr followed by payload)
+ * @ctx: checkpoint context
+ * @h: desired ckpt_hdr
+ * @ptr: desired buffer
+ * @len: desired payload length (if 0, flexible)
+ * @max: maximum payload length
+ *
+ * If @ptr is NULL, then read only the header (payload to follow)
+ */
+static int _ckpt_read_obj(struct ckpt_ctx *ctx, struct ckpt_hdr *h,
+ void *ptr, int len, int max)
+{
+ int ret;
+
+ ret = ckpt_kread(ctx, h, sizeof(*h));
+ if (ret < 0)
+ return ret;
+ _ckpt_debug(CKPT_DRW, "type %d len %d(%d,%d)\n",
+ h->type, h->len, len, max);
+ if (h->len < sizeof(*h))
+ return -EINVAL;
+ /* if len specified, enforce, else if maximum specified, enforce */
+ if ((len && h->len != len) || (!len && max && h->len > max))
+ return -EINVAL;
+
+ if (ptr)
+ ret = ckpt_kread(ctx, ptr, h->len - sizeof(struct ckpt_hdr));
+ return ret;
+}
+
+/**
+ * _ckpt_read_nbuffer - read an object of type buffer (variable length)
+ * @ctx: checkpoint context
+ * @ptr: provided buffer
+ * @len: buffer length
+ *
+ * If @ptr is NULL, then read only the header (payload to follow)
+ * Returns: actual buffer length (bounded by @len)
+ */
+int _ckpt_read_nbuffer(struct ckpt_ctx *ctx, void *ptr, int len)
+{
+ struct ckpt_hdr h;
+ int ret;
+
+ BUG_ON(!len);
+
+ len += sizeof(struct ckpt_hdr);
+ ret = _ckpt_read_obj(ctx, &h, ptr, 0, len);
+ if (ret < 0)
+ return ret;
+ _ckpt_debug(CKPT_DRW, "type %d len %d\n", h.type, h.len);
+ if (h.type != CKPT_HDR_BUFFER)
+ return -EINVAL;
+ return h.len;
+}
+
+/**
+ * _ckpt_read_obj_type - read an object of some type (set length)
+ * @ctx: checkpoint context
+ * @ptr: provided buffer
+ * @len: buffer length
+ * @type: buffer type
+ *
+ * If @ptr is NULL, then read only the header (payload to follow)
+ */
+int _ckpt_read_obj_type(struct ckpt_ctx *ctx, void *ptr, int len, int type)
+{
+ struct ckpt_hdr h;
+ int ret;
+
+ len += sizeof(struct ckpt_hdr);
+ ret = _ckpt_read_obj(ctx, &h, ptr, len, len);
+ if (ret < 0)
+ return ret;
+ if (h.type != type)
+ return -EINVAL;
+ return 0;
+}
+
+/**
+ * _ckpt_read_buffer - read an object of type buffer (set length)
+ * @ctx: checkpoint context
+ * @ptr: provided buffer
+ * @len: buffer length
+ *
+ * If @ptr is NULL, then read only the header (payload to follow)
+ */
+int _ckpt_read_buffer(struct ckpt_ctx *ctx, void *ptr, int len)
+{
+ BUG_ON(!len);
+ return _ckpt_read_obj_type(ctx, ptr, len, CKPT_HDR_BUFFER);
+}
+
+/**
+ * _ckpt_read_string - read an object of type string (set length)
+ * @ctx: checkpoint context
+ * @ptr: provided buffer
+ * @len: string length
+ *
+ * If @ptr is NULL, then read only the header (payload to follow)
+ */
+int _ckpt_read_string(struct ckpt_ctx *ctx, void *ptr, int len)
+{
+ int ret;
+
+ BUG_ON(!len);
+
+ ret = _ckpt_read_obj_type(ctx, ptr, len, CKPT_HDR_STRING);
+ if (ret < 0)
+ return ret;
+ if (ptr)
+ ((char *) ptr)[len - 1] = '\0'; /* always play it safe */
+ return 0;
+}
+
+/**
+ * ckpt_read_obj - allocate and read an object (ckpt_hdr followed by payload)
+ * @ctx: checkpoint context
+ * @h: object descriptor
+ * @len: desired payload length (if 0, flexible)
+ * @max: maximum payload length
+ *
+ * Return: new buffer allocated on success, error pointer otherwise
+ */
+static void *ckpt_read_obj(struct ckpt_ctx *ctx, int len, int max)
+{
+ struct ckpt_hdr hh;
+ struct ckpt_hdr *h;
+ int ret;
+
+ ret = ckpt_kread(ctx, &hh, sizeof(hh));
+ if (ret < 0)
+ return ERR_PTR(ret);
+ _ckpt_debug(CKPT_DRW, "type %d len %d(%d,%d)\n",
+ hh.type, hh.len, len, max);
+ if (hh.len < sizeof(*h))
+ return ERR_PTR(-EINVAL);
+ /* if len specified, enforce, else if maximum specified, enforce */
+ if ((len && hh.len != len) || (!len && max && hh.len > max))
+ return ERR_PTR(-EINVAL);
+
+ h = ckpt_hdr_get(ctx, hh.len);
+ if (!h)
+ return ERR_PTR(-ENOMEM);
+
+ *h = hh; /* yay ! */
+
+ ret = ckpt_kread(ctx, (h + 1), hh.len - sizeof(struct ckpt_hdr));
+ if (ret < 0) {
+ ckpt_hdr_put(ctx, h);
+ h = ERR_PTR(ret);
+ }
+
+ return h;
+}
+
+/**
+ * ckpt_read_obj_type - allocate and read an object of some type
+ * @ctx: checkpoint context
+ * @len: desired object length
+ * @type: desired object type
+ *
+ * Return: new buffer allocated on success, error pointer otherwise
+ */
+void *ckpt_read_obj_type(struct ckpt_ctx *ctx, int len, int type)
+{
+ struct ckpt_hdr *h;
+
+ BUG_ON(!len);
+
+ h = ckpt_read_obj(ctx, len, len);
+ if (IS_ERR(h))
+ return h;
+
+ if (h->type != type) {
+ ckpt_hdr_put(ctx, h);
+ h = ERR_PTR(-EINVAL);
+ }
+
+ return h;
+}
+
+/**
+ * ckpt_read_buf_type - allocate and read an object of some type (flxible)
+ * @ctx: checkpoint context
+ * @len: maximum object length
+ * @type: desired object type
+ *
+ * This differs from ckpt_read_obj_type() in that the length of the
+ * incoming object is flexible (up to the maximum specified by @len),
+ * as determined by the ckpt_hdr data.
+ *
+ * Return: new buffer allocated on success, error pointer otherwise
+ */
+void *ckpt_read_buf_type(struct ckpt_ctx *ctx, int len, int type)
+{
+ struct ckpt_hdr *h;
+
+ h = ckpt_read_obj(ctx, 0, len);
+ if (IS_ERR(h))
+ return h;
+
+ if (h->type != type) {
+ ckpt_hdr_put(ctx, h);
+ h = ERR_PTR(-EINVAL);
+ }
+
+ return h;
+}
+
+/***********************************************************************
+ * Restart
+ */
+
+static int check_kernel_const(struct ckpt_hdr_const *h)
+{
+ struct task_struct *tsk;
+ struct new_utsname *uts;
+
+ /* task */
+ if (h->task_comm_len != sizeof(tsk->comm))
+ return -EINVAL;
+ /* uts */
+ if (h->uts_release_len != sizeof(uts->release))
+ return -EINVAL;
+ if (h->uts_version_len != sizeof(uts->version))
+ return -EINVAL;
+ if (h->uts_machine_len != sizeof(uts->machine))
+ return -EINVAL;
+
+ return 0;
+}
+
+/* read the checkpoint header */
+static int restore_read_header(struct ckpt_ctx *ctx)
+{
+ struct ckpt_hdr_header *h;
+ struct new_utsname *uts = NULL;
+ int ret;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_HEADER);
+ if (IS_ERR(h))
+ return PTR_ERR(h);
+
+ ret = -EINVAL;
+ if (h->magic != CHECKPOINT_MAGIC_HEAD ||
+ h->rev != CHECKPOINT_VERSION ||
+ h->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
+ h->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
+ h->patch != ((LINUX_VERSION_CODE) & 0xff))
+ goto out;
+ if (h->uflags)
+ goto out;
+
+ ret = check_kernel_const(&h->constants);
+ if (ret < 0)
+ goto out;
+
+ ret = -ENOMEM;
+ uts = kmalloc(sizeof(*uts), GFP_KERNEL);
+ if (!uts)
+ goto out;
+
+ ctx->oflags = h->uflags;
+
+ /* FIX: verify compatibility of release, version and machine */
+ ret = _ckpt_read_buffer(ctx, uts->release, sizeof(uts->release));
+ if (ret < 0)
+ goto out;
+ ret = _ckpt_read_buffer(ctx, uts->version, sizeof(uts->version));
+ if (ret < 0)
+ goto out;
+ ret = _ckpt_read_buffer(ctx, uts->machine, sizeof(uts->machine));
+ out:
+ kfree(uts);
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
+
+/* read the checkpoint trailer */
+static int restore_read_tail(struct ckpt_ctx *ctx)
+{
+ struct ckpt_hdr_tail *h;
+ int ret = 0;
+
+ h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_TAIL);
+ if (IS_ERR(h))
+ return PTR_ERR(h);
+
+ if (h->magic != CHECKPOINT_MAGIC_TAIL)
+ ret = -EINVAL;
+
+ ckpt_hdr_put(ctx, h);
+ return ret;
+}
+
+long do_restart(struct ckpt_ctx *ctx, pid_t pid)
+{
+ long ret;
+
+ ret = restore_read_header(ctx);
+ if (ret < 0)
+ return ret;
+ ret = restore_task(ctx);
+ if (ret < 0)
+ return ret;
+ ret = restore_read_tail(ctx);
+
+ /* on success, adjust the return value if needed [TODO] */
+ return ret;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 79936cc..7f6f71e 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -8,9 +8,192 @@
* distribution for more details.
*/

+/* default debug level for output */
+#define CKPT_DFLAG CKPT_DSYS
+
#include <linux/sched.h>
#include <linux/kernel.h>
#include <linux/syscalls.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/uaccess.h>
+#include <linux/capability.h>
+#include <linux/checkpoint.h>
+
+/*
+ * Helpers to write(read) from(to) kernel space to(from) the checkpoint
+ * image file descriptor (similar to how a core-dump is performed).
+ *
+ * ckpt_kwrite() - write a kernel-space buffer to the checkpoint image
+ * ckpt_kread() - read from the checkpoint image to a kernel-space buffer
+ */
+
+static inline int _ckpt_kwrite(struct file *file, void *addr, int count)
+{
+ void __user *uaddr = (__force void __user *) addr;
+ ssize_t nwrite;
+ int nleft;
+
+ for (nleft = count; nleft; nleft -= nwrite) {
+ loff_t pos = file_pos_read(file);
+ nwrite = vfs_write(file, uaddr, nleft, &pos);
+ file_pos_write(file, pos);
+ if (nwrite < 0) {
+ if (nwrite == -EAGAIN)
+ nwrite = 0;
+ else
+ return nwrite;
+ }
+ uaddr += nwrite;
+ }
+ return 0;
+}
+
+int ckpt_kwrite(struct ckpt_ctx *ctx, void *addr, int count)
+{
+ mm_segment_t fs;
+ int ret;
+
+ fs = get_fs();
+ set_fs(KERNEL_DS);
+ ret = _ckpt_kwrite(ctx->file, addr, count);
+ set_fs(fs);
+
+ ctx->total += count;
+ return ret;
+}
+
+static inline int _ckpt_kread(struct file *file, void *addr, int count)
+{
+ void __user *uaddr = (__force void __user *) addr;
+ ssize_t nread;
+ int nleft;
+
+ for (nleft = count; nleft; nleft -= nread) {
+ loff_t pos = file_pos_read(file);
+ nread = vfs_read(file, uaddr, nleft, &pos);
+ file_pos_write(file, pos);
+ if (nread <= 0) {
+ if (nread == -EAGAIN) {
+ nread = 0;
+ continue;
+ } else if (nread == 0)
+ nread = -EPIPE; /* unexecpted EOF */
+ return nread;
+ }
+ uaddr += nread;
+ }
+ return 0;
+}
+
+int ckpt_kread(struct ckpt_ctx *ctx, void *addr, int count)
+{
+ mm_segment_t fs;
+ int ret;
+
+ fs = get_fs();
+ set_fs(KERNEL_DS);
+ ret = _ckpt_kread(ctx->file , addr, count);
+ set_fs(fs);
+
+ ctx->total += count;
+ return ret;
+}
+
+/**
+ * ckpt_hdr_get - get a hdr of certain size
+ * @ctx: checkpoint context
+ * @len: desired length
+ *
+ * Returns pointer to header
+ */
+void *ckpt_hdr_get(struct ckpt_ctx *ctx, int len)
+{
+ return kzalloc(len, GFP_KERNEL);
+}
+
+/**
+ * _ckpt_hdr_put - free a hdr allocated with ckpt_hdr_get
+ * @ctx: checkpoint context
+ * @ptr: header to free
+ * @len: header length
+ *
+ * (requiring 'ptr' makes it easily interchangable with kmalloc/kfree
+ */
+void _ckpt_hdr_put(struct ckpt_ctx *ctx, void *ptr, int len)
+{
+ kfree(ptr);
+}
+
+/**
+ * ckpt_hdr_put - free a hdr allocated with ckpt_hdr_get
+ * @ctx: checkpoint context
+ * @ptr: header to free
+ *
+ * It is assumed that @ptr begins with a 'struct ckpt_hdr'.
+ */
+void ckpt_hdr_put(struct ckpt_ctx *ctx, void *ptr)
+{
+ struct ckpt_hdr *h = (struct ckpt_hdr *) ptr;
+ _ckpt_hdr_put(ctx, ptr, h->len);
+}
+
+/**
+ * ckpt_hdr_get_type - get a hdr of certain size
+ * @ctx: checkpoint context
+ * @len: number of bytes to reserve
+ *
+ * Returns pointer to reserved space on hbuf
+ */
+void *ckpt_hdr_get_type(struct ckpt_ctx *ctx, int len, int type)
+{
+ struct ckpt_hdr *h;
+
+ h = ckpt_hdr_get(ctx, len);
+ if (!h)
+ return NULL;
+
+ h->type = type;
+ h->len = len;
+ return h;
+}
+
+
+/*
+ * Helpers to manage c/r contexts: allocated for each checkpoint and/or
+ * restart operation, and persists until the operation is completed.
+ */
+
+static void ckpt_ctx_free(struct ckpt_ctx *ctx)
+{
+ if (ctx->file)
+ fput(ctx->file);
+ kfree(ctx);
+}
+
+static struct ckpt_ctx *ckpt_ctx_alloc(int fd, unsigned long uflags,
+ unsigned long kflags)
+{
+ struct ckpt_ctx *ctx;
+ int err;
+
+ ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+ if (!ctx)
+ return ERR_PTR(-ENOMEM);
+
+ ctx->uflags = uflags;
+ ctx->kflags = kflags;
+
+ err = -EBADF;
+ ctx->file = fget(fd);
+ if (!ctx->file)
+ goto err;
+
+ return ctx;
+ err:
+ ckpt_ctx_free(ctx);
+ return ERR_PTR(err);
+}

/**
* sys_checkpoint - checkpoint a container
@@ -23,7 +206,26 @@
*/
SYSCALL_DEFINE3(checkpoint, pid_t, pid, int, fd, unsigned long, flags)
{
- return -ENOSYS;
+ struct ckpt_ctx *ctx;
+ long ret;
+
+ /* no flags for now */
+ if (flags)
+ return -EINVAL;
+
+ if (pid == 0)
+ pid = task_pid_vnr(current);
+ ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_CHECKPOINT);
+ if (IS_ERR(ctx))
+ return PTR_ERR(ctx);
+
+ ret = do_checkpoint(ctx, pid);
+
+ if (!ret)
+ ret = ctx->crid;
+
+ ckpt_ctx_free(ctx);
+ return ret;
}

/**
@@ -37,5 +239,46 @@ SYSCALL_DEFINE3(checkpoint, pid_t, pid, int, fd, unsigned long, flags)
*/
SYSCALL_DEFINE3(restart, pid_t, pid, int, fd, unsigned long, flags)
{
- return -ENOSYS;
+ struct ckpt_ctx *ctx = NULL;
+ long ret;
+
+ /* no flags for now */
+ if (flags)
+ return -EINVAL;
+
+ ctx = ckpt_ctx_alloc(fd, flags, CKPT_CTX_RESTART);
+ if (IS_ERR(ctx))
+ return PTR_ERR(ctx);
+
+ ret = do_restart(ctx, pid);
+
+ /* restart(2) isn't idempotent: can't restart syscall */
+ if (ret == -ERESTARTSYS || ret == -ERESTARTNOINTR ||
+ ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK)
+ ret = -EINTR;
+
+ ckpt_ctx_free(ctx);
+ return ret;
+}
+
+
+/* 'ckpt_debug_level' controls the verbosity level of c/r code */
+#ifdef CONFIG_CHECKPOINT_DEBUG
+
+/* FIX: allow to change during runtime */
+unsigned long __read_mostly ckpt_debug_level = CKPT_DDEFAULT;
+
+static __init int ckpt_debug_setup(char *s)
+{
+ long val, ret;
+
+ ret = strict_strtoul(s, 10, &val);
+ if (ret < 0)
+ return ret;
+ ckpt_debug_level = val;
+ return 0;
}
+
+__setup("ckpt_debug=", ckpt_debug_setup);
+
+#endif /* CONFIG_CHECKPOINT_DEBUG */
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index 334a359..3e8bd18 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -44,6 +44,9 @@ header-y += bpqether.h
header-y += bsg.h
header-y += can.h
header-y += cdk.h
+header-y += checkpoint.h
+header-y += checkpoint_hdr.h
+header-y += checkpoint_types.h
header-y += chio.h
header-y += coda_psdev.h
header-y += coff.h
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
new file mode 100644
index 0000000..b2cb91f
--- /dev/null
+++ b/include/linux/checkpoint.h
@@ -0,0 +1,101 @@
+#ifndef _LINUX_CHECKPOINT_H_
+#define _LINUX_CHECKPOINT_H_
+/*
+ * Generic checkpoint-restart
+ *
+ * Copyright (C) 2008-2009 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#define CHECKPOINT_VERSION 1
+
+#ifdef __KERNEL__
+#ifdef CONFIG_CHECKPOINT
+
+#include <linux/checkpoint_types.h>
+#include <linux/checkpoint_hdr.h>
+
+
+/* ckpt_ctx: kflags */
+#define CKPT_CTX_CHECKPOINT_BIT 1
+#define CKPT_CTX_RESTART_BIT 2
+
+#define CKPT_CTX_CHECKPOINT (1 << CKPT_CTX_CHECKPOINT_BIT)
+#define CKPT_CTX_RESTART (1 << CKPT_CTX_RESTART_BIT)
+
+
+extern int ckpt_kwrite(struct ckpt_ctx *ctx, void *buf, int count);
+extern int ckpt_kread(struct ckpt_ctx *ctx, void *buf, int count);
+
+extern void _ckpt_hdr_put(struct ckpt_ctx *ctx, void *ptr, int n);
+extern void ckpt_hdr_put(struct ckpt_ctx *ctx, void *ptr);
+extern void *ckpt_hdr_get(struct ckpt_ctx *ctx, int n);
+extern void *ckpt_hdr_get_type(struct ckpt_ctx *ctx, int n, int type);
+
+extern int ckpt_write_obj(struct ckpt_ctx *ctx, struct ckpt_hdr *h);
+extern int ckpt_write_obj_type(struct ckpt_ctx *ctx,
+ void *ptr, int len, int type);
+extern int ckpt_write_buffer(struct ckpt_ctx *ctx, void *ptr, int len);
+extern int ckpt_write_string(struct ckpt_ctx *ctx, char *str, int len);
+extern void __ckpt_write_err(struct ckpt_ctx *ctx, char *fmt, ...);
+extern int ckpt_write_err(struct ckpt_ctx *ctx, char *fmt, ...);
+
+extern int _ckpt_read_obj_type(struct ckpt_ctx *ctx,
+ void *ptr, int len, int type);
+extern int _ckpt_read_buffer(struct ckpt_ctx *ctx, void *ptr, int len);
+extern int _ckpt_read_string(struct ckpt_ctx *ctx, void *ptr, int len);
+extern void *ckpt_read_obj_type(struct ckpt_ctx *ctx, int len, int type);
+extern void *ckpt_read_buf_type(struct ckpt_ctx *ctx, int len, int type);
+
+extern long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid);
+extern long do_restart(struct ckpt_ctx *ctx, pid_t pid);
+
+/* task */
+extern int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int restore_task(struct ckpt_ctx *ctx);
+
+
+/* debugging flags */
+#define CKPT_DBASE 0x1 /* anything */
+#define CKPT_DSYS 0x2 /* generic (system) */
+#define CKPT_DRW 0x4 /* image read/write */
+
+#define CKPT_DDEFAULT 0xffff /* default debug level */
+
+#ifndef CKPT_DFLAG
+#define CKPT_DFLAG 0xffff /* everything */
+#endif
+
+#ifdef CONFIG_CHECKPOINT_DEBUG
+extern unsigned long ckpt_debug_level;
+
+/* use this to select a specific debug level */
+#define _ckpt_debug(level, fmt, args...) \
+ do { \
+ if (ckpt_debug_level & (level)) \
+ printk(KERN_DEBUG "[%d:%d:c/r:%s:%d] " fmt, \
+ current->pid, task_pid_vnr(current), \
+ __func__, __LINE__, ## args); \
+ } while (0)
+
+/*
+ * CKPT_DBASE is the base flags, doesn't change
+ * CKPT_DFLAG is to be redfined in each source file
+ */
+#define ckpt_debug(fmt, args...) \
+ _ckpt_debug(CKPT_DBASE | CKPT_DFLAG, fmt, ## args)
+
+#else
+
+#define _ckpt_debug(level, fmt, args...) do { } while (0)
+#define ckpt_debug(fmt, args...) do { } while (0)
+
+#endif /* CONFIG_CHECKPOINT_DEBUG */
+
+#endif /* CONFIG_CHECKPOINT */
+#endif /* __KERNEL__ */
+
+#endif /* _LINUX_CHECKPOINT_H_ */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
new file mode 100644
index 0000000..827a6bb
--- /dev/null
+++ b/include/linux/checkpoint_hdr.h
@@ -0,0 +1,109 @@
+#ifndef _CHECKPOINT_CKPT_HDR_H_
+#define _CHECKPOINT_CKPT_HDR_H_
+/*
+ * Generic container checkpoint-restart
+ *
+ * Copyright (C) 2008-2009 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/types.h>
+#include <linux/utsname.h>
+
+/*
+ * To maintain compatibility between 32-bit and 64-bit architecture flavors,
+ * keep data 64-bit aligned: use padding for structure members, and use
+ * __attribute__((aligned (8))) for the entire structure.
+ *
+ * Quoting Arnd Bergmann:
+ * "This structure has an odd multiple of 32-bit members, which means
+ * that if you put it into a larger structure that also contains 64-bit
+ * members, the larger structure may get different alignment on x86-32
+ * and x86-64, which you might want to avoid. I can't tell if this is
+ * an actual problem here. ... In this case, I'm pretty sure that
+ * sizeof(ckpt_hdr_task) on x86-32 is different from x86-64, since it
+ * will be 32-bit aligned on x86-32."
+ */
+
+/*
+ * header format: 'struct ckpt_hdr' must prefix all other headers. Therfore
+ * when a header is passed around, the information about it (type, size)
+ * is readily available.
+ */
+struct ckpt_hdr {
+ __u32 type;
+ __u32 len;
+} __attribute__((aligned(8)));
+
+/* header types */
+enum {
+ CKPT_HDR_HEADER = 1,
+ CKPT_HDR_BUFFER,
+ CKPT_HDR_STRING,
+
+ CKPT_HDR_TASK = 101,
+
+ CKPT_HDR_TAIL = 9001,
+
+ CKPT_HDR_ERROR = 9999,
+};
+
+/* kernel constants */
+struct ckpt_hdr_const {
+ /* task */
+ __u16 task_comm_len;
+ /* uts */
+ __u16 uts_release_len;
+ __u16 uts_version_len;
+ __u16 uts_machine_len;
+} __attribute__((aligned(8)));
+
+/* checkpoint image header */
+struct ckpt_hdr_header {
+ struct ckpt_hdr h;
+ __u64 magic;
+
+ __u16 _padding;
+
+ __u16 major;
+ __u16 minor;
+ __u16 patch;
+ __u16 rev;
+
+ struct ckpt_hdr_const constants;
+
+ __u64 time; /* when checkpoint taken */
+ __u64 uflags; /* uflags from checkpoint */
+
+ /*
+ * the header is followed by three strings:
+ * char release[const.uts_release_len];
+ * char version[const.uts_version_len];
+ * char machine[const.uts_machine_len];
+ */
+} __attribute__((aligned(8)));
+
+
+/* checkpoint image trailer */
+struct ckpt_hdr_tail {
+ struct ckpt_hdr h;
+ __u64 magic;
+} __attribute__((aligned(8)));
+
+
+/* task data */
+struct ckpt_hdr_task {
+ struct ckpt_hdr h;
+ __u32 state;
+ __u32 exit_state;
+ __u32 exit_code;
+ __u32 exit_signal;
+
+ __u64 set_child_tid;
+ __u64 clear_child_tid;
+} __attribute__((aligned(8)));
+
+#endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/checkpoint_types.h b/include/linux/checkpoint_types.h
new file mode 100644
index 0000000..203ecac
--- /dev/null
+++ b/include/linux/checkpoint_types.h
@@ -0,0 +1,34 @@
+#ifndef _LINUX_CHECKPOINT_TYPES_H_
+#define _LINUX_CHECKPOINT_TYPES_H_
+/*
+ * Generic checkpoint-restart
+ *
+ * Copyright (C) 2008-2009 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#ifdef __KERNEL__
+
+#include <linux/fs.h>
+
+struct ckpt_ctx {
+ int crid; /* unique checkpoint id */
+
+ pid_t root_pid; /* container identifier */
+
+ unsigned long kflags; /* kerenl flags */
+ unsigned long uflags; /* user flags */
+ unsigned long oflags; /* restart: uflags from checkpoint */
+
+ struct file *file; /* input/output file */
+ int total; /* total read/written */
+
+ char err_string[256]; /* checkpoint: error string */
+};
+
+#endif /* __KERNEL__ */
+
+#endif /* _LINUX_CHECKPOINT_TYPES_H_ */
diff --git a/include/linux/magic.h b/include/linux/magic.h
index 1923327..ff17a59 100644
--- a/include/linux/magic.h
+++ b/include/linux/magic.h
@@ -53,4 +53,8 @@
#define INOTIFYFS_SUPER_MAGIC 0x2BAD1DEA

#define STACK_END_MAGIC 0x57AC6E9D
+
+#define CHECKPOINT_MAGIC_HEAD 0x00feed0cc0a2d200LL
+#define CHECKPOINT_MAGIC_TAIL 0x002d2a0cc0deef00LL
+
#endif /* __LINUX_MAGIC_H__ */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 12327b2..e1ae6e6 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1006,6 +1006,19 @@ config DMA_API_DEBUG
This option causes a performance degredation. Use only if you want
to debug device drivers. If unsure, say N.

+config CHECKPOINT_DEBUG
+ bool "Checkpoint/restart debugging (EXPERIMENTAL)"
+ depends on CHECKPOINT
+ default y
+ help
+ This options turns on the debugging output of checkpoint/restart.
+ The level of verbosity is controlled by 'ckpt_debug_level' and can
+ be set at boot time with "ckpt_debug=" option.
+
+ Turning this option off will reduce the size of the c/r code. If
+ turned on, it is unlikely to incur visible overhead if the debug
+ level is set to zero.
+
source "samples/Kconfig"

source "lib/Kconfig.kgdb"
--
1.6.0.4

2009-07-22 10:21:59

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 43/60] splice: export pipe/file-to-pipe/file functionality

During pipes c/r pipes we need to save and restore pipe buffers. But
do_splice() requires two file descriptors, therefore we can't use it,
as we always have one file descriptor (checkpoint image) and one
pipe_inode_info.

This patch exports interfaces that work at the pipe_inode_info level,
namely link_pipe(), do_splice_to() and do_splice_from(). They are used
in the following patch to to save and restore pipe buffers without
unnecessary data copy.

It slightly modifies both do_splice_to() and do_splice_from() to
detect the case of pipe-to-pipe transfer, in which case they invoke
splice_pipe_to_pipe() directly.

Signed-off-by: Oren Laadan <[email protected]>
---
fs/splice.c | 61 ++++++++++++++++++++++++++++++++---------------
include/linux/splice.h | 9 +++++++
2 files changed, 50 insertions(+), 20 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 73766d2..f251b4c 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1055,18 +1055,43 @@ ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe, struct file *out,
EXPORT_SYMBOL(generic_splice_sendpage);

/*
+ * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same
+ * location, so checking ->i_pipe is not enough to verify that this is a
+ * pipe.
+ */
+static inline struct pipe_inode_info *pipe_info(struct inode *inode)
+{
+ if (S_ISFIFO(inode->i_mode))
+ return inode->i_pipe;
+
+ return NULL;
+}
+
+static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
+ struct pipe_inode_info *opipe,
+ size_t len, unsigned int flags);
+
+/*
* Attempt to initiate a splice from pipe to file.
*/
-static long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
- loff_t *ppos, size_t len, unsigned int flags)
+long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
+ loff_t *ppos, size_t len, unsigned int flags)
{
ssize_t (*splice_write)(struct pipe_inode_info *, struct file *,
loff_t *, size_t, unsigned int);
+ struct pipe_inode_info *opipe;
int ret;

if (unlikely(!(out->f_mode & FMODE_WRITE)))
return -EBADF;

+ /* When called directly (e.g. from c/r) output may be a pipe */
+ opipe = pipe_info(out->f_path.dentry->d_inode);
+ if (opipe) {
+ BUG_ON(opipe == pipe);
+ return splice_pipe_to_pipe(pipe, opipe, len, flags);
+ }
+
if (unlikely(out->f_flags & O_APPEND))
return -EINVAL;

@@ -1084,17 +1109,25 @@ static long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
/*
* Attempt to initiate a splice from a file to a pipe.
*/
-static long do_splice_to(struct file *in, loff_t *ppos,
- struct pipe_inode_info *pipe, size_t len,
- unsigned int flags)
+long do_splice_to(struct file *in, loff_t *ppos,
+ struct pipe_inode_info *pipe, size_t len,
+ unsigned int flags)
{
ssize_t (*splice_read)(struct file *, loff_t *,
struct pipe_inode_info *, size_t, unsigned int);
+ struct pipe_inode_info *ipipe;
int ret;

if (unlikely(!(in->f_mode & FMODE_READ)))
return -EBADF;

+ /* When called firectly (e.g. from c/r) input may be a pipe */
+ ipipe = pipe_info(in->f_path.dentry->d_inode);
+ if (ipipe) {
+ BUG_ON(ipipe == pipe);
+ return splice_pipe_to_pipe(ipipe, pipe, len, flags);
+ }
+
ret = rw_verify_area(READ, in, ppos, len);
if (unlikely(ret < 0))
return ret;
@@ -1273,18 +1306,6 @@ long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
struct pipe_inode_info *opipe,
size_t len, unsigned int flags);
-/*
- * After the inode slimming patch, i_pipe/i_bdev/i_cdev share the same
- * location, so checking ->i_pipe is not enough to verify that this is a
- * pipe.
- */
-static inline struct pipe_inode_info *pipe_info(struct inode *inode)
-{
- if (S_ISFIFO(inode->i_mode))
- return inode->i_pipe;
-
- return NULL;
-}

/*
* Determine where to splice to/from.
@@ -1887,9 +1908,9 @@ retry:
/*
* Link contents of ipipe to opipe.
*/
-static int link_pipe(struct pipe_inode_info *ipipe,
- struct pipe_inode_info *opipe,
- size_t len, unsigned int flags)
+int link_pipe(struct pipe_inode_info *ipipe,
+ struct pipe_inode_info *opipe,
+ size_t len, unsigned int flags)
{
struct pipe_buffer *ibuf, *obuf;
int ret = 0, i = 0, nbuf;
diff --git a/include/linux/splice.h b/include/linux/splice.h
index 18e7c7c..431662c 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -82,4 +82,13 @@ extern ssize_t splice_to_pipe(struct pipe_inode_info *,
extern ssize_t splice_direct_to_actor(struct file *, struct splice_desc *,
splice_direct_actor *);

+extern int link_pipe(struct pipe_inode_info *ipipe,
+ struct pipe_inode_info *opipe,
+ size_t len, unsigned int flags);
+extern long do_splice_to(struct file *in, loff_t *ppos,
+ struct pipe_inode_info *pipe, size_t len,
+ unsigned int flags);
+extern long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
+ loff_t *ppos, size_t len, unsigned int flags);
+
#endif
--
1.6.0.4

2009-07-22 10:22:22

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 31/60] c/r: detect resource leaks for whole-container checkpoint

Add a 'users' count to objhash items, and, for a !CHECKPOINT_SUBTREE
checkpoint, return an error code if the actual objects' counts are
higher, indicating leaks (references to the objects from a task not
being checkpointed). Of course, by this time most of the checkpoint
image has been written out to disk, so this is purely advisory. But
then, it's probably naive to argue that anything more than an advisory
'this went wrong' error code is useful.

The comparison of the objhash user counts to object refcounts as a
basis for checking for leaks comes from Alexey's OpenVZ-based c/r
patchset.

"Leak detection" occurs _before_ any real state is saved, as a
pre-step. This prevents races due to sharing with outside world where
the sharing ceases before the leak test takes place, thus protecting
the checkpoint image from inconsistencies.

Once leak testing concludes, checkpoint will proceed. Because objects
are already in the objhash, checkpoint_obj() cannot distinguish
between the first and subsequent encounters. This is solved with a
flag (CKPT_OBJ_CHECKPOINTED) per object.

Two additional checks take place during checkpoint: for objects that
were created during, and objects destroyed, while the leak-detection
pre-step took place.

Changelog[v17]:
- Leak detection is performed in two-steps
- Detect reverse-leaks (objects disappearing unexpectedly)
- Skip reverse-leak detection if ops->ref_users isn't defined

Signed-off-by: Oren Laadan <[email protected]>
---
checkpoint/checkpoint.c | 36 ++++++++++
checkpoint/objhash.c | 153 +++++++++++++++++++++++++++++++++++++++++++-
checkpoint/process.c | 5 ++
include/linux/checkpoint.h | 5 ++
4 files changed, 196 insertions(+), 3 deletions(-)

diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index fb14585..e126626 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -380,6 +380,20 @@ static int checkpoint_pids(struct ckpt_ctx *ctx)
return ret;
}

+static int collect_objects(struct ckpt_ctx *ctx)
+{
+ int n, ret = 0;
+
+ for (n = 0; n < ctx->nr_tasks; n++) {
+ ckpt_debug("dumping task #%d\n", n);
+ ret = ckpt_collect_task(ctx, ctx->tasks_arr[n]);
+ if (ret < 0)
+ break;
+ }
+
+ return ret;
+}
+
/* count number of tasks in tree (and optionally fill pid's in array) */
static int tree_count_tasks(struct ckpt_ctx *ctx)
{
@@ -619,6 +633,21 @@ long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
if (ret < 0)
goto out;

+ if (!(ctx->uflags & CHECKPOINT_SUBTREE)) {
+ /*
+ * Verify that all objects are contained (no leaks):
+ * First collect them all into the while counting users
+ * and then compare to the objects' real user counts.
+ */
+ ret = collect_objects(ctx);
+ if (ret < 0)
+ goto out;
+ if (!ckpt_obj_contained(ctx)) {
+ ret = -EAGAIN;
+ goto out;
+ }
+ }
+
ret = checkpoint_write_header(ctx);
if (ret < 0)
goto out;
@@ -628,6 +657,13 @@ long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
ret = checkpoint_all_tasks(ctx);
if (ret < 0)
goto out;
+
+ /* verify that all objects were indeed checkpointed */
+ if (!ckpt_obj_checkpointed(ctx)) {
+ ret = -EAGAIN;
+ goto out;
+ }
+
ret = checkpoint_write_tail(ctx);
if (ret < 0)
goto out;
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index eb2bb55..3f23910 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -25,16 +25,19 @@ struct ckpt_obj_ops {
enum obj_type obj_type;
void (*ref_drop)(void *ptr);
int (*ref_grab)(void *ptr);
+ int (*ref_users)(void *ptr);
int (*checkpoint)(struct ckpt_ctx *ctx, void *ptr);
void *(*restore)(struct ckpt_ctx *ctx);
};

struct ckpt_obj {
+ int users;
int objref;
int flags;
void *ptr;
struct ckpt_obj_ops *ops;
struct hlist_node hash;
+ struct hlist_node next;
};

/* object internal flags */
@@ -42,10 +45,21 @@ struct ckpt_obj {

struct ckpt_obj_hash {
struct hlist_head *head;
+ struct hlist_head list;
int next_free_objref;
};

-/* helper grab/drop functions: */
+int checkpoint_bad(struct ckpt_ctx *ctx, void *ptr)
+{
+ BUG();
+}
+
+void *restore_bad(struct ckpt_ctx *ctx)
+{
+ return ERR_PTR(-EINVAL);
+}
+
+/* helper grab/drop/users functions */

static void obj_no_drop(void *ptr)
{
@@ -114,6 +128,7 @@ int ckpt_obj_hash_alloc(struct ckpt_ctx *ctx)

obj_hash->head = head;
obj_hash->next_free_objref = 1;
+ INIT_HLIST_HEAD(&obj_hash->list);

ctx->obj_hash = obj_hash;
return 0;
@@ -176,6 +191,7 @@ static struct ckpt_obj *obj_new(struct ckpt_ctx *ctx, void *ptr,

obj->ptr = ptr;
obj->ops = ops;
+ obj->users = 2; /* extra reference that objhash itself takes */

if (!objref) {
/* use @obj->ptr to index, assign objref (checkpoint) */
@@ -193,6 +209,7 @@ static struct ckpt_obj *obj_new(struct ckpt_ctx *ctx, void *ptr,
obj = ERR_PTR(ret);
} else {
hlist_add_head(&obj->hash, &ctx->obj_hash->head[i]);
+ hlist_add_head(&obj->next, &ctx->obj_hash->list);
}

return obj;
@@ -225,12 +242,36 @@ static struct ckpt_obj *obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
*first = 1;
} else {
BUG_ON(obj->ops->obj_type != type);
+ obj->users++;
*first = 0;
}
return obj;
}

/**
+ * ckpt_obj_collect - collect object into objhash
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ * @first: [output] first encoutner (added to table)
+ *
+ * [used during checkpoint].
+ * Return: objref
+ */
+int ckpt_obj_collect(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
+{
+ struct ckpt_obj *obj;
+ int first;
+
+ obj = obj_lookup_add(ctx, ptr, type, &first);
+ if (IS_ERR(obj))
+ return PTR_ERR(obj);
+ ckpt_debug("%s objref %d first %d\n",
+ obj->ops->obj_name, obj->objref, first);
+ return obj->objref;
+}
+
+/**
* ckpt_obj_lookup - lookup object (by pointer) in objhash
* @ctx: checkpoint context
* @ptr: pointer to object
@@ -291,12 +332,20 @@ int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)
{
struct ckpt_hdr_objref *h;
struct ckpt_obj *obj;
- int first, ret = 0;
+ int new, ret = 0;

- obj = obj_lookup_add(ctx, ptr, type, &first);
+ obj = obj_lookup_add(ctx, ptr, type, &new);
if (IS_ERR(obj))
return PTR_ERR(obj);

+ /*
+ * A "reverse" leak ? All objects should already be in the
+ * objhash by now. But an outside task may have created an
+ * object while we were collecting, which we didn't catch.
+ */
+ if (new && obj->ops->ref_users && !(ctx->uflags & CHECKPOINT_SUBTREE))
+ return -EAGAIN;
+
if (!(obj->flags & CKPT_OBJ_CHECKPOINTED)) {
h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_OBJREF);
if (!h)
@@ -316,9 +365,107 @@ int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr, enum obj_type type)

obj->flags |= CKPT_OBJ_CHECKPOINTED;
}
+
return (ret < 0 ? ret : obj->objref);
}

+/* increment the 'users' count of an object */
+static void ckpt_obj_users_inc(struct ckpt_ctx *ctx, void *ptr, int increment)
+{
+ struct ckpt_obj *obj;
+
+ obj = obj_find_by_ptr(ctx, ptr);
+ if (obj)
+ obj->users += increment;
+}
+
+/*
+ * "Leak detection" - to guarantee a consistent checkpoint of a full
+ * container we verify that all resources are confined and isolated in
+ * that container:
+ *
+ * c/r code first walks through all tasks and collects all shared
+ * resources into the objhash, while counting the references to them;
+ * then, it compares this count to the object's real reference count,
+ * and if they don't match it means that an object has "leaked" to the
+ * outside.
+ *
+ * Otherwise, it is guaranteed that there are no references outside
+ * (of container). c/r code now proceeds to walk through all tasks,
+ * again, and checkpoints the resources. It ensures that all resources
+ * are already in the objhash, and that all of them are checkpointed.
+ * Otherwise it means that due to a race, an object was created or
+ * destroyed during the first walk but not accounted for.
+ *
+ * For instance, consider an outside task A that shared files_struct
+ * with inside task B. Then, after B's files where collected, A opens
+ * or closes a file, and immediately exits - before the first leak
+ * test is performed, such that the test passes.
+ */
+
+/**
+ * ckpt_obj_contained - test if shared objects are "contained" in checkpoint
+ * @ctx: checkpoint context
+ *
+ * Loops through all objects in the table and compares the number of
+ * references accumulated during checkpoint, with the reference count
+ * reported by the kernel.
+ *
+ * Return 1 if respective counts match for all objects, 0 otherwise.
+ */
+int ckpt_obj_contained(struct ckpt_ctx *ctx)
+{
+ struct ckpt_obj *obj;
+ struct hlist_node *node;
+
+ /* account for ctx->file reference (if in the table already) */
+ ckpt_obj_users_inc(ctx, ctx->file, 1);
+
+ hlist_for_each_entry(obj, node, &ctx->obj_hash->list, next) {
+ if (!obj->ops->ref_users)
+ continue;
+ if (obj->ops->ref_users(obj->ptr) != obj->users) {
+ ckpt_debug("usage leak: %s\n", obj->ops->obj_name);
+ ckpt_write_err(ctx, "%s leak: users %d != c/r %d\n",
+ obj->ops->obj_name,
+ obj->ops->ref_users(obj->ptr),
+ obj->users);
+ printk(KERN_NOTICE "c/r: %s users %d != count %d\n",
+ obj->ops->obj_name,
+ obj->ops->ref_users(obj->ptr),
+ obj->users);
+ return 0;
+ }
+ }
+
+ return 1;
+}
+
+/**
+ * ckpt_obj_checkpointed - test that all shared objects were checkpointed
+ * @ctx: checkpoint context
+ *
+ * Return 1 if all objects where checkpointed, 0 otherwise.
+ */
+int ckpt_obj_checkpointed(struct ckpt_ctx *ctx)
+{
+ struct ckpt_obj *obj;
+ struct hlist_node *node;
+
+ hlist_for_each_entry(obj, node, &ctx->obj_hash->list, next) {
+ if (!(obj->flags & CKPT_OBJ_CHECKPOINTED)) {
+ ckpt_debug("reverse leak: %s\n", obj->ops->obj_name);
+ ckpt_write_err(ctx, "%s leak: not checkpointed\n",
+ obj->ops->obj_name);
+ printk(KERN_NOTICE "c/r: %s object not checkpointed\n",
+ obj->ops->obj_name);
+ return 0;
+ }
+ }
+
+ return 1;
+}
+
/**************************************************************************
* Restart
*/
diff --git a/checkpoint/process.c b/checkpoint/process.c
index 9e459c6..4da4e4a 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -241,6 +241,11 @@ int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t)
return ret;
}

+int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+ return 0;
+}
+
/***********************************************************************
* Restart
*/
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 8eb5434..efd05cc 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -86,6 +86,10 @@ extern int ckpt_obj_hash_alloc(struct ckpt_ctx *ctx);
extern int restore_obj(struct ckpt_ctx *ctx, struct ckpt_hdr_objref *h);
extern int checkpoint_obj(struct ckpt_ctx *ctx, void *ptr,
enum obj_type type);
+extern int ckpt_obj_collect(struct ckpt_ctx *ctx, void *ptr,
+ enum obj_type type);
+extern int ckpt_obj_contained(struct ckpt_ctx *ctx);
+extern int ckpt_obj_checkpointed(struct ckpt_ctx *ctx);
extern int ckpt_obj_lookup(struct ckpt_ctx *ctx, void *ptr,
enum obj_type type);
extern int ckpt_obj_lookup_add(struct ckpt_ctx *ctx, void *ptr,
@@ -103,6 +107,7 @@ extern long do_restart(struct ckpt_ctx *ctx, pid_t pid);

/* task */
extern int ckpt_activate_next(struct ckpt_ctx *ctx);
+extern int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t);
extern int checkpoint_task(struct ckpt_ctx *ctx, struct task_struct *t);
extern int restore_task(struct ckpt_ctx *ctx);

--
1.6.0.4

2009-07-22 10:22:23

by Oren Laadan

[permalink] [raw]
Subject: [RFC v17][PATCH 60/60] c/r: checkpoint and restore (shared) task's sighand_struct

This patch adds the checkpointing and restart of signal handling
state - 'struct sighand_struct'. Since the contents of this state
only affect userspace, no input validation is required.

Add _NSIG to kernel constants saved/tested with image header.

Number of signals (_NSIG) is arch-dependent, but is within __KERNEL__
and not visibile to userspace compile. Therefore, define per arch
CKPT_ARCH_NSIG in <asm/checkpoint_hdr.h>.

Signed-off-by: Oren Laadan <[email protected]>
---
arch/s390/include/asm/checkpoint_hdr.h | 8 ++
arch/x86/include/asm/checkpoint_hdr.h | 8 ++
checkpoint/Makefile | 3 +-
checkpoint/checkpoint.c | 2 +
checkpoint/objhash.c | 26 +++++
checkpoint/process.c | 20 ++++
checkpoint/restart.c | 3 +
checkpoint/signal.c | 170 ++++++++++++++++++++++++++++++++
include/linux/checkpoint.h | 7 ++
include/linux/checkpoint_hdr.h | 22 ++++
10 files changed, 268 insertions(+), 1 deletions(-)
create mode 100644 checkpoint/signal.c

diff --git a/arch/s390/include/asm/checkpoint_hdr.h b/arch/s390/include/asm/checkpoint_hdr.h
index ad9449e..1976355 100644
--- a/arch/s390/include/asm/checkpoint_hdr.h
+++ b/arch/s390/include/asm/checkpoint_hdr.h
@@ -82,6 +82,14 @@ struct ckpt_hdr_mm_context {
unsigned long asce_limit;
};

+#define CKPT_ARCH_NSIG 64
+#ifdef __KERNEL__
+#include <asm/signal.h>
+#if CKPT_ARCH_NSIG != _SIGCONTEXT_NSIG
+#error CKPT_ARCH_NSIG size is wrong (asm/sigcontext.h and asm/checkpoint_hdr.h)
+#endif
+#endif
+
struct ckpt_hdr_header_arch {
struct ckpt_hdr h;
};
diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
index 0e756b0..1228d1b 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -48,6 +48,14 @@ enum {
CKPT_HDR_MM_CONTEXT_LDT,
};

+#define CKPT_ARCH_NSIG 64
+#ifdef __KERNEL__
+#include <asm/signal.h>
+#if CKPT_ARCH_NSIG != _NSIG
+#error CKPT_ARCH_NSIG size is wrong per asm/signal.h and asm/checkpoint_hdr.h
+#endif
+#endif
+
struct ckpt_hdr_header_arch {
struct ckpt_hdr h;
/* FIXME: add HAVE_HWFP */
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index bb2c0ca..f8a55df 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -10,4 +10,5 @@ obj-$(CONFIG_CHECKPOINT) += \
process.o \
namespace.o \
files.o \
- memory.o
+ memory.o \
+ signal.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index e4f971e..1522e6f 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -190,6 +190,8 @@ static void fill_kernel_const(struct ckpt_hdr_const *h)
h->task_comm_len = sizeof(tsk->comm);
/* mm */
h->mm_saved_auxv_len = sizeof(mm->saved_auxv);
+ /* signal */
+ h->signal_nsig = _NSIG;
/* uts */
h->uts_sysname_len = sizeof(uts->sysname);
h->uts_nodename_len = sizeof(uts->nodename);
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
index 15b9d66..da43bf4 100644
--- a/checkpoint/objhash.c
+++ b/checkpoint/objhash.c
@@ -135,6 +135,22 @@ static int obj_mm_users(void *ptr)
return atomic_read(&((struct mm_struct *) ptr)->mm_users);
}

+static int obj_sighand_grab(void *ptr)
+{
+ atomic_inc(&((struct sighand_struct *) ptr)->count);
+ return 0;
+}
+
+static void obj_sighand_drop(void *ptr)
+{
+ __cleanup_sighand((struct sighand_struct *) ptr);
+}
+
+static int obj_sighand_users(void *ptr)
+{
+ return atomic_read(&((struct sighand_struct *) ptr)->count);
+}
+
static int obj_ns_grab(void *ptr)
{
get_nsproxy((struct nsproxy *) ptr);
@@ -275,6 +291,16 @@ static struct ckpt_obj_ops ckpt_obj_ops[] = {
.checkpoint = checkpoint_mm,
.restore = restore_mm,
},
+ /* sighand object */
+ {
+ .obj_name = "SIGHAND",
+ .obj_type = CKPT_OBJ_SIGHAND,
+ .ref_drop = obj_sighand_drop,
+ .ref_grab = obj_sighand_grab,
+ .ref_users = obj_sighand_users,
+ .checkpoint = checkpoint_sighand,
+ .restore = restore_sighand,
+ },
/* ns object */
{
.obj_name = "NSPROXY",
diff --git a/checkpoint/process.c b/checkpoint/process.c
index f028822..d76ab2c 100644
--- a/checkpoint/process.c
+++ b/checkpoint/process.c
@@ -180,6 +180,7 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
struct ckpt_hdr_task_objs *h;
int files_objref;
int mm_objref;
+ int sighand_objref;
int ret;

/*
@@ -215,11 +216,20 @@ static int checkpoint_task_objs(struct ckpt_ctx *ctx, struct task_struct *t)
return mm_objref;
}

+ sighand_objref = checkpoint_obj_sighand(ctx, t);
+ ckpt_debug("sighand: objref %d\n", sighand_objref);
+ if (sighand_objref < 0) {
+ ckpt_write_err(ctx, "task %d (%s), sighand_struct: %d",
+ task_pid_vnr(t), t->comm, sighand_objref);
+ return sighand_objref;
+ }
+
h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_TASK_OBJS);
if (!h)
return -ENOMEM;
h->files_objref = files_objref;
h->mm_objref = mm_objref;
+ h->sighand_objref = sighand_objref;
ret = ckpt_write_obj(ctx, &h->h);
ckpt_hdr_put(ctx, h);

@@ -379,6 +389,9 @@ int ckpt_collect_task(struct ckpt_ctx *ctx, struct task_struct *t)
if (ret < 0)
return ret;
ret = ckpt_collect_mm(ctx, t);
+ if (ret < 0)
+ return ret;
+ ret = ckpt_collect_sighand(ctx, t);

return ret;
}
@@ -524,10 +537,17 @@ static int restore_task_objs(struct ckpt_ctx *ctx)

ret = restore_obj_file_table(ctx, h->files_objref);
ckpt_debug("file_table: ret %d (%p)\n", ret, current->files);
+ if (ret < 0)
+ goto out;

ret = restore_obj_mm(ctx, h->mm_objref);
ckpt_debug("mm: ret %d (%p)\n", ret, current->mm);
+ if (ret < 0)
+ goto out;

+ ret = restore_obj_sighand(ctx, h->sighand_objref);
+ ckpt_debug("sighand: ret %d (%p)\n", ret, current->sighand);
+ out:
ckpt_hdr_put(ctx, h);
return ret;
}
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 935caf6..677f030 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -323,6 +323,9 @@ static int check_kernel_const(struct ckpt_hdr_const *h)
/* mm */
if (h->mm_saved_auxv_len != sizeof(mm->saved_auxv))
return -EINVAL;
+ /* signal */
+ if (h->signal_nsig != _NSIG)
+ return -EINVAL;
/* uts */
if (h->uts_sysname_len != sizeof(uts->sysname))
return -EINVAL;
diff --git a/checkpoint/signal.c b/checkpoint/signal.c
new file mode 100644
index 0000000..506476b
--- /dev/null
+++ b/checkpoint/signal.c
@@ -0,0 +1,170 @@
+/*
+ * Checkpoint task signals
+ *
+ * Copyright (C) 2009 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+/* default debug level for output */
+#define CKPT_DFLAG CKPT_DSYS
+
+#include <linux/sched.h>
+#include <linux/signal.h>
+#include <linux/errno.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+static inline void fill_sigset(struct ckpt_hdr_sigset *h, sigset_t *sigset)
+{
+ memcpy(&h->sigset, sigset, _NSIG / 8);
+}
+
+static inline void load_sigset(sigset_t *sigset, struct ckpt_hdr_sigset *h)
+{
+ memcpy(sigset, &h->sigset, _NSIG / 8);
+}
+
+/***********************************************************************
+ * Checkpoint
+ */
+
+int do_checkpoint_sighand(struct ckpt_ctx *ctx, struct sighand_struct *sighand)
+{
+ struct ckpt_hdr_sighand *h;
+ struct ckpt_hdr_sigaction *hh;
+ struct sigaction *sa;
+ int i, ret;
+
+ h = ckpt_hdr_get_type(ctx, _NSIG * sizeof(*hh) + sizeof(*h),
+ CKPT_HDR_SIGHAND);
+ if (!h)
+ return -ENOMEM;
+
+ hh = h->action;
+ spin_lock_irq(&sighand->siglock);
+ for (i = 0; i < _NSIG; i++) {
+ sa = &sighand->action[i].sa;
+ hh[i]._sa_handler = (unsigned long) sa->sa_handler;
+ hh[i].sa_flags = sa->sa_flags;
+ hh[i].sa_restorer = (unsigned long) sa->sa_restorer;
+ fill_sigset(&hh[i].sa_mask, &sa->sa_mask);
+ }
+ spin_unlock_irq(&sighand->siglock);
+
+ ret = ckpt_write_obj(ctx, &h->h);
+ ckpt_hdr_put(ctx, h);
+
+ return ret;
+}
+
+int checkpoint_sighand(struct ckpt_ctx *ctx, void *ptr)
+{
+ return do_checkpoint_sighand(ctx, (struct sighand_struct *) ptr);
+}
+
+int checkpoint_obj_sighand(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+ struct sighand_struct *sighand;
+ int objref;
+
+ read_lock(&tasklist_lock);
+ sighand = rcu_dereference(t->sighand);
+ atomic_inc(&sighand->count);
+ read_unlock(&tasklist_lock);
+
+ objref = checkpoint_obj(ctx, sighand, CKPT_OBJ_SIGHAND);
+ __cleanup_sighand(sighand);
+
+ return objref;
+}
+
+/***********************************************************************
+ * Collect
+ */
+
+int ckpt_collect_sighand(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+ struct sighand_struct *sighand;
+ int ret;
+
+ read_lock(&tasklist_lock);
+ sighand = rcu_dereference(t->sighand);
+ atomic_inc(&sighand->count);
+ read_unlock(&tasklist_lock);
+
+ ret = ckpt_obj_collect(ctx, sighand, CKPT_OBJ_SIGHAND);
+ __cleanup_sighand(sighand);
+
+ return ret;
+}
+
+/***********************************************************************
+ * Restart
+ */
+
+struct sighand_struct *do_restore_sighand(struct ckpt_ctx *ctx)
+{
+ struct ckpt_hdr_sighand *h;
+ struct ckpt_hdr_sigaction *hh;
+ struct sighand_struct *sighand;
+ struct sigaction *sa;
+ int i;
+
+ h = ckpt_read_obj_type(ctx, _NSIG * sizeof(*hh) + sizeof(*h),
+ CKPT_HDR_SIGHAND);
+ if (IS_ERR(h))
+ return ERR_PTR(PTR_ERR(h));
+
+ sighand = kmem_cache_alloc(sighand_cachep, GFP_KERNEL);
+ if (!sighand) {
+ sighand = ERR_PTR(-ENOMEM);
+ goto out;
+ }
+ atomic_set(&sighand->count, 1);
+
+ hh = h->action;
+ for (i = 0; i < _NSIG; i++) {
+ sa = &sighand->action[i].sa;
+ sa->sa_handler = (void *) (unsigned long) hh[i]._sa_handler;
+ sa->sa_flags = hh[i].sa_flags;
+ sa->sa_restorer = (void *) (unsigned long) hh[i].sa_restorer;
+ load_sigset(&sa->sa_mask, &hh[i].sa_mask);
+ }
+ out:
+ ckpt_hdr_put(ctx, h);
+ return sighand;
+}
+
+void *restore_sighand(struct ckpt_ctx *ctx)
+{
+ return (void *) do_restore_sighand(ctx);
+}
+
+int restore_obj_sighand(struct ckpt_ctx *ctx, int sighand_objref)
+{
+ struct sighand_struct *sighand;
+ struct sighand_struct *old_sighand;
+
+ sighand = ckpt_obj_fetch(ctx, sighand_objref, CKPT_OBJ_SIGHAND);
+ if (IS_ERR(sighand))
+ return PTR_ERR(sighand);
+
+ if (sighand == current->sighand)
+ return 0;
+
+ atomic_inc(&sighand->count);
+
+ /* manipulate tsk->sighand with tasklist lock write-held */
+ write_lock_irq(&tasklist_lock);
+ old_sighand = rcu_dereference(current->sighand);
+ spin_lock(&old_sighand->siglock);
+ rcu_assign_pointer(current->sighand, sighand);
+ spin_unlock(&old_sighand->siglock);
+ write_unlock_irq(&tasklist_lock);
+ __cleanup_sighand(old_sighand);
+
+ return 0;
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 0a8bfc7..60d1116 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -222,6 +222,13 @@ extern int restore_memory_contents(struct ckpt_ctx *ctx, struct inode *inode);
VM_RESERVED | VM_NORESERVE | VM_HUGETLB | VM_NONLINEAR | \
VM_MAPPED_COPY | VM_INSERTPAGE | VM_MIXEDMAP | VM_SAO)

+/* signals */
+extern int checkpoint_obj_sighand(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int restore_obj_sighand(struct ckpt_ctx *ctx, int sighand_objref);
+
+extern int ckpt_collect_sighand(struct ckpt_ctx *ctx, struct task_struct *t);
+extern int checkpoint_sighand(struct ckpt_ctx *ctx, void *ptr);
+extern void *restore_sighand(struct ckpt_ctx *ctx);

/* useful macros to copy fields and buffers to/from ckpt_hdr_xxx structures */
#define CKPT_CPT 1
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 0863a07..8b7ca46 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -86,6 +86,8 @@ enum {
CKPT_HDR_IPC_MSG_MSG,
CKPT_HDR_IPC_SEM,

+ CKPT_HDR_SIGHAND = 601,
+
CKPT_HDR_TAIL = 9001,

CKPT_HDR_ERROR = 9999,
@@ -112,6 +114,7 @@ enum obj_type {
CKPT_OBJ_FILE_TABLE,
CKPT_OBJ_FILE,
CKPT_OBJ_MM,
+ CKPT_OBJ_SIGHAND,
CKPT_OBJ_NS,
CKPT_OBJ_UTS_NS,
CKPT_OBJ_IPC_NS,
@@ -128,6 +131,8 @@ struct ckpt_hdr_const {
__u16 task_comm_len;
/* mm */
__u16 mm_saved_auxv_len;
+ /* signal */
+ __u16 signal_nsig;
/* uts */
__u16 uts_sysname_len;
__u16 uts_nodename_len;
@@ -279,6 +284,7 @@ struct ckpt_hdr_task_objs {

__s32 files_objref;
__s32 mm_objref;
+ __s32 sighand_objref;
} __attribute__((aligned(8)));

/* restart blocks */
@@ -408,6 +414,22 @@ struct ckpt_hdr_pgarr {
__u64 nr_pages; /* number of pages to saved */
} __attribute__((aligned(8)));

+/* signals */
+struct ckpt_hdr_sigset {
+ __u8 sigset[CKPT_ARCH_NSIG / 8];
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_sigaction {
+ __u64 _sa_handler;
+ __u64 sa_flags;
+ __u64 sa_restorer;
+ struct ckpt_hdr_sigset sa_mask;
+} __attribute__((aligned(8)));
+
+struct ckpt_hdr_sighand {
+ struct ckpt_hdr h;
+ struct ckpt_hdr_sigaction action[0];
+} __attribute__((aligned(8)));

/* ipc commons */
struct ckpt_hdr_ipcns {
--
1.6.0.4

2009-07-22 17:25:09

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [RFC v17][PATCH 52/60] c/r: support semaphore sysv-ipc

[Oren Laadan - Wed, Jul 22, 2009 at 06:00:14AM -0400]
...
| +static struct sem *restore_sem_array(struct ckpt_ctx *ctx, int nsems)
| +{
| + struct sem *sma;
| + int i, ret;
| +
| + sma = kmalloc(nsems * sizeof(*sma), GFP_KERNEL);

Forgot to

if (!sma)
return -ENOMEM;

right?

| + ret = _ckpt_read_buffer(ctx, sma, nsems * sizeof(*sma));
| + if (ret < 0)
| + goto out;
| +
| + /* validate sem array contents */
| + for (i = 0; i < nsems; i++) {
| + if (sma[i].semval < 0 || sma[i].sempid < 0) {
| + ret = -EINVAL;
| + break;
| + }
| + }
| + out:
| + if (ret < 0) {
| + kfree(sma);
| + sma = ERR_PTR(ret);
| + }
| + return sma;
| +}
...

-- Cyrill

2009-07-22 17:57:44

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC v17][PATCH 22/60] c/r: external checkpoint of a task other than ourself

Quoting Oren Laadan ([email protected]):
> Now we can do "external" checkpoint, i.e. act on another task.

...

> long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
> {
> long ret;
>
> + ret = init_checkpoint_ctx(ctx, pid);
> + if (ret < 0)
> + return ret;
> +
> + if (ctx->root_freezer) {
> + ret = cgroup_freezer_begin_checkpoint(ctx->root_freezer);
> + if (ret < 0)
> + return ret;
> + }

Self-checkpoint of a task in root freezer is now denied, though.

Was that intentional?

-serge

2009-07-23 02:35:20

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC v17][PATCH 10/60] c/r: make file_pos_read/write() public

a nitpick.

On Wed, 22 Jul 2009 05:59:32 -0400
Oren Laadan <[email protected]> wrote:

> These two are used in the next patch when calling vfs_read/write()

> +static inline loff_t file_pos_read(struct file *file)
> +{
> + return file->f_pos;
> +}
> +
> +static inline void file_pos_write(struct file *file, loff_t pos)
> +{
> + file->f_pos = pos;
> +}
> +

I'm not sure but how about renaming this to
file_pos()
set_file_pos()
at moving this to global include file ?

Thanks,
-Kame

2009-07-23 04:21:39

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC v17][PATCH 52/60] c/r: support semaphore sysv-ipc



Cyrill Gorcunov wrote:
> [Oren Laadan - Wed, Jul 22, 2009 at 06:00:14AM -0400]
> ...
> | +static struct sem *restore_sem_array(struct ckpt_ctx *ctx, int nsems)
> | +{
> | + struct sem *sma;
> | + int i, ret;
> | +
> | + sma = kmalloc(nsems * sizeof(*sma), GFP_KERNEL);
>
> Forgot to
>
> if (!sma)
> return -ENOMEM;
>
> right?

Yep ! thanks... (fixed commit to branch ckpt-v17-dev)

Oren.

2009-07-23 04:32:35

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC v17][PATCH 22/60] c/r: external checkpoint of a task other than ourself



Serge E. Hallyn wrote:
> Quoting Oren Laadan ([email protected]):
>> Now we can do "external" checkpoint, i.e. act on another task.
>
> ...
>
>> long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
>> {
>> long ret;
>>
>> + ret = init_checkpoint_ctx(ctx, pid);
>> + if (ret < 0)
>> + return ret;
>> +
>> + if (ctx->root_freezer) {
>> + ret = cgroup_freezer_begin_checkpoint(ctx->root_freezer);
>> + if (ret < 0)
>> + return ret;
>> + }
>
> Self-checkpoint of a task in root freezer is now denied, though.
>
> Was that intentional?

Yes.

"root freezer" is an arbitrary task in the checkpoint subtree or
container. It is used to verify that all checkpointed tasks - except
for current, if doing self-checkpoint - belong to the same freezer
group.

Since current is busy calling checkpoint(2), and since we only permit
checkpoint of (cgroup-) frozen tasks, then - by definition - it cannot
possibly belong to the same group. If it did, it would itself be frozen
like its fellows and unable to call checkpoint(2).

Oren.

2009-07-23 13:16:23

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC v17][PATCH 22/60] c/r: external checkpoint of a task other than ourself

Quoting Oren Laadan ([email protected]):
>
>
> Serge E. Hallyn wrote:
> > Quoting Oren Laadan ([email protected]):
> >> Now we can do "external" checkpoint, i.e. act on another task.
> >
> > ...
> >
> >> long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
> >> {
> >> long ret;
> >>
> >> + ret = init_checkpoint_ctx(ctx, pid);
> >> + if (ret < 0)
> >> + return ret;
> >> +
> >> + if (ctx->root_freezer) {
> >> + ret = cgroup_freezer_begin_checkpoint(ctx->root_freezer);
> >> + if (ret < 0)
> >> + return ret;
> >> + }
> >
> > Self-checkpoint of a task in root freezer is now denied, though.
> >
> > Was that intentional?
>
> Yes.
>
> "root freezer" is an arbitrary task in the checkpoint subtree or
> container. It is used to verify that all checkpointed tasks - except
> for current, if doing self-checkpoint - belong to the same freezer
> group.
>
> Since current is busy calling checkpoint(2), and since we only permit
> checkpoint of (cgroup-) frozen tasks, then - by definition - it cannot
> possibly belong to the same group. If it did, it would itself be frozen
> like its fellows and unable to call checkpoint(2).

So then you're saying that regular self-checkpoint no longer works,
but the documentation still shows self.c and claims it should just
work.

Mind you I prefer this as it is more consistent, but I thought it
was something you wanted to support.

-serge

2009-07-23 14:14:29

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC v17][PATCH 22/60] c/r: external checkpoint of a task other than ourself



Serge E. Hallyn wrote:
> Quoting Oren Laadan ([email protected]):
>>
>> Serge E. Hallyn wrote:
>>> Quoting Oren Laadan ([email protected]):
>>>> Now we can do "external" checkpoint, i.e. act on another task.
>>> ...
>>>
>>>> long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
>>>> {
>>>> long ret;
>>>>
>>>> + ret = init_checkpoint_ctx(ctx, pid);
>>>> + if (ret < 0)
>>>> + return ret;
>>>> +
>>>> + if (ctx->root_freezer) {
>>>> + ret = cgroup_freezer_begin_checkpoint(ctx->root_freezer);
>>>> + if (ret < 0)
>>>> + return ret;
>>>> + }
>>> Self-checkpoint of a task in root freezer is now denied, though.
>>>
>>> Was that intentional?
>> Yes.
>>
>> "root freezer" is an arbitrary task in the checkpoint subtree or
>> container. It is used to verify that all checkpointed tasks - except
>> for current, if doing self-checkpoint - belong to the same freezer
>> group.
>>
>> Since current is busy calling checkpoint(2), and since we only permit
>> checkpoint of (cgroup-) frozen tasks, then - by definition - it cannot
>> possibly belong to the same group. If it did, it would itself be frozen
>> like its fellows and unable to call checkpoint(2).
>
> So then you're saying that regular self-checkpoint no longer works,
> but the documentation still shows self.c and claims it should just
> work.

I'm unsure why you say that self-checkpoint no longer works ?
In fact, I just double checked that it does.

Self-checkpoint has two immediate use-cases:

1) Single process that checkpoints itself - ctx->root_freezer remains
NULL, which causes cgroup_freezer_begin_checkpoint() to be skipped.

2) Process P that belongs to a hierarchy (subtree or container), and
P calls checkpoint(2) to checkpoint the hierarchy.
For this to work, all other processes in the hierarchy must be frozen.
Therefore, they also belong to a freezer cgroup (perhaps more than one -
but that is not permitted).
In this case, ctx->root will point to a process from the freezer cgroup,
and the code tests all other processes (excluding P, which is current)
to confirm that they belong to the same freezer cgroup.
P itself can not possibly belong to it, otherwise it would have been
frozen and not executing the checkpoint(2) syscall.

IOW, for case 2 to work, one must arrange for all tasks in the target
hierarchy, except for P (- current, the checkpointer), to belong to
a single freezer cgroup, and for that cgroup to be frozen.

>>> Self-checkpoint of a task in root freezer is now denied, though.

Maybe I didn't really understand what you meant by that, and by
"root freezer" ?

>
> Mind you I prefer this as it is more consistent, but I thought it
> was something you wanted to support.

Self-checkpoint simply allows a process to checkpoint itself (and
perhaps additional processes too). I never quite understood why you
view it as a source of inconsistency ...

Nevertheless, it still works.

Oren.

2009-07-23 14:25:24

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC v17][PATCH 19/60] c/r: documentation

Quoting Oren Laadan ([email protected]):
> +Security
> +========
> +
> +The main question is whether sys_checkpoint() and sys_restart()
> +require privileged or unprivileged operation.
> +
> +Early versions checked capable(CAP_SYS_ADMIN) assuming that we would
> +attempt to remove the need for privilege, so that all users could
> +safely use it. Arnd Bergmann pointed out that it'd make more sense to
> +let unprivileged users use them now, so that we'll be more careful
> +about the security as patches roll in.
> +
> +Checkpoint: the main concern is whether a task that performs the
> +checkpoint of another task has sufficient privileges to access its
> +state. We address this by requiring that the checkpointer task will be
> +able to ptrace the target task, by means of ptrace_may_access() with
> +read mode.

with access mode now, actually.

> +Restart: the main concern is that we may allow an unprivileged user to
> +feed the kernel with random data. To this end, the restart works in a
> +way that does not skip the usual security checks. Task credentials,
> +i.e. euid, reuid, and LSM security contexts currently come from the
> +caller, not the checkpoint image. When restoration of credentials
> +becomes supported, then definitely the ability of the task that calls
> +sys_restore() to setresuid/setresgid to those values must be checked.

That is now possible, and this is done.

> +Keeping the restart procedure to operate within the limits of the
> +caller's credentials means that there various scenarios that cannot
> +be supported. For instance, a setuid program that opened a protected
> +log file and then dropped privileges will fail the restart, because
> +the user won't have enough credentials to reopen the file. In these
> +cases, we should probably treat restarting like inserting a kernel
> +module: surely the user can cause havoc by providing incorrect data,
> +but then again we must trust the root account.
> +
> +So that's why we don't want CAP_SYS_ADMIN required up-front. That way
> +we will be forced to more carefully review each of those features.
> +However, this can be controlled with a sysctl-variable.
> +
> +
> diff --git a/Documentation/checkpoint/usage.txt b/Documentation/checkpoint/usage.txt
> new file mode 100644
> index 0000000..ed34765
> --- /dev/null
> +++ b/Documentation/checkpoint/usage.txt
> @@ -0,0 +1,193 @@
> +
> + How to use Checkpoint-Restart
> + =========================================
> +
> +
> +API
> +===
> +
> +The API consists of two new system calls:
> +
> +* int checkpoint(pid_t pid, int fd, unsigned long flag);
> +
> + Checkpoint a (sub-)container whose root task is identified by @pid,
> + to the open file indicated by @fd. @flags may be on or more of:
> + - CHECKPOINT_SUBTREE : allow checkpoint of sub-container
> + (other value are not allowed).
> +
> + Returns: a positive checkpoint identifier (ckptid) upon success, 0 if
> + it returns from a restart, and -1 if an error occurs. The ckptid will
> + uniquely identify a checkpoint image, for as long as the checkpoint
> + is kept in the kernel (e.g. if one wishes to keep a checkpoint, or a
> + partial checkpoint, residing in kernel memory).
> +
> +* int sys_restart(pid_t pid, int fd, unsigned long flags);
> +
> + Restart a process hierarchy from a checkpoint image that is read from
> + the blob stored in the file indicated by @fd. The @flags' will have
> + future meaning (must be 0 for now). @pid indicates the root of the
> + hierarchy as seen in the coordinator's pid-namespace, and is expected
> + to be a child of the coordinator. (Note that this argument may mean
> + 'ckptid' to identify an in-kernel checkpoint image, with some @flags
> + in the future).
> +
> + Returns: -1 if an error occurs, 0 on success when restarting from a
> + "self" checkpoint, and return value of system call at the time of the
> + checkpoint when restarting from an "external" checkpoint.

Return value of the checkpointed (init) task's syscall at the time of
external checkpoint? If so, what's the use for this, as opposed to
returning 0 as in the case of self-checkpoint?

> + TODO: upon successful "external" restart, the container will end up
> + in a frozen state.

Should clone_with_pids() be mentioned here?

thanks,
-serge

2009-07-23 14:54:55

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC v17][PATCH 22/60] c/r: external checkpoint of a task other than ourself

Quoting Oren Laadan ([email protected]):
>
>
> Serge E. Hallyn wrote:
> > Quoting Oren Laadan ([email protected]):
> >> Now we can do "external" checkpoint, i.e. act on another task.
> >
> > ...
> >
> >> long do_checkpoint(struct ckpt_ctx *ctx, pid_t pid)
> >> {
> >> long ret;
> >>
> >> + ret = init_checkpoint_ctx(ctx, pid);
> >> + if (ret < 0)
> >> + return ret;
> >> +
> >> + if (ctx->root_freezer) {
> >> + ret = cgroup_freezer_begin_checkpoint(ctx->root_freezer);
> >> + if (ret < 0)
> >> + return ret;
> >> + }
> >
> > Self-checkpoint of a task in root freezer is now denied, though.
> >
> > Was that intentional?
>
> Yes.
>
> "root freezer" is an arbitrary task in the checkpoint subtree or
> container. It is used to verify that all checkpointed tasks - except
> for current, if doing self-checkpoint - belong to the same freezer
> group.
>
> Since current is busy calling checkpoint(2), and since we only permit
> checkpoint of (cgroup-) frozen tasks, then - by definition - it cannot
> possibly belong to the same group. If it did, it would itself be frozen
> like its fellows and unable to call checkpoint(2).
>
> Oren.

Ok, well I don't know what was happening yesterday. Today it's
restart that is failing, and as you pointed out on irc that's
on s390 only. I'll send out a patch this afternoon to fix that.

Yesterday I must not have read the output right I guess...

thanks,
-serge

2009-07-23 15:23:46

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC v17][PATCH 22/60] c/r: external checkpoint of a task other than ourself

Quoting Oren Laadan ([email protected]):
> +/* setup checkpoint-specific parts of ctx */
> +static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
> +{
> + struct task_struct *task;
> + struct nsproxy *nsproxy;
> + int ret;
> +
> + /*
> + * No need for explicit cleanup here, because if an error
> + * occurs then ckpt_ctx_free() is eventually called.
> + */
> +
> + ctx->root_pid = pid;
> +
> + /* root task */
> + read_lock(&tasklist_lock);
> + task = find_task_by_vpid(pid);
> + if (task)
> + get_task_struct(task);
> + read_unlock(&tasklist_lock);
> + if (!task)
> + return -ESRCH;
> + else
> + ctx->root_task = task;
> +
> + /* root nsproxy */
> + rcu_read_lock();
> + nsproxy = task_nsproxy(task);
> + if (nsproxy)
> + get_nsproxy(nsproxy);
> + rcu_read_unlock();
> + if (!nsproxy)
> + return -ESRCH;
> + else
> + ctx->root_nsproxy = nsproxy;
> +
> + /* root freezer */
> + ctx->root_freezer = task;
> + geT_task_struct(task);
> +
> + ret = may_checkpoint_task(ctx, task);
> + if (ret) {
> + ckpt_write_err(ctx, NULL);
> + put_task_struct(task);
> + put_task_struct(task);
> + put_nsproxy(nsproxy);

I don't think this is safe - the ckpt_ctx_free() will
free them a second time because you're not setting them
to NULL, right?

> + return ret;
> + }
> +
> + return 0;
> +}
> +

-serge

2009-07-23 15:24:49

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC v17][PATCH 19/60] c/r: documentation



Serge E. Hallyn wrote:
> Quoting Oren Laadan ([email protected]):
>> +Security
>> +========
>> +
>> +The main question is whether sys_checkpoint() and sys_restart()
>> +require privileged or unprivileged operation.
>> +
>> +Early versions checked capable(CAP_SYS_ADMIN) assuming that we would
>> +attempt to remove the need for privilege, so that all users could
>> +safely use it. Arnd Bergmann pointed out that it'd make more sense to
>> +let unprivileged users use them now, so that we'll be more careful
>> +about the security as patches roll in.
>> +
>> +Checkpoint: the main concern is whether a task that performs the
>> +checkpoint of another task has sufficient privileges to access its
>> +state. We address this by requiring that the checkpointer task will be
>> +able to ptrace the target task, by means of ptrace_may_access() with
>> +read mode.
>
> with access mode now, actually.

Yes...

>
>> +Restart: the main concern is that we may allow an unprivileged user to
>> +feed the kernel with random data. To this end, the restart works in a
>> +way that does not skip the usual security checks. Task credentials,
>> +i.e. euid, reuid, and LSM security contexts currently come from the
>> +caller, not the checkpoint image. When restoration of credentials
>> +becomes supported, then definitely the ability of the task that calls
>> +sys_restore() to setresuid/setresgid to those values must be checked.
>
> That is now possible, and this is done.

Yes again.

>
>> +Keeping the restart procedure to operate within the limits of the
>> +caller's credentials means that there various scenarios that cannot
>> +be supported. For instance, a setuid program that opened a protected
>> +log file and then dropped privileges will fail the restart, because
>> +the user won't have enough credentials to reopen the file. In these
>> +cases, we should probably treat restarting like inserting a kernel
>> +module: surely the user can cause havoc by providing incorrect data,
>> +but then again we must trust the root account.
>> +
>> +So that's why we don't want CAP_SYS_ADMIN required up-front. That way
>> +we will be forced to more carefully review each of those features.
>> +However, this can be controlled with a sysctl-variable.
>> +
>> +
>> diff --git a/Documentation/checkpoint/usage.txt b/Documentation/checkpoint/usage.txt
>> new file mode 100644
>> index 0000000..ed34765
>> --- /dev/null
>> +++ b/Documentation/checkpoint/usage.txt
>> @@ -0,0 +1,193 @@
>> +
>> + How to use Checkpoint-Restart
>> + =========================================
>> +
>> +
>> +API
>> +===
>> +
>> +The API consists of two new system calls:
>> +
>> +* int checkpoint(pid_t pid, int fd, unsigned long flag);
>> +
>> + Checkpoint a (sub-)container whose root task is identified by @pid,
>> + to the open file indicated by @fd. @flags may be on or more of:
>> + - CHECKPOINT_SUBTREE : allow checkpoint of sub-container
>> + (other value are not allowed).
>> +
>> + Returns: a positive checkpoint identifier (ckptid) upon success, 0 if
>> + it returns from a restart, and -1 if an error occurs. The ckptid will
>> + uniquely identify a checkpoint image, for as long as the checkpoint
>> + is kept in the kernel (e.g. if one wishes to keep a checkpoint, or a
>> + partial checkpoint, residing in kernel memory).
>> +
>> +* int sys_restart(pid_t pid, int fd, unsigned long flags);
>> +
>> + Restart a process hierarchy from a checkpoint image that is read from
>> + the blob stored in the file indicated by @fd. The @flags' will have
>> + future meaning (must be 0 for now). @pid indicates the root of the
>> + hierarchy as seen in the coordinator's pid-namespace, and is expected
>> + to be a child of the coordinator. (Note that this argument may mean
>> + 'ckptid' to identify an in-kernel checkpoint image, with some @flags
>> + in the future).
>> +
>> + Returns: -1 if an error occurs, 0 on success when restarting from a
>> + "self" checkpoint, and return value of system call at the time of the
>> + checkpoint when restarting from an "external" checkpoint.
>
> Return value of the checkpointed (init) task's syscall at the time of
> external checkpoint? If so, what's the use for this, as opposed to
> returning 0 as in the case of self-checkpoint?

When you restart from a regular ("external") syscall, the checkpointed
process was doing _something_:

If it was frozen for checkpoint while running in userspace, then it will
resume running in userspace exactly where it was interrupted.

If it was frozen while in kernel doing a syscall, it will return what
that syscall returned when it was interrupted - or completed - for the
freeze. It will proceed from there as if it had only been frozen and
then thawed.

In the special case that the process original self-checkpointed, then
once restart completes successfully, it will resume execution at the
first instruction after the original call to checkpoint(2), and the
return value from that syscall will be set to 0. (The caller uses
this retval to learn that it was restarted, and not just completed
a checkpoint).

>
>> + TODO: upon successful "external" restart, the container will end up
>> + in a frozen state.

Heh .. this is also done :)

>
> Should clone_with_pids() be mentioned here?

It's not a c/r interface per-se, but you're probably right that a few
words there won't hurt.

Thanks.

Oren.

2009-07-23 15:42:36

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC v17][PATCH 22/60] c/r: external checkpoint of a task other than ourself



Serge E. Hallyn wrote:
> Quoting Oren Laadan ([email protected]):
>> +/* setup checkpoint-specific parts of ctx */
>> +static int init_checkpoint_ctx(struct ckpt_ctx *ctx, pid_t pid)
>> +{
>> + struct task_struct *task;
>> + struct nsproxy *nsproxy;
>> + int ret;
>> +
>> + /*
>> + * No need for explicit cleanup here, because if an error
>> + * occurs then ckpt_ctx_free() is eventually called.
>> + */
>> +
>> + ctx->root_pid = pid;
>> +
>> + /* root task */
>> + read_lock(&tasklist_lock);
>> + task = find_task_by_vpid(pid);
>> + if (task)
>> + get_task_struct(task);
>> + read_unlock(&tasklist_lock);
>> + if (!task)
>> + return -ESRCH;
>> + else
>> + ctx->root_task = task;
>> +
>> + /* root nsproxy */
>> + rcu_read_lock();
>> + nsproxy = task_nsproxy(task);
>> + if (nsproxy)
>> + get_nsproxy(nsproxy);
>> + rcu_read_unlock();
>> + if (!nsproxy)
>> + return -ESRCH;
>> + else
>> + ctx->root_nsproxy = nsproxy;
>> +
>> + /* root freezer */
>> + ctx->root_freezer = task;
>> + geT_task_struct(task);
>> +
>> + ret = may_checkpoint_task(ctx, task);
>> + if (ret) {
>> + ckpt_write_err(ctx, NULL);
>> + put_task_struct(task);
>> + put_task_struct(task);
>> + put_nsproxy(nsproxy);
>
> I don't think this is safe - the ckpt_ctx_free() will
> free them a second time because you're not setting them
> to NULL, right?

Yes. Fortunately this hole chunk is removed by the 3rd-next patch.
I'll make sure it's correct here too.

Thanks,

Oren.

2009-07-24 19:10:03

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC v17][PATCH 00/60] Kernel based checkpoint/restart

Quoting Oren Laadan ([email protected]):
> Application checkpoint/restart (c/r) is the ability to save the state
> of a running application so that it can later resume its execution
> from the time at which it was checkpointed, on the same or a different
> machine.
>
> This version introduces 'clone_with_pids()' syscall to preset pid(s)
> for a child process. It is used by restart(2) to recreate process
> hierarchy with the same pids as at checkpoint time.
>
> It also adds a freezer state CHECKPOINTING to safeguard processes
> during a checkpoint. Other important changes include support for
> threads and zombies, credentials, signal handling, and improved
> restart logic. See below for a more detailed changelog.
>
> Compiled and tested against v2.6.31-rc3.

With the s390 patch I recently sent on top of this set, all of my
c/r tests pass, and ltp behaves the same as on plain v2.6.31-rc3
(up to and including hanging on mallocstress).

-serge

2009-07-29 00:42:18

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [RFC v17][PATCH 17/60] pids 7/7: Define clone_with_pids syscall

Ccing Oleg Nesterov, Eric Biederman, Mike Waychinson, Ying Han

Note that this is a variant of an earlier clone_with_pids() interface, sent
in Mar 2009 (http://lkml.org/lkml/2009/3/13/359). Linus' major objection
was about security, as unprivileged tasks could read /var/run data and try
to run with cached pids.

This variant addresses that concern by requiring CAP_SYS_ADMIN to specify
pids. This makes sense since CAP_SYS_ADMIN is required to create a new
pid namespace anyway.

Sukadev

Oren Laadan [[email protected]] wrote:
| From: Sukadev Bhattiprolu <[email protected]>
|
| Container restart requires that a task have the same pid it had when it was
| checkpointed. When containers are nested the tasks within the containers
| exist in multiple pid namespaces and hence have multiple pids to specify
| during restart.
|
| clone_with_pids(), intended for use during restart, is the same as clone(),
| except that it takes a 'target_pid_set' paramter. This parameter lets caller
| choose specific pid numbers for the child process, in the process's active
| and ancestor pid namespaces. (Descendant pid namespaces in general don't
| matter since processes don't have pids in them anyway, but see comments
| in copy_target_pids() regarding CLONE_NEWPID).
|
| Unlike clone(), clone_with_pids() needs CAP_SYS_ADMIN, at least for now, to
| prevent unprivileged processes from misusing this interface.
|
| Call clone_with_pids as follows:
|
| pid_t pids[] = { 0, 77, 99 };
| struct target_pid_set pid_set;
|
| pid_set.num_pids = sizeof(pids) / sizeof(int);
| pid_set.target_pids = &pids;
|
| syscall(__NR_clone_with_pids, flags, stack, NULL, NULL, NULL, &pid_set);
|
| If a target-pid is 0, the kernel continues to assign a pid for the process in
| that namespace. In the above example, pids[0] is 0, meaning the kernel will
| assign next available pid to the process in init_pid_ns. But kernel will assign
| pid 77 in the child pid namespace 1 and pid 99 in pid namespace 2. If either
| 77 or 99 are taken, the system call fails with -EBUSY.
|
| If 'pid_set.num_pids' exceeds the current nesting level of pid namespaces,
| the system call fails with -EINVAL.
|
| Its mostly an exploratory patch seeking feedback on the interface.
|
| NOTE:
| Compared to clone(), clone_with_pids() needs to pass in two more
| pieces of information:
|
| - number of pids in the set
| - user buffer containing the list of pids.
|
| But since clone() already takes 5 parameters, use a 'struct
| target_pid_set'.
|
| TODO:
| - Gently tested.
| - May need additional sanity checks in do_fork_with_pids().
|
| Changelog[v3]:
| - (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
| in the target_pids[] list and setting it 0. See copy_target_pids()).
| - (Oren Laadan) Specified target pids should apply only to youngest
| pid-namespaces (see copy_target_pids())
| - (Matt Helsley) Update patch description.
|
| Changelog[v2]:
| - Remove unnecessary printk and add a note to callers of
| copy_target_pids() to free target_pids.
| - (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
| - (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
| 'num_pids == 0' (fall back to normal clone()).
| - Move arch-independent code (sanity checks and copy-in of target-pids)
| into kernel/fork.c and simplify sys_clone_with_pids()
|
| Changelog[v1]:
| - Fixed some compile errors (had fixed these errors earlier in my
| git tree but had not refreshed patches before emailing them)
|
| Signed-off-by: Sukadev Bhattiprolu <[email protected]>
| ---
| arch/x86/include/asm/syscalls.h | 2 +
| arch/x86/include/asm/unistd_32.h | 1 +
| arch/x86/kernel/entry_32.S | 1 +
| arch/x86/kernel/process_32.c | 21 +++++++
| arch/x86/kernel/syscall_table_32.S | 1 +
| kernel/fork.c | 108 +++++++++++++++++++++++++++++++++++-
| 6 files changed, 133 insertions(+), 1 deletions(-)
|
| diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h
| index 372b76e..df3c4a8 100644
| --- a/arch/x86/include/asm/syscalls.h
| +++ b/arch/x86/include/asm/syscalls.h
| @@ -40,6 +40,8 @@ long sys_iopl(struct pt_regs *);
|
| /* kernel/process_32.c */
| int sys_clone(struct pt_regs *);
| +int sys_clone_with_pids(struct pt_regs *);
| +int sys_vfork(struct pt_regs *);
| int sys_execve(struct pt_regs *);
|
| /* kernel/signal.c */
| diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
| index 732a307..f65b750 100644
| --- a/arch/x86/include/asm/unistd_32.h
| +++ b/arch/x86/include/asm/unistd_32.h
| @@ -342,6 +342,7 @@
| #define __NR_pwritev 334
| #define __NR_rt_tgsigqueueinfo 335
| #define __NR_perf_counter_open 336
| +#define __NR_clone_with_pids 337
|
| #ifdef __KERNEL__
|
| diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
| index c097e7d..c7bd1f6 100644
| --- a/arch/x86/kernel/entry_32.S
| +++ b/arch/x86/kernel/entry_32.S
| @@ -718,6 +718,7 @@ ptregs_##name: \
| PTREGSCALL(iopl)
| PTREGSCALL(fork)
| PTREGSCALL(clone)
| +PTREGSCALL(clone_with_pids)
| PTREGSCALL(vfork)
| PTREGSCALL(execve)
| PTREGSCALL(sigaltstack)
| diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
| index 59f4524..9965c06 100644
| --- a/arch/x86/kernel/process_32.c
| +++ b/arch/x86/kernel/process_32.c
| @@ -443,6 +443,27 @@ int sys_clone(struct pt_regs *regs)
| return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
| }
|
| +int sys_clone_with_pids(struct pt_regs *regs)
| +{
| + unsigned long clone_flags;
| + unsigned long newsp;
| + int __user *parent_tidptr;
| + int __user *child_tidptr;
| + void __user *upid_setp;
| +
| + clone_flags = regs->bx;
| + newsp = regs->cx;
| + parent_tidptr = (int __user *)regs->dx;
| + child_tidptr = (int __user *)regs->di;
| + upid_setp = (void __user *)regs->bp;
| +
| + if (!newsp)
| + newsp = regs->sp;
| +
| + return do_fork_with_pids(clone_flags, newsp, regs, 0, parent_tidptr,
| + child_tidptr, upid_setp);
| +}
| +
| /*
| * sys_execve() executes a new program.
| */
| diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
| index d51321d..879e5ec 100644
| --- a/arch/x86/kernel/syscall_table_32.S
| +++ b/arch/x86/kernel/syscall_table_32.S
| @@ -336,3 +336,4 @@ ENTRY(sys_call_table)
| .long sys_pwritev
| .long sys_rt_tgsigqueueinfo /* 335 */
| .long sys_perf_counter_open
| + .long ptregs_clone_with_pids
| diff --git a/kernel/fork.c b/kernel/fork.c
| index 64d53d9..29c66f0 100644
| --- a/kernel/fork.c
| +++ b/kernel/fork.c
| @@ -1336,6 +1336,97 @@ struct task_struct * __cpuinit fork_idle(int cpu)
| }
|
| /*
| + * If user specified any 'target-pids' in @upid_setp, copy them from
| + * user and return a pointer to a local copy of the list of pids. The
| + * caller must free the list, when they are done using it.
| + *
| + * If user did not specify any target pids, return NULL (caller should
| + * treat this like normal clone).
| + *
| + * On any errors, return the error code
| + */
| +static pid_t *copy_target_pids(void __user *upid_setp)
| +{
| + int j;
| + int rc;
| + int size;
| + int unum_pids; /* # of pids specified by user */
| + int knum_pids; /* # of pids needed in kernel */
| + pid_t *target_pids;
| + struct target_pid_set pid_set;
| +
| + if (!upid_setp)
| + return NULL;
| +
| + rc = copy_from_user(&pid_set, upid_setp, sizeof(pid_set));
| + if (rc)
| + return ERR_PTR(-EFAULT);
| +
| + unum_pids = pid_set.num_pids;
| + knum_pids = task_pid(current)->level + 1;
| +
| + if (!unum_pids)
| + return NULL;
| +
| + if (unum_pids < 0 || unum_pids > knum_pids)
| + return ERR_PTR(-EINVAL);
| +
| + /*
| + * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[]
| + * and set it to 0. This last entry in target_pids[] corresponds to the
| + * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was
| + * specified. If CLONE_NEWPID was not specified, this last entry will
| + * simply be ignored.
| + */
| + target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL);
| + if (!target_pids)
| + return ERR_PTR(-ENOMEM);
| +
| + /*
| + * A process running in a level 2 pid namespace has three pid namespaces
| + * and hence three pid numbers. If this process is checkpointed,
| + * information about these three namespaces are saved. We refer to these
| + * namespaces as 'known namespaces'.
| + *
| + * If this checkpointed process is however restarted in a level 3 pid
| + * namespace, the restarted process has an extra ancestor pid namespace
| + * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'.
| + *
| + * During restart, the process requests specific pids for its 'known
| + * namespaces' and lets kernel assign pids to its 'unknown namespaces'.
| + *
| + * Since the requested-pids correspond to 'known namespaces' and since
| + * 'known-namespaces' are younger than (i.e descendants of) 'unknown-
| + * namespaces', copy requested pids to the back-end of target_pids[]
| + * (i.e before the last entry for CLONE_NEWPID mentioned above).
| + * Any entries in target_pids[] not corresponding to a requested pid
| + * will be set to zero and kernel assigns a pid in those namespaces.
| + *
| + * NOTE: The order of pids in target_pids[] is oldest pid namespace to
| + * youngest (target_pids[0] corresponds to init_pid_ns). i.e.
| + * the order is:
| + *
| + * - pids for 'unknown-namespaces' (if any)
| + * - pids for 'known-namespaces' (requested pids)
| + * - 0 in the last entry (for CLONE_NEWPID).
| + */
| + j = knum_pids - unum_pids;
| + size = unum_pids * sizeof(pid_t);
| +
| + rc = copy_from_user(&target_pids[j], pid_set.target_pids, size);
| + if (rc) {
| + rc = -EFAULT;
| + goto out_free;
| + }
| +
| + return target_pids;
| +
| +out_free:
| + kfree(target_pids);
| + return ERR_PTR(rc);
| +}
| +
| +/*
| * Ok, this is the main fork-routine.
| *
| * It copies the process, and if successful kick-starts
| @@ -1352,7 +1443,7 @@ long do_fork_with_pids(unsigned long clone_flags,
| struct task_struct *p;
| int trace = 0;
| long nr;
| - pid_t *target_pids = NULL;
| + pid_t *target_pids;
|
| /*
| * Do some preliminary argument and permissions checking before we
| @@ -1386,6 +1477,17 @@ long do_fork_with_pids(unsigned long clone_flags,
| }
| }
|
| + target_pids = copy_target_pids(pid_setp);
| +
| + if (target_pids) {
| + if (IS_ERR(target_pids))
| + return PTR_ERR(target_pids);
| +
| + nr = -EPERM;
| + if (!capable(CAP_SYS_ADMIN))
| + goto out_free;
| + }
| +
| /*
| * When called from kernel_thread, don't do user tracing stuff.
| */
| @@ -1453,6 +1555,10 @@ long do_fork_with_pids(unsigned long clone_flags,
| } else {
| nr = PTR_ERR(p);
| }
| +
| +out_free:
| + kfree(target_pids);
| +
| return nr;
| }
|
| --
| 1.6.0.4