These patches are from Oren Laaden. I've refactored them
a bit to make them a wee bit more reviewable. I think this
separates out the per-arch bits pretty well. It should also
be at least build-bisetable.
If there are no objections to this general approach, then we
plan to start submitting these bits to -mm.
--
At the containers mini-conference before OLS, the consensus among
all the stakeholders was that doing checkpoint/restart in the kernel
as much as possible was the best approach. With this approach, the
kernel will export a relatively opaque 'blob' of data to userspace
which can then be handed to the new kernel at restore time.
This is different that what had been proposed before, which was
that a userspace application would be responsible for collecting
all of this data. We were also planning on adding lots of new,
little kernel interfaces for all of the things that needed
checkpointing. This unites those into a single, grand interface.
The 'blob' will contain copies of select portions of kernel
structures such as vmas and mm_structs. It will also contain
copies of the actual memory that the process uses. Any changes
in this blob's format between kernel revisions can be handled by
an in-userspace conversion program.
This is a similar approach to virtually all of the commercial
checkpoint/restart products out there, as well as the research
project Zap.
These patches basically serialize internel kernel state and write
it out to a file descriptor. The checkpoint and restore are done
with two new system calls: sys_checkpoint and sys_restart.
In this incarnation, they can only work checkpoint and restore a
single task. The task's address space may consist of only private,
simple vma's - anonymous or file-mapped.
--
Oren's original announcement
In the recent mini-summit at OLS 2008 and the following days it was
agreed to tackle the checkpoint/restart (CR) by beginning with a very
simple case: save and restore a single task, with simple memory
layout, disregarding other task state such as files, signals etc.
Following these discussions I coded a prototype that can do exactly
that, as a starter. This code adds two system calls - sys_checkpoint
and sys_restart - that a task can call to save and restore its state
respectively. It also demonstrates how the checkpoint image file can
be formatted, as well as show its nested nature (e.g. cr_write_mm()
-> cr_write_vma() nesting).
The state that is saved/restored is the following:
* some of the task_struct
* some of the thread_struct and thread_info
* the cpu state (including FPU)
* the memory address space
[The patch is against commit fb2e405fc1fc8b20d9c78eaa1c7fd5a297efde43
of Linus's tree (uhhh.. don't ask why), but against tonight's head too].
In the current code, sys_checkpoint will checkpoint the current task,
although the logic exists to checkpoint other tasks (not in the
checkpointee's execution context). A simple loop will extend this to
handle multiple processes. sys_restart restarts the current tasks, and
with multiple tasks each task will call the syscall independently.
(Actually, to checkpoint outside the context of a task, it is also
necessary to also handle restart-block logic when saving/restoring the
thread data).
It takes longer to describe what isn't implemented or supported by
this prototype ... basically everything that isn't as simple as the
above.
As for containers - since we still don't have a representation for a
container, this patch has no notion of a container. The tests for
consistent namespaces (and isolation) are also omitted.
Below are two example programs: one uses checkpoint (called ckpt) and
one uses restart (called rstr). Execute like this (as a superuser):
orenl:~/test$ ./ckpt > out.1
hello, world! (ret=1) <-- sys_checkpoint returns positive id
<-- ctrl-c
orenl:~/test$ ./ckpt > out.2
hello, world! (ret=2)
<-- ctrl-c
orenl:~/test$ ./rstr < out.1
hello, world! (ret=0) <-- sys_restart return 0
(if you check the output of ps, you'll see that "rstr" changed its
name to "ckpt", as expected).
Hoping this will accelerate the discussion. Comments are welcome.
Let the fun begin :)
Oren.
============================== ckpt.c ================================
#define _GNU_SOURCE /* or _BSD_SOURCE or _SVID_SOURCE */
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
#include <asm/unistd_32.h>
#include <sys/syscall.h>
int main(int argc, char *argv[])
{
pid_t pid = getpid();
int ret;
ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
if (ret < 0)
perror("checkpoint");
fprintf(stderr, "hello, world! (ret=%d)\n", ret);
while (1)
;
return 0;
}
============================== rstr.c ================================
#define _GNU_SOURCE /* or _BSD_SOURCE or _SVID_SOURCE */
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
#include <asm/unistd_32.h>
#include <sys/syscall.h>
int main(int argc, char *argv[])
{
pid_t pid = getpid();
int ret;
ret = syscall(__NR_restart, pid, STDIN_FILENO, 0);
if (ret < 0)
perror("restart");
printf("should not reach here !\n");
return 0;
}
From: Oren Laadan <[email protected]>
This patch adds those interfaces, as well as all of the helpers
needed to easily manage the file format.
The code is roughly broken out as follows:
ckpt/sys.c - user/kernel data transfer, as well as setting up of the
checkpoint/restart context (a per-checkpoint data
structure for housekeeping)
ckpt/checkpoint.c - output wrappers and basic checkpoint handling
ckpt/restart.c - input wrappers and basic restart handling
Patches to add the per-architecture support as well as the actual
work to do the memory checkpoint follow in subsequent patches.
Signed-off-by: Oren Laadan <[email protected]>
---
linux-2.6.git-dave/Makefile | 2
linux-2.6.git-dave/ckpt/Makefile | 1
linux-2.6.git-dave/ckpt/checkpoint.c | 207 +++++++++++++++++++++++++++++++
linux-2.6.git-dave/ckpt/ckpt.h | 82 ++++++++++++
linux-2.6.git-dave/ckpt/ckpt_hdr.h | 69 ++++++++++
linux-2.6.git-dave/ckpt/restart.c | 189 ++++++++++++++++++++++++++++
linux-2.6.git-dave/ckpt/sys.c | 233 +++++++++++++++++++++++++++++++++++
7 files changed, 782 insertions(+), 1 deletion(-)
diff -puN /dev/null ckpt/checkpoint.c
--- /dev/null 2007-04-11 11:48:27.000000000 -0700
+++ linux-2.6.git-dave/ckpt/checkpoint.c 2008-08-07 15:37:22.000000000 -0700
@@ -0,0 +1,207 @@
+/*
+ * Checkpoint logic and helpers
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/time.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/dcache.h>
+#include <linux/mount.h>
+#include <asm/ptrace.h>
+
+#include "ckpt.h"
+#include "ckpt_hdr.h"
+
+/**
+ * cr_get_fname - return pathname of a given file
+ * @file: file pointer
+ * @buf: buffer for pathname
+ * @n: buffer length (in) and pathname length (out)
+ *
+ * if the buffer provivded by the caller is too small, allocate a new
+ * buffer; caller should call cr_put_pathname() for cleanup
+ */
+char *cr_get_fname(struct path *path, struct path *root, char *buf, int *n)
+{
+ char *fname;
+
+ fname = __d_path(path, root, buf, *n);
+
+ if (IS_ERR(fname) && PTR_ERR(fname) == -ENAMETOOLONG) {
+ if (!(buf = (char *) __get_free_pages(GFP_KERNEL, 0)))
+ return ERR_PTR(-ENOMEM);
+ fname = __d_path(path, root, buf, PAGE_SIZE);
+ if (IS_ERR(fname))
+ free_pages((unsigned long) buf, 0);
+ }
+ if (!IS_ERR(fname))
+ *n = (buf + *n - fname);
+
+ return fname;
+}
+
+/**
+ * cr_put_fname - (possibly) cleanup pathname buffer
+ * @buf: original buffer that was given to cr_get_pathname()
+ * @fname: resulting pathname from cr_get_pathname()
+ * @n: length of original buffer
+ */
+void cr_put_fname(char *buf, char *fname, int n)
+{
+ if (fname && (fname < buf || fname >= buf + n))
+ free_pages((unsigned long) buf, 0);
+}
+
+/**
+ * cr_write_obj - write a record described by a cr_hdr
+ * @ctx: checkpoint context
+ * @h: record descriptor
+ * @buf: record buffer
+ */
+int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf)
+{
+ int ret;
+
+ if ((ret = cr_kwrite(ctx, h, sizeof(*h))) < 0)
+ return ret;
+ return cr_kwrite(ctx, buf, h->len);
+}
+
+/**
+ * cr_write_str - write a string record
+ * @ctx: checkpoint context
+ * @str: string buffer
+ * @n: string length
+ */
+int cr_write_str(struct cr_ctx *ctx, char *str, int n)
+{
+ struct cr_hdr h;
+
+ h.type = CR_HDR_STR;
+ h.len = n;
+ h.id = 0;
+
+ return cr_write_obj(ctx, &h, str);
+}
+
+/* write the checkpoint header */
+static int cr_write_hdr(struct cr_ctx *ctx)
+{
+ struct cr_hdr h;
+ struct cr_hdr_head *hh = ctx->tbuf;
+ struct timeval ktv;
+
+ h.type = CR_HDR_HEAD;
+ h.len = sizeof(*hh);
+ h.id = 0;
+
+ do_gettimeofday(&ktv);
+
+ hh->magic = 0x00a2d200;
+ hh->major = (LINUX_VERSION_CODE >> 16) & 0xff;
+ hh->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
+ hh->patch = (LINUX_VERSION_CODE) & 0xff;
+
+ hh->version = 1;
+
+ hh->flags = ctx->flags;
+ hh->time = ktv.tv_sec;
+
+ return cr_write_obj(ctx, &h, hh);
+}
+
+/* write the checkpoint trailer */
+static int cr_write_tail(struct cr_ctx *ctx)
+{
+ struct cr_hdr h;
+ struct cr_hdr_tail *hh = ctx->tbuf;
+
+ h.type = CR_HDR_TAIL;
+ h.len = sizeof(*hh);
+ h.id = 0;
+
+ hh->magic = 0x002d2a00;
+ hh->cksum[0] = hh->cksum[1] = 1; /* TBD ... */
+
+ return cr_write_obj(ctx, &h, hh);
+}
+
+/* dump the task_struct of a given task */
+static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t)
+{
+ struct cr_hdr h;
+ struct cr_hdr_task *hh = ctx->tbuf;
+
+ h.type = CR_HDR_TASK;
+ h.len = sizeof(*hh);
+ h.id = ctx->pid;
+
+ hh->state = t->state;
+ hh->exit_state = t->exit_state;
+ hh->exit_code = t->exit_code;
+ hh->exit_signal = t->exit_signal;
+
+ hh->pid = t->pid;
+ hh->tgid = t->tgid;
+
+ hh->utime = t->utime;
+ hh->stime = t->stime;
+ hh->utimescaled = t->utimescaled;
+ hh->stimescaled = t->stimescaled;
+ hh->gtime = t->gtime;
+ hh->prev_utime = t->prev_utime;
+ hh->prev_stime = t->prev_stime;
+ hh->nvcsw = t->nvcsw;
+ hh->nivcsw = t->nivcsw;
+ hh->start_time_sec = t->start_time.tv_sec;
+ hh->start_time_nsec = t->start_time.tv_nsec;
+ hh->real_start_time_sec = t->real_start_time.tv_sec;
+ hh->real_start_time_nsec = t->real_start_time.tv_nsec;
+ hh->min_flt = t->min_flt;
+ hh->maj_flt = t->maj_flt;
+
+ hh->task_comm_len = TASK_COMM_LEN;
+ memcpy(hh->comm, t->comm, TASK_COMM_LEN);
+
+ return cr_write_obj(ctx, &h, hh);
+}
+
+/* dump the entire state of a given task */
+static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
+{
+ int ret ;
+
+ BUG_ON(t->state == TASK_DEAD);
+
+ ret = cr_write_task_struct(ctx, t);
+ CR_PRINTK("ret (task_struct) %d\n", ret);
+
+ return ret;
+}
+
+int do_checkpoint(struct cr_ctx *ctx)
+{
+ int ret;
+
+ /* FIX: need to test whether container is checkpointable */
+
+ ret = cr_write_hdr(ctx);
+ if (!ret)
+ ret = cr_write_task(ctx, current);
+ if (!ret)
+ ret = cr_write_tail(ctx);
+
+ /* on success, return (unique) checkpoint identifier */
+ if (!ret)
+ ret = ctx->crid;
+
+ return ret;
+}
diff -puN /dev/null ckpt/ckpt.h
--- /dev/null 2007-04-11 11:48:27.000000000 -0700
+++ linux-2.6.git-dave/ckpt/ckpt.h 2008-08-05 09:04:27.000000000 -0700
@@ -0,0 +1,82 @@
+#ifndef _CKPT_CKPT_H_
+#define _CKPT_CKPT_H_
+/*
+ * Generic container checkpoint-restart
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/path.h>
+#include <linux/fs.h>
+
+struct cr_pgarr;
+
+struct cr_ctx {
+ pid_t pid; /* container identifier */
+ int crid; /* unique checkpoint id */
+
+ unsigned long flags;
+ unsigned long oflags; /* restart: old flags */
+
+ struct file *file;
+ int total; /* total read/written */
+
+ void *tbuf; /* temp: to avoid many alloc/dealloc */
+ void *hbuf; /* header: to avoid many alloc/dealloc */
+ int hpos;
+
+ struct cr_pgarr *pgarr;
+ struct cr_pgarr *pgcur;
+
+ struct path *vfsroot; /* container root */
+};
+
+/* cr_ctx: flags */
+#define CR_CTX_CKPT 0x1
+#define CR_CTX_RSTR 0x2
+
+/* allocation defaults */
+#define CR_ORDER_TBUF 1
+#define CR_ORDER_HBUF 1
+
+#define CR_TBUF_TOTAL ((PAGE_SIZE << CR_ORDER_TBUF) / sizeof(void *))
+#define CR_HBUF_TOTAL ((PAGE_SIZE << CR_ORDER_HBUF) / sizeof(void *))
+
+extern void cr_put_fname(char *buf, char *fname, int n);
+extern char *cr_get_fname(struct path *path, struct path *root, char *buf, int *n);
+
+extern int cr_uwrite(struct cr_ctx *ctx, void *buf, int count);
+extern int cr_kwrite(struct cr_ctx *ctx, void *buf, int count);
+extern int cr_uread(struct cr_ctx *ctx, void *buf, int count);
+extern int cr_kread(struct cr_ctx *ctx, void *buf, int count);
+
+extern void *cr_hbuf_get(struct cr_ctx *ctx, int n);
+extern void cr_hbuf_put(struct cr_ctx *ctx, int n);
+
+struct cr_hdr;
+
+extern int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
+extern int cr_write_str(struct cr_ctx *ctx, char *str, int n);
+extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
+
+extern int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
+extern int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type);
+extern int cr_read_str(struct cr_ctx *ctx, void *str, int n);
+extern int cr_read_mm(struct cr_ctx *ctx);
+
+extern int do_checkpoint(struct cr_ctx *ctx);
+extern int do_restart(struct cr_ctx *ctx);
+
+/* debugging */
+#if 0
+#define CR_PRINTK(str, args...) \
+ printk(KERN_ERR "cr@%s#%d: " str, __func__, __LINE__, ##args)
+#else
+#define CR_PRINTK(...) do {} while (0)
+#endif
+
+#endif /* _CKPT_CKPT_H_ */
diff -puN /dev/null ckpt/ckpt_hdr.h
--- /dev/null 2007-04-11 11:48:27.000000000 -0700
+++ linux-2.6.git-dave/ckpt/ckpt_hdr.h 2008-08-07 15:37:22.000000000 -0700
@@ -0,0 +1,69 @@
+/*
+ * Generic container checkpoint-restart
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/types.h>
+
+struct cr_hdr {
+ __s16 type;
+ __s16 len;
+ __u32 id;
+};
+
+enum {
+ CR_HDR_HEAD = 1,
+ CR_HDR_STR,
+
+ CR_HDR_TASK = 101,
+ CR_HDR_THREAD,
+ CR_HDR_CPU,
+
+ CR_HDR_MM = 201,
+ CR_HDR_VMA,
+ CR_HDR_MM_CONTEXT,
+
+ CR_HDR_TAIL = 5001
+};
+
+struct cr_hdr_head {
+ __u32 magic;
+ __u16 major;
+ __u16 minor;
+ __u16 patch;
+ __u16 version;
+ __u32 flags; /* checkpoint options */
+ __u64 time; /* when checkpoint taken */
+};
+
+struct cr_hdr_tail {
+ __u32 magic;
+ __u32 cksum[2];
+};
+
+struct cr_hdr_task {
+ __u64 state;
+ __u32 exit_state;
+ __u32 exit_code, exit_signal;
+
+ __u16 pid;
+ __u16 tgid;
+
+ __u64 utime, stime, utimescaled, stimescaled;
+ __u64 gtime;
+ __u64 prev_utime, prev_stime;
+ __u64 nvcsw, nivcsw;
+ __u64 start_time_sec, start_time_nsec;
+ __u64 real_start_time_sec, real_start_time_nsec;
+ __u64 min_flt, maj_flt;
+
+ __s16 task_comm_len;
+ char comm[TASK_COMM_LEN];
+};
+
+
diff -puN /dev/null ckpt/Makefile
--- /dev/null 2007-04-11 11:48:27.000000000 -0700
+++ linux-2.6.git-dave/ckpt/Makefile 2008-08-07 15:37:22.000000000 -0700
@@ -0,0 +1 @@
+obj-y += sys.o checkpoint.o restart.o
diff -puN /dev/null ckpt/restart.c
--- /dev/null 2007-04-11 11:48:27.000000000 -0700
+++ linux-2.6.git-dave/ckpt/restart.c 2008-08-07 15:37:22.000000000 -0700
@@ -0,0 +1,189 @@
+/*
+ * Restart logic and helpers
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+/*
+ * During restart the code reads in data from the chekcpoint image into a
+ * temporary buffer (ctx->hbuf). Because operations can be nested, one
+ * should call cr_hbuf_get() to reserve space in the buffer, and then
+ * cr_hbuf_put() when it no longer needs that space
+ */
+
+#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+
+#include "ckpt.h"
+#include "ckpt_hdr.h"
+
+/**
+ * cr_hbuf_get - reserve space on the hbuf
+ * @ctx: checkpoint context
+ * @n: number of bytes to reserve
+ */
+void *cr_hbuf_get(struct cr_ctx *ctx, int n)
+{
+ void *ptr;
+
+ BUG_ON(ctx->hpos + n > CR_HBUF_TOTAL);
+ ptr = (void *) (((char *) ctx->hbuf) + ctx->hpos);
+ ctx->hpos += n;
+ return ptr;
+}
+
+/**
+ * cr_hbuf_put - unreserve space on the hbuf
+ * @ctx: checkpoint context
+ * @n: number of bytes to reserve
+ */
+void cr_hbuf_put(struct cr_ctx *ctx, int n)
+{
+ BUG_ON(ctx->hpos < n);
+ ctx->hpos -= n;
+}
+
+/**
+ * cr_read_obj - read a whole record (cr_hdr followed by payload)
+ * @ctx: checkpoint context
+ * @h: record descriptor
+ * @buf: record buffer
+ * @n: available buffer size
+ */
+int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n)
+{
+ int ret;
+
+ ret = cr_kread(ctx, h, sizeof(*h));
+ if (ret < 0)
+ return ret;
+
+ CR_PRINTK("type %d len %d id %d (%d)\n", h->type, h->len, h->id, n);
+ if (h->len < 0 || h->len > n)
+ return -EINVAL;
+
+ return cr_kread(ctx, buf, h->len);
+}
+
+/**
+ * cr_read_obj_type - read a whole record of expected type
+ * @ctx: checkpoint context
+ * @buf: record buffer
+ * @n: available buffer size
+ * @type: expected record type
+ */
+int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type)
+{
+ struct cr_hdr h;
+ int ret;
+
+ ret = cr_read_obj(ctx, &h, buf, n);
+ if (!ret)
+ ret = (h.type == type ? h.id : -EINVAL);
+ return ret;
+}
+
+/**
+ * cr_read_str - read a string record
+ * @ctx: checkpoint context
+ * @str: string buffer
+ * @n: string length
+ */
+int cr_read_str(struct cr_ctx *ctx, void *str, int n)
+{
+ return cr_read_obj_type(ctx, str, n, CR_HDR_STR);
+}
+
+/* read the checkpoint header */
+static int cr_read_hdr(struct cr_ctx *ctx)
+{
+ struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ int ret;
+
+ ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD);
+ if (ret < 0)
+ return ret;
+
+ if (hh->magic != 0x00a2d200 || hh->version != 1 ||
+ hh->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
+ hh->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
+ hh->patch != ((LINUX_VERSION_CODE) & 0xff))
+ return -EINVAL;
+
+ if (hh->flags & ~CR_CTX_CKPT)
+ return -EINVAL;
+
+ ctx->oflags = hh->flags;
+
+ cr_hbuf_put(ctx, sizeof(*hh));
+ return 0;
+}
+
+/* read the checkpoint trailer */
+static int cr_read_tail(struct cr_ctx *ctx)
+{
+ struct cr_hdr_tail *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ int ret;
+
+ ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TAIL);
+ if (ret < 0)
+ return ret;
+
+ if (hh->magic != 0x002d2a00 ||
+ hh->cksum[0] != 1 || hh->cksum[1] != 1)
+ return -EINVAL;
+
+ cr_hbuf_put(ctx, sizeof(*hh));
+ return 0;
+}
+
+/* read the task_struct into the current task */
+static int cr_read_task_struct(struct cr_ctx *ctx)
+{
+ struct cr_hdr_task *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ struct task_struct *t = current;
+ int ret;
+
+ ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TASK);
+ if (ret < 0)
+ return ret;
+
+ /* for now, only restore t->comm */
+ if (hh->task_comm_len < 0 || hh->task_comm_len > TASK_COMM_LEN)
+ return -EINVAL;
+
+ memset(t->comm, 0, TASK_COMM_LEN);
+ memcpy(t->comm, hh->comm, hh->task_comm_len);
+
+ cr_hbuf_put(ctx, sizeof(*hh));
+ return 0;
+}
+
+/* read the entire state of the current task */
+static int cr_read_task(struct cr_ctx *ctx)
+{
+ int ret;
+
+ ret = cr_read_task_struct(ctx);
+ CR_PRINTK("ret (task_struct) %d\n", ret);
+
+ return ret;
+}
+
+int do_restart(struct cr_ctx *ctx)
+{
+ int ret;
+
+ ret = cr_read_hdr(ctx);
+ if (!ret)
+ ret = cr_read_task(ctx);
+ if (!ret)
+ ret = cr_read_tail(ctx);
+
+ return ret;
+}
diff -puN /dev/null ckpt/sys.c
--- /dev/null 2007-04-11 11:48:27.000000000 -0700
+++ linux-2.6.git-dave/ckpt/sys.c 2008-08-07 15:37:22.000000000 -0700
@@ -0,0 +1,233 @@
+/*
+ * Generic container checkpoint-restart
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/uaccess.h>
+#include <linux/capability.h>
+
+#include "ckpt.h"
+
+/*
+ * helpers to write/read to/from the image file descriptor
+ *
+ * cr_uwrite() - write a user-space buffer to the checkpoint image
+ * cr_kwrite() - write a kernel-space buffer to the checkpoint image
+ * cr_uread() - read from the checkpoint image to a user-space buffer
+ * cr_kread() - read from the checkpoint image to a kernel-space buffer
+ *
+ */
+
+/* (temporarily added file_pos_read() and file_pos_write() because they
+ * are static in fs/read_write.c... should cleanup and remove later) */
+static inline loff_t file_pos_read(struct file *file)
+{
+ return file->f_pos;
+}
+
+static inline void file_pos_write(struct file *file, loff_t pos)
+{
+ file->f_pos = pos;
+}
+
+int cr_uwrite(struct cr_ctx *ctx, void *buf, int count)
+{
+ struct file *file = ctx->file;
+ ssize_t nwrite;
+ int nleft;
+
+ for (nleft = count; nleft; nleft -= nwrite) {
+ loff_t pos = file_pos_read(file);
+ nwrite = vfs_write(file, (char __user *) buf, nleft, &pos);
+ file_pos_write(file, pos);
+ if (unlikely(nwrite <= 0)) /* zero tolerance */
+ return (nwrite ? : -EIO);
+ buf += nwrite;
+ }
+
+ ctx->total += count;
+ return 0;
+}
+
+int cr_kwrite(struct cr_ctx *ctx, void *buf, int count)
+{
+ mm_segment_t oldfs;
+ int ret;
+
+ oldfs = get_fs();
+ set_fs(KERNEL_DS);
+ ret = cr_uwrite(ctx, buf, count);
+ set_fs(oldfs);
+
+ return ret;
+}
+
+int cr_uread(struct cr_ctx *ctx, void *buf, int count)
+{
+ struct file *file = ctx->file;
+ ssize_t nread;
+ int nleft;
+
+ for (nleft = count; nleft; nleft -= nread) {
+ loff_t pos = file_pos_read(file);
+ nread = vfs_read(file, (char __user *) buf, nleft, &pos);
+ file_pos_write(file, pos);
+ if (unlikely(nread <= 0)) /* zero tolerance */
+ return (nread ? : -EIO);
+ buf += nread;
+ }
+
+ ctx->total += count;
+ return 0;
+}
+
+int cr_kread(struct cr_ctx *ctx, void *buf, int count)
+{
+ mm_segment_t oldfs;
+ int ret;
+
+ oldfs = get_fs();
+ set_fs(KERNEL_DS);
+ ret = cr_uread(ctx, buf, count);
+ set_fs(oldfs);
+
+ return ret;
+}
+
+
+/*
+ * helpers to manage CR contexts: allocated for each checkpoint and/or
+ * restart operation, and persists until the operation is completed.
+ */
+
+static atomic_t cr_ctx_count; /* unique checkpoint identifier */
+
+void cr_ctx_free(struct cr_ctx *ctx)
+{
+
+ if (ctx->file)
+ fput(ctx->file);
+ if (ctx->vfsroot)
+ path_put(ctx->vfsroot);
+
+ free_pages((unsigned long) ctx->tbuf, CR_ORDER_TBUF);
+ free_pages((unsigned long) ctx->hbuf, CR_ORDER_HBUF);
+
+ kfree(ctx);
+}
+
+struct cr_ctx *cr_ctx_alloc(pid_t pid, struct file *file, unsigned long flags)
+{
+ struct cr_ctx *ctx;
+
+ ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+ if (!ctx)
+ return NULL;
+
+ ctx->tbuf = (void *) __get_free_pages(GFP_KERNEL, CR_ORDER_TBUF);
+ ctx->hbuf = (void *) __get_free_pages(GFP_KERNEL, CR_ORDER_HBUF);
+ if (!ctx->tbuf || !ctx->hbuf)
+ goto nomem;
+
+ ctx->pid = pid;
+ ctx->flags = flags;
+
+ ctx->file = file;
+ get_file(file);
+
+ /* assume checkpointer is in container's root vfs */
+ ctx->vfsroot = ¤t->fs->root;
+ path_get(ctx->vfsroot);
+
+ ctx->crid = atomic_inc_return(&cr_ctx_count);
+
+ return ctx;
+
+ nomem:
+ cr_ctx_free(ctx);
+ return NULL;
+}
+
+/**
+ * sys_checkpoint - checkpoint a container
+ * @pid: pid of the container init(1) process
+ * @fd: file to which dump the checkpoint image
+ * @flags: checkpoint operation flags
+ */
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
+{
+ struct cr_ctx *ctx;
+ struct file *file;
+ int fput_needed;
+ int ret;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ file = fget_light(fd, &fput_needed);
+ if (!file)
+ return -EBADF;
+
+ /* no flags for now */
+ if (flags)
+ return -EINVAL;
+
+ ctx = cr_ctx_alloc(pid, file, flags | CR_CTX_CKPT);
+ if (!ctx) {
+ fput_light(file, fput_needed);
+ return -ENOMEM;
+ }
+
+ ret = do_checkpoint(ctx);
+
+ cr_ctx_free(ctx);
+ fput_light(file, fput_needed);
+ CR_PRINTK("ckpt retval = %d\n", ret);
+ return ret;
+}
+
+/**
+ * sys_restart - restart a container
+ * @crid: checkpoint image identifier
+ * @fd: file from which read the checkpoint image
+ * @flags: restart operation flags
+ */
+asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
+{
+ struct cr_ctx *ctx;
+ struct file *file;
+ int fput_needed;
+ int ret;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ file = fget_light(fd, &fput_needed);
+ if (!file)
+ return -EBADF;
+
+ /* no flags for now */
+ if (flags)
+ return -EINVAL;
+
+ ctx = cr_ctx_alloc(crid, file, flags | CR_CTX_RSTR);
+ if (!ctx) {
+ fput_light(file, fput_needed);
+ return -ENOMEM;
+ }
+
+ ret = do_restart(ctx);
+
+ cr_ctx_free(ctx);
+ fput_light(file, fput_needed);
+ CR_PRINTK("restart retval = %d\n", ret);
+ return ret;
+}
diff -puN Makefile~handle_a_single_task_with_private_memory_maps Makefile
--- linux-2.6.git/Makefile~handle_a_single_task_with_private_memory_maps 2008-08-05 09:04:27.000000000 -0700
+++ linux-2.6.git-dave/Makefile 2008-08-05 09:07:53.000000000 -0700
@@ -611,7 +611,7 @@ export mod_strip_cmd
ifeq ($(KBUILD_EXTMOD),)
-core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/
+core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ ckpt/
vmlinux-dirs := $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
$(core-y) $(core-m) $(drivers-y) $(drivers-m) \
_
The original version of Oren's patch contained a good hunk
of #ifdefs. I've extracted all of those and created a bit
of an API for new architectures to follow.
Leaving Oren's sign-off because this is all still his code,
even though he hasn't seen it mangled like this before.
Signed-off-by: Oren Laadan <[email protected]>
---
linux-2.6.git-dave/ckpt/Makefile | 1
linux-2.6.git-dave/ckpt/checkpoint.c | 7
linux-2.6.git-dave/ckpt/ckpt_arch.h | 6
linux-2.6.git-dave/ckpt/restart.c | 7
linux-2.6.git-dave/ckpt/x86.c | 269 ++++++++++++++++++++++++++++++
linux-2.6.git-dave/include/asm-x86/ckpt.h | 46 +++++
6 files changed, 336 insertions(+)
diff -puN ckpt/checkpoint.c~x86_part ckpt/checkpoint.c
--- linux-2.6.git/ckpt/checkpoint.c~x86_part 2008-08-04 13:29:59.000000000 -0700
+++ linux-2.6.git-dave/ckpt/checkpoint.c 2008-08-04 13:29:59.000000000 -0700
@@ -19,6 +19,7 @@
#include "ckpt.h"
#include "ckpt_hdr.h"
+#include "ckpt_arch.h"
/**
* cr_get_fname - return pathname of a given file
@@ -183,6 +184,12 @@ static int cr_write_task(struct cr_ctx *
ret = cr_write_task_struct(ctx, t);
CR_PRINTK("ret (task_struct) %d\n", ret);
+ if (!ret)
+ ret = cr_write_thread(ctx, t);
+ CR_PRINTK("ret (thread) %d\n", ret);
+ if (!ret)
+ ret = cr_write_cpu(ctx, t);
+ CR_PRINTK("ret (cpu) %d\n", ret);
return ret;
}
diff -puN /dev/null ckpt/ckpt_arch.h
--- /dev/null 2007-04-11 11:48:27.000000000 -0700
+++ linux-2.6.git-dave/ckpt/ckpt_arch.h 2008-08-04 13:29:59.000000000 -0700
@@ -0,0 +1,6 @@
+#include "ckpt.h"
+
+int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
+int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
+int cr_read_thread(struct cr_ctx *ctx);
+int cr_read_cpu(struct cr_ctx *ctx);
diff -puN ckpt/Makefile~x86_part ckpt/Makefile
--- linux-2.6.git/ckpt/Makefile~x86_part 2008-08-04 13:29:59.000000000 -0700
+++ linux-2.6.git-dave/ckpt/Makefile 2008-08-04 13:29:59.000000000 -0700
@@ -1 +1,2 @@
obj-y += sys.o checkpoint.o restart.o
+obj-$(CONFIG_X86) += x86.o
diff -puN ckpt/restart.c~x86_part ckpt/restart.c
--- linux-2.6.git/ckpt/restart.c~x86_part 2008-08-04 13:29:59.000000000 -0700
+++ linux-2.6.git-dave/ckpt/restart.c 2008-08-04 13:29:59.000000000 -0700
@@ -21,6 +21,7 @@
#include "ckpt.h"
#include "ckpt_hdr.h"
+#include "ckpt_arch.h"
/**
* cr_hbuf_get - reserve space on the hbuf
@@ -171,6 +172,12 @@ static int cr_read_task(struct cr_ctx *c
ret = cr_read_task_struct(ctx);
CR_PRINTK("ret (task_struct) %d\n", ret);
+ if (!ret)
+ ret = cr_read_thread(ctx);
+ CR_PRINTK("ret (thread) %d\n", ret);
+ if (!ret)
+ ret = cr_read_cpu(ctx);
+ CR_PRINTK("ret (cpu) %d\n", ret);
return ret;
}
diff -puN /dev/null ckpt/x86.c
--- /dev/null 2007-04-11 11:48:27.000000000 -0700
+++ linux-2.6.git-dave/ckpt/x86.c 2008-08-04 13:29:59.000000000 -0700
@@ -0,0 +1,269 @@
+#include <asm/ckpt.h>
+#include <asm/desc.h>
+#include <asm/i387.h>
+
+#include "ckpt.h"
+#include "ckpt_hdr.h"
+
+/* dump the thread_struct of a given task */
+int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t)
+{
+ struct cr_hdr h;
+ struct cr_hdr_thread *hh = ctx->tbuf;
+ struct thread_struct *thread;
+ struct desc_struct *desc;
+ int ntls = 0;
+ int n, ret;
+
+ h.type = CR_HDR_THREAD;
+ h.len = sizeof(*hh);
+ h.id = ctx->pid;
+
+ thread = &t->thread;
+
+ /* calculate no. of TLS entries that follow */
+ desc = thread->tls_array;
+ for (n = GDT_ENTRY_TLS_ENTRIES; n > 0; n--, desc++) {
+ if (desc->a || desc->b)
+ ntls++;
+ }
+
+ hh->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
+ hh->sizeof_tls_array = sizeof(thread->tls_array);
+ hh->ntls = ntls;
+
+ if ((ret = cr_write_obj(ctx, &h, hh)) < 0)
+ return ret;
+
+ /* for simplicity dump the entire array, cherry-pick upon restart */
+ ret = cr_kwrite(ctx, thread->tls_array, sizeof(thread->tls_array));
+
+ CR_PRINTK("ntls %d\n", ntls);
+
+ /* IGNORE RESTART BLOCKS FOR NOW ... */
+
+ return ret;
+}
+
+/* dump the cpu state and registers of a given task */
+int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t)
+{
+ struct cr_hdr h;
+ struct cr_hdr_cpu *hh = ctx->tbuf;
+ struct thread_struct *thread;
+ struct thread_info *thread_info;
+ struct pt_regs *regs;
+
+ h.type = CR_HDR_CPU;
+ h.len = sizeof(*hh);
+ h.id = ctx->pid;
+
+ thread = &t->thread;
+ thread_info = task_thread_info(t);
+ regs = task_pt_regs(t);
+
+ hh->bx = regs->bx;
+ hh->cx = regs->cx;
+ hh->dx = regs->dx;
+ hh->si = regs->si;
+ hh->di = regs->di;
+ hh->bp = regs->bp;
+ hh->ax = regs->ax;
+ hh->ds = regs->ds;
+ hh->es = regs->es;
+ hh->orig_ax = regs->orig_ax;
+ hh->ip = regs->ip;
+ hh->cs = regs->cs;
+ hh->flags = regs->flags;
+ hh->sp = regs->sp;
+ hh->ss = regs->ss;
+
+ /* for checkpoint in process context (from within a container)
+ the GS and FS registers should be saved from the hardware;
+ otherwise they are already sabed on the thread structure */
+ if (t == current) {
+ savesegment(gs, hh->gs);
+ savesegment(fs, hh->fs);
+ } else {
+ hh->gs = thread->gs;
+ hh->fs = thread->fs;
+ }
+
+ /*
+ * for checkpoint in process context (from within a container),
+ * the actual syscall is taking place at this very moment; so
+ * we (optimistically) subtitute the future return value (0) of
+ * this syscall into the orig_eax, so that upon restart it will
+ * succeed (or it will endlessly retry checkpoint...)
+ */
+ if (t == current) {
+ BUG_ON(hh->orig_ax < 0);
+ hh->ax = 0;
+ }
+
+ preempt_disable();
+
+ /* i387 + MMU + SSE logic */
+ hh->used_math = tsk_used_math(t) ? 1 : 0;
+ if (hh->used_math) {
+ /* normally, no need to unlazy_fpu(), since TS_USEDFPU flag
+ * have been cleared when task was conexted-switched out...
+ * except if we are in process context, in which case we do */
+ if (thread_info->status & TS_USEDFPU)
+ unlazy_fpu(current);
+
+ hh->has_fxsr = cpu_has_fxsr;
+ memcpy(&hh->xstate, &thread->xstate, sizeof(thread->xstate));
+ }
+
+ /* debug regs */
+
+ /*
+ * for checkpoint in process context (from within a container),
+ * get the actual registers; otherwise get the saved values.
+ */
+ if (t == current) {
+ get_debugreg(hh->debugreg0, 0);
+ get_debugreg(hh->debugreg1, 1);
+ get_debugreg(hh->debugreg2, 2);
+ get_debugreg(hh->debugreg3, 3);
+ get_debugreg(hh->debugreg6, 6);
+ get_debugreg(hh->debugreg7, 7);
+ } else {
+ hh->debugreg0 = thread->debugreg0;
+ hh->debugreg1 = thread->debugreg1;
+ hh->debugreg2 = thread->debugreg2;
+ hh->debugreg3 = thread->debugreg3;
+ hh->debugreg6 = thread->debugreg6;
+ hh->debugreg7 = thread->debugreg7;
+ }
+
+ hh->uses_debug = !!(thread_info->flags & TIF_DEBUG);
+
+ preempt_enable();
+
+ CR_PRINTK("math %d debug %d\n", hh->used_math, hh->uses_debug);
+
+ return cr_write_obj(ctx, &h, hh);
+}
+
+/* read the thread_struct into the current task */
+int cr_read_thread(struct cr_ctx *ctx)
+{
+ struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ struct task_struct *t = current;
+ struct thread_struct *thread = &t->thread;
+ int ret;
+
+ ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_THREAD);
+ if (ret < 0)
+ return ret;
+
+ CR_PRINTK("ntls %d\n", hh->ntls);
+
+ if (hh->gdt_entry_tls_entries != GDT_ENTRY_TLS_ENTRIES ||
+ hh->sizeof_tls_array != sizeof(thread->tls_array) ||
+ hh->ntls < 0 || hh->ntls > GDT_ENTRY_TLS_ENTRIES)
+ return -EINVAL;
+
+ if (hh->ntls > 0) {
+
+ /* restore TLS by hand: why convert to struct user_desc if
+ * sys_set_thread_entry() will convert it back ? */
+
+ struct desc_struct *buf = ctx->tbuf;
+ int size = sizeof(*buf) * GDT_ENTRY_TLS_ENTRIES;
+ int cpu;
+
+ BUG_ON(size > CR_TBUF_TOTAL);
+
+ ret = cr_kread(ctx, buf, size);
+ if (ret < 0)
+ return ret;
+
+ /* FIX: add sanity checks (eg. that values makes sense, that
+ * that we don't overwrite old values, etc */
+
+ cpu = get_cpu();
+ memcpy(thread->tls_array, buf, size);
+ load_TLS(thread, cpu);
+ put_cpu();
+ }
+
+ return 0;
+}
+
+/* read the cpu state nad registers for the current task */
+int cr_read_cpu(struct cr_ctx *ctx)
+{
+ struct cr_hdr_cpu *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ struct task_struct *t = current;
+ struct thread_struct *thread;
+ struct thread_info *thread_info;
+ struct pt_regs *regs;
+ int ret;
+
+ ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_CPU);
+ if (ret < 0)
+ return ret;
+
+ /* FIX: sanity check for sensitive registers (eg. eflags) */
+
+ thread = &t->thread;
+ thread_info = task_thread_info(t);
+ regs = task_pt_regs(t);
+
+ regs->bx = hh->bx;
+ regs->cx = hh->cx;
+ regs->dx = hh->dx;
+ regs->si = hh->si;
+ regs->di = hh->di;
+ regs->bp = hh->bp;
+ regs->ax = hh->ax;
+ regs->ds = hh->ds;
+ regs->es = hh->es;
+ regs->orig_ax = hh->orig_ax;
+ regs->ip = hh->ip;
+ regs->cs = hh->cs;
+ regs->flags = hh->flags;
+ regs->sp = hh->sp;
+ regs->ss = hh->ss;
+
+ thread->gs = hh->gs;
+ thread->fs = hh->fs;
+ loadsegment(gs, hh->gs);
+ loadsegment(fs, hh->fs);
+
+ CR_PRINTK("math %d debug %d\n", hh->used_math, hh->uses_debug);
+
+ /* FIX: this should work ... (someone double check !) */
+
+ preempt_disable();
+
+ /* i387 + MMU + SSE */
+ __clear_fpu(t); /* in case we used FPU in user mode */
+ if (!hh->used_math)
+ clear_used_math();
+ else {
+ if (hh->has_fxsr != cpu_has_fxsr) {
+ force_sig(SIGFPE, t);
+ return -EINVAL;
+ }
+ memcpy(&thread->xstate, &hh->xstate, sizeof(thread->xstate));
+ set_used_math();
+ }
+
+ /* debug regs */
+ if (hh->uses_debug) {
+ set_debugreg(hh->debugreg0, 0);
+ set_debugreg(hh->debugreg1, 1);
+ set_debugreg(hh->debugreg2, 2);
+ set_debugreg(hh->debugreg3, 3);
+ set_debugreg(hh->debugreg6, 6);
+ set_debugreg(hh->debugreg7, 7);
+ }
+
+ preempt_enable();
+
+ return 0;
+}
diff -puN /dev/null include/asm-x86/ckpt.h
--- /dev/null 2007-04-11 11:48:27.000000000 -0700
+++ linux-2.6.git-dave/include/asm-x86/ckpt.h 2008-08-04 13:29:59.000000000 -0700
@@ -0,0 +1,46 @@
+#ifndef __ASM_X86_CKPT_H
+#define __ASM_X86_CKPT_H
+
+#include <asm/processor.h>
+
+struct cr_hdr_thread {
+ /* NEED: restart blocks */
+ __s16 gdt_entry_tls_entries;
+ __s16 sizeof_tls_array;
+ __s16 ntls; /* number of TLS entries to follow */
+};
+
+struct cr_hdr_cpu {
+ __u64 bx;
+ __u64 cx;
+ __u64 dx;
+ __u64 si;
+ __u64 di;
+ __u64 bp;
+ __u64 ax;
+ __u64 ds;
+ __u64 es;
+ __u64 orig_ax;
+ __u64 ip;
+ __u64 cs;
+ __u64 flags;
+ __u64 sp;
+ __u64 ss;
+ __u64 fs;
+ __u64 gs;
+
+ __u64 debugreg0;
+ __u64 debugreg1;
+ __u64 debugreg2;
+ __u64 debugreg3;
+ __u64 debugreg6;
+ __u64 debugreg7;
+
+ __u8 uses_debug;
+
+ __u8 used_math;
+ __u8 has_fxsr;
+ union thread_xstate xstate; /* i387 */
+};
+
+#endif /* __ASM_X86_CKPT_H */
_
For each vma, there is a 'struct cr_vma'; if the vma is file-mapped,
it will be followed by the file name. The cr_vma->npages will tell
how many pages were dumped for this vma. Then it will be followed
by the actual data: first a dump of the addresses of all dumped
pages (npages entries) followed by a dump of the contents of all
dumped pages (npages pages). Then will come the next vma and so on.
I guess I could also separate out the x86-specific bits here, but
they're pretty small, comparatively.
Signed-off-by: Oren Laadan <[email protected]>
---
linux-2.6.git-dave/arch/x86/kernel/ldt.c | 2
linux-2.6.git-dave/ckpt/Makefile | 2
linux-2.6.git-dave/ckpt/ckpt_arch.h | 2
linux-2.6.git-dave/ckpt/ckpt_hdr.h | 21 +
linux-2.6.git-dave/ckpt/ckpt_mem.c | 388 ++++++++++++++++++++++++++++++
linux-2.6.git-dave/ckpt/ckpt_mem.h | 32 ++
linux-2.6.git-dave/ckpt/rstr_mem.c | 354 +++++++++++++++++++++++++++
linux-2.6.git-dave/ckpt/sys.c | 3
linux-2.6.git-dave/ckpt/x86.c | 83 ++++++
linux-2.6.git-dave/include/asm-x86/ckpt.h | 5
linux-2.6.git-dave/include/asm-x86/desc.h | 3
11 files changed, 892 insertions(+), 3 deletions(-)
diff -puN arch/x86/kernel/ldt.c~memory_part arch/x86/kernel/ldt.c
--- linux-2.6.git/arch/x86/kernel/ldt.c~memory_part 2008-08-05 08:37:29.000000000 -0700
+++ linux-2.6.git-dave/arch/x86/kernel/ldt.c 2008-08-05 08:38:00.000000000 -0700
@@ -183,7 +183,7 @@ static int read_default_ldt(void __user
return bytecount;
}
-static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
+int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
{
struct mm_struct *mm = current->mm;
struct desc_struct ldt;
diff -puN ckpt/ckpt_arch.h~memory_part ckpt/ckpt_arch.h
--- linux-2.6.git/ckpt/ckpt_arch.h~memory_part 2008-08-05 08:37:29.000000000 -0700
+++ linux-2.6.git-dave/ckpt/ckpt_arch.h 2008-08-05 08:37:29.000000000 -0700
@@ -4,3 +4,5 @@ int cr_write_thread(struct cr_ctx *ctx,
int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
int cr_read_thread(struct cr_ctx *ctx);
int cr_read_cpu(struct cr_ctx *ctx);
+int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm);
+int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm);
diff -puN ckpt/ckpt_hdr.h~memory_part ckpt/ckpt_hdr.h
--- linux-2.6.git/ckpt/ckpt_hdr.h~memory_part 2008-08-05 08:37:29.000000000 -0700
+++ linux-2.6.git-dave/ckpt/ckpt_hdr.h 2008-08-05 08:37:29.000000000 -0700
@@ -67,3 +67,24 @@ struct cr_hdr_task {
};
+
+struct cr_hdr_mm {
+ __u32 tag; /* sharing identifier */
+ __u64 start_code, end_code, start_data, end_data;
+ __u64 start_brk, brk, start_stack;
+ __u64 arg_start, arg_end, env_start, env_end;
+ __s16 map_count;
+};
+
+struct cr_hdr_vma {
+ __u32 how;
+
+ __u64 vm_start;
+ __u64 vm_end;
+ __u64 vm_page_prot;
+ __u64 vm_flags;
+ __u64 vm_pgoff;
+
+ __s16 npages;
+ __s16 namelen;
+};
diff -puN /dev/null ckpt/ckpt_mem.c
--- /dev/null 2007-04-11 11:48:27.000000000 -0700
+++ linux-2.6.git-dave/ckpt/ckpt_mem.c 2008-08-05 08:37:29.000000000 -0700
@@ -0,0 +1,388 @@
+/*
+ * Checkpoint memory contents
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/pagemap.h>
+#include <linux/mm_types.h>
+
+#include "ckpt.h"
+#include "ckpt_hdr.h"
+#include "ckpt_arch.h"
+#include "ckpt_mem.h"
+
+/*
+ * utilities to alloc, free, and handle 'struct cr_pgarr'
+ * (common to ckpt_mem.c and rstr_mem.c)
+ */
+
+#define CR_ORDER_PGARR 0
+#define CR_PGARR_TOTAL ((PAGE_SIZE << CR_ORDER_PGARR) / sizeof(void *))
+
+/* release pages referenced by a page-array */
+void _cr_pgarr_release(struct cr_ctx *ctx, struct cr_pgarr *pgarr)
+{
+ int n;
+
+ /* only checkpoint keeps references to pages */
+ if (ctx->flags & CR_CTX_CKPT) {
+ CR_PRINTK("release pages (nused %d)\n", pgarr->nused);
+ for (n = pgarr->nused; n--; )
+ page_cache_release(pgarr->pages[n]);
+ }
+ pgarr->nused = 0;
+ pgarr->nleft = CR_PGARR_TOTAL;
+}
+
+/* release pages referenced by chain of page-arrays */
+void cr_pgarr_release(struct cr_ctx *ctx)
+{
+ struct cr_pgarr *pgarr;
+
+ for (pgarr = ctx->pgarr; pgarr; pgarr = pgarr->next)
+ _cr_pgarr_release(ctx, pgarr);
+}
+
+/* free a chain of page-arrays */
+void cr_pgarr_free(struct cr_ctx *ctx)
+{
+ struct cr_pgarr *pgarr, *pgnxt;
+
+ for (pgarr = ctx->pgarr; pgarr; pgarr = pgnxt) {
+ _cr_pgarr_release(ctx, pgarr);
+ free_pages((unsigned long) ctx->pgarr->addrs, CR_ORDER_PGARR);
+ free_pages((unsigned long) ctx->pgarr->pages, CR_ORDER_PGARR);
+ pgnxt = pgarr->next;
+ kfree(pgarr);
+ }
+}
+
+/* allocate and add a new page-array to chain */
+struct cr_pgarr *cr_pgarr_alloc(struct cr_ctx *ctx, struct cr_pgarr **pgnew)
+{
+ struct cr_pgarr *pgarr = ctx->pgcur;
+
+ if (pgarr && pgarr->next) {
+ ctx->pgcur = pgarr->next;
+ return pgarr->next;
+ }
+
+ if ((pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL))) {
+ pgarr->nused = 0;
+ pgarr->nleft = CR_PGARR_TOTAL;
+ pgarr->addrs = (unsigned long *)
+ __get_free_pages(GFP_KERNEL, CR_ORDER_PGARR);
+ pgarr->pages = (struct page **)
+ __get_free_pages(GFP_KERNEL, CR_ORDER_PGARR);
+ if (likely(pgarr->addrs && pgarr->pages)) {
+ *pgnew = pgarr;
+ ctx->pgcur = pgarr;
+ return pgarr;
+ } else if (pgarr->addrs)
+ free_pages((unsigned long) pgarr->addrs,
+ CR_ORDER_PGARR);
+ kfree(pgarr);
+ }
+
+ return NULL;
+}
+
+/* return current page-array (and allocate if needed) */
+struct cr_pgarr *cr_pgarr_prep(struct cr_ctx *ctx)
+{
+ struct cr_pgarr *pgarr = ctx->pgcur;
+
+ if (unlikely(!pgarr->nleft))
+ pgarr = cr_pgarr_alloc(ctx, &pgarr->next);
+ return pgarr;
+}
+
+/*
+ * Checkpoint is outside the context of the checkpointee, so one cannot
+ * simply read pages from user-space. Instead, we scan the address space
+ * of the target to cherry-pick pages of interest. Selected pages are
+ * enlisted in a page-array chain (attached to the checkpoint context).
+ * To save their contents, each page is mapped to kernel memory and then
+ * dumped to the file descriptor.
+ */
+
+/**
+ * cr_vma_fill_pgarr - fill a page-array with addr/page tuples for a vma
+ * @ctx - checkpoint context
+ * @pgarr - page-array to fill
+ * @vma - vma to scan
+ * @start - start address (updated)
+ */
+static int cr_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
+ struct vm_area_struct *vma, unsigned long *start)
+{
+ unsigned long end = vma->vm_end;
+ unsigned long addr = *start;
+ struct page **pagep;
+ unsigned long *addrp;
+ int cow, nr, ret = 0;
+
+ nr = pgarr->nleft;
+ pagep = &pgarr->pages[pgarr->nused];
+ addrp = &pgarr->addrs[pgarr->nused];
+ cow = !!vma->vm_file;
+
+ while (addr < end) {
+ struct page *page;
+
+ /* simplified version of get_user_pages(): already have vma,
+ * only need FOLL_TOUCH, and (for now) ignore fault stats */
+
+ cond_resched();
+ while (!(page = follow_page(vma, addr, FOLL_TOUCH))) {
+ ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
+ if (ret & VM_FAULT_ERROR) {
+ if (ret & VM_FAULT_OOM)
+ ret = -ENOMEM;
+ else if (ret & VM_FAULT_SIGBUS)
+ ret = -EFAULT;
+ else
+ BUG();
+ break;
+ }
+ cond_resched();
+ }
+
+ if (IS_ERR(page)) {
+ ret = PTR_ERR(page);
+ break;
+ }
+
+ if (page == ZERO_PAGE(0))
+ page = NULL; /* zero page: ignore */
+ else if (cow && page_mapping(page) != NULL)
+ page = NULL; /* clean cow: ignore */
+ else {
+ get_page(page);
+ *(addrp++) = addr;
+ *(pagep++) = page;
+ if (--nr == 0) {
+ addr += PAGE_SIZE;
+ break;
+ }
+ }
+
+ addr += PAGE_SIZE;
+ }
+
+ if (unlikely(ret < 0)) {
+ nr = pgarr->nleft - nr;
+ while (nr--)
+ page_cache_release(*(--pagep));
+ return ret;
+ }
+
+ *start = addr;
+ return (pgarr->nleft - nr);
+}
+
+/**
+ * cr_vma_scan_pages - scan vma for pages that will need to be dumped
+ * @ctx - checkpoint context
+ * @vma - vma to scan
+ *
+ * a list of addr/page tuples is kept in ctx->pgarr page-array chain
+ */
+static int cr_vma_scan_pages(struct cr_ctx *ctx, struct vm_area_struct *vma)
+{
+ unsigned long addr = vma->vm_start;
+ unsigned long end = vma->vm_end;
+ struct cr_pgarr *pgarr;
+ int nr, total = 0;
+
+ while (addr < end) {
+ if (!(pgarr = cr_pgarr_prep(ctx)))
+ return -ENOMEM;
+ if ((nr = cr_vma_fill_pgarr(ctx, pgarr, vma, &addr)) < 0)
+ return nr;
+ pgarr->nleft -= nr;
+ pgarr->nused += nr;
+ total += nr;
+ }
+
+ CR_PRINTK("total %d\n", total);
+ return total;
+}
+
+/**
+ * cr_vma_dump_pages - dump pages listed in the ctx page-array chain
+ * @ctx - checkpoint context
+ * @total - total number of pages
+ */
+static int cr_vma_dump_pages(struct cr_ctx *ctx, int total)
+{
+ struct cr_pgarr *pgarr;
+ int ret;
+
+ if (!total)
+ return 0;
+
+ for (pgarr = ctx->pgarr; pgarr; pgarr = pgarr->next) {
+ ret = cr_kwrite(ctx, pgarr->addrs,
+ pgarr->nused * sizeof(*pgarr->addrs));
+ if (ret < 0)
+ return ret;
+ }
+
+ for (pgarr = ctx->pgarr; pgarr; pgarr = pgarr->next) {
+ struct page **pages = pgarr->pages;
+ int nr = pgarr->nused;
+ void *ptr;
+
+ while (nr--) {
+ ptr = kmap(*pages);
+ ret = cr_kwrite(ctx, ptr, PAGE_SIZE);
+ kunmap(*pages);
+ if (ret < 0)
+ return ret;
+ pages++;
+ }
+ }
+
+ return total;
+}
+
+static int cr_write_vma(struct cr_ctx *ctx, struct vm_area_struct *vma)
+{
+ struct cr_hdr h;
+ struct cr_hdr_vma *hh = ctx->tbuf;
+ char *fname = NULL;
+ int how, nr, ret;
+
+ h.type = CR_HDR_VMA;
+ h.len = sizeof(*hh);
+ h.id = ctx->pid;
+
+ hh->vm_start = vma->vm_start;
+ hh->vm_end = vma->vm_end;
+ hh->vm_page_prot = vma->vm_page_prot.pgprot;
+ hh->vm_flags = vma->vm_flags;
+ hh->vm_pgoff = vma->vm_pgoff;
+
+ if (vma->vm_flags & (VM_SHARED | VM_IO | VM_HUGETLB | VM_NONLINEAR)) {
+ printk(KERN_WARNING "CR: unknown VMA %#lx\n", vma->vm_flags);
+ return -ETXTBSY;
+ }
+
+ /* by default assume anon memory */
+ how = CR_VMA_ANON;
+
+ /* if there is a backing file, assume private-mapped */
+ /* (NEED: check if the file is unlinked) */
+ if (vma->vm_file) {
+ nr = PAGE_SIZE;
+ fname = cr_get_fname(&vma->vm_file->f_path,
+ ctx->vfsroot, ctx->tbuf, &nr);
+ if (IS_ERR(fname))
+ return PTR_ERR(fname);
+ hh->namelen = nr;
+ how = CR_VMA_FILE;
+ } else
+ hh->namelen = 0;
+
+ hh->how = how;
+
+ /*
+ * it seems redundant now, but we do it in 3 steps for because:
+ * first, the logic is simpler when we how many pages before
+ * dumping them; second, a future optimization will defer the
+ * writeout (dump, and free) to a later step; in which case all
+ * the pages to be dumped will be aggregated on the checkpoint ctx
+ */
+
+ /* (1) scan: scan through the PTEs of the vma, both to count the
+ * pages to dump, and make those pages COW. keep the list of pages
+ * (and a reference to each page) on the checkpoint ctx */
+ nr = cr_vma_scan_pages(ctx, vma);
+ if (nr < 0) {
+ cr_put_fname(ctx->tbuf, fname, PAGE_SIZE);
+ return nr;
+ }
+
+ hh->npages = nr;
+ ret = cr_write_obj(ctx, &h, hh);
+
+ if (!ret && hh->namelen)
+ ret = cr_write_str(ctx, fname, hh->namelen);
+
+ cr_put_fname(ctx->tbuf, fname, PAGE_SIZE);
+
+ if (ret < 0)
+ return ret;
+
+ /* (2) dump: write out the addresses of all pages in the list (on
+ * the checkpoint ctx) followed by the contents of all pages */
+ ret = cr_vma_dump_pages(ctx, nr);
+
+ /* (3) free: free the extra references to the pages in the list */
+ cr_pgarr_release(ctx);
+
+ return ret;
+}
+
+int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t)
+{
+ struct cr_hdr h;
+ struct cr_hdr_mm *hh = ctx->tbuf;
+ struct mm_struct *mm;
+ struct vm_area_struct *vma;
+ int ret;
+
+ h.type = CR_HDR_MM;
+ h.len = sizeof(*hh);
+ h.id = ctx->pid;
+
+ mm = get_task_mm(t);
+
+ hh->tag = 1; /* non-zero will mean first time encounter */
+
+ hh->start_code = mm->start_code;
+ hh->end_code = mm->end_code;
+ hh->start_data = mm->start_data;
+ hh->end_data = mm->end_data;
+ hh->start_brk = mm->start_brk;
+ hh->brk = mm->brk;
+ hh->start_stack = mm->start_stack;
+ hh->arg_start = mm->arg_start;
+ hh->arg_end = mm->arg_end;
+ hh->env_start = mm->env_start;
+ hh->env_end = mm->env_end;
+
+ hh->map_count = mm->map_count;
+
+ /* FIX: need also mm->flags */
+
+ ret = cr_write_obj(ctx, &h, hh);
+ if (ret < 0)
+ goto out;
+
+ /* write the vma's */
+ down_read(&mm->mmap_sem);
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ if ((ret = cr_write_vma(ctx, vma)) < 0)
+ break;
+ }
+ up_read(&mm->mmap_sem);
+
+ if (ret < 0)
+ goto out;
+
+ ret = cr_write_mm_context(ctx, mm);
+
+ out:
+ mmput(mm);
+ return ret;
+}
diff -puN /dev/null ckpt/ckpt_mem.h
--- /dev/null 2007-04-11 11:48:27.000000000 -0700
+++ linux-2.6.git-dave/ckpt/ckpt_mem.h 2008-08-05 08:37:29.000000000 -0700
@@ -0,0 +1,32 @@
+/*
+ * Generic container checkpoint-restart
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/mm_types.h>
+
+/* page-array chains: each pgarr hols a list of <addr,page> tuples */
+struct cr_pgarr {
+ unsigned long *addrs;
+ struct page **pages;
+ struct cr_pgarr *next;
+ unsigned short nleft;
+ unsigned short nused;
+};
+
+/* vma subtypes */
+enum {
+ CR_VMA_ANON = 1,
+ CR_VMA_FILE
+};
+
+extern void _cr_pgarr_release(struct cr_ctx *ctx, struct cr_pgarr *pgarr);
+extern void cr_pgarr_release(struct cr_ctx *ctx);
+extern void cr_pgarr_free(struct cr_ctx *ctx);
+extern struct cr_pgarr *cr_pgarr_alloc(struct cr_ctx *ctx, struct cr_pgarr **pgnew);
+extern struct cr_pgarr *cr_pgarr_prep(struct cr_ctx *ctx);
diff -puN ckpt/Makefile~memory_part ckpt/Makefile
--- linux-2.6.git/ckpt/Makefile~memory_part 2008-08-05 08:37:29.000000000 -0700
+++ linux-2.6.git-dave/ckpt/Makefile 2008-08-05 08:37:29.000000000 -0700
@@ -1,2 +1,2 @@
-obj-y += sys.o checkpoint.o restart.o
+obj-y += sys.o checkpoint.o restart.o ckpt_mem.o rstr_mem.o
obj-$(CONFIG_X86) += x86.o
diff -puN /dev/null ckpt/rstr_mem.c
--- /dev/null 2007-04-11 11:48:27.000000000 -0700
+++ linux-2.6.git-dave/ckpt/rstr_mem.c 2008-08-05 08:37:29.000000000 -0700
@@ -0,0 +1,354 @@
+/*
+ * Restart memory contents
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <asm/unistd.h>
+
+#include <linux/sched.h>
+#include <linux/fcntl.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/uaccess.h>
+#include <linux/mm_types.h>
+#include <linux/mman.h>
+#include <linux/mm.h>
+#include <linux/err.h>
+#include <asm/cacheflush.h>
+
+#include "ckpt.h"
+#include "ckpt_arch.h"
+#include "ckpt_hdr.h"
+#include "ckpt_mem.h"
+
+/*
+ * Unlike checkpoint, restart is executed in the context of each restarting
+ * process: vma regions are restored via a call to mmap(), and the data is
+ * read in directly to the address space of the current process
+ */
+
+/**
+ * cr_vma_read_pages_addr - read addresses of pages to page-array chain
+ * @ctx - restart context
+ * @npages - number of pages
+ */
+static int cr_vma_read_pages_addr(struct cr_ctx *ctx, int npages)
+{
+ struct cr_pgarr *pgarr;
+ int nr, ret;
+
+ while (npages) {
+ if (!(pgarr = cr_pgarr_prep(ctx)))
+ return -ENOMEM;
+ nr = min(npages, (int) pgarr->nleft);
+ ret = cr_kread(ctx, pgarr->addrs, nr * sizeof(unsigned long));
+ if (ret < 0)
+ return ret;
+ pgarr->nleft -= nr;
+ pgarr->nused += nr;
+ npages -= nr;
+ }
+ return 0;
+}
+
+/**
+ * cr_vma_read_pages_data - read in data of pages in page-array chain
+ * @ctx - restart context
+ * @npages - number of pages
+ */
+static int cr_vma_read_pages_data(struct cr_ctx *ctx, int npages)
+{
+ struct cr_pgarr *pgarr;
+ unsigned long *addrs;
+ int nr, ret;
+
+ for (pgarr = ctx->pgarr; npages; pgarr = pgarr->next) {
+ addrs = pgarr->addrs;
+ nr = pgarr->nused;
+ npages -= nr;
+ while (nr--) {
+ ret = cr_uread(ctx, (void *) *(addrs++), PAGE_SIZE);
+ if (ret < 0)
+ return ret;
+ }
+ }
+
+ return 0;
+}
+
+/* change the protection of an address range to be writable/non-writable.
+ * this is useful when restoring the memory of a read-only vma */
+static int cr_vma_writable(struct mm_struct *mm, unsigned long start,
+ unsigned long end, int writable)
+{
+ struct vm_area_struct *vma, *prev;
+ unsigned long flags = 0;
+ int ret = -EINVAL;
+
+ CR_PRINTK("vma %#lx-%#lx writable %d\n", start, end, writable);
+
+ down_write(&mm->mmap_sem);
+ vma = find_vma_prev(mm, start, &prev);
+ if (unlikely(!vma || vma->vm_start > end || vma->vm_end < start))
+ goto out;
+ if (writable && !(vma->vm_flags & VM_WRITE))
+ flags = vma->vm_flags | VM_WRITE;
+ else if (!writable && (vma->vm_flags & VM_WRITE))
+ flags = vma->vm_flags & ~VM_WRITE;
+ CR_PRINTK("flags %#lx\n", flags);
+ if (flags)
+ ret = mprotect_fixup(vma, &prev, vma->vm_start,
+ vma->vm_end, flags);
+ out:
+ up_write(&mm->mmap_sem);
+ return ret;
+}
+
+/**
+ * cr_vma_read_pages - read in pages for to restore a vma
+ * @ctx - restart context
+ * @cr_vma - vma descriptor from restart
+ */
+static int cr_vma_read_pages(struct cr_ctx *ctx, struct cr_hdr_vma *cr_vma)
+{
+ struct mm_struct *mm = current->mm;
+ int ret = 0;
+
+ if (!cr_vma->npages)
+ return 0;
+
+ /* in the unlikely case that this vma is read-only */
+ if (!(cr_vma->vm_flags & VM_WRITE))
+ ret = cr_vma_writable(mm, cr_vma->vm_start, cr_vma->vm_end, 1);
+
+ if (!ret)
+ ret = cr_vma_read_pages_addr(ctx, cr_vma->npages);
+ if (!ret)
+ ret = cr_vma_read_pages_data(ctx, cr_vma->npages);
+ if (ret < 0)
+ return ret;
+
+ cr_pgarr_release(ctx); /* reset page-array chain */
+
+ /* restore original protection for this vma */
+ if (!(cr_vma->vm_flags & VM_WRITE))
+ ret = cr_vma_writable(mm, cr_vma->vm_start, cr_vma->vm_end, 0);
+
+ return ret;
+}
+
+/**
+ * cr_calc_map_prot_bits - convert vm_flags to mmap protection
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long cr_calc_map_prot_bits(unsigned long orig_vm_flags)
+{
+ unsigned long vm_prot = 0;
+
+ if (orig_vm_flags & VM_READ)
+ vm_prot |= PROT_READ;
+ if (orig_vm_flags & VM_WRITE)
+ vm_prot |= PROT_WRITE;
+ if (orig_vm_flags & VM_EXEC)
+ vm_prot |= PROT_EXEC;
+ if (orig_vm_flags & PROT_SEM) /* only (?) with IPC-SHM */
+ vm_prot |= PROT_SEM;
+
+ return vm_prot;
+}
+
+/**
+ * cr_calc_map_flags_bits - convert vm_flags to mmap flags
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long cr_calc_map_flags_bits(unsigned long orig_vm_flags)
+{
+ unsigned long vm_flags = 0;
+
+ vm_flags = MAP_FIXED;
+ if (orig_vm_flags & VM_GROWSDOWN)
+ vm_flags |= MAP_GROWSDOWN;
+ if (orig_vm_flags & VM_DENYWRITE)
+ vm_flags |= MAP_DENYWRITE;
+ if (orig_vm_flags & VM_EXECUTABLE)
+ vm_flags |= MAP_EXECUTABLE;
+ if (orig_vm_flags & VM_MAYSHARE)
+ vm_flags |= MAP_SHARED;
+ else
+ vm_flags |= MAP_PRIVATE;
+
+ return vm_flags;
+}
+
+static int cr_read_vma(struct cr_ctx *ctx, struct mm_struct *mm)
+{
+ struct cr_hdr_vma *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ unsigned long vm_size, vm_flags, vm_prot, vm_pgoff;
+ unsigned long addr;
+ unsigned long flags;
+ struct file *file = NULL;
+ char *fname = NULL;
+ int ret;
+
+ ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_VMA);
+ if (ret < 0)
+ return ret;
+
+ CR_PRINTK("vma %#lx-%#lx npages %d namelen %d\n",
+ (unsigned long) hh->vm_start, (unsigned long) hh->vm_end,
+ (int) hh->npages, (int) hh->namelen);
+
+ if (hh->vm_end < hh->vm_start)
+ return -EINVAL;
+ if (hh->npages < 0 || hh->namelen < 0)
+ return -EINVAL;
+
+ vm_size = hh->vm_end - hh->vm_start;
+ vm_prot = cr_calc_map_prot_bits(hh->vm_flags);
+ vm_flags = cr_calc_map_flags_bits(hh->vm_flags);
+ vm_pgoff = hh->vm_pgoff;
+
+ if (hh->namelen) {
+ fname = ctx->tbuf;
+ ret = cr_read_str(ctx, fname, PAGE_SIZE);
+ if (ret < 0)
+ return ret;
+ }
+
+ CR_PRINTK("vma fname '%s' how %d\n", fname, hh->how);
+
+ switch (hh->how) {
+
+ case CR_VMA_ANON: /* anonymous private mapping */
+ if (hh->namelen)
+ return -EINVAL;
+ /* vm_pgoff for anonymous mapping is the "global" page
+ offset (namely from addr 0x0), so we force a zero */
+ vm_pgoff = 0;
+ break;
+
+ case CR_VMA_FILE: /* private mapping from a file */
+ if (!hh->namelen)
+ return -EINVAL;
+ /* O_RDWR only needed if both (VM_WRITE|VM_SHARED) are set */
+ flags = hh->vm_flags & (VM_WRITE | VM_SHARED);
+ flags = (flags == (VM_WRITE | VM_SHARED) ? O_RDWR : O_RDONLY);
+ file = filp_open(fname, flags, 0);
+ if (IS_ERR(file))
+ return PTR_ERR(file);
+ break;
+
+ default:
+ return -EINVAL;
+
+ }
+
+ addr = do_mmap_pgoff(file, (unsigned long) hh->vm_start,
+ vm_size, vm_prot, vm_flags, vm_pgoff);
+ CR_PRINTK("vma size %#lx prot %#lx flags %#lx pgoff %#lx => %#lx\n",
+ vm_size, vm_prot, vm_flags, vm_pgoff, addr);
+
+ /* the file (if opened) is now referenced by the vma */
+ if (file)
+ filp_close(file, NULL);
+
+ if (IS_ERR((void*) addr))
+ return (PTR_ERR((void *) addr));
+
+ /*
+ * CR_VMA_ANON: read in memory as is
+ * CR_VMA_FILE: read in memory as is
+ * (more to follow ...)
+ */
+
+ switch (hh->how) {
+ case CR_VMA_ANON:
+ case CR_VMA_FILE:
+ /* standard case: read the data into the memory */
+ ret = cr_vma_read_pages(ctx, hh);
+ break;
+ }
+
+ if (ret < 0)
+ return ret;
+
+ if (vm_prot & PROT_EXEC)
+ flush_icache_range(hh->vm_start, hh->vm_end);
+
+ cr_hbuf_put(ctx, sizeof(*hh));
+ CR_PRINTK("vma retval %d\n", ret);
+ return 0;
+}
+
+static int cr_destroy_mm(struct mm_struct *mm)
+{
+ struct vm_area_struct *vmnext = mm->mmap;
+ struct vm_area_struct *vma;
+ int ret;
+
+ while (vmnext) {
+ vma = vmnext;
+ vmnext = vmnext->vm_next;
+ ret = do_munmap(mm, vma->vm_start, vma->vm_end-vma->vm_start);
+ if (ret < 0)
+ return ret;
+ }
+ return 0;
+}
+
+int cr_read_mm(struct cr_ctx *ctx)
+{
+ struct cr_hdr_mm *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ struct mm_struct *mm;
+ int nr, ret;
+
+ ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM);
+ if (ret < 0)
+ return ret;
+
+ CR_PRINTK("map_count %d\n", hh->map_count);
+
+ /* XXX need more sanity checks */
+ if (hh->start_code > hh->end_code ||
+ hh->start_data > hh->end_data || hh->map_count < 0)
+ return -EINVAL;
+
+ mm = current->mm;
+
+ /* point of no return -- destruct current mm */
+ down_write(&mm->mmap_sem);
+ ret = cr_destroy_mm(mm);
+ up_write(&mm->mmap_sem);
+
+ if (ret < 0)
+ return ret;
+
+ mm->start_code = hh->start_code;
+ mm->end_code = hh->end_code;
+ mm->start_data = hh->start_data;
+ mm->end_data = hh->end_data;
+ mm->start_brk = hh->start_brk;
+ mm->brk = hh->brk;
+ mm->start_stack = hh->start_stack;
+ mm->arg_start = hh->arg_start;
+ mm->arg_end = hh->arg_end;
+ mm->env_start = hh->env_start;
+ mm->env_end = hh->env_end;
+
+ /* FIX: need also mm->flags */
+
+ for (nr = hh->map_count; nr; nr--) {
+ ret = cr_read_vma(ctx, mm);
+ if (ret < 0)
+ return ret;
+ }
+
+ cr_hbuf_put(ctx, sizeof(*hh));
+
+ return cr_read_mm_context(ctx, mm);
+}
diff -puN ckpt/sys.c~memory_part ckpt/sys.c
--- linux-2.6.git/ckpt/sys.c~memory_part 2008-08-05 08:37:29.000000000 -0700
+++ linux-2.6.git-dave/ckpt/sys.c 2008-08-05 08:37:29.000000000 -0700
@@ -15,6 +15,7 @@
#include <linux/capability.h>
#include "ckpt.h"
+#include "ckpt_mem.h"
/*
* helpers to write/read to/from the image file descriptor
@@ -118,6 +119,8 @@ void cr_ctx_free(struct cr_ctx *ctx)
if (ctx->vfsroot)
path_put(ctx->vfsroot);
+ cr_pgarr_free(ctx);
+
free_pages((unsigned long) ctx->tbuf, CR_ORDER_TBUF);
free_pages((unsigned long) ctx->hbuf, CR_ORDER_HBUF);
diff -puN ckpt/x86.c~memory_part ckpt/x86.c
--- linux-2.6.git/ckpt/x86.c~memory_part 2008-08-05 08:37:29.000000000 -0700
+++ linux-2.6.git-dave/ckpt/x86.c 2008-08-05 08:37:29.000000000 -0700
@@ -1,5 +1,6 @@
#include <asm/ckpt.h>
#include <asm/desc.h>
+#include <asm/ldt.h>
#include <asm/i387.h>
#include "ckpt.h"
@@ -267,3 +268,85 @@ int cr_read_cpu(struct cr_ctx *ctx)
return 0;
}
+
+int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm)
+{
+ struct cr_hdr h;
+ struct cr_hdr_mm_context *hh = ctx->tbuf;
+ int ret;
+
+ h.type = CR_HDR_MM_CONTEXT;
+ h.len = sizeof(*hh);
+ h.id = ctx->pid;
+
+ mutex_lock(&mm->context.lock);
+
+ hh->ldt_entry_size = LDT_ENTRY_SIZE;
+ hh->nldt = mm->context.size;
+
+ CR_PRINTK("nldt %d\n", hh->nldt);
+
+ ret = cr_write_obj(ctx, &h, hh);
+ if (ret < 0)
+ return ret;
+
+ ret = cr_kwrite(ctx, mm->context.ldt, hh->nldt * LDT_ENTRY_SIZE);
+
+ mutex_unlock(&mm->context.lock);
+
+ return ret;
+}
+
+int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm)
+{
+ struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ int n, ret;
+
+ ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM_CONTEXT);
+ if (ret < 0)
+ return ret;
+
+ CR_PRINTK("nldt %d\n", hh->nldt);
+
+ if (hh->nldt < 0 || hh->ldt_entry_size != LDT_ENTRY_SIZE)
+ return -EINVAL;
+
+ /* to utilize the syscall modify_ldt() we first convert the data
+ * in the checkpoint image from 'struct desc_struct' to 'struct
+ * user_desc' with reverse logic of inclue/asm/desc.h:fill_ldt() */
+
+ for (n = 0; n < hh->nldt; n++) {
+ struct user_desc info;
+ struct desc_struct desc;
+ mm_segment_t old_fs;
+
+ ret = cr_kread(ctx, &desc, LDT_ENTRY_SIZE);
+ if (ret < 0)
+ return ret;
+
+ info.entry_number = n;
+ info.base_addr = desc.base0 | (desc.base1 << 16);
+ info.limit = desc.limit0;
+ info.seg_32bit = desc.d;
+ info.contents = desc.type >> 2;
+ info.read_exec_only = (desc.type >> 1) ^ 1;
+ info.limit_in_pages = desc.g;
+ info.seg_not_present = desc.p ^ 1;
+ info.useable = desc.avl;
+
+ old_fs = get_fs();
+ set_fs(get_ds());
+ /* ret = sys_modify_ldt(1, &info, sizeof(info)); */
+ /* modified by daveh */
+ ret = write_ldt(&info, sizeof(info), 1);
+ set_fs(old_fs);
+
+ if (ret < 0)
+ return ret;
+ }
+
+ load_LDT(&mm->context);
+
+ cr_hbuf_put(ctx, sizeof(*hh));
+ return 0;
+}
diff -puN include/asm-x86/ckpt.h~memory_part include/asm-x86/ckpt.h
--- linux-2.6.git/include/asm-x86/ckpt.h~memory_part 2008-08-05 08:37:29.000000000 -0700
+++ linux-2.6.git-dave/include/asm-x86/ckpt.h 2008-08-05 08:37:29.000000000 -0700
@@ -43,4 +43,9 @@ struct cr_hdr_cpu {
union thread_xstate xstate; /* i387 */
};
+struct cr_hdr_mm_context {
+ __s16 ldt_entry_size;
+ __s16 nldt;
+};
+
#endif /* __ASM_X86_CKPT_H */
diff -puN include/asm-x86/desc.h~memory_part include/asm-x86/desc.h
--- linux-2.6.git/include/asm-x86/desc.h~memory_part 2008-08-05 08:37:29.000000000 -0700
+++ linux-2.6.git-dave/include/asm-x86/desc.h 2008-08-05 08:40:11.000000000 -0700
@@ -111,6 +111,8 @@ static inline void native_write_ldt_entr
memcpy(&ldt[entry], desc, 8);
}
+int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode);
+
static inline void native_write_gdt_entry(struct desc_struct *gdt, int entry,
const void *desc, int type)
{
@@ -394,7 +396,6 @@ static inline void set_system_gate_ist(i
shll $16, base; \
movw idx * 8 + 2(gdt), lo_w;
-
#endif /* __ASSEMBLY__ */
#endif
_
From: Oren Laadan <[email protected]>
Create trivial sys_checkpoint and sys_restore system calls. They will
enable to checkpoint and restart an entire container, to and from a
checkpoint image file.
First create a template for both syscalls: they take a file descriptor
(for the image file) and flags as arguments. For sys_checkpoint the
first argument identifies the target container; for sys_restart it will
identify the checkpoint image.
Signed-off-by: Oren Laadan <[email protected]>
---
linux-2.6.git-dave/arch/x86/kernel/syscall_table_32.S | 2 ++
linux-2.6.git-dave/include/asm-x86/unistd_32.h | 2 ++
2 files changed, 4 insertions(+)
diff -puN arch/x86/kernel/syscall_table_32.S~introduce_sys_checkpoint_and_sys_restore arch/x86/kernel/syscall_table_32.S
--- linux-2.6.git/arch/x86/kernel/syscall_table_32.S~introduce_sys_checkpoint_and_sys_restore 2008-08-07 15:38:04.000000000 -0700
+++ linux-2.6.git-dave/arch/x86/kernel/syscall_table_32.S 2008-08-07 15:38:04.000000000 -0700
@@ -326,3 +326,5 @@ ENTRY(sys_call_table)
.long sys_fallocate
.long sys_timerfd_settime /* 325 */
.long sys_timerfd_gettime
+ .long sys_checkpoint
+ .long sys_restart
diff -puN include/asm-x86/unistd_32.h~introduce_sys_checkpoint_and_sys_restore include/asm-x86/unistd_32.h
--- linux-2.6.git/include/asm-x86/unistd_32.h~introduce_sys_checkpoint_and_sys_restore 2008-08-07 15:38:04.000000000 -0700
+++ linux-2.6.git-dave/include/asm-x86/unistd_32.h 2008-08-07 15:38:04.000000000 -0700
@@ -332,6 +332,8 @@
#define __NR_fallocate 324
#define __NR_timerfd_settime 325
#define __NR_timerfd_gettime 326
+#define __NR_checkpoint 327
+#define __NR_restart 328
#ifdef __KERNEL__
diff -puN Makefile~introduce_sys_checkpoint_and_sys_restore Makefile
_
On Friday 08 August 2008, Dave Hansen wrote:
> These patches are from Oren Laaden. I've refactored them
> a bit to make them a wee bit more reviewable. I think this
> separates out the per-arch bits pretty well. It should also
> be at least build-bisetable.
Cool stuff
> ============================== ckpt.c ================================
>
> #define _GNU_SOURCE /* or _BSD_SOURCE or _SVID_SOURCE */
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <errno.h>
> #include <fcntl.h>
> #include <unistd.h>
> #include <asm/unistd_32.h>
> #include <sys/syscall.h>
Note that asm/unistd_32.h is not portable, you should use asm/unistd.h
in the example.
> pid_t pid = getpid();
> int ret;
>
> ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
Interface-wise, I would consider checkpointing yourself signficantly
different from checkpointing some other thread. If checkpointing
yourself is the common case, it probably makes sense to allow passing
of pid=0 for this.
Arnd <><
On Friday 08 August 2008, Dave Hansen wrote:
> +/* write the checkpoint header */
> +static int cr_write_hdr(struct cr_ctx *ctx)
> +{
> + struct cr_hdr h;
> + struct cr_hdr_head *hh = ctx->tbuf;
> + struct timeval ktv;
> +
> + h.type = CR_HDR_HEAD;
> + h.len = sizeof(*hh);
> + h.id = 0;
> +
> + do_gettimeofday(&ktv);
> +
> + hh->magic = 0x00a2d200;
> + hh->major = (LINUX_VERSION_CODE >> 16) & 0xff;
> + hh->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
> + hh->patch = (LINUX_VERSION_CODE) & 0xff;
> +
> + hh->version = 1;
> +
> + hh->flags = ctx->flags;
> + hh->time = ktv.tv_sec;
> +
> + return cr_write_obj(ctx, &h, hh);
> +}
Do you rely on the kernel version in order to determine the format
of the binary data, or is it just informational?
If you think the format can change in incompatible ways, you
probably need something more specific than the version number
these days, because there are just so many different trees with
the same numbers.
> +
> +/* debugging */
> +#if 0
> +#define CR_PRINTK(str, args...) \
> + printk(KERN_ERR "cr@%s#%d: " str, __func__, __LINE__, ##args)
> +#else
> +#define CR_PRINTK(...) do {} while (0)
> +#endif
> +
Please use the existing pr_debug and dev_debug here, instead of creating
yet another version.
> +struct cr_hdr_tail {
> + __u32 magic;
> + __u32 cksum[2];
> +};
This structure has an odd multiple of 32-bit members, which means
that if you put it into a larger structure that also contains
64-bit members, the larger structure may get different alignment
on x86-32 and x86-64, which you might want to avoid.
I can't tell if this is an actual problem here.
> +
> +struct cr_hdr_task {
> + __u64 state;
> + __u32 exit_state;
> + __u32 exit_code, exit_signal;
> +
> + __u16 pid;
> + __u16 tgid;
> +
> + __u64 utime, stime, utimescaled, stimescaled;
> + __u64 gtime;
> + __u64 prev_utime, prev_stime;
> + __u64 nvcsw, nivcsw;
> + __u64 start_time_sec, start_time_nsec;
> + __u64 real_start_time_sec, real_start_time_nsec;
> + __u64 min_flt, maj_flt;
> +
> + __s16 task_comm_len;
> + char comm[TASK_COMM_LEN];
> +};
In this case, I'm pretty sure that sizeof(cr_hdr_task) on x86-32 is
different from x86-64, since it will be 32-bit aligned on x86-32.
> +
> +/*
> + * During restart the code reads in data from the chekcpoint image into a
> + * temporary buffer (ctx->hbuf). Because operations can be nested, one
> + * should call cr_hbuf_get() to reserve space in the buffer, and then
> + * cr_hbuf_put() when it no longer needs that space
> + */
> +
> +#include <linux/version.h>
> +#include <linux/sched.h>
> +#include <linux/file.h>
> +
> +#include "ckpt.h"
> +#include "ckpt_hdr.h"
> +
> +/**
> + * cr_hbuf_get - reserve space on the hbuf
> + * @ctx: checkpoint context
> + * @n: number of bytes to reserve
> + */
> +void *cr_hbuf_get(struct cr_ctx *ctx, int n)
> +{
> + void *ptr;
> +
> + BUG_ON(ctx->hpos + n > CR_HBUF_TOTAL);
> + ptr = (void *) (((char *) ctx->hbuf) + ctx->hpos);
> + ctx->hpos += n;
> + return ptr;
> +}
Can (ctx->hpos + n > CR_HBUF_TOTAL) be controlled by the input
data? If so, this is a denial-of-service attack.
> +
> +int cr_kwrite(struct cr_ctx *ctx, void *buf, int count)
> +{
> + mm_segment_t oldfs;
> + int ret;
> +
> + oldfs = get_fs();
> + set_fs(KERNEL_DS);
> + ret = cr_uwrite(ctx, buf, count);
> + set_fs(oldfs);
> +
> + return ret;
> +}
get_fs()/set_fs() always feels a bit ouch, and this way you have
to use __force to avoid the warnings about __user pointer casts
in sparse.
I wonder if you can use splice_read/splice_write to get around
this problem.
> + struct cr_ctx *ctx;
> + struct file *file;
> + int fput_needed;
> + int ret;
> +
> + if (!capable(CAP_SYS_ADMIN))
> + return -EPERM;
> +
Why do you need CAP_SYS_ADMIN for this? Can't regular users
be allowed to checkpoint/restart their own tasks?
> --- linux-2.6.git/Makefile~handle_a_single_task_with_private_memory_maps 2008-08-05 09:04:27.000000000 -0700
> +++ linux-2.6.git-dave/Makefile 2008-08-05 09:07:53.000000000 -0700
> @@ -611,7 +611,7 @@ export mod_strip_cmd
>
>
> ifeq ($(KBUILD_EXTMOD),)
> -core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/
> +core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ ckpt/
>
The name 'ckpt' is a bit unobvious, how about naming it 'checkpoint' instead?
Arnd <><
> diff -puN /dev/null include/asm-x86/ckpt.h
> --- /dev/null 2007-04-11 11:48:27.000000000 -0700
> +++ linux-2.6.git-dave/include/asm-x86/ckpt.h 2008-08-04 13:29:59.000000000 -0700
> @@ -0,0 +1,46 @@
> +
> +struct cr_hdr_cpu {
> + __u64 bx;
> + __u64 cx;
> + __u64 dx;
> + __u64 si;
> + __u64 di;
> + __u64 bp;
> + __u64 ax;
> + __u64 ds;
> + __u64 es;
> + __u64 orig_ax;
> + __u64 ip;
> + __u64 cs;
> + __u64 flags;
> + __u64 sp;
> + __u64 ss;
> + __u64 fs;
> + __u64 gs;
> +
> + __u64 debugreg0;
> + __u64 debugreg1;
> + __u64 debugreg2;
> + __u64 debugreg3;
> + __u64 debugreg6;
> + __u64 debugreg7;
> +
> + __u8 uses_debug;
> +
> + __u8 used_math;
> + __u8 has_fxsr;
> + union thread_xstate xstate; /* i387 */
> +};
It seems weird that you use __u64 members for the registers, but don't
include r8..r15 in the list. As a consequence, this structure does not
seem well suited for either x86-32 or x86-64.
I would suggest either using struct pt_regs by reference, or defining
it so that you can use the same structure for both 32 and 64 bit x86.
Arnd <><
On Friday 08 August 2008, Dave Hansen wrote:
>
> diff -puN ckpt/ckpt_hdr.h~memory_part ckpt/ckpt_hdr.h
> --- linux-2.6.git/ckpt/ckpt_hdr.h~memory_part 2008-08-05 08:37:29.000000000 -0700
> +++ linux-2.6.git-dave/ckpt/ckpt_hdr.h 2008-08-05 08:37:29.000000000 -0700
> @@ -67,3 +67,24 @@ struct cr_hdr_task {
> };
>
>
> +
> +struct cr_hdr_mm {
> + __u32 tag; /* sharing identifier */
> + __u64 start_code, end_code, start_data, end_data;
> + __u64 start_brk, brk, start_stack;
> + __u64 arg_start, arg_end, env_start, env_end;
> + __s16 map_count;
> +};
Another structure that is not 32/64 bit ABI safe on x86. It would be safe
if you reorder the members as
struct cr_hdr_mm {
__u32 tag; /* sharing identifier */
__s16 map_count;
__u16 pad; /* not actually needed, but better to make it explicit */
__u64 start_code, end_code, start_data, end_data;
__u64 start_brk, brk, start_stack;
__u64 arg_start, arg_end, env_start, env_end;
};
> +struct cr_hdr_vma {
> + __u32 how;
> +
> + __u64 vm_start;
> + __u64 vm_end;
> + __u64 vm_page_prot;
> + __u64 vm_flags;
> + __u64 vm_pgoff;
> +
> + __s16 npages;
> + __s16 namelen;
> +};
same here.
Arnd <><
On Friday 08 August 2008, Dave Hansen wrote:
>
> linux-2.6.git-dave/arch/x86/kernel/syscall_table_32.S | 2 ++
> linux-2.6.git-dave/include/asm-x86/unistd_32.h | 2 ++
> 2 files changed, 4 insertions(+)
System calls should also be declared in include/linux/syscalls.h.
I guess you are aware that this implementation is not enough to
support 32 bit tasks on x86_64. In addition to the native 64-bit
code, you would also need the 32-bit compat code here.
Arnd <><
On Fri, 2008-08-08 at 11:25 +0200, Arnd Bergmann wrote:
> > pid_t pid = getpid();
> > int ret;
> >
> > ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
>
> Interface-wise, I would consider checkpointing yourself signficantly
> different from checkpointing some other thread. If checkpointing
> yourself is the common case, it probably makes sense to allow passing
> of pid=0 for this.
I don't think it is the common case. Probably now when we're screwing
around with it, but not in the future. Do you think it is worth adding
the pid=0 handling?
-- Dave
On Friday 08 August 2008, Dave Hansen wrote:
> I don't think it is the common case. ?Probably now when we're screwing
> around with it, but not in the future. ?Do you think it is worth adding
> the pid=0 handling?
If it's the exception, probably not. Otherwise it would be a nice shortcut
to avoid having to do two system calls every time you write code using it.
Then again, there are probably not many programs calling it anyway, if it
get encapsulated in some user space tool.
Arnd <><
On Fri, 2008-08-08 at 11:46 +0200, Arnd Bergmann wrote:
> On Friday 08 August 2008, Dave Hansen wrote:
> > + hh->magic = 0x00a2d200;
> > + hh->major = (LINUX_VERSION_CODE >> 16) & 0xff;
> > + hh->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
> > + hh->patch = (LINUX_VERSION_CODE) & 0xff;
...
> > +}
>
> Do you rely on the kernel version in order to determine the format
> of the binary data, or is it just informational?
>
> If you think the format can change in incompatible ways, you
> probably need something more specific than the version number
> these days, because there are just so many different trees with
> the same numbers.
Yeah, this is very true. My guess is that we'll need something like
what we do with modversions.
> > +/* debugging */
> > +#if 0
> > +#define CR_PRINTK(str, args...) \
> > + printk(KERN_ERR "cr@%s#%d: " str, __func__, __LINE__, ##args)
> > +#else
> > +#define CR_PRINTK(...) do {} while (0)
> > +#endif
> > +
>
> Please use the existing pr_debug and dev_debug here, instead of creating
> yet another version.
Sure thing. Will do.
> > +struct cr_hdr_tail {
> > + __u32 magic;
> > + __u32 cksum[2];
> > +};
>
> This structure has an odd multiple of 32-bit members, which means
> that if you put it into a larger structure that also contains
> 64-bit members, the larger structure may get different alignment
> on x86-32 and x86-64, which you might want to avoid.
> I can't tell if this is an actual problem here.
Can't we just declare all these things __packed__ and stop worrying
about aligning them all manually?
> > +
> > +/*
> > + * During restart the code reads in data from the chekcpoint image into a
> > + * temporary buffer (ctx->hbuf). Because operations can be nested, one
> > + * should call cr_hbuf_get() to reserve space in the buffer, and then
> > + * cr_hbuf_put() when it no longer needs that space
> > + */
> > +
> > +#include <linux/version.h>
> > +#include <linux/sched.h>
> > +#include <linux/file.h>
> > +
> > +#include "ckpt.h"
> > +#include "ckpt_hdr.h"
> > +
> > +/**
> > + * cr_hbuf_get - reserve space on the hbuf
> > + * @ctx: checkpoint context
> > + * @n: number of bytes to reserve
> > + */
> > +void *cr_hbuf_get(struct cr_ctx *ctx, int n)
> > +{
> > + void *ptr;
> > +
> > + BUG_ON(ctx->hpos + n > CR_HBUF_TOTAL);
> > + ptr = (void *) (((char *) ctx->hbuf) + ctx->hpos);
> > + ctx->hpos += n;
> > + return ptr;
> > +}
>
> Can (ctx->hpos + n > CR_HBUF_TOTAL) be controlled by the input
> data? If so, this is a denial-of-service attack.
Ugh, this is crappy code anyway. It needs to return an error and have
someone else handle it.
> > +int cr_kwrite(struct cr_ctx *ctx, void *buf, int count)
> > +{
> > + mm_segment_t oldfs;
> > + int ret;
> > +
> > + oldfs = get_fs();
> > + set_fs(KERNEL_DS);
> > + ret = cr_uwrite(ctx, buf, count);
> > + set_fs(oldfs);
> > +
> > + return ret;
> > +}
>
> get_fs()/set_fs() always feels a bit ouch, and this way you have
> to use __force to avoid the warnings about __user pointer casts
> in sparse.
> I wonder if you can use splice_read/splice_write to get around
> this problem.
I have to wonder if this is just a symptom of us trying to do this the
wrong way. We're trying to talk the kernel into writing internal gunk
into a FD. You're right, it is like a splice where one end of the pipe
is in the kernel.
Any thoughts on a better way to do this?
> > + struct cr_ctx *ctx;
> > + struct file *file;
> > + int fput_needed;
> > + int ret;
> > +
> > + if (!capable(CAP_SYS_ADMIN))
> > + return -EPERM;
> > +
>
> Why do you need CAP_SYS_ADMIN for this? Can't regular users
> be allowed to checkpoint/restart their own tasks?
Yes, eventually. I think one good point is that we should probably
remove this now so that we *have* to think about security implications
as we add each individual patch. For instance, what kind of checking do
we do when we restore an mlock()'d VMA?
I'll pull this check out so it causes pain. (the good kind)
> > --- linux-2.6.git/Makefile~handle_a_single_task_with_private_memory_maps 2008-08-05 09:04:27.000000000 -0700
> > +++ linux-2.6.git-dave/Makefile 2008-08-05 09:07:53.000000000 -0700
> > @@ -611,7 +611,7 @@ export mod_strip_cmd
> >
> >
> > ifeq ($(KBUILD_EXTMOD),)
> > -core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/
> > +core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ ckpt/
> >
>
> The name 'ckpt' is a bit unobvious, how about naming it 'checkpoint' instead?
Fine with me. Renamed in new patches, hopefully.
I'll send new patches out later today.
-- Dave
Hi,
Arnd Bergmann wrote:
> On Friday 08 August 2008, Dave Hansen wrote:
>> These patches are from Oren Laaden. I've refactored them
>> a bit to make them a wee bit more reviewable. I think this
>> separates out the per-arch bits pretty well. It should also
>> be at least build-bisetable.
>
> Cool stuff
>
Thanks. This is a proof of concept so all sorts of feedback are
definitely welcome. Some of the ideas and discussions are found
around:
http://wiki.openvz.org/Containers/Mini-summit_2008
and the notes:
http://wiki.openvz.org/Containers/Mini-summit_2008_notes
and the archives of the linux containers mailing list:
https://lists.linux-foundation.org/pipermail/containers/
(August and July).
Several aspects of the implementation are still experimental and
I expect them to evolve with the feedback. In particular, expect
the specific user interface (syscalls) and the checkpoint image
format to be moving targets.
>> ============================== ckpt.c ================================
>>
>> #define _GNU_SOURCE /* or _BSD_SOURCE or _SVID_SOURCE */
>>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <errno.h>
>> #include <fcntl.h>
>> #include <unistd.h>
>> #include <asm/unistd_32.h>
>> #include <sys/syscall.h>
>
> Note that asm/unistd_32.h is not portable, you should use asm/unistd.h
> in the example.
>
>> pid_t pid = getpid();
>> int ret;
>>
>> ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
>
> Interface-wise, I would consider checkpointing yourself signficantly
> different from checkpointing some other thread. If checkpointing
> yourself is the common case, it probably makes sense to allow passing
> of pid=0 for this.
>
The checkpoint/restart code is meant to checkpoint a whole container,
that is be able to save the state of multiple other tasks. The same
code can also be used to checkpoint yourself fairly easily with minimal
changes (see comments in the code about "in context" checkpoint/restart
that take care of this).
I suggest to keep the interface as is in the sense that the pid will
identify the target container (e.g. the pid of the init process of that
container).
Then, pid=0 would mean "the container to which I belong" if
you are inside a container (and therefore don't know the pid of the
init process there).
Finally, to checkpoint yourself, you would set the a bit in the flags
argument to something like CR_CKPT_MYSELF. Such a flag will be needed
internally anyway to special-case self checkpoint where appropriate.
Comments are welcome.
Oren.
Arnd Bergmann wrote:
> On Friday 08 August 2008, Dave Hansen wrote:
>> linux-2.6.git-dave/arch/x86/kernel/syscall_table_32.S | 2 ++
>> linux-2.6.git-dave/include/asm-x86/unistd_32.h | 2 ++
>> 2 files changed, 4 insertions(+)
>
> System calls should also be declared in include/linux/syscalls.h.
>
> I guess you are aware that this implementation is not enough to
> support 32 bit tasks on x86_64. In addition to the native 64-bit
> code, you would also need the 32-bit compat code here.
Yes, of course. The current code does not attempt to do that yet.
Oren.
Hi,
Thanks for the feedback.
The proof-of-concept is written for x86 32 bits, keeping in mind that
we'll need support for 64 bits support. My goal is to leverage feedback
and contributions to have support for 64 bits and other architectures
as well.
Arnd Bergmann wrote:
>
>> diff -puN /dev/null include/asm-x86/ckpt.h
>> --- /dev/null 2007-04-11 11:48:27.000000000 -0700
>> +++ linux-2.6.git-dave/include/asm-x86/ckpt.h 2008-08-04 13:29:59.000000000 -0700
>> @@ -0,0 +1,46 @@
>
>> +
>> +struct cr_hdr_cpu {
>> + __u64 bx;
>> + __u64 cx;
>> + __u64 dx;
>> + __u64 si;
>> + __u64 di;
>> + __u64 bp;
>> + __u64 ax;
>> + __u64 ds;
>> + __u64 es;
>> + __u64 orig_ax;
>> + __u64 ip;
>> + __u64 cs;
>> + __u64 flags;
>> + __u64 sp;
>> + __u64 ss;
>> + __u64 fs;
>> + __u64 gs;
>> +
>> + __u64 debugreg0;
>> + __u64 debugreg1;
>> + __u64 debugreg2;
>> + __u64 debugreg3;
>> + __u64 debugreg6;
>> + __u64 debugreg7;
>> +
>> + __u8 uses_debug;
>> +
>> + __u8 used_math;
>> + __u8 has_fxsr;
>> + union thread_xstate xstate; /* i387 */
>> +};
>
> It seems weird that you use __u64 members for the registers, but don't
> include r8..r15 in the list. As a consequence, this structure does not
> seem well suited for either x86-32 or x86-64.
In the context of CR, x86-32 and x86-64 are distinct architectures because
you cannot always migrate from one to the other (though 32->64 is sometimes
possible). Therefore, each architecture can have a separate checkpoint file
format (eg r8..r15 only for x86-64).
The information about the kernel configuration, version and cpu settings
will appear on the header; so the restart code will know the architecture
on which the checkpoint had been taken.
So if we want to restart a task checkpointed on x86-32 on a x86-64 machine
(in 32 bit mode), the code will know to not expect that data (r8..r15).
Except for this special case (32 bit running 64 bit), simple conversion can
be done in the kernel if needed, but most conversion between kernel the
format for different kernel versions (should it change) can be done in
user space (eg. with a filter).
>
> I would suggest either using struct pt_regs by reference, or defining
> it so that you can use the same structure for both 32 and 64 bit x86.
We prefer not to use the kernel structure directly, but an intermediate
structure that can help mitigate subtle incompatibilities issues (between
kernel configurations, versions, and even compiler versions).
Anyway, either a single structure for both 32 and 64 bit x86, or separate
"struct cr_hdr_cpu{_32,_64}", one for each architecture.
Oren.
Dave Hansen wrote:
> On Fri, 2008-08-08 at 11:46 +0200, Arnd Bergmann wrote:
>> On Friday 08 August 2008, Dave Hansen wrote:
>>> + hh->magic = 0x00a2d200;
>>> + hh->major = (LINUX_VERSION_CODE >> 16) & 0xff;
>>> + hh->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
>>> + hh->patch = (LINUX_VERSION_CODE) & 0xff;
> ...
>>> +}
>> Do you rely on the kernel version in order to determine the format
>> of the binary data, or is it just informational?
>>
>> If you think the format can change in incompatible ways, you
>> probably need something more specific than the version number
>> these days, because there are just so many different trees with
>> the same numbers.
>
> Yeah, this is very true. My guess is that we'll need something like
> what we do with modversions.
Exactly. The header should eventually contain sufficient information
to describe the kernel version, configuration, compiler, cpu (arch and
capabilities), and checkpoint code version.
How would you suggest to identify the origin tree with an identifier
(or a text field) in the header ?
>
>>> +/* debugging */
>>> +#if 0
>>> +#define CR_PRINTK(str, args...) \
>>> + printk(KERN_ERR "cr@%s#%d: " str, __func__, __LINE__, ##args)
>>> +#else
>>> +#define CR_PRINTK(...) do {} while (0)
>>> +#endif
>>> +
>> Please use the existing pr_debug and dev_debug here, instead of creating
>> yet another version.
>
> Sure thing. Will do.
>
>>> +struct cr_hdr_tail {
>>> + __u32 magic;
>>> + __u32 cksum[2];
>>> +};
>> This structure has an odd multiple of 32-bit members, which means
>> that if you put it into a larger structure that also contains
>> 64-bit members, the larger structure may get different alignment
>> on x86-32 and x86-64, which you might want to avoid.
>> I can't tell if this is an actual problem here.
>
> Can't we just declare all these things __packed__ and stop worrying
> about aligning them all manually?
>
>>> +
>>> +/*
>>> + * During restart the code reads in data from the chekcpoint image into a
>>> + * temporary buffer (ctx->hbuf). Because operations can be nested, one
>>> + * should call cr_hbuf_get() to reserve space in the buffer, and then
>>> + * cr_hbuf_put() when it no longer needs that space
>>> + */
>>> +
>>> +#include <linux/version.h>
>>> +#include <linux/sched.h>
>>> +#include <linux/file.h>
>>> +
>>> +#include "ckpt.h"
>>> +#include "ckpt_hdr.h"
>>> +
>>> +/**
>>> + * cr_hbuf_get - reserve space on the hbuf
>>> + * @ctx: checkpoint context
>>> + * @n: number of bytes to reserve
>>> + */
>>> +void *cr_hbuf_get(struct cr_ctx *ctx, int n)
>>> +{
>>> + void *ptr;
>>> +
>>> + BUG_ON(ctx->hpos + n > CR_HBUF_TOTAL);
>>> + ptr = (void *) (((char *) ctx->hbuf) + ctx->hpos);
>>> + ctx->hpos += n;
>>> + return ptr;
>>> +}
>> Can (ctx->hpos + n > CR_HBUF_TOTAL) be controlled by the input
>> data? If so, this is a denial-of-service attack.
>
> Ugh, this is crappy code anyway. It needs to return an error and have
> someone else handle it.
Actually not quite. 'n' is _not_ controlled by the input data, and at
the same time ctx->hpos should always carry enough room by design. If
that is not the case, then it's a logical bug, not DoS attack.
To avoid repetitive malloc/free, ctx->hbuf is a buffer to host headers
as they are read; since headers can be read in a nested manner, ctx->hpos
points to the next free position in that buffer. So 'n' is the size of
the header that we are about to read - decided at compile time, not the
user input. The BUG_ON() statement asserts that by design we have enough
buffer (like you'd check that you didn't run out of kernel stack...)
If it is preferred, we can change this to write a kernel message and
return a special error telling that a logical error has occurred.
>
>>> +int cr_kwrite(struct cr_ctx *ctx, void *buf, int count)
>>> +{
>>> + mm_segment_t oldfs;
>>> + int ret;
>>> +
>>> + oldfs = get_fs();
>>> + set_fs(KERNEL_DS);
>>> + ret = cr_uwrite(ctx, buf, count);
>>> + set_fs(oldfs);
>>> +
>>> + return ret;
>>> +}
>> get_fs()/set_fs() always feels a bit ouch, and this way you have
>> to use __force to avoid the warnings about __user pointer casts
>> in sparse.
>> I wonder if you can use splice_read/splice_write to get around
>> this problem.
>
> I have to wonder if this is just a symptom of us trying to do this the
> wrong way. We're trying to talk the kernel into writing internal gunk
> into a FD. You're right, it is like a splice where one end of the pipe
> is in the kernel.
>
> Any thoughts on a better way to do this?
>
>>> + struct cr_ctx *ctx;
>>> + struct file *file;
>>> + int fput_needed;
>>> + int ret;
>>> +
>>> + if (!capable(CAP_SYS_ADMIN))
>>> + return -EPERM;
>>> +
>> Why do you need CAP_SYS_ADMIN for this? Can't regular users
>> be allowed to checkpoint/restart their own tasks?
>
> Yes, eventually. I think one good point is that we should probably
> remove this now so that we *have* to think about security implications
> as we add each individual patch. For instance, what kind of checking do
> we do when we restore an mlock()'d VMA?
>
> I'll pull this check out so it causes pain. (the good kind)
Hmmm... even if not strictly now, we *will* need admin privileges for
the CR operations, for the following reasons:
checkpoint: we save the entire state of a set of processes to a file - so
we must have privileges to do so, at least within (or with respect to) the
said container. Even if we are the user who owns the container, we'll need
root access within that container.
restart: we restore the entire set of a set of processes, which may require
some privileged operations (again, at least within or with respect to the
said container). Otherwise any user could inject any restart data into the
kernel and create any set of processes with arbitrary permissions.
In a sense, we need something similar to granting ptrace access.
>
>>> --- linux-2.6.git/Makefile~handle_a_single_task_with_private_memory_maps 2008-08-05 09:04:27.000000000 -0700
>>> +++ linux-2.6.git-dave/Makefile 2008-08-05 09:07:53.000000000 -0700
>>> @@ -611,7 +611,7 @@ export mod_strip_cmd
>>>
>>>
>>> ifeq ($(KBUILD_EXTMOD),)
>>> -core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/
>>> +core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ ckpt/
>>>
>> The name 'ckpt' is a bit unobvious, how about naming it 'checkpoint' instead?
>
> Fine with me. Renamed in new patches, hopefully.
>
> I'll send new patches out later today.
>
> -- Dave
>
On Friday 08 August 2008, Dave Hansen wrote:
> On Fri, 2008-08-08 at 11:46 +0200, Arnd Bergmann wrote:
> > > +struct cr_hdr_tail {
> > > + __u32 magic;
> > > + __u32 cksum[2];
> > > +};
> >
> > This structure has an odd multiple of 32-bit members, which means
> > that if you put it into a larger structure that also contains
> > 64-bit members, the larger structure may get different alignment
> > on x86-32 and x86-64, which you might want to avoid.
> > I can't tell if this is an actual problem here.
>
> Can't we just declare all these things __packed__ and stop worrying
> about aligning them all manually?
I personally dislike __packed__ because it makes it very easy to get
suboptimal object code. If you either pad every structure to a multiple
of 64 bits or avoid __u64 members, you don't have a problem. Also,
I think avoiding implicit padding inside of data structures is very
helpful for user interfaces, if necessary you can always add explicit
padding.
> > get_fs()/set_fs() always feels a bit ouch, and this way you have
> > to use __force to avoid the warnings about __user pointer casts
> > in sparse.
> > I wonder if you can use splice_read/splice_write to get around
> > this problem.
>
> I have to wonder if this is just a symptom of us trying to do this the
> wrong way. We're trying to talk the kernel into writing internal gunk
> into a FD. You're right, it is like a splice where one end of the pipe
> is in the kernel.
>
> Any thoughts on a better way to do this?
Maybe you can invert the logic and let the new syscalls create a file
descriptor, and then have user space read or splice the checkpoint
data from it, and restore it by writing to the file descriptor.
It's probably easy to do using anon_inode_getfd() and would solve this
problem, but at the same time make checkpointing the current thread
hard if not impossible.
> Yes, eventually. I think one good point is that we should probably
> remove this now so that we *have* to think about security implications
> as we add each individual patch. For instance, what kind of checking do
> we do when we restore an mlock()'d VMA?
I think the question can be generalized further: How do you deal with
saved tasks that have more priviledges than the task doing the restore?
There are probably more, but what I can think of right now includes:
* anything you can set using ulimit
* capabilities
* threads running as another user/group
* open files that have had their permissions changed after the open
Arnd <><
On Fri, 2008-08-08 at 16:59 -0400, Oren Laadan wrote:
> To avoid repetitive malloc/free, ctx->hbuf is a buffer to host headers
> as they are read;
kmalloc/kfree() area really, really fast. I wonder if the code gets
easier or harder to read if we just alloc/free as we need to.
How large are these allocations, usually? Will stack allocation work in
most cases?
-- Dave
On Friday 08 August 2008, Oren Laadan wrote:
> > Yeah, this is very true. My guess is that we'll need something like
> > what we do with modversions.
>
> Exactly. The header should eventually contain sufficient information
> to describe the kernel version, configuration, compiler, cpu (arch and
> capabilities), and checkpoint code version.
>
> How would you suggest to identify the origin tree with an identifier
> (or a text field) in the header ?
Including struct utsname in the header covers most of this. I supposed
you can't do it entirely safe, and you always need to be prepared for
malicious input data, so there probably isn't much point in getting
too fancy beyond what you need to do to prevent user errors.
> If it is preferred, we can change this to write a kernel message and
> return a special error telling that a logical error has occurred.
My recommendation in general is to make kernel code crash loudly if
there is a bug in the kernel itself. Returning error codes makes most
sense if they get sent back to the user, which then can make sense of
it, as documented in the man page of the syscall.
> > Yes, eventually. I think one good point is that we should probably
> > remove this now so that we *have* to think about security implications
> > as we add each individual patch. For instance, what kind of checking do
> > we do when we restore an mlock()'d VMA?
> >
> > I'll pull this check out so it causes pain. (the good kind)
>
> Hmmm... even if not strictly now, we *will* need admin privileges for
> the CR operations, for the following reasons:
>
> checkpoint: we save the entire state of a set of processes to a file - so
> we must have privileges to do so, at least within (or with respect to) the
> said container. Even if we are the user who owns the container, we'll need
> root access within that container.
>
> restart: we restore the entire set of a set of processes, which may require
> some privileged operations (again, at least within or with respect to the
> said container). Otherwise any user could inject any restart data into the
> kernel and create any set of processes with arbitrary permissions.
>
> In a sense, we need something similar to granting ptrace access.
Exactly. There was a project that implemented checkpoint/restart through
ptrace (don't remember what it was called), so with certain limitations
it should also be possible to implement the syscalls so that any user that
can ptrace the tasks can also checkpoint them.
Arnd <><
On Sat, 2008-08-09 at 00:13 +0200, Arnd Bergmann wrote:
> > I have to wonder if this is just a symptom of us trying to do this the
> > wrong way. We're trying to talk the kernel into writing internal gunk
> > into a FD. You're right, it is like a splice where one end of the pipe
> > is in the kernel.
> >
> > Any thoughts on a better way to do this?
>
> Maybe you can invert the logic and let the new syscalls create a file
> descriptor, and then have user space read or splice the checkpoint
> data from it, and restore it by writing to the file descriptor.
> It's probably easy to do using anon_inode_getfd() and would solve this
> problem, but at the same time make checkpointing the current thread
> hard if not impossible.
Yeah, it does seem kinda backwards. But, instead of even having to
worry about the anon_inode stuff, why don't we just put it in a fs like
everything else? checkpointfs!
I'm also really not convinced that putting the entire checkpoint in one
glob is really the solution, either. I mean, is system call overhead
really a problem here?
-- Dave
On Friday 08 August 2008, Oren Laadan wrote:
> > It seems weird that you use __u64 members for the registers, but don't
> > include r8..r15 in the list. As a consequence, this structure does not
> > seem well suited for either x86-32 or x86-64.
>
> In the context of CR, x86-32 and x86-64 are distinct architectures because
> you cannot always migrate from one to the other (though 32->64 is sometimes
> possible). Therefore, each architecture can have a separate checkpoint file
> format (eg r8..r15 only for x86-64).
So why do you use __u64 members for your 32 bit registers?
> Except for this special case (32 bit running 64 bit), simple conversion can
> be done in the kernel if needed, but most conversion between kernel the
> format for different kernel versions (should it change) can be done in
> user space (eg. with a filter).
The 32bit on 64bit case is quite common on non-x86 architectures, e.g.
powerpc or sparc, where 64 bit kernels typically run 32 bit user space.
A particularly interesting case is mixing 32 and 64 bit tasks in a container
that you are checkpointing. This is a very realistic scenario, so there
may be good arguments for keeping the format identical between the variations
of each architecture.
> > I would suggest either using struct pt_regs by reference, or defining
> > it so that you can use the same structure for both 32 and 64 bit x86.
>
> We prefer not to use the kernel structure directly, but an intermediate
> structure that can help mitigate subtle incompatibilities issues (between
> kernel configurations, versions, and even compiler versions).
>
> Anyway, either a single structure for both 32 and 64 bit x86, or separate
> "struct cr_hdr_cpu{_32,_64}", one for each architecture.
struct pt_regs is part of the kernel ABI, it will not change.
Arnd <><
On Saturday 09 August 2008, Dave Hansen wrote:
> On Sat, 2008-08-09 at 00:13 +0200, Arnd Bergmann wrote:
> > Maybe you can invert the logic and let the new syscalls create a file
> > descriptor, and then have user space read or splice the checkpoint
> > data from it, and restore it by writing to the file descriptor.
> > It's probably easy to do using anon_inode_getfd() and would solve this
> > problem, but at the same time make checkpointing the current thread
> > hard if not impossible.
>
> Yeah, it does seem kinda backwards. ?But, instead of even having to
> worry about the anon_inode stuff, why don't we just put it in a fs like
> everything else? ?checkpointfs!
Well, anon_inodes are really easy to use and have replaced some of the
simple non-mountable file systems in the kernel.
checkpointfs sounds interesting and I guess in a plan9 world of fairies
and fantasy, you should be able to create a checkpoint of your system using
'tar czf - /proc/', but I'm not sure it helps here.
The main problem I see with that would be atomicity: If you want multiple
processes to keep interacting with each other, you need to save them at
the same point in time, which gets harder as you split your interface into
more than a single file descriptor.
Arnd <><
Arnd Bergmann wrote:
> On Friday 08 August 2008, Oren Laadan wrote:
>
>>> It seems weird that you use __u64 members for the registers, but don't
>>> include r8..r15 in the list. As a consequence, this structure does not
>>> seem well suited for either x86-32 or x86-64.
>> In the context of CR, x86-32 and x86-64 are distinct architectures because
>> you cannot always migrate from one to the other (though 32->64 is sometimes
>> possible). Therefore, each architecture can have a separate checkpoint file
>> format (eg r8..r15 only for x86-64).
>
> So why do you use __u64 members for your 32 bit registers?
The idea was that x86-32 checkpoints can be restarted on a x86-64 node in
32 bit mode. Clearly it needed more thought...
>
>> Except for this special case (32 bit running 64 bit), simple conversion can
>> be done in the kernel if needed, but most conversion between kernel the
>> format for different kernel versions (should it change) can be done in
>> user space (eg. with a filter).
>
> The 32bit on 64bit case is quite common on non-x86 architectures, e.g.
> powerpc or sparc, where 64 bit kernels typically run 32 bit user space.
>
> A particularly interesting case is mixing 32 and 64 bit tasks in a container
> that you are checkpointing. This is a very realistic scenario, so there
> may be good arguments for keeping the format identical between the variations
> of each architecture.
>
>>> I would suggest either using struct pt_regs by reference, or defining
>>> it so that you can use the same structure for both 32 and 64 bit x86.
>> We prefer not to use the kernel structure directly, but an intermediate
>> structure that can help mitigate subtle incompatibilities issues (between
>> kernel configurations, versions, and even compiler versions).
>>
>> Anyway, either a single structure for both 32 and 64 bit x86, or separate
>> "struct cr_hdr_cpu{_32,_64}", one for each architecture.
>
> struct pt_regs is part of the kernel ABI, it will not change.
I'm in favor about keeping the format identical between the variations of
each architecture. Note, however, that "struct pt_regs" won't do because it
may change with these variations.
So we'll take care of the padding and add r8..r15 in the next version.
Oren.
Dave Hansen wrote:
> On Fri, 2008-08-08 at 16:59 -0400, Oren Laadan wrote:
>> To avoid repetitive malloc/free, ctx->hbuf is a buffer to host headers
>> as they are read;
>
> kmalloc/kfree() area really, really fast. I wonder if the code gets
> easier or harder to read if we just alloc/free as we need to.
The ctx->hbuf interface is a pair of cr_hbuf_get(ctx, length) and a
matching cr_hbuf_put(ctx, length), almost like using kmalloc/kfree().
The main difference is that cleanup in error paths is implicit (the
whole buffer is freed when the ctx is deallocated).
> How large are these allocations, usually? Will stack allocation work in
> most cases?
That depends on how we construct the headers. In Zap there are some
headers that use relatively long structures to be put on the stack,
and it wouldn't make much sense to divide them into smaller headers
artificially.
However, I forgot to mention earlier that an important reason to use
this construct is actually in anticipation for a future optimization:
during application downtime the checkpoint state will be aggregated
into an in-memory buffer, and only after the application is allowed
to continue execution (unfrozen) the buffer will be written-back to
the FD. In that scenario, we will allocate a larger buffer in the ctx
(eg based on some heuristics) and cr_hbuf_get() will return the next
location in that buffer, while cr_hbuf_put() will do nothing.
Oren.
On Fri, 2008-08-08 at 19:04 -0400, Oren Laadan wrote:
> > struct pt_regs is part of the kernel ABI, it will not change.
>
> I'm in favor about keeping the format identical between the variations of
> each architecture. Note, however, that "struct pt_regs" won't do because it
> may change with these variations.
"Part of the kernel ABI" makes it sound to me like it won't change.
Who's right here? :)
-- Dave
On Sat, 2008-08-09 at 00:39 +0200, Arnd Bergmann wrote:
> The main problem I see with that would be atomicity: If you want multiple
> processes to keep interacting with each other, you need to save them at
> the same point in time, which gets harder as you split your interface into
> more than a single file descriptor.
It could take ages to write out a checkpoint even to a single fd, so I
suspect we'd have the exact same kinds of issues either way.
-- Dave
Dave Hansen wrote:
> On Fri, 2008-08-08 at 19:04 -0400, Oren Laadan wrote:
>>> struct pt_regs is part of the kernel ABI, it will not change.
>> I'm in favor about keeping the format identical between the variations of
>> each architecture. Note, however, that "struct pt_regs" won't do because it
>> may change with these variations.
>
> "Part of the kernel ABI" makes it sound to me like it won't change.
> Who's right here? :)
>
> -- Dave
>
hehehe .. both; I meant that while it doesn't change per architecture, it
varies between architectures. So "struct pt_regs" compiled for x86-32 is
different than that compiled for x86-64. Therefore we can't just dump the
structure as is and expect that 64 bit would be able to parse the 32 bit.
In other words, we need an intermediate representation.
Oren.
On Fri, 2008-08-08 at 21:20 -0400, Oren Laadan wrote:
> hehehe .. both; I meant that while it doesn't change per architecture, it
> varies between architectures. So "struct pt_regs" compiled for x86-32 is
> different than that compiled for x86-64. Therefore we can't just dump the
> structure as is and expect that 64 bit would be able to parse the 32 bit.
> In other words, we need an intermediate representation.
Surely we already handle this, though. Don't we allow a 32-bit app
running on a 64-bit kernel to PTRACE_GETREGS and get the 32-bit version?
A 64-bit app will get the 64-bit version making the same syscall. It's
all handled in the syscall compatibility code.
-- Dave
Dave Hansen wrote:
> On Fri, 2008-08-08 at 21:20 -0400, Oren Laadan wrote:
>> hehehe .. both; I meant that while it doesn't change per architecture, it
>> varies between architectures. So "struct pt_regs" compiled for x86-32 is
>> different than that compiled for x86-64. Therefore we can't just dump the
>> structure as is and expect that 64 bit would be able to parse the 32 bit.
>> In other words, we need an intermediate representation.
>
> Surely we already handle this, though. Don't we allow a 32-bit app
> running on a 64-bit kernel to PTRACE_GETREGS and get the 32-bit version?
> A 64-bit app will get the 64-bit version making the same syscall. It's
> all handled in the syscall compatibility code.
Sure, that's a compatibility layer around ptrace() in the 64-bit kernel.
Recall that Arnd suggested "keeping the format identical between the
variations of each architecture", and I fully agree. If we want to keep
the format identical, we can't simply define:
struct cr_hdr_cpu {
struct pt_regs regs;
...
};
because that will compile differently on x86-32 and x86-64. So either we
add r8..r15 to the structure as it appears in the patch now (and keep the
format identical), or allow the format to vary, and explicitly test for
this case and add a compatibility layer. Personally I prefer the former.
Oren.
On Saturday 09 August 2008, Dave Hansen wrote:
> On Sat, 2008-08-09 at 00:39 +0200, Arnd Bergmann wrote:
> > The main problem I see with that would be atomicity: If you want multiple
> > processes to keep interacting with each other, you need to save them at
> > the same point in time, which gets harder as you split your interface into
> > more than a single file descriptor.
>
> It could take ages to write out a checkpoint even to a single fd, so I
> suspect we'd have the exact same kinds of issues either way.
I guess either way, you have to SIGSTOP (or similar) all the tasks you want
to checkpoint atomically before you start saving the contents.
If you use a single fd, you can do that under the covers, when using a
more complex file system, it seems more logical to require an explicit
interface for this.
Arnd <><
On Saturday 09 August 2008, Oren Laadan wrote:
> >> Anyway, either a single structure for both 32 and 64 bit x86, or separate
> >> "struct cr_hdr_cpu{_32,_64}", one for each architecture.
> >
> > struct pt_regs is part of the kernel ABI, it will not change.
>
> I'm in favor about keeping the format identical between the variations of
> each architecture. Note, however, that "struct pt_regs" won't do because it
> may change with these variations.
>
> So we'll take care of the padding and add r8..r15 in the next version.
>
Fair enough. How about making the layout in that structure identical to
the 64-bit pt_regs though? I don't know if we need that at any time,
but my feeling is that it is nicer than a slightly different random
layout, e.g. if someone wants to extend gdb to look at checkpointed
process dumps.
Arnd <><
On Sat, 2008-08-09 at 08:37 +0200, Arnd Bergmann wrote:
> On Saturday 09 August 2008, Dave Hansen wrote:
> > On Sat, 2008-08-09 at 00:39 +0200, Arnd Bergmann wrote:
> > > The main problem I see with that would be atomicity: If you want multiple
> > > processes to keep interacting with each other, you need to save them at
> > > the same point in time, which gets harder as you split your interface into
> > > more than a single file descriptor.
> >
> > It could take ages to write out a checkpoint even to a single fd, so I
> > suspect we'd have the exact same kinds of issues either way.
>
> I guess either way, you have to SIGSTOP (or similar) all the tasks you want
> to checkpoint atomically before you start saving the contents.
> If you use a single fd, you can do that under the covers, when using a
> more complex file system, it seems more logical to require an explicit
> interface for this.
Oh, we're already working on patches to the freezer code to do this for
us. There's a branch in here from Matt H. that's doing just that:
http://git.kernel.org/?p=linux/kernel/git/daveh/linux-2.6-next-lxc.git;a=shortlog
-- Dave
Dave Hansen wrote:
> On Fri, 2008-08-08 at 19:04 -0400, Oren Laadan wrote:
>
>>> struct pt_regs is part of the kernel ABI, it will not change.
>>>
>> I'm in favor about keeping the format identical between the variations of
>> each architecture. Note, however, that "struct pt_regs" won't do because it
>> may change with these variations.
>>
>
> "Part of the kernel ABI" makes it sound to me like it won't change.
> Who's right here? :)
Struct pt_regs is not ABI, and can (and has) changed on x86. It's not
suitable for a checkpoint structure because it only contains the
registers that the kernel trashes, not all usermode registers (on i386,
it leaves out %gs, for example). asm-x86/ptrace-abi.h does define stuff
that's fixed in stone; it expresses it in terms of a register array,
with constants defining what element is which register.
J
Quoting Dave Hansen ([email protected]):
> On Sat, 2008-08-09 at 00:13 +0200, Arnd Bergmann wrote:
> > > I have to wonder if this is just a symptom of us trying to do this the
> > > wrong way. We're trying to talk the kernel into writing internal gunk
> > > into a FD. You're right, it is like a splice where one end of the pipe
> > > is in the kernel.
> > >
> > > Any thoughts on a better way to do this?
> >
> > Maybe you can invert the logic and let the new syscalls create a file
> > descriptor, and then have user space read or splice the checkpoint
> > data from it, and restore it by writing to the file descriptor.
> > It's probably easy to do using anon_inode_getfd() and would solve this
> > problem, but at the same time make checkpointing the current thread
> > hard if not impossible.
>
> Yeah, it does seem kinda backwards. But, instead of even having to
> worry about the anon_inode stuff, why don't we just put it in a fs like
> everything else? checkpointfs!
One reason is that I suspect that stops us from being able to send that
data straight to a pipe to compress and/or send on the network, without
hitting local disk. Though if the checkpointfs was ram-based maybe not?
As Oren has pointed out before, passing in an fd means we can pass a
socket into the syscall.
Using the anon_inodes would also prevent that, but if it makes for a
cleaner overall solution then I'm not against considering either one
of course.
> I'm also really not convinced that putting the entire checkpoint in one
> glob is really the solution, either. I mean, is system call overhead
> really a problem here?
>
> -- Dave
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/containers
Quoting Arnd Bergmann ([email protected]):
> On Friday 08 August 2008, Dave Hansen wrote:
> > On Fri, 2008-08-08 at 11:46 +0200, Arnd Bergmann wrote:
>
> > > > +struct cr_hdr_tail {
> > > > + __u32 magic;
> > > > + __u32 cksum[2];
> > > > +};
> > >
> > > This structure has an odd multiple of 32-bit members, which means
> > > that if you put it into a larger structure that also contains
> > > 64-bit members, the larger structure may get different alignment
> > > on x86-32 and x86-64, which you might want to avoid.
> > > I can't tell if this is an actual problem here.
> >
> > Can't we just declare all these things __packed__ and stop worrying
> > about aligning them all manually?
>
> I personally dislike __packed__ because it makes it very easy to get
> suboptimal object code. If you either pad every structure to a multiple
> of 64 bits or avoid __u64 members, you don't have a problem. Also,
> I think avoiding implicit padding inside of data structures is very
> helpful for user interfaces, if necessary you can always add explicit
> padding.
>
> > > get_fs()/set_fs() always feels a bit ouch, and this way you have
> > > to use __force to avoid the warnings about __user pointer casts
> > > in sparse.
> > > I wonder if you can use splice_read/splice_write to get around
> > > this problem.
> >
> > I have to wonder if this is just a symptom of us trying to do this the
> > wrong way. We're trying to talk the kernel into writing internal gunk
> > into a FD. You're right, it is like a splice where one end of the pipe
> > is in the kernel.
> >
> > Any thoughts on a better way to do this?
>
> Maybe you can invert the logic and let the new syscalls create a file
> descriptor, and then have user space read or splice the checkpoint
> data from it, and restore it by writing to the file descriptor.
> It's probably easy to do using anon_inode_getfd() and would solve this
> problem, but at the same time make checkpointing the current thread
> hard if not impossible.
>
> > Yes, eventually. I think one good point is that we should probably
> > remove this now so that we *have* to think about security implications
> > as we add each individual patch. For instance, what kind of checking do
> > we do when we restore an mlock()'d VMA?
>
> I think the question can be generalized further: How do you deal with
> saved tasks that have more priviledges than the task doing the restore?
>
> There are probably more, but what I can think of right now includes:
> * anything you can set using ulimit
> * capabilities
> * threads running as another user/group
> * open files that have had their permissions changed after the open
At the checkpoint end, the ptrace checks seem apporpriate: If you're
allowed to stop and manipulate the process, then you may as well be
allowed to checkpoint and see/tweak its memory that way.
At the restart end, every resource which was checkpointed will have to
be re-created, and permissions checked against the privilege of the
task which did the restart. We may end up having to make use of the new
credentials for this.
This could become unpleasant: if an unprivileged task asked a privileged
helper to create something for the unprivileged task to use (i.e. a
raw socket), then the user needs to be privileged to re-created the
resource. But it's necessary.
-serge
On Monday 11 August 2008, Serge E. Hallyn wrote:
> One reason is that I suspect that stops us from being able to send that
> data straight to a pipe to compress and/or send on the network, without
> hitting local disk. ?Though if the checkpointfs was ram-based maybe not?
>
> As Oren has pointed out before, passing in an fd means we can pass a
> socket into the syscall.
>
> Using the anon_inodes would also prevent that, but if it makes for a
> cleaner overall solution then I'm not against considering either one
> of course.
>
With anon_inodes, you can still implement splice_read/splice_write,
so you can splice it into a socket.
Arnd <><
On Sun, 2008-08-10 at 07:55 -0700, Jeremy Fitzhardinge wrote:
> Struct pt_regs is not ABI, and can (and has) changed on x86. It's not
> suitable for a checkpoint structure because it only contains the
> registers that the kernel trashes, not all usermode registers (on i386,
> it leaves out %gs, for example). asm-x86/ptrace-abi.h does define stuff
> that's fixed in stone; it expresses it in terms of a register array,
> with constants defining what element is which register.
Thanks for the explanation.
I just want to reduce the coding and maintenance burden here. Xen must
do this for partition mobility, right? Does it define all its own
stuff?
-- Dave
Dave Hansen wrote:
> On Sun, 2008-08-10 at 07:55 -0700, Jeremy Fitzhardinge wrote:
>
>> Struct pt_regs is not ABI, and can (and has) changed on x86. It's not
>> suitable for a checkpoint structure because it only contains the
>> registers that the kernel trashes, not all usermode registers (on i386,
>> it leaves out %gs, for example). asm-x86/ptrace-abi.h does define stuff
>> that's fixed in stone; it expresses it in terms of a register array,
>> with constants defining what element is which register.
>>
>
> Thanks for the explanation.
>
> I just want to reduce the coding and maintenance burden here. Xen must
> do this for partition mobility, right? Does it define all its own
> stuff?
>
You mean save/restore/migrate? Yes, it defines all its own stuff.
Checkpoint-resume on a whole VM is a rather simpler operation than a
subset of processes.
J
On Monday 11 August 2008, Serge E. Hallyn wrote:
> At the restart end, every resource which was checkpointed will have to
> be re-created, and permissions checked against the privilege of the
> task which did the restart. ?We may end up having to make use of the new
> credentials for this.
>
> This could become unpleasant: if an unprivileged task asked a privileged
> helper to create something for the unprivileged task to use (i.e. a
> raw socket), then the user needs to be privileged to re-created the
> resource. ?But it's necessary.
Right. Of course, the hard part here will be to make it obvious to
be safe. Having to check all sorts of permissions means there will
be many opportunities for exploitable bugs.
The best way I can think of for this would be to use existing syscalls
(e.g. sched_setscheduler, setfsuid, ...) from user space whereever
possible and do only the bare minimum for the restart part in the kernel.
Arnd <><
On Mon, 2008-08-11 at 18:53 +0200, Arnd Bergmann wrote:
> The best way I can think of for this would be to use existing syscalls
> (e.g. sched_setscheduler, setfsuid, ...) from user space whereever
> possible and do only the bare minimum for the restart part in the
> kernel.
Well, the current direction is about as far away from that as you can
get, unless we basically call those system calls from inside our new
sys_restart() one. As of now, we're as much work in the kernel as
possible, and doing the bare minimum in userspace. That's what both
Oren and our OpenVZ colleagues have advocated.
-- Dave
I'm trying to figure out this patch set...here's a few things which
have caught my eye in passing.
> +/**
> + * cr_get_fname - return pathname of a given file
> + * @file: file pointer
> + * @buf: buffer for pathname
> + * @n: buffer length (in) and pathname length (out)
> + *
> + * if the buffer provivded by the caller is too small, allocate a new
> + * buffer; caller should call cr_put_pathname() for cleanup
> + */
> +char *cr_get_fname(struct path *path, struct path *root, char *buf,
> int *n) +{
> + char *fname;
> +
> + fname = __d_path(path, root, buf, *n);
> +
> + if (IS_ERR(fname) && PTR_ERR(fname) == -ENAMETOOLONG) {
> + if (!(buf = (char *) __get_free_pages(GFP_KERNEL, 0)))
This seems like a clunky and error-prone interface - why not just have it
allocate the memory always? But, in this case, cr_get_fname() always seems
to be called with ctx->tbuf, which, in turn, is an order-1 allocation.
Here you're saying that if it's too small, you'll try replacing it with an
order-0 allocation instead. I rather suspect that's not going to help.
> +/* write the checkpoint header */
> +static int cr_write_hdr(struct cr_ctx *ctx)
> +{
> + struct cr_hdr h;
> + struct cr_hdr_head *hh = ctx->tbuf;
> + struct timeval ktv;
> +
> + h.type = CR_HDR_HEAD;
> + h.len = sizeof(*hh);
> + h.id = 0;
> +
> + do_gettimeofday(&ktv);
> +
> + hh->magic = 0x00a2d200;
This magic number is hard-coded in a number of places. Could it maybe
benefit from a macro, which, in turn, could maybe end up in linux/magic.h?
> +/* dump the task_struct of a given task */
> +static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t)
This function is going to break every time somebody changes struct
task_struct. I'm not quite sure how to prevent that. I wonder if the
modversions stuff could somehow be employed to detect changes and make the
build fail?
> +/**
> + * sys_checkpoint - checkpoint a container
> + * @pid: pid of the container init(1) process
> + * @fd: file to which dump the checkpoint image
> + * @flags: checkpoint operation flags
> + */
> +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long
> flags) +{
> + struct cr_ctx *ctx;
> + struct file *file;
> + int fput_needed;
> + int ret;
> +
> + if (!capable(CAP_SYS_ADMIN))
> + return -EPERM;
Like others, I wondered why CAP_SYS_ADMIN was required here. I *still*
wonder, though, how you'll ever be able to do restart without a privilege
check. There must be a thousand ways to compromise a system by messing
with the checkpoint file.
> + file = fget_light(fd, &fput_needed);
> + if (!file)
> + return -EBADF;
Should you maybe check for write access? An attempt to overwrite a
read-only file won't succeed, but you could save a lot of work by just
failing it with a clear code here.
What about the file position? Perhaps there could be a good reason to
checkpoint a process into the middle of a file, don't know.
In general, I don't see a whole lot of locking going on. Is it really
possible to save and restore memory without ever holding mmap_sem?
jon
On Mon, 2008-08-11 at 12:03 -0600, Jonathan Corbet wrote:
> I'm trying to figure out this patch set...here's a few things which
> have caught my eye in passing.
>
> > +/**
> > + * cr_get_fname - return pathname of a given file
> > + * @file: file pointer
> > + * @buf: buffer for pathname
> > + * @n: buffer length (in) and pathname length (out)
> > + *
> > + * if the buffer provivded by the caller is too small, allocate a new
> > + * buffer; caller should call cr_put_pathname() for cleanup
> > + */
> > +char *cr_get_fname(struct path *path, struct path *root, char *buf,
> > int *n) +{
> > + char *fname;
> > +
> > + fname = __d_path(path, root, buf, *n);
> > +
> > + if (IS_ERR(fname) && PTR_ERR(fname) == -ENAMETOOLONG) {
> > + if (!(buf = (char *) __get_free_pages(GFP_KERNEL, 0)))
>
> This seems like a clunky and error-prone interface - why not just have it
> allocate the memory always? But, in this case, cr_get_fname() always seems
> to be called with ctx->tbuf, which, in turn, is an order-1 allocation.
> Here you're saying that if it's too small, you'll try replacing it with an
> order-0 allocation instead. I rather suspect that's not going to help.
Yeah, it doesn't make much sense on the surface. I would imagine that
this has some use for when we're stacking things up in the ctx->hbuf
rather than just using it as a completely temporary buffer. But, in any
case, it doesn't make sense as it stands now, so I think it needs to be
reworked.
> > +/* write the checkpoint header */
> > +static int cr_write_hdr(struct cr_ctx *ctx)
> > +{
> > + struct cr_hdr h;
> > + struct cr_hdr_head *hh = ctx->tbuf;
> > + struct timeval ktv;
> > +
> > + h.type = CR_HDR_HEAD;
> > + h.len = sizeof(*hh);
> > + h.id = 0;
> > +
> > + do_gettimeofday(&ktv);
> > +
> > + hh->magic = 0x00a2d200;
>
> This magic number is hard-coded in a number of places. Could it maybe
> benefit from a macro, which, in turn, could maybe end up in linux/magic.h?
>
> > +/* dump the task_struct of a given task */
> > +static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t)
>
> This function is going to break every time somebody changes struct
> task_struct. I'm not quite sure how to prevent that. I wonder if the
> modversions stuff could somehow be employed to detect changes and make the
> build fail?
In general, I think any time that we are checkpointing $THING and $THING
changes, the checkpoint will break. It just so happens that all we're
checkpointing here is the task_struct, so $THING == task_struct for
now. :)
The things that *really* worry me are things like when flags change
semantics subtly. Or, let's say a flag is used for two different things
in 2.6.26.4 vs 2.6.27. I'm not sure we're ever going to be in a
position to find and fix up stuff like that.
That's one reason I have been advocating doing checkpoint/restart in
much tinier bits so that we can understand each of them as we go along.
I also think that the *less* we expose to userspace, the better.
> > +/**
> > + * sys_checkpoint - checkpoint a container
> > + * @pid: pid of the container init(1) process
> > + * @fd: file to which dump the checkpoint image
> > + * @flags: checkpoint operation flags
> > + */
> > +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long
> > flags) +{
> > + struct cr_ctx *ctx;
> > + struct file *file;
> > + int fput_needed;
> > + int ret;
> > +
> > + if (!capable(CAP_SYS_ADMIN))
> > + return -EPERM;
>
> Like others, I wondered why CAP_SYS_ADMIN was required here. I *still*
> wonder, though, how you'll ever be able to do restart without a privilege
> check. There must be a thousand ways to compromise a system by messing
> with the checkpoint file.
As with everything else coming from userspace, the checkpoint file
should be completely untrusted. I do think, though, that the ptrace
analogy that Serge?? made is a good one.
> > + file = fget_light(fd, &fput_needed);
> > + if (!file)
> > + return -EBADF;
>
> Should you maybe check for write access? An attempt to overwrite a
> read-only file won't succeed, but you could save a lot of work by just
> failing it with a clear code here.
That's true. I'll take a look and see.
This patch does reach down and use vfs_write() at some point. There
really aren't any other in-kernel users that do this (short of ecryptfs
and plan9fs). That makes me doubt that we're even using a good approach
here.
> What about the file position? Perhaps there could be a good reason to
> checkpoint a process into the middle of a file, don't know.
I think this is a good example of a place where the kernel can let
userspace shoot itself in its foot if it wants. We might also want to
allow things to be sent over fds that don't necessarily have positions,
like sockets or pipes.
> In general, I don't see a whole lot of locking going on. Is it really
> possible to save and restore memory without ever holding mmap_sem?
I personally haven't audited the locking, yet. It is going to be fun!
But, take a look in patch 3/4:
+ /* write the vma's */
+ down_read(&mm->mmap_sem);
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ if ((ret = cr_write_vma(ctx, vma)) < 0)
+ break;
+ }
+ up_read(&mm->mmap_sem);
Thanks for the review, Jonathan!
-- Dave
Arnd, Jeremy and Oren,
Thanks for all of the very interesting comments about the ABI.
Considering that we're still *really* early in getting this concept
merged up into mainline, what do you all think we should do now?
My main goal here is just to get everyone to understand the approach
that we're proposing rather than to really fix the interfaces in stone.
I bet we're going to be changing them a lot before these patches
actually get in.
-- Dave
On Monday 11 August 2008, Dave Hansen wrote:
> Thanks for all of the very interesting comments about the ABI. ?
>
> Considering that we're still *really* early in getting this concept
> merged up into mainline, what do you all think we should do now?
I think the two most important aspects here need to be security and
simplicity. If you have to choose between the two, it probably makes
sense to put security first, because loading untrusted data into
the kernel puts you at a significant risk to start with. If you
can show a restart interface that lets regular users restart their
tasks in a way anyone can verify to be secure, that will be a
good indication that you're on the right track.
The other problem that you really need to solve is interface
stability. What you are creating is a binary representation
of many kernel internal data structures, so in our common
rules, you have to make sure that you remain forward and
backward compatible. Simply saying that you need to run
an identical kernel when restarting from a checkpoint is not
enough IMHO.
Some more words on specific interfaces that we have discussed:
The single-file-descriptor approach has the big advantage of
keeping the complexity in one place (the kernel). To be consistent
with other kernel interfaces, I would make the kernel hand out a
file descriptor, not let the user open a file and pass that into
the kernel as you do now.
A new file system is a good idea for many complex interfaces that
make their way into the kernel, but I don't think it will help
in this case.
For checkpointing a single task, or even a task with its children,
a different interface I could imagine would be to have a new
file in procfs per pid that you can read as a pipe giving our
the same data that you currently save in the checkpoint file
descriptor. It does mean that you won't be able to pass flags
down easily (you could write to the pipe before you start reading,
but that's not too nice).
On the restart side, I think the most consistent interface would
be a new binfmt_chkpt implementation that you can use to execve
a checkpoint, just like you execute an ELF file today. The binfmt
can be a module (unlike a syscall), so an administrator that is
afraid of the security implications can just disable it by not
loading the module. In an execve model, the parent process can
set up anything related to credentials as good as it's allowed
to and then let the kernel do the rest.
Arnd <><
Quoting Dave Hansen <[email protected]>:
> Arnd, Jeremy and Oren,
>
> Thanks for all of the very interesting comments about the ABI.
I second that !
>
> Considering that we're still *really* early in getting this concept
> merged up into mainline, what do you all think we should do now?
>
> My main goal here is just to get everyone to understand the approach
> that we're proposing rather than to really fix the interfaces in stone.
> I bet we're going to be changing them a lot before these patches
> actually get in.
I closely follow the valuable feedback and fix the code accordingly.
I propose to extend the proof of concept, to also be able to save and
restore "simple" open files (regular files, directories). My motivation
is twofold:
(1) Providing eough functionality for people to meaningfully play
with the proposed patches and try them.
(2) Demonstrate how I propose to handle shared resources (open files).
The first point is very important because it makes the concept actually
useful to a much broader set of programs than otherwise. I hope this
will attract a larger audience :)
To this end, I have already extended the patchset, and I should be
able to send something working in a day or two.
Oren.
On Mon, 11 Aug 2008 23:47:49 +0200
Arnd Bergmann <[email protected]> wrote:
> The other problem that you really need to solve is interface
> stability. What you are creating is a binary representation
> of many kernel internal data structures, so in our common
> rules, you have to make sure that you remain forward and
> backward compatible. Simply saying that you need to run
> an identical kernel when restarting from a checkpoint is not
> enough IMHO.
OTOH, making one of these checkpoint files go into any 2.6.x kernel
seems like a very high bar, to the point, perhaps, of killing this
feature entirely.
There could be a case for viewing sys_restore() as being a lot like
sys_init_module() - a view into kernel internals that goes beyond the
normal user-space ABI, and beyond the stability guarantee. It might be
possible to create a certain amount of version portability with a
modversions-like mechanism, but it sure seems hard to do better than
that.
jon
On Mon, 2008-08-11 at 17:14 -0600, Jonathan Corbet wrote:
> On Mon, 11 Aug 2008 23:47:49 +0200
> Arnd Bergmann <[email protected]> wrote:
>
> > The other problem that you really need to solve is interface
> > stability. What you are creating is a binary representation
> > of many kernel internal data structures, so in our common
> > rules, you have to make sure that you remain forward and
> > backward compatible. Simply saying that you need to run
> > an identical kernel when restarting from a checkpoint is not
> > enough IMHO.
>
> OTOH, making one of these checkpoint files go into any 2.6.x kernel
> seems like a very high bar, to the point, perhaps, of killing this
> feature entirely.
The OpenVZ dudes like refer to something that Andrew Morton said about
this (paraphrasing...): if we need cross-version restore support, we
can count on userspace to do the conversion.
You can almost think of it like the crashdump processing utility that we
have. Instead of worrying about having the kernel *always* produce the
same crashdump with the same gunk in it, we make userspace do all the
parsing and interpretation.
It also makes it quite possible for a distribution to make a change (say
because of a security fix) in the kernel that changes the checkpoint
format, then to quickly code up the necessary bits for the conversion
program.
-- Dave
Dave Hansen wrote:
> Arnd, Jeremy and Oren,
>
> Thanks for all of the very interesting comments about the ABI.
> Considering that we're still *really* early in getting this concept
> merged up into mainline, what do you all think we should do now?
>
> My main goal here is just to get everyone to understand the approach
> that we're proposing rather than to really fix the interfaces in stone.
> I bet we're going to be changing them a lot before these patches
> actually get in.
>
Yes.
It seems to me that worrying about ABI at this point is a bit premature.
This feature, as it currently stands, is essentially useless for any
practical purpose. Self-checkpointing a single process with no handling
of non-file file descriptors and no proper handling of file
file-descriptors is not very useful.
My understanding that this is basically a prototype for a more useful
multi-process or container-wide checkpoint facility.
While you could try to come up with an extensible file format that would
be able to handle any future extensions, the chances are you'd get it
wrong and need to break file format compatibility anyway.
I'm more interested in seeing a description of how you're doing to
handle things like:
* multiple processes
* pipes
* UNIX domain sockets
* INET sockets (both inter and intra machine)
* unlinked open files
* checkpointing file content
* closed files (ie, files which aren't currently open, but will be
soon, esp tmp files)
* shared memory
* (Peter, what have I forgotten?)
Having gone through this before, I don't think an all-kernel solution
can work except for the most simple cases.
Which, come to think of it, is an important point. What are the
expected use-cases for this feature? Do you really mean
checkpoint/restart? Do you expect to be able to checkpoint a process,
leave it running, then "rewind" by restoring the image? Or does
checkpoint always atomically kill the source process(es)? Are you
expecting to be able to resume on another machine?
Lightweight filesystem checkpointing, such as btrfs provides, would seem
like a powerful mechanism for handling a lot of the filesystem state
problems. It would have been useful when we did this...
J
>>>>> "Jeremy" == Jeremy Fitzhardinge <[email protected]> writes:
Jeremy> Dave Hansen wrote:
>> Arnd, Jeremy and Oren,
>>
Jeremy> * multiple processes * pipes * UNIX domain sockets * INET
Jeremy> sockets (both inter and intra machine) * unlinked open files *
Jeremy> checkpointing file content * closed files (ie, files which
Jeremy> aren't currently open, but will be soon, esp tmp files) *
Jeremy> shared memory * (Peter, what have I forgotten?)
File sharing; multiple threads with wierd sharing arrangements (think:
clone with various parameters, followed by exec in some of the threads
but not others); MERT/system-V shared memory, semaphores and message
queues; devices (audio, framebuffer, etc), HugeTLBFS, numa issues
(pinning, memory layout), processes being debugged (so,
checkpoint.restart a gdb/target pair), futexes, etc., etc. Linux
process state keeps expanding.
Jeremy> Having gone through this before, I don't think an all-kernel
Jeremy> solution can work except for the most simple cases.
I agree ... it's better to put mechanisms into the kernel that can
then be used by a user-space programme to actually do the
checkpointing and restarting.
Beefing up ptrace or fixing /proc to be a real debugging interface
would be a start ... when you can get at *all* the info you need,
quickly and easily, the userspace checkpoint falls out fairly
naturally. You still have to work out an extensible file format to
store stuff, and how to restore all that state you've so lovingly
collected.
Jeremy> Lightweight filesystem checkpointing, such as btrfs provides,
Jeremy> would seem like a powerful mechanism for handling a lot of the
Jeremy> filesystem state problems. It would have been useful when we
Jeremy> did this...
And how! saving bits of files was very timeconsuming.
--
Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au ERTOS within National ICT Australia
Dave Hansen wrote:
> On Mon, 2008-08-11 at 12:03 -0600, Jonathan Corbet wrote:
>> I'm trying to figure out this patch set...here's a few things which
>> have caught my eye in passing.
>>
>>> +/**
>>> + * cr_get_fname - return pathname of a given file
>>> + * @file: file pointer
>>> + * @buf: buffer for pathname
>>> + * @n: buffer length (in) and pathname length (out)
>>> + *
>>> + * if the buffer provivded by the caller is too small, allocate a new
>>> + * buffer; caller should call cr_put_pathname() for cleanup
>>> + */
>>> +char *cr_get_fname(struct path *path, struct path *root, char *buf,
>>> int *n) +{
>>> + char *fname;
>>> +
>>> + fname = __d_path(path, root, buf, *n);
>>> +
>>> + if (IS_ERR(fname) && PTR_ERR(fname) == -ENAMETOOLONG) {
>>> + if (!(buf = (char *) __get_free_pages(GFP_KERNEL, 0)))
>> This seems like a clunky and error-prone interface - why not just have it
>> allocate the memory always? But, in this case, cr_get_fname() always seems
>> to be called with ctx->tbuf, which, in turn, is an order-1 allocation.
>> Here you're saying that if it's too small, you'll try replacing it with an
>> order-0 allocation instead. I rather suspect that's not going to help.
>
> Yeah, it doesn't make much sense on the surface. I would imagine that
> this has some use for when we're stacking things up in the ctx->hbuf
> rather than just using it as a completely temporary buffer. But, in any
> case, it doesn't make sense as it stands now, so I think it needs to be
> reworked.
Dave is right on the money: in Zap (the equivalent of) cr_get_fname()
may be called with a buffer smaller than PATH_MAX (one page) and hence
the need to allocate ad-hoc. Indeed in the current code this is not the
case (yet?) so I'll go ahead and simplify it.
>
>>> +/* write the checkpoint header */
>>> +static int cr_write_hdr(struct cr_ctx *ctx)
>>> +{
>>> + struct cr_hdr h;
>>> + struct cr_hdr_head *hh = ctx->tbuf;
>>> + struct timeval ktv;
>>> +
>>> + h.type = CR_HDR_HEAD;
>>> + h.len = sizeof(*hh);
>>> + h.id = 0;
>>> +
>>> + do_gettimeofday(&ktv);
>>> +
>>> + hh->magic = 0x00a2d200;
>> This magic number is hard-coded in a number of places. Could it maybe
>> benefit from a macro, which, in turn, could maybe end up in linux/magic.h?
>>
>>> +/* dump the task_struct of a given task */
>>> +static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t)
>> This function is going to break every time somebody changes struct
>> task_struct. I'm not quite sure how to prevent that. I wonder if the
>> modversions stuff could somehow be employed to detect changes and make the
>> build fail?
>
> In general, I think any time that we are checkpointing $THING and $THING
> changes, the checkpoint will break. It just so happens that all we're
> checkpointing here is the task_struct, so $THING == task_struct for
> now. :)
>
> The things that *really* worry me are things like when flags change
> semantics subtly. Or, let's say a flag is used for two different things
> in 2.6.26.4 vs 2.6.27. I'm not sure we're ever going to be in a
> position to find and fix up stuff like that.
One way to reduce the risk is to use an intermediate representation to
kernel native data and properties (e.g. classify VMAs during checkpoint
instead of relying blindly on the flags).
The problem is not so much in restarting a checkpoint image from old
kernel on a new kernel - that can be handled by conversion in user space.
Tracking changes affecting the checkpoint/restart logic - well, if
eventually checkpoint/restart gets to becomes main-stream enough that
developers will be aware of it.
>
> That's one reason I have been advocating doing checkpoint/restart in
> much tinier bits so that we can understand each of them as we go along.
> I also think that the *less* we expose to userspace, the better.
>
>>> +/**
>>> + * sys_checkpoint - checkpoint a container
>>> + * @pid: pid of the container init(1) process
>>> + * @fd: file to which dump the checkpoint image
>>> + * @flags: checkpoint operation flags
>>> + */
>>> +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long
>>> flags) +{
>>> + struct cr_ctx *ctx;
>>> + struct file *file;
>>> + int fput_needed;
>>> + int ret;
>>> +
>>> + if (!capable(CAP_SYS_ADMIN))
>>> + return -EPERM;
>> Like others, I wondered why CAP_SYS_ADMIN was required here. I *still*
>> wonder, though, how you'll ever be able to do restart without a privilege
>> check. There must be a thousand ways to compromise a system by messing
>> with the checkpoint file.
>
> As with everything else coming from userspace, the checkpoint file
> should be completely untrusted. I do think, though, that the ptrace
> analogy that Serge?? made is a good one.
The only reason I made the analogy without actually implementing it is lack
of time and familiarity with the ptrace permission world.
>
>>> + file = fget_light(fd, &fput_needed);
>>> + if (!file)
>>> + return -EBADF;
>> Should you maybe check for write access? An attempt to overwrite a
>> read-only file won't succeed, but you could save a lot of work by just
>> failing it with a clear code here.
>
> That's true. I'll take a look and see.
>
> This patch does reach down and use vfs_write() at some point. There
> really aren't any other in-kernel users that do this (short of ecryptfs
> and plan9fs). That makes me doubt that we're even using a good approach
> here.
>
>> What about the file position? Perhaps there could be a good reason to
>> checkpoint a process into the middle of a file, don't know.
>
> I think this is a good example of a place where the kernel can let
> userspace shoot itself in its foot if it wants. We might also want to
> allow things to be sent over fds that don't necessarily have positions,
> like sockets or pipes.
>
>> In general, I don't see a whole lot of locking going on. Is it really
>> possible to save and restore memory without ever holding mmap_sem?
>
> I personally haven't audited the locking, yet. It is going to be fun!
There is some optimistic locking (mmap_sem), improved in the next version.
Thanks,
Oren.
Quoting Peter Chubb ([email protected]):
> >>>>> "Jeremy" == Jeremy Fitzhardinge <[email protected]> writes:
>
> Jeremy> Dave Hansen wrote:
> >> Arnd, Jeremy and Oren,
> >>
>
>
> Jeremy> * multiple processes * pipes * UNIX domain sockets * INET
> Jeremy> sockets (both inter and intra machine) * unlinked open files *
> Jeremy> checkpointing file content * closed files (ie, files which
> Jeremy> aren't currently open, but will be soon, esp tmp files) *
> Jeremy> shared memory * (Peter, what have I forgotten?)
>
> File sharing; multiple threads with wierd sharing arrangements (think:
> clone with various parameters, followed by exec in some of the threads
> but not others); MERT/system-V shared memory, semaphores and message
> queues; devices (audio, framebuffer, etc), HugeTLBFS, numa issues
> (pinning, memory layout), processes being debugged (so,
> checkpoint.restart a gdb/target pair), futexes, etc., etc. Linux
> process state keeps expanding.
>
> Jeremy> Having gone through this before, I don't think an all-kernel
> Jeremy> solution can work except for the most simple cases.
>
> I agree ... it's better to put mechanisms into the kernel that can
> then be used by a user-space programme to actually do the
> checkpointing and restarting.
>
> Beefing up ptrace or fixing /proc to be a real debugging interface
> would be a start ... when you can get at *all* the info you need,
Except we don't really want to export all the info you need for a
complete restartable checkpoint. And especially not make it
generally writable.
We have also started down that path using ptrace (see cryo, at
git://git.sr71.net/~hallyn/cryodev.git).
Right before the containers mini-summit, where the general agreement was
that a complete in-kernel solution ought to be pursued, I had tried
a restart using a binary format that read a checkpoint file and used
cryo (userspace using ptrace) for the rest of the restart, only
because there was no other reasonable way to set tsk->did_exec on
restart.
> quickly and easily, the userspace checkpoint falls out fairly
> naturally. You still have to work out an extensible file format to
> store stuff, and how to restore all that state you've so lovingly
> collected.
>
> Jeremy> Lightweight filesystem checkpointing, such as btrfs provides,
> Jeremy> would seem like a powerful mechanism for handling a lot of the
> Jeremy> filesystem state problems. It would have been useful when we
> Jeremy> did this...
>
> And how! saving bits of files was very timeconsuming.
Yes, we're looking forward to using btrfs' snapshots :)
-serge
On Mon, 2008-08-11 at 16:38 -0700, Jeremy Fitzhardinge wrote:
> This feature, as it currently stands, is essentially useless for any
> practical purpose. Self-checkpointing a single process with no handling
> of non-file file descriptors and no proper handling of file
> file-descriptors is not very useful.
>
> My understanding that this is basically a prototype for a more useful
> multi-process or container-wide checkpoint facility.
Yes, that's exactly it. We're diverging from discussing the important
bits as it is, and I think we'd do that more and more with extra
code. :)
> While you could try to come up with an extensible file format that would
> be able to handle any future extensions, the chances are you'd get it
> wrong and need to break file format compatibility anyway.
Amen to that. I won't speak for the rest of the whackos interested in
this stuff, but I *KNOW* I'm not clever enough to pull it off.
> I'm more interested in seeing a description of how you're doing to
> handle things like:
>
> * multiple processes
> * pipes
> * UNIX domain sockets
> * INET sockets (both inter and intra machine)
> * unlinked open files
> * checkpointing file content
> * closed files (ie, files which aren't currently open, but will be
> soon, esp tmp files)
> * shared memory
> * (Peter, what have I forgotten?)
>
> Having gone through this before, I don't think an all-kernel solution
> can work except for the most simple cases.
So, there's a lot of stuff there. The networking stuff is way out of my
league, so I'll cc Daniel and make him answer. :)
All of the other stuff has been done in various in-kernel
implementations. OpenVZ, IBM's Metacluster, Zap (Oren's work at
Columbia). Most of it *can* be done from userspace, but some of it is
very painful. There are some good OLS papers describing most of these
things. Zap might have had one or two academic papers written about it.
Maybe. ;)
Unlinked files, for instance, are actually available in /proc. You can
freeze the app, write a helper that opens /proc/1234/fd, then copies its
contents to a linked file (ooooh, with splice!) Anyway, if we can do it
in userspace, we can surely do it in the kernel.
I'm not sure what you mean by "closed files". Either the app has a fd,
it doesn't, or it is in sys_open() somewhere. We have to get the app
into a quiescent state before we can checkpoint, so we basically just
say that we won't checkpoint things that are *in* the kernel.
Is there anything specific you are thinking of that particularly worries
you? I could write pages on the list you have there.
> Which, come to think of it, is an important point. What are the
> expected use-cases for this feature? Do you really mean
> checkpoint/restart? Do you expect to be able to checkpoint a process,
> leave it running, then "rewind" by restoring the image? Or does
> checkpoint always atomically kill the source process(es)? Are you
> expecting to be able to resume on another machine?
Yes.
We all want different things, and there are a lot of people interested
in this stuff. So, I think all of what you've mentioned above are
goals, at least long term. Some, *really* long term.
I don't want to get into a full virtualization vs. containers debate,
but we also want it for all the same reasons that you migrate Xen
partitions.
> Lightweight filesystem checkpointing, such as btrfs provides, would seem
> like a powerful mechanism for handling a lot of the filesystem state
> problems. It would have been useful when we did this...
Yup. We were just chatting about that with some filesystem folks last
week. But, as the OpenVZ dudes like to mention, the poor man's way of
moving filesystem snapshots around is always rsync.
-- Dave
On Tue, 2008-08-12 at 09:54 +1000, Peter Chubb wrote:
> Jeremy> * multiple processes * pipes * UNIX domain sockets * INET
> Jeremy> sockets (both inter and intra machine) * unlinked open files *
> Jeremy> checkpointing file content * closed files (ie, files which
> Jeremy> aren't currently open, but will be soon, esp tmp files) *
> Jeremy> shared memory * (Peter, what have I forgotten?)
>
> File sharing; multiple threads with wierd sharing arrangements (think:
> clone with various parameters, followed by exec in some of the threads
> but not others); MERT/system-V shared memory, semaphores and message
> queues; devices (audio, framebuffer, etc), HugeTLBFS, numa issues
> (pinning, memory layout), processes being debugged (so,
> checkpoint.restart a gdb/target pair), futexes, etc., etc. Linux
> process state keeps expanding.
Yep, these are all challenges. If you have some really specific
questions, or things you truly think can't be done, please speak up.
But, I really don't see any show stoppers in your list.
We also plan to do this incrementally. The first consumers are likely
to be dumb, simple HPC apps that don't have real hardware like audio or
video. Eventually, we'll get to real hardware like infiniband (ugh) or
audio. Eventually.
(Actually futexes aren't that bad because they don't keep state
in-kernel)
-- Dave
Dave Hansen wrote:
>> I'm more interested in seeing a description of how you're doing to
>> handle things like:
>>
>> * multiple processes
>> * pipes
>> * UNIX domain sockets
>> * INET sockets (both inter and intra machine)
>> * unlinked open files
>> * checkpointing file content
>> * closed files (ie, files which aren't currently open, but will be
>> soon, esp tmp files)
>> * shared memory
>> * (Peter, what have I forgotten?)
>>
>> Having gone through this before, I don't think an all-kernel solution
>> can work except for the most simple cases.
>>
>
> So, there's a lot of stuff there. The networking stuff is way out of my
> league, so I'll cc Daniel and make him answer. :)
>
Inter-machine networking stuff is hard because its outside the
checkpointed set, so the checkpoint is observable. Migration is easier,
in principle, because you might be able to shift the connection endpoint
without bringing it down. Dealing with networking within your
checkpointed set is just fiddly, particularly remembering and restoring
all the details of things like urgent messages, on-the-fly file
descriptors, packet boundaries, etc.
> Unlinked files, for instance, are actually available in /proc. You can
> freeze the app, write a helper that opens /proc/1234/fd, then copies its
> contents to a linked file (ooooh, with splice!) Anyway, if we can do it
> in userspace, we can surely do it in the kernel.
>
Sure, there's no inherent problem. But do you imagine including the
file contents within your checkpoint image, or would they be saved
separately?
> I'm not sure what you mean by "closed files". Either the app has a fd,
> it doesn't, or it is in sys_open() somewhere. We have to get the app
> into a quiescent state before we can checkpoint, so we basically just
> say that we won't checkpoint things that are *in* the kernel.
>
It's common for an app to write a tmp file, close it, and then open it a
bit later expecting to find the content it just wrote. If you
checkpoint-kill it in the interim, reboot (clearing out /tmp) and then
resume, then it will lose its tmp file. There's no explicit connection
between the process and its potential working set of files. We had to
deal with it by setting a bunch of policy files to tell the
checkpoint/restart system what filename patterns it had to look out
for. But if you just checkpoint the whole filesystem state along with
the process(es), then perhaps it isn't an issue.
> Is there anything specific you are thinking of that particularly worries
> you? I could write pages on the list you have there.
>
No, that's the problem; it all worries me. It's a big problem space.
>> Which, come to think of it, is an important point. What are the
>> expected use-cases for this feature? Do you really mean
>> checkpoint/restart? Do you expect to be able to checkpoint a process,
>> leave it running, then "rewind" by restoring the image? Or does
>> checkpoint always atomically kill the source process(es)? Are you
>> expecting to be able to resume on another machine?
>>
>
> Yes.
>
> We all want different things, and there are a lot of people interested
> in this stuff. So, I think all of what you've mentioned above are
> goals, at least long term. Some, *really* long term.
>
So, in other words: whoever wants to work on it gets to define (their)
goals. Fair enough.
> I don't want to get into a full virtualization vs. containers debate,
> but we also want it for all the same reasons that you migrate Xen
> partitions.
>
No, I don't have any real opinion about containers vs virtualization. I
think they're quite distinct solutions for distinct problems.
But I was involved in the design and implementation of a
checkpoint-restart system (along with Peter Chubb), and have the scars
to prove it. We implemented it for IRIX; we called it Hibernator, and
licensed it to SGI for a while (I don't remember what name they marketed
it under). The list of problems that Peter and I mentioned are ones we
had to solve (or, in some cases, failed to solve) to get a workable system.
J
On Tue, 2008-08-12 at 09:32 -0700, Jeremy Fitzhardinge wrote:
> Inter-machine networking stuff is hard because its outside the
> checkpointed set, so the checkpoint is observable. Migration is easier,
> in principle, because you might be able to shift the connection endpoint
> without bringing it down. Dealing with networking within your
> checkpointed set is just fiddly, particularly remembering and restoring
> all the details of things like urgent messages, on-the-fly file
> descriptors, packet boundaries, etc.
All true. Hard stuff.
The IBM product works partly by limiting migrations to occurring on a
single physical ethernet network. Each container gets its own IP and
MAC address. The socket state is checkpointed quite fully and moved
along with the IP.
> > Unlinked files, for instance, are actually available in /proc. You can
> > freeze the app, write a helper that opens /proc/1234/fd, then copies its
> > contents to a linked file (ooooh, with splice!) Anyway, if we can do it
> > in userspace, we can surely do it in the kernel.
>
> Sure, there's no inherent problem. But do you imagine including the
> file contents within your checkpoint image, or would they be saved
> separately?
Me, personally, I think I'd probably "re-link" the thing, mark it as
such, ship it across like a normal file, then unlink it after the
restore. I don't know what we'd choose when actually implementing it.
> > I'm not sure what you mean by "closed files". Either the app has a fd,
> > it doesn't, or it is in sys_open() somewhere. We have to get the app
> > into a quiescent state before we can checkpoint, so we basically just
> > say that we won't checkpoint things that are *in* the kernel.
>
> It's common for an app to write a tmp file, close it, and then open it a
> bit later expecting to find the content it just wrote. If you
> checkpoint-kill it in the interim, reboot (clearing out /tmp) and then
> resume, then it will lose its tmp file. There's no explicit connection
> between the process and its potential working set of files.
I respectfully disagree. The number one prerequisite for
checkpoint/restart is isolation. Xen just happens to get this for free.
So, instead of saying that there's no explicit connection between the
process and its working set, ask yourself how we make a connection.
In this case, we can do it with a filesystem (mount) namespace. Each
container that we might want to checkpoint must have its writable
filesystems contained to a private set that are not shared with other
containers. Things like union mounts would help here, but aren't
necessarily required. They just make it more efficient.
> We had to
> deal with it by setting a bunch of policy files to tell the
> checkpoint/restart system what filename patterns it had to look out
> for. But if you just checkpoint the whole filesystem state along with
> the process(es), then perhaps it isn't an issue.
Right. We just start with "everybody has their own disk" which is slow
and crappy and optimize it from there.
> > Is there anything specific you are thinking of that particularly worries
> > you? I could write pages on the list you have there.
>
> No, that's the problem; it all worries me. It's a big problem space.
It's almost as big of a problem as trying to virtualize entire machines
and expecting them to run as fast as native. :)
> > I don't want to get into a full virtualization vs. containers debate,
> > but we also want it for all the same reasons that you migrate Xen
> > partitions.
> >
> No, I don't have any real opinion about containers vs virtualization. I
> think they're quite distinct solutions for distinct problems.
>
> But I was involved in the design and implementation of a
> checkpoint-restart system (along with Peter Chubb), and have the scars
> to prove it. We implemented it for IRIX; we called it Hibernator, and
> licensed it to SGI for a while (I don't remember what name they marketed
> it under). The list of problems that Peter and I mentioned are ones we
> had to solve (or, in some cases, failed to solve) to get a workable system.
Cool! I didn't know you guys did the IRIX implementation. I'm sure you
guys got a lot farther than any of us are. Did you guys ever write any
papers or anything on it? I'd be interested in more information.
-- Dave
Dave Hansen wrote:
>>> I'm not sure what you mean by "closed files". Either the app has a fd,
>>> it doesn't, or it is in sys_open() somewhere. We have to get the app
>>> into a quiescent state before we can checkpoint, so we basically just
>>> say that we won't checkpoint things that are *in* the kernel.
>>>
>> It's common for an app to write a tmp file, close it, and then open it a
>> bit later expecting to find the content it just wrote. If you
>> checkpoint-kill it in the interim, reboot (clearing out /tmp) and then
>> resume, then it will lose its tmp file. There's no explicit connection
>> between the process and its potential working set of files.
>>
>
> I respectfully disagree. The number one prerequisite for
> checkpoint/restart is isolation. Xen just happens to get this for free.
>
(I don't have my Xen hat on at all for this thread.)
> So, instead of saying that there's no explicit connection between the
> process and its working set, ask yourself how we make a connection.
>
> In this case, we can do it with a filesystem (mount) namespace. Each
> container that we might want to checkpoint must have its writable
> filesystems contained to a private set that are not shared with other
> containers. Things like union mounts would help here, but aren't
> necessarily required. They just make it more efficient.
>
We were dealing with checkpointing random sets of processes, and that
posed all sorts of problems. Filesystem namespace was one, the pid
namespace was another. Doing checkpointing at the container-level
granularity definitely solves a lot of problems.
>>> Is there anything specific you are thinking of that particularly worries
>>> you? I could write pages on the list you have there.
>>>
>> No, that's the problem; it all worries me. It's a big problem space.
>>
>
> It's almost as big of a problem as trying to virtualize entire machines
> and expecting them to run as fast as native. :)
>
No, it's much harder. Hardware is relatively simple and immutable
compared to kernel and process state ;)
> Cool! I didn't know you guys did the IRIX implementation. I'm sure you
> guys got a lot farther than any of us are. Did you guys ever write any
> papers or anything on it? I'd be interested in more information.
>
Yeah, there was a paper, but it looks like the internet has lost it. It
was at
http://www.csu.edu.au/special/conference/apwww95/.papers95/cmaltby/cmaltby.ps
http://www.csu.edu.au/special/conference/apwww95/sept-all.html has
mention of the paper.
J
Oren Laadan wrote:
>
> Dave Hansen wrote:
>> On Fri, 2008-08-08 at 11:46 +0200, Arnd Bergmann wrote:
>>> On Friday 08 August 2008, Dave Hansen wrote:
>>>> + hh->magic = 0x00a2d200;
>>>> + hh->major = (LINUX_VERSION_CODE >> 16) & 0xff;
>>>> + hh->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
>>>> + hh->patch = (LINUX_VERSION_CODE) & 0xff;
>> ...
>>>> +}
>>> Do you rely on the kernel version in order to determine the format
>>> of the binary data, or is it just informational?
>>>
>>> If you think the format can change in incompatible ways, you
>>> probably need something more specific than the version number
>>> these days, because there are just so many different trees with
>>> the same numbers.
>> Yeah, this is very true. My guess is that we'll need something like
>> what we do with modversions.
>
> Exactly. The header should eventually contain sufficient information
> to describe the kernel version, configuration, compiler, cpu (arch and
> capabilities), and checkpoint code version.
Sorry for late response - I was on vacation till Wednesday.
I am opposed against having *explicit* information about the kernel
configuration inside the image.
I see this like we store object images in a file, and these images do
*not* change depending on the .config. But instead of this, at the time
of restore we may *only* detect whether we can restore this type of
object or not.
E.g. consider we are saving a container image on ipv6 node and trying
to restore from it on the one without the ipv6. In that case we *may*
have some object of for example CKPT_IPV6_IFA type of CLPT_IPV6_SOCK_INFO
and fail the restoration process when finding such in an input file. But
what we should *not* do is to write any information about whether we had
the CONFIG_IPV6 turned on on the dumping side and check for this on the
restoring side.
(Sorry, if this question is already settled, but the discussion thread
built by my mailer is a bit messy, so I suspect I could miss some part
of the threads)
> How would you suggest to identify the origin tree with an identifier
> (or a text field) in the header ?
Hi!
> > > > I have to wonder if this is just a symptom of us trying to do this the
> > > > wrong way. We're trying to talk the kernel into writing internal gunk
> > > > into a FD. You're right, it is like a splice where one end of the pipe
> > > > is in the kernel.
> > > >
> > > > Any thoughts on a better way to do this?
> > >
> > > Maybe you can invert the logic and let the new syscalls create a file
> > > descriptor, and then have user space read or splice the checkpoint
> > > data from it, and restore it by writing to the file descriptor.
> > > It's probably easy to do using anon_inode_getfd() and would solve this
> > > problem, but at the same time make checkpointing the current thread
> > > hard if not impossible.
> >
> > Yeah, it does seem kinda backwards. But, instead of even having to
> > worry about the anon_inode stuff, why don't we just put it in a fs like
> > everything else? checkpointfs!
>
> One reason is that I suspect that stops us from being able to send that
> data straight to a pipe to compress and/or send on the network, without
> hitting local disk. Though if the checkpointfs was ram-based maybe not?
>
> As Oren has pointed out before, passing in an fd means we can pass a
> socket into the syscall.
If you do pass a socket, will it handle blocking correctly? Getting
deadlocked task would be bad. What happens if I try to snapshot into
/proc/self/fd/0 ? Or maybe restore from /proc/cmdline?
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Thu, 2008-08-14 at 07:53 +0200, Pavel Machek wrote:
> > As Oren has pointed out before, passing in an fd means we can pass a
> > socket into the syscall.
>
> If you do pass a socket, will it handle blocking correctly? Getting
> deadlocked task would be bad. What happens if I try to snapshot into
> /proc/self/fd/0 ? Or maybe restore from /proc/cmdline?
Heh, that's a good point. What was the other code where we kept coming
up with deadlocks like that? Anyone remember?
-- Dave
On Thu, 2008-08-14 at 12:09 +0400, Pavel Emelyanov wrote:
> E.g. consider we are saving a container image on ipv6 node and trying
> to restore from it on the one without the ipv6. In that case we *may*
> have some object of for example CKPT_IPV6_IFA type of CLPT_IPV6_SOCK_INFO
> and fail the restoration process when finding such in an input file. But
> what we should *not* do is to write any information about whether we had
> the CONFIG_IPV6 turned on on the dumping side and check for this on the
> restoring side.
The only problem I can see with this is that you lose efficiency,
especially when you have to build your checkpoint image with lots of
things that are config-specific.
The approach sounds like a good one in theory, but I'm a bit skeptical
that we could stick to it in practice, in a mainline kernel where there
are billions of config options. It is definitely something to strive
for, though. Good point!
-- Dave
> diff -puN /dev/null ckpt/ckpt_hdr.h
> --- /dev/null 2007-04-11 11:48:27.000000000 -0700
> +++ linux-2.6.git-dave/ckpt/ckpt_hdr.h 2008-08-07 15:37:22.000000000 -0700
> @@ -0,0 +1,69 @@
> +/*
> + * Generic container checkpoint-restart
> + *
> + * Copyright (C) 2008 Oren Laadan
> + *
> + * This file is subject to the terms and conditions of the GNU General Public
> + * License. See the file COPYING in the main directory of the Linux
> + * distribution for more details.
> + */
> +
> +#include <linux/types.h>
> +
> +struct cr_hdr {
> + __s16 type;
> + __s16 len;
> + __u32 id;
> +};
Sorry for probably being out-of-date again, but isn't it better to
put these headers in the include/linux and export them to the user
space?
Why? Because we'll need some image-dumping tool (let alone the
image converting one for compatibility purposes) and these tools
would require to know how the image looks like.
Thanks,
Pavel
On Mon, 2008-08-18 at 13:26 +0400, Pavel Emelyanov wrote:
> > diff -puN /dev/null ckpt/ckpt_hdr.h
> > --- /dev/null 2007-04-11 11:48:27.000000000 -0700
> > +++ linux-2.6.git-dave/ckpt/ckpt_hdr.h 2008-08-07 15:37:22.000000000 -0700
> > @@ -0,0 +1,69 @@
> > +/*
> > + * Generic container checkpoint-restart
> > + *
> > + * Copyright (C) 2008 Oren Laadan
> > + *
> > + * This file is subject to the terms and conditions of the GNU General Public
> > + * License. See the file COPYING in the main directory of the Linux
> > + * distribution for more details.
> > + */
> > +
> > +#include <linux/types.h>
> > +
> > +struct cr_hdr {
> > + __s16 type;
> > + __s16 len;
> > + __u32 id;
> > +};
>
> Sorry for probably being out-of-date again, but isn't it better to
> put these headers in the include/linux and export them to the user
> space?
>
> Why? Because we'll need some image-dumping tool (let alone the
> image converting one for compatibility purposes) and these tools
> would require to know how the image looks like.
What's the deal with headers being exported these days? Don't we always
have to sanitize them before we ship them over to userspace anyway?
-- Dave
Pavel Machek wrote:
> Hi!
>
>>>>> I have to wonder if this is just a symptom of us trying to do this the
>>>>> wrong way. We're trying to talk the kernel into writing internal gunk
>>>>> into a FD. You're right, it is like a splice where one end of the pipe
>>>>> is in the kernel.
>>>>>
>>>>> Any thoughts on a better way to do this?
>>>> Maybe you can invert the logic and let the new syscalls create a file
>>>> descriptor, and then have user space read or splice the checkpoint
>>>> data from it, and restore it by writing to the file descriptor.
>>>> It's probably easy to do using anon_inode_getfd() and would solve this
>>>> problem, but at the same time make checkpointing the current thread
>>>> hard if not impossible.
>>> Yeah, it does seem kinda backwards. But, instead of even having to
>>> worry about the anon_inode stuff, why don't we just put it in a fs like
>>> everything else? checkpointfs!
>> One reason is that I suspect that stops us from being able to send that
>> data straight to a pipe to compress and/or send on the network, without
>> hitting local disk. Though if the checkpointfs was ram-based maybe not?
>>
>> As Oren has pointed out before, passing in an fd means we can pass a
>> socket into the syscall.
>
> If you do pass a socket, will it handle blocking correctly? Getting
> deadlocked task would be bad. What happens if I try to snapshot into
> /proc/self/fd/0 ? Or maybe restore from /proc/cmdline?
Hmmm... these are good points.
Keep in mind that our principal goal is to checkpoint a whole container,
rather then a task to checkpoint itself (which is a by-product). Of course
your comments apply to a whole container as well.
In both cases, I don't think that blocking on a socket is a problem; the
checkpointer will enter a TASK_INTERRUPTIBLE state. Where is the deadlock ?
Writing or reading to/from /proc/self/... likewise - the programmer must
understand the implications, or the program won't work as expected. I don't
see a possible deadlock here, though.
For example - writing to /proc/self/fd/0 is ok; the state of fd[0] of that
task will be captured at some point in the middle of the checkpoint, so
after restart one cannot assume anything about the file position; the rest
should work.
Oren.
Jeremy Fitzhardinge wrote:
> Dave Hansen wrote:
>>>> I'm not sure what you mean by "closed files". Either the app has a fd,
>>>> it doesn't, or it is in sys_open() somewhere. We have to get the app
>>>> into a quiescent state before we can checkpoint, so we basically just
>>>> say that we won't checkpoint things that are *in* the kernel.
>>>>
>>> It's common for an app to write a tmp file, close it, and then open it a
>>> bit later expecting to find the content it just wrote. If you
>>> checkpoint-kill it in the interim, reboot (clearing out /tmp) and then
>>> resume, then it will lose its tmp file. There's no explicit connection
>>> between the process and its potential working set of files.
>>>
>> I respectfully disagree. The number one prerequisite for
>> checkpoint/restart is isolation. Xen just happens to get this for free.
>>
>
> (I don't have my Xen hat on at all for this thread.)
>
>> So, instead of saying that there's no explicit connection between the
>> process and its working set, ask yourself how we make a connection.
>>
>> In this case, we can do it with a filesystem (mount) namespace. Each
>> container that we might want to checkpoint must have its writable
>> filesystems contained to a private set that are not shared with other
>> containers. Things like union mounts would help here, but aren't
>> necessarily required. They just make it more efficient.
>>
>
> We were dealing with checkpointing random sets of processes, and that
> posed all sorts of problems. Filesystem namespace was one, the pid
> namespace was another. Doing checkpointing at the container-level
> granularity definitely solves a lot of problems.
>
>>>> Is there anything specific you are thinking of that particularly worries
>>>> you? I could write pages on the list you have there.
>>>>
>>> No, that's the problem; it all worries me. It's a big problem space.
>>>
>> It's almost as big of a problem as trying to virtualize entire machines
>> and expecting them to run as fast as native. :)
>>
>
> No, it's much harder. Hardware is relatively simple and immutable
> compared to kernel and process state ;)
>
>> Cool! I didn't know you guys did the IRIX implementation. I'm sure you
>> guys got a lot farther than any of us are. Did you guys ever write any
>> papers or anything on it? I'd be interested in more information.
>>
>
> Yeah, there was a paper, but it looks like the internet has lost it. It
> was at
> http://www.csu.edu.au/special/conference/apwww95/.papers95/cmaltby/cmaltby.ps
> http://www.csu.edu.au/special/conference/apwww95/sept-all.html has
> mention of the paper.
>
you can find it here:
http://ertos.nicta.com.au/publications/papers/Maltby_Chubb_95.pdf
Oren.
Dave Hansen wrote:
> On Tue, 2008-08-12 at 09:32 -0700, Jeremy Fitzhardinge wrote:
>> Inter-machine networking stuff is hard because its outside the
>> checkpointed set, so the checkpoint is observable. Migration is easier,
>> in principle, because you might be able to shift the connection endpoint
>> without bringing it down. Dealing with networking within your
>> checkpointed set is just fiddly, particularly remembering and restoring
>> all the details of things like urgent messages, on-the-fly file
>> descriptors, packet boundaries, etc.
>
> All true. Hard stuff.
>
> The IBM product works partly by limiting migrations to occurring on a
> single physical ethernet network. Each container gets its own IP and
> MAC address. The socket state is checkpointed quite fully and moved
> along with the IP.
>
>>> Unlinked files, for instance, are actually available in /proc. You can
>>> freeze the app, write a helper that opens /proc/1234/fd, then copies its
>>> contents to a linked file (ooooh, with splice!) Anyway, if we can do it
>>> in userspace, we can surely do it in the kernel.
>> Sure, there's no inherent problem. But do you imagine including the
>> file contents within your checkpoint image, or would they be saved
>> separately?
>
> Me, personally, I think I'd probably "re-link" the thing, mark it as
> such, ship it across like a normal file, then unlink it after the
> restore. I don't know what we'd choose when actually implementing it.
Re-linking works well when the file system supports that - some do not
allow this, in which case you need to silently rename instead of really
un-linking (even with NFS), or copy the entire contents.
Of course, you also need a snapshot of the file system in case it changes
after the checkpoint is taken, or take other measures. We can safely
defer addressing this for later.
>
>>> I'm not sure what you mean by "closed files". Either the app has a fd,
>>> it doesn't, or it is in sys_open() somewhere. We have to get the app
>>> into a quiescent state before we can checkpoint, so we basically just
>>> say that we won't checkpoint things that are *in* the kernel.
>> It's common for an app to write a tmp file, close it, and then open it a
>> bit later expecting to find the content it just wrote. If you
>> checkpoint-kill it in the interim, reboot (clearing out /tmp) and then
>> resume, then it will lose its tmp file. There's no explicit connection
>> between the process and its potential working set of files.
>
> I respectfully disagree. The number one prerequisite for
> checkpoint/restart is isolation. Xen just happens to get this for free.
> So, instead of saying that there's no explicit connection between the
> process and its working set, ask yourself how we make a connection.
>
> In this case, we can do it with a filesystem (mount) namespace. Each
> container that we might want to checkpoint must have its writable
> filesystems contained to a private set that are not shared with other
> containers. Things like union mounts would help here, but aren't
> necessarily required. They just make it more efficient.
>
>> We had to
>> deal with it by setting a bunch of policy files to tell the
>> checkpoint/restart system what filename patterns it had to look out
>> for. But if you just checkpoint the whole filesystem state along with
>> the process(es), then perhaps it isn't an issue.
>
> Right. We just start with "everybody has their own disk" which is slow
> and crappy and optimize it from there.
Yep.
[SNIP]
Oren.
On Wed, 2008-08-20 at 17:54 -0400, Oren Laadan wrote:
> > Me, personally, I think I'd probably "re-link" the thing, mark it as
> > such, ship it across like a normal file, then unlink it after the
> > restore. I don't know what we'd choose when actually implementing
> it.
>
> Re-linking works well when the file system supports that - some do not
> allow this, in which case you need to silently rename instead of really
> un-linking (even with NFS), or copy the entire contents.
Yeah, it will certainly be fs-dependent.
This might be a good application for splice.
open("/tmp/linked-newfile", O_RDONLY, perms);
splice(unlinked_fd, NULL, new_fd, NULL, MAX_INT, SPLICE_F_MOVE);
I'm not sure if it can re-use the blocks on the fs for this, but it
probably doesn't matter.
-- Dave
Arnd Bergmann wrote:
> On Monday 11 August 2008, Dave Hansen wrote:
>> Thanks for all of the very interesting comments about the ABI.
>>
>> Considering that we're still *really* early in getting this concept
>> merged up into mainline, what do you all think we should do now?
>
> I think the two most important aspects here need to be security and
> simplicity. If you have to choose between the two, it probably makes
> sense to put security first, because loading untrusted data into
> the kernel puts you at a significant risk to start with. If you
> can show a restart interface that lets regular users restart their
> tasks in a way anyone can verify to be secure, that will be a
> good indication that you're on the right track.
>
> The other problem that you really need to solve is interface
> stability. What you are creating is a binary representation
> of many kernel internal data structures, so in our common
> rules, you have to make sure that you remain forward and
> backward compatible. Simply saying that you need to run
> an identical kernel when restarting from a checkpoint is not
> enough IMHO.
>
quoting:
> There could be a case for viewing sys_restore() as being a lot like
> sys_init_module() - a view into kernel internals that goes beyond the
> normal user-space ABI, and beyond the stability guarantee. It might be
> possible to create a certain amount of version portability with a
> modversions-like mechanism, but it sure seems hard to do better than
> that.
>
> jon
Extending this view in the context of security - we can require sysadmin
privilege to restart, and then sysadmin is responsible for the contents
of the file. The kernel will ensure the the data isn't corrupted. Much
like with loading a kenrel module - the admin may load any sort of crap.
Then, sysadmin may, for instance, add a signature on a checkpointed file
to verify it's integrity.
(Well, one problem with this scheme in the context of self-checkpoint
would be - who can be trusted to generate the signature in that case).
> Some more words on specific interfaces that we have discussed:
>
> The single-file-descriptor approach has the big advantage of
> keeping the complexity in one place (the kernel). To be consistent
> with other kernel interfaces, I would make the kernel hand out a
> file descriptor, not let the user open a file and pass that into
> the kernel as you do now.
>
> A new file system is a good idea for many complex interfaces that
> make their way into the kernel, but I don't think it will help
> in this case.
>
> For checkpointing a single task, or even a task with its children,
> a different interface I could imagine would be to have a new
> file in procfs per pid that you can read as a pipe giving our
> the same data that you currently save in the checkpoint file
> descriptor. It does mean that you won't be able to pass flags
> down easily (you could write to the pipe before you start reading,
> but that's not too nice).
Using a single handle (crid or a special file descriptor) to identify
the whole checkpoint is very useful - to be able to stream it (eg. over
the network, or through filters). It is also very important for future
features and optimizations. For example, to reduce downtime of the
application during checkpoint, one can use COW for dirty pages, and
only write-back the entire data after the application resumes execution.
Or imagine a use-case where one would like to keep the entire checkpoint
in memory. These are pretty hard to do if you split the handling between
multiple files or handles.
>
> On the restart side, I think the most consistent interface would
> be a new binfmt_chkpt implementation that you can use to execve
> a checkpoint, just like you execute an ELF file today. The binfmt
> can be a module (unlike a syscall), so an administrator that is
> afraid of the security implications can just disable it by not
> loading the module. In an execve model, the parent process can
> set up anything related to credentials as good as it's allowed
> to and then let the kernel do the rest.
This is an interesting idea but not without its problems. In particular,
a successful execve() by one thread destroys all the others. Also, it
isn't clear how this can work with pre-copying and live-migration; And
finally, I'm not sure how to handle shared objects in this manner.
As for kernel module - it is easy to implement most of the checkpoint
restart functionality in a kernel module, leaving only the syscall stubs
in the kernel.
Oren.
On Thursday 21 August 2008, Oren Laadan wrote:
>
> Arnd Bergmann wrote:
> Extending this view in the context of security - we can require sysadmin
> privilege to restart, and then sysadmin is responsible for the contents
> of the file. The kernel will ensure the the data isn't corrupted. Much
> like with loading a kenrel module - the admin may load any sort of crap.
> Then, sysadmin may, for instance, add a signature on a checkpointed file
> to verify it's integrity.
>
> (Well, one problem with this scheme in the context of self-checkpoint
> would be - who can be trusted to generate the signature in that case).
Sorry, I don't buy that argument. I'm convinced that an implementation
is possible where any user can load checkpoints of tasks that he could
create by starting the processes directly. If you argue that loading
a corrupted checkpoint can cause any problems, then I would assume
the restart code needs better permission and sanity checks.
> Using a single handle (crid or a special file descriptor) to identify
> the whole checkpoint is very useful - to be able to stream it (eg. over
> the network, or through filters). It is also very important for future
> features and optimizations. For example, to reduce downtime of the
> application during checkpoint, one can use COW for dirty pages, and
> only write-back the entire data after the application resumes execution.
> Or imagine a use-case where one would like to keep the entire checkpoint
> in memory. These are pretty hard to do if you split the handling between
> multiple files or handles.
right.
> > On the restart side, I think the most consistent interface would
> > be a new binfmt_chkpt implementation that you can use to execve
> > a checkpoint, just like you execute an ELF file today. The binfmt
> > can be a module (unlike a syscall), so an administrator that is
> > afraid of the security implications can just disable it by not
> > loading the module. In an execve model, the parent process can
> > set up anything related to credentials as good as it's allowed
> > to and then let the kernel do the rest.
>
> This is an interesting idea but not without its problems. In particular,
> a successful execve() by one thread destroys all the others.
Right, execve currently assumes that the new process starts up with
a single thread, but a potential binfmt_chkpt would need to potentially
start multithreaded. I guess this either requires execve to reuse
the existing threads (assuming they have been set up correctly in
advance) or to create new ones according to the context of the
checkpoint data. It may not be as easy as I thought initially, but
both seem possible.
Restarting a whole set of processes from a checkpoint would be
a relatively simple extension of that.
> Also, it isn't clear how this can work with pre-copying and live-migration;
> And finally, I'm not sure how to handle shared objects in this manner.
What do you mean with pre-copying?
How is live-migration different from restarting a previously saved
task from the same machine?
> As for kernel module - it is easy to implement most of the checkpoint
> restart functionality in a kernel module, leaving only the syscall stubs
> in the kernel.
Yeah, I've done the same in spufs, but I still think it's ugly ;-)
Arnd <><
Arnd Bergmann wrote:
> On Thursday 21 August 2008, Oren Laadan wrote:
>
>> Using a single handle (crid or a special file descriptor) to identify
>> the whole checkpoint is very useful - to be able to stream it (eg. over
>> the network, or through filters). It is also very important for future
>> features and optimizations. For example, to reduce downtime of the
>> application during checkpoint, one can use COW for dirty pages, and
>> only write-back the entire data after the application resumes execution.
>> Or imagine a use-case where one would like to keep the entire checkpoint
>> in memory. These are pretty hard to do if you split the handling between
>> multiple files or handles.
>
> right.
>
>>> On the restart side, I think the most consistent interface would
>>> be a new binfmt_chkpt implementation that you can use to execve
>>> a checkpoint, just like you execute an ELF file today. The binfmt
>>> can be a module (unlike a syscall), so an administrator that is
>>> afraid of the security implications can just disable it by not
>>> loading the module. In an execve model, the parent process can
>>> set up anything related to credentials as good as it's allowed
>>> to and then let the kernel do the rest.
>> This is an interesting idea but not without its problems. In particular,
>> a successful execve() by one thread destroys all the others.
>
> Right, execve currently assumes that the new process starts up with
> a single thread, but a potential binfmt_chkpt would need to potentially
> start multithreaded. I guess this either requires execve to reuse
> the existing threads (assuming they have been set up correctly in
> advance) or to create new ones according to the context of the
> checkpoint data. It may not be as easy as I thought initially, but
> both seem possible.
> Restarting a whole set of processes from a checkpoint would be
> a relatively simple extension of that.
>
>> Also, it isn't clear how this can work with pre-copying and live-migration;
>> And finally, I'm not sure how to handle shared objects in this manner.
>
> What do you mean with pre-copying?
> How is live-migration different from restarting a previously saved
> task from the same machine?
By pre-copying I refer to the first stage of live-migration: to reduce
down time, much of the state of a container can be saved while tasks
are still running (most notably memory, but also file system snapshot,
if need be). Since the state may change, this is repeated - to save the
what changed in the meanwhile - until the delta is small enough. During
all this time the tasks continue to execute. At this point, we freeze
the container, save the last delta, and resume (in case of snapshot) or
or kill (in case of live-migration) the container. I'm not convinced that
execve() is the best way to handle this iterative process.
Also, with multiple tasks in a container, data for consecutive tasks
will appear in order in the checkpoint image. Moreover, a future
optimization would be the have multiple threads checkpoint the container,
with data interleaved in the checkpoint image stream. Here, too, I'm
not sure how execve()-like approach plays.
Finally there is the case of shared objects: v2 demonstrates this in
checkpoint/objhash.c (see also Documentation/checkpoint.txt). Again,
I'm not sure how execve() can adapt to this need.
I definitely agree that using something like execve() is elegant and
has its advantages. It just isn't clear to me that it is truly suitable
for the needs. Suggestions are welcome.
Oren.
>
>> As for kernel module - it is easy to implement most of the checkpoint
>> restart functionality in a kernel module, leaving only the syscall stubs
>> in the kernel.
>
> Yeah, I've done the same in spufs, but I still think it's ugly ;-)
>
> Arnd <><
"Serge E. Hallyn" <[email protected]> writes:
> Quoting Peter Chubb ([email protected]):
>> Beefing up ptrace or fixing /proc to be a real debugging interface
>> would be a start ... when you can get at *all* the info you need,
>
> Except we don't really want to export all the info you need for a
> complete restartable checkpoint. And especially not make it
> generally writable.
That and unless we get a lot of synergy from authors of debuggers
and debugging code it is a more general and slower interface for
no apparent gain.
> We have also started down that path using ptrace (see cryo, at
> git://git.sr71.net/~hallyn/cryodev.git).
>
> Right before the containers mini-summit, where the general agreement was
> that a complete in-kernel solution ought to be pursued, I had tried
> a restart using a binary format that read a checkpoint file and used
> cryo (userspace using ptrace) for the rest of the restart, only
> because there was no other reasonable way to set tsk->did_exec on
> restart.
Can we please describe this as the giant syscall approach. Instead
of a complete in-kernel solution. There are things like filesystems
that should be checkpointed separately, or not checkpointed at all.
However there is a large set of processes and process state that always
goes together and if you checkpoint a container you always want.
So building something that is roughly equivalent to a binfmt module
but that can save and restore multiple tasks with a single operation
looks like the right granularity.
>> Jeremy> Lightweight filesystem checkpointing, such as btrfs provides,
>> Jeremy> would seem like a powerful mechanism for handling a lot of the
>> Jeremy> filesystem state problems. It would have been useful when we
>> Jeremy> did this...
>>
>> And how! saving bits of files was very timeconsuming.
>
> Yes, we're looking forward to using btrfs' snapshots :)
Yep. And in the case of migration we don't even need to snapshot
a filesystem just mount it from on the target machine. Except for
the unlinked files challenge.
Eric