2008-08-21 02:59:28

by Oren Laadan

[permalink] [raw]
Subject: [RFC v2][PATCH 1/9] kernel based checkpoint-restart


These patches implement checkpoint-restart [CR v2]. This version adds
save and restore of open files state (regular files and directories)
which makes it more usable. Other changes address the feedback given
for the previous version. It is also refactored (along Dave's posting)
for easier reviewing.

Todo:
- Add support for x86-64 and improve ABI
- Refine or change syscall interface
- Extend to handle (multiple) tasks in a container
- Security (without CAPS_SYS_ADMIN files restore may fail)

Changelog:

[2008-Aug-20] v2:
- Added dump and restore of open files (regular and directories);
see the changes in the test program (ckpt.c)
- Added basic handling of shared objects, and use 'parent tag'
- Added documentation
- Improved ABI, add 64bit padding for image data
- Improved locking when saving/restoring memory
- Added UTS information to header (release, version, machine)
- Cleanup extraction of filename from a file pointer
- Refactor to allow easier reviewing
- Remove requirement for CAPS_SYS_ADMIN until we come up with a
security policy (this means that file restore may fail)
- Other cleanup in response to comments for v1

[2008-Jul-29] v1:
- Initial version: support a single task with address space of only
private anonymous or file-mapped VMAs; syscalls ignore pid/crid
argument and act on current process.

--
(Dave Hansen's announcement)

At the containers mini-conference before OLS, the consensus among
all the stakeholders was that doing checkpoint/restart in the kernel
as much as possible was the best approach. With this approach, the
kernel will export a relatively opaque 'blob' of data to userspace
which can then be handed to the new kernel at restore time.

This is different than what had been proposed before, which was
that a userspace application would be responsible for collecting
all of this data. We were also planning on adding lots of new,
little kernel interfaces for all of the things that needed
checkpointing. This unites those into a single, grand interface.

The 'blob' will contain copies of select portions of kernel
structures such as vmas and mm_structs. It will also contain
copies of the actual memory that the process uses. Any changes
in this blob's format between kernel revisions can be handled by
an in-userspace conversion program.

This is a similar approach to virtually all of the commercial
checkpoint/restart products out there, as well as the research
project Zap.

These patches basically serialize internel kernel state and write
it out to a file descriptor. The checkpoint and restore are done
with two new system calls: sys_checkpoint and sys_restart.

In this incarnation, they can only work checkpoint and restore a
single task. The task's address space may consist of only private,
simple vma's - anonymous or file-mapped. The open files may consist
of only simple files and directories.

--
(Original announcement)

In the recent mini-summit at OLS 2008 and the following days it was
agreed to tackle the checkpoint/restart (CR) by beginning with a very
simple case: save and restore a single task, with simple memory
layout, disregarding other task state such as files, signals etc.

Following these discussions I coded a prototype that can do exactly
that, as a starter. This code adds two system calls - sys_checkpoint
and sys_restart - that a task can call to save and restore its state
respectively. It also demonstrates how the checkpoint image file can
be formatted, as well as show its nested nature (e.g. cr_write_mm()
-> cr_write_vma() nesting).

The state that is saved/restored is the following:
* some of the task_struct
* some of the thread_struct and thread_info
* the cpu state (including FPU)
* the memory address space

In the current code, sys_checkpoint will checkpoint the current task,
although the logic exists to checkpoint other tasks (not in the
checkpointee's execution context). A simple loop will extend this to
handle multiple processes. sys_restart restarts the current tasks, and
with multiple tasks each task will call the syscall independently.
(Actually, to checkpoint outside the context of a task, it is also
necessary to also handle restart-block logic when saving/restoring the
thread data).

It takes longer to describe what isn't implemented or supported by
this prototype ... basically everything that isn't as simple as the
above.

As for containers - since we still don't have a representation for a
container, this patch has no notion of a container. The tests for
consistent namespaces (and isolation) are also omitted.

Below are two example programs: one uses checkpoint (called ckpt) and
one uses restart (called rstr). Execute like this (as a superuser):

orenl:~/test$ ./ckpt > out.1
<-- ctrl-c
orenl:~/test$ cat /tmp/cr-rest.out
hello, world!
world, hello!
(ret = 1)

orenl:~/test$ ./ckpt > out.1
<-- ctrl-c
orenl:~/test$ cat /tmp/cr-rest.out
hello, world!
world, hello!
(ret = 2)

<-- now change the contents of the file
orenl:~/test$ sed -i 's/world, hello!/xxxx/' /tmp/cr-rest.out
orenl:~/test$ cat /tmp/cr-rest.out
hello, world!
xxxx
(ret = 2)

<-- and do the restart
orenl:~/test$ ./rstr < out.1
<-- ctrl-c
orenl:~/test$ cat /tmp/cr-rest.out
hello, world!
world, hello!
(ret = 0)

(if you check the output of ps, you'll see that "rstr" changed its
name to "ckpt", as expected).

Oren.


============================== ckpt.c ================================

#define _GNU_SOURCE /* or _BSD_SOURCE or _SVID_SOURCE */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
#include <asm/unistd.h>
#include <sys/syscall.h>

#define OUTFILE "/tmp/cr-test.out"

int main(int argc, char *argv[])
{
pid_t pid = getpid();
FILE *file;
int ret;

close(0);
close(2);

unlink(OUTFILE);
file = fopen(OUTFILE, "w+");
if (!file) {
perror("open");
exit(1);
}

fprintf(file, "hello, world!\n");
fflush(file);

ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
if (ret < 0) {
perror("checkpoint");
exit(2);
}

fprintf(file, "world, hello!\n");
fprintf(file, "(ret = %d)\n", ret);
fflush(file);

while (1)
;

return 0;
}

============================== rstr.c ================================

#define _GNU_SOURCE /* or _BSD_SOURCE or _SVID_SOURCE */

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
#include <asm/unistd.h>
#include <sys/syscall.h>

int main(int argc, char *argv[])
{
pid_t pid = getpid();
int ret;

ret = syscall(__NR_restart, pid, STDIN_FILENO, 0);
if (ret < 0)
perror("restart");

printf("should not reach here !\n");

return 0;
}


2008-08-21 03:03:54

by Oren Laadan

[permalink] [raw]
Subject: [RFC v2][PATCH 1/9] Create trivial sys_checkpoint/sys_restart syscalls


Create trivial sys_checkpoint and sys_restore system calls. They will
enable to checkpoint and restart an entire container, to and from a
checkpoint image file.

First create a template for both syscalls: they take a file descriptor
(for the image file) and flags as arguments. For sys_checkpoint the
first argument identifies the target container; for sys_restart it will
identify the checkpoint image.

Signed-off-by: Oren Laadan <[email protected]>
---
arch/x86/kernel/syscall_table_32.S | 2 ++
include/asm-x86/unistd_32.h | 2 ++
include/linux/syscalls.h | 2 ++
3 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index d44395f..5543136 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -332,3 +332,5 @@ ENTRY(sys_call_table)
.long sys_dup3 /* 330 */
.long sys_pipe2
.long sys_inotify_init1
+ .long sys_checkpoint
+ .long sys_restart
diff --git a/include/asm-x86/unistd_32.h b/include/asm-x86/unistd_32.h
index d739467..88bdec4 100644
--- a/include/asm-x86/unistd_32.h
+++ b/include/asm-x86/unistd_32.h
@@ -338,6 +338,8 @@
#define __NR_dup3 330
#define __NR_pipe2 331
#define __NR_inotify_init1 332
+#define __NR_checkpoint 333
+#define __NR_restart 334

#ifdef __KERNEL__

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index d6ff145..edc218b 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -622,6 +622,8 @@ asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
asmlinkage long sys_eventfd(unsigned int count);
asmlinkage long sys_eventfd2(unsigned int count, int flags);
asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags);
+asmlinkage long sys_restart(int crid, int fd, unsigned long flags);

int kernel_execve(const char *filename, char *const argv[], char *const envp[]);

--
1.5.4.3

2008-08-21 03:06:36

by Oren Laadan

[permalink] [raw]
Subject: [RFC v2][PATCH 4/9] Memory management - dump state


For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
it will be followed by the file name. The cr_vma->npages will tell
how many pages were dumped for this VMA. Then it will be followed
by the actual data: first a dump of the addresses of all dumped
pages (npages entries) followed by a dump of the contents of all
dumped pages (npages pages). Then will come the next VMA and so on.

Signed-off-by: Oren Laadan <[email protected]>
---
checkpoint/Makefile | 2 +-
checkpoint/checkpoint.c | 3 +
checkpoint/ckpt.h | 6 +
checkpoint/ckpt_arch.h | 1 +
checkpoint/ckpt_hdr.h | 31 ++++
checkpoint/ckpt_mem.c | 382 +++++++++++++++++++++++++++++++++++++++++++++++
checkpoint/ckpt_mem.h | 30 ++++
checkpoint/ckpt_x86.c | 28 ++++
checkpoint/rstr_x86.c | 2 +
checkpoint/sys.c | 7 +-
include/asm-x86/ckpt.h | 5 +
11 files changed, 495 insertions(+), 2 deletions(-)
create mode 100644 checkpoint/ckpt_mem.c
create mode 100644 checkpoint/ckpt_mem.h

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 29dbb2d..032fc9f 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -1,2 +1,2 @@
-obj-y += sys.o checkpoint.o restart.o
+obj-y += sys.o checkpoint.o restart.o ckpt_mem.o
obj-$(CONFIG_X86) += ckpt_x86.o rstr_x86.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 949ed58..bf868ae 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -169,6 +169,9 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
ret = cr_write_task_struct(ctx, t);
cr_debug("task_struct: ret %d\n", ret);
if (!ret)
+ ret = cr_write_mm(ctx, t);
+ cr_debug("memory: ret %d\n", ret);
+ if (!ret)
ret = cr_write_thread(ctx, t);
cr_debug("thread: ret %d\n", ret);
if (!ret)
diff --git a/checkpoint/ckpt.h b/checkpoint/ckpt.h
index 7ecafb3..0addb63 100644
--- a/checkpoint/ckpt.h
+++ b/checkpoint/ckpt.h
@@ -29,6 +29,9 @@ struct cr_ctx {
void *hbuf; /* header: to avoid many alloc/dealloc */
int hpos;

+ struct cr_pgarr *pgarr;
+ struct cr_pgarr *pgcur;
+
struct path *vfsroot; /* container root */
};

@@ -62,6 +65,9 @@ int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type);
int cr_read_str(struct cr_ctx *ctx, void *str, int n);

+int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
+int cr_read_mm(struct cr_ctx *ctx);
+
int do_checkpoint(struct cr_ctx *ctx);
int do_restart(struct cr_ctx *ctx);

diff --git a/checkpoint/ckpt_arch.h b/checkpoint/ckpt_arch.h
index b7cc8c9..3b87a6f 100644
--- a/checkpoint/ckpt_arch.h
+++ b/checkpoint/ckpt_arch.h
@@ -2,6 +2,7 @@

int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
+int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int ptag);

int cr_read_thread(struct cr_ctx *ctx);
int cr_read_cpu(struct cr_ctx *ctx);
diff --git a/checkpoint/ckpt_hdr.h b/checkpoint/ckpt_hdr.h
index a478b7c..a3919cf 100644
--- a/checkpoint/ckpt_hdr.h
+++ b/checkpoint/ckpt_hdr.h
@@ -30,6 +30,7 @@ struct cr_hdr {
__u32 ptag;
};

+/* header types */
enum {
CR_HDR_HEAD = 1,
CR_HDR_STR,
@@ -45,6 +46,12 @@ enum {
CR_HDR_TAIL = 5001
};

+/* vma subtypes */
+enum {
+ CR_VMA_ANON = 1,
+ CR_VMA_FILE
+};
+
struct cr_hdr_head {
__u64 magic;

@@ -83,4 +90,28 @@ struct cr_hdr_task {
char comm[TASK_COMM_LEN];
} __attribute__ ((aligned (8)));

+struct cr_hdr_mm {
+ __u32 tag; /* sharing identifier */
+ __s16 map_count;
+ __s16 _padding;
+
+ __u64 start_code, end_code, start_data, end_data;
+ __u64 start_brk, brk, start_stack;
+ __u64 arg_start, arg_end, env_start, env_end;
+
+} __attribute__ ((aligned (8)));
+
+struct cr_hdr_vma {
+ __u32 how;
+ __s16 npages;
+ __u16 fname;
+
+ __u64 vm_start;
+ __u64 vm_end;
+ __u64 vm_page_prot;
+ __u64 vm_flags;
+ __u64 vm_pgoff;
+
+} __attribute__ ((aligned (8)));
+
#endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/checkpoint/ckpt_mem.c b/checkpoint/ckpt_mem.c
new file mode 100644
index 0000000..a23aa29
--- /dev/null
+++ b/checkpoint/ckpt_mem.c
@@ -0,0 +1,382 @@
+/*
+ * Checkpoint memory contents
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/pagemap.h>
+#include <linux/mm_types.h>
+
+#include "ckpt.h"
+#include "ckpt_hdr.h"
+#include "ckpt_arch.h"
+#include "ckpt_mem.h"
+
+/*
+ * utilities to alloc, free, and handle 'struct cr_pgarr'
+ * (common to ckpt_mem.c and rstr_mem.c)
+ */
+
+#define CR_PGARR_ORDER 0
+#define CR_PGARR_TOTAL ((PAGE_SIZE << CR_PGARR_ORDER) / sizeof(void *))
+
+/* release pages referenced by a page-array */
+void _cr_pgarr_release(struct cr_ctx *ctx, struct cr_pgarr *pgarr)
+{
+ int n;
+
+ /* only checkpoint keeps references to pages */
+ if (ctx->flags & CR_CTX_CKPT) {
+ cr_debug("nused %d\n", pgarr->nused);
+ for (n = pgarr->nused; n--; )
+ page_cache_release(pgarr->pages[n]);
+ }
+ pgarr->nused = 0;
+ pgarr->nleft = CR_PGARR_TOTAL;
+}
+
+/* release pages referenced by chain of page-arrays */
+void cr_pgarr_release(struct cr_ctx *ctx)
+{
+ struct cr_pgarr *pgarr;
+
+ for (pgarr = ctx->pgarr; pgarr; pgarr = pgarr->next)
+ _cr_pgarr_release(ctx, pgarr);
+}
+
+/* free a chain of page-arrays */
+void cr_pgarr_free(struct cr_ctx *ctx)
+{
+ struct cr_pgarr *pgarr, *pgnxt;
+
+ for (pgarr = ctx->pgarr; pgarr; pgarr = pgnxt) {
+ _cr_pgarr_release(ctx, pgarr);
+ free_pages((unsigned long) ctx->pgarr->addrs, CR_PGARR_ORDER);
+ free_pages((unsigned long) ctx->pgarr->pages, CR_PGARR_ORDER);
+ pgnxt = pgarr->next;
+ kfree(pgarr);
+ }
+}
+
+/* allocate and add a new page-array to chain */
+struct cr_pgarr *cr_pgarr_alloc(struct cr_ctx *ctx, struct cr_pgarr **pgnew)
+{
+ struct cr_pgarr *pgarr = ctx->pgcur;
+
+ if (pgarr && pgarr->next) {
+ ctx->pgcur = pgarr->next;
+ return pgarr->next;
+ }
+
+ if ((pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL))) {
+ pgarr->nused = 0;
+ pgarr->nleft = CR_PGARR_TOTAL;
+ pgarr->addrs = (unsigned long *)
+ __get_free_pages(GFP_KERNEL, CR_PGARR_ORDER);
+ pgarr->pages = (struct page **)
+ __get_free_pages(GFP_KERNEL, CR_PGARR_ORDER);
+ if (likely(pgarr->addrs && pgarr->pages)) {
+ *pgnew = pgarr;
+ ctx->pgcur = pgarr;
+ return pgarr;
+ } else if (pgarr->addrs)
+ free_pages((unsigned long) pgarr->addrs,
+ CR_PGARR_ORDER);
+ kfree(pgarr);
+ }
+
+ return NULL;
+}
+
+/* return current page-array (and allocate if needed) */
+struct cr_pgarr *cr_pgarr_prep(struct cr_ctx *ctx)
+{
+ struct cr_pgarr *pgarr = ctx->pgcur;
+
+ if (unlikely(!pgarr->nleft))
+ pgarr = cr_pgarr_alloc(ctx, &pgarr->next);
+ return pgarr;
+}
+
+/*
+ * Checkpoint is outside the context of the checkpointee, so one cannot
+ * simply read pages from user-space. Instead, we scan the address space
+ * of the target to cherry-pick pages of interest. Selected pages are
+ * enlisted in a page-array chain (attached to the checkpoint context).
+ * To save their contents, each page is mapped to kernel memory and then
+ * dumped to the file descriptor.
+ */
+
+/**
+ * cr_vma_fill_pgarr - fill a page-array with addr/page tuples for a vma
+ * @ctx - checkpoint context
+ * @pgarr - page-array to fill
+ * @vma - vma to scan
+ * @start - start address (updated)
+ */
+static int cr_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
+ struct vm_area_struct *vma, unsigned long *start)
+{
+ unsigned long end = vma->vm_end;
+ unsigned long addr = *start;
+ struct page **pagep;
+ unsigned long *addrp;
+ int cow, nr, ret = 0;
+
+ nr = pgarr->nleft;
+ pagep = &pgarr->pages[pgarr->nused];
+ addrp = &pgarr->addrs[pgarr->nused];
+ cow = !!vma->vm_file;
+
+ while (addr < end) {
+ struct page *page;
+
+ /* simplified version of get_user_pages(): already have vma,
+ * only need FOLL_TOUCH, and (for now) ignore fault stats */
+
+ cond_resched();
+ while (!(page = follow_page(vma, addr, FOLL_TOUCH))) {
+ ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
+ if (ret & VM_FAULT_ERROR) {
+ if (ret & VM_FAULT_OOM)
+ ret = -ENOMEM;
+ else if (ret & VM_FAULT_SIGBUS)
+ ret = -EFAULT;
+ else
+ BUG();
+ break;
+ }
+ cond_resched();
+ }
+
+ if (IS_ERR(page)) {
+ ret = PTR_ERR(page);
+ break;
+ }
+
+ if (page == ZERO_PAGE(0))
+ page = NULL; /* zero page: ignore */
+ else if (cow && page_mapping(page) != NULL)
+ page = NULL; /* clean cow: ignore */
+ else {
+ get_page(page);
+ *(addrp++) = addr;
+ *(pagep++) = page;
+ if (--nr == 0) {
+ addr += PAGE_SIZE;
+ break;
+ }
+ }
+
+ addr += PAGE_SIZE;
+ }
+
+ if (unlikely(ret < 0)) {
+ nr = pgarr->nleft - nr;
+ while (nr--)
+ page_cache_release(*(--pagep));
+ return ret;
+ }
+
+ *start = addr;
+ return (pgarr->nleft - nr);
+}
+
+/**
+ * cr_vma_scan_pages - scan vma for pages that will need to be dumped
+ * @ctx - checkpoint context
+ * @vma - vma to scan
+ *
+ * a list of addr/page tuples is kept in ctx->pgarr page-array chain
+ */
+static int cr_vma_scan_pages(struct cr_ctx *ctx, struct vm_area_struct *vma)
+{
+ unsigned long addr = vma->vm_start;
+ unsigned long end = vma->vm_end;
+ struct cr_pgarr *pgarr;
+ int nr, total = 0;
+
+ while (addr < end) {
+ if (!(pgarr = cr_pgarr_prep(ctx)))
+ return -ENOMEM;
+ if ((nr = cr_vma_fill_pgarr(ctx, pgarr, vma, &addr)) < 0)
+ return nr;
+ pgarr->nleft -= nr;
+ pgarr->nused += nr;
+ total += nr;
+ }
+
+ cr_debug("total %d\n", total);
+ return total;
+}
+
+/**
+ * cr_vma_dump_pages - dump pages listed in the ctx page-array chain
+ * @ctx - checkpoint context
+ * @total - total number of pages
+ */
+static int cr_vma_dump_pages(struct cr_ctx *ctx, int total)
+{
+ struct cr_pgarr *pgarr;
+ int ret;
+
+ if (!total)
+ return 0;
+
+ for (pgarr = ctx->pgarr; pgarr; pgarr = pgarr->next) {
+ ret = cr_kwrite(ctx, pgarr->addrs,
+ pgarr->nused * sizeof(*pgarr->addrs));
+ if (ret < 0)
+ return ret;
+ }
+
+ for (pgarr = ctx->pgarr; pgarr; pgarr = pgarr->next) {
+ struct page **pages = pgarr->pages;
+ int nr = pgarr->nused;
+ void *ptr;
+
+ while (nr--) {
+ ptr = kmap(*pages);
+ ret = cr_kwrite(ctx, ptr, PAGE_SIZE);
+ kunmap(*pages);
+ if (ret < 0)
+ return ret;
+ pages++;
+ }
+ }
+
+ return total;
+}
+
+static int cr_write_vma(struct cr_ctx *ctx, struct vm_area_struct *vma)
+{
+ struct cr_hdr h;
+ struct cr_hdr_vma *hh = ctx->hbuf;
+ char *fname = NULL;
+ int flen = 0, how, nr, ret;
+
+ h.type = CR_HDR_VMA;
+ h.len = sizeof(*hh);
+ h.ptag = 0;
+
+ hh->vm_start = vma->vm_start;
+ hh->vm_end = vma->vm_end;
+ hh->vm_page_prot = vma->vm_page_prot.pgprot;
+ hh->vm_flags = vma->vm_flags;
+ hh->vm_pgoff = vma->vm_pgoff;
+
+ if (vma->vm_flags & (VM_SHARED | VM_IO | VM_HUGETLB | VM_NONLINEAR)) {
+ pr_warning("CR: unknown VMA %#lx\n", vma->vm_flags);
+ return -ETXTBSY;
+ }
+
+ /* by default assume anon memory */
+ how = CR_VMA_ANON;
+
+ /* if there is a backing file, assume private-mapped */
+ /* (NEED: check if the file is unlinked) */
+ if (vma->vm_file) {
+ flen = PAGE_SIZE;
+ fname = cr_fill_fname(&vma->vm_file->f_path,
+ ctx->vfsroot, ctx->tbuf, &flen);
+ if (IS_ERR(fname))
+ return PTR_ERR(fname);
+ how = CR_VMA_FILE;
+ }
+
+ hh->how = how;
+ hh->fname = !!fname;
+
+ /*
+ * it seems redundant now, but we do it in 3 steps for because:
+ * first, the logic is simpler when we how many pages before
+ * dumping them; second, a future optimization will defer the
+ * writeout (dump, and free) to a later step; in which case all
+ * the pages to be dumped will be aggregated on the checkpoint ctx
+ */
+
+ /* (1) scan: scan through the PTEs of the vma to count the pages
+ * to dump (and later make those pages COW), and keep the list of
+ * pages (and a reference to each page) on the checkpoint ctx */
+ nr = cr_vma_scan_pages(ctx, vma);
+ if (nr < 0)
+ return nr;
+
+ hh->npages = nr;
+ ret = cr_write_obj(ctx, &h, hh);
+
+ if (!ret && flen)
+ ret = cr_write_str(ctx, fname, flen);
+
+ if (ret < 0)
+ return ret;
+
+ /* (2) dump: write out the addresses of all pages in the list (on
+ * the checkpoint ctx) followed by the contents of all pages */
+ ret = cr_vma_dump_pages(ctx, nr);
+
+ /* (3) free: free the extra references to the pages in the list */
+ cr_pgarr_release(ctx);
+
+ return ret;
+}
+
+int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t)
+{
+ struct cr_hdr h;
+ struct cr_hdr_mm *hh = ctx->hbuf;
+ struct mm_struct *mm;
+ struct vm_area_struct *vma;
+ int ret;
+
+ h.type = CR_HDR_MM;
+ h.len = sizeof(*hh);
+ h.ptag = task_pid_vnr(t);
+
+ mm = get_task_mm(t);
+
+ hh->tag = 1; /* dummy for now; meaningful with multiple tasks */
+
+ down_read(&mm->mmap_sem);
+
+ hh->start_code = mm->start_code;
+ hh->end_code = mm->end_code;
+ hh->start_data = mm->start_data;
+ hh->end_data = mm->end_data;
+ hh->start_brk = mm->start_brk;
+ hh->brk = mm->brk;
+ hh->start_stack = mm->start_stack;
+ hh->arg_start = mm->arg_start;
+ hh->arg_end = mm->arg_end;
+ hh->env_start = mm->env_start;
+ hh->env_end = mm->env_end;
+
+ hh->map_count = mm->map_count;
+
+ /* FIX: need also mm->flags */
+
+ ret = cr_write_obj(ctx, &h, hh);
+ if (ret < 0)
+ goto out;
+
+ /* write the vma's */
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ if ((ret = cr_write_vma(ctx, vma)) < 0)
+ goto out;
+ }
+
+ ret = cr_write_mm_context(ctx, mm, hh->tag);
+
+ out:
+ up_read(&mm->mmap_sem);
+ mmput(mm);
+ return ret;
+}
diff --git a/checkpoint/ckpt_mem.h b/checkpoint/ckpt_mem.h
new file mode 100644
index 0000000..83d1cfc
--- /dev/null
+++ b/checkpoint/ckpt_mem.h
@@ -0,0 +1,30 @@
+#ifndef _CHECKPOINT_CKPT_MEM_H_
+#define _CHECKPOINT_CKPT_MEM_H_
+/*
+ * Generic container checkpoint-restart
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/mm_types.h>
+
+/* page-array chains: each pgarr holds a list of <addr,page> tuples */
+struct cr_pgarr {
+ unsigned long *addrs;
+ struct page **pages;
+ struct cr_pgarr *next;
+ unsigned short nleft;
+ unsigned short nused;
+};
+
+void _cr_pgarr_release(struct cr_ctx *ctx, struct cr_pgarr *pgarr);
+void cr_pgarr_release(struct cr_ctx *ctx);
+void cr_pgarr_free(struct cr_ctx *ctx);
+struct cr_pgarr *cr_pgarr_alloc(struct cr_ctx *ctx, struct cr_pgarr **pgnew);
+struct cr_pgarr *cr_pgarr_prep(struct cr_ctx *ctx);
+
+#endif /* _CHECKPOINT_CKPT_MEM_H_ */
diff --git a/checkpoint/ckpt_x86.c b/checkpoint/ckpt_x86.c
index ad6c8e8..b365fff 100644
--- a/checkpoint/ckpt_x86.c
+++ b/checkpoint/ckpt_x86.c
@@ -188,3 +188,31 @@ int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t)

return cr_write_obj(ctx, &h, hh);
}
+
+int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int ptag)
+{
+ struct cr_hdr h;
+ struct cr_hdr_mm_context *hh = ctx->hbuf;
+ int ret;
+
+ h.type = CR_HDR_MM_CONTEXT;
+ h.len = sizeof(*hh);
+ h.ptag = ptag;
+
+ mutex_lock(&mm->context.lock);
+
+ hh->ldt_entry_size = LDT_ENTRY_SIZE;
+ hh->nldt = mm->context.size;
+
+ cr_debug("nldt %d\n", hh->nldt);
+
+ ret = cr_write_obj(ctx, &h, hh);
+ if (ret < 0)
+ return ret;
+
+ ret = cr_kwrite(ctx, mm->context.ldt, hh->nldt * LDT_ENTRY_SIZE);
+
+ mutex_unlock(&mm->context.lock);
+
+ return ret;
+}
diff --git a/checkpoint/rstr_x86.c b/checkpoint/rstr_x86.c
index a24a7de..86b6c83 100644
--- a/checkpoint/rstr_x86.c
+++ b/checkpoint/rstr_x86.c
@@ -8,6 +8,8 @@
* distribution for more details.
*/

+#include <asm/unistd.h>
+
#include <asm/ckpt.h>
#include <asm/desc.h>
#include <asm/i387.h>
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 2891c48..eec5032 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -15,6 +15,7 @@
#include <linux/capability.h>

#include "ckpt.h"
+#include "ckpt_mem.h"

/*
* helpers to write/read to/from the image file descriptor
@@ -112,7 +113,6 @@ static atomic_t cr_ctx_count; /* unique checkpoint identifier */

void cr_ctx_free(struct cr_ctx *ctx)
{
-
if (ctx->file)
fput(ctx->file);
if (ctx->vfsroot)
@@ -121,6 +121,8 @@ void cr_ctx_free(struct cr_ctx *ctx)
free_pages((unsigned long) ctx->tbuf, CR_TBUF_ORDER);
free_pages((unsigned long) ctx->hbuf, CR_HBUF_ORDER);

+ cr_pgarr_free(ctx);
+
kfree(ctx);
}

@@ -137,6 +139,9 @@ struct cr_ctx *cr_ctx_alloc(pid_t pid, struct file *file, unsigned long flags)
if (!ctx->tbuf || !ctx->hbuf)
goto nomem;

+ if (!cr_pgarr_alloc(ctx, &ctx->pgarr))
+ goto nomem;
+
ctx->pid = pid;
ctx->flags = flags;

diff --git a/include/asm-x86/ckpt.h b/include/asm-x86/ckpt.h
index cd74657..a2d98fb 100644
--- a/include/asm-x86/ckpt.h
+++ b/include/asm-x86/ckpt.h
@@ -69,4 +69,9 @@ struct cr_hdr_cpu {

} __attribute__ ((aligned (8)));

+struct cr_hdr_mm_context {
+ __s16 ldt_entry_size;
+ __s16 nldt;
+} __attribute__ ((aligned (8)));
+
#endif /* __ASM_X86_CKPT_H */
--
1.5.4.3

2008-08-21 03:07:20

by Oren Laadan

[permalink] [raw]
Subject: [RFC v2][PATCH 5/9] Memory managemnet - restore state


Restoring the memory address space begins with nuking the existing one
of the current process, and then reading the VMA state and contents.
Call do_mmap_pgoffset() for each VMA and then read in the data.

Signed-off-by: Oren Laadan <[email protected]>
---
checkpoint/Makefile | 2 +-
checkpoint/ckpt_arch.h | 1 +
checkpoint/restart.c | 3 +
checkpoint/rstr_mem.c | 356 ++++++++++++++++++++++++++++++++++++++++++++++++
checkpoint/rstr_x86.c | 55 ++++++++
5 files changed, 416 insertions(+), 1 deletions(-)
create mode 100644 checkpoint/rstr_mem.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 032fc9f..41e0877 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -1,2 +1,2 @@
-obj-y += sys.o checkpoint.o restart.o ckpt_mem.o
+obj-y += sys.o checkpoint.o restart.o ckpt_mem.o rstr_mem.o
obj-$(CONFIG_X86) += ckpt_x86.o rstr_x86.o
diff --git a/checkpoint/ckpt_arch.h b/checkpoint/ckpt_arch.h
index 3b87a6f..ab7ac1c 100644
--- a/checkpoint/ckpt_arch.h
+++ b/checkpoint/ckpt_arch.h
@@ -6,3 +6,4 @@ int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int ptag);

int cr_read_thread(struct cr_ctx *ctx);
int cr_read_cpu(struct cr_ctx *ctx);
+int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int ptag);
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index a85f48b..81ce0a4 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -183,6 +183,9 @@ static int cr_read_task(struct cr_ctx *ctx)
ret = cr_read_task_struct(ctx);
cr_debug("task_struct: ret %d\n", ret);
if (!ret)
+ ret = cr_read_mm(ctx);
+ cr_debug("memory: ret %d\n", ret);
+ if (!ret)
ret = cr_read_thread(ctx);
cr_debug("thread: ret %d\n", ret);
if (!ret)
diff --git a/checkpoint/rstr_mem.c b/checkpoint/rstr_mem.c
new file mode 100644
index 0000000..df602a9
--- /dev/null
+++ b/checkpoint/rstr_mem.c
@@ -0,0 +1,356 @@
+/*
+ * Restart memory contents
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fcntl.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/uaccess.h>
+#include <linux/mm_types.h>
+#include <linux/mman.h>
+#include <linux/mm.h>
+#include <linux/err.h>
+#include <asm/cacheflush.h>
+
+#include "ckpt.h"
+#include "ckpt_arch.h"
+#include "ckpt_hdr.h"
+#include "ckpt_mem.h"
+
+/*
+ * Unlike checkpoint, restart is executed in the context of each restarting
+ * process: vma regions are restored via a call to mmap(), and the data is
+ * read in directly to the address space of the current process
+ */
+
+/**
+ * cr_vma_read_pages_addr - read addresses of pages to page-array chain
+ * @ctx - restart context
+ * @npages - number of pages
+ */
+static int cr_vma_read_pages_addr(struct cr_ctx *ctx, int npages)
+{
+ struct cr_pgarr *pgarr;
+ int nr, ret;
+
+ while (npages) {
+ if (!(pgarr = cr_pgarr_prep(ctx)))
+ return -ENOMEM;
+ nr = min(npages, (int) pgarr->nleft);
+ ret = cr_kread(ctx, pgarr->addrs, nr * sizeof(unsigned long));
+ if (ret < 0)
+ return ret;
+ pgarr->nleft -= nr;
+ pgarr->nused += nr;
+ npages -= nr;
+ }
+ return 0;
+}
+
+/**
+ * cr_vma_read_pages_data - read in data of pages in page-array chain
+ * @ctx - restart context
+ * @npages - number of pages
+ */
+static int cr_vma_read_pages_data(struct cr_ctx *ctx, int npages)
+{
+ struct cr_pgarr *pgarr;
+ unsigned long *addrs;
+ int nr, ret;
+
+ for (pgarr = ctx->pgarr; npages; pgarr = pgarr->next) {
+ addrs = pgarr->addrs;
+ nr = pgarr->nused;
+ npages -= nr;
+ while (nr--) {
+ ret = cr_uread(ctx, (void *) *(addrs++), PAGE_SIZE);
+ if (ret < 0)
+ return ret;
+ }
+ }
+
+ return 0;
+}
+
+/* change the protection of an address range to be writable/non-writable.
+ * this is useful when restoring the memory of a read-only vma */
+static int cr_vma_writable(struct mm_struct *mm, unsigned long start,
+ unsigned long end, int writable)
+{
+ struct vm_area_struct *vma, *prev;
+ unsigned long flags = 0;
+ int ret = -EINVAL;
+
+ cr_debug("vma %#lx-%#lx writable %d\n", start, end, writable);
+
+ down_write(&mm->mmap_sem);
+ vma = find_vma_prev(mm, start, &prev);
+ if (unlikely(!vma || vma->vm_start > end || vma->vm_end < start))
+ goto out;
+ if (writable && !(vma->vm_flags & VM_WRITE))
+ flags = vma->vm_flags | VM_WRITE;
+ else if (!writable && (vma->vm_flags & VM_WRITE))
+ flags = vma->vm_flags & ~VM_WRITE;
+ cr_debug("flags %#lx\n", flags);
+ if (flags)
+ ret = mprotect_fixup(vma, &prev, vma->vm_start,
+ vma->vm_end, flags);
+ out:
+ up_write(&mm->mmap_sem);
+ return ret;
+}
+
+/**
+ * cr_vma_read_pages - read in pages for to restore a vma
+ * @ctx - restart context
+ * @cr_vma - vma descriptor from restart
+ */
+static int cr_vma_read_pages(struct cr_ctx *ctx, struct cr_hdr_vma *cr_vma)
+{
+ struct mm_struct *mm = current->mm;
+ int ret = 0;
+
+ if (!cr_vma->npages)
+ return 0;
+
+ /* in the unlikely case that this vma is read-only */
+ if (!(cr_vma->vm_flags & VM_WRITE))
+ ret = cr_vma_writable(mm, cr_vma->vm_start, cr_vma->vm_end, 1);
+
+ if (!ret)
+ ret = cr_vma_read_pages_addr(ctx, cr_vma->npages);
+ if (!ret)
+ ret = cr_vma_read_pages_data(ctx, cr_vma->npages);
+ if (ret < 0)
+ return ret;
+
+ cr_pgarr_release(ctx); /* reset page-array chain */
+
+ /* restore original protection for this vma */
+ if (!(cr_vma->vm_flags & VM_WRITE))
+ ret = cr_vma_writable(mm, cr_vma->vm_start, cr_vma->vm_end, 0);
+
+ return ret;
+}
+
+/**
+ * cr_calc_map_prot_bits - convert vm_flags to mmap protection
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long cr_calc_map_prot_bits(unsigned long orig_vm_flags)
+{
+ unsigned long vm_prot = 0;
+
+ if (orig_vm_flags & VM_READ)
+ vm_prot |= PROT_READ;
+ if (orig_vm_flags & VM_WRITE)
+ vm_prot |= PROT_WRITE;
+ if (orig_vm_flags & VM_EXEC)
+ vm_prot |= PROT_EXEC;
+ if (orig_vm_flags & PROT_SEM) /* only (?) with IPC-SHM */
+ vm_prot |= PROT_SEM;
+
+ return vm_prot;
+}
+
+/**
+ * cr_calc_map_flags_bits - convert vm_flags to mmap flags
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long cr_calc_map_flags_bits(unsigned long orig_vm_flags)
+{
+ unsigned long vm_flags = 0;
+
+ vm_flags = MAP_FIXED;
+ if (orig_vm_flags & VM_GROWSDOWN)
+ vm_flags |= MAP_GROWSDOWN;
+ if (orig_vm_flags & VM_DENYWRITE)
+ vm_flags |= MAP_DENYWRITE;
+ if (orig_vm_flags & VM_EXECUTABLE)
+ vm_flags |= MAP_EXECUTABLE;
+ if (orig_vm_flags & VM_MAYSHARE)
+ vm_flags |= MAP_SHARED;
+ else
+ vm_flags |= MAP_PRIVATE;
+
+ return vm_flags;
+}
+
+static int cr_read_vma(struct cr_ctx *ctx, struct mm_struct *mm)
+{
+ struct cr_hdr_vma *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ unsigned long vm_size, vm_flags, vm_prot, vm_pgoff;
+ unsigned long addr;
+ unsigned long flags;
+ struct file *file = NULL;
+ char *fname = NULL;
+ int ret;
+
+ ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_VMA);
+ if (ret < 0)
+ return ret;
+ else if (ret != 0)
+ return -EINVAL;
+
+ cr_debug("vma %#lx-%#lx npages %d\n", (unsigned long) hh->vm_start,
+ (unsigned long) hh->vm_end, (int) hh->npages);
+
+ if (hh->vm_end < hh->vm_start || hh->npages < 0)
+ return -EINVAL;
+
+ vm_size = hh->vm_end - hh->vm_start;
+ vm_prot = cr_calc_map_prot_bits(hh->vm_flags);
+ vm_flags = cr_calc_map_flags_bits(hh->vm_flags);
+ vm_pgoff = hh->vm_pgoff;
+
+ if (hh->fname) {
+ fname = ctx->tbuf;
+ ret = cr_read_str(ctx, fname, PAGE_SIZE);
+ if (ret < 0)
+ return ret;
+ }
+
+ cr_debug("vma fname '%s' how %d\n", fname, hh->how);
+
+ switch (hh->how) {
+
+ case CR_VMA_ANON: /* anonymous private mapping */
+ if (hh->fname)
+ return -EINVAL;
+ /* vm_pgoff for anonymous mapping is the "global" page
+ offset (namely from addr 0x0), so we force a zero */
+ vm_pgoff = 0;
+ break;
+
+ case CR_VMA_FILE: /* private mapping from a file */
+ if (!hh->fname)
+ return -EINVAL;
+ /* O_RDWR only needed if both (VM_WRITE|VM_SHARED) are set */
+ flags = hh->vm_flags & (VM_WRITE | VM_SHARED);
+ flags = (flags == (VM_WRITE | VM_SHARED) ? O_RDWR : O_RDONLY);
+ file = filp_open(fname, flags, 0);
+ if (IS_ERR(file))
+ return PTR_ERR(file);
+ break;
+
+ default:
+ return -EINVAL;
+
+ }
+
+ addr = do_mmap_pgoff(file, (unsigned long) hh->vm_start,
+ vm_size, vm_prot, vm_flags, vm_pgoff);
+ cr_debug("size %#lx prot %#lx flag %#lx pgoff %#lx => %#lx\n",
+ vm_size, vm_prot, vm_flags, vm_pgoff, addr);
+
+ /* the file (if opened) is now referenced by the vma */
+ if (file)
+ filp_close(file, NULL);
+
+ if (IS_ERR((void*) addr))
+ return (PTR_ERR((void *) addr));
+
+ /*
+ * CR_VMA_ANON: read in memory as is
+ * CR_VMA_FILE: read in memory as is
+ * (more to follow ...)
+ */
+
+ switch (hh->how) {
+ case CR_VMA_ANON:
+ case CR_VMA_FILE:
+ /* standard case: read the data into the memory */
+ ret = cr_vma_read_pages(ctx, hh);
+ break;
+ }
+
+ if (ret < 0)
+ return ret;
+
+ if (vm_prot & PROT_EXEC)
+ flush_icache_range(hh->vm_start, hh->vm_end);
+
+ cr_hbuf_put(ctx, sizeof(*hh));
+ cr_debug("vma retval %d\n", ret);
+ return 0;
+}
+
+static int cr_destroy_mm(struct mm_struct *mm)
+{
+ struct vm_area_struct *vmnext = mm->mmap;
+ struct vm_area_struct *vma;
+ int ret;
+
+ while (vmnext) {
+ vma = vmnext;
+ vmnext = vmnext->vm_next;
+ ret = do_munmap(mm, vma->vm_start, vma->vm_end-vma->vm_start);
+ if (ret < 0)
+ return ret;
+ }
+ return 0;
+}
+
+int cr_read_mm(struct cr_ctx *ctx)
+{
+ struct cr_hdr_mm *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ struct mm_struct *mm;
+ int nr, ret;
+
+ ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM);
+ if (ret < 0)
+ return ret;
+#if 0 /* activate when containers are used */
+ if (ret != task_pid_vnr(current))
+ return -EINVAL;
+#endif
+ cr_debug("map_count %d\n", hh->map_count);
+
+ /* XXX need more sanity checks */
+ if (hh->start_code > hh->end_code ||
+ hh->start_data > hh->end_data || hh->map_count < 0)
+ return -EINVAL;
+
+ mm = current->mm;
+
+ /* point of no return -- destruct current mm */
+ down_write(&mm->mmap_sem);
+ ret = cr_destroy_mm(mm);
+ up_write(&mm->mmap_sem);
+
+ if (ret < 0)
+ return ret;
+
+ mm->start_code = hh->start_code;
+ mm->end_code = hh->end_code;
+ mm->start_data = hh->start_data;
+ mm->end_data = hh->end_data;
+ mm->start_brk = hh->start_brk;
+ mm->brk = hh->brk;
+ mm->start_stack = hh->start_stack;
+ mm->arg_start = hh->arg_start;
+ mm->arg_end = hh->arg_end;
+ mm->env_start = hh->env_start;
+ mm->env_end = hh->env_end;
+
+ /* FIX: need also mm->flags */
+
+ for (nr = hh->map_count; nr; nr--) {
+ ret = cr_read_vma(ctx, mm);
+ if (ret < 0)
+ return ret;
+ }
+
+ ret = cr_read_mm_context(ctx, mm, hh->tag);
+
+ cr_hbuf_put(ctx, sizeof(*hh));
+ return ret;
+}
diff --git a/checkpoint/rstr_x86.c b/checkpoint/rstr_x86.c
index 86b6c83..918df5c 100644
--- a/checkpoint/rstr_x86.c
+++ b/checkpoint/rstr_x86.c
@@ -176,3 +176,58 @@ int cr_read_cpu(struct cr_ctx *ctx)

return ret;
}
+
+asmlinkage int sys_modify_ldt(int func, void __user *ptr, unsigned long bytecount);
+
+int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int ptag)
+{
+ struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ int n, ret;
+
+ ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM_CONTEXT);
+ cr_debug("ptag %d ret %d nldt %d\n", ptag, ret, hh->nldt);
+ if (ret < 0)
+ return ret;
+ if (ret != ptag)
+ return -EINVAL;
+
+ if (hh->nldt < 0 || hh->ldt_entry_size != LDT_ENTRY_SIZE)
+ return -EINVAL;
+
+ /* to utilize the syscall modify_ldt() we first convert the data
+ * in the checkpoint image from 'struct desc_struct' to 'struct
+ * user_desc' with reverse logic of inclue/asm/desc.h:fill_ldt() */
+
+ for (n = 0; n < hh->nldt; n++) {
+ struct user_desc info;
+ struct desc_struct desc;
+ mm_segment_t old_fs;
+
+ ret = cr_kread(ctx, &desc, LDT_ENTRY_SIZE);
+ if (ret < 0)
+ return ret;
+
+ info.entry_number = n;
+ info.base_addr = desc.base0 | (desc.base1 << 16);
+ info.limit = desc.limit0;
+ info.seg_32bit = desc.d;
+ info.contents = desc.type >> 2;
+ info.read_exec_only = (desc.type >> 1) ^ 1;
+ info.limit_in_pages = desc.g;
+ info.seg_not_present = desc.p ^ 1;
+ info.useable = desc.avl;
+
+ old_fs = get_fs();
+ set_fs(get_ds());
+ ret = sys_modify_ldt(1, &info, sizeof(info));
+ set_fs(old_fs);
+
+ if (ret < 0)
+ return ret;
+ }
+
+ load_LDT(&mm->context);
+
+ cr_hbuf_put(ctx, sizeof(*hh));
+ return 0;
+}
--
1.5.4.3

2008-08-21 03:07:46

by Oren Laadan

[permalink] [raw]
Subject: [RFC v2][PATCH 3/9] x86 support for checkpoint/restart


(Following Dave Hansen's refactoring of the original post)

Add logic to save and restore architecture specific state, including
thread-specific state, CPU registers and FPU state.

Currently only x86-32 is supported. Compiling on x86-64 will trigger
an explicit error.

Signed-off-by: Oren Laadan <[email protected]>
---
checkpoint/Makefile | 1 +
checkpoint/checkpoint.c | 9 ++-
checkpoint/ckpt_arch.h | 7 ++
checkpoint/ckpt_x86.c | 190 +++++++++++++++++++++++++++++++++++++++++++++++
checkpoint/restart.c | 11 ++-
checkpoint/rstr_x86.c | 176 +++++++++++++++++++++++++++++++++++++++++++
include/asm-x86/ckpt.h | 72 ++++++++++++++++++
7 files changed, 463 insertions(+), 3 deletions(-)
create mode 100644 checkpoint/ckpt_arch.h
create mode 100644 checkpoint/ckpt_x86.c
create mode 100644 checkpoint/rstr_x86.c
create mode 100644 include/asm-x86/ckpt.h

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index d129878..29dbb2d 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -1 +1,2 @@
obj-y += sys.o checkpoint.o restart.o
+obj-$(CONFIG_X86) += ckpt_x86.o rstr_x86.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 25343f5..949ed58 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -20,6 +20,7 @@

#include "ckpt.h"
#include "ckpt_hdr.h"
+#include "ckpt_arch.h"

/**
* cr_fill_fname - return pathname of a given file
@@ -166,7 +167,13 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
}

ret = cr_write_task_struct(ctx, t);
- cr_debug("ret %d\n", ret);
+ cr_debug("task_struct: ret %d\n", ret);
+ if (!ret)
+ ret = cr_write_thread(ctx, t);
+ cr_debug("thread: ret %d\n", ret);
+ if (!ret)
+ ret = cr_write_cpu(ctx, t);
+ cr_debug("cpu: ret %d\n", ret);

return ret;
}
diff --git a/checkpoint/ckpt_arch.h b/checkpoint/ckpt_arch.h
new file mode 100644
index 0000000..b7cc8c9
--- /dev/null
+++ b/checkpoint/ckpt_arch.h
@@ -0,0 +1,7 @@
+#include "ckpt.h"
+
+int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
+int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
+
+int cr_read_thread(struct cr_ctx *ctx);
+int cr_read_cpu(struct cr_ctx *ctx);
diff --git a/checkpoint/ckpt_x86.c b/checkpoint/ckpt_x86.c
new file mode 100644
index 0000000..ad6c8e8
--- /dev/null
+++ b/checkpoint/ckpt_x86.c
@@ -0,0 +1,190 @@
+/*
+ * Checkpoint/restart - architecture specific support for x86
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <asm/ckpt.h>
+#include <asm/desc.h>
+#include <asm/i387.h>
+
+#include "ckpt.h"
+#include "ckpt_hdr.h"
+
+/* dump the thread_struct of a given task */
+int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t)
+{
+ struct cr_hdr h;
+ struct cr_hdr_thread *hh = ctx->hbuf;
+ struct thread_struct *thread;
+ struct desc_struct *desc;
+ int ntls = 0;
+ int n, ret;
+
+ h.type = CR_HDR_THREAD;
+ h.len = sizeof(*hh);
+ h.ptag = task_pid_vnr(t);
+
+ thread = &t->thread;
+
+ /* calculate no. of TLS entries that follow */
+ desc = thread->tls_array;
+ for (n = GDT_ENTRY_TLS_ENTRIES; n > 0; n--, desc++) {
+ if (desc->a || desc->b)
+ ntls++;
+ }
+
+ hh->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
+ hh->sizeof_tls_array = sizeof(thread->tls_array);
+ hh->ntls = ntls;
+
+ if ((ret = cr_write_obj(ctx, &h, hh)) < 0)
+ return ret;
+
+ /* for simplicity dump the entire array, cherry-pick upon restart */
+ ret = cr_kwrite(ctx, thread->tls_array, sizeof(thread->tls_array));
+
+ cr_debug("ntls %d\n", ntls);
+
+ /* IGNORE RESTART BLOCKS FOR NOW ... */
+
+ return ret;
+}
+
+#ifdef CONFIG_X86_64
+
+#error "CONFIG_X86_64 unsupported yet."
+
+#else /* !CONFIG_X86_64 */
+
+void cr_write_cpu_regs(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+ struct thread_struct *thread = &t->thread;
+ struct pt_regs *regs = task_pt_regs(t);
+
+ hh->bp = regs->bp;
+ hh->bx = regs->bx;
+ hh->ax = regs->ax;
+ hh->cx = regs->cx;
+ hh->dx = regs->dx;
+ hh->si = regs->si;
+ hh->di = regs->di;
+ hh->orig_ax = regs->orig_ax;
+ hh->ip = regs->ip;
+ hh->cs = regs->cs;
+ hh->flags = regs->flags;
+ hh->sp = regs->sp;
+ hh->ss = regs->ss;
+
+ hh->ds = regs->ds;
+ hh->es = regs->es;
+
+ /* for checkpoint in process context (from within a container)
+ the GS and FS registers should be saved from the hardware;
+ otherwise they are already sabed on the thread structure */
+ if (t == current) {
+ savesegment(gs, hh->gs);
+ savesegment(fs, hh->fs);
+ } else {
+ hh->gs = thread->gs;
+ hh->fs = thread->fs;
+ }
+
+ /*
+ * for checkpoint in process context (from within a container),
+ * the actual syscall is taking place at this very moment; so
+ * we (optimistically) subtitute the future return value (0) of
+ * this syscall into the orig_eax, so that upon restart it will
+ * succeed (or it will endlessly retry checkpoint...)
+ */
+ if (t == current) {
+ BUG_ON(hh->orig_ax < 0);
+ hh->ax = 0;
+ }
+}
+
+void cr_write_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+ struct thread_struct *thread = &t->thread;
+
+ /* debug regs */
+
+ preempt_disable();
+
+ /*
+ * for checkpoint in process context (from within a container),
+ * get the actual registers; otherwise get the saved values.
+ */
+
+ if (t == current) {
+ get_debugreg(hh->debugreg0, 0);
+ get_debugreg(hh->debugreg1, 1);
+ get_debugreg(hh->debugreg2, 2);
+ get_debugreg(hh->debugreg3, 3);
+ get_debugreg(hh->debugreg6, 6);
+ get_debugreg(hh->debugreg7, 7);
+ } else {
+ hh->debugreg0 = thread->debugreg0;
+ hh->debugreg1 = thread->debugreg1;
+ hh->debugreg2 = thread->debugreg2;
+ hh->debugreg3 = thread->debugreg3;
+ hh->debugreg6 = thread->debugreg6;
+ hh->debugreg7 = thread->debugreg7;
+ }
+
+ hh->debugreg4 = 0;
+ hh->debugreg5 = 0;
+
+ hh->uses_debug = !!(task_thread_info(t)->flags & TIF_DEBUG);
+
+ preempt_enable();
+}
+
+void cr_write_cpu_fpu(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+ struct thread_struct *thread = &t->thread;
+ struct thread_info *thread_info = task_thread_info(t);
+
+ /* i387 + MMU + SSE logic */
+
+ preempt_disable();
+
+ hh->used_math = tsk_used_math(t) ? 1 : 0;
+ if (hh->used_math) {
+ /* normally, no need to unlazy_fpu(), since TS_USEDFPU flag
+ * have been cleared when task was conexted-switched out...
+ * except if we are in process context, in which case we do */
+ if (thread_info->status & TS_USEDFPU)
+ unlazy_fpu(current);
+
+ hh->has_fxsr = cpu_has_fxsr;
+ memcpy(&hh->xstate, &thread->xstate, sizeof(thread->xstate));
+ }
+
+ preempt_enable();
+}
+
+#endif /* CONFIG_X86_64 */
+
+/* dump the cpu state and registers of a given task */
+int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t)
+{
+ struct cr_hdr h;
+ struct cr_hdr_cpu *hh = ctx->hbuf;
+
+ h.type = CR_HDR_CPU;
+ h.len = sizeof(*hh);
+ h.ptag = task_pid_vnr(t);
+
+ cr_write_cpu_regs(hh, t);
+ cr_write_cpu_debug(hh, t);
+ cr_write_cpu_fpu(hh, t);
+
+ cr_debug("math %d debug %d\n", hh->used_math, hh->uses_debug);
+
+ return cr_write_obj(ctx, &h, hh);
+}
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index be7d08c..a85f48b 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -21,6 +21,7 @@

#include "ckpt.h"
#include "ckpt_hdr.h"
+#include "ckpt_arch.h"

/**
* cr_hbuf_get - reserve space on the hbuf
@@ -63,7 +64,7 @@ int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n)
if (ret < 0)
return ret;

- cr_debug("type %d len %d id %d (%d)\n", h->type, h->len, h->ptag, n);
+ cr_debug("type %d len %d ptag %d (%d)\n", h->type, h->len, h->ptag, n);

if (h->len < 0 || h->len > n)
return -EINVAL;
@@ -180,7 +181,13 @@ static int cr_read_task(struct cr_ctx *ctx)
int ret;

ret = cr_read_task_struct(ctx);
- cr_debug("ret %d\n", ret);
+ cr_debug("task_struct: ret %d\n", ret);
+ if (!ret)
+ ret = cr_read_thread(ctx);
+ cr_debug("thread: ret %d\n", ret);
+ if (!ret)
+ ret = cr_read_cpu(ctx);
+ cr_debug("cpu: ret %d\n", ret);

return ret;
}
diff --git a/checkpoint/rstr_x86.c b/checkpoint/rstr_x86.c
new file mode 100644
index 0000000..a24a7de
--- /dev/null
+++ b/checkpoint/rstr_x86.c
@@ -0,0 +1,176 @@
+/*
+ * Checkpoint/restart - architecture specific support for x86
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <asm/ckpt.h>
+#include <asm/desc.h>
+#include <asm/i387.h>
+
+#include "ckpt.h"
+#include "ckpt_hdr.h"
+
+/* read the thread_struct into the current task */
+int cr_read_thread(struct cr_ctx *ctx)
+{
+ struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ struct task_struct *t = current;
+ struct thread_struct *thread = &t->thread;
+ int ret;
+
+ ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_THREAD);
+ if (ret < 0)
+ return ret;
+#if 0 /* activate when containers are used */
+ if (ret != task_pid_vnr(t))
+ return -EINVAL;
+#endif
+ cr_debug("ntls %d\n", hh->ntls);
+
+ if (hh->gdt_entry_tls_entries != GDT_ENTRY_TLS_ENTRIES ||
+ hh->sizeof_tls_array != sizeof(thread->tls_array) ||
+ hh->ntls < 0 || hh->ntls > GDT_ENTRY_TLS_ENTRIES)
+ return -EINVAL;
+
+ if (hh->ntls > 0) {
+
+ /* restore TLS by hand: why convert to struct user_desc if
+ * sys_set_thread_entry() will convert it back ? */
+
+ struct desc_struct *buf = ctx->tbuf;
+ int size = sizeof(*buf) * GDT_ENTRY_TLS_ENTRIES;
+ int cpu;
+
+ BUG_ON(size > CR_TBUF_TOTAL);
+
+ ret = cr_kread(ctx, buf, size);
+ if (ret < 0)
+ return ret;
+
+ /* FIX: add sanity checks (eg. that values makes sense, that
+ * that we don't overwrite old values, etc */
+
+ cpu = get_cpu();
+ memcpy(thread->tls_array, buf, size);
+ load_TLS(thread, cpu);
+ put_cpu();
+ }
+
+ return 0;
+}
+
+#ifdef CONFIG_X86_64
+
+#error "CONFIG_X86_64 unsupported yet."
+
+#else /* !CONFIG_X86_64 */
+
+int cr_read_cpu_regs(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+ struct thread_struct *thread = &t->thread;
+ struct pt_regs *regs = task_pt_regs(t);
+
+ regs->bx = hh->bx;
+ regs->cx = hh->cx;
+ regs->dx = hh->dx;
+ regs->si = hh->si;
+ regs->di = hh->di;
+ regs->bp = hh->bp;
+ regs->ax = hh->ax;
+ regs->ds = hh->ds;
+ regs->es = hh->es;
+ regs->orig_ax = hh->orig_ax;
+ regs->ip = hh->ip;
+ regs->cs = hh->cs;
+ regs->flags = hh->flags;
+ regs->sp = hh->sp;
+ regs->ss = hh->ss;
+
+ thread->gs = hh->gs;
+ thread->fs = hh->fs;
+ loadsegment(gs, hh->gs);
+ loadsegment(fs, hh->fs);
+
+ return 0;
+}
+
+int cr_read_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+ /* debug regs */
+
+ preempt_disable();
+
+ if (hh->uses_debug) {
+ set_debugreg(hh->debugreg0, 0);
+ set_debugreg(hh->debugreg1, 1);
+ /* ignore 4, 5 */
+ set_debugreg(hh->debugreg2, 2);
+ set_debugreg(hh->debugreg3, 3);
+ set_debugreg(hh->debugreg6, 6);
+ set_debugreg(hh->debugreg7, 7);
+ }
+
+ preempt_enable();
+
+ return 0;
+}
+
+int cr_read_cpu_fpu(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+ struct thread_struct *thread = &t->thread;
+
+ /* i387 + MMU + SSE */
+
+ preempt_disable();
+
+ __clear_fpu(t); /* in case we used FPU in user mode */
+
+ if (!hh->used_math)
+ clear_used_math();
+ else {
+ if (hh->has_fxsr != cpu_has_fxsr) {
+ force_sig(SIGFPE, t);
+ return -EINVAL;
+ }
+ memcpy(&thread->xstate, &hh->xstate, sizeof(thread->xstate));
+ set_used_math();
+ }
+
+ preempt_enable();
+
+ return 0;
+}
+
+#endif /* CONFIG_X86_64 */
+
+/* read the cpu state and registers for the current task */
+int cr_read_cpu(struct cr_ctx *ctx)
+{
+ struct cr_hdr_cpu *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ struct task_struct *t = current;
+ int ret;
+
+ ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_CPU);
+ if (ret < 0)
+ return ret;
+#if 0 /* activate when containers are used */
+ if (ret != task_pid_vnr(t))
+ return -EINVAL;
+#endif
+ /* FIX: sanity check for sensitive registers (eg. eflags) */
+
+ ret = cr_read_cpu_regs(hh, t);
+ if (!ret)
+ ret = cr_read_cpu_debug(hh, t);
+ if (!ret)
+ ret = cr_read_cpu_fpu(hh, t);
+
+ cr_debug("math %d debug %d\n", hh->used_math, hh->uses_debug);
+
+ return ret;
+}
diff --git a/include/asm-x86/ckpt.h b/include/asm-x86/ckpt.h
new file mode 100644
index 0000000..cd74657
--- /dev/null
+++ b/include/asm-x86/ckpt.h
@@ -0,0 +1,72 @@
+#ifndef __ASM_X86_CKPT_H
+#define __ASM_X86_CKPT_H
+/*
+ * Checkpoint/restart - architecture specific headers x86
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <asm/processor.h>
+
+struct cr_hdr_thread {
+ /* NEED: restart blocks */
+
+ __s16 gdt_entry_tls_entries;
+ __s16 sizeof_tls_array;
+ __s16 ntls; /* number of TLS entries to follow */
+} __attribute__ ((aligned (8)));
+
+struct cr_hdr_cpu {
+ /* see struct pt_regs (x86-64) */
+ __u64 r15;
+ __u64 r14;
+ __u64 r13;
+ __u64 r12;
+ __u64 bp;
+ __u64 bx;
+ __u64 r11;
+ __u64 r10;
+ __u64 r9;
+ __u64 r8;
+ __u64 ax;
+ __u64 cx;
+ __u64 dx;
+ __u64 si;
+ __u64 di;
+ __u64 orig_ax;
+ __u64 ip;
+ __u64 cs;
+ __u64 flags;
+ __u64 sp;
+ __u64 ss;
+
+ /* segment registers */
+ __u64 ds;
+ __u64 es;
+ __u64 fs;
+ __u64 gs;
+
+ /* debug registers */
+ __u64 debugreg0;
+ __u64 debugreg1;
+ __u64 debugreg2;
+ __u64 debugreg3;
+ __u64 debugreg4;
+ __u64 debugreg5;
+ __u64 debugreg6;
+ __u64 debugreg7;
+
+ __u16 uses_debug;
+ __u16 used_math;
+ __u16 has_fxsr;
+ __u16 _padding;
+
+ union thread_xstate xstate; /* i387 */
+
+} __attribute__ ((aligned (8)));
+
+#endif /* __ASM_X86_CKPT_H */
--
1.5.4.3

2008-08-21 03:08:03

by Oren Laadan

[permalink] [raw]
Subject: [RFC v2][PATCH 6/9] Checkpoint/restart: initial documentation


Covers application checkpoint/restart, overall design, interfaces
and checkpoint image format.

Signed-off-by: Oren Laadan <[email protected]>
---
Documentation/checkpoint.txt | 177 ++++++++++++++++++++++++++++++++++++++++++
1 files changed, 177 insertions(+), 0 deletions(-)
create mode 100644 Documentation/checkpoint.txt

diff --git a/Documentation/checkpoint.txt b/Documentation/checkpoint.txt
new file mode 100644
index 0000000..fdc69cb
--- /dev/null
+++ b/Documentation/checkpoint.txt
@@ -0,0 +1,177 @@
+
+ === Checkpoint-Restart support in the Linux kernel ===
+
+Copyright (C) 2008 Oren Laadan
+
+Author: Oren Laadan <[email protected]>
+
+License: The GNU Free Documentation License, Version 1.2
+ (dual licensed under the GPL v2)
+Reviewers:
+
+Application checkpoint/restart [CR] is the ability to save the state
+of a running application so that it can later resume its execution
+from the time at which it was checkpointed. An application can be
+migrated by checkpointing it on one machine and restarting it on
+another. CR can provide many potential benefits:
+
+* Failure recovery: by rolling back an to a previous checkpoint
+
+* Improved response time: by restarting applications from checkpoints
+ instead of from scratch.
+
+* Improved system utilization: by suspending long running CPU
+ intensive jobs and resuming them when load decreases.
+
+* Fault resilience: by migrating applications off of faulty hosts.
+
+* Dynamic load balancing: by migrating applications to less loaded
+ hosts.
+
+* Improved service availability and administration: by migrating
+ applications before host maintenance so that they continue to run
+ with minimal downtime
+
+* Time-travel: by taking periodic checkpoints and restarting from
+ any previous checkpoint.
+
+
+=== Overall design
+
+Checkpoint and restart is done in the kernel as much as possible. The
+kernel exports a relative opaque 'blob' of data to userspace which can
+then be handed to the new kernel at restore time. The 'blob' contains
+data and state of select portions of kernel structures such as VMAs
+and mm_structs, as well as copies of the actual memory that the tasks
+use. Any changes in this blob's format between kernel revisions can be
+handled by an in-userspace conversion program. The approach is similar
+to virtually all of the commercial CR products out there, as well as
+the research project Zap.
+
+Two new system calls are introduced to provide CR: sys_checkpoint and
+sys_restart. The checkpoint code basically serializes internel kernel
+state and writes it out to a file descriptor, and the resulting image
+is stream-able. More specifically, it consists of 5 steps:
+ 1. Pre-dump
+ 2. Freeze the container
+ 3. Dump
+ 4. Thaw (or kill) the container
+ 5. Post-dump
+Steps 1 and 5 are an optimization to reduce application downtime:
+"pre-dump" works before freezing the container, e.g. the pre-copy for
+live migration, and "post-dump" works after the container resumes
+execution, e.g. write-back the data to secondary storage.
+
+The restart code basically reads the saved kernel state and from a
+file descriptor, and re-creates the tasks and the resources they need
+to resume execution. The restart code is executed by each task that
+is restored in a new container to reconstruct its own state.
+
+
+=== Interfaces
+
+int sys_checkpoint(pid_t pid, int fd, unsigned long flag);
+ Checkpoint a container whose init task is identified by pid, to the
+ file designated by fd. Flags will have future meaning (should be 0
+ for now).
+ Returns: a positive integer that identifies the checkpoint image
+ (for future reference in case it is kept in memory) upon success,
+ 0 if it returns from a restart, and -1 if an error occurs.
+
+int sys_restart(int crid, int fd, unsigned long flags);
+ Restart a container from a checkpoint image identified by crid, or
+ from the blob stored in the file designated by fd. Flags will have
+ future meaning (should be 0 for now).
+ Returns: 0 on success and -1 if an error occurs.
+
+Thus, if checkpoint is initiated by a process in the container, one
+can use logic similar to fork():
+ ...
+ crid = checkpoint(...);
+ switch (crid) {
+ case -1:
+ perror("checkpoint failed");
+ break;
+ default:
+ fprintf(stderr, "checkpoint succeeded, CRID=%d\n", ret);
+ /* proceed with execution after checkpoint */
+ ...
+ break;
+ case 0:
+ fprintf(stderr, "returned after restart\n");
+ /* proceed with action required following a restart */
+ ...
+ break;
+ }
+ ...
+And to initiate a restart, the process in an empty container can use
+logic similar to execve():
+ ...
+ if (restart(crid, ...) < 0)
+ perror("restart failed");
+ /* only get here if restart failed */
+ ...
+
+
+=== Checkpoint image format
+
+The checkpoint image format is composed of records consistings of a
+pre-header that identifies its contents, followed by a payload. (The
+idea here is to enable parallel checkpointing in the future in which
+multiple threads interleave data from multiple processes into a single
+stream).
+
+The pre-header is defined by "struct cr_hdr" as follows:
+
+struct cr_hdr {
+ __s16 type;
+ __s16 len;
+ __u32 id;
+};
+
+Here, 'type' field identifies the type of the payload, 'len' tells its
+length in byes. The 'id' identifies the owner object instance. The
+meaning of the 'id' field varies depending on the type. For example,
+for type CR_HDR_MM, the 'id' identifies the task to which this MM
+belongs. The payload also varies depending on the type, for instance,
+the data describing a task_struct is given by a 'struct cr_hdr_task'
+(type CR_HDR_TASK) and so on.
+
+The format of the memory dump is as follows: for each VMA, there is a
+'struct cr_vma'; if the VMA is file-mapped, it is followed by the file
+name. The cr_vma->npages indicated how many pages were dumped for this
+VMA. Following comes the actual data: first the addresses of all the
+dumped pages, followed by the contents of all the dumped pages (npages
+entries each). Then comes the next VMA and so on.
+
+To illustrate this, consider a single simple task with two VMAs: one
+is file mapped with two dumped pages, and the other is anonymous with
+three dumped pages. The checkpoint image will look like this:
+
+cr_hdr + cr_hdr_head
+cr_hdr + cr_hdr_task
+ cr_hdr + cr_hdr_mm
+ cr_hdr + cr_hdr_vma + cr_hdr + string
+ addr1, addr2
+ page1, page2
+ cr_hdr + cr_hdr_vma
+ addr3, addr4, addr5
+ page3, page4, page5
+ cr_hdr + cr_mm_context
+ cr_hdr + cr_hdr_thread
+ cr_hdr + cr_hdr_cpu
+cr_hdr + cr_hdr_tail
+
+
+=== Changelog
+
+[2008-Jul-29] v1:
+In this incarnation, CR only works on single task. The address space
+may consist of only private, simple VMAs - anonymous or file-mapped.
+Both checkpoint and restart will ignore the first argument (pid/crid)
+and instead act on themselves.
+
+[2008-Aug-09] v2:
+* Added utsname->{release,version,machine} to checkpoint header
+* Pad header structures to 64 bits to ensure compatibility
+* Address comments from LKML and linux-containers mailing list
--
1.5.4.3

2008-08-21 03:08:33

by Oren Laadan

[permalink] [raw]
Subject: [RFC v2][PATCH 2/9] General infrastructure for checkpoint restart


Add those interfaces, as well as helpers needed to easily manage the
file format. The code is roughly broken out as follows:

ckpt/sys.c - user/kernel data transfer, as well as setup of the
checkpoint/restart context (a per-checkpoint data structure for
housekeeping)

ckpt/checkpoint.c - output wrappers and basic checkpoint handling

ckpt/restart.c - input wrappers and basic restart handling

Patches to add the per-architecture support as well as the actual
work to do the memory checkpoint follow in subsequent patches.

Signed-off-by: Oren Laadan <[email protected]>
---
Makefile | 2 +-
checkpoint/Makefile | 1 +
checkpoint/checkpoint.c | 191 +++++++++++++++++++++++++++++++++++++++
checkpoint/ckpt.h | 71 +++++++++++++++
checkpoint/ckpt_hdr.h | 86 ++++++++++++++++++
checkpoint/restart.c | 199 +++++++++++++++++++++++++++++++++++++++++
checkpoint/sys.c | 227 +++++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 776 insertions(+), 1 deletions(-)
create mode 100644 checkpoint/Makefile
create mode 100644 checkpoint/checkpoint.c
create mode 100644 checkpoint/ckpt.h
create mode 100644 checkpoint/ckpt_hdr.h
create mode 100644 checkpoint/restart.c
create mode 100644 checkpoint/sys.c

diff --git a/Makefile b/Makefile
index f3e2065..584c8f9 100644
--- a/Makefile
+++ b/Makefile
@@ -619,7 +619,7 @@ export mod_strip_cmd


ifeq ($(KBUILD_EXTMOD),)
-core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/
+core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ checkpoint/

vmlinux-dirs := $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
$(core-y) $(core-m) $(drivers-y) $(drivers-m) \
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
new file mode 100644
index 0000000..d129878
--- /dev/null
+++ b/checkpoint/Makefile
@@ -0,0 +1 @@
+obj-y += sys.o checkpoint.o restart.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
new file mode 100644
index 0000000..25343f5
--- /dev/null
+++ b/checkpoint/checkpoint.c
@@ -0,0 +1,191 @@
+/*
+ * Checkpoint logic and helpers
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/time.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/dcache.h>
+#include <linux/mount.h>
+#include <linux/utsname.h>
+#include <asm/ptrace.h>
+
+#include "ckpt.h"
+#include "ckpt_hdr.h"
+
+/**
+ * cr_fill_fname - return pathname of a given file
+ * @file: file pointer
+ * @buf: buffer for pathname
+ * @n: buffer length (in) and pathname length (out)
+ */
+char *cr_fill_fname(struct path *path, struct path *root, char *buf, int *n)
+{
+ char *fname;
+
+ BUG_ON(!buf);
+ fname = __d_path(path, root, buf, *n);
+ if (!IS_ERR(fname))
+ *n = (buf + (*n) - fname);
+ return fname;
+}
+
+/**
+ * cr_write_obj - write a record described by a cr_hdr
+ * @ctx: checkpoint context
+ * @h: record descriptor
+ * @buf: record buffer
+ */
+int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf)
+{
+ int ret;
+
+ if ((ret = cr_kwrite(ctx, h, sizeof(*h))) < 0)
+ return ret;
+ return cr_kwrite(ctx, buf, h->len);
+}
+
+/**
+ * cr_write_str - write a string record
+ * @ctx: checkpoint context
+ * @str: string buffer
+ * @n: string length
+ */
+int cr_write_str(struct cr_ctx *ctx, char *str, int n)
+{
+ struct cr_hdr h;
+
+ h.type = CR_HDR_STR;
+ h.len = n;
+ h.ptag = 0;
+
+ return cr_write_obj(ctx, &h, str);
+}
+
+/* write the checkpoint header */
+static int cr_write_hdr(struct cr_ctx *ctx)
+{
+ struct cr_hdr h;
+ struct cr_hdr_head *hh = ctx->hbuf;
+ struct new_utsname *uts;
+ struct timeval ktv;
+
+ h.type = CR_HDR_HEAD;
+ h.len = sizeof(*hh);
+ h.ptag = 0;
+
+ do_gettimeofday(&ktv);
+
+ hh->magic = CR_MAGIC_HEAD;
+ hh->major = (LINUX_VERSION_CODE >> 16) & 0xff;
+ hh->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
+ hh->patch = (LINUX_VERSION_CODE) & 0xff;
+
+ hh->rev = CR_VERSION;
+
+ hh->flags = ctx->flags;
+ hh->time = ktv.tv_sec;
+
+ uts = utsname();
+ memcpy(hh->release, uts->release, __NEW_UTS_LEN);
+ memcpy(hh->version, uts->version, __NEW_UTS_LEN);
+ memcpy(hh->machine, uts->machine, __NEW_UTS_LEN);
+
+ return cr_write_obj(ctx, &h, hh);
+}
+
+/* write the checkpoint trailer */
+static int cr_write_tail(struct cr_ctx *ctx)
+{
+ struct cr_hdr h;
+ struct cr_hdr_tail *hh = ctx->hbuf;
+
+ h.type = CR_HDR_TAIL;
+ h.len = sizeof(*hh);
+ h.ptag = 0;
+
+ hh->magic = CR_MAGIC_TAIL;
+ hh->cksum = 1; /* TBD ... */
+
+ return cr_write_obj(ctx, &h, hh);
+}
+
+/* dump the task_struct of a given task */
+static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t)
+{
+ struct cr_hdr h;
+ struct cr_hdr_task *hh = ctx->hbuf;
+
+ h.type = CR_HDR_TASK;
+ h.len = sizeof(*hh);
+ h.ptag = 0;
+
+ hh->state = t->state;
+ hh->exit_state = t->exit_state;
+ hh->exit_code = t->exit_code;
+ hh->exit_signal = t->exit_signal;
+
+ hh->utime = t->utime;
+ hh->stime = t->stime;
+ hh->utimescaled = t->utimescaled;
+ hh->stimescaled = t->stimescaled;
+ hh->gtime = t->gtime;
+ hh->prev_utime = t->prev_utime;
+ hh->prev_stime = t->prev_stime;
+ hh->nvcsw = t->nvcsw;
+ hh->nivcsw = t->nivcsw;
+ hh->start_time_sec = t->start_time.tv_sec;
+ hh->start_time_nsec = t->start_time.tv_nsec;
+ hh->real_start_time_sec = t->real_start_time.tv_sec;
+ hh->real_start_time_nsec = t->real_start_time.tv_nsec;
+ hh->min_flt = t->min_flt;
+ hh->maj_flt = t->maj_flt;
+
+ hh->task_comm_len = TASK_COMM_LEN;
+ memcpy(hh->comm, t->comm, TASK_COMM_LEN);
+
+ return cr_write_obj(ctx, &h, hh);
+}
+
+/* dump the entire state of a given task */
+static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
+{
+ int ret ;
+
+ if (t->state == TASK_DEAD) {
+ pr_warning("CR: task may not be in state TASK_DEAD\n");
+ return -EAGAIN;
+ }
+
+ ret = cr_write_task_struct(ctx, t);
+ cr_debug("ret %d\n", ret);
+
+ return ret;
+}
+
+int do_checkpoint(struct cr_ctx *ctx)
+{
+ int ret;
+
+ /* FIX: need to test whether container is checkpointable */
+
+ ret = cr_write_hdr(ctx);
+ if (!ret)
+ ret = cr_write_task(ctx, current);
+ if (!ret)
+ ret = cr_write_tail(ctx);
+
+ /* on success, return (unique) checkpoint identifier */
+ if (!ret)
+ ret = ctx->crid;
+
+ return ret;
+}
diff --git a/checkpoint/ckpt.h b/checkpoint/ckpt.h
new file mode 100644
index 0000000..7ecafb3
--- /dev/null
+++ b/checkpoint/ckpt.h
@@ -0,0 +1,71 @@
+#ifndef _CHECKPOINT_CKPT_H_
+#define _CHECKPOINT_CKPT_H_
+/*
+ * Generic container checkpoint-restart
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/path.h>
+#include <linux/fs.h>
+
+#define CR_VERSION 1
+
+struct cr_ctx {
+ pid_t pid; /* container identifier */
+ int crid; /* unique checkpoint id */
+
+ unsigned long flags;
+ unsigned long oflags; /* restart: old flags */
+
+ struct file *file;
+ int total; /* total read/written */
+
+ void *tbuf; /* temp: to avoid many alloc/dealloc */
+ void *hbuf; /* header: to avoid many alloc/dealloc */
+ int hpos;
+
+ struct path *vfsroot; /* container root */
+};
+
+/* cr_ctx: flags */
+#define CR_CTX_CKPT 0x1
+#define CR_CTX_RSTR 0x2
+
+/* allocation defaults */
+#define CR_TBUF_ORDER 1
+#define CR_TBUF_TOTAL (PAGE_SIZE << CR_TBUF_ORDER)
+
+#define CR_HBUF_ORDER 1
+#define CR_HBUF_TOTAL (PAGE_SIZE << CR_HBUF_ORDER)
+
+char *cr_fill_fname(struct path *path, struct path *root, char *buf, int *n);
+
+int cr_uwrite(struct cr_ctx *ctx, void *buf, int count);
+int cr_kwrite(struct cr_ctx *ctx, void *buf, int count);
+int cr_uread(struct cr_ctx *ctx, void *buf, int count);
+int cr_kread(struct cr_ctx *ctx, void *buf, int count);
+
+void *cr_hbuf_get(struct cr_ctx *ctx, int n);
+void cr_hbuf_put(struct cr_ctx *ctx, int n);
+
+struct cr_hdr;
+
+int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
+int cr_write_str(struct cr_ctx *ctx, char *str, int n);
+
+int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
+int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type);
+int cr_read_str(struct cr_ctx *ctx, void *str, int n);
+
+int do_checkpoint(struct cr_ctx *ctx);
+int do_restart(struct cr_ctx *ctx);
+
+#define cr_debug(fmt, args...) \
+ pr_debug("[CR:%s] " fmt, __func__, ## args)
+
+#endif /* _CHECKPOINT_CKPT_H_ */
diff --git a/checkpoint/ckpt_hdr.h b/checkpoint/ckpt_hdr.h
new file mode 100644
index 0000000..a478b7c
--- /dev/null
+++ b/checkpoint/ckpt_hdr.h
@@ -0,0 +1,86 @@
+#ifndef _CHECKPOINT_CKPT_HDR_H_
+#define _CHECKPOINT_CKPT_HDR_H_
+/*
+ * Generic container checkpoint-restart
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/types.h>
+#include <linux/utsname.h>
+
+#define CR_MAGIC_HEAD 0x00feed0cc0a2d200LL
+#define CR_MAGIC_TAIL 0x002d2a0cc0deef00LL
+
+/*
+ * To maintain compatibility between 32-bit and 64-bit architecture flavors,
+ * keep data 64-bit aligned: use padding for structure members, and use
+ * __attribute__ ((aligned (8))) for the entire structure.
+ */
+
+/* records: generic header */
+
+struct cr_hdr {
+ __s16 type;
+ __s16 len;
+ __u32 ptag;
+};
+
+enum {
+ CR_HDR_HEAD = 1,
+ CR_HDR_STR,
+
+ CR_HDR_TASK = 101,
+ CR_HDR_THREAD,
+ CR_HDR_CPU,
+
+ CR_HDR_MM = 201,
+ CR_HDR_VMA,
+ CR_HDR_MM_CONTEXT,
+
+ CR_HDR_TAIL = 5001
+};
+
+struct cr_hdr_head {
+ __u64 magic;
+
+ __u16 major;
+ __u16 minor;
+ __u16 patch;
+ __u16 rev;
+
+ __u64 time; /* when checkpoint taken */
+ __u64 flags; /* checkpoint options */
+
+ char release[__NEW_UTS_LEN];
+ char version[__NEW_UTS_LEN];
+ char machine[__NEW_UTS_LEN];
+} __attribute__ ((aligned (8)));
+
+struct cr_hdr_tail {
+ __u64 magic;
+ __u64 cksum;
+} __attribute__ ((aligned (8)));
+
+struct cr_hdr_task {
+ __u64 state;
+ __u32 exit_state;
+ __u32 exit_code, exit_signal;
+
+ __u64 utime, stime, utimescaled, stimescaled;
+ __u64 gtime;
+ __u64 prev_utime, prev_stime;
+ __u64 nvcsw, nivcsw;
+ __u64 start_time_sec, start_time_nsec;
+ __u64 real_start_time_sec, real_start_time_nsec;
+ __u64 min_flt, maj_flt;
+
+ __s16 task_comm_len;
+ char comm[TASK_COMM_LEN];
+} __attribute__ ((aligned (8)));
+
+#endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
new file mode 100644
index 0000000..be7d08c
--- /dev/null
+++ b/checkpoint/restart.c
@@ -0,0 +1,199 @@
+/*
+ * Restart logic and helpers
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+/*
+ * During restart the code reads in data from the chekcpoint image into a
+ * temporary buffer (ctx->hbuf). Because operations can be nested, one
+ * should call cr_hbuf_get() to reserve space in the buffer, and then
+ * cr_hbuf_put() when it no longer needs that space
+ */
+
+#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+
+#include "ckpt.h"
+#include "ckpt_hdr.h"
+
+/**
+ * cr_hbuf_get - reserve space on the hbuf
+ * @ctx: checkpoint context
+ * @n: number of bytes to reserve
+ */
+void *cr_hbuf_get(struct cr_ctx *ctx, int n)
+{
+ void *ptr;
+
+ BUG_ON(ctx->hpos + n > CR_HBUF_TOTAL);
+ ptr = (void *) (((char *) ctx->hbuf) + ctx->hpos);
+ ctx->hpos += n;
+ return ptr;
+}
+
+/**
+ * cr_hbuf_put - unreserve space on the hbuf
+ * @ctx: checkpoint context
+ * @n: number of bytes to reserve
+ */
+void cr_hbuf_put(struct cr_ctx *ctx, int n)
+{
+ BUG_ON(ctx->hpos < n);
+ ctx->hpos -= n;
+}
+
+/**
+ * cr_read_obj - read a whole record (cr_hdr followed by payload)
+ * @ctx: checkpoint context
+ * @h: record descriptor
+ * @buf: record buffer
+ * @n: available buffer size
+ */
+int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n)
+{
+ int ret;
+
+ ret = cr_kread(ctx, h, sizeof(*h));
+ if (ret < 0)
+ return ret;
+
+ cr_debug("type %d len %d id %d (%d)\n", h->type, h->len, h->ptag, n);
+
+ if (h->len < 0 || h->len > n)
+ return -EINVAL;
+
+ return cr_kread(ctx, buf, h->len);
+}
+
+/**
+ * cr_read_obj_type - read a whole record of expected type
+ * @ctx: checkpoint context
+ * @buf: record buffer
+ * @n: available buffer size
+ * @type: expected record type
+ */
+int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type)
+{
+ struct cr_hdr h;
+ int ret;
+
+ ret = cr_read_obj(ctx, &h, buf, n);
+ if (!ret)
+ ret = (h.type == type ? h.ptag : -EINVAL);
+ return ret;
+}
+
+/**
+ * cr_read_str - read a string record
+ * @ctx: checkpoint context
+ * @str: string buffer
+ * @n: string length
+ */
+int cr_read_str(struct cr_ctx *ctx, void *str, int n)
+{
+ return cr_read_obj_type(ctx, str, n, CR_HDR_STR);
+}
+
+/* read the checkpoint header */
+static int cr_read_hdr(struct cr_ctx *ctx)
+{
+ struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ int ret;
+
+ ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD);
+ if (ret < 0)
+ return ret;
+ else if (ret != 0)
+ return -EINVAL;
+
+ if (hh->magic != CR_MAGIC_HEAD || hh->rev != CR_VERSION ||
+ hh->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
+ hh->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
+ hh->patch != ((LINUX_VERSION_CODE) & 0xff))
+ return -EINVAL;
+
+ if (hh->flags & ~CR_CTX_CKPT)
+ return -EINVAL;
+
+ ctx->oflags = hh->flags;
+
+ /* FIX: verify compatibility of release, version and machine */
+
+ cr_hbuf_put(ctx, sizeof(*hh));
+ return 0;
+}
+
+/* read the checkpoint trailer */
+static int cr_read_tail(struct cr_ctx *ctx)
+{
+ struct cr_hdr_tail *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ int ret;
+
+ ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TAIL);
+ if (ret < 0)
+ return ret;
+ else if (ret != 0)
+ return -EINVAL;
+
+ if (hh->magic != CR_MAGIC_TAIL || hh->cksum != 1)
+ return -EINVAL;
+
+ cr_hbuf_put(ctx, sizeof(*hh));
+ return 0;
+}
+
+/* read the task_struct into the current task */
+static int cr_read_task_struct(struct cr_ctx *ctx)
+{
+ struct cr_hdr_task *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ struct task_struct *t = current;
+ int ret;
+
+ ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TASK);
+ if (ret < 0)
+ return ret;
+ else if (ret != 0)
+ return -EINVAL;
+
+ /* for now, only restore t->comm */
+ if (hh->task_comm_len < 0 || hh->task_comm_len > TASK_COMM_LEN)
+ return -EINVAL;
+
+ /* current should already have correct pid and tgid */
+
+ memset(t->comm, 0, TASK_COMM_LEN);
+ memcpy(t->comm, hh->comm, hh->task_comm_len);
+
+ cr_hbuf_put(ctx, sizeof(*hh));
+ return 0;
+}
+
+/* read the entire state of the current task */
+static int cr_read_task(struct cr_ctx *ctx)
+{
+ int ret;
+
+ ret = cr_read_task_struct(ctx);
+ cr_debug("ret %d\n", ret);
+
+ return ret;
+}
+
+int do_restart(struct cr_ctx *ctx)
+{
+ int ret;
+
+ ret = cr_read_hdr(ctx);
+ if (!ret)
+ ret = cr_read_task(ctx);
+ if (!ret)
+ ret = cr_read_tail(ctx);
+
+ return ret;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
new file mode 100644
index 0000000..2891c48
--- /dev/null
+++ b/checkpoint/sys.c
@@ -0,0 +1,227 @@
+/*
+ * Generic container checkpoint-restart
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/uaccess.h>
+#include <linux/capability.h>
+
+#include "ckpt.h"
+
+/*
+ * helpers to write/read to/from the image file descriptor
+ *
+ * cr_uwrite() - write a user-space buffer to the checkpoint image
+ * cr_kwrite() - write a kernel-space buffer to the checkpoint image
+ * cr_uread() - read from the checkpoint image to a user-space buffer
+ * cr_kread() - read from the checkpoint image to a kernel-space buffer
+ *
+ */
+
+/* (temporarily added file_pos_read() and file_pos_write() because they
+ * are static in fs/read_write.c... should cleanup and remove later) */
+static inline loff_t file_pos_read(struct file *file)
+{
+ return file->f_pos;
+}
+
+static inline void file_pos_write(struct file *file, loff_t pos)
+{
+ file->f_pos = pos;
+}
+
+int cr_uwrite(struct cr_ctx *ctx, void *buf, int count)
+{
+ struct file *file = ctx->file;
+ ssize_t nwrite;
+ int nleft;
+
+ for (nleft = count; nleft; nleft -= nwrite) {
+ loff_t pos = file_pos_read(file);
+ nwrite = vfs_write(file, (char __user *) buf, nleft, &pos);
+ file_pos_write(file, pos);
+ if (unlikely(nwrite <= 0)) /* zero tolerance */
+ return (nwrite ? : -EIO);
+ buf += nwrite;
+ }
+
+ ctx->total += count;
+ return 0;
+}
+
+int cr_kwrite(struct cr_ctx *ctx, void *buf, int count)
+{
+ mm_segment_t oldfs;
+ int ret;
+
+ oldfs = get_fs();
+ set_fs(KERNEL_DS);
+ ret = cr_uwrite(ctx, buf, count);
+ set_fs(oldfs);
+
+ return ret;
+}
+
+int cr_uread(struct cr_ctx *ctx, void *buf, int count)
+{
+ struct file *file = ctx->file;
+ ssize_t nread;
+ int nleft;
+
+ for (nleft = count; nleft; nleft -= nread) {
+ loff_t pos = file_pos_read(file);
+ nread = vfs_read(file, (char __user *) buf, nleft, &pos);
+ file_pos_write(file, pos);
+ if (unlikely(nread <= 0)) /* zero tolerance */
+ return (nread ? : -EIO);
+ buf += nread;
+ }
+
+ ctx->total += count;
+ return 0;
+}
+
+int cr_kread(struct cr_ctx *ctx, void *buf, int count)
+{
+ mm_segment_t oldfs;
+ int ret;
+
+ oldfs = get_fs();
+ set_fs(KERNEL_DS);
+ ret = cr_uread(ctx, buf, count);
+ set_fs(oldfs);
+
+ return ret;
+}
+
+
+/*
+ * helpers to manage CR contexts: allocated for each checkpoint and/or
+ * restart operation, and persists until the operation is completed.
+ */
+
+static atomic_t cr_ctx_count; /* unique checkpoint identifier */
+
+void cr_ctx_free(struct cr_ctx *ctx)
+{
+
+ if (ctx->file)
+ fput(ctx->file);
+ if (ctx->vfsroot)
+ path_put(ctx->vfsroot);
+
+ free_pages((unsigned long) ctx->tbuf, CR_TBUF_ORDER);
+ free_pages((unsigned long) ctx->hbuf, CR_HBUF_ORDER);
+
+ kfree(ctx);
+}
+
+struct cr_ctx *cr_ctx_alloc(pid_t pid, struct file *file, unsigned long flags)
+{
+ struct cr_ctx *ctx;
+
+ ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+ if (!ctx)
+ return NULL;
+
+ ctx->tbuf = (void *) __get_free_pages(GFP_KERNEL, CR_TBUF_ORDER);
+ ctx->hbuf = (void *) __get_free_pages(GFP_KERNEL, CR_HBUF_ORDER);
+ if (!ctx->tbuf || !ctx->hbuf)
+ goto nomem;
+
+ ctx->pid = pid;
+ ctx->flags = flags;
+
+ ctx->file = file;
+ get_file(file);
+
+ /* assume checkpointer is in container's root vfs */
+ ctx->vfsroot = &current->fs->root;
+ path_get(ctx->vfsroot);
+
+ ctx->crid = atomic_inc_return(&cr_ctx_count);
+
+ return ctx;
+
+ nomem:
+ cr_ctx_free(ctx);
+ return NULL;
+}
+
+/**
+ * sys_checkpoint - checkpoint a container
+ * @pid: pid of the container init(1) process
+ * @fd: file to which dump the checkpoint image
+ * @flags: checkpoint operation flags
+ */
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
+{
+ struct cr_ctx *ctx;
+ struct file *file;
+ int fput_needed;
+ int ret;
+
+ file = fget_light(fd, &fput_needed);
+ if (!file)
+ return -EBADF;
+
+ /* no flags for now */
+ if (flags)
+ return -EINVAL;
+
+ ctx = cr_ctx_alloc(pid, file, flags | CR_CTX_CKPT);
+ if (!ctx) {
+ fput_light(file, fput_needed);
+ return -ENOMEM;
+ }
+
+ ret = do_checkpoint(ctx);
+
+ cr_ctx_free(ctx);
+ fput_light(file, fput_needed);
+
+ return ret;
+}
+
+/**
+ * sys_restart - restart a container
+ * @crid: checkpoint image identifier
+ * @fd: file from which read the checkpoint image
+ * @flags: restart operation flags
+ */
+asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
+{
+ struct cr_ctx *ctx;
+ struct file *file;
+ int fput_needed;
+ int ret;
+
+ file = fget_light(fd, &fput_needed);
+ if (!file)
+ return -EBADF;
+
+ /* no flags for now */
+ if (flags)
+ return -EINVAL;
+
+ ctx = cr_ctx_alloc(crid, file, flags | CR_CTX_RSTR);
+ if (!ctx) {
+ fput_light(file, fput_needed);
+ return -ENOMEM;
+ }
+
+ ret = do_restart(ctx);
+
+ cr_ctx_free(ctx);
+ fput_light(file, fput_needed);
+
+ return ret;
+}
--
1.5.4.3

2008-08-21 03:09:21

by Oren Laadan

[permalink] [raw]
Subject: [RFC v2][PATCH 9/9] File descriprtors (restore)


Restore open file descriptors: for each FD read 'struct cr_hdr_fd_ent'
and lookup tag in the hash table; if not found (first occurence), read
in 'struct cr_hdr_fd_data', create a new FD and register in the hash.
Otherwise attach the file pointer from the hash as an FD.

This patch only handles basic FDs - regular files, directories and also
symbolic links.

Signed-off-by: Oren Laadan <[email protected]>
---
checkpoint/Makefile | 2 +-
checkpoint/checkpoint.c | 3 +
checkpoint/ckpt.h | 6 +-
checkpoint/restart.c | 3 +
checkpoint/rstr_file.c | 202 +++++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 213 insertions(+), 3 deletions(-)
create mode 100644 checkpoint/rstr_file.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 179175b..fd073cd 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -1,3 +1,3 @@
obj-y += sys.o checkpoint.o restart.o objhash.o \
- ckpt_mem.o rstr_mem.o ckpt_file.o
+ ckpt_mem.o rstr_mem.o ckpt_file.o rstr_file.o
obj-$(CONFIG_X86) += ckpt_x86.o rstr_x86.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index bf868ae..fe30ebb 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -172,6 +172,9 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
ret = cr_write_mm(ctx, t);
cr_debug("memory: ret %d\n", ret);
if (!ret)
+ ret = cr_write_files(ctx, t);
+ cr_debug("files: ret %d\n", ret);
+ if (!ret)
ret = cr_write_thread(ctx, t);
cr_debug("thread: ret %d\n", ret);
if (!ret)
diff --git a/checkpoint/ckpt.h b/checkpoint/ckpt.h
index ef2f74d..b83dea1 100644
--- a/checkpoint/ckpt.h
+++ b/checkpoint/ckpt.h
@@ -83,11 +83,13 @@ int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type);
int cr_read_str(struct cr_ctx *ctx, void *str, int n);

+int do_checkpoint(struct cr_ctx *ctx);
int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
-int cr_read_mm(struct cr_ctx *ctx);
+int cr_write_files(struct cr_ctx *ctx, struct task_struct *t);

-int do_checkpoint(struct cr_ctx *ctx);
int do_restart(struct cr_ctx *ctx);
+int cr_read_mm(struct cr_ctx *ctx);
+int cr_read_files(struct cr_ctx *ctx);

#define cr_debug(fmt, args...) \
pr_debug("[CR:%s] " fmt, __func__, ## args)
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 81ce0a4..4c2ef32 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -186,6 +186,9 @@ static int cr_read_task(struct cr_ctx *ctx)
ret = cr_read_mm(ctx);
cr_debug("memory: ret %d\n", ret);
if (!ret)
+ ret = cr_read_files(ctx);
+ cr_debug("files: ret %d\n", ret);
+ if (!ret)
ret = cr_read_thread(ctx);
cr_debug("thread: ret %d\n", ret);
if (!ret)
diff --git a/checkpoint/rstr_file.c b/checkpoint/rstr_file.c
new file mode 100644
index 0000000..a30d65d
--- /dev/null
+++ b/checkpoint/rstr_file.c
@@ -0,0 +1,202 @@
+/*
+ * Checkpoint file descriptors
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/fsnotify.h>
+#include <linux/syscalls.h>
+
+#include "ckpt.h"
+#include "ckpt_hdr.h"
+#include "ckpt_file.h"
+
+static int cr_close_all_fds(struct files_struct *files)
+{
+ int *fdtable;
+ int n;
+
+ do {
+ n = cr_scan_fds(files, &fdtable);
+ if (n < 0)
+ return n;
+ while (n--)
+ sys_close(fdtable[n]);
+ kfree(fdtable);
+ } while (n != -1);
+
+ return 0;
+}
+
+/**
+ * cr_attach_file - attach a lonely file ptr to a file descriptor
+ * @file: lonely file pointer
+ */
+static int cr_attach_file(struct file *file)
+{
+ int fd = get_unused_fd_flags(0);
+
+ if (fd >= 0) {
+ fsnotify_open(file->f_path.dentry);
+ fd_install(fd, file);
+ }
+ return fd;
+}
+
+#define CR_SETFL_MASK (O_APPEND|O_NONBLOCK|O_NDELAY|FASYNC|O_DIRECT|O_NOATIME)
+
+/* cr_read_fd_data - restore the state of a given file pointer */
+static int
+cr_read_fd_data(struct cr_ctx *ctx, struct files_struct *files, int ptag)
+{
+ struct cr_hdr_fd_data *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ struct file *file;
+ char *fname = NULL;
+ int fd, ret;
+
+ ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FD_DATA);
+ cr_debug("ret %d ptag %d flags %#x mode %#x how %d\n",
+ ret, ptag, hh->f_flags, hh->f_mode, hh->how);
+ if (ret < 0)
+ return ret;
+ if (ret != ptag)
+ return -EINVAL;
+ /* FIX: more sanity checks on f_flags, f_mode etc */
+
+ switch (hh->how) {
+ case CR_FD_FILE:
+ case CR_FD_DIR:
+ case CR_FD_LINK:
+ fname = ctx->tbuf;
+ ret = cr_read_str(ctx, fname, PAGE_SIZE);
+ if (ret < 0)
+ return ret;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ cr_debug("open '%s' flags %#lx\n", fname, (unsigned long)hh->f_flags);
+ file = filp_open(fname, hh->f_flags, hh->f_mode);
+ if (IS_ERR(file))
+ return PTR_ERR(file);
+
+ /* FIX: need to restore uid, gid, owner etc */
+
+ fd = cr_attach_file(file); /* no need to cleanup 'file' below */
+ if (fd < 0) {
+ filp_close(file, NULL);
+ return fd;
+ }
+
+ /* register new <tag, file> tuple in hash table */
+ ret = cr_obj_add_tag(ctx, (void *) file, ptag, CR_OBJ_FILE, 0);
+
+ if (!ret)
+ ret = sys_fcntl(fd, F_SETFL, hh->f_flags & CR_SETFL_MASK);
+ if (ret >= 0)
+ ret = vfs_llseek(file, hh->f_pos, SEEK_SET);
+ if (ret == -ESPIPE) /* ignore error on non-seekable files */
+ ret = 0;
+
+ cr_hbuf_put(ctx, sizeof(*hh));
+ return (ret < 0 ? ret : fd);
+}
+
+/**
+ * cr_read_fd_ent - restore the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @ptag: parent tag
+ *
+ * Restore the state of a file descriptor; look up the tag (in the header)
+ * in the hash table, and if found pick the matching file pointer and use
+ * it; otherwise call cr_read_fd_data to restore the file pointer too.
+ */
+static int
+cr_read_fd_ent(struct cr_ctx *ctx, struct files_struct *files, int ptag)
+{
+ struct cr_hdr_fd_ent *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ struct file *file;
+ int newfd, ret;
+
+ ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FD_ENT);
+ cr_debug("ret %d ptag %d tag %d fd %d\n", ret, ptag, hh->tag, hh->fd);
+ if (ret < 0)
+ return ret;
+ if (ret != ptag)
+ return -EINVAL;
+ cr_debug("tag %d close_on_exec %d\n", hh->tag, hh->close_on_exec);
+ if (hh->tag <= 0)
+ return -EINVAL;
+
+ file = cr_obj_get_by_tag(ctx, hh->tag, CR_OBJ_FILE);
+ if (IS_ERR(file))
+ return PTR_ERR(file);
+
+ if (file) {
+ newfd = cr_attach_file(file);
+ if (newfd < 0)
+ return newfd;
+ get_file(file);
+ } else {
+ /* create new file pointer (and register in hash table) */
+ newfd = cr_read_fd_data(ctx, files, hh->tag);
+ if (newfd < 0)
+ return newfd;
+ }
+
+ cr_debug("newfd got %d wanted %d\n", newfd, hh->fd);
+
+ /* if newfd isn't desired fd, use dup2() to relocated it */
+ if (newfd != hh->fd) {
+ ret = sys_dup2(newfd, hh->fd);
+ sys_close(newfd);
+ }
+
+ if (ret >= 0 && hh->close_on_exec)
+ set_close_on_exec(hh->fd, 1);
+
+ cr_hbuf_put(ctx, sizeof(*hh));
+ return (ret < 0 ? ret : 0);
+}
+
+int cr_read_files(struct cr_ctx *ctx)
+{
+ struct cr_hdr_files *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ struct files_struct *files = current->files;
+ int n, ret;
+
+ ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FILES);
+ if (ret < 0)
+ return ret;
+#if 0 /* activate when containers are used */
+ if (ret != task_pid_vnr(current))
+ return -EINVAL;
+#endif
+ cr_debug("tag %d nfds %d\n", hh->tag, hh->nfds);
+ if (hh->tag < 0 || hh->nfds < 0)
+ return -EINVAL;
+
+ /* point of no return -- close all file descriptors */
+ ret = cr_close_all_fds(files);
+ if (ret < 0)
+ return ret;
+
+ for (n = 0; n < hh->nfds; n++) {
+ ret = cr_read_fd_ent(ctx, files, hh->tag);
+ if (ret < 0)
+ break;
+ }
+
+ cr_hbuf_put(ctx, sizeof(*hh));
+ return ret;
+}
--
1.5.4.3

2008-08-21 03:09:43

by Oren Laadan

[permalink] [raw]
Subject: [RFC v2][PATCH 7/9] Infrastructure for shared objects


Infrastructure to handle objects that may be shared and referenced by
multiple tasks or other objects, e..g open files, memory address space
etc.

The state of shared objects is saved once. On the first encounter, the
state is dumped and the object is assigned a unique identifier and also
stored in a hash table (indexed by its physical kenrel address). From
then on the object will be found in the hash and only its identifier is
saved.

On restart the identifier is looked up in the hash table; if not found
then the state is read, the object is created, and added to the hash
table (this time indexed by its identifier). Otherwise, the object in
the hash table is used.

Signed-off-by: Oren Laadan <[email protected]>
---
Documentation/checkpoint.txt | 44 ++++++++++
checkpoint/Makefile | 2 +-
checkpoint/ckpt.h | 18 ++++
checkpoint/objhash.c | 193 ++++++++++++++++++++++++++++++++++++++++++
checkpoint/sys.c | 4 +
5 files changed, 260 insertions(+), 1 deletions(-)
create mode 100644 checkpoint/objhash.c

diff --git a/Documentation/checkpoint.txt b/Documentation/checkpoint.txt
index fdc69cb..923ec10 100644
--- a/Documentation/checkpoint.txt
+++ b/Documentation/checkpoint.txt
@@ -163,6 +163,50 @@ cr_hdr + cr_hdr_task
cr_hdr + cr_hdr_tail


+=== Shared resources (objects)
+
+Many resources used by tasks may be shared by more than one task (e.g.
+file descriptors, memory address space, etc), or even have multiple
+references from other resources (e.g. a single inode that represents
+two ends of a pipe).
+
+Clearly, the state of shared objects need only be saved once, even if
+they occur multiple times. We use a hash table (ctx->objhash) to keep
+track of shared objects in the following manner.
+
+On the first encounter, the state is dumped and the object is assigned
+a unique identifier and also stored in the hash table (indexed by its
+physical kenrel address). From then on the object will be found in the
+hash and only its identifier is saved.
+
+On restart the identifier is looked up in the hash table; if not found
+then the state is read, the object is created, and added to the hash
+table (this time indexed by its identifier). Otherwise, the object in
+the hash table is used.
+
+The interface for the hash table is the following:
+
+int cr_obj_get_by_ptr(struct cr_ctx *ctx, void *ptr, unsigned short type);
+ [checkpoint] find the unique identifier (tag) of the object that
+ is pointer to by ptr (or 0 if not found).
+
+int cr_obj_add_ptr(struct cr_ctx *ctx, void *ptr, int *tag,
+ unsigned short type, unsigned short flags);
+ [checkpoint] add the object pointed to by ptr to the hash table if
+ it isn't already there, and fill its unique identifier (tag); will
+ return 0 if already found in the has, or 1 otherwise.
+
+void *cr_obj_get_by_tag(struct cr_ctx *ctx, int tag, unsigned short type);
+ [restart] return the pointer to the object whose unique identifier
+ is equal to tag.
+
+int cr_obj_add_tag(struct cr_ctx *ctx, void *ptr, int tag,
+ unsigned short type, unsigned short flags);
+ [restart] add the object with unique identifier tag, pointed to by
+ ptr to the hash table if it isn't already there; will return 0 if
+ already found in the has, or 1 otherwise.
+
+
=== Changelog

[2008-Jul-29] v1:
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 41e0877..cd57d9d 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -1,2 +1,2 @@
-obj-y += sys.o checkpoint.o restart.o ckpt_mem.o rstr_mem.o
+obj-y += sys.o checkpoint.o restart.o objhash.o ckpt_mem.o rstr_mem.o
obj-$(CONFIG_X86) += ckpt_x86.o rstr_x86.o
diff --git a/checkpoint/ckpt.h b/checkpoint/ckpt.h
index 0addb63..8b02c4c 100644
--- a/checkpoint/ckpt.h
+++ b/checkpoint/ckpt.h
@@ -29,6 +29,8 @@ struct cr_ctx {
void *hbuf; /* header: to avoid many alloc/dealloc */
int hpos;

+ struct cr_objhash *objhash;
+
struct cr_pgarr *pgarr;
struct cr_pgarr *pgcur;

@@ -56,6 +58,22 @@ int cr_kread(struct cr_ctx *ctx, void *buf, int count);
void *cr_hbuf_get(struct cr_ctx *ctx, int n);
void cr_hbuf_put(struct cr_ctx *ctx, int n);

+/* shared objects handling */
+
+enum {
+ CR_OBJ_FILE = 1,
+ CR_OBJ_MAX
+};
+
+void cr_objhash_free(struct cr_ctx *ctx);
+int cr_objhash_alloc(struct cr_ctx *ctx);
+void *cr_obj_get_by_tag(struct cr_ctx *ctx, int tag, unsigned short type);
+int cr_obj_get_by_ptr(struct cr_ctx *ctx, void *ptr, unsigned short type);
+int cr_obj_add_ptr(struct cr_ctx *ctx, void *ptr, int *tag,
+ unsigned short type, unsigned short flags);
+int cr_obj_add_tag(struct cr_ctx *ctx, void *ptr, int tag,
+ unsigned short type, unsigned short flags);
+
struct cr_hdr;

int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
new file mode 100644
index 0000000..aca32c6
--- /dev/null
+++ b/checkpoint/objhash.c
@@ -0,0 +1,193 @@
+/*
+ * Checkpoint-restart - object hash infrastructure to manage shared objects
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/file.h>
+#include <linux/hash.h>
+
+#include "ckpt.h"
+
+struct cr_obj {
+ int tag;
+ void *ptr;
+ unsigned short type;
+ unsigned short flags;
+ struct cr_obj *next;
+};
+
+struct cr_objhash {
+ struct cr_obj **hash;
+ int next_tag;
+};
+
+#define CR_OBJHASH_NBITS 10 /* 10 bits = 1K buckets */
+#define CR_OBJHASH_ORDER 0 /* 1K buckets * 4 bytes/bucket = 1 page */
+#define CR_OBJHASH_TOTAL (1 << CR_OBJHASH_NBITS)
+
+static void cr_obj_ref_drop(struct cr_obj *obj)
+{
+ switch (obj->type) {
+ case CR_OBJ_FILE:
+ fput((struct file *) obj->ptr);
+ break;
+ default:
+ BUG();
+ }
+}
+
+static void cr_obj_ref_grab(struct cr_obj *obj)
+{
+ switch (obj->type) {
+ case CR_OBJ_FILE:
+ get_file((struct file *) obj->ptr);
+ break;
+ default:
+ BUG();
+ }
+}
+
+static void cr_objhash_clear(struct cr_objhash *objhash)
+{
+ struct cr_obj **hash = objhash->hash;
+ struct cr_obj *obj, *next;
+ int n;
+
+ for (n = 0; n < CR_OBJHASH_TOTAL; n++) {
+ for (obj = hash[n]; obj; obj = next) {
+ next = obj->next;
+ cr_obj_ref_drop(obj);
+ kfree(obj);
+ }
+ }
+}
+
+void cr_objhash_free(struct cr_ctx *ctx)
+{
+ struct cr_objhash *objhash = ctx->objhash;
+
+ if (objhash) {
+ cr_objhash_clear(objhash);
+ free_pages((unsigned long) objhash->hash, CR_OBJHASH_ORDER);
+ kfree(ctx->objhash);
+ ctx->objhash = NULL;
+ }
+}
+
+int cr_objhash_alloc(struct cr_ctx *ctx)
+{
+ struct cr_objhash *objhash;
+
+ objhash = kzalloc(sizeof(*objhash), GFP_KERNEL);
+ if (!objhash)
+ return -ENOMEM;
+ objhash->hash = (struct cr_obj **)
+ __get_free_pages(GFP_KERNEL, CR_OBJHASH_ORDER);
+ if (!objhash->hash) {
+ kfree(objhash);
+ return -ENOMEM;
+ }
+ memset(objhash->hash, 0, PAGE_SIZE << CR_OBJHASH_ORDER);
+ objhash->next_tag = 1;
+
+ ctx->objhash = objhash;
+ return 0;
+}
+
+static struct cr_obj *cr_obj_find_by_ptr(struct cr_ctx *ctx, void *ptr)
+{
+ struct cr_obj *obj;
+
+ obj = ctx->objhash->hash[hash_ptr(ptr, CR_OBJHASH_NBITS)];
+ while (obj && obj != ptr)
+ obj = obj->next;
+ return obj;
+}
+
+static struct cr_obj *cr_obj_find_by_tag(struct cr_ctx *ctx, int tag)
+{
+ struct cr_obj *obj;
+
+ obj = ctx->objhash->hash[hash_ptr((void *) tag, CR_OBJHASH_NBITS)];
+ while (obj && obj->tag != tag)
+ obj = obj->next;
+ return obj;
+}
+
+static struct cr_obj *cr_obj_new(struct cr_ctx *ctx, void *ptr, int tag,
+ unsigned short type, unsigned short flags)
+{
+ struct cr_obj *obj;
+ int n;
+
+ obj = kmalloc(sizeof(*obj), GFP_KERNEL);
+ if (obj) {
+ obj->ptr = ptr;
+ obj->type = type;
+ obj->flags = flags;
+ obj->tag = (tag ? tag : ctx->objhash->next_tag++);
+
+ cr_obj_ref_grab(obj);
+
+ n = hash_ptr(ptr, CR_OBJHASH_NBITS);
+ obj->next = ctx->objhash->hash[n];
+ ctx->objhash->hash[n] = obj;
+ }
+ return obj;
+}
+
+int cr_obj_add_ptr(struct cr_ctx *ctx, void *ptr, int *tag,
+ unsigned short type, unsigned short flags)
+{
+ struct cr_obj *obj;
+ int ret = 0;
+
+ obj = cr_obj_find_by_ptr(ctx, ptr);
+ if (!obj) {
+ obj = cr_obj_new(ctx, ptr, 0, type, flags);
+ if (!obj)
+ return -ENOMEM;
+ else
+ ret = 1;
+ } else if (obj->type != type) /* sanity check */
+ return -EINVAL;
+ *tag = obj->tag;
+ return ret;
+}
+
+int cr_obj_add_tag(struct cr_ctx *ctx, void *ptr, int tag,
+ unsigned short type, unsigned short flags)
+{
+ struct cr_obj *obj;
+
+ obj = cr_obj_new(ctx, ptr, tag, type, flags);
+ return (obj ? 0 : -ENOMEM);
+}
+
+int cr_obj_get_by_ptr(struct cr_ctx *ctx, void *ptr, unsigned short type)
+{
+ struct cr_obj *obj;
+
+ obj = cr_obj_find_by_ptr(ctx, ptr);
+ if (obj)
+ return (obj->type == type ? obj->tag : -EINVAL);
+ else
+ return -ESRCH;
+}
+
+void *cr_obj_get_by_tag(struct cr_ctx *ctx, int tag, unsigned short type)
+{
+ struct cr_obj *obj;
+
+ obj = cr_obj_find_by_tag(ctx, tag);
+ if (obj)
+ return (obj->type == type ? obj->ptr : ERR_PTR(-EINVAL));
+ else
+ return NULL;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index eec5032..7b2670a 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -122,6 +122,7 @@ void cr_ctx_free(struct cr_ctx *ctx)
free_pages((unsigned long) ctx->hbuf, CR_HBUF_ORDER);

cr_pgarr_free(ctx);
+ cr_objhash_free(ctx);

kfree(ctx);
}
@@ -142,6 +143,9 @@ struct cr_ctx *cr_ctx_alloc(pid_t pid, struct file *file, unsigned long flags)
if (!cr_pgarr_alloc(ctx, &ctx->pgarr))
goto nomem;

+ if (cr_objhash_alloc(ctx) < 0)
+ goto nomem;
+
ctx->pid = pid;
ctx->flags = flags;

--
1.5.4.3

2008-08-21 03:10:58

by Oren Laadan

[permalink] [raw]
Subject: [RFC v2][PATCH 8/9] File descriprtors - dump state


Dump the files_struct of a task with 'struct cr_hdr_files', followed by
all open file descriptors. Since FDs can be shared, they are assigned a
tag and registered in the object hash.

For each open FD there is a 'struct cr_hdr_fd_ent' with the FD, its tag
and its close-on-exec property. If the FD is to be saved (first time)
then this is followed by a 'struct cr_hdr_fd_data' with the FD state.
Then will come the next FD and so on.

This patch only handles basic FDs - regular files, directories and also
symbolic links.

Signed-off-by: Oren Laadan <[email protected]>
---
checkpoint/Makefile | 3 +-
checkpoint/ckpt.h | 2 +-
checkpoint/ckpt_file.c | 234 ++++++++++++++++++++++++++++++++++++++++++++++++
checkpoint/ckpt_file.h | 17 ++++
checkpoint/ckpt_hdr.h | 31 +++++++
5 files changed, 285 insertions(+), 2 deletions(-)
create mode 100644 checkpoint/ckpt_file.c
create mode 100644 checkpoint/ckpt_file.h

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index cd57d9d..179175b 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -1,2 +1,3 @@
-obj-y += sys.o checkpoint.o restart.o objhash.o ckpt_mem.o rstr_mem.o
+obj-y += sys.o checkpoint.o restart.o objhash.o \
+ ckpt_mem.o rstr_mem.o ckpt_file.o
obj-$(CONFIG_X86) += ckpt_x86.o rstr_x86.o
diff --git a/checkpoint/ckpt.h b/checkpoint/ckpt.h
index 8b02c4c..ef2f74d 100644
--- a/checkpoint/ckpt.h
+++ b/checkpoint/ckpt.h
@@ -13,7 +13,7 @@
#include <linux/path.h>
#include <linux/fs.h>

-#define CR_VERSION 1
+#define CR_VERSION 2

struct cr_ctx {
pid_t pid; /* container identifier */
diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
new file mode 100644
index 0000000..18faaf1
--- /dev/null
+++ b/checkpoint/ckpt_file.c
@@ -0,0 +1,234 @@
+/*
+ * Checkpoint file descriptors
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+
+#include "ckpt.h"
+#include "ckpt_hdr.h"
+#include "ckpt_file.h"
+
+#define CR_DEFAULT_FDTABLE 128
+
+/**
+ * cr_scan_fds - scan file table and construct array of open fds
+ * @files: files_struct pointer
+ * @fdtable: (output) array of open fds
+ * @return: the number of open fds found
+ *
+ * Allocates the file descriptors array (*fdtable), caller should free
+ */
+int cr_scan_fds(struct files_struct *files, int **fdtable)
+{
+ int i, j, n, max;
+ struct fdtable *fdt;
+ int *fdlist;
+
+ max = CR_DEFAULT_FDTABLE;
+
+ repeat:
+ fdlist = kmalloc(max * sizeof(*fdlist), GFP_KERNEL);
+ if (!fdlist)
+ return -ENOMEM;
+
+ j = 0;
+ n = 0;
+
+ spin_lock(&files->file_lock);
+ fdt = files_fdtable(files);
+ for (;;) {
+ unsigned long set;
+ i = j * __NFDBITS;
+ if (i >= fdt->max_fds)
+ break;
+ set = fdt->open_fds->fds_bits[j++];
+ while (set) {
+ if (set & 1) {
+ if (unlikely(n == max)) {
+ spin_unlock(&files->file_lock);
+ kfree(fdlist);
+ max *= 2;
+ if (max < 0) /* overflow ? */
+ return -EMFILE;
+ goto repeat;
+ }
+ fdlist[n++] = i;
+ }
+ i++;
+ set >>= 1;
+ }
+ }
+ spin_unlock(&files->file_lock);
+
+ *fdtable = fdlist;
+ return n;
+}
+
+/* cr_write_fd_data - dump the state of a given file pointer */
+static int cr_write_fd_data(struct cr_ctx *ctx, struct file *file, int ptag)
+{
+ struct cr_hdr h;
+ struct cr_hdr_fd_data *hh = ctx->hbuf;
+ struct dentry *dent = file->f_dentry;
+ struct inode *inode = dent->d_inode;
+ char *fname;
+ int flen, how, ret;
+
+ h.type = CR_HDR_FD_DATA;
+ h.len = sizeof(*hh);
+ h.ptag = ptag;
+
+ BUG_ON(!inode);
+
+ flen = PAGE_SIZE;
+ fname = cr_fill_fname(&file->f_path, ctx->vfsroot, ctx->tbuf, &flen);
+ if (IS_ERR(fname))
+ return PTR_ERR(fname);
+
+ hh->f_flags = file->f_flags;
+ hh->f_mode = file->f_mode;
+ hh->f_pos = file->f_pos;
+ hh->f_uid = file->f_uid;
+ hh->f_gid = file->f_gid;
+ hh->f_version = file->f_version;
+ /* FIX: need also file->f_owner */
+
+ switch(inode->i_mode & S_IFMT) {
+ case S_IFREG:
+ how = CR_FD_FILE;
+ break;
+ case S_IFDIR:
+ how = CR_FD_DIR;
+ break;
+ case S_IFLNK:
+ how = CR_FD_LINK;
+ break;
+ default:
+ return -EBADF;
+ }
+
+ /* FIX: check if the file/dir/link is unlinked */
+
+ BUG_ON(!flen);
+
+ ret = cr_write_obj(ctx, &h, hh);
+ if (!ret && flen)
+ ret = cr_write_str(ctx, fname, flen);
+
+ return ret;
+}
+
+/**
+ * cr_write_fd_ent - dump the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @fd: file descriptor
+ *
+ * Save the state of the file descriptor; look up the actual file pointer
+ * in the hash table, and if found save the matching tag, otherwise call
+ * cr_write_fd_data to dump the file pointer too.
+ */
+static int
+cr_write_fd_ent(struct cr_ctx *ctx, struct files_struct *files, int fd)
+{
+ struct cr_hdr h;
+ struct cr_hdr_fd_ent *hh = ctx->hbuf;
+ struct file *file = NULL;
+ struct fdtable *fdt;
+ int coe, tag, ret;
+
+ /* make sure hh->fd (that is of type __u16) doesn't overflow */
+ if (fd > USHORT_MAX) {
+ pr_warning("CR: open files table too big (%d)\n", USHORT_MAX);
+ return -EMFILE;
+ }
+
+ rcu_read_lock();
+ fdt = files_fdtable(files);
+ if (fd < fdt->max_fds)
+ file = rcu_dereference(fdt->fd[fd]);
+ if (file) {
+ coe = FD_ISSET(fd, fdt->close_on_exec);
+ get_file(file);
+ }
+ rcu_read_unlock();
+
+ /* sanity check (although this shouldn't happen) */
+ if (unlikely(!file))
+ return -EBADF;
+
+ ret = cr_obj_add_ptr(ctx, (void *) file, &tag, CR_OBJ_FILE, 0);
+ cr_debug("fd %d tag %d file %p c-o-e %d)\n", fd, tag, file, coe);
+
+ if (ret >= 0) {
+ int new = ret;
+
+ h.type = CR_HDR_FD_ENT;
+ h.len = sizeof(*hh);
+ h.ptag = 0;
+
+ hh->tag = tag;
+ hh->fd = fd;
+ hh->close_on_exec = coe;
+
+ ret = cr_write_obj(ctx, &h, hh);
+
+ /* new==1 if-and-only-if file was new and added to hash */
+ if (!ret && new)
+ ret = cr_write_fd_data(ctx, file, tag);
+ }
+
+ fput(file);
+ return ret;
+}
+
+int cr_write_files(struct cr_ctx *ctx, struct task_struct *t)
+{
+ struct cr_hdr h;
+ struct cr_hdr_files *hh = ctx->hbuf;
+ struct files_struct *files;
+ int *fdtable;
+ int nfds, n, ret;
+
+ h.type = CR_HDR_FILES;
+ h.len = sizeof(*hh);
+ h.ptag = task_pid_vnr(t);
+
+ files = get_files_struct(t);
+
+ nfds = cr_scan_fds(files, &fdtable);
+ if (nfds < 0) {
+ ret = nfds;
+ goto out;
+ }
+
+ hh->tag = 0;
+ hh->nfds = nfds;
+
+ ret = cr_write_obj(ctx, &h, hh);
+ if (ret < 0)
+ goto clean;
+
+ cr_debug("nfds %d\n", nfds);
+ for (n = 0; n < nfds; n++) {
+ ret = cr_write_fd_ent(ctx, files, n);
+ if (ret < 0)
+ break;
+ }
+
+ clean:
+ kfree(fdtable);
+ out:
+ put_files_struct(files);
+
+ return ret;
+}
diff --git a/checkpoint/ckpt_file.h b/checkpoint/ckpt_file.h
new file mode 100644
index 0000000..9dc3eba
--- /dev/null
+++ b/checkpoint/ckpt_file.h
@@ -0,0 +1,17 @@
+#ifndef _CHECKPOINT_CKPT_FILE_H_
+#define _CHECKPOINT_CKPT_FILE_H_
+/*
+ * Checkpoint file descriptors
+ *
+ * Copyright (C) 2008 Oren Laadan
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file COPYING in the main directory of the Linux
+ * distribution for more details.
+ */
+
+#include <linux/fdtable.h>
+
+int cr_scan_fds(struct files_struct *files, int **fdtable);
+
+#endif /* _CHECKPOINT_CKPT_FILE_H_ */
diff --git a/checkpoint/ckpt_hdr.h b/checkpoint/ckpt_hdr.h
index a3919cf..a8a37db 100644
--- a/checkpoint/ckpt_hdr.h
+++ b/checkpoint/ckpt_hdr.h
@@ -43,6 +43,10 @@ enum {
CR_HDR_VMA,
CR_HDR_MM_CONTEXT,

+ CR_HDR_FILES = 301,
+ CR_HDR_FD_ENT,
+ CR_HDR_FD_DATA,
+
CR_HDR_TAIL = 5001
};

@@ -52,6 +56,13 @@ enum {
CR_VMA_FILE
};

+/* fd subtypes */
+enum {
+ CR_FD_FILE = 1,
+ CR_FD_DIR,
+ CR_FD_LINK
+};
+
struct cr_hdr_head {
__u64 magic;

@@ -114,4 +125,24 @@ struct cr_hdr_vma {

} __attribute__ ((aligned (8)));

+struct cr_hdr_files {
+ __u32 tag; /* sharing identifier */
+ __u32 nfds;
+} __attribute__ ((aligned (8)));
+
+struct cr_hdr_fd_ent {
+ __u32 tag;
+ __u16 fd;
+ __u16 close_on_exec;
+} __attribute__ ((aligned (8)));
+
+struct cr_hdr_fd_data {
+ __u16 how;
+ __u16 f_mode;
+ __u32 f_flags;
+ __u32 f_uid, f_gid;
+ __u64 f_pos;
+ __u64 f_version;
+} __attribute__ ((aligned (8)));
+
#endif /* _CHECKPOINT_CKPT_HDR_H_ */
--
1.5.4.3

2008-08-21 05:17:28

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 1/9] kernel based checkpoint-restart


Subject line should have been 's/PATCH 1/PATCH 0/' ...

I left cr_debug() in for now to provide more info than pr_debug();
eventually that will be changed back to pr_debug()

In the mini-conference we considered doing CR in a kernel module,
but decided against because we needed a system call. It is still
possible to put the bulk of the code in a module. This is useful,
besides reducing debug time (recompile, unload, reload), to reduce
the kernel memory footprint. Also, an administrator can load/unload
the module to enable/disable this feature. Any thoughts ?

Oren.

Oren Laadan wrote:
>
> These patches implement checkpoint-restart [CR v2]. This version adds
> save and restore of open files state (regular files and directories)
> which makes it more usable. Other changes address the feedback given
> for the previous version. It is also refactored (along Dave's posting)
> for easier reviewing.
>
> Todo:
> - Add support for x86-64 and improve ABI
> - Refine or change syscall interface
> - Extend to handle (multiple) tasks in a container
> - Security (without CAPS_SYS_ADMIN files restore may fail)
>
> Changelog:
>
> [2008-Aug-20] v2:
> - Added dump and restore of open files (regular and directories);
> see the changes in the test program (ckpt.c)
> - Added basic handling of shared objects, and use 'parent tag'
> - Added documentation
> - Improved ABI, add 64bit padding for image data
> - Improved locking when saving/restoring memory
> - Added UTS information to header (release, version, machine)
> - Cleanup extraction of filename from a file pointer
> - Refactor to allow easier reviewing
> - Remove requirement for CAPS_SYS_ADMIN until we come up with a
> security policy (this means that file restore may fail)
> - Other cleanup in response to comments for v1
>
> [2008-Jul-29] v1:
> - Initial version: support a single task with address space of only
> private anonymous or file-mapped VMAs; syscalls ignore pid/crid
> argument and act on current process.
>

2008-08-21 05:27:58

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 9/9] File descriprtors (restore)


>
> Restore open file descriptors: for each FD read 'struct cr_hdr_fd_ent'
> and lookup tag in the hash table; if not found (first occurence), read
> in 'struct cr_hdr_fd_data', create a new FD and register in the hash.
> Otherwise attach the file pointer from the hash as an FD.
>
> This patch only handles basic FDs - regular files, directories and also
> symbolic links.
>
> Signed-off-by: Oren Laadan <[email protected]>
> ---
> checkpoint/Makefile | 2 +-
> checkpoint/checkpoint.c | 3 +
> checkpoint/ckpt.h | 6 +-
> checkpoint/restart.c | 3 +
> checkpoint/rstr_file.c | 202 +++++++++++++++++++++++++++++++++++++++++++++++
> 5 files changed, 213 insertions(+), 3 deletions(-)
> create mode 100644 checkpoint/rstr_file.c
>

To restore the open files, cr_read_files() first closes all the
existing FDs of the current process. This breaks the assumption
underlying the use of fget_light/fput_light in sys_restart(), so
use fget/fput instead.


diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 7b2670a..ec1609b 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -210,10 +210,9 @@ asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
{
struct cr_ctx *ctx;
struct file *file;
- int fput_needed;
int ret;

- file = fget_light(fd, &fput_needed);
+ file = fget(fd);
if (!file)
return -EBADF;

@@ -223,14 +222,14 @@ asmlinkage long sys_restart(int crid, int fd, unsigned long flags)

ctx = cr_ctx_alloc(crid, file, flags | CR_CTX_RSTR);
if (!ctx) {
- fput_light(file, fput_needed);
+ fput(file);
return -ENOMEM;
}

ret = do_restart(ctx);

cr_ctx_free(ctx);
- fput_light(file, fput_needed);
+ fput(file);

return ret;
}

2008-08-21 05:30:48

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 1/9] Create trivial sys_checkpoint/sys_restart syscalls


this, of course, should be the last patch, not the first.

Oren Laadan wrote:
> Create trivial sys_checkpoint and sys_restore system calls. They will
> enable to checkpoint and restart an entire container, to and from a
> checkpoint image file.
>
> First create a template for both syscalls: they take a file descriptor
> (for the image file) and flags as arguments. For sys_checkpoint the
> first argument identifies the target container; for sys_restart it will
> identify the checkpoint image.
>
> Signed-off-by: Oren Laadan <[email protected]>
> ---
> arch/x86/kernel/syscall_table_32.S | 2 ++
> include/asm-x86/unistd_32.h | 2 ++
> include/linux/syscalls.h | 2 ++
> 3 files changed, 6 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
> index d44395f..5543136 100644
> --- a/arch/x86/kernel/syscall_table_32.S
> +++ b/arch/x86/kernel/syscall_table_32.S
> @@ -332,3 +332,5 @@ ENTRY(sys_call_table)
> .long sys_dup3 /* 330 */
> .long sys_pipe2
> .long sys_inotify_init1
> + .long sys_checkpoint
> + .long sys_restart
> diff --git a/include/asm-x86/unistd_32.h b/include/asm-x86/unistd_32.h
> index d739467..88bdec4 100644
> --- a/include/asm-x86/unistd_32.h
> +++ b/include/asm-x86/unistd_32.h
> @@ -338,6 +338,8 @@
> #define __NR_dup3 330
> #define __NR_pipe2 331
> #define __NR_inotify_init1 332
> +#define __NR_checkpoint 333
> +#define __NR_restart 334
>
> #ifdef __KERNEL__
>
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index d6ff145..edc218b 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -622,6 +622,8 @@ asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
> asmlinkage long sys_eventfd(unsigned int count);
> asmlinkage long sys_eventfd2(unsigned int count, int flags);
> asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
> +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags);
> +asmlinkage long sys_restart(int crid, int fd, unsigned long flags);
>
> int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
>

2008-08-21 07:31:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 4/9] Memory management - dump state


* Oren Laadan <[email protected]> wrote:

> checkpoint/ckpt_x86.c | 28 ++++
> checkpoint/rstr_x86.c | 2 +

please move these into arch/x86/mm/checkpoint.c and
arch/x86/mm/restore.c. (also, please dont try to abbreviate too much in
filenames, makes it harder to follow changes later on, etc.)

Cool stuff btw! Any ETA on when more complex apps can be
checkpointed/restored? Also, shouldnt at least part of this effort be
joined with the live migration efforts of openvz? (or do you see
fundamental, unsolvable differences?)

Ingo

2008-08-21 08:01:47

by Justin P. Mattock

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 4/9] Memory management - dump state

CONFIG_INGO=Y
(sorry saw andrew Morton do that
Couldn't resist);

justin P. Mattock



On Aug 21, 2008, at 12:30 AM, Ingo Molnar <[email protected]> wrote:

>
> * Oren Laadan <[email protected]> wrote:
>
>> checkpoint/ckpt_x86.c | 28 ++++
>> checkpoint/rstr_x86.c | 2 +
>
> please move these into arch/x86/mm/checkpoint.c and
> arch/x86/mm/restore.c. (also, please dont try to abbreviate too much
> in
> filenames, makes it harder to follow changes later on, etc.)
>
> Cool stuff btw! Any ETA on when more complex apps can be
> checkpointed/restored? Also, shouldnt at least part of this effort be
> joined with the live migration efforts of openvz? (or do you see
> fundamental, unsolvable differences?)
>
> Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-
> kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2008-08-21 09:35:42

by Louis Rilling

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 2/9] General infrastructure for checkpoint restart

On Wed, Aug 20, 2008 at 11:04:13PM -0400, Oren Laadan wrote:
>
> Add those interfaces, as well as helpers needed to easily manage the
> file format. The code is roughly broken out as follows:
>
> ckpt/sys.c - user/kernel data transfer, as well as setup of the
> checkpoint/restart context (a per-checkpoint data structure for
> housekeeping)
>
> ckpt/checkpoint.c - output wrappers and basic checkpoint handling
>
> ckpt/restart.c - input wrappers and basic restart handling
>
> Patches to add the per-architecture support as well as the actual
> work to do the memory checkpoint follow in subsequent patches.
>

[...]

> diff --git a/checkpoint/sys.c b/checkpoint/sys.c
> new file mode 100644
> index 0000000..2891c48
> --- /dev/null
> +++ b/checkpoint/sys.c

[...]

> +/*
> + * helpers to manage CR contexts: allocated for each checkpoint and/or
> + * restart operation, and persists until the operation is completed.
> + */
> +
> +static atomic_t cr_ctx_count; /* unique checkpoint identifier */

I thought we agreed that this counter should be per-container. Perhaps add a
TODO here?

> +
> +void cr_ctx_free(struct cr_ctx *ctx)
> +{
> +
> + if (ctx->file)
> + fput(ctx->file);
> + if (ctx->vfsroot)
> + path_put(ctx->vfsroot);
> +
> + free_pages((unsigned long) ctx->tbuf, CR_TBUF_ORDER);
> + free_pages((unsigned long) ctx->hbuf, CR_HBUF_ORDER);
> +
> + kfree(ctx);
> +}
> +
> +struct cr_ctx *cr_ctx_alloc(pid_t pid, struct file *file, unsigned long flags)
> +{
> + struct cr_ctx *ctx;
> +
> + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
> + if (!ctx)
> + return NULL;
> +
> + ctx->tbuf = (void *) __get_free_pages(GFP_KERNEL, CR_TBUF_ORDER);
> + ctx->hbuf = (void *) __get_free_pages(GFP_KERNEL, CR_HBUF_ORDER);
> + if (!ctx->tbuf || !ctx->hbuf)
> + goto nomem;
> +
> + ctx->pid = pid;
> + ctx->flags = flags;
> +
> + ctx->file = file;
> + get_file(file);
> +
> + /* assume checkpointer is in container's root vfs */

I'm a bit puzzled by this assumption. I would say: either this is a
self-checkpoint (only current process), or this is a container checkpoint. In
the latter case, I expect that in the general case the checkpointer lives
outside the container (and the interface of sys_checkpoint() below confirms
this), so it's root fs is probably not the container's one.

Does it differ from what you're planning?

Thanks,

Louis

> + ctx->vfsroot = &current->fs->root;
> + path_get(ctx->vfsroot);
> +
> + ctx->crid = atomic_inc_return(&cr_ctx_count);
> +
> + return ctx;
> +
> + nomem:
> + cr_ctx_free(ctx);
> + return NULL;
> +}
> +
> +/**
> + * sys_checkpoint - checkpoint a container
> + * @pid: pid of the container init(1) process
> + * @fd: file to which dump the checkpoint image
> + * @flags: checkpoint operation flags
> + */
> +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
> +{
> + struct cr_ctx *ctx;
> + struct file *file;
> + int fput_needed;
> + int ret;
> +
> + file = fget_light(fd, &fput_needed);
> + if (!file)
> + return -EBADF;
> +
> + /* no flags for now */
> + if (flags)
> + return -EINVAL;
> +
> + ctx = cr_ctx_alloc(pid, file, flags | CR_CTX_CKPT);
> + if (!ctx) {
> + fput_light(file, fput_needed);
> + return -ENOMEM;
> + }
> +
> + ret = do_checkpoint(ctx);
> +
> + cr_ctx_free(ctx);
> + fput_light(file, fput_needed);
> +
> + return ret;
> +}
> +
> +/**
> + * sys_restart - restart a container
> + * @crid: checkpoint image identifier
> + * @fd: file from which read the checkpoint image
> + * @flags: restart operation flags
> + */
> +asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
> +{
> + struct cr_ctx *ctx;
> + struct file *file;
> + int fput_needed;
> + int ret;
> +
> + file = fget_light(fd, &fput_needed);
> + if (!file)
> + return -EBADF;
> +
> + /* no flags for now */
> + if (flags)
> + return -EINVAL;
> +
> + ctx = cr_ctx_alloc(crid, file, flags | CR_CTX_RSTR);
> + if (!ctx) {
> + fput_light(file, fput_needed);
> + return -ENOMEM;
> + }
> +
> + ret = do_restart(ctx);
> +
> + cr_ctx_free(ctx);
> + fput_light(file, fput_needed);
> +
> + return ret;
> +}

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes


Attachments:
(No filename) (4.14 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2008-08-21 09:53:29

by Louis Rilling

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 4/9] Memory management - dump state

On Wed, Aug 20, 2008 at 11:05:15PM -0400, Oren Laadan wrote:
>
> For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
> it will be followed by the file name. The cr_vma->npages will tell
> how many pages were dumped for this VMA. Then it will be followed
> by the actual data: first a dump of the addresses of all dumped
> pages (npages entries) followed by a dump of the contents of all
> dumped pages (npages pages). Then will come the next VMA and so on.

[...]

> diff --git a/checkpoint/ckpt_mem.c b/checkpoint/ckpt_mem.c
> new file mode 100644
> index 0000000..a23aa29
> --- /dev/null
> +++ b/checkpoint/ckpt_mem.c

[...]

> +/**
> + * cr_vma_fill_pgarr - fill a page-array with addr/page tuples for a vma
> + * @ctx - checkpoint context
> + * @pgarr - page-array to fill
> + * @vma - vma to scan
> + * @start - start address (updated)
> + */
> +static int cr_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
> + struct vm_area_struct *vma, unsigned long *start)
> +{
> + unsigned long end = vma->vm_end;
> + unsigned long addr = *start;
> + struct page **pagep;
> + unsigned long *addrp;
> + int cow, nr, ret = 0;
> +
> + nr = pgarr->nleft;
> + pagep = &pgarr->pages[pgarr->nused];
> + addrp = &pgarr->addrs[pgarr->nused];
> + cow = !!vma->vm_file;
> +
> + while (addr < end) {
> + struct page *page;
> +
> + /* simplified version of get_user_pages(): already have vma,
> + * only need FOLL_TOUCH, and (for now) ignore fault stats */
> +
> + cond_resched();
> + while (!(page = follow_page(vma, addr, FOLL_TOUCH))) {
> + ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
> + if (ret & VM_FAULT_ERROR) {
> + if (ret & VM_FAULT_OOM)
> + ret = -ENOMEM;
> + else if (ret & VM_FAULT_SIGBUS)
> + ret = -EFAULT;
> + else
> + BUG();
> + break;
> + }

+ ret = 0;

> + cond_resched();
> + }
> +
> + if (IS_ERR(page)) {
> + ret = PTR_ERR(page);
> + break;
> + }

Need to check ret here:

+ if (ret)
break;

> +
> + if (page == ZERO_PAGE(0))
> + page = NULL; /* zero page: ignore */
> + else if (cow && page_mapping(page) != NULL)
> + page = NULL; /* clean cow: ignore */
> + else {
> + get_page(page);
> + *(addrp++) = addr;
> + *(pagep++) = page;
> + if (--nr == 0) {
> + addr += PAGE_SIZE;
> + break;
> + }
> + }
> +
> + addr += PAGE_SIZE;
> + }
> +
> + if (unlikely(ret < 0)) {
> + nr = pgarr->nleft - nr;
> + while (nr--)
> + page_cache_release(*(--pagep));
> + return ret;
> + }
> +
> + *start = addr;
> + return (pgarr->nleft - nr);
> +}

[...]

Thanks,

Louis

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes


Attachments:
(No filename) (2.69 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2008-08-21 10:07:51

by Louis Rilling

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 5/9] Memory managemnet - restore state

On Wed, Aug 20, 2008 at 11:05:39PM -0400, Oren Laadan wrote:
>
> Restoring the memory address space begins with nuking the existing one
> of the current process, and then reading the VMA state and contents.
> Call do_mmap_pgoffset() for each VMA and then read in the data.

[...]

> diff --git a/checkpoint/rstr_mem.c b/checkpoint/rstr_mem.c
> new file mode 100644
> index 0000000..df602a9
> --- /dev/null
> +++ b/checkpoint/rstr_mem.c

[...]

> +static int cr_read_vma(struct cr_ctx *ctx, struct mm_struct *mm)
> +{
> + struct cr_hdr_vma *hh = cr_hbuf_get(ctx, sizeof(*hh));
> + unsigned long vm_size, vm_flags, vm_prot, vm_pgoff;
> + unsigned long addr;
> + unsigned long flags;
> + struct file *file = NULL;
> + char *fname = NULL;
> + int ret;
> +
> + ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_VMA);
> + if (ret < 0)
> + return ret;
> + else if (ret != 0)
> + return -EINVAL;
> +
> + cr_debug("vma %#lx-%#lx npages %d\n", (unsigned long) hh->vm_start,
> + (unsigned long) hh->vm_end, (int) hh->npages);
> +
> + if (hh->vm_end < hh->vm_start || hh->npages < 0)
> + return -EINVAL;
> +
> + vm_size = hh->vm_end - hh->vm_start;
> + vm_prot = cr_calc_map_prot_bits(hh->vm_flags);
> + vm_flags = cr_calc_map_flags_bits(hh->vm_flags);
> + vm_pgoff = hh->vm_pgoff;
> +
> + if (hh->fname) {
> + fname = ctx->tbuf;
> + ret = cr_read_str(ctx, fname, PAGE_SIZE);
> + if (ret < 0)
> + return ret;
> + }
> +
> + cr_debug("vma fname '%s' how %d\n", fname, hh->how);
> +
> + switch (hh->how) {
> +
> + case CR_VMA_ANON: /* anonymous private mapping */
> + if (hh->fname)
> + return -EINVAL;
> + /* vm_pgoff for anonymous mapping is the "global" page
> + offset (namely from addr 0x0), so we force a zero */
> + vm_pgoff = 0;
> + break;
> +
> + case CR_VMA_FILE: /* private mapping from a file */
> + if (!hh->fname)
> + return -EINVAL;
> + /* O_RDWR only needed if both (VM_WRITE|VM_SHARED) are set */
> + flags = hh->vm_flags & (VM_WRITE | VM_SHARED);
> + flags = (flags == (VM_WRITE | VM_SHARED) ? O_RDWR : O_RDONLY);
> + file = filp_open(fname, flags, 0);
> + if (IS_ERR(file))
> + return PTR_ERR(file);
> + break;
> +
> + default:
> + return -EINVAL;
> +
> + }
> +
> + addr = do_mmap_pgoff(file, (unsigned long) hh->vm_start,
> + vm_size, vm_prot, vm_flags, vm_pgoff);
> + cr_debug("size %#lx prot %#lx flag %#lx pgoff %#lx => %#lx\n",
> + vm_size, vm_prot, vm_flags, vm_pgoff, addr);
> +
> + /* the file (if opened) is now referenced by the vma */
> + if (file)
> + filp_close(file, NULL);
> +
> + if (IS_ERR((void*) addr))
> + return (PTR_ERR((void *) addr));
> +
> + /*
> + * CR_VMA_ANON: read in memory as is
> + * CR_VMA_FILE: read in memory as is
> + * (more to follow ...)
> + */
> +
> + switch (hh->how) {
> + case CR_VMA_ANON:
> + case CR_VMA_FILE:
> + /* standard case: read the data into the memory */
> + ret = cr_vma_read_pages(ctx, hh);
> + break;
> + }
> +
> + if (ret < 0)
> + return ret;
> +
> + if (vm_prot & PROT_EXEC)
> + flush_icache_range(hh->vm_start, hh->vm_end);
> +
> + cr_hbuf_put(ctx, sizeof(*hh));
> + cr_debug("vma retval %d\n", ret);
> + return 0;
> +}
> +
> +static int cr_destroy_mm(struct mm_struct *mm)
> +{
> + struct vm_area_struct *vmnext = mm->mmap;
> + struct vm_area_struct *vma;
> + int ret;
> +
> + while (vmnext) {
> + vma = vmnext;
> + vmnext = vmnext->vm_next;
> + ret = do_munmap(mm, vma->vm_start, vma->vm_end-vma->vm_start);
> + if (ret < 0)
> + return ret;
> + }
> + return 0;
> +}
> +
> +int cr_read_mm(struct cr_ctx *ctx)
> +{
> + struct cr_hdr_mm *hh = cr_hbuf_get(ctx, sizeof(*hh));
> + struct mm_struct *mm;
> + int nr, ret;
> +
> + ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM);
> + if (ret < 0)
> + return ret;
> +#if 0 /* activate when containers are used */
> + if (ret != task_pid_vnr(current))
> + return -EINVAL;
> +#endif
> + cr_debug("map_count %d\n", hh->map_count);
> +
> + /* XXX need more sanity checks */
> + if (hh->start_code > hh->end_code ||
> + hh->start_data > hh->end_data || hh->map_count < 0)
> + return -EINVAL;
> +
> + mm = current->mm;
> +
> + /* point of no return -- destruct current mm */
> + down_write(&mm->mmap_sem);
> + ret = cr_destroy_mm(mm);
> + up_write(&mm->mmap_sem);
> +
> + if (ret < 0)
> + return ret;
> +

Should down_write(&mm->mmap_sem) again here, and hold it until all vmas are
restored. This means removing down_write() from cr_vma_writable(). Or perhaps
make it finer grain: release it before looping on the vmas and make
cr_read_vma() take it again before calling do_mmap_pgoff().

> + mm->start_code = hh->start_code;
> + mm->end_code = hh->end_code;
> + mm->start_data = hh->start_data;
> + mm->end_data = hh->end_data;
> + mm->start_brk = hh->start_brk;
> + mm->brk = hh->brk;
> + mm->start_stack = hh->start_stack;
> + mm->arg_start = hh->arg_start;
> + mm->arg_end = hh->arg_end;
> + mm->env_start = hh->env_start;
> + mm->env_end = hh->env_end;
> +
> + /* FIX: need also mm->flags */
> +
> + for (nr = hh->map_count; nr; nr--) {
> + ret = cr_read_vma(ctx, mm);
> + if (ret < 0)
> + return ret;
> + }
> +
> + ret = cr_read_mm_context(ctx, mm, hh->tag);
> +
> + cr_hbuf_put(ctx, sizeof(*hh));
> + return ret;
> +}

Thanks,

Louis

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes


Attachments:
(No filename) (5.27 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2008-08-21 10:28:38

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 4/9] Memory management - dump state

Ingo Molnar wrote:
> * Oren Laadan <[email protected]> wrote:
>
>> checkpoint/ckpt_x86.c | 28 ++++
>> checkpoint/rstr_x86.c | 2 +
>
> please move these into arch/x86/mm/checkpoint.c and
> arch/x86/mm/restore.c. (also, please dont try to abbreviate too much in
> filenames, makes it harder to follow changes later on, etc.)
>
> Cool stuff btw! Any ETA on when more complex apps can be
> checkpointed/restored? Also, shouldnt at least part of this effort be
> joined with the live migration efforts of openvz? (or do you see
> fundamental, unsolvable differences?)

My understanding is that this is a result of the joint discussion (with the
OpenVZ folks present) that happened during the containers min-summit prior to
OLS this year.

--
Balbir

2008-08-21 10:40:28

by Louis Rilling

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 7/9] Infrastructure for shared objects

On Wed, Aug 20, 2008 at 11:06:50PM -0400, Oren Laadan wrote:
>
> Infrastructure to handle objects that may be shared and referenced by
> multiple tasks or other objects, e..g open files, memory address space
> etc.
>
> The state of shared objects is saved once. On the first encounter, the
> state is dumped and the object is assigned a unique identifier and also
> stored in a hash table (indexed by its physical kenrel address). From
> then on the object will be found in the hash and only its identifier is
> saved.
>
> On restart the identifier is looked up in the hash table; if not found
> then the state is read, the object is created, and added to the hash
> table (this time indexed by its identifier). Otherwise, the object in
> the hash table is used.

[...]

> diff --git a/checkpoint/ckpt.h b/checkpoint/ckpt.h
> index 0addb63..8b02c4c 100644
> --- a/checkpoint/ckpt.h
> +++ b/checkpoint/ckpt.h
> @@ -29,6 +29,8 @@ struct cr_ctx {
> void *hbuf; /* header: to avoid many alloc/dealloc */
> int hpos;
>
> + struct cr_objhash *objhash;
> +
> struct cr_pgarr *pgarr;
> struct cr_pgarr *pgcur;
>
> @@ -56,6 +58,22 @@ int cr_kread(struct cr_ctx *ctx, void *buf, int count);
> void *cr_hbuf_get(struct cr_ctx *ctx, int n);
> void cr_hbuf_put(struct cr_ctx *ctx, int n);
>
> +/* shared objects handling */
> +
> +enum {
> + CR_OBJ_FILE = 1,
> + CR_OBJ_MAX
> +};
> + +void cr_objhash_free(struct cr_ctx *ctx);
^
|
Strange, isn't it?

> +int cr_objhash_alloc(struct cr_ctx *ctx);
> +void *cr_obj_get_by_tag(struct cr_ctx *ctx, int tag, unsigned short type);
> +int cr_obj_get_by_ptr(struct cr_ctx *ctx, void *ptr, unsigned short type);
> +int cr_obj_add_ptr(struct cr_ctx *ctx, void *ptr, int *tag,
> + unsigned short type, unsigned short flags);
> +int cr_obj_add_tag(struct cr_ctx *ctx, void *ptr, int tag,
> + unsigned short type, unsigned short flags);
> +
> struct cr_hdr;
>
> int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
> diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
> new file mode 100644
> index 0000000..aca32c6
> --- /dev/null
> +++ b/checkpoint/objhash.c
> @@ -0,0 +1,193 @@
> +/*
> + * Checkpoint-restart - object hash infrastructure to manage shared
> objects + *
> + * Copyright (C) 2008 Oren Laadan
> + *
> + * This file is subject to the terms and conditions of the GNU General Public
> + * License. See the file COPYING in the main directory of the Linux
> + * distribution for more details.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/file.h>
> +#include <linux/hash.h>
> +
> +#include "ckpt.h"
> +
> +struct cr_obj {
> + int tag;
> + void *ptr;
> + unsigned short type;
> + unsigned short flags;
> + struct cr_obj *next;
> +};
> +
> +struct cr_objhash {
> + struct cr_obj **hash;
> + int next_tag;
> +};
> +
> +#define CR_OBJHASH_NBITS 10 /* 10 bits = 1K buckets */
> +#define CR_OBJHASH_ORDER 0 /* 1K buckets * 4 bytes/bucket = 1 page */

Only true when PAGE_SIZE == 4K and in 32bits. Perhaps like below?

#define CR_OBJHASH_BUCKET_NBITS (BITS_PER_LONG == 64 ? 3 : 2)
#define CR_MIN_OBJHASH_NBITS ((PAGE_SHIFT - CR_OBJHASH_BUCKET_NBITS)
#define CR_OBJHASH_NBITS (CR_MIN_OBJHASH_NBITS >= 10 ? CR_MIN_OBJHASH_NBITS : 10)
#define CR_OBJHASH_ORDER (CR_OBJHASH_NBITS + CR_OBJHASH_BUCKET_NBITS - PAGE_SHIFT)

Louis

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes


Attachments:
(No filename) (3.42 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2008-08-21 11:06:26

by Louis Rilling

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 8/9] File descriprtors - dump state

On Wed, Aug 20, 2008 at 11:07:16PM -0400, Oren Laadan wrote:
>
> Dump the files_struct of a task with 'struct cr_hdr_files', followed by
> all open file descriptors. Since FDs can be shared, they are assigned a
> tag and registered in the object hash.
>
> For each open FD there is a 'struct cr_hdr_fd_ent' with the FD, its tag
> and its close-on-exec property. If the FD is to be saved (first time)
> then this is followed by a 'struct cr_hdr_fd_data' with the FD state.
> Then will come the next FD and so on.
>
> This patch only handles basic FDs - regular files, directories and also
> symbolic links.

[...]

> diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
> new file mode 100644
> index 0000000..18faaf1
> --- /dev/null
> +++ b/checkpoint/ckpt_file.c
> @@ -0,0 +1,234 @@
> +/*
> + * Checkpoint file descriptors
> + *
> + * Copyright (C) 2008 Oren Laadan
> + *
> + * This file is subject to the terms and conditions of the GNU General Public
> + * License. See the file COPYING in the main directory of the Linux
> + * distribution for more details.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/sched.h>
> +#include <linux/file.h>
> +#include <linux/fdtable.h>
> +
> +#include "ckpt.h"
> +#include "ckpt_hdr.h"
> +#include "ckpt_file.h"
> +
> +#define CR_DEFAULT_FDTABLE 128
> +
> +/**
> + * cr_scan_fds - scan file table and construct array of open fds
> + * @files: files_struct pointer
> + * @fdtable: (output) array of open fds
> + * @return: the number of open fds found
> + *
> + * Allocates the file descriptors array (*fdtable), caller should free
> + */
> +int cr_scan_fds(struct files_struct *files, int **fdtable)
> +{
> + int i, j, n, max;
> + struct fdtable *fdt;
> + int *fdlist;
> +
> + max = CR_DEFAULT_FDTABLE;
> +
> + repeat:
> + fdlist = kmalloc(max * sizeof(*fdlist), GFP_KERNEL);
> + if (!fdlist)
> + return -ENOMEM;
> +
> + j = 0;
> + n = 0;
> +
> + spin_lock(&files->file_lock);
> + fdt = files_fdtable(files);
> + for (;;) {
> + unsigned long set;
> + i = j * __NFDBITS;
> + if (i >= fdt->max_fds)
> + break;
> + set = fdt->open_fds->fds_bits[j++];

Why not simply use fcheck_files(files, n) and check if the result is not NULL?

> + while (set) {
> + if (set & 1) {
> + if (unlikely(n == max)) {
> + spin_unlock(&files->file_lock);
> + kfree(fdlist);
> + max *= 2;
> + if (max < 0) /* overflow ? */
> + return -EMFILE;
> + goto repeat;
> + }
> + fdlist[n++] = i;
> + }
> + i++;
> + set >>= 1;
> + }
> + }
> + spin_unlock(&files->file_lock);
> +
> + *fdtable = fdlist;
> + return n;
> +}
> +
> +/* cr_write_fd_data - dump the state of a given file pointer */
> +static int cr_write_fd_data(struct cr_ctx *ctx, struct file *file, int ptag)
> +{
> + struct cr_hdr h;
> + struct cr_hdr_fd_data *hh = ctx->hbuf;
> + struct dentry *dent = file->f_dentry;
> + struct inode *inode = dent->d_inode;
> + char *fname;
> + int flen, how, ret;
> +
> + h.type = CR_HDR_FD_DATA;
> + h.len = sizeof(*hh);
> + h.ptag = ptag;
> +
> + BUG_ON(!inode);
> +
> + flen = PAGE_SIZE;
> + fname = cr_fill_fname(&file->f_path, ctx->vfsroot, ctx->tbuf, &flen);
> + if (IS_ERR(fname))
> + return PTR_ERR(fname);
> +
> + hh->f_flags = file->f_flags;
> + hh->f_mode = file->f_mode;
> + hh->f_pos = file->f_pos;
> + hh->f_uid = file->f_uid;
> + hh->f_gid = file->f_gid;
> + hh->f_version = file->f_version;
> + /* FIX: need also file->f_owner */
> +
> + switch(inode->i_mode & S_IFMT) {
> + case S_IFREG:
> + how = CR_FD_FILE;
> + break;
> + case S_IFDIR:
> + how = CR_FD_DIR;
> + break;
> + case S_IFLNK:
> + how = CR_FD_LINK;
> + break;
> + default:
> + return -EBADF;
> + }
> +
> + /* FIX: check if the file/dir/link is unlinked */
> +
> + BUG_ON(!flen);
> +
> + ret = cr_write_obj(ctx, &h, hh);
> + if (!ret && flen)
> + ret = cr_write_str(ctx, fname, flen);
> +
> + return ret;
> +}
> +
> +/**
> + * cr_write_fd_ent - dump the state of a given file descriptor
> + * @ctx: checkpoint context
> + * @files: files_struct pointer
> + * @fd: file descriptor
> + * + * Save the state of the file descriptor; look up the actual file
> pointer
> + * in the hash table, and if found save the matching tag, otherwise call
> + * cr_write_fd_data to dump the file pointer too.
> + */
> +static int
> +cr_write_fd_ent(struct cr_ctx *ctx, struct files_struct *files, int fd)
> +{
> + struct cr_hdr h;
> + struct cr_hdr_fd_ent *hh = ctx->hbuf;
> + struct file *file = NULL;
> + struct fdtable *fdt;
> + int coe, tag, ret;
> +
> + /* make sure hh->fd (that is of type __u16) doesn't overflow */
> + if (fd > USHORT_MAX) {
> + pr_warning("CR: open files table too big (%d)\n", USHORT_MAX);
> + return -EMFILE;
> + }
> +
> + rcu_read_lock();
> + fdt = files_fdtable(files);
> + if (fd < fdt->max_fds)
> + file = rcu_dereference(fdt->fd[fd]);

You should probably use fcheck_files() instead of copying its code here. I agree
that this means dereferencing files->fdt a second time below, but is it so
performance critical?

> + if (file) {
> + coe = FD_ISSET(fd, fdt->close_on_exec);
> + get_file(file);
> + }
> + rcu_read_unlock();
> +
> + /* sanity check (although this shouldn't happen) */
> + if (unlikely(!file))
> + return -EBADF;
> +
> + ret = cr_obj_add_ptr(ctx, (void *) file, &tag, CR_OBJ_FILE, 0);
> + cr_debug("fd %d tag %d file %p c-o-e %d)\n", fd, tag, file, coe);
> +
> + if (ret >= 0) {
> + int new = ret;
> +
> + h.type = CR_HDR_FD_ENT;
> + h.len = sizeof(*hh);
> + h.ptag = 0;
> + + hh->tag = tag;
^
|

> + hh->fd = fd;
> + hh->close_on_exec = coe;
> +
> + ret = cr_write_obj(ctx, &h, hh);
> +
> + /* new==1 if-and-only-if file was new and added to hash */
> + if (!ret && new)
> + ret = cr_write_fd_data(ctx, file, tag);
> + }
> +
> + fput(file);
> + return ret;
> +}
> +
> +int cr_write_files(struct cr_ctx *ctx, struct task_struct *t)
> +{
> + struct cr_hdr h;
> + struct cr_hdr_files *hh = ctx->hbuf;
> + struct files_struct *files;
> + int *fdtable;
> + int nfds, n, ret;
> +
> + h.type = CR_HDR_FILES;
> + h.len = sizeof(*hh);
> + h.ptag = task_pid_vnr(t);
> +
> + files = get_files_struct(t);
> +
> + nfds = cr_scan_fds(files, &fdtable);
> + if (nfds < 0) {
> + ret = nfds;
> + goto out;
> + }
> + + hh->tag = 0;
^
|

> + hh->nfds = nfds;
> +
> + ret = cr_write_obj(ctx, &h, hh);
> + if (ret < 0)
> + goto clean;
> +
> + cr_debug("nfds %d\n", nfds);
> + for (n = 0; n < nfds; n++) {
> + ret = cr_write_fd_ent(ctx, files, n);
> + if (ret < 0)
> + break;
> + }
> +
> + clean:
> + kfree(fdtable);
> + out:
> + put_files_struct(files);
> +
> + return ret;
> +}

[...]

> diff --git a/checkpoint/ckpt_hdr.h b/checkpoint/ckpt_hdr.h
> index a3919cf..a8a37db 100644
> --- a/checkpoint/ckpt_hdr.h
> +++ b/checkpoint/ckpt_hdr.h

[...]

> @@ -114,4 +125,24 @@ struct cr_hdr_vma {
>
> } __attribute__ ((aligned (8)));
>
> +struct cr_hdr_files {
> + __u32 tag; /* sharing identifier */
> + __u32 nfds;
> +} __attribute__ ((aligned (8)));
> +
> +struct cr_hdr_fd_ent {
> + __u32 tag;
> + __u16 fd;
> + __u16 close_on_exec;
> +} __attribute__ ((aligned (8)));
> +
> +struct cr_hdr_fd_data {
> + __u16 how;
> + __u16 f_mode;
> + __u32 f_flags;
> + __u32 f_uid, f_gid;

We are not at a 64bits boundary here. Should add one __32 padding or reorder the
fields.

> + __u64 f_pos;
> + __u64 f_version;
> +} __attribute__ ((aligned (8)));
> +
> #endif /* _CHECKPOINT_CKPT_HDR_H_ */
> --
> 1.5.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes


Attachments:
(No filename) (7.67 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2008-08-21 12:00:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 4/9] Memory management - dump state


* Balbir Singh <[email protected]> wrote:

> Ingo Molnar wrote:
> > * Oren Laadan <[email protected]> wrote:
> >
> >> checkpoint/ckpt_x86.c | 28 ++++
> >> checkpoint/rstr_x86.c | 2 +
> >
> > please move these into arch/x86/mm/checkpoint.c and
> > arch/x86/mm/restore.c. (also, please dont try to abbreviate too much in
> > filenames, makes it harder to follow changes later on, etc.)
> >
> > Cool stuff btw! Any ETA on when more complex apps can be
> > checkpointed/restored? Also, shouldnt at least part of this effort
> > be joined with the live migration efforts of openvz? (or do you see
> > fundamental, unsolvable differences?)
>
> My understanding is that this is a result of the joint discussion
> (with the OpenVZ folks present) that happened during the containers
> min-summit prior to OLS this year.

great news :-)

This stuff should be completed ASAP, user-transparent kernel upgrades on
a generic Linux distro would be really nice :-)

Ingo

2008-08-22 20:01:47

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 1/9] Create trivial sys_checkpoint/sys_restart syscalls

On Wed, 2008-08-20 at 23:03 -0400, Oren Laadan wrote:
> 6/unistd_32.h
> index d739467..88bdec4 100644
> --- a/include/asm-x86/unistd_32.h
> +++ b/include/asm-x86/unistd_32.h
> @@ -338,6 +338,8 @@
> #define __NR_dup3 330
> #define __NR_pipe2 331
> #define __NR_inotify_init1 332
> +#define __NR_checkpoint 333
> +#define __NR_restart 334
>
> #ifdef __KERNEL__

Uh oh. These got whitespace mangled somehow. I'll look through them,
but probably can't produce patches on top of them for now.

-- Dave

2008-08-22 20:02:05

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 2/9] General infrastructure for checkpoint restart

On Wed, 2008-08-20 at 23:04 -0400, Oren Laadan wrote:
> index f3e2065..584c8f9 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -619,7 +619,7 @@ export mod_strip_cmd
>
>
> ifeq ($(KBUILD_EXTMOD),)
> -core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/
> +core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ checkpoint/
>
> vmlinux-dirs := $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
> $(core-y) $(core-m) $(drivers-y) $(drivers-m) \
> diff --git a/checkpoint/Makefile b/checkpoint/Makefile
> new file mode 100644
> index 0000000..d129878
> --- /dev/null
> +++ b/checkpoint/Makefile
> @@ -0,0 +1 @@
> +obj-y += sys.o checkpoint.o restart.o
> diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
> new file mode 100644
> index 0000000..25343f5
> --- /dev/null
> +++ b/checkpoint/checkpoint.c
> @@ -0,0 +1,191 @@
> +/*
> + * Checkpoint logic and helpers
> + *
> + * Copyright (C) 2008 Oren Laadan
> + *
> + * This file is subject to the terms and conditions of the GNU General Public
> + * License. See the file COPYING in the main directory of the Linux
> + * distribution for more details.
> + */
> +
> +#include <linux/version.h>
> +#include <linux/sched.h>
> +#include <linux/time.h>
> +#include <linux/fs.h>
> +#include <linux/file.h>
> +#include <linux/dcache.h>
> +#include <linux/mount.h>
> +#include <linux/utsname.h>
> +#include <asm/ptrace.h>
> +
> +#include "ckpt.h"
> +#include "ckpt_hdr.h"
> +
> +/**
> + * cr_fill_fname - return pathname of a given file
> + * @file: file pointer
> + * @buf: buffer for pathname
> + * @n: buffer length (in) and pathname length (out)
> + */
> +char *cr_fill_fname(struct path *path, struct path *root, char *buf, int *n)
> +{
> + char *fname;
> +
> + BUG_ON(!buf);
> + fname = __d_path(path, root, buf, *n);
> + if (!IS_ERR(fname))
> + *n = (buf + (*n) - fname);
> + return fname;
> +}
> +
> +/**
> + * cr_write_obj - write a record described by a cr_hdr
> + * @ctx: checkpoint context
> + * @h: record descriptor
> + * @buf: record buffer
> + */
> +int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf)
> +{
> + int ret;
> +
> + if ((ret = cr_kwrite(ctx, h, sizeof(*h))) < 0)
> + return ret;

These all need to be fixed up to Linux style:

ret = = cr_kwrite(ctx, h, sizeof(*h)));
if (ret < 0)
...;

You might want to try checkpatch.pl on these. It is pretty good at
finding this stuff.

> +/* write the checkpoint header */
> +static int cr_write_hdr(struct cr_ctx *ctx)
> +{
> + struct cr_hdr h;
> + struct cr_hdr_head *hh = ctx->hbuf;
> + struct new_utsname *uts;
> + struct timeval ktv;
> +
> + h.type = CR_HDR_HEAD;
> + h.len = sizeof(*hh);
> + h.ptag = 0;
> +
> + do_gettimeofday(&ktv);
> +
> + hh->magic = CR_MAGIC_HEAD;
> + hh->major = (LINUX_VERSION_CODE >> 16) & 0xff;
> + hh->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
> + hh->patch = (LINUX_VERSION_CODE) & 0xff;

We at least neex EXTRAVERSION in there if we're going to even pretend
this is useful, right? is this redundant with utsname() below?

> + hh->rev = CR_VERSION;
> +
> + hh->flags = ctx->flags;
> + hh->time = ktv.tv_sec;
> +
> + uts = utsname();
> + memcpy(hh->release, uts->release, __NEW_UTS_LEN);
> + memcpy(hh->version, uts->version, __NEW_UTS_LEN);
> + memcpy(hh->machine, uts->machine, __NEW_UTS_LEN);
> +
> + return cr_write_obj(ctx, &h, hh);
> +}
> +
> +/* write the checkpoint trailer */
> +static int cr_write_tail(struct cr_ctx *ctx)
> +{
> + struct cr_hdr h;
> + struct cr_hdr_tail *hh = ctx->hbuf;
> +
> + h.type = CR_HDR_TAIL;
> + h.len = sizeof(*hh);
> + h.ptag = 0;
> +
> + hh->magic = CR_MAGIC_TAIL;
> + hh->cksum = 1; /* TBD ... */
> +
> + return cr_write_obj(ctx, &h, hh);
> +}
> +
> +/* dump the task_struct of a given task */
> +static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t)
> +{
> + struct cr_hdr h;
> + struct cr_hdr_task *hh = ctx->hbuf;
> +
> + h.type = CR_HDR_TASK;
> + h.len = sizeof(*hh);
> + h.ptag = 0;
> +
> + hh->state = t->state;
> + hh->exit_state = t->exit_state;
> + hh->exit_code = t->exit_code;
> + hh->exit_signal = t->exit_signal;
> +
> + hh->utime = t->utime;
> + hh->stime = t->stime;
> + hh->utimescaled = t->utimescaled;
> + hh->stimescaled = t->stimescaled;
> + hh->gtime = t->gtime;
> + hh->prev_utime = t->prev_utime;
> + hh->prev_stime = t->prev_stime;
> + hh->nvcsw = t->nvcsw;
> + hh->nivcsw = t->nivcsw;
> + hh->start_time_sec = t->start_time.tv_sec;
> + hh->start_time_nsec = t->start_time.tv_nsec;
> + hh->real_start_time_sec = t->real_start_time.tv_sec;
> + hh->real_start_time_nsec = t->real_start_time.tv_nsec;
> + hh->min_flt = t->min_flt;
> + hh->maj_flt = t->maj_flt;
> +
> + hh->task_comm_len = TASK_COMM_LEN;
> + memcpy(hh->comm, t->comm, TASK_COMM_LEN);
> +
> + return cr_write_obj(ctx, &h, hh);
> +}
> +
> +/* dump the entire state of a given task */
> +static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
> +{
> + int ret ;
> +
> + if (t->state == TASK_DEAD) {
> + pr_warning("CR: task may not be in state TASK_DEAD\n");
> + return -EAGAIN;
> + }
> +
> + ret = cr_write_task_struct(ctx, t);
> + cr_debug("ret %d\n", ret);
> +
> + return ret;
> +}
> +
> +int do_checkpoint(struct cr_ctx *ctx)
> +{
> + int ret;
> +
> + /* FIX: need to test whether container is checkpointable */
> +
> + ret = cr_write_hdr(ctx);
> + if (!ret)
> + ret = cr_write_task(ctx, current);
> + if (!ret)
> + ret = cr_write_tail(ctx);
> +
> + /* on success, return (unique) checkpoint identifier */
> + if (!ret)
> + ret = ctx->crid;
> +
> + return ret;
> +}
> diff --git a/checkpoint/ckpt.h b/checkpoint/ckpt.h
> new file mode 100644
> index 0000000..7ecafb3
> --- /dev/null
> +++ b/checkpoint/ckpt.h
> @@ -0,0 +1,71 @@
> +#ifndef _CHECKPOINT_CKPT_H_
> +#define _CHECKPOINT_CKPT_H_
> +/*
> + * Generic container checkpoint-restart
> + *
> + * Copyright (C) 2008 Oren Laadan
> + *
> + * This file is subject to the terms and conditions of the GNU General Public
> + * License. See the file COPYING in the main directory of the Linux
> + * distribution for more details.
> + */
> +
> +#include <linux/path.h>
> +#include <linux/fs.h>
> +
> +#define CR_VERSION 1
> +
> +struct cr_ctx {
> + pid_t pid; /* container identifier */
> + int crid; /* unique checkpoint id */
> +
> + unsigned long flags;
> + unsigned long oflags; /* restart: old flags */
> +
> + struct file *file;
> + int total; /* total read/written */
> +
> + void *tbuf; /* temp: to avoid many alloc/dealloc */
> + void *hbuf; /* header: to avoid many alloc/dealloc */
> + int hpos;

alloc/delloc is not a bad thing. :) I think at least a few of these
should be moved over to stack or kmalloc()/kfree().

> + struct path *vfsroot; /* container root */
> +};

Can a container only have one vfsroot?

> +
> +/* cr_ctx: flags */
> +#define CR_CTX_CKPT 0x1
> +#define CR_CTX_RSTR 0x2

enum?

> +/* allocation defaults */
> +#define CR_TBUF_ORDER 1
> +#define CR_TBUF_TOTAL (PAGE_SIZE << CR_TBUF_ORDER)
> +
> +#define CR_HBUF_ORDER 1
> +#define CR_HBUF_TOTAL (PAGE_SIZE << CR_HBUF_ORDER)
> +
> +char *cr_fill_fname(struct path *path, struct path *root, char *buf, int *n);
> +
> +int cr_uwrite(struct cr_ctx *ctx, void *buf, int count);
> +int cr_kwrite(struct cr_ctx *ctx, void *buf, int count);
> +int cr_uread(struct cr_ctx *ctx, void *buf, int count);
> +int cr_kread(struct cr_ctx *ctx, void *buf, int count);
> +
> +void *cr_hbuf_get(struct cr_ctx *ctx, int n);
> +void cr_hbuf_put(struct cr_ctx *ctx, int n);
> +
> +struct cr_hdr;
> +
> +int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
> +int cr_write_str(struct cr_ctx *ctx, char *str, int n);
> +
> +int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
> +int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type);
> +int cr_read_str(struct cr_ctx *ctx, void *str, int n);
> +
> +int do_checkpoint(struct cr_ctx *ctx);
> +int do_restart(struct cr_ctx *ctx);
> +
> +#define cr_debug(fmt, args...) \
> + pr_debug("[CR:%s] " fmt, __func__, ## args)
> +
> +#endif /* _CHECKPOINT_CKPT_H_ */
> diff --git a/checkpoint/ckpt_hdr.h b/checkpoint/ckpt_hdr.h
> new file mode 100644
> index 0000000..a478b7c
> --- /dev/null
> +++ b/checkpoint/ckpt_hdr.h
> @@ -0,0 +1,86 @@
> +#ifndef _CHECKPOINT_CKPT_HDR_H_
> +#define _CHECKPOINT_CKPT_HDR_H_
> +/*
> + * Generic container checkpoint-restart
> + *
> + * Copyright (C) 2008 Oren Laadan
> + *
> + * This file is subject to the terms and conditions of the GNU General Public
> + * License. See the file COPYING in the main directory of the Linux
> + * distribution for more details.
> + */
> +
> +#include <linux/types.h>
> +#include <linux/utsname.h>
> +
> +#define CR_MAGIC_HEAD 0x00feed0cc0a2d200LL
> +#define CR_MAGIC_TAIL 0x002d2a0cc0deef00LL

What happened to include/linux/magic.h?

> +/*
> + * To maintain compatibility between 32-bit and 64-bit architecture flavors,
> + * keep data 64-bit aligned: use padding for structure members, and use
> + * __attribute__ ((aligned (8))) for the entire structure.
> + */
> +
> +/* records: generic header */
> +
> +struct cr_hdr {
> + __s16 type;
> + __s16 len;
> + __u32 ptag;
> +};
> +
> +enum {
> + CR_HDR_HEAD = 1,
> + CR_HDR_STR,
> +
> + CR_HDR_TASK = 101,
> + CR_HDR_THREAD,
> + CR_HDR_CPU,
> +
> + CR_HDR_MM = 201,
> + CR_HDR_VMA,
> + CR_HDR_MM_CONTEXT,
> +
> + CR_HDR_TAIL = 5001
> +};
> +
> +struct cr_hdr_head {
> + __u64 magic;
> +
> + __u16 major;
> + __u16 minor;
> + __u16 patch;
> + __u16 rev;
> +
> + __u64 time; /* when checkpoint taken */
> + __u64 flags; /* checkpoint options */
> +
> + char release[__NEW_UTS_LEN];
> + char version[__NEW_UTS_LEN];
> + char machine[__NEW_UTS_LEN];
> +} __attribute__ ((aligned (8)));
> +
> +struct cr_hdr_tail {
> + __u64 magic;
> + __u64 cksum;
> +} __attribute__ ((aligned (8)));
> +
> +struct cr_hdr_task {
> + __u64 state;
> + __u32 exit_state;
> + __u32 exit_code, exit_signal;
> +
> + __u64 utime, stime, utimescaled, stimescaled;
> + __u64 gtime;
> + __u64 prev_utime, prev_stime;
> + __u64 nvcsw, nivcsw;
> + __u64 start_time_sec, start_time_nsec;
> + __u64 real_start_time_sec, real_start_time_nsec;
> + __u64 min_flt, maj_flt;
> +
> + __s16 task_comm_len;
> + char comm[TASK_COMM_LEN];
> +} __attribute__ ((aligned (8)));

TASK_COMM_LEN might be subject to changing. Can it be part of the
user<->kernel ABI?

...
> +/*
> + * During restart the code reads in data from the chekcpoint image into a
> + * temporary buffer (ctx->hbuf). Because operations can be nested, one
> + * should call cr_hbuf_get() to reserve space in the buffer, and then
> + * cr_hbuf_put() when it no longer needs that space
> + */
> +
> +#include <linux/version.h>
> +#include <linux/sched.h>
> +#include <linux/file.h>
> +
> +#include "ckpt.h"
> +#include "ckpt_hdr.h"
> +
> +/**
> + * cr_hbuf_get - reserve space on the hbuf
> + * @ctx: checkpoint context
> + * @n: number of bytes to reserve
> + */
> +void *cr_hbuf_get(struct cr_ctx *ctx, int n)
> +{
> + void *ptr;
> +
> + BUG_ON(ctx->hpos + n > CR_HBUF_TOTAL);
> + ptr = (void *) (((char *) ctx->hbuf) + ctx->hpos);
> + ctx->hpos += n;
> + return ptr;
> +}
> +
> +/**
> + * cr_hbuf_put - unreserve space on the hbuf
> + * @ctx: checkpoint context
> + * @n: number of bytes to reserve
> + */
> +void cr_hbuf_put(struct cr_ctx *ctx, int n)
> +{
> + BUG_ON(ctx->hpos < n);
> + ctx->hpos -= n;
> +}

I replaced all of the uses of these with kmalloc()/kfree() and stack
allocations. I'm really not sure what these buy us since our allocators
are already so fast. tbuf, especially, worries me if it gets used in
any kind of nested manner we're going to get some really fun bugs.

> +/**
> + * cr_read_obj_type - read a whole record of expected type
> + * @ctx: checkpoint context
> + * @buf: record buffer
> + * @n: available buffer size
> + * @type: expected record type
> + */
> +int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type)
> +{
> + struct cr_hdr h;
> + int ret;
> +
> + ret = cr_read_obj(ctx, &h, buf, n);
> + if (!ret)
> + ret = (h.type == type ? h.ptag : -EINVAL);
> + return ret;
> +}

That ( ? : ) needs to get broken out so it is easier to read. This
function is already nice and tight as it is, so saving a line or two
doesn't really do any good.

> +/**
> + * cr_read_str - read a string record
> + * @ctx: checkpoint context
> + * @str: string buffer
> + * @n: string length
> + */
> +int cr_read_str(struct cr_ctx *ctx, void *str, int n)
> +{
> + return cr_read_obj_type(ctx, str, n, CR_HDR_STR);
> +}
> +
> +/* read the checkpoint header */
> +static int cr_read_hdr(struct cr_ctx *ctx)
> +{
> + struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh));
> + int ret;
> +
> + ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD);
> + if (ret < 0)
> + return ret;
> + else if (ret != 0)
> + return -EINVAL;

How about just making cr_read_obj_type() stop returning positive values?
Is it ever valid for it to return >0?

> + if (hh->magic != CR_MAGIC_HEAD || hh->rev != CR_VERSION ||
> + hh->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
> + hh->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
> + hh->patch != ((LINUX_VERSION_CODE) & 0xff))
> + return -EINVAL;
> +
> + if (hh->flags & ~CR_CTX_CKPT)
> + return -EINVAL;
> +
> + ctx->oflags = hh->flags;
> +
> + /* FIX: verify compatibility of release, version and machine */
> +
> + cr_hbuf_put(ctx, sizeof(*hh));
> + return 0;
> +}
> +
> +/* read the checkpoint trailer */
> +static int cr_read_tail(struct cr_ctx *ctx)
> +{
> + struct cr_hdr_tail *hh = cr_hbuf_get(ctx, sizeof(*hh));
> + int ret;
> +
> + ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TAIL);
> + if (ret < 0)
> + return ret;
> + else if (ret != 0)
> + return -EINVAL;
> +
> + if (hh->magic != CR_MAGIC_TAIL || hh->cksum != 1)
> + return -EINVAL;

If we're not going to use the cksum, can it just be pulled out for now?

> + cr_hbuf_put(ctx, sizeof(*hh));
> + return 0;
> +}
> +
> +/* read the task_struct into the current task */
> +static int cr_read_task_struct(struct cr_ctx *ctx)
> +{
> + struct cr_hdr_task *hh = cr_hbuf_get(ctx, sizeof(*hh));
> + struct task_struct *t = current;
> + int ret;
> +
> + ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TASK);
> + if (ret < 0)
> + return ret;
> + else if (ret != 0)
> + return -EINVAL;
> +
> + /* for now, only restore t->comm */
> + if (hh->task_comm_len < 0 || hh->task_comm_len > TASK_COMM_LEN)
> + return -EINVAL;

Well, if TASK_COMM_LEN changed, then the size of the structure changed,
and the ABI changed. Uh oh.


> +int cr_uwrite(struct cr_ctx *ctx, void *buf, int count)
> +{
> + struct file *file = ctx->file;
> + ssize_t nwrite;
> + int nleft;
> +
> + for (nleft = count; nleft; nleft -= nwrite) {
> + loff_t pos = file_pos_read(file);
> + nwrite = vfs_write(file, (char __user *) buf, nleft, &pos);
> + file_pos_write(file, pos);
> + if (unlikely(nwrite <= 0)) /* zero tolerance */
> + return (nwrite ? : -EIO);

First of all, the unlikely() needs to die. This isn't
performance-critical code, and I don't see any justification for why it
is needed.

Also, errors like -EAGAIN are just fine and normal to come back from a
write to a file. We shoudn't be erroring out on them. (Note that this
is something that userspace would be handling by itself if we were doing
this in userspace.)

> +int cr_uread(struct cr_ctx *ctx, void *buf, int count)
> +{
> + struct file *file = ctx->file;
> + ssize_t nread;
> + int nleft;
> +
> + for (nleft = count; nleft; nleft -= nread) {
> + loff_t pos = file_pos_read(file);
> + nread = vfs_read(file, (char __user *) buf, nleft, &pos);
> + file_pos_write(file, pos);
> + if (unlikely(nread <= 0)) /* zero tolerance */
> + return (nread ? : -EIO);
> + buf += nread;
> + }
> +
> + ctx->total += count;
> + return 0;
> +}
> +
> +int cr_kread(struct cr_ctx *ctx, void *buf, int count)
> +{
> + mm_segment_t oldfs;
> + int ret;
> +
> + oldfs = get_fs();
> + set_fs(KERNEL_DS);
> + ret = cr_uread(ctx, buf, count);
> + set_fs(oldfs);
> +
> + return ret;
> +}
> +
> +
> +/*
> + * helpers to manage CR contexts: allocated for each checkpoint and/or
> + * restart operation, and persists until the operation is completed.
> + */
> +
> +static atomic_t cr_ctx_count; /* unique checkpoint identifier */
> +
> +void cr_ctx_free(struct cr_ctx *ctx)
> +{
> +
> + if (ctx->file)
> + fput(ctx->file);
> + if (ctx->vfsroot)
> + path_put(ctx->vfsroot);
> +
> + free_pages((unsigned long) ctx->tbuf, CR_TBUF_ORDER);
> + free_pages((unsigned long) ctx->hbuf, CR_HBUF_ORDER);
> +
> + kfree(ctx);
> +}
> +
> +struct cr_ctx *cr_ctx_alloc(pid_t pid, struct file *file, unsigned long flags)
> +{
> + struct cr_ctx *ctx;
> +
> + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
> + if (!ctx)
> + return NULL;
> +
> + ctx->tbuf = (void *) __get_free_pages(GFP_KERNEL, CR_TBUF_ORDER);
> + ctx->hbuf = (void *) __get_free_pages(GFP_KERNEL, CR_HBUF_ORDER);
> + if (!ctx->tbuf || !ctx->hbuf)
> + goto nomem;

Any particular reason you're using __get_free_pages() here? kmalloc()
would save you a cast, and having to deal with keeping "order" vs. a
plain size. Also, why don't you just stick tbuf and hbuf's allocation
into the ctx structure itself for now? It'll save some of the error
paths. Yeah, it might become an order 2 or 3 allocation, but that
should be OK. We're not in interrupt context or anything.

> +/**
> + * sys_checkpoint - checkpoint a container
> + * @pid: pid of the container init(1) process
> + * @fd: file to which dump the checkpoint image
> + * @flags: checkpoint operation flags
> + */
> +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
> +{
> + struct cr_ctx *ctx;
> + struct file *file;
> + int fput_needed;
> + int ret;
> +
> + file = fget_light(fd, &fput_needed);
> + if (!file)
> + return -EBADF;
> +
> + /* no flags for now */
> + if (flags)
> + return -EINVAL;
> +
> + ctx = cr_ctx_alloc(pid, file, flags | CR_CTX_CKPT);
> + if (!ctx) {
> + fput_light(file, fput_needed);
> + return -ENOMEM;
> + }
> +
> + ret = do_checkpoint(ctx);
> +
> + cr_ctx_free(ctx);
> + fput_light(file, fput_needed);
> +
> + return ret;
> +}
> +
> +/**
> + * sys_restart - restart a container
> + * @crid: checkpoint image identifier
> + * @fd: file from which read the checkpoint image
> + * @flags: restart operation flags
> + */
> +asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
> +{
> + struct cr_ctx *ctx;
> + struct file *file;
> + int fput_needed;
> + int ret;
> +
> + file = fget_light(fd, &fput_needed);
> + if (!file)
> + return -EBADF;
> +
> + /* no flags for now */
> + if (flags)
> + return -EINVAL;
> +
> + ctx = cr_ctx_alloc(crid, file, flags | CR_CTX_RSTR);
> + if (!ctx) {
> + fput_light(file, fput_needed);
> + return -ENOMEM;
> + }
> +
> + ret = do_restart(ctx);
> +
> + cr_ctx_free(ctx);
> + fput_light(file, fput_needed);
> +
> + return ret;
> +}

Should we just combine most of these two functions? If we passed a fd
into cr_ctx_alloc(), it could do the fget/put_light() bits, and we
wouldn't need the code in both sys_checkpoint() and sys_restart().

-- Dave

2008-08-22 20:11:39

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 1/9] Create trivial sys_checkpoint/sys_restart syscalls

On Fri, 2008-08-22 at 12:32 -0700, Dave Hansen wrote:
> On Wed, 2008-08-20 at 23:03 -0400, Oren Laadan wrote:
> > 6/unistd_32.h
> > index d739467..88bdec4 100644
> > --- a/include/asm-x86/unistd_32.h
> > +++ b/include/asm-x86/unistd_32.h
> > @@ -338,6 +338,8 @@
> > #define __NR_dup3 330
> > #define __NR_pipe2 331
> > #define __NR_inotify_init1 332
> > +#define __NR_checkpoint 333
> > +#define __NR_restart 334
> >
> > #ifdef __KERNEL__
>
> Uh oh. These got whitespace mangled somehow. I'll look through them,
> but probably can't produce patches on top of them for now.

This patch also has to go *AFTER* you actually define the syscall
functions. Otherwise, this won't be build-bisectable when it gets
committed.

-- Dave

2008-08-22 20:18:07

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 3/9] x86 support for checkpoint/restart

On Wed, 2008-08-20 at 23:04 -0400, Oren Laadan wrote:+
> +#ifdef CONFIG_X86_64
> +
> +#error "CONFIG_X86_64 unsupported yet."

Well, I told you I'd write some Kconfig magic for you, but I can't since
I can't apply these patches. :)

You need this in a Kconfig file somewhere:

config CHECKPOINT_RESTART
prompt "Good prompt"
def_bool y
depends on X86_32 && EXPERIMENTAL
help
Lots of fun help...

config SELECT_ALL_NAMESPACES
prompt "Enable all kernel namespace support"
selects IPC_NS NET_NS ...
help
This option will save you having to go track down
all of the individual places that namespaces are
supported in the kernel.

If you don't turn this on, the restart porttion
of checkpoint/restart gets a lots less reliable...

probably in checkpoint/Kconfig.

Then, in all of the architectures' arch/*/Kconfig files, you need to:

source "checkpoint/Kconfig"

Then, the Makefile looks like this:

--- /dev/null 2008-04-22 10:49:52.000000000 -0700
+++ oren-cr.git-dave/checkpoint/Makefile 2008-08-22 12:30:53.000000000 -0700
@@ -0,0 +1 @@
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o

-- Dave

2008-08-22 20:37:27

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 4/9] Memory management - dump state

On Wed, 2008-08-20 at 23:05 -0400, Oren Laadan wrote:
> index 7ecafb3..0addb63 100644
> --- a/checkpoint/ckpt.h
> +++ b/checkpoint/ckpt.h
> @@ -29,6 +29,9 @@ struct cr_ctx {
> void *hbuf; /* header: to avoid many alloc/dealloc */
> int hpos;
>
> + struct cr_pgarr *pgarr;
> + struct cr_pgarr *pgcur;
> +
> struct path *vfsroot; /* container root */
> };

These need much better variable names. From this, I have no idea what
they do. Checkpoint Restart Pirates Go ARR!

> @@ -62,6 +65,9 @@ int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
> int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type);
> int cr_read_str(struct cr_ctx *ctx, void *str, int n);
>
> +int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
> +int cr_read_mm(struct cr_ctx *ctx);
> +
> int do_checkpoint(struct cr_ctx *ctx);
> int do_restart(struct cr_ctx *ctx);
>
> diff --git a/checkpoint/ckpt_arch.h b/checkpoint/ckpt_arch.h
> index b7cc8c9..3b87a6f 100644
> --- a/checkpoint/ckpt_arch.h
> +++ b/checkpoint/ckpt_arch.h
> @@ -2,6 +2,7 @@
>
> int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
> int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
> +int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int ptag);
>
> int cr_read_thread(struct cr_ctx *ctx);
> int cr_read_cpu(struct cr_ctx *ctx);
> diff --git a/checkpoint/ckpt_hdr.h b/checkpoint/ckpt_hdr.h
> index a478b7c..a3919cf 100644
> --- a/checkpoint/ckpt_hdr.h
> +++ b/checkpoint/ckpt_hdr.h
> @@ -30,6 +30,7 @@ struct cr_hdr {
> __u32 ptag;
> };
>
> +/* header types */
> enum {
> CR_HDR_HEAD = 1,
> CR_HDR_STR,
> @@ -45,6 +46,12 @@ enum {
> CR_HDR_TAIL = 5001
> };
>
> +/* vma subtypes */
> +enum {
> + CR_VMA_ANON = 1,
> + CR_VMA_FILE
> +};

Is this really necessary, or can we infer it from the contents of the
VMA?

> struct cr_hdr_head {
> __u64 magic;
>
> @@ -83,4 +90,28 @@ struct cr_hdr_task {
> char comm[TASK_COMM_LEN];
> } __attribute__ ((aligned (8)));
>
> +struct cr_hdr_mm {
> + __u32 tag; /* sharing identifier */

If this really is a sharing identifier, we need a:

struct cr_shared_object_ref {
__u32 tag;
};

And then one of *those* in the vma struct. Make it much more idiot (aka
Dave) proof.

> + __s16 map_count;
> + __s16 _padding;
> +
> + __u64 start_code, end_code, start_data, end_data;
> + __u64 start_brk, brk, start_stack;
> + __u64 arg_start, arg_end, env_start, env_end;
> +
> +} __attribute__ ((aligned (8)));
> +
> +struct cr_hdr_vma {
> + __u32 how;

It is too bad that we can't actually use the enum type here. It would
make it *much* more obvious what this was. I actually had to go look at
the code below to figure it out.

> + __s16 npages;

Wow. Linux only supports 256MB in a single VMA? I didn't know that.
Maybe we should make this type bigger. :)

This also needs to get called something much more descriptive, like
nr_present_pages.



> #endif /* _CHECKPOINT_CKPT_HDR_H_ */
> diff --git a/checkpoint/ckpt_mem.c b/checkpoint/ckpt_mem.c
> new file mode 100644
> index 0000000..a23aa29
> --- /dev/null
> +++ b/checkpoint/ckpt_mem.c
> @@ -0,0 +1,382 @@
> +/*
> + * Checkpoint memory contents
> + *
> + * Copyright (C) 2008 Oren Laadan
> + *
> + * This file is subject to the terms and conditions of the GNU General Public
> + * License. See the file COPYING in the main directory of the Linux
> + * distribution for more details.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/file.h>
> +#include <linux/pagemap.h>
> +#include <linux/mm_types.h>
> +
> +#include "ckpt.h"
> +#include "ckpt_hdr.h"
> +#include "ckpt_arch.h"
> +#include "ckpt_mem.h"
> +
> +/*
> + * utilities to alloc, free, and handle 'struct cr_pgarr'
> + * (common to ckpt_mem.c and rstr_mem.c)
> + */
> +
> +#define CR_PGARR_ORDER 0
> +#define CR_PGARR_TOTAL ((PAGE_SIZE << CR_PGARR_ORDER) / sizeof(void *))

Another allocator? Really?

> +/* release pages referenced by a page-array */
> +void _cr_pgarr_release(struct cr_ctx *ctx, struct cr_pgarr *pgarr)
> +{
> + int n;
> +
> + /* only checkpoint keeps references to pages */
> + if (ctx->flags & CR_CTX_CKPT) {
> + cr_debug("nused %d\n", pgarr->nused);
> + for (n = pgarr->nused; n--; )
> + page_cache_release(pgarr->pages[n]);
> + }
> + pgarr->nused = 0;
> + pgarr->nleft = CR_PGARR_TOTAL;
> +}
> +
> +/* release pages referenced by chain of page-arrays */
> +void cr_pgarr_release(struct cr_ctx *ctx)
> +{
> + struct cr_pgarr *pgarr;
> +
> + for (pgarr = ctx->pgarr; pgarr; pgarr = pgarr->next)
> + _cr_pgarr_release(ctx, pgarr);
> +}
> +
> +/* free a chain of page-arrays */
> +void cr_pgarr_free(struct cr_ctx *ctx)
> +{
> + struct cr_pgarr *pgarr, *pgnxt;
> +
> + for (pgarr = ctx->pgarr; pgarr; pgarr = pgnxt) {
> + _cr_pgarr_release(ctx, pgarr);
> + free_pages((unsigned long) ctx->pgarr->addrs, CR_PGARR_ORDER);
> + free_pages((unsigned long) ctx->pgarr->pages, CR_PGARR_ORDER);
> + pgnxt = pgarr->next;
> + kfree(pgarr);
> + }
> +}
> +
> +/* allocate and add a new page-array to chain */
> +struct cr_pgarr *cr_pgarr_alloc(struct cr_ctx *ctx, struct cr_pgarr **pgnew)
> +{
> + struct cr_pgarr *pgarr = ctx->pgcur;
> +
> + if (pgarr && pgarr->next) {
> + ctx->pgcur = pgarr->next;
> + return pgarr->next;
> + }
> +
> + if ((pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL))) {

This entire nested if(){} should be brought back to 1 tab. Remember, no
assignments in if() conditions.

> + pgarr->nused = 0;
> + pgarr->nleft = CR_PGARR_TOTAL;
> + pgarr->addrs = (unsigned long *)
> + __get_free_pages(GFP_KERNEL, CR_PGARR_ORDER);
> + pgarr->pages = (struct page **)
> + __get_free_pages(GFP_KERNEL, CR_PGARR_ORDER);
> + if (likely(pgarr->addrs && pgarr->pages)) {
> + *pgnew = pgarr;
> + ctx->pgcur = pgarr;
> + return pgarr;
> + } else if (pgarr->addrs)
> + free_pages((unsigned long) pgarr->addrs,
> + CR_PGARR_ORDER);
> + kfree(pgarr);
> + }
> +
> + return NULL;
> +}
> +
> +/* return current page-array (and allocate if needed) */
> +struct cr_pgarr *cr_pgarr_prep(struct cr_ctx *ctx)
> +{
> + struct cr_pgarr *pgarr = ctx->pgcur;
> +
> + if (unlikely(!pgarr->nleft))
> + pgarr = cr_pgarr_alloc(ctx, &pgarr->next);
> + return pgarr;
> +}
> +
> +/*
> + * Checkpoint is outside the context of the checkpointee, so one cannot
> + * simply read pages from user-space. Instead, we scan the address space
> + * of the target to cherry-pick pages of interest. Selected pages are
> + * enlisted in a page-array chain (attached to the checkpoint context).
> + * To save their contents, each page is mapped to kernel memory and then
> + * dumped to the file descriptor.
> + */
> +
> +/**
> + * cr_vma_fill_pgarr - fill a page-array with addr/page tuples for a vma
> + * @ctx - checkpoint context
> + * @pgarr - page-array to fill
> + * @vma - vma to scan
> + * @start - start address (updated)
> + */
> +static int cr_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
> + struct vm_area_struct *vma, unsigned long *start)
> +{
> + unsigned long end = vma->vm_end;
> + unsigned long addr = *start;
> + struct page **pagep;
> + unsigned long *addrp;
> + int cow, nr, ret = 0;
> +
> + nr = pgarr->nleft;
> + pagep = &pgarr->pages[pgarr->nused];
> + addrp = &pgarr->addrs[pgarr->nused];
> + cow = !!vma->vm_file;
> +
> + while (addr < end) {
> + struct page *page;
> +
> + /* simplified version of get_user_pages(): already have vma,
> + * only need FOLL_TOUCH, and (for now) ignore fault stats */
> +
> + cond_resched();
> + while (!(page = follow_page(vma, addr, FOLL_TOUCH))) {
> + ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
> + if (ret & VM_FAULT_ERROR) {
> + if (ret & VM_FAULT_OOM)
> + ret = -ENOMEM;
> + else if (ret & VM_FAULT_SIGBUS)
> + ret = -EFAULT;
> + else
> + BUG();
> + break;
> + }
> + cond_resched();
> + }

At best this needs to get folded back into its own function. This is
pretty hard to read as-is. Also, are there truly no in-kernel functions
that can be used for this?

> + if (IS_ERR(page)) {
> + ret = PTR_ERR(page);
> + break;
> + }
> +
> + if (page == ZERO_PAGE(0))
> + page = NULL; /* zero page: ignore */
> + else if (cow && page_mapping(page) != NULL)
> + page = NULL; /* clean cow: ignore */
> + else {

Put the curly brackets in here. It doesn't take up space.

> + get_page(page);
> + *(addrp++) = addr;
> + *(pagep++) = page;
> + if (--nr == 0) {
> + addr += PAGE_SIZE;
> + break;
> + }
> + }
> +
> + addr += PAGE_SIZE;
> + }
> +
> + if (unlikely(ret < 0)) {
> + nr = pgarr->nleft - nr;
> + while (nr--)
> + page_cache_release(*(--pagep));
> + return ret;
> + }
> +
> + *start = addr;
> + return (pgarr->nleft - nr);
> +}
> +
> +/**
> + * cr_vma_scan_pages - scan vma for pages that will need to be dumped
> + * @ctx - checkpoint context
> + * @vma - vma to scan
> + *
> + * a list of addr/page tuples is kept in ctx->pgarr page-array chain
> + */
> +static int cr_vma_scan_pages(struct cr_ctx *ctx, struct vm_area_struct *vma)
> +{
> + unsigned long addr = vma->vm_start;
> + unsigned long end = vma->vm_end;
> + struct cr_pgarr *pgarr;
> + int nr, total = 0;
> +
> + while (addr < end) {
> + if (!(pgarr = cr_pgarr_prep(ctx)))
> + return -ENOMEM;
> + if ((nr = cr_vma_fill_pgarr(ctx, pgarr, vma, &addr)) < 0)
> + return nr;
> + pgarr->nleft -= nr;
> + pgarr->nused += nr;
> + total += nr;
> + }
> +
> + cr_debug("total %d\n", total);
> + return total;
> +}
> +
> +/**
> + * cr_vma_dump_pages - dump pages listed in the ctx page-array chain
> + * @ctx - checkpoint context
> + * @total - total number of pages
> + */
> +static int cr_vma_dump_pages(struct cr_ctx *ctx, int total)
> +{
> + struct cr_pgarr *pgarr;
> + int ret;
> +
> + if (!total)
> + return 0;
> +
> + for (pgarr = ctx->pgarr; pgarr; pgarr = pgarr->next) {
> + ret = cr_kwrite(ctx, pgarr->addrs,
> + pgarr->nused * sizeof(*pgarr->addrs));
> + if (ret < 0)
> + return ret;
> + }
> +
> + for (pgarr = ctx->pgarr; pgarr; pgarr = pgarr->next) {
> + struct page **pages = pgarr->pages;
> + int nr = pgarr->nused;
> + void *ptr;
> +
> + while (nr--) {
> + ptr = kmap(*pages);
> + ret = cr_kwrite(ctx, ptr, PAGE_SIZE);
> + kunmap(*pages);

Why not use kmap_atomic() here?

> + if (ret < 0)
> + return ret;
> + pages++;
> + }
> + }
> +
> + return total;
> +}
> +
> +static int cr_write_vma(struct cr_ctx *ctx, struct vm_area_struct *vma)
> +{
> + struct cr_hdr h;
> + struct cr_hdr_vma *hh = ctx->hbuf;
> + char *fname = NULL;
> + int flen = 0, how, nr, ret;
> +
> + h.type = CR_HDR_VMA;
> + h.len = sizeof(*hh);
> + h.ptag = 0;
> +
> + hh->vm_start = vma->vm_start;
> + hh->vm_end = vma->vm_end;
> + hh->vm_page_prot = vma->vm_page_prot.pgprot;
> + hh->vm_flags = vma->vm_flags;
> + hh->vm_pgoff = vma->vm_pgoff;
> +
> + if (vma->vm_flags & (VM_SHARED | VM_IO | VM_HUGETLB | VM_NONLINEAR)) {
> + pr_warning("CR: unknown VMA %#lx\n", vma->vm_flags);
> + return -ETXTBSY;
> + }

Hmmm. Interesting error code for VM_HUGETLB. :)

> + /* by default assume anon memory */
> + how = CR_VMA_ANON;
> +
> + /* if there is a backing file, assume private-mapped */
> + /* (NEED: check if the file is unlinked) */
> + if (vma->vm_file) {
> + flen = PAGE_SIZE;
> + fname = cr_fill_fname(&vma->vm_file->f_path,
> + ctx->vfsroot, ctx->tbuf, &flen);
> + if (IS_ERR(fname))
> + return PTR_ERR(fname);
> + how = CR_VMA_FILE;
> + }
> +
> + hh->how = how;
> + hh->fname = !!fname;
> +
> + /*
> + * it seems redundant now, but we do it in 3 steps for because:
> + * first, the logic is simpler when we how many pages before
> + * dumping them; second, a future optimization will defer the
> + * writeout (dump, and free) to a later step; in which case all
> + * the pages to be dumped will be aggregated on the checkpoint ctx
> + */
> +
> + /* (1) scan: scan through the PTEs of the vma to count the pages
> + * to dump (and later make those pages COW), and keep the list of
> + * pages (and a reference to each page) on the checkpoint ctx */
> + nr = cr_vma_scan_pages(ctx, vma);
> + if (nr < 0)
> + return nr;
> +
> + hh->npages = nr;
> + ret = cr_write_obj(ctx, &h, hh);
> +
> + if (!ret && flen)
> + ret = cr_write_str(ctx, fname, flen);
> +
> + if (ret < 0)
> + return ret;
> +
> + /* (2) dump: write out the addresses of all pages in the list (on
> + * the checkpoint ctx) followed by the contents of all pages */
> + ret = cr_vma_dump_pages(ctx, nr);
> +
> + /* (3) free: free the extra references to the pages in the list */
> + cr_pgarr_release(ctx);
> +
> + return ret;
> +}

This gets simpler-looking if you just defer the filename processing
until you actually go to write it out.

-- Dave

2008-08-22 21:20:38

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 1/9] Create trivial sys_checkpoint/sys_restart syscalls



Dave Hansen wrote:
> On Wed, 2008-08-20 at 23:03 -0400, Oren Laadan wrote:
>> 6/unistd_32.h
>> index d739467..88bdec4 100644
>> --- a/include/asm-x86/unistd_32.h
>> +++ b/include/asm-x86/unistd_32.h
>> @@ -338,6 +338,8 @@
>> #define __NR_dup3 330
>> #define __NR_pipe2 331
>> #define __NR_inotify_init1 332
>> +#define __NR_checkpoint 333
>> +#define __NR_restart 334
>>
>> #ifdef __KERNEL__
>
> Uh oh. These got whitespace mangled somehow. I'll look through them,
> but probably can't produce patches on top of them for now.
>

Oops .. not sure what happened.
I'll re-send after addressing the recent comments.

Oren.

2008-08-22 21:26:53

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 4/9] Memory management - dump state


Thanks Louis for all the comments. Will fix in v3.

Oren.

Louis Rilling wrote:
> On Wed, Aug 20, 2008 at 11:05:15PM -0400, Oren Laadan wrote:
>> For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
>> it will be followed by the file name. The cr_vma->npages will tell
>> how many pages were dumped for this VMA. Then it will be followed
>> by the actual data: first a dump of the addresses of all dumped
>> pages (npages entries) followed by a dump of the contents of all
>> dumped pages (npages pages). Then will come the next VMA and so on.
>
> [...]
>
>> diff --git a/checkpoint/ckpt_mem.c b/checkpoint/ckpt_mem.c
>> new file mode 100644
>> index 0000000..a23aa29
>> --- /dev/null
>> +++ b/checkpoint/ckpt_mem.c
>
> [...]
>
>> +/**
>> + * cr_vma_fill_pgarr - fill a page-array with addr/page tuples for a vma
>> + * @ctx - checkpoint context
>> + * @pgarr - page-array to fill
>> + * @vma - vma to scan
>> + * @start - start address (updated)
>> + */
>> +static int cr_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
>> + struct vm_area_struct *vma, unsigned long *start)
>> +{
>> + unsigned long end = vma->vm_end;
>> + unsigned long addr = *start;
>> + struct page **pagep;
>> + unsigned long *addrp;
>> + int cow, nr, ret = 0;
>> +
>> + nr = pgarr->nleft;
>> + pagep = &pgarr->pages[pgarr->nused];
>> + addrp = &pgarr->addrs[pgarr->nused];
>> + cow = !!vma->vm_file;
>> +
>> + while (addr < end) {
>> + struct page *page;
>> +
>> + /* simplified version of get_user_pages(): already have vma,
>> + * only need FOLL_TOUCH, and (for now) ignore fault stats */
>> +
>> + cond_resched();
>> + while (!(page = follow_page(vma, addr, FOLL_TOUCH))) {
>> + ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
>> + if (ret & VM_FAULT_ERROR) {
>> + if (ret & VM_FAULT_OOM)
>> + ret = -ENOMEM;
>> + else if (ret & VM_FAULT_SIGBUS)
>> + ret = -EFAULT;
>> + else
>> + BUG();
>> + break;
>> + }
>
> + ret = 0;
>
>> + cond_resched();
>> + }
>> +
>> + if (IS_ERR(page)) {
>> + ret = PTR_ERR(page);
>> + break;
>> + }
>
> Need to check ret here:
>
> + if (ret)
> break;
>
>> +
>> + if (page == ZERO_PAGE(0))
>> + page = NULL; /* zero page: ignore */
>> + else if (cow && page_mapping(page) != NULL)
>> + page = NULL; /* clean cow: ignore */
>> + else {
>> + get_page(page);
>> + *(addrp++) = addr;
>> + *(pagep++) = page;
>> + if (--nr == 0) {
>> + addr += PAGE_SIZE;
>> + break;
>> + }
>> + }
>> +
>> + addr += PAGE_SIZE;
>> + }
>> +
>> + if (unlikely(ret < 0)) {
>> + nr = pgarr->nleft - nr;
>> + while (nr--)
>> + page_cache_release(*(--pagep));
>> + return ret;
>> + }
>> +
>> + *start = addr;
>> + return (pgarr->nleft - nr);
>> +}
>
> [...]
>
> Thanks,
>
> Louis
>

2008-08-25 02:47:45

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 2/9] General infrastructure for checkpoint restart



Dave Hansen wrote:
> On Wed, 2008-08-20 at 23:04 -0400, Oren Laadan wrote:
>> diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
>> new file mode 100644
>> index 0000000..25343f5
>> --- /dev/null
>> +++ b/checkpoint/checkpoint.c

[...]

>> +/* write the checkpoint header */
>> +static int cr_write_hdr(struct cr_ctx *ctx)
>> +{
>> + struct cr_hdr h;
>> + struct cr_hdr_head *hh = ctx->hbuf;
>> + struct new_utsname *uts;
>> + struct timeval ktv;
>> +
>> + h.type = CR_HDR_HEAD;
>> + h.len = sizeof(*hh);
>> + h.ptag = 0;
>> +
>> + do_gettimeofday(&ktv);
>> +
>> + hh->magic = CR_MAGIC_HEAD;
>> + hh->major = (LINUX_VERSION_CODE >> 16) & 0xff;
>> + hh->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
>> + hh->patch = (LINUX_VERSION_CODE) & 0xff;
>
> We at least neex EXTRAVERSION in there if we're going to even pretend
> this is useful, right? is this redundant with utsname() below?

It isn't clear exactly what we need in the header in terms of version, config,
and compile information. I put this as a starter, because it's obvious and
because this info is likely to be useful even at the user/admin level. I don't
expect this to remain as is as we continue development.

[...]

>> +struct cr_ctx {
>> + pid_t pid; /* container identifier */
>> + int crid; /* unique checkpoint id */
>> +
>> + unsigned long flags;
>> + unsigned long oflags; /* restart: old flags */
>> +
>> + struct file *file;
>> + int total; /* total read/written */
>> +
>> + void *tbuf; /* temp: to avoid many alloc/dealloc */
>> + void *hbuf; /* header: to avoid many alloc/dealloc */
>> + int hpos;
>
> alloc/delloc is not a bad thing. :) I think at least a few of these
> should be moved over to stack or kmalloc()/kfree().

Definitely the comment is misleading. See comment below.

>
>> + struct path *vfsroot; /* container root */
>> +};
>
> Can a container only have one vfsroot?

This is a temporary solution since this patchset doesn't really handle real
containers yet. Will add a FIXME comment.

>
>> +
>> +/* cr_ctx: flags */
>> +#define CR_CTX_CKPT 0x1
>> +#define CR_CTX_RSTR 0x2
>
> enum?

These are flags (more will come later).

[...]

>> +struct cr_hdr_task {
>> + __u64 state;
>> + __u32 exit_state;
>> + __u32 exit_code, exit_signal;
>> +
>> + __u64 utime, stime, utimescaled, stimescaled;
>> + __u64 gtime;
>> + __u64 prev_utime, prev_stime;
>> + __u64 nvcsw, nivcsw;
>> + __u64 start_time_sec, start_time_nsec;
>> + __u64 real_start_time_sec, real_start_time_nsec;
>> + __u64 min_flt, maj_flt;
>> +
>> + __s16 task_comm_len;
>> + char comm[TASK_COMM_LEN];
>> +} __attribute__ ((aligned (8)));
>
> TASK_COMM_LEN might be subject to changing. Can it be part of the
> user<->kernel ABI?

Good point. Having the 'task_comm_len' field there isn't enough because the
size of the entire struct changes anyways.

[...]

>
> ...
>> +/*
>> + * During restart the code reads in data from the chekcpoint image into a
>> + * temporary buffer (ctx->hbuf). Because operations can be nested, one
>> + * should call cr_hbuf_get() to reserve space in the buffer, and then
>> + * cr_hbuf_put() when it no longer needs that space
>> + */
>> +
>> +#include <linux/version.h>
>> +#include <linux/sched.h>
>> +#include <linux/file.h>
>> +
>> +#include "ckpt.h"
>> +#include "ckpt_hdr.h"
>> +
>> +/**
>> + * cr_hbuf_get - reserve space on the hbuf
>> + * @ctx: checkpoint context
>> + * @n: number of bytes to reserve
>> + */
>> +void *cr_hbuf_get(struct cr_ctx *ctx, int n)
>> +{
>> + void *ptr;
>> +
>> + BUG_ON(ctx->hpos + n > CR_HBUF_TOTAL);
>> + ptr = (void *) (((char *) ctx->hbuf) + ctx->hpos);
>> + ctx->hpos += n;
>> + return ptr;
>> +}
>> +
>> +/**
>> + * cr_hbuf_put - unreserve space on the hbuf
>> + * @ctx: checkpoint context
>> + * @n: number of bytes to reserve
>> + */
>> +void cr_hbuf_put(struct cr_ctx *ctx, int n)
>> +{
>> + BUG_ON(ctx->hpos < n);
>> + ctx->hpos -= n;
>> +}
>
> I replaced all of the uses of these with kmalloc()/kfree() and stack
> allocations. I'm really not sure what these buy us since our allocators
> are already so fast. tbuf, especially, worries me if it gets used in
> any kind of nested manner we're going to get some really fun bugs.

I disagree with putting some of the cr_hdr_... on the stack: first, if they
grow in size, it's easy to forget to change the allocation to stack. Second,
using a standard code/handling for all cr_hdr_... makes the code more readable
and easier for others to extend by simply following existing code.

The main argument for ->hbuf is that eventually we would like to buffer
data in the kernel to avoid potentially slow writing/reading when processes
are frozen during checkpoint/restart. By using the simple cr_get_hbuf() and
cr_put_hbuf() we prepare for that: now they just allocate room in ->hbuf,
but later they will provide a pointer directly in the data buffer. Moreover,
it simplifies the error path since cleanup (of ->hbuf) is implicit. Also,
it is designed to allow checkpoint and restart function to be called in a
nested manner, again simplifying the code. Finally, my experience was that
it can impact performance if you are after very short downtimes, especially
for the checkpoint.

The use of ->tbuf is less critical and is mainly for performance and to
simplify the error path, never to be called nested. I'm ok with removing
it for now.

[...]

>> +/* read the checkpoint header */
>> +static int cr_read_hdr(struct cr_ctx *ctx)
>> +{
>> + struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh));
>> + int ret;
>> +
>> + ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD);
>> + if (ret < 0)
>> + return ret;
>> + else if (ret != 0)
>> + return -EINVAL;
>
> How about just making cr_read_obj_type() stop returning positive values?
> Is it ever valid for it to return >0?

The idea with cr_read_obj_type() is that it returns the 'ptag' - parent tag -
field of the object that it reads in. The 'ptag' is the tag of the parent
object, and is useful in several places. For instance, the 'ptag' of an MM
header identifies (is equal to) the 'tag' of TASK to which it belongs. In
this case, the 'ptag' should be zero because it has no parent object.

I'll change the variable name from 'ret' to 'ptag' to make it more obvious.

[...]

>
> -- Dave
>

Oren.

2008-08-25 03:28:34

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 8/9] File descriprtors - dump state



Louis Rilling wrote:
> On Wed, Aug 20, 2008 at 11:07:16PM -0400, Oren Laadan wrote:
>> Dump the files_struct of a task with 'struct cr_hdr_files', followed by
>> all open file descriptors. Since FDs can be shared, they are assigned a
>> tag and registered in the object hash.
>>
>> For each open FD there is a 'struct cr_hdr_fd_ent' with the FD, its tag
>> and its close-on-exec property. If the FD is to be saved (first time)
>> then this is followed by a 'struct cr_hdr_fd_data' with the FD state.
>> Then will come the next FD and so on.
>>
>> This patch only handles basic FDs - regular files, directories and also
>> symbolic links.
>
> [...]

[...]

>
>> @@ -114,4 +125,24 @@ struct cr_hdr_vma {
>>
>> } __attribute__ ((aligned (8)));
>>
>> +struct cr_hdr_files {
>> + __u32 tag; /* sharing identifier */
>> + __u32 nfds;
>> +} __attribute__ ((aligned (8)));
>> +
>> +struct cr_hdr_fd_ent {
>> + __u32 tag;
>> + __u16 fd;
>> + __u16 close_on_exec;
>> +} __attribute__ ((aligned (8)));
>> +
>> +struct cr_hdr_fd_data {
>> + __u16 how;
>> + __u16 f_mode;
>> + __u32 f_flags;
>> + __u32 f_uid, f_gid;
>
> We are not at a 64bits boundary here. Should add one __32 padding or reorder the
> fields.

Actually this is ok - note there are two fields on fourth row. I'll
change to one field per row so it's easily seen.

Thanks,

Oren.

2008-08-25 10:30:50

by Louis Rilling

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 8/9] File descriprtors - dump state

On Sun, Aug 24, 2008 at 11:28:14PM -0400, Oren Laadan wrote:
>
>
> Louis Rilling wrote:
>> On Wed, Aug 20, 2008 at 11:07:16PM -0400, Oren Laadan wrote:
>>> Dump the files_struct of a task with 'struct cr_hdr_files', followed by
>>> all open file descriptors. Since FDs can be shared, they are assigned a
>>> tag and registered in the object hash.
>>>
>>> For each open FD there is a 'struct cr_hdr_fd_ent' with the FD, its tag
>>> and its close-on-exec property. If the FD is to be saved (first time)
>>> then this is followed by a 'struct cr_hdr_fd_data' with the FD state.
>>> Then will come the next FD and so on.
>>>
>>> This patch only handles basic FDs - regular files, directories and also
>>> symbolic links.
>>
>> [...]
>
> [...]
>
>>
>>> @@ -114,4 +125,24 @@ struct cr_hdr_vma {
>>>
>>> } __attribute__ ((aligned (8)));
>>>
>>> +struct cr_hdr_files {
>>> + __u32 tag; /* sharing identifier */
>>> + __u32 nfds;
>>> +} __attribute__ ((aligned (8)));
>>> +
>>> +struct cr_hdr_fd_ent {
>>> + __u32 tag;
>>> + __u16 fd;
>>> + __u16 close_on_exec;
>>> +} __attribute__ ((aligned (8)));
>>> +
>>> +struct cr_hdr_fd_data {
>>> + __u16 how;
>>> + __u16 f_mode;
>>> + __u32 f_flags;
>>> + __u32 f_uid, f_gid;
>>
>> We are not at a 64bits boundary here. Should add one __32 padding or reorder the
>> fields.
>
> Actually this is ok - note there are two fields on fourth row. I'll
> change to one field per row so it's easily seen.

Have to revise my eyes, really ;) Sorry for the noise!

Louis

--
Dr Louis Rilling Kerlabs - IRISA
Skype: louis.rilling Campus Universitaire de Beaulieu
Phone: (+33|0) 2 99 84 71 52 Avenue du General Leclerc
Fax: (+33|0) 2 99 84 71 71 35042 Rennes CEDEX - France
http://www.kerlabs.com/


Attachments:
(No filename) (1.69 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2008-08-24 05:41:32

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 4/9] Memory management - dump state



Dave Hansen wrote:
> On Wed, 2008-08-20 at 23:05 -0400, Oren Laadan wrote:
>> index 7ecafb3..0addb63 100644
>> --- a/checkpoint/ckpt.h
>> +++ b/checkpoint/ckpt.h
>> @@ -29,6 +29,9 @@ struct cr_ctx {
>> void *hbuf; /* header: to avoid many alloc/dealloc */
>> int hpos;
>>
>> + struct cr_pgarr *pgarr;
>> + struct cr_pgarr *pgcur;
>> +
>> struct path *vfsroot; /* container root */
>> };
>
> These need much better variable names. From this, I have no idea what
> they do. Checkpoint Restart Pirates Go ARR!
>
>> @@ -62,6 +65,9 @@ int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
>> int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type);
>> int cr_read_str(struct cr_ctx *ctx, void *str, int n);
>>
>> +int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
>> +int cr_read_mm(struct cr_ctx *ctx);
>> +
>> int do_checkpoint(struct cr_ctx *ctx);
>> int do_restart(struct cr_ctx *ctx);
>>
>> diff --git a/checkpoint/ckpt_arch.h b/checkpoint/ckpt_arch.h
>> index b7cc8c9..3b87a6f 100644
>> --- a/checkpoint/ckpt_arch.h
>> +++ b/checkpoint/ckpt_arch.h
>> @@ -2,6 +2,7 @@
>>
>> int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
>> int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
>> +int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int ptag);
>>
>> int cr_read_thread(struct cr_ctx *ctx);
>> int cr_read_cpu(struct cr_ctx *ctx);
>> diff --git a/checkpoint/ckpt_hdr.h b/checkpoint/ckpt_hdr.h
>> index a478b7c..a3919cf 100644
>> --- a/checkpoint/ckpt_hdr.h
>> +++ b/checkpoint/ckpt_hdr.h
>> @@ -30,6 +30,7 @@ struct cr_hdr {
>> __u32 ptag;
>> };
>>
>> +/* header types */
>> enum {
>> CR_HDR_HEAD = 1,
>> CR_HDR_STR,
>> @@ -45,6 +46,12 @@ enum {
>> CR_HDR_TAIL = 5001
>> };
>>
>> +/* vma subtypes */
>> +enum {
>> + CR_VMA_ANON = 1,
>> + CR_VMA_FILE
>> +};
>
> Is this really necessary, or can we infer it from the contents of the
> VMA?

This classification eventually simplifies both dump and restore. For
instance, it decides whether a file name follows or not.

There will be more, later: CR_VMA_FILE_UNLINKED (mapped to an unlinked
file), CR_VMA_ANON_SHARED (shared anonymous), CR_VMA_ANON_SHARED_SKIP
(shared anonymous, had been sent before) and so on.

>
>> struct cr_hdr_head {
>> __u64 magic;
>>
>> @@ -83,4 +90,28 @@ struct cr_hdr_task {
>> char comm[TASK_COMM_LEN];
>> } __attribute__ ((aligned (8)));
>>
>> +struct cr_hdr_mm {
>> + __u32 tag; /* sharing identifier */
>
> If this really is a sharing identifier, we need a:
>
> struct cr_shared_object_ref {
> __u32 tag;
> };
>
> And then one of *those* in the vma struct. Make it much more idiot (aka
> Dave) proof.

I figured the use of 'tag' for the identifiers of shared objects is clear.
By using a __u32 the size of that field is immediately visible, while using
a structure will hide the actual size; in turn this is what we want visible
here (ABI), no ?

>
>> + __s16 map_count;
>> + __s16 _padding;
>> +
>> + __u64 start_code, end_code, start_data, end_data;
>> + __u64 start_brk, brk, start_stack;
>> + __u64 arg_start, arg_end, env_start, env_end;
>> +
>> +} __attribute__ ((aligned (8)));
>> +
>> +struct cr_hdr_vma {
>> + __u32 how;
>
> It is too bad that we can't actually use the enum type here. It would
> make it *much* more obvious what this was. I actually had to go look at
> the code below to figure it out.

Using __u32 instead of enum guarantees the size. I'll change the
name and move the enum nearby.

>
>> + __s16 npages;
>
> Wow. Linux only supports 256MB in a single VMA? I didn't know that.
> Maybe we should make this type bigger. :)
>
> This also needs to get called something much more descriptive, like
> nr_present_pages.
>
>
>
>> #endif /* _CHECKPOINT_CKPT_HDR_H_ */
>> diff --git a/checkpoint/ckpt_mem.c b/checkpoint/ckpt_mem.c
>> new file mode 100644
>> index 0000000..a23aa29
>> --- /dev/null
>> +++ b/checkpoint/ckpt_mem.c
>> @@ -0,0 +1,382 @@
>> +/*
>> + * Checkpoint memory contents
>> + *
>> + * Copyright (C) 2008 Oren Laadan
>> + *
>> + * This file is subject to the terms and conditions of the GNU General Public
>> + * License. See the file COPYING in the main directory of the Linux
>> + * distribution for more details.
>> + */
>> +
>> +#include <linux/kernel.h>
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/file.h>
>> +#include <linux/pagemap.h>
>> +#include <linux/mm_types.h>
>> +
>> +#include "ckpt.h"
>> +#include "ckpt_hdr.h"
>> +#include "ckpt_arch.h"
>> +#include "ckpt_mem.h"
>> +
>> +/*
>> + * utilities to alloc, free, and handle 'struct cr_pgarr'
>> + * (common to ckpt_mem.c and rstr_mem.c)
>> + */
>> +
>> +#define CR_PGARR_ORDER 0
>> +#define CR_PGARR_TOTAL ((PAGE_SIZE << CR_PGARR_ORDER) / sizeof(void *))
>
> Another allocator? Really?
>
>> +/* release pages referenced by a page-array */
>> +void _cr_pgarr_release(struct cr_ctx *ctx, struct cr_pgarr *pgarr)
>> +{
>> + int n;
>> +
>> + /* only checkpoint keeps references to pages */
>> + if (ctx->flags & CR_CTX_CKPT) {
>> + cr_debug("nused %d\n", pgarr->nused);
>> + for (n = pgarr->nused; n--; )
>> + page_cache_release(pgarr->pages[n]);
>> + }
>> + pgarr->nused = 0;
>> + pgarr->nleft = CR_PGARR_TOTAL;
>> +}
>> +
>> +/* release pages referenced by chain of page-arrays */
>> +void cr_pgarr_release(struct cr_ctx *ctx)
>> +{
>> + struct cr_pgarr *pgarr;
>> +
>> + for (pgarr = ctx->pgarr; pgarr; pgarr = pgarr->next)
>> + _cr_pgarr_release(ctx, pgarr);
>> +}
>> +
>> +/* free a chain of page-arrays */
>> +void cr_pgarr_free(struct cr_ctx *ctx)
>> +{
>> + struct cr_pgarr *pgarr, *pgnxt;
>> +
>> + for (pgarr = ctx->pgarr; pgarr; pgarr = pgnxt) {
>> + _cr_pgarr_release(ctx, pgarr);
>> + free_pages((unsigned long) ctx->pgarr->addrs, CR_PGARR_ORDER);
>> + free_pages((unsigned long) ctx->pgarr->pages, CR_PGARR_ORDER);
>> + pgnxt = pgarr->next;
>> + kfree(pgarr);
>> + }
>> +}
>> +
>> +/* allocate and add a new page-array to chain */
>> +struct cr_pgarr *cr_pgarr_alloc(struct cr_ctx *ctx, struct cr_pgarr **pgnew)
>> +{
>> + struct cr_pgarr *pgarr = ctx->pgcur;
>> +
>> + if (pgarr && pgarr->next) {
>> + ctx->pgcur = pgarr->next;
>> + return pgarr->next;
>> + }
>> +
>> + if ((pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL))) {
>
> This entire nested if(){} should be brought back to 1 tab. Remember, no
> assignments in if() conditions.
>
>> + pgarr->nused = 0;
>> + pgarr->nleft = CR_PGARR_TOTAL;
>> + pgarr->addrs = (unsigned long *)
>> + __get_free_pages(GFP_KERNEL, CR_PGARR_ORDER);
>> + pgarr->pages = (struct page **)
>> + __get_free_pages(GFP_KERNEL, CR_PGARR_ORDER);
>> + if (likely(pgarr->addrs && pgarr->pages)) {
>> + *pgnew = pgarr;
>> + ctx->pgcur = pgarr;
>> + return pgarr;
>> + } else if (pgarr->addrs)
>> + free_pages((unsigned long) pgarr->addrs,
>> + CR_PGARR_ORDER);
>> + kfree(pgarr);
>> + }
>> +
>> + return NULL;
>> +}
>> +
>> +/* return current page-array (and allocate if needed) */
>> +struct cr_pgarr *cr_pgarr_prep(struct cr_ctx *ctx)
>> +{
>> + struct cr_pgarr *pgarr = ctx->pgcur;
>> +
>> + if (unlikely(!pgarr->nleft))
>> + pgarr = cr_pgarr_alloc(ctx, &pgarr->next);
>> + return pgarr;
>> +}
>> +
>> +/*
>> + * Checkpoint is outside the context of the checkpointee, so one cannot
>> + * simply read pages from user-space. Instead, we scan the address space
>> + * of the target to cherry-pick pages of interest. Selected pages are
>> + * enlisted in a page-array chain (attached to the checkpoint context).
>> + * To save their contents, each page is mapped to kernel memory and then
>> + * dumped to the file descriptor.
>> + */
>> +
>> +/**
>> + * cr_vma_fill_pgarr - fill a page-array with addr/page tuples for a vma
>> + * @ctx - checkpoint context
>> + * @pgarr - page-array to fill
>> + * @vma - vma to scan
>> + * @start - start address (updated)
>> + */
>> +static int cr_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
>> + struct vm_area_struct *vma, unsigned long *start)
>> +{
>> + unsigned long end = vma->vm_end;
>> + unsigned long addr = *start;
>> + struct page **pagep;
>> + unsigned long *addrp;
>> + int cow, nr, ret = 0;
>> +
>> + nr = pgarr->nleft;
>> + pagep = &pgarr->pages[pgarr->nused];
>> + addrp = &pgarr->addrs[pgarr->nused];
>> + cow = !!vma->vm_file;
>> +
>> + while (addr < end) {
>> + struct page *page;
>> +
>> + /* simplified version of get_user_pages(): already have vma,
>> + * only need FOLL_TOUCH, and (for now) ignore fault stats */
>> +
>> + cond_resched();
>> + while (!(page = follow_page(vma, addr, FOLL_TOUCH))) {
>> + ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
>> + if (ret & VM_FAULT_ERROR) {
>> + if (ret & VM_FAULT_OOM)
>> + ret = -ENOMEM;
>> + else if (ret & VM_FAULT_SIGBUS)
>> + ret = -EFAULT;
>> + else
>> + BUG();
>> + break;
>> + }
>> + cond_resched();
>> + }
>
> At best this needs to get folded back into its own function. This is

This is almost identical to the original - see the preceding comment.

> pretty hard to read as-is. Also, are there truly no in-kernel functions
> that can be used for this?

Can you suggest one ?

>
>> + if (IS_ERR(page)) {
>> + ret = PTR_ERR(page);
>> + break;
>> + }
>> +
>> + if (page == ZERO_PAGE(0))
>> + page = NULL; /* zero page: ignore */
>> + else if (cow && page_mapping(page) != NULL)
>> + page = NULL; /* clean cow: ignore */
>> + else {
>
> Put the curly brackets in here. It doesn't take up space.
>
>> + get_page(page);
>> + *(addrp++) = addr;
>> + *(pagep++) = page;
>> + if (--nr == 0) {
>> + addr += PAGE_SIZE;
>> + break;
>> + }
>> + }
>> +
>> + addr += PAGE_SIZE;
>> + }
>> +
>> + if (unlikely(ret < 0)) {
>> + nr = pgarr->nleft - nr;
>> + while (nr--)
>> + page_cache_release(*(--pagep));
>> + return ret;
>> + }
>> +
>> + *start = addr;
>> + return (pgarr->nleft - nr);
>> +}
>> +
>> +/**
>> + * cr_vma_scan_pages - scan vma for pages that will need to be dumped
>> + * @ctx - checkpoint context
>> + * @vma - vma to scan
>> + *
>> + * a list of addr/page tuples is kept in ctx->pgarr page-array chain
>> + */
>> +static int cr_vma_scan_pages(struct cr_ctx *ctx, struct vm_area_struct *vma)
>> +{
>> + unsigned long addr = vma->vm_start;
>> + unsigned long end = vma->vm_end;
>> + struct cr_pgarr *pgarr;
>> + int nr, total = 0;
>> +
>> + while (addr < end) {
>> + if (!(pgarr = cr_pgarr_prep(ctx)))
>> + return -ENOMEM;
>> + if ((nr = cr_vma_fill_pgarr(ctx, pgarr, vma, &addr)) < 0)
>> + return nr;
>> + pgarr->nleft -= nr;
>> + pgarr->nused += nr;
>> + total += nr;
>> + }
>> +
>> + cr_debug("total %d\n", total);
>> + return total;
>> +}
>> +
>> +/**
>> + * cr_vma_dump_pages - dump pages listed in the ctx page-array chain
>> + * @ctx - checkpoint context
>> + * @total - total number of pages
>> + */
>> +static int cr_vma_dump_pages(struct cr_ctx *ctx, int total)
>> +{
>> + struct cr_pgarr *pgarr;
>> + int ret;
>> +
>> + if (!total)
>> + return 0;
>> +
>> + for (pgarr = ctx->pgarr; pgarr; pgarr = pgarr->next) {
>> + ret = cr_kwrite(ctx, pgarr->addrs,
>> + pgarr->nused * sizeof(*pgarr->addrs));
>> + if (ret < 0)
>> + return ret;
>> + }
>> +
>> + for (pgarr = ctx->pgarr; pgarr; pgarr = pgarr->next) {
>> + struct page **pages = pgarr->pages;
>> + int nr = pgarr->nused;
>> + void *ptr;
>> +
>> + while (nr--) {
>> + ptr = kmap(*pages);
>> + ret = cr_kwrite(ctx, ptr, PAGE_SIZE);
>> + kunmap(*pages);
>
> Why not use kmap_atomic() here?

It is my understanding that the code must not sleep between kmap_atomic()
and kunmap_atomic().

>
>> + if (ret < 0)
>> + return ret;
>> + pages++;
>> + }
>> + }
>> +
>> + return total;
>> +}
>> +
>> +static int cr_write_vma(struct cr_ctx *ctx, struct vm_area_struct *vma)
>> +{
>> + struct cr_hdr h;
>> + struct cr_hdr_vma *hh = ctx->hbuf;
>> + char *fname = NULL;
>> + int flen = 0, how, nr, ret;
>> +
>> + h.type = CR_HDR_VMA;
>> + h.len = sizeof(*hh);
>> + h.ptag = 0;
>> +
>> + hh->vm_start = vma->vm_start;
>> + hh->vm_end = vma->vm_end;
>> + hh->vm_page_prot = vma->vm_page_prot.pgprot;
>> + hh->vm_flags = vma->vm_flags;
>> + hh->vm_pgoff = vma->vm_pgoff;
>> +
>> + if (vma->vm_flags & (VM_SHARED | VM_IO | VM_HUGETLB | VM_NONLINEAR)) {
>> + pr_warning("CR: unknown VMA %#lx\n", vma->vm_flags);
>> + return -ETXTBSY;
>> + }
>
> Hmmm. Interesting error code for VM_HUGETLB. :)

:) well, the usual EINVAL didn't seem suitable. Any better suggestions ?

>
>> + /* by default assume anon memory */
>> + how = CR_VMA_ANON;
>> +
>> + /* if there is a backing file, assume private-mapped */
>> + /* (NEED: check if the file is unlinked) */
>> + if (vma->vm_file) {
>> + flen = PAGE_SIZE;
>> + fname = cr_fill_fname(&vma->vm_file->f_path,
>> + ctx->vfsroot, ctx->tbuf, &flen);
>> + if (IS_ERR(fname))
>> + return PTR_ERR(fname);
>> + how = CR_VMA_FILE;
>> + }
>> +
>> + hh->how = how;
>> + hh->fname = !!fname;
>> +
>> + /*
>> + * it seems redundant now, but we do it in 3 steps for because:
>> + * first, the logic is simpler when we how many pages before
>> + * dumping them; second, a future optimization will defer the
>> + * writeout (dump, and free) to a later step; in which case all
>> + * the pages to be dumped will be aggregated on the checkpoint ctx
>> + */
>> +
>> + /* (1) scan: scan through the PTEs of the vma to count the pages
>> + * to dump (and later make those pages COW), and keep the list of
>> + * pages (and a reference to each page) on the checkpoint ctx */
>> + nr = cr_vma_scan_pages(ctx, vma);
>> + if (nr < 0)
>> + return nr;
>> +
>> + hh->npages = nr;
>> + ret = cr_write_obj(ctx, &h, hh);
>> +
>> + if (!ret && flen)
>> + ret = cr_write_str(ctx, fname, flen);
>> +
>> + if (ret < 0)
>> + return ret;
>> +
>> + /* (2) dump: write out the addresses of all pages in the list (on
>> + * the checkpoint ctx) followed by the contents of all pages */
>> + ret = cr_vma_dump_pages(ctx, nr);
>> +
>> + /* (3) free: free the extra references to the pages in the list */
>> + cr_pgarr_release(ctx);
>> +
>> + return ret;
>> +}
>
> This gets simpler-looking if you just defer the filename processing
> until you actually go to write it out.

Yes. I'll encapsulate that in it's own function.

Oren.

2008-08-24 06:03:18

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 2/9] General infrastructure for checkpoint restart



Louis Rilling wrote:
> On Wed, Aug 20, 2008 at 11:04:13PM -0400, Oren Laadan wrote:
>> Add those interfaces, as well as helpers needed to easily manage the
>> file format. The code is roughly broken out as follows:
>>
>> ckpt/sys.c - user/kernel data transfer, as well as setup of the
>> checkpoint/restart context (a per-checkpoint data structure for
>> housekeeping)
>>
>> ckpt/checkpoint.c - output wrappers and basic checkpoint handling
>>
>> ckpt/restart.c - input wrappers and basic restart handling
>>
>> Patches to add the per-architecture support as well as the actual
>> work to do the memory checkpoint follow in subsequent patches.
>>
>
> [...]
>
>> diff --git a/checkpoint/sys.c b/checkpoint/sys.c
>> new file mode 100644
>> index 0000000..2891c48
>> --- /dev/null
>> +++ b/checkpoint/sys.c
>
> [...]
>
>> +/*
>> + * helpers to manage CR contexts: allocated for each checkpoint and/or
>> + * restart operation, and persists until the operation is completed.
>> + */
>> +
>> +static atomic_t cr_ctx_count; /* unique checkpoint identifier */
>
> I thought we agreed that this counter should be per-container. Perhaps add a
> TODO here?

True.

>
>> +
>> +void cr_ctx_free(struct cr_ctx *ctx)
>> +{
>> +
>> + if (ctx->file)
>> + fput(ctx->file);
>> + if (ctx->vfsroot)
>> + path_put(ctx->vfsroot);
>> +
>> + free_pages((unsigned long) ctx->tbuf, CR_TBUF_ORDER);
>> + free_pages((unsigned long) ctx->hbuf, CR_HBUF_ORDER);
>> +
>> + kfree(ctx);
>> +}
>> +
>> +struct cr_ctx *cr_ctx_alloc(pid_t pid, struct file *file, unsigned long flags)
>> +{
>> + struct cr_ctx *ctx;
>> +
>> + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
>> + if (!ctx)
>> + return NULL;
>> +
>> + ctx->tbuf = (void *) __get_free_pages(GFP_KERNEL, CR_TBUF_ORDER);
>> + ctx->hbuf = (void *) __get_free_pages(GFP_KERNEL, CR_HBUF_ORDER);
>> + if (!ctx->tbuf || !ctx->hbuf)
>> + goto nomem;
>> +
>> + ctx->pid = pid;
>> + ctx->flags = flags;
>> +
>> + ctx->file = file;
>> + get_file(file);
>> +
>> + /* assume checkpointer is in container's root vfs */
>
> I'm a bit puzzled by this assumption. I would say: either this is a
> self-checkpoint (only current process), or this is a container checkpoint. In
> the latter case, I expect that in the general case the checkpointer lives
> outside the container (and the interface of sys_checkpoint() below confirms
> this), so it's root fs is probably not the container's one.
>
> Does it differ from what you're planning?

You are correct. We lack infrastructure for what I'd call "container-object",
and the patchset does not yet tackle a container and multiple tasks, so this
is an interim solution. Will add a FIXME comment.

Thanks,

Oren.

2008-08-26 16:34:15

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 4/9] Memory management - dump state

On Sun, 2008-08-24 at 01:40 -0400, Oren Laadan wrote:
> >> +/* vma subtypes */
> >> +enum {
> >> + CR_VMA_ANON = 1,
> >> + CR_VMA_FILE
> >> +};
> >
> > Is this really necessary, or can we infer it from the contents of the
> > VMA?
>
> This classification eventually simplifies both dump and restore. For
> instance, it decides whether a file name follows or not.
>
> There will be more, later: CR_VMA_FILE_UNLINKED (mapped to an unlinked
> file), CR_VMA_ANON_SHARED (shared anonymous), CR_VMA_ANON_SHARED_SKIP
> (shared anonymous, had been sent before) and so on.

I still don't see there being a need to explicitly specify the
distinction. Why should a VMA mapping an unlinked file be any different
from a linked one? It should point over to the 'file' checkpoint
structure and let the real work be done there.

There are no truly anonymous shared areas. They anon ones are still
file-backed as far as the kernel is concerned. If we do the file saving
correctly, I think most of these problems just fall out.

> >> struct cr_hdr_head {
> >> __u64 magic;
> >>
> >> @@ -83,4 +90,28 @@ struct cr_hdr_task {
> >> char comm[TASK_COMM_LEN];
> >> } __attribute__ ((aligned (8)));
> >>
> >> +struct cr_hdr_mm {
> >> + __u32 tag; /* sharing identifier */
> >
> > If this really is a sharing identifier, we need a:
> >
> > struct cr_shared_object_ref {
> > __u32 tag;
> > };
> >
> > And then one of *those* in the vma struct. Make it much more idiot (aka
> > Dave) proof.
>
> I figured the use of 'tag' for the identifiers of shared objects is clear.
> By using a __u32 the size of that field is immediately visible, while using
> a structure will hide the actual size; in turn this is what we want visible
> here (ABI), no ?

Either way you stack it, I think 'tag' is a horrendous variable name.
What kind of tag? What does it do? What does it tag? What does it
reference? What values are valid in there?

If we have a separate structure, we simply make it clear that the
structure will get used all over the place, and that it, too is part of
the ABI. Knowing the sizes isn't as important as knowing that the ABI
is constant.

> >> +static int cr_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
> >> + struct vm_area_struct *vma, unsigned long *start)
> >> +{
> >> + unsigned long end = vma->vm_end;
> >> + unsigned long addr = *start;
> >> + struct page **pagep;
> >> + unsigned long *addrp;
> >> + int cow, nr, ret = 0;
> >> +
> >> + nr = pgarr->nleft;
> >> + pagep = &pgarr->pages[pgarr->nused];
> >> + addrp = &pgarr->addrs[pgarr->nused];
> >> + cow = !!vma->vm_file;
> >> +
> >> + while (addr < end) {
> >> + struct page *page;
> >> +
> >> + /* simplified version of get_user_pages(): already have vma,
> >> + * only need FOLL_TOUCH, and (for now) ignore fault stats */
> >> +
> >> + cond_resched();
> >> + while (!(page = follow_page(vma, addr, FOLL_TOUCH))) {
> >> + ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
> >> + if (ret & VM_FAULT_ERROR) {
> >> + if (ret & VM_FAULT_OOM)
> >> + ret = -ENOMEM;
> >> + else if (ret & VM_FAULT_SIGBUS)
> >> + ret = -EFAULT;
> >> + else
> >> + BUG();
> >> + break;
> >> + }
> >> + cond_resched();
> >> + }
> >
> > At best this needs to get folded back into its own function. This is
>
> This is almost identical to the original - see the preceding comment.

Exactly. The code is copy-and-pasted. If there's a bug in the
original, who will change this one? Better to simply consolidate the
code into one copy.

> > pretty hard to read as-is. Also, are there truly no in-kernel functions
> > that can be used for this?
>
> Can you suggest one ?

This is where the mentality has to shift. Instead of thinking "there is
no exact in-kernel match for what I want here" we need to consider that
we can modify the in-kernel ones. They can be modified to suit both
existing and the new needs that we have.

> >> + for (pgarr = ctx->pgarr; pgarr; pgarr = pgarr->next) {
> >> + struct page **pages = pgarr->pages;
> >> + int nr = pgarr->nused;
> >> + void *ptr;
> >> +
> >> + while (nr--) {
> >> + ptr = kmap(*pages);
> >> + ret = cr_kwrite(ctx, ptr, PAGE_SIZE);
> >> + kunmap(*pages);
> >
> > Why not use kmap_atomic() here?
>
> It is my understanding that the code must not sleep between kmap_atomic()
> and kunmap_atomic().

Yes, but you're going to absolutely kill performance on systems which
have expensive global TLB flushes. Frequent kmaps()s should be avoided
at all costs.

The best way around this would be to change the interface to cr_kwrite()
so that it didn't have to use *mapped* kernel memory. Maybe an
interface that takes 'struct page'. Or, one where we kmap_atomic() the
buffer, kunmap_atomic(), copy to a temporary buffer, then cr_kwrite().

> >> +static int cr_write_vma(struct cr_ctx *ctx, struct vm_area_struct *vma)
> >> +{
> >> + struct cr_hdr h;
> >> + struct cr_hdr_vma *hh = ctx->hbuf;
> >> + char *fname = NULL;
> >> + int flen = 0, how, nr, ret;
> >> +
> >> + h.type = CR_HDR_VMA;
> >> + h.len = sizeof(*hh);
> >> + h.ptag = 0;
> >> +
> >> + hh->vm_start = vma->vm_start;
> >> + hh->vm_end = vma->vm_end;
> >> + hh->vm_page_prot = vma->vm_page_prot.pgprot;
> >> + hh->vm_flags = vma->vm_flags;
> >> + hh->vm_pgoff = vma->vm_pgoff;
> >> +
> >> + if (vma->vm_flags & (VM_SHARED | VM_IO | VM_HUGETLB | VM_NONLINEAR)) {
> >> + pr_warning("CR: unknown VMA %#lx\n", vma->vm_flags);
> >> + return -ETXTBSY;
> >> + }
> >
> > Hmmm. Interesting error code for VM_HUGETLB. :)
>
> :) well, the usual EINVAL didn't seem suitable. Any better suggestions ?

-ENOSUP might be clearest for now. "Your system call tried to do
something unsupported."

-- Dave

2008-08-26 16:42:41

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 2/9] General infrastructure for checkpoint restart

On Sun, 2008-08-24 at 22:47 -0400, Oren Laadan wrote:
> > I replaced all of the uses of these with kmalloc()/kfree() and stack
> > allocations. I'm really not sure what these buy us since our allocators
> > are already so fast. tbuf, especially, worries me if it gets used in
> > any kind of nested manner we're going to get some really fun bugs.
>
> I disagree with putting some of the cr_hdr_... on the stack: first, if they
> grow in size, it's easy to forget to change the allocation to stack.

I can buy that.

> Second,
> using a standard code/handling for all cr_hdr_... makes the code more readable
> and easier for others to extend by simply following existing code.

It actually makes it harder for others to jump into because they see
something which is basically a kmalloc() to the rest of the kernel. We
don't have ext2_alloc() or usb_alloc(), so I'm not sure we should have
cr_alloc(), effectively.

> The main argument for ->hbuf is that eventually we would like to buffer
> data in the kernel to avoid potentially slow writing/reading when processes
> are frozen during checkpoint/restart.

Um, we're writing to a file descriptor and kmap()'ing. Those can both
potentially be very, very slow. I don't think that a few kmalloc()s
thrown in there are going to be noticeable.

> By using the simple cr_get_hbuf() and
> cr_put_hbuf() we prepare for that: now they just allocate room in ->hbuf,
> but later they will provide a pointer directly in the data buffer. Moreover,
> it simplifies the error path since cleanup (of ->hbuf) is implicit.

It simplifies *one* error path. But, it complicates the ctx creation
and makes *that* error path more complex. Pick your poison, I guess.

> Also,
> it is designed to allow checkpoint and restart function to be called in a
> nested manner, again simplifying the code. Finally, my experience was that
> it can impact performance if you are after very short downtimes, especially
> for the checkpoint.

Right, but I'm comparing it to kmalloc() Certainly kmalloc()s can be
used in a nested manner.

> >> +/* read the checkpoint header */
> >> +static int cr_read_hdr(struct cr_ctx *ctx)
> >> +{
> >> + struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh));
> >> + int ret;
> >> +
> >> + ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD);
> >> + if (ret < 0)
> >> + return ret;
> >> + else if (ret != 0)
> >> + return -EINVAL;
> >
> > How about just making cr_read_obj_type() stop returning positive values?
> > Is it ever valid for it to return >0?
>
> The idea with cr_read_obj_type() is that it returns the 'ptag' - parent tag -
> field of the object that it reads in. The 'ptag' is the tag of the parent
> object, and is useful in several places. For instance, the 'ptag' of an MM
> header identifies (is equal to) the 'tag' of TASK to which it belongs. In
> this case, the 'ptag' should be zero because it has no parent object.
>
> I'll change the variable name from 'ret' to 'ptag' to make it more obvious.

Since this ptag isn't actually used, I can't really offer a suggestion.
I don't see the whole picture.

-- Dave

2008-08-26 17:01:18

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 7/9] Infrastructure for shared objects

On Thu, 2008-08-21 at 12:40 +0200, Louis Rilling wrote:
>
> > +struct cr_objhash {
> > + struct cr_obj **hash;
> > + int next_tag;
> > +};
> > +
> > +#define CR_OBJHASH_NBITS 10 /* 10 bits = 1K buckets */
> > +#define CR_OBJHASH_ORDER 0 /* 1K buckets * 4 bytes/bucket = 1
> page */
>
> Only true when PAGE_SIZE == 4K and in 32bits. Perhaps like below?
>
> #define CR_OBJHASH_BUCKET_NBITS (BITS_PER_LONG == 64 ? 3 : 2)
> #define CR_MIN_OBJHASH_NBITS ((PAGE_SHIFT - CR_OBJHASH_BUCKET_NBITS)
> #define CR_OBJHASH_NBITS (CR_MIN_OBJHASH_NBITS >= 10 ?
> CR_MIN_OBJHASH_NBITS : 10)
> #define CR_OBJHASH_ORDER (CR_OBJHASH_NBITS + CR_OBJHASH_BUCKET_NBITS -
> PAGE_SHIFT)

Gah. Can't we just use the existing radix tree or hash implementations
we already have?

-- Dave

2008-08-27 00:16:30

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 4/9] Memory management - dump state



Dave Hansen wrote:
> On Sun, 2008-08-24 at 01:40 -0400, Oren Laadan wrote:
>>>> +/* vma subtypes */
>>>> +enum {
>>>> + CR_VMA_ANON = 1,
>>>> + CR_VMA_FILE
>>>> +};
>>> Is this really necessary, or can we infer it from the contents of the
>>> VMA?
>> This classification eventually simplifies both dump and restore. For
>> instance, it decides whether a file name follows or not.
>>
>> There will be more, later: CR_VMA_FILE_UNLINKED (mapped to an unlinked
>> file), CR_VMA_ANON_SHARED (shared anonymous), CR_VMA_ANON_SHARED_SKIP
>> (shared anonymous, had been sent before) and so on.
>
> I still don't see there being a need to explicitly specify the
> distinction. Why should a VMA mapping an unlinked file be any different
> from a linked one? It should point over to the 'file' checkpoint
> structure and let the real work be done there.
>
> There are no truly anonymous shared areas. They anon ones are still
> file-backed as far as the kernel is concerned. If we do the file saving
> correctly, I think most of these problems just fall out.

The classifications helps to make the code cleaner (and more readable). In
any case, you need at least to save a classifier that tells whether a shared
VMA is anonymous, or mapped to a file, or is an IPC shmem segment (and also
whether that IPC segment has been removed - a la unlinked): you simply don't
treat them equally; and on the restart side you can't re-test it on the VMA
structure itself.

Given that we need a classifier anyway, re-using it to describe all VMAs and
not only distinguish the above (and maybe more) cases not only enhances the
readability of the code, but also allows to merge common code paths based on
the value of the classifier.

Probably you won't be convinced until I add the code to support all types of
VMAs. As it is, the classifier is harmless, so please bare with it for now.

>
>>>> struct cr_hdr_head {
>>>> __u64 magic;
>>>>

[...]

>>>> +
>>>> + while (addr < end) {
>>>> + struct page *page;
>>>> +
>>>> + /* simplified version of get_user_pages(): already have vma,
>>>> + * only need FOLL_TOUCH, and (for now) ignore fault stats */
>>>> +
>>>> + cond_resched();
>>>> + while (!(page = follow_page(vma, addr, FOLL_TOUCH))) {
>>>> + ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
>>>> + if (ret & VM_FAULT_ERROR) {
>>>> + if (ret & VM_FAULT_OOM)
>>>> + ret = -ENOMEM;
>>>> + else if (ret & VM_FAULT_SIGBUS)
>>>> + ret = -EFAULT;
>>>> + else
>>>> + BUG();
>>>> + break;
>>>> + }
>>>> + cond_resched();
>>>> + }
>>> At best this needs to get folded back into its own function. This is
>> This is almost identical to the original - see the preceding comment.
>
> Exactly. The code is copy-and-pasted. If there's a bug in the
> original, who will change this one? Better to simply consolidate the
> code into one copy.
>
>>> pretty hard to read as-is. Also, are there truly no in-kernel functions
>>> that can be used for this?
>> Can you suggest one ?
>
> This is where the mentality has to shift. Instead of thinking "there is
> no exact in-kernel match for what I want here" we need to consider that
> we can modify the in-kernel ones. They can be modified to suit both
> existing and the new needs that we have.

I agree.

However, my main goal now is not to make this patch perfect, but to provide
a viable proof-of-concept that demonstrates how we want to do things. Unless
you feel we are near ready to send these for inclusion soon (...), I intend
to prioritize design and functionality.

I'll add a FIXME comment there. Of course, should you provide a patch to fix
this (or any other such issue) I'll merge it in for the next round.

>
>>>> + for (pgarr = ctx->pgarr; pgarr; pgarr = pgarr->next) {
>>>> + struct page **pages = pgarr->pages;
>>>> + int nr = pgarr->nused;
>>>> + void *ptr;
>>>> +
>>>> + while (nr--) {
>>>> + ptr = kmap(*pages);
>>>> + ret = cr_kwrite(ctx, ptr, PAGE_SIZE);
>>>> + kunmap(*pages);
>>> Why not use kmap_atomic() here?
>> It is my understanding that the code must not sleep between kmap_atomic()
>> and kunmap_atomic().
>
> Yes, but you're going to absolutely kill performance on systems which
> have expensive global TLB flushes. Frequent kmaps()s should be avoided
> at all costs.
>
> The best way around this would be to change the interface to cr_kwrite()
> so that it didn't have to use *mapped* kernel memory. Maybe an
> interface that takes 'struct page'. Or, one where we kmap_atomic() the
> buffer, kunmap_atomic(), copy to a temporary buffer, then cr_kwrite().
>
>>>> +static int cr_write_vma(struct cr_ctx *ctx, struct vm_area_struct *vma)
>>>> +{
>>>> + struct cr_hdr h;
>>>> + struct cr_hdr_vma *hh = ctx->hbuf;
>>>> + char *fname = NULL;
>>>> + int flen = 0, how, nr, ret;
>>>> +
>>>> + h.type = CR_HDR_VMA;
>>>> + h.len = sizeof(*hh);
>>>> + h.ptag = 0;
>>>> +
>>>> + hh->vm_start = vma->vm_start;
>>>> + hh->vm_end = vma->vm_end;
>>>> + hh->vm_page_prot = vma->vm_page_prot.pgprot;
>>>> + hh->vm_flags = vma->vm_flags;
>>>> + hh->vm_pgoff = vma->vm_pgoff;
>>>> +
>>>> + if (vma->vm_flags & (VM_SHARED | VM_IO | VM_HUGETLB | VM_NONLINEAR)) {
>>>> + pr_warning("CR: unknown VMA %#lx\n", vma->vm_flags);
>>>> + return -ETXTBSY;
>>>> + }
>>> Hmmm. Interesting error code for VM_HUGETLB. :)
>> :) well, the usual EINVAL didn't seem suitable. Any better suggestions ?
>
> -ENOSUP might be clearest for now. "Your system call tried to do
> something unsupported."

Are you suggesting adding a new error code ?

Oren.

2008-08-27 00:38:19

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 2/9] General infrastructure for checkpoint restart



Dave Hansen wrote:
> On Sun, 2008-08-24 at 22:47 -0400, Oren Laadan wrote:
>>> I replaced all of the uses of these with kmalloc()/kfree() and stack
>>> allocations. I'm really not sure what these buy us since our allocators
>>> are already so fast. tbuf, especially, worries me if it gets used in
>>> any kind of nested manner we're going to get some really fun bugs.
>> I disagree with putting some of the cr_hdr_... on the stack: first, if they
>> grow in size, it's easy to forget to change the allocation to stack.
>
> I can buy that.
>
>> Second,
>> using a standard code/handling for all cr_hdr_... makes the code more readable
>> and easier for others to extend by simply following existing code.
>
> It actually makes it harder for others to jump into because they see
> something which is basically a kmalloc() to the rest of the kernel. We
> don't have ext2_alloc() or usb_alloc(), so I'm not sure we should have
> cr_alloc(), effectively.

I disagree. It's more than "allocate memory", it's: "give me some room on
the buffer". When we don't care about downtime (like now), it's enough to
kmalloc a temporary buffer and flush (write to the FD) immediately.

But when we will care about downtime, "the buffer" will be memory in kernel
where the entire checkpoint image data will be kept while the container is
frozen (memory contents need only be marked copy-on-write, not explicitly
buffered).

In this setting, cr_kwrite() will not really write the data to the FD, but
only record somewhere that the data exists. Only after the container resumes
execution will we eventually write the data to the FD and free the buffer.

(This is also useful in case you want to keep the checkpoint image entirely
in memory without flushing to any FD; and yes, there are use-cases for that).

So for this, instead of using kmalloc() to allocate a temp-buf, fill it with
data, and then copy that data to the actual (big) buffer, the mechanism
provides a shortcut to allocate space directly on the buffer, and save the
copy.

>
>> The main argument for ->hbuf is that eventually we would like to buffer
>> data in the kernel to avoid potentially slow writing/reading when processes
>> are frozen during checkpoint/restart.
>
> Um, we're writing to a file descriptor and kmap()'ing. Those can both
> potentially be very, very slow. I don't think that a few kmalloc()s
> thrown in there are going to be noticeable.

But you miss the point: the writing to the file descriptor in the (future)
optimized version will _not_ happen while the container is frozen. Only
much later after the container resumes, so at the end it will be:
(1) freeze container (2) dump data to in-kernel buffer (mark dirty pages
copy-on-write) (3) unfreeze container (4) write in-kernel buffer to FD.

By deferring step #4 until after the freeze-period, you can save tens and
hundreds of milliseconds, and seconds. Now the goal is to make step #2 be
as fast as possible, and every millisecond (and less) counts, e.g. to take
fast checkpoints of interactive desktops. And my experience is that in
such "workload" repetitive kmalloc/kfree inside that dump loop has a
measurable impact on performance (of course, for those objects which are
abundant).

>
>> By using the simple cr_get_hbuf() and
>> cr_put_hbuf() we prepare for that: now they just allocate room in ->hbuf,
>> but later they will provide a pointer directly in the data buffer. Moreover,
>> it simplifies the error path since cleanup (of ->hbuf) is implicit.
>
> It simplifies *one* error path. But, it complicates the ctx creation
> and makes *that* error path more complex. Pick your poison, I guess.

Actually it simplifies *many* error paths -- in *all* cr_write_...()
functions - including the future ones (and there are many to come).

>
>> Also,
>> it is designed to allow checkpoint and restart function to be called in a
>> nested manner, again simplifying the code. Finally, my experience was that
>> it can impact performance if you are after very short downtimes, especially
>> for the checkpoint.
>
> Right, but I'm comparing it to kmalloc() Certainly kmalloc()s can be
> used in a nested manner.
>
>>>> +/* read the checkpoint header */
>>>> +static int cr_read_hdr(struct cr_ctx *ctx)
>>>> +{
>>>> + struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh));
>>>> + int ret;
>>>> +
>>>> + ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD);
>>>> + if (ret < 0)
>>>> + return ret;
>>>> + else if (ret != 0)
>>>> + return -EINVAL;
>>> How about just making cr_read_obj_type() stop returning positive values?
>>> Is it ever valid for it to return >0?
>> The idea with cr_read_obj_type() is that it returns the 'ptag' - parent tag -
>> field of the object that it reads in. The 'ptag' is the tag of the parent
>> object, and is useful in several places. For instance, the 'ptag' of an MM
>> header identifies (is equal to) the 'tag' of TASK to which it belongs. In
>> this case, the 'ptag' should be zero because it has no parent object.
>>
>> I'll change the variable name from 'ret' to 'ptag' to make it more obvious.
>
> Since this ptag isn't actually used, I can't really offer a suggestion.
> I don't see the whole picture.

You can see how it's used in the code that dumps files. Or look at the next
incarnation of the patch.

Oren.

2008-08-27 08:27:01

by Louis Rilling

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 7/9] Infrastructure for shared objects

On Tue, Aug 26, 2008 at 10:01:04AM -0700, Dave Hansen wrote:
> On Thu, 2008-08-21 at 12:40 +0200, Louis Rilling wrote:
> >
> > > +struct cr_objhash {
> > > + struct cr_obj **hash;
> > > + int next_tag;
> > > +};
> > > +
> > > +#define CR_OBJHASH_NBITS 10 /* 10 bits = 1K buckets */
> > > +#define CR_OBJHASH_ORDER 0 /* 1K buckets * 4 bytes/bucket = 1
> > page */
> >
> > Only true when PAGE_SIZE == 4K and in 32bits. Perhaps like below?
> >
> > #define CR_OBJHASH_BUCKET_NBITS (BITS_PER_LONG == 64 ? 3 : 2)
> > #define CR_MIN_OBJHASH_NBITS ((PAGE_SHIFT - CR_OBJHASH_BUCKET_NBITS)
> > #define CR_OBJHASH_NBITS (CR_MIN_OBJHASH_NBITS >= 10 ?
> > CR_MIN_OBJHASH_NBITS : 10)
> > #define CR_OBJHASH_ORDER (CR_OBJHASH_NBITS + CR_OBJHASH_BUCKET_NBITS -
> > PAGE_SHIFT)
>
> Gah. Can't we just use the existing radix tree or hash implementations
> we already have?

AFAICS, all hash implementations work like this (I'm referring to pid hashes
and compat_ioctl hashes for instance). The only common code is the hash
functions, which Oren seems to have used.

Radix trees may be an alternative, given that object tags are sequentially
generated in some way. I have no idea about the simplicity we can gain though.

Louis

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes


Attachments:
(No filename) (1.37 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2008-08-27 15:41:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 4/9] Memory management - dump state

On Tue, 2008-08-26 at 20:14 -0400, Oren Laadan wrote:
> >>>> +
> >>>> + while (addr < end) {
> >>>> + struct page *page;
> >>>> +
> >>>> + /* simplified version of get_user_pages(): already have vma,
> >>>> + * only need FOLL_TOUCH, and (for now) ignore fault stats */
> >>>> +
> >>>> + cond_resched();
> >>>> + while (!(page = follow_page(vma, addr, FOLL_TOUCH))) {
> >>>> + ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
> >>>> + if (ret & VM_FAULT_ERROR) {
> >>>> + if (ret & VM_FAULT_OOM)
> >>>> + ret = -ENOMEM;
> >>>> + else if (ret & VM_FAULT_SIGBUS)
> >>>> + ret = -EFAULT;
> >>>> + else
> >>>> + BUG();
> >>>> + break;
> >>>> + }
> >>>> + cond_resched();
> >>>> + }
> >>> At best this needs to get folded back into its own function. This is
> >> This is almost identical to the original - see the preceding comment.
> >
> > Exactly. The code is copy-and-pasted. If there's a bug in the
> > original, who will change this one? Better to simply consolidate the
> > code into one copy.
> >
> >>> pretty hard to read as-is. Also, are there truly no in-kernel functions
> >>> that can be used for this?
> >> Can you suggest one ?
> >
> > This is where the mentality has to shift. Instead of thinking "there is
> > no exact in-kernel match for what I want here" we need to consider that
> > we can modify the in-kernel ones. They can be modified to suit both
> > existing and the new needs that we have.
>
> I agree.
>
> However, my main goal now is not to make this patch perfect, but to provide
> a viable proof-of-concept that demonstrates how we want to do things. Unless
> you feel we are near ready to send these for inclusion soon (...), I intend
> to prioritize design and functionality.

So, you currently have buy-in on this basic approach from the OpenVZ
guys, and probably Ingo. If you get too far along in the "design and
functionality" and a community member comes up with a really good
objection to some basic part of the "design and functionality" you're
going to have a lot of code to rewrite.

I'm trying to get minimized the quantity of "design and functionality"
down to the bare minimum set where we can get this merged. So, yes, I
do think these should be sent for inclusion soon.

I believe that we're on course to create another large out-of-tree patch
set that does in-kernel checkpoint/restart. We have no shortage of
those. It's been done many times before.

This one will *HOPEFULLY* be different from all the rest when it gets
merged.

> >>>> + for (pgarr = ctx->pgarr; pgarr; pgarr = pgarr->next) {
> >>>> + struct page **pages = pgarr->pages;
> >>>> + int nr = pgarr->nused;
> >>>> + void *ptr;
> >>>> +
> >>>> + while (nr--) {
> >>>> + ptr = kmap(*pages);
> >>>> + ret = cr_kwrite(ctx, ptr, PAGE_SIZE);
> >>>> + kunmap(*pages);
> >>> Why not use kmap_atomic() here?
> >> It is my understanding that the code must not sleep between kmap_atomic()
> >> and kunmap_atomic().
> >
> > Yes, but you're going to absolutely kill performance on systems which
> > have expensive global TLB flushes. Frequent kmaps()s should be avoided
> > at all costs.
> >
> > The best way around this would be to change the interface to cr_kwrite()
> > so that it didn't have to use *mapped* kernel memory. Maybe an
> > interface that takes 'struct page'. Or, one where we kmap_atomic() the
> > buffer, kunmap_atomic(), copy to a temporary buffer, then cr_kwrite().
> >
> >>>> +static int cr_write_vma(struct cr_ctx *ctx, struct vm_area_struct *vma)
> >>>> +{
> >>>> + struct cr_hdr h;
> >>>> + struct cr_hdr_vma *hh = ctx->hbuf;
> >>>> + char *fname = NULL;
> >>>> + int flen = 0, how, nr, ret;
> >>>> +
> >>>> + h.type = CR_HDR_VMA;
> >>>> + h.len = sizeof(*hh);
> >>>> + h.ptag = 0;
> >>>> +
> >>>> + hh->vm_start = vma->vm_start;
> >>>> + hh->vm_end = vma->vm_end;
> >>>> + hh->vm_page_prot = vma->vm_page_prot.pgprot;
> >>>> + hh->vm_flags = vma->vm_flags;
> >>>> + hh->vm_pgoff = vma->vm_pgoff;
> >>>> +
> >>>> + if (vma->vm_flags & (VM_SHARED | VM_IO | VM_HUGETLB | VM_NONLINEAR)) {
> >>>> + pr_warning("CR: unknown VMA %#lx\n", vma->vm_flags);
> >>>> + return -ETXTBSY;
> >>>> + }
> >>> Hmmm. Interesting error code for VM_HUGETLB. :)
> >> :) well, the usual EINVAL didn't seem suitable. Any better suggestions ?
> >
> > -ENOSUP might be clearest for now. "Your system call tried to do
> > something unsupported."
>
> Are you suggesting adding a new error code ?

Dang. What's the thing we return from system calls that are
unsupported? Did I just dream that up?

-- Dave

2008-08-27 15:57:35

by Louis Rilling

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 4/9] Memory management - dump state

On Wed, Aug 27, 2008 at 08:41:36AM -0700, Dave Hansen wrote:
>
> Dang. What's the thing we return from system calls that are
> unsupported? Did I just dream that up?

ENOSYS?

Louis

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes


Attachments:
(No filename) (360.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments

2008-08-27 16:12:26

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 4/9] Memory management - dump state

On Wed, 2008-08-27 at 17:57 +0200, Louis Rilling wrote:
> On Wed, Aug 27, 2008 at 08:41:36AM -0700, Dave Hansen wrote:
> > Dang. What's the thing we return from system calls that are
> > unsupported? Did I just dream that up?
>
> ENOSYS?

Brain not functioning, yet. Thanks.

-- Dave

2008-08-27 16:19:24

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 4/9] Memory management - dump state

Dave Hansen wrote:
> On Wed, 2008-08-27 at 17:57 +0200, Louis Rilling wrote:
>
>> On Wed, Aug 27, 2008 at 08:41:36AM -0700, Dave Hansen wrote:
>>
>>> Dang. What's the thing we return from system calls that are
>>> unsupported? Did I just dream that up?
>>>
>> ENOSYS?
>>
>
> Brain not functioning, yet. Thanks.
>

Or EOPNOTSUPP - but that's documented as a network error ("Operation not
supported on transport endpoint").

J

2008-08-27 20:34:47

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 4/9] Memory management - dump state

Quoting Dave Hansen ([email protected]):
> On Tue, 2008-08-26 at 20:14 -0400, Oren Laadan wrote:
> > >>>> +
> > >>>> + while (addr < end) {
> > >>>> + struct page *page;
> > >>>> +
> > >>>> + /* simplified version of get_user_pages(): already have vma,
> > >>>> + * only need FOLL_TOUCH, and (for now) ignore fault stats */
> > >>>> +
> > >>>> + cond_resched();
> > >>>> + while (!(page = follow_page(vma, addr, FOLL_TOUCH))) {
> > >>>> + ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
> > >>>> + if (ret & VM_FAULT_ERROR) {
> > >>>> + if (ret & VM_FAULT_OOM)
> > >>>> + ret = -ENOMEM;
> > >>>> + else if (ret & VM_FAULT_SIGBUS)
> > >>>> + ret = -EFAULT;
> > >>>> + else
> > >>>> + BUG();
> > >>>> + break;
> > >>>> + }
> > >>>> + cond_resched();
> > >>>> + }
> > >>> At best this needs to get folded back into its own function. This is
> > >> This is almost identical to the original - see the preceding comment.
> > >
> > > Exactly. The code is copy-and-pasted. If there's a bug in the
> > > original, who will change this one? Better to simply consolidate the
> > > code into one copy.
> > >
> > >>> pretty hard to read as-is. Also, are there truly no in-kernel functions
> > >>> that can be used for this?
> > >> Can you suggest one ?
> > >
> > > This is where the mentality has to shift. Instead of thinking "there is
> > > no exact in-kernel match for what I want here" we need to consider that
> > > we can modify the in-kernel ones. They can be modified to suit both
> > > existing and the new needs that we have.
> >
> > I agree.
> >
> > However, my main goal now is not to make this patch perfect, but to provide
> > a viable proof-of-concept that demonstrates how we want to do things. Unless
> > you feel we are near ready to send these for inclusion soon (...), I intend
> > to prioritize design and functionality.
>
> So, you currently have buy-in on this basic approach from the OpenVZ
> guys, and probably Ingo. If you get too far along in the "design and
> functionality" and a community member comes up with a really good
> objection to some basic part of the "design and functionality" you're
> going to have a lot of code to rewrite.
>
> I'm trying to get minimized the quantity of "design and functionality"
> down to the bare minimum set where we can get this merged. So, yes, I
> do think these should be sent for inclusion soon.
>
> I believe that we're on course to create another large out-of-tree patch
> set that does in-kernel checkpoint/restart. We have no shortage of
> those. It's been done many times before.
>
> This one will *HOPEFULLY* be different from all the rest when it gets
> merged.

It's true there are out-of-tree proofs-of-concept for in-kernel c/r, but
on the other hand it is nice seeing where Oren's code is going. Based
on the patchsets you and he have been sending, the impression I've
gotten was that Oren was fleshing out the longer-term tree, while you
were working on the upstream-acceptable minimal patchset. That seems
like a good model to me. (Was I wrong about either of your intents?)

It does mean that much of Oren's patchset may end up having to be
re-written based on how Dave's patches look when they get upstream, but
I'm sure Oren knows that's par for the course and doesn't mind.

(Or, is that too much effort required on your part to try and
cherrypick bits of Oren's changes back into your tree?)

In any case, if that *is* the model, then I'd really like to see a
repost of your (Dave's) simplified patchset. I think we need to
decide precisly which features we want to try to push first (I
thought we were not going to support fds?), throw out any extraneous
infrastructure, put the source for the one-and-only program that
it can checkpoint and restart in the patch description, and see what
feedback we get from the community. The reason it'll be good to
have Oren continue on his own path is that you know we'll get
questions like "well how are you going to handle (X)", so then
we can point them to Oren's tree.

At least, that's how I was seeing it.

-serge

2008-08-27 20:38:47

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 4/9] Memory management - dump state

On Wed, 2008-08-27 at 15:34 -0500, Serge E. Hallyn wrote:
> (Or, is that too much effort required on your part to try and
> cherrypick bits of Oren's changes back into your tree?)

That would probably work as long as Oren is working on top of what I
have. It's going to be a lot harder if I have to repeat the same
break-out process for each iteration of Oren's patches, though.

-- Dave

2008-08-27 20:49:19

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 4/9] Memory management - dump state

Quoting Dave Hansen ([email protected]):
> On Wed, 2008-08-27 at 15:34 -0500, Serge E. Hallyn wrote:
> > (Or, is that too much effort required on your part to try and
> > cherrypick bits of Oren's changes back into your tree?)
>
> That would probably work as long as Oren is working on top of what I
> have. It's going to be a lot harder if I have to repeat the same
> break-out process for each iteration of Oren's patches, though.
>
> -- Dave

If Oren's purpose is not quite to create a upstreamable patchset,
then it would make more sense for him to keep a git tree and
put new patches on top of his existing ones (within reaason as he
rebases). Then you'd at least be able to trivially look at the latest
deltas.

-serge

2008-08-27 20:57:52

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 4/9] Memory management - dump state

On Wed, 2008-08-27 at 15:48 -0500, Serge E. Hallyn wrote:
> Quoting Dave Hansen ([email protected]):
> > On Wed, 2008-08-27 at 15:34 -0500, Serge E. Hallyn wrote:
> > > (Or, is that too much effort required on your part to try and
> > > cherrypick bits of Oren's changes back into your tree?)
> >
> > That would probably work as long as Oren is working on top of what I
> > have. It's going to be a lot harder if I have to repeat the same
> > break-out process for each iteration of Oren's patches, though.
> >
>
> If Oren's purpose is not quite to create a upstreamable patchset,
> then it would make more sense for him to keep a git tree and
> put new patches on top of his existing ones (within reaason as he
> rebases). Then you'd at least be able to trivially look at the latest
> deltas.

The trick would be compromising on things that I, for instance, think
need to be rewritten or removed.

Oren would have to rebase his work against what I do. I guess you could
think about it like me being upstream from Oren. Anything that I would
change, Oren would need to rebase on top of.

Oren, are you willing to keep a set of patches that add functionality on
top of a minimal set that I'm keeping? Mine being the set that we're
trying to push into mainline as soon as possible.

-- Dave

2008-08-31 07:17:10

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 4/9] Memory management - dump state


Dave, Serge:

I'm currently away so I must keep this short. I think we have so far
more discussion than an actual problem. I'm happy to coordinate with
every interested party to eventually see this work go into main stream.

My only concerns are twofold: first, to get more feedback I believe we
need to get the code a bit more usable; including FDs is an excellent
way to actually do that. That will add significant value to the patch.
I think it's important to demonstrate how shared resources and multiple
processes are handled. FDs demonstrate the former (with a fixed version
of the recent patchset - I will post soon). The latter will increase
the size of the patchset significantly, so perhaps can indeed wait for
now.

It should not be hard for me to add functionality on top of a more
basic patchset. The question is, what is "basic" ? Anyway, I will be
back towards the end of the week. Let's try to discuss this over IRC
then (e.g. Friday afternoon ?).

Oren.

Dave Hansen wrote:
> On Wed, 2008-08-27 at 15:48 -0500, Serge E. Hallyn wrote:
>> Quoting Dave Hansen ([email protected]):
>>> On Wed, 2008-08-27 at 15:34 -0500, Serge E. Hallyn wrote:
>>>> (Or, is that too much effort required on your part to try and
>>>> cherrypick bits of Oren's changes back into your tree?)
>>> That would probably work as long as Oren is working on top of what I
>>> have. It's going to be a lot harder if I have to repeat the same
>>> break-out process for each iteration of Oren's patches, though.
>>>
>> If Oren's purpose is not quite to create a upstreamable patchset,
>> then it would make more sense for him to keep a git tree and
>> put new patches on top of his existing ones (within reaason as he
>> rebases). Then you'd at least be able to trivially look at the latest
>> deltas.
>
> The trick would be compromising on things that I, for instance, think
> need to be rewritten or removed.
>
> Oren would have to rebase his work against what I do. I guess you could
> think about it like me being upstream from Oren. Anything that I would
> change, Oren would need to rebase on top of.
>
> Oren, are you willing to keep a set of patches that add functionality on
> top of a minimal set that I'm keeping? Mine being the set that we're
> trying to push into mainline as soon as possible.
>
> -- Dave
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/containers

2008-08-31 17:35:03

by Cédric Le Goater

[permalink] [raw]
Subject: Re: [RFC v2][PATCH 4/9] Memory management - dump state

Oren Laadan wrote:
> Dave, Serge:
>
> I'm currently away so I must keep this short. I think we have so far
> more discussion than an actual problem. I'm happy to coordinate with
> every interested party to eventually see this work go into main stream.

thanks. We do need a moderator and federator.

> My only concerns are twofold: first, to get more feedback I believe we
> need to get the code a bit more usable; including FDs is an excellent
> way to actually do that. That will add significant value to the patch.

hmm, yes and no.

fd's are a must have but I would be more interested to see an external
checkpoint/restart and signal support first. why ? because it would be
already usable for most computational programs in HPC, like this stupid
one :

https://www.ccs.uky.edu/~bmadhu/pi/pi1.c

signals are required because it's how 'load' and/or 'system' managers
interact with the jobs they spawn. external checkpoint/restart for the
same reason.

for files, I would first only care about stdios (make sure they're
relinked to something safe on restart) and file states of regular files.
contents is generally handled externally (deleted files being an annoying
exception)

then, support for openmp application is a nice to have, so I'd probably
go on with thread support.

> I think it's important to demonstrate how shared resources and multiple
> processes are handled. FDs demonstrate the former (with a fixed version
> of the recent patchset - I will post soon).

shared resources are only useful in a multiprocess/multitask context.
I'd start working on this first. here we jump directly in the pid namespace
issues, how we start a set of process in a pidnamespace ? how do we
relink it to its parent pidnamespace ? are signals well propagated ? etc.
but hey, we'll have to solve it one day.

FD's are shared but have many types which are pain to handle. (it would
interesting to see if we can add checkpoint() and restart() operations in
fileops) So, for shared resources demonstration, I'd work on sysvipc,
there are less types to handle and they force us to think how we are going
to merge with the sysvipc namespaces.

> The latter will increase the size of the patchset significantly, so
> perhaps can indeed wait for now.

hmm, that depends how you do it.

If you restart all the hierarchy in the kernel, It will increase for sure
the patch footprint. However if you restore the hierarchy from user space
and then let each process restore itself from some binary blob, it should
not. This, of course, means that the binary blob representing the state of
the container (we call it statefile) is not totally opaque. It see it a bit
like /proc, a directory containing shared states (all namespaces) and tasks
states. That's something to discuss.

I do prefer the second option for many reasons:

. each process restarts itself from its current context, this makes it easier
to reuse kernel services depending on current.

. user tools can evaluate more precisely what they are going to restart from
the statefile. see this as a generalised 'readelf' that would be run on
the statefile, like we do on a core file today.


> It should not be hard for me to add functionality on top of a more
> basic patchset. The question is, what is "basic" ? Anyway, I will be
> back towards the end of the week. Let's try to discuss this over IRC
> then (e.g. Friday afternoon ?).

IHMO, the first one is to support a 'basic' computational program in
a real environment (under a load manager HPC). your POC nearly reaches
it but the user space API (how to launch, checkpoint, restart) needs to
be worked on.


There are some big steps in the development.

Multi-task is a big step which opens plenty of other big steps with
shared resources : mem, ipc, fds, etc. Not all have to be solved
but at least detected if we don't have the support.

Network is another one. This is an interesting step to support
distributed application using MPI over TCP. May be a priority.

there are also plenty of funky kernel resources used by misc servers,
database that will need special attention.



I'll be happy to start with the basic menu first as I know that it will
be useful for many applications !

Thanks,

C.