2008-10-17 23:12:23

by Andrey Mirkin

[permalink] [raw]
Subject: [PATCH 0/10] OpenVZ kernel based checkpointing/restart (v2)

These patchset introduces kernel based checkpointing/restart as it is
implemented in OpenVZ project. This version (v2) supports multiple
processes with simple private memory and open files (regular files).

Todo:
- Create processes with the same PID during restart
- Add support for x86-64
- Add support for shared objects

Changelog:

18 Oct 2008 (v2):
- Add support for multiple processes
- Cleanup and bug fixes

--

This patchset introduces kernel based checkpointing/restart as it is
implemented in OpenVZ project. This patchset has limited functionality and
are able to checkpoint/restart only single process. Recently Oren Laaden
sent another kernel based implementation of checkpoint/restart. The main
differences between this patchset and Oren's patchset are:

* In this patchset checkpointing initiated not from the process
(right now we do not have a container, only namespaces), Oren's patchset
performs checkpointing from the process context.

* Restart in this patchset is initiated from process, which restarts a new
process (in new namespaces) with saved state. Oren's patchset uses the same
process from which restart was initiated and restore saved state over it.

* Checkpoint/restart functionality in this patchset is implemented as a kernel
module


As checkpointing is initiated not from the process which state should be saved
we should freeze a process before saving its state. Right now Container Freezer
from Matt Helsley can be used for this.

This patchset introduce only a concept how kernel based checkpointing/restart
can be implemented and are able to checkpoint/restart only a single process
with simple VMAs.

I've tried to split my patchset in small patches to make review more easier.


2008-10-17 23:11:59

by Andrey Mirkin

[permalink] [raw]
Subject: [PATCH 01/10] Introduce trivial sys_checkpoint and sys_restore system calls

Right now they just return -ENOSYS. Later they will provide functionality
to checkpoint and restart a container.

Both syscalls take as arguments a file descriptor and flags.
Also sys_checkpoint take as the first argument a PID of container's init
(later it will be container ID); sys_restart takes as the first argument
a container ID (right now it will not be used).

Signed-off-by: Andrey Mirkin <[email protected]>
---
Makefile | 2 +-
arch/x86/kernel/syscall_table_32.S | 2 +
checkpoint/Makefile | 1 +
checkpoint/sys_core.c | 38 ++++++++++++++++++++++++++++++++++++
include/asm-x86/unistd_32.h | 2 +
5 files changed, 44 insertions(+), 1 deletions(-)
create mode 100644 checkpoint/Makefile
create mode 100644 checkpoint/sys_core.c

diff --git a/Makefile b/Makefile
index ea413fa..ce49afd 100644
--- a/Makefile
+++ b/Makefile
@@ -619,7 +619,7 @@ export mod_strip_cmd


ifeq ($(KBUILD_EXTMOD),)
-core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/
+core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ checkpoint/

vmlinux-dirs := $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
$(core-y) $(core-m) $(drivers-y) $(drivers-m) \
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index fd9d4f4..4a0d7fb 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -333,3 +333,5 @@ ENTRY(sys_call_table)
.long sys_pipe2
.long sys_inotify_init1
.long sys_hijack
+ .long sys_checkpoint
+ .long sys_restart /* 335 */
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
new file mode 100644
index 0000000..2276fb1
--- /dev/null
+++ b/checkpoint/Makefile
@@ -0,0 +1 @@
+obj-y += sys_core.o
diff --git a/checkpoint/sys_core.c b/checkpoint/sys_core.c
new file mode 100644
index 0000000..1a97fb6
--- /dev/null
+++ b/checkpoint/sys_core.c
@@ -0,0 +1,38 @@
+/*
+ * Copyright (C) 2008 Parallels, Inc.
+ *
+ * Author: Andrey Mirkin <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation, version 2 of the
+ * License.
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+
+/**
+ * sys_checkpoint - checkpoint a container from outside
+ * @pid: pid of the container init(1) process
+ * TODO: should switch to container id later
+ * @fd: file to which save the checkpoint image
+ * @flags: checkpoint operation flags
+ */
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
+{
+ return -ENOSYS;
+}
+
+/**
+ * sys_restart - restart a container
+ * @ctid: container id which should be used to restart a container
+ * @fd: file from which read the checkpoint image
+ * @flags: restart operation flags
+ */
+asmlinkage long sys_restart(int ctid, int fd, unsigned long flags)
+{
+ return -ENOSYS;
+}
diff --git a/include/asm-x86/unistd_32.h b/include/asm-x86/unistd_32.h
index 70280da..1a09604 100644
--- a/include/asm-x86/unistd_32.h
+++ b/include/asm-x86/unistd_32.h
@@ -339,6 +339,8 @@
#define __NR_pipe2 331
#define __NR_inotify_init1 332
#define __NR_hijack 333
+#define __NR_checkpoint 334
+#define __NR_restart 335

#ifdef __KERNEL__

--
1.5.6

2008-10-17 23:12:36

by Andrey Mirkin

[permalink] [raw]
Subject: [PATCH 02/10] Make checkpoint/restart functionality modular

A config option CONFIG_CHECKPOINT is introduced.
New structure cpt_operations is introduced to store pointers to
checkpoint/restart functions from module.

Signed-off-by: Andrey Mirkin <[email protected]>
---
checkpoint/Kconfig | 7 ++++++
checkpoint/Makefile | 4 +++
checkpoint/checkpoint.h | 19 ++++++++++++++++++
checkpoint/sys.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++
checkpoint/sys_core.c | 29 ++++++++++++++++++++++++++-
init/Kconfig | 2 +
6 files changed, 107 insertions(+), 2 deletions(-)
create mode 100644 checkpoint/Kconfig
create mode 100644 checkpoint/checkpoint.h
create mode 100644 checkpoint/sys.c

diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
new file mode 100644
index 0000000..b9bc72d
--- /dev/null
+++ b/checkpoint/Kconfig
@@ -0,0 +1,7 @@
+config CHECKPOINT
+ tristate "Checkpoint & restart for containers"
+ depends on EXPERIMENTAL
+ default n
+ help
+ This option adds module "cptrst", which allow to save a running
+ container to a file and restart it later using this image file.
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 2276fb1..bfe75d5 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -1 +1,5 @@
obj-y += sys_core.o
+
+obj-$(CONFIG_CHECKPOINT) += cptrst.o
+
+cptrst-objs := sys.o
diff --git a/checkpoint/checkpoint.h b/checkpoint/checkpoint.h
new file mode 100644
index 0000000..381a9bf
--- /dev/null
+++ b/checkpoint/checkpoint.h
@@ -0,0 +1,19 @@
+/*
+ * Copyright (C) 2008 Parallels, Inc.
+ *
+ * Author: Andrey Mirkin <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation, version 2 of the
+ * License.
+ *
+ */
+
+struct cpt_operations
+{
+ struct module * owner;
+ int (*checkpoint)(pid_t pid, int fd, unsigned long flags);
+ int (*restart)(int ctid, int fd, unsigned long flags);
+};
+extern struct cpt_operations cpt_ops;
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
new file mode 100644
index 0000000..010e4eb
--- /dev/null
+++ b/checkpoint/sys.c
@@ -0,0 +1,48 @@
+/*
+ * Copyright (C) 2008 Parallels, Inc.
+ *
+ * Author: Andrey Mirkin <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation, version 2 of the
+ * License.
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/notifier.h>
+#include <linux/module.h>
+
+#include "checkpoint.h"
+
+MODULE_LICENSE("GPL");
+
+static int checkpoint(pid_t pid, int fd, unsigned long flags)
+{
+ return -ENOSYS;
+}
+
+static int restart(int ctid, int fd, unsigned long flags)
+{
+ return -ENOSYS;
+}
+
+static int __init init_cptrst(void)
+{
+ cpt_ops.owner = THIS_MODULE;
+ cpt_ops.checkpoint = checkpoint;
+ cpt_ops.restart = restart;
+ return 0;
+}
+module_init(init_cptrst);
+
+static void __exit exit_cptrst(void)
+{
+ cpt_ops.checkpoint = NULL;
+ cpt_ops.restart = NULL;
+ cpt_ops.owner = NULL;
+}
+module_exit(exit_cptrst);
diff --git a/checkpoint/sys_core.c b/checkpoint/sys_core.c
index 1a97fb6..528aaec 100644
--- a/checkpoint/sys_core.c
+++ b/checkpoint/sys_core.c
@@ -13,6 +13,13 @@
#include <linux/sched.h>
#include <linux/fs.h>
#include <linux/file.h>
+#include <linux/notifier.h>
+#include <linux/module.h>
+
+#include "checkpoint.h"
+
+struct cpt_operations cpt_ops = { NULL, NULL, NULL };
+EXPORT_SYMBOL(cpt_ops);

/**
* sys_checkpoint - checkpoint a container from outside
@@ -23,7 +30,16 @@
*/
asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
{
- return -ENOSYS;
+ int ret;
+
+ ret = -ENOSYS;
+
+ if (try_module_get(cpt_ops.owner)) {
+ if (cpt_ops.checkpoint)
+ ret = cpt_ops.checkpoint(pid, fd, flags);
+ module_put(cpt_ops.owner);
+ }
+ return ret;
}

/**
@@ -34,5 +50,14 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
*/
asmlinkage long sys_restart(int ctid, int fd, unsigned long flags)
{
- return -ENOSYS;
+ int ret;
+
+ ret = -ENOSYS;
+
+ if (try_module_get(cpt_ops.owner)) {
+ if (cpt_ops.restart)
+ ret = cpt_ops.restart(ctid, fd, flags);
+ module_put(cpt_ops.owner);
+ }
+ return ret;
}
diff --git a/init/Kconfig b/init/Kconfig
index 4bd4b0c..b10f3cf 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -344,6 +344,8 @@ config CGROUP_FREEZER
Provides a way to freeze and unfreeze all tasks in a
cgroup

+source "checkpoint/Kconfig"
+
config FAIR_GROUP_SCHED
bool "Group scheduling for SCHED_OTHER"
depends on GROUP_SCHED
--
1.5.6

2008-10-17 23:12:52

by Andrey Mirkin

[permalink] [raw]
Subject: [PATCH 03/10] Introduce context structure needed during checkpointing/restart

Add functions for context allocation/destroy.
Introduce functions to read/write image.
Introduce image header and object header.

Signed-off-by: Andrey Mirkin <[email protected]>
---
checkpoint/checkpoint.h | 40 +++++++++++++++
checkpoint/cpt_image.h | 63 ++++++++++++++++++++++++
checkpoint/sys.c | 125 +++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 225 insertions(+), 3 deletions(-)
create mode 100644 checkpoint/cpt_image.h

diff --git a/checkpoint/checkpoint.h b/checkpoint/checkpoint.h
index 381a9bf..8ea73f5 100644
--- a/checkpoint/checkpoint.h
+++ b/checkpoint/checkpoint.h
@@ -10,6 +10,8 @@
*
*/

+#include "cpt_image.h"
+
struct cpt_operations
{
struct module * owner;
@@ -17,3 +19,41 @@ struct cpt_operations
int (*restart)(int ctid, int fd, unsigned long flags);
};
extern struct cpt_operations cpt_ops;
+
+enum cpt_ctx_state
+{
+ CPT_CTX_ERROR = -1,
+ CPT_CTX_IDLE = 0,
+ CPT_CTX_DUMPING,
+ CPT_CTX_UNDUMPING
+};
+
+typedef struct cpt_context
+{
+ pid_t pid; /* should be changed to ctid later */
+ int ctx_id; /* context id */
+ struct list_head ctx_list;
+ int refcount;
+ int ctx_state;
+ struct semaphore main_sem;
+
+ int errno;
+
+ struct file *file;
+ loff_t current_object;
+
+ struct list_head object_array[CPT_OBJ_MAX];
+
+ int (*write)(const void *addr, size_t count, struct cpt_context *ctx);
+ int (*read)(void *addr, size_t count, struct cpt_context *ctx);
+} cpt_context_t;
+
+extern int debug_level;
+
+#define cpt_printk(lvl, fmt, args...) do { \
+ if (lvl <= debug_level) \
+ printk(fmt, ##args); \
+ } while (0)
+
+#define eprintk(a...) cpt_printk(1, "CPT ERR: " a)
+#define dprintk(a...) cpt_printk(1, "CPT DBG: " a)
diff --git a/checkpoint/cpt_image.h b/checkpoint/cpt_image.h
new file mode 100644
index 0000000..0338dd0
--- /dev/null
+++ b/checkpoint/cpt_image.h
@@ -0,0 +1,63 @@
+/*
+ * Copyright (C) 2008 Parallels, Inc.
+ *
+ * Author: Andrey Mirkin <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation, version 2 of the
+ * License.
+ *
+ */
+
+#ifndef __CPT_IMAGE_H_
+#define __CPT_IMAGE_H_ 1
+
+enum _cpt_object_type
+{
+ CPT_OBJ_TASK = 0,
+ CPT_OBJ_MAX,
+ /* The objects above are stored in memory while checkpointing */
+
+ CPT_OBJ_HEAD = 1024,
+};
+
+enum _cpt_content_type {
+ CPT_CONTENT_VOID,
+ CPT_CONTENT_ARRAY,
+ CPT_CONTENT_DATA,
+ CPT_CONTENT_NAME,
+ CPT_CONTENT_REF,
+ CPT_CONTENT_MAX
+};
+
+#define CPT_SIGNATURE0 0x79
+#define CPT_SIGNATURE1 0x1c
+#define CPT_SIGNATURE2 0x01
+#define CPT_SIGNATURE3 0x63
+
+struct cpt_head
+{
+ __u8 cpt_signature[4]; /* Magic number */
+ __u32 cpt_hdrlen; /* Header length */
+ __u16 cpt_image_major; /* Format of this file */
+ __u16 cpt_image_minor; /* Format of this file */
+ __u16 cpt_image_sublevel; /* Format of this file */
+ __u16 cpt_image_extra; /* Format of this file */
+ __u16 cpt_arch; /* Architecture */
+#define CPT_ARCH_I386 0
+ __u16 cpt_pad1;
+ __u32 cpt_pad2;
+ __u64 cpt_time; /* Time */
+} __attribute__ ((aligned (8)));
+
+/* Common object header. */
+struct cpt_object_hdr
+{
+ __u64 cpt_len; /* Size of current chunk of data */
+ __u32 cpt_hdrlen; /* Size of header */
+ __u16 cpt_type; /* Type of object */
+ __u16 cpt_content; /* Content type: array, reference... */
+} __attribute__ ((aligned (8)));
+
+#endif /* __CPT_IMAGE_H_ */
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 010e4eb..a561a06 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -13,21 +13,140 @@
#include <linux/sched.h>
#include <linux/fs.h>
#include <linux/file.h>
-#include <linux/notifier.h>
+#include <linux/uaccess.h>
#include <linux/module.h>

#include "checkpoint.h"
+#include "cpt_image.h"

MODULE_LICENSE("GPL");

+/* Debug level, constant for now */
+int debug_level = 1;
+
+static int file_write(const void *addr, size_t count, struct cpt_context *ctx)
+{
+ mm_segment_t oldfs;
+ ssize_t err = -EBADF;
+ struct file *file = ctx->file;
+
+ oldfs = get_fs(); set_fs(KERNEL_DS);
+ if (file)
+ err = file->f_op->write(file, addr, count, &file->f_pos);
+ set_fs(oldfs);
+ if (err != count)
+ return err >= 0 ? -EIO : err;
+ return 0;
+}
+
+static int file_read(void *addr, size_t count, struct cpt_context *ctx)
+{
+ mm_segment_t oldfs;
+ ssize_t err = -EBADF;
+ struct file *file = ctx->file;
+
+ oldfs = get_fs(); set_fs(KERNEL_DS);
+ if (file)
+ err = file->f_op->read(file, addr, count, &file->f_pos);
+ set_fs(oldfs);
+ if (err != count)
+ return err >= 0 ? -EIO : err;
+ return 0;
+}
+
+struct cpt_context * context_alloc(void)
+{
+ struct cpt_context *ctx;
+ int i;
+
+ ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+ if (!ctx)
+ return NULL;
+
+ init_MUTEX(&ctx->main_sem);
+ ctx->refcount = 1;
+
+ ctx->current_object = -1;
+ ctx->write = file_write;
+ ctx->read = file_read;
+ for (i = 0; i < CPT_OBJ_MAX; i++) {
+ INIT_LIST_HEAD(&ctx->object_array[i]);
+ }
+
+ return ctx;
+}
+
+void context_release(struct cpt_context *ctx)
+{
+ ctx->ctx_state = CPT_CTX_ERROR;
+
+ kfree(ctx);
+}
+
+static void context_put(struct cpt_context *ctx)
+{
+ if (!--ctx->refcount)
+ context_release(ctx);
+}
+
static int checkpoint(pid_t pid, int fd, unsigned long flags)
{
- return -ENOSYS;
+ struct file *file;
+ struct cpt_context *ctx;
+ int err;
+
+ err = -EBADF;
+ file = fget(fd);
+ if (!file)
+ goto out;
+
+ err = -ENOMEM;
+ ctx = context_alloc();
+ if (!ctx)
+ goto out_file;
+
+ ctx->file = file;
+ ctx->ctx_state = CPT_CTX_DUMPING;
+
+ /* checkpoint */
+ err = -ENOSYS;
+
+ context_put(ctx);
+
+out_file:
+ fput(file);
+out:
+ return err;
}

static int restart(int ctid, int fd, unsigned long flags)
{
- return -ENOSYS;
+ struct file *file;
+ struct cpt_context *ctx;
+ int err;
+
+ err = -EBADF;
+ file = fget(fd);
+ if (!file)
+ goto out;
+
+ err = -ENOMEM;
+ ctx = context_alloc();
+ if (!ctx)
+ goto out_file;
+
+ ctx->file = file;
+ ctx->ctx_state = CPT_CTX_UNDUMPING;
+
+ /* restart */
+ err = -ENOSYS;
+
+ context_put(ctx);
+
+out_file:
+ fput(file);
+out:
+ return err;
}

static int __init init_cptrst(void)
--
1.5.6

2008-10-17 23:13:43

by Andrey Mirkin

[permalink] [raw]
Subject: [PATCH 06/10] Introduce functions to dump mm

Functions to dump mm struct, VMAs and mm context are added.

Signed-off-by: Andrey Mirkin <[email protected]>
---
arch/x86/mm/hugetlbpage.c | 2 +
checkpoint/Makefile | 2 +-
checkpoint/checkpoint.h | 1 +
checkpoint/cpt_image.h | 61 +++++++
checkpoint/cpt_mm.c | 434 +++++++++++++++++++++++++++++++++++++++++++++
checkpoint/cpt_process.c | 8 +-
mm/memory.c | 1 +
7 files changed, 504 insertions(+), 5 deletions(-)
create mode 100644 checkpoint/cpt_mm.c

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 8f307d9..63028e7 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -12,6 +12,7 @@
#include <linux/slab.h>
#include <linux/err.h>
#include <linux/sysctl.h>
+#include <linux/module.h>
#include <asm/mman.h>
#include <asm/tlb.h>
#include <asm/tlbflush.h>
@@ -221,6 +222,7 @@ int pmd_huge(pmd_t pmd)
{
return !!(pmd_val(pmd) & _PAGE_PSE);
}
+EXPORT_SYMBOL(pmd_huge);

int pud_huge(pud_t pud)
{
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 457cc96..bbb0e37 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,4 @@ obj-y += sys_core.o

obj-$(CONFIG_CHECKPOINT) += cptrst.o

-cptrst-objs := sys.o checkpoint.o cpt_process.o
+cptrst-objs := sys.o checkpoint.o cpt_process.o cpt_mm.o
diff --git a/checkpoint/checkpoint.h b/checkpoint/checkpoint.h
index 9e46b10..e3e6b66 100644
--- a/checkpoint/checkpoint.h
+++ b/checkpoint/checkpoint.h
@@ -61,3 +61,4 @@ extern int debug_level;

int dump_container(struct cpt_context *ctx);
int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx);
+int cpt_dump_mm(struct task_struct *tsk, struct cpt_context *ctx);
diff --git a/checkpoint/cpt_image.h b/checkpoint/cpt_image.h
index cddfe37..160cf85 100644
--- a/checkpoint/cpt_image.h
+++ b/checkpoint/cpt_image.h
@@ -16,13 +16,19 @@
#include <linux/sched.h>
#include <asm/segment.h>

+#define CPT_NULL (~0ULL)
+
enum _cpt_object_type
{
CPT_OBJ_TASK = 0,
+ CPT_OBJ_MM,
CPT_OBJ_MAX,
/* The objects above are stored in memory while checkpointing */

CPT_OBJ_HEAD = 1024,
+ CPT_OBJ_VMA,
+ CPT_OBJ_PAGES,
+ CPT_OBJ_NAME,
CPT_OBJ_X86_REGS,
CPT_OBJ_BITS,
};
@@ -35,6 +41,7 @@ enum _cpt_content_type {
CPT_CONTENT_REF,
CPT_CONTENT_X86_FPUSTATE,
CPT_CONTENT_X86_FPUSTATE_OLD,
+ CPT_CONTENT_MM_CONTEXT,
CPT_CONTENT_MAX
};

@@ -123,6 +130,60 @@ struct cpt_task_image {
__u64 cpt_maj_flt;
} __attribute__ ((aligned (8)));

+struct cpt_mm_image {
+ __u64 cpt_len;
+ __u32 cpt_hdrlen;
+ __u16 cpt_type;
+ __u16 cpt_content;
+
+ __u64 cpt_start_code;
+ __u64 cpt_end_code;
+ __u64 cpt_start_data;
+ __u64 cpt_end_data;
+ __u64 cpt_start_brk;
+ __u64 cpt_brk;
+ __u64 cpt_start_stack;
+ __u64 cpt_start_arg;
+ __u64 cpt_end_arg;
+ __u64 cpt_start_env;
+ __u64 cpt_end_env;
+ __u64 cpt_def_flags;
+ __u64 cpt_flags;
+ __u64 cpt_map_count;
+} __attribute__ ((aligned (8)));
+
+struct cpt_vma_image
+{
+ __u64 cpt_len;
+ __u32 cpt_hdrlen;
+ __u16 cpt_type;
+ __u16 cpt_content;
+
+ __u64 cpt_file;
+ __u32 cpt_vma_type;
+#define CPT_VMA_TYPE_0 0
+#define CPT_VMA_FILE 1
+ __u32 cpt_pad;
+
+ __u64 cpt_start;
+ __u64 cpt_end;
+ __u64 cpt_flags;
+ __u64 cpt_pgprot;
+ __u64 cpt_pgoff;
+ __u64 cpt_page_num;
+} __attribute__ ((aligned (8)));
+
+struct cpt_page_block
+{
+ __u64 cpt_len;
+ __u32 cpt_hdrlen;
+ __u16 cpt_type;
+ __u16 cpt_content;
+
+ __u64 cpt_start;
+ __u64 cpt_end;
+} __attribute__ ((aligned (8)));
+
struct cpt_obj_bits
{
__u64 cpt_len;
diff --git a/checkpoint/cpt_mm.c b/checkpoint/cpt_mm.c
new file mode 100644
index 0000000..8a22c48
--- /dev/null
+++ b/checkpoint/cpt_mm.c
@@ -0,0 +1,434 @@
+/*
+ * Copyright (C) 2008 Parallels, Inc.
+ *
+ * Authors: Andrey Mirkin <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation, version 2 of the
+ * License.
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/mm.h>
+#include <linux/errno.h>
+#include <linux/major.h>
+#include <linux/mman.h>
+#include <linux/mnt_namespace.h>
+#include <linux/mount.h>
+#include <linux/namei.h>
+#include <linux/pagemap.h>
+#include <linux/hugetlb.h>
+#include <asm/ldt.h>
+
+#include "checkpoint.h"
+#include "cpt_image.h"
+
+struct page_area
+{
+ int type;
+ unsigned long start;
+ unsigned long end;
+ pgoff_t pgoff;
+ loff_t mm;
+ __u64 list[16];
+};
+
+struct page_desc
+{
+ int type;
+ pgoff_t index;
+ loff_t mm;
+ int shared;
+};
+
+enum {
+ PD_ABSENT,
+ PD_COPY,
+ PD_FUNKEY,
+};
+
+/* 0: page can be obtained from backstore, or still not mapped anonymous page,
+ or something else, which does not requre copy.
+ 1: page requires copy
+ 2: page requres copy but its content is zero. Quite useless.
+ 3: wp page is shared after fork(). It is to be COWed when modified.
+ 4: page is something unsupported... We copy it right now.
+ */
+
+static void page_get_desc(struct vm_area_struct *vma, unsigned long addr,
+ struct page_desc *pdesc, cpt_context_t * ctx)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *ptep, pte;
+ spinlock_t *ptl;
+ struct page *pg = NULL;
+ pgoff_t linear_index = (addr - vma->vm_start)/PAGE_SIZE + vma->vm_pgoff;
+
+ pdesc->index = linear_index;
+ pdesc->shared = 0;
+ pdesc->mm = CPT_NULL;
+
+ if (vma->vm_flags & VM_IO) {
+ pdesc->type = PD_ABSENT;
+ return;
+ }
+
+ pgd = pgd_offset(mm, addr);
+ if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
+ goto out_absent;
+ pud = pud_offset(pgd, addr);
+ if (pud_none(*pud) || unlikely(pud_bad(*pud)))
+ goto out_absent;
+ pmd = pmd_offset(pud, addr);
+ if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
+ goto out_absent;
+#ifdef CONFIG_X86
+ if (pmd_huge(*pmd)) {
+ eprintk("page_huge\n");
+ goto out_unsupported;
+ }
+#endif
+ ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ pte = *ptep;
+ pte_unmap(ptep);
+
+ if (pte_none(pte))
+ goto out_absent_unlock;
+
+ if ((pg = vm_normal_page(vma, addr, pte)) == NULL) {
+ pdesc->type = PD_COPY;
+ goto out_unlock;
+ }
+
+ get_page(pg);
+ spin_unlock(ptl);
+
+ if (pg->mapping && !PageAnon(pg)) {
+ if (vma->vm_file == NULL) {
+ eprintk("pg->mapping!=NULL for fileless vma: %08lx\n", addr);
+ goto out_unsupported;
+ }
+ if (vma->vm_file->f_mapping != pg->mapping) {
+ eprintk("pg->mapping!=f_mapping: %08lx %p %p\n",
+ addr, vma->vm_file->f_mapping, pg->mapping);
+ goto out_unsupported;
+ }
+ pdesc->index = (pg->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT));
+ /* Page is in backstore. For us it is like
+ * it is not present.
+ */
+ goto out_absent;
+ }
+
+ if (PageReserved(pg)) {
+ /* Special case: ZERO_PAGE is used, when an
+ * anonymous page is accessed but not written. */
+ if (pg == ZERO_PAGE(addr)) {
+ if (pte_write(pte)) {
+ eprintk("not funny already, writable ZERO_PAGE\n");
+ goto out_unsupported;
+ }
+ /* Just copy it for now */
+ pdesc->type = PD_COPY;
+ goto out_put;
+ }
+ eprintk("reserved page %lu at %08lx\n", pg->index, addr);
+ goto out_unsupported;
+ }
+
+ if (!pg->mapping) {
+ eprintk("page without mapping at %08lx\n", addr);
+ goto out_unsupported;
+ }
+
+ pdesc->type = PD_COPY;
+
+out_put:
+ if (pg)
+ put_page(pg);
+ return;
+
+out_unlock:
+ spin_unlock(ptl);
+ goto out_put;
+
+out_absent_unlock:
+ spin_unlock(ptl);
+
+out_absent:
+ pdesc->type = PD_ABSENT;
+ goto out_put;
+
+out_unsupported:
+ pdesc->type = PD_FUNKEY;
+ goto out_put;
+}
+
+static int count_vma_pages(struct vm_area_struct *vma, struct cpt_context *ctx)
+{
+ unsigned long addr;
+ int page_num = 0;
+
+ for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
+ struct page_desc pd;
+
+ page_get_desc(vma, addr, &pd, ctx);
+
+ if (pd.type != PD_COPY) {
+ return -EINVAL;
+ } else {
+ page_num += 1;
+ }
+
+ }
+ return page_num;
+}
+
+/* ATTN: We give "current" to get_user_pages(). This is wrong, but get_user_pages()
+ * does not really need this thing. It just stores some page fault stats there.
+ *
+ * BUG: some archs (f.e. sparc64, but not Intel*) require flush cache pages
+ * before accessing vma.
+ */
+static int dump_pages(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, struct cpt_context *ctx)
+{
+#define MAX_PAGE_BATCH 16
+ struct page *pg[MAX_PAGE_BATCH];
+ int npages = (end - start)/PAGE_SIZE;
+ int count = 0;
+
+ while (count < npages) {
+ int copy = npages - count;
+ int n;
+
+ if (copy > MAX_PAGE_BATCH)
+ copy = MAX_PAGE_BATCH;
+ n = get_user_pages(current, vma->vm_mm, start, copy,
+ 0, 1, pg, NULL);
+ if (n == copy) {
+ int i;
+ for (i=0; i<n; i++) {
+ char *maddr = kmap(pg[i]);
+ ctx->write(maddr, PAGE_SIZE, ctx);
+ kunmap(pg[i]);
+ }
+ } else {
+ eprintk("get_user_pages fault");
+ for ( ; n > 0; n--)
+ page_cache_release(pg[n-1]);
+ return -EFAULT;
+ }
+ start += n*PAGE_SIZE;
+ count += n;
+ for ( ; n > 0; n--)
+ page_cache_release(pg[n-1]);
+ }
+ return 0;
+}
+
+static int dump_page_block(struct vm_area_struct *vma,
+ struct cpt_page_block *pgb,
+ struct cpt_context *ctx)
+{
+ int err;
+ pgb->cpt_len = sizeof(*pgb) + pgb->cpt_end - pgb->cpt_start;
+ pgb->cpt_type = CPT_OBJ_PAGES;
+ pgb->cpt_hdrlen = sizeof(*pgb);
+ pgb->cpt_content = CPT_CONTENT_DATA;
+
+ err = ctx->write(pgb, sizeof(*pgb), ctx);
+ if (!err)
+ err = dump_pages(vma, pgb->cpt_start, pgb->cpt_end, ctx);
+
+ return err;
+}
+
+static int cpt_dump_dentry(struct path *p, cpt_context_t *ctx)
+{
+ int len;
+ char *path;
+ char *buf;
+ struct cpt_object_hdr o;
+
+ buf = (char *)__get_free_page(GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ path = d_path(p, buf, PAGE_SIZE);
+
+ if (IS_ERR(path)) {
+ free_page((unsigned long)buf);
+ return PTR_ERR(path);
+ }
+
+ len = buf + PAGE_SIZE - 1 - path;
+ o.cpt_len = sizeof(o) + len + 1;
+ o.cpt_type = CPT_OBJ_NAME;
+ o.cpt_hdrlen = sizeof(o);
+ o.cpt_content = CPT_CONTENT_NAME;
+ path[len] = 0;
+
+ ctx->write(&o, sizeof(o), ctx);
+ ctx->write(path, len + 1, ctx);
+ free_page((unsigned long)buf);
+
+ return 0;
+}
+
+static int dump_one_vma(struct mm_struct *mm,
+ struct vm_area_struct *vma, struct cpt_context *ctx)
+{
+ struct cpt_vma_image *v;
+ unsigned long addr;
+ int page_num;
+ int err;
+
+ v = kzalloc(sizeof(*v), GFP_KERNEL);
+ if (!v)
+ return -ENOMEM;
+
+ v->cpt_len = sizeof(*v);
+ v->cpt_type = CPT_OBJ_VMA;
+ v->cpt_hdrlen = sizeof(*v);
+ v->cpt_content = CPT_CONTENT_ARRAY;
+
+ v->cpt_start = vma->vm_start;
+ v->cpt_end = vma->vm_end;
+ v->cpt_flags = vma->vm_flags;
+ if (vma->vm_flags & VM_HUGETLB) {
+ eprintk("huge TLB VMAs are still not supported\n");
+ kfree(v);
+ return -EINVAL;
+ }
+ v->cpt_pgprot = vma->vm_page_prot.pgprot;
+ v->cpt_pgoff = vma->vm_pgoff;
+ v->cpt_file = CPT_NULL;
+ v->cpt_vma_type = CPT_VMA_TYPE_0;
+
+ page_num = count_vma_pages(vma, ctx);
+ if (page_num < 0) {
+ kfree(v);
+ return -EINVAL;
+ }
+ v->cpt_page_num = page_num;
+
+ if (vma->vm_file) {
+ v->cpt_file = 0;
+ v->cpt_vma_type = CPT_VMA_FILE;
+ }
+
+ ctx->write(v, sizeof(*v), ctx);
+ kfree(v);
+
+ if (vma->vm_file) {
+ err = cpt_dump_dentry(&vma->vm_file->f_path, ctx);
+ if (err < 0)
+ return err;
+ }
+
+ for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
+ struct page_desc pd;
+ struct cpt_page_block pgb;
+
+ page_get_desc(vma, addr, &pd, ctx);
+
+ if (pd.type == PD_FUNKEY || pd.type == PD_ABSENT) {
+ eprintk("dump_one_vma: funkey page\n");
+ return -EINVAL;
+ }
+
+ pgb.cpt_start = addr;
+ pgb.cpt_end = addr + PAGE_SIZE;
+ dump_page_block(vma, &pgb, ctx);
+ }
+
+ return 0;
+}
+
+static int cpt_dump_mm_context(struct mm_struct *mm, struct cpt_context *ctx)
+{
+#ifdef CONFIG_X86
+ if (mm->context.size) {
+ struct cpt_obj_bits b;
+ int size;
+
+ mutex_lock(&mm->context.lock);
+
+ b.cpt_type = CPT_OBJ_BITS;
+ b.cpt_len = sizeof(b);
+ b.cpt_content = CPT_CONTENT_MM_CONTEXT;
+ b.cpt_size = mm->context.size * LDT_ENTRY_SIZE;
+
+ ctx->write(&b, sizeof(b), ctx);
+
+ size = mm->context.size * LDT_ENTRY_SIZE;
+
+ ctx->write(mm->context.ldt, size, ctx);
+
+ mutex_unlock(&mm->context.lock);
+ }
+#endif
+ return 0;
+}
+
+int cpt_dump_mm(struct task_struct *tsk, struct cpt_context *ctx)
+{
+ struct mm_struct *mm = tsk->mm;
+ struct cpt_mm_image *v;
+ struct vm_area_struct *vma;
+ int err;
+
+ v = kzalloc(sizeof(*v), GFP_KERNEL);
+ if (!v)
+ return -ENOMEM;
+
+ v->cpt_len = sizeof(*v);
+ v->cpt_type = CPT_OBJ_MM;
+ v->cpt_hdrlen = sizeof(*v);
+ v->cpt_content = CPT_CONTENT_ARRAY;
+
+ down_read(&mm->mmap_sem);
+ v->cpt_start_code = mm->start_code;
+ v->cpt_end_code = mm->end_code;
+ v->cpt_start_data = mm->start_data;
+ v->cpt_end_data = mm->end_data;
+ v->cpt_start_brk = mm->start_brk;
+ v->cpt_brk = mm->brk;
+ v->cpt_start_stack = mm->start_stack;
+ v->cpt_start_arg = mm->arg_start;
+ v->cpt_end_arg = mm->arg_end;
+ v->cpt_start_env = mm->env_start;
+ v->cpt_end_env = mm->env_end;
+ v->cpt_def_flags = mm->def_flags;
+ v->cpt_flags = mm->flags;
+ v->cpt_map_count = mm->map_count;
+
+ err = ctx->write(v, sizeof(*v), ctx);
+ kfree(v);
+
+ if (err) {
+ eprintk("error during writing mm\n");
+ goto err_up;
+ }
+
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ if ((err = dump_one_vma(mm, vma, ctx)) != 0)
+ goto err_up;
+ }
+
+ err = cpt_dump_mm_context(mm, ctx);
+
+err_up:
+ up_read(&mm->mmap_sem);
+
+ return err;
+}
+
diff --git a/checkpoint/cpt_process.c b/checkpoint/cpt_process.c
index 58f608d..1f7a54b 100644
--- a/checkpoint/cpt_process.c
+++ b/checkpoint/cpt_process.c
@@ -225,12 +225,12 @@ int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx)

err = cpt_dump_task_struct(tsk, ctx);

- /* Dump task mm */
-
if (!err)
- cpt_dump_fpustate(tsk, ctx);
+ err = cpt_dump_mm(tsk, ctx);
+ if (!err)
+ err = cpt_dump_fpustate(tsk, ctx);
if (!err)
- cpt_dump_registers(tsk, ctx);
+ err = cpt_dump_registers(tsk, ctx);

return err;
}
diff --git a/mm/memory.c b/mm/memory.c
index 1002f47..479a294 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -481,6 +481,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
out:
return pfn_to_page(pfn);
}
+EXPORT_SYMBOL(vm_normal_page);

/*
* copy one vm_area from one task to the other. Assumes the page tables
--
1.5.6

2008-10-17 23:13:29

by Andrey Mirkin

[permalink] [raw]
Subject: [PATCH 05/10] Introduce function to dump process

Functions to dump task struct, fpu state and registers are added.
All IDs are saved from the POV of process (container) namespace.

Signed-off-by: Andrey Mirkin <[email protected]>
---
checkpoint/Makefile | 2 +-
checkpoint/checkpoint.c | 2 +-
checkpoint/checkpoint.h | 1 +
checkpoint/cpt_image.h | 123 ++++++++++++++++++++++++
checkpoint/cpt_process.c | 236 ++++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 362 insertions(+), 2 deletions(-)
create mode 100644 checkpoint/cpt_process.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 173346b..457cc96 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,4 @@ obj-y += sys_core.o

obj-$(CONFIG_CHECKPOINT) += cptrst.o

-cptrst-objs := sys.o checkpoint.o
+cptrst-objs := sys.o checkpoint.o cpt_process.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index c4bddce..aae198d 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -70,7 +70,7 @@ int dump_container(struct cpt_context *ctx)

/* Dump task here */
if (!err)
- err = -ENOSYS;
+ err = cpt_dump_task(root, ctx);

out:
ctx->nsproxy = NULL;
diff --git a/checkpoint/checkpoint.h b/checkpoint/checkpoint.h
index 6926aa2..9e46b10 100644
--- a/checkpoint/checkpoint.h
+++ b/checkpoint/checkpoint.h
@@ -60,3 +60,4 @@ extern int debug_level;
#define dprintk(a...) cpt_printk(1, "CPT DBG: " a)

int dump_container(struct cpt_context *ctx);
+int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx);
diff --git a/checkpoint/cpt_image.h b/checkpoint/cpt_image.h
index 0338dd0..cddfe37 100644
--- a/checkpoint/cpt_image.h
+++ b/checkpoint/cpt_image.h
@@ -13,6 +13,9 @@
#ifndef __CPT_IMAGE_H_
#define __CPT_IMAGE_H_ 1

+#include <linux/sched.h>
+#include <asm/segment.h>
+
enum _cpt_object_type
{
CPT_OBJ_TASK = 0,
@@ -20,6 +23,8 @@ enum _cpt_object_type
/* The objects above are stored in memory while checkpointing */

CPT_OBJ_HEAD = 1024,
+ CPT_OBJ_X86_REGS,
+ CPT_OBJ_BITS,
};

enum _cpt_content_type {
@@ -28,6 +33,8 @@ enum _cpt_content_type {
CPT_CONTENT_DATA,
CPT_CONTENT_NAME,
CPT_CONTENT_REF,
+ CPT_CONTENT_X86_FPUSTATE,
+ CPT_CONTENT_X86_FPUSTATE_OLD,
CPT_CONTENT_MAX
};

@@ -60,4 +67,120 @@ struct cpt_object_hdr
__u16 cpt_content; /* Content type: array, reference... */
} __attribute__ ((aligned (8)));

+struct cpt_task_image {
+ __u64 cpt_len;
+ __u32 cpt_hdrlen;
+ __u16 cpt_type;
+ __u16 cpt_content;
+
+ __u64 cpt_state;
+ __u64 cpt_flags;
+#define CPT_PF_EXITING 0
+#define CPT_PF_FORKNOEXEC 1
+#define CPT_PF_SUPERPRIV 2
+#define CPT_PF_DUMPCORE 3
+#define CPT_PF_SIGNALED 4
+#define CPT_PF_USED_MATH 5
+
+ __u64 cpt_thrflags;
+ __u64 cpt_thrstatus;
+ __u32 cpt_pid;
+ __u32 cpt_tgid;
+ __u32 cpt_ppid;
+ __u32 cpt_rppid;
+ __u32 cpt_pgrp;
+ __u32 cpt_session;
+ __u32 cpt_old_pgrp;
+ __u32 cpt_leader;
+ __u64 cpt_set_tid;
+ __u64 cpt_clear_tid;
+ __u32 cpt_exit_code;
+ __u32 cpt_exit_signal;
+ __u32 cpt_pdeath_signal;
+ __u32 cpt_user;
+ __u32 cpt_uid;
+ __u32 cpt_euid;
+ __u32 cpt_suid;
+ __u32 cpt_fsuid;
+ __u32 cpt_gid;
+ __u32 cpt_egid;
+ __u32 cpt_sgid;
+ __u32 cpt_fsgid;
+ __u8 cpt_comm[TASK_COMM_LEN];
+ __u64 cpt_tls[GDT_ENTRY_TLS_ENTRIES];
+ __u64 cpt_utime;
+ __u64 cpt_stime;
+ __u64 cpt_utimescaled;
+ __u64 cpt_stimescaled;
+ __u64 cpt_gtime;
+ __u64 cpt_prev_utime;
+ __u64 cpt_prev_stime;
+ __u64 cpt_start_time;
+ __u64 cpt_real_start_time;
+ __u64 cpt_nvcsw;
+ __u64 cpt_nivcsw;
+ __u64 cpt_min_flt;
+ __u64 cpt_maj_flt;
+} __attribute__ ((aligned (8)));
+
+struct cpt_obj_bits
+{
+ __u64 cpt_len;
+ __u32 cpt_hdrlen;
+ __u16 cpt_type;
+ __u16 cpt_content;
+
+ __u32 cpt_size;
+ __u32 __cpt_pad1;
+} __attribute__ ((aligned (8)));
+
+#define CPT_SEG_ZERO 0
+#define CPT_SEG_TLS1 1
+#define CPT_SEG_TLS2 2
+#define CPT_SEG_TLS3 3
+#define CPT_SEG_USER32_DS 4
+#define CPT_SEG_USER32_CS 5
+#define CPT_SEG_USER64_DS 6
+#define CPT_SEG_USER64_CS 7
+#define CPT_SEG_LDT 256
+
+struct cpt_x86_regs
+{
+ __u64 cpt_len;
+ __u32 cpt_hdrlen;
+ __u16 cpt_type;
+ __u16 cpt_content;
+
+ __u32 cpt_debugreg[8];
+ __u32 cpt_gs;
+
+ __u32 cpt_bx;
+ __u32 cpt_cx;
+ __u32 cpt_dx;
+ __u32 cpt_si;
+ __u32 cpt_di;
+ __u32 cpt_bp;
+ __u32 cpt_ax;
+ __u32 cpt_ds;
+ __u32 cpt_es;
+ __u32 cpt_fs;
+ __u32 cpt_orig_ax;
+ __u32 cpt_ip;
+ __u32 cpt_cs;
+ __u32 cpt_flags;
+ __u32 cpt_sp;
+ __u32 cpt_ss;
+} __attribute__ ((aligned (8)));
+
+static inline __u64 cpt_timespec_export(struct timespec *tv)
+{
+ return (((u64)tv->tv_sec) << 32) + tv->tv_nsec;
+}
+
+static inline void cpt_timespec_import(struct timespec *tv, __u64 val)
+{
+ tv->tv_sec = val >> 32;
+ tv->tv_nsec = (val & 0xFFFFFFFF);
+}
+
#endif /* __CPT_IMAGE_H_ */
diff --git a/checkpoint/cpt_process.c b/checkpoint/cpt_process.c
new file mode 100644
index 0000000..58f608d
--- /dev/null
+++ b/checkpoint/cpt_process.c
@@ -0,0 +1,236 @@
+/*
+ * Copyright (C) 2008 Parallels, Inc.
+ *
+ * Author: Andrey Mirkin <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation, version 2 of the
+ * License.
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/version.h>
+#include <linux/nsproxy.h>
+
+#include "checkpoint.h"
+#include "cpt_image.h"
+
+static unsigned int encode_task_flags(unsigned int task_flags)
+{
+ unsigned int flags = 0;
+
+ if (task_flags & PF_EXITING)
+ flags |= (1 << CPT_PF_EXITING);
+ if (task_flags & PF_FORKNOEXEC)
+ flags |= (1 << CPT_PF_FORKNOEXEC);
+ if (task_flags & PF_SUPERPRIV)
+ flags |= (1 << CPT_PF_SUPERPRIV);
+ if (task_flags & PF_DUMPCORE)
+ flags |= (1 << CPT_PF_DUMPCORE);
+ if (task_flags & PF_SIGNALED)
+ flags |= (1 << CPT_PF_SIGNALED);
+ if (task_flags & PF_USED_MATH)
+ flags |= (1 << CPT_PF_USED_MATH);
+
+ return flags;
+
+}
+
+int cpt_dump_task_struct(struct task_struct *tsk, struct cpt_context *ctx)
+{
+ struct cpt_task_image *t;
+ int i;
+ int err;
+
+ t = kzalloc(sizeof(*t), GFP_KERNEL);
+ if (!t)
+ return -ENOMEM;
+
+ t->cpt_len = sizeof(*t);
+ t->cpt_type = CPT_OBJ_TASK;
+ t->cpt_hdrlen = sizeof(*t);
+ t->cpt_content = CPT_CONTENT_ARRAY;
+
+ t->cpt_state = tsk->state;
+ t->cpt_flags = encode_task_flags(tsk->flags);
+ t->cpt_exit_code = tsk->exit_code;
+ t->cpt_exit_signal = tsk->exit_signal;
+ t->cpt_pdeath_signal = tsk->pdeath_signal;
+ t->cpt_pid = task_pid_nr_ns(tsk, ctx->nsproxy->pid_ns);
+ t->cpt_tgid = task_tgid_nr_ns(tsk, ctx->nsproxy->pid_ns);
+ t->cpt_ppid = tsk->parent ?
+ task_pid_nr_ns(tsk->parent, ctx->nsproxy->pid_ns) : 0;
+ t->cpt_rppid = tsk->real_parent ?
+ task_pid_nr_ns(tsk->real_parent, ctx->nsproxy->pid_ns) : 0;
+ t->cpt_pgrp = task_pgrp_nr_ns(tsk, ctx->nsproxy->pid_ns);
+ t->cpt_session = task_session_nr_ns(tsk, ctx->nsproxy->pid_ns);
+ t->cpt_old_pgrp = 0;
+ if (tsk->signal->tty_old_pgrp)
+ t->cpt_old_pgrp = pid_vnr(tsk->signal->tty_old_pgrp);
+ t->cpt_leader = tsk->group_leader ? task_pid_vnr(tsk->group_leader) : 0;
+ t->cpt_utime = tsk->utime;
+ t->cpt_stime = tsk->stime;
+ t->cpt_utimescaled = tsk->utimescaled;
+ t->cpt_stimescaled = tsk->stimescaled;
+ t->cpt_gtime = tsk->gtime;
+ t->cpt_prev_utime = tsk->prev_utime;
+ t->cpt_prev_stime = tsk->prev_stime;
+ t->cpt_nvcsw = tsk->nvcsw;
+ t->cpt_nivcsw = tsk->nivcsw;
+ t->cpt_start_time = cpt_timespec_export(&tsk->start_time);
+ t->cpt_real_start_time = cpt_timespec_export(&tsk->real_start_time);
+ t->cpt_min_flt = tsk->min_flt;
+ t->cpt_maj_flt = tsk->maj_flt;
+ memcpy(t->cpt_comm, tsk->comm, TASK_COMM_LEN);
+ for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++) {
+ t->cpt_tls[i] = (((u64)tsk->thread.tls_array[i].b) << 32) +
+ tsk->thread.tls_array[i].a;
+ }
+ /* TODO: encode thread flags and status like task flags */
+ t->cpt_thrflags = task_thread_info(tsk)->flags & ~(1<<TIF_FREEZE);
+ t->cpt_thrstatus = task_thread_info(tsk)->status;
+ t->cpt_user = tsk->user->uid;
+ t->cpt_uid = tsk->uid;
+ t->cpt_euid = tsk->euid;
+ t->cpt_suid = tsk->suid;
+ t->cpt_fsuid = tsk->fsuid;
+ t->cpt_gid = tsk->gid;
+ t->cpt_egid = tsk->egid;
+ t->cpt_sgid = tsk->sgid;
+ t->cpt_fsgid = tsk->fsgid;
+
+ err = ctx->write(t, sizeof(*t), ctx);
+
+ kfree(t);
+ return err;
+}
+
+static int cpt_dump_fpustate(struct task_struct *tsk, struct cpt_context *ctx)
+{
+ struct cpt_obj_bits hdr;
+ int err;
+ int content;
+ unsigned long size;
+
+ content = CPT_CONTENT_X86_FPUSTATE;
+ size = sizeof(struct i387_fxsave_struct);
+#ifndef CONFIG_X86_64
+ if (!cpu_has_fxsr) {
+ size = sizeof(struct i387_fsave_struct);
+ content = CPT_CONTENT_X86_FPUSTATE_OLD;
+ }
+#endif
+
+ hdr.cpt_len = sizeof(hdr) + size;
+ hdr.cpt_type = CPT_OBJ_BITS;
+ hdr.cpt_hdrlen = sizeof(hdr);
+ hdr.cpt_content = content;
+ hdr.cpt_size = size;
+ err = ctx->write(&hdr, sizeof(hdr), ctx);
+ if (!err)
+ ctx->write(tsk->thread.xstate, size, ctx);
+ return err;
+}
+
+static u32 encode_segment(u32 segreg)
+{
+ segreg &= 0xFFFF;
+
+ if (segreg == 0)
+ return CPT_SEG_ZERO;
+ if ((segreg & 3) != 3) {
+ eprintk("Invalid RPL of a segment reg %x\n", segreg);
+ return CPT_SEG_ZERO;
+ }
+
+ /* LDT descriptor, it is just an index to LDT array */
+ if (segreg & 4)
+ return CPT_SEG_LDT + (segreg >> 3);
+
+ /* TLS descriptor. */
+ if ((segreg >> 3) >= GDT_ENTRY_TLS_MIN &&
+ (segreg >> 3) <= GDT_ENTRY_TLS_MAX)
+ return CPT_SEG_TLS1 + ((segreg>>3) - GDT_ENTRY_TLS_MIN);
+
+ /* One of standard desriptors */
+#ifdef CONFIG_X86_64
+ if (segreg == __USER32_DS)
+ return CPT_SEG_USER32_DS;
+ if (segreg == __USER32_CS)
+ return CPT_SEG_USER32_CS;
+ if (segreg == __USER_DS)
+ return CPT_SEG_USER64_DS;
+ if (segreg == __USER_CS)
+ return CPT_SEG_USER64_CS;
+#else
+ if (segreg == __USER_DS)
+ return CPT_SEG_USER32_DS;
+ if (segreg == __USER_CS)
+ return CPT_SEG_USER32_CS;
+#endif
+ eprintk("Invalid segment reg %x\n", segreg);
+ return CPT_SEG_ZERO;
+}
+
+static int cpt_dump_registers(struct task_struct *tsk, struct cpt_context *ctx)
+{
+ struct cpt_x86_regs ri;
+ struct pt_regs *pt_regs;
+
+ ri.cpt_len = sizeof(ri);
+ ri.cpt_type = CPT_OBJ_X86_REGS;
+ ri.cpt_hdrlen = sizeof(ri);
+ ri.cpt_content = CPT_CONTENT_VOID;
+
+ ri.cpt_debugreg[0] = tsk->thread.debugreg0;
+ ri.cpt_debugreg[1] = tsk->thread.debugreg1;
+ ri.cpt_debugreg[2] = tsk->thread.debugreg2;
+ ri.cpt_debugreg[3] = tsk->thread.debugreg3;
+ ri.cpt_debugreg[4] = 0;
+ ri.cpt_debugreg[5] = 0;
+ ri.cpt_debugreg[6] = tsk->thread.debugreg6;
+ ri.cpt_debugreg[7] = tsk->thread.debugreg7;
+
+ pt_regs = task_pt_regs(tsk);
+
+ ri.cpt_fs = encode_segment(pt_regs->fs);
+ ri.cpt_gs = encode_segment(tsk->thread.gs);
+
+ ri.cpt_bx = pt_regs->bx;
+ ri.cpt_cx = pt_regs->cx;
+ ri.cpt_dx = pt_regs->dx;
+ ri.cpt_si = pt_regs->si;
+ ri.cpt_di = pt_regs->di;
+ ri.cpt_bp = pt_regs->bp;
+ ri.cpt_ax = pt_regs->ax;
+ ri.cpt_ds = encode_segment(pt_regs->ds);
+ ri.cpt_es = encode_segment(pt_regs->es);
+ ri.cpt_orig_ax = pt_regs->orig_ax;
+ ri.cpt_ip = pt_regs->ip;
+ ri.cpt_cs = encode_segment(pt_regs->cs);
+ ri.cpt_flags = pt_regs->flags;
+ ri.cpt_sp = pt_regs->sp;
+ ri.cpt_ss = encode_segment(pt_regs->ss);
+
+ return ctx->write(&ri, sizeof(ri), ctx);
+}
+
+int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx)
+{
+ int err;
+
+ err = cpt_dump_task_struct(tsk, ctx);
+
+ /* Dump task mm */
+
+ if (!err)
+ cpt_dump_fpustate(tsk, ctx);
+ if (!err)
+ cpt_dump_registers(tsk, ctx);
+
+ return err;
+}
--
1.5.6

2008-10-17 23:13:08

by Andrey Mirkin

[permalink] [raw]
Subject: [PATCH 04/10] Introduce container dump function

Actually right now we are going to dump only one process.
Function for dumping head of image file are added.

Signed-off-by: Andrey Mirkin <[email protected]>
---
checkpoint/Makefile | 2 +-
checkpoint/checkpoint.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++
checkpoint/checkpoint.h | 3 ++
checkpoint/sys.c | 3 +-
kernel/fork.c | 2 +
5 files changed, 87 insertions(+), 2 deletions(-)
create mode 100644 checkpoint/checkpoint.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index bfe75d5..173346b 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,4 @@ obj-y += sys_core.o

obj-$(CONFIG_CHECKPOINT) += cptrst.o

-cptrst-objs := sys.o
+cptrst-objs := sys.o checkpoint.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
new file mode 100644
index 0000000..c4bddce
--- /dev/null
+++ b/checkpoint/checkpoint.c
@@ -0,0 +1,79 @@
+/*
+ * Copyright (C) 2008 Parallels, Inc.
+ *
+ * Author: Andrey Mirkin <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation, version 2 of the
+ * License.
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/version.h>
+#include <linux/nsproxy.h>
+
+#include "checkpoint.h"
+
+static int cpt_write_head(struct cpt_context *ctx)
+{
+ struct cpt_head hdr;
+
+ memset(&hdr, 0, sizeof(hdr));
+ hdr.cpt_signature[0] = CPT_SIGNATURE0;
+ hdr.cpt_signature[1] = CPT_SIGNATURE1;
+ hdr.cpt_signature[2] = CPT_SIGNATURE2;
+ hdr.cpt_signature[3] = CPT_SIGNATURE3;
+ hdr.cpt_hdrlen = sizeof(hdr);
+ hdr.cpt_image_major = (LINUX_VERSION_CODE >> 16) & 0xff;
+ hdr.cpt_image_minor = (LINUX_VERSION_CODE >> 8) & 0xff;
+ hdr.cpt_image_sublevel = (LINUX_VERSION_CODE) & 0xff;
+ hdr.cpt_image_extra = 0;
+#if defined(CONFIG_X86_32)
+ hdr.cpt_arch = CPT_ARCH_I386;
+#else
+#error Arch is not supported
+#endif
+ return ctx->write(&hdr, sizeof(hdr), ctx);
+}
+
+int dump_container(struct cpt_context *ctx)
+{
+ int err;
+ struct task_struct *root;
+
+ read_lock(&tasklist_lock);
+ root = find_task_by_vpid(ctx->pid);
+ if (root)
+ get_task_struct(root);
+ read_unlock(&tasklist_lock);
+
+ err = -ESRCH;
+ if (!root) {
+ eprintk("can not find root task\n");
+ return err;
+ }
+ rcu_read_lock();
+ ctx->nsproxy = task_nsproxy(root);
+ if (!ctx->nsproxy) {
+ eprintk("nsproxy is null\n");
+ rcu_read_unlock();
+ goto out;
+ }
+ get_nsproxy(ctx->nsproxy);
+ rcu_read_unlock();
+
+ err = cpt_write_head(ctx);
+
+ /* Dump task here */
+ if (!err)
+ err = -ENOSYS;
+
+out:
+ ctx->nsproxy = NULL;
+ put_task_struct(root);
+ return err;
+}
diff --git a/checkpoint/checkpoint.h b/checkpoint/checkpoint.h
index 8ea73f5..6926aa2 100644
--- a/checkpoint/checkpoint.h
+++ b/checkpoint/checkpoint.h
@@ -36,6 +36,7 @@ typedef struct cpt_context
int refcount;
int ctx_state;
struct semaphore main_sem;
+ struct nsproxy *nsproxy;

int errno;

@@ -57,3 +58,5 @@ extern int debug_level;

#define eprintk(a...) cpt_printk(1, "CPT ERR: " a)
#define dprintk(a...) cpt_printk(1, "CPT DBG: " a)
+
+int dump_container(struct cpt_context *ctx);
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index a561a06..1902fef 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -107,9 +107,10 @@ static int checkpoint(pid_t pid, int fd, unsigned long flags)

ctx->file = file;
ctx->ctx_state = CPT_CTX_DUMPING;
+ ctx->pid = pid;

/* checkpoint */
- err = -ENOSYS;
+ err = dump_container(ctx);

context_put(ctx);

diff --git a/kernel/fork.c b/kernel/fork.c
index 52b5037..f38b43d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -77,6 +77,7 @@ int max_threads; /* tunable limit on nr_threads */
DEFINE_PER_CPU(unsigned long, process_counts) = 0;

__cacheline_aligned DEFINE_RWLOCK(tasklist_lock); /* outer */
+EXPORT_SYMBOL(tasklist_lock);

int nr_processes(void)
{
@@ -153,6 +154,7 @@ void __put_task_struct(struct task_struct *tsk)
if (!profile_handoff_task(tsk))
free_task(tsk);
}
+EXPORT_SYMBOL(__put_task_struct);

/*
* macro override instead of weak attribute alias, to workaround
--
1.5.6

2008-10-17 23:13:57

by Andrey Mirkin

[permalink] [raw]
Subject: [PATCH 07/10] Introduce function for restarting a container

Actually, right now this function will restart only one process.
Function to read head of dump file is introduced.

Signed-off-by: Andrey Mirkin <[email protected]>
---
checkpoint/Makefile | 2 +-
checkpoint/checkpoint.h | 1 +
checkpoint/restart.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++
checkpoint/sys.c | 2 +-
4 files changed, 90 insertions(+), 2 deletions(-)
create mode 100644 checkpoint/restart.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index bbb0e37..47c7852 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,4 @@ obj-y += sys_core.o

obj-$(CONFIG_CHECKPOINT) += cptrst.o

-cptrst-objs := sys.o checkpoint.o cpt_process.o cpt_mm.o
+cptrst-objs := sys.o checkpoint.o cpt_process.o cpt_mm.o restart.o
diff --git a/checkpoint/checkpoint.h b/checkpoint/checkpoint.h
index e3e6b66..0608bb9 100644
--- a/checkpoint/checkpoint.h
+++ b/checkpoint/checkpoint.h
@@ -62,3 +62,4 @@ extern int debug_level;
int dump_container(struct cpt_context *ctx);
int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx);
int cpt_dump_mm(struct task_struct *tsk, struct cpt_context *ctx);
+int restart_container(struct cpt_context *ctx);
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
new file mode 100644
index 0000000..acfcadb
--- /dev/null
+++ b/checkpoint/restart.c
@@ -0,0 +1,87 @@
+/*
+ * Copyright (C) 2008 Parallels, Inc.
+ *
+ * Author: Andrey Mirkin <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation, version 2 of the
+ * License.
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/version.h>
+
+#include "checkpoint.h"
+#include "cpt_image.h"
+
+int rst_get_object(int type, void *tmp, int size, struct cpt_context *ctx)
+{
+ int err;
+ struct cpt_object_hdr *hdr = tmp;
+ err = ctx->read(hdr, sizeof(struct cpt_object_hdr), ctx);
+ if (err)
+ return err;
+ if (type > 0 && type != hdr->cpt_type)
+ return -EINVAL;
+ if (hdr->cpt_hdrlen < sizeof(struct cpt_object_hdr))
+ return -EINVAL;
+ if (size < sizeof(struct cpt_object_hdr))
+ return -EINVAL;
+ if (hdr->cpt_len < hdr->cpt_hdrlen)
+ return -EINVAL;
+ if (size > hdr->cpt_hdrlen)
+ size = hdr->cpt_hdrlen;
+ if (size > sizeof(*hdr))
+ err = ctx->read(hdr + 1, size - sizeof(*hdr), ctx);
+ return err;
+}
+
+static int rst_read_head(struct cpt_context *ctx)
+{
+ struct cpt_head hdr;
+ int err;
+
+ err = -EBADF;
+ if (!ctx->file)
+ return err;
+
+ err = ctx->read(&hdr, sizeof(hdr), ctx);
+ if (err < 0)
+ return err;
+
+ if (hdr.cpt_signature[0] != CPT_SIGNATURE0 ||
+ hdr.cpt_signature[1] != CPT_SIGNATURE1 ||
+ hdr.cpt_signature[2] != CPT_SIGNATURE2 ||
+ hdr.cpt_signature[3] != CPT_SIGNATURE3) {
+ return -EINVAL;
+ }
+ if (KERNEL_VERSION(hdr.cpt_image_major, hdr.cpt_image_minor,
+ hdr.cpt_image_sublevel) != LINUX_VERSION_CODE)
+ return -EINVAL;
+
+#if defined(CONFIG_X86_32)
+ if (hdr.cpt_arch != CPT_ARCH_I386)
+ return -ENOSYS;
+#else
+#error Arch is not supported
+#endif
+
+ return 0;
+}
+
+int restart_container(struct cpt_context *ctx)
+{
+ int err;
+
+ err = rst_read_head(ctx);
+
+ /* Restart process */
+ if (!err)
+ err = -ENOSYS;
+
+ return err;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 1902fef..b92312a 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -140,7 +140,7 @@ static int restart(int ctid, int fd, unsigned long flags)
ctx->ctx_state = CPT_CTX_UNDUMPING;

/* restart */
- err = -ENOSYS;
+ err = restart_container(ctx);

context_put(ctx);

--
1.5.6

2008-10-17 23:14:44

by Andrey Mirkin

[permalink] [raw]
Subject: [PATCH 09/10] Introduce functions to restore mm

Functions to restore mm, VMAs and mm context are added.

Signed-off-by: Andrey Mirkin <[email protected]>
---
checkpoint/Makefile | 2 +-
checkpoint/checkpoint.h | 1 +
checkpoint/cpt_image.h | 5 +
checkpoint/rst_mm.c | 320 ++++++++++++++++++++++++++++++++++++++++++++++
checkpoint/rst_process.c | 3 +-
mm/mmap.c | 1 +
mm/mprotect.c | 2 +
7 files changed, 332 insertions(+), 2 deletions(-)
create mode 100644 checkpoint/rst_mm.c

diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 689a0eb..19ca732 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@ obj-y += sys_core.o
obj-$(CONFIG_CHECKPOINT) += cptrst.o

cptrst-objs := sys.o checkpoint.o cpt_process.o cpt_mm.o restart.o \
- rst_process.o
+ rst_process.o rst_mm.o
diff --git a/checkpoint/checkpoint.h b/checkpoint/checkpoint.h
index 1d0ca49..195fdc6 100644
--- a/checkpoint/checkpoint.h
+++ b/checkpoint/checkpoint.h
@@ -65,3 +65,4 @@ int cpt_dump_mm(struct task_struct *tsk, struct cpt_context *ctx);
int restart_container(struct cpt_context *ctx);
int rst_get_object(int type, void *tmp, int size, struct cpt_context *ctx);
int rst_restart_process(struct cpt_context *ctx);
+int rst_restore_mm(struct cpt_context *ctx);
diff --git a/checkpoint/cpt_image.h b/checkpoint/cpt_image.h
index 160cf85..e1fb483 100644
--- a/checkpoint/cpt_image.h
+++ b/checkpoint/cpt_image.h
@@ -233,6 +233,11 @@ struct cpt_x86_regs
__u32 cpt_ss;
} __attribute__ ((aligned (8)));

+static inline void __user * cpt_ptr_import(__u64 ptr)
+{
+ return (void*)(unsigned long)ptr;
+}
+
static inline __u64 cpt_timespec_export(struct timespec *tv)
{
return (((u64)tv->tv_sec) << 32) + tv->tv_nsec;
diff --git a/checkpoint/rst_mm.c b/checkpoint/rst_mm.c
new file mode 100644
index 0000000..fe53c45
--- /dev/null
+++ b/checkpoint/rst_mm.c
@@ -0,0 +1,320 @@
+/*
+ * Copyright (C) 2008 Parallels, Inc.
+ *
+ * Author: Andrey Mirkin <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation, version 2 of the
+ * License.
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/version.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/highmem.h>
+#include <linux/pagemap.h>
+#include <linux/vmalloc.h>
+#include <linux/syscalls.h>
+
+#include "checkpoint.h"
+#include "cpt_image.h"
+
+static unsigned long make_prot(struct cpt_vma_image *vmai)
+{
+ unsigned long prot = 0;
+
+ if (vmai->cpt_flags & VM_READ)
+ prot |= PROT_READ;
+ if (vmai->cpt_flags & VM_WRITE)
+ prot |= PROT_WRITE;
+ if (vmai->cpt_flags & VM_EXEC)
+ prot |= PROT_EXEC;
+ if (vmai->cpt_flags & VM_GROWSDOWN)
+ prot |= PROT_GROWSDOWN;
+ if (vmai->cpt_flags & VM_GROWSUP)
+ prot |= PROT_GROWSUP;
+ return prot;
+}
+
+static unsigned long make_flags(struct cpt_vma_image *vmai)
+{
+ unsigned long flags = MAP_FIXED;
+
+ if (vmai->cpt_flags&(VM_SHARED|VM_MAYSHARE))
+ flags |= MAP_SHARED;
+ else
+ flags |= MAP_PRIVATE;
+
+ if (vmai->cpt_file == CPT_NULL)
+ flags |= MAP_ANONYMOUS;
+ if (vmai->cpt_flags & VM_GROWSDOWN)
+ flags |= MAP_GROWSDOWN;
+#ifdef MAP_GROWSUP
+ if (vmai->cpt_flags & VM_GROWSUP)
+ flags |= MAP_GROWSUP;
+#endif
+ if (vmai->cpt_flags & VM_DENYWRITE)
+ flags |= MAP_DENYWRITE;
+ if (vmai->cpt_flags & VM_EXECUTABLE)
+ flags |= MAP_EXECUTABLE;
+ if (!(vmai->cpt_flags & VM_ACCOUNT))
+ flags |= MAP_NORESERVE;
+ return flags;
+}
+
+static int rst_restore_one_vma(struct cpt_context *ctx)
+{
+ int err;
+ int i;
+ unsigned long addr;
+ struct mm_struct *mm = current->mm;
+ struct cpt_vma_image vmai;
+ struct vm_area_struct *vma;
+ struct file *file = NULL;
+ unsigned long prot;
+
+ err = rst_get_object(CPT_OBJ_VMA, &vmai, sizeof(vmai), ctx);
+ if (err)
+ return err;
+
+ prot = make_prot(&vmai);
+
+ if (vmai.cpt_vma_type == CPT_VMA_FILE) {
+ struct cpt_object_hdr h;
+ int len;
+ char *path;
+
+ err = rst_get_object(CPT_OBJ_NAME, &h, sizeof(h), ctx);
+ if (err)
+ goto out;
+ len = h.cpt_len - sizeof(h);
+ if (len < 0) {
+ err = -EINVAL;
+ goto out;
+ }
+ path = kmalloc(len, GFP_KERNEL);
+ if (!path) {
+ err = -ENOMEM;
+ goto out;
+ }
+ err = ctx->read(path, len, ctx);
+ if (err) {
+ kfree(path);
+ goto out;
+ }
+
+ /* Just open file
+ TODO: open with correct flags */
+ file = filp_open(path, O_RDONLY, 0);
+ kfree(path);
+ if (IS_ERR(file)) {
+ err = PTR_ERR(file);
+ goto out;
+ }
+ }
+
+ down_write(&mm->mmap_sem);
+ addr = do_mmap_pgoff(file, vmai.cpt_start,
+ vmai.cpt_end - vmai.cpt_start,
+ prot, make_flags(&vmai),
+ vmai.cpt_pgoff);
+
+ if (addr != vmai.cpt_start) {
+ up_write(&mm->mmap_sem);
+
+ err = -EINVAL;
+ if (IS_ERR((void*)addr))
+ err = addr;
+ goto out;
+ }
+
+ vma = find_vma(mm, vmai.cpt_start);
+ if (vma == NULL) {
+ up_write(&mm->mmap_sem);
+ eprintk("cannot find mmapped vma\n");
+ err = -ESRCH;
+ goto out;
+ }
+
+ /* do_mmap_pgoff() can merge new area to previous one (not to the next,
+ * we mmap in order, the rest of mm is still unmapped). This can happen
+ * f.e. if flags are to be adjusted later, or if we had different
+ * anon_vma on two adjacent regions. Split it by brute force. */
+ if (vma->vm_start != vmai.cpt_start) {
+ err = split_vma(mm, vma, (unsigned long)vmai.cpt_start, 0);
+ if (err) {
+ up_write(&mm->mmap_sem);
+ eprintk("cannot split vma\n");
+ goto out;
+ }
+ }
+ up_write(&mm->mmap_sem);
+
+ for (i = 0; i < vmai.cpt_page_num; i++) {
+ struct cpt_page_block pb;
+
+ err = rst_get_object(CPT_OBJ_PAGES, &pb, sizeof(pb), ctx);
+ if (err)
+ goto out;
+ if (!(vmai.cpt_flags & VM_ACCOUNT) && !(prot & PROT_WRITE)) {
+ /* I guess this is get_user_pages() messed things,
+ * this happens f.e. when gdb inserts breakpoints.
+ */
+ int j;
+ for (j = 0; j < (pb.cpt_end-pb.cpt_start)/PAGE_SIZE; j++) {
+ struct page *page;
+ void *maddr;
+ err = get_user_pages(current, current->mm,
+ (unsigned long)pb.cpt_start +
+ j * PAGE_SIZE,
+ 1, 1, 1, &page, NULL);
+ if (err == 0)
+ err = -EFAULT;
+ if (err < 0) {
+ eprintk("get_user_pages: %d\n", err);
+ goto out;
+ }
+ err = 0;
+ maddr = kmap(page);
+ if (pb.cpt_content == CPT_CONTENT_VOID) {
+ memset(maddr, 0, PAGE_SIZE);
+ } else if (pb.cpt_content == CPT_CONTENT_DATA) {
+ err = ctx->read(maddr, PAGE_SIZE, ctx);
+ if (err) {
+ kunmap(page);
+ goto out;
+ }
+ } else {
+ err = -EINVAL;
+ kunmap(page);
+ goto out;
+ }
+ set_page_dirty_lock(page);
+ kunmap(page);
+ page_cache_release(page);
+ }
+ } else {
+ if (!(prot & PROT_WRITE))
+ sys_mprotect(vmai.cpt_start,
+ vmai.cpt_end - vmai.cpt_start,
+ prot | PROT_WRITE);
+ if (pb.cpt_content == CPT_CONTENT_VOID) {
+ int j;
+ for (j=0; j<(pb.cpt_end-pb.cpt_start)/sizeof(unsigned long); j++) {
+ err = __put_user(0UL, ((unsigned long __user*)(unsigned long)pb.cpt_start) + j);
+ if (err) {
+ eprintk("__put_user 2 %d\n", err);
+ goto out;
+ }
+ }
+ } else if (pb.cpt_content == CPT_CONTENT_DATA) {
+ err = ctx->read(cpt_ptr_import(pb.cpt_start),
+ pb.cpt_end - pb.cpt_start,
+ ctx);
+ if (err)
+ goto out;
+ } else {
+ err = -EINVAL;
+ goto out;
+ }
+ if (!(prot & PROT_WRITE))
+ sys_mprotect(vmai.cpt_start,
+ vmai.cpt_end - vmai.cpt_start,
+ prot);
+ }
+ }
+
+out:
+ if (file)
+ fput(file);
+ return err;
+}
+
+static int rst_restore_mm_context(struct cpt_context *ctx)
+{
+ struct cpt_obj_bits b;
+ struct mm_struct *mm = current->mm;
+ int oldsize = mm->context.size;
+ int err;
+ void *oldldt;
+ void *newldt;
+
+ err = rst_get_object(CPT_OBJ_BITS, &b, sizeof(b), ctx);
+ if (err)
+ return err;
+
+ if (b.cpt_size > PAGE_SIZE)
+ newldt = vmalloc(b.cpt_size);
+ else
+ newldt = kmalloc(b.cpt_size, GFP_KERNEL);
+
+ if (!newldt)
+ return -ENOMEM;
+
+ err = ctx->read(newldt, b.cpt_size, ctx);
+ if (err)
+ return err;
+
+ oldldt = mm->context.ldt;
+ mm->context.ldt = newldt;
+ mm->context.size = b.cpt_size / LDT_ENTRY_SIZE;
+
+ load_LDT(&mm->context);
+
+ if (oldsize) {
+ if (oldsize * LDT_ENTRY_SIZE > PAGE_SIZE)
+ vfree(oldldt);
+ else
+ kfree(oldldt);
+ }
+
+ return 0;
+}
+
+int rst_restore_mm(struct cpt_context *ctx)
+{
+ int err;
+ int i;
+ struct mm_struct *mm = current->mm;
+ struct cpt_mm_image m;
+
+ err = rst_get_object(CPT_OBJ_MM, &m, sizeof(m), ctx);
+ if (err)
+ return err;
+
+ down_write(&mm->mmap_sem);
+ do_munmap(mm, 0, TASK_SIZE);
+
+ mm->start_code = m.cpt_start_code;
+ mm->end_code = m.cpt_end_code;
+ mm->start_data = m.cpt_start_data;
+ mm->end_data = m.cpt_end_data;
+ mm->start_brk = m.cpt_start_brk;
+ mm->brk = m.cpt_brk;
+ mm->start_stack = m.cpt_start_stack;
+ mm->arg_start = m.cpt_start_arg;
+ mm->arg_end = m.cpt_end_arg;
+ mm->env_start = m.cpt_start_env;
+ mm->env_end = m.cpt_end_env;
+ mm->def_flags = m.cpt_def_flags;
+ mm->flags = m.cpt_flags;
+
+ up_write(&mm->mmap_sem);
+
+ for (i = 0; i < m.cpt_map_count; i++) {
+ err = rst_restore_one_vma(ctx);
+ if (err < 0)
+ goto out;
+ }
+
+ err = rst_restore_mm_context(ctx);
+out:
+ return err;
+}
+
diff --git a/checkpoint/rst_process.c b/checkpoint/rst_process.c
index b9f745e..9e448b2 100644
--- a/checkpoint/rst_process.c
+++ b/checkpoint/rst_process.c
@@ -210,7 +210,8 @@ static int restart_thread(void *arg)
err = rst_get_object(CPT_OBJ_TASK, ti, sizeof(*ti), ctx);
if (!err)
err = rst_restore_task_struct(current, ti, ctx);
- /* Restore mm here */
+ if (!err)
+ err = rst_restore_mm(ctx);
if (!err)
err = rst_restore_fpustate(current, ti, ctx);
if (!err)
diff --git a/mm/mmap.c b/mm/mmap.c
index 971d0ed..98d1ba9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1858,6 +1858,7 @@ int split_vma(struct mm_struct * mm, struct vm_area_struct * vma,

return 0;
}
+EXPORT_SYMBOL(split_vma);

/* Munmap is split into 2 main parts -- this part which finds
* what needs doing, and the areas themselves, which do the
diff --git a/mm/mprotect.c b/mm/mprotect.c
index fded06f..47c7d75 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -22,6 +22,7 @@
#include <linux/swap.h>
#include <linux/swapops.h>
#include <linux/mmu_notifier.h>
+#include <linux/module.h>
#include <asm/uaccess.h>
#include <asm/pgtable.h>
#include <asm/cacheflush.h>
@@ -317,3 +318,4 @@ out:
up_write(&current->mm->mmap_sem);
return error;
}
+EXPORT_SYMBOL(sys_mprotect);
--
1.5.6

2008-10-17 23:14:27

by Andrey Mirkin

[permalink] [raw]
Subject: [PATCH 08/10] Introduce functions to restart a process

Functions to restart process, restore its state, fpu and registers are added.

Signed-off-by: Andrey Mirkin <[email protected]>
---
arch/x86/kernel/entry_32.S | 21 +++
arch/x86/kernel/process_32.c | 3 +
checkpoint/Makefile | 3 +-
checkpoint/checkpoint.h | 2 +
checkpoint/restart.c | 2 +-
checkpoint/rst_process.c | 277 ++++++++++++++++++++++++++++++++++++++++++
kernel/sched.c | 1 +
7 files changed, 307 insertions(+), 2 deletions(-)
create mode 100644 checkpoint/rst_process.c

diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index 109792b..a4848a3 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -225,6 +225,7 @@ ENTRY(ret_from_fork)
GET_THREAD_INFO(%ebp)
popl %eax
CFI_ADJUST_CFA_OFFSET -4
+ret_from_fork_tail:
pushl $0x0202 # Reset kernel eflags
CFI_ADJUST_CFA_OFFSET 4
popfl
@@ -233,6 +234,26 @@ ENTRY(ret_from_fork)
CFI_ENDPROC
END(ret_from_fork)

+ENTRY(i386_ret_from_resume)
+ CFI_STARTPROC
+ pushl %eax
+ CFI_ADJUST_CFA_OFFSET 4
+ call schedule_tail
+ GET_THREAD_INFO(%ebp)
+ popl %eax
+ CFI_ADJUST_CFA_OFFSET -4
+ movl (%esp), %eax
+ testl %eax, %eax
+ jz 1f
+ pushl %esp
+ call *%eax
+ addl $4, %esp
+1:
+ addl $256, %esp
+ jmp ret_from_fork_tail
+ CFI_ENDPROC
+END(i386_ret_from_resume)
+
/*
* Return to user mode is not as complex as all this looks,
* but we want the default path for a system call return to
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 4711eed..1bdec02 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -58,6 +58,9 @@
#include <asm/kdebug.h>

asmlinkage void ret_from_fork(void) __asm__("ret_from_fork");
+EXPORT_SYMBOL(ret_from_fork);
+asmlinkage void i386_ret_from_resume(void) __asm__("i386_ret_from_resume");
+EXPORT_SYMBOL(i386_ret_from_resume);

DEFINE_PER_CPU(struct task_struct *, current_task) = &init_task;
EXPORT_PER_CPU_SYMBOL(current_task);
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 47c7852..689a0eb 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,5 @@ obj-y += sys_core.o

obj-$(CONFIG_CHECKPOINT) += cptrst.o

-cptrst-objs := sys.o checkpoint.o cpt_process.o cpt_mm.o restart.o
+cptrst-objs := sys.o checkpoint.o cpt_process.o cpt_mm.o restart.o \
+ rst_process.o
diff --git a/checkpoint/checkpoint.h b/checkpoint/checkpoint.h
index 0608bb9..1d0ca49 100644
--- a/checkpoint/checkpoint.h
+++ b/checkpoint/checkpoint.h
@@ -63,3 +63,5 @@ int dump_container(struct cpt_context *ctx);
int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx);
int cpt_dump_mm(struct task_struct *tsk, struct cpt_context *ctx);
int restart_container(struct cpt_context *ctx);
+int rst_get_object(int type, void *tmp, int size, struct cpt_context *ctx);
+int rst_restart_process(struct cpt_context *ctx);
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index acfcadb..62cef28 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -81,7 +81,7 @@ int restart_container(struct cpt_context *ctx)

/* Restart process */
if (!err)
- err = -ENOSYS;
+ err = rst_restart_process(ctx);

return err;
}
diff --git a/checkpoint/rst_process.c b/checkpoint/rst_process.c
new file mode 100644
index 0000000..b9f745e
--- /dev/null
+++ b/checkpoint/rst_process.c
@@ -0,0 +1,277 @@
+/*
+ * Copyright (C) 2008 Parallels, Inc.
+ *
+ * Author: Andrey Mirkin <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation, version 2 of the
+ * License.
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/version.h>
+#include <linux/module.h>
+
+#include "checkpoint.h"
+#include "cpt_image.h"
+
+#define HOOK_RESERVE 256
+
+struct thr_context {
+ struct completion complete;
+ int error;
+ struct cpt_context *ctx;
+ struct task_struct *tsk;
+};
+
+int local_kernel_thread(int (*fn)(void *), void * arg, unsigned long flags, pid_t pid)
+{
+ pid_t ret;
+
+ if (current->fs == NULL) {
+ /* do_fork_pid() hates processes without fs, oopses. */
+ eprintk("local_kernel_thread: current->fs==NULL\n");
+ return -EINVAL;
+ }
+ if (!try_module_get(THIS_MODULE))
+ return -EBUSY;
+ ret = kernel_thread(fn, arg, flags);
+ if (ret < 0)
+ module_put(THIS_MODULE);
+ return ret;
+}
+
+static unsigned int decode_task_flags(unsigned int task_flags)
+{
+ unsigned int flags = 0;
+
+ if (task_flags & (1 << CPT_PF_EXITING))
+ flags |= PF_EXITING;
+ if (task_flags & (1 << CPT_PF_FORKNOEXEC))
+ flags |= PF_FORKNOEXEC;
+ if (task_flags & (1 << CPT_PF_SUPERPRIV))
+ flags |= PF_SUPERPRIV;
+ if (task_flags & (1 << CPT_PF_DUMPCORE))
+ flags |= PF_DUMPCORE;
+ if (task_flags & (1 << CPT_PF_SIGNALED))
+ flags |= PF_SIGNALED;
+
+ return flags;
+
+}
+
+int rst_restore_task_struct(struct task_struct *tsk, struct cpt_task_image *ti,
+ struct cpt_context *ctx)
+{
+ int i;
+
+ /* Restore only saved flags, comm and tls for now */
+ tsk->flags = decode_task_flags(ti->cpt_flags);
+ clear_tsk_thread_flag(tsk, TIF_FREEZE);
+ memcpy(tsk->comm, ti->cpt_comm, TASK_COMM_LEN);
+ for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++) {
+ tsk->thread.tls_array[i].a = ti->cpt_tls[i] & 0xFFFFFFFF;
+ tsk->thread.tls_array[i].b = ti->cpt_tls[i] >> 32;
+ }
+
+ return 0;
+}
+
+static int rst_restore_fpustate(struct task_struct *tsk, struct cpt_task_image *ti,
+ struct cpt_context *ctx)
+{
+ struct cpt_obj_bits hdr;
+ int err;
+ char *buf;
+
+ clear_stopped_child_used_math(tsk);
+
+ err = rst_get_object(CPT_OBJ_BITS, &hdr, sizeof(hdr), ctx);
+ if (err < 0)
+ return err;
+
+ buf = kmalloc(hdr.cpt_size, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ err = ctx->read(buf, hdr.cpt_size, ctx);
+ if (err)
+ goto out;
+
+ if (hdr.cpt_content == CPT_CONTENT_X86_FPUSTATE && cpu_has_fxsr) {
+ memcpy(&tsk->thread.xstate, buf,
+ sizeof(struct i387_fxsave_struct));
+ if (ti->cpt_flags & CPT_PF_USED_MATH)
+ set_stopped_child_used_math(tsk);
+ }
+#ifndef CONFIG_X86_64
+ else if (hdr.cpt_content == CPT_CONTENT_X86_FPUSTATE_OLD &&
+ !cpu_has_fxsr) {
+ memcpy(&tsk->thread.xstate, buf,
+ sizeof(struct i387_fsave_struct));
+ if (ti->cpt_flags & CPT_PF_USED_MATH)
+ set_stopped_child_used_math(tsk);
+ }
+#endif
+
+out:
+ kfree(buf);
+ return err;
+}
+
+static u32 decode_segment(u32 segid)
+{
+ if (segid == CPT_SEG_ZERO)
+ return 0;
+
+ /* TLS descriptors */
+ if (segid <= CPT_SEG_TLS3)
+ return ((GDT_ENTRY_TLS_MIN + segid - CPT_SEG_TLS1) << 3) + 3;
+
+ /* LDT descriptor, it is just an index to LDT array */
+ if (segid >= CPT_SEG_LDT)
+ return ((segid - CPT_SEG_LDT) << 3) | 7;
+
+ /* Check for one of standard descriptors */
+ if (segid == CPT_SEG_USER32_DS)
+ return __USER_DS;
+ if (segid == CPT_SEG_USER32_CS)
+ return __USER_CS;
+
+ eprintk("Invalid segment reg %d\n", segid);
+ return 0;
+}
+
+static int rst_restore_registers(struct task_struct *tsk, struct cpt_context *ctx)
+{
+ struct cpt_x86_regs ri;
+ struct pt_regs *regs = task_pt_regs(tsk);
+ extern char i386_ret_from_resume;
+ int err;
+
+ err = rst_get_object(CPT_OBJ_X86_REGS, &ri, sizeof(ri), ctx);
+ if (err < 0)
+ return err;
+
+ tsk->thread.sp = (unsigned long) regs;
+ tsk->thread.sp0 = (unsigned long) (regs+1);
+ tsk->thread.ip = (unsigned long) &i386_ret_from_resume;
+
+ tsk->thread.gs = decode_segment(ri.cpt_gs);
+ tsk->thread.debugreg0 = ri.cpt_debugreg[0];
+ tsk->thread.debugreg1 = ri.cpt_debugreg[1];
+ tsk->thread.debugreg2 = ri.cpt_debugreg[2];
+ tsk->thread.debugreg3 = ri.cpt_debugreg[3];
+ tsk->thread.debugreg6 = ri.cpt_debugreg[6];
+ tsk->thread.debugreg7 = ri.cpt_debugreg[7];
+
+ regs->bx = ri.cpt_bx;
+ regs->cx = ri.cpt_cx;
+ regs->dx = ri.cpt_dx;
+ regs->si = ri.cpt_si;
+ regs->di = ri.cpt_di;
+ regs->bp = ri.cpt_bp;
+ regs->ax = ri.cpt_ax;
+ regs->orig_ax = ri.cpt_orig_ax;
+ regs->ip = ri.cpt_ip;
+ regs->flags = ri.cpt_flags;
+ regs->sp = ri.cpt_sp;
+
+ regs->cs = decode_segment(ri.cpt_cs);
+ regs->ss = decode_segment(ri.cpt_ss);
+ regs->ds = decode_segment(ri.cpt_ds);
+ regs->es = decode_segment(ri.cpt_es);
+ regs->fs = decode_segment(ri.cpt_fs);
+
+ tsk->thread.sp -= HOOK_RESERVE;
+ memset((void*)tsk->thread.sp, 0, HOOK_RESERVE);
+
+ return 0;
+}
+
+static int restart_thread(void *arg)
+{
+ struct thr_context *thr_ctx = arg;
+ struct cpt_context *ctx;
+ struct cpt_task_image *ti;
+ int err;
+
+ current->state = TASK_UNINTERRUPTIBLE;
+
+ ctx = thr_ctx->ctx;
+ ti = kmalloc(sizeof(*ti), GFP_KERNEL);
+ if (!ti)
+ return -ENOMEM;
+
+ err = rst_get_object(CPT_OBJ_TASK, ti, sizeof(*ti), ctx);
+ if (!err)
+ err = rst_restore_task_struct(current, ti, ctx);
+ /* Restore mm here */
+ if (!err)
+ err = rst_restore_fpustate(current, ti, ctx);
+ if (!err)
+ err = rst_restore_registers(current, ctx);
+
+ thr_ctx->error = err;
+ complete(&thr_ctx->complete);
+
+ if (!err && (ti->cpt_state & (EXIT_ZOMBIE|EXIT_DEAD))) {
+ do_exit(ti->cpt_exit_code);
+ } else {
+ __set_current_state(TASK_UNINTERRUPTIBLE);
+ }
+
+ kfree(ti);
+ schedule();
+
+ eprintk("leaked %d/%d %p\n", task_pid_nr(current), task_pid_vnr(current), current->mm);
+
+ module_put(THIS_MODULE);
+ complete_and_exit(NULL, 0);
+ return 0;
+}
+static int create_root_task(struct cpt_context *ctx,
+ struct thr_context *thr_ctx)
+{
+ struct task_struct *tsk;
+ int pid;
+
+ thr_ctx->ctx = ctx;
+ thr_ctx->error = 0;
+ init_completion(&thr_ctx->complete);
+
+ /* We should also create container here */
+ pid = local_kernel_thread(restart_thread, thr_ctx,
+ CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
+ CLONE_NEWUSER | CLONE_NEWPID | CLONE_NEWNET, 0);
+ if (pid < 0)
+ return pid;
+ read_lock(&tasklist_lock);
+ tsk = find_task_by_vpid(pid);
+ if (tsk)
+ get_task_struct(tsk);
+ read_unlock(&tasklist_lock);
+ if (tsk == NULL)
+ return -ESRCH;
+ thr_ctx->tsk = tsk;
+ return 0;
+}
+
+int rst_restart_process(struct cpt_context *ctx)
+{
+ struct thr_context thr_ctx_root;
+ int err;
+
+ err = create_root_task(ctx, &thr_ctx_root);
+ if (err)
+ return err;
+
+ wait_for_completion(&thr_ctx_root.complete);
+ wait_task_inactive(thr_ctx_root.tsk, 0);
+
+ return err;
+}
diff --git a/kernel/sched.c b/kernel/sched.c
index 04160d2..94a23e5 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1970,6 +1970,7 @@ unsigned long wait_task_inactive(struct task_struct *p, long match_state)

return ncsw;
}
+EXPORT_SYMBOL(wait_task_inactive);

/***
* kick_process - kick a running thread to enter/exit the kernel
--
1.5.6

2008-10-17 23:15:00

by Andrey Mirkin

[permalink] [raw]
Subject: [PATCH 10/10] Add support for multiple processes

The whole tree of processes can be checkpointed and restarted now.
Shared objects are not supported yet.

Signed-off-by: Andrey Mirkin <[email protected]>
---
checkpoint/cpt_image.h | 2 +
checkpoint/cpt_process.c | 24 +++++++++++++
checkpoint/rst_process.c | 85 +++++++++++++++++++++++++++-------------------
3 files changed, 76 insertions(+), 35 deletions(-)

diff --git a/checkpoint/cpt_image.h b/checkpoint/cpt_image.h
index e1fb483..f370df2 100644
--- a/checkpoint/cpt_image.h
+++ b/checkpoint/cpt_image.h
@@ -128,6 +128,8 @@ struct cpt_task_image {
__u64 cpt_nivcsw;
__u64 cpt_min_flt;
__u64 cpt_maj_flt;
+ __u32 cpt_children_num;
+ __u32 cpt_pad;
} __attribute__ ((aligned (8)));

struct cpt_mm_image {
diff --git a/checkpoint/cpt_process.c b/checkpoint/cpt_process.c
index 1f7a54b..d73ec3c 100644
--- a/checkpoint/cpt_process.c
+++ b/checkpoint/cpt_process.c
@@ -40,6 +40,19 @@ static unsigned int encode_task_flags(unsigned int task_flags)

}

+static int cpt_count_children(struct task_struct *tsk, struct cpt_context *ctx)
+{
+ int num = 0;
+ struct task_struct *child;
+
+ list_for_each_entry(child, &tsk->children, sibling) {
+ if (child->parent != tsk)
+ continue;
+ num++;
+ }
+ return num;
+}
+
int cpt_dump_task_struct(struct task_struct *tsk, struct cpt_context *ctx)
{
struct cpt_task_image *t;
@@ -102,6 +115,7 @@ int cpt_dump_task_struct(struct task_struct *tsk, struct cpt_context *ctx)
t->cpt_egid = tsk->egid;
t->cpt_sgid = tsk->sgid;
t->cpt_fsgid = tsk->fsgid;
+ t->cpt_children_num = cpt_count_children(tsk, ctx);

err = ctx->write(t, sizeof(*t), ctx);

@@ -231,6 +245,16 @@ int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx)
err = cpt_dump_fpustate(tsk, ctx);
if (!err)
err = cpt_dump_registers(tsk, ctx);
+ if (!err) {
+ struct task_struct *child;
+ list_for_each_entry(child, &tsk->children, sibling) {
+ if (child->parent != tsk)
+ continue;
+ err = cpt_dump_task(child, ctx);
+ if (err)
+ break;
+ }
+ }

return err;
}
diff --git a/checkpoint/rst_process.c b/checkpoint/rst_process.c
index 9e448b2..c088833 100644
--- a/checkpoint/rst_process.c
+++ b/checkpoint/rst_process.c
@@ -25,7 +25,7 @@ struct thr_context {
struct completion complete;
int error;
struct cpt_context *ctx;
- struct task_struct *tsk;
+ struct cpt_task_image *ti;
};

int local_kernel_thread(int (*fn)(void *), void * arg, unsigned long flags, pid_t pid)
@@ -199,17 +199,14 @@ static int restart_thread(void *arg)
struct cpt_context *ctx;
struct cpt_task_image *ti;
int err;
+ int i;

current->state = TASK_UNINTERRUPTIBLE;

ctx = thr_ctx->ctx;
- ti = kmalloc(sizeof(*ti), GFP_KERNEL);
- if (!ti)
- return -ENOMEM;
+ ti = thr_ctx->ti;

- err = rst_get_object(CPT_OBJ_TASK, ti, sizeof(*ti), ctx);
- if (!err)
- err = rst_restore_task_struct(current, ti, ctx);
+ err = rst_restore_task_struct(current, ti, ctx);
if (!err)
err = rst_restore_mm(ctx);
if (!err)
@@ -217,6 +214,12 @@ static int restart_thread(void *arg)
if (!err)
err = rst_restore_registers(current, ctx);

+ for (i = 0; i < ti->cpt_children_num; i++) {
+ err = rst_restart_process(ctx);
+ if (err)
+ break;
+ }
+
thr_ctx->error = err;
complete(&thr_ctx->complete);

@@ -226,7 +229,6 @@ static int restart_thread(void *arg)
__set_current_state(TASK_UNINTERRUPTIBLE);
}

- kfree(ti);
schedule();

eprintk("leaked %d/%d %p\n", task_pid_nr(current), task_pid_vnr(current), current->mm);
@@ -235,44 +237,57 @@ static int restart_thread(void *arg)
complete_and_exit(NULL, 0);
return 0;
}
-static int create_root_task(struct cpt_context *ctx,
- struct thr_context *thr_ctx)
+
+int rst_restart_process(struct cpt_context *ctx)
{
+ struct thr_context thr_ctx;
struct task_struct *tsk;
+ struct cpt_task_image *ti;
int pid;
+ int err;

- thr_ctx->ctx = ctx;
- thr_ctx->error = 0;
- init_completion(&thr_ctx->complete);
+ thr_ctx.ctx = ctx;
+ thr_ctx.error = 0;
+ init_completion(&thr_ctx.complete);

- /* We should also create container here */
- pid = local_kernel_thread(restart_thread, thr_ctx,
- CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
- CLONE_NEWUSER | CLONE_NEWPID | CLONE_NEWNET, 0);
- if (pid < 0)
- return pid;
+ ti = kmalloc(sizeof(*ti), GFP_KERNEL);
+ if (!ti)
+ return -ENOMEM;
+
+ err = rst_get_object(CPT_OBJ_TASK, ti, sizeof(*ti), ctx);
+ if (err)
+ goto err_free;
+ thr_ctx.ti = ti;
+
+ if (ti->cpt_pid == 1) {
+ /* We should also create container here */
+ pid = local_kernel_thread(restart_thread, &thr_ctx,
+ CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
+ CLONE_NEWUSER | CLONE_NEWPID | CLONE_NEWNET, 0);
+ } else {
+ /* We should fork here a child with the same pid and
+ correct flags */
+ pid = local_kernel_thread(restart_thread, &thr_ctx, 0, 0);
+ }
+ if (pid < 0) {
+ err = pid;
+ goto err_free;
+ }
read_lock(&tasklist_lock);
tsk = find_task_by_vpid(pid);
if (tsk)
get_task_struct(tsk);
read_unlock(&tasklist_lock);
- if (tsk == NULL)
- return -ESRCH;
- thr_ctx->tsk = tsk;
- return 0;
-}
-
-int rst_restart_process(struct cpt_context *ctx)
-{
- struct thr_context thr_ctx_root;
- int err;
-
- err = create_root_task(ctx, &thr_ctx_root);
- if (err)
- return err;
+ if (tsk == NULL) {
+ err = -ESRCH;
+ goto err_free;
+ }

- wait_for_completion(&thr_ctx_root.complete);
- wait_task_inactive(thr_ctx_root.tsk, 0);
+ wait_for_completion(&thr_ctx.complete);
+ wait_task_inactive(tsk, 0);
+ err = thr_ctx.error;

+err_free:
+ kfree(ti);
return err;
}
--
1.5.6

2008-10-20 09:25:06

by Cédric Le Goater

[permalink] [raw]
Subject: Re: [PATCH 08/10] Introduce functions to restart a process

Hello Andrey !


> diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
> index 109792b..a4848a3 100644
> --- a/arch/x86/kernel/entry_32.S
> +++ b/arch/x86/kernel/entry_32.S
> @@ -225,6 +225,7 @@ ENTRY(ret_from_fork)
> GET_THREAD_INFO(%ebp)
> popl %eax
> CFI_ADJUST_CFA_OFFSET -4
> +ret_from_fork_tail:
> pushl $0x0202 # Reset kernel eflags
> CFI_ADJUST_CFA_OFFSET 4
> popfl
> @@ -233,6 +234,26 @@ ENTRY(ret_from_fork)
> CFI_ENDPROC
> END(ret_from_fork)
>
> +ENTRY(i386_ret_from_resume)
> + CFI_STARTPROC
> + pushl %eax
> + CFI_ADJUST_CFA_OFFSET 4
> + call schedule_tail
> + GET_THREAD_INFO(%ebp)
> + popl %eax
> + CFI_ADJUST_CFA_OFFSET -4
> + movl (%esp), %eax
> + testl %eax, %eax
> + jz 1f
> + pushl %esp
> + call *%eax
> + addl $4, %esp
> +1:
> + addl $256, %esp
> + jmp ret_from_fork_tail
> + CFI_ENDPROC
> +END(i386_ret_from_resume)

Could you explain why you need to do this

call *%eax

is it related to the freezer code ?

C.

2008-10-20 11:02:38

by Louis Rilling

[permalink] [raw]
Subject: Re: [PATCH 05/10] Introduce function to dump process

Hi,

On Sat, Oct 18, 2008 at 03:11:33AM +0400, Andrey Mirkin wrote:
> Functions to dump task struct, fpu state and registers are added.
> All IDs are saved from the POV of process (container) namespace.

Just a couple of little comments, in case this series should keep on living.

[...]

> diff --git a/checkpoint/cpt_process.c b/checkpoint/cpt_process.c
> new file mode 100644
> index 0000000..58f608d
> --- /dev/null
> +++ b/checkpoint/cpt_process.c
> @@ -0,0 +1,236 @@
> +/*
> + * Copyright (C) 2008 Parallels, Inc.
> + *
> + * Author: Andrey Mirkin <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License as
> + * published by the Free Software Foundation, version 2 of the
> + * License.
> + *
> + */
> +
> +#include <linux/sched.h>
> +#include <linux/fs.h>
> +#include <linux/file.h>
> +#include <linux/version.h>
> +#include <linux/nsproxy.h>
> +
> +#include "checkpoint.h"
> +#include "cpt_image.h"
> +
> +static unsigned int encode_task_flags(unsigned int task_flags)
> +{
> + unsigned int flags = 0;
> +
> + if (task_flags & PF_EXITING)
> + flags |= (1 << CPT_PF_EXITING);
> + if (task_flags & PF_FORKNOEXEC)
> + flags |= (1 << CPT_PF_FORKNOEXEC);
> + if (task_flags & PF_SUPERPRIV)
> + flags |= (1 << CPT_PF_SUPERPRIV);
> + if (task_flags & PF_DUMPCORE)
> + flags |= (1 << CPT_PF_DUMPCORE);
> + if (task_flags & PF_SIGNALED)
> + flags |= (1 << CPT_PF_SIGNALED);
> + if (task_flags & PF_USED_MATH)
> + flags |= (1 << CPT_PF_USED_MATH);
> +
> + return flags;
> +
> +}
> +
> +int cpt_dump_task_struct(struct task_struct *tsk, struct cpt_context *ctx)
> +{
> + struct cpt_task_image *t;
> + int i;
> + int err;
> +
> + t = kzalloc(sizeof(*t), GFP_KERNEL);
> + if (!t)
> + return -ENOMEM;
> +
> + t->cpt_len = sizeof(*t);
> + t->cpt_type = CPT_OBJ_TASK;
> + t->cpt_hdrlen = sizeof(*t);
> + t->cpt_content = CPT_CONTENT_ARRAY;
> +
> + t->cpt_state = tsk->state;
> + t->cpt_flags = encode_task_flags(tsk->flags);
> + t->cpt_exit_code = tsk->exit_code;
> + t->cpt_exit_signal = tsk->exit_signal;
> + t->cpt_pdeath_signal = tsk->pdeath_signal;
> + t->cpt_pid = task_pid_nr_ns(tsk, ctx->nsproxy->pid_ns);
> + t->cpt_tgid = task_tgid_nr_ns(tsk, ctx->nsproxy->pid_ns);
> + t->cpt_ppid = tsk->parent ?
> + task_pid_nr_ns(tsk->parent, ctx->nsproxy->pid_ns) : 0;
> + t->cpt_rppid = tsk->real_parent ?
> + task_pid_nr_ns(tsk->real_parent, ctx->nsproxy->pid_ns) : 0;
> + t->cpt_pgrp = task_pgrp_nr_ns(tsk, ctx->nsproxy->pid_ns);
> + t->cpt_session = task_session_nr_ns(tsk, ctx->nsproxy->pid_ns);
> + t->cpt_old_pgrp = 0;
> + if (tsk->signal->tty_old_pgrp)
> + t->cpt_old_pgrp = pid_vnr(tsk->signal->tty_old_pgrp);
> + t->cpt_leader = tsk->group_leader ? task_pid_vnr(tsk->group_leader) : 0;

Why pid_vnr() here, and task_*_nr_ns() above? According to the introducing
comment, I'd expect something like pid_nr_ns(tsk->signal->tty_old_pgrp,
tsk->nsproxy->pid_ns), and the same for tsk->group_leader.

IIUC, pid_vnr() is correct only if ctx->nsproxy->pid_ns == tsk->nsproxy->pid_ns
== current->nsproxy->pid_ns, and I expect current to live in a different pid_ns.

Comments?

> + t->cpt_utime = tsk->utime;
> + t->cpt_stime = tsk->stime;
> + t->cpt_utimescaled = tsk->utimescaled;
> + t->cpt_stimescaled = tsk->stimescaled;
> + t->cpt_gtime = tsk->gtime;
> + t->cpt_prev_utime = tsk->prev_utime;
> + t->cpt_prev_stime = tsk->prev_stime;
> + t->cpt_nvcsw = tsk->nvcsw;
> + t->cpt_nivcsw = tsk->nivcsw;
> + t->cpt_start_time = cpt_timespec_export(&tsk->start_time);
> + t->cpt_real_start_time = cpt_timespec_export(&tsk->real_start_time);
> + t->cpt_min_flt = tsk->min_flt;
> + t->cpt_maj_flt = tsk->maj_flt;
> + memcpy(t->cpt_comm, tsk->comm, TASK_COMM_LEN);
> + for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++) {
> + t->cpt_tls[i] = (((u64)tsk->thread.tls_array[i].b) << 32) +
> + tsk->thread.tls_array[i].a;
> + }
> + /* TODO: encode thread flags and status like task flags */
> + t->cpt_thrflags = task_thread_info(tsk)->flags & ~(1<<TIF_FREEZE);
> + t->cpt_thrstatus = task_thread_info(tsk)->status;
> + t->cpt_user = tsk->user->uid;
> + t->cpt_uid = tsk->uid;
> + t->cpt_euid = tsk->euid;
> + t->cpt_suid = tsk->suid;
> + t->cpt_fsuid = tsk->fsuid;
> + t->cpt_gid = tsk->gid;
> + t->cpt_egid = tsk->egid;
> + t->cpt_sgid = tsk->sgid;
> + t->cpt_fsgid = tsk->fsgid;
> +
> + err = ctx->write(t, sizeof(*t), ctx);
> +
> + kfree(t);
> + return err;
> +}
> +
> +static int cpt_dump_fpustate(struct task_struct *tsk, struct cpt_context *ctx)
> +{
> + struct cpt_obj_bits hdr;
> + int err;
> + int content;
> + unsigned long size;
> +
> + content = CPT_CONTENT_X86_FPUSTATE;
> + size = sizeof(struct i387_fxsave_struct);
> +#ifndef CONFIG_X86_64
> + if (!cpu_has_fxsr) {
> + size = sizeof(struct i387_fsave_struct);
> + content = CPT_CONTENT_X86_FPUSTATE_OLD;
> + }
> +#endif
> +
> + hdr.cpt_len = sizeof(hdr) + size;
> + hdr.cpt_type = CPT_OBJ_BITS;
> + hdr.cpt_hdrlen = sizeof(hdr);
> + hdr.cpt_content = content;
> + hdr.cpt_size = size;
> + err = ctx->write(&hdr, sizeof(hdr), ctx);
> + if (!err)
> + ctx->write(tsk->thread.xstate, size, ctx);

Should check the error code of the line above, right?

> + return err;
> +}
> +
> +static u32 encode_segment(u32 segreg)
> +{
> + segreg &= 0xFFFF;
> +
> + if (segreg == 0)
> + return CPT_SEG_ZERO;
> + if ((segreg & 3) != 3) {
> + eprintk("Invalid RPL of a segment reg %x\n", segreg);
> + return CPT_SEG_ZERO;
> + }
> +
> + /* LDT descriptor, it is just an index to LDT array */
> + if (segreg & 4)
> + return CPT_SEG_LDT + (segreg >> 3);
> +
> + /* TLS descriptor. */
> + if ((segreg >> 3) >= GDT_ENTRY_TLS_MIN &&
> + (segreg >> 3) <= GDT_ENTRY_TLS_MAX)
> + return CPT_SEG_TLS1 + ((segreg>>3) - GDT_ENTRY_TLS_MIN);
> +
> + /* One of standard desriptors */
> +#ifdef CONFIG_X86_64
> + if (segreg == __USER32_DS)
> + return CPT_SEG_USER32_DS;
> + if (segreg == __USER32_CS)
> + return CPT_SEG_USER32_CS;
> + if (segreg == __USER_DS)
> + return CPT_SEG_USER64_DS;
> + if (segreg == __USER_CS)
> + return CPT_SEG_USER64_CS;
> +#else
> + if (segreg == __USER_DS)
> + return CPT_SEG_USER32_DS;
> + if (segreg == __USER_CS)
> + return CPT_SEG_USER32_CS;
> +#endif
> + eprintk("Invalid segment reg %x\n", segreg);
> + return CPT_SEG_ZERO;
> +}
> +
> +static int cpt_dump_registers(struct task_struct *tsk, struct cpt_context *ctx)
> +{
> + struct cpt_x86_regs ri;
> + struct pt_regs *pt_regs;
> +
> + ri.cpt_len = sizeof(ri);
> + ri.cpt_type = CPT_OBJ_X86_REGS;
> + ri.cpt_hdrlen = sizeof(ri);
> + ri.cpt_content = CPT_CONTENT_VOID;
> +
> + ri.cpt_debugreg[0] = tsk->thread.debugreg0;
> + ri.cpt_debugreg[1] = tsk->thread.debugreg1;
> + ri.cpt_debugreg[2] = tsk->thread.debugreg2;
> + ri.cpt_debugreg[3] = tsk->thread.debugreg3;
> + ri.cpt_debugreg[4] = 0;
> + ri.cpt_debugreg[5] = 0;
> + ri.cpt_debugreg[6] = tsk->thread.debugreg6;
> + ri.cpt_debugreg[7] = tsk->thread.debugreg7;
> +
> + pt_regs = task_pt_regs(tsk);
> +
> + ri.cpt_fs = encode_segment(pt_regs->fs);
> + ri.cpt_gs = encode_segment(tsk->thread.gs);
> +
> + ri.cpt_bx = pt_regs->bx;
> + ri.cpt_cx = pt_regs->cx;
> + ri.cpt_dx = pt_regs->dx;
> + ri.cpt_si = pt_regs->si;
> + ri.cpt_di = pt_regs->di;
> + ri.cpt_bp = pt_regs->bp;
> + ri.cpt_ax = pt_regs->ax;
> + ri.cpt_ds = encode_segment(pt_regs->ds);
> + ri.cpt_es = encode_segment(pt_regs->es);
> + ri.cpt_orig_ax = pt_regs->orig_ax;
> + ri.cpt_ip = pt_regs->ip;
> + ri.cpt_cs = encode_segment(pt_regs->cs);
> + ri.cpt_flags = pt_regs->flags;
> + ri.cpt_sp = pt_regs->sp;
> + ri.cpt_ss = encode_segment(pt_regs->ss);
> +
> + return ctx->write(&ri, sizeof(ri), ctx);
> +}
> +
> +int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx)
> +{
> + int err;
> +
> + err = cpt_dump_task_struct(tsk, ctx);
> +
> + /* Dump task mm */
> +
> + if (!err)
> + cpt_dump_fpustate(tsk, ctx);

error checking...

> + if (!err)
> + cpt_dump_registers(tsk, ctx);

error checking...

> +
> + return err;
> +}
> --
> 1.5.6
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

Louis

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes


Attachments:
(No filename) (8.24 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2008-10-20 12:25:27

by Louis Rilling

[permalink] [raw]
Subject: Re: [PATCH 06/10] Introduce functions to dump mm

On Sat, Oct 18, 2008 at 03:11:34AM +0400, Andrey Mirkin wrote:
> Functions to dump mm struct, VMAs and mm context are added.

Again, a few little comments.

[...]

> diff --git a/checkpoint/cpt_mm.c b/checkpoint/cpt_mm.c
> new file mode 100644
> index 0000000..8a22c48
> --- /dev/null
> +++ b/checkpoint/cpt_mm.c
> @@ -0,0 +1,434 @@
> +/*
> + * Copyright (C) 2008 Parallels, Inc.
> + *
> + * Authors: Andrey Mirkin <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License as
> + * published by the Free Software Foundation, version 2 of the
> + * License.
> + *
> + */
> +
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/file.h>
> +#include <linux/mm.h>
> +#include <linux/errno.h>
> +#include <linux/major.h>
> +#include <linux/mman.h>
> +#include <linux/mnt_namespace.h>
> +#include <linux/mount.h>
> +#include <linux/namei.h>
> +#include <linux/pagemap.h>
> +#include <linux/hugetlb.h>
> +#include <asm/ldt.h>
> +
> +#include "checkpoint.h"
> +#include "cpt_image.h"
> +
> +struct page_area
> +{
> + int type;
> + unsigned long start;
> + unsigned long end;
> + pgoff_t pgoff;
> + loff_t mm;
> + __u64 list[16];
> +};
> +
> +struct page_desc
> +{
> + int type;
> + pgoff_t index;
> + loff_t mm;
> + int shared;
> +};
> +
> +enum {
> + PD_ABSENT,
> + PD_COPY,
> + PD_FUNKEY,
> +};
> +
> +/* 0: page can be obtained from backstore, or still not mapped anonymous page,
> + or something else, which does not requre copy.
> + 1: page requires copy
> + 2: page requres copy but its content is zero. Quite useless.
> + 3: wp page is shared after fork(). It is to be COWed when modified.
> + 4: page is something unsupported... We copy it right now.
> + */
> +
> +static void page_get_desc(struct vm_area_struct *vma, unsigned long addr,
> + struct page_desc *pdesc, cpt_context_t * ctx)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *ptep, pte;
> + spinlock_t *ptl;
> + struct page *pg = NULL;
> + pgoff_t linear_index = (addr - vma->vm_start)/PAGE_SIZE + vma->vm_pgoff;
> +
> + pdesc->index = linear_index;
> + pdesc->shared = 0;
> + pdesc->mm = CPT_NULL;
> +
> + if (vma->vm_flags & VM_IO) {
> + pdesc->type = PD_ABSENT;
> + return;
> + }
> +
> + pgd = pgd_offset(mm, addr);
> + if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
> + goto out_absent;
> + pud = pud_offset(pgd, addr);
> + if (pud_none(*pud) || unlikely(pud_bad(*pud)))
> + goto out_absent;
> + pmd = pmd_offset(pud, addr);
> + if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
> + goto out_absent;
> +#ifdef CONFIG_X86
> + if (pmd_huge(*pmd)) {
> + eprintk("page_huge\n");
> + goto out_unsupported;
> + }
> +#endif
> + ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
> + pte = *ptep;
> + pte_unmap(ptep);
> +
> + if (pte_none(pte))
> + goto out_absent_unlock;
> +
> + if ((pg = vm_normal_page(vma, addr, pte)) == NULL) {
> + pdesc->type = PD_COPY;
> + goto out_unlock;
> + }
> +
> + get_page(pg);
> + spin_unlock(ptl);
> +
> + if (pg->mapping && !PageAnon(pg)) {
> + if (vma->vm_file == NULL) {
> + eprintk("pg->mapping!=NULL for fileless vma: %08lx\n", addr);
> + goto out_unsupported;
> + }
> + if (vma->vm_file->f_mapping != pg->mapping) {
> + eprintk("pg->mapping!=f_mapping: %08lx %p %p\n",
> + addr, vma->vm_file->f_mapping, pg->mapping);
> + goto out_unsupported;
> + }
> + pdesc->index = (pg->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT));
> + /* Page is in backstore. For us it is like
> + * it is not present.
> + */
> + goto out_absent;
> + }
> +
> + if (PageReserved(pg)) {
> + /* Special case: ZERO_PAGE is used, when an
> + * anonymous page is accessed but not written. */
> + if (pg == ZERO_PAGE(addr)) {
> + if (pte_write(pte)) {
> + eprintk("not funny already, writable ZERO_PAGE\n");
> + goto out_unsupported;
> + }
> + /* Just copy it for now */
> + pdesc->type = PD_COPY;
> + goto out_put;
> + }
> + eprintk("reserved page %lu at %08lx\n", pg->index, addr);
> + goto out_unsupported;
> + }
> +
> + if (!pg->mapping) {
> + eprintk("page without mapping at %08lx\n", addr);
> + goto out_unsupported;
> + }
> +
> + pdesc->type = PD_COPY;
> +
> +out_put:
> + if (pg)
> + put_page(pg);
> + return;
> +
> +out_unlock:
> + spin_unlock(ptl);
> + goto out_put;
> +
> +out_absent_unlock:
> + spin_unlock(ptl);
> +
> +out_absent:
> + pdesc->type = PD_ABSENT;
> + goto out_put;
> +
> +out_unsupported:
> + pdesc->type = PD_FUNKEY;
> + goto out_put;
> +}
> +
> +static int count_vma_pages(struct vm_area_struct *vma, struct cpt_context *ctx)
> +{
> + unsigned long addr;
> + int page_num = 0;
> +
> + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> + struct page_desc pd;
> +
> + page_get_desc(vma, addr, &pd, ctx);
> +
> + if (pd.type != PD_COPY) {
> + return -EINVAL;
> + } else {
> + page_num += 1;
> + }
> +
> + }
> + return page_num;
> +}
> +
> +/* ATTN: We give "current" to get_user_pages(). This is wrong, but get_user_pages()
> + * does not really need this thing. It just stores some page fault stats there.
> + *
> + * BUG: some archs (f.e. sparc64, but not Intel*) require flush cache pages
> + * before accessing vma.
> + */
> +static int dump_pages(struct vm_area_struct *vma, unsigned long start,
> + unsigned long end, struct cpt_context *ctx)
> +{
> +#define MAX_PAGE_BATCH 16
> + struct page *pg[MAX_PAGE_BATCH];
> + int npages = (end - start)/PAGE_SIZE;
> + int count = 0;
> +
> + while (count < npages) {
> + int copy = npages - count;
> + int n;
> +
> + if (copy > MAX_PAGE_BATCH)
> + copy = MAX_PAGE_BATCH;
> + n = get_user_pages(current, vma->vm_mm, start, copy,
> + 0, 1, pg, NULL);
> + if (n == copy) {
> + int i;
> + for (i=0; i<n; i++) {
> + char *maddr = kmap(pg[i]);
> + ctx->write(maddr, PAGE_SIZE, ctx);
> + kunmap(pg[i]);

There is no error handling in this inner loop. Should be fixed imho.

> + }
> + } else {
> + eprintk("get_user_pages fault");
> + for ( ; n > 0; n--)
> + page_cache_release(pg[n-1]);
> + return -EFAULT;
> + }
> + start += n*PAGE_SIZE;
> + count += n;
> + for ( ; n > 0; n--)
> + page_cache_release(pg[n-1]);
> + }
> + return 0;
> +}
> +
> +static int dump_page_block(struct vm_area_struct *vma,
> + struct cpt_page_block *pgb,
> + struct cpt_context *ctx)
> +{
> + int err;
> + pgb->cpt_len = sizeof(*pgb) + pgb->cpt_end - pgb->cpt_start;
> + pgb->cpt_type = CPT_OBJ_PAGES;
> + pgb->cpt_hdrlen = sizeof(*pgb);
> + pgb->cpt_content = CPT_CONTENT_DATA;
> +
> + err = ctx->write(pgb, sizeof(*pgb), ctx);
> + if (!err)
> + err = dump_pages(vma, pgb->cpt_start, pgb->cpt_end, ctx);
> +
> + return err;
> +}
> +
> +static int cpt_dump_dentry(struct path *p, cpt_context_t *ctx)
> +{
> + int len;
> + char *path;
> + char *buf;
> + struct cpt_object_hdr o;
> +
> + buf = (char *)__get_free_page(GFP_KERNEL);
> + if (!buf)
> + return -ENOMEM;
> +
> + path = d_path(p, buf, PAGE_SIZE);
> +
> + if (IS_ERR(path)) {
> + free_page((unsigned long)buf);
> + return PTR_ERR(path);
> + }
> +
> + len = buf + PAGE_SIZE - 1 - path;
> + o.cpt_len = sizeof(o) + len + 1;
> + o.cpt_type = CPT_OBJ_NAME;
> + o.cpt_hdrlen = sizeof(o);
> + o.cpt_content = CPT_CONTENT_NAME;
> + path[len] = 0;
> +
> + ctx->write(&o, sizeof(o), ctx);
> + ctx->write(path, len + 1, ctx);

Error handling?

> + free_page((unsigned long)buf);
> +
> + return 0;
> +}
> +
> +static int dump_one_vma(struct mm_struct *mm,
> + struct vm_area_struct *vma, struct cpt_context *ctx)
> +{
> + struct cpt_vma_image *v;
> + unsigned long addr;
> + int page_num;
> + int err;
> +
> + v = kzalloc(sizeof(*v), GFP_KERNEL);
> + if (!v)
> + return -ENOMEM;
> +
> + v->cpt_len = sizeof(*v);
> + v->cpt_type = CPT_OBJ_VMA;
> + v->cpt_hdrlen = sizeof(*v);
> + v->cpt_content = CPT_CONTENT_ARRAY;
> +
> + v->cpt_start = vma->vm_start;
> + v->cpt_end = vma->vm_end;
> + v->cpt_flags = vma->vm_flags;
> + if (vma->vm_flags & VM_HUGETLB) {
> + eprintk("huge TLB VMAs are still not supported\n");
> + kfree(v);
> + return -EINVAL;
> + }
> + v->cpt_pgprot = vma->vm_page_prot.pgprot;
> + v->cpt_pgoff = vma->vm_pgoff;
> + v->cpt_file = CPT_NULL;
> + v->cpt_vma_type = CPT_VMA_TYPE_0;
> +
> + page_num = count_vma_pages(vma, ctx);
> + if (page_num < 0) {
> + kfree(v);
> + return -EINVAL;
> + }

AFAICS, since count_vma_pages only supports pages with PD_COPY, and since
get_page_desc() tags text segment pages (file-mapped and not anonymous since
not written to) as PD_ABSENT, no executable is checkpointable. So, where is the
trick? Am I completely missing something about page mapping?

> + v->cpt_page_num = page_num;
> +
> + if (vma->vm_file) {
> + v->cpt_file = 0;
> + v->cpt_vma_type = CPT_VMA_FILE;
> + }
> +
> + ctx->write(v, sizeof(*v), ctx);

Error handling?

> + kfree(v);
> +
> + if (vma->vm_file) {
> + err = cpt_dump_dentry(&vma->vm_file->f_path, ctx);
> + if (err < 0)
> + return err;
> + }
> +
> + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> + struct page_desc pd;
> + struct cpt_page_block pgb;
> +
> + page_get_desc(vma, addr, &pd, ctx);
> +
> + if (pd.type == PD_FUNKEY || pd.type == PD_ABSENT) {
> + eprintk("dump_one_vma: funkey page\n");
> + return -EINVAL;
> + }
> +
> + pgb.cpt_start = addr;
> + pgb.cpt_end = addr + PAGE_SIZE;
> + dump_page_block(vma, &pgb, ctx);

Error handling?

> + }
> +
> + return 0;
> +}
> +
> +static int cpt_dump_mm_context(struct mm_struct *mm, struct cpt_context *ctx)
> +{
> +#ifdef CONFIG_X86
> + if (mm->context.size) {
> + struct cpt_obj_bits b;
> + int size;
> +
> + mutex_lock(&mm->context.lock);
> +
> + b.cpt_type = CPT_OBJ_BITS;
> + b.cpt_len = sizeof(b);
> + b.cpt_content = CPT_CONTENT_MM_CONTEXT;
> + b.cpt_size = mm->context.size * LDT_ENTRY_SIZE;
> +
> + ctx->write(&b, sizeof(b), ctx);
> +
> + size = mm->context.size * LDT_ENTRY_SIZE;
> +
> + ctx->write(mm->context.ldt, size, ctx);

Error handling?

> +
> + mutex_unlock(&mm->context.lock);
> + }
> +#endif
> + return 0;
> +}
> +
> +int cpt_dump_mm(struct task_struct *tsk, struct cpt_context *ctx)
> +{
> + struct mm_struct *mm = tsk->mm;
> + struct cpt_mm_image *v;
> + struct vm_area_struct *vma;
> + int err;
> +
> + v = kzalloc(sizeof(*v), GFP_KERNEL);
> + if (!v)
> + return -ENOMEM;
> +
> + v->cpt_len = sizeof(*v);
> + v->cpt_type = CPT_OBJ_MM;
> + v->cpt_hdrlen = sizeof(*v);
> + v->cpt_content = CPT_CONTENT_ARRAY;
> +
> + down_read(&mm->mmap_sem);
> + v->cpt_start_code = mm->start_code;
> + v->cpt_end_code = mm->end_code;
> + v->cpt_start_data = mm->start_data;
> + v->cpt_end_data = mm->end_data;
> + v->cpt_start_brk = mm->start_brk;
> + v->cpt_brk = mm->brk;
> + v->cpt_start_stack = mm->start_stack;
> + v->cpt_start_arg = mm->arg_start;
> + v->cpt_end_arg = mm->arg_end;
> + v->cpt_start_env = mm->env_start;
> + v->cpt_end_env = mm->env_end;
> + v->cpt_def_flags = mm->def_flags;
> + v->cpt_flags = mm->flags;
> + v->cpt_map_count = mm->map_count;
> +
> + err = ctx->write(v, sizeof(*v), ctx);
> + kfree(v);
> +
> + if (err) {
> + eprintk("error during writing mm\n");
> + goto err_up;
> + }
> +
> + for (vma = mm->mmap; vma; vma = vma->vm_next) {
> + if ((err = dump_one_vma(mm, vma, ctx)) != 0)
> + goto err_up;
> + }
> +
> + err = cpt_dump_mm_context(mm, ctx);
> +
> +err_up:
> + up_read(&mm->mmap_sem);
> +
> + return err;
> +}
> +

[...]

Louis

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes


Attachments:
(No filename) (11.37 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2008-10-20 13:25:47

by Louis Rilling

[permalink] [raw]
Subject: Re: [PATCH 08/10] Introduce functions to restart a process

On Sat, Oct 18, 2008 at 03:11:36AM +0400, Andrey Mirkin wrote:
> Functions to restart process, restore its state, fpu and registers are added.

[...]

> diff --git a/checkpoint/rst_process.c b/checkpoint/rst_process.c
> new file mode 100644
> index 0000000..b9f745e
> --- /dev/null
> +++ b/checkpoint/rst_process.c
> @@ -0,0 +1,277 @@
> +/*
> + * Copyright (C) 2008 Parallels, Inc.
> + *
> + * Author: Andrey Mirkin <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License as
> + * published by the Free Software Foundation, version 2 of the
> + * License.
> + *
> + */
> +
> +#include <linux/sched.h>
> +#include <linux/fs.h>
> +#include <linux/file.h>
> +#include <linux/version.h>
> +#include <linux/module.h>
> +
> +#include "checkpoint.h"
> +#include "cpt_image.h"
> +
> +#define HOOK_RESERVE 256
> +
> +struct thr_context {
> + struct completion complete;
> + int error;
> + struct cpt_context *ctx;
> + struct task_struct *tsk;
> +};
> +
> +int local_kernel_thread(int (*fn)(void *), void * arg, unsigned long flags, pid_t pid)
> +{
> + pid_t ret;
> +
> + if (current->fs == NULL) {
> + /* do_fork_pid() hates processes without fs, oopses. */
> + eprintk("local_kernel_thread: current->fs==NULL\n");
> + return -EINVAL;
> + }
> + if (!try_module_get(THIS_MODULE))
> + return -EBUSY;
> + ret = kernel_thread(fn, arg, flags);
> + if (ret < 0)
> + module_put(THIS_MODULE);
> + return ret;
> +}
> +
> +static unsigned int decode_task_flags(unsigned int task_flags)
> +{
> + unsigned int flags = 0;
> +
> + if (task_flags & (1 << CPT_PF_EXITING))
> + flags |= PF_EXITING;
> + if (task_flags & (1 << CPT_PF_FORKNOEXEC))
> + flags |= PF_FORKNOEXEC;
> + if (task_flags & (1 << CPT_PF_SUPERPRIV))
> + flags |= PF_SUPERPRIV;
> + if (task_flags & (1 << CPT_PF_DUMPCORE))
> + flags |= PF_DUMPCORE;
> + if (task_flags & (1 << CPT_PF_SIGNALED))
> + flags |= PF_SIGNALED;
> +
> + return flags;
> +
> +}
> +
> +int rst_restore_task_struct(struct task_struct *tsk, struct cpt_task_image *ti,
> + struct cpt_context *ctx)
> +{
> + int i;
> +
> + /* Restore only saved flags, comm and tls for now */
> + tsk->flags = decode_task_flags(ti->cpt_flags);
> + clear_tsk_thread_flag(tsk, TIF_FREEZE);
> + memcpy(tsk->comm, ti->cpt_comm, TASK_COMM_LEN);
> + for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++) {
> + tsk->thread.tls_array[i].a = ti->cpt_tls[i] & 0xFFFFFFFF;
> + tsk->thread.tls_array[i].b = ti->cpt_tls[i] >> 32;
> + }
> +
> + return 0;
> +}
> +
> +static int rst_restore_fpustate(struct task_struct *tsk, struct cpt_task_image *ti,
> + struct cpt_context *ctx)
> +{
> + struct cpt_obj_bits hdr;
> + int err;
> + char *buf;
> +
> + clear_stopped_child_used_math(tsk);
> +
> + err = rst_get_object(CPT_OBJ_BITS, &hdr, sizeof(hdr), ctx);
> + if (err < 0)
> + return err;
> +
> + buf = kmalloc(hdr.cpt_size, GFP_KERNEL);
> + if (!buf)
> + return -ENOMEM;
> +
> + err = ctx->read(buf, hdr.cpt_size, ctx);
> + if (err)
> + goto out;
> +
> + if (hdr.cpt_content == CPT_CONTENT_X86_FPUSTATE && cpu_has_fxsr) {
> + memcpy(&tsk->thread.xstate, buf,
> + sizeof(struct i387_fxsave_struct));
> + if (ti->cpt_flags & CPT_PF_USED_MATH)
> + set_stopped_child_used_math(tsk);
> + }
> +#ifndef CONFIG_X86_64
> + else if (hdr.cpt_content == CPT_CONTENT_X86_FPUSTATE_OLD &&
> + !cpu_has_fxsr) {
> + memcpy(&tsk->thread.xstate, buf,
> + sizeof(struct i387_fsave_struct));
> + if (ti->cpt_flags & CPT_PF_USED_MATH)
> + set_stopped_child_used_math(tsk);
> + }
> +#endif
> +
> +out:
> + kfree(buf);
> + return err;
> +}
> +
> +static u32 decode_segment(u32 segid)
> +{
> + if (segid == CPT_SEG_ZERO)
> + return 0;
> +
> + /* TLS descriptors */
> + if (segid <= CPT_SEG_TLS3)
> + return ((GDT_ENTRY_TLS_MIN + segid - CPT_SEG_TLS1) << 3) + 3;
> +
> + /* LDT descriptor, it is just an index to LDT array */
> + if (segid >= CPT_SEG_LDT)
> + return ((segid - CPT_SEG_LDT) << 3) | 7;
> +
> + /* Check for one of standard descriptors */
> + if (segid == CPT_SEG_USER32_DS)
> + return __USER_DS;
> + if (segid == CPT_SEG_USER32_CS)
> + return __USER_CS;
> +
> + eprintk("Invalid segment reg %d\n", segid);
> + return 0;
> +}
> +
> +static int rst_restore_registers(struct task_struct *tsk, struct cpt_context *ctx)
> +{
> + struct cpt_x86_regs ri;
> + struct pt_regs *regs = task_pt_regs(tsk);
> + extern char i386_ret_from_resume;
> + int err;
> +
> + err = rst_get_object(CPT_OBJ_X86_REGS, &ri, sizeof(ri), ctx);
> + if (err < 0)
> + return err;
> +
> + tsk->thread.sp = (unsigned long) regs;
> + tsk->thread.sp0 = (unsigned long) (regs+1);
> + tsk->thread.ip = (unsigned long) &i386_ret_from_resume;
> +
> + tsk->thread.gs = decode_segment(ri.cpt_gs);
> + tsk->thread.debugreg0 = ri.cpt_debugreg[0];
> + tsk->thread.debugreg1 = ri.cpt_debugreg[1];
> + tsk->thread.debugreg2 = ri.cpt_debugreg[2];
> + tsk->thread.debugreg3 = ri.cpt_debugreg[3];
> + tsk->thread.debugreg6 = ri.cpt_debugreg[6];
> + tsk->thread.debugreg7 = ri.cpt_debugreg[7];
> +
> + regs->bx = ri.cpt_bx;
> + regs->cx = ri.cpt_cx;
> + regs->dx = ri.cpt_dx;
> + regs->si = ri.cpt_si;
> + regs->di = ri.cpt_di;
> + regs->bp = ri.cpt_bp;
> + regs->ax = ri.cpt_ax;
> + regs->orig_ax = ri.cpt_orig_ax;
> + regs->ip = ri.cpt_ip;
> + regs->flags = ri.cpt_flags;
> + regs->sp = ri.cpt_sp;
> +
> + regs->cs = decode_segment(ri.cpt_cs);
> + regs->ss = decode_segment(ri.cpt_ss);
> + regs->ds = decode_segment(ri.cpt_ds);
> + regs->es = decode_segment(ri.cpt_es);
> + regs->fs = decode_segment(ri.cpt_fs);
> +
> + tsk->thread.sp -= HOOK_RESERVE;
> + memset((void*)tsk->thread.sp, 0, HOOK_RESERVE);
> +
> + return 0;
> +}
> +
> +static int restart_thread(void *arg)
> +{
> + struct thr_context *thr_ctx = arg;
> + struct cpt_context *ctx;
> + struct cpt_task_image *ti;
> + int err;
> +
> + current->state = TASK_UNINTERRUPTIBLE;
> +
> + ctx = thr_ctx->ctx;
> + ti = kmalloc(sizeof(*ti), GFP_KERNEL);
> + if (!ti)
> + return -ENOMEM;
> +
> + err = rst_get_object(CPT_OBJ_TASK, ti, sizeof(*ti), ctx);
> + if (!err)
> + err = rst_restore_task_struct(current, ti, ctx);
> + /* Restore mm here */
> + if (!err)
> + err = rst_restore_fpustate(current, ti, ctx);
> + if (!err)
> + err = rst_restore_registers(current, ctx);
> +
> + thr_ctx->error = err;
> + complete(&thr_ctx->complete);
> +
> + if (!err && (ti->cpt_state & (EXIT_ZOMBIE|EXIT_DEAD))) {
> + do_exit(ti->cpt_exit_code);
> + } else {
> + __set_current_state(TASK_UNINTERRUPTIBLE);
> + }
> +
> + kfree(ti);
> + schedule();
> +
> + eprintk("leaked %d/%d %p\n", task_pid_nr(current), task_pid_vnr(current), current->mm);
> +
> + module_put(THIS_MODULE);

I'm sorry, I still do not understand what you are doing with this self-module
pinning stuff. AFAICS, we should not get here unless there is a bug. So the
checkpoint module ref count is never decreased, right?

Could you detail what is this self-module pinning for? As I already told you,
this looks like a bogus solution to avoid unloading the checkpoint module during
restart.

Thanks!

Louis

[...]

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes


Attachments:
(No filename) (7.10 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2008-10-20 16:52:19

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 02/10] Make checkpoint/restart functionality modular

On Sat, 2008-10-18 at 03:11 +0400, Andrey Mirkin wrote:
> +struct cpt_operations
> +{
> + struct module * owner;
> + int (*checkpoint)(pid_t pid, int fd, unsigned long flags);
> + int (*restart)(int ctid, int fd, unsigned long flags);
> +};

I think this is pretty useless obfuscation. We're not going to have
pluggable checkpoint/restart implementations, are we? So, why bother
putting it in a module?

I can understand that it's easier to develop your code when it's in a
module and you don't have to reboot the machine to load a new kernel
each time. But, that's an individual developer thing, and doesn't
belong in an upstream submission.

I know people have given you a hard time for this in the past. Why is
it still here?

-- Dave

2008-10-20 17:00:42

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 02/10] Make checkpoint/restart functionality modular

Quoting Andrey Mirkin ([email protected]):
> A config option CONFIG_CHECKPOINT is introduced.
> New structure cpt_operations is introduced to store pointers to
> checkpoint/restart functions from module.

I thought we had decided not to use a kernel module?

Louis' comments on your patch 8 regarding module pinning suggests that
details about using a module will detract from proper review of the core
c/r functionality...

-serge

2008-10-20 17:02:51

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 03/10] Introduce context structure needed during checkpointing/restart

On Sat, 2008-10-18 at 03:11 +0400, Andrey Mirkin wrote:
> +typedef struct cpt_context
> +{
> + pid_t pid; /* should be changed to ctid later */
> + int ctx_id; /* context id */
> + struct list_head ctx_list;
> + int refcount;
> + int ctx_state;
> + struct semaphore main_sem;

Does this really need to be a semaphore or is a mutex OK?

> + int errno;

Could you hold off on adding these things to the struct until the patch
where they're actually used? It's hard to judge this without seeing
what you do with it.

> + struct file *file;
> + loff_t current_object;
> +
> + struct list_head object_array[CPT_OBJ_MAX];
> +
> + int (*write)(const void *addr, size_t count, struct cpt_context *ctx);
> + int (*read)(void *addr, size_t count, struct cpt_context *ctx);
> +} cpt_context_t;

Man, this is hard to review. I was going to try and make sure that your
refcounting was right and atomic, but there's no use of it in this patch
except for the initialization and accessor functions. Darn.

> +extern int debug_level;

I'm going to go out on a limb here and say that "debug_level" is
probably a wee bit too generic of a variable name.

> +#define cpt_printk(lvl, fmt, args...) do { \
> + if (lvl <= debug_level) \
> + printk(fmt, ##args); \
> + } while (0)

I think you can use pr_debug() here, too, just like Oren did.

> +struct cpt_context * context_alloc(void)
> +{
> + struct cpt_context *ctx;
> + int i;
> +
> + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
> + if (!ctx)
> + return NULL;
> +
> + init_MUTEX(&ctx->main_sem);
> + ctx->refcount = 1;
> +
> + ctx->current_object = -1;
> + ctx->write = file_write;
> + ctx->read = file_read;
> + for (i = 0; i < CPT_OBJ_MAX; i++) {
> + INIT_LIST_HEAD(&ctx->object_array[i]);
> + }
> +
> + return ctx;
> +}
> +
> +void context_release(struct cpt_context *ctx)
> +{
> + ctx->ctx_state = CPT_CTX_ERROR;
> +
> + kfree(ctx);
> +}
> +
> +static void context_put(struct cpt_context *ctx)
> +{
> + if (!--ctx->refcount)
> + context_release(ctx);
> +}
> +
> static int checkpoint(pid_t pid, int fd, unsigned long flags)
> {
> - return -ENOSYS;
> + struct file *file;
> + struct cpt_context *ctx;
> + int err;
> +
> + err = -EBADF;
> + file = fget(fd);
> + if (!file)
> + goto out;
> +
> + err = -ENOMEM;
> + ctx = context_alloc();
> + if (!ctx)
> + goto out_file;
> +
> + ctx->file = file;
> + ctx->ctx_state = CPT_CTX_DUMPING;
> +
> + /* checkpoint */
> + err = -ENOSYS;
> +
> + context_put(ctx);
> +
> +out_file:
> + fput(file);
> +out:
> + return err;
> }

So, where is context_get()? Is there only single-threaded access to the
refcount? If so, why do we even need it? We should probably just use
context_release() driectly.

If there is multithreaded access to context_put() or the refcount, then
they're unsafe without additional locking.

-- Dave

2008-10-20 17:24:33

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 06/10] Introduce functions to dump mm

On Sat, 2008-10-18 at 03:11 +0400, Andrey Mirkin wrote:
> +static void page_get_desc(struct vm_area_struct *vma, unsigned long addr,
> + struct page_desc *pdesc, cpt_context_t * ctx)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *ptep, pte;
> + spinlock_t *ptl;
> + struct page *pg = NULL;
> + pgoff_t linear_index = (addr - vma->vm_start)/PAGE_SIZE + vma->vm_pgoff;
> +
> + pdesc->index = linear_index;
> + pdesc->shared = 0;
> + pdesc->mm = CPT_NULL;
> +
> + if (vma->vm_flags & VM_IO) {
> + pdesc->type = PD_ABSENT;
> + return;
> + }
> +
> + pgd = pgd_offset(mm, addr);
> + if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
> + goto out_absent;
> + pud = pud_offset(pgd, addr);
> + if (pud_none(*pud) || unlikely(pud_bad(*pud)))
> + goto out_absent;
> + pmd = pmd_offset(pud, addr);
> + if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
> + goto out_absent;
> +#ifdef CONFIG_X86
> + if (pmd_huge(*pmd)) {
> + eprintk("page_huge\n");
> + goto out_unsupported;
> + }
> +#endif

I take it you know that this breaks with the 1GB (x86_64) and 16GB (ppc)
large pages.

Since you have the VMA, why not use is_vm_hugetlb_page()?

-- Dave

2008-10-20 17:48:28

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 05/10] Introduce function to dump process

Quoting Andrey Mirkin ([email protected]):
> + t->cpt_uid = tsk->uid;
> + t->cpt_euid = tsk->euid;
> + t->cpt_suid = tsk->suid;
> + t->cpt_fsuid = tsk->fsuid;
> + t->cpt_gid = tsk->gid;
> + t->cpt_egid = tsk->egid;
> + t->cpt_sgid = tsk->sgid;
> + t->cpt_fsgid = tsk->fsgid;

I don't see where any of these are restored. (Obviously, I wanted
to think about how you're verifying the restarter's authorization
to do so)

thanks,
-serge

2008-10-22 08:51:21

by Andrey Mirkin

[permalink] [raw]
Subject: Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process

On Monday 20 October 2008 13:23 Cedric Le Goater wrote:
> Hello Andrey !
>
> > diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
> > index 109792b..a4848a3 100644
> > --- a/arch/x86/kernel/entry_32.S
> > +++ b/arch/x86/kernel/entry_32.S
> > @@ -225,6 +225,7 @@ ENTRY(ret_from_fork)
> > GET_THREAD_INFO(%ebp)
> > popl %eax
> > CFI_ADJUST_CFA_OFFSET -4
> > +ret_from_fork_tail:
> > pushl $0x0202 # Reset kernel eflags
> > CFI_ADJUST_CFA_OFFSET 4
> > popfl
> > @@ -233,6 +234,26 @@ ENTRY(ret_from_fork)
> > CFI_ENDPROC
> > END(ret_from_fork)
> >
> > +ENTRY(i386_ret_from_resume)
> > + CFI_STARTPROC
> > + pushl %eax
> > + CFI_ADJUST_CFA_OFFSET 4
> > + call schedule_tail
> > + GET_THREAD_INFO(%ebp)
> > + popl %eax
> > + CFI_ADJUST_CFA_OFFSET -4
> > + movl (%esp), %eax
> > + testl %eax, %eax
> > + jz 1f
> > + pushl %esp
> > + call *%eax
> > + addl $4, %esp
> > +1:
> > + addl $256, %esp
> > + jmp ret_from_fork_tail
> > + CFI_ENDPROC
> > +END(i386_ret_from_resume)
>
> Could you explain why you need to do this
>
> call *%eax
>
> is it related to the freezer code ?

It is not related to the freezer code actually.
That is needed to restart syscalls. Right now I don't have a code in my
patchset which restarts a syscall, but later I plan to add it.
In OpenVZ checkpointing we restart syscalls if process was caught in syscall
during checkpointing.

Andrey

2008-10-22 08:58:47

by Andrey Mirkin

[permalink] [raw]
Subject: Re: [Devel] Re: [PATCH 06/10] Introduce functions to dump mm

On Monday 20 October 2008 16:25 Louis Rilling wrote:
> On Sat, Oct 18, 2008 at 03:11:34AM +0400, Andrey Mirkin wrote:
> > Functions to dump mm struct, VMAs and mm context are added.
>
> Again, a few little comments.
>
> [...]
>
> > diff --git a/checkpoint/cpt_mm.c b/checkpoint/cpt_mm.c
> > new file mode 100644
> > index 0000000..8a22c48
> > --- /dev/null
> > +++ b/checkpoint/cpt_mm.c
> > @@ -0,0 +1,434 @@
> > +/*
> > + * Copyright (C) 2008 Parallels, Inc.
> > + *
> > + * Authors: Andrey Mirkin <[email protected]>
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License as
> > + * published by the Free Software Foundation, version 2 of the
> > + * License.
> > + *
> > + */
> > +
> > +#include <linux/sched.h>
> > +#include <linux/slab.h>
> > +#include <linux/file.h>
> > +#include <linux/mm.h>
> > +#include <linux/errno.h>
> > +#include <linux/major.h>
> > +#include <linux/mman.h>
> > +#include <linux/mnt_namespace.h>
> > +#include <linux/mount.h>
> > +#include <linux/namei.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/hugetlb.h>
> > +#include <asm/ldt.h>
> > +
> > +#include "checkpoint.h"
> > +#include "cpt_image.h"
> > +
> > +struct page_area
> > +{
> > + int type;
> > + unsigned long start;
> > + unsigned long end;
> > + pgoff_t pgoff;
> > + loff_t mm;
> > + __u64 list[16];
> > +};
> > +
> > +struct page_desc
> > +{
> > + int type;
> > + pgoff_t index;
> > + loff_t mm;
> > + int shared;
> > +};
> > +
> > +enum {
> > + PD_ABSENT,
> > + PD_COPY,
> > + PD_FUNKEY,
> > +};
> > +
> > +/* 0: page can be obtained from backstore, or still not mapped anonymous
> > page, + or something else, which does not requre copy.
> > + 1: page requires copy
> > + 2: page requres copy but its content is zero. Quite useless.
> > + 3: wp page is shared after fork(). It is to be COWed when modified.
> > + 4: page is something unsupported... We copy it right now.
> > + */
> > +
> > +static void page_get_desc(struct vm_area_struct *vma, unsigned long
> > addr, + struct page_desc *pdesc, cpt_context_t * ctx)
> > +{
> > + struct mm_struct *mm = vma->vm_mm;
> > + pgd_t *pgd;
> > + pud_t *pud;
> > + pmd_t *pmd;
> > + pte_t *ptep, pte;
> > + spinlock_t *ptl;
> > + struct page *pg = NULL;
> > + pgoff_t linear_index = (addr - vma->vm_start)/PAGE_SIZE +
> > vma->vm_pgoff; +
> > + pdesc->index = linear_index;
> > + pdesc->shared = 0;
> > + pdesc->mm = CPT_NULL;
> > +
> > + if (vma->vm_flags & VM_IO) {
> > + pdesc->type = PD_ABSENT;
> > + return;
> > + }
> > +
> > + pgd = pgd_offset(mm, addr);
> > + if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
> > + goto out_absent;
> > + pud = pud_offset(pgd, addr);
> > + if (pud_none(*pud) || unlikely(pud_bad(*pud)))
> > + goto out_absent;
> > + pmd = pmd_offset(pud, addr);
> > + if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
> > + goto out_absent;
> > +#ifdef CONFIG_X86
> > + if (pmd_huge(*pmd)) {
> > + eprintk("page_huge\n");
> > + goto out_unsupported;
> > + }
> > +#endif
> > + ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
> > + pte = *ptep;
> > + pte_unmap(ptep);
> > +
> > + if (pte_none(pte))
> > + goto out_absent_unlock;
> > +
> > + if ((pg = vm_normal_page(vma, addr, pte)) == NULL) {
> > + pdesc->type = PD_COPY;
> > + goto out_unlock;
> > + }
> > +
> > + get_page(pg);
> > + spin_unlock(ptl);
> > +
> > + if (pg->mapping && !PageAnon(pg)) {
> > + if (vma->vm_file == NULL) {
> > + eprintk("pg->mapping!=NULL for fileless vma: %08lx\n", addr);
> > + goto out_unsupported;
> > + }
> > + if (vma->vm_file->f_mapping != pg->mapping) {
> > + eprintk("pg->mapping!=f_mapping: %08lx %p %p\n",
> > + addr, vma->vm_file->f_mapping, pg->mapping);
> > + goto out_unsupported;
> > + }
> > + pdesc->index = (pg->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT));
> > + /* Page is in backstore. For us it is like
> > + * it is not present.
> > + */
> > + goto out_absent;
> > + }
> > +
> > + if (PageReserved(pg)) {
> > + /* Special case: ZERO_PAGE is used, when an
> > + * anonymous page is accessed but not written. */
> > + if (pg == ZERO_PAGE(addr)) {
> > + if (pte_write(pte)) {
> > + eprintk("not funny already, writable ZERO_PAGE\n");
> > + goto out_unsupported;
> > + }
> > + /* Just copy it for now */
> > + pdesc->type = PD_COPY;
> > + goto out_put;
> > + }
> > + eprintk("reserved page %lu at %08lx\n", pg->index, addr);
> > + goto out_unsupported;
> > + }
> > +
> > + if (!pg->mapping) {
> > + eprintk("page without mapping at %08lx\n", addr);
> > + goto out_unsupported;
> > + }
> > +
> > + pdesc->type = PD_COPY;
> > +
> > +out_put:
> > + if (pg)
> > + put_page(pg);
> > + return;
> > +
> > +out_unlock:
> > + spin_unlock(ptl);
> > + goto out_put;
> > +
> > +out_absent_unlock:
> > + spin_unlock(ptl);
> > +
> > +out_absent:
> > + pdesc->type = PD_ABSENT;
> > + goto out_put;
> > +
> > +out_unsupported:
> > + pdesc->type = PD_FUNKEY;
> > + goto out_put;
> > +}
> > +
> > +static int count_vma_pages(struct vm_area_struct *vma, struct
> > cpt_context *ctx) +{
> > + unsigned long addr;
> > + int page_num = 0;
> > +
> > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> > + struct page_desc pd;
> > +
> > + page_get_desc(vma, addr, &pd, ctx);
> > +
> > + if (pd.type != PD_COPY) {
> > + return -EINVAL;
> > + } else {
> > + page_num += 1;
> > + }
> > +
> > + }
> > + return page_num;
> > +}
> > +
> > +/* ATTN: We give "current" to get_user_pages(). This is wrong, but
> > get_user_pages() + * does not really need this thing. It just stores some
> > page fault stats there. + *
> > + * BUG: some archs (f.e. sparc64, but not Intel*) require flush cache
> > pages + * before accessing vma.
> > + */
> > +static int dump_pages(struct vm_area_struct *vma, unsigned long start,
> > + unsigned long end, struct cpt_context *ctx)
> > +{
> > +#define MAX_PAGE_BATCH 16
> > + struct page *pg[MAX_PAGE_BATCH];
> > + int npages = (end - start)/PAGE_SIZE;
> > + int count = 0;
> > +
> > + while (count < npages) {
> > + int copy = npages - count;
> > + int n;
> > +
> > + if (copy > MAX_PAGE_BATCH)
> > + copy = MAX_PAGE_BATCH;
> > + n = get_user_pages(current, vma->vm_mm, start, copy,
> > + 0, 1, pg, NULL);
> > + if (n == copy) {
> > + int i;
> > + for (i=0; i<n; i++) {
> > + char *maddr = kmap(pg[i]);
> > + ctx->write(maddr, PAGE_SIZE, ctx);
> > + kunmap(pg[i]);
>
> There is no error handling in this inner loop. Should be fixed imho.

Yes, you right. Already fixed in next version. I'll try to send it out
shortly.

>
> > + }
> > + } else {
> > + eprintk("get_user_pages fault");
> > + for ( ; n > 0; n--)
> > + page_cache_release(pg[n-1]);
> > + return -EFAULT;
> > + }
> > + start += n*PAGE_SIZE;
> > + count += n;
> > + for ( ; n > 0; n--)
> > + page_cache_release(pg[n-1]);
> > + }
> > + return 0;
> > +}
> > +
> > +static int dump_page_block(struct vm_area_struct *vma,
> > + struct cpt_page_block *pgb,
> > + struct cpt_context *ctx)
> > +{
> > + int err;
> > + pgb->cpt_len = sizeof(*pgb) + pgb->cpt_end - pgb->cpt_start;
> > + pgb->cpt_type = CPT_OBJ_PAGES;
> > + pgb->cpt_hdrlen = sizeof(*pgb);
> > + pgb->cpt_content = CPT_CONTENT_DATA;
> > +
> > + err = ctx->write(pgb, sizeof(*pgb), ctx);
> > + if (!err)
> > + err = dump_pages(vma, pgb->cpt_start, pgb->cpt_end, ctx);
> > +
> > + return err;
> > +}
> > +
> > +static int cpt_dump_dentry(struct path *p, cpt_context_t *ctx)
> > +{
> > + int len;
> > + char *path;
> > + char *buf;
> > + struct cpt_object_hdr o;
> > +
> > + buf = (char *)__get_free_page(GFP_KERNEL);
> > + if (!buf)
> > + return -ENOMEM;
> > +
> > + path = d_path(p, buf, PAGE_SIZE);
> > +
> > + if (IS_ERR(path)) {
> > + free_page((unsigned long)buf);
> > + return PTR_ERR(path);
> > + }
> > +
> > + len = buf + PAGE_SIZE - 1 - path;
> > + o.cpt_len = sizeof(o) + len + 1;
> > + o.cpt_type = CPT_OBJ_NAME;
> > + o.cpt_hdrlen = sizeof(o);
> > + o.cpt_content = CPT_CONTENT_NAME;
> > + path[len] = 0;
> > +
> > + ctx->write(&o, sizeof(o), ctx);
> > + ctx->write(path, len + 1, ctx);
>
> Error handling?
Will fix it, thanks.

>
> > + free_page((unsigned long)buf);
> > +
> > + return 0;
> > +}
> > +
> > +static int dump_one_vma(struct mm_struct *mm,
> > + struct vm_area_struct *vma, struct cpt_context *ctx)
> > +{
> > + struct cpt_vma_image *v;
> > + unsigned long addr;
> > + int page_num;
> > + int err;
> > +
> > + v = kzalloc(sizeof(*v), GFP_KERNEL);
> > + if (!v)
> > + return -ENOMEM;
> > +
> > + v->cpt_len = sizeof(*v);
> > + v->cpt_type = CPT_OBJ_VMA;
> > + v->cpt_hdrlen = sizeof(*v);
> > + v->cpt_content = CPT_CONTENT_ARRAY;
> > +
> > + v->cpt_start = vma->vm_start;
> > + v->cpt_end = vma->vm_end;
> > + v->cpt_flags = vma->vm_flags;
> > + if (vma->vm_flags & VM_HUGETLB) {
> > + eprintk("huge TLB VMAs are still not supported\n");
> > + kfree(v);
> > + return -EINVAL;
> > + }
> > + v->cpt_pgprot = vma->vm_page_prot.pgprot;
> > + v->cpt_pgoff = vma->vm_pgoff;
> > + v->cpt_file = CPT_NULL;
> > + v->cpt_vma_type = CPT_VMA_TYPE_0;
> > +
> > + page_num = count_vma_pages(vma, ctx);
> > + if (page_num < 0) {
> > + kfree(v);
> > + return -EINVAL;
> > + }
>
> AFAICS, since count_vma_pages only supports pages with PD_COPY, and since
> get_page_desc() tags text segment pages (file-mapped and not anonymous
> since not written to) as PD_ABSENT, no executable is checkpointable. So,
> where is the trick? Am I completely missing something about page mapping?
Oh, that's my fault, I have sent wrong version. I will send new patchset with
correct page mapping today.

>
> > + v->cpt_page_num = page_num;
> > +
> > + if (vma->vm_file) {
> > + v->cpt_file = 0;
> > + v->cpt_vma_type = CPT_VMA_FILE;
> > + }
> > +
> > + ctx->write(v, sizeof(*v), ctx);
>
> Error handling?
Yes, will add it.

>
> > + kfree(v);
> > +
> > + if (vma->vm_file) {
> > + err = cpt_dump_dentry(&vma->vm_file->f_path, ctx);
> > + if (err < 0)
> > + return err;
> > + }
> > +
> > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> > + struct page_desc pd;
> > + struct cpt_page_block pgb;
> > +
> > + page_get_desc(vma, addr, &pd, ctx);
> > +
> > + if (pd.type == PD_FUNKEY || pd.type == PD_ABSENT) {
> > + eprintk("dump_one_vma: funkey page\n");
> > + return -EINVAL;
> > + }
> > +
> > + pgb.cpt_start = addr;
> > + pgb.cpt_end = addr + PAGE_SIZE;
> > + dump_page_block(vma, &pgb, ctx);
>
> Error handling?
Yeap, thanks.

>
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +static int cpt_dump_mm_context(struct mm_struct *mm, struct cpt_context
> > *ctx) +{
> > +#ifdef CONFIG_X86
> > + if (mm->context.size) {
> > + struct cpt_obj_bits b;
> > + int size;
> > +
> > + mutex_lock(&mm->context.lock);
> > +
> > + b.cpt_type = CPT_OBJ_BITS;
> > + b.cpt_len = sizeof(b);
> > + b.cpt_content = CPT_CONTENT_MM_CONTEXT;
> > + b.cpt_size = mm->context.size * LDT_ENTRY_SIZE;
> > +
> > + ctx->write(&b, sizeof(b), ctx);
> > +
> > + size = mm->context.size * LDT_ENTRY_SIZE;
> > +
> > + ctx->write(mm->context.ldt, size, ctx);
>
> Error handling?
Thanks again!

>
> > +
> > + mutex_unlock(&mm->context.lock);
> > + }
> > +#endif
> > + return 0;
> > +}
> > +
> > +int cpt_dump_mm(struct task_struct *tsk, struct cpt_context *ctx)
> > +{
> > + struct mm_struct *mm = tsk->mm;
> > + struct cpt_mm_image *v;
> > + struct vm_area_struct *vma;
> > + int err;
> > +
> > + v = kzalloc(sizeof(*v), GFP_KERNEL);
> > + if (!v)
> > + return -ENOMEM;
> > +
> > + v->cpt_len = sizeof(*v);
> > + v->cpt_type = CPT_OBJ_MM;
> > + v->cpt_hdrlen = sizeof(*v);
> > + v->cpt_content = CPT_CONTENT_ARRAY;
> > +
> > + down_read(&mm->mmap_sem);
> > + v->cpt_start_code = mm->start_code;
> > + v->cpt_end_code = mm->end_code;
> > + v->cpt_start_data = mm->start_data;
> > + v->cpt_end_data = mm->end_data;
> > + v->cpt_start_brk = mm->start_brk;
> > + v->cpt_brk = mm->brk;
> > + v->cpt_start_stack = mm->start_stack;
> > + v->cpt_start_arg = mm->arg_start;
> > + v->cpt_end_arg = mm->arg_end;
> > + v->cpt_start_env = mm->env_start;
> > + v->cpt_end_env = mm->env_end;
> > + v->cpt_def_flags = mm->def_flags;
> > + v->cpt_flags = mm->flags;
> > + v->cpt_map_count = mm->map_count;
> > +
> > + err = ctx->write(v, sizeof(*v), ctx);
> > + kfree(v);
> > +
> > + if (err) {
> > + eprintk("error during writing mm\n");
> > + goto err_up;
> > + }
> > +
> > + for (vma = mm->mmap; vma; vma = vma->vm_next) {
> > + if ((err = dump_one_vma(mm, vma, ctx)) != 0)
> > + goto err_up;
> > + }
> > +
> > + err = cpt_dump_mm_context(mm, ctx);
> > +
> > +err_up:
> > + up_read(&mm->mmap_sem);
> > +
> > + return err;
> > +}
> > +
>
> [...]
>
> Louis

2008-10-22 09:25:20

by Louis Rilling

[permalink] [raw]
Subject: Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process

On Wed, Oct 22, 2008 at 12:49:54PM +0400, Andrey Mirkin wrote:
> On Monday 20 October 2008 13:23 Cedric Le Goater wrote:
> > Hello Andrey !
> >
> > > diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
> > > index 109792b..a4848a3 100644
> > > --- a/arch/x86/kernel/entry_32.S
> > > +++ b/arch/x86/kernel/entry_32.S
> > > @@ -225,6 +225,7 @@ ENTRY(ret_from_fork)
> > > GET_THREAD_INFO(%ebp)
> > > popl %eax
> > > CFI_ADJUST_CFA_OFFSET -4
> > > +ret_from_fork_tail:
> > > pushl $0x0202 # Reset kernel eflags
> > > CFI_ADJUST_CFA_OFFSET 4
> > > popfl
> > > @@ -233,6 +234,26 @@ ENTRY(ret_from_fork)
> > > CFI_ENDPROC
> > > END(ret_from_fork)
> > >
> > > +ENTRY(i386_ret_from_resume)
> > > + CFI_STARTPROC
> > > + pushl %eax
> > > + CFI_ADJUST_CFA_OFFSET 4
> > > + call schedule_tail
> > > + GET_THREAD_INFO(%ebp)
> > > + popl %eax
> > > + CFI_ADJUST_CFA_OFFSET -4
> > > + movl (%esp), %eax
> > > + testl %eax, %eax
> > > + jz 1f
> > > + pushl %esp
> > > + call *%eax
> > > + addl $4, %esp
> > > +1:
> > > + addl $256, %esp
> > > + jmp ret_from_fork_tail
> > > + CFI_ENDPROC
> > > +END(i386_ret_from_resume)
> >
> > Could you explain why you need to do this
> >
> > call *%eax
> >
> > is it related to the freezer code ?
>
> It is not related to the freezer code actually.
> That is needed to restart syscalls. Right now I don't have a code in my
> patchset which restarts a syscall, but later I plan to add it.
> In OpenVZ checkpointing we restart syscalls if process was caught in syscall
> during checkpointing.

Do you checkpoint uninterruptible syscalls as well? If only interruptible
syscalls are checkpointed, I'd say that either this syscall uses ERESTARTSYS or
ERESTART_RESTARTBLOCK, and then signal handling code already does the trick, or
this syscall does not restart itself when interrupted, and well, this is life,
userspace just sees -EINTR, which is allowed by the syscall spec.
Actually this is how we checkpoint/migrate tasks in interruptible syscalls in
Kerrighed and this works.

Louis

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes


Attachments:
(No filename) (2.17 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2008-10-22 10:07:33

by Greg Kurz

[permalink] [raw]
Subject: Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process

On Wed, 2008-10-22 at 11:25 +0200, Louis Rilling wrote:
> Do you checkpoint uninterruptible syscalls as well? If only interruptible
> syscalls are checkpointed, I'd say that either this syscall uses ERESTARTSYS or
> ERESTART_RESTARTBLOCK, and then signal handling code already does the trick, or
> this syscall does not restart itself when interrupted, and well, this is life,
> userspace just sees -EINTR, which is allowed by the syscall spec.
> Actually this is how we checkpoint/migrate tasks in interruptible syscalls in
> Kerrighed and this works.
>
> Louis
>

I don't know Kerrighed internals but I understand you perform checkpoint
with a signal handler. Right ? This approach has a huge benefit: the
signal handling code do all the arch dependant stuff to save registers
in user memory.

--
Gregory Kurz [email protected]
Software Engineer @ IBM/Meiosys http://www.ibm.com
Tel +33 (0)534 638 479 Fax +33 (0)561 400 420

"Anarchy is about taking complete responsibility for yourself."
Alan Moore.

2008-10-22 10:14:26

by Andrey Mirkin

[permalink] [raw]
Subject: Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process

On Wednesday 22 October 2008 13:25 Louis Rilling wrote:
> On Wed, Oct 22, 2008 at 12:49:54PM +0400, Andrey Mirkin wrote:
> > On Monday 20 October 2008 13:23 Cedric Le Goater wrote:
> > > Hello Andrey !
> > >
> > > > diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
> > > > index 109792b..a4848a3 100644
> > > > --- a/arch/x86/kernel/entry_32.S
> > > > +++ b/arch/x86/kernel/entry_32.S
> > > > @@ -225,6 +225,7 @@ ENTRY(ret_from_fork)
> > > > GET_THREAD_INFO(%ebp)
> > > > popl %eax
> > > > CFI_ADJUST_CFA_OFFSET -4
> > > > +ret_from_fork_tail:
> > > > pushl $0x0202 # Reset kernel eflags
> > > > CFI_ADJUST_CFA_OFFSET 4
> > > > popfl
> > > > @@ -233,6 +234,26 @@ ENTRY(ret_from_fork)
> > > > CFI_ENDPROC
> > > > END(ret_from_fork)
> > > >
> > > > +ENTRY(i386_ret_from_resume)
> > > > + CFI_STARTPROC
> > > > + pushl %eax
> > > > + CFI_ADJUST_CFA_OFFSET 4
> > > > + call schedule_tail
> > > > + GET_THREAD_INFO(%ebp)
> > > > + popl %eax
> > > > + CFI_ADJUST_CFA_OFFSET -4
> > > > + movl (%esp), %eax
> > > > + testl %eax, %eax
> > > > + jz 1f
> > > > + pushl %esp
> > > > + call *%eax
> > > > + addl $4, %esp
> > > > +1:
> > > > + addl $256, %esp
> > > > + jmp ret_from_fork_tail
> > > > + CFI_ENDPROC
> > > > +END(i386_ret_from_resume)
> > >
> > > Could you explain why you need to do this
> > >
> > > call *%eax
> > >
> > > is it related to the freezer code ?
> >
> > It is not related to the freezer code actually.
> > That is needed to restart syscalls. Right now I don't have a code in my
> > patchset which restarts a syscall, but later I plan to add it.
> > In OpenVZ checkpointing we restart syscalls if process was caught in
> > syscall during checkpointing.
>
> Do you checkpoint uninterruptible syscalls as well? If only interruptible
> syscalls are checkpointed, I'd say that either this syscall uses
> ERESTARTSYS or ERESTART_RESTARTBLOCK, and then signal handling code already
> does the trick, or this syscall does not restart itself when interrupted,
> and well, this is life, userspace just sees -EINTR, which is allowed by the
> syscall spec.
> Actually this is how we checkpoint/migrate tasks in interruptible syscalls
> in Kerrighed and this works.

We checkpoint only interruptible syscalls. Some syscalls do not restart
themself, that is why after restarting a process we restart syscall to
complete it.

Andrey

2008-10-22 10:44:58

by Louis Rilling

[permalink] [raw]
Subject: Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process

On Wed, Oct 22, 2008 at 12:06:19PM +0200, Greg Kurz wrote:
> On Wed, 2008-10-22 at 11:25 +0200, Louis Rilling wrote:
> > Do you checkpoint uninterruptible syscalls as well? If only interruptible
> > syscalls are checkpointed, I'd say that either this syscall uses ERESTARTSYS or
> > ERESTART_RESTARTBLOCK, and then signal handling code already does the trick, or
> > this syscall does not restart itself when interrupted, and well, this is life,
> > userspace just sees -EINTR, which is allowed by the syscall spec.
> > Actually this is how we checkpoint/migrate tasks in interruptible syscalls in
> > Kerrighed and this works.
> >
> > Louis
> >
>
> I don't know Kerrighed internals but I understand you perform checkpoint
> with a signal handler. Right ?

Right. This is an kernel-internal-only signal, so all signals remain available
for userspace.

> This approach has a huge benefit: the
> signal handling code do all the arch dependant stuff to save registers
> in user memory.

Hm, I'm not sure to understand what you mean here. We just rely on arch code
that jumps to signal handling to correctly setup struct pt_regs, which is then
passed to the checkpoint code. So yes, userspace registers are mostly saved by
existing arch code. But in x86-64 for instance, segment registers still need to
be saved by the checkpoint code (a bit like copy_thread() does), and I don't
know arch-independent functions doing this.

Louis

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes


Attachments:
(No filename) (1.57 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2008-10-22 10:46:46

by Louis Rilling

[permalink] [raw]
Subject: Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process

On Wed, Oct 22, 2008 at 02:12:12PM +0400, Andrey Mirkin wrote:
> On Wednesday 22 October 2008 13:25 Louis Rilling wrote:
> > On Wed, Oct 22, 2008 at 12:49:54PM +0400, Andrey Mirkin wrote:
> > > On Monday 20 October 2008 13:23 Cedric Le Goater wrote:
> > > > Hello Andrey !
> > > >
> > > > > diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
> > > > > index 109792b..a4848a3 100644
> > > > > --- a/arch/x86/kernel/entry_32.S
> > > > > +++ b/arch/x86/kernel/entry_32.S
> > > > > @@ -225,6 +225,7 @@ ENTRY(ret_from_fork)
> > > > > GET_THREAD_INFO(%ebp)
> > > > > popl %eax
> > > > > CFI_ADJUST_CFA_OFFSET -4
> > > > > +ret_from_fork_tail:
> > > > > pushl $0x0202 # Reset kernel eflags
> > > > > CFI_ADJUST_CFA_OFFSET 4
> > > > > popfl
> > > > > @@ -233,6 +234,26 @@ ENTRY(ret_from_fork)
> > > > > CFI_ENDPROC
> > > > > END(ret_from_fork)
> > > > >
> > > > > +ENTRY(i386_ret_from_resume)
> > > > > + CFI_STARTPROC
> > > > > + pushl %eax
> > > > > + CFI_ADJUST_CFA_OFFSET 4
> > > > > + call schedule_tail
> > > > > + GET_THREAD_INFO(%ebp)
> > > > > + popl %eax
> > > > > + CFI_ADJUST_CFA_OFFSET -4
> > > > > + movl (%esp), %eax
> > > > > + testl %eax, %eax
> > > > > + jz 1f
> > > > > + pushl %esp
> > > > > + call *%eax
> > > > > + addl $4, %esp
> > > > > +1:
> > > > > + addl $256, %esp
> > > > > + jmp ret_from_fork_tail
> > > > > + CFI_ENDPROC
> > > > > +END(i386_ret_from_resume)
> > > >
> > > > Could you explain why you need to do this
> > > >
> > > > call *%eax
> > > >
> > > > is it related to the freezer code ?
> > >
> > > It is not related to the freezer code actually.
> > > That is needed to restart syscalls. Right now I don't have a code in my
> > > patchset which restarts a syscall, but later I plan to add it.
> > > In OpenVZ checkpointing we restart syscalls if process was caught in
> > > syscall during checkpointing.
> >
> > Do you checkpoint uninterruptible syscalls as well? If only interruptible
> > syscalls are checkpointed, I'd say that either this syscall uses
> > ERESTARTSYS or ERESTART_RESTARTBLOCK, and then signal handling code already
> > does the trick, or this syscall does not restart itself when interrupted,
> > and well, this is life, userspace just sees -EINTR, which is allowed by the
> > syscall spec.
> > Actually this is how we checkpoint/migrate tasks in interruptible syscalls
> > in Kerrighed and this works.
>
> We checkpoint only interruptible syscalls. Some syscalls do not restart
> themself, that is why after restarting a process we restart syscall to
> complete it.

I guess you do that to avoid breaking application that are badly written and do
not handle -EINTR correctly with interruptible syscalls. Right?

Louis

--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes


Attachments:
(No filename) (2.83 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2008-10-22 12:44:29

by Greg Kurz

[permalink] [raw]
Subject: Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process

On Wed, 2008-10-22 at 12:44 +0200, Louis Rilling wrote:
> On Wed, Oct 22, 2008 at 12:06:19PM +0200, Greg Kurz wrote:
> > On Wed, 2008-10-22 at 11:25 +0200, Louis Rilling wrote:
> > > Do you checkpoint uninterruptible syscalls as well? If only interruptible
> > > syscalls are checkpointed, I'd say that either this syscall uses ERESTARTSYS or
> > > ERESTART_RESTARTBLOCK, and then signal handling code already does the trick, or
> > > this syscall does not restart itself when interrupted, and well, this is life,
> > > userspace just sees -EINTR, which is allowed by the syscall spec.
> > > Actually this is how we checkpoint/migrate tasks in interruptible syscalls in
> > > Kerrighed and this works.
> > >
> > > Louis
> > >
> >
> > I don't know Kerrighed internals but I understand you perform checkpoint
> > with a signal handler. Right ?
>
> Right. This is an kernel-internal-only signal, so all signals remain available
> for userspace.
>
> > This approach has a huge benefit: the
> > signal handling code do all the arch dependant stuff to save registers
> > in user memory.
>
> Hm, I'm not sure to understand what you mean here. We just rely on arch code
> that jumps to signal handling to correctly setup struct pt_regs, which is then
> passed to the checkpoint code. So yes, userspace registers are mostly saved by
> existing arch code. But in x86-64 for instance, segment registers still need to
> be saved by the checkpoint code (a bit like copy_thread() does), and I don't
> know arch-independent functions doing this.
>

You're right, some segment registers need to be saved on x86 also... I
should have written 'most of' in my previous mail.

--
Gregory Kurz [email protected]
Software Engineer @ IBM/Meiosys http://www.ibm.com
Tel +33 (0)534 638 479 Fax +33 (0)561 400 420

"Anarchy is about taking complete responsibility for yourself."
Alan Moore.

2008-10-22 12:48:26

by Cédric Le Goater

[permalink] [raw]
Subject: Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process

>>> +ENTRY(i386_ret_from_resume)
>>> + CFI_STARTPROC
>>> + pushl %eax
>>> + CFI_ADJUST_CFA_OFFSET 4
>>> + call schedule_tail
>>> + GET_THREAD_INFO(%ebp)
>>> + popl %eax
>>> + CFI_ADJUST_CFA_OFFSET -4
>>> + movl (%esp), %eax
>>> + testl %eax, %eax
>>> + jz 1f
>>> + pushl %esp
>>> + call *%eax
>>> + addl $4, %esp
>>> +1:
>>> + addl $256, %esp
>>> + jmp ret_from_fork_tail
>>> + CFI_ENDPROC
>>> +END(i386_ret_from_resume)
>> Could you explain why you need to do this
>>
>> call *%eax
>>
>> is it related to the freezer code ?
>
> It is not related to the freezer code actually.
> That is needed to restart syscalls. Right now I don't have a code in my
> patchset which restarts a syscall, but later I plan to add it.
> In OpenVZ checkpointing we restart syscalls if process was caught in syscall
> during checkpointing.

ok. I get it now. why 256 bytes of extra stack ? I'm sure it's not random.

Thanks,

C.

2008-10-22 15:27:07

by Oren Laadan

[permalink] [raw]
Subject: Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process



Andrey Mirkin wrote:
> On Wednesday 22 October 2008 13:25 Louis Rilling wrote:
>> On Wed, Oct 22, 2008 at 12:49:54PM +0400, Andrey Mirkin wrote:
>>> On Monday 20 October 2008 13:23 Cedric Le Goater wrote:
>>>> Hello Andrey !
>>>>
>>>>> diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
>>>>> index 109792b..a4848a3 100644
>>>>> --- a/arch/x86/kernel/entry_32.S
>>>>> +++ b/arch/x86/kernel/entry_32.S
>>>>> @@ -225,6 +225,7 @@ ENTRY(ret_from_fork)
>>>>> GET_THREAD_INFO(%ebp)
>>>>> popl %eax
>>>>> CFI_ADJUST_CFA_OFFSET -4
>>>>> +ret_from_fork_tail:
>>>>> pushl $0x0202 # Reset kernel eflags
>>>>> CFI_ADJUST_CFA_OFFSET 4
>>>>> popfl
>>>>> @@ -233,6 +234,26 @@ ENTRY(ret_from_fork)
>>>>> CFI_ENDPROC
>>>>> END(ret_from_fork)
>>>>>
>>>>> +ENTRY(i386_ret_from_resume)
>>>>> + CFI_STARTPROC
>>>>> + pushl %eax
>>>>> + CFI_ADJUST_CFA_OFFSET 4
>>>>> + call schedule_tail
>>>>> + GET_THREAD_INFO(%ebp)
>>>>> + popl %eax
>>>>> + CFI_ADJUST_CFA_OFFSET -4
>>>>> + movl (%esp), %eax
>>>>> + testl %eax, %eax
>>>>> + jz 1f
>>>>> + pushl %esp
>>>>> + call *%eax
>>>>> + addl $4, %esp
>>>>> +1:
>>>>> + addl $256, %esp
>>>>> + jmp ret_from_fork_tail
>>>>> + CFI_ENDPROC
>>>>> +END(i386_ret_from_resume)
>>>> Could you explain why you need to do this
>>>>
>>>> call *%eax
>>>>
>>>> is it related to the freezer code ?
>>> It is not related to the freezer code actually.
>>> That is needed to restart syscalls. Right now I don't have a code in my
>>> patchset which restarts a syscall, but later I plan to add it.
>>> In OpenVZ checkpointing we restart syscalls if process was caught in
>>> syscall during checkpointing.
>> Do you checkpoint uninterruptible syscalls as well? If only interruptible
>> syscalls are checkpointed, I'd say that either this syscall uses
>> ERESTARTSYS or ERESTART_RESTARTBLOCK, and then signal handling code already
>> does the trick, or this syscall does not restart itself when interrupted,
>> and well, this is life, userspace just sees -EINTR, which is allowed by the
>> syscall spec.
>> Actually this is how we checkpoint/migrate tasks in interruptible syscalls
>> in Kerrighed and this works.
>
> We checkpoint only interruptible syscalls. Some syscalls do not restart
> themself, that is why after restarting a process we restart syscall to
> complete it.

Can you please elaborate on this ? I don't recall having had issues
with that.

Thanks,

Oren.

>
> Andrey
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/containers

2008-10-23 08:45:51

by Andrey Mirkin

[permalink] [raw]
Subject: Re: [Devel] Re: [PATCH 06/10] Introduce functions to dump mm

On Monday 20 October 2008 21:21 Dave Hansen wrote:
> On Sat, 2008-10-18 at 03:11 +0400, Andrey Mirkin wrote:
> > +static void page_get_desc(struct vm_area_struct *vma, unsigned long
> > addr, + struct page_desc *pdesc, cpt_context_t *
> > ctx) +{
> > + struct mm_struct *mm = vma->vm_mm;
> > + pgd_t *pgd;
> > + pud_t *pud;
> > + pmd_t *pmd;
> > + pte_t *ptep, pte;
> > + spinlock_t *ptl;
> > + struct page *pg = NULL;
> > + pgoff_t linear_index = (addr - vma->vm_start)/PAGE_SIZE +
> > vma->vm_pgoff; +
> > + pdesc->index = linear_index;
> > + pdesc->shared = 0;
> > + pdesc->mm = CPT_NULL;
> > +
> > + if (vma->vm_flags & VM_IO) {
> > + pdesc->type = PD_ABSENT;
> > + return;
> > + }
> > +
> > + pgd = pgd_offset(mm, addr);
> > + if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
> > + goto out_absent;
> > + pud = pud_offset(pgd, addr);
> > + if (pud_none(*pud) || unlikely(pud_bad(*pud)))
> > + goto out_absent;
> > + pmd = pmd_offset(pud, addr);
> > + if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
> > + goto out_absent;
> > +#ifdef CONFIG_X86
> > + if (pmd_huge(*pmd)) {
> > + eprintk("page_huge\n");
> > + goto out_unsupported;
> > + }
> > +#endif
>
> I take it you know that this breaks with the 1GB (x86_64) and 16GB (ppc)
> large pages.
>
> Since you have the VMA, why not use is_vm_hugetlb_page()?
Right now I'm checking VM_HUGETLB flag on VMAs in dump_one_vma().
This checks were added for sanity purpose just to throw out all unsupported
right now cases.

Andrey

2008-10-23 08:54:49

by Andrey Mirkin

[permalink] [raw]
Subject: Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process

On Wednesday 22 October 2008 14:46 Louis Rilling wrote:
> On Wed, Oct 22, 2008 at 02:12:12PM +0400, Andrey Mirkin wrote:
> > On Wednesday 22 October 2008 13:25 Louis Rilling wrote:
> > > On Wed, Oct 22, 2008 at 12:49:54PM +0400, Andrey Mirkin wrote:
> > > > On Monday 20 October 2008 13:23 Cedric Le Goater wrote:
> > > > > Hello Andrey !
> > > > >
> > > > > > diff --git a/arch/x86/kernel/entry_32.S
> > > > > > b/arch/x86/kernel/entry_32.S index 109792b..a4848a3 100644
> > > > > > --- a/arch/x86/kernel/entry_32.S
> > > > > > +++ b/arch/x86/kernel/entry_32.S
> > > > > > @@ -225,6 +225,7 @@ ENTRY(ret_from_fork)
> > > > > > GET_THREAD_INFO(%ebp)
> > > > > > popl %eax
> > > > > > CFI_ADJUST_CFA_OFFSET -4
> > > > > > +ret_from_fork_tail:
> > > > > > pushl $0x0202 # Reset kernel eflags
> > > > > > CFI_ADJUST_CFA_OFFSET 4
> > > > > > popfl
> > > > > > @@ -233,6 +234,26 @@ ENTRY(ret_from_fork)
> > > > > > CFI_ENDPROC
> > > > > > END(ret_from_fork)
> > > > > >
> > > > > > +ENTRY(i386_ret_from_resume)
> > > > > > + CFI_STARTPROC
> > > > > > + pushl %eax
> > > > > > + CFI_ADJUST_CFA_OFFSET 4
> > > > > > + call schedule_tail
> > > > > > + GET_THREAD_INFO(%ebp)
> > > > > > + popl %eax
> > > > > > + CFI_ADJUST_CFA_OFFSET -4
> > > > > > + movl (%esp), %eax
> > > > > > + testl %eax, %eax
> > > > > > + jz 1f
> > > > > > + pushl %esp
> > > > > > + call *%eax
> > > > > > + addl $4, %esp
> > > > > > +1:
> > > > > > + addl $256, %esp
> > > > > > + jmp ret_from_fork_tail
> > > > > > + CFI_ENDPROC
> > > > > > +END(i386_ret_from_resume)
> > > > >
> > > > > Could you explain why you need to do this
> > > > >
> > > > > call *%eax
> > > > >
> > > > > is it related to the freezer code ?
> > > >
> > > > It is not related to the freezer code actually.
> > > > That is needed to restart syscalls. Right now I don't have a code in
> > > > my patchset which restarts a syscall, but later I plan to add it. In
> > > > OpenVZ checkpointing we restart syscalls if process was caught in
> > > > syscall during checkpointing.
> > >
> > > Do you checkpoint uninterruptible syscalls as well? If only
> > > interruptible syscalls are checkpointed, I'd say that either this
> > > syscall uses ERESTARTSYS or ERESTART_RESTARTBLOCK, and then signal
> > > handling code already does the trick, or this syscall does not restart
> > > itself when interrupted, and well, this is life, userspace just sees
> > > -EINTR, which is allowed by the syscall spec.
> > > Actually this is how we checkpoint/migrate tasks in interruptible
> > > syscalls in Kerrighed and this works.
> >
> > We checkpoint only interruptible syscalls. Some syscalls do not restart
> > themself, that is why after restarting a process we restart syscall to
> > complete it.
>
> I guess you do that to avoid breaking application that are badly written
> and do not handle -EINTR correctly with interruptible syscalls. Right?

Right, also this is needed to restart some syscalls (like pause) from kernel
without returning to user space. Let me explain it in more details. There is
a gap when process will be in user space just before entering syscall again.
At this time a signal can be delivered to process and it even can be handled.
So, we will miss a signal which must interrupt pause syscall.

Andrey

2008-10-23 09:02:35

by Andrey Mirkin

[permalink] [raw]
Subject: Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process

On Wednesday 22 October 2008 19:25 Oren Laadan wrote:
> Andrey Mirkin wrote:
> > On Wednesday 22 October 2008 13:25 Louis Rilling wrote:
> >> On Wed, Oct 22, 2008 at 12:49:54PM +0400, Andrey Mirkin wrote:
> >>> On Monday 20 October 2008 13:23 Cedric Le Goater wrote:
> >>>> Hello Andrey !
> >>>>
> >>>>> diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
> >>>>> index 109792b..a4848a3 100644
> >>>>> --- a/arch/x86/kernel/entry_32.S
> >>>>> +++ b/arch/x86/kernel/entry_32.S
> >>>>> @@ -225,6 +225,7 @@ ENTRY(ret_from_fork)
> >>>>> GET_THREAD_INFO(%ebp)
> >>>>> popl %eax
> >>>>> CFI_ADJUST_CFA_OFFSET -4
> >>>>> +ret_from_fork_tail:
> >>>>> pushl $0x0202 # Reset kernel eflags
> >>>>> CFI_ADJUST_CFA_OFFSET 4
> >>>>> popfl
> >>>>> @@ -233,6 +234,26 @@ ENTRY(ret_from_fork)
> >>>>> CFI_ENDPROC
> >>>>> END(ret_from_fork)
> >>>>>
> >>>>> +ENTRY(i386_ret_from_resume)
> >>>>> + CFI_STARTPROC
> >>>>> + pushl %eax
> >>>>> + CFI_ADJUST_CFA_OFFSET 4
> >>>>> + call schedule_tail
> >>>>> + GET_THREAD_INFO(%ebp)
> >>>>> + popl %eax
> >>>>> + CFI_ADJUST_CFA_OFFSET -4
> >>>>> + movl (%esp), %eax
> >>>>> + testl %eax, %eax
> >>>>> + jz 1f
> >>>>> + pushl %esp
> >>>>> + call *%eax
> >>>>> + addl $4, %esp
> >>>>> +1:
> >>>>> + addl $256, %esp
> >>>>> + jmp ret_from_fork_tail
> >>>>> + CFI_ENDPROC
> >>>>> +END(i386_ret_from_resume)
> >>>>
> >>>> Could you explain why you need to do this
> >>>>
> >>>> call *%eax
> >>>>
> >>>> is it related to the freezer code ?
> >>>
> >>> It is not related to the freezer code actually.
> >>> That is needed to restart syscalls. Right now I don't have a code in my
> >>> patchset which restarts a syscall, but later I plan to add it.
> >>> In OpenVZ checkpointing we restart syscalls if process was caught in
> >>> syscall during checkpointing.
> >>
> >> Do you checkpoint uninterruptible syscalls as well? If only
> >> interruptible syscalls are checkpointed, I'd say that either this
> >> syscall uses ERESTARTSYS or ERESTART_RESTARTBLOCK, and then signal
> >> handling code already does the trick, or this syscall does not restart
> >> itself when interrupted, and well, this is life, userspace just sees
> >> -EINTR, which is allowed by the syscall spec.
> >> Actually this is how we checkpoint/migrate tasks in interruptible
> >> syscalls in Kerrighed and this works.
> >
> > We checkpoint only interruptible syscalls. Some syscalls do not restart
> > themself, that is why after restarting a process we restart syscall to
> > complete it.
>
> Can you please elaborate on this ? I don't recall having had issues
> with that.

Right now in 2.6.18 kernel we restarts in such a way pause, rt_sigtimedwait
and futex syscalls. Recently futex syscall was reworked and we will not need
such hooks for it.

Andrey

2008-10-23 09:55:38

by Andrey Mirkin

[permalink] [raw]
Subject: Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process

On Wednesday 22 October 2008 16:47 Cedric Le Goater wrote:
> >>> +ENTRY(i386_ret_from_resume)
> >>> + CFI_STARTPROC
> >>> + pushl %eax
> >>> + CFI_ADJUST_CFA_OFFSET 4
> >>> + call schedule_tail
> >>> + GET_THREAD_INFO(%ebp)
> >>> + popl %eax
> >>> + CFI_ADJUST_CFA_OFFSET -4
> >>> + movl (%esp), %eax
> >>> + testl %eax, %eax
> >>> + jz 1f
> >>> + pushl %esp
> >>> + call *%eax
> >>> + addl $4, %esp
> >>> +1:
> >>> + addl $256, %esp
> >>> + jmp ret_from_fork_tail
> >>> + CFI_ENDPROC
> >>> +END(i386_ret_from_resume)
> >>
> >> Could you explain why you need to do this
> >>
> >> call *%eax
> >>
> >> is it related to the freezer code ?
> >
> > It is not related to the freezer code actually.
> > That is needed to restart syscalls. Right now I don't have a code in my
> > patchset which restarts a syscall, but later I plan to add it.
> > In OpenVZ checkpointing we restart syscalls if process was caught in
> > syscall during checkpointing.
>
> ok. I get it now. why 256 bytes of extra stack ? I'm sure it's not random.
We are putting special structure on stack, which is used at the very end of
the whole restart procedure to restore complex states (ptrace is one of such
cases). Right now I don't need to use this structure as we have a deal with
simple cases, but reservation of 256 bytes on stack is needed for future.

Andrey

2008-10-23 10:56:21

by Andrey Mirkin

[permalink] [raw]
Subject: Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process

On Monday 20 October 2008 17:25 Louis Rilling wrote:
> On Sat, Oct 18, 2008 at 03:11:36AM +0400, Andrey Mirkin wrote:
> > Functions to restart process, restore its state, fpu and registers are
> > added.
>
> [...]
>
> > diff --git a/checkpoint/rst_process.c b/checkpoint/rst_process.c
> > new file mode 100644
> > index 0000000..b9f745e
> > --- /dev/null
> > +++ b/checkpoint/rst_process.c
> > @@ -0,0 +1,277 @@
> > +/*
> > + * Copyright (C) 2008 Parallels, Inc.
> > + *
> > + * Author: Andrey Mirkin <[email protected]>
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License as
> > + * published by the Free Software Foundation, version 2 of the
> > + * License.
> > + *
> > + */
> > +
> > +#include <linux/sched.h>
> > +#include <linux/fs.h>
> > +#include <linux/file.h>
> > +#include <linux/version.h>
> > +#include <linux/module.h>
> > +
> > +#include "checkpoint.h"
> > +#include "cpt_image.h"
> > +
> > +#define HOOK_RESERVE 256
> > +
> > +struct thr_context {
> > + struct completion complete;
> > + int error;
> > + struct cpt_context *ctx;
> > + struct task_struct *tsk;
> > +};
> > +
> > +int local_kernel_thread(int (*fn)(void *), void * arg, unsigned long
> > flags, pid_t pid) +{
> > + pid_t ret;
> > +
> > + if (current->fs == NULL) {
> > + /* do_fork_pid() hates processes without fs, oopses. */
> > + eprintk("local_kernel_thread: current->fs==NULL\n");
> > + return -EINVAL;
> > + }
> > + if (!try_module_get(THIS_MODULE))
> > + return -EBUSY;
> > + ret = kernel_thread(fn, arg, flags);
> > + if (ret < 0)
> > + module_put(THIS_MODULE);
> > + return ret;
> > +}
> > +
> > +static unsigned int decode_task_flags(unsigned int task_flags)
> > +{
> > + unsigned int flags = 0;
> > +
> > + if (task_flags & (1 << CPT_PF_EXITING))
> > + flags |= PF_EXITING;
> > + if (task_flags & (1 << CPT_PF_FORKNOEXEC))
> > + flags |= PF_FORKNOEXEC;
> > + if (task_flags & (1 << CPT_PF_SUPERPRIV))
> > + flags |= PF_SUPERPRIV;
> > + if (task_flags & (1 << CPT_PF_DUMPCORE))
> > + flags |= PF_DUMPCORE;
> > + if (task_flags & (1 << CPT_PF_SIGNALED))
> > + flags |= PF_SIGNALED;
> > +
> > + return flags;
> > +
> > +}
> > +
> > +int rst_restore_task_struct(struct task_struct *tsk, struct
> > cpt_task_image *ti, + struct cpt_context *ctx)
> > +{
> > + int i;
> > +
> > + /* Restore only saved flags, comm and tls for now */
> > + tsk->flags = decode_task_flags(ti->cpt_flags);
> > + clear_tsk_thread_flag(tsk, TIF_FREEZE);
> > + memcpy(tsk->comm, ti->cpt_comm, TASK_COMM_LEN);
> > + for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++) {
> > + tsk->thread.tls_array[i].a = ti->cpt_tls[i] & 0xFFFFFFFF;
> > + tsk->thread.tls_array[i].b = ti->cpt_tls[i] >> 32;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +static int rst_restore_fpustate(struct task_struct *tsk, struct
> > cpt_task_image *ti, + struct cpt_context *ctx)
> > +{
> > + struct cpt_obj_bits hdr;
> > + int err;
> > + char *buf;
> > +
> > + clear_stopped_child_used_math(tsk);
> > +
> > + err = rst_get_object(CPT_OBJ_BITS, &hdr, sizeof(hdr), ctx);
> > + if (err < 0)
> > + return err;
> > +
> > + buf = kmalloc(hdr.cpt_size, GFP_KERNEL);
> > + if (!buf)
> > + return -ENOMEM;
> > +
> > + err = ctx->read(buf, hdr.cpt_size, ctx);
> > + if (err)
> > + goto out;
> > +
> > + if (hdr.cpt_content == CPT_CONTENT_X86_FPUSTATE && cpu_has_fxsr) {
> > + memcpy(&tsk->thread.xstate, buf,
> > + sizeof(struct i387_fxsave_struct));
> > + if (ti->cpt_flags & CPT_PF_USED_MATH)
> > + set_stopped_child_used_math(tsk);
> > + }
> > +#ifndef CONFIG_X86_64
> > + else if (hdr.cpt_content == CPT_CONTENT_X86_FPUSTATE_OLD &&
> > + !cpu_has_fxsr) {
> > + memcpy(&tsk->thread.xstate, buf,
> > + sizeof(struct i387_fsave_struct));
> > + if (ti->cpt_flags & CPT_PF_USED_MATH)
> > + set_stopped_child_used_math(tsk);
> > + }
> > +#endif
> > +
> > +out:
> > + kfree(buf);
> > + return err;
> > +}
> > +
> > +static u32 decode_segment(u32 segid)
> > +{
> > + if (segid == CPT_SEG_ZERO)
> > + return 0;
> > +
> > + /* TLS descriptors */
> > + if (segid <= CPT_SEG_TLS3)
> > + return ((GDT_ENTRY_TLS_MIN + segid - CPT_SEG_TLS1) << 3) + 3;
> > +
> > + /* LDT descriptor, it is just an index to LDT array */
> > + if (segid >= CPT_SEG_LDT)
> > + return ((segid - CPT_SEG_LDT) << 3) | 7;
> > +
> > + /* Check for one of standard descriptors */
> > + if (segid == CPT_SEG_USER32_DS)
> > + return __USER_DS;
> > + if (segid == CPT_SEG_USER32_CS)
> > + return __USER_CS;
> > +
> > + eprintk("Invalid segment reg %d\n", segid);
> > + return 0;
> > +}
> > +
> > +static int rst_restore_registers(struct task_struct *tsk, struct
> > cpt_context *ctx) +{
> > + struct cpt_x86_regs ri;
> > + struct pt_regs *regs = task_pt_regs(tsk);
> > + extern char i386_ret_from_resume;
> > + int err;
> > +
> > + err = rst_get_object(CPT_OBJ_X86_REGS, &ri, sizeof(ri), ctx);
> > + if (err < 0)
> > + return err;
> > +
> > + tsk->thread.sp = (unsigned long) regs;
> > + tsk->thread.sp0 = (unsigned long) (regs+1);
> > + tsk->thread.ip = (unsigned long) &i386_ret_from_resume;
> > +
> > + tsk->thread.gs = decode_segment(ri.cpt_gs);
> > + tsk->thread.debugreg0 = ri.cpt_debugreg[0];
> > + tsk->thread.debugreg1 = ri.cpt_debugreg[1];
> > + tsk->thread.debugreg2 = ri.cpt_debugreg[2];
> > + tsk->thread.debugreg3 = ri.cpt_debugreg[3];
> > + tsk->thread.debugreg6 = ri.cpt_debugreg[6];
> > + tsk->thread.debugreg7 = ri.cpt_debugreg[7];
> > +
> > + regs->bx = ri.cpt_bx;
> > + regs->cx = ri.cpt_cx;
> > + regs->dx = ri.cpt_dx;
> > + regs->si = ri.cpt_si;
> > + regs->di = ri.cpt_di;
> > + regs->bp = ri.cpt_bp;
> > + regs->ax = ri.cpt_ax;
> > + regs->orig_ax = ri.cpt_orig_ax;
> > + regs->ip = ri.cpt_ip;
> > + regs->flags = ri.cpt_flags;
> > + regs->sp = ri.cpt_sp;
> > +
> > + regs->cs = decode_segment(ri.cpt_cs);
> > + regs->ss = decode_segment(ri.cpt_ss);
> > + regs->ds = decode_segment(ri.cpt_ds);
> > + regs->es = decode_segment(ri.cpt_es);
> > + regs->fs = decode_segment(ri.cpt_fs);
> > +
> > + tsk->thread.sp -= HOOK_RESERVE;
> > + memset((void*)tsk->thread.sp, 0, HOOK_RESERVE);
> > +
> > + return 0;
> > +}
> > +
> > +static int restart_thread(void *arg)
> > +{
> > + struct thr_context *thr_ctx = arg;
> > + struct cpt_context *ctx;
> > + struct cpt_task_image *ti;
> > + int err;
> > +
> > + current->state = TASK_UNINTERRUPTIBLE;
> > +
> > + ctx = thr_ctx->ctx;
> > + ti = kmalloc(sizeof(*ti), GFP_KERNEL);
> > + if (!ti)
> > + return -ENOMEM;
> > +
> > + err = rst_get_object(CPT_OBJ_TASK, ti, sizeof(*ti), ctx);
> > + if (!err)
> > + err = rst_restore_task_struct(current, ti, ctx);
> > + /* Restore mm here */
> > + if (!err)
> > + err = rst_restore_fpustate(current, ti, ctx);
> > + if (!err)
> > + err = rst_restore_registers(current, ctx);
> > +
> > + thr_ctx->error = err;
> > + complete(&thr_ctx->complete);
> > +
> > + if (!err && (ti->cpt_state & (EXIT_ZOMBIE|EXIT_DEAD))) {
> > + do_exit(ti->cpt_exit_code);
> > + } else {
> > + __set_current_state(TASK_UNINTERRUPTIBLE);
> > + }
> > +
> > + kfree(ti);
> > + schedule();
> > +
> > + eprintk("leaked %d/%d %p\n", task_pid_nr(current),
> > task_pid_vnr(current), current->mm); +
> > + module_put(THIS_MODULE);
>
> I'm sorry, I still do not understand what you are doing with this
> self-module pinning stuff. AFAICS, we should not get here unless there is a
> bug. So the checkpoint module ref count is never decreased, right?
>
> Could you detail what is this self-module pinning for? As I already told
> you, this looks like a bogus solution to avoid unloading the checkpoint
> module during restart.

Actually right now module ref count increase/decrease is not needed.
But in some cases restore work should be done only after unfreezing the
process. So, in this case we should grab ref count during process creation
and put it after this special work is done.
I will rework this place and send it in next version to make it more clear how
it will be used in future.

Andrey

2008-10-23 13:49:29

by Dave Hansen

[permalink] [raw]
Subject: Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process

On Thu, 2008-10-23 at 13:54 +0400, Andrey Mirkin wrote:
> We are putting special structure on stack, which is used at the very end of
> the whole restart procedure to restore complex states (ptrace is one of such
> cases). Right now I don't need to use this structure as we have a deal with
> simple cases, but reservation of 256 bytes on stack is needed for future.

Wow. So you're saying that, if this patch is accepted, we simply need
to accept that anything being checkpointed will use an extra 256 bytes
of stack? Seems like something to perhaps put in the changelog rather
than some completely undocumented assembly nugget.

-- Dave

2008-10-23 13:51:45

by Dave Hansen

[permalink] [raw]
Subject: Re: [Devel] Re: [PATCH 06/10] Introduce functions to dump mm

On Thu, 2008-10-23 at 12:43 +0400, Andrey Mirkin wrote:
> > > +#ifdef CONFIG_X86
> > > + if (pmd_huge(*pmd)) {
> > > + eprintk("page_huge\n");
> > > + goto out_unsupported;
> > > + }
> > > +#endif
> >
> > I take it you know that this breaks with the 1GB (x86_64) and 16GB (ppc)
> > large pages.
> >
> > Since you have the VMA, why not use is_vm_hugetlb_page()?
> Right now I'm checking VM_HUGETLB flag on VMAs in dump_one_vma().
> This checks were added for sanity purpose just to throw out all unsupported
> right now cases.

I'm telling you that it's no good. Not only should this path never be
hit. But, if it is, you'll crash anyway in some cases.

It's a bad check. At best it misleads the reader to think that you've
covered your bases.

-- Dave

2008-10-23 13:58:19

by Dave Hansen

[permalink] [raw]
Subject: Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process

On Thu, 2008-10-23 at 13:00 +0400, Andrey Mirkin wrote:
>
> > >>> It is not related to the freezer code actually.
> > >>> That is needed to restart syscalls. Right now I don't have a code in my
> > >>> patchset which restarts a syscall, but later I plan to add it.
> > >>> In OpenVZ checkpointing we restart syscalls if process was caught in
> > >>> syscall during checkpointing.
> > >>
> > >> Do you checkpoint uninterruptible syscalls as well? If only
> > >> interruptible syscalls are checkpointed, I'd say that either this
> > >> syscall uses ERESTARTSYS or ERESTART_RESTARTBLOCK, and then signal
> > >> handling code already does the trick, or this syscall does not restart
> > >> itself when interrupted, and well, this is life, userspace just sees
> > >> -EINTR, which is allowed by the syscall spec.
> > >> Actually this is how we checkpoint/migrate tasks in interruptible
> > >> syscalls in Kerrighed and this works.
> > >
> > > We checkpoint only interruptible syscalls. Some syscalls do not restart
> > > themself, that is why after restarting a process we restart syscall to
> > > complete it.
> >
> > Can you please elaborate on this ? I don't recall having had issues
> > with that.
>
> Right now in 2.6.18 kernel we restarts in such a way pause, rt_sigtimedwait
> and futex syscalls. Recently futex syscall was reworked and we will not need
> such hooks for it.

Could you elaborate on this a bit?

If the futex syscall was reworked, perhaps we can do the same for
rt_sigtimedwait() and get rid of this code completely.

-- Dave

2008-10-24 03:57:51

by Andrey Mirkin

[permalink] [raw]
Subject: Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process

On Thursday 23 October 2008 17:57 Dave Hansen wrote:
> On Thu, 2008-10-23 at 13:00 +0400, Andrey Mirkin wrote:
> > > >>> It is not related to the freezer code actually.
> > > >>> That is needed to restart syscalls. Right now I don't have a code
> > > >>> in my patchset which restarts a syscall, but later I plan to add
> > > >>> it. In OpenVZ checkpointing we restart syscalls if process was
> > > >>> caught in syscall during checkpointing.
> > > >>
> > > >> Do you checkpoint uninterruptible syscalls as well? If only
> > > >> interruptible syscalls are checkpointed, I'd say that either this
> > > >> syscall uses ERESTARTSYS or ERESTART_RESTARTBLOCK, and then signal
> > > >> handling code already does the trick, or this syscall does not
> > > >> restart itself when interrupted, and well, this is life, userspace
> > > >> just sees -EINTR, which is allowed by the syscall spec.
> > > >> Actually this is how we checkpoint/migrate tasks in interruptible
> > > >> syscalls in Kerrighed and this works.
> > > >
> > > > We checkpoint only interruptible syscalls. Some syscalls do not
> > > > restart themself, that is why after restarting a process we restart
> > > > syscall to complete it.
> > >
> > > Can you please elaborate on this ? I don't recall having had issues
> > > with that.
> >
> > Right now in 2.6.18 kernel we restarts in such a way pause,
> > rt_sigtimedwait and futex syscalls. Recently futex syscall was reworked
> > and we will not need such hooks for it.
>
> Could you elaborate on this a bit?
>
> If the futex syscall was reworked, perhaps we can do the same for
> rt_sigtimedwait() and get rid of this code completely.

Well, we can try to rework rt_sigtimedwait(), but we will still need this code
in the future to restart pause syscall from kernel without returning to user
space. Also this code will be needed to restore some complex states.
As concerns pause syscall I have already written to Louis about the problem we
are trying to solve with this code. There is a gap when process will be in
user space just before entering syscall again. At this time a signal can be
delivered to process and it even can be handled. So, we will miss a signal
which must interrupt pause syscall.

Andrey

2008-10-24 04:04:48

by Andrey Mirkin

[permalink] [raw]
Subject: Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process

On Thursday 23 October 2008 17:49 Dave Hansen wrote:
> On Thu, 2008-10-23 at 13:54 +0400, Andrey Mirkin wrote:
> > We are putting special structure on stack, which is used at the very end
> > of the whole restart procedure to restore complex states (ptrace is one
> > of such cases). Right now I don't need to use this structure as we have a
> > deal with simple cases, but reservation of 256 bytes on stack is needed
> > for future.
>
> Wow. So you're saying that, if this patch is accepted, we simply need
> to accept that anything being checkpointed will use an extra 256 bytes
> of stack? Seems like something to perhaps put in the changelog rather
> than some completely undocumented assembly nugget.

This 256 bytes will be used only during restart procedure and only by our
module. As you can see in i386_ret_from_resume we are restoring it back. So,
when process will return to user space it will not have extra 256 bytes
reserved on stack already. I will add information about it to documentation
and changelog.

Andrey

2008-10-24 04:07:46

by Andrey Mirkin

[permalink] [raw]
Subject: Re: [Devel] Re: [PATCH 06/10] Introduce functions to dump mm

On Thursday 23 October 2008 17:51 Dave Hansen wrote:
> On Thu, 2008-10-23 at 12:43 +0400, Andrey Mirkin wrote:
> > > > +#ifdef CONFIG_X86
> > > > + if (pmd_huge(*pmd)) {
> > > > + eprintk("page_huge\n");
> > > > + goto out_unsupported;
> > > > + }
> > > > +#endif
> > >
> > > I take it you know that this breaks with the 1GB (x86_64) and 16GB
> > > (ppc) large pages.
> > >
> > > Since you have the VMA, why not use is_vm_hugetlb_page()?
> >
> > Right now I'm checking VM_HUGETLB flag on VMAs in dump_one_vma().
> > This checks were added for sanity purpose just to throw out all
> > unsupported right now cases.
>
> I'm telling you that it's no good. Not only should this path never be
> hit. But, if it is, you'll crash anyway in some cases.
>
> It's a bad check. At best it misleads the reader to think that you've
> covered your bases.

Agree, I will rework this.

Andrey

2008-10-24 04:15:01

by Andrey Mirkin

[permalink] [raw]
Subject: Re: [Devel] Re: [PATCH 05/10] Introduce function to dump process

On Monday 20 October 2008 15:02 Louis Rilling wrote:
> Hi,
>
> On Sat, Oct 18, 2008 at 03:11:33AM +0400, Andrey Mirkin wrote:
> > Functions to dump task struct, fpu state and registers are added.
> > All IDs are saved from the POV of process (container) namespace.
>
> Just a couple of little comments, in case this series should keep on
> living.
>
> [...]
>
> > diff --git a/checkpoint/cpt_process.c b/checkpoint/cpt_process.c
> > new file mode 100644
> > index 0000000..58f608d
> > --- /dev/null
> > +++ b/checkpoint/cpt_process.c
> > @@ -0,0 +1,236 @@
> > +/*
> > + * Copyright (C) 2008 Parallels, Inc.
> > + *
> > + * Author: Andrey Mirkin <[email protected]>
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License as
> > + * published by the Free Software Foundation, version 2 of the
> > + * License.
> > + *
> > + */
> > +
> > +#include <linux/sched.h>
> > +#include <linux/fs.h>
> > +#include <linux/file.h>
> > +#include <linux/version.h>
> > +#include <linux/nsproxy.h>
> > +
> > +#include "checkpoint.h"
> > +#include "cpt_image.h"
> > +
> > +static unsigned int encode_task_flags(unsigned int task_flags)
> > +{
> > + unsigned int flags = 0;
> > +
> > + if (task_flags & PF_EXITING)
> > + flags |= (1 << CPT_PF_EXITING);
> > + if (task_flags & PF_FORKNOEXEC)
> > + flags |= (1 << CPT_PF_FORKNOEXEC);
> > + if (task_flags & PF_SUPERPRIV)
> > + flags |= (1 << CPT_PF_SUPERPRIV);
> > + if (task_flags & PF_DUMPCORE)
> > + flags |= (1 << CPT_PF_DUMPCORE);
> > + if (task_flags & PF_SIGNALED)
> > + flags |= (1 << CPT_PF_SIGNALED);
> > + if (task_flags & PF_USED_MATH)
> > + flags |= (1 << CPT_PF_USED_MATH);
> > +
> > + return flags;
> > +
> > +}
> > +
> > +int cpt_dump_task_struct(struct task_struct *tsk, struct cpt_context
> > *ctx) +{
> > + struct cpt_task_image *t;
> > + int i;
> > + int err;
> > +
> > + t = kzalloc(sizeof(*t), GFP_KERNEL);
> > + if (!t)
> > + return -ENOMEM;
> > +
> > + t->cpt_len = sizeof(*t);
> > + t->cpt_type = CPT_OBJ_TASK;
> > + t->cpt_hdrlen = sizeof(*t);
> > + t->cpt_content = CPT_CONTENT_ARRAY;
> > +
> > + t->cpt_state = tsk->state;
> > + t->cpt_flags = encode_task_flags(tsk->flags);
> > + t->cpt_exit_code = tsk->exit_code;
> > + t->cpt_exit_signal = tsk->exit_signal;
> > + t->cpt_pdeath_signal = tsk->pdeath_signal;
> > + t->cpt_pid = task_pid_nr_ns(tsk, ctx->nsproxy->pid_ns);
> > + t->cpt_tgid = task_tgid_nr_ns(tsk, ctx->nsproxy->pid_ns);
> > + t->cpt_ppid = tsk->parent ?
> > + task_pid_nr_ns(tsk->parent, ctx->nsproxy->pid_ns) : 0;
> > + t->cpt_rppid = tsk->real_parent ?
> > + task_pid_nr_ns(tsk->real_parent, ctx->nsproxy->pid_ns) : 0;
> > + t->cpt_pgrp = task_pgrp_nr_ns(tsk, ctx->nsproxy->pid_ns);
> > + t->cpt_session = task_session_nr_ns(tsk, ctx->nsproxy->pid_ns);
> > + t->cpt_old_pgrp = 0;
> > + if (tsk->signal->tty_old_pgrp)
> > + t->cpt_old_pgrp = pid_vnr(tsk->signal->tty_old_pgrp);
> > + t->cpt_leader = tsk->group_leader ? task_pid_vnr(tsk->group_leader) :
> > 0;
>
> Why pid_vnr() here, and task_*_nr_ns() above? According to the introducing
> comment, I'd expect something like pid_nr_ns(tsk->signal->tty_old_pgrp,
> tsk->nsproxy->pid_ns), and the same for tsk->group_leader.
>
> IIUC, pid_vnr() is correct only if ctx->nsproxy->pid_ns ==
> tsk->nsproxy->pid_ns == current->nsproxy->pid_ns, and I expect current to
> live in a different pid_ns.
>
> Comments?

Oh, you right here. I have already fixed it in my tree, but accidentally wrong
patch were sent.

>
> > + t->cpt_utime = tsk->utime;
> > + t->cpt_stime = tsk->stime;
> > + t->cpt_utimescaled = tsk->utimescaled;
> > + t->cpt_stimescaled = tsk->stimescaled;
> > + t->cpt_gtime = tsk->gtime;
> > + t->cpt_prev_utime = tsk->prev_utime;
> > + t->cpt_prev_stime = tsk->prev_stime;
> > + t->cpt_nvcsw = tsk->nvcsw;
> > + t->cpt_nivcsw = tsk->nivcsw;
> > + t->cpt_start_time = cpt_timespec_export(&tsk->start_time);
> > + t->cpt_real_start_time = cpt_timespec_export(&tsk->real_start_time);
> > + t->cpt_min_flt = tsk->min_flt;
> > + t->cpt_maj_flt = tsk->maj_flt;
> > + memcpy(t->cpt_comm, tsk->comm, TASK_COMM_LEN);
> > + for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++) {
> > + t->cpt_tls[i] = (((u64)tsk->thread.tls_array[i].b) << 32) +
> > + tsk->thread.tls_array[i].a;
> > + }
> > + /* TODO: encode thread flags and status like task flags */
> > + t->cpt_thrflags = task_thread_info(tsk)->flags & ~(1<<TIF_FREEZE);
> > + t->cpt_thrstatus = task_thread_info(tsk)->status;
> > + t->cpt_user = tsk->user->uid;
> > + t->cpt_uid = tsk->uid;
> > + t->cpt_euid = tsk->euid;
> > + t->cpt_suid = tsk->suid;
> > + t->cpt_fsuid = tsk->fsuid;
> > + t->cpt_gid = tsk->gid;
> > + t->cpt_egid = tsk->egid;
> > + t->cpt_sgid = tsk->sgid;
> > + t->cpt_fsgid = tsk->fsgid;
> > +
> > + err = ctx->write(t, sizeof(*t), ctx);
> > +
> > + kfree(t);
> > + return err;
> > +}
> > +
> > +static int cpt_dump_fpustate(struct task_struct *tsk, struct cpt_context
> > *ctx) +{
> > + struct cpt_obj_bits hdr;
> > + int err;
> > + int content;
> > + unsigned long size;
> > +
> > + content = CPT_CONTENT_X86_FPUSTATE;
> > + size = sizeof(struct i387_fxsave_struct);
> > +#ifndef CONFIG_X86_64
> > + if (!cpu_has_fxsr) {
> > + size = sizeof(struct i387_fsave_struct);
> > + content = CPT_CONTENT_X86_FPUSTATE_OLD;
> > + }
> > +#endif
> > +
> > + hdr.cpt_len = sizeof(hdr) + size;
> > + hdr.cpt_type = CPT_OBJ_BITS;
> > + hdr.cpt_hdrlen = sizeof(hdr);
> > + hdr.cpt_content = content;
> > + hdr.cpt_size = size;
> > + err = ctx->write(&hdr, sizeof(hdr), ctx);
> > + if (!err)
> > + ctx->write(tsk->thread.xstate, size, ctx);
>
> Should check the error code of the line above, right?
Right, thanks!

[...]

> > +int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx)
> > +{
> > + int err;
> > +
> > + err = cpt_dump_task_struct(tsk, ctx);
> > +
> > + /* Dump task mm */
> > +
> > + if (!err)
> > + cpt_dump_fpustate(tsk, ctx);
>
> error checking...
>
> > + if (!err)
> > + cpt_dump_registers(tsk, ctx);
>
> error checking...
Thanks again, will fix it.


Andrey

2008-10-24 04:49:41

by Andrey Mirkin

[permalink] [raw]
Subject: Re: [Devel] Re: [PATCH 05/10] Introduce function to dump process

On Monday 20 October 2008 21:48 Serge E. Hallyn wrote:
> Quoting Andrey Mirkin ([email protected]):
> > + t->cpt_uid = tsk->uid;
> > + t->cpt_euid = tsk->euid;
> > + t->cpt_suid = tsk->suid;
> > + t->cpt_fsuid = tsk->fsuid;
> > + t->cpt_gid = tsk->gid;
> > + t->cpt_egid = tsk->egid;
> > + t->cpt_sgid = tsk->sgid;
> > + t->cpt_fsgid = tsk->fsgid;
>
> I don't see where any of these are restored. (Obviously, I wanted
> to think about how you're verifying the restarter's authorization
> to do so)

Well, right now I don't use them during restore to simplify restart procedure
and make it more clear for reviewers. In OpenVZ we are doing all restart
procedure with root's privileges and relying on fact that all such IDs will
be the same during restart (as we are restarting a container and its file
system will be the same during restart).

Andrey

2008-10-25 21:12:18

by Oren Laadan

[permalink] [raw]
Subject: Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process



Andrey Mirkin wrote:
> On Thursday 23 October 2008 17:57 Dave Hansen wrote:
>> On Thu, 2008-10-23 at 13:00 +0400, Andrey Mirkin wrote:
>>>>>>> It is not related to the freezer code actually.
>>>>>>> That is needed to restart syscalls. Right now I don't have a code
>>>>>>> in my patchset which restarts a syscall, but later I plan to add
>>>>>>> it. In OpenVZ checkpointing we restart syscalls if process was
>>>>>>> caught in syscall during checkpointing.
>>>>>> Do you checkpoint uninterruptible syscalls as well? If only
>>>>>> interruptible syscalls are checkpointed, I'd say that either this
>>>>>> syscall uses ERESTARTSYS or ERESTART_RESTARTBLOCK, and then signal
>>>>>> handling code already does the trick, or this syscall does not
>>>>>> restart itself when interrupted, and well, this is life, userspace
>>>>>> just sees -EINTR, which is allowed by the syscall spec.
>>>>>> Actually this is how we checkpoint/migrate tasks in interruptible
>>>>>> syscalls in Kerrighed and this works.
>>>>> We checkpoint only interruptible syscalls. Some syscalls do not
>>>>> restart themself, that is why after restarting a process we restart
>>>>> syscall to complete it.
>>>> Can you please elaborate on this ? I don't recall having had issues
>>>> with that.
>>> Right now in 2.6.18 kernel we restarts in such a way pause,
>>> rt_sigtimedwait and futex syscalls. Recently futex syscall was reworked
>>> and we will not need such hooks for it.
>> Could you elaborate on this a bit?
>>
>> If the futex syscall was reworked, perhaps we can do the same for
>> rt_sigtimedwait() and get rid of this code completely.
>
> Well, we can try to rework rt_sigtimedwait(), but we will still need this code
> in the future to restart pause syscall from kernel without returning to user
> space. Also this code will be needed to restore some complex states.
> As concerns pause syscall I have already written to Louis about the problem we
> are trying to solve with this code. There is a gap when process will be in
> user space just before entering syscall again. At this time a signal can be
> delivered to process and it even can be handled. So, we will miss a signal
> which must interrupt pause syscall.

I'm not convinced that you a real race exists, and even if it does, I'm not
convinced that hacking the assembly entry/exit code is the best way to do it.

Let me explain:

You are concerned about a race in which a signal is delivered to a task
that resumes from restart to user space and is about to (re)invoke 'pause()'
(because the restart so arranged its EIP and registers).

This almost always means that the user code is buggy and relies on specific
scheduling, because you can usually find a scheduling (without the C/R) where
the intended recipient of the signal was delayed and only calls pause() after
the signal is delivered.

For instance, if the sequence of events is:
A calls pause() -> checkpoint -> restart ->
B signals A -> A calls pause() (after restart),
then the following sequence is possible(*) without C/R:
B signals A -> A calls pause()
because normally B cannot assume anything about when A is actually,
really, is suspended (which means the programmer did an imperfect job).

I said "almost always" and "usually", because there is one case where the
alternative schedule: task B could, prior to sending the signal, "ensure"
that task A is already sleeping within the 'pause()' syscall. While this
is possible, it is definitely unusual, and in fact I never code that does
that. And what if the sysadmin send SIGSTOP followed by SIGCONT ? In
short, such code is simply broken.

More importantly, if you think about the operation and semantics of the
freezer cgroup - similar behavior is to be expected when you freeze and
then thaw a container.

Specifically, when you freeze the container that has a task in sys_pause(),
then that task will abort the syscall become frozen. As soon as it becomes
unfrozen, it will return to user space (with the EIP "rewinded") only to
re-invoke the syscall. So the same "race" remains even if you only freeze
and then thaw, regardless of C/R.

Moreover, I argue that basically when you return from a sys_restart(), the
entire container should, by default, remain in frozen state - just like it
is with sys_checkpoint(). An explicit thaw will make the container resume
execution.

Therefore, there are two options: the first is to decide that this behavior
- going back to user space to re-invoke the syscall - is valid. In this case
you don't need a special hack for returning from sys_restart(). The second
option is to decide that it is broken, in which case you need to also fix
the freezer code. Personally, I think that this behavior is valid and need
not be fixed.

Finally, even if you do want to fix the behavior for this pathologic case,
I don't see why you'd want to do it in this manner. Instead, you can add a
simple test prior to returning from sys_restart(), something like this:

...
/* almost done: now handle special cases: */
if (our last syscall == __NR_pause) {
ret = sys_pause();
} else if (our last syscall == __NR_futex) {
do some stuff;
ret = sys_futex();
} else {
ret = what-we-want-to-return
}
/* finally, return to user space */
return ret;
}

I'm not quite know what other "complex states" you refer to; but I wonder
whether that code "needed to restore some complex states" could not be
implemented along the same idea.

The upside is clear: the code is less obscure, simple to debug, and not
architecture-dependent. (hehe .. it even runs faster because it saves a
whole kernel->user->kernel switch, what do you know !).

Oren.

2008-10-27 15:58:38

by Oren Laadan

[permalink] [raw]
Subject: Re: [PATCH 10/10] Add support for multiple processes



Andrey Mirkin wrote:
> The whole tree of processes can be checkpointed and restarted now.
> Shared objects are not supported yet.
>
> Signed-off-by: Andrey Mirkin <[email protected]>
> ---
> checkpoint/cpt_image.h | 2 +
> checkpoint/cpt_process.c | 24 +++++++++++++
> checkpoint/rst_process.c | 85 +++++++++++++++++++++++++++-------------------
> 3 files changed, 76 insertions(+), 35 deletions(-)
>
> diff --git a/checkpoint/cpt_image.h b/checkpoint/cpt_image.h
> index e1fb483..f370df2 100644
> --- a/checkpoint/cpt_image.h
> +++ b/checkpoint/cpt_image.h
> @@ -128,6 +128,8 @@ struct cpt_task_image {
> __u64 cpt_nivcsw;
> __u64 cpt_min_flt;
> __u64 cpt_maj_flt;
> + __u32 cpt_children_num;
> + __u32 cpt_pad;
> } __attribute__ ((aligned (8)));
>
> struct cpt_mm_image {
> diff --git a/checkpoint/cpt_process.c b/checkpoint/cpt_process.c
> index 1f7a54b..d73ec3c 100644
> --- a/checkpoint/cpt_process.c
> +++ b/checkpoint/cpt_process.c
> @@ -40,6 +40,19 @@ static unsigned int encode_task_flags(unsigned int task_flags)
>
> }
>
> +static int cpt_count_children(struct task_struct *tsk, struct cpt_context *ctx)
> +{
> + int num = 0;
> + struct task_struct *child;
> +
> + list_for_each_entry(child, &tsk->children, sibling) {
> + if (child->parent != tsk)
> + continue;
> + num++;
> + }
> + return num;
> +}
> +

I noticed that don't take the appropriate locks when browsing through
tasks lists (siblings, children, global list). Although I realize that
the container should be frozen at this time, I keep wondering if this
is indeed always safe.

For instance, are you protected against an OOM killer that might just
occur uninvited and kill one of those tasks ?

Can the administrator force an un-freeze of the container ? Or perhaps
some error condition if the kernel cause that ?

> int cpt_dump_task_struct(struct task_struct *tsk, struct cpt_context *ctx)
> {
> struct cpt_task_image *t;
> @@ -102,6 +115,7 @@ int cpt_dump_task_struct(struct task_struct *tsk, struct cpt_context *ctx)
> t->cpt_egid = tsk->egid;
> t->cpt_sgid = tsk->sgid;
> t->cpt_fsgid = tsk->fsgid;
> + t->cpt_children_num = cpt_count_children(tsk, ctx);
>
> err = ctx->write(t, sizeof(*t), ctx);
>
> @@ -231,6 +245,16 @@ int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx)
> err = cpt_dump_fpustate(tsk, ctx);
> if (!err)
> err = cpt_dump_registers(tsk, ctx);
> + if (!err) {
> + struct task_struct *child;
> + list_for_each_entry(child, &tsk->children, sibling) {
> + if (child->parent != tsk)
> + continue;
> + err = cpt_dump_task(child, ctx);
> + if (err)
> + break;

Here too.

[...]

Oren.

2008-10-29 14:54:44

by Andrey Mirkin

[permalink] [raw]
Subject: Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process

On Sunday 26 October 2008 01:10 Oren Laadan wrote:
> Andrey Mirkin wrote:
> > On Thursday 23 October 2008 17:57 Dave Hansen wrote:
> >> On Thu, 2008-10-23 at 13:00 +0400, Andrey Mirkin wrote:
> >>>>>>> It is not related to the freezer code actually.
> >>>>>>> That is needed to restart syscalls. Right now I don't have a code
> >>>>>>> in my patchset which restarts a syscall, but later I plan to add
> >>>>>>> it. In OpenVZ checkpointing we restart syscalls if process was
> >>>>>>> caught in syscall during checkpointing.
> >>>>>>
> >>>>>> Do you checkpoint uninterruptible syscalls as well? If only
> >>>>>> interruptible syscalls are checkpointed, I'd say that either this
> >>>>>> syscall uses ERESTARTSYS or ERESTART_RESTARTBLOCK, and then signal
> >>>>>> handling code already does the trick, or this syscall does not
> >>>>>> restart itself when interrupted, and well, this is life, userspace
> >>>>>> just sees -EINTR, which is allowed by the syscall spec.
> >>>>>> Actually this is how we checkpoint/migrate tasks in interruptible
> >>>>>> syscalls in Kerrighed and this works.
> >>>>>
> >>>>> We checkpoint only interruptible syscalls. Some syscalls do not
> >>>>> restart themself, that is why after restarting a process we restart
> >>>>> syscall to complete it.
> >>>>
> >>>> Can you please elaborate on this ? I don't recall having had issues
> >>>> with that.
> >>>
> >>> Right now in 2.6.18 kernel we restarts in such a way pause,
> >>> rt_sigtimedwait and futex syscalls. Recently futex syscall was reworked
> >>> and we will not need such hooks for it.
> >>
> >> Could you elaborate on this a bit?
> >>
> >> If the futex syscall was reworked, perhaps we can do the same for
> >> rt_sigtimedwait() and get rid of this code completely.
> >
> > Well, we can try to rework rt_sigtimedwait(), but we will still need this
> > code in the future to restart pause syscall from kernel without returning
> > to user space. Also this code will be needed to restore some complex
> > states. As concerns pause syscall I have already written to Louis about
> > the problem we are trying to solve with this code. There is a gap when
> > process will be in user space just before entering syscall again. At this
> > time a signal can be delivered to process and it even can be handled. So,
> > we will miss a signal which must interrupt pause syscall.
>
> I'm not convinced that you a real race exists, and even if it does, I'm not
> convinced that hacking the assembly entry/exit code is the best way to do
> it.

Well, as I already told pause() syscall is is not only one case why we need to
do some additional job in that place.

> Let me explain:
>
> You are concerned about a race in which a signal is delivered to a task
> that resumes from restart to user space and is about to (re)invoke
> 'pause()' (because the restart so arranged its EIP and registers).
>
> This almost always means that the user code is buggy and relies on specific
> scheduling, because you can usually find a scheduling (without the C/R)
> where the intended recipient of the signal was delayed and only calls
> pause() after the signal is delivered.
>
> For instance, if the sequence of events is:
> A calls pause() -> checkpoint -> restart ->
> B signals A -> A calls pause() (after restart),
> then the following sequence is possible(*) without C/R:
> B signals A -> A calls pause()
> because normally B cannot assume anything about when A is actually,
> really, is suspended (which means the programmer did an imperfect job).

You right here. Both sequences are possible in theory. You will be surprised
but in practice we found out that probability to miss a signal in case of C/R
is much higher then during ordinary execution.

> I said "almost always" and "usually", because there is one case where the
> alternative schedule: task B could, prior to sending the signal, "ensure"
> that task A is already sleeping within the 'pause()' syscall. While this
> is possible, it is definitely unusual, and in fact I never code that does
> that. And what if the sysadmin send SIGSTOP followed by SIGCONT ? In
> short, such code is simply broken.
>
> More importantly, if you think about the operation and semantics of the
> freezer cgroup - similar behavior is to be expected when you freeze and
> then thaw a container.
>
> Specifically, when you freeze the container that has a task in sys_pause(),
> then that task will abort the syscall become frozen. As soon as it becomes
> unfrozen, it will return to user space (with the EIP "rewinded") only to
> re-invoke the syscall. So the same "race" remains even if you only freeze
> and then thaw, regardless of C/R.

Exactly. But during freeze/unfreeze probability to catch such situation is
very low. In our tests we were tried to checkpoint/restart LTP tests. And
this "race" were triggered during restart almost in 100% of tests.

> Moreover, I argue that basically when you return from a sys_restart(), the
> entire container should, by default, remain in frozen state - just like it
> is with sys_checkpoint(). An explicit thaw will make the container resume
> execution.

No doubt. That is how we do in OpenVZ, after restart the container remains
frozen. And we need to thaw it to resume its execution.

> Therefore, there are two options: the first is to decide that this behavior
> - going back to user space to re-invoke the syscall - is valid. In this
> case you don't need a special hack for returning from sys_restart(). The
> second option is to decide that it is broken, in which case you need to
> also fix the freezer code. Personally, I think that this behavior is valid
> and need not be fixed.

I still believe that we need to fix such behaviour during restart as in
practice it is very easy to trigger it.

> Finally, even if you do want to fix the behavior for this pathologic case,
> I don't see why you'd want to do it in this manner. Instead, you can add a
> simple test prior to returning from sys_restart(), something like this:
>
> ...
> /* almost done: now handle special cases: */
> if (our last syscall == __NR_pause) {
> ret = sys_pause();
> } else if (our last syscall == __NR_futex) {
> do some stuff;
> ret = sys_futex();
> } else {
> ret = what-we-want-to-return
> }
> /* finally, return to user space */
> return ret;
> }

This only works if we do not want to stay in frozen state after restart. Or am
I missed something?

> I'm not quite know what other "complex states" you refer to; but I wonder
> whether that code "needed to restore some complex states" could not be
> implemented along the same idea.

In the same manner we are restoring for instance ptrace.

> The upside is clear: the code is less obscure, simple to debug, and not
> architecture-dependent. (hehe .. it even runs faster because it saves a
> whole kernel->user->kernel switch, what do you know !).
In our case we also do not need a switch and the code actually not very
complicated.

Andrey

2008-10-29 15:30:13

by Andrey Mirkin

[permalink] [raw]
Subject: Re: [Devel] Re: [PATCH 03/10] Introduce context structure needed during checkpointing/restart

On Monday 20 October 2008 21:02 Dave Hansen wrote:
> On Sat, 2008-10-18 at 03:11 +0400, Andrey Mirkin wrote:
> > +typedef struct cpt_context
> > +{
> > + pid_t pid; /* should be changed to ctid later */
> > + int ctx_id; /* context id */
> > + struct list_head ctx_list;
> > + int refcount;
> > + int ctx_state;
> > + struct semaphore main_sem;
>
> Does this really need to be a semaphore or is a mutex OK?
Actually mutex is enough here.

> > + int errno;
>
> Could you hold off on adding these things to the struct until the patch
> where they're actually used? It's hard to judge this without seeing
> what you do with it.
I will try not to introduce variables and functions which are not used in
future.

>
> > + struct file *file;
> > + loff_t current_object;
> > +
> > + struct list_head object_array[CPT_OBJ_MAX];
> > +
> > + int (*write)(const void *addr, size_t count, struct cpt_context *ctx);
> > + int (*read)(void *addr, size_t count, struct cpt_context *ctx);
> > +} cpt_context_t;
>
> Man, this is hard to review. I was going to try and make sure that your
> refcounting was right and atomic, but there's no use of it in this patch
> except for the initialization and accessor functions. Darn.
For simplicity I will throw out all this stuff completely.

>
> > +extern int debug_level;
>
> I'm going to go out on a limb here and say that "debug_level" is
> probably a wee bit too generic of a variable name.
I will change it to something else.

>
> > +#define cpt_printk(lvl, fmt, args...) do { \
> > + if (lvl <= debug_level) \
> > + printk(fmt, ##args); \
> > + } while (0)
>
> I think you can use pr_debug() here, too, just like Oren did.
Will switch to pr_debug().

>
> > +struct cpt_context * context_alloc(void)
> > +{
> > + struct cpt_context *ctx;
> > + int i;
> > +
> > + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
> > + if (!ctx)
> > + return NULL;
> > +
> > + init_MUTEX(&ctx->main_sem);
> > + ctx->refcount = 1;
> > +
> > + ctx->current_object = -1;
> > + ctx->write = file_write;
> > + ctx->read = file_read;
> > + for (i = 0; i < CPT_OBJ_MAX; i++) {
> > + INIT_LIST_HEAD(&ctx->object_array[i]);
> > + }
> > +
> > + return ctx;
> > +}
> > +
> > +void context_release(struct cpt_context *ctx)
> > +{
> > + ctx->ctx_state = CPT_CTX_ERROR;
> > +
> > + kfree(ctx);
> > +}
> > +
> > +static void context_put(struct cpt_context *ctx)
> > +{
> > + if (!--ctx->refcount)
> > + context_release(ctx);
> > +}
> > +
> > static int checkpoint(pid_t pid, int fd, unsigned long flags)
> > {
> > - return -ENOSYS;
> > + struct file *file;
> > + struct cpt_context *ctx;
> > + int err;
> > +
> > + err = -EBADF;
> > + file = fget(fd);
> > + if (!file)
> > + goto out;
> > +
> > + err = -ENOMEM;
> > + ctx = context_alloc();
> > + if (!ctx)
> > + goto out_file;
> > +
> > + ctx->file = file;
> > + ctx->ctx_state = CPT_CTX_DUMPING;
> > +
> > + /* checkpoint */
> > + err = -ENOSYS;
> > +
> > + context_put(ctx);
> > +
> > +out_file:
> > + fput(file);
> > +out:
> > + return err;
> > }
>
> So, where is context_get()? Is there only single-threaded access to the
> refcount? If so, why do we even need it? We should probably just use
> context_release() driectly.
The idea is that in future we should be able to keep a context for incremental
checkpointing. That is why we need context get/put functions. Right now it is
not used, so I will drop it.

> If there is multithreaded access to context_put() or the refcount, then
> they're unsafe without additional locking.
Access to refcount will be protected with context mutex.

Thanks for comments.

Actually I'm not sure if I will continue with my own patch set, but I will
take into account all your comments during porting my functionality to Oren's
tree.

Andrey

2008-10-30 04:54:44

by Andrey Mirkin

[permalink] [raw]
Subject: Re: [Devel] Re: [PATCH 10/10] Add support for multiple processes

On Monday 27 October 2008 18:58 Oren Laadan wrote:
> Andrey Mirkin wrote:
> > The whole tree of processes can be checkpointed and restarted now.
> > Shared objects are not supported yet.
> >
> > Signed-off-by: Andrey Mirkin <[email protected]>
> > ---
> > checkpoint/cpt_image.h | 2 +
> > checkpoint/cpt_process.c | 24 +++++++++++++
> > checkpoint/rst_process.c | 85
> > +++++++++++++++++++++++++++------------------- 3 files changed, 76
> > insertions(+), 35 deletions(-)
> >
> > diff --git a/checkpoint/cpt_image.h b/checkpoint/cpt_image.h
> > index e1fb483..f370df2 100644
> > --- a/checkpoint/cpt_image.h
> > +++ b/checkpoint/cpt_image.h
> > @@ -128,6 +128,8 @@ struct cpt_task_image {
> > __u64 cpt_nivcsw;
> > __u64 cpt_min_flt;
> > __u64 cpt_maj_flt;
> > + __u32 cpt_children_num;
> > + __u32 cpt_pad;
> > } __attribute__ ((aligned (8)));
> >
> > struct cpt_mm_image {
> > diff --git a/checkpoint/cpt_process.c b/checkpoint/cpt_process.c
> > index 1f7a54b..d73ec3c 100644
> > --- a/checkpoint/cpt_process.c
> > +++ b/checkpoint/cpt_process.c
> > @@ -40,6 +40,19 @@ static unsigned int encode_task_flags(unsigned int
> > task_flags)
> >
> > }
> >
> > +static int cpt_count_children(struct task_struct *tsk, struct
> > cpt_context *ctx) +{
> > + int num = 0;
> > + struct task_struct *child;
> > +
> > + list_for_each_entry(child, &tsk->children, sibling) {
> > + if (child->parent != tsk)
> > + continue;
> > + num++;
> > + }
> > + return num;
> > +}
> > +
>
> I noticed that don't take the appropriate locks when browsing through
> tasks lists (siblings, children, global list). Although I realize that
> the container should be frozen at this time, I keep wondering if this
> is indeed always safe.
You right here. We need to take tasklist_lock to be sure that everything will
be consistent.

> For instance, are you protected against an OOM killer that might just
> occur uninvited and kill one of those tasks ?

OOM killer can't kill one of those tasks as all of them should be frozen at
that time and be in uninterruptible state. So, we can just do not think about
OOM at that time. But anyway you right and we need locking around browsing
tasks list.

> Can the administrator force an un-freeze of the container ? Or perhaps
> some error condition if the kernel cause that ?

The main idea is that context should be protected with a mutex and only one
process can do some activity (freeze, dump, unfreeze, kill) with a container.
Right now it is not implemented at all, but in future it will be added.

Andrey

> > int cpt_dump_task_struct(struct task_struct *tsk, struct cpt_context
> > *ctx) {
> > struct cpt_task_image *t;
> > @@ -102,6 +115,7 @@ int cpt_dump_task_struct(struct task_struct *tsk,
> > struct cpt_context *ctx) t->cpt_egid = tsk->egid;
> > t->cpt_sgid = tsk->sgid;
> > t->cpt_fsgid = tsk->fsgid;
> > + t->cpt_children_num = cpt_count_children(tsk, ctx);
> >
> > err = ctx->write(t, sizeof(*t), ctx);
> >
> > @@ -231,6 +245,16 @@ int cpt_dump_task(struct task_struct *tsk, struct
> > cpt_context *ctx) err = cpt_dump_fpustate(tsk, ctx);
> > if (!err)
> > err = cpt_dump_registers(tsk, ctx);
> > + if (!err) {
> > + struct task_struct *child;
> > + list_for_each_entry(child, &tsk->children, sibling) {
> > + if (child->parent != tsk)
> > + continue;
> > + err = cpt_dump_task(child, ctx);
> > + if (err)
> > + break;
>
> Here too.
>
> [...]
>
> Oren.
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/containers
>
> _______________________________________________
> Devel mailing list
> [email protected]
> https://openvz.org/mailman/listinfo/devel

2008-10-30 16:00:47

by Oren Laadan

[permalink] [raw]
Subject: Re: [Devel] Re: [PATCH 08/10] Introduce functions to restart a process



Andrey Mirkin wrote:
> On Sunday 26 October 2008 01:10 Oren Laadan wrote:
>> Andrey Mirkin wrote:
>>> On Thursday 23 October 2008 17:57 Dave Hansen wrote:
>>>> On Thu, 2008-10-23 at 13:00 +0400, Andrey Mirkin wrote:
>>>>>>>>> It is not related to the freezer code actually.
>>>>>>>>> That is needed to restart syscalls. Right now I don't have a code
>>>>>>>>> in my patchset which restarts a syscall, but later I plan to add
>>>>>>>>> it. In OpenVZ checkpointing we restart syscalls if process was
>>>>>>>>> caught in syscall during checkpointing.
>>>>>>>> Do you checkpoint uninterruptible syscalls as well? If only
>>>>>>>> interruptible syscalls are checkpointed, I'd say that either this
>>>>>>>> syscall uses ERESTARTSYS or ERESTART_RESTARTBLOCK, and then signal
>>>>>>>> handling code already does the trick, or this syscall does not
>>>>>>>> restart itself when interrupted, and well, this is life, userspace
>>>>>>>> just sees -EINTR, which is allowed by the syscall spec.
>>>>>>>> Actually this is how we checkpoint/migrate tasks in interruptible
>>>>>>>> syscalls in Kerrighed and this works.
>>>>>>> We checkpoint only interruptible syscalls. Some syscalls do not
>>>>>>> restart themself, that is why after restarting a process we restart
>>>>>>> syscall to complete it.
>>>>>> Can you please elaborate on this ? I don't recall having had issues
>>>>>> with that.
>>>>> Right now in 2.6.18 kernel we restarts in such a way pause,
>>>>> rt_sigtimedwait and futex syscalls. Recently futex syscall was reworked
>>>>> and we will not need such hooks for it.
>>>> Could you elaborate on this a bit?
>>>>
>>>> If the futex syscall was reworked, perhaps we can do the same for
>>>> rt_sigtimedwait() and get rid of this code completely.
>>> Well, we can try to rework rt_sigtimedwait(), but we will still need this
>>> code in the future to restart pause syscall from kernel without returning
>>> to user space. Also this code will be needed to restore some complex
>>> states. As concerns pause syscall I have already written to Louis about
>>> the problem we are trying to solve with this code. There is a gap when
>>> process will be in user space just before entering syscall again. At this
>>> time a signal can be delivered to process and it even can be handled. So,
>>> we will miss a signal which must interrupt pause syscall.
>> I'm not convinced that you a real race exists, and even if it does, I'm not
>> convinced that hacking the assembly entry/exit code is the best way to do
>> it.
>
> Well, as I already told pause() syscall is is not only one case why we need to
> do some additional job in that place.
>
>> Let me explain:
>>
>> You are concerned about a race in which a signal is delivered to a task
>> that resumes from restart to user space and is about to (re)invoke
>> 'pause()' (because the restart so arranged its EIP and registers).
>>
>> This almost always means that the user code is buggy and relies on specific
>> scheduling, because you can usually find a scheduling (without the C/R)
>> where the intended recipient of the signal was delayed and only calls
>> pause() after the signal is delivered.
>>
>> For instance, if the sequence of events is:
>> A calls pause() -> checkpoint -> restart ->
>> B signals A -> A calls pause() (after restart),
>> then the following sequence is possible(*) without C/R:
>> B signals A -> A calls pause()
>> because normally B cannot assume anything about when A is actually,
>> really, is suspended (which means the programmer did an imperfect job).
>
> You right here. Both sequences are possible in theory. You will be surprised
> but in practice we found out that probability to miss a signal in case of C/R
> is much higher then during ordinary execution.

The point is that "missing" a signal because of freeze/thaw (or stop/cont)
is perfectly acceptable behavior. It is supposed to work that way. So I
argue that we don't need a workaround.

>
>> I said "almost always" and "usually", because there is one case where the
>> alternative schedule: task B could, prior to sending the signal, "ensure"
>> that task A is already sleeping within the 'pause()' syscall. While this
>> is possible, it is definitely unusual, and in fact I never code that does
>> that. And what if the sysadmin send SIGSTOP followed by SIGCONT ? In
>> short, such code is simply broken.
>>
>> More importantly, if you think about the operation and semantics of the
>> freezer cgroup - similar behavior is to be expected when you freeze and
>> then thaw a container.
>>
>> Specifically, when you freeze the container that has a task in sys_pause(),
>> then that task will abort the syscall become frozen. As soon as it becomes
>> unfrozen, it will return to user space (with the EIP "rewinded") only to
>> re-invoke the syscall. So the same "race" remains even if you only freeze
>> and then thaw, regardless of C/R.
>
> Exactly. But during freeze/unfreeze probability to catch such situation is
> very low. In our tests we were tried to checkpoint/restart LTP tests. And
> this "race" were triggered during restart almost in 100% of tests.

Why is it the case ?

At the end of restart, the container remains frozen until you unfreeze it.
So c/r effectively becomes freeze/unfreeze, except - possible - for page
fault that are more likely to happen when thawing following the restart
(and these may slow down the application and allow a signal to "slip in").

If it's nearly 100% of the tests, that it should be easily reproducible
with merely freeze/thaw pairs, no ?

Still, arguing that LTP "breaks" here is like arguing that LTP "breaks"
if we were to run it while sending SIGSTOP/SIGCONT to its processes...

>
>> Moreover, I argue that basically when you return from a sys_restart(), the
>> entire container should, by default, remain in frozen state - just like it
>> is with sys_checkpoint(). An explicit thaw will make the container resume
>> execution.
>
> No doubt. That is how we do in OpenVZ, after restart the container remains
> frozen. And we need to thaw it to resume its execution.
>
>> Therefore, there are two options: the first is to decide that this behavior
>> - going back to user space to re-invoke the syscall - is valid. In this
>> case you don't need a special hack for returning from sys_restart(). The
>> second option is to decide that it is broken, in which case you need to
>> also fix the freezer code. Personally, I think that this behavior is valid
>> and need not be fixed.
>
> I still believe that we need to fix such behaviour during restart as in
> practice it is very easy to trigger it.

did you see any problem outside LTP ? I never saw any such problems.

>
>> Finally, even if you do want to fix the behavior for this pathologic case,
>> I don't see why you'd want to do it in this manner. Instead, you can add a
>> simple test prior to returning from sys_restart(), something like this:
>>
>> ...
>> /* almost done: now handle special cases: */
>> if (our last syscall == __NR_pause) {
do_freeze();

>> ret = sys_pause();
>> } else if (our last syscall == __NR_futex) {
>> do some stuff;
do_freeze();

>> ret = sys_futex();
>> } else {
>> ret = what-we-want-to-return
>> }
>> /* finally, return to user space */
>> return ret;
>> }
>
> This only works if we do not want to stay in frozen state after restart. Or am
> I missed something?

Sure: see addition above.

>
>> I'm not quite know what other "complex states" you refer to; but I wonder
>> whether that code "needed to restore some complex states" could not be
>> implemented along the same idea.
>
> In the same manner we are restoring for instance ptrace.

Yes, I mentioned that in the past. Can be addressed in the same manner.

>
>> The upside is clear: the code is less obscure, simple to debug, and not
>> architecture-dependent. (hehe .. it even runs faster because it saves a
>> whole kernel->user->kernel switch, what do you know !).

> In our case we also do not need a switch and the code actually not very
> complicated.

Fact is that people wondered what was going on there.

I prefer the arch-independent way, because it is, well, arch-independent,
and because it makes the logic and the exception obvious to the reader,
and it is easily extensible to handle additional special cases (like
ptrace), and makes maintenance easier.

Do you object to doing it this way ?

Oren.