2009-03-05 16:40:39

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 00/11] track files for checkpointability

This takes a suggestion of Ingo's along with comments from lots of
other people. It can track whether a given file is able to be
checkpointed. It introduces a f_op to allow easy customization
like the reset of the VFS.

You can also find these patches in git:

http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/daveh/linux-2.6-cr.git;a=shortlog;h=dave-v13.4

Sorry if anybody got this twice. I'm fighting with git-send-email.


2009-03-05 16:39:19

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 02/11] breakout fdinfo sprintf() into its own function


I'll be adding to this in a moment and it is in a bad place
to do that cleanly now.

Also, increase the buffer size. Most /proc files can
output up to a page, so use the same here.

Signed-off-by: Dave Hansen <[email protected]>
---

linux-2.6.git-dave/fs/proc/base.c | 23 +++++++++++++++--------
1 file changed, 15 insertions(+), 8 deletions(-)

diff -puN fs/proc/base.c~breakout-fdinfo-sprintf fs/proc/base.c
--- linux-2.6.git/fs/proc/base.c~breakout-fdinfo-sprintf 2009-03-05 08:37:00.000000000 -0800
+++ linux-2.6.git-dave/fs/proc/base.c 2009-03-05 08:37:00.000000000 -0800
@@ -1632,7 +1632,18 @@ out:
return ~0U;
}

-#define PROC_FDINFO_MAX 64
+#define PROC_FDINFO_MAX PAGE_SIZE
+
+static void proc_fd_write_info(struct file *file, char *info)
+{
+ int max = PROC_FDINFO_MAX;
+ int p = 0;
+ if (!info)
+ return;
+
+ p += scnprintf(info+p, max-p, "pos:\t%lli\n", (long long) file->f_pos);
+ p += scnprintf(info+p, max-p, "flags:\t0%o\n", file->f_flags);
+}

static int proc_fd_info(struct inode *inode, struct path *path, char *info)
{
@@ -1657,12 +1668,7 @@ static int proc_fd_info(struct inode *in
*path = file->f_path;
path_get(&file->f_path);
}
- if (info)
- snprintf(info, PROC_FDINFO_MAX,
- "pos:\t%lli\n"
- "flags:\t0%o\n",
- (long long) file->f_pos,
- file->f_flags);
+ proc_fd_write_info(file, info);
spin_unlock(&files->file_lock);
put_files_struct(files);
return 0;
@@ -1870,10 +1876,11 @@ static int proc_readfd(struct file *filp
static ssize_t proc_fdinfo_read(struct file *file, char __user *buf,
size_t len, loff_t *ppos)
{
- char tmp[PROC_FDINFO_MAX];
+ char *tmp = kmalloc(PROC_FDINFO_MAX, GFP_KERNEL);
int err = proc_fd_info(file->f_path.dentry->d_inode, NULL, tmp);
if (!err)
err = simple_read_from_buffer(buf, len, ppos, tmp, strlen(tmp));
+ kfree(tmp);
return err;
}

_

2009-03-05 16:39:36

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 01/11] kill '_data' in cr_hdr_fd_data name


The 'cr_hdr_fd_data' name is too long for me. Adding data
at the end doesn't clarify anything in its use which makes
it a waste of space. Kill it.

Signed-off-by: Dave Hansen <[email protected]>
---

linux-2.6.git-dave/checkpoint/ckpt_file.c | 2 +-
linux-2.6.git-dave/checkpoint/rstr_file.c | 2 +-
linux-2.6.git-dave/include/linux/checkpoint_hdr.h | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)

diff -puN checkpoint/ckpt_file.c~f49811bdfd2cde1dfa766f38cc017a5cdd8be75e checkpoint/ckpt_file.c
--- linux-2.6.git/checkpoint/ckpt_file.c~f49811bdfd2cde1dfa766f38cc017a5cdd8be75e 2009-03-05 08:36:59.000000000 -0800
+++ linux-2.6.git-dave/checkpoint/ckpt_file.c 2009-03-05 08:36:59.000000000 -0800
@@ -76,7 +76,7 @@ int cr_scan_fds(struct files_struct *fil
static int cr_write_fd_data(struct cr_ctx *ctx, struct file *file, int parent)
{
struct cr_hdr h;
- struct cr_hdr_fd_data *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ struct cr_hdr_fd *hh = cr_hbuf_get(ctx, sizeof(*hh));
struct dentry *dent = file->f_dentry;
struct inode *inode = dent->d_inode;
enum fd_type fd_type;
diff -puN checkpoint/rstr_file.c~f49811bdfd2cde1dfa766f38cc017a5cdd8be75e checkpoint/rstr_file.c
--- linux-2.6.git/checkpoint/rstr_file.c~f49811bdfd2cde1dfa766f38cc017a5cdd8be75e 2009-03-05 08:36:59.000000000 -0800
+++ linux-2.6.git-dave/checkpoint/rstr_file.c 2009-03-05 08:36:59.000000000 -0800
@@ -71,7 +71,7 @@ static int cr_attach_get_file(struct fil
static int
cr_read_fd_data(struct cr_ctx *ctx, struct files_struct *files, int rparent)
{
- struct cr_hdr_fd_data *hh = cr_hbuf_get(ctx, sizeof(*hh));
+ struct cr_hdr_fd *hh = cr_hbuf_get(ctx, sizeof(*hh));
struct file *file;
int parent, ret;
int fd = 0; /* pacify gcc warning */
diff -puN include/linux/checkpoint_hdr.h~f49811bdfd2cde1dfa766f38cc017a5cdd8be75e include/linux/checkpoint_hdr.h
--- linux-2.6.git/include/linux/checkpoint_hdr.h~f49811bdfd2cde1dfa766f38cc017a5cdd8be75e 2009-03-05 08:36:59.000000000 -0800
+++ linux-2.6.git-dave/include/linux/checkpoint_hdr.h 2009-03-05 08:36:59.000000000 -0800
@@ -137,7 +137,7 @@ enum fd_type {
CR_FD_DIR,
};

-struct cr_hdr_fd_data {
+struct cr_hdr_fd {
__u16 fd_type;
__u16 f_mode;
__u32 f_flags;
_

2009-03-05 16:39:51

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 04/11] actually use f_op in checkpoint code


Right now, we assume all normal files and directories
can be checkpointed. However, as usual in the VFS, there
are specialized places that will always need an ability
to override these defaults. We could do this completely
in the checkpoint code, but that would bitrot quickly.

This adds a new 'file_operations' function for
checkpointing a file. I did this under the assumption
that we should have a dirt-simple way to make something
(un)checkpointable that fits in with current code.

As you can see in the ext[234] and /proc patches, all
that we have to do to make something simple be
supported is add a single "generic" f_op entry.

Signed-off-by: Dave Hansen <[email protected]>
---

linux-2.6.git-dave/checkpoint/ckpt_file.c | 4 ++--
linux-2.6.git-dave/include/linux/fs.h | 9 +++++++++
2 files changed, 11 insertions(+), 2 deletions(-)

diff -puN checkpoint/ckpt_file.c~f_op-for-checkpointability checkpoint/ckpt_file.c
--- linux-2.6.git/checkpoint/ckpt_file.c~f_op-for-checkpointability 2009-03-05 08:37:01.000000000 -0800
+++ linux-2.6.git-dave/checkpoint/ckpt_file.c 2009-03-05 08:37:01.000000000 -0800
@@ -106,8 +106,8 @@ static int cr_write_fd_data(struct cr_ct

hh->fd_type = CR_FD_UNSET;
ret = -EBADF;
- if ((file->f_dentry->d_inode->i_mode & S_IFMT) == S_IFREG)
- ret = generic_file_checkpoint(file, ctx, hh);
+ if (file->f_op->checkpoint)
+ ret = file->f_op->checkpoint(file, ctx, hh);

if (ret)
goto out;
diff -puN include/linux/fs.h~f_op-for-checkpointability include/linux/fs.h
--- linux-2.6.git/include/linux/fs.h~f_op-for-checkpointability 2009-03-05 08:37:01.000000000 -0800
+++ linux-2.6.git-dave/include/linux/fs.h 2009-03-05 08:37:01.000000000 -0800
@@ -1296,6 +1296,14 @@ int generic_osync_inode(struct inode *,
typedef int (*filldir_t)(void *, const char *, int, loff_t, u64, unsigned);
struct block_device_operations;

+struct cr_ctx;
+struct cr_hdr_fd;
+#ifdef CONFIG_CHECKPOINT_RESTART
+int generic_file_checkpoint(struct file *, struct cr_ctx *, struct cr_hdr_fd *);
+#else
+#define generic_file_checkpoint NULL
+#endif
+
/* These macros are for out of kernel modules to test that
* the kernel supports the unlocked_ioctl and compat_ioctl
* fields in struct file_operations. */
@@ -1334,6 +1342,7 @@ struct file_operations {
ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
int (*setlease)(struct file *, long, struct file_lock **);
+ int (*checkpoint)(struct file *, struct cr_ctx *, struct cr_hdr_fd *);
};

struct inode_operations {
_

2009-03-05 16:40:12

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 05/11] add generic checkpoint f_op to ext fses


This marks ext[234] as being checkpointable. There will be many
more to do this to, but this is a start.

Signed-off-by: Dave Hansen <[email protected]>
---

linux-2.6.git-dave/fs/ext2/dir.c | 1 +
linux-2.6.git-dave/fs/ext2/file.c | 2 ++
linux-2.6.git-dave/fs/ext3/dir.c | 1 +
linux-2.6.git-dave/fs/ext3/file.c | 1 +
linux-2.6.git-dave/fs/ext4/dir.c | 1 +
linux-2.6.git-dave/fs/ext4/file.c | 1 +
6 files changed, 7 insertions(+)

diff -puN fs/ext2/dir.c~add-stupid-checkpoint-to-lots-of-fses-take0 fs/ext2/dir.c
--- linux-2.6.git/fs/ext2/dir.c~add-stupid-checkpoint-to-lots-of-fses-take0 2009-03-05 08:37:01.000000000 -0800
+++ linux-2.6.git-dave/fs/ext2/dir.c 2009-03-05 08:37:01.000000000 -0800
@@ -721,4 +721,5 @@ const struct file_operations ext2_dir_op
.compat_ioctl = ext2_compat_ioctl,
#endif
.fsync = ext2_sync_file,
+ .checkpoint = generic_file_checkpoint,
};
diff -puN fs/ext2/file.c~add-stupid-checkpoint-to-lots-of-fses-take0 fs/ext2/file.c
--- linux-2.6.git/fs/ext2/file.c~add-stupid-checkpoint-to-lots-of-fses-take0 2009-03-05 08:37:01.000000000 -0800
+++ linux-2.6.git-dave/fs/ext2/file.c 2009-03-05 08:37:01.000000000 -0800
@@ -58,6 +58,7 @@ const struct file_operations ext2_file_o
.fsync = ext2_sync_file,
.splice_read = generic_file_splice_read,
.splice_write = generic_file_splice_write,
+ .checkpoint = generic_file_checkpoint,
};

#ifdef CONFIG_EXT2_FS_XIP
@@ -73,6 +74,7 @@ const struct file_operations ext2_xip_fi
.open = generic_file_open,
.release = ext2_release_file,
.fsync = ext2_sync_file,
+ .checkpoint = generic_file_checkpoint,
};
#endif

diff -puN fs/ext3/dir.c~add-stupid-checkpoint-to-lots-of-fses-take0 fs/ext3/dir.c
--- linux-2.6.git/fs/ext3/dir.c~add-stupid-checkpoint-to-lots-of-fses-take0 2009-03-05 08:37:01.000000000 -0800
+++ linux-2.6.git-dave/fs/ext3/dir.c 2009-03-05 08:37:01.000000000 -0800
@@ -48,6 +48,7 @@ const struct file_operations ext3_dir_op
#endif
.fsync = ext3_sync_file, /* BKL held */
.release = ext3_release_dir,
+ .checkpoint = generic_file_checkpoint,
};


diff -puN fs/ext3/file.c~add-stupid-checkpoint-to-lots-of-fses-take0 fs/ext3/file.c
--- linux-2.6.git/fs/ext3/file.c~add-stupid-checkpoint-to-lots-of-fses-take0 2009-03-05 08:37:01.000000000 -0800
+++ linux-2.6.git-dave/fs/ext3/file.c 2009-03-05 08:37:01.000000000 -0800
@@ -122,6 +122,7 @@ const struct file_operations ext3_file_o
.fsync = ext3_sync_file,
.splice_read = generic_file_splice_read,
.splice_write = generic_file_splice_write,
+ .checkpoint = generic_file_checkpoint,
};

const struct inode_operations ext3_file_inode_operations = {
diff -puN fs/ext4/dir.c~add-stupid-checkpoint-to-lots-of-fses-take0 fs/ext4/dir.c
--- linux-2.6.git/fs/ext4/dir.c~add-stupid-checkpoint-to-lots-of-fses-take0 2009-03-05 08:37:01.000000000 -0800
+++ linux-2.6.git-dave/fs/ext4/dir.c 2009-03-05 08:37:01.000000000 -0800
@@ -48,6 +48,7 @@ const struct file_operations ext4_dir_op
#endif
.fsync = ext4_sync_file,
.release = ext4_release_dir,
+ .checkpoint = generic_file_checkpoint,
};


diff -puN fs/ext4/file.c~add-stupid-checkpoint-to-lots-of-fses-take0 fs/ext4/file.c
--- linux-2.6.git/fs/ext4/file.c~add-stupid-checkpoint-to-lots-of-fses-take0 2009-03-05 08:37:01.000000000 -0800
+++ linux-2.6.git-dave/fs/ext4/file.c 2009-03-05 08:37:01.000000000 -0800
@@ -156,6 +156,7 @@ const struct file_operations ext4_file_o
.fsync = ext4_sync_file,
.splice_read = generic_file_splice_read,
.splice_write = generic_file_splice_write,
+ .checkpoint = generic_file_checkpoint,
};

const struct inode_operations ext4_file_inode_operations = {
_

2009-03-05 16:40:52

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 09/11] check files for checkpointability


Introduce a files_struct counter to indicate whether a particular
file_struct has ever contained a file which can not be
checkpointed. This flag is a one-way trip; once it is set, it may
not be unset.

We assume at allocation that a new files_struct is clean and may
be checkpointed. However, as soon as it has had its files filled
from its parent's, we check it for real in __scan_files_for_cr().
At that point, we mark it if it contained any uncheckpointable
files.

We also check each 'struct file' when it is installed in a fd
slot. This way, if anyone open()s or managed to dup() an
unsuppored file, we can catch it.

Signed-off-by: Dave Hansen <[email protected]>
---

linux-2.6.git-dave/fs/file.c | 19 +++++++++++++++++++
linux-2.6.git-dave/fs/open.c | 5 +++++
linux-2.6.git-dave/include/linux/checkpoint.h | 13 +++++++++++++
linux-2.6.git-dave/include/linux/fdtable.h | 3 +++
4 files changed, 40 insertions(+)

diff -puN fs/file.c~242c7afafb1a81c0d9a41085536f2197d81146e7 fs/file.c
--- linux-2.6.git/fs/file.c~242c7afafb1a81c0d9a41085536f2197d81146e7 2009-03-05 08:37:04.000000000 -0800
+++ linux-2.6.git-dave/fs/file.c 2009-03-05 08:37:04.000000000 -0800
@@ -15,6 +15,7 @@
#include <linux/file.h>
#include <linux/fdtable.h>
#include <linux/bitops.h>
+#include <linux/checkpoint.h>
#include <linux/interrupt.h>
#include <linux/spinlock.h>
#include <linux/rcupdate.h>
@@ -285,6 +286,20 @@ static int count_open_files(struct fdtab
return i;
}

+static void __scan_files_for_cr(struct files_struct *files)
+{
+ int i;
+
+ for (i = 0; i < files->fdtab.max_fds; i++) {
+ struct file *f = fcheck_files(files, i);
+ if (!f)
+ continue;
+ if (cr_file_supported(f))
+ continue;
+ files_deny_checkpointing(files);
+ }
+}
+
/*
* Allocate a new files structure and copy contents from the
* passed in files structure.
@@ -303,6 +318,9 @@ struct files_struct *dup_fd(struct files
goto out;

atomic_set(&newf->count, 1);
+#ifdef CONFIG_CHECKPOINT_RESTART
+ newf->may_checkpoint = 1;
+#endif

spin_lock_init(&newf->file_lock);
newf->next_fd = 0;
@@ -396,6 +414,7 @@ struct files_struct *dup_fd(struct files

rcu_assign_pointer(newf->fdt, new_fdt);

+ __scan_files_for_cr(newf);
return newf;

out_release:
diff -puN fs/open.c~242c7afafb1a81c0d9a41085536f2197d81146e7 fs/open.c
--- linux-2.6.git/fs/open.c~242c7afafb1a81c0d9a41085536f2197d81146e7 2009-03-05 08:37:04.000000000 -0800
+++ linux-2.6.git-dave/fs/open.c 2009-03-05 08:37:04.000000000 -0800
@@ -29,6 +29,7 @@
#include <linux/rcupdate.h>
#include <linux/audit.h>
#include <linux/falloc.h>
+#include <linux/checkpoint.h>

int vfs_statfs(struct dentry *dentry, struct kstatfs *buf)
{
@@ -1015,6 +1016,10 @@ void fd_install(unsigned int fd, struct
{
struct files_struct *files = current->files;
struct fdtable *fdt;
+
+ if (!cr_file_supported(file))
+ files_deny_checkpointing(files);
+
spin_lock(&files->file_lock);
fdt = files_fdtable(files);
BUG_ON(fdt->fd[fd] != NULL);
diff -puN include/linux/checkpoint.h~242c7afafb1a81c0d9a41085536f2197d81146e7 include/linux/checkpoint.h
--- linux-2.6.git/include/linux/checkpoint.h~242c7afafb1a81c0d9a41085536f2197d81146e7 2009-03-05 08:37:04.000000000 -0800
+++ linux-2.6.git-dave/include/linux/checkpoint.h 2009-03-05 08:37:04.000000000 -0800
@@ -12,6 +12,7 @@

#include <linux/path.h>
#include <linux/fs.h>
+#include <linux/fdtable.h>

#ifdef CONFIG_CHECKPOINT_RESTART

@@ -101,11 +102,23 @@ extern int cr_read_files(struct cr_ctx *

#define pr_fmt(fmt) "[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__

+static inline void __files_deny_checkpointing(struct files_struct *files,
+ char *file, int line)
+{
+ if (!test_and_clear_bit(0, &files->may_checkpoint))
+ return;
+ printk(KERN_INFO "process performed an action that can not be "
+ "checkpointed at: %s:%d\n", file, line);
+}
+#define files_deny_checkpointing(f) \
+ __files_deny_checkpointing(f, __FILE__, __LINE__)
+
int cr_explain_file(struct file *file, char *explain, int left);
int cr_file_supported(struct file *file);

#else /* !CONFIG_CHECKPOINT_RESTART */

+static inline void files_deny_checkpointing(struct files_struct *files) {}
static inline int cr_explain_file(struct file *file, char *explain, int left)
{
return 0;
diff -puN include/linux/fdtable.h~242c7afafb1a81c0d9a41085536f2197d81146e7 include/linux/fdtable.h
--- linux-2.6.git/include/linux/fdtable.h~242c7afafb1a81c0d9a41085536f2197d81146e7 2009-03-05 08:37:04.000000000 -0800
+++ linux-2.6.git-dave/include/linux/fdtable.h 2009-03-05 08:37:04.000000000 -0800
@@ -45,6 +45,9 @@ struct files_struct {
atomic_t count;
struct fdtable *fdt;
struct fdtable fdtab;
+#ifdef CONFIG_CHECKPOINT_RESTART
+ unsigned long may_checkpoint;
+#endif
/*
* written part on a separate cache line in SMP
*/
_

2009-03-05 16:41:19

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 03/11] Introduce generic_file_checkpoint()


In a bit we will introduce a f_op to do per-file
checkpointing. But, before I do that, I want to
demonstrate how it splices in here.

So, introduce generic_file_checkpoint() and use it
only for regular files.

I'm also removing the CR_FD_DIR type. We treat
normal files and directories the same way, so just
call it CR_FD_GENERIC to make it more clear that
we always follow the same generic behavior.

Signed-off-by: Dave Hansen <[email protected]>
---

linux-2.6.git-dave/checkpoint/ckpt_file.c | 50 ++++++++++++----------
linux-2.6.git-dave/checkpoint/rstr_file.c | 3 -
linux-2.6.git-dave/include/linux/checkpoint_hdr.h | 4 -
3 files changed, 32 insertions(+), 25 deletions(-)

diff -puN checkpoint/ckpt_file.c~generic_file_checkpoint checkpoint/ckpt_file.c
--- linux-2.6.git/checkpoint/ckpt_file.c~generic_file_checkpoint 2009-03-05 08:37:00.000000000 -0800
+++ linux-2.6.git-dave/checkpoint/ckpt_file.c 2009-03-05 08:37:00.000000000 -0800
@@ -72,42 +72,50 @@ int cr_scan_fds(struct files_struct *fil
return n;
}

+int generic_file_checkpoint(struct file *file, struct cr_ctx *ctx,
+ struct cr_hdr_fd *hh)
+{
+ hh->f_flags = file->f_flags;
+ hh->f_mode = file->f_mode;
+ hh->f_pos = file->f_pos;
+ hh->f_version = file->f_version;
+ /* FIX: need also file->uid, file->gid, file->f_owner, etc */
+
+ /*
+ * CR_FD_GENERIC basically means that this can simply be
+ * open()'d on restore. Nothing special. Note that this
+ * includes directories.
+ */
+ hh->fd_type = CR_FD_GENERIC;
+
+ /* FIX: check if the file/dir/link is unlinked */
+
+ return 0;
+}
+
/* cr_write_fd_data - dump the state of a given file pointer */
static int cr_write_fd_data(struct cr_ctx *ctx, struct file *file, int parent)
{
struct cr_hdr h;
struct cr_hdr_fd *hh = cr_hbuf_get(ctx, sizeof(*hh));
- struct dentry *dent = file->f_dentry;
- struct inode *inode = dent->d_inode;
- enum fd_type fd_type;
int ret;

h.type = CR_HDR_FD_DATA;
h.len = sizeof(*hh);
h.parent = parent;

- hh->f_flags = file->f_flags;
- hh->f_mode = file->f_mode;
- hh->f_pos = file->f_pos;
- hh->f_version = file->f_version;
- /* FIX: need also file->uid, file->gid, file->f_owner, etc */
+ hh->fd_type = CR_FD_UNSET;
+ ret = -EBADF;
+ if ((file->f_dentry->d_inode->i_mode & S_IFMT) == S_IFREG)
+ ret = generic_file_checkpoint(file, ctx, hh);

- switch (inode->i_mode & S_IFMT) {
- case S_IFREG:
- fd_type = CR_FD_FILE;
- break;
- case S_IFDIR:
- fd_type = CR_FD_DIR;
- break;
- default:
- cr_hbuf_put(ctx, sizeof(*hh));
- return -EBADF;
- }
+ if (ret)
+ goto out;

- /* FIX: check if the file/dir/link is unlinked */
- hh->fd_type = fd_type;
+ WARN_ON(hh->fd_type == CR_FD_UNSET);

ret = cr_write_obj(ctx, &h, hh);
+out:
cr_hbuf_put(ctx, sizeof(*hh));
if (ret < 0)
return ret;
diff -puN checkpoint/rstr_file.c~generic_file_checkpoint checkpoint/rstr_file.c
--- linux-2.6.git/checkpoint/rstr_file.c~generic_file_checkpoint 2009-03-05 08:37:00.000000000 -0800
+++ linux-2.6.git-dave/checkpoint/rstr_file.c 2009-03-05 08:37:00.000000000 -0800
@@ -92,8 +92,7 @@ cr_read_fd_data(struct cr_ctx *ctx, stru
/* FIX: more sanity checks on f_flags, f_mode etc */

switch (hh->fd_type) {
- case CR_FD_FILE:
- case CR_FD_DIR:
+ case CR_FD_GENERIC:
file = cr_read_open_fname(ctx, hh->f_flags, hh->f_mode);
break;
default:
diff -puN include/linux/checkpoint_hdr.h~generic_file_checkpoint include/linux/checkpoint_hdr.h
--- linux-2.6.git/include/linux/checkpoint_hdr.h~generic_file_checkpoint 2009-03-05 08:37:00.000000000 -0800
+++ linux-2.6.git-dave/include/linux/checkpoint_hdr.h 2009-03-05 08:37:00.000000000 -0800
@@ -133,8 +133,8 @@ struct cr_hdr_fd_ent {

/* fd types */
enum fd_type {
- CR_FD_FILE = 1,
- CR_FD_DIR,
+ CR_FD_UNSET = 0,
+ CR_FD_GENERIC = 1,
};

struct cr_hdr_fd {
_

2009-03-05 16:41:34

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 11/11] optimize c/r check in dup_fd()


__scan_files_for_cr() should compile down to nothing
if checkpoint/restart is disabled.

But, I think the spinlocks in fcheck_files() keep
the compiler from figuring this out. So, instead
if putting in some #ifdefs, use our new cr_enabled()
helper.

I've verified that this puts .text size back to
where it was without the __scan_files_for_cr() call.

Signed-off-by: Dave Hansen <[email protected]>
---

linux-2.6.git-dave/fs/file.c | 3 +++
1 file changed, 3 insertions(+)

diff -puN fs/file.c~optimize-cr-checkk fs/file.c
--- linux-2.6.git/fs/file.c~optimize-cr-checkk 2009-03-05 08:37:05.000000000 -0800
+++ linux-2.6.git-dave/fs/file.c 2009-03-05 08:37:05.000000000 -0800
@@ -290,6 +290,9 @@ static void __scan_files_for_cr(struct f
{
int i;

+ if (!cr_enabled())
+ return;
+
for (i = 0; i < files->fdtab.max_fds; i++) {
struct file *f = fcheck_files(files, i);
if (!f)
_

2009-03-05 16:42:12

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 10/11] add checkpoint/restart compile helper


This is to aid the compiler a little bit so that we
don't have to add as many #ifdefs. Gets used in the
next patch.

Signed-off-by: Dave Hansen <[email protected]>
---

linux-2.6.git-dave/include/linux/checkpoint.h | 11 +++++++++++
1 file changed, 11 insertions(+)

diff -puN include/linux/checkpoint.h~cr-compile-helper include/linux/checkpoint.h
--- linux-2.6.git/include/linux/checkpoint.h~cr-compile-helper 2009-03-05 08:37:04.000000000 -0800
+++ linux-2.6.git-dave/include/linux/checkpoint.h 2009-03-05 08:37:04.000000000 -0800
@@ -115,6 +115,10 @@ static inline void __files_deny_checkpoi

int cr_explain_file(struct file *file, char *explain, int left);
int cr_file_supported(struct file *file);
+static inline int cr_enabled(void)
+{
+ return 1;
+}

#else /* !CONFIG_CHECKPOINT_RESTART */

@@ -128,6 +132,13 @@ static inline int cr_file_supported(stru
{
return 0;
}
+/*
+ * Use this when you can in lieu of an #ifdef
+ */
+static inline int cr_enabled(void)
+{
+ return 0;
+}

#endif /* CONFIG_CHECKPOINT_RESTART */
#endif /* _CHECKPOINT_CKPT_H_ */
_

2009-03-05 16:42:49

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 07/11] file c/r: expose functions to query fs support


This pair of functions will check to see whether a given
'struct file' can be checkpointed. If it can't be, the
"explain" function can also give a description why.

Note that we assume the presence of an f_op->checkpoint
implies that a file is always checkpointable. That may
have to change in the future, but it is a fixed rule
for now.

Signed-off-by: Dave Hansen <[email protected]>
---

linux-2.6.git-dave/checkpoint/ckpt_file.c | 26 ++++++++++++++++++++++++++
linux-2.6.git-dave/include/linux/checkpoint.h | 18 ++++++++++++++++++
2 files changed, 44 insertions(+)

diff -puN checkpoint/ckpt_file.c~cfcc333a1e665c38185a12ba119c32cf796dbd85 checkpoint/ckpt_file.c
--- linux-2.6.git/checkpoint/ckpt_file.c~cfcc333a1e665c38185a12ba119c32cf796dbd85 2009-03-05 08:37:03.000000000 -0800
+++ linux-2.6.git-dave/checkpoint/ckpt_file.c 2009-03-05 08:37:03.000000000 -0800
@@ -93,6 +93,32 @@ int generic_file_checkpoint(struct file
return 0;
}

+int cr_explain_file(struct file *file, char *explain, int left)
+{
+ struct inode *inode = file->f_dentry->d_inode;
+
+ if (!file->f_op->checkpoint)
+ return scnprintf(explain, left, " (no checkpoint handler)");
+
+ if (special_file(inode->i_mode))
+ return scnprintf(explain, left, " (special file)");
+
+ return 0;
+}
+
+int cr_file_supported(struct file *file)
+{
+ struct inode *inode = file->f_dentry->d_inode;
+
+ if (!file->f_op->checkpoint)
+ return 0;
+
+ if (special_file(inode->i_mode))
+ return 0;
+
+ return 1;
+}
+
/* cr_write_fd_data - dump the state of a given file pointer */
static int cr_write_fd_data(struct cr_ctx *ctx, struct file *file, int parent)
{
diff -puN include/linux/checkpoint.h~cfcc333a1e665c38185a12ba119c32cf796dbd85 include/linux/checkpoint.h
--- linux-2.6.git/include/linux/checkpoint.h~cfcc333a1e665c38185a12ba119c32cf796dbd85 2009-03-05 08:37:03.000000000 -0800
+++ linux-2.6.git-dave/include/linux/checkpoint.h 2009-03-05 08:37:03.000000000 -0800
@@ -13,6 +13,8 @@
#include <linux/path.h>
#include <linux/fs.h>

+#ifdef CONFIG_CHECKPOINT_RESTART
+
#define CR_VERSION 2

struct cr_ctx {
@@ -99,4 +101,20 @@ extern int cr_read_files(struct cr_ctx *

#define pr_fmt(fmt) "[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__

+int cr_explain_file(struct file *file, char *explain, int left);
+int cr_file_supported(struct file *file);
+
+#else /* !CONFIG_CHECKPOINT_RESTART */
+
+static inline int cr_explain_file(struct file *file, char *explain, int left)
+{
+ return 0;
+}
+
+static inline int cr_file_supported(struct file *file)
+{
+ return 0;
+}
+
+#endif /* CONFIG_CHECKPOINT_RESTART */
#endif /* _CHECKPOINT_CKPT_H_ */
_

2009-03-05 16:42:33

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 06/11] add checkpoint_file_generic() to /proc


/proc gets opened a *lot* including during some things that
ld.so does (according to Serge). If we want *anything* to
be checkpointable, we need to handle at least a few things
in /proc.

My approach here was we should be conservative. We should
only mark things that we basically already know can change
at any time, so a process would not be confused if they
changed.

Things like /proc/kcore are a bit trickier and should be
left alone for now.

Signed-off-by: Dave Hansen <[email protected]>
---

linux-2.6.git-dave/fs/proc/base.c | 9 +++++++++
linux-2.6.git-dave/fs/proc/meminfo.c | 1 +
linux-2.6.git-dave/fs/proc/stat.c | 1 +
linux-2.6.git-dave/mm/vmstat.c | 4 ++++
4 files changed, 15 insertions(+)

diff -puN fs/proc/base.c~add-stupid-checkpoint-to-proc-0 fs/proc/base.c
--- linux-2.6.git/fs/proc/base.c~add-stupid-checkpoint-to-proc-0 2009-03-05 08:37:02.000000000 -0800
+++ linux-2.6.git-dave/fs/proc/base.c 2009-03-05 08:37:02.000000000 -0800
@@ -690,6 +690,7 @@ static const struct file_operations proc
.llseek = seq_lseek,
.release = mounts_release,
.poll = mounts_poll,
+ .checkpoint = generic_file_checkpoint,
};

static int mountinfo_open(struct inode *inode, struct file *file)
@@ -703,6 +704,7 @@ static const struct file_operations proc
.llseek = seq_lseek,
.release = mounts_release,
.poll = mounts_poll,
+ .checkpoint = generic_file_checkpoint,
};

static int mountstats_open(struct inode *inode, struct file *file)
@@ -715,6 +717,7 @@ static const struct file_operations proc
.read = seq_read,
.llseek = seq_lseek,
.release = mounts_release,
+ .checkpoint = generic_file_checkpoint,
};

#define PROC_BLOCK_SIZE (3*1024) /* 4K page size but our output routines use some slack for overruns */
@@ -751,6 +754,7 @@ out_no_task:

static const struct file_operations proc_info_file_operations = {
.read = proc_info_read,
+ .checkpoint = generic_file_checkpoint,
};

static int proc_single_show(struct seq_file *m, void *v)
@@ -790,6 +794,7 @@ static const struct file_operations proc
.read = seq_read,
.llseek = seq_lseek,
.release = single_release,
+ .checkpoint = generic_file_checkpoint,
};

static int mem_open(struct inode* inode, struct file* file)
@@ -1008,6 +1013,7 @@ out_no_task:

static const struct file_operations proc_environ_operations = {
.read = environ_read,
+ .checkpoint = generic_file_checkpoint,
};

static ssize_t oom_adjust_read(struct file *file, char __user *buf,
@@ -1063,6 +1069,7 @@ static ssize_t oom_adjust_write(struct f
static const struct file_operations proc_oom_adjust_operations = {
.read = oom_adjust_read,
.write = oom_adjust_write,
+ .checkpoint = generic_file_checkpoint,
};

#ifdef CONFIG_AUDITSYSCALL
@@ -1130,6 +1137,7 @@ out_free_page:
static const struct file_operations proc_loginuid_operations = {
.read = proc_loginuid_read,
.write = proc_loginuid_write,
+ .checkpoint = generic_file_checkpoint,
};

static ssize_t proc_sessionid_read(struct file * file, char __user * buf,
@@ -1150,6 +1158,7 @@ static ssize_t proc_sessionid_read(struc

static const struct file_operations proc_sessionid_operations = {
.read = proc_sessionid_read,
+ .checkpoint = generic_file_checkpoint,
};
#endif

diff -puN fs/proc/meminfo.c~add-stupid-checkpoint-to-proc-0 fs/proc/meminfo.c
--- linux-2.6.git/fs/proc/meminfo.c~add-stupid-checkpoint-to-proc-0 2009-03-05 08:37:02.000000000 -0800
+++ linux-2.6.git-dave/fs/proc/meminfo.c 2009-03-05 08:37:02.000000000 -0800
@@ -164,6 +164,7 @@ static const struct file_operations memi
.read = seq_read,
.llseek = seq_lseek,
.release = single_release,
+ .checkpoint = generic_file_checkpoint,
};

static int __init proc_meminfo_init(void)
diff -puN fs/proc/stat.c~add-stupid-checkpoint-to-proc-0 fs/proc/stat.c
--- linux-2.6.git/fs/proc/stat.c~add-stupid-checkpoint-to-proc-0 2009-03-05 08:37:02.000000000 -0800
+++ linux-2.6.git-dave/fs/proc/stat.c 2009-03-05 08:37:02.000000000 -0800
@@ -142,6 +142,7 @@ static const struct file_operations proc
.read = seq_read,
.llseek = seq_lseek,
.release = single_release,
+ .checkpoint = generic_file_checkpoint,
};

static int __init proc_stat_init(void)
diff -puN mm/vmstat.c~add-stupid-checkpoint-to-proc-0 mm/vmstat.c
--- linux-2.6.git/mm/vmstat.c~add-stupid-checkpoint-to-proc-0 2009-03-05 08:37:02.000000000 -0800
+++ linux-2.6.git-dave/mm/vmstat.c 2009-03-05 08:37:02.000000000 -0800
@@ -598,6 +598,7 @@ static const struct file_operations frag
.read = seq_read,
.llseek = seq_lseek,
.release = seq_release,
+ .checkpoint = generic_file_checkpoint,
};

static const struct seq_operations pagetypeinfo_op = {
@@ -617,6 +618,7 @@ static const struct file_operations page
.read = seq_read,
.llseek = seq_lseek,
.release = seq_release,
+ .checkpoint = generic_file_checkpoint,
};

#ifdef CONFIG_ZONE_DMA
@@ -813,6 +815,7 @@ static const struct file_operations proc
.read = seq_read,
.llseek = seq_lseek,
.release = seq_release,
+ .checkpoint = generic_file_checkpoint,
};

static void *vmstat_start(struct seq_file *m, loff_t *pos)
@@ -887,6 +890,7 @@ static const struct file_operations proc
.read = seq_read,
.llseek = seq_lseek,
.release = seq_release,
+ .checkpoint = generic_file_checkpoint,
};
#endif /* CONFIG_PROC_FS */

_

2009-03-05 16:41:50

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 08/11] expose file checkpointability and reasoning in /proc


This is the first in a set of things that checked to
ensure that a process or entire container is
able to be checkpointed.


Signed-off-by: Dave Hansen <[email protected]>
---

linux-2.6.git-dave/fs/proc/base.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff -puN fs/proc/base.c~proc-part-of-f_op-conversion fs/proc/base.c
--- linux-2.6.git/fs/proc/base.c~proc-part-of-f_op-conversion 2009-03-05 08:37:03.000000000 -0800
+++ linux-2.6.git-dave/fs/proc/base.c 2009-03-05 08:37:03.000000000 -0800
@@ -80,6 +80,7 @@
#include <linux/oom.h>
#include <linux/elf.h>
#include <linux/pid_namespace.h>
+#include <linux/checkpoint.h>
#include "internal.h"

/* NOTE:
@@ -1645,6 +1646,7 @@ out:

static void proc_fd_write_info(struct file *file, char *info)
{
+ int checkpointable = 0;
int max = PROC_FDINFO_MAX;
int p = 0;
if (!info)
@@ -1652,6 +1654,12 @@ static void proc_fd_write_info(struct fi

p += scnprintf(info+p, max-p, "pos:\t%lli\n", (long long) file->f_pos);
p += scnprintf(info+p, max-p, "flags:\t0%o\n", file->f_flags);
+
+ if (file->f_op->checkpoint)
+ checkpointable = 1;
+ p += scnprintf(info+p, max-p, "checkpointable:\t%d", checkpointable);
+ if (!checkpointable)
+ p += cr_explain_file(file, info+p, max-p);
}

static int proc_fd_info(struct inode *inode, struct path *path, char *info)
_

2009-03-05 17:34:02

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability

On Thu, Mar 05, 2009 at 08:38:57AM -0800, Dave Hansen wrote:
> This takes a suggestion of Ingo's along with comments from lots of
> other people. It can track whether a given file is able to be
> checkpointed. It introduces a f_op to allow easy customization
> like the reset of the VFS.

Here is how alternative looks like
* without touching VFS at all
* without adding default handlers
* without duplicate code every ->checkpoint hook will have
* without largely useless "special file" messages
(what's so special about it?)
* without adding userspace-visible /proc/*/checkpointable
* without recalculating "checkpointable" property on fs_struct
on every C/R=y kernel.

* with "ban by default" policy as well
* with error message immediatly understandable by developer:

cr_check_file: can't checkpoint file f61a0f40, ->f_op = socket_file_ops+0x0/0x1c0

It may lack some printk, but printks are trivial to insert including
using d_path for precise info.



static int cr_check_file(struct file *file)
{
struct inode *inode = file->f_path.dentry->d_inode;
unsigned int major, minor;

if (d_unhashed(file->f_path.dentry))
return -EINVAL;
#ifdef CONFIG_SECURITY
if (file->f_security)
return -EINVAL;
#endif
#ifdef CONFIG_EPOLL
spin_lock(&file->f_ep_lock);
if (!list_empty(&file->f_ep_links)) {
spin_unlock(&file->f_ep_lock);
return -EINVAL;
}
spin_unlock(&file->f_ep_lock);
#endif

switch (inode->i_mode & S_IFMT) {
case S_IFREG:
case S_IFDIR:
/* Likely on-disk filesystem. */
/* FIXME: FUSE, NFS, other networking filesystems */
if (inode->i_sb->s_type->fs_flags & FS_REQUIRES_DEV)
return 0;
break;
case S_IFBLK:
major = imajor(inode);
minor = iminor(inode);
printk("%s: can't checkpoint block device %u:%u, ->f_op = %pS\n", __func__, major, minor, file->f_op);
return -EINVAL;
case S_IFCHR:
major = imajor(inode);
minor = iminor(inode);
if (major == UNIX98_PTY_SLAVE_MAJOR)
return 0;
printk("%s: can't checkpoint char device %u:%u, ->f_op = %pS\n", __func__, major, minor, file->f_op);
return -EINVAL;
case S_IFIFO:
break;
case S_IFSOCK:
return 0;
case S_IFLNK:
/* One can't open symlink. */
BUG();
}
printk("%s: can't checkpoint file %p, ->f_op = %pS\n", __func__, file, file->f_op);
return -EINVAL;
}

static int __cr_collect_file(struct cr_context *ctx, struct file *file)
{
struct cr_object *obj;

obj = cr_find_obj_by_ptr(ctx, file, CR_CTX_FILE);
if (obj) {
obj->o_count++;
return 0;
}

obj = cr_object_create(file);
if (!obj)
return -ENOMEM;
list_add_tail(&obj->o_list, &ctx->cr_obj[CR_CTX_FILE]);
printk("collect file %p\n", file);
return 0;
}

int cr_collect_file(struct cr_context *ctx, struct file *file)
{
int rv;

rv = cr_check_file(file);
if (rv < 0)
return rv;
return __cr_collect_file(ctx, file);
}

2009-03-05 18:13:52

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability

I need the following to get fdinfo checkpointability output:

>From 4002821bf068c122867c3b1819817f0b1fdfc250 Mon Sep 17 00:00:00 2001
From: Serge E. Hallyn <[email protected]>
Date: Thu, 5 Mar 2009 10:12:30 -0800
Subject: [PATCH 1/1] checkpointable: add missing trailing \n for fdinfo output

Signed-off-by: Serge E. Hallyn <[email protected]>
---
fs/proc/base.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index bde92c3..12543b6 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1657,7 +1657,7 @@ static void proc_fd_write_info(struct file *file, char *info)

if (file->f_op->checkpoint)
checkpointable = 1;
- p += scnprintf(info+p, max-p, "checkpointable:\t%d", checkpointable);
+ p += scnprintf(info+p, max-p, "checkpointable:\t%d\n", checkpointable);
if (!checkpointable)
p += cr_explain_file(file, info+p, max-p);
}
--
1.5.4.3

2009-03-05 18:17:15

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability

On Thu, 2009-03-05 at 12:13 -0600, Serge E. Hallyn wrote:
>
> ---
> fs/proc/base.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index bde92c3..12543b6 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1657,7 +1657,7 @@ static void proc_fd_write_info(struct file *file, char *info)
>
> if (file->f_op->checkpoint)
> checkpointable = 1;
> - p += scnprintf(info+p, max-p, "checkpointable:\t%d", checkpointable);
> + p += scnprintf(info+p, max-p, "checkpointable:\t%d\n", checkpointable);
> if (!checkpointable)
> p += cr_explain_file(file, info+p, max-p);
> }

I'll apply that, thanks Serge.

-- Dave

2009-03-05 19:17:33

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability

On Thu, 2009-03-05 at 20:40 +0300, Alexey Dobriyan wrote:
> On Thu, Mar 05, 2009 at 08:38:57AM -0800, Dave Hansen wrote:
> > This takes a suggestion of Ingo's along with comments from lots of
> > other people. It can track whether a given file is able to be
> > checkpointed. It introduces a f_op to allow easy customization
> > like the reset of the VFS.
>
> Here is how alternative looks like
> * without touching VFS at all
> * without adding default handlers

Are these bad things? If this was harmful to the VFS, I can bet
Christoph would be speaking up. As far as the default handlers, blame
Christoph. :)

> * without duplicate code every ->checkpoint hook will have

Well, I had actually planned to break the generic function up into a
"common" function that all of the handlers can call or could be called
before each handler. This is trivially fixable, but it would look kinda
goofy without some code to use it.

> * without largely useless "special file" messages
> (what's so special about it?)

Very true. I'll improve that one.

> * without adding userspace-visible /proc/*/checkpointable

Ingo, could you weigh in on how you expected this "checkpointable" flag
to get exposed to and checked by userspace?

> * without recalculating "checkpointable" property on fs_struct
> on every C/R=y kernel.

Yeah, this is certainly less than ideal. Although, I haven't seen your
proposal for where to tie your code into the kernel. Do you suggest
that we do nothing during normal kernel runtime and all the checking at
sys_checkpoint() time?

> * with "ban by default" policy as well
> * with error message immediatly understandable by developer:
>
> cr_check_file: can't checkpoint file f61a0f40, ->f_op = socket_file_ops+0x0/0x1c0

That's a much better error message than I have. I think I'll steal it
if you don't mind.

> It may lack some printk, but printks are trivial to insert including
> using d_path for precise info.

This is definitely workable approach. However, could you show how you
would support /dev/null and, say, /proc/$$/stat? I've shown what it
takes to do that in my patches, and I think it would show a lot about
your approach.

-- Dave

2009-03-05 19:44:40

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability

On Thu, 2009-03-05 at 20:40 +0300, Alexey Dobriyan wrote:
> * without largely useless "special file" messages
> (what's so special about it?)

Thanks for pointing this out, Alexey. After going and looking at these,
I realized that this changed sometime in the last 5 or 6 years. :)

My f_op approach works for these even without the special_file()
check.

-- Dave

2009-03-05 21:02:13

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability

On Thu, Mar 05, 2009 at 11:16:07AM -0800, Dave Hansen wrote:
> On Thu, 2009-03-05 at 20:40 +0300, Alexey Dobriyan wrote:
> > On Thu, Mar 05, 2009 at 08:38:57AM -0800, Dave Hansen wrote:
> > > This takes a suggestion of Ingo's along with comments from lots of
> > > other people. It can track whether a given file is able to be
> > > checkpointed. It introduces a f_op to allow easy customization
> > > like the reset of the VFS.
> >
> > Here is how alternative looks like
> > * without touching VFS at all
> > * without adding default handlers
>
> Are these bad things? If this was harmful to the VFS, I can bet
> Christoph would be speaking up. As far as the default handlers, blame
> Christoph. :)

It's too much for too little and unreliable. See below.

> > * without duplicate code every ->checkpoint hook will have
>
> Well, I had actually planned to break the generic function up into a
> "common" function that all of the handlers can call or could be called
> before each handler. This is trivially fixable, but it would look kinda
> goofy without some code to use it.
>
> > * without largely useless "special file" messages
> > (what's so special about it?)
>
> Very true. I'll improve that one.
>
> > * without adding userspace-visible /proc/*/checkpointable
>
> Ingo, could you weigh in on how you expected this "checkpointable" flag
> to get exposed to and checked by userspace?
>
> > * without recalculating "checkpointable" property on fs_struct
> > on every C/R=y kernel.
>
> Yeah, this is certainly less than ideal. Although, I haven't seen your
> proposal for where to tie your code into the kernel. Do you suggest
> that we do nothing during normal kernel runtime and all the checking at
> sys_checkpoint() time?

Of course!

C/R won't be used by majority of users, so it shouldn't bring any
overhead. ->f_op->checkpoint (not ->checkpointable!) is probably
acceptable. Recalculating flags is not, sorry.

Imagine, unsupported file is opened between userspace checks
for /proc/*/checkpointable and /proc/*/fdinfo/*/checkpointable
and whatever, you stil have to do all the checks inside checkpoint(2).

> > It may lack some printk, but printks are trivial to insert including
> > using d_path for precise info.
>
> This is definitely workable approach. However, could you show how you
> would support /dev/null and, say, /proc/$$/stat? I've shown what it
> takes to do that in my patches, and I think it would show a lot about
> your approach.

I haven't yet written code for /dev/null, but it would be:
* at checkpoint(2)
** see it's block device
** see it's 1:3 => supported
** dump "1:3", dump "/dev/null" as filename
* at restore(2)
** read CR_OBJ_FILE
** open filename or -E
** if not block device return -E
** if not 1:3 return -E
** save "struct file *" where needed

(all of this is modulo unlinked case)

/proc/*/stat is much trickier (and BTW can very well ruin idea of piping
dumpfile to somewhere by introducing honest loops in restoration hierarchy)

2009-03-05 21:27:22

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability

On Fri, 2009-03-06 at 00:08 +0300, Alexey Dobriyan wrote:
> On Thu, Mar 05, 2009 at 11:16:07AM -0800, Dave Hansen wrote:
> > On Thu, 2009-03-05 at 20:40 +0300, Alexey Dobriyan wrote:
> > > * without recalculating "checkpointable" property on fs_struct
> > > on every C/R=y kernel.
> >
> > Yeah, this is certainly less than ideal. Although, I haven't seen your
> > proposal for where to tie your code into the kernel. Do you suggest
> > that we do nothing during normal kernel runtime and all the checking at
> > sys_checkpoint() time?
>
> Of course!
>
> C/R won't be used by majority of users, so it shouldn't bring any
> overhead. ->f_op->checkpoint (not ->checkpointable!) is probably
> acceptable. Recalculating flags is not, sorry.

Yeah, what I'm doing in dup_fd() is certainly suboptimal. It introduces
extra overhead in fork() (with the config option turned on) which sucks
big time. But, I'm *sure* we can optimize it, especially if we can push
it out to only occurring at "container fork()" time. Whatever container
fork ends up being.

> Imagine, unsupported file is opened between userspace checks
> for /proc/*/checkpointable and /proc/*/fdinfo/*/checkpointable
> and whatever, you stil have to do all the checks inside checkpoint(2).

Alexey, we have two problems here. I completely agree that we have to
do complete and thorough checks of each file descriptor at
sys_checkpoint(). Any checks made at other times should not be trusted.

The other side is what Ingo has been asking for. How do we *know* when
we are checkpointable *before* we call (and without calling)
sys_checkpoint()? You are yet to acknowledge that this is a valid use
case, but it is exactly what Ingo is asking for, I believe.

If nice printk()s are sufficient to cover what Ingo wants, I'm quite
happy to remove the /proc files.

> > > It may lack some printk, but printks are trivial to insert including
> > > using d_path for precise info.
> >
> > This is definitely workable approach. However, could you show how you
> > would support /dev/null and, say, /proc/$$/stat? I've shown what it
> > takes to do that in my patches, and I think it would show a lot about
> > your approach.
>
> I haven't yet written code for /dev/null, but it would be:
> * at checkpoint(2)
> ** see it's block device
> ** see it's 1:3 => supported
> ** dump "1:3", dump "/dev/null" as filename

Can we see code, please? With my approach, it is a single line added to
a structure definition. Your approach sounds like it may be more than a
single line of code. It sounds like you would like to have some kind of
device number to c/r mapping. I'm curious what form that would take.

> * at restore(2)
> ** read CR_OBJ_FILE
> ** open filename or -E
> ** if not block device return -E
> ** if not 1:3 return -E
> ** save "struct file *" where needed
>
> (all of this is modulo unlinked case)

/dev/null is a character device, btw. :)

This sanity checking on the sys_restore() side is also definitely a good
idea. But, in the interests of keeping our patch size down, I think it
is safe to say that we require userspace to get the fs back into a state
consistent with sys_checkpoint() time.

-- Dave

2009-03-05 21:54:14

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability

On Thu, Mar 05, 2009 at 01:27:07PM -0800, Dave Hansen wrote:
> > Imagine, unsupported file is opened between userspace checks
> > for /proc/*/checkpointable and /proc/*/fdinfo/*/checkpointable
> > and whatever, you stil have to do all the checks inside checkpoint(2).
>
> Alexey, we have two problems here. I completely agree that we have to
> do complete and thorough checks of each file descriptor at
> sys_checkpoint(). Any checks made at other times should not be trusted.
>
> The other side is what Ingo has been asking for. How do we *know* when
> we are checkpointable *before* we call (and without calling)

This "without calling checkpoint(2)" results in much complications
as demonstrated.

task_struct and file are not like other structures because they are exposed
in /proc. For PROC_FS=n kernels, one can't even check.

You can do checkpoint(2) without actual dump. You pass, you're most
certainly checkpointable (with inevitable race condition in mind).

With time the amount of stuff C/R won't support will approach zero,
but the infrastructure for "checkpointable" will stay constant.
If it's too much right now, it will be way too much in future.

> sys_checkpoint()? You are yet to acknowledge that this is a valid use
> case, but it is exactly what Ingo is asking for, I believe.

It's a valid requirement.

> If nice printk()s are sufficient to cover what Ingo wants, I'm quite
> happy to remove the /proc files.

2009-03-05 22:24:44

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability

On Fri, 2009-03-06 at 01:00 +0300, Alexey Dobriyan wrote:
> On Thu, Mar 05, 2009 at 01:27:07PM -0800, Dave Hansen wrote:
> > > Imagine, unsupported file is opened between userspace checks
> > > for /proc/*/checkpointable and /proc/*/fdinfo/*/checkpointable
> > > and whatever, you stil have to do all the checks inside checkpoint(2).
> >
> > Alexey, we have two problems here. I completely agree that we have to
> > do complete and thorough checks of each file descriptor at
> > sys_checkpoint(). Any checks made at other times should not be trusted.
> >
> > The other side is what Ingo has been asking for. How do we *know* when
> > we are checkpointable *before* we call (and without calling)
>
> This "without calling checkpoint(2)" results in much complications
> as demonstrated.

I'll let you take that up with Ingo. :)

> task_struct and file are not like other structures because they are exposed
> in /proc.

Very true. But, we can always use the task as a proxy to say whether
any of this tasks's *resources* are uncheckpointable. Is this task's
ipc_namespace checkpointable, etc...

> For PROC_FS=n kernels, one can't even check.

Definitely. I'd be happy to make this check require PROC=y or even
DEBUGFS=y. I just want to make the mechanism usable for developers so
they're more motivated to find and fix checkpoint issues.

> You can do checkpoint(2) without actual dump. You pass, you're most
> certainly checkpointable (with inevitable race condition in mind).

OK, so you envision this as maybe calling sys_checkpoint() with a -1 fd
or something? I'm generally OK with that. If the /proc stuff is really
the sticking point here, I'd be happy to stick it at the end of the
series so we can throw it away more easily.

> With time the amount of stuff C/R won't support will approach zero,
> but the infrastructure for "checkpointable" will stay constant.
> If it's too much right now, it will be way too much in future.

What have you seen in OpenVZ? Do new things that are not checkpointable
pop up very often?

-- Dave

2009-03-06 14:35:08

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability

Quoting Dave Hansen ([email protected]):
> On Fri, 2009-03-06 at 01:00 +0300, Alexey Dobriyan wrote:
> > On Thu, Mar 05, 2009 at 01:27:07PM -0800, Dave Hansen wrote:
> > > > Imagine, unsupported file is opened between userspace checks
> > > > for /proc/*/checkpointable and /proc/*/fdinfo/*/checkpointable
> > > > and whatever, you stil have to do all the checks inside checkpoint(2).
> > >
> > > Alexey, we have two problems here. I completely agree that we have to
> > > do complete and thorough checks of each file descriptor at
> > > sys_checkpoint(). Any checks made at other times should not be trusted.
> > >
> > > The other side is what Ingo has been asking for. How do we *know* when
> > > we are checkpointable *before* we call (and without calling)
> >
> > This "without calling checkpoint(2)" results in much complications
> > as demonstrated.
>
> I'll let you take that up with Ingo. :)
>
> > task_struct and file are not like other structures because they are exposed
> > in /proc.
>
> Very true. But, we can always use the task as a proxy to say whether
> any of this tasks's *resources* are uncheckpointable. Is this task's
> ipc_namespace checkpointable, etc...
>
> > For PROC_FS=n kernels, one can't even check.
>
> Definitely. I'd be happy to make this check require PROC=y or even
> DEBUGFS=y. I just want to make the mechanism usable for developers so
> they're more motivated to find and fix checkpoint issues.
>
> > You can do checkpoint(2) without actual dump. You pass, you're most
> > certainly checkpointable (with inevitable race condition in mind).
>
> OK, so you envision this as maybe calling sys_checkpoint() with a -1 fd
> or something? I'm generally OK with that. If the /proc stuff is really
> the sticking point here, I'd be happy to stick it at the end of the
> series so we can throw it away more easily.

Yeah thing is I definately like what Alexey is suggesting.

The only reason for going the route of Dave's patches is to implement
the pain Ingo wants to inflict to push us to faster support the
resources which users actually want/need. As Alexey says that's
a temporary gain and therefore not worth permanent code.

Oh, right, there's the second reason:

> > With time the amount of stuff C/R won't support will approach zero,
> > but the infrastructure for "checkpointable" will stay constant.
> > If it's too much right now, it will be way too much in future.
>
> What have you seen in OpenVZ? Do new things that are not checkpointable
> pop up very often?

Realistically, do you think the uncheckpointable stuff would catch a
brand-new unsupported feature? If it has a file interface then I
suppose it would. Well, might. I wouldn't be surprised if the authors
would cut and paste enough code to paste the .checkpoint =
generic_file_checkpoint line :)

-serge

2009-03-06 15:10:00

by Greg Kurz

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability

On Fri, 2009-03-06 at 01:00 +0300, Alexey Dobriyan wrote:
> On Thu, Mar 05, 2009 at 01:27:07PM -0800, Dave Hansen wrote:
> > > Imagine, unsupported file is opened between userspace checks
> > > for /proc/*/checkpointable and /proc/*/fdinfo/*/checkpointable
> > > and whatever, you stil have to do all the checks inside checkpoint(2).
> >
> > Alexey, we have two problems here. I completely agree that we have to
> > do complete and thorough checks of each file descriptor at
> > sys_checkpoint(). Any checks made at other times should not be trusted.
> >
> > The other side is what Ingo has been asking for. How do we *know* when
> > we are checkpointable *before* we call (and without calling)
>
> This "without calling checkpoint(2)" results in much complications
> as demonstrated.
>
> task_struct and file are not like other structures because they are exposed
> in /proc. For PROC_FS=n kernels, one can't even check.
>
> You can do checkpoint(2) without actual dump. You pass, you're most
> certainly checkpointable (with inevitable race condition in mind).
>

Ahhh thank you very much Alexey ! I wanted to explain this to Dave a few
monthes ago but I failed... probably because of my poor English skills.

https://lists.linux-foundation.org/pipermail/containers/2008-October/013549.html

Why would we add checking all over the place when it MUST be done on the
sys_checkpoint() path ? The checkpoint(2) dry-run is definitely the way
to go.

> With time the amount of stuff C/R won't support will approach zero,
> but the infrastructure for "checkpointable" will stay constant.
> If it's too much right now, it will be way too much in future.
>
> > sys_checkpoint()? You are yet to acknowledge that this is a valid use
> > case, but it is exactly what Ingo is asking for, I believe.
>
> It's a valid requirement.
>
> > If nice printk()s are sufficient to cover what Ingo wants, I'm quite
> > happy to remove the /proc files.
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/containers
--
Gregory Kurz [email protected]
Software Engineer @ IBM/Meiosys http://www.ibm.com
Tel +33 (0)534 638 479 Fax +33 (0)561 400 420

"Anarchy is about taking complete responsibility for yourself."
Alan Moore.

2009-03-06 15:36:05

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability

Quoting Greg Kurz ([email protected]):
> On Fri, 2009-03-06 at 01:00 +0300, Alexey Dobriyan wrote:
> > On Thu, Mar 05, 2009 at 01:27:07PM -0800, Dave Hansen wrote:
> > > > Imagine, unsupported file is opened between userspace checks
> > > > for /proc/*/checkpointable and /proc/*/fdinfo/*/checkpointable
> > > > and whatever, you stil have to do all the checks inside checkpoint(2).
> > >
> > > Alexey, we have two problems here. I completely agree that we have to
> > > do complete and thorough checks of each file descriptor at
> > > sys_checkpoint(). Any checks made at other times should not be trusted.
> > >
> > > The other side is what Ingo has been asking for. How do we *know* when
> > > we are checkpointable *before* we call (and without calling)
> >
> > This "without calling checkpoint(2)" results in much complications
> > as demonstrated.
> >
> > task_struct and file are not like other structures because they are exposed
> > in /proc. For PROC_FS=n kernels, one can't even check.
> >
> > You can do checkpoint(2) without actual dump. You pass, you're most
> > certainly checkpointable (with inevitable race condition in mind).
> >
>
> Ahhh thank you very much Alexey ! I wanted to explain this to Dave a few
> monthes ago but I failed... probably because of my poor English skills.
>
> https://lists.linux-foundation.org/pipermail/containers/2008-October/013549.html
>
> Why would we add checking all over the place when it MUST be done on the
> sys_checkpoint() path ? The checkpoint(2) dry-run is definitely the way
> to go.

I'm sure Dave understood that this was possible :)

But what you and Alexey are proposing does not and cannot fullfill
Ingo's requirement.

-serge

2009-03-06 15:48:54

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability

On Fri, 2009-03-06 at 08:34 -0600, Serge E. Hallyn wrote:
> > > With time the amount of stuff C/R won't support will approach zero,
> > > but the infrastructure for "checkpointable" will stay constant.
> > > If it's too much right now, it will be way too much in future.
> >
> > What have you seen in OpenVZ? Do new things that are not checkpointable
> > pop up very often?
>
> Realistically, do you think the uncheckpointable stuff would catch a
> brand-new unsupported feature? If it has a file interface then I
> suppose it would. Well, might. I wouldn't be surprised if the authors
> would cut and paste enough code to paste the .checkpoint =
> generic_file_checkpoint line :)

Yeah, that's true. Us maintainers would probably need to keep an eye on
that.

-- Dave

2009-03-06 16:23:52

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability

Quoting Dave Hansen ([email protected]):
> On Fri, 2009-03-06 at 08:34 -0600, Serge E. Hallyn wrote:
> > > > With time the amount of stuff C/R won't support will approach zero,
> > > > but the infrastructure for "checkpointable" will stay constant.
> > > > If it's too much right now, it will be way too much in future.
> > >
> > > What have you seen in OpenVZ? Do new things that are not checkpointable
> > > pop up very often?
> >
> > Realistically, do you think the uncheckpointable stuff would catch a
> > brand-new unsupported feature? If it has a file interface then I
> > suppose it would. Well, might. I wouldn't be surprised if the authors
> > would cut and paste enough code to paste the .checkpoint =
> > generic_file_checkpoint line :)
>
> Yeah, that's true. Us maintainers would probably need to keep an eye on
> that.

Which imo is fine, but my question is whether that leaves any actual
value in the persistent per-resource uncheckpointable flag.

-serge

2009-03-06 16:46:22

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability

On Fri, 2009-03-06 at 10:23 -0600, Serge E. Hallyn wrote:
> Which imo is fine, but my question is whether that leaves any actual
> value in the persistent per-resource uncheckpointable flag.

OK, let's take a look back at this discussion a little bit and how we
got here.

Ingo quotes:
> Yeah, per resource it should be. That's per task in the normal
> case - except for threaded workloads where it's shared by
> threads.

> Uncheckpointable should be a one-way flag anyway. We want this
> to become usable, so uncheckpointable functionality should be as
> painful as possible, to make sure it's getting fixed ...

> Is there any automated test that could discover C/R breakage via
> brute force? All that matters in such cases is to get the "you
> broke stuff" information as soon as possible. If it comes at an
> early stage developers can generally just fix stuff.

You add these things together and you get what I posted. My patch is:
1. per resource
2. has a one way flag
3. Gives messages to developers at an early stage (dmesg) and lets them
explore it more thoroughly (/proc)

But, these "early stage" messages are completely opposed to an approach
that uses sys_checkpoint() in some form (like with a -1 fd as an
argument).

Think of it like lockdep. We *could* have designed lockdep to simply
give us a nice message whenever we do an a/b b/a deadlock. That would
be helpful. Or, we could design it to record all lock acquisitions that
didn't deadlock to see if they ever possibly deadlock. (We did the
second one, btw). That gave an early, useful, warning that developers
could fix before we encounter an actual problem. I'm advocating such a
mechanism for c/r.

-- Dave

2009-03-06 17:37:57

by Cedric Le Goater

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability

Serge E. Hallyn wrote:
> Quoting Greg Kurz ([email protected]):
>> On Fri, 2009-03-06 at 01:00 +0300, Alexey Dobriyan wrote:
>>> On Thu, Mar 05, 2009 at 01:27:07PM -0800, Dave Hansen wrote:
>>>>> Imagine, unsupported file is opened between userspace checks
>>>>> for /proc/*/checkpointable and /proc/*/fdinfo/*/checkpointable
>>>>> and whatever, you stil have to do all the checks inside checkpoint(2).
>>>> Alexey, we have two problems here. I completely agree that we have to
>>>> do complete and thorough checks of each file descriptor at
>>>> sys_checkpoint(). Any checks made at other times should not be trusted.
>>>>
>>>> The other side is what Ingo has been asking for. How do we *know* when
>>>> we are checkpointable *before* we call (and without calling)
>>> This "without calling checkpoint(2)" results in much complications
>>> as demonstrated.
>>>
>>> task_struct and file are not like other structures because they are exposed
>>> in /proc. For PROC_FS=n kernels, one can't even check.
>>>
>>> You can do checkpoint(2) without actual dump. You pass, you're most
>>> certainly checkpointable (with inevitable race condition in mind).
>>>
>> Ahhh thank you very much Alexey ! I wanted to explain this to Dave a few
>> monthes ago but I failed... probably because of my poor English skills.
>>
>> https://lists.linux-foundation.org/pipermail/containers/2008-October/013549.html
>>
>> Why would we add checking all over the place when it MUST be done on the
>> sys_checkpoint() path ? The checkpoint(2) dry-run is definitely the way
>> to go.
>
> I'm sure Dave understood that this was possible :)
>
> But what you and Alexey are proposing does not and cannot fullfill
> Ingo's requirement.

And if Ingo's requirement is fulfilled, would any C/R patchset be acceptable ?

C.

2009-03-06 18:25:25

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability

Quoting Dave Hansen ([email protected]):
> On Fri, 2009-03-06 at 10:23 -0600, Serge E. Hallyn wrote:
> > Which imo is fine, but my question is whether that leaves any actual
> > value in the persistent per-resource uncheckpointable flag.
>
> OK, let's take a look back at this discussion a little bit and how we
> got here.
>
> Ingo quotes:
> > Yeah, per resource it should be. That's per task in the normal
> > case - except for threaded workloads where it's shared by
> > threads.
>
> > Uncheckpointable should be a one-way flag anyway. We want this
> > to become usable, so uncheckpointable functionality should be as
> > painful as possible, to make sure it's getting fixed ...
>
> > Is there any automated test that could discover C/R breakage via
> > brute force? All that matters in such cases is to get the "you
> > broke stuff" information as soon as possible. If it comes at an
> > early stage developers can generally just fix stuff.
>
> You add these things together and you get what I posted. My patch is:
> 1. per resource
> 2. has a one way flag
> 3. Gives messages to developers at an early stage (dmesg) and lets them
> explore it more thoroughly (/proc)
>
> But, these "early stage" messages are completely opposed to an approach
> that uses sys_checkpoint() in some form (like with a -1 fd as an
> argument).

Well I disagree with that. The 'early stage' messages could be seen as
either:

1. a short-term way to prioritize resources to support
or
2. a long-term way to catch new resources introduced
without checkpoint/restart support

I don't believe 2. would work. I think 1. would work, but that we
risk imposing permanent code changes to support a temporary goal.

In contrast, the sys_checkpoint() check will always be needed to
check whether a particular application is checkpointable. For
instance a task will never be checkpointable if it shares a mm-struct
with a task not being checkpointed.

> Think of it like lockdep. We *could* have designed lockdep to simply
> give us a nice message whenever we do an a/b b/a deadlock. That would
> be helpful. Or, we could design it to record all lock acquisitions that
> didn't deadlock to see if they ever possibly deadlock. (We did the
> second one, btw). That gave an early, useful, warning that developers
> could fix before we encounter an actual problem. I'm advocating such a
> mechanism for c/r.

If you can convince me that it'll do that you'll have me on board :)

-serge

2009-03-06 18:38:54

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability

Quoting Cedric Le Goater ([email protected]):
> Serge E. Hallyn wrote:
> > Quoting Greg Kurz ([email protected]):
> >> On Fri, 2009-03-06 at 01:00 +0300, Alexey Dobriyan wrote:
> >>> On Thu, Mar 05, 2009 at 01:27:07PM -0800, Dave Hansen wrote:
> >>>>> Imagine, unsupported file is opened between userspace checks
> >>>>> for /proc/*/checkpointable and /proc/*/fdinfo/*/checkpointable
> >>>>> and whatever, you stil have to do all the checks inside checkpoint(2).
> >>>> Alexey, we have two problems here. I completely agree that we have to
> >>>> do complete and thorough checks of each file descriptor at
> >>>> sys_checkpoint(). Any checks made at other times should not be trusted.
> >>>>
> >>>> The other side is what Ingo has been asking for. How do we *know* when
> >>>> we are checkpointable *before* we call (and without calling)
> >>> This "without calling checkpoint(2)" results in much complications
> >>> as demonstrated.
> >>>
> >>> task_struct and file are not like other structures because they are exposed
> >>> in /proc. For PROC_FS=n kernels, one can't even check.
> >>>
> >>> You can do checkpoint(2) without actual dump. You pass, you're most
> >>> certainly checkpointable (with inevitable race condition in mind).
> >>>
> >> Ahhh thank you very much Alexey ! I wanted to explain this to Dave a few
> >> monthes ago but I failed... probably because of my poor English skills.
> >>
> >> https://lists.linux-foundation.org/pipermail/containers/2008-October/013549.html
> >>
> >> Why would we add checking all over the place when it MUST be done on the
> >> sys_checkpoint() path ? The checkpoint(2) dry-run is definitely the way
> >> to go.
> >
> > I'm sure Dave understood that this was possible :)
> >
> > But what you and Alexey are proposing does not and cannot fullfill
> > Ingo's requirement.
>
> And if Ingo's requirement is fulfilled, would any C/R patchset be acceptable ?

Yup, no matter how hideous :) Ok not really.

But the point was that it wasn't Dave not understanding Alexey's
suggestion, but Greg not understanding Ingo's. If you think Ingo's
goal isn't worthwhile or achievable, then argue that (as I am), don't
keep elaborating on something we all agree will be needed (Alexey's
suggestion or some other way of doing a true may-be-checkpointed test).

-serge

2009-03-06 19:42:21

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability

On Fri, 2009-03-06 at 12:24 -0600, Serge E. Hallyn wrote:
> > But, these "early stage" messages are completely opposed to an approach
> > that uses sys_checkpoint() in some form (like with a -1 fd as an
> > argument).
>
> Well I disagree with that. The 'early stage' messages could be seen as
> either:
>
> 1. a short-term way to prioritize resources to support
> or
> 2. a long-term way to catch new resources introduced
> without checkpoint/restart support
>
> I don't believe 2. would work. I think 1. would work, but that we
> risk imposing permanent code changes to support a temporary goal.

I should be a bit more clear. My goal (and I think Ingo's) here is to
come up with a mechanism that will make the checkpoint feature less
likely to break once we merge it into the tree. I'm looking for a tool
that people can utilize, even if they don't necessarily care about
checkpoint/restart.

If we *completely* depend on sys_checkpoint() as the interface for
determining if we are checkpointable, we don't have such a tool. We
have a tool that the checkpoint/restart developers and probably some
testers can and certainly will use. This is still very, very useful.
But, it probably won't ever generate a bug report from anyone who
doesn't specifically care about c/r.

As far as detecting *new* resources. Well, crap. I don't think our
little ->may_checkpoint flags can do that. My little f_op trick will
help and is better than nothing. But, as you noted, it is far from
perfect because we'll probably have people just copying the generic*
functions into new code.

-- Dave

2009-03-10 15:59:32

by Nathan Lynch

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability

Please consider this and a following patch.

>From a0fb96aa41c4d360559013cfd7f32f07f449c1c4 Mon Sep 17 00:00:00 2001
From: Nathan Lynch <[email protected]>
Date: Mon, 9 Mar 2009 22:23:02 -0500
Subject: [PATCH] checkpoint: make files_deny_checkpointing print task name and pid

This lets the developer know *which* task performed an action that
prevents checkpoint.


Signed-off-by: Nathan Lynch <[email protected]>
---
fs/file.c | 2 +-
fs/open.c | 2 +-
include/linux/checkpoint.h | 13 +++++++------
3 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 0501af6..fcb2803 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -299,7 +299,7 @@ static void __scan_files_for_cr(struct files_struct *files)
continue;
if (cr_file_supported(f))
continue;
- files_deny_checkpointing(files);
+ files_deny_checkpointing(current, files);
}
}

diff --git a/fs/open.c b/fs/open.c
index 7f575b0..327ff5b 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1018,7 +1018,7 @@ void fd_install(unsigned int fd, struct file *file)
struct fdtable *fdt;

if (!cr_file_supported(file))
- files_deny_checkpointing(files);
+ files_deny_checkpointing(current, files);

spin_lock(&files->file_lock);
fdt = files_fdtable(files);
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index e77393f..27366ac 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -13,6 +13,7 @@
#include <linux/path.h>
#include <linux/fs.h>
#include <linux/fdtable.h>
+#include <linux/sched.h>

#ifdef CONFIG_CHECKPOINT_RESTART

@@ -102,16 +103,16 @@ extern int cr_read_files(struct cr_ctx *ctx);

#define pr_fmt(fmt) "[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__

-static inline void __files_deny_checkpointing(struct files_struct *files,
- char *file, int line)
+static inline void __files_deny_checkpointing(const struct task_struct *tsk,
+ struct files_struct *files, char *file, int line)
{
if (!test_and_clear_bit(0, &files->may_checkpoint))
return;
- printk(KERN_INFO "process performed an action that can not be "
- "checkpointed at: %s:%d\n", file, line);
+ printk(KERN_INFO "%s/%i performed an action that can not be "
+ "checkpointed at: %s:%d\n", tsk->comm, tsk->pid, file, line);
}
-#define files_deny_checkpointing(f) \
- __files_deny_checkpointing(f, __FILE__, __LINE__)
+#define files_deny_checkpointing(tsk,f) \
+ __files_deny_checkpointing(tsk, f, __FILE__, __LINE__)

int cr_explain_file(struct file *file, char *explain, int left);
int cr_file_supported(struct file *file);
--
1.6.0.6

2009-03-10 16:00:26

by Nathan Lynch

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability

>From 3695fbda6225d2436e4af67a4bce6984df0891be Mon Sep 17 00:00:00 2001
From: Nathan Lynch <[email protected]>
Date: Mon, 9 Mar 2009 22:36:40 -0500
Subject: [PATCH] ratelimit files_deny_checkpointing output

Any common distribution's boot sequence causes thousands of
"<something> performed an action that cannot be checkpointed"
messages. Kernels are known to produce other messages of interest at
times, so ratelimit the output of files_deny_checkpointing.


Signed-off-by: Nathan Lynch <[email protected]>
---
include/linux/checkpoint.h | 6 ++++--
1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 27366ac..efdb8f0 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -108,8 +108,10 @@ static inline void __files_deny_checkpointing(const struct task_struct *tsk,
{
if (!test_and_clear_bit(0, &files->may_checkpoint))
return;
- printk(KERN_INFO "%s/%i performed an action that can not be "
- "checkpointed at: %s:%d\n", tsk->comm, tsk->pid, file, line);
+ if (printk_ratelimit())
+ printk(KERN_INFO "%s/%i performed an action that can not be "
+ "checkpointed at: %s:%d\n",
+ tsk->comm, tsk->pid, file, line);
}
#define files_deny_checkpointing(tsk,f) \
__files_deny_checkpointing(tsk, f, __FILE__, __LINE__)
--
1.6.0.6

2009-03-10 16:23:19

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability

On Tue, 2009-03-10 at 10:57 -0500, Nathan Lynch wrote:
> From a0fb96aa41c4d360559013cfd7f32f07f449c1c4 Mon Sep 17 00:00:00 2001
> From: Nathan Lynch <[email protected]>
> Date: Mon, 9 Mar 2009 22:23:02 -0500
> Subject: [PATCH] checkpoint: make files_deny_checkpointing print task
> name and pid
>
> This lets the developer know *which* task performed an action that
> prevents checkpoint.

Can you think of any case where we wouldn't be calling
foo_deny_checkpoint() for a task other than current? Should we just use
current in the function directly instead of passing it in?

-- Dave

2009-03-10 16:26:17

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability

Quoting Nathan Lynch ([email protected]):
> Please consider this and a following patch.
>
> >From a0fb96aa41c4d360559013cfd7f32f07f449c1c4 Mon Sep 17 00:00:00 2001
> From: Nathan Lynch <[email protected]>
> Date: Mon, 9 Mar 2009 22:23:02 -0500
> Subject: [PATCH] checkpoint: make files_deny_checkpointing print task name and pid
>
> This lets the developer know *which* task performed an action that
> prevents checkpoint.
>
>
> Signed-off-by: Nathan Lynch <[email protected]>
> ---
> fs/file.c | 2 +-
> fs/open.c | 2 +-
> include/linux/checkpoint.h | 13 +++++++------
> 3 files changed, 9 insertions(+), 8 deletions(-)
>
> diff --git a/fs/file.c b/fs/file.c
> index 0501af6..fcb2803 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -299,7 +299,7 @@ static void __scan_files_for_cr(struct files_struct *files)
> continue;
> if (cr_file_supported(f))
> continue;
> - files_deny_checkpointing(files);
> + files_deny_checkpointing(current, files);

Ah but you can't do this, because __scan_files_for_cr is called
from dupfd which is called during copy_files, right?

So you'll need to send the task into __scan_files_for_cr, right?

I do wholeheartedly agree with the idea though.

thanks,
-serge

2009-03-10 16:31:00

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability

Quoting Nathan Lynch ([email protected]):
> >From 3695fbda6225d2436e4af67a4bce6984df0891be Mon Sep 17 00:00:00 2001
> From: Nathan Lynch <[email protected]>
> Date: Mon, 9 Mar 2009 22:36:40 -0500
> Subject: [PATCH] ratelimit files_deny_checkpointing output
>
> Any common distribution's boot sequence causes thousands of
> "<something> performed an action that cannot be checkpointed"
> messages. Kernels are known to produce other messages of interest at
> times,

nonense...

> so ratelimit the output of files_deny_checkpointing.

Yes, please.

> Signed-off-by: Nathan Lynch <[email protected]>

Acked-by: Serge Hallyn <[email protected]>

> ---
> include/linux/checkpoint.h | 6 ++++--
> 1 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
> index 27366ac..efdb8f0 100644
> --- a/include/linux/checkpoint.h
> +++ b/include/linux/checkpoint.h
> @@ -108,8 +108,10 @@ static inline void __files_deny_checkpointing(const struct task_struct *tsk,
> {
> if (!test_and_clear_bit(0, &files->may_checkpoint))
> return;
> - printk(KERN_INFO "%s/%i performed an action that can not be "
> - "checkpointed at: %s:%d\n", tsk->comm, tsk->pid, file, line);
> + if (printk_ratelimit())
> + printk(KERN_INFO "%s/%i performed an action that can not be "
> + "checkpointed at: %s:%d\n",
> + tsk->comm, tsk->pid, file, line);
> }
> #define files_deny_checkpointing(tsk,f) \
> __files_deny_checkpointing(tsk, f, __FILE__, __LINE__)
> --
> 1.6.0.6
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/containers

2009-03-10 17:23:41

by Nathan Lynch

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability

"Serge E. Hallyn" <[email protected]> wrote:
> Quoting Nathan Lynch ([email protected]):
> > Please consider this and a following patch.
> >
> > >From a0fb96aa41c4d360559013cfd7f32f07f449c1c4 Mon Sep 17 00:00:00 2001
> > From: Nathan Lynch <[email protected]>
> > Date: Mon, 9 Mar 2009 22:23:02 -0500
> > Subject: [PATCH] checkpoint: make files_deny_checkpointing print task name and pid
> >
> > This lets the developer know *which* task performed an action that
> > prevents checkpoint.
> >
> >
> > Signed-off-by: Nathan Lynch <[email protected]>
> > ---
> > fs/file.c | 2 +-
> > fs/open.c | 2 +-
> > include/linux/checkpoint.h | 13 +++++++------
> > 3 files changed, 9 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/file.c b/fs/file.c
> > index 0501af6..fcb2803 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -299,7 +299,7 @@ static void __scan_files_for_cr(struct files_struct *files)
> > continue;
> > if (cr_file_supported(f))
> > continue;
> > - files_deny_checkpointing(files);
> > + files_deny_checkpointing(current, files);
>
> Ah but you can't do this, because __scan_files_for_cr is called
> from dupfd which is called during copy_files, right?

Are you saying that the message should identify the child instead of
the parent as the uncheckpointable task?

2009-03-10 17:45:45

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability

Quoting Nathan Lynch ([email protected]):
> "Serge E. Hallyn" <[email protected]> wrote:
> > Quoting Nathan Lynch ([email protected]):
> > > Please consider this and a following patch.
> > >
> > > >From a0fb96aa41c4d360559013cfd7f32f07f449c1c4 Mon Sep 17 00:00:00 2001
> > > From: Nathan Lynch <[email protected]>
> > > Date: Mon, 9 Mar 2009 22:23:02 -0500
> > > Subject: [PATCH] checkpoint: make files_deny_checkpointing print task name and pid
> > >
> > > This lets the developer know *which* task performed an action that
> > > prevents checkpoint.
> > >
> > >
> > > Signed-off-by: Nathan Lynch <[email protected]>
> > > ---
> > > fs/file.c | 2 +-
> > > fs/open.c | 2 +-
> > > include/linux/checkpoint.h | 13 +++++++------
> > > 3 files changed, 9 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/fs/file.c b/fs/file.c
> > > index 0501af6..fcb2803 100644
> > > --- a/fs/file.c
> > > +++ b/fs/file.c
> > > @@ -299,7 +299,7 @@ static void __scan_files_for_cr(struct files_struct *files)
> > > continue;
> > > if (cr_file_supported(f))
> > > continue;
> > > - files_deny_checkpointing(files);
> > > + files_deny_checkpointing(current, files);
> >
> > Ah but you can't do this, because __scan_files_for_cr is called
> > from dupfd which is called during copy_files, right?
>
> Are you saying that the message should identify the child instead of
> the parent as the uncheckpointable task?

Yes. The parent may have opened the fd (or, importantly, may NOT have)
but the child is the one now getting that 'dirty' fd and being newly
marked uncheckpointable.

-serge

2009-03-10 17:47:50

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability

On Tue, 2009-03-10 at 12:45 -0500, Serge E. Hallyn wrote:
> > Are you saying that the message should identify the child instead of
> > the parent as the uncheckpointable task?
>
> Yes. The parent may have opened the fd (or, importantly, may NOT have)
> but the child is the one now getting that 'dirty' fd and being newly
> marked uncheckpointable.

Yeah. It is kinda the parent's *fault* but this is the spot where we've
chosen to 'taint' the child. If I were looking back in the logs, I'd be
wondering from where the child's 'taint' flag came from. This is the
spot I should be looking for.

-- Dave

2009-03-11 07:52:07

by Cedric Le Goater

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability

>> And if Ingo's requirement is fulfilled, would any C/R patchset be acceptable ?
>
> Yup, no matter how hideous :) Ok not really.
>
> But the point was that it wasn't Dave not understanding Alexey's
> suggestion, but Greg not understanding Ingo's. If you think Ingo's
> goal isn't worthwhile or achievable, then argue that (as I am), don't
> keep elaborating on something we all agree will be needed (Alexey's
> suggestion or some other way of doing a true may-be-checkpointed test).

I rather spend my time on enabling things rather than forbid them.

C.

2009-03-12 15:27:52

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability

Quoting Cedric Le Goater ([email protected]):
> >> And if Ingo's requirement is fulfilled, would any C/R patchset be acceptable ?
> >
> > Yup, no matter how hideous :) Ok not really.
> >
> > But the point was that it wasn't Dave not understanding Alexey's
> > suggestion, but Greg not understanding Ingo's. If you think Ingo's
> > goal isn't worthwhile or achievable, then argue that (as I am), don't
> > keep elaborating on something we all agree will be needed (Alexey's
> > suggestion or some other way of doing a true may-be-checkpointed test).
>
> I rather spend my time on enabling things rather than forbid them.

That sure sounds productive. How could I argue with that.

But wait, haven't several teams been doing that for years? So why is
c/r not in the upstream kernel? Could it be that ignoring the
upstream maintainers' concerns about (a) treating the feature as a
toy, (b) long-term maintainability, and (c) c/r becoming an impediment
to future features, and instead hacking away at our toy feature, is
*not* always the best course?

Now I actually don't think you believe it's politically possible for
c/r to get upstream. But I do.

Getting back on track. Ingo's concern is that this turn into a toy feature
rather than something which serious apps can rely on. To address that
he wants an application to be able to tell whether or not it is
checkpointable, and, if not, then why not.

Now Alexey also has a suggestion for addressing this. It has (at least)
two shortcomings relative to Dave's:

1. it is more prone to race conditions. If an app opens an
uncheckpointable file briefly and then closes it, and later reopens
it, then it may think it is checkpointable even though it could have
already known it is not always. If you want to argue that returning
-EAGAIN is better in that case, that seems reasonable.

2. For repeatedly checking the checkpointability of large
applications it could be much more costly. For instance, if we have
to check the flags on each vma, and an application has 10s or 100s of
gigs of memory, each check for checkpointability would require walking
all those vmas each time. Dave's approach has the advantage of only
checking those when the resource is opened.

Hmm, plus

3. One of the things Ingo likes about Dave's approach, which
you may think is bogus, is that users of an uncheckpointable application
will scream more loudly if the app becomes permanently uncheckpointable
(and they know why), than if it sometimes works and sometimes doesn't
(-EAGAIN).

The funny thing is, for simplicity I actually prefer Alexey's approach.
It's easier (and therefore seems more robust) to tell if a task has a
particular sort of file at a definite point in time, than to try and
catch all the ways such an open file can be opened or received. Which
is where Ingo's LSM suggestion is seductive, but I'm convinced that
approach would be politically impossible.

-serge

2009-03-12 18:10:11

by Matt Helsley

[permalink] [raw]
Subject: Re: [RFC][PATCH 09/11] check files for checkpointability

On Thu, Mar 05, 2009 at 08:39:10AM -0800, Dave Hansen wrote:
>
> Introduce a files_struct counter to indicate whether a particular
> file_struct has ever contained a file which can not be
> checkpointed. This flag is a one-way trip; once it is set, it may
> not be unset.
>
> We assume at allocation that a new files_struct is clean and may
> be checkpointed. However, as soon as it has had its files filled
> from its parent's, we check it for real in __scan_files_for_cr().
> At that point, we mark it if it contained any uncheckpointable
> files.
>
> We also check each 'struct file' when it is installed in a fd
> slot. This way, if anyone open()s or managed to dup() an
> unsuppored file, we can catch it.
>
> Signed-off-by: Dave Hansen <[email protected]>
> ---
>
> linux-2.6.git-dave/fs/file.c | 19 +++++++++++++++++++
> linux-2.6.git-dave/fs/open.c | 5 +++++
> linux-2.6.git-dave/include/linux/checkpoint.h | 13 +++++++++++++
> linux-2.6.git-dave/include/linux/fdtable.h | 3 +++
> 4 files changed, 40 insertions(+)
>
> diff -puN fs/file.c~242c7afafb1a81c0d9a41085536f2197d81146e7 fs/file.c
> --- linux-2.6.git/fs/file.c~242c7afafb1a81c0d9a41085536f2197d81146e7 2009-03-05 08:37:04.000000000 -0800
> +++ linux-2.6.git-dave/fs/file.c 2009-03-05 08:37:04.000000000 -0800
> @@ -15,6 +15,7 @@
> #include <linux/file.h>
> #include <linux/fdtable.h>
> #include <linux/bitops.h>
> +#include <linux/checkpoint.h>
> #include <linux/interrupt.h>
> #include <linux/spinlock.h>
> #include <linux/rcupdate.h>
> @@ -285,6 +286,20 @@ static int count_open_files(struct fdtab
> return i;
> }
>
> +static void __scan_files_for_cr(struct files_struct *files)
> +{
> + int i;
> +
> + for (i = 0; i < files->fdtab.max_fds; i++) {
> + struct file *f = fcheck_files(files, i);
> + if (!f)
> + continue;
> + if (cr_file_supported(f))
> + continue;
> + files_deny_checkpointing(files);

At this point couldn't we skip the rest of the loop iterations?

Might it also be useful to print a path to f here? So not only would
the log show the location in the kernel source but we'd also get some
idea of which file caused the problem? Of course "f" isn't always
available everywhere we call files_deny_checkpointing()..

> + }
> +}
> +
> /*
> * Allocate a new files structure and copy contents from the
> * passed in files structure.
> @@ -303,6 +318,9 @@ struct files_struct *dup_fd(struct files
> goto out;
>
> atomic_set(&newf->count, 1);
> +#ifdef CONFIG_CHECKPOINT_RESTART
> + newf->may_checkpoint = 1;
> +#endif
>
> spin_lock_init(&newf->file_lock);
> newf->next_fd = 0;
> @@ -396,6 +414,7 @@ struct files_struct *dup_fd(struct files
>
> rcu_assign_pointer(newf->fdt, new_fdt);
>
> + __scan_files_for_cr(newf);
> return newf;
>
> out_release:
> diff -puN fs/open.c~242c7afafb1a81c0d9a41085536f2197d81146e7 fs/open.c
> --- linux-2.6.git/fs/open.c~242c7afafb1a81c0d9a41085536f2197d81146e7 2009-03-05 08:37:04.000000000 -0800
> +++ linux-2.6.git-dave/fs/open.c 2009-03-05 08:37:04.000000000 -0800
> @@ -29,6 +29,7 @@
> #include <linux/rcupdate.h>
> #include <linux/audit.h>
> #include <linux/falloc.h>
> +#include <linux/checkpoint.h>
>
> int vfs_statfs(struct dentry *dentry, struct kstatfs *buf)
> {
> @@ -1015,6 +1016,10 @@ void fd_install(unsigned int fd, struct
> {
> struct files_struct *files = current->files;
> struct fdtable *fdt;
> +
> + if (!cr_file_supported(file))
> + files_deny_checkpointing(files);
> +
> spin_lock(&files->file_lock);
> fdt = files_fdtable(files);
> BUG_ON(fdt->fd[fd] != NULL);
> diff -puN include/linux/checkpoint.h~242c7afafb1a81c0d9a41085536f2197d81146e7 include/linux/checkpoint.h
> --- linux-2.6.git/include/linux/checkpoint.h~242c7afafb1a81c0d9a41085536f2197d81146e7 2009-03-05 08:37:04.000000000 -0800
> +++ linux-2.6.git-dave/include/linux/checkpoint.h 2009-03-05 08:37:04.000000000 -0800
> @@ -12,6 +12,7 @@
>
> #include <linux/path.h>
> #include <linux/fs.h>
> +#include <linux/fdtable.h>
>
> #ifdef CONFIG_CHECKPOINT_RESTART
>
> @@ -101,11 +102,23 @@ extern int cr_read_files(struct cr_ctx *
>
> #define pr_fmt(fmt) "[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__
>
> +static inline void __files_deny_checkpointing(struct files_struct *files,
> + char *file, int line)
> +{
> + if (!test_and_clear_bit(0, &files->may_checkpoint))
> + return;
> + printk(KERN_INFO "process performed an action that can not be "
> + "checkpointed at: %s:%d\n", file, line);
> +}
> +#define files_deny_checkpointing(f) \
> + __files_deny_checkpointing(f, __FILE__, __LINE__)
> +

<snip>

2009-03-12 19:14:45

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 09/11] check files for checkpointability

On Mon, 2009-03-09 at 10:38 -0700, Matt Helsley wrote:
> On Thu, Mar 05, 2009 at 08:39:10AM -0800, Dave Hansen wrote:
> > +static void __scan_files_for_cr(struct files_struct *files)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < files->fdtab.max_fds; i++) {
> > + struct file *f = fcheck_files(files, i);
> > + if (!f)
> > + continue;
> > + if (cr_file_supported(f))
> > + continue;
> > + files_deny_checkpointing(files);
>
> At this point couldn't we skip the rest of the loop iterations?

As it stands, yeah. That makes sense.

> Might it also be useful to print a path to f here? So not only would
> the log show the location in the kernel source but we'd also get some
> idea of which file caused the problem? Of course "f" isn't always
> available everywhere we call files_deny_checkpointing()..

Also a good suggestion. That would help the readability of the warning
a bunch.

-- Dave

2009-03-13 03:07:05

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC][PATCH 00/11] track files for checkpointability



Serge E. Hallyn wrote:
> Quoting Dave Hansen ([email protected]):
>> On Fri, 2009-03-06 at 01:00 +0300, Alexey Dobriyan wrote:
>>> On Thu, Mar 05, 2009 at 01:27:07PM -0800, Dave Hansen wrote:
>>>>> Imagine, unsupported file is opened between userspace checks
>>>>> for /proc/*/checkpointable and /proc/*/fdinfo/*/checkpointable
>>>>> and whatever, you stil have to do all the checks inside checkpoint(2).
>>>> Alexey, we have two problems here. I completely agree that we have to
>>>> do complete and thorough checks of each file descriptor at
>>>> sys_checkpoint(). Any checks made at other times should not be trusted.
>>>>
>>>> The other side is what Ingo has been asking for. How do we *know* when
>>>> we are checkpointable *before* we call (and without calling)
>>> This "without calling checkpoint(2)" results in much complications
>>> as demonstrated.
>> I'll let you take that up with Ingo. :)
>>
>>> task_struct and file are not like other structures because they are exposed
>>> in /proc.
>> Very true. But, we can always use the task as a proxy to say whether
>> any of this tasks's *resources* are uncheckpointable. Is this task's
>> ipc_namespace checkpointable, etc...
>>
>>> For PROC_FS=n kernels, one can't even check.
>> Definitely. I'd be happy to make this check require PROC=y or even
>> DEBUGFS=y. I just want to make the mechanism usable for developers so
>> they're more motivated to find and fix checkpoint issues.
>>
>>> You can do checkpoint(2) without actual dump. You pass, you're most
>>> certainly checkpointable (with inevitable race condition in mind).
>> OK, so you envision this as maybe calling sys_checkpoint() with a -1 fd
>> or something? I'm generally OK with that. If the /proc stuff is really
>> the sticking point here, I'd be happy to stick it at the end of the
>> series so we can throw it away more easily.
>
> Yeah thing is I definately like what Alexey is suggesting.

I totally agree with Alexey. Use a CR_CHECKPOINT_PROBE to indicate that
you want a 'quick' test pass.

>
> The only reason for going the route of Dave's patches is to implement
> the pain Ingo wants to inflict to push us to faster support the
> resources which users actually want/need. As Alexey says that's
> a temporary gain and therefore not worth permanent code.

Not only the gain is temporary, it's also not that big to begin with.
We're talking about the file system. The basic code, e.g. without an
optimization for unlinked files, is file system agnostic. The exception
are pseudo file systems that must be handles specifically.

In other words, the "special" cases are: pseudo file systems, devices,
and aliens like epollfd. Pseudo file systems require special handling
in any implementation. Devices -- how many of these are there that in
practice we need to checkpoint and restart ? certainly not network
drivers, nor graphics cards etc. The list is short: pty, null, random,
rtc, tty, ... some of which will also require some sort of virtalization
(e.g. RTC should be per container, but that's another topic).

It isn't a "pain" to support more resources - it's the joy !

Oren

>
> Oh, right, there's the second reason:
>
>>> With time the amount of stuff C/R won't support will approach zero,
>>> but the infrastructure for "checkpointable" will stay constant.
>>> If it's too much right now, it will be way too much in future.
>> What have you seen in OpenVZ? Do new things that are not checkpointable
>> pop up very often?
>
> Realistically, do you think the uncheckpointable stuff would catch a
> brand-new unsupported feature? If it has a file interface then I
> suppose it would. Well, might. I wouldn't be surprised if the authors
> would cut and paste enough code to paste the .checkpoint =
> generic_file_checkpoint line :)
>
> -serge
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/containers
>

2009-03-13 03:26:21

by Oren Laadan

[permalink] [raw]
Subject: Re: [RFC][PATCH 05/11] add generic checkpoint f_op to ext fses


Note that as far as I understand, ext4 does _not_ support a "relink"
operation. By "re-link" I mean an operation that re-links an (orphan)
inode to a filename. (By "orphan" I mean a file that was opened and
unlinked, so it does not appear in any directory anymore).

This feature is very useful to efficiently checkpoint unlinked files.

Oren.


Dave Hansen wrote:
> This marks ext[234] as being checkpointable. There will be many
> more to do this to, but this is a start.
>
> Signed-off-by: Dave Hansen <[email protected]>
> ---
>
> linux-2.6.git-dave/fs/ext2/dir.c | 1 +
> linux-2.6.git-dave/fs/ext2/file.c | 2 ++
> linux-2.6.git-dave/fs/ext3/dir.c | 1 +
> linux-2.6.git-dave/fs/ext3/file.c | 1 +
> linux-2.6.git-dave/fs/ext4/dir.c | 1 +
> linux-2.6.git-dave/fs/ext4/file.c | 1 +
> 6 files changed, 7 insertions(+)
>
> diff -puN fs/ext2/dir.c~add-stupid-checkpoint-to-lots-of-fses-take0 fs/ext2/dir.c
> --- linux-2.6.git/fs/ext2/dir.c~add-stupid-checkpoint-to-lots-of-fses-take0 2009-03-05 08:37:01.000000000 -0800
> +++ linux-2.6.git-dave/fs/ext2/dir.c 2009-03-05 08:37:01.000000000 -0800
> @@ -721,4 +721,5 @@ const struct file_operations ext2_dir_op
> .compat_ioctl = ext2_compat_ioctl,
> #endif
> .fsync = ext2_sync_file,
> + .checkpoint = generic_file_checkpoint,
> };
> diff -puN fs/ext2/file.c~add-stupid-checkpoint-to-lots-of-fses-take0 fs/ext2/file.c
> --- linux-2.6.git/fs/ext2/file.c~add-stupid-checkpoint-to-lots-of-fses-take0 2009-03-05 08:37:01.000000000 -0800
> +++ linux-2.6.git-dave/fs/ext2/file.c 2009-03-05 08:37:01.000000000 -0800
> @@ -58,6 +58,7 @@ const struct file_operations ext2_file_o
> .fsync = ext2_sync_file,
> .splice_read = generic_file_splice_read,
> .splice_write = generic_file_splice_write,
> + .checkpoint = generic_file_checkpoint,
> };
>
> #ifdef CONFIG_EXT2_FS_XIP
> @@ -73,6 +74,7 @@ const struct file_operations ext2_xip_fi
> .open = generic_file_open,
> .release = ext2_release_file,
> .fsync = ext2_sync_file,
> + .checkpoint = generic_file_checkpoint,
> };
> #endif
>
> diff -puN fs/ext3/dir.c~add-stupid-checkpoint-to-lots-of-fses-take0 fs/ext3/dir.c
> --- linux-2.6.git/fs/ext3/dir.c~add-stupid-checkpoint-to-lots-of-fses-take0 2009-03-05 08:37:01.000000000 -0800
> +++ linux-2.6.git-dave/fs/ext3/dir.c 2009-03-05 08:37:01.000000000 -0800
> @@ -48,6 +48,7 @@ const struct file_operations ext3_dir_op
> #endif
> .fsync = ext3_sync_file, /* BKL held */
> .release = ext3_release_dir,
> + .checkpoint = generic_file_checkpoint,
> };
>
>
> diff -puN fs/ext3/file.c~add-stupid-checkpoint-to-lots-of-fses-take0 fs/ext3/file.c
> --- linux-2.6.git/fs/ext3/file.c~add-stupid-checkpoint-to-lots-of-fses-take0 2009-03-05 08:37:01.000000000 -0800
> +++ linux-2.6.git-dave/fs/ext3/file.c 2009-03-05 08:37:01.000000000 -0800
> @@ -122,6 +122,7 @@ const struct file_operations ext3_file_o
> .fsync = ext3_sync_file,
> .splice_read = generic_file_splice_read,
> .splice_write = generic_file_splice_write,
> + .checkpoint = generic_file_checkpoint,
> };
>
> const struct inode_operations ext3_file_inode_operations = {
> diff -puN fs/ext4/dir.c~add-stupid-checkpoint-to-lots-of-fses-take0 fs/ext4/dir.c
> --- linux-2.6.git/fs/ext4/dir.c~add-stupid-checkpoint-to-lots-of-fses-take0 2009-03-05 08:37:01.000000000 -0800
> +++ linux-2.6.git-dave/fs/ext4/dir.c 2009-03-05 08:37:01.000000000 -0800
> @@ -48,6 +48,7 @@ const struct file_operations ext4_dir_op
> #endif
> .fsync = ext4_sync_file,
> .release = ext4_release_dir,
> + .checkpoint = generic_file_checkpoint,
> };
>
>
> diff -puN fs/ext4/file.c~add-stupid-checkpoint-to-lots-of-fses-take0 fs/ext4/file.c
> --- linux-2.6.git/fs/ext4/file.c~add-stupid-checkpoint-to-lots-of-fses-take0 2009-03-05 08:37:01.000000000 -0800
> +++ linux-2.6.git-dave/fs/ext4/file.c 2009-03-05 08:37:01.000000000 -0800
> @@ -156,6 +156,7 @@ const struct file_operations ext4_file_o
> .fsync = ext4_sync_file,
> .splice_read = generic_file_splice_read,
> .splice_write = generic_file_splice_write,
> + .checkpoint = generic_file_checkpoint,
> };
>
> const struct inode_operations ext4_file_inode_operations = {
> _
>

2009-03-13 06:36:25

by Matt Helsley

[permalink] [raw]
Subject: Ensuring c/r maintainability (WAS Re: [RFC][PATCH 00/11] track files for checkpointability)

On Thu, Mar 12, 2009 at 10:30:48AM -0500, Serge E. Hallyn wrote:
> Quoting Cedric Le Goater ([email protected]):
> > >> And if Ingo's requirement is fulfilled, would any C/R patchset be acceptable ?
> > >
> > > Yup, no matter how hideous :) Ok not really.
> > >
> > > But the point was that it wasn't Dave not understanding Alexey's
> > > suggestion, but Greg not understanding Ingo's. If you think Ingo's
> > > goal isn't worthwhile or achievable, then argue that (as I am), don't
> > > keep elaborating on something we all agree will be needed (Alexey's
> > > suggestion or some other way of doing a true may-be-checkpointed test).
> >
> > I rather spend my time on enabling things rather than forbid them.
>
> That sure sounds productive. How could I argue with that.
>
> But wait, haven't several teams been doing that for years? So why is
> c/r not in the upstream kernel? Could it be that ignoring the
> upstream maintainers' concerns about (a) treating the feature as a
> toy, (b) long-term maintainability, and (c) c/r becoming an impediment
> to future features, and instead hacking away at our toy feature, is
> *not* always the best course?

I've been thinking about how we could make checkpoint/restart (c/r) more
maintainable in the long-term. I've only come up with two ideas:

I. Implement sparse-like __cr struct annotations for some compile-time checking.

First we annotate structures which c/r needs to save. For example we might have:

struct mm_struct {
__cr struct vm_area_struct * mmap;
struct rb_root mm_rb;
struct vm_area_struct *mmap_cache;
...
__cr unsigned long mmap_base;
__cr unsigned long task_size;
..
};

The __cr annotations indicate fields of the mm_struct which must be
saved during checkpoint restart. In fact, for non-pointer fields these
annotations would be sufficient to generate c/r code.

Next we would need a __cr_root annotation. These mark structures which
the c/r code visits that determine the scope of c/r. If there is no path from a
__cr annotation to a __cr_root annotation then we would conclude that c/r of
this struct is broken. These path constraint checks could be done at compile
time.

Since the example so far lacks a __cr_root we would know that there's a bug
since no __cr_root struct is reachable from an mm_struct. We'd fix that with:

struct task_struct __cr_root {
__cr volatile long state;
..
__cr struct mm_struct *mm;
struct *active_mm;
...
};

Of course there are problems with this specific annotation scheme. It doesn't
follow casts -- e.g. list heads, rb_nodes, and anything that uses the important
container_of() idiom would be problematic if the containers themselves are not
uniform. What I've proposed so far doesn't check the functions that walk these
data structures during c/r to make sure each saved instance has a matching
restore (seems like this could be addressed though). I'm no sparse expert so I
don't know that sparse can check these kinds of struct constraints,
however I'm pretty sure that if sparse can't do it then we can do it
with the dwarf-2 debugging information available.

We could also save __cr-annotated struct definitions across one or more commits
and compare them to determine how structures with __cr annotations change. We
could use dwarf-2 information to detect certain types of changes which can then
be flagged for further review, emit warnings or even errors. I think once c/r
was in mainline this would ideally be run against automated compile tests of
linux-next and the output sent to c/r maintainers/lists. This output would also
highlight changes relevant to userspace checkpoint-image converters that enable
things like kernel upgrades by doing checkpoint, kexec, and restart.

The idea of generating straight-line struct assignment code from these
annotations crossed my mind but I'm pretty sure that would be less
maintainable.

II. JIT-instrumentation

WARNING: This idea is much more vaporous than the previous idea.

Confirm that, from the perspective of executing userspace code, a restarted
container does exactly the same thing as the original, checkpointed container.
Use valgrind's JIT instrumentation framework to do instruction by instruction
step-and-compare cycles between corresponding tasks in each container:

1. Compare instruction pointer (IP)
2. Compare the instruction (redundant if we check text mappings)
3. Compare all register and memory operand contents

Each comparison must match exactly otherwise we "abort" the two containers.

Normally this would break even if c/r is correctly implemented. The idea is we
insert "mirroring" code into parts of the kernel to ensure that the timing and
contents of inputs external to the userspace portion of these containers match:
1. Network
i. Need to "mirror" packets to both containers
ii. Need to merge packets from both containers
2. Time
Each corresponding call to functions like gettimeofday()
would need to return the same struct timeval.

Each timeout result from calls like ppoll() would need
to be the same.
3. Locking "order" would need to be "mirrored". This is probably
the hardest to implement. Task A and A' (original and
duplicate respectively) could try to acquire the same
lock or semaphore in the kernel. The results of a system call
could vary depending on whether trying to acquire this
lock succeeds. Suddenly the task behavior would diverge
even though c/r is correct. I haven't figured out how to
address this problem but perhaps others can make
suggestions.
4. Scheduling
Need to schedule tasks within a container in the same order.
This is probably strongly tied to both timing and locking --
perhaps so much so that taking care of those will take care
of scheduling??

This should catch any points where the restarted application differs
from the original. I'm hoping it would catch most of the problems that
struct annotation would miss and make maintenance of c/r much less
problematic. One weakness of this method as I've described it so far is there's
no mechanism for relating the divergence of the two tasks to a spot in the c/r
code. Again, I haven't thought of a way to do that and perhaps others
can help take the idea farther.

A weakened version of this check might use the same kernel changes but only
compare system calls.

Cheers,
-Matt Helsley

2009-03-13 17:53:19

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: Ensuring c/r maintainability (WAS Re: [RFC][PATCH 00/11] track files for checkpointability)

Quoting Matt Helsley ([email protected]):
> On Thu, Mar 12, 2009 at 10:30:48AM -0500, Serge E. Hallyn wrote:
> > Quoting Cedric Le Goater ([email protected]):
> > > >> And if Ingo's requirement is fulfilled, would any C/R patchset be acceptable ?
> > > >
> > > > Yup, no matter how hideous :) Ok not really.
> > > >
> > > > But the point was that it wasn't Dave not understanding Alexey's
> > > > suggestion, but Greg not understanding Ingo's. If you think Ingo's
> > > > goal isn't worthwhile or achievable, then argue that (as I am), don't
> > > > keep elaborating on something we all agree will be needed (Alexey's
> > > > suggestion or some other way of doing a true may-be-checkpointed test).
> > >
> > > I rather spend my time on enabling things rather than forbid them.
> >
> > That sure sounds productive. How could I argue with that.
> >
> > But wait, haven't several teams been doing that for years? So why is
> > c/r not in the upstream kernel? Could it be that ignoring the
> > upstream maintainers' concerns about (a) treating the feature as a
> > toy, (b) long-term maintainability, and (c) c/r becoming an impediment
> > to future features, and instead hacking away at our toy feature, is
> > *not* always the best course?
>
> I've been thinking about how we could make checkpoint/restart (c/r) more
> maintainable in the long-term. I've only come up with two ideas:
>
> I. Implement sparse-like __cr struct annotations for some compile-time checking.
>
> First we annotate structures which c/r needs to save. For example we might have:
>
> struct mm_struct {
> __cr struct vm_area_struct * mmap;
> struct rb_root mm_rb;
> struct vm_area_struct *mmap_cache;
> ...
> __cr unsigned long mmap_base;
> __cr unsigned long task_size;
> ..
> };
>
> The __cr annotations indicate fields of the mm_struct which must be
> saved during checkpoint restart. In fact, for non-pointer fields these
> annotations would be sufficient to generate c/r code.
>
> Next we would need a __cr_root annotation. These mark structures which
> the c/r code visits that determine the scope of c/r. If there is no path from a
> __cr annotation to a __cr_root annotation then we would conclude that c/r of
> this struct is broken. These path constraint checks could be done at compile
> time.

Hi Matt,

is what you're detecting here really something we're worried about?

Maybe that's something we should be doing - coming up with a list of
the things we are trying to detect or prevent. I can only think of
a few offhand:

1. checkpoint (and restart) a task which has used a resource which we
do cannot (yet, or ever) safely checkpoint/restart.

2. kernel has a new feature for which we have not considered
checkpoint/restart. Not only is it not safe to c/r a task using it,
but we haven't even implemented a check for tasks using it.

3. Some new kernel feature has an attribute which simply must be
stored away. An example would be the vdso_base in s390 as of
recent 2.6.29 rc's, which was not present in 2.6.28. So there are
two things to worry about in this one:

a. detect that this happened and handle it, so c/r continues
to work.
b. figure out a way to restart an older c/r image on a newer
kernel - or simply detect older images and call them
incompatible.

-serge