2013-05-27 15:33:24

by Miklos Szeredi

[permalink] [raw]
Subject: [RFC PATCH] vfs: add permute operation

Permute the location of files. E.g. 'permute(A, B, C)' is equivalent to A->B,
B->C and C->A. This is essentially a series of renames done as a single atomic
operation.

The current implementation allows two or three objects to be permuted. More
than three would complicate it too much without a clear gain. All objects have
to be positive and each can be of any type.

This is to serve as a helper to implement whiteouts for overlay/union, but is
expected to be useful for other things as well.

Previous approaches tended to focus on implementing whiteouts as separate
filesystem objects and specialized operations dealing with them. I think that's
the wrong approach because these objects and code are not generally usable
outside overlay/union implementations.

Whiteouts are needed to "cover" deleted entries in lower filesystems. They
maybe created in case of an unlink, rmdir or rename. Whiteouts need to be
replaced when a real object is created in place. This happens on create, mkdir,
link, rename, etc...

As usual the most complicated operation is rename: a whiteout may need to be
created and may need to be removed as well. The "empty" target directory may
not actually be empty, as it may contain whiteout files covering all the deleted
entries in that directory, which all need to be removed. And all of this needs
to be done as an atomic operation. We could add whiteout support to filesystem
ops to perform the creation or removal of whiteouts atomically, but it would
complicate many filesystem ops needlessly.

Alternatively we can add a generic permute operation and add whiteout support to
the VFS which utilizes this to perform the operations atomically. To do this
there has to be a temporary "orphan" directory where objects are prepared and
cleaned up. Hiding this orphan directory from userspace could also be done by
filesystems if necessary.

Here are some examples of doing various operations involving whiteouts.

rmdir(P) where P needs to be whiteout:
- create-whiteout(orphan/X)
- permute(orphan/X, P)
- rmdir(orphan/X)

unlink(P) where P needs to be whiteout (doesn't actually have to use permute):
- create-whiteout(orphan/X)
- rename(orphan/X, P)

mkdir(P) where P is currently a whiteout:
- mkdir(orphan/X)
- permute(orphan/X, P)
- unlink(orphan/X)

rename(P, Q) where P needs to be whiteout and Q is negative:
- create-whiteout(Q)
- permute(P, Q)

rename(P, Q) where P needs to be whiteout and Q is nondir:
- create-whiteout(orphan/X)
- permute(orphan/X, P, Q)
- unlink(orphan/X)

rename(P, Q) where Q is a directory containing only whiteouts:
- create-whiteout(orphan/X)
- permute(orphan/X, P, Q)
- recursive-remove(orphan/X)

Signed-off-by: Miklos Szeredi <[email protected]>
---
arch/x86/syscalls/syscall_64.tbl | 1
fs/dcache.c | 149 +++++++++++++++++++++++
fs/ext4/namei.c | 175 +++++++++++++++++++++++++++
fs/internal.h | 1
fs/namei.c | 249 +++++++++++++++++++++++++++++++++++++++
include/linux/dcache.h | 1
include/linux/fs.h | 6
include/linux/syscalls.h | 3
8 files changed, 585 insertions(+)

--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -320,6 +320,7 @@
311 64 process_vm_writev sys_process_vm_writev
312 common kcmp sys_kcmp
313 common finit_module sys_finit_module
+314 64 permute sys_permute

#
# x32-specific system call numbers start at 512 to avoid cache impact
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2316,6 +2316,155 @@ void d_move(struct dentry *dentry, struc
}
EXPORT_SYMBOL(d_move);

+static void sort_parents3(struct dentry **p)
+{
+ unsigned i;
+
+ /*
+ * Iterate over all permutations.
+ */
+ for (i = 0; i < 3; i++) {
+ if (d_ancestor(p[1], p[2]))
+ goto found;
+ swap(p[1], p[2]);
+ if (d_ancestor(p[1], p[2]))
+ goto found;
+ swap(p[0], p[1]);
+ }
+ return;
+
+found:
+ /*
+ * Now p[1] is ancestor of p[2]. If p[0] is ancestor of p[1] then p[0]
+ * is the top ancestor. Otherwise p[1] is the top and the other two
+ * need sorting.
+ */
+ if (!d_ancestor(p[0], p[1])) {
+ swap(p[0], p[1]);
+ if (!d_ancestor(p[1], p[2]))
+ swap(p[1], p[2]);
+ }
+}
+
+void sort_parents(struct dentry **p, unsigned *nump)
+{
+ switch (*nump) {
+ case 2:
+ break;
+ case 3:
+ if (p[0] == p[1] || p[0] == p[2] || p[1] == p[2]) {
+ (*nump)--;
+ if (p[0] == p[1])
+ swap(p[1], p[2]);
+ break;
+ }
+ return sort_parents3(p);
+ default:
+ BUG();
+ }
+ if (p[0] == p[1]) {
+ (*nump)--;
+ } else {
+ if (!d_ancestor(p[0], p[1]))
+ swap(p[0], p[1]);
+ }
+}
+
+/*
+ * d_permute - permute dentries
+ * @dentry: array of dentries
+ * @num: size of array
+ *
+ * Do a cyclic permutation of the dentries in the forward direction. This
+ * is essentially a more general version of d_move().
+ */
+void d_permute(struct dentry **dentry, unsigned num)
+{
+ unsigned i;
+ unsigned next;
+ struct dentry *parent[PERMUTE_MAX];
+ unsigned numparent = num;
+ struct qstr tmp_name;
+ unsigned char tmp_iname[DNAME_INLINE_LEN];
+ struct dentry *tmp_parent;
+ bool tmp_internal = false;
+
+ BUG_ON(num > PERMUTE_MAX);
+
+ write_seqlock(&rename_lock);
+
+ for (i = 0; i < num; i++) {
+ next = (i + 1) % num;
+ BUG_ON(!dentry[i]->d_inode);
+ /* No disconnected dentries */
+ BUG_ON(IS_ROOT(dentry[i]));
+ parent[i] = dentry[i]->d_parent;
+ }
+
+ sort_parents(parent, &numparent);
+ for (i = 0; i < numparent; i++)
+ spin_lock_nested(&parent[i]->d_lock, i);
+
+ for (i = 0; i < num; i++) {
+ spin_lock_nested(&dentry[i]->d_lock, PERMUTE_MAX + i);
+ write_seqcount_begin(&dentry[i]->d_seq);
+ }
+
+ /*
+ * Move the dentry to the target hash queue. Don't bother checking
+ * for the same hash queue because of how unlikely it is.
+ */
+ for (i = 0; i < num; i++) {
+ next = (i + 1) % num;
+ __d_drop(dentry[i]);
+ __d_rehash(dentry[i], d_hash(dentry[next]->d_parent,
+ dentry[next]->d_name.hash));
+ list_del(&dentry[i]->d_u.d_child);
+ }
+
+ /* Switch the names and the parents */
+ tmp_name = dentry[0]->d_name;
+ tmp_parent = dentry[0]->d_parent;
+ if (!dname_external(dentry[0])) {
+ tmp_internal = true;
+ memcpy(tmp_iname, dentry[0]->d_name.name,
+ dentry[0]->d_name.len + 1);
+ }
+ for (i = 1; i < num; i++) {
+ dentry[i-1]->d_name = dentry[i]->d_name;
+ if (!dname_external(dentry[i])) {
+ memcpy(dentry[i-1]->d_iname, dentry[i]->d_name.name,
+ dentry[i]->d_name.len + 1);
+ dentry[i-1]->d_name.name = dentry[i-1]->d_iname;
+ }
+ dentry[i-1]->d_parent = dentry[i]->d_parent;
+ }
+ dentry[num-1]->d_name = tmp_name;
+ if (tmp_internal) {
+ memcpy(dentry[num-1]->d_iname, tmp_iname, tmp_name.len + 1);
+ dentry[num-1]->d_name.name = dentry[num-1]->d_iname;
+ }
+ dentry[num-1]->d_parent = tmp_parent;
+
+ /* And add them back to the (new) parent lists */
+ for (i = 0; i < num; i++) {
+ list_add(&dentry[i]->d_u.d_child,
+ &dentry[i]->d_parent->d_subdirs);
+ write_seqcount_end(&dentry[i]->d_seq);
+ }
+
+ for (i = 0; i < numparent; i++)
+ spin_unlock(&parent[i]->d_lock);
+
+ for (i = 0; i < num; i++) {
+ fsnotify_d_move(dentry[i]);
+ spin_unlock(&dentry[i]->d_lock);
+ }
+
+ write_sequnlock(&rename_lock);
+}
+
+
/**
* d_ancestor - search for an ancestor
* @p1: ancestor dentry
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -3160,6 +3160,180 @@ static int ext4_rename(struct inode *old
return retval;
}

+static int ext4_permute(struct dentry **dentry, unsigned int num)
+{
+ struct inode *dir[PERMUTE_MAX];
+ struct inode *inode[PERMUTE_MAX];
+ struct buffer_head *bh[PERMUTE_MAX];
+ struct buffer_head *dir_bh[PERMUTE_MAX];
+ struct ext4_dir_entry_2 *de[PERMUTE_MAX];
+ int parent_inlined[PERMUTE_MAX];
+ int inlined[PERMUTE_MAX];
+ struct ext4_dir_entry_2 *parent_de[PERMUTE_MAX];
+ __le32 tmp_inode;
+ __u8 tmp_file_type;
+ bool sync = 0;
+ bool has_filetype;
+ handle_t *handle;
+ int retval;
+ unsigned i;
+ unsigned next;
+ struct super_block *sb = dentry[0]->d_sb;
+
+ if (WARN_ON(num > PERMUTE_MAX) ||
+ WARN_ON(num < 2))
+ return -EINVAL;
+
+ for (i = 0; i < num; i++) {
+ parent_inlined[i] = 0;
+ inlined[i] = 0;
+ bh[i] = dir_bh[i] = NULL;
+ inode[i] = dentry[i]->d_inode;
+ dir[i] = dentry[i]->d_parent->d_inode;
+ dquot_initialize(dir[i]);
+ if (IS_DIRSYNC(dir[i]))
+ sync = 1;
+ }
+
+ handle = ext4_journal_start(dir[0], EXT4_HT_DIR,
+ num * EXT4_DATA_TRANS_BLOCKS(sb) +
+ num * EXT4_INDEX_EXTRA_TRANS_BLOCKS + num);
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);
+
+ if (sync)
+ ext4_handle_sync(handle);
+
+ for (i = 0; i < num; i++) {
+ bh[i] = ext4_find_entry(dir[i], &dentry[i]->d_name, &de[i],
+ &inlined[i]);
+
+ retval = -ENOENT;
+ if (!bh[i] || le32_to_cpu(de[i]->inode) != inode[i]->i_ino)
+ goto end_permute;
+ }
+
+ for (i = 0; i < num; i++) {
+ next = (i + 1) % num;
+ if (S_ISDIR(inode[i]->i_mode) && dir[i] != dir[next]) {
+ retval = -EIO;
+ dir_bh[i] = ext4_get_first_dir_block(handle, inode[i],
+ &retval, &parent_de[i],
+ &parent_inlined[i]);
+ if (!dir_bh[i])
+ goto end_permute;
+ if (le32_to_cpu(parent_de[i]->inode) != dir[i]->i_ino)
+ goto end_permute;
+ /*
+ * If moving a dir in place of a non-dir then check for
+ * the link count limit.
+ */
+ retval = -EMLINK;
+ if (!S_ISDIR(inode[next]->i_mode) &&
+ EXT4_DIR_LINK_MAX(dir[next]))
+ goto end_permute;
+
+ BUFFER_TRACE(olddir_bh, "get_write_access");
+ retval = ext4_journal_get_write_access(handle,
+ dir_bh[i]);
+ if (retval)
+ goto end_permute;
+ }
+ BUFFER_TRACE(new_bh, "get write access");
+ retval = ext4_journal_get_write_access(handle, bh[i]);
+ if (retval)
+ goto end_permute;
+ }
+
+ has_filetype =
+ EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_FILETYPE);
+
+ i = num - 1;
+ tmp_inode = de[i]->inode;
+ tmp_file_type = de[i]->file_type;
+ for (; i > 0; i--) {
+ de[i]->inode = de[i - 1]->inode;
+ if (has_filetype)
+ de[i]->file_type = de[i - 1]->file_type;
+ }
+ de[0]->inode = tmp_inode;
+ if (has_filetype)
+ de[0]->file_type = tmp_file_type;
+
+
+ for (i = 0; i < num; i++) {
+ dir[i]->i_version++;
+ dir[i]->i_ctime = dir[i]->i_mtime = ext4_current_time(dir[i]);
+ ext4_mark_inode_dirty(handle, dir[i]);
+ BUFFER_TRACE(old_bh, "call ext4_handle_dirty_metadata");
+ if (!inlined[i]) {
+ retval = ext4_handle_dirty_dirent_node(handle, dir[i],
+ bh[i]);
+ if (unlikely(retval)) {
+ ext4_std_error(sb, retval);
+ goto end_permute;
+ }
+ }
+ brelse(bh[i]);
+ bh[i] = NULL;
+
+ /*
+ * Set the ctime for inodes on permute.
+ */
+ inode[i]->i_ctime = ext4_current_time(inode[i]);
+ ext4_mark_inode_dirty(handle, inode[i]);
+
+ }
+
+ for (i = 0; i < num; i++) {
+ next = (i + 1) % num;
+
+ if (!dir_bh[i])
+ continue;
+
+ parent_de[i]->inode = cpu_to_le32(dir[next]->i_ino);
+ BUFFER_TRACE(dir_bh[i], "call ext4_handle_dirty_metadata");
+ if (!parent_inlined[i]) {
+ if (is_dx(inode[i])) {
+ retval = ext4_handle_dirty_dx_node(handle,
+ inode[i], dir_bh[i]);
+ } else {
+ retval = ext4_handle_dirty_dirent_node(handle,
+ inode[i], dir_bh[i]);
+ }
+ } else {
+ retval = ext4_mark_inode_dirty(handle, inode[i]);
+ }
+ if (retval) {
+ ext4_std_error(sb, retval);
+ goto end_permute;
+ }
+ }
+ for (i = 0; i < num; i++) {
+ if (S_ISDIR(inode[i]->i_mode)) {
+ next = (i + 1) % num;
+ ext4_dec_count(handle, dir[i]);
+ ext4_inc_count(handle, dir[next]);
+ ext4_update_dx_flag(dir[i]);
+ ext4_update_dx_flag(dir[next]);
+ ext4_mark_inode_dirty(handle, dir[i]);
+ ext4_mark_inode_dirty(handle, dir[next]);
+ }
+ }
+
+
+ retval = 0;
+
+end_permute:
+ for (i = 0; i < num; i++) {
+ brelse(dir_bh[i]);
+ brelse(bh[i]);
+ }
+ ext4_journal_stop(handle);
+
+ return retval;
+}
+
/*
* directories can handle most operations...
*/
@@ -3173,6 +3347,7 @@ const struct inode_operations ext4_dir_i
.rmdir = ext4_rmdir,
.mknod = ext4_mknod,
.rename = ext4_rename,
+ .permute = ext4_permute,
.setattr = ext4_setattr,
.setxattr = generic_setxattr,
.getxattr = generic_getxattr,
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -125,6 +125,7 @@ extern int invalidate_inodes(struct supe
* dcache.c
*/
extern struct dentry *__d_alloc(struct super_block *, const struct qstr *);
+extern void sort_parents(struct dentry **p, unsigned *nump);

/*
* read_write.c
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3932,6 +3932,255 @@ SYSCALL_DEFINE2(rename, const char __use
return sys_renameat(AT_FDCWD, oldname, AT_FDCWD, newname);
}

+static int vfs_permute(struct dentry **dentry, unsigned num)
+{
+ int error;
+ struct super_block *sb = dentry[0]->d_sb;
+ unsigned max_links = sb->s_max_links;
+ struct inode *dir;
+ struct inode *next_dir;
+ struct inode *inode;
+ struct inode *next_inode;
+ unsigned i;
+ unsigned j;
+ unsigned next;
+
+ for (i = 0; i < num; i++) {
+ for (j = i + 1; j < num; j++) {
+ if (dentry[i] == dentry[j])
+ return -EINVAL;
+ }
+ }
+
+ for (i = 0; i < num; i++) {
+ dir = dentry[i]->d_parent->d_inode;
+ inode = dentry[i]->d_inode;
+ error = may_delete(dir, dentry[i], S_ISDIR(inode->i_mode));
+ if (error)
+ return error;
+ }
+
+ dir = dentry[0]->d_parent->d_inode;
+ if (!dir->i_op->permute)
+ return -EPERM;
+
+ /*
+ * If we are going to change the parent - check write permissions,
+ * we'll need to flip '..'.
+ */
+
+ for (i = 0; i < num; i++) {
+ next = (i + 1) % num;
+ dir = dentry[i]->d_parent->d_inode;
+ next_dir = dentry[next]->d_parent->d_inode;
+ inode = dentry[i]->d_inode;
+ next_inode = dentry[next]->d_inode;
+
+ if (S_ISDIR(inode->i_mode) && dir != next_dir) {
+ error = inode_permission(inode, MAY_WRITE);
+ if (error)
+ return error;
+
+ error = -EMLINK;
+ if (max_links && !S_ISDIR(next_inode->i_mode) &&
+ next_dir->i_nlink >= max_links)
+ return error;
+ }
+ }
+
+ for (i = 0; i < num; i++) {
+ next = (i + 1) % num;
+ dir = dentry[i]->d_parent->d_inode;
+ next_dir = dentry[next]->d_parent->d_inode;
+ error = security_inode_rename(dir, dentry[i],
+ next_dir, dentry[next]);
+ if (error)
+ return error;
+ }
+
+ for (i = 0; i < num; i++) {
+ error = -EBUSY;
+ if (d_mountpoint(dentry[i]))
+ return error;
+ }
+
+ dir = dentry[0]->d_parent->d_inode;
+ error = dir->i_op->permute(dentry, num);
+ if (error)
+ return error;
+
+ d_permute(dentry, num);
+
+ if (!error) {
+ for (i = 0; i < num; i++) {
+ unsigned prev = (i + num - 1) % num;
+ const unsigned char *old_name;
+ int is_dir = S_ISDIR(dentry[i]->d_inode->i_mode);
+
+ old_name = dentry[prev]->d_name.name;
+ next = (i + 1) % num;
+ dir = dentry[i]->d_parent->d_inode;
+ next_dir = dentry[next]->d_parent->d_inode;
+ fsnotify_move(dir, next_dir, old_name, is_dir, NULL,
+ dentry[i]);
+ }
+ }
+
+ return error;
+}
+
+static inline unsigned int i_mutex_depth_class(unsigned depth)
+{
+ switch (depth) {
+ case 0:
+ return I_MUTEX_PARENT;
+ case 1:
+ return I_MUTEX_CHILD;
+ case 2:
+ return I_MUTEX_NORMAL;
+ default:
+ BUG();
+ }
+}
+
+SYSCALL_DEFINE6(permute,
+ int, dfd0, const char __user *, name0,
+ int, dfd1, const char __user *, name1,
+ int, dfd2, const char __user *, name2)
+{
+ int dfd[PERMUTE_MAX];
+ const char __user *name[PERMUTE_MAX];
+ struct dentry *dentry[PERMUTE_MAX];
+ struct dentry *dir[PERMUTE_MAX];
+ struct nameidata nd[PERMUTE_MAX];
+ struct filename *fname[PERMUTE_MAX];
+ struct super_block *sb;
+ unsigned num;
+ unsigned numparent;
+ unsigned i;
+ unsigned next;
+ int error;
+
+ dfd[0] = dfd0;
+ dfd[1] = dfd1;
+ dfd[2] = dfd2;
+ name[0] = name0;
+ name[1] = name1;
+ name[2] = name2;
+
+ num = 3;
+ if (name[2] == NULL)
+ num = 2;
+
+ for (i = 0; i < num; i++) {
+ dentry[i] = NULL;
+ fname[i] = NULL;
+ }
+
+ for (i = 0; i < num; i++) {
+ struct filename *fname_res;
+ fname_res = user_path_parent(dfd[i], name[i], &nd[i], 0);
+ if (IS_ERR(fname_res)) {
+ error = PTR_ERR(fname_res);
+ goto exit;
+ }
+ fname[i] = fname_res;
+ }
+
+ for (i = 0; i < num; i++) {
+ next = (i + 1) % num;
+ error = -EXDEV;
+ if (nd[i].path.mnt != nd[next].path.mnt)
+ goto exit;
+ }
+
+ for (i = 0; i < num; i++) {
+ dir[i] = nd[i].path.dentry;
+ error = -EBUSY;
+ if (nd[i].last_type != LAST_NORM)
+ goto exit;
+ }
+
+ error = mnt_want_write(nd[0].path.mnt);
+ if (error)
+ goto exit;
+
+ for (i = 0; i < num; i++)
+ nd[i].flags &= ~LOOKUP_PARENT;
+
+ numparent = num;
+ sort_parents(dir, &numparent);
+ sb = dir[0]->d_inode->i_sb;
+ if (numparent > 1)
+ mutex_lock(&sb->s_vfs_rename_mutex);
+
+ for (i = 0; i < numparent; i++) {
+ mutex_lock_nested(&dir[i]->d_inode->i_mutex,
+ i_mutex_depth_class(i));
+ }
+
+ for (i = 0; i < num; i++) {
+ struct dentry *dentry_res;
+ dentry_res = lookup_hash(&nd[i]);
+ error = PTR_ERR(dentry_res);
+ if (IS_ERR(dentry_res))
+ goto exit3;
+
+ dentry[i] = dentry_res;
+
+ /* source must exist */
+ error = -ENOENT;
+ if (!dentry[i]->d_inode)
+ goto exit3;
+
+ /* unless it is a directory trailing slashes give -ENOTDIR */
+ if (!S_ISDIR(dentry[i]->d_inode->i_mode)) {
+ error = -ENOTDIR;
+ if (nd[i].last.name[nd[i].last.len])
+ goto exit3;
+ }
+ }
+ for (i = 0; i < num; i++) {
+ next = (i + 1) % num;
+ error = -EINVAL;
+ /* source should not be ancestor or descendant of target */
+ if (d_ancestor(dentry[i], dentry[next]))
+ goto exit3;
+ if (d_ancestor(dentry[next], dentry[i]))
+ goto exit3;
+ }
+
+
+ for (i = 0; i < num; i++) {
+ next = (i + 1) % num;
+ error = security_path_rename(&nd[i].path, dentry[i],
+ &nd[next].path, dentry[next]);
+ if (error)
+ goto exit3;
+ }
+ error = vfs_permute(dentry, num);
+
+exit3:
+ for (i = 0; i < num; i++)
+ dput(dentry[i]);
+
+ for (i = 0; i < numparent; i++)
+ mutex_unlock(&dir[i]->d_inode->i_mutex);
+
+ if (numparent > 1)
+ mutex_unlock(&sb->s_vfs_rename_mutex);
+
+ mnt_drop_write(nd[0].path.mnt);
+exit:
+ for (i = 0; i < num; i++) {
+ if (fname[i]) {
+ path_put(&nd[i].path);
+ putname(fname[i]);
+ }
+ }
+ return error;
+}
+
int vfs_readlink(struct dentry *dentry, char __user *buffer, int buflen, const char *link)
{
int len;
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -293,6 +293,7 @@ extern void dentry_update_name_case(stru

/* used for rename() and baskets */
extern void d_move(struct dentry *, struct dentry *);
+extern void d_permute(struct dentry **dentry, unsigned num);
extern struct dentry *d_ancestor(struct dentry *, struct dentry *);

/* appendix may either be NULL or be used for transname suffixes */
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1545,6 +1545,11 @@ struct file_operations {
int (*show_fdinfo)(struct seq_file *m, struct file *f);
};

+/*
+ * Things get ugly with permutations involving > 3 dentries, so don't do that
+ */
+#define PERMUTE_MAX 3
+
struct inode_operations {
struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
void * (*follow_link) (struct dentry *, struct nameidata *);
@@ -1563,6 +1568,7 @@ struct inode_operations {
int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t);
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *);
+ int (*permute) (struct dentry **, unsigned int);
int (*setattr) (struct dentry *, struct iattr *);
int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -841,4 +841,7 @@ asmlinkage long sys_process_vm_writev(pi
asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
unsigned long idx1, unsigned long idx2);
asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
+asmlinkage long sys_permute(int dfd0, const char __user *name0,
+ int dfd1, const char __user *name1,
+ int dfd2, const char __user *name2);
#endif


2013-05-28 17:37:01

by Zach Brown

[permalink] [raw]
Subject: Re: [RFC PATCH] vfs: add permute operation

Some quick thoughts:

> Permute the location of files. E.g. 'permute(A, B, C)' is equivalent to A->B,
> B->C and C->A. This is essentially a series of renames done as a single atomic
> operation.

Hmm. Can we choose a more specific name than 'permute'? To me,
->permute() tells me just as much about the operation as
->do_something(). {multi,bulk,mass}_rename()? renamev()?

Maybe it's just me.

> to be done as an atomic operation. We could add whiteout support to filesystem
> ops to perform the creation or removal of whiteouts atomically, but it would
> complicate many filesystem ops needlessly.
>
> Alternatively we can add a generic permute operation and add whiteout support to
> the VFS which utilizes this to perform the operations atomically.

I certainly like the sound of this.

> +static void sort_parents3(struct dentry **p)
> +void sort_parents(struct dentry **p, unsigned *nump)

Yikes, that's a bunch of fiddly code. Is it *really* worth all that to
avoid calling the generic sort helpers?

> + if (WARN_ON(num > PERMUTE_MAX) ||
> + WARN_ON(num < 2))
> + return -EINVAL;

And in other places this is a BUG? Why not, like the syscall, limit the
arguments to three if we're serious about that limitation?

- z

2013-05-29 08:12:19

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC PATCH] vfs: add permute operation

On Tue, May 28, 2013 at 7:36 PM, Zach Brown <[email protected]> wrote:
> Some quick thoughts:
>
>> Permute the location of files. E.g. 'permute(A, B, C)' is equivalent to A->B,
>> B->C and C->A. This is essentially a series of renames done as a single atomic
>> operation.
>
> Hmm. Can we choose a more specific name than 'permute'? To me,
> ->permute() tells me just as much about the operation as
> ->do_something(). {multi,bulk,mass}_rename()? renamev()?

It's not just plain muti-rename, but a cyclic one. Maybe cyclic_rename()?

>> +static void sort_parents3(struct dentry **p)
>> +void sort_parents(struct dentry **p, unsigned *nump)
>
> Yikes, that's a bunch of fiddly code. Is it *really* worth all that to
> avoid calling the generic sort helpers?

AFAICS, I cannot make the compare function transitive, e.g.: A is
descendant of C but B is unrelated. Then what should cmp(A, B) and
cmp(B, C) return?

>
>> + if (WARN_ON(num > PERMUTE_MAX) ||
>> + WARN_ON(num < 2))
>> + return -EINVAL;
>
> And in other places this is a BUG? Why not, like the syscall, limit the
> arguments to three if we're serious about that limitation?

I could be more consistent with the BUGs. Doing it with 3 args is not
necessarily good, since then we can't do loops and the chance of a
copy-paste error is increased.

Thanks,
Miklos

2013-05-29 23:02:28

by Zach Brown

[permalink] [raw]
Subject: Re: [RFC PATCH] vfs: add permute operation

> >> +static void sort_parents3(struct dentry **p)
> >> +void sort_parents(struct dentry **p, unsigned *nump)
> >
> > Yikes, that's a bunch of fiddly code. Is it *really* worth all that to
> > avoid calling the generic sort helpers?
>
> AFAICS, I cannot make the compare function transitive, e.g.: A is
> descendant of C but B is unrelated. Then what should cmp(A, B) and
> cmp(B, C) return?

Ah, of course. I wasn't reading closely enough.

- z

2013-05-30 09:11:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC PATCH] vfs: add permute operation

On Thu, May 30, 2013 at 5:45 PM, Miklos Szeredi <[email protected]> wrote:
>
> The third name is because of the replace-empty-directory wart in the
> rename(2) definition. With overlay/union that can become
>
> 1) check if destination directory is empty: upper directory contains a
> whiteout for each lower directory entry and nothing else
> 2) if empty then remove whiteouts in destination directory
> 3) and then go on with the normal rename procedure, replacing the empty
> destination directory with the source directory ,
>
> This is done with directory locking, so atomicity is not usually a problem.
> But in case of a crash between 2) and 3) we just seriously corrupted the
> overlay.
>
> Suggestions for fixing that?

Why not just do the NFS thing. That has worked forever - using a
sillyrename as a "pending deletion" instead of actually deleting
things.

So in between (1) and (2), silly-rename the pseudo-empty target. At
that point (2) is no longer even an atomicity requirement, because you
can do the whiteout removal later. In fact, you probably want to do it
at the end, after doing the "real" rename.

No, it's not perfect, but it works in practice. NFS may not be POSIX,
but nobody really cares. It's usable.

> We could just refuse to do the rename-over-empty-directory and see if anyone
> complains. I don't think it's often used, but if something is documented
> then people are bound to find some stupid use for it.

I'm sure there are uses for it, since it's traditional unix behavior.
And I'm sure there are good reasons for it too (eg locking over NFS or
whatever)

Linus

2013-05-30 10:15:44

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC PATCH] vfs: add permute operation

On Thu, May 30, 2013 at 11:11 AM, Linus Torvalds
<[email protected]> wrote:
> On Thu, May 30, 2013 at 5:45 PM, Miklos Szeredi <[email protected]> wrote:

>> 1) check if destination directory is empty: upper directory contains a
>> whiteout for each lower directory entry and nothing else
>> 2) if empty then remove whiteouts in destination directory
>> 3) and then go on with the normal rename procedure, replacing the empty
>> destination directory with the source directory ,
>>
>> This is done with directory locking, so atomicity is not usually a problem.
>> But in case of a crash between 2) and 3) we just seriously corrupted the
>> overlay.
>>
>> Suggestions for fixing that?
>
> Why not just do the NFS thing. That has worked forever - using a
> sillyrename as a "pending deletion" instead of actually deleting
> things.
>
> So in between (1) and (2), silly-rename the pseudo-empty target. At
> that point (2) is no longer even an atomicity requirement, because you
> can do the whiteout removal later. In fact, you probably want to do it
> at the end, after doing the "real" rename.

Okay, nice idea. More specifically we want to replace the directory
containing whiteouts with an opaque empty directory, which can be done
with a cross-rename.

Then we are left with basically two new variants of rename:

- cross rename - exchange two names
- plain overwriting rename but whiteout source

I'm fine with that.

As for userspace interfaces I think the cross-rename is useful enough
to justify a new syscall (rename/renameat don't have flags
unfortunately).

Thanks,
Miklos

2013-05-30 15:59:31

by J. R. Okajima

[permalink] [raw]
Subject: Re: [RFC PATCH] vfs: add permute operation


Linus Torvalds:
> On Thu, May 30, 2013 at 5:45 PM, Miklos Szeredi <[email protected]> wrote:
> >
> > The third name is because of the replace-empty-directory wart in the
> > rename(2) definition. With overlay/union that can become
> >
> > 1) check if destination directory is empty: upper directory contains a
> > whiteout for each lower directory entry and nothing else
> > 2) if empty then remove whiteouts in destination directory
> > 3) and then go on with the normal rename procedure, replacing the empty
> > destination directory with the source directory ,
> >
> > This is done with directory locking, so atomicity is not usually a problem.
> > But in case of a crash between 2) and 3) we just seriously corrupted the
> > overlay.
> >
> > Suggestions for fixing that?
>
> Why not just do the NFS thing. That has worked forever - using a
> sillyrename as a "pending deletion" instead of actually deleting
> things.
>
> So in between (1) and (2), silly-rename the pseudo-empty target. At
> that point (2) is no longer even an atomicity requirement, because you
> can do the whiteout removal later. In fact, you probably want to do it
> at the end, after doing the "real" rename.

Hmm, where was the quoted Miklos's mail posted? I cannot find it in
both of linux-kernel and linux-fsdevel.

Anyway the idea sounds very similar to the approach which aufs
implemented years ago. Is it?

(from the aufs design document)

Whiteout
----------------------------------------------------------------------
The whiteout in aufs is very similar to Unionfs's. That is represented
by its filename. UnionMount takes an approach of a file mode, but I am
afraid several utilities (find(1) or something) will have to support it.

Basically the whiteout represents "logical deletion" which stops aufs to
lookup further, but also it represents "dir is opaque" which also stop
lookup.

In aufs, rmdir(2) and rename(2) for dir uses whiteout alternatively.
In order to make several functions in a single systemcall to be
revertible, aufs adopts an approach to rename a directory to a temporary
unique whiteouted name.
For example, in rename(2) dir where the target dir already existed, aufs
renames the target dir to a temporary unique whiteouted name before the
actual rename on a branch and then handles other actions (make it opaque,
update the attributes, etc). If an error happens in these actions, aufs
simply renames the whiteouted name back and returns an error. If all are
succeeded, aufs registers a function to remove the whiteouted unique
temporary name completely and asynchronously to the system global
workqueue.


J. R. Okajima