2010-08-08 15:54:49

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 00/39] Union mounts - return d_ino from lower fs

After questioning the value of d_ino, I am now convinced of its
utility. This version of union mounts fills in d_ino of fallthru
directory entries with the inode number of the target. You still need
to stat() the entry to get st_dev if you want to do a file uniqueness
comparison using the inode.

See the patch introducing generic_readdir_fallthru() for the
implementation.

-VAL

Felix Fietkau (2):
whiteout: jffs2 whiteout support
fallthru: jffs2 fallthru support

Jan Blunck (10):
VFS: Make lookup_hash() return a struct path
autofs4: Save autofs trigger's vfsmount in super block info
whiteout/NFSD: Don't return information about whiteouts to userspace
whiteout: Add vfs_whiteout() and whiteout inode operation
whiteout: Set opaque flag if new directory was previously a whiteout
whiteout: Allow removal of a directory with whiteouts
whiteout: Split of ext2_append_link() from ext2_add_link()
whiteout: ext2 whiteout support
union-mount: Introduce MNT_UNION and MS_UNION flags
union-mount: Call do_whiteout() on unlink and rmdir in unions

Valerie Aurora (27):
VFS: Comment follow_mount() and friends
VFS: Add read-only users count to superblock
whiteout: tmpfs whiteout support
fallthru: Basic fallthru definitions
union-mount: Union mounts documentation
union-mount: Introduce union_dir structure and basic operations
union-mount: Free union dirs on removal from dcache
union-mount: Support for union mounting file systems
union-mount: Implement union lookup
union-mount: Copy up directory entries on first readdir()
union-mount: Add generic_readdir_fallthru() helper
fallthru: ext2 fallthru support
fallthru: tmpfs fallthru support
VFS: Split inode_permission() and create path_permission()
VFS: Create user_path_nd() to lookup both parent and target
union-mount: In-kernel file copyup routines
union-mount: Implement union-aware access()/faccessat()
union-mount: Implement union-aware link()
union-mount: Implement union-aware rename()
union-mount: Implement union-aware writable open()
union-mount: Implement union-aware chown()
union-mount: Implement union-aware truncate()
union-mount: Implement union-aware chmod()/fchmodat()
union-mount: Implement union-aware lchown()
union-mount: Implement union-aware utimensat()
union-mount: Implement union-aware setxattr()
union-mount: Implement union-aware lsetxattr()

Documentation/filesystems/union-mounts.txt | 752 +++++++++++++++++++++++++++
Documentation/filesystems/vfs.txt | 16 +-
fs/Kconfig | 13 +
fs/Makefile | 1 +
fs/autofs4/autofs_i.h | 1 +
fs/autofs4/init.c | 11 +-
fs/autofs4/root.c | 6 +
fs/compat.c | 9 +
fs/dcache.c | 32 ++-
fs/ext2/dir.c | 248 +++++++++-
fs/ext2/ext2.h | 4 +
fs/ext2/inode.c | 11 +-
fs/ext2/namei.c | 85 +++-
fs/ext2/super.c | 7 +
fs/jffs2/dir.c | 117 +++++-
fs/jffs2/fs.c | 4 +
fs/jffs2/super.c | 2 +-
fs/libfs.c | 23 +-
fs/namei.c | 754 ++++++++++++++++++++++++----
fs/namespace.c | 289 +++++++++++-
fs/nfsd/nfs3xdr.c | 5 +
fs/nfsd/nfs4xdr.c | 5 +
fs/nfsd/nfsxdr.c | 4 +
fs/open.c | 116 ++++-
fs/readdir.c | 18 +
fs/super.c | 24 +
fs/union.c | 752 +++++++++++++++++++++++++++
fs/union.h | 80 +++
fs/utimes.c | 14 +-
fs/xattr.c | 65 ++-
include/linux/dcache.h | 19 +-
include/linux/ext2_fs.h | 8 +
include/linux/fs.h | 28 +
include/linux/jffs2.h | 8 +
include/linux/mount.h | 6 +-
include/linux/namei.h | 2 +
mm/shmem.c | 193 +++++++-
37 files changed, 3549 insertions(+), 183 deletions(-)
create mode 100644 Documentation/filesystems/union-mounts.txt
create mode 100644 fs/union.c
create mode 100644 fs/union.h


2010-08-08 15:54:17

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 03/39] VFS: Add read-only users count to superblock

While we can check if a file system is currently read-only, we can't
guarantee that it will stay read-only. The file system can be
remounted read-write at any time; it's also conceivable that a file
system can be mounted a second time and converted to read-write if the
underlying fs allows it. This is a problem for union mounts, which
require the underlying file system be read-only. Add a read-only
users count and don't allow remounts to change the file system to
read-write or read-write mounts if there are any read-only users.

Signed-off-by: Valerie Aurora <[email protected]>
Cc: Alexander Viro <[email protected]>
---
fs/namespace.c | 13 +++++++++++++
fs/super.c | 23 +++++++++++++++++++++++
include/linux/fs.h | 8 ++++++++
3 files changed, 44 insertions(+), 0 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index b8a66db..984c331 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -200,6 +200,19 @@ int __mnt_is_readonly(struct vfsmount *mnt)
}
EXPORT_SYMBOL_GPL(__mnt_is_readonly);

+static void inc_hard_readonly_users(struct vfsmount *mnt)
+{
+ BUG_ON(!__mnt_is_readonly(mnt));
+ mnt->mnt_sb->s_hard_readonly_users++;
+}
+
+static void dec_hard_readonly_users(struct vfsmount *mnt)
+{
+ BUG_ON(!__mnt_is_readonly(mnt));
+ BUG_ON(mnt->mnt_sb->s_hard_readonly_users == 0);
+ mnt->mnt_sb->s_hard_readonly_users--;
+}
+
static inline void inc_mnt_writers(struct vfsmount *mnt)
{
#ifdef CONFIG_SMP
diff --git a/fs/super.c b/fs/super.c
index 938119a..86bdf1f 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -108,6 +108,7 @@ out:
*/
static inline void destroy_super(struct super_block *s)
{
+ BUG_ON(s->s_hard_readonly_users);
security_sb_free(s);
kfree(s->s_subtype);
kfree(s->s_options);
@@ -512,6 +513,21 @@ rescan:
return NULL;
}

+/*
+ * Some uses of file systems require that they never be mounted
+ * read-write anywhere (e.g., the lower layers of union mounts must
+ * always be read-only). If there are any of these "hard" read-only
+ * mounts, don't permit a transition to read-write.
+ *
+ * Must be called while holding the namespace lock.
+ */
+
+int sb_is_hard_readonly(struct super_block *sb)
+{
+ return sb->s_hard_readonly_users ? 1 : 0;
+}
+EXPORT_SYMBOL(sb_is_hard_readonly);
+
/**
* do_remount_sb - asks filesystem to change mount options.
* @sb: superblock in question
@@ -550,6 +566,9 @@ int do_remount_sb(struct super_block *sb, int flags, void *data, int force)
return -EBUSY;
}

+ if (!(flags & MS_RDONLY) && sb_is_hard_readonly(sb))
+ return -EROFS;
+
if (sb->s_op->remount_fs) {
retval = sb->s_op->remount_fs(sb, &flags, data);
if (retval)
@@ -924,6 +943,10 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
WARN((mnt->mnt_sb->s_maxbytes < 0), "%s set sb->s_maxbytes to "
"negative value (%lld)\n", type->name, mnt->mnt_sb->s_maxbytes);

+ error = -EROFS;
+ if (!(flags & MS_RDONLY) && sb_is_hard_readonly(mnt->mnt_sb))
+ goto out_sb;
+
mnt->mnt_mountpoint = mnt->mnt_root;
mnt->mnt_parent = mnt;
up_write(&mnt->mnt_sb->s_umount);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 68ca1b0..eeb49d7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1382,6 +1382,13 @@ struct super_block {
* generic_show_options()
*/
char *s_options;
+
+ /*
+ * Some mounts require that the underlying file system never
+ * transition to read-write. They mark the sb itself as
+ * read-only.
+ */
+ int s_hard_readonly_users;
};

extern struct timespec current_fs_time(struct super_block *sb);
@@ -1768,6 +1775,7 @@ extern int get_sb_nodev(struct file_system_type *fs_type,
int (*fill_super)(struct super_block *, void *, int),
struct vfsmount *mnt);
void generic_shutdown_super(struct super_block *sb);
+int sb_is_hard_readonly(struct super_block *sb);
void kill_block_super(struct super_block *sb);
void kill_anon_super(struct super_block *sb);
void kill_litter_super(struct super_block *sb);
--
1.6.3.3

2010-08-08 15:54:45

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 07/39] whiteout: Set opaque flag if new directory was previously a whiteout

From: Jan Blunck <[email protected]>

If we mkdir() a directory on the top layer of a union, we don't want
entries from a matching directory on the lower layer to "show through"
suddenly. To prevent this, we set the opaque flag on a directory if
there was previously a white-out with the same name. (If there is no
white-out and the directory exists in a lower layer, then mkdir() will
fail with EEXIST.)

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namei.c | 11 ++++++++++-
include/linux/fs.h | 5 +++++
2 files changed, 15 insertions(+), 1 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 665d394..cd8b0d0 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2108,6 +2108,7 @@ SYSCALL_DEFINE3(mknod, const char __user *, filename, int, mode, unsigned, dev)
int vfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
{
int error = may_create(dir, dentry);
+ int opaque = 0;

if (error)
return error;
@@ -2120,9 +2121,17 @@ int vfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
if (error)
return error;

+ if (d_is_whiteout(dentry))
+ opaque = 1;
+
error = dir->i_op->mkdir(dir, dentry, mode);
- if (!error)
+ if (!error) {
fsnotify_mkdir(dir, dentry);
+ if (opaque) {
+ dentry->d_inode->i_flags |= S_OPAQUE;
+ mark_inode_dirty(dentry->d_inode);
+ }
+ }
return error;
}

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1f80897..1dbe156 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -236,6 +236,7 @@ struct inodes_stat_t {
#define S_NOCMTIME 128 /* Do not update file c/mtime */
#define S_SWAPFILE 256 /* Do not truncate: swapon got its bmaps */
#define S_PRIVATE 512 /* Inode is fs-internal */
+#define S_OPAQUE 1024 /* Directory is opaque */

/*
* Note that nosuid etc flags are inode-specific: setting some file-system
@@ -270,6 +271,7 @@ struct inodes_stat_t {
#define IS_NOCMTIME(inode) ((inode)->i_flags & S_NOCMTIME)
#define IS_SWAPFILE(inode) ((inode)->i_flags & S_SWAPFILE)
#define IS_PRIVATE(inode) ((inode)->i_flags & S_PRIVATE)
+#define IS_OPAQUE(inode) ((inode)->i_flags & S_OPAQUE)

/* the read-only stuff doesn't really belong here, but any other place is
probably as bad and I don't want to create yet another include file. */
@@ -351,8 +353,11 @@ struct inodes_stat_t {
#define FS_NOTAIL_FL 0x00008000 /* file tail should not be merged */
#define FS_DIRSYNC_FL 0x00010000 /* dirsync behaviour (directories only) */
#define FS_TOPDIR_FL 0x00020000 /* Top of directory hierarchies*/
+/* 0x00040000 is used by ext4 */
#define FS_EXTENT_FL 0x00080000 /* Extents */
#define FS_DIRECTIO_FL 0x00100000 /* Use direct i/o */
+/* 0x00200000 and 0x00400000 also used by ext4 */
+#define FS_OPAQUE_FL 0x00800000 /* Dir is opaque */
#define FS_RESERVED_FL 0x80000000 /* reserved for ext2 lib */

#define FS_FL_USER_VISIBLE 0x0003DFFF /* User visible flags */
--
1.6.3.3

2010-08-08 15:55:05

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 16/39] union-mount: Introduce union_dir structure and basic operations

This patch adds the basic structures and operations of VFS-based union
mounts (but not the ability to mount or lookup unioned file systems).
Each directory in a unioned file system has an associated union stack
created when the directory is first looked up. The union stack is a
union_dir structure kept in a hash table indexed by mount and dentry
of the directory; thus, specific paths are unioned, not dentries
alone. The union_dir keeps a pointer to the upper path and the lower
path and can be looked up by either path. Currently only two layers
are supported, but the union_dir struct is flexible enough to allow
more than two layers.

This particular version of union mounts is based on ideas by Jan
Blunck, Bharata B. Rao, and many others.

Signed-off-by: Valerie Aurora <[email protected]>
---
fs/Kconfig | 13 +++++
fs/Makefile | 1 +
fs/dcache.c | 3 +
fs/union.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/union.h | 66 ++++++++++++++++++++++++++
include/linux/dcache.h | 5 ++-
6 files changed, 206 insertions(+), 1 deletions(-)
create mode 100644 fs/union.c
create mode 100644 fs/union.h

diff --git a/fs/Kconfig b/fs/Kconfig
index 5f85b59..47409c9 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -59,6 +59,19 @@ source "fs/notify/Kconfig"

source "fs/quota/Kconfig"

+config UNION_MOUNT
+ bool "Union mounts (writable overlays) (EXPERIMENTAL)"
+ depends on EXPERIMENTAL
+ help
+ Union mounts allow you to mount a transparent writable
+ layer over a read-only file system, for example, an ext3
+ partition on a hard drive over a CD-ROM root file system
+ image.
+
+ See <file:Documentation/filesystems/union-mounts.txt> for details.
+
+ If unsure, say N.
+
source "fs/autofs/Kconfig"
source "fs/autofs4/Kconfig"
source "fs/fuse/Kconfig"
diff --git a/fs/Makefile b/fs/Makefile
index e6ec1d3..936acf0 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -52,6 +52,7 @@ obj-$(CONFIG_NFS_COMMON) += nfs_common/
obj-$(CONFIG_GENERIC_ACL) += generic_acl.o

obj-y += quota/
+obj-$(CONFIG_UNION_MOUNT) += union.o

obj-$(CONFIG_PROC_FS) += proc/
obj-y += partitions/
diff --git a/fs/dcache.c b/fs/dcache.c
index 249d077..456030d 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -959,6 +959,9 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
INIT_LIST_HEAD(&dentry->d_lru);
INIT_LIST_HEAD(&dentry->d_subdirs);
INIT_LIST_HEAD(&dentry->d_alias);
+#ifdef CONFIG_UNION_MOUNT
+ dentry->d_union_dir = NULL;
+#endif

if (parent) {
dentry->d_parent = dget(parent);
diff --git a/fs/union.c b/fs/union.c
new file mode 100644
index 0000000..02abb7c
--- /dev/null
+++ b/fs/union.c
@@ -0,0 +1,119 @@
+ /*
+ * VFS-based union mounts for Linux
+ *
+ * Copyright (C) 2004-2007 IBM Corporation, IBM Deutschland Entwicklung GmbH.
+ * Copyright (C) 2007-2009 Novell Inc.
+ * Copyright (C) 2009-2010 Red Hat, Inc.
+ *
+ * Author(s): Jan Blunck ([email protected])
+ * Valerie Aurora <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#include <linux/bootmem.h>
+#include <linux/init.h>
+#include <linux/types.h>
+#include <linux/fs.h>
+#include <linux/mount.h>
+#include <linux/fs_struct.h>
+#include <linux/slab.h>
+
+#include "union.h"
+
+static struct kmem_cache *union_cache;
+
+static int __init init_union(void)
+{
+ union_cache = KMEM_CACHE(union_dir, SLAB_PANIC | SLAB_MEM_SPREAD);
+ return 0;
+}
+
+fs_initcall(init_union);
+
+/**
+ * union_alloc - allocate a union_dir
+ *
+ * @path: path of directory underneath another directory
+ *
+ * Allocate a union_dir for this directory. We only allocate
+ * union_dirs for the second and lower layers - the read-only layers.
+ * Top-level dentries don't have a union_dir, just a pointer to the
+ * union_dir of the directory in the layer below it. u_lower is
+ * initialized to NULL by default. If there is another layer below
+ * this and a matching directory in the layer, then we allocate a
+ * union_dir for it and then set u_lower of the above union_dir to
+ * point to it.
+ */
+
+static struct union_dir *union_alloc(struct path *path)
+{
+ struct union_dir *ud;
+
+ BUG_ON(!S_ISDIR(path->dentry->d_inode->i_mode));
+
+ ud = kmem_cache_alloc(union_cache, GFP_ATOMIC);
+ if (!ud)
+ return NULL;
+
+ ud->u_this = *path;
+ ud->u_lower = NULL;
+
+ return ud;
+}
+
+static void union_put(struct union_dir *ud)
+{
+ path_put(&ud->u_this);
+ kmem_cache_free(union_cache, ud);
+}
+
+/**
+ * union_add_dir - Add another layer to a unioned directory
+ *
+ * @upper - directory in the previous layer
+ * @lower - directory in the current layer
+ * @next_ud - location of pointer to this union_dir
+ *
+ * Must have a reference (i.e., call path_get()) to @lower before
+ * passing to this function.
+ */
+
+int union_add_dir(struct path *upper, struct path *lower,
+ struct union_dir **next_ud)
+{
+ struct union_dir *ud;
+
+ BUG_ON(*next_ud != NULL);
+
+ ud = union_alloc(lower);
+ if (!ud)
+ return -ENOMEM;
+ *next_ud = ud;
+
+ return 0;
+}
+
+/**
+ * d_free_unions - free all unions for this dentry
+ *
+ * @dentry - topmost dentry in the union stack to remove
+ *
+ * This must be called when freeing a dentry.
+ */
+void d_free_unions(struct dentry *dentry)
+{
+ struct union_dir *this, *next;
+
+ this = dentry->d_union_dir;
+
+ while (this != NULL) {
+ next = this->u_lower;
+ union_put(this);
+ this = next;
+ }
+ dentry->d_union_dir = NULL;
+}
diff --git a/fs/union.h b/fs/union.h
new file mode 100644
index 0000000..04efc1f
--- /dev/null
+++ b/fs/union.h
@@ -0,0 +1,66 @@
+ /*
+ * VFS-based union mounts for Linux
+ *
+ * Copyright (C) 2004-2007 IBM Corporation, IBM Deutschland Entwicklung GmbH.
+ * Copyright (C) 2007-2009 Novell Inc.
+ * Copyright (C) 2009-2010 Red Hat, Inc.
+ *
+ * Author(s): Jan Blunck ([email protected])
+ * Valerie Aurora <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+#ifndef __LINUX_UNION_H
+#define __LINUX_UNION_H
+#ifdef __KERNEL__
+
+#ifdef CONFIG_UNION_MOUNT
+
+/*
+ * WARNING! Confusing terminology alert.
+ *
+ * Note that the directions "up" and "down" in union mounts are the
+ * opposite of "up" and "down" in normal VFS operation terminology.
+ * "up" in the rest of the VFS means "towards the root of the mount
+ * tree." If you mount B on top of A, following B "up" will get you
+ * A. In union mounts, "up" means "towards the most recently mounted
+ * layer of the union stack." If you union mount B on top of A,
+ * following A "up" will get you to B. Another way to put it is that
+ * "up" in the VFS means going from this mount towards the direction
+ * of its mnt->mnt_parent pointer, but "up" in union mounts means
+ * going in the opposite direction (until you run out of union
+ * layers).
+ */
+
+/*
+ * The union_dir structure. Basically just a singly-linked list with
+ * a pointer to the referenced dentry, whose head is d_union_dir in
+ * the dentry of the topmost directory. We can't link this list
+ * purely through list elements in the dentry because lower layer
+ * dentries can be part of multiple union stacks. However, the
+ * topmost dentry is only part of one union stack. So we point at the
+ * lower layer dentries through a linked list rooted in the topmost
+ * dentry.
+ */
+struct union_dir {
+ struct path u_this; /* this is me */
+ struct union_dir *u_lower; /* this is what I overlay */
+};
+
+#define IS_MNT_UNION(mnt) ((mnt)->mnt_flags & MNT_UNION)
+
+extern int union_add_dir(struct path *, struct path *, struct union_dir **);
+extern void d_free_unions(struct dentry *);
+
+#else /* CONFIG_UNION_MOUNT */
+
+#define IS_MNT_UNION(x) (0)
+#define union_add_dir(x, y, z) ({ BUG(); (NULL); })
+#define d_free_unions(x) do { } while (0)
+
+#endif /* CONFIG_UNION_MOUNT */
+#endif /* __KERNEL__ */
+#endif /* __LINUX_UNION_H */
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 0904716..84657da 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -100,7 +100,10 @@ struct dentry {
struct hlist_node d_hash; /* lookup hash list */
struct dentry *d_parent; /* parent directory */
struct qstr d_name;
-
+ /* XXX Changes size of dentry, perhaps should be re-tuned. */
+#ifdef CONFIG_UNION_MOUNT
+ struct union_dir *d_union_dir; /* head of union stack */
+#endif
struct list_head d_lru; /* LRU list */
/*
* d_child and d_rcu can share memory
--
1.6.3.3

2010-08-08 15:54:58

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 15/39] union-mount: Introduce MNT_UNION and MS_UNION flags

From: Jan Blunck <[email protected]>

Add per mountpoint flag for Union Mount support. You need additional patches
to util-linux for that to work - see:

git://git.kernel.org/pub/scm/utils/util-linux-ng/val/util-linux-ng.git

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namespace.c | 5 ++++-
include/linux/fs.h | 1 +
include/linux/mount.h | 4 ++--
3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 984c331..f115cb6 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -809,6 +809,7 @@ static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
{ MNT_NODIRATIME, ",nodiratime" },
{ MNT_RELATIME, ",relatime" },
{ MNT_STRICTATIME, ",strictatime" },
+ { MNT_UNION, ",union" },
{ 0, NULL }
};
const struct proc_fs_info *fs_infop;
@@ -2008,10 +2009,12 @@ long do_mount(char *dev_name, char *dir_name, char *type_page,
mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME);
if (flags & MS_RDONLY)
mnt_flags |= MNT_READONLY;
+ if (flags & MS_UNION)
+ mnt_flags |= MNT_UNION;

flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT |
- MS_STRICTATIME);
+ MS_STRICTATIME | MS_UNION);

if (flags & MS_REMOUNT)
retval = do_remount(&path, flags & ~MS_REMOUNT, mnt_flags,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 71ee74e..31cfa48 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -192,6 +192,7 @@ struct inodes_stat_t {
#define MS_REMOUNT 32 /* Alter flags of a mounted FS */
#define MS_MANDLOCK 64 /* Allow mandatory locks on an FS */
#define MS_DIRSYNC 128 /* Directory modifications are synchronous */
+#define MS_UNION 256 /* Merge namespace with FS mounted below */
#define MS_NOATIME 1024 /* Do not update access times. */
#define MS_NODIRATIME 2048 /* Do not update directory access times */
#define MS_BIND 4096
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 4bd0547..0302703 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -43,9 +43,9 @@ struct mnt_namespace;
*/
#define MNT_SHARED_MASK (MNT_UNBINDABLE)
#define MNT_PROPAGATION_MASK (MNT_SHARED | MNT_UNBINDABLE)
+#define MNT_UNION 0x4000 /* top layer of a union mount */

-
-#define MNT_INTERNAL 0x4000
+#define MNT_INTERNAL 0x8000

struct vfsmount {
struct list_head mnt_hash;
--
1.6.3.3

2010-08-08 15:55:10

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 37/39] union-mount: Implement union-aware utimensat()


Signed-off-by: Valerie Aurora <[email protected]>
---
fs/utimes.c | 14 ++++++++++++--
1 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/utimes.c b/fs/utimes.c
index e4c75db..e83b6bd 100644
--- a/fs/utimes.c
+++ b/fs/utimes.c
@@ -8,8 +8,10 @@
#include <linux/stat.h>
#include <linux/utime.h>
#include <linux/syscalls.h>
+#include <linux/slab.h>
#include <asm/uaccess.h>
#include <asm/unistd.h>
+#include "union.h"

#ifdef __ARCH_WANT_SYS_UTIME

@@ -152,18 +154,26 @@ long do_utimes(int dfd, char __user *filename, struct timespec *times, int flags
error = utimes_common(&file->f_path, times);
fput(file);
} else {
+ struct nameidata nd;
+ char *tmp;
struct path path;
int lookup_flags = 0;

if (!(flags & AT_SYMLINK_NOFOLLOW))
lookup_flags |= LOOKUP_FOLLOW;

- error = user_path_at(dfd, filename, lookup_flags, &path);
+ error = user_path_nd(dfd, filename, lookup_flags, &nd, &path,
+ &tmp);
if (error)
goto out;

- error = utimes_common(&path, times);
+ error = union_copyup(&nd, &path);
+
+ if (!error)
+ error = utimes_common(&path, times);
path_put(&path);
+ path_put(&nd.path);
+ putname(tmp);
}

out:
--
1.6.3.3

2010-08-08 15:55:20

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 22/39] union-mount: Add generic_readdir_fallthru() helper

In readdir(), client file systems need to lookup the target of a
fallthru in a lower layer for three reasons: (1) fill in d_ino, (2)
fill in d_type, (2) make sure there is something to fall through to
(and if not, don't return this dentry). Create a generic helper
function.

Signed-off-by: Valerie Aurora <[email protected]>
---
fs/union.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 2 +
2 files changed, 56 insertions(+), 0 deletions(-)

diff --git a/fs/union.c b/fs/union.c
index 917248d..a91e8fc 100644
--- a/fs/union.c
+++ b/fs/union.c
@@ -373,3 +373,57 @@ out_fput:
mnt_drop_write(topmost_path->mnt);
return res;
}
+
+/* Relationship between i_mode and the DT_xxx types */
+static inline unsigned char dt_type(struct inode *inode)
+{
+ return (inode->i_mode >> 12) & 15;
+}
+
+/**
+ * generic_readdir_fallthru - Helper to lookup target of a fallthru
+ *
+ * In readdir(), client file systems need to lookup the target of a
+ * fallthru in a lower layer for three reasons: (1) fill in d_ino, (2)
+ * fill in d_type, (2) make sure there is something to fall through to
+ * (and if not, don't return this dentry). Upon detecting a fallthru
+ * dentry in readdir(), the client file system should call this function.
+ *
+ * @topmost_dentry: dentry for the topmost dentry of the dir being read
+ * @name: name of fallthru dirent
+ * @namelen: length of @name
+ * @ino: return inode number of target, if found
+ * @d_type: return directory type of target, if found
+ *
+ * Returns 0 on success and -ENOENT if no matching directory entry was
+ * found (which happens when a file system will fallthrus is mounted
+ * somewhere other than where the fallthrus were created). Any other
+ * errors are unexpected.
+ */
+
+int
+generic_readdir_fallthru(struct dentry *topmost_dentry, const char *name,
+ int namlen, ino_t *ino, unsigned char *d_type)
+{
+ struct dentry *dentry, *parent;
+ struct union_dir *ud = topmost_dentry->d_union_dir;
+
+ BUG_ON(!mutex_is_locked(&topmost_dentry->d_inode->i_mutex));
+
+ for (ud = topmost_dentry->d_union_dir; ud != NULL; ud = ud->u_lower) {
+ parent = ud->u_this.dentry;
+ mutex_lock(&parent->d_inode->i_mutex);
+ dentry = lookup_one_len(name, parent, namlen);
+ mutex_unlock(&parent->d_inode->i_mutex);
+ if (IS_ERR(dentry))
+ return PTR_ERR(dentry);
+ if (dentry->d_inode) {
+ *ino = dentry->d_inode->i_ino;
+ *d_type = dt_type(dentry->d_inode);
+ dput(dentry);
+ return 0;
+ }
+ dput(dentry);
+ }
+ return -ENOENT;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b88d088..3675501 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2140,6 +2140,8 @@ extern int notify_change(struct dentry *, struct iattr *);
extern int inode_permission(struct inode *, int);
extern int generic_permission(struct inode *, int,
int (*check_acl)(struct inode *, int));
+extern int generic_readdir_fallthru(struct dentry *topmost_dentry, const char *name,
+ int namlen, ino_t *ino, unsigned char *d_type);

static inline bool execute_ok(struct inode *inode)
{
--
1.6.3.3

2010-08-08 15:55:24

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 24/39] fallthru: jffs2 fallthru support

From: Felix Fietkau <[email protected]>

Add support for fallthru dentries to jffs2.

XXX - untested changes including generic_readdir_fallthru()

Cc: David Woodhouse <[email protected]>
Cc: [email protected]
Signed-off-by: Felix Fietkau <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/jffs2/dir.c | 49 +++++++++++++++++++++++++++++++++++++++++++++----
include/linux/jffs2.h | 6 ++++++
2 files changed, 51 insertions(+), 4 deletions(-)

diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index 4798586..453e695 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -35,6 +35,7 @@ static int jffs2_rename (struct inode *, struct dentry *,
struct inode *, struct dentry *);

static int jffs2_whiteout (struct inode *, struct dentry *, struct dentry *);
+static int jffs2_fallthru (struct inode *, struct dentry *);

const struct file_operations jffs2_dir_operations =
{
@@ -59,6 +60,7 @@ const struct inode_operations jffs2_dir_inode_operations =
.rename = jffs2_rename,
.check_acl = jffs2_check_acl,
.whiteout = jffs2_whiteout,
+ .fallthru = jffs2_fallthru,
.setattr = jffs2_setattr,
.setxattr = jffs2_setxattr,
.getxattr = jffs2_getxattr,
@@ -103,10 +105,14 @@ static struct dentry *jffs2_lookup(struct inode *dir_i, struct dentry *target,
}
if (fd) {
spin_lock(&target->d_lock);
- if (fd->type == DT_WHT)
+ switch (fd->type) {
+ case DT_WHT:
target->d_flags |= DCACHE_WHITEOUT;
- else
+ case JFFS2_DT_FALLTHRU:
+ target->d_flags |= DCACHE_FALLTHRU;
+ default:
ino = fd->ino;
+ }
spin_unlock(&target->d_lock);
}
mutex_unlock(&dir_f->sem);
@@ -131,6 +137,8 @@ static int jffs2_readdir(struct file *filp, void *dirent, filldir_t filldir)
struct inode *inode = filp->f_path.dentry->d_inode;
struct jffs2_full_dirent *fd;
unsigned long offset, curofs;
+ ino_t ino;
+ char d_type;

D1(printk(KERN_DEBUG "jffs2_readdir() for dir_i #%lu\n", filp->f_path.dentry->d_inode->i_ino));

@@ -164,13 +172,25 @@ static int jffs2_readdir(struct file *filp, void *dirent, filldir_t filldir)
fd->name, fd->ino, fd->type, curofs, offset));
continue;
}
- if (!fd->ino) {
+ if (fd->type == JFFS2_DT_FALLTHRU) {
+ int err;
+ err = generic_readdir_fallthru(filp->f_path.dentry, fd->name, strlen(fd->name),
+ &ino, &d_type);
+ if (err) {
+ D2(printk(KERN_DEBUG "Skipping fallthru dirent \"%s\"\n", fd->name));
+ offset++;
+ continue;
+ }
+ } else if (!fd->ino && (fd->type != DT_WHT)) {
D2(printk(KERN_DEBUG "Skipping deletion dirent \"%s\"\n", fd->name));
offset++;
continue;
+ } else {
+ ino = fd->ino;
+ d_type = fd->type;
}
D2(printk(KERN_DEBUG "Dirent %ld: \"%s\", ino #%u, type %d\n", offset, fd->name, fd->ino, fd->type));
- if (filldir(dirent, fd->name, strlen(fd->name), offset, fd->ino, fd->type) < 0)
+ if (filldir(dirent, fd->name, strlen(fd->name), offset, ino, d_type) < 0)
break;
offset++;
}
@@ -798,6 +818,26 @@ static int jffs2_mknod (struct inode *dir_i, struct dentry *dentry, int mode, de
return ret;
}

+static int jffs2_fallthru (struct inode *dir, struct dentry *dentry)
+{
+ struct jffs2_sb_info *c = JFFS2_SB_INFO(dir->i_sb);
+ uint32_t now;
+ int ret;
+
+ now = get_seconds();
+ ret = jffs2_do_link(c, JFFS2_INODE_INFO(dir), 0, DT_UNKNOWN,
+ dentry->d_name.name, dentry->d_name.len, now);
+ if (ret)
+ return ret;
+
+ d_instantiate(dentry, NULL);
+ spin_lock(&dentry->d_lock);
+ dentry->d_flags |= DCACHE_FALLTHRU;
+ spin_unlock(&dentry->d_lock);
+
+ return 0;
+}
+
static int jffs2_whiteout (struct inode *dir, struct dentry *old_dentry,
struct dentry *new_dentry)
{
@@ -830,6 +870,7 @@ static int jffs2_whiteout (struct inode *dir, struct dentry *old_dentry,
return ret;

spin_lock(&new_dentry->d_lock);
+ new_dentry->d_flags &= ~DCACHE_FALLTHRU;
new_dentry->d_flags |= DCACHE_WHITEOUT;
spin_unlock(&new_dentry->d_lock);
d_add(new_dentry, NULL);
diff --git a/include/linux/jffs2.h b/include/linux/jffs2.h
index cc6347f..f3cedf6 100644
--- a/include/linux/jffs2.h
+++ b/include/linux/jffs2.h
@@ -114,6 +114,12 @@ struct jffs2_unknown_node
jint32_t hdr_crc;
};

+/*
+ * Non-standard directory entry type(s), for on-disk use
+ */
+
+#define JFFS2_DT_FALLTHRU (DT_WHT + 1)
+
struct jffs2_raw_dirent
{
jint16_t magic;
--
1.6.3.3

2010-08-08 15:55:49

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 30/39] union-mount: Implement union-aware link()


Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namei.c | 24 ++++++++++++++++++++----
1 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index e7b02fa..5b22cc5 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2942,16 +2942,18 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
{
struct dentry *new_dentry;
struct nameidata nd;
+ struct nameidata old_nd;
struct path old_path;
int error;
char *to;
+ char *from;

if ((flags & ~AT_SYMLINK_FOLLOW) != 0)
return -EINVAL;

- error = user_path_at(olddfd, oldname,
+ error = user_path_nd(olddfd, oldname,
flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0,
- &old_path);
+ &old_nd, &old_path, &from);
if (error)
return error;

@@ -2959,8 +2961,20 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
if (error)
goto out;
error = -EXDEV;
- if (old_path.mnt != nd.path.mnt)
- goto out_release;
+ if (old_path.mnt != nd.path.mnt) {
+ if (IS_DIR_UNIONED(old_nd.path.dentry) &&
+ (old_nd.path.mnt == nd.path.mnt)) {
+ error = mnt_want_write(old_nd.path.mnt);
+ if (error)
+ goto out_release;
+ error = union_copyup(&old_nd, &old_path);
+ mnt_drop_write(old_nd.path.mnt);
+ if (error)
+ goto out_release;
+ } else {
+ goto out_release;
+ }
+ }
new_dentry = lookup_create(&nd, 0);
error = PTR_ERR(new_dentry);
if (IS_ERR(new_dentry))
@@ -2983,6 +2997,8 @@ out_release:
putname(to);
out:
path_put(&old_path);
+ path_put(&old_nd.path);
+ putname(from);

return error;
}
--
1.6.3.3

2010-08-08 15:55:57

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 34/39] union-mount: Implement union-aware truncate()


Signed-off-by: Valerie Aurora <[email protected]>
---
fs/open.c | 24 ++++++++++++++++++++----
1 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 8588b31..e4fc8e5 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -64,14 +64,17 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
static long do_sys_truncate(const char __user *pathname, loff_t length)
{
struct path path;
+ struct nameidata nd;
+ struct vfsmount *mnt;
struct inode *inode;
+ char *tmp;
int error;

error = -EINVAL;
if (length < 0) /* sorry, but loff_t says... */
goto out;

- error = user_path(pathname, &path);
+ error = user_path_nd(AT_FDCWD, pathname, 0, &nd, &path, &tmp);
if (error)
goto out;
inode = path.dentry->d_inode;
@@ -85,11 +88,16 @@ static long do_sys_truncate(const char __user *pathname, loff_t length)
if (!S_ISREG(inode->i_mode))
goto dput_and_out;

- error = mnt_want_write(path.mnt);
+ if (IS_DIR_UNIONED(nd.path.dentry))
+ mnt = nd.path.mnt;
+ else
+ mnt = path.mnt;
+
+ error = mnt_want_write(mnt);
if (error)
goto dput_and_out;

- error = inode_permission(inode, MAY_WRITE);
+ error = path_permission(&path, &nd.path, MAY_WRITE);
if (error)
goto mnt_drop_write_and_out;

@@ -97,6 +105,12 @@ static long do_sys_truncate(const char __user *pathname, loff_t length)
if (IS_APPEND(inode))
goto mnt_drop_write_and_out;

+ error = union_copyup_len(&nd, &path, length);
+ if (error)
+ goto mnt_drop_write_and_out;
+
+ /* path may have changed after copyup */
+ inode = path.dentry->d_inode;
error = get_write_access(inode);
if (error)
goto mnt_drop_write_and_out;
@@ -118,9 +132,11 @@ static long do_sys_truncate(const char __user *pathname, loff_t length)
put_write_and_out:
put_write_access(inode);
mnt_drop_write_and_out:
- mnt_drop_write(path.mnt);
+ mnt_drop_write(mnt);
dput_and_out:
path_put(&path);
+ path_put(&nd.path);
+ putname(tmp);
out:
return error;
}
--
1.6.3.3

2010-08-08 15:56:05

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 38/39] union-mount: Implement union-aware setxattr()


Signed-off-by: Valerie Aurora <[email protected]>
---
fs/xattr.c | 34 +++++++++++++++++++++++++++-------
1 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/fs/xattr.c b/fs/xattr.c
index 01bb813..7869788 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -19,7 +19,7 @@
#include <linux/fsnotify.h>
#include <linux/audit.h>
#include <asm/uaccess.h>
-
+#include "union.h"

/*
* Check permissions for extended attribute access. This is a bit complicated
@@ -281,17 +281,37 @@ SYSCALL_DEFINE5(setxattr, const char __user *, pathname,
size_t, size, int, flags)
{
struct path path;
+ struct nameidata nd;
+ struct vfsmount *mnt;
+ char *tmp;
int error;

- error = user_path(pathname, &path);
+ error = user_path_nd(AT_FDCWD, pathname, LOOKUP_FOLLOW, &nd, &path,
+ &tmp);
if (error)
return error;
- error = mnt_want_write(path.mnt);
- if (!error) {
- error = setxattr(path.dentry, name, value, size, flags);
- mnt_drop_write(path.mnt);
- }
+
+ if (IS_DIR_UNIONED(nd.path.dentry))
+ mnt = nd.path.mnt;
+ else
+ mnt = path.mnt;
+
+ error = mnt_want_write(mnt);
+ if (error)
+ goto out;
+
+ error = union_copyup(&nd, &path);
+ if (error)
+ goto out_drop_write;
+
+ error = setxattr(path.dentry, name, value, size, flags);
+
+out_drop_write:
+ mnt_drop_write(mnt);
+out:
path_put(&path);
+ path_put(&nd.path);
+ putname(tmp);
return error;
}

--
1.6.3.3

2010-08-08 15:56:08

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 39/39] union-mount: Implement union-aware lsetxattr()


Signed-off-by: Valerie Aurora <[email protected]>
---
fs/xattr.c | 31 +++++++++++++++++++++++++------
1 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/fs/xattr.c b/fs/xattr.c
index 7869788..67815eb 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -320,17 +320,36 @@ SYSCALL_DEFINE5(lsetxattr, const char __user *, pathname,
size_t, size, int, flags)
{
struct path path;
+ struct nameidata nd;
+ struct vfsmount *mnt;
+ char *tmp;
int error;

- error = user_lpath(pathname, &path);
+ error = user_path_nd(AT_FDCWD, pathname, 0, &nd, &path, &tmp);
if (error)
return error;
- error = mnt_want_write(path.mnt);
- if (!error) {
- error = setxattr(path.dentry, name, value, size, flags);
- mnt_drop_write(path.mnt);
- }
+
+ if (IS_DIR_UNIONED(nd.path.dentry))
+ mnt = nd.path.mnt;
+ else
+ mnt = path.mnt;
+
+ error = mnt_want_write(mnt);
+ if (error)
+ goto out;
+
+ error = union_copyup(&nd, &path);
+ if (error)
+ goto out_drop_write;
+
+ error = setxattr(path.dentry, name, value, size, flags);
+
+out_drop_write:
+ mnt_drop_write(mnt);
+out:
path_put(&path);
+ path_put(&nd.path);
+ putname(tmp);
return error;
}

--
1.6.3.3

2010-08-08 15:55:55

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 33/39] union-mount: Implement union-aware chown()


Signed-off-by: Valerie Aurora <[email protected]>
---
fs/open.c | 23 ++++++++++++++++++++---
1 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index fc56da0..8588b31 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -552,18 +552,35 @@ static int chown_common(struct path *path, uid_t user, gid_t group)
SYSCALL_DEFINE3(chown, const char __user *, filename, uid_t, user, gid_t, group)
{
struct path path;
+ struct nameidata nd;
+ struct vfsmount *mnt;
+ char *tmp;
int error;

- error = user_path(filename, &path);
+ error = user_path_nd(AT_FDCWD, filename, LOOKUP_FOLLOW,
+ &nd, &path, &tmp);
if (error)
goto out;
- error = mnt_want_write(path.mnt);
+
+ if (IS_DIR_UNIONED(nd.path.dentry))
+ mnt = nd.path.mnt;
+ else
+ mnt = path.mnt;
+
+ error = mnt_want_write(mnt);
if (error)
goto out_release;
+
+ error = union_copyup(&nd, &path);
+ if (error)
+ goto out_drop_write;
error = chown_common(&path, user, group);
- mnt_drop_write(path.mnt);
+out_drop_write:
+ mnt_drop_write(mnt);
out_release:
path_put(&path);
+ path_put(&nd.path);
+ putname(tmp);
out:
return error;
}
--
1.6.3.3

2010-08-08 15:56:39

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 36/39] union-mount: Implement union-aware lchown()


Signed-off-by: Valerie Aurora <[email protected]>
---
fs/open.c | 23 ++++++++++++++++++++---
1 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 5c9933f..693258f 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -646,18 +646,35 @@ out:
SYSCALL_DEFINE3(lchown, const char __user *, filename, uid_t, user, gid_t, group)
{
struct path path;
+ struct nameidata nd;
+ struct vfsmount *mnt;
+ char *tmp;
int error;

- error = user_lpath(filename, &path);
+ error = user_path_nd(AT_FDCWD, filename, 0, &nd, &path, &tmp);
if (error)
goto out;
- error = mnt_want_write(path.mnt);
+
+ if (IS_DIR_UNIONED(nd.path.dentry))
+ mnt = nd.path.mnt;
+ else
+ mnt = path.mnt;
+
+ error = mnt_want_write(mnt);
if (error)
goto out_release;
+
+ error = union_copyup(&nd, &path);
+ if (error)
+ goto out_drop_write;
+
error = chown_common(&path, user, group);
- mnt_drop_write(path.mnt);
+out_drop_write:
+ mnt_drop_write(mnt);
out_release:
path_put(&path);
+ path_put(&nd.path);
+ putname(tmp);
out:
return error;
}
--
1.6.3.3

2010-08-08 15:55:53

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 35/39] union-mount: Implement union-aware chmod()/fchmodat()


Signed-off-by: Valerie Aurora <[email protected]>
---
fs/open.c | 25 +++++++++++++++++++++----
1 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index e4fc8e5..5c9933f 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -503,18 +503,32 @@ out:
SYSCALL_DEFINE3(fchmodat, int, dfd, const char __user *, filename, mode_t, mode)
{
struct path path;
+ struct nameidata nd;
+ struct vfsmount *mnt;
struct inode *inode;
+ char *tmp;
int error;
struct iattr newattrs;

- error = user_path_at(dfd, filename, LOOKUP_FOLLOW, &path);
+ error = user_path_nd(dfd, filename, LOOKUP_FOLLOW, &nd,
+ &path, &tmp);
if (error)
goto out;
- inode = path.dentry->d_inode;

- error = mnt_want_write(path.mnt);
+ if (IS_DIR_UNIONED(nd.path.dentry))
+ mnt = nd.path.mnt;
+ else
+ mnt = path.mnt;
+
+ error = mnt_want_write(mnt);
if (error)
goto dput_and_out;
+
+ error = union_copyup(&nd, &path);
+ if (error)
+ goto mnt_drop_write_and_out;
+
+ inode = path.dentry->d_inode;
mutex_lock(&inode->i_mutex);
error = security_path_chmod(path.dentry, path.mnt, mode);
if (error)
@@ -526,9 +540,12 @@ SYSCALL_DEFINE3(fchmodat, int, dfd, const char __user *, filename, mode_t, mode)
error = notify_change(path.dentry, &newattrs);
out_unlock:
mutex_unlock(&inode->i_mutex);
- mnt_drop_write(path.mnt);
+mnt_drop_write_and_out:
+ mnt_drop_write(mnt);
dput_and_out:
path_put(&path);
+ path_put(&nd.path);
+ putname(tmp);
out:
return error;
}
--
1.6.3.3

2010-08-08 15:57:17

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 32/39] union-mount: Implement union-aware writable open()

Copy up a file when opened with write permissions. Does not copy up
the file data when O_TRUNC is specified.

Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namei.c | 28 ++++++++++++++++++++++++++++
1 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 67ebf4a..88d1a79 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1911,6 +1911,24 @@ exit:
return ERR_PTR(error);
}

+static int open_union_copyup(struct nameidata *nd, struct path *path,
+ int open_flag)
+{
+ struct vfsmount *oldmnt = path->mnt;
+ int error;
+
+ if (open_flag & O_TRUNC)
+ error = union_copyup_len(nd, path, 0);
+ else
+ error = union_copyup(nd, path);
+ if (error)
+ return error;
+ if (oldmnt != path->mnt)
+ mntput(nd->path.mnt);
+
+ return error;
+}
+
static struct file *do_last(struct nameidata *nd, struct path *path,
int open_flag, int acc_mode,
int mode, const char *pathname)
@@ -1962,6 +1980,11 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
if (!path->dentry->d_inode->i_op->lookup)
goto exit_dput;
}
+ if (acc_mode & MAY_WRITE) {
+ error = open_union_copyup(nd, path, open_flag);
+ if (error)
+ goto exit_dput;
+ }
path_to_nameidata(path, nd);
audit_inode(pathname, nd->path.dentry);
goto ok;
@@ -2033,6 +2056,11 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
if (path->dentry->d_inode->i_op->follow_link)
return NULL;

+ if (acc_mode & MAY_WRITE) {
+ error = open_union_copyup(nd, path, open_flag);
+ if (error)
+ goto exit_dput;
+ }
path_to_nameidata(path, nd);
error = -EISDIR;
if (S_ISDIR(path->dentry->d_inode->i_mode))
--
1.6.3.3

2010-08-08 15:55:44

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 28/39] union-mount: In-kernel file copyup routines

When a file on the read-only layer of a union mount is altered, it
must be copied up to the topmost read-write layer. This patch creates
union_copyup() and its supporting routines.

Thanks to Valdis Kletnieks for a bug fix.

XXX - Miklos Szeredi points out: What happens if we crash halfway
through the file copyup? Answer: A bug, the file is truncated. Needs
fixing.

Cc: [email protected]
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/union.c | 323 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/union.h | 7 +-
2 files changed, 329 insertions(+), 1 deletions(-)

diff --git a/fs/union.c b/fs/union.c
index a91e8fc..103c436 100644
--- a/fs/union.c
+++ b/fs/union.c
@@ -24,6 +24,8 @@
#include <linux/namei.h>
#include <linux/file.h>
#include <linux/security.h>
+#include <linux/splice.h>
+#include <linux/xattr.h>

#include "union.h"

@@ -191,6 +193,72 @@ int needs_lookup_union(struct path *parent_path, struct path *path)
return 1;
}

+/**
+ * union_copyup_xattr
+ *
+ * @old: dentry of original file
+ * @new: dentry of new copy
+ *
+ * Copy up extended attributes from the original file to the new one.
+ *
+ * XXX - Permissions? For now, copying up every xattr.
+ */
+
+static int union_copyup_xattr(struct dentry *old, struct dentry *new)
+{
+ ssize_t list_size, size;
+ char *buf, *name, *value;
+ int error;
+
+ /* Check for xattr support */
+ if (!old->d_inode->i_op->getxattr ||
+ !new->d_inode->i_op->getxattr)
+ return 0;
+
+ /* Find out how big the list of xattrs is */
+ list_size = vfs_listxattr(old, NULL, 0);
+ if (list_size <= 0)
+ return list_size;
+
+ /* Allocate memory for the list */
+ buf = kzalloc(list_size, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ /* Allocate memory for the xattr's value */
+ error = -ENOMEM;
+ value = kmalloc(XATTR_SIZE_MAX, GFP_KERNEL);
+ if (!value)
+ goto out;
+
+ /* Actually get the list of xattrs */
+ list_size = vfs_listxattr(old, buf, list_size);
+ if (list_size <= 0) {
+ error = list_size;
+ goto out_free_value;
+ }
+
+ for (name = buf; name < (buf + list_size); name += strlen(name) + 1) {
+ /* XXX Locking? old is on read-only fs */
+ size = vfs_getxattr(old, name, value, XATTR_SIZE_MAX);
+ if (size <= 0) {
+ error = size;
+ goto out_free_value;
+ }
+ /* XXX do we really need to check for size overflow? */
+ /* XXX locks new dentry, lock ordering problems? */
+ error = vfs_setxattr(new, name, value, size, 0);
+ if (error)
+ goto out_free_value;
+ }
+
+out_free_value:
+ kfree(value);
+out:
+ kfree(buf);
+ return error;
+}
+
/*
* union_create_topmost_dir - Create a matching dir in the topmost file system
*/
@@ -209,6 +277,13 @@ int union_create_topmost_dir(struct path *parent, struct qstr *name,

res = vfs_mkdir(parent->dentry->d_inode, topmost->dentry, mode);

+ if (res)
+ goto out;
+
+ res = union_copyup_xattr(lower->dentry, topmost->dentry);
+ if (res)
+ dput(topmost->dentry);
+out:
mnt_drop_write(parent->mnt);

return res;
@@ -427,3 +502,251 @@ generic_readdir_fallthru(struct dentry *topmost_dentry, const char *name,
}
return -ENOENT;
}
+
+/**
+ * union_create_file
+ *
+ * @nd: namediata for source file
+ * @old: path of the source file
+ * @new: path of the new file, negative dentry
+ *
+ * Must already have mnt_want_write() on the mnt and the parent's
+ * i_mutex.
+ */
+
+static int union_create_file(struct nameidata *nd, struct path *old,
+ struct dentry *new)
+{
+ struct path *parent = &nd->path;
+ BUG_ON(!mutex_is_locked(&parent->dentry->d_inode->i_mutex));
+
+ return vfs_create(parent->dentry->d_inode, new,
+ old->dentry->d_inode->i_mode, nd);
+}
+
+/**
+ * union_create_symlink
+ *
+ * @nd: namediata for source symlink
+ * @old: path of the source symlink
+ * @new: path of the new symlink, negative dentry
+ *
+ * Must already have mnt_want_write() on the mnt and the parent's
+ * i_mutex.
+ */
+
+static int union_create_symlink(struct nameidata *nd, struct path *old,
+ struct dentry *new)
+{
+ void *cookie;
+ int error;
+
+ BUG_ON(!mutex_is_locked(&nd->path.dentry->d_inode->i_mutex));
+ /*
+ * We want the contents of this symlink, not to follow it, so
+ * this is modeled on generic_readlink() rather than
+ * do_follow_link().
+ */
+ nd->depth = 0;
+ cookie = old->dentry->d_inode->i_op->follow_link(old->dentry, nd);
+ if (IS_ERR(cookie))
+ return PTR_ERR(cookie);
+ /* Create a copy of the link on the top layer */
+ error = vfs_symlink(nd->path.dentry->d_inode, new,
+ nd_get_link(nd));
+ if (old->dentry->d_inode->i_op->put_link)
+ old->dentry->d_inode->i_op->put_link(old->dentry, nd, cookie);
+ return error;
+}
+
+/**
+ * union_copyup_data - Copy up len bytes of old's data to new
+ *
+ * @old: path of source file
+ * @new_mnt: vfsmount of target file
+ * @new_dentry: dentry of target file
+ * @len: number of bytes to copy
+ */
+
+static int union_copyup_data(struct path *old, struct vfsmount *new_mnt,
+ struct dentry *new_dentry, size_t len)
+{
+ struct file *old_file;
+ struct file *new_file;
+ const struct cred *cred = current_cred();
+ loff_t offset = 0;
+ long bytes;
+ int error = 0;
+
+ if (len == 0)
+ return 0;
+
+ /* Get reference to balance later fput() */
+ path_get(old);
+ old_file = dentry_open(old->dentry, old->mnt, O_RDONLY, cred);
+ if (IS_ERR(old_file))
+ return PTR_ERR(old_file);
+
+ mntget(new_mnt);
+ dget(new_dentry);
+ new_file = dentry_open(new_dentry, new_mnt, O_WRONLY, cred);
+ if (IS_ERR(new_file)) {
+ error = PTR_ERR(new_file);
+ goto out_fput;
+ }
+
+ bytes = do_splice_direct(old_file, &offset, new_file, len,
+ SPLICE_F_MOVE);
+ if (bytes < 0)
+ error = bytes;
+
+ fput(new_file);
+out_fput:
+ fput(old_file);
+ return error;
+}
+
+/**
+ * __union_copyup_len - Copy up a file and len bytes of data
+ *
+ * @nd: nameidata for topmost parent dir
+ * @path: path of file to be copied up
+ * @len: number of bytes of file data to copy up
+ *
+ * Parent's i_mutex must be held by caller. Newly copied up path is
+ * returned in @path and original is path_put().
+ */
+
+static int __union_copyup_len(struct nameidata *nd, struct path *path,
+ size_t len)
+{
+ struct path *parent = &nd->path;
+ struct dentry *dentry;
+ int error;
+
+ BUG_ON(!mutex_is_locked(&parent->dentry->d_inode->i_mutex));
+
+ dentry = lookup_one_len(path->dentry->d_name.name, parent->dentry,
+ path->dentry->d_name.len);
+ if (IS_ERR(dentry))
+ return PTR_ERR(dentry);
+
+ if (dentry->d_inode) {
+ /*
+ * We raced with someone else and "lost." That's
+ * okay, they did all the work of copying up the file.
+ * Note that currently data copyup happens under the
+ * parent dir's i_mutex. If we move it outside that,
+ * we'll need some way of waiting for the data copyup
+ * to complete here.
+ */
+ error = 0;
+ goto out_newpath;
+ }
+ if (S_ISREG(path->dentry->d_inode->i_mode)) {
+ /* Create file */
+ error = union_create_file(nd, path, dentry);
+ if (error)
+ goto out_dput;
+ /* Copyup data */
+ error = union_copyup_data(path, parent->mnt, dentry, len);
+ } else {
+ BUG_ON(!S_ISLNK(path->dentry->d_inode->i_mode));
+ error = union_create_symlink(nd, path, dentry);
+ }
+ if (error) {
+ /* Most likely error: ENOSPC */
+ vfs_unlink(parent->dentry->d_inode, dentry);
+ goto out_dput;
+ }
+ /* XXX Copyup xattrs and any other dangly bits */
+ error = union_copyup_xattr(path->dentry, dentry);
+ if (error)
+ goto out_dput;
+out_newpath:
+ /* path_put() of original must happen before we copy in new */
+ path_put(path);
+ path->dentry = dentry;
+ path->mnt = mntget(parent->mnt);
+ return error;
+out_dput:
+ /* Don't path_put(path), let caller unwind */
+ dput(dentry);
+ return error;
+}
+
+/**
+ * do_union_copyup_len - Copy up a file given its path (and its parent's)
+ *
+ * @nd: nameidata for topmost parent dir
+ * @path: path of file to be copied up
+ * @copy_all: if set, copy all of the file's data and ignore @len
+ * @len: if @copy_all is not set, number of bytes of file data to copy up
+ *
+ * Newly copied up path is returned in @path.
+ */
+
+static int do_union_copyup_len(struct nameidata *nd, struct path *path,
+ int copy_all, size_t len)
+{
+ struct path *parent = &nd->path;
+ int error;
+
+ if (!IS_DIR_UNIONED(parent->dentry))
+ return 0;
+ if (parent->mnt == path->mnt)
+ return 0;
+ if (!S_ISREG(path->dentry->d_inode->i_mode) &&
+ !S_ISLNK(path->dentry->d_inode->i_mode))
+ return 0;
+
+ BUG_ON(!S_ISDIR(parent->dentry->d_inode->i_mode));
+
+ mutex_lock(&parent->dentry->d_inode->i_mutex);
+ error = -ENOENT;
+ if (IS_DEADDIR(parent->dentry->d_inode))
+ goto out_unlock;
+
+ if (copy_all && S_ISREG(path->dentry->d_inode->i_mode)) {
+ error = -EFBIG;
+ len = i_size_read(path->dentry->d_inode);
+ /* Check for overflow of file size */
+ if (((size_t)len != len) || ((ssize_t)len != len))
+ goto out_unlock;
+ }
+
+ error = __union_copyup_len(nd, path, len);
+
+out_unlock:
+ mutex_unlock(&parent->dentry->d_inode->i_mutex);
+ return error;
+}
+
+/*
+ * Helper function to copy up all of a file
+ */
+int union_copyup(struct nameidata *nd, struct path *path)
+{
+ return do_union_copyup_len(nd, path, 1, 0);
+}
+
+/*
+ * Unlocked helper function to copy up all of a file
+ */
+int __union_copyup(struct nameidata *nd, struct path *path)
+{
+ size_t len;
+ len = i_size_read(path->dentry->d_inode);
+ if (((size_t)len != len) || ((ssize_t)len != len))
+ return -EFBIG;
+
+ return __union_copyup_len(nd, path, len);
+}
+
+/*
+ * Helper function to copy up part of a file
+ */
+int union_copyup_len(struct nameidata *nd, struct path *path, size_t len)
+{
+ return do_union_copyup_len(nd, path, 0, len);
+}
diff --git a/fs/union.h b/fs/union.h
index 80c2421..01fa183 100644
--- a/fs/union.h
+++ b/fs/union.h
@@ -59,7 +59,9 @@ int needs_lookup_union(struct path *, struct path *);
int union_create_topmost_dir(struct path *, struct qstr *, struct path *,
struct path *);
extern int union_copyup_dir(struct path *);
-
+extern int union_copyup(struct nameidata *, struct path *);
+extern int __union_copyup(struct nameidata *, struct path *);
+extern int union_copyup_len(struct nameidata *, struct path *, size_t len);
#else /* CONFIG_UNION_MOUNT */

#define IS_MNT_UNION(x) (0)
@@ -69,6 +71,9 @@ extern int union_copyup_dir(struct path *);
#define needs_lookup_union(x, y) ({ (0); })
#define union_create_topmost_dir(w, x, y, z) ({ BUG(); (NULL); })
#define union_copyup_dir(x) ({ BUG(); (0); })
+#define union_copyup(x, y) ({ BUG(); (NULL); })
+#define __union_copyup(x, y) ({ BUG(); (NULL); })
+#define union_copyup_len(x, y, z) ({ BUG(); (NULL); })

#endif /* CONFIG_UNION_MOUNT */
#endif /* __KERNEL__ */
--
1.6.3.3

2010-08-08 15:57:59

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 31/39] union-mount: Implement union-aware rename()

On rename() of a file on union mount, copyup and whiteout the source
file. Both are done under the rename mutex. I believe this is
actually atomic.

XXX - May not need to do file copyup under the lock.
XXX - Convert newly empty unioned dirs to not-unioned

Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namei.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++++++----
1 files changed, 70 insertions(+), 6 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 5b22cc5..67ebf4a 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3159,6 +3159,7 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
{
struct dentry *old_dir, *new_dir;
struct path old, new;
+ struct path to_whiteout = {NULL, NULL};
struct dentry *trap;
struct nameidata oldnd, newnd;
char *from;
@@ -3174,13 +3175,9 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
goto exit1;

error = -EXDEV;
+ /* Union mounts will pass below test - dirs always on topmost */
if (oldnd.path.mnt != newnd.path.mnt)
goto exit2;
- /* Rename on union mounts not implemented yet */
- /* XXX much harsher check than necessary - can do some renames */
- if (IS_DIR_UNIONED(oldnd.path.dentry) ||
- IS_DIR_UNIONED(newnd.path.dentry))
- goto exit2;
old_dir = oldnd.path.dentry;
error = -EBUSY;
if (oldnd.last_type != LAST_NORM)
@@ -3203,7 +3200,7 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
error = -ENOENT;
if (!old.dentry->d_inode)
goto exit4;
- /* unless the source is a directory trailing slashes give -ENOTDIR */
+ /* unless the source is a directory, trailing slashes give -ENOTDIR */
if (!S_ISDIR(old.dentry->d_inode->i_mode)) {
error = -ENOTDIR;
if (oldnd.last.name[oldnd.last.len])
@@ -3215,6 +3212,11 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
error = -EINVAL;
if (old.dentry == trap)
goto exit4;
+ error = -EXDEV;
+ /* Can't rename a directory from a lower layer */
+ if (IS_DIR_UNIONED(oldnd.path.dentry) &&
+ IS_DIR_UNIONED(old.dentry))
+ goto exit4;
error = lookup_hash(&newnd, &newnd.last, &new);
if (error)
goto exit4;
@@ -3222,6 +3224,48 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
error = -ENOTEMPTY;
if (new.dentry == trap)
goto exit5;
+ error = -EXDEV;
+ /* Can't rename over directories on the lower layer */
+ if (IS_DIR_UNIONED(newnd.path.dentry) &&
+ IS_DIR_UNIONED(new.dentry))
+ goto exit5;
+
+ /* If source is on lower layer, copy up */
+ if (IS_DIR_UNIONED(oldnd.path.dentry) &&
+ (old.mnt != oldnd.path.mnt)) {
+ /* Save the lower path to avoid a second lookup for whiteout */
+ to_whiteout.mnt = mntget(old.mnt);
+ to_whiteout.dentry = dget(old.dentry);
+ error = __union_copyup(&oldnd, &old);
+ if (error)
+ goto exit5;
+ }
+
+ /* If target is on lower layer, get negative dentry for topmost */
+ if (IS_DIR_UNIONED(newnd.path.dentry) &&
+ (new.mnt != newnd.path.mnt)) {
+ struct dentry *dentry;
+ /*
+ * At this point, source and target are both files,
+ * the source is on the topmost layer, and the target
+ * is on a lower layer. We want the target dentry to
+ * disappear from the namespace, and give vfs_rename a
+ * negative dentry from the topmost layer.
+ */
+ /* We already did lookup once, no need to check perm */
+ dentry = __lookup_hash(&newnd.last, newnd.path.dentry, &newnd);
+ if (IS_ERR(dentry)) {
+ error = PTR_ERR(dentry);
+ goto exit5;
+ }
+ /* We no longer need the lower target dentry. It
+ * definitely should be removed from the hash table */
+ /* XXX what about failure case? */
+ d_delete(new.dentry);
+ mntput(new.mnt);
+ new.mnt = mntget(newnd.path.mnt);
+ new.dentry = dentry;
+ }

error = mnt_want_write(oldnd.path.mnt);
if (error)
@@ -3232,6 +3276,26 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
goto exit6;
error = vfs_rename(old_dir->d_inode, old.dentry,
new_dir->d_inode, new.dentry);
+ if (error)
+ goto exit6;
+ /* Now whiteout the source */
+ if (IS_DIR_UNIONED(oldnd.path.dentry)) {
+ if (!to_whiteout.dentry) {
+ struct dentry *dentry;
+ /* We could have exposed a lower level entry */
+ dentry = __lookup_hash(&oldnd.last, oldnd.path.dentry, &oldnd);
+ if (IS_ERR(dentry)) {
+ error = PTR_ERR(dentry);
+ goto exit6;
+ }
+ to_whiteout.dentry = dentry;
+ to_whiteout.mnt = mntget(oldnd.path.mnt);
+ }
+
+ if (to_whiteout.dentry->d_inode)
+ error = do_whiteout(&oldnd, &to_whiteout, 0);
+ path_put(&to_whiteout);
+ }
exit6:
mnt_drop_write(oldnd.path.mnt);
exit5:
--
1.6.3.3

2010-08-08 15:55:39

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 26/39] VFS: Split inode_permission() and create path_permission()

Split inode_permission() into inode and file-system-dependent parts.
Create path_permission() to check permission based on the path to the
inode. This is for union mounts, in which an inode can be located on
a read-only lower layer file system but is still writable, since we
will copy it up to the writable top layer file system. So in that
case, we want to ignore MS_RDONLY on the lower layer. To make this
decision, we must know the path (vfsmount, dentry) of both the target
and its parent.

XXX - so ugly!

Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namei.c | 92 ++++++++++++++++++++++++++++++++++++++++++++--------
include/linux/fs.h | 1 +
2 files changed, 79 insertions(+), 14 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 2d30a5b..74d6852 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -241,29 +241,20 @@ int generic_permission(struct inode *inode, int mask,
}

/**
- * inode_permission - check for access rights to a given inode
+ * __inode_permission - check for access rights to a given inode
* @inode: inode to check permission on
* @mask: right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
*
* Used to check for read/write/execute permissions on an inode.
- * We use "fsuid" for this, letting us set arbitrary permissions
- * for filesystem access without changing the "normal" uids which
- * are used for other things.
+ *
+ * This does not check for a read-only file system. You probably want
+ * inode_permission().
*/
-int inode_permission(struct inode *inode, int mask)
+static int __inode_permission(struct inode *inode, int mask)
{
int retval;

if (mask & MAY_WRITE) {
- umode_t mode = inode->i_mode;
-
- /*
- * Nobody gets write access to a read-only fs.
- */
- if (IS_RDONLY(inode) &&
- (S_ISREG(mode) || S_ISDIR(mode) || S_ISLNK(mode)))
- return -EROFS;
-
/*
* Nobody gets write access to an immutable file.
*/
@@ -288,6 +279,79 @@ int inode_permission(struct inode *inode, int mask)
}

/**
+ * sb_permission - check superblock-level permissions
+ * @sb: superblock of inode to check permission on
+ * @mask: right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
+ *
+ * Separate out file-system wide checks from inode-specific permission
+ * checks. In particular, union mounts want to check the read-only
+ * status of the top-level file system, not the lower.
+ */
+int sb_permission(struct super_block *sb, struct inode *inode, int mask)
+{
+ if (mask & MAY_WRITE) {
+ umode_t mode = inode->i_mode;
+
+ /*
+ * Nobody gets write access to a read-only fs.
+ */
+ if ((sb->s_flags & MS_RDONLY) &&
+ (S_ISREG(mode) || S_ISDIR(mode) || S_ISLNK(mode)))
+ return -EROFS;
+ }
+ return 0;
+}
+
+/**
+ * inode_permission - check for access rights to a given inode
+ * @inode: inode to check permission on
+ * @mask: right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
+ *
+ * Used to check for read/write/execute permissions on an inode.
+ * We use "fsuid" for this, letting us set arbitrary permissions
+ * for filesystem access without changing the "normal" uids which
+ * are used for other things.
+ */
+int inode_permission(struct inode *inode, int mask)
+{
+ int retval;
+
+ retval = sb_permission(inode->i_sb, inode, mask);
+ if (retval)
+ return retval;
+ return __inode_permission(inode, mask);
+}
+
+/**
+ * path_permission - check for inode access rights depending on path
+ * @path: path of inode to check
+ * @parent_path: path of inode's parent
+ * @mask: right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
+ *
+ * Like inode_permission, but used to check for permission when the
+ * file may potentially be copied up between union layers.
+ */
+
+int path_permission(struct path *path, struct path *parent_path, int mask)
+{
+ struct vfsmount *mnt;
+ int retval;
+
+ /* Catch some reversal of args */
+ BUG_ON(!S_ISDIR(parent_path->dentry->d_inode->i_mode));
+
+ if (IS_MNT_UNION(parent_path->mnt))
+ mnt = parent_path->mnt;
+ else
+ mnt = path->mnt;
+
+ retval = sb_permission(mnt->mnt_sb, path->dentry->d_inode, mask);
+ if (retval)
+ return retval;
+ return __inode_permission(path->dentry->d_inode, mask);
+}
+
+/**
* file_permission - check for additional access rights to a given file
* @file: file to check access rights for
* @mask: right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3675501..7b2a553 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2138,6 +2138,7 @@ extern sector_t bmap(struct inode *, sector_t);
#endif
extern int notify_change(struct dentry *, struct iattr *);
extern int inode_permission(struct inode *, int);
+extern int path_permission(struct path *, struct path *, int);
extern int generic_permission(struct inode *, int,
int (*check_acl)(struct inode *, int));
extern int generic_readdir_fallthru(struct dentry *topmost_dentry, const char *name,
--
1.6.3.3

2010-08-08 15:58:38

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 29/39] union-mount: Implement union-aware access()/faccessat()

For union mounts, a file located on the lower layer will incorrectly
return EROFS on an access check. To fix this, use the new
path_permission() call, which ignores a read-only lower layer file
system if the target will be copied up to the topmost file system.

Signed-off-by: Valerie Aurora <[email protected]>
---
fs/open.c | 21 +++++++++++++++++----
1 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 5463266..fc56da0 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -31,6 +31,7 @@
#include <linux/ima.h>

#include "internal.h"
+#include "union.h"

int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
struct file *filp)
@@ -288,7 +289,10 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)
const struct cred *old_cred;
struct cred *override_cred;
struct path path;
+ struct nameidata nd;
+ struct vfsmount *mnt;
struct inode *inode;
+ char *tmp;
int res;

if (mode & ~S_IRWXO) /* where's F_OK, X_OK, W_OK, R_OK? */
@@ -312,10 +316,17 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)

old_cred = override_creds(override_cred);

- res = user_path_at(dfd, filename, LOOKUP_FOLLOW, &path);
+ res = user_path_nd(dfd, filename, LOOKUP_FOLLOW,
+ &nd, &path, &tmp);
if (res)
goto out;

+ /* For union mounts, use the topmost mnt's permissions */
+ if (IS_DIR_UNIONED(nd.path.dentry))
+ mnt = nd.path.mnt;
+ else
+ mnt = path.mnt;
+
inode = path.dentry->d_inode;

if ((mode & MAY_EXEC) && S_ISREG(inode->i_mode)) {
@@ -324,11 +335,11 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)
* with the "noexec" flag.
*/
res = -EACCES;
- if (path.mnt->mnt_flags & MNT_NOEXEC)
+ if (mnt->mnt_flags & MNT_NOEXEC)
goto out_path_release;
}

- res = inode_permission(inode, mode | MAY_ACCESS);
+ res = path_permission(&path, &nd.path, mode | MAY_ACCESS);
/* SuS v2 requires we report a read only fs too */
if (res || !(mode & S_IWOTH) || special_file(inode->i_mode))
goto out_path_release;
@@ -342,11 +353,13 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)
* inherently racy and know that the fs may change
* state before we even see this result.
*/
- if (__mnt_is_readonly(path.mnt))
+ if (__mnt_is_readonly(mnt))
res = -EROFS;

out_path_release:
path_put(&path);
+ path_put(&nd.path);
+ putname(tmp);
out:
revert_creds(old_cred);
put_cred(override_cred);
--
1.6.3.3

2010-08-08 15:55:36

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 27/39] VFS: Create user_path_nd() to lookup both parent and target

Proof-of-concept implementation of user_path_nd(). Lookup both the
parent and the target of a user-supplied filename, to supply later to
union copyup routines.

XXX - Inefficient, racy, gets the parent of the symlink instead of the
parent of the target. Al Viro would like to see something more like
this:

user_path_mumble() looks up and returns:

parent nameidata
positive topmost dentry of target
negative dentry of target from the topmost layer (if it doesn't exist on top)

Both the positive lower dentry and negative topmost dentry are passed
to the following code, like do_chown(). The tests for permissions and
such-like are performed on the positive lower dentry. When it comes
time to actually modify the target, we call union_copyup() with both
positive and negative dentries (and the parent nameidata).

Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namei.c | 31 +++++++++++++++++++++++++++++++
include/linux/namei.h | 2 ++
2 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 74d6852..e7b02fa 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1546,6 +1546,37 @@ static int user_path_parent(int dfd, const char __user *path,
return error;
}

+int user_path_nd(int dfd, const char __user *filename,
+ unsigned flags, struct nameidata *parent_nd,
+ struct path *child, char **tmp)
+{
+ struct nameidata child_nd;
+ char *s = getname(filename);
+ int error;
+
+ if (IS_ERR(s))
+ return PTR_ERR(s);
+
+ /* Lookup parent */
+ error = do_path_lookup(dfd, s, LOOKUP_PARENT, parent_nd);
+ if (error)
+ goto out_putname;
+
+ /* Lookup child - XXX optimize, racy */
+ error = do_path_lookup(dfd, s, flags, &child_nd);
+ if (error)
+ goto out_path_put;
+ *child = child_nd.path;
+ *tmp = s;
+ return 0;
+
+out_path_put:
+ path_put(&parent_nd->path);
+out_putname:
+ putname(s);
+ return error;
+}
+
/*
* It's inline, so penalty for filesystems that don't use sticky bit is
* minimal.
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 05b441d..83dc8b5 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -58,6 +58,8 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
#define LOOKUP_RENAME_TARGET 0x0800

extern int user_path_at(int, const char __user *, unsigned, struct path *);
+extern int user_path_nd(int, const char __user *, unsigned,
+ struct nameidata *, struct path *, char **);

#define user_path(name, path) user_path_at(AT_FDCWD, name, LOOKUP_FOLLOW, path)
#define user_lpath(name, path) user_path_at(AT_FDCWD, name, 0, path)
--
1.6.3.3

2010-08-08 15:59:00

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 25/39] fallthru: tmpfs fallthru support

Add support for fallthru directory entries to tmpfs

Signed-off-by: Valerie Aurora <[email protected]>
---
fs/dcache.c | 3 +-
fs/libfs.c | 23 ++++++++++++++++++--
mm/shmem.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++-------
3 files changed, 78 insertions(+), 12 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index ed7f15a..4fe51a9 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2254,7 +2254,8 @@ resume:
* we can evict it.
*/
if (d_unhashed(dentry)||(!dentry->d_inode &&
- !d_is_whiteout(dentry)))
+ !d_is_whiteout(dentry) &&
+ !d_is_fallthru(dentry)))
continue;
if (!list_empty(&dentry->d_subdirs)) {
this_parent = dentry;
diff --git a/fs/libfs.c b/fs/libfs.c
index dcaf972..1172f1a 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -130,6 +130,7 @@ int dcache_readdir(struct file * filp, void * dirent, filldir_t filldir)
struct dentry *cursor = filp->private_data;
struct list_head *p, *q = &cursor->d_u.d_child;
ino_t ino;
+ char d_type;
int i = filp->f_pos;

switch (i) {
@@ -155,14 +156,30 @@ int dcache_readdir(struct file * filp, void * dirent, filldir_t filldir)
for (p=q->next; p != &dentry->d_subdirs; p=p->next) {
struct dentry *next;
next = list_entry(p, struct dentry, d_u.d_child);
- if (d_unhashed(next) || !next->d_inode)
+ if (d_unhashed(next) || (!next->d_inode && !d_is_fallthru(next)))
continue;

spin_unlock(&dcache_lock);
+ if (d_is_fallthru(next)) {
+ /*
+ * Fallthru lookup should never fail on tmpfs (except
+ * ENOMEM and the like). If fallthru fails, better to
+ * fake up return values than crash.
+ */
+ ino = 1;
+ d_type = DT_UNKNOWN;
+ generic_readdir_fallthru(filp->f_path.dentry,
+ next->d_name.name,
+ next->d_name.len,
+ &ino, &d_type);
+ } else {
+ ino = next->d_inode->i_ino;
+ d_type = dt_type(next->d_inode);
+ }
+
if (filldir(dirent, next->d_name.name,
next->d_name.len, filp->f_pos,
- next->d_inode->i_ino,
- dt_type(next->d_inode)) < 0)
+ ino, d_type) < 0)
return 0;
spin_lock(&dcache_lock);
/* next is still alive */
diff --git a/mm/shmem.c b/mm/shmem.c
index a0a4fa5..eab3f27 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1809,8 +1809,7 @@ static int shmem_rmdir(struct inode *dir, struct dentry *dentry);
static int shmem_unlink(struct inode *dir, struct dentry *dentry);

/*
- * This is the whiteout support for tmpfs. It uses one singleton whiteout
- * inode per superblock thus it is very similar to shmem_link().
+ * Create a dentry to signify a whiteout.
*/
static int shmem_whiteout(struct inode *dir, struct dentry *old_dentry,
struct dentry *new_dentry)
@@ -1841,8 +1840,10 @@ static int shmem_whiteout(struct inode *dir, struct dentry *old_dentry,
spin_unlock(&sbinfo->stat_lock);
}

- if (old_dentry->d_inode) {
- if (S_ISDIR(old_dentry->d_inode->i_mode))
+ if (old_dentry->d_inode || d_is_fallthru(old_dentry)) {
+ /* A fallthru for a dir is treated like a regular link */
+ if (old_dentry->d_inode &&
+ S_ISDIR(old_dentry->d_inode->i_mode))
shmem_rmdir(dir, old_dentry);
else
shmem_unlink(dir, old_dentry);
@@ -1859,6 +1860,48 @@ static int shmem_whiteout(struct inode *dir, struct dentry *old_dentry,
}

static void shmem_d_instantiate(struct inode *dir, struct dentry *dentry,
+ struct inode *inode);
+
+/*
+ * Create a dentry to signify a fallthru. A fallthru in tmpfs is the
+ * logical equivalent of an in-kernel readdir() cache. It can't be
+ * deleted until the file system is unmounted.
+ */
+static int shmem_fallthru(struct inode *dir, struct dentry *dentry)
+{
+ struct shmem_sb_info *sbinfo = SHMEM_SB(dir->i_sb);
+
+ /* FIXME: this is stupid */
+ if (!(dir->i_sb->s_flags & MS_WHITEOUT))
+ return -EPERM;
+
+ if (dentry->d_inode || d_is_fallthru(dentry) || d_is_whiteout(dentry))
+ return -EEXIST;
+
+ /*
+ * Each new link needs a new dentry, pinning lowmem, and tmpfs
+ * dentries cannot be pruned until they are unlinked.
+ */
+ if (sbinfo->max_inodes) {
+ spin_lock(&sbinfo->stat_lock);
+ if (!sbinfo->free_inodes) {
+ spin_unlock(&sbinfo->stat_lock);
+ return -ENOSPC;
+ }
+ sbinfo->free_inodes--;
+ spin_unlock(&sbinfo->stat_lock);
+ }
+
+ shmem_d_instantiate(dir, dentry, NULL);
+ dir->i_ctime = dir->i_mtime = CURRENT_TIME;
+
+ spin_lock(&dentry->d_lock);
+ dentry->d_flags |= DCACHE_FALLTHRU;
+ spin_unlock(&dentry->d_lock);
+ return 0;
+}
+
+static void shmem_d_instantiate(struct inode *dir, struct dentry *dentry,
struct inode *inode)
{
if (d_is_whiteout(dentry)) {
@@ -1866,14 +1909,15 @@ static void shmem_d_instantiate(struct inode *dir, struct dentry *dentry,
shmem_free_inode(dir->i_sb);
if (S_ISDIR(inode->i_mode))
inode->i_mode |= S_OPAQUE;
+ } else if (d_is_fallthru(dentry)) {
+ shmem_free_inode(dir->i_sb);
} else {
/* New dentry */
dir->i_size += BOGO_DIRENT_SIZE;
dget(dentry); /* Extra count - pin the dentry in core */
}
- /* Will clear DCACHE_WHITEOUT flag */
+ /* Will clear DCACHE_WHITEOUT and DCACHE_FALLTHRU flags */
d_instantiate(dentry, inode);
-
}
/*
* File creation. Allocate an inode, and we're done..
@@ -1955,7 +1999,8 @@ static int shmem_unlink(struct inode *dir, struct dentry *dentry)
{
struct inode *inode = dentry->d_inode;

- if (d_is_whiteout(dentry) || (inode->i_nlink > 1 && !S_ISDIR(inode->i_mode)))
+ if (d_is_whiteout(dentry) || d_is_fallthru(dentry) ||
+ (inode->i_nlink > 1 && !S_ISDIR(inode->i_mode)))
shmem_free_inode(dir->i_sb);

if (inode) {
@@ -2479,8 +2524,10 @@ int shmem_fill_super(struct super_block *sb, void *data, int silent)
sb->s_root = root;

#ifdef CONFIG_TMPFS
- if (!(sb->s_flags & MS_NOUSER))
+ if (!(sb->s_flags & MS_NOUSER)) {
sb->s_flags |= MS_WHITEOUT;
+ sb->s_flags |= MS_FALLTHRU;
+ }
#endif

return 0;
@@ -2583,6 +2630,7 @@ static const struct inode_operations shmem_dir_inode_operations = {
.mknod = shmem_mknod,
.rename = shmem_rename,
.whiteout = shmem_whiteout,
+ .fallthru = shmem_fallthru,
#endif
#ifdef CONFIG_TMPFS_POSIX_ACL
.setattr = shmem_notify_change,
--
1.6.3.3

2010-08-08 15:55:18

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 21/39] union-mount: Copy up directory entries on first readdir()

readdir() in union mounts is implemented by copying up all visible
directory entries from the lower level directories to the topmost
directory. Directory entries that refer to lower level file system
objects are marked as "fallthru" in the topmost directory.

Thanks to Felix Fietkau <[email protected]> for a bug fix.

XXX - How to deal with fallthrus in lower layers?

Signed-off-by: Valerie Aurora <[email protected]>
Signed-off-by: Felix Fietkau <[email protected]>
---
fs/readdir.c | 9 +++
fs/union.c | 162 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/union.h | 2 +
3 files changed, 173 insertions(+), 0 deletions(-)

diff --git a/fs/readdir.c b/fs/readdir.c
index 3a48491..dd3eae1 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -19,6 +19,8 @@

#include <asm/uaccess.h>

+#include "union.h"
+
int vfs_readdir(struct file *file, filldir_t filler, void *buf)
{
struct inode *inode = file->f_path.dentry->d_inode;
@@ -36,9 +38,16 @@ int vfs_readdir(struct file *file, filldir_t filler, void *buf)

res = -ENOENT;
if (!IS_DEADDIR(inode)) {
+ if (IS_DIR_UNIONED(file->f_path.dentry) && !IS_OPAQUE(inode)) {
+ res = union_copyup_dir(&file->f_path);
+ if (res)
+ goto out_unlock;
+ }
+
res = file->f_op->readdir(file, buf, filler);
file_accessed(file);
}
+out_unlock:
mutex_unlock(&inode->i_mutex);
out:
return res;
diff --git a/fs/union.c b/fs/union.c
index c089c02..917248d 100644
--- a/fs/union.c
+++ b/fs/union.c
@@ -22,6 +22,8 @@
#include <linux/fs_struct.h>
#include <linux/slab.h>
#include <linux/namei.h>
+#include <linux/file.h>
+#include <linux/security.h>

#include "union.h"

@@ -211,3 +213,163 @@ int union_create_topmost_dir(struct path *parent, struct qstr *name,

return res;
}
+
+/**
+ * union_copyup_dir_one - copy up a single directory entry
+ *
+ * Individual directory entry copyup function for union_copyup_dir.
+ * We get the entries from higher level layers first.
+ */
+
+static int union_copyup_dir_one(void *buf, const char *name, int namlen,
+ loff_t offset, u64 ino, unsigned int d_type)
+{
+ struct dentry *topmost_dentry = (struct dentry *) buf;
+ struct dentry *dentry;
+ int err = 0;
+
+ switch (namlen) {
+ case 2:
+ if (name[1] != '.')
+ break;
+ case 1:
+ if (name[0] != '.')
+ break;
+ return 0;
+ }
+
+ /* Lookup this entry in the topmost directory */
+ dentry = lookup_one_len(name, topmost_dentry, namlen);
+
+ if (IS_ERR(dentry)) {
+ printk(KERN_WARNING "%s: error looking up %s\n", __func__,
+ dentry->d_name.name);
+ err = PTR_ERR(dentry);
+ goto out;
+ }
+
+ /*
+ * If the entry already exists, one of the following is true:
+ * it was already copied up (due to an earlier lookup), an
+ * entry with the same name already exists on the topmost file
+ * system, it is a whiteout, or it is a fallthru. In each
+ * case, the top level entry masks any entries from lower file
+ * systems, so don't copy up this entry.
+ */
+ if (dentry->d_inode || d_is_whiteout(dentry) || d_is_fallthru(dentry))
+ goto out_dput;
+
+ /*
+ * If the entry doesn't exist, create a fallthru entry in the
+ * topmost file system. All possible directory types are
+ * used, so each file system must implement its own way of
+ * storing a fallthru entry.
+ */
+ err = topmost_dentry->d_inode->i_op->fallthru(topmost_dentry->d_inode,
+ dentry);
+out_dput:
+ dput(dentry);
+out:
+ return err;
+}
+
+/**
+ * union_copyup_dir - copy up low-level directory entries to topmost dir
+ *
+ * readdir() is difficult to support on union file systems for two
+ * reasons: We must eliminate duplicates and apply whiteouts, and we
+ * must return something in f_pos that lets us restart in the same
+ * place when we return. Our solution is to, on first readdir() of
+ * the directory, copy up all visible entries from the low-level file
+ * systems and mark the entries that refer to low-level file system
+ * objects as "fallthru" entries.
+ *
+ * Locking strategy: We hold the topmost dir's i_mutex on entry. We
+ * grab the i_mutex on lower directories one by one. So the locking
+ * order is:
+ *
+ * Writable/topmost layers > Read-only/lower layers
+ *
+ * So there is no problem with lock ordering for union stacks with
+ * multiple lower layers. E.g.:
+ *
+ * (topmost) A->B->C (bottom)
+ * (topmost) D->C->B (bottom)
+ *
+ */
+
+int union_copyup_dir(struct path *topmost_path)
+{
+ struct dentry *topmost_dentry = topmost_path->dentry;
+ struct union_dir *ud;
+ int res = 0;
+
+ BUG_ON(IS_OPAQUE(topmost_dentry->d_inode));
+
+ if (!topmost_dentry->d_inode->i_op || !topmost_dentry->d_inode->i_op->fallthru)
+ return -EOPNOTSUPP;
+
+ res = mnt_want_write(topmost_path->mnt);
+ if (res)
+ return res;
+
+ for (ud = topmost_path->dentry->d_union_dir; ud != NULL; ud = ud->u_lower) {
+ struct file * ftmp;
+ struct inode * inode;
+ struct path path;
+
+ BUG_ON(ud->u_this.dentry->d_count.counter == 0);
+ path = ud->u_this;
+ /* dentry_open() doesn't get a path reference itself */
+ path_get(&path);
+ ftmp = dentry_open(path.dentry, path.mnt,
+ O_RDONLY | O_DIRECTORY | O_NOATIME,
+ current_cred());
+ if (IS_ERR(ftmp)) {
+ printk (KERN_ERR "unable to open dir %s for "
+ "directory copyup: %ld\n",
+ path.dentry->d_name.name, PTR_ERR(ftmp));
+ path_put(&path);
+ res = PTR_ERR(ftmp);
+ break;
+ }
+
+ inode = path.dentry->d_inode;
+ mutex_lock(&inode->i_mutex);
+
+ res = -ENOENT;
+ if (IS_DEADDIR(inode))
+ goto out_fput;
+ /*
+ * Read the whole directory, calling our directory
+ * entry copyup function on each entry. Pass in the
+ * topmost dentry as our private data so we can create
+ * new entries in the topmost directory.
+ */
+ res = ftmp->f_op->readdir(ftmp, topmost_dentry,
+ union_copyup_dir_one);
+out_fput:
+ mutex_unlock(&inode->i_mutex);
+ fput(ftmp);
+
+ if (res)
+ break;
+
+ /* XXX Should process directories below an opaque
+ * directory in case there are fallthrus in it */
+ if (IS_OPAQUE(path.dentry->d_inode))
+ break;
+ }
+ /*
+ * Mark this dir opaque to show that we have already copied up
+ * the lower entries. Be sure to do this AFTER the directory
+ * entries have been copied up in case of a crash.
+ */
+ if (!res) {
+ topmost_dentry->d_inode->i_flags |= S_OPAQUE;
+ mark_inode_dirty(topmost_dentry->d_inode);
+ }
+
+ mnt_drop_write(topmost_path->mnt);
+ return res;
+}
diff --git a/fs/union.h b/fs/union.h
index 505f132..80c2421 100644
--- a/fs/union.h
+++ b/fs/union.h
@@ -58,6 +58,7 @@ extern void d_free_unions(struct dentry *);
int needs_lookup_union(struct path *, struct path *);
int union_create_topmost_dir(struct path *, struct qstr *, struct path *,
struct path *);
+extern int union_copyup_dir(struct path *);

#else /* CONFIG_UNION_MOUNT */

@@ -67,6 +68,7 @@ int union_create_topmost_dir(struct path *, struct qstr *, struct path *,
#define d_free_unions(x) do { } while (0)
#define needs_lookup_union(x, y) ({ (0); })
#define union_create_topmost_dir(w, x, y, z) ({ BUG(); (NULL); })
+#define union_copyup_dir(x) ({ BUG(); (0); })

#endif /* CONFIG_UNION_MOUNT */
#endif /* __KERNEL__ */
--
1.6.3.3

2010-08-08 15:59:33

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 23/39] fallthru: ext2 fallthru support

Add support for fallthru directory entries to ext2.

Cc: Theodore Tso <[email protected]>
Cc: [email protected]
Signed-off-by: Valerie Aurora <[email protected]>
Signed-off-by: Jan Blunck <[email protected]>
---
fs/ext2/dir.c | 92 ++++++++++++++++++++++++++++++++++++++++++++--
fs/ext2/ext2.h | 1 +
fs/ext2/namei.c | 22 +++++++++++
fs/ext2/super.c | 2 +
include/linux/ext2_fs.h | 4 ++
5 files changed, 117 insertions(+), 4 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 030bd46..f19651e 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -219,7 +219,8 @@ static inline int ext2_match (int len, const char * const name,
{
if (len != de->name_len)
return 0;
- if (!de->inode && (de->file_type != EXT2_FT_WHT))
+ if (!de->inode && ((de->file_type != EXT2_FT_WHT) &&
+ (de->file_type != EXT2_FT_FALLTHRU)))
return 0;
return !memcmp(name, de->name, len);
}
@@ -256,6 +257,7 @@ static unsigned char ext2_filetype_table[EXT2_FT_MAX] = {
[EXT2_FT_SOCK] = DT_SOCK,
[EXT2_FT_SYMLINK] = DT_LNK,
[EXT2_FT_WHT] = DT_WHT,
+ [EXT2_FT_FALLTHRU] = DT_UNKNOWN,
};

#define S_SHIFT 12
@@ -342,6 +344,24 @@ ext2_readdir (struct file * filp, void * dirent, filldir_t filldir)
ext2_put_page(page);
return 0;
}
+ } else if (de->file_type == EXT2_FT_FALLTHRU) {
+ int over;
+ unsigned char d_type = DT_UNKNOWN;
+ ino_t ino;
+ int err;
+
+ offset = (char *)de - kaddr;
+ err = generic_readdir_fallthru(filp->f_path.dentry, de->name,
+ de->name_len, &ino, &d_type);
+ if (!err) {
+ over = filldir(dirent, de->name, de->name_len,
+ (n<<PAGE_CACHE_SHIFT) | offset,
+ ino, d_type);
+ if (over) {
+ ext2_put_page(page);
+ return 0;
+ }
+ }
}
filp->f_pos += ext2_rec_len_from_disk(de->rec_len);
}
@@ -463,6 +483,10 @@ ino_t ext2_inode_by_dentry(struct inode *dir, struct dentry *dentry)
spin_lock(&dentry->d_lock);
dentry->d_flags |= DCACHE_WHITEOUT;
spin_unlock(&dentry->d_lock);
+ } else if(!res && de->file_type == EXT2_FT_FALLTHRU) {
+ spin_lock(&dentry->d_lock);
+ dentry->d_flags |= DCACHE_FALLTHRU;
+ spin_unlock(&dentry->d_lock);
}
ext2_put_page(page);
}
@@ -532,6 +556,7 @@ static ext2_dirent * ext2_append_entry(struct dentry * dentry,
de->name_len = 0;
de->rec_len = ext2_rec_len_to_disk(chunk_size);
de->inode = 0;
+ de->file_type = 0;
goto got_it;
}
if (de->rec_len == 0) {
@@ -545,6 +570,7 @@ static ext2_dirent * ext2_append_entry(struct dentry * dentry,
name_len = EXT2_DIR_REC_LEN(de->name_len);
rec_len = ext2_rec_len_from_disk(de->rec_len);
if (!de->inode && (de->file_type != EXT2_FT_WHT) &&
+ (de->file_type != EXT2_FT_FALLTHRU) &&
(rec_len >= reclen))
goto got_it;
if (rec_len >= name_len + reclen)
@@ -587,7 +613,8 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode)

err = -EEXIST;
if (ext2_match (namelen, name, de)) {
- if (de->file_type == EXT2_FT_WHT)
+ if ((de->file_type == EXT2_FT_WHT) ||
+ (de->file_type == EXT2_FT_FALLTHRU))
goto got_it;
goto out_unlock;
}
@@ -602,7 +629,8 @@ got_it:
&page, NULL);
if (err)
goto out_unlock;
- if (de->inode || ((de->file_type == EXT2_FT_WHT) &&
+ if (de->inode || (((de->file_type == EXT2_FT_WHT) ||
+ (de->file_type == EXT2_FT_FALLTHRU)) &&
!ext2_match (namelen, name, de))) {
ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len);
@@ -627,6 +655,60 @@ out_unlock:
}

/*
+ * Create a fallthru entry.
+ */
+int ext2_fallthru_entry (struct inode *dir, struct dentry *dentry)
+{
+ const char *name = dentry->d_name.name;
+ int namelen = dentry->d_name.len;
+ unsigned short rec_len, name_len;
+ ext2_dirent * de;
+ struct page *page;
+ loff_t pos;
+ int err;
+
+ de = ext2_append_entry(dentry, &page);
+ if (IS_ERR(de))
+ return PTR_ERR(de);
+
+ err = -EEXIST;
+ if (ext2_match (namelen, name, de))
+ goto out_unlock;
+
+ name_len = EXT2_DIR_REC_LEN(de->name_len);
+ rec_len = ext2_rec_len_from_disk(de->rec_len);
+
+ pos = page_offset(page) +
+ (char*)de - (char*)page_address(page);
+ err = __ext2_write_begin(NULL, page->mapping, pos, rec_len, 0,
+ &page, NULL);
+ if (err)
+ goto out_unlock;
+ if (de->inode || (de->file_type == EXT2_FT_WHT) ||
+ (de->file_type == EXT2_FT_FALLTHRU)) {
+ ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
+ de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len);
+ de->rec_len = ext2_rec_len_to_disk(name_len);
+ de = de1;
+ }
+ de->name_len = namelen;
+ memcpy(de->name, name, namelen);
+ de->inode = 0;
+ de->file_type = EXT2_FT_FALLTHRU;
+ err = ext2_commit_chunk(page, pos, rec_len);
+ dir->i_mtime = dir->i_ctime = CURRENT_TIME_SEC;
+ EXT2_I(dir)->i_flags &= ~EXT2_BTREE_FL;
+ mark_inode_dirty(dir);
+ /* OFFSET_CACHE */
+out_put:
+ ext2_put_page(page);
+ return err;
+out_unlock:
+ unlock_page(page);
+ goto out_put;
+}
+
+/*
* ext2_delete_entry deletes a directory entry by merging it with the
* previous entry. Page is up-to-date. Releases the page.
*/
@@ -711,7 +793,9 @@ int ext2_whiteout_entry (struct inode * dir, struct dentry * dentry,
*/
if (ext2_match (namelen, name, de))
de->inode = 0;
- if (de->inode || (de->file_type == EXT2_FT_WHT)) {
+ if (de->inode || (((de->file_type == EXT2_FT_WHT) ||
+ (de->file_type == EXT2_FT_FALLTHRU)) &&
+ !ext2_match (namelen, name, de))) {
ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len);
de->rec_len = ext2_rec_len_to_disk(name_len);
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 89ab2f7..1504814 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -108,6 +108,7 @@ extern struct ext2_dir_entry_2 * ext2_find_entry (struct inode *,struct qstr *,
extern int ext2_delete_entry (struct ext2_dir_entry_2 *, struct page *);
extern int ext2_whiteout_entry (struct inode *, struct dentry *,
struct ext2_dir_entry_2 *, struct page *);
+extern int ext2_fallthru_entry (struct inode *, struct dentry *);
extern int ext2_empty_dir (struct inode *);
extern struct ext2_dir_entry_2 * ext2_dotdot (struct inode *, struct page **);
extern void ext2_set_link(struct inode *, struct ext2_dir_entry_2 *, struct page *, struct inode *, int);
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 8f92dd0..af4052f 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -345,6 +345,7 @@ static int ext2_whiteout(struct inode *dir, struct dentry *dentry,
goto out;

spin_lock(&new_dentry->d_lock);
+ new_dentry->d_flags &= ~DCACHE_FALLTHRU;
new_dentry->d_flags |= DCACHE_WHITEOUT;
spin_unlock(&new_dentry->d_lock);
d_add(new_dentry, NULL);
@@ -363,6 +364,26 @@ out:
return err;
}

+/*
+ * Create a fallthru entry.
+ */
+static int ext2_fallthru (struct inode *dir, struct dentry *dentry)
+{
+ int err;
+
+ dquot_initialize(dir);
+
+ err = ext2_fallthru_entry(dir, dentry);
+ if (err)
+ return err;
+
+ d_instantiate(dentry, NULL);
+ spin_lock(&dentry->d_lock);
+ dentry->d_flags |= DCACHE_FALLTHRU;
+ spin_unlock(&dentry->d_lock);
+ return 0;
+}
+
static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
struct inode * new_dir, struct dentry * new_dentry )
{
@@ -466,6 +487,7 @@ const struct inode_operations ext2_dir_inode_operations = {
.rmdir = ext2_rmdir,
.mknod = ext2_mknod,
.whiteout = ext2_whiteout,
+ .fallthru = ext2_fallthru,
.rename = ext2_rename,
#ifdef CONFIG_EXT2_FS_XATTR
.setxattr = generic_setxattr,
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 704521b..76eba1e 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -1095,6 +1095,8 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)

if (EXT2_HAS_INCOMPAT_FEATURE(sb, EXT2_FEATURE_INCOMPAT_WHITEOUT))
sb->s_flags |= MS_WHITEOUT;
+ if (EXT2_HAS_INCOMPAT_FEATURE(sb, EXT2_FEATURE_INCOMPAT_FALLTHRU))
+ sb->s_flags |= MS_FALLTHRU;

if (ext2_setup_super (sb, es, sb->s_flags & MS_RDONLY))
sb->s_flags |= MS_RDONLY;
diff --git a/include/linux/ext2_fs.h b/include/linux/ext2_fs.h
index b0fb356..1a6f929 100644
--- a/include/linux/ext2_fs.h
+++ b/include/linux/ext2_fs.h
@@ -505,11 +505,14 @@ struct ext2_super_block {
#define EXT3_FEATURE_INCOMPAT_JOURNAL_DEV 0x0008
#define EXT2_FEATURE_INCOMPAT_META_BG 0x0010
#define EXT2_FEATURE_INCOMPAT_WHITEOUT 0x0020
+/* ext3/4 incompat flags take up the intervening constants */
+#define EXT2_FEATURE_INCOMPAT_FALLTHRU 0x2000
#define EXT2_FEATURE_INCOMPAT_ANY 0xffffffff

#define EXT2_FEATURE_COMPAT_SUPP EXT2_FEATURE_COMPAT_EXT_ATTR
#define EXT2_FEATURE_INCOMPAT_SUPP (EXT2_FEATURE_INCOMPAT_FILETYPE| \
EXT2_FEATURE_INCOMPAT_WHITEOUT| \
+ EXT2_FEATURE_INCOMPAT_FALLTHRU| \
EXT2_FEATURE_INCOMPAT_META_BG)
#define EXT2_FEATURE_RO_COMPAT_SUPP (EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER| \
EXT2_FEATURE_RO_COMPAT_LARGE_FILE| \
@@ -577,6 +580,7 @@ enum {
EXT2_FT_SOCK = 6,
EXT2_FT_SYMLINK = 7,
EXT2_FT_WHT = 8,
+ EXT2_FT_FALLTHRU = 9,
EXT2_FT_MAX
};

--
1.6.3.3

2010-08-08 16:00:04

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 18/39] union-mount: Support for union mounting file systems

Create and tear down union mount structures on mount. Check
requirements for union mounts. This version clones the read-only
mounts as one big tree and points to them from the superblock of the
topmost layer file system.

Thanks to Felix Fietkau <[email protected]> for a bug fix and Miklos
Szeredi <[email protected]> for better mount error messages.

Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namespace.c | 255 ++++++++++++++++++++++++++++++++++++++++++++++++-
fs/super.c | 1 +
include/linux/fs.h | 7 ++
include/linux/mount.h | 2 +
4 files changed, 263 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index f115cb6..aa6a132 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -33,6 +33,7 @@
#include <asm/unistd.h>
#include "pnode.h"
#include "internal.h"
+#include "union.h"

#define HASH_SHIFT ilog2(PAGE_SIZE / sizeof(struct list_head))
#define HASH_SIZE (1UL << HASH_SHIFT)
@@ -1050,6 +1051,7 @@ void umount_tree(struct vfsmount *mnt, int propagate, struct list_head *kill)
propagate_umount(kill);

list_for_each_entry(p, kill, mnt_hash) {
+ d_free_unions(p->mnt_root);
list_del_init(&p->mnt_expire);
list_del_init(&p->mnt_list);
__touch_mnt_namespace(p->mnt_ns);
@@ -1333,6 +1335,217 @@ static int invent_group_ids(struct vfsmount *mnt, bool recurse)
return 0;
}

+/**
+ * check_mnt_union - mount-time checks for union mount
+ *
+ * @mntpnt: path of the mountpoint the new mount will be on
+ * @topmost_mnt: vfsmount of the new file system to be mounted
+ * @mnt_flags: mount flags for the new file system
+ *
+ * Mount-time check of upper and lower layer file systems to see if we
+ * can union mount one on the other.
+ *
+ * The rules:
+ *
+ * Lower layer(s) and submounts read-only: We can't deal with
+ * namespace changes in the lower layers of a union, so the lower
+ * layer must be read-only. Note that we could possibly convert a
+ * read-write unioned mount into a read-only mount here.
+ *
+ * Lower layer(s) and submounts not shared: The lower layer(s) of a
+ * union mount must not have any changes to its namespace. Therefore,
+ * it must not be part of any mount event propagation group - i.e.,
+ * shared or slave.
+ *
+ * Union only at roots of file systems: Only permit unioning of file
+ * systems at their root directories. This allows us to mark entire
+ * mounts as unioned. Otherwise we must slowly and expensively work
+ * our way up a path looking for a unioned directory before we know if
+ * a path is from a unioned lower layer.
+ *
+ * Topmost layer must be writable to support our readdir()
+ * solution of copying up all lower level entries to the
+ * topmost layer.
+ *
+ * Topmost file system must support whiteouts and fallthrus.
+ *
+ * Topmost file system can't be mounted elsewhere. XXX implement some
+ * kind of marker in the superblock so subsequent mounts are not
+ * possible.
+ *
+ */
+
+static int
+check_mnt_union(struct path *mntpnt, struct vfsmount *topmost_mnt, int mnt_flags)
+{
+ struct vfsmount *p, *lower_mnt = mntpnt->mnt;
+
+ if (!(mnt_flags & MNT_UNION))
+ return 0;
+
+#ifndef CONFIG_UNION_MOUNT
+ printk(KERN_INFO "union mount: not supported by the kernel\n");
+ return -EINVAL;
+#endif
+ for (p = lower_mnt; p; p = next_mnt(p, lower_mnt)) {
+ if (!(p->mnt_sb->s_flags & MS_RDONLY))
+ return -EBUSY;
+ if (IS_MNT_SHARED(p) || IS_MNT_SLAVE(p))
+ return -EBUSY;
+ }
+
+ if (!IS_ROOT(mntpnt->dentry)) {
+ printk(KERN_INFO "union mount: mount point must be a root dir\n");
+ return -EINVAL;
+ }
+
+ if (mnt_flags & MNT_READONLY)
+ return -EROFS;
+
+ if (!(topmost_mnt->mnt_sb->s_flags & MS_WHITEOUT)) {
+ printk(KERN_INFO "union mount: whiteouts not supported by fs\n");
+ return -EINVAL;
+ }
+
+ if (!(topmost_mnt->mnt_sb->s_flags & MS_FALLTHRU)) {
+ printk(KERN_INFO "union mount: fallthrus not supported by fs\n");
+ return -EINVAL;
+ }
+
+ /* XXX top level mount should only be mounted once */
+
+ return 0;
+}
+
+void put_union_sb(struct super_block *sb)
+{
+ struct vfsmount *p, *mnt;
+ LIST_HEAD(umount_list);
+
+ if (!sb->s_ro_union_mnts)
+ return;
+ mnt = sb->s_ro_union_mnts;
+ for (p = mnt; p; p = next_mnt(p, mnt))
+ dec_hard_readonly_users(p);
+ spin_lock(&vfsmount_lock);
+ umount_tree(mnt, 0, &umount_list);
+ spin_unlock(&vfsmount_lock);
+ release_mounts(&umount_list);
+}
+
+static void cleanup_mnt_union(struct vfsmount *topmost_mnt)
+{
+ d_free_unions(topmost_mnt->mnt_root);
+ put_union_sb(topmost_mnt->mnt_sb);
+}
+
+/*
+ * find_union_root - Find the "lowest" (union low) mount to be unioned
+ */
+
+static struct vfsmount *find_union_root(struct vfsmount *topmost_mnt, struct path *mntpnt)
+{
+ struct path this_layer = *mntpnt;
+ struct vfsmount *lowest_mnt = NULL;
+
+ while(check_mnt_union(&this_layer, topmost_mnt, MNT_UNION) == 0) {
+ lowest_mnt = this_layer.mnt;
+ this_layer.dentry = this_layer.mnt->mnt_mountpoint;
+ this_layer.mnt = this_layer.mnt->mnt_parent;
+ }
+ return lowest_mnt;
+}
+
+/*
+ * Build the union stack for the root dir. Note that topmost_mnt is
+ * not connected to the mount tree yet and that the cloned tree is not
+ * either.
+ */
+
+static int build_root_union(struct vfsmount *topmost_mnt, struct vfsmount *clone_root)
+{
+ struct union_dir **next_ud;
+ struct path upper, lower;
+ struct vfsmount *p, *mnt;
+ int err = 0;
+
+ /*
+ * Find the topmost read-only mount, starting from the root
+ * of the cloned tree of read-only mounts. __lookup_mnt() and
+ * friends don't work because the cloned tree is not mounted
+ * anywhere.
+ */
+ mnt = clone_root;
+ for (p = clone_root; p; p = next_mnt(p, clone_root)) {
+ if ((p->mnt_parent == mnt) &&
+ (p->mnt_mountpoint == mnt->mnt_root))
+ mnt = p;
+ }
+
+ /* Build the root union stack */
+ upper.mnt = topmost_mnt;
+ upper.dentry = topmost_mnt->mnt_root;
+ next_ud = &upper.dentry->d_union_dir;
+
+ while (upper.mnt != clone_root) {
+ lower.mnt = mntget(mnt);
+ lower.dentry = dget(mnt->mnt_root);
+ err = union_add_dir(&upper, &lower, next_ud);
+ if (err)
+ goto out;
+ next_ud = &(*next_ud)->u_lower;
+ upper = lower;
+ mnt = mnt->mnt_parent;
+ }
+out:
+ return err;
+}
+
+/**
+ * prepare_mnt_union - do setup necessary for a union mount
+ *
+ * @topmost_mnt: vfsmount of topmost layer
+ * @mntpnt: path of requested mountpoint
+ *
+ * We union every underlying file system that is mounted on the same
+ * mountpoint (well, pathname), read-only, and not shared. We clone
+ * the entire underlying read-only mount tree and keep a pointer to it
+ * from the topmost file system's superblock.
+ *
+ * XXX - Maybe should take # of layers to go down as an argument. But
+ * how to pass this in through mount options? All solutions look ugly.
+ */
+
+static int prepare_mnt_union(struct vfsmount *topmost_mnt, struct path *mntpnt)
+{
+ struct super_block *sb = topmost_mnt->mnt_sb;
+ struct vfsmount *p, *clone_root;
+ int err;
+
+ clone_root = find_union_root(topmost_mnt, mntpnt);
+ if (!clone_root)
+ return 0; /* Nothing to union */
+
+ /* Clone the whole mount tree that we're going to union. */
+ err = -ENOMEM;
+ sb->s_ro_union_mnts = copy_tree(clone_root, clone_root->mnt_root,
+ CL_COPY_ALL | CL_PRIVATE);
+ if (!sb->s_ro_union_mnts)
+ goto out;
+
+ for (p = sb->s_ro_union_mnts; p; p = next_mnt(p, sb->s_ro_union_mnts))
+ inc_hard_readonly_users(p);
+
+ err = build_root_union(topmost_mnt, clone_root);
+ if (err)
+ goto out;
+
+ return 0;
+out:
+ cleanup_mnt_union(topmost_mnt);
+ return err;
+}
+
/*
* @source_mnt : mount tree to be attached
* @nd : place the mount tree @source_mnt is attached
@@ -1410,9 +1623,16 @@ static int attach_recursive_mnt(struct vfsmount *source_mnt,
if (err)
goto out;
}
+
+ if (!parent_path && IS_MNT_UNION(source_mnt)) {
+ err = prepare_mnt_union(source_mnt, path);
+ if (err)
+ goto out_cleanup_ids;
+ }
+
err = propagate_mnt(dest_mnt, dest_dentry, source_mnt, &tree_list);
if (err)
- goto out_cleanup_ids;
+ goto out_cleanup_union;

spin_lock(&vfsmount_lock);

@@ -1436,6 +1656,9 @@ static int attach_recursive_mnt(struct vfsmount *source_mnt,
spin_unlock(&vfsmount_lock);
return 0;

+ out_cleanup_union:
+ if (IS_MNT_UNION(source_mnt))
+ cleanup_mnt_union(source_mnt);
out_cleanup_ids:
if (IS_MNT_SHARED(dest_mnt))
cleanup_group_ids(source_mnt, NULL);
@@ -1482,6 +1705,17 @@ static int do_change_type(struct path *path, int flag)
return -EINVAL;

down_write(&namespace_sem);
+
+ /*
+ * Mounts of file systems with read-only users can't deal with
+ * mount/umount propagation events - it's the moral equivalent
+ * of rm -rf dir/ or the like.
+ */
+ if (sb_is_hard_readonly(mnt->mnt_sb)) {
+ err = -EROFS;
+ goto out_unlock;
+ }
+
if (type == MS_SHARED) {
err = invent_group_ids(mnt, recurse);
if (err)
@@ -1519,6 +1753,9 @@ static int do_loopback(struct path *path, char *old_name,
err = -EINVAL;
if (IS_MNT_UNBINDABLE(old_path.mnt))
goto out;
+ /* Mount part of a union mount elsewhere? The mind boggles. */
+ if (IS_MNT_UNION(old_path.mnt))
+ goto out;

if (!check_mnt(path->mnt) || !check_mnt(old_path.mnt))
goto out;
@@ -1540,7 +1777,6 @@ static int do_loopback(struct path *path, char *old_name,
spin_unlock(&vfsmount_lock);
release_mounts(&umount_list);
}
-
out:
up_write(&namespace_sem);
path_put(&old_path);
@@ -1581,6 +1817,17 @@ static int do_remount(struct path *path, int flags, int mnt_flags,
if (!check_mnt(path->mnt))
return -EINVAL;

+ if (mnt_flags & MNT_UNION)
+ return -EINVAL;
+
+ if ((path->mnt->mnt_flags & MNT_UNION) &&
+ !(mnt_flags & MNT_UNION))
+ return -EINVAL;
+
+ if ((path->mnt->mnt_flags & MNT_UNION) &&
+ (mnt_flags & MNT_READONLY))
+ return -EINVAL;
+
if (path->dentry != path->mnt->mnt_root)
return -EINVAL;

@@ -1743,6 +1990,10 @@ int do_add_mount(struct vfsmount *newmnt, struct path *path,
if (S_ISLNK(newmnt->mnt_root->d_inode->i_mode))
goto unlock;

+ err = check_mnt_union(path, newmnt, mnt_flags);
+ if (err)
+ goto unlock;
+
newmnt->mnt_flags = mnt_flags;
if ((err = graft_tree(newmnt, path)))
goto unlock;
diff --git a/fs/super.c b/fs/super.c
index 86bdf1f..bdfe98f 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -160,6 +160,7 @@ void deactivate_locked_super(struct super_block *s)
if (atomic_dec_and_test(&s->s_active)) {
fs->kill_sb(s);
put_filesystem(fs);
+ put_union_sb(s);
put_super(s);
} else {
up_write(&s->s_umount);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 31cfa48..b88d088 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1397,6 +1397,13 @@ struct super_block {
* read-only.
*/
int s_hard_readonly_users;
+
+ /*
+ * If this is the topmost file system in a union mount, this
+ * points to the root of the private cloned vfsmount tree of
+ * the read-only mounts in this union.
+ */
+ struct vfsmount *s_ro_union_mnts;
};

extern struct timespec current_fs_time(struct super_block *sb);
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 0302703..17d3d27 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -136,4 +136,6 @@ extern void mark_mounts_for_expiry(struct list_head *mounts);

extern dev_t name_to_dev_t(char *name);

+extern void put_union_sb(struct super_block *sb);
+
#endif /* _LINUX_MOUNT_H */
--
1.6.3.3

2010-08-08 16:00:26

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 19/39] union-mount: Implement union lookup

Implement unioned directories, whiteouts, and fallthrus in pathname
lookup routines. do_lookup() and lookup_hash() call lookup_union()
after looking up the dentry from the top-level file system.
lookup_union() is centered around __lookup_hash(), which does cached
and/or real lookups and revalidates each dentry in the union stack.

XXX - implement negative union cache entries

XXX - handle different permissions on directories

Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namei.c | 174 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
fs/union.c | 94 ++++++++++++++++++++++++++++++++
fs/union.h | 7 +++
3 files changed, 274 insertions(+), 1 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 0b6378e..0821544 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -35,6 +35,7 @@
#include <asm/uaccess.h>

#include "internal.h"
+#include "union.h"

/* [Feb-1997 T. Schoebel-Theuer]
* Fundamental changes in the pathname lookup mechanisms (namei)
@@ -723,6 +724,163 @@ static __always_inline void follow_dotdot(struct nameidata *nd)
follow_mount(&nd->path);
}

+static struct dentry *__lookup_hash(struct qstr *name, struct dentry *base,
+ struct nameidata *nd);
+
+/*
+ * __lookup_union - Given a path from the topmost layer, lookup and
+ * revalidate each dentry in its union stack, building it if necessary
+ *
+ * @nd - nameidata for the parent of @topmost
+ * @name - pathname from this element on
+ * @topmost - path of the topmost matching dentry
+ *
+ * Given the nameidata and the path of the topmost dentry for this
+ * pathname, lookup, revalidate, and build the associated union stack.
+ * @topmost must be either a negative dentry or a directory, and not a
+ * whiteout.
+ *
+ * This function may stomp nd->path with the path of the parent
+ * directory of lower layer, so the caller must save nd->path and
+ * restore it afterwards. You probably want to use lookup_union(),
+ * not __lookup_union().
+ */
+
+static int __lookup_union(struct nameidata *nd, struct qstr *name,
+ struct path *topmost)
+{
+ struct path parent = nd->path;
+ struct path lower, upper;
+ struct union_dir *ud;
+ /* next_ud is the head of the list of union dirs for this dentry */
+ struct union_dir **next_ud = &topmost->dentry->d_union_dir;
+ int err = 0;
+
+ /*
+ * upper is either a negative dentry from the top layer, or it
+ * is the most recent positive dentry for a directory that
+ * we've seen.
+ */
+ upper = *topmost;
+
+ /* Go through each dir underlying the parent, looking for a match */
+ for (ud = nd->path.dentry->d_union_dir; ud != NULL; ud = ud->u_lower) {
+ BUG_ON(ud->u_this.dentry->d_count.counter == 0);
+ /* Change the nameidata to point to this level's dir */
+ nd->path = ud->u_this;
+ /* Lookup the child in this level */
+ lower.mnt = mntget(nd->path.mnt);
+ mutex_lock(&nd->path.dentry->d_inode->i_mutex);
+ lower.dentry = __lookup_hash(name, nd->path.dentry, nd);
+ mutex_unlock(&nd->path.dentry->d_inode->i_mutex);
+
+ if (IS_ERR(lower.dentry)) {
+ mntput(lower.mnt);
+ err = PTR_ERR(lower.dentry);
+ goto out;
+ }
+
+ if (!lower.dentry->d_inode) {
+ if (d_is_whiteout(lower.dentry))
+ break;
+ if (IS_OPAQUE(nd->path.dentry->d_inode) &&
+ !d_is_fallthru(lower.dentry))
+ break;
+ /* Plain old negative! Keep looking */
+ path_put(&lower);
+ continue;
+ }
+
+ /* Finding a non-dir ends the lookup, one way or another */
+ if (!S_ISDIR(lower.dentry->d_inode->i_mode)) {
+ /* Ignore file below dir - invalid */
+ if (upper.dentry->d_inode &&
+ S_ISDIR(upper.dentry->d_inode->i_mode)) {
+ path_put(&lower);
+ break;
+ }
+ /* Bingo, found our target */
+ dput(topmost->dentry);
+ /* mntput(topmost) done in link_path_walk() */
+ *topmost = lower;
+ break;
+ }
+
+ /* Allow read-only submounts on lower layers */
+ follow_mount(&lower);
+
+ /* Found a directory. Create the topmost version if it doesn't exist */
+ if (!topmost->dentry->d_inode) {
+ err = union_create_topmost_dir(&parent, name, topmost,
+ &lower);
+ if (err) {
+ path_put(&lower);
+ return err;
+ }
+ }
+
+ err = union_add_dir(&upper, &lower, next_ud);
+ if (err)
+ break;
+
+ next_ud = &(*next_ud)->u_lower;
+ upper = lower;
+ }
+out:
+ return 0;
+}
+
+/*
+ * lookup_union - revalidate and build union stack for this path
+ *
+ * We borrow the nameidata struct from the topmost layer to do the
+ * revalidation on lower dentries, replacing the topmost parent
+ * directory's path with that of the matching parent dir in each lower
+ * layer. This wrapper for __lookup_union() saves the topmost layer's
+ * path and restores it when we are done.
+ */
+static int lookup_union(struct nameidata *nd, struct qstr *name,
+ struct path *topmost)
+{
+ struct path saved_path;
+ int err;
+
+ BUG_ON(!IS_MNT_UNION(nd->path.mnt) && !IS_MNT_UNION(topmost->mnt));
+ BUG_ON(!mutex_is_locked(&nd->path.dentry->d_inode->i_mutex));
+
+ saved_path = nd->path;
+ path_get(&saved_path);
+
+ err = __lookup_union(nd, name, topmost);
+
+ nd->path = saved_path;
+ path_put(&saved_path);
+
+ return err;
+}
+
+/*
+ * do_union_lookup - union mount-aware part of do_lookup
+ *
+ * do_lookup()-style wrapper for lookup_union(). Follows mounts.
+ */
+
+static int do_lookup_union(struct nameidata *nd, struct qstr *name,
+ struct path *topmost)
+{
+ struct dentry *parent = nd->path.dentry;
+ struct inode *dir = parent->d_inode;
+ int err;
+
+ mutex_lock(&dir->i_mutex);
+ err = lookup_union(nd, name, topmost);
+ mutex_unlock(&dir->i_mutex);
+
+ __follow_mount(topmost);
+
+ return err;
+}
+
/*
* It's more convoluted than I'd like it to be, but... it's still fairly
* small and for now I'd prefer to have fast path as straight as possible.
@@ -753,6 +911,11 @@ done:
path->mnt = mnt;
path->dentry = dentry;
__follow_mount(path);
+ if (needs_lookup_union(&nd->path, path)) {
+ int err = do_lookup_union(nd, name, path);
+ if (err < 0)
+ return err;
+ }
return 0;

need_lookup:
@@ -1224,8 +1387,13 @@ static int lookup_hash(struct nameidata *nd, struct qstr *name,
err = PTR_ERR(path->dentry);
path->dentry = NULL;
path->mnt = NULL;
+ return err;
}
+
+ if (needs_lookup_union(&nd->path, path))
+ err = lookup_union(nd, name, path);
return err;
+
}

static int __lookup_one_len(const char *name, struct qstr *this,
@@ -2889,7 +3057,11 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
error = -EXDEV;
if (oldnd.path.mnt != newnd.path.mnt)
goto exit2;
-
+ /* Rename on union mounts not implemented yet */
+ /* XXX much harsher check than necessary - can do some renames */
+ if (IS_DIR_UNIONED(oldnd.path.dentry) ||
+ IS_DIR_UNIONED(newnd.path.dentry))
+ goto exit2;
old_dir = oldnd.path.dentry;
error = -EBUSY;
if (oldnd.last_type != LAST_NORM)
diff --git a/fs/union.c b/fs/union.c
index 02abb7c..c089c02 100644
--- a/fs/union.c
+++ b/fs/union.c
@@ -21,6 +21,7 @@
#include <linux/mount.h>
#include <linux/fs_struct.h>
#include <linux/slab.h>
+#include <linux/namei.h>

#include "union.h"

@@ -117,3 +118,96 @@ void d_free_unions(struct dentry *dentry)
}
dentry->d_union_dir = NULL;
}
+
+/**
+ * needs_lookup_union - Avoid union lookup when not necessary
+ *
+ * @parent_path: path of the parent directory
+ * @path: path of the lookup target
+ *
+ * Check to see if the target needs union lookup. Two cases need
+ * union lookup: the target is a directory, and the target is a
+ * negative dentry.
+ *
+ * Returns 0 if this dentry is definitely not unioned. Returns 1 if
+ * it is possible this dentry is unioned.
+ */
+
+int needs_lookup_union(struct path *parent_path, struct path *path)
+{
+ /*
+ * If the target is the root of the mount, then its union
+ * stack was already created at mount time (if this is a union
+ * mount).
+ */
+ if (IS_ROOT(path->dentry))
+ return 0;
+
+ /* Only dentries in a unioned directory need a union lookup. */
+ if (!IS_DIR_UNIONED(parent_path->dentry))
+ return 0;
+
+ /* Whiteouts cover up everything below */
+ if (d_is_whiteout(path->dentry))
+ return 0;
+
+ /* Opaque dirs cover except if this is a fallthru */
+ if (IS_OPAQUE(parent_path->dentry->d_inode) &&
+ !d_is_fallthru(path->dentry))
+ return 0;
+
+ /*
+ * XXX Negative dentries in unioned directories must always go
+ * through a full union lookup because there might be a
+ * matching entry below it. To improve performance, we should
+ * mark negative dentries in some way to show they have
+ * already been looked up in the union and nothing was found.
+ * Maybe mark it opaque?
+ */
+ if (!path->dentry->d_inode)
+ return 1;
+
+ /*
+ * If it's not a directory and it's a positive dentry, then we
+ * already have the topmost dentry and we don't need to do any
+ * lookup in lower layers.
+ */
+
+ if (!S_ISDIR(path->dentry->d_inode->i_mode))
+ return 0;
+
+ /* Is the union stack already constructed? */
+ if (IS_DIR_UNIONED(path->dentry))
+ return 0;
+
+ /*
+ * XXX This is like the negative dentry case. This directory
+ * may have no matching directories in the lower layers, or
+ * this may just be the first time we looked it up. We can't
+ * tell the difference.
+ */
+ return 1;
+}
+
+/*
+ * union_create_topmost_dir - Create a matching dir in the topmost file system
+ */
+
+int union_create_topmost_dir(struct path *parent, struct qstr *name,
+ struct path *topmost, struct path *lower)
+{
+ int mode = lower->dentry->d_inode->i_mode;
+ int res;
+
+ BUG_ON(topmost->dentry->d_inode);
+
+ res = mnt_want_write(parent->mnt);
+ if (res)
+ return res;
+
+ res = vfs_mkdir(parent->dentry->d_inode, topmost->dentry, mode);
+
+ mnt_drop_write(parent->mnt);
+
+ return res;
+}
diff --git a/fs/union.h b/fs/union.h
index 04efc1f..505f132 100644
--- a/fs/union.h
+++ b/fs/union.h
@@ -51,15 +51,22 @@ struct union_dir {
};

#define IS_MNT_UNION(mnt) ((mnt)->mnt_flags & MNT_UNION)
+#define IS_DIR_UNIONED(dentry) ((dentry)->d_union_dir)

extern int union_add_dir(struct path *, struct path *, struct union_dir **);
extern void d_free_unions(struct dentry *);
+int needs_lookup_union(struct path *, struct path *);
+int union_create_topmost_dir(struct path *, struct qstr *, struct path *,
+ struct path *);

#else /* CONFIG_UNION_MOUNT */

#define IS_MNT_UNION(x) (0)
+#define IS_DIR_UNIONED(x) (0)
#define union_add_dir(x, y, z) ({ BUG(); (NULL); })
#define d_free_unions(x) do { } while (0)
+#define needs_lookup_union(x, y) ({ (0); })
+#define union_create_topmost_dir(w, x, y, z) ({ BUG(); (NULL); })

#endif /* CONFIG_UNION_MOUNT */
#endif /* __KERNEL__ */
--
1.6.3.3

2010-08-08 16:00:45

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 20/39] union-mount: Call do_whiteout() on unlink and rmdir in unions

From: Jan Blunck <[email protected]>

Call do_whiteout() when removing files and directories from a union
mounted file system.

Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namei.c | 8 ++++++++
1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 0821544..2d30a5b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2596,6 +2596,10 @@ static long do_rmdir(int dfd, const char __user *pathname)
error = security_path_rmdir(&nd.path, path.dentry);
if (error)
goto exit4;
+ if (IS_DIR_UNIONED(nd.path.dentry)) {
+ error = do_whiteout(&nd, &path, 1);
+ goto exit4;
+ }
error = vfs_rmdir(nd.path.dentry->d_inode, path.dentry);
exit4:
mnt_drop_write(nd.path.mnt);
@@ -2685,6 +2689,10 @@ static long do_unlinkat(int dfd, const char __user *pathname)
error = security_path_unlink(&nd.path, path.dentry);
if (error)
goto exit3;
+ if (IS_DIR_UNIONED(nd.path.dentry)) {
+ error = do_whiteout(&nd, &path, 0);
+ goto exit3;
+ }
error = vfs_unlink(nd.path.dentry->d_inode, path.dentry);
exit3:
mnt_drop_write(nd.path.mnt);
--
1.6.3.3

2010-08-08 16:01:00

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 17/39] union-mount: Free union dirs on removal from dcache

If a dentry is removed from dentry cache because its usage count drops
to zero, the union_dirs in its union stack are freed too.

Signed-off-by: Valerie Aurora <[email protected]>
---
fs/dcache.c | 11 +++++++++++
1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 456030d..ed7f15a 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -34,6 +34,7 @@
#include <linux/fs_struct.h>
#include <linux/hardirq.h>
#include "internal.h"
+#include "union.h"

int sysctl_vfs_cache_pressure __read_mostly = 100;
EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
@@ -175,6 +176,7 @@ static struct dentry *d_kill(struct dentry *dentry)
dentry_stat.nr_dentry--; /* For d_free, below */
/*drops the locks, at that point nobody can reach this dentry */
dentry_iput(dentry);
+ d_free_unions(dentry);
if (IS_ROOT(dentry))
parent = NULL;
else
@@ -695,6 +697,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
iput(inode);
}

+ d_free_unions(dentry);
d_free(dentry);

/* finished when we fall off the top of the tree,
@@ -1535,6 +1538,7 @@ void d_delete(struct dentry * dentry)
if (atomic_read(&dentry->d_count) == 1) {
dentry->d_flags &= ~DCACHE_CANT_MOUNT;
dentry_iput(dentry);
+ d_free_unions(dentry);
fsnotify_nameremove(dentry, isdir);
return;
}
@@ -1545,6 +1549,13 @@ void d_delete(struct dentry * dentry)
spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);

+ /*
+ * Remove any associated unions. While someone still has this
+ * directory open (ref count > 0), we could not have deleted
+ * it unless it was empty, and therefore has no references to
+ * directories below it. So we don't need the unions.
+ */
+ d_free_unions(dentry);
fsnotify_nameremove(dentry, isdir);
}
EXPORT_SYMBOL(d_delete);
--
1.6.3.3

2010-08-08 15:54:56

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 06/39] whiteout: Add vfs_whiteout() and whiteout inode operation

From: Jan Blunck <[email protected]>

Whiteout a given directory entry. File systems that support whiteouts
must implement the new ->whiteout() directory inode operation.

XXX - Only whiteout when there is a matching entry in a lower layer.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
Documentation/filesystems/vfs.txt | 10 +++++-
fs/dcache.c | 4 ++-
fs/namei.c | 73 ++++++++++++++++++++++++++++++++++++-
include/linux/dcache.h | 7 ++++
include/linux/fs.h | 2 +
5 files changed, 93 insertions(+), 3 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 94677e7..964e0fc 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -308,7 +308,7 @@ struct inode_operations
-----------------------

This describes how the VFS can manipulate an inode in your
-filesystem. As of kernel 2.6.22, the following members are defined:
+filesystem. As of kernel 2.6.34, the following members are defined:

struct inode_operations {
int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
@@ -319,6 +319,7 @@ struct inode_operations {
int (*mkdir) (struct inode *,struct dentry *,int);
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct inode *,struct dentry *,int,dev_t);
+ int (*whiteout) (struct inode *, struct dentry *, struct dentry *);
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *);
int (*readlink) (struct dentry *, char __user *,int);
@@ -382,6 +383,13 @@ otherwise noted.
will probably need to call d_instantiate() just as you would
in the create() method

+ whiteout: called by the rmdir(2) and unlink(2) system calls on a
+ layered file system. Only required if you want to support
+ whiteouts. The first dentry passed in is that for the old
+ dentry if it exists, and a negative dentry otherwise. The
+ second is the dentry for the whiteout itself. This method
+ must unlink() or rmdir() the original entry if it exists.
+
rename: called by the rename(2) system call to rename the object to
have the parent and name given by the second inode and dentry.

diff --git a/fs/dcache.c b/fs/dcache.c
index 86d4db1..80f059b 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -991,8 +991,10 @@ EXPORT_SYMBOL(d_alloc_name);
/* the caller must hold dcache_lock */
static void __d_instantiate(struct dentry *dentry, struct inode *inode)
{
- if (inode)
+ if (inode) {
+ dentry->d_flags &= ~DCACHE_WHITEOUT;
list_add(&dentry->d_alias, &inode->i_dentry);
+ }
dentry->d_inode = inode;
fsnotify_d_instantiate(dentry, inode);
}
diff --git a/fs/namei.c b/fs/namei.c
index 7552e61..665d394 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1357,7 +1357,6 @@ static int may_delete(struct inode *dir,struct dentry *victim,int isdir)
if (!victim->d_inode)
return -ENOENT;

- BUG_ON(victim->d_parent->d_inode != dir);
audit_inode_child(victim, dir);

error = inode_permission(dir, MAY_WRITE | MAY_EXEC);
@@ -2169,6 +2168,78 @@ SYSCALL_DEFINE2(mkdir, const char __user *, pathname, int, mode)
return sys_mkdirat(AT_FDCWD, pathname, mode);
}

+/**
+ * vfs_whiteout: create a whiteout for the given directory entry
+ * @dir: parent inode
+ * @dentry: directory entry to whiteout
+ *
+ * Create a whiteout for the given directory entry. A whiteout
+ * prevents lookup from dropping down to a lower layer of a union
+ * mounted file system.
+ *
+ * There are two important cases: (a) The directory entry to be
+ * whited-out may already exist, in which case it must first be
+ * deleted before we create the whiteout, and (b) no such directory
+ * entry exists and we only have to create the whiteout itself.
+ *
+ * The caller must pass in a dentry for the directory entry to be
+ * whited-out - a positive one if it exists, and a negative if not.
+ * When this function returns, the caller should dput() the old, now
+ * defunct dentry it passed in. The dentry for the whiteout itself is
+ * created inside this function.
+ */
+static int vfs_whiteout(struct inode *dir, struct dentry *old_dentry, int isdir)
+{
+ struct inode *old_inode = old_dentry->d_inode;
+ struct dentry *parent, *whiteout;
+ int err = 0;
+
+ BUG_ON(old_dentry->d_parent->d_inode != dir);
+
+ if (!dir->i_op || !dir->i_op->whiteout)
+ return -EOPNOTSUPP;
+
+ /*
+ * If the old dentry is positive, then we have to delete this
+ * entry before we create the whiteout. The file system
+ * ->whiteout() op does the actual delete, but we do all the
+ * VFS-level checks and changes here.
+ */
+ if (old_inode) {
+ mutex_lock(&old_inode->i_mutex);
+ if (d_mountpoint(old_dentry)) {
+ mutex_unlock(&old_inode->i_mutex);
+ return -EBUSY;
+ }
+ if (isdir) {
+ dentry_unhash(old_dentry);
+ err = security_inode_rmdir(dir, old_dentry);
+ } else {
+ err = security_inode_unlink(dir, old_dentry);
+ }
+ }
+
+ parent = dget_parent(old_dentry);
+ whiteout = d_alloc_name(parent, old_dentry->d_name.name);
+
+ if (!err)
+ err = dir->i_op->whiteout(dir, old_dentry, whiteout);
+
+ if (old_inode) {
+ mutex_unlock(&old_inode->i_mutex);
+ if (!err) {
+ fsnotify_link_count(old_inode);
+ d_delete(old_dentry);
+ }
+ if (isdir)
+ dput(old_dentry);
+ }
+
+ dput(whiteout);
+ dput(parent);
+ return err;
+}
+
/*
* We try to drop the dentry early: we should have
* a usage count of 2 if we're the only user of this
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index eebb617..7d650a2 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -188,6 +188,8 @@ d_iput: no no no yes

#define DCACHE_CANT_MOUNT 0x0100

+#define DCACHE_WHITEOUT 0x0200 /* Stop lookup in a unioned file system */
+
extern spinlock_t dcache_lock;
extern seqlock_t rename_lock;

@@ -372,6 +374,11 @@ static inline void dont_mount(struct dentry *dentry)
spin_unlock(&dentry->d_lock);
}

+static inline int d_is_whiteout(struct dentry *dentry)
+{
+ return (dentry->d_flags & DCACHE_WHITEOUT);
+}
+
static inline struct dentry *dget_parent(struct dentry *dentry)
{
struct dentry *ret;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index eeb49d7..1f80897 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -209,6 +209,7 @@ struct inodes_stat_t {
#define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */
#define MS_I_VERSION (1<<23) /* Update inode I_version field */
#define MS_STRICTATIME (1<<24) /* Always perform atime updates */
+#define MS_WHITEOUT (1<<25) /* FS supports whiteout filetype */
#define MS_ACTIVE (1<<30)
#define MS_NOUSER (1<<31)

@@ -1527,6 +1528,7 @@ struct inode_operations {
int (*mkdir) (struct inode *,struct dentry *,int);
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct inode *,struct dentry *,int,dev_t);
+ int (*whiteout) (struct inode *, struct dentry *, struct dentry *);
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *);
int (*readlink) (struct dentry *, char __user *,int);
--
1.6.3.3

2010-08-08 15:54:53

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 08/39] whiteout: Allow removal of a directory with whiteouts

From: Jan Blunck <[email protected]>

do_whiteout() allows removal of a directory when it has whiteouts but
is logically empty.

XXX - This patch abuses readdir() to check if the union directory is
logically empty - that is, all the entries are whiteouts (or "." or
".."). Currently, we have no clean VFS interface to ask the lower
file system if a directory is empty.

Fixes:
- Add ->is_directory_empty() op
- Add is_directory_empty flag to dentry (ugly dcache populate)
- Ask underlying fs to remove it and look for an error return
- (your idea here)

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
---
fs/namei.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 84 insertions(+), 0 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index cd8b0d0..0b6378e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2250,6 +2250,90 @@ static int vfs_whiteout(struct inode *dir, struct dentry *old_dentry, int isdir)
}

/*
+ * XXX - We are abusing readdir to check if a union directory is
+ * logically empty.
+ */
+static int filldir_is_empty(void *__buf, const char *name, int namlen,
+ loff_t offset, u64 ino, unsigned int d_type)
+{
+ int *is_empty = (int *)__buf;
+
+ switch (namlen) {
+ case 2:
+ if (name[1] != '.')
+ break;
+ case 1:
+ if (name[0] != '.')
+ break;
+ return 0;
+ }
+
+ if (d_type == DT_WHT)
+ return 0;
+
+ (*is_empty) = 0;
+ return 0;
+}
+
+static int directory_is_empty(struct path *path)
+{
+ struct file *file;
+ int err;
+ int is_empty = 1;
+
+ BUG_ON(!S_ISDIR(path->dentry->d_inode->i_mode));
+
+ /* references for the file pointer */
+ path_get(path);
+
+ file = dentry_open(path->dentry, path->mnt, O_RDONLY, current_cred());
+ if (IS_ERR(file))
+ return 0;
+
+ err = vfs_readdir(file, filldir_is_empty, &is_empty);
+
+ fput(file);
+ return is_empty;
+}
+
+static int do_whiteout(struct nameidata *nd, struct path *path, int isdir)
+{
+ struct path safe = nd->path;
+ struct dentry *dentry = path->dentry;
+ int err;
+
+ path_get(&safe);
+
+ err = may_delete(nd->path.dentry->d_inode, dentry, isdir);
+ if (err)
+ goto out;
+
+ err = -ENOTEMPTY;
+ if (isdir && !directory_is_empty(path))
+ goto out;
+
+ if (nd->path.dentry != dentry->d_parent) {
+ dentry = __lookup_hash(&path->dentry->d_name, nd->path.dentry,
+ nd);
+ err = PTR_ERR(dentry);
+ if (IS_ERR(dentry))
+ goto out;
+
+ dput(path->dentry);
+ if (path->mnt != safe.mnt)
+ mntput(path->mnt);
+ path->mnt = nd->path.mnt;
+ path->dentry = dentry;
+ }
+
+ err = vfs_whiteout(nd->path.dentry->d_inode, dentry, isdir);
+
+out:
+ path_put(&safe);
+ return err;
+}
+
+/*
* We try to drop the dentry early: we should have
* a usage count of 2 if we're the only user of this
* dentry, and if that is true (possibly after pruning
--
1.6.3.3

2010-08-08 16:01:34

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 14/39] union-mount: Union mounts documentation

Document design and implementation of union mounts (a.k.a. writable
overlays).

Signed-off-by: Valerie Aurora <[email protected]>
---
Documentation/filesystems/union-mounts.txt | 752 ++++++++++++++++++++++++++++
1 files changed, 752 insertions(+), 0 deletions(-)
create mode 100644 Documentation/filesystems/union-mounts.txt

diff --git a/Documentation/filesystems/union-mounts.txt b/Documentation/filesystems/union-mounts.txt
new file mode 100644
index 0000000..977a2b5
--- /dev/null
+++ b/Documentation/filesystems/union-mounts.txt
@@ -0,0 +1,752 @@
+Union mounts (a.k.a. writable overlays)
+=======================================
+
+This document describes the architecture and current status of union
+mounts, also known as writable overlays.
+
+In this document:
+ - Overview of union mounts
+ - Terminology
+ - VFS implementation
+ - Locking strategy
+ - VFS/file system interface
+ - Userland interface
+ - NFS interaction
+ - Status
+ - Contributing to union mounts
+
+Overview
+========
+
+A union mount layers one read-write file system over one or more
+read-only file systems, with all writes going to the writable file
+system. The namespace of both file systems appears as a combined
+whole to userland, with files and directories on the writable file
+system covering up any files or directories with matching pathnames on
+the read-only file system. The read-write file system is the
+"topmost" or "upper" file system and the read-only file systems are
+the "lower" file systems. A few use cases:
+
+- Root file system on CD with writes saved to hard drive (LiveCD)
+- Multiple virtual machines with the same starting root file system
+- Cluster with NFS mounted root on clients
+
+Most if not all of these problems could be solved with a COW block
+device or a clustered file system (include NFS mounts). However, for
+some use cases, sharing is more efficient and better performing if
+done at the file system namespace level. COW block devices only
+increase their divergence as time goes on, and a fully coherent
+writable file system is unnecessary synchronization overhead if no
+other client needs to see the writes.
+
+What union mounts are not
+-------------------------
+
+Union mounts are not a general-purpose unioning file system. They do
+not provide a generic "union of namespaces" operation for an arbitrary
+number of file systems. Many interesting features can be implemented
+with a generic unioning facility: dynamic insertion and removal of
+branches, write policies based on space available, online upgrade,
+etc. Some unioning file systems that do this are UnionFS and AUFS.
+
+Terminology
+===========
+
+The main physical metaphor for union mounts is that a writable file
+system is mounted "on top" of a read-only file system. Lookups start
+at the "topmost" read-write file system and travel "down" to the
+"bottom" read-only file system only if no blocking entry exists on the
+top layer.
+
+Topmost layer: The read-write file system. Lookups begin here.
+
+Bottom layer: The read-only file system. Lookups end here.
+
+Path: Combination of the vfsmount and dentry structure.
+
+Follow down: Given a path from the top layer, find the corresponding
+path on the bottom layer.
+
+Follow up: Given a path from the bottom layer, find the corresponding
+path on the top layer.
+
+Whiteout: A directory entry in the top layer that prevents lookups
+from travelling down to the bottom layer. Created on unlink()/rmdir()
+if a corresponding directory entry exists in the bottom layer.
+
+Opaque flag: A flag on a directory in the top layer that prevents
+lookups of entries in this directory from travelling down to the
+bottom layer (unless there is an explicit fallthru entry allowing that
+for a particular entry). Set on creation of a directory that replaces
+a whiteout, and after a directory copyup.
+
+Fallthru: A directory entry which allows lookups to "fall through" to
+the bottom layer for that exact directory entry. This serves as a
+placeholder for directory entries from the bottom layer during
+readdir(). Fallthrus override opaque flags.
+
+File copyup: Create a file on the top layer that has the same metadata
+and contents as the file with the same pathname on the bottom layer.
+
+Directory copyup: Copy up the visible directory entries from the
+bottom layer as fallthrus in the matching top layer directory. Mark
+the directory opaque to avoid unnecessary negative lookups on the
+bottom layer.
+
+Examples
+========
+
+What happens when I...
+
+- creat() /newfile -> creates on topmost layer
+- unlink() /oldfile -> creates a whiteout on topmost layer
+- Edit /existingfile -> copies up to top layer at open(O_WR) time
+- truncate /existingfile -> copies up to topmost layer + N bytes if specified
+- touch()/chmod()/chown()/etc. -> copies up to topmost layer
+- mkdir() /newdir -> creates on topmost layer
+- rmdir() /olddir -> creates a whiteout on topmost layer
+- mkdir() /olddir after above -> creates on topmost layer w/ opaque flag
+- readdir() /shareddir -> copies up entries from bottom layer as fallthrus
+- link() /oldfile /newlink -> copies up /oldfile, creates /newlink on topmost layer
+- symlink() /oldfile /symlink -> nothing special
+- rename() /oldfile /newfile -> copies up /oldfile to /newfile on top layer
+- rename() /olddir /newdir -> EXDEV
+- rename() /topmost_only_dir /topmost_only_dir2 -> success
+
+Getting to a root file system with union mounts:
+
+- Mount the base read-only file system as the root file system
+- Mount the read-only file system again on /newroot
+- Mount the read-write layer on /newroot:
+ # mount -o union /dev/sda /newroot
+- pivot_root to /newroot
+- Start init
+
+See scripts/pivot.sh in the UML devkit linked to from:
+
+http://valerieaurora.org/union/
+
+VFS implementation
+==================
+
+Union mounts are implemented as an integral part of the VFS, rather
+than as a VFS client file system (i.e., a stacked file system like
+unionfs or ecryptfs). Implementing unioning inside the VFS eliminates
+the need for duplicate copies of VFS data structures, unnecessary
+indirection, and code duplication, but requires very maintainable,
+low-to-zero overhead code. Union mounts require no change to file
+systems serving as the read-only layer, and requires some minor
+support from file systems serving as the read-write layer. File
+systems that want to be the writable layer must implement the new
+->whiteout() and ->fallthru() inode operations, which create special
+dummy directory entries.
+
+The union mounts code must accomplish the following major tasks:
+
+1) Pass lookups through to the lower level file system.
+2) Copy files and directories up to the topmost layer when written.
+3) Create whiteouts and fallthrus as necessary.
+
+VFS objects and union mounts
+----------------------------
+
+First, some VFS basics:
+
+The VFS allows multiple mounts of the same file system. For example,
+/dev/sda can be mounted at /usr and also at /mnt. The same file
+system can be mounted read-only at one point and read-write at
+another. Each of these mounts has its own vfsmount data structure in
+the kernel. However, each underlying file system has exactly one
+in-kernel superblock structure no matter how many times it is mounted.
+All the separate vfsmounts for the same file system reference the same
+superblock data structure.
+
+Directory entries are cached by the VFS in dentry structures. The VFS
+keeps one dentry structure for each file or directory in a file
+system, no matter how many times it is mounted. Each dentry
+represents only one element of a path name. When the VFS looks up a
+pathname (e.g., "/sbin/init"), the result is combination of vfsmount
+and dentry. This <mnt,dentry> pair is usually stored in a kernel
+structure named "path", which is simply two pointers, one to the
+vfsmount and one to the dentry. A "struct path" is this structure; a
+pathname is a string like "/etc/fstab".
+
+In union mounts, a file system can only be the topmost layer for one
+union mount. A file system can be part of multiple union mounts if it
+is a read-only layer. So dentries in the read-only layers can be part
+of multiple unions, while a dentry in the read-write layer can only be
+part of one unin.
+
+union_dir structure
+---------------------
+
+The first job of union mounts is to map directories from the topmost
+layer to directories with the same pathname in the lower layer. That
+is, given the <mnt,dentry> pair for a directory pathname in the
+topmost layer, we need to find all the <mnt,dentry> pairs for the
+directory with the same pathname in the lower layer. We do this with
+a singly linked list rooted in the dentry from the topmost layer. The
+linked list is the union_dir structure:
+
+/*
+ * The union_dir structure. Basically just a singly-linked list with
+ * a pointer to the referenced dentry, whose head is d_union_dir in
+ * the dentry of the topmost directory. We can't link this list
+ * purely through list elements in the dentry because lower layer
+ * dentries can be part of multiple union stacks. However, the
+ * topmost dentry is only part of one union stack. So we point at the
+ * lower layer dentries through a linked list rooted in the topmost
+ * dentry.
+ */
+struct union_dir {
+ struct path u_this; /* this is me */
+ struct union_dir *u_lower; /* this is what I overlay */
+};
+
+This structure is flexible enough to support an arbitrary number of
+layers of unioned file systems. (The current code is tested only with
+two layers but should allow more layers.) Since there can be more than
+two layers, this section will talk about mapping "upper" directories
+to "lower" directories, instead of "topmost" directories to "bottom"
+directories.
+
+At the time of a union mount, we allocate a union_dir structure to link
+the root directory of the upper layer to the root directory of the
+lower layer and put the pointer to it in the d_union_dir field of
+struct dentry:
+
+struct dentry {
+[...]
+#ifdef CONFIG_UNION_MOUNT
+ struct union_dir *d_union_dir; /* head of union stack */
+#endif
+
+
+Traversing the union stack
+--------------------------
+
+The set of union_dir structures referring to a particular pathname are
+called collectively the union stack for that directory. Only lookup
+needs to traverse the union stack - walk down the list of paths
+beginning with the topmost. This is open-coded:
+
+static int __lookup_union(struct nameidata *nd, struct qstr *name,
+ struct path *topmost)
+{
+[...]
+ /* new_ud is the tail of the list of union dirs for this dentry */
+ struct union_dir **next_ud = &topmost->dentry->d_union_dir;
+[...]
+ /* Go through each dir underlying the parent, looking for a match */
+ for (ud = nd->path.dentry->d_union_dir; ud != NULL; ud = ud->u_lower) {
+[...]
+ next_ud = &(*next_ud)->u_lower;
+ }
+}
+
+Code paths
+----------
+
+Union mounts modify the following key code paths in the VFS:
+
+- mount()/umount()
+- Pathname lookup
+- Any path that modifies an existing file
+
+Mount
+-----
+
+Union mounts are created in two steps:
+
+1. Mount the read-only layer file systems read-only in the usual
+manner, all on the same mountpoint. Submounts are permitted as long
+as they are also read-only and not shared (part of a mount propagation
+group).
+
+2. Mount the top layer with the "-o union" option at the same
+mountpoint. All read-only file systems mounted at this mountpoint
+will be included in the union mount.
+
+The bottom layers must be read-only and the top layer must be
+read-write and support whiteouts and fallthrus. A file system that
+supports whiteouts and fallthrus indicates this by setting the
+MS_WHITEOUT flag in the superblock. Currently, the top layer is
+forced to "noatime" to avoid a copyup on every access of a file.
+Supporting atime with the current infrastructure would require a
+copyup on every open(). The "relatime" option would be equally
+efficient if the atime is the same or more recent than the mtime/ctime
+for every object on the read-only file system, and if the 24-hour
+timeout on relatime was disabled. However, this is probably not
+worthwhile for the majority of union mount use cases.
+
+File systems can only be union mounted at their root directories.
+Without this restriction, some VFS operations must always do a
+union_lookup() - requiring a global lock - in order to find out if a
+path is potentially unioned. With this restriction, we can tell if a
+path is potentially unioned by checking a flag in the vfsmount.
+
+pivot_root() to a union mounted file system is supported. The
+recommended way to get to a union mounted root file system is to boot
+with the read-only mount as the root file system, construct the union
+mount on an entirely new mount, and pivot_root() to the new union
+mount root. Attempting to union mount the root file system later in
+boot will result in covering other file systems, e.g., /proc, which
+isn't permitted in the current code and is a bad idea anyway.
+
+Hard read-only file systems
+---------------------------
+
+Union mounts require the lower layer of the file system to be
+read-only. However, in Linux, any individual file system may be
+mounted at multiple places in the namespace, and a file system can be
+changed from read-only to read-write while still mounted. Thus, simply
+checking that the bottom layer is read-only at the time the writable
+overlay is mounted over it is pointless, since at any time the bottom
+layer may become read-write.
+
+We have to guarantee that a file system will be read-only for as long
+as it is the bottom layer of a union mount. To do this, we track the
+number of hard read-only users of a file system in its VFS superblock
+structure. When we union mount a writable overlay over a file system,
+we increment its read-only user count. The file system can only be
+mounted read-write if its read-only users count is zero.
+
+Todo:
+
+- Support hard read-only NFS mounts. See discussion here:
+
+ http://markmail.org/message/3mkgnvo4pswxd7lp
+
+Pathname lookup
+---------------
+
+Pathname lookup in a unioned directory traverses down the union stack
+for the parent directory, looking up each pathname element in each
+layer of the file system (according to the rules of whiteouts,
+fallthrus, and opaque flags). At mount time, the union stack for the
+root directory of the file system is created, and the union stack
+creation for every other unioned directory in the file system is
+boot-strapped using the already-existing union stack of the
+directory's parent. In order to simplify the code greatly, every
+visible directory on the lower file system is required to have a
+matching directory on the upper file system. This matching directory
+is created during pathname lookup if does not already exist.
+Therefore, each unioned directory is the child of another unioned
+directory (or is the root directory of the file system).
+
+The actual union lookup function is called in the following code
+paths:
+
+do_lookup()->do_union_lookup()->lookup_union()->__lookup_union()
+lookup_hash()->lookup_union()->__lookup_union()
+
+__lookup_union() is where the rules of whiteouts, fallthrus, and
+opaque flags are actually implemented. __lookup_union() returns
+either the first visible dentry, or a negative dentry from the topmost
+file system if no matching dentry exists. If it finds a directory, it
+looks up any potential matching lower layer directories. If it finds
+a lower layer directory, it first creates the topmost dir if necessary
+via union_create_topmost_dir(), and then calls union_add_dir() to
+append the lower directory to the end of the union stack.
+
+Note that not all directories in a union mount are unioned, only those
+with matching directories on the lower layer. The macro
+IS_DIR_UNIONED() is a cheap, constant time way to check if a directory
+is unioned, while IS_MNT_UNION() checks if the entire mount is unioned
+(and therefore whether the directory in question is potentially
+unioned).
+
+Currently, lookup of a negative dentry in a unioned directory requires
+a lookup in every directory in the union stack every time it is looked
+up. We could avoid subsequent lookups by adding a negative union
+cache entry, exactly the way negative dentries are cached.
+
+File copyup
+-----------
+
+Any system call that alters the data or metadata of a file on the
+bottom layer, or creates or changes a hard link to it will trigger a
+copyup of the target file from the lower layer to the topmost layer
+
+ - open(O_WRITE | O_RDWR | O_APPEND)
+ - truncate()/open(O_TRUNC)
+ - link()
+ - rename()
+ - chmod()
+ - chown()/lchown()
+ - utimes()
+ - setxattr()/lsetxattr()
+
+Copyup of a file due to open(O_WRITE) has already occurred when:
+
+ - write()
+ - ftruncate()
+ - writable mmap()
+
+The following system calls will fail on an fd opened O_RDONLY:
+
+ - fchmod()
+ - fchown()
+ - fsetxattr()
+ - futimensat()
+
+Contrary to common sense, the above system calls are defined to
+succeed on O_RDONLY fds. The idea seems to be that the
+O_RDONLY/O_RDWR/O_WRITE flags only apply to the actual file data, not
+to any form of metadata (times, owner, mode, or even extended
+attributes). Applications making these system calls on O_RDONLY fds
+are correct according to the standard and work on non-union-mounts.
+They will need to be rewritten (O_RDONLY -> O_RDWR) to work on union
+mounts. We suspect this usage is uncommon.
+
+This deviation from standard is due to technical limitations of the
+union mount implementation. Specifically, we would need to replace an
+open file descriptor from the lower layer with an open file descriptor
+for a file with matching pathname and contents on the upper layer,
+which is difficult to do. We avoid this in other system calls by
+doing the copyup before the file is opened. Unionfs doesn't encounter
+this problem because it creates a dummy file struct which redirects or
+fans out operations to the struct files for the underlying file
+systems.
+
+From an application's point of view, the result of an in-kernel file
+copyup is the logical equivalent of another application updating the
+file via the rename() pattern: creat() a new file, copy the data over,
+make changes the copy, and rename() over the old version. Any
+existing open file descriptors for that file (including those in the
+same application) refer to a now invisible object that used to have
+the same pathname. Only opens that occur after the copyup will see
+updates to the file.
+
+Permission checks
+-----------------
+
+We want to be sure we have the correct permissions to actually succeed
+in a system call before copying a file up to avoid unnecessary IO. At
+present, the permission check for a single system call may be spread
+out over many hundreds of lines of code (e.g., open()). In order to
+check permissions, we occasionally need to determine if there is a
+writable overlay on top of this inode. This requires a full path, but
+often we only have the inode at this point. In particular,
+inode_permission() returns EROFS if the inode is on a read-only file
+system, which is the wrong answer if there is a writable overlay
+mounted on top of it.
+
+The current solution is to split out the file-system-wide permission
+checks from the per-inode permission checks. inode_permission()
+becomes:
+
+sb_permission()
+__inode_permission()
+
+inode_permission() calls sb_permission() and __inode_permission() on
+the same path. We create path_permission() which calls
+sb_permission() on the parent directory from the top layer, and
+__inode_permission() on the target on the lower layer. This gets us
+the correct write permissions consdering that the file will be copied
+up.
+
+Todo:
+
+ - Currently, we don't deal with differing directory permissions at
+ different levels of the stack. This is a bug.
+
+Impact on non-union kernels and mounts
+--------------------------------------
+
+Union-related data structures, extra fields, and function calls are
+#ifdef'd out at the function/macro level with CONFIG_UNION_MOUNT in
+nearly all cases (see fs/union.h).
+
+Todo:
+
+ - Do performance tests
+
+Locking strategy
+================
+
+The current union mount locking strategy is based on the following
+rules:
+
+* The lower layer file system is always read-only
+* The topmost file system is always read-write
+ => A file system can never a topmost and lower layer at the same time
+
+Additionally, the topmost layer may only be mounted exactly once.
+Don't think of the topmost layer as a separate independent file
+system; when it is part of a union mount, it is only a file system in
+conjunction with the read-only bottom layer. The read-only bottom
+layer is an independent file system in and of itself and can be
+mounted elsewhere, including as the bottom layer for another union
+mount.
+
+Thus, we may define a stable locking order in terms of top layer and
+bottom layer locks, since a top layer is never a bottom layer and a
+bottom layer is never a top layer. Another simplifying assumption is
+that all directories in a pathname exist on the top layer, as they are
+created step-by-step during lookup. This prevents us from ever having
+to walk backwards up the path creating directory entries, which can
+get complicated. By implication, parent directories paths during any
+operation (rename(), unlink(),etc.) are from the top layer. Dentries
+for directories from the bottom layer are only ever seen or used by
+the lookup code.
+
+The two major problems we avoid with the above rules are:
+
+Lock ordering: Imagine two union stacks with the same two file
+systems: A mounted over B, and B mounted over A. Sometimes locks on
+objects in both A and B will have to be held simultanously. What
+order should they be acquired in? Simply acquiring them from top to
+bottom will create a lock-ordering problem - one thread acquires lock
+on object from A and then tries for a lock on object from B, while
+another thread grabs the lock on object from B and then waits for the
+lock on object from A. Some other lock ordering must be defined.
+
+Movement/change/disappearance of objects on multiple layers: A variety
+of nasty corner cases arise when more than one layer is changing at
+the same time. Changes in the directory topology and their effect on
+inheritance are of special concern. Al Viro's canonical email on the
+subject:
+
+http://lkml.indiana.edu/hypermail/linux/kernel/0802.0/0839.html
+
+We don't try to solve any of these cases, just avoid them in the first
+place.
+
+Todo: Prevent top layer from being mounted more than once.
+
+Cross-layer interactions
+------------------------
+
+The VFS code simultaneously holds references to and/or modifies
+objects from both the top and bottom layers in the following cases:
+
+Path lookup:
+
+Grabs i_mutex on bottom layer while holding i_mutex on top layer
+directory inode.
+
+File copyup:
+
+Holds i_mutex on the parent directory from the top layer while copying
+up file from lower layer.
+
+link():
+
+File copyup of target while holding i_mutex on parent directory on top
+layer. Followed by a normal link() operation.
+
+rename():
+
+Holds s_vfs_rename_mutex on the top layer, i_mutex of the source's
+parent dir (top layer), and i_mutex of the target's parent dir (also
+top layer) while looking up and copying the bottom layer target and
+also creating the whiteout.
+
+Notes on rename():
+
+First, renaming of directories returns EXDEV. It's not at all
+reasonable to recursively copy directory trees and userspace has to
+handle this case anyway. An exception is rename() of directories that
+exist only on the topmost layer; this succeeds.
+
+Rename involves three steps on a union mount: (1) copyup of the file
+from the bottom layer, (2) rename of the new top-layer copy to the
+target in the usual manner, (3) creation of a whiteout covering the
+source of the rename.
+
+Directory copyup:
+
+Directory entries are copied up on the first readdir(). We hold the
+top layer directory i_mutex throughout and sequentially acquire and
+drop the i_mutex for each lower layer directory.
+
+VFS-fs interface
+================
+
+Read-only layer: No support necessary other than enforcement of really
+really read-only semantics (done by VFS for local file systems).
+
+Writable layer: Must implement two new inode operations:
+
+int (*whiteout) (struct inode *, struct dentry *, struct dentry *);
+int (*fallthru) (struct inode *, struct dentry *);
+
+And set the MS_WHITEOUT flag to indicate support of these operations.
+
+Todo:
+
+- Return inode of underlying file in d_ino in readdir()
+- Implement whiteouts and fallthrus in ext3
+- Implement whiteouts and fallthrus in btrfs
+
+Supported file systems
+----------------------
+
+Any file system can be a read-only layer. File systems must
+explicitly support whiteouts and fallthrus in order to be a read-write
+layer. This patch set implements whiteouts for ext2, tmpfs, and
+jffs2. We have tested ext2, tmpfs, and iso9660 as the read-only
+layer.
+
+Todo:
+ - Test corner cases of case-insensitive/oversensitive file systems
+
+NFS interaction
+===============
+
+NFS is currently not supported as either type of layer. NFS as
+read-only layer requires support from the server to honor the
+read-only guarantee needed for the bottom layer. To do this, the
+server needs to revoke access to clients requesting read-only file
+systems if the exported file system is remounted read-write or
+unmounted (during which arbitrary changes can occur). Some recent
+discussion:
+
+http://markmail.org/message/3mkgnvo4pswxd7lp
+
+NFS as the read-write layer would require implementation of the
+->whiteout() and ->fallthru() methods. DT_WHT directory entries are
+theoretically already supported.
+
+Also, technically the requirement for a readdir() cookie that is
+stable across reboots comes only from file systems exported via NFSv2:
+
+http://oss.oracle.com/pipermail/btrfs-devel/2008-January/000463.html
+
+Todo:
+
+- Guarantee really really read-only on NFS exports
+- Implement whiteout()/fallthru() for NFS
+
+Userland support
+================
+
+The mount command must support the "-o union" mount option and pass
+the corresponding MS_UNION flag to the kerel. A util-linux git
+tree with union mount support is here:
+
+git://git.kernel.org/pub/scm/utils/util-linux-ng/val/util-linux-ng.git
+
+File system utilities must support whiteouts and fallthrus. An
+e2fsprogs git tree with union mount support is here:
+
+git://git.kernel.org/pub/scm/fs/ext2/val/e2fsprogs.git
+
+Currently, whiteout directory entries are not returned to userland.
+While the directory type for whiteouts, DT_WHT, has been defined for
+many years, very little userland code handles them. Userland will
+never see fallthru directory entries.
+
+Known non-POSIX behaviors
+-------------------------
+
+- Any writing system call (unlink()/chmod()/etc.) can return ENOSPC or EIO
+
+ Most programs are not tested and don't work well under conditions of
+ ENOSPC. The solution is to add more disk space.
+
+- Link count may be wrong for files on bottom layer with > 1 link count
+
+ A file may have more than one hard link to it. When a file with
+ multiple hard links is copied up, any other hard links pointing to
+ the same inode will remain unchanged. If the file is looked up via
+ one of the hard links on the read-only layer, it will have the
+ original link count (which is off by one at this point). An
+ example:
+
+ /bin/link1 -> inode 100
+ /etc/link2 -> inode 100
+
+ inode 100 will have link count 2.
+
+ # echo "blah" > /bin/link1
+
+ Now /bin/link1 will be copied up to the topmost layer. But
+ /etc/link2 will still point to the original inode 100, and its link
+ count will still be 2.
+
+- Link count on directories will be wrong before readdir() (fixable)
+- File copyup is the logical equivalent of an update via copy +
+ rename(). Any existing open file descriptors will continue to refer
+ to the read-only copy on the bottom layer and will not see any
+ changes that occur after the copy-up.
+- rename() of directory may fail with EXDEV
+- inode number in d_ino of struct dirent will be wrong for fallthrus
+- fchmod()/fchown()/futimensat()/fsetattr() fail on O_RDONLY fds
+
+Status
+======
+
+The current union mounts implementation is feature-complete on local
+file systems and passes an extensive union mounts test suite,
+available in the union mounts Usermode Linux-based development kit:
+
+http://valerieaurora.org/union/union_mount_devkit.tar.gz
+
+The whiteout code has had some non-trivial level of review and
+testing, but the much the code has had no external review or testing
+outside the authors' machines.
+
+The latest version is available at:
+
+git://git.kernel.org/pub/scm/linux/kernel/git/val/linux-2.6.git
+
+Check the union mounts web page for the name of the latest branch:
+
+http://valerieaurora.org/union/
+
+Todo:
+
+- Run more tests (e.g., XFS test suite)
+- Get review from VFS maintainers
+
+Non-features
+------------
+
+Features we do not currently plan to support in union mounts:
+
+Online upgrade: E.g., installing software on a file system NFS
+exported to clients while the clients are still up and running.
+Allowing the read-only bottom layer of a union mount to change
+invalidates our locking strategy.
+
+Recursive copying of directories: E.g., implementing rename() across
+layers for directories. Doing an in-kernel copy of a single file is
+bad enough. Recursively copying a directory is a big no-no.
+
+Read-only top layer: The readdir() strategy fundamentally requires the
+ability to create persistent directory entries on the top layer file
+system (which may be tmpfs). Numerous alternatives (including
+in-kernel or in-application caching) exist and are compatible with
+union mounts with its writing-readdir() implementation disabled.
+Creating a readdir() cookie that is stable across multiple readdir()s
+requires one of:
+
+- Write to stable storage (e.g., fallthru dentries)
+- Non-evictable kernel memory cache (doesn't handle NFS server reboot)
+- Per-application caching by glibc readdir()
+
+Often these features are supported by other unioning file systems or
+by other versions of union mounts.
+
+Contributing to union mounts
+============================
+
+The union mounts web page is here:
+
+http://valerieaurora.org/union/
+
+It links to:
+
+ - All git repositories
+ - Documentation
+ - An entire self-contained UML-based dev kit with README, etc.
+
+The best mailing list for discussing union mounts is:
+
[email protected]
+
+http://vger.kernel.org/vger-lists.html#linux-fsdevel
+
+Thank you for reading!
--
1.6.3.3

2010-08-08 15:54:51

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 05/39] whiteout/NFSD: Don't return information about whiteouts to userspace

From: Jan Blunck <[email protected]>

Userspace isn't ready for handling another file type, so silently drop
whiteout directory entries before they leave the kernel.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
Cc: [email protected]
Cc: "J. Bruce Fields" <[email protected]>
Cc: Neil Brown <[email protected]>
---
fs/compat.c | 9 +++++++++
fs/nfsd/nfs3xdr.c | 5 +++++
fs/nfsd/nfs4xdr.c | 5 +++++
fs/nfsd/nfsxdr.c | 4 ++++
fs/readdir.c | 9 +++++++++
5 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/fs/compat.c b/fs/compat.c
index 6490d21..7e7b3a4 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -912,6 +912,9 @@ static int compat_fillonedir(void *__buf, const char *name, int namlen,
struct compat_old_linux_dirent __user *dirent;
compat_ulong_t d_ino;

+ if (d_type == DT_WHT)
+ return 0;
+
if (buf->result)
return -EINVAL;
d_ino = ino;
@@ -983,6 +986,9 @@ static int compat_filldir(void *__buf, const char *name, int namlen,
compat_ulong_t d_ino;
int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 2, sizeof(compat_long_t));

+ if (d_type == DT_WHT)
+ return 0;
+
buf->error = -EINVAL; /* only used if we fail.. */
if (reclen > buf->count)
return -EINVAL;
@@ -1072,6 +1078,9 @@ static int compat_filldir64(void * __buf, const char * name, int namlen, loff_t
int reclen = ALIGN(jj + namlen + 1, sizeof(u64));
u64 off;

+ if (d_type == DT_WHT)
+ return 0;
+
buf->error = -EINVAL; /* only used if we fail.. */
if (reclen > buf->count)
return -EINVAL;
diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 2a533a0..9b96f5a 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -885,6 +885,11 @@ encode_entry(struct readdir_cd *ccd, const char *name, int namlen,
int elen; /* estimated entry length in words */
int num_entry_words = 0; /* actual number of words */

+ if (d_type == DT_WHT) {
+ cd->common.err = nfs_ok;
+ return 0;
+ }
+
if (cd->offset) {
u64 offset64 = offset;

diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index ac17a70..fb67254 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -2279,6 +2279,11 @@ nfsd4_encode_dirent(void *ccdv, const char *name, int namlen,
return 0;
}

+ if (d_type == DT_WHT) {
+ cd->common.err = nfs_ok;
+ return 0;
+ }
+
if (cd->offset)
xdr_encode_hyper(cd->offset, (u64) offset);

diff --git a/fs/nfsd/nfsxdr.c b/fs/nfsd/nfsxdr.c
index 4ce005d..0e57d4b 100644
--- a/fs/nfsd/nfsxdr.c
+++ b/fs/nfsd/nfsxdr.c
@@ -503,6 +503,10 @@ nfssvc_encode_entry(void *ccdv, const char *name,
namlen, name, offset, ino);
*/

+ if (d_type == DT_WHT) {
+ cd->common.err = nfs_ok;
+ return 0;
+ }
if (offset > ~((u32) 0)) {
cd->common.err = nfserr_fbig;
return -EINVAL;
diff --git a/fs/readdir.c b/fs/readdir.c
index 7723401..3a48491 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -77,6 +77,9 @@ static int fillonedir(void * __buf, const char * name, int namlen, loff_t offset
struct old_linux_dirent __user * dirent;
unsigned long d_ino;

+ if (d_type == DT_WHT)
+ return 0;
+
if (buf->result)
return -EINVAL;
d_ino = ino;
@@ -154,6 +157,9 @@ static int filldir(void * __buf, const char * name, int namlen, loff_t offset,
unsigned long d_ino;
int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 2, sizeof(long));

+ if (d_type == DT_WHT)
+ return 0;
+
buf->error = -EINVAL; /* only used if we fail.. */
if (reclen > buf->count)
return -EINVAL;
@@ -239,6 +245,9 @@ static int filldir64(void * __buf, const char * name, int namlen, loff_t offset,
struct getdents_callback64 * buf = (struct getdents_callback64 *) __buf;
int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 1, sizeof(u64));

+ if (d_type == DT_WHT)
+ return 0;
+
buf->error = -EINVAL; /* only used if we fail.. */
if (reclen > buf->count)
return -EINVAL;
--
1.6.3.3

2010-08-08 16:02:00

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 13/39] fallthru: Basic fallthru definitions

Define the fallthru dcache flag and file system op. Mask out the
DCACHE_FALLTHRU flag on dentry creation. Actual users and changes to
lookup come in later patches.

Signed-off-by: Valerie Aurora <[email protected]>
---
Documentation/filesystems/vfs.txt | 6 ++++++
fs/dcache.c | 2 +-
include/linux/dcache.h | 7 +++++++
include/linux/fs.h | 2 ++
4 files changed, 16 insertions(+), 1 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 964e0fc..bbaefa9 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -320,6 +320,7 @@ struct inode_operations {
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct inode *,struct dentry *,int,dev_t);
int (*whiteout) (struct inode *, struct dentry *, struct dentry *);
+ int (*fallthru) (struct inode *, struct dentry *);
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *);
int (*readlink) (struct dentry *, char __user *,int);
@@ -390,6 +391,11 @@ otherwise noted.
second is the dentry for the whiteout itself. This method
must unlink() or rmdir() the original entry if it exists.

+ fallthru: called by the readdir(2) system call on a layered file
+ system. Only required if you want to support fallthrus.
+ Fallthrus are place-holders for directory entries visible from
+ a lower level file system.
+
rename: called by the rename(2) system call to rename the object to
have the parent and name given by the second inode and dentry.

diff --git a/fs/dcache.c b/fs/dcache.c
index 79b9f6a..249d077 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -992,7 +992,7 @@ EXPORT_SYMBOL(d_alloc_name);
static void __d_instantiate(struct dentry *dentry, struct inode *inode)
{
if (inode) {
- dentry->d_flags &= ~DCACHE_WHITEOUT;
+ dentry->d_flags &= ~(DCACHE_WHITEOUT|DCACHE_FALLTHRU);
list_add(&dentry->d_alias, &inode->i_dentry);
}
dentry->d_inode = inode;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 7d650a2..0904716 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -190,6 +190,8 @@ d_iput: no no no yes

#define DCACHE_WHITEOUT 0x0200 /* Stop lookup in a unioned file system */

+#define DCACHE_FALLTHRU 0x0400 /* Continue lookup below an opaque dir */
+
extern spinlock_t dcache_lock;
extern seqlock_t rename_lock;

@@ -379,6 +381,11 @@ static inline int d_is_whiteout(struct dentry *dentry)
return (dentry->d_flags & DCACHE_WHITEOUT);
}

+static inline int d_is_fallthru(struct dentry *dentry)
+{
+ return (dentry->d_flags & DCACHE_FALLTHRU);
+}
+
static inline struct dentry *dget_parent(struct dentry *dentry)
{
struct dentry *ret;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1dbe156..71ee74e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -210,6 +210,7 @@ struct inodes_stat_t {
#define MS_I_VERSION (1<<23) /* Update inode I_version field */
#define MS_STRICTATIME (1<<24) /* Always perform atime updates */
#define MS_WHITEOUT (1<<25) /* FS supports whiteout filetype */
+#define MS_FALLTHRU (1<<26) /* FS supports fallthru filetype */
#define MS_ACTIVE (1<<30)
#define MS_NOUSER (1<<31)

@@ -1534,6 +1535,7 @@ struct inode_operations {
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct inode *,struct dentry *,int,dev_t);
int (*whiteout) (struct inode *, struct dentry *, struct dentry *);
+ int (*fallthru) (struct inode *, struct dentry *);
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *);
int (*readlink) (struct dentry *, char __user *,int);
--
1.6.3.3

2010-08-08 15:54:48

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 01/39] VFS: Comment follow_mount() and friends

Add comments describing what the directions "up" and "down" mean and
ref count handling to the VFS follow_mount() family of functions.

Signed-off-by: Valerie Aurora <[email protected]>
Cc: Alexander Viro <[email protected]>
---
fs/namei.c | 43 +++++++++++++++++++++++++++++++++++++++----
fs/namespace.c | 16 ++++++++++++++--
2 files changed, 53 insertions(+), 6 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 868d0cb..fd6df0d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -597,6 +597,17 @@ loop:
return err;
}

+/*
+ * follow_up - Find the mountpoint of path's vfsmount
+ *
+ * Given a path, find the mountpoint of its source file system.
+ * Replace @path with the path of the mountpoint in the parent mount.
+ * Up is towards /.
+ *
+ * Return 1 if we went up a level and 0 if we were already at the
+ * root.
+ */
+
int follow_up(struct path *path)
{
struct vfsmount *parent;
@@ -617,8 +628,22 @@ int follow_up(struct path *path)
return 1;
}

-/* no need for dcache_lock, as serialization is taken care in
- * namespace.c
+/*
+ * __follow_mount - Return the most recent mount at this mountpoint
+ *
+ * Given a mountpoint, find the most recently mounted file system at
+ * this mountpoint and return the path to its root dentry. This is
+ * the file system that is visible, and it is in the direction of VFS
+ * "down" - away from the root of the mount tree. See comments to
+ * lookup_mnt() for an example of "down."
+ *
+ * Does not decrement the refcount on the given mount even if it
+ * follows it to another mount and returns that path instead.
+ *
+ * Returns 0 if path was unchanged, 1 if we followed it to another mount.
+ *
+ * No need for dcache_lock, as serialization is taken care in
+ * namespace.c.
*/
static int __follow_mount(struct path *path)
{
@@ -637,6 +662,12 @@ static int __follow_mount(struct path *path)
return res;
}

+/*
+ * Like __follow_mount, but no return value and drops references to
+ * both mnt and dentry of the given path if it follows to another
+ * mount.
+ */
+
static void follow_mount(struct path *path)
{
while (d_mountpoint(path->dentry)) {
@@ -650,8 +681,12 @@ static void follow_mount(struct path *path)
}
}

-/* no need for dcache_lock, as serialization is taken care in
- * namespace.c
+/*
+ * Like follow_mount(), but traverses only one layer instead of
+ * continuing until it runs out.
+ *
+ * No need for dcache_lock, as serialization is taken care in
+ * namespace.c.
*/
int follow_down(struct path *path)
{
diff --git a/fs/namespace.c b/fs/namespace.c
index 88058de..b8a66db 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -433,8 +433,20 @@ struct vfsmount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry,
}

/*
- * lookup_mnt increments the ref count before returning
- * the vfsmount struct.
+ * lookup_mnt - Return the first child mount mounted at path
+ *
+ * "First" means first mounted chronologically. If you create the
+ * following mounts:
+ *
+ * mount /dev/sda1 /mnt
+ * mount /dev/sda2 /mnt
+ * mount /dev/sda3 /mnt
+ *
+ * Then lookup_mnt() on the base /mnt dentry in the root mount will
+ * return successively the root dentry and vfsmount of /dev/sda1, then
+ * /dev/sda2, then /dev/sda3, then NULL.
+ *
+ * lookup_mnt takes a reference to the found vfsmount.
*/
struct vfsmount *lookup_mnt(struct path *path)
{
--
1.6.3.3

2010-08-08 15:54:42

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 04/39] autofs4: Save autofs trigger's vfsmount in super block info

From: Jan Blunck <[email protected]>

XXX - This is broken and included just to make union mounts work. Ian
Kent and David Howells are working on a long-term solution that will
replace abuse of ->follow_link() to trigger an automount with a new
op.

Original commit message:

This is a bugfix/replacement for commit
051d381259eb57d6074d02a6ba6e90e744f1a29f:

During a path walk if an autofs trigger is mounted on a dentry,
when the follow_link method is called, the nameidata struct
contains the vfsmount and mountpoint dentry of the parent mount
while the dentry that is passed in is the root of the autofs
trigger mount. I believe it is impossible to get the vfsmount of
the trigger mount, within the follow_link method, when only the
parent vfsmount and the root dentry of the trigger mount are
known.

The pre solution in this commit was to replace the path embedded in the
parent's nameidata with the path of the link itself in
__do_follow_link(). This is a relatively harmless misuse of the
field, but union mounts ran into a bug during follow_link() caused by
the nameidata containing the wrong path (we count on it being what it
is all other places - the path of the parent).

A cleaner and easier to understand solution is to save the necessary
vfsmount in the autofs superblock info when it is mounted. Then we
can easily update the vfsmount in autofs4_follow_link().

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
Acked-by: Ian Kent <[email protected]>
Cc: [email protected]
Cc: Alexander Viro <[email protected]>
---
fs/autofs4/autofs_i.h | 1 +
fs/autofs4/init.c | 11 ++++++++++-
fs/autofs4/root.c | 6 ++++++
fs/namei.c | 7 ++-----
4 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
index 3d283ab..de3af64 100644
--- a/fs/autofs4/autofs_i.h
+++ b/fs/autofs4/autofs_i.h
@@ -133,6 +133,7 @@ struct autofs_sb_info {
int reghost_enabled;
int needs_reghost;
struct super_block *sb;
+ struct vfsmount *mnt;
struct mutex wq_mutex;
spinlock_t fs_lock;
struct autofs_wait_queue *queues; /* Wait queue pointer */
diff --git a/fs/autofs4/init.c b/fs/autofs4/init.c
index 9722e4b..5e0dcd7 100644
--- a/fs/autofs4/init.c
+++ b/fs/autofs4/init.c
@@ -17,7 +17,16 @@
static int autofs_get_sb(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data, struct vfsmount *mnt)
{
- return get_sb_nodev(fs_type, flags, data, autofs4_fill_super, mnt);
+ struct autofs_sb_info *sbi;
+ int ret;
+
+ ret = get_sb_nodev(fs_type, flags, data, autofs4_fill_super, mnt);
+ if (ret)
+ return ret;
+
+ sbi = autofs4_sbi(mnt->mnt_sb);
+ sbi->mnt = mnt;
+ return 0;
}

static struct file_system_type autofs_fs_type = {
diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
index db4117e..e4c507d 100644
--- a/fs/autofs4/root.c
+++ b/fs/autofs4/root.c
@@ -220,6 +220,12 @@ static void *autofs4_follow_link(struct dentry *dentry, struct nameidata *nd)
DPRINTK("dentry=%p %.*s oz_mode=%d nd->flags=%d",
dentry, dentry->d_name.len, dentry->d_name.name, oz_mode,
nd->flags);
+
+ dput(nd->path.dentry);
+ mntput(nd->path.mnt);
+ nd->path.mnt = mntget(sbi->mnt);
+ nd->path.dentry = dget(dentry);
+
/*
* For an expire of a covered direct or offset mount we need
* to break out of follow_down() at the autofs mount trigger
diff --git a/fs/namei.c b/fs/namei.c
index 735f54b..7552e61 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -539,11 +539,8 @@ __do_follow_link(struct path *path, struct nameidata *nd, void **p)
touch_atime(path->mnt, dentry);
nd_set_link(nd, NULL);

- if (path->mnt != nd->path.mnt) {
- path_to_nameidata(path, nd);
- dget(dentry);
- }
- mntget(path->mnt);
+ if (path->mnt == nd->path.mnt)
+ mntget(nd->path.mnt);
nd->last_type = LAST_BIND;
*p = dentry->d_inode->i_op->follow_link(dentry, nd);
error = PTR_ERR(*p);
--
1.6.3.3

2010-08-08 16:02:54

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 12/39] whiteout: jffs2 whiteout support

From: Felix Fietkau <[email protected]>

Add support for whiteout dentries to jffs2.

XXX - David Woodhouse suggests several changes and provides an
untested patch. See:

http://patchwork.ozlabs.org/patch/50466/

XXX - Backward compatibility? Creating a whiteout on a JFFS2 file
system can only happen if it is deliberately mounted "-o union" so
there is some way to prevent creation of whiteouts on a file system
you want to later mount with an earlier (no support for whiteout) file
system. However, ext2/3 has much more robust methods (explicit fs
feature flag) to prevent such an occurance.

Signed-off-by: Felix Fietkau <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: [email protected]
---
fs/jffs2/dir.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++-
fs/jffs2/fs.c | 4 +++
fs/jffs2/super.c | 2 +-
include/linux/jffs2.h | 2 +
4 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index 166062a..4798586 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -34,6 +34,8 @@ static int jffs2_mknod (struct inode *,struct dentry *,int,dev_t);
static int jffs2_rename (struct inode *, struct dentry *,
struct inode *, struct dentry *);

+static int jffs2_whiteout (struct inode *, struct dentry *, struct dentry *);
+
const struct file_operations jffs2_dir_operations =
{
.read = generic_read_dir,
@@ -56,6 +58,7 @@ const struct inode_operations jffs2_dir_inode_operations =
.mknod = jffs2_mknod,
.rename = jffs2_rename,
.check_acl = jffs2_check_acl,
+ .whiteout = jffs2_whiteout,
.setattr = jffs2_setattr,
.setxattr = jffs2_setxattr,
.getxattr = jffs2_getxattr,
@@ -98,8 +101,14 @@ static struct dentry *jffs2_lookup(struct inode *dir_i, struct dentry *target,
fd = fd_list;
}
}
- if (fd)
- ino = fd->ino;
+ if (fd) {
+ spin_lock(&target->d_lock);
+ if (fd->type == DT_WHT)
+ target->d_flags |= DCACHE_WHITEOUT;
+ else
+ ino = fd->ino;
+ spin_unlock(&target->d_lock);
+ }
mutex_unlock(&dir_f->sem);
if (ino) {
inode = jffs2_iget(dir_i->i_sb, ino);
@@ -502,6 +511,11 @@ static int jffs2_mkdir (struct inode *dir_i, struct dentry *dentry, int mode)
return PTR_ERR(inode);
}

+ if (dentry->d_flags & DCACHE_WHITEOUT) {
+ inode->i_flags |= S_OPAQUE;
+ ri->flags = cpu_to_je16(JFFS2_INO_FLAG_OPAQUE);
+ }
+
inode->i_op = &jffs2_dir_inode_operations;
inode->i_fop = &jffs2_dir_operations;

@@ -784,6 +798,60 @@ static int jffs2_mknod (struct inode *dir_i, struct dentry *dentry, int mode, de
return ret;
}

+static int jffs2_whiteout (struct inode *dir, struct dentry *old_dentry,
+ struct dentry *new_dentry)
+{
+ struct jffs2_sb_info *c = JFFS2_SB_INFO(dir->i_sb);
+ struct jffs2_inode_info *victim_f = NULL;
+ uint32_t now;
+ int ret;
+
+ /* If it's a directory, then check whether it is really empty */
+ if (new_dentry->d_inode) {
+ victim_f = JFFS2_INODE_INFO(old_dentry->d_inode);
+ if (S_ISDIR(old_dentry->d_inode->i_mode)) {
+ struct jffs2_full_dirent *fd;
+
+ mutex_lock(&victim_f->sem);
+ for (fd = victim_f->dents; fd; fd = fd->next) {
+ if (fd->ino) {
+ mutex_unlock(&victim_f->sem);
+ return -ENOTEMPTY;
+ }
+ }
+ mutex_unlock(&victim_f->sem);
+ }
+ }
+
+ now = get_seconds();
+ ret = jffs2_do_link(c, JFFS2_INODE_INFO(dir), 0, DT_WHT,
+ new_dentry->d_name.name, new_dentry->d_name.len, now);
+ if (ret)
+ return ret;
+
+ spin_lock(&new_dentry->d_lock);
+ new_dentry->d_flags |= DCACHE_WHITEOUT;
+ spin_unlock(&new_dentry->d_lock);
+ d_add(new_dentry, NULL);
+
+ if (victim_f) {
+ /* There was a victim. Kill it off nicely */
+ drop_nlink(old_dentry->d_inode);
+ /* Don't oops if the victim was a dirent pointing to an
+ inode which didn't exist. */
+ if (victim_f->inocache) {
+ mutex_lock(&victim_f->sem);
+ if (S_ISDIR(old_dentry->d_inode->i_mode))
+ victim_f->inocache->pino_nlink = 0;
+ else
+ victim_f->inocache->pino_nlink--;
+ mutex_unlock(&victim_f->sem);
+ }
+ }
+
+ return 0;
+}
+
static int jffs2_rename (struct inode *old_dir_i, struct dentry *old_dentry,
struct inode *new_dir_i, struct dentry *new_dentry)
{
diff --git a/fs/jffs2/fs.c b/fs/jffs2/fs.c
index 459d39d..cdb2667 100644
--- a/fs/jffs2/fs.c
+++ b/fs/jffs2/fs.c
@@ -301,6 +301,10 @@ struct inode *jffs2_iget(struct super_block *sb, unsigned long ino)

inode->i_op = &jffs2_dir_inode_operations;
inode->i_fop = &jffs2_dir_operations;
+
+ if (je16_to_cpu(latest_node.flags) & JFFS2_INO_FLAG_OPAQUE)
+ inode->i_flags |= S_OPAQUE;
+
break;
}
case S_IFREG:
diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c
index 511e2d6..f998679 100644
--- a/fs/jffs2/super.c
+++ b/fs/jffs2/super.c
@@ -170,7 +170,7 @@ static int jffs2_fill_super(struct super_block *sb, void *data, int silent)

sb->s_op = &jffs2_super_operations;
sb->s_export_op = &jffs2_export_ops;
- sb->s_flags = sb->s_flags | MS_NOATIME;
+ sb->s_flags = sb->s_flags | MS_NOATIME | MS_WHITEOUT;
sb->s_xattr = jffs2_xattr_handlers;
#ifdef CONFIG_JFFS2_FS_POSIX_ACL
sb->s_flags |= MS_POSIXACL;
diff --git a/include/linux/jffs2.h b/include/linux/jffs2.h
index 0874ab5..cc6347f 100644
--- a/include/linux/jffs2.h
+++ b/include/linux/jffs2.h
@@ -87,6 +87,8 @@
#define JFFS2_INO_FLAG_USERCOMPR 2 /* User has requested a specific
compression type */

+#define JFFS2_INO_FLAG_OPAQUE 4 /* Directory is opaque (for union mounts) */
+

/* These can go once we've made sure we've caught all uses without
byteswapping */
--
1.6.3.3

2010-08-08 15:54:38

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 02/39] VFS: Make lookup_hash() return a struct path

From: Jan Blunck <[email protected]>

This patch changes lookup_hash() into returning a struct path.

XXX - Check for correctness, otherwise obvious

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
Cc: Alexander Viro <[email protected]>
---
fs/namei.c | 113 ++++++++++++++++++++++++++++++-----------------------------
1 files changed, 57 insertions(+), 56 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index fd6df0d..735f54b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1156,7 +1156,7 @@ int vfs_path_lookup(struct dentry *dentry, struct vfsmount *mnt,
}

static struct dentry *__lookup_hash(struct qstr *name,
- struct dentry *base, struct nameidata *nd)
+ struct dentry *base, struct nameidata *nd)
{
struct dentry *dentry;
struct inode *inode;
@@ -1213,14 +1213,22 @@ out:
* needs parent already locked. Doesn't follow mounts.
* SMP-safe.
*/
-static struct dentry *lookup_hash(struct nameidata *nd)
+static int lookup_hash(struct nameidata *nd, struct qstr *name,
+ struct path *path)
{
int err;

err = exec_permission(nd->path.dentry->d_inode);
if (err)
- return ERR_PTR(err);
- return __lookup_hash(&nd->last, nd->path.dentry, nd);
+ return err;
+ path->mnt = nd->path.mnt;
+ path->dentry = __lookup_hash(name, nd->path.dentry, nd);
+ if (IS_ERR(path->dentry)) {
+ err = PTR_ERR(path->dentry);
+ path->dentry = NULL;
+ path->mnt = NULL;
+ }
+ return err;
}

static int __lookup_one_len(const char *name, struct qstr *this,
@@ -1702,12 +1710,9 @@ static struct file *do_last(struct nameidata *nd, struct path *path,

/* OK, it's O_CREAT */
mutex_lock(&dir->d_inode->i_mutex);
+ error = lookup_hash(nd, &nd->last, path);

- path->dentry = lookup_hash(nd);
- path->mnt = nd->path.mnt;
-
- error = PTR_ERR(path->dentry);
- if (IS_ERR(path->dentry)) {
+ if (error) {
mutex_unlock(&dir->d_inode->i_mutex);
goto exit;
}
@@ -1959,7 +1964,8 @@ EXPORT_SYMBOL(filp_open);
*/
struct dentry *lookup_create(struct nameidata *nd, int is_dir)
{
- struct dentry *dentry = ERR_PTR(-EEXIST);
+ struct path path;
+ int err;

mutex_lock_nested(&nd->path.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
/*
@@ -1967,7 +1973,7 @@ struct dentry *lookup_create(struct nameidata *nd, int is_dir)
* (foo/., foo/.., /////)
*/
if (nd->last_type != LAST_NORM)
- goto fail;
+ return ERR_PTR(-EEXIST);
nd->flags &= ~LOOKUP_PARENT;
nd->flags |= LOOKUP_CREATE | LOOKUP_EXCL;
nd->intent.open.flags = O_EXCL;
@@ -1975,11 +1981,11 @@ struct dentry *lookup_create(struct nameidata *nd, int is_dir)
/*
* Do the final lookup.
*/
- dentry = lookup_hash(nd);
- if (IS_ERR(dentry))
- goto fail;
+ err = lookup_hash(nd, &nd->last, &path);
+ if (err)
+ return ERR_PTR(err);

- if (dentry->d_inode)
+ if (path.dentry->d_inode)
goto eexist;
/*
* Special case - lookup gave negative, but... we had foo/bar/
@@ -1988,15 +1994,14 @@ struct dentry *lookup_create(struct nameidata *nd, int is_dir)
* been asking for (non-existent) directory. -ENOENT for you.
*/
if (unlikely(!is_dir && nd->last.name[nd->last.len])) {
- dput(dentry);
- dentry = ERR_PTR(-ENOENT);
+ dput(path.dentry);
+ return ERR_PTR(-ENOENT);
}
- return dentry;
+
+ return path.dentry;
eexist:
- dput(dentry);
- dentry = ERR_PTR(-EEXIST);
-fail:
- return dentry;
+ path_put_conditional(&path, nd);
+ return ERR_PTR(-EEXIST);
}
EXPORT_SYMBOL_GPL(lookup_create);

@@ -2231,7 +2236,7 @@ static long do_rmdir(int dfd, const char __user *pathname)
{
int error = 0;
char * name;
- struct dentry *dentry;
+ struct path path;
struct nameidata nd;

error = user_path_parent(dfd, pathname, &nd, &name);
@@ -2253,21 +2258,20 @@ static long do_rmdir(int dfd, const char __user *pathname)
nd.flags &= ~LOOKUP_PARENT;

mutex_lock_nested(&nd.path.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
- dentry = lookup_hash(&nd);
- error = PTR_ERR(dentry);
- if (IS_ERR(dentry))
+ error = lookup_hash(&nd, &nd.last, &path);
+ if (error)
goto exit2;
error = mnt_want_write(nd.path.mnt);
if (error)
goto exit3;
- error = security_path_rmdir(&nd.path, dentry);
+ error = security_path_rmdir(&nd.path, path.dentry);
if (error)
goto exit4;
- error = vfs_rmdir(nd.path.dentry->d_inode, dentry);
+ error = vfs_rmdir(nd.path.dentry->d_inode, path.dentry);
exit4:
mnt_drop_write(nd.path.mnt);
exit3:
- dput(dentry);
+ path_put_conditional(&path, &nd);
exit2:
mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
exit1:
@@ -2323,7 +2327,7 @@ static long do_unlinkat(int dfd, const char __user *pathname)
{
int error;
char *name;
- struct dentry *dentry;
+ struct path path;
struct nameidata nd;
struct inode *inode = NULL;

@@ -2338,26 +2342,25 @@ static long do_unlinkat(int dfd, const char __user *pathname)
nd.flags &= ~LOOKUP_PARENT;

mutex_lock_nested(&nd.path.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
- dentry = lookup_hash(&nd);
- error = PTR_ERR(dentry);
- if (!IS_ERR(dentry)) {
+ error = lookup_hash(&nd, &nd.last, &path);
+ if (!error) {
/* Why not before? Because we want correct error value */
if (nd.last.name[nd.last.len])
goto slashes;
- inode = dentry->d_inode;
+ inode = path.dentry->d_inode;
if (inode)
atomic_inc(&inode->i_count);
error = mnt_want_write(nd.path.mnt);
if (error)
goto exit2;
- error = security_path_unlink(&nd.path, dentry);
+ error = security_path_unlink(&nd.path, path.dentry);
if (error)
goto exit3;
- error = vfs_unlink(nd.path.dentry->d_inode, dentry);
+ error = vfs_unlink(nd.path.dentry->d_inode, path.dentry);
exit3:
mnt_drop_write(nd.path.mnt);
exit2:
- dput(dentry);
+ path_put_conditional(&path, &nd);
}
mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
if (inode)
@@ -2368,8 +2371,8 @@ exit1:
return error;

slashes:
- error = !dentry->d_inode ? -ENOENT :
- S_ISDIR(dentry->d_inode->i_mode) ? -EISDIR : -ENOTDIR;
+ error = !path.dentry->d_inode ? -ENOENT :
+ S_ISDIR(path.dentry->d_inode->i_mode) ? -EISDIR : -ENOTDIR;
goto exit2;
}

@@ -2707,7 +2710,7 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
int, newdfd, const char __user *, newname)
{
struct dentry *old_dir, *new_dir;
- struct dentry *old_dentry, *new_dentry;
+ struct path old, new;
struct dentry *trap;
struct nameidata oldnd, newnd;
char *from;
@@ -2741,16 +2744,15 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,

trap = lock_rename(new_dir, old_dir);

- old_dentry = lookup_hash(&oldnd);
- error = PTR_ERR(old_dentry);
- if (IS_ERR(old_dentry))
+ error = lookup_hash(&oldnd, &oldnd.last, &old);
+ if (error)
goto exit3;
/* source must exist */
error = -ENOENT;
- if (!old_dentry->d_inode)
+ if (!old.dentry->d_inode)
goto exit4;
/* unless the source is a directory trailing slashes give -ENOTDIR */
- if (!S_ISDIR(old_dentry->d_inode->i_mode)) {
+ if (!S_ISDIR(old.dentry->d_inode->i_mode)) {
error = -ENOTDIR;
if (oldnd.last.name[oldnd.last.len])
goto exit4;
@@ -2759,32 +2761,31 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
}
/* source should not be ancestor of target */
error = -EINVAL;
- if (old_dentry == trap)
+ if (old.dentry == trap)
goto exit4;
- new_dentry = lookup_hash(&newnd);
- error = PTR_ERR(new_dentry);
- if (IS_ERR(new_dentry))
+ error = lookup_hash(&newnd, &newnd.last, &new);
+ if (error)
goto exit4;
/* target should not be an ancestor of source */
error = -ENOTEMPTY;
- if (new_dentry == trap)
+ if (new.dentry == trap)
goto exit5;

error = mnt_want_write(oldnd.path.mnt);
if (error)
goto exit5;
- error = security_path_rename(&oldnd.path, old_dentry,
- &newnd.path, new_dentry);
+ error = security_path_rename(&oldnd.path, old.dentry,
+ &newnd.path, new.dentry);
if (error)
goto exit6;
- error = vfs_rename(old_dir->d_inode, old_dentry,
- new_dir->d_inode, new_dentry);
+ error = vfs_rename(old_dir->d_inode, old.dentry,
+ new_dir->d_inode, new.dentry);
exit6:
mnt_drop_write(oldnd.path.mnt);
exit5:
- dput(new_dentry);
+ path_put_conditional(&new, &newnd);
exit4:
- dput(old_dentry);
+ path_put_conditional(&old, &oldnd);
exit3:
unlock_rename(new_dir, old_dir);
exit2:
--
1.6.3.3

2010-08-08 16:03:28

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 09/39] whiteout: tmpfs whiteout support

Add support for whiteout dentries to tmpfs. This includes adding
support for whiteouts to d_genocide(), which is called to tear down
pinned tmpfs dentries. Whiteouts have to be persistent, so they have
a pinning extra ref count that needs to be dropped by d_genocide().

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: [email protected]
---
fs/dcache.c | 13 +++++-
mm/shmem.c | 145 +++++++++++++++++++++++++++++++++++++++++++++++++++++------
2 files changed, 143 insertions(+), 15 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 80f059b..79b9f6a 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2229,7 +2229,18 @@ resume:
struct list_head *tmp = next;
struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
next = tmp->next;
- if (d_unhashed(dentry)||!dentry->d_inode)
+ /*
+ * Skip unhashed and negative dentries, but process
+ * positive dentries and whiteouts. A whiteout looks
+ * kind of like a negative dentry for purposes of
+ * lookup, but it has an extra pinning ref count
+ * because it can't be evicted like a negative dentry
+ * can. What we care about here is ref counts - and
+ * we need to drop the ref count on a whiteout before
+ * we can evict it.
+ */
+ if (d_unhashed(dentry)||(!dentry->d_inode &&
+ !d_is_whiteout(dentry)))
continue;
if (!list_empty(&dentry->d_subdirs)) {
this_parent = dentry;
diff --git a/mm/shmem.c b/mm/shmem.c
index f65f840..a0a4fa5 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1805,6 +1805,76 @@ static int shmem_statfs(struct dentry *dentry, struct kstatfs *buf)
return 0;
}

+static int shmem_rmdir(struct inode *dir, struct dentry *dentry);
+static int shmem_unlink(struct inode *dir, struct dentry *dentry);
+
+/*
+ * This is the whiteout support for tmpfs. It uses one singleton whiteout
+ * inode per superblock thus it is very similar to shmem_link().
+ */
+static int shmem_whiteout(struct inode *dir, struct dentry *old_dentry,
+ struct dentry *new_dentry)
+{
+ struct shmem_sb_info *sbinfo = SHMEM_SB(dir->i_sb);
+ struct dentry *dentry;
+
+ if (!(dir->i_sb->s_flags & MS_WHITEOUT))
+ return -EPERM;
+
+ /* This gives us a proper initialized negative dentry */
+ dentry = simple_lookup(dir, new_dentry, NULL);
+ if (dentry && IS_ERR(dentry))
+ return PTR_ERR(dentry);
+
+ /*
+ * No ordinary (disk based) filesystem counts whiteouts as inodes;
+ * but each new link needs a new dentry, pinning lowmem, and
+ * tmpfs dentries cannot be pruned until they are unlinked.
+ */
+ if (sbinfo->max_inodes) {
+ spin_lock(&sbinfo->stat_lock);
+ if (!sbinfo->free_inodes) {
+ spin_unlock(&sbinfo->stat_lock);
+ return -ENOSPC;
+ }
+ sbinfo->free_inodes--;
+ spin_unlock(&sbinfo->stat_lock);
+ }
+
+ if (old_dentry->d_inode) {
+ if (S_ISDIR(old_dentry->d_inode->i_mode))
+ shmem_rmdir(dir, old_dentry);
+ else
+ shmem_unlink(dir, old_dentry);
+ }
+
+ dir->i_size += BOGO_DIRENT_SIZE;
+ dir->i_ctime = dir->i_mtime = CURRENT_TIME;
+ /* Extra pinning count for the created dentry */
+ dget(new_dentry);
+ spin_lock(&new_dentry->d_lock);
+ new_dentry->d_flags |= DCACHE_WHITEOUT;
+ spin_unlock(&new_dentry->d_lock);
+ return 0;
+}
+
+static void shmem_d_instantiate(struct inode *dir, struct dentry *dentry,
+ struct inode *inode)
+{
+ if (d_is_whiteout(dentry)) {
+ /* Re-using an existing whiteout */
+ shmem_free_inode(dir->i_sb);
+ if (S_ISDIR(inode->i_mode))
+ inode->i_mode |= S_OPAQUE;
+ } else {
+ /* New dentry */
+ dir->i_size += BOGO_DIRENT_SIZE;
+ dget(dentry); /* Extra count - pin the dentry in core */
+ }
+ /* Will clear DCACHE_WHITEOUT flag */
+ d_instantiate(dentry, inode);
+
+}
/*
* File creation. Allocate an inode, and we're done..
*/
@@ -1833,10 +1903,8 @@ shmem_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev)
#else
error = 0;
#endif
- dir->i_size += BOGO_DIRENT_SIZE;
+ shmem_d_instantiate(dir, dentry, inode);
dir->i_ctime = dir->i_mtime = CURRENT_TIME;
- d_instantiate(dentry, inode);
- dget(dentry); /* Extra count - pin the dentry in core */
}
return error;
}
@@ -1874,12 +1942,11 @@ static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentr
if (ret)
goto out;

- dir->i_size += BOGO_DIRENT_SIZE;
+ shmem_d_instantiate(dir, dentry, inode);
+
inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
inc_nlink(inode);
atomic_inc(&inode->i_count); /* New dentry reference */
- dget(dentry); /* Extra pinning count for the created dentry */
- d_instantiate(dentry, inode);
out:
return ret;
}
@@ -1888,21 +1955,61 @@ static int shmem_unlink(struct inode *dir, struct dentry *dentry)
{
struct inode *inode = dentry->d_inode;

- if (inode->i_nlink > 1 && !S_ISDIR(inode->i_mode))
- shmem_free_inode(inode->i_sb);
+ if (d_is_whiteout(dentry) || (inode->i_nlink > 1 && !S_ISDIR(inode->i_mode)))
+ shmem_free_inode(dir->i_sb);

+ if (inode) {
+ inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
+ drop_nlink(inode);
+ }
dir->i_size -= BOGO_DIRENT_SIZE;
- inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
- drop_nlink(inode);
dput(dentry); /* Undo the count from "create" - this does all the work */
return 0;
}

+static void shmem_dir_unlink_whiteouts(struct inode *dir, struct dentry *dentry)
+{
+ if (!dentry->d_inode)
+ return;
+
+ /* Remove whiteouts from logical empty directory */
+ if (S_ISDIR(dentry->d_inode->i_mode) &&
+ dentry->d_inode->i_sb->s_flags & MS_WHITEOUT) {
+ struct dentry *child, *next;
+ LIST_HEAD(list);
+
+ spin_lock(&dcache_lock);
+ list_for_each_entry(child, &dentry->d_subdirs, d_u.d_child) {
+ spin_lock(&child->d_lock);
+ if (d_is_whiteout(child)) {
+ __d_drop(child);
+ if (!list_empty(&child->d_lru)) {
+ list_del(&child->d_lru);
+ dentry_stat.nr_unused--;
+ }
+ list_add(&child->d_lru, &list);
+ }
+ spin_unlock(&child->d_lock);
+ }
+ spin_unlock(&dcache_lock);
+
+ list_for_each_entry_safe(child, next, &list, d_lru) {
+ spin_lock(&child->d_lock);
+ list_del_init(&child->d_lru);
+ spin_unlock(&child->d_lock);
+
+ shmem_unlink(dentry->d_inode, child);
+ }
+ }
+}
+
static int shmem_rmdir(struct inode *dir, struct dentry *dentry)
{
if (!simple_empty(dentry))
return -ENOTEMPTY;

+ /* Remove whiteouts from logical empty directory */
+ shmem_dir_unlink_whiteouts(dir, dentry);
drop_nlink(dentry->d_inode);
drop_nlink(dir);
return shmem_unlink(dir, dentry);
@@ -1911,7 +2018,7 @@ static int shmem_rmdir(struct inode *dir, struct dentry *dentry)
/*
* The VFS layer already does all the dentry stuff for rename,
* we just have to decrement the usage count for the target if
- * it exists so that the VFS layer correctly free's it when it
+ * it exists so that the VFS layer correctly frees it when it
* gets overwritten.
*/
static int shmem_rename(struct inode *old_dir, struct dentry *old_dentry, struct inode *new_dir, struct dentry *new_dentry)
@@ -1922,7 +2029,12 @@ static int shmem_rename(struct inode *old_dir, struct dentry *old_dentry, struct
if (!simple_empty(new_dentry))
return -ENOTEMPTY;

+ if (d_is_whiteout(new_dentry))
+ shmem_unlink(new_dir, new_dentry);
+
if (new_dentry->d_inode) {
+ /* Remove whiteouts from logical empty directory */
+ shmem_dir_unlink_whiteouts(new_dir, new_dentry);
(void) shmem_unlink(new_dir, new_dentry);
if (they_are_dirs)
drop_nlink(old_dir);
@@ -1987,10 +2099,8 @@ static int shmem_symlink(struct inode *dir, struct dentry *dentry, const char *s
unlock_page(page);
page_cache_release(page);
}
- dir->i_size += BOGO_DIRENT_SIZE;
+ shmem_d_instantiate(dir, dentry, inode);
dir->i_ctime = dir->i_mtime = CURRENT_TIME;
- d_instantiate(dentry, inode);
- dget(dentry);
return 0;
}

@@ -2367,6 +2477,12 @@ int shmem_fill_super(struct super_block *sb, void *data, int silent)
if (!root)
goto failed_iput;
sb->s_root = root;
+
+#ifdef CONFIG_TMPFS
+ if (!(sb->s_flags & MS_NOUSER))
+ sb->s_flags |= MS_WHITEOUT;
+#endif
+
return 0;

failed_iput:
@@ -2466,6 +2582,7 @@ static const struct inode_operations shmem_dir_inode_operations = {
.rmdir = shmem_rmdir,
.mknod = shmem_mknod,
.rename = shmem_rename,
+ .whiteout = shmem_whiteout,
#endif
#ifdef CONFIG_TMPFS_POSIX_ACL
.setattr = shmem_notify_change,
--
1.6.3.3

2010-08-08 16:03:45

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 11/39] whiteout: ext2 whiteout support

From: Jan Blunck <[email protected]>

This patch adds whiteout support to EXT2. A whiteout is an empty directory
entry (inode == 0) with the file type set to EXT2_FT_WHT. Therefore it
allocates space in directories. Due to being implemented as a filetype it is
necessary to have the EXT2_FEATURE_INCOMPAT_FILETYPE flag set.

XXX - Needs serious review. Al wonders: What happens with a delete at
the beginning of a block? Will we find the matching dentry or the
first empty space?

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
Cc: Theodore Tso <[email protected]>
Cc: [email protected]
---
fs/ext2/dir.c | 96 +++++++++++++++++++++++++++++++++++++++++++++--
fs/ext2/ext2.h | 3 +
fs/ext2/inode.c | 11 ++++-
fs/ext2/namei.c | 63 +++++++++++++++++++++++++++++-
fs/ext2/super.c | 5 ++
include/linux/ext2_fs.h | 4 ++
6 files changed, 172 insertions(+), 10 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 57207a9..030bd46 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -219,7 +219,7 @@ static inline int ext2_match (int len, const char * const name,
{
if (len != de->name_len)
return 0;
- if (!de->inode)
+ if (!de->inode && (de->file_type != EXT2_FT_WHT))
return 0;
return !memcmp(name, de->name, len);
}
@@ -255,6 +255,7 @@ static unsigned char ext2_filetype_table[EXT2_FT_MAX] = {
[EXT2_FT_FIFO] = DT_FIFO,
[EXT2_FT_SOCK] = DT_SOCK,
[EXT2_FT_SYMLINK] = DT_LNK,
+ [EXT2_FT_WHT] = DT_WHT,
};

#define S_SHIFT 12
@@ -448,6 +449,26 @@ ino_t ext2_inode_by_name(struct inode *dir, struct qstr *child)
return res;
}

+/* Special version for filetype based whiteout support */
+ino_t ext2_inode_by_dentry(struct inode *dir, struct dentry *dentry)
+{
+ ino_t res = 0;
+ struct ext2_dir_entry_2 *de;
+ struct page *page;
+
+ de = ext2_find_entry (dir, &dentry->d_name, &page);
+ if (de) {
+ res = le32_to_cpu(de->inode);
+ if (!res && de->file_type == EXT2_FT_WHT) {
+ spin_lock(&dentry->d_lock);
+ dentry->d_flags |= DCACHE_WHITEOUT;
+ spin_unlock(&dentry->d_lock);
+ }
+ ext2_put_page(page);
+ }
+ return res;
+}
+
/* Releases the page */
void ext2_set_link(struct inode *dir, struct ext2_dir_entry_2 *de,
struct page *page, struct inode *inode, int update_times)
@@ -523,7 +544,8 @@ static ext2_dirent * ext2_append_entry(struct dentry * dentry,
goto got_it;
name_len = EXT2_DIR_REC_LEN(de->name_len);
rec_len = ext2_rec_len_from_disk(de->rec_len);
- if (!de->inode && rec_len >= reclen)
+ if (!de->inode && (de->file_type != EXT2_FT_WHT) &&
+ (rec_len >= reclen))
goto got_it;
if (rec_len >= name_len + reclen)
goto got_it;
@@ -564,8 +586,11 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode)
return PTR_ERR(de);

err = -EEXIST;
- if (ext2_match (namelen, name, de))
+ if (ext2_match (namelen, name, de)) {
+ if (de->file_type == EXT2_FT_WHT)
+ goto got_it;
goto out_unlock;
+ }

got_it:
name_len = EXT2_DIR_REC_LEN(de->name_len);
@@ -577,7 +602,8 @@ got_it:
&page, NULL);
if (err)
goto out_unlock;
- if (de->inode) {
+ if (de->inode || ((de->file_type == EXT2_FT_WHT) &&
+ !ext2_match (namelen, name, de))) {
ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len);
de->rec_len = ext2_rec_len_to_disk(name_len);
@@ -646,6 +672,68 @@ out:
return err;
}

+int ext2_whiteout_entry (struct inode * dir, struct dentry * dentry,
+ struct ext2_dir_entry_2 * de, struct page * page)
+{
+ const char *name = dentry->d_name.name;
+ int namelen = dentry->d_name.len;
+ unsigned short rec_len, name_len;
+ loff_t pos;
+ int err;
+
+ if (!de) {
+ de = ext2_append_entry(dentry, &page);
+ BUG_ON(!de);
+ }
+
+ err = -EEXIST;
+ if (ext2_match (namelen, name, de) &&
+ (de->file_type == EXT2_FT_WHT)) {
+ ext2_error(dir->i_sb, __func__,
+ "entry is already a whiteout in directory #%lu",
+ dir->i_ino);
+ goto out_unlock;
+ }
+
+ name_len = EXT2_DIR_REC_LEN(de->name_len);
+ rec_len = ext2_rec_len_from_disk(de->rec_len);
+
+ pos = page_offset(page) +
+ (char*)de - (char*)page_address(page);
+ err = __ext2_write_begin(NULL, page->mapping, pos, rec_len, 0,
+ &page, NULL);
+ if (err)
+ goto out_unlock;
+ /*
+ * We whiteout an existing entry. Do what ext2_delete_entry() would do,
+ * except that we don't need to merge with the previous entry since
+ * we are going to reuse it.
+ */
+ if (ext2_match (namelen, name, de))
+ de->inode = 0;
+ if (de->inode || (de->file_type == EXT2_FT_WHT)) {
+ ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
+ de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len);
+ de->rec_len = ext2_rec_len_to_disk(name_len);
+ de = de1;
+ }
+ de->name_len = namelen;
+ memcpy(de->name, name, namelen);
+ de->inode = 0;
+ de->file_type = EXT2_FT_WHT;
+ err = ext2_commit_chunk(page, pos, rec_len);
+ dir->i_mtime = dir->i_ctime = CURRENT_TIME_SEC;
+ EXT2_I(dir)->i_flags &= ~EXT2_BTREE_FL;
+ mark_inode_dirty(dir);
+ /* OFFSET_CACHE */
+out_put:
+ ext2_put_page(page);
+ return err;
+out_unlock:
+ unlock_page(page);
+ goto out_put;
+}
+
/*
* Set the first fragment of directory.
*/
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 52b34f1..89ab2f7 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -102,9 +102,12 @@ extern void ext2_rsv_window_add(struct super_block *sb, struct ext2_reserve_wind
/* dir.c */
extern int ext2_add_link (struct dentry *, struct inode *);
extern ino_t ext2_inode_by_name(struct inode *, struct qstr *);
+extern ino_t ext2_inode_by_dentry(struct inode *, struct dentry *);
extern int ext2_make_empty(struct inode *, struct inode *);
extern struct ext2_dir_entry_2 * ext2_find_entry (struct inode *,struct qstr *, struct page **);
extern int ext2_delete_entry (struct ext2_dir_entry_2 *, struct page *);
+extern int ext2_whiteout_entry (struct inode *, struct dentry *,
+ struct ext2_dir_entry_2 *, struct page *);
extern int ext2_empty_dir (struct inode *);
extern struct ext2_dir_entry_2 * ext2_dotdot (struct inode *, struct page **);
extern void ext2_set_link(struct inode *, struct ext2_dir_entry_2 *, struct page *, struct inode *, int);
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 3675088..f31b872 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1261,7 +1261,8 @@ void ext2_set_inode_flags(struct inode *inode)
{
unsigned int flags = EXT2_I(inode)->i_flags;

- inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+ inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|
+ S_OPAQUE);
if (flags & EXT2_SYNC_FL)
inode->i_flags |= S_SYNC;
if (flags & EXT2_APPEND_FL)
@@ -1272,6 +1273,8 @@ void ext2_set_inode_flags(struct inode *inode)
inode->i_flags |= S_NOATIME;
if (flags & EXT2_DIRSYNC_FL)
inode->i_flags |= S_DIRSYNC;
+ if (flags & EXT2_OPAQUE_FL)
+ inode->i_flags |= S_OPAQUE;
}

/* Propagate flags from i_flags to EXT2_I(inode)->i_flags */
@@ -1279,8 +1282,8 @@ void ext2_get_inode_flags(struct ext2_inode_info *ei)
{
unsigned int flags = ei->vfs_inode.i_flags;

- ei->i_flags &= ~(EXT2_SYNC_FL|EXT2_APPEND_FL|
- EXT2_IMMUTABLE_FL|EXT2_NOATIME_FL|EXT2_DIRSYNC_FL);
+ ei->i_flags &= ~(EXT2_SYNC_FL|EXT2_APPEND_FL|EXT2_IMMUTABLE_FL|
+ EXT2_NOATIME_FL|EXT2_DIRSYNC_FL|EXT2_OPAQUE_FL);
if (flags & S_SYNC)
ei->i_flags |= EXT2_SYNC_FL;
if (flags & S_APPEND)
@@ -1291,6 +1294,8 @@ void ext2_get_inode_flags(struct ext2_inode_info *ei)
ei->i_flags |= EXT2_NOATIME_FL;
if (flags & S_DIRSYNC)
ei->i_flags |= EXT2_DIRSYNC_FL;
+ if (flags & S_OPAQUE)
+ ei->i_flags |= EXT2_OPAQUE_FL;
}

struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 71efb0e..8f92dd0 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -55,15 +55,16 @@ static inline int ext2_add_nondir(struct dentry *dentry, struct inode *inode)
* Methods themselves.
*/

-static struct dentry *ext2_lookup(struct inode * dir, struct dentry *dentry, struct nameidata *nd)
+static struct dentry *ext2_lookup(struct inode * dir, struct dentry *dentry,
+ struct nameidata *nd)
{
struct inode * inode;
ino_t ino;
-
+
if (dentry->d_name.len > EXT2_NAME_LEN)
return ERR_PTR(-ENAMETOOLONG);

- ino = ext2_inode_by_name(dir, &dentry->d_name);
+ ino = ext2_inode_by_dentry(dir, dentry);
inode = NULL;
if (ino) {
inode = ext2_iget(dir->i_sb, ino);
@@ -307,6 +308,61 @@ static int ext2_rmdir (struct inode * dir, struct dentry *dentry)
return err;
}

+/*
+ * Create a whiteout for the dentry
+ */
+static int ext2_whiteout(struct inode *dir, struct dentry *dentry,
+ struct dentry *new_dentry)
+{
+ struct inode * inode = dentry->d_inode;
+ struct ext2_dir_entry_2 * de = NULL;
+ struct page * page;
+ int err = -ENOTEMPTY;
+
+ if (!EXT2_HAS_INCOMPAT_FEATURE(dir->i_sb,
+ EXT2_FEATURE_INCOMPAT_FILETYPE)) {
+ ext2_error (dir->i_sb, "ext2_whiteout",
+ "can't set whiteout filetype");
+ err = -EPERM;
+ goto out;
+ }
+
+ dquot_initialize(dir);
+
+ if (inode) {
+ if (S_ISDIR(inode->i_mode) && !ext2_empty_dir(inode))
+ goto out;
+
+ err = -ENOENT;
+ de = ext2_find_entry (dir, &dentry->d_name, &page);
+ if (!de)
+ goto out;
+ lock_page(page);
+ }
+
+ err = ext2_whiteout_entry (dir, dentry, de, page);
+ if (err)
+ goto out;
+
+ spin_lock(&new_dentry->d_lock);
+ new_dentry->d_flags |= DCACHE_WHITEOUT;
+ spin_unlock(&new_dentry->d_lock);
+ d_add(new_dentry, NULL);
+
+ if (inode) {
+ inode->i_ctime = dir->i_ctime;
+ inode_dec_link_count(inode);
+ if (S_ISDIR(inode->i_mode)) {
+ inode->i_size = 0;
+ inode_dec_link_count(inode);
+ inode_dec_link_count(dir);
+ }
+ }
+ err = 0;
+out:
+ return err;
+}
+
static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
struct inode * new_dir, struct dentry * new_dentry )
{
@@ -409,6 +465,7 @@ const struct inode_operations ext2_dir_inode_operations = {
.mkdir = ext2_mkdir,
.rmdir = ext2_rmdir,
.mknod = ext2_mknod,
+ .whiteout = ext2_whiteout,
.rename = ext2_rename,
#ifdef CONFIG_EXT2_FS_XATTR
.setxattr = generic_setxattr,
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 7ff43f4..704521b 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -1092,9 +1092,14 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
if (EXT2_HAS_COMPAT_FEATURE(sb, EXT3_FEATURE_COMPAT_HAS_JOURNAL))
ext2_msg(sb, KERN_WARNING,
"warning: mounting ext3 filesystem as ext2");
+
+ if (EXT2_HAS_INCOMPAT_FEATURE(sb, EXT2_FEATURE_INCOMPAT_WHITEOUT))
+ sb->s_flags |= MS_WHITEOUT;
+
if (ext2_setup_super (sb, es, sb->s_flags & MS_RDONLY))
sb->s_flags |= MS_RDONLY;
ext2_write_super(sb);
+
return 0;

cantfind_ext2:
diff --git a/include/linux/ext2_fs.h b/include/linux/ext2_fs.h
index 2dfa707..b0fb356 100644
--- a/include/linux/ext2_fs.h
+++ b/include/linux/ext2_fs.h
@@ -189,6 +189,7 @@ struct ext2_group_desc
#define EXT2_NOTAIL_FL FS_NOTAIL_FL /* file tail should not be merged */
#define EXT2_DIRSYNC_FL FS_DIRSYNC_FL /* dirsync behaviour (directories only) */
#define EXT2_TOPDIR_FL FS_TOPDIR_FL /* Top of directory hierarchies*/
+#define EXT2_OPAQUE_FL FS_OPAQUE_FL /* Dir is opaque */
#define EXT2_RESERVED_FL FS_RESERVED_FL /* reserved for ext2 lib */

#define EXT2_FL_USER_VISIBLE FS_FL_USER_VISIBLE /* User visible flags */
@@ -503,10 +504,12 @@ struct ext2_super_block {
#define EXT3_FEATURE_INCOMPAT_RECOVER 0x0004
#define EXT3_FEATURE_INCOMPAT_JOURNAL_DEV 0x0008
#define EXT2_FEATURE_INCOMPAT_META_BG 0x0010
+#define EXT2_FEATURE_INCOMPAT_WHITEOUT 0x0020
#define EXT2_FEATURE_INCOMPAT_ANY 0xffffffff

#define EXT2_FEATURE_COMPAT_SUPP EXT2_FEATURE_COMPAT_EXT_ATTR
#define EXT2_FEATURE_INCOMPAT_SUPP (EXT2_FEATURE_INCOMPAT_FILETYPE| \
+ EXT2_FEATURE_INCOMPAT_WHITEOUT| \
EXT2_FEATURE_INCOMPAT_META_BG)
#define EXT2_FEATURE_RO_COMPAT_SUPP (EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER| \
EXT2_FEATURE_RO_COMPAT_LARGE_FILE| \
@@ -573,6 +576,7 @@ enum {
EXT2_FT_FIFO = 5,
EXT2_FT_SOCK = 6,
EXT2_FT_SYMLINK = 7,
+ EXT2_FT_WHT = 8,
EXT2_FT_MAX
};

--
1.6.3.3

2010-08-08 16:03:59

by Valerie Aurora

[permalink] [raw]
Subject: [PATCH 10/39] whiteout: Split of ext2_append_link() from ext2_add_link()

From: Jan Blunck <[email protected]>

The ext2_append_link() is later used to find or append a directory
entry to whiteout.

Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
Cc: Theodore Tso <[email protected]>
Cc: [email protected]
---
fs/ext2/dir.c | 70 ++++++++++++++++++++++++++++++++++++++++----------------
1 files changed, 50 insertions(+), 20 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 7516957..57207a9 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -472,9 +472,10 @@ void ext2_set_link(struct inode *dir, struct ext2_dir_entry_2 *de,
}

/*
- * Parent is locked.
+ * Find or append a given dentry to the parent directory
*/
-int ext2_add_link (struct dentry *dentry, struct inode *inode)
+static ext2_dirent * ext2_append_entry(struct dentry * dentry,
+ struct page ** page)
{
struct inode *dir = dentry->d_parent->d_inode;
const char *name = dentry->d_name.name;
@@ -482,13 +483,10 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode)
unsigned chunk_size = ext2_chunk_size(dir);
unsigned reclen = EXT2_DIR_REC_LEN(namelen);
unsigned short rec_len, name_len;
- struct page *page = NULL;
- ext2_dirent * de;
+ ext2_dirent * de = NULL;
unsigned long npages = dir_pages(dir);
unsigned long n;
char *kaddr;
- loff_t pos;
- int err;

/*
* We take care of directory expansion in the same loop.
@@ -498,20 +496,19 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode)
for (n = 0; n <= npages; n++) {
char *dir_end;

- page = ext2_get_page(dir, n, 0);
- err = PTR_ERR(page);
- if (IS_ERR(page))
+ *page = ext2_get_page(dir, n, 0);
+ de = ERR_PTR(PTR_ERR(*page));
+ if (IS_ERR(*page))
goto out;
- lock_page(page);
- kaddr = page_address(page);
+ lock_page(*page);
+ kaddr = page_address(*page);
dir_end = kaddr + ext2_last_byte(dir, n);
de = (ext2_dirent *)kaddr;
kaddr += PAGE_CACHE_SIZE - reclen;
while ((char *)de <= kaddr) {
if ((char *)de == dir_end) {
/* We hit i_size */
- name_len = 0;
- rec_len = chunk_size;
+ de->name_len = 0;
de->rec_len = ext2_rec_len_to_disk(chunk_size);
de->inode = 0;
goto got_it;
@@ -519,12 +516,11 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode)
if (de->rec_len == 0) {
ext2_error(dir->i_sb, __func__,
"zero-length directory entry");
- err = -EIO;
+ de = ERR_PTR(-EIO);
goto out_unlock;
}
- err = -EEXIST;
if (ext2_match (namelen, name, de))
- goto out_unlock;
+ goto got_it;
name_len = EXT2_DIR_REC_LEN(de->name_len);
rec_len = ext2_rec_len_from_disk(de->rec_len);
if (!de->inode && rec_len >= reclen)
@@ -533,13 +529,48 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode)
goto got_it;
de = (ext2_dirent *) ((char *) de + rec_len);
}
- unlock_page(page);
- ext2_put_page(page);
+ unlock_page(*page);
+ ext2_put_page(*page);
}
+
BUG();
- return -EINVAL;

got_it:
+ return de;
+ /* OFFSET_CACHE */
+out_unlock:
+ unlock_page(*page);
+ ext2_put_page(*page);
+out:
+ return de;
+}
+
+/*
+ * Parent is locked.
+ */
+int ext2_add_link (struct dentry *dentry, struct inode *inode)
+{
+ struct inode *dir = dentry->d_parent->d_inode;
+ const char *name = dentry->d_name.name;
+ int namelen = dentry->d_name.len;
+ unsigned short rec_len, name_len;
+ ext2_dirent * de;
+ struct page *page;
+ loff_t pos;
+ int err;
+
+ de = ext2_append_entry(dentry, &page);
+ if (IS_ERR(de))
+ return PTR_ERR(de);
+
+ err = -EEXIST;
+ if (ext2_match (namelen, name, de))
+ goto out_unlock;
+
+got_it:
+ name_len = EXT2_DIR_REC_LEN(de->name_len);
+ rec_len = ext2_rec_len_from_disk(de->rec_len);
+
pos = page_offset(page) +
(char*)de - (char*)page_address(page);
err = __ext2_write_begin(NULL, page->mapping, pos, rec_len, 0,
@@ -563,7 +594,6 @@ got_it:
/* OFFSET_CACHE */
out_put:
ext2_put_page(page);
-out:
return err;
out_unlock:
unlock_page(page);
--
1.6.3.3

2010-08-09 22:56:55

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 14/39] union-mount: Union mounts documentation

On Sun, 8 Aug 2010 11:52:31 -0400
Valerie Aurora <[email protected]> wrote:


> +A union mount layers one read-write file system over one or more
> +read-only file systems, with all writes going to the writable file
> +system. The namespace of both file systems appears as a combined
> +whole to userland, with files and directories on the writable file
> +system covering up any files or directories with matching pathnames on
> +the read-only file system. The read-write file system is the
> +"topmost" or "upper" file system and the read-only file systems are
> +the "lower" file systems. A few use cases:
> +
> +- Root file system on CD with writes saved to hard drive (LiveCD)
> +- Multiple virtual machines with the same starting root file system
> +- Cluster with NFS mounted root on clients
> +
> +Most if not all of these problems could be solved with a COW block
> +device or a clustered file system (include NFS mounts). However, for
> +some use cases, sharing is more efficient and better performing if
> +done at the file system namespace level. COW block devices only
> +increase their divergence as time goes on, and a fully coherent
> +writable file system is unnecessary synchronization overhead if no
> +other client needs to see the writes.

Thanks for including lots of documentation!
Given how intrusive this patch set is, I would really like the see the
justification above fleshed out a bit more.

What would be particularly valuable would be real-life use cases where
someone has put this to work and found that it genuinely meets a need.
I realise there can be a bit of a chicken/egg issue there, but if you do have
anything it would be good to include it.
A particular need for this is that fact that a number of standard features
are not going to be supported and it would be good to be sure that there are
real cases that don't need those.

...

> +Non-features
> +------------
> +
> +Features we do not currently plan to support in union mounts:
> +
> +Online upgrade: E.g., installing software on a file system NFS
> +exported to clients while the clients are still up and running.
> +Allowing the read-only bottom layer of a union mount to change
> +invalidates our locking strategy.

I wonder if the restriction is not more serious than this.
Given the prevalence of "copy-up", particularly of directories, I would think
that even off-line upgrade would not be supported.
If the upgrade adds a file in a directory that has already been read (and
hence copied-up), or changes a file that has been chmodded, then the upgrade
will not be completely visible, which sounds dangerous.

Don't you have to require (or strongly recommend) that the underlying
filesystem remain unchanged while the on-top filesystem exists, not just
while it is mounted ??



As a counter-position for you or others to write cogent arguments against,
and to then include those arguments in the justification section, I would
like to present my preferred approach, which is essentially that the problem
is better solved at the block layer or the distro layer.

A distro-layer solution would be appropriate when you want a common root
filesystem with per-host configuration, whether in an NFS cluster of a
virtual-machine cluster.
This involved every file that might need configuration being made a symlink
to e.g. /local, and every instance mounts some local directory on /local.
e.g. mount --bind /local-`hostname` /local

This is obviously less transparent, but it is also more predictable (you
know exactly what can and cannot be changed by an upgraded on the shared
filesystem).

A convincing use case that required NFS sharing and required signficantly
more customisation that just some config file would be a good
counter-argument to this.

I see two block-layer solutions. The obvious is a COW block device as you
have mentioned. I am not convinced that it is as bad as you think.
Particularly if the COW device could advertise that it handles small
'discard' requests efficiently, and if filesystems could then send small
discard requests whenever appropriate, the wastage due to divergence need not
be too great.
In any case, some hard numbers like "Performing a kernel compile on a COW
device requires N meg of space while using a union-mounted filesystem it
requires M ( << N) meg of space" would help a lot. (of course that is a silly
test as we would use "make O=/somewhere/else", not COW or Union for that
task).

The second solution would be filesystem specific, and hence a good selling
point of a new up-and-coming filesystem.
If a filesystem was comfortable with data on multiple devices, and was able
to copy-on-write files, then it should be relatively easy to give it a
read-only device and a clean read-write device, and tell it write all changes
only to the second device (and never update even the filesystem metadata on
the first device). The filesystem could then make effective use of any space
available in the second device, without wastage.


> +Thank you for reading!

Thank you for writing!

NeilBrown

2010-08-11 02:03:06

by J. R. Okajima

[permalink] [raw]
Subject: Re: [PATCH 14/39] union-mount: Union mounts documentation


Neil Brown:
> I wonder if the restriction is not more serious than this.
> Given the prevalence of "copy-up", particularly of directories, I would think
> that even off-line upgrade would not be supported.
> If the upgrade adds a file in a directory that has already been read (and
> hence copied-up), or changes a file that has been chmodded, then the upgrade
> will not be completely visible, which sounds dangerous.
:::
> I see two block-layer solutions. The obvious is a COW block device as you
> have mentioned. I am not convinced that it is as bad as you think.
:::

DM snapshot provides the COW block feature and it will match your idea
since the size of COW device is much smaller genearally. But it doesn't
support off-line upgrade either. If you do, it is equivalent to corrupt
filesystem for DM snapshot device.

Here is pros/cons of DM snapshot comparing a union.
- the number of bytes to be copied between devices is much smaller.

- the type of filesystem must be one and only.
- the fs must be writable, no readonly fs, even for the lower original
device. so the compression fs will not be usable. but if we use
loopback mount, we may address this issue.
for instance,
mount /cdrom/squashfs.img /sq
losetup /sq/ext2.img
losetup /somewhere/cow
dmsetup "snapshot /dev/loop0 /dev/loop1 ..."

- it will be difficult (or needs more operations) to extract the
difference between the original device and COW.

- DM snapshot-merge may help a lot when users try merging. in the
fs-layer union, users will use rsync(1).

- in fs-based union, users can add/remove members(layer) dynamicall
without unmounting. of course, all files on the removing layer should
not be busy.


Also here is my concern about UnionMount. All these issues have been
reported before.
- for users, the inode number may change silently. eg. copy-up.
- link(2) may break by copy-up.
- read(2) may get an obsoleted filedata (fstat(2) too).
- fcntl(F_SETLK) may be broken by copy-up.
- unnecessary copy-up may happen, for example mmap(MAP_PRIVATE) after
open(O_RDWR).


J. R. Okajima

2010-08-13 13:49:22

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 19/39] union-mount: Implement union lookup

On Sun, 8 Aug 2010, Valerie Aurora wrote:
> Implement unioned directories, whiteouts, and fallthrus in pathname
> lookup routines. do_lookup() and lookup_hash() call lookup_union()
> after looking up the dentry from the top-level file system.
> lookup_union() is centered around __lookup_hash(), which does cached
> and/or real lookups and revalidates each dentry in the union stack.
>
> XXX - implement negative union cache entries
>
> XXX - handle different permissions on directories

If process doing the lookup doesn't have write permission on the top
level directory then the lookup will fail. This is not intended, is
it?

Thanks,
Miklos

2010-08-13 13:52:13

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 23/39] fallthru: ext2 fallthru support

On Sun, 8 Aug 2010, Valerie Aurora wrote:
> Add support for fallthru directory entries to ext2.

This still doesn't work correctly, only now after unmounting/mounting
the top layer the directory appears empty:

uml:~# mount -oloop -r ext3.img /mnt/img/
uml:~# losetup /dev/loop2 ovl.img
uml:~# /host/store/git/e2fsprogs/misc/mke2fs -O whiteout,fallthru /dev/loop2
mke2fs 1.41.12 (17-May-2010)
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
Stride=0 blocks, Stripe width=0 blocks
1280 inodes, 10240 blocks
512 blocks (5.00%) reserved for the super user
First data block=1
Maximum filesystem blocks=10485760
2 block groups
8192 blocks per group, 8192 fragments per group
640 inodes per group
Superblock backups stored on blocks:
8193

Allocating group tables: done
Writing inode tables: done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 24 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
uml:~# mmount -b 8 -t ext2 /dev/loop2 /mnt/img/
uml:~# "ls" /mnt/img
lost+found union
uml:~# "ls" /mnt/img/union
1 2 3
uml:~# umount /mnt/img
uml:~# mmount -b 8 -t ext2 /dev/loop2 /mnt/img/
uml:~# ls -l /mnt/img/
total 13
drwx------ 2 root root 12288 Aug 13 13:42 lost+found
drwxr-xr-x 2 root root 1024 Aug 13 13:42 union
uml:~# ls -la /mnt/img/union/
total 2
drwxr-xr-x 2 root root 1024 Aug 13 13:42 .
drwxr-xr-x 4 root root 1024 Aug 13 13:42 ..
uml:~# grep /mnt/img /proc/self/mountinfo
21 12 7:0 / /mnt/img ro,relatime - ext3 /dev/loop0 ro,errors=continue,barrier=0,data=writeback
22 21 7:2 / /mnt/img rw,relatime,union - ext2 /dev/loop2 rw,errors=continue

Thanks,
Miklos

2010-08-17 20:44:57

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 14/39] union-mount: Union mounts documentation

On Tue, Aug 10, 2010 at 08:56:41AM +1000, Neil Brown wrote:
> On Sun, 8 Aug 2010 11:52:31 -0400
> Valerie Aurora <[email protected]> wrote:
>
>
> > +A union mount layers one read-write file system over one or more
> > +read-only file systems, with all writes going to the writable file
> > +system. The namespace of both file systems appears as a combined
> > +whole to userland, with files and directories on the writable file
> > +system covering up any files or directories with matching pathnames on
> > +the read-only file system. The read-write file system is the
> > +"topmost" or "upper" file system and the read-only file systems are
> > +the "lower" file systems. A few use cases:
> > +
> > +- Root file system on CD with writes saved to hard drive (LiveCD)
> > +- Multiple virtual machines with the same starting root file system
> > +- Cluster with NFS mounted root on clients
> > +
> > +Most if not all of these problems could be solved with a COW block
> > +device or a clustered file system (include NFS mounts). However, for
> > +some use cases, sharing is more efficient and better performing if
> > +done at the file system namespace level. COW block devices only
> > +increase their divergence as time goes on, and a fully coherent
> > +writable file system is unnecessary synchronization overhead if no
> > +other client needs to see the writes.
>
> Thanks for including lots of documentation!
> Given how intrusive this patch set is, I would really like the see the
> justification above fleshed out a bit more.
>
> What would be particularly valuable would be real-life use cases where
> someone has put this to work and found that it genuinely meets a need.
> I realise there can be a bit of a chicken/egg issue there, but if you do have
> anything it would be good to include it.

I felt the way you did until I talked to several users who explained
to me why none of the existing solutions worked well for their use
case. The real-life use cases are those where people are currently
using unionfs and aufs, which include many live CDs, Linux appliances,
and at least three national lab computer clusters. The best argument
for their need for a union file system is that they are using unionfs
and aufs despite the pain of using out-of-mainline code and (according
to the users I have spoken to) frequent crashes. Union mounts is
intended as an in-mainline replacement for the existing users of
unionfs and aufs.

I'm not sure this needs to be in Documentation/ - at the point it is
merged into mainline, we will have already agreed on whether it is
necessary. :)

> > +Non-features
> > +------------
> > +
> > +Features we do not currently plan to support in union mounts:
> > +
> > +Online upgrade: E.g., installing software on a file system NFS
> > +exported to clients while the clients are still up and running.
> > +Allowing the read-only bottom layer of a union mount to change
> > +invalidates our locking strategy.
>
> I wonder if the restriction is not more serious than this.
> Given the prevalence of "copy-up", particularly of directories, I would think
> that even off-line upgrade would not be supported.
> If the upgrade adds a file in a directory that has already been read (and
> hence copied-up), or changes a file that has been chmodded, then the upgrade
> will not be completely visible, which sounds dangerous.
>
> Don't you have to require (or strongly recommend) that the underlying
> filesystem remain unchanged while the on-top filesystem exists, not just
> while it is mounted ??

It is true, you have to know what you are doing and carefully groom
both file systems if you want to change the lower file system and get
the effect you intended. Just updating the lower file system and
slapping the overlay back on will probably not accomplish what you
want.

But frankly, this is an impossible problem to solve generically at the
file system level. When a user says, "Show the changes to the lower
file system in my overlaid file system," they are actually saying,
"Replace everything in /bin, but not /etc/hostname, and merge the
lower package database with the upper package database, and update
/etc/resolv.conf, unless it's the mailserver..." If you look into the
problems with merging after running in disconnected mode in Coda, it's
exactly the same set of problems. They "solved" it by proposing
application-specific merging programs that you run one by one for each
file that was modified in two places during the time the client was
disconnected. Here's the first quote I found:

http://www.coda.cs.cmu.edu/ljpaper/lj.html

"The second issue is that during reintegration it may appear that
during the disconnection another client has modified the file too and
have shipped it to the server. This is called a local/global conflict
(viz. Client/Server) which needs repair. Repairs can sometimes be done
automatically by application specific resolvers (which know that one
client inserting an appointment into a calendar file for Monday and
another client inserting one for Tuesday have not created an
irresolvable conflict). Sometimes, but quite infrequently, human
intervention is needed to repair the conflict."

Union mounts doesn't solve the problem of how to resolve conflicts
between two versions of a file system. All I can do is give you tools
to clear opaque flags, delete fallthrus and whiteouts, and things like
that. You can, for example, clear all opaque directory flags and
fallthrus in the overlay, so that new files will show up but deleted
files will continue to be whited-out - which may be what you want,
unless it's not.

Another thing you can do is is do the upgrade on the union mounted fs,
unmount it, remount the fs's separately, and then do a comparison and
delete all files on the overlay that are identical on both fs's. All
this only makes sense to do in userspace, it's way too complicated and
policy-ridden to do in-kernel and online.

To solve the upgrade problem, you don't need a file system, you need
to use a tool like Puppet, which will automatically upgrade and
configure thousands of hosts using recipes:

http://www.puppetlabs.com/puppet/introduction/

> As a counter-position for you or others to write cogent arguments against,
> and to then include those arguments in the justification section, I would
> like to present my preferred approach, which is essentially that the problem
> is better solved at the block layer or the distro layer.

I personally like the block layer solution better and would be
happiest if all unionfs and aufs users switched to it and no one
needed union mounts. :) This is one case where the author is not in
love with the solution. I'm not going to argue for the need for it
beyond noting the existing unionfs and aufs user base.

As to whether union mounts will work for the same cases as unionfs and
aufs, all I can say is that union mounts offers as much functionality
as I can figure out how to give without crashing the kernel. At that
point userspace will have to either rewrite to work around problems or
else keep using out-of-mainline code.

-VAL

2010-08-17 21:08:52

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 23/39] fallthru: ext2 fallthru support

On Fri, Aug 13, 2010 at 03:52:02PM +0200, Miklos Szeredi wrote:
> On Sun, 8 Aug 2010, Valerie Aurora wrote:
> > Add support for fallthru directory entries to ext2.
>
> This still doesn't work correctly, only now after unmounting/mounting
> the top layer the directory appears empty:

Thanks, I will look into this.

-VAL

> uml:~# mount -oloop -r ext3.img /mnt/img/
> uml:~# losetup /dev/loop2 ovl.img
> uml:~# /host/store/git/e2fsprogs/misc/mke2fs -O whiteout,fallthru /dev/loop2
> mke2fs 1.41.12 (17-May-2010)
> Filesystem label=
> OS type: Linux
> Block size=1024 (log=0)
> Fragment size=1024 (log=0)
> Stride=0 blocks, Stripe width=0 blocks
> 1280 inodes, 10240 blocks
> 512 blocks (5.00%) reserved for the super user
> First data block=1
> Maximum filesystem blocks=10485760
> 2 block groups
> 8192 blocks per group, 8192 fragments per group
> 640 inodes per group
> Superblock backups stored on blocks:
> 8193
>
> Allocating group tables: done
> Writing inode tables: done
> Writing superblocks and filesystem accounting information: done
>
> This filesystem will be automatically checked every 24 mounts or
> 180 days, whichever comes first. Use tune2fs -c or -i to override.
> uml:~# mmount -b 8 -t ext2 /dev/loop2 /mnt/img/
> uml:~# "ls" /mnt/img
> lost+found union
> uml:~# "ls" /mnt/img/union
> 1 2 3
> uml:~# umount /mnt/img
> uml:~# mmount -b 8 -t ext2 /dev/loop2 /mnt/img/
> uml:~# ls -l /mnt/img/
> total 13
> drwx------ 2 root root 12288 Aug 13 13:42 lost+found
> drwxr-xr-x 2 root root 1024 Aug 13 13:42 union
> uml:~# ls -la /mnt/img/union/
> total 2
> drwxr-xr-x 2 root root 1024 Aug 13 13:42 .
> drwxr-xr-x 4 root root 1024 Aug 13 13:42 ..
> uml:~# grep /mnt/img /proc/self/mountinfo
> 21 12 7:0 / /mnt/img ro,relatime - ext3 /dev/loop0 ro,errors=continue,barrier=0,data=writeback
> 22 21 7:2 / /mnt/img rw,relatime,union - ext2 /dev/loop2 rw,errors=continue
>
> Thanks,
> Miklos
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2010-08-17 21:44:28

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 19/39] union-mount: Implement union lookup

On Fri, Aug 13, 2010 at 03:49:04PM +0200, Miklos Szeredi wrote:
> On Sun, 8 Aug 2010, Valerie Aurora wrote:
> > Implement unioned directories, whiteouts, and fallthrus in pathname
> > lookup routines. do_lookup() and lookup_hash() call lookup_union()
> > after looking up the dentry from the top-level file system.
> > lookup_union() is centered around __lookup_hash(), which does cached
> > and/or real lookups and revalidates each dentry in the union stack.
> >
> > XXX - implement negative union cache entries
> >
> > XXX - handle different permissions on directories
>
> If process doing the lookup doesn't have write permission on the top
> level directory then the lookup will fail. This is not intended, is
> it?

Does it fail? I'm not checking permissions before calling
->fallthru(). But I can't test this because the code doesn't set the
owner of the copied up directory correctly. :)

Don't bother doing any permission testing on this version - it's known
buggy and I will fix it in the next release.

Thanks,

-VAL

2010-08-17 22:29:07

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 23/39] fallthru: ext2 fallthru support

On Fri, Aug 13, 2010 at 03:52:02PM +0200, Miklos Szeredi wrote:
> On Sun, 8 Aug 2010, Valerie Aurora wrote:
> > Add support for fallthru directory entries to ext2.
>
> This still doesn't work correctly, only now after unmounting/mounting
> the top layer the directory appears empty:

Thanks, I will look into this.

-VAL

> uml:~# mount -oloop -r ext3.img /mnt/img/
> uml:~# losetup /dev/loop2 ovl.img
> uml:~# /host/store/git/e2fsprogs/misc/mke2fs -O whiteout,fallthru /dev/loop2
> mke2fs 1.41.12 (17-May-2010)
> Filesystem label=
> OS type: Linux
> Block size=1024 (log=0)
> Fragment size=1024 (log=0)
> Stride=0 blocks, Stripe width=0 blocks
> 1280 inodes, 10240 blocks
> 512 blocks (5.00%) reserved for the super user
> First data block=1
> Maximum filesystem blocks=10485760
> 2 block groups
> 8192 blocks per group, 8192 fragments per group
> 640 inodes per group
> Superblock backups stored on blocks:
> 8193
>
> Allocating group tables: done
> Writing inode tables: done
> Writing superblocks and filesystem accounting information: done
>
> This filesystem will be automatically checked every 24 mounts or
> 180 days, whichever comes first. Use tune2fs -c or -i to override.
> uml:~# mmount -b 8 -t ext2 /dev/loop2 /mnt/img/
> uml:~# "ls" /mnt/img
> lost+found union
> uml:~# "ls" /mnt/img/union
> 1 2 3
> uml:~# umount /mnt/img
> uml:~# mmount -b 8 -t ext2 /dev/loop2 /mnt/img/
> uml:~# ls -l /mnt/img/
> total 13
> drwx------ 2 root root 12288 Aug 13 13:42 lost+found
> drwxr-xr-x 2 root root 1024 Aug 13 13:42 union
> uml:~# ls -la /mnt/img/union/
> total 2
> drwxr-xr-x 2 root root 1024 Aug 13 13:42 .
> drwxr-xr-x 4 root root 1024 Aug 13 13:42 ..
> uml:~# grep /mnt/img /proc/self/mountinfo
> 21 12 7:0 / /mnt/img ro,relatime - ext3 /dev/loop0 ro,errors=continue,barrier=0,data=writeback
> 22 21 7:2 / /mnt/img rw,relatime,union - ext2 /dev/loop2 rw,errors=continue
>
> Thanks,
> Miklos
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2010-08-17 22:53:49

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 14/39] union-mount: Union mounts documentation

On Tue, 17 Aug 2010 16:44:30 -0400
Valerie Aurora <[email protected]> wrote:

> On Tue, Aug 10, 2010 at 08:56:41AM +1000, Neil Brown wrote:
> > On Sun, 8 Aug 2010 11:52:31 -0400
> > Valerie Aurora <[email protected]> wrote:
> >
> >
> > > +A union mount layers one read-write file system over one or more
> > > +read-only file systems, with all writes going to the writable file
> > > +system. The namespace of both file systems appears as a combined
> > > +whole to userland, with files and directories on the writable file
> > > +system covering up any files or directories with matching pathnames on
> > > +the read-only file system. The read-write file system is the
> > > +"topmost" or "upper" file system and the read-only file systems are
> > > +the "lower" file systems. A few use cases:
> > > +
> > > +- Root file system on CD with writes saved to hard drive (LiveCD)
> > > +- Multiple virtual machines with the same starting root file system
> > > +- Cluster with NFS mounted root on clients
> > > +
> > > +Most if not all of these problems could be solved with a COW block
> > > +device or a clustered file system (include NFS mounts). However, for
> > > +some use cases, sharing is more efficient and better performing if
> > > +done at the file system namespace level. COW block devices only
> > > +increase their divergence as time goes on, and a fully coherent
> > > +writable file system is unnecessary synchronization overhead if no
> > > +other client needs to see the writes.
> >
> > Thanks for including lots of documentation!
> > Given how intrusive this patch set is, I would really like the see the
> > justification above fleshed out a bit more.
> >
> > What would be particularly valuable would be real-life use cases where
> > someone has put this to work and found that it genuinely meets a need.
> > I realise there can be a bit of a chicken/egg issue there, but if you do have
> > anything it would be good to include it.
>
> I felt the way you did until I talked to several users who explained
> to me why none of the existing solutions worked well for their use
> case. The real-life use cases are those where people are currently
> using unionfs and aufs, which include many live CDs, Linux appliances,
> and at least three national lab computer clusters. The best argument
> for their need for a union file system is that they are using unionfs
> and aufs despite the pain of using out-of-mainline code and (according
> to the users I have spoken to) frequent crashes. Union mounts is
> intended as an in-mainline replacement for the existing users of
> unionfs and aufs.

You present a good argument that "something must be done", but it gives no
pointers to what that something should be.
I don't suppose it is possible to get that explanation you mention is writing?

>
> I'm not sure this needs to be in Documentation/ - at the point it is
> merged into mainline, we will have already agreed on whether it is
> necessary. :)

However, until it is merged in to mainline it would be good to keep the
justification of this change well documented so you don't have to repeat the
same argument to every bozo who pops up and thinks they know better.
Ultimately the git commit log (or even an lwn.net article) could well be a
better place to store this rather than Documenation/, but I think there is
still value in it being written.


>
> > > +Non-features
> > > +------------
> > > +
> > > +Features we do not currently plan to support in union mounts:
> > > +
> > > +Online upgrade: E.g., installing software on a file system NFS
> > > +exported to clients while the clients are still up and running.
> > > +Allowing the read-only bottom layer of a union mount to change
> > > +invalidates our locking strategy.
> >
> > I wonder if the restriction is not more serious than this.
> > Given the prevalence of "copy-up", particularly of directories, I would think
> > that even off-line upgrade would not be supported.
> > If the upgrade adds a file in a directory that has already been read (and
> > hence copied-up), or changes a file that has been chmodded, then the upgrade
> > will not be completely visible, which sounds dangerous.
> >
> > Don't you have to require (or strongly recommend) that the underlying
> > filesystem remain unchanged while the on-top filesystem exists, not just
> > while it is mounted ??
>
> It is true, you have to know what you are doing and carefully groom
> both file systems if you want to change the lower file system and get
> the effect you intended. Just updating the lower file system and
> slapping the overlay back on will probably not accomplish what you
> want.
>
> But frankly, this is an impossible problem to solve generically at the
> file system level.

Absolutely right - no argument about that.
I just think that should be explicit in the documentation.
Right after the "Online upgrade" paragraph:

Even off-line upgrade - e.g. installing software on an exported filesystem
and the remounting that on client and union-mounting a pre-existing over
lay on top of it - is significantly non-trivial and would require
significant extra management software to created a working solution.
(or something like that, but more that just one long sentence).


>
> > As a counter-position for you or others to write cogent arguments against,
> > and to then include those arguments in the justification section, I would
> > like to present my preferred approach, which is essentially that the problem
> > is better solved at the block layer or the distro layer.
>
> I personally like the block layer solution better and would be
> happiest if all unionfs and aufs users switched to it and no one
> needed union mounts. :) This is one case where the author is not in
> love with the solution. I'm not going to argue for the need for it
> beyond noting the existing unionfs and aufs user base.

That may be enough justification to work on this as a research project, but I
don't think it is enough justification to merge it into mainline.

Just because aufs might be the best available solution to a particular problem
doesn't mean that making a better aufs (aka VFS union mounts) will be the best
possible solution. That can only be determine if the key needs, and the
problems with all available solutions, are publicly known.

Thanks,
NeilBrown

2010-08-18 00:23:42

by Luca Barbieri

[permalink] [raw]
Subject: Re: [PATCH 14/39] union-mount: Union mounts documentation

>> I personally like the block layer solution better and would be
>> happiest if all unionfs and aufs users switched to it and no one
>> needed union mounts. :)

I think that the safety of personal data and the ability to make
changes to the layers independently are very important features that
justify union mounts.

If you do block-level COW and then lose the lower filesystem layer
(e.g. lose the LiveCD or lose network access to the NFS master), then
you have no guarantee of being able to access the data you added (e.g.
your /home) since you'll only have a corrupted "filesystem piece" that
fsck may or may not be able to fix.

Also, you can't modify the lower layer at all (without rebuilding the
upper layer from scratch), while with an union mount minor changes can
be done with no issues (e.g. replacing the LiveCD with a new minor
update, or applying a security update to the NFS master), and major
ones can be done with some care.

Hence, in any case where the layers are even slightly separated, or
where you need to modify them independently, or extract the changes,
union mounts/unionfs are much better, and often actually the only
viable solution.

This includes the LiveCD case, the NFS mount case and some use cases
with virtual machines.

This would be the case even more strongly if additional features like
online modification of lower layers, or path resolution to the most
recent file instead of the one in the highest layer, were added.

Of course, this is why people currently use unionfs or aufs, and a
VFS-based solution seems just better, since it is going to be more
efficient and guaranteed to be relatively bug-free once it satisfies
the high quality requirements for inclusion in the core kernel.

2010-08-18 01:24:42

by J. R. Okajima

[permalink] [raw]
Subject: Re: [PATCH 14/39] union-mount: Union mounts documentation


Valerie Aurora:
> and at least three national lab computer clusters. The best argument
> for their need for a union file system is that they are using unionfs
> and aufs despite the pain of using out-of-mainline code and (according
> to the users I have spoken to) frequent crashes. Union mounts is

Hmm, anyone who meets crash in aufs, please let me know.
While I never say aufs is bug-free, I don't receive such report
recently. I always try fixing a bug as soon as possible when I got a
report.

A reply I have to write repeatedly to who have met a problem in aufs and
reported to aufs-users ML, is "your aufs version is too old. please get
the latest one."
Because aufs is released every week and some linux distributions keep
using very old aufs version, even over one year old version than thier
release date.

By the way, I don't have objection to merge Val's UnionMount into
mainline as I have heard it is already decided.


> But frankly, this is an impossible problem to solve generically at the
> file system level. When a user says, "Show the changes to the lower
> file system in my overlaid file system," they are actually saying,

Is it (mostly) possible by receiving a notification via fsnotify?
For remote FS, their ->d_revalidate() will tell us something is changed.


J. R. Okajima

2010-08-18 08:11:37

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 19/39] union-mount: Implement union lookup

On Tue, 17 Aug 2010, Valerie Aurora wrote:
> On Fri, Aug 13, 2010 at 03:49:04PM +0200, Miklos Szeredi wrote:
> > On Sun, 8 Aug 2010, Valerie Aurora wrote:
> > > Implement unioned directories, whiteouts, and fallthrus in pathname
> > > lookup routines. do_lookup() and lookup_hash() call lookup_union()
> > > after looking up the dentry from the top-level file system.
> > > lookup_union() is centered around __lookup_hash(), which does cached
> > > and/or real lookups and revalidates each dentry in the union stack.
> > >
> > > XXX - implement negative union cache entries
> > >
> > > XXX - handle different permissions on directories
> >
> > If process doing the lookup doesn't have write permission on the top
> > level directory then the lookup will fail. This is not intended, is
> > it?
>
> Does it fail? I'm not checking permissions before calling
> ->fallthru(). But I can't test this because the code doesn't set the
> owner of the copied up directory correctly. :)

It fails because everything, including copyup, is done with the
credentials of the user doing the lookup/copyup. This is wrong, for
the time of the copyup the credentials need to be upgraded to be able
to create and copy the lower file or directory into the upper
filesystem even when the current process doesn't have enough
privileges for that.

Thanks,
Miklos

2010-08-18 18:56:04

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 14/39] union-mount: Union mounts documentation

On Wed, Aug 18, 2010 at 10:23:52AM +0900, J. R. Okajima wrote:
>
> Valerie Aurora:
> > and at least three national lab computer clusters. The best argument
> > for their need for a union file system is that they are using unionfs
> > and aufs despite the pain of using out-of-mainline code and (according
> > to the users I have spoken to) frequent crashes. Union mounts is
>
> Hmm, anyone who meets crash in aufs, please let me know.
> While I never say aufs is bug-free, I don't receive such report
> recently. I always try fixing a bug as soon as possible when I got a
> report.

According Al Viro, unionfs has some fundamental architectural problems
that prevents it from being correct and leads to crashes:

http://lkml.indiana.edu/hypermail/linux/kernel/0802.0/0839.html

The main question for me is whether aufs has fixed these problems. If
it hasn't, then it can't be bug-free.

> A reply I have to write repeatedly to who have met a problem in aufs and
> reported to aufs-users ML, is "your aufs version is too old. please get
> the latest one."
> Because aufs is released every week and some linux distributions keep
> using very old aufs version, even over one year old version than thier
> release date.
>
> By the way, I don't have objection to merge Val's UnionMount into
> mainline as I have heard it is already decided.

I wish I had your confidence. :)

> > But frankly, this is an impossible problem to solve generically at the
> > file system level. When a user says, "Show the changes to the lower
> > file system in my overlaid file system," they are actually saying,
>
> Is it (mostly) possible by receiving a notification via fsnotify?
> For remote FS, their ->d_revalidate() will tell us something is changed.

Think about the case of two different RPM package database files. One
contains the info from newly installed packages on the top layer file
system. The lower layer contains info from packages newly installed
on the lower file system. You don't want either file; you want the
merged packaged database showing the info for all packages installed
on both layers. Any practical file system based system is only going
to be able to pick one file or the other, and it's going to be wrong
in some cases.

-VAL

2010-08-18 19:04:53

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 14/39] union-mount: Union mounts documentation

On Wed, Aug 18, 2010 at 08:53:33AM +1000, Neil Brown wrote:
> On Tue, 17 Aug 2010 16:44:30 -0400
> Valerie Aurora <[email protected]> wrote:
> >
> > I felt the way you did until I talked to several users who explained
> > to me why none of the existing solutions worked well for their use
> > case. The real-life use cases are those where people are currently
> > using unionfs and aufs, which include many live CDs, Linux appliances,
> > and at least three national lab computer clusters. The best argument
> > for their need for a union file system is that they are using unionfs
> > and aufs despite the pain of using out-of-mainline code and (according
> > to the users I have spoken to) frequent crashes. Union mounts is
> > intended as an in-mainline replacement for the existing users of
> > unionfs and aufs.
>
> You present a good argument that "something must be done", but it gives no
> pointers to what that something should be.
> I don't suppose it is possible to get that explanation you mention is writing?

Sorry, all the documentation I have about union mounts is publicly
available. I'll announce any new documentation in the usual way.

If you are willing to do the research personally, you can start with
the list of projects using unionfs:

http://www.fsl.cs.sunysb.edu/project-unionfs.html

> > It is true, you have to know what you are doing and carefully groom
> > both file systems if you want to change the lower file system and get
> > the effect you intended. Just updating the lower file system and
> > slapping the overlay back on will probably not accomplish what you
> > want.
> >
> > But frankly, this is an impossible problem to solve generically at the
> > file system level.
>
> Absolutely right - no argument about that.
> I just think that should be explicit in the documentation.
> Right after the "Online upgrade" paragraph:
>
> Even off-line upgrade - e.g. installing software on an exported filesystem
> and the remounting that on client and union-mounting a pre-existing over
> lay on top of it - is significantly non-trivial and would require
> significant extra management software to created a working solution.
> (or something like that, but more that just one long sentence).

Okay, I'll put something in the next time I update the docs.

> > > As a counter-position for you or others to write cogent arguments against,
> > > and to then include those arguments in the justification section, I would
> > > like to present my preferred approach, which is essentially that the problem
> > > is better solved at the block layer or the distro layer.
> >
> > I personally like the block layer solution better and would be
> > happiest if all unionfs and aufs users switched to it and no one
> > needed union mounts. :) This is one case where the author is not in
> > love with the solution. I'm not going to argue for the need for it
> > beyond noting the existing unionfs and aufs user base.
>
> That may be enough justification to work on this as a research project, but I
> don't think it is enough justification to merge it into mainline.
>
> Just because aufs might be the best available solution to a particular problem
> doesn't mean that making a better aufs (aka VFS union mounts) will be the best
> possible solution. That can only be determine if the key needs, and the
> problems with all available solutions, are publicly known.

The problems with all available solutions, including union mounts, are
thoroughly documented in my four LWN articles on union mounts:

http://lwn.net/Articles/324291/
http://lwn.net/Articles/325369/
http://lwn.net/Articles/327738/
http://lwn.net/Articles/396020/

I understand your desire for better documentation. But contrary to
popular conception, I hate writing and do it as seldom as possible. :)

Thanks,

-VAL

2010-08-19 01:35:50

by J. R. Okajima

[permalink] [raw]
Subject: Re: [PATCH 14/39] union-mount: Union mounts documentation


Valerie Aurora:
> According Al Viro, unionfs has some fundamental architectural problems
> that prevents it from being correct and leads to crashes:
>
> http://lkml.indiana.edu/hypermail/linux/kernel/0802.0/0839.html
>
> The main question for me is whether aufs has fixed these problems. If
> it hasn't, then it can't be bug-free.

Although I don't understand fully your question, aufs actually verifies
the parent-child relationship after lock_rename() on the writable layer.
Such verification is done in other operations too.
And aufs provides three options to specify the level of
verification. When the highest (most strict) level is given, aufs_rename
lookup again after lock_rename() and compares the got parent and the
given (cached) parent.
Does this answer your question correctly?


> Think about the case of two different RPM package database files. One
> contains the info from newly installed packages on the top layer file
> system. The lower layer contains info from packages newly installed
> on the lower file system. You don't want either file; you want the
> merged packaged database showing the info for all packages installed
> on both layers. Any practical file system based system is only going
> to be able to pick one file or the other, and it's going to be wrong
> in some cases.

Let me make sure.
Do you mean something like this?
- a user makes a union
- fileA exists on the lower layer but upper
- modify fileA in the union
--> the file is copied-up and updated on the upper layer.
- modify fileA on the lower layer directly (by-passing union)
--> the file on the lower is updated.
- and the user will not see the uptodate fileA in the union, lack of the
modification made on the lower directly.

Then I'd say it is an expected behaviour. Simply the upper file hides
the lower.

While UnionMount takes a block device as a parameter of making a union
operaion, aufs takes a directory.
# mount /dev/sda /u
# mount -o union /dev/sdb /u

# mount /dev/sda /ro
# mount /dev/sdb /rw
# mount -t aufs -o br:/rw:/ro none /u

It means sda is hidden in UnionMount (generally) and users cannot access
it directly. But in aufs, it is possible via /ro. For those who wants to
hide /ro and stop accessing it directly, aufs document suggests mounting
another thing onto /ro. It can be an empty directly if you use "mount -o
bind".


J. R. Okajima

2010-08-24 00:05:38

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 14/39] union-mount: Union mounts documentation

On Thu, Aug 19, 2010 at 10:34:59AM +0900, J. R. Okajima wrote:
>
> Valerie Aurora:
> > According Al Viro, unionfs has some fundamental architectural problems
> > that prevents it from being correct and leads to crashes:
> >
> > http://lkml.indiana.edu/hypermail/linux/kernel/0802.0/0839.html
> >
> > The main question for me is whether aufs has fixed these problems. If
> > it hasn't, then it can't be bug-free.
>
> Although I don't understand fully your question, aufs actually verifies
> the parent-child relationship after lock_rename() on the writable layer.
> Such verification is done in other operations too.
> And aufs provides three options to specify the level of
> verification. When the highest (most strict) level is given, aufs_rename
> lookup again after lock_rename() and compares the got parent and the
> given (cached) parent.
> Does this answer your question correctly?

First, my theory when writing any file system code is that whenever Al
Viro says, "You can deadlock easily" or "It violates the locking
rules" that I have to understand the problem and fix it. I understand
why union mounts doesn't have the problems unionfs had when Al wrote
this email (because lower layers are not writable). But since aufs
allows directories on lower layers to be renamed in the way that
creates the problems Al describes, I assume it has this same problem
until the author understands the unionfs problem and can describe why
aufs didn't inherit it (or fixed it, or whatever).

Second, why isn't the most strict level of lookup the only option? It
seems like anything else is a bug.

Third, you have this odd circular inheritance problem that comes from
moving a child directory on the lower layer to the path of its parent,
and vice versa. From Al's email:

> If you allow a mix of old and new mappings, you can easily run into the
> situations when at some moment X1 covers Y1, X2 covers Y2, X2 is a descendent
> of X1 and Y1 is a descendent of Y2. You *really* don't want to go there -
> if nothing else, defining behaviour of copyup in face of that insanity
> will be very painful.

I understand the circular inheritance problem but find this hard to
explain better than Al does above. But here's an example of how you
get there:

Start with parent_dir1/child_dir1 covering parent_dir2/child_dir2
thread 1 does a union lookup and gets:
parent_dir1 covering parent_dir2
child_dir1 covering child_dir2
parent_dir1 parent of child_dir1
parent_dir2 parent of child_dir2
thread 2 swaps parent_dir2 with child_dir2 (using rename and a tmp dir)
now lower fs looks like: child_dir2/parent_dir2

Who inherits what? Does thread 1 see parent_dir2 as a descendant of
child_dir2 which is a descendant of parent_dir2 through the union with
parent_dir1? Can you sanely define the behavior here?

Fourth, you have a potential deadlock now. Say thread 1 is operating
with the belief that parent_dir1/child_dir1 covers
parent_dir2/child_dir2. parent_dir2/child_dir2 gets renamed such that
the two switch places, as described above. And thread 2 is directly
accessing the lower file system, now with child_dir2/parent_dir2. The
locking order for thread 1 is:

parent_dir2 -> parent_dir1 -> child_dir1 -> child_dir2

For thread 2, it is:

child_dir2 -> parent_dir2

So if thread 1 gets a lock on parent_dir2, and then thread 2 gets a
lock on child_dir2, they will deadlock. In general, this situation
violates the fundamental assumptions of correct directory locking,
described in Documentation/filesystems/directory-locking.

That's my attempt to explain Al's email, anyway. :) All errors are my
own.

> > Think about the case of two different RPM package database files. One
> > contains the info from newly installed packages on the top layer file
> > system. The lower layer contains info from packages newly installed
> > on the lower file system. You don't want either file; you want the
> > merged packaged database showing the info for all packages installed
> > on both layers. Any practical file system based system is only going
> > to be able to pick one file or the other, and it's going to be wrong
> > in some cases.
>
> Let me make sure.
> Do you mean something like this?
> - a user makes a union
> - fileA exists on the lower layer but upper
> - modify fileA in the union
> --> the file is copied-up and updated on the upper layer.
> - modify fileA on the lower layer directly (by-passing union)
> --> the file on the lower is updated.
> - and the user will not see the uptodate fileA in the union, lack of the
> modification made on the lower directly.
>
> Then I'd say it is an expected behaviour. Simply the upper file hides
> the lower.

I am not arguing with you and I agree that this is the expected
behavior. I wrote about this case just to show that there is a case
in which what the user "wants" in an upgrade situation is impossible
to do automatically in the file system. So you need to have a smart
tool to do an upgrade of the lower layer file system. And I argue
that smart tool should deal with all cases of a file copied up to the
topmost file system that covers an updated file on the lower file
system, instead of putting this policy decision into the VFS.

-VAL

2010-08-24 02:29:23

by J. R. Okajima

[permalink] [raw]
Subject: Re: [PATCH 14/39] union-mount: Union mounts documentation


Thank you for explanation, very much.

Valerie Aurora:
> First, my theory when writing any file system code is that whenever Al
> Viro says, "You can deadlock easily" or "It violates the locking
> rules" that I have to understand the problem and fix it. I understand
> why union mounts doesn't have the problems unionfs had when Al wrote
> this email (because lower layers are not writable). But since aufs
> allows directories on lower layers to be renamed in the way that
> creates the problems Al describes, I assume it has this same problem
> until the author understands the unionfs problem and can describe why
> aufs didn't inherit it (or fixed it, or whatever).

Basically agreed.


> Second, why isn't the most strict level of lookup the only option? It
> seems like anything else is a bug.

Because users can hide the layers (such like UnionMount) if they want,
and it totally prohibits bypassing aufs. Additionally they modify on the
layer directly (bypassing aufs) only when it is really necessary. So the
default value of the option is not a strict one. And users can change
the option dynamically.


> Start with parent_dir1/child_dir1 covering parent_dir2/child_dir2
> thread 1 does a union lookup and gets:
> parent_dir1 covering parent_dir2
> child_dir1 covering child_dir2
> parent_dir1 parent of child_dir1
> parent_dir2 parent of child_dir2
> thread 2 swaps parent_dir2 with child_dir2 (using rename and a tmp dir)
> now lower fs looks like: child_dir2/parent_dir2
>
> Who inherits what? Does thread 1 see parent_dir2 as a descendant of
> child_dir2 which is a descendant of parent_dir2 through the union with
> parent_dir1? Can you sanely define the behavior here?

When a rename happens on a layer directly, aufs receives a
inotify/fsnotify event. Following the event type, aufs makes the cached
dentry/inode obsoleted and they will be lookup-ed again in the
succeeding access. Finally aufs will know the upper parent_dir1 is not
covering the lower parent_dir2 anymore.
This notification is the main purpose of the strict option which is
called "udba=notify" (User's Direct Branch Access).


> Fourth, you have a potential deadlock now. Say thread 1 is operating
:::

No, deadlock will not happen since aufs knows the new parent-child
relationship. By using inotify/hinotify in above answer, I hope you
would agree with that.


> > Then I'd say it is an expected behaviour. Simply the upper file hides
> > the lower.
>
> I am not arguing with you and I agree that this is the expected
> behavior. I wrote about this case just to show that there is a case
> in which what the user "wants" in an upgrade situation is impossible
> to do automatically in the file system. So you need to have a smart
> tool to do an upgrade of the lower layer file system. And I argue
> that smart tool should deal with all cases of a file copied up to the
> topmost file system that covers an updated file on the lower file
> system, instead of putting this policy decision into the VFS.

I am afraid that still I may not understand what you wrote well.
Do you mean that upgrading a package involves updating seveal files and
their version have to be matched with each other within the package, and
upgrading different package in both of upper and lower layer directly
causes mismatch among those files?

Although I don't think you are talking about an aufs utility aubrsync
which runs rsync between layers, I don't understand about "putting this
policy decision into the VFS". The simple rule "the upper file hides the
lower" is out of VFS.


J. R. Okajima

2010-08-24 20:49:04

by Valerie Aurora

[permalink] [raw]
Subject: Re: [PATCH 14/39] union-mount: Union mounts documentation

On Tue, Aug 24, 2010 at 11:28:37AM +0900, J. R. Okajima wrote:
>
> Thank you for explanation, very much.

You are welcome!

> When a rename happens on a layer directly, aufs receives a
> inotify/fsnotify event. Following the event type, aufs makes the cached
> dentry/inode obsoleted and they will be lookup-ed again in the
> succeeding access. Finally aufs will know the upper parent_dir1 is not
> covering the lower parent_dir2 anymore.
> This notification is the main purpose of the strict option which is
> called "udba=notify" (User's Direct Branch Access).

No, that's not a sufficient description and leaves open questions
about all sorts of deadlocks and race conditions. For example,
inotify events occur while holding locks only on one layer. You
obviously need to lock the top layer to update the inheritance and
parent-child relationships. Now you are locking the lower layer first
and the top layer second, which is the reverse of the usual order.
Also, it should not be an option.

If Al Viro says it's wrong, you need a very detailed explanation of
why it is right. See Documentation/filesystem/directory-locking for
an example of the argument you have to make to show that moving things
around on the lower layer is safe. In general, your first task is to
show a global lock ordering to prove lack of deadlocks (which I don't
think you should spend time on because most VFS experts think it is
impossible to do with two read-write layers).

I'm not going to explain any more how aufs is wrong; it's the
maintainer's job to convince Al Viro and other maintainers that aufs
is right. But I hope this gave you a start and showed why union
mounts is a preferred approach for many people.

Thanks,

-VAL

2010-08-25 02:59:00

by Christian Stroetmann

[permalink] [raw]
Subject: Re: [PATCH 14/39] union-mount: Union mounts documentation

Aloha Everybody;

On the 24.08.2010 22:48, Valerie Aurora wrote:
> On Tue, Aug 24, 2010 at 11:28:37AM +0900, J. R. Okajima wrote:
>> Thank you for explanation, very much.

Me too

> You are welcome!
>
>> When a rename happens on a layer directly, aufs receives a
>> inotify/fsnotify event. Following the event type, aufs makes the cached
>> dentry/inode obsoleted and they will be lookup-ed again in the
>> succeeding access. Finally aufs will know the upper parent_dir1 is not
>> covering the lower parent_dir2 anymore.
>> This notification is the main purpose of the strict option which is
>> called "udba=notify" (User's Direct Branch Access).
> No, that's not a sufficient description and leaves open questions
> about all sorts of deadlocks and race conditions. For example,
> inotify events occur while holding locks only on one layer. You
> obviously need to lock the top layer to update the inheritance and
> parent-child relationships. Now you are locking the lower layer first
> and the top layer second, which is the reverse of the usual order.
> Also, it should not be an option.
>
> If Al Viro says it's wrong, you need a very detailed explanation of
> why it is right. See Documentation/filesystem/directory-locking for
> an example of the argument you have to make to show that moving things
> around on the lower layer is safe. In general, your first task is to
> show a global lock ordering to prove lack of deadlocks (which I don't
> think you should spend time on because most VFS experts think it is
> impossible to do with two read-write layers).

This all reminds me of the 5/dining philosophers problem and its
solutions, especially the waiter and the resource hierarchy solutions
(see [1]).
And I do think that such problems can always be solved in a real world
context, but often the solutions are very time and/or space consuming.

> I'm not going to explain any more how aufs is wrong; it's the
> maintainer's job to convince Al Viro and other maintainers that aufs
> is right. But I hope this gave you a start and showed why union
> mounts is a preferred approach for many people.
>
> Thanks,
>
> -VAL

[1] http://en.wikipedia.org/wiki/Dining_philosophers_problem

Have fun
Christian

2010-08-25 05:04:43

by J. R. Okajima

[permalink] [raw]
Subject: Re: [PATCH 14/39] union-mount: Union mounts documentation


Valerie Aurora:
> No, that's not a sufficient description and leaves open questions
> about all sorts of deadlocks and race conditions. For example,
> inotify events occur while holding locks only on one layer. You
> obviously need to lock the top layer to update the inheritance and
> parent-child relationships. Now you are locking the lower layer first
> and the top layer second, which is the reverse of the usual order.

I don't agree about deadlock and race condition.
When user modifies the dir hierarchy on the layer directly during
aufs_rename() is running, aufs will detect it after lock_rename().
It behaves like this.
- decide the layer where actual rename operates. create the dir
hierarchy on it if necessary.
- lock_rename() for the layer
- calls ->rename()
or
- if the renaming file exists on the lower readonly layer, aufs will
copyup it to the upper writable layer as the rename target name.
In this case, ->rename() is not called.

If a user changes the dir hierarchy directly on the layer before
aufs_rename(), then the notify event tells aufs it and aufs gets the
latetst hierarchy.

If it happens before lock_rename() in aufs_rename(), aufs verifies the
relationship between the target child and the locked dir. if it differs,
return EBUSY. Of course, lock_rename() follows the "ancestors first"
order described in Documentation/filesystem/directory-locking.


> around on the lower layer is safe. In general, your first task is to
> show a global lock ordering to prove lack of deadlocks (which I don't
> think you should spend time on because most VFS experts think it is
> impossible to do with two read-write layers).

Since you may not read this anymore and other people doesn't seem to
be intrested in aufs, it may not be meaningful to write down about
locking in aufs. But I will try.

At first,
- since aufs is FS, it has its own super_block, dentry and inode.
- super_block, dentry and inode in aufs have private data which contains
rwsem.
- the locking order for these rwsem is child-first.
- aufs specifies FS_RENAME_DOES_D_MOVE.

locking order in aufs_rename
+ down_read() for aufs sb
protects sb from branch-add, delete.
+ two down_write()s for src and dest child
protects them from other processes in aufs.
+ down_write() for the dst_parent.
+ decide the layer where we will operate, by comparing the index of
layers where the targets exist and the layer attribute (ro, rw).
+ copyup the dest dir hierarchy if necessary, by repeating
- dget_parent(), down/up_read() for the parent (in aufs)
- mutex_lock() for the dir (on the layer) to mkdir the non-existing
child dir on the layer and verify the parent-child relationship.
- mkdir and setattr on the layer.
- mutex_unlock() the dir on the layer.
+ test they are rename-able
if it is a dir, it must be empty (logically) or must not have children
on the multiple branches.
+ if src_parent and dst_parent differ, down_write both. up_write for
dst_parent may be necessary to keep the "child-first" rule in aufs.

(from here the "sub-VFS" characteristic of aufs appears)
+ lock_rename() on the layer
and verify the every relationships between child and parent.
+ test the src_child is deletable.
+ test the dst_child is add-able or deletable if it exists.
+ vfs_rename() on the layer or copyup src_child as a dst_child name.
+ unlock_rename() on the layer

(return to aufs world)
+ d_drop() dst_child if necessary.
+ d_move()
+ up_write() for src_parent and dst_parent
+ up_write() fot src_child and dst_child
+ up_read() for aufs sb

Strictly speaking, there are more things which aufs_rename() handles
such as inode attributes, whiteout, opaque-dir, internal pointers to the
object on the layer, temporary dir-name. But they are unrelated to the
locking order essentially. So I didn't describe about them.


Thank you reading this long mail.


J. R. Okajima