2007-06-20 05:43:50

by Bharata B Rao

[permalink] [raw]
Subject: [RFC PATCH 0/4] New approach to VFS based union mount

Hi,

The earlier approach to VFS based union mounts posted here didn't work for
all cases of mounts (specially bind mounts). Hence, I have worked on this
new approach which is more generic and hopefully should work in most cases.
This approach fundamentally changes the way union stacks are maintained.

Details of the approach can be found in the accompanying documentation patch.
Since this is still a work in progress, there are many loose ends. Apart
from that, I have some questions about the design and implementation aspects,
which I have raised in the documenation patch. It will be very helpful
if people who are familiar with VFS and stackable filesystems could review
this.

This series (on 2622rc4mm2) has the following 4 patches:

1/4 - Union mount documentation.
2/4 - Mount changes to support union mount.
3/4 - Lookup changes to support union mount.
4/4 - Directory listing support for union mounted directories.

Kindly review and send me your feedback.

Regards,
Bharata.


2007-06-20 05:44:45

by Bharata B Rao

[permalink] [raw]
Subject: [RFC PATCH 1/4] Union mount documentation.

From: Bharata B Rao <[email protected]>
Subject: Union mount documentation.

Adds union mount documentation.

Signed-off-by: Bharata B Rao <[email protected]>
---
Documentation/union-mounts.txt | 232 +++++++++++++++++++++++++++++++++++++++++
1 files changed, 232 insertions(+)

--- /dev/null
+++ b/Documentation/union-mounts.txt
@@ -0,0 +1,232 @@
+VFS BASED UNION MOUNT
+=====================
+
+1. What is Union Mount ?
+2. Recap
+3. The new approach
+4. Union stack: building and traversal
+5. Union stack: destroying
+6. Directory lising
+7. What works and what doesn't ?
+8. Usage
+9. References
+
+1. What is Union Mount ?
+------------------------
+Union mount allows mounting of two or more filesystems transparently on
+a single mount point. Contents (files or directories) of all the
+filesystems become visible at the mount point after a union mount. If
+there are files with the same name in multiple layers of the union, only
+the topmost files remain visible. Contents of common named directories are
+merged again to present a unified view at the subdirectory level.
+
+In this approach of filesystem namespace unification, the layering or
+stacking information of different components (filesystems) of the union
+mount are maintained at the VFS layer. Hence, this is referred to as VFS
+based union mount.
+
+2. Recap
+--------
+Jan Blunck had developed a version of VFS based union mount in 2003-4.
+This version was cleaned up and ported to later kernels. Early in year
+2007, two iterations of these patches were posted for review (Ref 1, Ref 2).
+But, this approach had a few shortcomings:
+
+- It wasn't designed to work with shared subtree additions to mount.
+- It didn't work well when same filesystem was mounted from different
+ namespaces, as it maintained the union stack at dentry level.
+- It made dget() sleep.
+- The union stack was built using dentries and this was too fragile. This
+ made the code complex and the locking ugly.
+
+3. The new approach
+-------------------
+In this new approach, the way union stack is built and traversed has been
+changed. Instead of dentry-to-dentry links forming the stack between
+different layers, we now have (vfsmount, dentry) pairs as the building
+blocks of the union stack. Since this (vfsmount, dentry) combination is
+unique across all namespaces, we should be able to maintain the union stack
+sanely even if the filesystem is union mounted privately in different
+namespaces or if it appears under different mounts due to various types
+of bind mounts.
+
+4. Union stack: building and traversal
+--------------------------------------
+Union stack needs to be built from two places: during an explicit union
+mount (or mount propagation) and during the lookup of a directory that
+appears in more than one layer of the union.
+
+The link between two layers of union stack is maintained using the
+union_mount structure:
+
+struct union_mount {
+ /* vfsmount and dentry of this layer */
+ struct vfsmount *src_mnt;
+ struct dentry *src_dentry;
+
+ /* vfsmount and dentry of the next lower layer */
+ struct vfsmount *dst_mnt;
+ struct dentry *dst_dentry;
+
+ /*
+ * This list_head hashes this union_mount based on this layer's
+ * vfsmount and dentry. This is used to get to the next layer of
+ * the stack (dst_mnt, dst_dentry) given the (src_mnt, src_dentry)
+ * and is used for stack traversal.
+ */
+ struct list_head hash;
+
+ /*
+ * All union_mounts under a vfsmount(src_mnt) are linked together
+ * at mnt->mnt_union using this list_head. This is needed to destroy
+ * all the union_mounts when the mnt goes away.
+ */
+ struct list_head list;
+};
+
+These union mount structures are stored in a hash table(union_mount_hashtable)
+which uses the same hash as used for mount_hashtable since both of them use
+(vfsmount, dentry) pairs to calculate the hash.
+
+During a new mount (or mount propagation), a new union_mount structure is
+created. A reference to the mountpoint's vfsmount and dentry is taken and
+stored in the union_mount (as dst_mnt, dst_dentry). And this union_mount
+is inserted in the union_mount_hashtable based on the hash generated by
+the mount root's vfsmount and dentry.
+
+Similar method is employed to create a union stack during first time lookup
+of a common named directory within a union mount point. But here, the top
+level directory's vfsmount and dentry are hashed to get to the lower level
+directory's vfsmount and dentry.
+
+The insertion, deletion and lookup of union_mounts in the
+union_mount_hashtable is protected by vfsmount_lock. While traversing the
+stack, we hold this spinlock only briefly during lookup time and release
+it as soon as we get the next union stack member. The top level of the
+stack holds a reference to the next level (via union_mount structure) and
+so on. Therefore, as long as we hold a reference to a union stack member,
+its lower layers can't go away. And since we don't do the complete
+traversal under any lock, it is possible for the stack to change over the
+level from where we started traversing. For eg. when traversing the stack
+downwards, a new filesystem can be mounted on top of it. When this happens,
+the user who had a reference to the old top wouldn't have visibility to
+the new top and would continue as if the new top didn't exist for him.
+I believe this is fine as long as members of the stack don't go away from
+under us(CHECK). And to be sure of this, we need to hold a reference to the
+level from where we start the traversal and should continue to hold it
+till we are done with the traversal.
+
+5. Union stack: destroying
+--------------------------
+In addition to storing the union_mounts in a hash table for quick lookups,
+they are also stored as a list, headed at vsmount->mnt_union. So, all
+union_mounts that occur under a vfsmount (starting from the mountpoint
+followed by the subdir unions) are stored within the vfsmount. During
+umount (specifically, during the last mntput()), this list is traversed
+to destroy all union stacks under this vfsmount.
+
+Hence, all union stacks under a vfsmount continue to exist until the
+vfsmount is unmounted. It may be noted that the union_mount structure
+holds a reference to the current dentry also. Becasue of this, for
+subdir unions, both the top and bottom level dentries become pinned
+till the upper layer filesystem is unmounted. Is this behaviour
+acceptable ? Would this lead to a lot of pinned dentries over a period
+of time ? (CHECK) If we don't do this, the top layer dentry might go
+out of cache, during which time we have no means to release the
+corresponding union_mount and the union_mount becomes stale. Would it
+be necessary and worthwhile to add intelligence to prune_dcache() to
+prune unused union_mounts thereby releasing the dentries ?
+
+As noted above, we hold the refernce to current dentry from union_mount
+but don't get a reference to the corresponding vfsmount. We depend on
+the user of the union stack to hold the reference to the topmost vfsmount
+until he is done with the stack traversal. Not holding a reference to the
+top vfsmount from within union_mount allows us to free all the union_mounts
+from last mntput of the top vfsmount. Is this approach acceptable ?
+
+NOTE: union_mount structures are part of two lists: the hash list for
+quick lookups and a linked list to aid the freeing of these structures
+during unmount.
+
+6. Directory lising
+-------------------
+The merged view of directories is obtained by reading the directory
+entries of all the layers (starting from topmost) and merging the result.
+To aid this, the directory entries are stored in a cache as and when they
+are read and the newly read entries are compared against this for duplicate
+elimination before being passed to user space. This cache is a simple linked
+list at the moment.
+
+If getdents() returns to user space before completely reading the directory,
+the state at which it left reading the union mounted directory is stored
+in the rdstate structure.
+
+struct rdstate {
+ /* vfsmount and dentry of the directory from which we were reading */
+ struct vfsmount *mnt;
+ struct dentry *dentry;
+
+ /* the file offset of directory file at which we stopped reading */
+ loff_t off;
+
+ /* cache of directory entries */
+ struct list_head dirent_cache;
+};
+
+A pointer to this structure is stored in the file structure for the topmost
+directory and initialized during the first readdir()/getdents() of this
+directory. This readdir state information is destroyed during the last
+fput() of the file. For every subsequent readdir()/getdents(), the file
+offset of the directory determined by rdstate->{mnt, dentry} is set to
+the rdstate->off, before continuing with readdir()/getdents() on that
+directory.
+
+Since readdir()/getdents() is issued on the topmost directory for union
+mounted directories, it is possible for the file->f_pos of the topmost
+directory to reach its end while we are still reading the contents of
+the stacked bottom directories. So, file->f_pos is not clearly defined
+for union mounted directories. And because of this lseek doesn't work
+as it works normally for other directories. If this approach of directory
+listing is acceptable, we need to fix the meaning of file offset for
+union mounted directories and accordingly get lseek to behave sanely.
+
+7. What works and what doesn't ?
+-------------------------------
+These work:
+ - mount/umount :)
+ - A simple case of union mount propagation to slave and shared
+ mounts.
+ - /bin/ls on a union mounted directory.
+
+These don't:
+ - lseek on union mounted directory.
+
+Not tried:
+ - move mounts
+ - pivot_root
+ - Other cases of bind mounts, specifically recursive binds.
+ - etc :(
+
+Not yet implemented:
+ - copyup and whiteout features. So, as of now we can only
+ do a union mount and directory listing on it. Other operations,
+ specifically write to a lower layer file are not supported.
+
+8. Usage
+--------
+To union mount a device /dev/sda1 on a mount point /mnt, we do this:
+
+# mount --union /dev/sda1 /mnt
+
+This results in the union mount getting created at /mnt which will contain
+the merged view of /mnt's original content and the contents of /dev/sda1.
+
+The mount(8) command from util-linux has to be modified to support
+--union option.
+
+9. References
+-------------
+1. http://lkml.org/lkml/2007/4/17/150 - First post of original union mount.
+2. http://lkml.org/lkml/2007/5/14/69 - Next (v1) post of original union mount.
+
+- June 2007

2007-06-20 05:45:32

by Bharata B Rao

[permalink] [raw]
Subject: [RFC PATCH 2/4] Mount changes to support union mount.

From: Bharata B Rao <[email protected]>
Subject: Mount changes to support union mount.

Adds union mount support.

This patch adds a new mount type for union mount (MNT_UNION) and changes
the mount path to build a union stack during mount. The routines for
supporting the creation, traversal and destruction of union stacks are
also included here.

Signed-off-by: Bharata B Rao <[email protected]>
---
fs/namespace.c | 164 ++++++++++++++++++++++++++++++++++++++++++++++----
include/linux/fs.h | 1
include/linux/mount.h | 17 +++++
3 files changed, 172 insertions(+), 10 deletions(-)

--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -35,6 +35,7 @@ __cacheline_aligned_in_smp DEFINE_SPINLO
static int event;

static struct list_head *mount_hashtable __read_mostly;
+static struct list_head *union_mount_hashtable;
static int hash_mask __read_mostly, hash_bits __read_mostly;
static struct kmem_cache *mnt_cache __read_mostly;
static struct rw_semaphore namespace_sem;
@@ -54,6 +55,89 @@ static inline unsigned long hash(struct
return tmp & hash_mask;
}

+/* Must be called with vfsmount_lock held */
+static struct union_mount *find_union_mount(struct vfsmount *mnt,
+ struct dentry *dentry)
+{
+ struct list_head *head;
+ struct union_mount *u;
+
+ if (!IS_MNT_UNION(mnt))
+ return NULL;
+
+ head = union_mount_hashtable + hash(mnt, dentry);
+ list_for_each_entry(u, head, hash)
+ if (u->src_mnt == mnt && u->src_dentry == dentry)
+ return u;
+ return NULL;
+}
+
+/*
+ * When propagating mount events to peer group, this is called under
+ * vfsmount_lock. Hence using GFP_ATOMIC for kmalloc here.
+ * TODO: Can we use a separate kmem cache for union_mount ?
+ */
+struct union_mount *alloc_union_mount(struct vfsmount *src_mnt,
+ struct dentry *src_dentry, struct vfsmount *dst_mnt,
+ struct dentry *dst_dentry)
+{
+ struct union_mount *u;
+ u = kmalloc(sizeof(struct union_mount), GFP_ATOMIC);
+ if (!u)
+ return u;
+ u->dst_mnt = mntget(dst_mnt);
+ u->dst_dentry = dget(dst_dentry);
+ u->src_mnt = src_mnt;
+ u->src_dentry = dget(src_dentry);
+ INIT_LIST_HEAD(&u->hash);
+ INIT_LIST_HEAD(&u->list);
+ return u;
+}
+
+/* Must be called with vfsmount_lock held */
+void attach_mnt_union(struct union_mount *u)
+{
+ if (!u)
+ return;
+
+ list_add_tail(&u->hash, union_mount_hashtable +
+ hash(u->src_mnt, u->src_dentry));
+ list_add_tail(&u->list, &u->src_mnt->mnt_union);
+}
+
+/*
+ * Finds the next (vfsmount, dentry) in the union stack. If found, returns
+ * it via @nd and returns true. Else doesn't modify @nd, but returns false.
+ */
+int next_union_mount(struct nameidata *nd)
+{
+ struct union_mount *u;
+
+ spin_lock(&vfsmount_lock);
+ u = find_union_mount(nd->mnt, nd->dentry);
+ spin_unlock(&vfsmount_lock);
+ if (u) {
+ nd->mnt = u->dst_mnt;
+ nd->dentry = u->dst_dentry;
+ return 1;
+ }
+ return 0;
+}
+
+/* Check if next element of the union stack exists. @nd isn't modified. */
+int next_union_mount_exists(struct vfsmount *mnt, struct dentry *dentry)
+{
+ struct union_mount *u;
+
+ spin_lock(&vfsmount_lock);
+ u = find_union_mount(mnt, dentry);
+ spin_unlock(&vfsmount_lock);
+ if (u)
+ return 1;
+ else
+ return 0;
+}
+
struct vfsmount *alloc_vfsmnt(const char *name)
{
struct vfsmount *mnt = kmem_cache_zalloc(mnt_cache, GFP_KERNEL);
@@ -67,6 +151,7 @@ struct vfsmount *alloc_vfsmnt(const char
INIT_LIST_HEAD(&mnt->mnt_share);
INIT_LIST_HEAD(&mnt->mnt_slave_list);
INIT_LIST_HEAD(&mnt->mnt_slave);
+ INIT_LIST_HEAD(&mnt->mnt_union);
if (name) {
int size = strlen(name) + 1;
char *newname = kmalloc(size, GFP_KERNEL);
@@ -173,18 +258,20 @@ void mnt_set_mountpoint(struct vfsmount
dentry->d_mounted++;
}

-static void attach_mnt(struct vfsmount *mnt, struct nameidata *nd)
+static void attach_mnt(struct vfsmount *mnt, struct nameidata *nd,
+ struct union_mount *u)
{
mnt_set_mountpoint(nd->mnt, nd->dentry, mnt);
list_add_tail(&mnt->mnt_hash, mount_hashtable +
hash(nd->mnt, nd->dentry));
list_add_tail(&mnt->mnt_child, &nd->mnt->mnt_mounts);
+ attach_mnt_union(u);
}

/*
* the caller must hold vfsmount_lock
*/
-static void commit_tree(struct vfsmount *mnt)
+static void commit_tree(struct vfsmount *mnt, struct union_mount *u)
{
struct vfsmount *parent = mnt->mnt_parent;
struct vfsmount *m;
@@ -201,6 +288,7 @@ static void commit_tree(struct vfsmount
list_add_tail(&mnt->mnt_hash, mount_hashtable +
hash(parent, mnt->mnt_mountpoint));
list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
+ attach_mnt_union(u);
touch_mnt_namespace(n);
}

@@ -342,8 +430,18 @@ static struct vfsmount *clone_mnt(struct
static inline void __mntput(struct vfsmount *mnt)
{
struct super_block *sb = mnt->mnt_sb;
+ struct union_mount *u, *next;
+
dput(mnt->mnt_root);
clear_mnt_user(mnt);
+
+ list_for_each_entry_safe(u, next, &mnt->mnt_union, list) {
+ list_del_init(&u->list);
+ dput(u->src_dentry);
+ mntput(u->dst_mnt);
+ dput(u->dst_dentry);
+ kfree(u);
+ }
free_vfsmnt(mnt);
deactivate_super(sb);
}
@@ -352,6 +450,17 @@ void mntput_no_expire(struct vfsmount *m
{
repeat:
if (atomic_dec_and_lock(&mnt->mnt_count, &vfsmount_lock)) {
+ struct union_mount *u;
+
+ /*
+ * Remove all union_mounts under this mnt from the
+ * union_mount_hashtable. This needs to be be done with
+ * vfsmount_lock held. The rest of the cleanup is done
+ * outside of the lock.
+ */
+ list_for_each_entry(u, &mnt->mnt_union, list)
+ list_del_init(&u->hash);
+
if (likely(!mnt->mnt_pinned)) {
spin_unlock(&vfsmount_lock);
__mntput(mnt);
@@ -436,6 +545,7 @@ static int show_vfsmnt(struct seq_file *
{ MNT_NODIRATIME, ",nodiratime" },
{ MNT_RELATIME, ",relatime" },
{ MNT_NOMNT, ",nomnt" },
+ { MNT_UNION, ",union" },
{ 0, NULL }
};
struct proc_fs_info *fs_infop;
@@ -839,7 +949,11 @@ struct vfsmount *copy_tree(struct vfsmou
goto error;
spin_lock(&vfsmount_lock);
list_add_tail(&q->mnt_list, &res->mnt_list);
- attach_mnt(q, &nd);
+ /*
+ * TODO: Understand and pass appropriate union_mount
+ * argument here.
+ */
+ attach_mnt(q, &nd, NULL);
spin_unlock(&vfsmount_lock);
}
}
@@ -925,10 +1039,16 @@ static int attach_recursive_mnt(struct v
struct vfsmount *dest_mnt = nd->mnt;
struct dentry *dest_dentry = nd->dentry;
struct vfsmount *child, *p;
+ struct union_mount *u = NULL;

if (propagate_mnt(dest_mnt, dest_dentry, source_mnt, &tree_list))
return -EINVAL;

+ if (IS_MNT_UNION(source_mnt))
+ if (!(u = alloc_union_mount(source_mnt, source_mnt->mnt_root,
+ dest_mnt, dest_dentry)))
+ return -ENOMEM;
+
if (IS_MNT_SHARED(dest_mnt)) {
for (p = source_mnt; p; p = next_mnt(p, source_mnt))
set_mnt_shared(p);
@@ -937,18 +1057,26 @@ static int attach_recursive_mnt(struct v
spin_lock(&vfsmount_lock);
if (parent_nd) {
detach_mnt(source_mnt, parent_nd);
- attach_mnt(source_mnt, nd);
+ attach_mnt(source_mnt, nd, u);
touch_mnt_namespace(current->nsproxy->mnt_ns);
} else {
mnt_set_mountpoint(dest_mnt, dest_dentry, source_mnt);
- commit_tree(source_mnt);
+ commit_tree(source_mnt, u);
}

list_for_each_entry_safe(child, p, &tree_list, mnt_hash) {
list_del_init(&child->mnt_hash);
- commit_tree(child);
+ if (IS_MNT_UNION(child)) {
+ u = alloc_union_mount(child, child->mnt_root,
+ child->mnt_parent, child->mnt_mountpoint);
+ /* FIXME: It is too late to fail from here */
+ if (!u)
+ printk(KERN_ERR "attach_recursive_mnt: ENOMEM\n");
+ }
+ commit_tree(child, u);
}
spin_unlock(&vfsmount_lock);
+
return 0;
}

@@ -1556,9 +1684,12 @@ long do_mount(char *dev_name, char *dir_
mnt_flags |= MNT_RELATIME;
if (flags & MS_NOMNT)
mnt_flags |= MNT_NOMNT;
+ if (flags & MS_UNION)
+ mnt_flags |= MNT_UNION;

flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
- MS_NOATIME | MS_NODIRATIME | MS_RELATIME | MS_NOMNT);
+ MS_NOATIME | MS_NODIRATIME | MS_RELATIME | MS_NOMNT |
+ MS_UNION);

/* ... and get the mountpoint */
retval = path_lookup(dir_name, LOOKUP_FOLLOW, &nd);
@@ -1888,8 +2019,9 @@ asmlinkage long sys_pivot_root(const cha
goto out3;
detach_mnt(new_nd.mnt, &parent_nd);
detach_mnt(user_nd.mnt, &root_parent);
- attach_mnt(user_nd.mnt, &old_nd); /* mount old root on put_old */
- attach_mnt(new_nd.mnt, &root_parent); /* mount new_root on / */
+ /* TODO: Understand and pass appropriate union_mount argument here. */
+ attach_mnt(user_nd.mnt, &old_nd, NULL); /* mount old root on put_old */
+ attach_mnt(new_nd.mnt, &root_parent, NULL); /* mount new_root on / */
touch_mnt_namespace(current->nsproxy->mnt_ns);
spin_unlock(&vfsmount_lock);
chroot_fs_refs(&user_nd, &new_nd);
@@ -1940,7 +2072,7 @@ static void __init init_mount_tree(void)

void __init mnt_init(unsigned long mempages)
{
- struct list_head *d;
+ struct list_head *d, *e;
unsigned int nr_hash;
int i;
int err;
@@ -1976,12 +2108,24 @@ void __init mnt_init(unsigned long mempa

printk("Mount-cache hash table entries: %d\n", nr_hash);

+ /*
+ * Use the same nr_hash for union mount hashtable also.
+ * TODO: This might need a bigger hash table.
+ */
+ union_mount_hashtable = (struct list_head *)__get_free_page(GFP_ATOMIC);
+
+ if (!union_mount_hashtable)
+ panic("Failed to allocate union mount hash table\n");
+
/* And initialize the newly allocated array */
d = mount_hashtable;
+ e = union_mount_hashtable;
i = nr_hash;
do {
INIT_LIST_HEAD(d);
+ INIT_LIST_HEAD(e);
d++;
+ e++;
i--;
} while (i);
err = sysfs_init();
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -113,6 +113,7 @@ extern int dir_notify_enable;
#define MS_REMOUNT 32 /* Alter flags of a mounted FS */
#define MS_MANDLOCK 64 /* Allow mandatory locks on an FS */
#define MS_DIRSYNC 128 /* Directory modifications are synchronous */
+#define MS_UNION 256 /* Union mount */
#define MS_NOATIME 1024 /* Do not update access times. */
#define MS_NODIRATIME 2048 /* Do not update directory access times */
#define MS_BIND 4096
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -36,6 +36,7 @@ struct mnt_namespace;
#define MNT_SHARED 0x1000 /* if the vfsmount is a shared mount */
#define MNT_UNBINDABLE 0x2000 /* if the vfsmount is a unbindable mount */
#define MNT_PNODE_MASK 0x3000 /* propagation flag mask */
+#define MNT_UNION 0x4000 /* if the vfsmount is a union mount */

struct vfsmount {
struct list_head mnt_hash;
@@ -53,6 +54,7 @@ struct vfsmount {
struct list_head mnt_share; /* circular list of shared mounts */
struct list_head mnt_slave_list;/* list of slave mounts */
struct list_head mnt_slave; /* slave list entry */
+ struct list_head mnt_union; /* list of union_mounts */
struct vfsmount *mnt_master; /* slave is on master->mnt_slave_list */
struct mnt_namespace *mnt_ns; /* containing namespace */
/*
@@ -107,5 +109,20 @@ extern void shrink_submounts(struct vfsm
extern spinlock_t vfsmount_lock;
extern dev_t name_to_dev_t(char *name);

+#define IS_MNT_UNION(mnt) (mnt->mnt_flags & MNT_UNION)
+
+struct union_mount {
+ struct vfsmount *src_mnt, *dst_mnt;
+ struct dentry *src_dentry, *dst_dentry;
+ struct list_head hash, list;
+};
+
+extern void attach_mnt_union(struct union_mount *u);
+extern struct union_mount *alloc_union_mount(struct vfsmount *src_mnt,
+ struct dentry *src_dentry, struct vfsmount *dst_mnt,
+ struct dentry *dst_dentry);
+extern int next_union_mount(struct nameidata *nd);
+extern int next_union_mount_exists(struct vfsmount *mnt, struct dentry *dentry);
+
#endif
#endif /* _LINUX_MOUNT_H */

2007-06-20 05:46:20

by Bharata B Rao

[permalink] [raw]
Subject: [RFC PATCH 3/4] Lookup changes to support union mount.

From: Bharata B Rao <[email protected]>
Subject: Lookup changes to support union mount.

Adds support for looking up dentries inside a union mount point.

This patch modifies the do_lookup() routine to look beyond the top layer for
union mount points/directories. Union mount versions of dcache and real lookup
routines are added to support this. The lookup routines are modified to build a
new union stack when looking up a common named directory in the union stack.

TODO: Check if lookup_hash() and friends also need to know about union mount.

Signed-off-by: Bharata B Rao <[email protected]>
---
fs/dcache.c | 53 ++++++++++++++++++
fs/namei.c | 137 +++++++++++++++++++++++++++++++++++++++++++++++--
include/linux/dcache.h | 1
3 files changed, 187 insertions(+), 4 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1312,6 +1312,59 @@ next:
return found;
}

+/*
+ * Looks for the given @name in dcache by walking through all the layers
+ * of the union stack, starting from the top.
+ * FIXME: If we don't find the dentry in a upper layer, we descend to the
+ * next layer. So there is a chance to miss this dentry in the top layer
+ * if this is the _first_ time lookup of the dentry in this layer. A real
+ * lookup might have fetched a valid dentry in this layer itself, while we
+ * chose to descend to the next lower layer. One solution is not have this
+ * function itself, do the toplevel lookup in dcache and if it fails proceed
+ * to real_lookup_union() directly.
+ */
+struct dentry *__d_lookup_union(struct nameidata *nd, struct qstr *name)
+{
+ struct dentry *dentry;
+ struct nameidata nd_tmp;
+ struct vfsmount *mnt = mntget(nd->mnt);
+ struct qstr this;
+ int err;
+
+ nd_tmp.mnt = nd->mnt;
+ nd_tmp.dentry = nd->dentry;
+
+ this.name = name->name;
+ this.len = name->len;
+ this.hash = name->hash;
+
+ do {
+ /* d_hash() is a repetition for the top layer. */
+ if (nd_tmp.dentry->d_op && nd_tmp.dentry->d_op->d_hash) {
+ err = nd_tmp.dentry->d_op->d_hash(nd_tmp.dentry, &this);
+ if (err < 0)
+ goto out;
+ }
+
+ dentry = __d_lookup(nd_tmp.dentry, &this);
+ if (dentry) {
+ if (dentry->d_inode) {
+ if (nd->mnt != nd_tmp.mnt) {
+ mntput(nd->mnt);
+ nd->mnt = mntget(nd_tmp.mnt);
+ }
+ mntput(mnt);
+ return dentry;
+ } else {
+ dput(dentry);
+ }
+ }
+ } while (next_union_mount(&nd_tmp));
+out:
+ mntput(mnt);
+ return NULL;
+}
+
/**
* d_hash_and_lookup - hash the qstr then search for a dentry
* @dir: Directory to search in
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -515,6 +515,127 @@ static struct dentry * real_lookup(struc
return result;
}

+/*
+ * Walks down the parent's union stack looking for the given @name.
+ * Builds unions at subdirectory levels if needed.
+ */
+static struct dentry *real_lookup_union(struct nameidata *nd, struct qstr *name)
+{
+ struct dentry *dentry, *topmost, *prev = NULL;
+ struct nameidata nd_tmp;
+ struct vfsmount *mnt_prev = NULL;
+ struct vfsmount *mnt = mntget(nd->mnt);
+
+ topmost = real_lookup(nd->dentry, name, nd);
+ if (IS_ERR(topmost))
+ goto out;
+
+ /* Found a vaild non-directory dentry in the topmost layer, return. */
+ if (topmost->d_inode && !S_ISDIR(topmost->d_inode->i_mode))
+ goto out;
+
+ nd_tmp.mnt = nd->mnt;
+ nd_tmp.dentry = nd->dentry;
+
+ /*
+ * At this point we either have a negative dentry or a directory
+ * as the topmost dentry. Continue to lookup in the bottom layers.
+ */
+ while (next_union_mount(&nd_tmp)) {
+ dentry = real_lookup(nd_tmp.dentry, name, &nd_tmp);
+ /*
+ * If there is an error, return the dentry from
+ * the topmost layer.
+ */
+ if (IS_ERR(dentry))
+ goto out;
+
+ /*
+ * Found a negative dentry in this layer, release it and
+ * continue with the next layer.
+ */
+ if (!dentry->d_inode) {
+ dput(dentry);
+ continue;
+ }
+
+ /*
+ * If we find a vaild non-directory dentry in this layer, And
+ * - if the topmost dentry is negative, return this.
+ * - or if the topmost dentry is valid (dir), return topmost.
+ */
+ if (!S_ISDIR(dentry->d_inode->i_mode)) {
+ if (!topmost->d_inode) {
+ mntput(nd->mnt);
+ nd->mnt = mntget(nd_tmp.mnt);
+ dput(topmost);
+ topmost = dentry;
+ } else {
+ dput(dentry);
+ }
+ goto out;
+ }
+
+ /*
+ * At this point, we have a directory in this layer. And
+ * - if the topmost dentry is negative, make this directory
+ * the topmost dentry.
+ * - or if topmost dentry is valid, union the two directories.
+ */
+ if (!topmost->d_inode) {
+ mntput(nd->mnt);
+ nd->mnt = mntget(nd_tmp.mnt);
+ dput(topmost);
+ topmost = dget(dentry);
+ } else {
+ struct union_mount *u;
+ struct dentry *top_dentry = prev ? prev : topmost;
+ struct vfsmount *top_mnt = mnt_prev ? mnt_prev :
+ nd->mnt;
+ u = alloc_union_mount(top_mnt, top_dentry, nd_tmp.mnt,
+ dentry);
+ if (!u) {
+ dput(dentry);
+ goto out;
+ }
+ spin_lock(&vfsmount_lock);
+ attach_mnt_union(u);
+ spin_unlock(&vfsmount_lock);
+ /*
+ * If the lower layer we found has a layer below
+ * it, no need to continue the lookup, return
+ * with the topmost we found.
+ * Else remember this layer so that this becomes
+ * the top layer for the next union.
+ */
+ if (next_union_mount_exists(nd_tmp.mnt, dentry)) {
+ dput(dentry);
+ goto out;
+ } else {
+ if (prev)
+ dput(prev);
+ prev = dget(dentry);
+ if (mnt_prev)
+ mntput(mnt_prev);
+ mnt_prev = mntget(nd_tmp.mnt);
+ }
+ }
+
+ /*
+ * Release the dentry from this layer and continue
+ * the lookup in the next lower layer.
+ */
+ dput(dentry);
+ }
+out:
+ if (prev)
+ dput(prev);
+ if (mnt_prev)
+ mntput(mnt_prev);
+ mntput(mnt);
+ return topmost;
+}
+
static int __emul_lookup_dentry(const char *, struct nameidata *);

/* SMP-safe */
@@ -769,21 +890,29 @@ static __always_inline void follow_dotdo
static int do_lookup(struct nameidata *nd, struct qstr *name,
struct path *path)
{
- struct vfsmount *mnt = nd->mnt;
- struct dentry *dentry = __d_lookup(nd->dentry, name);
+ struct dentry *dentry;
+
+ if (IS_MNT_UNION(nd->mnt))
+ dentry = __d_lookup_union(nd, name);
+ else
+ dentry = __d_lookup(nd->dentry, name);

if (!dentry)
goto need_lookup;
if (dentry->d_op && dentry->d_op->d_revalidate)
goto need_revalidate;
done:
- path->mnt = mnt;
+ path->mnt = nd->mnt;
path->dentry = dentry;
__follow_mount(path);
return 0;

need_lookup:
- dentry = real_lookup(nd->dentry, name, nd);
+ if (IS_MNT_UNION(nd->mnt))
+ dentry = real_lookup_union(nd, name);
+ else
+ dentry = real_lookup(nd->dentry, name, nd);
+
if (IS_ERR(dentry))
goto fail;
goto done;
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -289,6 +289,7 @@ extern void d_move(struct dentry *, stru
/* appendix may either be NULL or be used for transname suffixes */
extern struct dentry * d_lookup(struct dentry *, struct qstr *);
extern struct dentry * __d_lookup(struct dentry *, struct qstr *);
+extern struct dentry *__d_lookup_union(struct nameidata *nd, struct qstr *name);
extern struct dentry * d_hash_and_lookup(struct dentry *, struct qstr *);

/* validate "insecure" dentry pointer */

2007-06-20 05:47:14

by Bharata B Rao

[permalink] [raw]
Subject: [RFC PATCH 4/4] Directory listing support for union mounted directories.

From: Bharata B Rao <[email protected]>
Subject: Directory listing support for union mounted directories.

Modifies readdir()/getdents() to support union mounted directories.

This patch adds support to readdir()/getdents() to read directory entries
from all the directories of the union stack, and present a merged view to
the userspace. This is achieved by maintaining a cache of read entries.

TODO: Breaks lseek on union mounted directories, Fix this.

Signed-off-by: Bharata B Rao <[email protected]>
---
fs/file_table.c | 1
fs/readdir.c | 242 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
include/linux/fs.h | 12 ++
3 files changed, 254 insertions(+), 1 deletion(-)

--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -174,6 +174,7 @@ void fastcall __fput(struct file *file)
if (file->f_mode & FMODE_WRITE)
put_write_access(inode);
put_pid(file->f_owner.pid);
+ put_rdstate(file->f_rdstate);
file_kill(file);
file->f_path.dentry = NULL;
file->f_path.mnt = NULL;
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -19,6 +19,8 @@

#include <asm/uaccess.h>

+int readdir_union(struct file *file, void *buf, filldir_t filler);
+
int vfs_readdir(struct file *file, filldir_t filler, void *buf)
{
struct inode *inode = file->f_path.dentry->d_inode;
@@ -33,7 +35,10 @@ int vfs_readdir(struct file *file, filld
mutex_lock(&inode->i_mutex);
res = -ENOENT;
if (!IS_DEADDIR(inode)) {
- res = file->f_op->readdir(file, buf, filler);
+ if (IS_MNT_UNION(file->f_path.mnt))
+ res = readdir_union(file, buf, filler);
+ else
+ res = file->f_op->readdir(file, buf, filler);
file_accessed(file);
}
mutex_unlock(&inode->i_mutex);
@@ -303,3 +308,238 @@ out_putf:
out:
return error;
}
+
+/*
+ * NOTE: Some of the code below which maintains a list of dirents as a cache
+ * is from Jan Blunck's original union mount readdir patch.
+ */
+
+/* The readdir cache object */
+struct rdcache_entry {
+ struct list_head list;
+ struct qstr name;
+};
+
+struct rdcache_callback {
+ void *buf; /* original callback buffer */
+ filldir_t filldir; /* the filldir() we should call */
+ int error; /* stores filldir error */
+ struct rdstate *rdstate; /* readdir state */
+};
+
+static int rdcache_add_entry(struct list_head *list,
+ const char *name, int namelen)
+{
+ struct rdcache_entry *this;
+ char *tmp_name;
+
+ this = kmalloc(sizeof(*this), GFP_KERNEL);
+ if (!this) {
+ printk(KERN_CRIT "rdcache_add_entry(): out of kernel memory\n");
+ return -ENOMEM;
+ }
+
+ tmp_name = kmalloc(namelen + 1, GFP_KERNEL);
+ if (!tmp_name) {
+ printk(KERN_CRIT "rdcache_add_entry(): out of kernel memory\n");
+ kfree(this);
+ return -ENOMEM;
+ }
+
+ this->name.name = tmp_name;
+ this->name.len = namelen;
+ this->name.hash = 0;
+ memcpy(tmp_name, name, namelen);
+ tmp_name[namelen] = 0;
+ INIT_LIST_HEAD(&this->list);
+ list_add(&this->list, list);
+ return 0;
+}
+
+static void rdcache_free(struct list_head *list)
+{
+ struct list_head *p;
+ struct list_head *ptmp;
+ int count = 0;
+
+ list_for_each_safe(p, ptmp, list) {
+ struct rdcache_entry *this;
+
+ this = list_entry(p, struct rdcache_entry, list);
+ list_del_init(&this->list);
+ kfree(this->name.name);
+ kfree(this);
+ count++;
+ }
+ return;
+}
+
+static int rdcache_find_entry(struct list_head *uc_list,
+ const char *name, int namelen)
+{
+ struct rdcache_entry *p;
+ int ret = 0;
+
+ list_for_each_entry(p, uc_list, list) {
+ if (p->name.len != namelen)
+ continue;
+ if (strncmp(p->name.name, name, namelen) == 0) {
+ ret = 1;
+ break;
+ }
+ }
+ return ret;
+}
+
+/*
+ * filldir routine for union mounted directories.
+ * Handles duplicate elimination by building a readdir cache.
+ * TODO: readdir cache is a linked list. Check if this can be designed
+ * more efficiently.
+ */
+static int filldir_union(void *buf, const char *name, int namlen,
+ loff_t offset, u64 ino, unsigned int d_type)
+{
+ struct rdcache_callback *cb = buf;
+ int err;
+
+ if (rdcache_find_entry(&cb->rdstate->dirent_cache, name, namlen))
+ return 0;
+
+ err = cb->filldir(cb->buf, name, namlen, offset, ino, d_type);
+ if (err >= 0)
+ rdcache_add_entry(&cb->rdstate->dirent_cache, name, namlen);
+ cb->error = err;
+ return err;
+}
+
+/* Called from last fput() */
+void put_rdstate(struct rdstate *rdstate)
+{
+ if (!rdstate)
+ return;
+
+ mntput(rdstate->mnt);
+ dput(rdstate->dentry);
+ rdcache_free(&rdstate->dirent_cache);
+ kfree(rdstate);
+}
+
+static struct rdstate *get_rdstate(struct file *file)
+{
+ if (file->f_rdstate)
+ return file->f_rdstate;
+
+ /*
+ * We have read the dirents from this earlier but now don't have a
+ * corresponding rdstate. This shouldn't happen.
+ */
+ if (file->f_pos)
+ return ERR_PTR(-EINVAL);
+
+ file->f_rdstate = kmalloc(sizeof(struct rdstate), GFP_KERNEL);
+ if (!file->f_rdstate)
+ return ERR_PTR(-ENOMEM);
+
+ file->f_rdstate->off = 0;
+ file->f_rdstate->mnt = mntget(file->f_path.mnt);
+ file->f_rdstate->dentry = dget(file->f_path.dentry);
+ INIT_LIST_HEAD(&file->f_rdstate->dirent_cache);
+ return file->f_rdstate;
+}
+
+int readdir_union(struct file *file, void *buf, filldir_t filler)
+{
+ struct dentry *topmost = file->f_path.dentry;
+ struct rdstate *rdstate;
+ struct nameidata nd;
+ loff_t offset = 0;
+ struct rdcache_callback cb;
+ int err = 0;
+
+ /*
+ * If this is not a union mount point or not a unioned directory
+ * within the union mount point, do readdir in the normal way.
+ */
+ nd.mnt = file->f_path.mnt;
+ nd.dentry = file->f_path.dentry;
+ if (!next_union_mount_exists(nd.mnt, nd.dentry))
+ return file->f_op->readdir(file, buf, filler);
+
+ rdstate = get_rdstate(file);
+ if (IS_ERR(rdstate)) {
+ err = PTR_ERR(rdstate);
+ return err;
+ }
+
+ cb.buf = buf;
+ cb.filldir = filler;
+ cb.rdstate = rdstate;
+
+ offset = rdstate->off;
+
+ /* Read from the topmost directory */
+ if (rdstate->dentry == topmost) {
+ file->f_pos = offset;
+ err = file->f_op->readdir(file, &cb, filldir_union);
+ rdstate->off = file->f_pos;
+ if (err < 0 || cb.error < 0)
+ goto out;
+
+ /*
+ * Reading from topmost dir complete, start reading the lower
+ * dir from the beginning.
+ */
+ offset = 0;
+ } else {
+ nd.mnt = rdstate->mnt;
+ nd.dentry = rdstate->dentry;
+ goto read_lower;
+ }
+
+ if (!next_union_mount(&nd))
+ return err;
+
+read_lower:
+ do {
+ struct vfsmount *mnt_tmp = mntget(nd.mnt);
+ struct dentry *dentry_tmp = dget(nd.dentry);
+ struct file *ftmp = dentry_open(nd.dentry, nd.mnt,
+ file->f_flags);
+ if (IS_ERR(ftmp)) {
+ mntput(mnt_tmp);
+ dput(dentry_tmp);
+ err = PTR_ERR(ftmp);
+ goto out;
+ }
+
+ mutex_lock(&nd.dentry->d_inode->i_mutex);
+ ftmp->f_pos = offset;
+
+ err = ftmp->f_op->readdir(ftmp, &cb, filldir_union);
+ file_accessed(ftmp);
+ rdstate->off = ftmp->f_pos;
+ mutex_unlock(&nd.dentry->d_inode->i_mutex);
+ if (nd.mnt != rdstate->mnt) {
+ mntput(rdstate->mnt);
+ rdstate->mnt = mntget(nd.mnt);
+ }
+ if (nd.dentry != rdstate->dentry) {
+ dput(rdstate->dentry);
+ rdstate->dentry = dget(nd.dentry);
+ }
+ fput(ftmp);
+ if (err < 0 || cb.error < 0)
+ goto out;
+
+ /*
+ * Reading from a lower dir complete, start reading the
+ * next lower dir from the beginning.
+ */
+ offset = 0;
+ } while (next_union_mount(&nd));
+
+ return 0;
+out:
+ return err;
+}
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -841,6 +841,14 @@ static inline void ra_set_size(struct fi
unsigned long ra_submit(struct file_ra_state *ra,
struct address_space *mapping, struct file *filp);

+/* Direct cache for readdir, used for union mounted directories. */
+struct rdstate {
+ struct vfsmount *mnt;
+ struct dentry *dentry;
+ loff_t off;
+ struct list_head dirent_cache;
+};
+
struct file {
/*
* fu_list becomes invalid after file_free is called and queued via
@@ -875,6 +883,7 @@ struct file {
spinlock_t f_ep_lock;
#endif /* #ifdef CONFIG_EPOLL */
struct address_space *f_mapping;
+ struct rdstate *f_rdstate;
};
extern spinlock_t files_lock;
#define file_list_lock() spin_lock(&files_lock);
@@ -2119,5 +2128,8 @@ static inline void free_secdata(void *se
{ }
#endif /* CONFIG_SECURITY */

+/* fs/readdir.c */
+void put_rdstate(struct rdstate *rdstate);
+
#endif /* __KERNEL__ */
#endif /* _LINUX_FS_H */

2007-06-20 06:03:19

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] Union mount documentation.

On Wed, 2007-06-20 at 11:21 +0530, Bharata B Rao wrote:
> From: Bharata B Rao <[email protected]>
> Subject: Union mount documentation.
Hi,

first of all I'm happy to see that people are still working on unionfs;
I'd love to have functionality like this show up in Linux.

I'll not claim to have any VFS knowledge whatsoever, but I was just
wondering what happens in the following scenario:

FS A is mounted twice, in /mnt/A and /mnt/union

FS B is mounted twice, in /mnt/B and as topmost union mount
on /mnt/union

lets for simplicity say both filesystems are entirely empty

user does on FS A:
mkdir /mnt/A/somedir
touch /mnt/A/somedir/somefile

and then 2 things happen in parallel
1) touch /mnt/B/somefile
2) mv /mnt/union/somedir /mnt/union/somefile

since the underlying FS for 2) is FS A... how will this work out locking
wise? Will the VS lock the union directory only? Or will this operate
only on the underlying FS? How is dcache consistency guaranteed for
scenarios like this?


Greetings,
Arjan van de Ven

--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

2007-06-20 07:55:31

by Jan Blunck

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] Union mount documentation.

On Tue, 19 Jun 2007 22:59:51 -0700, Arjan van de Ven wrote:

> first of all I'm happy to see that people are still working on unionfs;
> I'd love to have functionality like this show up in Linux.

This has nothing to do with unionfs. This is about doing a VFS based
approach to union mounts. Unification is a name-based construct so it
belongs into VFS and not into a separate file system.

> I'll not claim to have any VFS knowledge whatsoever, but I was just
> wondering what happens in the following scenario:
>
> FS A is mounted twice, in /mnt/A and /mnt/union
>
> FS B is mounted twice, in /mnt/B and as topmost union mount
> on /mnt/union
>
> lets for simplicity say both filesystems are entirely empty
>
> user does on FS A:
> mkdir /mnt/A/somedir
> touch /mnt/A/somedir/somefile
>
> and then 2 things happen in parallel
> 1) touch /mnt/B/somefile
> 2) mv /mnt/union/somedir /mnt/union/somefile
>
> since the underlying FS for 2) is FS A... how will this work out locking
> wise? Will the VS lock the union directory only? Or will this operate
> only on the underlying FS? How is dcache consistency guaranteed for
> scenarios like this?

Mounting a file system twice is bad in the first place. This should be
done by using bind mounts and bind a mounted file system into a union.
After that the normal locking rules apply (and hopefully work ;).

Jan

2007-06-20 07:58:12

by Jan Blunck

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] Mount changes to support union mount.

On Wed, 20 Jun 2007 11:22:41 +0530, Bharata B Rao wrote:

> +/*
> + * When propagating mount events to peer group, this is called under
> + * vfsmount_lock. Hence using GFP_ATOMIC for kmalloc here.
> + * TODO: Can we use a separate kmem cache for union_mount ?
> + */
> +struct union_mount *alloc_union_mount(struct vfsmount *src_mnt,
> + struct dentry *src_dentry, struct vfsmount *dst_mnt,
> + struct dentry *dst_dentry)
> +{
> + struct union_mount *u;
> + u = kmalloc(sizeof(struct union_mount), GFP_ATOMIC);
> + if (!u)
> + return u;
> + u->dst_mnt = mntget(dst_mnt);
> + u->dst_dentry = dget(dst_dentry);
> + u->src_mnt = src_mnt;
> + u->src_dentry = dget(src_dentry);
> + INIT_LIST_HEAD(&u->hash);
> + INIT_LIST_HEAD(&u->list);
> + return u;
> +}

Hmm, you pin the dentries in memory until umount. This isn't good. Besides
that this doesn't work with file systems that do invalidate their
dentries. The file system must have a chance to replace the dentry in the
union structure.

Jan

2007-06-20 08:01:01

by Jan Blunck

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] Lookup changes to support union mount.

On Wed, 20 Jun 2007 11:23:26 +0530, Bharata B Rao wrote:

> +/*
> + * Looks for the given @name in dcache by walking through all the layers
> + * of the union stack, starting from the top.
> + * FIXME: If we don't find the dentry in a upper layer, we descend to the
> + * next layer. So there is a chance to miss this dentry in the top layer
> + * if this is the _first_ time lookup of the dentry in this layer. A real
> + * lookup might have fetched a valid dentry in this layer itself, while we
> + * chose to descend to the next lower layer. One solution is not have this
> + * function itself, do the toplevel lookup in dcache and if it fails proceed
> + * to real_lookup_union() directly.
> + */
> +struct dentry *__d_lookup_union(struct nameidata *nd, struct qstr *name)
> +{
> + struct dentry *dentry;
> + struct nameidata nd_tmp;
> + struct vfsmount *mnt = mntget(nd->mnt);
> + struct qstr this;
> + int err;
> +
> + nd_tmp.mnt = nd->mnt;
> + nd_tmp.dentry = nd->dentry;
> +
> + this.name = name->name;
> + this.len = name->len;
> + this.hash = name->hash;
> +
> + do {
> + /* d_hash() is a repetition for the top layer. */
> + if (nd_tmp.dentry->d_op && nd_tmp.dentry->d_op->d_hash) {
> + err = nd_tmp.dentry->d_op->d_hash(nd_tmp.dentry, &this);
> + if (err < 0)
> + goto out;
> + }
> +
> + dentry = __d_lookup(nd_tmp.dentry, &this);
> + if (dentry) {
> + if (dentry->d_inode) {
> + if (nd->mnt != nd_tmp.mnt) {
> + mntput(nd->mnt);
> + nd->mnt = mntget(nd_tmp.mnt);
> + }
> + mntput(mnt);
> + return dentry;
> + } else {
> + dput(dentry);
> + }
> + }
> + } while (next_union_mount(&nd_tmp));
> +out:
> + mntput(mnt);
> + return NULL;
> +}
> +

The reference counting for vfsmount is wrong. next_union_mount() should be
something similar to follow_down(). You should grab the reference to the
underlying mount before doing the lookup. Ok ok, you already have a valid
reference in struct union_mount but anyway.

Jan

2007-06-20 08:12:37

by Jan Blunck

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] Union mount documentation.

On Wed, 20 Jun 2007 11:21:57 +0530, Bharata B Rao wrote:

> +4. Union stack: building and traversal
> +-------------------------------------- +Union stack needs to be built
> from two places: during an explicit union +mount (or mount propagation)
> and during the lookup of a directory that +appears in more than one
> layer of the union. +
> +The link between two layers of union stack is maintained using the
> +union_mount structure:
> +
> +struct union_mount {
> + /* vfsmount and dentry of this layer */
> + struct vfsmount *src_mnt;
> + struct dentry *src_dentry;
> +
> + /* vfsmount and dentry of the next lower layer */
> + struct vfsmount *dst_mnt;
> + struct dentry *dst_dentry;
> +
> + /*
> + * This list_head hashes this union_mount based on this layer's + *
> vfsmount and dentry. This is used to get to the next layer of + * the
> stack (dst_mnt, dst_dentry) given the (src_mnt, src_dentry) + * and is
> used for stack traversal. + */
> + struct list_head hash;
> +
> + /*
> + * All union_mounts under a vfsmount(src_mnt) are linked together + *
> at mnt->mnt_union using this list_head. This is needed to destroy + *
> all the union_mounts when the mnt goes away. + */
> + struct list_head list;
> +};
> +
> +These union mount structures are stored in a hash
> table(union_mount_hashtable) +which uses the same hash as used for
> mount_hashtable since both of them use +(vfsmount, dentry) pairs to
> calculate the hash. +
> +During a new mount (or mount propagation), a new union_mount structure
> is +created. A reference to the mountpoint's vfsmount and dentry is
> taken and +stored in the union_mount (as dst_mnt, dst_dentry). And this
> union_mount +is inserted in the union_mount_hashtable based on the hash
> generated by +the mount root's vfsmount and dentry. +
> +Similar method is employed to create a union stack during first time
> lookup +of a common named directory within a union mount point. But
> here, the top +level directory's vfsmount and dentry are hashed to get
> to the lower level +directory's vfsmount and dentry.
> +
> +The insertion, deletion and lookup of union_mounts in the
> +union_mount_hashtable is protected by vfsmount_lock. While traversing
> the +stack, we hold this spinlock only briefly during lookup time and
> release +it as soon as we get the next union stack member. The top level
> of the +stack holds a reference to the next level (via union_mount
> structure) and +so on. Therefore, as long as we hold a reference to a
> union stack member, +its lower layers can't go away. And since we don't
> do the complete +traversal under any lock, it is possible for the stack
> to change over the +level from where we started traversing. For eg. when
> traversing the stack +downwards, a new filesystem can be mounted on top
> of it. When this happens, +the user who had a reference to the old top
> wouldn't have visibility to +the new top and would continue as if the
> new top didn't exist for him. +I believe this is fine as long as members
> of the stack don't go away from +under us(CHECK). And to be sure of
> this, we need to hold a reference to the +level from where we start the
> traversal and should continue to hold it +till we are done with the
> traversal.

Well done. I like your approach much more than the simple chaining of
dentries. When I told you about the idea of maintaining a list of
<dentry,vfsmount> objects I always though about one big structure for all
the layers of an union. Smaller objects that only point to the next layer
seem to be better but make the search for the topmost layer impossible.
You should maintain a reference to the topmost struct union_mount though.

> +5. Union stack: destroying
> +--------------------------
> +In addition to storing the union_mounts in a hash table for quick
> lookups, +they are also stored as a list, headed at vsmount->mnt_union.
> So, all +union_mounts that occur under a vfsmount (starting from the
> mountpoint +followed by the subdir unions) are stored within the
> vfsmount. During +umount (specifically, during the last mntput()), this
> list is traversed +to destroy all union stacks under this vfsmount. +
> +Hence, all union stacks under a vfsmount continue to exist until the
> +vfsmount is unmounted. It may be noted that the union_mount structure
> +holds a reference to the current dentry also. Becasue of this, for
> +subdir unions, both the top and bottom level dentries become pinned
> +till the upper layer filesystem is unmounted. Is this behaviour
> +acceptable ? Would this lead to a lot of pinned dentries over a period
> +of time ? (CHECK) If we don't do this, the top layer dentry might go
> +out of cache, during which time we have no means to release the
> +corresponding union_mount and the union_mount becomes stale. Would it
> +be necessary and worthwhile to add intelligence to prune_dcache() to
> +prune unused union_mounts thereby releasing the dentries ? +
> +As noted above, we hold the refernce to current dentry from union_mount
> +but don't get a reference to the corresponding vfsmount. We depend on
> +the user of the union stack to hold the reference to the topmost
> vfsmount +until he is done with the stack traversal. Not holding a
> reference to the +top vfsmount from within union_mount allows us to free
> all the union_mounts +from last mntput of the top vfsmount. Is this
> approach acceptable ? +
> +NOTE: union_mount structures are part of two lists: the hash list for
> +quick lookups and a linked list to aid the freeing of these structures
> +during unmount.

This must changed. This is the only reason why the dentry chaining
approach was so complex. You need a way to get rid of all unused dentries
in a union.

Besides that, I wonder why you left out the rest of my code? The readdir,
whiteout and copyup parts are orthogonal to the code for maintaining the
union structure itself. I just rewrote most of it myself to use functions
like follow_union_down() etc to get rid of the dentry chaining in the long
run.

Jan

2007-06-20 08:53:41

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] Mount changes to support union mount.

(replying from a different ID as you didn't copy me on reply)

On 6/20/07, Jan Blunck <[email protected]> wrote:
> On Wed, 20 Jun 2007 11:22:41 +0530, Bharata B Rao wrote:
>
> > +/*
> > + * When propagating mount events to peer group, this is called under
> > + * vfsmount_lock. Hence using GFP_ATOMIC for kmalloc here.
> > + * TODO: Can we use a separate kmem cache for union_mount ?
> > + */
> > +struct union_mount *alloc_union_mount(struct vfsmount *src_mnt,
> > + struct dentry *src_dentry, struct vfsmount *dst_mnt,
> > + struct dentry *dst_dentry)
> > +{
> > + struct union_mount *u;
> > + u = kmalloc(sizeof(struct union_mount), GFP_ATOMIC);
> > + if (!u)
> > + return u;
> > + u->dst_mnt = mntget(dst_mnt);
> > + u->dst_dentry = dget(dst_dentry);
> > + u->src_mnt = src_mnt;
> > + u->src_dentry = dget(src_dentry);
> > + INIT_LIST_HEAD(&u->hash);
> > + INIT_LIST_HEAD(&u->list);
> > + return u;
> > +}
>
> Hmm, you pin the dentries in memory until umount. This isn't good. Besides
> that this doesn't work with file systems that do invalidate their
> dentries. The file system must have a chance to replace the dentry in the
> union structure.

Yes, both top level and next level dentries are pinned until umount of
the upper layer. I was thinking if we could prune these from
prune_dcache(). What do you think ?

Ok, I haven't thought about filesystem invalidating the dentries. Yet
to understand the dentry invalidation, but would filesystem invalidate
an inuse dentry ?

Regards,
Bharata.
--
"Men come and go but mountains remain" -- Ruskin Bond.

2007-06-20 08:57:05

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] Lookup changes to support union mount.

On 6/20/07, Jan Blunck <[email protected]> wrote:
> On Wed, 20 Jun 2007 11:23:26 +0530, Bharata B Rao wrote:
>
> > +/*
> > + * Looks for the given @name in dcache by walking through all the layers
> > + * of the union stack, starting from the top.
> > + * FIXME: If we don't find the dentry in a upper layer, we descend to the
> > + * next layer. So there is a chance to miss this dentry in the top layer
> > + * if this is the _first_ time lookup of the dentry in this layer. A real
> > + * lookup might have fetched a valid dentry in this layer itself, while we
> > + * chose to descend to the next lower layer. One solution is not have this
> > + * function itself, do the toplevel lookup in dcache and if it fails proceed
> > + * to real_lookup_union() directly.
> > + */
> > +struct dentry *__d_lookup_union(struct nameidata *nd, struct qstr *name)
> > +{
> > + struct dentry *dentry;
> > + struct nameidata nd_tmp;
> > + struct vfsmount *mnt = mntget(nd->mnt);
> > + struct qstr this;
> > + int err;
> > +
> > + nd_tmp.mnt = nd->mnt;
> > + nd_tmp.dentry = nd->dentry;
> > +
> > + this.name = name->name;
> > + this.len = name->len;
> > + this.hash = name->hash;
> > +
> > + do {
> > + /* d_hash() is a repetition for the top layer. */
> > + if (nd_tmp.dentry->d_op && nd_tmp.dentry->d_op->d_hash) {
> > + err = nd_tmp.dentry->d_op->d_hash(nd_tmp.dentry, &this);
> > + if (err < 0)
> > + goto out;
> > + }
> > +
> > + dentry = __d_lookup(nd_tmp.dentry, &this);
> > + if (dentry) {
> > + if (dentry->d_inode) {
> > + if (nd->mnt != nd_tmp.mnt) {
> > + mntput(nd->mnt);
> > + nd->mnt = mntget(nd_tmp.mnt);
> > + }
> > + mntput(mnt);
> > + return dentry;
> > + } else {
> > + dput(dentry);
> > + }
> > + }
> > + } while (next_union_mount(&nd_tmp));
> > +out:
> > + mntput(mnt);
> > + return NULL;
> > +}
> > +
>
> The reference counting for vfsmount is wrong. next_union_mount() should be
> something similar to follow_down(). You should grab the reference to the
> underlying mount before doing the lookup. Ok ok, you already have a valid
> reference in struct union_mount but anyway.

The idea is that a reference to the top layer's vfsmount and dentry is
taken by the user of the union mount point/directory (eg. lookup,
readdir etc) and that should ensure that all the bottom layers are
valid until the top level reference is held.

Regards,
Bharata.
--
"Men come and go but mountains remain" -- Ruskin Bond.

2007-06-20 09:09:46

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] Union mount documentation.

On 6/20/07, Jan Blunck <[email protected]> wrote:
> On Wed, 20 Jun 2007 11:21:57 +0530, Bharata B Rao wrote:
> <snip>
> Well done. I like your approach much more than the simple chaining of
> dentries. When I told you about the idea of maintaining a list of
> <dentry,vfsmount> objects I always though about one big structure for all
> the layers of an union. Smaller objects that only point to the next layer
> seem to be better but make the search for the topmost layer impossible.
> You should maintain a reference to the topmost struct union_mount though.

Even in our last version I didn't understand clearly why you had
pointers from the bottom layers to the topmost layer. Could you please
explain under what circumstances there needs to be a bottom to top
traversal ?

>
> > +5. Union stack: destroying
> > +--------------------------
> > +In addition to storing the union_mounts in a hash table for quick
> > lookups, +they are also stored as a list, headed at vsmount->mnt_union.
> > So, all +union_mounts that occur under a vfsmount (starting from the
> > mountpoint +followed by the subdir unions) are stored within the
> > vfsmount. During +umount (specifically, during the last mntput()), this
> > list is traversed +to destroy all union stacks under this vfsmount. +
> > +Hence, all union stacks under a vfsmount continue to exist until the
> > +vfsmount is unmounted. It may be noted that the union_mount structure
> > +holds a reference to the current dentry also. Becasue of this, for
> > +subdir unions, both the top and bottom level dentries become pinned
> > +till the upper layer filesystem is unmounted. Is this behaviour
> > +acceptable ? Would this lead to a lot of pinned dentries over a period
> > +of time ? (CHECK) If we don't do this, the top layer dentry might go
> > +out of cache, during which time we have no means to release the
> > +corresponding union_mount and the union_mount becomes stale. Would it
> > +be necessary and worthwhile to add intelligence to prune_dcache() to
> > +prune unused union_mounts thereby releasing the dentries ? +
> > +As noted above, we hold the refernce to current dentry from union_mount
> > +but don't get a reference to the corresponding vfsmount. We depend on
> > +the user of the union stack to hold the reference to the topmost
> > vfsmount +until he is done with the stack traversal. Not holding a
> > reference to the +top vfsmount from within union_mount allows us to free
> > all the union_mounts +from last mntput of the top vfsmount. Is this
> > approach acceptable ? +
> > +NOTE: union_mount structures are part of two lists: the hash list for
> > +quick lookups and a linked list to aid the freeing of these structures
> > +during unmount.
>
> This must changed. This is the only reason why the dentry chaining
> approach was so complex. You need a way to get rid of all unused dentries
> in a union.

The second list headed at mnt->mnt_union was added precisely to get
rid of all the union_mounts under a vfsmount at umount time. So umount
is the time to destroy the union stack.
>
> Besides that, I wonder why you left out the rest of my code? The readdir,
> whiteout and copyup parts are orthogonal to the code for maintaining the
> union structure itself. I just rewrote most of it myself to use functions
> like follow_union_down() etc to get rid of the dentry chaining in the long
> run.

The idea was to start simple, get some feedback and concensus and add
features after that. Some of the feedback I got from our last two
posts was that the code was too complex and big to review and we had
so many patches. So this time I have started with the bare minimum so
that it becomes easier for the reviewers. I plan to add copyup and
whiteout only when there is an agreement that this approach of
unioning is acceptable.

And about readdir, I digressed from your approach a bit and made
readdir cache persistant across readdir()/getdents() calls. Also, made
readdir on union mounted directories filesystem independent unlike our
earlier approach. But again this breaks lseek as I have noted, which
needs to be fixed.

Regards,
Bharata.
--
"Men come and go but mountains remain" -- Ruskin Bond.

2007-06-20 12:09:59

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] Directory listing support for union mounted directories.

On Wed, Jun 20, 2007 at 11:24:18AM +0530, Bharata B Rao wrote:
> From: Bharata B Rao <[email protected]>
> Subject: Directory listing support for union mounted directories.
>
> Modifies readdir()/getdents() to support union mounted directories.
>
> This patch adds support to readdir()/getdents() to read directory entries
> from all the directories of the union stack, and present a merged view to
> the userspace. This is achieved by maintaining a cache of read entries.
>
> TODO: Breaks lseek on union mounted directories, Fix this.

Btw, to avoid dealing with file structs representing multiple objects
->readdir should be changed to an inode_operation and not take a struct
file * but rather struct inode + loff_t *ppos. This sounds like a nice
preparatory patch for any kind of union mount implementation and is
acceptable standing on it's own.

>
> Signed-off-by: Bharata B Rao <[email protected]>
> ---
> fs/file_table.c | 1
> fs/readdir.c | 242 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
> include/linux/fs.h | 12 ++
> 3 files changed, 254 insertions(+), 1 deletion(-)
>
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -174,6 +174,7 @@ void fastcall __fput(struct file *file)
> if (file->f_mode & FMODE_WRITE)
> put_write_access(inode);
> put_pid(file->f_owner.pid);
> + put_rdstate(file->f_rdstate);
> file_kill(file);
> file->f_path.dentry = NULL;
> file->f_path.mnt = NULL;
> --- a/fs/readdir.c
> +++ b/fs/readdir.c
> @@ -19,6 +19,8 @@
>
> #include <asm/uaccess.h>
>
> +int readdir_union(struct file *file, void *buf, filldir_t filler);
> +
> int vfs_readdir(struct file *file, filldir_t filler, void *buf)
> {
> struct inode *inode = file->f_path.dentry->d_inode;
> @@ -33,7 +35,10 @@ int vfs_readdir(struct file *file, filld
> mutex_lock(&inode->i_mutex);
> res = -ENOENT;
> if (!IS_DEADDIR(inode)) {
> - res = file->f_op->readdir(file, buf, filler);
> + if (IS_MNT_UNION(file->f_path.mnt))
> + res = readdir_union(file, buf, filler);
> + else
> + res = file->f_op->readdir(file, buf, filler);
> file_accessed(file);
> }
> mutex_unlock(&inode->i_mutex);
> @@ -303,3 +308,238 @@ out_putf:
> out:
> return error;
> }
> +
> +/*
> + * NOTE: Some of the code below which maintains a list of dirents as a cache
> + * is from Jan Blunck's original union mount readdir patch.
> + */
> +
> +/* The readdir cache object */
> +struct rdcache_entry {
> + struct list_head list;
> + struct qstr name;
> +};
> +
> +struct rdcache_callback {
> + void *buf; /* original callback buffer */
> + filldir_t filldir; /* the filldir() we should call */
> + int error; /* stores filldir error */
> + struct rdstate *rdstate; /* readdir state */
> +};
> +
> +static int rdcache_add_entry(struct list_head *list,
> + const char *name, int namelen)
> +{
> + struct rdcache_entry *this;
> + char *tmp_name;
> +
> + this = kmalloc(sizeof(*this), GFP_KERNEL);
> + if (!this) {
> + printk(KERN_CRIT "rdcache_add_entry(): out of kernel memory\n");
> + return -ENOMEM;
> + }
> +
> + tmp_name = kmalloc(namelen + 1, GFP_KERNEL);
> + if (!tmp_name) {
> + printk(KERN_CRIT "rdcache_add_entry(): out of kernel memory\n");
> + kfree(this);
> + return -ENOMEM;
> + }
> +
> + this->name.name = tmp_name;
> + this->name.len = namelen;
> + this->name.hash = 0;
> + memcpy(tmp_name, name, namelen);
> + tmp_name[namelen] = 0;
> + INIT_LIST_HEAD(&this->list);
> + list_add(&this->list, list);
> + return 0;
> +}
> +
> +static void rdcache_free(struct list_head *list)
> +{
> + struct list_head *p;
> + struct list_head *ptmp;
> + int count = 0;
> +
> + list_for_each_safe(p, ptmp, list) {
> + struct rdcache_entry *this;
> +
> + this = list_entry(p, struct rdcache_entry, list);
> + list_del_init(&this->list);
> + kfree(this->name.name);
> + kfree(this);
> + count++;
> + }
> + return;
> +}
> +
> +static int rdcache_find_entry(struct list_head *uc_list,
> + const char *name, int namelen)
> +{
> + struct rdcache_entry *p;
> + int ret = 0;
> +
> + list_for_each_entry(p, uc_list, list) {
> + if (p->name.len != namelen)
> + continue;
> + if (strncmp(p->name.name, name, namelen) == 0) {
> + ret = 1;
> + break;
> + }
> + }
> + return ret;
> +}
> +
> +/*
> + * filldir routine for union mounted directories.
> + * Handles duplicate elimination by building a readdir cache.
> + * TODO: readdir cache is a linked list. Check if this can be designed
> + * more efficiently.
> + */
> +static int filldir_union(void *buf, const char *name, int namlen,
> + loff_t offset, u64 ino, unsigned int d_type)
> +{
> + struct rdcache_callback *cb = buf;
> + int err;
> +
> + if (rdcache_find_entry(&cb->rdstate->dirent_cache, name, namlen))
> + return 0;
> +
> + err = cb->filldir(cb->buf, name, namlen, offset, ino, d_type);
> + if (err >= 0)
> + rdcache_add_entry(&cb->rdstate->dirent_cache, name, namlen);
> + cb->error = err;
> + return err;
> +}
> +
> +/* Called from last fput() */
> +void put_rdstate(struct rdstate *rdstate)
> +{
> + if (!rdstate)
> + return;
> +
> + mntput(rdstate->mnt);
> + dput(rdstate->dentry);
> + rdcache_free(&rdstate->dirent_cache);
> + kfree(rdstate);
> +}
> +
> +static struct rdstate *get_rdstate(struct file *file)
> +{
> + if (file->f_rdstate)
> + return file->f_rdstate;
> +
> + /*
> + * We have read the dirents from this earlier but now don't have a
> + * corresponding rdstate. This shouldn't happen.
> + */
> + if (file->f_pos)
> + return ERR_PTR(-EINVAL);
> +
> + file->f_rdstate = kmalloc(sizeof(struct rdstate), GFP_KERNEL);
> + if (!file->f_rdstate)
> + return ERR_PTR(-ENOMEM);
> +
> + file->f_rdstate->off = 0;
> + file->f_rdstate->mnt = mntget(file->f_path.mnt);
> + file->f_rdstate->dentry = dget(file->f_path.dentry);
> + INIT_LIST_HEAD(&file->f_rdstate->dirent_cache);
> + return file->f_rdstate;
> +}
> +
> +int readdir_union(struct file *file, void *buf, filldir_t filler)
> +{
> + struct dentry *topmost = file->f_path.dentry;
> + struct rdstate *rdstate;
> + struct nameidata nd;
> + loff_t offset = 0;
> + struct rdcache_callback cb;
> + int err = 0;
> +
> + /*
> + * If this is not a union mount point or not a unioned directory
> + * within the union mount point, do readdir in the normal way.
> + */
> + nd.mnt = file->f_path.mnt;
> + nd.dentry = file->f_path.dentry;
> + if (!next_union_mount_exists(nd.mnt, nd.dentry))
> + return file->f_op->readdir(file, buf, filler);
> +
> + rdstate = get_rdstate(file);
> + if (IS_ERR(rdstate)) {
> + err = PTR_ERR(rdstate);
> + return err;
> + }
> +
> + cb.buf = buf;
> + cb.filldir = filler;
> + cb.rdstate = rdstate;
> +
> + offset = rdstate->off;
> +
> + /* Read from the topmost directory */
> + if (rdstate->dentry == topmost) {
> + file->f_pos = offset;
> + err = file->f_op->readdir(file, &cb, filldir_union);
> + rdstate->off = file->f_pos;
> + if (err < 0 || cb.error < 0)
> + goto out;
> +
> + /*
> + * Reading from topmost dir complete, start reading the lower
> + * dir from the beginning.
> + */
> + offset = 0;
> + } else {
> + nd.mnt = rdstate->mnt;
> + nd.dentry = rdstate->dentry;
> + goto read_lower;
> + }
> +
> + if (!next_union_mount(&nd))
> + return err;
> +
> +read_lower:
> + do {
> + struct vfsmount *mnt_tmp = mntget(nd.mnt);
> + struct dentry *dentry_tmp = dget(nd.dentry);
> + struct file *ftmp = dentry_open(nd.dentry, nd.mnt,
> + file->f_flags);
> + if (IS_ERR(ftmp)) {
> + mntput(mnt_tmp);
> + dput(dentry_tmp);
> + err = PTR_ERR(ftmp);
> + goto out;
> + }
> +
> + mutex_lock(&nd.dentry->d_inode->i_mutex);
> + ftmp->f_pos = offset;
> +
> + err = ftmp->f_op->readdir(ftmp, &cb, filldir_union);
> + file_accessed(ftmp);
> + rdstate->off = ftmp->f_pos;
> + mutex_unlock(&nd.dentry->d_inode->i_mutex);
> + if (nd.mnt != rdstate->mnt) {
> + mntput(rdstate->mnt);
> + rdstate->mnt = mntget(nd.mnt);
> + }
> + if (nd.dentry != rdstate->dentry) {
> + dput(rdstate->dentry);
> + rdstate->dentry = dget(nd.dentry);
> + }
> + fput(ftmp);
> + if (err < 0 || cb.error < 0)
> + goto out;
> +
> + /*
> + * Reading from a lower dir complete, start reading the
> + * next lower dir from the beginning.
> + */
> + offset = 0;
> + } while (next_union_mount(&nd));
> +
> + return 0;
> +out:
> + return err;
> +}
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -841,6 +841,14 @@ static inline void ra_set_size(struct fi
> unsigned long ra_submit(struct file_ra_state *ra,
> struct address_space *mapping, struct file *filp);
>
> +/* Direct cache for readdir, used for union mounted directories. */
> +struct rdstate {
> + struct vfsmount *mnt;
> + struct dentry *dentry;
> + loff_t off;
> + struct list_head dirent_cache;
> +};
> +
> struct file {
> /*
> * fu_list becomes invalid after file_free is called and queued via
> @@ -875,6 +883,7 @@ struct file {
> spinlock_t f_ep_lock;
> #endif /* #ifdef CONFIG_EPOLL */
> struct address_space *f_mapping;
> + struct rdstate *f_rdstate;
> };
> extern spinlock_t files_lock;
> #define file_list_lock() spin_lock(&files_lock);
> @@ -2119,5 +2128,8 @@ static inline void free_secdata(void *se
> { }
> #endif /* CONFIG_SECURITY */
>
> +/* fs/readdir.c */
> +void put_rdstate(struct rdstate *rdstate);
> +
> #endif /* __KERNEL__ */
> #endif /* _LINUX_FS_H */
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
---end quoted text---

2007-06-20 12:32:35

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] Union mount documentation.

On Wed, Jun 20, 2007 at 07:29:55AM +0000, Jan Blunck wrote:
> Mounting a file system twice is bad in the first place. This should be
> done by using bind mounts and bind a mounted file system into a union.
> After that the normal locking rules apply (and hopefully work ;).

>From the kernel POV mounting a filesystem twice is the same as doing
a bind mount.

2007-06-20 12:52:45

by Jan Blunck

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] Union mount documentation.

On Wed, 20 Jun 2007 13:32:23 +0100, Christoph Hellwig wrote:

> On Wed, Jun 20, 2007 at 07:29:55AM +0000, Jan Blunck wrote:
>> Mounting a file system twice is bad in the first place. This should be
>> done by using bind mounts and bind a mounted file system into a union.
>> After that the normal locking rules apply (and hopefully work ;).
>
> From the kernel POV mounting a filesystem twice is the same as doing
> a bind mount.

Somehow I thought about doing this:

mount /dev/dasda1 /mnt/A
mount /dev/dasda1 /mnt/B

... which doesn't result in a bind mount.

2007-06-20 13:05:27

by Jan Blunck

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] Union mount documentation.

On Tue, 19 Jun 2007 22:59:51 -0700, Arjan van de Ven wrote:

> user does on FS A:
> mkdir /mnt/A/somedir
> touch /mnt/A/somedir/somefile
>
> and then 2 things happen in parallel
> 1) touch /mnt/B/somefile
> 2) mv /mnt/union/somedir /mnt/union/somefile
>
> since the underlying FS for 2) is FS A... how will this work out locking
> wise? Will the VS lock the union directory only? Or will this operate
> only on the underlying FS? How is dcache consistency guaranteed for
> scenarios like this?

Ok, with Christophs help I guess I know now what the question is :)

touch /mnt/B/somefile is doing a lookup in "B" for "somefile". Therefore it
locks B->i_mutex for that. When it gets a negative dentry it creates the
file.

mv /mnt/union/somedir /mnt/union/somefile is doing a lookup in "union" for
"somefile". Therefore it first locks the i_mutex of the topmost directory
in the union of "/mnt/union" (which happens to be "B"). When it gets a
negative dentry it than follows the union down to the next layer (with the
topmost directory still locked). Lookup is repeated until a filled dentry
is found or the topmost dentry negative dentry is used as a target for the
move. Thats it.

Did that answer your question?

Cheers,
Jan

2007-06-20 13:25:51

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] Union mount documentation.

On Wed, Jun 20, 2007 at 12:43:56PM +0000, Jan Blunck wrote:
> On Wed, 20 Jun 2007 13:32:23 +0100, Christoph Hellwig wrote:
>
> > On Wed, Jun 20, 2007 at 07:29:55AM +0000, Jan Blunck wrote:
> >> Mounting a file system twice is bad in the first place. This should be
> >> done by using bind mounts and bind a mounted file system into a union.
> >> After that the normal locking rules apply (and hopefully work ;).
> >
> > From the kernel POV mounting a filesystem twice is the same as doing
> > a bind mount.
>
> Somehow I thought about doing this:
>
> mount /dev/dasda1 /mnt/A
> mount /dev/dasda1 /mnt/B
>
> ... which doesn't result in a bind mount.

But the kernel internal effect is exactly the same. One superblock instance,
two vfsmounts referring to it.

2007-06-20 14:23:07

by Trond Myklebust

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] Directory listing support for union mounted directories.

On Wed, 2007-06-20 at 13:09 +0100, Christoph Hellwig wrote:
> On Wed, Jun 20, 2007 at 11:24:18AM +0530, Bharata B Rao wrote:
> > From: Bharata B Rao <[email protected]>
> > Subject: Directory listing support for union mounted directories.
> >
> > Modifies readdir()/getdents() to support union mounted directories.
> >
> > This patch adds support to readdir()/getdents() to read directory entries
> > from all the directories of the union stack, and present a merged view to
> > the userspace. This is achieved by maintaining a cache of read entries.
> >
> > TODO: Breaks lseek on union mounted directories, Fix this.
>
> Btw, to avoid dealing with file structs representing multiple objects
> ->readdir should be changed to an inode_operation and not take a struct
> file * but rather struct inode + loff_t *ppos. This sounds like a nice
> preparatory patch for any kind of union mount implementation and is
> acceptable standing on it's own.

No it shouldn't. The struct file contains other stateful information
from the open() call (such as authentication info) that needs to be
passed into readdir.

Trond

2007-06-20 17:02:37

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] Directory listing support for union mounted directories.

On Wed, Jun 20, 2007 at 10:22:28AM -0400, Trond Myklebust wrote:
> No it shouldn't. The struct file contains other stateful information
> from the open() call (such as authentication info) that needs to be
> passed into readdir.

Which is exactly that problem this tries to solve. Once you have
union mounts you'll have a single open file descriptor for multiple
actual directories. Beause of that you can't simply attach to the
state to the struct file but have to keep it in a different way.

2007-06-20 17:28:37

by Erez Zadok

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] Union mount documentation.

In message <[email protected]>, Jan Blunck writes:
> On Tue, 19 Jun 2007 22:59:51 -0700, Arjan van de Ven wrote:
>
> > first of all I'm happy to see that people are still working on unionfs;
> > I'd love to have functionality like this show up in Linux.
>
> This has nothing to do with unionfs. This is about doing a VFS based
> approach to union mounts. Unification is a name-based construct so it
> belongs into VFS and not into a separate file system.

Jan, while I agree with you in principle that unification is a VFS-level
namespace construct, I disagree with you that unioning doesn't belong in a
separate f/s.

As someone whose group developed three generations of the stackable file
system Unionfs (see http://unionfs.filesystems.org/), I can tell you from my
experience and the experience of numerous users, that the devil is the
details -- or the so-called orthogonal issues. To get a fully working
unioning implementation, one that the many current users of Unionfs could
use, you'll have to deal with many issues and corner cases: cache coherency,
inode persistency (and network f/s exports), copyups, whiteouts and opaque
dirs, how to deal with "odd" file systems which don't support native
whiteouts and such, directory reading (seekdir), and more. Our third
generation Unionfs, the one with On-Disk Format (ODF), handles all of these.
Rather than reproduce all that discussion here, I'll point people to read
more info here: <http://www.filesystems.org/unionfs-odf.txt>

So, to have a fully usable union mounts implementation, you're going to have
to support a lot of existing features; but if you were to support them all
at the VFS level, you will have bloated the VFS considerably with stuff that
many would argue does not belong in the VFS.

Sincerely,
Erez Zadok.

2007-06-20 17:45:14

by Trond Myklebust

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] Directory listing support for union mounted directories.

On Wed, 2007-06-20 at 18:02 +0100, Christoph Hellwig wrote:
> On Wed, Jun 20, 2007 at 10:22:28AM -0400, Trond Myklebust wrote:
> > No it shouldn't. The struct file contains other stateful information
> > from the open() call (such as authentication info) that needs to be
> > passed into readdir.
>
> Which is exactly that problem this tries to solve. Once you have
> union mounts you'll have a single open file descriptor for multiple
> actual directories. Beause of that you can't simply attach to the
> state to the struct file but have to keep it in a different way.

Which creates another, much WORSE problem.

Authentication information is part of a series of things that POSIX
requires you to keep on per-descriptor basis (because POSIX assumes that
you can suid/sgid a process without any security implications for file
descriptors that are already open). It is quite natural to pass it
around by means of the struct file.

If you don't want to pass the struct file around, then you at least need
to come up with an alternative mechanism that allows filesystems to
provide correct semantics in the standard non-union case.

Trond

2007-06-21 05:25:54

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] Union mount documentation.

On 6/20/07, Erez Zadok <[email protected]> wrote:
> In message <[email protected]>, Jan Blunck writes:
> > On Tue, 19 Jun 2007 22:59:51 -0700, Arjan van de Ven wrote:
> >
> > > first of all I'm happy to see that people are still working on unionfs;
> > > I'd love to have functionality like this show up in Linux.
> >
> > This has nothing to do with unionfs. This is about doing a VFS based
> > approach to union mounts. Unification is a name-based construct so it
> > belongs into VFS and not into a separate file system.
>
> Jan, while I agree with you in principle that unification is a VFS-level
> namespace construct, I disagree with you that unioning doesn't belong in a
> separate f/s.
>
> As someone whose group developed three generations of the stackable file
> system Unionfs (see http://unionfs.filesystems.org/), I can tell you from my
> experience and the experience of numerous users, that the devil is the
> details -- or the so-called orthogonal issues. To get a fully working
> unioning implementation, one that the many current users of Unionfs could
> use, you'll have to deal with many issues and corner cases: cache coherency,
> inode persistency (and network f/s exports), copyups, whiteouts and opaque
> dirs, how to deal with "odd" file systems which don't support native
> whiteouts and such, directory reading (seekdir), and more. Our third
> generation Unionfs, the one with On-Disk Format (ODF), handles all of these.
> Rather than reproduce all that discussion here, I'll point people to read
> more info here: <http://www.filesystems.org/unionfs-odf.txt>

Erez, thanks, will definetely have to look at that to understand how
unionfs is addressing all these corner cases.

Though I don't understand all the issues involved with cache coherency
atm, one of the things you said during unionfs 2.0 release is that it
is now possible to make modifications to the lower layer directly and
they will be visible from the union. Note that since we do unioning at
VFS layer, we don't explicitly address this. Direct
modifications/additions to the lower layer will automatically get
reflected in the union. Anyway before commenting anything more on
this, let me get back and study the coherency issues more closely :)

>
> So, to have a fully usable union mounts implementation, you're going to have
> to support a lot of existing features; but if you were to support them all
> at the VFS level, you will have bloated the VFS considerably with stuff that
> many would argue does not belong in the VFS.

Talking about copyup and whiteout at VFS layer, we have already
demonstrated what complexity it takes to have these within VFS. Please
take a look at the copyup and whiteout patches in our previous
releases at:

http://lkml.org/lkml/2007/4/17/150
http://lkml.org/lkml/2007/5/14/69

Or may be wait till I clean all those up to work with the new union
new stack infrastructure which I have posted here.

Regards,
Bharata.
--
"Men come and go but mountains remain" -- Ruskin Bond.

2007-06-21 16:29:24

by Josef Sipek

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] Union mount documentation.

On Thu, Jun 21, 2007 at 10:55:45AM +0530, Bharata B Rao wrote:
...
> Talking about copyup and whiteout at VFS layer, we have already
> demonstrated what complexity it takes to have these within VFS. Please
> take a look at the copyup and whiteout patches in our previous
> releases at:
>
> http://lkml.org/lkml/2007/4/17/150
> http://lkml.org/lkml/2007/5/14/69
>
> Or may be wait till I clean all those up to work with the new union
> new stack infrastructure which I have posted here.

Really, the problem for both, union mounts and unionfs, is that the concept
of unioning spans the two layers. You have the unification part - which is
very VFS-level concept, but at the same time, you got whiteouts, copyup,
(semi-?)persistent inode numbers, and a bunch of other details that just
don't belong in the VFS at all.

Josef "Jeff" Sipek.

--
Evolution, n.:
A hypothetical process whereby infinitely improbable events occur with
alarming frequency, order arises from chaos, and no one is given credit.

2007-06-21 16:40:01

by Erez Zadok

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] Union mount documentation.

In message <[email protected]>, Josef Sipek writes:
> On Thu, Jun 21, 2007 at 10:55:45AM +0530, Bharata B Rao wrote:
> ...
> > Talking about copyup and whiteout at VFS layer, we have already
> > demonstrated what complexity it takes to have these within VFS. Please
> > take a look at the copyup and whiteout patches in our previous
> > releases at:
> >
> > http://lkml.org/lkml/2007/4/17/150
> > http://lkml.org/lkml/2007/5/14/69
> >
> > Or may be wait till I clean all those up to work with the new union
> > new stack infrastructure which I have posted here.
>
> Really, the problem for both, union mounts and unionfs, is that the concept
> of unioning spans the two layers. You have the unification part - which is
> very VFS-level concept, but at the same time, you got whiteouts, copyup,
> (semi-?)persistent inode numbers, and a bunch of other details that just
> don't belong in the VFS at all.
>
> Josef "Jeff" Sipek.

Yup, which is why I feel that the eventual solution may involve a hybrid
solution: a file system "driver" plus ample VFS support. The question will
always be how much should go into the VFS and how much should go into the
f/s driver? Our approach w/ unionfs had been to keep as much of it into the
f/s driver, and slowly offer VFS-level support, so as not to perturb the VFS
too much all at once. That way we can offer users something that works now,
and internally change the implementation with minimal user-visible changes.

Erez.

2007-06-21 16:40:37

by Josef Sipek

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] Mount changes to support union mount.

On Wed, Jun 20, 2007 at 02:23:30PM +0530, Bharata B Rao wrote:
> (replying from a different ID as you didn't copy me on reply)
>
> On 6/20/07, Jan Blunck <[email protected]> wrote:
> >On Wed, 20 Jun 2007 11:22:41 +0530, Bharata B Rao wrote:
> >
> >> +/*
> >> + * When propagating mount events to peer group, this is called under
> >> + * vfsmount_lock. Hence using GFP_ATOMIC for kmalloc here.
> >> + * TODO: Can we use a separate kmem cache for union_mount ?
> >> + */
> >> +struct union_mount *alloc_union_mount(struct vfsmount *src_mnt,
> >> + struct dentry *src_dentry, struct vfsmount *dst_mnt,
> >> + struct dentry *dst_dentry)
> >> +{
> >> + struct union_mount *u;
> >> + u = kmalloc(sizeof(struct union_mount), GFP_ATOMIC);
> >> + if (!u)
> >> + return u;
> >> + u->dst_mnt = mntget(dst_mnt);
> >> + u->dst_dentry = dget(dst_dentry);
> >> + u->src_mnt = src_mnt;
> >> + u->src_dentry = dget(src_dentry);
> >> + INIT_LIST_HEAD(&u->hash);
> >> + INIT_LIST_HEAD(&u->list);
> >> + return u;
> >> +}
> >
> >Hmm, you pin the dentries in memory until umount. This isn't good. Besides
> >that this doesn't work with file systems that do invalidate their
> >dentries. The file system must have a chance to replace the dentry in the
> >union structure.
>
> Yes, both top level and next level dentries are pinned until umount of
> the upper layer. I was thinking if we could prune these from
> prune_dcache(). What do you think ?
>
> Ok, I haven't thought about filesystem invalidating the dentries. Yet
> to understand the dentry invalidation, but would filesystem invalidate
> an inuse dentry ?

NFS, and quite possibly other network/cluster filesystems can encounter all
sort of odd conditions where the dentry might change, no?

Josef "Jeff" Sipek.

--
Hegh QaQ law'
quvHa'ghach QaQ puS

2007-06-30 09:43:45

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] Directory listing support for union mounted directories.

On Wed, Jun 20, 2007 at 01:44:52PM -0400, Trond Myklebust wrote:
> > Which is exactly that problem this tries to solve. Once you have
> > union mounts you'll have a single open file descriptor for multiple
> > actual directories. Beause of that you can't simply attach to the
> > state to the struct file but have to keep it in a different way.
>
> Which creates another, much WORSE problem.
>
> Authentication information is part of a series of things that POSIX
> requires you to keep on per-descriptor basis (because POSIX assumes that
> you can suid/sgid a process without any security implications for file
> descriptors that are already open). It is quite natural to pass it
> around by means of the struct file.
>
> If you don't want to pass the struct file around, then you at least need
> to come up with an alternative mechanism that allows filesystems to
> provide correct semantics in the standard non-union case.

That'd be struct ucred..

Anyway, to get back to the original problem - currently filesystems
are perfectly fine to keep whatever state they want in struct file
(mostly ->private), with union mounts we will have a single file descriptor
and struct file for two directories, so we need to restrict this.
Getting this right is the key to any unioning work that actually works
on an arbitrary filesystem and not just on disk filesystems that don't
do anything fancy.