2007-05-22 18:51:21

by Miklos Szeredi

[permalink] [raw]
Subject: [RFC PATCH] file as directory

Why do we want this?
--------------------

That depends on who you ask. My answer is this:

'foo.tar.gz/foo/bar' or
'foo.tar.gz/contents/foo/bar'

or something similar.

Others might suggest accessing streams, resource forks or extended
attributes through such an interface. However this patch only deals
with the non-directory case, so directories would be excluded from
that interface.

But otherwise this patch doesn't limit the uses of the "file as
directory" concept in any way. It just adds the infrastructure to
support these whacky beasts.

How is it done?
---------------

(See this [1] thread for more discussion on the subject)

When a non-directory object is accessed without a trailing slash, then
path resolution returns the object itself as usual.

If a non-directory object is accessed with a trailing slash, then the
filesystem may opt to let the file be accessed as a directory. In
this case "something" (as supplied by the filesystem) is mounted on
top of the non-directory object.

This mount will have special properties:

- If there's no trailing slash is after the file name, the mount
won't be followed, even if the path resolution would otherwise
follow mounts.

- The mount only stays there while it is referenced by some external
object, like a pwd or an open file. When it is no longer
referenced, it is automatically unmounted.

- Unlike "real" mounts, this won't block unlink(2) or rename(2) on
the underlying object.


Compatibility with existing systems
-----------------------------------

Filesystems which enable "file as directory" semantics, might possibly
break existing applications. For example an app could conceivably
check if an object is a directory by appending a slash to the name and
trying some filesystem operation. This application might be confused
by allowing such operations to succeed on non-directory objects.

However in practice this sort of behavior seem to be rare.

The other question is, how well unmodified applications cope with
user-supplied paths which have a slash after the name of a
non-directory object.

Command line utilities seem to cope very well, since they don't have
too much path "sanitization". Bash also seems perfectly capable of
dealing with such beasts, with filename completion and everything.

More complex apps like emacs and file browsers have more problems, but
in some cases they do actually work as expected. Notably if the
supplied path has at least one additional component below the
non-directory object.

So while this doesn't work in emacs etc.:

foo.tar.gz/

this usually does:

foo.tar.gz/foo

It is probably trivial to teach these programs to not be too clever
with path names. It should also be possible to make apps be aware and
explicitly support files as directories.

Implementation details
----------------------

See comments and Documentation/* in the patch.

The patch is careful not to touch the fastpaths in the path
resolution:

- Only check ->enter() if ->lookup() is not defined and there's a
trailing slash. This happens very infrequently, since most apps
check the file type before trying to enter a directory

- Since the "directory on file" mount is removed on leave, most files
won't have anything mounted over them. In these cases
follow_mount() and friends will be just as fast as before this
patch. There's only a negligible slowdown for crossing a
mountpoint and a very minor slowdown for accessing files, which
currently have a "directory on file" mount over them.

How to try it out
-----------------

This needs quite a bit of fiddling. First get the files from

http://www.kernel.org/pub/linux/kernel/people/mszeredi/file-as-directory/

- Get the CVS version of fuse, patch it with fuse-enter.patch.

- Compile avfs-enter.c as instructed at the top of that file

- Get the CVS version of AVFS and compile it, you should get a
working avfsd daemon. After mounting with "./avfsd /avfs", try

ls -l /avfs/usr/src/linux-2.6.21.tar.gz#/

- Patch a kernel with the below patch. This is against
2.6.22-rc1-mm1, but with some effort should apply to other recent
kernels.

- Reboot and look for "app/pid enter name/" lines in dmesg. Those
are when an app is attempting to access a non-directory with a
slash, and failing of course, because no filesystem supports this
yet.

- Mount the avfs filesystem. It is important to mount it on /avfs.

- Mount the avfs-enter filesystem somewhere, e.g. /tmp/avfs

- Try

ls -l /tmp/avfs/usr/src/linux-2.6.21.tar.gz
ls -l /tmp/avfs/usr/src/linux-2.6.21.tar.gz/
cd /tmp/avfs/usr/src/linux-2.6.21.tar.gz/

[1] http://article.gmane.org/gmane.comp.file-systems.reiserfs.general/10861

Signed-off-by: Miklos Szeredi <[email protected]>
---

Documentation/filesystems/Locking | 7 +
Documentation/filesystems/vfs.txt | 12 +-
fs/dcache.c | 2
fs/namei.c | 121 ++++++++++++++++----
fs/namespace.c | 223 +++++++++++++++++++++++++++++++++-----
include/linux/dcache.h | 20 +++
include/linux/fs.h | 2
include/linux/mount.h | 4
include/linux/namei.h | 3
9 files changed, 338 insertions(+), 56 deletions(-)

Index: linux/fs/namei.c
===================================================================
--- linux.orig/fs/namei.c 2007-05-22 18:06:24.000000000 +0200
+++ linux/fs/namei.c 2007-05-22 18:06:32.000000000 +0200
@@ -669,14 +669,22 @@ int follow_up(struct vfsmount **mnt, str
return 1;
}

-/* no need for dcache_lock, as serialization is taken care in
+/*
+ * Follows mounts on the given struct path. Assumes that no extra
+ * reference is held for the supplied vfsmount.
+ *
+ * If 'enter' is false, does not follow "directory on file" mounts.
+ *
+ * No need for dcache_lock, as serialization is taken care in
* namespace.c
*/
-static int __follow_mount(struct path *path)
+static int __follow_mount(struct path *path, bool enter)
{
int res = 0;
while (d_mountpoint(path->dentry)) {
- struct vfsmount *mounted = lookup_mnt(path->mnt, path->dentry);
+ struct vfsmount *mounted =
+ lookup_mnt(path->mnt, path->dentry, enter);
+
if (!mounted)
break;
dput(path->dentry);
@@ -689,27 +697,37 @@ static int __follow_mount(struct path *p
return res;
}

-static void follow_mount(struct vfsmount **mnt, struct dentry **dentry)
+/*
+ * Follows mounts on the given nameidata.
+ *
+ * Only follows "directory on file" mounts if LOOKUP_ENTER is set.
+ */
+void follow_mount(struct nameidata *nd)
{
- while (d_mountpoint(*dentry)) {
- struct vfsmount *mounted = lookup_mnt(*mnt, *dentry);
+ while (d_mountpoint(nd->dentry)) {
+ bool enter = nd->flags & LOOKUP_ENTER;
+ struct vfsmount *mounted =
+ lookup_mnt(nd->mnt, nd->dentry, enter);
+
if (!mounted)
break;
- dput(*dentry);
- mntput(*mnt);
- *mnt = mounted;
- *dentry = dget(mounted->mnt_root);
+ dput(nd->dentry);
+ mntput(nd->mnt);
+ nd->mnt = mounted;
+ nd->dentry = dget(mounted->mnt_root);
}
}

-/* no need for dcache_lock, as serialization is taken care in
- * namespace.c
+/*
+ * Follows mounts on the given vfsmount/dentry.
+ *
+ * Does not follow "directory on file" mounts.
*/
int follow_down(struct vfsmount **mnt, struct dentry **dentry)
{
struct vfsmount *mounted;

- mounted = lookup_mnt(*mnt, *dentry);
+ mounted = lookup_mnt(*mnt, *dentry, false);
if (mounted) {
dput(*dentry);
mntput(*mnt);
@@ -756,7 +774,7 @@ static __always_inline void follow_dotdo
mntput(nd->mnt);
nd->mnt = parent;
}
- follow_mount(&nd->mnt, &nd->dentry);
+ follow_mount(nd);
}

/*
@@ -777,7 +795,7 @@ static int do_lookup(struct nameidata *n
done:
path->mnt = mnt;
path->dentry = dentry;
- __follow_mount(path);
+ __follow_mount(path, nd->flags & LOOKUP_ENTER);
return 0;

need_lookup:
@@ -799,6 +817,40 @@ fail:
}

/*
+ * Try to enter a non-directory object.
+ *
+ * This is called if the object has no ->lookup() defined, yet the
+ * path contains a slash after the object name.
+ *
+ * If the filesystem defines an ->enter() method, this will be called,
+ * and the filesystem shall fill the supplied struct path or return an
+ * error.
+ *
+ * The returned path will be bind mounted on top of the object with
+ * the MNT_DIRONFILE flag, and the nameidata will descend into the
+ * mount.
+ */
+static int enter_file(struct inode *inode, struct nameidata *nd)
+{
+ int err;
+ struct path newpath;
+
+ printk(KERN_DEBUG "%s/%d enter %s/\n", current->comm, current->pid,
+ nd->dentry->d_name.name);
+ if (!inode->i_op->enter)
+ return -ENOTDIR;
+
+ newpath.mnt = NULL;
+ newpath.dentry = NULL;
+ err = inode->i_op->enter(nd, &newpath);
+ if (!err) {
+ err = mount_dironfile(nd, &newpath);
+ pathput(&newpath);
+ }
+ return err;
+}
+
+/*
* Name resolution.
* This is the basic name resolution function, turning a pathname into
* the final dentry. We expect 'base' to be positive and a directory.
@@ -820,7 +872,8 @@ static fastcall int __link_path_walk(con

inode = nd->dentry->d_inode;
if (nd->depth)
- lookup_flags = LOOKUP_FOLLOW | (nd->flags & LOOKUP_CONTINUE);
+ lookup_flags = LOOKUP_FOLLOW |
+ (nd->flags & (LOOKUP_CONTINUE | LOOKUP_ENTER));

/* At this point we know we have a real path component. */
for(;;) {
@@ -828,7 +881,7 @@ static fastcall int __link_path_walk(con
struct qstr this;
unsigned int c;

- nd->flags |= LOOKUP_CONTINUE;
+ nd->flags |= LOOKUP_CONTINUE | LOOKUP_ENTER;
err = exec_permission_lite(inode, nd);
if (err == -EAGAIN)
err = vfs_permission(nd, MAY_EXEC);
@@ -906,17 +959,22 @@ static fastcall int __link_path_walk(con
break;
} else
path_to_nameidata(&next, nd);
- err = -ENOTDIR;
- if (!inode->i_op->lookup)
- break;
+ if (unlikely(!inode->i_op->lookup)) {
+ err = enter_file(inode, nd);
+ if (err)
+ break;
+ inode = nd->dentry->d_inode;
+ }
continue;
/* here ends the main loop */

last_with_slashes:
- lookup_flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
+ lookup_flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY |
+ LOOKUP_ENTER;
last_component:
/* Clear LOOKUP_CONTINUE iff it was previously unset */
- nd->flags &= lookup_flags | ~LOOKUP_CONTINUE;
+ nd->flags &= lookup_flags |
+ ~(LOOKUP_CONTINUE | LOOKUP_ENTER);
if (lookup_flags & LOOKUP_PARENT)
goto lookup_parent;
if (this.name[0] == '.') switch (this.len) {
@@ -951,10 +1009,19 @@ last_component:
err = -ENOENT;
if (!inode)
break;
- if (lookup_flags & LOOKUP_DIRECTORY) {
+ if (lookup_flags & (LOOKUP_DIRECTORY | LOOKUP_ENTER)) {
err = -ENOTDIR;
- if (!inode->i_op || !inode->i_op->lookup)
+ if (!inode->i_op)
break;
+
+ if (!inode->i_op->lookup) {
+ if (!(lookup_flags & LOOKUP_ENTER))
+ break;
+
+ err = enter_file(inode, nd);
+ if (err)
+ break;
+ }
}
goto return_base;
lookup_parent:
@@ -1726,7 +1793,7 @@ do_last:
if (flag & O_EXCL)
goto exit_dput;

- if (__follow_mount(&path)) {
+ if (__follow_mount(&path, false)) {
error = -ELOOP;
if (flag & O_NOFOLLOW)
goto exit_dput;
@@ -2114,7 +2181,7 @@ int vfs_unlink(struct inode *dir, struct
DQUOT_INIT(dir);

mutex_lock(&dentry->d_inode->i_mutex);
- if (d_mountpoint(dentry))
+ if (d_real_mountpoint(dentry))
error = -EBUSY;
else {
error = security_inode_unlink(dir, dentry);
@@ -2449,7 +2516,7 @@ static int vfs_rename_other(struct inode
target = new_dentry->d_inode;
if (target)
mutex_lock(&target->i_mutex);
- if (d_mountpoint(old_dentry)||d_mountpoint(new_dentry))
+ if (d_real_mountpoint(old_dentry)||d_real_mountpoint(new_dentry))
error = -EBUSY;
else
error = old_dir->i_op->rename(old_dir, old_dentry, new_dir, new_dentry);
Index: linux/fs/namespace.c
===================================================================
--- linux.orig/fs/namespace.c 2007-05-22 18:06:24.000000000 +0200
+++ linux/fs/namespace.c 2007-05-22 18:06:32.000000000 +0200
@@ -122,13 +122,21 @@ struct vfsmount *__lookup_mnt(struct vfs
/*
* lookup_mnt increments the ref count before returning
* the vfsmount struct.
+ *
+ * If 'enter' is false, ignore "directory on file" mounts
*/
-struct vfsmount *lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
+struct vfsmount *lookup_mnt(struct vfsmount *mnt, struct dentry *dentry,
+ bool enter)
{
struct vfsmount *child_mnt;
spin_lock(&vfsmount_lock);
- if ((child_mnt = __lookup_mnt(mnt, dentry, 1)))
- mntget(child_mnt);
+ child_mnt = __lookup_mnt(mnt, dentry, 1);
+ if (child_mnt) {
+ if (IS_MNT_DIRONFILE(child_mnt) && !enter)
+ child_mnt = NULL;
+ else
+ mntget(child_mnt);
+ }
spin_unlock(&vfsmount_lock);
return child_mnt;
}
@@ -156,6 +164,7 @@ static void __touch_mnt_namespace(struct

static void detach_mnt(struct vfsmount *mnt, struct nameidata *old_nd)
{
+ BUG_ON(IS_MNT_DIRONFILE(mnt));
old_nd->dentry = mnt->mnt_mountpoint;
old_nd->mnt = mnt->mnt_parent;
mnt->mnt_parent = mnt;
@@ -301,8 +310,8 @@ static struct vfsmount *clone_mnt(struct
mnt->mnt_mountpoint = mnt->mnt_root;
mnt->mnt_parent = mnt;

- /* don't copy the MNT_USER flag */
- mnt->mnt_flags &= ~MNT_USER;
+ /* don't copy some flags */
+ mnt->mnt_flags &= ~(MNT_USER | MNT_DIRONFILE);
if (flag & CL_SETUSER)
__set_mnt_user(mnt, owner);

@@ -339,6 +348,48 @@ static struct vfsmount *clone_mnt(struct
return ERR_PTR(-ENOMEM);
}

+/*
+ * Automatically umount "directory on file" mounts when the last
+ * reference to the mount is released.
+ *
+ * This is tricky, because for namespace modification we must take the
+ * namespace semaphore. But mntput() is called from various places,
+ * sometimes with namespace_sem held. Fortunately in those places the
+ * mount cannot yet have MNT_DIRONFILE, or at least that's what I
+ * hope...
+ *
+ * The umounting is done in two stages, first the mount is removed
+ * from the hashes. This is done atomically wrt other mount lookups,
+ * so it's not possible to acquire a new ref to this dead mount that
+ * way.
+ *
+ * Then after having locked namespace_sem and relocked vfsmount_lock,
+ * the mount is properly detached.
+ */
+static void umount_dironfile(struct vfsmount *mnt)
+ __releases(vfsmount_lock)
+{
+ struct nameidata nd;
+
+ printk(KERN_DEBUG "umount dir-on-file %p\n", mnt);
+ BUG_ON(!IS_MNT_DIRONFILE(mnt));
+ list_del_init(&mnt->mnt_hash);
+ spin_unlock(&vfsmount_lock);
+
+ down_write(&namespace_sem);
+ spin_lock(&vfsmount_lock);
+ mnt->mnt_flags &= ~MNT_DIRONFILE;
+ mnt->mnt_mountpoint->d_dironfile--;
+ detach_mnt(mnt, &nd);
+ list_del_init(&mnt->mnt_expire);
+ list_del_init(&mnt->mnt_list);
+ change_mnt_propagation(mnt, MS_PRIVATE);
+ spin_unlock(&vfsmount_lock);
+ up_write(&namespace_sem);
+
+ path_release(&nd);
+}
+
static inline void __mntput(struct vfsmount *mnt)
{
struct super_block *sb = mnt->mnt_sb;
@@ -348,12 +399,21 @@ static inline void __mntput(struct vfsmo
deactivate_super(sb);
}

+/*
+ * Decrement mount refcount, without reseting the expiry flag
+ *
+ * On final mntput, pinned (process accounting) and still attached
+ * (directory on file) mounts require special treatment.
+ */
void mntput_no_expire(struct vfsmount *mnt)
{
repeat:
if (atomic_dec_and_lock(&mnt->mnt_count, &vfsmount_lock)) {
if (likely(!mnt->mnt_pinned)) {
- spin_unlock(&vfsmount_lock);
+ if (mnt->mnt_parent != mnt)
+ umount_dironfile(mnt);
+ else
+ spin_unlock(&vfsmount_lock);
__mntput(mnt);
return;
}
@@ -442,6 +502,7 @@ static int show_vfsmnt(struct seq_file *
{ MNT_NODIRATIME, ",nodiratime" },
{ MNT_RELATIME, ",relatime" },
{ MNT_NOMNT, ",nomnt" },
+ { MNT_DIRONFILE, ",dironfile" },
{ 0, NULL }
};
struct proc_fs_info *fs_infop;
@@ -609,12 +670,33 @@ void umount_tree(struct vfsmount *mnt, i
__touch_mnt_namespace(p->mnt_ns);
p->mnt_ns = NULL;
list_del_init(&p->mnt_child);
+ /*
+ * When a "directory on file" mount is detached, clear the
+ * flag and acquire the missing attachment reference, which
+ * will be dropped later during the umount.
+ */
+ if (IS_MNT_DIRONFILE(p)) {
+ printk(KERN_DEBUG "detach dir-on-file %p\n", p);
+ mntget(p);
+ p->mnt_mountpoint->d_dironfile--;
+ p->mnt_flags &= ~MNT_DIRONFILE;
+ }
if (p->mnt_parent != p)
p->mnt_mountpoint->d_mounted--;
change_mnt_propagation(p, MS_PRIVATE);
}
}

+void release_tree(struct vfsmount *mnt)
+{
+ LIST_HEAD(umount_list);
+
+ spin_lock(&vfsmount_lock);
+ umount_tree(mnt, 0, &umount_list);
+ spin_unlock(&vfsmount_lock);
+ release_mounts(&umount_list);
+}
+
static int do_umount(struct vfsmount *mnt, int flags)
{
struct super_block *sb = mnt->mnt_sb;
@@ -809,6 +891,12 @@ static int lives_below_in_same_fs(struct
}
}

+/*
+ * Recursively clone a mount tree
+ *
+ * Unless CL_COPY_ALL clone flag is given, skip unbindable and
+ * "directory on file" mounts
+ */
struct vfsmount *copy_tree(struct vfsmount *mnt, struct dentry *dentry,
int flag, uid_t owner)
{
@@ -829,7 +917,8 @@ struct vfsmount *copy_tree(struct vfsmou
continue;

for (s = r; s; s = next_mnt(s, r)) {
- if (!(flag & CL_COPY_ALL) && IS_MNT_UNBINDABLE(s)) {
+ if (!(flag & CL_COPY_ALL) &&
+ (IS_MNT_UNBINDABLE(s) || IS_MNT_DIRONFILE(mnt))) {
s = skip_mnt_tree(s);
continue;
}
@@ -851,13 +940,8 @@ struct vfsmount *copy_tree(struct vfsmou
}
return res;
error:
- if (res) {
- LIST_HEAD(umount_list);
- spin_lock(&vfsmount_lock);
- umount_tree(res, 0, &umount_list);
- spin_unlock(&vfsmount_lock);
- release_mounts(&umount_list);
- }
+ if (res)
+ release_tree(res);
return q;
}

@@ -1032,7 +1116,7 @@ static int do_loopback(struct nameidata

down_write(&namespace_sem);
err = -EINVAL;
- if (IS_MNT_UNBINDABLE(old_nd.mnt))
+ if (IS_MNT_UNBINDABLE(old_nd.mnt) || IS_MNT_DIRONFILE(old_nd.mnt))
goto out;

if (!check_mnt(nd->mnt) || !check_mnt(old_nd.mnt))
@@ -1053,14 +1137,8 @@ static int do_loopback(struct nameidata
goto out;

err = graft_tree(mnt, nd);
- if (err) {
- LIST_HEAD(umount_list);
- spin_lock(&vfsmount_lock);
- umount_tree(mnt, 0, &umount_list);
- spin_unlock(&vfsmount_lock);
- release_mounts(&umount_list);
- }
-
+ if (err)
+ release_tree(mnt);
out:
up_write(&namespace_sem);
path_release(&old_nd);
@@ -1068,6 +1146,62 @@ out:
}

/*
+ * Bind mount the "old" path onto the supplied nameidata. The new
+ * mount is created with the MNT_DIRONFILE flag.
+ *
+ * If successful, descend immediately into the newly created mount.
+ *
+ * No reference is retained for the attachment, so when the mount is
+ * released it will be automatically umounted.
+ */
+int mount_dironfile(struct nameidata *nd, struct path *old)
+{
+ int err;
+ struct vfsmount *mnt;
+
+ down_write(&namespace_sem);
+
+ err = -ENOMEM;
+ mnt = clone_mnt(old->mnt, old->dentry, 0, 0);
+ if (!mnt)
+ goto out;
+
+ change_mnt_propagation(mnt, MS_PRIVATE);
+ mnt->mnt_flags |= MNT_DIRONFILE;
+
+ mutex_lock(&nd->dentry->d_inode->i_mutex);
+ err = -EINVAL;
+ if (S_ISDIR(nd->dentry->d_inode->i_mode))
+ goto out_unlock;
+
+ err = -ENOENT;
+ if (d_unhashed(nd->dentry))
+ goto out_unlock;
+
+ err = 0;
+ spin_lock(&vfsmount_lock);
+ attach_mnt(mnt, nd);
+ mnt->mnt_mountpoint->d_dironfile++;
+ mnt->mnt_ns = mnt->mnt_parent->mnt_ns;
+ list_add_tail(&mnt->mnt_list, &mnt->mnt_ns->list);
+ printk(KERN_DEBUG "mount dir-on-file %p\n", mnt);
+ spin_unlock(&vfsmount_lock);
+
+out_unlock:
+ mutex_unlock(&nd->dentry->d_inode->i_mutex);
+out:
+ up_write(&namespace_sem);
+ if (err)
+ mnt->mnt_flags &= ~MNT_DIRONFILE;
+ else
+ follow_mount(nd);
+
+ mntput(mnt);
+
+ return err;
+}
+
+/*
* change filesystem flags. dir should be a physical root of filesystem.
* If you've mounted a non-root directory somewhere and want to do remount
* on it - tough luck.
@@ -1125,8 +1259,7 @@ static int do_move_mount(struct nameidat
return err;

down_write(&namespace_sem);
- while (d_mountpoint(nd->dentry) && follow_down(&nd->mnt, &nd->dentry))
- ;
+ follow_mount(nd);
err = -EINVAL;
if (!check_mnt(nd->mnt) || !check_mnt(old_nd.mnt))
goto out;
@@ -1146,6 +1279,9 @@ static int do_move_mount(struct nameidat
if (old_nd.mnt == old_nd.mnt->mnt_parent)
goto out1;

+ if (IS_MNT_DIRONFILE(old_nd.mnt))
+ goto out1;
+
if (S_ISDIR(nd->dentry->d_inode->i_mode) !=
S_ISDIR(old_nd.dentry->d_inode->i_mode))
goto out1;
@@ -1240,8 +1376,7 @@ int do_add_mount(struct vfsmount *newmnt

down_write(&namespace_sem);
/* Something was mounted here while we slept */
- while (d_mountpoint(nd->dentry) && follow_down(&nd->mnt, &nd->dentry))
- ;
+ follow_mount(nd);
err = -EINVAL;
if (!check_mnt(nd->mnt))
goto unlock;
@@ -1602,6 +1737,7 @@ static struct mnt_namespace *dup_mnt_ns(
struct mnt_namespace *new_ns;
struct vfsmount *rootmnt = NULL, *pwdmnt = NULL, *altrootmnt = NULL;
struct vfsmount *p, *q;
+ LIST_HEAD(dironfile_list);

new_ns = kmalloc(sizeof(struct mnt_namespace), GFP_KERNEL);
if (!new_ns)
@@ -1629,6 +1765,9 @@ static struct mnt_namespace *dup_mnt_ns(
* Second pass: switch the tsk->fs->* elements and mark new vfsmounts
* as belonging to new namespace. We have already acquired a private
* fs_struct, so tsk->fs->lock is not needed.
+ *
+ * Move "directory on file" mounts to a dedicated list for special
+ * treatment.
*/
p = mnt_ns->root;
q = new_ns->root;
@@ -1648,6 +1787,9 @@ static struct mnt_namespace *dup_mnt_ns(
fs->altrootmnt = mntget(q);
}
}
+ if (IS_MNT_DIRONFILE(p))
+ list_move(&q->mnt_list, &dironfile_list);
+
p = next_mnt(p, mnt_ns->root);
q = next_mnt(q, new_ns->root);
}
@@ -1660,6 +1802,31 @@ static struct mnt_namespace *dup_mnt_ns(
if (altrootmnt)
mntput(altrootmnt);

+ /*
+ * Mounts on dironfile_list were cloned from "directory on
+ * file" mounts. So mark the new ones as such and drop the
+ * attachment reference.
+ *
+ * This will have the effect of immediately unmounting all
+ * those which have not been claimed above by a root, altroot
+ * or pwd pointer.
+ *
+ * No lock needs to be held, because only the brand new
+ * mnt_namespace is touched.
+ */
+ while (!list_empty(&dironfile_list)) {
+ struct vfsmount *q;
+
+ q = list_entry(dironfile_list.next, struct vfsmount, mnt_list);
+ q->mnt_flags |= MNT_DIRONFILE;
+ spin_lock(&vfsmount_lock);
+ q->mnt_mountpoint->d_dironfile++;
+ printk(KERN_DEBUG "clone dir-on-file %p\n", q);
+ spin_unlock(&vfsmount_lock);
+ list_move_tail(&q->mnt_list, &new_ns->list);
+ mntput(q);
+ }
+
return new_ns;
}

@@ -1853,6 +2020,8 @@ asmlinkage long sys_pivot_root(const cha
down_write(&namespace_sem);
mutex_lock(&old_nd.dentry->d_inode->i_mutex);
error = -EINVAL;
+ if (IS_MNT_DIRONFILE(new_nd.mnt) || IS_MNT_DIRONFILE(user_nd.mnt))
+ goto out2;
if (IS_MNT_SHARED(old_nd.mnt) ||
IS_MNT_SHARED(new_nd.mnt->mnt_parent) ||
IS_MNT_SHARED(user_nd.mnt->mnt_parent))
Index: linux/include/linux/namei.h
===================================================================
--- linux.orig/include/linux/namei.h 2007-05-22 18:06:24.000000000 +0200
+++ linux/include/linux/namei.h 2007-05-22 18:06:32.000000000 +0200
@@ -55,6 +55,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LA
#define LOOKUP_PARENT 16
#define LOOKUP_NOALT 32
#define LOOKUP_REVAL 64
+#define LOOKUP_ENTER 128
/*
* Intent data
*/
@@ -95,6 +96,8 @@ static inline struct dentry *lookup_one_
extern int follow_down(struct vfsmount **, struct dentry **);
extern int follow_up(struct vfsmount **, struct dentry **);

+extern void follow_mount(struct nameidata *nd);
+
extern struct dentry *lock_rename(struct dentry *, struct dentry *);
extern void unlock_rename(struct dentry *, struct dentry *);
extern int kernel_readlink(struct dentry *dentry, char **buffer, int *buflen);
Index: linux/include/linux/mount.h
===================================================================
--- linux.orig/include/linux/mount.h 2007-05-22 18:06:24.000000000 +0200
+++ linux/include/linux/mount.h 2007-05-22 18:06:32.000000000 +0200
@@ -37,6 +37,10 @@ struct mnt_namespace;
#define MNT_UNBINDABLE 0x2000 /* if the vfsmount is a unbindable mount */
#define MNT_PNODE_MASK 0x3000 /* propagation flag mask */

+#define MNT_DIRONFILE 0x10000
+
+#define IS_MNT_DIRONFILE(mnt) ((mnt)->mnt_flags & MNT_DIRONFILE)
+
struct vfsmount {
struct list_head mnt_hash;
struct vfsmount *mnt_parent; /* fs we are mounted on */
Index: linux/include/linux/fs.h
===================================================================
--- linux.orig/include/linux/fs.h 2007-05-22 18:06:24.000000000 +0200
+++ linux/include/linux/fs.h 2007-05-22 18:06:32.000000000 +0200
@@ -1149,6 +1149,7 @@ struct inode_operations {
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*removexattr) (struct dentry *, const char *);
void (*truncate_range)(struct inode *, loff_t, loff_t);
+ int (*enter) (struct nameidata *, struct path *);
};

struct seq_file;
@@ -1319,6 +1320,7 @@ extern void release_mounts(struct list_h
extern long do_mount(char *, char *, char *, unsigned long, void *);
extern void mnt_set_mountpoint(struct vfsmount *, struct dentry *,
struct vfsmount *);
+extern int mount_dironfile(struct nameidata *nd, struct path *old);

extern int vfs_statfs(struct dentry *, struct kstatfs *);

Index: linux/include/linux/dcache.h
===================================================================
--- linux.orig/include/linux/dcache.h 2007-05-22 18:06:24.000000000 +0200
+++ linux/include/linux/dcache.h 2007-05-22 18:06:32.000000000 +0200
@@ -111,6 +111,7 @@ struct dentry {
struct dcookie_struct *d_cookie; /* cookie, if any */
#endif
int d_mounted;
+ int d_dironfile;
unsigned char d_iname[DNAME_INLINE_LEN_MIN]; /* small names */
};

@@ -351,12 +352,29 @@ static inline struct dentry *dget_parent

extern void dput(struct dentry *);

+/**
+ * d_mountpoint - check if dentry is mounted
+ * @dentry: detnry to check
+ */
static inline int d_mountpoint(struct dentry *dentry)
{
return dentry->d_mounted;
}

-extern struct vfsmount *lookup_mnt(struct vfsmount *, struct dentry *);
+/**
+ * d_real_mountpoint - check if dentry has real mounts over it
+ * @dentry: dentry to check
+ *
+ * Returns true if dentry is mounted and not all of those mounts are
+ * "directory on file" mounts
+ */
+static inline int d_real_mountpoint(struct dentry *dentry)
+{
+ BUG_ON(dentry->d_mounted < dentry->d_dironfile);
+ return dentry->d_mounted - dentry->d_dironfile;
+}
+
+extern struct vfsmount *lookup_mnt(struct vfsmount *, struct dentry *, bool);
extern struct vfsmount *__lookup_mnt(struct vfsmount *, struct dentry *, int);
extern struct dentry *lookup_create(struct nameidata *nd, int is_dir);

Index: linux/Documentation/filesystems/Locking
===================================================================
--- linux.orig/Documentation/filesystems/Locking 2007-05-22 18:06:24.000000000 +0200
+++ linux/Documentation/filesystems/Locking 2007-05-22 18:06:32.000000000 +0200
@@ -51,6 +51,8 @@ ata *);
ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*removexattr) (struct dentry *, const char *);
+ void (*truncate_range)(struct inode *, loff_t, loff_t);
+ int (*enter) (struct nameidata *, struct path *);

locking rules:
all may block, none have BKL
@@ -74,6 +76,9 @@ setxattr: yes
getxattr: no
listxattr: no
removexattr: yes
+truncate_range: yes
+enter: no
+
Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_mutex on
victim.
cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
@@ -82,6 +87,8 @@ method. It's called by vmtruncate() - li
->setattr(). Locking information above applies to that call (i.e. is
inherited from ->setattr() - vmtruncate() is used when ATTR_SIZE had been
passed).
+ ->truncate_range() and ->setattr() with ATTR_SIZE also hold
+i_alloc_sem for write.

See Documentation/filesystems/directory-locking for more detailed discussion
of the locking scheme for directory operations.
Index: linux/Documentation/filesystems/vfs.txt
===================================================================
--- linux.orig/Documentation/filesystems/vfs.txt 2007-05-22 18:06:24.000000000 +0200
+++ linux/Documentation/filesystems/vfs.txt 2007-05-22 18:06:32.000000000 +0200
@@ -324,7 +324,7 @@ struct inode_operations
-----------------------

This describes how the VFS can manipulate an inode in your
-filesystem. As of kernel 2.6.13, the following members are defined:
+filesystem. As of kernel 2.6.21, the following members are defined:

struct inode_operations {
int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
@@ -348,6 +348,8 @@ struct inode_operations {
ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*removexattr) (struct dentry *, const char *);
+ void (*truncate_range)(struct inode *, loff_t, loff_t);
+ int (*enter) (struct nameidata *, struct path *);
};

Again, all methods are called without any locks being held, unless
@@ -444,6 +446,14 @@ otherwise noted.
removexattr: called by the VFS to remove an extended attribute from
a file. This method is called by removexattr(2) system call.

+ truncate_range: punch a hole in the middle of the file
+
+ enter: if a non-directory is suffixed with a slash, this method (if
+ defined) will be called. The filesystem shall return a
+ vfsmount/dentry pair in the struct path argument which will be
+ bind mounted mounted on the object. The mount will be marked
+ with a special "directory on file" flag, which will only be
+ followed when the path contains a slash after the file name.

The Address Space Object
========================
Index: linux/fs/dcache.c
===================================================================
--- linux.orig/fs/dcache.c 2007-05-22 18:06:24.000000000 +0200
+++ linux/fs/dcache.c 2007-05-22 18:06:32.000000000 +0200
@@ -87,6 +87,7 @@ static void d_callback(struct rcu_head *
*/
static void d_free(struct dentry *dentry)
{
+ BUG_ON(dentry->d_dironfile);
if (dentry->d_op && dentry->d_op->d_release)
dentry->d_op->d_release(dentry);
/* if dentry was never inserted into hash, immediate free is OK */
@@ -933,6 +934,7 @@ struct dentry *d_alloc(struct dentry * p
dentry->d_op = NULL;
dentry->d_fsdata = NULL;
dentry->d_mounted = 0;
+ dentry->d_dironfile = 0;
#ifdef CONFIG_PROFILING
dentry->d_cookie = NULL;
#endif


2007-05-22 22:10:58

by Al Viro

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

On Tue, May 22, 2007 at 08:48:49PM +0200, Miklos Szeredi wrote:
> Why do we want this?
> --------------------
>
> That depends on who you ask. My answer is this:
>
> 'foo.tar.gz/foo/bar' or
> 'foo.tar.gz/contents/foo/bar'
>
> or something similar.
>
> Others might suggest accessing streams, resource forks or extended
> attributes through such an interface. However this patch only deals
> with the non-directory case, so directories would be excluded from
> that interface.
>
> But otherwise this patch doesn't limit the uses of the "file as
> directory" concept in any way. It just adds the infrastructure to
> support these whacky beasts.
>
> How is it done?
> ---------------
>
> (See this [1] thread for more discussion on the subject)
>
> When a non-directory object is accessed without a trailing slash, then
> path resolution returns the object itself as usual.
>
> If a non-directory object is accessed with a trailing slash, then the
> filesystem may opt to let the file be accessed as a directory. In
> this case "something" (as supplied by the filesystem) is mounted on
> top of the non-directory object.
>
> This mount will have special properties:
>
> - If there's no trailing slash is after the file name, the mount
> won't be followed, even if the path resolution would otherwise
> follow mounts.
>
> - The mount only stays there while it is referenced by some external
> object, like a pwd or an open file. When it is no longer
> referenced, it is automatically unmounted.
>
> - Unlike "real" mounts, this won't block unlink(2) or rename(2) on
> the underlying object.

Interesting... How do you deal with mount propagation and things like
mount --move? As for unlink... How do you deal with having that thing
mounted, mounting something _under_ it (so that vfsmount would be kept
busy) and then unlinking that sucker?

I'll look through the patch tonight; it sounds interesting, assuming that
we don't run into serious crap with locking and <shudder> revalidation
logics.

2007-05-22 23:29:26

by Shaya Potter

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

Miklos Szeredi wrote:
> Why do we want this?
> --------------------
>
> That depends on who you ask. My answer is this:
>
> 'foo.tar.gz/foo/bar' or
> 'foo.tar.gz/contents/foo/bar'
>
> or something similar.
>
> Others might suggest accessing streams, resource forks or extended
> attributes through such an interface. However this patch only deals
> with the non-directory case, so directories would be excluded from
> that interface.

here's a possibly stupid question. What about symlinks to dirs? namely
the shells tend to treat them differently if postfixed with a slash or not.

2007-05-23 06:36:53

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

> > When a non-directory object is accessed without a trailing slash, then
> > path resolution returns the object itself as usual.
> >
> > If a non-directory object is accessed with a trailing slash, then the
> > filesystem may opt to let the file be accessed as a directory. In
> > this case "something" (as supplied by the filesystem) is mounted on
> > top of the non-directory object.
> >
> > This mount will have special properties:
> >
> > - If there's no trailing slash is after the file name, the mount
> > won't be followed, even if the path resolution would otherwise
> > follow mounts.
> >
> > - The mount only stays there while it is referenced by some external
> > object, like a pwd or an open file. When it is no longer
> > referenced, it is automatically unmounted.
> >
> > - Unlike "real" mounts, this won't block unlink(2) or rename(2) on
> > the underlying object.
>
> Interesting... How do you deal with mount propagation and things like
> mount --move?

Moving (or doing other mount operations on) an ancestor shouldn't be a
problem. Moving this mount itself is not allowed, and neither is
doing bind or pivot_root. Maybe bind could be allowed...

When doing recursive bind on ancestor, these mounts are skipped.

> As for unlink... How do you deal with having that thing
> mounted, mounting something _under_ it (so that vfsmount would be kept
> busy) and then unlinking that sucker?

Yeah, that's a good point. Current patch doesn't deal with that.
Simplest solution could be to disallow submounting these. Don't think
it makes much sense anyway.

> I'll look through the patch tonight; it sounds interesting, assuming that
> we don't run into serious crap with locking and <shudder> revalidation
> logics.

Revalidation shouln't be a problem. We'll just end up with an
unhashed dentry with a mount over it, which will be detached when the
vfsmount ref is dropped.

Miklos

2007-05-23 06:40:34

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

> > Others might suggest accessing streams, resource forks or extended
> > attributes through such an interface. However this patch only deals
> > with the non-directory case, so directories would be excluded from
> > that interface.
>
> here's a possibly stupid question. What about symlinks to dirs? namely
> the shells tend to treat them differently if postfixed with a slash or not.

Right. So it only works on non-directory, non-symlink objects.

Miklos

2007-05-23 07:03:19

by Al Viro

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

On Wed, May 23, 2007 at 08:36:04AM +0200, Miklos Szeredi wrote:
> > Interesting... How do you deal with mount propagation and things like
> > mount --move?
>
> Moving (or doing other mount operations on) an ancestor shouldn't be a
> problem. Moving this mount itself is not allowed, and neither is
> doing bind or pivot_root. Maybe bind could be allowed...

Eh... Arbitrary limitations are fun, aren't they?

> When doing recursive bind on ancestor, these mounts are skipped.

What about clone copying your namespace? What about MNT_SLAVE stuff being
set up prior to that lookup? More interesting question: should independent
lookups of that sucker on different paths end up with the same superblock
(and vfsmount for each) or should we get fully independent mount on each?
The latter would be interesting wrt cache coherency...

> > As for unlink... How do you deal with having that thing
> > mounted, mounting something _under_ it (so that vfsmount would be kept
> > busy) and then unlinking that sucker?
>
> Yeah, that's a good point. Current patch doesn't deal with that.
> Simplest solution could be to disallow submounting these. Don't think
> it makes much sense anyway.

Arbitrary limitations... (and that's where revalidate horrors come in, BTW).
BTW^2: what if fs mounted that way will happen to have such node itself?

I'm not saying that it's unfeasible or won't lead to interesting things,
but it really needs semantics done right...

2007-05-23 07:20:19

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

> > > Interesting... How do you deal with mount propagation and things like
> > > mount --move?
> >
> > Moving (or doing other mount operations on) an ancestor shouldn't be a
> > problem. Moving this mount itself is not allowed, and neither is
> > doing bind or pivot_root. Maybe bind could be allowed...
>
> Eh... Arbitrary limitations are fun, aren't they?

But these mounts _are_ special. There is really no point in moving or
pivoting them.

> > When doing recursive bind on ancestor, these mounts are skipped.
>
> What about clone copying your namespace?

In that case they are cloned, but only those survive which have refs
in the new namespace.

> What about MNT_SLAVE stuff being set up prior to that lookup?

These mounts are not propagated. Or at least I hope so. Propagation
stuff is a bit too complicated for my poor little brain.

> More interesting question: should independent lookups of that sucker
> on different paths end up with the same superblock (and vfsmount for
> each) or should we get fully independent mount on each? The latter
> would be interesting wrt cache coherency...

I think they should be the same superblock, same dentry. What would
be the advantage of doing otherwise?

> > > As for unlink... How do you deal with having that thing
> > > mounted, mounting something _under_ it (so that vfsmount would be kept
> > > busy) and then unlinking that sucker?
> >
> > Yeah, that's a good point. Current patch doesn't deal with that.
> > Simplest solution could be to disallow submounting these. Don't think
> > it makes much sense anyway.
>
> Arbitrary limitations... (and that's where revalidate horrors come in, BTW).
> BTW^2: what if fs mounted that way will happen to have such node itself?

I think doing this recursively should be allowed. "Releasing last ref
cleans up the mess" should work in that case.

> I'm not saying that it's unfeasible or won't lead to interesting things,
> but it really needs semantics done right...

Agreed :)

Miklos

2007-05-23 07:37:13

by Al Viro

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

On Wed, May 23, 2007 at 09:19:17AM +0200, Miklos Szeredi wrote:
> > Eh... Arbitrary limitations are fun, aren't they?
>
> But these mounts _are_ special. There is really no point in moving or
> pivoting them.

pivoting - probably true, moving... why not?

> > What about MNT_SLAVE stuff being set up prior to that lookup?
>
> These mounts are not propagated. Or at least I hope so. Propagation
> stuff is a bit too complicated for my poor little brain.

Er... These mounts might not be propagated, but what about a bind
over another instance of such file in master tree?

> I think they should be the same superblock, same dentry. What would
> be the advantage of doing otherwise?

Then you are going to have interesting time with locking in final mntput().
BTW, what about having several links to the same file? You have i_mutex
on the inode, so serialization of those is not a problem, but...

> I think doing this recursively should be allowed. "Releasing last ref
> cleans up the mess" should work in that case.

Releasing the last reference will lead to cascade of umounts in that
case... IOW, need to be careful with locking.

2007-05-23 08:06:20

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

> > > Eh... Arbitrary limitations are fun, aren't they?
> >
> > But these mounts _are_ special. There is really no point in moving or
> > pivoting them.
>
> pivoting - probably true, moving... why not?

I don't see any use for that. But indeed, it should not be too hard
to do.

> > > What about MNT_SLAVE stuff being set up prior to that lookup?
> >
> > These mounts are not propagated. Or at least I hope so. Propagation
> > stuff is a bit too complicated for my poor little brain.
>
> Er... These mounts might not be propagated, but what about a bind
> over another instance of such file in master tree?

So your question is, which mount takes priority on the lookup? It
probably should be the propagated real mount, rather than the
dir-on-file one, shouldn't it?

> > I think they should be the same superblock, same dentry. What would
> > be the advantage of doing otherwise?
>
> Then you are going to have interesting time with locking in final mntput().

Final mntput of what?

> BTW, what about having several links to the same file? You have i_mutex
> on the inode, so serialization of those is not a problem, but...

Sorry, I lost it...

> > I think doing this recursively should be allowed. "Releasing last ref
> > cleans up the mess" should work in that case.
>
> Releasing the last reference will lead to cascade of umounts in that
> case... IOW, need to be careful with locking.

I think it's done right: detach_mnt() with namespace_sem and
vfsmount_lock, then release locks, and path_release(&old_nd).

If the recursion is extremely deep we could have stack overflow
problems though, aargh...

Miklos

2007-05-23 08:29:29

by Al Viro

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

On Wed, May 23, 2007 at 10:05:21AM +0200, Miklos Szeredi wrote:
> > Er... These mounts might not be propagated, but what about a bind
> > over another instance of such file in master tree?
>
> So your question is, which mount takes priority on the lookup? It
> probably should be the propagated real mount, rather than the
> dir-on-file one, shouldn't it?

There might be dragons in that area...

> > > I think they should be the same superblock, same dentry. What would
> > > be the advantage of doing otherwise?
> >
> > Then you are going to have interesting time with locking in final mntput().
>
> Final mntput of what?

When the last reference to your mount goes away.

> > BTW, what about having several links to the same file? You have i_mutex
> > on the inode, so serialization of those is not a problem, but...
>
> Sorry, I lost it...

Say /foo/bar/a is such a file.

cd /foo/bar
ln a b

now do lookups on a/ and b/

What happens?

2007-05-23 09:04:12

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

> On Wed, May 23, 2007 at 10:05:21AM +0200, Miklos Szeredi wrote:
> > > Er... These mounts might not be propagated, but what about a bind
> > > over another instance of such file in master tree?
> >
> > So your question is, which mount takes priority on the lookup? It
> > probably should be the propagated real mount, rather than the
> > dir-on-file one, shouldn't it?
>
> There might be dragons in that area...
>
> > > > I think they should be the same superblock, same dentry. What would
> > > > be the advantage of doing otherwise?
> > >
> > > Then you are going to have interesting time with locking in final mntput().
> >
> > Final mntput of what?
>
> When the last reference to your mount goes away.

I still don't get it where the superblock comes in. The locking is
"interesting" in there, yes. And I haven't completely convinced
myself it's right, let alone something that won't easily be screwed up
in the future. So there's definitely room for thought there.

But how does it matter if two different paths have the same sb or a
different sb mounted over them?

> > > BTW, what about having several links to the same file? You have i_mutex
> > > on the inode, so serialization of those is not a problem, but...
> >
> > Sorry, I lost it...
>
> Say /foo/bar/a is such a file.
>
> cd /foo/bar
> ln a b
>
> now do lookups on a/ and b/
>
> What happens?

The same dentry is mounted over each one. The contents of the
directory should only depend on the contents of the underlying inode.
The path leading up to it is completely irrelevant.

Miklos

2007-05-23 09:16:33

by Jan Blunck

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

On 5/23/07, Miklos Szeredi <[email protected]> wrote:
>
> So your question is, which mount takes priority on the lookup? It
> probably should be the propagated real mount, rather than the
> dir-on-file one, shouldn't it?
>

Maybe this might belong into __link_path_walk() similar to the
handling of symbolic links. If the real mount has always higher
priority why do we bother in follow_mount() about it.

Jan

2007-05-23 09:21:22

by Jan Blunck

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

On 5/23/07, Miklos Szeredi <[email protected]> wrote:
>
> > As for unlink... How do you deal with having that thing
> > mounted, mounting something _under_ it (so that vfsmount would be kept
> > busy) and then unlinking that sucker?
>
> Yeah, that's a good point. Current patch doesn't deal with that.
> Simplest solution could be to disallow submounting these. Don't think
> it makes much sense anyway.
>

Hmm, think about /your/path/qemu-disk1.img/boot ,
/your/path/qemu-disk2.img/usr , ...

Jan

2007-05-23 09:30:37

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

> > So your question is, which mount takes priority on the lookup? It
> > probably should be the propagated real mount, rather than the
> > dir-on-file one, shouldn't it?
> >
>
> Maybe this might belong into __link_path_walk() similar to the
> handling of symbolic links. If the real mount has always higher
> priority why do we bother in follow_mount() about it.

Do you mean, that follow_mount() should never descend into the
dir-on-file mount but that should always be done by
__link_path_walk()?

This could make sense.

__lookup_mnt() currently returns the first matching mount in the hash
list. With your suggestion, we'd need two __lookup_mnt() variants (or
a parameter). One, that only matches normal mounts, and one that only
matches dir-on-file mounts. Is that it?

Miklos

2007-05-23 09:36:48

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

> > > As for unlink... How do you deal with having that thing
> > > mounted, mounting something _under_ it (so that vfsmount would be kept
> > > busy) and then unlinking that sucker?
> >
> > Yeah, that's a good point. Current patch doesn't deal with that.
> > Simplest solution could be to disallow submounting these. Don't think
> > it makes much sense anyway.
> >
>
> Hmm, think about /your/path/qemu-disk1.img/boot ,
> /your/path/qemu-disk2.img/usr , ...

I get it.

It could probably be done with a little added complexity. For example
when a real mount is attached onto a dir-on-file mount, the
"mountedness" is propagated up to the dentry on the next real mount.

So in that case unlink won't be allowed, even if the immediate
attachment is a dir-on-file mount.

This is tricky to do right though.

Other possibility is to detach all mount trees attached to dentry on
unlink.

Miklos

2007-05-23 09:51:39

by Al Viro

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

On Tue, May 22, 2007 at 08:48:49PM +0200, Miklos Szeredi wrote:
> */
> -static int __follow_mount(struct path *path)
> +static int __follow_mount(struct path *path, bool enter)
> {
> int res = 0;
> while (d_mountpoint(path->dentry)) {
> - struct vfsmount *mounted = lookup_mnt(path->mnt, path->dentry);
> + struct vfsmount *mounted =
> + lookup_mnt(path->mnt, path->dentry, enter);
> +
> if (!mounted)
> break;
> dput(path->dentry);
> @@ -689,27 +697,37 @@ static int __follow_mount(struct path *p
> return res;
> }
>
> -static void follow_mount(struct vfsmount **mnt, struct dentry **dentry)
> +/*
> + * Follows mounts on the given nameidata.
> + *
> + * Only follows "directory on file" mounts if LOOKUP_ENTER is set.
> + */
> +void follow_mount(struct nameidata *nd)

BTW, I'd split that (and matching updates in callers) into separate
patch.

> {
> - while (d_mountpoint(*dentry)) {
> - struct vfsmount *mounted = lookup_mnt(*mnt, *dentry);
> + while (d_mountpoint(nd->dentry)) {
> + bool enter = nd->flags & LOOKUP_ENTER;

int, surely?

> + * This is called if the object has no ->lookup() defined, yet the
> + * path contains a slash after the object name.
> + *
> + * If the filesystem defines an ->enter() method, this will be called,
> + * and the filesystem shall fill the supplied struct path or return an
> + * error.
> + *
> + * The returned path will be bind mounted on top of the object with
> + * the MNT_DIRONFILE flag, and the nameidata will descend into the
> + * mount.
> + */
> +static int enter_file(struct inode *inode, struct nameidata *nd)
> +{
> + int err;
> + struct path newpath;
> +
> + printk(KERN_DEBUG "%s/%d enter %s/\n", current->comm, current->pid,
> + nd->dentry->d_name.name);
> + if (!inode->i_op->enter)
> + return -ENOTDIR;
> +
> + newpath.mnt = NULL;
> + newpath.dentry = NULL;
> + err = inode->i_op->enter(nd, &newpath);
> + if (!err) {
> + err = mount_dironfile(nd, &newpath);
> + pathput(&newpath);
> + }
> + return err;

Ouch. What guarantees that two lookups won't race right here? You are
not holding any locks at that point, AFAICS...

BTW, why newpath? What's wrong with simply returning a new vfsmount
with right ->mnt_root/->mnt_sb (instead of creating it inside
mount_dironfile())? ERR_PTR() for error, struct vfsmount * for success...

> @@ -301,8 +310,8 @@ static struct vfsmount *clone_mnt(struct
> mnt->mnt_mountpoint = mnt->mnt_root;
> mnt->mnt_parent = mnt;
>
> - /* don't copy the MNT_USER flag */
> - mnt->mnt_flags &= ~MNT_USER;
> + /* don't copy some flags */
> + mnt->mnt_flags &= ~(MNT_USER | MNT_DIRONFILE);
> if (flag & CL_SETUSER)
> __set_mnt_user(mnt, owner);

Hmm? So you do copy them and strip your MNT_DIRONFILE from copies?

> + * This is tricky, because for namespace modification we must take the
> + * namespace semaphore. But mntput() is called from various places,
> + * sometimes with namespace_sem held. Fortunately in those places the
> + * mount cannot yet have MNT_DIRONFILE, or at least that's what I
> + * hope...
> + *
> + * The umounting is done in two stages, first the mount is removed
> + * from the hashes. This is done atomically wrt other mount lookups,
> + * so it's not possible to acquire a new ref to this dead mount that
> + * way.
> + *
> + * Then after having locked namespace_sem and relocked vfsmount_lock,
> + * the mount is properly detached.
> + */
> +static void umount_dironfile(struct vfsmount *mnt)
> + __releases(vfsmount_lock)
> +{
> + struct nameidata nd;

You've got to be kidding. nameidata is *big*. If anything, we want
to make detach_mnt() take struct path * instead, but even that is
lousy due to recursion.

I really don't like what's going on here. The thing is, current code
is based on assumption that presence in the mount tree => holding a
reference. We _might_ deal with that (there was an old plan to change
refcounting logics for vfsmounts), but that sort of games with locks
spells trouble. What happens, for example, if namespace gets cloned
before you grab namespace_sem?

There's another problem, BTW - a lot of stuff does stat + open + fstat +
compare kind of sequence. You'll end up mounting/umounting between stat
and open, which opens you to race with somebody else. Get a different
st_dev, eat a nice unreproducible error from application...

2007-05-23 09:58:36

by Al Viro

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

On Wed, May 23, 2007 at 11:03:08AM +0200, Miklos Szeredi wrote:
> I still don't get it where the superblock comes in. The locking is
> "interesting" in there, yes. And I haven't completely convinced
> myself it's right, let alone something that won't easily be screwed up
> in the future. So there's definitely room for thought there.
>
> But how does it matter if two different paths have the same sb or a
> different sb mounted over them?

Because then you get a slew of fun issues with dropping the final reference
to vfsmount vs. lookup on another place. What hold do you have on that
superblock and when do you switch from "oh, called ->enter() on the same
inode again, return vfsmount over the same superblock" to "need to
initialize that damn superblock, all mounts are gone"?

> The same dentry is mounted over each one. The contents of the
> directory should only depend on the contents of the underlying inode.
> The path leading up to it is completely irrelevant.

So what kind of exclusion do you have for ->enter()? None?

2007-05-23 10:10:26

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

> > + * This is called if the object has no ->lookup() defined, yet the
> > + * path contains a slash after the object name.
> > + *
> > + * If the filesystem defines an ->enter() method, this will be called,
> > + * and the filesystem shall fill the supplied struct path or return an
> > + * error.
> > + *
> > + * The returned path will be bind mounted on top of the object with
> > + * the MNT_DIRONFILE flag, and the nameidata will descend into the
> > + * mount.
> > + */
> > +static int enter_file(struct inode *inode, struct nameidata *nd)
> > +{
> > + int err;
> > + struct path newpath;
> > +
> > + printk(KERN_DEBUG "%s/%d enter %s/\n", current->comm, current->pid,
> > + nd->dentry->d_name.name);
> > + if (!inode->i_op->enter)
> > + return -ENOTDIR;
> > +
> > + newpath.mnt = NULL;
> > + newpath.dentry = NULL;
> > + err = inode->i_op->enter(nd, &newpath);
> > + if (!err) {
> > + err = mount_dironfile(nd, &newpath);
> > + pathput(&newpath);
> > + }
> > + return err;
>
> Ouch. What guarantees that two lookups won't race right here? You are
> not holding any locks at that point, AFAICS...

Right. After locking vfsmount_lock, mount_dironfile() should recheck
if there was a race and bail out.

> BTW, why newpath? What's wrong with simply returning a new vfsmount
> with right ->mnt_root/->mnt_sb (instead of creating it inside
> mount_dironfile())? ERR_PTR() for error, struct vfsmount * for success...

I don't think the filesystem ought to try _creating_ a vfsmount. I
imagine, that the fs has already a kernel-internal mounted for this
kind of stuff, and it just supplies a dentry from that. The vfsmount
isn't actually important, but it should be readily available, and it's
easier to clone from a vfsmount/dentry pair.

> > @@ -301,8 +310,8 @@ static struct vfsmount *clone_mnt(struct
> > mnt->mnt_mountpoint = mnt->mnt_root;
> > mnt->mnt_parent = mnt;
> >
> > - /* don't copy the MNT_USER flag */
> > - mnt->mnt_flags &= ~MNT_USER;
> > + /* don't copy some flags */
> > + mnt->mnt_flags &= ~(MNT_USER | MNT_DIRONFILE);
> > if (flag & CL_SETUSER)
> > __set_mnt_user(mnt, owner);
>
> Hmm? So you do copy them and strip your MNT_DIRONFILE from copies?

Yes. On namespace cloning the MNT_DIRONFILE will be re-added later.
Otherwise we shouln't even get here with MNT_DIRONFILE.

> > + * This is tricky, because for namespace modification we must take the
> > + * namespace semaphore. But mntput() is called from various places,
> > + * sometimes with namespace_sem held. Fortunately in those places the
> > + * mount cannot yet have MNT_DIRONFILE, or at least that's what I
> > + * hope...
> > + *
> > + * The umounting is done in two stages, first the mount is removed
> > + * from the hashes. This is done atomically wrt other mount lookups,
> > + * so it's not possible to acquire a new ref to this dead mount that
> > + * way.
> > + *
> > + * Then after having locked namespace_sem and relocked vfsmount_lock,
> > + * the mount is properly detached.
> > + */
> > +static void umount_dironfile(struct vfsmount *mnt)
> > + __releases(vfsmount_lock)
> > +{
> > + struct nameidata nd;
>
> You've got to be kidding. nameidata is *big*. If anything, we want
> to make detach_mnt() take struct path * instead, but even that is
> lousy due to recursion.
>
> I really don't like what's going on here. The thing is, current code
> is based on assumption that presence in the mount tree => holding a
> reference. We _might_ deal with that (there was an old plan to change
> refcounting logics for vfsmounts), but that sort of games with locks
> spells trouble. What happens, for example, if namespace gets cloned
> before you grab namespace_sem?

It _should_ work. The mount in the new namespace will be created
(with namespace_sem held, so we can't yet free this mount), and then
dropped, because there are no refs to it.

> There's another problem, BTW - a lot of stuff does stat + open + fstat +
> compare kind of sequence. You'll end up mounting/umounting between stat
> and open, which opens you to race with somebody else. Get a different
> st_dev, eat a nice unreproducible error from application...

As I said, the superblock should be persistent, so we'll get a stable
st_dev for multiple mounts.

Miklos

2007-05-23 10:15:23

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

> On Wed, May 23, 2007 at 11:03:08AM +0200, Miklos Szeredi wrote:
> > I still don't get it where the superblock comes in. The locking is
> > "interesting" in there, yes. And I haven't completely convinced
> > myself it's right, let alone something that won't easily be screwed up
> > in the future. So there's definitely room for thought there.
> >
> > But how does it matter if two different paths have the same sb or a
> > different sb mounted over them?
>
> Because then you get a slew of fun issues with dropping the final reference
> to vfsmount vs. lookup on another place. What hold do you have on that
> superblock and when do you switch from "oh, called ->enter() on the same
> inode again, return vfsmount over the same superblock" to "need to
> initialize that damn superblock, all mounts are gone"?
>
> > The same dentry is mounted over each one. The contents of the
> > directory should only depend on the contents of the underlying inode.
> > The path leading up to it is completely irrelevant.
>
> So what kind of exclusion do you have for ->enter()? None?
>

So really these issues, are about how do we get hold of the superblock
to mount.

I think that should be a filesystem internal problem, and I suspect
the easiest solution is to just have a permanent meta superblock for
these dir-on-file mounts.

Miklos

2007-05-23 10:24:48

by Al Viro

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

On Wed, May 23, 2007 at 12:09:19PM +0200, Miklos Szeredi wrote:
> Right. After locking vfsmount_lock, mount_dironfile() should recheck
> if there was a race and bail out.

Owww... Not pretty, that...

> I don't think the filesystem ought to try _creating_ a vfsmount. I
> imagine, that the fs has already a kernel-internal mounted for this
> kind of stuff, and it just supplies a dentry from that. The vfsmount
> isn't actually important, but it should be readily available, and it's
> easier to clone from a vfsmount/dentry pair.

I don't get it. What's the point of that exercise, then? When do you
create that kernel-internal mount?

> As I said, the superblock should be persistent, so we'll get a stable
> st_dev for multiple mounts.

OK, but then I guess I don't understand the intended use.

2007-05-23 10:25:05

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

> > > + * This is tricky, because for namespace modification we must take the
> > > + * namespace semaphore. But mntput() is called from various places,
> > > + * sometimes with namespace_sem held. Fortunately in those places the
> > > + * mount cannot yet have MNT_DIRONFILE, or at least that's what I
> > > + * hope...
> > > + *
> > > + * The umounting is done in two stages, first the mount is removed
> > > + * from the hashes. This is done atomically wrt other mount lookups,
> > > + * so it's not possible to acquire a new ref to this dead mount that
> > > + * way.
> > > + *
> > > + * Then after having locked namespace_sem and relocked vfsmount_lock,
> > > + * the mount is properly detached.
> > > + */
> > > +static void umount_dironfile(struct vfsmount *mnt)
> > > + __releases(vfsmount_lock)
> > > +{
> > > + struct nameidata nd;
> >
> > You've got to be kidding. nameidata is *big*. If anything, we want
> > to make detach_mnt() take struct path * instead, but even that is
> > lousy due to recursion.
> >
> > I really don't like what's going on here. The thing is, current code
> > is based on assumption that presence in the mount tree => holding a
> > reference. We _might_ deal with that (there was an old plan to change
> > refcounting logics for vfsmounts), but that sort of games with locks
> > spells trouble. What happens, for example, if namespace gets cloned
> > before you grab namespace_sem?
>
> It _should_ work. The mount in the new namespace will be created
> (with namespace_sem held, so we can't yet free this mount), and then
> dropped, because there are no refs to it.

BTW, I'm not saying I like this. It's pretty ugly and fragile. But
it's damn convenient to get rid of these mounts from mntput().

Is there a better alternative?

Miklos

2007-05-23 10:41:32

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

> > Right. After locking vfsmount_lock, mount_dironfile() should recheck
> > if there was a race and bail out.
>
> Owww... Not pretty, that...

If the cost of ->enter() is low, then it shouln't really be a problem.
We can't use ->i_mutex for locking, and introducing a new lock for
this doesn't sound right either.

> > I don't think the filesystem ought to try _creating_ a vfsmount. I
> > imagine, that the fs has already a kernel-internal mounted for this
> > kind of stuff, and it just supplies a dentry from that. The vfsmount
> > isn't actually important, but it should be readily available, and it's
> > easier to clone from a vfsmount/dentry pair.
>
> I don't get it. What's the point of that exercise, then? When do you
> create that kernel-internal mount?

When the real superblock is created. It could even be the _same_
super block as the real one. There'd be just the problem of anchoring
the dir-on-file dentries somewhere...

Or with fuse the dir-on-file mount can just come from any mounted
filesystem, again possibly the same one as the parent. I do actually
test with this. The userspace filesystem supplies a file descriptor,
from which the struct path is extracted and returned from ->enter().

Miklos

2007-05-23 11:39:35

by Al Viro

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

> When the real superblock is created. It could even be the _same_
> super block as the real one. There'd be just the problem of anchoring
> the dir-on-file dentries somewhere...
>
> Or with fuse the dir-on-file mount can just come from any mounted
> filesystem, again possibly the same one as the parent. I do actually
> test with this. The userspace filesystem supplies a file descriptor,
> from which the struct path is extracted and returned from ->enter().

Then I do not understand what this mechanism could be used for, other
than an odd way to twist POSIX behaviour and see how much of the userland
would survive that. Certainly not useful for your "look into tarball
as a tree", unless you seriously want to scan the entire damn fs for
tarballs at mount time and set up a superblock for each. And for per-file
extended attributes/forks/whatever-you-call-that-abomination it also
obviously doesn't help, since you lose them for directories.

IOW, what uses do you have in mind? Complete scenario, please...

2007-05-23 12:05:28

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory


On May 22 2007 20:48, Miklos Szeredi wrote:
>Why do we want this?
>--------------------
>
>That depends on who you ask. My answer is this:
>
> 'foo.tar.gz/foo/bar' or
> 'foo.tar.gz/contents/foo/bar'
>
>or something similar.

Stole reiser4 an idea.
These semantics are quite fragile. Until now, chdir is only possible
for directories (otherwise, -ENOTDIR), and opening a directory without
O_DIRECTORY gives -EISDIR. You can't just change semantics.

That said, with FUSE, something like this should already be possible,
should not it?

And looking at your example of foo.tar.gz/foo/bar,the tar.gz needs to
be read at least once to get at foo/bar.


Jan
--

2007-05-23 12:16:35

by Al Viro

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

On Wed, May 23, 2007 at 12:39:25PM +0100, Al Viro wrote:
> Then I do not understand what this mechanism could be used for, other
> than an odd way to twist POSIX behaviour and see how much of the userland
> would survive that. Certainly not useful for your "look into tarball
> as a tree", unless you seriously want to scan the entire damn fs for
> tarballs at mount time and set up a superblock for each. And for per-file
> extended attributes/forks/whatever-you-call-that-abomination it also
> obviously doesn't help, since you lose them for directories.
>
> IOW, what uses do you have in mind? Complete scenario, please...

Ah... After rereading the thread you've mentioned in the very beginning,
I think I understand what you are driving at. However, in that case
* I really don't see why bother with returning vfsmount at all.
dentry alone is enough to create a new vfsmount, all in fs/namei.c.
* the lifetime rules look fscking scary. You call that ->enter()
on nearly every damn lookup. OK, so you'll recreate equivalent vfsmount,
but... That's a lot of allocations/freeing. Can we do some caching and
deal with it on memory pressure?
* invalidation on unlink is still an open problem.
* locking in final mntput() doesn't look nice; we probably need
a new refcounting scheme for vfsmounts to make that work. I have a variant
that might work here (and make life much easier for expiry logics in
automount/shared trees, which is what it had been initially proposed for),
but it still doesn't kill the need to deal with invalidation. And yes,
NFS still needs it (and so do all network filesystems, really). The question
of caching is related to that.

2007-05-23 12:35:00

by Trond Myklebust

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

On Wed, 2007-05-23 at 08:36 +0100, Al Viro wrote:
> On Wed, May 23, 2007 at 09:19:17AM +0200, Miklos Szeredi wrote:
> > > Eh... Arbitrary limitations are fun, aren't they?
> >
> > But these mounts _are_ special. There is really no point in moving or
> > pivoting them.
>
> pivoting - probably true, moving... why not?

Moving would be an implementation artefact that doesn't really
correspond to any useful operation on the filesyst

AFAIK, most filesystems that have implemented subfiles (excepting
Reiser4 of course) do not allow you to rename or move the subfile
directory or its contents from one parent file to another.

Trond

2007-05-23 12:40:28

by Al Viro

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

On Wed, May 23, 2007 at 08:34:42AM -0400, Trond Myklebust wrote:
> On Wed, 2007-05-23 at 08:36 +0100, Al Viro wrote:
> > On Wed, May 23, 2007 at 09:19:17AM +0200, Miklos Szeredi wrote:
> > > > Eh... Arbitrary limitations are fun, aren't they?
> > >
> > > But these mounts _are_ special. There is really no point in moving or
> > > pivoting them.
> >
> > pivoting - probably true, moving... why not?
>
> Moving would be an implementation artefact that doesn't really
> correspond to any useful operation on the filesyst
>
> AFAIK, most filesystems that have implemented subfiles (excepting
> Reiser4 of course) do not allow you to rename or move the subfile
> directory or its contents from one parent file to another.

If that's about xattr and nothing else, colour me thoroughly uninterested.
If it might have other interesting uses, OTOH...

2007-05-23 13:02:30

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

> On Wed, May 23, 2007 at 12:39:25PM +0100, Al Viro wrote:
> > Then I do not understand what this mechanism could be used for, other
> > than an odd way to twist POSIX behaviour and see how much of the userland
> > would survive that. Certainly not useful for your "look into tarball
> > as a tree", unless you seriously want to scan the entire damn fs for
> > tarballs at mount time and set up a superblock for each. And for per-file
> > extended attributes/forks/whatever-you-call-that-abomination it also
> > obviously doesn't help, since you lose them for directories.

Someone might think of a way to make those work with directories.
Invisible directory entries, anyone? </me ducks>

> > IOW, what uses do you have in mind? Complete scenario, please...
>
> Ah... After rereading the thread you've mentioned in the very beginning,
> I think I understand what you are driving at. However, in that case
> * I really don't see why bother with returning vfsmount at all.
> dentry alone is enough to create a new vfsmount, all in fs/namei.c.
> * the lifetime rules look fscking scary. You call that ->enter()
> on nearly every damn lookup. OK, so you'll recreate equivalent vfsmount,
> but... That's a lot of allocations/freeing. Can we do some caching and
> deal with it on memory pressure?

Hmm.

> * invalidation on unlink is still an open problem.
> * locking in final mntput() doesn't look nice; we probably need
> a new refcounting scheme for vfsmounts to make that work. I have a variant
> that might work here (and make life much easier for expiry logics in
> automount/shared trees, which is what it had been initially proposed for),

Which variant? We had that "detached subtrees" thing, is that it?

> but it still doesn't kill the need to deal with invalidation. And
> yes, NFS still needs it (and so do all network filesystems, really).
> The question of caching is related to that.

So what's so special about invalidation? Why not just treat
dir-on-file mounts the same as any other ref on the dentry?

Miklos

2007-05-23 13:24:17

by Ph. Marek

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

On Mittwoch, 23. Mai 2007, Al Viro wrote:
> Then I do not understand what this mechanism could be used for, other
> than an odd way to twist POSIX behaviour and see how much of the userland
> would survive that.
I have some similar considerations about how userspace should deal with that.

The behaviour of simply "cd file/" are not that robust, I fear ...


> Certainly not useful for your "look into tarball
> as a tree", unless you seriously want to scan the entire damn fs for
> tarballs at mount time and set up a superblock for each. And for per-file
> extended attributes/forks/whatever-you-call-that-abomination it also
> obviously doesn't help, since you lose them for directories.
Well, *use cases* I can see. I'd like to use that - for loop mounting,
archives, possibly using symlinks to remote filesystems "symlink1 =>
ssh:user@ip" (although that's possible with FUSE anyway - but would be
possibly within a .zip, too), ...


But I'm not sure how to do the presentation to userspace *right*.


How about some special node in eg. /proc (or a new filesystem)?
Eg.
/fileAsDir/etc/passwd/owner ...
would work for all *files*. For directories we do not know whether we're still
climbing the hierarchy or would like to see meta-data.

Some way like a ".this" entry is not the Right Way IMO ...
Well, I cannot imagine a real good way to tell where I'd like to stop
following the "normal" filesystem and go into the "generated" hierarchy ...

/fileAsDir/level-3/usr/local/bin/owner
is not nice.


Regards,

Phil

2007-05-23 13:47:16

by Jaroslav Sykora

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

Hello,

On Tuesday May 22, 2007, Miklos Szeredi wrote:
> Why do we want this?
> --------------------
>
> That depends on who you ask. My answer is this:
>
> 'foo.tar.gz/foo/bar' or
> 'foo.tar.gz/contents/foo/bar'
>
> or something similar.
>

I work for a similir goal in my bachelor's theses. But my approach is
a little bit different. Instead of:

> 'foo.tar.gz/foo/bar' or
> 'foo.tar.gz/contents/foo/bar'

I do:
'foo.zip^/foo/bar' or
'foo.zip^/contents/foo/bar'

where foo.zip is a ZIP file. See the little '^' in the pathname: it's an
escape character. I have a kernel patch which modifies a lookup
resolution function and when a normal lookup fails ('foo.zip^/foo/bar'
dosn't exist) and the pathname contains '^' it *redirects* the lookup to
a FUSE mount.

So say we have a FUSE vfs server (called 'RheaVFS') on '/tmp/shadow'.
When a process tries to access '/home/xx/foo.zip^/foo/bar'
it is in-kernel transparently redirected to
'/tmp/shadow/home/xx/foo.zip^/foo/bar' and the vfs server handles all the
extraction/compresion/semi-mounting/semi-umounting/whatsoever...

Advantages:
* 99.9% imho backward compatible. No problems with clever programs
doing stat() before open()/opendir().
* you can easily and transparently stack filesystems one on top of another
with a clear semantic. Say we have 'foo.tar.gz'; then:
'foo.tar.gz^' is a decompressed TAR *file*;
'foo.tar.gz^^' is a directory
* you can pass additional parameters to the vfs server after the '^',
eg. 'foo.zip^compresslevel=1/foo/bar'
* works with symlinks too

Drawbacks:
* users must/should be aware of the special escape char '^'
* usually only single vfs server per user handles all "virtual"
directories --> single point of failure. (But I implemented a quirk
which allows restarting the FUSE vfs server with only minor
problems)
* probably tons of others I don't know....

The project tarball is at:

http://veverka.sh.cvut.cz/~sykora/prj/rheavfs-20070523-1239.tar.gz

The kernel patch is in the tarball and for your viewing pleasure
I've attached it to this email.
The patch is againts 2.6.20.1 and works with 2.6.21.1 too.
There are two minor failed hunks for 2.6.22-rc2 which I hadn't time to correct.

My project is not completed, there's almost no documentation etc.
Maybe I will put together some simple README/HOWTO in a few days.
I wouldn't present the project at this time, but seeing your post
I've thought my approach might be interesting for the discussion.


Jara

--
I find television very educating. Every time somebody turns on the set,
I go into the other room and read a book.


Attachments:
(No filename) (2.55 kB)
shdw-2.6.20.1.patch (32.74 kB)
Download all attachments

2007-05-23 13:51:51

by Al Viro

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

On Wed, May 23, 2007 at 03:01:38PM +0200, Miklos Szeredi wrote:
> Someone might think of a way to make those work with directories.
> Invisible directory entries, anyone? </me ducks>

Not unless you manage to get working union-mount [*NOT* unionfs]

> > * invalidation on unlink is still an open problem.
> > * locking in final mntput() doesn't look nice; we probably need
> > a new refcounting scheme for vfsmounts to make that work. I have a variant
> > that might work here (and make life much easier for expiry logics in
> > automount/shared trees, which is what it had been initially proposed for),
>
> Which variant? We had that "detached subtrees" thing, is that it?

Umm... It is related to detached subtrees, but I'm not sure if it is what
you are thinking about.

Short version of the story: new counter (mnt_busy) that would be defined
in the following way: the number of external references (not due to the
vfsmount tree structure or from namespace to root) + the number of
children that have non-zero ->mnt_busy. And a per-vfsmount flag ("goner").

The rules for handling ->mnt_busy:
* duplicating external reference: increment m->mnt_busy
* getting from m to child: increment child->mnt_busy, if it went
from 0 to non-zero - increment m->mnt_busy as well (that's done under
vfsmount_lock, so we can safely check for zero here).
* getting from m to parent: increment parent->mnt_busy.
* dropping external reference: decrement m->mnt_busy; if it's still
non-zero, we are done. If it's zero, we are in for some work (and had
acquired vfsmount_lock by atomic_dec_and_lock()). Here's what we do:
* go through ancestors, decrementing ->mnt_busy, until we
hit the root or get to one with ->mnt_busy staying
non-zero.
* find the most remote ancestor that has zero ->mnt_busy
and is marked as goner (might be m itself).
* if no such beast exists, we are done.
* otherwise, detach the subtree rooted in that ancestor
from its parent (if any) and unhash its root (if hashed).
Now there is no external references to any vfsmount in that
subtree.
* now we can kill all vfsmounts in that subtree.
* detaching m from parent: nothing; we trade a busy child of parent
for new external reference to parent.
* lazy umount: in addition to detaching everything from parents
and dropping resulting external references to parents, mark everything
in the subtree as goners.
* normal umount: check ->mnt_busy *and* lack of children, detach,
mark as goner, drop resulting external reference to parent.
* fun new stuff - umount of intact subtree: detach the subtree from
parent, do *not* dissolve it, mark everything in subtree as goners. If
something we mark as goner is not busy, we can kill it and all its descendents.
The subtree will be shrinking as its pieces lose external references.
* check for expirability: "we hold an external reference to m and
m->mnt_busy is 1". No need to look into children, etc.
* your vfsmounts: simply mark them goners from the very beginning.

> > but it still doesn't kill the need to deal with invalidation. And
> > yes, NFS still needs it (and so do all network filesystems, really).
> > The question of caching is related to that.
>
> So what's so special about invalidation? Why not just treat
> dir-on-file mounts the same as any other ref on the dentry?

Because of the case of having something mounted in that subtree. The
current code doesn't even try to evict such stuff. NFS *does*, but
it's not in position to do that decently (not NFS fault, it's just that
we don't have the data needed for it).

Note that one problem we used to have back then is gone - namely, per-namespace
semaphores. It's a global semaphore now, so we *can* do cross-namespace
rogering of mount trees without that kind of locking horrors.

What we really need is "go through dentry subtree, try to evict everything
we can, for anything that has stuff mounted on it go through all such
vfsmounts and kick them and all their descendents out". That's what should
happen on invalidation. From generic code, so that NFS wouldn't have to
bother.

And _that_ is what we could call from ->unlink() on your inode - would take
care of submounts.

Note that I'm not all that happy about this scheme; we might make it work,
but I still want to see a good use scenario for that kind of stuff.
Invalidation logics is a separate story - it's simply needed for existing
stuff; that area sucks *badly*, regardless of adding these hybrid objects.

2007-05-23 13:54:42

by Al Viro

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

On Wed, May 23, 2007 at 03:23:54PM +0200, Ph. Marek wrote:
> How about some special node in eg. /proc (or a new filesystem)?
> Eg.
> /fileAsDir/etc/passwd/owner ...
> would work for all *files*. For directories we do not know whether we're still
> climbing the hierarchy or would like to see meta-data.

So we need to make *anything* done anywhere in the namespace to modify
the dentry tree on that fs. Could you spell "fuck, NO"?

2007-05-23 14:33:27

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

> > > * invalidation on unlink is still an open problem.
> > > * locking in final mntput() doesn't look nice; we probably need
> > > a new refcounting scheme for vfsmounts to make that work. I have a variant
> > > that might work here (and make life much easier for expiry logics in
> > > automount/shared trees, which is what it had been initially proposed for),
> >
> > Which variant? We had that "detached subtrees" thing, is that it?
>
> Umm... It is related to detached subtrees, but I'm not sure if it is what
> you are thinking about.

I was thinking of a similar one by Mike Waychison. It had the problem
of requiring a spinlock for mntget/mntput. It was also different in
that it did not gradually dissolve detached trees, but kept them as
whole blobs until the last ref went away.

> Short version of the story: new counter (mnt_busy) that would be defined
> in the following way: the number of external references (not due to the
> vfsmount tree structure or from namespace to root) + the number of
> children that have non-zero ->mnt_busy. And a per-vfsmount flag ("goner").
>
> The rules for handling ->mnt_busy:
> * duplicating external reference: increment m->mnt_busy
> * getting from m to child: increment child->mnt_busy, if it went
> from 0 to non-zero - increment m->mnt_busy as well (that's done under
> vfsmount_lock, so we can safely check for zero here).
> * getting from m to parent: increment parent->mnt_busy.
> * dropping external reference: decrement m->mnt_busy; if it's still
> non-zero, we are done. If it's zero, we are in for some work (and had
> acquired vfsmount_lock by atomic_dec_and_lock()). Here's what we do:
> * go through ancestors, decrementing ->mnt_busy, until we
> hit the root or get to one with ->mnt_busy staying
> non-zero.
> * find the most remote ancestor that has zero ->mnt_busy
> and is marked as goner (might be m itself).
> * if no such beast exists, we are done.
> * otherwise, detach the subtree rooted in that ancestor
> from its parent (if any) and unhash its root (if hashed).

How will this work with copy_tree() and namespace duplication, which
currently walk the tree with only namespace_sem held?

> Now there is no external references to any vfsmount in that
> subtree.
> * now we can kill all vfsmounts in that subtree.
> * detaching m from parent: nothing; we trade a busy child of parent
> for new external reference to parent.
> * lazy umount: in addition to detaching everything from parents
> and dropping resulting external references to parents, mark everything
> in the subtree as goners.
> * normal umount: check ->mnt_busy *and* lack of children, detach,
> mark as goner, drop resulting external reference to parent.
> * fun new stuff - umount of intact subtree: detach the subtree from
> parent, do *not* dissolve it, mark everything in subtree as goners. If
> something we mark as goner is not busy, we can kill it and all its descendents.
> The subtree will be shrinking as its pieces lose external references.
> * check for expirability: "we hold an external reference to m and
> m->mnt_busy is 1". No need to look into children, etc.
> * your vfsmounts: simply mark them goners from the very beginning.
>
> > > but it still doesn't kill the need to deal with invalidation. And
> > > yes, NFS still needs it (and so do all network filesystems, really).
> > > The question of caching is related to that.
> >
> > So what's so special about invalidation? Why not just treat
> > dir-on-file mounts the same as any other ref on the dentry?
>
> Because of the case of having something mounted in that subtree. The
> current code doesn't even try to evict such stuff. NFS *does*, but
> it's not in position to do that decently (not NFS fault, it's just that
> we don't have the data needed for it).
>
> Note that one problem we used to have back then is gone - namely, per-namespace
> semaphores. It's a global semaphore now, so we *can* do cross-namespace
> rogering of mount trees without that kind of locking horrors.
>
> What we really need is "go through dentry subtree, try to evict everything
> we can, for anything that has stuff mounted on it go through all such
> vfsmounts and kick them and all their descendents out". That's what should
> happen on invalidation. From generic code, so that NFS wouldn't have to
> bother.
>
> And _that_ is what we could call from ->unlink() on your inode - would take
> care of submounts.

OK, I'll digest this info.

Miklos

2007-05-23 15:06:19

by Al Viro

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

On Wed, May 23, 2007 at 04:32:37PM +0200, Miklos Szeredi wrote:
> > Umm... It is related to detached subtrees, but I'm not sure if it is what
> > you are thinking about.
>
> I was thinking of a similar one by Mike Waychison. It had the problem
> of requiring a spinlock for mntget/mntput. It was also different in
> that it did not gradually dissolve detached trees, but kept them as
> whole blobs until the last ref went away.

Here the spinlock is needed only when mnt_busy goes to 0, so presumably
it won't be a serious problem on more or less common setups; however,
it certainly would need serious profiling.

> How will this work with copy_tree() and namespace duplication, which
> currently walk the tree with only namespace_sem held?

Easy - grab namespace_sem, grab vfsmount_lock, walk the subtree and bump
mnt_busy on everything (by 1 + number of non-busy children). Then drop
vfsmount_lock and do as usual, dropping references in tree being copied
as you go. Nothing will get attached or detached due to namespace_sem,
nothing will get evicted by anybody other than you since you've got all
that stuff pinned down. End of story...

2007-05-23 15:26:41

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

> > How will this work with copy_tree() and namespace duplication, which
> > currently walk the tree with only namespace_sem held?
>
> Easy - grab namespace_sem, grab vfsmount_lock, walk the subtree and bump
> mnt_busy on everything (by 1 + number of non-busy children). Then drop
> vfsmount_lock and do as usual, dropping references in tree being copied
> as you go. Nothing will get attached or detached due to namespace_sem,
> nothing will get evicted by anybody other than you since you've got all
> that stuff pinned down. End of story...

Right.

Do you have some code?

Should I try to code something up?

Miklos

2007-05-23 15:37:30

by Al Viro

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

On Wed, May 23, 2007 at 05:25:49PM +0200, Miklos Szeredi wrote:
> > > How will this work with copy_tree() and namespace duplication, which
> > > currently walk the tree with only namespace_sem held?
> >
> > Easy - grab namespace_sem, grab vfsmount_lock, walk the subtree and bump
> > mnt_busy on everything (by 1 + number of non-busy children). Then drop
> > vfsmount_lock and do as usual, dropping references in tree being copied
> > as you go. Nothing will get attached or detached due to namespace_sem,
> > nothing will get evicted by anybody other than you since you've got all
> > that stuff pinned down. End of story...
>
> Right.
>
> Do you have some code?
>
> Should I try to code something up?

I hope to get some breathing space next week, then I'll get back to
VFS work. I'd rather do that one myself, since it'll be a long series
of equivalent transformations - debugging such rewrite of refcounting
done as a single patch is going to be hell. And yes, refcounting rewrite
is near the top of the list (another thing is wading through several
threads from hell and reviewing unionfs ;-/)

2007-05-23 15:56:32

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

> I hope to get some breathing space next week, then I'll get back to
> VFS work.

Great.

> I'd rather do that one myself,

Sure, don't want to rob you of any fun stuff ;)

Miklos

2007-05-26 09:31:37

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

Hi!

> > As for unlink... How do you deal with having that thing
> > mounted, mounting something _under_ it (so that vfsmount would be kept
> > busy) and then unlinking that sucker?
>
> Yeah, that's a good point. Current patch doesn't deal with that.
> Simplest solution could be to disallow submounting these. Don't think
> it makes much sense anyway.

Hmmm, cd foo.tgz/bar/baz.tgz/xyzzy makes sense, and it is implemented
as a submount, no?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-05-28 14:44:51

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC PATCH] file as directory

> > > As for unlink... How do you deal with having that thing
> > > mounted, mounting something _under_ it (so that vfsmount would be kept
> > > busy) and then unlinking that sucker?
> >
> > Yeah, that's a good point. Current patch doesn't deal with that.
> > Simplest solution could be to disallow submounting these. Don't think
> > it makes much sense anyway.
>
> Hmmm, cd foo.tgz/bar/baz.tgz/xyzzy makes sense, and it is implemented
> as a submount, no?

Yes, that certainly makes sense, but it's the same "special" mount,
which goes away automatically, so there isn't any problem with
unlinking with any number of such submounts.

But I don't want to explicitly prohibit submounting by normal mounts
either, if it's not too hard to handle, and Al's new vfsmount
refcounting scheme should take care of the difficult part of that.

Miklos