Here's the current state of the unionmount patches.
Significant changes made include:
(1) Brought up to date with the current changes in the VFS (eg. RCU pathwalk,
mount/vfsmount struct split).
(2) Moved the copy-up for namei metadata syscalls that modify an inode
(eg. link(), utimes(), setxattr()) into pathwalk. These syscalls now
merely need to provide LOOKUP_COPY_UP to pathwalk and the copy up is done
there.
One thing I'm not certain of is that this (and the original patches)
would do the copy up, even if the caller didn't have permission to alter
the inode and the system call failed the permission check after the copy
up had taken place.
Truncate and open-write don't suffer that issue.
(3) Merged the in-development ext2 patches in and completed them - at least I
think I did, one of them didn't say what it did.
(4) Added some code to override the credentials around upper inode creation
to make sure the inode gets the right UID/GID. This doesn't help if the
lower inode has some sort of foreign user identifier.
Also, I'm not sure whether the LSM xattrs should be blindly copied up.
Should the LSM policies applicable to the lower fs's apply to the upper
fs too?
(5) Added a marker flag for mounts on the lower fs. There is the possibility
of having a file mounted over a file on the upperfs. I suspect some of
the logic will malfunction in such a case as the previous component will
be seen to be on the upperfs and the current component not on a unioned
fs - and may trigger a copyup attempt.
(6) Added a patch to pass the mount flags to sget() and thence to the compare
routine. sget() installs them in the superblock before returning it.
(7) Added a patch to combine the multiple chown syscalls.
(8) Moved the xattr copyup code earlier so that it can be used for directories
too (it is called there, but isn't introduced till a later patch).
(9) Added a patch to have a second lock class for use by unionmounts for
i_mutex and i_dir_mutex to stop complaints when a union is made of two
filesystems of the same type.
Unionmount over a unionmount will need special handling, and possibly
rejecting. It should just work, however if the two upper filesystems are
the same type, lockdep will incorrectly moan a lot.
Some issues:
(1) Need to handle automount points and managed directories. Probably simply
ignoring them is best.
(2) Need to better handle mountpoints. Currently it just calls
follow_mount(), which is probably wrong.
(3) do_lookup() needs to come up with the correct inode after
needs_lookup_union() is called. I think I have this right, but it could
do with checking.
(4) Should d_revalidate() be called on the lower fs objects under some
circumstances. I assume not, since we don't want to see the lower fs
changing.
David
---
David Howells (21):
fallthru: ext2 support for lookup of d_type/d_ino in fallthrus
ext2: Add whiteout and opaque directory support
ext2: Remove target inode pointer from ext2_add_entry()
union-mount: Implement union-aware truncate()
union-mount: Implement union-aware rename()
union-mount: Make various syscalls aware (link, chmod, chown, utimes & setxattr)
unionmount: Override creds when copying up a file to correctly set ownership
unionmount: Add LOOKUP_COPY_UP
union-mount: In-kernel file copyup routines
union-mount: Add wrapper for lookup_union_locked() and RCU hook
union-mount: Implement union mount
union-mount: Duplicate the i_{,dir_}mutex lock classes and use for upper layer
unionmount: Mark lower layers in union
union-mount: Add union_create_topmost_dir()
whiteout: Add vfs_whiteout() and whiteout inode operation
VFS: Split inode_permission()
VFS: Pass mount flags to sget()
VFS: Make lookup_hash() return a struct path
VFS: Comment mount following code
VFS: Make clone_mnt()/copy_tree()/collect_mounts() return errors
VFS: Make chown() and lchown() call fchownat()
Felix Fietkau (2):
jffs2: Add fallthru support
jffs2: Add whiteout support
Jan Blunck (6):
union-mount: Create IS_MNT_UNION()
union-mount: Free union stack on removal of topmost dentry from dcache
union-mount: Introduce MNT_UNION and MS_UNION flags
tmpfs: Add whiteout support
whiteout: Allow removal of a directory with whiteouts
whiteout/NFSD: Don't return information about whiteouts to userspace
Valerie Aurora (44):
fallthru: jffs2 support for lookup of d_type/d_ino in fallthrus
ext2: Add fallthru support
ext2: Split ext2_add_entry() from ext2_add_link()
ext2: Add ext2_dirent_in_use()
union-mount: Implement union-aware writable open()
union-mount: Implement union-aware access()/faccessat()
VFS: Create user_path_nd() to lookup both parent and target
fallthru: tmpfs support for lookup of d_type/d_ino in fallthrus
union-mount: Add generic_readdir_fallthru() helper
union-mount: Copy up directory entries on first readdir()
union-mount: Set opaque flag on new directories in unioned file systems
union-mount: Create whiteout on rmdir()
union-mount: Create whiteout on unlink()
union-mount: Call union lookup functions in lookup path
union-mount: Add lookup_union_locked()
union-mount: Follow mount in __lookup_union()
union-mount: Build union stack in __lookup_union()
union-mount: Return files found in lower layers in __lookup_union()
union-mount: Process negative dentries in __lookup_union()
union-mount: Basic infrastructure of __lookup_union()
union-mount: Temporarily disable some syscalls
union-mount: Prevent bind mounts of union mounts
union-mount: Prevent topmost file system from being mounted elsewhere
union-mount: Prevent improper union-related remounts
union-mount: Create prepare_mnt_union() and cleanup_mnt_union()
union-mount: Create build_root_union()
union-mount: Add clone_union_tree() and put_union_sb()
union-mount: Create check_topmost_union_mnt()
union-mount: Create needs_lookup_union()
union-mount: Create union_add_dir()
union-mount: Create d_free_unions()
union-mount: Add union_find_dir()
union-mount: Add union_alloc()
union-mount: Add two superblock fields for union mounts
union-mount: Create union_stack structure
union-mount: Add CONFIG_UNION_MOUNT option
union-mount: Union mounts documentation
tmpfs: Add fallthru support
VFS: Basic fallthru definitions
whiteout: Define flags and operations for opaque inodes
VFS: Add CL_MAKE_HARD_READONLY flag to clone_mnt()/copy_tree()
VFS: Add CL_NO_SLAVE flag to clone_mnt()/copy_tree()
VFS: Add CL_NO_SHARED flag to clone_mnt()/copy_tree()
VFS: Add hard read-only users count to superblock
Documentation/filesystems/union-mounts.txt | 712 ++++++++++++++++++++++
Documentation/filesystems/vfs.txt | 16
drivers/mtd/mtdsuper.c | 4
fs/9p/vfs_super.c | 4
fs/Kconfig | 12
fs/Makefile | 1
fs/afs/super.c | 3
fs/btrfs/super.c | 4
fs/ceph/super.c | 2
fs/cifs/cifsfs.c | 9
fs/compat.c | 9
fs/dcache.c | 28 +
fs/devpts/inode.c | 6
fs/ecryptfs/main.c | 3
fs/ext2/dir.c | 181 +++++-
fs/ext2/ext2.h | 3
fs/ext2/inode.c | 11
fs/ext2/namei.c | 73 ++
fs/ext2/super.c | 6
fs/gfs2/ops_fstype.c | 5
fs/inode.c | 48 +
fs/internal.h | 5
fs/jffs2/dir.c | 117 +++-
fs/jffs2/fs.c | 4
fs/jffs2/super.c | 2
fs/libfs.c | 25 +
fs/logfs/super.c | 3
fs/namei.c | 906 +++++++++++++++++++++++++---
fs/namespace.c | 391 ++++++++++--
fs/nfs/super.c | 10
fs/nfsd/nfs3xdr.c | 5
fs/nfsd/nfs4xdr.c | 5
fs/nfsd/nfsxdr.c | 4
fs/nilfs2/super.c | 4
fs/open.c | 131 +++-
fs/pnode.c | 5
fs/pnode.h | 4
fs/proc/root.c | 3
fs/proc_namespace.c | 1
fs/readdir.c | 18 +
fs/reiserfs/procfs.c | 2
fs/super.c | 42 +
fs/sysfs/mount.c | 3
fs/ubifs/super.c | 3
fs/union.c | 721 ++++++++++++++++++++++
fs/union.h | 189 ++++++
fs/utimes.c | 2
fs/xattr.c | 10
include/linux/dcache.h | 40 +
include/linux/ext2_fs.h | 8
include/linux/fs.h | 47 +
include/linux/jffs2.h | 8
include/linux/mount.h | 5
include/linux/namei.h | 3
kernel/audit_tree.c | 10
kernel/cgroup.c | 2
mm/shmem.c | 192 ++++++
57 files changed, 3750 insertions(+), 320 deletions(-)
create mode 100644 Documentation/filesystems/union-mounts.txt
create mode 100644 fs/union.c
create mode 100644 fs/union.h
copy_tree() can theoretically fail in a case other than ENOMEM, but always
returns NULL which is interpreted by callers as -ENOMEM. Change it to return
an explicit error.
Also change clone_mnt() for consistency and because union mounts will add new
error cases.
Thanks to Andreas Gruenbacher <[email protected]> for a bug fix.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
Cc: Valerie Aurora <[email protected]>
Cc: Andreas Gruenbacher <[email protected]>
---
fs/namespace.c | 116 +++++++++++++++++++++++++++------------------------
fs/pnode.c | 5 +-
kernel/audit_tree.c | 10 ++--
3 files changed, 70 insertions(+), 61 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index e608199..baedd0b 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -725,56 +725,60 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
int flag)
{
struct super_block *sb = old->mnt.mnt_sb;
- struct mount *mnt = alloc_vfsmnt(old->mnt_devname);
+ struct mount *mnt;
+ int err;
- if (mnt) {
- if (flag & (CL_SLAVE | CL_PRIVATE))
- mnt->mnt_group_id = 0; /* not a peer of original */
- else
- mnt->mnt_group_id = old->mnt_group_id;
-
- if ((flag & CL_MAKE_SHARED) && !mnt->mnt_group_id) {
- int err = mnt_alloc_group_id(mnt);
- if (err)
- goto out_free;
- }
+ mnt = alloc_vfsmnt(old->mnt_devname);
+ if (!mnt)
+ return ERR_PTR(-ENOMEM);
- mnt->mnt.mnt_flags = old->mnt.mnt_flags & ~MNT_WRITE_HOLD;
- atomic_inc(&sb->s_active);
- mnt->mnt.mnt_sb = sb;
- mnt->mnt.mnt_root = dget(root);
- mnt->mnt_mountpoint = mnt->mnt.mnt_root;
- mnt->mnt_parent = mnt;
- br_write_lock(vfsmount_lock);
- list_add_tail(&mnt->mnt_instance, &sb->s_mounts);
- br_write_unlock(vfsmount_lock);
+ if (flag & (CL_SLAVE | CL_PRIVATE))
+ mnt->mnt_group_id = 0; /* not a peer of original */
+ else
+ mnt->mnt_group_id = old->mnt_group_id;
- if (flag & CL_SLAVE) {
- list_add(&mnt->mnt_slave, &old->mnt_slave_list);
- mnt->mnt_master = old;
- CLEAR_MNT_SHARED(mnt);
- } else if (!(flag & CL_PRIVATE)) {
- if ((flag & CL_MAKE_SHARED) || IS_MNT_SHARED(old))
- list_add(&mnt->mnt_share, &old->mnt_share);
- if (IS_MNT_SLAVE(old))
- list_add(&mnt->mnt_slave, &old->mnt_slave);
- mnt->mnt_master = old->mnt_master;
- }
- if (flag & CL_MAKE_SHARED)
- set_mnt_shared(mnt);
-
- /* stick the duplicate mount on the same expiry list
- * as the original if that was on one */
- if (flag & CL_EXPIRE) {
- if (!list_empty(&old->mnt_expire))
- list_add(&mnt->mnt_expire, &old->mnt_expire);
- }
+ if ((flag & CL_MAKE_SHARED) && !mnt->mnt_group_id) {
+ err = mnt_alloc_group_id(mnt);
+ if (err)
+ goto out_free;
}
+
+ mnt->mnt.mnt_flags = old->mnt.mnt_flags & ~MNT_WRITE_HOLD;
+ atomic_inc(&sb->s_active);
+ mnt->mnt.mnt_sb = sb;
+ mnt->mnt.mnt_root = dget(root);
+ mnt->mnt_mountpoint = mnt->mnt.mnt_root;
+ mnt->mnt_parent = mnt;
+ br_write_lock(vfsmount_lock);
+ list_add_tail(&mnt->mnt_instance, &sb->s_mounts);
+ br_write_unlock(vfsmount_lock);
+
+ if (flag & CL_SLAVE) {
+ list_add(&mnt->mnt_slave, &old->mnt_slave_list);
+ mnt->mnt_master = old;
+ CLEAR_MNT_SHARED(mnt);
+ } else if (!(flag & CL_PRIVATE)) {
+ if ((flag & CL_MAKE_SHARED) || IS_MNT_SHARED(old))
+ list_add(&mnt->mnt_share, &old->mnt_share);
+ if (IS_MNT_SLAVE(old))
+ list_add(&mnt->mnt_slave, &old->mnt_slave);
+ mnt->mnt_master = old->mnt_master;
+ }
+ if (flag & CL_MAKE_SHARED)
+ set_mnt_shared(mnt);
+
+ /* stick the duplicate mount on the same expiry list as the
+ * original if that was on one */
+ if (flag & CL_EXPIRE) {
+ if (!list_empty(&old->mnt_expire))
+ list_add(&mnt->mnt_expire, &old->mnt_expire);
+ }
+
return mnt;
out_free:
free_vfsmnt(mnt);
- return NULL;
+ return ERR_PTR(err);
}
static inline void mntfree(struct mount *mnt)
@@ -1258,11 +1262,12 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
struct path path;
if (!(flag & CL_COPY_ALL) && IS_MNT_UNBINDABLE(mnt))
- return NULL;
+ return ERR_PTR(-EINVAL);
res = q = clone_mnt(mnt, dentry, flag);
- if (!q)
- goto Enomem;
+ if (IS_ERR(q))
+ return q;
+
q->mnt_mountpoint = mnt->mnt_mountpoint;
p = mnt;
@@ -1284,8 +1289,8 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
path.mnt = &q->mnt;
path.dentry = p->mnt_mountpoint;
q = clone_mnt(p, p->mnt.mnt_root, flag);
- if (!q)
- goto Enomem;
+ if (IS_ERR(q))
+ goto out;
br_write_lock(vfsmount_lock);
list_add_tail(&q->mnt_list, &res->mnt_list);
attach_mnt(q, &path);
@@ -1293,7 +1298,7 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
}
}
return res;
-Enomem:
+out:
if (res) {
LIST_HEAD(umount_list);
br_write_lock(vfsmount_lock);
@@ -1301,9 +1306,11 @@ Enomem:
br_write_unlock(vfsmount_lock);
release_mounts(&umount_list);
}
- return NULL;
+ return q;
}
+/* Caller should check returned pointer for errors */
+
struct vfsmount *collect_mounts(struct path *path)
{
struct mount *tree;
@@ -1606,14 +1613,15 @@ static int do_loopback(struct path *path, char *old_name,
if (!check_mnt(real_mount(path->mnt)) || !check_mnt(old))
goto out2;
- err = -ENOMEM;
if (recurse)
mnt = copy_tree(old, old_path.dentry, 0);
else
mnt = clone_mnt(old, old_path.dentry, 0);
- if (!mnt)
- goto out2;
+ if (IS_ERR(mnt)) {
+ err = PTR_ERR(mnt);
+ goto out;
+ }
err = graft_tree(mnt, path);
if (err) {
@@ -2244,10 +2252,10 @@ static struct mnt_namespace *dup_mnt_ns(struct mnt_namespace *mnt_ns,
down_write(&namespace_sem);
/* First pass: copy the tree topology */
new = copy_tree(old, old->mnt.mnt_root, CL_COPY_ALL | CL_EXPIRE);
- if (!new) {
+ if (IS_ERR(new)) {
up_write(&namespace_sem);
kfree(new_ns);
- return ERR_PTR(-ENOMEM);
+ return ERR_CAST(new);
}
new_ns->root = new;
br_write_lock(vfsmount_lock);
diff --git a/fs/pnode.c b/fs/pnode.c
index ab5fa9e..b79d27c 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -237,8 +237,9 @@ int propagate_mnt(struct mount *dest_mnt, struct dentry *dest_dentry,
source = get_source(m, prev_dest_mnt, prev_src_mnt, &type);
- if (!(child = copy_tree(source, source->mnt.mnt_root, type))) {
- ret = -ENOMEM;
+ child = copy_tree(source, source->mnt.mnt_root, type);
+ if (IS_ERR(child)) {
+ ret = PTR_ERR(child);
list_splice(tree_list, tmp_list.prev);
goto out;
}
diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c
index 5bf0790..3a5ca58 100644
--- a/kernel/audit_tree.c
+++ b/kernel/audit_tree.c
@@ -595,7 +595,7 @@ void audit_trim_trees(void)
root_mnt = collect_mounts(&path);
path_put(&path);
- if (!root_mnt)
+ if (IS_ERR(root_mnt))
goto skip_it;
spin_lock(&hash_lock);
@@ -669,8 +669,8 @@ int audit_add_tree_rule(struct audit_krule *rule)
goto Err;
mnt = collect_mounts(&path);
path_put(&path);
- if (!mnt) {
- err = -ENOMEM;
+ if (IS_ERR(mnt)) {
+ err = PTR_ERR(mnt);
goto Err;
}
@@ -719,8 +719,8 @@ int audit_tag_tree(char *old, char *new)
return err;
tagged = collect_mounts(&path2);
path_put(&path2);
- if (!tagged)
- return -ENOMEM;
+ if (IS_ERR(tagged))
+ return PTR_ERR(tagged);
err = kern_path(old, 0, &path1);
if (err) {
From: Valerie Aurora <[email protected]>
While we can check if a file system is currently read-only, we can't
guarantee that it will stay read-only. The file system can be mounted
or remounted read-write at any time. This is a problem for union
mounts, which require the underlying file system be read-only for the
entire duration of the union mount.
Add a hard read-only users count to the superblock. When this count
is non-zero, don't allow any read-write mounts of this super, or any
read-write remounts of existing mounts.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/super.c | 11 +++++++++++
include/linux/fs.h | 6 ++++++
2 files changed, 17 insertions(+), 0 deletions(-)
diff --git a/fs/super.c b/fs/super.c
index f343eda..732e19b 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -202,6 +202,7 @@ static inline void destroy_super(struct super_block *s)
#ifdef CONFIG_SMP
free_percpu(s->s_files);
#endif
+ BUG_ON(s->s_hard_readonly_users);
security_sb_free(s);
WARN_ON(!list_empty(&s->s_mounts));
kfree(s->s_subtype);
@@ -758,6 +759,9 @@ int do_remount_sb(struct super_block *sb, int flags, void *data, int force)
}
}
+ if (!(flags & MS_RDONLY) && sb->s_hard_readonly_users)
+ return -EROFS;
+
if (sb->s_op->remount_fs) {
retval = sb->s_op->remount_fs(sb, &flags, data);
if (retval) {
@@ -1150,9 +1154,16 @@ mount_fs(struct file_system_type *type, int flags, const char *name, void *data)
WARN((sb->s_maxbytes < 0), "%s set sb->s_maxbytes to "
"negative value (%lld)\n", type->name, sb->s_maxbytes);
+ if (!(flags & MS_RDONLY) && sb->s_hard_readonly_users)
+ goto out_sb_is_hard_ro;
+
up_write(&sb->s_umount);
free_secdata(secdata);
return root;
+
+out_sb_is_hard_ro:
+ up_write(&sb->s_umount);
+ error = -EROFS;
out_sb:
dput(root);
deactivate_locked_super(sb);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d851be9..b8276c0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1496,6 +1496,12 @@ struct super_block {
/* Being remounted read-only */
int s_readonly_remount;
+
+ /* Number of mounts requiring that the underlying file system never
+ * transition to read-write. Protected by s_umount. Decremented by
+ * free_vfsmnt() if MNT_HARD_READONLY is set.
+ */
+ int s_hard_readonly_users;
};
/* superblock cache pruning functions */
From: Jan Blunck <[email protected]>
do_whiteout() allows removal of a directory when it has whiteouts but
is logically empty.
XXX - This patch abuses readdir() to check if the union directory is
logically empty - that is, all the entries are whiteouts (or "." or
".."). Currently, we have no clean VFS interface to ask the lower
file system if a directory is empty.
Fixes:
- Add ->is_directory_empty() op
- Add is_directory_empty flag to dentry (ugly dcache populate)
- Ask underlying fs to remove it and look for an error return
- (your idea here)
Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/namei.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 85 insertions(+), 0 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 3d396fd..991a32c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -33,6 +33,7 @@
#include <linux/device_cgroup.h>
#include <linux/fs_struct.h>
#include <linux/posix_acl.h>
+#include <linux/init_task.h>
#include <asm/uaccess.h>
#include "internal.h"
@@ -2735,6 +2736,90 @@ error_unlock:
}
/*
+ * XXX - We are abusing readdir to check if a union directory is
+ * logically empty.
+ */
+static int filldir_is_empty(void *__buf, const char *name, int namelen,
+ loff_t offset, u64 ino, unsigned int d_type)
+{
+ int *is_empty = __buf;
+
+ switch (namelen) {
+ case 2:
+ if (name[1] != '.')
+ break;
+ case 1:
+ if (name[0] != '.')
+ break;
+ return 0;
+ }
+
+ if (d_type == DT_WHT)
+ return 0;
+
+ *is_empty = 0;
+ return 1; /* no point scanning further */
+}
+
+static int directory_is_empty(struct path *path)
+{
+ struct file *file;
+ int err;
+ int is_empty = 1;
+
+ BUG_ON(!S_ISDIR(path->dentry->d_inode->i_mode));
+
+ /* references for the file pointer */
+ path_get(path);
+
+ file = dentry_open(path->dentry, path->mnt, O_RDONLY, &init_cred);
+ if (IS_ERR(file))
+ return 0;
+
+ err = vfs_readdir(file, filldir_is_empty, &is_empty);
+
+ fput(file);
+ return is_empty;
+}
+
+static int do_whiteout(struct nameidata *nd, struct path *path, int isdir)
+{
+ struct path safe = nd->path;
+ struct dentry *dentry = path->dentry;
+ int err;
+
+ path_get(&safe);
+
+ err = may_delete(nd->path.dentry->d_inode, dentry, isdir);
+ if (err)
+ goto out;
+
+ err = -ENOTEMPTY;
+ if (isdir && !directory_is_empty(path))
+ goto out;
+
+ if (nd->path.dentry != dentry->d_parent) {
+ dentry = __lookup_hash(&path->dentry->d_name, nd->path.dentry,
+ nd);
+ err = PTR_ERR(dentry);
+ if (IS_ERR(dentry))
+ goto out;
+
+ dput(path->dentry);
+ if (path->mnt != safe.mnt)
+ mntput(path->mnt);
+ path->mnt = nd->path.mnt;
+ path->dentry = dentry;
+ }
+
+ err = vfs_whiteout(nd->path.dentry->d_inode, dentry, isdir);
+
+out:
+ path_put(&safe);
+ return err;
+}
+
+/*
* The dentry_unhash() helper will try to drop the dentry early: we
* should have a usage count of 2 if we're the only user of this
* dentry, and if that is true (possibly after pruning the dcache),
From: Felix Fietkau <[email protected]>
Add support for whiteout dentries to jffs2.
XXX - David Woodhouse suggests several changes and provides an untested patch.
See:
http://patchwork.ozlabs.org/patch/50466/
XXX - Backward compatibility? Creating a whiteout on a JFFS2 file system can
only happen if it is deliberately mounted "-o union" so there is some way to
prevent creation of whiteouts on a file system you want to later mount with an
earlier (no support for whiteout) file system. However, ext2/3 has much more
robust methods (explicit fs feature flag) to prevent such an occurance.
Signed-off-by: Felix Fietkau <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: [email protected]
---
fs/jffs2/dir.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++-
fs/jffs2/fs.c | 4 +++
fs/jffs2/super.c | 2 +
include/linux/jffs2.h | 2 +
4 files changed, 77 insertions(+), 3 deletions(-)
diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index 973ac58..fe7468d 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -35,6 +35,8 @@ static int jffs2_mknod (struct inode *,struct dentry *,umode_t,dev_t);
static int jffs2_rename (struct inode *, struct dentry *,
struct inode *, struct dentry *);
+static int jffs2_whiteout (struct inode *, struct dentry *, struct dentry *);
+
const struct file_operations jffs2_dir_operations =
{
.read = generic_read_dir,
@@ -57,6 +59,7 @@ const struct inode_operations jffs2_dir_inode_operations =
.mknod = jffs2_mknod,
.rename = jffs2_rename,
.get_acl = jffs2_get_acl,
+ .whiteout = jffs2_whiteout,
.setattr = jffs2_setattr,
.setxattr = jffs2_setxattr,
.getxattr = jffs2_getxattr,
@@ -97,8 +100,14 @@ static struct dentry *jffs2_lookup(struct inode *dir_i, struct dentry *target,
fd = fd_list;
}
}
- if (fd)
- ino = fd->ino;
+ if (fd) {
+ spin_lock(&target->d_lock);
+ if (fd->type == DT_WHT)
+ target->d_flags |= DCACHE_WHITEOUT;
+ else
+ ino = fd->ino;
+ spin_unlock(&target->d_lock);
+ }
mutex_unlock(&dir_f->sem);
if (ino) {
inode = jffs2_iget(dir_i->i_sb, ino);
@@ -491,6 +500,11 @@ static int jffs2_mkdir (struct inode *dir_i, struct dentry *dentry, umode_t mode
return PTR_ERR(inode);
}
+ if (dentry->d_flags & DCACHE_WHITEOUT) {
+ inode->i_flags |= S_OPAQUE;
+ ri->flags = cpu_to_je16(JFFS2_INO_FLAG_OPAQUE);
+ }
+
inode->i_op = &jffs2_dir_inode_operations;
inode->i_fop = &jffs2_dir_operations;
@@ -769,6 +783,60 @@ static int jffs2_mknod (struct inode *dir_i, struct dentry *dentry, umode_t mode
return ret;
}
+static int jffs2_whiteout (struct inode *dir, struct dentry *old_dentry,
+ struct dentry *new_dentry)
+{
+ struct jffs2_sb_info *c = JFFS2_SB_INFO(dir->i_sb);
+ struct jffs2_inode_info *victim_f = NULL;
+ uint32_t now;
+ int ret;
+
+ /* If it's a directory, then check whether it is really empty */
+ if (new_dentry->d_inode) {
+ victim_f = JFFS2_INODE_INFO(old_dentry->d_inode);
+ if (S_ISDIR(old_dentry->d_inode->i_mode)) {
+ struct jffs2_full_dirent *fd;
+
+ mutex_lock(&victim_f->sem);
+ for (fd = victim_f->dents; fd; fd = fd->next) {
+ if (fd->ino) {
+ mutex_unlock(&victim_f->sem);
+ return -ENOTEMPTY;
+ }
+ }
+ mutex_unlock(&victim_f->sem);
+ }
+ }
+
+ now = get_seconds();
+ ret = jffs2_do_link(c, JFFS2_INODE_INFO(dir), 0, DT_WHT,
+ new_dentry->d_name.name, new_dentry->d_name.len, now);
+ if (ret)
+ return ret;
+
+ spin_lock(&new_dentry->d_lock);
+ new_dentry->d_flags |= DCACHE_WHITEOUT;
+ spin_unlock(&new_dentry->d_lock);
+ d_add(new_dentry, NULL);
+
+ if (victim_f) {
+ /* There was a victim. Kill it off nicely */
+ drop_nlink(old_dentry->d_inode);
+ /* Don't oops if the victim was a dirent pointing to an
+ inode which didn't exist. */
+ if (victim_f->inocache) {
+ mutex_lock(&victim_f->sem);
+ if (S_ISDIR(old_dentry->d_inode->i_mode))
+ victim_f->inocache->pino_nlink = 0;
+ else
+ victim_f->inocache->pino_nlink--;
+ mutex_unlock(&victim_f->sem);
+ }
+ }
+
+ return 0;
+}
+
static int jffs2_rename (struct inode *old_dir_i, struct dentry *old_dentry,
struct inode *new_dir_i, struct dentry *new_dentry)
{
diff --git a/fs/jffs2/fs.c b/fs/jffs2/fs.c
index 2e01238..b286ce5 100644
--- a/fs/jffs2/fs.c
+++ b/fs/jffs2/fs.c
@@ -303,6 +303,10 @@ struct inode *jffs2_iget(struct super_block *sb, unsigned long ino)
inode->i_op = &jffs2_dir_inode_operations;
inode->i_fop = &jffs2_dir_operations;
+
+ if (je16_to_cpu(latest_node.flags) & JFFS2_INO_FLAG_OPAQUE)
+ inode->i_flags |= S_OPAQUE;
+
break;
}
case S_IFREG:
diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c
index f2d96b5..9fb059a 100644
--- a/fs/jffs2/super.c
+++ b/fs/jffs2/super.c
@@ -295,7 +295,7 @@ static int jffs2_fill_super(struct super_block *sb, void *data, int silent)
sb->s_op = &jffs2_super_operations;
sb->s_export_op = &jffs2_export_ops;
- sb->s_flags = sb->s_flags | MS_NOATIME;
+ sb->s_flags = sb->s_flags | MS_NOATIME | MS_WHITEOUT;
sb->s_xattr = jffs2_xattr_handlers;
#ifdef CONFIG_JFFS2_FS_POSIX_ACL
sb->s_flags |= MS_POSIXACL;
diff --git a/include/linux/jffs2.h b/include/linux/jffs2.h
index a18b719..6404e01 100644
--- a/include/linux/jffs2.h
+++ b/include/linux/jffs2.h
@@ -88,6 +88,8 @@
#define JFFS2_INO_FLAG_USERCOMPR 2 /* User has requested a specific
compression type */
+#define JFFS2_INO_FLAG_OPAQUE 4 /* Directory is opaque (for union mounts) */
+
/* These can go once we've made sure we've caught all uses without
byteswapping */
From: Valerie Aurora <[email protected]>
Whiteout an unlinked directory entry in a union mounted file system.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/namei.c | 9 ++++-----
1 files changed, 4 insertions(+), 5 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 586913f..ce941ac 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3333,11 +3333,6 @@ static long do_unlinkat(int dfd, const char __user *pathname)
if (nd.last_type != LAST_NORM)
goto exit1;
- /* unlink() on union mounts not implemented yet */
- error = -EINVAL;
- if (IS_DIR_UNIONED(nd.path.dentry))
- goto exit1;
-
nd.flags &= ~LOOKUP_PARENT;
mutex_lock_nested(&nd.path.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
@@ -3356,6 +3351,10 @@ static long do_unlinkat(int dfd, const char __user *pathname)
error = security_path_unlink(&nd.path, path.dentry);
if (error)
goto exit3;
+ if (IS_DIR_UNIONED(nd.path.dentry)) {
+ error = do_whiteout(&nd, &path, 0);
+ goto exit3;
+ }
error = vfs_unlink(nd.path.dentry->d_inode, path.dentry);
exit3:
mnt_drop_write(nd.path.mnt);
From: Valerie Aurora <[email protected]>
Whiteouts end a union lookup. So do opaque directories, unless
specific fallthru entry exists for this name.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/namei.c | 22 +++++++++++++++++++++-
1 files changed, 21 insertions(+), 1 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 8caed86..009d9b5 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1163,11 +1163,31 @@ static int __lookup_union(struct nameidata *nd, struct qstr *name,
err = PTR_ERR(lower.dentry);
goto out_err;
}
- /* XXX - do nothing, lookup rule processing in later patches */
+
+ /* A negative dentry can mean several things: a plain negative
+ * dentry is ignored and lookup continues to the next layer,
+ * but a whiteout or a non-fallthru in an opaque dir covers
+ * everything below it.
+ */
+ if (!lower.dentry->d_inode) {
+ if (d_is_whiteout(lower.dentry))
+ goto out_lookup_done;
+ if (IS_OPAQUE(nd->path.dentry->d_inode) &&
+ !d_is_fallthru(lower.dentry))
+ goto out_lookup_done;
+ path_put(&lower);
+ continue;
+ }
+
+ /* XXX - do nothing, more in later patches */
path_put(&lower);
}
return 0;
+out_lookup_done:
+ path_put(&lower);
+ return 0;
+
out_err:
d_free_unions(topmost->dentry);
path_put(&lower);
From: Jan Blunck <[email protected]>
Userspace isn't ready for handling another file type, so silently drop
whiteout directory entries before they leave the kernel.
Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
Acked-by: J. Bruce Fields <[email protected]>
Cc: [email protected]
Cc: Neil Brown <[email protected]>
---
fs/compat.c | 9 +++++++++
fs/nfsd/nfs3xdr.c | 5 +++++
fs/nfsd/nfs4xdr.c | 5 +++++
fs/nfsd/nfsxdr.c | 4 ++++
fs/readdir.c | 9 +++++++++
5 files changed, 32 insertions(+), 0 deletions(-)
diff --git a/fs/compat.c b/fs/compat.c
index 07880ba..9ed826e 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -842,6 +842,9 @@ static int compat_fillonedir(void *__buf, const char *name, int namlen,
struct compat_old_linux_dirent __user *dirent;
compat_ulong_t d_ino;
+ if (d_type == DT_WHT)
+ return 0;
+
if (buf->result)
return -EINVAL;
d_ino = ino;
@@ -914,6 +917,9 @@ static int compat_filldir(void *__buf, const char *name, int namlen,
int reclen = ALIGN(offsetof(struct compat_linux_dirent, d_name) +
namlen + 2, sizeof(compat_long_t));
+ if (d_type == DT_WHT)
+ return 0;
+
buf->error = -EINVAL; /* only used if we fail.. */
if (reclen > buf->count)
return -EINVAL;
@@ -1003,6 +1009,9 @@ static int compat_filldir64(void * __buf, const char * name, int namlen, loff_t
sizeof(u64));
u64 off;
+ if (d_type == DT_WHT)
+ return 0;
+
buf->error = -EINVAL; /* only used if we fail.. */
if (reclen > buf->count)
return -EINVAL;
diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 08c6e36..69a60b6 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -887,6 +887,11 @@ encode_entry(struct readdir_cd *ccd, const char *name, int namlen,
int elen; /* estimated entry length in words */
int num_entry_words = 0; /* actual number of words */
+ if (d_type == DT_WHT) {
+ cd->common.err = nfs_ok;
+ return 0;
+ }
+
if (cd->offset) {
u64 offset64 = offset;
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 0ec5a1b..6124f78 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -2553,6 +2553,11 @@ nfsd4_encode_dirent(void *ccdv, const char *name, int namlen,
return 0;
}
+ if (d_type == DT_WHT) {
+ cd->common.err = nfs_ok;
+ return 0;
+ }
+
if (cd->offset)
xdr_encode_hyper(cd->offset, (u64) offset);
diff --git a/fs/nfsd/nfsxdr.c b/fs/nfsd/nfsxdr.c
index 65ec595..2af74b2 100644
--- a/fs/nfsd/nfsxdr.c
+++ b/fs/nfsd/nfsxdr.c
@@ -503,6 +503,10 @@ nfssvc_encode_entry(void *ccdv, const char *name,
namlen, name, offset, ino);
*/
+ if (d_type == DT_WHT) {
+ cd->common.err = nfs_ok;
+ return 0;
+ }
if (offset > ~((u32) 0)) {
cd->common.err = nfserr_fbig;
return -EINVAL;
diff --git a/fs/readdir.c b/fs/readdir.c
index 356f715..de703d6 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -77,6 +77,9 @@ static int fillonedir(void * __buf, const char * name, int namlen, loff_t offset
struct old_linux_dirent __user * dirent;
unsigned long d_ino;
+ if (d_type == DT_WHT)
+ return 0;
+
if (buf->result)
return -EINVAL;
d_ino = ino;
@@ -155,6 +158,9 @@ static int filldir(void * __buf, const char * name, int namlen, loff_t offset,
int reclen = ALIGN(offsetof(struct linux_dirent, d_name) + namlen + 2,
sizeof(long));
+ if (d_type == DT_WHT)
+ return 0;
+
buf->error = -EINVAL; /* only used if we fail.. */
if (reclen > buf->count)
return -EINVAL;
@@ -241,6 +247,9 @@ static int filldir64(void * __buf, const char * name, int namlen, loff_t offset,
int reclen = ALIGN(offsetof(struct linux_dirent64, d_name) + namlen + 1,
sizeof(u64));
+ if (d_type == DT_WHT)
+ return 0;
+
buf->error = -EINVAL; /* only used if we fail.. */
if (reclen > buf->count)
return -EINVAL;
From: Valerie Aurora <[email protected]>
Allow future code to use the guts of ext2_add_link().
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: [email protected]
---
fs/ext2/dir.c | 10 ++++++----
1 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 89015f1..d8382dc 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -489,10 +489,7 @@ void ext2_set_link(struct inode *dir, struct ext2_dir_entry_2 *de,
mark_inode_dirty(dir);
}
-/*
- * Parent is locked.
- */
-int ext2_add_link (struct dentry *dentry, struct inode *inode)
+int ext2_add_entry(struct dentry *dentry, struct inode *inode)
{
struct inode *dir = dentry->d_parent->d_inode;
const char *name = dentry->d_name.name;
@@ -588,6 +585,11 @@ out_unlock:
goto out_put;
}
+int ext2_add_link(struct dentry *dentry, struct inode *inode)
+{
+ return ext2_add_entry(dentry, inode);
+}
+
/*
* ext2_delete_entry deletes a directory entry by merging it with the
* previous entry. Page is up-to-date. Releases the page.
Add support for whiteouts and opaque directories to ext2.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]> (Further development)
cc: [email protected]
---
fs/ext2/dir.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++-
fs/ext2/ext2.h | 2 ++
fs/ext2/inode.c | 11 ++++++--
fs/ext2/namei.c | 51 +++++++++++++++++++++++++++++++++++++--
fs/ext2/super.c | 4 +++
include/linux/ext2_fs.h | 4 +++
6 files changed, 128 insertions(+), 6 deletions(-)
diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index dcb2d64..df4d6b1 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -220,7 +220,7 @@ fail:
static inline int ext2_dirent_in_use(struct ext2_dir_entry_2 *de)
{
- return de->inode != 0;
+ return de->inode != 0 || de->file_type == EXT2_FT_WHT;
}
/*
@@ -269,6 +269,7 @@ static unsigned char ext2_filetype_table[EXT2_FT_MAX] = {
[EXT2_FT_FIFO] = DT_FIFO,
[EXT2_FT_SOCK] = DT_SOCK,
[EXT2_FT_SYMLINK] = DT_LNK,
+ [EXT2_FT_WHT] = DT_WHT,
};
#define S_SHIFT 12
@@ -472,6 +473,26 @@ static int ext2_prepare_chunk(struct page *page, loff_t pos, unsigned len)
return __block_write_begin(page, pos, len, ext2_get_block);
}
+/* Special version for filetype based whiteout support */
+ino_t ext2_inode_by_dentry(struct inode *dir, struct dentry *dentry)
+{
+ ino_t res = 0;
+ struct ext2_dir_entry_2 *de;
+ struct page *page;
+
+ de = ext2_find_entry(dir, &dentry->d_name, &page);
+ if (de) {
+ res = le32_to_cpu(de->inode);
+ if (!res && de->file_type == EXT2_FT_WHT) {
+ spin_lock(&dentry->d_lock);
+ dentry->d_flags |= DCACHE_WHITEOUT;
+ spin_unlock(&dentry->d_lock);
+ }
+ ext2_put_page(page);
+ }
+ return res;
+}
+
/* Releases the page */
void ext2_set_link(struct inode *dir, struct ext2_dir_entry_2 *de,
struct page *page, struct inode *inode, int update_times)
@@ -580,6 +601,40 @@ int ext2_add_entry(struct dentry *dentry, ino_t ino, umode_t mode,
return -EINVAL;
got_it:
+ /* Pre-existing entries with the same name are allowable depending on
+ * the type of the entry being created, so we need to sanity check what
+ * we got.
+ *
+ * - Fallthru entries may be replaced by regular entries (copyup or
+ * rename) or whiteouts (unlink, rmdir or rename), but not by another
+ * fallthru.
+ * - Whiteout entries may be replaced by a regular entry (open, link,
+ * symlink, mkdir, mknod, socket or rename).
+ * - Regular entries may be replaced by a whiteout (unlink, rmdir or
+ * rename).
+ *
+ * Fallthru entries may only be created during directory copyup and
+ * should not already exist in the top directory at that time.
+ */
+ err = -EEXIST;
+ if (ext2_match(namelen, name, de)) {
+ switch (de->file_type) {
+ case EXT2_FT_WHT:
+ if (new_file_type == EXT2_FT_WHT) {
+ WARN(1, "Ext2: Can't turn whiteout into whiteout: %s\n",
+ dentry->d_name.name);
+ goto out_unlock;
+ }
+ break;
+ default:
+ if (new_file_type != EXT2_FT_WHT) {
+ WARN(1, "Ext2: Can't turn dirent into non-whiteout: %s\n",
+ dentry->d_name.name);
+ goto out_unlock;
+ }
+ break;
+ }
+ }
pos = page_offset(page) +
(char*)de - (char*)page_address(page);
err = ext2_prepare_chunk(page, pos, rec_len);
@@ -614,6 +669,11 @@ int ext2_add_link(struct dentry *dentry, struct inode *inode)
return ext2_add_entry(dentry, inode->i_ino, inode->i_mode, 0);
}
+int ext2_whiteout_entry(struct dentry *dentry)
+{
+ return ext2_add_entry(dentry, 0, 0, EXT2_FT_WHT);
+}
+
/*
* ext2_delete_entry deletes a directory entry by merging it with the
* previous entry. Page is up-to-date. Releases the page.
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 75ad433..b285d9a 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -102,9 +102,11 @@ extern void ext2_rsv_window_add(struct super_block *sb, struct ext2_reserve_wind
/* dir.c */
extern int ext2_add_link (struct dentry *, struct inode *);
extern ino_t ext2_inode_by_name(struct inode *, struct qstr *);
+extern ino_t ext2_inode_by_dentry(struct inode *, struct dentry *);
extern int ext2_make_empty(struct inode *, struct inode *);
extern struct ext2_dir_entry_2 * ext2_find_entry (struct inode *,struct qstr *, struct page **);
extern int ext2_delete_entry (struct ext2_dir_entry_2 *, struct page *);
+extern int ext2_whiteout_entry(struct dentry *);
extern int ext2_empty_dir (struct inode *);
extern struct ext2_dir_entry_2 * ext2_dotdot (struct inode *, struct page **);
extern void ext2_set_link(struct inode *, struct ext2_dir_entry_2 *, struct page *, struct inode *, int);
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 740cad8..2ffc91a 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1253,7 +1253,8 @@ void ext2_set_inode_flags(struct inode *inode)
{
unsigned int flags = EXT2_I(inode)->i_flags;
- inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+ inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|
+ S_OPAQUE);
if (flags & EXT2_SYNC_FL)
inode->i_flags |= S_SYNC;
if (flags & EXT2_APPEND_FL)
@@ -1264,6 +1265,8 @@ void ext2_set_inode_flags(struct inode *inode)
inode->i_flags |= S_NOATIME;
if (flags & EXT2_DIRSYNC_FL)
inode->i_flags |= S_DIRSYNC;
+ if (flags & EXT2_OPAQUE_FL)
+ inode->i_flags |= S_OPAQUE;
}
/* Propagate flags from i_flags to EXT2_I(inode)->i_flags */
@@ -1271,8 +1274,8 @@ void ext2_get_inode_flags(struct ext2_inode_info *ei)
{
unsigned int flags = ei->vfs_inode.i_flags;
- ei->i_flags &= ~(EXT2_SYNC_FL|EXT2_APPEND_FL|
- EXT2_IMMUTABLE_FL|EXT2_NOATIME_FL|EXT2_DIRSYNC_FL);
+ ei->i_flags &= ~(EXT2_SYNC_FL|EXT2_APPEND_FL|EXT2_IMMUTABLE_FL|
+ EXT2_NOATIME_FL|EXT2_DIRSYNC_FL|EXT2_OPAQUE_FL);
if (flags & S_SYNC)
ei->i_flags |= EXT2_SYNC_FL;
if (flags & S_APPEND)
@@ -1283,6 +1286,8 @@ void ext2_get_inode_flags(struct ext2_inode_info *ei)
ei->i_flags |= EXT2_NOATIME_FL;
if (flags & S_DIRSYNC)
ei->i_flags |= EXT2_DIRSYNC_FL;
+ if (flags & S_OPAQUE)
+ ei->i_flags |= EXT2_OPAQUE_FL;
}
struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 0804198..8227267 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -55,7 +55,8 @@ static inline int ext2_add_nondir(struct dentry *dentry, struct inode *inode)
* Methods themselves.
*/
-static struct dentry *ext2_lookup(struct inode * dir, struct dentry *dentry, struct nameidata *nd)
+static struct dentry *ext2_lookup(struct inode * dir, struct dentry *dentry,
+ struct nameidata *nd)
{
struct inode * inode;
ino_t ino;
@@ -63,7 +64,7 @@ static struct dentry *ext2_lookup(struct inode * dir, struct dentry *dentry, str
if (dentry->d_name.len > EXT2_NAME_LEN)
return ERR_PTR(-ENAMETOOLONG);
- ino = ext2_inode_by_name(dir, &dentry->d_name);
+ ino = ext2_inode_by_dentry(dir, dentry);
inode = NULL;
if (ino) {
inode = ext2_iget(dir->i_sb, ino);
@@ -303,6 +304,51 @@ static int ext2_rmdir (struct inode * dir, struct dentry *dentry)
return err;
}
+/*
+ * Create a whiteout for the dentry
+ */
+static int ext2_whiteout(struct inode *dir, struct dentry *dentry,
+ struct dentry *new_dentry)
+{
+ struct inode *inode = dentry->d_inode;
+ int err = -ENOTEMPTY;
+
+ if (!EXT2_HAS_INCOMPAT_FEATURE(dir->i_sb,
+ EXT2_FEATURE_INCOMPAT_FILETYPE)) {
+ ext2_error(dir->i_sb, "ext2_whiteout",
+ "can't set whiteout filetype");
+ err = -EPERM;
+ goto out;
+ }
+
+ dquot_initialize(dir);
+
+ if (inode && S_ISDIR(inode->i_mode) && !ext2_empty_dir(inode))
+ goto out;
+
+ err = ext2_whiteout_entry(dentry);
+ if (err)
+ goto out;
+
+ spin_lock(&new_dentry->d_lock);
+ new_dentry->d_flags |= DCACHE_WHITEOUT;
+ spin_unlock(&new_dentry->d_lock);
+ d_add(new_dentry, NULL);
+
+ if (inode) {
+ inode->i_ctime = dir->i_ctime;
+ inode_dec_link_count(inode);
+ if (S_ISDIR(inode->i_mode)) {
+ inode->i_size = 0;
+ inode_dec_link_count(inode);
+ inode_dec_link_count(dir);
+ }
+ }
+ err = 0;
+out:
+ return err;
+}
+
static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
struct inode * new_dir, struct dentry * new_dentry )
{
@@ -400,6 +446,7 @@ const struct inode_operations ext2_dir_inode_operations = {
.mkdir = ext2_mkdir,
.rmdir = ext2_rmdir,
.mknod = ext2_mknod,
+ .whiteout = ext2_whiteout,
.rename = ext2_rename,
#ifdef CONFIG_EXT2_FS_XATTR
.setxattr = generic_setxattr,
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 0090595..8869794 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -1097,6 +1097,10 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
if (EXT2_HAS_COMPAT_FEATURE(sb, EXT3_FEATURE_COMPAT_HAS_JOURNAL))
ext2_msg(sb, KERN_WARNING,
"warning: mounting ext3 filesystem as ext2");
+
+ if (EXT2_HAS_INCOMPAT_FEATURE(sb, EXT2_FEATURE_INCOMPAT_WHITEOUT))
+ sb->s_flags |= MS_WHITEOUT;
+
if (ext2_setup_super (sb, es, sb->s_flags & MS_RDONLY))
sb->s_flags |= MS_RDONLY;
ext2_write_super(sb);
diff --git a/include/linux/ext2_fs.h b/include/linux/ext2_fs.h
index ce1b719..2202faa 100644
--- a/include/linux/ext2_fs.h
+++ b/include/linux/ext2_fs.h
@@ -190,6 +190,7 @@ struct ext2_group_desc
#define EXT2_NOTAIL_FL FS_NOTAIL_FL /* file tail should not be merged */
#define EXT2_DIRSYNC_FL FS_DIRSYNC_FL /* dirsync behaviour (directories only) */
#define EXT2_TOPDIR_FL FS_TOPDIR_FL /* Top of directory hierarchies*/
+#define EXT2_OPAQUE_FL FS_OPAQUE_FL /* Dir is opaque */
#define EXT2_RESERVED_FL FS_RESERVED_FL /* reserved for ext2 lib */
#define EXT2_FL_USER_VISIBLE FS_FL_USER_VISIBLE /* User visible flags */
@@ -504,10 +505,12 @@ struct ext2_super_block {
#define EXT3_FEATURE_INCOMPAT_RECOVER 0x0004
#define EXT3_FEATURE_INCOMPAT_JOURNAL_DEV 0x0008
#define EXT2_FEATURE_INCOMPAT_META_BG 0x0010
+#define EXT2_FEATURE_INCOMPAT_WHITEOUT 0x0020
#define EXT2_FEATURE_INCOMPAT_ANY 0xffffffff
#define EXT2_FEATURE_COMPAT_SUPP EXT2_FEATURE_COMPAT_EXT_ATTR
#define EXT2_FEATURE_INCOMPAT_SUPP (EXT2_FEATURE_INCOMPAT_FILETYPE| \
+ EXT2_FEATURE_INCOMPAT_WHITEOUT| \
EXT2_FEATURE_INCOMPAT_META_BG)
#define EXT2_FEATURE_RO_COMPAT_SUPP (EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER| \
EXT2_FEATURE_RO_COMPAT_LARGE_FILE| \
@@ -574,6 +577,7 @@ enum {
EXT2_FT_FIFO = 5,
EXT2_FT_SOCK = 6,
EXT2_FT_SYMLINK = 7,
+ EXT2_FT_WHT = 8,
EXT2_FT_MAX
};
From: Jan Blunck <[email protected]>
Add per mountpoint flag for Union Mount support. You need additional patches
to util-linux for that to work - see:
git://git.kernel.org/pub/scm/utils/util-linux-ng/val/util-linux-ng.git
Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/namespace.c | 4 +++-
fs/proc_namespace.c | 1 +
include/linux/fs.h | 1 +
include/linux/mount.h | 1 +
4 files changed, 6 insertions(+), 1 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index c01aff2..33aa310 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2214,10 +2214,12 @@ long do_mount(char *dev_name, char *dir_name, char *type_page,
mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME);
if (flags & MS_RDONLY)
mnt_flags |= MNT_READONLY;
+ if (flags & MS_UNION)
+ mnt_flags |= MNT_UNION;
flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE | MS_BORN |
MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT |
- MS_STRICTATIME);
+ MS_STRICTATIME | MS_UNION);
if (flags & MS_REMOUNT)
retval = do_remount(&path, flags & ~MS_REMOUNT, mnt_flags,
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index 1241285..4609740 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -65,6 +65,7 @@ static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
{ MNT_NOATIME, ",noatime" },
{ MNT_NODIRATIME, ",nodiratime" },
{ MNT_RELATIME, ",relatime" },
+ { MNT_UNION, ",union" },
{ 0, NULL }
};
const struct proc_fs_info *fs_infop;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b5e6658..a014f0f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -193,6 +193,7 @@ struct inodes_stat_t {
#define MS_REMOUNT 32 /* Alter flags of a mounted FS */
#define MS_MANDLOCK 64 /* Allow mandatory locks on an FS */
#define MS_DIRSYNC 128 /* Directory modifications are synchronous */
+#define MS_UNION 256 /* Merge namespace with FS mounted below */
#define MS_NOATIME 1024 /* Do not update access times. */
#define MS_NODIRATIME 2048 /* Do not update directory access times */
#define MS_BIND 4096
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 41c7c84..0ba1def 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -47,6 +47,7 @@ struct mnt_namespace;
#define MNT_INTERNAL 0x4000
#define MNT_HARD_READONLY 0x8000 /* has a hard read-only ref on the sb */
+#define MNT_UNION 0x10000 /* top layer of a union mount */
struct vfsmount {
struct dentry *mnt_root; /* root of the mounted tree */
Split inode_permission() into inode- and superblock-dependent parts.
This is aimed at unionmounts where the superblock from the upper layer has to
be checked rather than the superblock from the lower layer as the upper layer
may be writable, thus allowing an unwritable file from the lower layer to be
copied up and modified.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]> (Further development)
---
fs/internal.h | 5 ++++
fs/namei.c | 66 ++++++++++++++++++++++++++++++++++++++++++---------------
2 files changed, 54 insertions(+), 17 deletions(-)
diff --git a/fs/internal.h b/fs/internal.h
index 9962c59..043a937 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -42,6 +42,11 @@ static inline int __sync_blockdev(struct block_device *bdev, int wait)
extern void __init chrdev_init(void);
/*
+ * namei.c
+ */
+extern int __inode_permission(struct inode *, int);
+
+/*
* namespace.c
*/
extern int copy_mount_options(const void __user *, unsigned long *);
diff --git a/fs/namei.c b/fs/namei.c
index 2d983f7..7f9df02 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -328,31 +328,22 @@ static inline int do_inode_permission(struct inode *inode, int mask)
}
/**
- * inode_permission - check for access rights to a given inode
- * @inode: inode to check permission on
- * @mask: right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC, ...)
+ * __inode_permission - Check for access rights to a given inode
+ * @inode: Inode to check permission on
+ * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
*
- * Used to check for read/write/execute permissions on an inode.
- * We use "fsuid" for this, letting us set arbitrary permissions
- * for filesystem access without changing the "normal" uids which
- * are used for other things.
+ * Check for read/write/execute permissions on an inode.
*
* When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
+ *
+ * This does not check for a read-only file system. You probably want
+ * inode_permission().
*/
-int inode_permission(struct inode *inode, int mask)
+int __inode_permission(struct inode *inode, int mask)
{
int retval;
if (unlikely(mask & MAY_WRITE)) {
- umode_t mode = inode->i_mode;
-
- /*
- * Nobody gets write access to a read-only fs.
- */
- if (IS_RDONLY(inode) &&
- (S_ISREG(mode) || S_ISDIR(mode) || S_ISLNK(mode)))
- return -EROFS;
-
/*
* Nobody gets write access to an immutable file.
*/
@@ -372,6 +363,47 @@ int inode_permission(struct inode *inode, int mask)
}
/**
+ * sb_permission - Check superblock-level permissions
+ * @sb: Superblock of inode to check permission on
+ * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
+ *
+ * Separate out file-system wide checks from inode-specific permission checks.
+ */
+static int sb_permission(struct super_block *sb, struct inode *inode, int mask)
+{
+ if (unlikely(mask & MAY_WRITE)) {
+ umode_t mode = inode->i_mode;
+
+ /* Nobody gets write access to a read-only fs. */
+ if ((sb->s_flags & MS_RDONLY) &&
+ (S_ISREG(mode) || S_ISDIR(mode) || S_ISLNK(mode)))
+ return -EROFS;
+ }
+ return 0;
+}
+
+/**
+ * inode_permission - Check for access rights to a given inode
+ * @inode: Inode to check permission on
+ * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
+ *
+ * Check for read/write/execute permissions on an inode. We use fs[ug]id for
+ * this, letting us set arbitrary permissions for filesystem access without
+ * changing the "normal" UIDs which are used for other things.
+ *
+ * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask.
+ */
+int inode_permission(struct inode *inode, int mask)
+{
+ int retval;
+
+ retval = sb_permission(inode->i_sb, inode, mask);
+ if (retval)
+ return retval;
+ return __inode_permission(inode, mask);
+}
+
+/**
* path_get - get a reference to a path
* @path: path to get the reference to
*
From: Valerie Aurora <[email protected]>
Union mounts hook into the lookup path in two places: do_lookup() and
lookup_hash().
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/namei.c | 12 ++++++++++++
1 files changed, 12 insertions(+), 0 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index c0adf4c..586913f 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1587,6 +1587,14 @@ retry:
}
if (err)
nd->flags |= LOOKUP_JUMPED;
+
+ if (needs_lookup_union(&nd->path, path)) {
+ int err = lookup_union(nd, name, path);
+ if (err < 0)
+ return err;
+#warning which inode?
+ }
+
*inode = path->dentry->d_inode;
return 0;
}
@@ -2135,8 +2143,12 @@ static int lookup_hash(struct nameidata *nd, struct qstr *name,
path->dentry = NULL;
return PTR_ERR(result);
}
+
path->mnt = nd->path.mnt;
path->dentry = result;
+
+ if (needs_lookup_union(&nd->path, path))
+ return lookup_union_locked(nd, name, path);
return 0;
}
From: Valerie Aurora <[email protected]>
d_free_unions() frees the union stack associated with a directory.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/union.c | 24 ++++++++++++++++++++++++
fs/union.h | 10 ++++++++++
2 files changed, 34 insertions(+), 0 deletions(-)
diff --git a/fs/union.c b/fs/union.c
index c8d7766..77d6a74 100644
--- a/fs/union.c
+++ b/fs/union.c
@@ -19,6 +19,7 @@
#include <linux/mount.h>
#include <linux/fs_struct.h>
#include <linux/slab.h>
+#include <linux/namei.h>
#include "union.h"
@@ -36,3 +37,26 @@ static struct union_stack *union_alloc(struct path *topmost)
return kcalloc(sizeof(struct path), layers, GFP_KERNEL);
}
+
+/**
+ * d_free_unions - free all unions for this dentry
+ * @dentry: topmost dentry in the union stack to remove
+ *
+ * This must be called when freeing a dentry.
+ */
+void d_free_unions(struct dentry *topmost)
+{
+ struct path *path;
+ unsigned int i, layers = topmost->d_sb->s_union_count;
+
+ if (!IS_DIR_UNIONED(topmost))
+ return;
+
+ for (i = 0; i < layers; i++) {
+ path = union_find_dir(topmost, i);
+ if (path->mnt)
+ path_put(path);
+ }
+ kfree(topmost->d_union_stack);
+ topmost->d_union_stack = NULL;
+}
diff --git a/fs/union.h b/fs/union.h
index f90d037..04a02ec 100644
--- a/fs/union.h
+++ b/fs/union.h
@@ -51,6 +51,13 @@ struct union_stack {
struct path u_dirs[0];
};
+static inline bool IS_DIR_UNIONED(struct dentry *dentry)
+{
+ return !!dentry->d_union_stack;
+}
+
+extern void d_free_unions(struct dentry *);
+
static inline
struct path *union_find_dir(struct dentry *dentry, unsigned int layer)
{
@@ -67,4 +74,7 @@ struct path *union_find_dir(struct dentry *dentry, unsigned int layer)
return NULL;
}
+static inline bool IS_DIR_UNIONED(struct dentry *dentry) { return false; }
+static inline void d_free_unions(struct dentry *dentry) {}
+
#endif /* CONFIG_UNION_MOUNT */
From: Valerie Aurora <[email protected]>
Define the fallthru dcache flag and file system op. Mask out the
DCACHE_FALLTHRU flag on dentry creation.
Actual users and changes to lookup come in later patches.
Signed-off-by: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
Documentation/filesystems/vfs.txt | 6 ++++++
fs/dcache.c | 2 +-
include/linux/dcache.h | 7 +++++++
include/linux/fs.h | 2 ++
4 files changed, 16 insertions(+), 1 deletions(-)
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 8575c5b..6d9c108 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -350,6 +350,7 @@ struct inode_operations {
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t);
int (*whiteout) (struct inode *, struct dentry *, struct dentry *);
+ int (*fallthru) (struct inode *, struct dentry *);
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *);
int (*readlink) (struct dentry *, char __user *,int);
@@ -421,6 +422,11 @@ otherwise noted.
second is the dentry for the whiteout itself. This method
must unlink() or rmdir() the original entry if it exists.
+ fallthru: called by the readdir(2) system call on a layered file
+ system. Only required if you want to support fallthrus.
+ Fallthrus are place-holders for directory entries visible from
+ a lower level file system.
+
rename: called by the rename(2) system call to rename the object to
have the parent and name given by the second inode and dentry.
diff --git a/fs/dcache.c b/fs/dcache.c
index 60af7b1..b1ce8d1 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1311,7 +1311,7 @@ static void __d_instantiate(struct dentry *dentry, struct inode *inode)
{
spin_lock(&dentry->d_lock);
if (inode) {
- dentry->d_flags &= ~DCACHE_WHITEOUT;
+ dentry->d_flags &= ~(DCACHE_WHITEOUT | DCACHE_FALLTHRU);
if (unlikely(IS_AUTOMOUNT(inode)))
dentry->d_flags |= DCACHE_NEED_AUTOMOUNT;
list_add(&dentry->d_alias, &inode->i_dentry);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index f22f530..cc0181b 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -213,6 +213,8 @@ struct dentry_operations {
#define DCACHE_FSNOTIFY_PARENT_WATCHED 0x4000
/* Parent inode is watched by some fsnotify listener */
+#define DCACHE_FALLTHRU 0x8000 /* Continue lookup below an opaque dir */
+
#define DCACHE_MOUNTED 0x10000 /* is a mountpoint */
#define DCACHE_NEED_AUTOMOUNT 0x20000 /* handle automount on this dir */
#define DCACHE_MANAGE_TRANSIT 0x40000 /* manage transit from this dirent */
@@ -420,6 +422,11 @@ static inline int d_is_whiteout(struct dentry *dentry)
return dentry->d_flags & DCACHE_WHITEOUT;
}
+static inline int d_is_fallthru(struct dentry *dentry)
+{
+ return dentry->d_flags & DCACHE_FALLTHRU;
+}
+
static inline bool d_mountpoint(struct dentry *dentry)
{
return dentry->d_flags & DCACHE_MOUNTED;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1e4ae06..b5e6658 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -211,6 +211,7 @@ struct inodes_stat_t {
#define MS_I_VERSION (1<<23) /* Update inode I_version field */
#define MS_STRICTATIME (1<<24) /* Always perform atime updates */
#define MS_WHITEOUT (1<<25) /* FS supports whiteout filetype */
+#define MS_FALLTHRU (1<<26) /* FS supports fallthru filetype */
#define MS_NOSEC (1<<28)
#define MS_BORN (1<<29)
#define MS_ACTIVE (1<<30)
@@ -1653,6 +1654,7 @@ struct inode_operations {
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t);
int (*whiteout) (struct inode *, struct dentry *, struct dentry *);
+ int (*fallthru) (struct inode *, struct dentry *);
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *);
void (*truncate) (struct inode *);
From: Jan Blunck <[email protected]>
If a dentry is removed from dentry cache because its usage count drops
to zero, its union stack is freed too.
Original-author: Jan Blunck <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/dcache.c | 10 ++++++++++
1 files changed, 10 insertions(+), 0 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 326a432..e450890 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -39,6 +39,7 @@
#include <linux/ratelimit.h>
#include "internal.h"
#include "mount.h"
+#include "union.h"
/*
* Usage:
@@ -316,6 +317,7 @@ static struct dentry *d_kill(struct dentry *dentry, struct dentry *parent)
if (parent)
spin_unlock(&parent->d_lock);
dentry_iput(dentry);
+ d_free_unions(dentry);
/*
* dentry_iput drops the locks, at which point nobody (except
* transient RCU lookups) can reach this dentry.
@@ -907,6 +909,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
iput(inode);
}
+ d_free_unions(dentry);
d_free(dentry);
/* finished when we fall off the top of the tree,
@@ -2009,6 +2012,7 @@ again:
}
dentry->d_flags &= ~DCACHE_CANT_MOUNT;
dentry_unlink_inode(dentry);
+ d_free_unions(dentry);
fsnotify_nameremove(dentry, isdir);
return;
}
@@ -2018,6 +2022,12 @@ again:
spin_unlock(&dentry->d_lock);
+ /* Remove any associated unions. While someone still has this
+ * directory open (ref count > 0), we could not have deleted it unless
+ * it was empty, and therefore has no references to directories below
+ * it. So we don't need the unions.
+ */
+ d_free_unions(dentry);
fsnotify_nameremove(dentry, isdir);
}
EXPORT_SYMBOL(d_delete);
Make the chown() and lchown() syscalls jump to the fchownat() syscall with the
appropriate extra arguments.
Signed-off-by: David Howells <[email protected]>
---
fs/open.c | 41 +++++++----------------------------------
1 files changed, 7 insertions(+), 34 deletions(-)
diff --git a/fs/open.c b/fs/open.c
index 77becc0..3c44148 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -527,25 +527,6 @@ static int chown_common(struct path *path, uid_t user, gid_t group)
return error;
}
-SYSCALL_DEFINE3(chown, const char __user *, filename, uid_t, user, gid_t, group)
-{
- struct path path;
- int error;
-
- error = user_path(filename, &path);
- if (error)
- goto out;
- error = mnt_want_write(path.mnt);
- if (error)
- goto out_release;
- error = chown_common(&path, user, group);
- mnt_drop_write(path.mnt);
-out_release:
- path_put(&path);
-out:
- return error;
-}
-
SYSCALL_DEFINE5(fchownat, int, dfd, const char __user *, filename, uid_t, user,
gid_t, group, int, flag)
{
@@ -573,23 +554,15 @@ out:
return error;
}
-SYSCALL_DEFINE3(lchown, const char __user *, filename, uid_t, user, gid_t, group)
+SYSCALL_DEFINE3(chown, const char __user *, filename, uid_t, user, gid_t, group)
{
- struct path path;
- int error;
+ return sys_fchownat(AT_FDCWD, filename, user, group, 0);
+}
- error = user_lpath(filename, &path);
- if (error)
- goto out;
- error = mnt_want_write(path.mnt);
- if (error)
- goto out_release;
- error = chown_common(&path, user, group);
- mnt_drop_write(path.mnt);
-out_release:
- path_put(&path);
-out:
- return error;
+SYSCALL_DEFINE3(lchown, const char __user *, filename, uid_t, user, gid_t, group)
+{
+ return sys_fchownat(AT_FDCWD, filename, user, group,
+ AT_SYMLINK_NOFOLLOW);
}
SYSCALL_DEFINE3(fchown, unsigned int, fd, uid_t, user, gid_t, group)
From: Valerie Aurora <[email protected]>
readdir() in union mounts is implemented by copying up all visible
directory entries from the lower level directories to the topmost
directory. Directory entries that refer to lower level file system
objects are marked as "fallthru" in the topmost directory.
Thanks to Felix Fietkau <[email protected]> for a bug fix.
Signed-off-by: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/readdir.c | 9 +++
fs/union.c | 184 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/union.h | 7 ++
3 files changed, 200 insertions(+), 0 deletions(-)
diff --git a/fs/readdir.c b/fs/readdir.c
index de703d6..a87ae02 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -20,6 +20,8 @@
#include <asm/uaccess.h>
+#include "union.h"
+
int vfs_readdir(struct file *file, filldir_t filler, void *buf)
{
struct inode *inode = file->f_path.dentry->d_inode;
@@ -37,9 +39,16 @@ int vfs_readdir(struct file *file, filldir_t filler, void *buf)
res = -ENOENT;
if (!IS_DEADDIR(inode)) {
+ if (IS_DIR_UNIONED(file->f_path.dentry) && !IS_OPAQUE(inode)) {
+ res = union_copyup_dir(&file->f_path);
+ if (res)
+ goto out_unlock;
+ }
+
res = file->f_op->readdir(file, buf, filler);
file_accessed(file);
}
+out_unlock:
mutex_unlock(&inode->i_mutex);
out:
return res;
diff --git a/fs/union.c b/fs/union.c
index f183051..0c0490f 100644
--- a/fs/union.c
+++ b/fs/union.c
@@ -22,6 +22,8 @@
#include <linux/namei.h>
#include <linux/fsnotify.h>
#include <linux/xattr.h>
+#include <linux/file.h>
+#include <linux/security.h>
#include "union.h"
@@ -202,3 +204,185 @@ out:
mnt_drop_write(parent->mnt);
return error;
}
+
+struct union_filldir_info {
+ struct dentry *topmost_dentry;
+ int error;
+};
+
+/**
+ * union_copyup_dir_one - copy up a single directory entry
+ *
+ * Individual directory entry copyup function for union_copyup_dir.
+ * We get the entries from higher level layers first.
+ */
+static int union_copyup_dir_one(void *buf, const char *name, int namlen,
+ loff_t offset, u64 ino, unsigned int d_type)
+{
+ struct union_filldir_info *ufi = (struct union_filldir_info *) buf;
+ struct dentry *topmost_dentry = ufi->topmost_dentry;
+ struct dentry *dentry;
+ int err = 0;
+
+ switch (namlen) {
+ case 2:
+ if (name[1] != '.')
+ break;
+ case 1:
+ if (name[0] != '.')
+ break;
+ return 0;
+ }
+
+ /* Lookup this entry in the topmost directory */
+ dentry = lookup_one_len(name, topmost_dentry, namlen);
+
+ if (IS_ERR(dentry)) {
+ printk(KERN_WARNING "%s: error looking up %s\n", __func__,
+ dentry->d_name.name);
+ err = PTR_ERR(dentry);
+ goto out;
+ }
+
+ /* XXX do we need to revalidate on readdir anyway? think NFS */
+ if (dentry->d_op && dentry->d_op->d_revalidate)
+ goto fallthru;
+
+ /* If the entry already exists, one of the following is true: it was
+ * already copied up (due to an earlier lookup), an entry with the same
+ * name already exists on the topmost file system, it is a whiteout, or
+ * it is a fallthru. In each case, the top level entry masks any
+ * entries from lower file systems, so don't copy up this entry.
+ */
+ if (dentry->d_inode || d_is_whiteout(dentry) || d_is_fallthru(dentry))
+ goto out_dput;
+
+ /* If the entry doesn't exist, create a fallthru entry in the topmost
+ * file system. All possible directory types are used, so each file
+ * system must implement its own way of storing a fallthru entry.
+ */
+fallthru:
+ err = topmost_dentry->d_inode->i_op->fallthru(topmost_dentry->d_inode,
+ dentry);
+
+ /* It's okay if it exists, ultimate responsibility rests with
+ * ->fallthru() */
+ if (err == -EEXIST)
+ err = 0;
+out_dput:
+ dput(dentry);
+out:
+ if (err)
+ ufi->error = err;
+ return err;
+}
+
+/**
+ * union_copyup_dir - copy up low-level directory entries to topmost dir
+ *
+ * readdir() is difficult to support on union file systems for two reasons: We
+ * must eliminate duplicates and apply whiteouts, and we must return something
+ * in f_pos that lets us restart in the same place when we return. Our
+ * solution is to, on first readdir() of the directory, copy up all visible
+ * entries from the low-level file systems and mark the entries that refer to
+ * low-level file system objects as "fallthru" entries.
+ *
+ * Locking strategy: We hold the topmost dir's i_mutex on entry. We grab the
+ * i_mutex on lower directories one by one. So the locking order is:
+ *
+ * Writable/topmost layers > Read-only/lower layers
+ *
+ * So there is no problem with lock ordering for union stacks with
+ * multiple lower layers. E.g.:
+ *
+ * (topmost) A->B->C (bottom)
+ * (topmost) D->C->B (bottom)
+ *
+ */
+int union_copyup_dir(struct path *topmost_path)
+{
+ struct union_filldir_info ufi;
+ struct dentry *topmost_dentry = topmost_path->dentry;
+ unsigned int i, layers = topmost_dentry->d_sb->s_union_count;
+ int error = 0;
+
+ BUG_ON(IS_OPAQUE(topmost_dentry->d_inode));
+
+ if (!topmost_dentry->d_inode->i_op ||
+ !topmost_dentry->d_inode->i_op->fallthru)
+ return -EOPNOTSUPP;
+
+ error = mnt_want_write(topmost_path->mnt);
+ if (error)
+ return error;
+
+ for (i = 0; i < layers; i++) {
+ struct file * ftmp;
+ struct inode * inode;
+ struct path *path;
+
+ path = union_find_dir(topmost_dentry, i);
+ if (!path->mnt)
+ continue;
+
+ /* dentry_open() doesn't get a path reference itself */
+ path_get(path);
+ ftmp = dentry_open(path->dentry, path->mnt,
+ O_RDONLY | O_DIRECTORY | O_NOATIME,
+ current_cred());
+ if (IS_ERR(ftmp)) {
+ printk (KERN_ERR "unable to open dir %s for "
+ "directory copyup: %ld\n",
+ path->dentry->d_name.name, PTR_ERR(ftmp));
+ path_put(path);
+ error = PTR_ERR(ftmp);
+ break;
+ }
+
+ inode = path->dentry->d_inode;
+ mutex_lock(&inode->i_mutex);
+
+ error = -ENOENT;
+ if (IS_DEADDIR(inode))
+ goto out_fput;
+
+ /* Read the whole directory, calling our directory entry copyup
+ * function on each entry.
+ */
+ ufi.topmost_dentry = topmost_dentry;
+ ufi.error = 0;
+ error = ftmp->f_op->readdir(ftmp, &ufi, union_copyup_dir_one);
+out_fput:
+ mutex_unlock(&inode->i_mutex);
+ fput(ftmp);
+
+ if (ufi.error)
+ error = ufi.error;
+ if (error)
+ break;
+
+ /* XXX Should process directories below an opaque directory in
+ * case there are fallthrus in it
+ */
+ if (IS_OPAQUE(path->dentry->d_inode))
+ break;
+ }
+
+ /* Mark this dir opaque to show that we have already copied up the
+ * lower entries. Be sure to do this AFTER the directory entries have
+ * been copied up so that if we crash in the middle of copyup, we will
+ * try to copyup the dir next time we read it.
+ *
+ * XXX - Could leave directory non-opaque, and force reread/copyup of
+ * directory each time it is read in from disk. That would make it
+ * easy to update lower file systems (when not union mounted) and have
+ * the changes show up when union mounted again.
+ */
+ if (!error) {
+ topmost_dentry->d_inode->i_flags |= S_OPAQUE;
+ mark_inode_dirty(topmost_dentry->d_inode);
+ }
+
+ mnt_drop_write(topmost_path->mnt);
+ return error;
+}
diff --git a/fs/union.h b/fs/union.h
index 48a9277..a77bd5f 100644
--- a/fs/union.h
+++ b/fs/union.h
@@ -70,6 +70,7 @@ extern void d_free_unions(struct dentry *);
extern int union_add_dir(struct path *, struct path *, unsigned int);
extern int union_create_topmost_dir(struct path *, struct qstr *, struct path *,
struct path *);
+extern int union_copyup_dir(struct path *);
static inline
struct path *union_find_dir(struct dentry *dentry, unsigned int layer)
@@ -130,4 +131,10 @@ static inline bool needs_lookup_union(struct path *parent_path, struct path *pat
return false;
}
+static inline int union_copyup_dir(struct path *topmost_path)
+{
+ BUG();
+ return 0;
+}
+
#endif /* CONFIG_UNION_MOUNT */
From: Valerie Aurora <[email protected]>
If we mkdir() a directory on the top layer of a union, we don't want
entries from a matching directory on the lower layer to "show through"
suddenly. To prevent this, we set the opaque flag on a directory in a
union mount if there is no matching directory on the lower layers.
Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/namei.c | 13 +++++++++++--
1 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index f9e0d68..d52377d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2946,8 +2946,17 @@ int vfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
return error;
error = dir->i_op->mkdir(dir, dentry, mode);
- if (!error)
- fsnotify_mkdir(dir, dentry);
+ if (error)
+ return error;
+
+ /* XXX racy - crash now and dir isn't opaque */
+ if (IS_DIR_UNIONED(dentry->d_parent)) {
+ dentry->d_inode->i_flags |= S_OPAQUE;
+ mark_inode_dirty(dentry->d_inode);
+ }
+
+ fsnotify_mkdir(dir, dentry);
+
return error;
}
From: Valerie Aurora <[email protected]>
Now that we have full union lookup support, lookup the true d_type and
d_ino of a fallthru.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/jffs2/dir.c | 11 ++++++++---
1 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index e294f1d..ce4c393 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -168,9 +168,14 @@ static int jffs2_readdir(struct file *filp, void *dirent, filldir_t filldir)
continue;
}
if (fd->type == JFFS2_DT_FALLTHRU) {
- /* XXX placeholder until generic_readdir_fallthru() arrives */
- ino = 1;
- d_type = DT_UNKNOWN;
+ int err;
+ err = generic_readdir_fallthru(filp->f_path.dentry, fd->name, strlen(fd->name),
+ &ino, &d_type);
+ if (err) {
+ D2(printk(KERN_DEBUG "Skipping fallthru dirent \"%s\"\n", fd->name));
+ offset++;
+ continue;
+ }
} else if (!fd->ino && (fd->type != DT_WHT)) {
D2(printk(KERN_DEBUG "Skipping deletion dirent \"%s\"\n", fd->name));
offset++;
Pass mount flags to sget() so that it can use them in initialising a new
superblock before the set function is called. They could also be passed to the
compare function.
Signed-off-by: David Howells <[email protected]>
---
drivers/mtd/mtdsuper.c | 4 +---
fs/9p/vfs_super.c | 4 ++--
fs/afs/super.c | 3 +--
fs/btrfs/super.c | 4 ++--
fs/ceph/super.c | 2 +-
fs/cifs/cifsfs.c | 9 ++++-----
fs/devpts/inode.c | 6 +++---
fs/ecryptfs/main.c | 3 +--
fs/gfs2/ops_fstype.c | 5 ++---
fs/libfs.c | 4 ++--
fs/logfs/super.c | 3 +--
fs/nfs/super.c | 10 +++++-----
fs/nilfs2/super.c | 4 ++--
fs/proc/root.c | 3 +--
fs/reiserfs/procfs.c | 2 +-
fs/super.c | 22 +++++++++++-----------
fs/sysfs/mount.c | 3 +--
fs/ubifs/super.c | 3 +--
include/linux/fs.h | 2 +-
kernel/cgroup.c | 2 +-
20 files changed, 44 insertions(+), 54 deletions(-)
diff --git a/drivers/mtd/mtdsuper.c b/drivers/mtd/mtdsuper.c
index a90bfe7..334da5f 100644
--- a/drivers/mtd/mtdsuper.c
+++ b/drivers/mtd/mtdsuper.c
@@ -63,7 +63,7 @@ static struct dentry *mount_mtd_aux(struct file_system_type *fs_type, int flags,
struct super_block *sb;
int ret;
- sb = sget(fs_type, get_sb_mtd_compare, get_sb_mtd_set, mtd);
+ sb = sget(fs_type, get_sb_mtd_compare, get_sb_mtd_set, flags, mtd);
if (IS_ERR(sb))
goto out_error;
@@ -74,8 +74,6 @@ static struct dentry *mount_mtd_aux(struct file_system_type *fs_type, int flags,
pr_debug("MTDSB: New superblock for device %d (\"%s\")\n",
mtd->index, mtd->name);
- sb->s_flags = flags;
-
ret = fill_super(sb, data, flags & MS_SILENT ? 1 : 0);
if (ret < 0) {
deactivate_locked_super(sb);
diff --git a/fs/9p/vfs_super.c b/fs/9p/vfs_super.c
index 7b0cd87..347034c 100644
--- a/fs/9p/vfs_super.c
+++ b/fs/9p/vfs_super.c
@@ -89,7 +89,7 @@ v9fs_fill_super(struct super_block *sb, struct v9fs_session_info *v9ses,
if (v9ses->cache)
sb->s_bdi->ra_pages = (VM_MAX_READAHEAD * 1024)/PAGE_CACHE_SIZE;
- sb->s_flags = flags | MS_ACTIVE | MS_DIRSYNC | MS_NOATIME;
+ sb->s_flags |= MS_ACTIVE | MS_DIRSYNC | MS_NOATIME;
if (!v9ses->cache)
sb->s_flags |= MS_SYNCHRONOUS;
@@ -137,7 +137,7 @@ static struct dentry *v9fs_mount(struct file_system_type *fs_type, int flags,
goto close_session;
}
- sb = sget(fs_type, NULL, v9fs_set_super, v9ses);
+ sb = sget(fs_type, NULL, v9fs_set_super, flags, v9ses);
if (IS_ERR(sb)) {
retval = PTR_ERR(sb);
goto clunk_fid;
diff --git a/fs/afs/super.c b/fs/afs/super.c
index 983ec59..be87270 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -398,7 +398,7 @@ static struct dentry *afs_mount(struct file_system_type *fs_type,
as->volume = vol;
/* allocate a deviceless superblock */
- sb = sget(fs_type, afs_test_super, afs_set_super, as);
+ sb = sget(fs_type, afs_test_super, afs_set_super, flags, as);
if (IS_ERR(sb)) {
ret = PTR_ERR(sb);
afs_put_volume(vol);
@@ -409,7 +409,6 @@ static struct dentry *afs_mount(struct file_system_type *fs_type,
if (!sb->s_root) {
/* initial superblock/root creation */
_debug("create");
- sb->s_flags = flags;
ret = afs_fill_super(sb, ¶ms);
if (ret < 0) {
deactivate_locked_super(sb);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 3ce97b2..650b5ca 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -961,7 +961,8 @@ static struct dentry *btrfs_mount(struct file_system_type *fs_type, int flags,
}
bdev = fs_devices->latest_bdev;
- s = sget(fs_type, btrfs_test_super, btrfs_set_super, fs_info);
+ s = sget(fs_type, btrfs_test_super, btrfs_set_super, flags | MS_NOSEC,
+ fs_info);
if (IS_ERR(s)) {
error = PTR_ERR(s);
goto error_close_devices;
@@ -975,7 +976,6 @@ static struct dentry *btrfs_mount(struct file_system_type *fs_type, int flags,
} else {
char b[BDEVNAME_SIZE];
- s->s_flags = flags | MS_NOSEC;
strlcpy(s->s_id, bdevname(bdev, b), sizeof(s->s_id));
btrfs_sb(s)->bdev_holder = fs_type;
error = btrfs_fill_super(s, fs_devices, data,
diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index 00de2c9..2bc74f5 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -860,7 +860,7 @@ static struct dentry *ceph_mount(struct file_system_type *fs_type,
if (ceph_test_opt(fsc->client, NOSHARE))
compare_super = NULL;
- sb = sget(fs_type, compare_super, ceph_set_super, fsc);
+ sb = sget(fs_type, compare_super, ceph_set_super, flags, fsc);
if (IS_ERR(sb)) {
res = ERR_CAST(sb);
goto out;
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index b1fd382..64a15b7 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -632,7 +632,10 @@ cifs_do_mount(struct file_system_type *fs_type,
mnt_data.cifs_sb = cifs_sb;
mnt_data.flags = flags;
- sb = sget(fs_type, cifs_match_super, cifs_set_super, &mnt_data);
+ /* BB should we make this contingent on mount parm? */
+ flags |= MS_NODIRATIME | MS_NOATIME;
+
+ sb = sget(fs_type, cifs_match_super, cifs_set_super, flags, &mnt_data);
if (IS_ERR(sb)) {
root = ERR_CAST(sb);
cifs_umount(cifs_sb);
@@ -643,10 +646,6 @@ cifs_do_mount(struct file_system_type *fs_type,
cFYI(1, "Use existing superblock");
cifs_umount(cifs_sb);
} else {
- sb->s_flags = flags;
- /* BB should we make this contingent on mount parm? */
- sb->s_flags |= MS_NODIRATIME | MS_NOATIME;
-
rc = cifs_read_super(sb);
if (rc) {
root = ERR_PTR(rc);
diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index c4e2a58..b596698 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -367,15 +367,15 @@ static struct dentry *devpts_mount(struct file_system_type *fs_type,
return ERR_PTR(error);
if (opts.newinstance)
- s = sget(fs_type, NULL, set_anon_super, NULL);
+ s = sget(fs_type, NULL, set_anon_super, flags, NULL);
else
- s = sget(fs_type, compare_init_pts_sb, set_anon_super, NULL);
+ s = sget(fs_type, compare_init_pts_sb, set_anon_super, flags,
+ NULL);
if (IS_ERR(s))
return ERR_CAST(s);
if (!s->s_root) {
- s->s_flags = flags;
error = devpts_fill_super(s, data, flags & MS_SILENT ? 1 : 0);
if (error)
goto out_undo_sget;
diff --git a/fs/ecryptfs/main.c b/fs/ecryptfs/main.c
index b4a6bef..99c273d 100644
--- a/fs/ecryptfs/main.c
+++ b/fs/ecryptfs/main.c
@@ -499,13 +499,12 @@ static struct dentry *ecryptfs_mount(struct file_system_type *fs_type, int flags
goto out;
}
- s = sget(fs_type, NULL, set_anon_super, NULL);
+ s = sget(fs_type, NULL, set_anon_super, flags, NULL);
if (IS_ERR(s)) {
rc = PTR_ERR(s);
goto out;
}
- s->s_flags = flags;
rc = bdi_setup_and_register(&sbi->bdi, "ecryptfs", BDI_CAP_MAP_COPY);
if (rc)
goto out1;
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 6aacf3f..0528e2e 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -1282,7 +1282,7 @@ static struct dentry *gfs2_mount(struct file_system_type *fs_type, int flags,
error = -EBUSY;
goto error_bdev;
}
- s = sget(fs_type, test_gfs2_super, set_gfs2_super, bdev);
+ s = sget(fs_type, test_gfs2_super, set_gfs2_super, flags, bdev);
mutex_unlock(&bdev->bd_fsfreeze_mutex);
error = PTR_ERR(s);
if (IS_ERR(s))
@@ -1312,7 +1312,6 @@ static struct dentry *gfs2_mount(struct file_system_type *fs_type, int flags,
} else {
char b[BDEVNAME_SIZE];
- s->s_flags = flags;
s->s_mode = mode;
strlcpy(s->s_id, bdevname(bdev, b), sizeof(s->s_id));
sb_set_blocksize(s, block_size(bdev));
@@ -1356,7 +1355,7 @@ static struct dentry *gfs2_mount_meta(struct file_system_type *fs_type,
dev_name, error);
return ERR_PTR(error);
}
- s = sget(&gfs2_fs_type, test_gfs2_super, set_meta_super,
+ s = sget(&gfs2_fs_type, test_gfs2_super, set_meta_super, flags,
path.dentry->d_inode->i_sb->s_bdev);
path_put(&path);
if (IS_ERR(s)) {
diff --git a/fs/libfs.c b/fs/libfs.c
index 5b2dbb3..38eb46d 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -222,15 +222,15 @@ struct dentry *mount_pseudo(struct file_system_type *fs_type, char *name,
const struct super_operations *ops,
const struct dentry_operations *dops, unsigned long magic)
{
- struct super_block *s = sget(fs_type, NULL, set_anon_super, NULL);
+ struct super_block *s;
struct dentry *dentry;
struct inode *root;
struct qstr d_name = {.name = name, .len = strlen(name)};
+ s = sget(fs_type, NULL, set_anon_super, MS_NOUSER, NULL);
if (IS_ERR(s))
return ERR_CAST(s);
- s->s_flags = MS_NOUSER;
s->s_maxbytes = MAX_LFS_FILESIZE;
s->s_blocksize = PAGE_SIZE;
s->s_blocksize_bits = PAGE_SHIFT;
diff --git a/fs/logfs/super.c b/fs/logfs/super.c
index c9ee7f5..deb2798 100644
--- a/fs/logfs/super.c
+++ b/fs/logfs/super.c
@@ -521,7 +521,7 @@ static struct dentry *logfs_get_sb_device(struct logfs_super *super,
log_super("LogFS: Start mount %x\n", mount_count++);
err = -EINVAL;
- sb = sget(type, logfs_sb_test, logfs_sb_set, super);
+ sb = sget(type, logfs_sb_test, logfs_sb_set, flags | MS_NOATIME, super);
if (IS_ERR(sb)) {
super->s_devops->put_device(super);
kfree(super);
@@ -543,7 +543,6 @@ static struct dentry *logfs_get_sb_device(struct logfs_super *super,
*/
sb->s_maxbytes = (1ull << 43) - 1;
sb->s_op = &logfs_super_operations;
- sb->s_flags = flags | MS_NOATIME;
err = logfs_read_sb(sb, sb->s_flags & MS_RDONLY);
if (err)
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 3dfa4f1..925ef28 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -2265,7 +2265,7 @@ static struct dentry *nfs_fs_mount(struct file_system_type *fs_type,
sb_mntdata.mntflags |= MS_SYNCHRONOUS;
/* Get a superblock - note that we may end up sharing one that already exists */
- s = sget(fs_type, compare_super, nfs_set_super, &sb_mntdata);
+ s = sget(fs_type, compare_super, nfs_set_super, flags, &sb_mntdata);
if (IS_ERR(s)) {
mntroot = ERR_CAST(s);
goto out_err_nosb;
@@ -2376,7 +2376,7 @@ nfs_xdev_mount(struct file_system_type *fs_type, int flags,
sb_mntdata.mntflags |= MS_SYNCHRONOUS;
/* Get a superblock - note that we may end up sharing one that already exists */
- s = sget(&nfs_fs_type, compare_super, nfs_set_super, &sb_mntdata);
+ s = sget(&nfs_fs_type, compare_super, nfs_set_super, flags, &sb_mntdata);
if (IS_ERR(s)) {
error = PTR_ERR(s);
goto out_err_nosb;
@@ -2645,7 +2645,7 @@ nfs4_remote_mount(struct file_system_type *fs_type, int flags,
sb_mntdata.mntflags |= MS_SYNCHRONOUS;
/* Get a superblock - note that we may end up sharing one that already exists */
- s = sget(&nfs4_fs_type, compare_super, nfs_set_super, &sb_mntdata);
+ s = sget(&nfs4_fs_type, compare_super, nfs_set_super, flags, &sb_mntdata);
if (IS_ERR(s)) {
error = PTR_ERR(s);
goto out_free;
@@ -2906,7 +2906,7 @@ nfs4_xdev_mount(struct file_system_type *fs_type, int flags,
sb_mntdata.mntflags |= MS_SYNCHRONOUS;
/* Get a superblock - note that we may end up sharing one that already exists */
- s = sget(&nfs4_fs_type, compare_super, nfs_set_super, &sb_mntdata);
+ s = sget(&nfs4_fs_type, compare_super, nfs_set_super, flags, &sb_mntdata);
if (IS_ERR(s)) {
error = PTR_ERR(s);
goto out_err_nosb;
@@ -2997,7 +2997,7 @@ nfs4_remote_referral_mount(struct file_system_type *fs_type, int flags,
sb_mntdata.mntflags |= MS_SYNCHRONOUS;
/* Get a superblock - note that we may end up sharing one that already exists */
- s = sget(&nfs4_fs_type, compare_super, nfs_set_super, &sb_mntdata);
+ s = sget(&nfs4_fs_type, compare_super, nfs_set_super, flags, &sb_mntdata);
if (IS_ERR(s)) {
error = PTR_ERR(s);
goto out_err_nosb;
diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
index 08e3d4f..4c81adb 100644
--- a/fs/nilfs2/super.c
+++ b/fs/nilfs2/super.c
@@ -1288,7 +1288,8 @@ nilfs_mount(struct file_system_type *fs_type, int flags,
err = -EBUSY;
goto failed;
}
- s = sget(fs_type, nilfs_test_bdev_super, nilfs_set_bdev_super, sd.bdev);
+ s = sget(fs_type, nilfs_test_bdev_super, nilfs_set_bdev_super, flags,
+ sd.bdev);
mutex_unlock(&sd.bdev->bd_fsfreeze_mutex);
if (IS_ERR(s)) {
err = PTR_ERR(s);
@@ -1301,7 +1302,6 @@ nilfs_mount(struct file_system_type *fs_type, int flags,
s_new = true;
/* New superblock instance created */
- s->s_flags = flags;
s->s_mode = mode;
strlcpy(s->s_id, bdevname(sd.bdev, b), sizeof(s->s_id));
sb_set_blocksize(s, block_size(sd.bdev));
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 46a15d8..eaa5164 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -111,12 +111,11 @@ static struct dentry *proc_mount(struct file_system_type *fs_type,
options = data;
}
- sb = sget(fs_type, proc_test_super, proc_set_super, ns);
+ sb = sget(fs_type, proc_test_super, proc_set_super, flags, ns);
if (IS_ERR(sb))
return ERR_CAST(sb);
if (!sb->s_root) {
- sb->s_flags = flags;
if (!proc_parse_options(options, ns)) {
deactivate_locked_super(sb);
return ERR_PTR(-EINVAL);
diff --git a/fs/reiserfs/procfs.c b/fs/reiserfs/procfs.c
index 7a99811..de42b03 100644
--- a/fs/reiserfs/procfs.c
+++ b/fs/reiserfs/procfs.c
@@ -404,7 +404,7 @@ static void *r_start(struct seq_file *m, loff_t * pos)
if (l)
return NULL;
- if (IS_ERR(sget(&reiserfs_fs_type, test_sb, set_sb, s)))
+ if (IS_ERR(sget(&reiserfs_fs_type, test_sb, set_sb, 0, s)))
return NULL;
up_write(&s->s_umount);
diff --git a/fs/super.c b/fs/super.c
index 6277ec6..f343eda 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -104,11 +104,12 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
/**
* alloc_super - create new superblock
* @type: filesystem type superblock should belong to
+ * @flags: the mount flags
*
* Allocates and initializes a new &struct super_block. alloc_super()
* returns a pointer new superblock or %NULL if allocation had failed.
*/
-static struct super_block *alloc_super(struct file_system_type *type)
+static struct super_block *alloc_super(struct file_system_type *type, int flags)
{
struct super_block *s = kzalloc(sizeof(struct super_block), GFP_USER);
static const struct super_operations default_op;
@@ -135,6 +136,7 @@ static struct super_block *alloc_super(struct file_system_type *type)
#else
INIT_LIST_HEAD(&s->s_files);
#endif
+ s->s_flags = flags;
s->s_bdi = &default_backing_dev_info;
INIT_HLIST_NODE(&s->s_instances);
INIT_HLIST_BL_HEAD(&s->s_anon);
@@ -414,11 +416,13 @@ EXPORT_SYMBOL(generic_shutdown_super);
* @type: filesystem type superblock should belong to
* @test: comparison callback
* @set: setup callback
+ * @flags: mount flags
* @data: argument to each of them
*/
struct super_block *sget(struct file_system_type *type,
int (*test)(struct super_block *,void *),
int (*set)(struct super_block *,void *),
+ int flags,
void *data)
{
struct super_block *s = NULL;
@@ -449,7 +453,7 @@ retry:
}
if (!s) {
spin_unlock(&sb_lock);
- s = alloc_super(type);
+ s = alloc_super(type, flags);
if (!s)
return ERR_PTR(-ENOMEM);
goto retry;
@@ -924,13 +928,12 @@ struct dentry *mount_ns(struct file_system_type *fs_type, int flags,
{
struct super_block *sb;
- sb = sget(fs_type, ns_test_super, ns_set_super, data);
+ sb = sget(fs_type, ns_test_super, ns_set_super, flags, data);
if (IS_ERR(sb))
return ERR_CAST(sb);
if (!sb->s_root) {
int err;
- sb->s_flags = flags;
err = fill_super(sb, data, flags & MS_SILENT ? 1 : 0);
if (err) {
deactivate_locked_super(sb);
@@ -991,7 +994,8 @@ struct dentry *mount_bdev(struct file_system_type *fs_type,
error = -EBUSY;
goto error_bdev;
}
- s = sget(fs_type, test_bdev_super, set_bdev_super, bdev);
+ s = sget(fs_type, test_bdev_super, set_bdev_super, flags | MS_NOSEC,
+ bdev);
mutex_unlock(&bdev->bd_fsfreeze_mutex);
if (IS_ERR(s))
goto error_s;
@@ -1016,7 +1020,6 @@ struct dentry *mount_bdev(struct file_system_type *fs_type,
} else {
char b[BDEVNAME_SIZE];
- s->s_flags = flags | MS_NOSEC;
s->s_mode = mode;
strlcpy(s->s_id, bdevname(bdev, b), sizeof(s->s_id));
sb_set_blocksize(s, block_size(bdev));
@@ -1061,13 +1064,11 @@ struct dentry *mount_nodev(struct file_system_type *fs_type,
int (*fill_super)(struct super_block *, void *, int))
{
int error;
- struct super_block *s = sget(fs_type, NULL, set_anon_super, NULL);
+ struct super_block *s = sget(fs_type, NULL, set_anon_super, flags, NULL);
if (IS_ERR(s))
return ERR_CAST(s);
- s->s_flags = flags;
-
error = fill_super(s, data, flags & MS_SILENT ? 1 : 0);
if (error) {
deactivate_locked_super(s);
@@ -1090,11 +1091,10 @@ struct dentry *mount_single(struct file_system_type *fs_type,
struct super_block *s;
int error;
- s = sget(fs_type, compare_single, set_anon_super, NULL);
+ s = sget(fs_type, compare_single, set_anon_super, flags, NULL);
if (IS_ERR(s))
return ERR_CAST(s);
if (!s->s_root) {
- s->s_flags = flags;
error = fill_super(s, data, flags & MS_SILENT ? 1 : 0);
if (error) {
deactivate_locked_super(s);
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index e34f0d9..a6674da 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -118,13 +118,12 @@ static struct dentry *sysfs_mount(struct file_system_type *fs_type,
for (type = KOBJ_NS_TYPE_NONE; type < KOBJ_NS_TYPES; type++)
info->ns[type] = kobj_ns_grab_current(type);
- sb = sget(fs_type, sysfs_test_super, sysfs_set_super, info);
+ sb = sget(fs_type, sysfs_test_super, sysfs_set_super, flags, info);
if (IS_ERR(sb) || sb->s_fs_info != info)
free_sysfs_super_info(info);
if (IS_ERR(sb))
return ERR_CAST(sb);
if (!sb->s_root) {
- sb->s_flags = flags;
error = sysfs_fill_super(sb, data, flags & MS_SILENT ? 1 : 0);
if (error) {
deactivate_locked_super(sb);
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index 63765d5..c3488d7 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -2141,7 +2141,7 @@ static struct dentry *ubifs_mount(struct file_system_type *fs_type, int flags,
dbg_gen("opened ubi%d_%d", c->vi.ubi_num, c->vi.vol_id);
- sb = sget(fs_type, sb_test, sb_set, c);
+ sb = sget(fs_type, sb_test, sb_set, flags, c);
if (IS_ERR(sb)) {
err = PTR_ERR(sb);
kfree(c);
@@ -2158,7 +2158,6 @@ static struct dentry *ubifs_mount(struct file_system_type *fs_type, int flags,
goto out_deact;
}
} else {
- sb->s_flags = flags;
err = ubifs_fill_super(sb, data, flags & MS_SILENT ? 1 : 0);
if (err)
goto out_deact;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 69cd5bb..d851be9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1864,7 +1864,7 @@ void free_anon_bdev(dev_t);
struct super_block *sget(struct file_system_type *type,
int (*test)(struct super_block *,void *),
int (*set)(struct super_block *,void *),
- void *data);
+ int flags, void *data);
extern struct dentry *mount_pseudo(struct file_system_type *, char *,
const struct super_operations *ops,
const struct dentry_operations *dops,
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index a5d3b53..d51bcce 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1522,7 +1522,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
opts.new_root = new_root;
/* Locate an existing or new sb for this hierarchy */
- sb = sget(fs_type, cgroup_test_super, cgroup_set_super, &opts);
+ sb = sget(fs_type, cgroup_test_super, cgroup_set_super, 0, &opts);
if (IS_ERR(sb)) {
ret = PTR_ERR(sb);
cgroup_drop_root(opts.new_root);
From: Valerie Aurora <[email protected]>
Whiteout a deleted directory in a union mounted file system.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/namei.c | 9 ++++-----
1 files changed, 4 insertions(+), 5 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index ce941ac..f9e0d68 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3228,11 +3228,6 @@ static long do_rmdir(int dfd, const char __user *pathname)
if (error)
return error;
- /* rmdir() on union mounts not implemented yet */
- error = -EINVAL;
- if (IS_DIR_UNIONED(nd.path.dentry))
- goto exit1;
-
switch(nd.last_type) {
case LAST_DOTDOT:
error = -ENOTEMPTY;
@@ -3261,6 +3256,10 @@ static long do_rmdir(int dfd, const char __user *pathname)
error = security_path_rmdir(&nd.path, path.dentry);
if (error)
goto exit4;
+ if (IS_DIR_UNIONED(nd.path.dentry)) {
+ error = do_whiteout(&nd, &path, 1);
+ goto exit4;
+ }
error = vfs_rmdir(nd.path.dentry->d_inode, path.dentry);
exit4:
mnt_drop_write(nd.path.mnt);
From: Valerie Aurora <[email protected]>
union_find_dir() returns the path of the directory at the specified
layer in a unioned directory.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/union.h | 17 +++++++++++++++++
1 files changed, 17 insertions(+), 0 deletions(-)
diff --git a/fs/union.h b/fs/union.h
index d42dc09..f90d037 100644
--- a/fs/union.h
+++ b/fs/union.h
@@ -19,6 +19,7 @@
#include <linux/mount.h>
#include <linux/dcache.h>
#include <linux/path.h>
+#include <linux/bug.h>
/*
* WARNING! Confusing terminology alert.
@@ -50,4 +51,20 @@ struct union_stack {
struct path u_dirs[0];
};
+static inline
+struct path *union_find_dir(struct dentry *dentry, unsigned int layer)
+{
+ BUG_ON(layer >= dentry->d_sb->s_union_count);
+ return &dentry->d_union_stack->u_dirs[layer];
+}
+
+#else /* CONFIG_UNION_MOUNT */
+
+static inline
+struct path *union_find_dir(struct dentry *dentry, unsigned int layer)
+{
+ BUG();
+ return NULL;
+}
+
#endif /* CONFIG_UNION_MOUNT */
From: Valerie Aurora <[email protected]>
union_add_dir() fills out the union stack for the topmost dentry with
the path of the directory in this layer of the union.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/union.c | 28 ++++++++++++++++++++++++++++
fs/union.h | 8 ++++++++
2 files changed, 36 insertions(+), 0 deletions(-)
diff --git a/fs/union.c b/fs/union.c
index 77d6a74..1e459b0 100644
--- a/fs/union.c
+++ b/fs/union.c
@@ -60,3 +60,31 @@ void d_free_unions(struct dentry *topmost)
kfree(topmost->d_union_stack);
topmost->d_union_stack = NULL;
}
+
+/**
+ * union_add_dir - Add another layer to a unioned directory
+ * @topmost: topmost directory
+ * @lower: directory in the current layer
+ * @layer: index of layer to add this at
+ *
+ * @layer counts starting at 0 for the dir below the topmost dir.
+ *
+ * This transfers the caller's references to the constituents of *lower to the
+ * union stack.
+ */
+int union_add_dir(struct path *topmost, struct path *lower, unsigned layer)
+{
+ struct dentry *dentry = topmost->dentry;
+ struct path *path;
+
+ BUG_ON(layer >= dentry->d_sb->s_union_count);
+
+ if (!dentry->d_union_stack)
+ dentry->d_union_stack = union_alloc(topmost);
+ if (!dentry->d_union_stack)
+ return -ENOMEM;
+
+ path = union_find_dir(dentry, layer);
+ *path = *lower;
+ return 0;
+}
diff --git a/fs/union.h b/fs/union.h
index 04a02ec..f39c88d 100644
--- a/fs/union.h
+++ b/fs/union.h
@@ -57,6 +57,7 @@ static inline bool IS_DIR_UNIONED(struct dentry *dentry)
}
extern void d_free_unions(struct dentry *);
+extern int union_add_dir(struct path *, struct path *, unsigned int);
static inline
struct path *union_find_dir(struct dentry *dentry, unsigned int layer)
@@ -77,4 +78,11 @@ struct path *union_find_dir(struct dentry *dentry, unsigned int layer)
static inline bool IS_DIR_UNIONED(struct dentry *dentry) { return false; }
static inline void d_free_unions(struct dentry *dentry) {}
+static inline
+int union_add_dir(struct path *topmost, struct path *lower, unsigned layer)
+{
+ BUG();
+ return 0;
+}
+
#endif /* CONFIG_UNION_MOUNT */
Make the link, chmod, chown, utimes and setxattr syscalls and their variants
aware of unionmounts by passing LOOKUP_COPY_UP to pathwalk.
This has the downside that the copyup will occur even if permission is later
denied to perform the operation.
Signed-off-by: David Howells <[email protected]>
---
fs/namei.c | 2 +-
fs/open.c | 5 +++--
fs/utimes.c | 2 +-
fs/xattr.c | 10 ++++++----
4 files changed, 11 insertions(+), 8 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 6ec5725..efad85e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3583,7 +3583,7 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
if (flags & AT_SYMLINK_FOLLOW)
how |= LOOKUP_FOLLOW;
- error = user_path_at(olddfd, oldname, how, &old_path);
+ error = user_path_at(olddfd, oldname, how | LOOKUP_COPY_UP, &old_path);
if (error)
return error;
diff --git a/fs/open.c b/fs/open.c
index d3be9e3..bce645b 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -516,7 +516,8 @@ SYSCALL_DEFINE3(fchmodat, int, dfd, const char __user *, filename, umode_t, mode
struct path path;
int error;
- error = user_path_at(dfd, filename, LOOKUP_FOLLOW, &path);
+ error = user_path_at(dfd, filename, LOOKUP_FOLLOW | LOOKUP_COPY_UP,
+ &path);
if (!error) {
error = chmod_common(&path, mode);
path_put(&path);
@@ -569,7 +570,7 @@ SYSCALL_DEFINE5(fchownat, int, dfd, const char __user *, filename, uid_t, user,
lookup_flags = (flag & AT_SYMLINK_NOFOLLOW) ? 0 : LOOKUP_FOLLOW;
if (flag & AT_EMPTY_PATH)
lookup_flags |= LOOKUP_EMPTY;
- error = user_path_at(dfd, filename, lookup_flags, &path);
+ error = user_path_at(dfd, filename, lookup_flags | LOOKUP_COPY_UP, &path);
if (error)
goto out;
error = mnt_want_write(path.mnt);
diff --git a/fs/utimes.c b/fs/utimes.c
index ba653f3..5fe9ed5 100644
--- a/fs/utimes.c
+++ b/fs/utimes.c
@@ -154,7 +154,7 @@ long do_utimes(int dfd, const char __user *filename, struct timespec *times,
fput(file);
} else {
struct path path;
- int lookup_flags = 0;
+ int lookup_flags = LOOKUP_COPY_UP;
if (!(flags & AT_SYMLINK_NOFOLLOW))
lookup_flags |= LOOKUP_FOLLOW;
diff --git a/fs/xattr.c b/fs/xattr.c
index 82f4337..b1d8b4c 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -351,7 +351,8 @@ SYSCALL_DEFINE5(setxattr, const char __user *, pathname,
struct path path;
int error;
- error = user_path(pathname, &path);
+ error = user_path_at(AT_FDCWD, pathname, LOOKUP_FOLLOW | LOOKUP_COPY_UP,
+ &path);
if (error)
return error;
error = mnt_want_write(path.mnt);
@@ -370,7 +371,7 @@ SYSCALL_DEFINE5(lsetxattr, const char __user *, pathname,
struct path path;
int error;
- error = user_lpath(pathname, &path);
+ error = user_path_at(AT_FDCWD, pathname, LOOKUP_COPY_UP, &path);
if (error)
return error;
error = mnt_want_write(path.mnt);
@@ -580,7 +581,8 @@ SYSCALL_DEFINE2(removexattr, const char __user *, pathname,
struct path path;
int error;
- error = user_path(pathname, &path);
+ error = user_path_at(AT_FDCWD, pathname, LOOKUP_FOLLOW | LOOKUP_COPY_UP,
+ &path);
if (error)
return error;
error = mnt_want_write(path.mnt);
@@ -598,7 +600,7 @@ SYSCALL_DEFINE2(lremovexattr, const char __user *, pathname,
struct path path;
int error;
- error = user_lpath(pathname, &path);
+ error = user_path_at(AT_FDCWD, pathname, LOOKUP_COPY_UP, &path);
if (error)
return error;
error = mnt_want_write(path.mnt);
Override the current creds when creating a file for copy up purposes so that
the ownership of that file is set correctly.
Signed-off-by: David Howells <[email protected]>
---
fs/union.c | 22 +++++++++++++++++++---
1 files changed, 19 insertions(+), 3 deletions(-)
diff --git a/fs/union.c b/fs/union.c
index b8ee42d..264da49 100644
--- a/fs/union.c
+++ b/fs/union.c
@@ -556,28 +556,44 @@ int union_copyup_file(struct nameidata *nd, struct path *lower,
struct dentry *dentry, size_t len)
{
struct path *parent = &nd->path;
+ const struct cred *saved_cred;
+ struct cred *override_cred;
int error;
BUG_ON(!mutex_is_locked(&parent->dentry->d_inode->i_mutex));
+ override_cred = prepare_kernel_cred(NULL);
+ if (!override_cred)
+ return -ENOMEM;
+
+ override_cred->fsuid = lower->dentry->d_inode->i_uid;
+ override_cred->fsgid = lower->dentry->d_inode->i_gid;
+
+ saved_cred = override_creds(override_cred);
+
if (S_ISREG(lower->dentry->d_inode->i_mode)) {
error = union_create_file(nd, lower, dentry);
if (error)
- return error;
+ goto out;
error = union_copyup_data(lower, parent->mnt, dentry, len);
} else if (S_ISLNK(lower->dentry->d_inode->i_mode)) {
- return union_create_symlink(nd, lower, dentry);
+ error = union_create_symlink(nd, lower, dentry);
+ goto out;
} else {
/* Don't currently support copyup of special files, though in
* theory there's no reason we couldn't at least copy up
* blockdev, chrdev and FIFO files
*/
- return -EXDEV;
+ error = -EXDEV;
+ goto out;
}
if (error)
/* Most likely error: ENOSPC */
vfs_unlink(parent->dentry->d_inode, dentry);
+out:
+ revert_creds(saved_cred);
+ put_cred(override_cred);
return error;
}
Union mounts design requires that the topmost directory exist for
every single directory at the time lookup completes. This is so that
we don't have to double back and create a whole path's worth of
directories whenever we copy up a file in a directory for the first
time. This greatly simplifies locking and error handling.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/union.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/union.h | 9 +++++
2 files changed, 123 insertions(+), 0 deletions(-)
diff --git a/fs/union.c b/fs/union.c
index 1e459b0..f183051 100644
--- a/fs/union.c
+++ b/fs/union.c
@@ -20,6 +20,8 @@
#include <linux/fs_struct.h>
#include <linux/slab.h>
#include <linux/namei.h>
+#include <linux/fsnotify.h>
+#include <linux/xattr.h>
#include "union.h"
@@ -88,3 +90,115 @@ int union_add_dir(struct path *topmost, struct path *lower, unsigned layer)
*path = *lower;
return 0;
}
+
+/**
+ * union_copyup_xattr
+ * @old: dentry of original file
+ * @new: dentry of new copy
+ *
+ * Copy up extended attributes from the original file to the new one.
+ *
+ * XXX - Permissions? For now, copying up every xattr.
+ */
+static int union_copyup_xattr(struct dentry *old, struct dentry *new)
+{
+ ssize_t list_size, size;
+ char *buf, *name, *value;
+ int error;
+
+ /* Check for xattr support */
+ if (!old->d_inode->i_op->getxattr ||
+ !new->d_inode->i_op->getxattr)
+ return 0;
+
+ /* Find out how big the list of xattrs is */
+ list_size = vfs_listxattr(old, NULL, 0);
+ if (list_size <= 0)
+ return list_size;
+
+ /* Allocate memory for the list */
+ buf = kzalloc(list_size, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ /* Allocate memory for the xattr's value */
+ error = -ENOMEM;
+ value = kmalloc(XATTR_SIZE_MAX, GFP_KERNEL);
+ if (!value)
+ goto out;
+
+ /* Actually get the list of xattrs */
+ list_size = vfs_listxattr(old, buf, list_size);
+ if (list_size <= 0) {
+ error = list_size;
+ goto out_free_value;
+ }
+
+ for (name = buf; name < (buf + list_size); name += strlen(name) + 1) {
+ /* XXX Locking? old is on read-only fs */
+ size = vfs_getxattr(old, name, value, XATTR_SIZE_MAX);
+ if (size <= 0) {
+ error = size;
+ goto out_free_value;
+ }
+ /* XXX do we really need to check for size overflow? */
+ /* XXX locks new dentry, lock ordering problems? */
+ error = vfs_setxattr(new, name, value, size, 0);
+ if (error)
+ goto out_free_value;
+ }
+
+out_free_value:
+ kfree(value);
+out:
+ kfree(buf);
+ return error;
+}
+
+/**
+ * union_create_topmost_dir - Create a matching dir in the topmost file system
+ * @parent - parent of target on topmost layer
+ * @name - name of target
+ * @topmost - path of target on topmost layer
+ * @lower - path of source on lower layer
+ *
+ * As we lookup each directory on the lower layer of a union, we create a
+ * matching directory on the topmost layer if it does not already exist.
+ *
+ * We don't use vfs_mkdir() for a few reasons: don't want to do the security
+ * check, don't want to make the dir opaque, don't need to sanitize the mode.
+ *
+ * XXX - owner is wrong, set credentials properly
+ * XXX - rmdir() directory on failure of xattr copyup
+ * XXX - not atomic w/ respect to crash
+ */
+int union_create_topmost_dir(struct path *parent, struct qstr *name,
+ struct path *topmost, struct path *lower)
+{
+ struct inode *dir = parent->dentry->d_inode;
+ int mode = lower->dentry->d_inode->i_mode;
+ int error;
+
+ BUG_ON(topmost->dentry->d_inode);
+
+ /* XXX - Do we even need to check this? */
+ if (!dir->i_op->mkdir)
+ return -EPERM;
+
+ error = mnt_want_write(parent->mnt);
+ if (error)
+ return error;
+
+ error = dir->i_op->mkdir(dir, topmost->dentry, mode);
+ if (error)
+ goto out;
+
+ error = union_copyup_xattr(lower->dentry, topmost->dentry);
+ if (error)
+ dput(topmost->dentry);
+
+ fsnotify_mkdir(dir, topmost->dentry);
+out:
+ mnt_drop_write(parent->mnt);
+ return error;
+}
diff --git a/fs/union.h b/fs/union.h
index f39c88d..e918a04 100644
--- a/fs/union.h
+++ b/fs/union.h
@@ -58,6 +58,8 @@ static inline bool IS_DIR_UNIONED(struct dentry *dentry)
extern void d_free_unions(struct dentry *);
extern int union_add_dir(struct path *, struct path *, unsigned int);
+extern int union_create_topmost_dir(struct path *, struct qstr *, struct path *,
+ struct path *);
static inline
struct path *union_find_dir(struct dentry *dentry, unsigned int layer)
@@ -85,4 +87,11 @@ int union_add_dir(struct path *topmost, struct path *lower, unsigned layer)
return 0;
}
+static inline int union_create_topmost_dir(struct path *parent, struct qstr *name,
+ struct path *topmost, struct path *lower)
+{
+ BUG();
+ return 0;
+}
+
#endif /* CONFIG_UNION_MOUNT */
From: Valerie Aurora <[email protected]>
Passing the CL_NO_SHARED flag to clone_mnt() causes the clone to fail
if the source mnt is shared.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
Cc: Ram Pai <[email protected]>
---
fs/namespace.c | 3 +++
fs/pnode.h | 1 +
2 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index 35c3b80..f92f574 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -740,6 +740,9 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
struct mount *mnt;
int err;
+ if ((flag & CL_NO_SHARED) && IS_MNT_SHARED(old))
+ return ERR_PTR(-EINVAL);
+
mnt = alloc_vfsmnt(old->mnt_devname);
if (!mnt)
return ERR_PTR(-ENOMEM);
diff --git a/fs/pnode.h b/fs/pnode.h
index 65c6097..c7089dd 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -22,6 +22,7 @@
#define CL_COPY_ALL 0x04
#define CL_MAKE_SHARED 0x08
#define CL_PRIVATE 0x10
+#define CL_NO_SHARED 0x20
static inline void set_mnt_shared(struct mount *mnt)
{
From: Valerie Aurora <[email protected]>
Add support for fallthru directory entries to ext2.
Signed-off-by: Valerie Aurora <[email protected]>
Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: David Howells <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: [email protected]
---
fs/ext2/dir.c | 49 +++++++++++++++++++++++++++++++++++++++++------
fs/ext2/ext2.h | 1 +
fs/ext2/namei.c | 22 +++++++++++++++++++++
fs/ext2/super.c | 2 ++
include/linux/ext2_fs.h | 4 ++++
5 files changed, 72 insertions(+), 6 deletions(-)
diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index df4d6b1..5fd6bbe 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -220,7 +220,9 @@ fail:
static inline int ext2_dirent_in_use(struct ext2_dir_entry_2 *de)
{
- return de->inode != 0 || de->file_type == EXT2_FT_WHT;
+ return de->inode != 0 ||
+ de->file_type == EXT2_FT_WHT ||
+ de->file_type == EXT2_FT_FALLTHRU;
}
/*
@@ -270,6 +272,7 @@ static unsigned char ext2_filetype_table[EXT2_FT_MAX] = {
[EXT2_FT_SOCK] = DT_SOCK,
[EXT2_FT_SYMLINK] = DT_LNK,
[EXT2_FT_WHT] = DT_WHT,
+ [EXT2_FT_FALLTHRU] = DT_UNKNOWN,
};
#define S_SHIFT 12
@@ -355,8 +358,20 @@ ext2_readdir (struct file * filp, void * dirent, filldir_t filldir)
offset = (char *)de - kaddr;
over = filldir(dirent, de->name, de->name_len,
- (n<<PAGE_CACHE_SHIFT) | offset,
- le32_to_cpu(de->inode), d_type);
+ (n << PAGE_CACHE_SHIFT) | offset,
+ le32_to_cpu(de->inode), d_type);
+ if (over) {
+ ext2_put_page(page);
+ return 0;
+ }
+ } else if (de->file_type == EXT2_FT_FALLTHRU) {
+ int over;
+
+ offset = (char *)de - kaddr;
+ /* XXX placeholder until generic_readdir_fallthru() arrives */
+ over = filldir(dirent, de->name, de->name_len,
+ (n<<PAGE_CACHE_SHIFT) | offset,
+ 1, DT_UNKNOWN); /* XXX */
if (over) {
ext2_put_page(page);
return 0;
@@ -487,6 +502,10 @@ ino_t ext2_inode_by_dentry(struct inode *dir, struct dentry *dentry)
spin_lock(&dentry->d_lock);
dentry->d_flags |= DCACHE_WHITEOUT;
spin_unlock(&dentry->d_lock);
+ } else if (!res && de->file_type == EXT2_FT_FALLTHRU) {
+ spin_lock(&dentry->d_lock);
+ dentry->d_flags |= DCACHE_FALLTHRU;
+ spin_unlock(&dentry->d_lock);
}
ext2_put_page(page);
}
@@ -580,7 +599,9 @@ int ext2_add_entry(struct dentry *dentry, ino_t ino, umode_t mode,
rec_len = ext2_rec_len_from_disk(de->rec_len);
if (ext2_match(namelen, name, de)) {
err = -EEXIST;
- /* XXX handle whiteouts and fallthroughs here */
+ /* XXX handle whiteouts here too */
+ if (de->file_type != EXT2_FT_FALLTHRU)
+ goto out_unlock;
printk("%s: found existing de\n", dentry->d_name.name);
goto got_it;
}
@@ -619,9 +640,17 @@ got_it:
err = -EEXIST;
if (ext2_match(namelen, name, de)) {
switch (de->file_type) {
+ case EXT2_FT_FALLTHRU:
+ if (new_file_type == EXT2_FT_FALLTHRU) {
+ WARN(1, "Ext2: Can't turn fallthru into fallthru: %s\n",
+ dentry->d_name.name);
+ goto out_unlock;
+ }
+ break;
case EXT2_FT_WHT:
- if (new_file_type == EXT2_FT_WHT) {
- WARN(1, "Ext2: Can't turn whiteout into whiteout: %s\n",
+ if (new_file_type == EXT2_FT_WHT ||
+ new_file_type == EXT2_FT_FALLTHRU) {
+ WARN(1, "Ext2: Can't turn whiteout into fallthru/whiteout: %s\n",
dentry->d_name.name);
goto out_unlock;
}
@@ -675,6 +704,14 @@ int ext2_whiteout_entry(struct dentry *dentry)
}
/*
+ * Create a fallthru entry.
+ */
+int ext2_fallthru_entry(struct dentry *dentry)
+{
+ return ext2_add_entry(dentry, 0, 0, EXT2_FT_FALLTHRU);
+}
+
+/*
* ext2_delete_entry deletes a directory entry by merging it with the
* previous entry. Page is up-to-date. Releases the page.
*/
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index b285d9a..dd82bc6 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -107,6 +107,7 @@ extern int ext2_make_empty(struct inode *, struct inode *);
extern struct ext2_dir_entry_2 * ext2_find_entry (struct inode *,struct qstr *, struct page **);
extern int ext2_delete_entry (struct ext2_dir_entry_2 *, struct page *);
extern int ext2_whiteout_entry(struct dentry *);
+extern int ext2_fallthru_entry(struct dentry *);
extern int ext2_empty_dir (struct inode *);
extern struct ext2_dir_entry_2 * ext2_dotdot (struct inode *, struct page **);
extern void ext2_set_link(struct inode *, struct ext2_dir_entry_2 *, struct page *, struct inode *, int);
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 8227267..957c4b9 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -331,6 +331,7 @@ static int ext2_whiteout(struct inode *dir, struct dentry *dentry,
goto out;
spin_lock(&new_dentry->d_lock);
+ new_dentry->d_flags &= ~DCACHE_FALLTHRU;
new_dentry->d_flags |= DCACHE_WHITEOUT;
spin_unlock(&new_dentry->d_lock);
d_add(new_dentry, NULL);
@@ -349,6 +350,26 @@ out:
return err;
}
+/*
+ * Create a fallthru entry.
+ */
+static int ext2_fallthru(struct inode *dir, struct dentry *dentry)
+{
+ int err;
+
+ dquot_initialize(dir);
+
+ err = ext2_fallthru_entry(dentry);
+ if (err)
+ return err;
+
+ spin_lock(&dentry->d_lock);
+ dentry->d_flags |= DCACHE_FALLTHRU;
+ spin_unlock(&dentry->d_lock);
+ d_instantiate(dentry, NULL);
+ return 0;
+}
+
static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
struct inode * new_dir, struct dentry * new_dentry )
{
@@ -447,6 +468,7 @@ const struct inode_operations ext2_dir_inode_operations = {
.rmdir = ext2_rmdir,
.mknod = ext2_mknod,
.whiteout = ext2_whiteout,
+ .fallthru = ext2_fallthru,
.rename = ext2_rename,
#ifdef CONFIG_EXT2_FS_XATTR
.setxattr = generic_setxattr,
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 8869794..bafd421 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -1100,6 +1100,8 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
if (EXT2_HAS_INCOMPAT_FEATURE(sb, EXT2_FEATURE_INCOMPAT_WHITEOUT))
sb->s_flags |= MS_WHITEOUT;
+ if (EXT2_HAS_INCOMPAT_FEATURE(sb, EXT2_FEATURE_INCOMPAT_FALLTHRU))
+ sb->s_flags |= MS_FALLTHRU;
if (ext2_setup_super (sb, es, sb->s_flags & MS_RDONLY))
sb->s_flags |= MS_RDONLY;
diff --git a/include/linux/ext2_fs.h b/include/linux/ext2_fs.h
index 2202faa..cd6d533 100644
--- a/include/linux/ext2_fs.h
+++ b/include/linux/ext2_fs.h
@@ -506,11 +506,14 @@ struct ext2_super_block {
#define EXT3_FEATURE_INCOMPAT_JOURNAL_DEV 0x0008
#define EXT2_FEATURE_INCOMPAT_META_BG 0x0010
#define EXT2_FEATURE_INCOMPAT_WHITEOUT 0x0020
+/* ext3/4 incompat flags take up the intervening constants */
+#define EXT2_FEATURE_INCOMPAT_FALLTHRU 0x2000
#define EXT2_FEATURE_INCOMPAT_ANY 0xffffffff
#define EXT2_FEATURE_COMPAT_SUPP EXT2_FEATURE_COMPAT_EXT_ATTR
#define EXT2_FEATURE_INCOMPAT_SUPP (EXT2_FEATURE_INCOMPAT_FILETYPE| \
EXT2_FEATURE_INCOMPAT_WHITEOUT| \
+ EXT2_FEATURE_INCOMPAT_FALLTHRU| \
EXT2_FEATURE_INCOMPAT_META_BG)
#define EXT2_FEATURE_RO_COMPAT_SUPP (EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER| \
EXT2_FEATURE_RO_COMPAT_LARGE_FILE| \
@@ -578,6 +581,7 @@ enum {
EXT2_FT_SOCK = 6,
EXT2_FT_SYMLINK = 7,
EXT2_FT_WHT = 8,
+ EXT2_FT_FALLTHRU = 9,
EXT2_FT_MAX
};
From: Valerie Aurora <[email protected]>
Prevent bind mounts of parts of union mounts.
XXX - Bind mounting parts of union mounts is probably easy to
implement, but requires some careful thought about corner cases,
extensive testing, and some refactoring of the code.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/namespace.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index 3c950fa..c990f69 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1845,6 +1845,12 @@ static int do_loopback(struct path *path, char *old_name,
if (IS_MNT_UNBINDABLE(old))
goto out2;
+ /* XXX - Mounting a subtree of a union mount elsewhere requires careful
+ * thought and some refactoring.
+ */
+ if (IS_MNT_UNION(old_path.mnt))
+ goto out2;
+
if (!check_mnt(real_mount(path->mnt)) || !check_mnt(old))
goto out2;
From: Valerie Aurora <[email protected]>
Currently ext2 checks if a directory entry is in-use by checking if the inode
is non-zero. Fallthrus and whiteouts will have zero inode but be in-use. Add
a function to abstract out the directory entry in-use test.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
cc: [email protected]
---
fs/ext2/dir.c | 12 +++++++++---
1 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index d37df35..89015f1 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -218,6 +218,11 @@ fail:
return ERR_PTR(-EIO);
}
+static inline int ext2_dirent_in_use(struct ext2_dir_entry_2 *de)
+{
+ return de->inode != 0;
+}
+
/*
* NOTE! unlike strncmp, ext2_match returns 1 for success, 0 for failure.
*
@@ -228,7 +233,7 @@ static inline int ext2_match (int len, const char * const name,
{
if (len != de->name_len)
return 0;
- if (!de->inode)
+ if (!ext2_dirent_in_use(de))
return 0;
return !memcmp(name, de->name, len);
}
@@ -527,6 +532,7 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode)
rec_len = chunk_size;
de->rec_len = ext2_rec_len_to_disk(chunk_size);
de->inode = 0;
+ de->file_type = 0;
goto got_it;
}
if (de->rec_len == 0) {
@@ -540,7 +546,7 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode)
goto out_unlock;
name_len = EXT2_DIR_REC_LEN(de->name_len);
rec_len = ext2_rec_len_from_disk(de->rec_len);
- if (!de->inode && rec_len >= reclen)
+ if (!ext2_dirent_in_use(de) && rec_len >= reclen)
goto got_it;
if (rec_len >= name_len + reclen)
goto got_it;
@@ -558,7 +564,7 @@ got_it:
err = ext2_prepare_chunk(page, pos, rec_len);
if (err)
goto out_unlock;
- if (de->inode) {
+ if (ext2_dirent_in_use(de)) {
ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len);
de->rec_len = ext2_rec_len_to_disk(name_len);
Add a wrapper function for lookup_union_locked() that locks the parent
directory and follows the mount after lookup. This is appropriate for calling
from do_lookup() when in refwalk mode.
Also add an RCU-mode pathwalk lookup function. This need not leave RCU-mode if
the upper dentry is appropriately assembled or the lower dentry can be validly
used.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]> (Further development)
---
fs/namei.c | 149 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 147 insertions(+), 2 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 2d69ce1..c0adf4c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1195,6 +1195,9 @@ static int __lookup_union(struct nameidata *nd, struct qstr *name,
* layer's directory to the union stack for the topmost
* directory.
*/
+#warning what if the directory is managed?
+#warning should we d_revalidate the lower dentry?
+#warning how to handle automounts?
follow_mount(&lower);
if (!topmost->dentry->d_inode) {
@@ -1277,6 +1280,144 @@ static int lookup_union_locked(struct nameidata *nd, struct qstr *name,
}
/*
+ * lookup_union - union mount-aware part of do_lookup()
+ *
+ * do_lookup()-style wrapper for lookup_union(). Follows mounts.
+ */
+static int lookup_union(struct nameidata *nd, struct qstr *name,
+ struct path *topmost)
+{
+ struct dentry *parent = nd->path.dentry;
+ struct inode *dir = parent->d_inode;
+ int err;
+
+ mutex_lock(&dir->i_mutex);
+ err = lookup_union_locked(nd, name, topmost);
+ mutex_unlock(&dir->i_mutex);
+ if (err)
+ return err;
+
+ return follow_managed(topmost, nd->flags);
+}
+
+/*
+ * lookup_union_rcu - Handle union mounted dentries in RCU-walk mode
+ * @nd: The current pathwalk state (refers to @parent currently)
+ * @parent: The parent directory (holds the union stack)
+ * @path: The point just looked up in @parent
+ * @parent_seq: The d_seq of @parent at the point of lookup
+ * @inode: The inode at @dentry (*@inode is NULL if negative dentry)
+ *
+ * Handle a dentry that represents a non-directory file or a hole/reference in
+ * a union mount upperfs. This involves transiting to the lower file, provided
+ * we aren't going to open the lower file for writing - otherwise we have to
+ * copy the file up (which we can't do in rcuwalk mode).
+ *
+ * Directories are handled differently: they're unconditionally and completely
+ * mirrored from the lowerfs to the upperfs as soon as we encounter them in a
+ * lookup. However, since we don't create dentries in rcuwalk mode, this will
+ * be handled automatically by refwalk mode.
+ *
+ * We return true if we don't need to do anything or if we've successfully
+ * updated the path. If we need to drop out of RCU-walk and go to refwalk
+ * mode, we return false.
+ */
+static bool lookup_union_rcu(struct nameidata *nd,
+ struct dentry *parent,
+ struct path *path,
+ unsigned parent_seq,
+ struct inode **inode)
+{
+ struct dentry *dentry = path->dentry;
+ struct inode *parent_inode = nd->inode;
+ unsigned layer, layers;
+
+ /* Handle non-unionmount dentries first. The union stack will have
+ * been built during the initial lookup of the parent dir, so if it's
+ * not there, it's not unioned.
+ */
+ if (!IS_DIR_UNIONED(parent))
+ return true;
+
+ /* If it's positive then no further lookup is needed: the file or
+ * directory has been copied up and the user gets to play with that.
+ */
+ if (*inode)
+ return true;
+
+ /* If this dentry is a blocker, then stop here. */
+ if (d_is_whiteout(dentry) ||
+ (IS_OPAQUE(parent_inode) && !d_is_fallthru(dentry)))
+ return true;
+
+ /* At this point we have a negative dentry in the unionmount that may
+ * be overlaying a non-directory file in a lower filesystem, so we loop
+ * through the union stack of the parent directory to try to find a
+ * usable dentry further down.
+ */
+ layers = parent->d_sb->s_union_count;
+ for (layer = 0; layer < layers; layer++) {
+ /* Look for the a matching dentry in this layer, assuming it's
+ * still valid. Since the lower fs is hard locked R/O,
+ * revalidation ought to be unnecessary.
+ */
+ unsigned ldseq, seq;
+ struct dentry *lower_dir, *lower;
+ struct path *lower_path = union_find_dir(parent, layer);
+ if (!lower_path->mnt)
+ continue;
+
+ lower_dir = lower_path->dentry;
+ ldseq = read_seqcount_begin(&lower_dir->d_seq);
+
+ if (unlikely(lower_dir->d_flags & DCACHE_OP_REVALIDATE)) {
+ if (unlikely(d_revalidate(lower_dir, nd) <= 0) ||
+ __read_seqcount_retry(&lower_dir->d_seq, ldseq))
+ return false;
+ }
+
+ lower = __d_lookup_rcu(lower_dir, &dentry->d_name, &seq, inode);
+ if (!lower)
+ return false;
+
+ /* We've got a negative dentry which can mean several things: a
+ * plain negative dentry is ignored and lookup continues to the
+ * next layer; but a whiteout or a non-fallthru in an opaque
+ * dir covers everything below it.
+ */
+ if (!*inode) {
+ if (d_is_whiteout(lower) ||
+ (IS_OPAQUE(parent_inode) && !d_is_fallthru(lower))) {
+ if (read_seqcount_retry(&lower_dir->d_seq,
+ ldseq))
+ return false;
+ return true;
+ }
+ continue;
+ }
+
+ /* If the lower dentry is a directory then it will need copying
+ * up before we can make use of it.
+ */
+ if (S_ISDIR((*inode)->i_mode))
+ return false;
+
+ /* We have a file in a lower fs that we can use */
+ if (read_seqcount_retry(&lower_dir->d_seq, ldseq) ||
+ __read_seqcount_retry(&parent->d_seq, parent_seq))
+ return false;
+
+ path->mnt = lower_path->mnt;
+ path->dentry = lower;
+ nd->seq = seq;
+ return true;
+ }
+
+ /* Found nothing, so just use the top negative dentry */
+ return dentry;
+}
+
+/*
* Allocate a dentry with name and parent, and perform a parent
* directory ->lookup on it. Returns the new dentry, or ERR_PTR
* on error. parent->d_inode->i_mutex must be held. d_lookup must
@@ -1351,14 +1492,15 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
* do the non-racy lookup, below.
*/
if (nd->flags & LOOKUP_RCU) {
- unsigned seq;
+ unsigned seq, pseq;
*inode = nd->inode;
dentry = __d_lookup_rcu(parent, name, &seq, inode);
if (!dentry)
goto unlazy;
/* Memory barrier in read_seqcount_begin of child is enough */
- if (__read_seqcount_retry(&parent->d_seq, nd->seq))
+ pseq = nd->seq;
+ if (__read_seqcount_retry(&parent->d_seq, pseq))
return -ECHILD;
nd->seq = seq;
@@ -1372,8 +1514,11 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
}
if (unlikely(d_need_lookup(dentry)))
goto unlazy;
+
path->mnt = mnt;
path->dentry = dentry;
+ if (unlikely(!lookup_union_rcu(nd, parent, path, pseq, inode)))
+ goto unlazy;
if (unlikely(!__follow_mount_rcu(nd, path, inode)))
goto unlazy;
if (unlikely(path->dentry->d_flags & DCACHE_NEED_AUTOMOUNT))
From: Valerie Aurora <[email protected]>
If we find a file during union lookup, don't look in any lower layers
and replace the topmost path with the file's path.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/namei.c | 20 ++++++++++++++++++++
1 files changed, 20 insertions(+), 0 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 009d9b5..f81f24e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1179,11 +1179,31 @@ static int __lookup_union(struct nameidata *nd, struct qstr *name,
continue;
}
+ /* Files block everything below them. Special case: If we find
+ * a file below a directory (which makes no sense), just ignore
+ * the file and return the directory above it.
+ */
+ if (!S_ISDIR(lower.dentry->d_inode->i_mode)) {
+ if (topmost->dentry->d_inode &&
+ S_ISDIR(topmost->dentry->d_inode->i_mode))
+ goto out_lookup_done;
+ goto out_found_file;
+ }
+
/* XXX - do nothing, more in later patches */
path_put(&lower);
}
return 0;
+out_found_file:
+ /* Swap out the positive lower dentry with the negative upper
+ * dentry for this file. Note that the matching mntput() is done
+ * in link_path_walk().
+ */
+ dput(topmost->dentry);
+ *topmost = lower;
+ return 0;
+
out_lookup_done:
path_put(&lower);
return 0;
From: Valerie Aurora <[email protected]>
Document design and implementation of union mounts (a.k.a. writable overlays).
With corrections from Andreas Gruenbacher <[email protected]>.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
Documentation/filesystems/union-mounts.txt | 712 ++++++++++++++++++++++++++++
1 files changed, 712 insertions(+), 0 deletions(-)
create mode 100644 Documentation/filesystems/union-mounts.txt
diff --git a/Documentation/filesystems/union-mounts.txt b/Documentation/filesystems/union-mounts.txt
new file mode 100644
index 0000000..596bfe6
--- /dev/null
+++ b/Documentation/filesystems/union-mounts.txt
@@ -0,0 +1,712 @@
+Union mounts (a.k.a. writable overlays)
+=======================================
+
+This document describes the architecture and current status of union mounts,
+also known as writable overlays.
+
+In this document:
+ - Overview of union mounts
+ - Terminology
+ - VFS implementation
+ - Locking strategy
+ - VFS/file system interface
+ - Userland interface
+ - NFS interaction
+ - Status
+ - Contributing to union mounts
+
+Overview
+========
+
+A union mount layers one read-write file system over one or more read-only file
+systems, with all writes going to the writable file system. The namespace of
+both file systems appears as a combined whole to userland, with files and
+directories on the writable file system covering up any files or directories
+with matching pathnames on the read-only file system. The read-write file
+system is the "topmost" or "upper" file system and the read-only file systems
+are the "lower" file systems. A few use cases:
+
+- Root file system on CD with writes saved to hard drive (LiveCD)
+- Multiple virtual machines with the same starting root file system
+- Cluster with NFS mounted root on clients
+
+Most if not all of these problems could be solved with a COW block device or a
+clustered file system (include NFS mounts). However, for some use cases,
+sharing is more efficient and better performing if done at the file system
+namespace level. COW block devices only increase their divergence as time goes
+on, and a fully coherent writable file system is unnecessary synchronization
+overhead if no other client needs to see the writes.
+
+What union mounts are not
+-------------------------
+
+Union mounts are not a general-purpose unioning file system. They do not
+provide a generic "union of namespaces" operation for an arbitrary number of
+file systems. Many interesting features can be implemented with a generic
+unioning facility: dynamic insertion and removal of branches, write policies
+based on space available, online upgrade, etc. Some unioning file systems that
+do this are UnionFS and AUFS.
+
+Terminology
+===========
+
+The main physical metaphor for union mounts is that a writable file system is
+mounted "on top" of a read-only file system. Lookups start at the "topmost"
+read-write file system and travel "down" to the "bottom" read-only file system
+only if no blocking entry exists on the top layer.
+
+Topmost layer: The read-write file system. Lookups begin here.
+
+Bottom layer: The read-only file system. Lookups end here.
+
+Path: Combination of the vfsmount and dentry structure.
+
+Follow down: Given a path from the top layer, find the corresponding path on
+the bottom layer.
+
+Follow up: Given a path from the bottom layer, find the corresponding path on
+the top layer.
+
+Whiteout: A directory entry in the top layer that prevents lookups from
+travelling down to the bottom layer. Created on unlink()/rmdir() if a
+corresponding directory entry exists in the bottom layer.
+
+Opaque flag: A flag on a directory in the top layer that prevents lookups of
+entries in this directory from travelling down to the bottom layer (unless
+there is an explicit fallthru entry allowing that for a particular entry). Set
+on creation of any new directory in in the topmost layer (that is, a directory
+that does not have any matching visible directory below it).
+
+Fallthru: A directory entry which allows lookups to "fall through" to the
+bottom layer for that exact directory entry. This serves as a placeholder for
+directory entries from the bottom layer during readdir(). Fallthrus override
+opaque flags.
+
+File copyup: Create a file on the top layer that has the same metadata and
+contents as the file with the same pathname on the bottom layer.
+
+Directory copyup: Copy up the visible directory entries from the bottom layer
+as fallthrus in the matching top layer directory. Mark the directory opaque to
+avoid unnecessary negative lookups on the bottom layer.
+
+Examples
+========
+
+What happens when I...
+
+- creat() /newfile -> creates on topmost layer
+- unlink() /oldfile -> creates a whiteout on topmost layer
+- Edit /existingfile -> copies up to top layer at open(O_WR) time
+- truncate /existingfile -> copies up to topmost layer + N bytes if specified
+- touch()/chmod()/chown()/etc. -> copies up to topmost layer
+- mkdir() /newdir -> creates opaque dir on topmost layer
+- rmdir() /olddir -> creates a whiteout on topmost layer
+- mkdir() /olddir after above -> creates opaque dir on topmost layer
+- readdir() /shareddir -> copies up entries from bottom layer as
+ fallthrus, processes duplicates and whiteouts
+- link() /oldfile /newlink -> copies up /oldfile, creates /newlink on
+ topmost layer
+- symlink() /oldfile /symlink -> nothing special
+- rename() /oldfile /newfile -> copies up /oldfile to /newfile on top layer,
+ whiteouts /oldfile
+- rename() /olddir /newdir -> EXDEV
+- rename() /topmost_only_dir /topmost_only_dir2 -> success
+- stat() /oldfile - inode & dev from lower layer
+- stat() /newfile - inode & dev from topmost layer
+- readdir() /shareddir - d_ino & d_type from lower layer on fallthrus
+
+Getting to a root file system with union mounts:
+
+- Mount the base read-only file system as the root file system
+- Mount the read-only file system again on /newroot
+- Mount the read-write layer on /newroot:
+ # mount -o union /dev/sda /newroot
+- pivot_root to /newroot
+- Start init
+
+See scripts/pivot.sh in the UML devkit linked to from:
+
+http://valerieaurora.org/union/
+
+VFS implementation
+==================
+
+Union mounts are implemented as an integral part of the VFS, rather than as a
+VFS client file system (i.e., a stacked file system like unionfs or ecryptfs).
+Implementing unioning inside the VFS eliminates the need for duplicate copies
+of VFS data structures, unnecessary indirection, and code duplication, but
+requires very maintainable, low overhead code. Union mounts require no change
+to file systems serving as the read-only layer, and requires some minor support
+from file systems serving as the read-write layer. File systems that want to
+be the writable layer must implement the new ->whiteout() and ->fallthru()
+inode operations, which create special dummy directory entries.
+
+The union mounts code must accomplish the following major tasks:
+
+1) Pass lookups through to the lower level file system.
+2) Copy files and directories up to the topmost layer when written.
+3) Create whiteouts and fallthrus as necessary.
+
+VFS objects and union mounts
+----------------------------
+
+First, some VFS basics:
+
+The VFS allows multiple mounts of the same file system. For example, /dev/sda
+can be mounted at /usr and also at /mnt. The same file system can be mounted
+read-only at one point and read-write at another. Each of these mounts has its
+own vfsmount data structure in the kernel. However, each underlying file
+system has exactly one in-kernel superblock structure no matter how many times
+it is mounted. All the separate vfsmounts for the same file system reference
+the same superblock data structure.
+
+Directory entries are cached by the VFS in dentry structures. The VFS keeps
+one dentry structure for each file or directory in a file system, no matter how
+many times it is mounted. Each dentry represents only one element of a path
+name. When the VFS looks up a pathname (e.g., "/sbin/init"), the result is a
+combination of vfsmount and dentry. This <mnt,dentry> pair is usually stored
+in a kernel structure named "path", which is simply two pointers, one to the
+vfsmount and one to the dentry. A "struct path" is this structure; a pathname
+is a string like "/etc/fstab".
+
+In union mounts, a file system can only be the topmost layer for one union
+mount. A file system can be part of multiple union mounts if it is a read-only
+layer. So dentries in the read-only layers can be part of multiple unions,
+while a dentry in the read-write layer can only be part of one unin.
+
+union_dir structure
+---------------------
+
+The first job of union mounts is to map directories from the topmost layer to
+directories with the same pathname in the lower layer. That is, given the
+<mnt,dentry> pair for a directory pathname in the topmost layer, we need to
+find all the <mnt,dentry> pairs for the directory with the same pathname in the
+lower layer. We do this with the union_dir structure, which is an array
+containing struct paths (mnt, dentry pointer pairs) for each directory unioned
+with the topmost union. The array is pointed to from the new d_union_stack
+member of struct dentry.
+
+/*
+ * The union_stack structure. It is an array of struct paths of
+ * directories below the topmost directory in a unioned directory, The
+ * topmost dentry has a pointer to this structure. The topmost dentry
+ * can only be part of one union, so we can reference it from the
+ * dentry, but lower dentries can be part of multiple union stacks.
+ *
+ * The number of dirs actually allocated is kept in the superblock,
+ * s_union_count.
+ */
+struct union_stack {
+ struct path u_dirs[0];
+};
+
+This structure is flexible enough to support an arbitrary number of layers of
+unioned file systems. Since there can be more than two layers, this section
+will talk about mapping "upper" directories to "lower" directories, instead of
+"topmost" directories to "bottom" directories.
+
+Traversing the union stack
+--------------------------
+
+The set of union_dir structures referring to a particular pathname are called
+collectively the union stack for that directory. To traverse the union stack,
+iterate through the number of layers in the union (stored in sb->s_union_count)
+with union_find_dir(). Example: freeing the union stack:
+
+void d_free_unions(struct dentry *topmost)
+{
+ struct path *path;
+ unsigned int i, layers = topmost->d_sb->s_union_count;
+
+ if (!IS_DIR_UNIONED(topmost))
+ return;
+
+ for (i = 0; i < layers; i++) {
+ path = union_find_dir(topmost, i);
+ if (path->mnt)
+ path_put(path);
+ }
+ kfree(topmost->d_union_stack);
+ topmost->d_union_stack = NULL;
+}
+
+Code paths
+----------
+
+Union mounts modify the following key code paths in the VFS:
+
+- mount()/umount()
+- Pathname lookup
+- Any path that modifies an existing file
+
+Mount
+-----
+
+Union mounts are created in two steps:
+
+1. Mount the read-only layer file systems read-only in the usual manner, all on
+the same mountpoint. Submounts are permitted as long as they are also
+read-only and not shared (part of a mount propagation group).
+
+2. Mount the top layer with the "-o union" option at the same mountpoint. All
+read-only file systems mounted at this mountpoint will be included in the union
+mount.
+
+The bottom layers must be read-only and the top layer must be read-write and
+support whiteouts and fallthrus. A file system that supports whiteouts and
+fallthrus indicates this by setting the MS_WHITEOUT and MS_FALLTHRU flags in
+the superblock. Currently, the top layer is forced to "noatime" to avoid a
+copyup on every access of a file. Supporting atime with the current
+infrastructure would require a copyup on every open(). The "relatime" option
+would be equally efficient if the atime is the same or more recent than the
+mtime/ctime for every object on the read-only file system, and if the 24-hour
+timeout on relatime was disabled. However, this is probably not worthwhile for
+the majority of union mount use cases.
+
+File systems can only be union mounted at their root directories, for
+simplicity and performance.
+
+pivot_root() to a union mounted file system is supported. The recommended way
+to get to a union mounted root file system is to boot with the read-only mount
+as the root file system, construct the union mount on an entirely new mount,
+and pivot_root() to the new union mount root. Attempting to union mount the
+root file system later in boot will result in covering other file systems,
+e.g., /proc, which isn't permitted in the current code and is a bad idea
+anyway.
+
+Hard read-only file systems
+---------------------------
+
+Union mounts require the lower layer of the file system to be read-only.
+However, in Linux, any individual file system may be mounted at multiple places
+in the namespace, and a file system can be changed from read-only to read-write
+while still mounted. Thus, simply checking that the bottom layer is read-only
+at the time the writable overlay is mounted over it is pointless, since at any
+time the bottom layer may become read-write.
+
+We have to guarantee that a file system will be read-only for as long as it is
+the bottom layer of a union mount. To do this, we track the number of hard
+read-only users of a file system in its VFS superblock structure. When we
+union mount a writable overlay over a file system, we increment its read-only
+user count. The file system can only be mounted read-write if its read-only
+users count is zero.
+
+Todo:
+
+- Support hard read-only NFS mounts. See discussion here:
+
+ http://markmail.org/message/3mkgnvo4pswxd7lp
+
+Pathname lookup
+---------------
+
+Pathname lookup in a unioned directory traverses down the union stack for the
+parent directory, looking up each pathname element in each layer of the file
+system (according to the rules of whiteouts, fallthrus, and opaque flags). At
+mount time, the union stack for the root directory of the file system is
+created, and the union stack creation for every other unioned directory in the
+file system is boot-strapped using the already-existing union stack of the
+directory's parent. In order to simplify the code greatly, every visible
+directory on the lower file system is required to have a matching directory on
+the upper file system. If this matching directory does not already exist, it
+is created during pathname lookup. Therefore, each unioned directory is the
+child of another unioned directory (or is the root directory of the file
+system).
+
+The actual union lookup function is called in the following code paths:
+
+do_lookup()->do_union_lookup()->lookup_union()->__lookup_union()
+lookup_hash()->lookup_union()->__lookup_union()
+
+__lookup_union() is where the rules of whiteouts, fallthrus, and opaque flags
+are actually implemented. __lookup_union() returns either the first visible
+dentry, or a negative dentry from the topmost file system if no matching dentry
+exists. If it finds a directory, it looks up any potential matching lower
+layer directories. If it finds a lower layer directory, it first creates the
+topmost dir if necessary via union_create_topmost_dir(), and then calls
+union_add_dir() to append the lower directory to the end of the union stack.
+
+Note that not all directories in a union mount are unioned, only those with
+matching directories on the lower layer. The macro IS_DIR_UNIONED() is a
+cheap, constant time way to check if a directory is unioned, while
+IS_MNT_UNION() checks if the entire mount is unioned (and therefore whether the
+directory in question is potentially unioned).
+
+Currently, lookup of a negative dentry or a directory with no matching
+directories below it requires a lookup in every directory in the union stack
+every time it is looked up. We could avoid subsequent lookups by adding the
+equivalent of a negative dcache entry.
+
+File copyup
+-----------
+
+Any system call that alters the data or metadata of a file on the bottom layer,
+or creates or changes a hard link to it will trigger a copyup of the target
+file from the lower layer to the topmost layer
+
+ - open(O_WRITE | O_RDWR | O_APPEND)
+ - truncate()/open(O_TRUNC)
+ - link()
+ - rename()
+ - chmod()
+ - chown()/lchown()
+ - utimes()
+ - setxattr()/lsetxattr()
+
+Copyup of a file due to open(O_WRITE) has already occurred when:
+
+ - write()
+ - ftruncate()
+ - writable mmap()
+
+The following system calls will fail on an fd opened O_RDONLY:
+
+ - fchmod()
+ - fchown()
+ - fsetxattr()
+ - futimensat()
+
+Contrary to common sense, the above system calls are defined to succeed on
+O_RDONLY fds. The idea seems to be that the O_RDONLY/O_RDWR/O_WRITE flags only
+apply to the actual file data, not to any form of metadata (times, owner, mode,
+or even extended attributes). Applications making these system calls on
+O_RDONLY fds are correct according to the standard and work on non-union
+mounts. They will need to be rewritten (O_RDONLY -> O_RDWR) to work on union
+mounts. We suspect this usage is uncommon.
+
+This deviation from standard is due to technical limitations of the union mount
+implementation. Specifically, we would need to replace an open file descriptor
+from the lower layer with an open file descriptor for a file with matching
+pathname and contents on the upper layer, which is difficult to do. We avoid
+this in other system calls by doing the copyup before the file is opened.
+Unionfs doesn't encounter this problem because it creates a dummy file struct
+which redirects or fans out operations to the struct files for the underlying
+file systems.
+
+From an application's point of view, the result of an in-kernel file copyup is
+the logical equivalent of another application updating the file via the
+rename() pattern: creat() a new file, copy the data over, make changes the
+copy, and rename() over the old version. Any existing open file descriptors
+for that file (including those in the same application) refer to a now
+invisible object that used to have the same pathname. Only opens that occur
+after the copyup will see updates to the file.
+
+Permission checks
+-----------------
+
+We want to be sure we have the correct permissions to actually succeed in a
+system call before copying a file up to avoid unnecessary IO. At present, the
+permission check for a single system call may be spread out over many hundreds
+of lines of code (e.g., open()). In order to check permissions, we
+occasionally need to determine if there is a writable overlay on top of this
+inode. This requires a full path, but often we only have the inode at this
+point. In particular, inode_permission() returns EROFS if the inode is on a
+read-only file system, which is the wrong answer if there is a writable overlay
+mounted on top of it.
+
+The current solution is to split out the file-system-wide permission checks
+from the per-inode permission checks. inode_permission() becomes:
+
+sb_permission()
+__inode_permission()
+
+inode_permission() calls sb_permission() and __inode_permission() on the same
+path. We create path_permission() which calls sb_permission() on the parent
+directory from the top layer, and __inode_permission() on the target on the
+lower layer. This gets us the correct write permissions consdering that the
+file will be copied up.
+
+Todo:
+
+ - Currently, we don't deal with differing directory permissions at
+ different levels of the stack. This is a bug.
+
+Impact on non-union kernels and mounts
+--------------------------------------
+
+Union-related data structures, extra fields, and function calls are #ifdef'd
+out at the function/macro level with CONFIG_UNION_MOUNT in nearly all cases
+(see fs/union.h). When CONFIG_UNION_MOUNT is enabled, struct dentry has one
+more pointer, reducing the size of dentry names stored in the dentry itself by
+4 to 8 bytes.
+
+Todo:
+
+ - Do performance tests
+
+Locking strategy
+================
+
+The current union mount locking strategy is based on the following
+rules:
+
+* The lower layer file system is always read-only
+* The topmost file system is always read-write
+ => A file system can never a topmost and lower layer at the same time
+
+Additionally, the topmost layer may only be mounted exactly once. Don't think
+of the topmost layer as a separate independent file system; when it is part of
+a union mount, it is only a file system in conjunction with the read-only
+bottom layer. The read-only bottom layer is an independent file system in and
+of itself and can be mounted elsewhere, including as the bottom layer for
+another union mount.
+
+Thus, we may define a stable locking order in terms of top layer and bottom
+layer locks, since a top layer is never a bottom layer and a bottom layer is
+never a top layer. Another simplifying assumption is that all directories in a
+pathname exist on the top layer, as they are created step-by-step during
+lookup. This prevents us from ever having to walk backwards up the path
+creating directory entries, which can get complicated. By implication, parent
+directories paths during any operation (rename(), unlink(),etc.) are from the
+top layer. Dentries for directories from the bottom layer are only ever seen
+or used by the lookup code.
+
+The two major problems we avoid with the above rules are:
+
+Lock ordering: Imagine two union stacks with the same two file systems: A
+mounted over B, and B mounted over A. Sometimes locks on objects in both A and
+B will have to be held simultanously. What order should they be acquired in?
+Simply acquiring them from top to bottom will create a lock-ordering problem -
+one thread acquires lock on object from A and then tries for a lock on object
+from B, while another thread grabs the lock on object from B and then waits for
+the lock on object from A. Some other lock ordering must be defined.
+
+Movement/change/disappearance of objects on multiple layers: A variety of nasty
+corner cases arise when more than one layer is changing at the same time.
+Changes in the directory topology and their effect on inheritance are of
+special concern. Al Viro's canonical email on the subject:
+
+http://lkml.indiana.edu/hypermail/linux/kernel/0802.0/0839.html
+
+We don't try to solve any of these cases, just avoid them in the first place.
+
+Todo: Prevent top layer from being mounted more than once.
+
+Cross-layer interactions
+------------------------
+
+The VFS code simultaneously holds references to and/or modifies objects from
+both the top and bottom layers in the following cases:
+
+Path lookup:
+
+Grabs i_mutex on bottom layer while holding i_mutex on top layer directory
+inode.
+
+File copyup:
+
+Holds i_mutex on the parent directory from the top layer while copying up file
+from lower layer.
+
+link():
+
+File copyup of target while holding i_mutex on parent directory on top layer.
+Followed by a normal link() operation.
+
+rename():
+
+Holds s_vfs_rename_mutex on the top layer, i_mutex of the source's parent dir
+(top layer), and i_mutex of the target's parent dir (also top layer) while
+looking up and copying the bottom layer target and also creating the whiteout.
+
+Notes on rename():
+
+First, renaming of directories returns EXDEV. It's not at all reasonable to
+recursively copy directory trees and userspace has to handle this case anyway.
+An exception is rename() of directories that exist only on the topmost layer;
+this succeeds.
+
+Rename involves three steps on a union mount: (1) copyup of the file from the
+bottom layer, (2) rename of the new top-layer copy to the target in the usual
+manner, (3) creation of a whiteout covering the source of the rename.
+
+Directory copyup:
+
+Directory entries are copied up on the first readdir(). We hold the top layer
+directory i_mutex throughout and sequentially acquire and drop the i_mutex for
+each lower layer directory.
+
+VFS-fs interface
+================
+
+Read-only layer: No support necessary other than enforcement of really really
+read-only semantics (done by VFS for local file systems).
+
+Writable layer: Must implement two new inode operations:
+
+int (*whiteout) (struct inode *, struct dentry *, struct dentry *);
+int (*fallthru) (struct inode *, struct dentry *);
+
+And set the MS_WHITEOUT and MS_FALLTHRU flags to indicate support of
+these operations.
+
+Todo:
+
+- Implement whiteouts and fallthrus in ext3
+- Implement whiteouts and fallthrus in btrfs
+
+Supported file systems
+----------------------
+
+Any file system can be a read-only layer. File systems must explicitly support
+whiteouts and fallthrus in order to be a read-write layer. This patch set
+implements whiteouts for ext2, tmpfs, and jffs2. We have tested ext2, tmpfs,
+and iso9660 as the read-only layer.
+
+Todo:
+ - Test corner cases of case-insensitive/oversensitive file systems
+
+NFS interaction
+===============
+
+NFS is currently not supported as either type of layer. NFS as read-only layer
+requires support from the server to honor the read-only guarantee needed for
+the bottom layer. To do this, the server needs to revoke access to clients
+requesting read-only file systems if the exported file system is remounted
+read-write or unmounted (during which arbitrary changes can occur). Some
+recent discussion:
+
+http://markmail.org/message/3mkgnvo4pswxd7lp
+
+NFS as the read-write layer would require implementation of the ->whiteout()
+and ->fallthru() methods. DT_WHT directory entries are theoretically already
+supported.
+
+Also, technically the requirement for a readdir() cookie that is stable across
+reboots comes only from file systems exported via NFSv2:
+
+http://oss.oracle.com/pipermail/btrfs-devel/2008-January/000463.html
+
+Todo:
+
+- Guarantee really really read-only on NFS exports
+- Implement whiteout()/fallthru() for NFS
+
+Userland support
+================
+
+The mount command must support the "-o union" mount option and pass the
+corresponding MS_UNION flag to the kerel. A util-linux git tree with union
+mount support is here:
+
+git://git.kernel.org/pub/scm/utils/util-linux-ng/val/util-linux-ng.git
+
+File system utilities must support whiteouts and fallthrus. An e2fsprogs git
+tree with union mount support is here:
+
+git://git.kernel.org/pub/scm/fs/ext2/val/e2fsprogs.git
+
+Currently, whiteout directory entries are not returned to userland. While the
+directory type for whiteouts, DT_WHT, has been defined for many years, very
+little userland code handles them. Userland will never see fallthru directory
+entries.
+
+Known non-POSIX behaviors
+-------------------------
+
+- Any writing system call (unlink()/chmod()/etc.) can return ENOSPC or EIO
+
+ Most programs are not tested and don't work well under conditions of ENOSPC.
+ The solution is to add more disk space.
+
+- Link count may be wrong for files on bottom layer with > 1 link count
+
+ A file may have more than one hard link to it. When a file with multiple
+ hard links is copied up, any other hard links pointing to the same inode will
+ remain unchanged. If the file is looked up via one of the hard links on the
+ read-only layer, it will have the original link count (which is off by one at
+ this point). An example:
+
+ /bin/link1 -> inode 100
+ /etc/link2 -> inode 100
+
+ inode 100 will have link count 2.
+
+ # echo "blah" > /bin/link1
+
+ Now /bin/link1 will be copied up to the topmost layer. But /etc/link2 will
+ still point to the original inode 100, and its link count will still be 2.
+
+- Link count on directories will be wrong before readdir() (fixable)
+- File copyup is the logical equivalent of an update via copy +
+ rename(). Any existing open file descriptors will continue to refer
+ to the read-only copy on the bottom layer and will not see any
+ changes that occur after the copy-up.
+- rename() of directory may fail with EXDEV
+- fchmod()/fchown()/futimensat()/fsetattr() fail on O_RDONLY fds
+
+Status
+======
+
+The current union mounts implementation is feature-complete on local file
+systems and passes an extensive union mounts test suite, available in the union
+mounts Usermode Linux-based development kit:
+
+http://valerieaurora.org/union/union_mount_devkit.tar.gz
+
+The whiteout code has had some non-trivial level of review and testing, but
+much of the code has had no external review or testing outside the authors'
+machines.
+
+The latest version is available at:
+
+git://git.kernel.org/pub/scm/linux/kernel/git/val/linux-2.6.git
+
+Check the union mounts web page for the name of the latest branch:
+
+http://valerieaurora.org/union/
+
+Todo:
+
+- Run more tests (e.g., XFS test suite)
+- Get review from VFS maintainers
+
+Non-features
+------------
+
+Features we do not currently plan to support in union mounts:
+
+Online upgrade: E.g., installing software on a file system NFS exported to
+clients while the clients are still up and running. Allowing the read-only
+bottom layer of a union mount to change invalidates our locking strategy.
+
+Recursive copying of directories: E.g., implementing rename() across layers for
+directories. Doing an in-kernel copy of a single file is bad enough.
+Recursively copying a directory is a big no-no.
+
+Read-only top layer: The readdir() strategy fundamentally requires the ability
+to create persistent directory entries on the top layer file system (which may
+be tmpfs). However, you can union two read-only file systems by union mounting
+a third file system (such as tmpfs) over the two read-onlly file systems.
+Numerous alternatives to this readdir() strategy (including in-kernel or
+in-application caching) exist and are compatible with union mounts with its
+writing-readdir() implementation disabled. Creating a readdir() cookie that is
+stable across multiple readdir()s requires one of:
+
+- Write to stable storage (e.g., fallthru dentries)
+- Non-evictable kernel memory cache (doesn't handle NFS server reboot)
+- Per-application caching by glibc readdir()
+
+Often these features are supported by other unioning file systems or by other
+versions of union mounts.
+
+Contributing to union mounts
+============================
+
+The union mounts web page is here:
+
+http://valerieaurora.org/union/
+
+It links to:
+
+ - All git repositories
+ - Documentation
+ - An entire self-contained UML-based dev kit with README, etc.
+
+The best mailing list for discussing union mounts is:
+
[email protected]
+
+http://vger.kernel.org/vger-lists.html#linux-fsdevel
+
+Thank you for reading!
Whiteout a given directory entry. File systems that support whiteouts
must implement the new ->whiteout() directory inode operation.
XXX - Only whiteout when there is a matching entry in a lower layer.
Signed-off-by: Jan Blunck <[email protected]> (Original author)
Signed-off-by: David Woodhouse <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]> (Forward port)
---
Documentation/filesystems/vfs.txt | 10 ++++
fs/dcache.c | 1
fs/namei.c | 89 +++++++++++++++++++++++++++++++++++++
include/linux/dcache.h | 6 ++
include/linux/fs.h | 2 +
5 files changed, 106 insertions(+), 2 deletions(-)
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 3d9393b..8575c5b 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -338,7 +338,7 @@ struct inode_operations
-----------------------
This describes how the VFS can manipulate an inode in your
-filesystem. As of kernel 2.6.22, the following members are defined:
+filesystem. As of kernel 2.6.34, the following members are defined:
struct inode_operations {
int (*create) (struct inode *,struct dentry *, umode_t, struct nameidata *);
@@ -349,6 +349,7 @@ struct inode_operations {
int (*mkdir) (struct inode *,struct dentry *,umode_t);
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t);
+ int (*whiteout) (struct inode *, struct dentry *, struct dentry *);
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *);
int (*readlink) (struct dentry *, char __user *,int);
@@ -413,6 +414,13 @@ otherwise noted.
will probably need to call d_instantiate() just as you would
in the create() method
+ whiteout: called by the rmdir(2) and unlink(2) system calls on a
+ layered file system. Only required if you want to support
+ whiteouts. The first dentry passed in is that for the old
+ dentry if it exists, and a negative dentry otherwise. The
+ second is the dentry for the whiteout itself. This method
+ must unlink() or rmdir() the original entry if it exists.
+
rename: called by the rename(2) system call to rename the object to
have the parent and name given by the second inode and dentry.
diff --git a/fs/dcache.c b/fs/dcache.c
index fe19ac1..a8355d5 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1311,6 +1311,7 @@ static void __d_instantiate(struct dentry *dentry, struct inode *inode)
{
spin_lock(&dentry->d_lock);
if (inode) {
+ dentry->d_flags &= ~DCACHE_WHITEOUT;
if (unlikely(IS_AUTOMOUNT(inode)))
dentry->d_flags |= DCACHE_NEED_AUTOMOUNT;
list_add(&dentry->d_alias, &inode->i_dentry);
diff --git a/fs/namei.c b/fs/namei.c
index 7f9df02..3d396fd 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1948,7 +1948,6 @@ static int may_delete(struct inode *dir,struct dentry *victim,int isdir)
if (!victim->d_inode)
return -ENOENT;
- BUG_ON(victim->d_parent->d_inode != dir);
audit_inode_child(victim, dir);
error = inode_permission(dir, MAY_WRITE | MAY_EXEC);
@@ -2647,6 +2646,94 @@ SYSCALL_DEFINE2(mkdir, const char __user *, pathname, umode_t, mode)
return sys_mkdirat(AT_FDCWD, pathname, mode);
}
+/**
+ * vfs_whiteout: Create a whiteout for the given directory entry
+ * @dir: Parent inode
+ * @dentry: Directory entry to whiteout
+ *
+ * Create a whiteout for the given directory entry. A whiteout prevents lookup
+ * from dropping down to a lower layer of a union mounted file system.
+ *
+ * There are two important cases: (a) The directory entry to be whited-out may
+ * already exist, in which case it must first be deleted before we create the
+ * whiteout, and (b) no such directory entry exists and we only have to create
+ * the whiteout itself.
+ *
+ * The caller must pass in a dentry for the directory entry to be whited-out -
+ * a positive one if it exists, and a negative if not. When this function
+ * returns, the caller should dput() the old, now defunct dentry it passed in.
+ * The dentry for the whiteout itself is created inside this function.
+ */
+static int vfs_whiteout(struct inode *dir, struct dentry *old_dentry, int isdir)
+{
+ struct inode *old_inode = old_dentry->d_inode;
+ struct dentry *parent, *whiteout;
+ bool do_dput = false;
+ int err = 0;
+
+ BUG_ON(old_dentry->d_parent->d_inode != dir);
+
+ if (!dir->i_op || !dir->i_op->whiteout)
+ return -EOPNOTSUPP;
+
+ /* If the old dentry is positive, then we have to delete this entry
+ * before we create the whiteout. The file system ->whiteout() op does
+ * the actual delete, but we do all the VFS-level checks and changes
+ * here.
+ */
+ if (old_inode) {
+ mutex_lock(&old_inode->i_mutex);
+ if (d_mountpoint(old_dentry)) {
+ mutex_unlock(&old_inode->i_mutex);
+ return -EBUSY;
+ }
+ if (isdir)
+ err = security_inode_rmdir(dir, old_dentry);
+ else
+ err = security_inode_unlink(dir, old_dentry);
+ if (err)
+ goto error_unlock;
+ }
+
+ parent = dget_parent(old_dentry);
+ err = -ENOMEM;
+ whiteout = d_alloc_name(parent, old_dentry->d_name.name);
+ if (!whiteout)
+ goto error_put_parent;
+
+ if (old_inode && isdir) {
+ dentry_unhash(old_dentry);
+ do_dput = true;
+ }
+
+ err = dir->i_op->whiteout(dir, old_dentry, whiteout);
+ if (err)
+ goto error_put_whiteout;
+
+ if (old_inode) {
+ mutex_unlock(&old_inode->i_mutex);
+ fsnotify_link_count(old_inode);
+ d_delete(old_dentry);
+ if (do_dput)
+ dput(old_dentry);
+ }
+
+ dput(whiteout);
+ dput(parent);
+ return err;
+
+error_put_whiteout:
+ dput(whiteout);
+error_put_parent:
+ dput(parent);
+error_unlock:
+ if (old_inode)
+ mutex_unlock(&old_inode->i_mutex);
+ if (do_dput)
+ dput(old_dentry);
+ return err;
+}
+
/*
* The dentry_unhash() helper will try to drop the dentry early: we
* should have a usage count of 2 if we're the only user of this
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index d64a55b..f22f530 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -204,6 +204,7 @@ struct dentry_operations {
#define DCACHE_CANT_MOUNT 0x0100
#define DCACHE_GENOCIDE 0x0200
#define DCACHE_SHRINK_LIST 0x0400
+#define DCACHE_WHITEOUT 0x0800 /* Stop lookup in a unioned file system */
#define DCACHE_NFSFS_RENAMED 0x1000
/* this dentry has been "silly renamed" and has to be deleted on the last
@@ -414,6 +415,11 @@ static inline bool d_managed(struct dentry *dentry)
return dentry->d_flags & DCACHE_MANAGED_DENTRY;
}
+static inline int d_is_whiteout(struct dentry *dentry)
+{
+ return dentry->d_flags & DCACHE_WHITEOUT;
+}
+
static inline bool d_mountpoint(struct dentry *dentry)
{
return dentry->d_flags & DCACHE_MOUNTED;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ab36080..1e4ae06 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -210,6 +210,7 @@ struct inodes_stat_t {
#define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */
#define MS_I_VERSION (1<<23) /* Update inode I_version field */
#define MS_STRICTATIME (1<<24) /* Always perform atime updates */
+#define MS_WHITEOUT (1<<25) /* FS supports whiteout filetype */
#define MS_NOSEC (1<<28)
#define MS_BORN (1<<29)
#define MS_ACTIVE (1<<30)
@@ -1651,6 +1652,7 @@ struct inode_operations {
int (*mkdir) (struct inode *,struct dentry *,umode_t);
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t);
+ int (*whiteout) (struct inode *, struct dentry *, struct dentry *);
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *);
void (*truncate) (struct inode *);
From: Felix Fietkau <[email protected]>
Add support for fallthru dentries to jffs2.
XXX - untested changes from David Woodhouse and Valerie Aurora.
Signed-off-by: Felix Fietkau <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: [email protected]
---
fs/jffs2/dir.c | 44 ++++++++++++++++++++++++++++++++++++++++----
include/linux/jffs2.h | 6 ++++++
2 files changed, 46 insertions(+), 4 deletions(-)
diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index fe7468d..e294f1d 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -36,6 +36,7 @@ static int jffs2_rename (struct inode *, struct dentry *,
struct inode *, struct dentry *);
static int jffs2_whiteout (struct inode *, struct dentry *, struct dentry *);
+static int jffs2_fallthru (struct inode *, struct dentry *);
const struct file_operations jffs2_dir_operations =
{
@@ -60,6 +61,7 @@ const struct inode_operations jffs2_dir_inode_operations =
.rename = jffs2_rename,
.get_acl = jffs2_get_acl,
.whiteout = jffs2_whiteout,
+ .fallthru = jffs2_fallthru,
.setattr = jffs2_setattr,
.setxattr = jffs2_setxattr,
.getxattr = jffs2_getxattr,
@@ -102,10 +104,14 @@ static struct dentry *jffs2_lookup(struct inode *dir_i, struct dentry *target,
}
if (fd) {
spin_lock(&target->d_lock);
- if (fd->type == DT_WHT)
+ switch (fd->type) {
+ case DT_WHT:
target->d_flags |= DCACHE_WHITEOUT;
- else
+ case JFFS2_DT_FALLTHRU:
+ target->d_flags |= DCACHE_FALLTHRU;
+ default:
ino = fd->ino;
+ }
spin_unlock(&target->d_lock);
}
mutex_unlock(&dir_f->sem);
@@ -127,6 +133,8 @@ static int jffs2_readdir(struct file *filp, void *dirent, filldir_t filldir)
struct inode *inode = filp->f_path.dentry->d_inode;
struct jffs2_full_dirent *fd;
unsigned long offset, curofs;
+ ino_t ino;
+ char d_type;
D1(printk(KERN_DEBUG "jffs2_readdir() for dir_i #%lu\n", filp->f_path.dentry->d_inode->i_ino));
@@ -159,13 +167,20 @@ static int jffs2_readdir(struct file *filp, void *dirent, filldir_t filldir)
fd->name, fd->ino, fd->type, curofs, offset));
continue;
}
- if (!fd->ino) {
+ if (fd->type == JFFS2_DT_FALLTHRU) {
+ /* XXX placeholder until generic_readdir_fallthru() arrives */
+ ino = 1;
+ d_type = DT_UNKNOWN;
+ } else if (!fd->ino && (fd->type != DT_WHT)) {
D2(printk(KERN_DEBUG "Skipping deletion dirent \"%s\"\n", fd->name));
offset++;
continue;
+ } else {
+ ino = fd->ino;
+ d_type = fd->type;
}
D2(printk(KERN_DEBUG "Dirent %ld: \"%s\", ino #%u, type %d\n", offset, fd->name, fd->ino, fd->type));
- if (filldir(dirent, fd->name, strlen(fd->name), offset, fd->ino, fd->type) < 0)
+ if (filldir(dirent, fd->name, strlen(fd->name), offset, ino, d_type) < 0)
break;
offset++;
}
@@ -783,6 +798,26 @@ static int jffs2_mknod (struct inode *dir_i, struct dentry *dentry, umode_t mode
return ret;
}
+static int jffs2_fallthru (struct inode *dir, struct dentry *dentry)
+{
+ struct jffs2_sb_info *c = JFFS2_SB_INFO(dir->i_sb);
+ uint32_t now;
+ int ret;
+
+ now = get_seconds();
+ ret = jffs2_do_link(c, JFFS2_INODE_INFO(dir), 0, DT_UNKNOWN,
+ dentry->d_name.name, dentry->d_name.len, now);
+ if (ret)
+ return ret;
+
+ d_instantiate(dentry, NULL);
+ spin_lock(&dentry->d_lock);
+ dentry->d_flags |= DCACHE_FALLTHRU;
+ spin_unlock(&dentry->d_lock);
+
+ return 0;
+}
+
static int jffs2_whiteout (struct inode *dir, struct dentry *old_dentry,
struct dentry *new_dentry)
{
@@ -815,6 +850,7 @@ static int jffs2_whiteout (struct inode *dir, struct dentry *old_dentry,
return ret;
spin_lock(&new_dentry->d_lock);
+ new_dentry->d_flags &= ~DCACHE_FALLTHRU;
new_dentry->d_flags |= DCACHE_WHITEOUT;
spin_unlock(&new_dentry->d_lock);
d_add(new_dentry, NULL);
diff --git a/include/linux/jffs2.h b/include/linux/jffs2.h
index 6404e01..1749127 100644
--- a/include/linux/jffs2.h
+++ b/include/linux/jffs2.h
@@ -115,6 +115,12 @@ struct jffs2_unknown_node
jint32_t hdr_crc;
};
+/*
+ * Non-standard directory entry type(s), for on-disk use
+ */
+
+#define JFFS2_DT_FALLTHRU (DT_WHT + 1)
+
struct jffs2_raw_dirent
{
jint16_t magic;
From: Valerie Aurora <[email protected]>
Now that we have full union lookup support, lookup the true d_type and
d_ino of a fallthru.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: [email protected]
---
fs/libfs.c | 11 ++++++++---
1 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/fs/libfs.c b/fs/libfs.c
index 43f1ac2..bd9388f 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -143,6 +143,7 @@ int dcache_readdir(struct file * filp, void * dirent, filldir_t filldir)
ino_t ino;
char d_type;
int i = filp->f_pos;
+ int err = 0;
switch (i) {
case 0:
@@ -177,9 +178,13 @@ int dcache_readdir(struct file * filp, void * dirent, filldir_t filldir)
spin_unlock(&next->d_lock);
spin_unlock(&dentry->d_lock);
if (d_is_fallthru(next)) {
- /* XXX placeholder until generic_readdir_fallthru() arrives */
- ino = 1;
- d_type = DT_UNKNOWN;
+ /* On tmpfs, should only fail with ENOMEM, EIO, etc. */
+ err = generic_readdir_fallthru(filp->f_path.dentry,
+ next->d_name.name,
+ next->d_name.len,
+ &ino, &d_type);
+ if (err)
+ return err;
} else {
ino = next->d_inode->i_ino;
d_type = dt_type(next->d_inode);
From: Valerie Aurora <[email protected]>
Passing the CL_NO_SLAVE flag to clone_mnt() causes the clone
to fail if the source mnt is a slave.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
Cc: Ram Pai <[email protected]>
---
fs/namespace.c | 3 +++
fs/pnode.h | 1 +
2 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index f92f574..96f43f2 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -743,6 +743,9 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
if ((flag & CL_NO_SHARED) && IS_MNT_SHARED(old))
return ERR_PTR(-EINVAL);
+ if ((flag & CL_NO_SLAVE) && IS_MNT_SLAVE(old))
+ return ERR_PTR(-EINVAL);
+
mnt = alloc_vfsmnt(old->mnt_devname);
if (!mnt)
return ERR_PTR(-ENOMEM);
diff --git a/fs/pnode.h b/fs/pnode.h
index c7089dd..f7ae149 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -23,6 +23,7 @@
#define CL_MAKE_SHARED 0x08
#define CL_PRIVATE 0x10
#define CL_NO_SHARED 0x20
+#define CL_NO_SLAVE 0x40
static inline void set_mnt_shared(struct mount *mnt)
{
From: Valerie Aurora <[email protected]>
Passing the CL_MAKE_HARD_READONLY flag to clone_mnt() causes the clone
to fail if the source superblock is not read-only. If it is
read-only, it increments the hard read-only users and sets the
MNT_HARD_READONLY flag in the vfsmount. When the mount is freed via
free_vfsmnt(), automatically decrement the hard read-only users count.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
Cc: Ram Pai <[email protected]>
---
fs/namespace.c | 18 ++++++++++++++++++
fs/pnode.h | 1 +
include/linux/mount.h | 1 +
3 files changed, 20 insertions(+), 0 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index 96f43f2..c01aff2 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -481,6 +481,12 @@ int sb_prepare_remount_readonly(struct super_block *sb)
static void free_vfsmnt(struct mount *mnt)
{
kfree(mnt->mnt_devname);
+ if (mnt->mnt.mnt_flags & MNT_HARD_READONLY) {
+ BUG_ON(mnt->mnt.mnt_sb->s_hard_readonly_users <= 0);
+ down_write(&mnt->mnt.mnt_sb->s_umount);
+ mnt->mnt.mnt_sb->s_hard_readonly_users--;
+ up_write(&mnt->mnt.mnt_sb->s_umount);
+ }
mnt_free_id(mnt);
#ifdef CONFIG_SMP
free_percpu(mnt->mnt_pcp);
@@ -746,6 +752,16 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
if ((flag & CL_NO_SLAVE) && IS_MNT_SLAVE(old))
return ERR_PTR(-EINVAL);
+ if (flag & CL_MAKE_HARD_READONLY) {
+ down_write(&sb->s_umount);
+ if (!(sb->s_flags & MS_RDONLY)) {
+ up_write(&sb->s_umount);
+ return ERR_PTR(-EBUSY);
+ }
+ sb->s_hard_readonly_users++;
+ up_write(&sb->s_umount);
+ }
+
mnt = alloc_vfsmnt(old->mnt_devname);
if (!mnt)
return ERR_PTR(-ENOMEM);
@@ -784,6 +800,8 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
}
if (flag & CL_MAKE_SHARED)
set_mnt_shared(mnt);
+ if (flag & CL_MAKE_HARD_READONLY)
+ mnt->mnt.mnt_flags |= MNT_HARD_READONLY;
/* stick the duplicate mount on the same expiry list as the
* original if that was on one */
diff --git a/fs/pnode.h b/fs/pnode.h
index f7ae149..321d7ab 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -24,6 +24,7 @@
#define CL_PRIVATE 0x10
#define CL_NO_SHARED 0x20
#define CL_NO_SLAVE 0x40
+#define CL_MAKE_HARD_READONLY 0x80
static inline void set_mnt_shared(struct mount *mnt)
{
diff --git a/include/linux/mount.h b/include/linux/mount.h
index d7029f4..41c7c84 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -46,6 +46,7 @@ struct mnt_namespace;
#define MNT_INTERNAL 0x4000
+#define MNT_HARD_READONLY 0x8000 /* has a hard read-only ref on the sb */
struct vfsmount {
struct dentry *mnt_root; /* root of the mounted tree */
Add comments describing what the directions "up" and "down" mean and ref count
handling to the VFS mount following family of functions.
Signed-off-by: Valerie Aurora <[email protected]> (Original author)
Signed-off-by: David Howells <[email protected]>
---
fs/namei.c | 10 ++++++++++
fs/namespace.c | 16 ++++++++++++++--
2 files changed, 24 insertions(+), 2 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index a780ea5..4dc0e1d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -688,6 +688,16 @@ static int follow_up_rcu(struct path *path)
return 1;
}
+/*
+ * follow_up - Find the mountpoint of path's vfsmount
+ *
+ * Given a path, find the mountpoint of its source file system.
+ * Replace @path with the path of the mountpoint in the parent mount.
+ * Up is towards /.
+ *
+ * Return 1 if we went up a level and 0 if we were already at the
+ * root.
+ */
int follow_up(struct path *path)
{
struct mount *mnt = real_mount(path->mnt);
diff --git a/fs/namespace.c b/fs/namespace.c
index baedd0b..35c3b80 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -515,8 +515,20 @@ struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry,
}
/*
- * lookup_mnt increments the ref count before returning
- * the vfsmount struct.
+ * lookup_mnt - Return the first child mount mounted at path
+ *
+ * "First" means first mounted chronologically. If you create the
+ * following mounts:
+ *
+ * mount /dev/sda1 /mnt
+ * mount /dev/sda2 /mnt
+ * mount /dev/sda3 /mnt
+ *
+ * Then lookup_mnt() on the base /mnt dentry in the root mount will
+ * return successively the root dentry and vfsmount of /dev/sda1, then
+ * /dev/sda2, then /dev/sda3, then NULL.
+ *
+ * lookup_mnt takes a reference to the found vfsmount.
*/
struct vfsmount *lookup_mnt(struct path *path)
{
From: Valerie Aurora <[email protected]>
check_topmost_union_mnt() checks that the topmost layer of a proposed
union mount is read-write, supports fallthrus and whiteouts, and isn't
mounted elsewhere.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/namespace.c | 40 ++++++++++++++++++++++++++++++++++++++++
1 files changed, 40 insertions(+), 0 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index 33aa310..bed9ccd 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1410,6 +1410,46 @@ static int invent_group_ids(struct mount *mnt, bool recurse)
return 0;
}
+/**
+ * check_topmost_union_mnt - mount-time checks for union mount
+ * @topmost_mnt: vfsmount of the topmost union filed system
+ * @mnt_flags: mount flags for the topmost mount
+ *
+ * Our readdir() solution of copying up directory entries requires
+ * that the topmost layer be writeable and support whiteouts and
+ * fallthrus. The topmost file system can't be mounted elsewhere
+ * because it's Too Hard(tm).
+ */
+static int check_topmost_union_mnt(struct vfsmount *topmost_mnt, int mnt_flags)
+{
+ struct super_block *sb = topmost_mnt->mnt_sb;
+
+#ifndef CONFIG_UNION_MOUNT
+ printk(KERN_INFO "union mount: not supported by the kernel\n");
+ return -EINVAL;
+#else
+ if (mnt_flags & MNT_READONLY)
+ return -EROFS;
+
+ if (atomic_read(&sb->s_active) != 1) {
+ printk(KERN_INFO "union mount: topmost fs mounted elsewhere\n");
+ return -EBUSY;
+ }
+
+ if (!(sb->s_flags & MS_WHITEOUT)) {
+ printk(KERN_INFO "union mount: whiteouts not supported by fs\n");
+ return -EINVAL;
+ }
+
+ if (!(sb->s_flags & MS_FALLTHRU)) {
+ printk(KERN_INFO "union mount: fallthrus not supported by fs\n");
+ return -EINVAL;
+ }
+
+ return 0;
+#endif
+}
+
/*
* @source_mnt : mount tree to be attached
* @nd : place the mount tree @source_mnt is attached
From: Jan Blunck <[email protected]>
Add support for whiteout dentries to tmpfs. This includes adding support for
whiteouts to d_genocide(), which is called to tear down pinned tmpfs dentries.
Whiteouts have to be persistent, so they have a pinning extra ref count that
needs to be dropped by d_genocide().
Signed-off-by: Jan Blunck <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
Signed-off-by: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: [email protected]
---
fs/dcache.c | 12 +++++
mm/shmem.c | 144 +++++++++++++++++++++++++++++++++++++++++++++++++++++------
2 files changed, 141 insertions(+), 15 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index a8355d5..60af7b1 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2886,7 +2886,17 @@ resume:
next = tmp->next;
spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
- if (d_unhashed(dentry) || !dentry->d_inode) {
+
+ /* Skip unhashed and negative dentries, but process positive
+ * dentries and whiteouts. A whiteout looks kind of like a
+ * negative dentry for purposes of lookup, but it has an extra
+ * pinning ref count because it can't be evicted like a
+ * negative dentry can. What we care about here is ref counts
+ * - and we need to drop the ref count on a whiteout before we
+ * can evict it.
+ */
+ if (d_unhashed(dentry) ||
+ (!dentry->d_inode && !d_is_whiteout(dentry))) {
spin_unlock(&dentry->d_lock);
continue;
}
diff --git a/mm/shmem.c b/mm/shmem.c
index 269d049..ca0bd30 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1477,6 +1477,76 @@ static int shmem_statfs(struct dentry *dentry, struct kstatfs *buf)
return 0;
}
+static int shmem_rmdir(struct inode *dir, struct dentry *dentry);
+static int shmem_unlink(struct inode *dir, struct dentry *dentry);
+
+/*
+ * This is the whiteout support for tmpfs. It uses one singleton whiteout
+ * inode per superblock thus it is very similar to shmem_link().
+ */
+static int shmem_whiteout(struct inode *dir, struct dentry *old_dentry,
+ struct dentry *new_dentry)
+{
+ struct shmem_sb_info *sbinfo = SHMEM_SB(dir->i_sb);
+ struct dentry *dentry;
+
+ if (!(dir->i_sb->s_flags & MS_WHITEOUT))
+ return -EPERM;
+
+ /* This gives us a proper initialized negative dentry */
+ dentry = simple_lookup(dir, new_dentry, NULL);
+ if (dentry && IS_ERR(dentry))
+ return PTR_ERR(dentry);
+
+ /*
+ * No ordinary (disk based) filesystem counts whiteouts as inodes;
+ * but each new link needs a new dentry, pinning lowmem, and
+ * tmpfs dentries cannot be pruned until they are unlinked.
+ */
+ if (sbinfo->max_inodes) {
+ spin_lock(&sbinfo->stat_lock);
+ if (!sbinfo->free_inodes) {
+ spin_unlock(&sbinfo->stat_lock);
+ return -ENOSPC;
+ }
+ sbinfo->free_inodes--;
+ spin_unlock(&sbinfo->stat_lock);
+ }
+
+ if (old_dentry->d_inode) {
+ if (S_ISDIR(old_dentry->d_inode->i_mode))
+ shmem_rmdir(dir, old_dentry);
+ else
+ shmem_unlink(dir, old_dentry);
+ }
+
+ dir->i_size += BOGO_DIRENT_SIZE;
+ dir->i_ctime = dir->i_mtime = CURRENT_TIME;
+ /* Extra pinning count for the created dentry */
+ dget(new_dentry);
+ spin_lock(&new_dentry->d_lock);
+ new_dentry->d_flags |= DCACHE_WHITEOUT;
+ spin_unlock(&new_dentry->d_lock);
+ return 0;
+}
+
+static void shmem_d_instantiate(struct inode *dir, struct dentry *dentry,
+ struct inode *inode)
+{
+ if (d_is_whiteout(dentry)) {
+ /* Re-using an existing whiteout */
+ shmem_free_inode(dir->i_sb);
+ if (S_ISDIR(inode->i_mode))
+ inode->i_mode |= S_OPAQUE;
+ } else {
+ /* New dentry */
+ dir->i_size += BOGO_DIRENT_SIZE;
+ dget(dentry); /* Extra count - pin the dentry in core */
+ }
+ /* Will clear DCACHE_WHITEOUT flag */
+ d_instantiate(dentry, inode);
+
+}
/*
* File creation. Allocate an inode, and we're done..
*/
@@ -1506,10 +1576,8 @@ shmem_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
#else
error = 0;
#endif
- dir->i_size += BOGO_DIRENT_SIZE;
+ shmem_d_instantiate(dir, dentry, inode);
dir->i_ctime = dir->i_mtime = CURRENT_TIME;
- d_instantiate(dentry, inode);
- dget(dentry); /* Extra count - pin the dentry in core */
}
return error;
}
@@ -1547,12 +1615,10 @@ static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentr
if (ret)
goto out;
- dir->i_size += BOGO_DIRENT_SIZE;
inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
inc_nlink(inode);
ihold(inode); /* New dentry reference */
- dget(dentry); /* Extra pinning count for the created dentry */
- d_instantiate(dentry, inode);
+ shmem_d_instantiate(dir, dentry, inode);
out:
return ret;
}
@@ -1561,21 +1627,61 @@ static int shmem_unlink(struct inode *dir, struct dentry *dentry)
{
struct inode *inode = dentry->d_inode;
- if (inode->i_nlink > 1 && !S_ISDIR(inode->i_mode))
- shmem_free_inode(inode->i_sb);
+ if (d_is_whiteout(dentry) || (inode->i_nlink > 1 && !S_ISDIR(inode->i_mode)))
+ shmem_free_inode(dir->i_sb);
+ if (inode) {
+ inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
+ drop_nlink(inode);
+ }
dir->i_size -= BOGO_DIRENT_SIZE;
- inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
- drop_nlink(inode);
dput(dentry); /* Undo the count from "create" - this does all the work */
return 0;
}
+static void shmem_dir_unlink_whiteouts(struct inode *dir, struct dentry *dentry)
+{
+ if (!dentry->d_inode)
+ return;
+
+ /* Remove whiteouts from logical empty directory */
+ if (S_ISDIR(dentry->d_inode->i_mode) &&
+ dentry->d_inode->i_sb->s_flags & MS_WHITEOUT) {
+ struct dentry *child, *next;
+ LIST_HEAD(list);
+
+ spin_lock(&dentry->d_lock);
+ list_for_each_entry(child, &dentry->d_subdirs, d_u.d_child) {
+ spin_lock(&child->d_lock);
+ if (d_is_whiteout(child)) {
+ __d_drop(child);
+ if (!list_empty(&child->d_lru)) {
+ list_del(&child->d_lru);
+ dentry_stat.nr_unused--;
+ }
+ list_add(&child->d_lru, &list);
+ }
+ spin_unlock(&child->d_lock);
+ }
+ spin_unlock(&dentry->d_lock);
+
+ list_for_each_entry_safe(child, next, &list, d_lru) {
+ spin_lock(&child->d_lock);
+ list_del_init(&child->d_lru);
+ spin_unlock(&child->d_lock);
+
+ shmem_unlink(dentry->d_inode, child);
+ }
+ }
+}
+
static int shmem_rmdir(struct inode *dir, struct dentry *dentry)
{
if (!simple_empty(dentry))
return -ENOTEMPTY;
+ /* Remove whiteouts from logical empty directory */
+ shmem_dir_unlink_whiteouts(dir, dentry);
drop_nlink(dentry->d_inode);
drop_nlink(dir);
return shmem_unlink(dir, dentry);
@@ -1584,7 +1690,7 @@ static int shmem_rmdir(struct inode *dir, struct dentry *dentry)
/*
* The VFS layer already does all the dentry stuff for rename,
* we just have to decrement the usage count for the target if
- * it exists so that the VFS layer correctly free's it when it
+ * it exists so that the VFS layer correctly frees it when it
* gets overwritten.
*/
static int shmem_rename(struct inode *old_dir, struct dentry *old_dentry, struct inode *new_dir, struct dentry *new_dentry)
@@ -1595,7 +1701,12 @@ static int shmem_rename(struct inode *old_dir, struct dentry *old_dentry, struct
if (!simple_empty(new_dentry))
return -ENOTEMPTY;
+ if (d_is_whiteout(new_dentry))
+ shmem_unlink(new_dir, new_dentry);
+
if (new_dentry->d_inode) {
+ /* Remove whiteouts from logical empty directory */
+ shmem_dir_unlink_whiteouts(new_dir, new_dentry);
(void) shmem_unlink(new_dir, new_dentry);
if (they_are_dirs)
drop_nlink(old_dir);
@@ -1663,10 +1774,8 @@ static int shmem_symlink(struct inode *dir, struct dentry *dentry, const char *s
unlock_page(page);
page_cache_release(page);
}
- dir->i_size += BOGO_DIRENT_SIZE;
dir->i_ctime = dir->i_mtime = CURRENT_TIME;
- d_instantiate(dentry, inode);
- dget(dentry);
+ shmem_d_instantiate(dir, dentry, inode);
return 0;
}
@@ -2236,6 +2345,12 @@ int shmem_fill_super(struct super_block *sb, void *data, int silent)
if (!root)
goto failed_iput;
sb->s_root = root;
+
+#ifdef CONFIG_TMPFS
+ if (!(sb->s_flags & MS_NOUSER))
+ sb->s_flags |= MS_WHITEOUT;
+#endif
+
return 0;
failed_iput:
@@ -2335,6 +2450,7 @@ static const struct inode_operations shmem_dir_inode_operations = {
.rmdir = shmem_rmdir,
.mknod = shmem_mknod,
.rename = shmem_rename,
+ .whiteout = shmem_whiteout,
#endif
#ifdef CONFIG_TMPFS_XATTR
.setxattr = shmem_setxattr,
From: Valerie Aurora <[email protected]>
Copy up a file when opened with write permissions. Does not copy up
the file data when O_TRUNC is specified.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/namei.c | 31 +++++++++++++++++++++++++++++++
1 files changed, 31 insertions(+), 0 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index dad7bef..4fe8f4c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2521,6 +2521,24 @@ static inline int open_to_namei_flags(int flag)
return flag;
}
+static int open_union_copyup(struct nameidata *nd, struct path *path,
+ int open_flag)
+{
+ struct vfsmount *oldmnt = path->mnt;
+ int error;
+
+ if (open_flag & O_TRUNC)
+ error = union_copyup_len(nd, path, 0);
+ else
+ error = union_copyup(nd, path);
+ if (error)
+ return error;
+ if (oldmnt != path->mnt)
+ mntput(nd->path.mnt);
+
+ return error;
+}
+
/*
* Handle the last step of open()
*/
@@ -2586,6 +2604,13 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
if (!nd->inode->i_op->lookup)
goto exit;
}
+
+ if (acc_mode & MAY_WRITE) {
+ error = open_union_copyup(nd, &nd->path, open_flag);
+ if (error)
+ goto exit;
+ }
+
audit_inode(pathname, nd->path.dentry);
goto ok;
}
@@ -2669,6 +2694,12 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
if (path->dentry->d_inode->i_op->follow_link)
return NULL;
+ if (acc_mode & MAY_WRITE) {
+ error = open_union_copyup(nd, path, open_flag);
+ if (error)
+ goto exit_dput;
+ }
+
path_to_nameidata(path, nd);
nd->inode = path->dentry->d_inode;
/* Why this, you ask? _Now_ we might have grown LOOKUP_JUMPED... */
From: Jan Blunck <[email protected]>
IS_MNT_UNION() tests whether a vfsmount is a union. Note that a
directory in a union mounted file system is not necessarily unioned.
Use IS_DIR_UNIONED() to test that.
Original-author: Jan Blunck <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/union.h | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)
diff --git a/fs/union.h b/fs/union.h
index e918a04..990dd16 100644
--- a/fs/union.h
+++ b/fs/union.h
@@ -51,6 +51,11 @@ struct union_stack {
struct path u_dirs[0];
};
+static inline bool IS_MNT_UNION(struct vfsmount *mnt)
+{
+ return mnt->mnt_flags & MNT_UNION;
+}
+
static inline bool IS_DIR_UNIONED(struct dentry *dentry)
{
return !!dentry->d_union_stack;
@@ -77,6 +82,7 @@ struct path *union_find_dir(struct dentry *dentry, unsigned int layer)
return NULL;
}
+static inline bool IS_MNT_UNION(struct vfsmount *mnt) { return false; }
static inline bool IS_DIR_UNIONED(struct dentry *dentry) { return false; }
static inline void d_free_unions(struct dentry *dentry) {}
Mark lower layers in union to make it easier to tell when they're being
accessed - just in case a file gets mounted over a unioned lower file.
Signed-off-by: David Howells <[email protected]>
---
fs/namespace.c | 5 ++++-
fs/pnode.h | 1 +
fs/union.h | 5 +++++
include/linux/mount.h | 1 +
4 files changed, 11 insertions(+), 1 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index 5fbe3b0..1f24a6b 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -787,6 +787,9 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
list_add_tail(&mnt->mnt_instance, &sb->s_mounts);
br_write_unlock(vfsmount_lock);
+ if ((flag & CL_MAKE_UNION))
+ mnt->mnt.mnt_flags |= MNT_UNION_LOWER;
+
if (flag & CL_SLAVE) {
list_add(&mnt->mnt_slave, &old->mnt_slave_list);
mnt->mnt_master = old;
@@ -1509,7 +1512,7 @@ static int clone_union_tree(struct mount *topmost, struct path *mntpnt)
cloned_tree = copy_tree(mnt, mnt->mnt.mnt_root,
CL_COPY_ALL | CL_PRIVATE |
CL_NO_SHARED | CL_NO_SLAVE |
- CL_MAKE_HARD_READONLY);
+ CL_MAKE_HARD_READONLY | CL_MAKE_UNION);
if (IS_ERR(cloned_tree))
return PTR_ERR(cloned_tree);
diff --git a/fs/pnode.h b/fs/pnode.h
index 321d7ab..3fd2afe 100644
--- a/fs/pnode.h
+++ b/fs/pnode.h
@@ -25,6 +25,7 @@
#define CL_NO_SHARED 0x20
#define CL_NO_SLAVE 0x40
#define CL_MAKE_HARD_READONLY 0x80
+#define CL_MAKE_UNION 0x100
static inline void set_mnt_shared(struct mount *mnt)
{
diff --git a/fs/union.h b/fs/union.h
index 757f28c..48a9277 100644
--- a/fs/union.h
+++ b/fs/union.h
@@ -56,6 +56,11 @@ static inline bool IS_MNT_UNION(struct vfsmount *mnt)
return mnt->mnt_flags & MNT_UNION;
}
+static inline bool IS_MNT_LOWER(struct vfsmount *mnt)
+{
+ return mnt->mnt_flags & MNT_UNION_LOWER;
+}
+
static inline bool IS_DIR_UNIONED(struct dentry *dentry)
{
return !!dentry->d_union_stack;
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 67f46fa..bd21196 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -48,6 +48,7 @@ struct mnt_namespace;
#define MNT_INTERNAL 0x4000
#define MNT_HARD_READONLY 0x8000 /* has a hard read-only ref on the sb */
#define MNT_UNION 0x10000 /* top layer of a union mount */
+#define MNT_UNION_LOWER 0x20000 /* lower layer of a union mount */
struct vfsmount {
struct dentry *mnt_root; /* root of the mounted tree */
From: Valerie Aurora <[email protected]>
Create a very simple version of union lookup. This patch only looks
up the target in each layer of the union but does not process it in
any way. Patches to do whiteouts, etc. follow.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/namei.c | 77 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 77 insertions(+), 0 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index f53c0bc..8caed86 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1097,6 +1097,83 @@ static void follow_dotdot(struct nameidata *nd)
nd->inode = nd->path.dentry->d_inode;
}
+static struct dentry *__lookup_hash(struct qstr *name, struct dentry *base,
+ struct nameidata *nd);
+
+/**
+ * __lookup_union - Lookup and build union stack
+ * @nd - nameidata for the parent of @topmost
+ * @name - name of target
+ * @topmost - path of the target on the topmost file system
+ *
+ * Do the "union" part of lookup for @topmost - that is, look it up in the
+ * lower layers of its parent directory's union stack. If @topmost is a
+ * directory, build its union stack. @topmost is the path of the target in the
+ * topmost layer of the union file system. It is either a directory or a
+ * negative (non-whiteout) dentry.
+ *
+ * This function may stomp nd->path with the path of the parent directory of
+ * the lower layers, so the caller must save nd->path and restore it
+ * afterwards.
+ */
+static int __lookup_union(struct nameidata *nd, struct qstr *name,
+ struct path *topmost)
+{
+ struct path lower, parent = nd->path;
+ struct path *path;
+ unsigned i, layers = parent.dentry->d_sb->s_union_count;
+ int err;
+
+ if (!topmost->dentry->d_inode) {
+ if (d_is_whiteout(topmost->dentry))
+ return 0;
+ if (IS_OPAQUE(parent.dentry->d_inode) &&
+ !d_is_fallthru(topmost->dentry))
+ return 0;
+ }
+
+ /* If it's positive and not a dir, no lookup needed */
+ if (topmost->dentry->d_inode &&
+ !S_ISDIR(topmost->dentry->d_inode->i_mode))
+ return 0;
+
+ /* At this point we have either a negative fallthru dentry or we have a
+ * positive directory dentry.
+ *
+ * Loop through the union stack of the parent of the target, building
+ * the union stack of the target (if applicable). Note that the union
+ * stack of the root directory is built at mount.
+ */
+ for (i = 0; i < layers; i++) {
+ /* Get the parent directory for this layer and lookup
+ * the target in it.
+ */
+ path = union_find_dir(parent.dentry, i);
+ if (!path->mnt)
+ continue;
+
+ nd->path = *path;
+ lower.mnt = mntget(nd->path.mnt);
+ mutex_lock(&nd->path.dentry->d_inode->i_mutex);
+ lower.dentry = __lookup_hash(name, nd->path.dentry, nd);
+ mutex_unlock(&nd->path.dentry->d_inode->i_mutex);
+
+ if (IS_ERR(lower.dentry)) {
+ mntput(lower.mnt);
+ err = PTR_ERR(lower.dentry);
+ goto out_err;
+ }
+ /* XXX - do nothing, lookup rule processing in later patches */
+ path_put(&lower);
+ }
+ return 0;
+
+out_err:
+ d_free_unions(topmost->dentry);
+ path_put(&lower);
+ return err;
+}
+
/*
* Allocate a dentry with name and parent, and perform a parent
* directory ->lookup on it. Returns the new dentry, or ERR_PTR
When a file on the read-only layer of a union mount is altered, it
must be copied up to the topmost read-write layer. This patch creates
union_copyup() and its supporting routines.
Thanks to Valdis Kletnieks <[email protected]> for a bug fix.
XXX - split up
XXX - If dir xattr copyup fails, delete the newly created dir
XXX - set correct owner after copyup
XXX - reimplement with get_unlinked_inode()
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]> (Further development)
---
fs/union.c | 266 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/union.h | 31 +++++++
2 files changed, 296 insertions(+), 1 deletions(-)
diff --git a/fs/union.c b/fs/union.c
index 8efed50..b8ee42d 100644
--- a/fs/union.c
+++ b/fs/union.c
@@ -24,6 +24,8 @@
#include <linux/xattr.h>
#include <linux/file.h>
#include <linux/security.h>
+#include <linux/splice.h>
+#include <linux/xattr.h>
#include "union.h"
@@ -197,9 +199,16 @@ int union_create_topmost_dir(struct path *parent, struct qstr *name,
error = union_copyup_xattr(lower->dentry, topmost->dentry);
if (error)
- dput(topmost->dentry);
+ goto out_rmdir;
fsnotify_mkdir(dir, topmost->dentry);
+
+ mnt_drop_write(parent->mnt);
+
+ return 0;
+out_rmdir:
+ /* XXX rm created dir */
+ dput(topmost->dentry);
out:
mnt_drop_write(parent->mnt);
return error;
@@ -439,3 +448,258 @@ int generic_readdir_fallthru(struct dentry *topmost_dentry, const char *name,
return -ENOENT;
}
EXPORT_SYMBOL(generic_readdir_fallthru);
+
+/**
+ * union_create_file
+ * @nd: namediata for source file
+ * @lower: path of the source file
+ * @new: path of the new file, negative dentry
+ *
+ * Must already have mnt_want_write() on the mnt and the parent's i_mutex.
+ */
+static int union_create_file(struct nameidata *nd, struct path *lower,
+ struct dentry *new)
+{
+ struct path *parent = &nd->path;
+
+ BUG_ON(!mutex_is_locked(&parent->dentry->d_inode->i_mutex));
+
+ return vfs_create(parent->dentry->d_inode, new,
+ lower->dentry->d_inode->i_mode, nd);
+}
+
+/**
+ * union_create_symlink
+ * @nd: namediata for source symlink
+ * @lower: path of the source symlink
+ * @new: path of the new symlink, negative dentry
+ *
+ * Must already have mnt_want_write() on the mnt and the parent's i_mutex.
+ */
+static int union_create_symlink(struct nameidata *nd, struct path *lower,
+ struct dentry *new)
+{
+ struct path *parent = &nd->path;
+ void *cookie;
+ int error;
+
+ BUG_ON(!mutex_is_locked(&parent->dentry->d_inode->i_mutex));
+
+ /* We want the contents of this symlink, not to follow it, so this is
+ * modeled on generic_readlink() rather than do_follow_link().
+ */
+ nd->depth = 0;
+ cookie = lower->dentry->d_inode->i_op->follow_link(lower->dentry, nd);
+ if (IS_ERR(cookie))
+ return PTR_ERR(cookie);
+
+ /* Create a copy of the link on the top layer */
+ error = vfs_symlink(parent->dentry->d_inode, new, nd_get_link(nd));
+ if (lower->dentry->d_inode->i_op->put_link)
+ lower->dentry->d_inode->i_op->put_link(lower->dentry, nd, cookie);
+ return error;
+}
+
+/**
+ * union_copyup_data - Copy up len bytes of old's data to new
+ * @lower: path of source file in lower layer
+ * @new_mnt: vfsmount of target file
+ * @new_dentry: dentry of target file
+ * @len: number of bytes to copy
+ */
+static int union_copyup_data(struct path *lower, struct vfsmount *new_mnt,
+ struct dentry *new_dentry, size_t len)
+{
+ const struct cred *cred = current_cred();
+ struct file *lower_file;
+ struct file *new_file;
+ loff_t offset = 0;
+ long bytes;
+ int error = 0;
+
+ if (len == 0)
+ return 0;
+
+ /* Get reference to balance later fput() */
+ path_get(lower);
+ lower_file = dentry_open(lower->dentry, lower->mnt, O_RDONLY, cred);
+ if (IS_ERR(lower_file))
+ return PTR_ERR(lower_file);
+
+ mntget(new_mnt);
+ dget(new_dentry);
+ new_file = dentry_open(new_dentry, new_mnt, O_WRONLY, cred);
+ if (IS_ERR(new_file)) {
+ error = PTR_ERR(new_file);
+ goto out_fput;
+ }
+
+ bytes = do_splice_direct(lower_file, &offset, new_file, len,
+ SPLICE_F_MOVE);
+ if (bytes < 0)
+ error = bytes;
+
+ fput(new_file);
+out_fput:
+ fput(lower_file);
+ return error;
+}
+
+/**
+ * union_copyup_file - Copy up a regular file, symlink or special file
+ * @nd: nameidata for topmost parent dir
+ * @lower: path of file to be copied up
+ * @dentry: dentry to copy up to
+ * @len: number of bytes of file data to copy up
+ */
+int union_copyup_file(struct nameidata *nd, struct path *lower,
+ struct dentry *dentry, size_t len)
+{
+ struct path *parent = &nd->path;
+ int error;
+
+ BUG_ON(!mutex_is_locked(&parent->dentry->d_inode->i_mutex));
+
+ if (S_ISREG(lower->dentry->d_inode->i_mode)) {
+ error = union_create_file(nd, lower, dentry);
+ if (error)
+ return error;
+ error = union_copyup_data(lower, parent->mnt, dentry, len);
+ } else if (S_ISLNK(lower->dentry->d_inode->i_mode)) {
+ return union_create_symlink(nd, lower, dentry);
+ } else {
+ /* Don't currently support copyup of special files, though in
+ * theory there's no reason we couldn't at least copy up
+ * blockdev, chrdev and FIFO files
+ */
+ return -EXDEV;
+ }
+ if (error)
+ /* Most likely error: ENOSPC */
+ vfs_unlink(parent->dentry->d_inode, dentry);
+
+ return error;
+}
+
+/**
+ * __union_copyup_len - Copy up a file and len bytes of data
+ * @nd: nameidata for topmost parent dir
+ * @path: path of file to be copied up
+ * @len: number of bytes of file data to copy up
+ *
+ * Parent's i_mutex must be held by caller. Newly copied up path is
+ * returned in @path and original is path_put().
+ */
+static int __union_copyup_len(struct nameidata *nd, struct path *path,
+ size_t len)
+{
+ struct dentry *dentry;
+ struct path *parent = &nd->path;
+ int error;
+
+ BUG_ON(!mutex_is_locked(&parent->dentry->d_inode->i_mutex));
+
+ dentry = lookup_one_len(path->dentry->d_name.name, parent->dentry,
+ path->dentry->d_name.len);
+ if (IS_ERR(dentry))
+ return PTR_ERR(dentry);
+
+ if (dentry->d_inode) {
+ /* We raced with someone else and "lost". That's okay, they
+ * did all the work of copying up the file.
+ *
+ * Note that currently data copyup happens under the parent
+ * dir's i_mutex. If we move it outside that, we'll need some
+ * way of waiting for the data copyup to complete here.
+ */
+ error = 0;
+ } else {
+ error = union_copyup_file(nd, path, dentry, len);
+ if (error < 0)
+ goto out_dput;
+ }
+
+ /* Move to the new dentry */
+ path_put(path);
+ path->dentry = dentry;
+ path->mnt = mntget(parent->mnt);
+ return error;
+
+out_dput:
+ /* Don't path_put(path), let caller unwind */
+ dput(dentry);
+ return error;
+}
+
+/**
+ * do_union_copyup_len - Copy up a file given its path (and its parent's)
+ * @nd: nameidata for topmost parent dir
+ * @path: path of file to be copied up
+ * @copy_all: if set, copy all of the file's data and ignore @len
+ * @len: if @copy_all is not set, number of bytes of file data to copy up
+ *
+ * Newly copied up path is returned in @path.
+ */
+static int do_union_copyup_len(struct nameidata *nd, struct path *path,
+ int copy_all, size_t len)
+{
+ struct path *parent = &nd->path;
+ int error;
+
+ if (!IS_DIR_UNIONED(parent->dentry) ||
+ parent->mnt == path->mnt)
+ return 0;
+ if (!S_ISREG(path->dentry->d_inode->i_mode) &&
+ !S_ISLNK(path->dentry->d_inode->i_mode))
+ return 0;
+
+ BUG_ON(!S_ISDIR(parent->dentry->d_inode->i_mode));
+
+ mutex_lock(&parent->dentry->d_inode->i_mutex);
+ error = -ENOENT;
+ if (IS_DEADDIR(parent->dentry->d_inode))
+ goto out_unlock;
+
+ if (copy_all && S_ISREG(path->dentry->d_inode->i_mode)) {
+ error = -EFBIG;
+ len = i_size_read(path->dentry->d_inode);
+ /* Check for overflow of file size */
+ if ((ssize_t)len != len)
+ goto out_unlock;
+ }
+
+ error = __union_copyup_len(nd, path, len);
+
+out_unlock:
+ mutex_unlock(&parent->dentry->d_inode->i_mutex);
+ return error;
+}
+
+/*
+ * Helper function to copy up all of a file
+ */
+int union_copyup(struct nameidata *nd, struct path *path)
+{
+ return do_union_copyup_len(nd, path, 1, 0);
+}
+
+/*
+ * Unlocked helper function to copy up all of a file
+ */
+int __union_copyup(struct nameidata *nd, struct path *path)
+{
+ loff_t len;
+ len = i_size_read(path->dentry->d_inode);
+ if ((ssize_t)len != len)
+ return -EFBIG;
+
+ return __union_copyup_len(nd, path, len);
+}
+
+/*
+ * Helper function to copy up part of a file for truncate and O_TRUNC.
+ */
+int union_copyup_len(struct nameidata *nd, struct path *path, size_t len)
+{
+ return do_union_copyup_len(nd, path, 0, len);
+}
diff --git a/fs/union.h b/fs/union.h
index 46944b9..62d8ef5 100644
--- a/fs/union.h
+++ b/fs/union.h
@@ -19,6 +19,7 @@
#include <linux/mount.h>
#include <linux/dcache.h>
#include <linux/path.h>
+#include <linux/namei.h>
#include <linux/bug.h>
/*
@@ -73,6 +74,11 @@ extern int union_create_topmost_dir(struct path *, struct qstr *, struct path *,
extern int union_copyup_dir(struct path *);
extern int generic_readdir_fallthru(struct dentry *topmost_dentry, const char *name,
int namlen, ino_t *ino, unsigned char *d_type);
+extern int union_copyup(struct nameidata *, struct path *);
+extern int __union_copyup(struct nameidata *, struct path *);
+extern int union_copyup_len(struct nameidata *, struct path *, size_t len);
+extern int union_copyup_file(struct nameidata *nd, struct path *lower,
+ struct dentry *dentry, size_t len);
static inline
struct path *union_find_dir(struct dentry *dentry, unsigned int layer)
@@ -147,4 +153,29 @@ int generic_readdir_fallthru(struct dentry *topmost_dentry, const char *name,
return 0;
}
+static inline int union_copyup(struct nameidata *nd, struct path *path)
+{
+ BUG();
+ return 0;
+}
+
+static inline int __union_copyup(struct nameidata *nd, struct path *path)
+{
+ BUG();
+ return 0;
+}
+
+static inline int union_copyup_len(struct nameidata *nd, struct path *path,
+ size_t len)
+{
+ BUG();
+ return 0;
+}
+
+static inline int union_copyup_file(struct nameidata *nd, struct path *lower,
+ struct dentry *dentry, size_t len)
+{
+ return 0;
+}
+
#endif /* CONFIG_UNION_MOUNT */
Add LOOKUP_COPY_UP as a pathwalk flag to indicate that if we hit a file on a
lower layer in the union, we definitely want it copying up.
Signed-off-by: David Howells <[email protected]>
---
fs/namei.c | 23 +++++++++++++++++++++--
fs/union.h | 12 ++++++++++--
include/linux/namei.h | 1 +
3 files changed, 32 insertions(+), 4 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index be505cd..6ec5725 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1214,6 +1214,16 @@ static int __lookup_union(struct nameidata *nd, struct qstr *name,
return 0;
out_found_file:
+ /* If the caller demands a top-level dentry then we have to copy up. */
+ if (nd->flags & LOOKUP_COPY_UP) {
+ nd->path = parent;
+ err = union_copyup_file(nd, &lower, topmost->dentry,
+ i_size_read(lower.dentry->d_inode));
+ if (err)
+ goto out_err;
+ goto out_lookup_done;
+ }
+
/* Swap out the positive lower dentry with the negative upper
* dentry for this file. Note that the matching mntput() is done
* in link_path_walk().
@@ -1350,6 +1360,15 @@ static bool lookup_union_rcu(struct nameidata *nd,
(IS_OPAQUE(parent_inode) && !d_is_fallthru(dentry)))
return true;
+ /* The dentry is a fallthru in an opaque unioned directory.
+ *
+ * If the caller demands that the terminal dentry be instantiated in
+ * the top layer of the union (copied up) immediately, that will
+ * require a mutex.
+ */
+ if (nd->flags & LOOKUP_COPY_UP)
+ return false;
+
/* At this point we have a negative dentry in the unionmount that may
* be overlaying a non-directory file in a lower filesystem, so we loop
* through the union stack of the parent directory to try to find a
@@ -1588,7 +1607,7 @@ retry:
if (err)
nd->flags |= LOOKUP_JUMPED;
- if (needs_lookup_union(&nd->path, path)) {
+ if (needs_lookup_union(nd, &nd->path, path)) {
int err = lookup_union(nd, name, path);
if (err < 0)
return err;
@@ -2147,7 +2166,7 @@ static int lookup_hash(struct nameidata *nd, struct qstr *name,
path->mnt = nd->path.mnt;
path->dentry = result;
- if (needs_lookup_union(&nd->path, path))
+ if (needs_lookup_union(nd, &nd->path, path))
return lookup_union_locked(nd, name, path);
return 0;
}
diff --git a/fs/union.h b/fs/union.h
index 62d8ef5..5c4db67 100644
--- a/fs/union.h
+++ b/fs/union.h
@@ -92,7 +92,8 @@ struct path *union_find_dir(struct dentry *dentry, unsigned int layer)
* dentry.
*/
static inline
-bool needs_lookup_union(struct path *parent_path, struct path *path)
+bool needs_lookup_union(struct nameidata *nd,
+ struct path *parent_path, struct path *path)
{
if (!IS_DIR_UNIONED(parent_path->dentry))
return false;
@@ -102,6 +103,12 @@ bool needs_lookup_union(struct path *parent_path, struct path *path)
if (IS_ROOT(path->dentry))
return false;
+ /* If this is a fallthru dentry and the caller requires the underlying
+ * inode to be copied up, then do so.
+ */
+ if (nd->flags & LOOKUP_COPY_UP && d_is_fallthru(path->dentry))
+ return true;
+
/* It's okay not to have the lock; will recheck in lookup_union() */
/* XXX set for root dentry at mount? */
return !(path->dentry->d_flags & DCACHE_UNION_LOOKUP_DONE);
@@ -134,7 +141,8 @@ static inline int union_create_topmost_dir(struct path *parent, struct qstr *nam
return 0;
}
-static inline bool needs_lookup_union(struct path *parent_path, struct path *path)
+static inline bool needs_lookup_union(struct nameidata *nd,
+ struct path *parent_path, struct path *path)
{
return false;
}
diff --git a/include/linux/namei.h b/include/linux/namei.h
index e273639..1736ece 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -65,6 +65,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
#define LOOKUP_JUMPED 0x1000
#define LOOKUP_ROOT 0x2000
#define LOOKUP_EMPTY 0x4000
+#define LOOKUP_COPY_UP 0x8000 /* Copy up from lower mount if unionmounted */
extern int user_path_at(int, const char __user *, unsigned, struct path *);
extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty);
From: Valerie Aurora <[email protected]>
A union mount clones the vfsmount tree of all of the read-only layers
of the union and keeps a reference to it in the vfsmount of the
topmost layer of the union.
clone_union_tree() takes the path of the proposed union mountpoint and
attempts to clones every vfsmount mounted at that same pathname, as
well as their submounts. All these mounts must be read-only, not
slave, and not shared.
put_union_sb() unwinds everything clone_union_tree() does. It is
called when the superblock is deactivated. Thus, you can lazy unmount
a union mount and when the last reference goes away, the union will be
torn down.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/namespace.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/mount.h | 2 +
2 files changed, 69 insertions(+), 0 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index bed9ccd..5fbe3b0 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1450,6 +1450,73 @@ static int check_topmost_union_mnt(struct vfsmount *topmost_mnt, int mnt_flags)
#endif
}
+void put_union_sb(struct super_block *sb)
+{
+ if (unlikely(sb->s_union_lower_mnts)) {
+ struct mount *mnt = real_mount(sb->s_union_lower_mnts);
+ LIST_HEAD(umount_list);
+
+ br_write_lock(vfsmount_lock);
+ umount_tree(mnt, 0, &umount_list);
+ br_write_unlock(vfsmount_lock);
+ release_mounts(&umount_list);
+ sb->s_union_lower_mnts = NULL;
+ sb->s_union_count = 0;
+ }
+}
+
+/**
+ * clone_union_tree - Clone all union-able mounts at this mountpoint
+ * @topmost: vfsmount of topmost layer
+ * @mntpnt: target of union mount
+ *
+ * Given the target mountpoint of a union mount, clone all the mounts at that
+ * mountpoint (well, pathname) that qualify as a union lower layer. Increment
+ * the hard readonly count of the lower layer superblocks.
+ *
+ * Returns error if any of the mounts or submounts mounted on or below this
+ * pathname are unsuitable for union mounting. This means you can't construct
+ * a union mount at the root of an existing mount without unioning it.
+ *
+ * XXX - Maybe should take # of layers to go down as an argument. But how to
+ * pass this in through mount options? All solutions look ugly. Currently you
+ * express your intention through mounting file systems on the same mountpoint,
+ * which is pretty elegant.
+ */
+static int clone_union_tree(struct mount *topmost, struct path *mntpnt)
+{
+ struct mount *mnt, *cloned_tree;
+
+ if (!IS_ROOT(mntpnt->dentry)) {
+ printk(KERN_INFO "union mount: mount point must be a root dir\n");
+ return -EINVAL;
+ }
+
+ /* Look for the "lowest" layer to union. */
+ mnt = real_mount(mntpnt->mnt);
+ while (mnt->mnt_parent->mnt.mnt_root == mnt->mnt_mountpoint) {
+ /* Got root (mnt)? */
+ if (mnt->mnt_parent == mnt)
+ break;
+ mnt = mnt->mnt_parent;
+ }
+
+ /* Clone all the read-only mounts and submounts, only if they
+ * are not shared or slave, and increment the hard read-only
+ * users count on each one. If this can't be done for every
+ * mount and submount below this one, fail.
+ */
+ cloned_tree = copy_tree(mnt, mnt->mnt.mnt_root,
+ CL_COPY_ALL | CL_PRIVATE |
+ CL_NO_SHARED | CL_NO_SLAVE |
+ CL_MAKE_HARD_READONLY);
+ if (IS_ERR(cloned_tree))
+ return PTR_ERR(cloned_tree);
+
+ topmost->mnt.mnt_sb->s_union_lower_mnts = &cloned_tree->mnt;
+ return 0;
+}
+
/*
* @source_mnt : mount tree to be attached
* @nd : place the mount tree @source_mnt is attached
diff --git a/include/linux/mount.h b/include/linux/mount.h
index 0ba1def..67f46fa 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -78,4 +78,6 @@ extern void mark_mounts_for_expiry(struct list_head *mounts);
extern dev_t name_to_dev_t(char *name);
+extern void put_union_sb(struct super_block *sb);
+
#endif /* _LINUX_MOUNT_H */
Signed-off-by: Jan Blunck <[email protected]> (Original author)
Signed-off-by: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]> (Forward porting)
---
fs/namei.c | 121 +++++++++++++++++++++++++++++++-----------------------------
1 files changed, 62 insertions(+), 59 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 4dc0e1d..2d983f7 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1766,9 +1766,20 @@ static struct dentry *__lookup_hash(struct qstr *name,
* needs parent already locked. Doesn't follow mounts.
* SMP-safe.
*/
-static struct dentry *lookup_hash(struct nameidata *nd)
+static int lookup_hash(struct nameidata *nd, struct qstr *name,
+ struct path *path)
{
- return __lookup_hash(&nd->last, nd->path.dentry, nd);
+ struct dentry *result;
+
+ result = __lookup_hash(name, nd->path.dentry, nd);
+ if (IS_ERR(result)) {
+ path->mnt = NULL;
+ path->dentry = NULL;
+ return PTR_ERR(result);
+ }
+ path->mnt = nd->path.mnt;
+ path->dentry = result;
+ return 0;
}
/**
@@ -2098,7 +2109,6 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
const struct open_flags *op, const char *pathname)
{
struct dentry *dir = nd->path.dentry;
- struct dentry *dentry;
int open_flag = op->open_flag;
int will_truncate = open_flag & O_TRUNC;
int want_write = 0;
@@ -2178,18 +2188,14 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
mutex_lock(&dir->d_inode->i_mutex);
- dentry = lookup_hash(nd);
- error = PTR_ERR(dentry);
- if (IS_ERR(dentry)) {
+ error = lookup_hash(nd, &nd->last, path);
+ if (error) {
mutex_unlock(&dir->d_inode->i_mutex);
goto exit;
}
- path->dentry = dentry;
- path->mnt = nd->path.mnt;
-
/* Negative dentry, just create the file */
- if (!dentry->d_inode) {
+ if (!path->dentry->d_inode) {
umode_t mode = op->mode;
if (!IS_POSIXACL(dir->d_inode))
mode &= ~current_umask();
@@ -2208,15 +2214,15 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
open_flag &= ~O_TRUNC;
will_truncate = 0;
acc_mode = MAY_OPEN;
- error = security_path_mknod(&nd->path, dentry, mode, 0);
+ error = security_path_mknod(&nd->path, path->dentry, mode, 0);
if (error)
goto exit_mutex_unlock;
- error = vfs_create(dir->d_inode, dentry, mode, nd);
+ error = vfs_create(dir->d_inode, path->dentry, mode, nd);
if (error)
goto exit_mutex_unlock;
mutex_unlock(&dir->d_inode->i_mutex);
dput(nd->path.dentry);
- nd->path.dentry = dentry;
+ nd->path.dentry = path->dentry;
goto common;
}
@@ -2395,8 +2401,8 @@ struct file *do_file_open_root(struct dentry *dentry, struct vfsmount *mnt,
struct dentry *kern_path_create(int dfd, const char *pathname, struct path *path, int is_dir)
{
- struct dentry *dentry = ERR_PTR(-EEXIST);
struct nameidata nd;
+ struct path new_path;
int error = do_path_lookup(dfd, pathname, LOOKUP_PARENT, &nd);
if (error)
return ERR_PTR(error);
@@ -2405,6 +2411,7 @@ struct dentry *kern_path_create(int dfd, const char *pathname, struct path *path
* Yucky last component or no last component at all?
* (foo/., foo/.., /////)
*/
+ error = -EEXIST;
if (nd.last_type != LAST_NORM)
goto out;
nd.flags &= ~LOOKUP_PARENT;
@@ -2415,11 +2422,11 @@ struct dentry *kern_path_create(int dfd, const char *pathname, struct path *path
* Do the final lookup.
*/
mutex_lock_nested(&nd.path.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
- dentry = lookup_hash(&nd);
- if (IS_ERR(dentry))
+ error = lookup_hash(&nd, &nd.last, &new_path);
+ if (error)
goto fail;
- if (dentry->d_inode)
+ if (new_path.dentry->d_inode)
goto eexist;
/*
* Special case - lookup gave negative, but... we had foo/bar/
@@ -2428,20 +2435,20 @@ struct dentry *kern_path_create(int dfd, const char *pathname, struct path *path
* been asking for (non-existent) directory. -ENOENT for you.
*/
if (unlikely(!is_dir && nd.last.name[nd.last.len])) {
- dput(dentry);
- dentry = ERR_PTR(-ENOENT);
- goto fail;
+ error = -ENOENT;
+ goto fail_do_put;
}
*path = nd.path;
- return dentry;
+ return new_path.dentry;
eexist:
- dput(dentry);
- dentry = ERR_PTR(-EEXIST);
+ error = -EEXIST;
+fail_do_put:
+ dput(new_path.dentry);
fail:
mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
out:
path_put(&nd.path);
- return dentry;
+ return ERR_PTR(error);
}
EXPORT_SYMBOL(kern_path_create);
@@ -2673,7 +2680,7 @@ static long do_rmdir(int dfd, const char __user *pathname)
{
int error = 0;
char * name;
- struct dentry *dentry;
+ struct path path;
struct nameidata nd;
error = user_path_parent(dfd, pathname, &nd, &name);
@@ -2695,25 +2702,24 @@ static long do_rmdir(int dfd, const char __user *pathname)
nd.flags &= ~LOOKUP_PARENT;
mutex_lock_nested(&nd.path.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
- dentry = lookup_hash(&nd);
- error = PTR_ERR(dentry);
- if (IS_ERR(dentry))
+ error = lookup_hash(&nd, &nd.last, &path);
+ if (error)
goto exit2;
- if (!dentry->d_inode) {
+ if (!path.dentry->d_inode) {
error = -ENOENT;
goto exit3;
}
error = mnt_want_write(nd.path.mnt);
if (error)
goto exit3;
- error = security_path_rmdir(&nd.path, dentry);
+ error = security_path_rmdir(&nd.path, path.dentry);
if (error)
goto exit4;
- error = vfs_rmdir(nd.path.dentry->d_inode, dentry);
+ error = vfs_rmdir(nd.path.dentry->d_inode, path.dentry);
exit4:
mnt_drop_write(nd.path.mnt);
exit3:
- dput(dentry);
+ path_put_conditional(&path, &nd);
exit2:
mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
exit1:
@@ -2769,7 +2775,7 @@ static long do_unlinkat(int dfd, const char __user *pathname)
{
int error;
char *name;
- struct dentry *dentry;
+ struct path path;
struct nameidata nd;
struct inode *inode = NULL;
@@ -2784,27 +2790,26 @@ static long do_unlinkat(int dfd, const char __user *pathname)
nd.flags &= ~LOOKUP_PARENT;
mutex_lock_nested(&nd.path.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
- dentry = lookup_hash(&nd);
- error = PTR_ERR(dentry);
- if (!IS_ERR(dentry)) {
+ error = lookup_hash(&nd, &nd.last, &path);
+ if (!error) {
/* Why not before? Because we want correct error value */
if (nd.last.name[nd.last.len])
goto slashes;
- inode = dentry->d_inode;
+ inode = path.dentry->d_inode;
if (!inode)
goto slashes;
ihold(inode);
error = mnt_want_write(nd.path.mnt);
if (error)
goto exit2;
- error = security_path_unlink(&nd.path, dentry);
+ error = security_path_unlink(&nd.path, path.dentry);
if (error)
goto exit3;
- error = vfs_unlink(nd.path.dentry->d_inode, dentry);
+ error = vfs_unlink(nd.path.dentry->d_inode, path.dentry);
exit3:
mnt_drop_write(nd.path.mnt);
exit2:
- dput(dentry);
+ path_put_conditional(&path, &nd);
}
mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
if (inode)
@@ -2815,8 +2820,8 @@ exit1:
return error;
slashes:
- error = !dentry->d_inode ? -ENOENT :
- S_ISDIR(dentry->d_inode->i_mode) ? -EISDIR : -ENOTDIR;
+ error = !path.dentry->d_inode ? -ENOENT :
+ S_ISDIR(path.dentry->d_inode->i_mode) ? -EISDIR : -ENOTDIR;
goto exit2;
}
@@ -3156,7 +3161,7 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
int, newdfd, const char __user *, newname)
{
struct dentry *old_dir, *new_dir;
- struct dentry *old_dentry, *new_dentry;
+ struct path old, new;
struct dentry *trap;
struct nameidata oldnd, newnd;
char *from;
@@ -3190,16 +3195,15 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
trap = lock_rename(new_dir, old_dir);
- old_dentry = lookup_hash(&oldnd);
- error = PTR_ERR(old_dentry);
- if (IS_ERR(old_dentry))
+ error = lookup_hash(&oldnd, &oldnd.last, &old);
+ if (error)
goto exit3;
/* source must exist */
error = -ENOENT;
- if (!old_dentry->d_inode)
+ if (!old.dentry->d_inode)
goto exit4;
/* unless the source is a directory trailing slashes give -ENOTDIR */
- if (!S_ISDIR(old_dentry->d_inode->i_mode)) {
+ if (!S_ISDIR(old.dentry->d_inode->i_mode)) {
error = -ENOTDIR;
if (oldnd.last.name[oldnd.last.len])
goto exit4;
@@ -3208,32 +3212,31 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
}
/* source should not be ancestor of target */
error = -EINVAL;
- if (old_dentry == trap)
+ if (old.dentry == trap)
goto exit4;
- new_dentry = lookup_hash(&newnd);
- error = PTR_ERR(new_dentry);
- if (IS_ERR(new_dentry))
+ error = lookup_hash(&newnd, &newnd.last, &new);
+ if (error)
goto exit4;
/* target should not be an ancestor of source */
error = -ENOTEMPTY;
- if (new_dentry == trap)
+ if (new.dentry == trap)
goto exit5;
error = mnt_want_write(oldnd.path.mnt);
if (error)
goto exit5;
- error = security_path_rename(&oldnd.path, old_dentry,
- &newnd.path, new_dentry);
+ error = security_path_rename(&oldnd.path, old.dentry,
+ &newnd.path, new.dentry);
if (error)
goto exit6;
- error = vfs_rename(old_dir->d_inode, old_dentry,
- new_dir->d_inode, new_dentry);
+ error = vfs_rename(old_dir->d_inode, old.dentry,
+ new_dir->d_inode, new.dentry);
exit6:
mnt_drop_write(oldnd.path.mnt);
exit5:
- dput(new_dentry);
+ path_put_conditional(&new, &newnd);
exit4:
- dput(old_dentry);
+ path_put_conditional(&old, &oldnd);
exit3:
unlock_rename(new_dir, old_dir);
exit2:
Now that we have full union lookup support, lookup the true d_type and
d_ino of a fallthru.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]> (Further development)
---
fs/ext2/dir.c | 20 +++++++++++++-------
1 files changed, 13 insertions(+), 7 deletions(-)
diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 5fd6bbe..a509096 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -366,15 +366,21 @@ ext2_readdir (struct file * filp, void * dirent, filldir_t filldir)
}
} else if (de->file_type == EXT2_FT_FALLTHRU) {
int over;
+ unsigned char d_type = DT_UNKNOWN;
+ ino_t ino;
+ int err;
offset = (char *)de - kaddr;
- /* XXX placeholder until generic_readdir_fallthru() arrives */
- over = filldir(dirent, de->name, de->name_len,
- (n<<PAGE_CACHE_SHIFT) | offset,
- 1, DT_UNKNOWN); /* XXX */
- if (over) {
- ext2_put_page(page);
- return 0;
+ err = generic_readdir_fallthru(filp->f_path.dentry, de->name,
+ de->name_len, &ino, &d_type);
+ if (!err) {
+ over = filldir(dirent, de->name, de->name_len,
+ (n<<PAGE_CACHE_SHIFT) | offset,
+ ino, d_type);
+ if (over) {
+ ext2_put_page(page);
+ return 0;
+ }
}
}
filp->f_pos += ext2_rec_len_from_disk(de->rec_len);
From: Valerie Aurora <[email protected]>
Add support for fallthru directory entries to tmpfs.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/dcache.c | 4 +++-
fs/libfs.c | 16 ++++++++++++---
mm/shmem.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++-------
3 files changed, 72 insertions(+), 12 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index b1ce8d1..238684a 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2896,7 +2896,9 @@ resume:
* can evict it.
*/
if (d_unhashed(dentry) ||
- (!dentry->d_inode && !d_is_whiteout(dentry))) {
+ (!dentry->d_inode &&
+ !d_is_whiteout(dentry) &&
+ !d_is_fallthru(dentry))) {
spin_unlock(&dentry->d_lock);
continue;
}
diff --git a/fs/libfs.c b/fs/libfs.c
index 38eb46d..43f1ac2 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -141,6 +141,7 @@ int dcache_readdir(struct file * filp, void * dirent, filldir_t filldir)
struct dentry *cursor = filp->private_data;
struct list_head *p, *q = &cursor->d_u.d_child;
ino_t ino;
+ char d_type;
int i = filp->f_pos;
switch (i) {
@@ -167,17 +168,26 @@ int dcache_readdir(struct file * filp, void * dirent, filldir_t filldir)
struct dentry *next;
next = list_entry(p, struct dentry, d_u.d_child);
spin_lock_nested(&next->d_lock, DENTRY_D_LOCK_NESTED);
- if (!simple_positive(next)) {
+ if (d_unhashed(next) ||
+ (!next->d_inode && !d_is_fallthru(next))) {
spin_unlock(&next->d_lock);
continue;
}
spin_unlock(&next->d_lock);
spin_unlock(&dentry->d_lock);
+ if (d_is_fallthru(next)) {
+ /* XXX placeholder until generic_readdir_fallthru() arrives */
+ ino = 1;
+ d_type = DT_UNKNOWN;
+ } else {
+ ino = next->d_inode->i_ino;
+ d_type = dt_type(next->d_inode);
+ }
+
if (filldir(dirent, next->d_name.name,
next->d_name.len, filp->f_pos,
- next->d_inode->i_ino,
- dt_type(next->d_inode)) < 0)
+ ino, d_type) < 0)
return 0;
spin_lock(&dentry->d_lock);
spin_lock_nested(&next->d_lock, DENTRY_D_LOCK_NESTED);
diff --git a/mm/shmem.c b/mm/shmem.c
index ca0bd30..bfd9ac8 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1481,8 +1481,7 @@ static int shmem_rmdir(struct inode *dir, struct dentry *dentry);
static int shmem_unlink(struct inode *dir, struct dentry *dentry);
/*
- * This is the whiteout support for tmpfs. It uses one singleton whiteout
- * inode per superblock thus it is very similar to shmem_link().
+ * Create a dentry to signify a whiteout.
*/
static int shmem_whiteout(struct inode *dir, struct dentry *old_dentry,
struct dentry *new_dentry)
@@ -1513,8 +1512,10 @@ static int shmem_whiteout(struct inode *dir, struct dentry *old_dentry,
spin_unlock(&sbinfo->stat_lock);
}
- if (old_dentry->d_inode) {
- if (S_ISDIR(old_dentry->d_inode->i_mode))
+ if (old_dentry->d_inode || d_is_fallthru(old_dentry)) {
+ /* A fallthru for a dir is treated like a regular link */
+ if (old_dentry->d_inode &&
+ S_ISDIR(old_dentry->d_inode->i_mode))
shmem_rmdir(dir, old_dentry);
else
shmem_unlink(dir, old_dentry);
@@ -1531,6 +1532,48 @@ static int shmem_whiteout(struct inode *dir, struct dentry *old_dentry,
}
static void shmem_d_instantiate(struct inode *dir, struct dentry *dentry,
+ struct inode *inode);
+
+/*
+ * Create a dentry to signify a fallthru. A fallthru in tmpfs is the
+ * logical equivalent of an in-kernel readdir() cache. It can't be
+ * deleted until the file system is unmounted.
+ */
+static int shmem_fallthru(struct inode *dir, struct dentry *dentry)
+{
+ struct shmem_sb_info *sbinfo = SHMEM_SB(dir->i_sb);
+
+ /* FIXME: this is stupid */
+ if (!(dir->i_sb->s_flags & MS_WHITEOUT))
+ return -EPERM;
+
+ if (dentry->d_inode || d_is_fallthru(dentry) || d_is_whiteout(dentry))
+ return -EEXIST;
+
+ /*
+ * Each new link needs a new dentry, pinning lowmem, and tmpfs
+ * dentries cannot be pruned until they are unlinked.
+ */
+ if (sbinfo->max_inodes) {
+ spin_lock(&sbinfo->stat_lock);
+ if (!sbinfo->free_inodes) {
+ spin_unlock(&sbinfo->stat_lock);
+ return -ENOSPC;
+ }
+ sbinfo->free_inodes--;
+ spin_unlock(&sbinfo->stat_lock);
+ }
+
+ shmem_d_instantiate(dir, dentry, NULL);
+ dir->i_ctime = dir->i_mtime = CURRENT_TIME;
+
+ spin_lock(&dentry->d_lock);
+ dentry->d_flags |= DCACHE_FALLTHRU;
+ spin_unlock(&dentry->d_lock);
+ return 0;
+}
+
+static void shmem_d_instantiate(struct inode *dir, struct dentry *dentry,
struct inode *inode)
{
if (d_is_whiteout(dentry)) {
@@ -1538,14 +1581,15 @@ static void shmem_d_instantiate(struct inode *dir, struct dentry *dentry,
shmem_free_inode(dir->i_sb);
if (S_ISDIR(inode->i_mode))
inode->i_mode |= S_OPAQUE;
+ } else if (d_is_fallthru(dentry)) {
+ shmem_free_inode(dir->i_sb);
} else {
/* New dentry */
dir->i_size += BOGO_DIRENT_SIZE;
dget(dentry); /* Extra count - pin the dentry in core */
}
- /* Will clear DCACHE_WHITEOUT flag */
+ /* Will clear DCACHE_WHITEOUT and DCACHE_FALLTHRU flags */
d_instantiate(dentry, inode);
-
}
/*
* File creation. Allocate an inode, and we're done..
@@ -1627,7 +1671,8 @@ static int shmem_unlink(struct inode *dir, struct dentry *dentry)
{
struct inode *inode = dentry->d_inode;
- if (d_is_whiteout(dentry) || (inode->i_nlink > 1 && !S_ISDIR(inode->i_mode)))
+ if (d_is_whiteout(dentry) || d_is_fallthru(dentry) ||
+ (inode->i_nlink > 1 && !S_ISDIR(inode->i_mode)))
shmem_free_inode(dir->i_sb);
if (inode) {
@@ -2347,8 +2392,10 @@ int shmem_fill_super(struct super_block *sb, void *data, int silent)
sb->s_root = root;
#ifdef CONFIG_TMPFS
- if (!(sb->s_flags & MS_NOUSER))
+ if (!(sb->s_flags & MS_NOUSER)) {
sb->s_flags |= MS_WHITEOUT;
+ sb->s_flags |= MS_FALLTHRU;
+ }
#endif
return 0;
@@ -2451,6 +2498,7 @@ static const struct inode_operations shmem_dir_inode_operations = {
.mknod = shmem_mknod,
.rename = shmem_rename,
.whiteout = shmem_whiteout,
+ .fallthru = shmem_fallthru,
#endif
#ifdef CONFIG_TMPFS_XATTR
.setxattr = shmem_setxattr,
From: Valerie Aurora <[email protected]>
A remount request must not (a) convert a union to a non-union (or vice
versa), or (b) make the topmost layer of a union read-only.
Note that we only have to worry about attempts to remount the vfsmount
of the topmost read-write of the union (the one with MNT_UNION set).
The vfsmounts of the read-only layers are hidden in a cloned tree
hanging of the superblock of the topmost layer and aren't visible to
userspace.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/namespace.c | 12 ++++++++++++
1 files changed, 12 insertions(+), 0 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index 261944d..aa6b1ef 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1907,6 +1907,18 @@ static int do_remount(struct path *path, int flags, int mnt_flags,
if (!check_mnt(mnt))
return -EINVAL;
+ if ((path->mnt->mnt_flags & MNT_UNION) &&
+ !(mnt_flags & MNT_UNION))
+ return -EINVAL;
+
+ if ((mnt_flags & MNT_UNION) &&
+ !(path->mnt->mnt_flags & MNT_UNION))
+ return -EINVAL;
+
+ if ((path->mnt->mnt_flags & MNT_UNION) &&
+ (mnt_flags & MNT_READONLY))
+ return -EINVAL;
+
if (path->dentry != path->mnt->mnt_root)
return -EINVAL;
Duplicate the i_mutex and i_dir_mutex lock classes and use for unionmount upper
layer superblock instead of the normal lock classes. This solves some of the
lockdep noise when the VFS tries to hold locks on inodes in both layers at the
same time. Note these only occur if both layers are of the same filesystem
type.
As far as I can tell, most of the lockdep warnings are false positives since
the inodes being locked are part of different superblocks; however, because
lockdep works on lock *classes*, it can't determine this.
I suspect that giving each superblock its own lock class would overextend
lockdep.
Signed-off-by: David Howells <[email protected]>
---
fs/inode.c | 48 ++++++++++++++++++++++++++++++++++++------------
fs/namespace.c | 2 +-
fs/super.c | 8 ++++++++
include/linux/fs.h | 5 +++--
4 files changed, 48 insertions(+), 15 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index d3ebdbe..95f926b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -166,8 +166,14 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
spin_lock_init(&inode->i_lock);
lockdep_set_class(&inode->i_lock, &sb->s_type->i_lock_key);
+ /* Duplicate the code with separate indices so that when lockdep print
+ * a warning, the numeric index is seen.
+ */
mutex_init(&inode->i_mutex);
- lockdep_set_class(&inode->i_mutex, &sb->s_type->i_mutex_key);
+ if (sb->s_lock_class == 0)
+ lockdep_set_class(&inode->i_mutex, &sb->s_type->i_mutex_key[0]);
+ else
+ lockdep_set_class(&inode->i_mutex, &sb->s_type->i_mutex_key[1]);
atomic_set(&inode->i_dio_count, 0);
@@ -935,18 +941,36 @@ EXPORT_SYMBOL(new_inode);
void lockdep_annotate_inode_mutex_key(struct inode *inode)
{
if (S_ISDIR(inode->i_mode)) {
- struct file_system_type *type = inode->i_sb->s_type;
+ struct super_block *sb = inode->i_sb;
+ struct file_system_type *type = sb->s_type;
- /* Set new key only if filesystem hasn't already changed it */
- if (!lockdep_match_class(&inode->i_mutex,
- &type->i_mutex_key)) {
- /*
- * ensure nobody is actually holding i_mutex
- */
- mutex_destroy(&inode->i_mutex);
- mutex_init(&inode->i_mutex);
- lockdep_set_class(&inode->i_mutex,
- &type->i_mutex_dir_key);
+ /* Set new key only if filesystem hasn't already changed it
+ *
+ * Duplicate the code with separate indices so that when
+ * lockdep print a warning, the numeric index is seen.
+ */
+ if (sb->s_lock_class == 0) {
+ if (!lockdep_match_class(&inode->i_mutex,
+ &type->i_mutex_key[0])) {
+ /*
+ * ensure nobody is actually holding i_mutex
+ */
+ mutex_destroy(&inode->i_mutex);
+ mutex_init(&inode->i_mutex);
+ lockdep_set_class(&inode->i_mutex,
+ &type->i_mutex_dir_key[0]);
+ }
+ } else {
+ if (!lockdep_match_class(&inode->i_mutex,
+ &type->i_mutex_key[1])) {
+ /*
+ * ensure nobody is actually holding i_mutex
+ */
+ mutex_destroy(&inode->i_mutex);
+ mutex_init(&inode->i_mutex);
+ lockdep_set_class(&inode->i_mutex,
+ &type->i_mutex_dir_key[1]);
+ }
}
}
}
diff --git a/fs/namespace.c b/fs/namespace.c
index c990f69..5e8328e 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2441,7 +2441,7 @@ long do_mount(char *dev_name, char *dir_name, char *type_page,
flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE | MS_BORN |
MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT |
- MS_STRICTATIME | MS_UNION);
+ MS_STRICTATIME);
if (flags & MS_REMOUNT)
retval = do_remount(&path, flags & ~MS_REMOUNT, mnt_flags,
diff --git a/fs/super.c b/fs/super.c
index 732e19b..4d24f05 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -137,6 +137,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
INIT_LIST_HEAD(&s->s_files);
#endif
s->s_flags = flags;
+ s->s_lock_class = (flags & MS_UNION) ? 1 : 0;
s->s_bdi = &default_backing_dev_info;
INIT_HLIST_NODE(&s->s_instances);
INIT_HLIST_BL_HEAD(&s->s_anon);
@@ -449,6 +450,13 @@ retry:
deactivate_locked_super(old);
goto retry;
}
+#ifdef CONFIG_UNION_MOUNT
+ if (unlikely((old->s_flags | flags) & MS_UNION)) {
+ up_write(&old->s_umount);
+ deactivate_locked_super(old);
+ return ERR_PTR(-EINVAL);
+ }
+#endif
return old;
}
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f19772c..e130d00 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1417,6 +1417,7 @@ struct super_block {
dev_t s_dev; /* search index; _not_ kdev_t */
unsigned char s_dirt;
unsigned char s_blocksize_bits;
+ u8 s_lock_class; /* Set of lock classes to use */
unsigned long s_blocksize;
loff_t s_maxbytes; /* Max file size */
struct file_system_type *s_type;
@@ -1861,8 +1862,8 @@ struct file_system_type {
struct lock_class_key s_vfs_rename_key;
struct lock_class_key i_lock_key;
- struct lock_class_key i_mutex_key;
- struct lock_class_key i_mutex_dir_key;
+ struct lock_class_key i_mutex_key[2];
+ struct lock_class_key i_mutex_dir_key[2];
};
extern struct dentry *mount_ns(struct file_system_type *fs_type, int flags,
From: Valerie Aurora <[email protected]>
The device underlying the topmost read-write layer of a file system
cannot be mounted anywhere else on the system. We keep a pointer to
the union stack in the dentry of the topmost directory, so that dentry
can't be part of a different mount, since dentries are shared between
different mounts of the same device.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/namespace.c | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index aa6b1ef..3c950fa 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2091,6 +2091,11 @@ static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)
if (S_ISLNK(newmnt->mnt.mnt_root->d_inode->i_mode))
goto unlock;
+ /* Top layers of union mounts can't be mounted elsewhere */
+ err = -EBUSY;
+ if (newmnt->mnt.mnt_sb->s_union_lower_mnts)
+ goto unlock;
+
newmnt->mnt.mnt_flags = mnt_flags;
err = graft_tree(newmnt, path);
From: Valerie Aurora <[email protected]>
In readdir(), client file systems need to lookup the target of a
fallthru in a lower layer for three reasons: (1) fill in d_ino, (2)
fill in d_type, (2) make sure there is something to fall through to
(and if not, don't return this dentry). Create a generic helper
function.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/union.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/union.h | 10 ++++++++++
include/linux/fs.h | 15 +++++++++++++++
3 files changed, 78 insertions(+), 0 deletions(-)
diff --git a/fs/union.c b/fs/union.c
index 0c0490f..8efed50 100644
--- a/fs/union.c
+++ b/fs/union.c
@@ -386,3 +386,56 @@ out_fput:
mnt_drop_write(topmost_path->mnt);
return error;
}
+
+/* Relationship between i_mode and the DT_xxx types */
+static inline unsigned char dt_type(struct inode *inode)
+{
+ return (inode->i_mode >> 12) & 15;
+}
+
+/**
+ * generic_readdir_fallthru - Helper to lookup target of a fallthru
+ * @topmost_dentry: dentry for the topmost dentry of the dir being read
+ * @name: name of fallthru dirent
+ * @namelen: length of @name
+ * @ino: return inode number of target, if found
+ * @d_type: return directory type of target, if found
+ *
+ * In readdir(), client file systems need to lookup the target of a
+ * fallthru in a lower layer for three reasons: (1) fill in d_ino, (2)
+ * fill in d_type, (2) make sure there is something to fall through to
+ * (and if not, don't return this dentry). Upon detecting a fallthru
+ * dentry in readdir(), the client file system should call this function.
+ *
+ * Returns 0 on success and -ENOENT if no matching directory entry was
+ * found (which can happen when the topmost file system is unmounted
+ * and remounted over a different file system than). Any other errors
+ * are unexpected.
+ */
+int generic_readdir_fallthru(struct dentry *topmost_dentry, const char *name,
+ int namlen, ino_t *ino, unsigned char *d_type)
+{
+ struct path *parent;
+ struct dentry *dentry;
+ unsigned int i, layers = topmost_dentry->d_sb->s_union_count;
+
+ BUG_ON(!mutex_is_locked(&topmost_dentry->d_inode->i_mutex));
+
+ for (i = 0; i < layers; i++) {
+ parent = union_find_dir(topmost_dentry, i);
+ mutex_lock(&parent->dentry->d_inode->i_mutex);
+ dentry = lookup_one_len(name, parent->dentry, namlen);
+ mutex_unlock(&parent->dentry->d_inode->i_mutex);
+ if (IS_ERR(dentry))
+ return PTR_ERR(dentry);
+ if (dentry->d_inode) {
+ *ino = dentry->d_inode->i_ino;
+ *d_type = dt_type(dentry->d_inode);
+ dput(dentry);
+ return 0;
+ }
+ dput(dentry);
+ }
+ return -ENOENT;
+}
+EXPORT_SYMBOL(generic_readdir_fallthru);
diff --git a/fs/union.h b/fs/union.h
index a77bd5f..46944b9 100644
--- a/fs/union.h
+++ b/fs/union.h
@@ -71,6 +71,8 @@ extern int union_add_dir(struct path *, struct path *, unsigned int);
extern int union_create_topmost_dir(struct path *, struct qstr *, struct path *,
struct path *);
extern int union_copyup_dir(struct path *);
+extern int generic_readdir_fallthru(struct dentry *topmost_dentry, const char *name,
+ int namlen, ino_t *ino, unsigned char *d_type);
static inline
struct path *union_find_dir(struct dentry *dentry, unsigned int layer)
@@ -137,4 +139,12 @@ static inline int union_copyup_dir(struct path *topmost_path)
return 0;
}
+static inline
+int generic_readdir_fallthru(struct dentry *topmost_dentry, const char *name,
+ int namlen, ino_t *ino, unsigned char *d_type)
+{
+ BUG();
+ return 0;
+}
+
#endif /* CONFIG_UNION_MOUNT */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e130d00..46b35ea 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2569,6 +2569,21 @@ extern int generic_file_fsync(struct file *, loff_t, loff_t, int);
extern int generic_check_addressable(unsigned, u64);
+#ifdef CONFIG_UNION_MOUNT
+extern int generic_readdir_fallthru(struct dentry *topmost_dentry, const char *name,
+ int namlen, ino_t *ino, unsigned char *d_type);
+#else
+static inline int generic_readdir_fallthru(struct dentry *topmost_dentry, const char *name,
+ int namlen, ino_t *ino, unsigned char *d_type)
+{
+ /*
+ * Found a fallthru on a kernel without union support.
+ * There's nothing to fall through to, so return -ENOENT.
+ */
+ return -ENOENT;
+}
+#endif
+
#ifdef CONFIG_MIGRATION
extern int buffer_migrate_page(struct address_space *,
struct page *, struct page *,
From: Valerie Aurora <[email protected]>
struct union_stack records the stack of directories unioned at this
directory. A union_stack is an array of struct paths, dynamically
allocated when the dentry for the topmost directory is created. The
topmost dentry contains a pointer to the union_stack.
Signed-off-by: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/dcache.c | 3 +++
fs/union.h | 53 ++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/dcache.h | 25 ++++++++++++++++++++++-
3 files changed, 80 insertions(+), 1 deletions(-)
create mode 100644 fs/union.h
diff --git a/fs/dcache.c b/fs/dcache.c
index 238684a..326a432 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1222,6 +1222,9 @@ struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
dentry->d_sb = sb;
dentry->d_op = NULL;
dentry->d_fsdata = NULL;
+#ifdef CONFIG_UNION_MOUNT
+ dentry->d_union_stack = NULL;
+#endif
INIT_HLIST_BL_NODE(&dentry->d_hash);
INIT_LIST_HEAD(&dentry->d_lru);
INIT_LIST_HEAD(&dentry->d_subdirs);
diff --git a/fs/union.h b/fs/union.h
new file mode 100644
index 0000000..d42dc09
--- /dev/null
+++ b/fs/union.h
@@ -0,0 +1,53 @@
+/*
+ * VFS-based union mounts for Linux
+ *
+ * Copyright (C) 2004-2007 IBM Corporation, IBM Deutschland Entwicklung GmbH.
+ * Copyright (C) 2007-2009 Novell Inc.
+ * Copyright (C) 2009-2010 Red Hat, Inc.
+ *
+ * Author(s): Jan Blunck ([email protected])
+ * Valerie Aurora <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#ifdef CONFIG_UNION_MOUNT
+
+#include <linux/mount.h>
+#include <linux/dcache.h>
+#include <linux/path.h>
+
+/*
+ * WARNING! Confusing terminology alert.
+ *
+ * Note that the directions "up" and "down" in union mounts are the
+ * opposite of "up" and "down" in normal VFS operation terminology.
+ * "up" in the rest of the VFS means "towards the root of the mount
+ * tree." If you mount B on top of A, following B "up" will get you
+ * A. In union mounts, "up" means "towards the most recently mounted
+ * layer of the union stack." If you union mount B on top of A,
+ * following A "up" will get you to B. Another way to put it is that
+ * "up" in the VFS means going from this mount towards the direction
+ * of its mnt->mnt_parent pointer, but "up" in union mounts means
+ * going in the opposite direction (until you run out of union
+ * layers).
+ */
+
+/*
+ * The union_stack structure. It is an array of struct paths of
+ * directories below the topmost directory in a unioned directory, The
+ * topmost dentry has a pointer to this structure. The topmost dentry
+ * can only be part of one union, so we can reference it from the
+ * dentry, but lower dentries can be part of multiple union stacks.
+ *
+ * The number of dirs actually allocated is kept in the superblock,
+ * s_union_count.
+ */
+struct union_stack {
+ struct path u_dirs[0];
+};
+
+#endif /* CONFIG_UNION_MOUNT */
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index cc0181b..e2d44e1 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -102,16 +102,36 @@ full_name_hash(const unsigned char *name, unsigned int len)
* Try to keep struct dentry aligned on 64 byte cachelines (this will
* give reasonable cacheline footprint with larger lines without the
* large memory footprint increase).
+ *
+ * XXX DNAME_INLINE_LEN_MIN is kind of pitiful on 64bit + union
+ * mounts. May be worth tuning up, but either we go to 256 bytes and
+ * a wasteful 88 bytes of d_iname, or we lose 64-byte aligment.
*/
#ifdef CONFIG_64BIT
+
+#ifdef CONFIG_UNION_MOUNT
+# define DNAME_INLINE_LEN 24 /* 192 bytes */
+#else
# define DNAME_INLINE_LEN 32 /* 192 bytes */
+#endif /* CONFIG_UNION_MOUNT */
+
+#else
+
+#ifdef CONFIG_UNION_MOUNT
+# ifdef CONFIG_SMP
+# define DNAME_INLINE_LEN 32 /* 128 bytes */
+# else
+# define DNAME_INLINE_LEN 36 /* 128 bytes */
+# endif
#else
# ifdef CONFIG_SMP
# define DNAME_INLINE_LEN 36 /* 128 bytes */
# else
# define DNAME_INLINE_LEN 40 /* 128 bytes */
# endif
-#endif
+#endif /* CONFIG_UNION_MOUNT */
+
+#endif /* CONFIG_64BIT */
struct dentry {
/* RCU lookup touched fields */
@@ -132,6 +152,9 @@ struct dentry {
unsigned long d_time; /* used by d_revalidate */
void *d_fsdata; /* fs-specific data */
+#ifdef CONFIG_UNION_MOUNT
+ struct union_stack *d_union_stack; /* dirs in union stack */
+#endif
struct list_head d_lru; /* LRU list */
/*
* d_child and d_rcu can share memory
From: Valerie Aurora <[email protected]>
lookup_union_locked() checks if union lookup is actually necessary for this
dentry, and marks the dentry to show that the union lookup has been performed -
all whilst the caller holds the directory i_mutex.
__lookup_union() may overwrite the parent's path in the nameidata
struct for the entry being looked up. This is because it reuses the
same nameidata to do lookups in each of the lower layer directories.
lookup_union() saves and restores the original parent's path in the
nameidata.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/namei.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 47 insertions(+), 0 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 37e32b4..2d69ce1 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1229,6 +1229,53 @@ out_err:
return err;
}
+/**
+ * lookup_union_locked - Lookup and/or build union stack if needed
+ * @nd - nameidata for the parent of @topmost
+ * @name - name of target
+ * @topmost - path of the target on the topmost file system
+ *
+ * Check if we need to do a union lookup on this target. Mark dentry
+ * to show lookup union has been performed.
+ *
+ * We borrow the nameidata struct from the topmost layer to do the
+ * revalidation on lower dentries, replacing the topmost parent
+ * directory's path with that of the matching parent dir in each lower
+ * layer. This wrapper for __lookup_union() saves the topmost layer's
+ * path and restores it when we are done.
+ *
+ * Caller must hold parent i_mutex.
+ */
+static int lookup_union_locked(struct nameidata *nd, struct qstr *name,
+ struct path *topmost)
+{
+ struct path saved_path;
+ int err;
+
+ BUG_ON(!IS_MNT_UNION(nd->path.mnt) && !IS_MNT_UNION(topmost->mnt));
+ BUG_ON(!mutex_is_locked(&nd->path.dentry->d_inode->i_mutex));
+
+ /* Initial test done outside of parent i_mutex lock, recheck it. We
+ * only set this flag inside parent i_mutex so it's safe to check it
+ * here (only need d_lock when setting to avoid squashing other flags).
+ */
+ if (topmost->dentry->d_flags & DCACHE_UNION_LOOKUP_DONE)
+ return 0;
+
+ saved_path = nd->path;
+
+ err = __lookup_union(nd, name, topmost);
+
+ nd->path = saved_path;
+
+ /* XXX move into dcache.h */
+ spin_lock(&topmost->dentry->d_lock);
+ topmost->dentry->d_flags |= DCACHE_UNION_LOOKUP_DONE;
+ spin_unlock(&topmost->dentry->d_lock);
+
+ return err;
+}
+
/*
* Allocate a dentry with name and parent, and perform a parent
* directory ->lookup on it. Returns the new dentry, or ERR_PTR
From: Valerie Aurora <[email protected]>
Proof-of-concept implementation of user_path_nd(). Lookup both the
parent and the target of a user-supplied filename, to supply later to
union copyup routines.
XXX - Inefficient, racy, gets the parent of the symlink instead of the
parent of the target. Al Viro would like to see something more like
this:
user_path_mumble() looks up and returns:
parent nameidata
positive topmost dentry of target
negative dentry of target from the topmost layer (if it doesn't exist on top)
Both the positive lower dentry and negative topmost dentry are passed
to the following code, like do_chown(). The tests for permissions and
such-like are performed on the positive lower dentry. When it comes
time to actually modify the target, we call union_copyup() with both
positive and negative dentries (and the parent nameidata).
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/namei.c | 31 +++++++++++++++++++++++++++++++
include/linux/namei.h | 2 ++
2 files changed, 33 insertions(+), 0 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index d52377d..be505cd 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2239,6 +2239,37 @@ static int user_path_parent(int dfd, const char __user *path,
return error;
}
+int user_path_nd(int dfd, const char __user *filename,
+ unsigned flags, struct nameidata *parent_nd,
+ struct path *child, char **tmp)
+{
+ struct nameidata child_nd;
+ char *s = getname(filename);
+ int error;
+
+ if (IS_ERR(s))
+ return PTR_ERR(s);
+
+ /* Lookup parent */
+ error = do_path_lookup(dfd, s, LOOKUP_PARENT, parent_nd);
+ if (error)
+ goto out_putname;
+
+ /* Lookup child - XXX optimize, racy */
+ error = do_path_lookup(dfd, s, flags, &child_nd);
+ if (error)
+ goto out_path_put;
+ *child = child_nd.path;
+ *tmp = s;
+ return 0;
+
+out_path_put:
+ path_put(&parent_nd->path);
+out_putname:
+ putname(s);
+ return error;
+}
+
/*
* It's inline, so penalty for filesystems that don't use sticky bit is
* minimal.
diff --git a/include/linux/namei.h b/include/linux/namei.h
index ffc0213..e273639 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -68,6 +68,8 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
extern int user_path_at(int, const char __user *, unsigned, struct path *);
extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty);
+extern int user_path_nd(int, const char __user *, unsigned,
+ struct nameidata *, struct path *, char **);
#define user_path(name, path) user_path_at(AT_FDCWD, name, LOOKUP_FOLLOW, path)
#define user_lpath(name, path) user_path_at(AT_FDCWD, name, 0, path)
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]> (Further development)
---
fs/open.c | 44 ++++++++++++++++++++++++++++++++++++++------
1 files changed, 38 insertions(+), 6 deletions(-)
diff --git a/fs/open.c b/fs/open.c
index bce645b..f61183b 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -65,14 +65,17 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
static long do_sys_truncate(const char __user *pathname, loff_t length)
{
struct path path;
+ struct nameidata nd;
+ struct vfsmount *mnt;
struct inode *inode;
+ char *tmp;
int error;
error = -EINVAL;
if (length < 0) /* sorry, but loff_t says... */
goto out;
- error = user_path(pathname, &path);
+ error = user_path_nd(AT_FDCWD, pathname, 0, &nd, &path, &tmp);
if (error)
goto out;
inode = path.dentry->d_inode;
@@ -86,18 +89,45 @@ static long do_sys_truncate(const char __user *pathname, loff_t length)
if (!S_ISREG(inode->i_mode))
goto dput_and_out;
- error = mnt_want_write(path.mnt);
+ /* If we're looking at the lower layer of a union mount, then we need
+ * to create the file on the upperfs and truncate that.
+ */
+ if (IS_MNT_LOWER(path.mnt))
+ mnt = nd.path.mnt;
+ else
+ mnt = path.mnt;
+
+ error = mnt_want_write(mnt);
if (error)
goto dput_and_out;
- error = inode_permission(inode, MAY_WRITE);
- if (error)
- goto mnt_drop_write_and_out;
+ if (unlikely(IS_MNT_UNION(mnt))) {
+ /* We have to be able to write to the upperfs. */
+ error = -EROFS;
+ if (mnt->mnt_sb->s_flags & MS_RDONLY)
+ goto mnt_drop_write_and_out;
+
+ /* But the lowerfs inode must offer write permission - if the
+ * lowerfs was mounted writably. */
+ error = __inode_permission(inode, MAY_WRITE);
+ if (error)
+ goto mnt_drop_write_and_out;
+ } else {
+ error = inode_permission(inode, MAY_WRITE);
+ if (error)
+ goto mnt_drop_write_and_out;
+ }
error = -EPERM;
if (IS_APPEND(inode))
goto mnt_drop_write_and_out;
+ error = union_copyup_len(&nd, &path, length);
+ if (error)
+ goto mnt_drop_write_and_out;
+
+ /* path may have changed after copyup */
+ inode = path.dentry->d_inode;
error = get_write_access(inode);
if (error)
goto mnt_drop_write_and_out;
@@ -119,9 +149,11 @@ static long do_sys_truncate(const char __user *pathname, loff_t length)
put_write_and_out:
put_write_access(inode);
mnt_drop_write_and_out:
- mnt_drop_write(path.mnt);
+ mnt_drop_write(mnt);
dput_and_out:
path_put(&path);
+ path_put(&nd.path);
+ putname(tmp);
out:
return error;
}
From: Valerie Aurora <[email protected]>
Opaque directories are the directory equivalent of whiteouts. Define the
generic opaque inode flags and operations.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
include/linux/fs.h | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b8276c0..ab36080 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -241,6 +241,7 @@ struct inodes_stat_t {
#define S_IMA 1024 /* Inode has an associated IMA struct */
#define S_AUTOMOUNT 2048 /* Automount/referral quasi-directory */
#define S_NOSEC 4096 /* no suid or xattr security attributes */
+#define S_OPAQUE 8192 /* Directory is opaque */
/*
* Note that nosuid etc flags are inode-specific: setting some file-system
@@ -278,6 +279,7 @@ struct inodes_stat_t {
#define IS_IMA(inode) ((inode)->i_flags & S_IMA)
#define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT)
#define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC)
+#define IS_OPAQUE(inode) ((inode)->i_flags & S_OPAQUE)
/* the read-only stuff doesn't really belong here, but any other place is
probably as bad and I don't want to create yet another include file. */
@@ -362,9 +364,11 @@ struct inodes_stat_t {
#define FS_NOTAIL_FL 0x00008000 /* file tail should not be merged */
#define FS_DIRSYNC_FL 0x00010000 /* dirsync behaviour (directories only) */
#define FS_TOPDIR_FL 0x00020000 /* Top of directory hierarchies*/
+/* 0x00040000 is used by ext4 */
#define FS_EXTENT_FL 0x00080000 /* Extents */
#define FS_DIRECTIO_FL 0x00100000 /* Use direct i/o */
#define FS_NOCOW_FL 0x00800000 /* Do not cow file */
+#define FS_OPAQUE_FL 0x04000000 /* Dir is opaque */
#define FS_RESERVED_FL 0x80000000 /* reserved for ext2 lib */
#define FS_FL_USER_VISIBLE 0x0003DFFF /* User visible flags */
From: Valerie Aurora <[email protected]>
During mount(), build_root_union() creates the union stack for the
root directory. All other directory union stacks are bootstrapped
from their parents' union stacks during path lookup.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/namespace.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 49 insertions(+), 0 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index 1f24a6b..3355b99 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -22,6 +22,7 @@
#include <linux/uaccess.h>
#include "pnode.h"
#include "internal.h"
+#include "union.h"
#define HASH_SHIFT ilog2(PAGE_SIZE / sizeof(struct list_head))
#define HASH_SIZE (1UL << HASH_SHIFT)
@@ -1520,6 +1521,54 @@ static int clone_union_tree(struct mount *topmost, struct path *mntpnt)
return 0;
}
+/**
+ * build_root_union - Create the union stack for the root dir
+ * @topmost_mnt - vfsmount of topmost mount
+ *
+ * Build the union stack for the root dir. Annoyingly, we have to traverse
+ * union "up" from the root of the cloned tree to find the topmost read-only
+ * mount, and then traverse back "down" to build the stack.
+ */
+static int build_root_union(struct vfsmount *topmost_mnt)
+{
+ struct path lower, topmost_path;
+ struct mount *mnt, *topmost_ro_mnt;
+ unsigned int i, layers = 1;
+ int err = 0;
+
+ /* Find the topmost read-only mount */
+ topmost_ro_mnt = real_mount(topmost_mnt->mnt_sb->s_union_lower_mnts);
+ for (mnt = topmost_ro_mnt; mnt; mnt = next_mnt(mnt, topmost_ro_mnt)) {
+ if (mnt->mnt_parent == topmost_ro_mnt &&
+ mnt->mnt_mountpoint == topmost_ro_mnt->mnt.mnt_root) {
+ topmost_ro_mnt = mnt;
+ layers++;
+ }
+ }
+ topmost_mnt->mnt_sb->s_union_count = layers;
+
+ // SHOULD USE collect_mounts() here rather than merely mntgetting
+
+ /* Build the root dir's union stack from the top down */
+ topmost_path.mnt = topmost_mnt;
+ topmost_path.dentry = topmost_mnt->mnt_root;
+ mnt = topmost_ro_mnt;
+ for (i = 0; i < layers; i++) {
+ lower.mnt = mntget(&mnt->mnt); // !!!!!!!!!! TODO: FIX
+ lower.dentry = dget(mnt->mnt.mnt_root);
+ err = union_add_dir(&topmost_path, &lower, i);
+ if (err)
+ goto out;
+ mnt = mnt->mnt_parent;
+ }
+ return 0;
+
+out:
+ d_free_unions(topmost_path.dentry);
+ topmost_mnt->mnt_sb->s_union_count = 0;
+ return err;
+}
+
/*
* @source_mnt : mount tree to be attached
* @nd : place the mount tree @source_mnt is attached
From: Valerie Aurora <[email protected]>
union_alloc() allocates a union stack with enough entries for the
maximum possible number of directories that might be unioned at this
point.
The union_stack may be larger than strictly necessary if this
directory does not exist on all layers, but allocating exactly the
right number would require keeping the number of layers in the
union_stack structure. We optimize for the case of unioning two file
systems and keep the count of layers in the superblock.
Original-author: Jan Blunck <[email protected]>
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/Makefile | 1 +
fs/union.c | 38 ++++++++++++++++++++++++++++++++++++++
2 files changed, 39 insertions(+), 0 deletions(-)
create mode 100644 fs/union.c
diff --git a/fs/Makefile b/fs/Makefile
index 93804d4..768fbf3 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -52,6 +52,7 @@ obj-$(CONFIG_GENERIC_ACL) += generic_acl.o
obj-$(CONFIG_FHANDLE) += fhandle.o
obj-y += quota/
+obj-$(CONFIG_UNION_MOUNT) += union.o
obj-$(CONFIG_PROC_FS) += proc/
obj-$(CONFIG_SYSFS) += sysfs/
diff --git a/fs/union.c b/fs/union.c
new file mode 100644
index 0000000..c8d7766
--- /dev/null
+++ b/fs/union.c
@@ -0,0 +1,38 @@
+/*
+ * VFS-based union mounts for Linux
+ *
+ * Copyright (C) 2004-2007 IBM Corporation, IBM Deutschland Entwicklung GmbH.
+ * Copyright (C) 2007-2009 Novell Inc.
+ * Copyright (C) 2009-2010 Red Hat, Inc.
+ *
+ * Author(s): Jan Blunck ([email protected])
+ * Valerie Aurora <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/mount.h>
+#include <linux/fs_struct.h>
+#include <linux/slab.h>
+
+#include "union.h"
+
+/**
+ * union_alloc - allocate a union stack
+ * @path: path of topmost directory
+ *
+ * Allocate a union_stack large enough to contain the maximum number
+ * of layers in this union mount.
+ */
+static struct union_stack *union_alloc(struct path *topmost)
+{
+ unsigned int layers = topmost->dentry->d_sb->s_union_count;
+ BUG_ON(!S_ISDIR(topmost->dentry->d_inode->i_mode));
+
+ return kcalloc(sizeof(struct path), layers, GFP_KERNEL);
+}
From: Valerie Aurora <[email protected]>
In order for read-only layers of a union to have submounts, we have to
follow mounts on directories in union lookup.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/namei.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 3ac07be..37e32b4 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1195,6 +1195,8 @@ static int __lookup_union(struct nameidata *nd, struct qstr *name,
* layer's directory to the union stack for the topmost
* directory.
*/
+ follow_mount(&lower);
+
if (!topmost->dentry->d_inode) {
err = union_create_topmost_dir(&parent, name, topmost,
&lower);
From: Valerie Aurora <[email protected]>
Add two fields to struct super_block to support union mounts:
(*) s_union_lower_mnts
A pointer to a cloned vfsmount tree of all the lower (read-only) mounts
unioned with the topmost (read-write) vfsmount. These mounts may have
submounts which will also be unioned; hence we copy the entire vfsmount
tree, not just the root vfsmounts.
(*) s_union_count
The number of lower mounts unioned at the root of the file system. This
count is the maximum number of directories that will ever be unioned with
a single directory. We use it to allocate a union stack of the correct
size for each directory.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
include/linux/fs.h | 10 ++++++++++
1 files changed, 10 insertions(+), 0 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a014f0f..f19772c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1509,6 +1509,16 @@ struct super_block {
* free_vfsmnt() if MNT_HARD_READONLY is set.
*/
int s_hard_readonly_users;
+
+ /* Root of the private cloned vfsmount tree of the read-only
+ * mounts in this union (set in topmost vfsmount only)
+ */
+ struct vfsmount *s_union_lower_mnts;
+
+ /* Number of layers in this union, not counting the topmost or
+ * submounts.
+ */
+ unsigned int s_union_count;
};
/* superblock cache pruning functions */
ext2_add_entry() does not really need an inode pointer except to get at the
inode number and file mode that it contains. Instead of passing in an inode
pointer, pass in the inode number and type to be recorded in the directory
entry.
ext2_add_link() can then calculate the directory entry type from the file mode
and pass it to ext2_add_entry().
Original-author: David Howells <[email protected]>
Signed-off-by: David Howells <[email protected]> (Further development)
---
fs/ext2/dir.c | 64 +++++++++++++++++++++++++++++++++++++++------------------
1 files changed, 44 insertions(+), 20 deletions(-)
diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index d8382dc..dcb2d64 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -282,13 +282,18 @@ static unsigned char ext2_type_by_mode[S_IFMT >> S_SHIFT] = {
[S_IFLNK >> S_SHIFT] = EXT2_FT_SYMLINK,
};
-static inline void ext2_set_de_type(ext2_dirent *de, struct inode *inode)
+static inline void ext2_set_de_type(ext2_dirent *de, struct super_block *sb,
+ umode_t mode, unsigned char file_type)
{
- umode_t mode = inode->i_mode;
- if (EXT2_HAS_INCOMPAT_FEATURE(inode->i_sb, EXT2_FEATURE_INCOMPAT_FILETYPE))
- de->file_type = ext2_type_by_mode[(mode & S_IFMT)>>S_SHIFT];
- else
+ if (!EXT2_HAS_INCOMPAT_FEATURE(sb, EXT2_FEATURE_INCOMPAT_FILETYPE)) {
de->file_type = 0;
+ return;
+ }
+
+ if (file_type)
+ de->file_type = file_type;
+ else
+ de->file_type = ext2_type_by_mode[(mode & S_IFMT) >> S_SHIFT];
}
static int
@@ -480,7 +485,7 @@ void ext2_set_link(struct inode *dir, struct ext2_dir_entry_2 *de,
err = ext2_prepare_chunk(page, pos, len);
BUG_ON(err);
de->inode = cpu_to_le32(inode->i_ino);
- ext2_set_de_type(de, inode);
+ ext2_set_de_type(de, inode->i_sb, inode->i_mode, 0);
err = ext2_commit_chunk(page, pos, len);
ext2_put_page(page);
if (update_times)
@@ -489,7 +494,18 @@ void ext2_set_link(struct inode *dir, struct ext2_dir_entry_2 *de,
mark_inode_dirty(dir);
}
-int ext2_add_entry(struct dentry *dentry, struct inode *inode)
+/*
+ * Called from three settings:
+ *
+ * - Creating a regular entry - de/page NULL, doesn't exist
+ * - Creating a fallthru - de/page NULL, doesn't exist
+ * - Creating a whiteout - de/page set if it exists
+ *
+ * @new_file_type is either EXT2_FT_WHT, EXT2_FT_FALLTHRU, or 0. If
+ * 0, file type is determined by inode->i_mode.
+ */
+int ext2_add_entry(struct dentry *dentry, ino_t ino, umode_t mode,
+ unsigned char new_file_type)
{
struct inode *dir = dentry->d_parent->d_inode;
const char *name = dentry->d_name.name;
@@ -497,10 +513,10 @@ int ext2_add_entry(struct dentry *dentry, struct inode *inode)
unsigned chunk_size = ext2_chunk_size(dir);
unsigned reclen = EXT2_DIR_REC_LEN(namelen);
unsigned short rec_len, name_len;
- struct page *page = NULL;
- ext2_dirent * de;
unsigned long npages = dir_pages(dir);
unsigned long n;
+ ext2_dirent *de;
+ struct page *page;
char *kaddr;
loff_t pos;
int err;
@@ -530,6 +546,7 @@ int ext2_add_entry(struct dentry *dentry, struct inode *inode)
de->rec_len = ext2_rec_len_to_disk(chunk_size);
de->inode = 0;
de->file_type = 0;
+ printk("%s: allocated new de\n", dentry->d_name.name);
goto got_it;
}
if (de->rec_len == 0) {
@@ -538,15 +555,22 @@ int ext2_add_entry(struct dentry *dentry, struct inode *inode)
err = -EIO;
goto out_unlock;
}
- err = -EEXIST;
- if (ext2_match (namelen, name, de))
- goto out_unlock;
name_len = EXT2_DIR_REC_LEN(de->name_len);
rec_len = ext2_rec_len_from_disk(de->rec_len);
- if (!ext2_dirent_in_use(de) && rec_len >= reclen)
+ if (ext2_match(namelen, name, de)) {
+ err = -EEXIST;
+ /* XXX handle whiteouts and fallthroughs here */
+ printk("%s: found existing de\n", dentry->d_name.name);
+ goto got_it;
+ }
+ if (!ext2_dirent_in_use(de) && rec_len >= reclen) {
+ printk("%s: reusing empty de\n", dentry->d_name.name);
goto got_it;
- if (rec_len >= name_len + reclen)
+ }
+ if (rec_len >= name_len + reclen) {
+ printk("%s: carving off end of in-use de\n", dentry->d_name.name);
goto got_it;
+ }
de = (ext2_dirent *) ((char *) de + rec_len);
}
unlock_page(page);
@@ -561,7 +585,7 @@ got_it:
err = ext2_prepare_chunk(page, pos, rec_len);
if (err)
goto out_unlock;
- if (ext2_dirent_in_use(de)) {
+ if (ext2_dirent_in_use(de) && !ext2_match (namelen, name, de)) {
ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len);
de->rec_len = ext2_rec_len_to_disk(name_len);
@@ -569,8 +593,8 @@ got_it:
}
de->name_len = namelen;
memcpy(de->name, name, namelen);
- de->inode = cpu_to_le32(inode->i_ino);
- ext2_set_de_type (de, inode);
+ de->inode = cpu_to_le32(ino);
+ ext2_set_de_type(de, dir->i_sb, mode, new_file_type);
err = ext2_commit_chunk(page, pos, rec_len);
dir->i_mtime = dir->i_ctime = CURRENT_TIME_SEC;
EXT2_I(dir)->i_flags &= ~EXT2_BTREE_FL;
@@ -587,7 +611,7 @@ out_unlock:
int ext2_add_link(struct dentry *dentry, struct inode *inode)
{
- return ext2_add_entry(dentry, inode);
+ return ext2_add_entry(dentry, inode->i_ino, inode->i_mode, 0);
}
/*
@@ -660,14 +684,14 @@ int ext2_make_empty(struct inode *inode, struct inode *parent)
de->rec_len = ext2_rec_len_to_disk(EXT2_DIR_REC_LEN(1));
memcpy (de->name, ".\0\0", 4);
de->inode = cpu_to_le32(inode->i_ino);
- ext2_set_de_type (de, inode);
+ ext2_set_de_type (de, inode->i_sb, inode->i_mode, 0);
de = (struct ext2_dir_entry_2 *)(kaddr + EXT2_DIR_REC_LEN(1));
de->name_len = 2;
de->rec_len = ext2_rec_len_to_disk(chunk_size - EXT2_DIR_REC_LEN(1));
de->inode = cpu_to_le32(parent->i_ino);
memcpy (de->name, "..\0", 4);
- ext2_set_de_type (de, inode);
+ ext2_set_de_type (de, inode->i_sb, inode->i_mode, 0);
kunmap_atomic(kaddr, KM_USER0);
err = ext2_commit_chunk(page, 0, chunk_size);
fail:
On rename() of a file on union mount, copyup and whiteout the source
file.
XXX - fix comments and make more readable
XXX - Convert newly empty unioned dirs to not-unioned
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]> (Further development)
---
fs/namei.c | 120 +++++++++++++++++++++++++++++++++++++++++++++++++++---------
1 files changed, 101 insertions(+), 19 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index efad85e..dad7bef 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3045,7 +3045,7 @@ SYSCALL_DEFINE2(mkdir, const char __user *, pathname, umode_t, mode)
/**
* vfs_whiteout: Create a whiteout for the given directory entry
- * @dir: Parent inode
+ * @parent: Parent directory
* @dentry: Directory entry to whiteout
*
* Create a whiteout for the given directory entry. A whiteout prevents lookup
@@ -3060,15 +3060,17 @@ SYSCALL_DEFINE2(mkdir, const char __user *, pathname, umode_t, mode)
* a positive one if it exists, and a negative if not. When this function
* returns, the caller should dput() the old, now defunct dentry it passed in.
* The dentry for the whiteout itself is created inside this function.
+ *
+ * The caller must hold the i_mutex lock on the parent directory.
*/
-static int vfs_whiteout(struct inode *dir, struct dentry *old_dentry, int isdir)
+static int vfs_whiteout(struct dentry *parent, struct dentry *old_dentry, int isdir)
{
- struct inode *old_inode = old_dentry->d_inode;
- struct dentry *parent, *whiteout;
+ struct inode *dir = parent->d_inode, *old_inode = old_dentry->d_inode;
+ struct dentry *whiteout;
bool do_dput = false;
int err = 0;
- BUG_ON(old_dentry->d_parent->d_inode != dir);
+ BUG_ON(old_dentry->d_parent != parent);
if (!dir->i_op || !dir->i_op->whiteout)
return -EOPNOTSUPP;
@@ -3092,11 +3094,10 @@ static int vfs_whiteout(struct inode *dir, struct dentry *old_dentry, int isdir)
goto error_unlock;
}
- parent = dget_parent(old_dentry);
err = -ENOMEM;
- whiteout = d_alloc_name(parent, old_dentry->d_name.name);
+ whiteout = d_alloc(parent, &old_dentry->d_name);
if (!whiteout)
- goto error_put_parent;
+ goto error_unlock;
if (old_inode && isdir) {
dentry_unhash(old_dentry);
@@ -3116,13 +3117,10 @@ static int vfs_whiteout(struct inode *dir, struct dentry *old_dentry, int isdir)
}
dput(whiteout);
- dput(parent);
return err;
error_put_whiteout:
dput(whiteout);
-error_put_parent:
- dput(parent);
error_unlock:
if (old_inode)
mutex_unlock(&old_inode->i_mutex);
@@ -3208,7 +3206,7 @@ static int do_whiteout(struct nameidata *nd, struct path *path, int isdir)
path->dentry = dentry;
}
- err = vfs_whiteout(nd->path.dentry->d_inode, dentry, isdir);
+ err = vfs_whiteout(nd->path.dentry, dentry, isdir);
out:
path_put(&safe);
@@ -3216,6 +3214,40 @@ out:
}
/*
+ * Create a whiteout to finish off a rename from a unionmounted directory.
+ * This prevents any file of the same name in the lowerfs from showing through.
+ */
+static int vfs_whiteout_after_rename(struct dentry *parent,
+ const struct qstr *name)
+{
+ struct inode *dir = parent->d_inode;
+ struct dentry *whiteout;
+ int err;
+
+ if (!dir->i_op || !dir->i_op->whiteout)
+ return -EOPNOTSUPP;
+
+ /* Rename moved the old dentry somewhere else, so there can't be one
+ * here now (the caller's locks see to that) and so there's no need to
+ * call lookup, especially as the ->whiteout() op is expected to add
+ * the new dentry into the tree.
+ */
+ whiteout = d_alloc(parent, name);
+ if (!whiteout)
+ return -ENOMEM;
+
+ /* I think it's okay to pass the new whiteout as the old dentry here.
+ * What it seems to want is the name, the parent dentry and the inode.
+ * However, we know the inode no longer resides there and d_inode will
+ * be NULL.
+ */
+ err = dir->i_op->whiteout(dir, whiteout, whiteout);
+
+ dput(whiteout);
+ return err;
+}
+
+/*
* The dentry_unhash() helper will try to drop the dentry early: we
* should have a usage count of 2 if we're the only user of this
* dentry, and if that is true (possibly after pruning the dcache),
@@ -3787,13 +3819,6 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
error = -EXDEV;
if (oldnd.path.mnt != newnd.path.mnt)
goto exit2;
-
- /* rename() on union mounts not implemented yet */
- error = -EXDEV;
- if (IS_DIR_UNIONED(oldnd.path.dentry) ||
- IS_DIR_UNIONED(newnd.path.dentry))
- goto exit2;
-
old_dir = oldnd.path.dentry;
error = -EBUSY;
if (oldnd.last_type != LAST_NORM)
@@ -3804,6 +3829,7 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
goto exit2;
oldnd.flags &= ~LOOKUP_PARENT;
+ oldnd.flags |= LOOKUP_COPY_UP;
newnd.flags &= ~LOOKUP_PARENT;
newnd.flags |= LOOKUP_RENAME_TARGET;
@@ -3828,6 +3854,11 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
error = -EINVAL;
if (old.dentry == trap)
goto exit4;
+ error = -EXDEV;
+ /* Can't rename a directory from a lower layer */
+ if (IS_DIR_UNIONED(oldnd.path.dentry) &&
+ IS_DIR_UNIONED(old.dentry))
+ goto exit4;
error = lookup_hash(&newnd, &newnd.last, &new);
if (error)
goto exit4;
@@ -3835,6 +3866,42 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
error = -ENOTEMPTY;
if (new.dentry == trap)
goto exit5;
+ error = -EXDEV;
+ /* Can't rename over directories on the lower layer */
+ if (IS_DIR_UNIONED(newnd.path.dentry) &&
+ IS_DIR_UNIONED(new.dentry))
+ goto exit5;
+
+ /* If source should've been copied up by lookup_hash() */
+ if (IS_DIR_UNIONED(oldnd.path.dentry))
+ BUG_ON(old.mnt != oldnd.path.mnt);
+
+ /* If target is on lower layer, get negative dentry for topmost */
+ if (IS_DIR_UNIONED(newnd.path.dentry) &&
+ new.mnt != newnd.path.mnt) {
+ /* At this point, source and target are both files, the source
+ * is on the topmost layer and the target is on a lower layer.
+ * We want the target dentry to disappear from the namespace
+ * and give vfs_rename a negative dentry from the topmost
+ * layer.
+ *
+ * Note: We already did lookup once, so no need to recheck perm
+ */
+ struct dentry *dentry =
+ __lookup_hash(&newnd.last, newnd.path.dentry, &newnd);
+ if (IS_ERR(dentry)) {
+ error = PTR_ERR(dentry);
+ goto exit5;
+ }
+
+ /* We no longer need the lower target dentry. It definitely
+ * should be removed from the hash table */
+ /* XXX what about failure case? */
+ d_delete(new.dentry);
+ mntput(new.mnt);
+ new.mnt = mntget(newnd.path.mnt);
+ new.dentry = dentry;
+ }
error = mnt_want_write(oldnd.path.mnt);
if (error)
@@ -3845,6 +3912,21 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
goto exit6;
error = vfs_rename(old_dir->d_inode, old.dentry,
new_dir->d_inode, new.dentry);
+ if (error)
+ goto exit6;
+
+ /* Now whiteout the source. We may have exposed a positive lower level
+ * dentry, so we have to make sure it doesn't get resurrected. We
+ * could probe the lower levels at this point to find out whether there
+ * is actually anything that needs whiting out.
+ *
+ * Note that if this fails, it may leave the lower dentry exposed, and
+ * we may not be able to recover by simply renaming back (say we
+ * encountered ENOMEM or ENOSPC conditions).
+ */
+ if (IS_DIR_UNIONED(oldnd.path.dentry))
+ error = vfs_whiteout_after_rename(old_dir, &oldnd.last);
+
exit6:
mnt_drop_write(oldnd.path.mnt);
exit5:
From: Valerie Aurora <[email protected]>
Build the union stack for directories as we look them up. Create the
topmost directory if it doesn't exist.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/namei.c | 17 +++++++++++++++--
1 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index f81f24e..3ac07be 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1190,8 +1190,21 @@ static int __lookup_union(struct nameidata *nd, struct qstr *name,
goto out_found_file;
}
- /* XXX - do nothing, more in later patches */
- path_put(&lower);
+ /* Now we know the target is a directory. Create a matching
+ * topmost directory if one doesn't already exist, and add this
+ * layer's directory to the union stack for the topmost
+ * directory.
+ */
+ if (!topmost->dentry->d_inode) {
+ err = union_create_topmost_dir(&parent, name, topmost,
+ &lower);
+ if (err)
+ goto out_err;
+ }
+
+ err = union_add_dir(topmost, &lower, i);
+ if (err)
+ goto out_err;
}
return 0;
From: Valerie Aurora <[email protected]>
Add CONFIG_UNION_MOUNT option.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/Kconfig | 12 ++++++++++++
1 files changed, 12 insertions(+), 0 deletions(-)
diff --git a/fs/Kconfig b/fs/Kconfig
index d621f02..6fc3c69 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -61,6 +61,18 @@ source "fs/notify/Kconfig"
source "fs/quota/Kconfig"
+config UNION_MOUNT
+ bool "Union mounts (writable overlays) (EXPERIMENTAL)"
+ depends on EXPERIMENTAL
+ help
+ Union mounts allow you to mount a transparent writable layer over a
+ read-only file system, for example, an ext3 partition on a hard drive
+ over a CD-ROM root file system image.
+
+ See <file:Documentation/filesystems/union-mounts.txt> for details.
+
+ If unsure, say N.
+
source "fs/autofs4/Kconfig"
source "fs/fuse/Kconfig"
From: Valerie Aurora <[email protected]>
For union mounts, a file located on the lower layer will incorrectly
return EROFS on an access check. To fix this, use the new
path_permission() call, which ignores a read-only lower layer file
system if the target will be copied up to the topmost file system.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/open.c | 41 +++++++++++++++++++++++++++++++++++------
1 files changed, 35 insertions(+), 6 deletions(-)
diff --git a/fs/open.c b/fs/open.c
index 3c44148..d3be9e3 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -32,6 +32,7 @@
#include <linux/dnotify.h>
#include "internal.h"
+#include "union.h"
int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
struct file *filp)
@@ -301,7 +302,11 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)
const struct cred *old_cred;
struct cred *override_cred;
struct path path;
+ struct nameidata nd;
+ struct vfsmount *mnt;
struct inode *inode;
+ umode_t i_mode;
+ char *tmp;
int res;
if (mode & ~S_IRWXO) /* where's F_OK, X_OK, W_OK, R_OK? */
@@ -325,25 +330,47 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)
old_cred = override_creds(override_cred);
- res = user_path_at(dfd, filename, LOOKUP_FOLLOW, &path);
+ res = user_path_nd(dfd, filename, LOOKUP_FOLLOW, &nd, &path, &tmp);
if (res)
goto out;
+ /* For union mounts, use the topmost mnt's permissions */
+ mnt = path.mnt;
+ if (IS_MNT_LOWER(mnt))
+ mnt = nd.path.mnt;
+
inode = path.dentry->d_inode;
+ i_mode = inode->i_mode;
- if ((mode & MAY_EXEC) && S_ISREG(inode->i_mode)) {
+ if ((mode & MAY_EXEC) && S_ISREG(i_mode)) {
/*
* MAY_EXEC on regular files is denied if the fs is mounted
* with the "noexec" flag.
*/
res = -EACCES;
- if (path.mnt->mnt_flags & MNT_NOEXEC)
+ if (mnt->mnt_flags & MNT_NOEXEC)
+ goto out_path_release;
+ }
+
+ mode |= MAY_ACCESS;
+ if ((mode & MAY_WRITE) && unlikely(IS_MNT_LOWER(path.mnt))) {
+ /* If we need to copy up, then the upperfs of a union must be
+ * writable. The lowerfs must be mounted read-only for the
+ * union to exist, but we don't care about that.
+ */
+ res = -EROFS;
+ if ((mnt->mnt_sb->s_flags & MS_RDONLY) &&
+ (S_ISREG(i_mode) || S_ISDIR(i_mode) || S_ISLNK(i_mode)))
goto out_path_release;
+
+ /* We do need write permission on the lower inode, however */
+ res = __inode_permission(inode, mode);
+ } else {
+ res = inode_permission(inode, mode);
}
- res = inode_permission(inode, mode | MAY_ACCESS);
/* SuS v2 requires we report a read only fs too */
- if (res || !(mode & S_IWOTH) || special_file(inode->i_mode))
+ if (res || !(mode & MAY_WRITE) || special_file(inode->i_mode))
goto out_path_release;
/*
* This is a rare case where using __mnt_is_readonly()
@@ -355,11 +382,13 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)
* inherently racy and know that the fs may change
* state before we even see this result.
*/
- if (__mnt_is_readonly(path.mnt))
+ if (__mnt_is_readonly(mnt))
res = -EROFS;
out_path_release:
path_put(&path);
+ path_put(&nd.path);
+ putname(tmp);
out:
revert_creds(old_cred);
put_cred(override_cred);
Up till this commit, mount with MS_UNION flag succeeded but didn't
actually union the file systems. Now call the functions to check
the source mounts and create/destroy the per-vfsmount union structures.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]> (Forward port)
---
fs/namespace.c | 43 +++++++++++++++++++++++++++----------------
fs/super.c | 1 +
2 files changed, 28 insertions(+), 16 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index 5e8328e..306565b 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1424,9 +1424,9 @@ static int invent_group_ids(struct mount *mnt, bool recurse)
* fallthrus. The topmost file system can't be mounted elsewhere
* because it's Too Hard(tm).
*/
-static int check_topmost_union_mnt(struct vfsmount *topmost_mnt, int mnt_flags)
+static int check_topmost_union_mnt(struct mount *topmost_mnt, int mnt_flags)
{
- struct super_block *sb = topmost_mnt->mnt_sb;
+ struct super_block *sb = topmost_mnt->mnt.mnt_sb;
#ifndef CONFIG_UNION_MOUNT
printk(KERN_INFO "union mount: not supported by the kernel\n");
@@ -1529,7 +1529,7 @@ static int clone_union_tree(struct mount *topmost, struct path *mntpnt)
* union "up" from the root of the cloned tree to find the topmost read-only
* mount, and then traverse back "down" to build the stack.
*/
-static int build_root_union(struct vfsmount *topmost_mnt)
+static int build_root_union(struct mount *topmost_mnt)
{
struct path lower, topmost_path;
struct mount *mnt, *topmost_ro_mnt;
@@ -1537,7 +1537,7 @@ static int build_root_union(struct vfsmount *topmost_mnt)
int err = 0;
/* Find the topmost read-only mount */
- topmost_ro_mnt = real_mount(topmost_mnt->mnt_sb->s_union_lower_mnts);
+ topmost_ro_mnt = real_mount(topmost_mnt->mnt.mnt_sb->s_union_lower_mnts);
for (mnt = topmost_ro_mnt; mnt; mnt = next_mnt(mnt, topmost_ro_mnt)) {
if (mnt->mnt_parent == topmost_ro_mnt &&
mnt->mnt_mountpoint == topmost_ro_mnt->mnt.mnt_root) {
@@ -1545,13 +1545,13 @@ static int build_root_union(struct vfsmount *topmost_mnt)
layers++;
}
}
- topmost_mnt->mnt_sb->s_union_count = layers;
+ topmost_mnt->mnt.mnt_sb->s_union_count = layers;
// SHOULD USE collect_mounts() here rather than merely mntgetting
/* Build the root dir's union stack from the top down */
- topmost_path.mnt = topmost_mnt;
- topmost_path.dentry = topmost_mnt->mnt_root;
+ topmost_path.mnt = &topmost_mnt->mnt;
+ topmost_path.dentry = topmost_mnt->mnt.mnt_root;
mnt = topmost_ro_mnt;
for (i = 0; i < layers; i++) {
lower.mnt = mntget(&mnt->mnt); // !!!!!!!!!! TODO: FIX
@@ -1565,7 +1565,7 @@ static int build_root_union(struct vfsmount *topmost_mnt)
out:
d_free_unions(topmost_path.dentry);
- topmost_mnt->mnt_sb->s_union_count = 0;
+ topmost_mnt->mnt.mnt_sb->s_union_count = 0;
return err;
}
@@ -1581,15 +1581,15 @@ out:
*
* Caller needs namespace_sem, but can't have vfsmount_lock.
*/
-static int prepare_mnt_union(struct vfsmount *topmost_mnt, struct path *mntpnt)
+static int prepare_mnt_union(struct mount *topmost_mnt, struct path *mntpnt)
{
int err;
- err = check_topmost_union_mnt(topmost_mnt, topmost_mnt->mnt_flags);
+ err = check_topmost_union_mnt(topmost_mnt, topmost_mnt->mnt.mnt_flags);
if (err)
return err;
- err = clone_union_tree(real_mount(topmost_mnt), mntpnt);
+ err = clone_union_tree(topmost_mnt, mntpnt);
if (err)
return err;
@@ -1599,14 +1599,14 @@ static int prepare_mnt_union(struct vfsmount *topmost_mnt, struct path *mntpnt)
return 0;
out:
- put_union_sb(topmost_mnt->mnt_sb);
+ put_union_sb(topmost_mnt->mnt.mnt_sb);
return err;
}
-static void cleanup_mnt_union(struct vfsmount *topmost_mnt)
+static void cleanup_mnt_union(struct mount *topmost_mnt)
{
- d_free_unions(topmost_mnt->mnt_root);
- put_union_sb(topmost_mnt->mnt_sb);
+ d_free_unions(topmost_mnt->mnt.mnt_root);
+ put_union_sb(topmost_mnt->mnt.mnt_sb);
}
/*
@@ -1686,9 +1686,17 @@ static int attach_recursive_mnt(struct mount *source_mnt,
if (err)
goto out;
}
+
+ /* parent_path means we are moving an existing unioned mount */
+ if (!parent_path && IS_MNT_UNION(&source_mnt->mnt)) {
+ err = prepare_mnt_union(source_mnt, path);
+ if (err)
+ goto out_cleanup_ids;
+ }
+
err = propagate_mnt(dest_mnt, dest_dentry, source_mnt, &tree_list);
if (err)
- goto out_cleanup_ids;
+ goto out_cleanup_union;
br_write_lock(vfsmount_lock);
@@ -1713,6 +1721,9 @@ static int attach_recursive_mnt(struct mount *source_mnt,
return 0;
+ out_cleanup_union:
+ if (!parent_path && IS_MNT_UNION(&source_mnt->mnt))
+ cleanup_mnt_union(source_mnt);
out_cleanup_ids:
if (IS_MNT_SHARED(dest_mnt))
cleanup_group_ids(source_mnt, NULL);
diff --git a/fs/super.c b/fs/super.c
index 4d24f05..992e2b0 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -266,6 +266,7 @@ void deactivate_locked_super(struct super_block *s)
*/
rcu_barrier();
put_filesystem(fs);
+ put_union_sb(s);
put_super(s);
} else {
up_write(&s->s_umount);
From: Valerie Aurora <[email protected]>
prepare_mnt_union() ties together all the mount-time checks and setup
for union mounts. It tests the layers for suitability and builds the
root union stack.
cleanup_mnt_union() unwinds everything prepare_mnt_union() does.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/namespace.c | 40 ++++++++++++++++++++++++++++++++++++++++
1 files changed, 40 insertions(+), 0 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index 3355b99..261944d 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1569,6 +1569,46 @@ out:
return err;
}
+/**
+ * prepare_mnt_union - do setup necessary for a union mount
+ * @topmost_mnt: vfsmount of topmost layer
+ * @mntpnt: path of requested mountpoint
+ *
+ * We union every underlying file system that is mounted on the same mountpoint
+ * (well, pathname), read-only, and not shared. If we get at least one layer,
+ * we don't return an error, although we will complain in the kernel log if we
+ * hit a mount that can't be unioned.
+ *
+ * Caller needs namespace_sem, but can't have vfsmount_lock.
+ */
+static int prepare_mnt_union(struct vfsmount *topmost_mnt, struct path *mntpnt)
+{
+ int err;
+
+ err = check_topmost_union_mnt(topmost_mnt, topmost_mnt->mnt_flags);
+ if (err)
+ return err;
+
+ err = clone_union_tree(real_mount(topmost_mnt), mntpnt);
+ if (err)
+ return err;
+
+ err = build_root_union(topmost_mnt);
+ if (err)
+ goto out;
+ return 0;
+
+out:
+ put_union_sb(topmost_mnt->mnt_sb);
+ return err;
+}
+
+static void cleanup_mnt_union(struct vfsmount *topmost_mnt)
+{
+ d_free_unions(topmost_mnt->mnt_root);
+ put_union_sb(topmost_mnt->mnt_sb);
+}
+
/*
* @source_mnt : mount tree to be attached
* @nd : place the mount tree @source_mnt is attached
From: Valerie Aurora <[email protected]>
After some of the following patches in this series, a few system calls
will crash the kernel if called on union-mounted file systems.
Temporarily disable rename(), unlink(), and rmdir() on unioned file
systems until they are correctly implemented by later patches.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/namei.c | 17 +++++++++++++++++
1 files changed, 17 insertions(+), 0 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 991a32c..f53c0bc 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -38,6 +38,7 @@
#include "internal.h"
#include "mount.h"
+#include "union.h"
/* [Feb-1997 T. Schoebel-Theuer]
* Fundamental changes in the pathname lookup mechanisms (namei)
@@ -2891,6 +2892,11 @@ static long do_rmdir(int dfd, const char __user *pathname)
if (error)
return error;
+ /* rmdir() on union mounts not implemented yet */
+ error = -EINVAL;
+ if (IS_DIR_UNIONED(nd.path.dentry))
+ goto exit1;
+
switch(nd.last_type) {
case LAST_DOTDOT:
error = -ENOTEMPTY;
@@ -2991,6 +2997,11 @@ static long do_unlinkat(int dfd, const char __user *pathname)
if (nd.last_type != LAST_NORM)
goto exit1;
+ /* unlink() on union mounts not implemented yet */
+ error = -EINVAL;
+ if (IS_DIR_UNIONED(nd.path.dentry))
+ goto exit1;
+
nd.flags &= ~LOOKUP_PARENT;
mutex_lock_nested(&nd.path.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
@@ -3384,6 +3395,12 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
if (oldnd.path.mnt != newnd.path.mnt)
goto exit2;
+ /* rename() on union mounts not implemented yet */
+ error = -EXDEV;
+ if (IS_DIR_UNIONED(oldnd.path.dentry) ||
+ IS_DIR_UNIONED(newnd.path.dentry))
+ goto exit2;
+
old_dir = oldnd.path.dentry;
error = -EBUSY;
if (oldnd.last_type != LAST_NORM)
From: Valerie Aurora <[email protected]>
needs_lookup_union() tests if a path could possibly require a union
lookup.
Original-author: Valerie Aurora <[email protected]>
Signed-off-by: David Howells <[email protected]>
---
fs/union.h | 25 +++++++++++++++++++++++++
include/linux/dcache.h | 2 ++
2 files changed, 27 insertions(+), 0 deletions(-)
diff --git a/fs/union.h b/fs/union.h
index 990dd16..757f28c 100644
--- a/fs/union.h
+++ b/fs/union.h
@@ -73,6 +73,26 @@ struct path *union_find_dir(struct dentry *dentry, unsigned int layer)
return &dentry->d_union_stack->u_dirs[layer];
}
+/*
+ * Determine whether we need to perform unionmount traversal or the copyup of a
+ * dentry.
+ */
+static inline
+bool needs_lookup_union(struct path *parent_path, struct path *path)
+{
+ if (!IS_DIR_UNIONED(parent_path->dentry))
+ return false;
+
+ /* Either already built or crossed a mountpoint to not-unioned mnt */
+ /* XXX are bind mounts root? think not */
+ if (IS_ROOT(path->dentry))
+ return false;
+
+ /* It's okay not to have the lock; will recheck in lookup_union() */
+ /* XXX set for root dentry at mount? */
+ return !(path->dentry->d_flags & DCACHE_UNION_LOOKUP_DONE);
+}
+
#else /* CONFIG_UNION_MOUNT */
static inline
@@ -100,4 +120,9 @@ static inline int union_create_topmost_dir(struct path *parent, struct qstr *nam
return 0;
}
+static inline bool needs_lookup_union(struct path *parent_path, struct path *path)
+{
+ return false;
+}
+
#endif /* CONFIG_UNION_MOUNT */
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index e2d44e1..79833ae 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -245,6 +245,8 @@ struct dentry_operations {
#define DCACHE_MANAGED_DENTRY \
(DCACHE_MOUNTED|DCACHE_NEED_AUTOMOUNT|DCACHE_MANAGE_TRANSIT)
+#define DCACHE_UNION_LOOKUP_DONE 0x100000 /* Union lookup was called on this dentry */
+
extern seqlock_t rename_lock;
static inline int dname_external(struct dentry *dentry)
David Howells:
> (4) Added some code to override the credentials around upper inode creation
> to make sure the inode gets the right UID/GID. This doesn't help if the
> lower inode has some sort of foreign user identifier.
>
> Also, I'm not sure whether the LSM xattrs should be blindly copied up.
> Should the LSM policies applicable to the lower fs's apply to the upper
> fs too?
Obviously the xattr entry may not have its meanings on the upper fs, or
the upper fs may return an error when setting the xattr. Additionally
the returned errno may not follow the generic semantics (ENOTSUP,
ENOSPC, or EDQUOT) since the fs may return fs-specific error.
On the other hand, users may expect that the all xattrs are copied-up,
particulary when he knows that the xattrs works well on the upper fs
too.
In copy-up, it will be hard to support all cases.
In order to leave users how to handle the xattrs, I'd suggest
introducing some mount options, which are similar to cp(1).
cp(1) has several options
--preserve=mode,ownership,timestamps,context,links,xattr,all
('mode' includes acl which are based upon xattr)
Since the mode (without acl), ownership and timestamps should always be
copied-up, the new mount options will be something like
cpup-xattr=acl,context,all
And only when the option is specfied, the xattrs are copied up. No
special error handling is necessary, all the errors should be returned
to users unconditionally.
Does union-mount preserve mtime? If not, it is critical for some
applications such like "make" I am afraid.
J. R. Okajima
On Tue, Feb 21, 2012 at 06:05:54PM +0000, David Howells wrote:
>
> -static inline void ext2_set_de_type(ext2_dirent *de, struct inode *inode)
> +static inline void ext2_set_de_type(ext2_dirent *de, struct super_block *sb,
> + umode_t mode, unsigned char file_type)
> {
> - umode_t mode = inode->i_mode;
> - if (EXT2_HAS_INCOMPAT_FEATURE(inode->i_sb, EXT2_FEATURE_INCOMPAT_FILETYPE))
> - de->file_type = ext2_type_by_mode[(mode & S_IFMT)>>S_SHIFT];
> - else
> + if (!EXT2_HAS_INCOMPAT_FEATURE(sb, EXT2_FEATURE_INCOMPAT_FILETYPE)) {
> de->file_type = 0;
> + return;
> + }
> +
> + if (file_type)
> + de->file_type = file_type;
> + else
> + de->file_type = ext2_type_by_mode[(mode & S_IFMT) >> S_SHIFT];
> }
It would be simpler to drop the umode_t mode parameter, and just
always make the caller pass in the correct file_type. In fact, this
manual calculation only needs to happen in one place, ext2_set_link(),
so might as well move "ext2_type_by_mode[(mode & S_IFMT) >> S_SHIFT]"
to ext2_set_link().
See below....
> -int ext2_add_entry(struct dentry *dentry, struct inode *inode)
> +/*
> + * Called from three settings:
> + *
> + * - Creating a regular entry - de/page NULL, doesn't exist
> + * - Creating a fallthru - de/page NULL, doesn't exist
> + * - Creating a whiteout - de/page set if it exists
> + *
> + * @new_file_type is either EXT2_FT_WHT, EXT2_FT_FALLTHRU, or 0. If
> + * 0, file type is determined by inode->i_mode.
> + */
One of the things that confused me at first when I reviewed these
patch series is that the patch that introduces whiteouts and
fallthroughs haven't been introduced yet; they come later. Yet in
this patch and in others, sometimes the comments mention fallthru and
whiteouts earlier.
It might be simpler just to fold the commits that adds fallthrus and
whiteouts into a single patch; and then make sure all of the comments
that mention fallthrus and whiteouts are included there. I'll
mentioned later, but it's also not clear it makes sense to have
separate INCOMPAT options whiteouts and fallthrus. Is there any time
when you would have one, but not another? And why can't it just be an
RO_INCOMPAT option? As long as the inode number field is zero, older
kernels will simply assume those directory entries represent deleted
entries, which should be fine.
> @@ -660,14 +684,14 @@ int ext2_make_empty(struct inode *inode, struct inode *parent)
> de->rec_len = ext2_rec_len_to_disk(EXT2_DIR_REC_LEN(1));
> memcpy (de->name, ".\0\0", 4);
> de->inode = cpu_to_le32(inode->i_ino);
> - ext2_set_de_type (de, inode);
> + ext2_set_de_type (de, inode->i_sb, inode->i_mode, 0);
In this and the following ext2_set_de_type() (for adding the "." and
".." entries in a new directory), the type can *only* be EXT2_FT_DIR.
So why not pass it in explicitly, and save the table lookup? This is
also why we should be able to drop passing in inode->i_mode to
ext2_set_de_type entirely --- the only place where we really need to
calculate file_type is ext2_set_link().
- Ted
On Tue, Feb 21, 2012 at 06:05:46PM +0000, David Howells wrote:
> From: Valerie Aurora <[email protected]>
>
> Allow future code to use the guts of ext2_add_link().
>
> Original-author: Valerie Aurora <[email protected]>
> Signed-off-by: David Howells <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: [email protected]
I'd suggest folding this in with the following patch (67/73). It's
not clear from this patch why renaming ext2_add_link to
ext2_add_entry() makes sense and then adding a new ext2_add_link()
which calls ext_add_entry(). It doesn't seem to clarify much....
I won't insist on it, but this seems to be unnecessary complication.
- Ted
On Tue, Feb 21, 2012 at 06:06:11PM +0000, David Howells wrote:
> From: Valerie Aurora <[email protected]>
>
> Add support for fallthru directory entries to ext2.
As I mentioned, I wonder if it makes sense combine the patches for
whiteout and fallthrough director entries into a single patch. Given
that the two patches modify the same functions, and in some cases
second modifies lines added or modified by first, it just makes life
easier if the two are folded together.
> --- a/include/linux/ext2_fs.h
> +++ b/include/linux/ext2_fs.h
> @@ -506,11 +506,14 @@ struct ext2_super_block {
> #define EXT3_FEATURE_INCOMPAT_JOURNAL_DEV 0x0008
> #define EXT2_FEATURE_INCOMPAT_META_BG 0x0010
> #define EXT2_FEATURE_INCOMPAT_WHITEOUT 0x0020
> +/* ext3/4 incompat flags take up the intervening constants */
> +#define EXT2_FEATURE_INCOMPAT_FALLTHRU 0x2000
... and the codepoint 0x2000 in the INCOMPAT mask has since already
been assigned.
As I mentioned in a comment to the previous patch, any objections if
you combine these two fields into a single ROCOMPAT feature?
#define EXT2_FEATURE_RO_COMPAT_UNION_MOUNT 0x0800
- Ted
On 02/21/2012 09:59 AM, David Howells wrote:
> From: Valerie Aurora <[email protected]>
>
> Document design and implementation of union mounts (a.k.a. writable overlays).
>
> With corrections from Andreas Gruenbacher <[email protected]>.
>
> Original-author: Valerie Aurora <[email protected]>
> Signed-off-by: David Howells <[email protected]>
> ---
>
> Documentation/filesystems/union-mounts.txt | 712 ++++++++++++++++++++++++++++
> 1 files changed, 712 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/filesystems/union-mounts.txt
>
> diff --git a/Documentation/filesystems/union-mounts.txt b/Documentation/filesystems/union-mounts.txt
> new file mode 100644
> index 0000000..596bfe6
> --- /dev/null
> +++ b/Documentation/filesystems/union-mounts.txt
> @@ -0,0 +1,712 @@
> +Union mounts (a.k.a. writable overlays)
> +=======================================
> +
> +This document describes the architecture and current status of union mounts,
> +also known as writable overlays.
> +
> +In this document:
> + - Overview of union mounts
> + - Terminology
> + - VFS implementation
> + - Locking strategy
> + - VFS/file system interface
> + - Userland interface
> + - NFS interaction
> + - Status
> + - Contributing to union mounts
> +
> +Overview
> +========
> +
> +A union mount layers one read-write file system over one or more read-only file
> +systems, with all writes going to the writable file system. The namespace of
> +both file systems appears as a combined whole to userland, with files and
> +directories on the writable file system covering up any files or directories
> +with matching pathnames on the read-only file system. The read-write file
> +system is the "topmost" or "upper" file system and the read-only file systems
> +are the "lower" file systems. A few use cases:
> +
> +- Root file system on CD with writes saved to hard drive (LiveCD)
> +- Multiple virtual machines with the same starting root file system
> +- Cluster with NFS mounted root on clients
> +
> +Most if not all of these problems could be solved with a COW block device or a
problems? use cases?
> +clustered file system (include NFS mounts). However, for some use cases,
> +sharing is more efficient and better performing if done at the file system
> +namespace level. COW block devices only increase their divergence as time goes
> +on, and a fully coherent writable file system is unnecessary synchronization
> +overhead if no other client needs to see the writes.
> +
> +What union mounts are not
> +-------------------------
> +
...
> +
> +Terminology
> +===========
> +
...
> +VFS objects and union mounts
> +----------------------------
> +
...
> +
> +In union mounts, a file system can only be the topmost layer for one union
> +mount. A file system can be part of multiple union mounts if it is a read-only
> +layer. So dentries in the read-only layers can be part of multiple unions,
> +while a dentry in the read-write layer can only be part of one unin.
typo: union.
> +
> +union_dir structure
> +---------------------
> +
...
> +/*
> + * The union_stack structure. It is an array of struct paths of
> + * directories below the topmost directory in a unioned directory, The
directory.
> + * topmost dentry has a pointer to this structure. The topmost dentry
> + * can only be part of one union, so we can reference it from the
> + * dentry, but lower dentries can be part of multiple union stacks.
> + *
> + * The number of dirs actually allocated is kept in the superblock,
> + * s_union_count.
> + */
> +struct union_stack {
> + struct path u_dirs[0];
> +};
> +
> +This structure is flexible enough to support an arbitrary number of layers of
> +unioned file systems. Since there can be more than two layers, this section
> +will talk about mapping "upper" directories to "lower" directories, instead of
> +"topmost" directories to "bottom" directories.
> +
> +Traversing the union stack
> +--------------------------
> +
...
> +Permission checks
> +-----------------
> +
...
> +
> +inode_permission() calls sb_permission() and __inode_permission() on the same
> +path. We create path_permission() which calls sb_permission() on the parent
> +directory from the top layer, and __inode_permission() on the target on the
> +lower layer. This gets us the correct write permissions consdering that the
considering
> +file will be copied up.
> +
> +Locking strategy
> +================
> +
> +The current union mount locking strategy is based on the following
> +rules:
> +
> +* The lower layer file system is always read-only
> +* The topmost file system is always read-write
> + => A file system can never a topmost and lower layer at the same time
can never be topmost and a lower layer at the same time
> +
> +Additionally, the topmost layer may only be mounted exactly once. Don't think
> +of the topmost layer as a separate independent file system; when it is part of
> +a union mount, it is only a file system in conjunction with the read-only
> +bottom layer. The read-only bottom layer is an independent file system in and
> +of itself and can be mounted elsewhere, including as the bottom layer for
> +another union mount.
> +
> +Thus, we may define a stable locking order in terms of top layer and bottom
> +layer locks, since a top layer is never a bottom layer and a bottom layer is
> +never a top layer. Another simplifying assumption is that all directories in a
> +pathname exist on the top layer, as they are created step-by-step during
> +lookup. This prevents us from ever having to walk backwards up the path
> +creating directory entries, which can get complicated. By implication, parent
> +directories paths during any operation (rename(), unlink(),etc.) are from the
directory paths
> +top layer. Dentries for directories from the bottom layer are only ever seen
> +or used by the lookup code.
> +
> +The two major problems we avoid with the above rules are:
> +
> +Lock ordering: Imagine two union stacks with the same two file systems: A
> +mounted over B, and B mounted over A. Sometimes locks on objects in both A and
> +B will have to be held simultanously. What order should they be acquired in?
simultaneously.
> +Simply acquiring them from top to bottom will create a lock-ordering problem -
> +one thread acquires lock on object from A and then tries for a lock on object
> +from B, while another thread grabs the lock on object from B and then waits for
> +the lock on object from A. Some other lock ordering must be defined.
> +
> +Movement/change/disappearance of objects on multiple layers: A variety of nasty
> +corner cases arise when more than one layer is changing at the same time.
> +Changes in the directory topology and their effect on inheritance are of
> +special concern. Al Viro's canonical email on the subject:
> +
> +http://lkml.indiana.edu/hypermail/linux/kernel/0802.0/0839.html
> +
> +We don't try to solve any of these cases, just avoid them in the first place.
> +
> +Todo: Prevent top layer from being mounted more than once.
> +
...
> +Userland support
> +================
> +
> +The mount command must support the "-o union" mount option and pass the
> +corresponding MS_UNION flag to the kerel. A util-linux git tree with union
kernel.
> +mount support is here:
> +
> +git://git.kernel.org/pub/scm/utils/util-linux-ng/val/util-linux-ng.git
> +
> +File system utilities must support whiteouts and fallthrus. An e2fsprogs git
> +tree with union mount support is here:
> +
> +git://git.kernel.org/pub/scm/fs/ext2/val/e2fsprogs.git
> +
> +Currently, whiteout directory entries are not returned to userland. While the
> +directory type for whiteouts, DT_WHT, has been defined for many years, very
> +little userland code handles them. Userland will never see fallthru directory
> +entries.
...
> +Non-features
> +------------
> +
...
> +Read-only top layer: The readdir() strategy fundamentally requires the ability
> +to create persistent directory entries on the top layer file system (which may
> +be tmpfs). However, you can union two read-only file systems by union mounting
> +a third file system (such as tmpfs) over the two read-onlly file systems.
read-only
> +Numerous alternatives to this readdir() strategy (including in-kernel or
> +in-application caching) exist and are compatible with union mounts with its
> +writing-readdir() implementation disabled. Creating a readdir() cookie that is
> +stable across multiple readdir()s requires one of:
> +
> +- Write to stable storage (e.g., fallthru dentries)
> +- Non-evictable kernel memory cache (doesn't handle NFS server reboot)
> +- Per-application caching by glibc readdir()
> +
> +Often these features are supported by other unioning file systems or by other
> +versions of union mounts.
--
~Randy
On 2012-02-26, at 17:04, Ted Ts'o <[email protected]> wrote:
> On Tue, Feb 21, 2012 at 06:05:46PM +0000, David Howells wrote:
>> From: Valerie Aurora <[email protected]>
>>
>> Allow future code to use the guts of ext2_add_link().
>>
>> Original-author: Valerie Aurora <[email protected]>
>> Signed-off-by: David Howells <[email protected]>
>> Cc: Jan Kara <[email protected]>
>> Cc: [email protected]
>
> I'd suggest folding this in with the following patch (67/73). It's
> not clear from this patch why renaming ext2_add_link to
> ext2_add_entry() makes sense and then adding a new ext2_add_link()
> which calls ext_add_entry(). It doesn't seem to clarify much....
Also, why is this being done in ext2, when it should only be done in ext4?
Fedora is already using ext4 for ext2- and ext3-formatted filesystems, to allow us to finally deprecate and then delete both of those trees and their ongoing duplicate maintenance. Adding new features to ext2 doesn't help that goal at all.
Cheers, Andreas-
On Sun, Feb 26, 2012 at 08:30:34PM -0700, Andreas Dilger wrote:
> > I'd suggest folding this in with the following patch (67/73). It's
> > not clear from this patch why renaming ext2_add_link to
> > ext2_add_entry() makes sense and then adding a new ext2_add_link()
> > which calls ext_add_entry(). It doesn't seem to clarify much....
>
> Also, why is this being done in ext2, when it should only be done in ext4?
I believe Val used ext2 as a proof-of-concept, because the codebase
was stable (and Union Mounts has been in the oven a loooong time, so
that was probably a good choice). I agree that if union mounts is
finally going to make it upstream, this would be a good time to
support implemented for ext4, and to get the support into e2fsprogs.
BTW, one thing that I think would be a good thing to do while we're
making this change is to mask off the low 4 bits when looking at the
filetype field so eventually we can use the high 4 bits for some
future extension.
- Ted
On 2012-02-27, at 12:09 PM, Ted Ts'o wrote:
> On Sun, Feb 26, 2012 at 08:30:34PM -0700, Andreas Dilger wrote:
>>> I'd suggest folding this in with the following patch (67/73). It's
>>> not clear from this patch why renaming ext2_add_link to
>>> ext2_add_entry() makes sense and then adding a new ext2_add_link()
>>> which calls ext_add_entry(). It doesn't seem to clarify much....
>>
>> Also, why is this being done in ext2, when it should only be done in ext4?
>
> I believe Val used ext2 as a proof-of-concept, because the codebase
> was stable (and Union Mounts has been in the oven a loooong time, so
> that was probably a good choice). I agree that if union mounts is
> finally going to make it upstream, this would be a good time to
> support implemented for ext4, and to get the support into e2fsprogs.
>
> BTW, one thing that I think would be a good thing to do while we're
> making this change is to mask off the low 4 bits when looking at the
> filetype field so eventually we can use the high 4 bits for some
> future extension.
Umm, we already DO use the high 4 bits for a future extension in the
EXT4_FEATURE_INCOMPAT_DIRDATA feature. The bare minimum for this is
extracted from a larger patch that allows storing extra data in the
dirent. We use it to store a filesystem-wide 128-bit identifier into
the dirent, and it could also be used to store the high 32 bits of the
inode number in a compatible way.
I haven't pushed this upstream as I don't think anyone else is interested
in this yet, but masking off the file type is definitely simple and could
be accepted upstream.
Index: linux-stage/fs/ext4/ext4.h
===================================================================
--- linux-stage.orig/fs/ext4/ext4.h
+++ linux-stage/fs/ext4/ext4.h
@@ -1262,6 +1265,24 @@ struct ext4_dir_entry_2 {
#define EXT4_FT_SYMLINK 7
#define EXT4_FT_MAX 8
+#define EXT4_FT_MASK 0xf
+
+#if EXT4_FT_MAX > EXT4_FT_MASK
+#error "conflicting EXT4_FT_MAX and EXT4_FT_MASK"
+#endif
+
+/*
+ * d_type has 4 unused bits, so it can hold four types data. these different
+ * type of data (e.g. lustre file ID, high 32 bits of 64-bit inode number)
+ * can be stored, in flag order, after file-name in ext4 dirent.
+*/
+/*
+ * This flag is added to d_type if ext4 dirent has extra data after filename.
+ * This data length is variable and length is stored in first byte of data.
+ * Data starts after filename NUL byte. This is used by Lustre FS.
+ */
+#define EXT4_DIRENT_LUFID 0x10
/*
* EXT4_DIR_PAD defines the directory entries boundaries
Index: linux-stage/fs/ext4/dir.c
===================================================================
--- linux-stage.orig/fs/ext4/dir.c
+++ linux-stage/fs/ext4/dir.c
@@ -53,11 +53,14 @@ const struct file_operations ext4_dir_op
static unsigned char get_dtype(struct super_block *sb, int filetype)
{
+ int fl_index = filetype & EXT4_FT_MASK;
+
if (!EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_FILETYPE) ||
- (filetype >= EXT4_FT_MAX))
+ (fl_index >= EXT4_FT_MAX))
return DT_UNKNOWN;
- return (ext4_filetype_table[filetype]);
+ return ext4_filetype_table[fl_index]);
+
}
Cheers, Andreas
On Tue, Feb 21, 2012 at 05:58:25PM +0000, David Howells wrote:
> From: Valerie Aurora <[email protected]>
>
> Passing the CL_NO_SHARED flag to clone_mnt() causes the clone to fail
> if the source mnt is shared.
>
> Original-author: Valerie Aurora <[email protected]>
> Signed-off-by: David Howells <[email protected]>
> Cc: Ram Pai <[email protected]>
Reviewed-by: Ram Pai <[email protected]>
> ---
>
> fs/namespace.c | 3 +++
> fs/pnode.h | 1 +
> 2 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 35c3b80..f92f574 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -740,6 +740,9 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
> struct mount *mnt;
> int err;
>
> + if ((flag & CL_NO_SHARED) && IS_MNT_SHARED(old))
> + return ERR_PTR(-EINVAL);
> +
> mnt = alloc_vfsmnt(old->mnt_devname);
> if (!mnt)
> return ERR_PTR(-ENOMEM);
> diff --git a/fs/pnode.h b/fs/pnode.h
> index 65c6097..c7089dd 100644
> --- a/fs/pnode.h
> +++ b/fs/pnode.h
> @@ -22,6 +22,7 @@
> #define CL_COPY_ALL 0x04
> #define CL_MAKE_SHARED 0x08
> #define CL_PRIVATE 0x10
> +#define CL_NO_SHARED 0x20
>
> static inline void set_mnt_shared(struct mount *mnt)
> {
On Tue, Feb 21, 2012 at 05:58:32PM +0000, David Howells wrote:
> From: Valerie Aurora <[email protected]>
>
> Passing the CL_NO_SLAVE flag to clone_mnt() causes the clone
> to fail if the source mnt is a slave.
>
> Original-author: Valerie Aurora <[email protected]>
> Signed-off-by: David Howells <[email protected]>
> Cc: Ram Pai <[email protected]>
Reviewed-by: Ram Pai <[email protected]>
> ---
>
> fs/namespace.c | 3 +++
> fs/pnode.h | 1 +
> 2 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index f92f574..96f43f2 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -743,6 +743,9 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
> if ((flag & CL_NO_SHARED) && IS_MNT_SHARED(old))
> return ERR_PTR(-EINVAL);
>
> + if ((flag & CL_NO_SLAVE) && IS_MNT_SLAVE(old))
> + return ERR_PTR(-EINVAL);
> +
> mnt = alloc_vfsmnt(old->mnt_devname);
> if (!mnt)
> return ERR_PTR(-ENOMEM);
> diff --git a/fs/pnode.h b/fs/pnode.h
> index c7089dd..f7ae149 100644
> --- a/fs/pnode.h
> +++ b/fs/pnode.h
> @@ -23,6 +23,7 @@
> #define CL_MAKE_SHARED 0x08
> #define CL_PRIVATE 0x10
> #define CL_NO_SHARED 0x20
> +#define CL_NO_SLAVE 0x40
>
> static inline void set_mnt_shared(struct mount *mnt)
> {