Al and Linus,
Please consider overlayfs for inclusion into 3.10.
It's included in Ubuntu and openSUSE, used by OpenWrt and various other
projects. I regularly get emails asking when it will be included in mainline.
Git tree is here:
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git overlayfs.v16
Thanks,
Miklos
---
Andy Whitcroft (3):
overlayfs-add-statfs-support
ovl-switch-to-inode_permission
overlayfs-copy-up-i_uid-i_gid-from-the-underlying-inode
Erez Zadok (1):
overlayfs-implement-show_options
Miklos Szeredi (6):
vfs-add-i_op-dentry_open
vfs-export-do_splice_direct-to-modules
vfs-introduce-clone_private_mount
overlay filesystem
fs-limit-filesystem-stacking-depth
vfs-export-inode_permission-to-modules
Neil Brown (1):
overlay-overlay-filesystem-documentation
Robin Dong (2):
overlayfs-fix-possible-leak-in-ovl_new_inode
overlayfs-create-new-inode-in-ovl_link
---
Documentation/filesystems/Locking | 2 +
Documentation/filesystems/overlayfs.txt | 199 +++++++++
Documentation/filesystems/vfs.txt | 7 +
MAINTAINERS | 7 +
fs/Kconfig | 1 +
fs/Makefile | 1 +
fs/ecryptfs/main.c | 7 +
fs/internal.h | 5 -
fs/namei.c | 10 +-
fs/namespace.c | 18 +
fs/open.c | 23 +-
fs/overlayfs/Kconfig | 4 +
fs/overlayfs/Makefile | 7 +
fs/overlayfs/copy_up.c | 385 +++++++++++++++++
fs/overlayfs/dir.c | 604 +++++++++++++++++++++++++++
fs/overlayfs/inode.c | 372 +++++++++++++++++
fs/overlayfs/overlayfs.h | 70 ++++
fs/overlayfs/readdir.c | 566 +++++++++++++++++++++++++
fs/overlayfs/super.c | 685 +++++++++++++++++++++++++++++++
fs/splice.c | 1 +
include/linux/fs.h | 14 +
include/linux/mount.h | 3 +
22 files changed, 2981 insertions(+), 10 deletions(-)
create mode 100644 Documentation/filesystems/overlayfs.txt
create mode 100644 fs/overlayfs/Kconfig
create mode 100644 fs/overlayfs/Makefile
create mode 100644 fs/overlayfs/copy_up.c
create mode 100644 fs/overlayfs/dir.c
create mode 100644 fs/overlayfs/inode.c
create mode 100644 fs/overlayfs/overlayfs.h
create mode 100644 fs/overlayfs/readdir.c
create mode 100644 fs/overlayfs/super.c
From: Miklos Szeredi <[email protected]>
Overlayfs needs a private clone of the mount, so create a function for
this and export to modules.
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/namespace.c | 18 ++++++++++++++++++
include/linux/mount.h | 3 +++
2 files changed, 21 insertions(+)
diff --git a/fs/namespace.c b/fs/namespace.c
index 50ca17d..9791b4e 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1399,6 +1399,24 @@ void drop_collected_mounts(struct vfsmount *mnt)
release_mounts(&umount_list);
}
+struct vfsmount *clone_private_mount(struct path *path)
+{
+ struct mount *old_mnt = real_mount(path->mnt);
+ struct mount *new_mnt;
+
+ if (IS_MNT_UNBINDABLE(old_mnt))
+ return ERR_PTR(-EINVAL);
+
+ down_read(&namespace_sem);
+ new_mnt = clone_mnt(old_mnt, path->dentry, CL_PRIVATE);
+ up_read(&namespace_sem);
+ if (!new_mnt)
+ return ERR_PTR(-ENOMEM);
+
+ return &new_mnt->mnt;
+}
+EXPORT_SYMBOL_GPL(clone_private_mount);
+
int iterate_mounts(int (*f)(struct vfsmount *, void *), void *arg,
struct vfsmount *root)
{
diff --git a/include/linux/mount.h b/include/linux/mount.h
index d7029f4..344a262 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -66,6 +66,9 @@ extern void mnt_pin(struct vfsmount *mnt);
extern void mnt_unpin(struct vfsmount *mnt);
extern int __mnt_is_readonly(struct vfsmount *mnt);
+struct path;
+extern struct vfsmount *clone_private_mount(struct path *path);
+
struct file_system_type;
extern struct vfsmount *vfs_kern_mount(struct file_system_type *type,
int flags, const char *name,
--
1.7.10.4
From: Miklos Szeredi <[email protected]>
We need to be able to check inode permissions (but not filesystem implied
permissions) for stackable filesystems. Expose this interface for overlayfs.
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/internal.h | 5 -----
fs/namei.c | 1 +
include/linux/fs.h | 1 +
3 files changed, 2 insertions(+), 5 deletions(-)
diff --git a/fs/internal.h b/fs/internal.h
index 507141f..89481ac 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -42,11 +42,6 @@ static inline int __sync_blockdev(struct block_device *bdev, int wait)
extern void __init chrdev_init(void);
/*
- * namei.c
- */
-extern int __inode_permission(struct inode *, int);
-
-/*
* namespace.c
*/
extern int copy_mount_options(const void __user *, unsigned long *);
diff --git a/fs/namei.c b/fs/namei.c
index c1c0c15..e1cffbd 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -402,6 +402,7 @@ int __inode_permission(struct inode *inode, int mask)
return security_inode_permission(inode, mask);
}
+EXPORT_SYMBOL(__inode_permission);
/**
* sb_permission - Check superblock-level permissions
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e4e5fbd..3353de6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2224,6 +2224,7 @@ extern sector_t bmap(struct inode *, sector_t);
#endif
extern int notify_change(struct dentry *, struct iattr *);
extern int inode_permission(struct inode *, int);
+extern int __inode_permission(struct inode *, int);
extern int generic_permission(struct inode *, int);
static inline bool execute_ok(struct inode *inode)
--
1.7.10.4
From: Andy Whitcroft <[email protected]>
Add support for statfs to the overlayfs filesystem. As the upper layer
is the target of all write operations assume that the space in that
filesystem is the space in the overlayfs. There will be some inaccuracy as
overwriting a file will copy it up and consume space we were not expecting,
but it is better than nothing.
Use the upper layer dentry and mount from the overlayfs root inode,
passing the statfs call to that filesystem.
Signed-off-by: Andy Whitcroft <[email protected]>
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/overlayfs/super.c | 40 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 40 insertions(+)
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 02deecd..928b1b1 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -17,15 +17,19 @@
#include <linux/module.h>
#include <linux/cred.h>
#include <linux/sched.h>
+#include <linux/statfs.h>
#include "overlayfs.h"
MODULE_AUTHOR("Miklos Szeredi <[email protected]>");
MODULE_DESCRIPTION("Overlay filesystem");
MODULE_LICENSE("GPL");
+#define OVERLAYFS_SUPER_MAGIC 0x794c764f
+
struct ovl_fs {
struct vfsmount *upper_mnt;
struct vfsmount *lower_mnt;
+ long lower_namelen;
};
struct ovl_entry {
@@ -406,9 +410,36 @@ static int ovl_remount_fs(struct super_block *sb, int *flagsp, char *data)
return mnt_want_write(ufs->upper_mnt);
}
+/**
+ * ovl_statfs
+ * @sb: The overlayfs super block
+ * @buf: The struct kstatfs to fill in with stats
+ *
+ * Get the filesystem statistics. As writes always target the upper layer
+ * filesystem pass the statfs to the same filesystem.
+ */
+static int ovl_statfs(struct dentry *dentry, struct kstatfs *buf)
+{
+ struct ovl_fs *ofs = dentry->d_sb->s_fs_info;
+ struct dentry *root_dentry = dentry->d_sb->s_root;
+ struct path path;
+ int err;
+
+ ovl_path_upper(root_dentry, &path);
+
+ err = vfs_statfs(&path, buf);
+ if (!err) {
+ buf->f_namelen = max(buf->f_namelen, ofs->lower_namelen);
+ buf->f_type = OVERLAYFS_SUPER_MAGIC;
+ }
+
+ return err;
+}
+
static const struct super_operations ovl_super_operations = {
.put_super = ovl_put_super,
.remount_fs = ovl_remount_fs,
+ .statfs = ovl_statfs,
};
struct ovl_config {
@@ -474,6 +505,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
struct ovl_entry *oe;
struct ovl_fs *ufs;
struct ovl_config config;
+ struct kstatfs statfs;
int err;
err = ovl_parse_opt((char *) data, &config);
@@ -508,6 +540,13 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
!S_ISDIR(lowerpath.dentry->d_inode->i_mode))
goto out_put_lowerpath;
+ err = vfs_statfs(&lowerpath, &statfs);
+ if (err) {
+ printk(KERN_ERR "overlayfs: statfs failed on lowerpath\n");
+ goto out_put_lowerpath;
+ }
+ ufs->lower_namelen = statfs.f_namelen;
+
ufs->upper_mnt = clone_private_mount(&upperpath);
err = PTR_ERR(ufs->upper_mnt);
if (IS_ERR(ufs->upper_mnt)) {
@@ -556,6 +595,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
root_dentry->d_fsdata = oe;
root_dentry->d_op = &ovl_dentry_operations;
+ sb->s_magic = OVERLAYFS_SUPER_MAGIC;
sb->s_op = &ovl_super_operations;
sb->s_root = root_dentry;
sb->s_fs_info = ufs;
--
1.7.10.4
From: Robin Dong <[email protected]>
After allocating a new inode, if the mode of inode is incorrect, we should
release it by iput().
Signed-off-by: Robin Dong <[email protected]>
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/overlayfs/inode.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index 0dfa8a4..033de6f 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -371,6 +371,7 @@ struct inode *ovl_new_inode(struct super_block *sb, umode_t mode,
default:
WARN(1, "illegal file type: %i\n", mode);
+ iput(inode);
inode = NULL;
}
--
1.7.10.4
From: Andy Whitcroft <[email protected]>
YAMA et al rely on on i_uid/i_gid to be populated in order to perform
their checks. While these really cannot be guarenteed as the underlying
filesystem may not even have the concept, they are expected to be filled
when possible. To quote Al Viro:
"Ideally, yes, we'd want to have ->i_uid used only by fs-specific
code and helpers used by that fs (including those that are
implicit defaults). [...] In practice we have enough places
where uid/gid is used directly to make setting them practically
a requirement - places like /proc/<pid>/ can get away with
not doing that, but only because shitloads of syscalls are
not allowed on those anyway, permissions or no permissions.
In anything general-purpose you really need to set it."
Copy up the underlying filesystem information into the overlayfs inode
when we create it.
BugLink: http://bugs.launchpad.net/bugs/944386
Signed-off-by: Andy Whitcroft <[email protected]>
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/overlayfs/dir.c | 2 ++
fs/overlayfs/inode.c | 2 ++
fs/overlayfs/overlayfs.h | 6 ++++++
fs/overlayfs/super.c | 1 +
4 files changed, 11 insertions(+)
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index d8f6ee0..b530e56 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -304,6 +304,7 @@ static int ovl_create_object(struct dentry *dentry, int mode, dev_t rdev,
}
}
ovl_dentry_update(dentry, newdentry);
+ ovl_copyattr(newdentry->d_inode, inode);
d_instantiate(dentry, inode);
inode = NULL;
newdentry = NULL;
@@ -446,6 +447,7 @@ static int ovl_link(struct dentry *old, struct inode *newdir,
new->d_fsdata);
if (!newinode)
goto link_fail;
+ ovl_copyattr(upperdir->d_inode, newinode);
ovl_dentry_version_inc(new->d_parent);
ovl_dentry_update(new, newdentry);
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index 3218a38..ee37e92 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -31,6 +31,8 @@ int ovl_setattr(struct dentry *dentry, struct iattr *attr)
mutex_lock(&upperdentry->d_inode->i_mutex);
err = notify_change(upperdentry, attr);
+ if (!err)
+ ovl_copyattr(upperdentry->d_inode, dentry->d_inode);
mutex_unlock(&upperdentry->d_inode->i_mutex);
return err;
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index fe1241d..1cba38f 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -56,6 +56,12 @@ int ovl_removexattr(struct dentry *dentry, const char *name);
struct inode *ovl_new_inode(struct super_block *sb, umode_t mode,
struct ovl_entry *oe);
+static inline void ovl_copyattr(struct inode *from, struct inode *to)
+{
+ to->i_uid = from->i_uid;
+ to->i_gid = from->i_gid;
+}
+
/* dir.c */
extern const struct inode_operations ovl_dir_inode_operations;
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 8cfe8df..357d6e8 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -347,6 +347,7 @@ static int ovl_do_lookup(struct dentry *dentry)
oe);
if (!inode)
goto out_dput;
+ ovl_copyattr(realdentry->d_inode, inode);
}
if (upperdentry)
--
1.7.10.4
From: Andy Whitcroft <[email protected]>
When checking permissions on an overlayfs inode we do not take into
account either device cgroup restrictions nor security permissions.
This allows a user to mount an overlayfs layer over a restricted device
directory and by pass those permissions to open otherwise restricted
files.
Switch over to __inode_permissions.
Signed-off-by: Andy Whitcroft <[email protected]>
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/overlayfs/inode.c | 12 +-----------
1 file changed, 1 insertion(+), 11 deletions(-)
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index 033de6f..3218a38 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -100,19 +100,9 @@ int ovl_permission(struct inode *inode, int mask)
if (is_upper && !IS_RDONLY(inode) && IS_RDONLY(realinode) &&
(S_ISREG(mode) || S_ISDIR(mode) || S_ISLNK(mode)))
goto out_dput;
-
- /*
- * Nobody gets write access to an immutable file.
- */
- err = -EACCES;
- if (IS_IMMUTABLE(realinode))
- goto out_dput;
}
- if (realinode->i_op->permission)
- err = realinode->i_op->permission(realinode, mask);
- else
- err = generic_permission(realinode, mask);
+ err = __inode_permission(realinode, mask);
out_dput:
dput(alias);
return err;
--
1.7.10.4
From: Robin Dong <[email protected]>
Imaging using ext4 as upperdir which has a file "hello" and lowdir is
totally empty.
1. mount -t overlayfs overlayfs -o lowerdir=/lower,upperdir=/upper /overlay
2. cd /overlay
3. ln hello bye
then the overlayfs code will call vfs_link to create a real ext4
dentry for "bye" and create
a new overlayfs dentry point to overlayfs inode (which standed for
"hello"). That means:
two overlayfs dentries and only one overlayfs inode.
and then
4. umount /overlay
5. mount -t overlayfs overlayfs -o lowerdir=/lower,upperdir=/upper
/overlay (again)
6. cd /overlay
7. ls hello bye
the overlayfs will create two inodes(one for the "hello", another
for the "bye") and two dentries (each point a inode).That means:
two dentries and two inodes.
As above, with different order of "create link" and "mount", the
result is not the same.
In order to make the behavior coherent, we need to create inode in ovl_link.
Signed-off-by: Robin Dong <[email protected]>
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/overlayfs/dir.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index 6fc5aa5..d8f6ee0 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -417,6 +417,7 @@ static int ovl_link(struct dentry *old, struct inode *newdir,
struct dentry *olddentry;
struct dentry *newdentry;
struct dentry *upperdir;
+ struct inode *newinode;
err = ovl_copy_up(old);
if (err)
@@ -441,13 +442,17 @@ static int ovl_link(struct dentry *old, struct inode *newdir,
err = -ENOENT;
goto out_unlock;
}
+ newinode = ovl_new_inode(old->d_sb, newdentry->d_inode->i_mode,
+ new->d_fsdata);
+ if (!newinode)
+ goto link_fail;
ovl_dentry_version_inc(new->d_parent);
ovl_dentry_update(new, newdentry);
- ihold(old->d_inode);
- d_instantiate(new, old->d_inode);
+ d_instantiate(new, newinode);
} else {
+link_fail:
if (ovl_dentry_is_opaque(new))
ovl_whiteout(upperdir, new);
dput(newdentry);
--
1.7.10.4
From: Miklos Szeredi <[email protected]>
Add a simple read-only counter to super_block that indicates deep this
is in the stack of filesystems. Previously ecryptfs was the only
stackable filesystem and it explicitly disallowed multiple layers of
itself.
Overlayfs, however, can be stacked recursively and also may be stacked
on top of ecryptfs or vice versa.
To limit the kernel stack usage we must limit the depth of the
filesystem stack. Initially the limit is set to 2.
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/ecryptfs/main.c | 7 +++++++
fs/overlayfs/super.c | 10 ++++++++++
include/linux/fs.h | 11 +++++++++++
3 files changed, 28 insertions(+)
diff --git a/fs/ecryptfs/main.c b/fs/ecryptfs/main.c
index e924cf4..8f7551e 100644
--- a/fs/ecryptfs/main.c
+++ b/fs/ecryptfs/main.c
@@ -567,6 +567,13 @@ static struct dentry *ecryptfs_mount(struct file_system_type *fs_type, int flags
s->s_maxbytes = path.dentry->d_sb->s_maxbytes;
s->s_blocksize = path.dentry->d_sb->s_blocksize;
s->s_magic = ECRYPTFS_SUPER_MAGIC;
+ s->s_stack_depth = path.dentry->d_sb->s_stack_depth + 1;
+
+ rc = -EINVAL;
+ if (s->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
+ printk(KERN_ERR "eCryptfs: maximum fs stacking depth exceeded\n");
+ goto out_free;
+ }
inode = ecryptfs_get_inode(path.dentry->d_inode, s);
rc = PTR_ERR(inode);
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 20e5ecf..8cfe8df 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -570,6 +570,16 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
}
ufs->lower_namelen = statfs.f_namelen;
+ sb->s_stack_depth = max(upperpath.mnt->mnt_sb->s_stack_depth,
+ lowerpath.mnt->mnt_sb->s_stack_depth) + 1;
+
+ err = -EINVAL;
+ if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
+ printk(KERN_ERR "overlayfs: maximum fs stacking depth exceeded\n");
+ goto out_put_lowerpath;
+ }
+
+
ufs->upper_mnt = clone_private_mount(&upperpath);
err = PTR_ERR(ufs->upper_mnt);
if (IS_ERR(ufs->upper_mnt)) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 891c93d..e4e5fbd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -244,6 +244,12 @@ struct iattr {
*/
#include <linux/quota.h>
+/*
+ * Maximum number of layers of fs stack. Needs to be limited to
+ * prevent kernel stack overflow
+ */
+#define FILESYSTEM_MAX_STACK_DEPTH 2
+
/**
* enum positive_aop_returns - aop return codes with specific semantics
*
@@ -1320,6 +1326,11 @@ struct super_block {
/* Being remounted read-only */
int s_readonly_remount;
+
+ /*
+ * Indicates how deep in a filesystem stack this SB is
+ */
+ int s_stack_depth;
};
/* superblock cache pruning functions */
--
1.7.10.4
From: Erez Zadok <[email protected]>
This is useful because of the stacking nature of overlayfs. Users like to
find out (via /proc/mounts) which lower/upper directory were used at mount
time.
Signed-off-by: Erez Zadok <[email protected]>
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/overlayfs/super.c | 63 ++++++++++++++++++++++++++++++++++----------------
1 file changed, 43 insertions(+), 20 deletions(-)
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 928b1b1..20e5ecf 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -18,6 +18,7 @@
#include <linux/cred.h>
#include <linux/sched.h>
#include <linux/statfs.h>
+#include <linux/seq_file.h>
#include "overlayfs.h"
MODULE_AUTHOR("Miklos Szeredi <[email protected]>");
@@ -26,12 +27,21 @@ MODULE_LICENSE("GPL");
#define OVERLAYFS_SUPER_MAGIC 0x794c764f
+struct ovl_config {
+ char *lowerdir;
+ char *upperdir;
+};
+
+/* private information held for overlayfs's superblock */
struct ovl_fs {
struct vfsmount *upper_mnt;
struct vfsmount *lower_mnt;
long lower_namelen;
+ /* pathnames of lower and upper dirs, for show_options */
+ struct ovl_config config;
};
+/* private information held for every overlayfs dentry */
struct ovl_entry {
/*
* Keep "double reference" on upper dentries, so that
@@ -388,6 +398,8 @@ static void ovl_put_super(struct super_block *sb)
mntput(ufs->upper_mnt);
mntput(ufs->lower_mnt);
+ kfree(ufs->config.lowerdir);
+ kfree(ufs->config.upperdir);
kfree(ufs);
}
@@ -436,15 +448,27 @@ static int ovl_statfs(struct dentry *dentry, struct kstatfs *buf)
return err;
}
+/**
+ * ovl_show_options
+ *
+ * Prints the mount options for a given superblock.
+ * Returns zero; does not fail.
+ */
+static int ovl_show_options(struct seq_file *m, struct dentry *dentry)
+{
+ struct super_block *sb = dentry->d_sb;
+ struct ovl_fs *ufs = sb->s_fs_info;
+
+ seq_printf(m, ",lowerdir=%s", ufs->config.lowerdir);
+ seq_printf(m, ",upperdir=%s", ufs->config.upperdir);
+ return 0;
+}
+
static const struct super_operations ovl_super_operations = {
.put_super = ovl_put_super,
.remount_fs = ovl_remount_fs,
.statfs = ovl_statfs,
-};
-
-struct ovl_config {
- char *lowerdir;
- char *upperdir;
+ .show_options = ovl_show_options,
};
enum {
@@ -504,34 +528,33 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
struct dentry *root_dentry;
struct ovl_entry *oe;
struct ovl_fs *ufs;
- struct ovl_config config;
struct kstatfs statfs;
int err;
- err = ovl_parse_opt((char *) data, &config);
- if (err)
+ err = -ENOMEM;
+ ufs = kmalloc(sizeof(struct ovl_fs), GFP_KERNEL);
+ if (!ufs)
goto out;
+ err = ovl_parse_opt((char *) data, &ufs->config);
+ if (err)
+ goto out_free_ufs;
+
err = -EINVAL;
- if (!config.upperdir || !config.lowerdir) {
+ if (!ufs->config.upperdir || !ufs->config.lowerdir) {
printk(KERN_ERR "overlayfs: missing upperdir or lowerdir\n");
goto out_free_config;
}
- err = -ENOMEM;
- ufs = kmalloc(sizeof(struct ovl_fs), GFP_KERNEL);
- if (!ufs)
- goto out_free_config;
-
oe = ovl_alloc_entry();
if (oe == NULL)
- goto out_free_ufs;
+ goto out_free_config;
- err = kern_path(config.upperdir, LOOKUP_FOLLOW, &upperpath);
+ err = kern_path(ufs->config.upperdir, LOOKUP_FOLLOW, &upperpath);
if (err)
goto out_free_oe;
- err = kern_path(config.lowerdir, LOOKUP_FOLLOW, &lowerpath);
+ err = kern_path(ufs->config.lowerdir, LOOKUP_FOLLOW, &lowerpath);
if (err)
goto out_put_upperpath;
@@ -615,11 +638,11 @@ out_put_upperpath:
path_put(&upperpath);
out_free_oe:
kfree(oe);
+out_free_config:
+ kfree(ufs->config.lowerdir);
+ kfree(ufs->config.upperdir);
out_free_ufs:
kfree(ufs);
-out_free_config:
- kfree(config.lowerdir);
- kfree(config.upperdir);
out:
return err;
}
--
1.7.10.4
From: Neil Brown <[email protected]>
Document the overlay filesystem.
Signed-off-by: Miklos Szeredi <[email protected]>
---
Documentation/filesystems/overlayfs.txt | 199 +++++++++++++++++++++++++++++++
MAINTAINERS | 7 ++
2 files changed, 206 insertions(+)
create mode 100644 Documentation/filesystems/overlayfs.txt
diff --git a/Documentation/filesystems/overlayfs.txt b/Documentation/filesystems/overlayfs.txt
new file mode 100644
index 0000000..00dbab0
--- /dev/null
+++ b/Documentation/filesystems/overlayfs.txt
@@ -0,0 +1,199 @@
+Written by: Neil Brown <[email protected]>
+
+Overlay Filesystem
+==================
+
+This document describes a prototype for a new approach to providing
+overlay-filesystem functionality in Linux (sometimes referred to as
+union-filesystems). An overlay-filesystem tries to present a
+filesystem which is the result over overlaying one filesystem on top
+of the other.
+
+The result will inevitably fail to look exactly like a normal
+filesystem for various technical reasons. The expectation is that
+many use cases will be able to ignore these differences.
+
+This approach is 'hybrid' because the objects that appear in the
+filesystem do not all appear to belong to that filesystem. In many
+cases an object accessed in the union will be indistinguishable
+from accessing the corresponding object from the original filesystem.
+This is most obvious from the 'st_dev' field returned by stat(2).
+
+While directories will report an st_dev from the overlay-filesystem,
+all non-directory objects will report an st_dev from the lower or
+upper filesystem that is providing the object. Similarly st_ino will
+only be unique when combined with st_dev, and both of these can change
+over the lifetime of a non-directory object. Many applications and
+tools ignore these values and will not be affected.
+
+Upper and Lower
+---------------
+
+An overlay filesystem combines two filesystems - an 'upper' filesystem
+and a 'lower' filesystem. When a name exists in both filesystems, the
+object in the 'upper' filesystem is visible while the object in the
+'lower' filesystem is either hidden or, in the case of directories,
+merged with the 'upper' object.
+
+It would be more correct to refer to an upper and lower 'directory
+tree' rather than 'filesystem' as it is quite possible for both
+directory trees to be in the same filesystem and there is no
+requirement that the root of a filesystem be given for either upper or
+lower.
+
+The lower filesystem can be any filesystem supported by Linux and does
+not need to be writable. The lower filesystem can even be another
+overlayfs. The upper filesystem will normally be writable and if it
+is it must support the creation of trusted.* extended attributes, and
+must provide valid d_type in readdir responses, at least for symbolic
+links - so NFS is not suitable.
+
+A read-only overlay of two read-only filesystems may use any
+filesystem type.
+
+Directories
+-----------
+
+Overlaying mainly involves directories. If a given name appears in both
+upper and lower filesystems and refers to a non-directory in either,
+then the lower object is hidden - the name refers only to the upper
+object.
+
+Where both upper and lower objects are directories, a merged directory
+is formed.
+
+At mount time, the two directories given as mount options are combined
+into a merged directory:
+
+ mount -t overlayfs overlayfs -olowerdir=/lower,upperdir=/upper /overlay
+
+Then whenever a lookup is requested in such a merged directory, the
+lookup is performed in each actual directory and the combined result
+is cached in the dentry belonging to the overlay filesystem. If both
+actual lookups find directories, both are stored and a merged
+directory is created, otherwise only one is stored: the upper if it
+exists, else the lower.
+
+Only the lists of names from directories are merged. Other content
+such as metadata and extended attributes are reported for the upper
+directory only. These attributes of the lower directory are hidden.
+
+whiteouts and opaque directories
+--------------------------------
+
+In order to support rm and rmdir without changing the lower
+filesystem, an overlay filesystem needs to record in the upper filesystem
+that files have been removed. This is done using whiteouts and opaque
+directories (non-directories are always opaque).
+
+The overlay filesystem uses extended attributes with a
+"trusted.overlay." prefix to record these details.
+
+A whiteout is created as a symbolic link with target
+"(overlay-whiteout)" and with xattr "trusted.overlay.whiteout" set to "y".
+When a whiteout is found in the upper level of a merged directory, any
+matching name in the lower level is ignored, and the whiteout itself
+is also hidden.
+
+A directory is made opaque by setting the xattr "trusted.overlay.opaque"
+to "y". Where the upper filesystem contains an opaque directory, any
+directory in the lower filesystem with the same name is ignored.
+
+readdir
+-------
+
+When a 'readdir' request is made on a merged directory, the upper and
+lower directories are each read and the name lists merged in the
+obvious way (upper is read first, then lower - entries that already
+exist are not re-added). This merged name list is cached in the
+'struct file' and so remains as long as the file is kept open. If the
+directory is opened and read by two processes at the same time, they
+will each have separate caches. A seekdir to the start of the
+directory (offset 0) followed by a readdir will cause the cache to be
+discarded and rebuilt.
+
+This means that changes to the merged directory do not appear while a
+directory is being read. This is unlikely to be noticed by many
+programs.
+
+seek offsets are assigned sequentially when the directories are read.
+Thus if
+ - read part of a directory
+ - remember an offset, and close the directory
+ - re-open the directory some time later
+ - seek to the remembered offset
+
+there may be little correlation between the old and new locations in
+the list of filenames, particularly if anything has changed in the
+directory.
+
+Readdir on directories that are not merged is simply handled by the
+underlying directory (upper or lower).
+
+
+Non-directories
+---------------
+
+Objects that are not directories (files, symlinks, device-special
+files etc.) are presented either from the upper or lower filesystem as
+appropriate. When a file in the lower filesystem is accessed in a way
+the requires write-access, such as opening for write access, changing
+some metadata etc., the file is first copied from the lower filesystem
+to the upper filesystem (copy_up). Note that creating a hard-link
+also requires copy_up, though of course creation of a symlink does
+not.
+
+The copy_up may turn out to be unnecessary, for example if the file is
+opened for read-write but the data is not modified.
+
+The copy_up process first makes sure that the containing directory
+exists in the upper filesystem - creating it and any parents as
+necessary. It then creates the object with the same metadata (owner,
+mode, mtime, symlink-target etc.) and then if the object is a file, the
+data is copied from the lower to the upper filesystem. Finally any
+extended attributes are copied up.
+
+Once the copy_up is complete, the overlay filesystem simply
+provides direct access to the newly created file in the upper
+filesystem - future operations on the file are barely noticed by the
+overlay filesystem (though an operation on the name of the file such as
+rename or unlink will of course be noticed and handled).
+
+
+Non-standard behavior
+---------------------
+
+The copy_up operation essentially creates a new, identical file and
+moves it over to the old name. The new file may be on a different
+filesystem, so both st_dev and st_ino of the file may change.
+
+Any open files referring to this inode will access the old data and
+metadata. Similarly any file locks obtained before copy_up will not
+apply to the copied up file.
+
+On a file opened with O_RDONLY fchmod(2), fchown(2), futimesat(2) and
+fsetxattr(2) will fail with EROFS.
+
+If a file with multiple hard links is copied up, then this will
+"break" the link. Changes will not be propagated to other names
+referring to the same inode.
+
+Symlinks in /proc/PID/ and /proc/PID/fd which point to a non-directory
+object in overlayfs will not contain valid absolute paths, only
+relative paths leading up to the filesystem's root. This will be
+fixed in the future.
+
+Some operations are not atomic, for example a crash during copy_up or
+rename will leave the filesystem in an inconsistent state. This will
+be addressed in the future.
+
+Changes to underlying filesystems
+---------------------------------
+
+Offline changes, when the overlay is not mounted, are allowed to either
+the upper or the lower trees.
+
+Changes to the underlying filesystems while part of a mounted overlay
+filesystem are not allowed. If the underlying filesystem is changed,
+the behavior of the overlay is undefined, though it will not result in
+a crash or deadlock.
diff --git a/MAINTAINERS b/MAINTAINERS
index 9561658..9ea89b8 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5872,6 +5872,13 @@ F: drivers/scsi/osd/
F: include/scsi/osd_*
F: fs/exofs/
+OVERLAYFS FILESYSTEM
+M: Miklos Szeredi <[email protected]>
+L: [email protected]
+S: Supported
+F: fs/overlayfs/*
+F: Documentation/filesystems/overlayfs.txt
+
P54 WIRELESS DRIVER
M: Christian Lamparter <[email protected]>
L: [email protected]
--
1.7.10.4
From: Miklos Szeredi <[email protected]>
Overlayfs allows one, usually read-write, directory tree to be
overlaid onto another, read-only directory tree. All modifications
go to the upper, writable layer.
This type of mechanism is most often used for live CDs but there's a
wide variety of other uses.
The implementation differs from other "union filesystem"
implementations in that after a file is opened all operations go
directly to the underlying, lower or upper, filesystems. This
simplifies the implementation and allows native performance in these
cases.
The dentry tree is duplicated from the underlying filesystems, this
enables fast cached lookups without adding special support into the
VFS. This uses slightly more memory than union mounts, but dentries
are relatively small.
Currently inodes are duplicated as well, but it is a possible
optimization to share inodes for non-directories.
Opening non directories results in the open forwarded to the
underlying filesystem. This makes the behavior very similar to union
mounts (with the same limitations vs. fchmod/fchown on O_RDONLY file
descriptors).
Usage:
mount -t overlay -olowerdir=/lower,upperdir=/upper overlay /mnt
Supported:
- all operations
Missing:
- Currently a crash in the middle of copy-up, rename, unlink, rmdir or create
over a whiteout may result in filesystem corruption on the overlay level.
IOW these operations need to become atomic or at least the corruption needs
to be detected.
The following cotributions have been folded into this patch:
Neil Brown <[email protected]>:
- minimal remount support
- use correct seek function for directories
- initialise is_real before use
- rename ovl_fill_cache to ovl_dir_read
Felix Fietkau <[email protected]>:
- fix a deadlock in ovl_dir_read_merged
- fix a deadlock in ovl_remove_whiteouts
Erez Zadok <[email protected]>
- fix cleanup after WARN_ON
Sedat Dilek <[email protected]>
- fix up permission to confirm to new API
Also thanks to the following people for testing and reporting bugs:
Jordi Pujol <[email protected]>
Andy Whitcroft <[email protected]>
Michal Suchanek <[email protected]>
Felix Fietkau <[email protected]>
Erez Zadok <[email protected]>
Randy Dunlap <[email protected]>
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/Kconfig | 1 +
fs/Makefile | 1 +
fs/overlayfs/Kconfig | 4 +
fs/overlayfs/Makefile | 7 +
fs/overlayfs/copy_up.c | 385 +++++++++++++++++++++++++++++
fs/overlayfs/dir.c | 597 ++++++++++++++++++++++++++++++++++++++++++++
fs/overlayfs/inode.c | 379 ++++++++++++++++++++++++++++
fs/overlayfs/overlayfs.h | 64 +++++
fs/overlayfs/readdir.c | 566 ++++++++++++++++++++++++++++++++++++++++++
fs/overlayfs/super.c | 611 ++++++++++++++++++++++++++++++++++++++++++++++
10 files changed, 2615 insertions(+)
create mode 100644 fs/overlayfs/Kconfig
create mode 100644 fs/overlayfs/Makefile
create mode 100644 fs/overlayfs/copy_up.c
create mode 100644 fs/overlayfs/dir.c
create mode 100644 fs/overlayfs/inode.c
create mode 100644 fs/overlayfs/overlayfs.h
create mode 100644 fs/overlayfs/readdir.c
create mode 100644 fs/overlayfs/super.c
diff --git a/fs/Kconfig b/fs/Kconfig
index 780725a..9e2ccd5 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -67,6 +67,7 @@ source "fs/quota/Kconfig"
source "fs/autofs4/Kconfig"
source "fs/fuse/Kconfig"
+source "fs/overlayfs/Kconfig"
config GENERIC_ACL
bool
diff --git a/fs/Makefile b/fs/Makefile
index 9d53192..479a720 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -107,6 +107,7 @@ obj-$(CONFIG_QNX6FS_FS) += qnx6/
obj-$(CONFIG_AUTOFS4_FS) += autofs4/
obj-$(CONFIG_ADFS_FS) += adfs/
obj-$(CONFIG_FUSE_FS) += fuse/
+obj-$(CONFIG_OVERLAYFS_FS) += overlayfs/
obj-$(CONFIG_UDF_FS) += udf/
obj-$(CONFIG_SUN_OPENPROMFS) += openpromfs/
obj-$(CONFIG_OMFS_FS) += omfs/
diff --git a/fs/overlayfs/Kconfig b/fs/overlayfs/Kconfig
new file mode 100644
index 0000000..c4517da
--- /dev/null
+++ b/fs/overlayfs/Kconfig
@@ -0,0 +1,4 @@
+config OVERLAYFS_FS
+ tristate "Overlay filesystem support"
+ help
+ Add support for overlay filesystem.
diff --git a/fs/overlayfs/Makefile b/fs/overlayfs/Makefile
new file mode 100644
index 0000000..8f91889
--- /dev/null
+++ b/fs/overlayfs/Makefile
@@ -0,0 +1,7 @@
+#
+# Makefile for the overlay filesystem.
+#
+
+obj-$(CONFIG_OVERLAYFS_FS) += overlayfs.o
+
+overlayfs-objs := super.o inode.o dir.o readdir.o copy_up.o
diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
new file mode 100644
index 0000000..eef85e0
--- /dev/null
+++ b/fs/overlayfs/copy_up.c
@@ -0,0 +1,385 @@
+/*
+ *
+ * Copyright (C) 2011 Novell Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/splice.h>
+#include <linux/xattr.h>
+#include <linux/security.h>
+#include <linux/uaccess.h>
+#include <linux/sched.h>
+#include "overlayfs.h"
+
+#define OVL_COPY_UP_CHUNK_SIZE (1 << 20)
+
+static int ovl_copy_up_xattr(struct dentry *old, struct dentry *new)
+{
+ ssize_t list_size, size;
+ char *buf, *name, *value;
+ int error;
+
+ if (!old->d_inode->i_op->getxattr ||
+ !new->d_inode->i_op->getxattr)
+ return 0;
+
+ list_size = vfs_listxattr(old, NULL, 0);
+ if (list_size <= 0) {
+ if (list_size == -EOPNOTSUPP)
+ return 0;
+ return list_size;
+ }
+
+ buf = kzalloc(list_size, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ error = -ENOMEM;
+ value = kmalloc(XATTR_SIZE_MAX, GFP_KERNEL);
+ if (!value)
+ goto out;
+
+ list_size = vfs_listxattr(old, buf, list_size);
+ if (list_size <= 0) {
+ error = list_size;
+ goto out_free_value;
+ }
+
+ for (name = buf; name < (buf + list_size); name += strlen(name) + 1) {
+ size = vfs_getxattr(old, name, value, XATTR_SIZE_MAX);
+ if (size <= 0) {
+ error = size;
+ goto out_free_value;
+ }
+ error = vfs_setxattr(new, name, value, size, 0);
+ if (error)
+ goto out_free_value;
+ }
+
+out_free_value:
+ kfree(value);
+out:
+ kfree(buf);
+ return error;
+}
+
+static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
+{
+ struct file *old_file;
+ struct file *new_file;
+ int error = 0;
+
+ if (len == 0)
+ return 0;
+
+ old_file = ovl_path_open(old, O_RDONLY);
+ if (IS_ERR(old_file))
+ return PTR_ERR(old_file);
+
+ new_file = ovl_path_open(new, O_WRONLY);
+ if (IS_ERR(new_file)) {
+ error = PTR_ERR(new_file);
+ goto out_fput;
+ }
+
+ /* FIXME: copy up sparse files efficiently */
+ while (len) {
+ loff_t offset = new_file->f_pos;
+ size_t this_len = OVL_COPY_UP_CHUNK_SIZE;
+ long bytes;
+
+ if (len < this_len)
+ this_len = len;
+
+ if (signal_pending_state(TASK_KILLABLE, current)) {
+ error = -EINTR;
+ break;
+ }
+
+ bytes = do_splice_direct(old_file, &offset, new_file, this_len,
+ SPLICE_F_MOVE);
+ if (bytes <= 0) {
+ error = bytes;
+ break;
+ }
+
+ len -= bytes;
+ }
+
+ fput(new_file);
+out_fput:
+ fput(old_file);
+ return error;
+}
+
+static char *ovl_read_symlink(struct dentry *realdentry)
+{
+ int res;
+ char *buf;
+ struct inode *inode = realdentry->d_inode;
+ mm_segment_t old_fs;
+
+ res = -EINVAL;
+ if (!inode->i_op->readlink)
+ goto err;
+
+ res = -ENOMEM;
+ buf = (char *) __get_free_page(GFP_KERNEL);
+ if (!buf)
+ goto err;
+
+ old_fs = get_fs();
+ set_fs(get_ds());
+ /* The cast to a user pointer is valid due to the set_fs() */
+ res = inode->i_op->readlink(realdentry,
+ (char __user *)buf, PAGE_SIZE - 1);
+ set_fs(old_fs);
+ if (res < 0) {
+ free_page((unsigned long) buf);
+ goto err;
+ }
+ buf[res] = '\0';
+
+ return buf;
+
+err:
+ return ERR_PTR(res);
+}
+
+static int ovl_set_timestamps(struct dentry *upperdentry, struct kstat *stat)
+{
+ struct iattr attr = {
+ .ia_valid =
+ ATTR_ATIME | ATTR_MTIME | ATTR_ATIME_SET | ATTR_MTIME_SET,
+ .ia_atime = stat->atime,
+ .ia_mtime = stat->mtime,
+ };
+
+ return notify_change(upperdentry, &attr);
+}
+
+static int ovl_set_mode(struct dentry *upperdentry, umode_t mode)
+{
+ struct iattr attr = {
+ .ia_valid = ATTR_MODE,
+ .ia_mode = mode,
+ };
+
+ return notify_change(upperdentry, &attr);
+}
+
+static int ovl_copy_up_locked(struct dentry *upperdir, struct dentry *dentry,
+ struct path *lowerpath, struct kstat *stat,
+ const char *link)
+{
+ int err;
+ struct path newpath;
+ umode_t mode = stat->mode;
+
+ /* Can't properly set mode on creation because of the umask */
+ stat->mode &= S_IFMT;
+
+ ovl_path_upper(dentry, &newpath);
+ WARN_ON(newpath.dentry);
+ newpath.dentry = ovl_upper_create(upperdir, dentry, stat, link);
+ if (IS_ERR(newpath.dentry))
+ return PTR_ERR(newpath.dentry);
+
+ if (S_ISREG(stat->mode)) {
+ err = ovl_copy_up_data(lowerpath, &newpath, stat->size);
+ if (err)
+ goto err_remove;
+ }
+
+ err = ovl_copy_up_xattr(lowerpath->dentry, newpath.dentry);
+ if (err)
+ goto err_remove;
+
+ mutex_lock(&newpath.dentry->d_inode->i_mutex);
+ if (!S_ISLNK(stat->mode))
+ err = ovl_set_mode(newpath.dentry, mode);
+ if (!err)
+ err = ovl_set_timestamps(newpath.dentry, stat);
+ mutex_unlock(&newpath.dentry->d_inode->i_mutex);
+ if (err)
+ goto err_remove;
+
+ ovl_dentry_update(dentry, newpath.dentry);
+
+ /*
+ * Easiest way to get rid of the lower dentry reference is to
+ * drop this dentry. This is neither needed nor possible for
+ * directories.
+ */
+ if (!S_ISDIR(stat->mode))
+ d_drop(dentry);
+
+ return 0;
+
+err_remove:
+ if (S_ISDIR(stat->mode))
+ vfs_rmdir(upperdir->d_inode, newpath.dentry);
+ else
+ vfs_unlink(upperdir->d_inode, newpath.dentry);
+
+ dput(newpath.dentry);
+
+ return err;
+}
+
+/*
+ * Copy up a single dentry
+ *
+ * Directory renames only allowed on "pure upper" (already created on
+ * upper filesystem, never copied up). Directories which are on lower or
+ * are merged may not be renamed. For these -EXDEV is returned and
+ * userspace has to deal with it. This means, when copying up a
+ * directory we can rely on it and ancestors being stable.
+ *
+ * Non-directory renames start with copy up of source if necessary. The
+ * actual rename will only proceed once the copy up was successful. Copy
+ * up uses upper parent i_mutex for exclusion. Since rename can change
+ * d_parent it is possible that the copy up will lock the old parent. At
+ * that point the file will have already been copied up anyway.
+ */
+static int ovl_copy_up_one(struct dentry *parent, struct dentry *dentry,
+ struct path *lowerpath, struct kstat *stat)
+{
+ int err;
+ struct kstat pstat;
+ struct path parentpath;
+ struct dentry *upperdir;
+ const struct cred *old_cred;
+ struct cred *override_cred;
+ char *link = NULL;
+
+ ovl_path_upper(parent, &parentpath);
+ upperdir = parentpath.dentry;
+
+ err = vfs_getattr(&parentpath, &pstat);
+ if (err)
+ return err;
+
+ if (S_ISLNK(stat->mode)) {
+ link = ovl_read_symlink(lowerpath->dentry);
+ if (IS_ERR(link))
+ return PTR_ERR(link);
+ }
+
+ err = -ENOMEM;
+ override_cred = prepare_creds();
+ if (!override_cred)
+ goto out_free_link;
+
+ override_cred->fsuid = stat->uid;
+ override_cred->fsgid = stat->gid;
+ /*
+ * CAP_SYS_ADMIN for copying up extended attributes
+ * CAP_DAC_OVERRIDE for create
+ * CAP_FOWNER for chmod, timestamp update
+ * CAP_FSETID for chmod
+ * CAP_MKNOD for mknod
+ */
+ cap_raise(override_cred->cap_effective, CAP_SYS_ADMIN);
+ cap_raise(override_cred->cap_effective, CAP_DAC_OVERRIDE);
+ cap_raise(override_cred->cap_effective, CAP_FOWNER);
+ cap_raise(override_cred->cap_effective, CAP_FSETID);
+ cap_raise(override_cred->cap_effective, CAP_MKNOD);
+ old_cred = override_creds(override_cred);
+
+ mutex_lock_nested(&upperdir->d_inode->i_mutex, I_MUTEX_PARENT);
+ if (ovl_path_type(dentry) != OVL_PATH_LOWER) {
+ err = 0;
+ } else {
+ err = ovl_copy_up_locked(upperdir, dentry, lowerpath,
+ stat, link);
+ if (!err) {
+ /* Restore timestamps on parent (best effort) */
+ ovl_set_timestamps(upperdir, &pstat);
+ }
+ }
+
+ mutex_unlock(&upperdir->d_inode->i_mutex);
+
+ revert_creds(old_cred);
+ put_cred(override_cred);
+
+out_free_link:
+ if (link)
+ free_page((unsigned long) link);
+
+ return err;
+}
+
+int ovl_copy_up(struct dentry *dentry)
+{
+ int err;
+
+ err = 0;
+ while (!err) {
+ struct dentry *next;
+ struct dentry *parent;
+ struct path lowerpath;
+ struct kstat stat;
+ enum ovl_path_type type = ovl_path_type(dentry);
+
+ if (type != OVL_PATH_LOWER)
+ break;
+
+ next = dget(dentry);
+ /* find the topmost dentry not yet copied up */
+ for (;;) {
+ parent = dget_parent(next);
+
+ type = ovl_path_type(parent);
+ if (type != OVL_PATH_LOWER)
+ break;
+
+ dput(next);
+ next = parent;
+ }
+
+ ovl_path_lower(next, &lowerpath);
+ err = vfs_getattr(&lowerpath, &stat);
+ if (!err)
+ err = ovl_copy_up_one(parent, next, &lowerpath, &stat);
+
+ dput(parent);
+ dput(next);
+ }
+
+ return err;
+}
+
+/* Optimize by not copying up the file first and truncating later */
+int ovl_copy_up_truncate(struct dentry *dentry, loff_t size)
+{
+ int err;
+ struct kstat stat;
+ struct path lowerpath;
+ struct dentry *parent = dget_parent(dentry);
+
+ err = ovl_copy_up(parent);
+ if (err)
+ goto out_dput_parent;
+
+ ovl_path_lower(dentry, &lowerpath);
+ err = vfs_getattr(&lowerpath, &stat);
+ if (err)
+ goto out_dput_parent;
+
+ if (size < stat.size)
+ stat.size = size;
+
+ err = ovl_copy_up_one(parent, dentry, &lowerpath, &stat);
+
+out_dput_parent:
+ dput(parent);
+ return err;
+}
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
new file mode 100644
index 0000000..6fc5aa5
--- /dev/null
+++ b/fs/overlayfs/dir.c
@@ -0,0 +1,597 @@
+/*
+ *
+ * Copyright (C) 2011 Novell Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/fs.h>
+#include <linux/namei.h>
+#include <linux/xattr.h>
+#include <linux/security.h>
+#include <linux/cred.h>
+#include "overlayfs.h"
+
+static const char *ovl_whiteout_symlink = "(overlay-whiteout)";
+
+static int ovl_whiteout(struct dentry *upperdir, struct dentry *dentry)
+{
+ int err;
+ struct dentry *newdentry;
+ const struct cred *old_cred;
+ struct cred *override_cred;
+
+ /* FIXME: recheck lower dentry to see if whiteout is really needed */
+
+ err = -ENOMEM;
+ override_cred = prepare_creds();
+ if (!override_cred)
+ goto out;
+
+ /*
+ * CAP_SYS_ADMIN for setxattr
+ * CAP_DAC_OVERRIDE for symlink creation
+ * CAP_FOWNER for unlink in sticky directory
+ */
+ cap_raise(override_cred->cap_effective, CAP_SYS_ADMIN);
+ cap_raise(override_cred->cap_effective, CAP_DAC_OVERRIDE);
+ cap_raise(override_cred->cap_effective, CAP_FOWNER);
+ override_cred->fsuid = GLOBAL_ROOT_UID;
+ override_cred->fsgid = GLOBAL_ROOT_GID;
+ old_cred = override_creds(override_cred);
+
+ newdentry = lookup_one_len(dentry->d_name.name, upperdir,
+ dentry->d_name.len);
+ err = PTR_ERR(newdentry);
+ if (IS_ERR(newdentry))
+ goto out_put_cred;
+
+ /* Just been removed within the same locked region */
+ WARN_ON(newdentry->d_inode);
+
+ err = vfs_symlink(upperdir->d_inode, newdentry, ovl_whiteout_symlink);
+ if (err)
+ goto out_dput;
+
+ ovl_dentry_version_inc(dentry->d_parent);
+
+ err = vfs_setxattr(newdentry, ovl_whiteout_xattr, "y", 1, 0);
+ if (err)
+ vfs_unlink(upperdir->d_inode, newdentry);
+
+out_dput:
+ dput(newdentry);
+out_put_cred:
+ revert_creds(old_cred);
+ put_cred(override_cred);
+out:
+ if (err) {
+ /*
+ * There's no way to recover from failure to whiteout.
+ * What should we do? Log a big fat error and... ?
+ */
+ printk(KERN_ERR "overlayfs: ERROR - failed to whiteout '%s'\n",
+ dentry->d_name.name);
+ }
+
+ return err;
+}
+
+static struct dentry *ovl_lookup_create(struct dentry *upperdir,
+ struct dentry *template)
+{
+ int err;
+ struct dentry *newdentry;
+ struct qstr *name = &template->d_name;
+
+ newdentry = lookup_one_len(name->name, upperdir, name->len);
+ if (IS_ERR(newdentry))
+ return newdentry;
+
+ if (newdentry->d_inode) {
+ const struct cred *old_cred;
+ struct cred *override_cred;
+
+ /* No need to check whiteout if lower parent is non-existent */
+ err = -EEXIST;
+ if (!ovl_dentry_lower(template->d_parent))
+ goto out_dput;
+
+ if (!S_ISLNK(newdentry->d_inode->i_mode))
+ goto out_dput;
+
+ err = -ENOMEM;
+ override_cred = prepare_creds();
+ if (!override_cred)
+ goto out_dput;
+
+ /*
+ * CAP_SYS_ADMIN for getxattr
+ * CAP_FOWNER for unlink in sticky directory
+ */
+ cap_raise(override_cred->cap_effective, CAP_SYS_ADMIN);
+ cap_raise(override_cred->cap_effective, CAP_FOWNER);
+ old_cred = override_creds(override_cred);
+
+ err = -EEXIST;
+ if (ovl_is_whiteout(newdentry))
+ err = vfs_unlink(upperdir->d_inode, newdentry);
+
+ revert_creds(old_cred);
+ put_cred(override_cred);
+ if (err)
+ goto out_dput;
+
+ dput(newdentry);
+ newdentry = lookup_one_len(name->name, upperdir, name->len);
+ if (IS_ERR(newdentry)) {
+ ovl_whiteout(upperdir, template);
+ return newdentry;
+ }
+
+ /*
+ * Whiteout just been successfully removed, parent
+ * i_mutex is still held, there's no way the lookup
+ * could return positive.
+ */
+ WARN_ON(newdentry->d_inode);
+ }
+
+ return newdentry;
+
+out_dput:
+ dput(newdentry);
+ return ERR_PTR(err);
+}
+
+struct dentry *ovl_upper_create(struct dentry *upperdir, struct dentry *dentry,
+ struct kstat *stat, const char *link)
+{
+ int err;
+ struct dentry *newdentry;
+ struct inode *dir = upperdir->d_inode;
+
+ newdentry = ovl_lookup_create(upperdir, dentry);
+ if (IS_ERR(newdentry))
+ goto out;
+
+ switch (stat->mode & S_IFMT) {
+ case S_IFREG:
+ err = vfs_create(dir, newdentry, stat->mode, NULL);
+ break;
+
+ case S_IFDIR:
+ err = vfs_mkdir(dir, newdentry, stat->mode);
+ break;
+
+ case S_IFCHR:
+ case S_IFBLK:
+ case S_IFIFO:
+ case S_IFSOCK:
+ err = vfs_mknod(dir, newdentry, stat->mode, stat->rdev);
+ break;
+
+ case S_IFLNK:
+ err = vfs_symlink(dir, newdentry, link);
+ break;
+
+ default:
+ err = -EPERM;
+ }
+ if (err) {
+ if (ovl_dentry_is_opaque(dentry))
+ ovl_whiteout(upperdir, dentry);
+ dput(newdentry);
+ newdentry = ERR_PTR(err);
+ } else if (WARN_ON(!newdentry->d_inode)) {
+ /*
+ * Not quite sure if non-instantiated dentry is legal or not.
+ * VFS doesn't seem to care so check and warn here.
+ */
+ dput(newdentry);
+ newdentry = ERR_PTR(-ENOENT);
+ }
+
+out:
+ return newdentry;
+
+}
+
+static int ovl_set_opaque(struct dentry *upperdentry)
+{
+ int err;
+ const struct cred *old_cred;
+ struct cred *override_cred;
+
+ override_cred = prepare_creds();
+ if (!override_cred)
+ return -ENOMEM;
+
+ /* CAP_SYS_ADMIN for setxattr of "trusted" namespace */
+ cap_raise(override_cred->cap_effective, CAP_SYS_ADMIN);
+ old_cred = override_creds(override_cred);
+ err = vfs_setxattr(upperdentry, ovl_opaque_xattr, "y", 1, 0);
+ revert_creds(old_cred);
+ put_cred(override_cred);
+
+ return err;
+}
+
+static int ovl_remove_opaque(struct dentry *upperdentry)
+{
+ int err;
+ const struct cred *old_cred;
+ struct cred *override_cred;
+
+ override_cred = prepare_creds();
+ if (!override_cred)
+ return -ENOMEM;
+
+ /* CAP_SYS_ADMIN for removexattr of "trusted" namespace */
+ cap_raise(override_cred->cap_effective, CAP_SYS_ADMIN);
+ old_cred = override_creds(override_cred);
+ err = vfs_removexattr(upperdentry, ovl_opaque_xattr);
+ revert_creds(old_cred);
+ put_cred(override_cred);
+
+ return err;
+}
+
+static int ovl_dir_getattr(struct vfsmount *mnt, struct dentry *dentry,
+ struct kstat *stat)
+{
+ int err;
+ enum ovl_path_type type;
+ struct path realpath;
+
+ type = ovl_path_real(dentry, &realpath);
+ err = vfs_getattr(&realpath, stat);
+ if (err)
+ return err;
+
+ stat->dev = dentry->d_sb->s_dev;
+ stat->ino = dentry->d_inode->i_ino;
+
+ /*
+ * It's probably not worth it to count subdirs to get the
+ * correct link count. nlink=1 seems to pacify 'find' and
+ * other utilities.
+ */
+ if (type == OVL_PATH_MERGE)
+ stat->nlink = 1;
+
+ return 0;
+}
+
+static int ovl_create_object(struct dentry *dentry, int mode, dev_t rdev,
+ const char *link)
+{
+ int err;
+ struct dentry *newdentry;
+ struct dentry *upperdir;
+ struct inode *inode;
+ struct kstat stat = {
+ .mode = mode,
+ .rdev = rdev,
+ };
+
+ err = -ENOMEM;
+ inode = ovl_new_inode(dentry->d_sb, mode, dentry->d_fsdata);
+ if (!inode)
+ goto out;
+
+ err = ovl_copy_up(dentry->d_parent);
+ if (err)
+ goto out_iput;
+
+ upperdir = ovl_dentry_upper(dentry->d_parent);
+ mutex_lock_nested(&upperdir->d_inode->i_mutex, I_MUTEX_PARENT);
+
+ newdentry = ovl_upper_create(upperdir, dentry, &stat, link);
+ err = PTR_ERR(newdentry);
+ if (IS_ERR(newdentry))
+ goto out_unlock;
+
+ ovl_dentry_version_inc(dentry->d_parent);
+ if (ovl_dentry_is_opaque(dentry) && S_ISDIR(mode)) {
+ err = ovl_set_opaque(newdentry);
+ if (err) {
+ vfs_rmdir(upperdir->d_inode, newdentry);
+ ovl_whiteout(upperdir, dentry);
+ goto out_dput;
+ }
+ }
+ ovl_dentry_update(dentry, newdentry);
+ d_instantiate(dentry, inode);
+ inode = NULL;
+ newdentry = NULL;
+ err = 0;
+
+out_dput:
+ dput(newdentry);
+out_unlock:
+ mutex_unlock(&upperdir->d_inode->i_mutex);
+out_iput:
+ iput(inode);
+out:
+ return err;
+}
+
+static int ovl_create(struct inode *dir, struct dentry *dentry, umode_t mode,
+ bool excl)
+{
+ return ovl_create_object(dentry, (mode & 07777) | S_IFREG, 0, NULL);
+}
+
+static int ovl_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
+{
+ return ovl_create_object(dentry, (mode & 07777) | S_IFDIR, 0, NULL);
+}
+
+static int ovl_mknod(struct inode *dir, struct dentry *dentry, umode_t mode,
+ dev_t rdev)
+{
+ return ovl_create_object(dentry, mode, rdev, NULL);
+}
+
+static int ovl_symlink(struct inode *dir, struct dentry *dentry,
+ const char *link)
+{
+ return ovl_create_object(dentry, S_IFLNK, 0, link);
+}
+
+static int ovl_do_remove(struct dentry *dentry, bool is_dir)
+{
+ int err;
+ enum ovl_path_type type;
+ struct path realpath;
+ struct dentry *upperdir;
+
+ err = ovl_copy_up(dentry->d_parent);
+ if (err)
+ return err;
+
+ upperdir = ovl_dentry_upper(dentry->d_parent);
+ mutex_lock_nested(&upperdir->d_inode->i_mutex, I_MUTEX_PARENT);
+ type = ovl_path_real(dentry, &realpath);
+ if (type != OVL_PATH_LOWER) {
+ err = -ESTALE;
+ if (realpath.dentry->d_parent != upperdir)
+ goto out_d_drop;
+
+ /* FIXME: create whiteout up front and rename to target */
+
+ if (is_dir)
+ err = vfs_rmdir(upperdir->d_inode, realpath.dentry);
+ else
+ err = vfs_unlink(upperdir->d_inode, realpath.dentry);
+ if (err)
+ goto out_d_drop;
+
+ ovl_dentry_version_inc(dentry->d_parent);
+ }
+
+ if (type != OVL_PATH_UPPER || ovl_dentry_is_opaque(dentry))
+ err = ovl_whiteout(upperdir, dentry);
+
+ /*
+ * Keeping this dentry hashed would mean having to release
+ * upperpath/lowerpath, which could only be done if we are the
+ * sole user of this dentry. Too tricky... Just unhash for
+ * now.
+ */
+out_d_drop:
+ d_drop(dentry);
+ mutex_unlock(&upperdir->d_inode->i_mutex);
+
+ return err;
+}
+
+static int ovl_unlink(struct inode *dir, struct dentry *dentry)
+{
+ return ovl_do_remove(dentry, false);
+}
+
+
+static int ovl_rmdir(struct inode *dir, struct dentry *dentry)
+{
+ int err;
+ enum ovl_path_type type;
+
+ type = ovl_path_type(dentry);
+ if (type != OVL_PATH_UPPER) {
+ err = ovl_check_empty_and_clear(dentry, type);
+ if (err)
+ return err;
+ }
+
+ return ovl_do_remove(dentry, true);
+}
+
+static int ovl_link(struct dentry *old, struct inode *newdir,
+ struct dentry *new)
+{
+ int err;
+ struct dentry *olddentry;
+ struct dentry *newdentry;
+ struct dentry *upperdir;
+
+ err = ovl_copy_up(old);
+ if (err)
+ goto out;
+
+ err = ovl_copy_up(new->d_parent);
+ if (err)
+ goto out;
+
+ upperdir = ovl_dentry_upper(new->d_parent);
+ mutex_lock_nested(&upperdir->d_inode->i_mutex, I_MUTEX_PARENT);
+ newdentry = ovl_lookup_create(upperdir, new);
+ err = PTR_ERR(newdentry);
+ if (IS_ERR(newdentry))
+ goto out_unlock;
+
+ olddentry = ovl_dentry_upper(old);
+ err = vfs_link(olddentry, upperdir->d_inode, newdentry);
+ if (!err) {
+ if (WARN_ON(!newdentry->d_inode)) {
+ dput(newdentry);
+ err = -ENOENT;
+ goto out_unlock;
+ }
+
+ ovl_dentry_version_inc(new->d_parent);
+ ovl_dentry_update(new, newdentry);
+
+ ihold(old->d_inode);
+ d_instantiate(new, old->d_inode);
+ } else {
+ if (ovl_dentry_is_opaque(new))
+ ovl_whiteout(upperdir, new);
+ dput(newdentry);
+ }
+out_unlock:
+ mutex_unlock(&upperdir->d_inode->i_mutex);
+out:
+ return err;
+
+}
+
+static int ovl_rename(struct inode *olddir, struct dentry *old,
+ struct inode *newdir, struct dentry *new)
+{
+ int err;
+ enum ovl_path_type old_type;
+ enum ovl_path_type new_type;
+ struct dentry *old_upperdir;
+ struct dentry *new_upperdir;
+ struct dentry *olddentry;
+ struct dentry *newdentry;
+ struct dentry *trap;
+ bool old_opaque;
+ bool new_opaque;
+ bool new_create = false;
+ bool is_dir = S_ISDIR(old->d_inode->i_mode);
+
+ /* Don't copy up directory trees */
+ old_type = ovl_path_type(old);
+ if (old_type != OVL_PATH_UPPER && is_dir)
+ return -EXDEV;
+
+ if (new->d_inode) {
+ new_type = ovl_path_type(new);
+
+ if (new_type == OVL_PATH_LOWER && old_type == OVL_PATH_LOWER) {
+ if (ovl_dentry_lower(old)->d_inode ==
+ ovl_dentry_lower(new)->d_inode)
+ return 0;
+ }
+ if (new_type != OVL_PATH_LOWER && old_type != OVL_PATH_LOWER) {
+ if (ovl_dentry_upper(old)->d_inode ==
+ ovl_dentry_upper(new)->d_inode)
+ return 0;
+ }
+
+ if (new_type != OVL_PATH_UPPER &&
+ S_ISDIR(new->d_inode->i_mode)) {
+ err = ovl_check_empty_and_clear(new, new_type);
+ if (err)
+ return err;
+ }
+ } else {
+ new_type = OVL_PATH_UPPER;
+ }
+
+ err = ovl_copy_up(old);
+ if (err)
+ return err;
+
+ err = ovl_copy_up(new->d_parent);
+ if (err)
+ return err;
+
+ old_upperdir = ovl_dentry_upper(old->d_parent);
+ new_upperdir = ovl_dentry_upper(new->d_parent);
+
+ trap = lock_rename(new_upperdir, old_upperdir);
+
+ olddentry = ovl_dentry_upper(old);
+ newdentry = ovl_dentry_upper(new);
+ if (newdentry) {
+ dget(newdentry);
+ } else {
+ new_create = true;
+ newdentry = ovl_lookup_create(new_upperdir, new);
+ err = PTR_ERR(newdentry);
+ if (IS_ERR(newdentry))
+ goto out_unlock;
+ }
+
+ err = -ESTALE;
+ if (olddentry->d_parent != old_upperdir)
+ goto out_dput;
+ if (newdentry->d_parent != new_upperdir)
+ goto out_dput;
+ if (olddentry == trap)
+ goto out_dput;
+ if (newdentry == trap)
+ goto out_dput;
+
+ old_opaque = ovl_dentry_is_opaque(old);
+ new_opaque = ovl_dentry_is_opaque(new) || new_type != OVL_PATH_UPPER;
+
+ if (is_dir && !old_opaque && new_opaque) {
+ err = ovl_set_opaque(olddentry);
+ if (err)
+ goto out_dput;
+ }
+
+ err = vfs_rename(old_upperdir->d_inode, olddentry,
+ new_upperdir->d_inode, newdentry);
+
+ if (err) {
+ if (new_create && ovl_dentry_is_opaque(new))
+ ovl_whiteout(new_upperdir, new);
+ if (is_dir && !old_opaque && new_opaque)
+ ovl_remove_opaque(olddentry);
+ goto out_dput;
+ }
+
+ if (old_type != OVL_PATH_UPPER || old_opaque)
+ err = ovl_whiteout(old_upperdir, old);
+ if (is_dir && old_opaque && !new_opaque)
+ ovl_remove_opaque(olddentry);
+
+ if (old_opaque != new_opaque)
+ ovl_dentry_set_opaque(old, new_opaque);
+
+ ovl_dentry_version_inc(old->d_parent);
+ ovl_dentry_version_inc(new->d_parent);
+
+out_dput:
+ dput(newdentry);
+out_unlock:
+ unlock_rename(new_upperdir, old_upperdir);
+ return err;
+}
+
+const struct inode_operations ovl_dir_inode_operations = {
+ .lookup = ovl_lookup,
+ .mkdir = ovl_mkdir,
+ .symlink = ovl_symlink,
+ .unlink = ovl_unlink,
+ .rmdir = ovl_rmdir,
+ .rename = ovl_rename,
+ .link = ovl_link,
+ .setattr = ovl_setattr,
+ .create = ovl_create,
+ .mknod = ovl_mknod,
+ .permission = ovl_permission,
+ .getattr = ovl_dir_getattr,
+ .setxattr = ovl_setxattr,
+ .getxattr = ovl_getxattr,
+ .listxattr = ovl_listxattr,
+ .removexattr = ovl_removexattr,
+};
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
new file mode 100644
index 0000000..0dfa8a4
--- /dev/null
+++ b/fs/overlayfs/inode.c
@@ -0,0 +1,379 @@
+/*
+ *
+ * Copyright (C) 2011 Novell Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/xattr.h>
+#include "overlayfs.h"
+
+int ovl_setattr(struct dentry *dentry, struct iattr *attr)
+{
+ struct dentry *upperdentry;
+ int err;
+
+ if ((attr->ia_valid & ATTR_SIZE) && !ovl_dentry_upper(dentry))
+ err = ovl_copy_up_truncate(dentry, attr->ia_size);
+ else
+ err = ovl_copy_up(dentry);
+ if (err)
+ return err;
+
+ upperdentry = ovl_dentry_upper(dentry);
+
+ if (attr->ia_valid & (ATTR_KILL_SUID|ATTR_KILL_SGID))
+ attr->ia_valid &= ~ATTR_MODE;
+
+ mutex_lock(&upperdentry->d_inode->i_mutex);
+ err = notify_change(upperdentry, attr);
+ mutex_unlock(&upperdentry->d_inode->i_mutex);
+
+ return err;
+}
+
+static int ovl_getattr(struct vfsmount *mnt, struct dentry *dentry,
+ struct kstat *stat)
+{
+ struct path realpath;
+
+ ovl_path_real(dentry, &realpath);
+ return vfs_getattr(&realpath, stat);
+}
+
+int ovl_permission(struct inode *inode, int mask)
+{
+ struct ovl_entry *oe;
+ struct dentry *alias = NULL;
+ struct inode *realinode;
+ struct dentry *realdentry;
+ bool is_upper;
+ int err;
+
+ if (S_ISDIR(inode->i_mode)) {
+ oe = inode->i_private;
+ } else if (mask & MAY_NOT_BLOCK) {
+ return -ECHILD;
+ } else {
+ /*
+ * For non-directories find an alias and get the info
+ * from there.
+ */
+ alias = d_find_any_alias(inode);
+ if (WARN_ON(!alias))
+ return -ENOENT;
+
+ oe = alias->d_fsdata;
+ }
+
+ realdentry = ovl_entry_real(oe, &is_upper);
+
+ /* Careful in RCU walk mode */
+ realinode = ACCESS_ONCE(realdentry->d_inode);
+ if (!realinode) {
+ WARN_ON(!(mask & MAY_NOT_BLOCK));
+ err = -ENOENT;
+ goto out_dput;
+ }
+
+ if (mask & MAY_WRITE) {
+ umode_t mode = realinode->i_mode;
+
+ /*
+ * Writes will always be redirected to upper layer, so
+ * ignore lower layer being read-only.
+ *
+ * If the overlay itself is read-only then proceed
+ * with the permission check, don't return EROFS.
+ * This will only happen if this is the lower layer of
+ * another overlayfs.
+ *
+ * If upper fs becomes read-only after the overlay was
+ * constructed return EROFS to prevent modification of
+ * upper layer.
+ */
+ err = -EROFS;
+ if (is_upper && !IS_RDONLY(inode) && IS_RDONLY(realinode) &&
+ (S_ISREG(mode) || S_ISDIR(mode) || S_ISLNK(mode)))
+ goto out_dput;
+
+ /*
+ * Nobody gets write access to an immutable file.
+ */
+ err = -EACCES;
+ if (IS_IMMUTABLE(realinode))
+ goto out_dput;
+ }
+
+ if (realinode->i_op->permission)
+ err = realinode->i_op->permission(realinode, mask);
+ else
+ err = generic_permission(realinode, mask);
+out_dput:
+ dput(alias);
+ return err;
+}
+
+
+struct ovl_link_data {
+ struct dentry *realdentry;
+ void *cookie;
+};
+
+static void *ovl_follow_link(struct dentry *dentry, struct nameidata *nd)
+{
+ void *ret;
+ struct dentry *realdentry;
+ struct inode *realinode;
+
+ realdentry = ovl_dentry_real(dentry);
+ realinode = realdentry->d_inode;
+
+ if (WARN_ON(!realinode->i_op->follow_link))
+ return ERR_PTR(-EPERM);
+
+ ret = realinode->i_op->follow_link(realdentry, nd);
+ if (IS_ERR(ret))
+ return ret;
+
+ if (realinode->i_op->put_link) {
+ struct ovl_link_data *data;
+
+ data = kmalloc(sizeof(struct ovl_link_data), GFP_KERNEL);
+ if (!data) {
+ realinode->i_op->put_link(realdentry, nd, ret);
+ return ERR_PTR(-ENOMEM);
+ }
+ data->realdentry = realdentry;
+ data->cookie = ret;
+
+ return data;
+ } else {
+ return NULL;
+ }
+}
+
+static void ovl_put_link(struct dentry *dentry, struct nameidata *nd, void *c)
+{
+ struct inode *realinode;
+ struct ovl_link_data *data = c;
+
+ if (!data)
+ return;
+
+ realinode = data->realdentry->d_inode;
+ realinode->i_op->put_link(data->realdentry, nd, data->cookie);
+ kfree(data);
+}
+
+static int ovl_readlink(struct dentry *dentry, char __user *buf, int bufsiz)
+{
+ struct path realpath;
+ struct inode *realinode;
+
+ ovl_path_real(dentry, &realpath);
+ realinode = realpath.dentry->d_inode;
+
+ if (!realinode->i_op->readlink)
+ return -EINVAL;
+
+ touch_atime(&realpath);
+
+ return realinode->i_op->readlink(realpath.dentry, buf, bufsiz);
+}
+
+
+static bool ovl_is_private_xattr(const char *name)
+{
+ return strncmp(name, "trusted.overlay.", 14) == 0;
+}
+
+int ovl_setxattr(struct dentry *dentry, const char *name,
+ const void *value, size_t size, int flags)
+{
+ int err;
+ struct dentry *upperdentry;
+
+ if (ovl_is_private_xattr(name))
+ return -EPERM;
+
+ err = ovl_copy_up(dentry);
+ if (err)
+ return err;
+
+ upperdentry = ovl_dentry_upper(dentry);
+ return vfs_setxattr(upperdentry, name, value, size, flags);
+}
+
+ssize_t ovl_getxattr(struct dentry *dentry, const char *name,
+ void *value, size_t size)
+{
+ if (ovl_path_type(dentry->d_parent) == OVL_PATH_MERGE &&
+ ovl_is_private_xattr(name))
+ return -ENODATA;
+
+ return vfs_getxattr(ovl_dentry_real(dentry), name, value, size);
+}
+
+ssize_t ovl_listxattr(struct dentry *dentry, char *list, size_t size)
+{
+ ssize_t res;
+ int off;
+
+ res = vfs_listxattr(ovl_dentry_real(dentry), list, size);
+ if (res <= 0 || size == 0)
+ return res;
+
+ if (ovl_path_type(dentry->d_parent) != OVL_PATH_MERGE)
+ return res;
+
+ /* filter out private xattrs */
+ for (off = 0; off < res;) {
+ char *s = list + off;
+ size_t slen = strlen(s) + 1;
+
+ BUG_ON(off + slen > res);
+
+ if (ovl_is_private_xattr(s)) {
+ res -= slen;
+ memmove(s, s + slen, res - off);
+ } else {
+ off += slen;
+ }
+ }
+
+ return res;
+}
+
+int ovl_removexattr(struct dentry *dentry, const char *name)
+{
+ int err;
+ struct path realpath;
+ enum ovl_path_type type;
+
+ if (ovl_path_type(dentry->d_parent) == OVL_PATH_MERGE &&
+ ovl_is_private_xattr(name))
+ return -ENODATA;
+
+ type = ovl_path_real(dentry, &realpath);
+ if (type == OVL_PATH_LOWER) {
+ err = vfs_getxattr(realpath.dentry, name, NULL, 0);
+ if (err < 0)
+ return err;
+
+ err = ovl_copy_up(dentry);
+ if (err)
+ return err;
+
+ ovl_path_upper(dentry, &realpath);
+ }
+
+ return vfs_removexattr(realpath.dentry, name);
+}
+
+static bool ovl_open_need_copy_up(int flags, enum ovl_path_type type,
+ struct dentry *realdentry)
+{
+ if (type != OVL_PATH_LOWER)
+ return false;
+
+ if (special_file(realdentry->d_inode->i_mode))
+ return false;
+
+ if (!(OPEN_FMODE(flags) & FMODE_WRITE) && !(flags & O_TRUNC))
+ return false;
+
+ return true;
+}
+
+static int ovl_dentry_open(struct dentry *dentry, struct file *file,
+ const struct cred *cred)
+{
+ int err;
+ struct path realpath;
+ enum ovl_path_type type;
+
+ type = ovl_path_real(dentry, &realpath);
+ if (ovl_open_need_copy_up(file->f_flags, type, realpath.dentry)) {
+ if (file->f_flags & O_TRUNC)
+ err = ovl_copy_up_truncate(dentry, 0);
+ else
+ err = ovl_copy_up(dentry);
+ if (err)
+ return err;
+
+ ovl_path_upper(dentry, &realpath);
+ }
+
+ return vfs_open(&realpath, file, cred);
+}
+
+static const struct inode_operations ovl_file_inode_operations = {
+ .setattr = ovl_setattr,
+ .permission = ovl_permission,
+ .getattr = ovl_getattr,
+ .setxattr = ovl_setxattr,
+ .getxattr = ovl_getxattr,
+ .listxattr = ovl_listxattr,
+ .removexattr = ovl_removexattr,
+ .dentry_open = ovl_dentry_open,
+};
+
+static const struct inode_operations ovl_symlink_inode_operations = {
+ .setattr = ovl_setattr,
+ .follow_link = ovl_follow_link,
+ .put_link = ovl_put_link,
+ .readlink = ovl_readlink,
+ .getattr = ovl_getattr,
+ .setxattr = ovl_setxattr,
+ .getxattr = ovl_getxattr,
+ .listxattr = ovl_listxattr,
+ .removexattr = ovl_removexattr,
+};
+
+struct inode *ovl_new_inode(struct super_block *sb, umode_t mode,
+ struct ovl_entry *oe)
+{
+ struct inode *inode;
+
+ inode = new_inode(sb);
+ if (!inode)
+ return NULL;
+
+ mode &= S_IFMT;
+
+ inode->i_ino = get_next_ino();
+ inode->i_mode = mode;
+ inode->i_flags |= S_NOATIME | S_NOCMTIME;
+
+ switch (mode) {
+ case S_IFDIR:
+ inode->i_private = oe;
+ inode->i_op = &ovl_dir_inode_operations;
+ inode->i_fop = &ovl_dir_operations;
+ break;
+
+ case S_IFLNK:
+ inode->i_op = &ovl_symlink_inode_operations;
+ break;
+
+ case S_IFREG:
+ case S_IFSOCK:
+ case S_IFBLK:
+ case S_IFCHR:
+ case S_IFIFO:
+ inode->i_op = &ovl_file_inode_operations;
+ break;
+
+ default:
+ WARN(1, "illegal file type: %i\n", mode);
+ inode = NULL;
+ }
+
+ return inode;
+
+}
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
new file mode 100644
index 0000000..fe1241d
--- /dev/null
+++ b/fs/overlayfs/overlayfs.h
@@ -0,0 +1,64 @@
+/*
+ *
+ * Copyright (C) 2011 Novell Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+struct ovl_entry;
+
+enum ovl_path_type {
+ OVL_PATH_UPPER,
+ OVL_PATH_MERGE,
+ OVL_PATH_LOWER,
+};
+
+extern const char *ovl_opaque_xattr;
+extern const char *ovl_whiteout_xattr;
+extern const struct dentry_operations ovl_dentry_operations;
+
+enum ovl_path_type ovl_path_type(struct dentry *dentry);
+u64 ovl_dentry_version_get(struct dentry *dentry);
+void ovl_dentry_version_inc(struct dentry *dentry);
+void ovl_path_upper(struct dentry *dentry, struct path *path);
+void ovl_path_lower(struct dentry *dentry, struct path *path);
+enum ovl_path_type ovl_path_real(struct dentry *dentry, struct path *path);
+struct dentry *ovl_dentry_upper(struct dentry *dentry);
+struct dentry *ovl_dentry_lower(struct dentry *dentry);
+struct dentry *ovl_dentry_real(struct dentry *dentry);
+struct dentry *ovl_entry_real(struct ovl_entry *oe, bool *is_upper);
+bool ovl_dentry_is_opaque(struct dentry *dentry);
+void ovl_dentry_set_opaque(struct dentry *dentry, bool opaque);
+bool ovl_is_whiteout(struct dentry *dentry);
+void ovl_dentry_update(struct dentry *dentry, struct dentry *upperdentry);
+struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
+ unsigned int flags);
+struct file *ovl_path_open(struct path *path, int flags);
+
+struct dentry *ovl_upper_create(struct dentry *upperdir, struct dentry *dentry,
+ struct kstat *stat, const char *link);
+
+/* readdir.c */
+extern const struct file_operations ovl_dir_operations;
+int ovl_check_empty_and_clear(struct dentry *dentry, enum ovl_path_type type);
+
+/* inode.c */
+int ovl_setattr(struct dentry *dentry, struct iattr *attr);
+int ovl_permission(struct inode *inode, int mask);
+int ovl_setxattr(struct dentry *dentry, const char *name,
+ const void *value, size_t size, int flags);
+ssize_t ovl_getxattr(struct dentry *dentry, const char *name,
+ void *value, size_t size);
+ssize_t ovl_listxattr(struct dentry *dentry, char *list, size_t size);
+int ovl_removexattr(struct dentry *dentry, const char *name);
+
+struct inode *ovl_new_inode(struct super_block *sb, umode_t mode,
+ struct ovl_entry *oe);
+/* dir.c */
+extern const struct inode_operations ovl_dir_inode_operations;
+
+/* copy_up.c */
+int ovl_copy_up(struct dentry *dentry);
+int ovl_copy_up_truncate(struct dentry *dentry, loff_t size);
diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
new file mode 100644
index 0000000..0797efb
--- /dev/null
+++ b/fs/overlayfs/readdir.c
@@ -0,0 +1,566 @@
+/*
+ *
+ * Copyright (C) 2011 Novell Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/namei.h>
+#include <linux/file.h>
+#include <linux/xattr.h>
+#include <linux/rbtree.h>
+#include <linux/security.h>
+#include <linux/cred.h>
+#include "overlayfs.h"
+
+struct ovl_cache_entry {
+ const char *name;
+ unsigned int len;
+ unsigned int type;
+ u64 ino;
+ bool is_whiteout;
+ struct list_head l_node;
+ struct rb_node node;
+};
+
+struct ovl_readdir_data {
+ struct rb_root *root;
+ struct list_head *list;
+ struct list_head *middle;
+ struct dentry *dir;
+ int count;
+ int err;
+};
+
+struct ovl_dir_file {
+ bool is_real;
+ bool is_cached;
+ struct list_head cursor;
+ u64 cache_version;
+ struct list_head cache;
+ struct file *realfile;
+};
+
+static struct ovl_cache_entry *ovl_cache_entry_from_node(struct rb_node *n)
+{
+ return container_of(n, struct ovl_cache_entry, node);
+}
+
+static struct ovl_cache_entry *ovl_cache_entry_find(struct rb_root *root,
+ const char *name, int len)
+{
+ struct rb_node *node = root->rb_node;
+ int cmp;
+
+ while (node) {
+ struct ovl_cache_entry *p = ovl_cache_entry_from_node(node);
+
+ cmp = strncmp(name, p->name, len);
+ if (cmp > 0)
+ node = p->node.rb_right;
+ else if (cmp < 0 || len < p->len)
+ node = p->node.rb_left;
+ else
+ return p;
+ }
+
+ return NULL;
+}
+
+static struct ovl_cache_entry *ovl_cache_entry_new(const char *name, int len,
+ u64 ino, unsigned int d_type)
+{
+ struct ovl_cache_entry *p;
+
+ p = kmalloc(sizeof(*p) + len + 1, GFP_KERNEL);
+ if (p) {
+ char *name_copy = (char *) (p + 1);
+ memcpy(name_copy, name, len);
+ name_copy[len] = '\0';
+ p->name = name_copy;
+ p->len = len;
+ p->type = d_type;
+ p->ino = ino;
+ p->is_whiteout = false;
+ }
+
+ return p;
+}
+
+static int ovl_cache_entry_add_rb(struct ovl_readdir_data *rdd,
+ const char *name, int len, u64 ino,
+ unsigned int d_type)
+{
+ struct rb_node **newp = &rdd->root->rb_node;
+ struct rb_node *parent = NULL;
+ struct ovl_cache_entry *p;
+
+ while (*newp) {
+ int cmp;
+ struct ovl_cache_entry *tmp;
+
+ parent = *newp;
+ tmp = ovl_cache_entry_from_node(*newp);
+ cmp = strncmp(name, tmp->name, len);
+ if (cmp > 0)
+ newp = &tmp->node.rb_right;
+ else if (cmp < 0 || len < tmp->len)
+ newp = &tmp->node.rb_left;
+ else
+ return 0;
+ }
+
+ p = ovl_cache_entry_new(name, len, ino, d_type);
+ if (p == NULL)
+ return -ENOMEM;
+
+ list_add_tail(&p->l_node, rdd->list);
+ rb_link_node(&p->node, parent, newp);
+ rb_insert_color(&p->node, rdd->root);
+
+ return 0;
+}
+
+static int ovl_fill_lower(void *buf, const char *name, int namelen,
+ loff_t offset, u64 ino, unsigned int d_type)
+{
+ struct ovl_readdir_data *rdd = buf;
+ struct ovl_cache_entry *p;
+
+ rdd->count++;
+ p = ovl_cache_entry_find(rdd->root, name, namelen);
+ if (p) {
+ list_move_tail(&p->l_node, rdd->middle);
+ } else {
+ p = ovl_cache_entry_new(name, namelen, ino, d_type);
+ if (p == NULL)
+ rdd->err = -ENOMEM;
+ else
+ list_add_tail(&p->l_node, rdd->middle);
+ }
+
+ return rdd->err;
+}
+
+static void ovl_cache_free(struct list_head *list)
+{
+ struct ovl_cache_entry *p;
+ struct ovl_cache_entry *n;
+
+ list_for_each_entry_safe(p, n, list, l_node)
+ kfree(p);
+
+ INIT_LIST_HEAD(list);
+}
+
+static int ovl_fill_upper(void *buf, const char *name, int namelen,
+ loff_t offset, u64 ino, unsigned int d_type)
+{
+ struct ovl_readdir_data *rdd = buf;
+
+ rdd->count++;
+ return ovl_cache_entry_add_rb(rdd, name, namelen, ino, d_type);
+}
+
+static inline int ovl_dir_read(struct path *realpath,
+ struct ovl_readdir_data *rdd, filldir_t filler)
+{
+ struct file *realfile;
+ int err;
+
+ realfile = ovl_path_open(realpath, O_RDONLY | O_DIRECTORY);
+ if (IS_ERR(realfile))
+ return PTR_ERR(realfile);
+
+ do {
+ rdd->count = 0;
+ rdd->err = 0;
+ err = vfs_readdir(realfile, filler, rdd);
+ if (err >= 0)
+ err = rdd->err;
+ } while (!err && rdd->count);
+ fput(realfile);
+
+ return 0;
+}
+
+static void ovl_dir_reset(struct file *file)
+{
+ struct ovl_dir_file *od = file->private_data;
+ enum ovl_path_type type = ovl_path_type(file->f_path.dentry);
+
+ if (ovl_dentry_version_get(file->f_path.dentry) != od->cache_version) {
+ list_del_init(&od->cursor);
+ ovl_cache_free(&od->cache);
+ od->is_cached = false;
+ }
+ WARN_ON(!od->is_real && type != OVL_PATH_MERGE);
+ if (od->is_real && type == OVL_PATH_MERGE) {
+ fput(od->realfile);
+ od->realfile = NULL;
+ od->is_real = false;
+ }
+}
+
+static int ovl_dir_mark_whiteouts(struct ovl_readdir_data *rdd)
+{
+ struct ovl_cache_entry *p;
+ struct dentry *dentry;
+ const struct cred *old_cred;
+ struct cred *override_cred;
+
+ override_cred = prepare_creds();
+ if (!override_cred) {
+ ovl_cache_free(rdd->list);
+ return -ENOMEM;
+ }
+
+ /*
+ * CAP_SYS_ADMIN for getxattr
+ * CAP_DAC_OVERRIDE for lookup
+ */
+ cap_raise(override_cred->cap_effective, CAP_SYS_ADMIN);
+ cap_raise(override_cred->cap_effective, CAP_DAC_OVERRIDE);
+ old_cred = override_creds(override_cred);
+
+ mutex_lock(&rdd->dir->d_inode->i_mutex);
+ list_for_each_entry(p, rdd->list, l_node) {
+ if (p->type != DT_LNK)
+ continue;
+
+ dentry = lookup_one_len(p->name, rdd->dir, p->len);
+ if (IS_ERR(dentry))
+ continue;
+
+ p->is_whiteout = ovl_is_whiteout(dentry);
+ dput(dentry);
+ }
+ mutex_unlock(&rdd->dir->d_inode->i_mutex);
+
+ revert_creds(old_cred);
+ put_cred(override_cred);
+
+ return 0;
+}
+
+static inline int ovl_dir_read_merged(struct path *upperpath,
+ struct path *lowerpath,
+ struct ovl_readdir_data *rdd)
+{
+ int err;
+ struct rb_root root = RB_ROOT;
+ struct list_head middle;
+
+ rdd->root = &root;
+ if (upperpath->dentry) {
+ rdd->dir = upperpath->dentry;
+ err = ovl_dir_read(upperpath, rdd, ovl_fill_upper);
+ if (err)
+ goto out;
+
+ err = ovl_dir_mark_whiteouts(rdd);
+ if (err)
+ goto out;
+ }
+ /*
+ * Insert lowerpath entries before upperpath ones, this allows
+ * offsets to be reasonably constant
+ */
+ list_add(&middle, rdd->list);
+ rdd->middle = &middle;
+ err = ovl_dir_read(lowerpath, rdd, ovl_fill_lower);
+ list_del(&middle);
+out:
+ rdd->root = NULL;
+
+ return err;
+}
+
+static void ovl_seek_cursor(struct ovl_dir_file *od, loff_t pos)
+{
+ struct list_head *l;
+ loff_t off;
+
+ l = od->cache.next;
+ for (off = 0; off < pos; off++) {
+ if (l == &od->cache)
+ break;
+ l = l->next;
+ }
+ list_move_tail(&od->cursor, l);
+}
+
+static int ovl_readdir(struct file *file, void *buf, filldir_t filler)
+{
+ struct ovl_dir_file *od = file->private_data;
+ int res;
+
+ if (!file->f_pos)
+ ovl_dir_reset(file);
+
+ if (od->is_real) {
+ res = vfs_readdir(od->realfile, filler, buf);
+ file->f_pos = od->realfile->f_pos;
+
+ return res;
+ }
+
+ if (!od->is_cached) {
+ struct path lowerpath;
+ struct path upperpath;
+ struct ovl_readdir_data rdd = { .list = &od->cache };
+
+ ovl_path_lower(file->f_path.dentry, &lowerpath);
+ ovl_path_upper(file->f_path.dentry, &upperpath);
+
+ res = ovl_dir_read_merged(&upperpath, &lowerpath, &rdd);
+ if (res) {
+ ovl_cache_free(rdd.list);
+ return res;
+ }
+
+ od->cache_version = ovl_dentry_version_get(file->f_path.dentry);
+ od->is_cached = true;
+
+ ovl_seek_cursor(od, file->f_pos);
+ }
+
+ while (od->cursor.next != &od->cache) {
+ int over;
+ loff_t off;
+ struct ovl_cache_entry *p;
+
+ p = list_entry(od->cursor.next, struct ovl_cache_entry, l_node);
+ off = file->f_pos;
+ if (!p->is_whiteout) {
+ over = filler(buf, p->name, p->len, off, p->ino,
+ p->type);
+ if (over)
+ break;
+ }
+ file->f_pos++;
+ list_move(&od->cursor, &p->l_node);
+ }
+
+ return 0;
+}
+
+static loff_t ovl_dir_llseek(struct file *file, loff_t offset, int origin)
+{
+ loff_t res;
+ struct ovl_dir_file *od = file->private_data;
+
+ mutex_lock(&file->f_dentry->d_inode->i_mutex);
+ if (!file->f_pos)
+ ovl_dir_reset(file);
+
+ if (od->is_real) {
+ res = vfs_llseek(od->realfile, offset, origin);
+ file->f_pos = od->realfile->f_pos;
+ } else {
+ res = -EINVAL;
+
+ switch (origin) {
+ case SEEK_CUR:
+ offset += file->f_pos;
+ break;
+ case SEEK_SET:
+ break;
+ default:
+ goto out_unlock;
+ }
+ if (offset < 0)
+ goto out_unlock;
+
+ if (offset != file->f_pos) {
+ file->f_pos = offset;
+ if (od->is_cached)
+ ovl_seek_cursor(od, offset);
+ }
+ res = offset;
+ }
+out_unlock:
+ mutex_unlock(&file->f_dentry->d_inode->i_mutex);
+
+ return res;
+}
+
+static int ovl_dir_fsync(struct file *file, loff_t start, loff_t end,
+ int datasync)
+{
+ struct ovl_dir_file *od = file->private_data;
+
+ /* May need to reopen directory if it got copied up */
+ if (!od->realfile) {
+ struct path upperpath;
+
+ ovl_path_upper(file->f_path.dentry, &upperpath);
+ od->realfile = ovl_path_open(&upperpath, O_RDONLY);
+ if (IS_ERR(od->realfile))
+ return PTR_ERR(od->realfile);
+ }
+
+ return vfs_fsync_range(od->realfile, start, end, datasync);
+}
+
+static int ovl_dir_release(struct inode *inode, struct file *file)
+{
+ struct ovl_dir_file *od = file->private_data;
+
+ list_del(&od->cursor);
+ ovl_cache_free(&od->cache);
+ if (od->realfile)
+ fput(od->realfile);
+ kfree(od);
+
+ return 0;
+}
+
+static int ovl_dir_open(struct inode *inode, struct file *file)
+{
+ struct path realpath;
+ struct file *realfile;
+ struct ovl_dir_file *od;
+ enum ovl_path_type type;
+
+ od = kzalloc(sizeof(struct ovl_dir_file), GFP_KERNEL);
+ if (!od)
+ return -ENOMEM;
+
+ type = ovl_path_real(file->f_path.dentry, &realpath);
+ realfile = ovl_path_open(&realpath, file->f_flags);
+ if (IS_ERR(realfile)) {
+ kfree(od);
+ return PTR_ERR(realfile);
+ }
+ INIT_LIST_HEAD(&od->cache);
+ INIT_LIST_HEAD(&od->cursor);
+ od->is_cached = false;
+ od->realfile = realfile;
+ od->is_real = (type != OVL_PATH_MERGE);
+ file->private_data = od;
+
+ return 0;
+}
+
+const struct file_operations ovl_dir_operations = {
+ .read = generic_read_dir,
+ .open = ovl_dir_open,
+ .readdir = ovl_readdir,
+ .llseek = ovl_dir_llseek,
+ .fsync = ovl_dir_fsync,
+ .release = ovl_dir_release,
+};
+
+static int ovl_check_empty_dir(struct dentry *dentry, struct list_head *list)
+{
+ int err;
+ struct path lowerpath;
+ struct path upperpath;
+ struct ovl_cache_entry *p;
+ struct ovl_readdir_data rdd = { .list = list };
+
+ ovl_path_upper(dentry, &upperpath);
+ ovl_path_lower(dentry, &lowerpath);
+
+ err = ovl_dir_read_merged(&upperpath, &lowerpath, &rdd);
+ if (err)
+ return err;
+
+ err = 0;
+
+ list_for_each_entry(p, list, l_node) {
+ if (p->is_whiteout)
+ continue;
+
+ if (p->name[0] == '.') {
+ if (p->len == 1)
+ continue;
+ if (p->len == 2 && p->name[1] == '.')
+ continue;
+ }
+ err = -ENOTEMPTY;
+ break;
+ }
+
+ return err;
+}
+
+static int ovl_remove_whiteouts(struct dentry *dir, struct list_head *list)
+{
+ struct path upperpath;
+ struct dentry *upperdir;
+ struct ovl_cache_entry *p;
+ const struct cred *old_cred;
+ struct cred *override_cred;
+ int err;
+
+ ovl_path_upper(dir, &upperpath);
+ upperdir = upperpath.dentry;
+
+ override_cred = prepare_creds();
+ if (!override_cred)
+ return -ENOMEM;
+
+ /*
+ * CAP_DAC_OVERRIDE for lookup and unlink
+ * CAP_SYS_ADMIN for setxattr of "trusted" namespace
+ * CAP_FOWNER for unlink in sticky directory
+ */
+ cap_raise(override_cred->cap_effective, CAP_DAC_OVERRIDE);
+ cap_raise(override_cred->cap_effective, CAP_SYS_ADMIN);
+ cap_raise(override_cred->cap_effective, CAP_FOWNER);
+ old_cred = override_creds(override_cred);
+
+ err = vfs_setxattr(upperdir, ovl_opaque_xattr, "y", 1, 0);
+ if (err)
+ goto out_revert_creds;
+
+ mutex_lock_nested(&upperdir->d_inode->i_mutex, I_MUTEX_PARENT);
+ list_for_each_entry(p, list, l_node) {
+ struct dentry *dentry;
+ int ret;
+
+ if (!p->is_whiteout)
+ continue;
+
+ dentry = lookup_one_len(p->name, upperdir, p->len);
+ if (IS_ERR(dentry)) {
+ printk(KERN_WARNING
+ "overlayfs: failed to lookup whiteout %.*s: %li\n",
+ p->len, p->name, PTR_ERR(dentry));
+ continue;
+ }
+ ret = vfs_unlink(upperdir->d_inode, dentry);
+ dput(dentry);
+ if (ret)
+ printk(KERN_WARNING
+ "overlayfs: failed to unlink whiteout %.*s: %i\n",
+ p->len, p->name, ret);
+ }
+ mutex_unlock(&upperdir->d_inode->i_mutex);
+
+out_revert_creds:
+ revert_creds(old_cred);
+ put_cred(override_cred);
+
+ return err;
+}
+
+int ovl_check_empty_and_clear(struct dentry *dentry, enum ovl_path_type type)
+{
+ int err;
+ LIST_HEAD(list);
+
+ err = ovl_check_empty_dir(dentry, &list);
+ if (!err && type == OVL_PATH_MERGE)
+ err = ovl_remove_whiteouts(dentry, &list);
+
+ ovl_cache_free(&list);
+
+ return err;
+}
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
new file mode 100644
index 0000000..02deecd
--- /dev/null
+++ b/fs/overlayfs/super.c
@@ -0,0 +1,611 @@
+/*
+ *
+ * Copyright (C) 2011 Novell Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/fs.h>
+#include <linux/namei.h>
+#include <linux/xattr.h>
+#include <linux/security.h>
+#include <linux/mount.h>
+#include <linux/slab.h>
+#include <linux/parser.h>
+#include <linux/module.h>
+#include <linux/cred.h>
+#include <linux/sched.h>
+#include "overlayfs.h"
+
+MODULE_AUTHOR("Miklos Szeredi <[email protected]>");
+MODULE_DESCRIPTION("Overlay filesystem");
+MODULE_LICENSE("GPL");
+
+struct ovl_fs {
+ struct vfsmount *upper_mnt;
+ struct vfsmount *lower_mnt;
+};
+
+struct ovl_entry {
+ /*
+ * Keep "double reference" on upper dentries, so that
+ * d_delete() doesn't think it's OK to reset d_inode to NULL.
+ */
+ struct dentry *__upperdentry;
+ struct dentry *lowerdentry;
+ union {
+ struct {
+ u64 version;
+ bool opaque;
+ };
+ struct rcu_head rcu;
+ };
+};
+
+const char *ovl_whiteout_xattr = "trusted.overlay.whiteout";
+const char *ovl_opaque_xattr = "trusted.overlay.opaque";
+
+
+enum ovl_path_type ovl_path_type(struct dentry *dentry)
+{
+ struct ovl_entry *oe = dentry->d_fsdata;
+
+ if (oe->__upperdentry) {
+ if (oe->lowerdentry && S_ISDIR(dentry->d_inode->i_mode))
+ return OVL_PATH_MERGE;
+ else
+ return OVL_PATH_UPPER;
+ } else {
+ return OVL_PATH_LOWER;
+ }
+}
+
+static struct dentry *ovl_upperdentry_dereference(struct ovl_entry *oe)
+{
+ struct dentry *upperdentry = ACCESS_ONCE(oe->__upperdentry);
+ smp_read_barrier_depends();
+ return upperdentry;
+}
+
+void ovl_path_upper(struct dentry *dentry, struct path *path)
+{
+ struct ovl_fs *ofs = dentry->d_sb->s_fs_info;
+ struct ovl_entry *oe = dentry->d_fsdata;
+
+ path->mnt = ofs->upper_mnt;
+ path->dentry = ovl_upperdentry_dereference(oe);
+}
+
+void ovl_path_lower(struct dentry *dentry, struct path *path)
+{
+ struct ovl_fs *ofs = dentry->d_sb->s_fs_info;
+ struct ovl_entry *oe = dentry->d_fsdata;
+
+ path->mnt = ofs->lower_mnt;
+ path->dentry = oe->lowerdentry;
+}
+
+enum ovl_path_type ovl_path_real(struct dentry *dentry, struct path *path)
+{
+
+ enum ovl_path_type type = ovl_path_type(dentry);
+
+ if (type == OVL_PATH_LOWER)
+ ovl_path_lower(dentry, path);
+ else
+ ovl_path_upper(dentry, path);
+
+ return type;
+}
+
+struct dentry *ovl_dentry_upper(struct dentry *dentry)
+{
+ struct ovl_entry *oe = dentry->d_fsdata;
+
+ return ovl_upperdentry_dereference(oe);
+}
+
+struct dentry *ovl_dentry_lower(struct dentry *dentry)
+{
+ struct ovl_entry *oe = dentry->d_fsdata;
+
+ return oe->lowerdentry;
+}
+
+struct dentry *ovl_dentry_real(struct dentry *dentry)
+{
+ struct ovl_entry *oe = dentry->d_fsdata;
+ struct dentry *realdentry;
+
+ realdentry = ovl_upperdentry_dereference(oe);
+ if (!realdentry)
+ realdentry = oe->lowerdentry;
+
+ return realdentry;
+}
+
+struct dentry *ovl_entry_real(struct ovl_entry *oe, bool *is_upper)
+{
+ struct dentry *realdentry;
+
+ realdentry = ovl_upperdentry_dereference(oe);
+ if (realdentry) {
+ *is_upper = true;
+ } else {
+ realdentry = oe->lowerdentry;
+ *is_upper = false;
+ }
+ return realdentry;
+}
+
+bool ovl_dentry_is_opaque(struct dentry *dentry)
+{
+ struct ovl_entry *oe = dentry->d_fsdata;
+ return oe->opaque;
+}
+
+void ovl_dentry_set_opaque(struct dentry *dentry, bool opaque)
+{
+ struct ovl_entry *oe = dentry->d_fsdata;
+ oe->opaque = opaque;
+}
+
+void ovl_dentry_update(struct dentry *dentry, struct dentry *upperdentry)
+{
+ struct ovl_entry *oe = dentry->d_fsdata;
+
+ WARN_ON(!mutex_is_locked(&upperdentry->d_parent->d_inode->i_mutex));
+ WARN_ON(oe->__upperdentry);
+ BUG_ON(!upperdentry->d_inode);
+ smp_wmb();
+ oe->__upperdentry = dget(upperdentry);
+}
+
+void ovl_dentry_version_inc(struct dentry *dentry)
+{
+ struct ovl_entry *oe = dentry->d_fsdata;
+
+ WARN_ON(!mutex_is_locked(&dentry->d_inode->i_mutex));
+ oe->version++;
+}
+
+u64 ovl_dentry_version_get(struct dentry *dentry)
+{
+ struct ovl_entry *oe = dentry->d_fsdata;
+
+ WARN_ON(!mutex_is_locked(&dentry->d_inode->i_mutex));
+ return oe->version;
+}
+
+bool ovl_is_whiteout(struct dentry *dentry)
+{
+ int res;
+ char val;
+
+ if (!dentry)
+ return false;
+ if (!dentry->d_inode)
+ return false;
+ if (!S_ISLNK(dentry->d_inode->i_mode))
+ return false;
+
+ res = vfs_getxattr(dentry, ovl_whiteout_xattr, &val, 1);
+ if (res == 1 && val == 'y')
+ return true;
+
+ return false;
+}
+
+static bool ovl_is_opaquedir(struct dentry *dentry)
+{
+ int res;
+ char val;
+
+ if (!S_ISDIR(dentry->d_inode->i_mode))
+ return false;
+
+ res = vfs_getxattr(dentry, ovl_opaque_xattr, &val, 1);
+ if (res == 1 && val == 'y')
+ return true;
+
+ return false;
+}
+
+static void ovl_entry_free(struct rcu_head *head)
+{
+ struct ovl_entry *oe = container_of(head, struct ovl_entry, rcu);
+ kfree(oe);
+}
+
+static void ovl_dentry_release(struct dentry *dentry)
+{
+ struct ovl_entry *oe = dentry->d_fsdata;
+
+ if (oe) {
+ dput(oe->__upperdentry);
+ dput(oe->__upperdentry);
+ dput(oe->lowerdentry);
+ call_rcu(&oe->rcu, ovl_entry_free);
+ }
+}
+
+const struct dentry_operations ovl_dentry_operations = {
+ .d_release = ovl_dentry_release,
+};
+
+static struct ovl_entry *ovl_alloc_entry(void)
+{
+ return kzalloc(sizeof(struct ovl_entry), GFP_KERNEL);
+}
+
+static inline struct dentry *ovl_lookup_real(struct dentry *dir,
+ struct qstr *name)
+{
+ struct dentry *dentry;
+
+ mutex_lock(&dir->d_inode->i_mutex);
+ dentry = lookup_one_len(name->name, dir, name->len);
+ mutex_unlock(&dir->d_inode->i_mutex);
+
+ if (IS_ERR(dentry)) {
+ if (PTR_ERR(dentry) == -ENOENT)
+ dentry = NULL;
+ } else if (!dentry->d_inode) {
+ dput(dentry);
+ dentry = NULL;
+ }
+ return dentry;
+}
+
+static int ovl_do_lookup(struct dentry *dentry)
+{
+ struct ovl_entry *oe;
+ struct dentry *upperdir;
+ struct dentry *lowerdir;
+ struct dentry *upperdentry = NULL;
+ struct dentry *lowerdentry = NULL;
+ struct inode *inode = NULL;
+ int err;
+
+ err = -ENOMEM;
+ oe = ovl_alloc_entry();
+ if (!oe)
+ goto out;
+
+ upperdir = ovl_dentry_upper(dentry->d_parent);
+ lowerdir = ovl_dentry_lower(dentry->d_parent);
+
+ if (upperdir) {
+ upperdentry = ovl_lookup_real(upperdir, &dentry->d_name);
+ err = PTR_ERR(upperdentry);
+ if (IS_ERR(upperdentry))
+ goto out_put_dir;
+
+ if (lowerdir && upperdentry &&
+ (S_ISLNK(upperdentry->d_inode->i_mode) ||
+ S_ISDIR(upperdentry->d_inode->i_mode))) {
+ const struct cred *old_cred;
+ struct cred *override_cred;
+
+ err = -ENOMEM;
+ override_cred = prepare_creds();
+ if (!override_cred)
+ goto out_dput_upper;
+
+ /* CAP_SYS_ADMIN needed for getxattr */
+ cap_raise(override_cred->cap_effective, CAP_SYS_ADMIN);
+ old_cred = override_creds(override_cred);
+
+ if (ovl_is_opaquedir(upperdentry)) {
+ oe->opaque = true;
+ } else if (ovl_is_whiteout(upperdentry)) {
+ dput(upperdentry);
+ upperdentry = NULL;
+ oe->opaque = true;
+ }
+ revert_creds(old_cred);
+ put_cred(override_cred);
+ }
+ }
+ if (lowerdir && !oe->opaque) {
+ lowerdentry = ovl_lookup_real(lowerdir, &dentry->d_name);
+ err = PTR_ERR(lowerdentry);
+ if (IS_ERR(lowerdentry))
+ goto out_dput_upper;
+ }
+
+ if (lowerdentry && upperdentry &&
+ (!S_ISDIR(upperdentry->d_inode->i_mode) ||
+ !S_ISDIR(lowerdentry->d_inode->i_mode))) {
+ dput(lowerdentry);
+ lowerdentry = NULL;
+ oe->opaque = true;
+ }
+
+ if (lowerdentry || upperdentry) {
+ struct dentry *realdentry;
+
+ realdentry = upperdentry ? upperdentry : lowerdentry;
+ err = -ENOMEM;
+ inode = ovl_new_inode(dentry->d_sb, realdentry->d_inode->i_mode,
+ oe);
+ if (!inode)
+ goto out_dput;
+ }
+
+ if (upperdentry)
+ oe->__upperdentry = dget(upperdentry);
+
+ if (lowerdentry)
+ oe->lowerdentry = lowerdentry;
+
+ dentry->d_fsdata = oe;
+ dentry->d_op = &ovl_dentry_operations;
+ d_add(dentry, inode);
+
+ return 0;
+
+out_dput:
+ dput(lowerdentry);
+out_dput_upper:
+ dput(upperdentry);
+out_put_dir:
+ kfree(oe);
+out:
+ return err;
+}
+
+struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
+ unsigned int flags)
+{
+ int err = ovl_do_lookup(dentry);
+
+ if (err)
+ return ERR_PTR(err);
+
+ return NULL;
+}
+
+struct file *ovl_path_open(struct path *path, int flags)
+{
+ path_get(path);
+ return dentry_open(path, flags, current_cred());
+}
+
+static void ovl_put_super(struct super_block *sb)
+{
+ struct ovl_fs *ufs = sb->s_fs_info;
+
+ if (!(sb->s_flags & MS_RDONLY))
+ mnt_drop_write(ufs->upper_mnt);
+
+ mntput(ufs->upper_mnt);
+ mntput(ufs->lower_mnt);
+
+ kfree(ufs);
+}
+
+static int ovl_remount_fs(struct super_block *sb, int *flagsp, char *data)
+{
+ int flags = *flagsp;
+ struct ovl_fs *ufs = sb->s_fs_info;
+
+ /* When remounting rw or ro, we need to adjust the write access to the
+ * upper fs.
+ */
+ if (((flags ^ sb->s_flags) & MS_RDONLY) == 0)
+ /* No change to readonly status */
+ return 0;
+
+ if (flags & MS_RDONLY) {
+ mnt_drop_write(ufs->upper_mnt);
+ return 0;
+ } else
+ return mnt_want_write(ufs->upper_mnt);
+}
+
+static const struct super_operations ovl_super_operations = {
+ .put_super = ovl_put_super,
+ .remount_fs = ovl_remount_fs,
+};
+
+struct ovl_config {
+ char *lowerdir;
+ char *upperdir;
+};
+
+enum {
+ Opt_lowerdir,
+ Opt_upperdir,
+ Opt_err,
+};
+
+static const match_table_t ovl_tokens = {
+ {Opt_lowerdir, "lowerdir=%s"},
+ {Opt_upperdir, "upperdir=%s"},
+ {Opt_err, NULL}
+};
+
+static int ovl_parse_opt(char *opt, struct ovl_config *config)
+{
+ char *p;
+
+ config->upperdir = NULL;
+ config->lowerdir = NULL;
+
+ while ((p = strsep(&opt, ",")) != NULL) {
+ int token;
+ substring_t args[MAX_OPT_ARGS];
+
+ if (!*p)
+ continue;
+
+ token = match_token(p, ovl_tokens, args);
+ switch (token) {
+ case Opt_upperdir:
+ kfree(config->upperdir);
+ config->upperdir = match_strdup(&args[0]);
+ if (!config->upperdir)
+ return -ENOMEM;
+ break;
+
+ case Opt_lowerdir:
+ kfree(config->lowerdir);
+ config->lowerdir = match_strdup(&args[0]);
+ if (!config->lowerdir)
+ return -ENOMEM;
+ break;
+
+ default:
+ return -EINVAL;
+ }
+ }
+ return 0;
+}
+
+static int ovl_fill_super(struct super_block *sb, void *data, int silent)
+{
+ struct path lowerpath;
+ struct path upperpath;
+ struct inode *root_inode;
+ struct dentry *root_dentry;
+ struct ovl_entry *oe;
+ struct ovl_fs *ufs;
+ struct ovl_config config;
+ int err;
+
+ err = ovl_parse_opt((char *) data, &config);
+ if (err)
+ goto out;
+
+ err = -EINVAL;
+ if (!config.upperdir || !config.lowerdir) {
+ printk(KERN_ERR "overlayfs: missing upperdir or lowerdir\n");
+ goto out_free_config;
+ }
+
+ err = -ENOMEM;
+ ufs = kmalloc(sizeof(struct ovl_fs), GFP_KERNEL);
+ if (!ufs)
+ goto out_free_config;
+
+ oe = ovl_alloc_entry();
+ if (oe == NULL)
+ goto out_free_ufs;
+
+ err = kern_path(config.upperdir, LOOKUP_FOLLOW, &upperpath);
+ if (err)
+ goto out_free_oe;
+
+ err = kern_path(config.lowerdir, LOOKUP_FOLLOW, &lowerpath);
+ if (err)
+ goto out_put_upperpath;
+
+ err = -ENOTDIR;
+ if (!S_ISDIR(upperpath.dentry->d_inode->i_mode) ||
+ !S_ISDIR(lowerpath.dentry->d_inode->i_mode))
+ goto out_put_lowerpath;
+
+ ufs->upper_mnt = clone_private_mount(&upperpath);
+ err = PTR_ERR(ufs->upper_mnt);
+ if (IS_ERR(ufs->upper_mnt)) {
+ printk(KERN_ERR "overlayfs: failed to clone upperpath\n");
+ goto out_put_lowerpath;
+ }
+
+ ufs->lower_mnt = clone_private_mount(&lowerpath);
+ err = PTR_ERR(ufs->lower_mnt);
+ if (IS_ERR(ufs->lower_mnt)) {
+ printk(KERN_ERR "overlayfs: failed to clone lowerpath\n");
+ goto out_put_upper_mnt;
+ }
+
+ /*
+ * Make lower_mnt R/O. That way fchmod/fchown on lower file
+ * will fail instead of modifying lower fs.
+ */
+ ufs->lower_mnt->mnt_flags |= MNT_READONLY;
+
+ /* If the upper fs is r/o, we mark overlayfs r/o too */
+ if (ufs->upper_mnt->mnt_sb->s_flags & MS_RDONLY)
+ sb->s_flags |= MS_RDONLY;
+
+ if (!(sb->s_flags & MS_RDONLY)) {
+ err = mnt_want_write(ufs->upper_mnt);
+ if (err)
+ goto out_put_lower_mnt;
+ }
+
+ err = -ENOMEM;
+ root_inode = ovl_new_inode(sb, S_IFDIR, oe);
+ if (!root_inode)
+ goto out_drop_write;
+
+ root_dentry = d_make_root(root_inode);
+ if (!root_dentry)
+ goto out_drop_write;
+
+ mntput(upperpath.mnt);
+ mntput(lowerpath.mnt);
+
+ oe->__upperdentry = dget(upperpath.dentry);
+ oe->lowerdentry = lowerpath.dentry;
+
+ root_dentry->d_fsdata = oe;
+ root_dentry->d_op = &ovl_dentry_operations;
+
+ sb->s_op = &ovl_super_operations;
+ sb->s_root = root_dentry;
+ sb->s_fs_info = ufs;
+
+ return 0;
+
+out_drop_write:
+ if (!(sb->s_flags & MS_RDONLY))
+ mnt_drop_write(ufs->upper_mnt);
+out_put_lower_mnt:
+ mntput(ufs->lower_mnt);
+out_put_upper_mnt:
+ mntput(ufs->upper_mnt);
+out_put_lowerpath:
+ path_put(&lowerpath);
+out_put_upperpath:
+ path_put(&upperpath);
+out_free_oe:
+ kfree(oe);
+out_free_ufs:
+ kfree(ufs);
+out_free_config:
+ kfree(config.lowerdir);
+ kfree(config.upperdir);
+out:
+ return err;
+}
+
+static struct dentry *ovl_mount(struct file_system_type *fs_type, int flags,
+ const char *dev_name, void *raw_data)
+{
+ return mount_nodev(fs_type, flags, raw_data, ovl_fill_super);
+}
+
+static struct file_system_type ovl_fs_type = {
+ .owner = THIS_MODULE,
+ .name = "overlayfs",
+ .mount = ovl_mount,
+ .kill_sb = kill_anon_super,
+};
+
+static int __init ovl_init(void)
+{
+ return register_filesystem(&ovl_fs_type);
+}
+
+static void __exit ovl_exit(void)
+{
+ unregister_filesystem(&ovl_fs_type);
+}
+
+module_init(ovl_init);
+module_exit(ovl_exit);
--
1.7.10.4
From: Miklos Szeredi <[email protected]>
Export do_splice_direct() to modules. Needed by overlay filesystem.
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/splice.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/fs/splice.c b/fs/splice.c
index 718bd00..0e8f44a 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1308,6 +1308,7 @@ long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
return ret;
}
+EXPORT_SYMBOL(do_splice_direct);
static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
struct pipe_inode_info *opipe,
--
1.7.10.4
From: Miklos Szeredi <[email protected]>
Add a new inode operation i_op->dentry_open(). This is for stacked filesystems
that want to return a struct file from a different filesystem.
Signed-off-by: Miklos Szeredi <[email protected]>
---
Documentation/filesystems/Locking | 2 ++
Documentation/filesystems/vfs.txt | 7 +++++++
fs/namei.c | 9 ++++++---
fs/open.c | 23 +++++++++++++++++++++--
include/linux/fs.h | 2 ++
5 files changed, 38 insertions(+), 5 deletions(-)
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 0706d32..4331290 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -66,6 +66,7 @@ prototypes:
int (*atomic_open)(struct inode *, struct dentry *,
struct file *, unsigned open_flag,
umode_t create_mode, int *opened);
+ int (*dentry_open)(struct dentry *, struct file *, const struct cred *);
locking rules:
all may block
@@ -93,6 +94,7 @@ removexattr: yes
fiemap: no
update_time: no
atomic_open: yes
+dentry_open: no
Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_mutex on
victim.
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index bc4b06b..f64a4d1 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -362,6 +362,7 @@ struct inode_operations {
int (*atomic_open)(struct inode *, struct dentry *,
struct file *, unsigned open_flag,
umode_t create_mode, int *opened);
+ int (*dentry_open)(struct dentry *, struct file *, const struct cred *);
};
Again, all methods are called without any locks being held, unless
@@ -681,6 +682,12 @@ struct address_space_operations {
but instead uses bmap to find out where the blocks in the file
are and uses those addresses directly.
+ dentry_open: this is an alternative to f_op->open(), the difference is that
+ this method may open a file not necessarily originating from the same
+ filesystem as the one i_op->open() was called on. It may be
+ useful for stacking filesystems which want to allow native I/O directly
+ on underlying files.
+
invalidatepage: If a page has PagePrivate set, then invalidatepage
will be called when part or all of the page is to be removed
diff --git a/fs/namei.c b/fs/namei.c
index 57ae9c8..c1c0c15 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2867,9 +2867,12 @@ finish_open_created:
error = may_open(&nd->path, acc_mode, open_flag);
if (error)
goto out;
- file->f_path.mnt = nd->path.mnt;
- error = finish_open(file, nd->path.dentry, NULL, opened);
- if (error) {
+
+ BUG_ON(*opened & FILE_OPENED); /* once it's opened, it's opened */
+ error = vfs_open(&nd->path, file, current_cred());
+ if (!error) {
+ *opened |= FILE_OPENED;
+ } else {
if (error == -EOPENSTALE)
goto stale_open;
goto out;
diff --git a/fs/open.c b/fs/open.c
index 6835446..b9d9f9e 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -828,8 +828,7 @@ struct file *dentry_open(const struct path *path, int flags,
f = get_empty_filp();
if (!IS_ERR(f)) {
f->f_flags = flags;
- f->f_path = *path;
- error = do_dentry_open(f, NULL, cred);
+ error = vfs_open(path, f, cred);
if (!error) {
/* from now on we need fput() to dispose of f */
error = open_check_o_direct(f);
@@ -846,6 +845,26 @@ struct file *dentry_open(const struct path *path, int flags,
}
EXPORT_SYMBOL(dentry_open);
+/**
+ * vfs_open - open the file at the given path
+ * @path: path to open
+ * @filp: newly allocated file with f_flag initialized
+ * @cred: credentials to use
+ */
+int vfs_open(const struct path *path, struct file *filp,
+ const struct cred *cred)
+{
+ struct inode *inode = path->dentry->d_inode;
+
+ if (inode->i_op->dentry_open)
+ return inode->i_op->dentry_open(path->dentry, filp, cred);
+ else {
+ filp->f_path = *path;
+ return do_dentry_open(filp, NULL, cred);
+ }
+}
+EXPORT_SYMBOL(vfs_open);
+
static inline int build_open_flags(int flags, umode_t mode, struct open_flags *op)
{
int lookup_flags = 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2c28271..891c93d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1573,6 +1573,7 @@ struct inode_operations {
int (*atomic_open)(struct inode *, struct dentry *,
struct file *, unsigned open_flag,
umode_t create_mode, int *opened);
+ int (*dentry_open)(struct dentry *, struct file *, const struct cred *);
} ____cacheline_aligned;
ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
@@ -2006,6 +2007,7 @@ extern struct file *file_open_name(struct filename *, int, umode_t);
extern struct file *filp_open(const char *, int, umode_t);
extern struct file *file_open_root(struct dentry *, struct vfsmount *,
const char *, int);
+extern int vfs_open(const struct path *, struct file *, const struct cred *);
extern struct file * dentry_open(const struct path *, int, const struct cred *);
extern int filp_close(struct file *, fl_owner_t id);
--
1.7.10.4
On Tue, Mar 12, 2013 at 4:41 PM, Miklos Szeredi <[email protected]> wrote:
> Al and Linus,
>
> Please consider overlayfs for inclusion into 3.10.
>
> It's included in Ubuntu and openSUSE, used by OpenWrt and various other
> projects. I regularly get emails asking when it will be included in mainline.
>
> Git tree is here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git overlayfs.v16
>
> Thanks,
> Miklos
>
> ---
> Andy Whitcroft (3):
> overlayfs-add-statfs-support
> ovl-switch-to-inode_permission
> overlayfs-copy-up-i_uid-i_gid-from-the-underlying-inode
>
> Erez Zadok (1):
> overlayfs-implement-show_options
>
> Miklos Szeredi (6):
> vfs-add-i_op-dentry_open
> vfs-export-do_splice_direct-to-modules
> vfs-introduce-clone_private_mount
> overlay filesystem
> fs-limit-filesystem-stacking-depth
> vfs-export-inode_permission-to-modules
>
> Neil Brown (1):
> overlay-overlay-filesystem-documentation
>
> Robin Dong (2):
> overlayfs-fix-possible-leak-in-ovl_new_inode
> overlayfs-create-new-inode-in-ovl_link
>
What happened to the subject lines (see v15)?
- Sedat -
[1] http://git.kernel.org/cgit/linux/kernel/git/mszeredi/vfs.git/log/?h=overlayfs.v15
> ---
> Documentation/filesystems/Locking | 2 +
> Documentation/filesystems/overlayfs.txt | 199 +++++++++
> Documentation/filesystems/vfs.txt | 7 +
> MAINTAINERS | 7 +
> fs/Kconfig | 1 +
> fs/Makefile | 1 +
> fs/ecryptfs/main.c | 7 +
> fs/internal.h | 5 -
> fs/namei.c | 10 +-
> fs/namespace.c | 18 +
> fs/open.c | 23 +-
> fs/overlayfs/Kconfig | 4 +
> fs/overlayfs/Makefile | 7 +
> fs/overlayfs/copy_up.c | 385 +++++++++++++++++
> fs/overlayfs/dir.c | 604 +++++++++++++++++++++++++++
> fs/overlayfs/inode.c | 372 +++++++++++++++++
> fs/overlayfs/overlayfs.h | 70 ++++
> fs/overlayfs/readdir.c | 566 +++++++++++++++++++++++++
> fs/overlayfs/super.c | 685 +++++++++++++++++++++++++++++++
> fs/splice.c | 1 +
> include/linux/fs.h | 14 +
> include/linux/mount.h | 3 +
> 22 files changed, 2981 insertions(+), 10 deletions(-)
> create mode 100644 Documentation/filesystems/overlayfs.txt
> create mode 100644 fs/overlayfs/Kconfig
> create mode 100644 fs/overlayfs/Makefile
> create mode 100644 fs/overlayfs/copy_up.c
> create mode 100644 fs/overlayfs/dir.c
> create mode 100644 fs/overlayfs/inode.c
> create mode 100644 fs/overlayfs/overlayfs.h
> create mode 100644 fs/overlayfs/readdir.c
> create mode 100644 fs/overlayfs/super.c
>
Miklos Szeredi:
> Please consider overlayfs for inclusion into 3.10.
Thank you for CCing me.
First, I'd suggest you to follow some recent activities in mainline
kernel such as
- MODULE_ALIAS_FS
- file_inode()
- d_weak_revalidate() which may not be necessary for overlayfs as long
as it prohibit users the direct-access to layers (bypassing
overlayfs).
> It's included in Ubuntu and openSUSE, used by OpenWrt and various other
> projects. I regularly get emails asking when it will be included in mainline.
Such situation is very similar which AUFS had a few years ago. At that
time, AUFS was rejected since UnionMount was chosen. Years passed, the
development of UnionMount seems to stop and linux mainline doesn't have
generic stackable filesytem yet. I had pointed out some defects in
overlayfs (and UnionMount too). They are all based upon the "name-based"
behaviour instead of "inode-based" one. Other than "Non-standard
behavior" in the overlayfs document, the are, for example,
- read(2) may get an obsoleted filedata (fstat(2) for metadata too).
- fcntl(F_SETLK) may be broken by copy-up.
- inotify may not work when it refers to the file before being
copied-up.
- unnecessary copy-up may happen, for example mmap(MAP_PRIVATE) after
open(O_RDWR).
AUFS is an "inode-based" stackable filesystem and solved them many years
ago. But I have to admit that AUFS is big. Yes it is grown up.
I don't stop including overlayfs into mainline, but if the development
of UnionMount is really stopped, then I'd ask people to consider merging
aufs as well as overlayfs.
J. R. Okajima
On Tue, Mar 12, 2013 at 5:49 PM, Sedat Dilek <[email protected]> wrote:
>
> What happened to the subject lines (see v15)?
>
Oops, will fix. Thanks.
Miklos
On Tue, Mar 12, 2013 at 6:22 PM, J. R. Okajima <[email protected]> wrote:
>
> Miklos Szeredi:
>> Please consider overlayfs for inclusion into 3.10.
>
> Thank you for CCing me.
>
> First, I'd suggest you to follow some recent activities in mainline
> kernel such as
> - MODULE_ALIAS_FS
> - file_inode()
Okay, thanks for the comments. Will fix.
Miklos
On Tue, Mar 12, 2013 at 8:41 AM, Miklos Szeredi <[email protected]> wrote:
> Al and Linus,
>
> Please consider overlayfs for inclusion into 3.10.
Yes, I think we should just do it. It's in use, it's pretty small, and
the other alternatives are worse. Let's just plan on getting this
thing done with.
Al, I realize you may not love this, but can you please give this a
look? People clearly want to use it. In particular the new interfaces,
like the inode ops open function with dentry passed in or whatever?
The changes outside of overlayfs looked fine to me.
Miklos, I hate how you have patches like 09/13 that fix bugs
introduced in earlier versions. Any particular reasons for that kind
of thing?
Linus
On Tue, Mar 12, 2013 at 02:50:02PM -0700, Linus Torvalds wrote:
> On Tue, Mar 12, 2013 at 8:41 AM, Miklos Szeredi <[email protected]> wrote:
> > Al and Linus,
> >
> > Please consider overlayfs for inclusion into 3.10.
>
> Yes, I think we should just do it. It's in use, it's pretty small, and
> the other alternatives are worse. Let's just plan on getting this
> thing done with.
>
> Al, I realize you may not love this, but can you please give this a
> look? People clearly want to use it. In particular the new interfaces,
> like the inode ops open function with dentry passed in or whatever?
> The changes outside of overlayfs looked fine to me.
I'll post a review tonight or tomorrow. FWIW, I was not too happy with
it the last time I looked, but I'll obviously need to reread the whole
thing.
I *have* looked at unionmount lately, and the recent modifications dhowells
is doing there are closing most of my problems with that; on the other hand,
there's no fundamental reason why both can't get merged. Hell, might as
well resurrect aufs, while we are at it...
union-like things are actually on top of my "things to deal with this cycle"
list, closely folowed by rework of ->readdir().
Miklos, two points:
* I would very much prefer to deal with that (as well as unionmount and
aufs) as git branches _expected_ to be reordered/rebased/folded/mutilated/etc.
while we are sorting all that stuff out. For now, let's base them on -rc1.
I expect that vfs.git will grow common stem, with bits and pieces of those
guys getting gradually pulled into it, at which point(s) the rest will be
rebased.
* what Linus just said about bisectablity
Oh, and the third one - I still owe you a bottle of your choice for sorting
the atomic_open shite out. Is there any chance you'll be able to attend
LSFS this year?
On Tue, Mar 12, 2013 at 11:23 PM, Al Viro <[email protected]> wrote:
> On Tue, Mar 12, 2013 at 02:50:02PM -0700, Linus Torvalds wrote:
>> On Tue, Mar 12, 2013 at 8:41 AM, Miklos Szeredi <[email protected]> wrote:
>> > Al and Linus,
>> >
>> > Please consider overlayfs for inclusion into 3.10.
>>
>> Yes, I think we should just do it. It's in use, it's pretty small, and
>> the other alternatives are worse. Let's just plan on getting this
>> thing done with.
>>
>> Al, I realize you may not love this, but can you please give this a
>> look? People clearly want to use it. In particular the new interfaces,
>> like the inode ops open function with dentry passed in or whatever?
>> The changes outside of overlayfs looked fine to me.
>
> I'll post a review tonight or tomorrow. FWIW, I was not too happy with
> it the last time I looked, but I'll obviously need to reread the whole
> thing.
>
> I *have* looked at unionmount lately, and the recent modifications dhowells
> is doing there are closing most of my problems with that; on the other hand,
> there's no fundamental reason why both can't get merged. Hell, might as
> well resurrect aufs, while we are at it...
>
> union-like things are actually on top of my "things to deal with this cycle"
> list, closely folowed by rework of ->readdir().
>
> Miklos, two points:
> * I would very much prefer to deal with that (as well as unionmount and
> aufs) as git branches _expected_ to be reordered/rebased/folded/mutilated/etc.
> while we are sorting all that stuff out. For now, let's base them on -rc1.
> I expect that vfs.git will grow common stem, with bits and pieces of those
> guys getting gradually pulled into it, at which point(s) the rest will be
> rebased.
> * what Linus just said about bisectablity
> Oh, and the third one - I still owe you a bottle of your choice for sorting
> the atomic_open shite out. Is there any chance you'll be able to attend
> LSFS this year?
>
Hi all,
I only say: Al-lelujah!
Congrats Miklos!
- Sedat -
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Mar 12, 2013 at 4:41 PM, Miklos Szeredi <[email protected]> wrote:
> Al and Linus,
>
> Please consider overlayfs for inclusion into 3.10.
>
> It's included in Ubuntu and openSUSE, used by OpenWrt and various other
> projects. I regularly get emails asking when it will be included in mainline.
>
> Git tree is here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git overlayfs.v16
>
> Thanks,
> Miklos
>
Hi Miklos,
Is it possible to get a backport of OverlayFS-v16 for Linux-stable v3.8.y?
Thanks in advance.
- Sedat -
On Wed, Mar 13, 2013 at 11:09 AM, Sedat Dilek <[email protected]> wrote:
> On Tue, Mar 12, 2013 at 4:41 PM, Miklos Szeredi <[email protected]> wrote:
>> Al and Linus,
>>
>> Please consider overlayfs for inclusion into 3.10.
>>
>> It's included in Ubuntu and openSUSE, used by OpenWrt and various other
>> projects. I regularly get emails asking when it will be included in mainline.
>>
>> Git tree is here:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git overlayfs.v16
>>
>> Thanks,
>> Miklos
>>
>
> Hi Miklos,
>
> Is it possible to get a backport of OverlayFS-v16 for Linux-stable v3.8.y?
> Thanks in advance.
v15 is the backport, or rather, v16 is a forward port. I changed
nothing else. I'll do a .v17 with the comments addressed.
Thanks,
Miklos
On Tue, Mar 12, 2013 at 11:23 PM, Al Viro <[email protected]> wrote:
> I'll post a review tonight or tomorrow. FWIW, I was not too happy with
> it the last time I looked, but I'll obviously need to reread the whole
> thing.
>
> I *have* looked at unionmount lately, and the recent modifications dhowells
> is doing there are closing most of my problems with that; on the other hand,
> there's no fundamental reason why both can't get merged. Hell, might as
> well resurrect aufs, while we are at it...
I haven't seen the union mount patchset for a long time. David, do
you plan to send out the series for review sometime?
> Oh, and the third one - I still owe you a bottle of your choice for sorting
> the atomic_open shite out. Is there any chance you'll be able to attend
> LSFS this year?
Can't make it this year. But I do plan to go to KS in Edinburgh.
Thanks,
Miklos
On Tue, Mar 12, 2013 at 10:23:50PM +0000, Al Viro wrote:
> I'll post a review tonight or tomorrow. FWIW, I was not too happy with
> it the last time I looked, but I'll obviously need to reread the whole
> thing.
OK... Here's the first pass at that:
* use of xattrs for whiteouts/opaque is a Bloody Bad Idea(tm). That's one
thing you definitely can share with unionmount. In particular, the games
with creds you have to pull off in ovl_do_lookup() are very clear indications
that xattr is simply a wrong interface for that.
* I don't see anything that would protect you from attacker playing silly
buggers with upper layer; mount it r/w elsewhere and do some renames...
Note that your ->lookup() relies on having the result of ovl_lookup_real()
remain the child of dentry we'd passed it as the first argument. What's
there to guarantee that it will remain such? The similar question goes for
malicious modifications of xattrs... For that matter, what's to prevent
the same sucker mounted as upper layer in two places, with two unrelated lower
layers? AFAICS, things will break rather badly if that happens, and I'm not
sure if you avoid deadlocks in such scenario... Interfering with copyup
in progress is also possible.
* I think you might have an unpleasant problem in your ->setattr(); suppose
you've got through the checks in notify_change() and ovl_setattr() got called.
With ATTR_SIZE present. OK, you do a truncated copyup; fair enough. But
then you do notify_change() on upper layer dentry to do the rest of the job.
What happens if that fails? Moreover, what's to prevent it being e.g. opened
by another process *before* you get around to that notify_change() part?
* ->follow_link(): Why the hell do you bother with struct ovl_link_data???
Just to avoid calling ovl_dentry_real() in ovl_put_link()?
BTW, a random note:
if (err)
return ERR_PTR(err);
return NULL;
is a weird way to spell return ERR_PTR(err) - ERR_PTR(0) *is* NULL, TYVM,
and we rely on that in a lot of places.
On Wed, Mar 13, 2013 at 7:52 PM, Al Viro <[email protected]> wrote:
> On Tue, Mar 12, 2013 at 10:23:50PM +0000, Al Viro wrote:
>
>> I'll post a review tonight or tomorrow. FWIW, I was not too happy with
>> it the last time I looked, but I'll obviously need to reread the whole
>> thing.
>
> OK... Here's the first pass at that:
>
> * use of xattrs for whiteouts/opaque is a Bloody Bad Idea(tm). That's one
> thing you definitely can share with unionmount. In particular, the games
> with creds you have to pull off in ovl_do_lookup() are very clear indications
> that xattr is simply a wrong interface for that.
I'm not so sure. The advantages of using xattrs are:
- already available on lots of filesystems
- no new, backward incompatible disk-format needed
- no new userspace interfaces needed to view/manipluate them
Just looked, union mounts supports patchset has whiteout support for
tmpfs, ext2 and jffs2. That means I can't try it out on my root
filesystem (and couldn't even if ext4 were supported, because of the
incompatible changes required).
>
> * I don't see anything that would protect you from attacker playing silly
> buggers with upper layer; mount it r/w elsewhere and do some renames...
> Note that your ->lookup() relies on having the result of ovl_lookup_real()
> remain the child of dentry we'd passed it as the first argument. What's
> there to guarantee that it will remain such? The similar question goes for
> malicious modifications of xattrs...
All of the above are privileged (xattrs are in the "trusted."
namespace). Behavior is unspecified in those cases, and "attacker"
can do no worse than confuse itself.
BTW. I had plans of adding infrastructure to make sure that only one
instance of "upper" is writable and that instance is only visible
through the overlay. But then decided to postpone this as it didn't
seem to be very important.
> For that matter, what's to prevent
> the same sucker mounted as upper layer in two places, with two unrelated lower
> layers? AFAICS, things will break rather badly if that happens, and I'm not
> sure if you avoid deadlocks in such scenario... Interfering with copyup
> in progress is also possible.
I don't see how that would deadlock. We follow VFS locking rules on
upper and lower filesystem and never lock both at the same time. And
we
only lock overlay first and then upper *or* lower.
As for same upper on unrelated lower: just don't do it. As I said, we
could enforce this, but I don't think this is top priority.
>
> * I think you might have an unpleasant problem in your ->setattr(); suppose
> you've got through the checks in notify_change() and ovl_setattr() got called.
> With ATTR_SIZE present. OK, you do a truncated copyup; fair enough. But
> then you do notify_change() on upper layer dentry to do the rest of the job.
> What happens if that fails? Moreover, what's to prevent it being e.g. opened
> by another process *before* you get around to that notify_change() part?
Okay, the truncated copyup doesn't seem to be a good idea. Or rather,
it needs to be done more carefully, only actually revealing the copied
up dentry once the setattr operation has completed successfully.
>
> * ->follow_link(): Why the hell do you bother with struct ovl_link_data???
> Just to avoid calling ovl_dentry_real() in ovl_put_link()?
Yes, a copy-up between ovl_follow_link and ovl_put_link will break that.
Thanks,
Miklos
On Wed, Mar 13, 2013 at 11:09:07PM +0100, Miklos Szeredi wrote:
> I don't see how that would deadlock. We follow VFS locking rules on
> upper and lower filesystem and never lock both at the same time. And
> we
> only lock overlay first and then upper *or* lower.
>
> As for same upper on unrelated lower: just don't do it. As I said, we
> could enforce this, but I don't think this is top priority.
Tell that to container crowd - they seem to be hell-bent on making everything
mount-related non-priveleged ;-/
FWIW, the thing that worries me is the fun involved in situations when
topology of dentry tree of your fs becomes completely unrelated to that
of one (or both) layers; if somebody can start playing with cross-directory
renames in said layers, things can get really nasty. Another thing is,
you cache some properties of underlying directories in your tree; are you
sure that this cached information becoming stale will _not_ do anything
bad?
> > * ->follow_link(): Why the hell do you bother with struct ovl_link_data???
> > Just to avoid calling ovl_dentry_real() in ovl_put_link()?
>
> Yes, a copy-up between ovl_follow_link and ovl_put_link will break that.
*blink*
Er... What's wrong with simply unhashing the sucker on copyup if it's
a symlink?
BTW, looking at your ovl_copy_up() - you do realize that dget_parent(d)
does *not* guarantee that returned dentry will remain the parent of d?
rename() can very well move it away just as dget_parent() is returning
to caller. As the result, you are not guaranteed that ovl_copy_up_one()
arguments will be anywhere near each other in the tree. Your locking
and rechecks might be enough to prevent trouble there, but it's not
obvious, to put it mildly.
I'm _very_ sceptical about the idea of delaying copyups that much, BTW;
there's a damn good reason why all implementations starting with Sun's
one in 80s did copy directories up as soon as they got looked up. Saves
a lot of headache...
As for whiteouts... I think we ought to pull these bits of unionmount
queue into the common stem and add the missing filesystems to them;
ext* and ufs are trivial (keep in mind that FFS derivatives, including
ext*, have d_type in directory entry and type 14 (DT_WHT) is there
precisely for that purpose). btrfs also has "dir_type" thing - 8bit
field...
Note that upper layer in *any* union would better not change unpredictably
under you - anybody trying to do e.g. NFS as top (either, actually) layer
is welcome to all kinds of PITA when it comes to need to revalidate stale
dentries. IOW, realistically the upper layer is going to be local,
read-write and not something like sysfs or procfs, where things can disappear
at will. What does "having xattr is enough" really buy?
On Thu, Mar 14, 2013 at 12:19 AM, Al Viro <[email protected]> wrote:
> On Wed, Mar 13, 2013 at 11:09:07PM +0100, Miklos Szeredi wrote:
>> As for same upper on unrelated lower: just don't do it. As I said, we
>> could enforce this, but I don't think this is top priority.
>
> Tell that to container crowd - they seem to be hell-bent on making everything
> mount-related non-priveleged ;-/
Which is good, but it does need some care. I'm happy to review those changes.
>> > * ->follow_link(): Why the hell do you bother with struct ovl_link_data???
>> > Just to avoid calling ovl_dentry_real() in ovl_put_link()?
>>
>> Yes, a copy-up between ovl_follow_link and ovl_put_link will break that.
>
> *blink*
>
> Er... What's wrong with simply unhashing the sucker on copyup if it's
> a symlink?
Nothing, so I'll do that. Actually we can do that for all except
directory dentries and save some worry.
> BTW, looking at your ovl_copy_up() - you do realize that dget_parent(d)
> does *not* guarantee that returned dentry will remain the parent of d?
> rename() can very well move it away just as dget_parent() is returning
> to caller. As the result, you are not guaranteed that ovl_copy_up_one()
> arguments will be anywhere near each other in the tree. Your locking
> and rechecks might be enough to prevent trouble there, but it's not
> obvious, to put it mildly.
This issue is documented above ovl_copy_up_one(). It's not all that
complicated, I think.
> I'm _very_ sceptical about the idea of delaying copyups that much, BTW;
> there's a damn good reason why all implementations starting with Sun's
> one in 80s did copy directories up as soon as they got looked up. Saves
> a lot of headache...
Maybe. If we find not trivially fixable holes in the current
implementation I'm open to that direction.
Delayed copy up has the advantage of allowing pure read-only overlays.
> As for whiteouts... I think we ought to pull these bits of unionmoun
> queue into the common stem and add the missing filesystems to them;
> ext* and ufs are trivial (keep in mind that FFS derivatives, including
> ext*, have d_type in directory entry and type 14 (DT_WHT) is there
> precisely for that purpose). btrfs also has "dir_type" thing - 8bit
> field...
What about userspace interfaces? Are we allowed to extend d_type and
st_mode without breaking things?
Thanks,
Miklos
Miklos Szeredi:
> On Thu, Mar 14, 2013 at 12:19 AM, Al Viro <[email protected]> wrote:
:::
> > As for whiteouts... I think we ought to pull these bits of unionmoun
> > queue into the common stem and add the missing filesystems to them;
> > ext* and ufs are trivial (keep in mind that FFS derivatives, including
> > ext*, have d_type in directory entry and type 14 (DT_WHT) is there
> > precisely for that purpose). btrfs also has "dir_type" thing - 8bit
> > field...
>
> What about userspace interfaces? Are we allowed to extend d_type and
> st_mode without breaking things?
Introducing a new d_type value can be a headache for other filesystems
or other OSs. Whiteout and opaque by xattr is an idea, but I prefer more
primitive way (special hidden filename) since xattr may force users to
change their configuration and it consumes disk space a little.
Filesystems may have a limit for the consumed size by xattr. If xattr
reaches the limit, then several operation in overlayfs will be unusable.
J. R. Okajima
On Thu, Mar 14, 2013 at 11:37:50AM +0100, Miklos Szeredi wrote:
> > As for whiteouts... I think we ought to pull these bits of unionmoun
> > queue into the common stem and add the missing filesystems to them;
> > ext* and ufs are trivial (keep in mind that FFS derivatives, including
> > ext*, have d_type in directory entry and type 14 (DT_WHT) is there
> > precisely for that purpose). btrfs also has "dir_type" thing - 8bit
> > field...
>
> What about userspace interfaces? Are we allowed to extend d_type and
> st_mode without breaking things?
Huh?
* from st_mode point of view, it's not going to conflict with
anything; FFS "entry type" matches bits 12..15 of mode_t, and the value
picked by whoever had first implemented whiteouts had been chosen so
that it would not clash with any existing values. We have
#define S_IFMT 00170000
#define S_IFSOCK 0140000
#define S_IFLNK 0120000
#define S_IFREG 0100000
#define S_IFBLK 0060000
#define S_IFDIR 0040000
#define S_IFCHR 0020000
#define S_IFIFO 0010000
and this sucker would've been 0160000; new filesystem object types are not
frequently introduced, to put it mildly, so I wouldn't expect clashes.
Especially since any such clash would hit Solaris and *BSD (including
MacOS X) as well - all of them use that value for whiteouts.
* for practically all syscalls these guys are treated as file not
being there; e.g. mkdir() on such name results in whiteout silently
replaced with references to newly created subdirectory, etc.
* BSDs have one major exposure of those guys to userland - their
getdents(2) shows whiteouts (with d_type equal to DT_WHT and d_fileno - to 1).
Other than that (and we obviously do _not_ want that as default behaviour;
maybe a new O_<something> to produce that), there's practically nothing -
undelete(2) removes a whiteout in attempt to expose the object masked by
it, mknod(2) with S_IFWHT as type does create one. That's it.
IOW, I don't see how we'd get any kind of userland problems with those. And
this is not a new API - a bunch of BSD-derived Unices starting with, IIRC,
SunOS 3 and 4.3BSD, as well as their bastard progeny (Solaris, MacOS X)
had that since mid-80s...
On Thu, Mar 14, 2013 at 11:59 PM, Al Viro <[email protected]> wrote:
> Huh?
> * from st_mode point of view, it's not going to conflict with
> anything; FFS "entry type" matches bits 12..15 of mode_t, and the value
> picked by whoever had first implemented whiteouts had been chosen so
> that it would not clash with any existing values. We have
> #define S_IFMT 00170000
> #define S_IFSOCK 0140000
> #define S_IFLNK 0120000
> #define S_IFREG 0100000
> #define S_IFBLK 0060000
> #define S_IFDIR 0040000
> #define S_IFCHR 0020000
> #define S_IFIFO 0010000
> and this sucker would've been 0160000; new filesystem object types are not
> frequently introduced, to put it mildly, so I wouldn't expect clashes.
I'm worried exactly because new filetypes are introduced so
infrequently. No such thing happened during the lifetime of Linux,
AFAICT. Backup/restore tools are not going to handle it. File
managers are not going to show anything sane (and quite possibly some
will simply crash).
Yes, all that can be fixed, but it will be a slow and painful process,
since union/overlay type filesystems are themselves quite specialized
and problems are not going to be shaken out quickly with userspace
interaction. On the other hand if a whiteout is exported to userspace
as a symlink with xattrs, there will be much less of those problems.
BTW I'm not against adding whiteout support to the VFS and possibly to
filesystems to clean up the security uglilness and optimize storage.
But I think the most important aspect is the userspace interface and
I'm far from sure that a new filetype is the best solution here.
Thanks,
Miklos
Sorry for being so very late to the party, but rather than messing
with xattrs, why not just have a specific file (say, default /.whiteout,
but selectable via a mount option) and links to it are counted as
whiteout entries?
All you need to do is resolve the link (it's probably a good idea
to allow symlinks to avoid hard-link count limits) and compare the
fs and inode number.
It's kind of hackish, but it could be done on pretty much *any* Unix
file system. VFAT would be SOL, but that's probably acceptable.
(Any security options on mount point crossing or consistent ownership
of symlinks that you want to impose would probably be of general VFS
use, and again, not FS-specific.)
On Wed, Mar 20, 2013 at 03:41:08PM -0400, George Spelvin wrote:
> Sorry for being so very late to the party, but rather than messing
> with xattrs, why not just have a specific file (say, default /.whiteout,
> but selectable via a mount option) and links to it are counted as
> whiteout entries?
>
> All you need to do is resolve the link (it's probably a good idea
> to allow symlinks to avoid hard-link count limits) and compare the
> fs and inode number.
>
> It's kind of hackish, but it could be done on pretty much *any* Unix
> file system. VFAT would be SOL, but that's probably acceptable.
>
> (Any security options on mount point crossing or consistent ownership
> of symlinks that you want to impose would probably be of general VFS
> use, and again, not FS-specific.)
Yeah... Now, think what rm -rf foo/ would be doing. We read the
underlying directory. For each file in it we create a link to that
magical file of yours in covering one. _Then_ we do rmdir(), and what a joy
it turns out to be - we
* lock the covering directory
* read the covering directory and stat everything in it, checking that
it's a link to your file; we also read the underlying directory and verify
that everything in it has a matching whiteout in the covering one.
* once we are through, we read it *again*, this time doing unlinks
and praying real hard we won't crash in the meanwhile
* once that is finished, we finally can rmdir the fucker and unlock
its inode.
I'm sorry, but this is insane.
On 03/20/2013 02:41:08 PM, George Spelvin wrote:
> Sorry for being so very late to the party, but rather than messing
> with xattrs, why not just have a specific file (say, default
> /.whiteout,
> but selectable via a mount option) and links to it are counted as
> whiteout entries?
Don't the ext2 descendants have an array of reserved inodes at the
beginning (one of which is the journal for ext3, most of which were
still unused last I checked)?
Maybe ext4 optimized this away, but that would mean it doesn't have to
be anywhere in the namespace.
(Of course you'd have to teach fsck about it, and probably not update
the link count or you'll have contention and integer overflow...)
Rob-