2013-03-13 14:16:54

by Miklos Szeredi

[permalink] [raw]
Subject: [PATCH 0/9] overlay filesystem: request for inclusion (v17)

Here's another version with the comments addressed plus a small bugfix and some
checkpatch cleanups.

Changes in v17:

- fix wrong return value in a failure path in ovl_link()
- fix subjects
- use file_inode() and MODULE_ALIAS_FS()
- fold bugfix patches
- checkpatch cleanups

Git tree is here:

git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git overlayfs.current

Thanks,
Miklos

---
Andy Whitcroft (1):
overlayfs: add statfs support

Erez Zadok (1):
overlayfs: implement show_options

Miklos Szeredi (6):
vfs: add i_op->dentry_open()
vfs: export do_splice_direct() to modules
vfs: export __inode_permission() to modules
vfs: introduce clone_private_mount()
overlay filesystem
fs: limit filesystem stacking depth

Neil Brown (1):
overlay: overlay filesystem documentation

---
Documentation/filesystems/Locking | 2 +
Documentation/filesystems/overlayfs.txt | 199 +++++++++
Documentation/filesystems/vfs.txt | 7 +
MAINTAINERS | 7 +
fs/Kconfig | 1 +
fs/Makefile | 1 +
fs/ecryptfs/main.c | 7 +
fs/internal.h | 5 -
fs/namei.c | 10 +-
fs/namespace.c | 18 +
fs/open.c | 23 +-
fs/overlayfs/Kconfig | 10 +
fs/overlayfs/Makefile | 7 +
fs/overlayfs/copy_up.c | 385 +++++++++++++++++
fs/overlayfs/dir.c | 605 +++++++++++++++++++++++++++
fs/overlayfs/inode.c | 372 +++++++++++++++++
fs/overlayfs/overlayfs.h | 70 ++++
fs/overlayfs/readdir.c | 566 +++++++++++++++++++++++++
fs/overlayfs/super.c | 686 +++++++++++++++++++++++++++++++
fs/splice.c | 1 +
include/linux/fs.h | 14 +
include/linux/mount.h | 3 +
22 files changed, 2989 insertions(+), 10 deletions(-)
create mode 100644 Documentation/filesystems/overlayfs.txt
create mode 100644 fs/overlayfs/Kconfig
create mode 100644 fs/overlayfs/Makefile
create mode 100644 fs/overlayfs/copy_up.c
create mode 100644 fs/overlayfs/dir.c
create mode 100644 fs/overlayfs/inode.c
create mode 100644 fs/overlayfs/overlayfs.h
create mode 100644 fs/overlayfs/readdir.c
create mode 100644 fs/overlayfs/super.c


2013-03-13 14:16:57

by Miklos Szeredi

[permalink] [raw]
Subject: [PATCH 1/9] vfs: add i_op->dentry_open()

From: Miklos Szeredi <[email protected]>

Add a new inode operation i_op->dentry_open(). This is for stacked filesystems
that want to return a struct file from a different filesystem.

Signed-off-by: Miklos Szeredi <[email protected]>
---
Documentation/filesystems/Locking | 2 ++
Documentation/filesystems/vfs.txt | 7 +++++++
fs/namei.c | 9 ++++++---
fs/open.c | 23 +++++++++++++++++++++--
include/linux/fs.h | 2 ++
5 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 0706d32..4331290 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -66,6 +66,7 @@ prototypes:
int (*atomic_open)(struct inode *, struct dentry *,
struct file *, unsigned open_flag,
umode_t create_mode, int *opened);
+ int (*dentry_open)(struct dentry *, struct file *, const struct cred *);

locking rules:
all may block
@@ -93,6 +94,7 @@ removexattr: yes
fiemap: no
update_time: no
atomic_open: yes
+dentry_open: no

Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_mutex on
victim.
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index bc4b06b..f64a4d1 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -362,6 +362,7 @@ struct inode_operations {
int (*atomic_open)(struct inode *, struct dentry *,
struct file *, unsigned open_flag,
umode_t create_mode, int *opened);
+ int (*dentry_open)(struct dentry *, struct file *, const struct cred *);
};

Again, all methods are called without any locks being held, unless
@@ -681,6 +682,12 @@ struct address_space_operations {
but instead uses bmap to find out where the blocks in the file
are and uses those addresses directly.

+ dentry_open: this is an alternative to f_op->open(), the difference is that
+ this method may open a file not necessarily originating from the same
+ filesystem as the one i_op->open() was called on. It may be
+ useful for stacking filesystems which want to allow native I/O directly
+ on underlying files.
+

invalidatepage: If a page has PagePrivate set, then invalidatepage
will be called when part or all of the page is to be removed
diff --git a/fs/namei.c b/fs/namei.c
index 57ae9c8..c1c0c15 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2867,9 +2867,12 @@ finish_open_created:
error = may_open(&nd->path, acc_mode, open_flag);
if (error)
goto out;
- file->f_path.mnt = nd->path.mnt;
- error = finish_open(file, nd->path.dentry, NULL, opened);
- if (error) {
+
+ BUG_ON(*opened & FILE_OPENED); /* once it's opened, it's opened */
+ error = vfs_open(&nd->path, file, current_cred());
+ if (!error) {
+ *opened |= FILE_OPENED;
+ } else {
if (error == -EOPENSTALE)
goto stale_open;
goto out;
diff --git a/fs/open.c b/fs/open.c
index 6835446..b9d9f9e 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -828,8 +828,7 @@ struct file *dentry_open(const struct path *path, int flags,
f = get_empty_filp();
if (!IS_ERR(f)) {
f->f_flags = flags;
- f->f_path = *path;
- error = do_dentry_open(f, NULL, cred);
+ error = vfs_open(path, f, cred);
if (!error) {
/* from now on we need fput() to dispose of f */
error = open_check_o_direct(f);
@@ -846,6 +845,26 @@ struct file *dentry_open(const struct path *path, int flags,
}
EXPORT_SYMBOL(dentry_open);

+/**
+ * vfs_open - open the file at the given path
+ * @path: path to open
+ * @filp: newly allocated file with f_flag initialized
+ * @cred: credentials to use
+ */
+int vfs_open(const struct path *path, struct file *filp,
+ const struct cred *cred)
+{
+ struct inode *inode = path->dentry->d_inode;
+
+ if (inode->i_op->dentry_open)
+ return inode->i_op->dentry_open(path->dentry, filp, cred);
+ else {
+ filp->f_path = *path;
+ return do_dentry_open(filp, NULL, cred);
+ }
+}
+EXPORT_SYMBOL(vfs_open);
+
static inline int build_open_flags(int flags, umode_t mode, struct open_flags *op)
{
int lookup_flags = 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2c28271..891c93d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1573,6 +1573,7 @@ struct inode_operations {
int (*atomic_open)(struct inode *, struct dentry *,
struct file *, unsigned open_flag,
umode_t create_mode, int *opened);
+ int (*dentry_open)(struct dentry *, struct file *, const struct cred *);
} ____cacheline_aligned;

ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
@@ -2006,6 +2007,7 @@ extern struct file *file_open_name(struct filename *, int, umode_t);
extern struct file *filp_open(const char *, int, umode_t);
extern struct file *file_open_root(struct dentry *, struct vfsmount *,
const char *, int);
+extern int vfs_open(const struct path *, struct file *, const struct cred *);
extern struct file * dentry_open(const struct path *, int, const struct cred *);
extern int filp_close(struct file *, fl_owner_t id);

--
1.7.10.4

2013-03-13 14:17:01

by Miklos Szeredi

[permalink] [raw]
Subject: [PATCH 3/9] vfs: export __inode_permission() to modules

From: Miklos Szeredi <[email protected]>

We need to be able to check inode permissions (but not filesystem implied
permissions) for stackable filesystems. Expose this interface for overlayfs.

Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/internal.h | 5 -----
fs/namei.c | 1 +
include/linux/fs.h | 1 +
3 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index 507141f..89481ac 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -42,11 +42,6 @@ static inline int __sync_blockdev(struct block_device *bdev, int wait)
extern void __init chrdev_init(void);

/*
- * namei.c
- */
-extern int __inode_permission(struct inode *, int);
-
-/*
* namespace.c
*/
extern int copy_mount_options(const void __user *, unsigned long *);
diff --git a/fs/namei.c b/fs/namei.c
index c1c0c15..e1cffbd 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -402,6 +402,7 @@ int __inode_permission(struct inode *inode, int mask)

return security_inode_permission(inode, mask);
}
+EXPORT_SYMBOL(__inode_permission);

/**
* sb_permission - Check superblock-level permissions
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 891c93d..fd98a86 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2213,6 +2213,7 @@ extern sector_t bmap(struct inode *, sector_t);
#endif
extern int notify_change(struct dentry *, struct iattr *);
extern int inode_permission(struct inode *, int);
+extern int __inode_permission(struct inode *, int);
extern int generic_permission(struct inode *, int);

static inline bool execute_ok(struct inode *inode)
--
1.7.10.4

2013-03-13 14:17:16

by Miklos Szeredi

[permalink] [raw]
Subject: [PATCH 5/9] overlay filesystem

From: Miklos Szeredi <[email protected]>

Overlayfs allows one, usually read-write, directory tree to be
overlaid onto another, read-only directory tree. All modifications
go to the upper, writable layer.

This type of mechanism is most often used for live CDs but there's a
wide variety of other uses.

The implementation differs from other "union filesystem"
implementations in that after a file is opened all operations go
directly to the underlying, lower or upper, filesystems. This
simplifies the implementation and allows native performance in these
cases.

The dentry tree is duplicated from the underlying filesystems, this
enables fast cached lookups without adding special support into the
VFS. This uses slightly more memory than union mounts, but dentries
are relatively small.

Currently inodes are duplicated as well, but it is a possible
optimization to share inodes for non-directories.

Opening non directories results in the open forwarded to the
underlying filesystem. This makes the behavior very similar to union
mounts (with the same limitations vs. fchmod/fchown on O_RDONLY file
descriptors).

Usage:

mount -t overlay -olowerdir=/lower,upperdir=/upper overlay /mnt

Supported:

- all operations

Missing:

- Currently a crash in the middle of copy-up, rename, unlink, rmdir or create
over a whiteout may result in filesystem corruption on the overlay level.
IOW these operations need to become atomic or at least the corruption needs
to be detected.


The following cotributions have been folded into this patch:

Neil Brown <[email protected]>:
- minimal remount support
- use correct seek function for directories
- initialise is_real before use
- rename ovl_fill_cache to ovl_dir_read

Felix Fietkau <[email protected]>:
- fix a deadlock in ovl_dir_read_merged
- fix a deadlock in ovl_remove_whiteouts

Erez Zadok <[email protected]>
- fix cleanup after WARN_ON

Sedat Dilek <[email protected]>
- fix up permission to confirm to new API

Robin Dong <[email protected]>
- fix possible leak in ovl_new_inode
- create new inode in ovl_link

Andy Whitcroft <[email protected]>
- switch to __inode_permission()
- copy up i_uid/i_gid from the underlying inode

Also thanks to the following people for testing and reporting bugs:

Jordi Pujol <[email protected]>
Andy Whitcroft <[email protected]>
Michal Suchanek <[email protected]>
Felix Fietkau <[email protected]>
Erez Zadok <[email protected]>
Randy Dunlap <[email protected]>

Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/Kconfig | 1 +
fs/Makefile | 1 +
fs/overlayfs/Kconfig | 10 +
fs/overlayfs/Makefile | 7 +
fs/overlayfs/copy_up.c | 385 +++++++++++++++++++++++++++++
fs/overlayfs/dir.c | 603 +++++++++++++++++++++++++++++++++++++++++++++
fs/overlayfs/inode.c | 370 ++++++++++++++++++++++++++++
fs/overlayfs/overlayfs.h | 64 +++++
fs/overlayfs/readdir.c | 566 ++++++++++++++++++++++++++++++++++++++++++
fs/overlayfs/super.c | 612 ++++++++++++++++++++++++++++++++++++++++++++++
10 files changed, 2619 insertions(+)
create mode 100644 fs/overlayfs/Kconfig
create mode 100644 fs/overlayfs/Makefile
create mode 100644 fs/overlayfs/copy_up.c
create mode 100644 fs/overlayfs/dir.c
create mode 100644 fs/overlayfs/inode.c
create mode 100644 fs/overlayfs/overlayfs.h
create mode 100644 fs/overlayfs/readdir.c
create mode 100644 fs/overlayfs/super.c

diff --git a/fs/Kconfig b/fs/Kconfig
index 780725a..9e2ccd5 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -67,6 +67,7 @@ source "fs/quota/Kconfig"

source "fs/autofs4/Kconfig"
source "fs/fuse/Kconfig"
+source "fs/overlayfs/Kconfig"

config GENERIC_ACL
bool
diff --git a/fs/Makefile b/fs/Makefile
index 9d53192..479a720 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -107,6 +107,7 @@ obj-$(CONFIG_QNX6FS_FS) += qnx6/
obj-$(CONFIG_AUTOFS4_FS) += autofs4/
obj-$(CONFIG_ADFS_FS) += adfs/
obj-$(CONFIG_FUSE_FS) += fuse/
+obj-$(CONFIG_OVERLAYFS_FS) += overlayfs/
obj-$(CONFIG_UDF_FS) += udf/
obj-$(CONFIG_SUN_OPENPROMFS) += openpromfs/
obj-$(CONFIG_OMFS_FS) += omfs/
diff --git a/fs/overlayfs/Kconfig b/fs/overlayfs/Kconfig
new file mode 100644
index 0000000..e601259
--- /dev/null
+++ b/fs/overlayfs/Kconfig
@@ -0,0 +1,10 @@
+config OVERLAYFS_FS
+ tristate "Overlay filesystem support"
+ help
+ An overlay filesystem combines two filesystems - an 'upper' filesystem
+ and a 'lower' filesystem. When a name exists in both filesystems, the
+ object in the 'upper' filesystem is visible while the object in the
+ 'lower' filesystem is either hidden or, in the case of directories,
+ merged with the 'upper' object.
+
+ For more information see Documentation/filesystems/overlayfs.txt
diff --git a/fs/overlayfs/Makefile b/fs/overlayfs/Makefile
new file mode 100644
index 0000000..8f91889
--- /dev/null
+++ b/fs/overlayfs/Makefile
@@ -0,0 +1,7 @@
+#
+# Makefile for the overlay filesystem.
+#
+
+obj-$(CONFIG_OVERLAYFS_FS) += overlayfs.o
+
+overlayfs-objs := super.o inode.o dir.o readdir.o copy_up.o
diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
new file mode 100644
index 0000000..eef85e0
--- /dev/null
+++ b/fs/overlayfs/copy_up.c
@@ -0,0 +1,385 @@
+/*
+ *
+ * Copyright (C) 2011 Novell Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/splice.h>
+#include <linux/xattr.h>
+#include <linux/security.h>
+#include <linux/uaccess.h>
+#include <linux/sched.h>
+#include "overlayfs.h"
+
+#define OVL_COPY_UP_CHUNK_SIZE (1 << 20)
+
+static int ovl_copy_up_xattr(struct dentry *old, struct dentry *new)
+{
+ ssize_t list_size, size;
+ char *buf, *name, *value;
+ int error;
+
+ if (!old->d_inode->i_op->getxattr ||
+ !new->d_inode->i_op->getxattr)
+ return 0;
+
+ list_size = vfs_listxattr(old, NULL, 0);
+ if (list_size <= 0) {
+ if (list_size == -EOPNOTSUPP)
+ return 0;
+ return list_size;
+ }
+
+ buf = kzalloc(list_size, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ error = -ENOMEM;
+ value = kmalloc(XATTR_SIZE_MAX, GFP_KERNEL);
+ if (!value)
+ goto out;
+
+ list_size = vfs_listxattr(old, buf, list_size);
+ if (list_size <= 0) {
+ error = list_size;
+ goto out_free_value;
+ }
+
+ for (name = buf; name < (buf + list_size); name += strlen(name) + 1) {
+ size = vfs_getxattr(old, name, value, XATTR_SIZE_MAX);
+ if (size <= 0) {
+ error = size;
+ goto out_free_value;
+ }
+ error = vfs_setxattr(new, name, value, size, 0);
+ if (error)
+ goto out_free_value;
+ }
+
+out_free_value:
+ kfree(value);
+out:
+ kfree(buf);
+ return error;
+}
+
+static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
+{
+ struct file *old_file;
+ struct file *new_file;
+ int error = 0;
+
+ if (len == 0)
+ return 0;
+
+ old_file = ovl_path_open(old, O_RDONLY);
+ if (IS_ERR(old_file))
+ return PTR_ERR(old_file);
+
+ new_file = ovl_path_open(new, O_WRONLY);
+ if (IS_ERR(new_file)) {
+ error = PTR_ERR(new_file);
+ goto out_fput;
+ }
+
+ /* FIXME: copy up sparse files efficiently */
+ while (len) {
+ loff_t offset = new_file->f_pos;
+ size_t this_len = OVL_COPY_UP_CHUNK_SIZE;
+ long bytes;
+
+ if (len < this_len)
+ this_len = len;
+
+ if (signal_pending_state(TASK_KILLABLE, current)) {
+ error = -EINTR;
+ break;
+ }
+
+ bytes = do_splice_direct(old_file, &offset, new_file, this_len,
+ SPLICE_F_MOVE);
+ if (bytes <= 0) {
+ error = bytes;
+ break;
+ }
+
+ len -= bytes;
+ }
+
+ fput(new_file);
+out_fput:
+ fput(old_file);
+ return error;
+}
+
+static char *ovl_read_symlink(struct dentry *realdentry)
+{
+ int res;
+ char *buf;
+ struct inode *inode = realdentry->d_inode;
+ mm_segment_t old_fs;
+
+ res = -EINVAL;
+ if (!inode->i_op->readlink)
+ goto err;
+
+ res = -ENOMEM;
+ buf = (char *) __get_free_page(GFP_KERNEL);
+ if (!buf)
+ goto err;
+
+ old_fs = get_fs();
+ set_fs(get_ds());
+ /* The cast to a user pointer is valid due to the set_fs() */
+ res = inode->i_op->readlink(realdentry,
+ (char __user *)buf, PAGE_SIZE - 1);
+ set_fs(old_fs);
+ if (res < 0) {
+ free_page((unsigned long) buf);
+ goto err;
+ }
+ buf[res] = '\0';
+
+ return buf;
+
+err:
+ return ERR_PTR(res);
+}
+
+static int ovl_set_timestamps(struct dentry *upperdentry, struct kstat *stat)
+{
+ struct iattr attr = {
+ .ia_valid =
+ ATTR_ATIME | ATTR_MTIME | ATTR_ATIME_SET | ATTR_MTIME_SET,
+ .ia_atime = stat->atime,
+ .ia_mtime = stat->mtime,
+ };
+
+ return notify_change(upperdentry, &attr);
+}
+
+static int ovl_set_mode(struct dentry *upperdentry, umode_t mode)
+{
+ struct iattr attr = {
+ .ia_valid = ATTR_MODE,
+ .ia_mode = mode,
+ };
+
+ return notify_change(upperdentry, &attr);
+}
+
+static int ovl_copy_up_locked(struct dentry *upperdir, struct dentry *dentry,
+ struct path *lowerpath, struct kstat *stat,
+ const char *link)
+{
+ int err;
+ struct path newpath;
+ umode_t mode = stat->mode;
+
+ /* Can't properly set mode on creation because of the umask */
+ stat->mode &= S_IFMT;
+
+ ovl_path_upper(dentry, &newpath);
+ WARN_ON(newpath.dentry);
+ newpath.dentry = ovl_upper_create(upperdir, dentry, stat, link);
+ if (IS_ERR(newpath.dentry))
+ return PTR_ERR(newpath.dentry);
+
+ if (S_ISREG(stat->mode)) {
+ err = ovl_copy_up_data(lowerpath, &newpath, stat->size);
+ if (err)
+ goto err_remove;
+ }
+
+ err = ovl_copy_up_xattr(lowerpath->dentry, newpath.dentry);
+ if (err)
+ goto err_remove;
+
+ mutex_lock(&newpath.dentry->d_inode->i_mutex);
+ if (!S_ISLNK(stat->mode))
+ err = ovl_set_mode(newpath.dentry, mode);
+ if (!err)
+ err = ovl_set_timestamps(newpath.dentry, stat);
+ mutex_unlock(&newpath.dentry->d_inode->i_mutex);
+ if (err)
+ goto err_remove;
+
+ ovl_dentry_update(dentry, newpath.dentry);
+
+ /*
+ * Easiest way to get rid of the lower dentry reference is to
+ * drop this dentry. This is neither needed nor possible for
+ * directories.
+ */
+ if (!S_ISDIR(stat->mode))
+ d_drop(dentry);
+
+ return 0;
+
+err_remove:
+ if (S_ISDIR(stat->mode))
+ vfs_rmdir(upperdir->d_inode, newpath.dentry);
+ else
+ vfs_unlink(upperdir->d_inode, newpath.dentry);
+
+ dput(newpath.dentry);
+
+ return err;
+}
+
+/*
+ * Copy up a single dentry
+ *
+ * Directory renames only allowed on "pure upper" (already created on
+ * upper filesystem, never copied up). Directories which are on lower or
+ * are merged may not be renamed. For these -EXDEV is returned and
+ * userspace has to deal with it. This means, when copying up a
+ * directory we can rely on it and ancestors being stable.
+ *
+ * Non-directory renames start with copy up of source if necessary. The
+ * actual rename will only proceed once the copy up was successful. Copy
+ * up uses upper parent i_mutex for exclusion. Since rename can change
+ * d_parent it is possible that the copy up will lock the old parent. At
+ * that point the file will have already been copied up anyway.
+ */
+static int ovl_copy_up_one(struct dentry *parent, struct dentry *dentry,
+ struct path *lowerpath, struct kstat *stat)
+{
+ int err;
+ struct kstat pstat;
+ struct path parentpath;
+ struct dentry *upperdir;
+ const struct cred *old_cred;
+ struct cred *override_cred;
+ char *link = NULL;
+
+ ovl_path_upper(parent, &parentpath);
+ upperdir = parentpath.dentry;
+
+ err = vfs_getattr(&parentpath, &pstat);
+ if (err)
+ return err;
+
+ if (S_ISLNK(stat->mode)) {
+ link = ovl_read_symlink(lowerpath->dentry);
+ if (IS_ERR(link))
+ return PTR_ERR(link);
+ }
+
+ err = -ENOMEM;
+ override_cred = prepare_creds();
+ if (!override_cred)
+ goto out_free_link;
+
+ override_cred->fsuid = stat->uid;
+ override_cred->fsgid = stat->gid;
+ /*
+ * CAP_SYS_ADMIN for copying up extended attributes
+ * CAP_DAC_OVERRIDE for create
+ * CAP_FOWNER for chmod, timestamp update
+ * CAP_FSETID for chmod
+ * CAP_MKNOD for mknod
+ */
+ cap_raise(override_cred->cap_effective, CAP_SYS_ADMIN);
+ cap_raise(override_cred->cap_effective, CAP_DAC_OVERRIDE);
+ cap_raise(override_cred->cap_effective, CAP_FOWNER);
+ cap_raise(override_cred->cap_effective, CAP_FSETID);
+ cap_raise(override_cred->cap_effective, CAP_MKNOD);
+ old_cred = override_creds(override_cred);
+
+ mutex_lock_nested(&upperdir->d_inode->i_mutex, I_MUTEX_PARENT);
+ if (ovl_path_type(dentry) != OVL_PATH_LOWER) {
+ err = 0;
+ } else {
+ err = ovl_copy_up_locked(upperdir, dentry, lowerpath,
+ stat, link);
+ if (!err) {
+ /* Restore timestamps on parent (best effort) */
+ ovl_set_timestamps(upperdir, &pstat);
+ }
+ }
+
+ mutex_unlock(&upperdir->d_inode->i_mutex);
+
+ revert_creds(old_cred);
+ put_cred(override_cred);
+
+out_free_link:
+ if (link)
+ free_page((unsigned long) link);
+
+ return err;
+}
+
+int ovl_copy_up(struct dentry *dentry)
+{
+ int err;
+
+ err = 0;
+ while (!err) {
+ struct dentry *next;
+ struct dentry *parent;
+ struct path lowerpath;
+ struct kstat stat;
+ enum ovl_path_type type = ovl_path_type(dentry);
+
+ if (type != OVL_PATH_LOWER)
+ break;
+
+ next = dget(dentry);
+ /* find the topmost dentry not yet copied up */
+ for (;;) {
+ parent = dget_parent(next);
+
+ type = ovl_path_type(parent);
+ if (type != OVL_PATH_LOWER)
+ break;
+
+ dput(next);
+ next = parent;
+ }
+
+ ovl_path_lower(next, &lowerpath);
+ err = vfs_getattr(&lowerpath, &stat);
+ if (!err)
+ err = ovl_copy_up_one(parent, next, &lowerpath, &stat);
+
+ dput(parent);
+ dput(next);
+ }
+
+ return err;
+}
+
+/* Optimize by not copying up the file first and truncating later */
+int ovl_copy_up_truncate(struct dentry *dentry, loff_t size)
+{
+ int err;
+ struct kstat stat;
+ struct path lowerpath;
+ struct dentry *parent = dget_parent(dentry);
+
+ err = ovl_copy_up(parent);
+ if (err)
+ goto out_dput_parent;
+
+ ovl_path_lower(dentry, &lowerpath);
+ err = vfs_getattr(&lowerpath, &stat);
+ if (err)
+ goto out_dput_parent;
+
+ if (size < stat.size)
+ stat.size = size;
+
+ err = ovl_copy_up_one(parent, dentry, &lowerpath, &stat);
+
+out_dput_parent:
+ dput(parent);
+ return err;
+}
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
new file mode 100644
index 0000000..5da5be3
--- /dev/null
+++ b/fs/overlayfs/dir.c
@@ -0,0 +1,603 @@
+/*
+ *
+ * Copyright (C) 2011 Novell Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/fs.h>
+#include <linux/namei.h>
+#include <linux/xattr.h>
+#include <linux/security.h>
+#include <linux/cred.h>
+#include "overlayfs.h"
+
+static const char *ovl_whiteout_symlink = "(overlay-whiteout)";
+
+static int ovl_whiteout(struct dentry *upperdir, struct dentry *dentry)
+{
+ int err;
+ struct dentry *newdentry;
+ const struct cred *old_cred;
+ struct cred *override_cred;
+
+ /* FIXME: recheck lower dentry to see if whiteout is really needed */
+
+ err = -ENOMEM;
+ override_cred = prepare_creds();
+ if (!override_cred)
+ goto out;
+
+ /*
+ * CAP_SYS_ADMIN for setxattr
+ * CAP_DAC_OVERRIDE for symlink creation
+ * CAP_FOWNER for unlink in sticky directory
+ */
+ cap_raise(override_cred->cap_effective, CAP_SYS_ADMIN);
+ cap_raise(override_cred->cap_effective, CAP_DAC_OVERRIDE);
+ cap_raise(override_cred->cap_effective, CAP_FOWNER);
+ override_cred->fsuid = GLOBAL_ROOT_UID;
+ override_cred->fsgid = GLOBAL_ROOT_GID;
+ old_cred = override_creds(override_cred);
+
+ newdentry = lookup_one_len(dentry->d_name.name, upperdir,
+ dentry->d_name.len);
+ err = PTR_ERR(newdentry);
+ if (IS_ERR(newdentry))
+ goto out_put_cred;
+
+ /* Just been removed within the same locked region */
+ WARN_ON(newdentry->d_inode);
+
+ err = vfs_symlink(upperdir->d_inode, newdentry, ovl_whiteout_symlink);
+ if (err)
+ goto out_dput;
+
+ ovl_dentry_version_inc(dentry->d_parent);
+
+ err = vfs_setxattr(newdentry, ovl_whiteout_xattr, "y", 1, 0);
+ if (err)
+ vfs_unlink(upperdir->d_inode, newdentry);
+
+out_dput:
+ dput(newdentry);
+out_put_cred:
+ revert_creds(old_cred);
+ put_cred(override_cred);
+out:
+ if (err) {
+ /*
+ * There's no way to recover from failure to whiteout.
+ * What should we do? Log a big fat error and... ?
+ */
+ pr_err("overlayfs: ERROR - failed to whiteout '%s'\n",
+ dentry->d_name.name);
+ }
+
+ return err;
+}
+
+static struct dentry *ovl_lookup_create(struct dentry *upperdir,
+ struct dentry *template)
+{
+ int err;
+ struct dentry *newdentry;
+ struct qstr *name = &template->d_name;
+
+ newdentry = lookup_one_len(name->name, upperdir, name->len);
+ if (IS_ERR(newdentry))
+ return newdentry;
+
+ if (newdentry->d_inode) {
+ const struct cred *old_cred;
+ struct cred *override_cred;
+
+ /* No need to check whiteout if lower parent is non-existent */
+ err = -EEXIST;
+ if (!ovl_dentry_lower(template->d_parent))
+ goto out_dput;
+
+ if (!S_ISLNK(newdentry->d_inode->i_mode))
+ goto out_dput;
+
+ err = -ENOMEM;
+ override_cred = prepare_creds();
+ if (!override_cred)
+ goto out_dput;
+
+ /*
+ * CAP_SYS_ADMIN for getxattr
+ * CAP_FOWNER for unlink in sticky directory
+ */
+ cap_raise(override_cred->cap_effective, CAP_SYS_ADMIN);
+ cap_raise(override_cred->cap_effective, CAP_FOWNER);
+ old_cred = override_creds(override_cred);
+
+ err = -EEXIST;
+ if (ovl_is_whiteout(newdentry))
+ err = vfs_unlink(upperdir->d_inode, newdentry);
+
+ revert_creds(old_cred);
+ put_cred(override_cred);
+ if (err)
+ goto out_dput;
+
+ dput(newdentry);
+ newdentry = lookup_one_len(name->name, upperdir, name->len);
+ if (IS_ERR(newdentry)) {
+ ovl_whiteout(upperdir, template);
+ return newdentry;
+ }
+
+ /*
+ * Whiteout just been successfully removed, parent
+ * i_mutex is still held, there's no way the lookup
+ * could return positive.
+ */
+ WARN_ON(newdentry->d_inode);
+ }
+
+ return newdentry;
+
+out_dput:
+ dput(newdentry);
+ return ERR_PTR(err);
+}
+
+struct dentry *ovl_upper_create(struct dentry *upperdir, struct dentry *dentry,
+ struct kstat *stat, const char *link)
+{
+ int err;
+ struct dentry *newdentry;
+ struct inode *dir = upperdir->d_inode;
+
+ newdentry = ovl_lookup_create(upperdir, dentry);
+ if (IS_ERR(newdentry))
+ goto out;
+
+ switch (stat->mode & S_IFMT) {
+ case S_IFREG:
+ err = vfs_create(dir, newdentry, stat->mode, NULL);
+ break;
+
+ case S_IFDIR:
+ err = vfs_mkdir(dir, newdentry, stat->mode);
+ break;
+
+ case S_IFCHR:
+ case S_IFBLK:
+ case S_IFIFO:
+ case S_IFSOCK:
+ err = vfs_mknod(dir, newdentry, stat->mode, stat->rdev);
+ break;
+
+ case S_IFLNK:
+ err = vfs_symlink(dir, newdentry, link);
+ break;
+
+ default:
+ err = -EPERM;
+ }
+ if (err) {
+ if (ovl_dentry_is_opaque(dentry))
+ ovl_whiteout(upperdir, dentry);
+ dput(newdentry);
+ newdentry = ERR_PTR(err);
+ } else if (WARN_ON(!newdentry->d_inode)) {
+ /*
+ * Not quite sure if non-instantiated dentry is legal or not.
+ * VFS doesn't seem to care so check and warn here.
+ */
+ dput(newdentry);
+ newdentry = ERR_PTR(-ENOENT);
+ }
+
+out:
+ return newdentry;
+
+}
+
+static int ovl_set_opaque(struct dentry *upperdentry)
+{
+ int err;
+ const struct cred *old_cred;
+ struct cred *override_cred;
+
+ override_cred = prepare_creds();
+ if (!override_cred)
+ return -ENOMEM;
+
+ /* CAP_SYS_ADMIN for setxattr of "trusted" namespace */
+ cap_raise(override_cred->cap_effective, CAP_SYS_ADMIN);
+ old_cred = override_creds(override_cred);
+ err = vfs_setxattr(upperdentry, ovl_opaque_xattr, "y", 1, 0);
+ revert_creds(old_cred);
+ put_cred(override_cred);
+
+ return err;
+}
+
+static int ovl_remove_opaque(struct dentry *upperdentry)
+{
+ int err;
+ const struct cred *old_cred;
+ struct cred *override_cred;
+
+ override_cred = prepare_creds();
+ if (!override_cred)
+ return -ENOMEM;
+
+ /* CAP_SYS_ADMIN for removexattr of "trusted" namespace */
+ cap_raise(override_cred->cap_effective, CAP_SYS_ADMIN);
+ old_cred = override_creds(override_cred);
+ err = vfs_removexattr(upperdentry, ovl_opaque_xattr);
+ revert_creds(old_cred);
+ put_cred(override_cred);
+
+ return err;
+}
+
+static int ovl_dir_getattr(struct vfsmount *mnt, struct dentry *dentry,
+ struct kstat *stat)
+{
+ int err;
+ enum ovl_path_type type;
+ struct path realpath;
+
+ type = ovl_path_real(dentry, &realpath);
+ err = vfs_getattr(&realpath, stat);
+ if (err)
+ return err;
+
+ stat->dev = dentry->d_sb->s_dev;
+ stat->ino = dentry->d_inode->i_ino;
+
+ /*
+ * It's probably not worth it to count subdirs to get the
+ * correct link count. nlink=1 seems to pacify 'find' and
+ * other utilities.
+ */
+ if (type == OVL_PATH_MERGE)
+ stat->nlink = 1;
+
+ return 0;
+}
+
+static int ovl_create_object(struct dentry *dentry, int mode, dev_t rdev,
+ const char *link)
+{
+ int err;
+ struct dentry *newdentry;
+ struct dentry *upperdir;
+ struct inode *inode;
+ struct kstat stat = {
+ .mode = mode,
+ .rdev = rdev,
+ };
+
+ err = -ENOMEM;
+ inode = ovl_new_inode(dentry->d_sb, mode, dentry->d_fsdata);
+ if (!inode)
+ goto out;
+
+ err = ovl_copy_up(dentry->d_parent);
+ if (err)
+ goto out_iput;
+
+ upperdir = ovl_dentry_upper(dentry->d_parent);
+ mutex_lock_nested(&upperdir->d_inode->i_mutex, I_MUTEX_PARENT);
+
+ newdentry = ovl_upper_create(upperdir, dentry, &stat, link);
+ err = PTR_ERR(newdentry);
+ if (IS_ERR(newdentry))
+ goto out_unlock;
+
+ ovl_dentry_version_inc(dentry->d_parent);
+ if (ovl_dentry_is_opaque(dentry) && S_ISDIR(mode)) {
+ err = ovl_set_opaque(newdentry);
+ if (err) {
+ vfs_rmdir(upperdir->d_inode, newdentry);
+ ovl_whiteout(upperdir, dentry);
+ goto out_dput;
+ }
+ }
+ ovl_dentry_update(dentry, newdentry);
+ d_instantiate(dentry, inode);
+ inode = NULL;
+ newdentry = NULL;
+ err = 0;
+
+out_dput:
+ dput(newdentry);
+out_unlock:
+ mutex_unlock(&upperdir->d_inode->i_mutex);
+out_iput:
+ iput(inode);
+out:
+ return err;
+}
+
+static int ovl_create(struct inode *dir, struct dentry *dentry, umode_t mode,
+ bool excl)
+{
+ return ovl_create_object(dentry, (mode & 07777) | S_IFREG, 0, NULL);
+}
+
+static int ovl_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
+{
+ return ovl_create_object(dentry, (mode & 07777) | S_IFDIR, 0, NULL);
+}
+
+static int ovl_mknod(struct inode *dir, struct dentry *dentry, umode_t mode,
+ dev_t rdev)
+{
+ return ovl_create_object(dentry, mode, rdev, NULL);
+}
+
+static int ovl_symlink(struct inode *dir, struct dentry *dentry,
+ const char *link)
+{
+ return ovl_create_object(dentry, S_IFLNK, 0, link);
+}
+
+static int ovl_do_remove(struct dentry *dentry, bool is_dir)
+{
+ int err;
+ enum ovl_path_type type;
+ struct path realpath;
+ struct dentry *upperdir;
+
+ err = ovl_copy_up(dentry->d_parent);
+ if (err)
+ return err;
+
+ upperdir = ovl_dentry_upper(dentry->d_parent);
+ mutex_lock_nested(&upperdir->d_inode->i_mutex, I_MUTEX_PARENT);
+ type = ovl_path_real(dentry, &realpath);
+ if (type != OVL_PATH_LOWER) {
+ err = -ESTALE;
+ if (realpath.dentry->d_parent != upperdir)
+ goto out_d_drop;
+
+ /* FIXME: create whiteout up front and rename to target */
+
+ if (is_dir)
+ err = vfs_rmdir(upperdir->d_inode, realpath.dentry);
+ else
+ err = vfs_unlink(upperdir->d_inode, realpath.dentry);
+ if (err)
+ goto out_d_drop;
+
+ ovl_dentry_version_inc(dentry->d_parent);
+ }
+
+ if (type != OVL_PATH_UPPER || ovl_dentry_is_opaque(dentry))
+ err = ovl_whiteout(upperdir, dentry);
+
+ /*
+ * Keeping this dentry hashed would mean having to release
+ * upperpath/lowerpath, which could only be done if we are the
+ * sole user of this dentry. Too tricky... Just unhash for
+ * now.
+ */
+out_d_drop:
+ d_drop(dentry);
+ mutex_unlock(&upperdir->d_inode->i_mutex);
+
+ return err;
+}
+
+static int ovl_unlink(struct inode *dir, struct dentry *dentry)
+{
+ return ovl_do_remove(dentry, false);
+}
+
+
+static int ovl_rmdir(struct inode *dir, struct dentry *dentry)
+{
+ int err;
+ enum ovl_path_type type;
+
+ type = ovl_path_type(dentry);
+ if (type != OVL_PATH_UPPER) {
+ err = ovl_check_empty_and_clear(dentry, type);
+ if (err)
+ return err;
+ }
+
+ return ovl_do_remove(dentry, true);
+}
+
+static int ovl_link(struct dentry *old, struct inode *newdir,
+ struct dentry *new)
+{
+ int err;
+ struct dentry *olddentry;
+ struct dentry *newdentry;
+ struct dentry *upperdir;
+ struct inode *newinode;
+
+ err = ovl_copy_up(old);
+ if (err)
+ goto out;
+
+ err = ovl_copy_up(new->d_parent);
+ if (err)
+ goto out;
+
+ upperdir = ovl_dentry_upper(new->d_parent);
+ mutex_lock_nested(&upperdir->d_inode->i_mutex, I_MUTEX_PARENT);
+ newdentry = ovl_lookup_create(upperdir, new);
+ err = PTR_ERR(newdentry);
+ if (IS_ERR(newdentry))
+ goto out_unlock;
+
+ olddentry = ovl_dentry_upper(old);
+ err = vfs_link(olddentry, upperdir->d_inode, newdentry);
+ if (!err) {
+ if (WARN_ON(!newdentry->d_inode)) {
+ dput(newdentry);
+ err = -ENOENT;
+ goto out_unlock;
+ }
+ newinode = ovl_new_inode(old->d_sb, newdentry->d_inode->i_mode,
+ new->d_fsdata);
+ if (!newinode) {
+ err = -ENOMEM;
+ goto link_fail;
+ }
+
+ ovl_dentry_version_inc(new->d_parent);
+ ovl_dentry_update(new, newdentry);
+
+ d_instantiate(new, newinode);
+ } else {
+link_fail:
+ if (ovl_dentry_is_opaque(new))
+ ovl_whiteout(upperdir, new);
+ dput(newdentry);
+ }
+out_unlock:
+ mutex_unlock(&upperdir->d_inode->i_mutex);
+out:
+ return err;
+}
+
+static int ovl_rename(struct inode *olddir, struct dentry *old,
+ struct inode *newdir, struct dentry *new)
+{
+ int err;
+ enum ovl_path_type old_type;
+ enum ovl_path_type new_type;
+ struct dentry *old_upperdir;
+ struct dentry *new_upperdir;
+ struct dentry *olddentry;
+ struct dentry *newdentry;
+ struct dentry *trap;
+ bool old_opaque;
+ bool new_opaque;
+ bool new_create = false;
+ bool is_dir = S_ISDIR(old->d_inode->i_mode);
+
+ /* Don't copy up directory trees */
+ old_type = ovl_path_type(old);
+ if (old_type != OVL_PATH_UPPER && is_dir)
+ return -EXDEV;
+
+ if (new->d_inode) {
+ new_type = ovl_path_type(new);
+
+ if (new_type == OVL_PATH_LOWER && old_type == OVL_PATH_LOWER) {
+ if (ovl_dentry_lower(old)->d_inode ==
+ ovl_dentry_lower(new)->d_inode)
+ return 0;
+ }
+ if (new_type != OVL_PATH_LOWER && old_type != OVL_PATH_LOWER) {
+ if (ovl_dentry_upper(old)->d_inode ==
+ ovl_dentry_upper(new)->d_inode)
+ return 0;
+ }
+
+ if (new_type != OVL_PATH_UPPER &&
+ S_ISDIR(new->d_inode->i_mode)) {
+ err = ovl_check_empty_and_clear(new, new_type);
+ if (err)
+ return err;
+ }
+ } else {
+ new_type = OVL_PATH_UPPER;
+ }
+
+ err = ovl_copy_up(old);
+ if (err)
+ return err;
+
+ err = ovl_copy_up(new->d_parent);
+ if (err)
+ return err;
+
+ old_upperdir = ovl_dentry_upper(old->d_parent);
+ new_upperdir = ovl_dentry_upper(new->d_parent);
+
+ trap = lock_rename(new_upperdir, old_upperdir);
+
+ olddentry = ovl_dentry_upper(old);
+ newdentry = ovl_dentry_upper(new);
+ if (newdentry) {
+ dget(newdentry);
+ } else {
+ new_create = true;
+ newdentry = ovl_lookup_create(new_upperdir, new);
+ err = PTR_ERR(newdentry);
+ if (IS_ERR(newdentry))
+ goto out_unlock;
+ }
+
+ err = -ESTALE;
+ if (olddentry->d_parent != old_upperdir)
+ goto out_dput;
+ if (newdentry->d_parent != new_upperdir)
+ goto out_dput;
+ if (olddentry == trap)
+ goto out_dput;
+ if (newdentry == trap)
+ goto out_dput;
+
+ old_opaque = ovl_dentry_is_opaque(old);
+ new_opaque = ovl_dentry_is_opaque(new) || new_type != OVL_PATH_UPPER;
+
+ if (is_dir && !old_opaque && new_opaque) {
+ err = ovl_set_opaque(olddentry);
+ if (err)
+ goto out_dput;
+ }
+
+ err = vfs_rename(old_upperdir->d_inode, olddentry,
+ new_upperdir->d_inode, newdentry);
+
+ if (err) {
+ if (new_create && ovl_dentry_is_opaque(new))
+ ovl_whiteout(new_upperdir, new);
+ if (is_dir && !old_opaque && new_opaque)
+ ovl_remove_opaque(olddentry);
+ goto out_dput;
+ }
+
+ if (old_type != OVL_PATH_UPPER || old_opaque)
+ err = ovl_whiteout(old_upperdir, old);
+ if (is_dir && old_opaque && !new_opaque)
+ ovl_remove_opaque(olddentry);
+
+ if (old_opaque != new_opaque)
+ ovl_dentry_set_opaque(old, new_opaque);
+
+ ovl_dentry_version_inc(old->d_parent);
+ ovl_dentry_version_inc(new->d_parent);
+
+out_dput:
+ dput(newdentry);
+out_unlock:
+ unlock_rename(new_upperdir, old_upperdir);
+ return err;
+}
+
+const struct inode_operations ovl_dir_inode_operations = {
+ .lookup = ovl_lookup,
+ .mkdir = ovl_mkdir,
+ .symlink = ovl_symlink,
+ .unlink = ovl_unlink,
+ .rmdir = ovl_rmdir,
+ .rename = ovl_rename,
+ .link = ovl_link,
+ .setattr = ovl_setattr,
+ .create = ovl_create,
+ .mknod = ovl_mknod,
+ .permission = ovl_permission,
+ .getattr = ovl_dir_getattr,
+ .setxattr = ovl_setxattr,
+ .getxattr = ovl_getxattr,
+ .listxattr = ovl_listxattr,
+ .removexattr = ovl_removexattr,
+};
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
new file mode 100644
index 0000000..3218a38
--- /dev/null
+++ b/fs/overlayfs/inode.c
@@ -0,0 +1,370 @@
+/*
+ *
+ * Copyright (C) 2011 Novell Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/xattr.h>
+#include "overlayfs.h"
+
+int ovl_setattr(struct dentry *dentry, struct iattr *attr)
+{
+ struct dentry *upperdentry;
+ int err;
+
+ if ((attr->ia_valid & ATTR_SIZE) && !ovl_dentry_upper(dentry))
+ err = ovl_copy_up_truncate(dentry, attr->ia_size);
+ else
+ err = ovl_copy_up(dentry);
+ if (err)
+ return err;
+
+ upperdentry = ovl_dentry_upper(dentry);
+
+ if (attr->ia_valid & (ATTR_KILL_SUID|ATTR_KILL_SGID))
+ attr->ia_valid &= ~ATTR_MODE;
+
+ mutex_lock(&upperdentry->d_inode->i_mutex);
+ err = notify_change(upperdentry, attr);
+ mutex_unlock(&upperdentry->d_inode->i_mutex);
+
+ return err;
+}
+
+static int ovl_getattr(struct vfsmount *mnt, struct dentry *dentry,
+ struct kstat *stat)
+{
+ struct path realpath;
+
+ ovl_path_real(dentry, &realpath);
+ return vfs_getattr(&realpath, stat);
+}
+
+int ovl_permission(struct inode *inode, int mask)
+{
+ struct ovl_entry *oe;
+ struct dentry *alias = NULL;
+ struct inode *realinode;
+ struct dentry *realdentry;
+ bool is_upper;
+ int err;
+
+ if (S_ISDIR(inode->i_mode)) {
+ oe = inode->i_private;
+ } else if (mask & MAY_NOT_BLOCK) {
+ return -ECHILD;
+ } else {
+ /*
+ * For non-directories find an alias and get the info
+ * from there.
+ */
+ alias = d_find_any_alias(inode);
+ if (WARN_ON(!alias))
+ return -ENOENT;
+
+ oe = alias->d_fsdata;
+ }
+
+ realdentry = ovl_entry_real(oe, &is_upper);
+
+ /* Careful in RCU walk mode */
+ realinode = ACCESS_ONCE(realdentry->d_inode);
+ if (!realinode) {
+ WARN_ON(!(mask & MAY_NOT_BLOCK));
+ err = -ENOENT;
+ goto out_dput;
+ }
+
+ if (mask & MAY_WRITE) {
+ umode_t mode = realinode->i_mode;
+
+ /*
+ * Writes will always be redirected to upper layer, so
+ * ignore lower layer being read-only.
+ *
+ * If the overlay itself is read-only then proceed
+ * with the permission check, don't return EROFS.
+ * This will only happen if this is the lower layer of
+ * another overlayfs.
+ *
+ * If upper fs becomes read-only after the overlay was
+ * constructed return EROFS to prevent modification of
+ * upper layer.
+ */
+ err = -EROFS;
+ if (is_upper && !IS_RDONLY(inode) && IS_RDONLY(realinode) &&
+ (S_ISREG(mode) || S_ISDIR(mode) || S_ISLNK(mode)))
+ goto out_dput;
+ }
+
+ err = __inode_permission(realinode, mask);
+out_dput:
+ dput(alias);
+ return err;
+}
+
+
+struct ovl_link_data {
+ struct dentry *realdentry;
+ void *cookie;
+};
+
+static void *ovl_follow_link(struct dentry *dentry, struct nameidata *nd)
+{
+ void *ret;
+ struct dentry *realdentry;
+ struct inode *realinode;
+
+ realdentry = ovl_dentry_real(dentry);
+ realinode = realdentry->d_inode;
+
+ if (WARN_ON(!realinode->i_op->follow_link))
+ return ERR_PTR(-EPERM);
+
+ ret = realinode->i_op->follow_link(realdentry, nd);
+ if (IS_ERR(ret))
+ return ret;
+
+ if (realinode->i_op->put_link) {
+ struct ovl_link_data *data;
+
+ data = kmalloc(sizeof(struct ovl_link_data), GFP_KERNEL);
+ if (!data) {
+ realinode->i_op->put_link(realdentry, nd, ret);
+ return ERR_PTR(-ENOMEM);
+ }
+ data->realdentry = realdentry;
+ data->cookie = ret;
+
+ return data;
+ } else {
+ return NULL;
+ }
+}
+
+static void ovl_put_link(struct dentry *dentry, struct nameidata *nd, void *c)
+{
+ struct inode *realinode;
+ struct ovl_link_data *data = c;
+
+ if (!data)
+ return;
+
+ realinode = data->realdentry->d_inode;
+ realinode->i_op->put_link(data->realdentry, nd, data->cookie);
+ kfree(data);
+}
+
+static int ovl_readlink(struct dentry *dentry, char __user *buf, int bufsiz)
+{
+ struct path realpath;
+ struct inode *realinode;
+
+ ovl_path_real(dentry, &realpath);
+ realinode = realpath.dentry->d_inode;
+
+ if (!realinode->i_op->readlink)
+ return -EINVAL;
+
+ touch_atime(&realpath);
+
+ return realinode->i_op->readlink(realpath.dentry, buf, bufsiz);
+}
+
+
+static bool ovl_is_private_xattr(const char *name)
+{
+ return strncmp(name, "trusted.overlay.", 14) == 0;
+}
+
+int ovl_setxattr(struct dentry *dentry, const char *name,
+ const void *value, size_t size, int flags)
+{
+ int err;
+ struct dentry *upperdentry;
+
+ if (ovl_is_private_xattr(name))
+ return -EPERM;
+
+ err = ovl_copy_up(dentry);
+ if (err)
+ return err;
+
+ upperdentry = ovl_dentry_upper(dentry);
+ return vfs_setxattr(upperdentry, name, value, size, flags);
+}
+
+ssize_t ovl_getxattr(struct dentry *dentry, const char *name,
+ void *value, size_t size)
+{
+ if (ovl_path_type(dentry->d_parent) == OVL_PATH_MERGE &&
+ ovl_is_private_xattr(name))
+ return -ENODATA;
+
+ return vfs_getxattr(ovl_dentry_real(dentry), name, value, size);
+}
+
+ssize_t ovl_listxattr(struct dentry *dentry, char *list, size_t size)
+{
+ ssize_t res;
+ int off;
+
+ res = vfs_listxattr(ovl_dentry_real(dentry), list, size);
+ if (res <= 0 || size == 0)
+ return res;
+
+ if (ovl_path_type(dentry->d_parent) != OVL_PATH_MERGE)
+ return res;
+
+ /* filter out private xattrs */
+ for (off = 0; off < res;) {
+ char *s = list + off;
+ size_t slen = strlen(s) + 1;
+
+ BUG_ON(off + slen > res);
+
+ if (ovl_is_private_xattr(s)) {
+ res -= slen;
+ memmove(s, s + slen, res - off);
+ } else {
+ off += slen;
+ }
+ }
+
+ return res;
+}
+
+int ovl_removexattr(struct dentry *dentry, const char *name)
+{
+ int err;
+ struct path realpath;
+ enum ovl_path_type type;
+
+ if (ovl_path_type(dentry->d_parent) == OVL_PATH_MERGE &&
+ ovl_is_private_xattr(name))
+ return -ENODATA;
+
+ type = ovl_path_real(dentry, &realpath);
+ if (type == OVL_PATH_LOWER) {
+ err = vfs_getxattr(realpath.dentry, name, NULL, 0);
+ if (err < 0)
+ return err;
+
+ err = ovl_copy_up(dentry);
+ if (err)
+ return err;
+
+ ovl_path_upper(dentry, &realpath);
+ }
+
+ return vfs_removexattr(realpath.dentry, name);
+}
+
+static bool ovl_open_need_copy_up(int flags, enum ovl_path_type type,
+ struct dentry *realdentry)
+{
+ if (type != OVL_PATH_LOWER)
+ return false;
+
+ if (special_file(realdentry->d_inode->i_mode))
+ return false;
+
+ if (!(OPEN_FMODE(flags) & FMODE_WRITE) && !(flags & O_TRUNC))
+ return false;
+
+ return true;
+}
+
+static int ovl_dentry_open(struct dentry *dentry, struct file *file,
+ const struct cred *cred)
+{
+ int err;
+ struct path realpath;
+ enum ovl_path_type type;
+
+ type = ovl_path_real(dentry, &realpath);
+ if (ovl_open_need_copy_up(file->f_flags, type, realpath.dentry)) {
+ if (file->f_flags & O_TRUNC)
+ err = ovl_copy_up_truncate(dentry, 0);
+ else
+ err = ovl_copy_up(dentry);
+ if (err)
+ return err;
+
+ ovl_path_upper(dentry, &realpath);
+ }
+
+ return vfs_open(&realpath, file, cred);
+}
+
+static const struct inode_operations ovl_file_inode_operations = {
+ .setattr = ovl_setattr,
+ .permission = ovl_permission,
+ .getattr = ovl_getattr,
+ .setxattr = ovl_setxattr,
+ .getxattr = ovl_getxattr,
+ .listxattr = ovl_listxattr,
+ .removexattr = ovl_removexattr,
+ .dentry_open = ovl_dentry_open,
+};
+
+static const struct inode_operations ovl_symlink_inode_operations = {
+ .setattr = ovl_setattr,
+ .follow_link = ovl_follow_link,
+ .put_link = ovl_put_link,
+ .readlink = ovl_readlink,
+ .getattr = ovl_getattr,
+ .setxattr = ovl_setxattr,
+ .getxattr = ovl_getxattr,
+ .listxattr = ovl_listxattr,
+ .removexattr = ovl_removexattr,
+};
+
+struct inode *ovl_new_inode(struct super_block *sb, umode_t mode,
+ struct ovl_entry *oe)
+{
+ struct inode *inode;
+
+ inode = new_inode(sb);
+ if (!inode)
+ return NULL;
+
+ mode &= S_IFMT;
+
+ inode->i_ino = get_next_ino();
+ inode->i_mode = mode;
+ inode->i_flags |= S_NOATIME | S_NOCMTIME;
+
+ switch (mode) {
+ case S_IFDIR:
+ inode->i_private = oe;
+ inode->i_op = &ovl_dir_inode_operations;
+ inode->i_fop = &ovl_dir_operations;
+ break;
+
+ case S_IFLNK:
+ inode->i_op = &ovl_symlink_inode_operations;
+ break;
+
+ case S_IFREG:
+ case S_IFSOCK:
+ case S_IFBLK:
+ case S_IFCHR:
+ case S_IFIFO:
+ inode->i_op = &ovl_file_inode_operations;
+ break;
+
+ default:
+ WARN(1, "illegal file type: %i\n", mode);
+ iput(inode);
+ inode = NULL;
+ }
+
+ return inode;
+
+}
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
new file mode 100644
index 0000000..fe1241d
--- /dev/null
+++ b/fs/overlayfs/overlayfs.h
@@ -0,0 +1,64 @@
+/*
+ *
+ * Copyright (C) 2011 Novell Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+struct ovl_entry;
+
+enum ovl_path_type {
+ OVL_PATH_UPPER,
+ OVL_PATH_MERGE,
+ OVL_PATH_LOWER,
+};
+
+extern const char *ovl_opaque_xattr;
+extern const char *ovl_whiteout_xattr;
+extern const struct dentry_operations ovl_dentry_operations;
+
+enum ovl_path_type ovl_path_type(struct dentry *dentry);
+u64 ovl_dentry_version_get(struct dentry *dentry);
+void ovl_dentry_version_inc(struct dentry *dentry);
+void ovl_path_upper(struct dentry *dentry, struct path *path);
+void ovl_path_lower(struct dentry *dentry, struct path *path);
+enum ovl_path_type ovl_path_real(struct dentry *dentry, struct path *path);
+struct dentry *ovl_dentry_upper(struct dentry *dentry);
+struct dentry *ovl_dentry_lower(struct dentry *dentry);
+struct dentry *ovl_dentry_real(struct dentry *dentry);
+struct dentry *ovl_entry_real(struct ovl_entry *oe, bool *is_upper);
+bool ovl_dentry_is_opaque(struct dentry *dentry);
+void ovl_dentry_set_opaque(struct dentry *dentry, bool opaque);
+bool ovl_is_whiteout(struct dentry *dentry);
+void ovl_dentry_update(struct dentry *dentry, struct dentry *upperdentry);
+struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
+ unsigned int flags);
+struct file *ovl_path_open(struct path *path, int flags);
+
+struct dentry *ovl_upper_create(struct dentry *upperdir, struct dentry *dentry,
+ struct kstat *stat, const char *link);
+
+/* readdir.c */
+extern const struct file_operations ovl_dir_operations;
+int ovl_check_empty_and_clear(struct dentry *dentry, enum ovl_path_type type);
+
+/* inode.c */
+int ovl_setattr(struct dentry *dentry, struct iattr *attr);
+int ovl_permission(struct inode *inode, int mask);
+int ovl_setxattr(struct dentry *dentry, const char *name,
+ const void *value, size_t size, int flags);
+ssize_t ovl_getxattr(struct dentry *dentry, const char *name,
+ void *value, size_t size);
+ssize_t ovl_listxattr(struct dentry *dentry, char *list, size_t size);
+int ovl_removexattr(struct dentry *dentry, const char *name);
+
+struct inode *ovl_new_inode(struct super_block *sb, umode_t mode,
+ struct ovl_entry *oe);
+/* dir.c */
+extern const struct inode_operations ovl_dir_inode_operations;
+
+/* copy_up.c */
+int ovl_copy_up(struct dentry *dentry);
+int ovl_copy_up_truncate(struct dentry *dentry, loff_t size);
diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
new file mode 100644
index 0000000..4c18abf
--- /dev/null
+++ b/fs/overlayfs/readdir.c
@@ -0,0 +1,566 @@
+/*
+ *
+ * Copyright (C) 2011 Novell Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/namei.h>
+#include <linux/file.h>
+#include <linux/xattr.h>
+#include <linux/rbtree.h>
+#include <linux/security.h>
+#include <linux/cred.h>
+#include "overlayfs.h"
+
+struct ovl_cache_entry {
+ const char *name;
+ unsigned int len;
+ unsigned int type;
+ u64 ino;
+ bool is_whiteout;
+ struct list_head l_node;
+ struct rb_node node;
+};
+
+struct ovl_readdir_data {
+ struct rb_root *root;
+ struct list_head *list;
+ struct list_head *middle;
+ struct dentry *dir;
+ int count;
+ int err;
+};
+
+struct ovl_dir_file {
+ bool is_real;
+ bool is_cached;
+ struct list_head cursor;
+ u64 cache_version;
+ struct list_head cache;
+ struct file *realfile;
+};
+
+static struct ovl_cache_entry *ovl_cache_entry_from_node(struct rb_node *n)
+{
+ return container_of(n, struct ovl_cache_entry, node);
+}
+
+static struct ovl_cache_entry *ovl_cache_entry_find(struct rb_root *root,
+ const char *name, int len)
+{
+ struct rb_node *node = root->rb_node;
+ int cmp;
+
+ while (node) {
+ struct ovl_cache_entry *p = ovl_cache_entry_from_node(node);
+
+ cmp = strncmp(name, p->name, len);
+ if (cmp > 0)
+ node = p->node.rb_right;
+ else if (cmp < 0 || len < p->len)
+ node = p->node.rb_left;
+ else
+ return p;
+ }
+
+ return NULL;
+}
+
+static struct ovl_cache_entry *ovl_cache_entry_new(const char *name, int len,
+ u64 ino, unsigned int d_type)
+{
+ struct ovl_cache_entry *p;
+
+ p = kmalloc(sizeof(*p) + len + 1, GFP_KERNEL);
+ if (p) {
+ char *name_copy = (char *) (p + 1);
+ memcpy(name_copy, name, len);
+ name_copy[len] = '\0';
+ p->name = name_copy;
+ p->len = len;
+ p->type = d_type;
+ p->ino = ino;
+ p->is_whiteout = false;
+ }
+
+ return p;
+}
+
+static int ovl_cache_entry_add_rb(struct ovl_readdir_data *rdd,
+ const char *name, int len, u64 ino,
+ unsigned int d_type)
+{
+ struct rb_node **newp = &rdd->root->rb_node;
+ struct rb_node *parent = NULL;
+ struct ovl_cache_entry *p;
+
+ while (*newp) {
+ int cmp;
+ struct ovl_cache_entry *tmp;
+
+ parent = *newp;
+ tmp = ovl_cache_entry_from_node(*newp);
+ cmp = strncmp(name, tmp->name, len);
+ if (cmp > 0)
+ newp = &tmp->node.rb_right;
+ else if (cmp < 0 || len < tmp->len)
+ newp = &tmp->node.rb_left;
+ else
+ return 0;
+ }
+
+ p = ovl_cache_entry_new(name, len, ino, d_type);
+ if (p == NULL)
+ return -ENOMEM;
+
+ list_add_tail(&p->l_node, rdd->list);
+ rb_link_node(&p->node, parent, newp);
+ rb_insert_color(&p->node, rdd->root);
+
+ return 0;
+}
+
+static int ovl_fill_lower(void *buf, const char *name, int namelen,
+ loff_t offset, u64 ino, unsigned int d_type)
+{
+ struct ovl_readdir_data *rdd = buf;
+ struct ovl_cache_entry *p;
+
+ rdd->count++;
+ p = ovl_cache_entry_find(rdd->root, name, namelen);
+ if (p) {
+ list_move_tail(&p->l_node, rdd->middle);
+ } else {
+ p = ovl_cache_entry_new(name, namelen, ino, d_type);
+ if (p == NULL)
+ rdd->err = -ENOMEM;
+ else
+ list_add_tail(&p->l_node, rdd->middle);
+ }
+
+ return rdd->err;
+}
+
+static void ovl_cache_free(struct list_head *list)
+{
+ struct ovl_cache_entry *p;
+ struct ovl_cache_entry *n;
+
+ list_for_each_entry_safe(p, n, list, l_node)
+ kfree(p);
+
+ INIT_LIST_HEAD(list);
+}
+
+static int ovl_fill_upper(void *buf, const char *name, int namelen,
+ loff_t offset, u64 ino, unsigned int d_type)
+{
+ struct ovl_readdir_data *rdd = buf;
+
+ rdd->count++;
+ return ovl_cache_entry_add_rb(rdd, name, namelen, ino, d_type);
+}
+
+static inline int ovl_dir_read(struct path *realpath,
+ struct ovl_readdir_data *rdd, filldir_t filler)
+{
+ struct file *realfile;
+ int err;
+
+ realfile = ovl_path_open(realpath, O_RDONLY | O_DIRECTORY);
+ if (IS_ERR(realfile))
+ return PTR_ERR(realfile);
+
+ do {
+ rdd->count = 0;
+ rdd->err = 0;
+ err = vfs_readdir(realfile, filler, rdd);
+ if (err >= 0)
+ err = rdd->err;
+ } while (!err && rdd->count);
+ fput(realfile);
+
+ return 0;
+}
+
+static void ovl_dir_reset(struct file *file)
+{
+ struct ovl_dir_file *od = file->private_data;
+ enum ovl_path_type type = ovl_path_type(file->f_path.dentry);
+
+ if (ovl_dentry_version_get(file->f_path.dentry) != od->cache_version) {
+ list_del_init(&od->cursor);
+ ovl_cache_free(&od->cache);
+ od->is_cached = false;
+ }
+ WARN_ON(!od->is_real && type != OVL_PATH_MERGE);
+ if (od->is_real && type == OVL_PATH_MERGE) {
+ fput(od->realfile);
+ od->realfile = NULL;
+ od->is_real = false;
+ }
+}
+
+static int ovl_dir_mark_whiteouts(struct ovl_readdir_data *rdd)
+{
+ struct ovl_cache_entry *p;
+ struct dentry *dentry;
+ const struct cred *old_cred;
+ struct cred *override_cred;
+
+ override_cred = prepare_creds();
+ if (!override_cred) {
+ ovl_cache_free(rdd->list);
+ return -ENOMEM;
+ }
+
+ /*
+ * CAP_SYS_ADMIN for getxattr
+ * CAP_DAC_OVERRIDE for lookup
+ */
+ cap_raise(override_cred->cap_effective, CAP_SYS_ADMIN);
+ cap_raise(override_cred->cap_effective, CAP_DAC_OVERRIDE);
+ old_cred = override_creds(override_cred);
+
+ mutex_lock(&rdd->dir->d_inode->i_mutex);
+ list_for_each_entry(p, rdd->list, l_node) {
+ if (p->type != DT_LNK)
+ continue;
+
+ dentry = lookup_one_len(p->name, rdd->dir, p->len);
+ if (IS_ERR(dentry))
+ continue;
+
+ p->is_whiteout = ovl_is_whiteout(dentry);
+ dput(dentry);
+ }
+ mutex_unlock(&rdd->dir->d_inode->i_mutex);
+
+ revert_creds(old_cred);
+ put_cred(override_cred);
+
+ return 0;
+}
+
+static inline int ovl_dir_read_merged(struct path *upperpath,
+ struct path *lowerpath,
+ struct ovl_readdir_data *rdd)
+{
+ int err;
+ struct rb_root root = RB_ROOT;
+ struct list_head middle;
+
+ rdd->root = &root;
+ if (upperpath->dentry) {
+ rdd->dir = upperpath->dentry;
+ err = ovl_dir_read(upperpath, rdd, ovl_fill_upper);
+ if (err)
+ goto out;
+
+ err = ovl_dir_mark_whiteouts(rdd);
+ if (err)
+ goto out;
+ }
+ /*
+ * Insert lowerpath entries before upperpath ones, this allows
+ * offsets to be reasonably constant
+ */
+ list_add(&middle, rdd->list);
+ rdd->middle = &middle;
+ err = ovl_dir_read(lowerpath, rdd, ovl_fill_lower);
+ list_del(&middle);
+out:
+ rdd->root = NULL;
+
+ return err;
+}
+
+static void ovl_seek_cursor(struct ovl_dir_file *od, loff_t pos)
+{
+ struct list_head *l;
+ loff_t off;
+
+ l = od->cache.next;
+ for (off = 0; off < pos; off++) {
+ if (l == &od->cache)
+ break;
+ l = l->next;
+ }
+ list_move_tail(&od->cursor, l);
+}
+
+static int ovl_readdir(struct file *file, void *buf, filldir_t filler)
+{
+ struct ovl_dir_file *od = file->private_data;
+ int res;
+
+ if (!file->f_pos)
+ ovl_dir_reset(file);
+
+ if (od->is_real) {
+ res = vfs_readdir(od->realfile, filler, buf);
+ file->f_pos = od->realfile->f_pos;
+
+ return res;
+ }
+
+ if (!od->is_cached) {
+ struct path lowerpath;
+ struct path upperpath;
+ struct ovl_readdir_data rdd = { .list = &od->cache };
+
+ ovl_path_lower(file->f_path.dentry, &lowerpath);
+ ovl_path_upper(file->f_path.dentry, &upperpath);
+
+ res = ovl_dir_read_merged(&upperpath, &lowerpath, &rdd);
+ if (res) {
+ ovl_cache_free(rdd.list);
+ return res;
+ }
+
+ od->cache_version = ovl_dentry_version_get(file->f_path.dentry);
+ od->is_cached = true;
+
+ ovl_seek_cursor(od, file->f_pos);
+ }
+
+ while (od->cursor.next != &od->cache) {
+ int over;
+ loff_t off;
+ struct ovl_cache_entry *p;
+
+ p = list_entry(od->cursor.next, struct ovl_cache_entry, l_node);
+ off = file->f_pos;
+ if (!p->is_whiteout) {
+ over = filler(buf, p->name, p->len, off, p->ino,
+ p->type);
+ if (over)
+ break;
+ }
+ file->f_pos++;
+ list_move(&od->cursor, &p->l_node);
+ }
+
+ return 0;
+}
+
+static loff_t ovl_dir_llseek(struct file *file, loff_t offset, int origin)
+{
+ loff_t res;
+ struct ovl_dir_file *od = file->private_data;
+
+ mutex_lock(&file_inode(file)->i_mutex);
+ if (!file->f_pos)
+ ovl_dir_reset(file);
+
+ if (od->is_real) {
+ res = vfs_llseek(od->realfile, offset, origin);
+ file->f_pos = od->realfile->f_pos;
+ } else {
+ res = -EINVAL;
+
+ switch (origin) {
+ case SEEK_CUR:
+ offset += file->f_pos;
+ break;
+ case SEEK_SET:
+ break;
+ default:
+ goto out_unlock;
+ }
+ if (offset < 0)
+ goto out_unlock;
+
+ if (offset != file->f_pos) {
+ file->f_pos = offset;
+ if (od->is_cached)
+ ovl_seek_cursor(od, offset);
+ }
+ res = offset;
+ }
+out_unlock:
+ mutex_unlock(&file_inode(file)->i_mutex);
+
+ return res;
+}
+
+static int ovl_dir_fsync(struct file *file, loff_t start, loff_t end,
+ int datasync)
+{
+ struct ovl_dir_file *od = file->private_data;
+
+ /* May need to reopen directory if it got copied up */
+ if (!od->realfile) {
+ struct path upperpath;
+
+ ovl_path_upper(file->f_path.dentry, &upperpath);
+ od->realfile = ovl_path_open(&upperpath, O_RDONLY);
+ if (IS_ERR(od->realfile))
+ return PTR_ERR(od->realfile);
+ }
+
+ return vfs_fsync_range(od->realfile, start, end, datasync);
+}
+
+static int ovl_dir_release(struct inode *inode, struct file *file)
+{
+ struct ovl_dir_file *od = file->private_data;
+
+ list_del(&od->cursor);
+ ovl_cache_free(&od->cache);
+ if (od->realfile)
+ fput(od->realfile);
+ kfree(od);
+
+ return 0;
+}
+
+static int ovl_dir_open(struct inode *inode, struct file *file)
+{
+ struct path realpath;
+ struct file *realfile;
+ struct ovl_dir_file *od;
+ enum ovl_path_type type;
+
+ od = kzalloc(sizeof(struct ovl_dir_file), GFP_KERNEL);
+ if (!od)
+ return -ENOMEM;
+
+ type = ovl_path_real(file->f_path.dentry, &realpath);
+ realfile = ovl_path_open(&realpath, file->f_flags);
+ if (IS_ERR(realfile)) {
+ kfree(od);
+ return PTR_ERR(realfile);
+ }
+ INIT_LIST_HEAD(&od->cache);
+ INIT_LIST_HEAD(&od->cursor);
+ od->is_cached = false;
+ od->realfile = realfile;
+ od->is_real = (type != OVL_PATH_MERGE);
+ file->private_data = od;
+
+ return 0;
+}
+
+const struct file_operations ovl_dir_operations = {
+ .read = generic_read_dir,
+ .open = ovl_dir_open,
+ .readdir = ovl_readdir,
+ .llseek = ovl_dir_llseek,
+ .fsync = ovl_dir_fsync,
+ .release = ovl_dir_release,
+};
+
+static int ovl_check_empty_dir(struct dentry *dentry, struct list_head *list)
+{
+ int err;
+ struct path lowerpath;
+ struct path upperpath;
+ struct ovl_cache_entry *p;
+ struct ovl_readdir_data rdd = { .list = list };
+
+ ovl_path_upper(dentry, &upperpath);
+ ovl_path_lower(dentry, &lowerpath);
+
+ err = ovl_dir_read_merged(&upperpath, &lowerpath, &rdd);
+ if (err)
+ return err;
+
+ err = 0;
+
+ list_for_each_entry(p, list, l_node) {
+ if (p->is_whiteout)
+ continue;
+
+ if (p->name[0] == '.') {
+ if (p->len == 1)
+ continue;
+ if (p->len == 2 && p->name[1] == '.')
+ continue;
+ }
+ err = -ENOTEMPTY;
+ break;
+ }
+
+ return err;
+}
+
+static int ovl_remove_whiteouts(struct dentry *dir, struct list_head *list)
+{
+ struct path upperpath;
+ struct dentry *upperdir;
+ struct ovl_cache_entry *p;
+ const struct cred *old_cred;
+ struct cred *override_cred;
+ int err;
+
+ ovl_path_upper(dir, &upperpath);
+ upperdir = upperpath.dentry;
+
+ override_cred = prepare_creds();
+ if (!override_cred)
+ return -ENOMEM;
+
+ /*
+ * CAP_DAC_OVERRIDE for lookup and unlink
+ * CAP_SYS_ADMIN for setxattr of "trusted" namespace
+ * CAP_FOWNER for unlink in sticky directory
+ */
+ cap_raise(override_cred->cap_effective, CAP_DAC_OVERRIDE);
+ cap_raise(override_cred->cap_effective, CAP_SYS_ADMIN);
+ cap_raise(override_cred->cap_effective, CAP_FOWNER);
+ old_cred = override_creds(override_cred);
+
+ err = vfs_setxattr(upperdir, ovl_opaque_xattr, "y", 1, 0);
+ if (err)
+ goto out_revert_creds;
+
+ mutex_lock_nested(&upperdir->d_inode->i_mutex, I_MUTEX_PARENT);
+ list_for_each_entry(p, list, l_node) {
+ struct dentry *dentry;
+ int ret;
+
+ if (!p->is_whiteout)
+ continue;
+
+ dentry = lookup_one_len(p->name, upperdir, p->len);
+ if (IS_ERR(dentry)) {
+ pr_warn(
+ "overlayfs: failed to lookup whiteout %.*s: %li\n",
+ p->len, p->name, PTR_ERR(dentry));
+ continue;
+ }
+ ret = vfs_unlink(upperdir->d_inode, dentry);
+ dput(dentry);
+ if (ret)
+ pr_warn(
+ "overlayfs: failed to unlink whiteout %.*s: %i\n",
+ p->len, p->name, ret);
+ }
+ mutex_unlock(&upperdir->d_inode->i_mutex);
+
+out_revert_creds:
+ revert_creds(old_cred);
+ put_cred(override_cred);
+
+ return err;
+}
+
+int ovl_check_empty_and_clear(struct dentry *dentry, enum ovl_path_type type)
+{
+ int err;
+ LIST_HEAD(list);
+
+ err = ovl_check_empty_dir(dentry, &list);
+ if (!err && type == OVL_PATH_MERGE)
+ err = ovl_remove_whiteouts(dentry, &list);
+
+ ovl_cache_free(&list);
+
+ return err;
+}
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
new file mode 100644
index 0000000..4bcb98f
--- /dev/null
+++ b/fs/overlayfs/super.c
@@ -0,0 +1,612 @@
+/*
+ *
+ * Copyright (C) 2011 Novell Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/fs.h>
+#include <linux/namei.h>
+#include <linux/xattr.h>
+#include <linux/security.h>
+#include <linux/mount.h>
+#include <linux/slab.h>
+#include <linux/parser.h>
+#include <linux/module.h>
+#include <linux/cred.h>
+#include <linux/sched.h>
+#include "overlayfs.h"
+
+MODULE_AUTHOR("Miklos Szeredi <[email protected]>");
+MODULE_DESCRIPTION("Overlay filesystem");
+MODULE_LICENSE("GPL");
+
+struct ovl_fs {
+ struct vfsmount *upper_mnt;
+ struct vfsmount *lower_mnt;
+};
+
+struct ovl_entry {
+ /*
+ * Keep "double reference" on upper dentries, so that
+ * d_delete() doesn't think it's OK to reset d_inode to NULL.
+ */
+ struct dentry *__upperdentry;
+ struct dentry *lowerdentry;
+ union {
+ struct {
+ u64 version;
+ bool opaque;
+ };
+ struct rcu_head rcu;
+ };
+};
+
+const char *ovl_whiteout_xattr = "trusted.overlay.whiteout";
+const char *ovl_opaque_xattr = "trusted.overlay.opaque";
+
+
+enum ovl_path_type ovl_path_type(struct dentry *dentry)
+{
+ struct ovl_entry *oe = dentry->d_fsdata;
+
+ if (oe->__upperdentry) {
+ if (oe->lowerdentry && S_ISDIR(dentry->d_inode->i_mode))
+ return OVL_PATH_MERGE;
+ else
+ return OVL_PATH_UPPER;
+ } else {
+ return OVL_PATH_LOWER;
+ }
+}
+
+static struct dentry *ovl_upperdentry_dereference(struct ovl_entry *oe)
+{
+ struct dentry *upperdentry = ACCESS_ONCE(oe->__upperdentry);
+ smp_read_barrier_depends();
+ return upperdentry;
+}
+
+void ovl_path_upper(struct dentry *dentry, struct path *path)
+{
+ struct ovl_fs *ofs = dentry->d_sb->s_fs_info;
+ struct ovl_entry *oe = dentry->d_fsdata;
+
+ path->mnt = ofs->upper_mnt;
+ path->dentry = ovl_upperdentry_dereference(oe);
+}
+
+void ovl_path_lower(struct dentry *dentry, struct path *path)
+{
+ struct ovl_fs *ofs = dentry->d_sb->s_fs_info;
+ struct ovl_entry *oe = dentry->d_fsdata;
+
+ path->mnt = ofs->lower_mnt;
+ path->dentry = oe->lowerdentry;
+}
+
+enum ovl_path_type ovl_path_real(struct dentry *dentry, struct path *path)
+{
+
+ enum ovl_path_type type = ovl_path_type(dentry);
+
+ if (type == OVL_PATH_LOWER)
+ ovl_path_lower(dentry, path);
+ else
+ ovl_path_upper(dentry, path);
+
+ return type;
+}
+
+struct dentry *ovl_dentry_upper(struct dentry *dentry)
+{
+ struct ovl_entry *oe = dentry->d_fsdata;
+
+ return ovl_upperdentry_dereference(oe);
+}
+
+struct dentry *ovl_dentry_lower(struct dentry *dentry)
+{
+ struct ovl_entry *oe = dentry->d_fsdata;
+
+ return oe->lowerdentry;
+}
+
+struct dentry *ovl_dentry_real(struct dentry *dentry)
+{
+ struct ovl_entry *oe = dentry->d_fsdata;
+ struct dentry *realdentry;
+
+ realdentry = ovl_upperdentry_dereference(oe);
+ if (!realdentry)
+ realdentry = oe->lowerdentry;
+
+ return realdentry;
+}
+
+struct dentry *ovl_entry_real(struct ovl_entry *oe, bool *is_upper)
+{
+ struct dentry *realdentry;
+
+ realdentry = ovl_upperdentry_dereference(oe);
+ if (realdentry) {
+ *is_upper = true;
+ } else {
+ realdentry = oe->lowerdentry;
+ *is_upper = false;
+ }
+ return realdentry;
+}
+
+bool ovl_dentry_is_opaque(struct dentry *dentry)
+{
+ struct ovl_entry *oe = dentry->d_fsdata;
+ return oe->opaque;
+}
+
+void ovl_dentry_set_opaque(struct dentry *dentry, bool opaque)
+{
+ struct ovl_entry *oe = dentry->d_fsdata;
+ oe->opaque = opaque;
+}
+
+void ovl_dentry_update(struct dentry *dentry, struct dentry *upperdentry)
+{
+ struct ovl_entry *oe = dentry->d_fsdata;
+
+ WARN_ON(!mutex_is_locked(&upperdentry->d_parent->d_inode->i_mutex));
+ WARN_ON(oe->__upperdentry);
+ BUG_ON(!upperdentry->d_inode);
+ smp_wmb();
+ oe->__upperdentry = dget(upperdentry);
+}
+
+void ovl_dentry_version_inc(struct dentry *dentry)
+{
+ struct ovl_entry *oe = dentry->d_fsdata;
+
+ WARN_ON(!mutex_is_locked(&dentry->d_inode->i_mutex));
+ oe->version++;
+}
+
+u64 ovl_dentry_version_get(struct dentry *dentry)
+{
+ struct ovl_entry *oe = dentry->d_fsdata;
+
+ WARN_ON(!mutex_is_locked(&dentry->d_inode->i_mutex));
+ return oe->version;
+}
+
+bool ovl_is_whiteout(struct dentry *dentry)
+{
+ int res;
+ char val;
+
+ if (!dentry)
+ return false;
+ if (!dentry->d_inode)
+ return false;
+ if (!S_ISLNK(dentry->d_inode->i_mode))
+ return false;
+
+ res = vfs_getxattr(dentry, ovl_whiteout_xattr, &val, 1);
+ if (res == 1 && val == 'y')
+ return true;
+
+ return false;
+}
+
+static bool ovl_is_opaquedir(struct dentry *dentry)
+{
+ int res;
+ char val;
+
+ if (!S_ISDIR(dentry->d_inode->i_mode))
+ return false;
+
+ res = vfs_getxattr(dentry, ovl_opaque_xattr, &val, 1);
+ if (res == 1 && val == 'y')
+ return true;
+
+ return false;
+}
+
+static void ovl_entry_free(struct rcu_head *head)
+{
+ struct ovl_entry *oe = container_of(head, struct ovl_entry, rcu);
+ kfree(oe);
+}
+
+static void ovl_dentry_release(struct dentry *dentry)
+{
+ struct ovl_entry *oe = dentry->d_fsdata;
+
+ if (oe) {
+ dput(oe->__upperdentry);
+ dput(oe->__upperdentry);
+ dput(oe->lowerdentry);
+ call_rcu(&oe->rcu, ovl_entry_free);
+ }
+}
+
+const struct dentry_operations ovl_dentry_operations = {
+ .d_release = ovl_dentry_release,
+};
+
+static struct ovl_entry *ovl_alloc_entry(void)
+{
+ return kzalloc(sizeof(struct ovl_entry), GFP_KERNEL);
+}
+
+static inline struct dentry *ovl_lookup_real(struct dentry *dir,
+ struct qstr *name)
+{
+ struct dentry *dentry;
+
+ mutex_lock(&dir->d_inode->i_mutex);
+ dentry = lookup_one_len(name->name, dir, name->len);
+ mutex_unlock(&dir->d_inode->i_mutex);
+
+ if (IS_ERR(dentry)) {
+ if (PTR_ERR(dentry) == -ENOENT)
+ dentry = NULL;
+ } else if (!dentry->d_inode) {
+ dput(dentry);
+ dentry = NULL;
+ }
+ return dentry;
+}
+
+static int ovl_do_lookup(struct dentry *dentry)
+{
+ struct ovl_entry *oe;
+ struct dentry *upperdir;
+ struct dentry *lowerdir;
+ struct dentry *upperdentry = NULL;
+ struct dentry *lowerdentry = NULL;
+ struct inode *inode = NULL;
+ int err;
+
+ err = -ENOMEM;
+ oe = ovl_alloc_entry();
+ if (!oe)
+ goto out;
+
+ upperdir = ovl_dentry_upper(dentry->d_parent);
+ lowerdir = ovl_dentry_lower(dentry->d_parent);
+
+ if (upperdir) {
+ upperdentry = ovl_lookup_real(upperdir, &dentry->d_name);
+ err = PTR_ERR(upperdentry);
+ if (IS_ERR(upperdentry))
+ goto out_put_dir;
+
+ if (lowerdir && upperdentry &&
+ (S_ISLNK(upperdentry->d_inode->i_mode) ||
+ S_ISDIR(upperdentry->d_inode->i_mode))) {
+ const struct cred *old_cred;
+ struct cred *override_cred;
+
+ err = -ENOMEM;
+ override_cred = prepare_creds();
+ if (!override_cred)
+ goto out_dput_upper;
+
+ /* CAP_SYS_ADMIN needed for getxattr */
+ cap_raise(override_cred->cap_effective, CAP_SYS_ADMIN);
+ old_cred = override_creds(override_cred);
+
+ if (ovl_is_opaquedir(upperdentry)) {
+ oe->opaque = true;
+ } else if (ovl_is_whiteout(upperdentry)) {
+ dput(upperdentry);
+ upperdentry = NULL;
+ oe->opaque = true;
+ }
+ revert_creds(old_cred);
+ put_cred(override_cred);
+ }
+ }
+ if (lowerdir && !oe->opaque) {
+ lowerdentry = ovl_lookup_real(lowerdir, &dentry->d_name);
+ err = PTR_ERR(lowerdentry);
+ if (IS_ERR(lowerdentry))
+ goto out_dput_upper;
+ }
+
+ if (lowerdentry && upperdentry &&
+ (!S_ISDIR(upperdentry->d_inode->i_mode) ||
+ !S_ISDIR(lowerdentry->d_inode->i_mode))) {
+ dput(lowerdentry);
+ lowerdentry = NULL;
+ oe->opaque = true;
+ }
+
+ if (lowerdentry || upperdentry) {
+ struct dentry *realdentry;
+
+ realdentry = upperdentry ? upperdentry : lowerdentry;
+ err = -ENOMEM;
+ inode = ovl_new_inode(dentry->d_sb, realdentry->d_inode->i_mode,
+ oe);
+ if (!inode)
+ goto out_dput;
+ }
+
+ if (upperdentry)
+ oe->__upperdentry = dget(upperdentry);
+
+ if (lowerdentry)
+ oe->lowerdentry = lowerdentry;
+
+ dentry->d_fsdata = oe;
+ dentry->d_op = &ovl_dentry_operations;
+ d_add(dentry, inode);
+
+ return 0;
+
+out_dput:
+ dput(lowerdentry);
+out_dput_upper:
+ dput(upperdentry);
+out_put_dir:
+ kfree(oe);
+out:
+ return err;
+}
+
+struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
+ unsigned int flags)
+{
+ int err = ovl_do_lookup(dentry);
+
+ if (err)
+ return ERR_PTR(err);
+
+ return NULL;
+}
+
+struct file *ovl_path_open(struct path *path, int flags)
+{
+ path_get(path);
+ return dentry_open(path, flags, current_cred());
+}
+
+static void ovl_put_super(struct super_block *sb)
+{
+ struct ovl_fs *ufs = sb->s_fs_info;
+
+ if (!(sb->s_flags & MS_RDONLY))
+ mnt_drop_write(ufs->upper_mnt);
+
+ mntput(ufs->upper_mnt);
+ mntput(ufs->lower_mnt);
+
+ kfree(ufs);
+}
+
+static int ovl_remount_fs(struct super_block *sb, int *flagsp, char *data)
+{
+ int flags = *flagsp;
+ struct ovl_fs *ufs = sb->s_fs_info;
+
+ /* When remounting rw or ro, we need to adjust the write access to the
+ * upper fs.
+ */
+ if (((flags ^ sb->s_flags) & MS_RDONLY) == 0)
+ /* No change to readonly status */
+ return 0;
+
+ if (flags & MS_RDONLY) {
+ mnt_drop_write(ufs->upper_mnt);
+ return 0;
+ } else
+ return mnt_want_write(ufs->upper_mnt);
+}
+
+static const struct super_operations ovl_super_operations = {
+ .put_super = ovl_put_super,
+ .remount_fs = ovl_remount_fs,
+};
+
+struct ovl_config {
+ char *lowerdir;
+ char *upperdir;
+};
+
+enum {
+ OPT_LOWERDIR,
+ OPT_UPPERDIR,
+ OPT_ERR,
+};
+
+static const match_table_t ovl_tokens = {
+ {OPT_LOWERDIR, "lowerdir=%s"},
+ {OPT_UPPERDIR, "upperdir=%s"},
+ {OPT_ERR, NULL}
+};
+
+static int ovl_parse_opt(char *opt, struct ovl_config *config)
+{
+ char *p;
+
+ config->upperdir = NULL;
+ config->lowerdir = NULL;
+
+ while ((p = strsep(&opt, ",")) != NULL) {
+ int token;
+ substring_t args[MAX_OPT_ARGS];
+
+ if (!*p)
+ continue;
+
+ token = match_token(p, ovl_tokens, args);
+ switch (token) {
+ case OPT_UPPERDIR:
+ kfree(config->upperdir);
+ config->upperdir = match_strdup(&args[0]);
+ if (!config->upperdir)
+ return -ENOMEM;
+ break;
+
+ case OPT_LOWERDIR:
+ kfree(config->lowerdir);
+ config->lowerdir = match_strdup(&args[0]);
+ if (!config->lowerdir)
+ return -ENOMEM;
+ break;
+
+ default:
+ return -EINVAL;
+ }
+ }
+ return 0;
+}
+
+static int ovl_fill_super(struct super_block *sb, void *data, int silent)
+{
+ struct path lowerpath;
+ struct path upperpath;
+ struct inode *root_inode;
+ struct dentry *root_dentry;
+ struct ovl_entry *oe;
+ struct ovl_fs *ufs;
+ struct ovl_config config;
+ int err;
+
+ err = ovl_parse_opt((char *) data, &config);
+ if (err)
+ goto out;
+
+ err = -EINVAL;
+ if (!config.upperdir || !config.lowerdir) {
+ pr_err("overlayfs: missing upperdir or lowerdir\n");
+ goto out_free_config;
+ }
+
+ err = -ENOMEM;
+ ufs = kmalloc(sizeof(struct ovl_fs), GFP_KERNEL);
+ if (!ufs)
+ goto out_free_config;
+
+ oe = ovl_alloc_entry();
+ if (oe == NULL)
+ goto out_free_ufs;
+
+ err = kern_path(config.upperdir, LOOKUP_FOLLOW, &upperpath);
+ if (err)
+ goto out_free_oe;
+
+ err = kern_path(config.lowerdir, LOOKUP_FOLLOW, &lowerpath);
+ if (err)
+ goto out_put_upperpath;
+
+ err = -ENOTDIR;
+ if (!S_ISDIR(upperpath.dentry->d_inode->i_mode) ||
+ !S_ISDIR(lowerpath.dentry->d_inode->i_mode))
+ goto out_put_lowerpath;
+
+ ufs->upper_mnt = clone_private_mount(&upperpath);
+ err = PTR_ERR(ufs->upper_mnt);
+ if (IS_ERR(ufs->upper_mnt)) {
+ pr_err("overlayfs: failed to clone upperpath\n");
+ goto out_put_lowerpath;
+ }
+
+ ufs->lower_mnt = clone_private_mount(&lowerpath);
+ err = PTR_ERR(ufs->lower_mnt);
+ if (IS_ERR(ufs->lower_mnt)) {
+ pr_err("overlayfs: failed to clone lowerpath\n");
+ goto out_put_upper_mnt;
+ }
+
+ /*
+ * Make lower_mnt R/O. That way fchmod/fchown on lower file
+ * will fail instead of modifying lower fs.
+ */
+ ufs->lower_mnt->mnt_flags |= MNT_READONLY;
+
+ /* If the upper fs is r/o, we mark overlayfs r/o too */
+ if (ufs->upper_mnt->mnt_sb->s_flags & MS_RDONLY)
+ sb->s_flags |= MS_RDONLY;
+
+ if (!(sb->s_flags & MS_RDONLY)) {
+ err = mnt_want_write(ufs->upper_mnt);
+ if (err)
+ goto out_put_lower_mnt;
+ }
+
+ err = -ENOMEM;
+ root_inode = ovl_new_inode(sb, S_IFDIR, oe);
+ if (!root_inode)
+ goto out_drop_write;
+
+ root_dentry = d_make_root(root_inode);
+ if (!root_dentry)
+ goto out_drop_write;
+
+ mntput(upperpath.mnt);
+ mntput(lowerpath.mnt);
+
+ oe->__upperdentry = dget(upperpath.dentry);
+ oe->lowerdentry = lowerpath.dentry;
+
+ root_dentry->d_fsdata = oe;
+ root_dentry->d_op = &ovl_dentry_operations;
+
+ sb->s_op = &ovl_super_operations;
+ sb->s_root = root_dentry;
+ sb->s_fs_info = ufs;
+
+ return 0;
+
+out_drop_write:
+ if (!(sb->s_flags & MS_RDONLY))
+ mnt_drop_write(ufs->upper_mnt);
+out_put_lower_mnt:
+ mntput(ufs->lower_mnt);
+out_put_upper_mnt:
+ mntput(ufs->upper_mnt);
+out_put_lowerpath:
+ path_put(&lowerpath);
+out_put_upperpath:
+ path_put(&upperpath);
+out_free_oe:
+ kfree(oe);
+out_free_ufs:
+ kfree(ufs);
+out_free_config:
+ kfree(config.lowerdir);
+ kfree(config.upperdir);
+out:
+ return err;
+}
+
+static struct dentry *ovl_mount(struct file_system_type *fs_type, int flags,
+ const char *dev_name, void *raw_data)
+{
+ return mount_nodev(fs_type, flags, raw_data, ovl_fill_super);
+}
+
+static struct file_system_type ovl_fs_type = {
+ .owner = THIS_MODULE,
+ .name = "overlayfs",
+ .mount = ovl_mount,
+ .kill_sb = kill_anon_super,
+};
+MODULE_ALIAS_FS("overlayfs");
+
+static int __init ovl_init(void)
+{
+ return register_filesystem(&ovl_fs_type);
+}
+
+static void __exit ovl_exit(void)
+{
+ unregister_filesystem(&ovl_fs_type);
+}
+
+module_init(ovl_init);
+module_exit(ovl_exit);
--
1.7.10.4

2013-03-13 14:17:13

by Miklos Szeredi

[permalink] [raw]
Subject: [PATCH 6/9] overlayfs: add statfs support

From: Andy Whitcroft <[email protected]>

Add support for statfs to the overlayfs filesystem. As the upper layer
is the target of all write operations assume that the space in that
filesystem is the space in the overlayfs. There will be some inaccuracy as
overwriting a file will copy it up and consume space we were not expecting,
but it is better than nothing.

Use the upper layer dentry and mount from the overlayfs root inode,
passing the statfs call to that filesystem.

Signed-off-by: Andy Whitcroft <[email protected]>
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/overlayfs/dir.c | 2 ++
fs/overlayfs/inode.c | 2 ++
fs/overlayfs/overlayfs.h | 6 ++++++
fs/overlayfs/super.c | 41 +++++++++++++++++++++++++++++++++++++++++
4 files changed, 51 insertions(+)

diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index 5da5be3..f4969c1 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -304,6 +304,7 @@ static int ovl_create_object(struct dentry *dentry, int mode, dev_t rdev,
}
}
ovl_dentry_update(dentry, newdentry);
+ ovl_copyattr(newdentry->d_inode, inode);
d_instantiate(dentry, inode);
inode = NULL;
newdentry = NULL;
@@ -448,6 +449,7 @@ static int ovl_link(struct dentry *old, struct inode *newdir,
err = -ENOMEM;
goto link_fail;
}
+ ovl_copyattr(upperdir->d_inode, newinode);

ovl_dentry_version_inc(new->d_parent);
ovl_dentry_update(new, newdentry);
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index 3218a38..ee37e92 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -31,6 +31,8 @@ int ovl_setattr(struct dentry *dentry, struct iattr *attr)

mutex_lock(&upperdentry->d_inode->i_mutex);
err = notify_change(upperdentry, attr);
+ if (!err)
+ ovl_copyattr(upperdentry->d_inode, dentry->d_inode);
mutex_unlock(&upperdentry->d_inode->i_mutex);

return err;
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index fe1241d..1cba38f 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -56,6 +56,12 @@ int ovl_removexattr(struct dentry *dentry, const char *name);

struct inode *ovl_new_inode(struct super_block *sb, umode_t mode,
struct ovl_entry *oe);
+static inline void ovl_copyattr(struct inode *from, struct inode *to)
+{
+ to->i_uid = from->i_uid;
+ to->i_gid = from->i_gid;
+}
+
/* dir.c */
extern const struct inode_operations ovl_dir_inode_operations;

diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 4bcb98f..bc22c2d 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -17,15 +17,19 @@
#include <linux/module.h>
#include <linux/cred.h>
#include <linux/sched.h>
+#include <linux/statfs.h>
#include "overlayfs.h"

MODULE_AUTHOR("Miklos Szeredi <[email protected]>");
MODULE_DESCRIPTION("Overlay filesystem");
MODULE_LICENSE("GPL");

+#define OVERLAYFS_SUPER_MAGIC 0x794c764f
+
struct ovl_fs {
struct vfsmount *upper_mnt;
struct vfsmount *lower_mnt;
+ long lower_namelen;
};

struct ovl_entry {
@@ -333,6 +337,7 @@ static int ovl_do_lookup(struct dentry *dentry)
oe);
if (!inode)
goto out_dput;
+ ovl_copyattr(realdentry->d_inode, inode);
}

if (upperdentry)
@@ -406,9 +411,36 @@ static int ovl_remount_fs(struct super_block *sb, int *flagsp, char *data)
return mnt_want_write(ufs->upper_mnt);
}

+/**
+ * ovl_statfs
+ * @sb: The overlayfs super block
+ * @buf: The struct kstatfs to fill in with stats
+ *
+ * Get the filesystem statistics. As writes always target the upper layer
+ * filesystem pass the statfs to the same filesystem.
+ */
+static int ovl_statfs(struct dentry *dentry, struct kstatfs *buf)
+{
+ struct ovl_fs *ofs = dentry->d_sb->s_fs_info;
+ struct dentry *root_dentry = dentry->d_sb->s_root;
+ struct path path;
+ int err;
+
+ ovl_path_upper(root_dentry, &path);
+
+ err = vfs_statfs(&path, buf);
+ if (!err) {
+ buf->f_namelen = max(buf->f_namelen, ofs->lower_namelen);
+ buf->f_type = OVERLAYFS_SUPER_MAGIC;
+ }
+
+ return err;
+}
+
static const struct super_operations ovl_super_operations = {
.put_super = ovl_put_super,
.remount_fs = ovl_remount_fs,
+ .statfs = ovl_statfs,
};

struct ovl_config {
@@ -474,6 +506,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
struct ovl_entry *oe;
struct ovl_fs *ufs;
struct ovl_config config;
+ struct kstatfs statfs;
int err;

err = ovl_parse_opt((char *) data, &config);
@@ -508,6 +541,13 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
!S_ISDIR(lowerpath.dentry->d_inode->i_mode))
goto out_put_lowerpath;

+ err = vfs_statfs(&lowerpath, &statfs);
+ if (err) {
+ pr_err("overlayfs: statfs failed on lowerpath\n");
+ goto out_put_lowerpath;
+ }
+ ufs->lower_namelen = statfs.f_namelen;
+
ufs->upper_mnt = clone_private_mount(&upperpath);
err = PTR_ERR(ufs->upper_mnt);
if (IS_ERR(ufs->upper_mnt)) {
@@ -556,6 +596,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
root_dentry->d_fsdata = oe;
root_dentry->d_op = &ovl_dentry_operations;

+ sb->s_magic = OVERLAYFS_SUPER_MAGIC;
sb->s_op = &ovl_super_operations;
sb->s_root = root_dentry;
sb->s_fs_info = ufs;
--
1.7.10.4

2013-03-13 14:18:23

by Miklos Szeredi

[permalink] [raw]
Subject: [PATCH 9/9] fs: limit filesystem stacking depth

From: Miklos Szeredi <[email protected]>

Add a simple read-only counter to super_block that indicates deep this
is in the stack of filesystems. Previously ecryptfs was the only
stackable filesystem and it explicitly disallowed multiple layers of
itself.

Overlayfs, however, can be stacked recursively and also may be stacked
on top of ecryptfs or vice versa.

To limit the kernel stack usage we must limit the depth of the
filesystem stack. Initially the limit is set to 2.

Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/ecryptfs/main.c | 7 +++++++
fs/overlayfs/super.c | 10 ++++++++++
include/linux/fs.h | 11 +++++++++++
3 files changed, 28 insertions(+)

diff --git a/fs/ecryptfs/main.c b/fs/ecryptfs/main.c
index e924cf4..8b0957e 100644
--- a/fs/ecryptfs/main.c
+++ b/fs/ecryptfs/main.c
@@ -567,6 +567,13 @@ static struct dentry *ecryptfs_mount(struct file_system_type *fs_type, int flags
s->s_maxbytes = path.dentry->d_sb->s_maxbytes;
s->s_blocksize = path.dentry->d_sb->s_blocksize;
s->s_magic = ECRYPTFS_SUPER_MAGIC;
+ s->s_stack_depth = path.dentry->d_sb->s_stack_depth + 1;
+
+ rc = -EINVAL;
+ if (s->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
+ pr_err("eCryptfs: maximum fs stacking depth exceeded\n");
+ goto out_free;
+ }

inode = ecryptfs_get_inode(path.dentry->d_inode, s);
rc = PTR_ERR(inode);
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 6e0c0c2..482c26f 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -571,6 +571,16 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
}
ufs->lower_namelen = statfs.f_namelen;

+ sb->s_stack_depth = max(upperpath.mnt->mnt_sb->s_stack_depth,
+ lowerpath.mnt->mnt_sb->s_stack_depth) + 1;
+
+ err = -EINVAL;
+ if (sb->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
+ pr_err("overlayfs: maximum fs stacking depth exceeded\n");
+ goto out_put_lowerpath;
+ }
+
+
ufs->upper_mnt = clone_private_mount(&upperpath);
err = PTR_ERR(ufs->upper_mnt);
if (IS_ERR(ufs->upper_mnt)) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd98a86..3353de6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -244,6 +244,12 @@ struct iattr {
*/
#include <linux/quota.h>

+/*
+ * Maximum number of layers of fs stack. Needs to be limited to
+ * prevent kernel stack overflow
+ */
+#define FILESYSTEM_MAX_STACK_DEPTH 2
+
/**
* enum positive_aop_returns - aop return codes with specific semantics
*
@@ -1320,6 +1326,11 @@ struct super_block {

/* Being remounted read-only */
int s_readonly_remount;
+
+ /*
+ * Indicates how deep in a filesystem stack this SB is
+ */
+ int s_stack_depth;
};

/* superblock cache pruning functions */
--
1.7.10.4

2013-03-13 14:17:10

by Miklos Szeredi

[permalink] [raw]
Subject: [PATCH 7/9] overlayfs: implement show_options

From: Erez Zadok <[email protected]>

This is useful because of the stacking nature of overlayfs. Users like to
find out (via /proc/mounts) which lower/upper directory were used at mount
time.

Signed-off-by: Erez Zadok <[email protected]>
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/overlayfs/super.c | 63 ++++++++++++++++++++++++++++++++++----------------
1 file changed, 43 insertions(+), 20 deletions(-)

diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index bc22c2d..6e0c0c2 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -18,6 +18,7 @@
#include <linux/cred.h>
#include <linux/sched.h>
#include <linux/statfs.h>
+#include <linux/seq_file.h>
#include "overlayfs.h"

MODULE_AUTHOR("Miklos Szeredi <[email protected]>");
@@ -26,12 +27,21 @@ MODULE_LICENSE("GPL");

#define OVERLAYFS_SUPER_MAGIC 0x794c764f

+struct ovl_config {
+ char *lowerdir;
+ char *upperdir;
+};
+
+/* private information held for overlayfs's superblock */
struct ovl_fs {
struct vfsmount *upper_mnt;
struct vfsmount *lower_mnt;
long lower_namelen;
+ /* pathnames of lower and upper dirs, for show_options */
+ struct ovl_config config;
};

+/* private information held for every overlayfs dentry */
struct ovl_entry {
/*
* Keep "double reference" on upper dentries, so that
@@ -389,6 +399,8 @@ static void ovl_put_super(struct super_block *sb)
mntput(ufs->upper_mnt);
mntput(ufs->lower_mnt);

+ kfree(ufs->config.lowerdir);
+ kfree(ufs->config.upperdir);
kfree(ufs);
}

@@ -437,15 +449,27 @@ static int ovl_statfs(struct dentry *dentry, struct kstatfs *buf)
return err;
}

+/**
+ * ovl_show_options
+ *
+ * Prints the mount options for a given superblock.
+ * Returns zero; does not fail.
+ */
+static int ovl_show_options(struct seq_file *m, struct dentry *dentry)
+{
+ struct super_block *sb = dentry->d_sb;
+ struct ovl_fs *ufs = sb->s_fs_info;
+
+ seq_printf(m, ",lowerdir=%s", ufs->config.lowerdir);
+ seq_printf(m, ",upperdir=%s", ufs->config.upperdir);
+ return 0;
+}
+
static const struct super_operations ovl_super_operations = {
.put_super = ovl_put_super,
.remount_fs = ovl_remount_fs,
.statfs = ovl_statfs,
-};
-
-struct ovl_config {
- char *lowerdir;
- char *upperdir;
+ .show_options = ovl_show_options,
};

enum {
@@ -505,34 +529,33 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
struct dentry *root_dentry;
struct ovl_entry *oe;
struct ovl_fs *ufs;
- struct ovl_config config;
struct kstatfs statfs;
int err;

- err = ovl_parse_opt((char *) data, &config);
- if (err)
+ err = -ENOMEM;
+ ufs = kmalloc(sizeof(struct ovl_fs), GFP_KERNEL);
+ if (!ufs)
goto out;

+ err = ovl_parse_opt((char *) data, &ufs->config);
+ if (err)
+ goto out_free_ufs;
+
err = -EINVAL;
- if (!config.upperdir || !config.lowerdir) {
+ if (!ufs->config.upperdir || !ufs->config.lowerdir) {
pr_err("overlayfs: missing upperdir or lowerdir\n");
goto out_free_config;
}

- err = -ENOMEM;
- ufs = kmalloc(sizeof(struct ovl_fs), GFP_KERNEL);
- if (!ufs)
- goto out_free_config;
-
oe = ovl_alloc_entry();
if (oe == NULL)
- goto out_free_ufs;
+ goto out_free_config;

- err = kern_path(config.upperdir, LOOKUP_FOLLOW, &upperpath);
+ err = kern_path(ufs->config.upperdir, LOOKUP_FOLLOW, &upperpath);
if (err)
goto out_free_oe;

- err = kern_path(config.lowerdir, LOOKUP_FOLLOW, &lowerpath);
+ err = kern_path(ufs->config.lowerdir, LOOKUP_FOLLOW, &lowerpath);
if (err)
goto out_put_upperpath;

@@ -616,11 +639,11 @@ out_put_upperpath:
path_put(&upperpath);
out_free_oe:
kfree(oe);
+out_free_config:
+ kfree(ufs->config.lowerdir);
+ kfree(ufs->config.upperdir);
out_free_ufs:
kfree(ufs);
-out_free_config:
- kfree(config.lowerdir);
- kfree(config.upperdir);
out:
return err;
}
--
1.7.10.4

2013-03-13 14:18:53

by Miklos Szeredi

[permalink] [raw]
Subject: [PATCH 8/9] overlay: overlay filesystem documentation

From: Neil Brown <[email protected]>

Document the overlay filesystem.

Signed-off-by: Miklos Szeredi <[email protected]>
---
Documentation/filesystems/overlayfs.txt | 199 +++++++++++++++++++++++++++++++
MAINTAINERS | 7 ++
2 files changed, 206 insertions(+)
create mode 100644 Documentation/filesystems/overlayfs.txt

diff --git a/Documentation/filesystems/overlayfs.txt b/Documentation/filesystems/overlayfs.txt
new file mode 100644
index 0000000..00dbab0
--- /dev/null
+++ b/Documentation/filesystems/overlayfs.txt
@@ -0,0 +1,199 @@
+Written by: Neil Brown <[email protected]>
+
+Overlay Filesystem
+==================
+
+This document describes a prototype for a new approach to providing
+overlay-filesystem functionality in Linux (sometimes referred to as
+union-filesystems). An overlay-filesystem tries to present a
+filesystem which is the result over overlaying one filesystem on top
+of the other.
+
+The result will inevitably fail to look exactly like a normal
+filesystem for various technical reasons. The expectation is that
+many use cases will be able to ignore these differences.
+
+This approach is 'hybrid' because the objects that appear in the
+filesystem do not all appear to belong to that filesystem. In many
+cases an object accessed in the union will be indistinguishable
+from accessing the corresponding object from the original filesystem.
+This is most obvious from the 'st_dev' field returned by stat(2).
+
+While directories will report an st_dev from the overlay-filesystem,
+all non-directory objects will report an st_dev from the lower or
+upper filesystem that is providing the object. Similarly st_ino will
+only be unique when combined with st_dev, and both of these can change
+over the lifetime of a non-directory object. Many applications and
+tools ignore these values and will not be affected.
+
+Upper and Lower
+---------------
+
+An overlay filesystem combines two filesystems - an 'upper' filesystem
+and a 'lower' filesystem. When a name exists in both filesystems, the
+object in the 'upper' filesystem is visible while the object in the
+'lower' filesystem is either hidden or, in the case of directories,
+merged with the 'upper' object.
+
+It would be more correct to refer to an upper and lower 'directory
+tree' rather than 'filesystem' as it is quite possible for both
+directory trees to be in the same filesystem and there is no
+requirement that the root of a filesystem be given for either upper or
+lower.
+
+The lower filesystem can be any filesystem supported by Linux and does
+not need to be writable. The lower filesystem can even be another
+overlayfs. The upper filesystem will normally be writable and if it
+is it must support the creation of trusted.* extended attributes, and
+must provide valid d_type in readdir responses, at least for symbolic
+links - so NFS is not suitable.
+
+A read-only overlay of two read-only filesystems may use any
+filesystem type.
+
+Directories
+-----------
+
+Overlaying mainly involves directories. If a given name appears in both
+upper and lower filesystems and refers to a non-directory in either,
+then the lower object is hidden - the name refers only to the upper
+object.
+
+Where both upper and lower objects are directories, a merged directory
+is formed.
+
+At mount time, the two directories given as mount options are combined
+into a merged directory:
+
+ mount -t overlayfs overlayfs -olowerdir=/lower,upperdir=/upper /overlay
+
+Then whenever a lookup is requested in such a merged directory, the
+lookup is performed in each actual directory and the combined result
+is cached in the dentry belonging to the overlay filesystem. If both
+actual lookups find directories, both are stored and a merged
+directory is created, otherwise only one is stored: the upper if it
+exists, else the lower.
+
+Only the lists of names from directories are merged. Other content
+such as metadata and extended attributes are reported for the upper
+directory only. These attributes of the lower directory are hidden.
+
+whiteouts and opaque directories
+--------------------------------
+
+In order to support rm and rmdir without changing the lower
+filesystem, an overlay filesystem needs to record in the upper filesystem
+that files have been removed. This is done using whiteouts and opaque
+directories (non-directories are always opaque).
+
+The overlay filesystem uses extended attributes with a
+"trusted.overlay." prefix to record these details.
+
+A whiteout is created as a symbolic link with target
+"(overlay-whiteout)" and with xattr "trusted.overlay.whiteout" set to "y".
+When a whiteout is found in the upper level of a merged directory, any
+matching name in the lower level is ignored, and the whiteout itself
+is also hidden.
+
+A directory is made opaque by setting the xattr "trusted.overlay.opaque"
+to "y". Where the upper filesystem contains an opaque directory, any
+directory in the lower filesystem with the same name is ignored.
+
+readdir
+-------
+
+When a 'readdir' request is made on a merged directory, the upper and
+lower directories are each read and the name lists merged in the
+obvious way (upper is read first, then lower - entries that already
+exist are not re-added). This merged name list is cached in the
+'struct file' and so remains as long as the file is kept open. If the
+directory is opened and read by two processes at the same time, they
+will each have separate caches. A seekdir to the start of the
+directory (offset 0) followed by a readdir will cause the cache to be
+discarded and rebuilt.
+
+This means that changes to the merged directory do not appear while a
+directory is being read. This is unlikely to be noticed by many
+programs.
+
+seek offsets are assigned sequentially when the directories are read.
+Thus if
+ - read part of a directory
+ - remember an offset, and close the directory
+ - re-open the directory some time later
+ - seek to the remembered offset
+
+there may be little correlation between the old and new locations in
+the list of filenames, particularly if anything has changed in the
+directory.
+
+Readdir on directories that are not merged is simply handled by the
+underlying directory (upper or lower).
+
+
+Non-directories
+---------------
+
+Objects that are not directories (files, symlinks, device-special
+files etc.) are presented either from the upper or lower filesystem as
+appropriate. When a file in the lower filesystem is accessed in a way
+the requires write-access, such as opening for write access, changing
+some metadata etc., the file is first copied from the lower filesystem
+to the upper filesystem (copy_up). Note that creating a hard-link
+also requires copy_up, though of course creation of a symlink does
+not.
+
+The copy_up may turn out to be unnecessary, for example if the file is
+opened for read-write but the data is not modified.
+
+The copy_up process first makes sure that the containing directory
+exists in the upper filesystem - creating it and any parents as
+necessary. It then creates the object with the same metadata (owner,
+mode, mtime, symlink-target etc.) and then if the object is a file, the
+data is copied from the lower to the upper filesystem. Finally any
+extended attributes are copied up.
+
+Once the copy_up is complete, the overlay filesystem simply
+provides direct access to the newly created file in the upper
+filesystem - future operations on the file are barely noticed by the
+overlay filesystem (though an operation on the name of the file such as
+rename or unlink will of course be noticed and handled).
+
+
+Non-standard behavior
+---------------------
+
+The copy_up operation essentially creates a new, identical file and
+moves it over to the old name. The new file may be on a different
+filesystem, so both st_dev and st_ino of the file may change.
+
+Any open files referring to this inode will access the old data and
+metadata. Similarly any file locks obtained before copy_up will not
+apply to the copied up file.
+
+On a file opened with O_RDONLY fchmod(2), fchown(2), futimesat(2) and
+fsetxattr(2) will fail with EROFS.
+
+If a file with multiple hard links is copied up, then this will
+"break" the link. Changes will not be propagated to other names
+referring to the same inode.
+
+Symlinks in /proc/PID/ and /proc/PID/fd which point to a non-directory
+object in overlayfs will not contain valid absolute paths, only
+relative paths leading up to the filesystem's root. This will be
+fixed in the future.
+
+Some operations are not atomic, for example a crash during copy_up or
+rename will leave the filesystem in an inconsistent state. This will
+be addressed in the future.
+
+Changes to underlying filesystems
+---------------------------------
+
+Offline changes, when the overlay is not mounted, are allowed to either
+the upper or the lower trees.
+
+Changes to the underlying filesystems while part of a mounted overlay
+filesystem are not allowed. If the underlying filesystem is changed,
+the behavior of the overlay is undefined, though it will not result in
+a crash or deadlock.
diff --git a/MAINTAINERS b/MAINTAINERS
index 9561658..9ea89b8 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5872,6 +5872,13 @@ F: drivers/scsi/osd/
F: include/scsi/osd_*
F: fs/exofs/

+OVERLAYFS FILESYSTEM
+M: Miklos Szeredi <[email protected]>
+L: [email protected]
+S: Supported
+F: fs/overlayfs/*
+F: Documentation/filesystems/overlayfs.txt
+
P54 WIRELESS DRIVER
M: Christian Lamparter <[email protected]>
L: [email protected]
--
1.7.10.4

2013-03-13 14:19:30

by Miklos Szeredi

[permalink] [raw]
Subject: [PATCH 4/9] vfs: introduce clone_private_mount()

From: Miklos Szeredi <[email protected]>

Overlayfs needs a private clone of the mount, so create a function for
this and export to modules.

Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/namespace.c | 18 ++++++++++++++++++
include/linux/mount.h | 3 +++
2 files changed, 21 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index 50ca17d..9791b4e 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1399,6 +1399,24 @@ void drop_collected_mounts(struct vfsmount *mnt)
release_mounts(&umount_list);
}

+struct vfsmount *clone_private_mount(struct path *path)
+{
+ struct mount *old_mnt = real_mount(path->mnt);
+ struct mount *new_mnt;
+
+ if (IS_MNT_UNBINDABLE(old_mnt))
+ return ERR_PTR(-EINVAL);
+
+ down_read(&namespace_sem);
+ new_mnt = clone_mnt(old_mnt, path->dentry, CL_PRIVATE);
+ up_read(&namespace_sem);
+ if (!new_mnt)
+ return ERR_PTR(-ENOMEM);
+
+ return &new_mnt->mnt;
+}
+EXPORT_SYMBOL_GPL(clone_private_mount);
+
int iterate_mounts(int (*f)(struct vfsmount *, void *), void *arg,
struct vfsmount *root)
{
diff --git a/include/linux/mount.h b/include/linux/mount.h
index d7029f4..344a262 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -66,6 +66,9 @@ extern void mnt_pin(struct vfsmount *mnt);
extern void mnt_unpin(struct vfsmount *mnt);
extern int __mnt_is_readonly(struct vfsmount *mnt);

+struct path;
+extern struct vfsmount *clone_private_mount(struct path *path);
+
struct file_system_type;
extern struct vfsmount *vfs_kern_mount(struct file_system_type *type,
int flags, const char *name,
--
1.7.10.4

2013-03-13 14:19:55

by Miklos Szeredi

[permalink] [raw]
Subject: [PATCH 2/9] vfs: export do_splice_direct() to modules

From: Miklos Szeredi <[email protected]>

Export do_splice_direct() to modules. Needed by overlay filesystem.

Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/splice.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/fs/splice.c b/fs/splice.c
index 718bd00..0e8f44a 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1308,6 +1308,7 @@ long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,

return ret;
}
+EXPORT_SYMBOL(do_splice_direct);

static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
struct pipe_inode_info *opipe,
--
1.7.10.4

2013-03-13 14:31:35

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)

On Wed, Mar 13, 2013 at 3:16 PM, Miklos Szeredi <[email protected]> wrote:
> Here's another version with the comments addressed plus a small bugfix and some
> checkpatch cleanups.
>
> Changes in v17:
>
> - fix wrong return value in a failure path in ovl_link()
> - fix subjects
> - use file_inode() and MODULE_ALIAS_FS()
> - fold bugfix patches
> - checkpatch cleanups
>
> Git tree is here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git overlayfs.current
>

Hi,

I pulled in v17 (current) into Linux-Next (next-20130313) and built
OverlayFS as a module.

Unfortunately, I do not see any success message on loading it.

--- dmesg_3.9.0-rc2-next20130313-3-iniza-small.txt 2013-03-13
15:21:19.578712536 +0100
+++ dmesg_3.9.0-rc2-next20130313-3-iniza-small_after-overlayfs-test.txt
2013-03-13 15:22:14.658238998 +0100
@@ -806,3 +806,8 @@
[ 25.517154] wlan0: associated
[ 25.517214] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
[ 54.149536] usb 2-1.5: USB disconnect, device number 3
+[ 86.502215] squashfs: version 4.0 (2009/01/31) Phillip Lougher
+[ 87.007082] EXT4-fs (loop4): mounted filesystem with ordered data
mode. Opts: (null)
+[ 87.311998] EXT4-fs (loop4): re-mounted. Opts: data=ordered
+[ 87.657291] EXT4-fs (loop4): re-mounted. Opts: data=ordered
+[ 88.057251] EXT4-fs (loop4): re-mounted. Opts: data=ordered

Highly appreciated to see such a message!

Test-case script attached (needs an additional patch on top of
overlayfs.current, see attachments).

- Sedat -

> Thanks,
> Miklos
>
> ---
> Andy Whitcroft (1):
> overlayfs: add statfs support
>
> Erez Zadok (1):
> overlayfs: implement show_options
>
> Miklos Szeredi (6):
> vfs: add i_op->dentry_open()
> vfs: export do_splice_direct() to modules
> vfs: export __inode_permission() to modules
> vfs: introduce clone_private_mount()
> overlay filesystem
> fs: limit filesystem stacking depth
>
> Neil Brown (1):
> overlay: overlay filesystem documentation
>
> ---
> Documentation/filesystems/Locking | 2 +
> Documentation/filesystems/overlayfs.txt | 199 +++++++++
> Documentation/filesystems/vfs.txt | 7 +
> MAINTAINERS | 7 +
> fs/Kconfig | 1 +
> fs/Makefile | 1 +
> fs/ecryptfs/main.c | 7 +
> fs/internal.h | 5 -
> fs/namei.c | 10 +-
> fs/namespace.c | 18 +
> fs/open.c | 23 +-
> fs/overlayfs/Kconfig | 10 +
> fs/overlayfs/Makefile | 7 +
> fs/overlayfs/copy_up.c | 385 +++++++++++++++++
> fs/overlayfs/dir.c | 605 +++++++++++++++++++++++++++
> fs/overlayfs/inode.c | 372 +++++++++++++++++
> fs/overlayfs/overlayfs.h | 70 ++++
> fs/overlayfs/readdir.c | 566 +++++++++++++++++++++++++
> fs/overlayfs/super.c | 686 +++++++++++++++++++++++++++++++
> fs/splice.c | 1 +
> include/linux/fs.h | 14 +
> include/linux/mount.h | 3 +
> 22 files changed, 2989 insertions(+), 10 deletions(-)
> create mode 100644 Documentation/filesystems/overlayfs.txt
> create mode 100644 fs/overlayfs/Kconfig
> create mode 100644 fs/overlayfs/Makefile
> create mode 100644 fs/overlayfs/copy_up.c
> create mode 100644 fs/overlayfs/dir.c
> create mode 100644 fs/overlayfs/inode.c
> create mode 100644 fs/overlayfs/overlayfs.h
> create mode 100644 fs/overlayfs/readdir.c
> create mode 100644 fs/overlayfs/super.c
>


Attachments:
run_overlayfs-test.sh (4.81 kB)
3.9.0-rc2-next20130313-3-iniza-small.patch (85.27 kB)
Download all attachments

2013-03-13 15:13:10

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)

On Wed, Mar 13, 2013 at 3:31 PM, Sedat Dilek <[email protected]> wrote:
> On Wed, Mar 13, 2013 at 3:16 PM, Miklos Szeredi <[email protected]> wrote:
>> Here's another version with the comments addressed plus a small bugfix and some
>> checkpatch cleanups.
>>
>> Changes in v17:
>>
>> - fix wrong return value in a failure path in ovl_link()
>> - fix subjects
>> - use file_inode() and MODULE_ALIAS_FS()
>> - fold bugfix patches
>> - checkpatch cleanups
>>
>> Git tree is here:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git overlayfs.current
>>
>
> Hi,
>
> I pulled in v17 (current) into Linux-Next (next-20130313) and built
> OverlayFS as a module.
>
> Unfortunately, I do not see any success message on loading it.
>
> --- dmesg_3.9.0-rc2-next20130313-3-iniza-small.txt 2013-03-13
> 15:21:19.578712536 +0100
> +++ dmesg_3.9.0-rc2-next20130313-3-iniza-small_after-overlayfs-test.txt
> 2013-03-13 15:22:14.658238998 +0100
> @@ -806,3 +806,8 @@
> [ 25.517154] wlan0: associated
> [ 25.517214] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
> [ 54.149536] usb 2-1.5: USB disconnect, device number 3
> +[ 86.502215] squashfs: version 4.0 (2009/01/31) Phillip Lougher
> +[ 87.007082] EXT4-fs (loop4): mounted filesystem with ordered data
> mode. Opts: (null)
> +[ 87.311998] EXT4-fs (loop4): re-mounted. Opts: data=ordered
> +[ 87.657291] EXT4-fs (loop4): re-mounted. Opts: data=ordered
> +[ 88.057251] EXT4-fs (loop4): re-mounted. Opts: data=ordered
>
> Highly appreciated to see such a message!
>
> Test-case script attached (needs an additional patch on top of
> overlayfs.current, see attachments).
>

Looks like this is missing (or intended?):

diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 482c26f..f23ebfc 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -684,3 +684,6 @@ static void __exit ovl_exit(void)

module_init(ovl_init);
module_exit(ovl_exit);
+MODULE_DESCRIPTION("overlayfs v17: provides overlay-filesystem functionality");
+MODULE_AUTHOR("Miklos Szeredi <[email protected]>");
+MODULE_LICENSE("GPL");

Feel free to correct if I fill in bullshit...

- Sedat -

> - Sedat -
>
>> Thanks,
>> Miklos
>>
>> ---
>> Andy Whitcroft (1):
>> overlayfs: add statfs support
>>
>> Erez Zadok (1):
>> overlayfs: implement show_options
>>
>> Miklos Szeredi (6):
>> vfs: add i_op->dentry_open()
>> vfs: export do_splice_direct() to modules
>> vfs: export __inode_permission() to modules
>> vfs: introduce clone_private_mount()
>> overlay filesystem
>> fs: limit filesystem stacking depth
>>
>> Neil Brown (1):
>> overlay: overlay filesystem documentation
>>
>> ---
>> Documentation/filesystems/Locking | 2 +
>> Documentation/filesystems/overlayfs.txt | 199 +++++++++
>> Documentation/filesystems/vfs.txt | 7 +
>> MAINTAINERS | 7 +
>> fs/Kconfig | 1 +
>> fs/Makefile | 1 +
>> fs/ecryptfs/main.c | 7 +
>> fs/internal.h | 5 -
>> fs/namei.c | 10 +-
>> fs/namespace.c | 18 +
>> fs/open.c | 23 +-
>> fs/overlayfs/Kconfig | 10 +
>> fs/overlayfs/Makefile | 7 +
>> fs/overlayfs/copy_up.c | 385 +++++++++++++++++
>> fs/overlayfs/dir.c | 605 +++++++++++++++++++++++++++
>> fs/overlayfs/inode.c | 372 +++++++++++++++++
>> fs/overlayfs/overlayfs.h | 70 ++++
>> fs/overlayfs/readdir.c | 566 +++++++++++++++++++++++++
>> fs/overlayfs/super.c | 686 +++++++++++++++++++++++++++++++
>> fs/splice.c | 1 +
>> include/linux/fs.h | 14 +
>> include/linux/mount.h | 3 +
>> 22 files changed, 2989 insertions(+), 10 deletions(-)
>> create mode 100644 Documentation/filesystems/overlayfs.txt
>> create mode 100644 fs/overlayfs/Kconfig
>> create mode 100644 fs/overlayfs/Makefile
>> create mode 100644 fs/overlayfs/copy_up.c
>> create mode 100644 fs/overlayfs/dir.c
>> create mode 100644 fs/overlayfs/inode.c
>> create mode 100644 fs/overlayfs/overlayfs.h
>> create mode 100644 fs/overlayfs/readdir.c
>> create mode 100644 fs/overlayfs/super.c
>>

2013-03-13 15:18:57

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)

> Looks like this is missing (or intended?):
>
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index 482c26f..f23ebfc 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -684,3 +684,6 @@ static void __exit ovl_exit(void)
>
> module_init(ovl_init);
> module_exit(ovl_exit);
> +MODULE_DESCRIPTION("overlayfs v17: provides overlay-filesystem functionality");
> +MODULE_AUTHOR("Miklos Szeredi <[email protected]>");
> +MODULE_LICENSE("GPL");

No, it *is* already in there.

Thanks,
Miklos

2013-03-13 15:26:28

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)

On Wed, Mar 13, 2013 at 4:18 PM, Miklos Szeredi <[email protected]> wrote:
>> Looks like this is missing (or intended?):
>>
>> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
>> index 482c26f..f23ebfc 100644
>> --- a/fs/overlayfs/super.c
>> +++ b/fs/overlayfs/super.c
>> @@ -684,3 +684,6 @@ static void __exit ovl_exit(void)
>>
>> module_init(ovl_init);
>> module_exit(ovl_exit);
>> +MODULE_DESCRIPTION("overlayfs v17: provides overlay-filesystem functionality");
>> +MODULE_AUTHOR("Miklos Szeredi <[email protected]>");
>> +MODULE_LICENSE("GPL");
>
> No, it *is* already in there.
>

Where is it?

Last lines in [1] are:

685 module_init(ovl_init);
686 module_exit(ovl_exit);

- Sedat -

[1] http://git.kernel.org/cgit/linux/kernel/git/mszeredi/vfs.git/tree/fs/overlayfs/super.c?h=overlayfs.current

> Thanks,
> Miklos

2013-03-13 15:53:11

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)

On Wed, Mar 13, 2013 at 4:26 PM, Sedat Dilek <[email protected]> wrote:
> On Wed, Mar 13, 2013 at 4:18 PM, Miklos Szeredi <[email protected]> wrote:
>>> Looks like this is missing (or intended?):
>>>
>>> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
>>> index 482c26f..f23ebfc 100644
>>> --- a/fs/overlayfs/super.c
>>> +++ b/fs/overlayfs/super.c
>>> @@ -684,3 +684,6 @@ static void __exit ovl_exit(void)
>>>
>>> module_init(ovl_init);
>>> module_exit(ovl_exit);
>>> +MODULE_DESCRIPTION("overlayfs v17: provides overlay-filesystem functionality");
>>> +MODULE_AUTHOR("Miklos Szeredi <[email protected]>");
>>> +MODULE_LICENSE("GPL");
>>
>> No, it *is* already in there.
>>
>
> Where is it?
>
> Last lines in [1] are:
>
> 685 module_init(ovl_init);
> 686 module_exit(ovl_exit);
>

OK, I looked at SquashFS which is not converted to use MODULE_ALIAS_FS.

Hehe, with my patch that looks now funny :-).

$ sudo modinfo overlayfs
filename:
/lib/modules/3.9.0-rc2-next20130313-4-iniza-small/kernel/fs/overlayfs/overlayfs.ko
license: GPL
author: Miklos Szeredi <[email protected]>
description: overlayfs v17, provides overlay-filesystem functionality
alias: fs-overlayfs
license: GPL
description: Overlay filesystem
author: Miklos Szeredi <[email protected]>
srcversion: 4332BB91603829A85CCEA59
depends:
intree: Y
vermagic: 3.9.0-rc2-next20130313-4-iniza-small SMP mod_unload modversions


$ sudo modinfo squashfs
filename:
/lib/modules/3.9.0-rc2-next20130313-4-iniza-small/kernel/fs/squashfs/squashfs.ko
license: GPL
author: Phillip Lougher <[email protected]>
description: squashfs 4.0, a compressed read-only filesystem
alias: fs-squashfs
srcversion: 752DB671D8E8DFB606BFC88
depends:
intree: Y
vermagic: 3.9.0-rc2-next20130313-4-iniza-small SMP mod_unload modversions

- Sedat -

> - Sedat -
>
> [1] http://git.kernel.org/cgit/linux/kernel/git/mszeredi/vfs.git/tree/fs/overlayfs/super.c?h=overlayfs.current
>
>> Thanks,
>> Miklos

2013-03-13 16:17:27

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)

On Wed, Mar 13, 2013 at 4:53 PM, Sedat Dilek <[email protected]> wrote:
> On Wed, Mar 13, 2013 at 4:26 PM, Sedat Dilek <[email protected]> wrote:
>> On Wed, Mar 13, 2013 at 4:18 PM, Miklos Szeredi <[email protected]> wrote:
>>>> Looks like this is missing (or intended?):
>>>>
>>>> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
>>>> index 482c26f..f23ebfc 100644
>>>> --- a/fs/overlayfs/super.c
>>>> +++ b/fs/overlayfs/super.c
>>>> @@ -684,3 +684,6 @@ static void __exit ovl_exit(void)
>>>>
>>>> module_init(ovl_init);
>>>> module_exit(ovl_exit);
>>>> +MODULE_DESCRIPTION("overlayfs v17: provides overlay-filesystem functionality");
>>>> +MODULE_AUTHOR("Miklos Szeredi <[email protected]>");
>>>> +MODULE_LICENSE("GPL");
>>>
>>> No, it *is* already in there.
>>>
>>
>> Where is it?
>>
>> Last lines in [1] are:
>>
>> 685 module_init(ovl_init);
>> 686 module_exit(ovl_exit);
>>
>
> OK, I looked at SquashFS which is not converted to use MODULE_ALIAS_FS.
>
> Hehe, with my patch that looks now funny :-).
>
> $ sudo modinfo overlayfs
> filename:
> /lib/modules/3.9.0-rc2-next20130313-4-iniza-small/kernel/fs/overlayfs/overlayfs.ko
> license: GPL
> author: Miklos Szeredi <[email protected]>
> description: overlayfs v17, provides overlay-filesystem functionality
> alias: fs-overlayfs
> license: GPL
> description: Overlay filesystem
> author: Miklos Szeredi <[email protected]>
> srcversion: 4332BB91603829A85CCEA59
> depends:
> intree: Y
> vermagic: 3.9.0-rc2-next20130313-4-iniza-small SMP mod_unload modversions
>
>
> $ sudo modinfo squashfs
> filename:
> /lib/modules/3.9.0-rc2-next20130313-4-iniza-small/kernel/fs/squashfs/squashfs.ko
> license: GPL
> author: Phillip Lougher <[email protected]>
> description: squashfs 4.0, a compressed read-only filesystem
> alias: fs-squashfs
> srcversion: 752DB671D8E8DFB606BFC88
> depends:
> intree: Y
> vermagic: 3.9.0-rc2-next20130313-4-iniza-small SMP mod_unload modversions
>

Nah, SquashFS has MODULE_ALIAS_FS in Linux-Next, but /me looked into Linus-tree.

You are right, you have those MODULE_XXX at the beginning of
fs/overlayfs/super.c

Anyway, with CONFIG_OVERLAYFS_FS=m I do not see any related messages
when the kernel-module is loaded.
So, is this intended?
SquashFS prints into the logs, so what is it doing differently?

- Sedat -


> - Sedat -
>
>> - Sedat -
>>
>> [1] http://git.kernel.org/cgit/linux/kernel/git/mszeredi/vfs.git/tree/fs/overlayfs/super.c?h=overlayfs.current
>>
>>> Thanks,
>>> Miklos

2013-03-13 16:21:34

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)

On Wed, Mar 13, 2013 at 5:10 PM, Sedat Dilek <[email protected]> wrote:

> Anyway, with CONFIG_OVERLAYFS_FS=m I do not see any related messages
> when the kernel-module is loaded.
> So, is this intended?
> SquashFS prints into the logs, so what is it doing differently?

Some modules do, some don't. It's not compulsory and not very useful.

Thanks,
Miklos

2013-03-13 16:35:20

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)

On Wed, Mar 13, 2013 at 5:21 PM, Miklos Szeredi <[email protected]> wrote:
> On Wed, Mar 13, 2013 at 5:10 PM, Sedat Dilek <[email protected]> wrote:
>
>> Anyway, with CONFIG_OVERLAYFS_FS=m I do not see any related messages
>> when the kernel-module is loaded.
>> So, is this intended?
>> SquashFS prints into the logs, so what is it doing differently?
>
> Some modules do, some don't. It's not compulsory and not very useful.
>

That's some hardcoded printk-stuff which is IMHO wrong implemented and outdated.
I hoped that this is handled in al FS-modules the same way.
Do not assume...

- Sedat -

[1] http://marc.info/?l=linux-kernel&m=136319193916686&w=2

> Thanks,
> Miklos

2013-03-13 16:51:36

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)

On Wed, Mar 13, 2013 at 5:21 PM, Miklos Szeredi <[email protected]> wrote:
> On Wed, Mar 13, 2013 at 5:10 PM, Sedat Dilek <[email protected]> wrote:
>
>> Anyway, with CONFIG_OVERLAYFS_FS=m I do not see any related messages
>> when the kernel-module is loaded.
>> So, is this intended?
>> SquashFS prints into the logs, so what is it doing differently?
>
> Some modules do, some don't. It's not compulsory and not very useful.
>

What about...

diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 482c26f..92b9ad5 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -675,11 +675,13 @@ MODULE_ALIAS_FS("overlayfs");
static int __init ovl_init(void)
{
return register_filesystem(&ovl_fs_type);
+ printk(KERN_INFO "overlayfs: loaded\n");
}

static void __exit ovl_exit(void)
{
unregister_filesystem(&ovl_fs_type);
+ printk(KERN_INFO "overlayfs: unloaded\n");
}

module_init(ovl_init);

Maybe "loaded" is enough.

- Sedat -

> Thanks,
> Miklos

2013-03-13 18:12:56

by Robin Holt

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)

On Wed, Mar 13, 2013 at 05:51:33PM +0100, Sedat Dilek wrote:
> On Wed, Mar 13, 2013 at 5:21 PM, Miklos Szeredi <[email protected]> wrote:
> > On Wed, Mar 13, 2013 at 5:10 PM, Sedat Dilek <[email protected]> wrote:
> >
> >> Anyway, with CONFIG_OVERLAYFS_FS=m I do not see any related messages
> >> when the kernel-module is loaded.
> >> So, is this intended?
> >> SquashFS prints into the logs, so what is it doing differently?
> >
> > Some modules do, some don't. It's not compulsory and not very useful.
> >
>
> What about...
>
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index 482c26f..92b9ad5 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -675,11 +675,13 @@ MODULE_ALIAS_FS("overlayfs");
> static int __init ovl_init(void)
> {
> return register_filesystem(&ovl_fs_type);
> + printk(KERN_INFO "overlayfs: loaded\n");

How about pr_debug(). That makes things quite for most of us but gives
you info when looking for things like boot problems.

> }
>
> static void __exit ovl_exit(void)
> {
> unregister_filesystem(&ovl_fs_type);
> + printk(KERN_INFO "overlayfs: unloaded\n");
> }
>
> module_init(ovl_init);
>
> Maybe "loaded" is enough.
>
> - Sedat -
>
> > Thanks,
> > Miklos
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-03-13 18:38:05

by Felix Fietkau

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)

On 2013-03-13 7:12 PM, Robin Holt wrote:
> On Wed, Mar 13, 2013 at 05:51:33PM +0100, Sedat Dilek wrote:
>> On Wed, Mar 13, 2013 at 5:21 PM, Miklos Szeredi <[email protected]> wrote:
>> > On Wed, Mar 13, 2013 at 5:10 PM, Sedat Dilek <[email protected]> wrote:
>> >
>> >> Anyway, with CONFIG_OVERLAYFS_FS=m I do not see any related messages
>> >> when the kernel-module is loaded.
>> >> So, is this intended?
>> >> SquashFS prints into the logs, so what is it doing differently?
>> >
>> > Some modules do, some don't. It's not compulsory and not very useful.
>> >
>>
>> What about...
>>
>> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
>> index 482c26f..92b9ad5 100644
>> --- a/fs/overlayfs/super.c
>> +++ b/fs/overlayfs/super.c
>> @@ -675,11 +675,13 @@ MODULE_ALIAS_FS("overlayfs");
>> static int __init ovl_init(void)
>> {
>> return register_filesystem(&ovl_fs_type);
>> + printk(KERN_INFO "overlayfs: loaded\n");
>
> How about pr_debug(). That makes things quite for most of us but gives
> you info when looking for things like boot problems.
How is having a printk/pr_debug useful here at all? The fact that other
filesystems do this as well is not a good reason...
Also, putting it *after* the return statement doesn't make much sense ;)

- Felix

2013-03-13 19:10:29

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)

On Wed, Mar 13, 2013 at 7:37 PM, Felix Fietkau <[email protected]> wrote:
> On 2013-03-13 7:12 PM, Robin Holt wrote:
>> On Wed, Mar 13, 2013 at 05:51:33PM +0100, Sedat Dilek wrote:
>>> On Wed, Mar 13, 2013 at 5:21 PM, Miklos Szeredi <[email protected]> wrote:
>>> > On Wed, Mar 13, 2013 at 5:10 PM, Sedat Dilek <[email protected]> wrote:
>>> >
>>> >> Anyway, with CONFIG_OVERLAYFS_FS=m I do not see any related messages
>>> >> when the kernel-module is loaded.
>>> >> So, is this intended?
>>> >> SquashFS prints into the logs, so what is it doing differently?
>>> >
>>> > Some modules do, some don't. It's not compulsory and not very useful.
>>> >
>>>
>>> What about...
>>>
>>> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
>>> index 482c26f..92b9ad5 100644
>>> --- a/fs/overlayfs/super.c
>>> +++ b/fs/overlayfs/super.c
>>> @@ -675,11 +675,13 @@ MODULE_ALIAS_FS("overlayfs");
>>> static int __init ovl_init(void)
>>> {
>>> return register_filesystem(&ovl_fs_type);
>>> + printk(KERN_INFO "overlayfs: loaded\n");
>>
>> How about pr_debug(). That makes things quite for most of us but gives
>> you info when looking for things like boot problems.
> How is having a printk/pr_debug useful here at all? The fact that other
> filesystems do this as well is not a good reason...
> Also, putting it *after* the return statement doesn't make much sense ;)
>

Hehe, I just checked my new kernel... that does not work (nothing in the logs).
But I think it's good to see if the filesystem is registered/loaded.

- Sedat -

> - Felix
>

2013-03-13 19:55:14

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)

Sedat Dilek <[email protected]> writes:

> On Wed, Mar 13, 2013 at 7:37 PM, Felix Fietkau <[email protected]> wrote:
>> On 2013-03-13 7:12 PM, Robin Holt wrote:
>>> On Wed, Mar 13, 2013 at 05:51:33PM +0100, Sedat Dilek wrote:
>>>> On Wed, Mar 13, 2013 at 5:21 PM, Miklos Szeredi <[email protected]> wrote:
>>>> > On Wed, Mar 13, 2013 at 5:10 PM, Sedat Dilek <[email protected]> wrote:
>>>> >
>>>> >> Anyway, with CONFIG_OVERLAYFS_FS=m I do not see any related messages
>>>> >> when the kernel-module is loaded.
>>>> >> So, is this intended?
>>>> >> SquashFS prints into the logs, so what is it doing differently?
>>>> >
>>>> > Some modules do, some don't. It's not compulsory and not very useful.
>>>> >
>>>>
>>>> What about...
>>>>
>>>> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
>>>> index 482c26f..92b9ad5 100644
>>>> --- a/fs/overlayfs/super.c
>>>> +++ b/fs/overlayfs/super.c
>>>> @@ -675,11 +675,13 @@ MODULE_ALIAS_FS("overlayfs");
>>>> static int __init ovl_init(void)
>>>> {
>>>> return register_filesystem(&ovl_fs_type);
>>>> + printk(KERN_INFO "overlayfs: loaded\n");
>>>
>>> How about pr_debug(). That makes things quite for most of us but gives
>>> you info when looking for things like boot problems.
>> How is having a printk/pr_debug useful here at all? The fact that other
>> filesystems do this as well is not a good reason...
>> Also, putting it *after* the return statement doesn't make much sense ;)
>>
>
> Hehe, I just checked my new kernel... that does not work (nothing in the logs).
> But I think it's good to see if the filesystem is registered/loaded.

lsmod | grep overlayfs

Eric

2013-03-13 19:59:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)

On Wed, Mar 13, 2013 at 12:54 PM, Eric W. Biederman
<[email protected]> wrote:
>>
>> Hehe, I just checked my new kernel... that does not work (nothing in the logs).
>> But I think it's good to see if the filesystem is registered/loaded.
>
> lsmod | grep overlayfs

How about the compiled-in case? What's wrong with just using the
obvious /proc/filesystems?

Linus

2013-03-13 20:27:25

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)

On Wed, Mar 13, 2013 at 8:58 PM, Linus Torvalds
<[email protected]> wrote:
> On Wed, Mar 13, 2013 at 12:54 PM, Eric W. Biederman
> <[email protected]> wrote:
>>>
>>> Hehe, I just checked my new kernel... that does not work (nothing in the logs).
>>> But I think it's good to see if the filesystem is registered/loaded.
>>
>> lsmod | grep overlayfs
>
> How about the compiled-in case? What's wrong with just using the
> obvious /proc/filesystems?
>

As I don't need OverlayFS in my daily life, I have it here built as a module.

Nevertheless, as module or built-in /me as an end-user wants to know
what's going on.
So, dmesg is normally my first weapon.

I grep-ed trrough a bit through all the filesystems in fs/ and it is
not handled unique.

"Loaded" BTW is not the right term in FS-folks-jargon: The functions
are called (un)register_filesystem().

$ dmesg | grep -i loaded
[ 0.241037] vgaarb: loaded
[ 0.241243] libata version 3.00 loaded.
[ 0.358574] Block layer SCSI generic (bsg) driver version 0.4
loaded (major 252)
[ 0.368711] brd: module loaded
[ 0.369832] loop: module loaded
[ 0.650086] PM: Hibernation image not present or could not be loaded.
[ 0.651150] Loaded X.509 cert 'Magrathea: Glacier signing key:
e75e3a59e59cb48db975ac12e21a905289ada56f'
[ 1.788947] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
[ 8.617549] lp: driver loaded but no devices found
[ 10.490721] wmi: Mapper loaded
[ 11.994985] iwlwifi 0000:01:00.0: loaded firmware version 18.168.6.1
[ 73.885056] squashfs: loaded
[ 81.345070] overlayfs: loaded

I am not sure whether those lines "blabla: module loaded" is true for
loop (block) it is a hardcoded printk-line.
I doubt...

CONFIG_BLK_DEV_LOOP=y

Last but not least, user-space apps like GRUB trigger fs loaded.
For me this is helpful.
You might be bored or whatever reason I can't or don't want to understand.

And just to have some fun about status informations:
Check your drm (kms) driver with date and version :-).
Stuff that is not maintained should be thrown out (yes, I had sent patches...).

- Sedat -

> Linus

2013-03-13 20:32:47

by Sedat Dilek

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)

On Wed, Mar 13, 2013 at 9:13 PM, Phillip Lougher
<[email protected]> wrote:
>
>
> On Wed, Mar 13, 2013 at 4:10 PM, Sedat Dilek <[email protected]> wrote:
>>
>> On Wed, Mar 13, 2013 at 4:53 PM, Sedat Dilek <[email protected]>
>> wrote:
>> > On Wed, Mar 13, 2013 at 4:26 PM, Sedat Dilek <[email protected]>
>> > wrote:
>> >> On Wed, Mar 13, 2013 at 4:18 PM, Miklos Szeredi <[email protected]>
>> >> wrote:
>> >>>> Looks like this is missing (or intended?):
>> >>>>
>> >>>> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
>> >>>> index 482c26f..f23ebfc 100644
>> >>>> --- a/fs/overlayfs/super.c
>> >>>> +++ b/fs/overlayfs/super.c
>> >>>> @@ -684,3 +684,6 @@ static void __exit ovl_exit(void)
>> >>>>
>> >>>> module_init(ovl_init);
>> >>>> module_exit(ovl_exit);
>> >>>> +MODULE_DESCRIPTION("overlayfs v17: provides overlay-filesystem
>> >>>> functionality");
>> >>>> +MODULE_AUTHOR("Miklos Szeredi <[email protected]>");
>> >>>> +MODULE_LICENSE("GPL");
>> >>>
>> >>> No, it *is* already in there.
>> >>>
>> >>
>> >> Where is it?
>> >>
>> >> Last lines in [1] are:
>> >>
>> >> 685 module_init(ovl_init);
>> >> 686 module_exit(ovl_exit);
>> >>
>> >
>> > OK, I looked at SquashFS which is not converted to use MODULE_ALIAS_FS.
>> >
>> > Hehe, with my patch that looks now funny :-).
>> >
>> > $ sudo modinfo overlayfs
>> > filename:
>> >
>> > /lib/modules/3.9.0-rc2-next20130313-4-iniza-small/kernel/fs/overlayfs/overlayfs.ko
>> > license: GPL
>> > author: Miklos Szeredi <[email protected]>
>> > description: overlayfs v17, provides overlay-filesystem functionality
>> > alias: fs-overlayfs
>> > license: GPL
>> > description: Overlay filesystem
>> > author: Miklos Szeredi <[email protected]>
>> > srcversion: 4332BB91603829A85CCEA59
>> > depends:
>> > intree: Y
>> > vermagic: 3.9.0-rc2-next20130313-4-iniza-small SMP mod_unload
>> > modversions
>> >
>> >
>> > $ sudo modinfo squashfs
>> > filename:
>> >
>> > /lib/modules/3.9.0-rc2-next20130313-4-iniza-small/kernel/fs/squashfs/squashfs.ko
>> > license: GPL
>> > author: Phillip Lougher <[email protected]>
>> > description: squashfs 4.0, a compressed read-only filesystem
>> > alias: fs-squashfs
>> > srcversion: 752DB671D8E8DFB606BFC88
>> > depends:
>> > intree: Y
>> > vermagic: 3.9.0-rc2-next20130313-4-iniza-small SMP mod_unload
>> > modversions
>> >
>>
>> Nah, SquashFS has MODULE_ALIAS_FS in Linux-Next, but /me looked into
>> Linus-tree.
>>
>> You are right, you have those MODULE_XXX at the beginning of
>> fs/overlayfs/super.c
>>
>> Anyway, with CONFIG_OVERLAYFS_FS=m I do not see any related messages
>> when the kernel-module is loaded.
>> So, is this intended?
>> SquashFS prints into the logs, so what is it doing differently?
>
>
> SquashFS did it because it was out of tree for a long time, and you couldn't
> use the kernel version to tell what version of Squashfs you had patched in.
>
> When people dug about in their embedded system (router, STB etc.) they often
> got kernels without modules, without source and no idea of the Squashfs
> version... When they emailed me to ask why xyz Squashfs filesystem
> wouldn't mount, which version of squashfs-tools they should use, I often had
> no way of knowing.
>

Hi Phillip,

thanks for your explanations.

I can imagine that sometimes to have version and date is helpful for
the developer.
But /me thinks the way it was implemented is not very cool.
See "version.h" in fs/btrfs and appropriate printk-line in fs/btrfs/super.c.
That solution pleased me well.
There you can put all the relevant informations (even you insist to
display your $author :-)).

But as said in another email that status information stuff is not
handled unique/consequently in fs/.
I just fell over it... not more not less. Asking should be allowed.

- Sedat -


> Phillip
>
>>
>> - Sedat -
>>
>>
>> > - Sedat -
>> >
>> >> - Sedat -
>> >>
>> >> [1]
>> >> http://git.kernel.org/cgit/linux/kernel/git/mszeredi/vfs.git/tree/fs/overlayfs/super.c?h=overlayfs.current
>> >>
>> >>> Thanks,
>> >>> Miklos
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel"
>> in
>>
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>

2013-03-13 20:36:30

by Phillip Lougher

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)

On Wed, Mar 13, 2013 at 4:10 PM, Sedat Dilek <[email protected]> wrote:
> On Wed, Mar 13, 2013 at 4:53 PM, Sedat Dilek <[email protected]> wrote:
>> On Wed, Mar 13, 2013 at 4:26 PM, Sedat Dilek <[email protected]> wrote:
>>> On Wed, Mar 13, 2013 at 4:18 PM, Miklos Szeredi <[email protected]> wrote:
>>>>> Looks like this is missing (or intended?):
>>>>>
>>>>> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
>>>>> index 482c26f..f23ebfc 100644
>>>>> --- a/fs/overlayfs/super.c
>>>>> +++ b/fs/overlayfs/super.c
>>>>> @@ -684,3 +684,6 @@ static void __exit ovl_exit(void)
>>>>>
>>>>> module_init(ovl_init);
>>>>> module_exit(ovl_exit);
>>>>> +MODULE_DESCRIPTION("overlayfs v17: provides overlay-filesystem functionality");
>>>>> +MODULE_AUTHOR("Miklos Szeredi <[email protected]>");
>>>>> +MODULE_LICENSE("GPL");
>>>>
>>>> No, it *is* already in there.
>>>>
>>>
>>> Where is it?
>>>
>>> Last lines in [1] are:
>>>
>>> 685 module_init(ovl_init);
>>> 686 module_exit(ovl_exit);
>>>
>>
>> OK, I looked at SquashFS which is not converted to use MODULE_ALIAS_FS.
>>
>> Hehe, with my patch that looks now funny :-).
>>
>> $ sudo modinfo overlayfs
>> filename:
>> /lib/modules/3.9.0-rc2-next20130313-4-iniza-small/kernel/fs/overlayfs/overlayfs.ko
>> license: GPL
>> author: Miklos Szeredi <[email protected]>
>> description: overlayfs v17, provides overlay-filesystem functionality
>> alias: fs-overlayfs
>> license: GPL
>> description: Overlay filesystem
>> author: Miklos Szeredi <[email protected]>
>> srcversion: 4332BB91603829A85CCEA59
>> depends:
>> intree: Y
>> vermagic: 3.9.0-rc2-next20130313-4-iniza-small SMP mod_unload modversions
>>
>>
>> $ sudo modinfo squashfs
>> filename:
>> /lib/modules/3.9.0-rc2-next20130313-4-iniza-small/kernel/fs/squashfs/squashfs.ko
>> license: GPL
>> author: Phillip Lougher <[email protected]>
>> description: squashfs 4.0, a compressed read-only filesystem
>> alias: fs-squashfs
>> srcversion: 752DB671D8E8DFB606BFC88
>> depends:
>> intree: Y
>> vermagic: 3.9.0-rc2-next20130313-4-iniza-small SMP mod_unload modversions
>>
>
> Nah, SquashFS has MODULE_ALIAS_FS in Linux-Next, but /me looked into Linus-tree.
>
> You are right, you have those MODULE_XXX at the beginning of
> fs/overlayfs/super.c
>
> Anyway, with CONFIG_OVERLAYFS_FS=m I do not see any related messages
> when the kernel-module is loaded.
> So, is this intended?
> SquashFS prints into the logs, so what is it doing differently?

SquashFS did it because it was out of tree for a long time, and you
couldn't use the kernel version to tell what version of Squashfs you
had patched in.

When people dug about in their embedded system (router, STB etc.) they
often got kernels without modules, without source and no idea of the
Squashfs version... When they emailed me to ask why xyz Squashfs
filesystem wouldn't mount, which version of squashfs-tools they should
use, I often had no way of knowing.

Phillip

(re-sending to mailing lists because the original got bounced - I
thought I'd told gmail never to send HTML)

>
> - Sedat -
>
>
>> - Sedat -
>>
>>> - Sedat -
>>>
>>> [1] http://git.kernel.org/cgit/linux/kernel/git/mszeredi/vfs.git/tree/fs/overlayfs/super.c?h=overlayfs.current
>>>
>>>> Thanks,
>>>> Miklos
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2013-03-13 22:44:38

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/9] vfs: add i_op->dentry_open()

On Wed, 13 Mar 2013 15:16:25 +0100 Miklos Szeredi <[email protected]> wrote:

> From: Miklos Szeredi <[email protected]>
>
> Add a new inode operation i_op->dentry_open(). This is for stacked filesystems
> that want to return a struct file from a different filesystem.
>
> ...
>
> +/**
> + * vfs_open - open the file at the given path
> + * @path: path to open
> + * @filp: newly allocated file with f_flag initialized
> + * @cred: credentials to use
> + */
> +int vfs_open(const struct path *path, struct file *filp,
> + const struct cred *cred)
> +{
> + struct inode *inode = path->dentry->d_inode;
> +
> + if (inode->i_op->dentry_open)
> + return inode->i_op->dentry_open(path->dentry, filp, cred);
> + else {
> + filp->f_path = *path;
> + return do_dentry_open(filp, NULL, cred);
> + }
> +}

This looks like it will add some overhead to dentry_open(). That long
walk path->dentry->d_inode->i_op->dentry_open, particularly. Can we do
anything? Using filp->f_inode might save a tad.

It's unfortunate and a bit ugly that ->dentry_open() and
do_dentry_open() don't have the same arguments. One would expect and
like them to do so, and this difference raises suspicions about design
warts.

If they _did_ have the same signature then we could perhaps populate
->dentry_open with do_dentry_open by default and avoid the `if'.

The test of inode->i_op->dentry_open could do with an unlikely(), but
that won't save us :(

> +EXPORT_SYMBOL(vfs_open);

Did you consider EXPORT_SYMBOL_GPL()?

2013-03-13 22:45:08

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

On Wed, 13 Mar 2013 15:16:26 +0100 Miklos Szeredi <[email protected]> wrote:

> From: Miklos Szeredi <[email protected]>
>
> Export do_splice_direct() to modules. Needed by overlay filesystem.
>
> Signed-off-by: Miklos Szeredi <[email protected]>
> ---
> fs/splice.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/fs/splice.c b/fs/splice.c
> index 718bd00..0e8f44a 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -1308,6 +1308,7 @@ long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
>
> return ret;
> }
> +EXPORT_SYMBOL(do_splice_direct);

_GPL?

2013-03-13 22:48:15

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 4/9] vfs: introduce clone_private_mount()

On Wed, 13 Mar 2013 15:16:28 +0100 Miklos Szeredi <[email protected]> wrote:

> From: Miklos Szeredi <[email protected]>
>
> Overlayfs needs a private clone of the mount, so create a function for
> this and export to modules.
>
> ...
>
> +struct vfsmount *clone_private_mount(struct path *path)
> +{
> + struct mount *old_mnt = real_mount(path->mnt);
> + struct mount *new_mnt;
> +
> + if (IS_MNT_UNBINDABLE(old_mnt))
> + return ERR_PTR(-EINVAL);
> +
> + down_read(&namespace_sem);
> + new_mnt = clone_mnt(old_mnt, path->dentry, CL_PRIVATE);
> + up_read(&namespace_sem);
> + if (!new_mnt)
> + return ERR_PTR(-ENOMEM);
> +
> + return &new_mnt->mnt;
> +}
> +EXPORT_SYMBOL_GPL(clone_private_mount);

So this one gets the _GPL?

This is a new, exported-to-modules kernel interface function. And it
is undocumented?

2013-03-13 22:53:25

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 5/9] overlay filesystem

On Wed, 13 Mar 2013 15:16:29 +0100 Miklos Szeredi <[email protected]> wrote:

> Overlayfs allows one, usually read-write, directory tree to be
> overlaid onto another, read-only directory tree. All modifications
> go to the upper, writable layer.
>
> This type of mechanism is most often used for live CDs but there's a
> wide variety of other uses.
>
> The implementation differs from other "union filesystem"
> implementations in that after a file is opened all operations go
> directly to the underlying, lower or upper, filesystems. This
> simplifies the implementation and allows native performance in these
> cases.

Well that's an implementation detail. I'm sure that people would like
to see a longer description of how this fs differs from others and the
pros and cons of the various approaches.

Is overlayfs equal or superior to all other union filesystems in all
respects? If not, what are its shortcomings and why do you expect that
users will find them tolerable?

> The dentry tree is duplicated from the underlying filesystems, this
> enables fast cached lookups without adding special support into the
> VFS. This uses slightly more memory than union mounts, but dentries
> are relatively small.
>
> Currently inodes are duplicated as well, but it is a possible
> optimization to share inodes for non-directories.
>
> Opening non directories results in the open forwarded to the
> underlying filesystem. This makes the behavior very similar to union
> mounts (with the same limitations vs. fchmod/fchown on O_RDONLY file
> descriptors).
>
> Usage:
>
> mount -t overlay -olowerdir=/lower,upperdir=/upper overlay /mnt
>
> Supported:
>
> - all operations
>
> Missing:
>
> - Currently a crash in the middle of copy-up, rename, unlink, rmdir or create
> over a whiteout may result in filesystem corruption on the overlay level.
> IOW these operations need to become atomic or at least the corruption needs
> to be detected.

I for one have forgotten what "whiteout" means in this context.
Perhaps a lengthier description of this fs would be helpful...

2013-03-13 23:06:27

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 8/9] overlay: overlay filesystem documentation

On Wed, 13 Mar 2013 15:16:32 +0100 Miklos Szeredi <[email protected]> wrote:

> From: Neil Brown <[email protected]>
>
> Document the overlay filesystem.
>

Damn, I did it again. This is good, thanks.

> +Changes to the underlying filesystems while part of a mounted overlay
> +filesystem are not allowed. If the underlying filesystem is changed,
> +the behavior of the overlay is undefined, though it will not result in
> +a crash or deadlock.

ah-hah, there it is. I was wondering how this can-o-worms would be
handled ;)

Is there anything in place to enforce this at runtime? Checks for
read-onlyness, code to prevent rw remounts, etc?

2013-03-13 23:08:58

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)

On Wed, 13 Mar 2013 15:16:24 +0100 Miklos Szeredi <[email protected]> wrote:

> Here's another version with the comments addressed plus a small bugfix and some
> checkpatch cleanups.

It all looks nice to my rusty eye. But then, I like anything which has
comments and passes checkpatch ;)

I see that quite a few people have contributed - have you need getting
enough review and testing feedback?

Either way, I suggest the next step is to ask Stephen to line this up
in linux-next.

2013-03-14 11:15:09

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 1/9] vfs: add i_op->dentry_open()

On Wed, Mar 13, 2013 at 11:44 PM, Andrew Morton
<[email protected]> wrote:
> On Wed, 13 Mar 2013 15:16:25 +0100 Miklos Szeredi <[email protected]> wrote:
>
>> From: Miklos Szeredi <[email protected]>
>>
>> Add a new inode operation i_op->dentry_open(). This is for stacked filesystems
>> that want to return a struct file from a different filesystem.
>>
>> ...
>>
>> +/**
>> + * vfs_open - open the file at the given path
>> + * @path: path to open
>> + * @filp: newly allocated file with f_flag initialized
>> + * @cred: credentials to use
>> + */
>> +int vfs_open(const struct path *path, struct file *filp,
>> + const struct cred *cred)
>> +{
>> + struct inode *inode = path->dentry->d_inode;
>> +
>> + if (inode->i_op->dentry_open)
>> + return inode->i_op->dentry_open(path->dentry, filp, cred);
>> + else {
>> + filp->f_path = *path;
>> + return do_dentry_open(filp, NULL, cred);
>> + }
>> +}
>
> This looks like it will add some overhead to dentry_open(). That long
> walk path->dentry->d_inode->i_op->dentry_open, particularly. Can we do
> anything? Using filp->f_inode might save a tad.

filp->f_inode is initialized in do_dentry_open. But we can move that
to callers.

Adding an IOP_DENTRY_OPEN flag should take care of the rest.

> It's unfortunate and a bit ugly that ->dentry_open() and
> do_dentry_open() don't have the same arguments. One would expect and
> like them to do so, and this difference raises suspicions about design
> warts.

Hmm, the basic reason for the difference is that filesystems should
not have access to the vfsmount part of the path.

Otherwise it would be trivial to make them match up.

>
> If they _did_ have the same signature then we could perhaps populate
> ->dentry_open with do_dentry_open by default and avoid the `if'.

Except there's no way to add a default i_op, AFAIK.

>
> The test of inode->i_op->dentry_open could do with an unlikely(), but
> that won't save us :(
>
>> +EXPORT_SYMBOL(vfs_open);
>
> Did you consider EXPORT_SYMBOL_GPL()?

It is not clear to me what to base that decision on. Either one is fine by me.

Thanks,
Miklos

2013-03-14 13:28:50

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 4/9] vfs: introduce clone_private_mount()

On Wed, Mar 13, 2013 at 11:48 PM, Andrew Morton
<[email protected]> wrote:
> On Wed, 13 Mar 2013 15:16:28 +0100 Miklos Szeredi <[email protected]> wrote:
>
>> From: Miklos Szeredi <[email protected]>
>>
>> Overlayfs needs a private clone of the mount, so create a function for
>> this and export to modules.
>>
>> ...
>>
>> +struct vfsmount *clone_private_mount(struct path *path)
>> +{
>> + struct mount *old_mnt = real_mount(path->mnt);
>> + struct mount *new_mnt;
>> +
>> + if (IS_MNT_UNBINDABLE(old_mnt))
>> + return ERR_PTR(-EINVAL);
>> +
>> + down_read(&namespace_sem);
>> + new_mnt = clone_mnt(old_mnt, path->dentry, CL_PRIVATE);
>> + up_read(&namespace_sem);
>> + if (!new_mnt)
>> + return ERR_PTR(-ENOMEM);
>> +
>> + return &new_mnt->mnt;
>> +}
>> +EXPORT_SYMBOL_GPL(clone_private_mount);
>
> So this one gets the _GPL?
>
> This is a new, exported-to-modules kernel interface function. And it
> is undocumented?

Following documentation added:

/**
* clone_private_mount - create a private clone of a path
*
* This creates a new vfsmount, which will be the clone of @path. The new will
* not be attached anywhere in the namespace and will be private (i.e. changes
* to the originating mount won't be propagated into this).
*
* Release with mntput().
*/

2013-03-14 13:35:59

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 8/9] overlay: overlay filesystem documentation

On Thu, Mar 14, 2013 at 12:06 AM, Andrew Morton
<[email protected]> wrote:
> On Wed, 13 Mar 2013 15:16:32 +0100 Miklos Szeredi <[email protected]> wrote:
>
>> From: Neil Brown <[email protected]>
>>
>> Document the overlay filesystem.
>>
>
> Damn, I did it again. This is good, thanks.
>
>> +Changes to the underlying filesystems while part of a mounted overlay
>> +filesystem are not allowed. If the underlying filesystem is changed,
>> +the behavior of the overlay is undefined, though it will not result in
>> +a crash or deadlock.
>
> ah-hah, there it is. I was wondering how this can-o-worms would be
> handled ;)
>
> Is there anything in place to enforce this at runtime? Checks for
> read-onlyness, code to prevent rw remounts, etc?

No, currently it is not enforced. I didn't think this was important
enough, and there was no complaint about this from users. I think
most people are going to be careful, especially if lots of warnings
are given in the documentation.

Thanks,
Miklos

2013-03-14 13:43:36

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)

On Thu, Mar 14, 2013 at 12:08 AM, Andrew Morton
<[email protected]> wrote:
> On Wed, 13 Mar 2013 15:16:24 +0100 Miklos Szeredi <[email protected]> wrote:
>
>> Here's another version with the comments addressed plus a small bugfix and some
>> checkpatch cleanups.
>
> It all looks nice to my rusty eye. But then, I like anything which has
> comments and passes checkpatch ;)
>
> I see that quite a few people have contributed - have you need getting
> enough review and testing feedback?

Well, testing feedback I got plenty when it was all new and full of
bugs. Now I don't get much. All I usually get nowdays is people
asking why is it not in mainline yet.

> Either way, I suggest the next step is to ask Stephen to line this up
> in linux-next.

Al, are you okay with that? Or do you want it to go though -vfs?

Thanks,
Miklos

2013-03-15 01:25:51

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)

On Thu, Mar 14, 2013 at 02:43:32PM +0100, Miklos Szeredi wrote:

> > Either way, I suggest the next step is to ask Stephen to line this up
> > in linux-next.
>
> Al, are you okay with that? Or do you want it to go though -vfs?

Umm... I would prefer it to go through vfs.git, with serious modifications.
I really don't like the idea of xattr-based whiteouts, for example.

2013-03-15 04:15:44

by J. R. Okajima

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)


Al Viro:
> Umm... I would prefer it to go through vfs.git, with serious modifications.
> I really don't like the idea of xattr-based whiteouts, for example.

I'd ask you how do you think another whiteout approach?
Here is a part of a posts which describes about aufs2 in 2009
<http://marc.info/?l=linux-kernel&m=123934927611907&w=2>

:::
+- whiteout is hardlinked in order to reduce the consumption of inodes
+ on branch
:::
+- kernel thread for removing the dir who has a plenty of whiteouts
:::
+The whiteout is for hiding files on lower branches. Also it is applied
+to stop readdir going lower branches.
+The latter case is called \[oq]opaque directory.\[cq] Any
+whiteout is an empty file, it means whiteout is just an mark.
+In the case of hiding lower files, the name of whiteout is
+\[oq]\*[AUFS_WH_PFX]<filename>.\[cq]
+And in the case of stopping readdir, the name is
+\[oq]\*[AUFS_WH_PFX]\*[AUFS_WH_PFX].opq\[cq] or
+\[oq]\*[AUFS_WH_PFX]__dir_opaque.\[cq] The name depends upon your compile
+configuration
+CONFIG_AUFS_COMPAT.
+.\" All of newly created or renamed directory will be opaque.
+All whiteouts are hardlinked,
+including \[oq]<writable branch top dir>/\*[AUFS_WH_BASE].\[cq]
:::
+The whiteout in aufs is very similar to Unionfs's. That is represented
+by its filename. UnionMount takes an approach of a file mode, but I am
+afraid several utilities (find(1) or something) will have to support it.
+
+Basically the whiteout represents "logical deletion" which stops aufs to
+lookup further, but also it represents "dir is opaque" which also stop
+lookup.
+
+In aufs, rmdir(2) and rename(2) for dir uses whiteout alternatively.
+In order to make several functions in a single systemcall to be
+revertible, aufs adopts an approach to rename a directory to a temporary
+unique whiteouted name.
+For example, in rename(2) dir where the target dir already existed, aufs
+renames the target dir to a temporary unique whiteouted name before the
+actual rename on a branch and then handles other actions (make it opaque,
+update the attributes, etc). If an error happens in these actions, aufs
+simply renames the whiteouted name back and returns an error. If all are
+succeeded, aufs registers a function to remove the whiteouted unique
+temporary name completely and asynchronously to the system global
+workqueue.
:::

Latest version is aufs3 (for linux-3.x series), there is no essential
changes about whiteout since aufs2.


J. R. Okajima

2013-03-15 04:44:19

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)

On Fri, Mar 15, 2013 at 01:15:36PM +0900, J. R. Okajima wrote:

> +- whiteout is hardlinked in order to reduce the consumption of inodes
> + on branch

*blink* Whiteouts have no inodes at all. Filesystem has an additional
kind of directory entries, recognizable as whiteouts. How they are
done is up to filesystem in question.

> +- kernel thread for removing the dir who has a plenty of whiteouts

Again, implementation detail for individual filesystem.

> +The whiteout in aufs is very similar to Unionfs's. That is represented
> +by its filename. UnionMount takes an approach of a file mode, but I am
> +afraid several utilities (find(1) or something) will have to support it.

Why the devil should find(1) even see them?

I really don't believe that it's a good idea to try making them fs-agnostic;
it's an implementation detail of filesystem, with things like "is this
directory empty?" very much belonging to the fs in question...

I don't know; maybe it's my experience of dealing with umsdos that has badly
soured me on that kind of approach, but IME this kind of schemes tend to
be brittle as hell ;-/

2013-03-15 05:09:21

by J. R. Okajima

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)


Al Viro:
> > +- whiteout is hardlinked in order to reduce the consumption of inodes
> > + on branch
>
> *blink* Whiteouts have no inodes at all. Filesystem has an additional
> kind of directory entries, recognizable as whiteouts. How they are
> done is up to filesystem in question.

"no inodes at all"?
Are you assuming the implementation in dcache only (with a new d_type
flag)? And it is up to the real fs (layer or branch) whether it consumes
inode or not?
If so, it has a big disadvantage for the layer-fs (or branch-fs) to have
to implement a new method for whiteout.

Overlayfs implements whiteout as symlink+xattr which consumes an
inode. And you don't like it, right?
What I showed is another generic approach without xattr where the new
method to whiteout is unnecessary.


> > +The whiteout in aufs is very similar to Unionfs's. That is represented
> > +by its filename. UnionMount takes an approach of a file mode, but I am
> > +afraid several utilities (find(1) or something) will have to support it.
>
> Why the devil should find(1) even see them?

It is the case when find(1) for the layer-fs/branch-fs directly.


J. R. Okajima

2013-03-15 05:13:30

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)

On Fri, Mar 15, 2013 at 02:09:14PM +0900, J. R. Okajima wrote:

> If so, it has a big disadvantage for the layer-fs (or branch-fs) to have
> to implement a new method for whiteout.
>
> Overlayfs implements whiteout as symlink+xattr which consumes an
> inode. And you don't like it, right?
> What I showed is another generic approach without xattr where the new
> method to whiteout is unnecessary.

I'm yet to see the reason that would make implementing that method a big
disadvantage, TBH...

2013-03-15 08:15:32

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)

On Fri, 2013-03-15 at 05:13 +0000, Al Viro wrote:
> On Fri, Mar 15, 2013 at 02:09:14PM +0900, J. R. Okajima wrote:
>
> > If so, it has a big disadvantage for the layer-fs (or branch-fs) to have
> > to implement a new method for whiteout.
> >
> > Overlayfs implements whiteout as symlink+xattr which consumes an
> > inode. And you don't like it, right?
> > What I showed is another generic approach without xattr where the new
> > method to whiteout is unnecessary.
>
> I'm yet to see the reason that would make implementing that method a big
> disadvantage, TBH...

It's the fact that a directory entry based whiteout limits the amount of
change to the VFS, but has to be supported by underlying filesystems.
The generic_dirent_fallthrough() mechanism is a nice way of hiding it,
but there are still quite a few fs specific mods in the union mount tree
because of this. Having to modify filesystems to me indicates the
mechanism is a bit fragile. If we could do whiteouts purely in the VFS,
so it would work for any filesystem (without needing filesystem
modifications) that would seem to be a more robust approach. I'm not
saying we can definitely do this in an elegant way ... I'm just saying
that if someone comes up with it, it's obviously preferable.

James


2013-03-15 12:12:32

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)

On Fri, Mar 15, 2013 at 08:15:18AM +0000, James Bottomley wrote:

> It's the fact that a directory entry based whiteout limits the amount of
> change to the VFS, but has to be supported by underlying filesystems.
> The generic_dirent_fallthrough() mechanism is a nice way of hiding it,
> but there are still quite a few fs specific mods in the union mount tree
> because of this. Having to modify filesystems to me indicates the
> mechanism is a bit fragile. If we could do whiteouts purely in the VFS,
> so it would work for any filesystem (without needing filesystem
> modifications) that would seem to be a more robust approach. I'm not
> saying we can definitely do this in an elegant way ... I'm just saying
> that if someone comes up with it, it's obviously preferable.

The trouble with such mechanisms is that they tend to end up depending on
fairly non-trivial properties of underlying fs. Try aufs one on btrfs,
see how soon you spot the problem. It's nice when a method turns out
to be really redundant and implementable in uniform way via other methods
present; see fh_to_dentry history for example of situation where it hadn't...

2013-03-15 18:57:27

by J. R. Okajima

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)


Al Viro:
> The trouble with such mechanisms is that they tend to end up depending on
> fairly non-trivial properties of underlying fs. Try aufs one on btrfs,
> see how soon you spot the problem. It's nice when a method turns out
> to be really redundant and implementable in uniform way via other methods
> present; see fh_to_dentry history for example of situation where it hadn't...

Hmm, I could not see problem around aufs using btrfs as the upper RW
branch, tested on linux-3.9-rc2.
Would you describe more specifically?
Some of my local tests didn't pass due to an error return from
btrfs. For example, repeat "ln fileA $very_loooong_filename", then btrfs
returns EMLINK so soon.

And aufs refers the common set of inode attribute. I mean attrs declared
in VFS. Also I know some filesystems don't maintain attrs on each
operation. For such fs, aufs calls vfs_getattr() internally.


J. R. Okajima

2013-03-15 19:11:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)

On Thu, Mar 14, 2013 at 10:09 PM, J. R. Okajima <[email protected]> wrote:
>
> "no inodes at all"?
> Are you assuming the implementation in dcache only (with a new d_type
> flag)? And it is up to the real fs (layer or branch) whether it consumes
> inode or not?

Yes. That would be lovely. And trivial for most filesystems to support.

Sure, you could have an inode if you need to (not all filesystems may
have a flag in the directory entry), so it would look like "mknod()"
for the filesystem. But the filesystem might decide to never actually
create the inode at all if it reconizes the node as a whiteout node.

I think we should do this. Yes, it requires filesystem work, but in
the long run it's the right thing to do, and the filesystem work is
likely very very simple. Besides, we probably only need to support a
few filesystems for it to be already useful. What filesystems do the
people who use unionfs actually use today?

Also note that it's only the "upper layer" filesystem that needs
whiteout nodes. So it's not "all filesystems involved with overlayfs",
it's only the upper ones. And they have to already support xattr, so
I'm assuming in practice we're talking only a few cases, right? Just
tmpfs, ext3, ext4 probably covers most cases..

And it would get rid of all the horrible security check changes for
xattrs, no? So we'd really end up with a noticeably cleaner model.

Linus

2013-03-15 19:40:03

by Erez Zadok

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)

I tend to agree with Al's and Linus's POV regarding whiteouts. There are three general techniques to implementing whiteouts:

1. namespace: special file names, hard/symlinks, or special "hidden" dot files.
2. extended attributes.
3. DT_WHT dirent flags.
(there's actually a 4th method I've tried before that I won't discuss below: implementing your own data structures on a raw partition ? way too cumbersome.)

The namespace techniques require lower file systems to support hard/symlinks and sometimes need long names. A plus: they work on most file systems (but not all). But they cause all sorts of namespace ugliness, where you have to hide the special file names, avoid them, ensure atomic updates for ops that involve whiteouts. It's all doable, as had been demonstrated by several implementations. But it's still icky in terms of namespace pollution.

The extended attributes technique, I think, is better than the namespace one in that you don't pollute the namespace; plus, I think the EA technique minimizes atomicity issues that show up with the namespace method. Yet, it still requires EA support in lower file systems, so it won't work unless lower file systems support xattr ops. Plus it could fail for file systems that have limited xattr support (e.g., number of EAs per inode).

The DT_WHT technique is the cleanest in the long run, and the best of the three IMHO. It's well understood and has been done in BSD a long time ago. It doesn't have the namespace pollution as seen in technique #1 above. And I believe it also minimize atomicity issues. Plus you won't have issues running out of EAs.

A while back I've looked at the unionmount code for DT_WHT support for ext2/tmpfs, and it was small, clean, and mostly additive. I even had a prototype port of unionfs using unionmount's DT_WHT support: it was relatively easy to port unionfs to use DT_WHT instead of namespace techniques. Plus, I was able to reduce the amount of code devoted to whiteout support by quite a bit.

So I think it'll be easy to port overlayfs to use DT_WHT. Given that most people use unioning with ext* and tmpfs, minimal DT_WHT support in those would get most users happy initially. And we can then let other file systems support DT_WHT on their own, in whatever way they deem suitable (as Al suggested, this is really best deferred to the F/S to implement).

Lastly, what I'm not sure is what API to use for whiteouts: should every f/s implement some new methods to add/remove/query a whiteout, or should the upper f/s and VFS directly check DT_WHT flags with S_ISWHT. The generic f/s methods may allow file systems to implement whiteouts in arbitrary ways, not necessarily as a dirent flag.

Cheers,
Erez.

2013-03-15 20:30:32

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)

On Sat, Mar 16, 2013 at 03:57:18AM +0900, J. R. Okajima wrote:
>
> Al Viro:
> > The trouble with such mechanisms is that they tend to end up depending on
> > fairly non-trivial properties of underlying fs. Try aufs one on btrfs,
> > see how soon you spot the problem. It's nice when a method turns out
> > to be really redundant and implementable in uniform way via other methods
> > present; see fh_to_dentry history for example of situation where it hadn't...
>
> Hmm, I could not see problem around aufs using btrfs as the upper RW
> branch, tested on linux-3.9-rc2.
> Would you describe more specifically?

Sure - btrfs happens to have an interesting limit on the number of
links to the same object located in one directory.

The thing is, you are trying to retrofit a new primitive into many
filesystems and do it in the same way. Doesn't work well...

And yes, it is an independent primitive. What I really don't understand
is WTF is so attractive about not having to touch individual filesystems;
it's not particulary hard to do for any fs we might care about...

2013-03-16 13:56:00

by J. R. Okajima

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)


Al Viro:
> Sure - btrfs happens to have an interesting limit on the number of
> links to the same object located in one directory.

It doesn't matter.
On every filesystem, the link count has its upper limit eventually. When
vfs_link() for whiteout returns EMLINK, aufs removes the whiteout-src
file and creates a new one internally and asynchronously. Completing a
new whiteout-src, aufs will go back to the link approach. The actual
link count when EMLINK returned does not matter.

Some filesystems don't support link(2). For such fs, user should tell
aufs that the layer-fs has no link, and aufs will always create whiteout
as a size-zeroed regular file instead of hardlink. Otherwise user cannot
remove a file logically (rm(1) will return ENOTSUP or EPERM.).


> And yes, it is an independent primitive. What I really don't understand
> is WTF is so attractive about not having to touch individual filesystems;
> it's not particulary hard to do for any fs we might care about...

Hmm, aufs might have lived too long time as outside mainline. The
whiteout approach aufs took is non-modifying the underlying fs. And I
thought it is essentially important for new fs, otherwise it would be
harder to be merged.


J. R. Okajima

2013-03-16 13:57:59

by J. R. Okajima

[permalink] [raw]
Subject: Re: [PATCH 0/9] overlay filesystem: request for inclusion (v17)


Linus Torvalds:
> Yes. That would be lovely. And trivial for most filesystems to support.
>
> Sure, you could have an inode if you need to (not all filesystems may
> have a flag in the directory entry), so it would look like "mknod()"
> for the filesystem. But the filesystem might decide to never actually
> create the inode at all if it reconizes the node as a whiteout node.

I am not sure whether it is trivial for FS developers. But I agree that
it is the maintainer who can decide the most suitable method to whiteout
for that FS.


> Also note that it's only the "upper layer" filesystem that needs
> whiteout nodes. So it's not "all filesystems involved with overlayfs",
> it's only the upper ones. And they have to already support xattr, so
> I'm assuming in practice we're talking only a few cases, right? Just
> tmpfs, ext3, ext4 probably covers most cases..

I agree that tmpfs is majority.
In real world, people uses various fs as the upper RW layer. NFS is
popular too. Sometimes non-mainlined FSs are used such as glusterfs. In
the long history of aufs1 to aufs3, there have been xfs, smbfs, fuse,
ubifs and fat/vfat are used as far as I remember. (though I am not sure
users still use them)
It means that the unioning is used in various way of long-live systems,
not only on the short-live LiveCD systems.

And there are a use-case which uses aufs like a version control
system. For example,
- /aufs = /rw1 + /ro0
- all the changes are stored under /rw1.
- prepend a new top layer and set /rw1 readonly,
/aufs = /rw2 + /ro1 (which was /rw1) + /ro0
- now /ro1 holds the past changes and later changes are all go to /rw2.
- on the next day/week/month,
/aufs = /rw3 + /ro2 (which was /rw2) + /ro1 (which as /rw1) + /ro0

As a variation, there is a merging approach. It means moving files from
the upper RW to the lower RO and creates a new RO. In this variation,
the number of layers will not change.

Ah, I should have mentioned the difference between overlayfs and aufs.
- aufs supports muliple layers more than 2.
- direct access to the layers (bypassing aufs) is allowed.

If the underlying FS chooses the way to introduce a new d_type to
implement whiteout, then many applications have to support it. And the
above merging approach (or backup) will not be used until the
application completes supporting. When the FS is a network filesystem,
the new d_type has to be available on both of server and client. Of
course, the server may be other than linux. Even if FS is local, users
may setup dual-boot. And the other OS has to recognize the new d_type
value.
I still doubt it is a good approach.


J. R. Okajima

2013-03-17 13:08:13

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

Miklos Szeredi <[email protected]> wrote:

> Export do_splice_direct() to modules. Needed by overlay filesystem.

Apparently you cannot call this from any function that is holding an i_mutex
if the target of the splice uses generic_file_splice_write().

The problem is a potential deadlock situation:

We have places already that do:

mnt_want_write()
mutex_lock()

This can be found in do_last() for example.

However, mnt_want_write() calls sb_start_write() as does
generic_file_splice_write(). So now in ovl_copy_up_locked() you're adding:

mutex_lock()
sb_start_write()

which lockdep reports as a potential ABBA deadlock.

Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock
might operate, so it's possible that this is a false alarm. Maybe Jan Kara can
illuminate further, so I've added him to the cc list.

I've attached the report I got with unionmount.

David
---
[ INFO: possible recursive locking detected ]
3.9.0-rc1-fsdevel+ #934 Not tainted
---------------------------------------------
fs-op/4476 is trying to acquire lock:
(sb_writers#4){.+.+.+}, at: [<ffffffff811087a4>] generic_file_splice_write+0x5d/0x14b
but task is already holding lock:
(sb_writers#4){.+.+.+}, at: [<ffffffff810ff97c>] mnt_want_write+0x1f/0x46
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(sb_writers#4);
lock(sb_writers#4);
*** DEADLOCK ***
May be due to missing lock nesting notation
2 locks held by fs-op/4476:
#0: (sb_writers#4){.+.+.+}, at: [<ffffffff810ff97c>] mnt_want_write+0x1f/0x46
#1: (&type->i_mutex_dir_key[1]){+.+.+.}, at: [<ffffffff81131c74>] __union_copy_up+0x9a/0x132
stack backtrace:
Pid: 4476, comm: fs-op Not tainted 3.9.0-rc1-fsdevel+ #934
Call Trace:
[<ffffffff81070398>] __lock_acquire+0x86a/0x16cf
[<ffffffff811081cc>] ? page_cache_pipe_buf_release+0x1b/0x1b
[<ffffffff810715e2>] lock_acquire+0x57/0x6d
[<ffffffff811087a4>] ? generic_file_splice_write+0x5d/0x14b
[<ffffffff810e3314>] __sb_start_write+0x10d/0x15d
[<ffffffff811087a4>] ? generic_file_splice_write+0x5d/0x14b
[<ffffffff811087a4>] generic_file_splice_write+0x5d/0x14b
[<ffffffff811083d5>] do_splice_from+0x74/0x91
[<ffffffff81108410>] direct_splice_actor+0x1e/0x20
[<ffffffff8110868b>] splice_direct_to_actor+0xc2/0x17e
[<ffffffff811083f2>] ? do_splice_from+0x91/0x91
[<ffffffff8110999d>] do_splice_direct+0x47/0x5a
[<ffffffff81131a99>] __union_copy_up_locked+0x171/0x2b2
[<ffffffff81131cc4>] __union_copy_up+0xea/0x132
[<ffffffff810e02ca>] vfs_truncate+0x15e/0x289
[<ffffffff810e043b>] do_sys_truncate+0x46/0x83
[<ffffffff810e05cf>] sys_truncate+0x9/0xb
[<ffffffff81456f92>] system_call_fastpath+0x16/0x1b

2013-03-18 02:32:08

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

On Sun, Mar 17, 2013 at 01:06:59PM +0000, David Howells wrote:
> Miklos Szeredi <[email protected]> wrote:
>
> > Export do_splice_direct() to modules. Needed by overlay filesystem.
>
> Apparently you cannot call this from any function that is holding an i_mutex
> if the target of the splice uses generic_file_splice_write().
>
> The problem is a potential deadlock situation:
>
> We have places already that do:
>
> mnt_want_write()
> mutex_lock()
>
> This can be found in do_last() for example.
>
> However, mnt_want_write() calls sb_start_write() as does
> generic_file_splice_write(). So now in ovl_copy_up_locked() you're adding:
>
> mutex_lock()
> sb_start_write()
>
> which lockdep reports as a potential ABBA deadlock.
>
> Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock
> might operate, so it's possible that this is a false alarm. Maybe Jan Kara can
> illuminate further, so I've added him to the cc list.
>
> I've attached the report I got with unionmount.

There's plenty of problems with splice locking that can lead to
deadlocks. Here's another that's been known for ages:

http://oss.sgi.com/archives/xfs/2011-08/msg00168.html

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-03-18 15:39:41

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

On Sun 17-03-13 13:06:59, David Howells wrote:
> Miklos Szeredi <[email protected]> wrote:
>
> > Export do_splice_direct() to modules. Needed by overlay filesystem.
>
> Apparently you cannot call this from any function that is holding an i_mutex
> if the target of the splice uses generic_file_splice_write().
>
> The problem is a potential deadlock situation:
>
> We have places already that do:
>
> mnt_want_write()
> mutex_lock()
>
> This can be found in do_last() for example.
>
> However, mnt_want_write() calls sb_start_write() as does
> generic_file_splice_write(). So now in ovl_copy_up_locked() you're adding:
>
> mutex_lock()
> sb_start_write()
>
> which lockdep reports as a potential ABBA deadlock.
>
> Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock
> might operate, so it's possible that this is a false alarm. Maybe Jan Kara can
> illuminate further, so I've added him to the cc list.
IMO the deadlock is real. In freeze_super() we wait for all writers to
the filesystem to finish while blocking beginning of any further writes. So
we have a deadlock scenario like:

THREAD1 THREAD2 THREAD3
mnt_want_write() mutex_lock(&inode->i_mutex);
... freeze_super()
block on mutex_lock(&inode->i_mutex)
sb_wait_write(sb, SB_FREEZE_WRITE);
block in sb_start_write()

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-03-18 21:53:46

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

On Mon, Mar 18, 2013 at 04:39:36PM +0100, Jan Kara wrote:
> IMO the deadlock is real. In freeze_super() we wait for all writers to
> the filesystem to finish while blocking beginning of any further writes. So
> we have a deadlock scenario like:
>
> THREAD1 THREAD2 THREAD3
> mnt_want_write() mutex_lock(&inode->i_mutex);
> ... freeze_super()
> block on mutex_lock(&inode->i_mutex)
> sb_wait_write(sb, SB_FREEZE_WRITE);
> block in sb_start_write()

The bug is on fsfreeze side and this is not the only problem related to it.
I've missed the implications when I applied "fs: Add freezing handling
to mnt_want_write() / mnt_drop_write()" last June ;-/

The thing is, until then mnt_want_write() used to be a counter; it could be
nested. Now any such nesting is a deadlock you've just described. This
is seriously wrong, IMO.

BTW, having sb_start_write() buried in individual ->splice_write() is
asking for trouble; could you describe the rules for that? E.g. where
does it nest wrt filesystem-private locks? XFS iolock, for example...

2013-03-18 23:01:15

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

On Mon, Mar 18, 2013 at 09:53:34PM +0000, Al Viro wrote:
> On Mon, Mar 18, 2013 at 04:39:36PM +0100, Jan Kara wrote:
> > IMO the deadlock is real. In freeze_super() we wait for all writers to
> > the filesystem to finish while blocking beginning of any further writes. So
> > we have a deadlock scenario like:
> >
> > THREAD1 THREAD2 THREAD3
> > mnt_want_write() mutex_lock(&inode->i_mutex);
> > ... freeze_super()
> > block on mutex_lock(&inode->i_mutex)
> > sb_wait_write(sb, SB_FREEZE_WRITE);
> > block in sb_start_write()
>
> The bug is on fsfreeze side and this is not the only problem related to it.
> I've missed the implications when I applied "fs: Add freezing handling
> to mnt_want_write() / mnt_drop_write()" last June ;-/
>
> The thing is, until then mnt_want_write() used to be a counter; it could be
> nested. Now any such nesting is a deadlock you've just described. This
> is seriously wrong, IMO.
>
> BTW, having sb_start_write() buried in individual ->splice_write() is
> asking for trouble; could you describe the rules for that? E.g. where
> does it nest wrt filesystem-private locks? XFS iolock, for example...

I'm looking at the existing callers and I really wonder if we ought to
push sb_start_write() from ->splice_write()/->aio_write()/etc. into the
callers.

Something like file_start_write()/file_end_write(), with check for file
being regular one might be a good starting point. As it is, copyup is
really fucked both in unionmount and overlayfs...

2013-03-19 01:38:15

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

On Mon, Mar 18, 2013 at 11:01:03PM +0000, Al Viro wrote:

> I'm looking at the existing callers and I really wonder if we ought to
> push sb_start_write() from ->splice_write()/->aio_write()/etc. into the
> callers.
>
> Something like file_start_write()/file_end_write(), with check for file
> being regular one might be a good starting point. As it is, copyup is
> really fucked both in unionmount and overlayfs...

BTW, I wonder what's the right locking for that sucker; overlayfs is probably
too heavy - we are talking about copying a file from one fs to another, which
can obviously take quite a while, so holding ->i_mutex on _parent_ all along
is asking for very serious contention. OTOH, there's a pile of unpleasant
races that need to be dealt with; consider e.g. chmod("lower_file", 0644) vs.
unlink("lower_file"). The former means creating a copy of lower_file in the
covering layer (and chmod of that copy once it's finished). The latter
means creating a whiteout in the covering layer. No matter which comes first,
the result *must* be whiteout in directory + no stray copies left in covering
layer. chmod() may legitimately return -ENOENT or 0, but resulting state of
fs is unambiguous. Holding ->i_mutex on parent would, of course, suffice to
serialize them, but it's not particulary nice thing to do wrt contention.

Another fun issue is copyup vs. copyup - we want to sit and wait for copyup
attempt in progress to complete, rather than start another one in parallel.
And whoever comes the second must check if copyup has succeeded, obviously -
it's possible to have user run into 5% limit and fail copyup, followed by
root doing it successfully.

Another one: overwriting rename() vs. copyup. Similar to unlink() vs. copyup().

Another one: read-only open() vs. copyup(). Hell knows - we obviously don't
want to open a partially copied file; might want to wait or pretend that this
open() has come first and give it the underlying file. The same goes for
stat() vs. copyup().

FWIW, something like "lock parent, ->create(), ->unlink(), unlock parent,
copy data and metadata, lock parent, allocate a new dentry in covering layer
and do ->lookup() on it, verify that it is negative and not a whiteout, lock
child, use ->link() to put it into directory, unlock everything" would
probably DTRT for unlink/copyup and rename/copyup. The rest... hell knows;
->i_mutex on child is a non-starter, since we couldn't use the normal ->write()
or ->splice_write() under it. One possibility is to allocate a structure for
copyup in progress, make it easily located (hash by lower dentry/upper sb
pair) and serialize on that. Hell knows...

One potential unpleasantness here is dcache lookup coming between ->create()
and ->unlink(); OTOH, there had been fairly reasonable requests for something
like atomic combination of open and unlink and it's not like _that_ would be
hard to implement as a new method - pretty much everything local could just
take part of ->create() and that would be it. Which might make sense on its
own - open() flag creating a new temporary file, unlinked from the very
beginning and thus killed when we close the damn thing. Objections?

2013-03-19 09:00:44

by J. R. Okajima

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules


Al Viro:
> BTW, I wonder what's the right locking for that sucker; overlayfs is probably
> too heavy - we are talking about copying a file from one fs to another, which
> can obviously take quite a while, so holding ->i_mutex on _parent_ all along
> is asking for very serious contention. OTOH, there's a pile of unpleasant

Yes, holding parent->i_mutex can be longer.
Using splice function for copy-up is simple and (probably) fastest. But
it doesn't support a "hole" in the file (sparse file). All holes are
filled with NUL byte and consumes a disk block on the upper layer. It is
a problem, especially for users who have smaller tmpfs as his upper
layer.
The copy-up with considering a hole may cost more, but it can save the
storage consumtion.


> Another fun issue is copyup vs. copyup - we want to sit and wait for copyup
> attempt in progress to complete, rather than start another one in parallel.
> And whoever comes the second must check if copyup has succeeded, obviously -
> it's possible to have user run into 5% limit and fail copyup, followed by
> root doing it successfully.

"5% limit" means the reserved are for a superuser on a filesystem,
right? As far as I know, overlayfs (UnionMount too?) solves this problem
as changing the task credential and the capability. But I am not sure
whether it solves the similar problem around the resource limit like
RLIMIT_CPU, RLIMIT_FSIZE or something.


> Another one: overwriting rename() vs. copyup. Similar to unlink() vs. copyup().

Hmm, do you mean that, just after the parent dir lock on the underlying
fs, the copyup routine should confirm whether the target is still alive?
If so, I agree.


> Another one: read-only open() vs. copyup(). Hell knows - we obviously don't
> want to open a partially copied file; might want to wait or pretend that this
> open() has come first and give it the underlying file. The same goes for
> stat() vs. copyup().

Exactly.
Moreover users don't want to refer to the lower file which is obsoleted
by copyup.


> FWIW, something like "lock parent, ->create(), ->unlink(), unlock parent,
> copy data and metadata, lock parent, allocate a new dentry in covering layer
> and do ->lookup() on it, verify that it is negative and not a whiteout, lock
> child, use ->link() to put it into directory, unlock everything" would
> probably DTRT for unlink/copyup and rename/copyup. The rest... hell knows;

Please let me make sure.
You are saying,
- create the file on the upper layer
- get the "struct file" object
- hide it from users
- before copying the file, unlock the parent in order to stop the long
period locking
- copy the file without the parent lock. it doesn't matter since the
file is invisible to users.
- confirm that the target name is still available for copyup
- make the completed file visible by ->link()

It is ineteresting to ->link() with the unlinked file. While
vfs_unlink() rejects such case, it may not matter for the underlying fs.
Need to verify FS including jounals.


J. R. Okajima

2013-03-19 10:29:46

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

On Tue, Mar 19, 2013 at 2:38 AM, Al Viro <[email protected]> wrote:
> On Mon, Mar 18, 2013 at 11:01:03PM +0000, Al Viro wrote:
>
>> I'm looking at the existing callers and I really wonder if we ought to
>> push sb_start_write() from ->splice_write()/->aio_write()/etc. into the
>> callers.
>>
>> Something like file_start_write()/file_end_write(), with check for file
>> being regular one might be a good starting point. As it is, copyup is
>> really fucked both in unionmount and overlayfs...
>
> BTW, I wonder what's the right locking for that sucker; overlayfs is probably
> too heavy - we are talking about copying a file from one fs to another, which
> can obviously take quite a while, so holding ->i_mutex on _parent_ all along
> is asking for very serious contention.

Copy up is a once-in-a-lifetime event for an object. Optimizing it is
way down in the list of things to do. I'd drop splice in a jiffy if
it's in the way.

Much more interesting question: what happens if we crash during a
rename? Whiteout implemented in the filesystem won't save us. And
the results are interesting: old versions of files become visible and
similar fun. Far from likely to happen, but ...

Add a rename-with-whiteout primitive on filesystems? That one is not
going to be as simple as plain whiteout. Or?

Thanks,
Miklos

2013-03-19 11:05:19

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

Miklos Szeredi <[email protected]> wrote:

> > BTW, I wonder what's the right locking for that sucker; overlayfs is
> > probably too heavy - we are talking about copying a file from one fs to
> > another, which can obviously take quite a while, so holding ->i_mutex on
> > _parent_ all along is asking for very serious contention.
>
> Copy up is a once-in-a-lifetime event for an object. Optimizing it is
> way down in the list of things to do. I'd drop splice in a jiffy if
> it's in the way.

Yes, but it could block the parent directory for a long time. I suspect it's
fine if you can RCU walk through the parent, but if you have to grab a lock on
it...

David

2013-03-19 11:40:22

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

On Tue, Mar 19, 2013 at 12:04 PM, David Howells <[email protected]> wrote:
> Miklos Szeredi <[email protected]> wrote:
>
>> > BTW, I wonder what's the right locking for that sucker; overlayfs is
>> > probably too heavy - we are talking about copying a file from one fs to
>> > another, which can obviously take quite a while, so holding ->i_mutex on
>> > _parent_ all along is asking for very serious contention.
>>
>> Copy up is a once-in-a-lifetime event for an object. Optimizing it is
>> way down in the list of things to do. I'd drop splice in a jiffy if
>> it's in the way.
>
> Yes, but it could block the parent directory for a long time. I suspect it's
> fine if you can RCU walk through the parent, but if you have to grab a lock on
> it...

Right.

Lets look at it this way: users of an overlay accept that an
operation X can take T time, where T is much longer than would be on a
normal filesystem. Then why would they complain that operation Y
(which happens to bump into the parent lock of X) also takes T?

If copy up of huge files happens more then very very occasionally,
then the overlay will be basically unusable anyway. It's just not
what it is designed for, so why try to optimize this case?

Thanks,
Miklos

2013-03-19 17:03:37

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

On Tue, Mar 19, 2013 at 11:29:41AM +0100, Miklos Szeredi wrote:

> Copy up is a once-in-a-lifetime event for an object. Optimizing it is
> way down in the list of things to do. I'd drop splice in a jiffy if
> it's in the way.

What makes you think that write is any better? Same deadlock there - check
generic_file_aio_write(), it calls the same sb_start_write()... IOW,
switching from splice to write won't help at all.

> Much more interesting question: what happens if we crash during a
> rename? Whiteout implemented in the filesystem won't save us. And
> the results are interesting: old versions of files become visible and
> similar fun. Far from likely to happen, but ...
>
> Add a rename-with-whiteout primitive on filesystems? That one is not
> going to be as simple as plain whiteout. Or?

Umm... If/when we start caring about that kind of atomicity (and I agree
that we ought to) overlayfs approach to whiteouts will actually have much
harder time - it doesn't take much to teach a journalling fs how to do that
kind of ->rename() in a single transaction; the only question is how to tell
it that we want to leave a whiteout behind us. Hell knows; one variant is
to add a flag, of course. Another might be more interesting - we want some
kind of "directory is opaque" flag, so if we start reshuffling the methods,
we might try to merge unlink/rmdir/whiteout. Rules:
* victim is negative => create a whiteout
* victim is a directory, parent opaque => rmdir
* victim is a non-directory, parent opaque => unlink
* victim is positive, parent _not_ opaque => replace with whiteout
* old_dir in case of ->rename() is opaque => normal rename
* old_dir in case of ->rename() is not opaque => leave whiteout behind
Non-unioned => opaque, of course (nothing showing through it).

Getting good behaviour on rename interrupted by crash is going to be _very_
tricky with any strategy other than whiteouts-in-fs, AFAICS.

Again, I have no problem whatsoever with changing the set of directory
methods, as long as the replacement is sane. We'd done that kind of thing
before and it's not a problem.

2013-03-19 18:32:47

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

On Tue, Mar 19, 2013 at 6:03 PM, Al Viro <[email protected]> wrote:
> On Tue, Mar 19, 2013 at 11:29:41AM +0100, Miklos Szeredi wrote:
>
>> Copy up is a once-in-a-lifetime event for an object. Optimizing it is
>> way down in the list of things to do. I'd drop splice in a jiffy if
>> it's in the way.
>
> What makes you think that write is any better? Same deadlock there - check
> generic_file_aio_write(), it calls the same sb_start_write()... IOW,
> switching from splice to write won't help at all.

Okay, I missed that. Yeah, that needs fixing...

>> Much more interesting question: what happens if we crash during a
>> rename? Whiteout implemented in the filesystem won't save us. And
>> the results are interesting: old versions of files become visible and
>> similar fun. Far from likely to happen, but ...
>>
>> Add a rename-with-whiteout primitive on filesystems? That one is not
>> going to be as simple as plain whiteout. Or?
>
> Umm... If/when we start caring about that kind of atomicity (and I agree
> that we ought to) overlayfs approach to whiteouts will actually have much
> harder time - it doesn't take much to teach a journalling fs how to do that
> kind of ->rename() in a single transaction; the only question is how to tell
> it that we want to leave a whiteout behind us. Hell knows; one variant is
> to add a flag, of course. Another might be more interesting - we want some
> kind of "directory is opaque" flag, so if we start reshuffling the methods,
> we might try to merge unlink/rmdir/whiteout. Rules:
> * victim is negative => create a whiteout
> * victim is a directory, parent opaque => rmdir
> * victim is a non-directory, parent opaque => unlink
> * victim is positive, parent _not_ opaque => replace with whiteout
> * old_dir in case of ->rename() is opaque => normal rename
> * old_dir in case of ->rename() is not opaque => leave whiteout behind
> Non-unioned => opaque, of course (nothing showing through it).
>

I dunnow. Overloading common paths with overlay/union specific things
doesn't look very clean to me.

I have a similar problem with union-mounts: it's hooking into lots of
common paths in the VFS for the sake of a very specialized feature.

> Getting good behaviour on rename interrupted by crash is going to be _very_
> tricky with any strategy other than whiteouts-in-fs, AFAICS.
>

One idea is to add a journal to the overlay itself (yeah, namespace issues).

Thanks,
Miklos

2013-03-19 20:25:48

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

On Mon 18-03-13 21:53:34, Al Viro wrote:
> On Mon, Mar 18, 2013 at 04:39:36PM +0100, Jan Kara wrote:
> > IMO the deadlock is real. In freeze_super() we wait for all writers to
> > the filesystem to finish while blocking beginning of any further writes. So
> > we have a deadlock scenario like:
> >
> > THREAD1 THREAD2 THREAD3
> > mnt_want_write() mutex_lock(&inode->i_mutex);
> > ... freeze_super()
> > block on mutex_lock(&inode->i_mutex)
> > sb_wait_write(sb, SB_FREEZE_WRITE);
> > block in sb_start_write()
>
> The bug is on fsfreeze side and this is not the only problem related to it.
> I've missed the implications when I applied "fs: Add freezing handling
> to mnt_want_write() / mnt_drop_write()" last June ;-/
>
> The thing is, until then mnt_want_write() used to be a counter; it could be
> nested. Now any such nesting is a deadlock you've just described. This
> is seriously wrong, IMO.
Well, but sb_start_write() has to be blocking (blocks when fs is frozen)
and you have to get it somewhere. It seems only natural to get the counter
from original mnt_want_write() at the same place and use one function for
that. Whether I should have changed the name from mnt_want_write() to
something else is questionable...

> BTW, having sb_start_write() buried in individual ->splice_write() is
> asking for trouble; could you describe the rules for that? E.g. where
> does it nest wrt filesystem-private locks? XFS iolock, for example...
Generally, the freeze protection should be the outermost lock taken (so
that we mitigate possibility of blocking readers when waiting for fs to
unfreeze). So it ranks above i_mutex, or XFS' ilock and iolock.

It seems that I screwed this up for ->splice_write() :-| If we are going to
move out sb_start_write() out of filesystems' hands into do_splice_from()
then we should likely do the same with ->aio_write(). Hmm?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-03-19 20:54:15

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

On Mon 18-03-13 23:01:03, Al Viro wrote:
> On Mon, Mar 18, 2013 at 09:53:34PM +0000, Al Viro wrote:
> > On Mon, Mar 18, 2013 at 04:39:36PM +0100, Jan Kara wrote:
> > > IMO the deadlock is real. In freeze_super() we wait for all writers to
> > > the filesystem to finish while blocking beginning of any further writes. So
> > > we have a deadlock scenario like:
> > >
> > > THREAD1 THREAD2 THREAD3
> > > mnt_want_write() mutex_lock(&inode->i_mutex);
> > > ... freeze_super()
> > > block on mutex_lock(&inode->i_mutex)
> > > sb_wait_write(sb, SB_FREEZE_WRITE);
> > > block in sb_start_write()
> >
> > The bug is on fsfreeze side and this is not the only problem related to it.
> > I've missed the implications when I applied "fs: Add freezing handling
> > to mnt_want_write() / mnt_drop_write()" last June ;-/
> >
> > The thing is, until then mnt_want_write() used to be a counter; it could be
> > nested. Now any such nesting is a deadlock you've just described. This
> > is seriously wrong, IMO.
> >
> > BTW, having sb_start_write() buried in individual ->splice_write() is
> > asking for trouble; could you describe the rules for that? E.g. where
> > does it nest wrt filesystem-private locks? XFS iolock, for example...
>
> I'm looking at the existing callers and I really wonder if we ought to
> push sb_start_write() from ->splice_write()/->aio_write()/etc. into the
> callers.
Yeah, that should be OK.

> Something like file_start_write()/file_end_write(), with check for file
> being regular one might be a good starting point. As it is, copyup is
> really fucked both in unionmount and overlayfs...
Makes sense. I can do the patch. BTW, for months I'm trying to push to you
a patch which creates a function like file_start_write() which returns
EAGAIN if the file is open with O_NONBLOCK and fs is frozen (this allows me
to solve a deadlock with bsd process accounting to frozen fs). After this
change the patch will become trivial so I'll add it to the series and
hopefully it won't get ignored.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-03-19 21:24:11

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

On Tue, Mar 19, 2013 at 07:32:42PM +0100, Miklos Szeredi wrote:

> > * victim is negative => create a whiteout
> > * victim is a directory, parent opaque => rmdir
> > * victim is a non-directory, parent opaque => unlink
> > * victim is positive, parent _not_ opaque => replace with whiteout
> > * old_dir in case of ->rename() is opaque => normal rename
> > * old_dir in case of ->rename() is not opaque => leave whiteout behind
> > Non-unioned => opaque, of course (nothing showing through it).
> >
>
> I dunnow. Overloading common paths with overlay/union specific things
> doesn't look very clean to me.

We certainly can add a couple of new methods; the question is, would that
buy us anything? This ->rename_and_whiteout would be practically identical
to ->rename on most of filesystems; the only difference for something like
ext* is that instead of deleting the old directory entry in the end we
would change its inumber to 0 and type to 14. Everything else would be
the same. Moreover, turning a positive entry into a whiteout differs from
unlink/rmdir resp. in exactly the same way. FWIW, creating a whiteout from
scratch might be better off as a separate method - it's closer to ->mknod()
and friends. But whiteout from positive and whiteout-on-rename probably
isn't worth separating from the normal methods. All we end up "overloading"
is actual removal of directory entry, which is a rather minor part of the
logics in those.

And merging ->unlink/->rmdir is definitely a good idea, regardless of anything
union-related; just look at the existing instances of both.

> > Getting good behaviour on rename interrupted by crash is going to be _very_
> > tricky with any strategy other than whiteouts-in-fs, AFAICS.
> >
>
> One idea is to add a journal to the overlay itself (yeah, namespace issues).

... and extra complexity from hell. If the underlying fs has a journal of
its own, we are getting all kinds of interesting interplay between those;
if not, replaying the journal in overlay will buy you nothing, since the
damage to underlying objects makes it all moot.

Layering is a very nice tool for quick prototyping; no arguments about that.
It allows to avoid modifying $BIGNUM filesystems and for something that
lives out of tree it's a very big benefit. Once the thing is merged, though,
that benefit becomes much smaller; it still might make sense to implement
something as a layer, but some parts of that sucker may be better off as
fs primitives. Hell, we could, in theory, implement xattrs as a layer;
just look at how reiserfs had done them. We could do the same for hardlinks
(look at qnx4, if you wish to see that hack). Of for symlinks (sysvfs).
Or for opened-and-unlinked files (sillyrename could be done as a generic
layer). Or for permissions/ownership/arbitrary names (umsdos, and that
_was_ very similar to layering). It's just that often an underlying fs
has a better way of doing that. IMO whiteouts are in that class.

2013-03-19 21:38:40

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

On Tue, Mar 19, 2013 at 09:25:43PM +0100, Jan Kara wrote:
> > BTW, having sb_start_write() buried in individual ->splice_write() is
> > asking for trouble; could you describe the rules for that? E.g. where
> > does it nest wrt filesystem-private locks? XFS iolock, for example...
> Generally, the freeze protection should be the outermost lock taken (so
> that we mitigate possibility of blocking readers when waiting for fs to
> unfreeze). So it ranks above i_mutex, or XFS' ilock and iolock.

Welcome to deadlock, then:
xfs_file_splice_write()
...
xfs_ilock(ip, XFS_IOLOCK_EXCL);
...
ret = generic_file_splice_write(pipe, outfilp, ppos, count, flags);

> It seems that I screwed this up for ->splice_write() :-| If we are going to
> move out sb_start_write() out of filesystems' hands into do_splice_from()
> then we should likely do the same with ->aio_write(). Hmm?

Yes, I've a tentative patch doing just that; will push tonight.

2013-03-19 22:10:43

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

On Tue, Mar 19, 2013 at 09:38:31PM +0000, Al Viro wrote:
> On Tue, Mar 19, 2013 at 09:25:43PM +0100, Jan Kara wrote:
> > > BTW, having sb_start_write() buried in individual ->splice_write() is
> > > asking for trouble; could you describe the rules for that? E.g. where
> > > does it nest wrt filesystem-private locks? XFS iolock, for example...
> > Generally, the freeze protection should be the outermost lock taken (so
> > that we mitigate possibility of blocking readers when waiting for fs to
> > unfreeze). So it ranks above i_mutex, or XFS' ilock and iolock.
>
> Welcome to deadlock, then:
> xfs_file_splice_write()
> ...
> xfs_ilock(ip, XFS_IOLOCK_EXCL);
> ...
> ret = generic_file_splice_write(pipe, outfilp, ppos, count, flags);
>
> > It seems that I screwed this up for ->splice_write() :-| If we are going to
> > move out sb_start_write() out of filesystems' hands into do_splice_from()
> > then we should likely do the same with ->aio_write(). Hmm?

... and then there's ext4_file_dio_write(), which cheerfully ignores freeze.
... and udf expanding from inline files to separately allocated before it
gets to sb_start_write()
... and a bunch of guys doing generic_write_sync() after generic_aio_file_write()
... and a lot of other fun stuff. Ouch...

OK, it's going to be an interesting series - aforementioned tentative patch
was badly incomplete ;-/

2013-03-20 02:33:24

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

On Tue, Mar 19, 2013 at 10:10:32PM +0000, Al Viro wrote:

> OK, it's going to be an interesting series - aforementioned tentative patch
> was badly incomplete ;-/

The interesting question is how far do we want to lift that. ->aio_write()
part is trivial - see vfs.git#experimental; the trouble begins with
->splice_write(). For *everything* except default_file_splice_write(),
lifting into the caller (do_splice_from()) is the right thing to do.

default_file_splice_write(), however, it trickier; there we end up calling
vfs_write() (via an ugly callchain). And _that_ is a real bitch. Granted,
vfs_write() is somewhat an overkill there (we'd already done rw_verify_area()
and access_ok() is pointless due to set_fs() we do around vfs_write()
there) and we'd already lifted it up to do_sync_write(). But if we lift
it any further, we'll need to deal with ->write() callers in the tree.
Current situation:

fs/coredump.c:662: return access_ok(VERIFY_READ, addr, nr) && file->f_op->write(file, addr, nr, &file->f_pos) == nr;
arch/powerpc/platforms/cell/spufs/coredump.c:63: written = file->f_op->write(file, addr, nr, &file->f_pos);

for these guys we might actually want to lift all way up to do_coredump()

drivers/staging/comedi/drivers/serial2002.c:91: result = f->f_op->write(f, buf, count, &f->f_pos);
fs/autofs4/waitq.c:73: (wr = file->f_op->write(file,data,bytes,&file->f_pos)) > 0) {

not regular files, unless I'm seriously misreading the code.

kernel/acct.c:553: file->f_op->write(file, (char *)&ac,

BTW, this is probably where we want to deal with your acct deadlock.

fs/compat.c:1103: fn = (io_fn_t)file->f_op->write;
fs/read_write.c:435: ret = file->f_op->write(file, buf, count, pos);
fs/read_write.c:732: fn = (io_fn_t)file->f_op->write;

syscalls - the question here is whether we lift it up to vfs_write/vfs_writev/
compat_writev, or actually take it further.

fs/cachefiles/rdwr.c:967: ret = file->f_op->write(

cachefiles_write_page(); no fucking idea what locks might be held by caller
and potentially that's a rather nasty source of PITA

fs/coda/file.c:84: ret = host_file->f_op->write(host_file, buf, count, ppos);

coda writing to file in cache on local fs. Potentially a nasty bugger, since
it's hard to lift any further - the caller has no idea that the thing is
on CODA, let alone what happens to hold the local cache.

drivers/block/loop.c:234: bw = file->f_op->write(file, buf, len, &pos);

do_bio_filebacked(), with some ugliness between that and callsite. Note,
BTW, that we have a pair of possible vfs_fsync() calls in there; how do those
interact with freeze?

This does *not* touch the current callers of vfs_write()/vfs_writev(); any of
those called while holding ->i_mutex on a directory (or mnt_want_write(), for
that matter) is a deadlock right now.

And we'd better start thinking about how we'll backport that crap - deadlock
in e.g. xfs ->splice_write() had been there since last summer ;-/

2013-03-20 09:15:32

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

On Tue, Mar 19, 2013 at 10:24 PM, Al Viro <[email protected]> wrote:
> it still might make sense to implement
> something as a layer, but some parts of that sucker may be better off as
> fs primitives. Hell, we could, in theory, implement xattrs as a layer;
> just look at how reiserfs had done them. We could do the same for hardlinks
> (look at qnx4, if you wish to see that hack). Of for symlinks (sysvfs).
> Or for opened-and-unlinked files (sillyrename could be done as a generic
> layer). Or for permissions/ownership/arbitrary names (umsdos, and that
> _was_ very similar to layering). It's just that often an underlying fs
> has a better way of doing that. IMO whiteouts are in that class.

My gut feeling is that whiteouts (and all that's related) are just too
specialized. All the examples you listed are more general purpose.

I know one that could be useful for a variety of things, an "exchange"
operation. While similar to rename, the semantics are much cleaner
and so the implementation becomes easier as well.

But that one doesn't quite solve the rename-with-whiteout thing. For
that a three way permutation would be needed where one of the three is
an unlinked "floating" object. That one is a really generic op but
we'd need to introduce linking unlinked objects into the tree which
comes with its own problems.

But I think it may be worth trying to come up with something more
generally useful before adding whiteouts and various overlay-specific
flags to filesystem code.

Thanks,
Miklos

2013-03-20 12:32:04

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

Al Viro <[email protected]> wrote:

> fs/cachefiles/rdwr.c:967: ret = file->f_op->write(
>
> cachefiles_write_page(); no fucking idea what locks might be held by caller
> and potentially that's a rather nasty source of PITA

The caller of cachefiles_write_page() (ie. fscache_write_op()) holds no locks
over the call.

__fscache_write_page() queues the pages for writing to the cache from the
netfs's read-side routines and returns immediately.

fscache_write_op() then picks them up in the background and passes them over
to the backend (eg. cachefiles_write_page()) which writes them out to the
cache.

So I think it should be safe to call file_start/end_write() or whatever around
the file->f_op->write() calls.

David

2013-03-20 19:52:29

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

On Wed 20-03-13 02:33:08, Al Viro wrote:
> On Tue, Mar 19, 2013 at 10:10:32PM +0000, Al Viro wrote:
>
> > OK, it's going to be an interesting series - aforementioned tentative patch
> > was badly incomplete ;-/
>
> The interesting question is how far do we want to lift that. ->aio_write()
> part is trivial - see vfs.git#experimental; the trouble begins with
> ->splice_write(). For *everything* except default_file_splice_write(),
> lifting into the caller (do_splice_from()) is the right thing to do.
>
> default_file_splice_write(), however, it trickier; there we end up calling
> vfs_write() (via an ugly callchain). And _that_ is a real bitch. Granted,
> vfs_write() is somewhat an overkill there (we'd already done rw_verify_area()
> and access_ok() is pointless due to set_fs() we do around vfs_write()
> there) and we'd already lifted it up to do_sync_write(). But if we lift
> it any further, we'll need to deal with ->write() callers in the tree.
> Current situation:
>
> fs/coredump.c:662: return access_ok(VERIFY_READ, addr, nr) && file->f_op->write(file, addr, nr, &file->f_pos) == nr;
> arch/powerpc/platforms/cell/spufs/coredump.c:63: written = file->f_op->write(file, addr, nr, &file->f_pos);
>
> for these guys we might actually want to lift all way up to do_coredump()
>
> drivers/staging/comedi/drivers/serial2002.c:91: result = f->f_op->write(f, buf, count, &f->f_pos);
> fs/autofs4/waitq.c:73: (wr = file->f_op->write(file,data,bytes,&file->f_pos)) > 0) {
>
> not regular files, unless I'm seriously misreading the code.
>
> kernel/acct.c:553: file->f_op->write(file, (char *)&ac,
>
> BTW, this is probably where we want to deal with your acct deadlock.
>
> fs/compat.c:1103: fn = (io_fn_t)file->f_op->write;
> fs/read_write.c:435: ret = file->f_op->write(file, buf, count, pos);
> fs/read_write.c:732: fn = (io_fn_t)file->f_op->write;
>
> syscalls - the question here is whether we lift it up to vfs_write/vfs_writev/
> compat_writev, or actually take it further.
>
> fs/cachefiles/rdwr.c:967: ret = file->f_op->write(
>
> cachefiles_write_page(); no fucking idea what locks might be held by caller
> and potentially that's a rather nasty source of PITA
>
> fs/coda/file.c:84: ret = host_file->f_op->write(host_file, buf, count, ppos);
>
> coda writing to file in cache on local fs. Potentially a nasty bugger, since
> it's hard to lift any further - the caller has no idea that the thing is
> on CODA, let alone what happens to hold the local cache.
>
> drivers/block/loop.c:234: bw = file->f_op->write(file, buf, len, &pos);
>
> do_bio_filebacked(), with some ugliness between that and callsite. Note,
> BTW, that we have a pair of possible vfs_fsync() calls in there; how do those
> interact with freeze?
Freezing code takes care that all dirty data is synced before fs is
frozen and no new dirty data can be created before fs is thawed. So
vfs_fsync() should just return without doing anything on frozen filesystem.

> This does *not* touch the current callers of vfs_write()/vfs_writev(); any of
> those called while holding ->i_mutex on a directory (or mnt_want_write(), for
> that matter) is a deadlock right now.
>
> And we'd better start thinking about how we'll backport that crap - deadlock
> in e.g. xfs ->splice_write() had been there since last summer ;-/
Yeah but noone really noticed because in practice the code isn't
stressed much. Much bigger problems had been there for years before they
were fixed last summer without anybody complaining... So I'm not sure how
hard do we want to try to backport this.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-03-20 21:48:23

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

On Wed, Mar 20, 2013 at 08:52:22PM +0100, Jan Kara wrote:
> > do_bio_filebacked(), with some ugliness between that and callsite. Note,
> > BTW, that we have a pair of possible vfs_fsync() calls in there; how do those
> > interact with freeze?
> Freezing code takes care that all dirty data is synced before fs is
> frozen and no new dirty data can be created before fs is thawed. So
> vfs_fsync() should just return without doing anything on frozen filesystem.

Um... How does it interact with vfs_fsync() already in progress when you
ask to freeze it?

Anyway, I've pulled the fscker out of ->aio_write, ->write and ->splice_write;
on that pathway it's in the do_splice_from() (see vfs.git#experimental).

... and now, for something *really* nasty: where do mandatory file locks
belong in the locking hierarchy? Relative to fsfreeze one, for starters,
but both for unionmount and overlayfs we need to decide where they live
relative to ->i_mutex on directories.

And that, BTW, may be the strongest argument so far in favour of the scheme
I'd suggested for copyup-via-opened-but-unlinked...

2013-03-20 22:19:07

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

On Wed 20-03-13 21:48:13, Al Viro wrote:
> On Wed, Mar 20, 2013 at 08:52:22PM +0100, Jan Kara wrote:
> > > do_bio_filebacked(), with some ugliness between that and callsite. Note,
> > > BTW, that we have a pair of possible vfs_fsync() calls in there; how do those
> > > interact with freeze?
> > Freezing code takes care that all dirty data is synced before fs is
> > frozen and no new dirty data can be created before fs is thawed. So
> > vfs_fsync() should just return without doing anything on frozen filesystem.
>
> Um... How does it interact with vfs_fsync() already in progress when you
> ask to freeze it?
So the exact sequence of freezing is:
sb->s_writers.frozen = SB_FREEZE_WRITE;
smp_wmb();
sb_wait_write(sb, SB_FREEZE_WRITE);
Now there are no processes in sb_start_write() - sb_end_write() section.
Then we do the same for SB_FREEZE_PAGEFAULT. After this noone should be
able to dirty a page or inode. Writeback or vfs_fsync() may be still
running (so fs can be creating new transactions in the journal for
writeback etc.).
sync_filesystem(sb);
After this there should be no dirty data so although we can still be
somewhere inside vfs_fsync() it should have nothing to do.

Now we freeze to state SB_FREEZE_FS (nop for ext4, but for XFS it may
interact e.g. with inode reclaim trimming preallocated blocks) and we are
done.

> Anyway, I've pulled the fscker out of ->aio_write, ->write and ->splice_write;
> on that pathway it's in the do_splice_from() (see vfs.git#experimental).
>
> ... and now, for something *really* nasty: where do mandatory file locks
> belong in the locking hierarchy? Relative to fsfreeze one, for starters,
> but both for unionmount and overlayfs we need to decide where they live
> relative to ->i_mutex on directories.
Hum, interesting question :). Relative to fsfreeze, it doesn't seem to
matter much, does it? We don't actually lock / unlock these from write
paths needing freeze protection, we only wait for them. But maybe I miss
some ugly case you have in mind.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-03-22 17:38:03

by J. R. Okajima

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules


David Howells:
> Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock
> might operate, so it's possible that this is a false alarm. Maybe Jan Kara can
> illuminate further, so I've added him to the cc list.

It is related to the design of UnionMount, isn't it?
UnionMount is not a filesystem and doen't have its own superblock.
If it was a fs, then
- vfs_truncate() acquires sb_writers for the unioning-fs.
- the unioning-fs may call vfs_truncate() again for the underlying fs.
- this time, sb_writers is for the underlying fs which is a different
sb_writers object from the already acquired one.
So there would be no deadlock.

Still lockdep will produce the message since sb_writers doesn't have the
lockdep class. Of course, we can introduce the lock class for it, or call
lockdep_off()/on() simply in order to stop the message. But, as long as
the unioning feature is not implemented as a fs, the solution will not
be so easy. I am afraid UnionMount will need to introduce a new counter
(or a new flag) to indicate the task entered the union, and adjust the
lock class or decide to call lockdep_off() for sb_writers. I don't think
it is a good idea.


J. R. Okajima

2013-03-22 18:11:28

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

On Sat, Mar 23, 2013 at 02:37:55AM +0900, J. R. Okajima wrote:
>
> David Howells:
> > Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock
> > might operate, so it's possible that this is a false alarm. Maybe Jan Kara can
> > illuminate further, so I've added him to the cc list.
>
> It is related to the design of UnionMount, isn't it?
> UnionMount is not a filesystem and doen't have its own superblock.
> If it was a fs, then
> - vfs_truncate() acquires sb_writers for the unioning-fs.
> - the unioning-fs may call vfs_truncate() again for the underlying fs.
> - this time, sb_writers is for the underlying fs which is a different
> sb_writers object from the already acquired one.
> So there would be no deadlock.

Doesn't help the situation with copyup - witness overlayfs stepping into the
same deadlock on copyup. It wants ->i_mutex held on directory in upper layer
and it tries to write to file it has created in there. The problem is
with the upper layer superblock getting frozen; having a separate one for
union is irrelevant. Let me check how aufs does... Aha. Your
au_do_copy_file() ends up calling vfs_write() on the file opened in
upper layer. And AFAICS it's called with ->i_mutex held on the directory
in upper layer, so you've got the same deadlock, sorry.

2013-03-22 18:21:21

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

On Fri, Mar 22, 2013 at 06:11:11PM +0000, Al Viro wrote:
> On Sat, Mar 23, 2013 at 02:37:55AM +0900, J. R. Okajima wrote:
> >
> > David Howells:
> > > Now, looking at __sb_start_write(), I'm not entirely sure how the deadlock
> > > might operate, so it's possible that this is a false alarm. Maybe Jan Kara can
> > > illuminate further, so I've added him to the cc list.
> >
> > It is related to the design of UnionMount, isn't it?
> > UnionMount is not a filesystem and doen't have its own superblock.
> > If it was a fs, then
> > - vfs_truncate() acquires sb_writers for the unioning-fs.
> > - the unioning-fs may call vfs_truncate() again for the underlying fs.
> > - this time, sb_writers is for the underlying fs which is a different
> > sb_writers object from the already acquired one.
> > So there would be no deadlock.
>
> Doesn't help the situation with copyup - witness overlayfs stepping into the
> same deadlock on copyup. It wants ->i_mutex held on directory in upper layer
> and it tries to write to file it has created in there. The problem is
> with the upper layer superblock getting frozen; having a separate one for
> union is irrelevant. Let me check how aufs does... Aha. Your
> au_do_copy_file() ends up calling vfs_write() on the file opened in
> upper layer. And AFAICS it's called with ->i_mutex held on the directory
> in upper layer, so you've got the same deadlock, sorry.

The scenario, BTW, looks so:
process A does sb_start_write() (on your upper layer)
process B tries to freeze said upper layer and blocks, waiting for A to finish
process C grabs ->i_mutex in your upper layer
process C does vfs_write(), which blocks, since there's a pending attempt to
freeze
process A tries to grab ->i_mutex held by C and blocks

2013-03-23 02:49:17

by J. R. Okajima

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules


Al Viro:
> The scenario, BTW, looks so:
> process A does sb_start_write() (on your upper layer)
> process B tries to freeze said upper layer and blocks, waiting for A to finish
> process C grabs ->i_mutex in your upper layer
> process C does vfs_write(), which blocks, since there's a pending attempt to
> freeze
> process A tries to grab ->i_mutex held by C and blocks

According to latest mm/filemap.c:generic_file_aio_write(),
sb_start_write(inode->i_sb);
mutex_lock(&inode->i_mutex);
ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
mutex_unlock(&inode->i_mutex);
:::
sb_end_write(inode->i_sb);

Process C would block *BEFORE* i_mutex by sb_start_write()? No?
Honestly speaking I didn't pay attention about the freeze feature since
I don't use it. I am making aufs to support it now. But I don't know how
to test it...


J. R. Okajima

2013-03-23 04:41:58

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules

On Sat, Mar 23, 2013 at 11:49:11AM +0900, J. R. Okajima wrote:
>
> Al Viro:
> > The scenario, BTW, looks so:
> > process A does sb_start_write() (on your upper layer)
> > process B tries to freeze said upper layer and blocks, waiting for A to finish
> > process C grabs ->i_mutex in your upper layer
> > process C does vfs_write(), which blocks, since there's a pending attempt to
> > freeze
> > process A tries to grab ->i_mutex held by C and blocks
>
> According to latest mm/filemap.c:generic_file_aio_write(),
> sb_start_write(inode->i_sb);
> mutex_lock(&inode->i_mutex);
> ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
> mutex_unlock(&inode->i_mutex);
> :::
> sb_end_write(inode->i_sb);
>
> Process C would block *BEFORE* i_mutex by sb_start_write()? No?

Different ->i_mutex; you are holding one on the parent directory already.

That's the problem - you have ->i_mutex nested both inside that sucker (as
it ought to) and outside. Which tends to do bad things, obviously, in
particular because something like mkdir(2) will do sb_start_write() (from
mnt_want_write(), called by kern_path_create()) before grabbing directory
->i_mutex.

Thus the activity with lifting the bastard out of ->aio_write(), etc. in
vfs.git#experimental - *any* union-like variant will need the ability to
pull sb_start_write() outside of locking the parent directory on copyup.
And yes, it's a common prerequisite to anything doing copyups - aufs is
in the same boat as overlayfs and unionmount. Same deadlock for all three
of them.

2013-03-23 05:38:08

by J. R. Okajima

[permalink] [raw]
Subject: Re: [PATCH 2/9] vfs: export do_splice_direct() to modules


Al Viro:
> Different ->i_mutex; you are holding one on the parent directory already.

Let me make sure. In your scenario,
- processA writes something into the union, and the unioning fs operates
the writable layer. After sb_start_write() succeeds, processA should
not block by the reason of fsfreeze.
- processC causes the copyup.
The current aufs implementation holds parent->i_mutex on the writable
layer during the copyup. The parent->i_mutex can make processA
blocking.

Now I am considering the copyup approach you suggested in another mail,
and I am going to replace your
"unlink the target, re-link later, no dir lock during copyup"
by
"make it hidden instead of unlinking, rename the correct name
later, no dir lock during copyup"
since I am not sure all FSs can operate "->link with the unlinked
one". I guess most FS can handle it, but I don't want to make sure
everything particulary remote fs, journals.

To make a file "hidden", I guess I can use the aufs "doubley whiteouted"
approach. As you might know, aufs prepends the ".wh." prefix to the
filename as whiteout. With one more prefix, the name loses the role of
whiteout. Aufs simply ignores such doubly whiteouted name.
The demerit is that aufs has to limit the length of the name.


J. R. Okajima